How to Scrape Wikipedia Data for Knowledge Graphs
Wikipedia is one of the richest sources of structured knowledge on the internet — millions of articles with infoboxes, categories, links, and Wikidata connections. If you're building a knowledge graph for NLP, search, or recommendation systems, scraping Wikipedia data gives you a massive head start.
This tutorial covers three approaches: the Wikipedia API for article content, Wikidata SPARQL for structured triples, and SearchHive for scraping when APIs aren't enough. Each includes working Python code ready to run.
Key Takeaways
- Wikipedia API returns article text, links, categories, and infobox data in structured free JSON formatter
- Wikidata SPARQL queries give you subject-predicate-object triples ready for knowledge graph construction
- SearchHive ScrapeForge handles full-page scraping including tables, infoboxes, and dynamic content
- Combining all three sources gives you the most complete knowledge graph possible
- Wikipedia's rate limits are generous (200 requests/minute for unregistered, but use SearchHive for higher volumes)
Prerequisites
pip install searchhive wikipedia-api SPARQLWrapper networkx matplotlib
- searchhive — Web scraping API (free tier: 50K requests/month)
- wikipedia-api — Official Wikipedia API wrapper
- SPARQLWrapper — Query Wikidata's SPARQL endpoint
- networkx + matplotlib — Visualize the knowledge graph
Step 1: Fetch Wikipedia Articles via API
The Wikipedia API gives you structured access to article content, summaries, links, and categories:
import wikipediaapi
def get_article(title: str) -> dict:
"""Fetch a Wikipedia article with metadata."""
wiki = wikipediaapi.Wikipedia('KnowledgeGraphBot/1.0', 'en')
page = wiki.page(title)
if not page.exists():
return {"error": f"Article '{title}' not found"}
return {
'title': page.title,
'summary': page.summary,
'full_text': page.text,
'categories': [c.split(':')[-1] for c in page.categories.keys()],
'links': list(page.links.keys())[:100],
'backlinks': list(page.backlinks.keys())[:100],
'sections': [s for s in page.sections],
}
# Usage
article = get_article("Machine learning")
print(f"Title: {article['title']}")
print(f"Summary: {article['summary'][:200]}...")
print(f"Categories: {article['categories'][:10]}")
print(f"Linked articles: {len(article['links'])}")
Extract Section-Level Content
For knowledge graphs, you often want structured section data rather than full text:
def get_sections(page) -> list[dict]:
"""Extract all sections with their content."""
sections = []
def traverse(section, depth=0):
sections.append({
'title': section.title,
'text': section.text[:1000], # First 1000 chars
'depth': depth,
'subsections': len(section.sections),
})
for subsection in section.sections:
traverse(subsection, depth + 1)
traverse(page)
return sections
wiki = wikipediaapi.Wikipedia('KnowledgeGraphBot/1.0', 'en')
page = wiki.page("Artificial intelligence")
sections = get_sections(page)
for s in sections[:10]:
indent = " " * s['depth']
print(f"{indent}{s['title']} ({len(s['text'])} chars)")
Step 2: Query Wikidata for Structured Triples
Wikidata stores structured data as subject-predicate-object triples — exactly what you need for a knowledge graph:
from SPARQLWrapper import SPARQLWrapper, JSON
def query_wikidata(sparql_query: str) -> list[dict]:
"""Execute a SPARQL query against Wikidata."""
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery(sparql_query)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
bindings = results.get("results", {}).get("bindings", [])
# Flatten bindings to simple dicts
rows = []
for binding in bindings:
row = {}
for key, value in binding.items():
row[key] = value.get("value", "")
# Extract Q-id if it's a Wikidata entity
if 'wikidata.org/entity/' in row[key]:
row[f"{key}_id"] = row[key].split('/')[-1]
rows.append(row)
return rows
Get Properties for a Specific Entity
# Get all properties for "Machine learning" (Q2539)
query = """
SELECT ?property ?propertyLabel ?value ?valueLabel WHERE {
wd:Q2539 ?prop ?statement.
?statement ?ps ?value.
?property wikibase:claim ?prop.
SERVICE wikibase:label { bd:serviceParam wikibase:api "Provider"; wikibase:endpoint "https://www.wikidata.org/w/api.php"; wikibase:language "en". }
}
LIMIT 50
"""
properties = query_wikidata(query)
for p in properties[:10]:
print(f"{p.get('propertyLabel', '?')}: {p.get('valueLabel', p.get('value', '?'))}")
Find Entities by Category
# Find all companies in the AI industry
query = """
SELECT ?company ?companyLabel ?founded ?industry WHERE {
?company wdt:P31 wd:Q4830453; # Instance of: business
wdt:P457 wd:Q11624; # Industry: artificial intelligence
wdt:P571 ?founded; # Inception date
wdt:P457 ?industry.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?founded)
LIMIT 30
"""
ai_companies = query_wikidata(query)
for c in ai_companies:
print(f"{c.get('companyLabel')}: founded {c.get('founded', '?')[:10]}")
Step 3: Scrape Wikipedia Infoboxes with SearchHive
Infoboxes contain the most structured data in Wikipedia articles — but extracting them from the API requires parsing wikitext. SearchHive can scrape the rendered HTML directly:
from searchhive import ScrapeForge
def scrape_infobox(article_title: str) -> dict:
"""Scrape a Wikipedia infobox using SearchHive."""
client = ScrapeForge()
url = f"https://en.wikipedia.org/wiki/{article_title.replace(' ', '_')}"
result = client.scrape(
url=url,
selectors={
'title': 'h1.firstHeading',
'infobox_rows': {
'each': 'table.infobox tr',
'fields': {
'label': 'th',
'value': 'td',
}
},
'categories': '#mw-normal-catlinks ul li a',
'first_paragraph': '#mw-content-text p:first-of-type',
}
)
# Parse infobox rows into key-value pairs
infobox = {}
for row in result.data.get('infobox_rows', []):
label = row.get('label', '').strip()
value = row.get('value', '').strip()
if label and value:
infobox[label] = value
return {
'title': result.data.get('title'),
'infobox': infobox,
'categories': result.data.get('categories', []),
'summary': result.data.get('first_paragraph', ''),
}
# Usage
data = scrape_infobox("Tesla, Inc.")
print(f"Title: {data['title']}")
print(f"Infobox fields: {len(data['infobox'])}")
for key, val in list(data['infobox'].items())[:8]:
print(f" {key}: {val}")
Batch Scrape Multiple Articles
def scrape_multiple_articles(titles: list[str]) -> list[dict]:
"""Scrape infoboxes from multiple Wikipedia articles."""
client = ScrapeForge()
urls = [
f"https://en.wikipedia.org/wiki/{t.replace(' ', '_')}"
for t in titles
]
results = client.scrape_batch(
urls,
selectors={
'title': 'h1.firstHeading',
'infobox_rows': {
'each': 'table.infobox tr',
'fields': {'label': 'th', 'value': 'td'}
},
'links': '#mw-content-text a[href^="/wiki/"] @href',
},
concurrency=5
)
articles = []
for r in results:
if r.success and r.data:
infobox = {}
for row in r.data.get('infobox_rows', []):
label = row.get('label', '').strip()
value = row.get('value', '').strip()
if label and value:
infobox[label] = value
articles.append({
'url': r.url,
'title': r.data.get('title'),
'infobox': infobox,
'links': [l.split('/')[-1] for l in r.data.get('links', [])[:200]],
})
return articles
articles = scrape_multiple_articles(["Python (programming language)", "JavaScript", "Rust (programming language)"])
for a in articles:
print(f"{a['title']}: {len(a['infobox'])} infobox fields, {len(a['links'])} links")
Step 4: Build the Knowledge Graph with NetworkX
Convert your scraped data into a graph structure:
import networkx as nx
from SPARQLWrapper import SPARQLWrapper, JSON
def build_knowledge_graph(entity: str) -> nx.DiGraph:
"""Build a knowledge graph starting from a Wikidata entity."""
G = nx.DiGraph()
# Query Wikidata for the entity and its connections
query = f"""
SELECT ?item ?itemLabel ?propLabel WHERE {{
wd:{entity} ?p ?item.
?property wikibase:claim ?p.
?property ?rdfs ?propLabel.
SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
}}
LIMIT 100
"""
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
entity_label = entity # fallback
for binding in results['results']['bindings']:
item = binding.get('item', {}).get('value', '')
item_label = binding.get('itemLabel', {}).get('value', '')
prop_label = binding.get('propLabel', {}).get('value', '')
if item and prop_label:
G.add_edge(entity, item_label, relation=prop_label)
return G
# Usage
G = build_knowledge_graph("Q2539") # Machine learning
print(f"Knowledge graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
Visualize the Graph
import matplotlib.pyplot as plt
def visualize_graph(G: nx.DiGraph, filename: str = "knowledge_graph.png"):
"""Visualize a knowledge graph."""
plt.figure(figsize=(16, 12))
pos = nx.spring_layout(G, k=2, iterations=50, seed=42)
# Draw edges with labels
nx.draw_networkx_edges(G, pos, alpha=0.3, edge_color='gray')
edge_labels = nx.get_edge_attributes(G, 'relation')
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=6)
# Draw nodes
nx.draw_networkx_nodes(G, pos, node_color='lightblue', node_size=800, alpha=0.7)
nx.draw_networkx_labels(G, pos, font_size=7)
plt.title("Wikipedia Knowledge Graph")
plt.axis('off')
plt.tight_layout()
plt.savefig(filename, dpi=150, bbox_inches='tight')
print(f"Graph saved to {filename}")
visualize_graph(G)
Step 5: Export Knowledge Graph to Standard Formats
import json
def export_to_json(G: nx.DiGraph, filename: str = "knowledge_graph.json"):
"""Export knowledge graph as node-link JSON."""
from networkx.readwrite import json_graph
data = json_graph.node_link_data(G)
with open(filename, 'w') as f:
json.dump(data, f, indent=2)
print(f"Exported {G.number_of_nodes()} nodes to {filename}")
def export_to_triples(G: nx.DiGraph, filename: str = "knowledge_graph.nt"):
"""Export as N-Triples format."""
with open(filename, 'w') as f:
for source, target, data in G.edges(data=True):
f.write(f'<{source}> <{data["relation"]}> <{target}> .\n")
print(f"Exported {G.number_of_edges()} triples to {filename}")
export_to_json(G)
export_to_triples(G)
Complete Code Example
from searchhive import ScrapeForge
from SPARQLWrapper import SPARQLWrapper, JSON
import networkx as nx
class WikipediaKnowledgeGraph:
def __init__(self):
self.client = ScrapeForge()
self.graph = nx.DiGraph()
def query_wikidata(self, sparql: str) -> list[dict]:
wrapper = SPARQLWrapper("https://query.wikidata.org/sparql")
wrapper.setQuery(sparql)
wrapper.setReturnFormat(JSON)
results = wrapper.query().convert()
return results.get("results", {}).get("bindings", [])
def add_entity(self, qid: str, label: str = ""):
node = label or qid
if not self.graph.has_node(node):
self.graph.add_node(node, qid=qid)
return node
def build_from_query(self, sparql: str, subject_key: str = "item", pred_key: str = "propLabel", object_key: str = "itemLabel"):
bindings = self.query_wikidata(sparql)
for b in bindings:
subj = b.get(subject_key, {}).get("value", "")
pred = b.get(pred_key, {}).get("value", "")
obj = b.get(object_key, {}).get("value", "")
if subj and pred and obj:
self.graph.add_edge(subj, obj, relation=pred)
return len(bindings)
# Usage
kg = WikipediaKnowledgeGraph()
# Build a graph around "Artificial Intelligence"
query = """
SELECT ?sub ?subLabel ?prop ?obj ?objLabel WHERE {
wd:Q11661 ?p ?obj.
?property wikibase:claim ?p.
?property ?rdfs ?prop.
?obj ?rdfs2 ?objLabel.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
"""
count = kg.build_from_query(query)
print(f"Built knowledge graph: {kg.graph.number_of_nodes()} nodes, {kg.graph.number_of_edges()} edges")
Common Issues
Wikidata SPARQL queries timeout on complex queries
Break complex queries into smaller ones. Use LIMIT and OFFSET for pagination. For very large extractions, use Wikidata dumps instead of the SPARQL endpoint.
Wikipedia API returns 403 (rate limited)
The API allows 200 requests/minute for unregistered users. Use SearchHive for batch scraping — its proxy rotation distributes requests across IPs.
Infoboxes have inconsistent formats across articles
Different Wikipedia templates use different field names. Write normalization logic for common fields (founded, ceo, revenue, etc.) rather than relying on exact field names.
NetworkX graph becomes too large to visualize
For graphs with more than 500 nodes, filter by degree centrality or use Gephi for interactive visualization.
Next Steps
- Use SearchHive DeepDive for extracting structured data from non-Wikipedia sources to enrich your knowledge graph
- Check /blog/how-to-extract-contact-info-from-websites-with-python for extracting entity relationships from business websites
- Explore /compare/apify for how SearchHive compares to other scraping APIs for large-scale data extraction
Start building knowledge graphs with SearchHive's free tier — 50,000 requests/month with proxy rotation. Scrape Wikipedia at scale without hitting rate limits. Read the docs.