How to Scrape Wikipedia Data for Knowledge Graphs

Wikipedia is one of the richest sources of structured knowledge on the internet — millions of articles with infoboxes, categories, links, and Wikidata connections. If you're building a knowledge graph for NLP, search, or recommendation systems, scraping Wikipedia data gives you a massive head start.

This tutorial covers three approaches: the Wikipedia API for article content, Wikidata SPARQL for structured triples, and SearchHive for scraping when APIs aren't enough. Each includes working Python code ready to run.

Key Takeaways

Wikipedia API returns article text, links, categories, and infobox data in structured free JSON formatter
Wikidata SPARQL queries give you subject-predicate-object triples ready for knowledge graph construction
SearchHive ScrapeForge handles full-page scraping including tables, infoboxes, and dynamic content
Combining all three sources gives you the most complete knowledge graph possible
Wikipedia's rate limits are generous (200 requests/minute for unregistered, but use SearchHive for higher volumes)

Prerequisites

pip install searchhive wikipedia-api SPARQLWrapper networkx matplotlib

searchhive — Web scraping API (free tier: 50K requests/month)
wikipedia-api — Official Wikipedia API wrapper
SPARQLWrapper — Query Wikidata's SPARQL endpoint
networkx + matplotlib — Visualize the knowledge graph

Step 1: Fetch Wikipedia Articles via API

The Wikipedia API gives you structured access to article content, summaries, links, and categories:

import wikipediaapi

def get_article(title: str) -> dict:
    """Fetch a Wikipedia article with metadata."""
    wiki = wikipediaapi.Wikipedia('KnowledgeGraphBot/1.0', 'en')
    page = wiki.page(title)
    
    if not page.exists():
        return {"error": f"Article '{title}' not found"}
    
    return {
        'title': page.title,
        'summary': page.summary,
        'full_text': page.text,
        'categories': [c.split(':')[-1] for c in page.categories.keys()],
        'links': list(page.links.keys())[:100],
        'backlinks': list(page.backlinks.keys())[:100],
        'sections': [s for s in page.sections],
    }

# Usage
article = get_article("Machine learning")
print(f"Title: {article['title']}")
print(f"Summary: {article['summary'][:200]}...")
print(f"Categories: {article['categories'][:10]}")
print(f"Linked articles: {len(article['links'])}")

Extract Section-Level Content

For knowledge graphs, you often want structured section data rather than full text:

def get_sections(page) -> list[dict]:
    """Extract all sections with their content."""
    sections = []
    
    def traverse(section, depth=0):
        sections.append({
            'title': section.title,
            'text': section.text[:1000],  # First 1000 chars
            'depth': depth,
            'subsections': len(section.sections),
        })
        for subsection in section.sections:
            traverse(subsection, depth + 1)
    
    traverse(page)
    return sections

wiki = wikipediaapi.Wikipedia('KnowledgeGraphBot/1.0', 'en')
page = wiki.page("Artificial intelligence")
sections = get_sections(page)
for s in sections[:10]:
    indent = "  " * s['depth']
    print(f"{indent}{s['title']} ({len(s['text'])} chars)")

Step 2: Query Wikidata for Structured Triples

Wikidata stores structured data as subject-predicate-object triples — exactly what you need for a knowledge graph:

from SPARQLWrapper import SPARQLWrapper, JSON

def query_wikidata(sparql_query: str) -> list[dict]:
    """Execute a SPARQL query against Wikidata."""
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
    sparql.setQuery(sparql_query)
    sparql.setReturnFormat(JSON)
    
    results = sparql.query().convert()
    bindings = results.get("results", {}).get("bindings", [])
    
    # Flatten bindings to simple dicts
    rows = []
    for binding in bindings:
        row = {}
        for key, value in binding.items():
            row[key] = value.get("value", "")
            # Extract Q-id if it's a Wikidata entity
            if 'wikidata.org/entity/' in row[key]:
                row[f"{key}_id"] = row[key].split('/')[-1]
        rows.append(row)
    return rows

Get Properties for a Specific Entity

# Get all properties for "Machine learning" (Q2539)
query = """
SELECT ?property ?propertyLabel ?value ?valueLabel WHERE {
  wd:Q2539 ?prop ?statement.
  ?statement ?ps ?value.
  ?property wikibase:claim ?prop.
  SERVICE wikibase:label { bd:serviceParam wikibase:api "Provider"; wikibase:endpoint "https://www.wikidata.org/w/api.php"; wikibase:language "en". }
}
LIMIT 50
"""

properties = query_wikidata(query)
for p in properties[:10]:
    print(f"{p.get('propertyLabel', '?')}: {p.get('valueLabel', p.get('value', '?'))}")

Find Entities by Category

# Find all companies in the AI industry
query = """
SELECT ?company ?companyLabel ?founded ?industry WHERE {
  ?company wdt:P31 wd:Q4830453;  # Instance of: business
           wdt:P457 wd:Q11624;     # Industry: artificial intelligence
           wdt:P571 ?founded;       # Inception date
           wdt:P457 ?industry.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?founded)
LIMIT 30
"""

ai_companies = query_wikidata(query)
for c in ai_companies:
    print(f"{c.get('companyLabel')}: founded {c.get('founded', '?')[:10]}")

Step 3: Scrape Wikipedia Infoboxes with SearchHive

Infoboxes contain the most structured data in Wikipedia articles — but extracting them from the API requires parsing wikitext. SearchHive can scrape the rendered HTML directly:

from searchhive import ScrapeForge

def scrape_infobox(article_title: str) -> dict:
    """Scrape a Wikipedia infobox using SearchHive."""
    client = ScrapeForge()
    url = f"https://en.wikipedia.org/wiki/{article_title.replace(' ', '_')}"
    
    result = client.scrape(
        url=url,
        selectors={
            'title': 'h1.firstHeading',
            'infobox_rows': {
                'each': 'table.infobox tr',
                'fields': {
                    'label': 'th',
                    'value': 'td',
                }
            },
            'categories': '#mw-normal-catlinks ul li a',
            'first_paragraph': '#mw-content-text p:first-of-type',
        }
    )
    
    # Parse infobox rows into key-value pairs
    infobox = {}
    for row in result.data.get('infobox_rows', []):
        label = row.get('label', '').strip()
        value = row.get('value', '').strip()
        if label and value:
            infobox[label] = value
    
    return {
        'title': result.data.get('title'),
        'infobox': infobox,
        'categories': result.data.get('categories', []),
        'summary': result.data.get('first_paragraph', ''),
    }

# Usage
data = scrape_infobox("Tesla, Inc.")
print(f"Title: {data['title']}")
print(f"Infobox fields: {len(data['infobox'])}")
for key, val in list(data['infobox'].items())[:8]:
    print(f"  {key}: {val}")

Batch Scrape Multiple Articles

def scrape_multiple_articles(titles: list[str]) -> list[dict]:
    """Scrape infoboxes from multiple Wikipedia articles."""
    client = ScrapeForge()
    urls = [
        f"https://en.wikipedia.org/wiki/{t.replace(' ', '_')}"
        for t in titles
    ]
    
    results = client.scrape_batch(
        urls,
        selectors={
            'title': 'h1.firstHeading',
            'infobox_rows': {
                'each': 'table.infobox tr',
                'fields': {'label': 'th', 'value': 'td'}
            },
            'links': '#mw-content-text a[href^="/wiki/"] @href',
        },
        concurrency=5
    )
    
    articles = []
    for r in results:
        if r.success and r.data:
            infobox = {}
            for row in r.data.get('infobox_rows', []):
                label = row.get('label', '').strip()
                value = row.get('value', '').strip()
                if label and value:
                    infobox[label] = value
            
            articles.append({
                'url': r.url,
                'title': r.data.get('title'),
                'infobox': infobox,
                'links': [l.split('/')[-1] for l in r.data.get('links', [])[:200]],
            })
    
    return articles

articles = scrape_multiple_articles(["Python (programming language)", "JavaScript", "Rust (programming language)"])
for a in articles:
    print(f"{a['title']}: {len(a['infobox'])} infobox fields, {len(a['links'])} links")

Step 4: Build the Knowledge Graph with NetworkX

Convert your scraped data into a graph structure:

import networkx as nx
from SPARQLWrapper import SPARQLWrapper, JSON

def build_knowledge_graph(entity: str) -> nx.DiGraph:
    """Build a knowledge graph starting from a Wikidata entity."""
    G = nx.DiGraph()
    
    # Query Wikidata for the entity and its connections
    query = f"""
    SELECT ?item ?itemLabel ?propLabel WHERE {{
      wd:{entity} ?p ?item.
      ?property wikibase:claim ?p.
      ?property ?rdfs ?propLabel.
      SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
    }}
    LIMIT 100
    """
    
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()
    
    entity_label = entity  # fallback
    
    for binding in results['results']['bindings']:
        item = binding.get('item', {}).get('value', '')
        item_label = binding.get('itemLabel', {}).get('value', '')
        prop_label = binding.get('propLabel', {}).get('value', '')
        
        if item and prop_label:
            G.add_edge(entity, item_label, relation=prop_label)
    
    return G

# Usage
G = build_knowledge_graph("Q2539")  # Machine learning
print(f"Knowledge graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

Visualize the Graph

import matplotlib.pyplot as plt

def visualize_graph(G: nx.DiGraph, filename: str = "knowledge_graph.png"):
    """Visualize a knowledge graph."""
    plt.figure(figsize=(16, 12))
    
    pos = nx.spring_layout(G, k=2, iterations=50, seed=42)
    
    # Draw edges with labels
    nx.draw_networkx_edges(G, pos, alpha=0.3, edge_color='gray')
    edge_labels = nx.get_edge_attributes(G, 'relation')
    nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=6)
    
    # Draw nodes
    nx.draw_networkx_nodes(G, pos, node_color='lightblue', node_size=800, alpha=0.7)
    nx.draw_networkx_labels(G, pos, font_size=7)
    
    plt.title("Wikipedia Knowledge Graph")
    plt.axis('off')
    plt.tight_layout()
    plt.savefig(filename, dpi=150, bbox_inches='tight')
    print(f"Graph saved to {filename}")

visualize_graph(G)

Step 5: Export Knowledge Graph to Standard Formats

import json

def export_to_json(G: nx.DiGraph, filename: str = "knowledge_graph.json"):
    """Export knowledge graph as node-link JSON."""
    from networkx.readwrite import json_graph
    data = json_graph.node_link_data(G)
    with open(filename, 'w') as f:
        json.dump(data, f, indent=2)
    print(f"Exported {G.number_of_nodes()} nodes to {filename}")

def export_to_triples(G: nx.DiGraph, filename: str = "knowledge_graph.nt"):
    """Export as N-Triples format."""
    with open(filename, 'w') as f:
        for source, target, data in G.edges(data=True):
            f.write(f'<{source}> <{data["relation"]}> <{target}> .\n")
    print(f"Exported {G.number_of_edges()} triples to {filename}")

export_to_json(G)
export_to_triples(G)

Complete Code Example

from searchhive import ScrapeForge
from SPARQLWrapper import SPARQLWrapper, JSON
import networkx as nx

class WikipediaKnowledgeGraph:
    def __init__(self):
        self.client = ScrapeForge()
        self.graph = nx.DiGraph()
    
    def query_wikidata(self, sparql: str) -> list[dict]:
        wrapper = SPARQLWrapper("https://query.wikidata.org/sparql")
        wrapper.setQuery(sparql)
        wrapper.setReturnFormat(JSON)
        results = wrapper.query().convert()
        return results.get("results", {}).get("bindings", [])
    
    def add_entity(self, qid: str, label: str = ""):
        node = label or qid
        if not self.graph.has_node(node):
            self.graph.add_node(node, qid=qid)
        return node
    
    def build_from_query(self, sparql: str, subject_key: str = "item", pred_key: str = "propLabel", object_key: str = "itemLabel"):
        bindings = self.query_wikidata(sparql)
        for b in bindings:
            subj = b.get(subject_key, {}).get("value", "")
            pred = b.get(pred_key, {}).get("value", "")
            obj = b.get(object_key, {}).get("value", "")
            if subj and pred and obj:
                self.graph.add_edge(subj, obj, relation=pred)
        return len(bindings)

# Usage
kg = WikipediaKnowledgeGraph()

# Build a graph around "Artificial Intelligence"
query = """
SELECT ?sub ?subLabel ?prop ?obj ?objLabel WHERE {
  wd:Q11661 ?p ?obj.
  ?property wikibase:claim ?p.
  ?property ?rdfs ?prop.
  ?obj ?rdfs2 ?objLabel.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
"""

count = kg.build_from_query(query)
print(f"Built knowledge graph: {kg.graph.number_of_nodes()} nodes, {kg.graph.number_of_edges()} edges")

Common Issues

Wikidata SPARQL queries timeout on complex queries

Break complex queries into smaller ones. Use LIMIT and OFFSET for pagination. For very large extractions, use Wikidata dumps instead of the SPARQL endpoint.

Wikipedia API returns 403 (rate limited)

The API allows 200 requests/minute for unregistered users. Use SearchHive for batch scraping — its proxy rotation distributes requests across IPs.

Infoboxes have inconsistent formats across articles

Different Wikipedia templates use different field names. Write normalization logic for common fields (founded, ceo, revenue, etc.) rather than relying on exact field names.

NetworkX graph becomes too large to visualize

For graphs with more than 500 nodes, filter by degree centrality or use Gephi for interactive visualization.

Next Steps

Use SearchHive DeepDive for extracting structured data from non-Wikipedia sources to enrich your knowledge graph
Check /blog/how-to-extract-contact-info-from-websites-with-python for extracting entity relationships from business websites
Explore /compare/apify for how SearchHive compares to other scraping APIs for large-scale data extraction

Start building knowledge graphs with SearchHive's free tier — 50,000 requests/month with proxy rotation. Scrape Wikipedia at scale without hitting rate limits. Read the docs.

How to Scrape Wikipedia Data for Knowledge Graphs

AI-Powered Research

How to Scrape Wikipedia Data for Knowledge Graphs

Key Takeaways

Prerequisites

Step 1: Fetch Wikipedia Articles via API

Extract Section-Level Content

Step 2: Query Wikidata for Structured Triples

Get Properties for a Specific Entity

Find Entities by Category

Step 3: Scrape Wikipedia Infoboxes with SearchHive

Batch Scrape Multiple Articles

Step 4: Build the Knowledge Graph with NetworkX

Visualize the Graph

Step 5: Export Knowledge Graph to Standard Formats

Complete Code Example

Common Issues

Wikidata SPARQL queries timeout on complex queries

Wikipedia API returns 403 (rate limited)

Infoboxes have inconsistent formats across articles

NetworkX graph becomes too large to visualize

Next Steps

Keywords

RELATED ARTICLES

How to Build a Proxy Rotator for Web Scraping with Python

How to Scrape YouTube Data — Video Metrics and Comments

How to Monitor Competitor Prices with Python — Automated System

BUILD WITH SEARCHHIVE