How to Build a News Aggregator with Python and Web APIs

Building a news aggregator means solving three problems: finding articles, extracting content, and presenting it in a useful format. This tutorial walks through building a complete news aggregation pipeline using Python and web APIs -- from search to structured data output.

Key Takeaways

A news aggregator needs three components: discovery (search), extraction (scraping), and organization
SearchHive's SwiftSearch handles article discovery; ScrapeForge handles content extraction
The full pipeline runs in under 50 lines of Python
Clean markdown output from ScrapeForge works directly with LLM summarization
SQLite is sufficient for most aggregator databases -- skip PostgreSQL until you need it

Prerequisites

Python 3.9+
requests library (pip install requests)
A SearchHive API key (free tier with 500 credits)

Architecture Overview

Search Query -> SwiftSearch API -> Article URLs
                                     |
Article URLs -> ScrapeForge API -> Markdown Content
                                     |
Markdown -> SQLite Database -> API / Frontend

Three API calls per article: one search, one scrape, one store. Linear and simple.

Step 1: Search for Articles

Use SwiftSearch to discover recent articles matching your topics.

import requests
from datetime import datetime, timedelta

SEARCHHIVE_KEY = "your_api_key"
headers = {"Authorization": f"Bearer {SEARCHHIVE_KEY}"}

def search_articles(topic: str, hours_back: int = 24) -> list:
    """Search for recent articles on a topic."""
    resp = requests.get(
        "https://api.searchhive.dev/v1/search",
        headers=headers,
        params={
            "q": f"{topic} news",
            "limit": 10,
            "fresh": "day"  # Results from last 24 hours
        }
    )
    results = resp.json().get("results", [])
    return [
        {
            "title": r["title"],
            "url": r["url"],
            "snippet": r.get("snippet", ""),
            "source": r.get("source", ""),
        }
        for r in results
    ]

# Example: find AI news
articles = search_articles("artificial intelligence")
for a in articles:
    print(f"[{a['source']}] {a['title']}")
    print(f"  {a['url']}")

Step 2: Extract Article Content

ScrapeForge renders JavaScript and returns clean markdown -- no parsing HTML manually.

def extract_article(url: str) -> dict:
    """Extract clean content from a news article URL."""
    resp = requests.post(
        "https://api.searchhive.dev/v1/scrape",
        headers=headers,
        json={
            "url": url,
            "render_js": True,
            "format": "markdown",
            "wait_for": 2000,
        }
    )
    if resp.status_code != 200:
        return None
    
    data = resp.json()
    return {
        "title": data.get("title", ""),
        "content": data.get("markdown", ""),
        "word_count": len(data.get("markdown", "").split()),
        "url": url,
    }

Step 3: Store in SQLite

SQLite handles most aggregator workloads. No separate database server needed.

import sqlite3
from datetime import datetime

def init_db():
    """Create the articles database."""
    conn = sqlite3.connect("news_aggregator.db")
    conn.execute("""
        CREATE TABLE IF NOT EXISTS articles (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            title TEXT NOT NULL,
            url TEXT UNIQUE NOT NULL,
            source TEXT,
            content TEXT,
            word_count INTEGER,
            topic TEXT,
            scraped_at TEXT,
            UNIQUE(url)
        )
    """)
    conn.commit()
    conn.close()

def save_article(article: dict, topic: str):
    """Save an article to the database."""
    conn = sqlite3.connect("news_aggregator.db")
    try:
        conn.execute(
            """INSERT OR IGNORE INTO articles 
               (title, url, source, content, word_count, topic, scraped_at)
               VALUES (?, ?, ?, ?, ?, ?, ?)""",
            (
                article["title"],
                article["url"],
                article.get("source", ""),
                article.get("content", ""),
                article.get("word_count", 0),
                topic,
                datetime.utcnow().isoformat(),
            )
        )
        conn.commit()
    except sqlite3.IntegrityError:
        pass  # Article already exists
    finally:
        conn.close()

Step 4: Complete Pipeline

Wire it all together with topic tracking and deduplication.

import time

TOPICS = [
    "artificial intelligence",
    "web development",
    "startup funding",
    "python programming",
]

def run_aggregator():
    """Run the full news aggregation pipeline."""
    init_db()
    
    for topic in TOPICS:
        print(f"\nSearching: {topic}")
        
        # Step 1: Find articles
        articles = search_articles(topic, hours_back=24)
        print(f"  Found {len(articles)} articles")
        
        # Step 2 & 3: Extract and save each article
        for article in articles:
            print(f"  Scraping: {article['title'][:60]}...")
            
            extracted = extract_article(article["url"])
            if extracted:
                extracted["source"] = article.get("source", "")
                save_article(extracted, topic)
            
            time.sleep(1)  # Be polite to servers
    
    # Report results
    conn = sqlite3.connect("news_aggregator.db")
    count = conn.execute("SELECT COUNT(*) FROM articles").fetchone()[0]
    print(f"\nTotal articles in database: {count}")
    conn.close()

if __name__ == "__main__":
    run_aggregator()

Step 5: Query and Serve Results

def get_recent_articles(topic: str = None, limit: int = 20) -> list:
    """Query articles from the database."""
    conn = sqlite3.connect("news_aggregator.db")
    conn.row_factory = sqlite3.Row
    
    if topic:
        rows = conn.execute(
            "SELECT * FROM articles WHERE topic = ? ORDER BY scraped_at DESC LIMIT ?",
            (topic, limit)
        ).fetchall()
    else:
        rows = conn.execute(
            "SELECT * FROM articles ORDER BY scraped_at DESC LIMIT ?",
            (limit,)
        ).fetchall()
    
    conn.close()
    return [dict(row) for row in rows]

def get_summary(topic: str) -> str:
    """Generate a text summary of recent articles for a topic."""
    articles = get_recent_articles(topic, limit=10)
    if not articles:
        return f"No recent articles found for '{topic}'"
    
    lines = [f"Recent news on '{topic}':\n"]
    for a in articles:
        date = a["scraped_at"][:10]
        lines.append(f"- [{date}] {a['title']}")
        lines.append(f"  {a['content'][:150]}...")
        lines.append(f"  Source: {a['source']} | {a['word_count']} words")
        lines.append("")
    
    return "\n".join(lines)

# Usage
print(get_summary("artificial intelligence"))

Running on a Schedule

Use the schedule library or set up a cron expression generator:

# Run every 6 hours
0 */6 * * * cd /path/to/project && python3 aggregator.py >> aggregator.log 2>&1

Cost Estimate

With SearchHive's pricing (1 credit = $0.0001):

Search calls: ~2 credits per call
Scrape calls: ~1 credit per page
4 topics x 10 articles = ~80 credits/run
Running 4x/day = ~9,600 credits/month
Starter plan ($9/mo for 5K credits) covers lightweight setups
Builder plan ($49/mo for 100K credits) handles aggressive multi-topic aggregation

The free 500-credit tier is enough to test the full pipeline before committing.

Next Steps

Add LLM summarization. The markdown content from ScrapeForge is already LLM-ready. Pipe it through any LLM API to generate summaries, sentiment scores, or topic clustering.

Add RSS feeds. Combine API search with RSS feed parsing for broader coverage.

Add notifications. Send digests via email, Slack, or Discord when new articles match specific keywords.

Get started with SearchHive free -- 500 credits, no credit card, full API access.

How to Build a News Aggregator with Python and Web APIs

AI-Powered Research

Key Takeaways

Prerequisites

Architecture Overview

Step 1: Search for Articles

Step 2: Extract Article Content

Step 3: Store in SQLite

Step 4: Complete Pipeline

Step 5: Query and Serve Results

Running on a Schedule

Cost Estimate

Next Steps

Keywords

RELATED ARTICLES

How to Scrape JavaScript-Rendered Pages with Python

Best Web Scraping APIs for Startups — Affordable Options Ranked

Google Search API — Free and Cheap Alternatives Compared

BUILD WITH SEARCHHIVE