Building a news aggregator means solving three problems: finding articles, extracting content, and presenting it in a useful format. This tutorial walks through building a complete news aggregation pipeline using Python and web APIs -- from search to structured data output.
Key Takeaways
- A news aggregator needs three components: discovery (search), extraction (scraping), and organization
- SearchHive's SwiftSearch handles article discovery; ScrapeForge handles content extraction
- The full pipeline runs in under 50 lines of Python
- Clean markdown output from ScrapeForge works directly with LLM summarization
- SQLite is sufficient for most aggregator databases -- skip PostgreSQL until you need it
Prerequisites
- Python 3.9+
requestslibrary (pip install requests)- A SearchHive API key (free tier with 500 credits)
Architecture Overview
Search Query -> SwiftSearch API -> Article URLs
|
Article URLs -> ScrapeForge API -> Markdown Content
|
Markdown -> SQLite Database -> API / Frontend
Three API calls per article: one search, one scrape, one store. Linear and simple.
Step 1: Search for Articles
Use SwiftSearch to discover recent articles matching your topics.
import requests
from datetime import datetime, timedelta
SEARCHHIVE_KEY = "your_api_key"
headers = {"Authorization": f"Bearer {SEARCHHIVE_KEY}"}
def search_articles(topic: str, hours_back: int = 24) -> list:
"""Search for recent articles on a topic."""
resp = requests.get(
"https://api.searchhive.dev/v1/search",
headers=headers,
params={
"q": f"{topic} news",
"limit": 10,
"fresh": "day" # Results from last 24 hours
}
)
results = resp.json().get("results", [])
return [
{
"title": r["title"],
"url": r["url"],
"snippet": r.get("snippet", ""),
"source": r.get("source", ""),
}
for r in results
]
# Example: find AI news
articles = search_articles("artificial intelligence")
for a in articles:
print(f"[{a['source']}] {a['title']}")
print(f" {a['url']}")
Step 2: Extract Article Content
ScrapeForge renders JavaScript and returns clean markdown -- no parsing HTML manually.
def extract_article(url: str) -> dict:
"""Extract clean content from a news article URL."""
resp = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers=headers,
json={
"url": url,
"render_js": True,
"format": "markdown",
"wait_for": 2000,
}
)
if resp.status_code != 200:
return None
data = resp.json()
return {
"title": data.get("title", ""),
"content": data.get("markdown", ""),
"word_count": len(data.get("markdown", "").split()),
"url": url,
}
Step 3: Store in SQLite
SQLite handles most aggregator workloads. No separate database server needed.
import sqlite3
from datetime import datetime
def init_db():
"""Create the articles database."""
conn = sqlite3.connect("news_aggregator.db")
conn.execute("""
CREATE TABLE IF NOT EXISTS articles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
url TEXT UNIQUE NOT NULL,
source TEXT,
content TEXT,
word_count INTEGER,
topic TEXT,
scraped_at TEXT,
UNIQUE(url)
)
""")
conn.commit()
conn.close()
def save_article(article: dict, topic: str):
"""Save an article to the database."""
conn = sqlite3.connect("news_aggregator.db")
try:
conn.execute(
"""INSERT OR IGNORE INTO articles
(title, url, source, content, word_count, topic, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?)""",
(
article["title"],
article["url"],
article.get("source", ""),
article.get("content", ""),
article.get("word_count", 0),
topic,
datetime.utcnow().isoformat(),
)
)
conn.commit()
except sqlite3.IntegrityError:
pass # Article already exists
finally:
conn.close()
Step 4: Complete Pipeline
Wire it all together with topic tracking and deduplication.
import time
TOPICS = [
"artificial intelligence",
"web development",
"startup funding",
"python programming",
]
def run_aggregator():
"""Run the full news aggregation pipeline."""
init_db()
for topic in TOPICS:
print(f"\nSearching: {topic}")
# Step 1: Find articles
articles = search_articles(topic, hours_back=24)
print(f" Found {len(articles)} articles")
# Step 2 & 3: Extract and save each article
for article in articles:
print(f" Scraping: {article['title'][:60]}...")
extracted = extract_article(article["url"])
if extracted:
extracted["source"] = article.get("source", "")
save_article(extracted, topic)
time.sleep(1) # Be polite to servers
# Report results
conn = sqlite3.connect("news_aggregator.db")
count = conn.execute("SELECT COUNT(*) FROM articles").fetchone()[0]
print(f"\nTotal articles in database: {count}")
conn.close()
if __name__ == "__main__":
run_aggregator()
Step 5: Query and Serve Results
def get_recent_articles(topic: str = None, limit: int = 20) -> list:
"""Query articles from the database."""
conn = sqlite3.connect("news_aggregator.db")
conn.row_factory = sqlite3.Row
if topic:
rows = conn.execute(
"SELECT * FROM articles WHERE topic = ? ORDER BY scraped_at DESC LIMIT ?",
(topic, limit)
).fetchall()
else:
rows = conn.execute(
"SELECT * FROM articles ORDER BY scraped_at DESC LIMIT ?",
(limit,)
).fetchall()
conn.close()
return [dict(row) for row in rows]
def get_summary(topic: str) -> str:
"""Generate a text summary of recent articles for a topic."""
articles = get_recent_articles(topic, limit=10)
if not articles:
return f"No recent articles found for '{topic}'"
lines = [f"Recent news on '{topic}':\n"]
for a in articles:
date = a["scraped_at"][:10]
lines.append(f"- [{date}] {a['title']}")
lines.append(f" {a['content'][:150]}...")
lines.append(f" Source: {a['source']} | {a['word_count']} words")
lines.append("")
return "\n".join(lines)
# Usage
print(get_summary("artificial intelligence"))
Running on a Schedule
Use the schedule library or set up a cron expression generator:
# Run every 6 hours
0 */6 * * * cd /path/to/project && python3 aggregator.py >> aggregator.log 2>&1
Cost Estimate
With SearchHive's pricing (1 credit = $0.0001):
- Search calls: ~2 credits per call
- Scrape calls: ~1 credit per page
- 4 topics x 10 articles = ~80 credits/run
- Running 4x/day = ~9,600 credits/month
- Starter plan ($9/mo for 5K credits) covers lightweight setups
- Builder plan ($49/mo for 100K credits) handles aggressive multi-topic aggregation
The free 500-credit tier is enough to test the full pipeline before committing.
Next Steps
Add LLM summarization. The markdown content from ScrapeForge is already LLM-ready. Pipe it through any LLM API to generate summaries, sentiment scores, or topic clustering.
Add RSS feeds. Combine API search with RSS feed parsing for broader coverage.
Add notifications. Send digests via email, Slack, or Discord when new articles match specific keywords.
Get started with SearchHive free -- 500 credits, no credit card, full API access.
See also: how to build an AI research assistant and best news search APIs.