How to Build a Real-Time News Scraper with Python
Real-time news monitoring powers trading signals, competitive intelligence, brand monitoring, and research workflows. Whether you're tracking industry trends, monitoring press coverage, or building a news aggregation platform, Python has the tools to get it done.
This tutorial builds a production-ready news scraper from scratch — RSS feed polling, full-text extraction, deduplication, and analysis — using battle-tested Python libraries and SearchHive for content enrichment.
Key Takeaways
- feedparser is the standard for RSS/Atom parsing — handles all feed formats with zero config
- trafilatura is the best open-source library for full-text article extraction (better than newspaper3k)
- Google News RSS still works in 2026 for topic-based and keyword-based news monitoring
- Real-time monitoring uses polling with conditional GET (ETag/Last-Modified) to minimize bandwidth
- SearchHive's DeepDive adds sentiment analysis, entity extraction, and topic classification to raw articles
Prerequisites
- Python 3.9+
- A SearchHive API key for content enrichment (free tier available)
- Optional: PostgreSQL for storage, Redis for deduplication
pip install feedparser trafilatura httpx searchhive
# Optional dependencies
pip install readability-lxml newspaper3k apscheduler
Step 1: Set Up RSS Feed Parsing
RSS feeds are the most reliable way to monitor news. They're structured, lightweight, and designed for automated consumption.
import feedparser
def parse_feed(feed_url):
"""Parse an RSS or Atom feed and return entries."""
feed = feedparser.parse(feed_url)
if feed.bozo and not feed.entries:
print(f"Feed parse error: {feed.bozo_exception}")
return []
entries = []
for entry in feed.entries:
entries.append({
"title": entry.get("title", ""),
"link": entry.get("link", ""),
"published": entry.get("published", ""),
"summary": entry.get("summary", ""),
"author": entry.get("author", ""),
"source": feed.feed.get("title", "Unknown"),
})
return entries
# Google News RSS — still working in 2026
entries = parse_feed("https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en")
for e in entries[:5]:
print(f"[{e['source']}] {e['title']}")
print(f" {e['link']}")
Key RSS Feed Sources
| Source | URL | Notes |
|---|---|---|
| Google News (top) | news.google.com/rss | 141+ countries, 41+ languages |
| Google News (search) | news.google.com/rss/search?q=QUERY | Keyword-based monitoring |
| Reuters | feeds.reuters.com/reuters/topNews | Major global news wire |
| BBC News | feeds.bbci.co.uk/news/rss.xml | UK/international focus |
| NYT | rss.nytimes.com/services/xml/rss/nyt/HomePage.xml | US focus |
| The Guardian | theguardian.com/world/rss | Global news |
| NPR | feeds.npr.org/1001/rss.xml | US public radio |
Step 2: Extract Full Article Text
RSS summaries are 100-300 characters. For real analysis, you need the full article body. trafilatura is the gold standard for extraction.
import json
from trafilatura import fetch_url, extract
from readability import Document
import requests
def extract_article(url):
"""Extract full article text with trafilatura + readability fallback."""
# Primary: trafilatura (best accuracy)
try:
downloaded = fetch_url(url)
result = extract(
downloaded,
url=url,
output_format="json",
with_metadata=True,
include_comments=False,
include_tables=True,
include_links=False,
)
if result:
data = json.loads(result)
return {
"title": data.get("title", ""),
"author": data.get("author", ""),
"date": data.get("date", ""),
"text": data.get("text", ""),
"language": data.get("language", ""),
"source_domain": data.get("source-hostname", ""),
"description": data.get("description", ""),
}
except Exception as e:
print(f"Trafilatura failed for {url}: {e}")
# Fallback: readability-lxml
try:
resp = requests.get(url, timeout=15, headers={
"User-Agent": "Mozilla/5.0 (compatible; NewsBot/1.0)"
})
doc = Document(resp.content)
return {
"title": doc.title(),
"text": doc.summary(),
"fallback": True,
}
except Exception as e:
print(f"Readability fallback also failed: {e}")
return None
# Test
article = extract_article("https://www.bbc.com/news")
if article:
print(f"Title: {article['title']}")
print(f"Text length: {len(article['text'])} chars")
Step 3: Build a Smart Feed Poller
Efficient polling uses conditional GET requests — if a feed hasn't changed, the server returns 304 (Not Modified) and you download nothing.
import feedparser
import hashlib
import time
class SmartFeedPoller:
"""Poll RSS feeds efficiently with deduplication and conditional GET."""
def __init__(self):
self.feeds = {} # url -> {etag, modified, seen_entries}
self.all_articles = []
def add_feed(self, url, name=None):
"""Register a feed to monitor."""
self.feeds[url] = {
"name": name or url,
"etag": None,
"modified": None,
"seen": set(),
}
def poll(self, url):
"""Poll a single feed for new entries."""
config = self.feeds[url]
# Conditional GET headers
kwargs = {}
if config["etag"]:
kwargs["etag"] = config["etag"]
if config["modified"]:
kwargs["modified"] = config["modified"]
feed = feedparser.parse(url, **kwargs)
# Handle HTTP status
if feed.status == 304:
return [] # Not modified — skip
if feed.status >= 400:
print(f"Error {feed.status} for {config['name']}")
return []
# Save conditional GET headers
config["etag"] = feed.get("etag")
config["modified"] = feed.get("modified")
# Process new entries only
new_entries = []
for entry in feed.entries:
entry_id = hashlib.md5(entry.get("link", entry.get("id", "")).encode()).hexdigest()
if entry_id not in config["seen"]:
config["seen"].add(entry_id)
new_entries.append({
"title": entry.get("title", ""),
"link": entry.get("link", ""),
"published": entry.get("published", ""),
"summary": entry.get("summary", ""),
"source": config["name"],
})
return new_entries
def poll_all(self):
"""Poll all registered feeds."""
all_new = []
for url in self.feeds:
entries = self.poll(url)
all_new.extend(entries)
return all_new
# Usage
poller = SmartFeedPoller()
poller.add_feed("https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en", "Google News")
poller.add_feed("https://feeds.bbci.co.uk/news/rss.xml", "BBC News")
poller.add_feed("http://feeds.reuters.com/reuters/topNews", "Reuters")
new_articles = poller.poll_all()
print(f"Found {len(new_articles)} new articles")
for a in new_articles:
print(f" [{a['source']}] {a['title']}")
Step 4: Add Full-Text Extraction Pipeline
Combine feed polling with article extraction for a complete pipeline:
import json
from datetime import datetime
class NewsScraperPipeline:
"""Complete news scraping pipeline."""
def __init__(self, searchhive_key=None):
self.poller = SmartFeedPoller()
self.searchhive_key = searchhive_key
self.articles = []
def add_feed(self, url, name=None):
self.poller.add_feed(url, name)
def run_once(self, extract_full_text=False, enrich=False):
"""Run one polling cycle."""
new_entries = self.poller.poll_all()
processed = []
for entry in new_entries:
article = {
"title": entry["title"],
"url": entry["link"],
"source": entry["source"],
"published": entry["published"],
"scraped_at": datetime.now().isoformat(),
}
# Extract full text
if extract_full_text:
full = extract_article(entry["link"])
if full:
article["full_text"] = full.get("text", "")
article["author"] = full.get("author", "")
article["language"] = full.get("language", "")
# Enrich with SearchHive
if enrich and self.searchhive_key and entry.get("link"):
article["analysis"] = self._enrich_article(entry["link"])
processed.append(article)
print(f" Processed: {entry['title'][:80]}")
self.articles.extend(processed)
return processed
def _enrich_article(self, url):
"""Enrich article with SearchHive analysis."""
try:
from searchhive import DeepDive
dd = DeepDive(api_key=self.searchhive_key)
analysis = dd.analyze(
url=url,
summarize=True,
extract_entities=True
)
return {
"summary": analysis.get("summary", "")[:300],
"entities": analysis.get("entities", [])[:10],
"sentiment": analysis.get("sentiment", "neutral"),
}
except Exception as e:
return {"error": str(e)}
def save(self, filename="news_articles.json"):
with open(filename, "w") as f:
json.dump(self.articles, f, indent=2, default=str)
print(f"Saved {len(self.articles)} articles to {filename}")
# Usage
scraper = NewsScraperPipeline(searchhive_key="your_key")
scraper.add_feed("https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en", "Google News")
scraper.add_feed("https://feeds.bbci.co.uk/news/rss.xml", "BBC News")
scraper.add_feed("http://feeds.reuters.com/reuters/topNews", "Reuters")
new = scraper.run_once(extract_full_text=True, enrich=False)
print(f"Processed {len(new)} new articles")
scraper.save()
Step 5: Real-Time Monitoring with Scheduling
For true real-time monitoring, run the poller on a schedule:
import time
from datetime import datetime
def run_continuous(pipeline, interval_seconds=300, extract_full_text=True):
"""Run the news scraper continuously."""
print(f"Starting continuous news monitoring (every {interval_seconds}s)")
print(f"Feeds: {list(pipeline.poller.feeds.keys())}")
print("---")
while True:
timestamp = datetime.now().strftime("%H:%M:%S")
print(f"\n[{timestamp}] Polling feeds...")
try:
new = pipeline.run_once(
extract_full_text=extract_full_text,
enrich=False # Enable for SearchHive analysis
)
if new:
print(f" Found {len(new)} new articles")
for a in new:
print(f" - [{a['source']}] {a['title'][:70]}")
pipeline.save()
else:
print(" No new articles")
except Exception as e:
print(f" Error: {e}")
time.sleep(interval_seconds)
# Run in the background
# run_continuous(scraper, interval_seconds=300) # Every 5 minutes
Step 6: Enrich with SearchHive
Raw article text is useful, but enriched data is actionable. SearchHive adds analysis that turns text into intelligence:
from searchhive import DeepDive, SwiftSearch
def analyze_news_article(url, api_key):
"""Deep analysis of a news article."""
dd = DeepDive(api_key=api_key)
analysis = dd.analyze(
url=url,
summarize=True,
extract_entities=True,
)
return {
"summary": analysis.get("summary", ""),
"key_entities": analysis.get("entities", []),
"topics": analysis.get("topics", []),
}
def search_news_context(topic, api_key):
"""Find related coverage across the web."""
search = SwiftSearch(api_key=api_key)
return search.search(
query=f"{topic} news analysis",
extract_fields=["title", "description", "url", "date"]
)
What SearchHive adds to raw news data:
| Raw Article | SearchHive Analysis |
|---|---|
| Full text content | Concise summary (2-3 sentences) |
| Author name | Author profile and other publications |
| Publication date | Trend analysis — is this topic gaining or losing momentum? |
| Source domain | Source credibility assessment and bias detection |
| Plain text | Named entity recognition (people, orgs, locations, dates) |
| Keywords | Topic classification into predefined categories |
Step 7: Complete Production Architecture
For a production news scraper, you'll want this architecture:
RSS Feeds (poll every 1-5 min)
|
v
feedparser (parse + conditional GET)
|
v
Deduplication (Redis URL hash set)
|
v
Full-text Extraction (trafilatura + readability fallback)
|
v
SearchHive Enrichment (summary, entities, sentiment)
|
v
Storage (PostgreSQL + ElasticSearch)
|
v
API / Dashboard / Alerts
Production Dependencies
pip install feedparser trafilatura readability-lxml httpx searchhive
pip install redis sqlalchemy psycopg2-binary elasticsearch
pip install apscheduler # For scheduling
Common Issues
Feed Returns 403 Forbidden
Cause: The publisher blocks non-browser requests. Fix: Add a proper user agent parser header. Some feeds (Google News) work reliably; others may require rotating user agents.
Trafilatura Returns Empty Text
Cause: Paywalled content, JavaScript-rendered articles, or blocked requests. Fix: Use readability-lxml as fallback. For JS-heavy sites, use ScrapeForge with Playwright rendering.
Too Many Duplicate Articles
Cause: Multiple outlets covering the same story with different URLs. Fix: Implement fuzzy title matching or content hashing beyond simple URL dedup. SearchHive can help detect duplicate coverage.
Feed Stops Updating
Cause: Feed URL changed or publisher discontinued the feed. Fix: Implement feed health monitoring — alert when a feed returns 404 or hasn't updated in 24+ hours.
Rate Limiting by Publishers
Cause: Requesting full text from too many articles too quickly. Fix: Add 1-3 second delays between article fetches. Respect robots.txt. Limit concurrent connections per domain.
Next Steps
- Scale with Celery — use distributed workers for high-volume feed monitoring
- Add alerts — trigger SearchHive deep-dives or notifications when specific topics appear
- Build a search index — use ElasticSearch or Meilisearch over extracted articles
- Create a dashboard — Streamlit or Gradio for quick article browsing and filtering
- Monitor feed health — track which feeds are active, their update frequency, and article counts
Start monitoring news intelligently. Get SearchHive's free tier — 100 free requests/month for article analysis and enrichment. See the API documentation for quickstart guides.
See also: How to scrape GitHub data | SearchHive vs ScraperAPI | Python scraping guide