How to Build a Real-Time News Scraper with Python

Real-time news monitoring powers trading signals, competitive intelligence, brand monitoring, and research workflows. Whether you're tracking industry trends, monitoring press coverage, or building a news aggregation platform, Python has the tools to get it done.

This tutorial builds a production-ready news scraper from scratch — RSS feed polling, full-text extraction, deduplication, and analysis — using battle-tested Python libraries and SearchHive for content enrichment.

Key Takeaways

feedparser is the standard for RSS/Atom parsing — handles all feed formats with zero config
trafilatura is the best open-source library for full-text article extraction (better than newspaper3k)
Google News RSS still works in 2026 for topic-based and keyword-based news monitoring
Real-time monitoring uses polling with conditional GET (ETag/Last-Modified) to minimize bandwidth
SearchHive's DeepDive adds sentiment analysis, entity extraction, and topic classification to raw articles

Prerequisites

Python 3.9+
A SearchHive API key for content enrichment (free tier available)
Optional: PostgreSQL for storage, Redis for deduplication

pip install feedparser trafilatura httpx searchhive
# Optional dependencies
pip install readability-lxml newspaper3k apscheduler

Step 1: Set Up RSS Feed Parsing

RSS feeds are the most reliable way to monitor news. They're structured, lightweight, and designed for automated consumption.

import feedparser

def parse_feed(feed_url):
    """Parse an RSS or Atom feed and return entries."""
    feed = feedparser.parse(feed_url)

    if feed.bozo and not feed.entries:
        print(f"Feed parse error: {feed.bozo_exception}")
        return []

    entries = []
    for entry in feed.entries:
        entries.append({
            "title": entry.get("title", ""),
            "link": entry.get("link", ""),
            "published": entry.get("published", ""),
            "summary": entry.get("summary", ""),
            "author": entry.get("author", ""),
            "source": feed.feed.get("title", "Unknown"),
        })

    return entries

# Google News RSS — still working in 2026
entries = parse_feed("https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en")
for e in entries[:5]:
    print(f"[{e['source']}] {e['title']}")
    print(f"  {e['link']}")

Key RSS Feed Sources

Source	URL	Notes
Google News (top)	`news.google.com/rss`	141+ countries, 41+ languages
Google News (search)	`news.google.com/rss/search?q=QUERY`	Keyword-based monitoring
Reuters	`feeds.reuters.com/reuters/topNews`	Major global news wire
BBC News	`feeds.bbci.co.uk/news/rss.xml`	UK/international focus
NYT	`rss.nytimes.com/services/xml/rss/nyt/HomePage.xml`	US focus
The Guardian	`theguardian.com/world/rss`	Global news
NPR	`feeds.npr.org/1001/rss.xml`	US public radio

Step 2: Extract Full Article Text

RSS summaries are 100-300 characters. For real analysis, you need the full article body. trafilatura is the gold standard for extraction.

import json
from trafilatura import fetch_url, extract
from readability import Document
import requests

def extract_article(url):
    """Extract full article text with trafilatura + readability fallback."""
    # Primary: trafilatura (best accuracy)
    try:
        downloaded = fetch_url(url)
        result = extract(
            downloaded,
            url=url,
            output_format="json",
            with_metadata=True,
            include_comments=False,
            include_tables=True,
            include_links=False,
        )
        if result:
            data = json.loads(result)
            return {
                "title": data.get("title", ""),
                "author": data.get("author", ""),
                "date": data.get("date", ""),
                "text": data.get("text", ""),
                "language": data.get("language", ""),
                "source_domain": data.get("source-hostname", ""),
                "description": data.get("description", ""),
            }
    except Exception as e:
        print(f"Trafilatura failed for {url}: {e}")

    # Fallback: readability-lxml
    try:
        resp = requests.get(url, timeout=15, headers={
            "User-Agent": "Mozilla/5.0 (compatible; NewsBot/1.0)"
        })
        doc = Document(resp.content)
        return {
            "title": doc.title(),
            "text": doc.summary(),
            "fallback": True,
        }
    except Exception as e:
        print(f"Readability fallback also failed: {e}")
        return None

# Test
article = extract_article("https://www.bbc.com/news")
if article:
    print(f"Title: {article['title']}")
    print(f"Text length: {len(article['text'])} chars")

Step 3: Build a Smart Feed Poller

Efficient polling uses conditional GET requests — if a feed hasn't changed, the server returns 304 (Not Modified) and you download nothing.

import feedparser
import hashlib
import time

class SmartFeedPoller:
    """Poll RSS feeds efficiently with deduplication and conditional GET."""

    def __init__(self):
        self.feeds = {}  # url -> {etag, modified, seen_entries}
        self.all_articles = []

    def add_feed(self, url, name=None):
        """Register a feed to monitor."""
        self.feeds[url] = {
            "name": name or url,
            "etag": None,
            "modified": None,
            "seen": set(),
        }

    def poll(self, url):
        """Poll a single feed for new entries."""
        config = self.feeds[url]

        # Conditional GET headers
        kwargs = {}
        if config["etag"]:
            kwargs["etag"] = config["etag"]
        if config["modified"]:
            kwargs["modified"] = config["modified"]

        feed = feedparser.parse(url, **kwargs)

        # Handle HTTP status
        if feed.status == 304:
            return []  # Not modified — skip
        if feed.status >= 400:
            print(f"Error {feed.status} for {config['name']}")
            return []

        # Save conditional GET headers
        config["etag"] = feed.get("etag")
        config["modified"] = feed.get("modified")

        # Process new entries only
        new_entries = []
        for entry in feed.entries:
            entry_id = hashlib.md5(entry.get("link", entry.get("id", "")).encode()).hexdigest()
            if entry_id not in config["seen"]:
                config["seen"].add(entry_id)
                new_entries.append({
                    "title": entry.get("title", ""),
                    "link": entry.get("link", ""),
                    "published": entry.get("published", ""),
                    "summary": entry.get("summary", ""),
                    "source": config["name"],
                })

        return new_entries

    def poll_all(self):
        """Poll all registered feeds."""
        all_new = []
        for url in self.feeds:
            entries = self.poll(url)
            all_new.extend(entries)
        return all_new

# Usage
poller = SmartFeedPoller()
poller.add_feed("https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en", "Google News")
poller.add_feed("https://feeds.bbci.co.uk/news/rss.xml", "BBC News")
poller.add_feed("http://feeds.reuters.com/reuters/topNews", "Reuters")

new_articles = poller.poll_all()
print(f"Found {len(new_articles)} new articles")
for a in new_articles:
    print(f"  [{a['source']}] {a['title']}")

Step 4: Add Full-Text Extraction Pipeline

Combine feed polling with article extraction for a complete pipeline:

import json
from datetime import datetime

class NewsScraperPipeline:
    """Complete news scraping pipeline."""

    def __init__(self, searchhive_key=None):
        self.poller = SmartFeedPoller()
        self.searchhive_key = searchhive_key
        self.articles = []

    def add_feed(self, url, name=None):
        self.poller.add_feed(url, name)

    def run_once(self, extract_full_text=False, enrich=False):
        """Run one polling cycle."""
        new_entries = self.poller.poll_all()
        processed = []

        for entry in new_entries:
            article = {
                "title": entry["title"],
                "url": entry["link"],
                "source": entry["source"],
                "published": entry["published"],
                "scraped_at": datetime.now().isoformat(),
            }

            # Extract full text
            if extract_full_text:
                full = extract_article(entry["link"])
                if full:
                    article["full_text"] = full.get("text", "")
                    article["author"] = full.get("author", "")
                    article["language"] = full.get("language", "")

            # Enrich with SearchHive
            if enrich and self.searchhive_key and entry.get("link"):
                article["analysis"] = self._enrich_article(entry["link"])

            processed.append(article)
            print(f"  Processed: {entry['title'][:80]}")

        self.articles.extend(processed)
        return processed

    def _enrich_article(self, url):
        """Enrich article with SearchHive analysis."""
        try:
            from searchhive import DeepDive
            dd = DeepDive(api_key=self.searchhive_key)
            analysis = dd.analyze(
                url=url,
                summarize=True,
                extract_entities=True
            )
            return {
                "summary": analysis.get("summary", "")[:300],
                "entities": analysis.get("entities", [])[:10],
                "sentiment": analysis.get("sentiment", "neutral"),
            }
        except Exception as e:
            return {"error": str(e)}

    def save(self, filename="news_articles.json"):
        with open(filename, "w") as f:
            json.dump(self.articles, f, indent=2, default=str)
        print(f"Saved {len(self.articles)} articles to {filename}")

# Usage
scraper = NewsScraperPipeline(searchhive_key="your_key")
scraper.add_feed("https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en", "Google News")
scraper.add_feed("https://feeds.bbci.co.uk/news/rss.xml", "BBC News")
scraper.add_feed("http://feeds.reuters.com/reuters/topNews", "Reuters")

new = scraper.run_once(extract_full_text=True, enrich=False)
print(f"Processed {len(new)} new articles")
scraper.save()

Step 5: Real-Time Monitoring with Scheduling

For true real-time monitoring, run the poller on a schedule:

import time
from datetime import datetime

def run_continuous(pipeline, interval_seconds=300, extract_full_text=True):
    """Run the news scraper continuously."""
    print(f"Starting continuous news monitoring (every {interval_seconds}s)")
    print(f"Feeds: {list(pipeline.poller.feeds.keys())}")
    print("---")

    while True:
        timestamp = datetime.now().strftime("%H:%M:%S")
        print(f"\n[{timestamp}] Polling feeds...")

        try:
            new = pipeline.run_once(
                extract_full_text=extract_full_text,
                enrich=False  # Enable for SearchHive analysis
            )

            if new:
                print(f"  Found {len(new)} new articles")
                for a in new:
                    print(f"  - [{a['source']}] {a['title'][:70]}")
                pipeline.save()
            else:
                print("  No new articles")

        except Exception as e:
            print(f"  Error: {e}")

        time.sleep(interval_seconds)

# Run in the background
# run_continuous(scraper, interval_seconds=300)  # Every 5 minutes

Step 6: Enrich with SearchHive

Raw article text is useful, but enriched data is actionable. SearchHive adds analysis that turns text into intelligence:

from searchhive import DeepDive, SwiftSearch

def analyze_news_article(url, api_key):
    """Deep analysis of a news article."""
    dd = DeepDive(api_key=api_key)

    analysis = dd.analyze(
        url=url,
        summarize=True,
        extract_entities=True,
    )

    return {
        "summary": analysis.get("summary", ""),
        "key_entities": analysis.get("entities", []),
        "topics": analysis.get("topics", []),
    }

def search_news_context(topic, api_key):
    """Find related coverage across the web."""
    search = SwiftSearch(api_key=api_key)
    return search.search(
        query=f"{topic} news analysis",
        extract_fields=["title", "description", "url", "date"]
    )

What SearchHive adds to raw news data:

Raw Article	SearchHive Analysis
Full text content	Concise summary (2-3 sentences)
Author name	Author profile and other publications
Publication date	Trend analysis — is this topic gaining or losing momentum?
Source domain	Source credibility assessment and bias detection
Plain text	Named entity recognition (people, orgs, locations, dates)
Keywords	Topic classification into predefined categories

Step 7: Complete Production Architecture

For a production news scraper, you'll want this architecture:

RSS Feeds (poll every 1-5 min)
    |
    v
feedparser (parse + conditional GET)
    |
    v
Deduplication (Redis URL hash set)
    |
    v
Full-text Extraction (trafilatura + readability fallback)
    |
    v
SearchHive Enrichment (summary, entities, sentiment)
    |
    v
Storage (PostgreSQL + ElasticSearch)
    |
    v
API / Dashboard / Alerts

Production Dependencies

pip install feedparser trafilatura readability-lxml httpx searchhive
pip install redis sqlalchemy psycopg2-binary elasticsearch
pip install apscheduler  # For scheduling

Common Issues

Feed Returns 403 Forbidden

Cause: The publisher blocks non-browser requests. Fix: Add a proper user agent parser header. Some feeds (Google News) work reliably; others may require rotating user agents.

Trafilatura Returns Empty Text

Cause: Paywalled content, JavaScript-rendered articles, or blocked requests. Fix: Use readability-lxml as fallback. For JS-heavy sites, use ScrapeForge with Playwright rendering.

Too Many Duplicate Articles

Cause: Multiple outlets covering the same story with different URLs. Fix: Implement fuzzy title matching or content hashing beyond simple URL dedup. SearchHive can help detect duplicate coverage.

Feed Stops Updating

Cause: Feed URL changed or publisher discontinued the feed. Fix: Implement feed health monitoring — alert when a feed returns 404 or hasn't updated in 24+ hours.

Rate Limiting by Publishers

Cause: Requesting full text from too many articles too quickly. Fix: Add 1-3 second delays between article fetches. Respect robots.txt. Limit concurrent connections per domain.

Next Steps

Scale with Celery — use distributed workers for high-volume feed monitoring
Add alerts — trigger SearchHive deep-dives or notifications when specific topics appear
Build a search index — use ElasticSearch or Meilisearch over extracted articles
Create a dashboard — Streamlit or Gradio for quick article browsing and filtering
Monitor feed health — track which feeds are active, their update frequency, and article counts

Start monitoring news intelligently. Get SearchHive's free tier — 100 free requests/month for article analysis and enrichment. See the API documentation for quickstart guides.

How to Build a Real-Time News Scraper with Python

AI-Powered Research

How to Build a Real-Time News Scraper with Python

Key Takeaways

Prerequisites

Step 1: Set Up RSS Feed Parsing

Key RSS Feed Sources

Step 2: Extract Full Article Text

Step 3: Build a Smart Feed Poller

Step 4: Add Full-Text Extraction Pipeline

Step 5: Real-Time Monitoring with Scheduling

Step 6: Enrich with SearchHive

Step 7: Complete Production Architecture

Production Dependencies

Common Issues

Feed Returns 403 Forbidden

Trafilatura Returns Empty Text

Too Many Duplicate Articles

Feed Stops Updating

Rate Limiting by Publishers

Next Steps

Keywords

RELATED ARTICLES

How to Build a Web Data Pipeline with Python

How to Monitor Website Changes with Python

How to Scrape Websites Behind Login with Python

BUILD WITH SEARCHHIVE