Complete Guide to Scraping Dynamic Content

Modern websites rely heavily on JavaScript to render content. React, Vue, Angular, and Next.js have made single-page applications (SPAs) the default, which means a simple HTTP request often returns an empty HTML shell with no actual data. If your scraper only fetches the initial HTML, you get nothing.

This guide covers every approach for scraping dynamic content -- from detecting JS-rendered pages to handling anti-bot systems, choosing the right tools, and writing production-ready scraping code.

Key Takeaways

Static fetchers fail on JS-rendered pages -- you need headless browsers or API interception
Multi-tier escalation (static -> stealth -> headless browser) handles 95%+ of sites
API interception is the fastest approach when you can find the underlying data endpoint
ScrapeForge handles dynamic content automatically with JS rendering and anti-bot bypass
Always check for hidden APIs before reaching for a headless browser

What Is Dynamic Content?

Dynamic content is any page content that loads after the initial HTML response. This includes:

Client-side rendered (CSR) apps -- React, Vue, Angular SPAs where the server sends minimal HTML and JavaScript builds the DOM
Lazy-loaded content -- images, comments, or article text that loads on scroll or user interaction
Infinite scroll -- social media feeds, product listings that append content as you scroll
AJAX-loaded sections -- product reviews, pricing tables, search results fetched via XHR/Fetch

Common signs a page uses dynamic rendering:

Empty <body> in the raw HTML source
<div id="root"></div> or <div id="app"></div> with no content
Script tags loading bundle files (.js, .chunk.js)
Content that appears only after a loading spinner

Detecting Dynamic Content

Before choosing a scraping strategy, confirm the page is actually JS-rendered:

import requests
from html.parser import HTMLParser

class TagCounter(HTMLParser):
    def __init__(self):
        super().__init__()
        self.text_length = 0
    def handle_data(self, data):
        self.text_length += len(data.strip())

def check_dynamic(url):
    """Check if a page likely uses dynamic rendering."""
    resp = requests.get(url, timeout=10)
    parser = TagCounter()
    parser.feed(resp.text)
    
    # If raw HTML has very little text but the page looks full in a browser,
    # it's dynamically rendered
    text_ratio = parser.text_length / max(len(resp.text), 1)
    has_js_frameworks = any(
        marker in resp.text.lower()
        for marker in ['react', 'vue', 'angular', 'next.js', '__next', 'nuxt']
    )
    
    print(f"Text ratio: {text_ratio:.2%}")
    print(f"JS framework detected: {has_js_frameworks}")
    print(f"Likely dynamic: {text_ratio < 0.05 or has_js_frameworks}")
    return text_ratio < 0.05 or has_js_frameworks

Approach 1: API Interception (Fastest)

Many dynamic sites load their data from REST or GraphQL APIs. If you can find the API endpoint, you skip the browser entirely and get clean free JSON formatter.

How to find hidden APIs:

Open DevTools (F12) -> Network tab
Reload the page
Look for XHR/Fetch requests returning JSON
Check for GraphQL endpoints (/graphql, /api/query)

import requests

def scrape_via_api(url):
    """Extract the article ID and fetch from the underlying API."""
    # Many news sites use a pattern like /article/slug -> /api/articles/{id}
    # This example shows a generic approach
    
    # Step 1: Get the page to find the data source
    resp = requests.get(url, timeout=10)
    
    # Step 2: Look for JSON data in script tags (common pattern)
    import re
    json_match = re.search(
        r'<script[^>]*type="application/ld\+json"[^>]*>(.*?)</script>',
        resp.text, re.DOTALL
    )
    if json_match:
        import json
        data = json.loads(json_match.group(1))
        print(f"Found structured data: {data.get('headline', 'N/A')}")
        return data
    
    # Step 3: Check for __NEXT_DATA__ or similar hydration payloads
    next_match = re.search(
        r'<script[^>]*id="__NEXT_DATA__"[^>]*>(.*?)</script>',
        resp.text, re.DOTALL
    )
    if next_match:
        data = json.loads(next_match.group(1))
        props = data.get('props', {}).get('pageProps', {})
        print(f"Found Next.js data: {list(props.keys())}")
        return props
    
    return None

This is the fastest approach -- typically under 1 second with no browser overhead. It works on Next.js (__NEXT_DATA__), Nuxt (__NUXT__), and many WordPress sites (wp-json).

Approach 2: Stealth Fetching

When API interception isn't possible, the next step is a stealth HTTP client that mimics real browser behavior. This handles basic bot detection without the overhead of a full browser.

# Using the SearchHive ScrapeForge API which handles this automatically
import requests

def scrape_dynamic(url):
    """Scrape a dynamically rendered page via ScrapeForge."""
    resp = requests.post(
        "https://api.searchhive.dev/api/v1/scrape",
        json={"url": url},
        timeout=60,
    )
    data = resp.json()
    
    if data.get("error"):
        print(f"Scrape error: {data['error']}")
        return None
    
    return {
        "title": data.get("title"),
        "text": data.get("text"),
        "url": data.get("url"),
    }

# Scrape a React-rendered page
result = scrape_dynamic("https://example.com/products")
print(result["title"])
print(result["text"][:500])

ScrapeForge uses a multi-tier escalation strategy internally:

Static fetch with stealth headers (8s timeout)
Stealth fetch with anti-detection (10s timeout)
Headless browser via Patchright (30s timeout)

It automatically escalates when it detects challenge pages, empty content, or bot detection signatures.

Approach 3: Headless Browsers

For heavily JS-dependent sites where API interception and stealth fetching fail, you need a real browser engine. The two main options are Playwright (and its stealth fork Patchright) and Puppeteer.

Playwright

from playwright.async_api import async_playwright
import asyncio

async def scrape_with_playwright(url):
    """Scrape with Playwright -- handles React, Vue, Angular, infinite scroll."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        await page.goto(url, wait_until="networkidle")
        
        # Wait for specific content to appear
        await page.wait_for_selector("article, .content, .product-card", timeout=10000)
        
        # Handle infinite scroll
        await auto_scroll(page)
        
        # Extract content
        title = await page.title()
        text = await page.evaluate("""
            () => {
                const article = document.querySelector('article, main, .content');
                return article ? article.innerText : document.body.innerText;
            }
        """)
        
        await browser.close()
        return {"title": title, "text": text}

async def auto_scroll(page, pause=1000):
    """Scroll to bottom to trigger lazy loading."""
    last_height = await page.evaluate("document.body.scrollHeight")
    while True:
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        await page.pause(pause)
        new_height = await page.evaluate("document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

# Run it
result = asyncio.run(scrape_with_playwright("https://news.example.com"))
print(result["title"])

Patchright (Anti-Detection Playwright Fork)

Patchright is a fork of Playwright that patches the browser runtime to avoid detection by Cloudflare, DataDome, and similar anti-bot systems:

# pip install patchright
from patchright.async_api import async_playwright
import asyncio

async def scrape_stealthy(url):
    """Scrape with Patchright -- bypasses basic anti-bot detection."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            viewport={"width": 1920, "height": 1080},
        )
        page = await context.new_page()
        
        await page.goto(url, wait_until="networkidle")
        await page.wait_for_timeout(3000)  # Extra wait for JS rendering
        
        text = await page.evaluate("document.body.innerText")
        await browser.close()
        return text

result = asyncio.run(scrape_stealthy("https://js-heavy-site.example.com"))

Approach 4: Using ScrapeForge for Production

For production workloads, managing your own headless browser infrastructure is expensive and fragile. ScrapeForge handles all of this:

import requests

def batch_scrape(urls):
    """Scrape multiple dynamic pages in a single request."""
    resp = requests.post(
        "https://api.searchhive.dev/api/v1/scrape/batch",
        json={"urls": urls},
        timeout=120,
    )
    results = resp.json()
    
    for item in results:
        print(f"[{item.get('url', 'unknown')}]")
        print(f"  Title: {item.get('title', 'N/A')}")
        print(f"  Content length: {len(item.get('text', ''))}")
    
    return results

# Scrape a batch of dynamic pages
pages = [
    "https://example.com/products/1",
    "https://example.com/products/2",
    "https://example.com/products/3",
]
batch_scrape(pages)

Best Practices for Scraping Dynamic Content

1. Always check for APIs first. A hidden JSON endpoint is 10-100x faster than launching a browser. Check DevTools Network tab before writing any scraping code.

2. Respect rate limits. Add delays between requests, use rotating proxies for high-volume scraping, and respect robots.txt. Getting blocked wastes everyone's time.

3. Cache aggressively. Re-scraping the same page is wasteful. Cache results with a TTL based on how often the content changes:

import json
import hashlib
import time

CACHE_DIR = "/tmp/scrape_cache"
CACHE_TTL = 3600  # 1 hour

def cached_scrape(url, scrape_fn):
    """Cache scraped content to avoid redundant requests."""
    cache_key = hashlib.md5(url.encode()).hexdigest()
    cache_path = f"{CACHE_DIR}/{cache_key}.json"
    
    try:
        with open(cache_path) as f:
            cached = json.load(f)
            if time.time() - cached["timestamp"] < CACHE_TTL:
                return cached["data"]
    except (FileNotFoundError, json.JSONDecodeError):
        pass
    
    data = scrape_fn(url)
    with open(cache_path, "w") as f:
        json.dump({"timestamp": time.time(), "data": data}, f)
    return data

4. Handle failures gracefully. Dynamic pages can timeout, crash, or serve challenge pages. Always wrap scraping in try/except with retries and fallbacks.

5. Use specific selectors. Instead of grabbing all text, target the content area: article, main, .post-content, or whatever the site uses. This avoids scraping navigation, footers, and cookie banners.

6. Monitor and adapt. Sites change their HTML structure regularly. Set up alerts when your scraper returns empty or significantly different content.

When to Use Each Approach

Approach	Speed	Reliability	Cost	Best For
API Interception	<1s	High	Free	Sites with hidden APIs (Next.js, WordPress)
Stealth Fetch	2-5s	Medium	Free-Low	Sites with basic bot detection
Headless Browser	5-30s	High	Medium	JS-heavy SPAs, infinite scroll
ScrapeForge API	3-15s	Very High	$0.0005-$0.004/page	Production workloads, multi-tier handling

Conclusion

Scraping dynamic content doesn't have to be complicated. Start with API interception (fastest, cheapest), escalate to stealth fetching for basic anti-bot, and use headless browsers or ScrapeForge for the hardest cases. The key insight is that most sites fall into the first two categories -- a full browser is only needed for the most heavily JS-rendered, anti-bot-protected pages.

For production scraping that handles all three tiers automatically, SearchHive ScrapeForge offers 500 free credits and plans starting at $9/5K requests. It's significantly cheaper than Firecrawl ($83/100K) and handles the complexity of multi-tier escalation so you don't have to.

/blog/complete-guide-to-data-extraction-for-ai /compare/firecrawl /compare/scrapingbee /blog/complete-guide-to-ecommerce-automation

Complete Guide to Scraping Dynamic Content

AI-Powered Research

Complete Guide to Scraping Dynamic Content

Key Takeaways

What Is Dynamic Content?

Detecting Dynamic Content

Approach 1: API Interception (Fastest)

Approach 2: Stealth Fetching

Approach 3: Headless Browsers

Playwright

Patchright (Anti-Detection Playwright Fork)

Approach 4: Using ScrapeForge for Production

Best Practices for Scraping Dynamic Content

When to Use Each Approach

Conclusion

Keywords

RELATED ARTICLES

Complete Guide to Ecommerce Automation

How to Implement API Gateway Patterns -- Step-by-Step

News Search API -- Common Questions Answered

BUILD WITH SEARCHHIVE