Complete Guide to Scraping Dynamic Content
Modern websites rely heavily on JavaScript to render content. React, Vue, Angular, and Next.js have made single-page applications (SPAs) the default, which means a simple HTTP request often returns an empty HTML shell with no actual data. If your scraper only fetches the initial HTML, you get nothing.
This guide covers every approach for scraping dynamic content -- from detecting JS-rendered pages to handling anti-bot systems, choosing the right tools, and writing production-ready scraping code.
Key Takeaways
- Static fetchers fail on JS-rendered pages -- you need headless browsers or API interception
- Multi-tier escalation (static -> stealth -> headless browser) handles 95%+ of sites
- API interception is the fastest approach when you can find the underlying data endpoint
- ScrapeForge handles dynamic content automatically with JS rendering and anti-bot bypass
- Always check for hidden APIs before reaching for a headless browser
What Is Dynamic Content?
Dynamic content is any page content that loads after the initial HTML response. This includes:
- Client-side rendered (CSR) apps -- React, Vue, Angular SPAs where the server sends minimal HTML and JavaScript builds the DOM
- Lazy-loaded content -- images, comments, or article text that loads on scroll or user interaction
- Infinite scroll -- social media feeds, product listings that append content as you scroll
- AJAX-loaded sections -- product reviews, pricing tables, search results fetched via XHR/Fetch
Common signs a page uses dynamic rendering:
- Empty
<body>in the raw HTML source <div id="root"></div>or<div id="app"></div>with no content- Script tags loading bundle files (
.js,.chunk.js) - Content that appears only after a loading spinner
Detecting Dynamic Content
Before choosing a scraping strategy, confirm the page is actually JS-rendered:
import requests
from html.parser import HTMLParser
class TagCounter(HTMLParser):
def __init__(self):
super().__init__()
self.text_length = 0
def handle_data(self, data):
self.text_length += len(data.strip())
def check_dynamic(url):
"""Check if a page likely uses dynamic rendering."""
resp = requests.get(url, timeout=10)
parser = TagCounter()
parser.feed(resp.text)
# If raw HTML has very little text but the page looks full in a browser,
# it's dynamically rendered
text_ratio = parser.text_length / max(len(resp.text), 1)
has_js_frameworks = any(
marker in resp.text.lower()
for marker in ['react', 'vue', 'angular', 'next.js', '__next', 'nuxt']
)
print(f"Text ratio: {text_ratio:.2%}")
print(f"JS framework detected: {has_js_frameworks}")
print(f"Likely dynamic: {text_ratio < 0.05 or has_js_frameworks}")
return text_ratio < 0.05 or has_js_frameworks
Approach 1: API Interception (Fastest)
Many dynamic sites load their data from REST or GraphQL APIs. If you can find the API endpoint, you skip the browser entirely and get clean free JSON formatter.
How to find hidden APIs:
- Open DevTools (F12) -> Network tab
- Reload the page
- Look for XHR/Fetch requests returning JSON
- Check for GraphQL endpoints (
/graphql,/api/query)
import requests
def scrape_via_api(url):
"""Extract the article ID and fetch from the underlying API."""
# Many news sites use a pattern like /article/slug -> /api/articles/{id}
# This example shows a generic approach
# Step 1: Get the page to find the data source
resp = requests.get(url, timeout=10)
# Step 2: Look for JSON data in script tags (common pattern)
import re
json_match = re.search(
r'<script[^>]*type="application/ld\+json"[^>]*>(.*?)</script>',
resp.text, re.DOTALL
)
if json_match:
import json
data = json.loads(json_match.group(1))
print(f"Found structured data: {data.get('headline', 'N/A')}")
return data
# Step 3: Check for __NEXT_DATA__ or similar hydration payloads
next_match = re.search(
r'<script[^>]*id="__NEXT_DATA__"[^>]*>(.*?)</script>',
resp.text, re.DOTALL
)
if next_match:
data = json.loads(next_match.group(1))
props = data.get('props', {}).get('pageProps', {})
print(f"Found Next.js data: {list(props.keys())}")
return props
return None
This is the fastest approach -- typically under 1 second with no browser overhead. It works on Next.js (__NEXT_DATA__), Nuxt (__NUXT__), and many WordPress sites (wp-json).
Approach 2: Stealth Fetching
When API interception isn't possible, the next step is a stealth HTTP client that mimics real browser behavior. This handles basic bot detection without the overhead of a full browser.
# Using the SearchHive ScrapeForge API which handles this automatically
import requests
def scrape_dynamic(url):
"""Scrape a dynamically rendered page via ScrapeForge."""
resp = requests.post(
"https://api.searchhive.dev/api/v1/scrape",
json={"url": url},
timeout=60,
)
data = resp.json()
if data.get("error"):
print(f"Scrape error: {data['error']}")
return None
return {
"title": data.get("title"),
"text": data.get("text"),
"url": data.get("url"),
}
# Scrape a React-rendered page
result = scrape_dynamic("https://example.com/products")
print(result["title"])
print(result["text"][:500])
ScrapeForge uses a multi-tier escalation strategy internally:
- Static fetch with stealth headers (8s timeout)
- Stealth fetch with anti-detection (10s timeout)
- Headless browser via Patchright (30s timeout)
It automatically escalates when it detects challenge pages, empty content, or bot detection signatures.
Approach 3: Headless Browsers
For heavily JS-dependent sites where API interception and stealth fetching fail, you need a real browser engine. The two main options are Playwright (and its stealth fork Patchright) and Puppeteer.
Playwright
from playwright.async_api import async_playwright
import asyncio
async def scrape_with_playwright(url):
"""Scrape with Playwright -- handles React, Vue, Angular, infinite scroll."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
# Wait for specific content to appear
await page.wait_for_selector("article, .content, .product-card", timeout=10000)
# Handle infinite scroll
await auto_scroll(page)
# Extract content
title = await page.title()
text = await page.evaluate("""
() => {
const article = document.querySelector('article, main, .content');
return article ? article.innerText : document.body.innerText;
}
""")
await browser.close()
return {"title": title, "text": text}
async def auto_scroll(page, pause=1000):
"""Scroll to bottom to trigger lazy loading."""
last_height = await page.evaluate("document.body.scrollHeight")
while True:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.pause(pause)
new_height = await page.evaluate("document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# Run it
result = asyncio.run(scrape_with_playwright("https://news.example.com"))
print(result["title"])
Patchright (Anti-Detection Playwright Fork)
Patchright is a fork of Playwright that patches the browser runtime to avoid detection by Cloudflare, DataDome, and similar anti-bot systems:
# pip install patchright
from patchright.async_api import async_playwright
import asyncio
async def scrape_stealthy(url):
"""Scrape with Patchright -- bypasses basic anti-bot detection."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
viewport={"width": 1920, "height": 1080},
)
page = await context.new_page()
await page.goto(url, wait_until="networkidle")
await page.wait_for_timeout(3000) # Extra wait for JS rendering
text = await page.evaluate("document.body.innerText")
await browser.close()
return text
result = asyncio.run(scrape_stealthy("https://js-heavy-site.example.com"))
Approach 4: Using ScrapeForge for Production
For production workloads, managing your own headless browser infrastructure is expensive and fragile. ScrapeForge handles all of this:
import requests
def batch_scrape(urls):
"""Scrape multiple dynamic pages in a single request."""
resp = requests.post(
"https://api.searchhive.dev/api/v1/scrape/batch",
json={"urls": urls},
timeout=120,
)
results = resp.json()
for item in results:
print(f"[{item.get('url', 'unknown')}]")
print(f" Title: {item.get('title', 'N/A')}")
print(f" Content length: {len(item.get('text', ''))}")
return results
# Scrape a batch of dynamic pages
pages = [
"https://example.com/products/1",
"https://example.com/products/2",
"https://example.com/products/3",
]
batch_scrape(pages)
Best Practices for Scraping Dynamic Content
1. Always check for APIs first. A hidden JSON endpoint is 10-100x faster than launching a browser. Check DevTools Network tab before writing any scraping code.
2. Respect rate limits. Add delays between requests, use rotating proxies for high-volume scraping, and respect robots.txt. Getting blocked wastes everyone's time.
3. Cache aggressively. Re-scraping the same page is wasteful. Cache results with a TTL based on how often the content changes:
import json
import hashlib
import time
CACHE_DIR = "/tmp/scrape_cache"
CACHE_TTL = 3600 # 1 hour
def cached_scrape(url, scrape_fn):
"""Cache scraped content to avoid redundant requests."""
cache_key = hashlib.md5(url.encode()).hexdigest()
cache_path = f"{CACHE_DIR}/{cache_key}.json"
try:
with open(cache_path) as f:
cached = json.load(f)
if time.time() - cached["timestamp"] < CACHE_TTL:
return cached["data"]
except (FileNotFoundError, json.JSONDecodeError):
pass
data = scrape_fn(url)
with open(cache_path, "w") as f:
json.dump({"timestamp": time.time(), "data": data}, f)
return data
4. Handle failures gracefully. Dynamic pages can timeout, crash, or serve challenge pages. Always wrap scraping in try/except with retries and fallbacks.
5. Use specific selectors. Instead of grabbing all text, target the content area: article, main, .post-content, or whatever the site uses. This avoids scraping navigation, footers, and cookie banners.
6. Monitor and adapt. Sites change their HTML structure regularly. Set up alerts when your scraper returns empty or significantly different content.
When to Use Each Approach
| Approach | Speed | Reliability | Cost | Best For |
|---|---|---|---|---|
| API Interception | <1s | High | Free | Sites with hidden APIs (Next.js, WordPress) |
| Stealth Fetch | 2-5s | Medium | Free-Low | Sites with basic bot detection |
| Headless Browser | 5-30s | High | Medium | JS-heavy SPAs, infinite scroll |
| ScrapeForge API | 3-15s | Very High | $0.0005-$0.004/page | Production workloads, multi-tier handling |
Conclusion
Scraping dynamic content doesn't have to be complicated. Start with API interception (fastest, cheapest), escalate to stealth fetching for basic anti-bot, and use headless browsers or ScrapeForge for the hardest cases. The key insight is that most sites fall into the first two categories -- a full browser is only needed for the most heavily JS-rendered, anti-bot-protected pages.
For production scraping that handles all three tiers automatically, SearchHive ScrapeForge offers 500 free credits and plans starting at $9/5K requests. It's significantly cheaper than Firecrawl ($83/100K) and handles the complexity of multi-tier escalation so you don't have to.
/blog/complete-guide-to-data-extraction-for-ai /compare/firecrawl /compare/scrapingbee /blog/complete-guide-to-ecommerce-automation