Scraping single page applications (SPAs) built with React, Vue, Angular, or Svelte is fundamentally different from scraping traditional server-rendered websites. SPAs render content in the browser using JavaScript, which means a simple HTTP request returns an empty HTML shell — not the data you want.
This tutorial walks you through how to scrape single page applications using modern Python tools, including headless browsers, API interception, and managed scraping services that handle the hard parts for you.
Key Takeaways
- SPAs return empty HTML shells on initial load — you need JavaScript execution to get the actual content
- Three approaches work: headless browsers (Playwright/Puppeteer), API interception (reverse-engineer XHR calls), and managed scraping APIs
- API interception is fastest and most reliable when you can find the data endpoints
- SearchHive's ScrapeForge handles SPA rendering automatically — no browser management needed
- Always check robots.txt generator and respect rate limits when scraping any website
Prerequisites
Before starting, you need:
- Python 3.8+ installed
- pip for package management
- A basic understanding of HTTP requests and HTML structure
- A SearchHive account (free tier includes 500 credits)
Install the dependencies we will use:
pip install requests playwright searchhive
playwright install chromium
Step 1: Understand Why SPAs Break Traditional Scraping
A traditional server-rendered page returns complete HTML:
<!-- Server-rendered: content is in the HTML -->
<html>
<body>
<h1>Product Details</h1>
<p class="price">$49.99</p>
<p class="description">Amazing widget...</p>
</body>
</html>
An SPA returns an empty shell:
<!-- SPA: content is loaded by JavaScript -->
<html>
<body>
<div id="root"></div>
<script src="/static/js/main.a1b2c3.js"></script>
</body>
</html>
When you send a requests.get() to an SPA, you get the second version — no product data, no prices, nothing useful. The JavaScript bundle executes in the browser and populates the DOM. That is why traditional scraping fails.
Step 2: Check If the Site Has a Backend API
Most SPAs fetch data from a backend API. Before reaching for a headless browser, check the network tab in your browser DevTools:
- Open the SPA in Chrome/Firefox
- Press F12 to open DevTools
- Go to the Network tab
- Reload the page
- Look for XHR/Fetch requests that return free JSON formatter data
If you find a clean JSON endpoint, you can skip the browser entirely:
import requests
# Example: many React apps fetch data from /api/ endpoints
api_url = "https://example-spa.com/api/products/123"
response = requests.get(api_url, headers={
"Accept": "application/json",
"User-Agent": "Mozilla/5.0"
})
product = response.json()
print(f"Name: {product['name']}")
print(f"Price: {product['price']}")
print(f"Description: {product['description']}")
This approach is fast, reliable, and does not require a browser. But many SPAs use authentication tokens, complex headers, or obfuscated endpoints that make direct API access difficult.
Step 3: Use Playwright for Full Page Rendering
When API interception is not feasible, Playwright gives you a real browser that executes JavaScript and waits for content to render:
from playwright.sync_api import sync_playwright
import json
def scrape_spa(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate and wait for content to render
page.goto(url, wait_until="networkidle")
# Wait for specific elements if needed
page.wait_for_selector(".product-title", timeout=10000)
# Extract data
title = page.locator(".product-title").first.text_content()
price = page.locator(".price-value").first.text_content()
description = page.locator(".product-description").first.text_content()
browser.close()
return {
"title": title.strip(),
"price": price.strip(),
"description": description.strip()
}
result = scrape_spa("https://example-spa.com/products/123")
print(json.dumps(result, indent=2))
The downside: Playwright is slow (1-3 seconds per page), resource-heavy, and many sites detect headless browsers. You will need stealth plugins, proxy rotation, and constant maintenance.
Step 4: Handle Infinite Scroll and Dynamic Loading
Many SPAs lazy-load content as the user scrolls. To capture everything:
from playwright.sync_api import sync_playwright
def scrape_infinite_scroll(url, max_scrolls=10):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
items = []
previous_count = 0
for i in range(max_scrolls):
# Scroll to bottom
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(1500) # Wait for lazy load
# Collect items
current_items = page.locator(".list-item").all()
new_items = current_items[previous_count:]
for item in new_items:
items.append({
"title": item.locator(".item-title").text_content(),
"price": item.locator(".item-price").text_content()
})
if len(current_items) == previous_count:
break # No new items loaded
previous_count = len(current_items)
browser.close()
return items
results = scrape_infinite_scroll("https://example-spa.com/shop")
print(f"Collected {len(results)} items")
Step 5: Use SearchHive ScrapeForge for Production Scraping
Managing Playwright instances, proxy rotation, and anti-bot detection at scale is a full-time job. SearchHive's ScrapeForge handles all of it:
from searchhive import ScrapeForge
client = ScrapeForge(api_key="your_api_key")
# Scrape any SPA — handles JS rendering, anti-bot, CAPTCHAs
page = client.scrape(
url="https://react-spa-example.com/products/123",
render_js=True, # Automatically renders JavaScript
wait_for=".product-title" # Waits for specific element
)
print(f"Title: {page.title}")
print(f"Content: {len(page.content)} chars")
print(f"Rendered HTML length: {len(page.html)} chars")
# Extract structured data with CSS selectors
products = page.css(".product-card").extract_all([
{"name": "title", "selector": ".product-title", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "url", "selector": "a", "type": "attribute", "attr": "href"}
])
for product in products:
print(f"{product['title']}: {product['price']}")
Benefits of using ScrapeForge over managing Playwright yourself:
- No infrastructure — no servers running headless browsers
- Anti-bot built-in — 99.3% success rate against Cloudflare and similar protections
- 17ms average latency — Rust-based, not Python-based
- Pay per use — from $9/mo, no idle server costs
Step 6: Scrape Multiple SPA Pages at Scale
For production workloads, you need to scrape hundreds or thousands of pages:
from searchhive import ScrapeForge
from concurrent.futures import ThreadPoolExecutor
import json
client = ScrapeForge(api_key="your_api_key")
def scrape_product(url):
try:
page = client.scrape(url, render_js=True)
return {
"url": url,
"title": page.title,
"success": True,
"content_length": len(page.content)
}
except Exception as e:
return {"url": url, "success": False, "error": str(e)}
# Scrape 50 product pages in parallel
urls = [f"https://spa-store.com/products/{i}" for i in range(1, 51)]
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(scrape_product, urls))
successful = sum(1 for r in results if r["success"])
print(f"Scraped {successful}/{len(urls)} pages successfully")
# Save results
with open("spa_products.json", "w") as f:
json.dump(results, f, indent=2)
Common Issues When Scraping SPAs
Content Not Loading
If the page appears empty, the JavaScript may need more time. Add a wait condition or increase the timeout:
page = client.scrape(url, render_js=True, wait_for=5000) # Wait 5 seconds
Authentication Required
Some SPAs require login. SearchHive supports passing cookies and headers:
page = client.scrape(
url="https://app.example.com/dashboard",
render_js=True,
cookies={"session_id": "your_session_token"},
headers={"Authorization": "Bearer your_token"}
)
Rate Limiting
If you are hitting rate limits, add delays between requests or use SearchHive's built-in rate limiting:
import time
for url in urls:
result = scrape_product(url)
time.sleep(0.5) # 500ms delay between requests
Next Steps
Now that you know how to scrape single page applications, here are related tutorials to level up:
- Set up a competitor monitoring API — monitor competitor SPAs for changes
- AI agent MCP integration — connect AI agents to web data
- Automation for monitoring guide — build complete monitoring systems
Start scraping SPAs today — get your free SearchHive API key with 500 credits. No credit card required.
For more scraping examples, check the SearchHive documentation.