How to Scraping Single Page Applications

Scraping single page applications (SPAs) built with React, Vue, Angular, or Svelte is fundamentally different from scraping traditional server-rendered websites. SPAs render content in the browser using JavaScript, which means a simple HTTP request returns an empty HTML shell — not the data you want.

This tutorial walks you through how to scrape single page applications using modern Python tools, including headless browsers, API interception, and managed scraping services that handle the hard parts for you.

Key Takeaways

SPAs return empty HTML shells on initial load — you need JavaScript execution to get the actual content
Three approaches work: headless browsers (Playwright/Puppeteer), API interception (reverse-engineer XHR calls), and managed scraping APIs
API interception is fastest and most reliable when you can find the data endpoints
SearchHive's ScrapeForge handles SPA rendering automatically — no browser management needed
Always check robots.txt generator and respect rate limits when scraping any website

Prerequisites

Before starting, you need:

Python 3.8+ installed
pip for package management
A basic understanding of HTTP requests and HTML structure
A SearchHive account (free tier includes 500 credits)

Install the dependencies we will use:

pip install requests playwright searchhive
playwright install chromium

Step 1: Understand Why SPAs Break Traditional Scraping

A traditional server-rendered page returns complete HTML:

<!-- Server-rendered: content is in the HTML -->
<html>
<body>
  <h1>Product Details</h1>
  <p class="price">$49.99</p>
  <p class="description">Amazing widget...</p>
</body>
</html>

An SPA returns an empty shell:

<!-- SPA: content is loaded by JavaScript -->
<html>
<body>
  <div id="root"></div>
  <script src="/static/js/main.a1b2c3.js"></script>
</body>
</html>

When you send a requests.get() to an SPA, you get the second version — no product data, no prices, nothing useful. The JavaScript bundle executes in the browser and populates the DOM. That is why traditional scraping fails.

Step 2: Check If the Site Has a Backend API

Most SPAs fetch data from a backend API. Before reaching for a headless browser, check the network tab in your browser DevTools:

Open the SPA in Chrome/Firefox
Press F12 to open DevTools
Go to the Network tab
Reload the page
Look for XHR/Fetch requests that return free JSON formatter data

If you find a clean JSON endpoint, you can skip the browser entirely:

import requests

# Example: many React apps fetch data from /api/ endpoints
api_url = "https://example-spa.com/api/products/123"
response = requests.get(api_url, headers={
    "Accept": "application/json",
    "User-Agent": "Mozilla/5.0"
})

product = response.json()
print(f"Name: {product['name']}")
print(f"Price: {product['price']}")
print(f"Description: {product['description']}")

This approach is fast, reliable, and does not require a browser. But many SPAs use authentication tokens, complex headers, or obfuscated endpoints that make direct API access difficult.

Step 3: Use Playwright for Full Page Rendering

When API interception is not feasible, Playwright gives you a real browser that executes JavaScript and waits for content to render:

from playwright.sync_api import sync_playwright
import json

def scrape_spa(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        # Navigate and wait for content to render
        page.goto(url, wait_until="networkidle")
        
        # Wait for specific elements if needed
        page.wait_for_selector(".product-title", timeout=10000)
        
        # Extract data
        title = page.locator(".product-title").first.text_content()
        price = page.locator(".price-value").first.text_content()
        description = page.locator(".product-description").first.text_content()
        
        browser.close()
        
        return {
            "title": title.strip(),
            "price": price.strip(),
            "description": description.strip()
        }

result = scrape_spa("https://example-spa.com/products/123")
print(json.dumps(result, indent=2))

The downside: Playwright is slow (1-3 seconds per page), resource-heavy, and many sites detect headless browsers. You will need stealth plugins, proxy rotation, and constant maintenance.

Step 4: Handle Infinite Scroll and Dynamic Loading

Many SPAs lazy-load content as the user scrolls. To capture everything:

from playwright.sync_api import sync_playwright


def scrape_infinite_scroll(url, max_scrolls=10):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        
        items = []
        previous_count = 0
        
        for i in range(max_scrolls):
            # Scroll to bottom
            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            page.wait_for_timeout(1500)  # Wait for lazy load
            
            # Collect items
            current_items = page.locator(".list-item").all()
            new_items = current_items[previous_count:]
            
            for item in new_items:
                items.append({
                    "title": item.locator(".item-title").text_content(),
                    "price": item.locator(".item-price").text_content()
                })
            
            if len(current_items) == previous_count:
                break  # No new items loaded
            
            previous_count = len(current_items)
        
        browser.close()
        return items


results = scrape_infinite_scroll("https://example-spa.com/shop")
print(f"Collected {len(results)} items")

Step 5: Use SearchHive ScrapeForge for Production Scraping

Managing Playwright instances, proxy rotation, and anti-bot detection at scale is a full-time job. SearchHive's ScrapeForge handles all of it:

from searchhive import ScrapeForge

client = ScrapeForge(api_key="your_api_key")

# Scrape any SPA — handles JS rendering, anti-bot, CAPTCHAs
page = client.scrape(
    url="https://react-spa-example.com/products/123",
    render_js=True,  # Automatically renders JavaScript
    wait_for=".product-title"  # Waits for specific element
)

print(f"Title: {page.title}")
print(f"Content: {len(page.content)} chars")
print(f"Rendered HTML length: {len(page.html)} chars")

# Extract structured data with CSS selectors
products = page.css(".product-card").extract_all([
    {"name": "title", "selector": ".product-title", "type": "text"},
    {"name": "price", "selector": ".price", "type": "text"},
    {"name": "url", "selector": "a", "type": "attribute", "attr": "href"}
])

for product in products:
    print(f"{product['title']}: {product['price']}")

Benefits of using ScrapeForge over managing Playwright yourself:

No infrastructure — no servers running headless browsers
Anti-bot built-in — 99.3% success rate against Cloudflare and similar protections
17ms average latency — Rust-based, not Python-based
Pay per use — from $9/mo, no idle server costs

Step 6: Scrape Multiple SPA Pages at Scale

For production workloads, you need to scrape hundreds or thousands of pages:

from searchhive import ScrapeForge
from concurrent.futures import ThreadPoolExecutor
import json

client = ScrapeForge(api_key="your_api_key")


def scrape_product(url):
    try:
        page = client.scrape(url, render_js=True)
        return {
            "url": url,
            "title": page.title,
            "success": True,
            "content_length": len(page.content)
        }
    except Exception as e:
        return {"url": url, "success": False, "error": str(e)}


# Scrape 50 product pages in parallel
urls = [f"https://spa-store.com/products/{i}" for i in range(1, 51)]

with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(scrape_product, urls))

successful = sum(1 for r in results if r["success"])
print(f"Scraped {successful}/{len(urls)} pages successfully")

# Save results
with open("spa_products.json", "w") as f:
    json.dump(results, f, indent=2)

Common Issues When Scraping SPAs

Content Not Loading

If the page appears empty, the JavaScript may need more time. Add a wait condition or increase the timeout:

page = client.scrape(url, render_js=True, wait_for=5000)  # Wait 5 seconds

Authentication Required

Some SPAs require login. SearchHive supports passing cookies and headers:

page = client.scrape(
    url="https://app.example.com/dashboard",
    render_js=True,
    cookies={"session_id": "your_session_token"},
    headers={"Authorization": "Bearer your_token"}
)

Rate Limiting

If you are hitting rate limits, add delays between requests or use SearchHive's built-in rate limiting:

import time

for url in urls:
    result = scrape_product(url)
    time.sleep(0.5)  # 500ms delay between requests

Next Steps

Now that you know how to scrape single page applications, here are related tutorials to level up:

Set up a competitor monitoring API — monitor competitor SPAs for changes
AI agent MCP integration — connect AI agents to web data
Automation for monitoring guide — build complete monitoring systems

Start scraping SPAs today — get your free SearchHive API key with 500 credits. No credit card required.

For more scraping examples, check the SearchHive documentation.

How to Scraping Single Page Applications — Step-by-Step

AI-Powered Research

Key Takeaways

Prerequisites

Step 1: Understand Why SPAs Break Traditional Scraping

Step 2: Check If the Site Has a Backend API

Step 3: Use Playwright for Full Page Rendering

Step 4: Handle Infinite Scroll and Dynamic Loading

Step 5: Use SearchHive ScrapeForge for Production Scraping

Step 6: Scrape Multiple SPA Pages at Scale

Common Issues When Scraping SPAs

Content Not Loading

Authentication Required

Rate Limiting

Next Steps

Keywords

RELATED ARTICLES

How to Competitor Monitoring Api — Step-by-Step

How to Ai Agent Mcp Integration — Step-by-Step

Automation For Monitoring — Common Questions Answered

BUILD WITH SEARCHHIVE