How to Extract Data from Any Website with Python

Every developer runs into the same problem: there's data on a website you need, and no API to access it. Whether it's competitor prices, government records, real estate listings, or job postings, the data exists in a browser but not in a format your code can use.

This guide walks through the full spectrum of data extraction approaches in Python, from simple static pages to complex JavaScript-heavy sites, with real code you can adapt to any target.

Key Takeaways

Static HTML pages need only requests + BeautifulSoup -- fast, simple, and reliable
JavaScript-rendered pages require headless browsers or managed APIs
SearchHive's ScrapeForge handles JS rendering, proxy rotation, and CAPTCHAs at scale
Structured extraction schemas let you define exactly what fields you want, getting clean free JSON formatter back
Always check for an existing API or RSS feed before scraping

Step 1: Identify What You're Extracting From

Before writing a single line of code, determine the page type:

Page Type	Rendering	Example	Recommended Approach
Static HTML	Server-side	Blogs, Wikipedia, government sites	requests + BeautifulSoup
Dynamic/JS	Client-side	SPAs, React apps, infinite scroll	Playwright / ScrapeForge API
Protected	CAPTCHAs, login walls	Amazon, LinkedIn, Instagram	Managed scraping service
Paginated	Multiple pages	Search results, product listings	API with pagination logic

Step 2: Static HTML Extraction with BeautifulSoup

For pages that serve their content in raw HTML (check by viewing page source vs. what's in the browser):

import requests
from bs4 import BeautifulSoup

def extract_static_page(url):
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract all headings
    headings = [h.get_text(strip=True) for h in soup.find_all(["h1", "h2", "h3"])]

    # Extract all links
    links = [
        {"text": a.get_text(strip=True), "href": a.get("href")}
        for a in soup.find_all("a", href=True)
    ]

    # Extract all images
    images = [img.get("src") for img in soup.find_all("img", src=True)]

    # Extract tables
    tables = []
    for table in soup.find_all("table"):
        rows = []
        for tr in table.find_all("tr"):
            cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
            rows.append(cells)
        tables.append(rows)

    return {"headings": headings, "links": links, "images": images, "tables": tables}

data = extract_static_page("https://en.wikipedia.org/wiki/Web_scraping")
print(f"Found {len(data['headings'])} headings, {len(data['links'])} links")

Step 3: Handle JavaScript-Rendered Pages

Many modern sites render content with JavaScript after the initial page load. requests only gets the initial HTML shell -- the actual data is loaded by client-side scripts.

Using Playwright:

import asyncio
from playwright.async_api import async_playwright

async def extract_js_page(url, wait_for="#content"):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        await page.goto(url, wait_until="networkidle")
        if wait_for:
            await page.wait_for_selector(wait_for, timeout=15000)

        # Extract visible text
        text = await page.inner_text("body")

        # Extract specific elements
        titles = await page.eval_on_selector_all("h2", "els => els.map(e => e.textContent)")

        await browser.close()
        return {"text": text[:5000], "titles": titles}

result = asyncio.run(extract_js_page("https://news.ycombinator.com"))
print(result["titles"][:5])

Playwright works well but adds ~200ms per request in overhead and requires managing browser instances. For a handful of pages, fine. For thousands, you want something that scales without the operational burden.

Step 4: Use SearchHive ScrapeForge for Production Workloads

SearchHive's ScrapeForge API renders JavaScript, rotates proxies, handles CAPTCHAs, and returns structured data. You define what you want; it does the rest.

Basic extraction -- get clean markdown:

import requests

API_KEY = "your-searchhive-api-key"

def scrape_any_page(url):
    response = requests.post(
        "https://api.searchhive.dev/v1/scrape",
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={"url": url, "formats": ["markdown"]},
    )
    return response.json()

result = scrape_any_page("https://example.com/product/12345")
print(result.get("markdown", "")[:500])

Structured extraction with a schema -- get clean JSON:

def extract_structured(url):
    response = requests.post(
        "https://api.searchhive.dev/v1/scrape",
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "extract": {
                "prompt": "Extract the main product information from this page",
                "schema": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string", "description": "Product or page name"},
                        "price": {"type": "string", "description": "Price if shown"},
                        "description": {"type": "string", "description": "Main description text"},
                        "features": {
                            "type": "array",
                            "items": {"type": "string"},
                            "description": "Key features listed on the page"
                        },
                    }
                }
            }
        },
    )
    return response.json().get("extracted", {})

product = extract_structured("https://example.com/product/12345")
print(product)

The schema defines exactly what you need. ScrapeForge renders the page, extracts the fields, and returns typed JSON. No regex tester gymnastics, no fragile CSS selectors.

Step 5: Batch Extraction at Scale

When you need to extract data from hundreds or thousands of pages:

import requests
import time
import json

API_KEY = "your-searchhive-api-key"

def batch_extract(urls, delay=0.3):
    results = []
    for i, url in enumerate(urls):
        try:
            response = requests.post(
                "https://api.searchhive.dev/v1/scrape",
                headers={
                    "Authorization": f"Bearer {API_KEY}",
                    "Content-Type": "application/json",
                },
                json={"url": url, "formats": ["markdown"]},
                timeout=30,
            )
            if response.status_code == 200:
                results.append({"url": url, "status": "ok", "data": response.json()})
            else:
                results.append({"url": url, "status": "error", "code": response.status_code})
        except Exception as e:
            results.append({"url": url, "status": "failed", "error": str(e)})

        if (i + 1) % 10 == 0:
            print(f"Processed {i + 1}/{len(urls)}")
        time.sleep(delay)

    return results

urls = [f"https://example.com/listing/{i}" for i in range(1, 101)]
results = batch_extract(urls)

successful = sum(1 for r in results if r["status"] == "ok")
print(f"Extracted {successful}/{len(urls)} pages")

Step 6: Clean and Store the Data

Raw scraped data needs normalization before it's useful:

import json
import re
from datetime import datetime

def clean_scraped_data(raw_data):
    cleaned = []
    for item in raw_data:
        if item["status"] != "ok":
            continue
        markdown = item["data"].get("markdown", "")
        cleaned.append({
            "url": item["url"],
            "content": markdown,
            "word_count": len(markdown.split()),
            "extracted_at": datetime.utcnow().isoformat(),
        })
    return cleaned

def save_to_jsonl(data, filename="output.jsonl"):
    with open(filename, "w") as f:
        for item in data:
            f.write(json.dumps(item, ensure_ascii=False) + "\n")

cleaned = clean_scraped_data(results)
save_to_jsonl(cleaned)
print(f"Saved {len(cleaned)} records to output.jsonl")

When to Use Each Approach

Scenario	Best Tool	Why
One-off extraction from a simple page	requests + BeautifulSoup	Zero overhead, familiar library
SPA or JS-heavy page	Playwright	Handles dynamic rendering
Protected site (CAPTCHAs, login)	SearchHive ScrapeForge	Handles auth, CAPTCHAs, proxies
100+ pages at scale	SearchHive ScrapeForge	No infrastructure management
Need specific fields extracted	ScrapeForge with schema	Returns clean JSON, not raw HTML
Real estate, e-commerce, directories	SearchHive DeepDive	Research-grade extraction with context

Common Mistakes to Avoid

Not checking for APIs first. Many sites have undocumented JSON endpoints. Check Network tab in DevTools before scraping.
Parsing with regex. HTML is not regular. Use a proper parser.
Hardcoding selectors. Sites change their DOM structure. Use semantic extraction (ScrapeForge) or flexible selectors.
No error handling. Network timeouts, 403s, and CAPTCHAs will happen. Handle all failure modes.
No rate limiting. Even with proxies, hammering a site can get your IPs burned.

Get Started

SearchHive's free tier gives you 500 credits to test extraction from any website. Sign up, grab your API key, and have your first structured extraction running in minutes. The Builder plan at $49/month covers 100,000 pages -- enough for most data pipeline workflows.

Read the docs or compare with other scraping APIs.

How to Extract Data from Any Website with Python

AI-Powered Research

How to Extract Data from Any Website with Python

Key Takeaways

Step 1: Identify What You're Extracting From

Step 2: Static HTML Extraction with BeautifulSoup

Step 3: Handle JavaScript-Rendered Pages

Step 4: Use SearchHive ScrapeForge for Production Workloads

Step 5: Batch Extraction at Scale

Step 6: Clean and Store the Data

When to Use Each Approach

Common Mistakes to Avoid

Get Started

Keywords

RELATED ARTICLES

What is MCP in AI?

What is the Best Search API for LLMs?

Can AI Agents Browse the Web?

BUILD WITH SEARCHHIVE