How to Extract Data from Any Website with Python
Every developer runs into the same problem: there's data on a website you need, and no API to access it. Whether it's competitor prices, government records, real estate listings, or job postings, the data exists in a browser but not in a format your code can use.
This guide walks through the full spectrum of data extraction approaches in Python, from simple static pages to complex JavaScript-heavy sites, with real code you can adapt to any target.
Key Takeaways
- Static HTML pages need only
requests+BeautifulSoup-- fast, simple, and reliable - JavaScript-rendered pages require headless browsers or managed APIs
- SearchHive's ScrapeForge handles JS rendering, proxy rotation, and CAPTCHAs at scale
- Structured extraction schemas let you define exactly what fields you want, getting clean free JSON formatter back
- Always check for an existing API or RSS feed before scraping
Step 1: Identify What You're Extracting From
Before writing a single line of code, determine the page type:
| Page Type | Rendering | Example | Recommended Approach |
|---|---|---|---|
| Static HTML | Server-side | Blogs, Wikipedia, government sites | requests + BeautifulSoup |
| Dynamic/JS | Client-side | SPAs, React apps, infinite scroll | Playwright / ScrapeForge API |
| Protected | CAPTCHAs, login walls | Amazon, LinkedIn, Instagram | Managed scraping service |
| Paginated | Multiple pages | Search results, product listings | API with pagination logic |
Step 2: Static HTML Extraction with BeautifulSoup
For pages that serve their content in raw HTML (check by viewing page source vs. what's in the browser):
import requests
from bs4 import BeautifulSoup
def extract_static_page(url):
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
# Extract all headings
headings = [h.get_text(strip=True) for h in soup.find_all(["h1", "h2", "h3"])]
# Extract all links
links = [
{"text": a.get_text(strip=True), "href": a.get("href")}
for a in soup.find_all("a", href=True)
]
# Extract all images
images = [img.get("src") for img in soup.find_all("img", src=True)]
# Extract tables
tables = []
for table in soup.find_all("table"):
rows = []
for tr in table.find_all("tr"):
cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
rows.append(cells)
tables.append(rows)
return {"headings": headings, "links": links, "images": images, "tables": tables}
data = extract_static_page("https://en.wikipedia.org/wiki/Web_scraping")
print(f"Found {len(data['headings'])} headings, {len(data['links'])} links")
Step 3: Handle JavaScript-Rendered Pages
Many modern sites render content with JavaScript after the initial page load. requests only gets the initial HTML shell -- the actual data is loaded by client-side scripts.
Using Playwright:
import asyncio
from playwright.async_api import async_playwright
async def extract_js_page(url, wait_for="#content"):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
if wait_for:
await page.wait_for_selector(wait_for, timeout=15000)
# Extract visible text
text = await page.inner_text("body")
# Extract specific elements
titles = await page.eval_on_selector_all("h2", "els => els.map(e => e.textContent)")
await browser.close()
return {"text": text[:5000], "titles": titles}
result = asyncio.run(extract_js_page("https://news.ycombinator.com"))
print(result["titles"][:5])
Playwright works well but adds ~200ms per request in overhead and requires managing browser instances. For a handful of pages, fine. For thousands, you want something that scales without the operational burden.
Step 4: Use SearchHive ScrapeForge for Production Workloads
SearchHive's ScrapeForge API renders JavaScript, rotates proxies, handles CAPTCHAs, and returns structured data. You define what you want; it does the rest.
Basic extraction -- get clean markdown:
import requests
API_KEY = "your-searchhive-api-key"
def scrape_any_page(url):
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
},
json={"url": url, "formats": ["markdown"]},
)
return response.json()
result = scrape_any_page("https://example.com/product/12345")
print(result.get("markdown", "")[:500])
Structured extraction with a schema -- get clean JSON:
def extract_structured(url):
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
},
json={
"url": url,
"extract": {
"prompt": "Extract the main product information from this page",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "Product or page name"},
"price": {"type": "string", "description": "Price if shown"},
"description": {"type": "string", "description": "Main description text"},
"features": {
"type": "array",
"items": {"type": "string"},
"description": "Key features listed on the page"
},
}
}
}
},
)
return response.json().get("extracted", {})
product = extract_structured("https://example.com/product/12345")
print(product)
The schema defines exactly what you need. ScrapeForge renders the page, extracts the fields, and returns typed JSON. No regex tester gymnastics, no fragile CSS selectors.
Step 5: Batch Extraction at Scale
When you need to extract data from hundreds or thousands of pages:
import requests
import time
import json
API_KEY = "your-searchhive-api-key"
def batch_extract(urls, delay=0.3):
results = []
for i, url in enumerate(urls):
try:
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
},
json={"url": url, "formats": ["markdown"]},
timeout=30,
)
if response.status_code == 200:
results.append({"url": url, "status": "ok", "data": response.json()})
else:
results.append({"url": url, "status": "error", "code": response.status_code})
except Exception as e:
results.append({"url": url, "status": "failed", "error": str(e)})
if (i + 1) % 10 == 0:
print(f"Processed {i + 1}/{len(urls)}")
time.sleep(delay)
return results
urls = [f"https://example.com/listing/{i}" for i in range(1, 101)]
results = batch_extract(urls)
successful = sum(1 for r in results if r["status"] == "ok")
print(f"Extracted {successful}/{len(urls)} pages")
Step 6: Clean and Store the Data
Raw scraped data needs normalization before it's useful:
import json
import re
from datetime import datetime
def clean_scraped_data(raw_data):
cleaned = []
for item in raw_data:
if item["status"] != "ok":
continue
markdown = item["data"].get("markdown", "")
cleaned.append({
"url": item["url"],
"content": markdown,
"word_count": len(markdown.split()),
"extracted_at": datetime.utcnow().isoformat(),
})
return cleaned
def save_to_jsonl(data, filename="output.jsonl"):
with open(filename, "w") as f:
for item in data:
f.write(json.dumps(item, ensure_ascii=False) + "\n")
cleaned = clean_scraped_data(results)
save_to_jsonl(cleaned)
print(f"Saved {len(cleaned)} records to output.jsonl")
When to Use Each Approach
| Scenario | Best Tool | Why |
|---|---|---|
| One-off extraction from a simple page | requests + BeautifulSoup | Zero overhead, familiar library |
| SPA or JS-heavy page | Playwright | Handles dynamic rendering |
| Protected site (CAPTCHAs, login) | SearchHive ScrapeForge | Handles auth, CAPTCHAs, proxies |
| 100+ pages at scale | SearchHive ScrapeForge | No infrastructure management |
| Need specific fields extracted | ScrapeForge with schema | Returns clean JSON, not raw HTML |
| Real estate, e-commerce, directories | SearchHive DeepDive | Research-grade extraction with context |
Common Mistakes to Avoid
- Not checking for APIs first. Many sites have undocumented JSON endpoints. Check Network tab in DevTools before scraping.
- Parsing with regex. HTML is not regular. Use a proper parser.
- Hardcoding selectors. Sites change their DOM structure. Use semantic extraction (ScrapeForge) or flexible selectors.
- No error handling. Network timeouts, 403s, and CAPTCHAs will happen. Handle all failure modes.
- No rate limiting. Even with proxies, hammering a site can get your IPs burned.
Get Started
SearchHive's free tier gives you 500 credits to test extraction from any website. Sign up, grab your API key, and have your first structured extraction running in minutes. The Builder plan at $49/month covers 100,000 pages -- enough for most data pipeline workflows.