Complete Guide to Data Extraction with Python
Data extraction with Python is the backbone of modern applications -- from price monitoring and lead generation to AI training pipelines and market research. This guide covers everything from basic HTML parsing to production-grade extraction pipelines, including when to use APIs versus scraping, and how SearchHive fits into the picture.
Key Takeaways
- Requests + BeautifulSoup handles most static page extraction needs
- Playwright/Selenium are necessary for JavaScript-rendered content
- API-based extraction (like SearchHive's ScrapeForge) avoids bot detection entirely
- Polars outperforms Pandas for large-scale data wrangling
- Rate limiting, retry logic, and error handling are non-negotiable for production
The Data Extraction Landscape in Python
Python dominates data extraction for good reason. The ecosystem is mature, the syntax is readable, and the community has built battle-tested libraries for every extraction scenario.
The main approaches break down into three categories:
- HTML parsing -- Download pages and parse the HTML directly
- Browser automation -- Control a real browser to render pages first
- API-based extraction -- Use a service that handles rendering and parsing for you
Each has trade-offs around cost, complexity, reliability, and scale.
Getting Started: Requests and BeautifulSoup
For static websites where all content is in the raw HTML, requests and BeautifulSoup are the fastest path to working extraction:
import requests
from bs4 import BeautifulSoup
response = requests.get(
"https://example.com/products",
headers={"User-Agent": "Mozilla/5.0"}
)
soup = BeautifulSoup(response.text, "html.parser")
products = []
for card in soup.select(".product-card"):
name = card.select_one(".product-name").get_text(strip=True)
price = card.select_one(".price").get_text(strip=True)
url = card.select_one("a")["href"]
products.append({"name": name, "price": price, "url": url})
print(f"Extracted {len(products)} products")
This works when the server sends complete HTML. The moment a page loads data via JavaScript after the initial request, this approach breaks.
Handling JavaScript-Rendered Pages
Modern web apps (React, Vue, Next.js) often render content client-side. The HTML you get from requests contains no actual data -- just an empty <div id="root"> and a bundle of JavaScript.
Option A: Find the underlying API
Many SPAs fetch data from internal APIs. Open DevTools, check the Network tab, and look for XHR/Fetch requests:
import requests
# The SPA calls this API internally
response = requests.get(
"https://example.com/api/v1/products",
params={"page": 1, "limit": 50},
headers={
"User-Agent": "Mozilla/5.0",
"Accept": "application/json"
}
)
data = response.json()
for product in data["items"]:
print(product["name"], product["price"])
This is the cleanest approach when it works. No browser needed, no rendering overhead, and the data comes pre-structured as free JSON formatter.
Option B: Browser automation with Playwright
When the data is embedded in the rendered page and no API endpoint is available, use Playwright:
from playwright.sync_api import sync_playwright
def extract_dynamic_page(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
content = page.content()
browser.close()
return content
html = extract_dynamic_page("https://example.com/dynamic-page")
soup = BeautifulSoup(html, "html.parser")
# Now parse the fully rendered HTML
Playwright is faster than Selenium, has better API design, and handles modern web features natively. The trade-off is resource consumption -- each browser instance uses 100-200MB of RAM.
Why API-Based Extraction Wins at Scale
Running browser instances at scale is expensive and fragile. Sites change their DOM structure, add CAPTCHAs, and implement anti-bot measures that break your scrapers constantly.
SearchHive's ScrapeForge API handles all of this for you:
import httpx
SEARCHHIVE_API_KEY = "your-api-key-here"
def scrape_with_searchhive(url: str) -> dict:
response = httpx.post(
"https://api.searchhive.dev/v1/scrape",
json={"url": url, "format": "markdown"},
headers={"Authorization": f"Bearer {SEARCHHIVE_API_KEY}"}
)
return response.json()
# Extract structured data from any page
result = scrape_with_searchhive("https://example.com/products")
print(result.get("title", ""))
print(result.get("content", "")[:500])
Key advantages over running your own scrapers:
- No browser infrastructure -- SearchHive manages headless browsers at scale
- Built-in anti-detection -- Rotating proxies, fingerprint spoofing, and CAPTCHA handling
- Structured output -- Get markdown, JSON, or clean HTML without writing parsers
- Unified API -- Same endpoint handles static pages, SPAs, and blocked sites
At $49/month for 100K credits on the Builder plan, ScrapeForge costs less than running a single cloud server for browser automation. See the full pricing.
Processing Extracted Data with Polars
Once you've extracted data, you need to clean and transform it. Polars is significantly faster than Pandas for this:
import polars as pl
# Load extracted data
df = pl.DataFrame({
"name": ["Widget A", "Widget B", "Widget C"],
"price_raw": ["$29.99", "$49.99", "FREE"],
"rating": ["4.5/5", "3.8/5", "N/A"]
})
# Clean price column
df = df.with_columns(
pl.col("price_raw")
.str.replace_all(r"[^\d.]", "")
.cast(pl.Float64, strict=False)
.fill_null(0)
.alias("price")
)
# Clean rating column
df = df.with_columns(
pl.col("rating")
.str.extract(r"(\d+\.?\d*)")
.cast(pl.Float64, strict=False)
.alias("rating_num")
)
# Filter and sort
results = (
df.filter(pl.col("price") > 0)
.sort("rating_num", descending=True)
)
print(results)
Building a Production Extraction Pipeline
Here's a complete pipeline that combines web search with scraping for comprehensive data collection:
import httpx
import asyncio
from pathlib import Path
SEARCHHIVE_API_KEY = "your-api-key-here"
async def search_and_extract(query: str, num_results: int = 5) -> list:
async with httpx.AsyncClient() as client:
# Step 1: Search for relevant pages
search_resp = await client.get(
"https://api.searchhive.dev/v1/search",
params={"q": query, "limit": num_results},
headers={"Authorization": f"Bearer {SEARCHHIVE_API_KEY}"}
)
urls = [r["url"] for r in search_resp.json().get("results", [])]
# Step 2: Scrape each result
tasks = [extract_page(client, url) for url in urls]
return await asyncio.gather(*tasks)
async def extract_page(client: httpx.AsyncClient, url: str) -> dict:
try:
resp = await client.post(
"https://api.searchhive.dev/v1/scrape",
json={"url": url, "format": "markdown"},
headers={"Authorization": f"Bearer {SEARCHHIVE_API_KEY}"}
)
data = resp.json()
return {"url": url, "title": data.get("title"), "content": data.get("content", "")}
except Exception as e:
return {"url": url, "error": str(e)}
# Run the pipeline
results = asyncio.run(search_and_extract("best python data extraction libraries 2026"))
for r in results:
if "error" not in r:
print(f"[{r['title']}] {r['content'][:100]}...")
Best Practices for Data Extraction
Respect rate limits. Add delays between requests, check for Retry-After headers, and use exponential backoff:
import time
def fetch_with_retry(url: str, max_retries: int = 3):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=30)
response.raise_for_status()
return response
except requests.exceptions.HTTPError as e:
if e.response.status_code == 429:
wait = 2 ** attempt + 1
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
else:
raise
raise Exception(f"Failed after {max_retries} retries")
Validate extracted data. Always check that extracted fields match expected formats:
import re
def validate_price(price_str: str) -> float:
match = re.search(r"\d+\.?\d*", price_str)
if match:
return float(match.group())
raise ValueError(f"Could not parse price from: {price_str}")
Cache aggressively. Don't fetch the same page twice. Use SQLite or Redis:
import sqlite3
class ExtractionCache:
def __init__(self, db_path="cache.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute(
"CREATE TABLE IF NOT EXISTS cache (url TEXT PRIMARY KEY, data TEXT, fetched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)"
)
def get(self, url: str):
row = self.conn.execute("SELECT data FROM cache WHERE url=?", (url,)).fetchone()
return row[0] if row else None
def set(self, url: str, data: str):
self.conn.execute("INSERT OR REPLACE INTO cache (url, data) VALUES (?, ?)", (url, data))
self.conn.commit()
When to Choose Each Approach
| Scenario | Best Approach |
|---|---|
| Static HTML pages | requests + BeautifulSoup |
| SPA with hidden API | Direct API calls |
| Any page, JS-heavy | Playwright (local) or ScrapeForge (API) |
| Large-scale production | ScrapeForge API + Polars pipeline |
| Need search + extraction | SwiftSearch + ScrapeForge combo |
Get Started with SearchHive
SearchHive's free tier gives you 500 credits to test all three APIs -- SwiftSearch for web search, ScrapeForge for page extraction, and DeepDive for research. No credit card required.
Sign up at searchhive.dev and check the API documentation for complete reference.
Related: /compare/firecrawl | /tutorials/headless-browser-scraping