Structured data extraction turns unstructured web pages into clean, machine-readable formats: free JSON formatter, CSV, or database records. It is the backbone of price monitoring, lead generation, market research, and AI training data pipelines.
This guide covers the practical side of structured data extraction -- from choosing the right approach to implementing reliable pipelines that handle JavaScript rendering, anti-bot systems, and schema validation.
Key Takeaways
- Structured data extraction converts HTML into JSON/CSV using parsers, CSS selectors, or LLM-based extraction
- JavaScript-heavy sites require rendering before extraction -- headless browsers or hosted APIs
- Schema validation catches extraction errors before they propagate downstream
- SearchHive ScrapeForge handles rendering, proxy rotation, and schema extraction in a single API call
- Production pipelines need retry logic, rate limiting, and data quality checks
What Is Structured Data Extraction?
Structured data extraction is the process of identifying specific data points on a web page and converting them into a consistent format. Instead of raw HTML, you get:
{
"products": [
{"name": "Widget Pro", "price": "$49.99", "rating": "4.5", "in_stock": true},
{"name": "Widget Lite", "price": "$19.99", "rating": "4.2", "in_stock": true}
]
}
This is different from simply downloading a page. Extraction implies parsing, transforming, and validating the data.
Common Approaches
1. CSS Selector Parsing
The most direct approach: load HTML, select elements by CSS class or XPath, extract text content.
import requests
from bs4 import BeautifulSoup
def extract_products(url):
resp = requests.get(url)
soup = BeautifulSoup(resp.text, "html.parser")
products = []
for card in soup.select(".product-card"):
products.append({
"name": card.select_one("h3.product-name").get_text(strip=True),
"price": card.select_one(".price").get_text(strip=True),
"rating": card.select_one(".stars")["data-rating"],
"url": card.select_one("a")["href"]
})
return products
data = extract_products("https://example.com/category/widgets")
print(f"Extracted {len(data)} products")
Limitations: Breaks when the site changes its HTML structure. Cannot handle JavaScript-rendered content.
2. XPath Extraction
XPath offers more powerful selection than CSS selectors -- you can select elements by text content, position, or computed values.
from lxml import html
import requests
def extract_with_xpath(url):
resp = requests.get(url)
tree = html.fromstring(resp.content)
# Extract all product names that contain "Pro"
names = tree.xpath('//h3[contains(@class, "product-name") and contains(text(), "Pro")]/text()')
# Extract prices from sibling elements
prices = tree.xpath('//h3[contains(@class, "product-name")]/following-sibling::div[contains(@class, "price")]/text()')
return list(zip(names, prices))
3. LLM-Based Extraction
Use an LLM to understand the page context and extract structured data. Works well for sites with inconsistent HTML.
import requests
import json
def extract_with_llm(url, api_key):
# First, get the page content as markdown
page_resp = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={"Authorization": f"Bearer {api_key}"},
json={"url": url, "format": "markdown"}
)
markdown_content = page_resp.json()["data"]["content"]
# Then extract structured data using DeepDive
extract_resp = requests.post(
"https://api.searchhive.dev/v1/deepdive",
headers={"Authorization": f"Bearer {api_key}"},
json={
"query": f"Extract all products from this page as JSON with fields: name, price, rating, availability. Page content:\n\n{markdown_content[:3000]}",
"output_format": "json"
}
)
return extract_resp.json()["data"]
4. API-Based Extraction with SearchHive
The most production-ready approach: let SearchHive handle rendering, parsing, and schema validation.
import requests
import json
def extract_structured(url, api_key):
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"url": url,
"render_js": True,
"extract": {
"type": "schema",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "string"},
"description": {"type": "string"},
"specifications": {
"type": "object",
"properties": {
"brand": {"type": "string"},
"model": {"type": "string"},
"weight": {"type": "string"},
"dimensions": {"type": "string"}
}
},
"reviews": {
"type": "array",
"items": {
"type": "object",
"properties": {
"author": {"type": "string"},
"rating": {"type": "number"},
"text": {"type": "string"}
}
}
},
"availability": {"type": "string"},
"shipping": {"type": "string"}
},
"required": ["title", "price"]
}
}
}
)
if response.status_code == 200:
return response.json()["data"]
else:
raise Exception(f"Extraction failed: {response.status_code} {response.text}")
# Usage
result = extract_structured(
"https://example.com/product/widget-pro-2026",
"your-api-key"
)
print(json.dumps(result, indent=2))
Handling JavaScript-Rendered Pages
Many modern sites (React, Vue, Angular, Next.js) render content client-side. A simple requests.get() returns an empty page or loading skeleton.
Solutions:
| Approach | Pros | Cons |
|---|---|---|
| Headless browser (Puppeteer/Playwright) | Full control, no vendor lock-in | Heavy infrastructure, slow |
| API service (ScrapeForge) | Zero infrastructure, fast | Vendor dependency |
| Pre-render service (Jina Reader) | Free tier available | Single-page only, no crawling |
SearchHive ScrapeForge renders JavaScript by default. Set render_js: true (or just omit it -- rendering is on by default for ScrapeForge).
Building a Production Pipeline
A reliable extraction pipeline needs more than a single API call. Here is a production-ready pattern:
import requests
import json
import time
import logging
from typing import Optional, Dict, Any
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class StructuredExtractor:
def __init__(self, api_key: str, max_retries: int = 3):
self.api_key = api_key
self.max_retries = max_retries
self.base_url = "https://api.searchhive.dev/v1/scrape"
def extract(self, url: str, schema: Dict[str, Any]) -> Optional[Dict]:
for attempt in range(self.max_retries):
try:
response = requests.post(
self.base_url,
headers={
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
},
json={
"url": url,
"render_js": True,
"extract": {"type": "schema", "schema": schema}
},
timeout=30
)
if response.status_code == 200:
data = response.json().get("data", {})
self._validate(data, schema)
return data
elif response.status_code == 429:
wait = 2 ** attempt
logger.warning(f"Rate limited, waiting {wait}s...")
time.sleep(wait)
else:
logger.error(f"HTTP {response.status_code}: {response.text[:200]}")
return None
except requests.exceptions.Timeout:
logger.warning(f"Timeout on attempt {attempt + 1}")
time.sleep(2 ** attempt)
except Exception as e:
logger.error(f"Error: {e}")
return None
return None
def _validate(self, data: Dict, schema: Dict):
# Check required fields exist
required = schema.get("required", [])
for field in required:
if field not in data:
logger.warning(f"Missing required field: {field}")
def batch_extract(self, urls: list, schema: Dict) -> list:
results = []
for url in urls:
result = self.extract(url, schema)
if result:
results.append({"url": url, "data": result})
time.sleep(0.5) # Polite delay
return results
# Usage
extractor = StructuredExtractor("your-api-key")
product_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
"rating": {"type": "string"},
"in_stock": {"type": "boolean"}
},
"required": ["name", "price"]
}
urls = [
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3"
]
results = extractor.batch_extract(urls, product_schema)
for r in results:
print(f"{r['url']}: {r['data'].get('name', 'N/A')} - {r['data'].get('price', 'N/A')}")
Common Pitfalls
1. Fragile selectors. Sites change their CSS classes frequently. Solution: use schema-based extraction (ScrapeForge) or LLM-powered extraction (DeepDive) instead of hardcoding selectors.
2. Pagination handling. Many data sets span multiple pages. SearchHive ScrapeForge supports paginate: true to automatically crawl paginated content.
3. Encoding issues. International sites use various encodings. Always specify response.encoding = response.apparent_encoding when using requests, or let SearchHive handle it server-side.
4. Rate limiting. Aggressive scraping triggers blocks. Build delays into your pipeline and use rotating proxies (included with SearchHive).
When to Use What
| Scenario | Best Tool |
|---|---|
| Static HTML, simple structure | BeautifulSoup + CSS selectors |
| Dynamic pages, need rendering | SearchHive ScrapeForge |
| Inconsistent HTML, need context | SearchHive DeepDive (LLM) |
| LinkedIn/social media profiles | PhantomBuster (pre-built) |
| Bulk crawling, many pages | ScrapeForge with pagination |
| Custom schemas, validation | ScrapeForge schema extraction |
Lessons from Production
After building extraction pipelines for hundreds of customers, here is what works:
-
Define your schema first. Know exactly what fields you need before writing a single line of code. This prevents scope creep and makes validation straightforward.
-
Use schema-based extraction. Tools like ScrapeForge validate output against your schema, catching missing fields before they hit your database.
-
Build idempotent pipelines. Every URL should be extractable multiple times without side effects. Store URLs in a queue and track status.
-
Monitor data quality. Set up alerts for extraction failure rates above 5%. Common causes: site redesigns, CAPTCHAs, rate limits.
-
Start with an API, graduate to self-hosted if needed. SearchHive handles 90% of use cases. Only invest in Puppeteer infrastructure if you need custom browser behavior that APIs cannot replicate.
Ready to extract structured data at scale? Get started with SearchHive free -- 500 credits, no credit card, full access to SwiftSearch, ScrapeForge, and DeepDive. Check the ScrapeForge documentation for schema extraction examples and compare with other tools.