How to Build a Web Scraping API -- Step-by-Step Tutorial
A web scraping API lets your applications extract data from websites programmatically. Instead of manual copy-paste or fragile scripts, you send HTTP requests and receive structured data. This tutorial walks through building a production-ready web scraping API using Python, from handling JavaScript rendering to parsing structured data.
Key Takeaways
- Modern ecommerce sites require JavaScript rendering -- plain HTTP requests return empty pages
- CSS selectors are the most reliable way to extract specific data points from HTML
- SearchHive's ScrapeForge API handles rendering, proxy rotation, and extraction in a single call
- Rate limiting and error handling are essential for any scraping pipeline
- A complete scraping API can be built in under 100 lines of Python
Prerequisites
- Python 3.8 or later
- A SearchHive API key (free tier -- 500 credits/month)
- Basic familiarity with HTTP requests and HTML structure
Install the required packages:
pip install requests
Step 1: Understand the Target Page Structure
Before writing any scraping code, inspect the target page to identify what data you need and how it's structured in the HTML.
- Open the target URL in your browser
- Right-click the data you want to extract and select "Inspect"
- Note the HTML tag, class names, and attributes that contain your target data
For example, on a product page you might find:
- Product title in
<h1 class="product-title"> - Price in
<span class="price" data-price="49.99"> - Rating in
<div class="star-rating" data-score="4.5"> - Availability in
<span class="stock-status">In Stock</span>
Step 2: Make Your First Scrape Request
The simplest way to scrape a page is a single API call with ScrapeForge:
import requests
import json
API_KEY = "your_searchhive_api_key"
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"url": "https://example.com/product/sample-product"
}
)
if response.status_code == 200:
data = response.json()
print(json.dumps(data, indent=2))
else:
print(f"Error {response.status_code}: {response.text}")
This returns the full HTML content of the page. If the page uses JavaScript to render content (most modern sites do), you need to enable rendering.
Step 3: Enable JavaScript Rendering
JavaScript-rendered pages return a nearly empty HTML document when scraped with a basic HTTP request. The actual content loads dynamically via JavaScript after the initial page load.
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": "https://example.com/product/sample-product",
"render_js": True
}
)
Setting render_js: True tells the API to load the page in a headless browser, wait for JavaScript to execute, and return the fully rendered HTML. This handles React, Vue, Angular, and any other JavaScript framework.
Step 4: Extract Structured Data with CSS Selectors
Raw HTML is hard to work with. ScrapeForge supports inline extraction rules that pull specific data points using CSS selectors:
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": "https://example.com/product/sample-product",
"render_js": True,
"extract": {
"product_name": "h1.product-title",
"price": "span.price",
"rating": "div.star-rating",
"description": "div.product-description",
"image_url": "img.main-product@src",
"availability": "span.stock-status"
}
}
)
product = response.json()
print(f"Name: {product['product_name']}")
print(f"Price: {product['price']}")
print(f"Rating: {product['rating']}")
print(f"In Stock: {product['availability']}")
The @src syntax extracts an attribute value (the src attribute of an <img> tag) instead of the text content.
Step 5: Scrape Multiple Pages with Rate Limiting
Scraping multiple pages requires rate limiting to avoid getting blocked by the target site and to stay within your API credit limits:
import time
from typing import List, Dict
def scrape_products(urls: List[str], delay: float = 1.0) -> List[Dict]:
"""Scrape multiple product pages with rate limiting."""
results = []
for i, url in enumerate(urls):
try:
resp = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": url,
"render_js": True,
"extract": {
"name": "h1",
"price": "[data-price]",
"rating": ".review-score",
"in_stock": ".stock-status"
}
}
)
if resp.status_code == 200:
results.append({"url": url, **resp.json()})
print(f"[{i+1}/{len(urls)}] OK: {url}")
else:
print(f"[{i+1}/{len(urls)}] FAIL {resp.status_code}: {url}")
except requests.RequestException as e:
print(f"[{i+1}/{len(urls)}] ERROR: {url} - {e}")
# Rate limit: wait between requests
if i < len(urls) - 1:
time.sleep(delay)
return results
urls = [
"https://store.com/product/1",
"https://store.com/product/2",
"https://store.com/product/3",
"https://store.com/product/4",
"https://store.com/product/5",
]
products = scrape_products(urls, delay=1.5)
print(f"\nScraped {len(products)} products successfully")
Step 6: Handle Errors Gracefully
Production scraping code needs to handle multiple failure modes:
import json
from datetime import datetime
def scrape_with_retry(url: str, max_retries: int = 3) -> dict:
"""Scrape a page with retry logic for transient failures."""
for attempt in range(max_retries):
try:
resp = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": url,
"render_js": True,
"extract": {
"title": "h1",
"price": "[data-price], .price",
"content": "main"
}
},
timeout=30
)
if resp.status_code == 200:
data = resp.json()
# Validate that we got meaningful data
if data.get("title") or data.get("content"):
return data
else:
print(f"Retry {attempt+1}: No data extracted from {url}")
elif resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 5))
print(f"Rate limited. Waiting {retry_after}s...")
time.sleep(retry_after)
continue
else:
print(f"Error {resp.status_code} on attempt {attempt+1}")
except requests.Timeout:
print(f"Timeout on attempt {attempt+1} for {url}")
except requests.ConnectionError:
print(f"Connection error on attempt {attempt+1}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
return {"error": f"Failed after {max_retries} attempts", "url": url}
Step 7: Save Results to JSON
After scraping, save your results for downstream processing:
import json
from pathlib import Path
def save_results(products: list, filename: str = "products.json"):
"""Save scraped products to a JSON file."""
output = {
"scraped_at": datetime.utcnow().isoformat(),
"total": len(products),
"products": products
}
Path(filename).write_text(json.dumps(output, indent=2, ensure_ascii=False))
print(f"Saved {len(products)} products to {filename}")
save_results(products)
Step 8: Combine Search and Scraping
For complete web data collection, combine search (finding URLs) with scraping (extracting data from those URLs):
def research_topic(query: str, num_results: int = 5):
"""Search for relevant pages, then scrape the top results."""
# Step 1: Find relevant URLs
search_resp = requests.post(
"https://api.searchhive.dev/v1/search",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"query": query, "num_results": num_results}
)
urls = [r["url"] for r in search_resp.json().get("results", [])]
print(f"Found {len(urls)} results for: {query}")
# Step 2: Scrape each result
articles = []
for url in urls:
resp = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": url,
"extract": {
"title": "h1",
"content": "article, main"
}
}
)
if resp.status_code == 200:
articles.append(resp.json())
time.sleep(1)
return articles
data = research_topic("best project management tools 2025")
for article in data:
print(f"\n--- {article.get('title', 'No title')} ---")
content = article.get('content', '')[:200]
print(content)
Common Issues and Fixes
Empty responses from JS-rendered pages. Make sure render_js: True is set. Some sites use lazy loading -- the content may appear only after scrolling. Check if your target data is in the initial viewport.
Blocked by anti-bot protection. Some sites (Cloudflare, Datadome) block scraping even with rendering enabled. SearchHive ScrapeForge handles most anti-bot measures automatically. If you still get blocked, contact support for residential proxy options.
Inconsistent data across pages. Not all pages on a site use the same HTML structure. Category pages, product pages, and search results may have different selectors. Test your selectors against multiple pages before running a large scrape.
Rate limit errors (429). Slow down your request rate. Add delays between requests and implement exponential backoff for retries. Monitor your SearchHive dashboard for credit consumption.
Next Steps
Now that you have a working web scraping pipeline, here are ways to extend it:
- Schedule daily scrapes with cron expression generator or a task queue (Celery, RQ)
- Store results in a database instead of free JSON formatter files for querying and analysis
- Add change detection to get notified when scraped data changes
- Use DeepDive for research tasks that need multi-page synthesis
The complete SearchHive API documentation covers all available parameters, advanced extraction rules, and integration patterns. Start with the free tier -- 500 credits per month, no credit card required. For more advanced scraping patterns, see /blog/complete-guide-to-automation-for-data-collection and /compare/firecrawl.