How to Build a Web Scraping API -- Step-by-Step Tutorial

A web scraping API lets your applications extract data from websites programmatically. Instead of manual copy-paste or fragile scripts, you send HTTP requests and receive structured data. This tutorial walks through building a production-ready web scraping API using Python, from handling JavaScript rendering to parsing structured data.

Key Takeaways

Modern ecommerce sites require JavaScript rendering -- plain HTTP requests return empty pages
CSS selectors are the most reliable way to extract specific data points from HTML
SearchHive's ScrapeForge API handles rendering, proxy rotation, and extraction in a single call
Rate limiting and error handling are essential for any scraping pipeline
A complete scraping API can be built in under 100 lines of Python

Prerequisites

Python 3.8 or later
A SearchHive API key (free tier -- 500 credits/month)
Basic familiarity with HTTP requests and HTML structure

Install the required packages:

pip install requests

Step 1: Understand the Target Page Structure

Before writing any scraping code, inspect the target page to identify what data you need and how it's structured in the HTML.

Open the target URL in your browser
Right-click the data you want to extract and select "Inspect"
Note the HTML tag, class names, and attributes that contain your target data

For example, on a product page you might find:

Product title in <h1 class="product-title">
Price in <span class="price" data-price="49.99">
Rating in <div class="star-rating" data-score="4.5">
Availability in <span class="stock-status">In Stock</span>

Step 2: Make Your First Scrape Request

The simplest way to scrape a page is a single API call with ScrapeForge:

import requests
import json

API_KEY = "your_searchhive_api_key"

response = requests.post(
    "https://api.searchhive.dev/v1/scrape",
    headers={
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "url": "https://example.com/product/sample-product"
    }
)

if response.status_code == 200:
    data = response.json()
    print(json.dumps(data, indent=2))
else:
    print(f"Error {response.status_code}: {response.text}")

This returns the full HTML content of the page. If the page uses JavaScript to render content (most modern sites do), you need to enable rendering.

Step 3: Enable JavaScript Rendering

JavaScript-rendered pages return a nearly empty HTML document when scraped with a basic HTTP request. The actual content loads dynamically via JavaScript after the initial page load.

response = requests.post(
    "https://api.searchhive.dev/v1/scrape",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "url": "https://example.com/product/sample-product",
        "render_js": True
    }
)

Setting render_js: True tells the API to load the page in a headless browser, wait for JavaScript to execute, and return the fully rendered HTML. This handles React, Vue, Angular, and any other JavaScript framework.

Step 4: Extract Structured Data with CSS Selectors

Raw HTML is hard to work with. ScrapeForge supports inline extraction rules that pull specific data points using CSS selectors:

response = requests.post(
    "https://api.searchhive.dev/v1/scrape",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "url": "https://example.com/product/sample-product",
        "render_js": True,
        "extract": {
            "product_name": "h1.product-title",
            "price": "span.price",
            "rating": "div.star-rating",
            "description": "div.product-description",
            "image_url": "img.main-product@src",
            "availability": "span.stock-status"
        }
    }
)

product = response.json()
print(f"Name: {product['product_name']}")
print(f"Price: {product['price']}")
print(f"Rating: {product['rating']}")
print(f"In Stock: {product['availability']}")

The @src syntax extracts an attribute value (the src attribute of an <img> tag) instead of the text content.

Step 5: Scrape Multiple Pages with Rate Limiting

Scraping multiple pages requires rate limiting to avoid getting blocked by the target site and to stay within your API credit limits:

import time
from typing import List, Dict

def scrape_products(urls: List[str], delay: float = 1.0) -> List[Dict]:
    """Scrape multiple product pages with rate limiting."""
    results = []
    for i, url in enumerate(urls):
        try:
            resp = requests.post(
                "https://api.searchhive.dev/v1/scrape",
                headers={"Authorization": f"Bearer {API_KEY}"},
                json={
                    "url": url,
                    "render_js": True,
                    "extract": {
                        "name": "h1",
                        "price": "[data-price]",
                        "rating": ".review-score",
                        "in_stock": ".stock-status"
                    }
                }
            )
            if resp.status_code == 200:
                results.append({"url": url, **resp.json()})
                print(f"[{i+1}/{len(urls)}] OK: {url}")
            else:
                print(f"[{i+1}/{len(urls)}] FAIL {resp.status_code}: {url}")
        except requests.RequestException as e:
            print(f"[{i+1}/{len(urls)}] ERROR: {url} - {e}")

        # Rate limit: wait between requests
        if i < len(urls) - 1:
            time.sleep(delay)

    return results

urls = [
    "https://store.com/product/1",
    "https://store.com/product/2",
    "https://store.com/product/3",
    "https://store.com/product/4",
    "https://store.com/product/5",
]

products = scrape_products(urls, delay=1.5)
print(f"\nScraped {len(products)} products successfully")

Step 6: Handle Errors Gracefully

Production scraping code needs to handle multiple failure modes:

import json
from datetime import datetime

def scrape_with_retry(url: str, max_retries: int = 3) -> dict:
    """Scrape a page with retry logic for transient failures."""
    for attempt in range(max_retries):
        try:
            resp = requests.post(
                "https://api.searchhive.dev/v1/scrape",
                headers={"Authorization": f"Bearer {API_KEY}"},
                json={
                    "url": url,
                    "render_js": True,
                    "extract": {
                        "title": "h1",
                        "price": "[data-price], .price",
                        "content": "main"
                    }
                },
                timeout=30
            )

            if resp.status_code == 200:
                data = resp.json()
                # Validate that we got meaningful data
                if data.get("title") or data.get("content"):
                    return data
                else:
                    print(f"Retry {attempt+1}: No data extracted from {url}")
            elif resp.status_code == 429:
                retry_after = int(resp.headers.get("Retry-After", 5))
                print(f"Rate limited. Waiting {retry_after}s...")
                time.sleep(retry_after)
                continue
            else:
                print(f"Error {resp.status_code} on attempt {attempt+1}")

        except requests.Timeout:
            print(f"Timeout on attempt {attempt+1} for {url}")
        except requests.ConnectionError:
            print(f"Connection error on attempt {attempt+1}")

        if attempt < max_retries - 1:
            time.sleep(2 ** attempt)  # Exponential backoff

    return {"error": f"Failed after {max_retries} attempts", "url": url}

Step 7: Save Results to JSON

After scraping, save your results for downstream processing:

import json
from pathlib import Path

def save_results(products: list, filename: str = "products.json"):
    """Save scraped products to a JSON file."""
    output = {
        "scraped_at": datetime.utcnow().isoformat(),
        "total": len(products),
        "products": products
    }
    Path(filename).write_text(json.dumps(output, indent=2, ensure_ascii=False))
    print(f"Saved {len(products)} products to {filename}")

save_results(products)

Step 8: Combine Search and Scraping

For complete web data collection, combine search (finding URLs) with scraping (extracting data from those URLs):

def research_topic(query: str, num_results: int = 5):
    """Search for relevant pages, then scrape the top results."""
    # Step 1: Find relevant URLs
    search_resp = requests.post(
        "https://api.searchhive.dev/v1/search",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"query": query, "num_results": num_results}
    )

    urls = [r["url"] for r in search_resp.json().get("results", [])]
    print(f"Found {len(urls)} results for: {query}")

    # Step 2: Scrape each result
    articles = []
    for url in urls:
        resp = requests.post(
            "https://api.searchhive.dev/v1/scrape",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json={
                "url": url,
                "extract": {
                    "title": "h1",
                    "content": "article, main"
                }
            }
        )
        if resp.status_code == 200:
            articles.append(resp.json())
        time.sleep(1)

    return articles

data = research_topic("best project management tools 2025")
for article in data:
    print(f"\n--- {article.get('title', 'No title')} ---")
    content = article.get('content', '')[:200]
    print(content)

Common Issues and Fixes

Empty responses from JS-rendered pages. Make sure render_js: True is set. Some sites use lazy loading -- the content may appear only after scrolling. Check if your target data is in the initial viewport.

Blocked by anti-bot protection. Some sites (Cloudflare, Datadome) block scraping even with rendering enabled. SearchHive ScrapeForge handles most anti-bot measures automatically. If you still get blocked, contact support for residential proxy options.

Inconsistent data across pages. Not all pages on a site use the same HTML structure. Category pages, product pages, and search results may have different selectors. Test your selectors against multiple pages before running a large scrape.

Rate limit errors (429). Slow down your request rate. Add delays between requests and implement exponential backoff for retries. Monitor your SearchHive dashboard for credit consumption.

Next Steps

Now that you have a working web scraping pipeline, here are ways to extend it:

Schedule daily scrapes with cron expression generator or a task queue (Celery, RQ)
Store results in a database instead of free JSON formatter files for querying and analysis
Add change detection to get notified when scraped data changes
Use DeepDive for research tasks that need multi-page synthesis

The complete SearchHive API documentation covers all available parameters, advanced extraction rules, and integration patterns. Start with the free tier -- 500 credits per month, no credit card required. For more advanced scraping patterns, see /blog/complete-guide-to-automation-for-data-collection and /compare/firecrawl.

How to Build a Web Scraping API -- Step-by-Step Tutorial

AI-Powered Research

How to Build a Web Scraping API -- Step-by-Step Tutorial

Key Takeaways

Prerequisites

Step 1: Understand the Target Page Structure

Step 2: Make Your First Scrape Request

Step 3: Enable JavaScript Rendering

Step 4: Extract Structured Data with CSS Selectors

Step 5: Scrape Multiple Pages with Rate Limiting

Step 6: Handle Errors Gracefully

Step 7: Save Results to JSON

Step 8: Combine Search and Scraping

Common Issues and Fixes

Next Steps

Keywords

RELATED ARTICLES

Best LLM Agents Architecture Tools (2025)

API Documentation Generators -- Common Questions Answered

Complete Guide to API SDK Development

BUILD WITH SEARCHHIVE