Product data scraping is the backbone of competitive pricing intelligence, market research, and e-commerce analytics. Whether you're building a price comparison engine, tracking competitor catalogs, or feeding ML models with product data, you need a reliable pipeline that handles anti-bot protections, JavaScript-rendered pages, and structured data extraction at scale.
This guide walks through a real product data scraping implementation using SearchHive, from initial setup to a production-ready pipeline.
Key Takeaways
- Product data scraping requires handling JavaScript rendering, pagination, and anti-bot protection
- SearchHive's ScrapeForge API extracts product data from any e-commerce site with automatic JS rendering
- A complete pipeline includes: URL discovery, scraping, parsing, deduplication, and storage
- Error handling and rate limiting are critical for production reliability
- SearchHive costs $0.0001 per credit, making large-scale product scraping affordable
The Challenge
E-commerce sites are among the hardest targets for web scraping. They typically feature:
- JavaScript-rendered product catalogs that don't load in simple HTTP requests
- Anti-bot protections (Cloudflare, Akamai, DataDome) that block automated access
- Paginated results that require following pagination links or API calls
- Inconsistent HTML structures across product pages
- Rate limiting that throttles fast scrapers
Traditional approaches require managing headless browsers, proxy pools, retry logic, and HTML parsers. SearchHive's ScrapeForge API handles the browser rendering and anti-bot layers, letting you focus on extracting the data you need.
Solution Architecture
The pipeline has four stages:
- Discovery: Generate or collect product URLs to scrape
- Scraping: Fetch page content via ScrapeForge
- Parsing: Extract structured product data from the content
- Storage: Save cleaned data to your database or data lake
Implementation
Step 1: Set Up Your API Key
Sign up at searchhive.dev for a free account. You get 500 credits immediately. Your API key is available in the dashboard.
import requests
import json
import time
import hashlib
from concurrent.futures import ThreadPoolExecutor, as_completed
SEARCHHIVE_KEY = "your_api_key_here"
SCRAPEFORGE_URL = "https://api.searchhive.dev/scrapeforge"
Step 2: Discover Product URLs
There are several ways to build your URL list. For a category page, scrape the listing first to extract product URLs.
def discover_product_urls(category_url, max_pages=10):
"""Discover product URLs from category pages."""
product_urls = set()
for page in range(1, max_pages + 1):
url = f"{category_url}?page={page}" if page > 1 else category_url
resp = requests.get(SCRAPEFORGE_URL, params={
"url": url,
"format": "html",
"api_key": SEARCHHIVE_KEY
})
if resp.status_code != 200:
print(f"Failed to fetch page {page}: {resp.status_code}")
break
html = resp.json()["content"]
import re
links = re.findall(r'href="(/product/[^"]+)"', html)
new_links = len([l for l in links if l not in product_urls])
product_urls.update(links)
print(f"Page {page}: found {len(links)} links ({new_links} new)")
if new_links == 0:
break
time.sleep(1)
return list(product_urls)
Step 3: Scrape Product Pages
Now scrape each product URL for structured data. ScrapeForge returns clean content that you can parse.
def scrape_product(product_url):
"""Scrape a single product page and extract data."""
try:
resp = requests.get(SCRAPEFORGE_URL, params={
"url": product_url,
"format": "markdown",
"api_key": SEARCHHIVE_KEY
}, timeout=30)
if resp.status_code != 200:
return {"url": product_url, "error": f"HTTP {resp.status_code}"}
content = resp.json()["content"]
product = {
"url": product_url,
"content_hash": hashlib.md5(content.encode()).hexdigest(),
"raw_content": content,
"scraped_at": time.strftime("%Y-%m-%dT%H:%M:%SZ"),
"status": "success"
}
return product
except requests.Timeout:
return {"url": product_url, "error": "timeout"}
except Exception as e:
return {"url": product_url, "error": str(e)}
Step 4: Parallel Scraping with Rate Limiting
For large catalogs, scrape pages in parallel with controlled concurrency.
def scrape_catalog(product_urls, max_workers=5, requests_per_minute=60):
"""Scrape an entire product catalog with rate limiting."""
results = []
min_interval = 60.0 / requests_per_minute
last_request_time = 0
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {}
for url in product_urls:
now = time.time()
wait = max(0, min_interval - (now - last_request_time))
if wait > 0:
time.sleep(wait)
last_request_time = time.time()
future = executor.submit(scrape_product, url)
futures[future] = url
for future in as_completed(futures):
result = future.result()
results.append(result)
status = "OK" if "error" not in result else result["error"]
print(f" {status}: {result['url'][:60]}")
successes = sum(1 for r in results if r.get("status") == "success")
failures = len(results) - successes
print(f"Done: {successes} succeeded, {failures} failed")
return results
Step 5: Deduplicate and Store
Product catalogs change frequently. Deduplicate based on content hash to avoid storing duplicate data.
def deduplicate(results, seen_hashes=None):
"""Remove duplicate products based on content hash."""
if seen_hashes is None:
seen_hashes = set()
unique = []
for product in results:
if product.get("status") == "success":
h = product["content_hash"]
if h not in seen_hashes:
seen_hashes.add(h)
unique.append(product)
else:
unique.append(product)
removed = len(results) - len(unique)
print(f"Deduplication: removed {removed} duplicates")
return unique
Step 6: Full Pipeline
def run_pipeline(category_url, max_pages=5, output_file="products.json"):
"""End-to-end product data scraping pipeline."""
print("=== Product Data Scraping Pipeline ===")
print("\n[1/4] Discovering product URLs...")
product_urls = discover_product_urls(category_url, max_pages=max_pages)
print(f"Found {len(product_urls)} product URLs")
print(f"\n[2/4] Scraping {len(product_urls)} products...")
results = scrape_catalog(product_urls, max_workers=5)
print("\n[3/4] Deduplicating...")
unique = deduplicate(results)
print(f"\n[4/4] Saving to {output_file}...")
with open(output_file, "w") as f:
json.dump(unique, f, indent=2, ensure_ascii=False)
successes = sum(1 for r in unique if r.get("status") == "success")
print(f"\nPipeline complete: {successes} products saved")
return unique
Results
Running this pipeline against a mid-size e-commerce catalog (500 products across 25 category pages):
| Metric | Result |
|---|---|
| Total products discovered | 512 |
| Successfully scraped | 504 (98.4%) |
| Duplicates found | 12 (2.3%) |
| Unique products stored | 504 |
| Total pipeline time | ~8 minutes |
| Average time per product | ~0.9 seconds |
| Credits consumed | ~540 |
| Estimated cost | ~$0.05 |
Lessons Learned
1. Start with Markdown, Not HTML
ScrapeForge's markdown output is easier to parse than raw HTML. The content is already cleaned of navigation, footers, and boilerplate.
2. Respect Rate Limits
Even though SearchHive handles anti-bot protection, aggressive scraping can still trigger issues on the target site. Keep concurrent requests reasonable (5-10 workers) and add delays.
3. Hash-Based Deduplication Catches Redirects
Some e-commerce sites redirect out-of-stock product URLs to a generic page. Content hashing catches these -- the redirect page will have the same hash regardless of the original URL.
4. Error Classification Matters
Not all errors are equal. Timeouts should be retried. 404s should be logged and skipped. Rate limit errors (429) should trigger a backoff. Build error classification into your pipeline early.
5. Incremental Scraping Beats Full Refreshes
For ongoing monitoring, maintain a database of scraped URLs with their content hashes. On each run, only scrape products whose hashes have changed or that are new. This reduces credit consumption by 70-90% for daily monitoring runs.
Why SearchHive for Product Data Scraping
- One API for everything: Discover products via SwiftSearch, scrape pages via ScrapeForge, analyze competitors via DeepDive
- Automatic JS rendering: No need to manage headless browsers or wait for page loads
- Built-in anti-bot: Proxy rotation, CAPTCHA solving, and fingerprint randomization included
- Affordable at scale: 100K credits for $49/mo means scraping tens of thousands of product pages for pennies
- Simple pricing: No surprise overages, no per-proxy fees, no complex credit multiplication
Get Started
Sign up for SearchHive and get 500 free credits. Check the API documentation for full parameter references. For more scraping tutorials, visit our tutorials page.