Product Data Scraping: Common Questions Answered
Product data scraping extracts information like prices, titles, descriptions, ratings, and availability from e-commerce websites. Whether you are building a price comparison engine, monitoring competitor catalogs, or training recommendation models, product data scraping is the foundation. This FAQ covers the most common questions with practical solutions using SearchHive's APIs.
Frequently Asked Questions
What is product data scraping?
Product data scraping means programmatically extracting structured product information from e-commerce websites. This includes product names, prices, images, descriptions, specifications, reviews, stock status, and seller information. Unlike manual data entry, scraping lets you collect data from thousands of products across multiple stores in minutes.
How do I extract product data from an e-commerce site?
The most reliable approach is using SearchHive's DeepDive API, which uses AI to understand page structure and extract exactly the fields you need:
import httpx
import json
response = httpx.post(
"https://api.searchhive.dev/v1/deepdive",
headers={"Authorization": "Bearer sh_live_..."},
json={
"url": "https://example.com/product/wireless-earbuds-pro",
"extract": {
"title": {"type": "string", "description": "Product title"},
"price": {"type": "number", "description": "Current price in USD"},
"original_price": {"type": "number", "description": "Original price before discount"},
"rating": {"type": "string", "description": "Customer rating (e.g. 4.5/5)"},
"review_count": {"type": "integer", "description": "Number of reviews"},
"availability": {"type": "string", "description": "In stock or out of stock"},
"description": {"type": "string", "description": "Product description"},
"features": {
"type": "array",
"description": "Product features/bullet points",
"items": {"type": "string"}
},
"images": {
"type": "array",
"description": "Product image URLs",
"items": {"type": "string"}
}
}
}
)
data = response.json().get("data", {})
print(json.dumps(data, indent=2))
Which e-commerce platforms are easiest to scrape?
- Shopify: Product pages have a consistent free JSON formatter-LD schema embedded in the HTML. Easy to extract even without AI.
- WooCommerce: Uses standard WordPress structure. Product data is in predictable HTML elements.
- Amazon: Heavily protected with CAPTCHAs and rate limiting. Requires proxy rotation and careful request pacing.
- eBay: Varies by listing type (auction vs buy-now). Structure is inconsistent.
DeepDive adapts to different platforms automatically because it uses AI to understand page layout rather than relying on CSS selectors.
How do I scrape prices from multiple competitors?
Build a monitoring pipeline that scrapes the same product across multiple stores:
import httpx
import json
import time
from datetime import datetime
SEARCHHIVE_API_KEY = "sh_live_..."
def scrape_product_price(url: str, store_name: str) -> dict:
"""Extract pricing data from a product page."""
response = httpx.post(
"https://api.searchhive.dev/v1/deepdive",
headers={"Authorization": f"Bearer {SEARCHHIVE_API_KEY}"},
json={
"url": url,
"extract": {
"title": {"type": "string"},
"price": {"type": "number"},
"availability": {"type": "string"},
"currency": {"type": "string"}
}
}
)
data = response.json().get("data", {})
return {
"timestamp": datetime.now().isoformat(),
"store": store_name,
"url": url,
**data
}
# Monitor the same product across competitors
products = [
("https://store-a.com/product/earbuds", "StoreA"),
("https://store-b.com/product/earbuds", "StoreB"),
("https://store-c.com/product/earbuds", "StoreC"),
]
results = []
for url, store in products:
try:
data = scrape_product_price(url, store)
results.append(data)
print(f" {store}: ${data.get('price', 'N/A')} ({data.get('availability', 'N/A')})")
time.sleep(1) # Respect rate limits
except Exception as e:
print(f" {store}: FAILED - {e}")
# Save results
with open("price_monitor.json", "w") as f:
json.dump(results, f, indent=2)
How much does product data scraping cost?
| Method | Cost per 1000 Products | Infrastructure |
|---|---|---|
| SearchHive DeepDive | $2-5 (Builder plan) | None |
| SearchHive ScrapeForge | $1-3 (Builder plan) | None |
| Firecrawl | $16/3K pages | None |
| ScrapingBee | $49/1M requests | None |
| Self-hosted (Playwright) | $0 + server costs | Server + proxies |
| Octoparse (no-code) | $89-249/month | Cloud |
SearchHive's Builder plan at $49/month gives you 100K credits, enough to extract data from 10K-20K products per month depending on page complexity.
How do I handle pagination on category pages?
Most e-commerce sites paginate product listings. Use ScrapeForge to scrape each page:
import httpx
def scrape_category(base_url: str, num_pages: int = 5) -> list:
"""Scrape all products from a paginated category."""
all_products = []
for page in range(1, num_pages + 1):
url = f"{base_url}?page={page}"
response = httpx.post(
"https://api.searchhive.dev/v1/scrapeforge",
headers={"Authorization": "Bearer sh_live_..."},
json={
"url": url,
"render_js": True,
"format": "html"
}
)
# Parse product URLs from the HTML
# Then deep dive each product URL for structured data
print(f" Scraped page {page}/{num_pages}")
time.sleep(0.5)
return all_products
Can I scrape Amazon product data?
Yes, but Amazon has aggressive anti-scraping measures. Key strategies:
- Use proxy rotation (built into ScrapeForge)
- Space requests at least 2-3 seconds apart
- Rotate user agents (handled by ScrapeForge)
- Focus on specific ASINs rather than broad category crawling
- Use the Unicorn plan for residential proxies
DeepDive can extract Amazon product data reliably at moderate volumes (under 500 products/day). For high-volume Amazon scraping, consider dedicated Amazon API services.
How do I handle product variants (size, color, etc.)?
Product variants add complexity. DeepDive can extract variant data:
response = httpx.post(
"https://api.searchhive.dev/v1/deepdive",
headers={"Authorization": "Bearer sh_live_..."},
json={
"url": "https://example.com/product/tshirt",
"extract": {
"title": {"type": "string"},
"variants": {
"type": "array",
"description": "Available variants",
"items": {
"name": {"type": "string"},
"price": {"type": "number"},
"available": {"type": "string"}
}
}
}
}
)
What about images and product media?
DeepDive can extract image URLs. To download images, use httpx to save them locally or to cloud storage:
import httpx
image_urls = data.get("images", [])
for i, img_url in enumerate(image_urls):
resp = httpx.get(img_url, follow_redirects=True)
with open(f"product_image_{i}.jpg", "wb") as f:
f.write(resp.content)
print(f" Saved image {i+1}/{len(image_urls)}")
Is product data scraping legal?
In most jurisdictions, scraping publicly available data is legal. However:
- Respect
robots.txt(ScrapeForge handles this automatically) - Do not scrape behind login walls without permission
- Do not scrape personal data (GDPR, CCPA apply)
- Follow the CFAA and local computer fraud laws
- Use data responsibly and do not redistribute copyrighted content
Consult a lawyer for specific legal guidance.
Summary
Product data scraping is straightforward with the right tools:
- Define what you need (price, title, specs, images, reviews)
- Use DeepDive to extract structured data from any product page
- Build a pipeline for multi-store monitoring with rate limiting
- Start free with SearchHive's 500 credits, then scale to $49/month for serious volume
See /blog/how-to-ecommerce-automation-step-by-step for a full automation tutorial, or /compare/firecrawl for scraping API alternatives.