Amazon product data is the backbone of price comparison tools, competitor monitoring, and market research platforms. The challenge: Amazon's anti-bot systems are aggressive, their HTML structure shifts regularly, and scraping at scale requires proxy rotation.
This tutorial walks through building a production-ready Amazon product scraper using Python and SearchHive's ScrapeForge API. No browser automation, no proxy management, no CAPTCHA solving infrastructure — the API handles all of that.
Key Takeaways
- Direct requests to Amazon fail fast — you need proxy rotation and JS rendering to get reliable data
- SearchHive's ScrapeForge API handles anti-bot bypassing, proxy rotation, and CAPTCHA solving automatically
- DeepDive (AI extraction) converts raw Amazon pages into structured JSON without fragile CSS selectors
- SearchHive SwiftSearch finds Amazon product URLs when you don't know the exact ASIN
- Rate limiting and respectful crawling are non-negotiable for sustained access
Prerequisites
- Python 3.8+
requestslibrary (pip install requests)- A SearchHive API key (free tier available)
Step 1: Scrape a Single Amazon Product Page
The simplest case — you have a product URL (or ASIN) and want structured data back.
import requests
import json
SEARCHHIVE_API_KEY = "your_api_key_here"
BASE_URL = "https://api.searchhive.dev/v1"
HEADERS = {"Authorization": f"Bearer {SEARCHHIVE_API_KEY}"}
def scrape_amazon_product(asin: str) -> dict:
"""Scrape a single Amazon product page and return structured data."""
url = f"https://www.amazon.com/dp/{asin}"
resp = requests.post(f"{BASE_URL}/extract", json={
"url": url,
"prompt": """Extract the following product information:
- Product title
- Price (current sale price)
- Original/list price
- Rating (star count)
- Number of reviews
- Availability status
- Product description (first paragraph)
- Main category
- All bullet points (feature list)"""
}, headers=HEADERS)
if resp.status_code == 200:
return resp.json()["data"]
else:
raise Exception(f"Scrape failed: {resp.status_code} - {resp.text}")
# Example usage
product = scrape_amazon_product("B09V3KXJPB")
print(json.dumps(product, indent=2))
This uses DeepDive, SearchHive's AI-powered extraction endpoint. Instead of writing CSS selectors that break when Amazon updates their DOM, you describe what you want in natural language and get structured JSON back.
Sample output:
{
"product_title": "Apple AirPods Pro (2nd Generation)",
"price": "$189.99",
"original_price": "$249.00",
"rating": 4.7,
"review_count": 112453,
"availability": "In Stock",
"description": "Active Noise Cancellation...",
"category": "Electronics > Earbuds",
"bullet_points": [
"Active Noise Cancellation removes background noise",
"Transparency mode lets outside sound in",
"Customizable fit with silicone ear tips"
]
}
Step 2: Find Amazon Products with Search
If you don't know the exact ASIN, use SwiftSearch to find products matching a query.
import requests
def search_amazon_products(query: str, num_results: int = 10) -> list:
"""Search Amazon for products using SearchHive SwiftSearch."""
resp = requests.get(f"{BASE_URL}/search", params={
"q": f"{query} site:amazon.com",
"engine": "google",
"num": num_results
}, headers=HEADERS)
products = []
for result in resp.json().get("results", []):
url = result["url"]
# Filter to product pages only
if "/dp/" in url or "/gp/product/" in url:
products.append({
"title": result["title"],
"url": url,
"snippet": result.get("snippet", "")
})
return products
# Find wireless earbuds on Amazon
results = search_amazon_products("wireless earbuds bestseller 2025")
for p in results:
print(f"{p['title']}\n {p['url']}\n")
This searches Google for Amazon product pages matching your query. It's more reliable than scraping Amazon's internal search, which is heavily gated.
Step 3: Bulk Scrape with Rate Limiting
Scraping multiple products requires rate limiting to avoid triggering anti-bot systems (even with proxy rotation, respect the platform).
import time
import json
from pathlib import Path
def scrape_product_list(asins: list, delay: float = 2.0, output_file: str = "products.json"):
"""Scrape multiple Amazon products with rate limiting."""
results = []
seen_asins = set()
# Resume from existing results if file exists
if Path(output_file).exists():
with open(output_file) as f:
existing = json.load(f)
for item in existing:
seen_asins.add(item.get("asin"))
results = existing
print(f"Resuming — {len(seen_asins)} products already scraped")
for i, asin in enumerate(asins):
if asin in seen_asins:
continue
try:
product = scrape_amazon_product(asin)
product["asin"] = asin
product["scraped_at"] = time.strftime("%Y-%m-%dT%H:%M:%SZ")
results.append(product)
# Save incrementally
with open(output_file, "w") as f:
json.dump(results, f, indent=2)
print(f"[{i+1}/{len(asins)}] {asin} — {product.get('price', 'N/A')}")
except Exception as e:
print(f"[{i+1}/{len(asins)}] {asin} — FAILED: {e}")
results.append({"asin": asin, "error": str(e)})
time.sleep(delay)
successful = [r for r in results if "error" not in r]
failed = [r for r in results if "error" in r]
print(f"Done. {len(successful)} scraped, {len(failed)} failed.")
return results
# Scrape a list of ASINs
asins = ["B09V3KXJPB", "B0CHWRXH8B", "B0C2P3F5T7", "B0BSHF7WHW", "B0D1XD1ZV3"]
scrape_product_list(asins, delay=2.0)
Key design decisions in this code:
- Incremental saves — if the script crashes at product 47 of 100, you resume at 48, not from scratch
- 2-second delay — conservative rate limit that works reliably for most volumes
- Error capture — failed scrapes are logged with the error, not silently dropped
Step 4: Extract Reviews from Product Pages
Product reviews require scraping a different URL pattern. SearchHive's DeepDive can extract reviews alongside product data.
def scrape_amazon_reviews(asin: str, num_pages: int = 3) -> list:
"""Scrape reviews from an Amazon product page."""
reviews = []
for page in range(1, num_pages + 1):
url = f"https://www.amazon.com/product-reviews/{asin}?pageNumber={page}"
resp = requests.post(f"{BASE_URL}/extract", json={
"url": url,
"prompt": """Extract all reviews on this page. For each review:
- Rating (1-5 stars)
- Review title
- Review body text
- Review date
- Verified purchase status
- Helpful votes count"""
}, headers=HEADERS)
if resp.status_code == 200:
data = resp.json()["data"]
if isinstance(data, list):
reviews.extend(data)
else:
reviews.append(data)
time.sleep(2)
return reviews
reviews = scrape_amazon_reviews("B09V3KXJPB", num_pages=2)
for r in reviews[:3]:
print(f"{'★' * r.get('rating', 0)} {r.get('title', 'No title')}")
print(f" {r.get('body', 'No body')[:120]}...\n")
Step 5: Build a Price Tracker
Combine search, scraping, and persistence into a recurring price tracker.
import json
import time
from datetime import datetime
from pathlib import Path
class AmazonPriceTracker:
def __init__(self, api_key: str, data_file: str = "price_history.json"):
self.headers = {"Authorization": f"Bearer {api_key}"}
self.base = "https://api.searchhive.dev/v1"
self.data_file = data_file
self._load_data()
def _load_data(self):
self.data = {}
if Path(self.data_file).exists():
with open(self.data_file) as f:
self.data = json.load(f)
def _save_data(self):
with open(self.data_file, "w") as f:
json.dump(self.data, f, indent=2)
def track_product(self, asin: str) -> dict:
"""Scrape current price and append to history."""
product = scrape_amazon_product(asin)
now = datetime.utcnow().isoformat()
if asin not in self.data:
self.data[asin] = {
"title": product.get("product_title", ""),
"url": f"https://www.amazon.com/dp/{asin}",
"history": []
}
entry = {
"timestamp": now,
"price": product.get("price"),
"original_price": product.get("original_price"),
"rating": product.get("rating"),
"availability": product.get("availability")
}
self.data[asin]["history"].append(entry)
self._save_data()
# Check for price drop
history = self.data[asin]["history"]
if len(history) >= 2:
prev = history[-2]["price"]
curr = entry["price"]
if prev and curr and curr < prev:
print(f"PRICE DROP: {asin} — {prev} -> {curr}")
return entry
def get_price_history(self, asin: str) -> list:
return self.data.get(asin, {}).get("history", [])
# Usage
tracker = AmazonPriceTracker("your_api_key")
watchlist = ["B09V3KXJPB", "B0CHWRXH8B", "B0C2P3F5T7"]
for asin in watchlist:
tracker.track_product(asin)
time.sleep(2)
# Check history
for asin in watchlist:
history = tracker.get_price_history(asin)
print(f"{asin}: {len(history)} data points")
Common Issues and Fixes
Amazon returns CAPTCHA pages. The ScrapeForge API handles this with automatic proxy rotation and CAPTCHA solving. If you see consistent failures, increase the delay between requests to 3-5 seconds.
Prices returned as "N/A" or missing. Some Amazon pages show prices only when items are in stock. Check the availability field — if it says "Out of Stock," the price field may be empty.
Different Amazon marketplace needed? Change the URL from amazon.com to amazon.co.uk, amazon.de, etc. Set the country parameter in your scrape request to match.
Rate limiting (429 errors). If SearchHive returns rate limit errors, you've hit your plan's throughput limit. Either reduce your scrape frequency or upgrade your plan.
Best Practices for Amazon Scraping
- Respect robots.txt and terms of service. Only scrape data you have a legitimate use for.
- Use the minimum delay necessary. 2 seconds between requests is a reasonable default. Don't hammer the site.
- Cache aggressively. Product titles and descriptions don't change often — cache them and only refresh prices daily.
- Handle partial failures gracefully. Some products will fail to scrape. Log and retry, don't crash.
- Store raw responses. Save the raw API response alongside your parsed data for debugging.
What's Next
- Combine with SearchHive's SwiftSearch to automatically discover new products in your niche
- Use DeepDive's structured extraction to build competitor comparison dashboards
- Schedule the price tracker with
cronfor daily automated monitoring - Export price history to CSV/JSON for analysis in pandas or your BI tool
/blog/cheapest-web-scraping-api-for-startups-ai-agents-in-2025 /blog/web-scraping-ecommerce-price-monitoring
Start Building
All the code above works with SearchHive's free tier. Sign up at searchhive.dev, grab your API key, and start scraping Amazon product data in under 5 minutes.