How to Scrape Amazon Product Data with Python — Complete Guide
Amazon hosts over 600 million products. Whether you're building a price tracker, competitor analysis tool, or market research pipeline, extracting product data from Amazon is one of the most common web scraping tasks developers face. It's also one of the hardest.
Amazon invests heavily in anti-bot systems. Rotating CAPTCHAs, dynamic DOM structures, JavaScript-heavy rendering, and aggressive rate limiting mean naive HTTP requests break constantly. This guide covers what actually works in 2026, including production-ready Python code you can adapt today.
Key Takeaways
- Direct HTTP requests to Amazon fail within hours due to CAPTCHA rotations and DOM changes
- Headless browsers solve rendering but get detected by Amazon's fingerprinting unless you use residential proxies
- SearchHive's ScrapeForge API handles proxy rotation, CAPTCHA solving, and JS rendering for you at $0.0001/credit
- A hybrid approach (API for data collection, Python for processing) is the most reliable architecture
- Always respect robots.txt generator and rate limits to avoid IP bans and legal issues
Why Scraping Amazon Is Hard
Amazon's anti-bot stack includes:
- CAPTCHA challenges that rotate based on request patterns
- TLS fingerprinting to detect non-browser clients
- DOM randomization where CSS class names and element IDs change between sessions
- JavaScript rendering for prices, reviews, and availability data
- Rate limiting that kicks in at 10-20 requests/second without proxy rotation
- ** honeypot elements** hidden in the page to trap scrapers that parse everything
A script that works today will silently break next week when Amazon changes a CSS class. This is why direct scraping is a maintenance nightmare for production systems.
Approach 1: Requests + BeautifulSoup (Fragile)
The simplest approach, but also the most brittle:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml",
}
def scrape_amazon_product(asin):
url = f"https://www.amazon.com/dp/{asin}"
resp = requests.get(url, headers=headers, timeout=10)
soup = BeautifulSoup(resp.text, "html.parser")
title = soup.select_one("#productTitle")
price = soup.select_one(".a-price .a-offscreen")
rating = soup.select_one("#acrPopover .a-icon-alt")
review_count = soup.select_one("#acrCustomerReviewText")
return {
"title": title.get_text(strip=True) if title else None,
"price": price.get_text(strip=True) if price else None,
"rating": rating.get_text(strip=True) if rating else None,
"reviews": review_count.get_text(strip=True) if review_count else None,
}
data = scrape_amazon_product("B09V3KXJPB")
print(data)
This works intermittently. Within days, selectors like #productTitle or .a-offscreen change, and your parser returns None for every field. Amazon also serves CAPTCHA pages instead of product pages once it detects scraping patterns.
Approach 2: Playwright Headless Browser (More Reliable)
Playwright handles JavaScript rendering and can mimic real browser behavior:
import asyncio
from playwright.async_api import async_playwright
async def scrape_with_playwright(asin):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
viewport={"width": 1920, "height": 1080},
)
page = await context.new_page()
await page.goto(f"https://www.amazon.com/dp/{asin}")
await page.wait_for_selector("#productTitle", timeout=15000)
title = await page.inner_text("#productTitle")
price_el = await page.query_selector(".a-price .a-offscreen")
price = await price_el.inner_text() if price_el else "N/A"
await browser.close()
return {"title": title.strip(), "price": price}
result = asyncio.run(scrape_with_playwright("B09V3KXJPB"))
print(result)
Better, but still gets flagged by Amazon's fingerprinting after sustained use. You need residential proxies ($5-15/GB), stealth plugins, and constant selector maintenance.
Approach 3: SearchHive ScrapeForge (Production-Ready)
SearchHive's ScrapeForge API handles the hard parts: proxy rotation, CAPTCHA solving, JavaScript rendering, and structured extraction. You send a URL, you get clean free JSON formatter back.
import requests
API_KEY = "your-searchhive-api-key"
def scrape_amazon_with_searchhive(asin):
url = f"https://www.amazon.com/dp/{asin}"
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
},
json={
"url": url,
"formats": ["markdown", "html"],
"extract": {
"schema": {
"type": "object",
"properties": {
"title": {"type": "string", "description": "Product title"},
"price": {"type": "string", "description": "Current price"},
"rating": {"type": "string", "description": "Star rating"},
"review_count": {"type": "string", "description": "Number of reviews"},
"availability": {"type": "string", "description": "In stock status"},
"features": {"type": "array", "items": {"type": "string"}},
}
}
}
},
)
data = response.json()
return data.get("extracted", data.get("markdown", ""))
product = scrape_amazon_with_searchhive("B09V3KXJPB")
print(product)
One API call. No browser to manage, no proxies to rotate, no selectors to maintain. The structured extraction schema pulls exactly the fields you need. At $0.0001/credit, scraping 1,000 Amazon product pages costs roughly $0.10 on the Starter plan.
Scraping Amazon Search Results at Scale
For collecting product data across categories:
import requests
import time
API_KEY = "your-searchhive-api-key"
def search_amazon_category(query, pages=5):
results = []
for page in range(1, pages + 1):
search_url = f"https://www.amazon.com/s?k={query}&page={page}"
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
},
json={
"url": search_url,
"extract": {
"schema": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": {"type": "string"},
"asin": {"type": "string"},
"price": {"type": "string"},
"rating": {"type": "string"},
"url": {"type": "string"},
}
}
}
}
}
}
},
)
if response.status_code == 200:
data = response.json()
products = data.get("extracted", {}).get("products", [])
results.extend(products)
time.sleep(0.5)
return results
products = search_amazon_category("wireless earbuds", pages=3)
print(f"Found {len(products)} products")
for p in products[:5]:
print(f" {p.get('title', 'N/A')[:60]} - {p.get('price', 'N/A')}")
Handling Amazon's Anti-Bot Systems
If you're going the self-hosted route, you need these layers:
| Technique | What It Solves | Reliability |
|---|---|---|
| Rotating user agents | Basic bot detection | Low |
| Residential proxies | IP-based blocking | Medium |
| Stealth browser plugins | TLS fingerprinting | Medium |
| CAPTCHA solving services | Challenge pages | High (but expensive) |
| Request throttling | Rate limiting | High |
Each layer adds cost and complexity. By the time you have a reliable setup, you're spending more than a ScrapeForge subscription costs.
Processing Scraped Amazon Data
Once you have the raw data, cleaning and structuring is where Python shines:
import re
def clean_price(price_str):
if not price_str:
return None
match = re.search(r"[\d,]+\.?\d*", price_str)
return float(match.group().replace(",", "")) if match else None
def clean_rating(rating_str):
if not rating_str:
return None
match = re.search(r"([\d.]+)", rating_str)
return float(match.group()) if match else None
def parse_product(raw):
return {
"title": raw.get("title", "").strip(),
"price_usd": clean_price(raw.get("price")),
"rating": clean_rating(raw.get("rating")),
"review_count": clean_price(raw.get("review_count")),
}
Legal Considerations
Amazon's Terms of Service prohibit automated data collection. While web scraping itself is generally legal (per hiQ Labs v. LinkedIn), scraping behind authentication or violating ToS creates legal risk. For commercial applications:
- Use official APIs where available (Amazon Product Advertising API)
- Respect
robots.txtdirectives - Don't scrape at volumes that impact site performance
- Consider using a managed scraping service that handles compliance
Alternatives to Direct Scraping
| Method | Pros | Cons |
|---|---|---|
| Amazon PA API | Official, legal | Rate limited, requires approval, incomplete data |
| Third-party APIs (Rainforest API, Keepa) | Structured data | Expensive at scale |
| SearchHive ScrapeForge | Full HTML + extraction, cheap | None significant |
| Self-hosted scrapers | Full control | High maintenance cost |
Get Started with SearchHive
SearchHive gives you 500 free credits to start. Sign up, get your API key, and scrape your first Amazon product page in under 5 minutes. The Starter plan at $9/month gives you 5,000 credits, and the Builder plan at $49/month covers 100K pages -- enough for most price tracking and market research workflows.
Check the docs and compare SearchHive with Firecrawl to see how the pricing stacks up.