How to Scrape Amazon Product Data with Python — Complete Guide

Amazon hosts over 600 million products. Whether you're building a price tracker, competitor analysis tool, or market research pipeline, extracting product data from Amazon is one of the most common web scraping tasks developers face. It's also one of the hardest.

Amazon invests heavily in anti-bot systems. Rotating CAPTCHAs, dynamic DOM structures, JavaScript-heavy rendering, and aggressive rate limiting mean naive HTTP requests break constantly. This guide covers what actually works in 2026, including production-ready Python code you can adapt today.

Key Takeaways

Direct HTTP requests to Amazon fail within hours due to CAPTCHA rotations and DOM changes
Headless browsers solve rendering but get detected by Amazon's fingerprinting unless you use residential proxies
SearchHive's ScrapeForge API handles proxy rotation, CAPTCHA solving, and JS rendering for you at $0.0001/credit
A hybrid approach (API for data collection, Python for processing) is the most reliable architecture
Always respect robots.txt generator and rate limits to avoid IP bans and legal issues

Why Scraping Amazon Is Hard

Amazon's anti-bot stack includes:

CAPTCHA challenges that rotate based on request patterns
TLS fingerprinting to detect non-browser clients
DOM randomization where CSS class names and element IDs change between sessions
JavaScript rendering for prices, reviews, and availability data
Rate limiting that kicks in at 10-20 requests/second without proxy rotation
** honeypot elements** hidden in the page to trap scrapers that parse everything

A script that works today will silently break next week when Amazon changes a CSS class. This is why direct scraping is a maintenance nightmare for production systems.

Approach 1: Requests + BeautifulSoup (Fragile)

The simplest approach, but also the most brittle:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml",
}

def scrape_amazon_product(asin):
    url = f"https://www.amazon.com/dp/{asin}"
    resp = requests.get(url, headers=headers, timeout=10)
    soup = BeautifulSoup(resp.text, "html.parser")

    title = soup.select_one("#productTitle")
    price = soup.select_one(".a-price .a-offscreen")
    rating = soup.select_one("#acrPopover .a-icon-alt")
    review_count = soup.select_one("#acrCustomerReviewText")

    return {
        "title": title.get_text(strip=True) if title else None,
        "price": price.get_text(strip=True) if price else None,
        "rating": rating.get_text(strip=True) if rating else None,
        "reviews": review_count.get_text(strip=True) if review_count else None,
    }

data = scrape_amazon_product("B09V3KXJPB")
print(data)

This works intermittently. Within days, selectors like #productTitle or .a-offscreen change, and your parser returns None for every field. Amazon also serves CAPTCHA pages instead of product pages once it detects scraping patterns.

Approach 2: Playwright Headless Browser (More Reliable)

Playwright handles JavaScript rendering and can mimic real browser behavior:

import asyncio
from playwright.async_api import async_playwright

async def scrape_with_playwright(asin):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
            viewport={"width": 1920, "height": 1080},
        )
        page = await context.new_page()

        await page.goto(f"https://www.amazon.com/dp/{asin}")
        await page.wait_for_selector("#productTitle", timeout=15000)

        title = await page.inner_text("#productTitle")
        price_el = await page.query_selector(".a-price .a-offscreen")
        price = await price_el.inner_text() if price_el else "N/A"

        await browser.close()
        return {"title": title.strip(), "price": price}

result = asyncio.run(scrape_with_playwright("B09V3KXJPB"))
print(result)

Better, but still gets flagged by Amazon's fingerprinting after sustained use. You need residential proxies ($5-15/GB), stealth plugins, and constant selector maintenance.

Approach 3: SearchHive ScrapeForge (Production-Ready)

SearchHive's ScrapeForge API handles the hard parts: proxy rotation, CAPTCHA solving, JavaScript rendering, and structured extraction. You send a URL, you get clean free JSON formatter back.

import requests

API_KEY = "your-searchhive-api-key"

def scrape_amazon_with_searchhive(asin):
    url = f"https://www.amazon.com/dp/{asin}"
    response = requests.post(
        "https://api.searchhive.dev/v1/scrape",
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "formats": ["markdown", "html"],
            "extract": {
                "schema": {
                    "type": "object",
                    "properties": {
                        "title": {"type": "string", "description": "Product title"},
                        "price": {"type": "string", "description": "Current price"},
                        "rating": {"type": "string", "description": "Star rating"},
                        "review_count": {"type": "string", "description": "Number of reviews"},
                        "availability": {"type": "string", "description": "In stock status"},
                        "features": {"type": "array", "items": {"type": "string"}},
                    }
                }
            }
        },
    )
    data = response.json()
    return data.get("extracted", data.get("markdown", ""))

product = scrape_amazon_with_searchhive("B09V3KXJPB")
print(product)

One API call. No browser to manage, no proxies to rotate, no selectors to maintain. The structured extraction schema pulls exactly the fields you need. At $0.0001/credit, scraping 1,000 Amazon product pages costs roughly $0.10 on the Starter plan.

Scraping Amazon Search Results at Scale

For collecting product data across categories:

import requests
import time

API_KEY = "your-searchhive-api-key"

def search_amazon_category(query, pages=5):
    results = []
    for page in range(1, pages + 1):
        search_url = f"https://www.amazon.com/s?k={query}&page={page}"
        response = requests.post(
            "https://api.searchhive.dev/v1/scrape",
            headers={
                "Authorization": f"Bearer {API_KEY}",
                "Content-Type": "application/json",
            },
            json={
                "url": search_url,
                "extract": {
                    "schema": {
                        "type": "object",
                        "properties": {
                            "products": {
                                "type": "array",
                                "items": {
                                    "type": "object",
                                    "properties": {
                                        "title": {"type": "string"},
                                        "asin": {"type": "string"},
                                        "price": {"type": "string"},
                                        "rating": {"type": "string"},
                                        "url": {"type": "string"},
                                    }
                                }
                            }
                        }
                    }
                }
            },
        )
        if response.status_code == 200:
            data = response.json()
            products = data.get("extracted", {}).get("products", [])
            results.extend(products)
        time.sleep(0.5)

    return results

products = search_amazon_category("wireless earbuds", pages=3)
print(f"Found {len(products)} products")
for p in products[:5]:
    print(f"  {p.get('title', 'N/A')[:60]} - {p.get('price', 'N/A')}")

Handling Amazon's Anti-Bot Systems

If you're going the self-hosted route, you need these layers:

Technique	What It Solves	Reliability
Rotating user agents	Basic bot detection	Low
Residential proxies	IP-based blocking	Medium
Stealth browser plugins	TLS fingerprinting	Medium
CAPTCHA solving services	Challenge pages	High (but expensive)
Request throttling	Rate limiting	High

Each layer adds cost and complexity. By the time you have a reliable setup, you're spending more than a ScrapeForge subscription costs.

Processing Scraped Amazon Data

Once you have the raw data, cleaning and structuring is where Python shines:

import re

def clean_price(price_str):
    if not price_str:
        return None
    match = re.search(r"[\d,]+\.?\d*", price_str)
    return float(match.group().replace(",", "")) if match else None

def clean_rating(rating_str):
    if not rating_str:
        return None
    match = re.search(r"([\d.]+)", rating_str)
    return float(match.group()) if match else None

def parse_product(raw):
    return {
        "title": raw.get("title", "").strip(),
        "price_usd": clean_price(raw.get("price")),
        "rating": clean_rating(raw.get("rating")),
        "review_count": clean_price(raw.get("review_count")),
    }

Legal Considerations

Amazon's Terms of Service prohibit automated data collection. While web scraping itself is generally legal (per hiQ Labs v. LinkedIn), scraping behind authentication or violating ToS creates legal risk. For commercial applications:

Use official APIs where available (Amazon Product Advertising API)
Respect robots.txt directives
Don't scrape at volumes that impact site performance
Consider using a managed scraping service that handles compliance

Alternatives to Direct Scraping

Method	Pros	Cons
Amazon PA API	Official, legal	Rate limited, requires approval, incomplete data
Third-party APIs (Rainforest API, Keepa)	Structured data	Expensive at scale
SearchHive ScrapeForge	Full HTML + extraction, cheap	None significant
Self-hosted scrapers	Full control	High maintenance cost

Get Started with SearchHive

SearchHive gives you 500 free credits to start. Sign up, get your API key, and scrape your first Amazon product page in under 5 minutes. The Starter plan at $9/month gives you 5,000 credits, and the Builder plan at $49/month covers 100K pages -- enough for most price tracking and market research workflows.

Check the docs and compare SearchHive with Firecrawl to see how the pricing stacks up.

How to Scrape Amazon Product Data with Python — Complete Guide

AI-Powered Research

How to Scrape Amazon Product Data with Python — Complete Guide

Key Takeaways

Why Scraping Amazon Is Hard

Approach 1: Requests + BeautifulSoup (Fragile)

Approach 2: Playwright Headless Browser (More Reliable)

Approach 3: SearchHive ScrapeForge (Production-Ready)

Scraping Amazon Search Results at Scale

Handling Amazon's Anti-Bot Systems

Processing Scraped Amazon Data

Legal Considerations

Alternatives to Direct Scraping

Get Started with SearchHive

Keywords

RELATED ARTICLES

What is MCP in AI?

What is the Best Search API for LLMs?

Can AI Agents Browse the Web?

BUILD WITH SEARCHHIVE