Amazon is one of the hardest websites to scrape. They invest heavily in anti-bot technology -- CAPTCHAs, IP blocking, behavioral analysis, JavaScript challenges, and device fingerprinting. If you try to scrape Amazon with basic requests and no precautions, you'll get blocked within minutes.
This FAQ covers every technique for scraping Amazon data reliably, from simple approaches to production-grade solutions, with code examples and honest cost comparisons.
Key Takeaways
- Amazon blocks aggressively -- basic HTTP requests and datacenter proxies get detected and banned quickly
- Residential proxies are mandatory for any volume beyond a few requests
- API-based services (SearchHive ScrapeForge, ScrapingBee) handle Amazon's anti-bot measures for you
- Pre-built scrapers (Apify Amazon Scraper) work but are expensive for ongoing use
- SearchHive DeepDive can research Amazon product data without you writing any scraper logic
- Respect rate limits and terms of service -- aggressive scraping risks legal action
Why is Amazon so hard to scrape?
Amazon uses multiple layers of anti-bot protection:
- IP reputation tracking: Datacenter IPs get flagged fast. Amazon maintains a massive database of known proxy and VPN IPs.
- Behavioral analysis: They track mouse movements, scroll patterns, click timing, and navigation patterns to distinguish bots from humans.
- JavaScript challenges: Pages load dynamic content via JS that requires a real browser engine to execute.
- CAPTCHAs: Triggered when suspicious activity is detected -- image puzzles, checkbox challenges.
- Device fingerprinting: Browser fingerprinting (canvas, WebGL, fonts, plugins) identifies automated browsers.
- Rate limiting per session: Even legitimate-looking sessions get throttled if they request too many pages too fast.
- TLS fingerprinting: Amazon checks the TLS handshake to verify you're using a real browser's SSL implementation.
What's the easiest way to scrape Amazon?
The easiest approach: use an API that handles Amazon's anti-bot measures for you.
SearchHive ScrapeForge
import requests
headers = {"Authorization": "Bearer YOUR_API_KEY"}
# Scrape an Amazon product page -- proxy rotation and JS rendering included
response = requests.get(
"https://api.searchhive.dev/scrapeforge",
headers=headers,
params={
"url": "https://www.amazon.com/dp/B0C4JVT6KQ",
"format": "markdown",
"js_render": True
}
).json()
print(response["markdown"][:1000])
ScrapeForge automatically handles:
- Residential proxy rotation (different IP for each request)
- JavaScript rendering (waits for dynamic content to load)
- CAPTCHA solving (when triggered)
- Browser fingerprint spoofing
At $49/month for 100K credits, this is significantly cheaper than managing your own proxy infrastructure.
SearchHive DeepDive for product research
If you don't need to scrape individual pages but want product data:
import requests
response = requests.post(
"https://api.searchhive.dev/deepdive",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"query": "Find the top 5 best-selling wireless earbuds on Amazon under $100 in 2026, with prices and ratings",
"depth": "comprehensive"
}
).json()
print(response["answer"])
How do I scrape Amazon with Python (manual approach)?
If you want to build your own scraper, here's what you need:
1. Residential Proxies
Datacenter proxies won't work. You need residential proxies -- IPs from real ISPs that look like normal users.
Popular residential proxy providers:
- Bright Data: From $5.50/GB, large proxy pool
- Oxylabs: From $8/GB, good reliability
- Smartproxy: From $4.40/GB, budget option
2. Undetected Browser
Use undetected-chromedriver or playwright-stealth:
from playwright.sync_api import sync_playwright
import time
def scrape_amazon_product(asin, proxy):
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={"server": proxy}
)
context = browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
viewport={"width": 1920, "height": 1080}
)
page = context.new_page()
page.goto(f"https://www.amazon.com/dp/{asin}")
page.wait_for_selector("#productTitle", timeout=10000)
time.sleep(2) # Simulate human reading time
title = page.locator("#productTitle").inner_text().strip()
price = page.locator(".a-price .a-offscreen").first.inner_text().strip()
rating = page.locator("#acrPopover .a-size-base").first.inner_text().strip()
browser.close()
return {"title": title, "price": price, "rating": rating}
3. Rate Limiting
Add random delays between requests to mimic human behavior:
import random
import time
asins = ["B0C4JVT6KQ", "B09V3KXJPB", "B0CJ5J7TVR"]
for asin in asins:
result = scrape_amazon_product(asin, "http://your-proxy:port")
print(result)
time.sleep(random.uniform(5, 15)) # 5-15 second delay
4. Session Management
Maintain cookies across requests to appear as a returning user:
context = browser.new_context(
storage_state="amazon_cookies.json", # Load saved session
user_agent="..."
)
# After successful scrape, save cookies
context.storage_state(path="amazon_cookies.json")
What about Apify's Amazon scraper?
Apify offers a pre-built Amazon scraper (Actor) that handles proxy rotation and CAPTCHAs:
- Amazon Product Scraper: $49/month (Personal plan), ~$0.001-0.003 per product
- Amazon Reviews Scraper: Same pricing
- Amazon Search Scraper: Same pricing
Pros: Works out of the box, handles anti-bot measures, good for non-developers Cons: Expensive at scale, limited customization, Apify's credit system is confusing
At 100K products/month, Apify costs $100-300/month depending on credit usage. SearchHive ScrapeForge does the same for $49/month.
What data can you legally scrape from Amazon?
- Product titles, prices, ratings: Publicly available on product pages
- Product descriptions and specs: Public information
- Customer reviews: Publicly available, but be careful with user-generated content
- Seller information: Public on storefront pages
- Best seller rankings: Publicly displayed
Avoid:
- Scraping personal information (buyer names, addresses)
- Bypassing paywalls or login-gated content
- Scraping at volumes that cause service disruption
- Using scraped data to compete directly on Amazon (may violate their Terms of Service)
The hiQ vs. LinkedIn (2022) ruling established that scraping publicly available data is generally legal in the US. However, Amazon's ToS explicitly prohibits automated access, which creates a contractual risk even if scraping itself is legal.
How do I scrape Amazon search results?
To get product listings from Amazon search:
import requests
headers = {"Authorization": "Bearer YOUR_API_KEY"}
# Scrape Amazon search results page
response = requests.get(
"https://api.searchhive.dev/scrapeforge",
headers=headers,
params={
"url": "https://www.amazon.com/s?k=wireless+earbuds&ref=nb_sb_noss",
"format": "markdown",
"js_render": True
}
).json()
# Parse the markdown output for product data
# In production, use a structured extraction or LLM to parse the markdown
print(response["markdown"][:2000])
For structured extraction, use SearchHive's free JSON formatter output format or parse the markdown with regex tester/LLM.
How do I scrape Amazon at scale?
For large-scale Amazon scraping (thousands to millions of products):
- Use a queue system: Redis or SQS to manage URLs to scrape
- Distribute across workers: Multiple scraper instances with different proxy IPs
- Implement exponential backoff: When blocked, wait longer before retrying
- Monitor success rates: Track blocks, CAPTCHAs, and empty responses
- Rotate user agent parsers: Cycle through hundreds of realistic user agent strings
- Respect crawl rate: 1 request per 5-15 seconds per IP minimum
import requests
import time
import json
API_KEY = "YOUR_API_KEY"
headers = {"Authorization": f"Bearer {API_KEY}"}
def batch_scrape_amazon(asins, output_file="amazon_products.json"):
results = []
for i, asin in enumerate(asins):
try:
response = requests.get(
"https://api.searchhive.dev/scrapeforge",
headers=headers,
params={
"url": f"https://www.amazon.com/dp/{asin}",
"format": "markdown",
"js_render": True
},
timeout=30
)
if response.status_code == 200:
data = response.json()
results.append({"asin": asin, "content": data.get("markdown", "")})
# Rate limiting
time.sleep(random.uniform(2, 5))
except Exception as e:
print(f"Error scraping {asin}: {e}")
continue
# Save checkpoint every 100 products
if (i + 1) % 100 == 0:
with open(output_file, "w") as f:
json.dump(results, f)
print(f"Checkpoint: {i+1}/{len(asins)} products scraped")
return results
What are the alternatives to scraping Amazon?
If you don't want to deal with scraping at all:
- Amazon Product Advertising API (PA-API): Official API, but requires Amazon Associates registration and has strict usage limits
- Amazon SP-API (Selling Partner API): Only available to registered Amazon sellers
- Third-party data providers: Keepa, Jungle Scout, Helium 10 -- provide historical price data, sales estimates, and BSR tracking
- SearchHive DeepDive: Research product categories, pricing trends, and competitor analysis without scraping individual pages
Get started
Don't fight Amazon's anti-bot systems alone. SearchHive ScrapeForge handles proxy rotation, JS rendering, and CAPTCHA solving so you can focus on your application.
- Free tier: 500 credits to test
- Starter plan: $9/month for 5,000 credits
- Builder plan: $49/month for 100,000 credits
Get your API key: https://searchhive.dev
For more scraping guides, see /blog/how-to-automate-web-scraping-complete-answer and /blog/what-is-the-best-web-scraping-api-complete-answer.