Review Scraping and Analysis: Common Questions Answered
Review scraping extracts customer feedback from Amazon, Google, Trustpilot, G2, App Store, and other review platforms. Businesses use it for competitive intelligence, sentiment tracking, feature demand analysis, and product improvement.
This FAQ covers the legal, technical, and practical aspects of review scraping and analysis -- the questions developers and product teams ask most.
Key Takeaways
- Review scraping is legal for public reviews in most jurisdictions, but terms of service vary by platform
- Python with requests + BeautifulSoup handles static review pages; Playwright/Selenium for JavaScript-rendered sites
- SearchHive's ScrapeForge API handles proxy rotation and JS rendering for review sites with anti-bot protection
- Sentiment analysis pipelines typically combine keyword matching with transformer models
- Expect 10,000-50,000 reviews per product on major platforms -- plan storage accordingly
Is Review Scraping Legal?
Yes, for publicly available reviews. Courts have consistently ruled that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA) in the US (hiQ Labs v. LinkedIn, 2022). However:
- Platform terms of service may prohibit scraping. This is a contract issue, not a criminal one.
- Personal data (EU GDPR, CCPA) requires careful handling. Don't store personally identifiable information without consent.
- Copyright applies to the review text. Using scraped reviews for commercial purposes (like displaying them on your site) may require licensing.
For internal analysis and competitive intelligence, scraping public reviews is widely practiced and legally defensible.
Which Platforms Can I Scrape Reviews From?
Most review platforms are technically scrapable, but difficulty varies:
| Platform | Difficulty | Notes |
|---|---|---|
| Amazon | Hard | Aggressive anti-bot. Needs residential proxies and JS rendering. |
| Google Reviews | Medium | Maps API available but limited. Direct scraping needs proxy rotation. |
| Trustpilot | Easy | Clean HTML structure, minimal anti-bot. |
| G2 / Capterra | Easy | Well-structured pages, no significant blocking. |
| App Store / Google Play | Medium | Rate-limited. Use official APIs when possible. |
| Yelp | Hard | Aggressive blocking. Honor robots.txt generator. |
| TripAdvisor | Medium | Moderate anti-bot. Respect crawl rate. |
How Do I Scrape Reviews from Amazon?
Amazon is the hardest major review platform to scrape. Here's the approach that works:
import requests
API_KEY = "your-searchhive-key"
def scrape_amazon_reviews(product_url):
resp = requests.get(
"https://api.searchhive.dev/v1/scrape",
headers={"Authorization": f"Bearer {API_KEY}"},
params={"url": product_url}
)
data = resp.json()
content = data.get("markdown", "")
# Parse review data from the markdown output
reviews = []
lines = content.split("\n")
current_review = None
for line in lines:
if "out of 5 stars" in line:
if current_review:
reviews.append(current_review)
current_review = {"rating": line.strip(), "text": ""}
elif current_review and line.strip():
current_review["text"] += line + "\n"
if current_review:
reviews.append(current_review)
return reviews
For Amazon specifically, ScrapeForge handles the proxy rotation and JavaScript rendering that Amazon requires. Direct scraping with requests alone gets blocked quickly.
What About Google Reviews?
Google Reviews are attached to Google Maps listings, which are heavily JavaScript-rendered.
def scrape_google_reviews(place_url):
resp = requests.get(
"https://api.searchhive.dev/v1/scrape",
headers={"Authorization": f"Bearer {API_KEY}"},
params={"url": place_url}
)
data = resp.json()
# Google returns review data in structured formats
return data.get("markdown", "")
Alternatively, use SearchHive's DeepDive to research a business and get a summary of its review sentiment:
resp = requests.post(
"https://api.searchhive.dev/v1/deepdive",
headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
json={"query": "Acme Corp reviews and ratings summary 2025", "depth": "shallow"}
)
summary = resp.json()
How Do I Perform Sentiment Analysis on Reviews?
Start simple with keyword-based classification, then upgrade to a transformer model.
Basic approach (no ML):
def basic_sentiment(text):
text = text.lower()
positive = ["great", "excellent", "love", "best", "amazing", "perfect"]
negative = ["terrible", "worst", "hate", "awful", "waste", "broken"]
pos = sum(1 for w in positive if w in text)
neg = sum(1 for w in negative if w in text)
if pos > neg:
return "positive"
elif neg > pos:
return "negative"
return "neutral"
Better approach (transformer model):
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2")
def ml_sentiment(text):
result = classifier(text[:512])[0] # Truncate to model max length
return result["label"].lower() # "positive" or "negative"
The DistilBERT model achieves ~91% accuracy on general sentiment. For domain-specific reviews (software, restaurants, electronics), fine-tuning on labeled data improves accuracy to 95%+.
How Many Reviews Do I Need for Meaningful Analysis?
It depends on your goal:
- Overall sentiment: 100+ reviews gives a reliable positive/negative ratio
- Feature-level analysis: 500+ reviews to identify common feature mentions
- Trend detection: 1,000+ reviews over time to spot rating shifts
- Competitive comparison: 500+ reviews per competitor for statistical significance
Most Amazon products have 1,000-50,000 reviews. Software products on G2 typically have 50-500 reviews.
How Do I Extract Feature-Specific Feedback?
Use named entity recognition (NER) or keyword extraction to map reviews to features.
def extract_features(text, features):
text_lower = text.lower()
found = {}
for feature in features:
# Check for feature name and common synonyms
keywords = feature.get("keywords", [feature["name"]])
mentions = [k for k in keywords if k in text_lower]
if mentions:
found[feature["name"]] = {
"mentioned": True,
"sentiment": basic_sentiment(text),
"context": text[:200]
}
return found
FEATURES = [
{"name": "battery", "keywords": ["battery", "charge", "battery life"]},
{"name": "display", "keywords": ["screen", "display", "resolution", "brightness"]},
{"name": "performance", "keywords": ["speed", "fast", "slow", "lag", "performance"]},
]
Can I Scrape Reviews at Scale?
Yes, but you need to handle rate limiting and respect platform guidelines:
- Respect robots.txt: Check the target site's crawl policies
- Add delays: 1-3 second delays between requests minimum
- Rotate proxies: Use residential proxies for sites with anti-bot measures
- Batch processing: Scrape during off-peak hours to reduce server load
- Error handling: Implement exponential backoff for 429/503 responses
With SearchHive's ScrapeForge, proxy rotation and rate limiting are handled automatically. A Builder plan ($49/mo, 100K credits) lets you scrape 10,000+ review pages per month.
How Do I Store and Query Scraped Reviews?
For most use cases, a simple approach works:
import json
import sqlite3
conn = sqlite3.connect("reviews.db")
conn.execute("""
CREATE TABLE IF NOT EXISTS reviews (
id INTEGER PRIMARY KEY,
platform TEXT,
product TEXT,
rating REAL,
text TEXT,
sentiment TEXT,
scraped_at TEXT
)
""")
def save_review(platform, product, rating, text, sentiment):
from datetime import datetime
conn.execute(
"INSERT INTO reviews (platform, product, rating, text, sentiment, scraped_at) VALUES (?, ?, ?, ?, ?, ?)",
(platform, product, rating, text, sentiment, datetime.utcnow().isoformat())
)
conn.commit()
For larger scale (100K+ reviews), use PostgreSQL with full-text search, or export to Parquet files for analysis with Pandas.
What Are the Alternatives to Scraping?
If you'd rather not scrape:
- Platform APIs: Amazon Product Advertising API, Google Places API, App Store Connect API. Limited data and strict rate limits.
- Third-party providers: ReviewTrackers, Reputation.com, Brandwatch. Expensive ($200-2,000/mo) but handle the scraping for you.
- SearchHive: ScrapeForge + SwiftSearch combination lets you collect review data without managing infrastructure.
Get Started
Ready to build your review scraping pipeline? SearchHive gives you 500 free credits to test ScrapeForge on review platforms. No credit card, no setup -- just an API key and a curl command.
Explore the documentation for advanced scraping options, or compare SearchHive against dedicated scraping tools for your use case.
/compare/scrapingbee | /compare/firecrawl | /blog/cheapest-web-scraping-api