Review Scraping and Analysis: Common Questions Answered

Review scraping extracts customer feedback from Amazon, Google, Trustpilot, G2, App Store, and other review platforms. Businesses use it for competitive intelligence, sentiment tracking, feature demand analysis, and product improvement.

This FAQ covers the legal, technical, and practical aspects of review scraping and analysis -- the questions developers and product teams ask most.

Key Takeaways

Review scraping is legal for public reviews in most jurisdictions, but terms of service vary by platform
Python with requests + BeautifulSoup handles static review pages; Playwright/Selenium for JavaScript-rendered sites
SearchHive's ScrapeForge API handles proxy rotation and JS rendering for review sites with anti-bot protection
Sentiment analysis pipelines typically combine keyword matching with transformer models
Expect 10,000-50,000 reviews per product on major platforms -- plan storage accordingly

Is Review Scraping Legal?

Yes, for publicly available reviews. Courts have consistently ruled that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA) in the US (hiQ Labs v. LinkedIn, 2022). However:

Platform terms of service may prohibit scraping. This is a contract issue, not a criminal one.
Personal data (EU GDPR, CCPA) requires careful handling. Don't store personally identifiable information without consent.
Copyright applies to the review text. Using scraped reviews for commercial purposes (like displaying them on your site) may require licensing.

For internal analysis and competitive intelligence, scraping public reviews is widely practiced and legally defensible.

Which Platforms Can I Scrape Reviews From?

Most review platforms are technically scrapable, but difficulty varies:

Platform	Difficulty	Notes
Amazon	Hard	Aggressive anti-bot. Needs residential proxies and JS rendering.
Google Reviews	Medium	Maps API available but limited. Direct scraping needs proxy rotation.
Trustpilot	Easy	Clean HTML structure, minimal anti-bot.
G2 / Capterra	Easy	Well-structured pages, no significant blocking.
App Store / Google Play	Medium	Rate-limited. Use official APIs when possible.
Yelp	Hard	Aggressive blocking. Honor robots.txt generator.
TripAdvisor	Medium	Moderate anti-bot. Respect crawl rate.

How Do I Scrape Reviews from Amazon?

Amazon is the hardest major review platform to scrape. Here's the approach that works:

import requests

API_KEY = "your-searchhive-key"

def scrape_amazon_reviews(product_url):
    resp = requests.get(
        "https://api.searchhive.dev/v1/scrape",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={"url": product_url}
    )
    data = resp.json()
    content = data.get("markdown", "")
    # Parse review data from the markdown output
    reviews = []
    lines = content.split("\n")
    current_review = None
    for line in lines:
        if "out of 5 stars" in line:
            if current_review:
                reviews.append(current_review)
            current_review = {"rating": line.strip(), "text": ""}
        elif current_review and line.strip():
            current_review["text"] += line + "\n"
    if current_review:
        reviews.append(current_review)
    return reviews

For Amazon specifically, ScrapeForge handles the proxy rotation and JavaScript rendering that Amazon requires. Direct scraping with requests alone gets blocked quickly.

What About Google Reviews?

Google Reviews are attached to Google Maps listings, which are heavily JavaScript-rendered.

def scrape_google_reviews(place_url):
    resp = requests.get(
        "https://api.searchhive.dev/v1/scrape",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={"url": place_url}
    )
    data = resp.json()
    # Google returns review data in structured formats
    return data.get("markdown", "")

Alternatively, use SearchHive's DeepDive to research a business and get a summary of its review sentiment:

resp = requests.post(
    "https://api.searchhive.dev/v1/deepdive",
    headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
    json={"query": "Acme Corp reviews and ratings summary 2025", "depth": "shallow"}
)
summary = resp.json()

How Do I Perform Sentiment Analysis on Reviews?

Start simple with keyword-based classification, then upgrade to a transformer model.

Basic approach (no ML):

def basic_sentiment(text):
    text = text.lower()
    positive = ["great", "excellent", "love", "best", "amazing", "perfect"]
    negative = ["terrible", "worst", "hate", "awful", "waste", "broken"]
    pos = sum(1 for w in positive if w in text)
    neg = sum(1 for w in negative if w in text)
    if pos > neg:
        return "positive"
    elif neg > pos:
        return "negative"
    return "neutral"

Better approach (transformer model):

from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2")

def ml_sentiment(text):
    result = classifier(text[:512])[0]  # Truncate to model max length
    return result["label"].lower()  # "positive" or "negative"

The DistilBERT model achieves ~91% accuracy on general sentiment. For domain-specific reviews (software, restaurants, electronics), fine-tuning on labeled data improves accuracy to 95%+.

How Many Reviews Do I Need for Meaningful Analysis?

It depends on your goal:

Overall sentiment: 100+ reviews gives a reliable positive/negative ratio
Feature-level analysis: 500+ reviews to identify common feature mentions
Trend detection: 1,000+ reviews over time to spot rating shifts
Competitive comparison: 500+ reviews per competitor for statistical significance

Most Amazon products have 1,000-50,000 reviews. Software products on G2 typically have 50-500 reviews.

How Do I Extract Feature-Specific Feedback?

Use named entity recognition (NER) or keyword extraction to map reviews to features.

def extract_features(text, features):
    text_lower = text.lower()
    found = {}
    for feature in features:
        # Check for feature name and common synonyms
        keywords = feature.get("keywords", [feature["name"]])
        mentions = [k for k in keywords if k in text_lower]
        if mentions:
            found[feature["name"]] = {
                "mentioned": True,
                "sentiment": basic_sentiment(text),
                "context": text[:200]
            }
    return found

FEATURES = [
    {"name": "battery", "keywords": ["battery", "charge", "battery life"]},
    {"name": "display", "keywords": ["screen", "display", "resolution", "brightness"]},
    {"name": "performance", "keywords": ["speed", "fast", "slow", "lag", "performance"]},
]

Can I Scrape Reviews at Scale?

Yes, but you need to handle rate limiting and respect platform guidelines:

Respect robots.txt: Check the target site's crawl policies
Add delays: 1-3 second delays between requests minimum
Rotate proxies: Use residential proxies for sites with anti-bot measures
Batch processing: Scrape during off-peak hours to reduce server load
Error handling: Implement exponential backoff for 429/503 responses

With SearchHive's ScrapeForge, proxy rotation and rate limiting are handled automatically. A Builder plan ($49/mo, 100K credits) lets you scrape 10,000+ review pages per month.

How Do I Store and Query Scraped Reviews?

For most use cases, a simple approach works:

import json
import sqlite3

conn = sqlite3.connect("reviews.db")
conn.execute("""
    CREATE TABLE IF NOT EXISTS reviews (
        id INTEGER PRIMARY KEY,
        platform TEXT,
        product TEXT,
        rating REAL,
        text TEXT,
        sentiment TEXT,
        scraped_at TEXT
    )
""")

def save_review(platform, product, rating, text, sentiment):
    from datetime import datetime
    conn.execute(
        "INSERT INTO reviews (platform, product, rating, text, sentiment, scraped_at) VALUES (?, ?, ?, ?, ?, ?)",
        (platform, product, rating, text, sentiment, datetime.utcnow().isoformat())
    )
    conn.commit()

For larger scale (100K+ reviews), use PostgreSQL with full-text search, or export to Parquet files for analysis with Pandas.

What Are the Alternatives to Scraping?

If you'd rather not scrape:

Platform APIs: Amazon Product Advertising API, Google Places API, App Store Connect API. Limited data and strict rate limits.
Third-party providers: ReviewTrackers, Reputation.com, Brandwatch. Expensive ($200-2,000/mo) but handle the scraping for you.
SearchHive: ScrapeForge + SwiftSearch combination lets you collect review data without managing infrastructure.

Get Started

Ready to build your review scraping pipeline? SearchHive gives you 500 free credits to test ScrapeForge on review platforms. No credit card, no setup -- just an API key and a curl command.

Explore the documentation for advanced scraping options, or compare SearchHive against dedicated scraping tools for your use case.

/compare/scrapingbee | /compare/firecrawl | /blog/cheapest-web-scraping-api

Review Scraping and Analysis: Common Questions Answered

AI-Powered Research

Review Scraping and Analysis: Common Questions Answered

Key Takeaways

Is Review Scraping Legal?

Which Platforms Can I Scrape Reviews From?

How Do I Scrape Reviews from Amazon?

What About Google Reviews?

How Do I Perform Sentiment Analysis on Reviews?

How Many Reviews Do I Need for Meaningful Analysis?

How Do I Extract Feature-Specific Feedback?

Can I Scrape Reviews at Scale?

How Do I Store and Query Scraped Reviews?

What Are the Alternatives to Scraping?

Get Started

Keywords

RELATED ARTICLES

SearchHive vs WebScraper.io — Proxy Management Compared

Complete Guide to API for LLM Integration

Complete Guide to Marketplace Data Collection

BUILD WITH SEARCHHIVE