How to Scrape Product Reviews with Python -- Amazon, Yelp, G2

Product reviews are one of the most valuable data sources on the web. They drive purchase decisions, inform competitive analysis, and power sentiment analysis pipelines. But scraping reviews from sites like Amazon, Yelp, and G2 is notoriously difficult -- they all use aggressive anti-bot protection, JavaScript rendering, and paginated layouts that change frequently. This tutorial shows you how to scrape product reviews reliably using Python and SearchHive's ScrapeForge API.

Key Takeaways

Amazon, Yelp, and G2 all block basic HTTP requests -- you need residential proxies and JS rendering
SearchHive's ScrapeForge API handles anti-bot protection, CAPTCHAs, and proxy rotation automatically
Each site has a different HTML structure, so you need source-specific parsers
Store reviews in a structured format (parquet or SQLite) for analysis
Respect robots.txt generator and rate limits -- scrape during off-peak hours and add delays

Prerequisites

Python 3.8+
pip install requests beautifulsoup4 pandas
A SearchHive API key (free at searchhive.dev)
A product URL or search query for Amazon, Yelp, or G2

Step 1: Set Up the SearchHive Scraper

All three review platforms block requests from data center IPs and require JavaScript rendering. SearchHive's ScrapeForge handles both:

import requests
from bs4 import BeautifulSoup
from datetime import datetime
import json
import time
import re

SH_KEY = "sk_live_your_key_here"
SH_HEADERS = {"Authorization": f"Bearer: {SH_KEY}", "Content-Type": "application/json"}

def scrape_page(url, wait_for=None):
    """Scrape a page with JS rendering and anti-bot bypass via SearchHive."""
    payload = {
        "url": url,
        "render_js": True,
        "follow_redirects": True,
        "extract_links": True
    }
    if wait_for:
        payload["wait_for"] = wait_for

    resp = requests.post(
        "https://www.searchhive.dev/api/v1/scrapeforge",
        headers=SH_HEADERS,
        json=payload
    )
    resp.raise_for_status()
    return BeautifulSoup(resp.json().get("content", ""), "lxml")

This single function works for all three platforms because ScrapeForge handles the platform-specific anti-bot measures (Cloudflare, reCAPTCHA, DataDome) at the infrastructure level.

Step 2: Scrape Amazon Reviews

Amazon review pages are JavaScript-rendered and behind aggressive bot protection. The review content loads dynamically, so a basic HTTP request returns an empty container:

def scrape_amazon_reviews(product_url, max_pages=5):
    """Scrape reviews from an Amazon product page."""
    reviews = []

    for page in range(1, max_pages + 1):
        # Amazon uses a seeAllReviews link for pagination
        if page == 1:
            url = f"{product_url}"
        else:
            url = f"{product_url}?reviewerType=all_reviews&pageNumber={page}"

        try:
            soup = scrape_page(url, wait_for="[data-hook='review']")
            review_elements = soup.select("[data-hook='review']")

            if not review_elements:
                print(f"  Page {page}: No reviews found (may be blocked or no more reviews)")
                break

            for el in review_elements:
                rating_el = el.select_one("[data-hook='review-star-rating'], [data-hook='cmps-review-star-rating']")
                rating = 5
                if rating_el:
                    match = re.search(r"([\d.]+)", rating_el.text)
                    if match:
                        rating = float(match.group(1))

                date_el = el.select_one("[data-hook='review-date']")
                date_str = date_el.text.strip() if date_el else ""

                body_el = el.select_one("[data-hook='review-body'] span")
                body = body_el.text.strip() if body_el else ""

                title_el = el.select_one("[data-hook='review-title'] span:not([class])")
                title = title_el.text.strip() if title_el else ""

                author_el = el.select_one(".a-profile-name")
                author = author_el.text.strip() if author_el else ""

                helpful_el = el.select_one("[data-hook='helpful-vote-statement']")
                helpful = helpful_el.text.strip() if helpful_el else "0"

                reviews.append({
                    "platform": "amazon",
                    "rating": rating,
                    "title": title,
                    "body": body,
                    "author": author,
                    "date": date_str,
                    "helpful_votes": helpful,
                    "product_url": product_url,
                    "scraped_at": datetime.now().isoformat()
                })

            print(f"  Page {page}: {len(review_elements)} reviews")
            time.sleep(2)  # Be polite between pages

        except Exception as e:
            print(f"  Page {page}: Error - {e}")
            break

    return reviews

Step 3: Scrape Yelp Reviews

Yelp uses a different review structure but is similarly protected. Reviews load in a grid layout with rich structured data:

def scrape_yelp_reviews(business_url, max_pages=5):
    """Scrape reviews from a Yelp business page."""
    reviews = []

    for page in range(1, max_pages + 1):
        url = f"{business_url}?start={(page - 1) * 20}"

        try:
            soup = scrape_page(url, wait_for=".review, [data-review-id]")

            review_elements = soup.select(".review, [data-review-id]")
            if not review_elements:
                print(f"  Page {page}: No reviews found")
                break

            for el in review_elements:
                rating_el = el.select_one("[aria-label*='star'], .i-stars")
                rating = 0
                if rating_el:
                    match = re.search(r"([\d.]+)", rating_el.get("aria-label", rating_el.get("title", "")))
                    if match:
                        rating = float(match.group(1))

                # Yelp star rating can also be extracted from CSS class
                if rating == 0:
                    class_str = " ".join(rating_el.get("class", [])) if rating_el else ""
                    match = re.search(r"stars_(\d)", class_str)
                    if match:
                        rating = float(match.group(1))

                body_el = el.select_one(".review__text, [lang]")
                body = body_el.text.strip() if body_el else ""

                author_el = el.select_one(".review__user-info a, .user-passport-info a")
                author = author_el.text.strip() if author_el else ""

                date_el = el.select_one("time")
                date_str = date_el.text.strip() if date_el else ""

                reviews.append({
                    "platform": "yelp",
                    "rating": rating,
                    "title": "",
                    "body": body,
                    "author": author,
                    "date": date_str,
                    "helpful_votes": "",
                    "product_url": business_url,
                    "scraped_at": datetime.now().isoformat()
                })

            print(f"  Page {page}: {len(review_elements)} reviews")
            time.sleep(2)

        except Exception as e:
            print(f"  Page {page}: Error - {e}")
            break

    return reviews

Step 4: Scrape G2 Reviews

G2 (formerly G2 Crowd) is a software review platform with a different pagination model. Reviews are loaded dynamically and paginated with offset parameters:

def scrape_g2_reviews(product_url, max_pages=5):
    """Scrape reviews from a G2 product page."""
    reviews = []

    for page in range(1, max_pages + 1):
        url = f"{product_url}/reviews?offset={(page - 1) * 10}"

        try:
            soup = scrape_page(url, wait_for=".review-item, [data-testid]")

            review_elements = soup.select(".review-item, [data-testid='review-card']")
            if not review_elements:
                # Try alternate selectors
                review_elements = soup.select("div[class*='review']")
            if not review_elements:
                print(f"  Page {page}: No reviews found")
                break

            for el in review_elements:
                # G2 shows individual star ratings for multiple criteria
                rating_els = el.select("span[class*='star']")
                overall_rating = 0
                if rating_els:
                    match = re.search(r"([\d.]+)", rating_els[0].text)
                    if match:
                        overall_rating = float(match.group(1))

                title_el = el.select_one("h3, [class*='title']")
                title = title_el.text.strip() if title_el else ""

                body_el = el.select_one("p, [class*='body'], [class*='content']")
                body = body_el.text.strip() if body_el else ""

                author_el = el.select_one("[class*='author'], [class*='user']")
                author = author_el.text.strip() if author_el else ""

                date_el = el.select_one("time, [class*='date']")
                date_str = date_el.text.strip() if date_el else ""

                reviews.append({
                    "platform": "g2",
                    "rating": overall_rating,
                    "title": title,
                    "body": body,
                    "author": author,
                    "date": date_str,
                    "helpful_votes": "",
                    "product_url": product_url,
                    "scraped_at": datetime.now().isoformat()
                })

            print(f"  Page {page}: {len(review_elements)} reviews")
            time.sleep(2)

        except Exception as e:
            print(f"  Page {page}: Error - {e}")
            break

    return reviews

Step 5: Normalize and Store Reviews

All three parsers output the same schema. Combine them into a single dataset:

import pandas as pd

def save_reviews(reviews, output_path="reviews.parquet"):
    """Save reviews to parquet for analysis."""
    if not reviews:
        print("No reviews to save")
        return
    df = pd.DataFrame(reviews)
    df.to_parquet(output_path, index=False, engine="pyarrow")
    print(f"Saved {len(df)} reviews to {output_path}")
    return df

def analyze_reviews(df):
    """Basic review analysis."""
    print("
=== Review Analysis ===")
    print(f"Total reviews: {len(df)}")
    print(f"
By platform:")
    print(df.groupby("platform")["rating"].agg(["count", "mean"]).round(2))

    print(f"
Rating distribution:")
    print(df["rating"].value_counts().sort_index(ascending=False))

    # Average review length by platform
    df["body_length"] = df["body"].str.len()
    print(f"
Average review length by platform:")
    print(df.groupby("platform")["body_length"].mean().round(0))

Step 6: Run the Full Pipeline

# Configuration -- replace with real product URLs
TARGETS = {
    "amazon": "https://www.amazon.com/product-reviews/PRODUCT_ID_HERE",
    "yelp": "https://www.yelp.com/biz/your-business-name",
    "g2": "https://www.g2.com/products/PRODUCT_NAME/reviews"
}

all_reviews = []

for platform, url in TARGETS.items():
    print(f"
Scraping {platform} reviews from {url}")
    if platform == "amazon":
        reviews = scrape_amazon_reviews(url, max_pages=3)
    elif platform == "yelp":
        reviews = scrape_yelp_reviews(url, max_pages=3)
    elif platform == "g2":
        reviews = scrape_g2_reviews(url, max_pages=3)
    else:
        reviews = []
    all_reviews.extend(reviews)
    print(f"  Total: {len(reviews)} reviews")

# Save and analyze
df = save_reviews(all_reviews, "all_reviews.parquet")
if df is not None:
    analyze_reviews(df)

Complete Code Example

Here is the full self-contained script:

import requests
from bs4 import BeautifulSoup
from datetime import datetime
import pandas as pd
import re
import time

SH_KEY = "sk_live_your_key_here"
SH_HEADERS = {"Authorization": f"Bearer: {SH_KEY}", "Content-Type": "application/json"}

def scrape(url, wait_for=None):
    resp = requests.post(
        "https://www.searchhive.dev/api/v1/scrapeforge",
        headers=SH_HEADERS,
        json={"url": url, "render_js": True, "wait_for": wait_for, "follow_redirects": True}
    )
    resp.raise_for_status()
    return BeautifulSoup(resp.json().get("content", ""), "lxml")

def parse_rating(el):
    if not el:
        return 0.0
    match = re.search(r"([\d.]+)", el.text or el.get("aria-label", ""))
    return float(match.group(1)) if match else 0.0

def scrape_reviews(platform, url, max_pages=3):
    reviews = []
    for page in range(1, max_pages + 1):
        try:
            soup = scrape(url + (f"?page={page}" if page > 1 else ""), wait_for=".review, [data-hook='review']")
            for el in soup.select(".review, [data-hook='review'], [data-review-id]"):
                reviews.append({
                    "platform": platform,
                    "rating": parse_rating(el.select_one("[class*='star'], [aria-label*='star']")),
                    "body": (el.select_one("[data-hook='review-body'] span, .review__text, p") or el).text.strip()[:1000],
                    "author": (el.select_one(".a-profile-name, .review__user-info a") or el).text.strip()[:50],
                    "date": (el.select_one("[data-hook='review-date'], time") or el).text.strip()[:30],
                    "scraped_at": datetime.now().isoformat()
                })
            print(f"  {platform} page {page}: {len(reviews)} total")
            time.sleep(2)
        except Exception as e:
            print(f"  {platform} page {page}: {e}")
            break
    return reviews

# Run for all platforms
all_reviews = []
for p, u in {"amazon": "https://amazon.com/dp/ID", "yelp": "https://yelp.com/biz/ID", "g2": "https://g2.com/products/ID/reviews"}.items():
    all_reviews.extend(scrape_reviews(p, u))

df = pd.DataFrame(all_reviews)
df.to_parquet("reviews.parquet", index=False)
print(f"
Total: {len(df)} reviews across {df['platform'].nunique()} platforms")
print(df.groupby("platform")["rating"].agg(["count", "mean"]).round(2))

Common Issues

1. Amazon blocks all scraping attempts. Amazon is one of the hardest sites to scrape. SearchHive's residential proxies help, but you may still get CAPTCHA pages. Reduce your request frequency and use wait_for to ensure content loads before extraction.

2. Yelp shows different content to bots. Yelp serves different layouts to logged-in vs. non-logged-in users and may detect automated access. ScrapeForge's anti-detection helps, but results may vary between runs.

3. G2 uses heavy JavaScript with infinite scroll. G2 loads reviews dynamically with offset-based pagination. The wait_for selector ensures reviews are present before parsing. If parsing fails, try adjusting the selector to match G2's current DOM structure.

4. Review text is truncated. Some platforms show truncated reviews with a "Read more" link. SearchHive does not click links, so you get the visible text only. For full review text, you may need to scrape individual review pages separately.

5. Rate limiting and IP bans. All three platforms rate-limit aggressively. Stick to 2-3 second delays between pages and limit to 3-5 pages per run. SearchHive's proxy rotation prevents IP bans, but the platforms may still throttle your access.

Next Steps

How to Extract Structured Data from HTML with Python -- deeper parsing techniques
How to Build a Product Comparison Tool with Web Scraping -- combine review data with product data
SearchHive ScrapeForge API docs -- full parameter reference and advanced options

Start scraping reviews with SearchHive's free tier -- 500 credits to experiment, no credit card needed.

How to Scrape Product Reviews with Python -- Amazon, Yelp, G2

AI-Powered Research

Key Takeaways

Prerequisites

Step 1: Set Up the SearchHive Scraper

Step 2: Scrape Amazon Reviews

Step 3: Scrape Yelp Reviews

Step 4: Scrape G2 Reviews

Step 5: Normalize and Store Reviews

Step 6: Run the Full Pipeline

Complete Code Example

Common Issues

Next Steps

Keywords

RELATED ARTICLES

How to Monitor Webhooks and API Events with Python

How to Build a Data Lake from Web Scraping with Python

How to Build a Product Comparison Tool with Web Scraping

BUILD WITH SEARCHHIVE