Product reviews are one of the most valuable data sources on the web. They drive purchase decisions, inform competitive analysis, and power sentiment analysis pipelines. But scraping reviews from sites like Amazon, Yelp, and G2 is notoriously difficult -- they all use aggressive anti-bot protection, JavaScript rendering, and paginated layouts that change frequently. This tutorial shows you how to scrape product reviews reliably using Python and SearchHive's ScrapeForge API.
Key Takeaways
- Amazon, Yelp, and G2 all block basic HTTP requests -- you need residential proxies and JS rendering
- SearchHive's ScrapeForge API handles anti-bot protection, CAPTCHAs, and proxy rotation automatically
- Each site has a different HTML structure, so you need source-specific parsers
- Store reviews in a structured format (parquet or SQLite) for analysis
- Respect robots.txt generator and rate limits -- scrape during off-peak hours and add delays
Prerequisites
- Python 3.8+
- pip install requests beautifulsoup4 pandas
- A SearchHive API key (free at searchhive.dev)
- A product URL or search query for Amazon, Yelp, or G2
Step 1: Set Up the SearchHive Scraper
All three review platforms block requests from data center IPs and require JavaScript rendering. SearchHive's ScrapeForge handles both:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import json
import time
import re
SH_KEY = "sk_live_your_key_here"
SH_HEADERS = {"Authorization": f"Bearer: {SH_KEY}", "Content-Type": "application/json"}
def scrape_page(url, wait_for=None):
"""Scrape a page with JS rendering and anti-bot bypass via SearchHive."""
payload = {
"url": url,
"render_js": True,
"follow_redirects": True,
"extract_links": True
}
if wait_for:
payload["wait_for"] = wait_for
resp = requests.post(
"https://www.searchhive.dev/api/v1/scrapeforge",
headers=SH_HEADERS,
json=payload
)
resp.raise_for_status()
return BeautifulSoup(resp.json().get("content", ""), "lxml")
This single function works for all three platforms because ScrapeForge handles the platform-specific anti-bot measures (Cloudflare, reCAPTCHA, DataDome) at the infrastructure level.
Step 2: Scrape Amazon Reviews
Amazon review pages are JavaScript-rendered and behind aggressive bot protection. The review content loads dynamically, so a basic HTTP request returns an empty container:
def scrape_amazon_reviews(product_url, max_pages=5):
"""Scrape reviews from an Amazon product page."""
reviews = []
for page in range(1, max_pages + 1):
# Amazon uses a seeAllReviews link for pagination
if page == 1:
url = f"{product_url}"
else:
url = f"{product_url}?reviewerType=all_reviews&pageNumber={page}"
try:
soup = scrape_page(url, wait_for="[data-hook='review']")
review_elements = soup.select("[data-hook='review']")
if not review_elements:
print(f" Page {page}: No reviews found (may be blocked or no more reviews)")
break
for el in review_elements:
rating_el = el.select_one("[data-hook='review-star-rating'], [data-hook='cmps-review-star-rating']")
rating = 5
if rating_el:
match = re.search(r"([\d.]+)", rating_el.text)
if match:
rating = float(match.group(1))
date_el = el.select_one("[data-hook='review-date']")
date_str = date_el.text.strip() if date_el else ""
body_el = el.select_one("[data-hook='review-body'] span")
body = body_el.text.strip() if body_el else ""
title_el = el.select_one("[data-hook='review-title'] span:not([class])")
title = title_el.text.strip() if title_el else ""
author_el = el.select_one(".a-profile-name")
author = author_el.text.strip() if author_el else ""
helpful_el = el.select_one("[data-hook='helpful-vote-statement']")
helpful = helpful_el.text.strip() if helpful_el else "0"
reviews.append({
"platform": "amazon",
"rating": rating,
"title": title,
"body": body,
"author": author,
"date": date_str,
"helpful_votes": helpful,
"product_url": product_url,
"scraped_at": datetime.now().isoformat()
})
print(f" Page {page}: {len(review_elements)} reviews")
time.sleep(2) # Be polite between pages
except Exception as e:
print(f" Page {page}: Error - {e}")
break
return reviews
Step 3: Scrape Yelp Reviews
Yelp uses a different review structure but is similarly protected. Reviews load in a grid layout with rich structured data:
def scrape_yelp_reviews(business_url, max_pages=5):
"""Scrape reviews from a Yelp business page."""
reviews = []
for page in range(1, max_pages + 1):
url = f"{business_url}?start={(page - 1) * 20}"
try:
soup = scrape_page(url, wait_for=".review, [data-review-id]")
review_elements = soup.select(".review, [data-review-id]")
if not review_elements:
print(f" Page {page}: No reviews found")
break
for el in review_elements:
rating_el = el.select_one("[aria-label*='star'], .i-stars")
rating = 0
if rating_el:
match = re.search(r"([\d.]+)", rating_el.get("aria-label", rating_el.get("title", "")))
if match:
rating = float(match.group(1))
# Yelp star rating can also be extracted from CSS class
if rating == 0:
class_str = " ".join(rating_el.get("class", [])) if rating_el else ""
match = re.search(r"stars_(\d)", class_str)
if match:
rating = float(match.group(1))
body_el = el.select_one(".review__text, [lang]")
body = body_el.text.strip() if body_el else ""
author_el = el.select_one(".review__user-info a, .user-passport-info a")
author = author_el.text.strip() if author_el else ""
date_el = el.select_one("time")
date_str = date_el.text.strip() if date_el else ""
reviews.append({
"platform": "yelp",
"rating": rating,
"title": "",
"body": body,
"author": author,
"date": date_str,
"helpful_votes": "",
"product_url": business_url,
"scraped_at": datetime.now().isoformat()
})
print(f" Page {page}: {len(review_elements)} reviews")
time.sleep(2)
except Exception as e:
print(f" Page {page}: Error - {e}")
break
return reviews
Step 4: Scrape G2 Reviews
G2 (formerly G2 Crowd) is a software review platform with a different pagination model. Reviews are loaded dynamically and paginated with offset parameters:
def scrape_g2_reviews(product_url, max_pages=5):
"""Scrape reviews from a G2 product page."""
reviews = []
for page in range(1, max_pages + 1):
url = f"{product_url}/reviews?offset={(page - 1) * 10}"
try:
soup = scrape_page(url, wait_for=".review-item, [data-testid]")
review_elements = soup.select(".review-item, [data-testid='review-card']")
if not review_elements:
# Try alternate selectors
review_elements = soup.select("div[class*='review']")
if not review_elements:
print(f" Page {page}: No reviews found")
break
for el in review_elements:
# G2 shows individual star ratings for multiple criteria
rating_els = el.select("span[class*='star']")
overall_rating = 0
if rating_els:
match = re.search(r"([\d.]+)", rating_els[0].text)
if match:
overall_rating = float(match.group(1))
title_el = el.select_one("h3, [class*='title']")
title = title_el.text.strip() if title_el else ""
body_el = el.select_one("p, [class*='body'], [class*='content']")
body = body_el.text.strip() if body_el else ""
author_el = el.select_one("[class*='author'], [class*='user']")
author = author_el.text.strip() if author_el else ""
date_el = el.select_one("time, [class*='date']")
date_str = date_el.text.strip() if date_el else ""
reviews.append({
"platform": "g2",
"rating": overall_rating,
"title": title,
"body": body,
"author": author,
"date": date_str,
"helpful_votes": "",
"product_url": product_url,
"scraped_at": datetime.now().isoformat()
})
print(f" Page {page}: {len(review_elements)} reviews")
time.sleep(2)
except Exception as e:
print(f" Page {page}: Error - {e}")
break
return reviews
Step 5: Normalize and Store Reviews
All three parsers output the same schema. Combine them into a single dataset:
import pandas as pd
def save_reviews(reviews, output_path="reviews.parquet"):
"""Save reviews to parquet for analysis."""
if not reviews:
print("No reviews to save")
return
df = pd.DataFrame(reviews)
df.to_parquet(output_path, index=False, engine="pyarrow")
print(f"Saved {len(df)} reviews to {output_path}")
return df
def analyze_reviews(df):
"""Basic review analysis."""
print("
=== Review Analysis ===")
print(f"Total reviews: {len(df)}")
print(f"
By platform:")
print(df.groupby("platform")["rating"].agg(["count", "mean"]).round(2))
print(f"
Rating distribution:")
print(df["rating"].value_counts().sort_index(ascending=False))
# Average review length by platform
df["body_length"] = df["body"].str.len()
print(f"
Average review length by platform:")
print(df.groupby("platform")["body_length"].mean().round(0))
Step 6: Run the Full Pipeline
# Configuration -- replace with real product URLs
TARGETS = {
"amazon": "https://www.amazon.com/product-reviews/PRODUCT_ID_HERE",
"yelp": "https://www.yelp.com/biz/your-business-name",
"g2": "https://www.g2.com/products/PRODUCT_NAME/reviews"
}
all_reviews = []
for platform, url in TARGETS.items():
print(f"
Scraping {platform} reviews from {url}")
if platform == "amazon":
reviews = scrape_amazon_reviews(url, max_pages=3)
elif platform == "yelp":
reviews = scrape_yelp_reviews(url, max_pages=3)
elif platform == "g2":
reviews = scrape_g2_reviews(url, max_pages=3)
else:
reviews = []
all_reviews.extend(reviews)
print(f" Total: {len(reviews)} reviews")
# Save and analyze
df = save_reviews(all_reviews, "all_reviews.parquet")
if df is not None:
analyze_reviews(df)
Complete Code Example
Here is the full self-contained script:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import pandas as pd
import re
import time
SH_KEY = "sk_live_your_key_here"
SH_HEADERS = {"Authorization": f"Bearer: {SH_KEY}", "Content-Type": "application/json"}
def scrape(url, wait_for=None):
resp = requests.post(
"https://www.searchhive.dev/api/v1/scrapeforge",
headers=SH_HEADERS,
json={"url": url, "render_js": True, "wait_for": wait_for, "follow_redirects": True}
)
resp.raise_for_status()
return BeautifulSoup(resp.json().get("content", ""), "lxml")
def parse_rating(el):
if not el:
return 0.0
match = re.search(r"([\d.]+)", el.text or el.get("aria-label", ""))
return float(match.group(1)) if match else 0.0
def scrape_reviews(platform, url, max_pages=3):
reviews = []
for page in range(1, max_pages + 1):
try:
soup = scrape(url + (f"?page={page}" if page > 1 else ""), wait_for=".review, [data-hook='review']")
for el in soup.select(".review, [data-hook='review'], [data-review-id]"):
reviews.append({
"platform": platform,
"rating": parse_rating(el.select_one("[class*='star'], [aria-label*='star']")),
"body": (el.select_one("[data-hook='review-body'] span, .review__text, p") or el).text.strip()[:1000],
"author": (el.select_one(".a-profile-name, .review__user-info a") or el).text.strip()[:50],
"date": (el.select_one("[data-hook='review-date'], time") or el).text.strip()[:30],
"scraped_at": datetime.now().isoformat()
})
print(f" {platform} page {page}: {len(reviews)} total")
time.sleep(2)
except Exception as e:
print(f" {platform} page {page}: {e}")
break
return reviews
# Run for all platforms
all_reviews = []
for p, u in {"amazon": "https://amazon.com/dp/ID", "yelp": "https://yelp.com/biz/ID", "g2": "https://g2.com/products/ID/reviews"}.items():
all_reviews.extend(scrape_reviews(p, u))
df = pd.DataFrame(all_reviews)
df.to_parquet("reviews.parquet", index=False)
print(f"
Total: {len(df)} reviews across {df['platform'].nunique()} platforms")
print(df.groupby("platform")["rating"].agg(["count", "mean"]).round(2))
Common Issues
1. Amazon blocks all scraping attempts. Amazon is one of the hardest sites to scrape. SearchHive's residential proxies help, but you may still get CAPTCHA pages. Reduce your request frequency and use wait_for to ensure content loads before extraction.
2. Yelp shows different content to bots. Yelp serves different layouts to logged-in vs. non-logged-in users and may detect automated access. ScrapeForge's anti-detection helps, but results may vary between runs.
3. G2 uses heavy JavaScript with infinite scroll. G2 loads reviews dynamically with offset-based pagination. The wait_for selector ensures reviews are present before parsing. If parsing fails, try adjusting the selector to match G2's current DOM structure.
4. Review text is truncated. Some platforms show truncated reviews with a "Read more" link. SearchHive does not click links, so you get the visible text only. For full review text, you may need to scrape individual review pages separately.
5. Rate limiting and IP bans. All three platforms rate-limit aggressively. Stick to 2-3 second delays between pages and limit to 3-5 pages per run. SearchHive's proxy rotation prevents IP bans, but the platforms may still throttle your access.
Next Steps
- How to Extract Structured Data from HTML with Python -- deeper parsing techniques
- How to Build a Product Comparison Tool with Web Scraping -- combine review data with product data
- SearchHive ScrapeForge API docs -- full parameter reference and advanced options
Start scraping reviews with SearchHive's free tier -- 500 credits to experiment, no credit card needed.