A product comparison tool aggregates pricing, features, and reviews from multiple sources into a single view. Whether you are building a price comparison engine, a SaaS directory, or a competitive intelligence dashboard, the core workflow is the same: scrape product pages from several sites, normalize the data, and present it side by side. This tutorial walks through building one with Python and SearchHive.
Key Takeaways
- Product comparison tools need three layers: scraping, normalization, and presentation
- SearchHive's ScrapeForge handles JavaScript-rendered product pages that block basic scrapers
- Normalize data into a common schema so different sources become comparable
- Use pandas for data manipulation and CSV/free JSON formatter for output
- Schedule regular scraping to keep comparison data fresh
Prerequisites
- Python 3.8+
- pip install requests beautifulsoup4 pandas searchhive-client
- A SearchHive API key (free at searchhive.dev)
- Target product pages to compare (2-5 competitor sites)
Step 1: Define Your Product Schema
Before scraping anything, define the structure your comparison tool expects. Every source needs to map into this schema:
from dataclasses import dataclass, asdict, field
from typing import Optional, List
@dataclass
class Product:
name: str
price: float
currency: str = "USD"
source: str = ""
url: str = ""
rating: Optional[float] = None
review_count: Optional[int] = None
features: List[str] = field(default_factory=list)
availability: str = "unknown"
category: str = ""
This schema captures the fields most comparison tools display. You can extend it with brand, shipping cost, or specs as needed.
Step 2: Scrape Product Pages with SearchHive
Most product pages are JavaScript-rendered and protected by anti-bot systems. SearchHive's ScrapeForge API handles both:
import requests
from bs4 import BeautifulSoup
SH_KEY = "sk_live_your_key_here"
SH_HEADERS = {"Authorization": f"Bearer: {SH_KEY}", "Content-Type": "application/json"}
def scrape_product_page(url, source_name):
"""Scrape a product page with JS rendering and return a Product object."""
resp = requests.post(
"https://www.searchhive.dev/api/v1/scrapeforge",
headers=SH_HEADERS,
json={
"url": url,
"render_js": True,
"wait_for": ".product-details",
"follow_redirects": True
}
)
resp.raise_for_status()
html = resp.json().get("content", "")
soup = BeautifulSoup(html, "lxml")
# These selectors depend on the target site -- customize per source
return Product(
name=soup.select_one("h1.product-title").text.strip(),
price=parse_price(soup.select_one(".price").text),
source=source_name,
url=url,
rating=parse_rating(soup.select_one(".rating-value")),
availability="in stock" if soup.select(".in-stock") else "out of stock"
)
def parse_price(text):
"""Extract numeric price from text like '$299.99' or '299,99 EUR'."""
import re
match = re.search(r"[\d,.]+", text.replace(",", "."))
return float(match.group()) if match else 0.0
The wait_for parameter tells ScrapeForge to wait until the product details element loads before extracting HTML. This is critical for SPAs where content appears after API calls complete.
Step 3: Build Source-Specific Parsers
Different e-commerce sites have different HTML structures. Create a parser for each source that maps its HTML into your Product schema:
def scrape_amazon(url):
"""Parser for Amazon product pages."""
resp = requests.post(
"https://www.searchhive.dev/api/v1/scrapeforge",
headers=SH_HEADERS,
json={"url": url, "render_js": True, "follow_redirects": True}
)
soup = BeautifulSoup(resp.json().get("content", ""), "lxml")
return Product(
name=(soup.select_one("#productTitle") or soup.select_one("h1")).text.strip(),
price=parse_price((soup.select_one(".a-price-whole") or soup.select_one(".price")).text),
source="Amazon",
url=url,
rating=parse_rating(soup.select_one("#acrPopover")),
review_count=parse_int(soup.select_one("#acrCustomerReviewText"))
)
def scrape_bestbuy(url):
"""Parser for Best Buy product pages."""
resp = requests.post(
"https://www.searchhive.dev/api/v1/scrapeforge",
headers=SH_HEADERS,
json={"url": url, "render_js": True, "follow_redirects": True}
)
soup = BeautifulSoup(resp.json().get("content", ""), "lxml")
return Product(
name=(soup.select_one("h1.sku-title") or soup.select_one("h1")).text.strip(),
price=parse_price((soup.select_one(".priceView-customer-price") or soup.select_one(".price")).text),
source="Best Buy",
url=url,
rating=parse_rating(soup.select_one(".c-review-average")),
features=[li.text.strip() for li in soup.select(".spec-features li")[:10]]
)
def parse_rating(el):
if not el:
return None
import re
match = re.search(r"[\d.]+", el.text)
return float(match.group()) if match else None
def parse_int(el):
if not el:
return None
import re
match = re.search(r"[\d,]+", el.text.replace(",", ""))
return int(match.group()) if match else None
Step 4: Aggregate and Compare with Pandas
Once you have products from multiple sources, use pandas to compare them side by side. Pandas makes it easy to sort by price, filter by availability, and calculate statistics:
import pandas as pd
# Scrape the same product from multiple retailers
urls = {
"Amazon": "https://www.amazon.com/dp/PRODUCT_ID",
"Best Buy": "https://www.bestbuy.com/site/PRODUCT_ID",
"Walmart": "https://www.walmart.com/ip/PRODUCT_ID"
}
parsers = {
"Amazon": scrape_amazon,
"Best Buy": scrape_bestbuy,
"Walmart": scrape_product_page
}
products = []
for source, url in urls.items():
try:
product = parsers[source](url)
products.append(asdict(product))
except Exception as e:
print(f"Failed to scrape {source}: {e}")
# Build comparison DataFrame
df = pd.DataFrame(products)
df["price"] = pd.to_numeric(df["price"], errors="coerce")
df = df.sort_values("price")
print("=== Product Comparison ===")
print(df[["name", "source", "price", "rating", "availability"]].to_string(index=False))
print(f"
Best price: ${df['price'].min():.2f} ({df.iloc[0]['source']})")
Step 5: Export the Comparison Data
Generate a clean comparison table for your frontend or export to CSV and JSON for downstream consumption. The export step also serves as a validation checkpoint -- if your parsers produced unexpected data types or missing fields, the export will surface those issues:
def export_comparison(products, output_path="comparison.json"):
"""Export products as a comparison-ready JSON file."""
sorted_products = sorted(products, key=lambda p: p.get("price", float("inf")))
comparison = {
"generated_at": pd.Timestamp.now().isoformat(),
"product_count": len(sorted_products),
"best_price": sorted_products[0] if sorted_products else None,
"all_products": sorted_products
}
with open(output_path, "w", encoding="utf-8") as f:
json.dump(comparison, f, indent=2, ensure_ascii=False, default=str)
print(f"Comparison saved to {output_path}")
# Also export CSV
df = pd.DataFrame(products)
df.to_csv("comparison.csv", index=False)
Step 6: Automate with Scheduled Scraping
Product prices change constantly -- sometimes multiple times per day. Set up a cron expression generator or scheduled task to refresh comparison data automatically. The schedule library provides a simple Python interface for this, but for production use, consider Celery with a Redis backend or a systemd timer on your server. Aim for at least daily updates, and hourly for high-velocity categories like electronics:
import schedule
import time
def update_comparison():
"""Scrape all sources and update the comparison file."""
print(f"Updating comparison at {pd.Timestamp.now()}")
products = []
for source, url in urls.items():
try:
product = parsers[source](url)
products.append(asdict(product))
print(f" {source}: ${product['price']}")
except Exception as e:
print(f" {source}: FAILED - {e}")
export_comparison(products)
# Schedule daily at 9 AM
schedule.every().day.at("09:00").do(update_comparison)
if __name__ == "__main__":
update_comparison() # Run immediately
while True:
schedule.run_pending()
time.sleep(3600)
Complete Code Example
import requests
import json
import re
import pandas as pd
from bs4 import BeautifulSoup
from dataclasses import dataclass, asdict, field
from typing import Optional, List
SH_KEY = "sk_live_your_key_here"
SH_HEADERS = {"Authorization": f"Bearer: {SH_KEY}", "Content-Type": "application/json"}
@dataclass
class Product:
name: str
price: float
currency: str = "USD"
source: str = ""
url: str = ""
rating: Optional[float] = None
review_count: Optional[int] = None
features: List[str] = field(default_factory=list)
availability: str = "unknown"
def scrape_with_sh(url, wait_for=None):
resp = requests.post(
"https://www.searchhive.dev/api/v1/scrapeforge",
headers=SH_HEADERS,
json={"url": url, "render_js": True, "wait_for": wait_for, "follow_redirects": True}
)
resp.raise_for_status()
return BeautifulSoup(resp.json().get("content", ""), "lxml")
def parse_price(text):
if not text:
return 0.0
match = re.search(r"[\d,.]+", text.replace(",", "."))
return float(match.group()) if match else 0.0
def compare_products(source_urls):
products = []
for source, url in source_urls.items():
try:
soup = scrape_with_sh(url)
products.append({
"name": soup.select_one("h1").text.strip()[:100],
"price": parse_price(soup.select_one(".price, .a-price-whole, [data-price]").text),
"source": source,
"url": url,
"rating": None,
"availability": "in stock" if soup.select(".in-stock, .available") else "check site"
})
except Exception as e:
print(f"{source} failed: {e}")
df = pd.DataFrame(products).sort_values("price")
print(df.to_string(index=False))
df.to_csv("product_comparison.csv", index=False)
return df
# Usage
compare_products({
"Store A": "https://store-a.example.com/product/123",
"Store B": "https://store-b.example.com/product/456",
"Store C": "https://store-c.example.com/product/789"
})
Common Issues
1. Different price formats. Some sites show "$299", others "299.99 USD", others "299,00 EUR". The parse_price function handles common formats, but you may need to add currency detection if comparing across regions. Consider storing both the numeric price and the original currency string, then applying exchange rates for true comparison.
2. Product matching. Different stores may use different product names or SKUs for the same item. Use UPC/EAN codes or manufacturer part numbers when available to ensure you are comparing the exact same product. When those are not available, fuzzy matching on product names using difflib.SequenceMatcher or fuzzywuzzy can help identify the same product across retailers.
3. Blocked requests. Major retailers aggressively block scrapers. SearchHive's residential proxies and CAPTCHA solving are essential here -- basic requests will get 403 responses.
4. Dynamic pricing. Some sites show different prices based on location, cookies, or time of day. ScrapeForge's proxy rotation helps, but be aware that comparison data may vary between runs. Cache results and take the median price over multiple scrapes for accuracy.
5. Out-of-stock detection. An item might exist on a product page but be unavailable. Always check for stock status elements and filter out unavailable items from your comparison results. Display "out of stock" in your UI rather than showing a stale price.
Extending the Comparison Tool
Once you have the basic pipeline working, consider these enhancements:
- Image scraping: Download product images alongside data using ScrapeForge's file extraction capabilities
- Historical price tracking: Store daily snapshots in your data lake and build price history charts
- Alert system: Notify users when a product drops below a target price (combine with the webhook monitoring approach from our webhook tutorial)
- Multi-category support: Extend the source parsers to handle different product categories with category-specific fields (tech specs for electronics, size/color for clothing)
- API endpoint: Wrap the comparison logic in a FastAPI endpoint and serve results to a frontend
- Email notifications: Send weekly comparison digests to subscribers using your email service of choice
Next Steps
- How to Extract Structured Data from HTML with Python -- deeper dive into HTML parsing techniques
- How to Scrape Product Reviews with Python -- add review data to your comparisons
- SearchHive API docs -- full ScrapeForge parameter reference
Build your product comparison tool with SearchHive's free tier -- 500 credits to start, no credit card needed.