How to Build a Product Comparison Tool with Web Scraping

A product comparison tool aggregates pricing, features, and reviews from multiple sources into a single view. Whether you are building a price comparison engine, a SaaS directory, or a competitive intelligence dashboard, the core workflow is the same: scrape product pages from several sites, normalize the data, and present it side by side. This tutorial walks through building one with Python and SearchHive.

Key Takeaways

Product comparison tools need three layers: scraping, normalization, and presentation
SearchHive's ScrapeForge handles JavaScript-rendered product pages that block basic scrapers
Normalize data into a common schema so different sources become comparable
Use pandas for data manipulation and CSV/free JSON formatter for output
Schedule regular scraping to keep comparison data fresh

Prerequisites

Python 3.8+
pip install requests beautifulsoup4 pandas searchhive-client
A SearchHive API key (free at searchhive.dev)
Target product pages to compare (2-5 competitor sites)

Step 1: Define Your Product Schema

Before scraping anything, define the structure your comparison tool expects. Every source needs to map into this schema:

from dataclasses import dataclass, asdict, field
from typing import Optional, List

@dataclass
class Product:
    name: str
    price: float
    currency: str = "USD"
    source: str = ""
    url: str = ""
    rating: Optional[float] = None
    review_count: Optional[int] = None
    features: List[str] = field(default_factory=list)
    availability: str = "unknown"
    category: str = ""

This schema captures the fields most comparison tools display. You can extend it with brand, shipping cost, or specs as needed.

Step 2: Scrape Product Pages with SearchHive

Most product pages are JavaScript-rendered and protected by anti-bot systems. SearchHive's ScrapeForge API handles both:

import requests
from bs4 import BeautifulSoup

SH_KEY = "sk_live_your_key_here"
SH_HEADERS = {"Authorization": f"Bearer: {SH_KEY}", "Content-Type": "application/json"}

def scrape_product_page(url, source_name):
    """Scrape a product page with JS rendering and return a Product object."""
    resp = requests.post(
        "https://www.searchhive.dev/api/v1/scrapeforge",
        headers=SH_HEADERS,
        json={
            "url": url,
            "render_js": True,
            "wait_for": ".product-details",
            "follow_redirects": True
        }
    )
    resp.raise_for_status()
    html = resp.json().get("content", "")
    soup = BeautifulSoup(html, "lxml")

    # These selectors depend on the target site -- customize per source
    return Product(
        name=soup.select_one("h1.product-title").text.strip(),
        price=parse_price(soup.select_one(".price").text),
        source=source_name,
        url=url,
        rating=parse_rating(soup.select_one(".rating-value")),
        availability="in stock" if soup.select(".in-stock") else "out of stock"
    )

def parse_price(text):
    """Extract numeric price from text like '$299.99' or '299,99 EUR'."""
    import re
    match = re.search(r"[\d,.]+", text.replace(",", "."))
    return float(match.group()) if match else 0.0

The wait_for parameter tells ScrapeForge to wait until the product details element loads before extracting HTML. This is critical for SPAs where content appears after API calls complete.

Step 3: Build Source-Specific Parsers

Different e-commerce sites have different HTML structures. Create a parser for each source that maps its HTML into your Product schema:

def scrape_amazon(url):
    """Parser for Amazon product pages."""
    resp = requests.post(
        "https://www.searchhive.dev/api/v1/scrapeforge",
        headers=SH_HEADERS,
        json={"url": url, "render_js": True, "follow_redirects": True}
    )
    soup = BeautifulSoup(resp.json().get("content", ""), "lxml")
    return Product(
        name=(soup.select_one("#productTitle") or soup.select_one("h1")).text.strip(),
        price=parse_price((soup.select_one(".a-price-whole") or soup.select_one(".price")).text),
        source="Amazon",
        url=url,
        rating=parse_rating(soup.select_one("#acrPopover")),
        review_count=parse_int(soup.select_one("#acrCustomerReviewText"))
    )

def scrape_bestbuy(url):
    """Parser for Best Buy product pages."""
    resp = requests.post(
        "https://www.searchhive.dev/api/v1/scrapeforge",
        headers=SH_HEADERS,
        json={"url": url, "render_js": True, "follow_redirects": True}
    )
    soup = BeautifulSoup(resp.json().get("content", ""), "lxml")
    return Product(
        name=(soup.select_one("h1.sku-title") or soup.select_one("h1")).text.strip(),
        price=parse_price((soup.select_one(".priceView-customer-price") or soup.select_one(".price")).text),
        source="Best Buy",
        url=url,
        rating=parse_rating(soup.select_one(".c-review-average")),
        features=[li.text.strip() for li in soup.select(".spec-features li")[:10]]
    )

def parse_rating(el):
    if not el:
        return None
    import re
    match = re.search(r"[\d.]+", el.text)
    return float(match.group()) if match else None

def parse_int(el):
    if not el:
        return None
    import re
    match = re.search(r"[\d,]+", el.text.replace(",", ""))
    return int(match.group()) if match else None

Step 4: Aggregate and Compare with Pandas

Once you have products from multiple sources, use pandas to compare them side by side. Pandas makes it easy to sort by price, filter by availability, and calculate statistics:

import pandas as pd

# Scrape the same product from multiple retailers
urls = {
    "Amazon": "https://www.amazon.com/dp/PRODUCT_ID",
    "Best Buy": "https://www.bestbuy.com/site/PRODUCT_ID",
    "Walmart": "https://www.walmart.com/ip/PRODUCT_ID"
}

parsers = {
    "Amazon": scrape_amazon,
    "Best Buy": scrape_bestbuy,
    "Walmart": scrape_product_page
}

products = []
for source, url in urls.items():
    try:
        product = parsers[source](url)
        products.append(asdict(product))
    except Exception as e:
        print(f"Failed to scrape {source}: {e}")

# Build comparison DataFrame
df = pd.DataFrame(products)
df["price"] = pd.to_numeric(df["price"], errors="coerce")
df = df.sort_values("price")

print("=== Product Comparison ===")
print(df[["name", "source", "price", "rating", "availability"]].to_string(index=False))
print(f"
Best price: ${df['price'].min():.2f} ({df.iloc[0]['source']})")

Step 5: Export the Comparison Data

Generate a clean comparison table for your frontend or export to CSV and JSON for downstream consumption. The export step also serves as a validation checkpoint -- if your parsers produced unexpected data types or missing fields, the export will surface those issues:

def export_comparison(products, output_path="comparison.json"):
    """Export products as a comparison-ready JSON file."""
    sorted_products = sorted(products, key=lambda p: p.get("price", float("inf")))
    comparison = {
        "generated_at": pd.Timestamp.now().isoformat(),
        "product_count": len(sorted_products),
        "best_price": sorted_products[0] if sorted_products else None,
        "all_products": sorted_products
    }
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(comparison, f, indent=2, ensure_ascii=False, default=str)
    print(f"Comparison saved to {output_path}")

# Also export CSV
df = pd.DataFrame(products)
df.to_csv("comparison.csv", index=False)

Step 6: Automate with Scheduled Scraping

Product prices change constantly -- sometimes multiple times per day. Set up a cron expression generator or scheduled task to refresh comparison data automatically. The schedule library provides a simple Python interface for this, but for production use, consider Celery with a Redis backend or a systemd timer on your server. Aim for at least daily updates, and hourly for high-velocity categories like electronics:

import schedule
import time

def update_comparison():
    """Scrape all sources and update the comparison file."""
    print(f"Updating comparison at {pd.Timestamp.now()}")
    products = []
    for source, url in urls.items():
        try:
            product = parsers[source](url)
            products.append(asdict(product))
            print(f"  {source}: ${product['price']}")
        except Exception as e:
            print(f"  {source}: FAILED - {e}")
    export_comparison(products)

# Schedule daily at 9 AM
schedule.every().day.at("09:00").do(update_comparison)

if __name__ == "__main__":
    update_comparison()  # Run immediately
    while True:
        schedule.run_pending()
        time.sleep(3600)

Complete Code Example

import requests
import json
import re
import pandas as pd
from bs4 import BeautifulSoup
from dataclasses import dataclass, asdict, field
from typing import Optional, List

SH_KEY = "sk_live_your_key_here"
SH_HEADERS = {"Authorization": f"Bearer: {SH_KEY}", "Content-Type": "application/json"}

@dataclass
class Product:
    name: str
    price: float
    currency: str = "USD"
    source: str = ""
    url: str = ""
    rating: Optional[float] = None
    review_count: Optional[int] = None
    features: List[str] = field(default_factory=list)
    availability: str = "unknown"

def scrape_with_sh(url, wait_for=None):
    resp = requests.post(
        "https://www.searchhive.dev/api/v1/scrapeforge",
        headers=SH_HEADERS,
        json={"url": url, "render_js": True, "wait_for": wait_for, "follow_redirects": True}
    )
    resp.raise_for_status()
    return BeautifulSoup(resp.json().get("content", ""), "lxml")

def parse_price(text):
    if not text:
        return 0.0
    match = re.search(r"[\d,.]+", text.replace(",", "."))
    return float(match.group()) if match else 0.0

def compare_products(source_urls):
    products = []
    for source, url in source_urls.items():
        try:
            soup = scrape_with_sh(url)
            products.append({
                "name": soup.select_one("h1").text.strip()[:100],
                "price": parse_price(soup.select_one(".price, .a-price-whole, [data-price]").text),
                "source": source,
                "url": url,
                "rating": None,
                "availability": "in stock" if soup.select(".in-stock, .available") else "check site"
            })
        except Exception as e:
            print(f"{source} failed: {e}")
    df = pd.DataFrame(products).sort_values("price")
    print(df.to_string(index=False))
    df.to_csv("product_comparison.csv", index=False)
    return df

# Usage
compare_products({
    "Store A": "https://store-a.example.com/product/123",
    "Store B": "https://store-b.example.com/product/456",
    "Store C": "https://store-c.example.com/product/789"
})

Common Issues

1. Different price formats. Some sites show "$299", others "299.99 USD", others "299,00 EUR". The parse_price function handles common formats, but you may need to add currency detection if comparing across regions. Consider storing both the numeric price and the original currency string, then applying exchange rates for true comparison.

2. Product matching. Different stores may use different product names or SKUs for the same item. Use UPC/EAN codes or manufacturer part numbers when available to ensure you are comparing the exact same product. When those are not available, fuzzy matching on product names using difflib.SequenceMatcher or fuzzywuzzy can help identify the same product across retailers.

3. Blocked requests. Major retailers aggressively block scrapers. SearchHive's residential proxies and CAPTCHA solving are essential here -- basic requests will get 403 responses.

4. Dynamic pricing. Some sites show different prices based on location, cookies, or time of day. ScrapeForge's proxy rotation helps, but be aware that comparison data may vary between runs. Cache results and take the median price over multiple scrapes for accuracy.

5. Out-of-stock detection. An item might exist on a product page but be unavailable. Always check for stock status elements and filter out unavailable items from your comparison results. Display "out of stock" in your UI rather than showing a stale price.

Extending the Comparison Tool

Once you have the basic pipeline working, consider these enhancements:

Image scraping: Download product images alongside data using ScrapeForge's file extraction capabilities
Historical price tracking: Store daily snapshots in your data lake and build price history charts
Alert system: Notify users when a product drops below a target price (combine with the webhook monitoring approach from our webhook tutorial)
Multi-category support: Extend the source parsers to handle different product categories with category-specific fields (tech specs for electronics, size/color for clothing)
API endpoint: Wrap the comparison logic in a FastAPI endpoint and serve results to a frontend
Email notifications: Send weekly comparison digests to subscribers using your email service of choice

Next Steps

How to Extract Structured Data from HTML with Python -- deeper dive into HTML parsing techniques
How to Scrape Product Reviews with Python -- add review data to your comparisons
SearchHive API docs -- full ScrapeForge parameter reference

Build your product comparison tool with SearchHive's free tier -- 500 credits to start, no credit card needed.

How to Build a Product Comparison Tool with Web Scraping

AI-Powered Research

Key Takeaways

Prerequisites

Step 1: Define Your Product Schema

Step 2: Scrape Product Pages with SearchHive

Step 3: Build Source-Specific Parsers

Step 4: Aggregate and Compare with Pandas

Step 5: Export the Comparison Data

Step 6: Automate with Scheduled Scraping

Complete Code Example

Common Issues

Extending the Comparison Tool

Next Steps

Keywords

RELATED ARTICLES

How to Scrape GitHub Data for Developer Research (2026)

Make.com Web Scraping — No-Code Data Extraction Compared (2026)

How to Scrape Product Reviews with Python -- Amazon, Yelp, G2

BUILD WITH SEARCHHIVE