How to Build a Price Comparison API — Step-by-Step Tutorial
Building a price comparison API lets you aggregate product prices from multiple retailers, track price changes over time, and deliver real-time pricing data to your applications. Whether you're building a shopping comparison site, a price alert system, or a competitive intelligence dashboard, this tutorial walks you through the complete process.
We'll use SearchHive's ScrapeForge API for web data extraction — it handles JavaScript rendering and bot detection automatically, which is critical when scraping e-commerce sites.
Key Takeaways
- A price comparison API needs three components: data extraction, normalization, and a serving layer
- E-commerce sites are heavily protected — you need anti-bot bypass capabilities
- SearchHive's ScrapeForge handles JS rendering and bot detection in a single API call
- Price normalization (currency conversion, unit standardization) is essential for accurate comparisons
- Caching and rate limiting keep your API fast and cost-effective
Prerequisites
Before starting, you'll need:
- Python 3.9+ installed
- A SearchHive API key (free tier available — 1,000 requests/month)
- Basic familiarity with REST APIs and Python
- A code editor or IDE
pip install requests fastapi uvicorn redis python-dotenv
Step 1: Define Your Data Model
Start by defining the structure of your price comparison data:
# models.py
from dataclasses import dataclass, asdict
from typing import Optional
from datetime import datetime
@dataclass
class ProductPrice:
product_name: str
retailer: str
price: float
currency: str
url: str
in_stock: bool
timestamp: str
shipping_cost: Optional[float] = None
condition: str = "new"
def to_dict(self):
return asdict(self)
@dataclass
class PriceComparison:
query: str
results: list[ProductPrice]
lowest_price: Optional[ProductPrice] = None
highest_price: Optional[ProductPrice] = None
scraped_at: str = ""
def __post_init__(self):
self.scraped_at = datetime.utcnow().isoformat()
if self.results:
prices = [r for r in self.results if r.in_stock]
if prices:
self.lowest_price = min(prices, key=lambda r: r.price)
self.highest_price = max(prices, key=lambda r: r.price)
Step 2: Set Up SearchHive ScrapeForge for E-Commerce
Each retailer needs its own extraction configuration. Here's how to extract prices from a typical product page:
# scrapers.py
import requests
import os
SEARCHHIVE_API_KEY = os.getenv("SEARCHHIVE_API_KEY")
SEARCHHIVE_BASE = "https://api.searchhive.dev/v1/scrapeforge"
headers = {"Authorization": f"Bearer {SEARCHHIVE_API_KEY}"}
RETAILER_CONFIGS = {
"amazon": {
"url_template": "https://www.amazon.com/s?k={query}",
"extraction": {
"type": "structured",
"fields": {
"products": {
"selector": "[data-component-type='s-search-result']",
"multiple": True,
"fields": {
"name": "h2 a span",
"price": ".a-price .a-offscreen::text",
"url": "h2 a::attr(href)",
"rating": ".a-icon-star-small .a-icon-alt::text",
"reviews": ".a-size-small .a-link-normal::text"
}
}
}
}
},
"walmart": {
"url_template": "https://www.walmart.com/search?q={query}",
"extraction": {
"type": "structured",
"fields": {
"products": {
"selector": "[data-item-id]",
"multiple": True,
"fields": {
"name": "[data-automation-id='product-title']::text",
"price": "[data-automation-id='product-price']::text",
"url": "a::attr(href)",
"rating": ".rating-number::text"
}
}
}
}
}
}
def scrape_retailer(retailer: str, query: str) -> dict:
"""Scrape product prices from a specific retailer."""
config = RETAILER_CONFIGS[retailer]
url = config["url_template"].format(query=query.replace(" ", "+"))
response = requests.post(
SEARCHHIVE_BASE,
headers=headers,
json={
"url": url,
"render_js": True,
"anti_bot": True,
"extraction": config["extraction"]
},
timeout=30
)
response.raise_for_status()
return response.json()
Step 3: Normalize and Clean Price Data
Raw scraped data is messy. Prices come in different formats, currencies, and may include text like "Free shipping" or "Was $99.99". Here's how to normalize:
# normalizer.py
import re
from decimal import Decimal, InvalidOperation
def parse_price(raw_price: str) -> float:
"""Extract numeric price from messy scraped text."""
if not raw_price:
return 0.0
# Remove currency symbols, commas, whitespace
cleaned = re.sub(r'[^\d.]', '', raw_price)
try:
return float(Decimal(cleaned))
except (InvalidOperation, ValueError):
return 0.0
def normalize_retailer_results(raw_data: dict, retailer: str, query: str) -> list:
"""Convert raw scrape results into ProductPrice objects."""
from models import ProductPrice
products = []
raw_products = raw_data.get("data", {}).get("products", [])
for item in raw_products[:10]: # Top 10 results
price = parse_price(item.get("price", ""))
if price <= 0:
continue
url = item.get("url", "")
if url.startswith("/"):
base = {
"amazon": "https://www.amazon.com",
"walmart": "https://www.walmart.com"
}.get(retailer, "")
url = base + url
products.append(ProductPrice(
product_name=item.get("name", "").strip()[:200],
retailer=retailer,
price=price,
currency="USD",
url=url,
in_stock=price > 0,
timestamp=__import__('datetime').datetime.utcnow().isoformat()
))
return products
Step 4: Build the Comparison Engine
Aggregate results from multiple retailers and sort by price:
# engine.py
import concurrent.futures
from models import PriceComparison
from scrapers import scrape_retailer, RETAILER_CONFIGS
from normalizer import normalize_retailer_results
def compare_prices(query: str, retailers: list[str] = None) -> PriceComparison:
"""Compare prices across multiple retailers."""
if retailers is None:
retailers = list(RETAILER_CONFIGS.keys())
all_products = []
# Scrape retailers in parallel for speed
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = {
executor.submit(scrape_retailer, r, query): r
for r in retailers
}
for future in concurrent.futures.as_completed(futures):
retailer = futures[future]
try:
raw_data = future.result()
products = normalize_retailer_results(raw_data, retailer, query)
all_products.extend(products)
except Exception as e:
print(f"Failed to scrape {retailer}: {e}")
# Sort by price (lowest first)
all_products.sort(key=lambda p: p.price)
return PriceComparison(
query=query,
results=all_products
)
Step 5: Create the FastAPI Endpoint
Wrap everything in a REST API:
# main.py
from fastapi import FastAPI, HTTPException, Query
from fastapi.middleware.cors import CORSMiddleware
from engine import compare_prices
import json
import time
app = FastAPI(
title="Price Comparison API",
description="Compare product prices across multiple retailers",
version="1.0.0"
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["GET"],
allow_headers=["*"],
)
# Simple in-memory cache
cache = {}
@app.get("/compare")
def compare(
q: str = Query(..., description="Product to search for"),
retailers: str = Query("amazon,walmart", description="Comma-separated retailer list"),
cache_ttl: int = Query(3600, description="Cache TTL in seconds")
):
cache_key = f"{q}:{retailers}"
# Check cache first
if cache_key in cache:
cached_data, cached_time = cache[cache_key]
if time.time() - cached_time < cache_ttl:
return cached_data
# Run comparison
retailer_list = [r.strip() for r in retailers.split(",")]
try:
result = compare_prices(q, retailer_list)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
response = {
"query": result.query,
"total_results": len(result.results),
"lowest_price": result.lowest_price.to_dict() if result.lowest_price else None,
"highest_price": result.highest_price.to_dict() if result.highest_price else None,
"results": [r.to_dict() for r in result.results],
"scraped_at": result.scraped_at
}
# Cache the result
cache[cache_key] = (response, time.time())
return response
@app.get("/health")
def health():
return {"status": "ok"}
Step 6: Run and Test
# Start the server
uvicorn main:app --host 0.0.0.0 --port 8000
# Test it
curl "http://localhost:8000/compare?q=wireless+headphones&retailers=amazon,walmart"
Expected response:
{
"query": "wireless headphones",
"total_results": 18,
"lowest_price": {
"product_name": "Sony WH-1000XM5",
"retailer": "amazon",
"price": 248.00,
"currency": "USD",
"url": "https://www.amazon.com/dp/B09XS7JWHH",
"in_stock": true
},
"highest_price": {
"product_name": "Apple AirPods Max",
"retailer": "walmart",
"price": 549.00
},
"results": [...]
}
Step 7: Add Persistence and Price History
Track price changes over time with Redis:
# history.py
import redis
import json
from datetime import datetime
r = redis.Redis(host='localhost', port=6379, db=0)
def save_price_history(comparison_result: dict):
"""Save price data for historical tracking."""
query = comparison_result["query"]
date_key = datetime.utcnow().strftime("%Y-%m-%d")
for product in comparison_result["results"]:
key = f"price:{query}:{product['retailer']}:{product['product_name']}"
r.zadd(key, {json.dumps(product): datetime.utcnow().timestamp()})
def get_price_trend(product_name: str, retailer: str, days: int = 30) -> list:
"""Get price history for a product."""
key = f"price:*:{retailer}:{product_name}"
cutoff = (datetime.utcnow().timestamp()) - (days * 86400)
history = []
for matching_key in r.scan_iter(key):
entries = r.zrangebyscore(matching_key, cutoff, "+inf")
for entry in entries:
history.append(json.loads(entry))
return sorted(history, key=lambda x: x["timestamp"])
Common Issues and Solutions
1. Bot Detection Blocking Your Scrapes
Problem: E-commerce sites return CAPTCHAs or 403 errors.
Solution: SearchHive's anti-bot bypass is enabled by default on paid plans. Ensure anti_bot: True is set in your ScrapeForge request.
2. Inconsistent Price Formats
Problem: Prices appear as "$29.99", "29,99 €", "Price: $30.00 (was $45.00)".
Solution: The parse_price() function in Step 3 handles most formats. Add retailer-specific parsers for edge cases.
3. Rate Limiting
Problem: Scraping too fast triggers rate limits.
Solution: Add delays between requests and use concurrent futures with limited workers:
import time
# Add a small delay between retailer scrapes
for retailer in retailers:
result = scrape_retailer(retailer, query)
time.sleep(2) # 2 second delay
4. Product Matching Across Retailers
Problem: The same product has different names on different sites.
Solution: Use fuzzy string matching:
from difflib import SequenceMatcher
def match_products(products_a: list, products_b: list, threshold=0.7) -> list:
matches = []
for pa in products_a:
for pb in products_b:
ratio = SequenceMatcher(None, pa.name.lower(), pb.name.lower()).ratio()
if ratio >= threshold:
matches.append((pa, pb, ratio))
return matches
Next Steps
- Add more retailers: Extend
RETAILER_CONFIGSwith eBay, Target, Best Buy - Add alerts: Use webhooks to notify users when prices drop below a threshold
- Deploy: Containerize with Docker and deploy to your preferred cloud
- Scale: Add Redis caching, rate limiting, and a task queue for high-volume use
Ready to build? Get your free SearchHive API key — 1,000 requests/month included, no credit card required. Check the ScrapeForge docs for advanced extraction options.
See also: