How to Scrape Restaurant Data for Competitive Analysis

Restaurant data -- menus, prices, ratings, hours, and location details -- is valuable for market research, competitive intelligence, and food delivery platforms. Whether you're building a restaurant directory, tracking pricing trends, or analyzing competitor strategy, this tutorial shows you how to scrape restaurant data at scale using Python and SearchHive's ScrapeForge API.

Key Takeaways

Restaurant data lives across Google Maps, Yelp, TripAdvisor, and individual restaurant websites
SearchHive's ScrapeForge API handles JavaScript rendering and anti-bot protection on review sites
The SwiftSearch API finds restaurant listings by location, cuisine, and price range
Structured extraction with DeepDive converts raw page content into clean free JSON formatter
A complete pipeline scrapes, structures, and exports data in under 50 lines of Python

Prerequisites

Python 3.8 or later
SearchHive API key (free tier -- 500 credits)
pip install requests searchhive

pip install requests searchhive

Step 1: Find Restaurants with SwiftSearch

Before scraping individual restaurant pages, you need a list of targets. SearchHive's SwiftSearch API queries search engines to find restaurants matching your criteria:

from searchhive import SwiftSearch

client = SwiftSearch(api_key="YOUR_API_KEY")

results = client.search(
    "best Italian restaurants in Brooklyn NY 2024",
    num_results=20
)

for r in results:
    print(f"{r.title} - {r.url}")

This returns ranked results with titles, URLs, and snippets. You can filter by cuisine type, neighborhood, or any other search criteria. SwiftSearch queries real search engines, so you get the same results a user would see.

Step 2: Scrape Individual Restaurant Pages

Once you have URLs, use ScrapeForge to extract content from each restaurant's page. Most restaurant websites use JavaScript for menus, hours, and reservation widgets, so headless rendering is essential:

from searchhive import ScrapeForge

client = ScrapeForge(api_key="YOUR_API_KEY")

url = "https://www.example-restaurant.com"
result = client.scrape(url, format="markdown", render_js=True)

print(result.content[:500])

The render_js=True parameter tells ScrapeForge to use headless Chrome, ensuring dynamic content like embedded menus and Google Maps widgets loads fully. The format="markdown" option converts the HTML into clean, readable text.

Step 3: Extract Structured Data with DeepDive

Raw markdown is useful for reading, but for a database or analysis pipeline, you need structured data. SearchHive's DeepDive API uses AI to extract specific fields from scraped content:

from searchhive import DeepDive

client = DeepDive(api_key="YOUR_API_KEY")

structured = client.extract(
    content=raw_markdown,  # from ScrapeForge
    schema={
        "type": "object",
        "properties": {
            "name": {"type": "string", "description": "Restaurant name"},
            "cuisine": {"type": "string", "description": "Type of cuisine"},
            "address": {"type": "string"},
            "phone": {"type": "string"},
            "hours": {"type": "string", "description": "Operating hours"},
            "price_range": {"type": "string", "description": "Dollar sign rating"},
            "rating": {"type": "string"},
            "menu_highlights": {
                "type": "array",
                "items": {"type": "string"},
                "description": "Notable menu items"
            }
        }
    }
)

print(structured.data)

DeepDive returns a Python dictionary matching your schema. The AI understands context -- it knows that "$$" means moderate pricing, and "Mon-Fri 11am-10pm" is the hours field.

Step 4: Scrape Review Platforms for Ratings

Google Maps, Yelp, and TripAdvisor are the main sources for restaurant ratings and reviews. These sites use heavy JavaScript rendering and aggressive anti-bot measures, making them difficult to scrape with basic HTTP requests:

from searchhive import ScrapeForge

client = ScrapeForge(api_key="YOUR_API_KEY")

# Scrape a Yelp restaurant page
yelp_url = "https://www.yelp.com/biz/example-restaurant-brooklyn"
result = client.scrape(yelp_url, format="markdown", render_js=True)

# Extract review data with DeepDive
from searchhive import DeepDive
deep = DeepDive(api_key="YOUR_API_KEY")

review_data = deep.extract(
    content=result.content,
    schema={
        "type": "object",
        "properties": {
            "overall_rating": {"type": "string"},
            "total_reviews": {"type": "string"},
            "price_range": {"type": "string"},
            "top_dishes": {
                "type": "array",
                "items": {"type": "string"}
            },
            "recent_reviews": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "rating": {"type": "string"},
                        "text": {"type": "string"},
                        "date": {"type": "string"}
                    }
                }
            }
        }
    }
)

print(f"Rating: {review_data.data['overall_rating']} ({review_data.data['total_reviews']} reviews)")

ScrapeForge handles the JavaScript rendering and proxy rotation automatically. Yelp and similar sites typically block direct scraping after a few requests -- SearchHive's built-in proxy rotation distributes requests across multiple IPs to avoid detection.

Step 5: Build a Batch Scraping Pipeline

For competitive analysis, you typically need data from dozens or hundreds of restaurants. Here's a complete pipeline that finds, scrapes, structures, and exports restaurant data:

from searchhive import SwiftSearch, ScrapeForge, DeepDive
import json
import csv
import time

API_KEY = "YOUR_API_KEY"
search = SwiftSearch(api_key=API_KEY)
scrape = ScrapeForge(api_key=API_KEY)
deep = DeepDive(api_key=API_KEY)

# Step 1: Find restaurants
query = "top rated sushi restaurants in Manhattan NYC"
search_results = search.search(query, num_results=15)
restaurant_urls = [r.url for r in search_results if r.url]

# Step 2: Scrape and extract each restaurant
restaurants = []
for i, url in enumerate(restaurant_urls):
    try:
        raw = scrape.scrape(url, format="markdown", render_js=True)
        structured = deep.extract(
            content=raw.content,
            schema={
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "cuisine": {"type": "string"},
                    "address": {"type": "string"},
                    "phone": {"type": "string"},
                    "price_range": {"type": "string"},
                    "rating": {"type": "string"},
                    "url": {"type": "string"}
                }
            }
        )
        structured.data["url"] = url
        restaurants.append(structured.data)
        print(f"[{i+1}/{len(restaurant_urls)}] {structured.data.get('name', 'Unknown')}")
        time.sleep(1)  # Rate limiting courtesy
    except Exception as e:
        print(f"[{i+1}/{len(restaurant_urls)}] Error: {e}")

# Step 3: Export to CSV
with open("restaurants.csv", "w", newline="") as f:
    if restaurants:
        writer = csv.DictWriter(f, fieldnames=restaurants[0].keys())
        writer.writeheader()
        writer.writerows(restaurants)

# Step 4: Save raw JSON for further analysis
with open("restaurants.json", "w") as f:
    json.dump(restaurants, f, indent=2)

print(f"Scraped {len(restaurants)} restaurants successfully")

Step 6: Handle Common Issues

Rate limiting. Restaurant directories and review sites throttle aggressive scraping. SearchHive's built-in rate limiting and retries handle transient errors, but add a small delay (1-2 seconds) between requests for courtesy.

Missing data. Some restaurant websites have minimal online presence. The DeepDive schema uses optional fields -- missing data simply won't appear in the output rather than causing errors.

Anti-bot challenges. Sites like Yelp and Google Maps serve CAPTCHAs to suspicious traffic. ScrapeForge's proxy rotation and headless Chrome fingerprints minimize this, but if you hit CAPTCHAs, reduce your request rate or use SearchHive's built-in retry logic.

Duplicate listings. The same restaurant may appear on multiple platforms. Deduplicate by name and address after scraping:

def deduplicate(restaurants):
    seen = set()
    unique = []
    for r in restaurants:
        key = (r.get("name", "").lower(), r.get("address", "").lower())
        if key not in seen:
            seen.add(key)
            unique.append(r)
    return unique

restaurants = deduplicate(restaurants)

Complete Code Example

Here's the full pipeline as a reusable script:

from searchhive import SwiftSearch, ScrapeForge, DeepDive
import json, csv, time

API_KEY = "YOUR_API_KEY"

RESTAURANT_SCHEMA = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "cuisine": {"type": "string"},
        "address": {"type": "string"},
        "phone": {"type": "string"},
        "price_range": {"type": "string"},
        "rating": {"type": "string"}
    }
}

def scrape_restaurants(query, num_results=15):
    search = SwiftSearch(api_key=API_KEY)
    scrape = ScrapeForge(api_key=API_KEY)
    deep = DeepDive(api_key=API_KEY)

    results = search.search(query, num_results=num_results)
    urls = [r.url for r in results if r.url]

    restaurants = []
    for i, url in enumerate(urls):
        try:
            raw = scrape.scrape(url, format="markdown", render_js=True)
            data = deep.extract(content=raw.content, schema=RESTAURANT_SCHEMA)
            data.data["source_url"] = url
            restaurants.append(data.data)
            print(f"  [{i+1}/{len(urls)}] {data.data.get('name', 'N/A')}")
            time.sleep(1)
        except Exception as e:
            print(f"  [{i+1}/{len(urls)}] Failed: {e}")

    return restaurants

if __name__ == "__main__":
    data = scrape_restaurants("best pizza restaurants in Chicago", num_results=10)

    with open("restaurant_data.json", "w") as f:
        json.dump(data, f, indent=2)

    print(f"Done: {len(data)} restaurants saved to restaurant_data.json")

Next Steps

Monitor competitors: Run this pipeline on a schedule to track menu changes, price updates, and new reviews
Geographic expansion: Query different neighborhoods and cities to build a regional database
Combine data sources: Cross-reference Yelp ratings with Google Maps data for a more complete picture
Sentiment analysis: Feed review text into an NLP pipeline to track customer sentiment over time

Get started with SearchHive's free tier -- 500 credits, no credit card needed. Check the API docs for full SDK reference.

How to Scrape Restaurant Data for Competitive Analysis

AI-Powered Research

How to Scrape Restaurant Data for Competitive Analysis

Key Takeaways

Prerequisites

Step 1: Find Restaurants with SwiftSearch

Step 2: Scrape Individual Restaurant Pages

Step 3: Extract Structured Data with DeepDive

Step 4: Scrape Review Platforms for Ratings

Step 5: Build a Batch Scraping Pipeline

Step 6: Handle Common Issues

Complete Code Example

Next Steps

Keywords

RELATED ARTICLES

How to Scrape Yellow Pages for Business Data

How to Scrape Social Media Data for Market Research

How to Scrape Reddit Data for Market Research

BUILD WITH SEARCHHIVE