How to Scrape Restaurant Data for Competitive Analysis
Restaurant data -- menus, prices, ratings, hours, and location details -- is valuable for market research, competitive intelligence, and food delivery platforms. Whether you're building a restaurant directory, tracking pricing trends, or analyzing competitor strategy, this tutorial shows you how to scrape restaurant data at scale using Python and SearchHive's ScrapeForge API.
Key Takeaways
- Restaurant data lives across Google Maps, Yelp, TripAdvisor, and individual restaurant websites
- SearchHive's ScrapeForge API handles JavaScript rendering and anti-bot protection on review sites
- The SwiftSearch API finds restaurant listings by location, cuisine, and price range
- Structured extraction with DeepDive converts raw page content into clean free JSON formatter
- A complete pipeline scrapes, structures, and exports data in under 50 lines of Python
Prerequisites
- Python 3.8 or later
- SearchHive API key (free tier -- 500 credits)
- pip install requests searchhive
pip install requests searchhive
Step 1: Find Restaurants with SwiftSearch
Before scraping individual restaurant pages, you need a list of targets. SearchHive's SwiftSearch API queries search engines to find restaurants matching your criteria:
from searchhive import SwiftSearch
client = SwiftSearch(api_key="YOUR_API_KEY")
results = client.search(
"best Italian restaurants in Brooklyn NY 2024",
num_results=20
)
for r in results:
print(f"{r.title} - {r.url}")
This returns ranked results with titles, URLs, and snippets. You can filter by cuisine type, neighborhood, or any other search criteria. SwiftSearch queries real search engines, so you get the same results a user would see.
Step 2: Scrape Individual Restaurant Pages
Once you have URLs, use ScrapeForge to extract content from each restaurant's page. Most restaurant websites use JavaScript for menus, hours, and reservation widgets, so headless rendering is essential:
from searchhive import ScrapeForge
client = ScrapeForge(api_key="YOUR_API_KEY")
url = "https://www.example-restaurant.com"
result = client.scrape(url, format="markdown", render_js=True)
print(result.content[:500])
The render_js=True parameter tells ScrapeForge to use headless Chrome, ensuring dynamic content like embedded menus and Google Maps widgets loads fully. The format="markdown" option converts the HTML into clean, readable text.
Step 3: Extract Structured Data with DeepDive
Raw markdown is useful for reading, but for a database or analysis pipeline, you need structured data. SearchHive's DeepDive API uses AI to extract specific fields from scraped content:
from searchhive import DeepDive
client = DeepDive(api_key="YOUR_API_KEY")
structured = client.extract(
content=raw_markdown, # from ScrapeForge
schema={
"type": "object",
"properties": {
"name": {"type": "string", "description": "Restaurant name"},
"cuisine": {"type": "string", "description": "Type of cuisine"},
"address": {"type": "string"},
"phone": {"type": "string"},
"hours": {"type": "string", "description": "Operating hours"},
"price_range": {"type": "string", "description": "Dollar sign rating"},
"rating": {"type": "string"},
"menu_highlights": {
"type": "array",
"items": {"type": "string"},
"description": "Notable menu items"
}
}
}
)
print(structured.data)
DeepDive returns a Python dictionary matching your schema. The AI understands context -- it knows that "$$" means moderate pricing, and "Mon-Fri 11am-10pm" is the hours field.
Step 4: Scrape Review Platforms for Ratings
Google Maps, Yelp, and TripAdvisor are the main sources for restaurant ratings and reviews. These sites use heavy JavaScript rendering and aggressive anti-bot measures, making them difficult to scrape with basic HTTP requests:
from searchhive import ScrapeForge
client = ScrapeForge(api_key="YOUR_API_KEY")
# Scrape a Yelp restaurant page
yelp_url = "https://www.yelp.com/biz/example-restaurant-brooklyn"
result = client.scrape(yelp_url, format="markdown", render_js=True)
# Extract review data with DeepDive
from searchhive import DeepDive
deep = DeepDive(api_key="YOUR_API_KEY")
review_data = deep.extract(
content=result.content,
schema={
"type": "object",
"properties": {
"overall_rating": {"type": "string"},
"total_reviews": {"type": "string"},
"price_range": {"type": "string"},
"top_dishes": {
"type": "array",
"items": {"type": "string"}
},
"recent_reviews": {
"type": "array",
"items": {
"type": "object",
"properties": {
"rating": {"type": "string"},
"text": {"type": "string"},
"date": {"type": "string"}
}
}
}
}
}
)
print(f"Rating: {review_data.data['overall_rating']} ({review_data.data['total_reviews']} reviews)")
ScrapeForge handles the JavaScript rendering and proxy rotation automatically. Yelp and similar sites typically block direct scraping after a few requests -- SearchHive's built-in proxy rotation distributes requests across multiple IPs to avoid detection.
Step 5: Build a Batch Scraping Pipeline
For competitive analysis, you typically need data from dozens or hundreds of restaurants. Here's a complete pipeline that finds, scrapes, structures, and exports restaurant data:
from searchhive import SwiftSearch, ScrapeForge, DeepDive
import json
import csv
import time
API_KEY = "YOUR_API_KEY"
search = SwiftSearch(api_key=API_KEY)
scrape = ScrapeForge(api_key=API_KEY)
deep = DeepDive(api_key=API_KEY)
# Step 1: Find restaurants
query = "top rated sushi restaurants in Manhattan NYC"
search_results = search.search(query, num_results=15)
restaurant_urls = [r.url for r in search_results if r.url]
# Step 2: Scrape and extract each restaurant
restaurants = []
for i, url in enumerate(restaurant_urls):
try:
raw = scrape.scrape(url, format="markdown", render_js=True)
structured = deep.extract(
content=raw.content,
schema={
"type": "object",
"properties": {
"name": {"type": "string"},
"cuisine": {"type": "string"},
"address": {"type": "string"},
"phone": {"type": "string"},
"price_range": {"type": "string"},
"rating": {"type": "string"},
"url": {"type": "string"}
}
}
)
structured.data["url"] = url
restaurants.append(structured.data)
print(f"[{i+1}/{len(restaurant_urls)}] {structured.data.get('name', 'Unknown')}")
time.sleep(1) # Rate limiting courtesy
except Exception as e:
print(f"[{i+1}/{len(restaurant_urls)}] Error: {e}")
# Step 3: Export to CSV
with open("restaurants.csv", "w", newline="") as f:
if restaurants:
writer = csv.DictWriter(f, fieldnames=restaurants[0].keys())
writer.writeheader()
writer.writerows(restaurants)
# Step 4: Save raw JSON for further analysis
with open("restaurants.json", "w") as f:
json.dump(restaurants, f, indent=2)
print(f"Scraped {len(restaurants)} restaurants successfully")
Step 6: Handle Common Issues
Rate limiting. Restaurant directories and review sites throttle aggressive scraping. SearchHive's built-in rate limiting and retries handle transient errors, but add a small delay (1-2 seconds) between requests for courtesy.
Missing data. Some restaurant websites have minimal online presence. The DeepDive schema uses optional fields -- missing data simply won't appear in the output rather than causing errors.
Anti-bot challenges. Sites like Yelp and Google Maps serve CAPTCHAs to suspicious traffic. ScrapeForge's proxy rotation and headless Chrome fingerprints minimize this, but if you hit CAPTCHAs, reduce your request rate or use SearchHive's built-in retry logic.
Duplicate listings. The same restaurant may appear on multiple platforms. Deduplicate by name and address after scraping:
def deduplicate(restaurants):
seen = set()
unique = []
for r in restaurants:
key = (r.get("name", "").lower(), r.get("address", "").lower())
if key not in seen:
seen.add(key)
unique.append(r)
return unique
restaurants = deduplicate(restaurants)
Complete Code Example
Here's the full pipeline as a reusable script:
from searchhive import SwiftSearch, ScrapeForge, DeepDive
import json, csv, time
API_KEY = "YOUR_API_KEY"
RESTAURANT_SCHEMA = {
"type": "object",
"properties": {
"name": {"type": "string"},
"cuisine": {"type": "string"},
"address": {"type": "string"},
"phone": {"type": "string"},
"price_range": {"type": "string"},
"rating": {"type": "string"}
}
}
def scrape_restaurants(query, num_results=15):
search = SwiftSearch(api_key=API_KEY)
scrape = ScrapeForge(api_key=API_KEY)
deep = DeepDive(api_key=API_KEY)
results = search.search(query, num_results=num_results)
urls = [r.url for r in results if r.url]
restaurants = []
for i, url in enumerate(urls):
try:
raw = scrape.scrape(url, format="markdown", render_js=True)
data = deep.extract(content=raw.content, schema=RESTAURANT_SCHEMA)
data.data["source_url"] = url
restaurants.append(data.data)
print(f" [{i+1}/{len(urls)}] {data.data.get('name', 'N/A')}")
time.sleep(1)
except Exception as e:
print(f" [{i+1}/{len(urls)}] Failed: {e}")
return restaurants
if __name__ == "__main__":
data = scrape_restaurants("best pizza restaurants in Chicago", num_results=10)
with open("restaurant_data.json", "w") as f:
json.dump(data, f, indent=2)
print(f"Done: {len(data)} restaurants saved to restaurant_data.json")
Next Steps
- Monitor competitors: Run this pipeline on a schedule to track menu changes, price updates, and new reviews
- Geographic expansion: Query different neighborhoods and cities to build a regional database
- Combine data sources: Cross-reference Yelp ratings with Google Maps data for a more complete picture
- Sentiment analysis: Feed review text into an NLP pipeline to track customer sentiment over time
Get started with SearchHive's free tier -- 500 credits, no credit card needed. Check the API docs for full SDK reference.
See also: /blog/how-to-scrape-google-search-results-with-python and /blog/how-to-monitor-competitor-prices-with-python-automated-system.