Glassdoor is one of the richest sources of employer intelligence available publicly. Salary data, employee reviews, interview experiences, and company culture insights -- all structured by company, role, and location. For HR researchers, recruiters, and job market analysts, this data is extremely valuable.
But Glassdoor's anti-scraping measures are among the most aggressive of any public website. They use Cloudflare protection, CAPTCHA challenges, and dynamic content loading that makes traditional scraping nearly impossible. This guide covers realistic approaches and working code.
Key Takeaways
- Glassdoor blocks most scraping attempts with Cloudflare, CAPTCHAs, and dynamic rendering.
- Dedicated APIs (SerpAPI, Bright Data, RapidAPI endpoints) are the most reliable source of Glassdoor data.
- SearchHive ScrapeForge + DeepDive can extract some Glassdoor data for light research needs, but heavy scraping requires specialized tools.
- Respect robots.txt generator and legal boundaries. Glassdoor has taken legal action against scrapers in the past.
Prerequisites
- Python 3.8+
requestslibrary (pip install requests)- A SearchHive API key (free signup with 500 credits)
- (Optional) SerpAPI account for reliable Glassdoor data
Step 1: Understand Glassdoor's Protection Stack
Glassdoor employs multiple layers of defense:
- Cloudflare -- Blocks requests from data centers and known proxy networks. Most cloud-hosted scrapers are rejected immediately.
- CAPTCHA challenges -- Even legitimate users face CAPTCHAs after a few page loads.
- JavaScript rendering -- Review content, salary data, and company info load via AJAX after the initial page render.
- Session tracking -- Glassdoor ties sessions to browser fingerprints and IP addresses.
- Rate limiting per IP -- After 10-20 requests, you'll get blocked or CAPTCHA'd.
A direct scraping attempt:
# This will likely return a Cloudflare challenge page
import requests
resp = requests.get("https://www.glassdoor.com/Reviews/company-reviews.htm")
print(resp.status_code) # 403
print("Cloudflare" in resp.text) # True
Step 2: Extract Glassdoor Data via SearchHive DeepDive
For light research needs -- pulling a few pages of reviews or salary data -- SearchHive's DeepDive can extract the content from publicly indexed Glassdoor pages:
import requests
API_KEY = "your-api-key"
BASE_URL = "https://api.searchhive.dev/v1"
def scrape_glassdoor_reviews(company_name):
# Extract Glassdoor review data for a specific company.
# Uses web search to find indexed Glassdoor pages, then extracts data.
# Args: company_name
# Returns: structured review data
# First, find the Glassdoor page via search
search_response = requests.post(
BASE_URL + "/search",
headers={"Authorization": "Bearer " + API_KEY},
json={
"query": "site:glassdoor.com " + company_name + " reviews",
"num_results": 5
}
)
results = search_response.json().get("results", [])
if not results:
return {"error": "No Glassdoor pages found in search results"}
# Extract data from the top result
top_url = results[0].get("url", "")
extract_response = requests.post(
BASE_URL + "/deepdive",
headers={"Authorization": "Bearer " + API_KEY},
json={
"url": top_url,
"prompt": (
"Extract Glassdoor review data for " + company_name + ": "
"overall rating out of 5, number of reviews, "
"pros and cons from the most recent 5 reviews, "
"CEO approval rating, and recommend to a friend percentage. "
"Return as structured JSON."
)
}
)
return extract_response.json()
# Example
data = scrape_glassdoor_reviews("Google")
print("Rating: " + str(data.get("overall_rating", "N/A")) + "/5")
print("Reviews: " + str(data.get("num_reviews", "N/A")))
Extract Salary Data
def scrape_glassdoor_salaries(company_name, job_title=None):
# Extract salary data from Glassdoor for a company and role.
query = "site:glassdoor.com " + company_name
if job_title:
query += " " + job_title
query += " salary"
search_response = requests.post(
BASE_URL + "/search",
headers={"Authorization": "Bearer " + API_KEY},
json={"query": query, "num_results": 5}
)
results = search_response.json().get("results", [])
if not results:
return {"error": "No salary pages found"}
url = results[0].get("url", "")
response = requests.post(
BASE_URL + "/deepdive",
headers={"Authorization": "Bearer " + API_KEY},
json={
"url": url,
"prompt": (
"Extract salary data: job title, base salary range (low, average, high), "
"total compensation if shown, and any location-specific data. "
"Return as a JSON array of salary entries."
)
}
)
return response.json()
salaries = scrape_glassdoor_salaries("Google", "Software Engineer")
for s in salaries if isinstance(salaries, list) else salaries.get("salaries", []):
print(str(s.get("title", "N/A")) + ": $" + str(s.get("avg", "N/A")) + "/yr")
Step 3: Use SerpAPI for Reliable Glassdoor Extraction
For production workloads, SerpAPI maintains a dedicated Glassdoor endpoint that handles all the anti-bot complexity:
SERPAPI_KEY = "your-serpapi-key"
def glassdoor_serpapi(company, location=None):
# Use SerpAPI's Glassdoor endpoint for reliable data extraction.
params = {
"engine": "glassdoor",
"q": company + " reviews",
"api_key": SERPAPI_KEY
}
if location:
params["l"] = location
response = requests.get(
"https://serpapi.com/search",
params=params
)
if response.status_code == 200:
data = response.json()
company_info = data.get("company_info", {})
reviews = data.get("reviews", [])
return {
"company": company_info.get("name"),
"rating": company_info.get("rating"),
"review_count": company_info.get("rating_count"),
"recommend": company_info.get("recommend_to_friend"),
"ceo_approval": company_info.get("ceo_approval"),
"recent_reviews": [
{
"title": r.get("title"),
"rating": r.get("rating"),
"date": r.get("date"),
"pros": r.get("pros"),
"cons": r.get("cons")
}
for r in reviews[:10]
]
}
raise Exception("SerpAPI error: " + str(response.status_code))
Step 4: Build a Company Benchmarking Tool
Combine data sources to benchmark companies against each other:
import json
class CompanyBenchmark:
def __init__(self, searchhive_key):
self.api_key = searchhive_key
self.base_url = "https://api.searchhive.dev/v1"
self.headers = {"Authorization": "Bearer " + searchhive_key}
def get_review_summary(self, company):
# Get a summary of Glassdoor reviews for a company.
response = requests.post(
self.base_url + "/search",
headers=self.headers,
json={
"query": company + " glassdoor reviews rating",
"num_results": 5
}
)
results = response.json().get("results", [])
if not results:
return None
# Use DeepDive on the most relevant result
url = results[0].get("url", "")
response = requests.post(
self.base_url + "/deepdive",
headers=self.headers,
json={
"url": url,
"prompt": (
"Extract the Glassdoor rating (out of 5), number of reviews, "
"and summarize the top 3 pros and top 3 cons mentioned. "
"Return as JSON with keys: rating, review_count, top_pros (array), top_cons (array)."
)
}
)
return response.json()
def get_salary_data(self, company, role):
# Get salary estimates for a role at a company.
response = requests.post(
self.base_url + "/search",
headers=self.headers,
json={
"query": company + " " + role + " glassdoor salary range",
"num_results": 3
}
)
results = response.json().get("results", [])
if results:
response = requests.post(
self.base_url + "/deepdive",
headers=self.headers,
json={
"url": results[0].get("url", ""),
"prompt": (
"Extract the salary range for " + role + " at " + company + ". "
"Include base salary (low, median, high) and total compensation if available."
)
}
)
return response.json()
return None
def benchmark(self, companies, role=None):
# Benchmark multiple companies.
benchmark_data = {}
for company in companies:
print("Benchmarking " + company + "...")
review_data = self.get_review_summary(company)
salary_data = self.get_salary_data(company, role) if role else None
benchmark_data[company] = {
"reviews": review_data,
"salary": salary_data
}
return benchmark_data
# Compare competitors
bench = CompanyBenchmark(API_KEY)
comparison = bench.benchmark(
companies=["Google", "Meta", "Apple"],
role="Software Engineer"
)
print(json.dumps(comparison, indent=2))
Step 5: Compare Glassdoor Data API Providers
| Provider | Reliability | Pricing | Best For |
|---|---|---|---|
| SearchHive | Medium (search-based) | $0-$199/mo | Light research, multi-purpose scraping |
| SerpAPI | High (dedicated endpoint) | $50+/mo (5K searches) | Production Glassdoor data pipelines |
| Bright Data | High (proxy + scraping) | Custom pricing | Enterprise-scale data collection |
| RapidAPI endpoints | Variable | $10-50/mo | Quick prototyping |
Common Issues
Cloudflare blocking
SearchHive uses proxy rotation that handles most Cloudflare challenges. If you're still getting blocked:
- Try accessing cached versions of Glassdoor pages via Google Cache
- Use search-based extraction (finding indexed pages in search results) rather than direct Glassdoor URLs
- For high-volume needs, switch to SerpAPI's dedicated Glassdoor endpoint
Data accuracy
Glassdoor data is self-reported, which means:
- Salary data may be inflated or outdated
- Reviews may be biased (disgruntled employees are more likely to post)
- Company ratings can be manipulated
Always cross-reference with other sources (Levels.fyi, LinkedIn Salary, Payscale) for salary benchmarking.
Legal considerations
Glassdoor's Terms of Service prohibit automated data collection. While using an API service like SerpAPI shifts the legal risk, be aware of:
- Don't republish Glassdoor content without attribution
- Don't use scraped data for commercial products without permission
- Check your jurisdiction's laws on web scraping and data collection
Next Steps
- Start researching today: Sign up at searchhive.dev with 500 free credits and extract your first company benchmark.
- For production HR analytics: Consider SerpAPI's Glassdoor endpoint for reliable, structured data at scale.
- Read the docs: Visit searchhive.dev/docs for the complete API reference.
Related: /blog/building-ai-agents-with-web-scraping-apis | /compare/firecrawl | /compare/scrapingbee