How to Scrape Glassdoor Data for HR Research

Glassdoor is one of the richest sources of employer intelligence available publicly. Salary data, employee reviews, interview experiences, and company culture insights -- all structured by company, role, and location. For HR researchers, recruiters, and job market analysts, this data is extremely valuable.

But Glassdoor's anti-scraping measures are among the most aggressive of any public website. They use Cloudflare protection, CAPTCHA challenges, and dynamic content loading that makes traditional scraping nearly impossible. This guide covers realistic approaches and working code.

Key Takeaways

Glassdoor blocks most scraping attempts with Cloudflare, CAPTCHAs, and dynamic rendering.
Dedicated APIs (SerpAPI, Bright Data, RapidAPI endpoints) are the most reliable source of Glassdoor data.
SearchHive ScrapeForge + DeepDive can extract some Glassdoor data for light research needs, but heavy scraping requires specialized tools.
Respect robots.txt generator and legal boundaries. Glassdoor has taken legal action against scrapers in the past.

Prerequisites

Python 3.8+
requests library (pip install requests)
A SearchHive API key (free signup with 500 credits)
(Optional) SerpAPI account for reliable Glassdoor data

Step 1: Understand Glassdoor's Protection Stack

Glassdoor employs multiple layers of defense:

Cloudflare -- Blocks requests from data centers and known proxy networks. Most cloud-hosted scrapers are rejected immediately.
CAPTCHA challenges -- Even legitimate users face CAPTCHAs after a few page loads.
JavaScript rendering -- Review content, salary data, and company info load via AJAX after the initial page render.
Session tracking -- Glassdoor ties sessions to browser fingerprints and IP addresses.
Rate limiting per IP -- After 10-20 requests, you'll get blocked or CAPTCHA'd.

A direct scraping attempt:

# This will likely return a Cloudflare challenge page
import requests
resp = requests.get("https://www.glassdoor.com/Reviews/company-reviews.htm")
print(resp.status_code)  # 403
print("Cloudflare" in resp.text)  # True

Step 2: Extract Glassdoor Data via SearchHive DeepDive

For light research needs -- pulling a few pages of reviews or salary data -- SearchHive's DeepDive can extract the content from publicly indexed Glassdoor pages:

import requests

API_KEY = "your-api-key"
BASE_URL = "https://api.searchhive.dev/v1"

def scrape_glassdoor_reviews(company_name):
    # Extract Glassdoor review data for a specific company.
    # Uses web search to find indexed Glassdoor pages, then extracts data.
    # Args: company_name
    # Returns: structured review data
    
    # First, find the Glassdoor page via search
    search_response = requests.post(
        BASE_URL + "/search",
        headers={"Authorization": "Bearer " + API_KEY},
        json={
            "query": "site:glassdoor.com " + company_name + " reviews",
            "num_results": 5
        }
    )
    
    results = search_response.json().get("results", [])
    
    if not results:
        return {"error": "No Glassdoor pages found in search results"}
    
    # Extract data from the top result
    top_url = results[0].get("url", "")
    
    extract_response = requests.post(
        BASE_URL + "/deepdive",
        headers={"Authorization": "Bearer " + API_KEY},
        json={
            "url": top_url,
            "prompt": (
                "Extract Glassdoor review data for " + company_name + ": "
                "overall rating out of 5, number of reviews, "
                "pros and cons from the most recent 5 reviews, "
                "CEO approval rating, and recommend to a friend percentage. "
                "Return as structured JSON."
            )
        }
    )
    
    return extract_response.json()

# Example
data = scrape_glassdoor_reviews("Google")
print("Rating: " + str(data.get("overall_rating", "N/A")) + "/5")
print("Reviews: " + str(data.get("num_reviews", "N/A")))

Extract Salary Data

def scrape_glassdoor_salaries(company_name, job_title=None):
    # Extract salary data from Glassdoor for a company and role.
    query = "site:glassdoor.com " + company_name
    if job_title:
        query += " " + job_title
    query += " salary"
    
    search_response = requests.post(
        BASE_URL + "/search",
        headers={"Authorization": "Bearer " + API_KEY},
        json={"query": query, "num_results": 5}
    )
    
    results = search_response.json().get("results", [])
    
    if not results:
        return {"error": "No salary pages found"}
    
    url = results[0].get("url", "")
    
    response = requests.post(
        BASE_URL + "/deepdive",
        headers={"Authorization": "Bearer " + API_KEY},
        json={
            "url": url,
            "prompt": (
                "Extract salary data: job title, base salary range (low, average, high), "
                "total compensation if shown, and any location-specific data. "
                "Return as a JSON array of salary entries."
            )
        }
    )
    
    return response.json()

salaries = scrape_glassdoor_salaries("Google", "Software Engineer")
for s in salaries if isinstance(salaries, list) else salaries.get("salaries", []):
    print(str(s.get("title", "N/A")) + ": $" + str(s.get("avg", "N/A")) + "/yr")

Step 3: Use SerpAPI for Reliable Glassdoor Extraction

For production workloads, SerpAPI maintains a dedicated Glassdoor endpoint that handles all the anti-bot complexity:

SERPAPI_KEY = "your-serpapi-key"

def glassdoor_serpapi(company, location=None):
    # Use SerpAPI's Glassdoor endpoint for reliable data extraction.
    params = {
        "engine": "glassdoor",
        "q": company + " reviews",
        "api_key": SERPAPI_KEY
    }
    
    if location:
        params["l"] = location
    
    response = requests.get(
        "https://serpapi.com/search",
        params=params
    )
    
    if response.status_code == 200:
        data = response.json()
        
        company_info = data.get("company_info", {})
        reviews = data.get("reviews", [])
        
        return {
            "company": company_info.get("name"),
            "rating": company_info.get("rating"),
            "review_count": company_info.get("rating_count"),
            "recommend": company_info.get("recommend_to_friend"),
            "ceo_approval": company_info.get("ceo_approval"),
            "recent_reviews": [
                {
                    "title": r.get("title"),
                    "rating": r.get("rating"),
                    "date": r.get("date"),
                    "pros": r.get("pros"),
                    "cons": r.get("cons")
                }
                for r in reviews[:10]
            ]
        }
    
    raise Exception("SerpAPI error: " + str(response.status_code))

Step 4: Build a Company Benchmarking Tool

Combine data sources to benchmark companies against each other:

import json

class CompanyBenchmark:
    def __init__(self, searchhive_key):
        self.api_key = searchhive_key
        self.base_url = "https://api.searchhive.dev/v1"
        self.headers = {"Authorization": "Bearer " + searchhive_key}
    
    def get_review_summary(self, company):
        # Get a summary of Glassdoor reviews for a company.
        response = requests.post(
            self.base_url + "/search",
            headers=self.headers,
            json={
                "query": company + " glassdoor reviews rating",
                "num_results": 5
            }
        )
        results = response.json().get("results", [])
        
        if not results:
            return None
        
        # Use DeepDive on the most relevant result
        url = results[0].get("url", "")
        response = requests.post(
            self.base_url + "/deepdive",
            headers=self.headers,
            json={
                "url": url,
                "prompt": (
                    "Extract the Glassdoor rating (out of 5), number of reviews, "
                    "and summarize the top 3 pros and top 3 cons mentioned. "
                    "Return as JSON with keys: rating, review_count, top_pros (array), top_cons (array)."
                )
            }
        )
        return response.json()
    
    def get_salary_data(self, company, role):
        # Get salary estimates for a role at a company.
        response = requests.post(
            self.base_url + "/search",
            headers=self.headers,
            json={
                "query": company + " " + role + " glassdoor salary range",
                "num_results": 3
            }
        )
        results = response.json().get("results", [])
        
        if results:
            response = requests.post(
                self.base_url + "/deepdive",
                headers=self.headers,
                json={
                    "url": results[0].get("url", ""),
                    "prompt": (
                        "Extract the salary range for " + role + " at " + company + ". "
                        "Include base salary (low, median, high) and total compensation if available."
                    )
                }
            )
            return response.json()
        return None
    
    def benchmark(self, companies, role=None):
        # Benchmark multiple companies.
        benchmark_data = {}
        
        for company in companies:
            print("Benchmarking " + company + "...")
            review_data = self.get_review_summary(company)
            salary_data = self.get_salary_data(company, role) if role else None
            
            benchmark_data[company] = {
                "reviews": review_data,
                "salary": salary_data
            }
        
        return benchmark_data


# Compare competitors
bench = CompanyBenchmark(API_KEY)
comparison = bench.benchmark(
    companies=["Google", "Meta", "Apple"],
    role="Software Engineer"
)

print(json.dumps(comparison, indent=2))

Step 5: Compare Glassdoor Data API Providers

Provider	Reliability	Pricing	Best For
SearchHive	Medium (search-based)	$0-$199/mo	Light research, multi-purpose scraping
SerpAPI	High (dedicated endpoint)	$50+/mo (5K searches)	Production Glassdoor data pipelines
Bright Data	High (proxy + scraping)	Custom pricing	Enterprise-scale data collection
RapidAPI endpoints	Variable	$10-50/mo	Quick prototyping

Common Issues

Cloudflare blocking

SearchHive uses proxy rotation that handles most Cloudflare challenges. If you're still getting blocked:

Try accessing cached versions of Glassdoor pages via Google Cache
Use search-based extraction (finding indexed pages in search results) rather than direct Glassdoor URLs
For high-volume needs, switch to SerpAPI's dedicated Glassdoor endpoint

Data accuracy

Glassdoor data is self-reported, which means:

Salary data may be inflated or outdated
Reviews may be biased (disgruntled employees are more likely to post)
Company ratings can be manipulated

Always cross-reference with other sources (Levels.fyi, LinkedIn Salary, Payscale) for salary benchmarking.

Legal considerations

Glassdoor's Terms of Service prohibit automated data collection. While using an API service like SerpAPI shifts the legal risk, be aware of:

Don't republish Glassdoor content without attribution
Don't use scraped data for commercial products without permission
Check your jurisdiction's laws on web scraping and data collection

Next Steps

Start researching today: Sign up at searchhive.dev with 500 free credits and extract your first company benchmark.
For production HR analytics: Consider SerpAPI's Glassdoor endpoint for reliable, structured data at scale.
Read the docs: Visit searchhive.dev/docs for the complete API reference.

How to Scrape Glassdoor Data for HR Research

AI-Powered Research

Key Takeaways

Prerequisites

Step 1: Understand Glassdoor's Protection Stack

Step 2: Extract Glassdoor Data via SearchHive DeepDive

Extract Salary Data

Step 3: Use SerpAPI for Reliable Glassdoor Extraction

Step 4: Build a Company Benchmarking Tool

Step 5: Compare Glassdoor Data API Providers

Common Issues

Cloudflare blocking

Data accuracy

Legal considerations

Next Steps

Keywords

RELATED ARTICLES

How to Scrape TikTok Data for Competitor Analysis

How to Monitor Competitor Websites for Changes

How to Scrape Google Search Results with Python

BUILD WITH SEARCHHIVE