How to Scrape GitHub Data for Developer Research
GitHub hosts over 500 million repositories and 100 million developers. That data is a goldmine for developer research — whether you're analyzing technology trends, evaluating open-source projects, recruiting contributors, or competitive intelligence.
This tutorial shows you how to scrape GitHub data using Python, from basic API calls to advanced research workflows. We'll also show how SearchHive complements GitHub's API with web-based research and content analysis.
Key Takeaways
- GitHub's REST API provides 5,000 requests/hour with authentication — enough for most research projects
- PyGithub is the recommended Python library, but raw
requestsgives you more control - Search API has separate, stricter rate limits (30 requests/minute authenticated)
- SearchHive's DeepDive adds qualitative analysis that the GitHub API can't provide alone
- Always use authentication and implement rate limiting to avoid getting blocked
Prerequisites
Before you start, you'll need:
- Python 3.9+ installed
- A GitHub Personal Access Token (PAT) — create one at GitHub Developer Settings
- Required token scopes:
public_repo(minimal) orrepo(full access) - Python packages:
pip install PyGithub requests
pip install PyGithub requests
Step 1: Set Up Authentication
GitHub's unauthenticated rate limit is 60 requests/hour. With a Personal Access Token, you get 5,000 requests/hour — an 83x improvement.
import requests
from github import Github
# Method 1: PyGithub (recommended)
from github import Auth
auth = Auth.Token("ghp_your_token_here")
g = Github(auth=auth)
# Verify authentication
user = g.get_user()
print(f"Authenticated as: {user.login}")
print(f"Rate limit: {g.get_rate_limit().core.remaining} remaining")
# Method 2: Raw requests
GITHUB_TOKEN = "ghp_your_token_here"
HEADERS = {
"Authorization": f"Bearer {GITHUB_TOKEN}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
}
Step 2: Search Repositories
GitHub's search API is the fastest way to find repositories matching specific criteria. Use search qualifiers to narrow results.
from github import Github
g = Github("ghp_your_token_here")
# Search for trending Python repos created recently
results = g.search_repositories(
query="language:python created:>2025-06-01 stars:>500",
sort="stars",
order="desc"
)
print(f"Total results: {results.totalCount}")
for repo in results[:10]:
print(f"Stars: {repo.stargazers_count:>6} | {repo.full_name:40s} | {repo.description[:60] if repo.description else 'N/A'}")
Key search qualifiers:
| Qualifier | Example | Description |
|---|---|---|
stars: | stars:>1000 | Filter by star count |
forks: | forks:>50 | Filter by fork count |
language: | language:rust | Filter by primary language |
created: | created:>2025-01-01 | Filter by creation date |
pushed: | pushed:>2026-01-01 | Filter by last push date |
topic: | topic:machine-learning | Filter by topic tag |
user: | user:microsoft | Filter by owner |
org: | org:vercel | Filter by organization |
Step 3: Extract Repository Metrics
For developer research, you'll want detailed metrics about each repository.
def get_repo_health(g, owner, repo_name):
"""Extract comprehensive repository health metrics."""
repo = g.get_repo(f"{owner}/{repo_name}")
# Core metrics
metrics = {
"name": repo.full_name,
"stars": repo.stargazers_count,
"forks": repo.forks_count,
"watchers": repo.subscribers_count,
"open_issues": repo.open_issues_count,
"language": repo.language,
"topics": repo.get_topics(),
"license": repo.get_license().spdx_id if repo.get_license() else None,
"created": repo.created_at.isoformat(),
"last_pushed": repo.pushed_at.isoformat(),
"description": repo.description,
}
# Recent activity (commits in last 90 days)
from datetime import datetime, timedelta
since = (datetime.utcnow() - timedelta(days=90)).isoformat() + "Z"
recent_commits = list(repo.get_commits(since=since))
metrics["commits_90d"] = len(recent_commits)
# Top contributors
contributors = list(repo.get_contributors())[:5]
metrics["top_contributors"] = [
{"login": c.login, "contributions": c.contributions}
for c in contributors
]
return metrics
# Example usage
health = get_repo_health(g, "vercel", "next.js")
for key, value in health.items():
print(f"{key}: {value}")
Step 4: Analyze Contributors
Contributor analysis is valuable for recruiting, community research, and identifying key maintainers.
def analyze_contributors(g, owner, repo_name, limit=20):
"""Analyze top contributors for a repository."""
repo = g.get_repo(f"{owner}/{repo_name}")
contributors = []
for c in repo.get_contributors()[:limit]:
user = g.get_user(c.login)
contributors.append({
"login": c.login,
"commits": c.contributions,
"public_repos": user.public_repos,
"followers": user.followers,
"company": user.company,
"location": user.location,
"bio": (user.bio or "")[:100],
"profile_url": user.html_url,
})
# Sort by commit count
contributors.sort(key=lambda x: x["commits"], reverse=True)
return contributors
contributors = analyze_contributors(g, "facebook", "react")
for c in contributors[:10]:
print(f"{c['login']:20s} | {c['commits']:>5} commits | "
f"{c['followers']:>4} followers | {c.get('company', 'N/A')}")
Step 5: Track Issues and Pull Requests
Issue and PR data reveals project health, community engagement, and maintenance activity.
def analyze_issues(g, owner, repo_name, days=30):
"""Analyze issue resolution patterns."""
from datetime import datetime, timedelta
repo = g.get_repo(f"{owner}/{repo_name}")
since = (datetime.utcnow() - timedelta(days=days)).isoformat() + "Z"
stats = {"opened": 0, "closed": 0, "avg_resolution_hours": 0, "resolution_times": []}
for issue in repo.get_issues(state="all", since=since):
if issue.pull_request: # Skip PRs
continue
if issue.created_at > datetime.fromisoformat(since.replace("Z", "+00:00")):
stats["opened"] += 1
if issue.closed_at:
stats["closed"] += 1
resolution = (issue.closed_at - issue.created_at).total_seconds() / 3600
stats["resolution_times"].append(resolution)
if stats["resolution_times"]:
stats["avg_resolution_hours"] = round(
sum(stats["resolution_times"]) / len(stats["resolution_times"]), 1
)
return stats
issue_stats = analyze_issues(g, "microsoft", "vscode", days=30)
print(f"Issues opened (30d): {issue_stats['opened']}")
print(f"Issues closed (30d): {issue_stats['closed']}")
print(f"Avg resolution: {issue_stats['avg_resolution_hours']} hours")
Step 6: Implement Rate Limiting
Hitting the rate limit will stop your research dead. Here's a robust rate limiter:
import requests
import time
class GitHubRateLimiter:
"""Intelligent rate limiting for GitHub API requests."""
def __init__(self, token, min_interval=0.1):
self.session = requests.Session()
self.session.headers.update({
"Authorization": f"Bearer {token}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
})
self.min_interval = min_interval
self._last_request = 0
def get(self, url, params=None):
"""Make a rate-limit-aware GET request."""
# Throttle between requests
elapsed = time.time() - self._last_request
if elapsed < self.min_interval:
time.sleep(self.min_interval - elapsed)
resp = self.session.get(url, params=params)
self._last_request = time.time()
# Handle rate limit (403 or 429)
if resp.status_code in (403, 429):
remaining = int(resp.headers.get("x-ratelimit-remaining", 0))
if remaining == 0:
reset = int(resp.headers.get("x-ratelimit-reset", 0))
wait = max(reset - int(time.time()), 1) + 1
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
return self.get(url, params) # Retry
resp.raise_for_status()
return resp.json()
# Usage
limiter = GitHubRateLimiter("ghp_your_token")
data = limiter.get(
"https://api.github.com/search/repositories",
params={"q": "language:python stars:>10000", "sort": "stars", "per_page": 100}
)
Step 7: Build a Complete Research Pipeline
Combine everything into a reusable research pipeline:
import json
from datetime import datetime
def github_research_pipeline(token, topic, min_stars=1000, language="python", limit=50):
"""Complete GitHub research pipeline for a given topic."""
g = Github(token)
print(f"Researching: {topic} ({language}, >{min_stars} stars)")
# Step 1: Search repos
query = f"topic:{topic} language:{language} stars:>{min_stars}"
repos = g.search_repositories(query=query, sort="stars", order="desc")
results = []
for repo in repos[:limit]:
metrics = get_repo_health(g, repo.owner.login, repo.name)
# Step 2: Use SearchHive for qualitative analysis
try:
from searchhive import DeepDive
dd = DeepDive(api_key="your_key")
analysis = dd.analyze(
url=f"https://github.com/{repo.full_name}",
summarize=True
)
metrics["readme_quality"] = analysis.get("summary", "")[:200]
except Exception:
metrics["readme_quality"] = "N/A"
results.append(metrics)
print(f" Done: {repo.full_name} ({repo.stargazers_count} stars)")
# Save results
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"github_research_{topic}_{timestamp}.json"
with open(filename, "w") as f:
json.dump(results, f, indent=2, default=str)
print(f"Saved {len(results)} repos to {filename}")
return results
# Run the pipeline
results = github_research_pipeline(
token="ghp_your_token",
topic="machine-learning",
min_stars=5000,
limit=20
)
Step 8: Complement with SearchHive
The GitHub API gives you quantitative data (stars, forks, commit counts). SearchHive adds qualitative intelligence that the API can't provide:
from searchhive import SwiftSearch, DeepDive
# Discover trending projects beyond GitHub's search
search = SwiftSearch(api_key="your_key")
trending = search.search(
query="best Python web frameworks 2026",
domains=["news.ycombinator.com", "dev.to", "reddit.com"],
extract_fields=["title", "url", "content"]
)
# DeepDive into project documentation quality
dd = DeepDive(api_key="your_key")
analysis = dd.analyze(
url="https://docs.djangoproject.com",
extract_features=True,
summarize=True
)
What SearchHive adds that GitHub can't:
| GitHub API | SearchHive |
|---|---|
| Star counts, fork counts | Sentiment analysis of community discussions |
| Language metadata | Documentation quality assessment |
| Issue counts | Trend detection across blogs, forums, social media |
| Contributor logins | Developer profiling (blogs, talks, publications) |
| Commit timestamps | Market intelligence (job postings, tutorial popularity) |
Common Issues
Rate Limit Exceeded (403)
Cause: Too many requests. Fix: Use g.get_rate_limit() to monitor remaining requests. Implement the GitHubRateLimiter class from Step 6. Use per_page=100 to minimize requests.
Search Returns Fewer Results Than Expected
Cause: GitHub search caps at 1,000 results total. Fix: Use more specific queries, or paginate through different time periods.
Missing Fields in Responses
Cause: Default API responses don't include all fields. Fix: Use repo.get_topics() explicitly, pass fields parameters in search, or use the raw API with full headers.
Conditional Requests Failing
Fix: Use If-None-Match (ETag) and If-Modified-Since headers. 304 responses don't count against your rate limit — this is free caching.
Next Steps
Now that you can scrape GitHub data, here are ways to level up:
- Store results in a database (PostgreSQL + SQLAlchemy) for historical tracking
- Set up scheduled research — run weekly snapshots to track star velocity and project health over time
- Combine with SearchHive for qualitative research that complements GitHub's quantitative data
- Build dashboards using the scraped data (Streamlit, Grafana, or Metabase)
- Export to CSV/Excel for stakeholder reporting
Ready to go beyond the GitHub API? Get started with SearchHive's free tier for web research, content analysis, and data enrichment. Check out the API docs for quickstart guides.
See also: How to build a lead generation scraper | SearchHive vs Apify | Web scraping with Python guide