How to Scrape GitHub Data for Developer Research

GitHub hosts over 500 million repositories and 100 million developers. That data is a goldmine for developer research — whether you're analyzing technology trends, evaluating open-source projects, recruiting contributors, or competitive intelligence.

This tutorial shows you how to scrape GitHub data using Python, from basic API calls to advanced research workflows. We'll also show how SearchHive complements GitHub's API with web-based research and content analysis.

Key Takeaways

GitHub's REST API provides 5,000 requests/hour with authentication — enough for most research projects
PyGithub is the recommended Python library, but raw requests gives you more control
Search API has separate, stricter rate limits (30 requests/minute authenticated)
SearchHive's DeepDive adds qualitative analysis that the GitHub API can't provide alone
Always use authentication and implement rate limiting to avoid getting blocked

Prerequisites

Before you start, you'll need:

Python 3.9+ installed
A GitHub Personal Access Token (PAT) — create one at GitHub Developer Settings
Required token scopes: public_repo (minimal) or repo (full access)
Python packages: pip install PyGithub requests

pip install PyGithub requests

Step 1: Set Up Authentication

GitHub's unauthenticated rate limit is 60 requests/hour. With a Personal Access Token, you get 5,000 requests/hour — an 83x improvement.

import requests
from github import Github

# Method 1: PyGithub (recommended)
from github import Auth
auth = Auth.Token("ghp_your_token_here")
g = Github(auth=auth)

# Verify authentication
user = g.get_user()
print(f"Authenticated as: {user.login}")
print(f"Rate limit: {g.get_rate_limit().core.remaining} remaining")

# Method 2: Raw requests
GITHUB_TOKEN = "ghp_your_token_here"
HEADERS = {
    "Authorization": f"Bearer {GITHUB_TOKEN}",
    "Accept": "application/vnd.github+json",
    "X-GitHub-Api-Version": "2022-11-28",
}

Step 2: Search Repositories

GitHub's search API is the fastest way to find repositories matching specific criteria. Use search qualifiers to narrow results.

from github import Github

g = Github("ghp_your_token_here")

# Search for trending Python repos created recently
results = g.search_repositories(
    query="language:python created:>2025-06-01 stars:>500",
    sort="stars",
    order="desc"
)

print(f"Total results: {results.totalCount}")
for repo in results[:10]:
    print(f"Stars: {repo.stargazers_count:>6} | {repo.full_name:40s} | {repo.description[:60] if repo.description else 'N/A'}")

Key search qualifiers:

Qualifier	Example	Description
`stars:`	`stars:>1000`	Filter by star count
`forks:`	`forks:>50`	Filter by fork count
`language:`	`language:rust`	Filter by primary language
`created:`	`created:>2025-01-01`	Filter by creation date
`pushed:`	`pushed:>2026-01-01`	Filter by last push date
`topic:`	`topic:machine-learning`	Filter by topic tag
`user:`	`user:microsoft`	Filter by owner
`org:`	`org:vercel`	Filter by organization

Step 3: Extract Repository Metrics

For developer research, you'll want detailed metrics about each repository.

def get_repo_health(g, owner, repo_name):
    """Extract comprehensive repository health metrics."""
    repo = g.get_repo(f"{owner}/{repo_name}")

    # Core metrics
    metrics = {
        "name": repo.full_name,
        "stars": repo.stargazers_count,
        "forks": repo.forks_count,
        "watchers": repo.subscribers_count,
        "open_issues": repo.open_issues_count,
        "language": repo.language,
        "topics": repo.get_topics(),
        "license": repo.get_license().spdx_id if repo.get_license() else None,
        "created": repo.created_at.isoformat(),
        "last_pushed": repo.pushed_at.isoformat(),
        "description": repo.description,
    }

    # Recent activity (commits in last 90 days)
    from datetime import datetime, timedelta
    since = (datetime.utcnow() - timedelta(days=90)).isoformat() + "Z"
    recent_commits = list(repo.get_commits(since=since))
    metrics["commits_90d"] = len(recent_commits)

    # Top contributors
    contributors = list(repo.get_contributors())[:5]
    metrics["top_contributors"] = [
        {"login": c.login, "contributions": c.contributions}
        for c in contributors
    ]

    return metrics

# Example usage
health = get_repo_health(g, "vercel", "next.js")
for key, value in health.items():
    print(f"{key}: {value}")

Step 4: Analyze Contributors

Contributor analysis is valuable for recruiting, community research, and identifying key maintainers.

def analyze_contributors(g, owner, repo_name, limit=20):
    """Analyze top contributors for a repository."""
    repo = g.get_repo(f"{owner}/{repo_name}")
    contributors = []

    for c in repo.get_contributors()[:limit]:
        user = g.get_user(c.login)
        contributors.append({
            "login": c.login,
            "commits": c.contributions,
            "public_repos": user.public_repos,
            "followers": user.followers,
            "company": user.company,
            "location": user.location,
            "bio": (user.bio or "")[:100],
            "profile_url": user.html_url,
        })

    # Sort by commit count
    contributors.sort(key=lambda x: x["commits"], reverse=True)
    return contributors

contributors = analyze_contributors(g, "facebook", "react")
for c in contributors[:10]:
    print(f"{c['login']:20s} | {c['commits']:>5} commits | "
          f"{c['followers']:>4} followers | {c.get('company', 'N/A')}")

Step 5: Track Issues and Pull Requests

Issue and PR data reveals project health, community engagement, and maintenance activity.

def analyze_issues(g, owner, repo_name, days=30):
    """Analyze issue resolution patterns."""
    from datetime import datetime, timedelta
    repo = g.get_repo(f"{owner}/{repo_name}")
    since = (datetime.utcnow() - timedelta(days=days)).isoformat() + "Z"

    stats = {"opened": 0, "closed": 0, "avg_resolution_hours": 0, "resolution_times": []}

    for issue in repo.get_issues(state="all", since=since):
        if issue.pull_request:  # Skip PRs
            continue

        if issue.created_at > datetime.fromisoformat(since.replace("Z", "+00:00")):
            stats["opened"] += 1
        if issue.closed_at:
            stats["closed"] += 1
            resolution = (issue.closed_at - issue.created_at).total_seconds() / 3600
            stats["resolution_times"].append(resolution)

    if stats["resolution_times"]:
        stats["avg_resolution_hours"] = round(
            sum(stats["resolution_times"]) / len(stats["resolution_times"]), 1
        )

    return stats

issue_stats = analyze_issues(g, "microsoft", "vscode", days=30)
print(f"Issues opened (30d): {issue_stats['opened']}")
print(f"Issues closed (30d): {issue_stats['closed']}")
print(f"Avg resolution: {issue_stats['avg_resolution_hours']} hours")

Step 6: Implement Rate Limiting

Hitting the rate limit will stop your research dead. Here's a robust rate limiter:

import requests
import time

class GitHubRateLimiter:
    """Intelligent rate limiting for GitHub API requests."""

    def __init__(self, token, min_interval=0.1):
        self.session = requests.Session()
        self.session.headers.update({
            "Authorization": f"Bearer {token}",
            "Accept": "application/vnd.github+json",
            "X-GitHub-Api-Version": "2022-11-28",
        })
        self.min_interval = min_interval
        self._last_request = 0

    def get(self, url, params=None):
        """Make a rate-limit-aware GET request."""
        # Throttle between requests
        elapsed = time.time() - self._last_request
        if elapsed < self.min_interval:
            time.sleep(self.min_interval - elapsed)

        resp = self.session.get(url, params=params)
        self._last_request = time.time()

        # Handle rate limit (403 or 429)
        if resp.status_code in (403, 429):
            remaining = int(resp.headers.get("x-ratelimit-remaining", 0))
            if remaining == 0:
                reset = int(resp.headers.get("x-ratelimit-reset", 0))
                wait = max(reset - int(time.time()), 1) + 1
                print(f"Rate limited. Waiting {wait}s...")
                time.sleep(wait)
                return self.get(url, params)  # Retry

        resp.raise_for_status()
        return resp.json()

# Usage
limiter = GitHubRateLimiter("ghp_your_token")
data = limiter.get(
    "https://api.github.com/search/repositories",
    params={"q": "language:python stars:>10000", "sort": "stars", "per_page": 100}
)

Step 7: Build a Complete Research Pipeline

Combine everything into a reusable research pipeline:

import json
from datetime import datetime

def github_research_pipeline(token, topic, min_stars=1000, language="python", limit=50):
    """Complete GitHub research pipeline for a given topic."""
    g = Github(token)

    print(f"Researching: {topic} ({language}, >{min_stars} stars)")

    # Step 1: Search repos
    query = f"topic:{topic} language:{language} stars:>{min_stars}"
    repos = g.search_repositories(query=query, sort="stars", order="desc")

    results = []
    for repo in repos[:limit]:
        metrics = get_repo_health(g, repo.owner.login, repo.name)

        # Step 2: Use SearchHive for qualitative analysis
        try:
            from searchhive import DeepDive
            dd = DeepDive(api_key="your_key")
            analysis = dd.analyze(
                url=f"https://github.com/{repo.full_name}",
                summarize=True
            )
            metrics["readme_quality"] = analysis.get("summary", "")[:200]
        except Exception:
            metrics["readme_quality"] = "N/A"

        results.append(metrics)
        print(f"  Done: {repo.full_name} ({repo.stargazers_count} stars)")

    # Save results
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"github_research_{topic}_{timestamp}.json"
    with open(filename, "w") as f:
        json.dump(results, f, indent=2, default=str)

    print(f"Saved {len(results)} repos to {filename}")
    return results

# Run the pipeline
results = github_research_pipeline(
    token="ghp_your_token",
    topic="machine-learning",
    min_stars=5000,
    limit=20
)

Step 8: Complement with SearchHive

The GitHub API gives you quantitative data (stars, forks, commit counts). SearchHive adds qualitative intelligence that the API can't provide:

from searchhive import SwiftSearch, DeepDive

# Discover trending projects beyond GitHub's search
search = SwiftSearch(api_key="your_key")
trending = search.search(
    query="best Python web frameworks 2026",
    domains=["news.ycombinator.com", "dev.to", "reddit.com"],
    extract_fields=["title", "url", "content"]
)

# DeepDive into project documentation quality
dd = DeepDive(api_key="your_key")
analysis = dd.analyze(
    url="https://docs.djangoproject.com",
    extract_features=True,
    summarize=True
)

What SearchHive adds that GitHub can't:

GitHub API	SearchHive
Star counts, fork counts	Sentiment analysis of community discussions
Language metadata	Documentation quality assessment
Issue counts	Trend detection across blogs, forums, social media
Contributor logins	Developer profiling (blogs, talks, publications)
Commit timestamps	Market intelligence (job postings, tutorial popularity)

Common Issues

Rate Limit Exceeded (403)

Cause: Too many requests. Fix: Use g.get_rate_limit() to monitor remaining requests. Implement the GitHubRateLimiter class from Step 6. Use per_page=100 to minimize requests.

Search Returns Fewer Results Than Expected

Cause: GitHub search caps at 1,000 results total. Fix: Use more specific queries, or paginate through different time periods.

Missing Fields in Responses

Cause: Default API responses don't include all fields. Fix: Use repo.get_topics() explicitly, pass fields parameters in search, or use the raw API with full headers.

Conditional Requests Failing

Fix: Use If-None-Match (ETag) and If-Modified-Since headers. 304 responses don't count against your rate limit — this is free caching.

Next Steps

Now that you can scrape GitHub data, here are ways to level up:

Store results in a database (PostgreSQL + SQLAlchemy) for historical tracking
Set up scheduled research — run weekly snapshots to track star velocity and project health over time
Combine with SearchHive for qualitative research that complements GitHub's quantitative data
Build dashboards using the scraped data (Streamlit, Grafana, or Metabase)
Export to CSV/Excel for stakeholder reporting

Ready to go beyond the GitHub API? Get started with SearchHive's free tier for web research, content analysis, and data enrichment. Check out the API docs for quickstart guides.

How to Scrape GitHub Data for Developer Research (2026)

AI-Powered Research