How to Scrape Websites Behind Login with Python

Many valuable websites keep their best data behind authentication -- pricing pages, dashboards, member directories, analytics panels. Scraping authenticated content requires handling sessions, cookies, CSRF tokens, and sometimes multi-factor authentication. This tutorial covers the practical techniques for scraping behind login walls using Python and SearchHive.

Prerequisites

Python 3.8+ with requests and beautifulsoup4 installed
A SearchHive API key (free tier includes 500 credits)
Target website credentials (for sites you have permission to access)

pip install requests beautifulsoup4

Key Takeaways

Scraping behind login requires managing session state -- cookies, CSRF tokens, and authentication headers
SearchHive's ScrapeForge can handle session-based authentication by forwarding cookies
For complex login flows (OAuth, 2FA, CAPTCHA on login), SearchHive's headless browser rendering with proxy rotation gives you the most reliable approach
Always check terms of service before scraping authenticated content -- scraping data you are authorized to access is different from bypassing access controls

Step 1: Understand the Authentication Method

Before writing any code, identify how the target site authenticates:

Cookie-based sessions: Most common. Login POST request sets session cookies that you forward with subsequent requests
Token-based (JWT decoder): Login returns an access token that you include in the Authorization header
OAuth 2.0: Redirects through a third-party auth provider
CSRF-protected: Login form includes a hidden CSRF token that must be submitted with credentials

Inspect the login flow using your browser's DevTools:

Open DevTools (F12), go to the Network tab
Submit the login form normally
Look at the POST request to the login endpoint
Note the request headers (especially Cookie and CSRF token), form data, and response headers (especially Set-Cookie)

Step 2: Session-Based Login with requests

The simplest case -- a standard HTML form login with cookie-based sessions:

import requests

session = requests.Session()

# Step 1: GET the login page to pick up CSRF token and initial cookies
login_page = session.get("https://example.com/login")
# Extract CSRF token (varies by framework)
# Django: <input type="hidden" name="csrfmiddlewaretoken" value="...">
from bs4 import BeautifulSoup
soup = BeautifulSoup(login_page.text, "html.parser")
csrf_token = soup.find("input", {"name": "csrfmiddlewaretoken"})["value"]

# Step 2: POST login credentials with CSRF token
login_response = session.post(
    "https://example.com/login",
    data={
        "username": "your_username",
        "password": "your_password",
        "csrfmiddlewaretoken": csrf_token
    },
    headers={"Referer": "https://example.com/login"}
)

# Step 3: Verify login succeeded
if "dashboard" in login_response.url or login_response.status_code == 200:
    print("Login successful")
else:
    print(f"Login failed: {login_response.status_code}")

# Step 4: Scrape authenticated pages using the session
dashboard = session.get("https://example.com/dashboard/data")
print(dashboard.text[:500])

The key is using requests.Session() -- it automatically persists cookies across requests, maintaining your authenticated state.

Step 3: Token-Based Authentication (JWT)

Many modern web apps return a JWT after login instead of cookies:

import requests

# Login to get access token
response = requests.post(
    "https://api.example.com/auth/login",
    json={
        "email": "your@email.com",
        "password": "your_password"
    }
)

if response.status_code == 200:
    token = response.json().get("access_token")
    headers = {"Authorization": f"Bearer {token}"}

    # Use token for subsequent requests
    data = requests.get(
        "https://api.example.com/dashboard/reports",
        headers=headers
    )
    print(data.json())
else:
    print(f"Auth failed: {response.text}")

Step 4: Using SearchHive for Authenticated Scraping

For sites with complex login flows (JavaScript-rendered login forms, OAuth redirects, CAPTCHAs on login), SearchHive's ScrapeForge with headless browser rendering is the most reliable approach:

import requests

API_KEY = "your_searchhive_key"
BASE_URL = "https://api.searchhive.dev/v1"
sh_headers = {"Authorization": f"Bearer {API_KEY}"}

# Step 1: Scrape the login page to get cookies and CSRF token
login_page = requests.post(
    f"{BASE_URL}/scrape",
    headers=sh_headers,
    json={
        "url": "https://example.com/login",
        "render_js": True,
        "cookies": {}  # Fresh session
    }
)

# Extract cookies from the response
cookies = login_page.json().get("cookies", {})

# Step 2: Submit login form (the actual login happens on your side or via ScrapeForge)
# For simple cookie-based auth, forward the session cookies:
protected_data = requests.post(
    f"{BASE_URL}/scrape",
    headers=sh_headers,
    json={
        "url": "https://example.com/dashboard",
        "render_js": True,
        "cookies": cookies,
        "extract": {
            "table_data": {
                "selector": "table.data-table tr",
                "fields": ["name", "value", "date"]
            }
        }
    }
)

print(protected_data.json())

Step 5: Handle JavaScript-Rendered Login Forms

Many sites use SPAs (React, Angular, Vue) where the login form is rendered by JavaScript. The actual authentication happens via an XHR/fetch POST to an API endpoint. Here is how to handle this:

import requests
import json

session = requests.Session()

# First, load the SPA
session.get("https://example.com")

# Find the actual auth endpoint by inspecting DevTools
# Typically POST /api/auth/login or similar
auth_response = session.post(
    "https://example.com/api/auth/login",
    json={
        "email": "your@email.com",
        "password": "your_password"
    },
    headers={
        "Content-Type": "application/json",
        "X-Requested-With": "XMLHttpRequest"
    }
)

if auth_response.status_code == 200:
    token = auth_response.json().get("token")
    session.headers.update({"Authorization": f"Bearer {token}"})

    # Now scrape authenticated endpoints
    data = session.get("https://example.com/api/dashboard/data")
    records = data.json()
    for record in records[:10]:
        print(record)

Step 6: Handle Multi-Step Login Flows

Some sites require multiple steps -- email entry, password entry, 2FA code:

import requests
import time

session = requests.Session()

# Step 1: Submit email
session.post("https://example.com/auth/email", json={"email": "you@example.com"})

# Step 2: Submit password
session.post("https://example.com/auth/password", json={"password": "your_password"})

# Step 3: If 2FA is required, you need to handle it
# Option A: Automated 2FA (if using TOTP)
import pyotp
totp = pyotp.TOTP("YOUR_TOTP_SECRET")
code = totp.now()
session.post("https://example.com/auth/2fa", json={"code": code})

# Option B: Manual 2FA (pause and wait for user input)
code = input("Enter 2FA code: ")
session.post("https://example.com/auth/2fa", json={"code": code})

# Step 4: Access authenticated data
data = session.get("https://example.com/api/protected")
print(data.json())

Step 7: Maintain Session Longevity

Sessions expire. Here is how to handle session refresh:

import requests
import time

class AuthenticatedScraper:
    def __init__(self, base_url, credentials):
        self.base_url = base_url
        self.credentials = credentials
        self.session = requests.Session()
        self.last_refresh = 0
        self.refresh_interval = 3600  # Refresh every hour

    def login(self):
        """Authenticate and get fresh tokens/cookies."""
        response = self.session.post(
            f"{self.base_url}/auth/login",
            json=self.credentials
        )
        if response.status_code == 200:
            token = response.json().get("access_token")
            self.session.headers.update({"Authorization": f"Bearer {token}"})
            self.last_refresh = time.time()
            return True
        return False

    def get(self, path):
        """GET with auto-refresh."""
        if time.time() - self.last_refresh > self.refresh_interval:
            print("Refreshing session...")
            if not self.login():
                raise Exception("Session refresh failed")
        return self.session.get(f"{self.base_url}{path}")

    def scrape_all(self, paths):
        """Scrape multiple authenticated paths."""
        self.login()
        results = {}
        for path in paths:
            try:
                response = self.get(path)
                results[path] = response.json() if response.status_code == 200 else None
            except Exception as e:
                results[path] = {"error": str(e)}
            time.sleep(1)
        return results

# Usage
scraper = AuthenticatedScraper(
    "https://example.com",
    {"email": "you@example.com", "password": "pass"}
)
data = scraper.scrape_all(["/api/reports", "/api/users", "/api/analytics"])

Common Issues

Session expiration: Tokens and cookies expire. Implement auto-refresh (Step 7) to avoid broken scrapers.
CSRF token mismatches: Some sites rotate CSRF tokens per request. Always fetch a fresh token before each form submission.
CAPTCHA on login: If the login page shows a CAPTCHA, automated scraping becomes difficult. SearchHive can bypass some CAPTCHAs, but consider whether you have legitimate access.
IP blocking: Frequent login attempts from the same IP can trigger security blocks. SearchHive's proxy rotation distributes requests across IPs.
2FA requirements: Automated 2FA is only possible with TOTP. SMS-based 2FA requires manual intervention.
Terms of service: Only scrape authenticated content from accounts you own or have explicit permission to access.

Next Steps

Store session tokens securely using environment variables or secret managers
Build retry logic with exponential backoff for failed requests
Combine authenticated scraping with SearchHive's DeepDive for AI-powered data extraction from protected pages

Get Started with SearchHive

SearchHive handles the hardest parts of authenticated scraping -- JavaScript rendering, proxy rotation, and CAPTCHA bypass. Start with 500 free credits and scrape authenticated pages reliably. Sign up free -- no credit card required. Check the API docs for the complete reference.

/tutorials/web-scraping-python-beginners-guide | /tutorials/how-to-scrape-github-data-for-developer-research | /compare/scrapingbee

How to Scrape Websites Behind Login with Python

AI-Powered Research

Prerequisites

Key Takeaways

Step 1: Understand the Authentication Method

Step 2: Session-Based Login with requests

Step 3: Token-Based Authentication (JWT)

Step 4: Using SearchHive for Authenticated Scraping

Step 5: Handle JavaScript-Rendered Login Forms

Step 6: Handle Multi-Step Login Flows

Step 7: Maintain Session Longevity

Common Issues

Next Steps

Get Started with SearchHive

Keywords

RELATED ARTICLES

How to Build a Web Data Pipeline with Python

How to Monitor Website Changes with Python

How to Scrape GitHub Data for Developer Research

BUILD WITH SEARCHHIVE