How to Scrape Websites Behind Login -- Step-by-Step Tutorial

Many valuable datasets live behind login walls -- pricing pages, dashboards, member directories, SaaS analytics, and internal tools. Scraping authenticated content requires handling sessions, cookies, CSRF tokens, and sometimes multi-factor authentication.

This tutorial shows you how to scrape websites behind login walls using Python, covering both session-based and token-based authentication patterns, plus SearchHive's ScrapeForge API for when you want to skip the setup entirely.

Prerequisites

Python 3.9+ with httpx, beautifulsoup4, and playwright
Target website credentials (use a test account, never scrape with production credentials)
Basic understanding of HTTP requests and cookies

pip install httpx beautifulsoup4 playwright
playwright install chromium

Step 1: Understand the Authentication Type

Before writing any code, identify how the target site handles authentication:

Session cookies -- Most common. Login sets a session_id or similar cookie that must be sent with every subsequent request.
Bearer tokens (JWT decoder) -- Login returns a token in the response body. Send it in the Authorization: Bearer <token> header.
API keys -- Static keys in headers. Easy to handle but often rate-limited.
OAuth 2.0 -- Multi-step flow with redirects. More complex but well-documented.

How to identify: Open your browser's DevTools Network tab, log into the site, and watch the requests. Look at what changes in the request headers and cookies before and after login.

Step 2: Session-Based Login with httpx

Most websites use session-based auth. Here is how to handle it:

import httpx
from bs4 import BeautifulSoup

LOGIN_URL = "https://example.com/login"
DASHBOARD_URL = "https://example.com/dashboard"

with httpx.Client(follow_redirects=True) as client:
    # Step 1: GET the login page (may set initial cookies / CSRF token)
    login_page = client.get(LOGIN_URL)
    soup = BeautifulSoup(login_page.text, "html.parser")

    # Step 2: Extract CSRF token if present
    csrf_input = soup.find("input", {"name": "csrf_token"})
    csrf_token = csrf_input["value"] if csrf_input else ""

    # Step 3: POST login credentials
    response = client.post(
        LOGIN_URL,
        data={
            "email": "your-email@example.com",
            "password": "your-password",
            "csrf_token": csrf_token,
        },
        headers={"Referer": LOGIN_URL},
    )

    # Step 4: Verify login succeeded
    if "dashboard" in response.url or response.status_code == 200:
        print("Login successful")

        # Step 5: Now scrape authenticated pages
        dashboard = client.get(DASHBOARD_URL)
        dashboard_soup = BeautifulSoup(dashboard.text, "html.parser")

        for item in dashboard_soup.select(".data-row"):
            title = item.select_one(".title").text.strip()
            value = item.select_one(".value").text.strip()
            print(f"{title}: {value}")
    else:
        print(f"Login failed: {response.status_code}")

Key points:

Use a single httpx.Client session -- it automatically manages cookies between requests
Extract CSRF tokens -- many frameworks (Django, Rails, Laravel) require them
Set the Referer header -- some sites reject login POSTs without it
Check the response URL -- successful logins often redirect

Step 3: Token-Based Authentication (JWT)

APIs and SPAs commonly use JWT tokens:

import httpx

API_BASE = "https://api.example.com/v1"

def get_authenticated_client(email: str, password: str) -> httpx.Client:
    """Login and return a client with the auth token set."""
    response = httpx.post(
        f"{API_BASE}/auth/login",
        json={"email": email, "password": password},
    )
    response.raise_for_status()

    token = response.json()["access_token"]
    return httpx.Client(
        base_url=API_BASE,
        headers={"Authorization": f"Bearer {token}"},
    )

# Use it
client = get_authenticated_client("user@example.com", "pass")
data = client.get("/reports/monthly").json()
print(data)

Step 4: Handle JavaScript-Rendered Login Pages

Many modern sites use JavaScript-heavy login flows that httpx cannot handle. Use Playwright for these:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    # Navigate to login
    page.goto("https://example.com/login")

    # Fill in credentials
    page.fill('input[name="email"]', "your-email@example.com")
    page.fill('input[name="password"]', "your-password")
    page.click('button[type="submit"]')

    # Wait for dashboard to load
    page.wait_for_url("**/dashboard**", timeout=10000)

    # Extract cookies for use with httpx
    cookies = page.context.cookies()
    cookie_jar = httpx.Cookies()
    for cookie in cookies:
        cookie_jar.set(cookie["name"], cookie["value"])

    # Now use httpx with these cookies for faster scraping
    with httpx.Client(cookies=cookie_jar) as client:
        data = client.get("https://example.com/api/reports").json()
        print(data)

    browser.close()

The hybrid approach (Playwright for login, httpx for scraping) is much faster than using Playwright for everything. Playwright launches a full browser -- it is slow. Use it only for the parts that require JavaScript execution.

Step 5: Handle MFA and CAPTCHAs

Multi-factor authentication and CAPTCHAs make automated login significantly harder.

CAPTCHA handling options:

Use CAPTCHA-solving services (2Captcha, Anti-Captcha) -- these add cost and latency
Use a headless browser that renders the CAPTCHA and manually solve it during development
Check if the site offers API tokens or session cookies that bypass the login flow entirely

MFA handling options:

Use TOTP libraries (pyotp) if you have access to the MFA secret key
Some sites offer "app passwords" or API tokens that bypass MFA
Session cookies from a manual browser login can be extracted and reused

import pyotp

# If you have the TOTP secret from the MFA setup
totp = pyotp.TOTP("YOUR_TOTP_SECRET")
mfa_code = totp.now()

response = httpx.post(
    "https://example.com/auth/verify-mfa",
    json={"code": mfa_code, "session_id": session_id},
)

Step 6: Persist Sessions Across Runs

If you need to scrape the same site repeatedly, persist the session to avoid logging in every time:

import httpx
import json
import os

SESSION_FILE = ".session_cookies.json"

def save_session(client: httpx.Client):
    """Save cookies to disk for reuse."""
    cookies = dict(client.cookies)
    with open(SESSION_FILE, "w") as f:
        json.dump(cookies, f)

def load_session() -> httpx.Client:
    """Load saved cookies into a new client."""
    if not os.path.exists(SESSION_FILE):
        raise FileNotFoundError("No saved session. Login first.")
    with open(SESSION_FILE) as f:
        cookies = json.load(f)
    return httpx.Client(cookies=httpx.Cookies(cookies))

def is_session_valid(client: httpx.Client) -> bool:
    """Check if the saved session is still active."""
    response = client.get("https://example.com/api/me")
    return response.status_code == 200

Step 7: The Easy Way -- SearchHive ScrapeForge

If you do not want to manage sessions, cookies, and CAPTCHAs manually, SearchHive's ScrapeForge API handles authentication for you:

import httpx

response = httpx.post(
    "https://api.searchhive.dev/v1/scrape",
    json={
        "url": "https://example.com/members/dashboard",
        "format": "markdown",
        "auth": {
            "type": "basic",
            "username": "your-email@example.com",
            "password": "your-password",
        },
    },
    headers={"Authorization": "Bearer YOUR_SEARCHHIVE_API_KEY"},
)
data = response.json()
print(data.get("content", "")[:500])

ScrapeForge supports multiple auth types:

Basic auth -- username/password via Authorization header
Bearer token -- pass a pre-authenticated token
Cookie-based -- pass session cookies directly
Custom headers -- pass any auth headers your target site requires

This eliminates the need for Playwright, cookie management, and session persistence. ScrapeForge handles the rendering and authentication server-side.

Common Issues

"Login works in browser but fails in code": Check for JavaScript-set cookies, CSRF tokens in custom headers, and request timing. Use DevTools to compare your code's requests against the browser's.

Session expires mid-scrape: Implement session validation before each batch of requests. Re-authenticate when the session is invalid.

Rate limiting after login: Space out your requests with time.sleep(). Most sites allow 1-5 requests per second for authenticated users.

Cloudflare or bot detection: Residential proxies and realistic headers help. ScrapeForge handles this server-side with built-in proxy rotation.

Next Steps

Once you have authenticated scraping working:

Schedule regular scrapes using cron expression generator or a task queue (Celery, RQ)
Store results in a database (PostgreSQL, SQLite) for historical analysis
Build change detection to get alerts when dashboard data changes
Combine with search -- use SwiftSearch to find pages, then ScrapeForge to extract authenticated content

For a simpler approach to authenticated scraping, try SearchHive's ScrapeForge. The free tier includes 500 credits per month -- enough to evaluate the API without any cost.

How to Scrape Websites Behind Login -- Step-by-Step Tutorial

AI-Powered Research

Prerequisites

Step 1: Understand the Authentication Type

Step 2: Session-Based Login with httpx

Step 3: Token-Based Authentication (JWT)

Step 4: Handle JavaScript-Rendered Login Pages

Step 5: Handle MFA and CAPTCHAs

Step 6: Persist Sessions Across Runs

Step 7: The Easy Way -- SearchHive ScrapeForge

Common Issues

Next Steps

Keywords

RELATED ARTICLES

Best Shopify Data Extraction Tools (2025)

Search Api For Developers -- Common Questions Answered

Complete Guide to Brand Tracking Platforms

BUILD WITH SEARCHHIVE