Many valuable datasets live behind login walls -- pricing pages, dashboards, member directories, SaaS analytics, and internal tools. Scraping authenticated content requires handling sessions, cookies, CSRF tokens, and sometimes multi-factor authentication.
This tutorial shows you how to scrape websites behind login walls using Python, covering both session-based and token-based authentication patterns, plus SearchHive's ScrapeForge API for when you want to skip the setup entirely.
Prerequisites
- Python 3.9+ with
httpx,beautifulsoup4, andplaywright - Target website credentials (use a test account, never scrape with production credentials)
- Basic understanding of HTTP requests and cookies
pip install httpx beautifulsoup4 playwright
playwright install chromium
Step 1: Understand the Authentication Type
Before writing any code, identify how the target site handles authentication:
- Session cookies -- Most common. Login sets a
session_idor similar cookie that must be sent with every subsequent request. - Bearer tokens (JWT decoder) -- Login returns a token in the response body. Send it in the
Authorization: Bearer <token>header. - API keys -- Static keys in headers. Easy to handle but often rate-limited.
- OAuth 2.0 -- Multi-step flow with redirects. More complex but well-documented.
How to identify: Open your browser's DevTools Network tab, log into the site, and watch the requests. Look at what changes in the request headers and cookies before and after login.
Step 2: Session-Based Login with httpx
Most websites use session-based auth. Here is how to handle it:
import httpx
from bs4 import BeautifulSoup
LOGIN_URL = "https://example.com/login"
DASHBOARD_URL = "https://example.com/dashboard"
with httpx.Client(follow_redirects=True) as client:
# Step 1: GET the login page (may set initial cookies / CSRF token)
login_page = client.get(LOGIN_URL)
soup = BeautifulSoup(login_page.text, "html.parser")
# Step 2: Extract CSRF token if present
csrf_input = soup.find("input", {"name": "csrf_token"})
csrf_token = csrf_input["value"] if csrf_input else ""
# Step 3: POST login credentials
response = client.post(
LOGIN_URL,
data={
"email": "your-email@example.com",
"password": "your-password",
"csrf_token": csrf_token,
},
headers={"Referer": LOGIN_URL},
)
# Step 4: Verify login succeeded
if "dashboard" in response.url or response.status_code == 200:
print("Login successful")
# Step 5: Now scrape authenticated pages
dashboard = client.get(DASHBOARD_URL)
dashboard_soup = BeautifulSoup(dashboard.text, "html.parser")
for item in dashboard_soup.select(".data-row"):
title = item.select_one(".title").text.strip()
value = item.select_one(".value").text.strip()
print(f"{title}: {value}")
else:
print(f"Login failed: {response.status_code}")
Key points:
- Use a single
httpx.Clientsession -- it automatically manages cookies between requests - Extract CSRF tokens -- many frameworks (Django, Rails, Laravel) require them
- Set the Referer header -- some sites reject login POSTs without it
- Check the response URL -- successful logins often redirect
Step 3: Token-Based Authentication (JWT)
APIs and SPAs commonly use JWT tokens:
import httpx
API_BASE = "https://api.example.com/v1"
def get_authenticated_client(email: str, password: str) -> httpx.Client:
"""Login and return a client with the auth token set."""
response = httpx.post(
f"{API_BASE}/auth/login",
json={"email": email, "password": password},
)
response.raise_for_status()
token = response.json()["access_token"]
return httpx.Client(
base_url=API_BASE,
headers={"Authorization": f"Bearer {token}"},
)
# Use it
client = get_authenticated_client("user@example.com", "pass")
data = client.get("/reports/monthly").json()
print(data)
Step 4: Handle JavaScript-Rendered Login Pages
Many modern sites use JavaScript-heavy login flows that httpx cannot handle. Use Playwright for these:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate to login
page.goto("https://example.com/login")
# Fill in credentials
page.fill('input[name="email"]', "your-email@example.com")
page.fill('input[name="password"]', "your-password")
page.click('button[type="submit"]')
# Wait for dashboard to load
page.wait_for_url("**/dashboard**", timeout=10000)
# Extract cookies for use with httpx
cookies = page.context.cookies()
cookie_jar = httpx.Cookies()
for cookie in cookies:
cookie_jar.set(cookie["name"], cookie["value"])
# Now use httpx with these cookies for faster scraping
with httpx.Client(cookies=cookie_jar) as client:
data = client.get("https://example.com/api/reports").json()
print(data)
browser.close()
The hybrid approach (Playwright for login, httpx for scraping) is much faster than using Playwright for everything. Playwright launches a full browser -- it is slow. Use it only for the parts that require JavaScript execution.
Step 5: Handle MFA and CAPTCHAs
Multi-factor authentication and CAPTCHAs make automated login significantly harder.
CAPTCHA handling options:
- Use CAPTCHA-solving services (2Captcha, Anti-Captcha) -- these add cost and latency
- Use a headless browser that renders the CAPTCHA and manually solve it during development
- Check if the site offers API tokens or session cookies that bypass the login flow entirely
MFA handling options:
- Use TOTP libraries (
pyotp) if you have access to the MFA secret key - Some sites offer "app passwords" or API tokens that bypass MFA
- Session cookies from a manual browser login can be extracted and reused
import pyotp
# If you have the TOTP secret from the MFA setup
totp = pyotp.TOTP("YOUR_TOTP_SECRET")
mfa_code = totp.now()
response = httpx.post(
"https://example.com/auth/verify-mfa",
json={"code": mfa_code, "session_id": session_id},
)
Step 6: Persist Sessions Across Runs
If you need to scrape the same site repeatedly, persist the session to avoid logging in every time:
import httpx
import json
import os
SESSION_FILE = ".session_cookies.json"
def save_session(client: httpx.Client):
"""Save cookies to disk for reuse."""
cookies = dict(client.cookies)
with open(SESSION_FILE, "w") as f:
json.dump(cookies, f)
def load_session() -> httpx.Client:
"""Load saved cookies into a new client."""
if not os.path.exists(SESSION_FILE):
raise FileNotFoundError("No saved session. Login first.")
with open(SESSION_FILE) as f:
cookies = json.load(f)
return httpx.Client(cookies=httpx.Cookies(cookies))
def is_session_valid(client: httpx.Client) -> bool:
"""Check if the saved session is still active."""
response = client.get("https://example.com/api/me")
return response.status_code == 200
Step 7: The Easy Way -- SearchHive ScrapeForge
If you do not want to manage sessions, cookies, and CAPTCHAs manually, SearchHive's ScrapeForge API handles authentication for you:
import httpx
response = httpx.post(
"https://api.searchhive.dev/v1/scrape",
json={
"url": "https://example.com/members/dashboard",
"format": "markdown",
"auth": {
"type": "basic",
"username": "your-email@example.com",
"password": "your-password",
},
},
headers={"Authorization": "Bearer YOUR_SEARCHHIVE_API_KEY"},
)
data = response.json()
print(data.get("content", "")[:500])
ScrapeForge supports multiple auth types:
- Basic auth -- username/password via Authorization header
- Bearer token -- pass a pre-authenticated token
- Cookie-based -- pass session cookies directly
- Custom headers -- pass any auth headers your target site requires
This eliminates the need for Playwright, cookie management, and session persistence. ScrapeForge handles the rendering and authentication server-side.
Common Issues
"Login works in browser but fails in code": Check for JavaScript-set cookies, CSRF tokens in custom headers, and request timing. Use DevTools to compare your code's requests against the browser's.
Session expires mid-scrape: Implement session validation before each batch of requests. Re-authenticate when the session is invalid.
Rate limiting after login: Space out your requests with time.sleep(). Most sites allow 1-5 requests per second for authenticated users.
Cloudflare or bot detection: Residential proxies and realistic headers help. ScrapeForge handles this server-side with built-in proxy rotation.
Next Steps
Once you have authenticated scraping working:
- Schedule regular scrapes using cron expression generator or a task queue (Celery, RQ)
- Store results in a database (PostgreSQL, SQLite) for historical analysis
- Build change detection to get alerts when dashboard data changes
- Combine with search -- use SwiftSearch to find pages, then ScrapeForge to extract authenticated content
For a simpler approach to authenticated scraping, try SearchHive's ScrapeForge. The free tier includes 500 credits per month -- enough to evaluate the API without any cost.