Many valuable websites keep their best data behind authentication -- pricing pages, dashboards, member directories, analytics panels. Scraping authenticated content requires handling sessions, cookies, CSRF tokens, and sometimes multi-factor authentication. This tutorial covers the practical techniques for scraping behind login walls using Python and SearchHive.
Prerequisites
- Python 3.8+ with
requestsandbeautifulsoup4installed - A SearchHive API key (free tier includes 500 credits)
- Target website credentials (for sites you have permission to access)
pip install requests beautifulsoup4
Key Takeaways
- Scraping behind login requires managing session state -- cookies, CSRF tokens, and authentication headers
- SearchHive's ScrapeForge can handle session-based authentication by forwarding cookies
- For complex login flows (OAuth, 2FA, CAPTCHA on login), SearchHive's headless browser rendering with proxy rotation gives you the most reliable approach
- Always check terms of service before scraping authenticated content -- scraping data you are authorized to access is different from bypassing access controls
Step 1: Understand the Authentication Method
Before writing any code, identify how the target site authenticates:
- Cookie-based sessions: Most common. Login POST request sets session cookies that you forward with subsequent requests
- Token-based (JWT decoder): Login returns an access token that you include in the Authorization header
- OAuth 2.0: Redirects through a third-party auth provider
- CSRF-protected: Login form includes a hidden CSRF token that must be submitted with credentials
Inspect the login flow using your browser's DevTools:
- Open DevTools (F12), go to the Network tab
- Submit the login form normally
- Look at the POST request to the login endpoint
- Note the request headers (especially Cookie and CSRF token), form data, and response headers (especially Set-Cookie)
Step 2: Session-Based Login with requests
The simplest case -- a standard HTML form login with cookie-based sessions:
import requests
session = requests.Session()
# Step 1: GET the login page to pick up CSRF token and initial cookies
login_page = session.get("https://example.com/login")
# Extract CSRF token (varies by framework)
# Django: <input type="hidden" name="csrfmiddlewaretoken" value="...">
from bs4 import BeautifulSoup
soup = BeautifulSoup(login_page.text, "html.parser")
csrf_token = soup.find("input", {"name": "csrfmiddlewaretoken"})["value"]
# Step 2: POST login credentials with CSRF token
login_response = session.post(
"https://example.com/login",
data={
"username": "your_username",
"password": "your_password",
"csrfmiddlewaretoken": csrf_token
},
headers={"Referer": "https://example.com/login"}
)
# Step 3: Verify login succeeded
if "dashboard" in login_response.url or login_response.status_code == 200:
print("Login successful")
else:
print(f"Login failed: {login_response.status_code}")
# Step 4: Scrape authenticated pages using the session
dashboard = session.get("https://example.com/dashboard/data")
print(dashboard.text[:500])
The key is using requests.Session() -- it automatically persists cookies across requests, maintaining your authenticated state.
Step 3: Token-Based Authentication (JWT)
Many modern web apps return a JWT after login instead of cookies:
import requests
# Login to get access token
response = requests.post(
"https://api.example.com/auth/login",
json={
"email": "your@email.com",
"password": "your_password"
}
)
if response.status_code == 200:
token = response.json().get("access_token")
headers = {"Authorization": f"Bearer {token}"}
# Use token for subsequent requests
data = requests.get(
"https://api.example.com/dashboard/reports",
headers=headers
)
print(data.json())
else:
print(f"Auth failed: {response.text}")
Step 4: Using SearchHive for Authenticated Scraping
For sites with complex login flows (JavaScript-rendered login forms, OAuth redirects, CAPTCHAs on login), SearchHive's ScrapeForge with headless browser rendering is the most reliable approach:
import requests
API_KEY = "your_searchhive_key"
BASE_URL = "https://api.searchhive.dev/v1"
sh_headers = {"Authorization": f"Bearer {API_KEY}"}
# Step 1: Scrape the login page to get cookies and CSRF token
login_page = requests.post(
f"{BASE_URL}/scrape",
headers=sh_headers,
json={
"url": "https://example.com/login",
"render_js": True,
"cookies": {} # Fresh session
}
)
# Extract cookies from the response
cookies = login_page.json().get("cookies", {})
# Step 2: Submit login form (the actual login happens on your side or via ScrapeForge)
# For simple cookie-based auth, forward the session cookies:
protected_data = requests.post(
f"{BASE_URL}/scrape",
headers=sh_headers,
json={
"url": "https://example.com/dashboard",
"render_js": True,
"cookies": cookies,
"extract": {
"table_data": {
"selector": "table.data-table tr",
"fields": ["name", "value", "date"]
}
}
}
)
print(protected_data.json())
Step 5: Handle JavaScript-Rendered Login Forms
Many sites use SPAs (React, Angular, Vue) where the login form is rendered by JavaScript. The actual authentication happens via an XHR/fetch POST to an API endpoint. Here is how to handle this:
import requests
import json
session = requests.Session()
# First, load the SPA
session.get("https://example.com")
# Find the actual auth endpoint by inspecting DevTools
# Typically POST /api/auth/login or similar
auth_response = session.post(
"https://example.com/api/auth/login",
json={
"email": "your@email.com",
"password": "your_password"
},
headers={
"Content-Type": "application/json",
"X-Requested-With": "XMLHttpRequest"
}
)
if auth_response.status_code == 200:
token = auth_response.json().get("token")
session.headers.update({"Authorization": f"Bearer {token}"})
# Now scrape authenticated endpoints
data = session.get("https://example.com/api/dashboard/data")
records = data.json()
for record in records[:10]:
print(record)
Step 6: Handle Multi-Step Login Flows
Some sites require multiple steps -- email entry, password entry, 2FA code:
import requests
import time
session = requests.Session()
# Step 1: Submit email
session.post("https://example.com/auth/email", json={"email": "you@example.com"})
# Step 2: Submit password
session.post("https://example.com/auth/password", json={"password": "your_password"})
# Step 3: If 2FA is required, you need to handle it
# Option A: Automated 2FA (if using TOTP)
import pyotp
totp = pyotp.TOTP("YOUR_TOTP_SECRET")
code = totp.now()
session.post("https://example.com/auth/2fa", json={"code": code})
# Option B: Manual 2FA (pause and wait for user input)
code = input("Enter 2FA code: ")
session.post("https://example.com/auth/2fa", json={"code": code})
# Step 4: Access authenticated data
data = session.get("https://example.com/api/protected")
print(data.json())
Step 7: Maintain Session Longevity
Sessions expire. Here is how to handle session refresh:
import requests
import time
class AuthenticatedScraper:
def __init__(self, base_url, credentials):
self.base_url = base_url
self.credentials = credentials
self.session = requests.Session()
self.last_refresh = 0
self.refresh_interval = 3600 # Refresh every hour
def login(self):
"""Authenticate and get fresh tokens/cookies."""
response = self.session.post(
f"{self.base_url}/auth/login",
json=self.credentials
)
if response.status_code == 200:
token = response.json().get("access_token")
self.session.headers.update({"Authorization": f"Bearer {token}"})
self.last_refresh = time.time()
return True
return False
def get(self, path):
"""GET with auto-refresh."""
if time.time() - self.last_refresh > self.refresh_interval:
print("Refreshing session...")
if not self.login():
raise Exception("Session refresh failed")
return self.session.get(f"{self.base_url}{path}")
def scrape_all(self, paths):
"""Scrape multiple authenticated paths."""
self.login()
results = {}
for path in paths:
try:
response = self.get(path)
results[path] = response.json() if response.status_code == 200 else None
except Exception as e:
results[path] = {"error": str(e)}
time.sleep(1)
return results
# Usage
scraper = AuthenticatedScraper(
"https://example.com",
{"email": "you@example.com", "password": "pass"}
)
data = scraper.scrape_all(["/api/reports", "/api/users", "/api/analytics"])
Common Issues
- Session expiration: Tokens and cookies expire. Implement auto-refresh (Step 7) to avoid broken scrapers.
- CSRF token mismatches: Some sites rotate CSRF tokens per request. Always fetch a fresh token before each form submission.
- CAPTCHA on login: If the login page shows a CAPTCHA, automated scraping becomes difficult. SearchHive can bypass some CAPTCHAs, but consider whether you have legitimate access.
- IP blocking: Frequent login attempts from the same IP can trigger security blocks. SearchHive's proxy rotation distributes requests across IPs.
- 2FA requirements: Automated 2FA is only possible with TOTP. SMS-based 2FA requires manual intervention.
- Terms of service: Only scrape authenticated content from accounts you own or have explicit permission to access.
Next Steps
- Store session tokens securely using environment variables or secret managers
- Build retry logic with exponential backoff for failed requests
- Combine authenticated scraping with SearchHive's DeepDive for AI-powered data extraction from protected pages
Get Started with SearchHive
SearchHive handles the hardest parts of authenticated scraping -- JavaScript rendering, proxy rotation, and CAPTCHA bypass. Start with 500 free credits and scrape authenticated pages reliably. Sign up free -- no credit card required. Check the API docs for the complete reference.
/tutorials/web-scraping-python-beginners-guide | /tutorials/how-to-scrape-github-data-for-developer-research | /compare/scrapingbee