Complete Guide to Web Scraping Without Getting Blocked
Web scraping in 2025 is a cat-and-mouse game. Every time scrapers get smarter, websites deploy more sophisticated anti-bot systems. This guide covers the practical strategies that actually work — tested across thousands of real scraping jobs — to extract data reliably without getting blocked.
Background
Modern websites protect their data with multiple layers of defense:
- IP-level blocking — rate limits, subnet bans, and IP reputation blacklists
- Browser fingerprinting — detecting headless browsers via Canvas, WebGL, and font fingerprinting
- JavaScript challenges — Cloudflare, Datadome, PerimeterX, Kasada
- Behavioral analysis — mouse movement patterns, scroll speed, click timing
- CAPTCHA — visual puzzles, reCAPTCHA, hCaptcha, Turnstile
- TLS fingerprinting — identifying non-browser HTTP clients (JA3/JA4 hashes)
Getting blocked isn't just inconvenient — it can mean losing access to critical data sources mid-pipeline, corrupting datasets with error pages instead of real content, and wasting credits on failed requests.
The Challenge
A typical scraping scenario: you need to collect product data from 50,000 pages on a modern e-commerce site. The site uses Cloudflare, renders content with JavaScript, and rotates its blocking rules weekly.
Using naive requests with a fixed IP, you'll get blocked within the first 100 requests. Using requests with rotating data center proxies, you'll get through maybe 1,000-5,000 requests before Cloudflare flags the proxy subnet. Using a basic headless browser without stealth patches, you'll get fingerprinted and blocked almost immediately.
The solution isn't a single technique — it's combining the right tools and strategies for each layer of protection.
Solution with SearchHive
SearchHive's ScrapeForge API handles the hardest parts of anti-blocking automatically:
- Automatic proxy rotation — residential proxies that look like real users
- JavaScript rendering — full browser rendering for dynamic content
- Stealth browser profiles — realistic fingerprints that pass bot detection
- CAPTCHA solving — handled transparently
- Wait conditions — wait for specific elements before extracting
import requests
headers = {
"Authorization": "Bearer sh_live_your_api_key_here",
"Content-Type": "application/json"
}
# ScrapeForge handles all anti-blocking automatically
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers=headers,
json={
"url": "https://protected-ecommerce-site.com/products/page-1",
"render_js": True,
"proxy": "auto",
"wait_for": "div.product-grid",
"format": "markdown"
}
)
data = response.json()
print(data["data"]["content"][:500])
For projects that need even more control, here are the underlying strategies that ScrapeForge uses under the hood.
Implementation: Anti-Blocking Strategies That Work
1. Residential Proxies Over Data Center Proxies
Data center proxies are cheap ($0.10-1.00/IP) but easily detected. Cloudflare and similar services maintain lists of known data center IP ranges and block entire subnets.
Residential proxies ($3-15/GB) use real home and office IPs. They're harder to detect and far more reliable for scraping.
import requests
# Bad: data center proxy — easily detected
proxies_dc = {"http": "http://1.2.3.4:8080", "https": "http://1.2.3.4:8080"}
# Better: residential proxy pool with rotation
# SearchHive handles this with "proxy": "auto"
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers=headers,
json={"url": "https://target-site.com/data", "proxy": "auto"}
)
2. Realistic Request Patterns
Bots make requests in predictable patterns: same interval, same headers, sequential URLs. Real humans don't behave this way.
- Randomize delays — 2-15 seconds between requests, with occasional longer pauses
- Vary headers — rotate user agents, accept languages, and referer headers
- Simulate navigation — visit the homepage first, then navigate to target pages
- Avoid sequential URLs — shuffle or randomize the order of pages you scrape
import time
import random
pages = ["https://target.com/page/1", "https://target.com/page/2", "https://target.com/page/3"]
random.shuffle(pages)
for page in pages:
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers=headers,
json={"url": page, "render_js": True, "proxy": "auto"}
)
# Random delay between 3-8 seconds
time.sleep(random.uniform(3, 8))
3. Stealth Browser Configuration
If you're running your own headless browsers (Playwright, Puppeteer), apply stealth patches:
- Disable automation flags — remove
navigator.webdriverindicator - Realistic viewport — common screen resolutions (1920x1080, 1440x900)
- Realistic browser headers — match the browser/OS combination you're simulating
- WebGL vendor spoofing — return realistic GPU vendor strings
- Canvas fingerprint randomization — add subtle noise to canvas rendering
Popular libraries: puppeteer-extra-plugin-stealth for Puppeteer, playwright-stealth for Playwright.
4. Handle JavaScript Challenges
Many sites render a challenge page that requires JavaScript execution to solve. A simple HTTP client can't do this.
Options:
- Headless browser — Full rendering solves most JS challenges automatically
- TLS fingerprinting — Use
curl_cffiortls-clientin Python to match browser TLS fingerprints (JA3/JA4). This alone bypasses many basic protections. - Scraping API — Let the service handle it. SearchHive's ScrapeForge uses headless Chrome with residential proxies.
# For sites that just need TLS fingerprint matching
# (no JS rendering required)
from curl_cffi import requests as cf_requests
session = cf_requests.Session(impersonate="chrome120")
resp = session.get("https://target-site.com/data")
5. Session and Cookie Management
Some sites use cookies to track sessions and detect anomalies:
- Maintain cookie jars across requests within a session
- Rotate sessions periodically to avoid cookie-based rate limiting
- Extract cookies from a real browser and use them in your scraper (use with caution, short-lived)
6. Respect Rate Limits
The most overlooked strategy: don't hit the site harder than you need to.
- Start with conservative rates and increase gradually
- Monitor response codes — if you see 429s or 403s, slow down immediately
- Scrape during off-peak hours (2-6 AM target timezone)
- Cache responses to avoid re-scraping the same pages
Results: What to Expect
With these strategies in place, here's what realistic success rates look like:
| Approach | Success Rate | Pages/Minute | Cost per 100K Pages |
|---|---|---|---|
| Naive requests | 0-5% | High but useless | $0 (wasted effort) |
| Data center proxies | 30-60% | 50-100 | $5-20 |
| Residential proxies | 80-95% | 10-30 | $50-150 |
| SearchHive ScrapeForge | 90-98% | 20-50 | $49 (Builder plan) |
| Manual stealth browser | 85-95% | 5-15 | $200-500+ (infra costs) |
The key insight: using a managed scraping API is cheaper and more reliable than self-hosting infrastructure for most teams. The $49/month SearchHive Builder plan gets you 100K credits with automatic anti-blocking — compare that to the infrastructure and maintenance costs of running your own proxy-rotating browser farm.
Lessons Learned
-
Start with the simplest approach that works. Don't over-engineer. If the target site has no anti-bot protection, a simple requests-based scraper is fine.
-
Scraping APIs beat self-hosted for most use cases. The engineering time saved on proxy management, CAPTCHA solving, and browser maintenance is worth far more than the monthly API cost.
-
Respect the target. Even with anti-blocking tools, scraping at abusive rates is unethical and counterproductive. Rate limit, respect robots.txt generator, and don't impact the target site's performance.
-
Have a fallback plan. Sites change their anti-bot protection regularly. Build scrapers that gracefully handle failures and can switch between strategies.
-
Monitor your success rate. Track the ratio of successful extractions to total attempts. If it drops below 90%, investigate and adjust before it gets worse.
Get Started
Stop fighting anti-bot systems manually. Let SearchHive handle proxy rotation, JavaScript rendering, and stealth configuration so you can focus on the data that matters.
Get 500 free credits — test ScrapeForge against your hardest targets. No credit card required.
For more data extraction strategies, see /blog/data-extraction-for-ai-common-questions-answered or compare scraping tools at /compare/firecrawl.