Parallel web scraping is the difference between waiting hours for data and getting it in minutes. But doing it wrong -- too many concurrent requests, no rate limiting, no error handling -- gets your IPs blocked fast. This FAQ answers the most common questions about parallel scraping, with practical advice and working code examples.
Key Takeaways
- Start with 5-10 concurrent workers and scale up only after testing
- Rate limiting is non-negotiable -- even with rotating proxies, aggressive scraping triggers blocks
- SearchHive's ScrapeForge API handles proxy rotation and anti-bot detection, letting you focus on concurrency control
- Error handling in parallel scraping requires different patterns than sequential scraping
- Async IO beats threading for I/O-bound scraping workloads in Python
Q: How many concurrent requests should I make?
Start with 5-10 concurrent workers. Increase gradually while monitoring failure rates. If your error rate jumps above 2-3%, you've gone too far.
The right concurrency depends on several factors:
- Target site tolerance: Major e-commerce sites tolerate more traffic; smaller sites block faster
- Proxy quality: Residential proxies handle higher concurrency than datacenter proxies
- API vs direct scraping: When using an API like SearchHive, the API provider manages proxy pools, so you can push harder
- Time of day: Off-peak hours allow slightly higher concurrency
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
API_KEY = "your_searchhive_key"
urls = [f"https://example.com/page/{i}" for i in range(1, 101)]
def scrape(url):
return requests.get("https://api.searchhive.dev/scrapeforge", params={
"url": url, "format": "json", "api_key": API_KEY
}).json()
# Start with 5 workers, increase if error rate stays low
with ThreadPoolExecutor(max_workers=5) as executor:
futures = {executor.submit(scrape, u): u for u in urls}
for future in as_completed(futures):
result = future.result()
url = futures[future]
print(f"{'OK' if 'content' in result else 'FAIL'}: {url}")
Q: ThreadPoolExecutor vs asyncio -- which is better?
For I/O-bound scraping (which most web scraping is), asyncio with aiohttp is generally faster than ThreadPoolExecutor with requests. Async IO avoids the overhead of thread context switching and can manage hundreds of concurrent connections efficiently.
However, ThreadPoolExecutor is simpler to write and debug. For most teams, the performance difference matters less than development speed.
Use ThreadPoolExecutor when:
- You're scraping fewer than 1,000 URLs
- Your team is more comfortable with synchronous code
- You're using a third-party API (like SearchHive) that handles the heavy lifting
Use asyncio when:
- You're scraping 10,000+ URLs
- Every millisecond of latency matters
- You're comfortable with async/await patterns
import asyncio
import aiohttp
async def scrape_async(session, url):
params = {"url": url, "format": "json", "api_key": "your_key"}
async with session.get(
"https://api.searchhive.dev/scrapeforge",
params=params
) as resp:
return await resp.json()
async def main(urls, concurrency=10):
semaphore = asyncio.Semaphore(concurrency)
async def limited_scrape(session, url):
async with semaphore:
return await scrape_async(session, url)
async with aiohttp.ClientSession() as session:
tasks = [limited_scrape(session, u) for u in urls]
return await asyncio.gather(*tasks)
results = asyncio.run(main(urls))
Q: How do I handle rate limiting?
Three strategies, used together:
- Semaphore-based limiting: Control maximum concurrent requests
- Token bucket: Control requests per time window
- Exponential backoff: Retry failed requests with increasing delays
import time
import threading
class RateLimiter:
def __init__(self, max_per_minute):
self.interval = 60.0 / max_per_minute
self.lock = threading.Lock()
self.last_request = 0
def wait(self):
with self.lock:
now = time.time()
wait_time = self.interval - (now - self.last_request)
if wait_time > 0:
time.sleep(wait_time)
self.last_request = time.time()
# Use it
limiter = RateLimiter(max_per_minute=60)
for url in urls:
limiter.wait()
result = scrape(url)
Q: What happens when a request fails in a parallel pipeline?
You need different handling depending on the failure type:
- Timeouts: Retry with backoff (the server might be slow, not blocked)
- 429 Rate Limited: Wait longer between requests, reduce concurrency
- 403 Forbidden: The target is blocking you -- switch proxies or slow down
- 500 Server Error: Transient -- retry once, then skip
- Connection Error: Network issue -- retry
import time
def scrape_with_retry(url, max_retries=3, base_delay=2):
for attempt in range(max_retries):
try:
resp = requests.get("https://api.searchhive.dev/scrapeforge", params={
"url": url, "format": "json", "api_key": "your_key"
}, timeout=30)
if resp.status_code == 200:
return resp.json()
elif resp.status_code == 429:
delay = base_delay * (2 ** attempt)
print(f"Rate limited, waiting {delay}s...")
time.sleep(delay)
elif resp.status_code == 403:
print(f"Blocked by target: {url}")
return {"url": url, "error": "blocked"}
else:
delay = base_delay
time.sleep(delay)
except requests.Timeout:
delay = base_delay * (2 ** attempt)
print(f"Timeout, retrying in {delay}s...")
time.sleep(delay)
except requests.ConnectionError:
print(f"Connection error: {url}")
return {"url": url, "error": "connection"}
return {"url": url, "error": "max_retries_exceeded"}
Q: How do I avoid getting blocked?
Layer your defenses:
- Use a scraping API (like SearchHive) that handles proxy rotation and browser fingerprinting
- Randomize request timing -- don't send requests at perfectly regular intervals
- Rotate user agents -- most scraping APIs do this automatically
- Respect robots.txt generator -- at minimum, check it before scraping
- Cache aggressively -- don't scrape the same URL twice if the data hasn't changed
SearchHive's ScrapeForge API handles items 1, 2, and 3 automatically. You still need to implement caching on your end.
import hashlib
import json
from pathlib import Path
CACHE_DIR = Path("./scrape_cache")
def scrape_with_cache(url):
cache_key = hashlib.md5(url.encode()).hexdigest()
cache_file = CACHE_DIR / f"{cache_key}.json"
# Check cache first
if cache_file.exists():
with open(cache_file) as f:
return json.load(f)
# Scrape fresh
result = requests.get("https://api.searchhive.dev/scrapeforge", params={
"url": url, "format": "json", "api_key": "your_key"
}).json()
# Save to cache
CACHE_DIR.mkdir(exist_ok=True)
with open(cache_file, "w") as f:
json.dump(result, f)
return result
Q: How much does parallel scraping cost with SearchHive?
SearchHive charges per credit. Each ScrapeForge request costs 1 credit for a basic page scrape. Pricing:
- Free: 500 credits (enough to test your pipeline)
- Starter: $9/mo for 5,000 credits
- Builder: $49/mo for 100,000 credits
- Unicorn: $199/mo for 500,000 credits
At Builder tier, scraping 100,000 pages costs $0.00049 per page. Compared to managing your own proxy infrastructure (residential proxies alone cost $5-15/GB), SearchHive is significantly cheaper when you factor in engineering time.
Q: Can I scrape multiple sites in parallel?
Yes, and this is where parallel scraping delivers the most value. Instead of scraping Site A completely before starting Site B, interleave requests across all sites:
import random
from concurrent.futures import ThreadPoolExecutor
all_urls = (
[f"https://site-a.com/product/{i}" for i in range(1, 51)] +
[f"https://site-b.com/product/{i}" for i in range(1, 51)] +
[f"https://site-c.com/product/{i}" for i in range(1, 51)]
)
random.shuffle(all_urls) # Interleave across sites
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(scrape_with_retry, all_urls))
Interleaving reduces per-site request density, which lowers your chance of getting blocked on any individual target.
Q: How do I monitor a parallel scraping pipeline?
Track these metrics in real time:
- Success rate: Should stay above 95%
- Average response time: Spikes indicate the target is slowing or blocking
- Error breakdown by type: Identifies systemic issues (rate limiting vs blocking)
- Credits consumed: Prevents unexpected overages
from collections import Counter
import time
class ScrapingMonitor:
def __init__(self):
self.results = []
self.start_time = time.time()
self.status_counts = Counter()
def record(self, url, result):
status = "success" if "content" in result else result.get("error", "unknown")
self.status_counts[status] += 1
elapsed = time.time() - self.start_time
rate = len(self.results) / elapsed if elapsed > 0 else 0
self.results.append(result)
print(f"[{len(self.results)}] {status} ({rate:.1f}/s): {url[:50]}")
def summary(self):
total = len(self.results)
print(f"
=== Pipeline Summary ===")
print(f"Total: {total} in {time.time()-self.start_time:.0f}s")
for status, count in self.status_counts.most_common():
pct = count / total * 100 if total > 0 else 0
print(f" {status}: {count} ({pct:.1f}%)")
monitor = ScrapingMonitor()
# Use monitor.record(url, result) in your scraping loop
Q: What about scraping with headless browsers in parallel?
Running headless browsers (Playwright, Puppeteer) in parallel is resource-intensive. Each browser instance consumes 100-300MB of RAM. Running 10 concurrent browsers means 1-3GB just for browser processes.
SearchHive's ScrapeForge handles JavaScript rendering server-side, so you don't need to run local browsers. Your Python process stays lightweight while the API handles the heavy rendering work.
Summary
Parallel web scraping is powerful when done right. The key principles:
- Start conservative -- 5-10 workers, scale up based on error rates
- Rate limit always -- semaphore or token bucket, not hope
- Handle errors properly -- retry with backoff, classify failures
- Cache aggressively -- don't re-scrape unchanged pages
- Use a scraping API -- SearchHive handles anti-bot so you can focus on your pipeline
For production scraping at scale, SearchHive's API-first approach eliminates the infrastructure complexity of managing proxies, browsers, and anti-bot systems. Get started free with 500 credits and see the documentation for full API references.