Parallel Web Scraping FAQ -- Concurrency, Rate Limiting, and Error Handling

Parallel web scraping is the difference between waiting hours for data and getting it in minutes. But doing it wrong -- too many concurrent requests, no rate limiting, no error handling -- gets your IPs blocked fast. This FAQ answers the most common questions about parallel scraping, with practical advice and working code examples.

Key Takeaways

Start with 5-10 concurrent workers and scale up only after testing
Rate limiting is non-negotiable -- even with rotating proxies, aggressive scraping triggers blocks
SearchHive's ScrapeForge API handles proxy rotation and anti-bot detection, letting you focus on concurrency control
Error handling in parallel scraping requires different patterns than sequential scraping
Async IO beats threading for I/O-bound scraping workloads in Python

Q: How many concurrent requests should I make?

Start with 5-10 concurrent workers. Increase gradually while monitoring failure rates. If your error rate jumps above 2-3%, you've gone too far.

The right concurrency depends on several factors:

Target site tolerance: Major e-commerce sites tolerate more traffic; smaller sites block faster
Proxy quality: Residential proxies handle higher concurrency than datacenter proxies
API vs direct scraping: When using an API like SearchHive, the API provider manages proxy pools, so you can push harder
Time of day: Off-peak hours allow slightly higher concurrency

import requests
from concurrent.futures import ThreadPoolExecutor, as_completed

API_KEY = "your_searchhive_key"
urls = [f"https://example.com/page/{i}" for i in range(1, 101)]

def scrape(url):
    return requests.get("https://api.searchhive.dev/scrapeforge", params={
        "url": url, "format": "json", "api_key": API_KEY
    }).json()

# Start with 5 workers, increase if error rate stays low
with ThreadPoolExecutor(max_workers=5) as executor:
    futures = {executor.submit(scrape, u): u for u in urls}
    for future in as_completed(futures):
        result = future.result()
        url = futures[future]
        print(f"{'OK' if 'content' in result else 'FAIL'}: {url}")

Q: ThreadPoolExecutor vs asyncio -- which is better?

For I/O-bound scraping (which most web scraping is), asyncio with aiohttp is generally faster than ThreadPoolExecutor with requests. Async IO avoids the overhead of thread context switching and can manage hundreds of concurrent connections efficiently.

However, ThreadPoolExecutor is simpler to write and debug. For most teams, the performance difference matters less than development speed.

Use ThreadPoolExecutor when:

You're scraping fewer than 1,000 URLs
Your team is more comfortable with synchronous code
You're using a third-party API (like SearchHive) that handles the heavy lifting

Use asyncio when:

You're scraping 10,000+ URLs
Every millisecond of latency matters
You're comfortable with async/await patterns

import asyncio
import aiohttp

async def scrape_async(session, url):
    params = {"url": url, "format": "json", "api_key": "your_key"}
    async with session.get(
        "https://api.searchhive.dev/scrapeforge",
        params=params
    ) as resp:
        return await resp.json()

async def main(urls, concurrency=10):
    semaphore = asyncio.Semaphore(concurrency)

    async def limited_scrape(session, url):
        async with semaphore:
            return await scrape_async(session, url)

    async with aiohttp.ClientSession() as session:
        tasks = [limited_scrape(session, u) for u in urls]
        return await asyncio.gather(*tasks)

results = asyncio.run(main(urls))

Q: How do I handle rate limiting?

Three strategies, used together:

Semaphore-based limiting: Control maximum concurrent requests
Token bucket: Control requests per time window
Exponential backoff: Retry failed requests with increasing delays

import time
import threading

class RateLimiter:
    def __init__(self, max_per_minute):
        self.interval = 60.0 / max_per_minute
        self.lock = threading.Lock()
        self.last_request = 0

    def wait(self):
        with self.lock:
            now = time.time()
            wait_time = self.interval - (now - self.last_request)
            if wait_time > 0:
                time.sleep(wait_time)
            self.last_request = time.time()

# Use it
limiter = RateLimiter(max_per_minute=60)

for url in urls:
    limiter.wait()
    result = scrape(url)

Q: What happens when a request fails in a parallel pipeline?

You need different handling depending on the failure type:

Timeouts: Retry with backoff (the server might be slow, not blocked)
429 Rate Limited: Wait longer between requests, reduce concurrency
403 Forbidden: The target is blocking you -- switch proxies or slow down
500 Server Error: Transient -- retry once, then skip
Connection Error: Network issue -- retry

import time

def scrape_with_retry(url, max_retries=3, base_delay=2):
    for attempt in range(max_retries):
        try:
            resp = requests.get("https://api.searchhive.dev/scrapeforge", params={
                "url": url, "format": "json", "api_key": "your_key"
            }, timeout=30)

            if resp.status_code == 200:
                return resp.json()
            elif resp.status_code == 429:
                delay = base_delay * (2 ** attempt)
                print(f"Rate limited, waiting {delay}s...")
                time.sleep(delay)
            elif resp.status_code == 403:
                print(f"Blocked by target: {url}")
                return {"url": url, "error": "blocked"}
            else:
                delay = base_delay
                time.sleep(delay)

        except requests.Timeout:
            delay = base_delay * (2 ** attempt)
            print(f"Timeout, retrying in {delay}s...")
            time.sleep(delay)
        except requests.ConnectionError:
            print(f"Connection error: {url}")
            return {"url": url, "error": "connection"}

    return {"url": url, "error": "max_retries_exceeded"}

Q: How do I avoid getting blocked?

Layer your defenses:

Use a scraping API (like SearchHive) that handles proxy rotation and browser fingerprinting
Randomize request timing -- don't send requests at perfectly regular intervals
Rotate user agents -- most scraping APIs do this automatically
Respect robots.txt generator -- at minimum, check it before scraping
Cache aggressively -- don't scrape the same URL twice if the data hasn't changed

SearchHive's ScrapeForge API handles items 1, 2, and 3 automatically. You still need to implement caching on your end.

import hashlib
import json
from pathlib import Path

CACHE_DIR = Path("./scrape_cache")

def scrape_with_cache(url):
    cache_key = hashlib.md5(url.encode()).hexdigest()
    cache_file = CACHE_DIR / f"{cache_key}.json"

    # Check cache first
    if cache_file.exists():
        with open(cache_file) as f:
            return json.load(f)

    # Scrape fresh
    result = requests.get("https://api.searchhive.dev/scrapeforge", params={
        "url": url, "format": "json", "api_key": "your_key"
    }).json()

    # Save to cache
    CACHE_DIR.mkdir(exist_ok=True)
    with open(cache_file, "w") as f:
        json.dump(result, f)

    return result

Q: How much does parallel scraping cost with SearchHive?

SearchHive charges per credit. Each ScrapeForge request costs 1 credit for a basic page scrape. Pricing:

Free: 500 credits (enough to test your pipeline)
Starter: $9/mo for 5,000 credits
Builder: $49/mo for 100,000 credits
Unicorn: $199/mo for 500,000 credits

At Builder tier, scraping 100,000 pages costs $0.00049 per page. Compared to managing your own proxy infrastructure (residential proxies alone cost $5-15/GB), SearchHive is significantly cheaper when you factor in engineering time.

Q: Can I scrape multiple sites in parallel?

Yes, and this is where parallel scraping delivers the most value. Instead of scraping Site A completely before starting Site B, interleave requests across all sites:

import random
from concurrent.futures import ThreadPoolExecutor

all_urls = (
    [f"https://site-a.com/product/{i}" for i in range(1, 51)] +
    [f"https://site-b.com/product/{i}" for i in range(1, 51)] +
    [f"https://site-c.com/product/{i}" for i in range(1, 51)]
)
random.shuffle(all_urls)  # Interleave across sites

with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(scrape_with_retry, all_urls))

Interleaving reduces per-site request density, which lowers your chance of getting blocked on any individual target.

Q: How do I monitor a parallel scraping pipeline?

Track these metrics in real time:

Success rate: Should stay above 95%
Average response time: Spikes indicate the target is slowing or blocking
Error breakdown by type: Identifies systemic issues (rate limiting vs blocking)
Credits consumed: Prevents unexpected overages

from collections import Counter
import time

class ScrapingMonitor:
    def __init__(self):
        self.results = []
        self.start_time = time.time()
        self.status_counts = Counter()

    def record(self, url, result):
        status = "success" if "content" in result else result.get("error", "unknown")
        self.status_counts[status] += 1
        elapsed = time.time() - self.start_time
        rate = len(self.results) / elapsed if elapsed > 0 else 0
        self.results.append(result)
        print(f"[{len(self.results)}] {status} ({rate:.1f}/s): {url[:50]}")

    def summary(self):
        total = len(self.results)
        print(f"
=== Pipeline Summary ===")
        print(f"Total: {total} in {time.time()-self.start_time:.0f}s")
        for status, count in self.status_counts.most_common():
            pct = count / total * 100 if total > 0 else 0
            print(f"  {status}: {count} ({pct:.1f}%)")

monitor = ScrapingMonitor()
# Use monitor.record(url, result) in your scraping loop

Q: What about scraping with headless browsers in parallel?

Running headless browsers (Playwright, Puppeteer) in parallel is resource-intensive. Each browser instance consumes 100-300MB of RAM. Running 10 concurrent browsers means 1-3GB just for browser processes.

SearchHive's ScrapeForge handles JavaScript rendering server-side, so you don't need to run local browsers. Your Python process stays lightweight while the API handles the heavy rendering work.

Summary

Parallel web scraping is powerful when done right. The key principles:

Start conservative -- 5-10 workers, scale up based on error rates
Rate limit always -- semaphore or token bucket, not hope
Handle errors properly -- retry with backoff, classify failures
Cache aggressively -- don't re-scrape unchanged pages
Use a scraping API -- SearchHive handles anti-bot so you can focus on your pipeline

For production scraping at scale, SearchHive's API-first approach eliminates the infrastructure complexity of managing proxies, browsers, and anti-bot systems. Get started free with 500 credits and see the documentation for full API references.

Parallel Web Scraping FAQ -- Concurrency, Rate Limiting, and Error Handling

AI-Powered Research

Key Takeaways

Q: How many concurrent requests should I make?

Q: ThreadPoolExecutor vs asyncio -- which is better?

Q: How do I handle rate limiting?

Q: What happens when a request fails in a parallel pipeline?

Q: How do I avoid getting blocked?

Q: How much does parallel scraping cost with SearchHive?

Q: Can I scrape multiple sites in parallel?

Q: How do I monitor a parallel scraping pipeline?

Q: What about scraping with headless browsers in parallel?

Summary

Keywords

RELATED ARTICLES

SearchHive vs Import.io -- Developer Experience Compared

Complete Guide to Product Data Scraping

SearchHive vs Zenserp -- Anti-Bot Handling Compared

BUILD WITH SEARCHHIVE