Parallel Web Scraping -- Common Questions Answered
Parallel web scraping lets you fetch hundreds or thousands of pages simultaneously instead of one at a time. It's the difference between waiting 3 hours for a sequential crawl and finishing in 5 minutes. This FAQ covers the most common questions developers have about scraping at scale.
Key Takeaways
- Parallel scraping can reduce crawl times by 10-100x compared to sequential requests
- Async HTTP clients (httpx, aiohttp) are the foundation of parallel scraping in Python
- Rate limiting and politeness are non-negotiable -- ignore them and you'll get blocked
- SearchHive's ScrapeForge handles parallelism, proxying, and anti-bot evasion for you
- Concurrency limits prevent both target server overload and your own IP bans
What Is Parallel Web Scraping?
Parallel web scraping means making multiple HTTP requests concurrently rather than one after another. Instead of:
Page 1 --> wait --> Page 2 --> wait --> Page 3 --> wait ...
(3 minutes total)
You do:
Page 1 -----+
Page 2 -----+--> all complete in 10 seconds
Page 3 -----+
In Python, this is typically done with asyncio and async HTTP libraries like httpx or aiohttp.
How Many Requests Can I Run in Parallel?
It depends on several factors:
| Factor | Conservative | Aggressive |
|---|---|---|
| Your own server limits | 10-50 concurrent | 100-500 concurrent |
| Target server (small site) | 2-5 concurrent | 5-10 concurrent |
| Target server (large site) | 10-50 concurrent | 50-200 concurrent |
| With proxy rotation | 50-200 concurrent | 500+ concurrent |
| With SearchHive ScrapeForge | Up to 100 concurrent | Scales with plan |
The practical sweet spot for most scraping tasks is 10-50 concurrent requests. Going higher without proxy rotation or the target server's consent leads to IP bans and CAPTCHAs.
What's the Best Library for Parallel Scraping in Python?
The main options are:
- httpx -- Modern async HTTP client, easy to use, good error handling
- aiohttp -- Mature, fast, but more boilerplate
- requests + ThreadPoolExecutor -- Synchronous but parallelized, simpler for small tasks
Here's a basic httpx implementation:
import asyncio
import httpx
async def scrape_page(client, url):
try:
resp = await client.get(url, timeout=30.0)
resp.raise_for_status()
return {"url": url, "status": "ok", "content": resp.text[:5000]}
except httpx.HTTPError as e:
return {"url": url, "status": "error", "error": str(e)}
async def scrape_parallel(urls, concurrency=20):
semaphore = asyncio.Semaphore(concurrency)
async def limited_scrape(client, url):
async with semaphore:
return await scrape_page(client, url)
async with httpx.AsyncClient() as client:
tasks = [limited_scrape(client, url) for url in urls]
return await asyncio.gather(*tasks)
urls = [f"https://example.com/page/{i}" for i in range(1, 101)]
results = asyncio.run(scrape_parallel(urls, concurrency=20))
ok_count = sum(1 for r in results if r["status"] == "ok")
print(f"Scraped {ok_count}/{len(urls)} pages")
How Do I Handle Rate Limiting?
Rate limiting protects both you and the target server. Implement it at two levels:
1. Concurrency limit with a semaphore:
semaphore = asyncio.Semaphore(10) # max 10 concurrent requests
2. Delay between batches:
import asyncio
async def scrape_with_delay(client, url, delay=0.5):
async with semaphore:
await asyncio.sleep(delay)
return await scrape_page(client, url)
3. Respect rate limit headers:
async def scrape_respecting_limits(client, url):
resp = await client.get(url)
# Check for rate limit headers
remaining = resp.headers.get("X-RateLimit-Remaining")
reset_after = resp.headers.get("X-RateLimit-Reset")
if remaining == "0" and reset_after:
wait = int(reset_after) + 1
print(f"Rate limited, waiting {wait}s")
await asyncio.sleep(wait)
return resp.text
Many sites return HTTP 429 when rate limited. Handle this with exponential backoff:
async def scrape_with_backoff(client, url, max_retries=3):
for attempt in range(max_retries):
resp = await client.get(url)
if resp.status_code == 429:
wait = 2 ** attempt
print(f"429 on {url}, retry in {wait}s")
await asyncio.sleep(wait)
continue
return resp.text
return None
How Do Proxies Work with Parallel Scraping?
Proxies rotate your IP address across requests, preventing the target server from associating all requests with a single IP. There are three types:
- Datacenter proxies -- Fast, cheap, easily detected ($1-5/GB)
- Residential proxies -- Real residential IPs, harder to detect ($5-15/GB)
- Mobile proxies -- Mobile carrier IPs, hardest to detect ($15-50/GB)
For parallel scraping:
import random
PROXIES = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
async def scrape_with_proxy(client, url):
proxy = random.choice(PROXIES)
resp = await client.get(url, proxy=proxy)
return resp.text
How Does SearchHive Handle Parallel Scraping?
SearchHive's ScrapeForge abstracts away the complexity of parallel scraping. You send a list of URLs and get structured results back -- ScrapeForge handles concurrency, proxying, retries, and anti-bot evasion internally.
import httpx
resp = httpx.post(
"https://api.searchhive.dev/v1/scrape/batch",
headers={"Authorization": "Bearer YOUR_KEY"},
json={
"urls": [
"https://competitor1.com/pricing",
"https://competitor2.com/pricing",
"https://competitor3.com/pricing"
],
"format": "markdown",
"concurrency": 10,
"remove_selectors": ["nav", "footer"]
}
)
results = resp.json()["results"]
# Returns structured content for each URL
ScrapeForge manages rate limiting, proxy rotation, and JavaScript rendering across all requests. You don't write asyncio code, handle retries, or manage proxy pools.
Compare ScrapeForge with alternatives like Firecrawl and ScrapingBee in our comparison guides.
What Error Handling Do I Need?
Parallel scraping fails differently than sequential scraping. Common issues:
- Connection timeouts -- Use per-request timeouts
- SSL errors -- Some sites have misconfigured certs
- HTTP 403/429 -- Anti-bot measures or rate limits
- Memory pressure -- Storing all page content in memory
- DNS failures -- Flaky resolvers at scale
async def robust_scrape(client, url):
try:
resp = await client.get(url, timeout=30.0)
resp.raise_for_status()
return {"url": url, "content": resp.text, "status": 200}
except httpx.TimeoutException:
return {"url": url, "error": "timeout", "status": 0}
except httpx.HTTPStatusError as e:
return {"url": url, "error": str(e), "status": e.response.status_code}
except Exception as e:
return {"url": url, "error": str(e), "status": 0}
# After running all tasks, filter out failures
results = asyncio.run(scrape_parallel(urls))
successful = [r for r in results if r.get("status") == 200]
failed = [r for r in results if r.get("status") != 200]
print(f"Success: {len(successful)}, Failed: {len(failed)}")
# Retry failed URLs
if failed:
retry_urls = [r["url"] for r in failed]
retried = asyncio.run(scrape_parallel(retry_urls))
successful.extend([r for r in retried if r.get("status") == 200])
Is Parallel Scraping Legal?
Parallel scraping itself is legal. What matters is what you scrape and how you scrape it:
- Legal: Publicly available data, data behind consent-gated access, data with clear terms of service allowing scraping
- Gray area: Personal data behind logins, data behind paywalls, data with no explicit scraping policy
- Illegal: Data protected by authentication you bypassed, copyrighted content reproduced without license, personal data subject to GDPR/CCPA
Always check robots.txt, respect rate limits, and follow terms of service. See our web scraping legality guide for a detailed breakdown.
What Are the Memory Considerations at Scale?
Scraping 10,000 pages in parallel while storing full HTML can consume gigabytes of RAM. Strategies to manage this:
- Stream results to disk instead of keeping everything in memory
- Process pages as they arrive rather than collecting all results first
- Limit content size -- most pages only need the first 50KB
- Use generators instead of lists
async def scrape_and_save(client, url, output_dir):
"""Scrape a page and immediately save to disk."""
result = await scrape_page(client, url)
if result["status"] == "ok":
filename = url.replace("/", "_").replace(":", "")
with open(f"{output_dir}/{filename}.txt", "w") as f:
f.write(result["content"])
return result
Summary
Parallel web scraping is essential for any project that needs data at scale. Python's async ecosystem makes it straightforward with httpx and asyncio, but production scraping requires rate limiting, error handling, proxy rotation, and anti-bot evasion. SearchHive's ScrapeForge handles all of this for you -- send URLs, get structured data.
Start scraping in parallel for free with 500 API credits. Explore ScrapeForge docs and SwiftSearch to build your first parallel pipeline today.