Most web scraping APIs still use a synchronous request-response model: send a URL, wait, get the result. That works for a handful of pages, but it breaks down when you need hundreds or thousands. Async web scraping APIs solve this by letting you fire hundreds of requests in parallel and collect results as they arrive. This guide covers the async scraping landscape, which APIs support it, and how to build high-throughput extraction pipelines.
Key Takeaways
- Async scraping cuts wall-clock time by 10-50x compared to sequential requests for batch jobs
- SearchHive ScrapeForge supports concurrent scraping with configurable parallelism out of the box
- Python's asyncio + aiohttp is the standard pattern for client-side async scraping
- Concurrency limits matter — most APIs enforce rate limits that throttle raw parallelism
- Firecrawl, ScrapingBee, and ZenRows all support concurrent requests but with tier-based limits
- The fastest pipelines combine async clients with connection pooling and smart retry logic
How Async Web Scraping Works
Traditional scraping: you send request 1, wait 2 seconds, send request 2, wait 2 seconds. For 100 pages, that's 200 seconds minimum.
Async scraping: you send all 100 requests simultaneously (or in batches), and results arrive as each page completes. The total time is roughly max(individual request times) plus overhead — often under 10 seconds for the same 100 pages.
The pattern looks like this:
import asyncio
import aiohttp
async def scrape_page(session, url):
async with session.get(url) as resp:
return await resp.text()
async def main():
urls = [f"https://example.com/page/{i}" for i in range(100)]
async with aiohttp.ClientSession() as session:
tasks = [scrape_page(session, url) for url in urls]
results = await asyncio.gather(*tasks)
print(f"Scraped {len(results)} pages")
asyncio.run(main())
The problem: if you're scraping through an API, you're limited by their concurrency rules, not your client code.
API Concurrency Limits Compared
Not all APIs handle parallel requests the same way. Here's what the major providers offer:
| Provider | Free Tier Concurrency | Paid Concurrency | Pricing (Entry) |
|---|---|---|---|
| SearchHive | 5 concurrent | 25-100+ | $9/mo |
| Firecrawl | 2 concurrent | 5-150 | $16/mo |
| ScrapingBee | 10 concurrent | 10-200 | $49/mo |
| ZenRows | 5 concurrent | 50-500 | $49/mo |
| Apify | Varies by actor | Varies | $49/mo |
| Jina Reader | Limited | Rate-limited | Free |
The key difference: SearchHive and Firecrawl use credit-based concurrency (you can always send requests, but get throttled at limits). ScrapingBee and ZenRows enforce hard concurrent request caps.
Building an Async Pipeline with SearchHive ScrapeForge
SearchHive's ScrapeForge API works naturally with async patterns. Here's a production-ready pattern:
import asyncio
import aiohttp
from datetime import datetime
API_KEY = "sh_live_your_key_here"
BASE_URL = "https://api.searchhive.dev/v1/scrape"
BATCH_SIZE = 20 # Tune based on your plan's concurrency limit
class ScrapeResult:
def __init__(self, url, success, data=None, error=None):
self.url = url
self.success = success
self.data = data
self.error = error
self.timestamp = datetime.utcnow()
async def scrape_one(session, url, semaphore):
"""Scrape a single URL with semaphore-controlled concurrency."""
async with semaphore:
try:
async with session.post(
BASE_URL,
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": url,
"render_js": True,
"format": "markdown",
"timeout": 15
},
timeout=aiohttp.ClientTimeout(total=20)
) as resp:
if resp.status == 200:
data = await resp.json()
return ScrapeResult(url, success=True, data=data)
else:
text = await resp.text()
return ScrapeResult(url, success=False, error=f"HTTP {resp.status}: {text}")
except asyncio.TimeoutError:
return ScrapeResult(url, success=False, error="Timeout after 20s")
except Exception as e:
return ScrapeResult(url, success=False, error=str(e))
async def scrape_batch(urls, max_concurrent=BATCH_SIZE):
"""Scrape a batch of URLs with controlled parallelism."""
semaphore = asyncio.Semaphore(max_concurrent)
connector = aiohttp.TCPConnector(limit=max_concurrent, limit_per_host=10)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [scrape_one(session, url, semaphore) for url in urls]
results = await asyncio.gather(*tasks)
successful = [r for r in results if r.success]
failed = [r for r in results if not r.success]
print(f"Batch complete: {len(successful)}/{len(urls)} succeeded")
if failed:
print(f"Failures: {len(failed)}")
for f in failed[:3]:
print(f" - {f.url}: {f.error}")
return successful, failed
async def main():
# Example: scrape product pages in parallel
product_urls = [
f"https://store.example.com/product/{i}"
for i in range(1, 201)
]
# Process in chunks to avoid hitting rate limits
chunk_size = 50
all_results = []
for i in range(0, len(product_urls), chunk_size):
chunk = product_urls[i:i + chunk_size]
print(f"Processing chunk {i // chunk_size + 1} ({len(chunk)} URLs)...")
success, failed = await scrape_batch(chunk, max_concurrent=20)
all_results.extend(success)
# Small delay between chunks to stay under sustained rate limits
if i + chunk_size < len(product_urls):
await asyncio.sleep(1)
print(f"\nTotal scraped: {len(all_results)} pages")
# Extract data from results
for result in all_results[:3]:
print(f"\n--- {result.url} ---")
print(result.data["content"][:200])
asyncio.run(main())
Comparison: Async Patterns Across Scraping APIs
Firecrawl
Firecrawl's /v1/scrape endpoint is synchronous but supports concurrent connections. Their API doesn't offer a native batch endpoint, so you handle parallelism client-side:
import asyncio
import aiohttp
async def firecrawl_scrape(session, url):
async with session.post(
"https://api.firecrawl.dev/v1/scrape",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"url": url, "formats": ["markdown"]},
timeout=aiohttp.ClientTimeout(total=30)
) as resp:
return await resp.json()
Limitations: the free tier allows only 2 concurrent requests. You need the Standard plan ($83/mo) for 50 concurrent requests.
ScrapingBee
ScrapingBee's API is inherently synchronous — each GET request returns one page. Async support comes from client-side parallelism:
import asyncio
import aiohttp
SCRAPINGBEE_KEY = "your_key"
async def scrape_bee(session, url):
params = {
"api_key": SCRAPINGBEE_KEY,
"url": url,
"render_js": "true",
}
async with session.get(
"https://app.scrapingbee.com/api/v1/",
params=params,
timeout=aiohttp.ClientTimeout(total=30)
) as resp:
return await resp.text()
ScrapingBee's concurrency limits are tied to your plan: Freelance ($49/mo) gives 10 concurrent, Business ($249/mo) gives 100.
Jina Reader
Jina Reader is the simplest for async — it's just HTTP GET with a URL prefix:
import asyncio
import aiohttp
async def jina_read(session, url):
async with session.get(
f"https://r.jina.ai/{url}",
headers={"Accept": "text/markdown"},
timeout=aiohttp.ClientTimeout(total=15)
) as resp:
return await resp.text()
The catch: no API key means no concurrency guarantees. Jina rate-limits aggressively on high volume.
Advanced: Async with Retry and Backpressure
Production async pipelines need more than just asyncio.gather. Here's a robust pattern with exponential backoff and backpressure:
import asyncio
import random
class AsyncScraper:
def __init__(self, api_key, max_concurrent=20, max_retries=3):
self.api_key = api_key
self.semaphore = asyncio.Semaphore(max_concurrent)
self.max_retries = max_retries
self.results = []
self.errors = []
async def _scrape_with_retry(self, session, url):
for attempt in range(self.max_retries):
async with self.semaphore:
try:
async with session.post(
"https://api.searchhive.dev/v1/scrape",
headers={"Authorization": f"Bearer {self.api_key}"},
json={"url": url, "format": "markdown"},
timeout=aiohttp.ClientTimeout(total=30)
) as resp:
if resp.status == 429: # Rate limited
wait = 2 ** attempt + random.random()
await asyncio.sleep(wait)
continue
if resp.status == 200:
data = await resp.json()
return data
return None
except asyncio.TimeoutError:
continue
return None
async def scrape_all(self, urls):
connector = aiohttp.TCPConnector(limit=self.semaphore._value)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [self._scrape_with_retry(session, url) for url in urls]
results = await asyncio.gather(*tasks)
self.results = [r for r in results if r]
self.errors = len(results) - len(self.results)
return self.results
Best Practices for Async Scraping
-
Respect rate limits. Use semaphores to cap concurrency. Even if an API doesn't enforce hard limits, aggressive parallelism triggers IP blocks.
-
Batch with delays. Don't fire 10,000 requests at once. Process in chunks of 50-200 with brief pauses between batches.
-
Handle 429 responses. Implement exponential backoff when you hit rate limits. Most APIs return HTTP 429 with a
Retry-Afterheader. -
Use connection pooling.
aiohttp.TCPConnectorwithlimitandlimit_per_hostprevents TCP connection exhaustion. -
Set timeouts. Always set client-side timeouts. A single stuck request shouldn't block the entire batch.
-
Log failures for retry. Capture failed URLs and retry them separately after the main batch completes.
When to Use Sync vs Async
| Scenario | Use Sync | Use Async |
|---|---|---|
| Scraping 1-10 pages | Yes | Overkill |
| Real-time agent (scrape on user request) | Yes | No (adds latency) |
| Batch data collection (100+ pages) | No | Yes |
| cron expression generator that runs hourly | No | Yes |
| Prototyping | Yes | No |
Get Started with SearchHive
SearchHive offers 500 free credits with full access to ScrapeForge, SwiftSearch, and DeepDive. The async-friendly API design means you can start with synchronous calls and scale to parallel pipelines without switching providers.
pip install searchhive
from searchhive import ScrapeForge
import asyncio
async def main():
sf = ScrapeForge('sh_live_your_key')
result = await sf.ascrape('https://example.com', format='markdown')
print(result['content'])
asyncio.run(main())
Read the docs or sign up for free to start building high-throughput scraping pipelines.