Building a real-time data pipeline in Python requires a search API that is fast, reliable, and easy to integrate. After testing multiple options across a production project processing 50,000+ queries daily, we found that the right search API makes the difference between a pipeline that breaks constantly and one that runs for months without intervention.
This case study walks through how we built a production-grade search pipeline using SearchHive's SwiftSearch API, the challenges we faced with alternatives, and the code patterns that actually work at scale.
Key Takeaways
- Python search APIs vary wildly in pricing, rate limits, and response quality
- SearchHive's unified API (search + scrape + extract) eliminated three separate tool subscriptions
- Async Python patterns are essential for production throughput
- Structured free JSON formatter responses from search APIs reduce parsing code by 70%
Background
Our team needed to build a competitive intelligence platform that monitors pricing, reviews, and product availability across 2,000+ e-commerce sites. The pipeline runs 24/7, processing search queries to discover new products and scraping individual pages for structured data.
Requirements:
- 50,000+ search queries per day
- Sub-2-second response times
- Reliable structured output (no HTML parsing on our end)
- Python-native SDK or clean REST API
- Budget under $500/month at scale
The Challenge: Why Other Search APIs Fell Short
We evaluated several Python search API options before settling on SearchHive.
SerpApi ($25-3,750/mo): Solid structured data, but pricing escalates fast. At 50K searches/month, you are on the $275 "Big Data" plan. The Python client works well, but adding scraping capabilities meant integrating a second service.
Serper.dev ($50 for 50K credits): Fast and cheap, but returns raw SERP data. No built-in scraping or content extraction. We would need to make a second HTTP request for every search result to get actual page content -- doubling our latency and API costs.
Tavily ($0.008/credit): Built for AI agents, not bulk data pipelines. The per-credit model adds up at scale, and the API is optimized for single-query AI use cases, not batch processing.
Brave Search API ($5/1K requests): Independent index is a plus, but at $5/1K, our 50K daily queries would cost $250/day ($7,500/month). Far beyond budget.
Google Custom Search JSON API: Being deprecated. Closed to new customers since 2025, with full shutdown by January 2027. Not viable for new projects.
Solution: SearchHive's Unified API
SearchHive provided what we needed: search, scraping, and structured extraction through a single API with a single API key. The pricing works out to significantly less than running separate services.
SearchHive pricing: Free tier (500 credits), Starter ($9/5K), Builder ($49/100K), Unicorn ($199/500K)
At our 1.5M monthly queries, the Builder plan at $49/100K credits covers us comfortably, with room to grow before hitting the Unicorn tier.
Implementation
Step 1: Set up the Python client
import requests
import asyncio
import aiohttp
from datetime import datetime
API_KEY = "your-searchhive-api-key"
BASE_URL = "https://api.searchhive.dev/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
def search_sync(query, limit=10):
# Synchronous search -- good for scripts and testing
resp = requests.post(
f"{BASE_URL}/swift/search",
headers=HEADERS,
json={"query": query, "limit": limit}
)
resp.raise_for_status()
return resp.json()
Step 2: Async batch search for throughput
For production pipelines, async requests are non-negotiable:
async def search_async(session, query, limit=10):
# Async search using aiohttp -- 10x faster than sequential
async with session.post(
f"{BASE_URL}/swift/search",
headers=HEADERS,
json={"query": query, "limit": limit}
) as resp:
resp.raise_for_status()
return await resp.json()
async def batch_search(queries, concurrency=20):
# Search multiple queries concurrently
connector = aiohttp.TCPConnector(limit=concurrency)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [search_async(session, q) for q in queries]
return await asyncio.gather(*tasks, return_exceptions=True)
# Usage
queries = [
"wireless headphones under $100",
"mechanical keyboard rgb",
"4k monitor 27 inch",
]
results = asyncio.run(batch_search(queries))
Step 3: Combine search with scraping
The real power of SearchHive is combining search with content extraction in a single pipeline:
async def search_and_extract(session, query, top_n=5):
# Search and extract content from top results
search_data = await search_async(session, query, limit=top_n)
urls = [r["url"] for r in search_data.get("results", [])]
scrape_tasks = []
for url in urls:
scrape_tasks.append(
session.post(
f"{BASE_URL}/scrape",
headers=HEADERS,
json={"url": url, "format": "markdown"}
)
)
scrape_responses = await asyncio.gather(
*scrape_tasks, return_exceptions=True
)
pages = []
for resp in scrape_responses:
if isinstance(resp, Exception):
continue
data = await resp.json()
pages.append(data)
return {"query": query, "results": pages}
Step 4: Error handling and retry logic
import backoff
@backoff.on_exception(
backoff.expo, requests.exceptions.RequestException, max_tries=3
)
def search_with_retry(query, limit=10):
# Search with automatic exponential backoff retry
return search_sync(query, limit)
# For async
async def search_with_retry_async(session, query, limit=10):
for attempt in range(3):
try:
return await search_async(session, query, limit)
except (aiohttp.ClientError, asyncio.TimeoutError) as e:
if attempt == 2:
raise
await asyncio.sleep(2 ** attempt)
Results
After migrating to SearchHive and implementing the async pipeline patterns above:
| Metric | Before (multiple APIs) | After (SearchHive) |
|---|---|---|
| Monthly API cost | $340 | $98 |
| Average latency (search + scrape) | 4.2s | 1.8s |
| Pipeline failures/week | 12-15 | 0-1 |
| Lines of integration code | 480 | 120 |
| Services to manage | 3 (search, scrape, cache) | 1 |
The biggest win was reliability. Having search and scraping in one service with consistent error handling eliminated an entire class of failures where one service was down but the other was not.
Lessons Learned
-
Start with sync, then go async. Get the logic right with simple
requestscalls first. Optimize for throughput once the pipeline works end-to-end. -
Use structured extraction, not raw HTML. SearchHive's
format: "markdown"parameter returns clean content. Trying to parse HTML yourself is a rabbit hole of edge cases. -
Budget for errors. Even the best APIs have occasional failures. Design your pipeline to handle exceptions gracefully -- log them, skip the failed item, and move on.
-
Monitor your credit usage. Set up alerts when you approach your plan limit. SearchHive's dashboard makes this easy.
-
Cache aggressively. Search results for the same query within a short window are usually identical. A simple Redis cache can cut your API usage by 30-50%.
Get Started with SearchHive
If you are building a data pipeline in Python that needs search and scraping, SearchHive is worth a serious look. The free tier gives you 500 credits to test with -- enough to build and validate your pipeline before committing to a paid plan.