API Caching Strategies: How a Data Pipeline Cut Costs by 80% with SearchHive
API caching strategies can reduce your external API costs by 50-90% while simultaneously improving response times. This case study shows how a data analytics team used SearchHive's search and scraping APIs combined with intelligent caching to transform their cost structure.
Background
The team runs a competitive intelligence platform that monitors 500+ websites daily. Each monitoring cycle involves:
- Searching for recent mentions of tracked companies (500 queries)
- Scraping top results for full content (2,000+ pages)
- Extracting structured data from scraped pages (2,000+ extractions)
On their initial setup using separate providers (SerpAPI for search, Firecrawl for scraping, custom parsers for extraction), this workflow cost over $400/month and took 4+ hours per cycle.
The Challenge
Three problems drove the team to rethink their approach:
Redundant API calls. The same search queries were being run every cycle, even when results hadn't changed. Companies mentioned once often appeared in the same search results for days.
Repeated scraping of static pages. Blog posts, press releases, and news articles don't change after publication. Yet the pipeline scraped the same pages every cycle.
No deduplication. Multiple search queries sometimes returned the same URLs, resulting in the same pages being scraped multiple times in a single cycle.
The team estimated that 60-70% of their API calls were redundant — fetching data that hadn't changed since the last cycle.
Solution: Multi-Layer Caching with SearchHive
The team migrated to SearchHive for the unified API and implemented a three-layer caching strategy:
Layer 1: Response Cache (URL-based)
Cache API responses keyed by URL and query parameters. Same URL or same search query returns the cached response without hitting the API.
import hashlib, json, time, os
CACHE_DIR = "/tmp/api_cache"
os.makedirs(CACHE_DIR, exist_ok=True)
DEFAULT_TTL = 3600 # 1 hour default
def get_cache_key(url, params=None):
key_str = f"{url}:{json.dumps(params or {}, sort_keys=True)}"
return hashlib.sha256(key_str.encode()).hexdigest()
def cached_api_call(url, params=None, ttl=DEFAULT_TTL, method="GET",
post_json=None, headers=None):
cache_key = get_cache_key(url, params)
cache_file = os.path.join(CACHE_DIR, f"{cache_key}.json")
# Check cache
if os.path.exists(cache_file):
with open(cache_file, "r") as f:
cached = json.load(f)
if time.time() - cached["timestamp"] < ttl:
return cached["data"], True # (data, from_cache)
# Cache miss — make API call
import requests
if method == "POST":
resp = requests.post(url, json=post_json, headers=headers, timeout=30)
else:
resp = requests.get(url, params=params, headers=headers, timeout=30)
data = resp.json()
# Save to cache
with open(cache_file, "w") as f:
json.dump({"timestamp": time.time(), "data": data}, f)
return data, False
Layer 2: Content Fingerprinting (ETag-based)
For scraped content, store a hash of the content. Only re-scrape if the hash has changed (which means the page content actually changed).
def content_hash(text):
return hashlib.md5(text.encode()).hexdigest()
def should_rescrape(url, fingerprint_store, ttl=86400):
"""Check if a page needs re-scraping based on content fingerprint."""
record = fingerprint_store.get(url)
if not record:
return True # Never scraped before
if time.time() - record["timestamp"] > ttl:
return True # TTL expired
return False # Content hasn't changed, skip
# In-memory fingerprint store (use Redis/DB in production)
fingerprints = {}
def smart_scrape(url):
if not should_rescrape(url, fingerprints, ttl=86400):
cached = fingerprints[url]
return {"content": cached["content"], "from_cache": True}
# Fresh scrape via SearchHive
import requests
resp = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers={"Authorization": f"Bearer YOUR_API_KEY"},
json={"url": url, "render_js": True, "format": "markdown"}
)
data = resp.json()
content = data.get("content", "")
# Update fingerprint
fingerprints[url] = {
"hash": content_hash(content),
"content": content,
"timestamp": time.time()
}
return {"content": content, "from_cache": False}
Layer 3: Deduplication Queue
Before processing, deduplicate URLs across all search results in a single cycle.
def deduplicated_pipeline(search_queries):
"""Run multiple searches, deduplicate URLs, then scrape unique pages."""
import requests
API_KEY = "YOUR_API_KEY"
seen_urls = set()
unique_results = []
# Step 1: Search with response caching
for query in search_queries:
data, from_cache = cached_api_call(
"https://api.searchhive.dev/v1/search",
params={"query": query, "limit": 10},
ttl=1800, # 30 min cache for search
headers={"Authorization": f"Bearer {API_KEY}"}
)
for result in data.get("results", []):
if result["url"] not in seen_urls:
seen_urls.add(result["url"])
unique_results.append(result)
# Step 2: Scrape only unique URLs with content fingerprinting
scraped_content = []
for result in unique_results:
scrape_result = smart_scrape(result["url"])
scraped_content.append({
"url": result["url"],
"title": result.get("title", ""),
"content": scrape_result["content"],
"from_cache": scrape_result["from_cache"]
})
return scraped_content
Implementation Details
The team deployed this caching system with three TTL tiers tuned to their use case:
| Data Type | Cache TTL | Rationale |
|---|---|---|
| Search results | 30 minutes | Search rankings change slowly |
| Blog post content | 7 days | Published articles don't change |
| News article content | 6 hours | News pages update more frequently |
| Product pages | 24 hours | Prices and availability change daily |
They used Redis as the production cache backend, replacing the file-based cache shown in the examples. Redis provided sub-millisecond cache lookups and built-in TTL expiration.
Results
After implementing the three-layer caching strategy with SearchHive:
API costs dropped 80%. From $400/month with separate providers to under $80/month on SearchHive's Builder plan ($49/month for 100K credits with headroom to spare).
Cycle time dropped 65%. From 4+ hours to under 1.5 hours. Cache hits return in milliseconds, and deduplication eliminated 40% of scraping work per cycle.
Reliability improved. SearchHive's unified API meant one set of retry logic, one monitoring dashboard, and one provider relationship instead of three.
Cache hit rates stabilized at 65%. After the initial warm-up cycle, roughly two-thirds of all API calls were served from cache.
Cost Breakdown Comparison
| Component | Before (Separate) | After (SearchHive + Cache) |
|---|---|---|
| Search API | $150/mo | ~$15/mo (65% cached) |
| Scraping API | $200/mo | ~$25/mo (70% cached) |
| Extraction | $50/mo (custom infra) | ~$9/mo (included) |
| Total | $400/mo | $49/mo (Builder plan) |
Lessons Learned
1. Not all data needs real-time freshness. Most competitive intelligence doesn't change hour-by-hour. Aggressive caching (30min to 7-day TTLs) dramatically reduces costs with minimal data staleness.
2. Deduplication is the biggest quick win. Before adding any caching, simply deduplicating URLs across search queries eliminated 40% of redundant work. This alone justified the migration.
3. Unified APIs make caching simpler. With SearchHive, one caching layer covers search, scraping, and extraction. With separate providers, you need separate caching logic for each.
4. Monitor cache hit rates. If your hit rate drops below 50%, your TTLs are too short or your data access patterns have changed. Set up alerts on cache hit rates.
5. Start with file-based caching, graduate to Redis. The file-based approach works for prototyping and low-volume use. Redis adds persistence, atomic operations, and distributed access for production workloads.
Start Optimizing with SearchHive
Whether you're running a competitive intelligence pipeline, a content aggregation system, or an AI agent workflow, caching SearchHive API calls can cut your costs dramatically. Start with the free tier — 500 credits/month to prototype your caching layer. The Builder plan ($49/month, 100K credits) handles most production workloads even without aggressive caching.
- Free tier: 500 credits/month
- Starter: $9/month for 5,000 credits
- Builder: $49/month for 100,000 credits
- Docs: searchhive.dev/docs
See also: /compare/serpapi for a detailed pricing comparison showing how SearchHive + caching beats SerpAPI on cost, or /compare/firecrawl for scraping-specific optimization strategies.