Every automation pipeline fails eventually. APIs return 500 errors, networks timeout, rate limits kick in, and CAPTCHAs block your requests. The difference between a flaky script and a production-grade system is how it handles failures. This guide covers battle-tested retry strategies for web scraping and API automation, with practical Python implementations using SearchHive's APIs.
Background
We built SearchHive to handle real-world web data at scale. Our internal automation runs thousands of scraping and search requests daily across competitor sites, product pages, and SERP monitoring endpoints. Along the way, we learned that retry logic is not optional -- it is the most critical part of any reliable pipeline.
After processing over 10 million API calls, here is what we found:
- 23% of requests to e-commerce sites fail on the first attempt (mostly JS rendering timeouts)
- Rate limiting accounts for 15% of failures on high-volume scraping jobs
- Proper retry logic with backoff recovers 94% of transient failures
- Exponential backoff alone is not enough -- you need circuit breakers and jitter too
The Challenge
Most developers implement retries as a simple loop with a fixed delay. This approach fails under real-world conditions for several reasons:
- Fixed delays create thundering herds -- when multiple workers retry simultaneously after a rate limit, they all hit the same endpoint at the same time
- No distinction between retryable and permanent errors -- retrying a 404 or 403 wastes credits and time
- Unbounded retries can run forever on persistent failures, burning through your API budget
- No circuit breaking means a degraded upstream service takes down your entire pipeline
Solution with SearchHive
SearchHive's APIs already handle a lot of failure modes at the infrastructure level:
- ScrapeForge includes automatic proxy rotation and retry logic for bot detection
- SwiftSearch has built-in fallback across search engine backends
- DeepDive retries extraction on malformed responses
But your application code still needs its own retry layer. Here is how to build one properly.
Implementation
Level 1: Basic Exponential Backoff with Jitter
The minimum viable retry strategy. Exponential backoff increases the delay between retries exponentially (1s, 2s, 4s, 8s...). Jitter adds randomness to prevent synchronized retries.
# retry/strategies.py
import time
import random
import httpx
from typing import Callable, TypeVar
T = TypeVar("T")
def retry_with_backoff(
func: Callable[..., T],
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 60.0,
retryable_status_codes: tuple = (429, 500, 502, 503, 504),
jitter: bool = True
) -> T:
"""Retry a function with exponential backoff and jitter.
Args:
func: Function to retry
max_retries: Maximum number of retry attempts
base_delay: Base delay in seconds (doubles each retry)
max_delay: Maximum delay cap
retryable_status_codes: HTTP status codes that trigger retry
jitter: Add random jitter to prevent thundering herd
"""
last_exception = None
for attempt in range(max_retries + 1):
try:
return func()
except httpx.HTTPStatusError as e:
if e.response.status_code not in retryable_status_codes:
raise # Permanent error, do not retry
last_exception = e
if attempt == max_retries:
break
except (httpx.TimeoutException, httpx.ConnectError) as e:
last_exception = e
if attempt == max_retries:
break
# Calculate delay with exponential backoff
delay = min(base_delay * (2 ** attempt), max_delay)
if jitter:
delay = delay * (0.5 + random.random())
print(f" Retry {attempt + 1}/{max_retries} after {delay:.1f}s "
f"(error: {last_exception})")
time.sleep(delay)
raise last_exception
Level 2: Circuit Breaker Pattern
A circuit breaker stops making requests to a failing service entirely, preventing cascade failures. After a cooldown period, it allows a single "test" request to check if the service has recovered.
# retry/circuit_breaker.py
import time
class CircuitBreaker:
"""Circuit breaker for protecting against repeated failures."""
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing, reject all requests
HALF_OPEN = "half_open" # Testing recovery
def __init__(self, failure_threshold: int = 5,
recovery_timeout: float = 60.0):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.state = self.CLOSED
self.failure_count = 0
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == self.OPEN:
# Check if recovery timeout has passed
if (time.time() - self.last_failure_time) > self.recovery_timeout:
self.state = self.HALF_OPEN
else:
raise RuntimeError(
f"Circuit breaker is OPEN. Last failure: "
f"{self.last_failure_time}. "
f"Retry after {self.recovery_timeout}s cooldown."
)
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
if self.state == self.HALF_OPEN:
self.state = self.CLOSED
self.failure_count = 0
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = self.OPEN
print(f" Circuit breaker OPENED after {self.failure_count} failures")
Level 3: SearchHive-Aware Retry Wrapper
Combine both strategies with SearchHive-specific error handling. This wrapper understands SearchHive API error codes and applies the right strategy for each.
# retry/searchhive_retry.py
import httpx
from strategies import retry_with_backoff
from circuit_breaker import CircuitBreaker
class SearchHiveRetryClient:
"""SearchHive API client with production retry logic."""
def __init__(self, api_key: str, max_retries: int = 3):
self.api_key = api_key
self.max_retries = max_retries
self.base_url = "https://api.searchhive.dev/v1"
self.headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
# Separate circuit breakers for each API
self._swift_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=120)
self._scrape_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=120)
self._deep_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=120)
def swift_search(self, query: str, num_results: int = 10) -> dict:
"""SwiftSearch with retry and circuit breaker."""
def _call():
resp = httpx.post(
f"{self.base_url}/swiftsearch",
headers=self.headers,
json={"query": query, "num_results": num_results},
timeout=30.0
)
resp.raise_for_status()
return resp.json()
return self._swift_breaker.call(
lambda: retry_with_backoff(_call, max_retries=self.max_retries)
)
def scrape_forge(self, url: str) -> dict:
"""ScrapeForge with retry and circuit breaker."""
def _call():
resp = httpx.post(
f"{self.base_url}/scrapeforge",
headers=self.headers,
json={"url": url, "render_js": True},
timeout=60.0
)
resp.raise_for_status()
return resp.json()
return self._scrape_breaker.call(
lambda: retry_with_backoff(_call, max_retries=self.max_retries)
)
def deep_dive(self, url: str, extract: dict = None) -> dict:
"""DeepDive with retry and circuit breaker."""
def _call():
resp = httpx.post(
f"{self.base_url}/deepdive",
headers=self.headers,
json={"url": url, "extract": extract},
timeout=60.0
)
resp.raise_for_status()
return resp.json()
return self._deep_breaker.call(
lambda: retry_with_backoff(_call, max_retries=self.max_retries)
)
Results
After implementing these retry strategies across our internal pipelines, we measured the following improvements:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Pipeline success rate | 76% | 97% | +21 percentage points |
| Wasted API credits | 18% | 3% | -83% reduction |
| Average latency (successful) | 2.1s | 2.4s | +14% (acceptable tradeoff) |
| P99 latency | 45s | 12s | -73% |
| Manual interventions / month | 12 | 1 | -92% |
The key insight: the slight latency increase from backoff delays is far outweighed by the reduction in failed runs and wasted credits.
Lessons Learned
1. Always distinguish retryable from permanent errors. A 403 (forbidden) means you are blocked -- retrying will not help. A 429 (rate limited) or 503 (service unavailable) deserves a retry with backoff. A 404 means the resource does not exist.
2. Jitter is non-negotiable. Without jitter, all your workers retry at the exact same moment after a rate limit expires, creating a traffic spike that triggers the rate limit again. Full jitter (multiply delay by a random value between 0 and 1) is the most effective approach.
3. Circuit breakers prevent cascade failures. If an upstream service is down, a circuit breaker stops wasting resources on guaranteed failures and lets you fail fast or fall back to cached data.
4. Log everything but alert selectively. Log every retry attempt for debugging. But only alert when circuit breakers open or when retries exceed the maximum -- these signal real problems.
5. Test your retry logic with chaos. Add intentional failures to your test suite. Use tools like toxiproxy or simple mocks to simulate 500 errors and timeouts. Your retry code should be as well-tested as your business logic.
6. Budget-aware retries. On pay-per-request APIs, each retry costs credits. Set a maximum credit budget per job and stop retrying when the budget is exhausted. On SearchHive, the Builder plan ($49/month for 100K credits) gives you enough buffer that retries rarely matter for cost, but they matter enormously for reliability.
For more on building reliable automation pipelines, see /blog/how-to-ecommerce-automation-step-by-step for a full ecommerce monitoring example, or check out /compare/serpapi to see how SearchHive's pricing compares when retries consume extra credits.