Complete Guide to Automation Error Handling: Building Resilient Workflows
Automation error handling is the difference between a workflow that runs reliably for months and one that silently fails at 3 AM, corrupting data and missing deadlines. Whether you're building web scrapers, API pipelines, or AI agent workflows, errors are inevitable — how you handle them determines your system's reliability.
This guide covers practical error handling patterns for automation workflows, with code examples using SearchHive's APIs as the reference implementation.
Key Takeaways
- Errors are not exceptions to handle — they're a normal part of automation — design for failure from the start
- Retry with exponential backoff handles transient failures — most API errors resolve within seconds
- Circuit breakers prevent cascade failures — stop calling a failing service before it takes down your whole pipeline
- Dead letter queues preserve failed work — never silently drop data
- Logging and alerting make failures visible — a failing system with no alerts is worse than no system at all
- SearchHive's unified API reduces error surface area — one provider means one set of error codes, one retry strategy, one monitoring dashboard
Why Automation Fails
Understanding why automated workflows fail helps you design better error handling. The most common failure categories:
Network failures. DNS resolution fails, TCP connections drop, TLS handshakes timeout. These are transient and resolve with retries.
Rate limiting. APIs return 429 Too Many Requests when you exceed their rate limits. You need to respect the Retry-After header.
Service failures. The API you're calling is down or returning 500 errors. Circuit breakers prevent you from hammering a dying service.
Data format changes. The website you're scraping changed its HTML structure. Your CSS selectors return empty results.
Authentication failures. API keys expire, tokens get revoked. You need credential rotation.
Resource exhaustion. Memory leaks, disk full, file descriptor limits. Long-running processes need health checks.
Pattern 1: Retry with Exponential Backoff
The single most impactful error handling pattern. Most failures are transient — the API was briefly overloaded, a network packet was lost, a connection timed out. Retry with increasing delays gives the system time to recover.
import requests, time, random
import logging
logger = logging.getLogger(__name__)
def api_call_with_retry(url, method="GET", max_retries=5,
base_delay=1.0, **kwargs):
for attempt in range(max_retries):
try:
response = requests.request(method, url, timeout=30, **kwargs)
# Success
if response.status_code == 200:
return response.json()
# Rate limited — respect Retry-After header
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", base_delay * (2 ** attempt)))
logger.warning(f"Rate limited, retrying in {retry_after}s (attempt {attempt + 1})")
time.sleep(retry_after)
continue
# Client error — don't retry 4xx (except 429)
if 400 <= response.status_code < 500:
logger.error(f"Client error {response.status_code}: {response.text[:200]}")
raise Exception(f"Client error: {response.status_code}")
# Server error — retry
logger.warning(f"Server error {response.status_code}, retrying (attempt {attempt + 1})")
except requests.exceptions.Timeout:
logger.warning(f"Timeout on attempt {attempt + 1}")
except requests.exceptions.ConnectionError:
logger.warning(f"Connection error on attempt {attempt + 1}")
# Exponential backoff with jitter
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
raise Exception(f"Failed after {max_retries} retries")
Applied to SearchHive
API_KEY = "your-searchhive-api-key"
def searchhive_search(query, limit=10):
return api_call_with_retry(
"https://api.searchhive.dev/v1/search",
headers={"Authorization": f"Bearer {API_KEY}"},
params={"query": query, "limit": limit}
)
def searchhive_scrape(url, format="markdown"):
return api_call_with_retry(
"https://api.searchhive.dev/v1/scrape",
method="POST",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"url": url, "render_js": True, "format": format}
)
The retry wrapper handles SearchHive's transient errors (network timeouts, 429 rate limits, 5xx server errors) without any per-API custom logic.
Pattern 2: Circuit Breaker
When an external service is consistently failing, stop calling it. Circuit breakers prevent cascade failures where one slow dependency takes down your entire pipeline.
import time
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.last_failure_time = None
self.state = "closed" # closed, open, half_open
def call(self, func, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "half_open"
logging.info("Circuit breaker entering half-open state")
else:
raise Exception(f"Circuit breaker is OPEN (failures: {self.failure_count})")
try:
result = func(*args, **kwargs)
# Success in half-open: reset
if self.state == "half_open":
self.state = "closed"
self.failure_count = 0
logging.info("Circuit breaker reset to closed")
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
logging.error(f"Circuit breaker OPENED after {self.failure_count} failures")
raise
# Usage with SearchHive
search_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=30)
def safe_search(query):
return search_breaker.call(
searchhive_search, query
)
If SearchHive's API goes down (or any provider), the circuit breaker stops calling it after 3 consecutive failures and waits 30 seconds before trying again. This prevents your pipeline from wasting time and credits on a dead service.
Pattern 3: Dead Letter Queue
When a task fails after all retries, don't silently drop it. Put it in a dead letter queue for manual inspection and retry.
import json, os
from datetime import datetime
DEAD_LETTER_DIR = "/tmp/dead_letter_queue"
def save_to_dead_letter(task_type, input_data, error):
os.makedirs(DEAD_LETTER_DIR, exist_ok=True)
record = {
"task_type": task_type,
"input": input_data,
"error": str(error),
"timestamp": datetime.utcnow().isoformat()
}
filename = f"{task_type}_{int(time.time())}.json"
filepath = os.path.join(DEAD_LETTER_DIR, filename)
with open(filepath, "w") as f:
json.dump(record, f, indent=2)
logging.error(f"Saved to dead letter queue: {filepath}")
# Usage in a pipeline
def scrape_with_dead_letter(url):
try:
return searchhive_scrape(url)
except Exception as e:
save_to_dead_letter("scrape", {"url": url}, e)
return None
Pattern 4: Graceful Degradation
When a dependency fails, don't crash the entire workflow. Fall back to a less capable but still functional alternative.
def get_page_content(url):
# Primary: SearchHive ScrapeForge (full JS rendering)
try:
result = searchhive_scrape(url, format="markdown")
if result and result.get("content"):
return result["content"]
except Exception as e:
logging.warning(f"ScrapeForge failed: {e}")
# Fallback: Simple HTTP request (no JS rendering)
try:
response = requests.get(url, timeout=10, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
if response.status_code == 200:
return response.text[:50000]
except Exception as e:
logging.warning(f"HTTP fallback failed: {e}")
# Last resort: Return cached version or None
logging.error(f"All extraction methods failed for {url}")
return None
Pattern 5: Batch Processing with Partial Failure Handling
When processing hundreds or thousands of items, don't let one failure stop the batch. Process each item independently and report aggregate results.
def process_url_batch(urls, process_fn):
results = {"success": 0, "failed": 0, "errors": []}
for url in urls:
try:
process_fn(url)
results["success"] += 1
except Exception as e:
results["failed"] += 1
results["errors"].append({"url": url, "error": str(e)})
continue
logging.info(f"Batch complete: {results['success']} success, {results['failed']} failed")
if results["failed"] > 0:
logging.warning(f"Failed URLs: {[e['url'] for e in results['errors']]}")
return results
# Process 100 URLs — continue even if some fail
urls = [f"https://example.com/page/{i}" for i in range(1, 101)]
stats = process_url_batch(urls, lambda u: searchhive_scrape(u))
Error Handling Anti-Patterns
Swallowing exceptions. Never use bare except: pass. Every error should be logged or explicitly handled.
Retrying indefinitely. Always set a maximum retry count. Infinite retries can deadlock your pipeline.
Ignoring rate limits. Always check for 429 responses and respect Retry-After headers.
No alerting on failures. If your dead letter queue grows to 1,000 items and nobody notices, your error handling is incomplete. Set up alerts on failure rates.
Single point of failure. Don't run your automation on a single server with no redundancy. Use a queue system (Redis, SQS) to decouple task submission from execution.
Why SearchHive Reduces Error Handling Complexity
Every additional API provider in your stack multiplies your error handling surface area. Each provider has different:
- Authentication methods and token refresh logic
- Rate limit behavior and Retry-After headers
- Error response formats and status codes
- Timeout characteristics and retry recommendations
SearchHive unifies search, scraping, and extraction into one API with consistent error handling. One retry strategy, one circuit breaker, one monitoring dashboard. This means fewer failure modes to handle and less code to maintain.
The cost advantage compounds too. At $9/month for 5,000 credits on the Starter plan, SearchHive is significantly cheaper than running separate SerpAPI ($50/month) + Firecrawl ($16/month) subscriptions with separate error handling for each.
Getting Started
Build more resilient automation with SearchHive. The free tier gives you 500 credits to test error handling patterns in a real environment — no credit card required.
- Free tier: 500 credits/month
- Starter: $9/month for 5,000 credits
- Builder: $49/month for 100,000 credits
- Documentation: searchhive.dev/docs
See also: /compare/serper for a comparison of API reliability between SearchHive and Serper, or /compare/firecrawl for scraping-specific error handling considerations.