How to Build a Proxy Rotator for Web Scraping with Python
If you're scraping at any real volume, you need a proxy rotator. Websites track IP addresses and block scrapers that make too many requests from the same source. A rotating proxy system distributes your requests across multiple IPs, making your scraper look like traffic from many different users.
This tutorial builds a production-ready proxy rotator in Python — with health checking, automatic failover, retry logic, and smart rotation strategies. We'll also show how SearchHive eliminates the need for a custom proxy rotator entirely.
Key Takeaways
- Round-robin rotation is the simplest strategy but can leak IP patterns to sophisticated detectors
- Health checking with timeouts and success-rate tracking removes bad proxies before they cause failures
- Backoff and retry logic handles temporary blocks without losing data
- SearchHive includes built-in proxy rotation — skip the infrastructure and get 50,000 free requests/month
- Free proxy lists work for light use but fail at scale; residential proxy providers cost $5–$15/GB
Prerequisites
pip install requests searchhive aiohttp
- requests — Synchronous HTTP client for basic scraping
- aiohttp — Async HTTP client for high-concurrency scraping
- searchhive — SearchHive Python SDK with built-in proxy rotation
You'll also need proxy servers. Options:
- Free lists: ProxyScrape, FreeProxyList (unreliable, for testing only)
- Paid providers: Bright Data ($5+/GB), Oxylabs, SmartProxy ($2.2+/GB), Webshare ($0.46+/GB)
- SearchHive: Proxy rotation included — no separate proxy provider needed
Step 1: Build a Basic Proxy Pool
Start with a proxy pool that tracks availability and performance:
import requests
import time
from dataclasses import dataclass, field
from typing import Optional
from urllib.parse import urlparse
@dataclass
class ProxyStats:
url: str
protocol: str # http, https, socks5
success_count: int = 0
fail_count: int = 0
avg_response_time: float = 0.0
last_used: float = 0.0
last_check: float = 0.0
is_alive: bool = True
class ProxyPool:
def __init__(self):
self.proxies: dict[str, ProxyStats] = {} # url -> stats
self._current_index = 0
def add(self, proxy_url: str):
"""Add a proxy to the pool."""
parsed = urlparse(proxy_url)
protocol = parsed.scheme or "http"
if protocol not in ("http", "https", "socks5"):
protocol = "http"
self.proxies[proxy_url] = ProxyStats(
url=proxy_url, protocol=protocol
)
def add_from_list(self, proxy_list: list[str]):
"""Add multiple proxies from a list."""
for url in proxy_list:
self.add(url)
print(f"Added {len(proxy_list)} proxies. Pool size: {len(self.proxies)}")
@property
def alive_proxies(self) -> list[ProxyStats]:
"""Get all proxies currently marked as alive."""
return [p for p in self.proxies.values() if p.is_alive]
@property
def size(self) -> int:
return len(self.proxies)
@property
def alive_count(self) -> int:
return len(self.alive_proxies)
# Usage
pool = ProxyPool()
pool.add_from_list([
"http://1.2.3.4:8080",
"http://5.6.7.8:3128",
"socks5://9.10.11.12:1080",
"http://user:pass@13.14.15.16:8080",
])
Step 2: Implement Rotation Strategies
Different rotation strategies work better for different scraping scenarios:
import random
import time
class RotatingProxyPool(ProxyPool):
def __init__(self, strategy: str = "round_robin"):
super().__init__()
self.strategy = strategy
self._index = 0
self._domain_cooldown: dict[str, float] = {} # domain -> last_used_time
def get_proxy(self, domain: str = "") -> Optional[str]:
"""Get the next proxy based on the rotation strategy."""
alive = self.alive_proxies
if not alive:
raise RuntimeError("No alive proxies in pool")
if self.strategy == "round_robin":
proxy = alive[self._index % len(alive)]
self._index += 1
elif self.strategy == "random":
proxy = random.choice(alive)
elif self.strategy == "least_used":
proxy = min(alive, key=lambda p: p.last_used)
elif self.strategy == "fastest":
alive_with_data = [p for p in alive if p.avg_response_time > 0]
if alive_with_data:
proxy = min(alive_with_data, key=lambda p: p.avg_response_time)
else:
proxy = random.choice(alive)
else:
proxy = alive[self._index % len(alive)]
self._index += 1
proxy.last_used = time.time()
return proxy.url
def get_requests_proxies(self, proxy_url: str) -> dict:
"""Format proxy URL for the requests library."""
parsed = urlparse(proxy_url)
protocol = parsed.scheme or "http"
host = parsed.hostname
port = parsed.port
auth = f"{parsed.username}:{parsed.password}@" if parsed.username else ""
formatted = f"{protocol}://{auth}{host}:{port}"
return {"http": formatted, "https": formatted}
Choosing a Strategy
| Strategy | Best For | Trade-off |
|---|---|---|
| round_robin | General scraping, easy to implement | Predictable pattern, detectable |
| random | Avoiding IP sequence detection | May reuse recently blocked IPs |
| least_used | Targeting a single domain | Slower proxy reuse |
| fastest | High-throughput scraping | Overloads fast proxies |
| domain_aware | Scraping multiple domains | More complex implementation |
Step 3: Add Health Checking
Dead proxies waste time and cause failures. Health checking removes bad proxies proactively:
import concurrent.futures
class SmartProxyPool(RotatingProxyPool):
CHECK_URL = "https://httpbin.org/ip"
CHECK_TIMEOUT = 10
MIN_SUCCESS_RATE = 0.5
def check_proxy(self, proxy_url: str) -> tuple[str, bool, float]:
"""Check if a single proxy is alive and measure response time."""
try:
start = time.time()
resp = requests.get(
self.CHECK_URL,
proxies=self.get_requests_proxies(proxy_url),
timeout=self.CHECK_TIMEOUT
)
elapsed = time.time() - start
is_alive = resp.status_code == 200
return proxy_url, is_alive, elapsed
except Exception:
return proxy_url, False, self.CHECK_TIMEOUT
def health_check_all(self, max_workers: int = 20):
"""Check all proxies concurrently."""
print(f"Health checking {len(self.proxies)} proxies...")
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {
executor.submit(self.check_proxy, url): url
for url in self.proxies
}
for future in concurrent.futures.as_completed(futures):
url, is_alive, response_time = future.result()
stats = self.proxies[url]
stats.is_alive = is_alive
stats.last_check = time.time()
if is_alive:
stats.success_count += 1
# Exponential moving average for response time
if stats.avg_response_time == 0:
stats.avg_response_time = response_time
else:
stats.avg_response_time = (
0.7 * stats.avg_response_time + 0.3 * response_time
)
else:
stats.fail_count += 1
alive = sum(1 for p in self.proxies.values() if p.is_alive)
print(f"Health check complete: {alive}/{len(self.proxies)} proxies alive")
def record_result(self, proxy_url: str, success: bool, response_time: float = 0):
"""Record the result of a scraping request."""
if proxy_url not in self.proxies:
return
stats = self.proxies[proxy_url]
if success:
stats.success_count += 1
if response_time > 0:
stats.avg_response_time = (
0.7 * stats.avg_response_time + 0.3 * response_time
) if stats.avg_response_time else response_time
else:
stats.fail_count += 1
# Auto-disable proxy if success rate drops too low
total = stats.success_count + stats.fail_count
if total >= 5 and stats.success_count / total < self.MIN_SUCCESS_RATE:
stats.is_alive = False
print(f"Disabled proxy {proxy_url}: success rate too low")
Step 4: Build the Rotating Request Handler
Wrap everything into a reusable request handler that handles rotation, retries, and failover:
class RotatingScraper:
def __init__(self, pool: SmartProxyPool, max_retries: int = 3, retry_delay: float = 2):
self.pool = pool
self.max_retries = max_retries
self.retry_delay = retry_delay
def get(self, url: str, **kwargs) -> requests.Response:
"""Make a GET request with rotating proxy and retries."""
from urllib.parse import urlparse
domain = urlparse(url).netloc
for attempt in range(self.max_retries):
proxy_url = self.pool.get_proxy(domain)
proxies = self.pool.get_requests_proxies(proxy_url)
try:
start = time.time()
response = requests.get(
url, proxies=proxies, timeout=15, **kwargs
)
elapsed = time.time() - start
if response.status_code == 200:
self.pool.record_result(proxy_url, True, elapsed)
return response
elif response.status_code in (403, 429):
# Blocked or rate limited — try a different proxy
self.pool.record_result(proxy_url, False)
print(f" [{attempt+1}] Blocked ({response.status_code}), rotating proxy...")
time.sleep(self.retry_delay * (attempt + 1))
continue
else:
self.pool.record_result(proxy_url, True, elapsed)
return response
except requests.exceptions.ProxyError:
self.pool.record_result(proxy_url, False)
print(f" [{attempt+1}] Proxy error, rotating...")
time.sleep(self.retry_delay)
continue
except requests.exceptions.Timeout:
self.pool.record_result(proxy_url, False, 15)
print(f" [{attempt+1}] Timeout, rotating...")
continue
except Exception as e:
self.pool.record_result(proxy_url, False)
print(f" [{attempt+1}] Error: {e}")
time.sleep(self.retry_delay)
continue
raise RuntimeError(f"Failed after {self.max_retries} retries: {url}")
# Usage
pool = SmartProxyPool(strategy="fastest")
pool.add_from_list([
"http://proxy1.example.com:8080",
"http://proxy2.example.com:8080",
"http://proxy3.example.com:8080",
])
pool.health_check_all()
scraper = RotatingScraper(pool)
response = scraper.get("https://httpbin.org/ip")
print(f"Response: {response.text}")
Step 5: Fetch Free Proxy Lists Programmatically
For testing purposes, you can fetch proxies from free lists:
def fetch_free_proxies() -> list[str]:
"""Fetch a list of free proxies from ProxyScrape."""
try:
url = "https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=5000&country=all&ssl=all&anonymity=all"
response = requests.get(url, timeout=10)
proxies = [
f"http://{line.strip()}"
for line in response.text.strip().split('\n')
if line.strip()
]
return proxies
except Exception as e:
print(f"Failed to fetch proxies: {e}")
return []
# Fetch and test free proxies
free_proxies = fetch_free_proxies()
print(f"Fetched {len(free_proxies)} free proxies")
pool = SmartProxyPool(strategy="fastest")
pool.add_from_list(free_proxies[:50]) # Test first 50
pool.health_check_all(max_workers=50)
print(f"Alive: {pool.alive_count}/{pool.size}")
Free proxies have ~20–30% reliability. They work for testing but not for production scraping.
Step 6: The Simpler Alternative — SearchHive
Managing proxy rotation infrastructure is time-consuming. SearchHive handles it all for you — residential proxy rotation, health checking, automatic failover, and retry logic — built into every API call:
from searchhive import ScrapeForge
client = ScrapeForge()
# Single request — automatic proxy rotation, no configuration needed
result = client.scrape(
url="https://example.com/products",
render_js=True,
selectors={
"title": "h1",
"price": ".price-tag",
"description": ".product-desc"
}
)
print(result.data)
No proxy pool to manage. No health checking. No retry logic to write. Each request automatically routes through a different residential IP.
Batch Scraping with Built-in Rotation
urls = [f"https://store.example.com/page-{i}" for i in range(1, 101)]
results = client.scrape_batch(
urls,
render_js=True,
selectors={"product_name": "h2", "price": ".price"},
concurrency=10 # 10 parallel requests with auto rate limiting
)
successes = [r for r in results if r.success]
print(f"Scraped {len(successes)} of {len(urls)} pages")
10 concurrent requests, each through a different proxy, with automatic retry on failure. The equivalent custom proxy rotator would need hundreds of lines of code and a paid proxy provider.
Cost Comparison
| Approach | Setup Time | Monthly Cost (100K requests) | Reliability |
|---|---|---|---|
| Custom + free proxies | 2-4 hours | $0 | 20-30% success rate |
| Custom + paid proxies | 1-2 hours | $50-150 (proxy cost) | 85-95% success rate |
| SearchHive Free | 5 minutes | $0 | 95%+ success rate |
| SearchHive Pro | 5 minutes | $29 | 99%+ success rate |
SearchHive's free tier alone handles what would cost $50-150/month in proxy provider fees — with better reliability.
Complete Code Example
Here's the full proxy rotator system:
import requests
import time
import random
import json
from dataclasses import dataclass
from typing import Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.parse import urlparse
@dataclass
class Proxy:
url: str
success: int = 0
fails: int = 0
avg_time: float = 0.0
alive: bool = True
class ProxyRotator:
def __init__(self, strategy="round_robin", retries=3):
self.proxies: dict[str, Proxy] = {}
self.strategy = strategy
self.retries = retries
self._idx = 0
def load(self, urls: list[str]):
for u in urls:
self.proxies[u] = Proxy(url=u)
print(f"Loaded {len(self.proxies)} proxies")
def check_health(self, workers=20):
def _check(url):
try:
t = time.time()
r = requests.get("https://httpbin.org/ip",
proxies={"http": url, "https": url}, timeout=10)
return url, r.ok, time.time() - t
except:
return url, False, 10
with ThreadPoolExecutor(workers) as ex:
for url, ok, t in ex.map(lambda u: _check(u), self.proxies):
p = self.proxies[url]
p.alive = ok
if ok:
p.success += 1
p.avg_time = 0.7 * p.avg_time + 0.3 * t if p.avg_time else t
else:
p.fails += 1
alive = sum(1 for p in self.proxies.values() if p.alive)
print(f"Health check: {alive}/{len(self.proxies)} alive")
def next_proxy(self) -> str:
alive = [p for p in self.proxies.values() if p.alive]
if not alive:
raise RuntimeError("No proxies available")
if self.strategy == "random":
p = random.choice(alive)
elif self.strategy == "fastest":
p = min((x for x in alive if x.avg_time > 0), key=lambda x: x.avg_time, default=random.choice(alive))
else:
p = alive[self._idx % len(alive)]
self._idx += 1
return p.url
def fetch(self, url: str, **kw) -> requests.Response:
for attempt in range(self.retries):
proxy = self.next_proxy()
try:
t = time.time()
r = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=15, **kw)
p = self.proxies[proxy]
if r.ok:
p.success += 1
p.avg_time = 0.7 * p.avg_time + 0.3 * (time.time() - t)
return r
p.fails += 1
if r.status_code in (403, 429):
time.sleep(2 ** attempt)
except Exception:
self.proxies[proxy].fails += 1
raise RuntimeError(f"Failed: {url}")
if __name__ == "__main__":
rotator = ProxyRotator(strategy="fastest")
# Add your proxies here
rotator.load(["http://proxy1:8080", "http://proxy2:8080"])
rotator.check_health()
resp = rotator.fetch("https://httpbin.org/ip")
print(resp.json())
Common Issues
All proxies fail health check
Free proxy lists expire quickly. Refresh every few hours. For production, use a paid provider or switch to SearchHive.
Getting 403 despite rotating proxies
The target site may use browser fingerprinting. Consider adding randomized user agent parser headers and Accept-Language headers to each request.
SOCKS5 proxies slower than HTTP
SOCKS5 proxies add overhead for DNS resolution through the proxy. Prefer HTTP/HTTPS proxies unless you need UDP support.
Proxy provider costs adding up
Calculate your cost per successful request. If you're paying $10/GB and each request is 50KB, that's $0.0005/request. SearchHive Pro at $29/mo for 100K requests is $0.00029/request — often cheaper than raw proxy costs.
Next Steps
- If you're building a production scraper, consider whether managing your own proxy rotator is worth it — SearchHive eliminates the infrastructure entirely
- Check /blog/how-to-extract-contact-info-from-websites-with-python for a scraping tutorial that uses SearchHive's built-in rotation
- See /compare/brightdata for a detailed cost comparison between SearchHive and enterprise proxy providers
Stop managing proxy infrastructure. Start with SearchHive's free tier — 50,000 requests/month with automatic proxy rotation, JS rendering, and a clean Python SDK. Read the docs.