How to Build a Proxy Rotator for Web Scraping with Python

If you're scraping at any real volume, you need a proxy rotator. Websites track IP addresses and block scrapers that make too many requests from the same source. A rotating proxy system distributes your requests across multiple IPs, making your scraper look like traffic from many different users.

This tutorial builds a production-ready proxy rotator in Python — with health checking, automatic failover, retry logic, and smart rotation strategies. We'll also show how SearchHive eliminates the need for a custom proxy rotator entirely.

Key Takeaways

Round-robin rotation is the simplest strategy but can leak IP patterns to sophisticated detectors
Health checking with timeouts and success-rate tracking removes bad proxies before they cause failures
Backoff and retry logic handles temporary blocks without losing data
SearchHive includes built-in proxy rotation — skip the infrastructure and get 50,000 free requests/month
Free proxy lists work for light use but fail at scale; residential proxy providers cost $5–$15/GB

Prerequisites

pip install requests searchhive aiohttp

requests — Synchronous HTTP client for basic scraping
aiohttp — Async HTTP client for high-concurrency scraping
searchhive — SearchHive Python SDK with built-in proxy rotation

You'll also need proxy servers. Options:

Free lists: ProxyScrape, FreeProxyList (unreliable, for testing only)
Paid providers: Bright Data ($5+/GB), Oxylabs, SmartProxy ($2.2+/GB), Webshare ($0.46+/GB)
SearchHive: Proxy rotation included — no separate proxy provider needed

Step 1: Build a Basic Proxy Pool

Start with a proxy pool that tracks availability and performance:

import requests
import time
from dataclasses import dataclass, field
from typing import Optional
from urllib.parse import urlparse

@dataclass
class ProxyStats:
    url: str
    protocol: str  # http, https, socks5
    success_count: int = 0
    fail_count: int = 0
    avg_response_time: float = 0.0
    last_used: float = 0.0
    last_check: float = 0.0
    is_alive: bool = True

class ProxyPool:
    def __init__(self):
        self.proxies: dict[str, ProxyStats] = {}  # url -> stats
        self._current_index = 0
    
    def add(self, proxy_url: str):
        """Add a proxy to the pool."""
        parsed = urlparse(proxy_url)
        protocol = parsed.scheme or "http"
        if protocol not in ("http", "https", "socks5"):
            protocol = "http"
        self.proxies[proxy_url] = ProxyStats(
            url=proxy_url, protocol=protocol
        )
    
    def add_from_list(self, proxy_list: list[str]):
        """Add multiple proxies from a list."""
        for url in proxy_list:
            self.add(url)
        print(f"Added {len(proxy_list)} proxies. Pool size: {len(self.proxies)}")
    
    @property
    def alive_proxies(self) -> list[ProxyStats]:
        """Get all proxies currently marked as alive."""
        return [p for p in self.proxies.values() if p.is_alive]
    
    @property
    def size(self) -> int:
        return len(self.proxies)
    
    @property
    def alive_count(self) -> int:
        return len(self.alive_proxies)

# Usage
pool = ProxyPool()
pool.add_from_list([
    "http://1.2.3.4:8080",
    "http://5.6.7.8:3128",
    "socks5://9.10.11.12:1080",
    "http://user:pass@13.14.15.16:8080",
])

Step 2: Implement Rotation Strategies

Different rotation strategies work better for different scraping scenarios:

import random
import time

class RotatingProxyPool(ProxyPool):
    def __init__(self, strategy: str = "round_robin"):
        super().__init__()
        self.strategy = strategy
        self._index = 0
        self._domain_cooldown: dict[str, float] = {}  # domain -> last_used_time
    
    def get_proxy(self, domain: str = "") -> Optional[str]:
        """Get the next proxy based on the rotation strategy."""
        alive = self.alive_proxies
        if not alive:
            raise RuntimeError("No alive proxies in pool")
        
        if self.strategy == "round_robin":
            proxy = alive[self._index % len(alive)]
            self._index += 1
        
        elif self.strategy == "random":
            proxy = random.choice(alive)
        
        elif self.strategy == "least_used":
            proxy = min(alive, key=lambda p: p.last_used)
        
        elif self.strategy == "fastest":
            alive_with_data = [p for p in alive if p.avg_response_time > 0]
            if alive_with_data:
                proxy = min(alive_with_data, key=lambda p: p.avg_response_time)
            else:
                proxy = random.choice(alive)
        
        else:
            proxy = alive[self._index % len(alive)]
            self._index += 1
        
        proxy.last_used = time.time()
        return proxy.url
    
    def get_requests_proxies(self, proxy_url: str) -> dict:
        """Format proxy URL for the requests library."""
        parsed = urlparse(proxy_url)
        protocol = parsed.scheme or "http"
        host = parsed.hostname
        port = parsed.port
        auth = f"{parsed.username}:{parsed.password}@" if parsed.username else ""
        
        formatted = f"{protocol}://{auth}{host}:{port}"
        return {"http": formatted, "https": formatted}

Choosing a Strategy

Strategy	Best For	Trade-off
round_robin	General scraping, easy to implement	Predictable pattern, detectable
random	Avoiding IP sequence detection	May reuse recently blocked IPs
least_used	Targeting a single domain	Slower proxy reuse
fastest	High-throughput scraping	Overloads fast proxies
domain_aware	Scraping multiple domains	More complex implementation

Step 3: Add Health Checking

Dead proxies waste time and cause failures. Health checking removes bad proxies proactively:

import concurrent.futures

class SmartProxyPool(RotatingProxyPool):
    CHECK_URL = "https://httpbin.org/ip"
    CHECK_TIMEOUT = 10
    MIN_SUCCESS_RATE = 0.5
    
    def check_proxy(self, proxy_url: str) -> tuple[str, bool, float]:
        """Check if a single proxy is alive and measure response time."""
        try:
            start = time.time()
            resp = requests.get(
                self.CHECK_URL,
                proxies=self.get_requests_proxies(proxy_url),
                timeout=self.CHECK_TIMEOUT
            )
            elapsed = time.time() - start
            is_alive = resp.status_code == 200
            return proxy_url, is_alive, elapsed
        except Exception:
            return proxy_url, False, self.CHECK_TIMEOUT
    
    def health_check_all(self, max_workers: int = 20):
        """Check all proxies concurrently."""
        print(f"Health checking {len(self.proxies)} proxies...")
        
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = {
                executor.submit(self.check_proxy, url): url
                for url in self.proxies
            }
            
            for future in concurrent.futures.as_completed(futures):
                url, is_alive, response_time = future.result()
                stats = self.proxies[url]
                stats.is_alive = is_alive
                stats.last_check = time.time()
                
                if is_alive:
                    stats.success_count += 1
                    # Exponential moving average for response time
                    if stats.avg_response_time == 0:
                        stats.avg_response_time = response_time
                    else:
                        stats.avg_response_time = (
                            0.7 * stats.avg_response_time + 0.3 * response_time
                        )
                else:
                    stats.fail_count += 1
        
        alive = sum(1 for p in self.proxies.values() if p.is_alive)
        print(f"Health check complete: {alive}/{len(self.proxies)} proxies alive")
    
    def record_result(self, proxy_url: str, success: bool, response_time: float = 0):
        """Record the result of a scraping request."""
        if proxy_url not in self.proxies:
            return
        
        stats = self.proxies[proxy_url]
        if success:
            stats.success_count += 1
            if response_time > 0:
                stats.avg_response_time = (
                    0.7 * stats.avg_response_time + 0.3 * response_time
                ) if stats.avg_response_time else response_time
        else:
            stats.fail_count += 1
            # Auto-disable proxy if success rate drops too low
            total = stats.success_count + stats.fail_count
            if total >= 5 and stats.success_count / total < self.MIN_SUCCESS_RATE:
                stats.is_alive = False
                print(f"Disabled proxy {proxy_url}: success rate too low")

Step 4: Build the Rotating Request Handler

Wrap everything into a reusable request handler that handles rotation, retries, and failover:

class RotatingScraper:
    def __init__(self, pool: SmartProxyPool, max_retries: int = 3, retry_delay: float = 2):
        self.pool = pool
        self.max_retries = max_retries
        self.retry_delay = retry_delay
    
    def get(self, url: str, **kwargs) -> requests.Response:
        """Make a GET request with rotating proxy and retries."""
        from urllib.parse import urlparse
        domain = urlparse(url).netloc
        
        for attempt in range(self.max_retries):
            proxy_url = self.pool.get_proxy(domain)
            proxies = self.pool.get_requests_proxies(proxy_url)
            
            try:
                start = time.time()
                response = requests.get(
                    url, proxies=proxies, timeout=15, **kwargs
                )
                elapsed = time.time() - start
                
                if response.status_code == 200:
                    self.pool.record_result(proxy_url, True, elapsed)
                    return response
                
                elif response.status_code in (403, 429):
                    # Blocked or rate limited — try a different proxy
                    self.pool.record_result(proxy_url, False)
                    print(f"  [{attempt+1}] Blocked ({response.status_code}), rotating proxy...")
                    time.sleep(self.retry_delay * (attempt + 1))
                    continue
                
                else:
                    self.pool.record_result(proxy_url, True, elapsed)
                    return response
                    
            except requests.exceptions.ProxyError:
                self.pool.record_result(proxy_url, False)
                print(f"  [{attempt+1}] Proxy error, rotating...")
                time.sleep(self.retry_delay)
                continue
                
            except requests.exceptions.Timeout:
                self.pool.record_result(proxy_url, False, 15)
                print(f"  [{attempt+1}] Timeout, rotating...")
                continue
                
            except Exception as e:
                self.pool.record_result(proxy_url, False)
                print(f"  [{attempt+1}] Error: {e}")
                time.sleep(self.retry_delay)
                continue
        
        raise RuntimeError(f"Failed after {self.max_retries} retries: {url}")

# Usage
pool = SmartProxyPool(strategy="fastest")
pool.add_from_list([
    "http://proxy1.example.com:8080",
    "http://proxy2.example.com:8080",
    "http://proxy3.example.com:8080",
])
pool.health_check_all()

scraper = RotatingScraper(pool)
response = scraper.get("https://httpbin.org/ip")
print(f"Response: {response.text}")

Step 5: Fetch Free Proxy Lists Programmatically

For testing purposes, you can fetch proxies from free lists:

def fetch_free_proxies() -> list[str]:
    """Fetch a list of free proxies from ProxyScrape."""
    try:
        url = "https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=5000&country=all&ssl=all&anonymity=all"
        response = requests.get(url, timeout=10)
        proxies = [
            f"http://{line.strip()}" 
            for line in response.text.strip().split('\n')
            if line.strip()
        ]
        return proxies
    except Exception as e:
        print(f"Failed to fetch proxies: {e}")
        return []

# Fetch and test free proxies
free_proxies = fetch_free_proxies()
print(f"Fetched {len(free_proxies)} free proxies")

pool = SmartProxyPool(strategy="fastest")
pool.add_from_list(free_proxies[:50])  # Test first 50
pool.health_check_all(max_workers=50)
print(f"Alive: {pool.alive_count}/{pool.size}")

Free proxies have ~20–30% reliability. They work for testing but not for production scraping.

Step 6: The Simpler Alternative — SearchHive

Managing proxy rotation infrastructure is time-consuming. SearchHive handles it all for you — residential proxy rotation, health checking, automatic failover, and retry logic — built into every API call:

from searchhive import ScrapeForge

client = ScrapeForge()

# Single request — automatic proxy rotation, no configuration needed
result = client.scrape(
    url="https://example.com/products",
    render_js=True,
    selectors={
        "title": "h1",
        "price": ".price-tag",
        "description": ".product-desc"
    }
)

print(result.data)

No proxy pool to manage. No health checking. No retry logic to write. Each request automatically routes through a different residential IP.

Batch Scraping with Built-in Rotation

urls = [f"https://store.example.com/page-{i}" for i in range(1, 101)]

results = client.scrape_batch(
    urls,
    render_js=True,
    selectors={"product_name": "h2", "price": ".price"},
    concurrency=10  # 10 parallel requests with auto rate limiting
)

successes = [r for r in results if r.success]
print(f"Scraped {len(successes)} of {len(urls)} pages")

10 concurrent requests, each through a different proxy, with automatic retry on failure. The equivalent custom proxy rotator would need hundreds of lines of code and a paid proxy provider.

Cost Comparison

Approach	Setup Time	Monthly Cost (100K requests)	Reliability
Custom + free proxies	2-4 hours	$0	20-30% success rate
Custom + paid proxies	1-2 hours	$50-150 (proxy cost)	85-95% success rate
SearchHive Free	5 minutes	$0	95%+ success rate
SearchHive Pro	5 minutes	$29	99%+ success rate

SearchHive's free tier alone handles what would cost $50-150/month in proxy provider fees — with better reliability.

Complete Code Example

Here's the full proxy rotator system:

import requests
import time
import random
import json
from dataclasses import dataclass
from typing import Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.parse import urlparse

@dataclass
class Proxy:
    url: str
    success: int = 0
    fails: int = 0
    avg_time: float = 0.0
    alive: bool = True

class ProxyRotator:
    def __init__(self, strategy="round_robin", retries=3):
        self.proxies: dict[str, Proxy] = {}
        self.strategy = strategy
        self.retries = retries
        self._idx = 0
    
    def load(self, urls: list[str]):
        for u in urls:
            self.proxies[u] = Proxy(url=u)
        print(f"Loaded {len(self.proxies)} proxies")
    
    def check_health(self, workers=20):
        def _check(url):
            try:
                t = time.time()
                r = requests.get("https://httpbin.org/ip",
                    proxies={"http": url, "https": url}, timeout=10)
                return url, r.ok, time.time() - t
            except:
                return url, False, 10
        
        with ThreadPoolExecutor(workers) as ex:
            for url, ok, t in ex.map(lambda u: _check(u), self.proxies):
                p = self.proxies[url]
                p.alive = ok
                if ok:
                    p.success += 1
                    p.avg_time = 0.7 * p.avg_time + 0.3 * t if p.avg_time else t
                else:
                    p.fails += 1
        
        alive = sum(1 for p in self.proxies.values() if p.alive)
        print(f"Health check: {alive}/{len(self.proxies)} alive")
    
    def next_proxy(self) -> str:
        alive = [p for p in self.proxies.values() if p.alive]
        if not alive:
            raise RuntimeError("No proxies available")
        
        if self.strategy == "random":
            p = random.choice(alive)
        elif self.strategy == "fastest":
            p = min((x for x in alive if x.avg_time > 0), key=lambda x: x.avg_time, default=random.choice(alive))
        else:
            p = alive[self._idx % len(alive)]
            self._idx += 1
        return p.url
    
    def fetch(self, url: str, **kw) -> requests.Response:
        for attempt in range(self.retries):
            proxy = self.next_proxy()
            try:
                t = time.time()
                r = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=15, **kw)
                p = self.proxies[proxy]
                if r.ok:
                    p.success += 1
                    p.avg_time = 0.7 * p.avg_time + 0.3 * (time.time() - t)
                    return r
                p.fails += 1
                if r.status_code in (403, 429):
                    time.sleep(2 ** attempt)
            except Exception:
                self.proxies[proxy].fails += 1
        raise RuntimeError(f"Failed: {url}")

if __name__ == "__main__":
    rotator = ProxyRotator(strategy="fastest")
    # Add your proxies here
    rotator.load(["http://proxy1:8080", "http://proxy2:8080"])
    rotator.check_health()
    
    resp = rotator.fetch("https://httpbin.org/ip")
    print(resp.json())

Common Issues

All proxies fail health check

Free proxy lists expire quickly. Refresh every few hours. For production, use a paid provider or switch to SearchHive.

Getting 403 despite rotating proxies

The target site may use browser fingerprinting. Consider adding randomized user agent parser headers and Accept-Language headers to each request.

SOCKS5 proxies slower than HTTP

SOCKS5 proxies add overhead for DNS resolution through the proxy. Prefer HTTP/HTTPS proxies unless you need UDP support.

Proxy provider costs adding up

Calculate your cost per successful request. If you're paying $10/GB and each request is 50KB, that's $0.0005/request. SearchHive Pro at $29/mo for 100K requests is $0.00029/request — often cheaper than raw proxy costs.

Next Steps

If you're building a production scraper, consider whether managing your own proxy rotator is worth it — SearchHive eliminates the infrastructure entirely
Check /blog/how-to-extract-contact-info-from-websites-with-python for a scraping tutorial that uses SearchHive's built-in rotation
See /compare/brightdata for a detailed cost comparison between SearchHive and enterprise proxy providers

Stop managing proxy infrastructure. Start with SearchHive's free tier — 50,000 requests/month with automatic proxy rotation, JS rendering, and a clean Python SDK. Read the docs.

How to Build a Proxy Rotator for Web Scraping with Python

AI-Powered Research

How to Build a Proxy Rotator for Web Scraping with Python

Key Takeaways

Prerequisites

Step 1: Build a Basic Proxy Pool

Step 2: Implement Rotation Strategies

Choosing a Strategy

Step 3: Add Health Checking

Step 4: Build the Rotating Request Handler

Step 5: Fetch Free Proxy Lists Programmatically

Step 6: The Simpler Alternative — SearchHive

Batch Scraping with Built-in Rotation

Cost Comparison

Complete Code Example

Common Issues

All proxies fail health check

Getting 403 despite rotating proxies

SOCKS5 proxies slower than HTTP

Proxy provider costs adding up

Next Steps

Keywords

RELATED ARTICLES

How to Scrape Wikipedia Data for Knowledge Graphs

How to Scrape YouTube Data — Video Metrics and Comments

How to Monitor Competitor Prices with Python — Automated System

BUILD WITH SEARCHHIVE