How to Captcha Solving For Scraping — Step-by-Step
Web scraping hits a wall the moment a site throws a CAPTCHA. Whether you're building a price monitor, a lead generation pipeline, or a research tool, CAPTCHAs are the most common barrier between your scraper and the data you need.
This guide covers every captcha solving for scraping approach -- from manual services to fully automated API-based solutions -- with working code examples using SearchHive's built-in CAPTCHA handling and standalone solving services.
Key Takeaways
- CAPTCHAs come in 6 major types: reCAPTCHA v2, reCAPTCHA v3, hCaptcha, Cloudflare Turnstile, Cloudflare Challenge, and text/audio CAPTCHAs
- Third-party solving services (2Captcha, CapSolver, Anti-Captcha) cost $0.50-$3.00 per 1000 solves
- SearchHive handles CAPTCHAs automatically as part of its scraping API -- no separate service needed
- The most reliable approach combines proxy rotation, request fingerprinting, and automatic CAPTCHA solving
Prerequisites
Before we start, you'll need:
- Python 3.8+ installed
- A SearchHive account (free tier includes CAPTCHA solving on 500 requests/month)
- Basic familiarity with HTTP requests and web scraping concepts
- For standalone services: accounts with 2Captcha, CapSolver, or Anti-Captcha
Install the required packages:
pip install requests searchhive-sdk
If you're using standalone solving services:
pip install 2captcha-python anticaptchaofficial capsolver-python
Step 1: Identify the CAPTCHA Type
Before solving anything, you need to know what you're dealing with. Here's a quick reference:
| CAPTCHA Type | Visual | Common Sites | Difficulty |
|---|---|---|---|
| reCAPTCHA v2 | Checkbox + image grids | Google services, broad adoption | Medium |
| reCAPTCHA v3 | Invisible (no user interaction) | Google services, enterprise | Hard |
| hCaptcha | Image selection (boats, crosswalks) | Cloudflare sites, Discord | Medium |
| Cloudflare Turnstile | Invisible checkbox | Modern Cloudflare-protected sites | Medium |
| Cloudflare Challenge | Full-page interstitial with JS challenge | E-commerce, SaaS dashboards | Very Hard |
| Text/Audio | Distorted text or audio clips | Legacy sites, government forms | Low |
Detecting the CAPTCHA type programmatically:
import requests
def detect_captcha_type(html):
# Check for reCAPTCHA v2
if 'google.com/recaptcha/api.js' in html or 'g-recaptcha' in html:
return 'recaptcha_v2'
# Check for reCAPTCHA v3
if 'grecaptcha' in html and 'render' in html and 'enterprise' in html:
return 'recaptcha_v3'
# Check for hCaptcha
if 'hcaptcha.com' in html or 'h-captcha' in html:
return 'hcaptcha'
# Check for Cloudflare Turnstile
if 'challenges.cloudflare.com/turnstile' in html or 'cf-turnstile' in html:
return 'turnstile'
# Check for Cloudflare Challenge
if 'challenge-platform' in html or 'cf-browser-verification' in html:
return 'cloudflare_challenge'
return 'unknown'
url = 'https://example.com/protected-page'
resp = requests.get(url, timeout=10)
captcha_type = detect_captcha_type(resp.text)
print(f'Detected CAPTCHA type: {captcha_type}')
Step 2: Use SearchHive's Built-in CAPTCHA Solving
The simplest approach is to let SearchHive handle everything. SearchHive's ScrapeForge API includes automatic CAPTCHA detection and solving -- no extra configuration required.
from searchhive import ScrapeForge
client = ScrapeForge(api_key='your-api-key')
# Single page scrape with automatic CAPTCHA handling
result = client.scrape(
url='https://example.com/protected-page',
render_js=True, # Enable for SPAs and JS-rendered CAPTCHAs
anti_bot=True, # Enable anti-bot fingerprinting
solve_captchas=True # Automatically solve detected CAPTCHAs
)
if result.success:
print(f'Title: {result.data.get("title")}')
print(f'Content length: {len(result.html)} chars')
else:
print(f'Error: {result.error}')
SearchHive supports all major CAPTCHA types out of the box:
- reCAPTCHA v2 and v3 -- solved automatically
- hCaptcha -- solved automatically
- Cloudflare Turnstile -- bypassed via browser fingerprinting
- Cloudflare Challenge -- handled with residential proxies + headless browser
Step 3: Integrate 2Captcha for Standalone Solving
If you need more control over the solving process, 2Captcha is the most popular standalone service at around $2.99 per 1000 reCAPTCHA v2 solves.
from twocaptcha import TwoCaptcha
solver = TwoCaptcha('YOUR_2CAPTCHA_API_KEY')
# Solve reCAPTCHA v2
result = solver.recaptcha(
sitekey='6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-',
url='https://example.com/page-with-recaptcha'
)
print(f'Token: {result["code"]}')
# Solve hCaptcha
result = solver.hcaptcha(
sitekey='SITE_KEY_HERE',
url='https://example.com/page-with-hcaptcha'
)
print(f'Token: {result["code"]}')
# Solve text CAPTCHA
result = solver.normal('BASE64_IMAGE_OR_TEXT')
print(f'Solution: {result["code"]}')
Step 4: Integrate CapSolver for Faster Solving
CapSolver offers AI-powered solving with faster response times. Pricing starts at $0.80 per 1000 reCAPTCHA v2 solves -- significantly cheaper than 2Captcha.
import capsolver
capsolver.api_key = 'YOUR_CAPSOLVER_API_KEY'
# Solve reCAPTCHA v2
solution = capsolver.solve({
"type": "ReCaptchaV2TaskProxyLess",
"websiteURL": "https://example.com",
"websiteKey": "6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-"
})
print(f'g-recaptcha-response: {solution["gRecaptchaResponse"]}')
# Solve Cloudflare Turnstile
solution = capsolver.solve({
"type": "AntiTurnstileTaskProxyLess",
"websiteURL": "https://example.com",
"websiteKey": "SITE_KEY_HERE"
})
print(f'Turnstile token: {solution["token"]}')
Step 5: Build a Retry Pipeline with CAPTCHA Fallback
Production scrapers need robust error handling. Here's a pipeline that tries the request first, detects CAPTCHAs, solves them, and retries:
import requests
import time
from searchhive import ScrapeForge
def robust_scrape(url, max_retries=3):
client = ScrapeForge(api_key='your-api-key')
for attempt in range(max_retries):
result = client.scrape(
url=url,
render_js=True,
anti_bot=True,
solve_captchas=True
)
if result.success:
return result
# Check if blocked by CAPTCHA
if result.status_code in (403, 503):
print(f'Attempt {attempt + 1}: CAPTCHA detected, retrying...')
time.sleep(2 ** attempt) # Exponential backoff
else:
print(f'Attempt {attempt + 1}: Error {result.status_code}')
break
return None
# Scrape multiple pages with rate limiting
urls = [
'https://example.com/product/1',
'https://example.com/product/2',
'https://example.com/product/3',
]
for url in urls:
result = robust_scrape(url)
if result and result.success:
print(f'Scraped: {result.data.get("title")}')
time.sleep(1) # Be polite between requests
Step 6: Handle Cloudflare Challenge Pages
Cloudflare Challenge pages are the hardest CAPTCHAs to bypass because they combine JS challenges, browser fingerprinting, and rate limiting. SearchHive uses residential proxies and headless browsers to handle these:
from searchhive import ScrapeForge
client = ScrapeForge(api_key='your-api-key')
# Cloudflare-protected pages need residential proxies
result = client.scrape(
url='https://cloudflare-protected-site.com/data',
render_js=True,
anti_bot=True,
solve_captchas=True,
proxy_type='residential', # Use residential proxies
country='us' # Geo-target if needed
)
if result.success:
print('Cloudflare challenge bypassed successfully')
print(result.html[:500])
Step 7: Cost Optimization
CAPTCHA solving costs add up fast. Here's how to keep them under control:
| Approach | Cost per 1000 solves | Best For |
|---|---|---|
| SearchHive (included) | $0 (bundled with API calls) | All-purpose scraping |
| 2Captcha | $2.99 (reCAPTCHA v2) | Budget scraping, text CAPTCHAs |
| CapSolver | $0.80 (reCAPTCHA v2) | High-volume, cost-sensitive |
| Anti-Captcha | $1.80 (reCAPTCHA v2) | Balanced cost/reliability |
| CapMonster Cloud | $0.70 (reCAPTCHA v2) | Maximum throughput |
Cost-saving strategies:
- Use SearchHive first -- CAPTCHA solving is included in every request, so you don't pay per-solve fees
- Cache CAPTCHA tokens -- reCAPTCHA v2 tokens are valid for 2 minutes; reuse them across requests to the same site
- Rotate proxies -- changing your IP reduces the frequency of CAPTCHA triggers
- Respect rate limits -- space requests 1-3 seconds apart to avoid triggering anti-bot systems
- Use session cookies -- maintain login sessions to avoid repeated CAPTCHA challenges
Step 8: Legal and Ethical Considerations
Before implementing CAPTCHA solving in production:
- Check the site's Terms of Service -- many sites explicitly prohibit automated access
- Respect robots.txt generator -- check
robots.txtbefore scraping - Rate limit your requests -- aggressive scraping degrades service for other users
- Don't bypass security for malicious purposes -- CAPTCHA solving for legitimate data collection is generally acceptable; bypassing security for credential stuffing or DDoS is not
- GDPR and data privacy -- ensure you're not scraping personal data without a legal basis
Complete Code Example
Here's a production-ready scraper that handles CAPTCHAs automatically:
import json
import time
from searchhive import ScrapeForge
def scrape_with_retry(urls, api_key, delay=1.5, max_retries=2):
client = ScrapeForge(api_key=api_key)
results = []
for i, url in enumerate(urls):
print(f'[{i+1}/{len(urls)}] Scraping {url}...')
for attempt in range(max_retries + 1):
result = client.scrape(
url=url,
render_js=True,
anti_bot=True,
solve_captchas=True
)
if result.success:
results.append({
'url': url,
'title': result.data.get('title', ''),
'status': 'success'
})
print(f' Success: {result.data.get("title", "")[:60]}')
break
if result.status_code in (403, 503):
wait = delay * (2 ** attempt)
print(f' CAPTCHA/block detected, waiting {wait}s...')
time.sleep(wait)
else:
results.append({
'url': url,
'status': 'failed',
'error': str(result.error)
})
print(f' Failed: {result.error}')
break
time.sleep(delay)
return results
if __name__ == '__main__':
API_KEY = 'your-searchhive-api-key'
urls = [
'https://httpbin.org/html',
'https://example.com',
]
results = scrape_with_retry(urls, API_KEY)
print(json.dumps(results, indent=2))
Common Issues
CAPTCHA solving takes too long. Most solving services return results in 10-30 seconds. If you need faster turnaround, use SearchHive's built-in solving which averages 3-5 seconds since it combines solving with the scrape request.
High failure rate on Cloudflare sites. Datacenter proxies get flagged quickly. Use residential proxies (available on SearchHive Builder and Unicorn plans) for Cloudflare-protected sites.
CAPTCHAs keep appearing even after solving. This means your request pattern is being detected. Add random delays (2-5 seconds), rotate user agents, and use session management to maintain cookies.
Next Steps
- Start with SearchHive's free tier -- 500 requests/month with CAPTCHA solving included
- Read the SearchHive API docs for advanced anti-bot configuration
- Check out /blog/searchhive-vs-2captcha-captcha-solving for a detailed service comparison
- For large-scale scraping, see /blog/how-to-build-a-web-scraper-that-scales
Ready to scrape without worrying about CAPTCHAs? Get started with SearchHive free -- no credit card required, CAPTCHA solving included on every plan.