Complete Guide to Python Web Scraping in 2025
Python web scraping is the process of automatically extracting data from websites using Python libraries and frameworks. Whether you're building price monitors, lead databases, or training datasets, Python remains the dominant language for web scraping thanks to its rich ecosystem of libraries, straightforward syntax, and massive community support.
This guide covers everything from basic HTML parsing to production-grade scraping pipelines that handle JavaScript rendering, rate limiting, and anti-bot detection.
Key Takeaways
- Beautiful Soup handles static HTML parsing -- use it for simple sites
- Playwright and Selenium handle JavaScript-rendered pages with full browser automation
- Scrapy is the go-to framework for large-scale crawling projects
- APIs like SearchHive's ScrapeForge eliminate proxy management and CAPTCHA handling entirely
- Always respect
robots.txt, rate-limit requests, and handle errors gracefully - Structured output (free JSON formatter, CSV) is easier to work with than raw HTML
1. Understanding the Basics of Web Scraping
Web scraping works by sending HTTP requests to a server, receiving HTML (or JSON) responses, and parsing the content to extract structured data. The fundamental workflow looks like this:
- Request -- Send an HTTP GET request to the target URL
- Response -- Receive the HTML document
- Parse -- Navigate the DOM to find the data you need
- Extract -- Pull text, attributes, or structured data
- Store -- Save to JSON, CSV, or a database
The simplest scraper
Every Python web scraper starts with requests and BeautifulSoup:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://example.com/products")
soup = BeautifulSoup(response.text, "html.parser")
products = []
for item in soup.select(".product-card"):
name = item.select_one(".product-name").get_text(strip=True)
price = item.select_one(".price").get_text(strip=True)
products.append({"name": name, "price": price})
print(f"Found {len(products)} products")
This works for static sites where all content is present in the initial HTML. Many modern websites, however, render content dynamically with JavaScript -- which brings us to the next section.
2. Handling JavaScript-Rendered Pages
Sites built with React, Vue, or Angular often return minimal HTML that gets populated by JavaScript after page load. requests won't see this content because it doesn't execute JavaScript.
Using Playwright for dynamic content
Playwright is the modern standard for browser automation in Python:
from playwright.sync_api import sync_playwright
def scrape_dynamic(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
# Wait for specific elements to appear
page.wait_for_selector(".product-card")
items = page.query_selector_all(".product-card")
results = []
for item in items:
name = item.query_selector(".product-name").inner_text()
price = item.query_selector(".price").inner_text()
results.append({"name": name, "price": price})
browser.close()
return results
data = scrape_dynamic("https://example.com/products")
Playwright handles JavaScript execution, network interception, screenshots, and even file downloads. The tradeoff is speed -- browser automation is significantly slower than raw HTTP requests.
Selenium as an alternative
Selenium has been around longer and has broad browser support, but Playwright is generally faster and has a cleaner API. Use Selenium if you need to support older browsers or integrate with existing Selenium Grid infrastructure.
3. Building Scalable Crawlers with Scrapy
When you need to scrape hundreds or thousands of pages, a framework like Scrapy provides the structure you need:
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/products"]
def parse(self, response):
for product in response.css(".product-card"):
yield {
"name": product.css(".product-name::text").get(),
"price": product.css(".price::text").get(),
"url": product.css("a::attr(href)").get(),
}
# Follow pagination
next_page = response.css(".next-page::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Scrapy provides built-in features that are tedious to implement manually:
- Request scheduling with concurrency control
- Retry middleware for failed requests
- Duplicate filtering to avoid re-visiting URLs
- Item pipelines for data cleaning and storage
- Middleware for cookies, proxies, and user agent parser rotation
Run the spider from the command line: scrapy runspider spider.py -o products.json
4. Dealing with Anti-Bot Protections
Most production websites employ anti-scraping measures. Here's how to handle the common ones:
Rate limiting
Space out your requests to avoid triggering rate limits:
import time
import random
for url in urls:
response = requests.get(url)
process(response)
time.sleep(random.uniform(1.5, 4.0))
Rotating user agents
Websites check the User-Agent header to identify bots:
from fake_useragent import UserAgent
ua = UserAgent()
headers = {"User-Agent": ua.random}
response = requests.get(url, headers=headers)
Proxy rotation
For high-volume scraping, residential proxies are essential:
proxies = {
"http": "http://user:pass@proxy1.example.com:8080",
"https": "http://user:pass@proxy1.example.com:8080",
}
response = requests.get(url, proxies=proxies)
Managing proxies, handling CAPTCHAs, and dealing with fingerprinting is where most scraping projects burn time and budget. This is exactly the problem SearchHive's ScrapeForge API was built to solve.
5. Using SearchHive's ScrapeForge API
Instead of managing proxies, browsers, and CAPTCHAs yourself, SearchHive's ScrapeForge handles all of that server-side:
import requests, json
API_KEY = "your-searchhive-api-key"
headers = {"Authorization": f"Bearer {API_KEY}"}
# Scrape a single page -- ScrapeForge handles JS rendering, proxies, CAPTCHAs
response = requests.post(
"https://api.searchhive.dev/v1/scrape",
headers=headers,
json={"url": "https://example.com/products"}
)
data = response.json()
for product in data.get("results", []):
print(f"{product['name']}: {product['price']}")
ScrapeForge gives you clean, structured data without the operational overhead. Compared to running your own Playwright cluster with proxy rotation:
- No infrastructure -- no servers to manage, no browsers to patch
- Built-in proxy rotation -- residential proxies included
- CAPTCHA handling -- solved automatically
- JS rendering -- full browser rendering on every request
- Rate limiting -- managed for you, with configurable throughput
For large-scale extraction, SearchHive's DeepDive endpoint crawls entire sites, following links and extracting structured data across thousands of pages in a single API call.
6. Data Extraction Best Practices
Parse selectors over regex
CSS selectors and XPath are more resilient to HTML changes than regex tester:
# Fragile -- breaks if class names or whitespace change
import re
price = re.search(r'\$(\d+\.\d{2})', html)
# Robust -- adapts to structural changes
price = soup.select_one(".price-value").get_text(strip=True)
Handle pagination and infinite scroll
For paginated content, detect the "next page" link and follow it. For infinite scroll, you'll need browser automation to scroll and trigger AJAX loads.
Cache responses during development
Scraping the same page repeatedly during development wastes bandwidth and risks getting blocked. Cache responses locally:
import hashlib, os, json
def fetch_cached(url, cache_dir="cache"):
os.makedirs(cache_dir, exist_ok=True)
fname = hashlib.md5(url.encode()).hexdigest() + ".json"
path = os.path.join(cache_dir, fname)
if os.path.exists(path):
with open(path) as f:
return json.load(f)
data = requests.get(url).json()
with open(path, "w") as f:
json.dump(data, f)
return data
Respect robots.txt
Before scraping a site, check its robots.txt to see which paths are allowed:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
print(rp.can_fetch("*", "https://example.com/products"))
7. Storing Scraped Data
For most projects, JSON and CSV are sufficient:
import json
with open("products.json", "w") as f:
json.dump(products, f, indent=2, ensure_ascii=False)
For production pipelines, consider:
- PostgreSQL for structured relational data
- MongoDB for flexible document storage
- S3 or GCS for raw HTML archives
- Data Extraction ETL Pipeline -- see our full guide on building production extraction pipelines
Conclusion
Python web scraping ranges from a five-line script with requests and BeautifulSoup to distributed crawling systems with Scrapy. The right tool depends on your scale and complexity requirements:
- Static pages:
requests+ BeautifulSoup - Dynamic pages: Playwright
- Large-scale crawls: Scrapy
- Production scraping without the ops burden: SearchHive ScrapeForge
Get started with 500 free credits on SearchHive's free tier -- no credit card required. Full access to SwiftSearch, ScrapeForge, and DeepDive endpoints. Check the docs for integration guides.