Python web scraping is one of the most practical skills a developer can learn. Whether you need to collect product data, monitor competitor prices, gather research material, or build a dataset for machine learning, Python has the tools to get it done.
This step-by-step guide covers the fundamentals of Python web scraping in 2025 -- from basic HTTP requests to handling JavaScript-rendered pages, with real code examples you can run today.
Prerequisites
Before starting, you'll need:
- Python 3.8 or newer installed
- A code editor or terminal
- Basic understanding of HTML and CSS selectors
requestsandbeautifulsoup4libraries
Install the dependencies:
pip install requests beautifulsoup4
Step 1: Understand the Target Page
Before writing any code, inspect the page you want to scrape. Open it in your browser, right-click the element you need, and select "Inspect" to see the HTML structure.
Key things to identify:
- The HTML tags and CSS classes containing your target data
- Whether data is loaded statically or via JavaScript
- Any pagination patterns (page=2, ?offset=20, etc.)
- Rate limiting indicators or anti-bot protections
For this tutorial, we'll scrape a hypothetical bookstore's product listings.
Step 2: Fetch the Page with Requests
The simplest approach is fetching HTML with the requests library:
import requests
url = "https://books.toscrape.com/"
response = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
print(f"Status: {response.status_code}")
print(f"Content length: {len(response.text)} bytes")
Always include a User-Agent header. Many sites block requests with the default Python requests user agent parser.
Check the HTTP status codes reference before parsing:
200-- success403-- forbidden (often means your request was blocked)404-- page not found429-- rate limited (slow down your requests)
Step 3: Parse HTML with BeautifulSoup
Once you have the HTML content, parse it with BeautifulSoup to extract specific elements:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# Find all book containers
books = soup.select("article.product_pod")
for book in books:
title = book.select_one("h3 a")["title"]
price = book.select_one(".price_color").text
rating = book.select_one("p.star-rating")["class"][1]
availability = book.select_one(".instock.availability").text.strip()
print(f"{title} | {price} | Rating: {rating} | {availability}")
CSS selectors are the most flexible way to target elements:
.class-- select by class#id-- select by IDtag-- select by tag namediv.class1.class2-- elements with multiple classesa[href]-- anchor tags with href attributetable tr:nth-child(odd)-- every other table row
Step 4: Handle Pagination
Most real websites spread data across multiple pages. Handle pagination by detecting the "next" link and looping:
import time
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
all_books = []
for page in range(1, 6): # Scrape first 5 pages
print(f"Scraping page {page}...")
response = requests.get(base_url.format(page), headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
})
soup = BeautifulSoup(response.text, "html.parser")
books = soup.select("article.product_pod")
for book in books:
title = book.select_one("h3 a")["title"]
price = book.select_one(".price_color").text
all_books.append({"title": title, "price": price})
time.sleep(1) # Be respectful -- don't hammer the server
print(f"Total books collected: {len(all_books)}")
Always add delays between requests. time.sleep(1) is a minimum -- for production scrapers, use 2-5 second delays and respect the site's robots.txt.
Step 5: Handle JavaScript-Rendered Pages
Many modern websites load content dynamically via JavaScript. The HTML you receive from requests.get() may be incomplete because the actual data is fetched by JavaScript after the page loads.
For JavaScript-rendered pages, you have two options:
Option A: Use a scraping API
The simplest approach is to use a service that renders JavaScript for you. SearchHive's ScrapeForge API handles this automatically:
import requests
# ScrapeForge renders JavaScript and returns clean content
response = requests.post(
"https://api.searchhive.dev/v1/scrapeforge",
headers={"Authorization": "Bearer YOUR_API_KEY", "Content-Type": "application/json"},
json={
"urls": ["https://example.com/dynamic-page"],
"format": "markdown",
"render_js": True
}
)
content = response.json()["results"][0]["content"]
print(content[:500])
This handles JavaScript rendering, proxy rotation, and CAPTCHA management. You get clean Markdown or HTML back without managing a browser.
Option B: Use Playwright locally
For full control, use Playwright to run a real browser:
pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/dynamic-page")
page.wait_for_selector(".product-list") # Wait for JS content
products = page.query_selector_all(".product-item")
for product in products:
name = product.query_selector(".name").inner_text()
price = product.query_selector(".price").inner_text()
print(f"{name}: {price}")
browser.close()
Playwright is more powerful but heavier -- it runs a full browser instance. For production workloads, the API approach (Option A) is usually more reliable and scalable.
Step 6: Export Data
Once you've collected your data, save it to a usable format:
import json
import csv
# Save as JSON
with open("books.json", "w") as f:
json.dump(all_books, f, indent=2)
# Save as CSV
with open("books.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price"])
writer.writeheader()
writer.writerows(all_books)
print("Data exported to books.json and books.csv")
For larger datasets, consider SQLite or a proper database.
Step 7: Handle Errors and Rate Limiting
Production scrapers need error handling:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
# Configure automatic retries
retry = Retry(total=3, backoff_factor=2, status_forcelist=[429, 500, 502, 503])
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)
def safe_scrape(url, retries=3):
for attempt in range(retries):
try:
response = session.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}, timeout=30)
response.raise_for_status()
return response.text
except requests.exceptions.RequestException as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < retries - 1:
time.sleep(2 ** attempt)
return None
Step 8: Full Working Example
Here's a complete scraper that combines all the concepts:
import requests
import csv
import time
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def setup_session():
session = requests.Session()
retry = Retry(total=3, backoff_factor=2, status_forcelist=[429, 500, 503])
adapter = HTTPAdapter(max_retries=retry)
session.mount("https://", adapter)
return session
def scrape_books(max_pages=5):
session = setup_session()
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
all_books = []
for page in range(1, max_pages + 1):
try:
response = session.get(
base_url.format(page),
headers={"User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"},
timeout=30
)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
articles = soup.select("article.product_pod")
for article in articles:
book = {
"title": article.select_one("h3 a")["title"],
"price": article.select_one(".price_color").text,
"rating": article.select_one("p.star-rating")["class"][1],
"url": base_url.format(page).rsplit("/", 1)[0] + "/" + article.select_one("h3 a")["href"]
}
all_books.append(book)
print(f"Page {page}: collected {len(articles)} books")
time.sleep(1.5)
except Exception as e:
print(f"Error on page {page}: {e}")
break
return all_books
def export_to_csv(books, filename="books_output.csv"):
if not books:
print("No books to export")
return
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=books[0].keys())
writer.writeheader()
writer.writerows(books)
print(f"Exported {len(books)} books to {filename}")
if __name__ == "__main__":
books = scrape_books(max_pages=5)
export_to_csv(books)
Common Issues
Getting blocked (403): Add a realistic User-Agent header, rotate between multiple user agents, and add delays between requests. For heavily protected sites, use a service like SearchHive ScrapeForge which handles proxy rotation and anti-detection.
Missing data (empty selectors): The data might be loaded by JavaScript. Check the Network tab in DevTools to see if content comes from an API call. If it does, call that API directly instead of parsing HTML.
Rate limiting (429): Slow down your requests. Use exponential backoff (doubling the delay after each failure). Respect robots.txt.
Encoding issues: Try response.encoding = response.apparent_encoding or specify encoding explicitly when opening files: open("file.csv", "w", encoding="utf-8").
Next Steps
Once you're comfortable with the basics, explore:
- Async scraping with
aiohttp+asynciofor parallel requests - API scraping -- many sites load data from free JSON formatter APIs that are easier to parse than HTML
- Scraping APIs like SearchHive ScrapeForge for production workloads that need reliability and scale
- Sitemap parsing to discover all pages on a site before scraping
For production scraping that handles JavaScript rendering, proxy rotation, and rate limiting automatically, check out SearchHive's free tier -- 500 free credits with full access to all scraping features.
See also: How to Scrape JavaScript-Rendered Pages | SearchHive vs Firecrawl Comparison | SearchHive vs ScrapingBee