How to Python Web Scraping -- Step-by-Step

Python web scraping is one of the most practical skills a developer can learn. Whether you need to collect product data, monitor competitor prices, gather research material, or build a dataset for machine learning, Python has the tools to get it done.

This step-by-step guide covers the fundamentals of Python web scraping in 2025 -- from basic HTTP requests to handling JavaScript-rendered pages, with real code examples you can run today.

Prerequisites

Before starting, you'll need:

Python 3.8 or newer installed
A code editor or terminal
Basic understanding of HTML and CSS selectors
requests and beautifulsoup4 libraries

Install the dependencies:

pip install requests beautifulsoup4

Step 1: Understand the Target Page

Before writing any code, inspect the page you want to scrape. Open it in your browser, right-click the element you need, and select "Inspect" to see the HTML structure.

Key things to identify:

The HTML tags and CSS classes containing your target data
Whether data is loaded statically or via JavaScript
Any pagination patterns (page=2, ?offset=20, etc.)
Rate limiting indicators or anti-bot protections

For this tutorial, we'll scrape a hypothetical bookstore's product listings.

Step 2: Fetch the Page with Requests

The simplest approach is fetching HTML with the requests library:

import requests

url = "https://books.toscrape.com/"

response = requests.get(url, headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})

print(f"Status: {response.status_code}")
print(f"Content length: {len(response.text)} bytes")

Always include a User-Agent header. Many sites block requests with the default Python requests user agent parser.

Check the HTTP status codes reference before parsing:

200 -- success
403 -- forbidden (often means your request was blocked)
404 -- page not found
429 -- rate limited (slow down your requests)

Step 3: Parse HTML with BeautifulSoup

Once you have the HTML content, parse it with BeautifulSoup to extract specific elements:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")

# Find all book containers
books = soup.select("article.product_pod")

for book in books:
    title = book.select_one("h3 a")["title"]
    price = book.select_one(".price_color").text
    rating = book.select_one("p.star-rating")["class"][1]
    availability = book.select_one(".instock.availability").text.strip()
    
    print(f"{title} | {price} | Rating: {rating} | {availability}")

CSS selectors are the most flexible way to target elements:

.class -- select by class
#id -- select by ID
tag -- select by tag name
div.class1.class2 -- elements with multiple classes
a[href] -- anchor tags with href attribute
table tr:nth-child(odd) -- every other table row

Step 4: Handle Pagination

Most real websites spread data across multiple pages. Handle pagination by detecting the "next" link and looping:

import time

base_url = "https://books.toscrape.com/catalogue/page-{}.html"
all_books = []

for page in range(1, 6):  # Scrape first 5 pages
    print(f"Scraping page {page}...")
    response = requests.get(base_url.format(page), headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
    })
    
    soup = BeautifulSoup(response.text, "html.parser")
    books = soup.select("article.product_pod")
    
    for book in books:
        title = book.select_one("h3 a")["title"]
        price = book.select_one(".price_color").text
        all_books.append({"title": title, "price": price})
    
    time.sleep(1)  # Be respectful -- don't hammer the server

print(f"Total books collected: {len(all_books)}")

Always add delays between requests. time.sleep(1) is a minimum -- for production scrapers, use 2-5 second delays and respect the site's robots.txt.

Step 5: Handle JavaScript-Rendered Pages

Many modern websites load content dynamically via JavaScript. The HTML you receive from requests.get() may be incomplete because the actual data is fetched by JavaScript after the page loads.

For JavaScript-rendered pages, you have two options:

Option A: Use a scraping API

The simplest approach is to use a service that renders JavaScript for you. SearchHive's ScrapeForge API handles this automatically:

import requests

# ScrapeForge renders JavaScript and returns clean content
response = requests.post(
    "https://api.searchhive.dev/v1/scrapeforge",
    headers={"Authorization": "Bearer YOUR_API_KEY", "Content-Type": "application/json"},
    json={
        "urls": ["https://example.com/dynamic-page"],
        "format": "markdown",
        "render_js": True
    }
)

content = response.json()["results"][0]["content"]
print(content[:500])

This handles JavaScript rendering, proxy rotation, and CAPTCHA management. You get clean Markdown or HTML back without managing a browser.

Option B: Use Playwright locally

For full control, use Playwright to run a real browser:

pip install playwright
playwright install chromium

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/dynamic-page")
    page.wait_for_selector(".product-list")  # Wait for JS content
    
    products = page.query_selector_all(".product-item")
    for product in products:
        name = product.query_selector(".name").inner_text()
        price = product.query_selector(".price").inner_text()
        print(f"{name}: {price}")
    
    browser.close()

Playwright is more powerful but heavier -- it runs a full browser instance. For production workloads, the API approach (Option A) is usually more reliable and scalable.

Step 6: Export Data

Once you've collected your data, save it to a usable format:

import json
import csv

# Save as JSON
with open("books.json", "w") as f:
    json.dump(all_books, f, indent=2)

# Save as CSV
with open("books.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "price"])
    writer.writeheader()
    writer.writerows(all_books)

print("Data exported to books.json and books.csv")

For larger datasets, consider SQLite or a proper database.

Step 7: Handle Errors and Rate Limiting

Production scrapers need error handling:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()

# Configure automatic retries
retry = Retry(total=3, backoff_factor=2, status_forcelist=[429, 500, 502, 503])
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)

def safe_scrape(url, retries=3):
    for attempt in range(retries):
        try:
            response = session.get(url, headers={
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
            }, timeout=30)
            response.raise_for_status()
            return response.text
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < retries - 1:
                time.sleep(2 ** attempt)
    return None

Step 8: Full Working Example

Here's a complete scraper that combines all the concepts:

import requests
import csv
import time
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def setup_session():
    session = requests.Session()
    retry = Retry(total=3, backoff_factor=2, status_forcelist=[429, 500, 503])
    adapter = HTTPAdapter(max_retries=retry)
    session.mount("https://", adapter)
    return session

def scrape_books(max_pages=5):
    session = setup_session()
    base_url = "https://books.toscrape.com/catalogue/page-{}.html"
    all_books = []
    
    for page in range(1, max_pages + 1):
        try:
            response = session.get(
                base_url.format(page),
                headers={"User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"},
                timeout=30
            )
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, "html.parser")
            articles = soup.select("article.product_pod")
            
            for article in articles:
                book = {
                    "title": article.select_one("h3 a")["title"],
                    "price": article.select_one(".price_color").text,
                    "rating": article.select_one("p.star-rating")["class"][1],
                    "url": base_url.format(page).rsplit("/", 1)[0] + "/" + article.select_one("h3 a")["href"]
                }
                all_books.append(book)
            
            print(f"Page {page}: collected {len(articles)} books")
            time.sleep(1.5)
            
        except Exception as e:
            print(f"Error on page {page}: {e}")
            break
    
    return all_books

def export_to_csv(books, filename="books_output.csv"):
    if not books:
        print("No books to export")
        return
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=books[0].keys())
        writer.writeheader()
        writer.writerows(books)
    print(f"Exported {len(books)} books to {filename}")

if __name__ == "__main__":
    books = scrape_books(max_pages=5)
    export_to_csv(books)

Common Issues

Getting blocked (403): Add a realistic User-Agent header, rotate between multiple user agents, and add delays between requests. For heavily protected sites, use a service like SearchHive ScrapeForge which handles proxy rotation and anti-detection.

Missing data (empty selectors): The data might be loaded by JavaScript. Check the Network tab in DevTools to see if content comes from an API call. If it does, call that API directly instead of parsing HTML.

Rate limiting (429): Slow down your requests. Use exponential backoff (doubling the delay after each failure). Respect robots.txt.

Encoding issues: Try response.encoding = response.apparent_encoding or specify encoding explicitly when opening files: open("file.csv", "w", encoding="utf-8").

Next Steps

Once you're comfortable with the basics, explore:

Async scraping with aiohttp + asyncio for parallel requests
API scraping -- many sites load data from free JSON formatter APIs that are easier to parse than HTML
Scraping APIs like SearchHive ScrapeForge for production workloads that need reliability and scale
Sitemap parsing to discover all pages on a site before scraping

For production scraping that handles JavaScript rendering, proxy rotation, and rate limiting automatically, check out SearchHive's free tier -- 500 free credits with full access to all scraping features.

How to Python Web Scraping -- Step-by-Step

AI-Powered Research

Prerequisites

Step 1: Understand the Target Page

Step 2: Fetch the Page with Requests

Step 3: Parse HTML with BeautifulSoup

Step 4: Handle Pagination

Step 5: Handle JavaScript-Rendered Pages

Step 6: Export Data

Step 7: Handle Errors and Rate Limiting

Step 8: Full Working Example

Common Issues

Next Steps

Keywords

RELATED ARTICLES

Best Shopify Data Extraction Tools (2025)

Search Api For Developers -- Common Questions Answered

Complete Guide to Brand Tracking Platforms

BUILD WITH SEARCHHIVE