How to Extract Contact Info from Websites with Python

Need to extract contact info from websites at scale? Whether you're building a sales leads database, verifying business listings, or enriching CRM data, Python gives you the tools to pull emails, phone numbers, and addresses from web pages reliably.

This tutorial walks through building a complete contact extraction system — from basic regex tester matching to structured data extraction with CSS selectors. We'll cover error handling, rate limiting, and how to avoid common scraping pitfalls.

Key Takeaways

Regex patterns are the foundation for extracting emails and phone numbers from raw HTML
CSS selectors + BeautifulSoup pull structured data like names, titles, and physical addresses
SearchHive ScrapeForge handles rendering, proxy rotation, and parsing in a single API call
Always respect robots.txt, rate-limit your requests, and include opt-out mechanisms
Combine multiple extraction methods (regex + DOM parsing) for maximum coverage

Prerequisites

Before starting, install these packages:

pip install requests beautifulsoup4 lxml searchhive

requests — HTTP client for fetching web pages
beautifulsoup4 + lxml — HTML parsing and DOM traversal
searchhive — SearchHive Python SDK (free tier: 50K requests/month)

You'll also want a text editor or Jupyter notebook for testing your extraction patterns.

Step 1: Fetch the Web Page

Start by fetching the target page's HTML content:

import requests
from bs4 import BeautifulSoup

def fetch_page(url: str) -> BeautifulSoup:
    """Fetch and parse a web page."""
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
    return BeautifulSoup(response.text, "lxml")

This is the basic approach. For production use, you need error handling, retries, and proxy rotation — we'll get to that.

Step 2: Extract Email Addresses with Regex

Email addresses follow a consistent pattern. A well-tuned regex catches most of them:

import re

def extract_emails(html: str) -> list[str]:
    """Extract email addresses from HTML content."""
    # Match standard email formats, avoiding common false positives
    pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
    emails = set(re.findall(pattern, html))
    
    # Filter out obvious false positives
    blacklist = {'example.com', 'domain.com', 'email.com', 'company.com'}
    return sorted([
        email for email in emails
        if not any(email.endswith(f'@{d}') for d in blacklist)
    ])

# Usage
soup = fetch_page("https://example.com/contact")
emails = extract_emails(str(soup))
print(f"Found {len(emails)} emails: {emails}")

Common false positives to watch for:

Images and SVGs containing @ symbols
CSS class names with @ (like icon@2x)
Obfuscated emails using JavaScript ([at] or dot notation)

Step 3: Extract Phone Numbers

Phone number formats vary widely by country. This pattern covers the most common formats:

def extract_phones(html: str) -> list[str]:
    """Extract phone numbers from HTML content."""
    patterns = [
        # US/Canada: (xxx) xxx-xxxx, xxx-xxx-xxxx, xxx.xxx.xxxx
        r'\(\d{3}\)\s*\d{3}[-.]?\d{4}',
        r'\d{3}[-.]\d{3}[-.]\d{4}',
        # International: +1-xxx-xxx-xxxx, +44 xxx xxx xxxx
        r'\+\d{1,3}[-.\s]?\(?\d{1,4}\)?[-.\s]?\d{1,4}[-.\s]?\d{1,9}',
        # Toll-free
        r'\b(?:1[-.]?)?800[-.]?\d{3}[-.]?\d{4}\b',
    ]
    
    phones = set()
    for pattern in patterns:
        matches = re.findall(pattern, html)
        phones.update(matches)
    
    return sorted(phones)

# Usage
phones = extract_phones(str(soup))
print(f"Found {len(phones)} phone numbers: {phones}")

Step 4: Extract Structured Contact Data with CSS Selectors

Most business websites have a dedicated contact page with structured information. Use CSS selectors to pull specific fields:

def extract_contact_page(soup: BeautifulSoup, base_url: str) -> dict:
    """Extract structured contact data from a contact page."""
    contact = {"source": base_url}
    
    # Business name — usually in an h1 or the site title
    name_tag = soup.find("h1") or soup.find("title")
    if name_tag:
        contact["business_name"] = name_tag.get_text(strip=True)
    
    # Email links — mailto: hrefs are the most reliable source
    mailto_links = soup.select('a[href^="mailto:"]')
    contact["emails"] = list(set(
        link["href"].replace("mailto:", "").strip()
        for link in mailto_links
    ))
    
    # Phone links — tel: hrefs
    tel_links = soup.select('a[href^="tel:"]')
    contact["phones"] = list(set(
        link["href"].replace("tel:", "").strip()
        for link in tel_links
    ))
    
    # Address — look for common address patterns
    address_selectors = [
        ".address", ".contact-address", "[itemprop=address]",
        "address", ".street-address", ".location"
    ]
    for selector in address_selectors:
        addr = soup.select_one(selector)
        if addr and addr.get_text(strip=True):
            contact["address"] = addr.get_text(strip=True)
            break
    
    # Social media links
    social_patterns = {
        "linkedin": 'a[href*="linkedin.com"]',
        "twitter": 'a[href*="twitter.com"]',
        "facebook": 'a[href*="facebook.com"]',
    }
    for platform, selector in social_patterns.items():
        link = soup.select_one(selector)
        if link:
            contact[platform] = link["href"]
    
    return contact

# Usage
contact_data = extract_contact_page(soup, "https://example.com/contact")
for key, value in contact_data.items():
    print(f"{key}: {value}")

Using mailto: and tel: href attributes is more reliable than regex on raw HTML — these are structured data points the website owner intentionally provides.

Step 5: Handle JavaScript-Rendered Pages

Many modern websites render contact information with JavaScript. The raw HTML won't contain the data you need. Two approaches:

Approach A: Use Selenium (Heavy)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
options.add_argument("--disable-gpu")

driver = webdriver.Chrome(options=options)
driver.get("https://example.com/contact")
driver.implicitly_wait(3)

html = driver.page_source
driver.quit()

soup = BeautifulSoup(html, "lxml")

Approach B: Use SearchHive ScrapeForge (Recommended)

SearchHive handles rendering, proxy rotation, and parsing in one call:

from searchhive import ScrapeForge

client = ScrapeForge()

result = client.scrape(
    url="https://example.com/contact",
    render_js=True,
    wait_for="address, .contact-info",
    selectors={
        "emails": 'a[href^="mailto:"] @href',
        "phones": 'a[href^="tel:"] @href',
        "address": "address, .contact-address, [itemprop=address]",
        "business_name": "h1",
        "social_links": 'a[href*="linkedin.com"], a[href*="twitter.com"] @href'
    }
)

print(result.data)

One API call. No browser to manage. Automatic retries and proxy rotation. The free tier includes 50,000 requests/month.

Step 6: Extract Contacts at Scale with Batch Processing

When you need to extract contact info from hundreds or thousands of pages, batch processing with rate limiting is essential:

from searchhive import ScrapeForge
import time
import json

client = ScrapeForge()

urls = [
    "https://company1.com/contact",
    "https://company2.com/about-us",
    "https://company3.com/contact-us",
]

all_contacts = []

for url in urls:
    try:
        result = client.scrape(
            url=url,
            render_js=True,
            selectors={
                "emails": 'a[href^="mailto:"] @href',
                "phones": 'a[href^="tel:"] @href',
                "address": "address",
                "name": "h1",
            },
            timeout=15
        )
        
        if result.data:
            result.data["url"] = url
            all_contacts.append(result.data)
            print(f"OK {url}: {len(result.data.get('emails', []))} emails")
            
    except Exception as e:
        print(f"FAIL {url}: {e}")
    
    time.sleep(1)

with open("contacts.json", "w") as f:
    json.dump(all_contacts, f, indent=2)

print(f"Extracted contacts from {len(all_contacts)} pages")

For larger batches, use SearchHive's scrape_batch with built-in concurrency control:

results = client.scrape_batch(
    urls,
    render_js=True,
    selectors={
        "emails": 'a[href^="mailto:"] @href',
        "phones": 'a[href^="tel:"] @href',
        "address": "address",
    },
    concurrency=5
)

valid_contacts = [r.data for r in results if r.success]
print(f"Successfully scraped {len(valid_contacts)} of {len(urls)} pages")

Step 7: Validate and Deduplicate Results

Raw extraction produces duplicates and false positives. Clean your data before using it:

import re

def is_valid_email(email: str) -> bool:
    """Validate an email address format."""
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

def deduplicate_contacts(contacts: list[dict]) -> list[dict]:
    """Remove duplicate contacts based on email addresses."""
    seen = set()
    unique = []
    
    for contact in contacts:
        emails = contact.get("emails", [])
        valid_emails = [e for e in emails if is_valid_email(e)]
        
        if valid_emails:
            key = valid_emails[0].lower()
            if key not in seen:
                seen.add(key)
                contact["emails"] = valid_emails
                unique.append(contact)
    
    return unique

cleaned = deduplicate_contacts(all_contacts)
print(f"Cleaned: {len(all_contacts)} to {len(cleaned)} unique contacts")

Step 8: Export to CSV for CRM Import

import csv

def export_to_csv(contacts: list[dict], filename: str = "contacts.csv"):
    """Export contacts to CSV for CRM import."""
    if not contacts:
        return
    
    fieldnames = ["business_name", "emails", "phones", "address", "url"]
    
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
        writer.writeheader()
        for contact in contacts:
            row = contact.copy()
            row["emails"] = "; ".join(row.get("emails", []))
            row["phones"] = "; ".join(row.get("phones", []))
            writer.writerow(row)
    
    print(f"Exported {len(contacts)} contacts to {filename}")

export_to_csv(cleaned)

Complete Code Example

Here's the full pipeline — from URL list to clean CSV:

from searchhive import ScrapeForge
import json
import csv
import re

def extract_contacts(urls: list[str]) -> list[dict]:
    """Extract contact info from a list of URLs using SearchHive."""
    client = ScrapeForge()
    
    selectors = {
        "business_name": "h1, title",
        "emails": 'a[href^="mailto:"] @href',
        "phones": 'a[href^="tel:"] @href',
        "address": "address, .contact-address, [itemprop=address]",
    }
    
    results = client.scrape_batch(
        urls, render_js=True, selectors=selectors, concurrency=3
    )
    
    contacts = []
    for r in results:
        if r.success and r.data:
            r.data["url"] = r.url
            if "emails" in r.data:
                r.data["emails"] = [
                    e.replace("mailto:", "") for e in r.data["emails"]
                ]
            if "phones" in r.data:
                r.data["phones"] = [
                    p.replace("tel:", "") for p in r.data["phones"]
                ]
            contacts.append(r.data)
    
    return contacts

def export_contacts(contacts: list[dict], filename: str):
    """Export to CSV."""
    fieldnames = ["business_name", "emails", "phones", "address", "url"]
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
        writer.writeheader()
        for c in contacts:
            row = c.copy()
            row["emails"] = "; ".join(row.get("emails", []))
            row["phones"] = "; ".join(row.get("phones", []))
            writer.writerow(row)

if __name__ == "__main__":
    urls = [
        "https://example.com/contact",
        "https://example.org/about",
        "https://example.net/contact-us",
    ]
    
    contacts = extract_contacts(urls)
    export_contacts(contacts, "contacts.csv")
    print(f"Exported {len(contacts)} contacts to contacts.csv")

Common Issues

Emails hidden behind Cloudflare or CAPTCHAs

SearchHive's proxy rotation and anti-detection features handle most bot protection. If you're hitting CAPTCHAs on specific sites, reduce concurrency and increase delays between requests.

JavaScript-obfuscated emails (`name [at] domain.com`)

Regex patterns won't catch these. Add a secondary pattern:

obfuscated = re.findall(r'(\S+)\s*\[at\]\s*(\S+)', html)
emails = [f"{name}@{domain}" for name, domain in obfuscated]

Duplicate entries across pages

Deduplicate by email address after extraction (Step 7 above).

robots.txt blocking scraping

Always check robots.txt before scraping. SearchHive respects robots.txt generator by default but allows you to configure this per-request.

Next Steps

Use SearchHive SwiftSearch to find relevant pages before extracting contact data — search for "contact us" pages across domains
Check out /blog/how-to-build-proxy-rotator-for-web-scraping-with-python for advanced proxy rotation techniques
Explore /compare/scraperapi to see how SearchHive compares to other scraping APIs

Start extracting contact data today with SearchHive's free tier — 50,000 requests/month with JS rendering and proxy rotation included. Read the docs.

How to Extract Contact Info from Websites with Python

AI-Powered Research

How to Extract Contact Info from Websites with Python

Key Takeaways

Prerequisites

Step 1: Fetch the Web Page

Step 2: Extract Email Addresses with Regex

Step 3: Extract Phone Numbers

Step 4: Extract Structured Contact Data with CSS Selectors

Step 5: Handle JavaScript-Rendered Pages

Approach A: Use Selenium (Heavy)

Approach B: Use SearchHive ScrapeForge (Recommended)

Step 6: Extract Contacts at Scale with Batch Processing

Step 7: Validate and Deduplicate Results

Step 8: Export to CSV for CRM Import

Complete Code Example

Common Issues

Emails hidden behind Cloudflare or CAPTCHAs

JavaScript-obfuscated emails (`name [at] domain.com`)

Duplicate entries across pages

robots.txt blocking scraping

Next Steps

Keywords

RELATED ARTICLES

How to Build a Proxy Rotator for Web Scraping with Python

How to Scrape Wikipedia Data for Knowledge Graphs

How to Scrape YouTube Data — Video Metrics and Comments

BUILD WITH SEARCHHIVE

How to Extract Contact Info from Websites with Python

AI-Powered Research

How to Extract Contact Info from Websites with Python

Key Takeaways

Prerequisites

Step 1: Fetch the Web Page

Step 2: Extract Email Addresses with Regex

Step 3: Extract Phone Numbers

Step 4: Extract Structured Contact Data with CSS Selectors

Step 5: Handle JavaScript-Rendered Pages

Approach A: Use Selenium (Heavy)

Approach B: Use SearchHive ScrapeForge (Recommended)

Step 6: Extract Contacts at Scale with Batch Processing

Step 7: Validate and Deduplicate Results

Step 8: Export to CSV for CRM Import

Complete Code Example

Common Issues

Emails hidden behind Cloudflare or CAPTCHAs

JavaScript-obfuscated emails (name [at] domain.com)

Duplicate entries across pages

robots.txt blocking scraping

Next Steps

Keywords

RELATED ARTICLES

How to Build a Proxy Rotator for Web Scraping with Python

How to Scrape Wikipedia Data for Knowledge Graphs

How to Scrape YouTube Data — Video Metrics and Comments

BUILD WITH SEARCHHIVE

JavaScript-obfuscated emails (`name [at] domain.com`)