How to Extract Contact Info from Websites with Python
Need to extract contact info from websites at scale? Whether you're building a sales leads database, verifying business listings, or enriching CRM data, Python gives you the tools to pull emails, phone numbers, and addresses from web pages reliably.
This tutorial walks through building a complete contact extraction system — from basic regex tester matching to structured data extraction with CSS selectors. We'll cover error handling, rate limiting, and how to avoid common scraping pitfalls.
Key Takeaways
- Regex patterns are the foundation for extracting emails and phone numbers from raw HTML
- CSS selectors + BeautifulSoup pull structured data like names, titles, and physical addresses
- SearchHive ScrapeForge handles rendering, proxy rotation, and parsing in a single API call
- Always respect
robots.txt, rate-limit your requests, and include opt-out mechanisms - Combine multiple extraction methods (regex + DOM parsing) for maximum coverage
Prerequisites
Before starting, install these packages:
pip install requests beautifulsoup4 lxml searchhive
- requests — HTTP client for fetching web pages
- beautifulsoup4 + lxml — HTML parsing and DOM traversal
- searchhive — SearchHive Python SDK (free tier: 50K requests/month)
You'll also want a text editor or Jupyter notebook for testing your extraction patterns.
Step 1: Fetch the Web Page
Start by fetching the target page's HTML content:
import requests
from bs4 import BeautifulSoup
def fetch_page(url: str) -> BeautifulSoup:
"""Fetch and parse a web page."""
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return BeautifulSoup(response.text, "lxml")
This is the basic approach. For production use, you need error handling, retries, and proxy rotation — we'll get to that.
Step 2: Extract Email Addresses with Regex
Email addresses follow a consistent pattern. A well-tuned regex catches most of them:
import re
def extract_emails(html: str) -> list[str]:
"""Extract email addresses from HTML content."""
# Match standard email formats, avoiding common false positives
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = set(re.findall(pattern, html))
# Filter out obvious false positives
blacklist = {'example.com', 'domain.com', 'email.com', 'company.com'}
return sorted([
email for email in emails
if not any(email.endswith(f'@{d}') for d in blacklist)
])
# Usage
soup = fetch_page("https://example.com/contact")
emails = extract_emails(str(soup))
print(f"Found {len(emails)} emails: {emails}")
Common false positives to watch for:
- Images and SVGs containing
@symbols - CSS class names with
@(likeicon@2x) - Obfuscated emails using JavaScript (
[at]ordotnotation)
Step 3: Extract Phone Numbers
Phone number formats vary widely by country. This pattern covers the most common formats:
def extract_phones(html: str) -> list[str]:
"""Extract phone numbers from HTML content."""
patterns = [
# US/Canada: (xxx) xxx-xxxx, xxx-xxx-xxxx, xxx.xxx.xxxx
r'\(\d{3}\)\s*\d{3}[-.]?\d{4}',
r'\d{3}[-.]\d{3}[-.]\d{4}',
# International: +1-xxx-xxx-xxxx, +44 xxx xxx xxxx
r'\+\d{1,3}[-.\s]?\(?\d{1,4}\)?[-.\s]?\d{1,4}[-.\s]?\d{1,9}',
# Toll-free
r'\b(?:1[-.]?)?800[-.]?\d{3}[-.]?\d{4}\b',
]
phones = set()
for pattern in patterns:
matches = re.findall(pattern, html)
phones.update(matches)
return sorted(phones)
# Usage
phones = extract_phones(str(soup))
print(f"Found {len(phones)} phone numbers: {phones}")
Step 4: Extract Structured Contact Data with CSS Selectors
Most business websites have a dedicated contact page with structured information. Use CSS selectors to pull specific fields:
def extract_contact_page(soup: BeautifulSoup, base_url: str) -> dict:
"""Extract structured contact data from a contact page."""
contact = {"source": base_url}
# Business name — usually in an h1 or the site title
name_tag = soup.find("h1") or soup.find("title")
if name_tag:
contact["business_name"] = name_tag.get_text(strip=True)
# Email links — mailto: hrefs are the most reliable source
mailto_links = soup.select('a[href^="mailto:"]')
contact["emails"] = list(set(
link["href"].replace("mailto:", "").strip()
for link in mailto_links
))
# Phone links — tel: hrefs
tel_links = soup.select('a[href^="tel:"]')
contact["phones"] = list(set(
link["href"].replace("tel:", "").strip()
for link in tel_links
))
# Address — look for common address patterns
address_selectors = [
".address", ".contact-address", "[itemprop=address]",
"address", ".street-address", ".location"
]
for selector in address_selectors:
addr = soup.select_one(selector)
if addr and addr.get_text(strip=True):
contact["address"] = addr.get_text(strip=True)
break
# Social media links
social_patterns = {
"linkedin": 'a[href*="linkedin.com"]',
"twitter": 'a[href*="twitter.com"]',
"facebook": 'a[href*="facebook.com"]',
}
for platform, selector in social_patterns.items():
link = soup.select_one(selector)
if link:
contact[platform] = link["href"]
return contact
# Usage
contact_data = extract_contact_page(soup, "https://example.com/contact")
for key, value in contact_data.items():
print(f"{key}: {value}")
Using mailto: and tel: href attributes is more reliable than regex on raw HTML — these are structured data points the website owner intentionally provides.
Step 5: Handle JavaScript-Rendered Pages
Many modern websites render contact information with JavaScript. The raw HTML won't contain the data you need. Two approaches:
Approach A: Use Selenium (Heavy)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=options)
driver.get("https://example.com/contact")
driver.implicitly_wait(3)
html = driver.page_source
driver.quit()
soup = BeautifulSoup(html, "lxml")
Approach B: Use SearchHive ScrapeForge (Recommended)
SearchHive handles rendering, proxy rotation, and parsing in one call:
from searchhive import ScrapeForge
client = ScrapeForge()
result = client.scrape(
url="https://example.com/contact",
render_js=True,
wait_for="address, .contact-info",
selectors={
"emails": 'a[href^="mailto:"] @href',
"phones": 'a[href^="tel:"] @href',
"address": "address, .contact-address, [itemprop=address]",
"business_name": "h1",
"social_links": 'a[href*="linkedin.com"], a[href*="twitter.com"] @href'
}
)
print(result.data)
One API call. No browser to manage. Automatic retries and proxy rotation. The free tier includes 50,000 requests/month.
Step 6: Extract Contacts at Scale with Batch Processing
When you need to extract contact info from hundreds or thousands of pages, batch processing with rate limiting is essential:
from searchhive import ScrapeForge
import time
import json
client = ScrapeForge()
urls = [
"https://company1.com/contact",
"https://company2.com/about-us",
"https://company3.com/contact-us",
]
all_contacts = []
for url in urls:
try:
result = client.scrape(
url=url,
render_js=True,
selectors={
"emails": 'a[href^="mailto:"] @href',
"phones": 'a[href^="tel:"] @href',
"address": "address",
"name": "h1",
},
timeout=15
)
if result.data:
result.data["url"] = url
all_contacts.append(result.data)
print(f"OK {url}: {len(result.data.get('emails', []))} emails")
except Exception as e:
print(f"FAIL {url}: {e}")
time.sleep(1)
with open("contacts.json", "w") as f:
json.dump(all_contacts, f, indent=2)
print(f"Extracted contacts from {len(all_contacts)} pages")
For larger batches, use SearchHive's scrape_batch with built-in concurrency control:
results = client.scrape_batch(
urls,
render_js=True,
selectors={
"emails": 'a[href^="mailto:"] @href',
"phones": 'a[href^="tel:"] @href',
"address": "address",
},
concurrency=5
)
valid_contacts = [r.data for r in results if r.success]
print(f"Successfully scraped {len(valid_contacts)} of {len(urls)} pages")
Step 7: Validate and Deduplicate Results
Raw extraction produces duplicates and false positives. Clean your data before using it:
import re
def is_valid_email(email: str) -> bool:
"""Validate an email address format."""
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return bool(re.match(pattern, email))
def deduplicate_contacts(contacts: list[dict]) -> list[dict]:
"""Remove duplicate contacts based on email addresses."""
seen = set()
unique = []
for contact in contacts:
emails = contact.get("emails", [])
valid_emails = [e for e in emails if is_valid_email(e)]
if valid_emails:
key = valid_emails[0].lower()
if key not in seen:
seen.add(key)
contact["emails"] = valid_emails
unique.append(contact)
return unique
cleaned = deduplicate_contacts(all_contacts)
print(f"Cleaned: {len(all_contacts)} to {len(cleaned)} unique contacts")
Step 8: Export to CSV for CRM Import
import csv
def export_to_csv(contacts: list[dict], filename: str = "contacts.csv"):
"""Export contacts to CSV for CRM import."""
if not contacts:
return
fieldnames = ["business_name", "emails", "phones", "address", "url"]
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
writer.writeheader()
for contact in contacts:
row = contact.copy()
row["emails"] = "; ".join(row.get("emails", []))
row["phones"] = "; ".join(row.get("phones", []))
writer.writerow(row)
print(f"Exported {len(contacts)} contacts to {filename}")
export_to_csv(cleaned)
Complete Code Example
Here's the full pipeline — from URL list to clean CSV:
from searchhive import ScrapeForge
import json
import csv
import re
def extract_contacts(urls: list[str]) -> list[dict]:
"""Extract contact info from a list of URLs using SearchHive."""
client = ScrapeForge()
selectors = {
"business_name": "h1, title",
"emails": 'a[href^="mailto:"] @href',
"phones": 'a[href^="tel:"] @href',
"address": "address, .contact-address, [itemprop=address]",
}
results = client.scrape_batch(
urls, render_js=True, selectors=selectors, concurrency=3
)
contacts = []
for r in results:
if r.success and r.data:
r.data["url"] = r.url
if "emails" in r.data:
r.data["emails"] = [
e.replace("mailto:", "") for e in r.data["emails"]
]
if "phones" in r.data:
r.data["phones"] = [
p.replace("tel:", "") for p in r.data["phones"]
]
contacts.append(r.data)
return contacts
def export_contacts(contacts: list[dict], filename: str):
"""Export to CSV."""
fieldnames = ["business_name", "emails", "phones", "address", "url"]
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
writer.writeheader()
for c in contacts:
row = c.copy()
row["emails"] = "; ".join(row.get("emails", []))
row["phones"] = "; ".join(row.get("phones", []))
writer.writerow(row)
if __name__ == "__main__":
urls = [
"https://example.com/contact",
"https://example.org/about",
"https://example.net/contact-us",
]
contacts = extract_contacts(urls)
export_contacts(contacts, "contacts.csv")
print(f"Exported {len(contacts)} contacts to contacts.csv")
Common Issues
Emails hidden behind Cloudflare or CAPTCHAs
SearchHive's proxy rotation and anti-detection features handle most bot protection. If you're hitting CAPTCHAs on specific sites, reduce concurrency and increase delays between requests.
JavaScript-obfuscated emails (name [at] domain.com)
Regex patterns won't catch these. Add a secondary pattern:
obfuscated = re.findall(r'(\S+)\s*\[at\]\s*(\S+)', html)
emails = [f"{name}@{domain}" for name, domain in obfuscated]
Duplicate entries across pages
Deduplicate by email address after extraction (Step 7 above).
robots.txt blocking scraping
Always check robots.txt before scraping. SearchHive respects robots.txt generator by default but allows you to configure this per-request.
Next Steps
- Use SearchHive SwiftSearch to find relevant pages before extracting contact data — search for "contact us" pages across domains
- Check out /blog/how-to-build-proxy-rotator-for-web-scraping-with-python for advanced proxy rotation techniques
- Explore /compare/scraperapi to see how SearchHive compares to other scraping APIs
Start extracting contact data today with SearchHive's free tier — 50,000 requests/month with JS rendering and proxy rotation included. Read the docs.