How to Build an Email Finder Tool with Web Scraping
Finding email addresses from websites is a common need for lead generation, outreach campaigns, and sales prospecting. This tutorial shows you how to build an email finder tool that scrapes contact pages, extracts email patterns, and validates addresses -- all using Python and SearchHive's ScrapeForge API.
Key Takeaways
- Email addresses appear on contact pages, about pages, and in page metadata (meta tags, schema markup)
- SearchHive's ScrapeForge API renders JavaScript-heavy contact pages that basic HTTP requests miss
- regex tester patterns catch most email formats, but structured extraction with DeepDive is more reliable
- Email validation checks format, domain MX records, and common disposable domains
- The complete tool scrapes, extracts, deduplicates, and validates emails from any website
Prerequisites
- Python 3.8 or later
- SearchHive API key (free tier -- 500 credits)
- pip install requests searchhive dnspython
pip install requests searchhive dnspython
Step 1: Scrape Contact Pages
Most business websites list email addresses on dedicated contact pages, about pages, or in the footer. ScrapeForge renders JavaScript so you can extract emails from dynamically loaded content:
from searchhive import ScrapeForge
client = ScrapeForge(api_key="YOUR_API_KEY")
def scrape_site_emails(base_url):
common_paths = [
"/contact", "/contact-us", "/about", "/about-us",
"/team", "/legal", "/privacy", "/terms"
]
all_content = ""
for path in common_paths:
url = base_url.rstrip("/") + path
try:
result = client.scrape(url, format="markdown", render_js=True)
if result.content:
all_content += result.content + "
"
print(f" Scraped {url} ({len(result.content)} chars)")
except Exception:
pass # Page may not exist
return all_content
content = scrape_site_emails("https://www.example-company.com")
This checks multiple common paths where emails typically appear. The markdown format gives you clean text without HTML noise.
Step 2: Extract Emails with Regex
A regex pattern catches most standard email formats from scraped text:
import re
def extract_emails(text):
pattern = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}'
emails = set(re.findall(pattern, text))
# Filter out common false positives
false_positives = {
"example@example.com", "test@example.com", "email@example.com",
"name@company.com", "user@domain.com", "your@email.com",
"admin@example.com", "noreply@", "no-reply@"
}
filtered = set()
for email in emails:
if not any(fp in email.lower() for fp in false_positives):
filtered.add(email.lower())
return list(filtered)
emails = extract_emails(content)
print(f"Found {len(emails)} unique emails")
for e in emails:
print(f" {e}")
The regex [a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,} matches standard email formats. The false positive filter removes placeholder emails commonly found in templates and documentation.
Step 3: Use DeepDive for Smarter Extraction
Regex works well for obvious email addresses but misses obfuscated patterns like "contact [at] company [dot] com" or emails embedded in structured data. DeepDive uses AI to find emails regardless of format:
from searchhive import DeepDive
deep = DeepDive(api_key="YOUR_API_KEY")
def extract_emails_ai(content):
result = deep.extract(
content=content,
schema={
"type": "object",
"properties": {
"emails": {
"type": "array",
"items": {"type": "string"},
"description": "All email addresses found on the page"
},
"contact_methods": {
"type": "array",
"items": {"type": "string"},
"description": "Other contact methods (phone, forms, social media)"
}
}
}
)
return result.data
contact_data = extract_emails_ai(content)
print(f"AI extracted {len(contact_data.get('emails', []))} emails")
print(f"Also found: {contact_data.get('contact_methods', [])}")
DeepDive catches obfuscated emails, mailto: links parsed from HTML, and emails mentioned in prose. Combining regex and AI extraction gives the highest coverage:
def combined_extraction(content):
regex_emails = set(extract_emails(content))
ai_data = extract_emails_ai(content)
ai_emails = set(e.lower() for e in ai_data.get("emails", []))
all_emails = regex_emails | ai_emails
return sorted(all_emails)
all_emails = combined_extraction(content)
Step 4: Validate Email Addresses
Not every extracted email is valid or deliverable. Add validation to filter out bad addresses:
import re
import dns.resolver
def is_valid_format(email):
pattern = r'^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$'
return bool(re.match(pattern, email))
def has_valid_mx(email):
domain = email.split("@")[1]
try:
records = dns.resolver.resolve(domain, "MX")
return len(records) > 0
except (dns.resolver.NoAnswer, dns.resolver.NXDOMAIN, dns.resolver.NoNameservers):
return False
except Exception:
return True # Assume valid if DNS check fails
DISPOSABLE_DOMAINS = {
"mailinator.com", "guerrillamail.com", "tempmail.com",
"throwaway.email", "yopmail.com", "10minutemail.com"
}
def validate_email(email):
if not is_valid_format(email):
return False, "invalid format"
domain = email.split("@")[1]
if domain in DISPOSABLE_DOMAINS:
return False, "disposable domain"
if not has_valid_mx(email):
return False, "no MX record"
return True, "valid"
# Validate all found emails
for email in all_emails:
valid, reason = validate_email(email)
status = "OK" if valid else f"SKIP ({reason})"
print(f" {email}: {status}")
Format validation checks the structure, MX record verification confirms the domain accepts email, and the disposable domain check filters temporary email services.
Step 5: Build a Batch Processing Pipeline
For lead generation, you typically need to process hundreds of websites. Here's a complete pipeline:
from searchhive import ScrapeForge, DeepDive
import re, json, dns.resolver, time
API_KEY = "YOUR_API_KEY"
DISPOSABLE_DOMAINS = {
"mailinator.com", "guerrillamail.com", "tempmail.com",
"throwaway.email", "yopmail.com", "10minutemail.com"
}
CONTACT_PATHS = ["/contact", "/about", "/team", "/about-us", "/contact-us"]
def find_emails_for_site(url):
scrape = ScrapeForge(api_key=API_KEY)
deep = DeepDive(api_key=API_KEY)
all_content = ""
for path in CONTACT_PATHS:
try:
result = scrape.scrape(url.rstrip("/") + path, format="markdown", render_js=True)
if result.content:
all_content += result.content + "
"
except Exception:
pass
if not all_content.strip():
return []
# Regex extraction
pattern = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}'
regex_emails = set(re.findall(pattern, all_content))
# AI extraction
try:
ai_result = deep.extract(
content=all_content,
schema={
"type": "object",
"properties": {
"emails": {
"type": "array",
"items": {"type": "string"}
}
}
}
)
ai_emails = set(e.lower() for e in ai_result.data.get("emails", []))
except Exception:
ai_emails = set()
# Merge, deduplicate, validate
all_emails = regex_emails | ai_emails
validated = []
for email in all_emails:
email = email.lower()
domain = email.split("@")[1] if "@" in email else ""
if domain not in DISPOSABLE_DOMAINS and re.match(r'^[\w.+-]+@[\w.-]+\.[a-z]{2,}$', email):
validated.append(email)
return list(set(validated))
# Process a list of websites
websites = [
"https://www.company-a.com",
"https://www.company-b.com",
"https://www.company-c.com",
]
results = {}
for site in websites:
print(f"Processing {site}...")
emails = find_emails_for_site(site)
results[site] = emails
print(f" Found {len(emails)} emails: {emails}")
time.sleep(2)
with open("email_results.json", "w") as f:
json.dump(results, f, indent=2)
print(f"Done: processed {len(websites)} sites")
Step 6: Handle Common Issues
Obfuscated emails. Some sites use "name [at] domain [dot] com" or JavaScript encoding to prevent scraping. DeepDive catches most obfuscation patterns. For JavaScript-encoded emails, ScrapeForge's render_js=True executes the decoding script.
Rate limiting and blocking. Scraping multiple pages from the same domain triggers rate limits. SearchHive's proxy rotation distributes requests across IPs, but add delays (2-3 seconds) between requests for multi-page scraping on the same domain.
Privacy compliance. Only scrape publicly listed email addresses. Check each website's robots.txt generator and terms of service. GDPR and CAN-SPAM regulations apply to how you use collected emails for outreach.
Data staleness. Contact pages change frequently. For ongoing lead generation, re-scrape targets on a weekly or monthly schedule.
Complete Code Example
from searchhive import ScrapeForge, DeepDive
import re, json, time
API_KEY = "YOUR_API_KEY"
def extract_emails_from_site(url):
scrape = ScrapeForge(api_key=API_KEY)
deep = DeepDive(api_key=API_KEY)
paths = ["/contact", "/about", "/about-us", "/team", "/contact-us"]
content = ""
for p in paths:
try:
r = scrape.scrape(url.rstrip("/") + p, format="markdown", render_js=True)
if r.content:
content += r.content + "
"
except Exception:
continue
if not content.strip():
return {"url": url, "emails": []}
emails = set(re.findall(r'[\w.+-]+@[\w.-]+\.[a-z]{2,}', content))
try:
ai = deep.extract(
content=content,
schema={"type": "object", "properties": {"emails": {"type": "array", "items": {"type": "string"}}}}
)
emails |= set(e.lower() for e in ai.data.get("emails", []))
except Exception:
pass
return {"url": url, "emails": sorted(emails)}
if __name__ == "__main__":
targets = [
"https://www.company-a.com",
"https://www.company-b.com",
]
results = []
for t in targets:
print(f"Scanning {t}...")
r = extract_emails_from_site(t)
results.append(r)
print(f" Found {len(r['emails'])} emails")
time.sleep(2)
with open("emails_found.json", "w") as f:
json.dump(results, f, indent=2)
print(f"Complete: {sum(len(r['emails']) for r in results)} total emails")
Next Steps
- Bulk processing: Feed a list of domains from a CSV or database for large-scale lead generation
- Categorize by role: Use DeepDive to identify job titles (CEO, CTO, support) associated with each email
- Integrate with CRM: Push validated emails directly to HubSpot, Salesforce, or Pipedrive
- Monitor for changes: Schedule weekly re-scrapes to detect when contact info changes
Get started with SearchHive's free tier -- 500 credits, no credit card required. See the API docs for full reference.
See also: /blog/how-to-extract-contact-info-from-websites-with-python and /blog/linkedin-scraping-apis-best-tools-for-lead-generation.