How to Build an Email Finder Tool with Web Scraping

Finding email addresses from websites is a common need for lead generation, outreach campaigns, and sales prospecting. This tutorial shows you how to build an email finder tool that scrapes contact pages, extracts email patterns, and validates addresses -- all using Python and SearchHive's ScrapeForge API.

Key Takeaways

Email addresses appear on contact pages, about pages, and in page metadata (meta tags, schema markup)
SearchHive's ScrapeForge API renders JavaScript-heavy contact pages that basic HTTP requests miss
regex tester patterns catch most email formats, but structured extraction with DeepDive is more reliable
Email validation checks format, domain MX records, and common disposable domains
The complete tool scrapes, extracts, deduplicates, and validates emails from any website

Prerequisites

Python 3.8 or later
SearchHive API key (free tier -- 500 credits)
pip install requests searchhive dnspython

pip install requests searchhive dnspython

Step 1: Scrape Contact Pages

Most business websites list email addresses on dedicated contact pages, about pages, or in the footer. ScrapeForge renders JavaScript so you can extract emails from dynamically loaded content:

from searchhive import ScrapeForge

client = ScrapeForge(api_key="YOUR_API_KEY")

def scrape_site_emails(base_url):
    common_paths = [
        "/contact", "/contact-us", "/about", "/about-us",
        "/team", "/legal", "/privacy", "/terms"
    ]

    all_content = ""

    for path in common_paths:
        url = base_url.rstrip("/") + path
        try:
            result = client.scrape(url, format="markdown", render_js=True)
            if result.content:
                all_content += result.content + "
"
                print(f"  Scraped {url} ({len(result.content)} chars)")
        except Exception:
            pass  # Page may not exist

    return all_content

content = scrape_site_emails("https://www.example-company.com")

This checks multiple common paths where emails typically appear. The markdown format gives you clean text without HTML noise.

Step 2: Extract Emails with Regex

A regex pattern catches most standard email formats from scraped text:

import re

def extract_emails(text):
    pattern = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}'
    emails = set(re.findall(pattern, text))

    # Filter out common false positives
    false_positives = {
        "example@example.com", "test@example.com", "email@example.com",
        "name@company.com", "user@domain.com", "your@email.com",
        "admin@example.com", "noreply@", "no-reply@"
    }

    filtered = set()
    for email in emails:
        if not any(fp in email.lower() for fp in false_positives):
            filtered.add(email.lower())

    return list(filtered)

emails = extract_emails(content)
print(f"Found {len(emails)} unique emails")
for e in emails:
    print(f"  {e}")

The regex [a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,} matches standard email formats. The false positive filter removes placeholder emails commonly found in templates and documentation.

Step 3: Use DeepDive for Smarter Extraction

Regex works well for obvious email addresses but misses obfuscated patterns like "contact [at] company [dot] com" or emails embedded in structured data. DeepDive uses AI to find emails regardless of format:

from searchhive import DeepDive

deep = DeepDive(api_key="YOUR_API_KEY")

def extract_emails_ai(content):
    result = deep.extract(
        content=content,
        schema={
            "type": "object",
            "properties": {
                "emails": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "All email addresses found on the page"
                },
                "contact_methods": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Other contact methods (phone, forms, social media)"
                }
            }
        }
    )
    return result.data

contact_data = extract_emails_ai(content)
print(f"AI extracted {len(contact_data.get('emails', []))} emails")
print(f"Also found: {contact_data.get('contact_methods', [])}")

DeepDive catches obfuscated emails, mailto: links parsed from HTML, and emails mentioned in prose. Combining regex and AI extraction gives the highest coverage:

def combined_extraction(content):
    regex_emails = set(extract_emails(content))
    ai_data = extract_emails_ai(content)
    ai_emails = set(e.lower() for e in ai_data.get("emails", []))

    all_emails = regex_emails | ai_emails
    return sorted(all_emails)

all_emails = combined_extraction(content)

Step 4: Validate Email Addresses

Not every extracted email is valid or deliverable. Add validation to filter out bad addresses:

import re
import dns.resolver

def is_valid_format(email):
    pattern = r'^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

def has_valid_mx(email):
    domain = email.split("@")[1]
    try:
        records = dns.resolver.resolve(domain, "MX")
        return len(records) > 0
    except (dns.resolver.NoAnswer, dns.resolver.NXDOMAIN, dns.resolver.NoNameservers):
        return False
    except Exception:
        return True  # Assume valid if DNS check fails

DISPOSABLE_DOMAINS = {
    "mailinator.com", "guerrillamail.com", "tempmail.com",
    "throwaway.email", "yopmail.com", "10minutemail.com"
}

def validate_email(email):
    if not is_valid_format(email):
        return False, "invalid format"
    domain = email.split("@")[1]
    if domain in DISPOSABLE_DOMAINS:
        return False, "disposable domain"
    if not has_valid_mx(email):
        return False, "no MX record"
    return True, "valid"

# Validate all found emails
for email in all_emails:
    valid, reason = validate_email(email)
    status = "OK" if valid else f"SKIP ({reason})"
    print(f"  {email}: {status}")

Format validation checks the structure, MX record verification confirms the domain accepts email, and the disposable domain check filters temporary email services.

Step 5: Build a Batch Processing Pipeline

For lead generation, you typically need to process hundreds of websites. Here's a complete pipeline:

from searchhive import ScrapeForge, DeepDive
import re, json, dns.resolver, time

API_KEY = "YOUR_API_KEY"

DISPOSABLE_DOMAINS = {
    "mailinator.com", "guerrillamail.com", "tempmail.com",
    "throwaway.email", "yopmail.com", "10minutemail.com"
}

CONTACT_PATHS = ["/contact", "/about", "/team", "/about-us", "/contact-us"]

def find_emails_for_site(url):
    scrape = ScrapeForge(api_key=API_KEY)
    deep = DeepDive(api_key=API_KEY)

    all_content = ""
    for path in CONTACT_PATHS:
        try:
            result = scrape.scrape(url.rstrip("/") + path, format="markdown", render_js=True)
            if result.content:
                all_content += result.content + "
"
        except Exception:
            pass

    if not all_content.strip():
        return []

    # Regex extraction
    pattern = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}'
    regex_emails = set(re.findall(pattern, all_content))

    # AI extraction
    try:
        ai_result = deep.extract(
            content=all_content,
            schema={
                "type": "object",
                "properties": {
                    "emails": {
                        "type": "array",
                        "items": {"type": "string"}
                    }
                }
            }
        )
        ai_emails = set(e.lower() for e in ai_result.data.get("emails", []))
    except Exception:
        ai_emails = set()

    # Merge, deduplicate, validate
    all_emails = regex_emails | ai_emails
    validated = []
    for email in all_emails:
        email = email.lower()
        domain = email.split("@")[1] if "@" in email else ""
        if domain not in DISPOSABLE_DOMAINS and re.match(r'^[\w.+-]+@[\w.-]+\.[a-z]{2,}$', email):
            validated.append(email)

    return list(set(validated))

# Process a list of websites
websites = [
    "https://www.company-a.com",
    "https://www.company-b.com",
    "https://www.company-c.com",
]

results = {}
for site in websites:
    print(f"Processing {site}...")
    emails = find_emails_for_site(site)
    results[site] = emails
    print(f"  Found {len(emails)} emails: {emails}")
    time.sleep(2)

with open("email_results.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"Done: processed {len(websites)} sites")

Step 6: Handle Common Issues

Obfuscated emails. Some sites use "name [at] domain [dot] com" or JavaScript encoding to prevent scraping. DeepDive catches most obfuscation patterns. For JavaScript-encoded emails, ScrapeForge's render_js=True executes the decoding script.

Rate limiting and blocking. Scraping multiple pages from the same domain triggers rate limits. SearchHive's proxy rotation distributes requests across IPs, but add delays (2-3 seconds) between requests for multi-page scraping on the same domain.

Privacy compliance. Only scrape publicly listed email addresses. Check each website's robots.txt generator and terms of service. GDPR and CAN-SPAM regulations apply to how you use collected emails for outreach.

Data staleness. Contact pages change frequently. For ongoing lead generation, re-scrape targets on a weekly or monthly schedule.

Complete Code Example

from searchhive import ScrapeForge, DeepDive
import re, json, time

API_KEY = "YOUR_API_KEY"

def extract_emails_from_site(url):
    scrape = ScrapeForge(api_key=API_KEY)
    deep = DeepDive(api_key=API_KEY)

    paths = ["/contact", "/about", "/about-us", "/team", "/contact-us"]
    content = ""
    for p in paths:
        try:
            r = scrape.scrape(url.rstrip("/") + p, format="markdown", render_js=True)
            if r.content:
                content += r.content + "
"
        except Exception:
            continue

    if not content.strip():
        return {"url": url, "emails": []}

    emails = set(re.findall(r'[\w.+-]+@[\w.-]+\.[a-z]{2,}', content))

    try:
        ai = deep.extract(
            content=content,
            schema={"type": "object", "properties": {"emails": {"type": "array", "items": {"type": "string"}}}}
        )
        emails |= set(e.lower() for e in ai.data.get("emails", []))
    except Exception:
        pass

    return {"url": url, "emails": sorted(emails)}

if __name__ == "__main__":
    targets = [
        "https://www.company-a.com",
        "https://www.company-b.com",
    ]
    results = []
    for t in targets:
        print(f"Scanning {t}...")
        r = extract_emails_from_site(t)
        results.append(r)
        print(f"  Found {len(r['emails'])} emails")
        time.sleep(2)

    with open("emails_found.json", "w") as f:
        json.dump(results, f, indent=2)
    print(f"Complete: {sum(len(r['emails']) for r in results)} total emails")

Next Steps

Bulk processing: Feed a list of domains from a CSV or database for large-scale lead generation
Categorize by role: Use DeepDive to identify job titles (CEO, CTO, support) associated with each email
Integrate with CRM: Push validated emails directly to HubSpot, Salesforce, or Pipedrive
Monitor for changes: Schedule weekly re-scrapes to detect when contact info changes

Get started with SearchHive's free tier -- 500 credits, no credit card required. See the API docs for full reference.

How to Build an Email Finder Tool with Web Scraping

AI-Powered Research

How to Build an Email Finder Tool with Web Scraping

Key Takeaways

Prerequisites

Step 1: Scrape Contact Pages

Step 2: Extract Emails with Regex

Step 3: Use DeepDive for Smarter Extraction

Step 4: Validate Email Addresses

Step 5: Build a Batch Processing Pipeline

Step 6: Handle Common Issues

Complete Code Example

Next Steps

Keywords

RELATED ARTICLES

How to Scrape Yellow Pages for Business Data

How to Scrape Social Media Data for Market Research

How to Scrape Reddit Data for Market Research

BUILD WITH SEARCHHIVE