How to Build a Lead Generation Tool with Web Scraping

Lead generation is the backbone of sales-driven businesses, and web scraping is one of the fastest ways to build a pipeline of potential customers. This tutorial walks you through building a complete lead generation tool in Python using SearchHive's APIs -- from scraping business directories to enriching contact data and storing results in a structured format.

Prerequisites

Python 3.8+ with requests and pandas installed
A SearchHive API key (free tier includes 500 credits)
Basic familiarity with Python and REST APIs

pip install requests pandas

Key Takeaways

A practical lead generation tool needs three components: source scraping, data enrichment, and structured output
SearchHive's SwiftSearch API finds business listings, ScrapeForge extracts contact details, and DeepDive enriches incomplete records
The complete tool handles pagination, rate limiting, and deduplication
You can process thousands of leads per month on SearchHive's Builder plan ($49/mo)

Step 1: Define Your Lead Source

Start by identifying where your target leads exist online. Common sources include:

Business directories (Yelp, Yellow Pages, industry-specific directories)
Google Maps / Google Business profiles
LinkedIn company pages
Industry conference attendee lists
Chamber of commerce websites

For this tutorial, we will scrape business listings from a directory-style page.

Step 2: Scrape Business Listings

Use SearchHive's ScrapeForge API to extract structured listing data from a business directory:

import requests
import time
import json

API_KEY = "your_searchhive_key"
BASE_URL = "https://api.searchhive.dev/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}

def scrape_directory(url):
    """Scrape business listings from a directory page."""
    response = requests.post(
        f"{BASE_URL}/scrape",
        headers=headers,
        json={
            "url": url,
            "render_js": True,
            "extract": {
                "businesses": {
                    "selector": ".listing-card",
                    "fields": {
                        "name": ".business-name",
                        "category": ".category",
                        "phone": ".phone-number",
                        "address": ".address",
                        "website": {"selector": "a.website", "attr": "href"},
                        "rating": ".rating"
                    }
                }
            }
        }
    )
    if response.status_code == 200:
        return response.json().get("businesses", [])
    else:
        print(f"Error scraping {url}: {response.status_code}")
        return []

# Scrape multiple pages
leads = []
for page in range(1, 11):
    url = f"https://example-directory.com/restaurants?page={page}"
    page_leads = scrape_directory(url)
    leads.extend(page_leads)
    print(f"Page {page}: {len(page_leads)} leads found")
    time.sleep(1)  # Rate limiting

Step 3: Enrich Leads with DeepDive

Not every directory listing has complete contact info. Use SearchHive's DeepDive API to visit each business website and extract additional details:

def enrich_lead(website_url):
    """Visit a business website and extract additional info."""
    response = requests.post(
        f"{BASE_URL}/deepdive",
        headers=headers,
        json={
            "url": website_url,
            "prompt": (
                "Extract the following if available: "
                "contact email, phone number, company size "
                "(small/medium/large/enterprise), industry, "
                "and whether they have a careers/jobs page (boolean). "
                "Return as JSON."
            )
        }
    )
    if response.status_code == 200:
        return response.json()
    return {}

# Enrich leads that have websites
enriched_leads = []
for lead in leads:
    if lead.get("website"):
        extra = enrich_lead(lead["website"])
        lead.update(extra)
        enriched_leads.append(lead)
        time.sleep(2)  # Be respectful to target sites
    else:
        enriched_leads.append(lead)

print(f"Total leads: {len(enriched_leads)}")

Step 4: Find Email Addresses with SwiftSearch

For leads missing email addresses, use SwiftSearch to search for publicly listed contact emails:

def find_email(business_name, website):
    """Search for a business email address."""
    response = requests.post(
        f"{BASE_URL}/search",
        headers=headers,
        json={
            "query": f"{business_name} {website} contact email",
            "num_results": 3
        }
    )
    if response.status_code == 200:
        results = response.json().get("results", [])
        # Extract emails from search snippets
        import re
        emails = set()
        for r in results:
            found = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', r.get("snippet", ""))
            emails.update(found)
        return list(emails)
    return []

# Fill in missing emails
for lead in enriched_leads:
    if not lead.get("email"):
        emails = find_email(lead["name"], lead.get("website", ""))
        if emails:
            lead["email"] = emails[0]  # Use first found

Step 5: Deduplicate and Score Leads

Clean your lead list by removing duplicates and applying a basic scoring system:

import pandas as pd

def score_lead(lead):
    """Score a lead from 0-100 based on data completeness."""
    score = 0
    if lead.get("email"): score += 30
    if lead.get("phone"): score += 20
    if lead.get("website"): score += 15
    if lead.get("rating") and float(lead["rating"]) >= 4.0: score += 15
    if lead.get("company_size") in ["medium", "large", "enterprise"]: score += 20
    return min(score, 100)

# Convert to DataFrame
df = pd.DataFrame(enriched_leads)

# Deduplicate by name + phone
df = df.drop_duplicates(subset=["name", "phone"], keep="first")

# Apply scoring
df["score"] = df.apply(score_lead, axis=1)

# Sort by score, best leads first
df = df.sort_values("score", ascending=False)

print(f"Unique leads: {len(df)}")
print(f"High-quality leads (score >= 70): {len(df[df['score'] >= 70])}")

Step 6: Export Results

Save your leads in multiple formats for different use cases:

# CSV for spreadsheets
df.to_csv("leads.csv", index=False)

# JSON for CRM import
df.to_json("leads.json", orient="records", indent=2)

# Filter high-quality leads for immediate outreach
hot_leads = df[df["score"] >= 70]
hot_leads.to_csv("hot-leads.csv", index=False)

print(f"Exported {len(df)} leads to CSV and JSON")
print(f"Hot leads ready for outreach: {len(hot_leads)}")

Step 7: Automate with Scheduling

Wrap everything in a scheduled pipeline that runs daily or weekly:

import schedule
import datetime

def daily_lead_gen():
    """Run the full lead generation pipeline."""
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M")

    # Scrape
    leads = []
    for page in range(1, 11):
        url = f"https://example-directory.com/restaurants?page={page}"
        page_leads = scrape_directory(url)
        leads.extend(page_leads)
        time.sleep(1)

    # Enrich top leads only (save credits)
    leads_with_sites = [l for l in leads if l.get("website")]
    for lead in leads_with_sites[:50]:  # Cap at 50 enrichments
        extra = enrich_lead(lead["website"])
        lead.update(extra)
        time.sleep(2)

    # Score and export
    df = pd.DataFrame(leads)
    df = df.drop_duplicates(subset=["name", "phone"], keep="first")
    df["score"] = df.apply(score_lead, axis=1)
    df = df.sort_values("score", ascending=False)
    df.to_csv(f"leads_{timestamp}.csv", index=False)
    print(f"Pipeline complete: {len(df)} leads saved")

# Schedule: run daily at 8 AM
schedule.every().day.at("08:00").do(daily_lead_gen)

while True:
    schedule.run_pending()
    time.sleep(60)

Common Issues

Rate limiting: Most directories limit how many pages you can scrape per minute. Use time.sleep() between requests and respect robots.txt. SearchHive's proxy rotation handles IP-level rate limits automatically.
Missing data: Not all listings have complete info. The enrichment step (Step 3) fills gaps, but budget your DeepDive credits -- use them only on leads that have websites.
Changing page structures: If a directory updates its HTML, your CSS selectors will break. Consider using DeepDive with natural language prompts instead of fragile selectors.
Legal considerations: Only scrape publicly available data. Check the target site's terms of service. Do not scrape personal data (PII) without consent. Many jurisdictions have data protection laws that apply to automated collection.

Next Steps

Connect your lead pipeline to a CRM using an API integration
Build a simple web dashboard to view and filter leads
Add email verification (SearchHive can check if emails are valid before you reach out)
Scale up by scraping multiple directories in parallel

Start Building for Free

SearchHive's free tier gives you 500 credits to test the entire pipeline -- SwiftSearch for discovery, ScrapeForge for extraction, and DeepDive for enrichment. No credit card required. Sign up here and check the API docs for the complete reference.

/tutorials/web-scraping-python-beginners-guide | /tutorials/how-to-scrape-google-search-results | /compare/serpapi

How to Build a Lead Generation Tool with Web Scraping

AI-Powered Research

Prerequisites

Key Takeaways

Step 1: Define Your Lead Source

Step 2: Scrape Business Listings

Step 3: Enrich Leads with DeepDive

Step 4: Find Email Addresses with SwiftSearch

Step 5: Deduplicate and Score Leads

Step 6: Export Results

Step 7: Automate with Scheduling

Common Issues

Next Steps

Start Building for Free

Keywords

RELATED ARTICLES

How to Build a Web Data Pipeline with Python

How to Monitor Website Changes with Python

How to Scrape Websites Behind Login with Python

BUILD WITH SEARCHHIVE