Lead generation is the backbone of sales-driven businesses, and web scraping is one of the fastest ways to build a pipeline of potential customers. This tutorial walks you through building a complete lead generation tool in Python using SearchHive's APIs -- from scraping business directories to enriching contact data and storing results in a structured format.
Prerequisites
- Python 3.8+ with
requestsandpandasinstalled - A SearchHive API key (free tier includes 500 credits)
- Basic familiarity with Python and REST APIs
pip install requests pandas
Key Takeaways
- A practical lead generation tool needs three components: source scraping, data enrichment, and structured output
- SearchHive's SwiftSearch API finds business listings, ScrapeForge extracts contact details, and DeepDive enriches incomplete records
- The complete tool handles pagination, rate limiting, and deduplication
- You can process thousands of leads per month on SearchHive's Builder plan ($49/mo)
Step 1: Define Your Lead Source
Start by identifying where your target leads exist online. Common sources include:
- Business directories (Yelp, Yellow Pages, industry-specific directories)
- Google Maps / Google Business profiles
- LinkedIn company pages
- Industry conference attendee lists
- Chamber of commerce websites
For this tutorial, we will scrape business listings from a directory-style page.
Step 2: Scrape Business Listings
Use SearchHive's ScrapeForge API to extract structured listing data from a business directory:
import requests
import time
import json
API_KEY = "your_searchhive_key"
BASE_URL = "https://api.searchhive.dev/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}
def scrape_directory(url):
"""Scrape business listings from a directory page."""
response = requests.post(
f"{BASE_URL}/scrape",
headers=headers,
json={
"url": url,
"render_js": True,
"extract": {
"businesses": {
"selector": ".listing-card",
"fields": {
"name": ".business-name",
"category": ".category",
"phone": ".phone-number",
"address": ".address",
"website": {"selector": "a.website", "attr": "href"},
"rating": ".rating"
}
}
}
}
)
if response.status_code == 200:
return response.json().get("businesses", [])
else:
print(f"Error scraping {url}: {response.status_code}")
return []
# Scrape multiple pages
leads = []
for page in range(1, 11):
url = f"https://example-directory.com/restaurants?page={page}"
page_leads = scrape_directory(url)
leads.extend(page_leads)
print(f"Page {page}: {len(page_leads)} leads found")
time.sleep(1) # Rate limiting
Step 3: Enrich Leads with DeepDive
Not every directory listing has complete contact info. Use SearchHive's DeepDive API to visit each business website and extract additional details:
def enrich_lead(website_url):
"""Visit a business website and extract additional info."""
response = requests.post(
f"{BASE_URL}/deepdive",
headers=headers,
json={
"url": website_url,
"prompt": (
"Extract the following if available: "
"contact email, phone number, company size "
"(small/medium/large/enterprise), industry, "
"and whether they have a careers/jobs page (boolean). "
"Return as JSON."
)
}
)
if response.status_code == 200:
return response.json()
return {}
# Enrich leads that have websites
enriched_leads = []
for lead in leads:
if lead.get("website"):
extra = enrich_lead(lead["website"])
lead.update(extra)
enriched_leads.append(lead)
time.sleep(2) # Be respectful to target sites
else:
enriched_leads.append(lead)
print(f"Total leads: {len(enriched_leads)}")
Step 4: Find Email Addresses with SwiftSearch
For leads missing email addresses, use SwiftSearch to search for publicly listed contact emails:
def find_email(business_name, website):
"""Search for a business email address."""
response = requests.post(
f"{BASE_URL}/search",
headers=headers,
json={
"query": f"{business_name} {website} contact email",
"num_results": 3
}
)
if response.status_code == 200:
results = response.json().get("results", [])
# Extract emails from search snippets
import re
emails = set()
for r in results:
found = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', r.get("snippet", ""))
emails.update(found)
return list(emails)
return []
# Fill in missing emails
for lead in enriched_leads:
if not lead.get("email"):
emails = find_email(lead["name"], lead.get("website", ""))
if emails:
lead["email"] = emails[0] # Use first found
Step 5: Deduplicate and Score Leads
Clean your lead list by removing duplicates and applying a basic scoring system:
import pandas as pd
def score_lead(lead):
"""Score a lead from 0-100 based on data completeness."""
score = 0
if lead.get("email"): score += 30
if lead.get("phone"): score += 20
if lead.get("website"): score += 15
if lead.get("rating") and float(lead["rating"]) >= 4.0: score += 15
if lead.get("company_size") in ["medium", "large", "enterprise"]: score += 20
return min(score, 100)
# Convert to DataFrame
df = pd.DataFrame(enriched_leads)
# Deduplicate by name + phone
df = df.drop_duplicates(subset=["name", "phone"], keep="first")
# Apply scoring
df["score"] = df.apply(score_lead, axis=1)
# Sort by score, best leads first
df = df.sort_values("score", ascending=False)
print(f"Unique leads: {len(df)}")
print(f"High-quality leads (score >= 70): {len(df[df['score'] >= 70])}")
Step 6: Export Results
Save your leads in multiple formats for different use cases:
# CSV for spreadsheets
df.to_csv("leads.csv", index=False)
# JSON for CRM import
df.to_json("leads.json", orient="records", indent=2)
# Filter high-quality leads for immediate outreach
hot_leads = df[df["score"] >= 70]
hot_leads.to_csv("hot-leads.csv", index=False)
print(f"Exported {len(df)} leads to CSV and JSON")
print(f"Hot leads ready for outreach: {len(hot_leads)}")
Step 7: Automate with Scheduling
Wrap everything in a scheduled pipeline that runs daily or weekly:
import schedule
import datetime
def daily_lead_gen():
"""Run the full lead generation pipeline."""
timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M")
# Scrape
leads = []
for page in range(1, 11):
url = f"https://example-directory.com/restaurants?page={page}"
page_leads = scrape_directory(url)
leads.extend(page_leads)
time.sleep(1)
# Enrich top leads only (save credits)
leads_with_sites = [l for l in leads if l.get("website")]
for lead in leads_with_sites[:50]: # Cap at 50 enrichments
extra = enrich_lead(lead["website"])
lead.update(extra)
time.sleep(2)
# Score and export
df = pd.DataFrame(leads)
df = df.drop_duplicates(subset=["name", "phone"], keep="first")
df["score"] = df.apply(score_lead, axis=1)
df = df.sort_values("score", ascending=False)
df.to_csv(f"leads_{timestamp}.csv", index=False)
print(f"Pipeline complete: {len(df)} leads saved")
# Schedule: run daily at 8 AM
schedule.every().day.at("08:00").do(daily_lead_gen)
while True:
schedule.run_pending()
time.sleep(60)
Common Issues
- Rate limiting: Most directories limit how many pages you can scrape per minute. Use
time.sleep()between requests and respectrobots.txt. SearchHive's proxy rotation handles IP-level rate limits automatically. - Missing data: Not all listings have complete info. The enrichment step (Step 3) fills gaps, but budget your DeepDive credits -- use them only on leads that have websites.
- Changing page structures: If a directory updates its HTML, your CSS selectors will break. Consider using DeepDive with natural language prompts instead of fragile selectors.
- Legal considerations: Only scrape publicly available data. Check the target site's terms of service. Do not scrape personal data (PII) without consent. Many jurisdictions have data protection laws that apply to automated collection.
Next Steps
- Connect your lead pipeline to a CRM using an API integration
- Build a simple web dashboard to view and filter leads
- Add email verification (SearchHive can check if emails are valid before you reach out)
- Scale up by scraping multiple directories in parallel
Start Building for Free
SearchHive's free tier gives you 500 credits to test the entire pipeline -- SwiftSearch for discovery, ScrapeForge for extraction, and DeepDive for enrichment. No credit card required. Sign up here and check the API docs for the complete reference.
/tutorials/web-scraping-python-beginners-guide | /tutorials/how-to-scrape-google-search-results | /compare/serpapi