Machine learning models are only as good as their training data. For NLP tasks, computer vision, recommendation systems, and anomaly detection, the web is the largest labeled and unlabeled dataset available. The problem isn't finding data — it's collecting it reliably at scale without spending your entire engineering budget on infrastructure.
This guide covers how web scraping APIs fit into ML data collection pipelines, which APIs work best for different ML tasks, and how to build collection workflows that actually scale.
Key Takeaways
- ML data collection needs reliability and volume more than fancy features — uptime and throughput matter more than UI
- SearchHive's credit system is ideal for ML workflows — search to discover data sources, scrape to collect, research to validate
- Structured extraction APIs eliminate labeling overhead — pulling fields directly into free JSON formatter skips the annotation pipeline
- Rate limiting and proxy management are the real bottlenecks — API-based scrapers handle this; self-hosted scrapers don't
- Budget 5-15% of your ML project cost for data collection — it's the foundation everything else builds on
- Incremental collection beats bulk scraping — schedule regular small crawls instead of massive one-time jobs
Why Use a Scraping API Instead of Building Your Own
Running your own scraping infrastructure means managing proxy pools, headless browser clusters, CAPTCHA solvers, IP rotation logic, and retry queues. That's a full-time engineering effort before you collect a single data point.
Scraping APIs abstract all of that into a simple HTTP call. You pay per page or per credit, and the provider handles:
- Proxy rotation — residential and datacenter IP pools
- JavaScript rendering — headless Chrome/Firefox for SPAs
- Anti-bot detection — request fingerprinting, timing randomization
- Rate limiting — automatic throttling to avoid blocks
- Error handling — retries, fallback proxies, dead-page detection
The tradeoff is cost. Self-hosted scraping is cheaper per page at very high volumes (millions+) but requires dedicated engineering. API-based scraping starts working immediately and scales linearly.
Choosing the Right Scraping API for ML Workloads
Different ML tasks have different data requirements. Here's how to match APIs to use cases.
Text Classification and NLP
NLP models need large corpora of text — articles, reviews, product descriptions, social media posts. Clean text output (markdown or plain text) matters more than HTML structure.
Best options: SearchHive ScrapeForge (markdown output, structured extraction), Firecrawl (markdown), Jina AI Reader (free Markdown conversion)
import requests
import json
API_KEY = "your_searchhive_key"
BASE = "https://api.searchhive.dev/v1"
def collect_training_corpus(query, num_pages=100):
# Discover relevant pages
search_resp = requests.get(f"{BASE}/search", params={
"api_key": API_KEY,
"query": query,
"num_results": min(num_pages, 20)
})
corpus = []
urls = [r["url"] for r in search_resp.json()["results"]]
# Scrape each page for clean text
for url in urls:
try:
scrape_resp = requests.get(f"{BASE}/scrape", params={
"api_key": API_KEY,
"url": url,
"format": "markdown",
"remove_links": True,
"remove_images": True
}, timeout=30)
content = scrape_resp.json().get("content", "")
if len(content) > 200:
corpus.append({
"url": url,
"text": content,
"label": query,
"char_count": len(content)
})
except Exception as e:
print(f"Failed to scrape {url}: {e}")
total = sum(c['char_count'] for c in corpus)
print(f"Collected {len(corpus)} documents ({total:,} chars)")
return corpus
Computer Vision
CV models need images — product photos, street scenes, medical imagery, documents. The scraping API needs to handle image-heavy pages and support binary download.
Best options: Apify (Actor-based, handles image downloading), ScrapingBee (raw HTML with image URLs)
from apify_client import ApifyClient
client = ApifyClient("your_apify_token")
run = client.actor("apify/web-scraper").call(run_input={
"startUrls": [{"url": "https://example.com/category"}],
"pageFunction": 'async function pf(ctx) { var imgs = []; ctx.$("img.product-image").each(function() { imgs.push(ctx.$(this).attr("src")); }); return { url: ctx.request.url, images: imgs }; }',
"maxPages": 50
})
product_data = list(client.dataset(run["defaultDatasetId"]).iterate_items())
image_urls = []
for item in product_data:
image_urls.extend(item.get("images", []))
print(f"Found {len(image_urls)} images across {len(product_data)} products")
Recommendation Systems
Recsys models need structured data — product names, prices, categories, ratings, descriptions. The key is extracting these fields consistently across thousands of pages with different HTML structures.
Best options: SearchHive ScrapeForge (structured extraction with extract parameter), Firecrawl (structured scraping)
import requests
import json
API_KEY = "your_searchhive_key"
BASE = "https://api.searchhive.dev/v1"
def scrape_product_data(urls):
products = []
for url in urls:
resp = requests.get(f"{BASE}/scrape", params={
"api_key": API_KEY,
"url": url,
"format": "json",
"extract": "name,price,category,rating,description"
}, timeout=30)
data = resp.json()
if data.get("extracted"):
products.append({"url": url, **data["extracted"]})
return products
# Scrape product listings from search results
search = requests.get(f"{BASE}/search", params={
"api_key": API_KEY,
"query": "site:example.com/products",
"num_results": 20
})
product_urls = [r["url"] for r in search.json()["results"]]
products = scrape_product_data(product_urls)
with open("training_data/products.jsonl", "w") as f:
for p in products:
f.write(json.dumps(p) + "\n")
print(f"Saved {len(products)} products to training data")
Building a Scalable Data Collection Pipeline
One-off scraping scripts break. Production ML data collection needs scheduling, deduplication, incremental updates, and monitoring.
Pipeline Architecture
Schedule Trigger
|
v
Source Discovery (SwiftSearch API)
|
v
URL Dedup (SQLite/local DB)
|
v
Content Extraction (ScrapeForge API)
|
v
Data Validation (schema check, min length)
|
v
Storage (JSONL / Parquet / Vector DB)
|
v
Training Pipeline
Implementation with Error Handling
import requests
import json
import time
import sqlite3
from pathlib import Path
API_KEY = "your_searchhive_key"
BASE = "https://api.searchhive.dev/v1"
class DataCollector:
def __init__(self, db_path="collected_urls.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute(
"CREATE TABLE IF NOT EXISTS urls "
"(url TEXT PRIMARY KEY, scraped_at TIMESTAMP, status TEXT)"
)
self.conn.commit()
def is_new(self, url):
row = self.conn.execute(
"SELECT 1 FROM urls WHERE url = ?", (url,)
).fetchone()
return row is None
def mark_scraped(self, url, status="success"):
self.conn.execute(
"INSERT OR REPLACE INTO urls VALUES (?, datetime('now'), ?)",
(url, status)
)
self.conn.commit()
def discover_and_collect(self, query, max_pages=200, delay=0.5):
# Discover
search = requests.get(f"{BASE}/search", params={
"api_key": API_KEY,
"query": query,
"num_results": 20
})
urls = [r["url"] for r in search.json()["results"]]
new_urls = [u for u in urls if self.is_new(u)]
print(f"Found {len(urls)} URLs, {len(new_urls)} new")
# Collect
collected = []
for url in new_urls[:max_pages]:
try:
resp = requests.get(f"{BASE}/scrape", params={
"api_key": API_KEY,
"url": url,
"format": "markdown",
"extract": "title,body",
"max_chars": 10000
}, timeout=30)
data = resp.json()
content = data.get("content", "")
if len(content) < 100:
self.mark_scraped(url, "too_short")
continue
collected.append({"url": url, "content": content})
self.mark_scraped(url)
except Exception as e:
self.mark_scraped(url, f"error: {e}")
time.sleep(delay)
# Save
out_path = Path("training_data/latest.jsonl")
out_path.parent.mkdir(exist_ok=True)
with open(out_path, "a") as f:
for item in collected:
f.write(json.dumps(item) + "\n")
print(f"Collected {len(collected)} new documents")
return collected
collector = DataCollector()
collector.discover_and_collect("machine learning tutorial dataset", max_pages=50)
Cost Planning for ML Data Collection
Budget planning for web data needs to account for discovery, collection, and validation phases:
| Phase | What You Do | Typical API Usage | SearchHive Cost |
|---|---|---|---|
| Discovery | Find relevant URLs | 5-20 search queries | ~50-200 credits |
| Collection | Scrape pages | 1K-100K pages | 1K-100K credits |
| Validation | Re-scrape failed pages | 10-20% of collection | 100-20K credits |
| Incremental updates | Weekly refresh | 5-10% of total monthly | Varies |
At the Builder tier ($49/month, 100K credits), you can collect roughly 80K-90K validated documents per month after accounting for discovery and retries. At the Unicorn tier ($199/month, 500K credits), that scales to 400K+ documents.
Best Practices for ML Data Collection
Validate before you scale. Start with 100 pages. Check data quality — are fields populated correctly? Is the text clean? Are there encoding issues? Fix problems at 100 pages, not 100,000.
Deduplicate aggressively. The same content appears on multiple URLs (AMP pages, canonical duplicates, syndication). Dedup by content hash, not just URL.
Track provenance. Record the source URL, scrape timestamp, and any extraction errors. When your model produces unexpected results, you need to trace the data back.
Respect robots.txt generator and terms of service. Some sites explicitly prohibit automated collection. Check before you scrape — especially for commercial ML training data.
Version your datasets. Models trained on different data versions produce different results. Tag each collection run and keep historical snapshots.
API Comparison for ML Workloads
| API | Best ML Use Case | Output Format | 100K Pages | Structured Fields |
|---|---|---|---|---|
| SearchHive ScrapeForge | NLP, Recsys | Markdown + JSON | $49/mo | Yes |
| Firecrawl | NLP | Markdown | $83/mo | Limited |
| Apify | CV, Complex | Custom | ~$100+/mo | Custom |
| ScrapingBee | General | HTML/Text | ~$99/mo | Via rules |
| Jina AI Reader | NLP (prototyping) | Markdown | $300/mo | No |
| Exa Contents | NLP | Text | $100/mo | No |
Getting Started
The fastest path from zero to training data: use SearchHive's free tier (500 credits) to build and validate your collection pipeline, then upgrade to Builder ($49/month) for production volume.
- Sign up at searchhive.dev — no credit card required
- Discover sources with SwiftSearch — find the sites with the data you need
- Collect with ScrapeForge — markdown output, structured extraction, one API
- Validate and iterate — check data quality before scaling
- Schedule regular collection — keep your training data fresh
The unified API means your search, scraping, and research all draw from the same credit pool. No juggling three subscriptions, no separate rate limits, no integration headaches.