Cloudflare Workers for Web Scraping: Complete Edge Computing Guide
Cloudflare Workers run JavaScript at the edge -- in data centers across 300+ cities worldwide. This makes them an interesting option for web scraping: low latency, no server management, and pay-per-request pricing. But Workers have real limitations for scraping that most tutorials gloss over. This guide covers what Workers can do, where they fall short, and when a dedicated scraping API is the better choice.
Key Takeaways
- Cloudflare Workers excel at lightweight API-based scraping and simple page fetches
- Workers have a 128MB memory limit, 30-second CPU timeout, and no DOM -- making JavaScript-heavy sites impossible to scrape
- For production scraping at scale, dedicated APIs like SearchHive's ScrapeForge handle JS rendering, proxies, and CAPTCHAs that Workers cannot
- Workers are best suited as orchestration layers that call dedicated scraping APIs, not as scrapers themselves
Prerequisites
- A Cloudflare account (free tier works)
- Node.js 18+ installed
- Basic JavaScript knowledge
wranglerCLI installed:npm install -g wrangler
Step 1: Set Up Your Cloudflare Worker
Initialize a new Worker project:
mkdir edge-scraper && cd edge-scraper
npx wrangler init
Choose "hello world" template when prompted. This creates the basic project structure:
edge-scraper/
wrangler.toml
package.json
src/index.js
Step 2: Basic Page Fetching
The simplest scraping use case is fetching page content. Workers can do this with the built-in fetch API:
// src/index.js
export default {
async fetch(request, env) {
const url = new URL(request.url);
const target = url.searchParams.get("url");
if (!target) {
return new Response("Missing ?url= parameter", { status: 400 });
}
const response = await fetch(target, {
headers: {
"User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0)",
"Accept": "text/html",
},
});
const html = await response.text();
const title = html.match(/<title>(.*?)<\/title>/)?.[1] || "No title";
const textContent = html
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, "")
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, "")
.replace(/<[^>]+>/g, " ")
.replace(/\s+/g, " ")
.trim()
.slice(0, 5000);
return Response.json({
url: target,
title: title,
content: textContent,
});
},
};
Deploy with npx wrangler deploy. Test it:
curl "https://your-worker.workers.dev/?url=https://example.com"
This works for static HTML pages. The problem? Most modern websites are not static.
Step 3: The JavaScript Rendering Problem
Cloudflare Workers do not have a DOM or a JavaScript engine. They cannot execute client-side JavaScript, which means sites built with React, Vue, Angular, or any SPA framework will return empty content shells.
Some people try to use htmlparser2 or linkedom in Workers for basic HTML parsing:
import { parseHTML } from "linkedom";
export default {
async fetch(request, env) {
const target = new URL(request.url).searchParams.get("url");
const resp = await fetch(target);
const html = await resp.text();
const { document } = parseHTML(html);
const headings = [...document.querySelectorAll("h1, h2, h3")].map(
(el) => el.textContent
);
const links = [...document.querySelectorAll("a[href]")].map(
(el) => ({ text: el.textContent, href: el.href })
);
return Response.json({ headings, links });
},
};
This parses the raw HTML, but it still cannot execute JavaScript. If the content you need is rendered client-side, you are out of luck with Workers alone.
Step 4: Using Workers as an Orchestration Layer
A better pattern: use Cloudflare Workers to orchestrate calls to a dedicated scraping API that handles JS rendering. Here is an example using SearchHive's ScrapeForge API:
export default {
async fetch(request, env) {
const SEARCHHIVE_KEY = env.SEARCHHIVE_API_KEY;
const target = new URL(request.url).searchParams.get("url");
if (!target) {
return new Response("Missing ?url= parameter", { status: 400 });
}
// Call SearchHive ScrapeForge for JS-rendered content
const scrapeResp = await fetch(
"https://api.searchhive.dev/v1/scrapeforge",
{
method: "POST",
headers: {
"Authorization": `Bearer ${SEARCHHIVE_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
url: target,
render_js: true,
format: "markdown",
timeout: 30000,
}),
}
);
const data = await scrapeResp.json();
return Response.json({
url: target,
title: data.title,
content: data.content,
status: data.status_code,
});
},
};
This runs at the edge for low latency while delegating the hard part (JS rendering, proxy rotation, anti-bot bypass) to ScrapeForge. Best of both worlds.
Step 5: Batch Scraping with Workers
For scraping multiple URLs in parallel at the edge:
export default {
async fetch(request, env) {
if (request.method !== "POST") {
return new Response("Use POST with JSON body", { status: 405 });
}
const { urls } = await request.json();
const SEARCHHIVE_KEY = env.SEARCHHIVE_API_KEY;
// Scrape up to 10 URLs in parallel (Workers limit)
const results = await Promise.allSettled(
urls.slice(0, 10).map(async (url) => {
const resp = await fetch(
"https://api.searchhive.dev/v1/swiftsearch",
{
method: "POST",
headers: {
"Authorization": `Bearer ${SEARCHHIVE_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
query: `site:${url}`,
num_results: 1,
}),
}
);
return { url, data: await resp.json() };
})
);
return Response.json({
results: results.map((r) => ({
url: r.value?.url,
success: r.status === "fulfilled",
data: r.value?.data,
error: r.reason?.message,
})),
});
},
};
Step 6: Configure Environment Variables
Set your API key as a secret:
npx wrangler secret put SEARCHHIVE_API_KEY
This keeps your key out of source code and accessible in the Worker via env.SEARCHHIVE_API_KEY.
Cloudflare Workers Limitations for Scraping
| Feature | Cloudflare Workers | SearchHive ScrapeForge |
|---|---|---|
| JS rendering | No | Yes (headless Chrome) |
| DOM access | No (use linkedom) | Full DOM |
| Memory limit | 128MB | No practical limit |
| CPU timeout | 30s (paid: 50s) | 60s default, 300s+ |
| Proxy rotation | No | Built-in residential proxies |
| CAPTCHA handling | No | Automatic |
| Concurrent requests | Limited by plan | Up to 100 QPS |
| Pricing | Free: 100K req/day | $49/mo for 100K credits |
Common Issues
Workers returning empty content: The site uses JavaScript rendering. Switch to ScrapeForge or another headless browser API.
403/429 from target sites: Workers run on Cloudflare's IP ranges, which many sites block. A dedicated scraping API with rotating proxies avoids this entirely.
Memory exceeded on large pages: The 128MB limit means you cannot process very large HTML documents. Stream responses or use a backend API.
CORS errors: If calling your Worker from a browser, add CORS headers to the response.
Next Steps
- Sign up for SearchHive to get a free API key with 500 credits
- Read the ScrapeForge documentation for advanced scraping options
- Check out our comparison of web scraping APIs for a detailed feature breakdown
- Learn about Supabase Edge Functions for scraping as an alternative serverless approach
Get Started with SearchHive
SearchHive combines search, scraping, and deep research in one API. The free tier gives you 500 credits to test everything. At scale, the Builder plan ($49/mo) delivers 100K credits per month with JS rendering, proxy rotation, and CAPTCHA handling built in.
Get your free API key -- no credit card required.