Cloudflare Workers for Web Scraping: Complete Edge Computing Guide

Cloudflare Workers run JavaScript at the edge -- in data centers across 300+ cities worldwide. This makes them an interesting option for web scraping: low latency, no server management, and pay-per-request pricing. But Workers have real limitations for scraping that most tutorials gloss over. This guide covers what Workers can do, where they fall short, and when a dedicated scraping API is the better choice.

Key Takeaways

Cloudflare Workers excel at lightweight API-based scraping and simple page fetches
Workers have a 128MB memory limit, 30-second CPU timeout, and no DOM -- making JavaScript-heavy sites impossible to scrape
For production scraping at scale, dedicated APIs like SearchHive's ScrapeForge handle JS rendering, proxies, and CAPTCHAs that Workers cannot
Workers are best suited as orchestration layers that call dedicated scraping APIs, not as scrapers themselves

Prerequisites

A Cloudflare account (free tier works)
Node.js 18+ installed
Basic JavaScript knowledge
wrangler CLI installed: npm install -g wrangler

Step 1: Set Up Your Cloudflare Worker

Initialize a new Worker project:

mkdir edge-scraper && cd edge-scraper
npx wrangler init

Choose "hello world" template when prompted. This creates the basic project structure:

edge-scraper/
  wrangler.toml
  package.json
  src/index.js

Step 2: Basic Page Fetching

The simplest scraping use case is fetching page content. Workers can do this with the built-in fetch API:

// src/index.js
export default {
  async fetch(request, env) {
    const url = new URL(request.url);
    const target = url.searchParams.get("url");

    if (!target) {
      return new Response("Missing ?url= parameter", { status: 400 });
    }

    const response = await fetch(target, {
      headers: {
        "User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0)",
        "Accept": "text/html",
      },
    });

    const html = await response.text();
    const title = html.match(/<title>(.*?)<\/title>/)?.[1] || "No title";
    const textContent = html
      .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, "")
      .replace(/<style[^>]*>[\s\S]*?<\/style>/gi, "")
      .replace(/<[^>]+>/g, " ")
      .replace(/\s+/g, " ")
      .trim()
      .slice(0, 5000);

    return Response.json({
      url: target,
      title: title,
      content: textContent,
    });
  },
};

Deploy with npx wrangler deploy. Test it:

curl "https://your-worker.workers.dev/?url=https://example.com"

This works for static HTML pages. The problem? Most modern websites are not static.

Step 3: The JavaScript Rendering Problem

Cloudflare Workers do not have a DOM or a JavaScript engine. They cannot execute client-side JavaScript, which means sites built with React, Vue, Angular, or any SPA framework will return empty content shells.

Some people try to use htmlparser2 or linkedom in Workers for basic HTML parsing:

import { parseHTML } from "linkedom";

export default {
  async fetch(request, env) {
    const target = new URL(request.url).searchParams.get("url");
    const resp = await fetch(target);
    const html = await resp.text();

    const { document } = parseHTML(html);
    const headings = [...document.querySelectorAll("h1, h2, h3")].map(
      (el) => el.textContent
    );
    const links = [...document.querySelectorAll("a[href]")].map(
      (el) => ({ text: el.textContent, href: el.href })
    );

    return Response.json({ headings, links });
  },
};

This parses the raw HTML, but it still cannot execute JavaScript. If the content you need is rendered client-side, you are out of luck with Workers alone.

Step 4: Using Workers as an Orchestration Layer

A better pattern: use Cloudflare Workers to orchestrate calls to a dedicated scraping API that handles JS rendering. Here is an example using SearchHive's ScrapeForge API:

export default {
  async fetch(request, env) {
    const SEARCHHIVE_KEY = env.SEARCHHIVE_API_KEY;
    const target = new URL(request.url).searchParams.get("url");

    if (!target) {
      return new Response("Missing ?url= parameter", { status: 400 });
    }

    // Call SearchHive ScrapeForge for JS-rendered content
    const scrapeResp = await fetch(
      "https://api.searchhive.dev/v1/scrapeforge",
      {
        method: "POST",
        headers: {
          "Authorization": `Bearer ${SEARCHHIVE_KEY}`,
          "Content-Type": "application/json",
        },
        body: JSON.stringify({
          url: target,
          render_js: true,
          format: "markdown",
          timeout: 30000,
        }),
      }
    );

    const data = await scrapeResp.json();
    return Response.json({
      url: target,
      title: data.title,
      content: data.content,
      status: data.status_code,
    });
  },
};

This runs at the edge for low latency while delegating the hard part (JS rendering, proxy rotation, anti-bot bypass) to ScrapeForge. Best of both worlds.

Step 5: Batch Scraping with Workers

For scraping multiple URLs in parallel at the edge:

export default {
  async fetch(request, env) {
    if (request.method !== "POST") {
      return new Response("Use POST with JSON body", { status: 405 });
    }

    const { urls } = await request.json();
    const SEARCHHIVE_KEY = env.SEARCHHIVE_API_KEY;

    // Scrape up to 10 URLs in parallel (Workers limit)
    const results = await Promise.allSettled(
      urls.slice(0, 10).map(async (url) => {
        const resp = await fetch(
          "https://api.searchhive.dev/v1/swiftsearch",
          {
            method: "POST",
            headers: {
              "Authorization": `Bearer ${SEARCHHIVE_KEY}`,
              "Content-Type": "application/json",
            },
            body: JSON.stringify({
              query: `site:${url}`,
              num_results: 1,
            }),
          }
        );
        return { url, data: await resp.json() };
      })
    );

    return Response.json({
      results: results.map((r) => ({
        url: r.value?.url,
        success: r.status === "fulfilled",
        data: r.value?.data,
        error: r.reason?.message,
      })),
    });
  },
};

Step 6: Configure Environment Variables

Set your API key as a secret:

npx wrangler secret put SEARCHHIVE_API_KEY

This keeps your key out of source code and accessible in the Worker via env.SEARCHHIVE_API_KEY.

Cloudflare Workers Limitations for Scraping

Feature	Cloudflare Workers	SearchHive ScrapeForge
JS rendering	No	Yes (headless Chrome)
DOM access	No (use linkedom)	Full DOM
Memory limit	128MB	No practical limit
CPU timeout	30s (paid: 50s)	60s default, 300s+
Proxy rotation	No	Built-in residential proxies
CAPTCHA handling	No	Automatic
Concurrent requests	Limited by plan	Up to 100 QPS
Pricing	Free: 100K req/day	$49/mo for 100K credits

Common Issues

Workers returning empty content: The site uses JavaScript rendering. Switch to ScrapeForge or another headless browser API.

403/429 from target sites: Workers run on Cloudflare's IP ranges, which many sites block. A dedicated scraping API with rotating proxies avoids this entirely.

Memory exceeded on large pages: The 128MB limit means you cannot process very large HTML documents. Stream responses or use a backend API.

CORS errors: If calling your Worker from a browser, add CORS headers to the response.

Next Steps

Sign up for SearchHive to get a free API key with 500 credits
Read the ScrapeForge documentation for advanced scraping options
Check out our comparison of web scraping APIs for a detailed feature breakdown
Learn about Supabase Edge Functions for scraping as an alternative serverless approach

Get Started with SearchHive

SearchHive combines search, scraping, and deep research in one API. The free tier gives you 500 credits to test everything. At scale, the Builder plan ($49/mo) delivers 100K credits per month with JS rendering, proxy rotation, and CAPTCHA handling built in.

Get your free API key -- no credit card required.

Cloudflare Workers for Web Scraping: Complete Edge Computing Guide

AI-Powered Research