Scrapy has been Python's go-to web scraping framework since 2008. It's powerful, extensible, and free. But in 2026, many teams are questioning whether maintaining a Scrapy pipeline is worth the effort when scraping APIs like SearchHive ScrapeForge can handle the same workloads with a single HTTP call.
This comparison breaks down when Scrapy is the right choice and when you're better off with an API, based on real tradeoffs -- not ideology.
Key Takeaways
- Scrapy excels at large-scale, custom scraping where you need fine-grained control over every step
- Scraping APIs win on speed of implementation -- a single API call vs. writing and maintaining a spider
- Cost comparison: Scrapy is "free" but costs developer time; APIs cost money but save engineering hours
- Scrapy handles complex, site-specific logic that no generic API can match
- The hybrid approach (Scrapy for complex sites, API for standard pages) is often the best answer
How Scrapy Works
Scrapy is a framework for writing web spiders -- programs that navigate websites, extract data, and follow links. A typical Scrapy spider defines:
- Which URLs to start from
- How to parse each page (CSS/XPath selectors)
- What data to extract
- Which links to follow next
- How to handle pagination, retries, and rate limiting
import scrapy
class BlogSpider(scrapy.Spider):
name = "blog"
start_urls = ["https://example.com/blog"]
def parse(self, response):
for article in response.css("article.post"):
yield {
"title": article.css("h2::text").get(),
"url": article.css("a::attr(href)").get(),
"excerpt": article.css(".excerpt::text").get(),
"date": article.css("time::attr(datetime)").get(),
}
# Follow pagination
next_page = response.css('a.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
How API Scraping Works
A scraping API handles all the infrastructure -- proxy rotation, JavaScript rendering, CAPTCHA solving, retries -- behind a simple HTTP interface. You send a URL, you get the content back.
import requests
# SearchHive ScrapeForge: same result as the Scrapy spider above, in 5 lines
resp = requests.post(
"https://api.searchhive.dev/v1/scrapeforge/scrape",
json={
"url": "https://example.com/blog",
"format": "markdown",
"render_js": True,
},
headers={"Authorization": "Bearer YOUR_API_KEY"},
timeout=30,
)
print(resp.json().get("markdown", "")[:500])
Feature-by-Feature Comparison
| Feature | Scrapy | SearchHive ScrapeForge | ScrapingBee | ScraperAPI |
|---|---|---|---|---|
| Setup time | Hours to days | Minutes | Minutes | Minutes |
| JS rendering | Via downloader middleware (Splash/Playwright) | Built-in | Built-in (5-25x credits) | Built-in |
| Proxy rotation | Via middleware or custom | Built-in | Built-in | Built-in |
| CAPTCHA handling | Manual or third-party | Built-in | Built-in | Built-in |
| Anti-bot bypass | Manual (headers, delays, fingerprints) | Built-in | Built-in | Built-in |
| Rate limiting | Built-in (settings) | Built-in | Built-in | Built-in |
| Pagination | Custom logic per site | N/A (single pages) | N/A | N/A |
| Custom selectors | Full CSS/XPath support | Format-based (markdown/HTML/text) | Extract rules | N/A |
| Site-specific logic | Unlimited (it's code) | None (generic) | None | None |
| Concurrent requests | Configurable (Twisted async) | API-managed | 10-200 concurrent | 20-200 concurrent |
| Data pipelines | Built-in (items, pipelines) | Your code | Your code | Your code |
| Error handling | Full control | API returns errors | API returns errors | API returns errors |
| Cost | Free (open source) + infrastructure | Free tier + per-request | $49+/mo | $49+/mo |
| Maintenance | High (selectors break when sites change) | Zero | Zero | Zero |
When Scrapy Is the Better Choice
1. Large-Scale Site Crawling
If you need to crawl an entire website -- thousands of pages, following links, handling pagination -- Scrapy's crawler engine is purpose-built for this. No API offers site-wide crawling as a single operation (though ScrapeForge's crawl endpoint handles multi-page crawling for defined sitemaps).
class SiteCrawler(scrapy.Spider):
name = "fullsite"
start_urls = ["https://example.com"]
custom_settings = {
"CONCURRENT_REQUESTS": 16,
"DOWNLOAD_DELAY": 0.5,
"AUTOTHROTTLE_ENABLED": True,
}
def parse(self, response):
# Extract all pages of a specific type
for page in response.css("a[href*='/docs/']"):
yield response.follow(page, callback=self.parse_doc)
def parse_doc(self, response):
yield {
"title": response.css("title::text").get(),
"content": response.css("main ::text").getall(),
"url": response.url,
}
2. Complex, Site-Specific Extraction Logic
Some sites have non-standard layouts, nested structures, or data split across multiple elements. Scrapy lets you write arbitrary Python to handle these cases.
3. Custom Data Pipelines
Scrapy's item pipeline system lets you clean, validate, deduplicate, and store scraped data through a composable chain of processors.
4. Zero API Cost at Scale
If you're scraping millions of pages, API costs add up. Scrapy is free -- you only pay for proxy infrastructure and hosting. At very high volumes, this can be cheaper than any API.
When an API Is the Better Choice
1. Quick Content Extraction
Need the text of a single page? Or 100 pages you already have URLs for? An API call is faster to write, faster to run, and requires zero maintenance.
2. JavaScript-Heavy Sites
Scrapy can handle JS rendering through Splash or Playwright integration, but setting it up is non-trivial. Scraping APIs handle it out of the box.
3. Anti-Bot Protected Sites
Sites with Cloudflare, DataDome, PerimeterX, or similar protections are hard to scrape with Scrapy alone. Scraping APIs invest heavily in bypass technology that individual developers can't easily replicate.
4. Prototyping and MVPs
When you're building a prototype, you don't want to write and debug Scrapy spiders. An API call gets you data in minutes, not hours.
The True Cost of Scrapy
Scrapy is "free" in the same way that Linux is "free" -- the software costs nothing, but the operational cost is real:
- Development time: Writing, testing, and debugging spiders takes hours per site
- Maintenance: Sites change their HTML constantly. Scrapy selectors break and need updating
- Infrastructure: You need servers, proxies, and monitoring
- Proxy costs: Rotating residential proxies cost $2-8/GB
- JS rendering: Running Splash or Playwright instances adds infrastructure complexity
A single Scrapy spider for a complex site can take 4-8 hours to build and requires ongoing maintenance. At typical developer rates, that's $400-1,600+ per spider. An equivalent API integration takes 30 minutes and $0.001 per page.
The Hybrid Approach (Recommended)
The most practical approach for most teams:
import requests
import scrapy
SEARCHHIVE_KEY = "your_key"
class HybridSpider(scrapy.Spider):
"""Use API for JS/protected pages, Scrapy for simple static pages."""
name = "hybrid"
start_urls = ["https://example.com"]
def parse(self, response):
for link in response.css("a[href]"):
url = response.urljoin(link.attrib["href"])
# Check if this URL needs JS rendering or is protected
if self.needs_js_or_protection(url):
# Yield a request that uses the API instead of direct download
yield scrapy.Request(
url,
callback=self.parse_api,
meta={"use_api": True},
dont_filter=True,
)
else:
yield response.follow(url, callback=self.parse_standard)
def parse_api(self, response):
"""Use SearchHive API for JS-rendered or protected pages."""
url = response.meta.get("use_api_url", response.url)
try:
resp = requests.post(
"https://api.searchhive.dev/v1/scrapeforge/scrape",
json={"url": url, "format": "markdown", "render_js": True},
headers={"Authorization": f"Bearer {SEARCHHIVE_KEY}"},
timeout=30,
)
data = resp.json()
yield {"url": url, "content": data.get("markdown", "")}
except Exception as e:
self.logger.error(f"API scrape failed for {url}: {e}")
def parse_standard(self, response):
"""Parse simple static pages directly with Scrapy."""
yield {
"url": response.url,
"title": response.css("title::text").get(),
"content": " ".join(response.css("main ::text").getall()),
}
def needs_js_or_protection(self, url):
# Custom logic: which URLs need API-based scraping
protected_domains = ["shopify.com", "cloudflare.com"]
return any(d in url for d in protected_domains)
Verdict
Use Scrapy when: you need site-wide crawling, custom extraction logic, or you're processing millions of pages where API costs would be prohibitive.
Use a scraping API when: you need to extract content from known URLs, you're building an MVP, the sites use JS rendering or anti-bot protection, or you don't want to maintain spiders.
Use both: the hybrid approach is the most practical for teams that need Scrapy's crawling power and an API's scraping capabilities.
For most development teams in 2026, SearchHive ScrapeForge covers 80% of web scraping needs with a single API call. Scrapy remains the right tool for the other 20% -- the complex, high-volume, site-specific workloads that require custom logic.
Related: Playwright vs Scraping APIs | Puppeteer vs Scraping APIs | Best Web Scraping APIs with Python SDK
Get the best of both worlds. Try SearchHive free -- ScrapeForge for API scraping, SwiftSearch for search, DeepDive for extraction.