Is Web Scraping Legal?

Yes, web scraping is generally legal — but with important caveats. In the United States, scraping publicly accessible data is protected following the landmark hiQ v. LinkedIn decision (2022). In the EU, scraping is permitted but heavily constrained by GDPR when personal data is involved. The act of scraping itself isn't illegal, but how you scrape, what you scrape, and what you do with the data can create serious legal liability.

This guide covers the current legal landscape, key court cases, and practical guidelines for staying on the right side of the law.

Key Takeaways

Scraping public data is legal in the US — confirmed by hiQ v. LinkedIn (9th Circuit, 2022)
CFAA no longer covers TOS violations — Van Buren v. United States (2021) narrowed the Computer Fraud and Abuse Act
GDPR applies regardless of public access — scraping personal data in the EU triggers data protection obligations
robots.txt generator matters — not legally binding alone, but ignoring it is used as evidence of bad faith
Using a legitimate API like SearchHive eliminates most legal risk compared to direct scraping

What does US law say about web scraping?

The most significant legal precedent is hiQ Labs v. LinkedIn (9th Circuit, 2020, Supreme Court certiorari denied 2022).

The facts: hiQ scraped publicly available LinkedIn profiles to build workforce analytics. LinkedIn sent a cease-and-desist, then deployed technical blocks. hiQ sought a preliminary injunction.

The ruling: The Ninth Circuit ruled that scraping publicly available data likely does not violate the Computer Fraud and Abuse Act (CFAA). LinkedIn's terms of service prohibition and technical measures did not create "unauthorized access" to public pages.

What this means for you: If data is publicly accessible on the web (no login required), scraping it does not constitute unauthorized access under the CFAA. This is the binding precedent in the 9th Circuit and persuasive authority nationwide.

What is the CFAA and how does it apply?

The Computer Fraud and Abuse Act (18 U.S.C. § 1030) is the primary federal law used against scrapers. Two court decisions reshaped its application:

Van Buren v. United States (2021)

The Supreme Court narrowed "exceeds authorized access" to mean accessing areas of a computer you lack permission to access — not misusing data you're permitted to access. This eliminated the theory that violating a website's terms of service constitutes a CFAA violation.

Post-Van Buren landscape

Scraping Scenario	CFAA Risk
Public pages (no login)	Low — no CFAA violation per hiQ
Pages behind free registration	Medium — gray area, fact-dependent
Bypassing authentication	High — likely "without authorization"
Insider scraping (employee)	High — per Cvent v. Leidholm
After explicit revocation of permission	Medium-High — borderline

The bottom line: CFAA is no longer a blanket weapon against scraping, but it remains potent when you bypass authentication or exceed scope-limited access.

What about other US court cases?

eBay v. Bidder's Edge (2000)

Bidder's Edge scraped eBay auction data despite robots.txt restrictions, consuming ~1.5% of eBay's server capacity. The court granted an injunction based on trespass to chattels — the scraping consumed measurable server resources after eBay explicitly revoked consent.

Lesson: Even if scraping is legal, causing measurable harm to a target's servers can create liability under trespass to chattels. Ignoring robots.txt was a key factor in the ruling.

Cvent v. Leidholm (2023)

A departing employee used automated scripts to download company data while still technically employed but already planning to join a competitor. The court found CFAA violations because her access was scope-limited by employment context.

Lesson: Insider scraping carries much higher legal risk than third-party scraping of public data. Authorization isn't just about technical access — it's about the context of that access.

How does EU law differ?

Scraping in the EU faces tighter restrictions:

GDPR

GDPR applies to personal data regardless of whether it was publicly accessible:

Article 14 requires providing privacy notices to data subjects whose data was obtained indirectly (i.e., via scraping)
Article 6 requires a lawful basis — consent or legitimate interest (with documented assessment)
Article 5(1)(c) mandates data minimization — scraping everything possible likely violates this
Article 17 gives data subjects the right to erasure — people can request deletion of their scraped data

Database Directive

The EU's sui generis database right can protect collections of data even if individual elements aren't copyrighted. Scraping and republishing entire databases can infringe this right.

DSM Directive (Text and Data Mining)

Article 3-4 allows text and data mining for research purposes
For commercial purposes, rights holders can opt out via "machine-readable means" (robots.txt may qualify)
This means robots.txt has stronger legal teeth in the EU than in the US

Does robots.txt have legal force?

Short answer: it's a technical standard, not a law. But its indirect legal weight is significant:

Context	robots.txt Impact
US — CFAA claims	Not determinative (hiQ court declined to treat it as creating "unauthorized access")
US — Trespass to chattels	Factor in eBay v. Bidder's Edge (evidence of willful lack of consent)
EU — GDPR/DSM	May serve as machine-readable opt-out for commercial TDM
Breach of contract	Strengthens claims if TOS incorporates robots.txt compliance
General litigation	Ignoring it demonstrates bad faith, weakens nearly any defense

Practical guidance: Always respect robots.txt. It won't necessarily save you from a lawsuit, but ignoring it will almost certainly be used against you.

from urllib import robotparser

def check_robots_txt(url, user_agent='MyBot/1.0'):
    """Check if scraping is allowed before making requests."""
    rp = robotparser.RobotFileParser()
    rp.set_url(f"{url.scheme}://{url.netloc}/robots.txt")
    rp.read()
    return rp.can_fetch(user_agent, str(url))

How does personal vs. commercial use affect legality?

The act of scraping public data isn't legally different based on purpose. The differences lie in enforcement and damages:

Factor	Personal/Non-Commercial	Commercial
Enforcement	Rarely targeted	Companies actively sue competitors
Damages	Hard to prove	Lost revenue, competitive harm, statutory damages
Copyright	Personal fair use defensible	Commercial reproduction harder to defend
Trade secrets	Rarely applicable	Major risk for competitor data
GDPR (EU)	Same obligations	Same obligations, higher DPA scrutiny

What are the ethical scraping guidelines?

Beyond legal minimums, these practices reduce risk and build goodwill:

Respect robots.txt — parse and obey it before scraping any domain
Rate limit requests — 1–2 requests per second, check Crawl-delay directives
Identify yourself — use a descriptive user agent parser with contact information
Don't bypass authentication — scraping behind login walls without permission increases legal risk
Minimize personal data — if you collect personal data, document your legal basis and implement data subject rights
Don't republish copyrighted content — scraping for internal analysis is defensible; republishing full articles is not
Cache responsibly — don't store more data than needed, set retention limits
Implement opt-out — allow people to request removal of their data

How does using SearchHive reduce legal risk?

Using a legitimate web data API like SearchHive instead of direct scraping reduces legal exposure in several ways:

No direct server contact — you're calling an API, not hitting target websites directly
No CFAA exposure — there's no "access" to target servers to dispute
No trespass to chattels — your traffic doesn't consume target server resources
No robots.txt violations — the API manages upstream compliance
No CAPTCHA circumvention — a legal gray area that direct scrapers often navigate

from searchhive import SwiftSearch, ScrapeForge

# Search — you never touch the target site directly
search = SwiftSearch(api_key='your-key')
results = search.query('competitor pricing data')

# Scrape — SearchHive handles the upstream request
scraper = ScrapeForge(api_key='your-key')
content = scraper.scrape('https://example.com/page')

This doesn't make you immune to all legal claims (data usage and copyright still apply), but it eliminates the most common vectors for scraping-related litigation. See How to Use SearchHive with Python for implementation details.

Summary

Web scraping of publicly accessible data is legal in the US following hiQ v. LinkedIn. The CFAA no longer covers TOS violations after Van Buren. In the EU, GDPR and database rights create additional constraints. Always respect robots.txt, rate limit your requests, minimize personal data collection, and identify yourself with a proper User-Agent. For teams that want to minimize legal risk, using a managed API like SearchHive eliminates most direct scraping-related liability.

Disclaimer: This article is for informational purposes only and does not constitute legal advice. Consult a qualified attorney for guidance specific to your situation.

Start with SearchHive's free tier for compliant web data access.

Is Web Scraping Legal — Complete Answer

AI-Powered Research