Is Web Scraping Legal?
Yes, web scraping is generally legal — but with important caveats. In the United States, scraping publicly accessible data is protected following the landmark hiQ v. LinkedIn decision (2022). In the EU, scraping is permitted but heavily constrained by GDPR when personal data is involved. The act of scraping itself isn't illegal, but how you scrape, what you scrape, and what you do with the data can create serious legal liability.
This guide covers the current legal landscape, key court cases, and practical guidelines for staying on the right side of the law.
Key Takeaways
- Scraping public data is legal in the US — confirmed by hiQ v. LinkedIn (9th Circuit, 2022)
- CFAA no longer covers TOS violations — Van Buren v. United States (2021) narrowed the Computer Fraud and Abuse Act
- GDPR applies regardless of public access — scraping personal data in the EU triggers data protection obligations
- robots.txt generator matters — not legally binding alone, but ignoring it is used as evidence of bad faith
- Using a legitimate API like SearchHive eliminates most legal risk compared to direct scraping
What does US law say about web scraping?
The most significant legal precedent is hiQ Labs v. LinkedIn (9th Circuit, 2020, Supreme Court certiorari denied 2022).
The facts: hiQ scraped publicly available LinkedIn profiles to build workforce analytics. LinkedIn sent a cease-and-desist, then deployed technical blocks. hiQ sought a preliminary injunction.
The ruling: The Ninth Circuit ruled that scraping publicly available data likely does not violate the Computer Fraud and Abuse Act (CFAA). LinkedIn's terms of service prohibition and technical measures did not create "unauthorized access" to public pages.
What this means for you: If data is publicly accessible on the web (no login required), scraping it does not constitute unauthorized access under the CFAA. This is the binding precedent in the 9th Circuit and persuasive authority nationwide.
What is the CFAA and how does it apply?
The Computer Fraud and Abuse Act (18 U.S.C. § 1030) is the primary federal law used against scrapers. Two court decisions reshaped its application:
Van Buren v. United States (2021)
The Supreme Court narrowed "exceeds authorized access" to mean accessing areas of a computer you lack permission to access — not misusing data you're permitted to access. This eliminated the theory that violating a website's terms of service constitutes a CFAA violation.
Post-Van Buren landscape
| Scraping Scenario | CFAA Risk |
|---|---|
| Public pages (no login) | Low — no CFAA violation per hiQ |
| Pages behind free registration | Medium — gray area, fact-dependent |
| Bypassing authentication | High — likely "without authorization" |
| Insider scraping (employee) | High — per Cvent v. Leidholm |
| After explicit revocation of permission | Medium-High — borderline |
The bottom line: CFAA is no longer a blanket weapon against scraping, but it remains potent when you bypass authentication or exceed scope-limited access.
What about other US court cases?
eBay v. Bidder's Edge (2000)
Bidder's Edge scraped eBay auction data despite robots.txt restrictions, consuming ~1.5% of eBay's server capacity. The court granted an injunction based on trespass to chattels — the scraping consumed measurable server resources after eBay explicitly revoked consent.
Lesson: Even if scraping is legal, causing measurable harm to a target's servers can create liability under trespass to chattels. Ignoring robots.txt was a key factor in the ruling.
Cvent v. Leidholm (2023)
A departing employee used automated scripts to download company data while still technically employed but already planning to join a competitor. The court found CFAA violations because her access was scope-limited by employment context.
Lesson: Insider scraping carries much higher legal risk than third-party scraping of public data. Authorization isn't just about technical access — it's about the context of that access.
How does EU law differ?
Scraping in the EU faces tighter restrictions:
GDPR
GDPR applies to personal data regardless of whether it was publicly accessible:
- Article 14 requires providing privacy notices to data subjects whose data was obtained indirectly (i.e., via scraping)
- Article 6 requires a lawful basis — consent or legitimate interest (with documented assessment)
- Article 5(1)(c) mandates data minimization — scraping everything possible likely violates this
- Article 17 gives data subjects the right to erasure — people can request deletion of their scraped data
Database Directive
The EU's sui generis database right can protect collections of data even if individual elements aren't copyrighted. Scraping and republishing entire databases can infringe this right.
DSM Directive (Text and Data Mining)
- Article 3-4 allows text and data mining for research purposes
- For commercial purposes, rights holders can opt out via "machine-readable means" (robots.txt may qualify)
- This means robots.txt has stronger legal teeth in the EU than in the US
Does robots.txt have legal force?
Short answer: it's a technical standard, not a law. But its indirect legal weight is significant:
| Context | robots.txt Impact |
|---|---|
| US — CFAA claims | Not determinative (hiQ court declined to treat it as creating "unauthorized access") |
| US — Trespass to chattels | Factor in eBay v. Bidder's Edge (evidence of willful lack of consent) |
| EU — GDPR/DSM | May serve as machine-readable opt-out for commercial TDM |
| Breach of contract | Strengthens claims if TOS incorporates robots.txt compliance |
| General litigation | Ignoring it demonstrates bad faith, weakens nearly any defense |
Practical guidance: Always respect robots.txt. It won't necessarily save you from a lawsuit, but ignoring it will almost certainly be used against you.
from urllib import robotparser
def check_robots_txt(url, user_agent='MyBot/1.0'):
"""Check if scraping is allowed before making requests."""
rp = robotparser.RobotFileParser()
rp.set_url(f"{url.scheme}://{url.netloc}/robots.txt")
rp.read()
return rp.can_fetch(user_agent, str(url))
How does personal vs. commercial use affect legality?
The act of scraping public data isn't legally different based on purpose. The differences lie in enforcement and damages:
| Factor | Personal/Non-Commercial | Commercial |
|---|---|---|
| Enforcement | Rarely targeted | Companies actively sue competitors |
| Damages | Hard to prove | Lost revenue, competitive harm, statutory damages |
| Copyright | Personal fair use defensible | Commercial reproduction harder to defend |
| Trade secrets | Rarely applicable | Major risk for competitor data |
| GDPR (EU) | Same obligations | Same obligations, higher DPA scrutiny |
What are the ethical scraping guidelines?
Beyond legal minimums, these practices reduce risk and build goodwill:
- Respect robots.txt — parse and obey it before scraping any domain
- Rate limit requests — 1–2 requests per second, check Crawl-delay directives
- Identify yourself — use a descriptive user agent parser with contact information
- Don't bypass authentication — scraping behind login walls without permission increases legal risk
- Minimize personal data — if you collect personal data, document your legal basis and implement data subject rights
- Don't republish copyrighted content — scraping for internal analysis is defensible; republishing full articles is not
- Cache responsibly — don't store more data than needed, set retention limits
- Implement opt-out — allow people to request removal of their data
How does using SearchHive reduce legal risk?
Using a legitimate web data API like SearchHive instead of direct scraping reduces legal exposure in several ways:
- No direct server contact — you're calling an API, not hitting target websites directly
- No CFAA exposure — there's no "access" to target servers to dispute
- No trespass to chattels — your traffic doesn't consume target server resources
- No robots.txt violations — the API manages upstream compliance
- No CAPTCHA circumvention — a legal gray area that direct scrapers often navigate
from searchhive import SwiftSearch, ScrapeForge
# Search — you never touch the target site directly
search = SwiftSearch(api_key='your-key')
results = search.query('competitor pricing data')
# Scrape — SearchHive handles the upstream request
scraper = ScrapeForge(api_key='your-key')
content = scraper.scrape('https://example.com/page')
This doesn't make you immune to all legal claims (data usage and copyright still apply), but it eliminates the most common vectors for scraping-related litigation. See How to Use SearchHive with Python for implementation details.
Summary
Web scraping of publicly accessible data is legal in the US following hiQ v. LinkedIn. The CFAA no longer covers TOS violations after Van Buren. In the EU, GDPR and database rights create additional constraints. Always respect robots.txt, rate limit your requests, minimize personal data collection, and identify yourself with a proper User-Agent. For teams that want to minimize legal risk, using a managed API like SearchHive eliminates most direct scraping-related liability.
Disclaimer: This article is for informational purposes only and does not constitute legal advice. Consult a qualified attorney for guidance specific to your situation.
Start with SearchHive's free tier for compliant web data access.
Related reading: How to Handle Rate Limiting in Web Scraping | What Is the Best Proxy for Web Scraping | How to Use SearchHive with Python