Extracting data from a website's API is one of the most reliable ways to get structured data at scale. Unlike scraping HTML, API responses come pre-formatted as free JSON formatter or XML, with consistent schemas and documented fields.
This guide covers every angle of extracting data from website APIs -- from finding hidden APIs to handling authentication, pagination, and rate limits.
Key Takeaways
- Hidden APIs (used by the site's own frontend) are everywhere and often undocumented
- Browser DevTools Network tab is the fastest way to discover what APIs a site uses
- SearchHive's ScrapeForge handles API extraction, proxy rotation, and auth for you
- Always check terms of service and rate limits before hitting any API
How do I find a website's hidden API?
Most modern websites are single-page applications (SPAs) that fetch data from internal APIs. These APIs are the same ones the site's own JavaScript calls, which means they return exactly the data the site displays.
Here's how to find them:
- Open the site in Chrome or Firefox
- Open DevTools (F12) and go to the Network tab
- Filter by Fetch/XHR requests
- Interact with the site (scroll, click, search)
- Look for JSON responses in the request list
You'll typically see endpoints like /api/products, /api/search, or /graphql. Click on any request to see the full URL, headers, and response body.
Many sites also have public API documentation. Check for /docs, /api-docs, or a developer portal link in the footer.
What tools do I need to extract data from an API?
At minimum, you need:
- HTTP client --
requests(Python),fetch(JavaScript), orcurl(CLI) - JSON parser -- built into most languages
- Authentication handler -- for APIs requiring API keys or tokens
For production extraction, add:
- Rate limiter -- to respect API limits
- Retry logic -- for transient failures
- Data storage -- database or file system
- Monitoring -- to detect when APIs change or break
How do I extract data from a REST API?
REST APIs use standard HTTP methods (GET, POST, PUT, DELETE) and return JSON. Here's a basic extraction pattern:
import requests
headers = {
"Accept": "application/json",
"User-Agent": "MyDataExtractor/1.0"
}
response = requests.get(
"https://api.example.com/products",
headers=headers,
params={"page": 1, "limit": 50}
)
data = response.json()
for product in data["items"]:
print(product["name"], product["price"])
Most APIs require some form of authentication. Common patterns include:
- API key in header:
Authorization: Bearer YOUR_KEY - API key in query param:
?api_key=YOUR_KEY - OAuth 2.0: Multi-step token exchange
- Cookie-based: Session cookies from a login flow
How do I handle API pagination?
APIs return data in pages to avoid sending massive responses. Common pagination patterns include:
Offset-based (most common):
all_products = []
page = 1
while True:
resp = requests.get(
"https://api.example.com/products",
params={"page": page, "limit": 100}
)
data = resp.json()
if not data["items"]:
break
all_products.extend(data["items"])
page += 1
Cursor-based (better for large datasets):
all_products = []
cursor = None
while True:
params = {"limit": 100}
if cursor:
params["cursor"] = cursor
resp = requests.get("https://api.example.com/products", params=params)
data = resp.json()
all_products.extend(data["items"])
cursor = data.get("next_cursor")
if not cursor:
break
Cursor-based pagination doesn't skip or duplicate items if data changes between requests, which makes it more reliable for large extraction jobs.
What about GraphQL APIs?
GraphQL APIs use a single endpoint with query-based requests. You request exactly the fields you need:
# Build a GraphQL query string (single-line or joined)
query = "{ products(first: 50, after: $cursor) { edges { node { id name price } } pageInfo { endCursor hasNextPage } } }"
response = requests.post(
"https://api.example.com/graphql",
json={"query": query, "variables": {"cursor": None}},
headers={"Authorization": "Bearer YOUR_KEY"}
)
Many sites (including Shopify, GitHub, and Twitter/X) use GraphQL. You can discover the schema using introspection queries if the API supports it.
How does SearchHive help with API extraction?
If you don't want to deal with authentication, rate limits, proxy rotation, and pagination manually, SearchHive's ScrapeForge handles all of it:
from searchhive import Client
client = Client(api_key="your-key")
# Extract data from any URL -- ScrapeForge handles the rest
result = client.scrapeforge.scrape(
url="https://example.com/products",
format="json",
extract={
"products": {
"type": "list",
"selector": ".product-card",
"fields": {
"name": "h3.title",
"price": ".price",
"rating": ".stars::attr(data-rating)"
}
}
}
)
for product in result["data"]["products"]:
print(product["name"], product["price"])
ScrapeForge can also extract from JavaScript-rendered pages that require browser execution. It runs headless Chrome under the hood, so you get the same data a human browser would see.
For even deeper extraction, SearchHive's DeepDive uses AI to understand page structure and extract entities:
result = client.deepdive.extract(
url="https://example.com/company/about",
entities=["company_name", "founded_year", "employees", "headquarters"]
)
print(result["entities"])
How do I handle rate limits?
Respect the API's rate limits to avoid getting blocked. Common strategies:
- Check response headers for rate limit info (
X-RateLimit-Remaining,Retry-After) - Add delays between requests using
time.sleep() - Use exponential backoff when you get 429 responses
- Distribute requests across multiple API keys or IPs if allowed
import time
import requests
def api_get(url, params=None, max_retries=3):
for attempt in range(max_retries):
resp = requests.get(url, params=params)
if resp.status_code == 200:
return resp.json()
if resp.status_code == 429:
wait = int(resp.headers.get("Retry-After", 60))
time.sleep(wait)
continue
resp.raise_for_status()
raise Exception(f"Failed after {max_retries} retries")
Is extracting data from APIs legal?
Generally yes, if the API is publicly accessible and you're not violating terms of service. Key considerations:
- Public APIs with documentation are typically fine for commercial use within their limits
- Undocumented APIs exist in a gray area -- the site may block you or change the endpoint
- Authentication bypass (accessing paid APIs for free) is not legal
- Data usage matters -- even legally obtained data may have usage restrictions (GDPR, CCPA)
Always read the terms of service and respect rate limits. If in doubt, ask the API provider for permission.
Summary
Extracting data from website APIs is more reliable than HTML scraping and often faster to implement. Start by finding the API endpoints in DevTools, handle pagination and authentication, and respect rate limits.
For production workloads, SearchHive's ScrapeForge and DeepDive handle the hard parts -- proxy rotation, JavaScript rendering, auth management, and AI-powered extraction -- so you can focus on using the data, not fighting to get it.
Start extracting data in minutes. Sign up for SearchHive's free tier and get 500 credits to try SwiftSearch, ScrapeForge, and DeepDive. Read the docs for quickstart guides.