How to extract data from a website API?

Extracting data from a website's API is one of the most reliable ways to get structured data at scale. Unlike scraping HTML, API responses come pre-formatted as free JSON formatter or XML, with consistent schemas and documented fields.

This guide covers every angle of extracting data from website APIs -- from finding hidden APIs to handling authentication, pagination, and rate limits.

Key Takeaways

Hidden APIs (used by the site's own frontend) are everywhere and often undocumented
Browser DevTools Network tab is the fastest way to discover what APIs a site uses
SearchHive's ScrapeForge handles API extraction, proxy rotation, and auth for you
Always check terms of service and rate limits before hitting any API

How do I find a website's hidden API?

Most modern websites are single-page applications (SPAs) that fetch data from internal APIs. These APIs are the same ones the site's own JavaScript calls, which means they return exactly the data the site displays.

Here's how to find them:

Open the site in Chrome or Firefox
Open DevTools (F12) and go to the Network tab
Filter by Fetch/XHR requests
Interact with the site (scroll, click, search)
Look for JSON responses in the request list

You'll typically see endpoints like /api/products, /api/search, or /graphql. Click on any request to see the full URL, headers, and response body.

Many sites also have public API documentation. Check for /docs, /api-docs, or a developer portal link in the footer.

What tools do I need to extract data from an API?

At minimum, you need:

HTTP client -- requests (Python), fetch (JavaScript), or curl (CLI)
JSON parser -- built into most languages
Authentication handler -- for APIs requiring API keys or tokens

For production extraction, add:

Rate limiter -- to respect API limits
Retry logic -- for transient failures
Data storage -- database or file system
Monitoring -- to detect when APIs change or break

How do I extract data from a REST API?

REST APIs use standard HTTP methods (GET, POST, PUT, DELETE) and return JSON. Here's a basic extraction pattern:

import requests

headers = {
    "Accept": "application/json",
    "User-Agent": "MyDataExtractor/1.0"
}

response = requests.get(
    "https://api.example.com/products",
    headers=headers,
    params={"page": 1, "limit": 50}
)

data = response.json()
for product in data["items"]:
    print(product["name"], product["price"])

Most APIs require some form of authentication. Common patterns include:

API key in header: Authorization: Bearer YOUR_KEY
API key in query param: ?api_key=YOUR_KEY
OAuth 2.0: Multi-step token exchange
Cookie-based: Session cookies from a login flow

How do I handle API pagination?

APIs return data in pages to avoid sending massive responses. Common pagination patterns include:

Offset-based (most common):

all_products = []
page = 1
while True:
    resp = requests.get(
        "https://api.example.com/products",
        params={"page": page, "limit": 100}
    )
    data = resp.json()
    if not data["items"]:
        break
    all_products.extend(data["items"])
    page += 1

Cursor-based (better for large datasets):

all_products = []
cursor = None
while True:
    params = {"limit": 100}
    if cursor:
        params["cursor"] = cursor
    resp = requests.get("https://api.example.com/products", params=params)
    data = resp.json()
    all_products.extend(data["items"])
    cursor = data.get("next_cursor")
    if not cursor:
        break

Cursor-based pagination doesn't skip or duplicate items if data changes between requests, which makes it more reliable for large extraction jobs.

What about GraphQL APIs?

GraphQL APIs use a single endpoint with query-based requests. You request exactly the fields you need:

# Build a GraphQL query string (single-line or joined)
query = "{ products(first: 50, after: $cursor) { edges { node { id name price } } pageInfo { endCursor hasNextPage } } }"

response = requests.post(
    "https://api.example.com/graphql",
    json={"query": query, "variables": {"cursor": None}},
    headers={"Authorization": "Bearer YOUR_KEY"}
)

Many sites (including Shopify, GitHub, and Twitter/X) use GraphQL. You can discover the schema using introspection queries if the API supports it.

How does SearchHive help with API extraction?

If you don't want to deal with authentication, rate limits, proxy rotation, and pagination manually, SearchHive's ScrapeForge handles all of it:

from searchhive import Client

client = Client(api_key="your-key")

# Extract data from any URL -- ScrapeForge handles the rest
result = client.scrapeforge.scrape(
    url="https://example.com/products",
    format="json",
    extract={
        "products": {
            "type": "list",
            "selector": ".product-card",
            "fields": {
                "name": "h3.title",
                "price": ".price",
                "rating": ".stars::attr(data-rating)"
            }
        }
    }
)

for product in result["data"]["products"]:
    print(product["name"], product["price"])

ScrapeForge can also extract from JavaScript-rendered pages that require browser execution. It runs headless Chrome under the hood, so you get the same data a human browser would see.

For even deeper extraction, SearchHive's DeepDive uses AI to understand page structure and extract entities:

result = client.deepdive.extract(
    url="https://example.com/company/about",
    entities=["company_name", "founded_year", "employees", "headquarters"]
)
print(result["entities"])

How do I handle rate limits?

Respect the API's rate limits to avoid getting blocked. Common strategies:

Check response headers for rate limit info (X-RateLimit-Remaining, Retry-After)
Add delays between requests using time.sleep()
Use exponential backoff when you get 429 responses
Distribute requests across multiple API keys or IPs if allowed

import time
import requests

def api_get(url, params=None, max_retries=3):
    for attempt in range(max_retries):
        resp = requests.get(url, params=params)
        if resp.status_code == 200:
            return resp.json()
        if resp.status_code == 429:
            wait = int(resp.headers.get("Retry-After", 60))
            time.sleep(wait)
            continue
        resp.raise_for_status()
    raise Exception(f"Failed after {max_retries} retries")

Is extracting data from APIs legal?

Generally yes, if the API is publicly accessible and you're not violating terms of service. Key considerations:

Public APIs with documentation are typically fine for commercial use within their limits
Undocumented APIs exist in a gray area -- the site may block you or change the endpoint
Authentication bypass (accessing paid APIs for free) is not legal
Data usage matters -- even legally obtained data may have usage restrictions (GDPR, CCPA)

Always read the terms of service and respect rate limits. If in doubt, ask the API provider for permission.

Summary

Extracting data from website APIs is more reliable than HTML scraping and often faster to implement. Start by finding the API endpoints in DevTools, handle pagination and authentication, and respect rate limits.

For production workloads, SearchHive's ScrapeForge and DeepDive handle the hard parts -- proxy rotation, JavaScript rendering, auth management, and AI-powered extraction -- so you can focus on using the data, not fighting to get it.

Start extracting data in minutes. Sign up for SearchHive's free tier and get 500 credits to try SwiftSearch, ScrapeForge, and DeepDive. Read the docs for quickstart guides.

How to extract data from a website API? - Complete Answer

AI-Powered Research

Key Takeaways

How do I find a website's hidden API?

What tools do I need to extract data from an API?

How do I extract data from a REST API?

How do I handle API pagination?

What about GraphQL APIs?

How does SearchHive help with API extraction?

How do I handle rate limits?

Is extracting data from APIs legal?

Summary

Keywords

RELATED ARTICLES

How to scrape a website without getting blocked? - Complete Answer

What is the best API for AI agents? - Complete Answer

Can AI agents browse the web? - Complete Answer

BUILD WITH SEARCHHIVE