PDFs hold a massive amount of the world's structured data -- invoices, reports, research papers, government filings, and contracts. But extracting data from PDFs programmatically is notoriously painful. You're dealing with inconsistent layouts, embedded tables, scanned images, and encoding quirks.
Web APIs solve this by handling the parsing on their servers and returning structured free JSON formatter. This guide walks through the main approaches: dedicated PDF APIs, SearchHive's ScrapeForge for PDF URLs, and when to use each method.
Key Takeaways
- Dedicated PDF APIs (PDF.co, Adobe, AWS Textract) specialize in text extraction, OCR, and table parsing from PDFs.
- SearchHive ScrapeForge can extract content from publicly hosted PDFs when you just need the raw text or markdown conversion.
- OCR is necessary for scanned PDFs -- regular text extraction only works on native (text-based) PDFs.
- For programmatic workflows, combining a PDF API with SearchHive's DeepDive gives you both raw extraction and intelligent data structuring.
Prerequisites
- Python 3.8+
requestslibrary (pip install requests)- A SearchHive API key (free signup with 500 credits)
- (Optional) PDF.co or AWS Textract account for advanced extraction
Step 1: Determine Your PDF Type
Not all PDFs are created equal. The extraction method depends on what you're working with:
Native (text-based) PDFs -- Created directly from digital documents. The text is embedded as vector data and can be extracted directly. Examples: exported reports, generated invoices, digital whitepapers.
Scanned (image-based) PDFs -- Created by scanning physical documents. They're essentially a series of images with no selectable text. These require OCR (Optical Character Recognition).
# Quick check: does the PDF contain extractable text?
import requests
def has_text_layer(url):
# Download first few KB and check for text markers.
resp = requests.get(url, stream=True)
chunk = b""
for byte in resp.iter_content(2048):
chunk += byte
if len(chunk) > 10000:
break
resp.close()
# Native PDFs contain BT (Begin Text) operators
return b"BT" in chunk or b"Tj" in chunk
print(has_text_layer("https://example.com/report.pdf"))
Step 2: Extract Text from Native PDFs with SearchHive
For publicly hosted PDFs, SearchHive's ScrapeForge can fetch and convert the content to markdown or text in a single API call:
import requests
API_KEY = "your-api-key"
BASE_URL = "https://api.searchhive.dev/v1"
def extract_pdf_content(pdf_url, output_format="markdown"):
# Extract text content from a publicly hosted PDF.
# Args: pdf_url (direct URL), output_format (markdown or text)
# Returns: extracted content as a string
response = requests.post(
f"{BASE_URL}/scrape",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": pdf_url,
"format": output_format,
"render_js": False
}
)
if response.status_code == 200:
data = response.json()
return data.get("content", data.get("markdown", ""))
else:
raise Exception(f"Extraction failed: {response.status_code} - {response.text}")
# Example: extract content from a research paper
content = extract_pdf_content(
"https://arxiv.org/pdf/2301.00234.pdf",
output_format="markdown"
)
print(content[:500])
This works well for research papers, whitepapers, and publicly accessible PDFs. The markdown output preserves headings, lists, and basic structure.
Step 3: Extract Structured Data with DeepDive
Raw text is useful, but you often need specific data points extracted -- line items from invoices, fields from forms, or specific sections from reports. SearchHive's DeepDive API handles this with natural language prompts:
def extract_structured_pdf(pdf_url, prompt):
# Use AI to extract specific structured data from a PDF.
# Args: pdf_url (URL to the PDF), prompt (what to extract)
# Returns: structured data as a dictionary
response = requests.post(
f"{BASE_URL}/deepdive",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": pdf_url,
"prompt": prompt
}
)
if response.status_code == 200:
return response.json()
else:
raise Exception(f"DeepDive failed: {response.status_code}")
# Extract invoice line items
invoice_data = extract_structured_pdf(
"https://example.com/invoices/march-2025.pdf",
"Extract all line items with description, quantity, unit price, and total. Also extract the invoice number, date, and grand total."
)
print(f"Invoice #{invoice_data['invoice_number']}")
print(f"Date: {invoice_data['date']}")
for item in invoice_data["line_items"]:
print(f" {item['description']}: {item['quantity']} x ${item['unit_price']} = ${item['total']}")
print(f"Grand Total: ${invoice_data['grand_total']}")
Step 4: Handle Scanned PDFs with OCR
When you're dealing with scanned documents (no text layer), you need OCR. Here's how to handle this:
Option A: SearchHive + PDF.co for OCR
PDF_CO_KEY = "your-pdfco-key"
def ocr_pdf(pdf_url):
# Extract text from scanned PDFs using PDF.co OCR.
# Returns raw text that you can then process further.
response = requests.post(
"https://api.pdf.co/v1/pdf/convert/to/text",
headers={"x-api-key": PDF_CO_KEY},
json={
"url": pdf_url,
"inline": True,
"pages": ""
}
)
if response.status_code == 200:
return response.json().get("body", "")
raise Exception(f"OCR failed: {response.status_code}")
# Chain: OCR extraction then SearchHive structuring
raw_text = ocr_pdf("https://example.com/scanned-contract.pdf")
print(raw_text[:500])
Option B: AWS Textract for Tables and Forms
AWS Textract excels at extracting tables and form fields from scanned documents:
import boto3
textract = boto3.client("textract", region_name="us-east-1")
def extract_with_textract(pdf_bytes):
# Extract tables and form data from scanned PDF bytes.
response = textract.analyze_document(
Document={"Bytes": pdf_bytes},
FeatureTypes=["TABLES", "FORMS"]
)
results = {"tables": [], "forms": []}
for block in response["Blocks"]:
if block["BlockType"] == "TABLE":
cells = [b for b in response["Blocks"] if b.get("Relationships")]
table_data = []
for cell in cells:
if "CellInformation" in cell:
text = ""
for rel in cell["Relationships"]:
if rel["Type"] == "CHILD":
for child in response["Blocks"]:
if child["Id"] in rel["Ids"]:
text += child.get("Text", "") + " "
table_data.append(text.strip())
results["tables"].append(table_data)
return results
Step 5: Build a Complete PDF Extraction Pipeline
Here's a production-ready pipeline that combines everything:
import requests
import json
class PDFExtractor:
def __init__(self, searchhive_key):
self.api_key = searchhive_key
self.base_url = "https://api.searchhive.dev/v1"
def extract(self, pdf_url, prompt=None, format="markdown"):
# Extract content from a PDF URL.
# If prompt is provided, uses DeepDive for structured extraction.
# Otherwise, uses ScrapeForge for raw content extraction.
headers = {"Authorization": f"Bearer {self.api_key}"}
if prompt:
response = requests.post(
f"{self.base_url}/deepdive",
headers=headers,
json={"url": pdf_url, "prompt": prompt}
)
else:
response = requests.post(
f"{self.base_url}/scrape",
headers=headers,
json={"url": pdf_url, "format": format}
)
response.raise_for_status()
return response.json()
def batch_extract(self, pdf_urls, prompt=None):
# Extract data from multiple PDFs efficiently.
results = []
for url in pdf_urls:
try:
data = self.extract(url, prompt=prompt)
results.append({"url": url, "status": "success", "data": data})
except Exception as e:
results.append({"url": url, "status": "error", "error": str(e)})
return results
# Usage
extractor = PDFExtractor("your-api-key")
# Simple content extraction
content = extractor.extract(
"https://example.com/annual-report.pdf",
format="markdown"
)
# Structured data extraction
financials = extractor.extract(
"https://example.com/annual-report.pdf",
prompt="Extract revenue, net income, and EPS for each fiscal year mentioned in the report. Return as a list of objects with year, revenue, net_income, and eps fields."
)
# Batch processing multiple PDFs
urls = [
"https://example.com/reports/Q1-2025.pdf",
"https://example.com/reports/Q2-2025.pdf",
"https://example.com/reports/Q3-2025.pdf",
]
results = extractor.batch_extract(urls, prompt="Extract quarter, revenue, and operating expenses")
Common Issues
PDF URLs that require authentication
SearchHive's ScrapeForge can only access publicly available PDFs. For authenticated PDFs, download the file first and use a local extraction library like pdfplumber or PyMuPDF:
import pdfplumber
def extract_local_pdf(filepath):
with pdfplumber.open(filepath) as pdf:
full_text = ""
for page in pdf.pages:
full_text += page.extract_text() or ""
return full_text
Large PDFs timing out
Break large PDFs into page ranges. Most APIs accept a pages parameter:
def extract_pages(pdf_url, page_range="1-10"):
response = requests.post(
f"{BASE_URL}/scrape",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"url": pdf_url, "format": "markdown", "pages": page_range}
)
return response.json()
Encoded or corrupted text
Some PDFs use custom font encodings that produce garbled text. OCR is the fallback for these cases -- treat the PDF as image-based even if it has a text layer.
Next Steps
- Start extracting today: Sign up for SearchHive with 500 free credits and try the ScrapeForge API on your PDFs.
- For OCR-heavy workflows: Combine SearchHive with PDF.co or AWS Textract for scanned documents.
- Check our docs: Visit searchhive.dev/docs for the full API reference and Python SDK.
Related: /blog/building-ai-agents-with-web-scraping-apis | /compare/firecrawl | /compare/scrapingbee