PDFs are the internet's stubborn data format. They carry invoices, research papers, financial reports, government filings, and product manuals — all structured information locked behind a page-layout format designed for printing, not for data extraction.
Extracting text, tables, and structured data from PDFs programmatically is harder than it looks. Layout-based PDFs (the kind most applications generate) don't have semantic markup — there's no concept of a "table" or "header" in the PDF specification, only positioned characters on a page. This guide covers the APIs that handle this complexity and how to integrate PDF extraction into your data pipeline.
Key Takeaways
- Text-based PDFs (most invoices, reports) can be parsed with simple libraries — no API needed
- Scanned PDFs (images of documents) require OCR, which is where extraction APIs earn their cost
- Table extraction from PDFs is the hardest problem — few tools do it well
- SearchHive DeepDive can extract structured data from PDF URLs using AI, bypassing the parsing problem entirely
- Costs range from free (self-hosted Tesseract) to $0.05-0.50 per page for commercial APIs
The PDF Extraction Problem
PDFs come in three main flavors, each requiring a different extraction approach:
1. Text-based PDFs — Created by word processors, report generators, or form tools. The text is embedded as character streams with position data. These are relatively easy to extract from with libraries like PyMuPDF, pdfplumber, or PyPDF2.
2. Scanned PDFs — Images of paper documents. No embedded text layer. Require OCR (optical character recognition) to convert images to text.
3. Hybrid PDFs — Combination of text layers and embedded images. Common in financial reports with embedded charts and signatures.
The extraction challenge scales with complexity: text extraction is straightforward, table extraction requires layout analysis, and structured data extraction (fields, key-value pairs) requires understanding document semantics.
Self-Hosted: Python Libraries for Text-Based PDFs
Before paying for an API, check whether open-source Python libraries handle your use case.
# Option 1: PyMuPDF (fast, reliable, MIT-licensed)
import fitz # PyMuPDF
doc = fitz.open("document.pdf")
for page in doc:
text = page.get_text()
print(text[:500])
# Option 2: pdfplumber (best for table extraction)
import pdfplumber
with pdfplumber.open("financial-report.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
Cost: Free. Best for: Text-based PDFs with simple layouts. Limitations: No OCR, limited table detection, no AI-powered understanding, struggles with complex layouts.
SearchHive DeepDive — AI-Powered PDF Extraction
SearchHive's DeepDive API can extract structured data from PDF URLs using natural language prompts. Instead of parsing PDF layouts, you describe what data you need and the AI handles the extraction.
import requests
API_KEY = "your-searchhive-key"
# Extract structured data from a PDF
resp = requests.post(
"https://api.searchhive.dev/v1/deepdive",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": "https://example.com/invoice-2026.pdf",
"prompt": "Extract invoice number, date, vendor name, line items (description, quantity, unit price, total), and the grand total"
}
)
result = resp.json()
print(result["structured_data"])
# {"invoice_number": "INV-2026-0412", "date": "2026-04-01", "vendor": "Acme Corp",
# "line_items": [...], "grand_total": 4250.00}
This approach works for both text-based and scanned PDFs because it uses AI vision for OCR and layout understanding. No need to configure parsing rules or train on document templates.
Pricing: Uses SearchHive credits. Free 500 credits, Starter $9/5K, Builder $49/100K. A PDF extraction costs more credits than a simple web scrape, reflecting the computational complexity.
Best for: Structured data extraction from invoices, reports, forms, and documents where you know what fields you need.
AWS Textract — Best Cloud OCR Service
Amazon Textract uses machine learning to extract text, tables, and forms from scanned documents. It handles both text-based and scanned PDFs.
import boto3
textract = boto3.client("textract", region_name="us-east-1")
with open("document.pdf", "rb") as f:
response = textract.analyze_document(
Document={"Bytes": f.read()},
FeatureTypes=["TABLES", "FORMS"]
)
for block in response["Blocks"]:
if block["BlockType"] == "LINE":
print(block["Text"])
Pricing: $1.50 per 1,000 pages (first 1M pages), $0.60/1K thereafter. Table detection adds $0.15/1K pages.
Strengths: Excellent table extraction, form field detection, AWS ecosystem integration Weaknesses: No structured output (you get text blocks, not key-value pairs), AWS lock-in, complex response format
Google Cloud Document AI — Best for Template-Based Extraction
Google's Document AI offers specialized processors for invoices, receipts, contracts, and tax forms. You can also train custom processors for your document types.
Pricing: $1.50/1,000 pages for the general processor. Specialized processors (invoices, receipts) are $0.60-$2.00/1K pages.
Strengths: Pre-trained processors for common document types, custom training, good accuracy Weaknesses: Google Cloud dependency, setup complexity, training custom processors requires labeled data
Adobe PDF Services API — Best for PDF Manipulation + Extraction
Adobe's PDF Services API handles extraction alongside creation, conversion, and manipulation. Their text extraction leverages Adobe's deep PDF knowledge.
Pricing: Free 500 pages/mo, then $0.05/page. Pay-as-you-go.
import requests, base64, json
# Extract text via Adobe PDF Services
with open("document.pdf", "rb") as f:
pdf_b64 = base64.encodebytes(f.read()).decode()
resp = requests.post(
"https://pdf-services.adobe.io/operation/extractpdf",
headers={
"Authorization": f"Bearer YOUR_TOKEN",
"x-api-key": "YOUR_KEY",
"Content-Type": "application/json"
},
json={
"assetID": "your-asset-id",
"elementsToExtract": ["text", "tables"]
}
)
Strengths: Official Adobe product, reliable text + table extraction, manipulation tools included Weaknesses: No OCR on base plan, no AI understanding, proprietary API format
Comparison Table
| Service | OCR Support | Table Extraction | Structured Output | Free Tier | Price per Page |
|---|---|---|---|---|---|
| SearchHive DeepDive | Yes (AI) | Yes (AI) | Yes (NL prompts) | 500 credits | ~$0.001-$0.005 |
| PyMuPDF/pdfplumber | No | Limited (text) | No | Free | $0 |
| AWS Textract | Yes | Yes | Partial | 1K free/mo | $0.0015 |
| Google Document AI | Yes | Yes | Partial (pre-trained) | 50 free/mo | $0.0015 |
| Adobe PDF Services | No (base plan) | Yes | No | 500/mo | $0.05 |
Best Practices
Classify your PDFs before choosing a tool. Text-based PDFs don't need OCR. Scanned documents need vision-capable extraction. Hybrid PDFs need both.
Batch process when possible. Most APIs offer bulk processing at reduced rates. Processing 100 pages in one API call is cheaper than 100 individual calls.
Validate extraction accuracy. PDF extraction is never perfect. Spot-check results, especially for table data where cell boundaries can be misidentified.
Cache results. PDFs don't change. Once you've extracted data from a document, store it locally rather than re-processing.
Use AI extraction for structured data needs. If you need specific fields (invoice totals, report metrics, form values), AI-powered extraction (SearchHive DeepDive, Textract forms) outperforms raw text extraction followed by regex tester parsing.
Getting Started
For quick extraction from text-based PDFs, start with pdfplumber — it's free and handles most cases. When you need OCR, table extraction, or structured data from complex documents, SearchHive DeepDive eliminates the parsing complexity with natural language prompts.
Get 500 free credits and extract your first PDF in minutes. Documentation covers DeepDive extraction, SwiftSearch, and ScrapeForge workflows.