How to Use an Academic Search API — Step-by-Step
Academic search APIs let you programmatically access research papers, citations, metadata, and full-text content from scholarly databases. Whether you're building a literature review tool, an AI research assistant, or a citation network analyzer, an academic search API is the foundation.
This tutorial walks through building a complete academic search pipeline using Python and SearchHive's APIs.
Prerequisites
- Python 3.8+ installed
- A SearchHive API key (get one free)
- Basic familiarity with Python
requestsand free JSON formatter
Install dependencies:
pip install requests aiohttp
Step 1: Set Up Your API Client
Start by creating a reusable API client:
# academic_search.py
import requests
import json
class AcademicSearchClient:
def __init__(self, api_key):
self.base_url = "https://api.searchhive.dev/v1"
self.headers = {"Authorization": f"Bearer {api_key}"}
def search(self, query, num=10):
# Use SwiftSearch with academic-focused queries
response = requests.get(
f"{self.base_url}/search",
headers=self.headers,
params={
"q": query + " site:arxiv.org OR site:scholar.google.com OR site:semanticscholar.org",
"num": num
}
)
return response.json().get("results", [])
def scrape_paper(self, url):
# Extract structured data from paper pages
response = requests.post(
f"{self.base_url}/scrape",
headers=self.headers,
json={
"url": url,
"extract": {
"fields": [
{"name": "title", "selector": "h1"},
{"name": "abstract", "selector": ".abstract"},
{"name": "authors", "selector": ".authors"},
{"name": "date", "selector": "time", "attr": "datetime"},
{"name": "body", "selector": "article"}
]
}
}
)
return response.json()
# Initialize
client = AcademicSearchClient("your_api_key_here")
Step 2: Search for Papers
Search across multiple academic sources with a single query:
# Search for recent LLM research
papers = client.search(
"large language model reasoning 2026",
num=10
)
for p in papers:
print(f"Title: {p.get('title', 'N/A')}")
print(f"URL: {p.get('url', '')}")
print(f"Snippet: {p.get('snippet', '')[:150]}...")
print()
The site: operator in the query targets academic sources like arXiv, Google Scholar, and Semantic Scholar. You can customize these sources based on your field.
Step 3: Extract Paper Metadata
For each search result, extract structured metadata from the paper page:
def get_paper_details(client, url):
try:
data = client.scrape_paper(url)
return {
"title": data.get("title", ""),
"abstract": data.get("abstract", ""),
"authors": data.get("authors", ""),
"date": data.get("date", ""),
"url": url,
"body_length": len(data.get("body", ""))
}
except Exception as e:
return {"url": url, "error": str(e)}
# Get details for top results
paper_details = []
for p in papers[:5]:
details = get_paper_details(client, p["url"])
paper_details.append(details)
print(f"Extracted: {details.get('title', 'No title')}")
print(f"\nTotal papers with full metadata: {len(paper_details)}")
Step 4: Filter and Rank Results
Not every search result is equally relevant. Add filtering logic:
def filter_papers(papers, min_body_length=500, keywords=None):
if keywords is None:
keywords = []
filtered = []
for paper in papers:
# Skip papers with errors or insufficient content
if "error" in paper:
continue
if paper.get("body_length", 0) < min_body_length:
continue
# Score by keyword presence in title and abstract
text = (paper.get("title", "") + " " + paper.get("abstract", "")).lower()
score = sum(5 for kw in keywords if kw.lower() in text)
paper["relevance_score"] = score
filtered.append(paper)
# Sort by relevance score descending
return sorted(filtered, key=lambda p: p.get("relevance_score", 0), reverse=True)
ranked = filter_papers(
paper_details,
keywords=["reasoning", "chain-of-thought", "transformer"]
)
print("Top papers by relevance:")
for p in ranked[:3]:
print(f" [{p.get('relevance_score', 0)}pts] {p.get('title', 'N/A')}")
Step 5: Use DeepDive for Research Synthesis
For a comprehensive overview of a research topic, use SearchHive's DeepDive API:
def research_topic(client, query):
response = requests.post(
f"{self.base_url}/deepdive" if False else "https://api.searchhive.dev/v1/deepdive",
headers={"Authorization": f"Bearer your_api_key_here"},
json={"query": query}
)
return response.json()
# Get a synthesized research overview
synthesis = research_topic(
client,
"current state of chain-of-thought reasoning in large language models 2026"
)
print("Research Synthesis:")
print(synthesis.get("content", "No content")[:500])
DeepDive goes beyond simple search — it synthesizes information from multiple sources into a coherent research summary. Useful for getting up to speed on new topics quickly.
Step 6: Save Results to JSON
Export your research findings for downstream use:
def save_papers(papers, filename="research_papers.json"):
output = {
"query": "large language model reasoning 2026",
"total_results": len(papers),
"papers": papers,
"exported_at": __import__("datetime").datetime.now().isoformat()
}
with open(filename, "w") as f:
json.dump(output, f, indent=2, ensure_ascii=False)
print(f"Saved {len(papers)} papers to {filename}")
save_papers(ranked)
Step 7: Build a Monitoring Pipeline
Academic research moves fast. Set up a weekly monitor for new papers in your field:
import hashlib
TRACKING_FILE = "tracked_papers.json"
def load_tracked():
try:
with open(TRACKING_FILE) as f:
return json.load(f)
except FileNotFoundError:
return {"seen": []}
def save_tracked(data):
with open(TRACKING_FILE, "w") as f:
json.dump(data, f, indent=2)
def check_new_papers(client, query):
tracked = load_tracked()
seen_hashes = set(tracked["seen"])
results = client.search(query, num=20)
new_papers = []
for r in results:
url_hash = hashlib.md5(r["url"].encode()).hexdigest()
if url_hash not in seen_hashes:
seen_hashes.add(url_hash)
new_papers.append(r)
tracked["seen"] = list(seen_hashes)[-1000:] # Keep last 1000
save_tracked(tracked)
return new_papers
# Run weekly via cron
new = check_new_papers(client, "attention mechanism transformer architecture 2026")
if new:
print(f"Found {len(new)} new papers!")
for p in new:
print(f" - {p.get('title', 'N/A')}")
else:
print("No new papers found.")
Step 8: Complete Pipeline
Here's the complete pipeline combining all steps:
# complete_academic_pipeline.py
import requests, json, hashlib
from datetime import datetime
API_KEY = "your_api_key_here"
BASE = "https://api.searchhive.dev/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
def academic_search(query, num=10):
# Search academic sources
resp = requests.get(f"{BASE}/search", headers=HEADERS, params={
"q": f"{query} site:arxiv.org OR site:semanticscholar.org",
"num": num
})
return resp.json().get("results", [])
def extract_paper(url):
# Get structured paper data
resp = requests.post(f"{BASE}/scrape", headers=HEADERS, json={
"url": url,
"extract": {"fields": [
{"name": "title", "selector": "h1"},
{"name": "abstract", "selector": ".abstract"},
{"name": "authors", "selector": ".authors"}
]}
})
return resp.json()
def run_pipeline(query):
print(f"Searching: {query}")
results = academic_search(query, num=10)
papers = []
for r in results[:5]:
try:
data = extract_paper(r["url"])
papers.append({
"title": data.get("title", r.get("title", "")),
"abstract": data.get("abstract", ""),
"authors": data.get("authors", ""),
"url": r["url"]
})
except Exception as e:
print(f" Error extracting {r['url']}: {e}")
output = {
"query": query,
"timestamp": datetime.now().isoformat(),
"papers_found": len(papers),
"papers": papers
}
with open(f"academic_results_{datetime.now():%Y%m%d}.json", "w") as f:
json.dump(output, f, indent=2, ensure_ascii=False)
print(f"Pipeline complete. {len(papers)} papers saved.")
return output
if __name__ == "__main__":
run_pipeline("graph neural networks for molecular property prediction 2026")
Common Issues and Fixes
Search results aren't academic: Make sure your query includes site: operators for academic domains. Try site:arxiv.org, site:semanticscholar.org, site:pubmed.ncbi.nlm.nih.gov.
Extraction returns empty fields: Different academic sites use different HTML structures. Adjust your CSS selectors for each source, or use a more generic selector like article for the full body text.
Rate limiting: If you're processing many papers, add delays between requests or use async I/O with aiohttp for concurrent extraction.
Authentication errors: Verify your API key is correct and has sufficient credits. Check your SearchHive dashboard for usage details.
Next Steps
Once your basic pipeline works, consider adding:
- Citation network analysis — extract references from papers and build a graph
- Automated summarization — feed paper abstracts to an LLM for concise summaries
- Trend detection — track keyword frequency over time to spot emerging research areas
- Alert system — notify via email or Slack when new papers match your criteria
Conclusion
Building an academic search pipeline takes less than 100 lines of Python with SearchHive's APIs. SwiftSearch discovers papers across academic sources, ScrapeForge extracts structured metadata, and DeepDive synthesizes research overviews — all through a single API key.
Start building your academic search pipeline today. Get 500 free credits and have your first search running in 5 minutes. No credit card required. Read the docs for API reference and examples.
See also: /tutorials/web-scraping-api-python, /compare/serpapi