How to Use a Search API for RAG -- Step-by-Step Tutorial
RAG (Retrieval-Augmented Generation) combines an LLM with real-time data retrieval to produce accurate, grounded answers. A search API is the retrieval engine that feeds relevant documents into the LLM's context.
This tutorial walks through building a production RAG pipeline using SearchHive's SwiftSearch API, from basic setup to advanced optimization.
Prerequisites
- Python 3.9+
- A SearchHive API key (free, 500 credits)
- An LLM API key (OpenAI, Anthropic, or local model)
- Basic familiarity with Python and REST APIs
Key Takeaways
- RAG reduces hallucinations by 40-60% compared to bare LLMs
- Search quality matters more than LLM quality for factual accuracy
- Token-efficient retrieval saves money -- clean snippets beat raw web pages
- SearchHive provides the cheapest unified search + scrape + research API for RAG
Step 1: Understand the RAG Architecture
A RAG system has four components:
- Query processing: Transform the user question into an effective search query
- Retrieval: Search for relevant documents using a search API
- Context assembly: Format and rank the retrieved documents
- Generation: Pass the context to an LLM for answer generation
The search API is the most critical component. Garbage in, garbage out -- if retrieval returns irrelevant results, the LLM can't generate a good answer.
Step 2: Set Up the Search Client
Create a search client that wraps SearchHive's SwiftSearch API:
import requests
from typing import Optional
SEARCHHIVE_API_KEY = "your_key"
SEARCHHIVE_BASE = "https://api.searchhive.dev/v1/swift-search"
def search_web(
query: str,
limit: int = 5,
format: str = "markdown",
recency: Optional[str] = None
) -> list[dict]:
"""Search the web and return structured results."""
payload = {
"query": query,
"limit": limit,
"format": format,
}
if recency:
payload["recency"] = recency
resp = requests.post(
SEARCHHIVE_BASE,
headers={"Authorization": f"Bearer {SEARCHHIVE_API_KEY}"},
json=payload,
timeout=15
)
resp.raise_for_status()
data = resp.json()
return data.get("results", [])
Step 3: Build the Context Window
Format search results into a context string that the LLM can use. The key is to include source attribution and keep the context token-efficient:
def build_context(results: list[dict], max_chars: int = 8000) -> str:
"""Build a token-efficient context from search results."""
parts = []
total_chars = 0
for i, result in enumerate(results, 1):
title = result.get("title", "Untitled")
snippet = result.get("snippet", "")
url = result.get("url", "")
entry = f"[{i}] {title}\nSource: {url}\n{snippet}\n"
entry_len = len(entry)
if total_chars + entry_len > max_chars:
# Truncate the last entry to fit
remaining = max_chars - total_chars
if remaining > 100:
entry = f"[{i}] {title}\nSource: {url}\n{snippet[:remaining - len(title) - len(url) - 10]}\n"
parts.append(entry)
break
parts.append(entry)
total_chars += entry_len
return "\n".join(parts)
Step 4: Implement the RAG Pipeline
Connect the search retrieval to your LLM for answer generation:
import json
OPENAI_API_KEY = "your_openai_key"
def rag_query(question: str, search_depth: str = "standard") -> str:
"""Full RAG pipeline: search -> context -> generate."""
# Step 1: Retrieve relevant documents
results = search_web(question, limit=5)
if not results:
return "I couldn't find relevant information for your question. Try rephrasing or check the topic."
# Step 2: Build context
context = build_context(results)
# Step 3: Generate answer using LLM
system_prompt = """You are a research assistant. Answer questions using ONLY the provided search results.
- Cite sources using [1], [2], etc. referencing the numbered results
- If the search results don't contain enough information, say so clearly
- Be concise and factual
- Do not make up information not present in the search results"""
user_prompt = f"""SEARCH RESULTS:
{context}
QUESTION: {question}
ANSWER (with source citations):"""
resp = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {OPENAI_API_KEY}"},
json={
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
"temperature": 0, # Factual, not creative
"max_tokens": 1000
},
timeout=30
)
answer = resp.json()["choices"][0]["message"]["content"]
return answer
# Usage
answer = rag_query("What is the pricing for SearchHive API?")
print(answer)
Step 5: Add Query Expansion for Better Retrieval
User questions are often vague. Query expansion reformulates the question into more effective search queries:
def expand_query(question: str) -> list[str]:
"""Generate multiple search queries from a single question."""
# Strategy 1: Direct query
queries = [question]
# Strategy 2: Extract key terms (simple approach)
# Remove question words and common filler
stop_words = {"what", "is", "the", "how", "does", "a", "an", "to", "for", "of", "and", "in", "on", "can", "best"}
terms = [w for w in question.lower().split() if w not in stop_words and len(w) > 2]
if terms:
queries.append(" ".join(terms))
# Strategy 3: Add context qualifiers
if any(w in question.lower() for w in ["price", "cost", "pricing", "cheap", "expensive"]):
queries.append(f"{question} pricing comparison 2026")
if any(w in question.lower() for w in ["best", "top", "recommend"]):
queries.append(f"{question} review comparison")
return queries[:3] # Limit to avoid excessive API calls
def rag_with_expansion(question: str) -> str:
"""RAG pipeline with query expansion."""
all_results = []
seen_urls = set()
for query in expand_query(question):
results = search_web(query, limit=5)
for r in results:
url = r.get("url", "")
if url not in seen_urls:
all_results.append(r)
seen_urls.add(url)
# Deduplicate and take top results
context = build_context(all_results[:7])
# ... proceed with LLM generation (same as Step 4)
Step 6: Implement Multi-Source Retrieval
For complex questions, combine search snippets with full page content for deeper context:
def deep_rag(question: str) -> str:
"""RAG with fallback to full page content for complex queries."""
# Start with search snippets
results = search_web(question, limit=5)
# For complex questions, also scrape the top 2 results for full content
full_contents = []
for result in results[:2]:
url = result.get("url")
if url:
try:
scrape_resp = requests.post(
"https://api.searchhive.dev/v1/scrapeforge",
headers={"Authorization": f"Bearer {SEARCHHIVE_API_KEY}"},
json={
"url": url,
"format": "markdown",
"render_js": True
},
timeout=30
)
scrape_resp.raise_for_status()
content = scrape_resp.json().get("content", "")
if content:
full_contents.append(f"FULL CONTENT from {url}:\n{content[:3000]}\n")
except Exception:
continue
# Combine snippets and full content
snippet_context = build_context(results)
full_context = "\n\n".join(full_contents)
combined_context = f"SNIPPETS:\n{snippet_context}\n\nFULL PAGES:\n{full_context}"
# Pass to LLM with combined context
# ... (same LLM call as Step 4, but with combined_context)
return combined_context # In practice, pass to LLM
Step 7: Add Response Quality Evaluation
Evaluate whether your RAG system is actually improving answers:
def evaluate_rag(question: str, answer: str, sources: list[dict]) -> dict:
"""Simple self-evaluation of RAG response quality."""
# Check if answer cites sources
has_citations = any(f"[{i}]" in answer for i in range(1, len(sources) + 1))
# Check if answer acknowledges insufficient info (honesty)
honest_disclaimer = any(phrase in answer.lower() for phrase in [
"couldn't find", "not enough information", "unclear", "not specified"
])
# Token efficiency (rough estimate: 1 token ~ 4 chars)
context_chars = sum(len(s.get("snippet", "")) for s in sources)
answer_chars = len(answer)
efficiency = answer_chars / max(context_chars, 1)
return {
"has_citations": has_citations,
"honest_when_uncertain": honest_disclaimer,
"context_used_chars": context_chars,
"answer_length_chars": answer_chars,
"context_efficiency": round(efficiency, 2),
"num_sources": len(sources)
}
Complete Code Example
Here's the full pipeline in one file:
import requests
from typing import Optional
SEARCHHIVE_KEY = "your_key"
OPENAI_KEY = "your_openai_key"
def search_web(query: str, limit: int = 5) -> list:
resp = requests.post(
"https://api.searchhive.dev/v1/swift-search",
headers={"Authorization": f"Bearer {SEARCHHIVE_KEY}"},
json={"query": query, "limit": limit, "format": "markdown"},
timeout=15
)
resp.raise_for_status()
return resp.json().get("results", [])
def build_context(results: list, max_chars: int = 8000) -> str:
parts, total = [], 0
for i, r in enumerate(results, 1):
entry = f"[{i}] {r.get('title', '')}\n{r.get('url', '')}\n{r.get('snippet', '')}\n"
if total + len(entry) > max_chars:
break
parts.append(entry)
total += len(entry)
return "\n".join(parts)
def ask(question: str) -> str:
results = search_web(question, limit=5)
if not results:
return "No relevant results found."
context = build_context(results)
resp = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {OPENAI_KEY}"},
json={
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "Answer using ONLY the provided search results. Cite sources as [1], [2], etc."},
{"role": "user", "content": f"RESULTS:\n{context}\n\nQ: {question}\nA:"}
],
"temperature": 0,
"max_tokens": 800
},
timeout=30
)
return resp.json()["choices"][0]["message"]["content"]
# Test it
print(ask("What is SearchHive and how much does it cost?"))
Common Issues
Irrelevant search results: Improve your query formulation. Add site-specific operators (e.g., site:docs.openai.com) for technical questions. Use the recency parameter for time-sensitive queries.
Context window overflow: Large search results can exceed the LLM's context limit. The build_context function with max_chars prevents this. GPT-4o supports 128K tokens (~500K chars), but keeping context under 8K chars improves accuracy and reduces cost.
Slow responses: SearchHive SwiftSearch responds in ~200ms. The bottleneck is usually the LLM. Use gpt-4o-mini for speed, gpt-4o for quality, or local models for cost.
Cost optimization: At GPT-4o-mini pricing ($0.15/1M input tokens), a typical RAG query with 5 search results costs less than $0.001 in LLM tokens. SearchHive credits at $0.0001 each make the search layer equally cheap. Total per-query cost: under $0.002.
Next Steps
- Add caching: Cache search results for common questions to reduce API calls
- Implement DeepDive: Use SearchHive's DeepDive for research-heavy questions that need multi-page synthesis
- Build evaluation: Track answer accuracy over time with a test set
- Add filtering: Restrict search to trusted sources for domain-specific applications
Summary
A search API is the foundation of any effective RAG system. SearchHive's SwiftSearch provides fast, token-efficient retrieval that grounds LLM answers in real-time web data. Combined with ScrapeForge for deep content extraction, you have everything needed for production-grade RAG.
Start with 500 free credits -- no credit card required. Build your first RAG pipeline in under 50 lines of Python. Check out the docs for complete API references and integration guides.
For more on LLM integration patterns, see /blog/api-for-llm-integration-common-questions-answered and /compare/tavily.