RAG Pipeline Architecture: The Complete Guide
Retrieval-Augmented Generation (RAG) has become the standard architecture for building LLM applications that need accurate, up-to-date information. Instead of relying solely on training data, RAG systems retrieve relevant context at inference time and feed it to the LLM alongside the user's query.
This guide covers the complete RAG pipeline architecture -- from document ingestion to generation -- with practical code examples showing how SearchHive's APIs power the retrieval layer.
Key Takeaways
- RAG eliminates hallucinations by grounding LLM responses in retrieved documents
- Hybrid search (dense + BM25) with a cross-encoder re-ranker is the current best practice
- Web search APIs serve as an external retrieval source alongside vector databases
- SearchHive provides the complete retrieval layer -- search, scrape, and analyze in one API
- A well-architected RAG pipeline can deliver answer accuracy above 90% with proper evaluation
What Is RAG?
RAG is an architecture pattern that augments LLM responses with external knowledge retrieved at inference time. The core flow is simple:
User Query --> Retrieve Relevant Documents --> Augment Prompt --> LLM Generates Answer
Why it matters:
- Eliminates hallucinations for factual queries by grounding responses in retrieved documents
- Enables knowledge updates without retraining -- just update the knowledge base
- Provides citations and source traceability
- Reduces cost vs. fine-tuning for knowledge-heavy tasks
Core RAG Pipeline Components
1. Document Processing and Chunking
Breaking raw documents into semantically coherent pieces that embed well:
| Strategy | Description | Best For |
|---|---|---|
| Fixed-size | Split every N tokens with overlap | Simple docs, baseline |
| Recursive character | Splits on paragraphs, lines, sentences hierarchically | General purpose |
| Semantic | Embed sentences, group similar adjacent ones | Preserving meaning boundaries |
| Document structure | Split on headers, sections, chapters | Markdown/HTML docs |
| Late chunking | Embed full doc, then chunk the embeddings | Cross-chunk context |
Best practices:
- Use 10-20% overlap between chunks
- Chunk size: 256-1024 tokens is typical
- Include metadata with each chunk (source, page number, section)
- Consider parent-child chunking: retrieve small chunks but pass parent to LLM
2. Embedding Models
Convert text chunks into vector representations:
| Model | Dimensions | Notes |
|---|---|---|
| text-embedding-3-small (OpenAI) | 1536 | Best general-purpose |
| text-embedding-3-large (OpenAI) | 3072 | Higher accuracy, more storage |
| BGE-M3 (BAAI, open) | 1024 | Best open-source |
| Nomic embed | 768 | Open, strong per dollar |
| GTE-Qwen2 | 768/1536 | Strong open-source option |
Rule: Use the same embedding model for indexing and querying.
3. Vector Databases
Store and query embeddings at scale:
| Type | Examples | Best For |
|---|---|---|
| Purpose-built | Pinecone, Weaviate, Milvus, Qdrant | Production at scale |
| Vector-capable SQL | pgvector (Postgres) | When you already use Postgres |
| Local/Embedded | Chroma, LanceDB, FAISS | Development, prototyping |
Trend: pgvector + Postgres is increasingly popular for teams wanting to avoid a separate database.
4. Retrieval
Single methods:
- Dense (semantic): Cosine similarity on embeddings -- good for meaning
- Sparse (keyword): BM25 -- good for exact terms and proper nouns
Advanced techniques:
- Hybrid search: Combine dense + sparse with reciprocal rank fusion -- current best practice
- Re-ranking: After initial retrieval (top-50), use a cross-encoder to get top-10. This is the highest-ROI improvement.
- Query transformation: Rewrite queries for better retrieval (HyDE, query expansion, decomposition)
5. Generation
The LLM receives: System prompt + Retrieved Context + User Query. Use structured output (free JSON formatter mode) for extractive tasks, and streaming for user-facing applications.
Where Web Search APIs Fit In
Web search APIs serve as an external retrieval source alongside or instead of a local vector database.
Architecture Patterns
Pattern A: Search-as-Fallback
Query --> Vector DB --> [If low confidence] --> Web Search --> Generate
Pattern B: Search-First (Web RAG)
Query --> Web Search API --> Scrape/Extract --> Chunk --> Embed --> Generate
Pattern C: Parallel/Hybrid Retrieval
Query --> [Vector DB] + [Web Search API] --> Merge/Rerank --> Generate
Pattern D: Search-Grounded Pre-Processing
Query --> Web Search --> Summarize results --> Use as context --> Generate
Popular Web Search APIs for RAG
| API | Strength | Notes |
|---|---|---|
| SearchHive | Unified search + scrape + analyze | One API key, one SDK |
| Tavily | Built for AI/RAG | Returns clean content |
| Brave Search | Privacy-focused | Good for general search |
| Serper | Google results via API | Popular in LangChain |
| Bing Web Search | Enterprise-grade | Microsoft ecosystem |
The key advantage of SearchHive: you get search AND scraping AND analysis in one API. No need for separate tools.
Building a RAG Pipeline with SearchHive
Pattern B: Web RAG Implementation
from searchhive import SwiftSearch, ScrapeForge, DeepDive
search = SwiftSearch(api_key="your-key")
scraper = ScrapeForge(api_key="your-key")
analyzer = DeepDive(api_key="your-key")
def web_rag_pipeline(query: str, num_sources: int = 5) -> dict:
# Complete Web RAG pipeline using SearchHive
# 1. Retrieve relevant URLs via search
results = search.search(query=query, num_results=num_sources)
urls = [r.url for r in results.organic]
# 2. Fetch and convert to clean markdown
pages = scraper.scrape_urls(urls=urls, format="markdown")
# 3. Build context from scraped content
context_parts = []
sources = []
for page in pages:
# Summarize to reduce context length
analysis = analyzer.analyze(
text=page.content,
summarize=True
)
context_parts.append(f"Source: {page.url}
{analysis.summary}")
sources.append(page.url)
context = "
---
".join(context_parts)
return {
"query": query,
"context": context,
"sources": sources,
"context_length": len(context)
}
# Use it
result = web_rag_pipeline("What are the latest advances in transformer architecture?")
print(f"Context length: {result['context_length']} chars")
print(f"Sources: {result['sources']}")
Pattern C: Hybrid RAG with Local + Web
from searchhive import SwiftSearch, ScrapeForge
search = SwiftSearch(api_key="your-key")
scraper = ScrapeForge(api_key="your-key")
def hybrid_rag(query: str, vector_results: list, web_results_count: int = 3):
# Combine local vector DB results with web search results
# Local retrieval (from your vector DB)
local_context = "
".join([
f"[Local] {r['text']}" for r in vector_results
])
# Web retrieval
web_results = search.search(query=query, num_results=web_results_count)
web_pages = scraper.scrape_urls(
urls=[r.url for r in web_results.organic],
format="markdown"
)
web_context = "
".join([
f"[Web: {p.url}] {p.content[:500]}..."
for p in web_pages
])
return local_context + "
--- WEB RESULTS ---
" + web_context
Evaluation: How Do You Know Your RAG Works?
Track these metrics using frameworks like RAGAS or TruLens:
| Metric | What It Measures | Target |
|---|---|---|
| Context Precision | Are retrieved chunks relevant? | > 0.7 |
| Context Recall | Are all needed chunks retrieved? | > 0.8 |
| Faithfulness | Is the answer grounded in context? | > 0.9 |
| Answer Relevance | Does the answer address the question? | > 0.85 |
Common Challenges and Solutions
| Challenge | Solution |
|---|---|
| Retrieves wrong documents | Better chunking, semantic chunking |
| Misses exact term matches | Add BM25 -- switch to hybrid search |
| Retrieves relevant but not most relevant | Add a cross-encoder re-ranker |
| Complex multi-faceted queries fail | Query decomposition, sub-query retrieval |
| Stale knowledge | Incremental indexing, web search fallback |
| High token costs | Markdown conversion, re-ranking to reduce context |
| Slow retrieval | HNSW tuning, caching, pgvector for simpler stacks |
Recommended Architecture (2025 Best Practice)
The single highest-impact improvement for most RAG systems is adding a cross-encoder re-ranker after initial retrieval. The second is switching to hybrid search (dense + BM25).
For the retrieval layer, SearchHive provides:
- SwiftSearch for web-based retrieval (Pattern B and C)
- ScrapeForge for fetching and converting content to LLM-ready markdown
- DeepDive for summarization and analysis of retrieved content
One API key, one SDK, one invoice. Start with the free tier -- 100 searches and 50 scrapes per day is enough to prototype and evaluate your RAG pipeline before scaling.
Read the SearchHive documentation and see our search API integration tutorial for more implementation details.
See also: /blog/complete-guide-to-ai-agent-web-scraping | /blog/how-to-search-api-integration-step-by-step