Complete Guide to RAG Pipeline Architecture

RAG Pipeline Architecture: The Complete Guide

Retrieval-Augmented Generation (RAG) has become the standard architecture for building LLM applications that need accurate, up-to-date information. Instead of relying solely on training data, RAG systems retrieve relevant context at inference time and feed it to the LLM alongside the user's query.

This guide covers the complete RAG pipeline architecture -- from document ingestion to generation -- with practical code examples showing how SearchHive's APIs power the retrieval layer.

Key Takeaways

RAG eliminates hallucinations by grounding LLM responses in retrieved documents
Hybrid search (dense + BM25) with a cross-encoder re-ranker is the current best practice
Web search APIs serve as an external retrieval source alongside vector databases
SearchHive provides the complete retrieval layer -- search, scrape, and analyze in one API
A well-architected RAG pipeline can deliver answer accuracy above 90% with proper evaluation

What Is RAG?

RAG is an architecture pattern that augments LLM responses with external knowledge retrieved at inference time. The core flow is simple:

User Query --> Retrieve Relevant Documents --> Augment Prompt --> LLM Generates Answer

Why it matters:

Eliminates hallucinations for factual queries by grounding responses in retrieved documents
Enables knowledge updates without retraining -- just update the knowledge base
Provides citations and source traceability
Reduces cost vs. fine-tuning for knowledge-heavy tasks

Core RAG Pipeline Components

1. Document Processing and Chunking

Breaking raw documents into semantically coherent pieces that embed well:

Strategy	Description	Best For
Fixed-size	Split every N tokens with overlap	Simple docs, baseline
Recursive character	Splits on paragraphs, lines, sentences hierarchically	General purpose
Semantic	Embed sentences, group similar adjacent ones	Preserving meaning boundaries
Document structure	Split on headers, sections, chapters	Markdown/HTML docs
Late chunking	Embed full doc, then chunk the embeddings	Cross-chunk context

Best practices:

Use 10-20% overlap between chunks
Chunk size: 256-1024 tokens is typical
Include metadata with each chunk (source, page number, section)
Consider parent-child chunking: retrieve small chunks but pass parent to LLM

2. Embedding Models

Convert text chunks into vector representations:

Model	Dimensions	Notes
text-embedding-3-small (OpenAI)	1536	Best general-purpose
text-embedding-3-large (OpenAI)	3072	Higher accuracy, more storage
BGE-M3 (BAAI, open)	1024	Best open-source
Nomic embed	768	Open, strong per dollar
GTE-Qwen2	768/1536	Strong open-source option

Rule: Use the same embedding model for indexing and querying.

3. Vector Databases

Store and query embeddings at scale:

Type	Examples	Best For
Purpose-built	Pinecone, Weaviate, Milvus, Qdrant	Production at scale
Vector-capable SQL	pgvector (Postgres)	When you already use Postgres
Local/Embedded	Chroma, LanceDB, FAISS	Development, prototyping

Trend: pgvector + Postgres is increasingly popular for teams wanting to avoid a separate database.

4. Retrieval

Single methods:

Dense (semantic): Cosine similarity on embeddings -- good for meaning
Sparse (keyword): BM25 -- good for exact terms and proper nouns

Advanced techniques:

Hybrid search: Combine dense + sparse with reciprocal rank fusion -- current best practice
Re-ranking: After initial retrieval (top-50), use a cross-encoder to get top-10. This is the highest-ROI improvement.
Query transformation: Rewrite queries for better retrieval (HyDE, query expansion, decomposition)

5. Generation

The LLM receives: System prompt + Retrieved Context + User Query. Use structured output (free JSON formatter mode) for extractive tasks, and streaming for user-facing applications.

Where Web Search APIs Fit In

Web search APIs serve as an external retrieval source alongside or instead of a local vector database.

Architecture Patterns

Pattern A: Search-as-Fallback

Query --> Vector DB --> [If low confidence] --> Web Search --> Generate

Pattern B: Search-First (Web RAG)

Query --> Web Search API --> Scrape/Extract --> Chunk --> Embed --> Generate

Pattern C: Parallel/Hybrid Retrieval

Query --> [Vector DB] + [Web Search API] --> Merge/Rerank --> Generate

Pattern D: Search-Grounded Pre-Processing

Query --> Web Search --> Summarize results --> Use as context --> Generate

Popular Web Search APIs for RAG

API	Strength	Notes
SearchHive	Unified search + scrape + analyze	One API key, one SDK
Tavily	Built for AI/RAG	Returns clean content
Brave Search	Privacy-focused	Good for general search
Serper	Google results via API	Popular in LangChain
Bing Web Search	Enterprise-grade	Microsoft ecosystem

The key advantage of SearchHive: you get search AND scraping AND analysis in one API. No need for separate tools.

Building a RAG Pipeline with SearchHive

Pattern B: Web RAG Implementation

from searchhive import SwiftSearch, ScrapeForge, DeepDive

search = SwiftSearch(api_key="your-key")
scraper = ScrapeForge(api_key="your-key")
analyzer = DeepDive(api_key="your-key")

def web_rag_pipeline(query: str, num_sources: int = 5) -> dict:
    # Complete Web RAG pipeline using SearchHive
    
    # 1. Retrieve relevant URLs via search
    results = search.search(query=query, num_results=num_sources)
    urls = [r.url for r in results.organic]
    
    # 2. Fetch and convert to clean markdown
    pages = scraper.scrape_urls(urls=urls, format="markdown")
    
    # 3. Build context from scraped content
    context_parts = []
    sources = []
    for page in pages:
        # Summarize to reduce context length
        analysis = analyzer.analyze(
            text=page.content,
            summarize=True
        )
        context_parts.append(f"Source: {page.url}
{analysis.summary}")
        sources.append(page.url)
    
    context = "

---

".join(context_parts)
    
    return {
        "query": query,
        "context": context,
        "sources": sources,
        "context_length": len(context)
    }

# Use it
result = web_rag_pipeline("What are the latest advances in transformer architecture?")
print(f"Context length: {result['context_length']} chars")
print(f"Sources: {result['sources']}")

Pattern C: Hybrid RAG with Local + Web

from searchhive import SwiftSearch, ScrapeForge

search = SwiftSearch(api_key="your-key")
scraper = ScrapeForge(api_key="your-key")

def hybrid_rag(query: str, vector_results: list, web_results_count: int = 3):
    # Combine local vector DB results with web search results
    
    # Local retrieval (from your vector DB)
    local_context = "
".join([
        f"[Local] {r['text']}" for r in vector_results
    ])
    
    # Web retrieval
    web_results = search.search(query=query, num_results=web_results_count)
    web_pages = scraper.scrape_urls(
        urls=[r.url for r in web_results.organic],
        format="markdown"
    )
    web_context = "
".join([
        f"[Web: {p.url}] {p.content[:500]}..." 
        for p in web_pages
    ])
    
    return local_context + "

--- WEB RESULTS ---

" + web_context

Evaluation: How Do You Know Your RAG Works?

Track these metrics using frameworks like RAGAS or TruLens:

Metric	What It Measures	Target
Context Precision	Are retrieved chunks relevant?	> 0.7
Context Recall	Are all needed chunks retrieved?	> 0.8
Faithfulness	Is the answer grounded in context?	> 0.9
Answer Relevance	Does the answer address the question?	> 0.85

Common Challenges and Solutions

Challenge	Solution
Retrieves wrong documents	Better chunking, semantic chunking
Misses exact term matches	Add BM25 -- switch to hybrid search
Retrieves relevant but not most relevant	Add a cross-encoder re-ranker
Complex multi-faceted queries fail	Query decomposition, sub-query retrieval
Stale knowledge	Incremental indexing, web search fallback
High token costs	Markdown conversion, re-ranking to reduce context
Slow retrieval	HNSW tuning, caching, pgvector for simpler stacks

Recommended Architecture (2025 Best Practice)

The single highest-impact improvement for most RAG systems is adding a cross-encoder re-ranker after initial retrieval. The second is switching to hybrid search (dense + BM25).

For the retrieval layer, SearchHive provides:

SwiftSearch for web-based retrieval (Pattern B and C)
ScrapeForge for fetching and converting content to LLM-ready markdown
DeepDive for summarization and analysis of retrieved content

One API key, one SDK, one invoice. Start with the free tier -- 100 searches and 50 scrapes per day is enough to prototype and evaluate your RAG pipeline before scaling.

Read the SearchHive documentation and see our search API integration tutorial for more implementation details.