RAG Pipeline Architecture -- Common Questions Answered
Retrieval-Augmented Generation (RAG) pipeline architecture determines how well your LLM applications retrieve relevant context and generate accurate answers. Get the architecture wrong, and you get hallucinations, slow responses, or missed documents. Get it right, and your application delivers reliable, grounded answers at scale.
This FAQ covers the most common questions developers ask when building production RAG systems -- from chunking strategies to vector database selection to evaluation.
Key Takeaways
- Chunking strategy matters more than vector database choice for most applications
- Hybrid search (dense + sparse) consistently outperforms pure dense retrieval
- Reranking models add 100-200ms latency but significantly improve relevance
- Document ingestion pipelines need metadata enrichment, not just text splitting
- SearchHive's DeepDive API can replace custom web-crawling components in RAG pipelines
What are the core components of a RAG pipeline?
A production RAG pipeline has five stages:
- Document ingestion -- load documents (PDFs, web pages, databases), extract text, normalize format
- Chunking -- split documents into semantically meaningful segments
- Embedding -- convert chunks to vector representations
- Retrieval -- find the most relevant chunks for a given query
- Generation -- feed retrieved chunks as context to the LLM
Each stage has design decisions that affect quality, cost, and latency. Skipping optimization at any stage creates a bottleneck.
from searchhive import ScrapeForge, DeepDive
# Replace custom web crawling with SearchHive for document ingestion
scrape = ScrapeForge(api_key="sk-YOUR_KEY")
extract = DeepDive(api_key="sk-YOUR_KEY")
# Ingest a documentation page
page = scrape.scrape("https://docs.example.com/guide", format="markdown")
# Extract structured sections for better chunking
sections = extract.extract(
page["content"],
schema={"fields": ["title", "content", "section_type"]}
)
What chunking strategy should I use?
There are three main approaches:
- Fixed-size chunking -- split every N tokens (e.g., 512). Simple but breaks mid-sentence and mid-paragraph. Works as a baseline.
- Recursive character splitting -- split on paragraph boundaries, then sentences, then words if needed. Available in LangChain's RecursiveCharacterTextSplitter. Better semantic coherence.
- Semantic chunking -- use embeddings to detect topic shifts and split at natural boundaries. Most expensive but produces the highest quality chunks.
For most production systems, recursive character splitting with a chunk size of 512-1024 tokens and overlap of 50-100 tokens is the sweet spot. Semantic chunking adds value when documents have highly variable topic density.
Rule of thumb: Your chunk size should be large enough to contain a complete idea, but small enough that multiple chunks fit in the LLM's context window alongside the query.
Which vector database should I use?
| Database | Best For | Scaling | Self-Hosted |
|---|---|---|---|
| Chroma | Prototyping, small apps | In-memory, limited | Yes |
| FAISS | Local/embedded, research | Single-node | Yes |
| Qdrant | Production, mid-scale | Horizontal | Yes |
| Pinecone | Production, large-scale | Fully managed | No |
| Weaviate | Production, hybrid search | Horizontal | Yes |
For startups and small teams: Qdrant or Chroma. For large-scale production: Pinecone or Weaviate.
The database matters less than how you index. Add metadata filters (source, date, section type) at ingestion time -- this lets you narrow retrieval before the expensive vector search step.
How does hybrid search improve RAG accuracy?
Pure dense retrieval (embedding similarity) misses exact keyword matches. Pure sparse retrieval (BM25/TF-IDF) misses semantic matches. Hybrid search combines both.
The implementation pattern:
- Run BM25 search on your text index
- Run dense vector search on your embedding index
- Combine scores with reciprocal rank fusion (RRF) or learned weights
- Rerank the merged results
# Hybrid search pattern (conceptual)
def hybrid_search(query, top_k=10):
# Dense search via vector DB
dense_results = vector_db.search(query_embedding, top_k=top_k * 2)
# Sparse search via BM25
sparse_results = bm25_index.search(query, top_k=top_k * 2)
# Reciprocal Rank Fusion
scores = {}
for rank, result in enumerate(dense_results):
doc_id = result["id"]
scores[doc_id] = scores.get(doc_id, 0) + 1 / (rank + 60)
for rank, result in enumerate(sparse_results):
doc_id = result["id"]
scores[doc_id] = scores.get(doc_id, 0) + 1 / (rank + 60)
# Return top-k by fused score
return sorted(scores.items(), key=lambda x: -x[1])[:top_k]
Studies consistently show hybrid search improves recall by 10-30% over dense-only retrieval, especially for queries containing technical terms, product names, or exact phrases.
Should I use a reranking model?
Yes, if retrieval quality matters more than ~100-200ms of added latency. Reranking models (Cohere Rerank, cross-encoders, ColBERT) score each retrieved chunk against the full query context, not just a single embedding.
The pattern:
- Retrieve 50-100 candidates with hybrid search (cheap, fast)
- Rerank to top 5-10 with a cross-encoder (expensive, slow but accurate)
- Feed the top chunks to the LLM
Reranking is most valuable when your document corpus is large (100K+ chunks) and queries are specific. For small corpora, the marginal improvement is smaller.
How do I handle web data in a RAG pipeline?
Web data requires crawling, extraction, and cleaning before it's useful in a RAG system. This is where most teams spend the most engineering time.
Common approaches:
- Scrapy/BeautifulSoup -- maximum control, maximum engineering effort
- Firecrawl -- good developer experience, higher cost at scale
- SearchHive -- unified search + scrape + extract API, cheaper than Firecrawl
from searchhive import SwiftSearch, ScrapeForge, DeepDive
def ingest_web_knowledge(topic, max_pages=50):
search = SwiftSearch(api_key="sk-YOUR_KEY")
scrape = ScrapeForge(api_key="sk-YOUR_KEY")
# Find relevant pages
results = search.search(topic, num=max_pages)
chunks = []
for result in results["organic"]:
page = scrape.scrape(result["url"], format="markdown")
if page.get("content"):
# Split and embed (your chunking logic here)
chunks.append({
"text": page["content"],
"source": result["url"],
"title": result["title"]
})
return chunks
How do I evaluate RAG pipeline quality?
Use the RAGAS framework or build custom evaluation with these metrics:
- Faithfulness -- does the generated answer follow from the retrieved context?
- Answer relevance -- does the answer actually address the question?
- Context precision -- are the retrieved chunks relevant to the question?
- Context recall -- did retrieval find all the information needed?
Run these against a held-out test set of question-answer pairs. Track metrics over time as you iterate on chunking, retrieval, and prompting.
What causes RAG hallucinations?
Three root causes:
- Insufficient retrieval -- the answer isn't in the retrieved chunks. Fix: improve retrieval (hybrid search, reranking, better embeddings).
- Contradictory context -- multiple chunks disagree. Fix: add source attribution, use more precise retrieval, or add a grounding prompt.
- LLM overconfidence -- the model fills gaps with fabricated information. Fix: explicitly instruct the model to say "I don't know" when context is insufficient.
How much does a production RAG system cost?
Breakdown for a mid-scale deployment (1M documents, 10M chunks, 1K queries/day):
| Component | Monthly Cost |
|---|---|
| Vector database (Pinecone) | $70-150 |
| Embedding API (OpenAI text-embedding-3-small) | $3-10 |
| LLM generation (GPT-4o-mini) | $15-30 |
| Web data extraction (SearchHive Builder) | $49 |
| Reranking (Cohere) | $10-20 |
| Total | $147-259/month |
SearchHive is the most cost-effective option for the web data extraction component -- $49/month for 100K universal credits vs. Firecrawl's $83/month for the same volume.
Summary
Building a RAG pipeline isn't rocket science, but the details matter. Start with recursive chunking, hybrid search, and a managed vector database. Add reranking when you need higher precision. Use SearchHive for web data ingestion to avoid building custom crawlers.
For the extraction and search layer that powers your RAG pipeline, try SearchHive free with 500 credits. The unified API handles search, scraping, and AI extraction -- three things every RAG system needs.
/compare/firecrawl /blog/best-web-data-extraction-at-scale-tools-2025