Retrieval Strategies for RAG Systems
RAG systems retrieve relevant context before generating responses. Learn retrieval strategies, ranking, and optimization techniques.
TL;DR
RAG (Retrieval-Augmented Generation) finds relevant documents before answering questions. Key strategies: semantic search with embeddings, reranking top results, and optimizing chunk size.
What is RAG?
Definition:
Combining retrieval (search) with generation (LLMs).
Flow:
- User asks question
- Search relevant documents
- Add documents to LLM prompt
- LLM generates answer using context
Benefits:
- Answers based on your data
- No fine-tuning needed
- Easy to update knowledge
Retrieval methods
Semantic search:
- Convert query and documents to embeddings
- Find closest matches (cosine similarity)
- Captures meaning, not just keywords
Keyword search:
- Traditional full-text search
- BM25 algorithm
- Good for exact matches
Hybrid:
- Combine semantic + keyword
- Best of both worlds
- Weighted fusion
Chunking strategies
Fixed size:
- Split at N characters/tokens
- Simple but may break sentences
Sentence-based:
- Keep sentences intact
- More coherent chunks
Paragraph-based:
- Split by paragraphs
- Better semantic units
Sliding window:
- Overlapping chunks
- Ensures no context loss at boundaries
Optimal chunk size:
- Too small: Lacks context
- Too large: Dilutes relevance
- Sweet spot: 200-500 tokens
Reranking
Problem:
- First-pass retrieval may miss nuances
- Top results not always best
Solution:
- Retrieve top 20-50 candidates
- Use more sophisticated model to rerank
- Return top 3-5 to LLM
Reranking models:
- Cross-encoders (more accurate, slower)
- Cohere rerank
- Custom scoring functions
Query optimization
Query expansion:
- Generate multiple versions of query
- "Fix leaky faucet" → "repair dripping tap," "plumbing leak"
HyDE (Hypothetical Document Embeddings):
- Generate hypothetical answer
- Search for docs similar to that answer
- Often finds better matches
Query decomposition:
- Break complex queries into sub-questions
- Retrieve for each
- Combine results
Metadata filtering
- Filter by date, category, author
- Combine with semantic search
- "Recent product docs" + semantic match
Evaluation
Metrics:
- Recall: % of relevant docs retrieved
- Precision: % of retrieved docs relevant
- MRR (Mean Reciprocal Rank): Position of first relevant result
- Build test set of queries + expected docs
- Measure retrieval quality
- Iterate on strategy
Common issues
Poor retrieval:
Context overload:
- Too many docs in prompt
- Exceeds context window
- Solution: Better reranking, fewer but better docs
Recency bias:
- Newer docs not indexed yet
- Solution: Regular re-indexing
Best practices
- Experiment with chunk sizes
- Use hybrid search when possible
- Always rerank top results
- Add metadata for filtering
- Monitor and iterate
What's next
- Vector Databases
- Embeddings Explained
- Building RAG Applications
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
RAG (Retrieval-Augmented Generation)
A technique where AI searches your documents for relevant info, then uses it to generate accurate, grounded answers.
Context Window
How much text an AI can 'see' or 'remember' at once. Older messages fall off when the window fills up.
Beam Search
A text generation strategy where the AI explores multiple possible word sequences simultaneously and keeps the best few (the 'beam') at each step, resulting in higher-quality but slower output than greedy generation.
Related Guides
Semantic Search: Search by Meaning, Not Keywords
IntermediateSemantic search finds results based on meaning, not exact keyword matches. Learn how it works and how to implement it.
Fine-Tuning Fundamentals: Customizing AI Models
IntermediateFine-tuning adapts pre-trained models to your specific use case. Learn when to fine-tune, how it works, and alternatives.
Vector Database Fundamentals
IntermediateVector databases store and search embeddings efficiently. Learn how they work, when to use them, and popular options.