Intermediate7 min read

Retrieval Strategies for RAG Systems

RAG systems retrieve relevant context before generating responses. Learn retrieval strategies, ranking, and optimization techniques.

RAGretrievalsearchtechnical

TL;DR

RAG (Retrieval-Augmented Generation) finds relevant documents before answering questions. Key strategies: semantic search with embeddings, reranking top results, and optimizing chunk size.

What is RAG?

Definition:
Combining retrieval (search) with generation (LLMs).

Flow:

User asks question
Search relevant documents
Add documents to LLM prompt
LLM generates answer using context

Benefits:

Answers based on your data
No fine-tuning needed
Easy to update knowledge

Retrieval methods

Semantic search:

Convert query and documents to embeddings
Find closest matches (cosine similarity)
Captures meaning, not just keywords

Keyword search:

Traditional full-text search
BM25 algorithm
Good for exact matches

Hybrid:

Combine semantic + keyword
Best of both worlds
Weighted fusion

Chunking strategies

Fixed size:

Split at N characters/tokens
Simple but may break sentences

Sentence-based:

Keep sentences intact
More coherent chunks

Paragraph-based:

Split by paragraphs
Better semantic units

Sliding window:

Overlapping chunks
Ensures no context loss at boundaries

Optimal chunk size:

Too small: Lacks context
Too large: Dilutes relevance
Sweet spot: 200-500 tokens

Reranking

Problem:

First-pass retrieval may miss nuances
Top results not always best

Solution:

Retrieve top 20-50 candidates
Use more sophisticated model to rerank
Return top 3-5 to LLM

Reranking models:

Cross-encoders (more accurate, slower)
Cohere rerank
Custom scoring functions

Query optimization

Query expansion:

Generate multiple versions of query
"Fix leaky faucet" → "repair dripping tap," "plumbing leak"

HyDE (Hypothetical Document Embeddings):

Generate hypothetical answer
Search for docs similar to that answer
Often finds better matches

Query decomposition:

Break complex queries into sub-questions
Retrieve for each
Combine results

Metadata filtering

Filter by date, category, author
Combine with semantic search
"Recent product docs" + semantic match

Evaluation

Metrics:

Recall: % of relevant docs retrieved
Precision: % of retrieved docs relevant
MRR (Mean Reciprocal Rank): Position of first relevant result

Testing:

Build test set of queries + expected docs
Measure retrieval quality
Iterate on strategy

Common issues

Poor retrieval:

Embedding model doesn't match domain
Chunks too large/small
Insufficient metadata

Context overload:

Too many docs in prompt
Exceeds context window
Solution: Better reranking, fewer but better docs

Recency bias:

Newer docs not indexed yet
Solution: Regular re-indexing

Best practices

Experiment with chunk sizes
Use hybrid search when possible
Always rerank top results
Add metadata for filtering
Monitor and iterate

What's next

Vector Databases
Embeddings Explained
Building RAG Applications

Was this guide helpful?

Your feedback helps us improve our guides

Key Terms Used in This Guide

RAG (Retrieval-Augmented Generation)

A technique where AI searches your documents for relevant info, then uses it to generate accurate, grounded answers.

Context Window

How much text an AI can 'see' or 'remember' at once. Older messages fall off when the window fills up.

A text generation strategy where the AI explores multiple possible word sequences simultaneously and keeps the best few (the 'beam') at each step, resulting in higher-quality but slower output than greedy generation.