TL;DR

RAG (Retrieval-Augmented Generation) finds relevant documents before answering questions. Key strategies: semantic search with embeddings, reranking top results, and optimizing chunk size.

What is RAG?

Definition:
Combining retrieval (search) with generation (LLMs).

Flow:

  1. User asks question
  2. Search relevant documents
  3. Add documents to LLM prompt
  4. LLM generates answer using context

Benefits:

  • Answers based on your data
  • No fine-tuning needed
  • Easy to update knowledge

Retrieval methods

Semantic search:

  • Convert query and documents to embeddings
  • Find closest matches (cosine similarity)
  • Captures meaning, not just keywords

Keyword search:

  • Traditional full-text search
  • BM25 algorithm
  • Good for exact matches

Hybrid:

  • Combine semantic + keyword
  • Best of both worlds
  • Weighted fusion

Chunking strategies

Fixed size:

  • Split at N characters/tokens
  • Simple but may break sentences

Sentence-based:

  • Keep sentences intact
  • More coherent chunks

Paragraph-based:

  • Split by paragraphs
  • Better semantic units

Sliding window:

  • Overlapping chunks
  • Ensures no context loss at boundaries

Optimal chunk size:

  • Too small: Lacks context
  • Too large: Dilutes relevance
  • Sweet spot: 200-500 tokens

Reranking

Problem:

  • First-pass retrieval may miss nuances
  • Top results not always best

Solution:

  • Retrieve top 20-50 candidates
  • Use more sophisticated model to rerank
  • Return top 3-5 to LLM

Reranking models:

  • Cross-encoders (more accurate, slower)
  • Cohere rerank
  • Custom scoring functions

Query optimization

Query expansion:

  • Generate multiple versions of query
  • "Fix leaky faucet" → "repair dripping tap," "plumbing leak"

HyDE (Hypothetical Document Embeddings):

  • Generate hypothetical answer
  • Search for docs similar to that answer
  • Often finds better matches

Query decomposition:

  • Break complex queries into sub-questions
  • Retrieve for each
  • Combine results

Metadata filtering

  • Filter by date, category, author
  • Combine with semantic search
  • "Recent product docs" + semantic match

Evaluation

Metrics:

  • Recall: % of relevant docs retrieved
  • Precision: % of retrieved docs relevant
  • MRR (Mean Reciprocal Rank): Position of first relevant result

Testing:

  • Build test set of queries + expected docs
  • Measure retrieval quality
  • Iterate on strategy

Common issues

Poor retrieval:

  • Embedding model doesn't match domain
  • Chunks too large/small
  • Insufficient metadata

Context overload:

Recency bias:

  • Newer docs not indexed yet
  • Solution: Regular re-indexing

Best practices

  1. Experiment with chunk sizes
  2. Use hybrid search when possible
  3. Always rerank top results
  4. Add metadata for filtering
  5. Monitor and iterate

What's next

  • Vector Databases
  • Embeddings Explained
  • Building RAG Applications