TL;DR

Basic RAG (Retrieval-Augmented Generation) often retrieves irrelevant or incomplete context, leading to poor answers. Advanced techniques -- hybrid search, reranking, smarter chunking, query decomposition, and systematic evaluation -- fix these problems. Implementing even two or three of these techniques can dramatically improve your RAG system's accuracy.

Why it matters

Most teams that build a basic RAG system hit the same wall: it works great for simple questions but falls apart for anything nuanced. Users ask a complex question, the system retrieves the wrong documents, and the AI generates a confident but incorrect answer.

The problem is rarely the language model -- it is the retrieval. If you feed the right context to an LLM, it will usually generate a good answer. If you feed it irrelevant or incomplete context, even the most powerful model will struggle. Advanced RAG techniques focus on the retrieval side: getting better context to the model so it can do its job.

These techniques are not theoretical. Companies running production RAG systems use them every day. The difference between a RAG system that frustrates users and one that genuinely helps them often comes down to these optimizations.

The limitations of naive RAG

A basic RAG system typically works like this: split documents into fixed-size chunks, embed them, store in a vector database, and retrieve the top K most similar chunks for each query. This approach has several weaknesses:

Poor chunking

Fixed-size chunking (splitting every 500 tokens) ignores document structure. A chunk might start mid-sentence and end in the middle of a different paragraph. Important context gets split across chunks, and each chunk lacks the surrounding information needed to make sense of it.

Irrelevant retrieval

Vector similarity is not the same as relevance. A chunk might be semantically similar to the query but not actually answer the question. Searching for "What is the return policy for electronics?" might retrieve a chunk about electronics product descriptions because it contains similar words, even though it says nothing about returns.

Missing the full picture

Top-K retrieval grabs the K most similar chunks independently. For complex questions that require synthesizing information from multiple documents ("How has our return policy changed over the last three years?"), naive RAG often misses important pieces because each chunk is retrieved in isolation.

No handling of ambiguous queries

When a user asks a vague question, naive RAG does its best with the literal query. It does not try to clarify, expand, or decompose the question into more searchable components.

Advanced chunking strategies

How you split your documents has a massive impact on retrieval quality.

Semantic chunking

Instead of splitting at fixed intervals, split at natural boundaries: paragraph breaks, section headings, topic changes. This ensures each chunk contains a coherent piece of information rather than an arbitrary slice.

How it works: Use an embedding model to measure the similarity between consecutive sentences. When the similarity drops sharply (indicating a topic change), insert a chunk boundary. This creates chunks that align with actual content structure.

Hierarchical chunking

Create chunks at multiple levels of granularity:

  • Large chunks (full sections or pages) for context and overview
  • Small chunks (individual paragraphs) for specific answers

When a small chunk matches a query, you can also pull in its parent chunk for additional context. This gives the LLM both the specific information it needs and the surrounding context to make sense of it.

Overlapping chunks

Add overlap between consecutive chunks (typically 10-20% of the chunk size). If an important piece of information lands right at a chunk boundary, the overlap ensures it appears in at least one complete chunk. Simple to implement and surprisingly effective.

Document metadata enrichment

Attach metadata to each chunk: the document title, section heading, page number, date, author, and document type. This metadata enables filtering (only search HR documents for policy questions) and helps the LLM understand where the information comes from.

Pure vector search misses important things that keyword search catches, and vice versa. Hybrid search combines both for better results.

Vector search strengths: Understands meaning, handles synonyms ("car" matches "automobile"), captures conceptual similarity.

Keyword search strengths: Exact matches for names, acronyms, product codes, and specific terms that vector search might embed incorrectly. If a user searches for "HIPAA compliance," keyword search reliably finds documents containing exactly "HIPAA."

How to combine them:

  1. Run both a vector search and a keyword search (BM25) on the same query.
  2. Each returns a ranked list of results.
  3. Combine the lists using Reciprocal Rank Fusion (RRF): for each document, calculate a score based on its rank in each list. Documents that appear high in both lists get the highest combined scores.

Most vector databases (Weaviate, Qdrant, Pinecone) support hybrid search natively. Enabling it is often a configuration change, not a major engineering effort.

Practical impact: Hybrid search typically improves retrieval accuracy by 10-20% compared to vector-only search, with the biggest gains on queries containing specific names, codes, or technical terms.

Reranking: a second pass for better results

Reranking is one of the highest-impact improvements you can make. The idea is a two-stage process:

  1. Fast first stage: Retrieve a larger set of candidates (50-100 chunks) using vector or hybrid search. This is fast but not perfectly accurate.
  2. Accurate second stage: Use a more powerful model (a cross-encoder) to re-score each candidate against the original query. Return only the top 3-5 best results.

Why does this work? Embedding-based search (bi-encoder) processes the query and documents separately and compares their embeddings. A cross-encoder processes the query and each document together, allowing much deeper comparison. It is more accurate but too slow to run against your entire document collection -- which is why you use the fast first stage to narrow down candidates.

Tools for reranking:

  • Cohere Rerank API: Easy to integrate, pay-per-use
  • Cross-encoder models from Sentence Transformers: Self-hosted, free
  • Jina Reranker: Open-source and API options

Practical impact: Adding a reranking step typically improves answer quality by 15-25% as measured by human evaluation.

Query decomposition for complex questions

When a user asks a complex question, breaking it into simpler sub-questions often produces better results than searching for the original query.

Example: "How does our vacation policy compare to industry standards for tech companies?"

A naive system searches for this exact query. A smarter system decomposes it:

  • Sub-query 1: "What is our company's vacation policy?"
  • Sub-query 2: "What are industry standard vacation policies for tech companies?"

Each sub-query is searched separately. The retrieved chunks are combined and sent to the LLM, which now has the right information to make the comparison.

HyDE (Hypothetical Document Embeddings) is a clever variation: instead of searching for the question, use an LLM to generate a hypothetical answer, then search for documents similar to that hypothetical answer. This often retrieves better results because the hypothetical answer is closer in embedding space to the actual answer than the question is.

Evaluating and debugging RAG systems

The hardest part of RAG is knowing whether it is working well. You need systematic evaluation.

Retrieval evaluation

  • Recall@K: What percentage of relevant documents appear in the top K results? The most fundamental metric.
  • MRR (Mean Reciprocal Rank): How high is the first relevant document ranked? An MRR of 0.5 means the first relevant document is typically the second result.

End-to-end evaluation

  • Answer correctness: Does the final answer match a known correct answer? Requires a test set with questions and ground-truth answers.
  • Faithfulness: Does the answer only use information from the retrieved context, or does it hallucinate? LLM-as-judge can assess this automatically.
  • Relevance: Is the answer actually relevant to what the user asked?

Debugging framework

When the system gives a wrong answer, diagnose where it failed:

  1. Was the right document retrieved? If not, it is a retrieval problem -- improve chunking, search, or reranking.
  2. Was the right document retrieved but ranked too low? Add reranking or adjust retrieval parameters.
  3. Was the right document retrieved and ranked high, but the answer was still wrong? It is a generation problem -- improve your prompt, add instructions to cite sources, or try a different model.

This diagnostic process tells you exactly which component to optimize, rather than guessing.

Practical implementation tips

  1. Start simple, then optimize. Get a basic RAG system working first. Measure its performance. Then add advanced techniques one at a time, measuring the impact of each.
  2. Build a test set early. Create 50-100 question-answer pairs that represent real user queries. Use this set to measure every change you make. Without it, you are optimizing blind.
  3. Log everything. For every query, log the retrieved chunks, their scores, the final prompt sent to the LLM, and the generated answer. This data is essential for debugging and improvement.
  4. Hybrid search is the best first upgrade. If you only implement one advanced technique, make it hybrid search. It is easy to enable in most vector databases and provides consistent improvement.
  5. Reranking is the best second upgrade. After hybrid search, adding a reranking step provides the next biggest quality improvement for the implementation effort.

Common mistakes

  • Optimizing the model instead of the retrieval. When RAG answers are wrong, teams often try a more powerful (and expensive) LLM. In most cases, the problem is that the wrong documents were retrieved. Fix retrieval first.
  • Using fixed-size chunking without overlap. This is the default in many tutorials but is one of the weakest approaches. At minimum, add overlap. Ideally, use semantic or hierarchical chunking.
  • Retrieving too many or too few chunks. Too few (1-2) and you miss relevant context. Too many (20+) and the LLM gets overwhelmed with irrelevant information. Start with 5-10 chunks and tune based on your evaluation metrics.
  • Not evaluating systematically. "It seems to work okay" is not evaluation. Build a test set, measure retrieval and answer quality, and track these metrics over time.
  • Ignoring metadata filtering. If your documents span multiple topics, time periods, or access levels, always filter by metadata before semantic search. Searching your entire corpus when you know the answer is in the HR policy section wastes retrieval capacity and introduces noise.

What's next?

  • Embeddings and RAG -- foundational understanding of how RAG works
  • Custom Embedding Models -- improving retrieval with domain-specific embeddings
  • Vector Database Examples -- choosing and configuring vector storage for production RAG
  • Context Management -- handling long conversations and context windows effectively