TL;DR

Basic RAG works, but production systems need advanced retrieval. Chunking strategy determines what information is available. Hybrid search combines keyword and semantic matching. Re-ranking refines results for accuracy. Metadata enables precise filtering. Master these techniques to build RAG systems that consistently find the right information.

Why advanced retrieval matters

You've built a basic RAG system. It works... sometimes. Users ask questions and get vague answers, or miss critical information buried in your docs. The problem isn't the LLM—it's retrieval.

If the search doesn't find the right chunks, the LLM can't generate the right answer. Garbage in, garbage out.

Advanced retrieval techniques dramatically improve accuracy by ensuring the most relevant information reaches the LLM every time.

Chunking: The foundation of retrieval

Chunking is how you split documents into searchable pieces. Get it wrong, and everything downstream fails.

Common pitfalls

Too small: "Return within 30 days" (missing: "with receipt" from the next sentence)
Too large: 2000-token chunks with one relevant sentence buried in noise
No context: Chunks that start mid-thought or reference "the above section"

Advanced chunking strategies

1. Semantic boundaries

Split at natural breakpoints: headings, paragraphs, topic shifts. For markdown or HTML, use the structure (h2 tags, section breaks). For plain text, use sentence boundaries and paragraph breaks.

When to use: Structured documents (docs, wikis, reports)

2. Overlapping windows

Create chunks with 10-20% overlap so context isn't lost at boundaries.

Example (500-token chunks, 100-token overlap):

  • Chunk 1: Tokens 0-500
  • Chunk 2: Tokens 400-900
  • Chunk 3: Tokens 800-1300

When to use: Dense technical content where every word matters (legal, medical, code)

3. Dynamic sizing

Vary chunk size based on content density. Short chunks for dense info (code snippets, definitions), longer for narrative text.

When to use: Mixed-content documents (tutorials with code examples, research papers with prose and data)

4. Parent-child chunking

Store small chunks for retrieval, but include parent context (the full section or page) when passing to the LLM.

Example:

  • Small chunk: "API rate limit: 100 requests/min"
  • Parent context: Full "Rate Limits" section with examples and error handling

When to use: When you need precise retrieval but comprehensive answers

Metadata: Making search smarter

Metadata lets you filter and prioritize before semantic search even happens.

What to tag

  • Document type: "policy," "tutorial," "API reference"
  • Date/version: "2024-Q3," "v2.1"
  • Author/department: "Engineering," "Legal," "Sales"
  • Audience: "internal," "customer-facing," "developer"
  • Freshness: Last updated timestamp

How to use it

Pre-filter: "Only search documents tagged 'API reference' and 'v2.1'"
Boost results: Prioritize recent docs over old ones
Multi-tenancy: Filter by customer ID or organization

Example: User asks "What's our refund policy?" Pre-filter to documents tagged "policy" and "customer-facing," then search only those chunks. Faster and more accurate.

Indexing strategies

How you organize embeddings affects search speed and quality.

Flat indexing

All chunks in one vector space. Simple, works for small-to-medium datasets (< 100K chunks).

Pros: Easy to implement, fast to query
Cons: Slows down with scale, no hierarchical context

Hierarchical indexing

Organize chunks in layers: document summaries at the top, sections in the middle, chunks at the bottom.

Search flow:

  1. Search summaries: "Which docs mention refunds?"
  2. Search sections within those docs
  3. Retrieve relevant chunks

Pros: Faster, better context, works at scale
Cons: More complex to build

When to use: Large knowledge bases (10K+ documents), multi-layered content

Multi-vector indexing

Store multiple embeddings per chunk: one for the full text, one for a summary, one for keywords.

Query different representations depending on the question type.

Pros: Captures multiple aspects of meaning
Cons: Higher storage cost

When to use: Complex documents where single embeddings miss nuance (research papers, technical specs)

Hybrid search: Best of both worlds

Vector search finds meaning. Keyword search finds exact terms. Hybrid search combines them.

Why you need both

Vector search alone misses:

  • Exact product names ("Model X-200")
  • Acronyms ("HIPAA," "GDPR")
  • Serial numbers, IDs, version numbers

Keyword search alone misses:

  • Synonyms ("refund" vs "money back")
  • Paraphrased questions
  • Semantic meaning

How it works

  1. Run BM25 (keyword search): Score chunks by term frequency and rarity
  2. Run vector search: Score chunks by semantic similarity
  3. Combine scores (weighted sum or reciprocal rank fusion)
  4. Return top results

Example weights:

  • 70% vector, 30% keyword (general Q&A)
  • 50% vector, 50% keyword (mixed)
  • 30% vector, 70% keyword (technical lookup)

When to use: Production RAG systems where accuracy is critical

Re-ranking: Refining your results

Retrieval gives you 20 candidate chunks. Re-ranking picks the best 5 to send to the LLM.

Why re-rank?

Initial retrieval is fast but imprecise. Re-ranking uses slower, more accurate models to refine.

Methods

1. Cross-encoders

Unlike embeddings (which encode query and document separately), cross-encoders process them together and output a relevance score.

Example: MS MARCO cross-encoders, BERT re-rankers

Pros: More accurate than pure vector similarity
Cons: Slower (can't pre-compute)

2. LLM-based re-ranking

Ask the LLM: "On a scale of 1-10, how relevant is this chunk to the question?"

Pros: Highly accurate, understands nuance
Cons: Expensive, slow

When to use: High-stakes queries (legal, medical), or as a final filtering step

Re-ranking workflow

  1. Retrieve 50 chunks (fast, broad search)
  2. Re-rank to top 10 (cross-encoder)
  3. Send top 5 to LLM (context window limit)

Evaluating retrieval quality

You can't improve what you don't measure.

Key metrics

Recall: Did you retrieve all relevant chunks?

  • Formula: Relevant retrieved / Total relevant
  • Example: 7 relevant chunks exist, you retrieved 5 → Recall = 71%

Precision: Are the retrieved chunks actually relevant?

  • Formula: Relevant retrieved / Total retrieved
  • Example: Retrieved 10 chunks, 5 relevant → Precision = 50%

MRR (Mean Reciprocal Rank): How high do relevant results rank?

  • Formula: 1 / rank of first relevant result
  • Example: First relevant result at position 3 → MRR = 1/3 = 0.33

NDCG (Normalized Discounted Cumulative Gain): Rewards putting highly relevant results at the top

  • Range: 0 to 1 (higher is better)

How to evaluate

  1. Build a test set: 50-100 questions with known correct chunks
  2. Run retrieval and measure metrics
  3. Iterate: Tune chunking, retrieval, re-ranking
  4. Re-test

Tools: Ragas, LlamaIndex evaluators, custom scripts

Debugging retrieval failures

Problem: Missing chunks (low recall)

Symptoms: LLM says "I don't have that information" (but the docs contain it)

Fixes:

  • Check chunking (did you split the answer across chunks?)
  • Try different embedding models (some are better for your domain)
  • Increase number of retrieved chunks
  • Add keyword search (hybrid) to catch exact terms

Problem: Irrelevant results (low precision)

Symptoms: LLM gets confused by noisy chunks, gives vague answers

Fixes:

  • Use metadata to pre-filter
  • Add re-ranking
  • Improve chunk quality (remove boilerplate, headers, footers)
  • Tune similarity threshold (only return chunks above a certain score)

Problem: Semantic drift

Symptoms: Search returns chunks that are topically related but don't answer the question

Fixes:

  • Use cross-encoder re-ranking
  • Add keyword search to anchor results
  • Fine-tune embeddings on your domain (advanced)

Tools and libraries

LangChain: Full RAG framework with chunking, retrieval, re-ranking modules. Great for rapid prototyping.

LlamaIndex: Specializes in indexing and retrieval strategies (hierarchical, multi-vector). Excellent for complex data.

Haystack: Production-ready, supports hybrid search, re-ranking, and custom pipelines.

Weaviate, Qdrant, Pinecone: Vector databases with built-in hybrid search and filtering.

Cohere Rerank, Jina Reranker: API-based re-ranking services.

Practical example: Building a support bot

Goal: Answer customer questions using your help docs.

Setup:

  1. Chunk help articles by H2 headings (semantic boundaries)
  2. Add metadata: Article category, last updated date
  3. Index in a vector DB with hybrid search enabled
  4. Retrieval flow:
    • Pre-filter by category (if mentioned in question)
    • Hybrid search (60% vector, 40% keyword)
    • Retrieve top 20 chunks
    • Re-rank with cross-encoder to top 5
    • Send to LLM with question

Result: 85% answer accuracy vs. 60% with basic RAG

Use responsibly

  • Monitor retrieval quality (not just LLM output)
  • Audit metadata for bias (don't over-filter)
  • Test edge cases (rare queries, ambiguous questions)
  • Version your chunks (so you can roll back if chunking changes break things)

What&#39;s next?

  • Evaluating AI Answers: Measure hallucination rates and answer quality
  • Vector DBs 101 (coming soon): Deep dive into indexing and scaling vector databases
  • Embeddings & RAG Explained: Start here if you're new to RAG
  • Prompting 101: Craft better prompts to guide LLMs using retrieved chunks