Intermediate12 min read

Retrieval 201: Chunking, Indexing, and Hybrid Search

Go beyond basic RAG. Advanced techniques for chunking documents, indexing strategies, re-ranking, and hybrid search.

RAGretrievalchunkingsearchadvanced

TL;DR

Basic RAG works, but production systems need advanced retrieval. Chunking strategy determines what information is available. Hybrid search combines keyword and semantic matching. Re-ranking refines results for accuracy. Metadata enables precise filtering. Master these techniques to build RAG systems that consistently find the right information.

Why advanced retrieval matters

You've built a basic RAG system. It works... sometimes. Users ask questions and get vague answers, or miss critical information buried in your docs. The problem isn't the LLM—it's retrieval.

If the search doesn't find the right chunks, the LLM can't generate the right answer. Garbage in, garbage out.

Advanced retrieval techniques dramatically improve accuracy by ensuring the most relevant information reaches the LLM every time.

Chunking: The foundation of retrieval

Chunking is how you split documents into searchable pieces. Get it wrong, and everything downstream fails.

Common pitfalls

Too small: "Return within 30 days" (missing: "with receipt" from the next sentence)
Too large: 2000-token chunks with one relevant sentence buried in noise
No context: Chunks that start mid-thought or reference "the above section"

Advanced chunking strategies

1. Semantic boundaries

Split at natural breakpoints: headings, paragraphs, topic shifts. For markdown or HTML, use the structure (h2 tags, section breaks). For plain text, use sentence boundaries and paragraph breaks.

When to use: Structured documents (docs, wikis, reports)

2. Overlapping windows

Create chunks with 10-20% overlap so context isn't lost at boundaries.

Example (500-token chunks, 100-token overlap):

Chunk 1: Tokens 0-500
Chunk 2: Tokens 400-900
Chunk 3: Tokens 800-1300

When to use: Dense technical content where every word matters (legal, medical, code)

3. Dynamic sizing

Vary chunk size based on content density. Short chunks for dense info (code snippets, definitions), longer for narrative text.

When to use: Mixed-content documents (tutorials with code examples, research papers with prose and data)

4. Parent-child chunking

Store small chunks for retrieval, but include parent context (the full section or page) when passing to the LLM.

Example:

Small chunk: "API rate limit: 100 requests/min"
Parent context: Full "Rate Limits" section with examples and error handling

When to use: When you need precise retrieval but comprehensive answers

Metadata: Making search smarter

Metadata lets you filter and prioritize before semantic search even happens.

What to tag

Document type: "policy," "tutorial," "API reference"
Date/version: "2024-Q3," "v2.1"
Author/department: "Engineering," "Legal," "Sales"
Audience: "internal," "customer-facing," "developer"
Freshness: Last updated timestamp

How to use it

Pre-filter: "Only search documents tagged 'API reference' and 'v2.1'"
Boost results: Prioritize recent docs over old ones
Multi-tenancy: Filter by customer ID or organization

Example: User asks "What's our refund policy?" Pre-filter to documents tagged "policy" and "customer-facing," then search only those chunks. Faster and more accurate.

Indexing strategies

How you organize embeddings affects search speed and quality.

Flat indexing

All chunks in one vector space. Simple, works for small-to-medium datasets (< 100K chunks).

Pros: Easy to implement, fast to query
Cons: Slows down with scale, no hierarchical context

Hierarchical indexing

Organize chunks in layers: document summaries at the top, sections in the middle, chunks at the bottom.

Search flow:

Search summaries: "Which docs mention refunds?"
Search sections within those docs
Retrieve relevant chunks

Pros: Faster, better context, works at scale
Cons: More complex to build

When to use: Large knowledge bases (10K+ documents), multi-layered content

Multi-vector indexing

Store multiple embeddings per chunk: one for the full text, one for a summary, one for keywords.

Query different representations depending on the question type.

Pros: Captures multiple aspects of meaning
Cons: Higher storage cost

When to use: Complex documents where single embeddings miss nuance (research papers, technical specs)

Hybrid search: Best of both worlds

Vector search finds meaning. Keyword search finds exact terms. Hybrid search combines them.

Why you need both

Vector search alone misses:

Exact product names ("Model X-200")
Acronyms ("HIPAA," "GDPR")
Serial numbers, IDs, version numbers

Keyword search alone misses:

Synonyms ("refund" vs "money back")
Paraphrased questions
Semantic meaning

How it works

Run BM25 (keyword search): Score chunks by term frequency and rarity
Run vector search: Score chunks by semantic similarity
Combine scores (weighted sum or reciprocal rank fusion)
Return top results

Example weights:

70% vector, 30% keyword (general Q&A)
50% vector, 50% keyword (mixed)
30% vector, 70% keyword (technical lookup)

When to use: Production RAG systems where accuracy is critical

Re-ranking: Refining your results

Retrieval gives you 20 candidate chunks. Re-ranking picks the best 5 to send to the LLM.

Why re-rank?

Initial retrieval is fast but imprecise. Re-ranking uses slower, more accurate models to refine.

Methods

1. Cross-encoders

Unlike embeddings (which encode query and document separately), cross-encoders process them together and output a relevance score.

Example: MS MARCO cross-encoders, BERT re-rankers

Pros: More accurate than pure vector similarity
Cons: Slower (can't pre-compute)

2. LLM-based re-ranking

Ask the LLM: "On a scale of 1-10, how relevant is this chunk to the question?"

Pros: Highly accurate, understands nuance
Cons: Expensive, slow

When to use: High-stakes queries (legal, medical), or as a final filtering step

Re-ranking workflow

Retrieve 50 chunks (fast, broad search)
Re-rank to top 10 (cross-encoder)
Send top 5 to LLM (context window limit)

Evaluating retrieval quality

You can't improve what you don't measure.

Key metrics

Recall: Did you retrieve all relevant chunks?

Formula: Relevant retrieved / Total relevant
Example: 7 relevant chunks exist, you retrieved 5 → Recall = 71%

Precision: Are the retrieved chunks actually relevant?

Formula: Relevant retrieved / Total retrieved
Example: Retrieved 10 chunks, 5 relevant → Precision = 50%

MRR (Mean Reciprocal Rank): How high do relevant results rank?

Formula: 1 / rank of first relevant result
Example: First relevant result at position 3 → MRR = 1/3 = 0.33

NDCG (Normalized Discounted Cumulative Gain): Rewards putting highly relevant results at the top

Range: 0 to 1 (higher is better)

How to evaluate

Build a test set: 50-100 questions with known correct chunks
Run retrieval and measure metrics
Iterate: Tune chunking, retrieval, re-ranking
Re-test

Tools: Ragas, LlamaIndex evaluators, custom scripts

Debugging retrieval failures

Problem: Missing chunks (low recall)

Symptoms: LLM says "I don't have that information" (but the docs contain it)

Fixes:

Check chunking (did you split the answer across chunks?)
Try different embedding models (some are better for your domain)
Increase number of retrieved chunks
Add keyword search (hybrid) to catch exact terms

Problem: Irrelevant results (low precision)

Symptoms: LLM gets confused by noisy chunks, gives vague answers

Fixes:

Use metadata to pre-filter
Add re-ranking
Improve chunk quality (remove boilerplate, headers, footers)
Tune similarity threshold (only return chunks above a certain score)

Problem: Semantic drift

Symptoms: Search returns chunks that are topically related but don't answer the question

Fixes:

Use cross-encoder re-ranking
Add keyword search to anchor results
Fine-tune embeddings on your domain (advanced)

Tools and libraries

LangChain: Full RAG framework with chunking, retrieval, re-ranking modules. Great for rapid prototyping.

LlamaIndex: Specializes in indexing and retrieval strategies (hierarchical, multi-vector). Excellent for complex data.

Haystack: Production-ready, supports hybrid search, re-ranking, and custom pipelines.

Weaviate, Qdrant, Pinecone: Vector databases with built-in hybrid search and filtering.

Cohere Rerank, Jina Reranker: API-based re-ranking services.

Practical example: Building a support bot

Goal: Answer customer questions using your help docs.

Setup:

Chunk help articles by H2 headings (semantic boundaries)
Add metadata: Article category, last updated date
Index in a vector DB with hybrid search enabled
Retrieval flow:
- Pre-filter by category (if mentioned in question)
- Hybrid search (60% vector, 40% keyword)
- Retrieve top 20 chunks
- Re-rank with cross-encoder to top 5
- Send to LLM with question

Result: 85% answer accuracy vs. 60% with basic RAG

Use responsibly

Monitor retrieval quality (not just LLM output)
Audit metadata for bias (don't over-filter)
Test edge cases (rare queries, ambiguous questions)
Version your chunks (so you can roll back if chunking changes break things)

What's next?

Evaluating AI Answers: Measure hallucination rates and answer quality
Vector DBs 101 (coming soon): Deep dive into indexing and scaling vector databases
Embeddings & RAG Explained: Start here if you're new to RAG
Prompting 101: Craft better prompts to guide LLMs using retrieved chunks

Was this guide helpful?

Your feedback helps us improve our guides

Key Terms Used in This Guide

RAG (Retrieval-Augmented Generation)

A technique where AI searches your documents for relevant info, then uses it to generate accurate, grounded answers.

Beam Search

A text generation strategy where the AI explores multiple possible word sequences simultaneously and keeps the best few (the 'beam') at each step, resulting in higher-quality but slower output than greedy generation.

Related Guides

Embeddings & RAG Explained (Plain English)

Intermediate

How AI tools search and retrieve information from documents. Understand embeddings and Retrieval-Augmented Generation without the math.

11 min read

Vector Databases 101: Storage, Indexing, and Search

Intermediate

Deep dive into vector databases. How they work, when to use them, and how to choose the right one for your needs.

11 min read

Evaluating AI Answers (Hallucinations, Checks, and Evidence)

Intermediate

How to spot when AI gets it wrong. Practical techniques to verify accuracy, detect hallucinations, and build trust in AI outputs.

10 min read

TL;DR

Why advanced retrieval matters

Chunking: The foundation of retrieval

Common pitfalls

Advanced chunking strategies

Metadata: Making search smarter

What to tag

How to use it

Indexing strategies

Flat indexing

Hierarchical indexing

Multi-vector indexing

Hybrid search: Best of both worlds

Why you need both

How it works

Re-ranking: Refining your results

Why re-rank?

Methods

Re-ranking workflow

Evaluating retrieval quality

Key metrics

How to evaluate

Debugging retrieval failures

Problem: Missing chunks (low recall)

Problem: Irrelevant results (low precision)

Problem: Semantic drift

Tools and libraries

Practical example: Building a support bot

Use responsibly

What&#39;s next?

Was this guide helpful?

Key Terms Used in This Guide

RAG (Retrieval-Augmented Generation)

Beam Search

Related Guides

Embeddings & RAG Explained (Plain English)

Vector Databases 101: Storage, Indexing, and Search

Evaluating AI Answers (Hallucinations, Checks, and Evidence)

What's next?