- Home
- /Guides
- /Data & Evaluation
- /Retrieval 201: Chunking, Indexing, and Hybrid Search
Retrieval 201: Chunking, Indexing, and Hybrid Search
Go beyond basic RAG. Advanced techniques for chunking documents, indexing strategies, re-ranking, and hybrid search.
TL;DR
Basic RAG works, but production systems need advanced retrieval. Chunking strategy determines what information is available. Hybrid search combines keyword and semantic matching. Re-ranking refines results for accuracy. Metadata enables precise filtering. Master these techniques to build RAG systems that consistently find the right information.
Why advanced retrieval matters
You've built a basic RAG system. It works... sometimes. Users ask questions and get vague answers, or miss critical information buried in your docs. The problem isn't the LLMâit's retrieval.
If the search doesn't find the right chunks, the LLM can't generate the right answer. Garbage in, garbage out.
Advanced retrieval techniques dramatically improve accuracy by ensuring the most relevant information reaches the LLM every time.
Chunking: The foundation of retrieval
Chunking is how you split documents into searchable pieces. Get it wrong, and everything downstream fails.
Common pitfalls
Too small: "Return within 30 days" (missing: "with receipt" from the next sentence)
Too large: 2000-token chunks with one relevant sentence buried in noise
No context: Chunks that start mid-thought or reference "the above section"
Advanced chunking strategies
1. Semantic boundaries
Split at natural breakpoints: headings, paragraphs, topic shifts. For markdown or HTML, use the structure (h2 tags, section breaks). For plain text, use sentence boundaries and paragraph breaks.
When to use: Structured documents (docs, wikis, reports)
2. Overlapping windows
Create chunks with 10-20% overlap so context isn't lost at boundaries.
Example (500-token chunks, 100-token overlap):
- Chunk 1: Tokens 0-500
- Chunk 2: Tokens 400-900
- Chunk 3: Tokens 800-1300
When to use: Dense technical content where every word matters (legal, medical, code)
3. Dynamic sizing
Vary chunk size based on content density. Short chunks for dense info (code snippets, definitions), longer for narrative text.
When to use: Mixed-content documents (tutorials with code examples, research papers with prose and data)
4. Parent-child chunking
Store small chunks for retrieval, but include parent context (the full section or page) when passing to the LLM.
Example:
- Small chunk: "API rate limit: 100 requests/min"
- Parent context: Full "Rate Limits" section with examples and error handling
When to use: When you need precise retrieval but comprehensive answers
Metadata: Making search smarter
Metadata lets you filter and prioritize before semantic search even happens.
What to tag
- Document type: "policy," "tutorial," "API reference"
- Date/version: "2024-Q3," "v2.1"
- Author/department: "Engineering," "Legal," "Sales"
- Audience: "internal," "customer-facing," "developer"
- Freshness: Last updated timestamp
How to use it
Pre-filter: "Only search documents tagged 'API reference' and 'v2.1'"
Boost results: Prioritize recent docs over old ones
Multi-tenancy: Filter by customer ID or organization
Example: User asks "What's our refund policy?" Pre-filter to documents tagged "policy" and "customer-facing," then search only those chunks. Faster and more accurate.
Indexing strategies
How you organize embeddings affects search speed and quality.
Flat indexing
All chunks in one vector space. Simple, works for small-to-medium datasets (< 100K chunks).
Pros: Easy to implement, fast to query
Cons: Slows down with scale, no hierarchical context
Hierarchical indexing
Organize chunks in layers: document summaries at the top, sections in the middle, chunks at the bottom.
Search flow:
- Search summaries: "Which docs mention refunds?"
- Search sections within those docs
- Retrieve relevant chunks
Pros: Faster, better context, works at scale
Cons: More complex to build
When to use: Large knowledge bases (10K+ documents), multi-layered content
Multi-vector indexing
Store multiple embeddings per chunk: one for the full text, one for a summary, one for keywords.
Query different representations depending on the question type.
Pros: Captures multiple aspects of meaning
Cons: Higher storage cost
When to use: Complex documents where single embeddings miss nuance (research papers, technical specs)
Hybrid search: Best of both worlds
Vector search finds meaning. Keyword search finds exact terms. Hybrid search combines them.
Why you need both
Vector search alone misses:
- Exact product names ("Model X-200")
- Acronyms ("HIPAA," "GDPR")
- Serial numbers, IDs, version numbers
Keyword search alone misses:
- Synonyms ("refund" vs "money back")
- Paraphrased questions
- Semantic meaning
How it works
- Run BM25 (keyword search): Score chunks by term frequency and rarity
- Run vector search: Score chunks by semantic similarity
- Combine scores (weighted sum or reciprocal rank fusion)
- Return top results
Example weights:
- 70% vector, 30% keyword (general Q&A)
- 50% vector, 50% keyword (mixed)
- 30% vector, 70% keyword (technical lookup)
When to use: Production RAG systems where accuracy is critical
Re-ranking: Refining your results
Retrieval gives you 20 candidate chunks. Re-ranking picks the best 5 to send to the LLM.
Why re-rank?
Initial retrieval is fast but imprecise. Re-ranking uses slower, more accurate models to refine.
Methods
1. Cross-encoders
Unlike embeddings (which encode query and document separately), cross-encoders process them together and output a relevance score.
Example: MS MARCO cross-encoders, BERT re-rankers
Pros: More accurate than pure vector similarity
Cons: Slower (can't pre-compute)
2. LLM-based re-ranking
Ask the LLM: "On a scale of 1-10, how relevant is this chunk to the question?"
Pros: Highly accurate, understands nuance
Cons: Expensive, slow
When to use: High-stakes queries (legal, medical), or as a final filtering step
Re-ranking workflow
- Retrieve 50 chunks (fast, broad search)
- Re-rank to top 10 (cross-encoder)
- Send top 5 to LLM (context window limit)
Evaluating retrieval quality
You can't improve what you don't measure.
Key metrics
Recall: Did you retrieve all relevant chunks?
- Formula: Relevant retrieved / Total relevant
- Example: 7 relevant chunks exist, you retrieved 5 â Recall = 71%
Precision: Are the retrieved chunks actually relevant?
- Formula: Relevant retrieved / Total retrieved
- Example: Retrieved 10 chunks, 5 relevant â Precision = 50%
MRR (Mean Reciprocal Rank): How high do relevant results rank?
- Formula: 1 / rank of first relevant result
- Example: First relevant result at position 3 â MRR = 1/3 = 0.33
NDCG (Normalized Discounted Cumulative Gain): Rewards putting highly relevant results at the top
- Range: 0 to 1 (higher is better)
How to evaluate
- Build a test set: 50-100 questions with known correct chunks
- Run retrieval and measure metrics
- Iterate: Tune chunking, retrieval, re-ranking
- Re-test
Tools: Ragas, LlamaIndex evaluators, custom scripts
Debugging retrieval failures
Problem: Missing chunks (low recall)
Symptoms: LLM says "I don't have that information" (but the docs contain it)
Fixes:
- Check chunking (did you split the answer across chunks?)
- Try different embedding models (some are better for your domain)
- Increase number of retrieved chunks
- Add keyword search (hybrid) to catch exact terms
Problem: Irrelevant results (low precision)
Symptoms: LLM gets confused by noisy chunks, gives vague answers
Fixes:
- Use metadata to pre-filter
- Add re-ranking
- Improve chunk quality (remove boilerplate, headers, footers)
- Tune similarity threshold (only return chunks above a certain score)
Problem: Semantic drift
Symptoms: Search returns chunks that are topically related but don't answer the question
Fixes:
- Use cross-encoder re-ranking
- Add keyword search to anchor results
- Fine-tune embeddings on your domain (advanced)
Tools and libraries
LangChain: Full RAG framework with chunking, retrieval, re-ranking modules. Great for rapid prototyping.
LlamaIndex: Specializes in indexing and retrieval strategies (hierarchical, multi-vector). Excellent for complex data.
Haystack: Production-ready, supports hybrid search, re-ranking, and custom pipelines.
Weaviate, Qdrant, Pinecone: Vector databases with built-in hybrid search and filtering.
Cohere Rerank, Jina Reranker: API-based re-ranking services.
Practical example: Building a support bot
Goal: Answer customer questions using your help docs.
Setup:
- Chunk help articles by H2 headings (semantic boundaries)
- Add metadata: Article category, last updated date
- Index in a vector DB with hybrid search enabled
- Retrieval flow:
- Pre-filter by category (if mentioned in question)
- Hybrid search (60% vector, 40% keyword)
- Retrieve top 20 chunks
- Re-rank with cross-encoder to top 5
- Send to LLM with question
Result: 85% answer accuracy vs. 60% with basic RAG
Use responsibly
- Monitor retrieval quality (not just LLM output)
- Audit metadata for bias (don't over-filter)
- Test edge cases (rare queries, ambiguous questions)
- Version your chunks (so you can roll back if chunking changes break things)
What's next?
- Evaluating AI Answers: Measure hallucination rates and answer quality
- Vector DBs 101 (coming soon): Deep dive into indexing and scaling vector databases
- Embeddings & RAG Explained: Start here if you're new to RAG
- Prompting 101: Craft better prompts to guide LLMs using retrieved chunks
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
RAG (Retrieval-Augmented Generation)
A technique where AI searches your documents for relevant info, then uses it to generate accurate, grounded answers.
Beam Search
A text generation strategy where the AI explores multiple possible word sequences simultaneously and keeps the best few (the 'beam') at each step, resulting in higher-quality but slower output than greedy generation.
Related Guides
Embeddings & RAG Explained (Plain English)
IntermediateHow AI tools search and retrieve information from documents. Understand embeddings and Retrieval-Augmented Generation without the math.
Vector Databases 101: Storage, Indexing, and Search
IntermediateDeep dive into vector databases. How they work, when to use them, and how to choose the right one for your needs.
Evaluating AI Answers (Hallucinations, Checks, and Evidence)
IntermediateHow to spot when AI gets it wrong. Practical techniques to verify accuracy, detect hallucinations, and build trust in AI outputs.