- Home
- /Guides
- /build-deploy
- /Context Management: Handling Long Conversations and Documents
Context Management: Handling Long Conversations and Documents
Master context window management for AI. Learn strategies for long conversations, document processing, memory systems, and context optimization.
TL;DR
AI models have context windows—a limit on how much text they can process at once. When conversations get long or documents get large, you hit this wall. Context management is the art of working within these limits. Rolling windows keep recent conversation history. Summarization condenses old context. Chunking breaks documents into processable pieces. RAG retrieves only relevant information instead of loading everything. Memory systems store and recall important facts across sessions. Master these techniques to build AI applications that handle unlimited conversations and massive documents without hitting limits or ballooning costs.
Why context management matters
You've built a chatbot. It works great for the first 5 messages. Then users hit a wall: "Why did you forget what I said 10 minutes ago?" or "This document is too large to analyze."
The problem is the context window—the amount of text a model can process in a single request. It's measured in tokens (roughly 3/4 of a word). Models have hard limits:
- GPT-3.5: 4K-16K tokens (~3K-12K words)
- GPT-4: 8K-128K tokens (~6K-96K words)
- Claude 3.5 Sonnet: 200K tokens (~150K words)
- Gemini 1.5 Pro: 2M tokens (~1.5M words)
What happens when you exceed the limit?
- The API rejects your request
- The model forgets the beginning of the conversation
- Costs skyrocket (long context = expensive)
- Response quality degrades (too much noise)
Context management solves these problems. It's essential for production AI systems.
Understanding context windows
Think of context as the model's "working memory." Everything you send—system instructions, conversation history, retrieved documents, your current question—must fit in this window.
Context structure
A typical LLM request contains:
Total context (must fit in window):
├─ System prompt (100-500 tokens)
├─ Conversation history (500-10K tokens)
├─ Retrieved documents/RAG chunks (1K-50K tokens)
├─ User's current message (50-500 tokens)
└─ Reserved for response (500-4K tokens)
Example (Claude 3.5 Sonnet, 200K limit):
- System prompt: 200 tokens
- Last 20 messages: 5,000 tokens
- RAG context: 3,000 tokens
- User question: 100 tokens
- Response space: 4,000 tokens
- Total used: 12,300 tokens (6% of limit)
Plenty of room! But what if the conversation reaches 100 messages? Or you need to include a 50-page document? That's when context management becomes critical.
The forgetting problem
When context exceeds the limit, models handle it in two ways:
1. Truncation (most APIs): Drop the oldest messages to fit the limit. The model literally forgets earlier parts of the conversation.
2. Sliding window: Keep the most recent context, discard the rest. Same result—earlier information is lost.
Neither is ideal. You need smarter strategies.
Strategy 1: Conversation memory management
For long conversations, you can't keep everything. Choose what to remember.
Rolling window with summarization
Keep recent messages in full detail, summarize older ones.
Example flow:
- Messages 1-10: Keep full history (under 2K tokens)
- Message 11 arrives: Summarize messages 1-5 into 200 tokens
- New context: Summary (200 tokens) + Messages 6-11 (full detail)
- Repeat as conversation grows
Implementation:
def manage_conversation_context(messages, max_tokens=8000):
"""Keep recent messages, summarize old ones"""
recent_threshold = 10 # Keep last 10 messages in full
if len(messages) <= recent_threshold:
return messages
# Split into old (to summarize) and recent (keep full)
old_messages = messages[:-recent_threshold]
recent_messages = messages[-recent_threshold:]
# Summarize old messages
summary_prompt = f"Summarize this conversation concisely: {old_messages}"
summary = call_llm(summary_prompt) # Returns ~200 tokens
# Build new context
return [
{"role": "system", "content": f"Previous conversation summary: {summary}"},
*recent_messages
]
When to use: Customer support bots, long-running assistance sessions, any chat that exceeds 50 messages.
Semantic memory selection
Instead of keeping the last N messages, keep the most relevant ones.
How it works:
- Embed all messages: Convert each message to a vector
- User asks new question: Embed the question
- Retrieve relevant past messages: Use vector similarity to find the 10 most relevant historical messages
- Build context: System prompt + relevant past messages + recent 5 messages + current question
When to use: Technical support (need to recall specific issues mentioned earlier), educational tutors (reference past lessons), complex problem-solving.
Memory systems (advanced)
For truly long-term memory, use an external store.
Architecture:
User message
↓
Extract facts/entities → Store in memory DB
↓
Retrieve relevant memories → Add to context
↓
Send to LLM
What to store:
- User preferences: "I prefer Python over JavaScript"
- Important facts: "Project deadline is March 15"
- Entities: "John is the project manager"
- Decisions: "We agreed to use PostgreSQL"
Tools: LangChain Memory modules, Mem0, Zep (conversation memory stores).
Implementation example:
# Store important facts
memory_db.store({
"user_id": "user123",
"fact": "Prefers explanations with code examples",
"timestamp": "2024-01-15"
})
# Retrieve before each request
relevant_memories = memory_db.retrieve(
user_id="user123",
query="How should I explain this concept?",
limit=5
)
# Include in context
context = f"User preferences: {relevant_memories}\n\nCurrent question: ..."
When to use: Personal assistants, ongoing project collaboration, enterprise tools with multi-session continuity.
Strategy 2: Document processing
You can't fit a 200-page document in context (even with 200K windows—it's expensive and slow). Smart chunking and retrieval are essential.
Revisiting chunking for context limits
From our RAG guides, you know chunking splits documents into pieces. For context management, chunking serves two purposes:
1. Fit documents into retrieval systems (covered in RAG guides)
2. Ensure retrieved chunks fit in context windows
Context-aware chunking guidelines:
- Small models (8K context): Chunks of 500-800 tokens, retrieve 3-5 chunks (total: ~3K tokens)
- Medium models (32K context): Chunks of 1000-1500 tokens, retrieve 8-10 chunks (total: ~12K tokens)
- Large models (200K+ context): Chunks of 2000-3000 tokens, retrieve 20-30 chunks (total: ~60K tokens)
Always leave 20-30% of context for conversation history, system prompts, and response generation.
Progressive loading
Don't load everything at once. Load what's needed when needed.
Example: Legal document analysis
Naive approach:
- User uploads 100-page contract
- Chunk entire document, embed, store
- User asks: "What's the termination clause?"
- Retrieve 10 chunks about termination
- Generate answer
Progressive approach:
- User uploads contract
- Generate table of contents / section summaries (1-2K tokens total)
- User asks: "What's the termination clause?"
- Identify relevant section from TOC
- Load only that section (2-3 pages, ~1K tokens)
- Generate answer
Benefits: Faster, cheaper, more accurate (less noise).
Implementation:
def analyze_large_document(document, question):
# Step 1: Generate section summaries
sections = split_into_sections(document) # By headings
summaries = {
section.title: summarize(section.content)
for section in sections
}
# Step 2: Identify relevant sections
relevant_sections = find_relevant_sections(question, summaries)
# Step 3: Load only relevant content
context = "".join([sections[s].content for s in relevant_sections])
# Step 4: Answer with limited context
return call_llm(f"Context: {context}\n\nQuestion: {question}")
When to use: Multi-document analysis, contracts, research papers, technical manuals.
Hierarchical analysis
For massive documents, analyze in layers.
Example: Analyzing 500 pages of financial reports
Layer 1 - Document summaries:
- Summarize each document (500 pages → 50 summaries × 100 tokens = 5K tokens)
- User asks: "What were the main risks mentioned?"
- Identify which documents mention risks
Layer 2 - Section summaries:
- For relevant documents, load section summaries
- Pinpoint exact sections discussing risks
Layer 3 - Full content:
- Load only those specific sections (maybe 10 pages total)
- Generate detailed answer
Benefits: Handle unlimited content while staying under context limits. Zoom in progressively instead of loading everything upfront.
Strategy 3: Context compression
Sometimes you need to fit more information in less space. Compression techniques help.
Prompt compression
Remove unnecessary words from prompts without losing meaning.
Before (100 tokens):
Please analyze the following customer feedback very carefully and
provide me with a detailed summary of the main points that were
mentioned, including both positive and negative aspects, and any
suggestions for improvement that the customer may have provided.
After (30 tokens):
Summarize this customer feedback. Include:
- Positive points
- Negative points
- Suggestions
Tools: LLMLingua, LongLLMLingua (research projects for automatic prompt compression).
Embedding-based compression
Replace verbose text with vector embeddings, then reconstruct context.
Advanced technique (emerging):
- Convert long context to embeddings
- Store embeddings in model's "soft prompts"
- Model accesses compressed information without full text
Status: Experimental, not yet production-ready for most use cases.
Response streaming with context updates
For long responses, stream output and dynamically adjust context as you go.
Example:
- User asks to analyze 50 documents
- Start streaming response for document 1
- As model generates, load next document chunks
- Continue streaming, seamlessly including new context
When to use: Multi-document summarization, iterative analysis.
Strategy 4: RAG for conversations
Combine RAG (Retrieval-Augmented Generation) with conversation management.
Conversation + knowledge base
Architecture:
User question
↓
├─ Retrieve from knowledge base (RAG) → 5 chunks
├─ Retrieve from conversation history → 3 relevant past messages
↓
Combine into context → Send to LLM
Example: Technical support bot
- Knowledge base: Product documentation, FAQs, troubleshooting guides
- Conversation history: User's specific setup, previous issues
- Current question: "How do I fix the error I got earlier?"
Context construction:
# Retrieve from docs
doc_chunks = vector_db.search(question, filter="docs", limit=5)
# Retrieve from conversation
relevant_messages = conversation_memory.search(
question,
user_id=user_id,
limit=3
)
# Build context
context = f"""
Product documentation:
{doc_chunks}
Previous conversation context:
{relevant_messages}
Current question: {question}
"""
When to use: Support bots, internal tools, educational assistants.
Dual-index RAG
Separate indexes for different types of context.
Two vector databases:
- Long-term knowledge: Company docs, policies, reference material (rarely changes)
- Session context: Current conversation, temporary notes (changes constantly)
Benefits:
- Update session context frequently without re-indexing everything
- Different retrieval strategies for each (more from knowledge base, less from conversation)
- Easier to manage permissions (shared knowledge vs. user-specific context)
Strategy 5: Multi-turn optimization
Optimize how you structure multi-turn conversations to minimize context usage.
Stateless vs. stateful turns
Stateless (every turn is independent):
# Each request includes full history
call_llm(messages=[msg1, msg2, msg3, msg4, msg5]) # All 5 messages
call_llm(messages=[msg1, msg2, msg3, msg4, msg5, msg6]) # All 6 messages
Pros: Simple, works with any API
Cons: Redundant, expensive, hits limits quickly
Stateful (maintain conversation on server):
# Some providers support conversation IDs
session = create_conversation_session()
session.send(msg1) # Only sends msg1
session.send(msg2) # Only sends msg2, server remembers msg1
Pros: Efficient, less redundant data transfer
Cons: Provider-specific, less portable
When to use each:
- Stateless: Prototyping, any LLM provider, full control over context
- Stateful: Production systems with high volume, supported providers (OpenAI Assistants API, Anthropic Claude with Bedrock)
Turn compression
Reduce context by compressing earlier turns without losing meaning.
Turn 1:
User: "I need help writing a Python function to parse CSV files"
Assistant: [300 tokens of detailed explanation with code examples]
Turn 2 (naive - keep full history):
Context: [Full Turn 1: 300 tokens]
User: "Can you add error handling?"
Turn 2 (optimized - compress Turn 1):
Context: "User requested Python CSV parser. Provided basic implementation."
User: "Can you add error handling?"
Savings: 300 tokens → 15 tokens. Scales dramatically over long conversations.
Caching repeated context
Some providers offer prompt caching—reuse parts of context across requests without reprocessing.
Example (Anthropic prompt caching):
# Request 1: Process 50K token document
response1 = client.messages.create(
model="claude-3-5-sonnet-20241022",
system=[{
"type": "text",
"text": long_document, # 50K tokens
"cache_control": {"type": "ephemeral"} # Cache this
}],
messages=[{"role": "user", "content": "Summarize page 1"}]
)
# Request 2: Same document, different question
response2 = client.messages.create(
model="claude-3-5-sonnet-20241022",
system=[{
"type": "text",
"text": long_document, # Same 50K tokens
"cache_control": {"type": "ephemeral"} # Reuses cache
}],
messages=[{"role": "user", "content": "Summarize page 2"}]
)
Benefits:
- 90% cost reduction for cached content
- Faster response times (no re-processing)
- Essential for analyzing long documents with multiple questions
When to use: Multiple questions about the same document, fixed system prompts with many users, repeated RAG contexts.
Cost vs. context trade-offs
Longer context = higher costs. Choose wisely.
Cost comparison
GPT-4 Turbo (128K context):
- Input: $0.01 per 1K tokens
- 50K token request = $0.50 per call
- 100 calls/day = $50/day = $1,500/month
GPT-3.5 Turbo (16K context):
- Input: $0.0005 per 1K tokens
- 10K token request = $0.005 per call
- 100 calls/day = $0.50/day = $15/month
Claude 3 Haiku (200K context):
- Input: $0.00025 per 1K tokens
- 50K token request = $0.0125 per call
- 100 calls/day = $1.25/day = $37.50/month
Lesson: Long context capabilities don't mean you should use them carelessly. Optimize context size to balance quality and cost.
When to use long context models
Good use cases:
- Analyzing full documents where breaking into chunks loses coherence (contracts, narratives)
- Code repositories (need to see multiple files at once)
- Complex reasoning requiring extensive background
Bad use cases:
- Simple Q&A that could use RAG with 3-5 chunks
- Conversations where summarization would suffice
- Any task where smaller models + optimization work fine
Rule of thumb: Try RAG + smaller context first. Only reach for long context models when retrieval fails to capture necessary information.
Practical patterns
Real-world implementations combining these techniques.
Pattern 1: Customer support bot
def handle_support_query(user_id, question):
# 1. Retrieve relevant docs (RAG)
doc_context = search_docs(question, limit=5) # ~2K tokens
# 2. Load recent conversation
recent_messages = get_recent_messages(user_id, limit=10) # ~1K tokens
# 3. Check for important past facts
user_context = memory_db.get_user_context(user_id) # ~200 tokens
# 4. Build context (total: ~3.5K tokens)
context = {
"system": f"User context: {user_context}",
"history": recent_messages,
"docs": doc_context,
"question": question
}
return call_llm(context)
Context budget: 3.5K / 8K available = 44% (plenty of room for response)
Pattern 2: Document analysis assistant
def analyze_document(document, questions):
# 1. Generate section index
sections = extract_sections(document)
index = {s.title: summarize(s.content) for s in sections} # 2K tokens
responses = []
for question in questions:
# 2. Identify relevant sections
relevant = find_relevant_sections(question, index) # 3 sections
# 3. Load only those sections
context = load_sections(relevant) # ~5K tokens
# 4. Answer with limited context
response = call_llm(f"Context: {context}\n\nQ: {question}")
responses.append(response)
return responses
Handles unlimited document size by loading selectively
Pattern 3: Long-running project assistant
def project_assistant(user_id, message):
# 1. Retrieve project context (RAG)
project_docs = search_project_docs(message, limit=5) # 3K tokens
# 2. Get relevant conversation snippets (semantic search)
relevant_history = conversation_db.search(
user_id=user_id,
query=message,
limit=5
) # 1K tokens
# 3. Get key facts from memory store
project_memory = memory_db.get(
user_id=user_id,
type=["decisions", "deadlines", "preferences"]
) # 500 tokens
# 4. Recent messages (context continuity)
recent = get_recent_messages(user_id, limit=5) # 800 tokens
# Total: ~5.5K tokens in an 8K window (68% usage)
context = build_context(project_docs, relevant_history, project_memory, recent, message)
return call_llm(context)
Combines multiple strategies for comprehensive context
Troubleshooting common issues
Problem: "Context length exceeded" errors
Symptoms: API rejects requests, error message about token limits
Fixes:
- Measure your context: Log token counts for each component (system, history, docs, etc.)
- Identify the culprit: What's consuming most tokens?
- Apply appropriate strategy:
- Long history → Summarization or rolling window
- Large documents → Better chunking or progressive loading
- Too many RAG chunks → Reduce retrieval limit or re-rank
Quick fix:
def ensure_context_fits(messages, max_tokens=8000):
current_tokens = count_tokens(messages)
while current_tokens > max_tokens:
# Remove oldest message (after system prompt)
messages.pop(1)
current_tokens = count_tokens(messages)
return messages
Problem: Model "forgets" important information
Symptoms: User says "As I mentioned earlier..." and model doesn't recall
Fixes:
- Implement semantic memory retrieval (not just recent messages)
- Use memory store for critical facts
- Add explicit recap: "Based on your earlier mention of X..."
Problem: Responses are slow or expensive
Symptoms: High latency, unexpected API costs
Fixes:
- Reduce context size (less input = faster + cheaper)
- Use smaller models for simple questions
- Implement prompt caching for repeated context
- Batch similar questions together
Problem: Answers lack important context
Symptoms: Vague responses, missing details that are "somewhere" in the history
Fixes:
- Don't over-summarize—keep important details
- Use parent-child chunking (small chunks for search, large for context)
- Increase RAG retrieval limit (more chunks = more context)
Tools and libraries
Context management:
- LangChain Memory: Built-in conversation memory, buffers, summarization
- Mem0: Dedicated memory layer for AI applications
- Zep: Long-term conversation memory with auto-summarization
Token counting:
- tiktoken: OpenAI's tokenizer (for GPT models)
- transformers: Hugging Face tokenizers (any model)
Prompt caching:
- Anthropic Claude: Native prompt caching support
- Helicone: Caching proxy for multiple providers
Monitoring:
- Langfuse: Track context size, costs, latency
- LangSmith: Debug conversation flows, visualize context usage
Use responsibly
- Measure before optimizing: Log context usage to understand actual patterns
- Don't over-compress: Some detail is necessary for quality
- Test edge cases: Very long conversations, massive documents, rapid-fire questions
- Monitor costs: Context management directly impacts spend
- Privacy matters: Memory stores contain sensitive user data—secure them properly
What's next?
- Embeddings & RAG Explained: Foundation for retrieval-based context strategies
- Vector DBs 101: Storage systems for conversation and document memory
- Retrieval 201: Advanced chunking techniques for context optimization
- Cost & Latency Optimization: Balance context usage with performance and spend
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
Related Guides
Deployment Patterns: Serverless, Edge, and Containers
IntermediateHow to deploy AI systems in production. Compare serverless, edge, container, and self-hosted options.
Fine-Tuning vs RAG: Which Should You Use?
IntermediateCompare fine-tuning and RAG to customize AI. Learn when each approach works best, how they differ, and how to combine them.
Orchestration Options: LangChain, LlamaIndex, and Beyond
IntermediateFrameworks for building AI workflows. Compare LangChain, LlamaIndex, Haystack, and custom solutions.