Intermediate12 min read

Context Management: Handling Long Conversations and Documents

Master context window management for AI. Learn strategies for long conversations, document processing, memory systems, and context optimization.

context windowmemoryconversationchunkingoptimization

TL;DR

AI models have context windows—a limit on how much text they can process at once. When conversations get long or documents get large, you hit this wall. Context management is the art of working within these limits. Rolling windows keep recent conversation history. Summarization condenses old context. Chunking breaks documents into processable pieces. RAG retrieves only relevant information instead of loading everything. Memory systems store and recall important facts across sessions. Master these techniques to build AI applications that handle unlimited conversations and massive documents without hitting limits or ballooning costs.

Why context management matters

You've built a chatbot. It works great for the first 5 messages. Then users hit a wall: "Why did you forget what I said 10 minutes ago?" or "This document is too large to analyze."

The problem is the context window—the amount of text a model can process in a single request. It's measured in tokens (roughly 3/4 of a word). Models have hard limits:

GPT-3.5: 4K-16K tokens (~3K-12K words)
GPT-4: 8K-128K tokens (~6K-96K words)
Claude 3.5 Sonnet: 200K tokens (~150K words)
Gemini 1.5 Pro: 2M tokens (~1.5M words)

What happens when you exceed the limit?

The API rejects your request
The model forgets the beginning of the conversation
Costs skyrocket (long context = expensive)
Response quality degrades (too much noise)

Context management solves these problems. It's essential for production AI systems.

Understanding context windows

Think of context as the model's "working memory." Everything you send—system instructions, conversation history, retrieved documents, your current question—must fit in this window.

Context structure

A typical LLM request contains:

Total context (must fit in window):
├─ System prompt (100-500 tokens)
├─ Conversation history (500-10K tokens)
├─ Retrieved documents/RAG chunks (1K-50K tokens)
├─ User's current message (50-500 tokens)
└─ Reserved for response (500-4K tokens)

Example (Claude 3.5 Sonnet, 200K limit):

System prompt: 200 tokens
Last 20 messages: 5,000 tokens
RAG context: 3,000 tokens
User question: 100 tokens
Response space: 4,000 tokens
Total used: 12,300 tokens (6% of limit)

Plenty of room! But what if the conversation reaches 100 messages? Or you need to include a 50-page document? That's when context management becomes critical.

The forgetting problem

When context exceeds the limit, models handle it in two ways:

1. Truncation (most APIs): Drop the oldest messages to fit the limit. The model literally forgets earlier parts of the conversation.

2. Sliding window: Keep the most recent context, discard the rest. Same result—earlier information is lost.

Neither is ideal. You need smarter strategies.

Strategy 1: Conversation memory management

For long conversations, you can't keep everything. Choose what to remember.

Rolling window with summarization

Keep recent messages in full detail, summarize older ones.

Example flow:

Messages 1-10: Keep full history (under 2K tokens)
Message 11 arrives: Summarize messages 1-5 into 200 tokens
New context: Summary (200 tokens) + Messages 6-11 (full detail)
Repeat as conversation grows

Implementation:

def manage_conversation_context(messages, max_tokens=8000):
    """Keep recent messages, summarize old ones"""
    recent_threshold = 10  # Keep last 10 messages in full

    if len(messages) <= recent_threshold:
        return messages

    # Split into old (to summarize) and recent (keep full)
    old_messages = messages[:-recent_threshold]
    recent_messages = messages[-recent_threshold:]

    # Summarize old messages
    summary_prompt = f"Summarize this conversation concisely: {old_messages}"
    summary = call_llm(summary_prompt)  # Returns ~200 tokens

    # Build new context
    return [
        {"role": "system", "content": f"Previous conversation summary: {summary}"},
        *recent_messages
    ]

When to use: Customer support bots, long-running assistance sessions, any chat that exceeds 50 messages.

Semantic memory selection

Instead of keeping the last N messages, keep the most relevant ones.

How it works:

Embed all messages: Convert each message to a vector
User asks new question: Embed the question
Retrieve relevant past messages: Use vector similarity to find the 10 most relevant historical messages
Build context: System prompt + relevant past messages + recent 5 messages + current question

When to use: Technical support (need to recall specific issues mentioned earlier), educational tutors (reference past lessons), complex problem-solving.

Memory systems (advanced)

For truly long-term memory, use an external store.

Architecture:

User message
    ↓
Extract facts/entities → Store in memory DB
    ↓
Retrieve relevant memories → Add to context
    ↓
Send to LLM

What to store:

User preferences: "I prefer Python over JavaScript"
Important facts: "Project deadline is March 15"
Entities: "John is the project manager"
Decisions: "We agreed to use PostgreSQL"

Tools: LangChain Memory modules, Mem0, Zep (conversation memory stores).

Implementation example:

# Store important facts
memory_db.store({
    "user_id": "user123",
    "fact": "Prefers explanations with code examples",
    "timestamp": "2024-01-15"
})

# Retrieve before each request
relevant_memories = memory_db.retrieve(
    user_id="user123",
    query="How should I explain this concept?",
    limit=5
)

# Include in context
context = f"User preferences: {relevant_memories}\n\nCurrent question: ..."

When to use: Personal assistants, ongoing project collaboration, enterprise tools with multi-session continuity.

Strategy 2: Document processing

You can't fit a 200-page document in context (even with 200K windows—it's expensive and slow). Smart chunking and retrieval are essential.

Revisiting chunking for context limits

From our RAG guides, you know chunking splits documents into pieces. For context management, chunking serves two purposes:

1. Fit documents into retrieval systems (covered in RAG guides)
2. Ensure retrieved chunks fit in context windows

Context-aware chunking guidelines:

Small models (8K context): Chunks of 500-800 tokens, retrieve 3-5 chunks (total: ~3K tokens)
Medium models (32K context): Chunks of 1000-1500 tokens, retrieve 8-10 chunks (total: ~12K tokens)
Large models (200K+ context): Chunks of 2000-3000 tokens, retrieve 20-30 chunks (total: ~60K tokens)

Always leave 20-30% of context for conversation history, system prompts, and response generation.

Progressive loading

Don't load everything at once. Load what's needed when needed.

Example: Legal document analysis

Naive approach:

User uploads 100-page contract
Chunk entire document, embed, store
User asks: "What's the termination clause?"
Retrieve 10 chunks about termination
Generate answer

Progressive approach:

User uploads contract
Generate table of contents / section summaries (1-2K tokens total)
User asks: "What's the termination clause?"
Identify relevant section from TOC
Load only that section (2-3 pages, ~1K tokens)
Generate answer

Benefits: Faster, cheaper, more accurate (less noise).

Implementation:

def analyze_large_document(document, question):
    # Step 1: Generate section summaries
    sections = split_into_sections(document)  # By headings
    summaries = {
        section.title: summarize(section.content)
        for section in sections
    }

    # Step 2: Identify relevant sections
    relevant_sections = find_relevant_sections(question, summaries)

    # Step 3: Load only relevant content
    context = "".join([sections[s].content for s in relevant_sections])

    # Step 4: Answer with limited context
    return call_llm(f"Context: {context}\n\nQuestion: {question}")

When to use: Multi-document analysis, contracts, research papers, technical manuals.

Hierarchical analysis

For massive documents, analyze in layers.

Example: Analyzing 500 pages of financial reports

Layer 1 - Document summaries:

Summarize each document (500 pages → 50 summaries × 100 tokens = 5K tokens)
User asks: "What were the main risks mentioned?"
Identify which documents mention risks

Layer 2 - Section summaries:

For relevant documents, load section summaries
Pinpoint exact sections discussing risks

Layer 3 - Full content:

Load only those specific sections (maybe 10 pages total)
Generate detailed answer

Benefits: Handle unlimited content while staying under context limits. Zoom in progressively instead of loading everything upfront.

Strategy 3: Context compression

Sometimes you need to fit more information in less space. Compression techniques help.

Prompt compression

Remove unnecessary words from prompts without losing meaning.

Before (100 tokens):

Please analyze the following customer feedback very carefully and
provide me with a detailed summary of the main points that were
mentioned, including both positive and negative aspects, and any
suggestions for improvement that the customer may have provided.

After (30 tokens):

Summarize this customer feedback. Include:
- Positive points
- Negative points
- Suggestions

Tools: LLMLingua, LongLLMLingua (research projects for automatic prompt compression).

Embedding-based compression

Replace verbose text with vector embeddings, then reconstruct context.

Advanced technique (emerging):

Convert long context to embeddings
Store embeddings in model's "soft prompts"
Model accesses compressed information without full text

Status: Experimental, not yet production-ready for most use cases.

Response streaming with context updates

For long responses, stream output and dynamically adjust context as you go.

Example:

User asks to analyze 50 documents
Start streaming response for document 1
As model generates, load next document chunks
Continue streaming, seamlessly including new context

When to use: Multi-document summarization, iterative analysis.

Strategy 4: RAG for conversations

Combine RAG (Retrieval-Augmented Generation) with conversation management.

Conversation + knowledge base

Architecture:

User question
    ↓
├─ Retrieve from knowledge base (RAG) → 5 chunks
├─ Retrieve from conversation history → 3 relevant past messages
↓
Combine into context → Send to LLM

Example: Technical support bot

Knowledge base: Product documentation, FAQs, troubleshooting guides
Conversation history: User's specific setup, previous issues
Current question: "How do I fix the error I got earlier?"

Context construction:

# Retrieve from docs
doc_chunks = vector_db.search(question, filter="docs", limit=5)

# Retrieve from conversation
relevant_messages = conversation_memory.search(
    question,
    user_id=user_id,
    limit=3
)

# Build context
context = f"""
Product documentation:
{doc_chunks}

Previous conversation context:
{relevant_messages}

Current question: {question}
"""

When to use: Support bots, internal tools, educational assistants.

Dual-index RAG

Separate indexes for different types of context.

Two vector databases:

Long-term knowledge: Company docs, policies, reference material (rarely changes)
Session context: Current conversation, temporary notes (changes constantly)

Benefits:

Update session context frequently without re-indexing everything
Different retrieval strategies for each (more from knowledge base, less from conversation)
Easier to manage permissions (shared knowledge vs. user-specific context)

Strategy 5: Multi-turn optimization

Optimize how you structure multi-turn conversations to minimize context usage.

Stateless vs. stateful turns

Stateless (every turn is independent):

# Each request includes full history
call_llm(messages=[msg1, msg2, msg3, msg4, msg5])  # All 5 messages
call_llm(messages=[msg1, msg2, msg3, msg4, msg5, msg6])  # All 6 messages

Pros: Simple, works with any API
Cons: Redundant, expensive, hits limits quickly

Stateful (maintain conversation on server):

# Some providers support conversation IDs
session = create_conversation_session()
session.send(msg1)  # Only sends msg1
session.send(msg2)  # Only sends msg2, server remembers msg1

Pros: Efficient, less redundant data transfer
Cons: Provider-specific, less portable

When to use each:

Stateless: Prototyping, any LLM provider, full control over context
Stateful: Production systems with high volume, supported providers (OpenAI Assistants API, Anthropic Claude with Bedrock)

Turn compression

Reduce context by compressing earlier turns without losing meaning.

Turn 1:

User: "I need help writing a Python function to parse CSV files"
Assistant: [300 tokens of detailed explanation with code examples]

Turn 2 (naive - keep full history):

Context: [Full Turn 1: 300 tokens]
User: "Can you add error handling?"

Turn 2 (optimized - compress Turn 1):

Context: "User requested Python CSV parser. Provided basic implementation."
User: "Can you add error handling?"

Savings: 300 tokens → 15 tokens. Scales dramatically over long conversations.

Caching repeated context

Some providers offer prompt caching—reuse parts of context across requests without reprocessing.

Example (Anthropic prompt caching):

# Request 1: Process 50K token document
response1 = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    system=[{
        "type": "text",
        "text": long_document,  # 50K tokens
        "cache_control": {"type": "ephemeral"}  # Cache this
    }],
    messages=[{"role": "user", "content": "Summarize page 1"}]
)

# Request 2: Same document, different question
response2 = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    system=[{
        "type": "text",
        "text": long_document,  # Same 50K tokens
        "cache_control": {"type": "ephemeral"}  # Reuses cache
    }],
    messages=[{"role": "user", "content": "Summarize page 2"}]
)

Benefits:

90% cost reduction for cached content
Faster response times (no re-processing)
Essential for analyzing long documents with multiple questions

When to use: Multiple questions about the same document, fixed system prompts with many users, repeated RAG contexts.

Cost vs. context trade-offs

Longer context = higher costs. Choose wisely.

Cost comparison

GPT-4 Turbo (128K context):

Input: $0.01 per 1K tokens
50K token request = $0.50 per call
100 calls/day = $50/day = $1,500/month

GPT-3.5 Turbo (16K context):

Input: $0.0005 per 1K tokens
10K token request = $0.005 per call
100 calls/day = $0.50/day = $15/month

Claude 3 Haiku (200K context):

Input: $0.00025 per 1K tokens
50K token request = $0.0125 per call
100 calls/day = $1.25/day = $37.50/month

Lesson: Long context capabilities don't mean you should use them carelessly. Optimize context size to balance quality and cost.

When to use long context models

Good use cases:

Analyzing full documents where breaking into chunks loses coherence (contracts, narratives)
Code repositories (need to see multiple files at once)
Complex reasoning requiring extensive background

Bad use cases:

Simple Q&A that could use RAG with 3-5 chunks
Conversations where summarization would suffice
Any task where smaller models + optimization work fine

Rule of thumb: Try RAG + smaller context first. Only reach for long context models when retrieval fails to capture necessary information.

Practical patterns

Real-world implementations combining these techniques.

Pattern 1: Customer support bot

def handle_support_query(user_id, question):
    # 1. Retrieve relevant docs (RAG)
    doc_context = search_docs(question, limit=5)  # ~2K tokens

    # 2. Load recent conversation
    recent_messages = get_recent_messages(user_id, limit=10)  # ~1K tokens

    # 3. Check for important past facts
    user_context = memory_db.get_user_context(user_id)  # ~200 tokens

    # 4. Build context (total: ~3.5K tokens)
    context = {
        "system": f"User context: {user_context}",
        "history": recent_messages,
        "docs": doc_context,
        "question": question
    }

    return call_llm(context)

Context budget: 3.5K / 8K available = 44% (plenty of room for response)

Pattern 2: Document analysis assistant

def analyze_document(document, questions):
    # 1. Generate section index
    sections = extract_sections(document)
    index = {s.title: summarize(s.content) for s in sections}  # 2K tokens

    responses = []
    for question in questions:
        # 2. Identify relevant sections
        relevant = find_relevant_sections(question, index)  # 3 sections

        # 3. Load only those sections
        context = load_sections(relevant)  # ~5K tokens

        # 4. Answer with limited context
        response = call_llm(f"Context: {context}\n\nQ: {question}")
        responses.append(response)

    return responses

Handles unlimited document size by loading selectively

Pattern 3: Long-running project assistant

def project_assistant(user_id, message):
    # 1. Retrieve project context (RAG)
    project_docs = search_project_docs(message, limit=5)  # 3K tokens

    # 2. Get relevant conversation snippets (semantic search)
    relevant_history = conversation_db.search(
        user_id=user_id,
        query=message,
        limit=5
    )  # 1K tokens

    # 3. Get key facts from memory store
    project_memory = memory_db.get(
        user_id=user_id,
        type=["decisions", "deadlines", "preferences"]
    )  # 500 tokens

    # 4. Recent messages (context continuity)
    recent = get_recent_messages(user_id, limit=5)  # 800 tokens

    # Total: ~5.5K tokens in an 8K window (68% usage)
    context = build_context(project_docs, relevant_history, project_memory, recent, message)
    return call_llm(context)

Combines multiple strategies for comprehensive context

Troubleshooting common issues

Problem: "Context length exceeded" errors

Symptoms: API rejects requests, error message about token limits

Fixes:

Measure your context: Log token counts for each component (system, history, docs, etc.)
Identify the culprit: What's consuming most tokens?
Apply appropriate strategy:
- Long history → Summarization or rolling window
- Large documents → Better chunking or progressive loading
- Too many RAG chunks → Reduce retrieval limit or re-rank

Quick fix:

def ensure_context_fits(messages, max_tokens=8000):
    current_tokens = count_tokens(messages)

    while current_tokens > max_tokens:
        # Remove oldest message (after system prompt)
        messages.pop(1)
        current_tokens = count_tokens(messages)

    return messages

Problem: Model "forgets" important information

Symptoms: User says "As I mentioned earlier..." and model doesn't recall

Fixes:

Implement semantic memory retrieval (not just recent messages)
Use memory store for critical facts
Add explicit recap: "Based on your earlier mention of X..."

Problem: Responses are slow or expensive

Symptoms: High latency, unexpected API costs

Fixes:

Reduce context size (less input = faster + cheaper)
Use smaller models for simple questions
Implement prompt caching for repeated context
Batch similar questions together

Problem: Answers lack important context

Symptoms: Vague responses, missing details that are "somewhere" in the history

Fixes:

Don't over-summarize—keep important details
Use parent-child chunking (small chunks for search, large for context)
Increase RAG retrieval limit (more chunks = more context)

Tools and libraries

Context management:

LangChain Memory: Built-in conversation memory, buffers, summarization
Mem0: Dedicated memory layer for AI applications
Zep: Long-term conversation memory with auto-summarization

Token counting:

tiktoken: OpenAI's tokenizer (for GPT models)
transformers: Hugging Face tokenizers (any model)

Prompt caching:

Anthropic Claude: Native prompt caching support
Helicone: Caching proxy for multiple providers

Monitoring:

Langfuse: Track context size, costs, latency
LangSmith: Debug conversation flows, visualize context usage

Use responsibly

Measure before optimizing: Log context usage to understand actual patterns
Don't over-compress: Some detail is necessary for quality
Test edge cases: Very long conversations, massive documents, rapid-fire questions
Monitor costs: Context management directly impacts spend
Privacy matters: Memory stores contain sensitive user data—secure them properly

What's next?

Embeddings & RAG Explained: Foundation for retrieval-based context strategies
Vector DBs 101: Storage systems for conversation and document memory
Retrieval 201: Advanced chunking techniques for context optimization
Cost & Latency Optimization: Balance context usage with performance and spend

Was this guide helpful?

Your feedback helps us improve our guides

Key Terms Used in This Guide

Context Window

How much text an AI can 'see' or 'remember' at once. Older messages fall off when the window fills up.

AI (Artificial Intelligence)

Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.

Related Guides

Deployment Patterns: Serverless, Edge, and Containers

Intermediate

How to deploy AI systems in production. Compare serverless, edge, container, and self-hosted options.

13 min read

Fine-Tuning vs RAG: Which Should You Use?

Intermediate

Compare fine-tuning and RAG to customize AI. Learn when each approach works best, how they differ, and how to combine them.

12 min read

Orchestration Options: LangChain, LlamaIndex, and Beyond

Intermediate

Frameworks for building AI workflows. Compare LangChain, LlamaIndex, Haystack, and custom solutions.

12 min read

TL;DR

Why context management matters

Understanding context windows

Context structure

The forgetting problem

Strategy 1: Conversation memory management

Rolling window with summarization

Semantic memory selection

Memory systems (advanced)

Strategy 2: Document processing

Revisiting chunking for context limits

Progressive loading

Hierarchical analysis

Strategy 3: Context compression

Prompt compression

Embedding-based compression

Response streaming with context updates

Strategy 4: RAG for conversations

Conversation + knowledge base

Dual-index RAG

Strategy 5: Multi-turn optimization

Stateless vs. stateful turns

Turn compression

Caching repeated context

Cost vs. context trade-offs

Cost comparison

When to use long context models

Practical patterns

Pattern 1: Customer support bot

Pattern 2: Document analysis assistant

Pattern 3: Long-running project assistant

Troubleshooting common issues

Problem: &quot;Context length exceeded&quot; errors

Problem: Model &quot;forgets&quot; important information

Problem: Responses are slow or expensive

Problem: Answers lack important context

Tools and libraries

Use responsibly

What&#39;s next?

Was this guide helpful?

Key Terms Used in This Guide

Context Window

AI (Artificial Intelligence)

Related Guides

Deployment Patterns: Serverless, Edge, and Containers

Fine-Tuning vs RAG: Which Should You Use?

Orchestration Options: LangChain, LlamaIndex, and Beyond

Problem: "Context length exceeded" errors

Problem: Model "forgets" important information

What's next?