TL;DR

AI models have context windows—a limit on how much text they can process at once. When conversations get long or documents get large, you hit this wall. Context management is the art of working within these limits. Rolling windows keep recent conversation history. Summarization condenses old context. Chunking breaks documents into processable pieces. RAG retrieves only relevant information instead of loading everything. Memory systems store and recall important facts across sessions. Master these techniques to build AI applications that handle unlimited conversations and massive documents without hitting limits or ballooning costs.

Why context management matters

You've built a chatbot. It works great for the first 5 messages. Then users hit a wall: "Why did you forget what I said 10 minutes ago?" or "This document is too large to analyze."

The problem is the context window—the amount of text a model can process in a single request. It's measured in tokens (roughly 3/4 of a word). Models have hard limits:

  • GPT-3.5: 4K-16K tokens (~3K-12K words)
  • GPT-4: 8K-128K tokens (~6K-96K words)
  • Claude 3.5 Sonnet: 200K tokens (~150K words)
  • Gemini 1.5 Pro: 2M tokens (~1.5M words)

What happens when you exceed the limit?

  • The API rejects your request
  • The model forgets the beginning of the conversation
  • Costs skyrocket (long context = expensive)
  • Response quality degrades (too much noise)

Context management solves these problems. It's essential for production AI systems.

Understanding context windows

Think of context as the model's "working memory." Everything you send—system instructions, conversation history, retrieved documents, your current question—must fit in this window.

Context structure

A typical LLM request contains:

Total context (must fit in window):
├─ System prompt (100-500 tokens)
├─ Conversation history (500-10K tokens)
├─ Retrieved documents/RAG chunks (1K-50K tokens)
├─ User's current message (50-500 tokens)
└─ Reserved for response (500-4K tokens)

Example (Claude 3.5 Sonnet, 200K limit):

  • System prompt: 200 tokens
  • Last 20 messages: 5,000 tokens
  • RAG context: 3,000 tokens
  • User question: 100 tokens
  • Response space: 4,000 tokens
  • Total used: 12,300 tokens (6% of limit)

Plenty of room! But what if the conversation reaches 100 messages? Or you need to include a 50-page document? That's when context management becomes critical.

The forgetting problem

When context exceeds the limit, models handle it in two ways:

1. Truncation (most APIs): Drop the oldest messages to fit the limit. The model literally forgets earlier parts of the conversation.

2. Sliding window: Keep the most recent context, discard the rest. Same result—earlier information is lost.

Neither is ideal. You need smarter strategies.

Strategy 1: Conversation memory management

For long conversations, you can't keep everything. Choose what to remember.

Rolling window with summarization

Keep recent messages in full detail, summarize older ones.

Example flow:

  1. Messages 1-10: Keep full history (under 2K tokens)
  2. Message 11 arrives: Summarize messages 1-5 into 200 tokens
  3. New context: Summary (200 tokens) + Messages 6-11 (full detail)
  4. Repeat as conversation grows

Implementation:

def manage_conversation_context(messages, max_tokens=8000):
    """Keep recent messages, summarize old ones"""
    recent_threshold = 10  # Keep last 10 messages in full

    if len(messages) <= recent_threshold:
        return messages

    # Split into old (to summarize) and recent (keep full)
    old_messages = messages[:-recent_threshold]
    recent_messages = messages[-recent_threshold:]

    # Summarize old messages
    summary_prompt = f"Summarize this conversation concisely: {old_messages}"
    summary = call_llm(summary_prompt)  # Returns ~200 tokens

    # Build new context
    return [
        {"role": "system", "content": f"Previous conversation summary: {summary}"},
        *recent_messages
    ]

When to use: Customer support bots, long-running assistance sessions, any chat that exceeds 50 messages.

Semantic memory selection

Instead of keeping the last N messages, keep the most relevant ones.

How it works:

  1. Embed all messages: Convert each message to a vector
  2. User asks new question: Embed the question
  3. Retrieve relevant past messages: Use vector similarity to find the 10 most relevant historical messages
  4. Build context: System prompt + relevant past messages + recent 5 messages + current question

When to use: Technical support (need to recall specific issues mentioned earlier), educational tutors (reference past lessons), complex problem-solving.

Memory systems (advanced)

For truly long-term memory, use an external store.

Architecture:

User message
    ↓
Extract facts/entities → Store in memory DB
    ↓
Retrieve relevant memories → Add to context
    ↓
Send to LLM

What to store:

  • User preferences: "I prefer Python over JavaScript"
  • Important facts: "Project deadline is March 15"
  • Entities: "John is the project manager"
  • Decisions: "We agreed to use PostgreSQL"

Tools: LangChain Memory modules, Mem0, Zep (conversation memory stores).

Implementation example:

# Store important facts
memory_db.store({
    "user_id": "user123",
    "fact": "Prefers explanations with code examples",
    "timestamp": "2024-01-15"
})

# Retrieve before each request
relevant_memories = memory_db.retrieve(
    user_id="user123",
    query="How should I explain this concept?",
    limit=5
)

# Include in context
context = f"User preferences: {relevant_memories}\n\nCurrent question: ..."

When to use: Personal assistants, ongoing project collaboration, enterprise tools with multi-session continuity.

Strategy 2: Document processing

You can't fit a 200-page document in context (even with 200K windows—it's expensive and slow). Smart chunking and retrieval are essential.

Revisiting chunking for context limits

From our RAG guides, you know chunking splits documents into pieces. For context management, chunking serves two purposes:

1. Fit documents into retrieval systems (covered in RAG guides)
2. Ensure retrieved chunks fit in context windows

Context-aware chunking guidelines:

  • Small models (8K context): Chunks of 500-800 tokens, retrieve 3-5 chunks (total: ~3K tokens)
  • Medium models (32K context): Chunks of 1000-1500 tokens, retrieve 8-10 chunks (total: ~12K tokens)
  • Large models (200K+ context): Chunks of 2000-3000 tokens, retrieve 20-30 chunks (total: ~60K tokens)

Always leave 20-30% of context for conversation history, system prompts, and response generation.

Progressive loading

Don't load everything at once. Load what's needed when needed.

Example: Legal document analysis

Naive approach:

  1. User uploads 100-page contract
  2. Chunk entire document, embed, store
  3. User asks: "What's the termination clause?"
  4. Retrieve 10 chunks about termination
  5. Generate answer

Progressive approach:

  1. User uploads contract
  2. Generate table of contents / section summaries (1-2K tokens total)
  3. User asks: "What's the termination clause?"
  4. Identify relevant section from TOC
  5. Load only that section (2-3 pages, ~1K tokens)
  6. Generate answer

Benefits: Faster, cheaper, more accurate (less noise).

Implementation:

def analyze_large_document(document, question):
    # Step 1: Generate section summaries
    sections = split_into_sections(document)  # By headings
    summaries = {
        section.title: summarize(section.content)
        for section in sections
    }

    # Step 2: Identify relevant sections
    relevant_sections = find_relevant_sections(question, summaries)

    # Step 3: Load only relevant content
    context = "".join([sections[s].content for s in relevant_sections])

    # Step 4: Answer with limited context
    return call_llm(f"Context: {context}\n\nQuestion: {question}")

When to use: Multi-document analysis, contracts, research papers, technical manuals.

Hierarchical analysis

For massive documents, analyze in layers.

Example: Analyzing 500 pages of financial reports

Layer 1 - Document summaries:

  • Summarize each document (500 pages → 50 summaries × 100 tokens = 5K tokens)
  • User asks: "What were the main risks mentioned?"
  • Identify which documents mention risks

Layer 2 - Section summaries:

  • For relevant documents, load section summaries
  • Pinpoint exact sections discussing risks

Layer 3 - Full content:

  • Load only those specific sections (maybe 10 pages total)
  • Generate detailed answer

Benefits: Handle unlimited content while staying under context limits. Zoom in progressively instead of loading everything upfront.

Strategy 3: Context compression

Sometimes you need to fit more information in less space. Compression techniques help.

Prompt compression

Remove unnecessary words from prompts without losing meaning.

Before (100 tokens):

Please analyze the following customer feedback very carefully and
provide me with a detailed summary of the main points that were
mentioned, including both positive and negative aspects, and any
suggestions for improvement that the customer may have provided.

After (30 tokens):

Summarize this customer feedback. Include:
- Positive points
- Negative points
- Suggestions

Tools: LLMLingua, LongLLMLingua (research projects for automatic prompt compression).

Embedding-based compression

Replace verbose text with vector embeddings, then reconstruct context.

Advanced technique (emerging):

  1. Convert long context to embeddings
  2. Store embeddings in model's "soft prompts"
  3. Model accesses compressed information without full text

Status: Experimental, not yet production-ready for most use cases.

Response streaming with context updates

For long responses, stream output and dynamically adjust context as you go.

Example:

  • User asks to analyze 50 documents
  • Start streaming response for document 1
  • As model generates, load next document chunks
  • Continue streaming, seamlessly including new context

When to use: Multi-document summarization, iterative analysis.

Strategy 4: RAG for conversations

Combine RAG (Retrieval-Augmented Generation) with conversation management.

Conversation + knowledge base

Architecture:

User question
    ↓
├─ Retrieve from knowledge base (RAG) → 5 chunks
├─ Retrieve from conversation history → 3 relevant past messages
↓
Combine into context → Send to LLM

Example: Technical support bot

  • Knowledge base: Product documentation, FAQs, troubleshooting guides
  • Conversation history: User's specific setup, previous issues
  • Current question: "How do I fix the error I got earlier?"

Context construction:

# Retrieve from docs
doc_chunks = vector_db.search(question, filter="docs", limit=5)

# Retrieve from conversation
relevant_messages = conversation_memory.search(
    question,
    user_id=user_id,
    limit=3
)

# Build context
context = f"""
Product documentation:
{doc_chunks}

Previous conversation context:
{relevant_messages}

Current question: {question}
"""

When to use: Support bots, internal tools, educational assistants.

Dual-index RAG

Separate indexes for different types of context.

Two vector databases:

  1. Long-term knowledge: Company docs, policies, reference material (rarely changes)
  2. Session context: Current conversation, temporary notes (changes constantly)

Benefits:

  • Update session context frequently without re-indexing everything
  • Different retrieval strategies for each (more from knowledge base, less from conversation)
  • Easier to manage permissions (shared knowledge vs. user-specific context)

Strategy 5: Multi-turn optimization

Optimize how you structure multi-turn conversations to minimize context usage.

Stateless vs. stateful turns

Stateless (every turn is independent):

# Each request includes full history
call_llm(messages=[msg1, msg2, msg3, msg4, msg5])  # All 5 messages
call_llm(messages=[msg1, msg2, msg3, msg4, msg5, msg6])  # All 6 messages

Pros: Simple, works with any API
Cons: Redundant, expensive, hits limits quickly

Stateful (maintain conversation on server):

# Some providers support conversation IDs
session = create_conversation_session()
session.send(msg1)  # Only sends msg1
session.send(msg2)  # Only sends msg2, server remembers msg1

Pros: Efficient, less redundant data transfer
Cons: Provider-specific, less portable

When to use each:

  • Stateless: Prototyping, any LLM provider, full control over context
  • Stateful: Production systems with high volume, supported providers (OpenAI Assistants API, Anthropic Claude with Bedrock)

Turn compression

Reduce context by compressing earlier turns without losing meaning.

Turn 1:

User: "I need help writing a Python function to parse CSV files"
Assistant: [300 tokens of detailed explanation with code examples]

Turn 2 (naive - keep full history):

Context: [Full Turn 1: 300 tokens]
User: "Can you add error handling?"

Turn 2 (optimized - compress Turn 1):

Context: "User requested Python CSV parser. Provided basic implementation."
User: "Can you add error handling?"

Savings: 300 tokens → 15 tokens. Scales dramatically over long conversations.

Caching repeated context

Some providers offer prompt caching—reuse parts of context across requests without reprocessing.

Example (Anthropic prompt caching):

# Request 1: Process 50K token document
response1 = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    system=[{
        "type": "text",
        "text": long_document,  # 50K tokens
        "cache_control": {"type": "ephemeral"}  # Cache this
    }],
    messages=[{"role": "user", "content": "Summarize page 1"}]
)

# Request 2: Same document, different question
response2 = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    system=[{
        "type": "text",
        "text": long_document,  # Same 50K tokens
        "cache_control": {"type": "ephemeral"}  # Reuses cache
    }],
    messages=[{"role": "user", "content": "Summarize page 2"}]
)

Benefits:

  • 90% cost reduction for cached content
  • Faster response times (no re-processing)
  • Essential for analyzing long documents with multiple questions

When to use: Multiple questions about the same document, fixed system prompts with many users, repeated RAG contexts.

Cost vs. context trade-offs

Longer context = higher costs. Choose wisely.

Cost comparison

GPT-4 Turbo (128K context):

  • Input: $0.01 per 1K tokens
  • 50K token request = $0.50 per call
  • 100 calls/day = $50/day = $1,500/month

GPT-3.5 Turbo (16K context):

  • Input: $0.0005 per 1K tokens
  • 10K token request = $0.005 per call
  • 100 calls/day = $0.50/day = $15/month

Claude 3 Haiku (200K context):

  • Input: $0.00025 per 1K tokens
  • 50K token request = $0.0125 per call
  • 100 calls/day = $1.25/day = $37.50/month

Lesson: Long context capabilities don't mean you should use them carelessly. Optimize context size to balance quality and cost.

When to use long context models

Good use cases:

  • Analyzing full documents where breaking into chunks loses coherence (contracts, narratives)
  • Code repositories (need to see multiple files at once)
  • Complex reasoning requiring extensive background

Bad use cases:

  • Simple Q&A that could use RAG with 3-5 chunks
  • Conversations where summarization would suffice
  • Any task where smaller models + optimization work fine

Rule of thumb: Try RAG + smaller context first. Only reach for long context models when retrieval fails to capture necessary information.

Practical patterns

Real-world implementations combining these techniques.

Pattern 1: Customer support bot

def handle_support_query(user_id, question):
    # 1. Retrieve relevant docs (RAG)
    doc_context = search_docs(question, limit=5)  # ~2K tokens

    # 2. Load recent conversation
    recent_messages = get_recent_messages(user_id, limit=10)  # ~1K tokens

    # 3. Check for important past facts
    user_context = memory_db.get_user_context(user_id)  # ~200 tokens

    # 4. Build context (total: ~3.5K tokens)
    context = {
        "system": f"User context: {user_context}",
        "history": recent_messages,
        "docs": doc_context,
        "question": question
    }

    return call_llm(context)

Context budget: 3.5K / 8K available = 44% (plenty of room for response)

Pattern 2: Document analysis assistant

def analyze_document(document, questions):
    # 1. Generate section index
    sections = extract_sections(document)
    index = {s.title: summarize(s.content) for s in sections}  # 2K tokens

    responses = []
    for question in questions:
        # 2. Identify relevant sections
        relevant = find_relevant_sections(question, index)  # 3 sections

        # 3. Load only those sections
        context = load_sections(relevant)  # ~5K tokens

        # 4. Answer with limited context
        response = call_llm(f"Context: {context}\n\nQ: {question}")
        responses.append(response)

    return responses

Handles unlimited document size by loading selectively

Pattern 3: Long-running project assistant

def project_assistant(user_id, message):
    # 1. Retrieve project context (RAG)
    project_docs = search_project_docs(message, limit=5)  # 3K tokens

    # 2. Get relevant conversation snippets (semantic search)
    relevant_history = conversation_db.search(
        user_id=user_id,
        query=message,
        limit=5
    )  # 1K tokens

    # 3. Get key facts from memory store
    project_memory = memory_db.get(
        user_id=user_id,
        type=["decisions", "deadlines", "preferences"]
    )  # 500 tokens

    # 4. Recent messages (context continuity)
    recent = get_recent_messages(user_id, limit=5)  # 800 tokens

    # Total: ~5.5K tokens in an 8K window (68% usage)
    context = build_context(project_docs, relevant_history, project_memory, recent, message)
    return call_llm(context)

Combines multiple strategies for comprehensive context

Troubleshooting common issues

Problem: &quot;Context length exceeded&quot; errors

Symptoms: API rejects requests, error message about token limits

Fixes:

  1. Measure your context: Log token counts for each component (system, history, docs, etc.)
  2. Identify the culprit: What's consuming most tokens?
  3. Apply appropriate strategy:
    • Long history → Summarization or rolling window
    • Large documents → Better chunking or progressive loading
    • Too many RAG chunks → Reduce retrieval limit or re-rank

Quick fix:

def ensure_context_fits(messages, max_tokens=8000):
    current_tokens = count_tokens(messages)

    while current_tokens > max_tokens:
        # Remove oldest message (after system prompt)
        messages.pop(1)
        current_tokens = count_tokens(messages)

    return messages

Problem: Model &quot;forgets&quot; important information

Symptoms: User says "As I mentioned earlier..." and model doesn't recall

Fixes:

  • Implement semantic memory retrieval (not just recent messages)
  • Use memory store for critical facts
  • Add explicit recap: "Based on your earlier mention of X..."

Problem: Responses are slow or expensive

Symptoms: High latency, unexpected API costs

Fixes:

  • Reduce context size (less input = faster + cheaper)
  • Use smaller models for simple questions
  • Implement prompt caching for repeated context
  • Batch similar questions together

Problem: Answers lack important context

Symptoms: Vague responses, missing details that are "somewhere" in the history

Fixes:

  • Don't over-summarize—keep important details
  • Use parent-child chunking (small chunks for search, large for context)
  • Increase RAG retrieval limit (more chunks = more context)

Tools and libraries

Context management:

  • LangChain Memory: Built-in conversation memory, buffers, summarization
  • Mem0: Dedicated memory layer for AI applications
  • Zep: Long-term conversation memory with auto-summarization

Token counting:

  • tiktoken: OpenAI's tokenizer (for GPT models)
  • transformers: Hugging Face tokenizers (any model)

Prompt caching:

  • Anthropic Claude: Native prompt caching support
  • Helicone: Caching proxy for multiple providers

Monitoring:

  • Langfuse: Track context size, costs, latency
  • LangSmith: Debug conversation flows, visualize context usage

Use responsibly

  • Measure before optimizing: Log context usage to understand actual patterns
  • Don't over-compress: Some detail is necessary for quality
  • Test edge cases: Very long conversations, massive documents, rapid-fire questions
  • Monitor costs: Context management directly impacts spend
  • Privacy matters: Memory stores contain sensitive user data—secure them properly

What&#39;s next?

  • Embeddings & RAG Explained: Foundation for retrieval-based context strategies
  • Vector DBs 101: Storage systems for conversation and document memory
  • Retrieval 201: Advanced chunking techniques for context optimization
  • Cost & Latency Optimization: Balance context usage with performance and spend