TL;DR

A context window is the maximum amount of text (measured in tokens) that an AI model can process in a single interaction. It includes everything: your prompt, conversation history, system instructions, and the model's response. Understanding context windows helps you build better AI applications, avoid confusing errors, and manage costs effectively.

Why it matters

Every time you interact with an AI model, you are working within an invisible boundary. Go beyond it and the model either drops older parts of your conversation, returns an error, or produces degraded output. For casual chatting, this rarely matters. But if you are building an application that analyses documents, maintains long conversations, or processes codebases, understanding context windows is the difference between a product that works and one that breaks unpredictably.

Context windows also directly impact your costs. Larger contexts mean more tokens processed, which means higher API bills. Knowing how to work efficiently within context limits saves real money at scale.

What exactly is a context window?

Think of a context window as the AI's working memory. It is not long-term storage — it is the notepad the model has open right now. Everything the model can "see" during a single interaction must fit on this notepad.

The notepad holds a fixed number of tokens. A token is roughly three-quarters of a word in English. The word "hamburger" is two tokens. "I" is one token. A typical English sentence is about 15-20 tokens.

Critically, the context window is shared between input and output. If a model has a 128K token context window and you send 100K tokens of input, the model only has 28K tokens left for its response. This shared allocation catches many developers off guard.

How context window sizes compare

Context windows have grown dramatically over the past few years:

Small windows (4K-8K tokens) were the standard in early 2023. GPT-3.5's 4K version could handle about 3,000 words — roughly a few pages of text. Enough for quick questions and short conversations, but not much more.

Medium windows (32K-64K tokens) expanded what was possible. GPT-4o's 32K version could process about 24,000 words, enough for medium-length documents or extended conversations.

Large windows (128K-200K tokens) are now common. Claude 4.5's 200K token window can handle roughly 150,000 words — an entire novel. GPT-4o offers 128K tokens. These sizes enable analysis of entire codebases, long legal contracts, or comprehensive research papers in a single pass.

Extra-large windows (1M+ tokens) are emerging. Google's Gemini models offer windows up to 1 million tokens, and the trend is clearly toward even larger capacities. However, bigger is not always better — cost and latency both increase with context size.

What happens when you exceed the limit

When your total tokens (input plus expected output) approach or exceed the context window, one of three things happens:

Truncation. In chat applications, the oldest messages are quietly dropped to make room for new ones. The model "forgets" what you discussed earlier in the conversation. This happens silently, which is why chatbots sometimes seem to lose track of what you told them.

Errors. When making API calls, you will receive an error (usually with a clear message about token limits) if your request exceeds the maximum. Your application needs to handle this gracefully.

Quality degradation. Even within the window, models sometimes struggle with very long contexts. Important information buried in the middle of a long document may receive less attention than information at the beginning or end. Researchers call this the "lost in the middle" problem.

Practical strategies for working within limits

Be selective about what you include. Do not dump your entire database into the prompt. Include only the information the model needs to answer the current question. This is the single most impactful strategy.

Summarise conversation history. Instead of including every message from a long conversation, periodically summarise the key points and replace the full history with the summary. This compresses dozens of messages into a few sentences while preserving the important context.

Use chunking for long documents. If you need to process a 500-page document, split it into chunks (say, 10 pages each), process each chunk separately, and then combine the results. This is more work to implement but works reliably with any context window size.

Implement RAG (Retrieval-Augmented Generation). Instead of including everything in the context, store your knowledge in a vector database and retrieve only the most relevant pieces for each query. This is how most production AI applications handle large knowledge bases.

Choose the right model for the job. Do not pay for a 200K context window when your task only needs 4K. Conversely, do not try to cram a 50-page document into a 4K window. Match the model to the task.

Context window versus long-term memory

This distinction confuses many people. A context window is temporary — it exists only for the duration of a single conversation or API call. When the conversation ends, the context is gone. The model does not remember you or what you discussed.

Long-term memory, on the other hand, is built externally. You store information in databases, vector stores, or files, and retrieve it when needed. Some AI applications simulate "memory" by storing conversation summaries and loading them into the context window at the start of each new session. The model is not actually remembering — you are reminding it.

Some newer AI products (like ChatGPT's "memory" feature) automate this process, saving key facts about you and loading them into system prompts. But under the hood, it is still context window management, not true persistent memory.

How context windows affect cost and performance

Every token in your context window costs money and takes time to process. The relationship is roughly linear — twice the tokens means roughly twice the cost and twice the latency.

For a concrete example: processing a 200K token context on Claude 4.5 Opus costs significantly more than processing a 10K token context for the same query. If your application handles thousands of requests per day, the difference can be hundreds or thousands of dollars per month.

Cost optimisation tips:

  • Set appropriate max_tokens limits on responses to prevent the model from rambling.
  • Use smaller models for tasks that do not need large contexts.
  • Cache results for repeated queries instead of reprocessing them.
  • Measure your actual token usage and set up cost alerts in your provider's dashboard.

The "lost in the middle" problem

Research has shown that large language models pay the most attention to information at the beginning and end of the context window, and less attention to information in the middle. If you place a critical fact on page 30 of a 60-page document, the model may miss it even though it is technically within the context window.

How to work around this:

  • Place the most important information at the beginning or end of your prompt.
  • Use explicit instructions like "Pay special attention to the section about pricing."
  • For long documents, consider processing them in smaller chunks rather than all at once.
  • Test your application with important facts at different positions to check for this issue.

Common mistakes

Assuming the model remembers previous conversations. Each API call starts fresh. If you need continuity, you must explicitly include the conversation history in your request.

Including everything "just in case." Stuffing the context window with every piece of data you have is wasteful, expensive, and can actually reduce quality. Be intentional about what you include.

Ignoring token limits until something breaks. Count your tokens proactively. Most provider SDKs include tokeniser functions that let you check the size of your input before sending it.

Forgetting that output tokens count too. If you are at 95% of the context window with your input, the model has almost no room to respond. Always leave headroom for the output.

Not testing with long contexts during development. Your application may work perfectly with short inputs during testing but fail in production when users send longer texts. Test with realistic input sizes.

What's next?

Deepen your understanding of related topics: