Context Windows: How Much AI Can Remember
By Marcin Piekarski builtweb.com.au · Last Updated: 11 February 2026
TL;DR: Context windows determine how much text an AI can process at once. Learn how they work, their limits, and how to work within them.
TL;DR
A context window is the maximum amount of text (measured in tokens) that an AI model can process in a single interaction. It includes everything: your prompt, conversation history, system instructions, and the model's response. Understanding context windows helps you build better AI applications, avoid confusing errors, and manage costs effectively.
Why it matters
Every time you interact with an AI model, you are working within an invisible boundary. Go beyond it and the model either drops older parts of your conversation, returns an error, or produces degraded output. For casual chatting, this rarely matters. But if you are building an application that analyses documents, maintains long conversations, or processes codebases, understanding context windows is the difference between a product that works and one that breaks unpredictably.
Context windows also directly impact your costs. Larger contexts mean more tokens processed, which means higher API bills. Knowing how to work efficiently within context limits saves real money at scale.
What exactly is a context window?
Think of a context window as the AI's working memory. It is not long-term storage — it is the notepad the model has open right now. Everything the model can "see" during a single interaction must fit on this notepad.
The notepad holds a fixed number of tokens. A token is roughly three-quarters of a word in English. The word "hamburger" is two tokens. "I" is one token. A typical English sentence is about 15-20 tokens.
Critically, the context window is shared between input and output. If a model has a 128K token context window and you send 100K tokens of input, the model only has 28K tokens left for its response. This shared allocation catches many developers off guard.
How context window sizes compare
Context windows have grown dramatically over the past few years:
Small windows (4K-8K tokens) were the standard in early 2023. GPT-3.5's 4K version could handle about 3,000 words — roughly a few pages of text. Enough for quick questions and short conversations, but not much more.
Medium windows (32K-64K tokens) expanded what was possible. GPT-4o's 32K version could process about 24,000 words, enough for medium-length documents or extended conversations.
Large windows (128K-200K tokens) are now common. Claude 4.5's 200K token window can handle roughly 150,000 words — an entire novel. GPT-4o offers 128K tokens. These sizes enable analysis of entire codebases, long legal contracts, or comprehensive research papers in a single pass.
Extra-large windows (1M+ tokens) are emerging. Google's Gemini models offer windows up to 1 million tokens, and the trend is clearly toward even larger capacities. However, bigger is not always better — cost and latency both increase with context size.
What happens when you exceed the limit
When your total tokens (input plus expected output) approach or exceed the context window, one of three things happens:
Truncation. In chat applications, the oldest messages are quietly dropped to make room for new ones. The model "forgets" what you discussed earlier in the conversation. This happens silently, which is why chatbots sometimes seem to lose track of what you told them.
Errors. When making API calls, you will receive an error (usually with a clear message about token limits) if your request exceeds the maximum. Your application needs to handle this gracefully.
Quality degradation. Even within the window, models sometimes struggle with very long contexts. Important information buried in the middle of a long document may receive less attention than information at the beginning or end. Researchers call this the "lost in the middle" problem.
Practical strategies for working within limits
Be selective about what you include. Do not dump your entire database into the prompt. Include only the information the model needs to answer the current question. This is the single most impactful strategy.
Summarise conversation history. Instead of including every message from a long conversation, periodically summarise the key points and replace the full history with the summary. This compresses dozens of messages into a few sentences while preserving the important context.
Use chunking for long documents. If you need to process a 500-page document, split it into chunks (say, 10 pages each), process each chunk separately, and then combine the results. This is more work to implement but works reliably with any context window size.
Implement RAG (Retrieval-Augmented Generation). Instead of including everything in the context, store your knowledge in a vector database and retrieve only the most relevant pieces for each query. This is how most production AI applications handle large knowledge bases.
Choose the right model for the job. Do not pay for a 200K context window when your task only needs 4K. Conversely, do not try to cram a 50-page document into a 4K window. Match the model to the task.
Context window versus long-term memory
This distinction confuses many people. A context window is temporary — it exists only for the duration of a single conversation or API call. When the conversation ends, the context is gone. The model does not remember you or what you discussed.
Long-term memory, on the other hand, is built externally. You store information in databases, vector stores, or files, and retrieve it when needed. Some AI applications simulate "memory" by storing conversation summaries and loading them into the context window at the start of each new session. The model is not actually remembering — you are reminding it.
Some newer AI products (like ChatGPT's "memory" feature) automate this process, saving key facts about you and loading them into system prompts. But under the hood, it is still context window management, not true persistent memory.
How context windows affect cost and performance
Every token in your context window costs money and takes time to process. The relationship is roughly linear — twice the tokens means roughly twice the cost and twice the latency.
For a concrete example: processing a 200K token context on Claude 4.5 Opus costs significantly more than processing a 10K token context for the same query. If your application handles thousands of requests per day, the difference can be hundreds or thousands of dollars per month.
Cost optimisation tips:
- Set appropriate
max_tokenslimits on responses to prevent the model from rambling. - Use smaller models for tasks that do not need large contexts.
- Cache results for repeated queries instead of reprocessing them.
- Measure your actual token usage and set up cost alerts in your provider's dashboard.
The "lost in the middle" problem
Research has shown that large language models pay the most attention to information at the beginning and end of the context window, and less attention to information in the middle. If you place a critical fact on page 30 of a 60-page document, the model may miss it even though it is technically within the context window.
How to work around this:
- Place the most important information at the beginning or end of your prompt.
- Use explicit instructions like "Pay special attention to the section about pricing."
- For long documents, consider processing them in smaller chunks rather than all at once.
- Test your application with important facts at different positions to check for this issue.
Common mistakes
Assuming the model remembers previous conversations. Each API call starts fresh. If you need continuity, you must explicitly include the conversation history in your request.
Including everything "just in case." Stuffing the context window with every piece of data you have is wasteful, expensive, and can actually reduce quality. Be intentional about what you include.
Ignoring token limits until something breaks. Count your tokens proactively. Most provider SDKs include tokeniser functions that let you check the size of your input before sending it.
Forgetting that output tokens count too. If you are at 95% of the context window with your input, the model has almost no room to respond. Always leave headroom for the output.
Not testing with long contexts during development. Your application may work perfectly with short inputs during testing but fail in production when users send longer texts. Test with realistic input sizes.
What's next?
Deepen your understanding of related topics:
- Token Economics to understand how tokens translate to costs
- RAG: Retrieval-Augmented Generation for handling knowledge that exceeds context limits
- Prompt Engineering Basics for making the most of your context budget
- Embeddings Explained for the technology that powers efficient retrieval
Frequently Asked Questions
How do I check how many tokens my text uses?
Most AI providers offer tokeniser tools. OpenAI's tiktoken library (for Python) or their online tokeniser tool lets you paste text and see the exact token count. Anthropic and other providers have similar tools. As a rough estimate, 1 token is about 0.75 English words, or 4 characters. A 1,000-word document is roughly 1,300 tokens.
Why does the AI forget what I said earlier in a long conversation?
When the conversation exceeds the context window, older messages are dropped (truncated) to make room for newer ones. The model is not choosing to forget — it simply cannot see those older messages anymore. To prevent this, use conversation summarisation or limit how long conversations run before starting a new one.
Does a larger context window always mean better results?
Not necessarily. Larger context windows allow you to include more information, but they also cost more, take longer to process, and can suffer from the 'lost in the middle' problem where the model pays less attention to information in the centre of the context. For many tasks, a smaller, well-curated context produces better results than a larger, unfocused one.
What is the difference between context length and context window?
They are usually the same thing. 'Context window' refers to the maximum capacity — the total number of tokens the model can handle. 'Context length' sometimes refers to how much of that window you are actually using in a given request. Think of the window as the size of the bucket and the length as how much water is in it.
Was this guide helpful?
Your feedback helps us improve our guides
About the Authors
Marcin Piekarski· Frontend Lead & AI Educator
Marcin is a Frontend Lead with 20+ years in tech. Currently building headless ecommerce at Harvey Norman (Next.js, Node.js, GraphQL). He created Field Guide to AI to help others understand AI tools practically—without the jargon.
Credentials & Experience:
- 20+ years web development experience
- Frontend Lead at Harvey Norman (10 years)
- Worked with: Gumtree, CommBank, Woolworths, Optus, M&C Saatchi
- Runs AI workshops for teams
- Founder of builtweb.com.au
- Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
- Specializes in React ecosystem: React, Next.js, Node.js
Areas of Expertise:
Prism AI· AI Research & Writing Assistant
Prism AI is the AI ghostwriter behind Field Guide to AI—a collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.
Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication.
Key Terms Used in This Guide
Context Window
The maximum amount of text an AI model can process at once—including both what you send and what it generates. Once the window fills up, the AI loses access to earlier parts of the conversation.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.
Context Engineering
The discipline of designing everything an AI model sees — system prompts, retrieved documents, tool definitions, conversation history, and examples — to produce reliable, high-quality outputs.
Token
A chunk of text — usually a word or part of a word — that AI models process as a single unit. Most English words are one token, but longer or uncommon words get split into pieces.
Related Guides
AI Model Architectures: A High-Level Overview
IntermediateFrom transformers to CNNs to diffusion models—understand the different AI architectures and what they're good at.
7 min readEmbeddings: Turning Words into Math
IntermediateEmbeddings convert text into numbers that capture meaning. Essential for search, recommendations, and RAG systems.
9 min readNatural Language Processing: How AI Understands Text
IntermediateNLP is how AI reads, understands, and generates human language. Learn the techniques behind chatbots, translation, and text analysis.
8 min read