TL;DR

Context engineering is the discipline of designing everything an AI model sees — system prompts, retrieved documents, tool outputs, conversation history, and examples — not just the individual prompt. It's why the same model can be brilliant in one product and useless in another. If prompt engineering is writing a good question, context engineering is setting up the entire classroom.

Why it matters

You've probably had this experience: you carefully craft a perfect prompt, get a great answer... and then the AI completely ignores it two messages later. Or a RAG system retrieves the right documents but the AI still gives a wrong answer.

The problem usually isn't your prompt. It's your context.

According to LangChain's 2025 State of Agent Engineering report, 57% of organisations now have AI agents in production — but 32% cite quality as their top barrier. Most failures trace back to poor context management, not model limitations. The models are capable enough. The context feeding them isn't good enough.

That's the shift from prompt engineering to context engineering: instead of asking "How do I write a better prompt?", you ask "How do I design the full informational environment that lets this AI reason reliably?"

What is context engineering?

Context engineering is the discipline of designing and managing all the information that reaches an AI model. Think of it as architecture rather than writing.

Prompt engineering is like crafting a single, perfect email.

Context engineering is like designing the entire briefing package — background docs, data tables, previous correspondence, and clear instructions — so that anyone reading it (human or AI) arrives at the right answer.

A well-engineered context includes five components working together:

1. System prompts and instructions

The foundation. These define the AI's role, rules, constraints, and output format. A customer service bot, a coding assistant, and a medical advisor might all use the same underlying model — the system prompt is what makes them behave differently.

Example: "You are a tax advisor for Australian small businesses. Only answer questions about Australian tax law. If asked about other jurisdictions, say 'I can only help with Australian tax — please consult a local advisor.'"

2. Retrieved documents (RAG context)

Dynamic information fetched at query time. Instead of relying on what the model memorised during training (which has a cutoff date), you supply current, relevant documents.

Example: When a user asks "What's the FBT rate?", your system retrieves the latest ATO bulletin rather than relying on the model's potentially outdated training data.

3. Tool definitions and outputs

Modern AI systems don't just generate text — they call functions. The tool schemas (what tools are available, what parameters they accept) and their outputs become part of the context.

Example: A financial assistant has access to get_stock_price(ticker), calculate_returns(portfolio, period), and search_sec_filings(company). The tool definitions tell the AI what it can do. The outputs feed back into the context for the next reasoning step.

4. Conversation history and memory

What's been said before — and a summary of what matters from earlier. Raw conversation history eats tokens fast, so production systems use strategies like summarisation, sliding windows, or explicit memory stores.

Example: Instead of keeping 50 messages of raw history, the system maintains a running summary: "User is planning a trip to Japan in April. Budget is $5,000. Prefers cultural experiences over nightlife. Already booked flights."

5. Examples and demonstrations

Few-shot examples that show the model what good output looks like. These calibrate tone, format, and reasoning style more effectively than instructions alone.

Example: Including 2–3 examples of well-formatted customer support responses teaches the model your company's style better than a paragraph describing it.

Context budgeting

Every AI model has a finite context window — the maximum amount of text it can process at once. Claude supports up to 200,000 tokens, GPT-4o up to 128,000. That sounds enormous, but it fills up fast when you're combining system prompts + RAG documents + tool schemas + conversation history + the user's actual question.

Context budgeting means allocating your window deliberately:

Component Typical allocation Notes
System prompt 500–2,000 tokens Keep focused; bloated instructions get ignored
Retrieved documents 2,000–10,000 tokens Quality over quantity — 3 relevant chunks beat 10 vaguely related ones
Tool definitions 500–3,000 tokens Scales with number of tools
Conversation history 1,000–5,000 tokens Summarise aggressively
Examples 500–2,000 tokens 2–3 well-chosen examples are plenty
User input + response Remainder Leave enough headroom for the answer

The golden rule: If you're using more than 50% of your context window on system instructions and tools, something needs trimming.

The "lost in the middle" problem

Models pay more attention to the beginning and end of the context. Information buried in the middle gets less attention. Structure your context with the most important information first (system prompt, key constraints) and last (the user's actual question), with supporting documents in between.

Real-world example: building a support bot

Bad approach (prompt engineering only):
You write one very long prompt with the company FAQ, product details, and tone guidelines all jammed together. It works for simple questions but hallucinates on edge cases, forgets policy details, and gives inconsistent formatting.

Good approach (context engineering):

  1. System prompt (300 tokens): Role, tone, escalation rules, output format
  2. RAG retrieval (dynamic): When user asks a question, fetch the 3 most relevant FAQ entries and the specific product page
  3. Tool access: check_order_status(order_id), create_ticket(category, description) — so the bot can actually do things, not just talk
  4. Conversation summary (maintained per session): Running summary of the user's issue, updated after each turn
  5. Examples (2 turns): One simple question-answer, one escalation example

The system prompt stays small. The real intelligence comes from the right information arriving at the right time.

Engineering each component well

System prompts: less is more

The most common mistake is writing a 2,000-word system prompt that tries to cover every edge case. Models start ignoring parts of overly long instructions. Instead:

  • State the role and primary objective in 1–2 sentences
  • List 3–5 non-negotiable rules as bullet points
  • Define the output format with a short example
  • Add a catch-all: "If unsure, ask the user to clarify"

RAG: relevance beats volume

More retrieved documents ≠ better answers. A common pattern:

  • Retrieve 10 candidate chunks
  • Re-rank by relevance (using a cross-encoder or the model itself)
  • Include only the top 3–5 in the context

Tag your chunks with metadata (source, date, confidence score) so the model can assess reliability.

Tools: clear schemas prevent errors

Write tool descriptions as if explaining to a new colleague. Include what the tool does, when to use it, what the parameters mean, and what the output looks like. Ambiguous tool definitions cause the model to call the wrong tool or pass wrong parameters.

History: summarise ruthlessly

Raw conversation history is the biggest context hog. A 20-turn conversation can easily be 5,000+ tokens. Use a running summary that captures decisions, preferences, and unresolved questions — not a transcript.

Common mistakes

  • Stuffing everything into the system prompt — put dynamic information in RAG, not in static instructions
  • Ignoring context budgets — filling the window means the model has no room to reason
  • Retrieving too many documents — 10 mediocre chunks confuse the model more than 3 excellent ones
  • Forgetting the "lost in the middle" effect — put critical information at the start and end of your context
  • No versioning — system prompts and retrieval strategies need version control just like code
  • Testing with short contexts, deploying with long ones — behaviour changes as the context fills up; test at realistic scale

Tools and frameworks

  • LangChain / LlamaIndex: Orchestration frameworks for building context pipelines
  • Anthropic's prompt engineering guides: Best practices for Claude-specific context design
  • PromptOps tools (Adaline, PromptHub): Version control and A/B testing for prompts
  • RAG evaluation frameworks (Ragas, TruLens): Measure retrieval quality and answer faithfulness

Context engineering vs prompt engineering

Aspect Prompt engineering Context engineering
Focus Wording of the instruction Entire information environment
Scope Single interaction System-level architecture
Key skill Writing clear instructions Designing information pipelines
Failure mode Bad answer Inconsistent system behaviour
Analogy Writing a good exam question Designing the entire curriculum

They're complementary. You still need to write good prompts (clear, specific, well-structured). But in production, the context around that prompt matters more than the prompt itself.

What's next?