# Context engineering beats prompt engineering for long-running agents.

A team has an agent that works for the first twenty minutes of a task and then gets worse. Early on it is sharp: it reads a file, calls a tool, makes a clean decision. An hour in, on the same task, it starts repeating tool calls it already made, citing a value that was true forty steps ago and stale now, and writing answers that drift from what was actually asked. Nothing crashed. The model did not change. The prompt did not change.

The team's first move is to rewrite the system prompt. They make it sterner, add a "remember to check what you already tried" line, restructure the instructions. It helps for a few more minutes, then the drift comes back. They are tuning the wrong thing. The system prompt was fixed before the run started; the thing that degraded was everything that arrived _after_ it — forty steps of tool outputs, retrieved chunks, intermediate reasoning, all of it piling into one context window that nobody is managing.

The system prompt is a small, fixed input. The context window of a long-running agent is a large, growing one, and the discipline of managing it across a long run — deciding turn after turn what to keep, what to drop, what to compact, what to retrieve — is a different job with a different name.

That job is context engineering. This post is about doing it.

## Prompt engineering moved the work; context engineering is where it went

Prompt engineering is the craft of writing the instruction: word choice, structure, examples, the system prompt. It is real and it still matters. But it answers a narrow question. Anthropic, in its 2025 essay ["Effective context engineering for AI agents"](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents), frames the distinction precisely: prompt engineering is "methods for writing and organizing LLM instructions for optimal outcomes," while context engineering is "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts." They call context engineering the natural progression of prompt engineering — not a replacement, a superset.

The shift is about _where the work is_. For a single-shot task — classify this, summarize that — the input is the prompt, and getting the prompt right is most of the battle. For a long-running agent, the prompt is the opening frame and then forty, eighty, two hundred steps happen. Each step adds tokens: a tool result, a retrieved document, a chunk of the model's own reasoning. By step eighty the system prompt is a thin slice of a context window that is mostly run history. Tuning that slice while ignoring the other ninety percent is tuning the part that was already fine.

Context engineering is the discipline of governing the ninety percent.

## The context window is a budget that fills as the run goes

Think of the context window the way the [agent-budgets essay](/blog/agent-budgets/) thinks of cost, latency, and depth: a finite resource the agent spends. The difference is that the cost budget is spent on purpose and the context budget fills _by default_. Every step an agent takes deposits tokens into the window whether or not anyone decided they belonged there.

Anthropic puts it directly: context is "a critical but finite resource" with "diminishing marginal returns," and a model has an "attention budget" it draws down as tokens accumulate. The window has a hard ceiling — the model's maximum — but the useful ceiling is well below it, for reasons the next section covers. So a long-running agent's context window has a fill curve:

```
  tokens in
  window
    ▲
    │                                            ╭──── hard limit (model max)
    │                                       ╱╱╱╱╱
    │  · · · · · · · · · · · · · · · · ╱╱╱╱╱ · · · ◄ useful ceiling
    │                            ╱╱╱╱╱            (quality degrades
    │                      ╱╱╱╱╱╱                  well before the
    │                ╱╱╱╱╱╱                        hard limit)
    │          ╱╱╱╱╱╱
    │    ╱╱╱╱╱╱
    │╱╱╱╱  ◄── system prompt + task: small, fixed
    └────────────────────────────────────────────────────►  steps in the run
       step 1        step 40         step 120        step 250

  unmanaged: the window fills monotonically and crosses the useful
  ceiling long before the task ends. context engineering is the set
  of operations that bend this curve back down — turn after turn.
```

An unmanaged agent rides that curve straight up. Context engineering is the set of operations — compaction, pruning, selective retrieval, externalization — that keep the curve under the useful ceiling for the length of the run. It is budget management, applied to tokens, performed continuously.

## What an unmanaged window does to an agent

If you let the window fill unmanaged, four failure modes arrive, and they compound.

**Context degradation — "context rot."** This is the load-bearing finding. As the token count in the window rises, a model's ability to use what is in that window reliably _drops_ — and it drops well before the hard limit. Chroma's 2025 study ["Context Rot"](https://www.trychroma.com/research/context-rot) tested 18 models across the Claude, GPT, Gemini, and Qwen families and found that performance "varies significantly as input length changes, even on simple tasks" — the assumption that a model handles the 10,000th token as reliably as the 100th does not hold in practice. Anthropic describes the same effect in its own terms: "as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases." A bigger window is not a free win; past a point, filling it makes the agent worse.

**The lost-in-the-middle effect.** A specific shape of that degradation. Liu et al.'s 2023 study ["Lost in the Middle"](https://arxiv.org/abs/2307.03172) (TACL 2024) showed that models use information at the _start_ and _end_ of a long context far more reliably than information in the _middle_ — a U-shaped curve. Their multi-document QA result is sharp enough to keep: with the answer buried mid-context among many documents, GPT-3.5-Turbo's accuracy fell below its no-context, closed-book baseline. Where a fact sits in the window changes whether the model can use it. A long unmanaged window buries earlier content exactly where the model reads it worst.

**Runaway cost and latency.** Every token in the window is paid for, on every step, for the rest of the run. A window that has grown to 100k tokens means each of the next steps re-bills 100k tokens of prompt and re-processes them. Transformer attention is quadratic in sequence length, so a window that grew without bound is also a step that got slower without bound — the cost story from the [agent-budgets essay](/blog/agent-budgets/) and the latency story from the [agent-latency-math essay](/blog/agent-latency-math/), both driven by one unmanaged number.

**Distraction by stale content.** A window full of forty-step-old tool results, abandoned plans, and superseded values is not neutral ballast. Chroma's study found that distractors — content semantically near the task but not actually relevant — measurably degrade output, and that even a single distractor hurts. An agent reasoning over a window thick with stale state is an agent being actively misled by its own history.

The through-line: these are not separate bugs. An unmanaged window degrades recall, buries facts where the model reads them worst, inflates cost and latency, and feeds the model distractors — all at once, and worse the longer the run.

## The techniques

Context engineering is a set of operations applied to the window across the run. Five carry most of the load.

**Compaction and summarization.** When the window approaches the useful ceiling, compress the run history: replace a long stretch of raw tool outputs and reasoning with a dense summary of what was learned and decided, and continue from the smaller window. Anthropic describes exactly this — summarizing conversation history and reinitializing the window with a distilled summary to hold coherence over a long interaction. Compaction is lossy by design; the engineering is in summarizing what the rest of the task needs and dropping what it does not.

**Selective retrieval into the window.** Do not front-load everything the agent might need. Keep reference material outside the window and retrieve the relevant slice _just in time_, when a step actually needs it — Anthropic calls this loading data at runtime via lightweight identifiers rather than pre-processing everything upfront. The window holds what the current step needs, not the union of what every step might.

**An external scratchpad — structured note-taking.** Give the agent a place to write outside the context window: a notes file, a structured store, a working document it can append to and read back. Durable conclusions go to the scratchpad, not left to ride in the window. Anthropic's example is Claude playing Pokémon, maintaining strategic notes across thousands of steps — the notes persist; the window does not have to. This is working-context hygiene: the scratchpad is part of running one agent well, distinct from the long-term memory service the next section draws a line to.

**Pruning stale tool results.** Once a tool result has been used and its conclusion captured, the raw result is dead weight — and, per the distractor finding, worse than dead weight. Drop it. A forty-step-old API response that has already informed a decision should not still be sitting in the window forty steps later. Pruning is the cheapest operation here and the most often skipped.

**Isolating sub-tasks so they don't pollute the main window.** When a self-contained sub-task would generate a lot of intermediate tokens — a deep search, a multi-step lookup — run it in its own context and return only the result to the main window. Anthropic uses sub-agents this way: a specialist works in a clean window and hands back a condensed summary, often one to two thousand tokens, rather than dumping its full trace into the caller. This is the context-management reason a sub-agent can be the right call — narrower and more disciplined than the orchestration decision the [multi-agent essay](/blog/multi-agent-one-too-many/) weighs, and worth keeping separate from it.

```
   UNMANAGED WINDOW                    CONTEXT-ENGINEERED WINDOW
   ┌────────────────────┐             ┌────────────────────┐
   │ system prompt       │             │ system prompt       │
   │ task                │             │ task                │
   │ tool result   1     │             │ ── compacted ────── │ ◄ 1–60 summarized
   │ tool result   2     │             │   summary of run    │
   │ reasoning     ...   │   ──────►   │ retrieved: just the │ ◄ pulled in for
   │ tool result   ...   │             │   chunk step 61     │   the current step
   │ stale result  37    │             │   needs             │
   │ abandoned plan      │             │ live working state  │ ◄ stale results
   │ tool result   58    │             └────────────────────┘   pruned; notes
   │ tool result   59    │             ┌────────────────────┐   on the scratchpad
   │ tool result   60    │             │  EXTERNAL SCRATCHPAD│
   │ ... and climbing    │             │  durable notes,     │
   └────────────────────┘             │  sub-task results   │
   crosses the useful ceiling;         └────────────────────┘
   model degrades, cost compounds      stays under the ceiling, turn after turn
```

None of these is exotic. They are the operations that bend the fill curve back down, and a long-running agent needs all of them running continuously, not one of them run once.

## Context engineering is not the memory service

A distinction worth drawing sharply, because the two get conflated. The working context window — the subject of this post — is not the tiered memory architecture the [agent-memory-microservice essay](/blog/agent-memory-microservice/) describes.

That essay's `working memory` tier — the current task's plan, intermediate results, what the agent has tried this run — _is_ what context engineering manages. But the essay's larger argument is about the tiers _beyond_ the task: episodic memory (timestamped records of past sessions) and semantic memory (consolidated, durable facts), running as a microservice the agent calls, with a write/consolidate/forget lifecycle.

The line is the run boundary. Context engineering manages the window _within_ a single long run — what this agent holds right now, this turn. The memory service manages what survives _across_ runs and sessions — what the agent should still know next week. They meet at one seam: the memory service's `recall` operation retrieves into the working window, and that retrieval is one of the selective-retrieval moves above. But the disciplines are distinct: one is the budget of a live context window, the other is a tiered store with a forgetting policy. Compaction is not consolidation; pruning a stale tool result is not a right-to-be-forgotten deletion. Build both — and do not let either stand in for the other.

## How to measure context health

You cannot manage a window you do not instrument. The [agent-observability essay](/blog/agent-observability/) argues every agent step should emit a structured record; context health is a few more fields on that record, and a few signals read across a run.

- **Window occupancy per step.** Tokens in the context window at each step, as a fraction of the useful ceiling — not the hard limit. The fill curve from earlier, made real. A run that rides above the ceiling for most of its length is a run whose later half is degraded.
- **Steps since last compaction.** How long the window has been growing unmanaged. A large and rising number is a compaction that should have already fired.
- **Stale-token ratio.** Roughly, the share of the window occupied by tool results and reasoning older than some recency horizon. High ratio means the agent is reasoning through a thick layer of distractors.
- **Retrieved-chunk utilization.** Of the chunks retrieved into the window, how many the agent actually used. Low utilization means retrieval is padding the window with near-misses — itself a distractor source.
- **Quality-versus-position checks.** In evaluation, plant a needed fact at different depths in the window — start, middle, end — and confirm the agent uses it regardless. A measurable gap is the lost-in-the-middle effect showing up in _your_ agent, on _your_ task, and a signal that compaction and retrieval need to keep the fact where the model reads it.

The point of measuring is that context degradation is silent — no error, no crash, just an agent getting quietly worse. The signals above are how a degrading window becomes visible before a customer finds it for you — and when a signal fires on a run you cannot explain from the trace alone, a [deterministically replayable](/blog/deterministic-replay-agents/) recording lets you step back through the window state by state.

## The checklist

Before a long-running agent goes to production:

- [ ] The context window is treated as a budget that fills across the run — not an input set once at the start.
- [ ] Compaction fires before the window crosses the useful ceiling, summarizing run history into a distilled form.
- [ ] Reference material is retrieved just in time for the step that needs it, not front-loaded into the window.
- [ ] The agent has an external scratchpad for durable notes and conclusions, so they do not have to ride in the window.
- [ ] Stale tool results are pruned once their conclusions are captured.
- [ ] Token-heavy sub-tasks run in isolated contexts and return condensed summaries.
- [ ] Working-context discipline (this) and the tiered memory service are kept distinct, and both exist.
- [ ] Context health is instrumented — window occupancy, stale-token ratio, retrieval utilization — and quality-versus-position is checked in eval.

When you can check all eight, the agent's context is engineered. Until then it is just filling up.

## Reading list

- Anthropic's [Effective context engineering for AI agents](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) — the clearest articulation of context engineering as the successor to prompt engineering, and the source for compaction, just-in-time retrieval, and structured note-taking.
- Chroma's [Context Rot](https://www.trychroma.com/research/context-rot) — 18 models tested; the evidence that performance degrades with input length even on simple tasks, and that distractors hurt.
- Liu et al., [Lost in the Middle](https://arxiv.org/abs/2307.03172) — the 2023 study (TACL 2024) behind the U-shaped curve: models use the start and end of a long context far better than the middle.

The system prompt is the first thing the agent reads and the smallest part of the problem. The context window is everything else — and for an agent that runs for an hour, everything else is the job.