# Your APM cannot see your agent failing.

An agent ran for nine minutes, made forty-one tool calls, spent two dollars, and returned a confident, well-formatted answer that was wrong. Every span in the APM trace was green. Every HTTP call returned 200. Latency was inside SLA. Every dashboard a normal service is judged on said the system was healthy.

The agent had retrieved a stale document, fed it into a planner that looped on a search tool eleven times, truncated half the context to fit a window, and synthesized a fluent paragraph from the wreckage. None of that is visible to a tool built to watch request/response services, because none of it _is_ a request/response failure. The requests succeeded, the responses came back, the system still failed. APM was built for a world where a failure is an exception, a 500, or a latency spike. An agent fails in none of those ways — it fails by being plausibly, expensively, silently wrong, and you need observability built for _that_.

## The four blind spots

Standard monitoring catches the failures it was designed to catch. Agents have at least four that it was not.

**Silent tool failure.** A tool call returns HTTP 200 and a body. The body is `{"results": []}`, or an error rendered as prose, or last quarter's data because a cache never invalidated. To the HTTP layer this is a success; to the agent it is poison — the model takes the garbage at face value and reasons forward from it. Your APM records a 200 and a 40ms latency, and calls it a perfect request.

**Context truncation.** The agent assembles a prompt — system instructions, tool outputs, history, retrieved chunks — and it overflows the window. Something silently drops: the framework trims the oldest messages, or a tool result gets clipped mid-JSON. The model now reasons over a partial picture and has no idea. No error, no log line — just an answer built on two-thirds of the inputs.

**Runaway loops.** The planner calls a search tool, doesn't like the result, calls it again with a reworded query, "reflects," and calls it a third time. We covered why this happens — and why "let it think longer" makes it worse — in [the agent-budgets essay](/blog/agent-budgets/). The observability point is narrower: a loop is not a crash. Each call is fast and green. The loop is only visible if something counts calls _per task_ and notices this one took forty-one when the median takes five.

**Degraded-but-not-failed reasoning.** The hardest one. The agent did not crash, loop, or truncate — it just reasoned badly. It picked a defensible-but-wrong tool, misread a correct result, drew a conclusion the evidence didn't support. The output is fluent, the trace is clean, and the only signal lives in the _semantic_ content of the steps. No infrastructure metric can see semantics.

The through-line: every one of these produces green infrastructure metrics. Latency, error rate, throughput, CPU — all healthy. An observability stack that watches only those is not watching your agent. It is watching the building the agent runs in.

## What an agent trace actually needs

A request trace answers one question: where did the time go. An agent trace answers a harder one: _was each decision sound, and given what._

The unit of an agent trace is not the HTTP call. It is the **step** — one iteration of the plan-act-observe loop. For every step, the trace has to carry:

| Field           | Why it has to be there                                                      |
| --------------- | --------------------------------------------------------------------------- |
| `tool`          | Which tool was invoked — or `none` for a pure reasoning step                |
| `arguments`     | The exact arguments the model chose. Wrong arguments are a top failure mode |
| `intent`        | The model's _stated reason_ for this step, captured before it runs          |
| `result`        | The full tool output, untruncated — including the empty and error cases     |
| `result_status` | Did this result actually help: `ok` / `empty` / `error` / `stale`           |
| `tokens`        | Prompt and completion tokens for the step                                   |
| `cost`          | Dollars for the step                                                        |
| `latency`       | Wall-clock for the step, tool time and model time separated                 |

`intent` and `result_status` are the two fields a request trace never has. `intent` — captured _before_ the step runs — lets a reviewer judge the decision instead of reverse-engineering it from the arguments. `result_status` makes silent failures loud: a 200 with `{"results": []}` is `empty`, not `ok`; a CRM returning last quarter's numbers is `stale`. A trace that classifies that surfaces what a raw HTTP status buries.

The latency split matters too. The [agent-latency-math essay](/blog/agent-latency-math/) is about the gap between demo time and production time; an agent trace is where you see that gap step by step — but only if model time and tool time are recorded separately. Nine minutes waiting on a slow vendor API is a different bug than nine minutes of a model thinking in circles, and one number cannot tell them apart.

## Three layers of agent eval

A trace tells you what one run did, not whether the agent is _good_. For that you need evaluation, and one kind is not enough. We run three layers, answering three different questions.

**Layer one — unit evals on steps.** Given a fixed input state, does the agent take the right step? Pin the state, assert on the tool chosen and the argument shape — "given a query with an order ID, the next step calls `order.lookup` with that ID." No LLM judge needed; these assert on structure, run in seconds, and catch the regression where a prompt tweak quietly breaks tool selection.

**Layer two — LLM-as-judge regression.** Step asserts cannot grade _reasoning quality_ or _final-answer correctness_. For that you keep a golden set of tasks with known-good outcomes, replay the agent on every meaningful change, and have a judge model score the trace: was the answer right, was the path sane, did it ground its conclusion in real tool results. Same harness shape evaluation calls for generally — a JSON file of tasks, a judge prompt, a score that gates a merge.

**Layer three — continuous production trace sampling.** The golden set is finite; the world is not. So you sample live traces — a fixed percentage, plus every trace that tripped a heuristic (call-count over threshold, non-zero `error`/`stale` count, cost outlier) — and run the same judge over them. This is how the golden set _grows_: a failure caught in sampling becomes tomorrow's layer-two case.

```
          fast / cheap / deterministic
   L1  unit evals on steps ......... every commit, CI gate
   L2  LLM-judge on golden tasks ... every change, merge gate
   L3  judge on sampled prod traces  continuous, feeds L2
          slow / costly / fuzzy
```

Skip layer one and every prompt edit is a gamble. Skip layer two and you cannot tell a refactor from a regression. Skip layer three and you learn about a failure class when a customer does. Most teams build only layer two — it feels like "real" evaluation — and run blind on the other two.

## Instrumenting the loop

You do not need a new framework for this — just one structured record emitted per step. If your agent loop looks like the budgeted loop from [the agent-budgets essay](/blog/agent-budgets/), the instrumentation is a wrapper around the step.

```python
def trace_step(state, planned_step, executor):
    t0 = time.perf_counter()
    result = executor(planned_step)            # the actual tool / model call
    emit({
        "trace_id":   state.trace_id,          # threads the task together
        "step_index": state.step_count,        # orders the steps
        "tool":       planned_step.tool or "none",
        "arguments":  planned_step.arguments,
        "intent":     planned_step.intent,     # model's stated reason, pre-call
        "result":     result.body,             # full, untruncated
        "result_status": classify(result),     # ok | empty | error | stale
        "tool_ms":    result.tool_ms,          # tool time and model time
        "model_ms":   result.model_ms,         #   kept separate, on purpose
        "tokens":     result.tokens,
        "cost_usd":   price(result.tokens, planned_step.tool),
    })
    return result


def classify(result) -> str:                   # the rule you own, per tool
    if result.http_error or result.exception:
        return "error"
    if result.is_empty():                       # [], "", {} — tool-specific
        return "empty"
    if result.staleness and result.staleness > MAX_STALENESS:
        return "stale"
    return "ok"
```

Three things make or break this. `emit` writes per step, not per task — a looping task that never returns has still left thirty readable records instead of zero; the trace of a hung agent is the most valuable one you have, so do not buffer it away. `result` is stored _untruncated_: the clipped middle is where the stale row and the silent failure live, and storage is cheap. And `classify` is a real function you own — trusting the HTTP status is the bug, because an empty result set is not an error to the transport layer and _is_ a failure to the agent.

`trace_id` threads every step of a task together and `step_index` orders them — enough to reconstruct the full plan-act-observe sequence and read it like a transcript.

## The tooling landscape

You do not have to build the trace store, judge runner, and dashboards yourself. A real category of tools exists for LLM and agent observability — and they are not interchangeable. Each leans toward a different job.

| Tool           | What it leans toward                                                                           |
| -------------- | ---------------------------------------------------------------------------------------------- |
| **Braintrust** | Eval-first. Strong for running judge-based scoring suites and gating changes on them in CI     |
| **LangSmith**  | Tracing and debugging, tightest if your agent is built on LangChain / LangGraph                |
| **Langfuse**   | Open-source tracing and metrics; self-hostable, which matters when traces carry sensitive data |
| **Arize**      | ML-observability lineage — drift, embedding analysis, production monitoring at scale           |
| **Galileo**    | Evaluation and guardrails, with a focus on detecting bad generations and hallucination         |

The split that matters is **tracing-first versus eval-first**. Tracing-first tools (LangSmith, Langfuse) are where you read what one run did and debug it. Eval-first tools (Braintrust, Galileo) are where you score many runs and gate a release. Arize comes at it from classical ML monitoring — most relevant once you are at production scale and care about drift over time.

Two pieces of advice. Instrument against an open standard — OpenTelemetry has a semantic convention for LLM and agent spans — so your traces are not hostage to one vendor's SDK. And do not let a tool's spans define your trace schema: the fields above are what you need, and if a tool has no slot for them, put them in a structured attribute. The tool is the storage and the UI. The schema is yours.

## What "good" looks like

The bar a team should clear before putting an agent in front of real traffic — not "we have a dashboard," these specific things.

**Dashboards that show agent-shaped failure, not service-shaped health.** Five panels, none of which an APM gives you for free:

- Call-count distribution per task — median, p95, p99. The p95 is where the loops hide.
- `result_status` breakdown — rate of `empty`, `error`, `stale` across all tool calls. Rising `stale` is a data-freshness bug no latency chart will show you.
- Cost-per-task distribution. A task at 5x the median cost did something pathological.
- Reasoning-step ratio — fraction of steps with `tool: none`. Climbing means the agent is talking to itself instead of acting.
- Golden-set score over time, plotted per release.

**Alerts on agent-native signals.** Page on the same metrics, not on CPU: p95 call-count crossing a threshold (loops emerging), a single tool's `stale`/`error` rate spiking (an integration degrading), the golden-set score dropping between releases (a regression shipped), cost-per-task p95 jumping (find out why before the bill does).

**A review habit, not just automation.** Someone reads sampled production traces every week — actually reads the plan-act-observe transcript, the way [the agent-budgets essay](/blog/agent-budgets/) describes pulling the last 200 traces and eyeballing the shape. Dashboards tell you a number moved; reading a trace tells you _why_. There is no metric for "the agent's reasoning was subtly off," and there never will be. A human reading transcripts is not a stopgap until the tooling improves — it is part of the tooling.

MCP-backed agents calling real tools in production are common enough now that this is an operational problem, not a research one. The agent is going to fail silently, with every infrastructure metric green. Whether you find out from a dashboard on Tuesday or from a customer on Friday is entirely a function of whether you instrumented for the way agents actually break.

## The checklist

Before an agent touches production traffic:

- [ ] Every step emits a structured record — `tool`, `arguments`, `intent`, `result`, `result_status`, `tokens`, `cost`, `latency`.
- [ ] `intent` captured _before_ the step runs; `result_status` classified by a rule you own, not the HTTP status.
- [ ] Tool results stored untruncated; records emit per step, so a hung task still leaves a trace.
- [ ] Layer one — unit evals on step selection, gating CI.
- [ ] Layer two — an LLM-judge golden set, gating merges.
- [ ] Layer three — continuous trace sampling, feeding the golden set.
- [ ] Dashboards for call-count, `result_status`, and cost _distributions_ — not averages.
- [ ] Alerts on p95 call-count, per-tool `stale`/`error` rate, and golden-set score.
- [ ] A standing weekly habit of a human reading sampled traces.

When you can check all nine, your observability can see your agent. Until then, you are watching the building.

## Reading list

- [Braintrust](https://www.braintrust.dev/) — eval-first platform; useful if your priority is judge-based scoring suites that gate releases.
- [Langfuse](https://langfuse.com/) — open-source LLM tracing and metrics, self-hostable, which matters when traces carry sensitive data.
- [LangSmith](https://www.langchain.com/langsmith) — tracing and debugging, tightest fit for agents built on LangChain or LangGraph.
- [Arize](https://arize.com/) — ML observability with drift and embedding analysis, aimed at production scale.
- [OpenTelemetry](https://opentelemetry.io/) — the open standard to instrument against so your traces outlive any one vendor's SDK.

Your APM is not lying to you. It is answering the question it was built to answer. It was just never the question an agent makes you ask.