Your APM cannot see your agent failing.

An agent ran for nine minutes, made forty-one tool calls, spent two dollars, and returned a confident, well-formatted answer that was wrong. Every span in the APM trace was green. Every HTTP call returned 200. Latency was inside SLA. Every dashboard a normal service is judged on said the system was healthy.

The agent had retrieved a stale document, fed it into a planner that looped on a search tool eleven times, truncated half the context to fit a window, and synthesized a fluent paragraph from the wreckage. None of that is visible to a tool built to watch request/response services, because none of it is a request/response failure. The requests succeeded, the responses came back, the system still failed. APM was built for a world where a failure is an exception, a 500, or a latency spike. An agent fails in none of those ways — it fails by being plausibly, expensively, silently wrong, and you need observability built for that.

Standard monitoring catches the failures it was designed to catch. Agents have at least four that it was not.

Silent tool failure. A tool call returns HTTP 200 and a body. The body is {"results": []}, or an error rendered as prose, or last quarter’s data because a cache never invalidated. To the HTTP layer this is a success; to the agent it is poison — the model takes the garbage at face value and reasons forward from it. Your APM records a 200 and a 40ms latency, and calls it a perfect request.

Context truncation. The agent assembles a prompt — system instructions, tool outputs, history, retrieved chunks — and it overflows the window. Something silently drops: the framework trims the oldest messages, or a tool result gets clipped mid-JSON. The model now reasons over a partial picture and has no idea. No error, no log line — just an answer built on two-thirds of the inputs.

Runaway loops. The planner calls a search tool, doesn’t like the result, calls it again with a reworded query, “reflects,” and calls it a third time. We covered why this happens — and why “let it think longer” makes it worse — in the agent-budgets essay. The observability point is narrower: a loop is not a crash. Each call is fast and green. The loop is only visible if something counts calls per task and notices this one took forty-one when the median takes five.

Degraded-but-not-failed reasoning. The hardest one. The agent did not crash, loop, or truncate — it just reasoned badly. It picked a defensible-but-wrong tool, misread a correct result, drew a conclusion the evidence didn’t support. The output is fluent, the trace is clean, and the only signal lives in the semantic content of the steps. No infrastructure metric can see semantics.

The through-line: every one of these produces green infrastructure metrics. Latency, error rate, throughput, CPU — all healthy. An observability stack that watches only those is not watching your agent. It is watching the building the agent runs in.

What an agent trace actually needs

A request trace answers one question: where did the time go. An agent trace answers a harder one: was each decision sound, and given what.

The unit of an agent trace is not the HTTP call. It is the step — one iteration of the plan-act-observe loop. For every step, the trace has to carry:

Field	Why it has to be there
`tool`	Which tool was invoked — or `none` for a pure reasoning step
`arguments`	The exact arguments the model chose. Wrong arguments are a top failure mode
`intent`	The model’s stated reason for this step, captured before it runs
`result`	The full tool output, untruncated — including the empty and error cases
`result_status`	Did this result actually help: `ok` / `empty` / `error` / `stale`
`tokens`	Prompt and completion tokens for the step
`cost`	Dollars for the step
`latency`	Wall-clock for the step, tool time and model time separated

intent and result_status are the two fields a request trace never has. intent — captured before the step runs — lets a reviewer judge the decision instead of reverse-engineering it from the arguments. result_status makes silent failures loud: a 200 with {"results": []} is empty, not ok; a CRM returning last quarter’s numbers is stale. A trace that classifies that surfaces what a raw HTTP status buries.

The latency split matters too. The agent-latency-math essay is about the gap between demo time and production time; an agent trace is where you see that gap step by step — but only if model time and tool time are recorded separately. Nine minutes waiting on a slow vendor API is a different bug than nine minutes of a model thinking in circles, and one number cannot tell them apart.

Three layers of agent eval

A trace tells you what one run did, not whether the agent is good. For that you need evaluation, and one kind is not enough. We run three layers, answering three different questions.

Layer one — unit evals on steps. Given a fixed input state, does the agent take the right step? Pin the state, assert on the tool chosen and the argument shape — “given a query with an order ID, the next step calls order.lookup with that ID.” No LLM judge needed; these assert on structure, run in seconds, and catch the regression where a prompt tweak quietly breaks tool selection.

Layer two — LLM-as-judge regression. Step asserts cannot grade reasoning quality or final-answer correctness. For that you keep a golden set of tasks with known-good outcomes, replay the agent on every meaningful change, and have a judge model score the trace: was the answer right, was the path sane, did it ground its conclusion in real tool results. Same harness shape evaluation calls for generally — a JSON file of tasks, a judge prompt, a score that gates a merge.

Layer three — continuous production trace sampling. The golden set is finite; the world is not. So you sample live traces — a fixed percentage, plus every trace that tripped a heuristic (call-count over threshold, non-zero error/stale count, cost outlier) — and run the same judge over them. This is how the golden set grows: a failure caught in sampling becomes tomorrow’s layer-two case.

          fast / cheap / deterministic
   L1  unit evals on steps ......... every commit, CI gate
   L2  LLM-judge on golden tasks ... every change, merge gate
   L3  judge on sampled prod traces  continuous, feeds L2
          slow / costly / fuzzy

Skip layer one and every prompt edit is a gamble. Skip layer two and you cannot tell a refactor from a regression. Skip layer three and you learn about a failure class when a customer does. Most teams build only layer two — it feels like “real” evaluation — and run blind on the other two.

Instrumenting the loop

You do not need a new framework for this — just one structured record emitted per step. If your agent loop looks like the budgeted loop from the agent-budgets essay, the instrumentation is a wrapper around the step.

def trace_step(state, planned_step, executor):
    t0 = time.perf_counter()
    result = executor(planned_step)            # the actual tool / model call
    emit({
        "trace_id":   state.trace_id,          # threads the task together
        "step_index": state.step_count,        # orders the steps
        "tool":       planned_step.tool or "none",
        "arguments":  planned_step.arguments,
        "intent":     planned_step.intent,     # model's stated reason, pre-call
        "result":     result.body,             # full, untruncated
        "result_status": classify(result),     # ok | empty | error | stale
        "tool_ms":    result.tool_ms,          # tool time and model time
        "model_ms":   result.model_ms,         #   kept separate, on purpose
        "tokens":     result.tokens,
        "cost_usd":   price(result.tokens, planned_step.tool),
    })
    return result


def classify(result) -> str:                   # the rule you own, per tool
    if result.http_error or result.exception:
        return "error"
    if result.is_empty():                       # [], "", {} — tool-specific
        return "empty"
    if result.staleness and result.staleness > MAX_STALENESS:
        return "stale"
    return "ok"

Three things make or break this. emit writes per step, not per task — a looping task that never returns has still left thirty readable records instead of zero; the trace of a hung agent is the most valuable one you have, so do not buffer it away. result is stored untruncated: the clipped middle is where the stale row and the silent failure live, and storage is cheap. And classify is a real function you own — trusting the HTTP status is the bug, because an empty result set is not an error to the transport layer and is a failure to the agent.

trace_id threads every step of a task together and step_index orders them — enough to reconstruct the full plan-act-observe sequence and read it like a transcript.

The tooling landscape

You do not have to build the trace store, judge runner, and dashboards yourself. A real category of tools exists for LLM and agent observability — and they are not interchangeable. Each leans toward a different job.

Tool	What it leans toward
Braintrust	Eval-first. Strong for running judge-based scoring suites and gating changes on them in CI
LangSmith	Tracing and debugging, tightest if your agent is built on LangChain / LangGraph
Langfuse	Open-source tracing and metrics; self-hostable, which matters when traces carry sensitive data
Arize	ML-observability lineage — drift, embedding analysis, production monitoring at scale
Galileo	Evaluation and guardrails, with a focus on detecting bad generations and hallucination

The split that matters is tracing-first versus eval-first. Tracing-first tools (LangSmith, Langfuse) are where you read what one run did and debug it. Eval-first tools (Braintrust, Galileo) are where you score many runs and gate a release. Arize comes at it from classical ML monitoring — most relevant once you are at production scale and care about drift over time.

Two pieces of advice. Instrument against an open standard — OpenTelemetry has a semantic convention for LLM and agent spans — so your traces are not hostage to one vendor’s SDK. And do not let a tool’s spans define your trace schema: the fields above are what you need, and if a tool has no slot for them, put them in a structured attribute. The tool is the storage and the UI. The schema is yours.

What “good” looks like

The bar a team should clear before putting an agent in front of real traffic — not “we have a dashboard,” these specific things.

Dashboards that show agent-shaped failure, not service-shaped health. Five panels, none of which an APM gives you for free:

Call-count distribution per task — median, p95, p99. The p95 is where the loops hide.
result_status breakdown — rate of empty, error, stale across all tool calls. Rising stale is a data-freshness bug no latency chart will show you.
Cost-per-task distribution. A task at 5x the median cost did something pathological.
Reasoning-step ratio — fraction of steps with tool: none. Climbing means the agent is talking to itself instead of acting.
Golden-set score over time, plotted per release.

Alerts on agent-native signals. Page on the same metrics, not on CPU: p95 call-count crossing a threshold (loops emerging), a single tool’s stale/error rate spiking (an integration degrading), the golden-set score dropping between releases (a regression shipped), cost-per-task p95 jumping (find out why before the bill does).

A review habit, not just automation. Someone reads sampled production traces every week — actually reads the plan-act-observe transcript, the way the agent-budgets essay describes pulling the last 200 traces and eyeballing the shape. Dashboards tell you a number moved; reading a trace tells you why. There is no metric for “the agent’s reasoning was subtly off,” and there never will be. A human reading transcripts is not a stopgap until the tooling improves — it is part of the tooling.

MCP-backed agents calling real tools in production are common enough now that this is an operational problem, not a research one. The agent is going to fail silently, with every infrastructure metric green. Whether you find out from a dashboard on Tuesday or from a customer on Friday is entirely a function of whether you instrumented for the way agents actually break.

The checklist

Before an agent touches production traffic:

Every step emits a structured record — tool, arguments, intent, result, result_status, tokens, cost, latency.
intent captured before the step runs; result_status classified by a rule you own, not the HTTP status.
Tool results stored untruncated; records emit per step, so a hung task still leaves a trace.
Layer one — unit evals on step selection, gating CI.
Layer two — an LLM-judge golden set, gating merges.
Layer three — continuous trace sampling, feeding the golden set.
Dashboards for call-count, result_status, and cost distributions — not averages.
Alerts on p95 call-count, per-tool stale/error rate, and golden-set score.
A standing weekly habit of a human reading sampled traces.

When you can check all nine, your observability can see your agent. Until then, you are watching the building.

Reading list

Braintrust — eval-first platform; useful if your priority is judge-based scoring suites that gate releases.
Langfuse — open-source LLM tracing and metrics, self-hostable, which matters when traces carry sensitive data.
LangSmith — tracing and debugging, tightest fit for agents built on LangChain or LangGraph.
Arize — ML observability with drift and embedding analysis, aimed at production scale.
OpenTelemetry — the open standard to instrument against so your traces outlive any one vendor’s SDK.

Your APM is not lying to you. It is answering the question it was built to answer. It was just never the question an agent makes you ask.

Your APM cannot see your agent failing.

The four blind spots

What an agent trace actually needs

Three layers of agent eval

Instrumenting the loop

The tooling landscape

What “good” looks like

The checklist

Reading list

Indirect prompt injection, by the numbers.

Your learning-rate schedule silently overrides your data-curation decisions.

Approving an agent's action is not authorizing it.

Tell us about it.

Got it.