◇ FIELD NOTESPRODUCTION AI · ON-CHAIN INFERENCE

Engineering notes from the field. Numbers, code, and opinions from the team.

Production AI and verifiable on-chain inference, written by the engineers who ship them. Two essays a month. Unsubscribe with one click.

Hand-folded paper polyhedron with sharp triangular facets, suspended in matte black space — Field Notes header

FILTER · TAG

Browse by page →

2026.07.16 SECURITY

Indirect prompt injection, by the numbers.

Our taxonomy argued prompt injection is a vulnerability class you contain, not a bug you fix. Here is the quantitative half — against the strongest published defenses, indirect-injection attack success stays high, and for agents that can act it stays alarming.

10 min →

2026.07.09 TRAINING

Your learning-rate schedule silently overrides your data-curation decisions.

A quality-ascending curriculum beats random shuffling — until a decaying schedule delivers your best data exactly when the learning rate is too small to absorb it. The same coupling flips proxy-model ablations: which dataset wins depends on the schedule, not the data alone.

13 min →

2026.07.08 SECURITY

Approving an agent's action is not authorizing it.

An agent workflow that pauses for human approval and then resumes looks safe — a person clicked approve. But the approval decision and the authority the resumed step runs under are two different objects, and most systems conflate them by carrying the approval as an in-band signal the resumed call trusts. That is a confused-deputy bug: a forged resume, a replayed request, or an approval granted at one gate authorizes an action it was never meant to. The fix is to stop transporting approval and start deriving authority from the trusted approval record at resume time, scoped to the exact suspended step — capabilities enforced at the tool boundary, fail-closed by default.

14 min →

2026.07.07 VOICE BUILD NOTE

Turn-based voice RAG: the whole loop on one stack.

A mic button on a grounded-answer dialog: Whisper for transcription, the unchanged retrieval path, per-sentence TTS for playback. What broke, what it costs, and why the label says turn-based instead of realtime.

6 min →

2026.07.05 TRAINING

Your MoE model does not route evenly.

The load-balancing loss in a mixture-of-experts model is a training-time regularizer that stops the router from collapsing — not a promise of even routing at inference. A competently trained MoE ships with deliberately skewed routing, because even routing means the experts never specialized. Expert-parallel serving sized for uniform load under-provisions the hot GPU and mis-budgets tail latency.

12 min →

2026.07.04 TRAINING

Process reward models grade fluency, not reasoning.

A process reward model is a learned grader of reasoning steps — and what it learned to detect is confident, fluent-sounding presentation, not logical validity. Point reinforcement learning at that gap and the gap becomes the objective.

11 min →

2026.07.01 SECURITY

You scan the model you receive and deploy a different one.

A clean backdoor scan on a downloaded model certifies one artifact. Quantization, distillation, merging, and fine-tuning each produce a different artifact — and a quantization-activated backdoor is engineered to be invisible in the model you scan and live in the model you serve. For the other transforms the scan misleads you either way. The only sound scan is of the post-transform bytes you actually deploy.

11 min →

2026.07.01 SECURITY

Weights provenance: supply-chain security for the model itself.

A tampered model passes your benchmarks — that is what the tamper is built to do — and you cannot reliably scrub a backdoor out after the fact. The only defense is knowing, and proving, that the weights were never altered. Provenance for the model as an artifact.

10 min →

2026.06.30 EVAL

Constrained decoding doesn't cost you accuracy — your prompt does.

The widely-cited result that structured output hurts a model's reasoning measured something real — but not what it is quoted as measuring. Most of the drop is a non-equivalent prompt and regex-based scoring, plus a decoder that renormalizes probabilities naively. With an equivalent prompt, a distribution-aware decoder, and a schema roomy enough to hold the model's reasoning, the causal effect of the format constraint on accuracy is near zero.

12 min →

2026.06.29 VOICE

Voice agent evals: scoring a conversation, not a transcript.

A voice agent can pass every transcript-level metric — low word error, correct answers — and still be unbearable to talk to. The caller experiences timing, turn-taking, and recovery, none of which a transcript records. How to score the conversation.

10 min →

2026.06.28 TRAINING

Test-time compute moved the compute-optimal pre-training point.

Chinchilla tells you the model size that minimizes loss for a training-compute budget. But a model is not trained to sit in a checkpoint — it is served, increasingly with repeated sampling and long reasoning traces. Once inference is priced into the budget, the compute-optimal point moves hard into the overtraining regime: a smaller model trained far longer beats a larger Chinchilla-optimal one at equal total cost.

13 min →

2026.06.26 PRIVACY

Differential privacy for fine-tuning: when it earns its cost.

Fine-tune on data with PII in it and the model can be made to recite it back verbatim. DP-SGD bounds that — at a measured cost of a point or two of accuracy for a pretrained model, not the catastrophe the folklore claims. When the threat model justifies paying it.

11 min →

2026.06.23 AGENTS

Tool-calling reliability plateaus at your tool descriptions, not your model.

An agent that calls the wrong tool too often looks like a model problem, so teams reach for a bigger model or a fine-tune. The wrong calls cluster where two tools look near-identical to the model — and rewriting or de-duplicating the ambiguous descriptions recovers double-digit accuracy a bigger model does not. The tool interface is a primary ceiling on tool-calling reliability, and the model's own choice margin is a usable pre-execution alarm.

12 min →

2026.06.22 TRAINING

Your diffusion LLM is slower than the autoregressive model it replaced.

A diffusion language model decodes many tokens per step instead of one at a time, which reads as a promise of speed. It is not one: every open diffusion model you can deploy today is slower than an equal-size autoregressive model on the same GPU — because a parallel step is only safe on tokens that do not depend on each other, and how many of those a prompt contains is not something the model controls.

14 min →

2026.06.22 ZKML

zkML's real first customers: compliance and fraud proofs.

Pricing a zero-knowledge proof of an LLM gives an absurd number, and teams conclude zkML is not ready. They are pricing the wrong model. For small fixed models on high-stakes decisions, zkML already pays for itself today.

11 min →

2026.06.19 TRAINING

Quantization for serving: the accuracy you actually lose.

FP8 serving is effectively lossless. INT8 costs a point or two. INT4 looks free on a standard benchmark — and quietly degrades reasoning by double digits. What low-bit serving really costs, measured per bit-width and per task.

11 min →

2026.06.15 TRAINING

Distillation: shrink the model, keep the eval.

Distillation trades parameters for latency and cost — and the average eval barely moves, which is exactly the trap. The mean can hold while the hard tail of cases regresses. How to shrink a model and actually keep the metric that matters.

11 min →

2026.06.12 PAYMENTS

Escrow and dispute resolution for agent commerce.

When an agent pays an agent, the deal can still go wrong — wrong result, partial result, no result. Two agents that met at runtime have no chargebacks and no courts. Something has to hold the funds and arbitrate. What that something is.

11 min →

2026.06.12 PAYMENTS

Stablecoin rails for agent payments.

Agents settle in stablecoins not for ideology but because the card rail structurally cannot do what an agent needs — sub-cent payments, no merchant account, no human cardholder. What that choice gives you, and what it leaves you to solve.

11 min →

2026.06.10 EVAL

Your agent's benchmark score is not its reliability.

A benchmark score is a point estimate on a curated, static, in-distribution task set. Production reliability is a distribution under drift and stress — and the measured gap between the two widens exactly as conditions get harder.

8 min →

2026.06.10 OPINION

"Agentic" has stopped meaning anything.

Two leading labs publish two different definitions of the word. In a single meeting it covers a lone tool call and a fully autonomous loop. A skeptic's taxonomy of what teams actually mean by agentic — and the four specific things to say instead.

10 min →

2026.06.10 EVAL

Your LLM router is probably losing to the baseline.

An LLM router is supposed to send each query to the cheapest model that can handle it. Under unified evaluation, many routers — including a commercial one — fail to beat the trivial baseline of always using the single best model. The dominant failure is routing collapse to the expensive model, and the cause is structural: routers are trained to predict scalar quality scores for what is really a discrete ranking decision.

11 min →

2026.06.10 TRAINING

Synthetic data that does not collapse the model.

Train a model on its own generations, recursively, and it collapses — the rare cases vanish first and the damage is irreversible. But the fix is not 'avoid synthetic data.' It is to accumulate it, verify it, and measure diversity.

11 min →

2026.06.08 STANDARDS

The agent identity stack: ERC-8004, DIDs, verifiable credentials.

There is no single 'agent identity standard.' There are three layers — an identifier, credentials, and authorization — each already standardized separately. Portable agent identity is an assembly job, and the research has converged on what binds the top layer.

11 min →

2026.06.08 EVAL

An agent benchmark a do-nothing agent can win.

On a popular agentic benchmark, an agent that returns nothing scores as successful, and passing the bundled unit tests need not mean the bug was fixed. Auditing the grader, not the agent, moves headline scores by tens of points — so a leaderboard rank is an artifact of its checker until proven otherwise.

14 min →

2026.06.05 TRAINING

Fine-tune or prompt? A 2026 decision tree.

Teams reach for fine-tuning to fix two different problems — a model that lacks facts and a model that lacks a behavior — and only one of them is a fine-tuning problem. When post-training beats prompt-and-retrieve, and when it is wasted spend.

11 min →

2026.06.03 AGENTS

Shipping an agent: canaries and rollback for prompts.

A one-word change to a system prompt can move accuracy by dozens of points, and a provider's model update can regress your app overnight. A prompt or model swap is a deploy. Give it a staged rollout and a one-action rollback path.

11 min →

2026.06.01 AGENTS

Cost observability for an agent fleet.

The monthly inference bill arrives as one number, and nobody can say which agent, which customer, or which tool spent it. Agent cost is too variable to estimate and has to be attributed after the fact — per run, per tool, per tenant. The layer most stacks skip.

11 min →

2026.05.29 AGENTS

Human-in-the-loop checkpoints without killing throughput.

An agent that asks permission for everything trains its reviewers to rubber-stamp, and the one dangerous action slips through in the noise. Approval gates belong on consequence and on uncertainty — not on every step. Where to put them.

12 min →

2026.05.27 EVAL

GraphRAG or vector RAG: a decision guide.

A knowledge graph earns its build-and-maintenance cost over plain vector retrieval only for specific query classes — multi-hop reasoning and global summarization. For fact lookup it loses, and it always costs more to index. A guide by query type.

10 min →

2026.05.25 EVAL

Hybrid search: when BM25 still beats your embeddings.

Dense embeddings win in-domain and on paraphrase. Lexical BM25 wins on rare exact terms and on corpora the embedding model never trained on. Production retrieval usually needs both — and a clear reason for the split.

10 min →

2026.05.22 EVAL

Reranking is the cheapest RAG win you are not using.

A cross-encoder reranker often beats retrieving more documents — it scores the query and passage together, where vector search only compared two embeddings made apart. But it scales linearly with candidates and does not always earn its latency. When to add one.

11 min →

2026.05.20 EVAL

Eval-driven development: write the eval before the feature.

An eval written after a feature ships can only ratify what the feature already does. Written first, the graded set becomes the specification — it forces the ambiguity out of the requirement before any code exists. Why the eval is the real spec.

10 min →

2026.05.20 OPERATIONS

When to call your AI consultancy back. A decision tree.

Most teams either over-call (the consultancy quietly becomes a line item) or under-call (the harness silently rots until a customer escalation forces the conversation). Six triggers, three are routine ops you own, three usually warrant a sprint. Here is the rubric.

12 min →

2026.05.19 OPERATIONS

Three months after handoff is where AI systems quietly die.

An AI system that passed every eval at handoff can be silently below its quality budget a quarter later. The decay is not dramatic — it is three measurable signals drifting slowly under thresholds that were never calibrated for steady state. This post is what month three actually looks like, and what to watch.

11 min →

2026.05.18 AGENTS

Tool design for agents: the schema is the prompt.

A tool's name, description, and JSON schema are the only thing the model sees — it picks and fills tools from those words alone. Treating that surface as API plumbing instead of as prompt is why agents call the wrong tool. How to design it.

10 min →

2026.05.16 AGENTS

Agent memory is a microservice, not a vector store.

"Add memory" usually means "dump everything into a vector DB." Real agent memory is tiered, has a write/forget/consolidate policy, and runs as a service the agent calls — not a library bolted into the loop.

9 min →

2026.05.16 AGENTS

Your APM cannot see your agent failing.

Request traces and dashboards were built for request/response services. The ways an agent fails — a tool returning 200 with garbage, a truncated context, a looping planner — trip none of them. What agent-native observability has to capture.

9 min →

2026.05.16 PAYMENTS

Spend rails for agents shipped. The safety layer didn't.

x402, ERC-8004, and AP2 gave agents the rails to hold and spend money. The controls that stop a prompt-injected agent from draining a wallet — spend ceilings, treasury isolation, circuit breakers — did not ship with them.

9 min →

2026.05.16 SECURITY

Auditing an agent that holds a wallet.

Agents now sign transactions. The attack surface — a prompt injection that ends in a signed transfer — is new, and almost no security auditor covers it. What an agent security audit actually checks.

10 min →

2026.05.16 EVAL

Stop OCR-ing your PDFs: retrieve on the page, not the transcript.

Most document RAG OCRs a PDF, rebuilds the layout, and chunks the text — losing every table, chart, and column it touches. ColPali-class visual retrieval embeds the page image directly. When that wins, when it doesn't.

10 min →

2026.05.16 TRAINING

Proving the work: verification in decentralized training.

Decentralized pretraining now reaches into the tens of billions of parameters — but you still cannot cryptographically prove the GPUs did the work they claim. How production networks check untrusted workers, and why ZK-proven training is years out.

11 min →

2026.05.16 ZKML

The first proven LLM: what DeepProve changes for zkML.

DeepProve, from Lagrange, produced the first zero-knowledge proof of a full LLM inference — GPT-2. It moves "prove a transformer" from impossible to merely expensive. What that unlocks, and what is still years away.

9 min →

2026.05.16 EVAL

Faithfulness is not groundedness. And "accuracy" is not a RAG metric.

Teams say their RAG is 'accurate' and mean different things. Faithfulness is whether the answer is true; groundedness is whether every claim traces to a source. They fail differently — and a deploy depends on measuring both.

6 min →

2026.05.16 STANDARDS

MCP in production: the four gaps nobody demos.

MCP won the tool-integration standard. But "works in a demo" and "works in production" are different claims — and four gaps bite at scale: sticky sessions, server fan-out, governance, and what happens when a session drops mid-task.

10 min →

2026.05.16 ZKML

opML or zkML: a decision tree for verifiable inference.

Two ways to make an off-chain model output trustworthy on-chain. zkML is cryptographic, expensive, and small-model-only. opML is optimistic, cheap, and runs Llama-2-scale models today. Choosing by stakes, model size, and latency.

10 min →

2026.05.15 VOICE

Cascaded or end-to-end: a 2026 voice-architecture trade study.

A 2026 voice agent forks at the first design decision — STT→LLM→TTS cascade, or a single speech-to-speech model. End-to-end wins on latency and naturalness; the cascade wins on everything you debug, audit, and control. Here is the trade study.

12 min →

2026.05.13 AGENTS

Deterministic replay: debugging agents that will not reproduce.

An agent run is non-deterministic — sampling, tool responses, and timing all vary — so a bug seen once may never recur. Deterministic replay records every non-deterministic input so the run can be replayed exactly.

12 min →

2026.05.11 PAYMENTS

Pricing an API when the customer is an agent.

When the buyer is an autonomous agent paying per call, human pricing breaks. Seats, signup-gated free tiers, and annual commitments stop making sense for a machine that reads the price and comparison-shops every call.

12 min →

2026.05.09 TRAINING

RL environments are the new dataset.

Post-training has shifted from supervised fine-tuning on static labeled data toward reinforcement learning, and that moves the unit of data work from a labeled file to an executable environment. Building good environments is the new data engineering — and the scarce input.

12 min →

2026.05.07 AGENTS

Context engineering beats prompt engineering for long-running agents.

For a long-running agent, the system prompt is a small part of the problem. The real discipline is managing the context window across the whole run as a budget — keep, drop, compact, retrieve.

11 min →

2026.05.05 STANDARDS

What ERC-8004 actually means for agent identity.

Agents need to prove who they are to each other without going through a central directory. ERC-8004 is the first standard that ships the three registries needed for that. Here is what it does and what it does not.

9 min →

2026.05.02 VOICE

Graceful failure for voice agents.

A voice call is real-time and unforgiving — there is no spinner to show, and dead air reads as a broken product. When STT, the LLM, or TTS fails mid-call, the system has to degrade, not drop.

11 min →

2026.04.29 EVAL

Your golden set is rotting.

A golden evaluation set is not a fixed asset — it decays. The world changes, the product shifts, the team overfits, and the pass rate quietly stops meaning anything. Eval data needs a maintenance protocol.

12 min →

2026.04.25 SECURITY

Signing-key custody for autonomous agents.

Assume the model gets injected — then ask where the signing key lives. MPC, HSMs, multisig, and session keys, judged on one question: can a fully compromised agent reach the key?

7 min →

2026.04.22 EVAL

You don't have a RAG problem. You have a chunking problem.

Most teams blame the retriever. The retriever is fine. Your chunks don't carry their context — and no amount of reranking saves them.

9 min →

2026.04.21 STANDARDS

A2A and MCP: two protocols, two jobs.

A2A and MCP get framed as rivals. They are not. MCP connects an agent to its tools; A2A connects agents to each other — different jobs at different layers, and a serious multi-agent system needs both.

12 min →

2026.04.17 PRIVACY

PII redaction that does not wreck retrieval.

Stripping PII before documents reach the embedding model is often necessary. But naive redaction destroys the semantic signal retrieval depends on. How to redact without wrecking retrieval.

10 min →

2026.04.14 AGENTS

Multi-agent systems are usually one agent too many.

Splitting a task across coordinating agents adds context-handoff loss, compounding latency and cost, and a wider failure surface — overhead that usually exceeds the benefit. Start with one agent.

12 min →

2026.04.10 TRAINING

Decentralized training in 2026: what works, what's still vapor.

A grounded look at distributed pretraining across untrusted GPUs. DiLoCo, DisTrO, INTELLECT-2, Bittensor's Templar, 0G's DiLoCoX — what each actually shipped, and what hasn't.

13 min →

2026.04.07 VOICE

Turn-taking is the hard part of voice agents.

Transcription is largely solved. Knowing when the caller has finished, when to stop for an interruption, and when an 'mm-hm' is not a turn — that is not. Endpointing, barge-in, and backchannels, measured.

10 min →

2026.04.03 SECURITY

Red-teaming an MCP server.

Everyone audits the agent. Almost nobody audits the servers it calls — and an MCP server writes straight into your model's context. This is the supply side of agent security.

8 min →

2026.03.30 EVAL

LLM-as-judge is a model you also have to evaluate.

Teams wire an LLM into the eval harness as the judge and treat its scores as ground truth. But the judge is a model — with measurable biases, shaky calibration, and silent drift. Evaluate it before you trust it to gate a deploy.

12 min →

2026.03.25 PRIVACY

Confidential RAG: keep the context secret, not just the query.

Most private RAG protects the user's query in transit and leaves the corpus exposed. But the corpus is the sensitive asset — the embeddings, the vector store, and the chunks the model sees all need protecting.

11 min →

2026.03.22 ZKML

Five zkML libraries, benchmarked. Only one ships today.

EZKL, Modulus, Giza, Ora, RISC Zero. Same model, same input, same target chain. Proof times, gas costs, gotchas — and the one we'd put in front of a customer.

12 min →

2026.03.17 ZKML

Folding schemes for zkML, explained without the cryptography.

zkML cannot scale to large models because proving a whole computation in one shot is ruinously expensive. Folding schemes — Nova and its lineage — prove a long, repetitive computation step by step instead. Explained without the cryptography.

11 min →

2026.03.13 SECURITY

Prompt injection is a vulnerability class, not a bug.

You do not patch prompt injection any more than you patched SQL injection. It is a vulnerability class with four members, and each one needs a different architectural defense.

13 min →

2026.03.08 AGENTS

Notes on agent budgets: why "let it think longer" is a bug.

An agent that hits a wall and asks for more compute is not reasoning. It is panicking. The budget is part of the spec, not a fallback.

6 min →

2026.02.28 PAYMENTS

The x402 micropayment economy: what 119M transactions reveal.

HTTP 402 is no longer reserved. A look at what an internet of paying-by-default APIs looks like once it's actually running, and what we learned building agents that consume it.

8 min →

2026.02.15 VOICE

A 280ms latency budget, broken down millisecond by millisecond.

Sub-300ms voice agents are a specific engineering problem. Here is every millisecond a packet spends between the user's mouth and the agent's reply — and where you actually claw the time back.

13 min →

2026.01.30 OPINION

Most agent demos are lying about the latency. Here is the math.

A 4-second agent looks great on stage and falls over in production. The demo has a few tricks. Once you see them, the latency claims of every other framework get a lot less impressive.

7 min →

2026.01.18 PRIVACY

FHE vs TEE for ML: when to use which.

Two ways to compute on data you can't see. One is cryptographically pure and 100,000x slower; the other is fast and depends on a chip vendor not being broken. A decision tree.

10 min →

2025.12.20 OPINION

On-chain agents are not the same as agents that touch chains.

Four levels of agent-chain integration, three of which are conflated in every pitch deck. A short field guide to what people actually mean when they say "on-chain agent."

7 min →