Menu
◇ FIELD NOTESWRITING FROM THE TEAM

Engineering notes from the field. Numbers, code, and opinions from the team.

Production AI and verifiable on-chain inference, written by the engineers who ship them. Two essays a month. Unsubscribe with one click.

Hand-folded paper polyhedron with sharp triangular facets, suspended in matte black space — Field Notes header
FILTER · TAG
Browse by page →
2026.06.03 AGENTS

Shipping an agent: canaries and rollback for prompts.

A one-word change to a system prompt can move accuracy by dozens of points, and a provider's model update can regress your app overnight. A prompt or model swap is a deploy. Give it a staged rollout and a one-action rollback path.

2026.06.01 AGENTS

Cost observability for an agent fleet.

The monthly inference bill arrives as one number, and nobody can say which agent, which customer, or which tool spent it. Agent cost is too variable to estimate and has to be attributed after the fact — per run, per tool, per tenant. The layer most stacks skip.

2026.05.29 AGENTS

Human-in-the-loop checkpoints without killing throughput.

An agent that asks permission for everything trains its reviewers to rubber-stamp, and the one dangerous action slips through in the noise. Approval gates belong on consequence and on uncertainty — not on every step. Where to put them.

2026.05.27 EVAL

GraphRAG or vector RAG: a decision guide.

A knowledge graph earns its build-and-maintenance cost over plain vector retrieval only for specific query classes — multi-hop reasoning and global summarization. For fact lookup it loses, and it always costs more to index. A guide by query type.

2026.05.25 EVAL

Hybrid search: when BM25 still beats your embeddings.

Dense embeddings win in-domain and on paraphrase. Lexical BM25 wins on rare exact terms and on corpora the embedding model never trained on. Production retrieval usually needs both — and a clear reason for the split.

2026.05.22 EVAL

Reranking is the cheapest RAG win you are not using.

A cross-encoder reranker often beats retrieving more documents — it scores the query and passage together, where vector search only compared two embeddings made apart. But it scales linearly with candidates and does not always earn its latency. When to add one.

2026.05.20 EVAL

Eval-driven development: write the eval before the feature.

An eval written after a feature ships can only ratify what the feature already does. Written first, the graded set becomes the specification — it forces the ambiguity out of the requirement before any code exists. Why the eval is the real spec.

2026.05.20 OPERATIONS

When to call your AI consultancy back. A decision tree.

Most teams either over-call (the consultancy quietly becomes a line item) or under-call (the harness silently rots until a customer escalation forces the conversation). Six triggers, three are routine ops you own, three usually warrant a sprint. Here is the rubric.

2026.05.19 OPERATIONS

Three months after handoff is where AI systems quietly die.

An AI system that passed every eval at handoff can be silently below its quality budget a quarter later. The decay is not dramatic — it is three measurable signals drifting slowly under thresholds that were never calibrated for steady state. This post is what month three actually looks like, and what to watch.

2026.05.18 AGENTS

Tool design for agents: the schema is the prompt.

A tool's name, description, and JSON schema are the only thing the model sees — it picks and fills tools from those words alone. Treating that surface as API plumbing instead of as prompt is why agents call the wrong tool. How to design it.

2026.05.16 AGENTS

Agent memory is a microservice, not a vector store.

"Add memory" usually means "dump everything into a vector DB." Real agent memory is tiered, has a write/forget/consolidate policy, and runs as a service the agent calls — not a library bolted into the loop.

2026.05.16 AGENTS

Your APM cannot see your agent failing.

Request traces and dashboards were built for request/response services. The ways an agent fails — a tool returning 200 with garbage, a truncated context, a looping planner — trip none of them. What agent-native observability has to capture.

2026.05.16 PAYMENTS

Spend rails for agents shipped. The safety layer didn't.

x402, ERC-8004, and AP2 gave agents the rails to hold and spend money. The controls that stop a prompt-injected agent from draining a wallet — spend ceilings, treasury isolation, circuit breakers — did not ship with them.

2026.05.16 SECURITY

Auditing an agent that holds a wallet.

Agents now sign transactions. The attack surface — a prompt injection that ends in a signed transfer — is new, and almost no security auditor covers it. What an agent security audit actually checks.

2026.05.16 EVAL

Stop OCR-ing your PDFs: retrieve on the page, not the transcript.

Most document RAG OCRs a PDF, rebuilds the layout, and chunks the text — losing every table, chart, and column it touches. ColPali-class visual retrieval embeds the page image directly. When that wins, when it doesn't.

2026.05.16 TRAINING

Proving the work: verification in decentralized training.

Decentralized pretraining now reaches into the tens of billions of parameters — but you still cannot cryptographically prove the GPUs did the work they claim. How production networks check untrusted workers, and why ZK-proven training is years out.

2026.05.16 ZKML

The first proven LLM: what DeepProve changes for zkML.

DeepProve, from Lagrange, produced the first zero-knowledge proof of a full LLM inference — GPT-2. It moves "prove a transformer" from impossible to merely expensive. What that unlocks, and what is still years away.

2026.05.16 EVAL

Faithfulness is not groundedness. And "accuracy" is not a RAG metric.

Teams say their RAG is 'accurate' and mean different things. Faithfulness is whether the answer is true; groundedness is whether every claim traces to a source. They fail differently — and a deploy depends on measuring both.

2026.05.16 STANDARDS

MCP in production: the four gaps nobody demos.

MCP won the tool-integration standard. But "works in a demo" and "works in production" are different claims — and four gaps bite at scale: sticky sessions, server fan-out, governance, and what happens when a session drops mid-task.

2026.05.16 ZKML

opML or zkML: a decision tree for verifiable inference.

Two ways to make an off-chain model output trustworthy on-chain. zkML is cryptographic, expensive, and small-model-only. opML is optimistic, cheap, and runs Llama-2-scale models today. Choosing by stakes, model size, and latency.

2026.05.15 VOICE

Cascaded or end-to-end: a 2026 voice-architecture trade study.

A 2026 voice agent forks at the first design decision — STT→LLM→TTS cascade, or a single speech-to-speech model. End-to-end wins on latency and naturalness; the cascade wins on everything you debug, audit, and control. Here is the trade study.

2026.05.13 AGENTS

Deterministic replay: debugging agents that will not reproduce.

An agent run is non-deterministic — sampling, tool responses, and timing all vary — so a bug seen once may never recur. Deterministic replay records every non-deterministic input so the run can be replayed exactly.

2026.05.11 PAYMENTS

Pricing an API when the customer is an agent.

When the buyer is an autonomous agent paying per call, human pricing breaks. Seats, signup-gated free tiers, and annual commitments stop making sense for a machine that reads the price and comparison-shops every call.

2026.05.09 TRAINING

RL environments are the new dataset.

Post-training has shifted from supervised fine-tuning on static labeled data toward reinforcement learning, and that moves the unit of data work from a labeled file to an executable environment. Building good environments is the new data engineering — and the scarce input.

2026.05.07 AGENTS

Context engineering beats prompt engineering for long-running agents.

For a long-running agent, the system prompt is a small part of the problem. The real discipline is managing the context window across the whole run as a budget — keep, drop, compact, retrieve.

2026.05.05 STANDARDS

What ERC-8004 actually means for agent identity.

Agents need to prove who they are to each other without going through a central directory. ERC-8004 is the first standard that ships the three registries needed for that. Here is what it does and what it does not.

2026.05.02 VOICE

Graceful failure for voice agents.

A voice call is real-time and unforgiving — there is no spinner to show, and dead air reads as a broken product. When STT, the LLM, or TTS fails mid-call, the system has to degrade, not drop.

2026.04.29 EVAL

Your golden set is rotting.

A golden evaluation set is not a fixed asset — it decays. The world changes, the product shifts, the team overfits, and the pass rate quietly stops meaning anything. Eval data needs a maintenance protocol.

2026.04.25 SECURITY

Signing-key custody for autonomous agents.

Assume the model gets injected — then ask where the signing key lives. MPC, HSMs, multisig, and session keys, judged on one question: can a fully compromised agent reach the key?

2026.04.22 EVAL

You don't have a RAG problem. You have a chunking problem.

Most teams blame the retriever. The retriever is fine. Your chunks don't carry their context — and no amount of reranking saves them.

2026.04.21 STANDARDS

A2A and MCP: two protocols, two jobs.

A2A and MCP get framed as rivals. They are not. MCP connects an agent to its tools; A2A connects agents to each other — different jobs at different layers, and a serious multi-agent system needs both.

2026.04.17 PRIVACY

PII redaction that does not wreck retrieval.

Stripping PII before documents reach the embedding model is often necessary. But naive redaction destroys the semantic signal retrieval depends on. How to redact without wrecking retrieval.

2026.04.14 AGENTS

Multi-agent systems are usually one agent too many.

Splitting a task across coordinating agents adds context-handoff loss, compounding latency and cost, and a wider failure surface — overhead that usually exceeds the benefit. Start with one agent.

2026.04.10 TRAINING

Decentralized training in 2026: what works, what's still vapor.

A grounded look at distributed pretraining across untrusted GPUs. DiLoCo, DisTrO, INTELLECT-2, Bittensor's Templar, 0G's DiLoCoX — what each actually shipped, and what hasn't.

2026.04.07 VOICE

Turn-taking is the hard part of voice agents.

Transcription is largely solved. Knowing when the caller has finished, when to stop for an interruption, and when an 'mm-hm' is not a turn — that is not. Endpointing, barge-in, and backchannels, measured.

2026.04.03 SECURITY

Red-teaming an MCP server.

Everyone audits the agent. Almost nobody audits the servers it calls — and an MCP server writes straight into your model's context. This is the supply side of agent security.

2026.03.30 EVAL

LLM-as-judge is a model you also have to evaluate.

Teams wire an LLM into the eval harness as the judge and treat its scores as ground truth. But the judge is a model — with measurable biases, shaky calibration, and silent drift. Evaluate it before you trust it to gate a deploy.

2026.03.25 PRIVACY

Confidential RAG: keep the context secret, not just the query.

Most private RAG protects the user's query in transit and leaves the corpus exposed. But the corpus is the sensitive asset — the embeddings, the vector store, and the chunks the model sees all need protecting.

2026.03.22 ZKML

Five zkML libraries, benchmarked. Only one ships today.

EZKL, Modulus, Giza, Ora, RISC Zero. Same model, same input, same target chain. Proof times, gas costs, gotchas — and the one we'd put in front of a customer.

2026.03.17 ZKML

Folding schemes for zkML, explained without the cryptography.

zkML cannot scale to large models because proving a whole computation in one shot is ruinously expensive. Folding schemes — Nova and its lineage — prove a long, repetitive computation step by step instead. Explained without the cryptography.

2026.03.13 SECURITY

Prompt injection is a vulnerability class, not a bug.

You do not patch prompt injection any more than you patched SQL injection. It is a vulnerability class with four members, and each one needs a different architectural defense.

2026.03.08 AGENTS

Notes on agent budgets: why "let it think longer" is a bug.

An agent that hits a wall and asks for more compute is not reasoning. It is panicking. The budget is part of the spec, not a fallback.

2026.02.28 PAYMENTS

The x402 micropayment economy: what 119M transactions reveal.

HTTP 402 is no longer reserved. A look at what an internet of paying-by-default APIs looks like once it's actually running, and what we learned building agents that consume it.

2026.02.15 VOICE

A 280ms latency budget, broken down millisecond by millisecond.

Sub-300ms voice agents are a specific engineering problem. Here is every millisecond a packet spends between the user's mouth and the agent's reply — and where you actually claw the time back.

2026.01.30 OPINION

Most agent demos are lying about the latency. Here is the math.

A 4-second agent looks great on stage and falls over in production. The demo has a few tricks. Once you see them, the latency claims of every other framework get a lot less impressive.

2026.01.18 PRIVACY

FHE vs TEE for ML: when to use which.

Two ways to compute on data you can't see. One is cryptographically pure and 100,000x slower; the other is fast and depends on a chip vendor not being broken. A decision tree.

2025.12.20 OPINION

On-chain agents are not the same as agents that touch chains.

Four levels of agent-chain integration, three of which are conflated in every pitch deck. A short field guide to what people actually mean when they say "on-chain agent."

NEW ENGAGEMENT · INTAKE

Tell us about it.

The more specific you are, the more useful our first reply.

SERVICE AREA
↩ ENCRYPTED IN TRANSIT