◇ ARCHIVEPAGE 4 / 7 · OLDEST → NEWEST

The field notes archive.

2026.05.16 AGENTS

Agent memory is a microservice, not a vector store.

"Add memory" usually means "dump everything into a vector DB." Real agent memory is tiered, has a write/forget/consolidate policy, and runs as a service the agent calls — not a library bolted into the loop.

9 min →

2026.05.18 AGENTS

Tool design for agents: the schema is the prompt.

A tool's name, description, and JSON schema are the only thing the model sees — it picks and fills tools from those words alone. Treating that surface as API plumbing instead of as prompt is why agents call the wrong tool. How to design it.

10 min →

2026.05.19 OPERATIONS

Three months after handoff is where AI systems quietly die.

An AI system that passed every eval at handoff can be silently below its quality budget a quarter later. The decay is not dramatic — it is three measurable signals drifting slowly under thresholds that were never calibrated for steady state. This post is what month three actually looks like, and what to watch.

11 min →

2026.05.20 OPERATIONS

When to call your AI consultancy back. A decision tree.

Most teams either over-call (the consultancy quietly becomes a line item) or under-call (the harness silently rots until a customer escalation forces the conversation). Six triggers, three are routine ops you own, three usually warrant a sprint. Here is the rubric.

12 min →

2026.05.20 EVAL

Eval-driven development: write the eval before the feature.

An eval written after a feature ships can only ratify what the feature already does. Written first, the graded set becomes the specification — it forces the ambiguity out of the requirement before any code exists. Why the eval is the real spec.

10 min →

2026.05.22 EVAL

Reranking is the cheapest RAG win you are not using.

A cross-encoder reranker often beats retrieving more documents — it scores the query and passage together, where vector search only compared two embeddings made apart. But it scales linearly with candidates and does not always earn its latency. When to add one.

11 min →

2026.05.25 EVAL

Hybrid search: when BM25 still beats your embeddings.

Dense embeddings win in-domain and on paraphrase. Lexical BM25 wins on rare exact terms and on corpora the embedding model never trained on. Production retrieval usually needs both — and a clear reason for the split.

10 min →

2026.05.27 EVAL

GraphRAG or vector RAG: a decision guide.

A knowledge graph earns its build-and-maintenance cost over plain vector retrieval only for specific query classes — multi-hop reasoning and global summarization. For fact lookup it loses, and it always costs more to index. A guide by query type.

10 min →

2026.05.29 AGENTS

Human-in-the-loop checkpoints without killing throughput.

An agent that asks permission for everything trains its reviewers to rubber-stamp, and the one dangerous action slips through in the noise. Approval gates belong on consequence and on uncertainty — not on every step. Where to put them.

12 min →

2026.06.01 AGENTS

Cost observability for an agent fleet.

The monthly inference bill arrives as one number, and nobody can say which agent, which customer, or which tool spent it. Agent cost is too variable to estimate and has to be attributed after the fact — per run, per tool, per tenant. The layer most stacks skip.

11 min →

2026.06.03 AGENTS

Shipping an agent: canaries and rollback for prompts.

A one-word change to a system prompt can move accuracy by dozens of points, and a provider's model update can regress your app overnight. A prompt or model swap is a deploy. Give it a staged rollout and a one-action rollback path.

11 min →

2026.06.05 TRAINING

Fine-tune or prompt? A 2026 decision tree.

Teams reach for fine-tuning to fix two different problems — a model that lacks facts and a model that lacks a behavior — and only one of them is a fine-tuning problem. When post-training beats prompt-and-retrieve, and when it is wasted spend.

11 min →

The field notes archive.

Agent memory is a microservice, not a vector store.

Tool design for agents: the schema is the prompt.

Three months after handoff is where AI systems quietly die.

When to call your AI consultancy back. A decision tree.

Eval-driven development: write the eval before the feature.

Reranking is the cheapest RAG win you are not using.

Hybrid search: when BM25 still beats your embeddings.

GraphRAG or vector RAG: a decision guide.

Human-in-the-loop checkpoints without killing throughput.

Cost observability for an agent fleet.

Shipping an agent: canaries and rollback for prompts.

Fine-tune or prompt? A 2026 decision tree.

Tell us about it.

Got it.