Menu
← FIELD NOTESAGENTS 2026.05.16 · 9 min

Agent memory is a microservice, not a vector store.

"Add memory" usually means "dump everything into a vector DB." Real agent memory is tiered, has a write/forget/consolidate policy, and runs as a service the agent calls — not a library bolted into the loop.

A team ships an agent that holds a conversation. It works for a week. By the second month every reply is slower, the agent keeps “remembering” a preference the user changed in March, and a support ticket lands because it quoted a price right in a January transcript and wrong today.

Ask how memory is implemented and the answer is always the same. Every turn gets embedded and written to a vector database; every turn, the agent retrieves the top-k most similar past turns and pastes them into the prompt. “We added memory” means “we added a vector store.”

That is not memory. It is an append-only log with cosine similarity bolted on the front, failing the way append-only logs fail. Here is what to build instead.

”Memory equals vector store” fails four ways

These failure modes are not bugs — they are what the architecture produces.

Unbounded growth. Every interaction is a write; nothing is ever a delete. A year-old agent has a year of turns in its index, and every retrieval pays for all of them in latency, noise, and storage — the system gets monotonically worse with use.

Retrieval noise. Top-k similarity returns the k most similar items whether or not any are relevant. Ask “what’s my deploy region” and the retriever surfaces three old turns that mention “region” in passing — the agent reasons over a prompt padded with near-misses.

No forgetting. A vector store has no opinion about what is stale. A correction — “use the EU region now” — is just another write, sitting next to the turn it supersedes; both come back, the agent picks one, and no policy says the newer fact wins.

Stale facts presented as current. The dangerous one. A retrieved chunk carries no notion of as of when: the agent pulls a price or a role from an interaction six months old and states it in the present tense. It is the RAG stale-chunk problem, except it compounds — the agent wrote the chunk itself and trusts it more.

A better embedder or a bigger k fixes none of it. What fixes it is treating memory as a system with its own structure, write policy, and lifecycle — not a bucket.

Three tiers, because recall is not one problem

Human memory is not one store: a phone number you need for ten seconds and a fact about a colleague you’ll need for two years have nothing in common operationally. Agent memory wants the same split — three tiers, each with a different job, lifetime, and retrieval path.

Working memory is the current task: the plan, intermediate tool results, what the agent has tried this run. It lives in the agent loop as structured state, discarded when the task ends — in the agent-budgets essay, the state object in the budgeted loop, home to the best_effort accumulator a partial result is built from.

Episodic memory is the record of past interactions — what happened, in what session, in what order: “on May 2nd the user asked to migrate the cluster and we ran the dry-run.” Specific, timestamped, session-tied, and the tier that most needs a forgetting policy.

Semantic memory is consolidated facts, stripped of the episode that produced them. Not “on May 2nd the user said they prefer the EU region” but “preferred region: EU” — the distillation of many episodes into the durable facts worth carrying forever. This is the tier most “add memory” implementations skip, and skipping it is why their agents drown in raw transcript.

                        ┌─────────────────────────────┐
   per task             │       WORKING MEMORY        │
   (seconds–minutes)    │  plan, tool results, tries  │  ← in the agent loop
                        └──────────────┬──────────────┘
                                       │ consolidate on session end
                        ┌──────────────▼──────────────┐
   per session          │      EPISODIC MEMORY        │
   (days–weeks)         │  timestamped interactions   │  ← memory service
                        └──────────────┬──────────────┘
                                       │ promote durable facts
                        ┌──────────────▼──────────────┐
   long-lived           │      SEMANTIC MEMORY        │
   (months–forever)     │  consolidated, dated facts  │  ← memory service
                        └─────────────────────────────┘

The flow runs one direction, and each promotion is also a compression — a thousand turns become fifty episodes become five facts. A vector store has no such funnel: one tier, the widest one, and it never narrows.

Memory is a service, not a library

Here is the architectural claim: memory belongs behind an API, as a microservice the agent calls — not as a library pip installed and called inline. The library version works for exactly one agent for exactly one week; then the seams show, in three places.

It scales on its own curve. Consolidation and embedding jobs are bursty and CPU-heavy; the agent loop is latency-sensitive. In one process they contend for resources, and a consolidation pass stalls a live conversation.

It is shared across agents. Any real deployment runs more than one agent — support, research, triage — that need a shared picture of the same user. As a library, each gets its own copy and they silently diverge; a service is one source of truth, and the user becomes a first-class entity rather than a per-agent local variable.

It is independently testable — the property I care about most. A service has a contract — write, consolidate, query, get these facts back — testable in isolation by replaying a synthetic six-month history and asserting the right facts survived and the stale ones decayed. Memory tangled into the loop can only be tested by running the whole agent: slow, costly, and it conflates memory bugs with reasoning bugs, so nobody writes those tests.

The interface is small. Four operations carry almost everything:

class MemoryService(Protocol):
    async def write(self, user_id: str, session_id: str,
                    events: list[Event]) -> None:
        """Append raw interaction events. Cheap, non-blocking."""

    async def consolidate(self, user_id: str, session_id: str) -> None:
        """End-of-session pass: episodic summary, promote durable
        facts. Runs async, off the agent's latency path."""

    async def recall(self, user_id: str, query: str,
                     tiers: list[Tier], k: int) -> MemoryView:
        """Retrieve for the current task. Returns dated, tier-tagged
        items — never a bare blob of text."""

    async def forget(self, user_id: str, criteria: ForgetCriteria) -> None:
        """Explicit deletion: superseded facts, decayed episodes,
        right-to-be-forgotten requests."""

Two things matter. recall returns a MemoryView — items each tagged with tier and as_of date, so the agent can say “as of February, your region was EU; still right?” — not a bare string. And consolidate runs off the latency path: the expensive thinking happens between turns, not while the user waits.

The write / consolidate / forget policy

A memory service without a lifecycle policy is just a vector store with extra latency. The policy is the product — three decisions.

When to write. Not every turn — writing every utterance rebuilds the unbounded-log problem inside your nice new service. Write only events that carry signal: a stated preference, a decision, a correction, a tool result that changed the world. A cheap classifier gates the write, on one bar: would a competent human assistant write this down?

How to consolidate. At session end, a consolidation pass summarizes the session into a dated, session-tagged episodic record, then scans for facts worth promoting to semantic memory. A fact is promotable when it is durable — a preference, an identity, a long-lived config, not “the user is in a hurry today.” Promotion is extraction: from “on May 2nd the user said to use EU,” emit “preferred region: EU (as of 2026-05-02).” It is also reconciliation: when a new fact contradicts an old one, the pass supersedes rather than appends — the older fact is marked stale, the newer wins, and the change is recorded so the history stays auditable.

How to forget. Three distinct mechanisms:

  • Supersession. A fact replaced by a newer fact, handled at consolidation — the old fact is dead the moment the new one lands.
  • Decay. Episodic memory ages out on a recency-and-usage score: every recall that touches an episode refreshes it; episodes untouched for months drop. This is the valve against the year-long log.
  • Deletion on request. A right-to-be-forgotten — a real operation, scoped by user_id, that purges all three tiers. Across five agent processes this is nearly impossible to prove; as a service it is one call.

Semantic facts mostly do not decay on a timer — they are superseded or deleted, not aged out. The tier you let time erode is episodic.

TierWrite triggerLifetimeForget mechanism
WorkingEvery step, in-loopThe taskDiscarded at task end
EpisodicSignal-bearing events onlyDays–weeksRecency/usage decay
SemanticPromoted at consolidationLong-livedSupersession or delete

You cannot tune memory you do not measure

Memory has a specific evaluation trap. Recall accuracy — did the agent retrieve the fact — is necessary but not sufficient. What matters is the task outcome: did the agent, with memory, do the job better than without it.

So the harness needs two layers. Memory-level metrics run on a long synthetic conversation with planted facts: did the right fact come back later, did a superseded fact correctly stop coming back, did stale episodes decay on schedule? LOCOMO is a long-conversation memory benchmark built for exactly this — multi-session dialogues with question sets that probe what a system should still know hundreds of turns later. Task-level metrics run your real task suite twice — memory on, memory off — and diff the outcomes; if task success does not move, your memory layer is cost and latency with no payoff. Keep both; never collapse them into a single number.

One subtler check is a stale-fact regression test: plant a fact, plant its correction a few sessions later, then query — the right answer must reflect the correction. Plain recall accuracy will never catch a failure here; it rewards returning the old fact. A memory harness has to test supersession the way a RAG harness tests refusals.

The tooling, and build-vs-buy

You do not have to build the service from scratch. Memory-as-a-service is an emerging architecture pattern with off-the-shelf tooling, and Mem0 is the most prominent — an agent-memory product behind an API that does extraction and consolidation rather than raw turn-dumping, exposing roughly the write / search / delete surface sketched above. In the Mem0 paper (arXiv 2504.19413), the authors report roughly a 26% relative improvement in LLM-as-judge answer quality on the LOCOMO benchmark over OpenAI’s built-in memory feature — note the precise comparison: against a specific built-in memory product, not against “a plain vector store.” Treat it as evidence that the tiered, extract-and-consolidate architecture beats naive approaches, not as a benchmark to chase.

The honest split: adopt when your needs are conventional — per-user preferences, conversational continuity, the standard tier structure — because a managed service hands you a tested consolidation pipeline, and consolidation is the hard part. Build when memory is your differentiator — domain-specific consolidation rules, an unusual retrieval path, a topology no off-the-shelf tool models. Even then, build the service, not a library: keep the API boundary, the four-operation contract, the independent testability. What stays off the menu either way is .add() and .search() against a raw vector store, in the loop, with no consolidation and no forgetting.

The checklist

Before you tell anyone your agent “has memory,” walk this:

  • Are working, episodic, and semantic memory actually distinct — different stores, different lifetimes — or is it all one index?
  • Does memory run behind an API the agent calls, not as a library inside the loop?
  • Is there a write policy that drops no-signal turns, or are you writing every utterance?
  • Is there a consolidation pass that summarizes sessions and promotes durable facts to semantic memory?
  • Does every stored fact carry an as_of date, and does the agent see it on recall?
  • When a fact is corrected, does the old one get superseded — not appended next to the new one?
  • Does episodic memory decay, so the store does not grow without bound?
  • Can you delete everything about one user in a single call?
  • Do you measure both recall accuracy and task outcome, and is there a stale-fact regression test?

Can’t check all nine? You do not have agent memory — you have a vector store, and it is already drifting.

Reading list

Memory is not a feature you add. It is a service you run, with a lifecycle policy — and if it never forgets, it was never memory.

NEW ENGAGEMENT · INTAKE

Tell us about it.

The more specific you are, the more useful our first reply.

SERVICE AREA
↩ ENCRYPTED IN TRANSIT