When to call your AI consultancy back. A decision tree.

A team finishes an AI engagement. The system is deployed, the eval suite is wired into CI, the runbook is signed off, the consultancy is gone. Six months later the team has either called the consultancy back four times — for things their on-call could have resolved with the dashboard — or has not called at all, even though the eval-pass rate has been below the budget for three weeks. Both teams are below the operating discipline they should be at. The first team is paying a consultancy to do work the runbook explicitly hands them. The second team is treating “we own it now” as a posture rather than a practice.

The right cadence is rare, and the reason it is rare is that the question “is this a re-engagement?” has no default answer that survives the first time it is asked under pressure. Teams that have not written the rubric down in advance answer it ad hoc, and ad hoc answers cluster on one of the two failure modes.

This post is the explicit rubric. Six triggers, three you own, three that usually warrant a 2–3 week sprint. The rubric is not the contract; the engagement letter is the contract. But the rubric is the conversation the engagement letter is supposed to make easy, and it is easier when it is written down before anyone needs it.

The meta-problem is the failure mode, not the triggers

Before the triggers, the failure modes they avoid, because the triggers are useless to a team that is not honest about which failure mode it has.

The over-call failure is when a team forwards every incident, every model question, every cost spike to the consultancy. The consultancy becomes a permanent line item; the team’s operating muscle atrophies because the consultancy is the muscle. This is comfortable for both sides — the consultancy gets revenue, the team gets answers — and it is the slow path to a system the team does not actually own. The runbook said the team owns it; the practice says otherwise. The cost is invisible until a year passes and the team realizes it has been outsourcing its own production system.

The under-call failure is the opposite shape. The team treats the handoff as the end of the relationship and runs the system into a slow degradation that the dashboards would have caught and the runbook would have flagged, but neither was watched closely enough. The eval-pass rate falls below the budget; nobody escalates because the CI gate did not fire; the customer-support queue stretches; an executive asks why the AI feature is generating more tickets than it used to. The conversation that should have been a 30-minute call at week four is a crisis conversation at week sixteen.

The rubric below is built to make both failures unlikely. Three of the triggers are explicitly routine ops the team owns, with the runbook entry and the dashboard view named. Three are explicitly the kind of work that warrants a re-engagement, with the scope and approximate sprint length named. The point is not the categorization in the abstract; it is the writing-down. A team that has agreed in advance which side of the line each trigger sits on does not have an ad-hoc conversation under pressure.

Trigger 1 — Eval-pass rate drops below the budget

Routine ops. The harness already catches this; the dashboard already plots it; the runbook entry is already written. The drop is the signal, not the diagnosis. The triage path is the one the runbook names: open the per-category breakdown, identify which category is falling, look at the recent dataset changes, look at the corpus changes, look at the configuration changes. The mechanics are described in three months after handoff, and the per-category methodology is in LLM-as-judge evaluation.

The escalation rule: if a single debugging cycle does not produce a hypothesis the team can test, the second debugging cycle is the call. Not because the team cannot do it, but because the second cycle is the signal that something has changed structurally — the harness is not catching a new failure mode, the eval set has aged out of the production distribution, or a configuration drifted without anyone noticing. Those are diagnosable in an afternoon by someone who has the engagement file open; they can chew a week of the team’s senior engineering otherwise.

The conversation to have is a 30-minute one, not a sprint. The team is not asking us to build something; the team is asking for a second pair of eyes on the dashboard and the recent diffs. Charged hourly against the engagement file, no addendum needed.

Trigger 2 — A new model lands and you want to A/B against the harness

Routine ops. This is what the harness is for. The eval suite is the team’s; the new model is plumbed against the same API surface; the comparison is a pull request, a CI run, and a one-page memo from the team’s senior engineer. The decision to swap or not is a decision the team owns, and the harness is the evidence base it owns to make it.

The escalation rule: the new model wins on aggregate but loses on a category the team cares about. That is not a model decision; it is an eval-design decision. Either the eval set is missing a category that the new model is reading correctly and the old model was getting lucky on, or the category the new model is losing on is a structural weakness that disqualifies the swap. Distinguishing the two needs a closer look at the eval set and the failure traces. That is a conversation, not a project — a 60-minute call, charged hourly, no sprint.

The under-call risk on this trigger is teams who run the A/B, see the headline number improve, swap the model, and discover the regression in production a week later. The model-swap decision is a deploy; treat it like a deploy. The harness exists for this exact case.

Trigger 3 — Cost per request creeps above the ceiling

Routine ops. The cost dashboard shows where the cost lives — usually rerankers, tool calls, context bloat, or a new feature flag turned on without recosting. The triage path is in the post agent budgets and in the runbook: decompose, identify, decide whether to roll back or recost.

The escalation rule: the obvious moves are not enough, and the next move is a structural change. Examples: the cost lives in a tool that was acceptable at sample-size 100 and is unacceptable at sample-size 10,000; the agent harness has learned to retry in a way that compounds across steps; the corpus has grown to a size where the embedding-store costs dominate the inference. Structural changes are a sprint, not an afternoon.

The over-call risk on this trigger is teams who treat every cost spike as a call. Most cost spikes are configuration drift, and configuration drift is the team’s to roll back. The runbook names the diff to look for. A call when the team cannot identify the diff is fine; a call as the first move is a process failure.

Trigger 4 — A regulatory change moves the compliance perimeter

Worth a sprint. New data-handling rules, an attested-vendor requirement, an audit posture shift, a regulator naming a category of inference your system performs — these usually require an architecture change, not a configuration one. The architecture change is at the perimeter, not at the model, and perimeter changes are exactly the kind of work an engagement is built for.

The pattern of work: re-scope the threat model in writing, identify the architectural deltas (TEE attestation, on-chain verifier, audit logging, key custody), implement the deltas against the existing eval suite to confirm the system still passes, write the engagement-letter addendum that names the compliance posture. Two to three weeks, compressed because the harness already exists and the team that ran the engagement knows the system. The relevant design choices are mapped out in FHE vs. TEE for ML and confidential RAG; the perimeter conversation is the kickoff.

The over-call risk here is low — most teams do not casually call about regulation. The under-call risk is real: a regulator-driven change that the team tries to absorb in normal operations almost always slips and then turns into a project under deadline pressure. Call early.

Trigger 5 — A new data source is on the way and the corpus will drift

Worth a sprint. Adding a corpus is rarely a configuration change. The retrieval characteristics shift, the faithfulness profile shifts, the eval set needs new categories, the cost decomposition shifts because chunk counts move. Each of these is fine in isolation; the combination is a project.

The work: profile the new corpus, extend the eval set with categories that cover what the new corpus is asked to answer, retune the retriever (chunking strategy, embedding model, reranker thresholds), rebuild the cost budget against the new distribution. Two to three weeks, scoped against the corpus and the new categories rather than the model. The decision frame is structural — if the new corpus is structured in a way the existing retriever cannot handle (relational, graph-shaped, or strongly hierarchical), the right answer might be a different retriever entirely, and that is exactly the conversation to have before the sprint, not after the retune underperforms.

The over-call risk is teams who treat every new document upload as a project. Most new documents fit inside the existing distribution and need nothing more than a re-embed and a recall check. A sprint is warranted when the new corpus is structurally different — a new domain, a new format, a new size class — not when it is incremental.

Trigger 6 — You are standing up a second product surface

Worth a sprint. A second product surface — voice on top of an existing RAG, an agent harness on top of an existing eval-suite, a verifiable-inference layer in front of a model already in production — is a new engagement, compressed because the foundation is solid. The compression matters: the eval discipline, the runbook discipline, the cost discipline are all transferable from the existing system, and the second surface inherits them without re-teaching.

This is the highest-leverage call-back trigger, and the under-call risk is the most expensive. A team that stands up the second surface without the eval discipline that made the first one work will discover the same lessons the engagement taught the first time, on a longer timeline and a higher headcount cost. The right time to call is before the second surface is in design, not after it is in production.

The over-call risk is essentially nil. A team that calls about a second surface and turns out not to need one has had a one-hour conversation and lost nothing.

What the numbers do not say

A few honest qualifications, because the rubric above is a rubric, not a contract.

First, the boundary between “routine ops” and “worth a sprint” is fuzzy on triggers 1–3. A team that has run the system for two years has a different threshold than a team in month three; an engineer with deep RAG experience reads the cost dashboard differently than one who is new to the stack. The right behavior is to treat the labels as defaults, not as absolutes. A team that has called twice on trigger 1 and learned each time should keep calling; a team that has called twice on trigger 4 because it interpreted every contract change as a regulatory perimeter shift should pause and reread the rubric.

Second, the time estimates — “an afternoon” for routine ops, “two to three weeks” for sprints — are typical, not guaranteed. A trigger-5 corpus-drift sprint can be a one-week project on a small, well-bounded corpus and a five-week project on a sprawling, multi-format one. The estimates are anchors for the conversation, not commitments before the scoping memo.

Third, the rubric is built around the model where the engagement that handed off the system is the engagement that comes back to extend it. That is the default and the cheapest path. A team that has changed consultancies in between, or that lost the runbook discipline, or that has accreted custom logic on top of the original system in a way the original team would not recognize — those teams need a re-diagnostic, not a sprint extension. The re-diagnostic is its own two-week engagement, priced and scoped the same as a first-time diagnostic.

The checklist

Before the call:

The trigger is named explicitly, against the rubric above — “this is trigger 4” rather than “we have a problem.”
The dashboard view that surfaced the trigger is in the first message, with a screenshot, a link, or the structured data attached.
The runbook entry for the trigger has been read and the diagnostic path it names has been followed at least one cycle.
The team has a position — keep, change, or unclear — and the call is to validate or refine the position, not to outsource it.
If the trigger is a sprint candidate, the scope is sketched in one paragraph and the first-call agenda is a memo to write together, not a project to start.

Reading list

Three months after handoff — the field map of the quiet drift that produces trigger 1 most often, with the dashboard views and operating cadence that catches it before the customer-support queue does.
LLM-as-judge evaluation — the framework for the per-category eval-pass signal that distinguishes a single-category slide from a system-wide one, which is what trigger 1’s escalation rule turns on.
Agent budgets — the cost-decomposition lens for trigger 3: where the per-tenant, per-tool, and per-step caps live, and how the configuration drift that quietly bloats the bill surfaces on the dashboard.
Agent observability — the dashboard primitives that produce the diagnostic posture trigger 1, 2, and 3 turn on: span traces, spend attribution, and the steady-state monitors that read against the budget set at handoff.
FHE vs. TEE for ML — the perimeter design space for trigger 4, the right reading before a regulatory-driven re-engagement so the kickoff conversation is technical rather than exploratory.
Confidential RAG — the design palette for trigger 5 when the new corpus carries privacy or attestation constraints, and the retune is not just a chunking decision.
The handoff page — the engagement-level contract this rubric operationalizes, and the scope language each sprint trigger maps to.

The right cadence to call a consultancy back is not zero and it is not weekly. It is the cadence the rubric above produces — three triggers your team owns with the dashboards and the runbook, three triggers the team writes a one-paragraph scope on before the first call. A team that has the rubric written down before it needs it will not have an ad-hoc conversation under pressure, and that is the only version of this question that matters. The contract is on paper. The rubric is what makes the contract useful.

When to call your AI consultancy back. A decision tree.

The meta-problem is the failure mode, not the triggers

Trigger 1 — Eval-pass rate drops below the budget

Trigger 2 — A new model lands and you want to A/B against the harness

Trigger 3 — Cost per request creeps above the ceiling

Trigger 4 — A regulatory change moves the compliance perimeter

Trigger 5 — A new data source is on the way and the corpus will drift

Trigger 6 — You are standing up a second product surface

What the numbers do not say

The checklist

Reading list

Indirect prompt injection, by the numbers.

Your learning-rate schedule silently overrides your data-curation decisions.

Approving an agent's action is not authorizing it.

Tell us about it.

Got it.