- ◇ Incident triage with your team in a shared Slack channel.
- ◇ Runbook updates as your on-call finds gaps in the first 30 days.
- ◇ Eval-suite tuning when the harness flags something the rubric did not catch.
- ◇ One drift-detection review at day 14 and day 28, written up as a short memo.
After handoff. What you own, what we still owe you, when to call us back.
An engagement ends in week ten, eleven, or twelve. After it ends, your team runs the system. This page is what that actually looks like — the artifacts you walk away with, the 30-day support window that bridges the gap, and the explicit triggers that suggest a call back. Engagements don't auto-convert into retainers — we build for your team's independence from day one.
Five artifacts, all yours — no retained rights, no telemetry.
- ◇ ARTIFACT
Production deploy
Running in your infrastructure, on your accounts, behind your access controls. We do not host inference for you after handoff.
- ◇ ARTIFACT
Eval suite
Runner, golden dataset, scoring rubric, CI gate. Versioned in your repo. Your engineers extend it; we do not gatekeep the suite.
- ◇ ARTIFACT
Dashboard
Eval-pass rate, p95 latency, cost per request, drift signals. Wired to your observability stack — no Proof-of-Tech-hosted reporting.
- ◇ ARTIFACT
Runbooks
One per failure mode the production deploy actually surfaced. Written with your on-call in the room. Generic playbooks do not survive contact with real incidents.
- ◇ ARTIFACT
Training
Two half-day sessions, recorded, with your named owners. We cover the eval-suite extension protocol, the runbook contract, and the incident playbook end-to-end.
What is in the runbook, and why generic ones do not survive.
The runbook is not a document we write in week eleven and email you. It is a per-failure-mode operations contract, written with your on-call in the room, against the failure modes the production deploy actually surfaced during the build phase. Every entry has a trigger condition, a measurable signal, a procedure, and an owner.
The default runbook covers, at minimum:
- Incident playbook — one entry per failure mode, with rollback paths for prompts, models, retrievers, and tool configurations.
- Scaling triggers — when to add capacity, when to shed load, and where the non-linear cost cliffs live in your stack.
- Eval-drift signals — what to look at when the harness flags a regression, and how to distinguish a model issue from a data-distribution one.
- Cost ceilings — per-tenant, per-tool, per-step caps wired to circuit-breakers that fail closed.
- Disaster recovery — data loss, key compromise, vendor outage. We name the vendors and the contingencies for each.
Generic playbooks tell you to "check the logs" and "rollback if needed." A runbook earned against real incidents tells you which query the dashboard is asking, which version to roll back to, and which engineer owns the page.
The bridge between handoff and your team running it alone.
- ◇ New features beyond what shipped in the build phase.
- ◇ Model swaps and major architectural changes.
- ◇ New eval categories not covered by the suite at handoff.
Anything out of scope can be added by a written addendum — same engagement letter, scoped deliverables. We will not surface-creep without one.
Six triggers. Three are routine ops; three usually warrant a sprint.
Eval-pass rate drops below the budget set in week two.
Your harness already catches this — that's why we wrote the budget into the CI gate. The drop is the signal, not the diagnosis. If your team can read the per-category breakdown and the recent dataset changes, this is a normal incident. If it persists past one debugging cycle, it warrants a call.
A new model you want to A/B against your harness.
Routine ops. The eval suite is yours; point it at the new model, read the diff, decide. The reason to call is when the new model wins on aggregate but loses on a category you care about — that is an eval-design conversation, not a model-swap one.
Cost per request creeps above the ceiling.
The dashboard shows where the cost lives — usually rerankers, tool calls, or context bloat. Your team can read the breakdown and apply the obvious moves. Call us when the obvious moves are not enough and the next step is a structural change.
A regulatory change moves the compliance perimeter.
New data-handling rules, an attested-vendor requirement, an audit posture shift — these usually require an architecture change, not just a configuration one. Two- to three-week scoped sprint; the harness already exists, the work is at the perimeter.
You are adding a new data source and seeing corpus drift.
Adding a corpus is rarely a configuration change. Retrieval characteristics shift, faithfulness changes, the eval set needs new categories. Worth a scoped engagement to extend the suite, retune the retriever, and rebuild the budget.
You are standing up a second product surface.
Voice on top of the RAG; an agent on top of the eval harness; a verifiable-inference layer in front of a model already in production. The system you have is the foundation; the second surface is a new engagement, compressed because the foundation is solid.
Retainers are separate, and entirely optional.
A retainer is a separate contract, scoped explicitly, with a quarterly review. Things that are reasonable under a retainer: quarterly eval reviews, model-swap support against the harness, eval-set maintenance as your corpus shifts, and a named escalation path for incidents your team cannot resolve in the first hour.
A retainer should map to real, ongoing work — not exist to defend revenue. If a quarter passes without a real ticket, the retainer ends or the price drops to match the work — your call, in writing, at the review.
The first call to discuss any of this is the same as the first call to discuss any other engagement — hello@proofoftech.org , 30 minutes, no decks. If you have a runbook of ours and a working harness, mention it in the first message and we will pull the engagement file before the call.
Bring us a hard problem.
We'll show you what we'd build.
The first call is a free 30 minutes. You'll come away with a one-page memo on what we'd build and how we'd approach it.