Most AI demos die in production. We build the services that don't.

We're the team you bring in when the prototype impressed everyone and now has to survive real users, real queues, and real operating cost. We build RAG, agents, and voice systems with the evals, observability, and runbooks that tell you whether the system should scale before customers or operators pay for the miss. Every engagement opens with a clear go/no-go recommendation, in writing, before you commit to a build. The eval suite is yours at handoff, so you find out the system is slipping before your customers do.

View work

ENGAGEMENT
06– 14 WK
Typical engagement window, from diagnostic to handoff.
SERVICES
06
Six ways to retire production risk: RAG, agents, voice, post-training, agent security, and payment rails.
TEAM SIZE
02 ENG
Senior engineers embedded inside your repo, with one accountable path to deploy.
SUPPORT
30 DAY
Post-handoff support window while ownership moves fully to your team.

PRODUCTION STACK

Claude · GPT-5 · Anthropic MCP · LangGraph · pgvector · bge-m3 · Cohere Rerank · Modal · Temporal · LiveKit · Deepgram · ElevenLabs · x402 · ERC-8004 · AP2 · Foundry · ElizaOS · Claude · GPT-5 · Anthropic MCP · LangGraph · pgvector · bge-m3 · Cohere Rerank · Modal · Temporal · LiveKit · Deepgram · ElevenLabs · x402 · ERC-8004 · AP2 · Foundry · ElizaOS ·

01 / SERVICES

Six services. Every one built to hold up in production.

01 / SERVICE

RAG that holds up under eval

Retrieval-augmented generation with the eval harness built in — an index is the right call on layout-heavy, citation-bound, or latency-tight work; the diagnostic makes that call before the build does, not an assumption baked in on day one. We pick the chunker, the embedding model, and the retriever — contextual chunking and late-interaction retrieval where they earn their place — and benchmark every change against a golden set you'll keep using after we leave. The goal is fewer unsupported answers, less manual review, and degradation caught before customers see it. The same harness ships on its own, for a model already in production with no way to know when it degrades.

── Hybrid + late-interaction retrieval, agentic re-query
── Contextual & late chunking per document class
── ColPali visual retrieval for layout-heavy docs
── Faithfulness + groundedness evals, CI-gated

pgvector · bge-m3 · cohere-rerank →

02 / SERVICE

Agentic harnesses

Multi-step agents with tool use, tiered memory, and budgets that don't blow up production costs. Built on MCP, gated by evals, traced end-to-end — and that observability and eval layer retrofits onto an agent you already run. The point is bounded spend, bounded blast radius, and failures your team can inspect instead of replaying from logs after the fact. When the agent has to spend money, the payment-rails service ships spend policy outside the model — ceilings, treasury isolation, and circuit breakers the agent cannot reason around.

── Tool orchestration over MCP servers
── Budgets, replanning, structured failure contracts
── Tiered memory — working, episodic, semantic
── Trace-level observability + per-step evals

LangGraph · MCP · Temporal · Langfuse →

03 / SERVICE

Voice agents

Streaming voice systems on phone trees, kiosks, and apps, architected to a sub-300ms p95 budget. STT → LLM → TTS with model-based barge-in, drift detection, and PII-safe transcripts. The system is built to cut handle time without hiding latency, failed turns, or policy-risk calls from the operators who own the queue. Every component on a millisecond budget.

── Sub-300ms p95 latency budget
── Model-based barge-in + back-channeling
── Call recording + drift detection
── PII-safe transcripts

LiveKit · Deepgram · ElevenLabs →

04 / SERVICE

Post-training & grounding

Fine-tuning and post-training on your data and your task — for when prompting and retrieval have hit their ceiling. SFT, preference tuning, and eval-gated checkpoint selection. We make the gain reproducible, not mystical: the training recipe is handed off, so you can rerun it, audit it, and keep improving the model after we leave.

── SFT and preference tuning on your data
── Eval-gated checkpoint selection
── Distillation for latency and cost
── A reproducible training recipe at handoff

TRL · vLLM · Modal →

05 / SERVICE

AI-agent security & audits

Security audits for agents that hold wallets and sign transactions. We red-team the prompt-injection-to-transaction attack surface that smart-contract auditors don't cover — because the contract can be fine while the agent is still the hole. The work gives finance, security, and engineering a shared view of what the agent can spend, sign, and refuse.

── Prompt-injection → transaction red-teaming
── Spend-limit and refusal-boundary review
── Signing-key isolation + MCP allowlist audit
── ERC-8004 identity hygiene

Foundry · ERC-8004 · custom injection suites →

06 / SERVICE

Agent treasury & payment rails

The payment protocols for agents — x402, ERC-8004, AP2 — shipped the rails to hold and spend money. The controls that keep a prompt-injected agent from draining a wallet did not. We build that layer: tiered spend ceilings enforced outside the model, treasury isolation with no auto-top-up, drawdown circuit breakers, and immutable receipts your ops team can actually investigate. The agent gets frictionless per-call payments; finance and security get a bounded blast radius they can sign off on.

── Per-call, per-counterparty, and per-day ceilings enforced outside the model
── Treasury isolation: hot-wallet float only; no agent-initiated top-up
── Drawdown circuit breakers and immutable payment receipts
── x402 / ERC-8004 / AP2 integration wired into the agent harness

x402 · ERC-8004 · AP2 · Foundry →

02 / APPROACH

An engagement looks like this — predictable by design.

W1–2

STEP 01

Diagnostic

We sit with the team, read the data, and retire the riskiest assumption first. The output is a 12-page memo: what to build, what to skip, what eval to point it at.

◇ 2 weeks fixed

◇ 1 principal eng

◇ Memo + spike repo

W3–10

STEP 02

Build

We pair with your engineers in-repo. Eval gates from day one, weekly demos against the metrics defined in week 2, and tradeoffs made while the code is still cheap to change.

◇ 6–8 weeks

◇ 2 senior engs

◇ Production deploy

W11+

STEP 03

Handoff

Runbooks, eval suite, dashboards, on-call rotation, and a 30-day support window. The deliverable is not just code; it is your team's ability to run it without us.

◇ 2 weeks fixed

◇ Docs + runbooks

◇ 30 day support

03 / METHODOLOGY

What you'll actually have, week by week.

W1–2

Diagnostic

INPUTS WE NEED: ◇ A representative data sample.
◇ Your current eval suite — even if it's a sheet.
◇ One engineer, full attention, for week one.
WE SHIP: ◇ A 12-page memo: what to build, what to skip.
◇ A spike repo with the riskiest path proven out.
◇ An eval-suite skeleton, wired and runnable.
WE MEASURE: ◇ Open questions closed by end of week 2.
◇ The riskiest assumption retired or named.
◇ Decision clarity — yes / no / not yet.

W3–10

Build

INPUTS WE NEED: ◇ Repo access and a CI lane we can break.
◇ Authority to decide tradeoffs in real time.
◇ A 45-minute demo slot, weekly, no slides.
WE SHIP: ◇ Production deploy gated by the eval suite.
◇ A dashboard for eval-pass rate and p95.
◇ Runbook v1 — incidents, rollback, scaling.
WE MEASURE: ◇ p95 latency against the budget set in week 2.
◇ Eval-pass rate, run-over-run.
◇ Deploy confidence — gated, observable, reversible.

W11+

Handoff

INPUTS WE NEED: ◇ Your on-call rotation and pager policy.
◇ The team that will own this, named, not TBD.
◇ Two half-day training sessions on the calendar.
WE SHIP: ◇ Runbooks, eval suite, dashboards — yours.
◇ On-call rotation handover with shadow shifts.
◇ 30 days of on-tap support, no scope haggling.
WE MEASURE: ◇ Incidents resolved without us.
◇ MTTR — pre vs. post handoff.
◇ Eval-suite coverage your team can extend.

Two-sided matchingProject intakeParent-level reporting

RAGAgents

PROBLEM

A two-sided marketplace and its parent company each run project intake differently. Contractors submit bids through one channel; internal teams track progress through another.

RISK

Matching quality and parent-level reporting both suffer when intake data does not flow cleanly between the marketplace and the operating company.

SYSTEM

Unified intake agent with retrieval over project specs and contractor profiles, structured handoff between marketplace matching and parent-level reporting dashboards.

05 / WRITING

Engineering notes from the field.

Field notes are where we show the evaluation standard behind the client work: failure modes, latency budgets, agent spend, and the edge cases demos skip.

2026.07.16 SECURITY

Indirect prompt injection, by the numbers.

Our taxonomy argued prompt injection is a vulnerability class you contain, not a bug you fix. Here is the quantitative half — against the strongest published defenses, indirect-injection attack success stays high, and for agents that can act it stays alarming.

10 min →

2026.07.09 TRAINING

Your learning-rate schedule silently overrides your data-curation decisions.

A quality-ascending curriculum beats random shuffling — until a decaying schedule delivers your best data exactly when the learning rate is too small to absorb it. The same coupling flips proxy-model ablations: which dataset wins depends on the schedule, not the data alone.

13 min →

2026.07.08 SECURITY

Approving an agent's action is not authorizing it.

An agent workflow that pauses for human approval and then resumes looks safe — a person clicked approve. But the approval decision and the authority the resumed step runs under are two different objects, and most systems conflate them by carrying the approval as an in-band signal the resumed call trusts. That is a confused-deputy bug: a forged resume, a replayed request, or an approval granted at one gate authorizes an action it was never meant to. The fix is to stop transporting approval and start deriving authority from the trusted approval record at resume time, scoped to the exact suspended step — capabilities enforced at the tool boundary, fail-closed by default.

14 min →

ALL FIELD NOTES →

06 / FAQ

Questions we answer on the first call anyway.

How do you charge?

Fixed fee for the diagnostic — two weeks, paid up front. The build phase is a weekly retainer, scoped to the deliverables set in week two. We bill outcomes, not hours. No success fees, no equity, no kickers.

Do you sign NDAs?

Yes. Mutual NDA — our paper or yours — signed before the first technical conversation. Most teams send their own; we counter-sign within a business day.

What does “eval-gated” actually mean in practice?

Every commit runs against a versioned eval suite. If the eval-pass rate drops below the budget set in week two, the deploy doesn't ship. The suite is yours at handoff — runner, dataset, scoring rubric, and the dashboard that watches it.

Will you push us off our existing stack?

No. We work with the models, vector stores, and frameworks you've already chosen — unless one of them is the reason the project is stuck. If so, the diagnostic memo says so, and you decide.

Who owns the IP and the eval suite after handoff?

You do. All of it. Code, evals, runbooks, dashboards. We retain no rights, no licenses, no required attribution. The only thing we keep is the right to reference the engagement publicly — with your written sign-off.

What if the diagnostic recommends not building?

You keep the diagnostic memo and the analysis behind it, and you make the call with clear eyes. A sound go/no-go decision is a real outcome of the engagement — getting that call right matters as much as shipping the system itself.

How do we know this is worth building?

The diagnostic starts there. We identify the workflow, the failure mode, the user who owns it, and the eval that would prove the system is improving the work rather than adding another tool to babysit.

What does our team need to own after handoff?

The repo, the eval suite, the dashboards, the runbooks, and the on-call path. We do not hand over a black box; we hand over the operating surface your engineers need to debug, extend, and retire parts of the system when the workflow changes.

Can you work with business stakeholders as well as engineering?

Yes. Engineering owns the system, but the workflow usually belongs to ops, risk, support, sales, or finance. We keep the technical interface precise and translate the build/no-build decision into the operational risk it retires.

How fast can you start?

Diagnostic phase usually starts two to four weeks after the first call. We run one diagnostic at a time, so the calendar is the constraint. Right now we're booking into Q4 2026.

◇ NEXTA 30-MIN CALL · NO DECKS

Bring us a hard problem.
We'll show you what we'd build.

The first call is a free 30 minutes. You'll leave with the first cut of the build/no-build path, the riskiest assumption, and the eval we'd use to test it.

or hello@proofoftech.org

Most AI demos die in production. We build the services that don't.