Menu
◇ PROOF OF TECHAI ENGINEERING · PRODUCTION-GRADE Booking into Q3 2026

Most AI demos die in production. We build the ones that don't.

We're the team you bring in when the prototype impressed everyone and now has to survive real users. We build RAG, agents, and voice systems — gated on evals you keep and own, so you find out the system is slipping before your customers do. Every engagement opens with a clear go/no-go recommendation, in writing, before you commit to a build. For the systems that genuinely need it, we also build verifiable inference and on-chain agents.

View work
Live eval ledger. Five-stage AI pipeline — prompt, retrieve, infer, eval gate, notarize — with each stage's result hashed onto a chronological on-chain ledger underneath.

Stage 4 of 5: Eval gate active, notarized as 0xe44d…7c08 on block 1027.

◇ LIVE EVAL LEDGER · DEMO RUN 04 / 12
  1. 01 · Prompt "settle market 0x9c…"
  2. 02 · Retrieve k=8 · 142ms
  3. 03 · Infer p95 263ms
  4. 04 · Eval gate PASS · 0.94
  5. 05 · Notarize 0x4e7a…91c2
  1. BLOCK #1024 0x9c12…d4a3 EZKL · halo2
  2. BLOCK #1025 0x71fa…0c9e EZKL · halo2
  3. BLOCK #1026 0xb803…2d51 EZKL · halo2
  4. BLOCK #1027 0xe44d…7c08 EZKL · halo2
  5. BLOCK #1028 0x4e7a…91c2 EZKL · halo2
HOVER · TAP A STAGE SIGNED
SIGNAL
  • ENGAGEMENT
    06– 14 WK

    Typical engagement window, from diagnostic to handoff.

  • SERVICES
    08

    Eight, from production RAG and agents to verifiable, on-chain inference.

  • TEAM SIZE
    02 ENG

    Senior engineers embedded inside your repo, week one.

  • SUPPORT
    30 DAY

    Post-handoff support window. Then your team owns it.

STACK · PRODUCTION-GRADE
Claude · GPT-5 · Anthropic MCP · LangGraph · pgvector · bge-m3 · Cohere Rerank · Modal · Temporal · LiveKit · Deepgram · ElevenLabs · Bittensor · INTELLECT-2 · DiLoCo · EZKL · RISC Zero · Lagrange · Giza · Ritual · 0G · Allora · Olas · ElizaOS · ERC-8004 · x402 · Zama (FHE) · Phala (TEE) · Marlin · Akash · Foundry · OpenZeppelin · Claude · GPT-5 · Anthropic MCP · LangGraph · pgvector · bge-m3 · Cohere Rerank · Modal · Temporal · LiveKit · Deepgram · ElevenLabs · Bittensor · INTELLECT-2 · DiLoCo · EZKL · RISC Zero · Lagrange · Giza · Ritual · 0G · Allora · Olas · ElizaOS · ERC-8004 · x402 · Zama (FHE) · Phala (TEE) · Marlin · Akash · Foundry · OpenZeppelin ·
01 / SERVICES

Eight services. Every one built to hold up in production.

01 / SERVICE

RAG that holds up under eval

Retrieval-augmented generation with the eval harness built in. We pick the chunker, the embedding model, and the retriever — contextual chunking and late-interaction retrieval where they earn their place — and benchmark every change against a golden set you'll keep using after we leave. The same harness ships on its own, for a model already in production with no way to know when it degrades.

  • ── Hybrid + late-interaction retrieval, agentic re-query
  • ── Contextual & late chunking per document class
  • ── ColPali visual retrieval for layout-heavy docs
  • ── Faithfulness + groundedness evals, CI-gated
pgvector · bge-m3 · cohere-rerank
02 / SERVICE

Agentic harnesses

Multi-step agents with tool use, tiered memory, and budgets that don't blow up production costs. Built on MCP, gated by evals, traced end-to-end — and that observability and eval layer retrofits onto an agent you already run. Optional x402 / ERC-8004 rails when the agent has to spend money or prove who it is.

  • ── Tool orchestration over MCP servers
  • ── Budgets, replanning, structured failure contracts
  • ── Tiered memory — working, episodic, semantic
  • ── Trace-level observability + per-step evals
LangGraph · MCP · Temporal · Langfuse
03 / SERVICE

Voice agents

Streaming voice systems on phone trees, kiosks, and apps, architected to a sub-300ms p95 budget. STT → LLM → TTS with model-based barge-in, drift detection, and PII-safe transcripts. Every component on a millisecond budget.

  • ── Sub-300ms p95 latency budget
  • ── Model-based barge-in + back-channeling
  • ── Call recording + drift detection
  • ── PII-safe transcripts
LiveKit · Deepgram · ElevenLabs
04 / SERVICE

Post-training & grounding

Fine-tuning and post-training on your data and your task — for when prompting and retrieval have hit their ceiling. SFT, preference tuning, and eval-gated checkpoint selection. The training recipe is handed off, so you can reproduce every result after we leave.

  • ── SFT and preference tuning on your data
  • ── Eval-gated checkpoint selection
  • ── Distillation for latency and cost
  • ── A reproducible training recipe at handoff
TRL · vLLM · Modal
05 / SERVICE

Verifiable inference (zkML / opML)

Cryptographic proof that a model produced a specific output from a specific input — without revealing the weights or the data. zkML (EZKL) for small, fixed models; optimistic-ML and TEE attestation when the model is too big to prove outright. Built for auditable risk models, oracle feeds, and prediction-market settlement.

  • ── EZKL Halo2 proofs for small, fixed models
  • ── opML + hardware-attested TEE (H100/H200) for production-scale models
  • ── RISC Zero zkVM for general execution proofs
  • ── On-chain verifier contracts + governance hooks
EZKL · RISC Zero · Ora · Phala
06 / SERVICE

Decentralized training & sovereign compute

DiLoCo-style distributed pretraining — a technique that now genuinely reaches 40–72B parameters over the open internet — plus Bittensor subnet design and GPU-market cost modeling. We build the validator policy, the emissions curve, and the verification layer so untrusted workers can still produce trusted gradients.

  • ── Bittensor subnet design + validator policy
  • ── DiLoCo / DisTrO communication-efficient SGD
  • ── TOPLOC-style verification of rollouts
  • ── GPU market routing (Akash, io.net, Crusoe)
Bittensor · PRIME-RL · Psyche · Akash
07 / SERVICE

On-chain agents & autonomous economics

Agents that hold wallets, sign transactions, settle services with stablecoins, and prove who they are. We build agents for Polymarket, Base, and custom appchains — with hard refusal-on-edge and PnL ceilings.

  • ── ERC-8004 identity + reputation registries
  • ── x402 micropayments + AP2 mandates for tool calls
  • ── ElizaOS / Olas / Virtuals composition
  • ── Tiered spend ceilings, treasury isolation, circuit breakers
ElizaOS · ERC-8004 · x402 · Olas
08 / SERVICE

AI-agent security & audits

Security audits for agents that hold wallets and sign transactions. We red-team the prompt-injection-to-transaction attack surface that smart-contract auditors don't cover — because the contract is fine; the agent is the hole.

  • ── Prompt-injection → transaction red-teaming
  • ── Spend-limit and refusal-boundary review
  • ── Signing-key isolation + MCP allowlist audit
  • ── ERC-8004 identity hygiene
Foundry · ERC-8004 · custom injection suites
02 / APPROACH

An engagement looks like this — predictable by design.

W1–2
STEP 01

Diagnostic

We sit with the team, read the data, and write a 12-page memo: what to build, what not to build, what eval to point it at.

2 weeks fixed
1 principal eng
Memo + spike repo
W3–10
STEP 02

Build

We pair with your engineers in-repo. Eval gates from day one. Weekly demos against the metrics defined in week 2.

6–8 weeks
2 senior engs
Production deploy
W11+
STEP 03

Handoff

Runbooks, eval suite, on-call rotation, and a 30-day support window. We leave when your team can run it without us.

2 weeks fixed
Docs + runbooks
30 day support
03 / METHODOLOGY

What you'll actually have, week by week.

W1–2

Diagnostic

INPUTS WE NEED
  • A representative data sample.
  • Your current eval suite — even if it's a sheet.
  • One engineer, full attention, for week one.
WE SHIP
  • A 12-page memo: what to build, what to skip.
  • A spike repo with the riskiest path proven out.
  • An eval-suite skeleton, wired and runnable.
WE MEASURE
  • Open questions closed by end of week 2.
  • Decision clarity — yes / no / not yet.
  • Sign-off speed: memo → green-light.
W3–10

Build

INPUTS WE NEED
  • Repo access and a CI lane we can break.
  • Authority to decide tradeoffs in real time.
  • A 45-minute demo slot, weekly, no slides.
WE SHIP
  • Production deploy gated by the eval suite.
  • A dashboard for eval-pass rate and p95.
  • Runbook v1 — incidents, rollback, scaling.
WE MEASURE
  • p95 latency against the budget set in week 2.
  • Eval-pass rate, run-over-run.
  • Deploy frequency — should rise, not fall.
W11+

Handoff

INPUTS WE NEED
  • Your on-call rotation and pager policy.
  • The team that will own this, named, not TBD.
  • Two half-day training sessions on the calendar.
WE SHIP
  • Runbooks, eval suite, dashboards — yours.
  • On-call rotation handover with shadow shifts.
  • 30 days of on-tap support, no scope haggling.
WE MEASURE
  • Incidents resolved without us.
  • MTTR — pre vs. post handoff.
  • Eval-suite coverage your team can extend.
04 / WHAT WE BUILD

Six representative engagements.

Representative engagements — the problem each one starts from, and the system we build to solve it.

ENGAGEMENT
01
DeFi

Verifiable risk model gating governance

zkML EZKL Halo2
PROBLEM
Governance rotates a credit model nobody can audit on-chain. Parameter updates ship on trust rather than proof.
SYSTEM
PyTorch model exported to Halo2 circuits via EZKL. An on-chain verifier gates every parameter update, and execution is replay-deterministic against fixed input commitments.
ENGAGEMENT
02
Prediction Markets

Autonomous trading agent on Polymarket

Agents ElizaOS x402
PROBLEM
Manual market-making misses overnight repricing on news-driven markets. The edges are real; the staffing to capture them is not.
SYSTEM
ElizaOS agent holding its own wallet on Base. x402 pays per call for premium data feeds, with an Allora forecast feed as a prior and strict refusal on under-priced edges. Position-level spend ceilings and a drawdown circuit-breaker bound the downside.
ENGAGEMENT
03
Healthcare Imaging

Bittensor subnet for federated eval

Bittensor Federated Eval
PROBLEM
Partner hospitals will not share scans, so evaluation runs on a small sliver of what is available — and the model cannot be trusted past it.
SYSTEM
Subnet design and validator policy so hospitals submit only labels for held-out scans they keep locally. TAO emissions are priced against agreement on a calibration head, and cryptographic commitments seal the eval set.
ENGAGEMENT
04
Insurance

Voice intake for first-notice-of-loss claims

Voice RAG Agents
PROBLEM
Long IVR handle times and low first-call resolution. Every triage minute is a customer thinking about switching.
SYSTEM
Voice agent architected to a tight real-time latency budget, with hybrid retrieval against policy documents and human-in-the-loop review on liability calls.
ENGAGEMENT
05
Logistics

Routing copilot for dispatch ops

Agents MCP
PROBLEM
Dispatchers move between several disconnected systems to rebook a load. The bottleneck is the human stitching the tools together, not the model.
SYSTEM
Agentic harness with MCP servers for the TMS, ELD, and weather sources. Replanning under cost ceilings, delivered through a Slack-native interface.
ENGAGEMENT
06
Asset Management

Research copilot over a decade of memos

RAG Evals
PROBLEM
Portfolio managers read deep internal-memo archives before every investment committee. The corpus is the moat, and none of it is searchable.
SYSTEM
Retrieval over years of memos and filings, with citation-first answers and strict refusal on unsourced claims.
06 / FAQ

Questions we answer on the first call anyway.

How do you charge?
Fixed fee for the diagnostic — two weeks, paid up front. The build phase is a weekly retainer, scoped to the deliverables set in week two. We bill outcomes, not hours. No success fees, no equity, no kickers.
Do you sign NDAs?
Yes. Mutual NDA — our paper or yours — signed before the first technical conversation. Most teams send their own; we counter-sign within a business day.
What does “eval-gated” actually mean in practice?
Every commit runs against a versioned eval suite. If the eval-pass rate drops below the budget set in week two, the deploy doesn't ship. The suite is yours at handoff — runner, dataset, scoring rubric, and the dashboard that watches it.
Will you push us off our existing stack?
No. We work with the models, vector stores, and frameworks you've already chosen — unless one of them is the reason the project is stuck. If so, the diagnostic memo says so, and you decide.
Who owns the IP and the eval suite after handoff?
You do. All of it. Code, evals, runbooks, dashboards. We retain no rights, no licenses, no required attribution. The only thing we keep is the right to reference the engagement publicly — with your written sign-off.
What if the diagnostic recommends not building?
You keep the diagnostic memo and the analysis behind it, and you make the call with clear eyes. A sound go/no-go decision is a real outcome of the engagement — getting that call right matters as much as shipping the system itself.
Do you do AI engagements without the on-chain piece?
Yes — most engagements won't need it. We build verifiable, on-chain inference because some systems genuinely require it. If yours doesn't, we won't bolt it on.
How fast can you start?
Diagnostic phase usually starts two to four weeks after the first call. We run one diagnostic at a time, so the calendar is the constraint. Right now we're booking into Q3 2026.
◇ NEXTA 30-MIN CALL · NO DECKS

Bring us a hard problem.
We'll show you what we'd build.

The first call is a free 30 minutes. You'll come away with a one-page memo on what we'd build and how we'd approach it.

or hello@proofoftech.org
NEW ENGAGEMENT · INTAKE

Tell us about it.

The more specific you are, the more useful our first reply.

SERVICE AREA
↩ ENCRYPTED IN TRANSIT