# Tool design for agents: the schema is the prompt.

An agent has a tool called `search`. It also has a tool called `query`, a tool called `lookup`, and a tool called `get_records`. A developer reading that list can reconstruct the history — four people added four things over two years and nobody reconciled them. The model reading that list reconstructs nothing. It has four names, four one-line descriptions, a JSON schema for each, and a decision to make on this turn and every turn after it.

It picks `query`. The task needed `search`. The run fails six steps later, somewhere with no obvious connection to the wrong choice, and the trace shows a confident call to a real tool with plausible arguments. Nothing threw. The agent just did the wrong thing, fluently, because the words it was given did not separate the four tools — and the words are the entire interface.

This is the part of agent engineering that gets the least design attention and deserves much more of it. A tool's name, its description, and its input schema are not API plumbing. They are prompt — model-facing text the model reads to decide what to do — and most teams write them the way they write an internal API: generated from an OpenAPI document, named after the database table, described in the register of a code comment. This post is about treating the tool definition as the prompt surface it is: how the model actually uses it, the two ways a call goes wrong, and the design moves — naming, schema, errors, catalog size, evaluation — that change whether an agent calls the right tool with the right arguments.

## The tool definition is the only thing the model sees

Start with what the model actually has. In the Model Context Protocol — now the default tool-integration layer across the major labs — a tool is an object with a `name`, a `description`, an `inputSchema` expressed in JSON Schema, and an optional `outputSchema`. The [MCP specification](https://modelcontextprotocol.io/specification/2025-11-25/server/tools) is explicit that tools are _model-controlled_: "the language model can discover and invoke tools automatically based on its contextual understanding." Your code does not hand the model a tool. The model reads the catalog and chooses.

And it reads only the catalog. It does not see your implementation. It does not see the OpenAPI document the schema was generated from, the comment above the handler, the design thread where you agreed what the tool was for, or the ticket explaining why `query` and `search` are different things. It sees a name, a sentence or two, and a schema. Every gap between what you meant and what those words say is a gap the model closes with a guess.

That is the whole argument, and it has one consequence worth stating flatly: a tool definition is a prompt. It is written for a model, read by a model, and acted on by a model. That it is _also_ a machine-readable contract for your own code does not make it less of a prompt — it makes it a prompt carrying a second job. Design it for the harder reader.

## Two ways a tool call goes wrong

A tool call fails in one of two places, and the distinction drives every design decision below. The Gorilla work on connecting models to large API sets ([arXiv 2305.15334](https://arxiv.org/abs/2305.15334)) named both: a model's "tendency to hallucinate the wrong usage of an API call" — it picked the wrong tool — and its "inability to generate accurate input arguments" — it picked the right tool and filled it wrong.

```
   the model reads:                  failure 1: wrong tool
   ┌───────────────────────┐         "query" looked as plausible
   │ name                  │──┐      as "search" — nothing in the
   │ description           │  │      words ruled it out
   │ inputSchema (JSON)    │  │
   └───────────────────────┘  ├──► picks a tool ──► fills arguments
   it does NOT read:          │         │                │
   your code, the OpenAPI     │         │     failure 2: right tool,
   doc, the comment, the      │         │     wrong arguments — a
   ticket, your intent      ──┘         │     date in the wrong format,
                                        │     an ID it never had
                                        ▼
                                  executes ──► result or error
```

The reliability numbers say both failures are common at frontier scale. On τ-bench, a benchmark of realistic tool-agent-user tasks ([arXiv 2406.12045](https://arxiv.org/abs/2406.12045)), strong function-calling agents succeed on under half of tasks — and consistency is worse than the average suggests: pass^8, the rate at which an agent succeeds on all eight of eight attempts at the same task, falls below 25% in the retail domain. On NESTFUL ([arXiv 2409.03797](https://arxiv.org/abs/2409.03797)), which tests nested calls where one tool's output feeds the next, a strong model reached only 28% full-sequence accuracy, and a named failure mode was the model not using the output-parameter details the API specification already gave it.

Read those as design feedback, not as a verdict on the models. An agent that calls the wrong tool acted on the words it was given, and an agent that fills an argument wrong filled it from a schema you wrote. Neither number is a fixed property of the model — both move when the definition moves, which is the whole premise of this post. The wrong-tool failure is a naming-and-description problem: the model could not separate two tools because the words did not separate them. The wrong-argument failure is a schema-and-error problem: the model guessed at a parameter the schema left ambiguous, or could not recover because the error it got back told it nothing it could act on. Each section below takes one of those levers in turn. Most of the gap is recoverable on your side of the line — but only if you treat the line, the tool definition itself, as something you design rather than something you generate from a type and forget.

## Names and descriptions are prompt, not labels

The name is the model's first and cheapest signal. A name like `query` carries almost nothing; `crm_contacts_search` tells the model the domain, the object, and the verb before it has read a word of the description. Anthropic, writing about [building tools for agents](https://www.anthropic.com/engineering/writing-tools-for-agents), reports that the choice between prefix- and suffix-based namespacing had "non-trivial effects on tool-use evaluations" — measured on internal evals, so treat the magnitude as Anthropic's claim rather than an independent benchmark, but the direction is not in dispute. Consistent prefixes give the model a grammar for the catalog.

The description has to do the job the name cannot: say when to use the tool _and when not to_. The `query`-versus-`search` failure is a description failure — neither description drew the boundary, so the model had no boundary. OpenAI's [function-calling guidance](https://developers.openai.com/api/docs/guides/function-calling) frames the bar as the "intern test": a new hire should be able to use the tool correctly from its name and description alone, with no other context. If the intern would have to ask a question, the model will not ask — it will guess, and Anthropic reports the converse holds too, that "even small refinements to tool descriptions can yield dramatic improvements." The description is not documentation that happens to be machine-readable. It is an instruction, and it is read on every turn.

## The schema is where you prevent the wrong call

The input schema is the most under-used lever, because teams generate it from an existing type definition and never look at it as prompt. It is prompt. An `enum` of four allowed values is a stronger, less ambiguous instruction than a string parameter described by a sentence asking for one of four values — the model cannot emit an out-of-set value without leaving the schema. OpenAI's guidance puts it directly: "use enums and object structure to make invalid states unrepresentable."

Three rules carry most of the benefit:

- **Make the illegal call unrepresentable.** Enums over free strings; required fields marked required; nested objects that only permit valid combinations. Every constraint you move from prose into the schema is a constraint the model cannot quietly violate.
- **Do not ask the model to supply what you already know.** If the agent runs in one tenant, the tenant ID is not a parameter — it is bound by your code. Every argument the model fills is an argument it can fill wrong; the smallest correct schema is the safest one.
- **Name and describe every parameter.** `start` and `end` are guesses waiting to happen; `start_date` with a description that states the format and the timezone is an instruction. Parameter descriptions are read with the same weight as the tool description.

## What comes back is prompt too

The tool's _output_ re-enters the context window and becomes the model's next input — so it is prompt as much as the definition is. Two parts of it are routinely mishandled.

Errors first. The MCP specification distinguishes tool _execution_ errors, returned with `isError: true`, from protocol errors, and notes that execution errors "contain actionable feedback that language models can use to self-correct and retry." A stack trace is not actionable feedback — it teaches the model nothing it can use. "The `start_date` must be before `end_date`; you passed a start of 2026-06-01 and an end of 2026-01-01" teaches it exactly what to change. An error message is the model's next prompt; write it as one.

Then granularity. A tool that returns everything forces the model to read everything, and the response competes for the same context budget every other step needs — the failure mode we walked through in the [context-engineering essay](/blog/context-engineering-agents/). Return high-signal fields, not raw rows: Anthropic reports that replacing arbitrary UUIDs with semantically meaningful identifiers "significantly improves Claude's precision," and Claude Code caps tool responses at 25,000 tokens by default precisely because an unbounded response is a denial-of-service on the model's attention.

## The catalog gets too big

Every tool you add is a line in a prompt the model reads on every turn, and the prompt has a working size. OpenAI's guidance suggests aiming for fewer than 20 tools available at the start of a turn. Past that, selection accuracy degrades — the clean degradation curve is softer than the round numbers imply, but the direction is well attested. ToolLLM ([arXiv 2307.16789](https://arxiv.org/abs/2307.16789)) had to build a dedicated neural retriever to operate over its 16,000-plus APIs; you cannot simply hand a model a large catalog and expect a clean pick.

The fix is the same move as retrieval-augmented context: select the tools relevant to the task and show the model only those. RAG-MCP ([arXiv 2505.03275](https://arxiv.org/abs/2505.03275), a preprint — treat its figures as such) reports retrieval-based tool selection lifting accuracy from 13.62% to 43.13% on an MCP stress test while cutting prompt tokens by more than half. The headline number is a preprint's; the shape is not controversial. A large tool catalog is a retrieval problem, not a list to paste. This is also where the operational layer the [MCP production-gaps essay](/blog/mcp-production-gaps/) covers — which servers an agent may reach, governed by policy — meets tool design: the catalog the model sees should be both small and scoped.

## You cannot eyeball tool-call accuracy

The opening failure — a confident call to a real tool with plausible arguments that happened to be wrong — has no exception, no error, and no log line that flags it. It is invisible to everything except an eval that knows what the right call was. Tool calls are model output, and they need the same treatment as any other model output: a graded set, run on every change.

τ-bench's pass^k metric is the right shape because it measures consistency, not a single lucky run. APIGen ([arXiv 2406.18518](https://arxiv.org/abs/2406.18518)) shows the grading can be concrete and three-staged — a format check that the call is well-formed, an execution check that it runs, and a semantic check that it did the right thing — and that a model trained against that signal can beat far larger ones. Build the eval before you tune the descriptions, the way the [eval-driven development essay](/blog/eval-driven-development/) argues, and route tool-call traces into the regression set the [agent-observability essay](/blog/agent-observability/) describes. A tool description you changed without an eval is a description you changed without knowing whether you helped.

## The tool-definition checklist

Before an agent's tools go in front of production traffic:

- [ ] Every tool name is namespaced and reads as a domain-object-verb phrase, not a database label.
- [ ] Every description states when to use the tool _and when not to_ — the boundary against its nearest neighbour is explicit.
- [ ] Each tool passes the intern test: usable correctly from name and description alone.
- [ ] The input schema uses enums and structure so an invalid call cannot be expressed.
- [ ] No parameter asks the model for a value your code already holds.
- [ ] Error returns are sentences that say what to change, not stack traces.
- [ ] Tool outputs return high-signal fields, are size-bounded, and use meaningful identifiers.
- [ ] The catalog shown on a turn is under ~20 tools, or tools are retrieved per task.
- [ ] A graded tool-call eval exists, measures consistency (pass^k), and runs on every definition change.

## Reading list

- Anthropic — writing effective tools for agents, the closest thing to a primary statement that tool descriptions are prompt: [anthropic.com](https://www.anthropic.com/engineering/writing-tools-for-agents)
- The Model Context Protocol specification on tools — the object the model actually sees, and the model-controlled invocation model: [modelcontextprotocol.io](https://modelcontextprotocol.io/specification/2025-11-25/server/tools)
- OpenAI's function-calling guide — the intern test, enums to make invalid states unrepresentable, the soft cap on tool count: [developers.openai.com](https://developers.openai.com/api/docs/guides/function-calling)
- Gorilla — names the two failure modes, wrong tool and wrong arguments: [arXiv 2305.15334](https://arxiv.org/abs/2305.15334)
- τ-bench — realistic tool-agent-user tasks and the pass^k consistency metric: [arXiv 2406.12045](https://arxiv.org/abs/2406.12045)
- NESTFUL — nested API calls, and the failure to use output-parameter details from the spec: [arXiv 2409.03797](https://arxiv.org/abs/2409.03797)
- ToolLLM — 16,000+ APIs and why a large catalog needs a retriever: [arXiv 2307.16789](https://arxiv.org/abs/2307.16789)
- APIGen — three-stage verification of tool calls, format then execution then semantics: [arXiv 2406.18518](https://arxiv.org/abs/2406.18518)
- RAG-MCP — retrieval-based tool selection as the answer to catalog bloat (preprint): [arXiv 2505.03275](https://arxiv.org/abs/2505.03275)

The model never sees your code. It sees the words around your code. Write those words like the prompt they are, or the agent will keep calling the wrong tool — fluently, confidently, and without ever throwing.