Thorin

Enter password to continue

Skip to content

MCP Deep Dive — Thorin Prep Companion

Companion doc to the AI agents study material, focused specifically on MCP internals, how agents select and load tools, and the MCP-vs-CLI-vs-code-execution discourse that's been reshaping the field since September 2025. Built for the Thorin onsite where you were assigned to implement an MCP client, so this goes deeper on the client side than the other doc.

The thesis of this doc: the "MCP tool overload" problem has become the central architectural tension in agent design over the past six months, and the field has converged on three complementary solutions — Tool Search, Programmatic Tool Calling, and Code Execution with MCP. If Asanka asks you anything about MCP beyond surface-level protocol mechanics, these are the topics that will separate "read the docs" from "actually understands the current state of the field."


Table of Contents

  1. How an MCP client actually works (the flow end-to-end)
  2. How agents select which tool to call
  3. The too-many-tools problem: why it happens
  4. Solution 1: Tool Search Tool (Anthropic, Nov 2025)
  5. Solution 2: Programmatic Tool Calling
  6. Solution 3: Code Execution with MCP
  7. Tool Use Examples and the fourth piece of the puzzle
  8. MCP vs API vs CLI — the full discourse
  9. Namespacing, scoping, and tool groups
  10. Implementation checklist for the Thorin client
  11. Things you should be able to explain on the whiteboard
  12. Source index

1. How an MCP client actually works (the flow end-to-end)

Before getting into the advanced stuff, you need to be able to trace the full lifecycle of an MCP interaction on a whiteboard. This is where Asanka is most likely to probe if the client implementation task resurfaces in conversation.

The architecture

MCP is a JSON-RPC 2.0 protocol with a client-server architecture. There are three distinct roles:

Host: The application the user interacts with (Claude Desktop, Claude Code, Cursor, your custom agent). The host owns the LLM connection, the user interface, and the overall session state.

Client: A protocol-level component inside the host that maintains one-to-one communication with a single server. If the host connects to five MCP servers, there are five clients, one per server. Clients handle the actual message protocol — serialization, transport, capability negotiation, message routing. When you "implement an MCP client," you're building this layer.

Server: A separate process (or HTTP service) that exposes capabilities (tools, resources, prompts) to clients. Servers know nothing about the host application or the LLM. They just respond to protocol messages.

The separation matters because it enforces isolation. A buggy or malicious MCP server can't directly touch the LLM, the user, or other servers — all interaction flows through the client, which is the host's trust boundary.

The transports

MCP defines two transports:

stdio transport. The host spawns the server as a subprocess and communicates over stdin/stdout. Line-delimited JSON-RPC messages. No network stack. Zero overhead. This is the default for local MCP servers. When Claude Desktop launches your Anna's Archive server, it's just running node /path/to/dist/index.js and piping JSON back and forth. Simplicity is the point.

Streamable HTTP transport (formerly HTTP+SSE, renamed in the 2025-03-26 spec revision). The server runs as an HTTP service. The client opens a connection, POSTs JSON-RPC requests, and receives responses — with Server-Sent Events used for server-initiated messages (notifications, streaming responses). This is what you use for remote MCP servers like the claude.ai connectors. Your Anna's Archive server supports this mode via Tailscale Funnel.

There's been some churn on the HTTP transport — the original spec used a "HTTP+SSE" design with two separate endpoints, which was replaced in 2025-03-26 with a single streamable HTTP endpoint. If you're implementing a client, you need to handle both for compatibility with older servers, but new work targets streamable HTTP.

The lifecycle

Here's the full sequence of a typical MCP session, which you should be able to draw on a whiteboard:

1. Initialize. The client sends an initialize request with protocol version, client info, and client capabilities (sampling, roots, elicitation — what the client can offer back to the server). The server responds with its protocol version, server info, and server capabilities (tools, resources, prompts — what the server can expose). This is the capability negotiation handshake. If versions don't match, the client terminates.

2. Initialized notification. The client sends an initialized notification confirming the handshake is complete. From this point, normal operation begins.

3. List capabilities. The client calls tools/list, resources/list, and/or prompts/list to discover what the server offers. The server responds with arrays of descriptors — for tools, each one has a name, description, and inputSchema (a JSON Schema for the arguments).

4. Inject into LLM context. This is where the client becomes interesting. The client takes the tool descriptors and converts them into the format the LLM's API expects. For Claude, that means building the tools array in the messages API call with name, description, and input_schema. The LLM now knows these tools exist.

5. User message arrives. User types something. The host builds a messages request including system prompt, conversation history, the tools list, and the new user message. The LLM sees everything and decides what to do.

6. Tool call. If the LLM decides to call a tool, the response contains a tool_use block with the tool name and arguments. The host extracts this, identifies which MCP server owns that tool (via namespacing or internal routing tables), and asks the appropriate client to call tools/call on that server with the arguments.

7. Tool execution. The server executes the tool and returns a result — structured content that the MCP spec allows to be text, images, audio, or embedded resources.

8. Feed result back. The client receives the tool result, the host converts it into a tool_result message block with the matching tool_use_id, appends it to the message history, and sends another messages request to the LLM. The LLM now has the tool result and decides the next step — another tool call, or a final response.

9. Loop until done. Steps 5-8 repeat until the LLM emits a response without a tool_use block, signaling it's finished with the turn.

10. Notifications. Throughout the session, servers can send notifications like tools/list_changed (tool list has updated — client should re-fetch) or resources/updated (a resource the client subscribed to has changed). Good clients handle these live.

11. Shutdown. Clean termination via shutdown request and exit notification. In stdio mode, this means the server process exits.

What your client implementation needs to handle

If Thorin tasked you with building an MCP client, the non-obvious parts are:

  • Transport abstraction. Your client code should not care whether it's talking over stdio or HTTP. Build a transport interface, implement both, and let the rest of the code be agnostic.
  • Capability negotiation. Don't assume all servers support all features. If a server doesn't advertise tools capability, don't call tools/list. Same for resources and prompts.
  • Message correlation. JSON-RPC uses request IDs to match responses to requests. You need to track in-flight requests, handle out-of-order responses, and time out hung requests.
  • Notifications as a separate channel. Notifications don't have request IDs and don't expect responses. Don't try to correlate them to requests.
  • Error handling. JSON-RPC error objects have codes (-32000 range for server errors, -32600+ for protocol errors). Different errors need different responses — retry, fail, reconnect, surface to user.
  • Tool name namespacing. Two servers may expose tools with the same name. Your client must disambiguate, usually by prefixing tool names with the server identifier when they enter the LLM's context.
  • Cancellation. The LLM might change its mind mid-request. You need to propagate cancellation to in-flight tool calls.
  • Reconnection. HTTP-transport clients need to handle connection drops gracefully, buffer unsent messages during reconnect, and resume cleanly.
  • Auth. OAuth 2.1 is the MCP-standard auth flow (more on this below). Your client needs to handle the OAuth dance, store tokens securely, and refresh them when they expire.

The OAuth story and why DCR matters

One thing Simon Willison highlighted (quoting Kenton Varda) is worth knowing: MCP's biggest practical unlock may actually be OAuth Dynamic Client Registration (DCR, RFC 7591). The problem is that OAuth traditionally assumes the client knows what API it's talking to in advance, so the developer can register the client with that API ahead of time to get a client_id and client_secret. Agents don't know what MCPs they'll talk to in advance — they discover them at runtime. So MCP requires dynamic client registration, which practically nobody implemented before MCP came along. DCR might as well have been invented by MCP, even though the RFC is from 2015.

This is a good piece of trivia to drop if Asanka asks about MCP auth. It shows you've read past the surface-level "MCP uses OAuth" explanation.


2. How agents select which tool to call

This is the core mechanical question you asked and it's worth understanding precisely because it's the question that determines how your client should shape its output.

The short answer

The agent doesn't have any special machinery for tool selection. It's just the LLM reading text. Tool descriptions enter the model's context window as part of the API request. The model produces its next token distribution conditioned on everything in context — system prompt, conversation, tool descriptions, the user's latest message. If the distribution is highest on a tool_use block pointing at a particular tool, that's the tool that gets called. Tool selection is an emergent behavior of next-token prediction on a context that happens to include tool descriptions.

This is why everyone says tool descriptions are prompts. They literally are — they're tokens the model reads to decide what to do. The "tool" abstraction is just a convention about how certain tokens get formatted in the model's output so the surrounding code can parse them as structured calls. The model doesn't "know" tools in any special sense. It's just read a lot of examples during training where systems with tools described this way got used in particular patterns, and it's pattern-matching against that training distribution.

What actually goes into the context

For a given turn, here's what the model sees:

[system prompt]
  - your system prompt text
  - the model's internal "you have tools available" framing
  - often a list of tool names/descriptions inline

[tool definitions, typically as a structured section]
  - for each tool:
    - name
    - description (prose, usually one to a few sentences)
    - input schema (JSON schema for arguments)
    - optionally, examples (see section 7)

[conversation history]
  - user messages
  - assistant messages (possibly including prior tool_use blocks)
  - tool_result blocks from prior calls

[current user message]

For Claude specifically, the tools array in the API becomes part of the system prompt. You can see this if you log prompt tokens — the tool definitions literally show up as formatted text in the system prompt section.

The model then generates a response. If it decides to call a tool, the output includes a tool_use content block with a name and arguments. The arguments are JSON that must validate against the input_schema — but importantly, that validation happens client-side after generation. The model generates what it thinks is valid JSON, and if it's wrong, you get a validation error that you have to handle.

What makes the model pick one tool over another

At inference time, the model's decision depends on:

The clarity of each tool's description relative to the current intent. If you have search_users and get_user_profile and the user asks "find Jane's profile," both tools sound plausible. The one with the description that more specifically matches the user's phrasing will win.

Name similarity to intent. This is pure pattern matching. If the user asks about "customer errors" and you have a search_logs tool, the model has to bridge from "customer errors" to "logs" conceptually. If you also have a get_customer_errors tool, that tool is almost guaranteed to win because the lexical overlap is direct.

Position in the context. Tools listed earlier tend to get higher attention in many models. This is the "lost in the middle" effect applied to tool lists — tools buried in the middle of a large list are less likely to be called even when they're the correct choice.

Training data biases. Models have seen certain tool names in certain contexts during training. grep, curl, ls, cat — these all have strong priors from RLHF and training data. Made-up names like github_mcp_create_pull_request have no training priors and rely entirely on the description plus name matching.

Schema complexity. Paradoxically, tools with simpler schemas are sometimes easier for models to call correctly but harder for models to decide between, because the schemas don't constrain the choice. Tools with rich, distinctive schemas are easier to differentiate but harder to use correctly without examples.

The failure modes

Microsoft Research has documented specific failure modes when tool selection goes wrong. Knowing these names is useful for interview conversations:

Wrong tool from similar names. If you have get_status, fetch_status, and query_status, the model will often pick based on partial token similarity rather than semantic understanding. Research from Microsoft has shown that common names like search appear in dozens of MCP servers, making disambiguation genuinely hard.

Tool paralysis. When faced with too many similar options, models sometimes take no action at all. Developers using LangGraph and CrewAI have reported agents "getting stuck" or timing out when tool selection becomes ambiguous. Microsoft confirmed LLMs can "decline to act at all when faced with ambiguous or excessive tool options" in overloaded contexts.

Hallucinated tool calls. Agents occasionally invent plausible-sounding tools like create_lead_entry when the actual tool is add_sales_contact. The model has the intent right but the name wrong, because it's pattern-matching against what a tool with that intent "should" be called based on training data.

Parameter hallucination. The model calls the right tool but invents parameters that don't exist in the schema, usually because the schema was ambiguous or an example suggested the wrong structure.

Salience crowding. When a verbose tool description is loaded into context, it crowds out the actual task instructions and user intent. The model ends up attending more to the tool descriptions than to what the user asked for, and over-calls tools unnecessarily.

Attention dilution in long lists. Transformer attention is not uniform across context. Research on "lost in the middle" effects shows accuracy drops significantly when relevant context is buried in a large window. A tool list of 93 items (the GitHub MCP) has tools in positions 40-60 that are systematically harder to select than tools at the top or bottom.

How to design to make selection easier

Given all of the above, the design advice is consistent:

  1. Keep tool names distinctive and semantically clear. search_customer_orders beats query_db_orders because it describes what the tool does in business terms.
  2. Write descriptions that explicitly rule tools in and out. "Use this tool when X. Do not use this tool for Y." Counterintuitively, telling the model when NOT to use a tool can be more effective than just describing when to use it.
  3. Consolidate. Instead of three tools that do related things, one tool with a mode parameter might be clearer — or vice versa, depending on where the disambiguation is easier for the model.
  4. Namespace by service when you have multiple servers. asana_search vs jira_search is unambiguous; two search tools is a coin flip.
  5. Keep tool lists small. The ceiling where selection quality starts degrading is around 30-50 tools for most models. Above that, you need Tool Search (see section 4).

How this connects to your Anna's Archive work. Your tool descriptions already demonstrate good practice here. The search tool's "Query Strategies" section is explicitly ruling in and out which parameter combinations to use for which intents. That's exactly the pattern Microsoft's research recommends — don't just describe what the tool does, guide the model's decision process. When Asanka asks about tool selection, you can walk through the Query Strategies section as a concrete example of prompt-engineering the decision layer.


2.5 Eager injection vs. lazy loading (the underlying tension)

Before getting into the too-many-tools problem and the three solutions, it's worth naming the spectrum explicitly because it's the frame that everything in sections 3-6 sits on.

Eager injection (the original model)

When you call the Anthropic API with tools=[...], every tool definition in that array gets serialized into the system prompt before the first token is generated. The model sees the full schema for every tool — name, description, input_schema, parameters — on every single API call. Tool definitions are tokens, and those tokens are in context whether or not the model will actually use the tool this turn.

This is how MCP originally worked in practice. The client calls tools/list on every connected server, dumps all results into the tools array, and ships the whole thing to the model. Simple. No discovery latency. No extra round trips. The model immediately knows everything available to it.

The problem is that it doesn't scale. If you have 200 tools across 10 MCP servers, you're paying 50K+ tokens per call for definitions even on turns where the model calls zero tools. And selection quality collapses above ~50 tools (section 3 goes into why).

Lazy / on-demand loading (the new model)

Don't put tool definitions in the system prompt at all. Keep them in a searchable catalog (Tool Search Tool), a filesystem (Code Execution with MCP), or a server-side sandbox (Programmatic Tool Calling). When the model needs a capability, it takes a discovery action — calls a search tool, reads a file, executes code — and only the tools it actually needs get pulled into context.

The cost is an extra round trip: the model makes a search call, sees results, then makes the real tool call. Two API calls where there would have been one. But you save 50K+ tokens on every turn, and selection quality stays high because the model only sees relevant tools.

The spectrum in practice

Production MCP clients don't pick one extreme. They split tools into two buckets:

  • Core tools (eagerly loaded): The 3-5 tools the model will use most often. Always in context. Worth the token cost because they get called every session.
  • Deferred tools (lazily loaded): The long tail. Registered with the API but hidden behind defer_loading: true. The model discovers them via search when a specific need arises.

Anthropic's MCP toolset config makes this explicit:

json
{
  "type": "mcp_toolset",
  "mcp_server_name": "google-drive",
  "default_config": {"defer_loading": true},
  "configs": {
    "search_files": {"defer_loading": false}
  }
}

Everything from the google-drive server is deferred except search_files, which stays loaded because it's the natural entry point. The model can always find getDocument or createFile via Tool Search when it needs them.

Why this matters for client design

If you're implementing an MCP client, this is the key architectural decision. Naive clients just dump every tool they discover into the LLM's tool array — the eager-injection extreme. Sophisticated clients let users (or automatic heuristics) tag tools as core vs. deferred. The client then:

  1. Ships core tools directly in the tools array
  2. Registers deferred tools with defer_loading: true
  3. Adds the Tool Search Tool (or a custom equivalent) so the model can discover the deferred ones
  4. Optionally supports Programmatic Tool Calling or Code Execution with MCP for workflows with large intermediate results

The three solutions in sections 4-6 are all different flavors of lazy loading. Tool Search Tool defers definitions. Programmatic Tool Calling defers results. Code Execution with MCP defers both. Pick the one that matches where your context pressure is coming from.


3. The too-many-tools problem: why it happens

This deserves its own section because it's the central problem that shaped 2025's MCP discourse and is directly relevant to implementing a client. If your client loads 200 tools from 10 MCP servers, here's what goes wrong.

The numbers

Concrete data from multiple 2025 writeups:

  • GitHub MCP server: 93 tools, ~55,000 tokens for the definitions alone (as of August 2025). They've since cut defaults to 52 tools, but it's still substantial. The full tool set now exposes 42,000+ tokens of definitions. Source: Geoffrey Huntley via Simon Willison.
  • Umbraco MCP: 345 tools, ~30,000 tokens just for definitions. More than many entire context windows.
  • Anthropic internal tools: Before optimization, internal tool definitions consumed 134,000 tokens. For context, that's more than half of Claude's full 200K window.
  • Typical enterprise stack (5 servers with 30 tools each = 150 tools): 30,000-60,000 tokens just in tool metadata. 25-30% of a generous 200K window gone before the user has said anything.
  • Anthropic's own five-server example: 58 tools across GitHub/Slack/Sentry/Grafana/Splunk = ~55K tokens. Add Jira (~17K on its own) and you're at 100K+ overhead.

The three ways it hurts

1. Context budget exhaustion. Every token you spend on tool definitions is a token you can't spend on the task. If your 200K window has 60K in tool schemas, 24K in system prompt, and 30K in conversation history, you have 86K left for reasoning and outputs. For a complex task that's not much.

2. Attention dilution ("lost in the middle"). Transformer attention isn't uniform. When the model has to attend across 200K tokens, signal from the actual task gets diluted by noise from tool definitions it doesn't need for this specific request. Research has shown LLM accuracy drops significantly when relevant context is buried in a large window.

3. Tool selection accuracy collapse. This is the most damaging one. Anthropic's internal testing showed Opus 4 accuracy on MCP evaluations dropping to 49% with large tool libraries, and Opus 4.5 to 79.5%. With Tool Search Tool enabled, these went to 74% and 88.1% respectively. The degradation is real and dramatic. The paper "Less is More" documents this effect in detail.

Why you can't just use a bigger context window

Frontier models now support 200K-1M token contexts. A common misconception is that this solves the problem. It doesn't, for three reasons:

Context rot. As documented in Chroma's research on context rot and referenced by Anthropic's context engineering post, models' ability to accurately recall information from the context decreases as context length grows. Bigger windows make this worse, not better, because attention has to spread across more tokens.

Cost. Token costs scale linearly with context length. If every request includes 60K tokens of tool definitions, you're paying for those tokens on every request, even when none of those tools get called.

Latency. Prompt processing time scales with context length. A 100K-token prompt takes meaningfully longer to process than a 20K-token prompt. For an agent that makes 20 tool calls in a session, this latency compounds.

The hard truth: bigger context windows make the problem worse in some ways, not better. You need architectural solutions, not just more space.

The symptoms in practice

When your agent has too many tools, here's what you see:

  • The model appears to get dumber mid-task as important instructions get pushed out by tool definitions
  • Over-calling tools unnecessarily due to salient descriptions crowding out reasoning
  • Under-calling tools because a long list overwhelms selection quality
  • Increased retries and loops caused by mistakes
  • Higher latency from processing more tokens than necessary
  • Specifically: the agent freezes or times out when faced with ambiguous tool choices

If you ever see these patterns in your MCP client, the first thing to check is your tool count and total token footprint.


4. Solution 1: Tool Search Tool (Anthropic, Nov 2025)

Anthropic shipped Tool Search Tool as a beta feature on November 24, 2025, as part of a package of three "advanced tool use" features. This is the one most directly relevant to your MCP client implementation because it specifically addresses "how should a client expose tools to the model when there are too many."

The core idea

Don't load all tools upfront. Load a single search tool that can discover other tools on-demand. When the model decides it needs to do something, it first calls the search tool to find the relevant tools, then calls those tools.

The crucial detail: you still register all your tools with the API. You just mark them defer_loading: true. Deferred tools aren't put into Claude's context initially. Claude sees only the Tool Search Tool plus any tools marked defer_loading: false (your most critical, frequently-used tools).

When Claude needs specific capabilities, it calls the Tool Search Tool with a query. The search runs server-side (inside Anthropic's API infrastructure) against your registered tool catalog. Matching tools get expanded into Claude's context as full definitions, and Claude then calls them normally.

The numbers that matter

From Anthropic's own testing:

  • Traditional approach with 50+ MCP tools: ~72K tokens upfront, total context consumption ~77K before any work begins.
  • With Tool Search Tool: ~500 tokens for the Tool Search Tool itself, ~3K tokens for the 3-5 relevant tools discovered per query. Total: ~8.7K tokens, preserving 95% of the context window.
  • Accuracy gains: Opus 4 went from 49% to 74% on MCP evaluations. Opus 4.5 went from 79.5% to 88.1%. These are massive improvements.
  • Token reduction: ~85% for large tool libraries.

Three search modes

Anthropic provides three search tool variants you can use out of the box, plus the ability to implement custom search:

Regex-based (tool_search_tool_regex_20251119). Claude constructs regex patterns using Python re.search() syntax. Best for tools with strict, consistent naming conventions. Example patterns Claude might generate: "weather", "get_.*_data", "database.*query|query.*database", "(?i)slack". Pro: extremely precise for structured APIs like stripe_customer_get. Con: fails if tool names are inconsistent or ambiguous.

BM25-based (tool_search_tool_bm25_20251119). Claude uses BM25, a classic information retrieval ranking algorithm, to match queries against tool names and descriptions. Better for semantic matching when tool names aren't strictly structured. Works well when tool descriptions are the primary signal.

Custom (embeddings, hybrid, whatever). You can implement your own search tool as a regular tool and have Claude call it. The Claude cookbooks have examples using embeddings for semantic search.

Configuration for MCP

For MCP servers, you can defer entire servers while keeping specific high-use tools loaded:

json
{
  "type": "mcp_toolset",
  "mcp_server_name": "google-drive",
  "default_config": {"defer_loading": true},
  "configs": {
    "search_files": {
      "defer_loading": false
    }
  }
}

This defers every tool from the google-drive MCP server except search_files, which stays loaded because it's the most common entry point. Claude can always find the other tools via search, but the common case is fast.

Prompt caching preservation

This is a subtle but important detail. Tool Search Tool doesn't break prompt caching because deferred tools are excluded from the initial prompt entirely. They're only added to context after Claude searches for them. Your system prompt and core tool definitions remain cacheable, which matters for cost and latency in production.

When to use it

Anthropic's guidance:

Use it when:

  • Tool definitions consuming >10K tokens
  • Experiencing tool selection accuracy issues
  • Building MCP-powered systems with multiple servers
  • 10+ tools available

Less beneficial when:

  • Small tool library (<10 tools)
  • All tools used frequently in every session
  • Tool definitions are compact

The decision boundary is roughly: if you'd have trouble keeping all your tools in memory while reading a paragraph about the task, the model has the same problem.

Design implications for your client

If you're building an MCP client for Thorin, Tool Search Tool fundamentally changes the design:

  1. Your client shouldn't naively dump all discovered tools into the LLM's tool list. It should decide which tools are core (always loaded) vs. discoverable (deferred).
  2. The client needs a tool catalog that the Tool Search Tool can search against. This is a new data structure — not just "the list of tools I'll send to the API" but "the full set of tools I know about, with searchable metadata."
  3. You need a policy for choosing the "core 3-5" tools that stay loaded. In Thorin's case this might be workflow-dependent: the tools needed for the observation layer are different from the tools needed for the action layer. A good client supports dynamic reconfiguration of what's core.
  4. If you're writing a client that talks to the Anthropic API, Tool Search Tool is a library feature you opt into via a beta header (advanced-tool-use-2025-11-20). If you're writing a client for a different model provider, you need to implement equivalent functionality yourself — Cursor, for example, implements its own version.

How this connects to your Anna's Archive work

Your server currently exposes 4 tools (search, download, read, stats) at ~10-15K tokens total for definitions plus the Query Strategies section. That's below the threshold where Tool Search Tool is necessary. But if you imagine scaling — say, adding a dozen more tools for different source collections or analytical operations — you'd want to structure them so search stays core-loaded and everything else is deferred behind a search layer. This is worth mentioning as a design consideration if Asanka asks how your MCP would scale.


5. Solution 2: Programmatic Tool Calling

This is the second of Anthropic's November 2025 advanced tool use features. It addresses a different problem than Tool Search: not "how do we avoid loading all tools upfront" but "how do we avoid passing tool results through the model's context when we don't need to."

The core idea

Instead of the model calling tools one at a time with each result streaming back into its context, Claude writes Python code that calls multiple tools, processes their outputs, and controls what information actually enters its context window. The code runs in a sandboxed code execution environment. Intermediate results stay in that environment. Only the final result enters Claude's context.

This is a fundamentally different execution model. Traditional tool calling is: model → tool → model → tool → model → ... Programmatic tool calling is: model writes code → code orchestrates many tools internally → final result → model sees only the result.

The example that makes it click

Anthropic's example: "Which team members exceeded their Q3 travel budget?"

Traditional approach:

  • Fetch team members → 20 people
  • For each person, fetch their Q3 expenses → 20 tool calls, each returning 50-100 line items (flights, hotels, meals, receipts)
  • Fetch budget limits by employee level
  • All of this enters Claude's context: 2,000+ expense line items, roughly 50 KB
  • Claude manually sums each person's expenses, looks up their budget, compares
  • Many round-trips to the model, significant context consumption

With Programmatic Tool Calling:

Claude writes this code:

python
team = await get_team_members("engineering")

# Fetch budgets for each unique level
levels = list(set(m["level"] for m in team))
budget_results = await asyncio.gather(*[
    get_budget_by_level(level) for level in levels
])
budgets = {level: budget for level, budget in zip(levels, budget_results)}

# Fetch all expenses in parallel
expenses = await asyncio.gather(*[
    get_expenses(m["id"], "Q3") for m in team
])

# Find employees who exceeded their travel budget
exceeded = []
for member, exp in zip(team, expenses):
    budget = budgets[member["level"]]
    total = sum(e["amount"] for e in exp)
    if total > budget["travel_limit"]:
        exceeded.append({
            "name": member["name"],
            "spent": total,
            "limit": budget["travel_limit"]
        })

print(json.dumps(exceeded))

The tools are called from inside the code. Results stay in the Python runtime. Claude's context receives only the final output: the two or three people who exceeded their budget (maybe 1 KB of JSON). The 2,000+ line items never enter Claude's context.

The numbers

  • Token usage dropped from 43,588 to 27,297 on complex research tasks in Anthropic's internal testing — a 37% reduction.
  • Internal knowledge retrieval accuracy improved from 25.6% to 28.5%.
  • GIA benchmarks improved from 46.5% to 51.2%.
  • On some workflows Anthropic describes, they saw token reduction from 150K to 2K — a 98.7% reduction. That's the extreme case (Google Drive transcript → Salesforce), but even typical cases see 30-50% reductions.

How it works under the hood

  1. You mark tools as callable from code by adding them to allowed_callers. You also add the code_execution tool to the tools array.
  2. Claude generates Python code that calls the tools. The code is wrapped in a server_tool_use block with name code_execution.
  3. When the code calls a tool like get_expenses(), the API intercepts the call and returns it to your client as a normal tool_use block — but with a caller field indicating it originated inside code execution.
  4. Your client executes the tool (hitting your backend, or in the MCP case, forwarding to the MCP server) and returns the result.
  5. The result is routed back into the Python runtime, not into Claude's context. The code continues executing.
  6. When the code finishes, only its final stdout/return value enters Claude's context, as a code_execution_tool_result block.

The magic is in step 3-5. Tool results flow through the code execution environment and never touch the model unless the code explicitly outputs them.

The concrete benefits in Anthropic's own words

Token savings: By keeping intermediate results out of Claude's context, PTC dramatically reduces token consumption.

Reduced latency: Each traditional API round-trip requires model inference (hundreds of milliseconds to seconds). When Claude orchestrates 20+ tool calls in a single code block, you eliminate 19+ inference passes. The API handles tool execution without returning to the model each time.

Improved accuracy: By writing explicit orchestration logic, Claude makes fewer errors than when juggling multiple tool results in natural language.

Parallel execution: asyncio.gather() in the code lets you run many tool calls concurrently. Traditional tool calling is inherently sequential — the model calls one tool, waits, then decides the next.

Privacy-preserving flows: Since intermediate data doesn't enter the model's context, sensitive data can pass between tools without the model ever seeing it. Anthropic describes a pattern where the code execution environment auto-tokenizes sensitive fields (emails, phone numbers) before they flow to the model. The MCP client maintains a mapping and substitutes real values when calling downstream tools.

When to use it

Most beneficial when:

  • Processing large datasets where you only need aggregates or summaries
  • Running multi-step workflows with 3+ dependent tool calls
  • Filtering, sorting, or transforming tool results before Claude sees them
  • Handling tasks where intermediate data shouldn't influence Claude's reasoning
  • Running parallel operations across many items

Less beneficial when:

  • Making simple single-tool invocations
  • Working on tasks where Claude should see and reason about intermediate results
  • Running quick lookups with small responses

How this connects to the Anna's Archive MCP

Picture this flow: "Find all books by author X published after 2010, and summarize the themes across them." With traditional tool calling, you'd search, see 15 results each with metadata, call read on each one (loading tens of thousands of tokens), then have the model synthesize. With PTC, Claude writes code: search → filter by year → for each result, call read with targeted page ranges → extract summaries → concatenate → return a synthesized list. The 15 full document contents never enter Claude's context. Only the final synthesis does.

This is exactly the kind of workflow your MCP was built to support, and understanding that it works better under PTC than under naive tool calling is a sophisticated point you can make.


6. Solution 3: Code Execution with MCP

This is the most radical of Anthropic's three solutions and the one that's most reshaping how people think about MCP architecture. It was published on November 4, 2025, in a post called "Code execution with MCP: building more efficient AI agents."

The core idea

Instead of presenting MCP tools to the model as function definitions in the API call, present them as TypeScript (or Python) files on disk in a filesystem the agent can navigate. The agent writes code that imports these files and calls the tools as normal function calls. The code runs in a sandboxed execution environment. The agent never sees tool definitions unless it explicitly reads a file.

Anthropic's example filesystem structure:

./servers/
  google-drive/
    getDocument.ts
    searchFiles.ts
    createFile.ts
    listFolders.ts
    ...
  salesforce/
    updateRecord.ts
    queryRecord.ts
    ...
  slack/
    sendMessage.ts
    getChannel.ts
    ...

Each tool becomes a file:

typescript
// ./servers/google-drive/getDocument.ts
interface GetDocumentInput {
  documentId: string;
}

interface GetDocumentResponse {
  content: string;
}

/* Read a document from Google Drive */
export async function getDocument(input: GetDocumentInput): Promise<GetDocumentResponse> {
  return callMCPTool<GetDocumentResponse>('google_drive__get_document', input);
}

This file consumes zero context tokens until the agent reads it. The agent discovers tools by exploring the filesystem — listing ./servers/, reading specific files it needs. It only loads what the current task requires.

Why this is different from Tool Search Tool

Tool Search Tool defers loading of tool definitions but still keeps them in Anthropic's managed catalog. The agent searches, matching tools are expanded into context.

Code Execution with MCP goes further: the tool definitions live entirely on the filesystem. Discovery is filesystem navigation. Invocation is code execution. Composition is just writing Python or TypeScript. The model's context is never polluted with tool schemas at all unless the agent explicitly chooses to load a specific file.

The three benefits stack:

  1. Discovery happens on the filesystem — zero tokens until a file is read
  2. Invocation happens through code — tools get called from a sandboxed runtime, not through the model's context
  3. Composition happens through code — loops, conditionals, parallelism are all native Python/TypeScript, not awkward JSON tool chains

The killer example: the Google Drive → Salesforce case

Anthropic's motivating example: an agent uses Google Drive MCP to fetch a long sales meeting transcript, then uses Salesforce MCP to update a record with that transcript. The full transcript flows through the model twice — once on the way out of Google Drive, once on the way into Salesforce. For a 2-hour meeting, that's 50,000+ extra tokens passing through the model that don't change the logic of the task.

With code execution, the agent writes a script that fetches the transcript, stores it in a variable, and passes it directly to the Salesforce call. The transcript never touches the model. The 98.7% reduction (150K to 2K tokens) Anthropic cites comes from this kind of case.

The filesystem-as-interface pattern

There's a deeper insight here that Daniel Miessler and Simon Willison both picked up on. Anthropic is essentially proposing that MCP tool calls become filesystem-based Skills. Instead of the MCP protocol being the main interface the agent uses, MCP becomes a directory of things the agent can call, and the agent writes code to actually call them.

This ties directly to Agent Skills (the other Anthropic standard from late 2025). Skills are also filesystem-based — they're folders with a SKILL.md file, scripts, and resources. Claude loads only the SKILL.md metadata upfront, and the agent can navigate into the folder for deeper content when needed. The similarity is not accidental.

Simon Willison put it this way: "Most of my MCP usage with coding agents like Claude Code has been replaced by custom shell scripts for it to execute, but there's still a useful role for MCP in helping the agent access secure resources in a controlled way." The separation is forming: MCP becomes the secure, auth-aware distribution layer. Code (Python scripts, TypeScript files, CLIs) becomes the invocation layer.

Additional benefits Anthropic highlights

Beyond token savings, the post calls out several secondary benefits:

Privacy-preserving operations. Sensitive fields can be tokenized inside the execution environment. The model sees placeholders, while the MCP client maintains a secure mapping and restores real values when calling downstream tools. Data moves between MCP servers without raw identifiers ever entering the model's context.

State and reusable skills. The filesystem lets agents store intermediate files and reusable scripts. A helper function that transforms a sheet into a report can be saved in a ./skills/ directory and imported in later sessions. This is how Code Execution with MCP naturally becomes a skill-building pattern.

Familiar control flow. Loops, conditionals, and error handling use normal Python/TypeScript constructs. Anthropic's framing: "Although many of the problems here feel novel — context management, tool composition, state persistence — they have known solutions from software engineering. Code execution applies these established patterns to agents."

The catch: you need a secure execution environment

The cost of this approach is real: running agent-generated code requires a secure execution environment with appropriate sandboxing, resource limits, and monitoring. These infrastructure requirements add operational overhead and security considerations that direct tool calls avoid. Anthropic is upfront about this — the benefits are substantial, but the implementation cost is nontrivial.

For Thorin's client implementation, this likely means: start with Tool Search Tool (simpler, Anthropic-managed), add Programmatic Tool Calling for specific workflows (still Anthropic-managed, just opting into code execution), and consider full Code Execution with MCP only if the context problem becomes severe enough to justify running your own sandbox.

The field's reaction

This post caused a real stir when it dropped. Simon Willison wrote: "This makes a lot of sense to me. Most of my MCP usage with coding agents like Claude Code has been replaced by custom shell scripts for it to execute." Daniel Miessler wrote a piece called "Anthropic Changes MCP Calls Into Filesystem-based Skills" arguing that Anthropic had essentially "thrown massive shade at MCPs" by deprecating them to "service directories." That's overstated — MCP is still important for distribution, auth, and discovery — but it captures the vibe shift.

The October 2025 → November 2025 period was when the field collectively realized that "just load everything into context" was a scaling failure, and the solutions started converging on "make context a scarce resource again and use code to orchestrate."


7. Tool Use Examples and the fourth piece of the puzzle

Tool Use Examples is the third of Anthropic's November 2025 advanced tool use features. It's less flashy than Tool Search or Programmatic Tool Calling but addresses a distinct problem: the model calls the right tool with wrong parameters.

The problem

JSON Schema is good at defining structure — types, required fields, enums. It's bad at expressing usage patterns: when to include optional parameters, which combinations make sense, what conventions the API expects.

Anthropic's example is a support ticket API with a deeply nested structure:

json
{
  "name": "create_ticket",
  "input_schema": {
    "properties": {
      "title": {"type": "string"},
      "priority": {"enum": ["low", "medium", "high", "critical"]},
      "labels": {"type": "array", "items": {"type": "string"}},
      "reporter": {
        "type": "object",
        "properties": {
          "id": {"type": "string"},
          "name": {"type": "string"},
          "contact": {
            "type": "object",
            "properties": {
              "email": {"type": "string"},
              "phone": {"type": "string"}
            }
          }
        }
      },
      "due_date": {"type": "string"},
      "escalation": {
        "type": "object",
        "properties": {
          "level": {"type": "integer"},
          "notify_manager": {"type": "boolean"},
          "sla_hours": {"type": "integer"}
        }
      }
    },
    "required": ["title"]
  }
}

The schema answers "what's valid?" but leaves critical questions unanswered:

  • Should due_date be "2024-11-06", "Nov 6, 2024", or "2024-11-06T00:00:00Z"?
  • Is reporter.id a UUID, "USR-12345", or just "12345"?
  • When should the agent populate reporter.contact?
  • How do escalation.level and escalation.sla_hours relate to priority?

These ambiguities lead to malformed tool calls and inconsistent parameter usage.

The solution

Add an input_examples field to the tool definition with 1-5 concrete examples showing different usage patterns. The examples teach the model format conventions, nested structure patterns, and optional parameter correlations — things schemas can't express.

json
{
  "name": "create_ticket",
  "input_schema": { /* same schema as above */ },
  "input_examples": [
    {
      "title": "Login page returns 500 error",
      "priority": "critical",
      "labels": ["bug", "authentication", "production"],
      "reporter": {
        "id": "USR-12345",
        "name": "Jane Smith",
        "contact": {
          "email": "jane@acme.com",
          "phone": "+1-555-0123"
        }
      },
      "due_date": "2024-11-06",
      "escalation": {
        "level": 2,
        "notify_manager": true,
        "sla_hours": 4
      }
    },
    {
      "title": "Add dark mode support",
      "labels": ["feature-request", "ui"],
      "reporter": {
        "id": "USR-67890",
        "name": "Alex Chen"
      }
    },
    {
      "title": "Update API documentation"
    }
  ]
}

Three examples, three patterns: full spec with all optional fields (critical bug), partial spec (feature request), minimal spec (internal task). From these the model learns date format, user ID convention, when to include nested contact info, and the correlation between priority and escalation.

Anthropic's internal testing: tool use examples improved accuracy from 72% to 90% on complex parameter handling.

Best practices

  • Use realistic data (real city names, plausible prices, not "string" or "value")
  • Show variety: minimal, partial, and full specification patterns
  • 1-5 examples per tool — more isn't better
  • Focus on ambiguity: only add examples where correct usage isn't obvious from schema

How the three advanced tool use features fit together

This is the key mental model:

  • Tool Search Tool ensures the right tools get found
  • Programmatic Tool Calling ensures efficient execution
  • Tool Use Examples ensure correct invocation

They're complementary, not alternatives. A sophisticated MCP client uses all three: defer-loads most tools behind Tool Search, enables Programmatic Tool Calling for workflows with large intermediate results, and ships Tool Use Examples for any tool with non-obvious parameter conventions.


8. MCP vs API vs CLI — the full discourse

This is the other thing you asked about and it's a rich topic because it's been the subject of an ongoing debate in the agent engineering community throughout 2025. There are real arguments on every side.

The setup

Three ways to give an agent access to external capabilities:

  1. Direct API calls: The agent makes HTTP requests directly, using API documentation or an SDK
  2. CLI tools: The agent runs shell commands (gh issue list, curl, jq)
  3. MCP servers: The agent calls structured tools via the MCP protocol

Each has different tradeoffs in context efficiency, composability, auth, security, and training alignment.

The CLI camp's argument

The CLI-first position was articulated most clearly by Simon Willison, Armin Ronacher, Peter Steinberger, and others throughout 2025. The core arguments:

1. LLMs already know CLIs. Models were trained on billions of lines of bash, curl, git, docker, etc. Every man page, Stack Overflow answer, and shell tutorial is in the training data. When you give an agent access to gh, it already knows how to use it — it's seen gh issue list --limit 5 --json title,state hundreds of times in training. MCP launched in late 2024, so there's approximately zero MCP usage in any model's training data. Every MCP interaction has to re-explain the schemas at inference time.

Simon Willison's exact framing: "If your coding agent can run terminal commands and you give it access to GitHub's gh tool it gains all of that functionality for a token cost close to zero — because every frontier LLM knows how to use that tool already."

2. Context cost is dramatically lower. The GitHub MCP server exposes 93 tools, ~55,000 tokens of definitions. The gh CLI is already in the training data. You can give the agent gh with zero schema tokens and get the same functionality. At 10,000 operations per month, the cost difference can be an order of magnitude or more ($3.20 vs $55.20 in one ScaleKit benchmark).

3. The pipe architecture is efficient. CLI tools compose via pipes. gh issue list --json title,state | jq '.[] | .title' produces ~1,400 tokens. The equivalent MCP flow loads 93 tool schemas (~55K tokens), picks one, gets the full response in context, then reasons over it. The shell does the filtering, not the model. jq, grep, awk, head are context-window-efficient filters with no MCP equivalent.

4. CLI tools are reliable and don't need re-auth. MCP servers are flaky. They need constant re-auth. They're moving parts. CLI binaries just run.

5. Model performance is actually better. ScaleKit benchmarked CLI vs MCP for the same GitHub operations across 75 runs using Claude Sonnet 4. CLI used 4-32x fewer tokens per operation and hit 100% success rate. MCP hit 72%.

6. The composability tells. Every major coding agent (Claude Code, Codex, Cursor, Aider, OpenHands) uses bash as its foundation. Models trained on billions of bash examples. Zero MCP schemas in training data. The tool format the field has converged on for coding agents isn't MCP — it's bash.

Peter Steinberger (who built ~10 custom CLIs for his OpenClaw workflow and got hired by OpenAI for it) put it bluntly: "MCPs were a mistake. Bash is better."

The MCP camp's counter

The pro-MCP position has weakened somewhat in 2025 but is still valid:

1. MCP's real value is distribution, not invocation. Simon Willison himself acknowledged this: "MCP's real value is distribution, not invocation. You can change your MCP server anytime and every connected agent picks up the new tools instantly. No SDK updates, no versioning. That's genuinely useful." Compare to CLI: if you update your CLI, users have to download the new binary. MCP servers update live.

2. MCP handles auth cleanly. The OAuth 2.1 flow in MCP is well-designed and, thanks to Dynamic Client Registration, handles the "agents don't know what services they'll talk to in advance" problem that traditional OAuth didn't. CLI tools typically require pre-existing credentials in the shell's environment, which is a security footgun — the agent has access to everything the shell has.

3. Programmatic SDK access. MCP works for non-shell contexts. OpenAI's Responses API lets you pass an MCP server as a tool provider in a few lines. Browser agents, chatbots, and sandboxed enterprise environments where there's no terminal at all can use MCP naturally.

4. Schemas enable static analysis and safety. MCP schemas let clients validate arguments before execution, enforce permissions, and provide good error messages. CLI tools are shell commands — anything can be passed as an argument, and validation happens (or doesn't) inside the binary. For high-stakes actions, schemas are a meaningful safety layer.

5. MCP can evolve. The November 2025 advanced tool use features (Tool Search, Programmatic Tool Calling) are partly about closing the context-efficiency gap with CLIs. They work. Claude Code now defers MCP tool schemas by default and loads them on demand. Cursor does the same, reporting a 46.9% reduction in total agent tokens. The gap is narrowing fast.

6. Security surface. This is the real hidden cost of the CLI-first approach. You're giving the agent a shell. The security surface is the entire operating system. MCP's auth is scoped per server with OAuth 2.1 — the standard that REST APIs also use. Giving an agent a shell for composability means accepting that the agent can do anything that user can do on that machine, including things you didn't anticipate.

The synthesis everyone's converging on

The debate has shifted. As of early 2026, the consensus looks something like this:

Use MCP for:

  • Distribution and discovery (publish your tool, any agent can find it)
  • SDK-level tool access (calling an LLM from your own code)
  • Environments without a shell (browser, chatbot, mobile)
  • Security-critical operations where schema validation matters
  • Cross-organization tool sharing where trust is managed via OAuth

Use CLIs for:

  • Developer agents with terminal access (Claude Code, Codex, Cursor, etc.)
  • Tasks where the CLI already exists and is training-data-familiar
  • Operations where pipe composition is natural (text filtering, data transformation)
  • Avoiding context tax when you have ~50+ MCP tools available

Use Code Execution with MCP (the new synthesis) for:

  • Workflows with large intermediate results that shouldn't flow through the model
  • Multi-step orchestration where loops and conditionals are natural in code
  • Complex data transformations that are painful in natural-language tool chains
  • Privacy-sensitive flows where intermediate data should be tokenized

The right framing is that MCP, CLI, and code execution are three different places to put the "intelligence" of tool use. MCP puts it in the protocol and lets the model invoke tools directly. CLI puts it in the training data prior and lets the model use familiar commands. Code execution puts it in the sandboxed runtime and lets the model orchestrate tools programmatically. For different tasks, different placements are right.

How to talk about this in the call

If Asanka asks about the MCP vs CLI debate (and he might, given that he's at an 8VC-incubated applied AI startup that's thinking hard about exactly this), the strong answer is the synthesis above. Don't pick a side. Say something like:

"The debate has shifted meaningfully in the last six months. The MCP vs CLI framing was the dominant tension through most of 2025, but Anthropic's November advanced tool use release and their code-execution-with-MCP post reframed the question. It's not MCP vs CLI anymore — it's where you put the tool invocation layer. MCP is still the right distribution and auth layer, CLIs are still the most token-efficient invocation layer for developer agents with terminal access, and Code Execution with MCP is the emerging pattern for workflows with heavy intermediate data. I'd expect the next year of agent engineering to be about combining these three rather than choosing between them. For a product like Thorin's — observing work across multiple services and taking action — the auth story is critical, which tips me toward MCP for the integration layer, but I'd absolutely want to use programmatic tool calling or code execution for any workflow that involves large intermediate results."

That's the answer that sounds like you've been paying attention to the field.

The failure mode to avoid

Don't be a CLI-maximalist. There's a particular flavor of "MCP is bad, bash is better" take that's popular in certain developer circles, and while it has merit for coding agents, it doesn't generalize to Thorin's product. Thorin needs to integrate with services where OAuth is non-negotiable (Gmail, Slack enterprise, Notion). CLI-only is a fantasy for that kind of integration surface. If you go in sounding like a Peter Steinberger maximalist, you'll sound like you haven't thought about the product Asanka is actually building.

Temperature your take. Acknowledge the CLI case is strong for developer agents. Then explain why Thorin's case is different.


9. Namespacing, scoping, and tool groups

This is a more tactical section but directly relevant to MCP client implementation.

The namespacing problem

When you connect multiple MCP servers, tool name collisions become a real issue. Both GitHub and GitLab might expose a create_issue tool. Both Gmail and Outlook might expose send_email. If your client just exposes them as bare names, the model has no way to disambiguate.

The Anthropic recommendation from the tool-writing post: namespace by service, and optionally by resource. Prefix-based: github_create_issue, gitlab_create_issue. Hierarchical: github_issues_create, github_issues_list, github_pulls_create.

Anthropic explicitly tested prefix vs suffix namespacing and found non-trivial differences in tool-use evaluations. The exact winner depends on the LLM, so they recommend testing your own setup. For Claude specifically, prefix namespacing seems to win slightly in most cases.

Tool groups for workflow-based scoping

A related pattern: instead of namespacing by service, group tools by workflow. "Development" tools, "QA" tools, "Admin" tools. This is what Lunar MCPX and others have promoted as "Tool Groups" — named collections bundled across servers into meaningful task-level buckets.

The idea: a given agent session is usually about one workflow. If the user is doing customer support, they need read_ticket, reply_to_customer, escalate, lookup_order, refund_payment. They don't need deploy_code or update_schema. Load only the current workflow's tools.

This maps well onto Tool Search Tool's deferred loading pattern. You define tool groups, mark everything deferred by default, and when a workflow starts, you enable the relevant group. The model never sees the irrelevant tools.

Per-session scoping

The most granular pattern is per-session scoping: at the start of a session, determine (from the user's intent, their role, the channel, whatever signal) which subset of tools this session needs, and only expose that subset. This is effectively what Cursor and Claude Code do with their "Configure Tools" interfaces — the user manually curates a per-project tool set.

For a Thorin MCP client, you'd want to think about scoping as a first-class concern:

  • What subset of tools is available for this session?
  • What's the policy for expanding the set mid-session (e.g., user asks about Slack during what was meant to be a GitHub-focused session)?
  • How does the user override the default scope?
  • How do you enforce scope at the client level vs. trusting the model to only call in-scope tools?

Security and least-privilege

Scoping isn't just about context efficiency — it's a security posture. If your client only exposes read tools in a session that's meant to be read-only, the model literally cannot take a destructive action. You've enforced least privilege at the tool-exposure layer, not at the authorization layer.

This matters for Thorin because their product is about agents taking actions on users' work systems. You want a clear story for "the agent can only do X in context Y because we only gave it the X tools." Saying "the agent has access to everything but we trust it to behave" is a much weaker safety story.


10. Implementation checklist for the Thorin client

Concrete things to think about if Thorin's onsite task comes up or the role materializes:

Core protocol implementation

  • [ ] JSON-RPC 2.0 message serialization with request ID tracking
  • [ ] Both stdio and streamable HTTP transports, with a clean abstraction
  • [ ] Capability negotiation during initialize
  • [ ] Tool/resource/prompt list fetching with cache and refresh on *_list_changed notifications
  • [ ] Error handling with proper JSON-RPC error codes
  • [ ] Graceful reconnection for HTTP transport
  • [ ] Cancellation propagation
  • [ ] Clean shutdown sequence

Tool routing and namespacing

  • [ ] Tool name disambiguation across multiple servers (prefix-based)
  • [ ] Internal routing table: which server owns which tool
  • [ ] Validation of tool arguments against input_schema before forwarding to server
  • [ ] Proper tool_result correlation back to tool_use_id

Context management

  • [ ] Tool deferral support (mark tools as defer_loading: true)
  • [ ] Tool Search Tool integration (or custom tool search if targeting non-Anthropic models)
  • [ ] Optional Programmatic Tool Calling support
  • [ ] Per-session scoping / tool groups
  • [ ] Mechanism to declare "core" tools vs "discoverable" tools

Auth

  • [ ] OAuth 2.1 flow with DCR for services that support it
  • [ ] API key storage (never in model context)
  • [ ] Token refresh
  • [ ] Per-server credential isolation
  • [ ] URL mode elicitation support (from Nov 2025 spec) for OAuth that needs a browser

Safety

  • [ ] Action logging with reasoning
  • [ ] Approval workflows for destructive actions
  • [ ] Dry-run mode
  • [ ] Tool scope enforcement at client level (not trusting model to self-restrict)
  • [ ] Rate limiting and timeout handling

Developer experience

  • [ ] Clean error messages when tool calls fail (prompt-engineered for the agent to recover)
  • [ ] Tracing/logging of every message in the protocol
  • [ ] Replay capability for debugging
  • [ ] Eval harness for testing tool descriptions against realistic tasks

Advanced

  • [ ] Code execution with MCP pattern (filesystem exposure of tools)
  • [ ] Tool Use Examples support
  • [ ] Sampling (client-side) for servers that need LLM access without their own API keys
  • [ ] Elicitation support for interactive workflows
  • [ ] Resource subscription for servers that push updates

Not all of these are necessary for a v1, but having thought about them means you can answer "how would you scale this to 100 servers" without flinching.


11. Things you should be able to explain on the whiteboard

If the onsite task reappears in conversation, these are the concrete artifacts to be ready to draw:

1. The MCP message flow. Host → client → server. Initialize handshake, tools/list, conversation loop, tools/call, tool_result back. Be able to draw this in under a minute.

2. The tool definition lifecycle. Where does a tool definition live (server), how does it get to the LLM (client fetches, converts, injects), what happens when the LLM calls it (tool_use → client → server → result → tool_result → back to LLM). The key insight: tool definitions are context-consuming tokens.

3. The too-many-tools math. Draw a context window. Show 60K tokens consumed by tool definitions out of 200K. Explain that this is 30% of context gone before the user has spoken. Explain attention dilution as the secondary cost.

4. Tool Search Tool in action. Draw the traditional flow (all tools loaded upfront, ~77K tokens consumed). Then draw the Tool Search flow (search tool only, ~500 tokens, discover tools on demand, ~8.7K total). The 85% reduction is the headline number.

5. Programmatic Tool Calling. Draw the traditional multi-round loop (model → tool → model → tool → model). Then draw the PTC flow (model writes code → code runs → tools called from code → intermediate results stay in sandbox → final result returned). The 37% token reduction is the concrete number.

6. Code Execution with MCP. Draw the filesystem tree (./servers/google-drive/getDocument.ts). Show the agent navigating it. Contrast with the traditional flow where every tool lives in context. The 150K → 2K example is the extreme case.

7. MCP vs CLI vs Code Execution. Draw three boxes: distribution/discovery (MCP), invocation (CLI for known tools, code execution for orchestration). Explain that they compose.

8. The Anna's Archive architecture. Your own system. Postgres metadata DB + MCP server (TypeScript) + stdio/HTTP transport + search/download/read/stats tools. Deployed as Docker on Olares, exposed via Tailscale Funnel. This is the concrete example you bring to every abstract discussion.


12. Source index

Most important (read first if time allows)

Anthropic — Introducing advanced tool use on the Claude Developer Platform (Nov 24, 2025) https://www.anthropic.com/engineering/advanced-tool-useThe single most important read for your onsite task. Tool Search Tool, Programmatic Tool Calling, and Tool Use Examples all covered in technical depth. The 85% / 37% / 72→90% numbers all come from here. If you only read one thing for MCP client work, read this.

Anthropic — Code execution with MCP: Building more efficient AI agents (Nov 4, 2025) https://www.anthropic.com/engineering/code-execution-with-mcpThe filesystem-as-tool-interface proposal. The Google Drive → Salesforce example. The 150K → 2K reduction claim. This is the post that shifted the conversation from "MCP vs CLI" to "where should the invocation layer live."

MCP spec — Latest version 2025-11-25https://modelcontextprotocol.io/specification/2025-11-25The authoritative protocol reference. Know the six primitives, the transport details, the initialization flow.

MCP blog — One Year of MCP: November 2025 Spec Releasehttps://blog.modelcontextprotocol.io/posts/2025-11-25-first-mcp-anniversary/The retrospective and changelog. Task-based workflows, URL mode elicitation, sampling with tools, agentic servers. Know these feature names if asked about recent changes.

For tool selection depth

Kaxil Naik — "MCP Sucks" (Until It Doesn't): When Each Winshttps://kaxil.substack.com/p/mcp-vs-cli-vs-restThe most balanced take on the MCP vs CLI vs REST debate. Has the benchmark numbers (4-32x fewer tokens, 100% vs 72% success rate, $3.20 vs $55.20 at 10K ops/month). Good for the synthesis.

Redis — Solving the MCP Tool overload problemhttps://redis.io/blog/from-reasoning-to-retrieval-solving-the-mcp-tool-overload-problem/The "tool selection as retrieval problem" framing. 98% token reduction, 8x retrieval speedup, 2x accuracy improvement in their benchmarks. Good concrete example of an external team building what Anthropic later shipped as Tool Search Tool.

Simon Willison's MCP tag archivehttps://simonwillison.net/tags/model-context-protocol/The most thoughtful running commentary on MCP. Start with the most recent entries. The Geoffrey Huntley 55K-token quote is here, as is the "MCP's real value is distribution" insight and the OAuth/DCR point about Kenton Varda.

Lunar — How to Prevent MCP Tool Overloadhttps://www.lunar.dev/post/why-is-there-mcp-tool-overload-and-how-to-solve-it-for-your-ai-agentsGood catalog of the failure modes when too many tools are exposed. Microsoft Research cites for tool name collisions. The "freeze or fail to select" pattern.

For CLI-first perspective

Courier Blog — MCP vs CLI for AI agents: why the terminal wins todayhttps://www.courier.com/blog/your-ai-agent-already-knows-how-to-use-a-terminal-why-cli-beat-mcp-serversThe cleanest articulation of the CLI-first position. The training-data argument. The MCP-as-distribution concession. Worth knowing for the synthesis argument.

Jannik Reinhard — Why CLI Tools Are Beating MCP for AI Agentshttps://jannikreinhard.com/2026/02/22/why-cli-tools-are-beating-mcp-for-ai-agents/Concrete benchmark story: same task, MCP vs CLI, dramatic difference. The "95% of context window available" claim. Worth knowing the anecdote.

For the Tool Search / dynamic loading mechanics

Speakeasy — Dynamic tool discovery in MCPhttps://www.speakeasy.com/mcp/tool-design/dynamic-tool-discoveryThe notifications/tools/list_changed pattern. How servers can dynamically enable/disable tools based on auth state. Important detail for client implementation.

Candede — Solving MCP Context Bloat with Claude's Tool Search APIhttps://www.candede.com/articles/claude-tool-searchGood walkthrough of the Tool Search API with implementation details. Covers regex vs BM25 search modes. The defer_loading: true configuration.

GitHub — github-mcp-server dynamic toolsets issuehttps://github.com/github/github-mcp-server/issues/275Real-world example of a major MCP server implementing dynamic toolset discovery. Worth reading to see how it's actually done in production.

For multi-tool discourse and Skills

DEV Community — MCP vs Agent Skills: Why They're Different, Not Competinghttps://dev.to/phil-whittaker/mcp-vs-agent-skills-why-theyre-different-not-competing-2bc1The relationship between MCP and Agent Skills. Progressive discovery. How Skills influenced the Tool Search Tool design.

Simon Willison — Code execution with MCP analysishttps://simonwillison.net/2025/Nov/4/code-execution-with-mcp/Simon's take on Anthropic's code execution post. "What if you could turn MCP tools into code functions instead, and then let the LLM wire them together with executable code?" Good framing.


Final reminder

The point of this doc is to give you MCP internals fluency for the onsite task and for any deep probe Asanka might do on tool architecture. Combine it with the main AI agents study doc for the full prep package.

The one-sentence version of each section, so you have it at the tip of your tongue:

  1. Client architecture: JSON-RPC over stdio or HTTP, host owns trust boundary, clients handle protocol mechanics per server.
  2. Tool selection: the model is just reading text; tool descriptions are prompts; selection emerges from next-token prediction.
  3. Too-many-tools problem: 55K+ tokens upfront for schemas, attention dilution, selection accuracy collapse above 30-50 tools.
  4. Tool Search Tool: defer tools, load on demand, 85% token reduction, Opus 4 accuracy 49% → 74%.
  5. Programmatic Tool Calling: model writes code, intermediate results stay in sandbox, 37% token reduction on complex workflows.
  6. Code Execution with MCP: tools as files on disk, discovery by filesystem navigation, up to 98.7% token reduction.
  7. Tool Use Examples: schemas can't express usage patterns, examples teach conventions, 72% → 90% accuracy.
  8. MCP vs CLI vs Code Execution: distribution vs invocation vs orchestration, use all three.
  9. Namespacing and scoping: prefix by service, group by workflow, enforce scope at client level for safety.
  10. Client implementation checklist: transport abstraction, capability negotiation, namespacing, deferral, OAuth, scoping.

And keep the primary rule: when you don't know something, say "I don't know, but here's how I'd find out." That overrides everything.