How I Cut My AI Agent's Tool Context by 97%

Algis Dumbris • 2026/04/24

TL;DR

With 14 MCP servers wired into my daily Claude setup, the serialized tool list cost 54,707 tokens before the agent read a single user message. After switching the gateway to a single retrieve_tools meta-tool, the same setup cost 818 tokens at conversation start — a 97% reduction on this specific workstation. The tools are still all there; the agent just asks for the ones it needs, when it needs them. This is one measurement on one laptop, not a benchmark claim — the public benchmark harness for the ecosystem-wide numbers is coming.

The Problem Nobody Warned Me About

The first time I connected a handful of MCP servers to Claude Desktop I got a polite out-of-memory error before the conversation started. The second time I paid closer attention: the JSON tool manifest had ballooned past 50,000 tokens. On a 200K context model that is a quarter of the window gone before any work happens, and the model’s tool-selection accuracy degrades as the list grows because it has to rank dozens of nearly identical candidates every turn. This is a documented scaling problem — Cursor caps tool count at 40, GitHub Copilot at 128, and the dev.to case study that kicked off a lot of this discourse measured 143K of 200K tokens eaten by tool definitions alone.

The structural issue is that every MCP tool gets injected as a JSON schema: name, description, parameter types, enums, constraints, oneOfs. A simple tool like slack_send_message runs 400–600 tokens. A schema-heavy one like github_create_pull_request with its reviewer lists, draft flags, and commit options climbs past 1,400. Multiply by 40+ tools across the servers a real developer actually uses and the bill comes due.

The Setup I Measured

I ran this on a single laptop, with a single user, against a single agent. It is not a benchmark. It is an honest reading from the machine I work on every day.

Agent: Claude Desktop, latest stable
Gateway: MCPProxy v0.24.x (local-first, Go)
Upstream servers: 14 — GitHub, GitLab, Slack, filesystem, git, sqlite, fetch, puppeteer, brave-search, memory, sentry, linear, google-drive, time
Total tools exposed: 187

The servers were imported in one line:

mcpproxy upstream import ~/Library/Application\ Support/Claude/claude_desktop_config.json

That command reads an existing Claude Desktop config, registers every upstream server with the proxy, and points the client at http://localhost:8080/mcp. No duplicate configuration, no per-server re-auth.

The Measurement

I instrumented two variants of the same Claude Desktop config and recorded the token count of the tools/list response immediately after session initialization — the payload the agent actually receives before your first prompt.

Configuration	tools/list tokens	Δ
Direct: 14 upstream servers, 187 tools	54,707	—
Via MCPProxy: `retrieve_tools` only	818	−97%

Token counts measured with Anthropic’s count_tokens endpoint against the raw JSON of the tools/list response. Both runs used identical upstream servers, identical tool sets, identical order. The only variable was whether Claude saw all 187 schemas up front or a single discovery meta-tool.

The 818 tokens is the cost of two things: the retrieve_tools function (a BM25 search over indexed tool metadata) and the three companion tools MCPProxy exposes for the actual invocation flow (call_tool_read, call_tool_write, call_tool_destructive). Everything else is pulled in on demand when the agent calls retrieve_tools("create github pull request") and gets back the 3–5 schemas that actually match.

Why This Works

MCPProxy indexes every tool’s name, description, and parameter schemas into an in-process Bleve BM25 index at startup. BM25 is lexical, not semantic, and that turns out to be the right floor for MCP because tool names and descriptions are written by humans who use the same vocabulary their users will. A query like “create a pull request” hits the word “pull” and “request” in github_create_pull_request with full-score overlap; no embedding model, no vector database, no API key, no round trip. Sub-millisecond lookup on a laptop.

BM25 is not a silver bullet — we’ve written at length about where it shines (small-to-medium tool sets with descriptive names) and where it needs help (cross-tool paraphrasing, intent bridging). For most personal setups under 300 tools, it is enough. For larger deployments we are building a hybrid retriever interface — more on that when the RFC ships.

What I Am Not Claiming

This is not the 99% number. The mcpproxy README currently cites ~99% token reduction with +43% tool-selection accuracy based on published research. Those numbers come from specific benchmark conditions, not from this laptop. The public benchmark harness for reproducing them across a fixed tool corpus is on our roadmap for the next release window — I will publish methodology, harness code, and the dashboard together.
This is not a lossless improvement. The tradeoff is one extra round-trip: the agent asks retrieve_tools, gets candidates back, then calls the one it wants. That adds latency per turn but gains it back by not wasting tokens on tools the agent was never going to use.
This is not a replacement for sensible tool curation. If you genuinely need 187 tools live, you need 187 tools live. Most of us don’t. Most of us have 14 we use weekly and 173 we installed once.

Reproduce This on Your Own Machine

# 1. Install — one command, static binary
brew install smart-mcp-proxy/tap/mcpproxy
# or: go install github.com/smart-mcp-proxy/mcpproxy-go/cmd/mcpproxy@latest

# 2. Import your existing Claude / Cursor / Codex config
mcpproxy upstream import ~/Library/Application\ Support/Claude/claude_desktop_config.json
#   or: mcpproxy upstream import ~/.cursor/mcp.json
#   or: mcpproxy upstream import ~/.codex/config.toml

# 3. Start the proxy
mcpproxy serve

# 4. Point your agent at http://localhost:8080/mcp
#    (or /mcp/<profile> for scoped tool groups — coming in v0.25)

That is the whole install path. If you run the direct config and the proxy config side by side and count tokens on the tools/list response, you should see the same order-of-magnitude gap I saw.

I will publish the benchmark harness — fixed tool corpus, reproducible numbers, the full methodology — when it lands. In the meantime, if you run this on your own setup, open an issue with your before/after numbers. The more honest readings we collect in the wild, the better the claims on the README get.

Further reading