Beyond BM25: The Future of MCP Tool Discovery

Algis Dumbris • 2026/03/15

TL;DR

In our earlier post, we made the case for BM25 as the right default for MCP tool discovery — and for small-to-medium tool sets, that case still holds. But new benchmarks from StackOne, Stacklok, and the RAG-MCP paper paint a more nuanced picture: BM25 alone delivers just 14% top-1 accuracy when tool counts climb past a few hundred. Hybrid approaches combining BM25 with semantic search hit 94%. This post lays out what the data actually shows, why BM25 degrades at scale, and how MCPProxy is evolving toward hybrid search while keeping the zero-dependency simplicity that makes it useful.

The Benchmarks Are In

Three independent evaluations have landed in the last few months, and they tell a consistent story.

StackOne’s benchmark tested 270 tools across 11 API categories (HRIS, ATS, CRM, Ticketing, IAM, Messaging, and others) with 2,700 natural-language queries. Their findings:

MethodTop-1 AccuracyTop-5 AccuracyLatency
BM25 only14%87%<1ms
TF-IDF/BM25 hybrid (alpha 0.1-0.2)21%90%<1ms
Embedding search38%85%50-200ms
Reranker40%+90%+200-500ms

Stacklok’s MCP Optimizer ran a head-to-head comparison against Anthropic’s built-in Tool Search across 2,792 tools. Their hybrid semantic+BM25 approach achieved 94% selection accuracy versus 34% for BM25-only and 30% for regex-based search. The hybrid system also ran faster — 5.75 seconds average versus 12-13.5 seconds for Anthropic’s remote service.

The RAG-MCP paper confirmed what we already knew about the retrieval-first approach: agents given every tool upfront achieve just 13.6% accuracy, while retrieval-first routing more than triples it to 43.1%. But the paper also demonstrated that the retrieval mechanism itself matters enormously — not just whether you retrieve, but how.

Tool discovery accuracy comparison across methods

Why BM25 Breaks Down at Scale

When we wrote the original BM25 post, we focused on a real and common scenario: 100-500 tools across 15-30 MCP servers. In that regime, BM25 genuinely works well. Tool names are descriptive, queries are keyword-heavy, and the term-frequency mechanics of BM25 reward exact matches effectively.

But the data shows a clear failure mode as tool counts grow. Here is what actually happens:

Common verbs saturate the index. When you have 2,000+ tools, verbs like “create,” “list,” “get,” “update,” and “delete” appear in hundreds of tool names and descriptions. BM25’s IDF (inverse document frequency) component is supposed to downweight common terms, but when 40% of your tools contain “create,” the discriminating power of that term drops to near zero. The query “create a pull request” matches github_create_pull_request, but it also matches jira_create_issue, slack_create_channel, notion_create_page, and dozens more — all with similar BM25 scores.

Short documents amplify the problem. BM25’s document-length normalization assumes documents vary meaningfully in length. Tool descriptions are uniformly short — typically 10-50 words. This collapses a dimension that BM25 normally uses for discrimination.

Semantic intent gets lost. A developer searching for “notify the team about a deployment” might need a Slack tool, a PagerDuty tool, or an email tool. BM25 cannot bridge the gap between “notify” and “send_message” or between “deployment” and the description of a CI/CD notification tool. It needs the exact keywords to appear in the tool definition.

None of this invalidates BM25 for smaller deployments. If you have 50 tools, the verb saturation problem barely exists — there might be three tools with “create” in the name, and BM25 easily distinguishes them by the other terms. The 87% top-5 accuracy from StackOne’s benchmark confirms that BM25 almost always gets the right tool somewhere in the results; it just struggles to rank it first when the candidate pool is large.

What Hybrid Search Actually Looks Like

The term “hybrid search” gets thrown around loosely. Here is the concrete architecture that the benchmarks show works:

Hybrid search architecture: BM25 + embedding with reciprocal rank fusion

Step 1: Parallel Retrieval

The query runs simultaneously through two retrieval paths:

Step 2: Reciprocal Rank Fusion

The two ranked lists are merged using Reciprocal Rank Fusion (RRF), a simple but effective algorithm. For each tool, RRF computes a combined score:

RRF_score(tool) = 1/(k + rank_bm25) + 1/(k + rank_semantic)

where k is a constant (typically 60) that controls how much weight goes to top-ranked items. Tools that appear high in both lists get boosted; tools that rank well in only one list still appear but with lower combined scores.

RRF is appealing because it is score-agnostic — it works on rank positions, not raw scores. This matters because BM25 scores and cosine similarity scores are on completely different scales. Trying to normalize and weight-average them introduces tuning headaches. RRF sidesteps the problem entirely.

Step 3: Optional Reranking

For the highest accuracy, a lightweight reranker can rescore the top-N results from the fusion step. StackOne’s benchmarks show reranking pushes top-1 accuracy to 40%+, but at significant latency cost (200-500ms). For most MCP tool discovery workloads, the RRF output without reranking is sufficient.

Why This Works So Well

The two retrieval paths complement each other precisely where each is weak:

Stacklok’s 94% accuracy versus BM25’s 34% on the same 2,792-tool dataset is the strongest evidence: the combination is not just incrementally better — it is categorically better at scale.

The Ecosystem Is Converging on This

MCPProxy is not the only project rethinking tool discovery. The ecosystem is converging on a set of related ideas:

Claude Code’s deferred tool loading already implements a form of hierarchical discovery. With approximately 200 tools across 8 MCP services, Claude Code presents only top-level service groups in the system prompt. The model drills down on demand, loading individual tool schemas only when needed. This reduces context footprint from approximately 200 entries to approximately 8 — a different axis of the same problem.

GitHub’s MCP Server ships with configurable toolsets — static allow-lists that let administrators expose only the tool groups relevant to a given agent. When no toolsets are specified, defaults are used. This is configuration-time filtering rather than runtime discovery, but it addresses the same root cause: too many tools in the context window.

The MCP specification itself is evolving. Discussion in the community around hierarchical tool management reflects growing consensus that flat tool lists do not scale. Estimates suggest 50 tools consume 20-25K tokens; at 200+ tools, you are spending the majority of your context window on tool definitions before the conversation starts.

Philipp Schmid at Hugging Face has recommended “vector retrieval to dynamically select and present only the most relevant tools” — essentially describing the architecture MCPProxy is moving toward. The argument is straightforward: as the tool ecosystem grows from dozens to thousands of available MCP servers, static approaches break down.

Where BM25 Still Wins

Honesty demands acknowledging that hybrid search is not always the right answer. BM25 retains clear advantages in several scenarios:

Small-to-medium tool sets (under 100 tools). If you are connecting 5-15 MCP servers with a total of 30-100 tools, BM25’s 87% top-5 accuracy is excellent. The verb saturation problem barely exists at this scale. The zero-dependency, sub-millisecond profile makes it the obvious choice.

Air-gapped and offline environments. BM25 runs entirely in-process with no network calls. Embedding models require either an API call or a local model. For environments where network access is restricted or latency to an embedding service is high, BM25 remains the only viable option.

Determinism and debuggability. BM25 scoring is fully transparent. You can inspect exactly why a tool ranked where it did — which terms matched, what their IDF weights were, how document length normalization affected the score. Embedding similarity scores are opaque by comparison. When a user reports that the wrong tool was selected, debugging BM25 is straightforward.

Cold start speed. BM25 indexes are built instantly from tool metadata. Embedding every tool description requires either API calls (adding latency and cost) or running a local model (adding memory and compute requirements). In MCPProxy’s dynamic environment — where servers connect and disconnect, and tool lists change in real time — fast re-indexing matters.

Where BM25 excels vs where hybrid search wins as tool count scales

MCPProxy’s Roadmap: Hybrid Without Compromise

Here is how we are thinking about evolving MCPProxy’s tool discovery without sacrificing the properties that make it useful today.

Phase 1: Smarter BM25 (Now)

Before adding embedding infrastructure, there is significant headroom in improving the BM25 pipeline itself:

These improvements target the gap between 14% and 21% top-1 accuracy — the range where StackOne showed that lightweight hybrid approaches (TF-IDF + BM25 weighting) deliver meaningful gains with zero infrastructure changes.

Phase 2: Optional Embedding Layer

For users who need higher accuracy at scale, MCPProxy will offer an optional embedding path:

Phase 3: Hierarchical Discovery

For very large deployments (500+ tools, 50+ servers), hierarchical discovery reduces the search space before retrieval even begins:

The Guiding Principle

Every phase maintains MCPProxy’s core contract: it ships as a single binary with zero required external dependencies. The embedding layer is opt-in. Hierarchical discovery uses local state. Nothing phones home. Nothing requires an API key to function. The simplest deployment — download the binary, point it at your MCP servers, and go — continues to work exactly as it does today.

What This Means for You

If you are running MCPProxy today with fewer than 100 tools, nothing changes. BM25 is serving you well, and the improvements in Phase 1 will make it even better.

If you are scaling toward hundreds or thousands of tools, the hybrid roadmap means you will not need to swap out MCPProxy for a different solution. The same proxy that handles your 20-tool setup today will handle your 2,000-tool setup tomorrow, with the same operational simplicity.

If you are evaluating tool discovery approaches, here is the honest summary:

Your ScaleRecommended ApproachExpected Top-1 Accuracy
10-50 toolsBM25 (MCPProxy default)~80-85%
50-200 toolsBM25 with field weighting~60-70%
200-500 toolsHybrid BM25 + embedding~85-90%
500+ toolsHybrid + hierarchical discovery~90-94%

Conclusion

The earlier BM25 post was not wrong — it was incomplete. BM25 is the right starting point for tool discovery: zero infrastructure, sub-millisecond latency, deterministic results, and strong accuracy for the tool counts most users have today. But the data is clear that BM25 alone does not scale to the hundreds-or-thousands-of-tools future the MCP ecosystem is heading toward.

The good news is that hybrid search is not an all-or-nothing proposition. BM25 and semantic search are complementary, not competing. Reciprocal Rank Fusion combines them cleanly. Local embedding models eliminate the infrastructure concerns we raised in the original post. And progressive, hierarchical discovery addresses the context-window problem from yet another angle.

MCPProxy started with BM25 because it was the right engineering choice for the constraints at hand. It is evolving toward hybrid search because the constraints are changing. The tool ecosystem is growing, the benchmarks are telling us where the limits are, and we would rather share that data honestly than pretend a single algorithm solves everything forever.

That is the engineering approach: measure, acknowledge the data, and build the next thing.


MCPProxy is open source and available at github.com/smart-mcp-proxy/mcpproxy-go. Install with brew install smart-mcp-proxy/tap/mcpproxy. Full documentation at mcpproxy.app.