vLLM Semantic Router Blog

Deploying vLLM Semantic Router on AMD Developer Cloud

2026-03-25T00:00:00.000Z

Running vLLM Semantic Router on AMD Developer Cloud is not just about bringing up one more inference endpoint. It is about turning it into a routed multi-tier system that can classify requests, choose a semantic lane, and make replay and Insights immediately useful.

This post walks through the practical path: start the ROCm backend on an AMD Developer Cloud instance, install vLLM-SR, import the reference profile, and validate the deployment end to end.

What Is vLLM Semantic Router?

vLLM Semantic Router is the system intelligence layer for LLMs. It sits in front of model endpoints, reads each request before generation begins, extracts semantic signals, and decides what should happen next.

That makes it more than a cost-saving router. It is also a control layer for safety, privacy, and policy. The same routing system that sends simple work to cheaper lanes can also detect sensitive traffic, keep private requests on local infrastructure, apply security-oriented plugin chains, and reserve stronger models for tasks that actually need deeper reasoning.

This is what makes Semantic Router especially relevant for AMD deployments. It supports intelligent multi-model routing, privacy-first enterprise AI, and local-first personal AI in the same architecture. In practice, one system can decide when to optimize for cost, when to prioritize security or privacy, and when to keep a personal or sensitive workflow close to the user instead of treating every query the same way.

Note: in this reference profile, aliases such as google/gemini-3.1-pro, openai/gpt5.4, and anthropic/claude-opus-4.6 are logical routing tiers backed by the same ROCm Qwen deployment. They are not outbound calls to those vendor APIs.

How the Signal-Driven Architecture Works

The easiest way to understand vLLM Semantic Router is as a four-layer architecture:

Signals are the raw observations extracted from each request. In this repository, the AMD profile uses signals such as keyword, embedding, structure, fact_check, user_feedback, reask, language, domain, context, and complexity.
Projections are the coordination layer. They take raw signal evidence and turn it into reusable routing outputs such as balance_simple, balance_complex, balance_reasoning, verification_required, or urgency_elevated.
Decisions are the policy layer. They combine signals and projection outputs into named routing outcomes such as medium_code_general, reasoning_deep, or premium_legal.
Models are the target lanes. Decisions point to logical models or aliases through modelRefs, while endpoint wiring, pricing, and backend references live in the provider model catalog.

In other words, the runtime flow is:

User Request -> Signals -> Projections -> Decisions -> Model Alias -> Backend Response

This is why the system is more expressive than a simple classifier. A query does not have to be “just math” or “just code.” It can simultaneously look urgent, evidence-sensitive, short-context, Chinese-language, and correction-oriented, and the routing policy can respond to that richer state.

What You Will Deploy

At a high level, this deployment consists of:

One ROCm vLLM backend running Qwen/Qwen3.5-122B-A10B-FP8
One vLLM Semantic Router instance in front of that backend
One reference routing profile from deploy/recipes/balance.yaml
One dashboard for onboarding, replay inspection, playground testing, and Insights

The reference alias layout is:

qwen/qwen3.5-rocm for the SIMPLE lane
google/gemini-2.5-flash-lite for lower-cost verified explanation and correction tasks
google/gemini-3.1-pro for complex technical, deep reasoning, STEM, or health-sensitive tasks
openai/gpt5.4 for narrow formal-proof escalation
anthropic/claude-opus-4.6 for the premium legal lane

Pricing in the profile is intentionally exaggerated so Insights can make tier differences and savings easy to see. It is a demo-friendly routing profile, not a mirror of vendor billing.

Why This Matters for AMD

This architecture opens up a particularly interesting opportunity for AMD, because AMD hardware does not have to be framed as “just another accelerator target.” With Semantic Router in front of it, an AMD deployment can become the control plane for system intelligence.

1. Intelligent Routing on AMD

The most immediate opportunity is intelligent routing. A single ROCm backend on AMD Developer Cloud can serve as the physical execution layer for multiple logical lanes. That means teams can prototype a Mixture-of-Models experience, cost-aware routing, replay-driven debugging, and tiered product behavior without first standing up a large multi-backend fleet.

In the AMD reference profile, the cheapest, medium, complex, reasoning, and premium lanes all resolve onto different models. The router still gives you differentiated behavior because the policy lives in signals, projections, and decisions, not only in the number of containers you run.

2. Privacy Routing and Local-First Governance

The second opportunity is privacy routing, that keeps PII, private code, internal documents, and suspicious prompts on a local lane while only escalating clearly non-sensitive reasoning work when policy allows it. That pattern is especially meaningful on AMD because it supports a local-first deployment story: keep sensitive traffic on infrastructure you control, audit every decision, and make cloud escalation a governed exception instead of the default.

For enterprises, that means AMD-backed deployments can become the trusted default lane for internal copilots, regulated workloads, or hybrid private AI systems. For developers, it means privacy is not just a hosting choice; it becomes a routing policy.

3. Personal AI and Local Personal Agents

The third opportunity is personal AI like deploying a personal model on AMD AI MAX+ and connecting to external Models as needed. Once routing, privacy, and reasoning are expressed as policy, an AMD-hosted stack can support assistants that feel more personal and more controlled. A personal AI system can keep ordinary tasks, memory-aware follow-ups, and private context on a local lane, while only escalating special cases when explicitly permitted.

That makes AMD interesting not only for enterprise infrastructure, but also for self-hosted assistants, home-lab AI, and local-first personal workflows. The important point is that Semantic Router lets the system distinguish between “keep this local,” “this is cheap and routine,” and “this needs deeper reasoning,” instead of treating all personal AI traffic as one undifferentiated workload.

Getting Started

Before you begin, make sure your AMD Developer Cloud instance is ready with:

A ROCm-capable AMD GPU instance
Docker installed and running
Access to /dev/kfd and /dev/dri
A persistent Hugging Face cache path, if you want to avoid repeated model downloads

Once you can SSH into the machine, you are ready to launch the backend.

Step 1: Create the Shared Docker Network

Create the network used by the reference deployment:

sudo docker network create vllm-sr-network 2>/dev/null || true

This keeps the backend naming consistent with the reference profile, which expects the vLLM service at vllm:8000.

Step 2: Start the AMD ROCm vLLM Backend

Run the following command on your AMD Developer Cloud instance:

sudo docker run -d \
  --name vllm \
  --network=vllm-sr-network \
  --restart unless-stopped \
  -p "${VLLM_PORT_122B:-8090}:8000" \
  -v "${VLLM_HF_CACHE:-/mnt/data/huggingface-cache}:/root/.cache/huggingface" \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --shm-size 32G \
  -v /data:/data \
  -v "$HOME:/myhome" \
  -w /myhome \
  -e VLLM_ROCM_USE_AITER=1 \
  -e VLLM_USE_AITER_UNIFIED_ATTENTION=1 \
  -e VLLM_ROCM_USE_AITER_MHA=0 \
  --entrypoint python3 \
  vllm/vllm-openai-rocm:v0.17.0 \
  -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3.5-122B-A10B-FP8 \
    --host 0.0.0.0 \
    --port 8000 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --served-model-name qwen/qwen3.5-rocm google/gemini-2.5-flash-lite google/gemini-3.1-pro openai/gpt5.4 anthropic/claude-opus-4.6 \
    --trust-remote-code \
    --reasoning-parser qwen3 \
    --max-model-len 262144 \
    --language-model-only \
    --max-num-seqs 128 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.85

This is the core of the deployment. The backend is still one Qwen model, but it now exposes multiple served-model aliases that the router can target semantically.

Install vLLM Semantic Router

With the backend up, install vLLM Semantic Router:

curl -fsSL https://vllm-semantic-router.com/install.sh | bash

The router dashboard should then be available at:

http://:8700

Open the dashboard and complete onboarding.

When prompted to load a routing profile (please skip the model configuration directly), import the reference YAML directly from:

https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/recipes/balance.yaml

The remote import path applies the full YAML directly during onboarding. If you later inspect the same profile in the DSL editor, the routing surfaces decompile from routing.modelCards, routing.signals, routing.projections, and routing.decisions, while providers remains YAML-native.

What the Reference Profile Is Doing

The imported profile expresses a complete AMD routing story with 13 active decisions across:

one premium legal lane
one reasoning lane for proofs, philosophy, and deep general reasoning
one complex specialist lane for multi-step execution, systems design, and specialist STEM work
two feedback recovery lanes
two verified overlays
three medium-cost lanes
one fast factual lane
one simple general lane
one terminal casual fallback lane

This is useful because replay and Insights stay signal-native. Instead of inventing a separate runtime dimension schema, the system shows what actually happened during routing: which signals matched, which projection outputs fired, which decision won, and which alias received the request.

Two intentionally conservative paths in the profile are worth calling out:

fact_check overlays only escalate when verification pressure is strong and the prompt gives explicit confirmation cues
user_feedback recovery lanes require literal correction or clarification signals instead of broadly capturing all follow-up traffic

That makes the profile easier to reason about when you are testing routing behavior on a single backend.

Validate the Deployment in the Playground

Once onboarding is complete, the fastest way to validate the system is through the dashboard playground. Try a few prompts that represent different routing tiers:

Coding Help

Debug this Python stack trace and suggest the most likely fix.

This should land on the cheaper coding lane backed by qwen/qwen3.5-rocm.

Formal Math Proof

Prove rigorously that the square root of 2 is irrational.

This should hit the narrow formal-proof overlay and map to the openai/gpt5.4 alias.

Premium Legal Analysis

Provide a legal analysis of the indemnity clause, liability cap, and compliance obligations in this contract.

This should match the premium legal lane and forward to anthropic/claude-opus-4.6.

Observe the Routing Behavior in Insights

You can also inspect the routing behavior in Insights. The reference profile includes replay, so you can see what actually happened during routing. And also how much money you saved by using the cheaper lanes.

Test the Brain Topology

The router dashboard also includes a brain topology view that shows the high-level structure of the routing graph. This is useful for understanding the overall shape of the policy, and how different decisions are connected. And you can directly test your prompt to see its activation path.

Design Your Own Routing DSL

The dashboard also includes a full DSL editor that lets you design your own routing policy. The reference profile is a good starting point, but you can also use the editor to try out different ideas.

And you can design a very complex boolean expression in a single route, to express very precise routing policy.

Final Thoughts

Deploying vLLM Semantic Router on AMD Developer Cloud gives you more than a working endpoint. It gives you a compact routed system: one or more ROCm-hosted backend, multiple semantic tiers, visible routing logic, and a dashboard experience that makes the behavior understandable instead of opaque.

That is what makes this reference profile useful. You can start with a single real AMD backend, import a complete routing policy, inspect how decisions are made, and then iterate from there without first building a large multi-backend fleet. For teams exploring cost-aware routing, replay-driven debugging, or AMD-based MoM patterns, it is a practical and reproducible starting point.

v0.3 Themis Roadmap: Stability at Scale

2026-03-12T00:00:00.000Z

v0.3, codename Themis, is our production-readiness release for Semantic Router. The theme is simple: Stability at Scale. After Athena expanded the system brain, Themis is the release where we make that intelligence dependable across real environments, clearer to operate, and safer to ship into production.

This roadmap is not just about adding more capability. It is about making the full system coherent: one stable contract across Docker and Kubernetes, one cleaner deployment path, one real version story for images and packages, stronger performance validation on both NVIDIA and AMD, and a research track that directly improves the product instead of sitting outside it.

Why Themis

Themis is the Greek figure of order, rules, and judgment. That fits this release better than a speed-oriented or purely routing-oriented codename. Themis is where Semantic Router starts acting less like a promising set of powerful building blocks and more like a platform with stable contracts, repeatable operations, and enforceable guardrails.

The current v0.3 milestone reflects that shift. It includes the new workstreams opened specifically for Themis, but it also folds in existing issues around protocol compatibility, session affinity, memory hardening, dashboard state, observability, security, and API standardization. This release is not a narrow feature sprint. It is a system-shaping release.

1. Stable API, config, and deployment contracts

The highest-priority theme in Themis is eliminating contract drift across environments. Today, router behavior, Helm-facing config, dashboard flows, and the Python CLI still expose differences that create friction for operators. Themis is where we narrow those seams.

At the center of that work is a canonical API and config contract across router, CLI, dashboard, and Kubernetes. The goal is simple: after this release, a user should not have to mentally maintain one configuration model for local Docker workflows and another for Kubernetes deployment. This is the core of #1505.

That contract work also includes the deployment entry point itself. The vllm-sr CLI should become the normal path for standing up both Docker and Kubernetes environments, instead of being treated as a local-only helper while Helm and other deployment paths evolve separately. That is the focus of #1507.

We also want the runtime topology to become easier to reason about. Themis moves toward a router-focused vllm-sr image, with external services such as dashboard, Envoy, and persistence components split out more cleanly. This keeps the main runtime narrower and makes upgrades, debugging, and composition less fragile. That work is tracked in #1508.

This same contract cleanup extends to protocol compatibility. Themis already includes work to support first-class OpenAI and Anthropic API entry points, align API definitions with official SDKs, and reduce homegrown JSON struct drift across the codebase. Those concerns now live in #1517, #1404, and #1217.

2. Stable versions, upgrades, and production operations

Themis is also the release where we stop treating latest as a deployment strategy. Production users need to know what they are running, how they upgrade, how they roll back, and what guarantees exist between images, packages, and charts. That operational maturity is the purpose of #1506.

This means introducing explicit version channels such as nightly and tagged releases, carrying versioned images and packages through the stack, and documenting a full upgrade and rollback flow instead of assuming rebuild-and-redeploy. A stable version story is part of stability at scale, not an afterthought to it.

Operational stability also depends on where state lives. Dashboard behavior today still depends too heavily on in-memory state for workflows that should survive restarts, scale-outs, and multi-user operation. Themis moves those operationally important pieces into a database-backed control plane, tracked in #1509.

As milestone triage has progressed, this operations theme has also pulled in related issues around docs and environment correctness, especially where deployment docs, API expectations, and runtime behavior need to converge before we can credibly call the surface stable.

3. Performance at scale on real hardware

Themis is not only about control-plane cleanup. It is also about making sure the router and its supporting model stack behave well under real load, across real backends, on real platforms. That is the purpose of #1510.

We want broader large-scale regression coverage across Candle, ONNX, and related runtime paths, with repeatable performance baselines for both NVIDIA and AMD. This matters because Semantic Router is increasingly expected to sit in front of more heterogeneous workloads: more model families, more protocol paths, more multi-component deployments, and more memory-heavy workflows.

This performance theme is also tied to product credibility. If we claim the platform is ready for production routing, then we need more than point optimizations. We need performance tests that survive release-to-release, platform-to-platform, and topology-to-topology changes.

That same bar increasingly applies to higher-level agent surfaces such as ClawOS. If model routing, memory, and tool execution are going to be orchestrated in room-based agent workflows, then performance and runtime visibility have to scale there too.

4. Research that feeds the product

Themis still includes research-heavy work, and it should. But the research in this milestone is there because it improves the production system, not because we are parking speculative ideas in the roadmap.

The first track is NL-to-DSL authoring in the dashboard, tracked in #1511. The goal is to let users express routing intent in natural language and generate a usable DSL scaffold instead of forcing every workflow through fully manual route authoring.

The second track is a feedback loop for generated DSL, tracked in #1512. Generated routing logic becomes much more useful when it can learn from real request history, observed routing outcomes, and user feedback, instead of acting like a one-shot assistant.

The third track is multi-turn session affinity, tracked in #1513 and reinforced by the older conversation-stability issue #1439. This is one of the clearest examples of research feeding production directly: without stable session affinity, routed multi-turn conversations can bounce between models and degrade user experience even if each single-turn decision looks correct.

There is also research around model legitimacy and selection quality, including #1422 and #1514. This line of work matters because model selection is only useful in production when it is trustworthy, inspectable, and not dependent on fragile external-only components. Themis should move that work closer to something operators can actually rely on.

ClawOS does have a genuine research component here, but it is specifically the context question. #1522 is about studying context-management patterns and OpenClaw best practices so long-running, tool-rich, room-based workflows have a clearer operating model.

In that sense, the research section of Themis is really about system intelligence: generating better routing logic, improving it continuously, keeping conversations stable across turns, and making model-selection decisions more defensible.

5. Hardening the current product surface

Themis also has a large body of work that is less glamorous than new intelligence features, but just as important for adoption.

Model selection needs to become more usable without external-service-only dependencies, which is the focus of #1514. Eval workflows need to be revisited so system eval and signal eval are first-class and stable inside the dashboard, tracked in #1515.

RAG and memory workflows also need to become more production-friendly. That includes the main hardening track in #1516, plus milestone issues already folded in around memory evolution such as #1293, #1287, #1289, #1350, and #1353.

ClawOS also belongs in this product-hardening bucket. #1521 is not a research item; it is about making collaborative rooms work as a first-class product surface through Matrix-style full WebSocket communication between rooms and participants.

This is also where protocol polish and dashboard usability meet. The goal is not just to have more capability on paper, but to make those capabilities easier to operate in the dashboard, easier to expose consistently through APIs, and easier to validate end to end.

6. Security and quality closure

Themis is also where we close the operational gaps that would block serious production adoption. That starts with the main security and RBAC workstream in #1518, but it is reinforced by several already-folded issues that expose concrete weaknesses in the current surface.

That includes security issues such as #1443, #1445, #1447, #1448, #1452, #1454, #1456, and #1458. These are exactly the kinds of issues that justify the Themis theme: if the platform is going to be production-ready, the security model has to be explicit and closed-loop.

Quality also means broader E2E coverage. The main expansion item is #1519, but related milestone issues such as #1295, #1432, #1501, and #1083 show the same pattern: production hardening requires better system-level tests, better observability, and fewer hidden assumptions.

That broader observability push now also includes ClawOS-specific visibility into model and tool behavior through #1523, so agentic workflows are not left outside the production-debugging story.

What success looks like

If Themis is successful, Semantic Router should feel materially different to deploy and operate:

API and config behavior should be much more consistent across Docker, Kubernetes, CLI, and dashboard workflows
release channels, upgrades, and rollbacks should be explicit rather than implicit
performance claims should be backed by repeatable NVIDIA and AMD validation
research work should show up as product intelligence, especially in DSL generation, feedback loops, session affinity, ClawOS context management, and better model selection
memory, eval, protocol compatibility, and dashboard state should look more like stable platform features than experimental edges
security, RBAC, observability, and E2E coverage should be strong enough that production users can trust the platform boundary

Themis is therefore less about one headline feature and more about making the whole system hold together under real use.

For the active implementation tracker, see v0.3 - Themis: Stability at Scale milestone and issue #1520.

vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain

2026-03-10T00:00:00.000Z

Athena is the first major hardening step after Iris. It refreshes the model stack, extends routing into safety and semantic control, and starts shaping the system brain needed to make Semantic Router easier to govern, operate, and scale in real deployments.

Synced from official vLLM Blog: vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain

Building Mixture-of-Models on AMD GPUs with vLLM-SR

2026-01-23T00:00:00.000Z

Building Mixture-of-Models on AMD GPUs is not just about serving one more model on one more device. It is about turning routing, governance, and inference into a coordinated system so MoM workloads can run efficiently on AMD hardware at production scale.

Synced from official vLLM Blog: Building Mixture-of-Models on AMD GPUs with vLLM-SR

vLLM Semantic Router v0.1 Iris: The First Major Release

2026-01-05T00:00:00.000Z

We are thrilled to announce the release of vLLM Semantic Router v0.1, codename Iris—our first major release that marks a transformative milestone for intelligent LLM routing. Since our experimental launch in September 2025, we've witnessed extraordinary community growth: over 600 Pull Requests merged, 300+ Issues addressed, and contributions from more than 50 outstanding engineers worldwide.

In Greek mythology, Iris (Ἶρις) served as the divine messenger who bridged the realms of gods and mortals, traveling on the arc of the rainbow to deliver messages across vast distances. This symbolism perfectly captures what vLLM Semantic Router v0.1 achieves: a bridge between users and diverse AI models, intelligently routing requests across different LLM providers and architectures.

Synced from official vLLM Blog: vLLM Semantic Router v0.1 Iris: The First Major Release

AMD × vLLM Semantic Router: Building the System Intelligence Together

2025-12-16T00:00:00.000Z

Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in how we think about AI system architecture.

AMD has been a long-term technology partner for the vLLM community, from accelerating the vLLM inference engine on AMD GPUs and ROCm™ Software to now co-building the next layer of the AI stack: intelligent routing and governance for Mixture-of-Models (MoM) systems.

Synced from official vLLM Blog: AMD × vLLM Semantic Router: Building the System Intelligence Together

Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

2025-12-14T00:00:00.000Z

Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of extrinsic hallucination—where models confidently ignore the ground truth sitting right in front of them.

Building on our Signal-Decision Architecture, we introduce HaluGate—a conditional, token-level hallucination detection pipeline that catches unsupported claims before they reach your users. No LLM-as-judge. No Python runtime. Just fast, explainable verification at the point of delivery.

Synced from official vLLM Blog: Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

2025-11-19T00:00:00.000Z

The earlier versions of vLLM Semantic Router relied on classification-based routing, a straightforward approach where user queries are classified into one of 14 MMLU domain categories, and then routed to corresponding models. While this worked for basic scenarios, we quickly discovered its limitations when building production AI systems for enterprises.

Synced from official vLLM Blog: Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

Semantic Tool Selection: Building Smarter AI Agents with Context-Aware Routing

2025-11-07T00:00:00.000Z

Anthropic recently published an insightful blog post on code execution with MCP, highlighting a critical challenge in modern AI systems: as agents connect to more tools, loading all tool definitions upfront becomes increasingly inefficient. Their solution—using code execution to load tools on-demand—demonstrates how established software engineering patterns can dramatically improve agent efficiency.

This resonates deeply with our experience building the vLLM Semantic Router. We've observed the same problem from a different angle: when AI agents have access to hundreds or thousands of tools, how do they know which tools are relevant for a given task?

Our solution: semantic tool selection—using semantic similarity to automatically select the most relevant tools for each user query before the request even reaches the LLM.

The Problem: Tool Overload in AI Agents

Context Window Bloat

Consider an AI agent with access to hundreds of tools across multiple domains. Loading all tool definitions into the context window for every request:

Consumes significant tokens for tool definitions (e.g., 741 tools require ~120K tokens)
Increases latency as the model processes a large number of tools
Raises costs due to increased token usage
May reduce accuracy as the model faces more complex selection decisions

The Relevance Problem

In many cases, most tools are not relevant for a given query:

User asks: "What's the weather in San Francisco?"
Agent receives: Hundreds of tool definitions (weather, finance, database, email, calendar, etc.)
Reality: Only a small subset of tools are actually relevant

This creates inefficiency in terms of tokens, latency, cost, and model decision-making complexity.

The Research Evidence

Recent academic studies have measured the impact of large tool catalogs on LLM performance:

Accuracy Degradation: Research testing tool selection with growing catalogs found that:

With ~50 tools (8K tokens): Most models maintain 84-95% accuracy
With ~200 tools (32K tokens): Accuracy ranges from 41-83% depending on model
With ~740 tools (120K tokens): Accuracy drops to 0-20% for most models

Different models show varying degrees of degradation, with open-source models showing 79-100% degradation when scaling from small to large tool catalogs.

The "Lost in the Middle" Effect: Research has documented position bias where tools in the middle of long lists are less likely to be selected correctly. For example, with 741 tools, middle positions (40-60%) showed 22-52% accuracy compared to 31-32% at the beginning/end positions for some models.

Non-Linear Degradation: Performance degradation is not gradual. Research shows that accuracy can drop sharply as tool count increases, with the transition from 207 to 417 tools showing particularly steep declines (e.g., from 64% to 20% for one model tested).

Our Solution: Semantic Tool Selection

The vLLM Semantic Router implements semantic tool selection as an intelligent filter that sits between the user and the LLM:

How It Works

Step 1: Tool Database with Embeddings

Each tool in our database has:

Tool definition (name, parameters, schema)
Rich description optimized for semantic matching
Pre-computed embedding vector
Optional metadata (category, tags)

Step 2: Query Embedding and Similarity Search

When a user query arrives:

Generate an embedding for the query text
Calculate cosine similarity with all tool embeddings
Select top-K tools above a similarity threshold
Inject only relevant tools into the request

Step 3: Request Modification

The router modifies the API request to include only selected tools, dramatically reducing token usage.

Experimental Results

We conducted extensive experiments comparing traditional "load all tools" approaches with our semantic tool selection system across three real-world scenarios. Our findings align with recent research showing that LLMs struggle significantly with large tool catalogs and long contexts in tool-calling scenarios.

Research Context: The Tool Selection Challenge

Recent academic research has quantified the severity of this problem. Studies show that as tool catalogs grow:

Performance drops 7-85% when tool count increases from small to large catalogs
Token consumption explodes by 50-100x with naive "load all tools" approaches
Position bias emerges - tools buried in the middle of long lists are often missed ("lost in the middle")
Accuracy degrades non-linearly - even state-of-the-art models like GPT-4 struggle

One study testing tool selection with increasing catalog sizes found that baseline accuracy dropped from 78% with 10 tools to just 13.62% with 100+ tools - a catastrophic 82% degradation. This "needle in a haystack" problem for tool selection motivated our semantic approach.

Large Tool Catalog Stress Test

Setup:

Based on the Berkeley Function Calling Leaderboard (BFCL) dataset, we tested tool selection performance as catalog size grows:

Dataset: 858 function calling samples (simple, live_simple, multiple subsets)
Tool catalog sizes: Varied from 49 tools (8K tokens) to 741 tools (120K tokens)
Models tested: Llama-3.1-70B, Mistral-Large, Granite-3.1-8B, ToolACE-8B, BitAgent-8B
Position testing: Tools placed at 10%, 30%, 50%, 70%, 90% positions in the catalog

Results - Accuracy Degradation by Catalog Size:

Model	Max Accuracy (Small Catalog)	Min Accuracy (Large Catalog)	Degradation
Llama-3.1-70B	95%	20%	-79%
Mistral-Large	94%	0%	-100%
Granite-3.1-8B	84%	7%	-92%
BitAgent-8B	95%	10%	-89%

Key Findings:

Catastrophic degradation: Most models experience 79-100% accuracy drops as tool count increases from ~50 to ~740 tools
Prompt bloat: At 120K tokens, average prompt size was 127,315 tokens with all tools loaded

Position Bias Analysis:

Testing with 741 tools at different positions revealed severe "lost in the middle" effects:

Tool Position	Granite-3.1-8B	Llama-3.1-70B	BitAgent-8B
Beginning (10%)	18%	32%	57%
Early (30%)	12%	28%	45%
Middle (50%)	8%	22%	24%
Late (70%)	14%	29%	41%
End (90%)	17%	31%	53%

Implications for vLLM Semantic Router:

These findings reinforce why semantic selection is critical:

Smaller contexts = better comprehension: By reducing tool catalog from 120K to 1K tokens, we leave 119K tokens for tool responses and conversation history
Focused selection = better recall: With only 3-5 relevant tools, models can focus on understanding responses rather than parsing hundreds of tool descriptions
Complementary to other optimizations: Semantic selection works alongside response parsing, context compression, and conversation management
Enables longer conversations: Saving 99.1% of context on tool definitions (127,315 → 1,084 tokens) allows significantly more room for conversation history or tool responses

Benefits of Semantic Tool Selection

1. Restores Usability at Scale

Research shows that without semantic selection, tool-calling systems become unusable beyond ~100 tools:

Accuracy Recovery:

Tool Count	Without Selection	With Semantic Selection	Recovery
49 tools	94%	94%	Baseline
207 tools	64%	94%	+47%
417 tools	20%	94%	+370%
741 tools	13.62%	43.13%	+217%

Key Insight: Semantic selection doesn't just improve performance—it makes large-scale tool calling possible.

2. Dramatic Token & Cost Reduction

Token Savings (741 tools):

Baseline: 127,315 tokens per request
Semantic Selection: 1,084 tokens per request
Reduction: 99.1% (117x fewer tokens)

Cost Impact (based on typical LLM pricing at $2.50/$10 per 1M input/output tokens):

Volume	Without Selection	With Selection	Annual Savings
1M requests/month	$318,288	$2,710	$3.79M/year
10M requests/month	$3.18M	$27,100	$37.9M/year

3. Eliminates Position Bias

Research documents severe "lost in the middle" effects. Semantic selection eliminates this:

Position Bias (741 tools, Llama-3.1-70B):

Beginning: 32% accuracy
Middle: 22% accuracy (31% worse)
End: 31% accuracy

With Semantic Selection: 94% accuracy regardless of original position

5. Scalability Beyond Current Limits

The MCP ecosystem already has 4,400+ servers. Research shows:

At 100+ tools: Baseline accuracy drops to 13-15% (near-random)
With semantic selection: Maintains 43%+ accuracy even at scale
Future-proof: As tool ecosystems grow to 10,000+ tools, semantic selection becomes essential

Architecture Overview

Here's how semantic tool selection integrates into the request flow:

System Components

Comparison with Other Approaches

vs. Loading All Tools

Research demonstrates clear advantages of semantic selection:

Metric	Observation
Token Usage	99.1% reduction (127,315 → 1,084 tokens for 741 tools)
Accuracy	3.2x improvement (43.13% vs 13.62% baseline in RAG-MCP study)
Scalability	Maintains performance as tool count grows to 4,400+
Position Bias	Mitigates "lost in the middle" effects through relevance-based selection

vs. Manual Categorization

Manual Categories:

Requires maintaining tool taxonomies
Brittle when tools span multiple categories
Doesn't adapt to query nuances
Maintenance overhead: ~2 hours/week per 100 tools

Semantic Selection:

Automatic relevance based on embeddings
Handles cross-domain queries naturally
Adapts to new tools without reconfiguration
Maintenance overhead: ~5 minutes/week (add new tools)

vs. Code Execution (MCP Approach)

Anthropic's code execution and our semantic selection are complementary:

Aspect	Code Execution (MCP)	Semantic Selection (vLLM SR)
When	During agent execution	Before LLM receives request
How	Filesystem exploration + code	Embedding similarity search
Latency	Variable (depends on exploration)	Fixed (~15ms)
Best For	Complex workflows, data filtering	Tool discovery, request optimization

Combined Approach:

Semantic Router selects relevant tools (500 → 3 tools)
LLM writes code to use those tools efficiently
Code execution handles data filtering and complex logic

This gives you the best of both worlds: efficient tool discovery + powerful execution patterns.

Future Directions: Scaling to Thousands of Tools

While our current implementation handles hundreds of tools effectively, research points to new challenges as tool ecosystems grow to thousands of tools:

Hierarchical Retrieval

Recent studies show that flat similarity search begins to degrade beyond ~1,000 tools. Future work will explore:

Two-stage retrieval: First select relevant categories, then tools within categories
Adaptive retrieval: Dynamically adjust top-K based on query complexity
Hybrid approaches: Combine semantic similarity with metadata filtering

Tool Response Management

Research has identified tool response processing as a critical bottleneck:

Intelligent parsing: Extract only relevant fields from large JSON responses
Progressive disclosure: Stream tool responses incrementally
Response summarization: Use smaller models to compress responses before sending to main LLM

Multi-Turn Optimization

For long conversations with many tool calls:

Context compression: Summarize earlier turns while preserving key information
Selective history: Include only relevant past tool calls in context
State management: Track conversation state separately from full history

Conclusion

Anthropic's blog on code execution with MCP highlighted a fundamental challenge: agents need efficient ways to discover and use tools at scale. Their solution—progressive disclosure through code execution—is elegant and powerful.

Our semantic tool selection approach tackles the same problem from a complementary angle: use semantic similarity to automatically select relevant tools before the LLM even sees the request. Research demonstrates:

99.1% token reduction (127,315 → 1,084 tokens for 741 tools)
3.2x accuracy improvement (43.13% vs 13.62% baseline in RAG-MCP benchmark)
Significant cost reduction through reduced token usage
Improved selection quality by focusing on relevant tools
Transparent and debuggable tool selection process

The two approaches are not mutually exclusive—in fact, they work beautifully together:

Semantic Router filters 500 tools down to 3 relevant ones
LLM writes code to use those tools efficiently
Code execution handles data processing and complex workflows

As AI agents become more capable and connect to more tools, intelligent tool management becomes critical. Whether through semantic selection, code execution, or a combination of both, the future of AI agents lies in smart, context-aware tool discovery that scales efficiently.

Give it a Try

The vLLM Semantic Router is open source:

GitHub: github.com/vllm-project/semantic-router
Documentation: vllm-semantic-router.com
Quick Start: Deploy in 5 minutes with Docker Compose or Kubernetes

Example configuration to get started:

# config.yaml
tools:
  enabled: true
  top_k: 3
  similarity_threshold: 0.80
  tools_db_path: config/tools_db.json
  fallback_to_empty: true

Start with a small tool database (10-20 tools) and expand as you see the benefits. Monitor the metrics dashboard to tune thresholds and optimize performance.

From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA

2025-10-25T00:00:00.000Z

Semantic routing systems face a scaling challenge. When each classification request requires running multiple fine-tuned models independently, the computational cost grows linearly with the number of models. This post examines how a recent refactoring of the vLLM Semantic Router's Rust-based classification layer addresses this problem through architectural modularity, Low-Rank Adaptation (LoRA), and concurrency optimization.

Sync from vLLM Official Blog.

Background: From BERT to a Modular System

The previous implementation relied primarily on BERT and ModernBERT for intent and jailbreak classification. While ModernBERT performs well for English text classification tasks, it has the following limitations:

Language Coverage: The original ModernBERT's multilingual support is limited compared to models trained on more diverse datasets. (Note: mmBERT, a massively multilingual variant of ModernBERT supporting 1800+ languages, was released after this refactoring began and represents an alternative approach to the multilingual challenge)
Context Length: While ModernBERT extends context to 8,192 tokens using RoPE (source), models like Qwen3-Embedding support up to 32,768 tokens, which is beneficial for very long document processing
Model Coupling: Classification logic was tightly coupled to specific model architectures, making it difficult to add new models

These constraints motivated a broader refactoring that would enable the system to support multiple model types while maintaining performance. The modular architecture means that newer models like mmBERT can be integrated alongside Qwen3-Embedding and EmbeddingGemma, allowing the router to select the most appropriate model for each task.

Architectural Restructuring

The refactoring introduces a layered architecture in the candle-binding crate. This structure separates concerns: core functionality remains independent of specific models, while new model architectures can be added without modifying existing code. The DualPathUnifiedClassifier implements routing logic that selects between traditional fine-tuned models and LoRA-adapted models based on the task requirements.

Long-Context Embedding Models

Two new embedding models address the context length limitation:

Qwen3-Embedding

Qwen3-Embedding supports context lengths up to 32,768 tokens (Hugging Face model card). The implementation uses a RoPE (Rotary Position Embedding), enabling this extended context handling through improved frequency resolution at longer distances.

Qwen3-Embedding was trained on text from over 100 languages (Hugging Face model card), making it suitable for multilingual routing scenarios where the previous ModernBERT-only approach would struggle.

EmbeddingGemma-300M

Google's EmbeddingGemma-300M takes a different approach, focusing on smaller model size while maintaining quality. The model supports context lengths of 2,048 tokens and implements Matryoshka representation learning, which means embeddings can be truncated to 768, 512, 256, or 128 dimensions without retraining (Hugging Face model card).

The architecture uses Multi-Query Attention (MQA) with 3 query heads and 1 key-value head, reducing memory bandwidth requirements. A distinctive feature is the dense bottleneck layer (768 → 3072 → 768) applied after the transformer blocks, which improves embedding quality based on the Matryoshka training approach.

Low-Rank Adaptation for Multi-Task Classification

LoRA addresses a fundamental inefficiency in the previous system. When a classification system needs to determine intent, detect PII, and check for security issues, the naive approach runs three separate fine-tuned models:

Each model processes the input through its entire network, including the expensive base transformer layers. This results in O(n) complexity where n is the number of classification tasks.

LoRA changes this by sharing the base model computation:

The base model runs once, producing intermediate representations. Each LoRA adapter then applies task-specific low-rank weight updates to specialize the output. Since LoRA adapters typically modify less than 1% of the model's parameters, this final step is much faster than running complete models.

The implementation in parallel_engine.rs uses Rayon for data parallelism, processing multiple LoRA adapters concurrently. For a request requiring three classifications, this changes the workload from three full forward passes to one full pass plus three lightweight adapter applications.

Concurrency Through OnceLock

The previous implementation used lazy_static for managing global classifier state, which introduced lock contention under concurrent load. The refactoring replaces this with OnceLock from the Rust standard library.

OnceLock provides lock-free reads after initialization. After the first initialization, all subsequent accesses are simple pointer reads with no synchronization overhead. Tests in oncelock_concurrent_test.rs verify this with 10 concurrent threads performing 30 total classifications, confirming that throughput scales linearly with thread count.

This matters when the router processes multiple incoming requests. With lazy_static, concurrent requests would queue behind a mutex. With OnceLock, they execute in parallel without contention.

Flash Attention for GPU Acceleration

Flash Attention 2 support is available as an optional feature for CUDA builds, though it requires Ampere-generation or newer GPUs (compute capability ≥ 8.0). Flash Attention optimizes the attention mechanism by processing computations in blocks that fit in fast on-chip SRAM memory, avoiding repeated reads from slower GPU DRAM.

Both ModernBERT and Qwen3 benefit from Flash Attention integration:

ModernBERT: Achieves up to 3× faster self-attention computations with significantly reduced memory usage (source). The model also uses alternating attention patterns (global attention every third layer, local sliding-window attention otherwise) to balance efficiency with context retention (source).
Qwen3: Integration of FlashAttention-2 provides up to 4× speedup in attention operations. For the 14B variant, this translates to 70-110 tokens/second during inference compared to 30-35 tokens/second without it—a performance improvement that becomes more pronounced with longer contexts (source).

The Rust implementation makes Flash Attention optional via Cargo features, allowing deployment on systems without compatible GPUs while enabling substantial performance gains when hardware supports it.

Cross-Language Integration for Cloud-Native Ecosystems

The choice of Rust for the core classification engine combined with Go FFI (Foreign Function Interface) bindings addresses a practical deployment challenge in cloud-native environments.

Why Rust for ML Inference

Rust provides several advantages for the classification layer:

Performance: Near-C performance with zero-cost abstractions, critical for low-latency inference
Memory Safety: Compile-time guarantees prevent common bugs like buffer overflows and use-after-free errors
Concurrency: The ownership system prevents data races, enabling safe parallel processing with Rayon
No Garbage Collection: Predictable latency without GC pauses that affect request processing

The Candle framework leverages these Rust strengths while providing a familiar API for ML model development.

Why Go FFI Bindings Matter

While Rust excels at compute-intensive ML inference, Go dominates the cloud-native infrastructure ecosystem. The FFI layer bridges these worlds. This integration enables deployment in environments where Go is the primary language:

Envoy Proxy Integration: The semantic router runs as an Envoy external processing filter, written in Go. The FFI allows the Go filter to leverage high-performance Rust classification without rewriting the entire Envoy integration layer.
Kubernetes Operators: Cloud-native operators are typically written in Go using controller-runtime. The FFI enables these operators to embed classification logic directly rather than making network calls to separate services.
Service Meshes: Projects like Istio, Linkerd, and Consul are Go-based. The FFI allows routing decisions to use ML-based classification while maintaining compatibility with existing mesh control planes.
API Gateways: Many API gateways (Kong, Tyk) have Go components. The FFI enables semantic routing at the gateway layer without introducing additional microservices.

Deployment Flexibility

The dual-language architecture provides deployment options:

Embedded Mode: The Go service links directly to the Rust library via CGO, minimizing latency and deployment complexity
Process Isolation: The classification layer can run as a separate process, communicating via gRPC or Unix sockets for additional fault isolation
Mixed Workloads: Services can combine Go's networking and orchestration strengths with Rust's ML inference performance

The semantic router leverages this pattern extensively. The main routing logic, configuration management, and cache implementations are in Go, while the compute-intensive classification runs in Rust. This separation allows each component to use the most appropriate language while maintaining clean interfaces through the FFI layer.

Performance Characteristics

The benefits of this architecture vary by workload:

Single vs multi-task classification: LoRA provides minimal benefit since there's no base model sharing. Traditional fine-tuned models may be faster. LoRA shows clear advantages when performing multiple classifications on the same input. Since the base model runs once and only LoRA adapters execute for each task, the overhead is substantially reduced compared to running separate full models. The actual speedup depends on the ratio of base model computation to adapter computation.
Long-context inputs: Qwen3-Embedding enables routing decisions on documents up to 32K tokens without truncation, extending beyond ModernBERT's 8K limit for very long documents. With Flash Attention 2 enabled on compatible GPUs, the performance advantage becomes more substantial as context length increases.
Multilingual routing: Models can now handle routing decisions for languages where ModernBERT has limited training data.
High concurrency: OnceLock eliminates lock contention, allowing throughput to scale with CPU cores for classification operations.
GPU acceleration: When Flash Attention 2 is enabled, attention operations run 3-4× faster, with the speedup becoming more pronounced at longer sequence lengths. This makes GPU deployment particularly advantageous for high-throughput scenarios.

Future Directions

The modular architecture enables several extensions:

Additional embedding models can be added by implementing the CoreModel trait
Flash Attention 3 support when available in Candle
Quantization support (4-bit, 8-bit) for reduced memory footprint
Custom LoRA adapters for domain-specific routing
FFI bindings for additional languages (Python, Java, C++) to expand integration possibilities

The system now has a foundation for incorporating new research advances without requiring architectural changes. The FFI layer provides a stable interface that allows the Rust implementation to evolve independently while maintaining compatibility with existing Go-based deployments.

Resources

Project Repository: https://github.com/vllm-project/semantic-router
Candle Framework: https://github.com/huggingface/candle
Qwen3-Embedding: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
EmbeddingGemma: https://huggingface.co/google/embeddinggemma-300m

Semantic Router Q4 2025 Roadmap: Journey to Iris

2025-10-20T00:00:00.000Z

As we approach the end of 2025, we're excited to share our Q4 2025 roadmap for vLLM Semantic Router. This quarter marks a significant milestone in our project's evolution as we prepare for our first major release: v0.1, codename "Iris", expected in late 2025 to early 2026.

About Our Release Naming Convention

Starting with v0.1, each major release of vLLM Semantic Router will carry a codename inspired by figures from Greek mythology. These names reflect the essence and purpose of each release, connecting ancient wisdom with modern AI infrastructure.

Our inaugural release is named Iris (Ἶρις), after the Greek goddess of the rainbow and divine messenger of the Olympian gods. In mythology, Iris served as the swift-footed messenger who bridged the gap between gods and mortals, traveling on the arc of the rainbow to deliver messages across vast distances. She personified the connection between heaven and earth, ensuring that communication flowed seamlessly across different realms.

This symbolism perfectly captures the essence of vLLM Semantic Router: a system that bridges the gap between users and diverse AI models, intelligently routing requests across different LLM providers and architectures. Just as Iris connected different worlds through her rainbow bridge, our router connects applications to the right models through intelligent semantic understanding. The rainbow itself—a spectrum of colors working in harmony—mirrors our vision of orchestrating multiple models in a unified, efficient system.

With the Iris release, we're establishing the foundation for reliable, intelligent, and secure AI model routing that will serve as the bridge for modern AI applications.

Q4 2025 Focus Areas

Our Q4 roadmap centers on seven critical pillars that will transform vLLM Semantic Router from an experimental project into a production-ready platform. These initiatives address the most pressing needs identified by our community and represent the essential groundwork for v0.1.

1. Semantic Chain for Fusion Intelligent Routing

The Challenge

Current routing relies exclusively on ModernBERT classification for semantic understanding. While powerful, this approach has limitations: it cannot perform deterministic routing based on specific keywords, lacks pattern-based detection for safety and compliance, and misses opportunities for specialized domain classification that could enhance routing accuracy and flexibility.

The Innovation

We're introducing a unified content scanning and routing framework that extends semantic routing with four complementary signal sources, all integrated through a Signal Fusion Layer:

1. Keyword-Based Routing

Deterministic, fast Boolean logic for exact term matching
Route queries containing "kubernetes" or "CVE-" patterns directly to specialized models
Eliminate unnecessary ML inference for technology-specific queries

2. Regex Content Scanning

Pattern-based detection for safety, compliance, and structured data
Guaranteed blocking of PII patterns (SSN, credit cards) with no ML false negatives
RE2 engine with ReDoS protection for security-critical applications

3. Embedding Similarity Scanning

Semantic concept detection robust to paraphrasing
Detect "multi-step reasoning" intent even when phrased as "explain thoroughly"
Reuses existing BERT embedder for zero additional model overhead

4. Domain Classification

In-Tree BERT Classification: Lightweight BERT-based domain classifiers running directly in the router process for low-latency intent detection
Out-of-Tree MCP Classification: Advanced domain-specific classifiers deployed as MCP servers for specialized routing scenarios (legal, medical, financial domains)
Hierarchical classification with confidence scoring for multi-domain queries

Dual Execution Paths

In-Tree Path: Low-latency signal providers running directly in the router process
Out-of-Tree Path: MCP (Model Context Protocol) servers for massive rule sets, custom matching engines (Aho-Corasick, Hyperscan), and domain-specific algorithms

Signal Fusion Layer

The decision-making engine that combines all signals into actionable routing decisions:

Priority-based policy evaluation: Safety blocks (200) → Routing overrides (150) → Category boosting (100) → Consensus (50) → Default (0)
Boolean expressions: Combine multiple signals with AND, OR, NOT operators
Flexible actions: Block, route to specific models, boost category weights, or fallthrough to BERT

Impact

This framework enables:

Fast deterministic routing for technology-specific queries
Guaranteed compliance with safety and regulatory requirements
Semantic intent detection that complements BERT classification
Specialized domain classification for vertical-specific routing (legal, medical, financial)
Flexible deployment options with both in-tree and out-of-tree execution paths
Graceful degradation and backward compatibility with existing routing

The Semantic Chain for Fusion Intelligent Routing represents a fundamental shift from pure ML-based routing to a hybrid approach that leverages the best of deterministic, pattern-based, semantic, and domain-specific classification methods.

2. Extensible Serving Architecture: Modular Candle-Binding for MoM Family

The Challenge

Our Rust-based candle-binding codebase has grown organically into a 2,600+ line monolithic structure. This architecture was designed for a handful of models, but now faces a critical challenge: supporting the entire MoM (Mixture of Models) Family with its diverse model architectures, specialized classifiers, and LoRA-adapted variants. The current monolithic design makes it nearly impossible to efficiently serve multiple model types simultaneously.

The Vision

We're restructuring the candle-binding into an extensible serving architecture specifically designed to support the MoM Family's diverse model ecosystem. This modular design enables seamless addition of new MoM models without code changes, efficient multi-model serving, and clear separation between model architectures and serving logic.

Layered Architecture for MoM Models

Core Layer: Unified error handling, configuration management, device initialization, and weight loading shared across all MoM models
Model Architectures Layer: Modular implementations of BERT (mom-similarity-flash, mom-pii-flash, mom-jailbreak-flash), ModernBERT, and Qwen3 (mom-brain-pro/max, mom-expert-* series) with extensible traits for future MoM additions
Classifiers Layer: Specialized implementations for sequence classification (intent routing), token classification (PII/jailbreak detection), and LoRA support (fine-tuned MoM experts)
FFI Layer: Centralized memory safety checks and C-compatible interfaces for Go integration

Impact

This extensible architecture enables:

Rapid MoM Model Deployment: Add new MoM models (mom-expert-math-flash, mom-brain-max) by implementing standard traits
Efficient Multi-Model Serving: Serve multiple MoM models simultaneously with shared infrastructure
LoRA Support: Native support for LoRA-adapted MoM experts with high-confidence routing
Backward Compatibility: Existing Go bindings continue to work without changes

This transformation positions the serving layer as a scalable foundation for the entire MoM Family ecosystem, enabling us to rapidly expand our model offerings while maintaining performance and reliability.

3. Model Unification: The MoM (Mixture of Models) Family

The Challenge

Despite developing a comprehensive family of specialized routing models, our codebase still references legacy models scattered across configuration files. This fragmentation creates confusion, inconsistent performance, and a steep learning curve for new users.

The Solution

We're migrating the entire system to use the MoM Family as the primary built-in models:

🧠 Intelligent Routing: mom-brain-flash/pro/max for intent classification with clear latency-accuracy trade-offs
🔍 Similarity Search: mom-similarity-flash for semantic matching
🔒 Prompt Guardian: mom-jailbreak-flash and mom-pii-flash for security and privacy
🎯 SLM Experts: Specialized models for math, science, social sciences, humanities, law, and general tasks

Key Features

Centralized Registry: Single source of truth for all MoM models with metadata, capabilities, and recommended use cases
Simplified Configuration: Reference models by name (mom-brain-flash) instead of complex paths
Auto-Discovery: Intelligent model detection and validation
Performance Optimization: All MoM models are specifically trained and optimized for vLLM-SR routing tasks

This unification provides users with a clear, consistent model selection experience while ensuring optimal performance for every routing scenario.

4. Architectural Evolution: Model-Based Routing Core

The Challenge

Our current routing implementation, inherited from traditional cluster-based approaches, has reached its architectural limits. The tight coupling between routing logic and cluster management prevents us from supporting the diverse LLM deployment scenarios that modern AI applications demand—from hybrid cloud deployments to multi-provider orchestration.

The Vision

We're reimagining our routing architecture with a clean separation of concerns: semantic routing focuses purely on intelligent model selection, while traffic management is delegated to the AI Gateway layer. This modular approach transforms the semantic router into a global external processor that operates transparently within the gateway infrastructure.

Key Capabilities

Universal Connectivity: Support for HTTPS, HTTP, IP-based, and DNS-based connections to any LLM provider
Hybrid Routing: Seamlessly route between in-cluster services and external providers (Claude, Gemini, DeepSeek, etc.)
Advanced Traffic Management: Model-level failover, weighted distribution, circuit breaking, and health checks
Enterprise Features: Built-in authentication, retry mechanisms, and token-based rate limiting

This architectural shift enables vLLM Semantic Router to scale from single-cluster deployments to global, multi-cloud AI infrastructures while maintaining the simplicity and performance that users expect.

5. Next-Generation API: OpenAI Responses API Support

The Challenge

The traditional Chat Completions API (/v1/chat/completions) is stateless and designed for single-turn interactions. Modern AI applications—especially agents, multi-turn conversations, and agentic workflows—require stateful interactions, advanced tool orchestration, and long-running background tasks. Without Responses API support, vLLM Semantic Router cannot intelligently route these next-generation workloads.

The Vision

Add comprehensive support for the OpenAI Responses API (/v1/responses), enabling intelligent routing for stateful, multi-turn, and agentic LLM workflows while preserving all advanced features of the API.

Key Capabilities

Stateful Conversations

Built-in conversation state management with previous_response_id chaining
Automatic context preservation across multiple turns
Intelligent routing that maintains conversation context and intent classification history

Advanced Tool Orchestration

Native support for code interpreter with container management
Function calling and tool execution routing
Image generation and editing capabilities
MCP (Model Context Protocol) server integration for external tools
File uploads and processing (PDFs, images, structured data)

Agentic Workflows

Background task processing for long-running agent operations
Asynchronous execution with polling support for complex reasoning tasks
Resumable streaming with sequence tracking for dropped connections
Support for reasoning models (o1, o3, o4-mini) with encrypted reasoning items

Semantic Routing Integration

Extract and classify intent from Responses API input field (text, messages, or mixed content)
Apply intelligent model selection based on conversation history and tool requirements
Route multi-turn conversations to models optimized for stateful interactions
Preserve VSR (vLLM Semantic Router) headers for routing metadata across response chains

Impact

Responses API support positions vLLM Semantic Router at the forefront of agentic AI infrastructure:

Enable routing for modern agent frameworks and multi-turn applications
Support complex workflows requiring code execution, file processing, and external tool integration
Provide intelligent model selection for reasoning-heavy tasks and long-running operations
Maintain semantic router's value proposition (cost optimization, latency reduction) for next-generation LLM APIs

This capability is essential for vLLM Semantic Router to remain relevant as the industry shifts from simple chat completions to sophisticated, stateful, tool-augmented AI agents.

6. Intelligent MCP Gateway: Smart Tool Management and Selection

The Challenge

As AI agents increasingly rely on external tools and services through the Model Context Protocol (MCP), managing and selecting the right tools for each task becomes critical. Current approaches lack intelligent tool discovery, selection optimization, and centralized management, leading to inefficient tool usage and increased latency in agentic workflows.

The Innovation

We're introducing an Intelligent MCP Gateway that serves as a unified control plane for MCP tools with smart selection capabilities:

MCP Tool Management

Centralized Registry: Unified catalog of available MCP servers and tools with metadata, capabilities, and performance characteristics
Dynamic Discovery: Automatic detection and registration of MCP servers in the cluster
Health Monitoring: Real-time health checks and availability tracking for all registered MCP tools
Version Management: Support for multiple versions of MCP tools with seamless upgrades and rollbacks

Intelligent Tool Selection

Semantic Matching: Analyze user intent and task requirements to automatically select the most appropriate tools
Context-Aware Routing: Consider conversation history, user preferences, and task complexity for tool selection
Performance Optimization: Route tool requests based on latency, cost, and success rate metrics
Fallback Strategies: Automatic failover to alternative tools when primary options are unavailable

Integration with Fusion Routing

Seamlessly integrate with the Semantic Chain for unified routing decisions
Combine tool selection with model selection for optimal agentic workflows
Support both in-tree (low-latency) and out-of-tree (MCP server) tool execution paths

Impact

The Intelligent MCP Gateway enables:

Simplified tool management for complex agentic applications
Reduced latency through intelligent tool selection and caching
Improved reliability with automatic failover and health monitoring
Enhanced developer experience with centralized tool discovery and configuration
Cost optimization by routing to the most efficient tools for each task

This gateway positions vLLM Semantic Router as a comprehensive orchestration layer for modern AI agents, managing not just model selection but also the tools and services that agents rely on.

7. Enterprise Readiness: Production Deployment Tools

The Challenge

While vLLM Semantic Router works well for experimental deployments, production adoption requires professional-grade deployment tools, comprehensive monitoring, and intuitive management interfaces.

The Deliverables

Helm Chart Support

Professional Kubernetes deployment with:

Templated manifests for all resources
Values-driven configuration for different environments
Built-in versioning and rollback capabilities
Best practices for security, scaling, and resource management

Modern Management Dashboard

A comprehensive web-based control plane featuring:

Visual Route Builder: Drag-and-drop interface for creating SemanticRoute configurations
Interactive Playground: Test routing decisions, compare models, and visualize filter chains
Real-time Monitoring: Live metrics, request tracing, and health status
Analytics & Insights: Cost analysis, performance benchmarks, and routing effectiveness
User Management: Role-based access control, API key management, and audit logs

These enterprise features will dramatically lower the barrier to entry, improve operational efficiency, and make vLLM Semantic Router accessible to organizations of all sizes.

Ecosystem Integration

Beyond the seven core pillars, we're actively exploring integrations with key platforms in the AI infrastructure ecosystem. These five integrations are work-in-progress and good-to-have features that will expand vLLM Semantic Router's reach and interoperability:

vLLM Production Stack

vLLM Production Stack is vLLM's reference system for Kubernetes-native cluster-wide deployment with community-driven performance optimization. It provides a reference implementation on how to build an inference stack on top of vLLM with Helm chart-based deployment.

Deep integration with the vLLM Production Stack will enable seamless model serving, monitoring, and orchestration. This integration will provide native support for vLLM's advanced features like PagedAttention, continuous batching, and optimized CUDA kernels, ensuring maximum performance for production workloads.

llm-d

llm-d is a Kubernetes-native high-performance distributed LLM inference framework built on vLLM. Founded by Red Hat, Google Cloud, CoreWeave, and IBM Research, with contributions from NVIDIA, Hugging Face, Intel, Lambda, and Mistral AI, llm-d provides well-lit paths for anyone to serve large generative AI models at scale with distributed inference capabilities.

Integration with llm-d will bring intelligent semantic routing to Kubernetes-native distributed inference deployments. This partnership will enable llm-d users to leverage MoM models and fusion routing for efficient model selection across distributed inference clusters, optimizing resource utilization and performance in cloud-native environments.

NVIDIA Dynamo

NVIDIA Dynamo is NVIDIA's high-performance, low-latency inference platform that supports all major frameworks including TensorRT-LLM. It delivers scalable, efficient inference with GPU optimization and includes intelligent inference optimizations that boost token generation performance by over 30x per GPU, with support for advanced features like disaggregated serving.

Integration with NVIDIA Dynamo will leverage cutting-edge GPU acceleration and optimization frameworks to deliver industry-leading latency and throughput for semantic routing operations. This partnership will enable seamless deployment of MoM models on NVIDIA-accelerated infrastructure with optimal performance.

vLLM AIBrix

vLLM AIBrix is an open-source initiative designed to provide essential building blocks to construct scalable GenAI inference infrastructure. As a cloud-native framework, AIBrix serves as an infrastructure orchestrator and workload control plane, offering cost-efficient and pluggable components for large-scale LLM serving with simplified deployment and management.

Collaboration with vLLM AIBrix will enable unified control planes, advanced observability, and streamlined deployment workflows across hybrid and multi-cloud environments. This integration will make it easier for enterprises to adopt and scale vLLM Semantic Router with production-ready infrastructure components.

These ecosystem integrations represent our commitment to building an open, interoperable platform that works seamlessly with the broader AI infrastructure landscape. While not required for the v0.1 release, they demonstrate our vision for vLLM Semantic Router as a foundational component in modern AI stacks.

Timeline and Release Plan

v0.1 "Iris" Release (Late 2025 - Early 2026):

All P0 priority issues resolved
Seven foundational pillars fully implemented
Comprehensive documentation and migration guides
Production-ready deployment tools (Helm charts, dashboard)
Full Responses API, Intelligent MCP Gateway, and Semantic Chain for Fusion Intelligent Routing support
Community celebration and feedback collection

Looking Beyond Iris

The Iris release establishes the foundation, but our vision extends far beyond v0.1. Future releases will introduce:

Advanced multi-model orchestration strategies
Federated routing across distributed clusters
Enhanced reasoning capabilities and chain-of-thought routing
Deeper integration with the broader vLLM ecosystem

Each release will carry its own mythological codename, reflecting the unique character and capabilities it brings to the project.

Get Involved

This roadmap represents our commitment to building production-ready AI infrastructure, but we can't do it alone. We invite the community to:

Review and provide feedback on the P0 issues
Contribute code to any of the initiatives
Test early releases and share your experiences
Suggest improvements to the roadmap

Together, we're building the bridge that will connect the next generation of AI applications to the models they need—just as Iris connected the realms of gods and mortals.

Follow our progress:

GitHub: vllm-project/semantic-router
Issues: P0 Priority Issues

The rainbow bridge awaits. Let's build it together. 🌈

vLLM Semantic Router: Next Phase in LLM inference

2025-09-06T00:00:00.000Z

Synced from official vLLM Blog: vLLM Semantic Router: Next Phase in LLM inference

Industry Status: Inference ≠ More Is Better

Over the past year, hybrid reasoning and automatic routing have increasingly defined progress in large-model infrastructure—shifting the debate from raw scale to per-token efficiency, latency control, and targeted compute use.

Take GPT-5 for example: its standout innovation lies not in sheer parameters, but in routing policies and quota-based reasoning:

Light queries → lightweight paths: trivial prompts like “Why is the sky blue?” don’t trigger expensive reasoning.
Complex/high-value queries → reasoning-enabled models: multi-step tasks—like legal analysis or financial planning—are routed to Chain-of-Thought–enabled inference.

This represents a broader principle of task-aware compute allocation, where every inference token must contribute meaningful value—not just be consumed.

Similar ideas are appearing in other systems:

Anthropic Claude 3.7/4: differentiates “fast thinking” and “slow thinking” pathways.
Google Gemini 2.5: offers explicit thinking budgets, allowing enterprises to cap reasoning depth.
Alibaba Qwen3: supports instruction-driven switching between reasoning and non-reasoning modes.
DeepSeek v3.1: merges conversational and reasoning flows within a dual-mode single model.

The trend is clear: future inference systems will be defined by selectivity and intelligence, not just model size.

Recent Research: vLLM Semantic Router

Responding to this shift, the vLLM Semantic Router offers an open-source, intent-aware routing layer for the highly efficient vLLM inference engine.

vLLM enables scalable LLM serving—but lacks semantic decision-making around reasoning. Developers face a trade-off:

Enable reasoning always → accuracy increases, but so does cost.
Disable reasoning → cost drops, but accuracy suffers on complex tasks.

The Semantic Router fills this gap by classifying queries semantically and routing them appropriately, giving accurate results where needed and efficiency where reasoning is unnecessary.

Architecture Design

The system comprises four pillars:

Semantic Classification: Uses ModernBERT—currently a lightweight, standalone classifier integrated into the router—to determine routing paths.
Smart Routing:
- Simple queries → "fast path" inference.
- Complex queries → "Chain-of-Thought" reasoning mode.
High-Performance Engine: Written in Rust using Hugging Face Candle, it delivers high concurrency and zero-copy inference.
Cloud-Native Integration: Works out-of-the-box with Kubernetes and Envoy via the ext_proc plugin.

In trials, this design yielded:

~10% higher accuracy
~50% lower latency
~50% fewer tokens

In business and economics domains, gains exceeded 20% accuracy improvements.

Challenges in Execution: Budgets and Tool Calling

Two technical constraints are important to address:

Reasoning Budget Costs
Unlimited reasoning inflates cold-start latency and resource usage. Without dynamic control, simple queries may over-consume tokens while critical queries may not get deep reasoning when needed. SLOs like TTFT and p95 latency are necessary—with possible adaptation mid-inference.
Tool Calling Constraints
Adding more tools (i.e. “tool catalog bloat”) or longer tool outputs can drastically reduce accuracy. The router must pre-filter tools and keep catalogs tight.

Project Background

The Semantic Router evolved from contributions across the open-source community:

Proposed in early 2025 by Dr. Chen Huamin (Red Hat)
Further developed by Xunzhuo Liu (Tencent)
To be presented by Dr. Wang Chen (IBM Research) and Dr. Chen Huamin at KubeCon North America 2025

Our goal: provide inference acceleration for open-source LLMs through:

Semantic-aware routing
Efficient model switching
Enterprise-friendly deployment (Kubernetes & Envoy)

Find the project on GitHub. The current focus is on a Work Group and planned v0.1 Roadmap.

Integration & Future Work: Embeddings and Pluggability

Currently, ModernBERT runs internally within the router for classification. It is not yet served by vLLM. However, future work aims to make the classifier—and potentially other embedding models—pluggable, allowing integration with vLLM-hosted models or external embedding services.

This capability will enhance the semantic cache and enable smoother inference customization.

Roadmap: v0.1 Milestone Highlights

The v0.1 milestone will expand the project’s technical capabilities:

Core: ExtProc-based modularity, semantic caching across backends, multi-factor routing logic
Benchmarking: CLI tools, performance testing suite, reasoning-mode evaluation
Networking: Deeper integration with Envoy, GIE, and llm-d gateways
Observability & UX: Admin dashboards, routing policy visualization, developer quickstarts, and policy cookbook

Future Trends: Just-in-Time Inference

The field is maturing from “Can we run inference?” to “How can inference be smarter?”

GPT-5 uses commercial value to guide reasoning depth.
vLLM Semantic Router delivers that capability to open source.

Looking ahead, systems that adapt their inference strategy on the fly, without manual toggles, will lead in efficiency, latency, and sustainability.

One-Sentence Summary

GPT-5: enterprise routing for smarter inference
vLLM Semantic Router: technical-first routing for open-source LLMs
Edge future: context-aware, minimal-compute inference that works seamlessly

vLLM Semantic Router Blog

Deploying vLLM Semantic Router on AMD Developer Cloud

What Is vLLM Semantic Router?​

How the Signal-Driven Architecture Works​

What You Will Deploy​

Why This Matters for AMD​

1. Intelligent Routing on AMD​

2. Privacy Routing and Local-First Governance​

3. Personal AI and Local Personal Agents​

Getting Started​

Step 1: Create the Shared Docker Network​

Step 2: Start the AMD ROCm vLLM Backend​

Install vLLM Semantic Router​

What the Reference Profile Is Doing​

Validate the Deployment in the Playground​

Coding Help​

Formal Math Proof​

Premium Legal Analysis​

Observe the Routing Behavior in Insights​

Test the Brain Topology​

Design Your Own Routing DSL​

Final Thoughts​

v0.3 Themis Roadmap: Stability at Scale

Why Themis​

1. Stable API, config, and deployment contracts​

2. Stable versions, upgrades, and production operations​

3. Performance at scale on real hardware​

4. Research that feeds the product​

5. Hardening the current product surface​

6. Security and quality closure​

What success looks like​

vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain

Building Mixture-of-Models on AMD GPUs with vLLM-SR

vLLM Semantic Router v0.1 Iris: The First Major Release

AMD × vLLM Semantic Router: Building the System Intelligence Together

Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

Semantic Tool Selection: Building Smarter AI Agents with Context-Aware Routing

The Problem: Tool Overload in AI Agents​

Context Window Bloat​

The Relevance Problem​

The Research Evidence​

Our Solution: Semantic Tool Selection​

How It Works​

Experimental Results​

Research Context: The Tool Selection Challenge​

Large Tool Catalog Stress Test​

Benefits of Semantic Tool Selection​

1. Restores Usability at Scale​

2. Dramatic Token & Cost Reduction​

3. Eliminates Position Bias​

5. Scalability Beyond Current Limits​

Architecture Overview​

System Components​

Comparison with Other Approaches​

vs. Loading All Tools​

vs. Manual Categorization​

vs. Code Execution (MCP Approach)​

Future Directions: Scaling to Thousands of Tools​

Hierarchical Retrieval​

Tool Response Management​

Multi-Turn Optimization​

Conclusion​

Give it a Try​

From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA

Background: From BERT to a Modular System​

Architectural Restructuring​

Long-Context Embedding Models​

Qwen3-Embedding​

EmbeddingGemma-300M​

Low-Rank Adaptation for Multi-Task Classification​

Concurrency Through OnceLock​

Flash Attention for GPU Acceleration​

Cross-Language Integration for Cloud-Native Ecosystems​

Why Rust for ML Inference​

Why Go FFI Bindings Matter​

Deployment Flexibility​

Performance Characteristics​

Future Directions​

Resources​

What Is vLLM Semantic Router?

How the Signal-Driven Architecture Works

What You Will Deploy

Why This Matters for AMD

1. Intelligent Routing on AMD

2. Privacy Routing and Local-First Governance

3. Personal AI and Local Personal Agents

Getting Started

Step 1: Create the Shared Docker Network

Step 2: Start the AMD ROCm vLLM Backend

Install vLLM Semantic Router

What the Reference Profile Is Doing

Validate the Deployment in the Playground

Coding Help

Formal Math Proof

Premium Legal Analysis

Observe the Routing Behavior in Insights

Test the Brain Topology

Design Your Own Routing DSL

Final Thoughts

Why Themis

1. Stable API, config, and deployment contracts

2. Stable versions, upgrades, and production operations

3. Performance at scale on real hardware

4. Research that feeds the product

5. Hardening the current product surface

6. Security and quality closure

What success looks like

The Problem: Tool Overload in AI Agents

Context Window Bloat

The Relevance Problem

The Research Evidence

Our Solution: Semantic Tool Selection

How It Works

Experimental Results

Research Context: The Tool Selection Challenge

Large Tool Catalog Stress Test

Benefits of Semantic Tool Selection

1. Restores Usability at Scale

2. Dramatic Token & Cost Reduction

3. Eliminates Position Bias

5. Scalability Beyond Current Limits

Architecture Overview

System Components

Comparison with Other Approaches

vs. Loading All Tools

vs. Manual Categorization

vs. Code Execution (MCP Approach)

Future Directions: Scaling to Thousands of Tools

Hierarchical Retrieval

Tool Response Management

Multi-Turn Optimization

Conclusion

Give it a Try

Background: From BERT to a Modular System

Architectural Restructuring

Long-Context Embedding Models

Qwen3-Embedding

EmbeddingGemma-300M

Low-Rank Adaptation for Multi-Task Classification

Concurrency Through OnceLock

Flash Attention for GPU Acceleration

Cross-Language Integration for Cloud-Native Ecosystems

Why Rust for ML Inference

Why Go FFI Bindings Matter

Deployment Flexibility

Performance Characteristics

Future Directions

Resources

About Our Release Naming Convention

Q4 2025 Focus Areas

1. Semantic Chain for Fusion Intelligent Routing

2. Extensible Serving Architecture: Modular Candle-Binding for MoM Family

3. Model Unification: The MoM (Mixture of Models) Family

4. Architectural Evolution: Model-Based Routing Core

5. Next-Generation API: OpenAI Responses API Support

6. Intelligent MCP Gateway: Smart Tool Management and Selection

7. Enterprise Readiness: Production Deployment Tools

Helm Chart Support

Modern Management Dashboard