Back to blog

LLM Routing Architecture: How to Diagram Model Routing Systems (2026)

How to design and diagram an LLM routing architecture. Covers cost-based, capability-based, latency-based, and semantic routing strategies — plus the router layer, model registry, fallback logic, and prompt templates for generating routing diagrams.

R
Ryan·Senior AI Engineer
·

LLM routing — also called model routing or AI routing — is the practice of intelligently directing individual prompts and requests to different language models based on factors like task complexity, cost, latency requirements, and capability fit. In 2026, virtually every company running LLMs in production uses more than one model: a frontier model for high-stakes reasoning, a faster mid-tier model for everyday tasks, and a small cheap model for high-volume classification. Routing is the layer that decides, at request time, which model wins.

Done well, LLM routing can reduce inference costs by 40–70% without any measurable drop in output quality for the majority of requests. Done poorly, it introduces latency, routing errors, and silent quality regressions that are hard to diagnose. This guide covers the core routing strategies, the architectural components that make up a routing layer, and ready-to-use prompt templates for generating accurate LLM routing architecture diagrams with AI.

Why LLM routing architecture matters

The economics of LLM usage have forced routing to the top of every AI engineering team's agenda. Frontier models like Claude Opus and GPT-4.5 are extraordinarily capable, but they carry a per-token price tag that makes them economically unviable for every request in a high-volume system. A simple autocomplete suggestion or intent classification does not need a 200-billion-parameter model — it needs a fast, cheap model that returns an answer in under 100 ms for a fraction of a cent.

At the same time, not all requests are simple. Contract analysis, complex code generation, multi-step reasoning, and safety-critical decisions genuinely require the best available model. Routing lets teams use frontier models where they add value and cheaper models everywhere else — without baking that logic into application code.

Latency is the other key driver. Even when cost is not a constraint, a slow model can break a user experience that expects a near-instant response. Routing requests by expected latency budget — fast models for interactive use cases, slower models for background batch jobs — keeps p99 latency within SLA without compromising quality for tasks where users can wait.

Core routing strategies

Cost-based routing

Cost-based routing selects the cheapest model that is likely to produce an acceptable result for the request. The router estimates task difficulty — using heuristics like prompt token count, presence of code, or a fast classifier — and dispatches to a tiered model hierarchy. Simple requests go to a low-cost model (GPT-4o mini, Claude Haiku, Gemini Flash). Only requests that the classifier flags as complex or high-stakes are escalated to a more expensive model. Some systems implement cascading cost routing: they try the cheap model first, evaluate the response confidence, and re-route to a stronger model if confidence is below a threshold.

Capability-based routing

Capability-based routing matches request types to models that have demonstrated strength in that domain. A code generation request might always route to a model fine-tuned for programming tasks. A structured data extraction request routes to a model with reliable JSON mode. A multilingual customer support query routes to a model with strong non-English performance. Capability routing is most commonly used when teams maintain a small fleet of specialized fine-tuned models alongside a general-purpose frontier model, and the router needs to choose between them based on what the task actually requires.

Latency-based routing

Latency-based routing selects the model based on the response time budget of the calling context. Interactive chat features with a 500 ms SLA route to the fastest available model. Document summarization jobs running in a background queue can afford to route to a slower but more capable model. The router must be aware of real-time provider health and tail latency — not just static benchmarks. Latency routing also handles provider degradation: if a primary provider's P95 latency spikes above a threshold, the router automatically shifts traffic to a faster alternative until the primary recovers.

Semantic / task-based routing

Semantic routing — sometimes called task-based routing — uses the content and intent of the request itself to choose a model. A lightweight classifier (often another small LLM or a trained embedding-based classifier) reads the incoming prompt and assigns it a task category: code, summarization, qa, creative_writing, structured_extraction, and so on. Each task category maps to a pre-configured model or model chain in a routing table. This approach is more expressive than cost-only routing because it can route to specialized models even when all models have similar cost, and it decouples the routing logic from both cost and latency concerns.

LLM routing architecture components

The router layer

The router layer is the central decision-making component. It receives every request before it reaches any LLM, applies the routing policy (cost, capability, latency, semantic, or a combination), and forwards the request to the selected model. In most production architectures the router is implemented as a sidecar proxy or a dedicated service that exposes an OpenAI-compatible API endpoint — so application code does not need to change when routing rules evolve. The router also injects request metadata (tenant ID, user tier, feature flag state) that downstream policies can use in routing decisions.

Model registry

The model registry is the source of truth for which models are available, their capabilities, cost per token, expected latency, and current health status. The router consults the registry at request time to resolve routing decisions. The registry is also where operators define routing tiers — grouping models by capability level so the router can escalate or de-escalate within a tier without needing to know individual model details. In teams managing self-hosted open-weight models alongside provider APIs, the registry stores serving endpoints, quantization levels, and GPU utilization thresholds alongside provider credentials.

Fallback logic

Fallback chains define what happens when the primary model is unavailable, rate-limited, or returns an error. A well-designed fallback chain has at least two levels: a same-tier fallback (e.g., OpenAI GPT-4o if Anthropic Claude claude-sonnet-4-6 is rate-limited) and a cross-tier fallback of last resort (e.g., a fast cheap model that returns a partial answer rather than a hard error). Fallbacks must be designed carefully — a fallback to a weaker model for a task that requires the primary model's capability is worse than a graceful error, because it silently degrades output quality without the caller knowing.

Rate limiting and budget enforcement

The router is the natural enforcement point for per-tenant and per-user rate limits. Rather than trusting individual services to stay within their quota, the router tracks token spend per tenant in a shared store (typically Redis) and rejects or downgrades requests that would exceed the limit. Downgrading — routing a request to a cheaper model when a tenant is near their budget — is often preferable to outright rejection for user-facing features. Budget enforcement at the router layer also prevents runaway costs from agentic loops that generate far more tokens than expected.

Load balancing

When multiple instances of the same model are available (multiple API keys, multiple self-hosted replicas, or multiple regions of a provider), the router performs load balancing across them. This is distinct from routing between different models — it is distribution of traffic across equivalent capacity. Load balancing at the LLM layer handles provider rate limits gracefully: if one API key is throttled, the balancer shifts traffic to another key before the caller ever sees a 429 error. Round-robin, least-connections, and weighted traffic splitting are all common balancing strategies in LLM proxies.

Prompt templates for LLM routing diagrams

The following prompt templates are designed to generate accurate LLM routing architecture diagrams using ArchitectureDiagram.ai. Copy a prompt, paste it into the diagram generator, and adjust model names and provider details to match your stack.

“A cost-based LLM routing architecture. All client requests enter a central LLM router service that exposes an OpenAI-compatible API endpoint. The router runs a lightweight prompt classifier that scores each request on complexity (low / medium / high) using token count, presence of code blocks, and a fast embedding similarity check against labeled examples. Low-complexity requests (simple Q&A, short completions) route to Claude Haiku via Anthropic API. Medium-complexity requests (summarization, structured extraction, multi-turn chat) route to Claude claude-sonnet-4-6. High-complexity requests (code generation, contract analysis, multi-step reasoning) route to Claude Opus. All three tiers share a Redis semantic cache that stores identical prompt hashes for 30 minutes, bypassing the model entirely on cache hits. Token spend per request is logged to a cost tracking database with tenant and feature tags. A budget enforcement service checks cumulative spend per tenant before each request and downgrades to a cheaper tier if the daily budget is within 10% of the cap.”
“A latency-aware LLM routing architecture for a real-time customer support chat product. Incoming chat messages arrive at an API gateway and are tagged with a latency SLA: interactive messages (user is waiting in chat) carry a 500 ms SLA tag; background tasks (email drafts, ticket summarization) carry a 5-second SLA tag. The LLM router reads the SLA tag and routes interactive messages to a fast model (GPT-4o mini or Gemini Flash) hosted in the same region as the user to minimize network latency. Background tasks route to a higher-quality model (GPT-4o or Claude claude-sonnet-4-6) where latency is acceptable. The router also monitors real-time P95 latency from each provider using a sliding 60- second window. If the primary provider's P95 exceeds the SLA threshold, the router automatically shifts interactive traffic to the secondary provider until latency recovers. All routing decisions are logged with the observed latency outcome so the routing thresholds can be tuned weekly.”
“A semantic task-based LLM routing architecture. All requests pass through a task classifier service — a fine-tuned small model — that assigns each prompt to one of five task categories: code generation, structured data extraction, creative writing, question answering, and safety-critical analysis. Each category maps to a specific model in the routing table: code generation routes to a coding-specialized model; structured extraction routes to a model with reliable JSON mode; creative writing routes to a model tuned for fluency and style; question answering routes to a retrieval-augmented model chain with web search access; safety-critical analysis always routes to the strongest available frontier model and triggers a mandatory human review flag. The routing table is stored in a config service and can be updated without a code deployment. A/B experiments on routing table configurations run on a 5% traffic slice with automatic rollback if task success rate drops below a baseline.”

Open-source and commercial LLM routers

Several tools have emerged specifically to handle LLM routing and the broader model proxy layer:

  • LiteLLM — open-source Python library and proxy server that presents a unified OpenAI-compatible API across 100+ providers. Handles load balancing, fallbacks, cost tracking, Redis caching, and per-key spend limits. The most widely deployed open-source LLM gateway in 2026.
  • PortKey — commercial AI gateway with a strong focus on multi-provider routing, semantic caching, observability, and per-tenant budget enforcement. Offers a managed cloud version and a self-hosted enterprise option.
  • RouteLLM — open-source framework from LMSYS that implements learned routing: a trained classifier that predicts whether a strong or weak model will produce a better response for a given prompt, based on historical preference data from LMSYS Chatbot Arena.
  • Not-Diamond — commercial LLM router that uses a trained meta-model to predict the best model for each request based on task type and historical quality scores. Focuses specifically on the routing decision rather than the full gateway feature set.
  • Martian — commercial router that dynamically selects the optimal model per request by predicting which model will produce the highest-quality output at the lowest cost, using a continuously updated preference model trained on live traffic.
  • OpenRouter — managed API that provides access to dozens of models through a single endpoint with built-in routing, fallback, and cost optimization. Popular for rapid prototyping and for teams that want multi-model access without managing a self-hosted proxy.

LLM routing vs AI gateway

LLM routing and AI gateway are related but distinct concepts that are frequently conflated. LLM routing refers specifically to the decision logic that selects which model handles a given request. It answers the question: “Given this prompt, which model should I use?”

An AI gateway is a broader architectural layer that sits in front of all LLM traffic and handles a superset of concerns: authentication and authorization, rate limiting, request/response logging, semantic caching, PII redaction, cost attribution, and — as one capability among many — model routing. Routing is a subset of what a gateway does, not a synonym for it.

In practice, most production systems implement both: a gateway that handles cross-cutting concerns (auth, logging, caching, rate limits) and a routing layer inside that gateway that applies the model selection logic. Tools like LiteLLM and PortKey bundle both into a single service, while teams building custom infrastructure often separate them — a lightweight Nginx or Envoy sidecar as the gateway and a dedicated routing microservice for model selection logic. For a deeper look at the gateway layer, see the AI gateway architecture guide.

Frequently asked questions

How much latency does an LLM router add?

A well-implemented router adds 5–20 ms of overhead per request for simple rule-based or cost-based routing. Semantic routing that runs a classifier model adds more — typically 30–100 ms depending on the classifier size and whether it runs locally or calls an external service. For interactive use cases with tight latency budgets, the classifier should be a local lightweight model (a small BERT-class model or a quantized 1B-parameter LLM) that runs in under 20 ms on CPU. The latency overhead is almost always worth it because the router can direct the majority of requests to faster, cheaper models — reducing total end-to-end latency even after accounting for the classification cost.

What happens when the router misclassifies a request?

Misclassification is the primary failure mode in semantic routing. A request incorrectly routed to a weaker model produces a lower-quality response; a request incorrectly routed to a stronger model wastes money but still produces a correct result. Most production routers are deliberately biased toward over-routing to stronger models for ambiguous cases — the cost of a wrong answer is higher than the cost of an unnecessary expensive model call. Misclassification rates should be tracked via online evaluation: a sample of routed requests is re-scored by an LLM judge to check whether the routed model actually performed well. When misclassification rates exceed a threshold, the classifier is retrained on the accumulated failure cases.

Should routing logic live in the application or in a centralized proxy?

Centralized routing — in a proxy or gateway service — is strongly preferred for anything beyond a single-service application. Application-level routing embeds model selection logic in every service that calls an LLM, making it difficult to update routing policies, enforce budget limits, and get a unified view of cross-service token spend. A centralized routing proxy externalizes that logic: all services call a single internal endpoint, and the routing rules, model registry, and spend tracking live in one place. The tradeoff is that the proxy becomes a critical dependency — it must be highly available, horizontally scalable, and deployable independently of the services that depend on it. Using a tool like LiteLLM or PortKey rather than building a custom proxy significantly reduces the operational burden of maintaining centralized routing.

Related guides: AI gateway architecture, LLM architecture diagrams, LLMOps architecture, and context engineering diagrams.

Ready to try it yourself?

Start Creating - Free