
Intent Classification for Agent Routing: LLM-Based, Embedding-Based & Hybrid Approaches
Learn intent classification for agent routing in a detailed, easy-to-understand way. This guide explains LLM-based routing, embedding similarity, hybrid classifiers, confidence thresholds, fallback logic, and multi-intent detection with a practical example.
Intent classification is one of the most important building blocks in a multi-agent system. Before you can send a request to the right agent, you first need to understand what the user is trying to do. That is the job of intent classification.
This sounds simple at first. If a user says, "Reset my password", route to the authentication agent. If they say, "Where is my order?", route to the order-tracking agent. But real user requests are often messy, ambiguous, and multi-purpose. A single message may contain several intents, incomplete context, or wording the system has never seen before.
This guide explains intent classification for agent routing in a very detailed and easy-to-understand way. We will cover LLM-based classification, embedding-based classification, hybrid routing, confidence thresholds, fallback logic, and multi-intent detection. We will also use a practical example throughout so the concepts stay concrete.
What intent classification really does
In a multi-agent architecture, different agents are usually specialized. One agent may handle billing, another technical support, another account management, and another product recommendations. If every request goes to every agent, the system becomes slow, expensive, and noisy. Routing helps the system send each request only where it belongs.
Intent classification is the decision layer behind that routing. It helps answer questions like:
- Is this a billing issue or a technical issue?
- Does this request need one agent or multiple agents?
- How confident is the system in its routing decision?
- Should the system ask a clarifying question before routing?
- Should the request go to a fallback or human review path?
Simple mental model
Suppose we are building an e-commerce assistant with these specialized agents:
- A Billing Agent for refunds, charges, and invoices
- An Order Agent for shipping status, cancellations, and delivery issues
- An Account Agent for login, password reset, and profile changes
- A Product Agent for recommendations and product questions
- A Technical Support Agent for app or website problems
Now consider these user messages:
- "I was charged twice for my last order."
- "My package says delivered, but I never got it."
- "I can't log in and I also need to update my email address."
- "Which laptop is best for video editing under $1500?"
- "The app crashes when I try to check out."
A good router should send each request to the correct agent or agents. That routing decision depends on intent classification.
Real-world requests are not always clean. Users may be vague, emotional, indirect, or combine multiple needs in one sentence. For example, "I can't log in and I think I was billed for the wrong plan" contains both an account issue and a billing issue.
Intent classification becomes difficult because of:
- Ambiguity: the wording could fit more than one intent
- Multi-intent queries: one message contains several tasks
- Domain overlap: similar language appears across categories
- Rare phrasing: users describe familiar problems in unfamiliar ways
- Low context: the message is too short to classify confidently
That is why production systems often combine several methods instead of relying on only one.
LLM-based classification uses a language model to read the user request and decide which intent best matches it. This approach is powerful because LLMs understand nuance, paraphrasing, and context better than simple keyword rules.
For example, a user might say "Why did you take money from my card twice?" Even if the exact phrase "duplicate charge" never appears, an LLM can still infer that this is likely a billing intent.
from openai import AsyncOpenAI
import json
from typing import List, Optional
client = AsyncOpenAI()
class LLMIntentClassifier:
def __init__(self, model: str = "gpt-4o-mini"):
self.model = model
self.intent_definitions: dict[str, str] = {}
def register_intent(self, name: str, description: str, examples: List[str] = None):
self.intent_definitions[name] = {
"description": description,
"examples": examples or []
}
async def classify(
self,
query: str,
return_confidence: bool = True,
allow_multiple: bool = False
) -> dict:
intent_desc = "\n".join([
f"- {name}: {info['description']}"
for name, info in self.intent_definitions.items()
])
response = await client.chat.completions.create(
model=self.model,
messages=[
{
"role": "system",
"content": (
f"Classify user intent. Available intents:\n{intent_desc}\n\n"
f"Return JSON with: intent (string{'or array' if allow_multiple else ''}), "
f"confidence (0.0-1.0), reasoning (brief explanation)."
)
},
{"role": "user", "content": query}
],
response_format={"type": "json_object"},
temperature=0
)
result = json.loads(response.choices[0].message.content)
return result- It handles paraphrases and natural language variation well
- It can use richer intent descriptions instead of only examples
- It can explain its reasoning
- It can detect multiple intents in one request
- It adapts better when user wording is messy or indirect
This makes LLMs especially useful when your routing space is complex or when user requests are highly varied.
LLM-based routing is powerful, but it is not free. It is usually slower and more expensive than embedding-based methods. It may also produce unstable outputs if prompts are weak or if the model is not constrained to structured JSON.
That is why many systems use LLM classification selectively: for ambiguous cases, high-value requests, or as a fallback when faster methods are uncertain.
{
"intent": "billing_refund",
"confidence": 0.94,
"reasoning": "The user describes being charged twice, which maps to a billing/refund issue."
}This output is useful because it gives both the routing label and a confidence score. The router can use that confidence to decide whether to route immediately or trigger a fallback.
Embedding-based classification works differently. Instead of asking an LLM to reason directly, it converts text into vectors and compares the user query to stored examples for each intent. The most similar intent wins.
This approach is often much faster and cheaper than LLM classification. It works especially well when intents are clearly separated and you have good example phrases for each one.
import numpy as np
from openai import OpenAI
from typing import List, Tuple
client = OpenAI()
class EmbeddingIntentClassifier:
def __init__(self, model: str = "text-embedding-3-small"):
self.model = model
self.intent_embeddings: dict[str, np.ndarray] = {}
self.intent_examples: dict[str, List[str]] = {}
def register_intent(self, name: str, examples: List[str]):
embeddings = self._get_embeddings(examples)
self.intent_embeddings[name] = np.mean(embeddings, axis=0)
self.intent_examples[name] = examples
def _get_embeddings(self, texts: List[str]) -> np.ndarray:
response = client.embeddings.create(model=self.model, input=texts)
return np.array([item.embedding for item in response.data])
def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def classify(
self,
query: str,
top_k: int = 1,
threshold: float = 0.7
) -> List[Tuple[str, float]]:
query_embedding = self._get_embeddings([query])[0]
similarities = [
(intent, self._cosine_similarity(query_embedding, emb))
for intent, emb in self.intent_embeddings.items()
]
similarities.sort(key=lambda x: x[1], reverse=True)
results = [(intent, score) for intent, score in similarities[:top_k] if score >= threshold]
return results if results else [("unknown", 0.0)]A useful mental model is this: embeddings place similar meanings near each other in vector space. If "I need a refund" and "I was charged twice" are close to your billing examples, the classifier will likely route them to the billing agent.
This method is efficient because you can precompute intent example embeddings ahead of time. Then, at runtime, you only embed the incoming query and compare it to stored vectors.
- Your intents are clearly distinct
- You have representative examples for each intent
- You need low latency and lower cost
- Most requests are routine and repetitive
- Two intents use very similar language
- The user request is long and contains multiple goals
- The request depends on subtle context or policy nuance
- Your example set is weak or incomplete
This is why embeddings are often excellent for the fast path, but not always enough for the final decision.
[
["billing_refund", 0.88],
["order_tracking", 0.41],
["account_access", 0.22]
]Here the top score is high enough that the router may confidently choose the billing agent without calling an LLM.
A hybrid classifier combines multiple methods so you get the strengths of each. The most common pattern is:
- Use embeddings first because they are fast and cheap.
- If confidence is high, route immediately.
- If confidence is low or the top intents are too close, call an LLM.
- If the LLM is still uncertain, ask a clarifying question or use fallback routing.
This design is popular because most requests are easy. You do not need expensive reasoning for every message. You only spend extra compute on the hard cases.
class HybridIntentClassifier:
def __init__(self):
self.embedding_classifier = EmbeddingIntentClassifier()
self.llm_classifier = LLMIntentClassifier()
self.metrics = {"embedding_only": 0, "llm_fallback": 0}
async def classify(
self,
query: str,
confidence_threshold: float = 0.85
) -> dict:
embedding_results = self.embedding_classifier.classify(query, top_k=3)
if not embedding_results:
self.metrics["llm_fallback"] += 1
return await self.llm_classifier.classify(query)
top_intent, top_score = embedding_results[0]
if top_score >= confidence_threshold:
self.metrics["embedding_only"] += 1
return {
"intent": top_intent,
"confidence": top_score,
"method": "embedding"
}
self.metrics["llm_fallback"] += 1
llm_result = await self.llm_classifier.classify(query)
llm_result["method"] = "llm_fallback"
llm_result["embedding_candidates"] = embedding_results
return llm_resultConfidence thresholds help the router decide when a prediction is strong enough to trust. If the top embedding score is 0.92, maybe that is good enough. If it is 0.61 and the second-best score is 0.59, the request is probably ambiguous.
Thresholds are not universal. A safe threshold depends on your domain, your intent set, and the cost of misrouting. In a low-risk FAQ bot, a lower threshold may be acceptable. In a financial or healthcare workflow, you may want stricter thresholds and more fallback checks.
Important routing lesson
- If embedding score is above 0.85, route directly
- If embedding score is between 0.65 and 0.85, use LLM verification
- If embedding score is below 0.65, mark as uncertain
- If uncertain after LLM review, ask a clarifying question or send to fallback support
Some user requests should not be routed to only one agent. For example: "I can't log in and I need a copy of my invoice." This contains both an account-access intent and a billing intent.
Multi-intent detection identifies all relevant intents in one message. That allows the system to either:
- run multiple agents in parallel
- split the request into sub-tasks
- prioritize one intent first and queue the others
- ask the user which issue they want to solve first
class MultiIntentClassifier:
def __init__(self, llm_classifier: LLMIntentClassifier):
self.classifier = llm_classifier
async def classify_multi(
self,
query: str,
max_intents: int = 3,
min_confidence: float = 0.6
) -> List[dict]:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Identify ALL intents in the query. Return JSON array of objects "
"with: intent (name), confidence (0.0-1.0), relevant_part (which part of query)."
)
},
{"role": "user", "content": query}
],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
intents = result.get("intents", [])
filtered = [
intent for intent in intents
if intent.get("confidence", 0) >= min_confidence
][:max_intents]
return filtered{
"intents": [
{
"intent": "account_access",
"confidence": 0.93,
"relevant_part": "I can't log in"
},
{
"intent": "billing_invoice",
"confidence": 0.87,
"relevant_part": "I need a copy of my invoice"
}
]
}This is much better than forcing the whole request into one label. The router can now coordinate multiple agents more intelligently.
Let us walk through a realistic request: "I can't log in, and I was also charged twice this month."
- The router receives the user message.
- The embedding classifier compares it against known intent examples.
- It finds strong similarity to both
account_accessandbilling_refund. - Because there are multiple strong candidates, the router triggers LLM verification.
- The LLM confirms that the request contains two intents.
- The router creates two sub-tasks: one for the Account Agent and one for the Billing Agent.
- The Account Agent handles login recovery.
- The Billing Agent investigates the duplicate charge.
- The orchestrator combines the results into one coordinated response.
This example shows why routing is not just classification. It is classification plus confidence handling, fallback logic, and workflow coordination.
No classifier is perfect. Good systems plan for uncertainty instead of pretending it does not exist.
Common fallback strategies include:
- Ask a clarifying question: "Is this about billing or account access?"
- Route to a generalist agent that can gather more context
- Escalate to a human for high-risk or high-value cases
- Use a safe default path such as support triage when confidence is too low
The right fallback depends on the cost of misrouting. If sending a request to the wrong agent is cheap, you can be more aggressive. If it creates risk, delay, or customer frustration, you should be more conservative.
To improve routing, you need to measure it. Useful evaluation questions include:
- How often does the top predicted intent match the correct one?
- How often does the system miss a second intent?
- How often does fallback trigger?
- Which intents are most often confused with each other?
- What is the latency and cost of each routing path?
These metrics help you decide whether to improve examples, adjust thresholds, rewrite prompts, or change the hybrid policy.
- Define intents clearly and keep boundaries understandable
- Collect representative examples for each intent
- Use embeddings for fast first-pass routing
- Use LLMs for ambiguous or high-value cases
- Set confidence thresholds based on real evaluation data
- Support multi-intent detection when users often combine requests
- Add fallback logic for uncertain cases
- Log routing decisions and confidence scores for analysis
- Continuously review misrouted examples and improve the classifier
Performance optimization
Key takeaway
Related Articles
State Management for Multi-Agent Systems: Redis, PostgreSQL, LangGraph & Checkpointing
Production state management for multi-agent workflows — Redis for ephemeral coordination, PostgreSQL for durable records, LangGraph for typed state graphs with conditional routing, and checkpoint/resume patterns that actually survive crashes.
Orchestration Architectures: Supervisor, Router & Hierarchical Patterns for Multi-Agent Systems
Build production orchestration for multi-agent systems — supervisor routing with LLM classification, parallel fan-out with error recovery, event-driven coordination, and hierarchical delegation. Includes comparison matrix and combined architecture example.
Agent-to-Agent Communication: Async Messaging, Handoff Protocols, and Conflict Resolution
Production-grade communication primitives for multi-agent systems: async message buses with backpressure, handoff protocols with real acknowledgment tracking, and conflict resolution including LLM arbitration.