Intent Classification for Agent Routing: LLM-Based, Embedding-Based & Hybrid Approaches

Learn intent classification for agent routing in a detailed, easy-to-understand way. This guide explains LLM-based routing, embedding similarity, hybrid classifiers, confidence thresholds, fallback logic, and multi-intent detection with a practical example.

AI EducatorMay 30, 2026

Intent classification is one of the most important building blocks in a multi-agent system. Before you can send a request to the right agent, you first need to understand what the user is trying to do. That is the job of intent classification.

This sounds simple at first. If a user says, "Reset my password", route to the authentication agent. If they say, "Where is my order?", route to the order-tracking agent. But real user requests are often messy, ambiguous, and multi-purpose. A single message may contain several intents, incomplete context, or wording the system has never seen before.

This guide explains intent classification for agent routing in a very detailed and easy-to-understand way. We will cover LLM-based classification, embedding-based classification, hybrid routing, confidence thresholds, fallback logic, and multi-intent detection. We will also use a practical example throughout so the concepts stay concrete.

What intent classification really does

Intent classification converts a user request into a routing decision. In other words, it answers: which agent should handle this request, how confident are we, and do we need one agent or several?

Why routing needs intent classification

In a multi-agent architecture, different agents are usually specialized. One agent may handle billing, another technical support, another account management, and another product recommendations. If every request goes to every agent, the system becomes slow, expensive, and noisy. Routing helps the system send each request only where it belongs.

Intent classification is the decision layer behind that routing. It helps answer questions like:

Is this a billing issue or a technical issue?
Does this request need one agent or multiple agents?
How confident is the system in its routing decision?
Should the system ask a clarifying question before routing?
Should the request go to a fallback or human review path?

Simple mental model

Think of intent classification like the triage desk in a hospital or the front desk in a company. The goal is not to solve the problem immediately. The goal is to understand the request well enough to send it to the right specialist.

A running example: e-commerce support router

Suppose we are building an e-commerce assistant with these specialized agents:

A Billing Agent for refunds, charges, and invoices
An Order Agent for shipping status, cancellations, and delivery issues
An Account Agent for login, password reset, and profile changes
A Product Agent for recommendations and product questions
A Technical Support Agent for app or website problems

Now consider these user messages:

"I was charged twice for my last order."
"My package says delivered, but I never got it."
"I can't log in and I also need to update my email address."
"Which laptop is best for video editing under $1500?"
"The app crashes when I try to check out."

A good router should send each request to the correct agent or agents. That routing decision depends on intent classification.

What makes intent classification hard

Real-world requests are not always clean. Users may be vague, emotional, indirect, or combine multiple needs in one sentence. For example, "I can't log in and I think I was billed for the wrong plan" contains both an account issue and a billing issue.

Intent classification becomes difficult because of:

Ambiguity: the wording could fit more than one intent
Multi-intent queries: one message contains several tasks
Domain overlap: similar language appears across categories
Rare phrasing: users describe familiar problems in unfamiliar ways
Low context: the message is too short to classify confidently

That is why production systems often combine several methods instead of relying on only one.

Part 1: LLM-based classification

LLM-based classification uses a language model to read the user request and decide which intent best matches it. This approach is powerful because LLMs understand nuance, paraphrasing, and context better than simple keyword rules.

For example, a user might say "Why did you take money from my card twice?" Even if the exact phrase "duplicate charge" never appears, an LLM can still infer that this is likely a billing intent.

llm_classification.py

python

from openai import AsyncOpenAI
import json
from typing import List, Optional

client = AsyncOpenAI()

class LLMIntentClassifier:
    def __init__(self, model: str = "gpt-4o-mini"):
        self.model = model
        self.intent_definitions: dict[str, str] = {}

    def register_intent(self, name: str, description: str, examples: List[str] = None):
        self.intent_definitions[name] = {
            "description": description,
            "examples": examples or []
        }

    async def classify(
        self,
        query: str,
        return_confidence: bool = True,
        allow_multiple: bool = False
    ) -> dict:
        intent_desc = "\n".join([
            f"- {name}: {info['description']}"
            for name, info in self.intent_definitions.items()
        ])

        response = await client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "system",
                    "content": (
                        f"Classify user intent. Available intents:\n{intent_desc}\n\n"
                        f"Return JSON with: intent (string{'or array' if allow_multiple else ''}), "
                        f"confidence (0.0-1.0), reasoning (brief explanation)."
                    )
                },
                {"role": "user", "content": query}
            ],
            response_format={"type": "json_object"},
            temperature=0
        )

        result = json.loads(response.choices[0].message.content)
        return result

Why LLM classification works well

It handles paraphrases and natural language variation well
It can use richer intent descriptions instead of only examples
It can explain its reasoning
It can detect multiple intents in one request
It adapts better when user wording is messy or indirect

This makes LLMs especially useful when your routing space is complex or when user requests are highly varied.

Limitations of LLM classification

LLM-based routing is powerful, but it is not free. It is usually slower and more expensive than embedding-based methods. It may also produce unstable outputs if prompts are weak or if the model is not constrained to structured JSON.

That is why many systems use LLM classification selectively: for ambiguous cases, high-value requests, or as a fallback when faster methods are uncertain.

Example LLM routing decision

llm_result.json

json

{
  "intent": "billing_refund",
  "confidence": 0.94,
  "reasoning": "The user describes being charged twice, which maps to a billing/refund issue."
}

This output is useful because it gives both the routing label and a confidence score. The router can use that confidence to decide whether to route immediately or trigger a fallback.

Part 2: Embedding-based classification

Embedding-based classification works differently. Instead of asking an LLM to reason directly, it converts text into vectors and compares the user query to stored examples for each intent. The most similar intent wins.

This approach is often much faster and cheaper than LLM classification. It works especially well when intents are clearly separated and you have good example phrases for each one.

embedding_classification.py

python

import numpy as np
from openai import OpenAI
from typing import List, Tuple

client = OpenAI()

class EmbeddingIntentClassifier:
    def __init__(self, model: str = "text-embedding-3-small"):
        self.model = model
        self.intent_embeddings: dict[str, np.ndarray] = {}
        self.intent_examples: dict[str, List[str]] = {}

    def register_intent(self, name: str, examples: List[str]):
        embeddings = self._get_embeddings(examples)
        self.intent_embeddings[name] = np.mean(embeddings, axis=0)
        self.intent_examples[name] = examples

    def _get_embeddings(self, texts: List[str]) -> np.ndarray:
        response = client.embeddings.create(model=self.model, input=texts)
        return np.array([item.embedding for item in response.data])

    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    def classify(
        self,
        query: str,
        top_k: int = 1,
        threshold: float = 0.7
    ) -> List[Tuple[str, float]]:
        query_embedding = self._get_embeddings([query])[0]

        similarities = [
            (intent, self._cosine_similarity(query_embedding, emb))
            for intent, emb in self.intent_embeddings.items()
        ]

        similarities.sort(key=lambda x: x[1], reverse=True)
        results = [(intent, score) for intent, score in similarities[:top_k] if score >= threshold]

        return results if results else [("unknown", 0.0)]

How to think about embeddings simply

A useful mental model is this: embeddings place similar meanings near each other in vector space. If "I need a refund" and "I was charged twice" are close to your billing examples, the classifier will likely route them to the billing agent.

This method is efficient because you can precompute intent example embeddings ahead of time. Then, at runtime, you only embed the incoming query and compare it to stored vectors.

When embeddings work well

Your intents are clearly distinct
You have representative examples for each intent
You need low latency and lower cost
Most requests are routine and repetitive

When embeddings struggle

Two intents use very similar language
The user request is long and contains multiple goals
The request depends on subtle context or policy nuance
Your example set is weak or incomplete

This is why embeddings are often excellent for the fast path, but not always enough for the final decision.

Example embedding routing result

embedding_result.json

json

[
  ["billing_refund", 0.88],
  ["order_tracking", 0.41],
  ["account_access", 0.22]
]

Here the top score is high enough that the router may confidently choose the billing agent without calling an LLM.

Part 3: Hybrid ensemble classification

A hybrid classifier combines multiple methods so you get the strengths of each. The most common pattern is:

Use embeddings first because they are fast and cheap.
If confidence is high, route immediately.
If confidence is low or the top intents are too close, call an LLM.
If the LLM is still uncertain, ask a clarifying question or use fallback routing.

This design is popular because most requests are easy. You do not need expensive reasoning for every message. You only spend extra compute on the hard cases.

hybrid_classification.py

python

class HybridIntentClassifier:
    def __init__(self):
        self.embedding_classifier = EmbeddingIntentClassifier()
        self.llm_classifier = LLMIntentClassifier()
        self.metrics = {"embedding_only": 0, "llm_fallback": 0}

    async def classify(
        self,
        query: str,
        confidence_threshold: float = 0.85
    ) -> dict:
        embedding_results = self.embedding_classifier.classify(query, top_k=3)

        if not embedding_results:
            self.metrics["llm_fallback"] += 1
            return await self.llm_classifier.classify(query)

        top_intent, top_score = embedding_results[0]

        if top_score >= confidence_threshold:
            self.metrics["embedding_only"] += 1
            return {
                "intent": top_intent,
                "confidence": top_score,
                "method": "embedding"
            }

        self.metrics["llm_fallback"] += 1
        llm_result = await self.llm_classifier.classify(query)
        llm_result["method"] = "llm_fallback"
        llm_result["embedding_candidates"] = embedding_results
        return llm_result

Why confidence thresholds matter

Confidence thresholds help the router decide when a prediction is strong enough to trust. If the top embedding score is 0.92, maybe that is good enough. If it is 0.61 and the second-best score is 0.59, the request is probably ambiguous.

Thresholds are not universal. A safe threshold depends on your domain, your intent set, and the cost of misrouting. In a low-risk FAQ bot, a lower threshold may be acceptable. In a financial or healthcare workflow, you may want stricter thresholds and more fallback checks.

Important routing lesson

A wrong high-confidence route is often worse than a low-confidence fallback. If the system is unsure, it is usually better to escalate, ask a clarifying question, or use a stronger classifier.

A practical hybrid routing policy

If embedding score is above 0.85, route directly
If embedding score is between 0.65 and 0.85, use LLM verification
If embedding score is below 0.65, mark as uncertain
If uncertain after LLM review, ask a clarifying question or send to fallback support

Part 4: Multi-intent detection

Some user requests should not be routed to only one agent. For example: "I can't log in and I need a copy of my invoice." This contains both an account-access intent and a billing intent.

Multi-intent detection identifies all relevant intents in one message. That allows the system to either:

run multiple agents in parallel
split the request into sub-tasks
prioritize one intent first and queue the others
ask the user which issue they want to solve first

multi_intent.py

python

class MultiIntentClassifier:
    def __init__(self, llm_classifier: LLMIntentClassifier):
        self.classifier = llm_classifier

    async def classify_multi(
        self,
        query: str,
        max_intents: int = 3,
        min_confidence: float = 0.6
    ) -> List[dict]:
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Identify ALL intents in the query. Return JSON array of objects "
                        "with: intent (name), confidence (0.0-1.0), relevant_part (which part of query)."
                    )
                },
                {"role": "user", "content": query}
            ],
            response_format={"type": "json_object"}
        )

        result = json.loads(response.choices[0].message.content)
        intents = result.get("intents", [])

        filtered = [
            intent for intent in intents
            if intent.get("confidence", 0) >= min_confidence
        ][:max_intents]

        return filtered

Example multi-intent output

multi_intent_result.json

json

{
  "intents": [
    {
      "intent": "account_access",
      "confidence": 0.93,
      "relevant_part": "I can't log in"
    },
    {
      "intent": "billing_invoice",
      "confidence": 0.87,
      "relevant_part": "I need a copy of my invoice"
    }
  ]
}

This is much better than forcing the whole request into one label. The router can now coordinate multiple agents more intelligently.

End-to-end walkthrough of the routing example

Let us walk through a realistic request: "I can't log in, and I was also charged twice this month."

The router receives the user message.
The embedding classifier compares it against known intent examples.
It finds strong similarity to both account_access and billing_refund.
Because there are multiple strong candidates, the router triggers LLM verification.
The LLM confirms that the request contains two intents.
The router creates two sub-tasks: one for the Account Agent and one for the Billing Agent.
The Account Agent handles login recovery.
The Billing Agent investigates the duplicate charge.
The orchestrator combines the results into one coordinated response.

This example shows why routing is not just classification. It is classification plus confidence handling, fallback logic, and workflow coordination.

Fallback strategies when routing is uncertain

No classifier is perfect. Good systems plan for uncertainty instead of pretending it does not exist.

Common fallback strategies include:

Ask a clarifying question: "Is this about billing or account access?"
Route to a generalist agent that can gather more context
Escalate to a human for high-risk or high-value cases
Use a safe default path such as support triage when confidence is too low

The right fallback depends on the cost of misrouting. If sending a request to the wrong agent is cheap, you can be more aggressive. If it creates risk, delay, or customer frustration, you should be more conservative.

How to evaluate routing quality

To improve routing, you need to measure it. Useful evaluation questions include:

How often does the top predicted intent match the correct one?
How often does the system miss a second intent?
How often does fallback trigger?
Which intents are most often confused with each other?
What is the latency and cost of each routing path?

These metrics help you decide whether to improve examples, adjust thresholds, rewrite prompts, or change the hybrid policy.

Best practices checklist

Define intents clearly and keep boundaries understandable
Collect representative examples for each intent
Use embeddings for fast first-pass routing
Use LLMs for ambiguous or high-value cases
Set confidence thresholds based on real evaluation data
Support multi-intent detection when users often combine requests
Add fallback logic for uncertain cases
Log routing decisions and confidence scores for analysis
Continuously review misrouted examples and improve the classifier

Performance optimization

Cache repeated classification results, precompute intent embeddings, use embeddings for the fast path, and reserve LLM calls for ambiguous cases. This usually gives a strong balance of speed, cost, and accuracy.

Key takeaway

Intent classification is the brain behind agent routing. The goal is not only to label a request, but to route it safely and efficiently. In practice, the strongest systems combine fast embedding search, careful confidence thresholds, LLM verification for hard cases, and multi-intent handling when one request needs several specialists.

#intent-classification #routing #embeddings #llm-classification #multi-intent #confidence-threshold #ensemble-methods

advanced

State Management for Multi-Agent Systems: Redis, PostgreSQL, LangGraph & Checkpointing

Production state management for multi-agent workflows — Redis for ephemeral coordination, PostgreSQL for durable records, LangGraph for typed state graphs with conditional routing, and checkpoint/resume patterns that actually survive crashes.

advanced

Orchestration Architectures: Supervisor, Router & Hierarchical Patterns for Multi-Agent Systems

Build production orchestration for multi-agent systems — supervisor routing with LLM classification, parallel fan-out with error recovery, event-driven coordination, and hierarchical delegation. Includes comparison matrix and combined architecture example.

Agent-to-Agent Communication: Async Messaging, Handoff Protocols, and Conflict Resolution

Production-grade communication primitives for multi-agent systems: async message buses with backpressure, handoff protocols with real acknowledgment tracking, and conflict resolution including LLM arbitration.

What intent classification really does

Simple mental model

Important routing lesson

Performance optimization

Key takeaway

Related Articles

State Management for Multi-Agent Systems: Redis, PostgreSQL, LangGraph & Checkpointing

Orchestration Architectures: Supervisor, Router & Hierarchical Patterns for Multi-Agent Systems

Agent-to-Agent Communication: Async Messaging, Handoff Protocols, and Conflict Resolution