DevLifted - Tech Blog & Tutorials

Guardrails, Safety & Output Validation: Building LLM Applications That Don't Break

AI Educator — Sun, 31 May 2026 06:00:00 GMT

Your LLM will produce garbage output on 2% of requests, leak customer PII if you pass it through carelessly, hallucinate facts that sound plausible enough to ship, and get jailbroken by anyone who spends fifteen minutes reading prompt injection blogs. These are not edge cases — they are the default behavior of every language model in production today. Guardrails are the engineering discipline that prevents all four. Not alignment research, not RLHF tuning, not hoping the model behaves — actual input validation, output filtering, schema enforcement, and content moderation code that wraps every LLM call in your system.

The Guardrails Architecture

Every LLM call in production should pass through a pipeline of guards. Some run on input, some on output, some on both. The architecture below shows where each guard sits and what it catches.

Orange guards handle validation and filtering. Red guards handle content moderation and safety. Blue handles schema enforcement. Purple handles hallucination detection. Green is the actual LLM call — the only part most developers build. The rest of this post implements every other box in this diagram.

Input Validation & Filtering

Input validation is the first line of defense. It catches prompt injection attempts, enforces topic boundaries, validates input length, and rejects malformed requests before they ever reach your LLM. The strategy is layered: a fast regex pass catches obvious attacks in microseconds, then an LLM-based classifier catches sophisticated injection attempts that regex misses.

Structured Output Enforcement

LLMs produce strings. Your application needs structured data. The gap between those two facts is where half of production bugs live. There are three approaches to closing it, each with different tradeoffs: OpenAI JSON mode, the instructor library with Pydantic, and manual schema enforcement with a parse-validate-repair loop.

Approach 1: OpenAI JSON Mode

Approach 2: Instructor + Pydantic (Recommended)

Approach 3: Manual Parse-Validate-Repair Loop

Comparing the Three Approaches

Content Moderation Pipeline

Content moderation runs on both input and output. The OpenAI Moderation API catches standard safety categories (hate, violence, sexual content, self-harm). But production applications need custom moderation on top: blocking competitor mentions, filtering off-topic content, catching domain-specific profanity that the generic API misses. The pipeline below layers both.

PII Detection & Redaction

Sending customer PII to an LLM is a compliance and security risk. The solution: detect PII in the input, replace it with typed placeholders before the LLM call, and optionally restore it after for authorized consumers. The redaction must be reversible — the placeholder mapping lives in your system, never in the LLM's context.

Hallucination Detection

Hallucination is the hardest problem in LLM safety. The model generates confident, fluent text that contains fabricated facts. There is no single solution — you need multiple detection strategies layered together. The three most effective: claim extraction with entailment checking, self-consistency voting, and source attribution for RAG contexts.

Guardrails AI Integration

The guardrails-ai library provides a declarative framework for wrapping LLM calls with validators. Instead of writing custom validation logic, you define a Guard with a list of validators, and the library handles validation, re-asking on failure, and structured output parsing. It reduces boilerplate significantly for common validation patterns.

NeMo Guardrails Integration

NVIDIA NeMo Guardrails takes a different approach: instead of wrapping validators around outputs, it defines conversational rails using Colang, a domain-specific language for dialogue control. Rails intercept both input and output at the conversational level — blocking off-topic queries, jailbreak attempts, and unsafe responses before they reach the user. The examples below use Colang 1.0 syntax. NeMo Guardrails 0.9+ supports Colang 2.0 with a different syntax — check the official docs for migration if you're on a newer version.

Combining Everything: Production Guardrail Pipeline

Individual guards are useful. A composable pipeline that chains them together with configurable severity levels, logging, and metrics is what you actually deploy. The pipeline below runs every guard in sequence on input and output, tracks which guards trigger and how long each takes, and lets you enable or disable guards per environment.

Performance & Cost Considerations

Every guard adds latency and cost. The goal is minimizing overhead while maximizing coverage. The table below shows typical latency for each guard type so you can budget your latency budget.

Optimization Strategies

**Run independent guards in parallel.** Input validation, PII detection, and content moderation don't depend on each other. Use `asyncio.gather()` to run them simultaneously — cuts total input guard latency from ~700ms sequential to ~400ms parallel.
**Tier your guards.** Fast regex runs on every request. LLM-based injection detection runs only on user-facing inputs. Hallucination checking runs only when RAG context is available and the request is high-stakes.
**Cache moderation results.** Identical or near-identical inputs produce identical moderation results. Hash the input and cache moderation API results for 5-10 minutes — reduces redundant API calls by 30-60% in conversational contexts.
**Skip guards by context.** Internal admin tools don't need injection detection. Development environments can disable hallucination checks. Health check endpoints skip everything. Make guard configuration per-route, not global.
**Fail open on guard errors.** If the moderation API times out, allow the request with a logged warning — don't block users because a guard failed. The exception: PII redaction should fail closed (block if detection fails).
**Budget your latency.** Set a total guard latency budget (e.g., 500ms for input guards, 1000ms for output guards) and monitor it. Alert when individual guards start exceeding their allocation.

Anti-Patterns

Testing Your Guards

Guards are code. Code needs tests. Here's how to unit test each guard type without hitting external APIs on every run.

Key Takeaways

**Layer your defenses.** No single guard catches everything. Regex catches 80% of injection fast, LLM classifiers catch the remaining 20%, and topic boundaries prevent the attacks that look like legitimate queries.
**Use instructor + Pydantic for structured output.** JSON mode guarantees valid JSON, not valid schema. Instructor gives you full Pydantic validation with automatic retry for <0.1% schema failure rates in production.
**Redact PII before it reaches any LLM.** Not just the main LLM call — also guard LLM calls (injection classifier, hallucination checker, topic classifier). Every LLM call in your pipeline is a potential PII leak.
**Run guards in parallel where possible.** Input validation, PII detection, and content moderation are independent. Parallel execution cuts guard latency by 40-60% with no reduction in coverage.
**Hallucination detection is expensive — gate it.** Claim extraction + verification costs $0.01-0.05 per response and adds 500-2000ms. Run it only on RAG responses, high-stakes outputs, or sampled traffic, not every request.
**Fail open on non-critical guards, fail closed on PII.** A moderation API timeout shouldn't block all traffic. A PII detection failure should block the request — the downside of leaking customer data is asymmetric.
**Make everything configurable per route.** User-facing chat needs the full pipeline. Internal admin tools need PII redaction and not much else. Development environments can disable most guards. One global config doesn't fit.
**Track guard metrics.** Log which guards trigger, how often, and latency per guard. A guard that never triggers is either unnecessary or misconfigured. A guard that triggers on 30% of requests has a threshold problem.
**Guardrails AI and NeMo solve different problems.** Guardrails AI is validator-centric — Pydantic models with stacked output validators. NeMo is conversation-centric — Colang rails that control dialogue flow. Use the one that matches your architecture, or both.
**Budget your guard latency explicitly.** Set a target (e.g., 500ms input guards, 1000ms output guards), measure against it, and alert when guards exceed allocation. Guard latency creep is invisible until users complain.

LLM Evaluation & Benchmarking Beyond RAGAS: Production Eval Systems That Actually Work

AI Educator — Sun, 31 May 2026 00:00:00 GMT

RAGAS gives you four metrics for RAG: faithfulness, answer relevancy, context recall, and context precision. That covers whether your retriever fetched the right chunks and whether the generator stayed faithful to them. It does not cover whether your LLM is hallucinating on non-RAG tasks, whether prompt version B is statistically better than version A, whether your judge model is actually discriminating between good and bad outputs, or whether quality is degrading in production right now. This post builds every piece that RAGAS leaves out — a complete async evaluation system with shared infrastructure, position-bias correction, statistical rigor, human annotation with inter-annotator agreement, CI/CD integration, and live monitoring. Every class is production-grade Python you can drop into a real codebase.

What RAGAS Covers and Where It Stops

RAGAS (Retrieval Augmented Generation Assessment) evaluates RAG pipelines along four axes. If you haven't used it yet, start with [**Semantic Caching & RAGAS Evaluation**](/blog/semantic-caching-ragas-evaluation) for the implementation walkthrough.

These metrics are reference-free (mostly) and RAG-specific. They tell you nothing about: general LLM output quality on non-RAG tasks, comparative quality between two prompt versions, judge reliability and bias, human-AI alignment, regression detection across deployments, or real-time quality degradation. The rest of this post builds all of that.

The Evaluation Architecture

Every component in this diagram gets a full implementation below. The key design decision: a shared `JudgeClient` base class that handles API calls, retries, JSON parsing, and score normalization. Every eval method — LLM-as-judge, pairwise, rubric — is just a different prompt strategy plugged into the same async client.

JudgeClient: Shared Evaluation Infrastructure

Every eval method in this post shares one base class. It owns the API call, temperature, JSON parsing, retry logic, token counting, and score normalization. Build this once, never duplicate it.

LLM-as-Judge with Multi-Aspect Scoring

The most common automated eval pattern: use a strong model to score outputs on multiple dimensions. This implementation extends `JudgeClient` and normalizes all scores to 0-1 internally.

Pairwise Comparison with Position Bias Correction

Pairwise comparison asks: "Which of these two responses is better?" It's more reliable than absolute scoring because relative judgments are easier for LLMs. But there's a well-documented problem: **position bias**. LLMs systematically prefer the response shown first (or last, depending on the model). Research from Zheng et al. (2023) found that GPT-4 favored the first response up to 65% of the time when both were equal quality.

Rubric-Based Scoring with 0-1 Normalization

Rubrics encode domain expertise into structured evaluation criteria. A customer support rubric looks nothing like a code review rubric. This implementation defines rubrics as data, converts them to judge prompts, and normalizes all scores to the 0-1 range so they're comparable across different rubric scales.

Judge Calibration and Meta-Evaluation

A judge model is only useful if it actually differentiates good outputs from bad ones — and does so consistently. Most teams deploy an LLM judge and never verify that it works. The `JudgeCalibrator` tests sensitivity and consistency. The `MetaEvaluator` measures judge-human correlation, score distribution skew, and bias.

Human Evaluation Pipeline with SQLite and Cohen's Kappa

LLM judges are fast and cheap but imperfect. Human evaluation is the ground truth you calibrate everything against. This isn't a toy in-memory list — it's a SQLite-backed pipeline with task assignment, load balancing, multi-annotator overlap, inter-annotator agreement via Cohen's kappa, and conflict resolution through third-annotator tiebreak.

Regression Testing with pytest and CI/CD

Every prompt change is a potential regression. The eval system needs to plug directly into your test runner and CI pipeline. This means real `pytest` tests with `assert` statements, not a custom script you run manually.

The GitHub Actions workflow runs this suite on every PR that touches prompt files or the generation pipeline:

Eval Dataset Construction

Evaluation is only as good as the dataset. This builder generates synthetic examples using an LLM, enforces category balance, deduplicates via embedding similarity, stratifies by difficulty, and exports versioned JSON.

Async Batch Evaluation with Cost Optimization

Running 500 evaluations sequentially takes hours. Running them all at once hits rate limits. The solution: async evaluation with concurrency control via `asyncio.Semaphore`, token counting for cost estimation before execution, and a cost-tiered strategy that routes borderline cases to expensive models while using cheap models for clear-cut ones.

Online Evaluation and Production Monitoring

Offline eval catches problems before deployment. Online eval catches problems that only surface with real traffic: distribution shift, edge cases you didn't anticipate, quality degradation over time. This monitor tracks user signals, runs shadow evaluation on a sample of live traffic, detects input distribution shift, and alerts when scores drop.

Statistical Rigor: Bootstrap Confidence Intervals

"Prompt B scored 0.72 vs Prompt A's 0.68" means nothing without confidence intervals. The difference could be noise. Bootstrap resampling gives you confidence intervals without distributional assumptions, and lets you compute whether a score difference is statistically significant.

Key Takeaways

**RAGAS handles RAG-specific metrics** — faithfulness, relevancy, context recall, context precision. Everything else needs custom eval infrastructure.
**One base class, many strategies** — the `JudgeClient` centralizes API calls, retries, caching, and token counting. LLM-as-judge, pairwise, and rubric are prompt strategies, not separate systems.
**Position bias corrupts pairwise results** — always run comparisons twice with swapped order. Only count consistent wins.
**Normalize to 0-1** — different rubrics use different scales (1-5, 0-10, 1-3). Normalize everything internally so scores are comparable.
**Calibrate your judges** — test sensitivity (does it differentiate good from bad?) and consistency (same input, same score?). If Spearman correlation with human scores is below 0.6, rework the judge prompt.
**Human eval needs structure** — SQLite persistence, multi-annotator overlap, Cohen's kappa for agreement, tiebreak resolution for conflicts. Label Studio or Argilla for the UI.
**pytest, not scripts** — regression tests belong in your CI pipeline with `assert` statements, not a custom runner you invoke manually.
**Estimate cost before running** — token counting and cost-tiered evaluation (cheap model first, expensive model for borderline cases) cuts eval cost by 60-80%.
**Statistical significance, not vibes** — bootstrap confidence intervals tell you whether a 0.04 score difference is real or noise. Don't ship based on point estimates.
**Online eval catches what offline eval misses** — shadow evaluation, user signal tracking, and distribution shift detection close the feedback loop in production.

Prompt Engineering Patterns & Techniques: The Complete Production Toolkit

AI Educator — Sun, 31 May 2026 00:00:00 GMT

Prompt engineering is applied interface design for language models. Every pattern here solves a specific failure mode: inconsistent reasoning, unstructured output, fragile single-call architectures, untested prompts leaking into production. The code is runnable, the patterns are battle-tested, and every section ends with something you can ship.

Pattern 1: Chain-of-Thought (CoT)

CoT forces the model to externalize its reasoning before producing a final answer. Two variants: zero-shot CoT (append a trigger phrase) and manual CoT (spell out the reasoning structure). Zero-shot is fast to implement; manual CoT gives you control over the reasoning path.

Zero-Shot vs Manual CoT: Code Review Example

The direct version typically says "looks fine" or catches only the obvious bug. Zero-shot CoT catches the refund sign error. Manual CoT catches the refund bug, the missing key safety issue, and the edge case of an empty discount value being falsy vs zero.

Reusable CoT Wrapper

Pattern 2: Few-Shot Learning

Few-shot learning teaches format and behavior through examples, not instructions. The model pattern-matches against your examples rather than interpreting your description of the task. This is almost always more reliable than zero-shot for structured output tasks.

OpenAI Messages Format: Entity Extraction

Anthropic Messages Format: Same Task

Dynamic Few-Shot Builder

Pattern 3: Self-Consistency

Self-consistency samples multiple reasoning paths at temperature > 0 and takes the majority answer. It turns an unreliable 70% accuracy into a reliable 90%+ by letting variance work in your favor. The implementation is straightforward with async parallel calls.

Pattern 4: Prompt Chaining

Prompt chaining decomposes a complex task into a pipeline of focused steps. Each step gets a simple, well-defined job. The output of step N becomes the input of step N+1. Failures are isolated, intermediate results are inspectable, and individual steps can be swapped without rewriting the pipeline.

Pattern 5: Structured Output Prompting

Unstructured model output is the #1 source of production bugs. JSON parsing failures, missing fields, wrong types — structured output patterns eliminate these entirely. Three approaches: JSON mode, Pydantic + instructor, and schema enforcement with retry.

OpenAI JSON Mode

Pydantic + Instructor: Type-Safe Structured Output

Manual Schema Enforcement with Retry

Pattern 6: System Prompt Design

The system prompt is the behavioral contract between you and the model. A vague system prompt produces vague behavior. A precise one produces a reliable agent. Below: a bad system prompt, a good one, and the reasoning behind every change.

Bad vs Good: Customer Support Agent

Adapting System Prompts for OpenAI vs Anthropic

Pattern 7: Advanced Techniques

Negative Examples

Telling the model what NOT to do is sometimes more effective than describing what you want. Especially useful for eliminating specific failure modes you've observed in production.

Prompt Templates with Jinja2

Prompt Version Registry

Production Patterns

Prompt A/B Testing

Prompt Regression Testing

Token Counting and Cost Estimation

Anti-Patterns: Before and After

These are real production failures, not hypotheticals. Each one has burned engineering hours.

Provider Differences That Matter

Key Takeaways

**CoT** is a reasoning amplifier. Use manual CoT for control, zero-shot for convenience. Skip it for simple classification.
**Few-shot examples** beat instructions for format control. Build them dynamically from a dataset. Watch example ordering bias.
**Self-consistency** turns 70% accuracy into 90%+ at 5x cost. Gate it behind confidence thresholds and use it only for high-stakes calls.
**Prompt chaining** makes complex tasks debuggable. Each step is independently testable and swappable.
**Structured output** eliminates parsing bugs. Use instructor/Pydantic in production, not raw JSON mode.
**System prompts** need structure: role, capabilities, restrictions, format, tone, escalation triggers. 30 lines beats 3.
**Version your prompts** like code. A/B test variants. Run regression suites in CI. Estimate costs before calling.
**Adapt per provider.** XML for Claude, markdown for GPT, test everything for open-source models.

Next Steps

These patterns are the daily toolkit. They compose: CoT inside a chain step, few-shot inside a self-consistency loop, structured output at every stage. The next level is agent architecture — where prompts become tools that call other tools. See Phase 2: Agent Architecture Patterns for that.

State Management for Multi-Agent Systems: Redis, PostgreSQL, LangGraph & Checkpointing

AI Educator — Sat, 30 May 2026 03:00:00 GMT

State is what separates a multi-agent system from a collection of stateless function calls. Without it, agents can't remember what another agent already discovered, can't resume after a crash, and can't coordinate updates to shared data without overwriting each other. Every reliability problem in multi-agent systems — lost progress, inconsistent answers, duplicated work — traces back to state management.

This post covers four approaches: Redis for fast ephemeral state, PostgreSQL for durable transactional records, LangGraph for typed state graphs with conditional routing across agent nodes, and checkpoint/resume patterns that survive process crashes. We'll build each with production-grade code and wire them through a running example: a customer support workflow handling **"I was charged twice, and I can't log in"** — a request that spans billing and account recovery and forces real state coordination.

The five kinds of state

Not all state deserves the same storage. Picking the wrong backend for a given state category is the most common architectural mistake in multi-agent systems.

**Ephemeral state** — session context, in-progress variables, temporary caches. Lives in Redis or memory. Lost on restart is acceptable.
**Durable state** — transactions, approvals, audit logs. Must survive crashes. Lives in PostgreSQL or equivalent RDBMS.
**Shared state** — data multiple agents read and write during a workflow. Needs concurrency control regardless of backend.
**Private agent state** — scratchpad, reasoning traces, tool call history. Owned by one agent, never shared.
**Checkpoint state** — frozen snapshots of workflow progress at specific boundaries. Enables resume-from-failure.

The running example: refund + account recovery

A user says: **"I was charged twice, and I still can't log in."** The system routes to a BillingAgent and an AccountAgent in parallel. Both write findings to shared state. A PolicyAgent reads those findings and decides whether to approve the refund. A ResponseAgent reads everything and composes the final answer. The workflow ID is `case_8842`.

The state that accumulates across this workflow includes: the original request, conversation history, billing findings (duplicate charge confirmed, transaction IDs), account recovery status (password reset sent), policy decision (refund approved), retry counts, and the final resolution. Every piece needs to be stored somewhere, readable by the right agents, and recoverable if the process dies mid-workflow.

Part 1: Redis for ephemeral state

Redis is the right tool for state that needs to be fast, shared, and temporary — active workflow coordination, session context, rate limiting, and distributed locks. It's the wrong tool for anything that must survive a Redis restart without persistence configured.

Two design decisions worth noting. First, the Redis client is private (`_redis`) — no external code should reach into it directly. Second, section-level updates use Lua scripts for atomicity. Without Lua, a read-modify-write cycle between two agents can silently drop one agent's update.

Agents consuming Redis state

State infrastructure is useless if agents don't use it consistently. Here's a base class that standardizes how agents load and save state, with a concrete BillingAgent that demonstrates the pattern:

The key pattern: agents never touch the state manager's internals. They call `update_workflow_section` with their own section name, and the state manager handles atomicity. An AccountAgent would own the `account` section, a PolicyAgent the `policy` section. No agent can accidentally overwrite another's data.

Part 2: PostgreSQL for durable state

Redis handles the hot path. PostgreSQL handles everything that must survive: approved refunds, audit trails, compliance records, and state history for debugging. If your workflow involves money, approvals, or regulatory requirements, the final decisions must land in a durable store.

The critical addition here is the audit log. Every state mutation records who changed what, the before and after values, and the version number. When a refund goes wrong three weeks later, you can reconstruct exactly what each agent saw and decided.

When to use Redis vs PostgreSQL

Most production systems use both: Redis for the active workflow (fast reads during agent execution) and PostgreSQL for durable decisions (the refund was approved, the audit trail). Sync the final decision to PostgreSQL when the workflow completes or at critical state transitions.

Part 3: LangGraph state graphs

LangGraph models multi-agent workflows as directed graphs where **typed state flows through nodes**. Each node is an agent function that reads state, does work, and returns updates. Conditional edges route state to different agents based on the current values. This makes workflow logic explicit and inspectable — you can see exactly which agent runs next and why.

Here's the refund + account recovery workflow as a full LangGraph implementation with typed state, three agent nodes, conditional routing, and state accumulation:

Three things make this different from a toy LangGraph example. First, the state is **typed with domain-specific sub-structures** (BillingResult, AccountResult, PolicyResult) — not a generic dict. Second, `completed_steps` uses an **accumulator reducer** (`operator.add`) so each node appends to the list without overwriting previous entries. Third, the policy agent makes real decisions based on accumulated state from previous nodes, including a threshold rule that escalates high-value refunds to human review.

Part 4: Checkpointing and workflow resumption

LangGraph handles checkpointing automatically if you use its built-in savers. But many systems use custom orchestration where you need manual checkpoint control. The key design requirement: **O(1) lookup of the latest checkpoint** for any workflow. Scanning keys or iterating lists to find the latest checkpoint is a production bug waiting to happen.

The original version of this code had a broken `get_latest_checkpoint` that stored the checkpoint ID in a list, then did a `scan_iter` across all keys to find the matching one — O(n) where n is the number of checkpoint keys for that workflow. The fix is simple: store a `latest` pointer key that contains the direct Redis key of the most recent checkpoint. One GET, done.

Where to place checkpoints

Checkpoint at natural workflow boundaries — not after every line of code, and not only at the end.

**After expensive operations** — LLM calls, external API lookups, database queries. These cost time and money to repeat.
**Before side effects** — refund issuance, email sends, account changes. If the process dies after the side effect but before the checkpoint, resumption will repeat the side effect. Checkpoint *before* so you can detect and skip on resume.
**At human-in-the-loop boundaries** — before waiting for approval, after receiving it. Humans are slow; don't make them repeat themselves.
**After each agent completes** — in a multi-agent pipeline, checkpoint between agent handoffs.

Part 5: State size management

State grows. Conversation histories accumulate. Agent reasoning traces get verbose. Retrieved documents get attached. Without size management, your state eventually hits Redis key size limits (512MB), causes serialization timeouts, or makes checkpoint/resume painfully slow.

The most common state bloat sources are conversation message histories (which grow linearly with turns), agent reasoning traces (which can be thousands of tokens each), and retrieved document chunks. Prune messages to a rolling window, truncate reasoning after the decision is made, and store large retrievals by reference rather than inlining them in state.

Putting it together: dual-store pattern

In practice, most production multi-agent systems use Redis and PostgreSQL together. Redis handles the hot path — fast reads during agent execution, distributed locks, ephemeral coordination. PostgreSQL handles the cold path — durable decisions, audit trails, compliance. Here's how they connect:

Failure modes and mitigations

**Lost progress after crash** — Mitigated by checkpointing at agent boundaries. Cost: one Redis write per checkpoint.
**Stale reads** — Agent B reads state before Agent A's write lands. Mitigated by reading state *inside* the agent's run method, not before dispatch.
**Concurrent overwrites** — Two agents write to the same state field. Mitigated by section ownership (each agent owns its section) and Lua atomic updates.
**State bloat** — Messages and reasoning traces grow unbounded. Mitigated by StateSizeManager pruning and compression.
**Schema drift** — State structure changes between deploys, breaking in-flight workflows. Mitigated by version fields and migration functions on resume.
**Redis TTL expiry during long workflows** — State disappears mid-workflow if TTL is too short. Set TTLs based on worst-case workflow duration, and refresh TTL at each checkpoint.

Decision framework

**Prototyping a single-agent workflow?** — Use in-memory dicts. Don't add infrastructure until you need it.
**Multi-agent with <10s workflows?** — Redis only. Checkpoint at each agent boundary. Persist final result to PostgreSQL.
**Multi-agent with human-in-the-loop?** — Redis + PostgreSQL + checkpointing. Humans are slow; you need durable resume.
**Using LangGraph?** — Use its built-in checkpointer (SQLite for dev, PostgreSQL for prod). Don't build your own unless you need custom checkpoint logic.
**Compliance/audit requirements?** — PostgreSQL with audit log table. Every state mutation gets a row.

Intent Classification for Agent Routing: LLM-Based, Embedding-Based & Hybrid Approaches

AI Educator — Sat, 30 May 2026 02:00:00 GMT

Intent classification is one of the most important building blocks in a multi-agent system. Before you can send a request to the right agent, you first need to understand **what the user is trying to do**. That is the job of intent classification.

This sounds simple at first. If a user says, **"Reset my password"**, route to the authentication agent. If they say, **"Where is my order?"**, route to the order-tracking agent. But real user requests are often messy, ambiguous, and multi-purpose. A single message may contain several intents, incomplete context, or wording the system has never seen before.

This guide explains intent classification for agent routing in a very detailed and easy-to-understand way. We will cover LLM-based classification, embedding-based classification, hybrid routing, confidence thresholds, fallback logic, and multi-intent detection. We will also use a practical example throughout so the concepts stay concrete.

Why routing needs intent classification

In a multi-agent architecture, different agents are usually specialized. One agent may handle billing, another technical support, another account management, and another product recommendations. If every request goes to every agent, the system becomes slow, expensive, and noisy. Routing helps the system send each request only where it belongs.

Intent classification is the decision layer behind that routing. It helps answer questions like:

Is this a billing issue or a technical issue?
Does this request need one agent or multiple agents?
How confident is the system in its routing decision?
Should the system ask a clarifying question before routing?
Should the request go to a fallback or human review path?

A running example: e-commerce support router

Suppose we are building an e-commerce assistant with these specialized agents:

A **Billing Agent** for refunds, charges, and invoices
An **Order Agent** for shipping status, cancellations, and delivery issues
An **Account Agent** for login, password reset, and profile changes
A **Product Agent** for recommendations and product questions
A **Technical Support Agent** for app or website problems

Now consider these user messages:

**"I was charged twice for my last order."**
**"My package says delivered, but I never got it."**
**"I can't log in and I also need to update my email address."**
**"Which laptop is best for video editing under $1500?"**
**"The app crashes when I try to check out."**

A good router should send each request to the correct agent or agents. That routing decision depends on intent classification.

What makes intent classification hard

Real-world requests are not always clean. Users may be vague, emotional, indirect, or combine multiple needs in one sentence. For example, **"I can't log in and I think I was billed for the wrong plan"** contains both an account issue and a billing issue.

Intent classification becomes difficult because of:

**Ambiguity**: the wording could fit more than one intent
**Multi-intent queries**: one message contains several tasks
**Domain overlap**: similar language appears across categories
**Rare phrasing**: users describe familiar problems in unfamiliar ways
**Low context**: the message is too short to classify confidently

That is why production systems often combine several methods instead of relying on only one.

Part 1: LLM-based classification

LLM-based classification uses a language model to read the user request and decide which intent best matches it. This approach is powerful because LLMs understand nuance, paraphrasing, and context better than simple keyword rules.

For example, a user might say **"Why did you take money from my card twice?"** Even if the exact phrase **"duplicate charge"** never appears, an LLM can still infer that this is likely a billing intent.

Why LLM classification works well

It handles paraphrases and natural language variation well
It can use richer intent descriptions instead of only examples
It can explain its reasoning
It can detect multiple intents in one request
It adapts better when user wording is messy or indirect

This makes LLMs especially useful when your routing space is complex or when user requests are highly varied.

Limitations of LLM classification

LLM-based routing is powerful, but it is not free. It is usually slower and more expensive than embedding-based methods. It may also produce unstable outputs if prompts are weak or if the model is not constrained to structured JSON.

That is why many systems use LLM classification selectively: for ambiguous cases, high-value requests, or as a fallback when faster methods are uncertain.

Example LLM routing decision

This output is useful because it gives both the routing label and a confidence score. The router can use that confidence to decide whether to route immediately or trigger a fallback.

Part 2: Embedding-based classification

Embedding-based classification works differently. Instead of asking an LLM to reason directly, it converts text into vectors and compares the user query to stored examples for each intent. The most similar intent wins.

This approach is often much faster and cheaper than LLM classification. It works especially well when intents are clearly separated and you have good example phrases for each one.

How to think about embeddings simply

A useful mental model is this: embeddings place similar meanings near each other in vector space. If **"I need a refund"** and **"I was charged twice"** are close to your billing examples, the classifier will likely route them to the billing agent.

This method is efficient because you can precompute intent example embeddings ahead of time. Then, at runtime, you only embed the incoming query and compare it to stored vectors.

When embeddings work well

Your intents are clearly distinct
You have representative examples for each intent
You need low latency and lower cost
Most requests are routine and repetitive

When embeddings struggle

Two intents use very similar language
The user request is long and contains multiple goals
The request depends on subtle context or policy nuance
Your example set is weak or incomplete

This is why embeddings are often excellent for the fast path, but not always enough for the final decision.

Example embedding routing result

Here the top score is high enough that the router may confidently choose the billing agent without calling an LLM.

Part 3: Hybrid ensemble classification

A hybrid classifier combines multiple methods so you get the strengths of each. The most common pattern is:

Use embeddings first because they are fast and cheap.
If confidence is high, route immediately.
If confidence is low or the top intents are too close, call an LLM.
If the LLM is still uncertain, ask a clarifying question or use fallback routing.

This design is popular because most requests are easy. You do not need expensive reasoning for every message. You only spend extra compute on the hard cases.

Why confidence thresholds matter

Confidence thresholds help the router decide when a prediction is strong enough to trust. If the top embedding score is 0.92, maybe that is good enough. If it is 0.61 and the second-best score is 0.59, the request is probably ambiguous.

Thresholds are not universal. A safe threshold depends on your domain, your intent set, and the cost of misrouting. In a low-risk FAQ bot, a lower threshold may be acceptable. In a financial or healthcare workflow, you may want stricter thresholds and more fallback checks.

A practical hybrid routing policy

If embedding score is above 0.85, route directly
If embedding score is between 0.65 and 0.85, use LLM verification
If embedding score is below 0.65, mark as uncertain
If uncertain after LLM review, ask a clarifying question or send to fallback support

Part 4: Multi-intent detection

Some user requests should not be routed to only one agent. For example: **"I can't log in and I need a copy of my invoice."** This contains both an account-access intent and a billing intent.

Multi-intent detection identifies all relevant intents in one message. That allows the system to either:

run multiple agents in parallel
split the request into sub-tasks
prioritize one intent first and queue the others
ask the user which issue they want to solve first

Example multi-intent output

This is much better than forcing the whole request into one label. The router can now coordinate multiple agents more intelligently.

End-to-end walkthrough of the routing example

Let us walk through a realistic request: **"I can't log in, and I was also charged twice this month."**

The router receives the user message.
The embedding classifier compares it against known intent examples.
It finds strong similarity to both `account_access` and `billing_refund`.
Because there are multiple strong candidates, the router triggers LLM verification.
The LLM confirms that the request contains two intents.
The router creates two sub-tasks: one for the Account Agent and one for the Billing Agent.
The Account Agent handles login recovery.
The Billing Agent investigates the duplicate charge.
The orchestrator combines the results into one coordinated response.

This example shows why routing is not just classification. It is classification plus confidence handling, fallback logic, and workflow coordination.

Fallback strategies when routing is uncertain

No classifier is perfect. Good systems plan for uncertainty instead of pretending it does not exist.

Common fallback strategies include:

**Ask a clarifying question**: "Is this about billing or account access?"
**Route to a generalist agent** that can gather more context
**Escalate to a human** for high-risk or high-value cases
**Use a safe default path** such as support triage when confidence is too low

The right fallback depends on the cost of misrouting. If sending a request to the wrong agent is cheap, you can be more aggressive. If it creates risk, delay, or customer frustration, you should be more conservative.

How to evaluate routing quality

To improve routing, you need to measure it. Useful evaluation questions include:

How often does the top predicted intent match the correct one?
How often does the system miss a second intent?
How often does fallback trigger?
Which intents are most often confused with each other?
What is the latency and cost of each routing path?

These metrics help you decide whether to improve examples, adjust thresholds, rewrite prompts, or change the hybrid policy.

Best practices checklist

Define intents clearly and keep boundaries understandable
Collect representative examples for each intent
Use embeddings for fast first-pass routing
Use LLMs for ambiguous or high-value cases
Set confidence thresholds based on real evaluation data
Support multi-intent detection when users often combine requests
Add fallback logic for uncertain cases
Log routing decisions and confidence scores for analysis
Continuously review misrouted examples and improve the classifier

Orchestration Architectures: Supervisor, Router & Hierarchical Patterns for Multi-Agent Systems

AI Educator — Sat, 30 May 2026 00:00:00 GMT

Building one capable agent is useful. Building **multiple specialized agents that work together reliably** is a fundamentally different problem. Once you have a billing agent, a technical support agent, an account agent, and a fraud agent, the question stops being 'how does each agent work?' and becomes 'how does work move through the system?' That coordination layer is called **orchestration**, and getting it wrong means your multi-agent system is just a collection of isolated specialists with no teamwork.

This guide covers six orchestration patterns — supervisor, router, sequential pipeline, parallel fan-out, event-driven, and hierarchical — with production-grade code for each. We'll use one running example throughout: a customer service platform where a user says **"My API calls are failing with a 429 error, I was charged twice, and I can't log into my account."** This request spans three domains and forces the orchestration layer to make real decisions about routing, parallelism, and result aggregation.

The running example: a multi-domain customer request

Our customer service system has five specialized agents. **BillingAgent** handles invoices, refunds, and subscription changes. **TechnicalSupportAgent** handles API errors, bugs, and troubleshooting. **AccountAgent** handles login, password resets, and profile updates. **PolicyAgent** checks whether actions comply with company rules. **ResponseAgent** turns internal results into a user-facing answer.

When the user says "My API calls are failing with a 429 error, I was charged twice, and I can't log into my account," three of these agents need to be involved. The orchestration layer must decide: does it route to one agent at a time? Run all three in parallel? Use a supervisor to coordinate? The answer depends on the pattern you choose.

Pattern 1: The supervisor

The supervisor pattern uses one central orchestrator that receives every request, classifies the intent using an LLM, dispatches tasks to specialized workers, waits for results, and aggregates them into a final response. It has the big picture and makes all coordination decisions.

The critical part of a supervisor is the classification step. A naive implementation checks for keywords like `if "charged twice" in request` — that breaks on anything remotely creative the user might type. A production supervisor uses an LLM to classify intent, detect multi-intent requests, and reformulate queries for each specialist.

Notice what this supervisor handles that a naive implementation doesn't: multi-intent detection (the user's request spans three domains), confidence filtering (low-confidence intents get dropped), parallel vs sequential dispatch (the LLM decides whether tasks are independent), worker timeouts, partial failure in aggregation (if one worker dies, the others still contribute), and tool execution within each worker's own ReAct loop.

When supervisor fits vs when it doesn't

Use a supervisor when requests frequently span multiple domains and need coordination — the supervisor's big-picture view is essential for decomposing complex requests and merging results. Avoid it when most requests go to a single agent, because you're paying for an extra LLM call (the classification step) on every request even when routing is obvious. The supervisor also becomes a single point of failure and a latency bottleneck as traffic grows.

Pattern 2: The router

The router is the supervisor's lighter sibling. It classifies intent and sends the request to **one** specialist, then gets out of the way. No multi-agent coordination, no result merging — the selected agent handles the full request independently. This is the right pattern when 80% of requests map cleanly to a single domain.

The key design choice here: when the router detects a multi-domain request, it doesn't try to handle it — it escalates to a supervisor. This lets you compose patterns: router handles the common case fast, supervisor handles the complex case thoroughly.

Pattern 3: Sequential pipeline

A pipeline passes work through a fixed sequence of stages, each transforming the output of the previous stage. This works when tasks have a natural order — you need to verify identity before changing account settings, or validate policy before issuing a refund. The implementation needs to handle validation between stages and retry logic when a stage produces insufficient output.

This pipeline handles what the simple `for stage in stages` version doesn't: per-stage validation (did the extraction actually produce JSON?), retries with feedback (the retry tells the model its previous attempt was insufficient), per-stage timeouts, abort-on-failure with partial results, and timing metrics for observability.

Pattern 4: Parallel fan-out with error recovery

Fan-out sends independent tasks to multiple agents concurrently. The hard part isn't the parallelism — that's just `asyncio.gather`. The hard part is **what happens when some agents succeed and others fail**, and **how you merge heterogeneous results into one coherent response**.

Pattern 5: Event-driven orchestration

In event-driven orchestration, agents react to events instead of being told what to do by a central controller. One agent publishes `duplicate_charge_confirmed`, and other agents that subscribe to that event kick off their own work. This creates loose coupling — agents don't need to know about each other, only about the events they care about. The trade-off: control flow becomes implicit and harder to trace.

This event bus handles what the 11-line version doesn't: handler timeouts (a slow handler doesn't block the system), dead letter tracking (failed events are captured for debugging), named handlers (you can see which subscriber failed), full event history with correlation IDs (you can trace an entire workflow), and concurrent event propagation.

Pattern 6: Hierarchical orchestration

When your system has 15+ agents across multiple domains, a single supervisor becomes unwieldy. Hierarchical orchestration adds layers: a **meta-orchestrator** delegates to **domain supervisors**, each of which manages their own team of specialist workers. This mirrors how large organizations work — a CEO delegates to VPs, VPs delegate to managers.

Pattern comparison matrix

Choosing the right pattern depends on your specific constraints. This matrix compares all six across the dimensions that matter most in production.

Combining patterns: real-world architecture

Production systems rarely use one pattern in isolation. For our running example — "My API calls are failing, I was charged twice, and I can't log in" — a realistic architecture combines a **router** at the front door (fast path for simple requests), a **supervisor** for multi-domain requests (the router escalates to it), **parallel fan-out** within the supervisor for independent subtasks, a **pipeline** within each domain for ordered steps like policy-then-refund, and **event-driven** coordination for side effects like notifications and audit logging.

The key insight: patterns are composable building blocks, not mutually exclusive choices. Start with the simplest pattern that handles your most common case (usually a router), then add complexity only where the workload demands it.

Phase 2: Agent Architecture — ReAct, Planning, Memory & Frameworks

AI Educator — Fri, 29 May 2026 12:00:00 GMT

Phase 1 taught you how to call LLMs, craft prompts, wire up tools, and retrieve context with RAG. But those are all **single-turn** patterns — you send a request and get a response. Real-world AI applications need something fundamentally different: **agents** that can reason about problems, take actions, observe results, and adapt their strategy across multiple steps. Phase 2 is where you learn to build those agents — and the non-negotiable rule is that you build the core loop yourself first, with raw API calls, before touching any framework.

Part 1: The ReAct Pattern — Build From Scratch

The ReAct pattern (Reasoning + Acting) is the foundational architecture behind virtually every modern AI agent. Introduced by Yao et al. in their 2022 paper *'ReAct: Synergizing Reasoning and Acting in Language Models'*, it interleaves **thinking** (chain-of-thought reasoning) with **acting** (calling external tools) in a loop. The model generates a thought about what to do, takes an action, observes the result, then thinks again — repeating until it has enough information to produce a final answer.

The Thought → Action → Observation Loop

At its core, the ReAct loop is deceptively simple. The LLM receives a system prompt that defines available tools and the expected output format. On each iteration, it produces a **Thought** (its reasoning about what to do next), an **Action** (a tool call with specific arguments), and then you — the orchestrator — execute that action and feed back an **Observation** (the tool's response). The loop continues until the model emits a **Final Answer** instead of another action.

Let's build this from absolute zero. No LangChain, no LangGraph, no frameworks — just Python, the OpenAI SDK, and your own loop control logic. This is the single most important exercise in this entire curriculum.

Defining Tools as JSON Schema

Before writing the loop, you need tools for the agent to call. We'll define them both as executable Python functions and as JSON schemas that the LLM understands. This dual definition — the schema for the model, the implementation for the runtime — is a pattern you'll use in every agent you build.

The Core Agent Loop

Now the main event — the ReAct loop itself. This is roughly 60 lines of Python that replicate what frameworks like LangChain wrap in thousands of lines of abstraction. Read every line carefully. Understand what the loop does on each iteration, how it manages conversation history, and how it decides when to stop.

Handling Failure Modes

The naive loop above works for happy paths, but production agents face every failure mode imaginable. The model might call a tool that doesn't exist. It might get stuck in an infinite loop calling the same tool repeatedly. It might generate malformed JSON arguments. It might hallucinate tool names. You need to handle all of these before your agent touches production traffic.

Termination Conditions Deep Dive

A well-designed agent needs multiple termination conditions, not just 'the model stopped calling tools.' Here's the complete set you should implement in every production agent:

Building a ReAct Agent with Anthropic

Anthropic's tool use API differs from OpenAI's in important ways. Tool definitions use a different schema format, tool results are sent as `tool_result` content blocks, and the model uses a `stop_reason` field to signal when it wants to use tools. Let's build the same agent using Claude.

Part 2: Planning Patterns

The basic ReAct loop is reactive — the model decides what to do one step at a time. For complex tasks, this leads to inefficient wandering. Planning patterns solve this by separating the **strategy** from the **execution**. The agent first creates a plan, then executes each step, and optionally revises the plan based on what it learns along the way.

Plan-and-Execute

The Plan-and-Execute pattern uses two separate LLM calls with different roles. A **planner** agent (usually a stronger model like GPT-4o or Claude Opus) decomposes the user's request into a numbered list of subtasks. An **executor** agent (which can be a cheaper model) then works through each subtask sequentially, reporting results back. If a step fails or reveals new information, the planner can be invoked again to revise the remaining steps.

Tree of Thoughts

Tree of Thoughts (ToT) extends chain-of-thought reasoning by exploring **multiple reasoning paths** simultaneously instead of following a single chain. Think of it as breadth-first search over possible thought sequences. At each step, the model generates several candidate 'thoughts,' evaluates which ones are most promising, and expands only the best branches. This is especially powerful for problems with multiple valid approaches — mathematical proofs, creative writing, strategic planning, and puzzle solving.

Reflexion: Self-Critiquing Agents

Reflexion is a pattern where the agent generates an output, then **critiques its own output** and uses that critique to produce an improved version. It's inspired by how humans revise their work — write a draft, review it, identify weaknesses, and rewrite. The key insight is that LLMs are often better at **evaluating** outputs than **generating** perfect ones on the first try. By giving the model a chance to reflect, you get significantly better results on tasks like code generation, writing, and analysis.

Chain-of-Thought vs Program-of-Thought

These two reasoning strategies represent fundamentally different ways an agent can solve problems. **Chain-of-Thought (CoT)** asks the model to reason in natural language — step by step, in words. **Program-of-Thought (PoT)** asks the model to generate executable code that solves the problem, then runs that code. For anything involving math, data manipulation, or precise logic, PoT dramatically outperforms CoT because the code executes deterministically.

Part 3: Memory Systems

A single-turn agent is stateless — it handles one request and forgets everything. A useful agent needs **memory**: the ability to recall what happened earlier in the conversation, what it learned in previous sessions, and what knowledge it has accumulated over time. Memory is what turns a tool into a colleague. There are five distinct types of memory that production agents use, each serving a different purpose.

In-Context Memory: Conversation History Management

The simplest form of memory is just passing the entire conversation history in the messages array on every LLM call. This is what you've been doing in every agent so far. But context windows are finite (even 128K or 200K tokens fill up fast when agents make many tool calls), so you need strategies for managing this. The three main approaches are **sliding window**, **summarization**, and **smart truncation**.

External Short-Term Memory: Redis & DynamoDB

In-context memory disappears when the conversation ends. For agents that need to maintain state across multiple API calls or even across sessions (but not forever), you need external short-term storage. Redis and DynamoDB are the two most common choices. Redis is faster and simpler for session state. DynamoDB is better when you need durability and don't want to manage infrastructure.

Long-Term Episodic Memory

Episodic memory stores **past interactions** that might be relevant to future conversations. When a user says 'remember, I prefer Python over JavaScript' or 'like we discussed last week,' the agent needs to retrieve those past episodes. This is built on top of a vector store — you embed summaries of past interactions, and at the start of each new conversation, retrieve the most relevant ones.

Semantic Memory: Vector Stores as Agent Knowledge

Semantic memory is your agent's **knowledge base** — facts, documents, and reference material that the agent can consult. This is essentially the RAG pipeline from Phase 1, but integrated into the agent as a tool rather than a standalone retrieval step. The agent decides when it needs to look something up, queries the vector store, and uses the results in its reasoning.

Procedural Memory: Tool Libraries and Skill Stores

Procedural memory stores **how to do things** — it's a dynamic library of tools and skills that the agent can draw from. Instead of hardcoding a fixed set of tools, the agent has access to a skill store where it can discover, load, and use tools dynamically. This is how sophisticated agents like Claude Code's agent system or AutoGPT-style systems work — they select tools from a registry based on what the current task requires.

Part 4: Agent Frameworks — After the Scratch Build

Now that you've built every core pattern from scratch — the ReAct loop, planning, memory — you're ready to use frameworks. The key mindset shift: you're not learning these frameworks because you can't build agents without them. You're using them because they handle the boring parts (state persistence, streaming, deployment) while you focus on the interesting parts (agent logic, tool design, evaluation). You understand what's underneath, so you can debug anything.

LangGraph: Stateful Graph-Based Workflows

LangGraph models agent workflows as **state machines** — directed graphs where nodes are processing steps and edges are transitions. The state is an explicit object that flows through the graph, and you define exactly how each node reads and writes to that state. This makes complex agent behaviors predictable and debuggable because the state is always visible and the transitions are always explicit.

Notice the structure: the graph has explicit nodes (agent, tools), explicit edges (conditional routing based on whether tool calls exist), and explicit state (messages + step count). Compare this to the raw ReAct loop you built earlier — the logic is identical, but now the control flow is a visible, inspectable graph rather than a Python for-loop. This matters when your agent has 10+ nodes with complex branching.

LangGraph: Plan-and-Execute Pattern

Let's implement the Plan-and-Execute pattern from earlier using LangGraph's state machine. This shows the real power of graph-based workflows — you can model the planner and executor as separate nodes with state flowing between them.

LangChain Agents: AgentExecutor and Tools

LangChain's `AgentExecutor` is the original high-level agent abstraction. It wraps the ReAct loop into a single class that handles tool execution, memory, and output parsing. While LangGraph is now recommended for complex workflows, AgentExecutor is still the fastest way to spin up a simple agent. Here's the full pattern including custom tools and memory.

AutoGen: Multi-Agent Conversations

AutoGen (by Microsoft) models agent systems as **conversations between multiple agents**. Instead of a single agent with tools, you create specialized agents that talk to each other. An 'assistant' agent generates plans and code. A 'user proxy' agent executes code on behalf of the human and returns results. A 'critic' agent reviews outputs. This conversational architecture is surprisingly powerful for complex tasks because each agent can have different models, tools, and system prompts.

crewAI: Role-Based Agent Teams

crewAI takes a different approach to multi-agent orchestration. Instead of free-form conversations, it defines **roles**, **goals**, and **tasks** explicitly. Each agent has a specific role (like 'Senior Data Analyst' or 'Technical Writer'), and tasks are assigned to agents with defined expected outputs. This structure makes it easier to reason about what each agent does and makes the workflow more predictable.

IBM watsonx Orchestrate

IBM watsonx Orchestrate is an enterprise agent platform that takes a constraints-first approach. Rather than giving agents unlimited freedom, it defines strict **skill flows** — sequences of actions that agents can take, with guardrails at every step. This is the right model for enterprise environments where agents need to comply with regulations, audit requirements, and governance policies. Understanding Orchestrate's constraints as a design philosophy — not a limitation — makes you a better agent architect.

Framework Comparison Matrix

Putting It All Together: A Complete Agent System

Let's build a complete agent system that combines everything from Phase 2 — the ReAct loop, planning, multiple memory types, and robust error handling — into a single, production-ready architecture. This is the kind of system you'd actually deploy.

Key Takeaways

Phase 2 transforms you from someone who can call APIs into someone who can build autonomous systems. The patterns here — ReAct, planning, memory, self-critique — are the building blocks of every production agent. In Phase 3, we'll scale these patterns into multi-agent systems, add evaluation and observability, and tackle the hardest problem in agent engineering: making agents reliable enough to trust in production.

Phase 1: Core Foundations of LLM Engineering — APIs, Prompts, Tools & RAG

AI Educator — Fri, 29 May 2026 00:00:00 GMT

Building production-grade AI applications requires more than just calling an API. You need to understand how modern LLMs work under the hood, how to craft prompts that reliably produce structured output, how to extend models with tools and function calling, and how to ground their responses in your own data using retrieval-augmented generation. This guide covers all four pillars in depth — the complete Phase 1 foundation for any serious AI engineer.

Part 1: LLM APIs & SDKs

Every major LLM provider exposes a **chat completions** interface. You send a list of messages (system, user, assistant) and receive a generated response. The core pattern is the same across OpenAI, Anthropic, and IBM watsonx.ai, but each has its own SDK conventions, authentication, and feature set.

OpenAI Chat Completions

The OpenAI SDK is the most widely used. The `chat.completions.create` method accepts a model identifier, a list of messages, and optional parameters like `temperature`, `max_tokens`, and `response_format`.

Anthropic Messages API

Anthropic's API uses a **messages** endpoint with a slightly different structure. The system prompt is a top-level parameter rather than a message role, and the response includes `stop_reason` and detailed `usage` metrics.

IBM watsonx.ai

IBM watsonx.ai provides access to foundation models through the `ibm-watsonx-ai` SDK. It uses a project-based authentication model and supports models like Granite, Llama, and Mixtral.

Streaming Responses

For real-time UIs, you need **streaming**. Instead of waiting for the entire response, you receive tokens as they're generated. This dramatically improves perceived latency — users see output within 200ms instead of waiting 3-5 seconds.

Token Counting & Context Window Management

Every LLM has a **context window** — the maximum number of tokens it can process in a single request (input + output combined). Understanding tokenization is critical for cost optimization and avoiding truncation errors.

Model Selection Trade-offs

Choosing the right model is a balancing act between **quality**, **latency**, **cost**, and **context window**. Here's a decision framework:

**High-stakes reasoning** (legal analysis, code review, complex math) → GPT-4o or Claude Sonnet 4
**High-volume simple tasks** (classification, extraction, summarization) → GPT-4o-mini or Claude Haiku 3.5
**On-premise / data sovereignty requirements** → IBM Granite via watsonx.ai
**Long document processing** (200K+ tokens) → Claude Sonnet 4 with 200K context
**Real-time chatbots** (latency-sensitive) → GPT-4o-mini or Claude Haiku 3.5 with streaming

Async API Calls

When building production applications, you'll need to make **concurrent API calls** — processing multiple documents, running evaluations in parallel, or serving multiple users. Python's `asyncio` is essential here.

Part 2: Prompt Engineering

Prompt engineering is the art and science of communicating with LLMs to get reliable, structured, high-quality output. It's the single highest-leverage skill in AI engineering — a well-crafted prompt can turn a mediocre model into an excellent one.

Zero-Shot Prompting

Zero-shot means giving the model a task **without any examples**. You rely entirely on the model's pre-trained knowledge and your instructions. This works well for simple, well-defined tasks.

Few-Shot Prompting

Few-shot prompting provides **examples** of input-output pairs. This dramatically improves consistency, especially for tasks where the desired format or reasoning style isn't obvious from instructions alone.

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting asks the model to **show its reasoning** before giving a final answer. This significantly improves accuracy on complex tasks like math, logic puzzles, and multi-step reasoning.

Structured Output (JSON Mode & XML Tags)

Production applications need **predictable, parseable output**. Both OpenAI and Anthropic offer mechanisms to constrain model output to valid JSON or structured formats.

Anthropic's Claude works exceptionally well with **XML tags** to structure both input and output, since it was trained with XML-aware formatting:

Role Prompting & Instruction Following

The **system prompt** sets the model's persona, constraints, and behavioral guidelines. A well-crafted system prompt is the difference between a generic chatbot and a specialized domain expert.

Prompt Injection Awareness & Defense

**Prompt injection** is when user input manipulates the model into ignoring its system instructions. This is the #1 security concern in LLM applications. Understanding attack vectors is essential for building safe systems.

**Direct injection**: User says "Ignore all previous instructions and..."
**Indirect injection**: Malicious instructions hidden in retrieved documents or tool outputs
**Jailbreaking**: Elaborate role-play scenarios to bypass safety guardrails
**Prompt leaking**: Tricking the model into revealing its system prompt

Part 3: Function Calling & Tool Use

Function calling lets LLMs **invoke external tools** — APIs, databases, calculators, web scrapers — by generating structured JSON that your application executes. This transforms LLMs from text generators into autonomous agents that can take actions in the real world.

OpenAI Function Calling

OpenAI uses a **tools** parameter with JSON Schema definitions. The model decides when to call a function and generates the arguments. Your application executes the function and feeds the result back.

Anthropic Tool Use

Anthropic's tool use follows a similar pattern but with different message structures. Tools are defined with `input_schema` and the model returns `tool_use` content blocks.

Building & Testing Custom Tools

Well-designed tools follow key principles: **clear descriptions** (the model reads these to decide when to use the tool), **strict input validation**, **meaningful error messages**, and **minimal scope** (one tool = one action).

Part 4: Retrieval-Augmented Generation (RAG)

RAG solves the fundamental limitation of LLMs: they only know what was in their training data. By **retrieving relevant documents** at query time and injecting them into the prompt, you can ground the model's responses in your own data — company docs, knowledge bases, codebases, or any text corpus.

Document Chunking Strategies

Before you can search your documents, you need to split them into **chunks** — small enough to be relevant, large enough to contain complete ideas. Chunking strategy has a massive impact on retrieval quality.

Embedding Models

Embeddings convert text into **dense numerical vectors** that capture semantic meaning. Similar texts produce similar vectors, enabling semantic search. The choice of embedding model affects both quality and cost.

Vector Stores: Milvus & Qdrant

Vector databases store embeddings and enable fast similarity search at scale. **Milvus** and **Qdrant** are two popular open-source options with different strengths.

Retrieval Strategies

Simple top-k retrieval is just the beginning. Advanced strategies can dramatically improve the relevance and diversity of retrieved documents.

**Top-K**: Return the K most similar documents by cosine similarity. Simple but can return redundant results.
**MMR (Maximal Marginal Relevance)**: Balances relevance with diversity — penalizes documents that are too similar to already-selected ones.
**HyDE (Hypothetical Document Embedding)**: Generate a hypothetical answer first, embed that, then search. Often outperforms direct query embedding.
**Hybrid BM25 + Dense**: Combine traditional keyword search (BM25) with semantic search. Best of both worlds — catches exact matches that embeddings might miss.

Reranking

Reranking is a **two-stage retrieval** technique. First, you retrieve a broad set of candidates (e.g., top 20) using fast vector search. Then, a more powerful cross-encoder model re-scores each candidate against the query for higher precision.

Putting It All Together: Complete RAG Pipeline

RAG Evaluation with RAGAS

Building a RAG pipeline isn't enough — you need to **measure** how well it performs. RAGAS (Retrieval Augmented Generation Assessment) provides automated metrics for evaluating both retrieval and generation quality.

**Faithfulness**: Is the generated answer supported by the retrieved context? (Prevents hallucination)
**Answer Relevancy**: Does the answer actually address the question asked?
**Context Precision**: Are the retrieved documents relevant to the question?
**Context Recall**: Did the retrieval find all the relevant information needed?

Week-by-Week Study Plan

Conclusion

These four pillars — **LLM APIs**, **prompt engineering**, **function calling**, and **RAG** — form the foundation of modern AI engineering. Master them, and you can build anything from intelligent chatbots to autonomous agents to enterprise knowledge systems. The code examples in this guide are production-ready starting points, not toy demos. Take them, extend them, break them, and build something real.

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

DevLifted Team — Wed, 27 May 2026 00:00:00 GMT

Most tutorials on sequence models stop at the architecture diagram. You get a clean picture of a [BiLSTM](/blog/bilstm-text-classification-explained) reading a sentence left-to-right and right-to-left, and the explanation ends there. Then you sit down to actually build one and immediately hit a wall: your sentences are different lengths, PyTorch wants tensors, the LSTM is faster if you tell it where the padding is, and somewhere along the way you have to turn a sequence of hidden states back into a single vector for classification.

This post is about that middle layer. The plumbing. The part that turns a clean theoretical model into something that trains in 30 seconds per epoch instead of 5 minutes, and that doesn't silently scramble your labels through a subtle indexing bug. We'll work through the full pipeline for a word-level BiLSTM intent classifier — vocabulary, dataset, collate function, packing, and pooling — and explain why each piece exists.

The shift from sentence vectors to token sequences

An [MLP on frozen sentence embeddings](/blog/mlp-on-frozen-sentence-embeddings) is simple to feed: one sentence in, one 384-dimensional vector out, classify. The sentence transformer does the heavy lifting before your model sees anything. You can ignore words and word order entirely because they've already been baked into the vector.

A BiLSTM sees the words. That changes everything about the input pipeline:

Words are strings, but neural networks need integers. You need a **vocabulary** — a deterministic mapping from word to integer ID.
Sentences have different lengths, but a tensor in a batch must be rectangular. You need **padding**.
Padded positions are fake — they shouldn't influence the model. You need **packing** or **masking**.
An LSTM emits a hidden state at every timestep, but a classifier wants one vector per sentence. You need a **pooling strategy**.

Each of these is a small problem in isolation, but they interact. Get the order wrong — say, pad and then forget to mask — and your model trains on noise. The rest of the post is one solution per problem, in the order they show up.

Step 1: The vocabulary

A vocabulary is a dictionary `word -> int`. Every word you want your model to recognize gets a unique integer ID. The embedding layer then uses that integer to look up a learned vector. If you haven't seen the [tokenization primer](/blog/text-preprocessing-tokenization-nlp), the short version is: split the text on whitespace, lowercase it, count word frequencies, and assign IDs to the most common ones.

Why integers, not strings?

Neural networks are matrices. `nn.Embedding(vocab_size, embed_dim)` is literally a `(vocab_size, embed_dim)` weight matrix. To get the vector for the word `flight`, you index into row 234 (or whichever row you assigned). Strings have no order and no row index — integers do.

The two special tokens you cannot skip

Before any real word goes into the vocabulary, two slots are reserved:

`` at index 0 is a convention worth following — `nn.Embedding` takes a `padding_idx` argument that pins that row to zero, and most utility code in PyTorch defaults to 0 as the pad value.

Building from the training set — and only the training set

The vocabulary is built from training texts. If you include validation or test words, you've leaked information. A real deployment sees brand-new words constantly, so simulating that by mapping unseen words to `` is the point.

Two knobs decide the size of the vocabulary. `max_size` caps the total number of words — useful because the embedding matrix grows linearly with this number. `min_freq` filters out words that appeared only once or twice in training; these are almost always typos, names, or rare items that the model can't learn anything useful about. Mapping them to `` is the honest move.

Step 2: Dataset and DataLoader

PyTorch has two abstractions for handling data: `Dataset` and `DataLoader`. They're independent of any model. Once you wire them up, the same code pattern works for images, audio, tabular data, or text.

Dataset: one example at a time

A `Dataset` only needs two methods: `__len__` (how many examples?) and `__getitem__(idx)` (give me example number idx). That's the entire interface.

Notice what's NOT here: padding, batching, conversion to tensors. The Dataset returns a Python list of integers and a Python int. Single example, raw. The DataLoader will handle the rest.

DataLoader: batches, shuffling, and parallelism for free

Wrap the Dataset in a `DataLoader` and you get batching, shuffling, and optional multi-process loading. The default behavior is to stack each item with `torch.stack`, which assumes every item is the same shape. For variable-length text, it isn't — so you provide a `collate_fn` that controls how the batch gets assembled.

Step 3: Dynamic padding with collate_fn

Sentences in a batch will have different lengths — 5 tokens, 9 tokens, 22 tokens. To put them into a single tensor, the short ones get padded with the `` index until they match the longest. The question is: padded to what length?

Static vs dynamic padding

**Static padding** pads every sentence to a fixed global maximum — say, `max_seq_len = 64`. Simple, but wasteful: if your batch happens to contain only short sentences, you're doing 64 timesteps of LSTM work on 90% padding.

**Dynamic padding** pads to the longest sentence *in the current batch*. A batch of mostly short sentences pads to maybe 12 tokens. A batch with one long outlier pads to 40. Across an epoch, this can cut training time in half.

The function returns three tensors: `padded` of shape `(batch, max_len)`, `lengths` of shape `(batch,)`, and `labels` of shape `(batch,)`. The `lengths` tensor is the key — without it, the model has no way to tell where real tokens end and padding begins.

Step 4: Packing — telling the LSTM to skip padding

Padding solves the shape problem but creates a compute problem. If your batch is padded to 40 timesteps and the average real length is 10, you're paying 4× the LSTM cost for nothing. Worse, the hidden state at timestep 40 of a 10-token sentence is the state after the LSTM has processed 30 padding tokens. If you use that as a sentence representation, you've corrupted the signal.

PyTorch's `pack_padded_sequence` solves both. It rearranges the padded tensor into a special `PackedSequence` object that the LSTM processes step by step, skipping padded positions automatically. The output comes back compressed in the same format; you unpack it with `pad_packed_sequence` to get a normal tensor again.

The sort/unsort dance

Here's the catch that trips up almost everyone the first time: `pack_padded_sequence` requires the batch to be sorted by length in descending order when `enforce_sorted=True`. That means you have to sort, pack, run, unpack — and then put everything back in the original order so it lines up with the labels.

Two small details: `pack_padded_sequence` wants `sorted_lengths.cpu()` even if the rest of the tensors are on a GPU — the function uses lengths for indexing on the CPU side. And `h_n` (the final hidden state) is indexed differently from `output`: its shape is `(num_layers * num_directions, batch, hidden_dim)`, so you unsort along dim 1.

Step 5: Pooling — one vector per sentence

After unpacking, you have a tensor of shape `(batch, seq_len, hidden_dim * 2)`. The `* 2` is because the BiLSTM concatenates forward and backward hidden states at every timestep. A classifier head wants a single vector per sentence, so the sequence dimension has to collapse. Two strategies are common.

Strategy 1: Last hidden state

Take the final hidden state of the LSTM. For a unidirectional LSTM, this is the state after reading the entire sentence — a natural summary. For a bidirectional LSTM, you want the *forward direction's last state* (which has read the whole sentence left-to-right) AND the *backward direction's last state* (which has read it right-to-left).

The shape of `h_n` is `(num_layers * 2, batch, hidden_dim)`. For a 2-layer BiLSTM, that's 4 rows. The layout is `[layer_0_forward, layer_0_backward, layer_1_forward, layer_1_backward]`. The last two — `h_n[-2]` and `h_n[-1]` — are what you want:

Strategy 2: Mean pooling with a mask

Mean pooling averages the hidden states across all real timesteps. The wrinkle is that padded positions are still in the output tensor after unpacking — they're just zeros, but if you average over them you're dividing by the wrong denominator. You need a mask.

The mask is built by comparing each position index to the sentence's real length. Position 0, 1, 2, ... up to `lengths[i] - 1` is True; everything after is False. Multiply by the mask, sum, divide by the real length, and you have an honest mean over non-padded positions.

Which one should you pick?

On short utterances like intent classification, the two are usually within a percentage point of each other. On longer documents, mean pooling tends to be more reliable. Treat it as a hyperparameter worth flipping during your sweep.

Putting the pieces together

The full forward pass — tokens to logits — looks like this:

A couple of details worth noting: `nn.LSTM`'s `dropout` argument only applies between stacked layers, which is why it's gated behind `num_layers > 1` (passing dropout to a single-layer LSTM is a no-op and triggers a warning). And `padding_idx=pad_idx` on the embedding layer pins row 0 to zero and freezes it — no gradient updates, no drift.

The training loop with a DataLoader

Compared to the manual mini-batch loop you might have used with a [frozen-embedding MLP](/blog/mlp-on-frozen-sentence-embeddings), the DataLoader version is shorter. No manual shuffling, no manual slicing — the loader yields batches, you iterate.

The optimizer is typically [Adam](/blog/adam-optimizer-explained) with `lr=1e-3` and a small weight decay, and you wrap the loop in [early stopping](/blog/early-stopping-explained) on validation loss with a patience of 5 or so. None of that is specific to BiLSTMs — these are the same training conventions you've used for every PyTorch model.

Common bugs and how to catch them

**Accuracy hovers near random**: almost always a missing unsort step after `pad_packed_sequence`. Labels and predictions are misaligned.
**Loss is NaN immediately**: usually a learning rate problem, but check that `padding_idx` is set on the embedding — otherwise the pad embedding drifts during training and can blow up.
**Training is far slower than expected**: you forgot to pack the sequence, or you're padding to a global `max_seq_len` instead of per-batch dynamic padding.
**Validation accuracy is wildly noisy across epochs**: `shuffle=True` slipped onto the validation loader. Set it to `False`.
**Inference breaks on new sentences**: a word at inference is missing from the vocabulary. Make sure `encode` returns `UNK_IDX` for unknown words, not `KeyError`.
**`h_n` shape mismatch in pooling**: you indexed `h_n[0]` and `h_n[1]` thinking they were forward/backward of the last layer. For multi-layer LSTMs use `h_n[-2]` and `h_n[-1]`.

Wrapping up

A working sequence classifier is mostly plumbing on top of a small model. The BiLSTM itself is a few lines — the work is in the pipeline around it: a vocabulary that maps strings to integers, a Dataset that yields raw token lists, a DataLoader with a `collate_fn` that pads dynamically, packing to skip padding inside the LSTM, the sort/unsort dance to keep labels aligned, and a pooling strategy to collapse the sequence back into a single vector.

Get this pipeline right once and almost every future sequence model — attention models, transformers, even speech recognizers — reuses the same shapes. Tokens go in, padded tensors flow through a model that knows how to ignore padding, and a pooled vector comes out for the head to classify.

Agent-to-Agent Communication: Async Messaging, Handoff Protocols, and Conflict Resolution

Haneesh PLD — Fri, 15 May 2026 10:00:00 GMT

Agents that can't talk to each other aren't a system — they're a collection of independent programs. Communication is the connective tissue of multi-agent architectures: it determines whether your agents can coordinate on a refund workflow, negotiate conflicting decisions, or hand off tasks without dropping context.

This post builds three production-grade primitives from scratch: an async message bus with backpressure and dead letter queues, a handoff protocol with real acknowledgment tracking and exponential backoff, and a conflict resolver that includes actual LLM arbitration. We'll wire them together through a customer support refund workflow where a TriageAgent, BillingAgent, and ApprovalAgent coordinate end-to-end.

Communication Patterns Compared

Before building anything, understand the tradeoffs. Each pattern fits different coordination needs:

Direct messaging is simple but creates tight coupling — every agent needs to know every other agent's address. Pub/sub decouples publishers from subscribers but loses request/reply semantics. Shared state (covered in the [state management post](/posts/state-management-agents)) works well for coordination data but not for task handoffs. An event bus with topic-based routing hits the sweet spot for most multi-agent workflows.

Message Schema

Every message in the system needs a consistent envelope. This schema supports routing, tracing, and idempotency:

The `idempotency_key` prevents duplicate processing when retries occur. The `correlation_id` chains request/reply pairs across multiple hops. The `reply()` factory method inverts sender/recipient and routes to the original sender's inbox topic.

Async Message Bus with Backpressure

A real message bus needs three things most tutorials skip: backpressure so fast producers don't overwhelm slow consumers, a dead letter queue for messages that repeatedly fail processing, and proper async subscriber management.

Handoff Protocol with Acknowledgment Tracking

Task handoff between agents requires guaranteed delivery. The HandoffProtocol tracks outgoing handoffs, waits for real acknowledgments using `asyncio.Event`, and retries with exponential backoff on timeout.

The key difference from toy implementations: `ack_event.wait()` blocks until the receiving agent explicitly calls `acknowledge()` or `reject()`. No fake sleeps, no polling. The `asyncio.Event` is the right primitive — it's zero-cost when waiting and instant when signaled.

Conflict Resolution with LLM Arbitration

When multiple agents propose conflicting actions — two agents both want to set a refund amount, or disagree on whether to escalate — you need a resolution strategy. The three strategies here are majority vote, priority-based, and LLM arbitration.

Refund Workflow: Message Flow

Here's the full message flow for a customer support refund. The TriageAgent receives the request, hands off to BillingAgent for amount calculation, and BillingAgent escalates to ApprovalAgent if the amount exceeds a threshold.

Agents Using the Communication Primitives

Here's how agents actually use the bus and handoff protocol. Each agent subscribes to its inbox, processes messages, and sends responses back through the bus. This is the complete wiring — not pseudocode.

Running the Workflow

Conflict Resolution in Practice

When two agents disagree on a refund amount, the conflict resolver picks a winner. Here's the LLM arbitration path — the resolver sends both proposals to an LLM with their reasoning and gets back a structured decision.

A Note on Shared State

You'll notice this post doesn't include a SharedStateManager. That's deliberate — shared state coordination is a distinct problem with its own concurrency challenges, and it's covered thoroughly in the [State Management for Agents](/posts/state-management-agents) post. The primitives here (message bus, handoff protocol, conflict resolver) compose with shared state but don't duplicate it.

Production Considerations

**Persistence**: The in-memory `asyncio.Queue` loses messages on crash. For production, back the bus with Redis Streams, RabbitMQ, or Kafka. The subscriber interface stays the same.
**Observability**: Log every message publish, delivery, and ack with correlation IDs. This is your debugging lifeline when three agents are exchanging messages at speed.
**Idempotency key storage**: The in-memory `set` for idempotency keys grows unbounded. Use a TTL-based cache (Redis with SETEX) or periodically prune keys older than the max message TTL.
**Dead letter processing**: Don't just log dead letters — alert on them. A growing DLQ means your consumers are failing and messages are being lost.
**LLM arbitration cost**: Every conflict that goes to LLM arbitration costs an API call. Use it as a fallback after majority vote fails to reach consensus, not as the default strategy.

Summary

Agent communication breaks down into three concerns: routing messages between agents (the bus), guaranteeing task delivery (the handoff protocol), and resolving disagreements (the conflict resolver). Each primitive is independent and composable — you can use the bus without the handoff protocol, or the conflict resolver without either.

The critical implementation details that tutorials skip: backpressure via bounded queues, real ack tracking via `asyncio.Event` instead of sleep-based polling, type-preserving vote counting, and actual LLM calls for arbitration instead of stub methods. These details determine whether your multi-agent system works under load or only in demos.

BiLSTM for Text Classification: Understanding Sequential Deep Learning

DevLifted Team — Fri, 24 Apr 2026 00:00:00 GMT

Imagine you're reading a sentence: "I don't want to cancel my flight." As a human, you understand that the word "don't" completely changes the meaning. But what if we told you that many machine learning models would treat "I want to cancel my flight" and "I don't want to cancel my flight" almost identically?

This is the fundamental problem with bag-of-words approaches and even frozen sentence embeddings—they lose the sequential structure of language. Enter **Bidirectional Long Short-Term Memory ([BiLSTM](/blog/bilstm-text-classification-explained))** networks, a powerful architecture that reads text word by word, understanding context, word order, and compositional meaning.

The Problem with Non-Sequential Models

Let's understand why sequence matters with a concrete example:

The problem? These models process the entire sentence at once, creating a single vector representation. The word "don't" gets averaged out with all other words, losing its critical negation role.

What Are Recurrent Neural Networks (RNNs)?

Recurrent Neural Networks are designed to process sequences by maintaining a **hidden state** that gets updated at each time step. Think of it like reading a book—you don't forget what you read in previous sentences; you carry that context forward.

How RNNs Work: A Simple Example

Let's process the sentence "I love pizza" word by word:

At each step, the [RNN](/blog/rnn-lstm-fundamentals) combines the current word with the previous hidden state, creating a new hidden state that encodes everything seen so far. The final hidden state represents the entire sentence.

The Vanishing Gradient Problem

Simple [RNNs](/blog/rnn-lstm-fundamentals) have a fatal flaw: they can't remember long-range dependencies. When processing long sentences, the gradient signal gets weaker and weaker as it propagates backward through time. This is called the **vanishing gradient problem**.

This is where [LSTMs](/blog/rnn-lstm-fundamentals) come to the rescue.

Long Short-Term Memory (LSTM): The Solution

[LSTMs](/blog/rnn-lstm-fundamentals) solve the vanishing gradient problem through a clever architecture with **gates** that control information flow. Think of gates as smart filters that decide what to remember, what to forget, and what to output.

The Three Gates of LSTM

**Forget Gate**: Decides what information to throw away from the cell state. "Should I forget that we're talking about a chef?"
**Input Gate**: Decides what new information to store in the cell state. "Should I remember that we're now talking about sushi?"
**Output Gate**: Decides what to output based on the cell state. "What information is relevant for the next step?"

Here's a simplified view of how [LSTM](/blog/rnn-lstm-fundamentals) processes one word:

Bidirectional LSTM: Reading Both Ways

A regular [LSTM](/blog/rnn-lstm-fundamentals) only reads text left-to-right. But humans understand language by considering context from both directions. Consider this sentence:

"The bank was steep and covered with grass."

Is "bank" a financial institution or a riverbank? You need to read ahead to "steep" and "grass" to know. This is why **Bidirectional [LSTMs](/blog/rnn-lstm-fundamentals)** are so powerful—they process the sequence in both directions simultaneously.

The [BiLSTM](/blog/bilstm-text-classification-explained) creates two hidden states for each word:

**Forward hidden state**: Encodes everything from the start up to this word
**Backward hidden state**: Encodes everything from the end back to this word

These are concatenated to give each word full context from both directions.

Building a BiLSTM Text Classifier

Let's build a complete [BiLSTM](/blog/bilstm-text-classification-explained) classifier step by step. We'll classify customer service queries into intents (like "cancel_flight", "book_hotel", etc.).

Step 1: Text Preprocessing and Vocabulary

Before we can feed text into a neural network, we need to convert words to numbers. This involves building a **vocabulary**—a mapping from words to integer indices.

Step 2: Encoding and Padding

Neural networks require fixed-size inputs, but sentences have variable lengths. We solve this with **padding**—adding special tokens to make all sequences the same length.

Step 3: Word Embeddings

Now we need to convert word indices to dense vectors. Unlike frozen embeddings, we'll **learn** these embeddings from scratch during training. This allows the model to learn task-specific word representations.

The embedding layer is essentially a lookup table. Each word index maps to a learnable vector. During training, backpropagation updates these vectors to be more useful for the task.

Step 4: The BiLSTM Architecture

Now we can build the complete [BiLSTM](/blog/bilstm-text-classification-explained) classifier:

Understanding the Architecture

Let's trace through what happens to a single sentence:

Training the BiLSTM

Training a [BiLSTM](/blog/bilstm-text-classification-explained) is similar to training any neural network, but with some sequence-specific considerations:

Why BiLSTM Works Better Than Frozen Embeddings

Let's compare the two approaches on our negation example:

Example: How BiLSTM Handles Negation

Hyperparameters and Their Impact

[BiLSTMs](/blog/bilstm-text-classification-explained) have several important hyperparameters that significantly affect performance:

1. Embedding Dimension (embed_dim)

**Too small (32-64)**: Words can't capture enough semantic information
**Sweet spot (128-256)**: Good balance of expressiveness and efficiency
**Too large (512+)**: Overfitting, slower training, diminishing returns

2. Hidden Dimension (hidden_dim)

**Too small (64-128)**: Can't capture complex patterns
**Sweet spot (256-512)**: Sufficient capacity for most tasks
**Too large (1024+)**: Overfitting, memory issues

3. Number of Layers (num_layers)

**1 layer**: Simple patterns only
**2 layers**: Good for most tasks (recommended starting point)
**3+ layers**: Deeper hierarchies, but harder to train

4. Sequence Length (max_len)

Common Pitfalls and Solutions

Pitfall 1: Forgetting bidirectional=True

Pitfall 2: Wrong hidden state extraction

Pitfall 3: Not setting padding_idx

Performance Expectations

On a typical intent classification dataset (like CLINC150 with 151 classes):

When to Use BiLSTM vs Other Approaches

Use BiLSTM when:

Word order and sequence structure are critical (negation, temporal relationships)
You have moderate amounts of training data (10K+ examples)
You need better accuracy than bag-of-words but can't afford transformer training time
Interpretability matters (you can visualize attention over time steps)
You're working with sequences of moderate length (< 100 tokens)

Don't use BiLSTM when:

You have very little data (< 1K examples) → use frozen embeddings
You need state-of-the-art accuracy and have compute budget → use transformers
Sequences are very long (> 500 tokens) → LSTMs struggle with very long sequences
Real-time inference is critical → simpler models are faster

Advanced Techniques

1. Packed Sequences (for efficiency)

When sentences have very different lengths, you can use packed sequences to avoid wasting computation on padding:

2. Attention Mechanism

Instead of just using the final hidden state, you can use attention to weight all time steps:

3. Pretrained Word Embeddings

You can initialize embeddings with pretrained vectors (like GloVe or Word2Vec) instead of random initialization:

Conclusion

[BiLSTM](/blog/bilstm-text-classification-explained) networks represent a significant step up from bag-of-words and frozen embedding approaches. By processing text sequentially and bidirectionally, they capture the compositional nature of language—understanding that "I don't want to cancel" is fundamentally different from "I want to cancel."

Key takeaways:

**LSTMs solve the vanishing gradient problem** through gating mechanisms
**Bidirectional processing** gives each word full context from both directions
**Learned embeddings** allow task-specific word representations
**Sequential processing** preserves word order and handles negations correctly
**BiLSTMs offer a sweet spot** between simple baselines and heavy transformers

While transformers have largely replaced [LSTMs](/blog/rnn-lstm-fundamentals) in state-of-the-art NLP, [BiLSTMs](/blog/bilstm-text-classification-explained) remain valuable for understanding sequence modeling fundamentals and for practical applications where compute budget is limited.

Understanding RNNs and LSTMs: The Foundation of Sequence Modeling

DevLifted Team — Fri, 24 Apr 2026 00:00:00 GMT

Imagine you're reading a book. You don't process each word in isolation—you remember what came before, building context as you go. This is exactly what [Recurrent Neural Networks](/blog/rnn-lstm-fundamentals) (RNNs) do for machines. They're designed to process sequences by maintaining a "memory" of previous inputs.

In this guide, we'll explore how [RNNs](/blog/rnn-lstm-fundamentals) work, why they struggle with long sequences, and how Long Short-Term Memory (LSTM) networks elegantly solve these problems.

The Problem: Why Regular Neural Networks Fail at Sequences

Traditional feedforward neural networks have a fundamental limitation: they treat each input independently. Consider these two sentences:

"The cat sat on the mat"
"The mat sat on the cat"

A feedforward network would see the same words and might produce similar outputs, completely missing that these sentences have opposite meanings. The problem? **No memory of word order**.

Enter Recurrent Neural Networks (RNNs)

[RNNs](/blog/rnn-lstm-fundamentals) solve this by introducing **recurrence**—the output at each step depends not just on the current input, but also on the previous hidden state. Think of it as a neural network with memory.

The Core Idea: Hidden State

An [RNN](/blog/rnn-lstm-fundamentals) maintains a **hidden state** that gets updated at each time step. This hidden state acts as the network's memory, encoding information about everything it has seen so far.

Processing a Sequence: Step by Step

Let's walk through processing the sentence "I love pizza" word by word:

The Mathematics Behind RNNs

The [RNN](/blog/rnn-lstm-fundamentals) update equation is surprisingly simple:

Let's implement a simple [RNN](/blog/rnn-lstm-fundamentals) from scratch:

The Vanishing Gradient Problem

[RNNs](/blog/rnn-lstm-fundamentals) sound perfect, right? Unfortunately, they have a critical flaw: they can't learn long-range dependencies. This is called the **vanishing gradient problem**.

Why Gradients Vanish

During backpropagation through time, gradients must flow backward through many time steps. At each step, they get multiplied by the weight matrix and the derivative of tanh.

This means [RNNs](/blog/rnn-lstm-fundamentals) struggle with sentences like:

"The chef, who trained in Paris for five years and later opened a restaurant in Tokyo, **makes** amazing sushi."

The [RNN](/blog/rnn-lstm-fundamentals) needs to connect "chef" (at the start) with "makes" (at the end), but the gradient signal is too weak by the time it reaches back to "chef".

Long Short-Term Memory (LSTM): The Solution

[LSTMs](/blog/rnn-lstm-fundamentals) were specifically designed to solve the vanishing gradient problem. They do this through a clever architecture with **gates** that control information flow.

The Key Innovation: Cell State

[LSTMs](/blog/rnn-lstm-fundamentals) introduce a **cell state**—a separate memory channel that runs through the entire sequence with minimal modifications. Think of it as a "memory highway" where information can flow unchanged.

The Three Gates

[LSTMs](/blog/rnn-lstm-fundamentals) use three gates to control the cell state:

1. Forget Gate: What to Throw Away

Decides what information to remove from the cell state.

2. Input Gate: What to Add

Decides what new information to store in the cell state.

3. Output Gate: What to Output

Decides what parts of the cell state to output as the hidden state.

Complete LSTM Cell

Putting it all together:

Why LSTMs Solve Vanishing Gradients

The cell state provides a direct path for gradients to flow backward through time:

Implementing LSTM in PyTorch

PyTorch provides a built-in [LSTM](/blog/rnn-lstm-fundamentals) implementation that's highly optimized:

Understanding LSTM Outputs

Bidirectional LSTM: Reading Both Ways

A standard [LSTM](/blog/rnn-lstm-fundamentals) only reads left-to-right. But for many tasks, we want to see the full context. **Bidirectional LSTMs** process the sequence in both directions:

Practical Example: Sentiment Analysis

Let's build a complete sentiment classifier using [LSTM](/blog/rnn-lstm-fundamentals):

Common Pitfalls and Solutions

1. Exploding Gradients

While [LSTMs](/blog/rnn-lstm-fundamentals) solve vanishing gradients, they can still suffer from exploding gradients. Solution: gradient clipping.

2. Slow Training

[LSTMs](/blog/rnn-lstm-fundamentals) process sequences sequentially, which is slow. Solutions:

Use packed sequences to skip padding
Use larger batch sizes
Consider using GRU (simpler, faster variant of LSTM)
For very long sequences, consider Transformers instead

3. Overfitting

[LSTMs](/blog/rnn-lstm-fundamentals) have many parameters and can overfit. Solutions:

LSTM vs GRU vs Transformer

When to Use LSTMs

Use LSTMs when:

You need to model sequential dependencies
Word order matters (it almost always does in NLP)
You have moderate amounts of data (10K+ examples)
Sequences are moderate length (< 500 tokens)
You want a good balance of performance and interpretability

Don't use LSTMs when:

You have very little data (< 1K examples) → use simpler models
You need state-of-the-art results → use Transformers
Sequences are very long (> 1000 tokens) → use Transformers with efficient attention
Training time is critical → consider GRU or simpler models

Conclusion

[RNNs](/blog/rnn-lstm-fundamentals) and LSTMs represent a fundamental breakthrough in sequence modeling. While Transformers have largely replaced them in state-of-the-art NLP, understanding LSTMs is crucial because:

They introduce core concepts (hidden state, sequential processing) that appear in all sequence models
They're still practical for many real-world applications with limited compute
They're more interpretable than Transformers
Understanding why they fail (vanishing gradients) helps you understand why Transformers succeed

Master [LSTMs](/blog/rnn-lstm-fundamentals), and you'll have a solid foundation for understanding modern sequence models like Transformers, which build upon these same core ideas.

Text Preprocessing and Tokenization for NLP: A Complete Guide

DevLifted Team — Fri, 24 Apr 2026 00:00:00 GMT

Before you can train a neural network on text, you need to convert raw text into a format the model can understand. This process—[text preprocessing](/blog/text-preprocessing-tokenization-nlp) and tokenization—is often overlooked but critically important. Poor preprocessing can tank your model's performance, while good preprocessing can give you a significant boost.

In this guide, we'll cover everything you need to know about preparing text for deep learning models.

The Text Processing Pipeline

Here's the typical pipeline for processing text:

Let's walk through each step with practical examples.

Step 1: Text Cleaning

Raw text is messy. It contains special characters, HTML tags, URLs, and inconsistent formatting. Cleaning prepares text for tokenization.

Common Cleaning Operations

To Lowercase or Not?

Step 2: Tokenization

Tokenization splits text into individual units (tokens). The most common approach is **word tokenization**, but there are others.

Word Tokenization

Split text into words. The simplest approach is splitting on whitespace:

A better approach handles punctuation:

Character Tokenization

Split text into individual characters. Useful for tasks like text generation or handling typos.

Subword Tokenization (BPE, WordPiece)

Modern approach used by BERT, GPT, etc. Splits words into subword units.

Step 3: Building a Vocabulary

A vocabulary maps words to integer indices. This is crucial for converting text to numbers.

Basic Vocabulary Class

Special Tokens

Most vocabularies include special tokens:

Vocabulary Size: How Big?

**Rule of thumb**: Choose vocabulary size to cover 90-95% of your text. Beyond that, you're mostly adding noise.

Step 4: Text Encoding

Convert words to integer indices using the vocabulary.

Step 5: Handling Variable-Length Sequences

Neural networks need fixed-size inputs, but sentences have different lengths. We solve this with **padding** and **truncation**.

Padding: Making Sequences the Same Length

Choosing max_len: Data Analysis

Pre vs Post Padding

Complete Text Processing Pipeline

Let's put it all together in a complete pipeline:

Advanced Techniques

1. Packed Sequences (for RNNs)

When using [RNNs](/blog/rnn-lstm-fundamentals) with very different sequence lengths, packed sequences can improve efficiency:

2. Attention Masks

For transformer models, create attention masks to ignore padding tokens:

Common Pitfalls and Solutions

Pitfall 1: Data Leakage in Vocabulary

Pitfall 2: Inconsistent Preprocessing

Pitfall 3: Wrong max_len Choice

Performance Considerations

Memory Usage

Speed Optimization

**Smaller vocabulary**: Reduces embedding layer size
**Shorter sequences**: Less computation in RNNs/Transformers
**Larger batches**: Better GPU utilization (up to memory limits)
**Packed sequences**: Skip computation on padding (RNNs only)

Best Practices Summary

**Analyze your data first**: Understand length distributions and vocabulary
**Build vocabulary only from training data**: Avoid data leakage
**Choose max_len to cover 95-99% of sequences**: Balance coverage and efficiency
**Use consistent preprocessing**: Same pipeline for train/val/test
**Reserve index 0 for padding**: Makes masking easier
**Filter vocabulary by frequency**: Remove rare words (noise)
**Consider subword tokenization**: For handling unknown words
**Monitor memory usage**: Especially with large vocabularies/sequences

Conclusion

[Text preprocessing](/blog/text-preprocessing-tokenization-nlp) and tokenization are foundational skills for NLP. While they might seem mundane compared to designing neural architectures, they can make or break your model's performance.

Key takeaways:

**Clean text appropriately** for your task (don't over-clean)
**Build vocabulary from training data only** to avoid leakage
**Choose sequence length based on data analysis**, not arbitrary numbers
**Use padding and truncation** to handle variable-length sequences
**Be consistent** in preprocessing across train/val/test splits

Master these fundamentals, and you'll have a solid foundation for any NLP project, from simple classification to complex language generation.

Word Embeddings Explained: From One-Hot to Dense Vectors

DevLifted Team — Fri, 24 Apr 2026 00:00:00 GMT

Computers don't understand words—they only understand numbers. So how do we teach a machine learning model about language? The answer is **[word embeddings](/blog/word-embeddings-explained)**: mathematical representations that capture the meaning of words as dense numerical vectors.

In this guide, we'll explore how [word embeddings](/blog/word-embeddings-explained) work, why they're so powerful, and how to use them effectively in your NLP projects.

The Problem: Representing Words as Numbers

Before we can process text with neural networks, we need to convert words to numbers. The naive approach is **one-hot encoding**:

Problems with One-Hot Encoding

**Huge dimensionality**: For a vocabulary of 10,000 words, each word is a 10,000-dimensional vector!
**No semantic meaning**: "cat" and "dog" are equally different from each other as "cat" and "democracy"
**Sparse**: 99.99% of values are zeros, wasting memory and computation
**No relationships**: Can't capture that "king" - "man" + "woman" ≈ "queen"

Enter Word Embeddings: Dense Representations

[Word embeddings](/blog/word-embeddings-explained) solve these problems by representing words as **dense, low-dimensional vectors** where similar words have similar vectors.

Key Properties of Good Embeddings

**Semantic similarity**: Similar words have similar vectors
**Dimensionality reduction**: 10,000 words → 128-300 dimensions
**Dense**: All values are meaningful (no zeros)
**Learned relationships**: Captures analogies and relationships

How Are Embeddings Learned?

There are two main approaches to creating [word embeddings](/blog/word-embeddings-explained):

1. Learned Embeddings (Task-Specific)

Train embeddings from scratch as part of your model. The embeddings learn to be useful for your specific task.

**How it works:**

2. Pretrained Embeddings (Transfer Learning)

Use embeddings trained on massive text corpora (like Wikipedia). These capture general language knowledge.

Popular pretrained embeddings:

**Word2Vec** (Google, 2013): Trained on Google News
**GloVe** (Stanford, 2014): Trained on Wikipedia + web text
**FastText** (Facebook, 2016): Handles out-of-vocabulary words

Word2Vec: Learning from Context

Word2Vec is based on a simple but powerful idea: **"You shall know a word by the company it keeps"** (J.R. Firth, 1957).

Words that appear in similar contexts should have similar meanings.

The Skip-Gram Model

Given a word, predict its surrounding words (context).

Famous Word2Vec Examples

Word2Vec embeddings capture amazing semantic relationships:

GloVe: Global Vectors

GloVe (Global Vectors for Word Representation) takes a different approach: it uses global word co-occurrence statistics.

How GloVe Works

GloVe combines the benefits of:

**Global statistics** (like LSA/SVD methods)
**Local context** (like Word2Vec)

Using Embeddings in PyTorch

Option 1: Learn from Scratch

Option 2: Use Pretrained Embeddings

When to Freeze vs Fine-Tune

Handling Unknown Words

What happens when you encounter a word not in your vocabulary?

Strategy 1: UNK Token

Strategy 2: Subword Embeddings (FastText)

FastText represents words as bags of character n-grams, allowing it to generate embeddings for unseen words.

Embedding Dimensions: How Many?

Choosing the right embedding dimension is important:

Visualizing Embeddings

Embeddings are high-dimensional, but we can visualize them using dimensionality reduction:

Common Pitfalls and Solutions

Pitfall 1: Not Setting padding_idx

Pitfall 2: Vocabulary Mismatch

Pitfall 3: Too Large Vocabulary

Learned vs Pretrained: A Comparison

Modern Alternatives: Contextual Embeddings

Traditional [word embeddings](/blog/word-embeddings-explained) have one limitation: **each word has a single embedding**, regardless of context.

Modern models like BERT, GPT, and RoBERTa use **contextual embeddings** that change based on surrounding words. However, they're much more expensive to train and use.

Practical Recommendations

For Beginners

Start with learned embeddings (128-256 dimensions)
Use padding_idx=0 for padding tokens
Filter vocabulary by frequency (min_freq=2)
Cap vocabulary size (max_vocab=10000-20000)

For Production

Try pretrained embeddings first (GloVe, FastText)
Fine-tune if you have enough data (> 50K examples)
Use FastText for handling unknown words
Consider contextual embeddings (BERT) for state-of-the-art results

Conclusion

[Word embeddings](/blog/word-embeddings-explained) are a fundamental building block of modern NLP. They transform discrete words into continuous vectors that capture semantic meaning, enabling neural networks to understand language.

Key takeaways:

**One-hot encoding is inefficient** and doesn't capture semantics
**Word embeddings are dense, low-dimensional vectors** that capture meaning
**Similar words have similar embeddings** (cosine similarity)
**Learned embeddings** adapt to your task but need more data
**Pretrained embeddings** (Word2Vec, GloVe) work well with less data
**Contextual embeddings** (BERT) are state-of-the-art but expensive

Understanding [word embeddings](/blog/word-embeddings-explained) is crucial for any NLP practitioner. They're the foundation upon which more complex models like [RNNs](/blog/rnn-lstm-fundamentals), LSTMs, and Transformers are built.

ReLU Explained: The Simple Activation Function That Changed Deep Learning

Haneesh — Thu, 23 Apr 2026 00:00:00 GMT

Imagine you're building a neural network and someone tells you to use **ReLU**. You nod along, but secretly wonder: what is this thing, and why does everyone use it? Here's the truth: ReLU is probably the simplest function in all of deep learning. It's so simple you can explain it to a 10-year-old. Yet this simple function revolutionized neural networks and made deep learning possible. Let's understand why.

Part 1 — What Is ReLU? (The One-Sentence Definition)

**ReLU (Rectified Linear Unit)** is a function that outputs the input if it's positive, and zero otherwise.

That's it. That's the whole thing. In math notation:

$$\text{ReLU}(x) = \max(0, x)$$

Or in plain English: **"If the number is positive, keep it. If it's negative, make it zero."**

Visual Example

See the pattern? Positive numbers pass through unchanged. Negative numbers become zero. That's all ReLU does.

Part 2 — Why Do We Need Activation Functions?

Before we understand why ReLU is special, let's understand why we need activation functions at all.

The Problem: Linear Layers Alone Are Too Simple

A neural network layer does a simple calculation: multiply inputs by weights and add a bias. This is called a **linear transformation**:

$$y = Wx + b$$

The problem? If you stack multiple linear layers, you still get a linear function. It's like multiplying numbers: 2 × 3 × 4 = 24, which is the same as just multiplying by 24 once. Stacking doesn't add power.

**Activation functions add non-linearity.** They let the network learn complex patterns like curves, boundaries, and interactions. Without them, your 100-layer network is no smarter than a 1-layer network.

Part 3 — ReLU vs Older Activation Functions

Before ReLU, people used **sigmoid** and **tanh**. These worked, but had serious problems.

Sigmoid: The Old Guard

Sigmoid squashes any input to a value between 0 and 1:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

**The Problem: Vanishing Gradients**

When you train a network, you compute gradients (how much to change each weight). Sigmoid's gradient is very small for large positive or negative inputs. In deep networks, these tiny gradients multiply together and become microscopic. The network stops learning. This is called the **vanishing gradient problem**.

Why ReLU Solves This

ReLU's gradient is simple:

If input > 0: gradient = 1 (perfect!)
If input ≤ 0: gradient = 0 (dead, but at least not vanishing)

For positive inputs, the gradient is always 1. It doesn't shrink. This means gradients can flow through many layers without vanishing. This is why ReLU made deep learning possible.

Part 4 — Implementing ReLU in PyTorch

PyTorch makes ReLU incredibly easy. Here are three ways to use it:

Method 1: As a Layer

Method 2: As a Function

Method 3: From Scratch (To Understand It)

Part 5 — The Dying ReLU Problem

ReLU has one weakness: **dying neurons**. If a neuron's output is always negative, ReLU makes it always zero. The gradient is also zero, so the neuron never updates. It's permanently dead.

**How common is this?** In practice, 10-20% of neurons can die during training. It's annoying but usually not catastrophic.

**How to prevent it?**

Use a smaller learning rate (neurons won't jump to extreme negative values)
Use proper weight initialization (He initialization for ReLU)
Use Leaky ReLU or other variants (explained next)

Part 6 — ReLU Variants: When Standard ReLU Isn't Enough

Leaky ReLU: Preventing Dead Neurons

Instead of making negative values exactly zero, Leaky ReLU makes them small:

$$\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ 0.01x & \text{if } x \leq 0 \end{cases}$$

Now negative inputs produce small negative outputs. The gradient is also small (0.01) instead of zero, so neurons can still learn even when they output negative values.

Other Variants

Part 7 — Complete Example: Building a Network with ReLU

Let's build a complete image classifier using ReLU:

Part 8 — When NOT to Use ReLU

ReLU is great, but not always the right choice:

**Output layers**: Never use ReLU on output layers. Use softmax for classification, nothing for regression.
**Recurrent networks (RNNs)**: ReLU can cause exploding gradients in RNNs. Use tanh instead.
**When you need bounded outputs**: If you need outputs in a specific range (like [0,1] or [-1,1]), use sigmoid or tanh.
**Transformers**: Modern transformers use GELU, not ReLU. It's smoother and works better.
**When dying neurons are a problem**: Switch to Leaky ReLU or ELU.

Part 9 — Visualizing ReLU

Let's visualize what ReLU does to data:

Key Takeaways

**ReLU is simple**: max(0, x) - that's the entire function.
**It solves vanishing gradients**: Gradient is 1 for positive inputs, allowing deep networks to train.
**It's fast**: Just a comparison and a max operation - no expensive exponentials.
**Use it on hidden layers**: Never on output layers.
**Standard pattern**: Linear → ReLU → Dropout → repeat
**Dying neurons exist**: 10-20% of neurons may die, but it's usually okay.
**Leaky ReLU helps**: Use it if dying neurons become a problem.
**ReLU made deep learning possible**: Before ReLU, training deep networks was nearly impossible.

Quick Reference

Adam Optimizer Explained: Why It's Better Than Plain Gradient Descent

Haneesh — Wed, 22 Apr 2026 00:00:00 GMT

Imagine driving a car where you can only set one speed for the entire journey — 60 mph on highways, 60 mph in school zones, 60 mph on bumpy roads. That's what plain gradient descent (SGD) does: one learning rate for all parameters. **Adam (Adaptive Moment Estimation)** is like having adaptive cruise control that automatically adjusts speed based on road conditions. This post explains exactly how Adam works, why it's become the default optimizer for most deep learning tasks, and how to use it effectively.

Part 1 — The Problem with Plain SGD

Let's start by understanding what we're improving upon. **SGD (Stochastic Gradient Descent)** is the simplest optimizer. The update rule is:

Every parameter gets the same learning rate. This causes three major problems:

Problem 1: Different Parameters Need Different Learning Rates

Imagine you're training a network with 1 million parameters. Some parameters have large, consistent gradients (they know which direction to go). Others have tiny, noisy gradients (they're uncertain). With one global learning rate:

**Large gradients**: If learning rate is too high, these parameters overshoot and oscillate
**Small gradients**: If learning rate is too low, these parameters barely move and learning is slow

You're forced to choose a learning rate that's a compromise — not optimal for anyone.

Problem 2: Noisy Gradients

Mini-batch gradients are noisy estimates of the true gradient. One batch might say 'go left', the next says 'go right'. SGD follows these noisy signals directly, leading to a zigzag path instead of a smooth descent.

Problem 3: Ravines and Plateaus

Loss landscapes often have **ravines** (steep in one direction, flat in another) and **plateaus** (flat everywhere). SGD struggles with both:

**Ravines**: SGD bounces between the steep walls instead of smoothly descending
**Plateaus**: Gradients are tiny, so SGD barely moves even though there's a cliff edge nearby

Part 2 — Building Blocks: Momentum and RMSprop

Adam combines two earlier innovations: **Momentum** and **RMSprop**. Let's understand each before seeing how Adam combines them.

Momentum: Smoothing the Path

Momentum adds 'inertia' to gradient descent. Instead of following the current gradient exactly, we maintain a **velocity** — a running average of recent gradients.

Think of it like pushing a ball down a hill. The ball doesn't instantly change direction with every bump — it has momentum that smooths out the path. If gradients consistently point in one direction, velocity builds up and we move faster. If gradients oscillate, velocity averages them out and we move more carefully.

RMSprop: Adaptive Learning Rates

RMSprop (Root Mean Square Propagation) adapts the learning rate for each parameter based on the magnitude of recent gradients.

The key insight: divide the learning rate by the square root of the average squared gradient. This means:

**Large gradients** → Large denominator → Smaller effective learning rate → Smaller steps
**Small gradients** → Small denominator → Larger effective learning rate → Larger steps

Each parameter gets its own adaptive learning rate based on its gradient history.

Part 3 — Adam: Combining the Best of Both

Adam combines momentum (for smoothing) and RMSprop (for adaptive rates). Here's the complete algorithm:

Let's break down each component:

First Moment (m): The Momentum Component

`m` is a running average of gradients (like momentum). `beta1 = 0.9` means we keep 90% of the old average and add 10% of the new gradient. This smooths out noise and builds up speed in consistent directions.

Second Moment (v): The Adaptive Rate Component

`v` is a running average of squared gradients (like RMSprop). `beta2 = 0.999` means we keep 99.9% of the old average and add 0.1% of the new squared gradient. This tracks the 'volatility' of each parameter's gradients.

Bias Correction: Fixing the Cold Start Problem

Here's a subtle but important detail. At the start of training, `m` and `v` are initialized to zero. This creates a bias toward zero in the early steps. Adam corrects this by dividing by `(1 - beta**t)`, where `t` is the step number.

The Final Update

The final update divides the smoothed gradient (`m_corrected`) by the square root of the smoothed squared gradient (`sqrt(v_corrected)`). This gives each parameter an adaptive learning rate based on its gradient history.

Part 4 — Why Adam Works So Well

Adam provides several key advantages over plain SGD:

Advantage 1: Faster Convergence

In practice, Adam typically converges 5-10x faster than SGD. Why? Because it adapts the learning rate per parameter. Parameters that need large steps get them, parameters that need small steps get them. No more one-size-fits-all compromise.

Advantage 2: Less Sensitive to Learning Rate

With SGD, choosing the right learning rate is critical and problem-specific. Too high and training explodes, too low and it crawls. Adam is much more forgiving. The default `lr=1e-3` (0.001) works well for most problems. You can often use it without tuning.

Advantage 3: Handles Sparse Gradients

In problems like NLP, many parameters have zero gradients most of the time (sparse gradients). Adam handles this well because it adapts per parameter. Parameters that rarely update get larger effective learning rates when they do update.

Advantage 4: Works Well Out of the Box

The default hyperparameters (`beta1=0.9`, `beta2=0.999`, `lr=1e-3`) work well for most problems. This is why Adam has become the default optimizer — it 'just works' without extensive tuning.

Part 5 — Using Adam in PyTorch

PyTorch makes Adam easy to use. Here's a complete example:

Understanding the Hyperparameters

Part 6 — Weight Decay in Adam

Weight decay is L2 regularization built into the optimizer. It adds a penalty for large weights, helping prevent overfitting.

**Common weight decay values:**

**0**: No regularization (only use if you have tons of data)
**1e-5 (0.00001)**: Mild regularization
**1e-4 (0.0001)**: Standard choice for most problems
**1e-3 (0.001)**: Strong regularization (if overfitting is severe)

Part 7 — Adam vs SGD: When to Use Which

Adam is the default choice for most problems, but SGD with momentum still has its place:

Part 8 — Common Mistakes and How to Avoid Them

**Learning rate too high**: If loss explodes to NaN in the first few steps, your learning rate is too high. Try 1e-4 instead of 1e-3.
**Not using weight decay**: Without regularization, models often overfit. Start with weight_decay=1e-4.
**Forgetting optimizer.zero_grad()**: Gradients accumulate by default. Always call zero_grad() before backward().
**Using Adam for everything**: For computer vision with huge datasets, well-tuned SGD can outperform Adam. Don't be dogmatic.
**Not adjusting learning rate**: For very long training runs, consider learning rate scheduling (reduce lr when progress plateaus).
**Comparing Adam and SGD with same learning rate**: Adam typically needs a smaller learning rate than SGD. Don't compare them with the same lr.

Part 9 — Advanced: Learning Rate Scheduling

For long training runs, you might want to reduce the learning rate over time. PyTorch provides several schedulers:

Key Takeaways

**Adam adapts learning rates per parameter** based on gradient history, making it much more effective than plain SGD.
**It combines momentum (smoothing) and RMSprop (adaptive rates)** to get the best of both worlds.
**Default hyperparameters work well**: lr=1e-3, beta1=0.9, beta2=0.999 are good starting points.
**Use weight decay**: weight_decay=1e-4 provides mild regularization and helps prevent overfitting.
**Adam converges 5-10x faster** than SGD in most cases, with less hyperparameter tuning needed.
**Always call optimizer.zero_grad()** before backward() to clear old gradients.
**For transformers, use AdamW** (Adam with decoupled weight decay) for best results.
**Consider learning rate scheduling** for very long training runs.

Backpropagation and the Chain Rule: A Simple Visual Guide

AI Educator — Wed, 22 Apr 2026 00:00:00 GMT

Backpropagation sounds intimidating, but it's actually a simple idea: **calculate how much each part of your neural network contributed to the error, then adjust accordingly**. In this post, we'll build intuition from the ground up using a concrete example you can follow step by step.

The Big Picture: What is Backpropagation?

Imagine you're baking a cake and it turns out too sweet. You need to figure out which ingredient to adjust. Was it the sugar? The vanilla? The frosting? Backpropagation does exactly this for neural networks—it traces back through the recipe (the network) to find out which 'ingredients' (weights) caused the error.

**The Process:**

**Forward Pass:** Feed input through the network to get a prediction
**Calculate Error:** Compare prediction to the actual answer
**Backward Pass:** Trace back to find how much each weight contributed to the error
**Update Weights:** Adjust weights to reduce the error

A Simple Example: Predicting House Prices

Let's build the simplest possible neural network: one that predicts house prices based on size. We'll use this tiny network to understand backpropagation completely.

**Our Network:**

**Input:** House size (in 1000 sq ft)
**Hidden Layer:** 1 neuron
**Output:** Predicted price (in $100k)

Step 1: The Forward Pass

Let's walk through a concrete example with actual numbers.

**Given:**

Input: $x = 2$ (house is 2000 sq ft)
Weight 1: $w_1 = 0.5$
Weight 2: $w_2 = 1.0$
True price: $y = 3$ (actually costs $300k)

**Forward Pass Calculations:**

Hidden layer (with ReLU activation):

$$z_1 = w_1 \times x = 0.5 \times 2 = 1.0$$

$$h = \text{ReLU}(z_1) = \max(0, 1.0) = 1.0$$

Output layer:

$$\hat{y} = w_2 \times h = 1.0 \times 1.0 = 1.0$$

**Error (Loss):**

$$L = \frac{1}{2}(y - \hat{y})^2 = \frac{1}{2}(3 - 1)^2 = 2.0$$

Step 2: Understanding the Chain Rule

Before we do backpropagation, we need to understand the chain rule. It's simpler than it sounds!

**The Chain Rule in Plain English:**

If A affects B, and B affects C, then to find how A affects C, you multiply the effects:

$$\frac{dC}{dA} = \frac{dC}{dB} \times \frac{dB}{dA}$$

**Example with Numbers:**

Say we have: $y = (2x + 1)^2$ and we want $\frac{dy}{dx}$ at $x = 1$

Break it down:

Let $u = 2x + 1$, so $y = u^2$
$\frac{dy}{du} = 2u$
$\frac{du}{dx} = 2$
$\frac{dy}{dx} = \frac{dy}{du} \times \frac{du}{dx} = 2u \times 2 = 4u$

At $x = 1$: $u = 3$, so $\frac{dy}{dx} = 4 \times 3 = 12$

Step 3: The Backward Pass (Backpropagation)

Now let's apply the chain rule to our neural network. We'll work backwards from the loss to find how each weight contributed to the error.

**Goal:** Find $\frac{\partial L}{\partial w_1}$ and $\frac{\partial L}{\partial w_2}$

Computing ∂L/∂w₂ (Output Weight)

The loss depends on $w_2$ through this chain: $L \rightarrow \hat{y} \rightarrow w_2$

Using the chain rule:

$$\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial w_2}$$

**Step 1:** Find $\frac{\partial L}{\partial \hat{y}}$

$$L = \frac{1}{2}(y - \hat{y})^2$$

$$\frac{\partial L}{\partial \hat{y}} = -(y - \hat{y}) = -(3 - 1) = -2$$

**Step 2:** Find $\frac{\partial \hat{y}}{\partial w_2}$

$$\hat{y} = w_2 \times h$$

$$\frac{\partial \hat{y}}{\partial w_2} = h = 1.0$$

**Step 3:** Multiply them (chain rule)

$$\frac{\partial L}{\partial w_2} = -2 \times 1.0 = -2.0$$

Computing ∂L/∂w₁ (Hidden Weight)

This is trickier because $w_1$ affects the loss through a longer chain: $L \rightarrow \hat{y} \rightarrow h \rightarrow z_1 \rightarrow w_1$

$$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial h} \times \frac{\partial h}{\partial z_1} \times \frac{\partial z_1}{\partial w_1}$$

**Step 1:** We already know $\frac{\partial L}{\partial \hat{y}} = -2$

**Step 2:** Find $\frac{\partial \hat{y}}{\partial h}$

$$\hat{y} = w_2 \times h$$

$$\frac{\partial \hat{y}}{\partial h} = w_2 = 1.0$$

**Step 3:** Find $\frac{\partial h}{\partial z_1}$ (ReLU derivative)

$$h = \text{ReLU}(z_1) = \max(0, z_1)$$

$$\frac{\partial h}{\partial z_1} = \begin{cases} 1 & \text{if } z_1 > 0 \\ 0 & \text{if } z_1 \leq 0 \end{cases}$$

Since $z_1 = 1.0 > 0$, we have $\frac{\partial h}{\partial z_1} = 1$

**Step 4:** Find $\frac{\partial z_1}{\partial w_1}$

$$z_1 = w_1 \times x$$

$$\frac{\partial z_1}{\partial w_1} = x = 2.0$$

**Step 5:** Multiply all together

$$\frac{\partial L}{\partial w_1} = -2 \times 1.0 \times 1 \times 2.0 = -4.0$$

Step 4: Updating the Weights

Now that we know the gradients, we can update our weights using gradient descent:

$$w_{\text{new}} = w_{\text{old}} - \alpha \times \frac{\partial L}{\partial w}$$

where $\alpha$ is the learning rate (let's use $\alpha = 0.1$)

**Update w₁:**

$$w_1^{\text{new}} = 0.5 - 0.1 \times (-4.0) = 0.5 + 0.4 = 0.9$$

**Update w₂:**

$$w_2^{\text{new}} = 1.0 - 0.1 \times (-2.0) = 1.0 + 0.2 = 1.2$$

Complete Implementation from Scratch

Let's put it all together in a complete training loop:

Key Takeaways

**Backpropagation is just the chain rule** applied systematically to find gradients
**Forward pass** computes predictions; **backward pass** computes gradients
**Gradients tell us direction**: negative gradient means increase weight, positive means decrease
**The chain rule multiplies local derivatives** as we trace back through the network
**Each layer only needs to know its local derivative**—this is what makes backprop scalable
**ReLU derivative is simple**: 1 if input > 0, else 0

Conclusion

Backpropagation isn't magic—it's a systematic application of the chain rule. By breaking the network into small pieces and computing local derivatives, we can efficiently find how every weight contributes to the error. This same principle scales from our tiny 2-weight network to massive models with billions of parameters.

The key insight: **you don't need to understand the entire network at once**. Each layer just needs to know its own derivative, and the chain rule connects everything together. That's the beauty of backpropagation!

Batch Normalization Explained: Why Your Neural Network Needs It

Haneesh — Wed, 22 Apr 2026 00:00:00 GMT

Imagine you're trying to bake a cake, but your oven temperature keeps changing randomly — sometimes 200°C, sometimes 400°C, sometimes 50°C. You'd never get consistent results. Neural networks face a similar problem: as data flows through multiple layers, the numbers can spiral out of control. **Batch Normalization** solves this by keeping the 'temperature' consistent at each layer. This post explains exactly how it works, why it's so important, and the critical mistake that causes mysteriously bad test results.

Part 1 — The Problem: Internal Covariate Shift

Let's start with the problem. When you train a neural network, each layer receives inputs from the previous layer. But as the previous layer's weights update during training, the distribution of its outputs changes. This means every layer is constantly trying to hit a moving target.

Here's a concrete example. Imagine Layer 2 is learning to recognize patterns in the outputs of Layer 1. But Layer 1's weights are also updating, so its outputs keep changing. Today Layer 1 outputs numbers between 0 and 1. Tomorrow, after some training, it outputs numbers between -100 and 100. Layer 2 has to constantly readjust to these changing inputs.

This phenomenon is called **internal covariate shift**. 'Internal' because it happens inside the network. 'Covariate' because the input distribution is changing. 'Shift' because it's moving around. The result? Training becomes slow and unstable. You need tiny learning rates to avoid exploding gradients, and even then, convergence is painful.

Part 2 — The Solution: Normalize Each Layer's Inputs

Batch Normalization's core idea is beautifully simple: after each layer, normalize the outputs so they have a consistent distribution. Specifically, make them have mean=0 and variance=1.

Here's the math (don't worry, we'll explain every symbol):

Let's break this down step by step with a real example.

Step 1: Compute Batch Statistics

Suppose you have a batch of 4 examples, each with 3 features (neurons):

Notice the huge differences in scale: Feature 1 has mean 125 and variance 3125. Feature 2 has mean 0.0025 and variance 0.00000125. Feature 3 is somewhere in between. This inconsistency makes training hard.

Step 2: Normalize

Step 3: Scale and Shift (The Learnable Part)

Here's a subtle but crucial point: forcing everything to mean=0 and variance=1 might be too restrictive. What if the optimal distribution for this layer is actually mean=5 and variance=2? Batch Norm handles this by adding two learnable parameters per feature:

**gamma (γ)**: A scale parameter (initially 1.0)
**beta (β)**: A shift parameter (initially 0.0)

This is brilliant: we normalize to a standard distribution, but give the network the flexibility to learn the optimal distribution for each layer. If the network decides that mean=0, variance=1 is actually best, it can learn gamma=1 and beta=0 (which is where they start). If it needs something else, it can learn different values.

Part 3 — Why Batch Normalization Works So Well

Batch Normalization provides three major benefits:

Benefit 1: Faster Training

With normalized inputs at each layer, you can use much higher learning rates without the training exploding. Why? Because the gradients stay in a reasonable range. Without batch norm, a large learning rate might cause some weights to get huge updates while others get tiny updates. With batch norm, the scale is consistent, so a single learning rate works well for all layers.

Benefit 2: More Stable Training

Without batch norm, training can be fragile. A slightly wrong learning rate, a slightly wrong initialization, and your loss explodes to infinity or gets stuck. With batch norm, training is much more forgiving. The normalization acts like a safety net, keeping activations in a reasonable range even when things go slightly wrong.

Benefit 3: Mild Regularization

Batch norm has a subtle regularization effect. Because it normalizes using the statistics of the current mini-batch, there's a bit of noise in the normalization (different batches have slightly different means and variances). This noise acts like a mild form of regularization, similar to dropout, helping prevent overfitting.

Part 4 — The Critical Difference: Training vs Evaluation Mode

This is where most beginners get tripped up. Batch Normalization behaves **completely differently** during training versus evaluation. Understanding this difference is absolutely critical.

During Training

During training, batch norm uses the statistics of the **current mini-batch**:

During Evaluation/Testing

During evaluation, batch norm uses **running averages** computed during training:

Why This Difference Matters

Imagine you're testing your model on a single example. If you used the current batch's statistics, you'd be normalizing based on just one example — the mean would be the example itself, and the variance would be zero! That's nonsense.

Instead, during evaluation, we use the running averages accumulated during training. These represent the 'typical' mean and variance across the entire training set, giving stable, consistent predictions regardless of batch size.

Part 5 — Implementing Batch Normalization in PyTorch

PyTorch makes batch norm easy with `nn.BatchNorm1d` (for fully-connected layers) and `nn.BatchNorm2d` (for convolutional layers). Here's a complete example:

Where to Place Batch Norm

The standard pattern is: **Linear → BatchNorm → Activation (ReLU)**. Some people put batch norm after the activation, but the original paper and most practitioners put it before. The reasoning: normalize the pre-activation values, then apply the nonlinearity.

Part 6 — Common Questions and Misconceptions

Q: Does batch size matter for batch norm?

Yes! Batch norm computes statistics over the batch, so very small batches (like 2-4 examples) give noisy estimates. The original paper used batches of 32 or larger. If you must use tiny batches, consider Layer Normalization or Group Normalization instead.

Q: Can I use batch norm with dropout?

Yes, they're complementary. A common pattern is: **Linear → BatchNorm → ReLU → Dropout**. Batch norm stabilizes training, dropout prevents overfitting. They work well together.

Q: What about batch norm for RNNs/LSTMs?

Batch norm is tricky for recurrent networks because the sequence length varies. Layer Normalization is usually preferred for RNNs. But for feed-forward networks (MLPs, CNNs), batch norm is the standard choice.

Part 7 — Debugging Batch Norm Issues

If your model with batch norm isn't working, check these common issues:

**Forgot model.eval()**: Your test accuracy will be wrong. Always call model.eval() before evaluation.
**Batch size too small**: With batches of 2-4, statistics are too noisy. Use at least 16-32.
**Batch norm on output layer**: Don't do this. Only use batch norm on hidden layers.
**Wrong order**: The standard is Linear → BatchNorm → Activation, not Activation → BatchNorm.
**Not loading running stats**: If you save/load a model, make sure to save the batch norm's running_mean and running_var.

Key Takeaways

**Batch Normalization normalizes layer inputs** to have consistent mean and variance, solving internal covariate shift.
**It enables faster training** by allowing higher learning rates and more stable gradients.
**Training mode uses current batch statistics**, evaluation mode uses running averages from training.
**Always call model.eval()** before testing — this is the most common batch norm bug.
**Standard pattern**: Linear → BatchNorm → ReLU → (optional Dropout)
**Don't use on output layers**, only on hidden layers.
**Requires reasonable batch sizes** (at least 16-32) for stable statistics.

Dropout Explained: The Surprisingly Simple Trick That Prevents Overfitting

Haneesh — Wed, 22 Apr 2026 00:00:00 GMT

Imagine training a sports team where, at every practice, you randomly send 30% of the players home. Sounds crazy, right? But this 'crazy' idea — called **Dropout** — is one of the most effective techniques in deep learning. By randomly turning off neurons during training, we force the network to learn more robust, generalizable patterns. This post explains exactly how dropout works, why it's so effective, and how to use it correctly.

Part 1 — The Problem: Overfitting and Co-Adaptation

Before we understand dropout, we need to understand the problem it solves: **overfitting**. Overfitting happens when a model learns the training data too well — including all its noise and quirks — and fails to generalize to new data.

What Is Overfitting?

Think of a student who memorizes answers to practice problems without understanding the concepts. They ace the practice test (100% on training data) but fail the real exam (poor performance on test data). That's overfitting.

Here's a concrete example. Suppose you're training a model to recognize cats. An overfit model might learn: 'If there's a red collar at pixel (45, 67), it's a cat.' This works for training images with red collars, but fails on new cats without red collars. A good model learns 'pointy ears + whiskers + fur texture = cat' — features that generalize.

The Co-Adaptation Problem

There's a subtler problem called **co-adaptation**. This happens when neurons become too dependent on each other. Neuron A learns to detect one specific pattern, Neuron B learns to detect another, and Neuron C only works when both A and B fire together.

This is fragile. If the input changes slightly and Neuron A doesn't fire, the whole chain breaks. The network has learned a brittle, overly-specific solution instead of robust, independent features.

Part 2 — The Solution: Dropout

Dropout's solution is brilliantly simple: during training, randomly set a fraction of neurons to zero. Typically, we drop 20-50% of neurons (30% is common). Which neurons? Different ones each time, chosen randomly.

How Dropout Works (Step by Step)

Let's walk through a concrete example. Suppose you have a layer with 10 neurons and dropout rate p=0.3 (30%):

Three things happened:

**Random selection**: 3 neurons (30%) were randomly chosen to be dropped
**Zeroing**: Those neurons were set to 0
**Scaling**: The remaining neurons were scaled up by 1/(1-0.3) ≈ 1.43

Why Scale Up the Remaining Neurons?

This is a subtle but important detail. If we just zeroed out 30% of neurons without scaling, the total activation would drop by 30%. The next layer would receive weaker signals than expected.

By scaling up the remaining neurons by 1/(1-p), we keep the **expected sum** the same. If 10 neurons each output 1, the sum is 10. If we drop 3 and scale the remaining 7 by 1.43, the sum is 7 × 1.43 ≈ 10. The next layer sees roughly the same total activation.

Part 3 — Why Dropout Works: The Ensemble Effect

Dropout works for two related reasons: it prevents co-adaptation and creates an ensemble effect.

Preventing Co-Adaptation

When neurons can't rely on specific other neurons always being present, they're forced to learn independently useful features. Each neuron must learn something valuable on its own, not just as part of a specific combination.

Back to the basketball analogy: if players are randomly absent at each practice, every player learns to be useful independently. Player A learns to shoot, pass, and defend — not just 'pass to Player B.' The team becomes more robust.

The Ensemble Effect

Here's a deeper insight: each time you apply dropout, you're effectively training a different sub-network. With 1000 neurons and 50% dropout, there are 2^1000 possible sub-networks (each neuron is either on or off).

During training, you're randomly sampling and training many of these sub-networks. At test time (with all neurons active), you're effectively averaging the predictions of all these sub-networks. This is similar to **ensemble learning**, where combining multiple models gives better results than any single model.

Part 4 — Training vs Testing: The Critical Difference

This is crucial: **Dropout only happens during training, never during testing**. Let's see why and how.

During Training

During Testing/Evaluation

Why disable dropout during testing? Because we want consistent, deterministic predictions. If dropout were active during testing, the same input would give different outputs each time (due to random neuron dropping). That's unacceptable for a production system.

Part 5 — Implementing Dropout in PyTorch

PyTorch makes dropout easy with `nn.Dropout`. Here's a complete example:

Where to Place Dropout

The standard pattern is: **Linear → Activation → Dropout**. Some variations:

**With BatchNorm**: Linear → BatchNorm → Activation → Dropout
**Without BatchNorm**: Linear → Activation → Dropout
**Never on output layer**: Dropout is for hidden layers only

Part 6 — Choosing the Right Dropout Rate

The dropout rate (p) is a hyperparameter you need to tune. Here are some guidelines:

**Common choices:**

**0.3 (30%)**: Good default for fully-connected layers
**0.5 (50%)**: Original dropout paper's recommendation
**0.1-0.2 (10-20%)**: For convolutional layers (they need less regularization)

Part 7 — Dropout vs Other Regularization Techniques

Dropout isn't the only regularization technique. Here's how it compares:

**Best practice**: Use multiple techniques together. A common combination is: **Dropout + L2 regularization + Early stopping**. They complement each other.

Part 8 — Common Mistakes and How to Avoid Them

**Forgetting model.eval()**: Dropout stays active during testing, giving random predictions. Always call model.eval() before inference.
**Dropout on output layer**: Never apply dropout to the final layer. It corrupts your predictions.
**Too high dropout rate**: p > 0.5 often hurts more than helps. Start with 0.3.
**Using dropout with very small networks**: If your network only has 10-20 neurons per layer, dropout might remove too much capacity. Use it with larger networks (100+ neurons per layer).
**Not training long enough**: Dropout slows convergence. You might need 2x more epochs compared to no dropout.
**Dropout with small batch sizes**: With batch size < 16, dropout adds too much noise. Use larger batches or reduce dropout rate.

Part 9 — Visualizing Dropout's Effect

Let's see dropout in action with a simple experiment:

Key Takeaways

**Dropout prevents overfitting** by randomly zeroing neurons during training, forcing the network to learn robust features.
**It prevents co-adaptation** — neurons can't rely on specific other neurons always being present.
**Training mode**: Dropout is active, randomly drops p% of neurons, scales remaining by 1/(1-p).
**Evaluation mode**: Dropout is disabled, all neurons active, predictions are deterministic.
**Always call model.eval()** before testing — this is the most common dropout bug.
**Standard pattern**: Linear → Activation → Dropout (never on output layer).
**Common dropout rates**: 0.3 for fully-connected layers, 0.1-0.2 for convolutional layers.
**Combine with other techniques**: Dropout + L2 + Early stopping works well together.

Early Stopping Explained: Knowing When to Stop Training

Haneesh — Wed, 22 Apr 2026 00:00:00 GMT

Imagine studying for an exam. If you stop too early, you haven't learned enough. If you study too long, you start overthinking and second-guessing yourself. There's a sweet spot — and finding it is crucial. **Early stopping** solves the same problem for neural networks: it automatically finds the optimal training duration, preventing both underfitting (stopping too early) and overfitting (training too long). This post explains exactly how early stopping works, why it's essential, and how to implement it correctly.

Part 1 — The Problem: When to Stop Training?

Training a neural network is an iterative process. Each epoch, the model sees the entire training dataset and updates its weights. But how many epochs should you train for? This is harder than it sounds.

Stop Too Early: Underfitting

If you stop training after 5 epochs when the model needs 50, you get **underfitting**. The model hasn't learned the patterns in your data yet. Both training and test accuracy are low.

Train Too Long: Overfitting

If you train for 500 epochs when the model only needed 50, you get **overfitting**. The model starts memorizing the training data instead of learning generalizable patterns. Training accuracy keeps improving, but test accuracy plateaus or even decreases.

Here's what happens during overfitting:

**Early epochs**: Model learns general patterns (good)
**Middle epochs**: Model refines understanding (still good)
**Late epochs**: Model starts memorizing training examples (bad)
**Very late epochs**: Model has memorized training data perfectly but fails on new data (very bad)

The Sweet Spot

There's an optimal number of epochs where the model has learned the patterns but hasn't started memorizing. This is where test accuracy is highest. The problem: you don't know this number in advance. It depends on:

Your dataset size
Model complexity
Learning rate
Regularization strength
Random initialization

You could guess ("let's try 100 epochs"), but that's wasteful. Early stopping finds the sweet spot automatically.

Part 2 — How Early Stopping Works

Early stopping monitors a validation metric (usually validation accuracy or validation loss) after each epoch. When the metric stops improving, training stops. Here's the algorithm:

Let's break down each component:

The Validation Set

You need three datasets:

**Training set**: Used to update weights
**Validation set**: Used to monitor progress and decide when to stop
**Test set**: Used only at the very end to report final performance

The validation set is crucial. You can't use training accuracy (it always improves, even during overfitting) or test accuracy (that would be cheating — you'd be peeking at the exam). The validation set is your honest progress check.

The Patience Parameter

**Patience** is how many epochs you wait without improvement before stopping. Why not stop immediately after the first epoch without improvement? Because validation metrics are noisy — they can fluctuate randomly.

Notice epoch 5 — validation accuracy improved after 2 epochs of decline. If patience was 1, we would have stopped too early. Patience gives the model a chance to recover from temporary dips.

Saving and Restoring Weights

This is the most critical part that beginners often get wrong. When early stopping fires, you must restore the **best** weights, not the **last** weights.

Why? Because the last few epochs were overfitting — that's why validation accuracy stopped improving! The best weights are from several epochs ago, when validation accuracy peaked.

Part 3 — Implementing Early Stopping in PyTorch

Here's a complete, production-ready implementation:

Part 4 — Choosing the Right Metric

What should you monitor? The most common choices:

Part 5 — Early Stopping vs Other Stopping Criteria

Early stopping isn't the only way to decide when to stop. Let's compare:

**Best practice**: Combine early stopping with a maximum epoch limit. This gives you automatic stopping with a safety net:

Part 6 — Visualizing Early Stopping

A picture is worth a thousand words. Here's what early stopping looks like in practice:

The key insight: training loss keeps decreasing (the model keeps improving on training data), but validation loss starts increasing after epoch 23 (the model is overfitting). Early stopping detects this and restores the weights from epoch 23.

Part 7 — Common Mistakes and How to Avoid Them

**Not restoring best weights**: The #1 bug. Always call restore_best_weights() after training stops.
**Using training metric instead of validation**: Training accuracy always improves, even during overfitting. Use validation metric.
**Patience too small**: patience=1 is too aggressive. Use at least 3-5.
**No validation set**: You need a separate validation set. Don't use test set for early stopping (that's cheating).
**Forgetting copy.deepcopy()**: model.state_dict() returns references, not copies. Use copy.deepcopy().
**Monitoring the wrong metric**: For classification, monitor accuracy (mode='max'). For regression, monitor loss (mode='min').
**Not setting max_epochs**: Always have a maximum epoch limit as a safety net.

Part 8 — Advanced: Min Delta

Sometimes validation metrics improve by tiny amounts (0.001%) due to noise. You might want to ignore these tiny improvements and only count "real" improvements. That's what **min_delta** does:

**When to use min_delta**: If your validation metric is very noisy and fluctuates by small amounts, set min_delta to filter out noise. For most problems, min_delta=0.0 (the default) works fine.

Part 9 — Early Stopping in Practice

Here's a complete training script with early stopping:

Key Takeaways

**Early stopping prevents overfitting** by monitoring validation metrics and stopping when they plateau.
**Patience controls sensitivity**: Higher patience is more conservative, lower patience stops faster.
**Always restore best weights**: The last weights are overfit; the best weights are from several epochs ago.
**Use copy.deepcopy()**: Make true copies of weights, not references.
**Monitor validation metrics**: Never use training metrics (they always improve) or test metrics (that's cheating).
**Combine with max_epochs**: Set a maximum epoch limit as a safety net.
**Default settings work well**: patience=5, min_delta=0.0, mode='max' for accuracy.
**Early stopping is free regularization**: No hyperparameters to tune, just works.

Linear Algebra for Machine Learning: A Complete Intuitive Guide

AI Educator — Wed, 22 Apr 2026 00:00:00 GMT

Linear algebra is the mathematical foundation of modern machine learning. Every neural network, from simple linear regression to GPT-4, relies fundamentally on vectors, matrices, and the operations that transform them. This comprehensive guide will take you from basic linear algebra concepts to understanding how automatic differentiation powers deep learning frameworks like PyTorch and TensorFlow.

Part 1: Vector Fundamentals

Understanding Vectors: Beyond Arrays

A vector isn't just a list of numbers—it's a geometric object with both magnitude and direction. In machine learning, vectors represent everything: input features, model weights, gradients, and embeddings.

Consider a 2D vector:

$$\mathbf{v} = \begin{bmatrix} 3 \\ 4 \end{bmatrix}$$

This represents an arrow from origin $(0, 0)$ to point $(3, 4)$. Its magnitude (length) is:

$$\|\mathbf{v}\| = \sqrt{3^2 + 4^2} = 5$$

Essential Vector Operations

**Dot Product** - The most important operation in ML:

$$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_ib_i = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta)$$

The dot product measures alignment. When $\theta = 0°$ (parallel), $\cos(\theta) = 1$ (maximum). When $\theta = 90°$ (perpendicular), $\cos(\theta) = 0$ (orthogonal).

Part 2: Matrix Operations

Matrices as Linear Transformations

A matrix isn't just a 2D array—it's a **linear transformation** that maps vectors from one space to another. Matrix-vector multiplication transforms the input vector.

Example scaling matrix:

$$\mathbf{A} = \begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix}, \quad \mathbf{v} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}$$

$$\mathbf{A}\mathbf{v} = \begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 2 \\ 3 \end{bmatrix}$$

This scales x by 2 and y by 3.

Matrix Multiplication

For matrices $\mathbf{A}$ (size $m \times n$) and $\mathbf{B}$ (size $n \times p$), the product $\mathbf{C} = \mathbf{AB}$ has size $m \times p$ where:

$$C_{ij} = \sum_{k=1}^{n} A_{ik}B_{kj}$$

Each element is the dot product of row $i$ from $\mathbf{A}$ and column $j$ from $\mathbf{B}$.

Special Matrices

Part 3: Eigenvalues and Eigenvectors

An eigenvector of matrix $\mathbf{A}$ is a special vector that doesn't change direction when $\mathbf{A}$ is applied—it only gets scaled:

$$\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$$

where $\mathbf{v}$ is the eigenvector and $\lambda$ is the eigenvalue.

Part 4: Calculus and Gradients

From Derivatives to Gradients

For multivariable functions, we need the **gradient**—a vector of partial derivatives:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

**Geometric Meaning:** The gradient points in the direction of steepest ascent.

Example: For $f(x, y) = x^2 + 2y^2$

$$\nabla f = \begin{bmatrix} 2x \\ 4y \end{bmatrix}$$

The Jacobian Matrix

When a function outputs a vector, we need the **Jacobian matrix**:

$$\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$

The Chain Rule

For composed functions $z = f(y)$ and $y = g(x)$:

$$\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$$

In vector form with Jacobians, this becomes matrix multiplication.

Part 5: Computation Graphs

Modern ML frameworks build **computation graphs**—directed acyclic graphs where nodes are operations and edges are data flow. Consider:

$$y = x^2 + 3x + 1$$

We break this into elementary operations:

Input: $x$
Operation A: $a = x^2$
Operation B: $b = 3x$
Operation C: $c = a + b$
Output: $y = c + 1$

Forward Pass

During forward pass, we compute outputs and store intermediate values. For $x = 2$:

Backward Pass: Backpropagation

During backward pass, we compute gradients by working backwards, multiplying local derivatives along each path.

Since $x$ affects $y$ through TWO paths, we sum contributions:

$$\frac{dy}{dx} = \frac{dy}{da} \cdot \frac{da}{dx} + \frac{dy}{db} \cdot \frac{db}{dx} = 1 \times 4 + 1 \times 3 = 7$$

This matches the analytical derivative: $\frac{d}{dx}(x^2 + 3x + 1) = 2x + 3 = 7$ at $x=2$.

PyTorch Autograd

Part 6: Gradient Descent

Gradient descent minimizes a function by moving opposite to the gradient:

$$\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \nabla f(\mathbf{x}_t)$$

where $\alpha$ is the learning rate.

Common Issues

Part 7: Neural Network Example

A neural network is function composition. Each layer applies a linear transformation followed by nonlinear activation:

$$\mathbf{h}_1 = \sigma(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1)$$

$$\mathbf{y} = \mathbf{W}_2\mathbf{h}_1 + \mathbf{b}_2$$

Key Takeaways

**Vectors and matrices** are geometric objects representing transformations
**Dot products** measure alignment and power neural networks
**Eigenvalues** reveal matrix properties and affect optimization
**Gradients** point uphill; we move opposite to minimize
**Computation graphs** enable automatic differentiation
**Backpropagation** is the chain rule applied systematically
**Gradients sum** when variables affect output through multiple paths

Final Thoughts

Understanding linear algebra transforms machine learning from magic to mathematics. When you see a neural network, you now understand it's matrix multiplications and element-wise operations composed together. When you call backward() in PyTorch, you know it's systematically applying the chain rule through a computation graph.

The beauty is scalability. The same principles that work for a simple polynomial also power GPT-4 with billions of parameters. The mathematics remains elegant and consistent.

Understanding Neural Networks: From Word Counting to Meaning Understanding

Haneesh — Wed, 22 Apr 2026 00:00:00 GMT

Imagine teaching a computer to understand what you mean when you say something. Not just matching keywords, but actually *understanding* that 'book a flight' and 'reserve a plane ticket' mean the same thing, even though they share almost no words. This post explains how modern AI does exactly that, using two powerful ideas: **sentence embeddings** (a way to capture meaning) and **multi-layer perceptrons** (a simple but effective neural network). We'll break down every concept into plain English with real examples.

Part 1 — The Problem: Why Counting Words Fails

Let's start with the old way of teaching computers to understand text: **counting words**. This approach is called TF-IDF (Term Frequency-Inverse Document Frequency), and it's like giving each word a score based on how often it appears.

Here's how it works: if you have the sentence 'book a flight to Tokyo', the computer creates a list of all possible words it knows (maybe 10,000 words), and marks which ones appear in your sentence. So 'book' gets a 1, 'flight' gets a 1, 'Tokyo' gets a 1, and the other 9,997 words get a 0.

Think about it like this: imagine trying to understand a recipe by only counting how many times each ingredient appears, without knowing the order or how they relate to each other. You'd know there's flour and eggs, but not whether you're making a cake or scrambled eggs!

Part 2 — The Solution: Sentence Embeddings (Meaning as Coordinates)

Now for the magic trick. Instead of counting words, what if we could capture the *meaning* of a sentence as a set of numbers? That's exactly what **sentence embeddings** do.

Think of it like GPS coordinates. Every location on Earth can be described by two numbers (latitude and longitude). Similarly, every sentence can be described by a list of numbers (typically 384 or 768 numbers) that capture its meaning. Sentences that mean similar things get similar numbers, like how nearby places have similar GPS coordinates.

Here's the best part: someone else already built this map for us! Companies like Google and Hugging Face trained models on billions of sentences to learn these coordinates. We can download their work for free and use it. This is called **transfer learning** — borrowing intelligence from someone else's hard work.

A Real Example

The model we use is called **all-MiniLM-L6-v2**. It's small (only 80MB), fast (works on regular computers), and produces 384-dimensional embeddings. Think of those 384 numbers as 384 different aspects of meaning — like 'is this about travel?', 'is this a question?', 'is this urgent?', and 381 other subtle aspects.

Part 3 — Why We 'Freeze' the Embedding Model

Here's an important concept: we **freeze** the embedding model, which means we never change it. We use it exactly as we downloaded it. Why?

Think of it like using a dictionary. The dictionary was created by experts who studied millions of words. When you look up a word, you don't rewrite the dictionary — you just use what's already there. Same idea here: the embedding model was trained on billions of sentences, and we only have thousands. If we tried to 'improve' it with our small dataset, we'd actually make it worse.

Part 4 — Caching: Don't Do the Same Work Twice

Converting 15,000 sentences into embeddings takes a few minutes. If we had to do this every time we train our model, we'd waste hours. So we do something smart: **caching**.

Caching means: compute the embeddings once, save them to a file, and then just load that file every time you need them. It's like meal prepping — cook once on Sunday, eat all week.

Part 5 — Neural Networks: Learning Patterns with Hidden Layers

Now we get to the heart of it: **neural networks**. Let's build up the intuition step by step.

The Limitation of Straight Lines

Imagine you're trying to separate apples from oranges on a table. If all the apples are on one side and all the oranges are on the other, you can draw a straight line between them. Easy!

But what if the apples are in the middle and the oranges are in a circle around them? No straight line can separate them. You need a curved boundary. That's exactly the problem with simple models — they can only draw straight lines (or flat surfaces in higher dimensions).

What Are Hidden Layers?

A **hidden layer** is a transformation step between input and output. Think of it like this:

**Input layer**: Your data comes in (the 384 embedding numbers)
**Hidden layer 1**: Transform those 384 numbers into 256 new numbers that capture useful patterns
**Hidden layer 2**: Transform those 256 numbers into 128 numbers that capture even more refined patterns
**Output layer**: Transform those 128 numbers into 151 final scores (one for each possible intent)

Each hidden layer is like a filter that extracts increasingly sophisticated patterns. The first layer might detect simple things like 'contains travel words' or 'sounds like a question'. The second layer might detect combinations like 'travel question about weather' or 'urgent booking request'.

ReLU: The Secret Ingredient

Here's a crucial detail: between each layer, we apply something called **ReLU** (Rectified Linear Unit). It's incredibly simple: if a number is positive, keep it; if it's negative, change it to zero.

Why is this important? Without ReLU (or some other nonlinear function), stacking multiple layers would be pointless — they'd collapse into a single layer mathematically. ReLU breaks this equivalence and lets the network learn curves instead of just straight lines.

Part 6 — Batch Normalization: Keeping Things Stable

As data flows through multiple layers, the numbers can get out of control — some might be in the thousands, others near zero. This makes training unstable. **[Batch Normalization](/blog/batch-normalization-explained)** fixes this.

Here's the idea: after each layer, normalize the numbers so they have a consistent scale (roughly mean=0, standard deviation=1). It's like adjusting the volume on different audio tracks so they're all at the same level before mixing them together.

One critical detail: Batch Normalization behaves differently during training versus testing. During training, it uses the current batch's statistics. During testing, it uses stable averages computed during training. This is why you must call `model.eval()` before testing — forgetting this is the #1 cause of mysteriously bad test results!

Part 7 — Dropout: Preventing Memorization

Here's a problem: if you train a model too long on the same data, it starts to **memorize** instead of **learn**. It's like a student who memorizes answers without understanding the concepts — they ace the practice test but fail the real exam.

**[Dropout](/blog/dropout-regularization-explained)** is a clever solution: during training, randomly turn off 30% of the neurons in each layer. Which 30%? Different ones each time, chosen randomly.

During testing, all neurons are active (no dropout). The model has learned to work with any subset of neurons, so when all are present, it performs even better.

Part 8 — The Adam Optimizer: Smart Learning

Training a neural network means adjusting millions of numbers (the weights) to minimize errors. The old way (called SGD - Stochastic Gradient Descent) adjusts every weight by the same amount. **[Adam](/blog/adam-optimizer-explained)** is smarter.

[Adam](/blog/adam-optimizer-explained) gives each weight its own personalized learning rate based on its history. If a weight's gradient has been consistently pointing in one direction, Adam lets it move faster. If a weight's gradient has been bouncing around randomly, Adam makes it move more cautiously.

Adam also includes **weight decay**, which is a fancy term for 'penalize weights that get too large'. This prevents the model from becoming too confident about any single pattern, which helps it generalize better to new data.

Part 9 — Early Stopping: Knowing When to Quit

How do you know when to stop training? If you stop too early, the model hasn't learned enough. If you train too long, it starts memorizing the training data and performs poorly on new data.

**[Early stopping](/blog/early-stopping-explained)** solves this automatically. Here's how it works:

After each training epoch, test the model on validation data (data it hasn't trained on)
If the validation accuracy improves, save the model weights and reset a counter
If the validation accuracy doesn't improve, increment the counter
If the counter reaches a threshold (say, 5 epochs without improvement), stop training and restore the best weights

Part 10 — Putting It All Together: The Complete Architecture

Let's see how all these pieces fit together in our MLP (Multi-Layer Perceptron) classifier:

Here's what happens step by step:

**Input**: A sentence like 'Book a flight to Tokyo'
**Sentence Transformer**: Converts it to 384 numbers capturing its meaning (frozen, never changes)
**First Hidden Layer**: 384 → 256 numbers, normalized, ReLU applied, 30% randomly dropped
**Second Hidden Layer**: 256 → 128 numbers, normalized, ReLU applied, 30% randomly dropped
**Output Layer**: 128 → 151 final scores (one for each possible intent)
**Prediction**: Pick the intent with the highest score

Part 11 — The Training Process: How Learning Happens

Training is an iterative process. Here's the cycle that repeats thousands of times:

**Forward Pass**: Feed a batch of examples through the network, get predictions
**Compute Loss**: Measure how wrong the predictions are (using CrossEntropyLoss)
**Backward Pass**: Calculate how to adjust each weight to reduce the error (using backpropagation)
**Update Weights**: Adjust the weights using Adam optimizer
**Repeat**: Do this for thousands of batches across many epochs

Part 12 — Why This Works So Well

When you combine all these techniques, something magical happens. On a dataset with 151 different intents (like 'book_flight', 'weather_query', 'play_music', etc.), this approach achieves over 90% accuracy. Compare that to the old word-counting approach which maxes out around 78%.

Why such a big jump? The sentence embeddings do most of the heavy lifting. They already understand that 'book', 'reserve', and 'schedule' are related. They already know that 'flight' and 'plane ticket' mean similar things. The MLP just needs to learn simple decision boundaries in this well-organized space.

Part 13 — Common Mistakes and How to Avoid Them

Here are the most common mistakes beginners make, and how to avoid them:

**Forgetting model.eval()**: Always call this before testing. If you don't, Dropout stays active and BatchNorm uses wrong statistics. Your test accuracy will be mysteriously bad.
**Not restoring best weights**: Early stopping finds the best epoch, but you must load those weights back. Otherwise you're using the last epoch's weights, which are often overfit.
**Learning rate too high or too low**: For Adam, 1e-3 (0.001) is a good starting point. Too high and training explodes, too low and nothing happens.
**Hidden layers too small**: With 151 output classes, you need enough capacity. [256, 128] works well. [32] is too small.
**Not caching embeddings**: Computing embeddings takes minutes. Cache them to disk so you only do it once.

Part 14 — Key Concepts Summary

Conclusion: The Big Picture

Let's zoom out and see the forest, not just the trees. Modern NLP works by dividing labor: a pretrained model (the sentence transformer) does the hard work of understanding language, and a small neural network (the MLP) does the easier work of classification.

This pattern — frozen pretrained encoder + small trainable head — is everywhere in modern AI. It's how Google, Meta, and virtually every company doing serious NLP work builds their systems. You've just learned the foundation.

The beautiful thing is that once you understand these building blocks, you can understand much more complex systems. Transformers, BERT, GPT — they all use these same fundamental ideas, just arranged in more sophisticated ways. You've taken the first step into a much larger world.

PyTorch Autograd: Automatic Differentiation from the Ground Up

AI Educator — Wed, 22 Apr 2026 00:00:00 GMT

Every time a neural network learns something — recognising a cat, translating a sentence, beating you at chess — it does so by computing **gradients** and nudging its parameters in the right direction. This process is called *backpropagation*, and in PyTorch it is handled entirely automatically by a subsystem called **autograd**. You never have to derive a single derivative by hand. In this post we'll build a mental model of how autograd works, play with real examples, and by the end you'll feel completely comfortable using it in your own projects.

1. What Is a Gradient (in Plain English)?

Imagine you are standing on a hilly landscape in thick fog. You can't see the valley, but you *can* feel the slope under your feet. The gradient tells you: **"how steeply is the ground rising, and in which direction?"** If you always step in the *opposite* direction of the slope, you'll eventually reach the lowest point — the valley.

In machine learning, the landscape is a **loss function** — a number that measures how wrong our model's predictions are. The 'ground' is all the model's parameters (weights). The gradient tells us: *"if I change each weight by a tiny amount, how much does the loss go up or down?"* We then nudge every weight slightly *downhill* — this is **gradient descent**.

2. Tensors and requires_grad

In PyTorch, data lives in **[tensors](/blog/what-is-a-tensor)** — think of them as supercharged NumPy arrays. By default, PyTorch doesn't track gradients for a tensor. You have to opt in by setting `requires_grad=True`. This tells PyTorch: *"watch this tensor — I want to know how the final output changes when this value changes."*

3. Your First Gradient Computation

Let's compute the gradient of a simple polynomial: **y = x² + 3x + 1**. We know from calculus that dy/dx = 2x + 3, so at x = 2 the gradient should be 2(2) + 3 = **7**. Let's verify that PyTorch agrees:

4. The Computational Graph — How PyTorch Sees Your Code

Every time you perform an operation on a `requires_grad` tensor, PyTorch silently builds a **computational graph** — a record of exactly what operations were performed and in what order. This graph is what makes automatic differentiation possible.

Think of it as a recipe card that PyTorch writes while you cook. When you call `.backward()`, PyTorch reads that recipe card *backwards* — from the final result all the way back to the inputs — applying the chain rule at each step.

Each node in the graph stores a `grad_fn` — the backward function that knows how to propagate gradients through that specific operation. Let's inspect it:

5. Gradients With Multiple Inputs

A neural network has *millions* of parameters — let's see how autograd handles multiple inputs at once. Consider **z = 2x² + y³**, where we want both ∂z/∂x and ∂z/∂y:

One single `.backward()` call populated the gradients for *all* participating leaf tensors simultaneously. In a real network, this means one backward pass computes gradients for every single weight — no matter how many there are.

6. The Chain Rule — The Heart of Backpropagation

Autograd works by applying the **chain rule** from calculus. The chain rule says: if `z` depends on `y`, and `y` depends on `x`, then the gradient of `z` with respect to `x` is:

dz/dx = (dz/dy) × (dy/dx)

PyTorch applies this rule at every node in the computational graph, chaining all the local gradients together as it works its way backward from the output to the inputs. Let's trace this manually for a two-step function:

7. Gradients With Vectors and Matrices

So far we've used scalar (single number) tensors. Real networks deal with vectors and matrices. When the output is a vector, `.backward()` needs a **gradient argument** — called the *vector-Jacobian product* — to know how to weight each output dimension. The most common case is passing a tensor of ones, which is equivalent to summing the outputs first.

8. Zeroing Gradients — A Critical Step

Here's one of the most common bugs for PyTorch beginners: **gradients accumulate**. Every time you call `.backward()`, PyTorch *adds* the new gradients to whatever is already stored in `.grad`. It does NOT overwrite them. This is useful for some advanced techniques, but in a standard training loop you must **zero the gradients manually** before every backward pass.

9. Turning Off Gradient Tracking

During inference (when you're just making predictions, not training), you don't need gradients at all. Disabling gradient tracking saves memory and speeds up computation. There are two main ways to do this:

9a. torch.no_grad() Context Manager

9b. tensor.detach()

`.detach()` creates a new tensor that **shares the same data** but is completely disconnected from the computational graph. It's like making a copy that has no memory of how it was created.

10. The retain_graph Flag

By default, PyTorch **destroys the computational graph** after calling `.backward()` to free memory. If you need to call `.backward()` more than once on the same graph (rare, but it happens in techniques like computing higher-order gradients), you must tell PyTorch to keep it:

11. Higher-Order Gradients (Gradient of a Gradient)

Because autograd builds a regular computation graph, you can differentiate *through* the backward pass itself to get second derivatives (and beyond). This is used in techniques like MAML (Model-Agnostic Meta-Learning). Use `create_graph=True` to make the backward pass itself differentiable:

12. torch.autograd.grad — More Surgical Control

While `.backward()` populates `.grad` for *all* leaf tensors in the graph, `torch.autograd.grad()` lets you request the gradient of a specific output with respect to specific inputs. It returns a tuple of gradients and doesn't touch `.grad` at all — great for custom training logic.

13. Custom Autograd Functions

Sometimes you need an operation whose gradient is not natively defined in PyTorch — perhaps a custom activation function, or an operation that wraps a CUDA kernel. You can teach PyTorch how to differentiate it by subclassing `torch.autograd.Function` and defining both a `forward` and `backward` method.

14. Putting It All Together — Linear Regression From Scratch

Let's write a complete, minimal training loop using *only* autograd — no `nn.Module`, no optimizer — to really understand what is happening under the hood. We'll fit a line **y = 2x + 1** from noisy data.

Running this produces output like the following, showing the weights converging toward the true values of w=2 and b=1:

15. Connecting to nn.Module and Optimizers

In practice you use `nn.Module` and `torch.optim` instead of managing `requires_grad` and `zero_()` manually. Here's the same linear regression rewritten the "PyTorch way" — notice the loop is structurally identical, just with more convenient abstractions:

The only difference is ergonomics. Under the hood, `model.parameters()` returns tensors with `requires_grad=True`, `optimizer.zero_grad()` calls `.zero_()` on each one, and `optimizer.step()` applies the weight update. Autograd is doing the same work it always was.

16. Common Pitfalls and How to Avoid Them

17. Quick Reference Cheatsheet

Conclusion

Autograd is one of the most elegant pieces of engineering in modern deep learning. By silently recording every operation in a computational graph, PyTorch can differentiate through arbitrarily complex functions — from a two-parameter line to a billion-parameter language model — using nothing but the chain rule applied node by node. Now that you understand the machinery under the hood, you'll find debugging training loops far more intuitive and the jump to advanced topics like custom layers, meta-learning, and physics-informed networks much less steep.

Transfer Learning in NLP: Standing on the Shoulders of Giants

Haneesh — Wed, 22 Apr 2026 00:00:00 GMT

Imagine you want to become a chef. You could start from scratch — learning what fire is, how heat works, basic chemistry. Or you could start with knowledge that master chefs have already figured out, and focus on your specific recipes. **Transfer learning** is the second approach: borrowing intelligence from models trained on massive datasets, and adapting it to your specific problem. This post explains how transfer learning revolutionized NLP, why it works so well, and how to use it effectively.

Part 1 — The Old Way: Training from Scratch

Before transfer learning, every NLP project started from zero. Want to classify movie reviews? Train a model from scratch on your 10,000 reviews. Want to detect spam? Train from scratch on your emails. Want to answer questions? Train from scratch on your Q&A pairs.

This had three major problems:

Problem 1: You Need Massive Datasets

Deep learning models have millions of parameters. To train them well, you need millions of examples. But most real-world projects have thousands, not millions. Training from scratch on small datasets leads to severe overfitting — the model memorizes the training data but fails on new examples.

Problem 2: You Waste Compute

Training a language model from scratch takes weeks on expensive GPUs. Every project repeats this expensive process, even though they're all learning the same basic things: what words mean, how grammar works, how sentences relate to each other. It's like every chef learning from scratch that water boils at 100°C — wasteful duplication of effort.

Problem 3: You Learn Shallow Patterns

With limited data, models learn superficial patterns. A spam classifier might learn 'if email contains "free money", it's spam' — but miss deeper patterns like writing style, urgency markers, or social engineering tactics. These deeper patterns require massive datasets to learn.

Part 2 — The New Way: Transfer Learning

Transfer learning flips the script. Instead of starting from zero, you start with a model that's already been trained on billions of words. This model has already learned:

What words mean and how they relate to each other
Grammar and syntax patterns
Common phrases and idioms
Semantic relationships (synonyms, antonyms, analogies)
Context and how meaning changes based on surrounding words

You take this pretrained model and adapt it to your specific task. This is called **transfer learning** — transferring knowledge from one task (general language understanding) to another (your specific problem).

Part 3 — How Pretrained Models Work

Let's understand what happens when a model is 'pretrained'. The most common approach is called **masked language modeling**:

Masked Language Modeling

The model is shown billions of sentences with random words masked out, and learns to predict the missing words:

To predict the masked word, the model must understand:

**Context**: What words appear before and after
**Grammar**: What part of speech fits here (noun, verb, adjective)
**Semantics**: What meaning makes sense in this context
**World knowledge**: Common patterns and relationships

After training on billions of sentences, the model develops a rich internal representation of language. It hasn't just memorized words — it's learned the deep structure of how language works.

Sentence Transformers: Specialized for Similarity

**Sentence transformers** are pretrained models specifically trained to produce good sentence embeddings. They're trained using **contrastive learning**:

This training creates embeddings where semantic similarity = geometric proximity. Sentences that mean similar things end up close together in the 384-dimensional space.

Part 4 — Two Approaches: Feature Extraction vs Fine-Tuning

There are two main ways to use a pretrained model:

Approach 1: Feature Extraction (Frozen Encoder)

Use the pretrained model as a **frozen feature extractor**. You never update its weights — you just use it to convert text into embeddings, then train a small classifier on top.

**Pros:**

**Fast**: Only training a small classifier, not the entire encoder
**Low memory**: Don't need to store gradients for the encoder
**Works on CPU**: No need for expensive GPUs
**Can't overfit the encoder**: The pretrained weights stay perfect

**Cons:**

**Can't adapt encoder**: If your domain is very different from the pretraining data, you're stuck
**Slightly lower accuracy**: Fine-tuning usually gives 2-5% better accuracy

Approach 2: Fine-Tuning

Update the pretrained model's weights on your specific task. You start with the pretrained weights and continue training, but with a very small learning rate.

**Pros:**

**Best accuracy**: Usually 2-5% better than frozen features
**Adapts to your domain**: Can learn domain-specific patterns

**Cons:**

**Slow**: Training the entire encoder takes much longer
**Needs GPU**: Too slow on CPU
**High memory**: Need to store gradients for millions of parameters
**Can overfit**: With small datasets, you might make the encoder worse

Part 5 — Why Freezing Makes Sense

Let's dig deeper into why freezing the encoder is often the right choice, especially for small datasets.

Reason 1: The Encoder Is Already Excellent

The pretrained encoder was trained on billions of sentences. Your dataset has thousands. If you try to 'improve' it with your tiny dataset, you'll almost certainly make it worse. This is called **catastrophic forgetting** — the model forgets its general knowledge while memorizing your specific examples.

Reason 2: Computational Efficiency

Computing gradients through a transformer encoder is expensive. By freezing it, you:

**Compute embeddings once**: Convert all text to embeddings before training starts
**No backprop through encoder**: Only compute gradients for the small classifier
**Train on CPU**: The classifier is small enough to train without a GPU
**Iterate faster**: Training takes minutes instead of hours

Reason 3: Caching

With a frozen encoder, embeddings never change. You can compute them once and save to disk:

This is a huge time saver. Computing 15,000 embeddings takes 2-3 minutes. If you're experimenting with different classifier architectures, you'd waste hours recomputing the same embeddings. With caching, subsequent runs start instantly.

Part 6 — Using Sentence Transformers in Practice

Let's see a complete example of using sentence transformers for classification:

Choosing a Sentence Transformer Model

There are many pretrained sentence transformers. Here are the most popular:

Part 7 — Common Mistakes and How to Avoid Them

**Fine-tuning on tiny datasets**: With <5,000 examples, stick to frozen features. Fine-tuning will overfit.
**Not caching embeddings**: Computing embeddings takes minutes. Cache them to disk and reuse.
**Using the wrong model**: all-MiniLM-L6-v2 is for general text. For code, use code-specific models. For scientific text, use scientific models.
**Forgetting to normalize**: Some models require L2 normalization of embeddings. Check the model card.
**Comparing embeddings with wrong metric**: Use cosine similarity, not Euclidean distance, for sentence embeddings.
**Training the encoder on small data**: If you have <10,000 examples, don't fine-tune. You'll make it worse.

Part 8 — The Impact of Transfer Learning

Transfer learning has revolutionized NLP. Here's what changed:

Key Takeaways

**Transfer learning means starting with pretrained knowledge** instead of training from scratch.
**Pretrained models learned from billions of sentences** and understand deep language patterns.
**Two approaches**: Feature extraction (frozen encoder) and fine-tuning (update encoder).
**Start with frozen features**: Faster, works on CPU, can't overfit the encoder.
**Only fine-tune if**: You have 10,000+ examples, a GPU, and need that extra 5% accuracy.
**Cache embeddings**: Compute once, save to disk, reuse forever.
**all-MiniLM-L6-v2 is a great default**: Small, fast, high-quality embeddings.
**Transfer learning democratized NLP**: Anyone can now build state-of-the-art systems.

What Is a Tensor? A Beginner's Guide with Real Examples

AI Educator — Wed, 22 Apr 2026 00:00:00 GMT

If you've ever opened a PyTorch tutorial and immediately hit the word **tensor**, you're not alone. It sounds intimidating — like something from a physics textbook. But here's the truth: a tensor is just a container for numbers, organised in a grid. That's it. Once that clicks, everything else in deep learning becomes a lot less scary.

1. Start With What You Already Know

Before we define a tensor, let's look at things you already know — because a tensor is just a generalisation of all of them.

A single number — a Scalar

A **scalar** is just one number. No grid, no list. Things like your age, the temperature outside, or the price of a coffee. Examples: `42`, `3.14`, `-7`.

A list of numbers — a Vector

A **vector** is a row (or column) of numbers. Think of a week's worth of temperatures: `[22, 24, 19, 17, 25, 28, 23]`. Or the three RGB colour values of a pixel: `[255, 128, 0]`.

A table of numbers — a Matrix

A **matrix** is a 2-D grid of numbers — rows and columns. A spreadsheet is a matrix. A grayscale photo is a matrix (each cell holds the brightness of one pixel).

Multiple tables stacked — a Tensor

A **tensor** is the general term for *any* of the above, plus the idea that you can keep stacking dimensions. A colour photo is three matrices stacked (one for Red, one for Green, one for Blue). A batch of 32 colour photos is 32 of those stacked. That's a tensor.

2. The Shape — The Most Important Property

Every tensor has a **shape** — a tuple that tells you how many elements exist along each dimension. Learning to read shapes fluently is the single most useful skill when debugging deep learning code.

3. Real-World Examples of Each Dimension

Abstract shapes become much easier to understand when you map them to something concrete. Here's how tensors appear in real machine learning tasks:

4. Creating Tensors in PyTorch

There are several ways to create tensors. The right choice depends on whether you already have data or just need a tensor of a certain size to start with.

5. Data Types (dtype)

All elements in a tensor must be the same **data type** (`dtype`). The most common ones you'll encounter are:

6. Basic Operations

Tensors support all the arithmetic you'd expect. Most operations work **element-wise** — meaning they're applied to each number independently, in the same position.

7. Broadcasting — When Shapes Don't Match

What happens when you try to add a shape `(3,)` tensor to a shape `(2, 3)` tensor? PyTorch uses a rule called **broadcasting** to "stretch" the smaller tensor to match the larger one — without actually copying data. It sounds confusing but follows a simple rule: *dimensions are aligned from the right, and any dimension of size 1 (or missing) gets repeated to match.*

8. Reshaping Tensors

The *data* in a tensor is stored as a flat list of numbers in memory. The **shape** is just a description of how to interpret that flat list as an N-dimensional grid. Reshaping changes the interpretation without moving any data — it's essentially free.

9. Indexing and Slicing

Indexing a tensor works just like indexing a NumPy array or a nested Python list. You can grab a single element, a row, a column, or any sub-region you like.

10. Moving to the GPU

One of the biggest reasons to use tensors instead of plain NumPy arrays is that tensors can live on a **GPU**, where thousands of cores can process them in parallel. Moving a tensor to the GPU is a single line.

11. Tensors and NumPy — Two Sides of the Same Coin

If you come from a data science background you've probably used **NumPy** arrays. PyTorch tensors and NumPy arrays are closely related — they can share the same underlying memory block, so converting between them is essentially free (as long as the tensor is on the CPU).

12. Everything Together — A Quick Mental Model

Here's a worked example that ties everything together. We'll represent a tiny batch of two colour images, poke around its shape, do some operations, and see how it would flow into a neural network:

Quick Reference Cheatsheet

Conclusion

A tensor is nothing more than an N-dimensional grid of numbers. Scalars, vectors, and matrices are all just tensors with fewer dimensions. Once you're comfortable reading shapes and thinking about dimensions, you'll find that most PyTorch code is just shuffling tensors into the right shape and multiplying them together. Every image, every word, every prediction, every loss value — it's all tensors all the way down.

From Words to Intelligence: Building an MLP Classifier on Pretrained Sentence Embeddings

AI Educator — Mon, 20 Apr 2026 00:00:00 GMT

Imagine you want to teach a computer to understand what someone means when they type a sentence — not just match keywords, but actually *understand*. A phrase like *'book me a flight'* and *'reserve a plane ticket'* mean exactly the same thing, yet share almost no words. Classic approaches like [TF-IDF](/blog/tfidf-logistic-regression-baseline) fail here completely. In this post, we'll build a system that genuinely handles this, by combining **pretrained sentence embeddings** with a **multi-layer perceptron (MLP)** built in PyTorch. Along the way, we'll unpack every building block from scratch: what embeddings are, why hidden layers matter, how BatchNorm and Dropout prevent failure modes, why Adam beats plain SGD, and how early stopping keeps your model honest.

Part 1 — Why TF-IDF Has a Ceiling

Before we talk about what we're building, let's understand what we're replacing and *why*. TF-IDF (Term Frequency–Inverse Document Frequency) represents a sentence as a sparse vector where each dimension corresponds to a word in the vocabulary. A sentence with 10,000 possible vocabulary words becomes a vector of 10,000 numbers, most of which are zero.

The problems are fundamental, not incidental. First, **TF-IDF is completely blind to meaning**. The words 'book', 'reserve', and 'schedule' all have completely different TF-IDF dimensions, so the model has no idea they're related. Second, **word order is lost entirely**. 'Dog bites man' and 'Man bites dog' produce identical TF-IDF vectors. Third, **unseen words are invisible**. If a user types a word not in your training vocabulary, it vanishes. A logistic regression on top of TF-IDF can only draw straight lines through this broken space — its accuracy ceiling is around 78% on hard intent datasets.

Part 2 — Pretrained Sentence Embeddings: Borrowed Intelligence

A sentence embedding is a dense vector — typically 384 or 768 numbers — that captures the *meaning* of a sentence, not just its words. Think of it as a GPS coordinate in meaning-space: sentences that mean similar things end up close together, regardless of the exact words used.

The model we'll use is **all-MiniLM-L6-v2**, a small but highly capable sentence transformer from the `sentence-transformers` library. 'MiniLM' means it's a distilled (compressed) version of a larger model. 'L6' means it has 6 transformer layers. 'v2' is the second version. It produces **384-dimensional embeddings**, weighs only ~80MB, and runs fast even on CPU. It was trained using a technique called *contrastive learning* — pushed to make semantically similar sentences close together in the 384-dimensional space.

Why We Freeze the Encoder

Freezing means we don't compute gradients through the sentence transformer — its weights stay exactly as they were when we downloaded it. There are two strong reasons for this. First, **the encoder is already excellent**: it was trained on billions of sentences; we have ~15,000. Fine-tuning it on such a small dataset would make it *worse*, not better (a phenomenon called catastrophic forgetting). Second, **it's dramatically cheaper**: computing gradients through 6 transformer layers is expensive. By freezing it, our training loop only needs to update the small MLP weights — which is fast even on CPU.

Part 3 — Caching Embeddings: Pay Once, Use Forever

Encoding 15,000 sentences through a transformer takes a few minutes. If you recompute embeddings on every training run, you're wasting that time every single time. Since the encoder is frozen, the embeddings *never change* — so we compute them once and save them to disk.

Part 4 — nn.Module: The Blueprint for Every PyTorch Model

Every neural network in PyTorch is a subclass of `nn.Module`. Think of `nn.Module` as a smart container that does several important things automatically: it tracks all the learnable parameters in your model so the optimizer can find them, it handles switching between training and evaluation modes, and it lets you save and load the entire model state with a single call.

The basic pattern has two parts: `__init__` (where you define your layers) and `forward` (where you describe how data flows through them). Here's the minimal skeleton:

Part 5 — Why We Need Hidden Layers (The Limits of Linearity)

A logistic regression — or a neural network with no hidden layers — can only draw **straight lines** (or flat hyperplanes in high dimensions) to separate classes. This works fine when classes are linearly separable, but real data almost never is.

Think about the classic XOR problem. Four points: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. No single straight line can separate the 0s from the 1s. But a hidden layer can draw *two* lines and combine them — suddenly XOR is solvable. The same principle applies to intent classification: the boundaries between 151 intent classes in 384-dimensional embedding space are curved and complex. Hidden layers give the model the power to learn those curves.

ReLU: The Nonlinearity That Makes It All Work

Here's the catch: stacking multiple `nn.Linear` layers *without anything in between* is mathematically equivalent to a single linear layer. No matter how many layers you stack, a linear-of-linear is still linear. You need **nonlinear activations** between layers to break this equivalence.

The most popular activation today is **ReLU** (Rectified Linear Unit): `f(x) = max(0, x)`. It's dead simple — if the input is positive, pass it through unchanged; if negative, output zero. Despite its simplicity, ReLU has several advantages over older activations like sigmoid or tanh: it doesn't saturate for positive inputs (avoiding the vanishing gradient problem), it's computationally trivial, and empirically it trains faster.

Part 6 — Batch Normalization: Keeping Activations Well-Behaved

As data flows through many layers of a neural network, the distribution of activations (the numbers at each layer) can drift wildly — some layers might produce values in the thousands, others near zero. This makes training unstable: the gradients become tiny (vanishing) or enormous (exploding), and the model struggles to learn.

**Batch Normalization** solves this by normalizing the activations *within each mini-batch* — it subtracts the batch mean and divides by the batch standard deviation, so the activations have approximately mean=0 and variance=1. Then it applies learned scale (γ) and shift (β) parameters to let the model restore any distribution it needs.

Part 7 — Dropout: Regularization Through Controlled Chaos

A neural network trained long enough on a fixed dataset will start to **overfit** — it memorizes the training examples rather than learning generalizable patterns. Its training accuracy climbs toward 100%, while validation accuracy stagnates or falls.

**Dropout** is a remarkably simple fix: during each training step, randomly zero out a fraction of the neurons in a layer. With `dropout=0.3`, each neuron has a 30% chance of being silenced on any given forward pass. The remaining active neurons are scaled up to compensate, so the expected sum stays constant.

Part 8 — Building the MLP: Putting Layers Together

Now we combine all the pieces. Our MLP takes a 384-dimensional embedding as input and outputs logits for 151 classes. Between input and output, we stack blocks of `[Linear → BatchNorm → ReLU → Dropout]`. The key design decision is making the number of hidden layers **dynamic** — driven by a config list like `[256, 128]` — so we can experiment without rewriting code.

Part 9 — The Adam Optimizer: Smarter than SGD

Training a neural network means finding the weights that minimize the loss. We do this by **gradient descent**: compute the gradient of the loss with respect to every weight, then nudge each weight in the opposite direction of its gradient. The plain version of this is **SGD (Stochastic Gradient Descent)** — every parameter gets the same learning rate, every update.

**Adam (Adaptive Moment Estimation)** is a smarter optimizer that gives each parameter its own effective learning rate, adapted based on the history of its gradients. It tracks two things for each parameter: the **first moment** (running average of the gradient — like a velocity in that direction) and the **second moment** (running average of the squared gradient — how volatile has this parameter's gradient been?). Parameters with large, consistent gradients get smaller effective learning rates; parameters with small or noisy gradients get larger effective learning rates.

Weight Decay: L2 Regularization Built Into Adam

The `weight_decay` parameter in Adam adds a small penalty proportional to the magnitude of each weight. Mathematically, it adds `λ * ||w||²` to the loss, where λ is the weight_decay value. This penalizes very large weights, discouraging the model from over-relying on any single feature. Think of it as encouraging the model to spread its 'bets' across many features rather than putting all its weight on a few. A value of `1e-4` (0.0001) is a gentle nudge — large enough to matter, small enough not to overwhelm the signal.

Part 10 — CrossEntropyLoss: The Right Loss for Classification

For a classification problem with multiple classes, we use **CrossEntropyLoss**. It measures how well the model's predicted probability distribution matches the true label. Internally, PyTorch's `nn.CrossEntropyLoss` does three things in one: applies log-softmax to convert raw logits into log-probabilities, selects the log-probability for the correct class, and negates it (so minimizing the loss = maximizing confidence in the correct class).

Part 11 — Early Stopping: Knowing When to Quit

Training a model longer doesn't always make it better. After a certain point, the training loss keeps decreasing (the model is memorizing training data) but the validation accuracy plateaus or falls. This is **overfitting**. If you stop training at the wrong epoch, you get a model that performs great on training data and poorly on new data.

**Early stopping** monitors validation accuracy after each epoch. If it improves, we save the model weights and reset a counter. If it doesn't improve for `patience` consecutive epochs, we stop training and restore the best weights we ever saw. This way we automatically find the sweet spot without having to guess the right number of epochs.

Part 12 — The Complete Training Loop

Now let's put everything together into a complete `MLPTrainer` class. A training loop has a repeating structure: for each epoch, shuffle the training data, slice it into mini-batches, do forward→loss→backward→step for each batch, then evaluate on the validation set.

Part 13 — The Full Pipeline: From Raw Text to Predictions

Now let's wire everything together in a runnable script. The pipeline is: load data → encode to embeddings (with caching) → train MLP → evaluate → plot curves → report results.

Part 14 — Writing Tests That Actually Catch Bugs

Good tests for neural networks don't just check that the code runs — they check that the *math is right*. Here are the key things worth testing and why each one matters:

Part 15 — Understanding the Results: Why >90% Accuracy?

When you run this pipeline on CLINC150 (a 150-class intent dataset with ~15,000 sentences), you should expect test accuracy above 90%. This is a dramatic jump from the ~78% of TF-IDF + logistic regression. The gain comes almost entirely from the embeddings, not the MLP architecture. The sentence transformer has already done the hard work of mapping synonymous phrases to nearby points in 384-dimensional space — the MLP just needs to draw decision boundaries between well-separated clusters.

Part 16 — Debugging Checklist: When Accuracy Is Below 90%

If your accuracy is unexpectedly low, work through these in order — most issues reduce to one of these five causes:

**Forgot model.eval() during validation** — This is the #1 culprit. Dropout stays active in train mode and randomly zeros neurons during your validation forward pass. BatchNorm uses noisy batch statistics instead of stable running stats. Your 'validation accuracy' becomes meaningless. Fix: always call model.eval() before any evaluation code.
**Not restoring best weights** — If early stopping fires but you forget load_state_dict(best_state), you evaluate the *last* epoch's weights, not the best epoch's. The last epoch is often overfit. Fix: always restore best_state after the training loop.
**Learning rate too high or too low** — Too high (>1e-2 for Adam): loss oscillates wildly and never converges. Too low (<1e-5): training effectively doesn't happen. The default 1e-3 is well-tested for Adam on this type of problem.
**Hidden dimensions too small** — With 151 output classes, you need enough representational capacity in the hidden layers. [256, 128] is appropriate. [32] is too small to separate 151 classes reliably.
**Embeddings not cached correctly** — If the cache isn't working and you're accidentally re-encoding with a different random seed or batch size (which shouldn't matter for this model, but can cause subtle bugs), verify the cache file is being loaded with print statements.

Key Concepts: Quick Reference

Conclusion

We've covered a lot of ground. Starting from the limitations of TF-IDF, we built a complete pipeline that uses a pretrained sentence transformer to encode meaning into dense vectors, then trains a multi-layer perceptron to classify intents with over 90% accuracy. Every piece — BatchNorm for training stability, Dropout for generalization, Adam for fast convergence, early stopping for finding the optimal epoch — plays a specific role in making the system robust.

The most important insight is about the division of labour: the sentence transformer does the heavy lifting of understanding language (trained on billions of sentences, frozen, never updated), and the MLP does the lighter work of learning decision boundaries in that well-structured representation space. This separation is the foundation of the modern practice of **transfer learning** in NLP — and the same pattern (frozen pretrained encoder + small trained head) is used in production systems at Google, Meta, and virtually every company doing serious NLP work.

Logistic Regression from Scratch in PyTorch: Every Line Explained

Haneesh — Sun, 19 Apr 2026 00:00:00 GMT

In the last post we looked at **TF-IDF + Logistic Regression** using sklearn — a single `fit()` call and you're done. That's great for shipping, terrible for learning. You end up with a model that works, and no idea *why*. This post builds the same classifier from scratch in PyTorch — no `nn.Linear`, no `nn.CrossEntropyLoss`, no `optim.SGD`. Every weight, every gradient, every update is spelled out by hand.

We'll keep the same running example: **CLINC150 intent classification**. A user types *"book me a flight to Tokyo"* and we need to pick one of 151 intent labels (150 real intents plus an out-of-scope bucket). Features are a ~10,000-dim TF-IDF vector, so the numbers we'll be quoting are real.

The big picture

Strip away the ceremony and logistic regression does exactly this: take a feature vector `x`, compute a score for each class (that's the `W @ x + b` you've seen a thousand times), turn scores into probabilities via softmax, and pick the argmax. Training is the process of nudging `W` and `b` until the correct class usually has the highest score.

Three shapes to keep in your head as we go:

Step 1 — The config

A `@dataclass` is Python's shortcut for classes that are really just bundles of values — it auto-generates `__init__`, `__repr__`, and equality checks. Think of it as a named tuple with type hints.

The hyperparameters, one at a time. If you want the beginner-friendly version of these ideas first, read [**ML Hyperparameters Explained for Beginners**](/blog/ml-hyperparameters-explained-beginners):

**`lr` (learning rate)** — how big a step to take when updating weights. `0.1` is aggressive, `0.001` is gentle. Too big and you overshoot the minimum; too small and training crawls.
**`epochs`** — one epoch is one full pass through the training data. `epochs=100` means every training example is seen 100 times.
**`batch_size`** — how many examples to process before each weight update. Bigger batches give smoother, more accurate gradients; smaller batches update faster and add useful noise.
**`l2_lambda`** — penalty on large weights, to prevent overfitting. More on this below.
**`seed`** — freezes randomness. Same seed = same run, every time. Absolutely critical for debugging.

Step 2 — The weights (and why initialization matters)

`W` has shape `(n_features, n_classes)`. For CLINC150 that's roughly 10,000 × 151 ≈ **1.5 million parameters**. Each *column* of `W` is conceptually the "prototype" for one class. When a new input `x` comes in, `x @ W` computes a dot product between `x` and every class prototype — 151 similarity scores in one matrix multiply.

Three more details worth noting. The **bias starts at zero** — we have no prior reason to prefer any class, so flat bias is the honest default. The **`generator=gen`** bit wires in our seeded RNG so the initialization is reproducible. And **`requires_grad_(True)`** is the flag that says "PyTorch, please track every operation touching this [tensor](/blog/what-is-a-tensor) so you can compute gradients later." Without it, `loss.backward()` silently does nothing. (Learn more about [how autograd tracks operations](/blog/pytorch-autograd-deep-dive).)

Step 3 — The forward pass

The `@` operator between [tensors](/blog/what-is-a-tensor) is matrix multiplication. If `X` is `(256, 10000)` and `W` is `(10000, 151)`, then `X @ W` is `(256, 151)` — one row per input example, 151 class scores per row.

Adding `b` (shape `(151,)`) to a `(256, 151)` matrix uses **broadcasting**: PyTorch virtually replicates `b` across all 256 rows without copying memory. The output is called **logits** — raw, unnormalized scores. Logits can be any real number. A big positive logit for class 5 means "this input strongly looks like class 5." A very negative logit means "this input really doesn't look like class 5."

Step 4 — Softmax: turning scores into probabilities

To turn logits into actual probabilities (positive, summing to 1), apply softmax:

softmax(x_i) = exp(x_i) / Σ exp(x_j)

Two things happen: `exp` makes everything positive (since e^x > 0 for any real x), and dividing by the sum normalizes to 1. Softmax also preserves ordering — the biggest logit becomes the biggest probability.

`torch.log_softmax` computes `log(softmax(x))` directly using the **log-sum-exp trick**: subtract `max(x)` before exponentiating. Mathematically the constant cancels out; computationally, the largest `exp` term becomes `exp(0) = 1` and everything else is between 0 and 1. No overflow, ever.

log_softmax(x_i) = x_i − max(x) − log(Σ exp(x_j − max(x)))

Step 5 — Cross-entropy loss

Cross-entropy is the standard loss for classification, and the intuition is simple: if the model assigns probability 0.9 to the correct class, you're happy; if it assigns 0.001, you're sad. The loss function `-log(p)` has exactly this shape:

So for each training example, we want to compute `-log(p_correct_class)` and average over the batch.

Why go via `log_softmax` and then index, instead of computing `softmax`, indexing, then taking `log`? **Numerical stability.** Staying in log-space means tiny probabilities like `1e-30` don't underflow to zero.

Step 6 — L2 regularization

Left unchecked, the model will learn huge weights to memorize the training set, then fail miserably on new data. This is **overfitting**. L2 regularization prevents it by adding a penalty proportional to the sum of squared weights:

The total loss becomes `cross_entropy + λ · Σ W²`. The optimizer now has two pressures: reduce the cross-entropy (fit the data) *and* keep weights small (stay simple). Lambda controls the tradeoff — too small and regularization does nothing, too big and the model underfits because every weight is squeezed toward zero.

Step 7 — The training loop

Now the heart of it. Every epoch we shuffle, then iterate over mini-batches. For each batch we run the five-step cycle that is the beating heart of essentially all deep learning:

Why shuffle every epoch?

If the data is sorted by class (all class 0 first, then class 1, etc.), the model would train on one class for ages, forget the previous one, and oscillate forever. Shuffling guarantees each batch sees a random mix. `torch.randperm(N)` gives a random permutation of `[0 .. N-1]` and `X_train[perm]` reorders the rows accordingly.

Why mini-batches?

Mini-batches hit the sweet spot — stable enough to converge, small enough to step often, small enough to fit on a GPU. `min(start + batch_size, N)` handles the final batch cleanly when `N` doesn't divide evenly.

Step 8 — What `loss.backward()` actually does

This is the part that feels like magic until you know. When you called `X @ W`, PyTorch silently recorded "matmul, with these inputs" on a hidden **computation graph**. Same for `log_softmax`, the indexing, the `.mean()`. Every operation on a `requires_grad=True` [tensor](/blog/what-is-a-tensor) adds a node to this graph. (Deep dive: [Understanding PyTorch's Autograd](/blog/pytorch-autograd-deep-dive).)

`loss.backward()` walks that graph in reverse, applying the **chain rule** from calculus at each node, all the way back to the tensors with `requires_grad=True`. The final gradients land in `W.grad` and `b.grad`, which have the same shapes as `W` and `b`.

You never write a derivative yourself — PyTorch ships with the derivative of every [tensor](/blog/what-is-a-tensor) operation built in. That's why this framework took over. [Explore autograd internals](/blog/pytorch-autograd-deep-dive).

Step 9 — Gradient descent: the update

The gradient points in the direction of **steepest increase** of the loss. We want to *decrease* the loss, so we step in the opposite direction. That's literally the entire idea of gradient descent:

W_new = W_old − lr · (∂loss / ∂W)

Two tricky details in the code worth pausing on. The `with torch.no_grad():` block tells PyTorch "don't track these operations" — otherwise the update itself becomes part of the graph, creating a recursive mess. And `.data` modifies the underlying tensor values directly, without breaking autograd's bookkeeping.

Step 10 — Why you MUST zero the gradients

Why does PyTorch accumulate instead of replacing? Because sometimes you *want* accumulation — for example, gradient accumulation across several small batches to simulate a larger effective batch size when memory is tight. The framework gives you flexibility and demands you handle the bookkeeping.

Step 11 — Prediction

Two optimizations here that matter in production. First, `torch.no_grad()` skips building the computation graph — no autograd bookkeeping, less memory, faster inference. Second, **we don't compute softmax at all**. Since softmax is monotonic (bigger logit ⇒ bigger probability), `argmax(logits) == argmax(softmax(logits))`. Save yourself the exp, the sum, and the division.

See it for yourself: a 20-line debug script

The best way to solidify any of this is to run training on a tiny toy dataset and watch the numbers move. Weights change, gradients shrink, loss drops. You can't unsee it.

Run it and you'll see the loss drop from around 1.1 (random guessing for 3 classes ≈ `-log(1/3)` = 1.1) toward something small. You'll also see `|grad|` shrinking — as the model approaches a good solution, there's less and less to correct.

Putting it all together

Zoom out and the entire arc of training is this: **start with random weights. Predict. Measure wrongness. Use autograd to find which direction to nudge each weight. Take a small step. Repeat thousands of times.** Slowly, the columns of `W` fill in useful patterns — one column comes to represent "flight-booking vocabulary," another "weather-query vocabulary," and so on — and the loss drops.

The beautiful thing about this implementation is that every step is visible. There's no `nn.Module` hiding the parameters, no `optim.SGD` hiding the update rule, no `CrossEntropyLoss` hiding the log-softmax. Once you've written this, you know exactly what every library shortcut is doing underneath — and when something breaks in a bigger model, you'll have the vocabulary to debug it.

Takeaways

**Logistic regression = linear scores + softmax + argmax.** Training = nudging the linear scores until the argmax matches the label.
**Logits are unnormalized scores.** Softmax only exists to make them into probabilities; for prediction, argmax on logits is equivalent and cheaper.
**Use `log_softmax`, never raw softmax.** The log-sum-exp trick is the difference between training that works and training that silently explodes.
**Cross-entropy punishes confident wrong answers.** `-log(p_correct)` is huge when p is tiny, zero when p is one.
**L2 regularization shrinks weights** to prevent overfitting. Don't regularize the bias.
**The five-step cycle is universal.** Forward, loss, backward, update, zero — every neural network you ever train follows it.
**`loss.backward()` is not magic** — it's the chain rule replayed over a recorded computation graph. Autograd does the bookkeeping; you do the modeling.
**Zero your gradients.** You will forget this once. Then never again.

ML Hyperparameters Explained for Beginners: Learning Rate, Epochs, Batch Size, L2, and Seed

Haneesh — Sun, 19 Apr 2026 00:00:00 GMT

If you are just starting machine learning, words like **learning rate**, **epochs**, **batch size**, **regularization**, and **seed** can feel technical very quickly. But the ideas behind them are actually simple. They are just settings you choose before training starts, and they control *how* the model learns.

This post explains five common ML hyperparameters in the simplest possible way: **`lr`**, **`epochs`**, **`batch_size`**, **`l2_lambda`**, and **`seed`**. We will also explain every related term we use, so nothing feels like hidden jargon.

Before the hyperparameters: a few words you must know

A **machine learning model** is a system that learns patterns from data and then uses those patterns to make predictions. A prediction could be something like "spam or not spam," "house price," or "which category this text belongs to."

A **dataset** is a collection of examples. Each example usually has an **input** and a correct **output**. For example, if we are predicting exam results, an input might be `hours studied = 5`, and the output might be `passed = yes`.

A model learns by adjusting internal numbers called **parameters**. In many models, the most important parameters are called **weights**. A weight is just a number inside the model that controls how strongly the model reacts to some pattern in the input.

There is an important difference between **parameters** and **hyperparameters**. Parameters are learned by the model during training. Hyperparameters are chosen by you before training begins.

Think of cooking. The food changing while it cooks is like the model's parameters changing during training. The oven temperature and cooking time are like hyperparameters: you choose them before the cooking starts.

1) Learning rate (`lr`)

The **learning rate** tells the model how big a step to take when it updates its weights.

To understand that sentence, we need two more words: **error** and **update**. Error means the difference between the model's prediction and the correct answer. An update means changing the model's weights to try to reduce that error.

Suppose the correct answer is `10`, but the model predicts `7`. The model is wrong. Training tries to reduce that wrongness by changing the weights a little. The learning rate decides whether that change should be big or small.

Learning rate = step size while learning

A large learning rate means the model takes bigger jumps. A small learning rate means the model takes smaller, gentler steps.

A simple analogy: imagine you are trying to stand exactly on a line painted on the floor. If you take huge jumps, you may keep crossing past the line. If you take tiny steps, you move safely but slowly. The learning rate controls that step size.

If the learning rate is too high, training can become unstable. If it is too low, training can become painfully slow.

2) Epochs

An **epoch** is one full pass through the entire training dataset.

Suppose your training dataset has 100 examples. If the model sees all 100 examples once, that is 1 epoch. If it sees all 100 examples again, that is 2 epochs. So `epochs = 100` means the model goes through the full training data 100 times.

A good beginner analogy is flashcards. If you have 20 flashcards and review all 20 once, that is one pass. Review all 20 again, that is another pass. In ML, each full pass is called an epoch.

Epoch = one complete pass through all training examples

Why do we need multiple epochs? Because the model usually does not learn everything from one pass. It often needs to see the same data many times to slowly improve its weights.

When the model has not learned enough, that is called **underfitting**. When it memorizes the training data too much and performs poorly on new data, that is called **overfitting**.

3) Batch size (`batch_size`)

The **batch size** is how many training examples the model processes before it updates the weights.

Suppose you have 100 training examples and `batch_size = 10`. That means the model looks at 10 examples, computes how wrong it was on those 10, updates the weights, then moves to the next 10.

Each small group of examples is called a **batch**. So if you have 100 examples and a batch size of 10, you will have 10 batches in one epoch.

Why not use all examples at once every time? Sometimes you can, but smaller groups are often more practical. They use less memory and allow the model to update more often.

You will also hear the word **gradient** here. A gradient is information that tells the model which direction to change the weights, and roughly how strongly to change them.

A simple analogy: asking 2 people for feedback on a product gives a noisy opinion. Asking 200 people gives a more stable average. Small batches are like asking a few people. Large batches are like asking many people.

4) L2 regularization (`l2_lambda`)

L2 regularization adds a penalty when the model's weights become too large.

To understand why that matters, remember overfitting: sometimes a model becomes too eager to match the training data exactly. One sign of this can be very large weights. Large weights can make the model too sensitive, so tiny input changes produce very large output changes.

Regularization means adding a rule that says: "fit the data, but also try to stay simple." In L2 regularization, staying simple usually means preferring smaller weights.

L2 regularization = penalty on large weights

The value `l2_lambda` controls how strong that penalty is. A small `l2_lambda` means a weak penalty. A large `l2_lambda` means a strong penalty.

A beginner analogy: imagine packing a bag for school. You want enough things to do the job, but not so many that the bag becomes heavy and messy. L2 regularization is like a rule that discourages carrying too much weight unless it is truly needed.

5) Seed

A **seed** is a starting number used to control randomness in a program.

Machine learning often involves randomness. For example, the model's starting weights may be random. The training examples may be shuffled randomly. Some algorithms may randomly sample data during training.

If you do not fix the seed, two runs of the same code can produce slightly different results. If you do fix the seed, the results become much more repeatable.

Same seed = same randomness pattern

This matters a lot for **debugging** and **reproducibility**. Debugging means finding out why something is wrong. Reproducibility means being able to run the same experiment again and get the same result.

A simple analogy is shuffling a deck of cards. Without a seed, every shuffle is different. With a fixed seed, you can make the shuffle happen in the same way every time.

One tiny example putting all five together

Imagine we are training a model to predict whether a student will pass an exam.

**`lr = 0.01`** means the model changes its weights with moderately small steps
**`epochs = 100`** means the model sees the full training dataset 100 times
**`batch_size = 32`** means it processes 32 examples before each weight update
**`l2_lambda = 0.001`** means it applies a small penalty to very large weights
**`seed = 42`** means the random parts of training are made repeatable

None of these numbers are magic by themselves. They are settings you tune based on the problem, the dataset, and the model. But understanding what each one *does* is the first step toward making good choices.

Quick summary table

Final intuition

Think of training like practicing basketball shots.

**Learning rate** = how much you change your shooting style after each miss
**Epochs** = how many full practice rounds you do
**Batch size** = how many shots you watch before deciding what to adjust
**L2 regularization** = avoiding wild, extreme movements that only work for a few cases
**Seed** = making the practice setup repeatable so you can compare sessions fairly

These are some of the most common machine learning basics you will see in tutorials, research code, and production systems. Once these ideas click, many training loops stop looking mysterious.

Takeaways

**Hyperparameters are settings chosen before training.** The model then learns within those rules.
**Learning rate controls step size.** Big steps can be unstable; tiny steps can be slow.
**Epochs tell you how many times the model sees the full training data.**
**Batch size controls how many examples are used before each update.**
**L2 regularization helps prevent overfitting by discouraging very large weights.**
**Seed helps make experiments repeatable, which is critical for debugging and fair comparison.**

TF-IDF + Logistic Regression: The Classical ML Baseline You Should Try First

Haneesh — Sun, 19 Apr 2026 00:00:00 GMT

Before you reach for a big neural network or an LLM for text classification, try the boring thing first. In my intent-routing project, an 8B parameter LLM (granite3.3:8b) landed at **72.19% accuracy** on the CLINC150 benchmark — respectable, but slow. The next question is almost rude: *can a model from 1995 beat it?*

This post walks through the classical baseline — **TF-IDF + Logistic Regression** — the way I built it. No PyTorch, no GPU, no transformers. Just sklearn, a few hundred lines of code, and an answer in under a second per query.

The problem: intent classification

Given a short user utterance like *"cancel my flight to Paris"*, predict which of 150 intents it belongs to (book_flight, cancel_reservation, weather, etc.). CLINC150 has 150 intents spread across 10 domains — banking, travel, small talk, work, and so on — plus an **out-of-scope (OOS)** bucket for things the system shouldn't try to answer.

Idea 1: TF-IDF — turning text into numbers

Machine learning models don't eat text. They eat numbers. **TF-IDF** is one of the oldest ways to turn text into numbers, and it's built on two simple intuitions:

**TF (Term Frequency)** — how often does a word appear in *this* document? A word that shows up four times is probably more important than a word that shows up once.
**IDF (Inverse Document Frequency)** — how *rare* is the word across *all* documents? Words like 'the' and 'my' appear everywhere, so they're useless for distinguishing documents. Words like 'refund' appear in a specific context, so they're valuable signal.

Multiply them together and you get a score that is high when a word is *frequent here but rare elsewhere* — exactly the words that make a document distinctive.

TF-IDF(t, d) = TF(t, d) × log(N / df(t))

Idea 2: Logistic Regression — drawing lines in high dimensions

Despite the name, logistic regression is a **classifier**, not a regressor. Given a vector of features (our TF-IDF vector), it learns a set of weights for each class and produces a probability distribution over classes. For 150 intents, it learns 150 weight vectors — one per class — and picks the class with the highest score.

Why logistic regression and not something fancier? Three reasons: it trains in seconds, it handles high-dimensional sparse inputs (like TF-IDF) beautifully, and its predictions are essentially free at inference time — a dot product per class.

Putting it together with sklearn Pipeline

The sklearn `Pipeline` lets you glue preprocessing and modeling into a single object. This matters for one reason above all: **you can't accidentally train on test data**, because the whole thing trains and predicts as one unit.

The knobs that matter

Don't guess at these. Let `GridSearchCV` search the space for you. It runs cross-validation across every combination and reports the winner.

Measuring latency honestly

A common trap: benchmarking the batched prediction (`predict` on 1000 items at once) and calling that your latency number. Real inference is often one query at a time. Measure p50 *and* p95, and warm up the pipeline first so you don't measure JIT overhead.

LLM vs TF-IDF: the surprising scoreboard

Where TF-IDF breaks (and why you'll still want embeddings)

TF-IDF is a bag of words. It has no idea that 'cancel' and 'terminate' mean the same thing, or that 'what time is it' and 'do you have the time' are paraphrases. The model has to see the *exact words* during training to learn them. Three concrete failure modes:

**Synonyms** — 'cancel my flight' and 'terminate my booking' share almost no vocabulary but are the same intent. TF-IDF can't bridge that gap.
**Paraphrases** — 'how cold is it outside' vs 'current temperature please' have no content words in common. A human gets it instantly; TF-IDF doesn't.
**Word order** — 'transfer from checking to savings' vs 'transfer from savings to checking' are the *opposite* operation but produce identical bag-of-words vectors.

Error analysis: learn from your confusions

After training, don't just stare at the accuracy number. Look at the **confusion matrix** and find the most-confused class pairs. Print a few misclassified examples from the worst pair and *read them*. You'll discover patterns — maybe two intents genuinely overlap, maybe the labels are noisy, maybe one class needs more training data. This is where intuition is built, not on dashboards. For a deeper dive into evaluation metrics like F1 scores, precision, recall, and latency percentiles, check out [**Inside a Production ML Evaluation Harness**](/blog/inside-an-ml-evaluation-harness).

Compute the confusion matrix from your predictions.
Find the top 10 off-diagonal cells with the highest counts.
For the worst pair, print 5 misclassified examples side-by-side.
Ask: is the model wrong, or are the labels wrong?

When to use this baseline

Takeaways

**Always build the classical baseline first.** It tells you what 'good' looks like before you burn GPU hours on neural models.
**TF-IDF + LogReg is a bag-of-words model.** It can't handle synonyms, paraphrases, or word order — but for short utterances with enough training data, it's shockingly strong.
**Measure latency honestly** — p50 and p95, one query at a time, with warmup.
**Error analysis beats metrics.** Read the misclassifications. That's where intuition lives.
**The next step up is embeddings** — dense vectors that capture meaning, not just word identity. That's where bag-of-words' limitations get fixed.

The Impartial Judge: Inside a Production ML Evaluation Harness

AI Educator — Thu, 16 Apr 2026 00:00:00 GMT

Every ML project eventually runs into the same uncomfortable question: *is version B actually better than version A?* You can squint at loss curves, trust your gut, or cherry-pick examples — but until both models pass through the **same scoring system**, you're guessing. This post cracks open a real evaluation harness — the kind you'd find in a production ML repo — and unpacks every design decision inside it, one piece at a time.

What the File Contains

The harness is a single Python module with **two dataclasses** (the report cards), **three functions** (score, time, format), and a small percentile helper. That's it. The whole point of a harness is to be small, stable, and boring — you don't want surprises in your ruler.

The Two Report Cards

The file opens with two `@dataclass` definitions. They're lightweight containers — *structs with benefits*. Instead of returning a confusing tuple like `(0.89, 0.87, 0.92, 234.1)` where you have to remember which number is which, the function returns an object with **named fields**.

Notice the **separation of concerns**: `ClassificationMetrics` is about *quality* (did the model get it right?), `LatencyStats` is about *speed* (how long did it take?). They're orthogonal — a slow-but-accurate model is useful for some contexts, a fast-but-mediocre one for others. Keeping them separate lets each be evolved independently.

Scoring Predictions: compute_metrics

This function takes two equal-length lists — `y_true` (the correct labels) and `y_pred` (what the model guessed) — and returns **four kinds of numbers**. Let's unpack each one.

Accuracy — the obvious one

Accuracy is the fraction of predictions that are correct. 89 right out of 100 = 0.89. Simple. Intuitive. **Often misleading.**

F1 — when accuracy lies

F1 fixes the imbalance problem by measuring two things together for each class:

**Precision**: of the times we predicted class X, how often were we right?
**Recall**: of the actual class-X examples, how many did we catch?
**F1**: the *harmonic mean* of precision and recall — high only when BOTH are high

The harmonic mean is the secret sauce. A regular average would let you cheat — score 1.0 on precision and 0.1 on recall, average is 0.55. But the harmonic mean punishes imbalance: it pulls the score toward the *worse* of the two numbers.

Macro F1 and the labels= trick

**Macro F1** is the *unweighted average* of per-class F1 scores. It treats every class as equally important regardless of how common it is. A model that nails the common classes but bombs the rare ones will score high on accuracy but low on macro F1. That's usually what you want to know.

Out-of-scope: the abstain case

Real-world classifiers need to say *'I don't know'* sometimes. A support-ticket router shouldn't confidently shove a random gibberish message into 'billing' — it should abstain. That's what **OOS (out-of-scope)** detection measures.

**OOS recall**: of all *truly* out-of-scope messages, how many did we correctly flag? (Catching the abstentions)
**OOS precision**: of all messages we flagged OOS, how many actually were? (Not over-abstaining)

Why manual counting instead of sklearn? Because the concept is clearer as arithmetic, and the explicit `if true_oos else 0.0` makes the zero-division behavior obvious. No hidden library magic.

Measuring Speed: measure_latency

You now know how often the model is *right*. The other half of the story is how *fast* it is. This is where `measure_latency` steps in — and it's packed with benchmarking wisdom.

The warmup ritual

The first call to a model is almost always the slowest. On a GPU with PyTorch, the first forward pass triggers CUDA graph compilation, MPS kernel compilation, memory allocation, caching. On CPU, it triggers import caching and branch prediction warmup. Including those first-call timings in your measurement **pollutes your numbers**.

Why percentiles, not averages

This is one of the most important ideas in production monitoring, so let's slow down. **Averages lie.** Especially with latency.

99 calls at 10ms + 1 call at 5 seconds = ~60ms mean. One user in a hundred waited five seconds. The mean doesn't tell you that.

In production, a small fraction of slow requests create most of the bad user experiences. That's why you want percentiles — they describe the *distribution*, not just the center.

The percentile calculation

The function computes percentiles by hand using **linear interpolation** — the standard approach when the target rank falls between two sorted samples.

Walking through p95 on 100 samples: `rank = 0.95 * 99 = 94.05`. That means take the value at index 94 and blend it 5% of the way toward index 95. If the rank lands on an integer, no interpolation is needed.

No batching — and why it matters

The docstring is emphatic: *Do NOT batch*. Production serving typically handles one query at a time — user sends a message, model replies. The number that matters is **per-query** latency. Batched throughput is a completely different metric: higher, but it doesn't reflect the user's wait time.

The Reporter: format_metrics_row

Once you've scored and timed the model, you need to put the numbers somewhere humans will read. `format_metrics_row` produces a single markdown table row — destined for an append-only `RESULTS.md` log that tracks every model you've ever tried.

The inner `fmt()` helper handles `None` gracefully — a metric that wasn't computed renders as `N/A` rather than crashing. Small detail, big resilience.

How the Pieces Fit Together

Every future model you build plugs into this exact flow. Same API, same report shape, instantly comparable to every previous run. That's the whole point — a harness is an **investment in comparability**.

Key Takeaways

**Separate quality from speed** — they're orthogonal concerns, so use two dataclasses.
**Don't trust accuracy alone** — under class imbalance, it rewards laziness. Reach for macro F1.
**Pass `labels=` explicitly to sklearn** — otherwise your F1 shifts when a rare class is absent from a split.
**Measure OOS precision AND recall** — catching abstentions (recall) and not over-abstaining (precision) are both important.
**Always warm up before timing** — first-call latency is not representative of steady state.
**Report percentiles, not means** — p50, p95, p99 describe the distribution; the mean hides tail pain.
**Use `time.perf_counter()`** — monotonic, high-resolution, benchmarking-appropriate.
**Never batch when measuring per-query latency** — it gives you throughput, not user wait time.
**Format output consistently** — one row per run, append-only, lives in git.

Conclusion

A good evaluation harness isn't clever. It's *disciplined*. It makes the same choices every time, surfaces the numbers that matter, and hides the ones that mislead. Every model that passes through it gets the same treatment — the impartial judge that your project deserves.

Semantic Caching & RAGAS Evaluation: Make Your RAG Pipeline Faster and Measurable

AI Educator — Tue, 14 Apr 2026 00:00:00 GMT

You've built a RAG bot. It retrieves context, generates answers, and mostly works. But two questions keep nagging: **how do I make it faster?** and **how do I know if it's actually good?** This post tackles both. We'll wire up a semantic cache that intercepts repeated queries before they ever touch the LLM, then plug in RAGAS — a reference-free evaluation framework — to put hard numbers on retrieval and generation quality. If you're still figuring out how to break your documents into chunks, start with [**Chunking in RAG — Breaking Text the Right Way**](/blog/chunking-in-rag).

Why Exact-Match Caching Falls Short

A traditional cache matches queries by their exact string. That's fine for database lookups, but terrible for LLM traffic. Users ask *"What is Python?"*, *"Tell me about Python"*, and *"Explain the Python language"* — three strings, one intent. An exact-match cache misses all of them after the first.

Semantic caching solves this by comparing **meaning** instead of characters. Every query gets embedded into a vector, and we search the cache using cosine similarity. If a stored query is close enough, we skip the LLM entirely and return the cached response.

How Semantic Caching Works

The flow is straightforward: embed the incoming query, search a vector store for the nearest cached embedding, and compare the similarity score against a configurable threshold. Above the threshold means a hit — below means a miss, and the full retrieval-generation pipeline runs as normal. The new response is then stored for future queries.

Setting Up GPTCache

GPTCache is an open-source library from Zilliz with pluggable components for embedding, storage, similarity evaluation, and eviction. It wraps the OpenAI API so you can drop it into an existing project with minimal changes.

Tuning the Similarity Threshold

The threshold is the single most important knob. Set it too low and you'll serve cached answers to the wrong questions. Set it too high and the cache barely fires. The right value depends on your use case.

To find the sweet spot, build a small test harness: create pairs of queries that *should* match and pairs that *should not*, then sweep the threshold and observe where false hits start appearing.

Wrapping Your RAG Bot with a Cache Layer

Rather than modifying your existing RAG pipeline, wrap it. The `CachedRAGBot` class below sits in front of your Day 1 bot, checks the cache first, and only falls through to full retrieval + generation on a miss.

Measuring Quality with RAGAS

Speed without quality is useless. RAGAS (*Retrieval-Augmented Generation Assessment*) is a framework that evaluates your RAG pipeline across multiple dimensions **without needing human-annotated ground truth**. It uses an LLM as a judge to score each sample automatically.

The Five Core Metrics

Running Your First RAGAS Evaluation

Crafting Good Test Pairs

Your evaluation is only as good as your test data. Aim for 20+ diverse question-answer pairs that cover several categories of difficulty and intent.

**Simple factual** — *"What is X?"* Tests basic single-chunk retrieval.
**Multi-hop** — *"How does X relate to Y?"* Tests context aggregation across chunks.
**Paraphrased** — Same question in 3 different wordings. Tests semantic cache hit rate.
**Adversarial** — Questions with no answer in the documents. Tests faithfulness (the bot should say it doesn't know).
**Specific** — *"What was the revenue in Q3?"* Tests precision and whether the right chunk surfaces.

Building the Comparison Table

The final deliverable ties everything together: a table comparing latency (cached vs. uncached) and RAGAS scores across different configurations — for example, different chunk sizes.

The numbers above are illustrative, but the pattern is consistent: caching cuts average latency by 3–4× once the cache warms up, and the benefit compounds as query volume grows.

Notebook Structure for Reproducibility

Organize your evaluation notebook so anyone can re-run it end to end. A clean structure also makes it easier to add new configurations or metrics later.

**Setup & Imports** — Install dependencies, import your RAG bot from Day 1.
**Add Semantic Cache** — Wrap the bot with `CachedRAGBot`, set threshold.
**Define Q&A Pairs** — 20+ diverse test cases across all categories.
**Run Queries** — Execute cached and uncached runs, record latency and hit/miss.
**RAGAS Evaluation** — Score both configurations on all five metrics.
**Comparison Table** — Aggregate latency stats and RAGAS scores per config.
**Visualization** — Box plots for latency distribution, bar charts for RAGAS scores.
**Analysis** — Which chunking strategy scored highest on faithfulness? How much latency did caching save? Any false cache hits?

Key Takeaways

Semantic caching and RAGAS evaluation address two sides of the same coin: **performance** and **quality**. Caching makes your pipeline cheaper and faster without changing the underlying retrieval or generation logic. RAGAS gives you a quantitative signal on whether that logic is working well in the first place.

Checklist Before You Ship

Semantic cache integrated with a configurable threshold
Cache hit/miss logging with latency timestamps
20+ diverse Q&A test pairs created
RAGAS metrics computed: faithfulness, answer relevancy, context precision, context recall, factual correctness
Comparison table: cached vs. uncached latency
Comparison table: RAGAS scores per chunking strategy
Evaluation notebook runs end-to-end without errors
README updated with Day 2 results

Chunking in RAG — Breaking Text the Right Way

RAG Engineering — Sun, 12 Apr 2026 00:00:00 GMT

So, What is Chunking?

Let's keep it simple. You've got a big document — maybe a 200-page PDF, a long article, or an entire codebase. You can't just shove the whole thing into an LLM and hope for the best. That's where chunking comes in.

**Chunking is the process of breaking down large text into smaller, manageable pieces** (called "chunks") that can be embedded, stored in a vector database, and retrieved when needed. Think of it like slicing a pizza — you need the right size slices so people can actually eat them.

Why Do We Even Need to Chunk?

Two big reasons:

**Embedding models have token limits.** Most embedding models work best with a certain input size. Feed them too much text and the quality of the embeddings drops. Feed them too little and you lose context. You need that sweet spot.

**LLMs have context windows.** Even though context windows are getting bigger (we're talking 100K+ tokens now), that doesn't mean you should dump everything in there. More context doesn't mean better answers — it often means worse ones. Which brings us to...

The Lost-in-the-Middle Problem

This one's a big deal and a lot of people overlook it.

Research has shown that when you give an LLM a long context, it pays the most attention to the **beginning** and the **end**. The stuff in the middle? It kinda gets ignored. The model literally "loses" information that sits in the middle of a long context window.

So chunking isn't just about breaking text apart. It's about making sure each piece is meaningful enough to stand on its own when retrieved, so you don't need to dump 30 chunks into context and hope for the best.

How to Pick a Chunking Strategy

Before you start splitting text like a madman, ask yourself these four questions:

**What kind of data am I working with?** A 500-page legal document is very different from a bunch of short FAQ answers. Long, structured docs need smart splitting. Short docs might not need chunking at all — if they fit comfortably in context, just use them as-is.
**Which embedding model am I using?** Different models are optimized for different input lengths. Some work great with 256 tokens, others prefer 512 or more. Check your model's docs and match your chunk size accordingly.
**What do user queries look like?** Short keyword searches? Full-sentence questions? Multi-paragraph prompts? Your chunk size should roughly mirror the granularity of the queries. Short queries → shorter chunks tend to match better. Detailed questions → slightly larger chunks with more context.
**How are retrieved chunks being used?** Are they being fed directly into an LLM prompt? Displayed to a user? Used for citation? Each use case has different requirements for chunk size, overlap, and structure.

Chunking Methods

Alright, let's get into the actual techniques. We'll go from simple to sophisticated.

Fixed-Size Chunking

The most basic approach. You pick a number — say 500 characters — and just split the text every 500 characters. Maybe you add some overlap (like 50 characters) so chunks share a little context at the edges.

It's fast, it's simple, it works in a pinch. But here's the problem — it's dumb. It doesn't care about sentence boundaries, paragraphs, or meaning. You'll end up with chunks that start mid-sentence and end mid-thought. The embeddings for those chunks will be noisy and retrieval quality suffers.

This is exactly why we need **"Content-aware" chunking**. Instead of blindly chopping text at arbitrary character counts, content-aware methods understand the structure of what they're splitting — sentences, paragraphs, headings, code blocks. The result? Chunks that actually make semantic sense, which means better embeddings and better retrieval.

Now let's look at the content-aware methods that actually respect your text's structure.

Sentence & Paragraph Splitting

The idea is straightforward — split text along natural language boundaries like sentences and paragraphs. Each chunk is a complete thought, not a fragment.

There are a few ways to do this:

Naive Splitting (Full Stops)

Just split on periods. It works... until it doesn't. Think about "Dr. Smith went to Washington D.C. on Jan. 5th." — that's one sentence, but naive splitting sees four. Not great.

NLTK — Natural Language Toolkit

NLTK's sentence tokenizer is way smarter. It uses a pre-trained model (Punkt) that understands abbreviations, decimals, and other tricky edge cases.

spaCy

spaCy takes it up another notch. It doesn't just find sentence boundaries — it builds a full linguistic model of your text. Slightly heavier, but the sentence detection is rock solid.

2. Recursive Character Chunking

This is probably the most popular method in the RAG world right now, and for good reason. LangChain's `RecursiveCharacterTextSplitter` is the go-to implementation.

The core idea is clever: try to split on the most meaningful boundary first, then fall back to less meaningful ones. The default separator hierarchy is:

So you set a target chunk size (say 1000 characters), and the splitter works its way down the separator list until each chunk fits. This means paragraphs stay intact when possible, sentences stay together when paragraphs are too long, and you only break words as an absolute last resort.

The beauty of this approach is balance — you get roughly consistent chunk sizes (which embedding models love) while still respecting text structure (which retrieval quality loves).

Document Structure-Based Chunking

Now we're getting smart. Instead of just looking at characters and sentences, this approach actually understands the structure of your document.

Think about it — real documents aren't just walls of text. They have:

**PDFs** — headers, sub-headers, tables, figures, footers, page numbers
**HTML pages** — `
` through `
` tags, `
` paragraphs, `
` elements, `
- **Markdown** — # headings, code blocks, bullet lists
- **Code files** — functions, classes, imports, comments
Structure-based chunking uses these natural boundaries to create chunks. A section under an `
` header becomes one chunk. A table stays together. A code function isn't split in half.

The real power here is that each chunk comes with **metadata**. You don't just get the text — you know which section it came from, what heading it was under, what page it was on. This makes retrieval way more precise.

Quick Recap
Here's the deal — there's no single "best" chunking strategy. It depends on your data, your embedding model, your queries, and your use case. But here's a rough mental model:
- **Just prototyping?** → Fixed-size or recursive character splitting. Get something working fast.
- **Building for production with unstructured text?** → Recursive character splitting with tuned parameters. It's the sweet spot for most use cases.
- **Working with structured documents (PDFs, HTML, docs)?** → Document structure-based chunking. Preserve that structure — it's free metadata.
- **Need maximum retrieval quality?** → Combine structure-based chunking with sentence-level splitting inside each section. Best of both worlds.
Happy chunking. Go build something cool. 🚀
Understanding Transformers: The Architecture Behind Modern AI

AI Educator — Sun, 12 Apr 2026 00:00:00 GMT

The **Transformer architecture**, introduced in the groundbreaking paper *Attention Is All You Need* (2017), revolutionized the field of natural language processing and became the foundation for modern AI systems like GPT, BERT, and Claude.

What Makes Transformers Special?

Unlike previous architectures that processed sequences sequentially, Transformers can process entire sequences in parallel, making them significantly faster and more efficient. This parallel processing capability is what enabled the training of massive language models.

Architecture Overview

The Transformer consists of an encoder and decoder, each made up of stacked layers. Each layer contains multi-head attention mechanisms and feed-forward neural networks, connected by residual connections and layer normalization.

Self-Attention Mechanism

The self-attention mechanism computes three vectors for each input token: **Query (Q)**, **Key (K)**, and **Value (V)**. Here's a simplified implementation:

Key Components
1. **Multi-Head Attention**: Allows the model to attend to different aspects of the input simultaneously
2. **Positional Encoding**: Injects information about token positions since Transformers don't inherently understand sequence order
3. **Feed-Forward Networks**: Applied to each position independently for non-linear transformations
4. **Layer Normalization**: Stabilizes training and improves convergence
5. **Residual Connections**: Helps with gradient flow in deep networks
Comparison with RNNs

Mathematical Foundation

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V

Where Q, K, and V are the query, key, and value matrices, and d_k is the dimension of the key vectors. The scaling factor prevents the dot products from growing too large.

Training Considerations

Conclusion

Transformers have fundamentally changed how we approach sequence modeling tasks. Their ability to capture long-range dependencies and process sequences in parallel has made them the architecture of choice for modern AI systems, from language models to image generators.