Adam Optimizer Explained: Why It's Better Than Plain Gradient Descent

A complete beginner's guide to the Adam optimizer - how it adapts learning rates per parameter, why it converges faster than SGD, and how to use it effectively in PyTorch.

HaneeshApril 22, 2026

Imagine driving a car where you can only set one speed for the entire journey — 60 mph on highways, 60 mph in school zones, 60 mph on bumpy roads. That's what plain gradient descent (SGD) does: one learning rate for all parameters. Adam (Adaptive Moment Estimation) is like having adaptive cruise control that automatically adjusts speed based on road conditions. This post explains exactly how Adam works, why it's become the default optimizer for most deep learning tasks, and how to use it effectively.

What You'll Learn

By the end of this post, you'll understand: why plain SGD struggles with modern neural networks, how Adam adapts learning rates per parameter, what momentum and RMSprop are (Adam's building blocks), how weight decay works in Adam, and practical tips for choosing hyperparameters.

Part 1 — The Problem with Plain SGD

Let's start by understanding what we're improving upon. SGD (Stochastic Gradient Descent) is the simplest optimizer. The update rule is:

sgd_update.py

python

# Plain SGD update rule
weight = weight - learning_rate * gradient

# Example:
# If gradient = 0.5 and learning_rate = 0.1
# weight = weight - 0.1 * 0.5 = weight - 0.05

Every parameter gets the same learning rate. This causes three major problems:

Problem 1: Different Parameters Need Different Learning Rates

Imagine you're training a network with 1 million parameters. Some parameters have large, consistent gradients (they know which direction to go). Others have tiny, noisy gradients (they're uncertain). With one global learning rate:

Large gradients: If learning rate is too high, these parameters overshoot and oscillate
Small gradients: If learning rate is too low, these parameters barely move and learning is slow

You're forced to choose a learning rate that's a compromise — not optimal for anyone.

Problem 2: Noisy Gradients

Mini-batch gradients are noisy estimates of the true gradient. One batch might say 'go left', the next says 'go right'. SGD follows these noisy signals directly, leading to a zigzag path instead of a smooth descent.

The Zigzag Problem

Imagine hiking down a mountain in thick fog. You can only see a few feet ahead (one mini-batch). You take a step based on the local slope, but the fog makes it hard to see the overall direction. You end up zigzagging down the mountain instead of taking a smooth path. That's SGD with noisy gradients.

Problem 3: Ravines and Plateaus

Loss landscapes often have ravines (steep in one direction, flat in another) and plateaus (flat everywhere). SGD struggles with both:

Ravines: SGD bounces between the steep walls instead of smoothly descending
Plateaus: Gradients are tiny, so SGD barely moves even though there's a cliff edge nearby

Part 2 — Building Blocks: Momentum and RMSprop

Adam combines two earlier innovations: Momentum and RMSprop. Let's understand each before seeing how Adam combines them.

Momentum: Smoothing the Path

Momentum adds 'inertia' to gradient descent. Instead of following the current gradient exactly, we maintain a velocity — a running average of recent gradients.

momentum.py

python

# Momentum update rule
velocity = beta * velocity + (1 - beta) * gradient
weight = weight - learning_rate * velocity

# beta is typically 0.9 (90% old velocity, 10% new gradient)
# This smooths out noise and builds up speed in consistent directions

Think of it like pushing a ball down a hill. The ball doesn't instantly change direction with every bump — it has momentum that smooths out the path. If gradients consistently point in one direction, velocity builds up and we move faster. If gradients oscillate, velocity averages them out and we move more carefully.

The Bowling Ball Analogy

Imagine rolling a bowling ball down a bumpy hill. The ball doesn't stop at every bump — its momentum carries it over small obstacles. It accelerates down consistent slopes and slows down when the terrain changes direction. That's exactly what momentum does for gradient descent.

RMSprop: Adaptive Learning Rates

RMSprop (Root Mean Square Propagation) adapts the learning rate for each parameter based on the magnitude of recent gradients.

rmsprop.py

python

# RMSprop update rule
squared_gradient_avg = beta * squared_gradient_avg + (1 - beta) * gradient**2
weight = weight - learning_rate / sqrt(squared_gradient_avg + epsilon) * gradient

# Parameters with large gradients get smaller effective learning rates
# Parameters with small gradients get larger effective learning rates

The key insight: divide the learning rate by the square root of the average squared gradient. This means:

Large gradients → Large denominator → Smaller effective learning rate → Smaller steps
Small gradients → Small denominator → Larger effective learning rate → Larger steps

Each parameter gets its own adaptive learning rate based on its gradient history.

Part 3 — Adam: Combining the Best of Both

Adam combines momentum (for smoothing) and RMSprop (for adaptive rates). Here's the complete algorithm:

adam_algorithm.py

python

# Adam update rule (simplified)

# 1. Compute first moment (momentum-like)
m = beta1 * m + (1 - beta1) * gradient

# 2. Compute second moment (RMSprop-like)
v = beta2 * v + (1 - beta2) * gradient**2

# 3. Bias correction (important for early steps)
m_corrected = m / (1 - beta1**t)  # t = current step number
v_corrected = v / (1 - beta2**t)

# 4. Update weight
weight = weight - learning_rate * m_corrected / (sqrt(v_corrected) + epsilon)

# Typical hyperparameters:
# beta1 = 0.9   (momentum decay)
# beta2 = 0.999 (RMSprop decay)
# epsilon = 1e-8 (numerical stability)

Let's break down each component:

First Moment (m): The Momentum Component

m is a running average of gradients (like momentum). beta1 = 0.9 means we keep 90% of the old average and add 10% of the new gradient. This smooths out noise and builds up speed in consistent directions.

Second Moment (v): The Adaptive Rate Component

v is a running average of squared gradients (like RMSprop). beta2 = 0.999 means we keep 99.9% of the old average and add 0.1% of the new squared gradient. This tracks the 'volatility' of each parameter's gradients.

Bias Correction: Fixing the Cold Start Problem

Here's a subtle but important detail. At the start of training, m and v are initialized to zero. This creates a bias toward zero in the early steps. Adam corrects this by dividing by (1 - beta**t), where t is the step number.

bias_correction_example.py

python

# Why bias correction matters
# Suppose beta1 = 0.9, and we're at step 1

# Without correction:
m = 0.9 * 0 + 0.1 * gradient = 0.1 * gradient  # Too small!

# With correction:
m_corrected = (0.1 * gradient) / (1 - 0.9**1) = (0.1 * gradient) / 0.1 = gradient  # Correct!

# As t increases, (1 - beta**t) approaches 1, so correction becomes negligible

The Final Update

The final update divides the smoothed gradient (m_corrected) by the square root of the smoothed squared gradient (sqrt(v_corrected)). This gives each parameter an adaptive learning rate based on its gradient history.

The Adaptive Cruise Control Analogy

Adam is like adaptive cruise control for gradient descent. It maintains momentum (smooth acceleration), adapts speed based on road conditions (gradient magnitude), and corrects for initial bias (cold start). Each parameter gets its own personalized learning rate, automatically adjusted based on its gradient history.

Part 4 — Why Adam Works So Well

Adam provides several key advantages over plain SGD:

Advantage 1: Faster Convergence

In practice, Adam typically converges 5-10x faster than SGD. Why? Because it adapts the learning rate per parameter. Parameters that need large steps get them, parameters that need small steps get them. No more one-size-fits-all compromise.

Advantage 2: Less Sensitive to Learning Rate

With SGD, choosing the right learning rate is critical and problem-specific. Too high and training explodes, too low and it crawls. Adam is much more forgiving. The default lr=1e-3 (0.001) works well for most problems. You can often use it without tuning.

Advantage 3: Handles Sparse Gradients

In problems like NLP, many parameters have zero gradients most of the time (sparse gradients). Adam handles this well because it adapts per parameter. Parameters that rarely update get larger effective learning rates when they do update.

Advantage 4: Works Well Out of the Box

The default hyperparameters (beta1=0.9, beta2=0.999, lr=1e-3) work well for most problems. This is why Adam has become the default optimizer — it 'just works' without extensive tuning.

Part 5 — Using Adam in PyTorch

PyTorch makes Adam easy to use. Here's a complete example:

adam_pytorch.py

python

import torch
import torch.nn as nn
import torch.optim as optim

# Define your model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

# Create Adam optimizer
optimizer = optim.Adam(
    model.parameters(),
    lr=1e-3,              # Learning rate (default: 1e-3)
    betas=(0.9, 0.999),   # (beta1, beta2) - momentum and RMSprop decay
    eps=1e-8,             # Epsilon for numerical stability
    weight_decay=0        # L2 regularization (more on this below)
)

# Training loop
for epoch in range(num_epochs):
    for batch_x, batch_y in train_loader:
        # 1. Zero gradients from previous step
        optimizer.zero_grad()
        
        # 2. Forward pass
        output = model(batch_x)
        loss = criterion(output, batch_y)
        
        # 3. Backward pass (compute gradients)
        loss.backward()
        
        # 4. Update weights using Adam
        optimizer.step()

# That's it! Adam handles all the complexity internally

Understanding the Hyperparameters

Parameter	Default	What It Does	When to Change
lr	1e-3	Base learning rate	Increase if training is too slow, decrease if loss explodes
beta1	0.9	Momentum decay (first moment)	Rarely changed; 0.9 works well
beta2	0.999	RMSprop decay (second moment)	Rarely changed; 0.999 works well
eps	1e-8	Numerical stability constant	Never change this
weight_decay	0	L2 regularization strength	Set to 1e-4 or 1e-5 to prevent overfitting

Part 6 — Weight Decay in Adam

Weight decay is L2 regularization built into the optimizer. It adds a penalty for large weights, helping prevent overfitting.

weight_decay.py

python

# Without weight decay
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# With weight decay (recommended for most problems)
optimizer = optim.Adam(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-4  # Penalize large weights
)

# What weight decay does:
# weight = weight - lr * gradient - lr * weight_decay * weight
#                                   ^^^^^^^^^^^^^^^^^^^^^^^^^
#                                   This term shrinks weights toward zero

Common weight decay values:

0: No regularization (only use if you have tons of data)
1e-5 (0.00001): Mild regularization
1e-4 (0.0001): Standard choice for most problems
1e-3 (0.001): Strong regularization (if overfitting is severe)

Start with 1e-4

For most problems, weight_decay=1e-4 is a good default. It provides mild regularization without hurting performance. If you're still overfitting, increase it. If you're underfitting, decrease it or remove it.

Part 7 — Adam vs SGD: When to Use Which

Adam is the default choice for most problems, but SGD with momentum still has its place:

Optimizer	Best For	Pros	Cons
Adam	Most problems, especially NLP and small datasets	Fast convergence, works out of box, less tuning needed	Sometimes worse final performance than well-tuned SGD
SGD + Momentum	Computer vision, very large datasets, when you have time to tune	Can achieve slightly better final accuracy	Requires careful learning rate tuning, slower convergence
AdamW	Transformers, modern NLP	Better weight decay than Adam, state-of-the-art results	Slightly more complex

The Practical Rule

Start with Adam. It will get you 90% of the way there with minimal tuning. If you need that last 5-10% of performance and have time to experiment, try SGD with momentum and carefully tune the learning rate schedule. For transformers and modern NLP, use AdamW (Adam with decoupled weight decay).

Part 8 — Common Mistakes and How to Avoid Them

Learning rate too high: If loss explodes to NaN in the first few steps, your learning rate is too high. Try 1e-4 instead of 1e-3.
Not using weight decay: Without regularization, models often overfit. Start with weight_decay=1e-4.
Forgetting optimizer.zero_grad(): Gradients accumulate by default. Always call zero_grad() before backward().
Using Adam for everything: For computer vision with huge datasets, well-tuned SGD can outperform Adam. Don't be dogmatic.
Not adjusting learning rate: For very long training runs, consider learning rate scheduling (reduce lr when progress plateaus).
Comparing Adam and SGD with same learning rate: Adam typically needs a smaller learning rate than SGD. Don't compare them with the same lr.

Part 9 — Advanced: Learning Rate Scheduling

For long training runs, you might want to reduce the learning rate over time. PyTorch provides several schedulers:

lr_scheduling.py

python

import torch.optim as optim
from torch.optim.lr_scheduler import ReduceLROnPlateau, StepLR

optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Option 1: Reduce LR when validation loss plateaus
scheduler = ReduceLROnPlateau(
    optimizer,
    mode='min',           # Minimize validation loss
    factor=0.5,           # Multiply LR by 0.5 when plateau detected
    patience=5,           # Wait 5 epochs before reducing
    verbose=True
)

# Training loop
for epoch in range(num_epochs):
    train_loss = train_one_epoch()
    val_loss = validate()
    
    # Update learning rate based on validation loss
    scheduler.step(val_loss)

# Option 2: Reduce LR every N epochs
scheduler = StepLR(
    optimizer,
    step_size=30,         # Reduce every 30 epochs
    gamma=0.1             # Multiply LR by 0.1
)

for epoch in range(num_epochs):
    train_one_epoch()
    scheduler.step()      # Update LR after each epoch

Key Takeaways

Adam adapts learning rates per parameter based on gradient history, making it much more effective than plain SGD.
It combines momentum (smoothing) and RMSprop (adaptive rates) to get the best of both worlds.
Default hyperparameters work well: lr=1e-3, beta1=0.9, beta2=0.999 are good starting points.
Use weight decay: weight_decay=1e-4 provides mild regularization and helps prevent overfitting.
Adam converges 5-10x faster than SGD in most cases, with less hyperparameter tuning needed.
Always call optimizer.zero_grad() before backward() to clear old gradients.
For transformers, use AdamW (Adam with decoupled weight decay) for best results.
Consider learning rate scheduling for very long training runs.

The Bottom Line

Adam is the default optimizer for deep learning because it 'just works' with minimal tuning. It adapts learning rates per parameter, handles sparse gradients well, and converges much faster than plain SGD. Start with Adam (lr=1e-3, weight_decay=1e-4) and you'll be right 90% of the time. Only switch to SGD if you have a specific reason and time to tune carefully.

#deep-learning #optimization

beginner

ReLU Explained: The Simple Activation Function That Changed Deep Learning

A complete beginner's guide to ReLU (Rectified Linear Unit) - what it is, why it works so well, and how to use it in neural networks with clear examples.

beginner

Batch Normalization Explained: Why Your Neural Network Needs It

A complete beginner's guide to Batch Normalization - what it is, why it works, how to implement it, and the critical train vs eval mode difference that trips up everyone.

beginner

Dropout Explained: The Surprisingly Simple Trick That Prevents Overfitting

A complete beginner's guide to Dropout regularization - why randomly turning off neurons makes neural networks smarter, how it works, and how to use it correctly in PyTorch.

What You'll Learn

The Zigzag Problem

The Bowling Ball Analogy

The Adaptive Cruise Control Analogy

Start with 1e-4

The Practical Rule

The Bottom Line

Related Articles

ReLU Explained: The Simple Activation Function That Changed Deep Learning

Batch Normalization Explained: Why Your Neural Network Needs It

Dropout Explained: The Surprisingly Simple Trick That Prevents Overfitting