
PyTorch Autograd: Automatic Differentiation from the Ground Up
A complete, beginner-friendly guide to PyTorch's autograd engine — from what a gradient is to building a neural network by hand.
Every time a neural network learns something — recognising a cat, translating a sentence, beating you at chess — it does so by computing gradients and nudging its parameters in the right direction. This process is called backpropagation, and in PyTorch it is handled entirely automatically by a subsystem called autograd. You never have to derive a single derivative by hand. In this post we'll build a mental model of how autograd works, play with real examples, and by the end you'll feel completely comfortable using it in your own projects.
What you need to follow along
Imagine you are standing on a hilly landscape in thick fog. You can't see the valley, but you can feel the slope under your feet. The gradient tells you: "how steeply is the ground rising, and in which direction?" If you always step in the opposite direction of the slope, you'll eventually reach the lowest point — the valley.
In machine learning, the landscape is a loss function — a number that measures how wrong our model's predictions are. The 'ground' is all the model's parameters (weights). The gradient tells us: "if I change each weight by a tiny amount, how much does the loss go up or down?" We then nudge every weight slightly downhill — this is gradient descent.
Formal definition (don't be scared)
In PyTorch, data lives in tensors — think of them as supercharged NumPy arrays. By default, PyTorch doesn't track gradients for a tensor. You have to opt in by setting requires_grad=True. This tells PyTorch: "watch this tensor — I want to know how the final output changes when this value changes."
import torch
# A regular tensor — gradients NOT tracked
x_plain = torch.tensor(3.0)
print(x_plain.requires_grad) # False
# A tensor WITH gradient tracking turned on
x = torch.tensor(3.0, requires_grad=True)
print(x.requires_grad) # True
# You can also enable it after creation
y = torch.tensor(5.0)
y.requires_grad_(True) # in-place toggle
print(y.requires_grad) # TrueWhy not track everything by default?
Let's compute the gradient of a simple polynomial: y = x² + 3x + 1. We know from calculus that dy/dx = 2x + 3, so at x = 2 the gradient should be 2(2) + 3 = 7. Let's verify that PyTorch agrees:
import torch
x = torch.tensor(2.0, requires_grad=True)
# Define the function y = x² + 3x + 1
y = x**2 + 3*x + 1
print(f"y = {y.item()}") # y = 11.0
# Ask PyTorch to compute dy/dx
y.backward()
# The gradient is stored in x.grad
print(f"dy/dx at x=2: {x.grad.item()}") # 7.0
# Manual check
expected = 2*2 + 3 # = 7
print(f"Manual check: {expected}") # 7It works!
Every time you perform an operation on a requires_grad tensor, PyTorch silently builds a computational graph — a record of exactly what operations were performed and in what order. This graph is what makes automatic differentiation possible.
Think of it as a recipe card that PyTorch writes while you cook. When you call .backward(), PyTorch reads that recipe card backwards — from the final result all the way back to the inputs — applying the chain rule at each step.
Each node in the graph stores a grad_fn — the backward function that knows how to propagate gradients through that specific operation. Let's inspect it:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x**2 + 3*x + 1
print(y.grad_fn) # <AddBackward0 object at 0x...>
print(y.grad_fn.next_functions) # Shows the upstream nodes
# Leaf node: x was created directly, not via an operation
print(x.is_leaf) # True
print(y.is_leaf) # False — y was produced by an operation
# Leaf nodes accumulate gradients; non-leaf nodes don't by default
print(x.grad) # None (before backward)
print(y.grad) # None (and stays None after backward, it's non-leaf)Leaf vs. non-leaf tensors
A neural network has millions of parameters — let's see how autograd handles multiple inputs at once. Consider z = 2x² + y³, where we want both ∂z/∂x and ∂z/∂y:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
z = 2 * x**2 + y**3
print(f"z = {z.item()}") # 2*4 + 27 = 35.0
z.backward()
# dz/dx = 4x => at x=2: 4*2 = 8
print(f"dz/dx = {x.grad.item()}") # 8.0
# dz/dy = 3y² => at y=3: 3*9 = 27
print(f"dz/dy = {y.grad.item()}") # 27.0One single .backward() call populated the gradients for all participating leaf tensors simultaneously. In a real network, this means one backward pass computes gradients for every single weight — no matter how many there are.
Autograd works by applying the chain rule from calculus. The chain rule says: if z depends on y, and y depends on x, then the gradient of z with respect to x is:
dz/dx = (dz/dy) × (dy/dx)
— Leibniz's Chain Rule
PyTorch applies this rule at every node in the computational graph, chaining all the local gradients together as it works its way backward from the output to the inputs. Let's trace this manually for a two-step function:
import torch
x = torch.tensor(3.0, requires_grad=True)
# Two-step function:
# Step 1: y = x² + 1 => dy/dx = 2x = 6
# Step 2: z = (y - 4)² => dz/dy = 2(y-4)
#
# Chain rule: dz/dx = dz/dy * dy/dx
y = x**2 + 1 # y = 10
z = (y - 4)**2 # z = (10-4)² = 36
z.backward()
# PyTorch's answer
print(f"dz/dx (autograd): {x.grad.item()}")
# Manual chain rule:
# y = 10, dy/dx = 2*3 = 6
# dz/dy = 2*(10-4) = 12
# dz/dx = 12 * 6 = 72
print(f"dz/dx (manual): {12 * 6}")So far we've used scalar (single number) tensors. Real networks deal with vectors and matrices. When the output is a vector, .backward() needs a gradient argument — called the vector-Jacobian product — to know how to weight each output dimension. The most common case is passing a tensor of ones, which is equivalent to summing the outputs first.
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
# Element-wise operation: y = x²
y = x ** 2 # y = [1, 4, 9]
# y is a vector, so we must tell backward() how to aggregate
# Passing torch.ones_like(y) means: compute d(sum(y))/dx
y.backward(torch.ones_like(y))
# d(x²)/dx = 2x, so gradients are [2, 4, 6]
print(x.grad) # tensor([2., 4., 6.])
# -------------------------------------------
# Shortcut: call .sum() to get a scalar, then .backward() with no args
x2 = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y2 = (x2 ** 2).sum()
y2.backward()
print(x2.grad) # tensor([2., 4., 6.]) — same resultWhy does the loss need to be a scalar?
Here's one of the most common bugs for PyTorch beginners: gradients accumulate. Every time you call .backward(), PyTorch adds the new gradients to whatever is already stored in .grad. It does NOT overwrite them. This is useful for some advanced techniques, but in a standard training loop you must zero the gradients manually before every backward pass.
import torch
x = torch.tensor(2.0, requires_grad=True)
# --- BUG: calling backward multiple times without zeroing ---
for i in range(3):
y = x ** 2
y.backward()
print(f"Step {i+1}: x.grad = {x.grad.item()}")
# Output:
# Step 1: x.grad = 4.0
# Step 2: x.grad = 8.0 <-- WRONG! Accumulated!
# Step 3: x.grad = 12.0 <-- WRONG!
print("\n--- FIXED VERSION ---")
x.grad = None # reset
for i in range(3):
y = x ** 2
y.backward()
print(f"Step {i+1}: x.grad = {x.grad.item()}")
x.grad.zero_() # ← zero AFTER reading the gradient
# Output:
# Step 1: x.grad = 4.0
# Step 2: x.grad = 4.0 ✓
# Step 3: x.grad = 4.0 ✓Always zero_() in your training loop
During inference (when you're just making predictions, not training), you don't need gradients at all. Disabling gradient tracking saves memory and speeds up computation. There are two main ways to do this:
import torch
x = torch.tensor(3.0, requires_grad=True)
# Inside no_grad block, operations don't build a graph
with torch.no_grad():
y = x ** 2 + 1
print(y.requires_grad) # False
print(y.grad_fn) # None — no graph was built
# Outside the block, tracking resumes normally
z = x ** 2 + 1
print(z.requires_grad) # True.detach() creates a new tensor that shares the same data but is completely disconnected from the computational graph. It's like making a copy that has no memory of how it was created.
import torch
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 + 1 # y is connected to x in the graph
y_detached = y.detach() # disconnected copy — same value, no grad_fn
print(y.grad_fn) # <AddBackward0> — still connected
print(y_detached.grad_fn) # None — disconnected
print(y_detached.item()) # 10.0 — same value
# Common use case: converting to numpy for plotting
import numpy as np
arr = y_detached.numpy() # works! (detach required if requires_grad=True)
print(arr) # [10.]| Method | How It Works | Use When |
|---|---|---|
| torch.no_grad() | Context manager; ops inside don't track anything | Inference loops, evaluation |
| tensor.detach() | Returns new tensor with same data, no grad history | Extracting values for plotting, logging, or stopping gradient flow mid-graph |
| requires_grad_(False) | Disables tracking on the tensor itself | Permanently marking a tensor (e.g., frozen layer weights) |
By default, PyTorch destroys the computational graph after calling .backward() to free memory. If you need to call .backward() more than once on the same graph (rare, but it happens in techniques like computing higher-order gradients), you must tell PyTorch to keep it:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
# First backward — works fine
y.backward(retain_graph=True) # keep the graph!
print(f"First backward: x.grad = {x.grad.item()}") # 12.0
x.grad.zero_()
# Second backward — only works because we retained the graph
y.backward(retain_graph=True)
print(f"Second backward: x.grad = {x.grad.item()}") # 12.0
# Without retain_graph=True, the second call would raise:
# RuntimeError: Trying to backward through the graph a second timeretain_graph=True is expensive
Because autograd builds a regular computation graph, you can differentiate through the backward pass itself to get second derivatives (and beyond). This is used in techniques like MAML (Model-Agnostic Meta-Learning). Use create_graph=True to make the backward pass itself differentiable:
import torch
x = torch.tensor(2.0, requires_grad=True)
# f(x) = x⁴ => f'(x) = 4x³ => f''(x) = 12x²
f = x ** 4
# First derivative — create_graph=True makes grad itself differentiable
df_dx = torch.autograd.grad(f, x, create_graph=True)[0]
print(f"f'(2) = {df_dx.item()}") # 4 * 8 = 32.0
# Second derivative: differentiate the first derivative
d2f_dx2 = torch.autograd.grad(df_dx, x)[0]
print(f"f''(2) = {d2f_dx2.item()}") # 12 * 4 = 48.0While .backward() populates .grad for all leaf tensors in the graph, torch.autograd.grad() lets you request the gradient of a specific output with respect to specific inputs. It returns a tuple of gradients and doesn't touch .grad at all — great for custom training logic.
import torch
x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)
loss = x**2 + y**2 # Euclidean-distance-squared from origin
# Only get the gradient w.r.t. x — y.grad is not touched
grad_x, = torch.autograd.grad(loss, x)
print(f"dloss/dx = {grad_x.item()}") # 2*3 = 6.0
print(f"y.grad = {y.grad}") # None — untouched
# Get both at once
grad_x, grad_y = torch.autograd.grad(loss, [x, y])
print(f"dloss/dx = {grad_x.item()}") # 6.0
print(f"dloss/dy = {grad_y.item()}") # 2*4 = 8.0Sometimes you need an operation whose gradient is not natively defined in PyTorch — perhaps a custom activation function, or an operation that wraps a CUDA kernel. You can teach PyTorch how to differentiate it by subclassing torch.autograd.Function and defining both a forward and backward method.
import torch
# Let's implement a "scaled sigmoid": f(x) = 2 * sigmoid(x)
# We'll do it the hard way (custom Function) for illustration
class ScaledSigmoid(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
# ctx (context) stores values needed for backward
sig = torch.sigmoid(x)
ctx.save_for_backward(sig) # save sigmoid(x) for reuse
return 2 * sig
@staticmethod
def backward(ctx, grad_output):
# grad_output is the gradient flowing in FROM the layer above
sig, = ctx.saved_tensors
# d/dx [2 * sigmoid(x)] = 2 * sigmoid(x) * (1 - sigmoid(x))
grad_input = grad_output * 2 * sig * (1 - sig)
return grad_input
# Usage — same as any built-in op
x = torch.tensor(1.0, requires_grad=True)
y = ScaledSigmoid.apply(x)
y.backward()
print(f"y = {y.item():.4f}") # 1.4621
print(f"dy/dx = {x.grad.item():.4f}") # 0.4200
# Verify with PyTorch's built-in autograd
x2 = torch.tensor(1.0, requires_grad=True)
y2 = 2 * torch.sigmoid(x2)
y2.backward()
print(f"dy/dx (builtin) = {x2.grad.item():.4f}") # 0.4200 — matches!When should you use custom Function?
Let's write a complete, minimal training loop using only autograd — no nn.Module, no optimizer — to really understand what is happening under the hood. We'll fit a line y = 2x + 1 from noisy data.
import torch
torch.manual_seed(42)
# ── 1. Generate synthetic data: y = 2x + 1 + noise ──
x_data = torch.linspace(0, 1, 100)
y_data = 2 * x_data + 1 + 0.1 * torch.randn(100)
# ── 2. Initialise parameters (random) ──
w = torch.tensor(0.0, requires_grad=True) # true value should be ~2
b = torch.tensor(0.0, requires_grad=True) # true value should be ~1
lr = 0.1 # learning rate
# ── 3. Training loop ──
for epoch in range(200):
# Forward pass: compute predictions
y_pred = w * x_data + b
# Compute loss (Mean Squared Error)
loss = ((y_pred - y_data) ** 2).mean()
# Backward pass: compute gradients
loss.backward()
# Update parameters (gradient descent step)
with torch.no_grad(): # don't track these assignments
w -= lr * w.grad
b -= lr * b.grad
# Zero gradients BEFORE the next iteration
w.grad.zero_()
b.grad.zero_()
if epoch % 40 == 0:
print(f"Epoch {epoch:3d} | loss={loss.item():.4f} | w={w.item():.3f} | b={b.item():.3f}")
print(f"\nFinal: w={w.item():.3f} (true: 2.0), b={b.item():.3f} (true: 1.0)")Running this produces output like the following, showing the weights converging toward the true values of w=2 and b=1:
Epoch 0 | loss=2.1378 | w=0.274 | b=0.148
Epoch 40 | loss=0.0200 | w=1.706 | b=0.807
Epoch 80 | loss=0.0107 | w=1.927 | b=0.942
Epoch 120 | loss=0.0101 | w=1.977 | b=0.983
Epoch 160 | loss=0.0100 | w=1.993 | b=0.995
Final: w=1.998 (true: 2.0), b=0.999 (true: 1.0)What just happened?
In practice you use nn.Module and torch.optim instead of managing requires_grad and zero_() manually. Here's the same linear regression rewritten the "PyTorch way" — notice the loop is structurally identical, just with more convenient abstractions:
import torch
import torch.nn as nn
torch.manual_seed(42)
# Same synthetic data
x_data = torch.linspace(0, 1, 100).unsqueeze(1) # shape (100, 1)
y_data = (2 * x_data + 1 + 0.1 * torch.randn(100, 1))
# Model: single linear layer (manages w and b internally, requires_grad=True by default)
model = nn.Linear(1, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(200):
optimizer.zero_grad() # ← replaces w.grad.zero_() and b.grad.zero_()
y_pred = model(x_data) # forward pass
loss = loss_fn(y_pred, y_data)
loss.backward() # backward pass
optimizer.step() # ← replaces w -= lr * w.grad etc.
if epoch % 40 == 0:
w, b = model.weight.item(), model.bias.item()
print(f"Epoch {epoch:3d} | loss={loss.item():.4f} | w={w:.3f} | b={b:.3f}")The only difference is ergonomics. Under the hood, model.parameters() returns tensors with requires_grad=True, optimizer.zero_grad() calls .zero_() on each one, and optimizer.step() applies the weight update. Autograd is doing the same work it always was.
| Pitfall | Symptom | Fix |
|---|---|---|
| Forgetting zero_grad() | Loss oscillates wildly or diverges; gradients grow each step | Call optimizer.zero_grad() at the start of every iteration |
| In-place ops on leaf tensors | RuntimeError about in-place operation on a leaf variable | Use out-of-place ops: x = x + 1 instead of x += 1 |
| Calling .backward() twice | RuntimeError: graph freed | Use retain_graph=True if you genuinely need two passes |
| Converting to NumPy without detach | RuntimeError: tensor requires grad | Use tensor.detach().numpy() or wrap in torch.no_grad() |
| Accumulating loss in a list | Memory leak — full graph kept alive by Python list | Store loss.item() (a plain float), not the tensor itself |
| Updating weights inside autograd scope | Gradients contaminated by weight-update operations | Always update weights inside torch.no_grad() or use optimizer.step() |
import torch
# ─── Creating tracked tensors ──────────────────────────────────
x = torch.tensor(1.0, requires_grad=True) # scalar
W = torch.randn(3, 3, requires_grad=True) # matrix
# ─── Triggering backprop ───────────────────────────────────────
loss = (W @ x.unsqueeze(0).T).sum()
loss.backward() # populates .grad for all leaves
# ─── Accessing gradients ───────────────────────────────────────
print(x.grad) # tensor of same shape as x
print(W.grad) # tensor of same shape as W
# ─── Zeroing gradients ─────────────────────────────────────────
x.grad.zero_() # in-place zero
optimizer.zero_grad() # via optimizer (preferred)
# ─── Disabling gradient tracking ───────────────────────────────
with torch.no_grad(): # block-level
y = W @ x
y = W @ x
y_no_grad = y.detach() # detach a specific tensor
# ─── Surgical gradient queries ─────────────────────────────────
grad_W, = torch.autograd.grad(loss, W, retain_graph=True)
# ─── Higher-order gradients ────────────────────────────────────
grad1, = torch.autograd.grad(loss, x, create_graph=True)
grad2, = torch.autograd.grad(grad1.sum(), x)
# ─── Inspecting the graph ──────────────────────────────────────
print(loss.grad_fn) # the last operation
print(loss.grad_fn.next_functions) # one step back
print(x.is_leaf) # True if created directly
# ─── Retaining graph for multiple backward calls ───────────────
loss.backward(retain_graph=True)
# ─── Custom Function ───────────────────────────────────────────
class MyOp(torch.autograd.Function):
@staticmethod
def forward(ctx, x): ctx.save_for_backward(x); return x.clamp(min=0)
@staticmethod
def backward(ctx, g): x, = ctx.saved_tensors; return g * (x > 0).float()
result = MyOp.apply(x)Autograd is one of the most elegant pieces of engineering in modern deep learning. By silently recording every operation in a computational graph, PyTorch can differentiate through arbitrarily complex functions — from a two-parameter line to a billion-parameter language model — using nothing but the chain rule applied node by node. Now that you understand the machinery under the hood, you'll find debugging training loops far more intuitive and the jump to advanced topics like custom layers, meta-learning, and physics-informed networks much less steep.
What to explore next
- Implement a two-layer MLP using only autograd (no nn.Module) to cement your understanding. 2) Read the official torch.autograd documentation — especially the section on numerical gradient checking with torch.autograd.gradcheck. 3) Explore higher-order gradients in the context of MAML for meta-learning.
Related Articles
Backpropagation and the Chain Rule: A Simple Visual Guide
Learn how backpropagation works through a simple, step-by-step example. Understand the chain rule intuitively with clear visualizations and working code.
What Is a Tensor? A Beginner's Guide with Real Examples
Tensors explained from scratch — no math degree required. Learn what tensors are, why PyTorch uses them, and how to work with them confidently.
Logistic Regression from Scratch in PyTorch: Every Line Explained
Build a multi-class classifier in PyTorch without nn.Linear, without optim.SGD, without CrossEntropyLoss. Just [tensors](/blog/what-is-a-tensor), [autograd](/blog/pytorch-autograd-deep-dive), and arithmetic — so you finally see what those helpers actually do.