PyTorch Autograd: Automatic Differentiation from the Ground Up

A complete, beginner-friendly guide to PyTorch's autograd engine — from what a gradient is to building a neural network by hand.

AI EducatorApril 22, 2026

Every time a neural network learns something — recognising a cat, translating a sentence, beating you at chess — it does so by computing gradients and nudging its parameters in the right direction. This process is called backpropagation, and in PyTorch it is handled entirely automatically by a subsystem called autograd. You never have to derive a single derivative by hand. In this post we'll build a mental model of how autograd works, play with real examples, and by the end you'll feel completely comfortable using it in your own projects.

What you need to follow along

Basic Python, a rough sense of what a neural network is, and PyTorch installed. That's it. We'll explain gradients from scratch.

1. What Is a Gradient (in Plain English)?

Imagine you are standing on a hilly landscape in thick fog. You can't see the valley, but you can feel the slope under your feet. The gradient tells you: "how steeply is the ground rising, and in which direction?" If you always step in the opposite direction of the slope, you'll eventually reach the lowest point — the valley.

In machine learning, the landscape is a loss function — a number that measures how wrong our model's predictions are. The 'ground' is all the model's parameters (weights). The gradient tells us: "if I change each weight by a tiny amount, how much does the loss go up or down?" We then nudge every weight slightly downhill — this is gradient descent.

Formal definition (don't be scared)

For a function f(x), the gradient df/dx is just the rate of change of f with respect to x. If f(x) = x², then df/dx = 2x. PyTorch computes these automatically for any arbitrarily complex function built from tensors.

2. Tensors and requires_grad

In PyTorch, data lives in tensors — think of them as supercharged NumPy arrays. By default, PyTorch doesn't track gradients for a tensor. You have to opt in by setting requires_grad=True. This tells PyTorch: "watch this tensor — I want to know how the final output changes when this value changes."

01_requires_grad.py

python

import torch

# A regular tensor — gradients NOT tracked
x_plain = torch.tensor(3.0)
print(x_plain.requires_grad)   # False

# A tensor WITH gradient tracking turned on
x = torch.tensor(3.0, requires_grad=True)
print(x.requires_grad)         # True

# You can also enable it after creation
y = torch.tensor(5.0)
y.requires_grad_(True)         # in-place toggle
print(y.requires_grad)         # True

Why not track everything by default?

Tracking gradients costs extra memory and computation. For large datasets of input images or text — things you never want to differentiate with respect to — you leave requires_grad=False. Only model parameters (weights and biases) normally need requires_grad=True.

3. Your First Gradient Computation

Let's compute the gradient of a simple polynomial: y = x² + 3x + 1. We know from calculus that dy/dx = 2x + 3, so at x = 2 the gradient should be 2(2) + 3 = 7. Let's verify that PyTorch agrees:

02_first_gradient.py

python

import torch

x = torch.tensor(2.0, requires_grad=True)

# Define the function  y = x² + 3x + 1
y = x**2 + 3*x + 1

print(f"y = {y.item()}")       # y = 11.0

# Ask PyTorch to compute dy/dx
y.backward()

# The gradient is stored in x.grad
print(f"dy/dx at x=2: {x.grad.item()}")   # 7.0

# Manual check
expected = 2*2 + 3   # = 7
print(f"Manual check: {expected}")         # 7

It works!

PyTorch computed dy/dx = 7.0 exactly, matching our manual calculus. The call y.backward() triggered the differentiation engine, and the result landed in x.grad.

4. The Computational Graph — How PyTorch Sees Your Code

Every time you perform an operation on a requires_grad tensor, PyTorch silently builds a computational graph — a record of exactly what operations were performed and in what order. This graph is what makes automatic differentiation possible.

Think of it as a recipe card that PyTorch writes while you cook. When you call .backward(), PyTorch reads that recipe card backwards — from the final result all the way back to the inputs — applying the chain rule at each step.

Each node in the graph stores a grad_fn — the backward function that knows how to propagate gradients through that specific operation. Let's inspect it:

03_grad_fn.py

python

import torch

x = torch.tensor(2.0, requires_grad=True)
y = x**2 + 3*x + 1

print(y.grad_fn)          # <AddBackward0 object at 0x...>
print(y.grad_fn.next_functions)   # Shows the upstream nodes

# Leaf node: x was created directly, not via an operation
print(x.is_leaf)    # True
print(y.is_leaf)    # False  — y was produced by an operation

# Leaf nodes accumulate gradients; non-leaf nodes don't by default
print(x.grad)       # None  (before backward)
print(y.grad)       # None  (and stays None after backward, it's non-leaf)

Leaf vs. non-leaf tensors

A leaf tensor is one you created directly (like a weight or input). Non-leaf tensors are the intermediate results of operations. After .backward(), only leaf tensors accumulate their gradient in .grad. Intermediate values discard theirs to save memory.

5. Gradients With Multiple Inputs

A neural network has millions of parameters — let's see how autograd handles multiple inputs at once. Consider z = 2x² + y³, where we want both ∂z/∂x and ∂z/∂y:

04_multiple_inputs.py

python

import torch

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

z = 2 * x**2 + y**3

print(f"z = {z.item()}")    # 2*4 + 27 = 35.0

z.backward()

# dz/dx = 4x  => at x=2: 4*2 = 8
print(f"dz/dx = {x.grad.item()}")  # 8.0

# dz/dy = 3y² => at y=3: 3*9 = 27
print(f"dz/dy = {y.grad.item()}")  # 27.0

One single .backward() call populated the gradients for all participating leaf tensors simultaneously. In a real network, this means one backward pass computes gradients for every single weight — no matter how many there are.

6. The Chain Rule — The Heart of Backpropagation

Autograd works by applying the chain rule from calculus. The chain rule says: if z depends on y, and y depends on x, then the gradient of z with respect to x is:

dz/dx = (dz/dy) × (dy/dx)
— Leibniz's Chain Rule

PyTorch applies this rule at every node in the computational graph, chaining all the local gradients together as it works its way backward from the output to the inputs. Let's trace this manually for a two-step function:

05_chain_rule.py

python

import torch

x = torch.tensor(3.0, requires_grad=True)

# Two-step function:
# Step 1: y = x² + 1        => dy/dx = 2x = 6
# Step 2: z = (y - 4)²      => dz/dy = 2(y-4)
#
# Chain rule: dz/dx = dz/dy * dy/dx

y = x**2 + 1                # y = 10
z = (y - 4)**2              # z = (10-4)² = 36

z.backward()

# PyTorch's answer
print(f"dz/dx (autograd): {x.grad.item()}")

# Manual chain rule:
# y = 10, dy/dx = 2*3 = 6
# dz/dy = 2*(10-4) = 12
# dz/dx = 12 * 6 = 72
print(f"dz/dx (manual):   {12 * 6}")

7. Gradients With Vectors and Matrices

So far we've used scalar (single number) tensors. Real networks deal with vectors and matrices. When the output is a vector, .backward() needs a gradient argument — called the vector-Jacobian product — to know how to weight each output dimension. The most common case is passing a tensor of ones, which is equivalent to summing the outputs first.

06_vector_gradients.py

python

import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Element-wise operation: y = x²
y = x ** 2      # y = [1, 4, 9]

# y is a vector, so we must tell backward() how to aggregate
# Passing torch.ones_like(y) means: compute d(sum(y))/dx
y.backward(torch.ones_like(y))

# d(x²)/dx = 2x, so gradients are [2, 4, 6]
print(x.grad)   # tensor([2., 4., 6.])

# -------------------------------------------
# Shortcut: call .sum() to get a scalar, then .backward() with no args
x2 = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y2 = (x2 ** 2).sum()
y2.backward()
print(x2.grad)  # tensor([2., 4., 6.])  — same result

Why does the loss need to be a scalar?

In practice, the loss function always outputs a single number (the mean loss over a batch). That's why you can call loss.backward() with no arguments in real training — the loss is already a scalar.

8. Zeroing Gradients — A Critical Step

Here's one of the most common bugs for PyTorch beginners: gradients accumulate. Every time you call .backward(), PyTorch adds the new gradients to whatever is already stored in .grad. It does NOT overwrite them. This is useful for some advanced techniques, but in a standard training loop you must zero the gradients manually before every backward pass.

07_gradient_accumulation_bug.py

python

import torch

x = torch.tensor(2.0, requires_grad=True)

# --- BUG: calling backward multiple times without zeroing ---
for i in range(3):
    y = x ** 2
    y.backward()
    print(f"Step {i+1}: x.grad = {x.grad.item()}")
# Output:
# Step 1: x.grad = 4.0
# Step 2: x.grad = 8.0   <-- WRONG! Accumulated!
# Step 3: x.grad = 12.0  <-- WRONG!

print("\n--- FIXED VERSION ---")
x.grad = None  # reset

for i in range(3):
    y = x ** 2
    y.backward()
    print(f"Step {i+1}: x.grad = {x.grad.item()}")
    x.grad.zero_()   # ← zero AFTER reading the gradient
# Output:
# Step 1: x.grad = 4.0
# Step 2: x.grad = 4.0   ✓
# Step 3: x.grad = 4.0   ✓

Always zero_() in your training loop

In practice you'll use an optimizer (like torch.optim.SGD) which has a zero_grad() method. Call optimizer.zero_grad() at the START of each training iteration, before computing the loss and calling backward().

9. Turning Off Gradient Tracking

During inference (when you're just making predictions, not training), you don't need gradients at all. Disabling gradient tracking saves memory and speeds up computation. There are two main ways to do this:

9a. torch.no_grad() Context Manager

08_no_grad.py

python

import torch

x = torch.tensor(3.0, requires_grad=True)

# Inside no_grad block, operations don't build a graph
with torch.no_grad():
    y = x ** 2 + 1
    print(y.requires_grad)   # False
    print(y.grad_fn)         # None — no graph was built

# Outside the block, tracking resumes normally
z = x ** 2 + 1
print(z.requires_grad)       # True

9b. tensor.detach()

.detach() creates a new tensor that shares the same data but is completely disconnected from the computational graph. It's like making a copy that has no memory of how it was created.

09_detach.py

python

import torch

x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 + 1           # y is connected to x in the graph

y_detached = y.detach()   # disconnected copy — same value, no grad_fn

print(y.grad_fn)           # <AddBackward0>  — still connected
print(y_detached.grad_fn)  # None            — disconnected
print(y_detached.item())   # 10.0            — same value

# Common use case: converting to numpy for plotting
import numpy as np
arr = y_detached.numpy()   # works! (detach required if requires_grad=True)
print(arr)                  # [10.]

Method	How It Works	Use When
torch.no_grad()	Context manager; ops inside don't track anything	Inference loops, evaluation
tensor.detach()	Returns new tensor with same data, no grad history	Extracting values for plotting, logging, or stopping gradient flow mid-graph
requires_grad_(False)	Disables tracking on the tensor itself	Permanently marking a tensor (e.g., frozen layer weights)

10. The retain_graph Flag

By default, PyTorch destroys the computational graph after calling .backward() to free memory. If you need to call .backward() more than once on the same graph (rare, but it happens in techniques like computing higher-order gradients), you must tell PyTorch to keep it:

10_retain_graph.py

python

import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 3

# First backward — works fine
y.backward(retain_graph=True)  # keep the graph!
print(f"First backward:  x.grad = {x.grad.item()}")   # 12.0

x.grad.zero_()

# Second backward — only works because we retained the graph
y.backward(retain_graph=True)
print(f"Second backward: x.grad = {x.grad.item()}")   # 12.0

# Without retain_graph=True, the second call would raise:
# RuntimeError: Trying to backward through the graph a second time

retain_graph=True is expensive

The graph is kept in memory until you explicitly free it or the tensors go out of scope. Avoid using this unless you really need it — accidentally leaving it set to True is a common source of out-of-memory errors.

11. Higher-Order Gradients (Gradient of a Gradient)

Because autograd builds a regular computation graph, you can differentiate through the backward pass itself to get second derivatives (and beyond). This is used in techniques like MAML (Model-Agnostic Meta-Learning). Use create_graph=True to make the backward pass itself differentiable:

11_second_derivative.py

python

import torch

x = torch.tensor(2.0, requires_grad=True)

# f(x) = x⁴  =>  f'(x) = 4x³  =>  f''(x) = 12x²
f = x ** 4

# First derivative — create_graph=True makes grad itself differentiable
df_dx = torch.autograd.grad(f, x, create_graph=True)[0]
print(f"f'(2)  = {df_dx.item()}")    # 4 * 8 = 32.0

# Second derivative: differentiate the first derivative
d2f_dx2 = torch.autograd.grad(df_dx, x)[0]
print(f"f''(2) = {d2f_dx2.item()}")  # 12 * 4 = 48.0

12. torch.autograd.grad — More Surgical Control

While .backward() populates .grad for all leaf tensors in the graph, torch.autograd.grad() lets you request the gradient of a specific output with respect to specific inputs. It returns a tuple of gradients and doesn't touch .grad at all — great for custom training logic.

12_autograd_grad.py

python

import torch

x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)

loss = x**2 + y**2    # Euclidean-distance-squared from origin

# Only get the gradient w.r.t. x — y.grad is not touched
grad_x, = torch.autograd.grad(loss, x)
print(f"dloss/dx = {grad_x.item()}")  # 2*3 = 6.0
print(f"y.grad   = {y.grad}")         # None  — untouched

# Get both at once
grad_x, grad_y = torch.autograd.grad(loss, [x, y])
print(f"dloss/dx = {grad_x.item()}")  # 6.0
print(f"dloss/dy = {grad_y.item()}")  # 2*4 = 8.0

13. Custom Autograd Functions

Sometimes you need an operation whose gradient is not natively defined in PyTorch — perhaps a custom activation function, or an operation that wraps a CUDA kernel. You can teach PyTorch how to differentiate it by subclassing torch.autograd.Function and defining both a forward and backward method.

13_custom_function.py

python

import torch

# Let's implement a "scaled sigmoid": f(x) = 2 * sigmoid(x)
# We'll do it the hard way (custom Function) for illustration

class ScaledSigmoid(torch.autograd.Function):

    @staticmethod
    def forward(ctx, x):
        # ctx (context) stores values needed for backward
        sig = torch.sigmoid(x)
        ctx.save_for_backward(sig)   # save sigmoid(x) for reuse
        return 2 * sig

    @staticmethod
    def backward(ctx, grad_output):
        # grad_output is the gradient flowing in FROM the layer above
        sig, = ctx.saved_tensors
        # d/dx [2 * sigmoid(x)] = 2 * sigmoid(x) * (1 - sigmoid(x))
        grad_input = grad_output * 2 * sig * (1 - sig)
        return grad_input

# Usage — same as any built-in op
x = torch.tensor(1.0, requires_grad=True)
y = ScaledSigmoid.apply(x)
y.backward()

print(f"y = {y.item():.4f}")          # 1.4621
print(f"dy/dx = {x.grad.item():.4f}") # 0.4200

# Verify with PyTorch's built-in autograd
x2 = torch.tensor(1.0, requires_grad=True)
y2 = 2 * torch.sigmoid(x2)
y2.backward()
print(f"dy/dx (builtin) = {x2.grad.item():.4f}")  # 0.4200 — matches!

When should you use custom Function?

Only when you need numerical stability tricks (like the log-sum-exp trick), when wrapping C++/CUDA kernels, or when implementing operations that don't exist in PyTorch. For everything else, compose built-in ops — autograd handles them automatically.

14. Putting It All Together — Linear Regression From Scratch

Let's write a complete, minimal training loop using only autograd — no nn.Module, no optimizer — to really understand what is happening under the hood. We'll fit a line y = 2x + 1 from noisy data.

14_linear_regression.py

python

import torch

torch.manual_seed(42)

# ── 1. Generate synthetic data: y = 2x + 1 + noise ──
x_data = torch.linspace(0, 1, 100)
y_data = 2 * x_data + 1 + 0.1 * torch.randn(100)

# ── 2. Initialise parameters (random) ──
w = torch.tensor(0.0, requires_grad=True)   # true value should be ~2
b = torch.tensor(0.0, requires_grad=True)   # true value should be ~1

lr = 0.1   # learning rate

# ── 3. Training loop ──
for epoch in range(200):

    # Forward pass: compute predictions
    y_pred = w * x_data + b

    # Compute loss (Mean Squared Error)
    loss = ((y_pred - y_data) ** 2).mean()

    # Backward pass: compute gradients
    loss.backward()

    # Update parameters (gradient descent step)
    with torch.no_grad():           # don't track these assignments
        w -= lr * w.grad
        b -= lr * b.grad

    # Zero gradients BEFORE the next iteration
    w.grad.zero_()
    b.grad.zero_()

    if epoch % 40 == 0:
        print(f"Epoch {epoch:3d} | loss={loss.item():.4f} | w={w.item():.3f} | b={b.item():.3f}")

print(f"\nFinal: w={w.item():.3f} (true: 2.0), b={b.item():.3f} (true: 1.0)")

Running this produces output like the following, showing the weights converging toward the true values of w=2 and b=1:

output

text

Epoch   0 | loss=2.1378 | w=0.274 | b=0.148
Epoch  40 | loss=0.0200 | w=1.706 | b=0.807
Epoch  80 | loss=0.0107 | w=1.927 | b=0.942
Epoch 120 | loss=0.0101 | w=1.977 | b=0.983
Epoch 160 | loss=0.0100 | w=1.993 | b=0.995

Final: w=1.998 (true: 2.0), b=0.999 (true: 1.0)

What just happened?

In ~20 lines of pure PyTorch, you implemented the complete gradient descent training loop. Every real deep learning framework — from a tiny MLP to GPT — follows this exact same loop: forward → loss → backward → update → zero_grad. The scale is different; the structure is identical.

15. Connecting to nn.Module and Optimizers

In practice you use nn.Module and torch.optim instead of managing requires_grad and zero_() manually. Here's the same linear regression rewritten the "PyTorch way" — notice the loop is structurally identical, just with more convenient abstractions:

15_nn_module_way.py

python

import torch
import torch.nn as nn

torch.manual_seed(42)

# Same synthetic data
x_data = torch.linspace(0, 1, 100).unsqueeze(1)   # shape (100, 1)
y_data = (2 * x_data + 1 + 0.1 * torch.randn(100, 1))

# Model: single linear layer (manages w and b internally, requires_grad=True by default)
model = nn.Linear(1, 1)

loss_fn  = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

for epoch in range(200):
    optimizer.zero_grad()          # ← replaces w.grad.zero_() and b.grad.zero_()

    y_pred = model(x_data)         # forward pass
    loss   = loss_fn(y_pred, y_data)

    loss.backward()                # backward pass
    optimizer.step()               # ← replaces w -= lr * w.grad  etc.

    if epoch % 40 == 0:
        w, b = model.weight.item(), model.bias.item()
        print(f"Epoch {epoch:3d} | loss={loss.item():.4f} | w={w:.3f} | b={b:.3f}")

The only difference is ergonomics. Under the hood, model.parameters() returns tensors with requires_grad=True, optimizer.zero_grad() calls .zero_() on each one, and optimizer.step() applies the weight update. Autograd is doing the same work it always was.

16. Common Pitfalls and How to Avoid Them

Pitfall	Symptom	Fix
Forgetting zero_grad()	Loss oscillates wildly or diverges; gradients grow each step	Call optimizer.zero_grad() at the start of every iteration
In-place ops on leaf tensors	RuntimeError about in-place operation on a leaf variable	Use out-of-place ops: x = x + 1 instead of x += 1
Calling .backward() twice	RuntimeError: graph freed	Use retain_graph=True if you genuinely need two passes
Converting to NumPy without detach	RuntimeError: tensor requires grad	Use tensor.detach().numpy() or wrap in torch.no_grad()
Accumulating loss in a list	Memory leak — full graph kept alive by Python list	Store loss.item() (a plain float), not the tensor itself
Updating weights inside autograd scope	Gradients contaminated by weight-update operations	Always update weights inside torch.no_grad() or use optimizer.step()

17. Quick Reference Cheatsheet

cheatsheet.py

python

import torch

# ─── Creating tracked tensors ──────────────────────────────────
x = torch.tensor(1.0, requires_grad=True)   # scalar
W = torch.randn(3, 3, requires_grad=True)   # matrix

# ─── Triggering backprop ───────────────────────────────────────
loss = (W @ x.unsqueeze(0).T).sum()
loss.backward()                    # populates .grad for all leaves

# ─── Accessing gradients ───────────────────────────────────────
print(x.grad)                      # tensor of same shape as x
print(W.grad)                      # tensor of same shape as W

# ─── Zeroing gradients ─────────────────────────────────────────
x.grad.zero_()                     # in-place zero
optimizer.zero_grad()              # via optimizer (preferred)

# ─── Disabling gradient tracking ───────────────────────────────
with torch.no_grad():              # block-level
    y = W @ x

y = W @ x
y_no_grad = y.detach()             # detach a specific tensor

# ─── Surgical gradient queries ─────────────────────────────────
grad_W, = torch.autograd.grad(loss, W, retain_graph=True)

# ─── Higher-order gradients ────────────────────────────────────
grad1, = torch.autograd.grad(loss, x, create_graph=True)
grad2, = torch.autograd.grad(grad1.sum(), x)

# ─── Inspecting the graph ──────────────────────────────────────
print(loss.grad_fn)                # the last operation
print(loss.grad_fn.next_functions) # one step back
print(x.is_leaf)                   # True if created directly

# ─── Retaining graph for multiple backward calls ───────────────
loss.backward(retain_graph=True)

# ─── Custom Function ───────────────────────────────────────────
class MyOp(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x): ctx.save_for_backward(x); return x.clamp(min=0)
    @staticmethod
    def backward(ctx, g): x, = ctx.saved_tensors; return g * (x > 0).float()

result = MyOp.apply(x)

Conclusion

Autograd is one of the most elegant pieces of engineering in modern deep learning. By silently recording every operation in a computational graph, PyTorch can differentiate through arbitrarily complex functions — from a two-parameter line to a billion-parameter language model — using nothing but the chain rule applied node by node. Now that you understand the machinery under the hood, you'll find debugging training loops far more intuitive and the jump to advanced topics like custom layers, meta-learning, and physics-informed networks much less steep.

What to explore next

Implement a two-layer MLP using only autograd (no nn.Module) to cement your understanding. 2) Read the official torch.autograd documentation — especially the section on numerical gradient checking with torch.autograd.gradcheck. 3) Explore higher-order gradients in the context of MAML for meta-learning.

#deep-learning #pytorch

beginner

What Is a Tensor? A Beginner's Guide with Real Examples

Tensors explained from scratch — no math degree required. Learn what tensors are, why PyTorch uses them, and how to work with them confidently.

intermediate

Logistic Regression from Scratch in PyTorch: Every Line Explained

Build a multi-class classifier in PyTorch without nn.Linear, without optim.SGD, without CrossEntropyLoss. Just [tensors](/blog/what-is-a-tensor), [autograd](/blog/pytorch-autograd-deep-dive), and arithmetic — so you finally see what those helpers actually do.

beginner

Backpropagation and the Chain Rule: A Simple Visual Guide

Learn how backpropagation works through a simple, step-by-step example. Understand the chain rule intuitively with clear visualizations and working code.

What you need to follow along

Formal definition (don't be scared)

Why not track everything by default?

It works!

Leaf vs. non-leaf tensors

Why does the loss need to be a scalar?

Always zero_() in your training loop

retain_graph=True is expensive

When should you use custom Function?

What just happened?

What to explore next

Related Articles

What Is a Tensor? A Beginner's Guide with Real Examples

Logistic Regression from Scratch in PyTorch: Every Line Explained

Backpropagation and the Chain Rule: A Simple Visual Guide