
Backpropagation and the Chain Rule: A Simple Visual Guide
Learn how backpropagation works through a simple, step-by-step example. Understand the chain rule intuitively with clear visualizations and working code.
Backpropagation sounds intimidating, but it's actually a simple idea: calculate how much each part of your neural network contributed to the error, then adjust accordingly. In this post, we'll build intuition from the ground up using a concrete example you can follow step by step.
What You'll Learn
Imagine you're baking a cake and it turns out too sweet. You need to figure out which ingredient to adjust. Was it the sugar? The vanilla? The frosting? Backpropagation does exactly this for neural networks—it traces back through the recipe (the network) to find out which 'ingredients' (weights) caused the error.
The Process:
- Forward Pass: Feed input through the network to get a prediction
- Calculate Error: Compare prediction to the actual answer
- Backward Pass: Trace back to find how much each weight contributed to the error
- Update Weights: Adjust weights to reduce the error
Let's build the simplest possible neural network: one that predicts house prices based on size. We'll use this tiny network to understand backpropagation completely.
Our Network:
- Input: House size (in 1000 sq ft)
- Hidden Layer: 1 neuron
- Output: Predicted price (in $100k)
Let's walk through a concrete example with actual numbers.
Given:
- Input: (house is 2000 sq ft)
- Weight 1:
- Weight 2:
- True price: (actually costs $300k)
Forward Pass Calculations:
Hidden layer (with ReLU activation):
Output layer:
Error (Loss):
Our Prediction is Wrong!
import numpy as np
# Network parameters
x = 2.0 # Input: 2000 sq ft
w1 = 0.5 # Weight 1
w2 = 1.0 # Weight 2
y_true = 3.0 # True price: $300k
# Forward pass
z1 = w1 * x
print(f"z1 = w1 * x = {w1} * {x} = {z1}")
# ReLU activation
h = max(0, z1)
print(f"h = ReLU(z1) = max(0, {z1}) = {h}")
# Output
y_pred = w2 * h
print(f"ŷ = w2 * h = {w2} * {h} = {y_pred}")
# Loss
loss = 0.5 * (y_true - y_pred)**2
print(f"\nLoss = 0.5 * (y - ŷ)² = 0.5 * ({y_true} - {y_pred})² = {loss}")
print(f"\nWe predicted ${y_pred * 100}k but actual is ${y_true * 100}k")
print(f"Error: ${abs(y_true - y_pred) * 100}k")Before we do backpropagation, we need to understand the chain rule. It's simpler than it sounds!
The Chain Rule in Plain English:
If A affects B, and B affects C, then to find how A affects C, you multiply the effects:
Real-World Analogy
Example with Numbers:
Say we have: and we want at
Break it down:
- Let , so
At : , so
# Chain rule example: y = (2x + 1)²
def f(x):
"""Forward pass"""
u = 2*x + 1
y = u**2
return y, u
def df_dx(x):
"""Derivative using chain rule"""
_, u = f(x)
# dy/du = 2u
dy_du = 2 * u
# du/dx = 2
du_dx = 2
# Chain rule: dy/dx = dy/du * du/dx
dy_dx = dy_du * du_dx
return dy_dx
# Test at x = 1
x = 1
y, u = f(x)
derivative = df_dx(x)
print(f"At x = {x}:")
print(f"u = 2x + 1 = {u}")
print(f"y = u² = {y}")
print(f"\ndy/du = 2u = {2*u}")
print(f"du/dx = 2")
print(f"dy/dx = dy/du * du/dx = {derivative}")Now let's apply the chain rule to our neural network. We'll work backwards from the loss to find how each weight contributed to the error.
Goal: Find and
The loss depends on through this chain:
Using the chain rule:
Step 1: Find
Step 2: Find
Step 3: Multiply them (chain rule)
What Does This Mean?
This is trickier because affects the loss through a longer chain:
Step 1: We already know
Step 2: Find
Step 3: Find (ReLU derivative)
Since , we have
Step 4: Find
Step 5: Multiply all together
Interpretation
import numpy as np
# Forward pass values (from before)
x = 2.0
w1 = 0.5
w2 = 1.0
y_true = 3.0
z1 = w1 * x # 1.0
h = max(0, z1) # 1.0 (ReLU)
y_pred = w2 * h # 1.0
loss = 0.5 * (y_true - y_pred)**2 # 2.0
print("=" * 50)
print("BACKWARD PASS (Backpropagation)")
print("=" * 50)
# Backward pass
print("\n1. Gradient of loss w.r.t. prediction:")
dL_dy_pred = -(y_true - y_pred)
print(f" ∂L/∂ŷ = -(y - ŷ) = -({y_true} - {y_pred}) = {dL_dy_pred}")
# Gradient for w2
print("\n2. Gradient for w2:")
dy_pred_dw2 = h
print(f" ∂ŷ/∂w2 = h = {dy_pred_dw2}")
dL_dw2 = dL_dy_pred * dy_pred_dw2
print(f" ∂L/∂w2 = ∂L/∂ŷ × ∂ŷ/∂w2 = {dL_dy_pred} × {dy_pred_dw2} = {dL_dw2}")
# Gradient for w1 (through the chain)
print("\n3. Gradient for w1 (longer chain):")
dy_pred_dh = w2
print(f" ∂ŷ/∂h = w2 = {dy_pred_dh}")
# ReLU derivative
dh_dz1 = 1 if z1 > 0 else 0
print(f" ∂h/∂z1 = 1 (since z1 = {z1} > 0)")
dz1_dw1 = x
print(f" ∂z1/∂w1 = x = {dz1_dw1}")
dL_dw1 = dL_dy_pred * dy_pred_dh * dh_dz1 * dz1_dw1
print(f" ∂L/∂w1 = {dL_dy_pred} × {dy_pred_dh} × {dh_dz1} × {dz1_dw1} = {dL_dw1}")
print("\n" + "=" * 50)
print("SUMMARY")
print("=" * 50)
print(f"Gradient for w1: {dL_dw1}")
print(f"Gradient for w2: {dL_dw2}")
print(f"\nw1 has a bigger effect on the loss!")Now that we know the gradients, we can update our weights using gradient descent:
where is the learning rate (let's use )
Update w₁:
Update w₂:
Both Weights Increased!
# Gradients from backprop
dL_dw1 = -4.0
dL_dw2 = -2.0
# Current weights
w1_old = 0.5
w2_old = 1.0
# Learning rate
alpha = 0.1
# Update weights
w1_new = w1_old - alpha * dL_dw1
w2_new = w2_old - alpha * dL_dw2
print("Weight Updates:")
print(f"w1: {w1_old} → {w1_new} (change: +{w1_new - w1_old})")
print(f"w2: {w2_old} → {w2_new} (change: +{w2_new - w2_old})")
# Test new prediction
z1_new = w1_new * x
h_new = max(0, z1_new)
y_pred_new = w2_new * h_new
loss_new = 0.5 * (y_true - y_pred_new)**2
print(f"\nOld prediction: ${y_pred * 100}k (loss: {loss})")
print(f"New prediction: ${y_pred_new * 100}k (loss: {loss_new})")
print(f"\nLoss decreased by: {loss - loss_new:.4f}")
print(f"We're getting closer to the true price of ${y_true * 100}k!")Let's put it all together in a complete training loop:
import numpy as np
import matplotlib.pyplot as plt
class TinyNetwork:
def __init__(self):
# Initialize weights randomly
self.w1 = 0.5
self.w2 = 1.0
self.learning_rate = 0.1
def relu(self, x):
return max(0, x)
def relu_derivative(self, x):
return 1 if x > 0 else 0
def forward(self, x):
"""Forward pass"""
self.x = x
self.z1 = self.w1 * x
self.h = self.relu(self.z1)
self.y_pred = self.w2 * self.h
return self.y_pred
def backward(self, y_true):
"""Backward pass (backpropagation)"""
# Gradient of loss w.r.t. prediction
dL_dy_pred = -(y_true - self.y_pred)
# Gradient for w2
dy_pred_dw2 = self.h
dL_dw2 = dL_dy_pred * dy_pred_dw2
# Gradient for w1 (through the chain)
dy_pred_dh = self.w2
dh_dz1 = self.relu_derivative(self.z1)
dz1_dw1 = self.x
dL_dw1 = dL_dy_pred * dy_pred_dh * dh_dz1 * dz1_dw1
return dL_dw1, dL_dw2
def update_weights(self, dL_dw1, dL_dw2):
"""Update weights using gradient descent"""
self.w1 -= self.learning_rate * dL_dw1
self.w2 -= self.learning_rate * dL_dw2
def train_step(self, x, y_true):
"""One complete training step"""
# Forward pass
y_pred = self.forward(x)
# Calculate loss
loss = 0.5 * (y_true - y_pred)**2
# Backward pass
dL_dw1, dL_dw2 = self.backward(y_true)
# Update weights
self.update_weights(dL_dw1, dL_dw2)
return loss, y_pred
# Training
network = TinyNetwork()
x = 2.0 # 2000 sq ft
y_true = 3.0 # $300k
losses = []
predictions = []
print("Training Progress:")
print("=" * 60)
for epoch in range(20):
loss, y_pred = network.train_step(x, y_true)
losses.append(loss)
predictions.append(y_pred)
if epoch % 5 == 0:
print(f"Epoch {epoch:2d}: Loss = {loss:.4f}, "
f"Prediction = ${y_pred*100:.1f}k, "
f"w1 = {network.w1:.3f}, w2 = {network.w2:.3f}")
print("=" * 60)
print(f"\nFinal prediction: ${predictions[-1]*100:.1f}k")
print(f"True price: ${y_true*100:.1f}k")
print(f"Final error: ${abs(y_true - predictions[-1])*100:.1f}k")
# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(losses)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Loss Over Time')
ax1.grid(True, alpha=0.3)
ax2.plot(predictions, label='Prediction')
ax2.axhline(y=y_true, color='r', linestyle='--', label='True Value')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Price ($100k)')
ax2.set_title('Predictions Over Time')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()- Backpropagation is just the chain rule applied systematically to find gradients
- Forward pass computes predictions; backward pass computes gradients
- Gradients tell us direction: negative gradient means increase weight, positive means decrease
- The chain rule multiplies local derivatives as we trace back through the network
- Each layer only needs to know its local derivative—this is what makes backprop scalable
- ReLU derivative is simple: 1 if input > 0, else 0
Practice Exercise
Backpropagation isn't magic—it's a systematic application of the chain rule. By breaking the network into small pieces and computing local derivatives, we can efficiently find how every weight contributes to the error. This same principle scales from our tiny 2-weight network to massive models with billions of parameters.
The key insight: you don't need to understand the entire network at once. Each layer just needs to know its own derivative, and the chain rule connects everything together. That's the beauty of backpropagation!
Next Steps
Related Articles
PyTorch Autograd: Automatic Differentiation from the Ground Up
A complete, beginner-friendly guide to PyTorch's autograd engine — from what a gradient is to building a neural network by hand.
What Is a Tensor? A Beginner's Guide with Real Examples
Tensors explained from scratch — no math degree required. Learn what tensors are, why PyTorch uses them, and how to work with them confidently.
Logistic Regression from Scratch in PyTorch: Every Line Explained
Build a multi-class classifier in PyTorch without nn.Linear, without optim.SGD, without CrossEntropyLoss. Just [tensors](/blog/what-is-a-tensor), [autograd](/blog/pytorch-autograd-deep-dive), and arithmetic — so you finally see what those helpers actually do.