Backpropagation and the Chain Rule: A Simple Visual Guide

Learn how backpropagation works through a simple, step-by-step example. Understand the chain rule intuitively with clear visualizations and working code.

AI EducatorApril 22, 2026

Backpropagation sounds intimidating, but it's actually a simple idea: calculate how much each part of your neural network contributed to the error, then adjust accordingly. In this post, we'll build intuition from the ground up using a concrete example you can follow step by step.

What You'll Learn

By the end of this post, you'll understand: what backpropagation really does, how the chain rule makes it possible, how to trace gradients through a simple network, and how to implement it from scratch in Python.

The Big Picture: What is Backpropagation?

Imagine you're baking a cake and it turns out too sweet. You need to figure out which ingredient to adjust. Was it the sugar? The vanilla? The frosting? Backpropagation does exactly this for neural networks—it traces back through the recipe (the network) to find out which 'ingredients' (weights) caused the error.

The Process:

Forward Pass: Feed input through the network to get a prediction
Calculate Error: Compare prediction to the actual answer
Backward Pass: Trace back to find how much each weight contributed to the error
Update Weights: Adjust weights to reduce the error

A Simple Example: Predicting House Prices

Let's build the simplest possible neural network: one that predicts house prices based on size. We'll use this tiny network to understand backpropagation completely.

Our Network:

Input: House size (in 1000 sq ft)
Hidden Layer: 1 neuron
Output: Predicted price (in $100k)

Step 1: The Forward Pass

Let's walk through a concrete example with actual numbers.

Given:

Input: $x = 2$ (house is 2000 sq ft)
Weight 1: $w_1 = 0.5$
Weight 2: $w_2 = 1.0$
True price: $y = 3$ (actually costs $300k)

Forward Pass Calculations:

Hidden layer (with ReLU activation):

$z_1 = w_1 \times x = 0.5 \times 2 = 1.0$

$h = \text{ReLU}(z_1) = \max(0, 1.0) = 1.0$

Output layer:

$\hat{y} = w_2 \times h = 1.0 \times 1.0 = 1.0$

Error (Loss):

$L = \frac{1}{2}(y - \hat{y})^2 = \frac{1}{2}(3 - 1)^2 = 2.0$

Our Prediction is Wrong!

We predicted

100k but the house actually costs

300k. We're off by $200k! Now we need to figure out how to adjust our weights to fix this.

forward_pass.py

python

import numpy as np

# Network parameters
x = 2.0  # Input: 2000 sq ft
w1 = 0.5  # Weight 1
w2 = 1.0  # Weight 2
y_true = 3.0  # True price: $300k

# Forward pass
z1 = w1 * x
print(f"z1 = w1 * x = {w1} * {x} = {z1}")

# ReLU activation
h = max(0, z1)
print(f"h = ReLU(z1) = max(0, {z1}) = {h}")

# Output
y_pred = w2 * h
print(f"ŷ = w2 * h = {w2} * {h} = {y_pred}")

# Loss
loss = 0.5 * (y_true - y_pred)**2
print(f"\nLoss = 0.5 * (y - ŷ)² = 0.5 * ({y_true} - {y_pred})² = {loss}")
print(f"\nWe predicted ${y_pred * 100}k but actual is ${y_true * 100}k")
print(f"Error: ${abs(y_true - y_pred) * 100}k")

Step 2: Understanding the Chain Rule

Before we do backpropagation, we need to understand the chain rule. It's simpler than it sounds!

The Chain Rule in Plain English:

If A affects B, and B affects C, then to find how A affects C, you multiply the effects:

$\frac{dC}{dA} = \frac{dC}{dB} \times \frac{dB}{dA}$

Real-World Analogy

Think of a car: pressing the gas pedal (A) increases engine RPM (B), which increases speed (C). To know how the pedal affects speed, you multiply: (speed change per RPM) × (RPM change per pedal press).

Example with Numbers:

Say we have: $y = (2x + 1)^2$ and we want $\frac{dy}{dx}$ at $x = 1$

Break it down:

Let $u = 2x + 1$ , so $y = u^2$
$\frac{dy}{du} = 2u$
$\frac{du}{dx} = 2$
$\frac{dy}{dx} = \frac{dy}{du} \times \frac{du}{dx} = 2u \times 2 = 4u$

At $x = 1$ : $u = 3$ , so $\frac{dy}{dx} = 4 \times 3 = 12$

chain_rule_example.py

python

# Chain rule example: y = (2x + 1)²

def f(x):
    """Forward pass"""
    u = 2*x + 1
    y = u**2
    return y, u

def df_dx(x):
    """Derivative using chain rule"""
    _, u = f(x)
    
    # dy/du = 2u
    dy_du = 2 * u
    
    # du/dx = 2
    du_dx = 2
    
    # Chain rule: dy/dx = dy/du * du/dx
    dy_dx = dy_du * du_dx
    
    return dy_dx

# Test at x = 1
x = 1
y, u = f(x)
derivative = df_dx(x)

print(f"At x = {x}:")
print(f"u = 2x + 1 = {u}")
print(f"y = u² = {y}")
print(f"\ndy/du = 2u = {2*u}")
print(f"du/dx = 2")
print(f"dy/dx = dy/du * du/dx = {derivative}")

Step 3: The Backward Pass (Backpropagation)

Now let's apply the chain rule to our neural network. We'll work backwards from the loss to find how each weight contributed to the error.

Goal: Find $\frac{\partial L}{\partial w_1}$ and $\frac{\partial L}{\partial w_2}$

Computing ∂L/∂w₂ (Output Weight)

The loss depends on $w_2$ through this chain: $L \rightarrow \hat{y} \rightarrow w_2$

Using the chain rule:

$\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial w_2}$

Step 1: Find $\frac{\partial L}{\partial \hat{y}}$

$L = \frac{1}{2}(y - \hat{y})^2$

$\frac{\partial L}{\partial \hat{y}} = -(y - \hat{y}) = -(3 - 1) = -2$

Step 2: Find $\frac{\partial \hat{y}}{\partial w_2}$

$\hat{y} = w_2 \times h$

$\frac{\partial \hat{y}}{\partial w_2} = h = 1.0$

Step 3: Multiply them (chain rule)

$\frac{\partial L}{\partial w_2} = -2 \times 1.0 = -2.0$

What Does This Mean?

The gradient of -2.0 means: if we increase w₂ by a tiny amount, the loss will decrease by 2× that amount. So we should increase w₂ to reduce our error!

Computing ∂L/∂w₁ (Hidden Weight)

This is trickier because $w_1$ affects the loss through a longer chain: $L \rightarrow \hat{y} \rightarrow h \rightarrow z_1 \rightarrow w_1$

$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial h} \times \frac{\partial h}{\partial z_1} \times \frac{\partial z_1}{\partial w_1}$

Step 1: We already know $\frac{\partial L}{\partial \hat{y}} = -2$

Step 2: Find $\frac{\partial \hat{y}}{\partial h}$

$\hat{y} = w_2 \times h$

$\frac{\partial \hat{y}}{\partial h} = w_2 = 1.0$

Step 3: Find $\frac{\partial h}{\partial z_1}$ (ReLU derivative)

$h = \text{ReLU}(z_1) = \max(0, z_1)$

$\frac{\partial h}{\partial z_1} = \begin{cases} 1 & \text{if } z_1 > 0 \\ 0 & \text{if } z_1 \leq 0 \end{cases}$

Since $z_1 = 1.0 > 0$ , we have $\frac{\partial h}{\partial z_1} = 1$

Step 4: Find $\frac{\partial z_1}{\partial w_1}$

$z_1 = w_1 \times x$

$\frac{\partial z_1}{\partial w_1} = x = 2.0$

Step 5: Multiply all together

$\frac{\partial L}{\partial w_1} = -2 \times 1.0 \times 1 \times 2.0 = -4.0$

Interpretation

The gradient of -4.0 means w₁ has an even bigger effect on the loss than w₂. Increasing w₁ will decrease the loss by 4× that amount.

backpropagation.py

python

import numpy as np

# Forward pass values (from before)
x = 2.0
w1 = 0.5
w2 = 1.0
y_true = 3.0

z1 = w1 * x  # 1.0
h = max(0, z1)  # 1.0 (ReLU)
y_pred = w2 * h  # 1.0
loss = 0.5 * (y_true - y_pred)**2  # 2.0

print("=" * 50)
print("BACKWARD PASS (Backpropagation)")
print("=" * 50)

# Backward pass
print("\n1. Gradient of loss w.r.t. prediction:")
dL_dy_pred = -(y_true - y_pred)
print(f"   ∂L/∂ŷ = -(y - ŷ) = -({y_true} - {y_pred}) = {dL_dy_pred}")

# Gradient for w2
print("\n2. Gradient for w2:")
dy_pred_dw2 = h
print(f"   ∂ŷ/∂w2 = h = {dy_pred_dw2}")

dL_dw2 = dL_dy_pred * dy_pred_dw2
print(f"   ∂L/∂w2 = ∂L/∂ŷ × ∂ŷ/∂w2 = {dL_dy_pred} × {dy_pred_dw2} = {dL_dw2}")

# Gradient for w1 (through the chain)
print("\n3. Gradient for w1 (longer chain):")

dy_pred_dh = w2
print(f"   ∂ŷ/∂h = w2 = {dy_pred_dh}")

# ReLU derivative
dh_dz1 = 1 if z1 > 0 else 0
print(f"   ∂h/∂z1 = 1 (since z1 = {z1} > 0)")

dz1_dw1 = x
print(f"   ∂z1/∂w1 = x = {dz1_dw1}")

dL_dw1 = dL_dy_pred * dy_pred_dh * dh_dz1 * dz1_dw1
print(f"   ∂L/∂w1 = {dL_dy_pred} × {dy_pred_dh} × {dh_dz1} × {dz1_dw1} = {dL_dw1}")

print("\n" + "=" * 50)
print("SUMMARY")
print("=" * 50)
print(f"Gradient for w1: {dL_dw1}")
print(f"Gradient for w2: {dL_dw2}")
print(f"\nw1 has a bigger effect on the loss!")

Step 4: Updating the Weights

Now that we know the gradients, we can update our weights using gradient descent:

$w_{\text{new}} = w_{\text{old}} - \alpha \times \frac{\partial L}{\partial w}$

where $\alpha$ is the learning rate (let's use $\alpha = 0.1$ )

Update w₁:

$w_1^{\text{new}} = 0.5 - 0.1 \times (-4.0) = 0.5 + 0.4 = 0.9$

Update w₂:

$w_2^{\text{new}} = 1.0 - 0.1 \times (-2.0) = 1.0 + 0.2 = 1.2$

Both Weights Increased!

Notice both gradients were negative, so we increased both weights. This makes sense—our prediction was too low, so we need to amplify the signal through the network.

weight_update.py

python

# Gradients from backprop
dL_dw1 = -4.0
dL_dw2 = -2.0

# Current weights
w1_old = 0.5
w2_old = 1.0

# Learning rate
alpha = 0.1

# Update weights
w1_new = w1_old - alpha * dL_dw1
w2_new = w2_old - alpha * dL_dw2

print("Weight Updates:")
print(f"w1: {w1_old} → {w1_new} (change: +{w1_new - w1_old})")
print(f"w2: {w2_old} → {w2_new} (change: +{w2_new - w2_old})")

# Test new prediction
z1_new = w1_new * x
h_new = max(0, z1_new)
y_pred_new = w2_new * h_new
loss_new = 0.5 * (y_true - y_pred_new)**2

print(f"\nOld prediction: ${y_pred * 100}k (loss: {loss})")
print(f"New prediction: ${y_pred_new * 100}k (loss: {loss_new})")
print(f"\nLoss decreased by: {loss - loss_new:.4f}")
print(f"We're getting closer to the true price of ${y_true * 100}k!")

Complete Implementation from Scratch

Let's put it all together in a complete training loop:

complete_backprop.py

python

import numpy as np
import matplotlib.pyplot as plt

class TinyNetwork:
    def __init__(self):
        # Initialize weights randomly
        self.w1 = 0.5
        self.w2 = 1.0
        self.learning_rate = 0.1
        
    def relu(self, x):
        return max(0, x)
    
    def relu_derivative(self, x):
        return 1 if x > 0 else 0
    
    def forward(self, x):
        """Forward pass"""
        self.x = x
        self.z1 = self.w1 * x
        self.h = self.relu(self.z1)
        self.y_pred = self.w2 * self.h
        return self.y_pred
    
    def backward(self, y_true):
        """Backward pass (backpropagation)"""
        # Gradient of loss w.r.t. prediction
        dL_dy_pred = -(y_true - self.y_pred)
        
        # Gradient for w2
        dy_pred_dw2 = self.h
        dL_dw2 = dL_dy_pred * dy_pred_dw2
        
        # Gradient for w1 (through the chain)
        dy_pred_dh = self.w2
        dh_dz1 = self.relu_derivative(self.z1)
        dz1_dw1 = self.x
        dL_dw1 = dL_dy_pred * dy_pred_dh * dh_dz1 * dz1_dw1
        
        return dL_dw1, dL_dw2
    
    def update_weights(self, dL_dw1, dL_dw2):
        """Update weights using gradient descent"""
        self.w1 -= self.learning_rate * dL_dw1
        self.w2 -= self.learning_rate * dL_dw2
    
    def train_step(self, x, y_true):
        """One complete training step"""
        # Forward pass
        y_pred = self.forward(x)
        
        # Calculate loss
        loss = 0.5 * (y_true - y_pred)**2
        
        # Backward pass
        dL_dw1, dL_dw2 = self.backward(y_true)
        
        # Update weights
        self.update_weights(dL_dw1, dL_dw2)
        
        return loss, y_pred

# Training
network = TinyNetwork()
x = 2.0  # 2000 sq ft
y_true = 3.0  # $300k

losses = []
predictions = []

print("Training Progress:")
print("=" * 60)
for epoch in range(20):
    loss, y_pred = network.train_step(x, y_true)
    losses.append(loss)
    predictions.append(y_pred)
    
    if epoch % 5 == 0:
        print(f"Epoch {epoch:2d}: Loss = {loss:.4f}, "
              f"Prediction = ${y_pred*100:.1f}k, "
              f"w1 = {network.w1:.3f}, w2 = {network.w2:.3f}")

print("=" * 60)
print(f"\nFinal prediction: ${predictions[-1]*100:.1f}k")
print(f"True price: ${y_true*100:.1f}k")
print(f"Final error: ${abs(y_true - predictions[-1])*100:.1f}k")

# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(losses)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Loss Over Time')
ax1.grid(True, alpha=0.3)

ax2.plot(predictions, label='Prediction')
ax2.axhline(y=y_true, color='r', linestyle='--', label='True Value')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Price ($100k)')
ax2.set_title('Predictions Over Time')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Key Takeaways

Backpropagation is just the chain rule applied systematically to find gradients
Forward pass computes predictions; backward pass computes gradients
Gradients tell us direction: negative gradient means increase weight, positive means decrease
The chain rule multiplies local derivatives as we trace back through the network
Each layer only needs to know its local derivative—this is what makes backprop scalable
ReLU derivative is simple: 1 if input > 0, else 0

Practice Exercise

Try modifying the code to add a second input feature (e.g., number of bedrooms). You'll need to add another weight and trace the gradients through. This will solidify your understanding!

Conclusion

Backpropagation isn't magic—it's a systematic application of the chain rule. By breaking the network into small pieces and computing local derivatives, we can efficiently find how every weight contributes to the error. This same principle scales from our tiny 2-weight network to massive models with billions of parameters.

The key insight: you don't need to understand the entire network at once. Each layer just needs to know its own derivative, and the chain rule connects everything together. That's the beauty of backpropagation!

Next Steps

Now that you understand backpropagation, explore: matrix-form backprop for handling batches, different activation functions and their derivatives, momentum and adaptive learning rates, and automatic differentiation in PyTorch/TensorFlow.

#deep-learning #neural-networks

beginner

PyTorch Autograd: Automatic Differentiation from the Ground Up

A complete, beginner-friendly guide to PyTorch's autograd engine — from what a gradient is to building a neural network by hand.

beginner

What Is a Tensor? A Beginner's Guide with Real Examples

Tensors explained from scratch — no math degree required. Learn what tensors are, why PyTorch uses them, and how to work with them confidently.

beginner

ReLU Explained: The Simple Activation Function That Changed Deep Learning

A complete beginner's guide to ReLU (Rectified Linear Unit) - what it is, why it works so well, and how to use it in neural networks with clear examples.

What You'll Learn

Our Prediction is Wrong!

Real-World Analogy

What Does This Mean?

Interpretation

Both Weights Increased!

Practice Exercise

Next Steps

Related Articles

PyTorch Autograd: Automatic Differentiation from the Ground Up

What Is a Tensor? A Beginner's Guide with Real Examples

ReLU Explained: The Simple Activation Function That Changed Deep Learning