Back to articles
Backpropagation and the Chain Rule: A Simple Visual Guide

Backpropagation and the Chain Rule: A Simple Visual Guide

Learn how backpropagation works through a simple, step-by-step example. Understand the chain rule intuitively with clear visualizations and working code.

12 min read

Backpropagation sounds intimidating, but it's actually a simple idea: calculate how much each part of your neural network contributed to the error, then adjust accordingly. In this post, we'll build intuition from the ground up using a concrete example you can follow step by step.

What You'll Learn

By the end of this post, you'll understand: what backpropagation really does, how the chain rule makes it possible, how to trace gradients through a simple network, and how to implement it from scratch in Python.

Imagine you're baking a cake and it turns out too sweet. You need to figure out which ingredient to adjust. Was it the sugar? The vanilla? The frosting? Backpropagation does exactly this for neural networks—it traces back through the recipe (the network) to find out which 'ingredients' (weights) caused the error.

The Process:

  1. Forward Pass: Feed input through the network to get a prediction
  2. Calculate Error: Compare prediction to the actual answer
  3. Backward Pass: Trace back to find how much each weight contributed to the error
  4. Update Weights: Adjust weights to reduce the error

Let's build the simplest possible neural network: one that predicts house prices based on size. We'll use this tiny network to understand backpropagation completely.

Our Network:

  • Input: House size (in 1000 sq ft)
  • Hidden Layer: 1 neuron
  • Output: Predicted price (in $100k)

Let's walk through a concrete example with actual numbers.

Given:

  • Input: x=2x = 2 (house is 2000 sq ft)
  • Weight 1: w1=0.5w_1 = 0.5
  • Weight 2: w2=1.0w_2 = 1.0
  • True price: y=3y = 3 (actually costs $300k)

Forward Pass Calculations:

Hidden layer (with ReLU activation):

z1=w1×x=0.5×2=1.0z_1 = w_1 \times x = 0.5 \times 2 = 1.0

h=ReLU(z1)=max(0,1.0)=1.0h = \text{ReLU}(z_1) = \max(0, 1.0) = 1.0

Output layer:

y^=w2×h=1.0×1.0=1.0\hat{y} = w_2 \times h = 1.0 \times 1.0 = 1.0

Error (Loss):

L=12(yy^)2=12(31)2=2.0L = \frac{1}{2}(y - \hat{y})^2 = \frac{1}{2}(3 - 1)^2 = 2.0

Our Prediction is Wrong!

We predicted 100kbutthehouseactuallycosts100k but the house actually costs 300k. We're off by $200k! Now we need to figure out how to adjust our weights to fix this.
forward_pass.py
python
import numpy as np

# Network parameters
x = 2.0  # Input: 2000 sq ft
w1 = 0.5  # Weight 1
w2 = 1.0  # Weight 2
y_true = 3.0  # True price: $300k

# Forward pass
z1 = w1 * x
print(f"z1 = w1 * x = {w1} * {x} = {z1}")

# ReLU activation
h = max(0, z1)
print(f"h = ReLU(z1) = max(0, {z1}) = {h}")

# Output
y_pred = w2 * h
print(f"ŷ = w2 * h = {w2} * {h} = {y_pred}")

# Loss
loss = 0.5 * (y_true - y_pred)**2
print(f"\nLoss = 0.5 * (y - ŷ)² = 0.5 * ({y_true} - {y_pred})² = {loss}")
print(f"\nWe predicted ${y_pred * 100}k but actual is ${y_true * 100}k")
print(f"Error: ${abs(y_true - y_pred) * 100}k")

Before we do backpropagation, we need to understand the chain rule. It's simpler than it sounds!

The Chain Rule in Plain English:

If A affects B, and B affects C, then to find how A affects C, you multiply the effects:

dCdA=dCdB×dBdA\frac{dC}{dA} = \frac{dC}{dB} \times \frac{dB}{dA}

Real-World Analogy

Think of a car: pressing the gas pedal (A) increases engine RPM (B), which increases speed (C). To know how the pedal affects speed, you multiply: (speed change per RPM) × (RPM change per pedal press).

Example with Numbers:

Say we have: y=(2x+1)2y = (2x + 1)^2 and we want dydx\frac{dy}{dx} at x=1x = 1

Break it down:

  1. Let u=2x+1u = 2x + 1, so y=u2y = u^2
  2. dydu=2u\frac{dy}{du} = 2u
  3. dudx=2\frac{du}{dx} = 2
  4. dydx=dydu×dudx=2u×2=4u\frac{dy}{dx} = \frac{dy}{du} \times \frac{du}{dx} = 2u \times 2 = 4u

At x=1x = 1: u=3u = 3, so dydx=4×3=12\frac{dy}{dx} = 4 \times 3 = 12

chain_rule_example.py
python
# Chain rule example: y = (2x + 1)²

def f(x):
    """Forward pass"""
    u = 2*x + 1
    y = u**2
    return y, u

def df_dx(x):
    """Derivative using chain rule"""
    _, u = f(x)
    
    # dy/du = 2u
    dy_du = 2 * u
    
    # du/dx = 2
    du_dx = 2
    
    # Chain rule: dy/dx = dy/du * du/dx
    dy_dx = dy_du * du_dx
    
    return dy_dx

# Test at x = 1
x = 1
y, u = f(x)
derivative = df_dx(x)

print(f"At x = {x}:")
print(f"u = 2x + 1 = {u}")
print(f"y = u² = {y}")
print(f"\ndy/du = 2u = {2*u}")
print(f"du/dx = 2")
print(f"dy/dx = dy/du * du/dx = {derivative}")

Now let's apply the chain rule to our neural network. We'll work backwards from the loss to find how each weight contributed to the error.

Goal: Find Lw1\frac{\partial L}{\partial w_1} and Lw2\frac{\partial L}{\partial w_2}

The loss depends on w2w_2 through this chain: Ly^w2L \rightarrow \hat{y} \rightarrow w_2

Using the chain rule:

Lw2=Ly^×y^w2\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial w_2}

Step 1: Find Ly^\frac{\partial L}{\partial \hat{y}}

L=12(yy^)2L = \frac{1}{2}(y - \hat{y})^2

Ly^=(yy^)=(31)=2\frac{\partial L}{\partial \hat{y}} = -(y - \hat{y}) = -(3 - 1) = -2

Step 2: Find y^w2\frac{\partial \hat{y}}{\partial w_2}

y^=w2×h\hat{y} = w_2 \times h

y^w2=h=1.0\frac{\partial \hat{y}}{\partial w_2} = h = 1.0

Step 3: Multiply them (chain rule)

Lw2=2×1.0=2.0\frac{\partial L}{\partial w_2} = -2 \times 1.0 = -2.0

What Does This Mean?

The gradient of -2.0 means: if we increase w₂ by a tiny amount, the loss will decrease by 2× that amount. So we should increase w₂ to reduce our error!

This is trickier because w1w_1 affects the loss through a longer chain: Ly^hz1w1L \rightarrow \hat{y} \rightarrow h \rightarrow z_1 \rightarrow w_1

Lw1=Ly^×y^h×hz1×z1w1\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial h} \times \frac{\partial h}{\partial z_1} \times \frac{\partial z_1}{\partial w_1}

Step 1: We already know Ly^=2\frac{\partial L}{\partial \hat{y}} = -2

Step 2: Find y^h\frac{\partial \hat{y}}{\partial h}

y^=w2×h\hat{y} = w_2 \times h

y^h=w2=1.0\frac{\partial \hat{y}}{\partial h} = w_2 = 1.0

Step 3: Find hz1\frac{\partial h}{\partial z_1} (ReLU derivative)

h=ReLU(z1)=max(0,z1)h = \text{ReLU}(z_1) = \max(0, z_1)

hz1={1if z1>00if z10\frac{\partial h}{\partial z_1} = \begin{cases} 1 & \text{if } z_1 > 0 \\ 0 & \text{if } z_1 \leq 0 \end{cases}

Since z1=1.0>0z_1 = 1.0 > 0, we have hz1=1\frac{\partial h}{\partial z_1} = 1

Step 4: Find z1w1\frac{\partial z_1}{\partial w_1}

z1=w1×xz_1 = w_1 \times x

z1w1=x=2.0\frac{\partial z_1}{\partial w_1} = x = 2.0

Step 5: Multiply all together

Lw1=2×1.0×1×2.0=4.0\frac{\partial L}{\partial w_1} = -2 \times 1.0 \times 1 \times 2.0 = -4.0

Interpretation

The gradient of -4.0 means w₁ has an even bigger effect on the loss than w₂. Increasing w₁ will decrease the loss by 4× that amount.
backpropagation.py
python
import numpy as np

# Forward pass values (from before)
x = 2.0
w1 = 0.5
w2 = 1.0
y_true = 3.0

z1 = w1 * x  # 1.0
h = max(0, z1)  # 1.0 (ReLU)
y_pred = w2 * h  # 1.0
loss = 0.5 * (y_true - y_pred)**2  # 2.0

print("=" * 50)
print("BACKWARD PASS (Backpropagation)")
print("=" * 50)

# Backward pass
print("\n1. Gradient of loss w.r.t. prediction:")
dL_dy_pred = -(y_true - y_pred)
print(f"   ∂L/∂ŷ = -(y - ŷ) = -({y_true} - {y_pred}) = {dL_dy_pred}")

# Gradient for w2
print("\n2. Gradient for w2:")
dy_pred_dw2 = h
print(f"   ∂ŷ/∂w2 = h = {dy_pred_dw2}")

dL_dw2 = dL_dy_pred * dy_pred_dw2
print(f"   ∂L/∂w2 = ∂L/∂ŷ × ∂ŷ/∂w2 = {dL_dy_pred} × {dy_pred_dw2} = {dL_dw2}")

# Gradient for w1 (through the chain)
print("\n3. Gradient for w1 (longer chain):")

dy_pred_dh = w2
print(f"   ∂ŷ/∂h = w2 = {dy_pred_dh}")

# ReLU derivative
dh_dz1 = 1 if z1 > 0 else 0
print(f"   ∂h/∂z1 = 1 (since z1 = {z1} > 0)")

dz1_dw1 = x
print(f"   ∂z1/∂w1 = x = {dz1_dw1}")

dL_dw1 = dL_dy_pred * dy_pred_dh * dh_dz1 * dz1_dw1
print(f"   ∂L/∂w1 = {dL_dy_pred} × {dy_pred_dh} × {dh_dz1} × {dz1_dw1} = {dL_dw1}")

print("\n" + "=" * 50)
print("SUMMARY")
print("=" * 50)
print(f"Gradient for w1: {dL_dw1}")
print(f"Gradient for w2: {dL_dw2}")
print(f"\nw1 has a bigger effect on the loss!")

Now that we know the gradients, we can update our weights using gradient descent:

wnew=woldα×Lww_{\text{new}} = w_{\text{old}} - \alpha \times \frac{\partial L}{\partial w}

where α\alpha is the learning rate (let's use α=0.1\alpha = 0.1)

Update w₁:

w1new=0.50.1×(4.0)=0.5+0.4=0.9w_1^{\text{new}} = 0.5 - 0.1 \times (-4.0) = 0.5 + 0.4 = 0.9

Update w₂:

w2new=1.00.1×(2.0)=1.0+0.2=1.2w_2^{\text{new}} = 1.0 - 0.1 \times (-2.0) = 1.0 + 0.2 = 1.2

Both Weights Increased!

Notice both gradients were negative, so we increased both weights. This makes sense—our prediction was too low, so we need to amplify the signal through the network.
weight_update.py
python
# Gradients from backprop
dL_dw1 = -4.0
dL_dw2 = -2.0

# Current weights
w1_old = 0.5
w2_old = 1.0

# Learning rate
alpha = 0.1

# Update weights
w1_new = w1_old - alpha * dL_dw1
w2_new = w2_old - alpha * dL_dw2

print("Weight Updates:")
print(f"w1: {w1_old} → {w1_new} (change: +{w1_new - w1_old})")
print(f"w2: {w2_old} → {w2_new} (change: +{w2_new - w2_old})")

# Test new prediction
z1_new = w1_new * x
h_new = max(0, z1_new)
y_pred_new = w2_new * h_new
loss_new = 0.5 * (y_true - y_pred_new)**2

print(f"\nOld prediction: ${y_pred * 100}k (loss: {loss})")
print(f"New prediction: ${y_pred_new * 100}k (loss: {loss_new})")
print(f"\nLoss decreased by: {loss - loss_new:.4f}")
print(f"We're getting closer to the true price of ${y_true * 100}k!")

Let's put it all together in a complete training loop:

complete_backprop.py
python
import numpy as np
import matplotlib.pyplot as plt

class TinyNetwork:
    def __init__(self):
        # Initialize weights randomly
        self.w1 = 0.5
        self.w2 = 1.0
        self.learning_rate = 0.1
        
    def relu(self, x):
        return max(0, x)
    
    def relu_derivative(self, x):
        return 1 if x > 0 else 0
    
    def forward(self, x):
        """Forward pass"""
        self.x = x
        self.z1 = self.w1 * x
        self.h = self.relu(self.z1)
        self.y_pred = self.w2 * self.h
        return self.y_pred
    
    def backward(self, y_true):
        """Backward pass (backpropagation)"""
        # Gradient of loss w.r.t. prediction
        dL_dy_pred = -(y_true - self.y_pred)
        
        # Gradient for w2
        dy_pred_dw2 = self.h
        dL_dw2 = dL_dy_pred * dy_pred_dw2
        
        # Gradient for w1 (through the chain)
        dy_pred_dh = self.w2
        dh_dz1 = self.relu_derivative(self.z1)
        dz1_dw1 = self.x
        dL_dw1 = dL_dy_pred * dy_pred_dh * dh_dz1 * dz1_dw1
        
        return dL_dw1, dL_dw2
    
    def update_weights(self, dL_dw1, dL_dw2):
        """Update weights using gradient descent"""
        self.w1 -= self.learning_rate * dL_dw1
        self.w2 -= self.learning_rate * dL_dw2
    
    def train_step(self, x, y_true):
        """One complete training step"""
        # Forward pass
        y_pred = self.forward(x)
        
        # Calculate loss
        loss = 0.5 * (y_true - y_pred)**2
        
        # Backward pass
        dL_dw1, dL_dw2 = self.backward(y_true)
        
        # Update weights
        self.update_weights(dL_dw1, dL_dw2)
        
        return loss, y_pred

# Training
network = TinyNetwork()
x = 2.0  # 2000 sq ft
y_true = 3.0  # $300k

losses = []
predictions = []

print("Training Progress:")
print("=" * 60)
for epoch in range(20):
    loss, y_pred = network.train_step(x, y_true)
    losses.append(loss)
    predictions.append(y_pred)
    
    if epoch % 5 == 0:
        print(f"Epoch {epoch:2d}: Loss = {loss:.4f}, "
              f"Prediction = ${y_pred*100:.1f}k, "
              f"w1 = {network.w1:.3f}, w2 = {network.w2:.3f}")

print("=" * 60)
print(f"\nFinal prediction: ${predictions[-1]*100:.1f}k")
print(f"True price: ${y_true*100:.1f}k")
print(f"Final error: ${abs(y_true - predictions[-1])*100:.1f}k")

# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(losses)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Loss Over Time')
ax1.grid(True, alpha=0.3)

ax2.plot(predictions, label='Prediction')
ax2.axhline(y=y_true, color='r', linestyle='--', label='True Value')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Price ($100k)')
ax2.set_title('Predictions Over Time')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
  • Backpropagation is just the chain rule applied systematically to find gradients
  • Forward pass computes predictions; backward pass computes gradients
  • Gradients tell us direction: negative gradient means increase weight, positive means decrease
  • The chain rule multiplies local derivatives as we trace back through the network
  • Each layer only needs to know its local derivative—this is what makes backprop scalable
  • ReLU derivative is simple: 1 if input > 0, else 0

Practice Exercise

Try modifying the code to add a second input feature (e.g., number of bedrooms). You'll need to add another weight and trace the gradients through. This will solidify your understanding!

Backpropagation isn't magic—it's a systematic application of the chain rule. By breaking the network into small pieces and computing local derivatives, we can efficiently find how every weight contributes to the error. This same principle scales from our tiny 2-weight network to massive models with billions of parameters.

The key insight: you don't need to understand the entire network at once. Each layer just needs to know its own derivative, and the chain rule connects everything together. That's the beauty of backpropagation!

Next Steps

Now that you understand backpropagation, explore: matrix-form backprop for handling batches, different activation functions and their derivatives, momentum and adaptive learning rates, and automatic differentiation in PyTorch/TensorFlow.

Related Articles