Back to articles
ReLU Explained: The Simple Activation Function That Changed Deep Learning

ReLU Explained: The Simple Activation Function That Changed Deep Learning

A complete beginner's guide to ReLU (Rectified Linear Unit) - what it is, why it works so well, and how to use it in neural networks with clear examples.

10 min read

Imagine you're building a neural network and someone tells you to use ReLU. You nod along, but secretly wonder: what is this thing, and why does everyone use it? Here's the truth: ReLU is probably the simplest function in all of deep learning. It's so simple you can explain it to a 10-year-old. Yet this simple function revolutionized neural networks and made deep learning possible. Let's understand why.

What You'll Learn

By the end of this post, you'll understand: what ReLU is (in one sentence), why it's better than older activation functions, how to implement it in PyTorch, when to use it (and when not to), and common variants like Leaky ReLU.

ReLU (Rectified Linear Unit) is a function that outputs the input if it's positive, and zero otherwise.

That's it. That's the whole thing. In math notation:

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

Or in plain English: "If the number is positive, keep it. If it's negative, make it zero."

relu_examples.txt
text
Input  →  ReLU  →  Output
─────────────────────────
  5    →  max(0, 5)   →   5
  2    →  max(0, 2)   →   2
  0    →  max(0, 0)   →   0
 -3    →  max(0, -3)  →   0
 -10   →  max(0, -10) →   0

See the pattern? Positive numbers pass through unchanged. Negative numbers become zero. That's all ReLU does.

The Simplest Possible Analogy

Think of ReLU as a one-way door. Positive numbers can walk through freely. Negative numbers hit a wall and stop at zero. That's the entire function.

Before we understand why ReLU is special, let's understand why we need activation functions at all.

A neural network layer does a simple calculation: multiply inputs by weights and add a bias. This is called a linear transformation:

y=Wx+by = Wx + b

The problem? If you stack multiple linear layers, you still get a linear function. It's like multiplying numbers: 2 × 3 × 4 = 24, which is the same as just multiplying by 24 once. Stacking doesn't add power.

linear_problem.py
python
import torch
import torch.nn as nn

# Two linear layers WITHOUT activation
model = nn.Sequential(
    nn.Linear(10, 20),
    nn.Linear(20, 5)
)

# This is mathematically equivalent to:
model_equivalent = nn.Linear(10, 5)

# Stacking linear layers without activation = waste of layers!

Activation functions add non-linearity. They let the network learn complex patterns like curves, boundaries, and interactions. Without them, your 100-layer network is no smarter than a 1-layer network.

The Key Insight

Linear functions can only draw straight lines. Real-world patterns (faces, speech, text) are not straight lines. Activation functions let neural networks draw curves, which is what makes them powerful.

Before ReLU, people used sigmoid and tanh. These worked, but had serious problems.

Sigmoid squashes any input to a value between 0 and 1:

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

sigmoid_examples.txt
text
Input  →  Sigmoid  →  Output
────────────────────────────
  10   →  σ(10)    →  0.9999  (almost 1)
   5   →  σ(5)     →  0.993
   0   →  σ(0)     →  0.5
  -5   →  σ(-5)    →  0.007
 -10   →  σ(-10)   →  0.0001  (almost 0)

The Problem: Vanishing Gradients

When you train a network, you compute gradients (how much to change each weight). Sigmoid's gradient is very small for large positive or negative inputs. In deep networks, these tiny gradients multiply together and become microscopic. The network stops learning. This is called the vanishing gradient problem.

The Vanishing Gradient Problem

Imagine passing a message through 100 people, but each person only passes 10% of what they heard. By the time it reaches the end, the message is gone. That's what happens to gradients with sigmoid in deep networks.

ReLU's gradient is simple:

  • If input > 0: gradient = 1 (perfect!)
  • If input ≤ 0: gradient = 0 (dead, but at least not vanishing)

For positive inputs, the gradient is always 1. It doesn't shrink. This means gradients can flow through many layers without vanishing. This is why ReLU made deep learning possible.

FeatureSigmoidReLU
Gradient for positive inputsSmall (< 0.25)Always 1
ComputationExpensive (exponential)Trivial (max operation)
Output range[0, 1][0, ∞)
Vanishing gradients?Yes, severeNo
Dead neurons?NoYes (but manageable)

PyTorch makes ReLU incredibly easy. Here are three ways to use it:

relu_layer.py
python
import torch
import torch.nn as nn

# Define a simple network with ReLU
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),              # ← ReLU as a layer
    nn.Linear(256, 128),
    nn.ReLU(),              # ← Another ReLU
    nn.Linear(128, 10)
)

# Test it
x = torch.randn(1, 784)
output = model(x)
print(output.shape)  # torch.Size([1, 10])
relu_function.py
python
import torch
import torch.nn.functional as F

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 10)
    
    def forward(self, x):
        x = self.fc1(x)
        x = F.relu(x)        # ← ReLU as a function
        x = self.fc2(x)
        return x
relu_from_scratch.py
python
import torch

def my_relu(x):
    """ReLU from scratch - just for learning!"""
    return torch.maximum(x, torch.zeros_like(x))

# Or even simpler:
def my_relu_v2(x):
    """Even simpler version"""
    return x * (x > 0)

# Test them
x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
print("Input:        ", x)
print("my_relu:      ", my_relu(x))
print("my_relu_v2:   ", my_relu_v2(x))
print("torch.relu:   ", torch.relu(x))

# Output:
# Input:         tensor([-2., -1.,  0.,  1.,  2.])
# my_relu:       tensor([0., 0., 0., 1., 2.])
# my_relu_v2:    tensor([0., 0., 0., 1., 2.])
# torch.relu:    tensor([0., 0., 0., 1., 2.])

Which Method to Use?

Use nn.ReLU() when building models with nn.Sequential. Use F.relu() when writing custom forward() methods. Both are equally fast - it's just a style preference.

ReLU has one weakness: dying neurons. If a neuron's output is always negative, ReLU makes it always zero. The gradient is also zero, so the neuron never updates. It's permanently dead.

dying_relu_example.py
python
import torch
import torch.nn as nn

# Imagine a neuron that always outputs negative values
weights = torch.tensor([-5.0, -3.0, -2.0])
inputs = torch.tensor([1.0, 1.0, 1.0])

# Linear output
z = torch.dot(weights, inputs)  # -5 + -3 + -2 = -10
print(f"Before ReLU: {z}")         # -10

# After ReLU
output = torch.relu(z)
print(f"After ReLU: {output}")      # 0

# Gradient is also 0, so this neuron never learns!
# It's stuck outputting 0 forever = DEAD NEURON

How common is this? In practice, 10-20% of neurons can die during training. It's annoying but usually not catastrophic.

How to prevent it?

  • Use a smaller learning rate (neurons won't jump to extreme negative values)
  • Use proper weight initialization (He initialization for ReLU)
  • Use Leaky ReLU or other variants (explained next)

Instead of making negative values exactly zero, Leaky ReLU makes them small:

LeakyReLU(x)={xif x>00.01xif x0\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ 0.01x & \text{if } x \leq 0 \end{cases}

leaky_relu.py
python
import torch
import torch.nn as nn

# Standard ReLU
relu = nn.ReLU()

# Leaky ReLU (negative slope = 0.01)
leaky_relu = nn.LeakyReLU(negative_slope=0.01)

# Test
x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
print("Input:       ", x)
print("ReLU:        ", relu(x))
print("Leaky ReLU:  ", leaky_relu(x))

# Output:
# Input:        tensor([-2., -1.,  0.,  1.,  2.])
# ReLU:         tensor([0., 0., 0., 1., 2.])
# Leaky ReLU:   tensor([-0.02, -0.01,  0.00,  1.00,  2.00])
#                       ↑ Small negative values instead of 0!

Now negative inputs produce small negative outputs. The gradient is also small (0.01) instead of zero, so neurons can still learn even when they output negative values.

VariantFormulaWhen to Use
ReLUmax(0, x)Default choice - works 90% of the time
Leaky ReLUmax(0.01x, x)When you have dying neuron problems
PReLUmax(αx, x) where α is learnedWhen you want the network to learn the slope
ELUx if x>0, else α(e^x - 1)When you want smooth negative values
GELUx·Φ(x) (Gaussian)Transformers and modern NLP models
Swishx·σ(x)Some computer vision tasks

Which Variant to Choose?

Start with standard ReLU. It works for 90% of problems. Only switch to Leaky ReLU if you notice many dead neurons (check with hooks or by monitoring activations). For transformers, use GELU (it's the standard). For everything else, ReLU is fine.

Let's build a complete image classifier using ReLU:

complete_example.py
python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# 1. Define the model
class SimpleClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(784, 512),
            nn.ReLU(),              # ← ReLU after first layer
            nn.Dropout(0.2),
            
            nn.Linear(512, 256),
            nn.ReLU(),              # ← ReLU after second layer
            nn.Dropout(0.2),
            
            nn.Linear(256, 128),
            nn.ReLU(),              # ← ReLU after third layer
            nn.Dropout(0.2),
            
            nn.Linear(128, 10)      # No ReLU on output layer!
        )
    
    def forward(self, x):
        x = x.view(x.size(0), -1)  # Flatten images
        return self.network(x)

# 2. Create model, loss, optimizer
model = SimpleClassifier()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 3. Training loop (simplified)
for epoch in range(10):
    for batch_x, batch_y in train_loader:
        # Forward pass
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

# Key points:
# - ReLU after every hidden layer
# - NO ReLU after the output layer (we want raw scores)
# - Dropout after ReLU (standard pattern)

Common Mistake: ReLU on Output Layer

Never put ReLU on the output layer! The output layer produces raw scores (logits) that can be negative. ReLU would force them to be positive, breaking the loss function. Only use ReLU on hidden layers.

ReLU is great, but not always the right choice:

  • Output layers: Never use ReLU on output layers. Use softmax for classification, nothing for regression.
  • Recurrent networks (RNNs): ReLU can cause exploding gradients in RNNs. Use tanh instead.
  • When you need bounded outputs: If you need outputs in a specific range (like [0,1] or [-1,1]), use sigmoid or tanh.
  • Transformers: Modern transformers use GELU, not ReLU. It's smoother and works better.
  • When dying neurons are a problem: Switch to Leaky ReLU or ELU.

Let's visualize what ReLU does to data:

visualize_relu.py
python
import torch
import matplotlib.pyplot as plt
import numpy as np

# Create input range
x = torch.linspace(-5, 5, 100)

# Apply different activations
relu = torch.relu(x)
leaky_relu = torch.nn.functional.leaky_relu(x, negative_slope=0.1)
sigmoid = torch.sigmoid(x)
tanh = torch.tanh(x)

# Plot
plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
plt.plot(x, relu, 'b-', linewidth=2)
plt.title('ReLU', fontsize=14)
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
plt.axvline(x=0, color='k', linestyle='-', linewidth=0.5)

plt.subplot(2, 2, 2)
plt.plot(x, leaky_relu, 'g-', linewidth=2)
plt.title('Leaky ReLU', fontsize=14)
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
plt.axvline(x=0, color='k', linestyle='-', linewidth=0.5)

plt.subplot(2, 2, 3)
plt.plot(x, sigmoid, 'r-', linewidth=2)
plt.title('Sigmoid', fontsize=14)
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
plt.axvline(x=0, color='k', linestyle='-', linewidth=0.5)

plt.subplot(2, 2, 4)
plt.plot(x, tanh, 'm-', linewidth=2)
plt.title('Tanh', fontsize=14)
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
plt.axvline(x=0, color='k', linestyle='-', linewidth=0.5)

plt.tight_layout()
plt.savefig('activation_functions.png', dpi=150, bbox_inches='tight')
plt.show()

print("Notice how ReLU is just a straight line for x>0 and flat at 0 for x<0!")
  1. ReLU is simple: max(0, x) - that's the entire function.
  2. It solves vanishing gradients: Gradient is 1 for positive inputs, allowing deep networks to train.
  3. It's fast: Just a comparison and a max operation - no expensive exponentials.
  4. Use it on hidden layers: Never on output layers.
  5. Standard pattern: Linear → ReLU → Dropout → repeat
  6. Dying neurons exist: 10-20% of neurons may die, but it's usually okay.
  7. Leaky ReLU helps: Use it if dying neurons become a problem.
  8. ReLU made deep learning possible: Before ReLU, training deep networks was nearly impossible.

The Bottom Line

ReLU is the default activation function for hidden layers in neural networks. It's simple, fast, and solves the vanishing gradient problem that plagued older activation functions. Start with ReLU, and only switch to variants if you have a specific reason. In 90% of cases, plain ReLU is the right choice.
relu_cheatsheet.py
python
import torch
import torch.nn as nn
import torch.nn.functional as F

# ─── Three ways to use ReLU ────────────────────────────────────

# 1. As a layer (in Sequential)
model1 = nn.Sequential(
    nn.Linear(100, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

# 2. As a function (in forward())
class Model2(nn.Module):
    def forward(self, x):
        x = F.relu(self.fc1(x))
        return x

# 3. Direct torch function
x = torch.tensor([-1.0, 0.0, 1.0])
output = torch.relu(x)  # tensor([0., 0., 1.])

# ─── ReLU variants ─────────────────────────────────────────────

nn.ReLU()                           # Standard
nn.LeakyReLU(negative_slope=0.01)   # Leaky
nn.PReLU()                          # Parametric (learned slope)
nn.ELU()                            # Exponential Linear Unit
nn.GELU()                           # Gaussian (for transformers)

# ─── Common pattern ────────────────────────────────────────────

nn.Sequential(
    nn.Linear(in_features, out_features),
    nn.BatchNorm1d(out_features),    # Optional
    nn.ReLU(),
    nn.Dropout(0.2)                   # Optional
)

Related Articles