
Linear Algebra for Machine Learning: A Complete Intuitive Guide
Master linear algebra concepts essential for machine learning - from vectors and matrices to backpropagation and computation graphs, with practical examples and code.
Linear algebra is the mathematical foundation of modern machine learning. Every neural network, from simple linear regression to GPT-4, relies fundamentally on vectors, matrices, and the operations that transform them. This comprehensive guide will take you from basic linear algebra concepts to understanding how automatic differentiation powers deep learning frameworks like PyTorch and TensorFlow.
What You'll Learn
A vector isn't just a list of numbers—it's a geometric object with both magnitude and direction. In machine learning, vectors represent everything: input features, model weights, gradients, and embeddings.
Consider a 2D vector:
This represents an arrow from origin to point . Its magnitude (length) is:
ML Intuition
Dot Product - The most important operation in ML:
The dot product measures alignment. When (parallel), (maximum). When (perpendicular), (orthogonal).
Why Dot Products Power Neural Networks
import numpy as np
# Define vectors
a = np.array([3, 4])
b = np.array([1, 2])
# Dot product
dot = np.dot(a, b) # 3*1 + 4*2 = 11
print(f"Dot product: {dot}")
# Magnitude
mag_a = np.linalg.norm(a) # sqrt(3^2 + 4^2) = 5
print(f"Magnitude of a: {mag_a}")
# Angle between vectors
cos_theta = dot / (np.linalg.norm(a) * np.linalg.norm(b))
angle = np.arccos(cos_theta)
print(f"Angle: {np.degrees(angle):.2f}°")A matrix isn't just a 2D array—it's a linear transformation that maps vectors from one space to another. Matrix-vector multiplication transforms the input vector.
Example scaling matrix:
This scales x by 2 and y by 3.
For matrices (size ) and (size ), the product has size where:
Each element is the dot product of row from and column from .
Order Matters!
import numpy as np
# Matrix multiplication
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = A @ B
print("A @ B:")
print(C)
# Verify AB != BA
D = B @ A
print("\nB @ A:")
print(D)
print(f"\nEqual? {np.array_equal(C, D)}")| Matrix Type | Property | ML Use Case |
|---|---|---|
| Identity | Iv = v | Skip connections |
| Diagonal | Non-zero on diagonal only | Scaling operations |
| Symmetric | A = A^T | Covariance matrices |
| Orthogonal | Q^T Q = I | Rotations |
An eigenvector of matrix is a special vector that doesn't change direction when is applied—it only gets scaled:
where is the eigenvector and is the eigenvalue.
Eigenvalues in ML
import numpy as np
A = np.array([[3, 1], [0, 2]])
# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:")
print(eigenvectors)
# Verify Av = λv
for i in range(len(eigenvalues)):
v = eigenvectors[:, i]
lam = eigenvalues[i]
Av = A @ v
lam_v = lam * v
print(f"\nEigenvector {i+1}:")
print(f"Match: {np.allclose(Av, lam_v)}")For multivariable functions, we need the gradient—a vector of partial derivatives:
Geometric Meaning: The gradient points in the direction of steepest ascent.
Example: For
When a function outputs a vector, we need the Jacobian matrix:
Jacobians in Neural Networks
For composed functions and :
In vector form with Jacobians, this becomes matrix multiplication.
Modern ML frameworks build computation graphs—directed acyclic graphs where nodes are operations and edges are data flow. Consider:
We break this into elementary operations:
- Input:
- Operation A:
- Operation B:
- Operation C:
- Output:
During forward pass, we compute outputs and store intermediate values. For :
| Node | Operation | Value | Local Derivative |
|---|---|---|---|
| a | x² | 4 | da/dx = 2x = 4 |
| b | 3x | 6 | db/dx = 3 |
| c | a + b | 10 | ∂c/∂a = 1, ∂c/∂b = 1 |
| y | c + 1 | 11 | dy/dc = 1 |
During backward pass, we compute gradients by working backwards, multiplying local derivatives along each path.
Since affects through TWO paths, we sum contributions:
This matches the analytical derivative: at .
The Summation Rule
import torch
# Enable gradient tracking
x = torch.tensor(2.0, requires_grad=True)
# Forward pass - PyTorch builds graph
a = x**2
b = 3*x
c = a + b
y = c + 1
print(f"Forward: y = {y.item()}")
# Backward pass
y.backward()
print(f"Gradient: dy/dx = {x.grad.item()}")
print(f"Expected: 2*{x.item()} + 3 = {2*x.item() + 3}")
# Vector example
print("\n--- Vector Example ---")
x_vec = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
# Compute y = ||x||²
y_vec = torch.sum(x_vec**2)
print(f"y = ||x||² = {y_vec.item()}")
y_vec.backward()
print(f"Gradient: {x_vec.grad}")
print(f"Expected: 2*x = {2*x_vec.data}")Gradient descent minimizes a function by moving opposite to the gradient:
where is the learning rate.
Why the Negative Sign?
import numpy as np
# Function: f(x,y) = x² + 2y²
def f(x, y):
return x**2 + 2*y**2
# Gradient
def grad_f(x, y):
return np.array([2*x, 4*y])
# Gradient descent
def gradient_descent(start, lr, iterations):
path = [start]
point = start.copy()
for _ in range(iterations):
gradient = grad_f(point[0], point[1])
point = point - lr * gradient
path.append(point.copy())
return np.array(path)
# Run
start = np.array([3.0, 2.0])
path = gradient_descent(start, 0.1, 20)
final = path[-1]
print(f"Start: {start}")
print(f"Final: {final}")
print(f"f(final) = {f(final[0], final[1]):.6f}")| Problem | Cause | Solution |
|---|---|---|
| Exploding gradients | Large eigenvalues | Gradient clipping |
| Vanishing gradients | Small eigenvalues | Skip connections |
| Slow convergence | Poor conditioning | Adaptive optimizers |
| Numerical instability | Near-singular matrices | Batch normalization |
A neural network is function composition. Each layer applies a linear transformation followed by nonlinear activation:
import torch
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(2, 3)
self.layer2 = nn.Linear(3, 1)
def forward(self, x):
h = torch.relu(self.layer1(x))
y = self.layer2(h)
return y
# Create network
model = SimpleNet()
x = torch.tensor([[1.0, 2.0]])
target = torch.tensor([[3.0]])
# Forward
output = model(x)
loss = (output - target)**2
print(f"Output: {output.item():.4f}")
print(f"Loss: {loss.item():.4f}")
# Backward
loss.backward()
print("\nGradients computed!")
for name, param in model.named_parameters():
if param.grad is not None:
print(f"{name}: {param.grad.shape}")- Vectors and matrices are geometric objects representing transformations
- Dot products measure alignment and power neural networks
- Eigenvalues reveal matrix properties and affect optimization
- Gradients point uphill; we move opposite to minimize
- Computation graphs enable automatic differentiation
- Backpropagation is the chain rule applied systematically
- Gradients sum when variables affect output through multiple paths
Practice Exercises
- Implement matrix multiplication from scratch. 2. Visualize how 2x2 matrices transform a unit circle. 3. Build a simple autograd system. 4. Implement gradient descent and plot the optimization path.
Understanding linear algebra transforms machine learning from magic to mathematics. When you see a neural network, you now understand it's matrix multiplications and element-wise operations composed together. When you call backward() in PyTorch, you know it's systematically applying the chain rule through a computation graph.
The beauty is scalability. The same principles that work for a simple polynomial also power GPT-4 with billions of parameters. The mathematics remains elegant and consistent.
Continue Learning
Related Articles
PyTorch Autograd: Automatic Differentiation from the Ground Up
A complete, beginner-friendly guide to PyTorch's autograd engine — from what a gradient is to building a neural network by hand.
TF-IDF + Logistic Regression: The Classical ML Baseline You Should Try First
Before reaching for LLMs or neural networks for text classification, try the boring thing. Here's how TF-IDF + Logistic Regression works, why it's often embarrassingly competitive, and where it breaks.
Backpropagation and the Chain Rule: A Simple Visual Guide
Learn how backpropagation works through a simple, step-by-step example. Understand the chain rule intuitively with clear visualizations and working code.