Back to articles
Linear Algebra for Machine Learning: A Complete Intuitive Guide

Linear Algebra for Machine Learning: A Complete Intuitive Guide

Master linear algebra concepts essential for machine learning - from vectors and matrices to backpropagation and computation graphs, with practical examples and code.

20 min read

Linear algebra is the mathematical foundation of modern machine learning. Every neural network, from simple linear regression to GPT-4, relies fundamentally on vectors, matrices, and the operations that transform them. This comprehensive guide will take you from basic linear algebra concepts to understanding how automatic differentiation powers deep learning frameworks like PyTorch and TensorFlow.

What You'll Learn

This guide covers vectors and their geometric meaning, matrix operations and transformations, eigenvalues and eigenvectors, gradients and the chain rule, Jacobian matrices, computation graphs, backpropagation mechanics, gradient descent, and practical PyTorch implementations with working code examples.

A vector isn't just a list of numbers—it's a geometric object with both magnitude and direction. In machine learning, vectors represent everything: input features, model weights, gradients, and embeddings.

Consider a 2D vector:

v=[34]\mathbf{v} = \begin{bmatrix} 3 \\ 4 \end{bmatrix}

This represents an arrow from origin (0,0)(0, 0) to point (3,4)(3, 4). Its magnitude (length) is:

v=32+42=5\|\mathbf{v}\| = \sqrt{3^2 + 4^2} = 5

ML Intuition

In ML, each data point is a vector. A house might be represented as a vector [2000, 3, 2] (square feet, bedrooms, bathrooms). Similar houses have vectors pointing in similar directions in feature space.

Dot Product - The most important operation in ML:

ab=i=1naibi=abcos(θ)\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_ib_i = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta)

The dot product measures alignment. When θ=0°\theta = 0° (parallel), cos(θ)=1\cos(\theta) = 1 (maximum). When θ=90°\theta = 90° (perpendicular), cos(θ)=0\cos(\theta) = 0 (orthogonal).

Why Dot Products Power Neural Networks

Every neuron computes the dot product of weights and inputs plus a bias. This measures how much the input aligns with what the neuron is looking for. Attention mechanisms use dot products to compute similarity between queries and keys.
vector_operations.py
python
import numpy as np

# Define vectors
a = np.array([3, 4])
b = np.array([1, 2])

# Dot product
dot = np.dot(a, b)  # 3*1 + 4*2 = 11
print(f"Dot product: {dot}")

# Magnitude
mag_a = np.linalg.norm(a)  # sqrt(3^2 + 4^2) = 5
print(f"Magnitude of a: {mag_a}")

# Angle between vectors
cos_theta = dot / (np.linalg.norm(a) * np.linalg.norm(b))
angle = np.arccos(cos_theta)
print(f"Angle: {np.degrees(angle):.2f}°")

A matrix isn't just a 2D array—it's a linear transformation that maps vectors from one space to another. Matrix-vector multiplication transforms the input vector.

Example scaling matrix:

A=[2003],v=[11]\mathbf{A} = \begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix}, \quad \mathbf{v} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}

Av=[2003][11]=[23]\mathbf{A}\mathbf{v} = \begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 2 \\ 3 \end{bmatrix}

This scales x by 2 and y by 3.

For matrices A\mathbf{A} (size m×nm \times n) and B\mathbf{B} (size n×pn \times p), the product C=AB\mathbf{C} = \mathbf{AB} has size m×pm \times p where:

Cij=k=1nAikBkjC_{ij} = \sum_{k=1}^{n} A_{ik}B_{kj}

Each element is the dot product of row ii from A\mathbf{A} and column jj from B\mathbf{B}.

Order Matters!

Matrix multiplication is NOT commutative. In general, AB ≠ BA. Applying transformation A then B is different from B then A.
matrix_operations.py
python
import numpy as np

# Matrix multiplication
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

C = A @ B
print("A @ B:")
print(C)

# Verify AB != BA
D = B @ A
print("\nB @ A:")
print(D)
print(f"\nEqual? {np.array_equal(C, D)}")
Matrix TypePropertyML Use Case
IdentityIv = vSkip connections
DiagonalNon-zero on diagonal onlyScaling operations
SymmetricA = A^TCovariance matrices
OrthogonalQ^T Q = IRotations

An eigenvector of matrix A\mathbf{A} is a special vector that doesn't change direction when A\mathbf{A} is applied—it only gets scaled:

Av=λv\mathbf{A}\mathbf{v} = \lambda\mathbf{v}

where v\mathbf{v} is the eigenvector and λ\lambda is the eigenvalue.

Eigenvalues in ML

PCA finds eigenvectors of the covariance matrix to identify directions of maximum variance. Eigenvalues of the Hessian determine optimization landscape curvature. Large eigenvalues indicate steep directions that can cause training instability.
eigenvalues.py
python
import numpy as np

A = np.array([[3, 1], [0, 2]])

# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)

print("Eigenvalues:", eigenvalues)
print("Eigenvectors:")
print(eigenvectors)

# Verify Av = λv
for i in range(len(eigenvalues)):
    v = eigenvectors[:, i]
    lam = eigenvalues[i]
    
    Av = A @ v
    lam_v = lam * v
    
    print(f"\nEigenvector {i+1}:")
    print(f"Match: {np.allclose(Av, lam_v)}")

For multivariable functions, we need the gradient—a vector of partial derivatives:

f=[fx1fx2fxn]\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

Geometric Meaning: The gradient points in the direction of steepest ascent.

Example: For f(x,y)=x2+2y2f(x, y) = x^2 + 2y^2

f=[2x4y]\nabla f = \begin{bmatrix} 2x \\ 4y \end{bmatrix}

When a function outputs a vector, we need the Jacobian matrix:

J=[f1x1f1xnfmx1fmxn]\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Jacobians in Neural Networks

Every layer has a Jacobian. During backpropagation, we multiply these Jacobians using the chain rule. This is why backprop is efficient—it's just matrix multiplication!

For composed functions z=f(y)z = f(y) and y=g(x)y = g(x):

dzdx=dzdydydx\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}

In vector form with Jacobians, this becomes matrix multiplication.

Modern ML frameworks build computation graphs—directed acyclic graphs where nodes are operations and edges are data flow. Consider:

y=x2+3x+1y = x^2 + 3x + 1

We break this into elementary operations:

  1. Input: xx
  2. Operation A: a=x2a = x^2
  3. Operation B: b=3xb = 3x
  4. Operation C: c=a+bc = a + b
  5. Output: y=c+1y = c + 1

During forward pass, we compute outputs and store intermediate values. For x=2x = 2:

NodeOperationValueLocal Derivative
a4da/dx = 2x = 4
b3x6db/dx = 3
ca + b10∂c/∂a = 1, ∂c/∂b = 1
yc + 111dy/dc = 1

During backward pass, we compute gradients by working backwards, multiplying local derivatives along each path.

Since xx affects yy through TWO paths, we sum contributions:

dydx=dydadadx+dydbdbdx=1×4+1×3=7\frac{dy}{dx} = \frac{dy}{da} \cdot \frac{da}{dx} + \frac{dy}{db} \cdot \frac{db}{dx} = 1 \times 4 + 1 \times 3 = 7

This matches the analytical derivative: ddx(x2+3x+1)=2x+3=7\frac{d}{dx}(x^2 + 3x + 1) = 2x + 3 = 7 at x=2x=2.

The Summation Rule

When a variable influences output through multiple paths, the total gradient is the SUM of gradients from all paths.
pytorch_autograd.py
python
import torch

# Enable gradient tracking
x = torch.tensor(2.0, requires_grad=True)

# Forward pass - PyTorch builds graph
a = x**2
b = 3*x
c = a + b
y = c + 1

print(f"Forward: y = {y.item()}")

# Backward pass
y.backward()

print(f"Gradient: dy/dx = {x.grad.item()}")
print(f"Expected: 2*{x.item()} + 3 = {2*x.item() + 3}")

# Vector example
print("\n--- Vector Example ---")
x_vec = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Compute y = ||x||²
y_vec = torch.sum(x_vec**2)
print(f"y = ||x||² = {y_vec.item()}")

y_vec.backward()
print(f"Gradient: {x_vec.grad}")
print(f"Expected: 2*x = {2*x_vec.data}")

Gradient descent minimizes a function by moving opposite to the gradient:

xt+1=xtαf(xt)\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \nabla f(\mathbf{x}_t)

where α\alpha is the learning rate.

Why the Negative Sign?

The gradient points uphill. To minimize, we move downhill, hence the negative sign.
gradient_descent.py
python
import numpy as np

# Function: f(x,y) = x² + 2y²
def f(x, y):
    return x**2 + 2*y**2

# Gradient
def grad_f(x, y):
    return np.array([2*x, 4*y])

# Gradient descent
def gradient_descent(start, lr, iterations):
    path = [start]
    point = start.copy()
    
    for _ in range(iterations):
        gradient = grad_f(point[0], point[1])
        point = point - lr * gradient
        path.append(point.copy())
    
    return np.array(path)

# Run
start = np.array([3.0, 2.0])
path = gradient_descent(start, 0.1, 20)
final = path[-1]

print(f"Start: {start}")
print(f"Final: {final}")
print(f"f(final) = {f(final[0], final[1]):.6f}")
ProblemCauseSolution
Exploding gradientsLarge eigenvaluesGradient clipping
Vanishing gradientsSmall eigenvaluesSkip connections
Slow convergencePoor conditioningAdaptive optimizers
Numerical instabilityNear-singular matricesBatch normalization

A neural network is function composition. Each layer applies a linear transformation followed by nonlinear activation:

h1=σ(W1x+b1)\mathbf{h}_1 = \sigma(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1)

y=W2h1+b2\mathbf{y} = \mathbf{W}_2\mathbf{h}_1 + \mathbf{b}_2

simple_neural_network.py
python
import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 3)
        self.layer2 = nn.Linear(3, 1)
    
    def forward(self, x):
        h = torch.relu(self.layer1(x))
        y = self.layer2(h)
        return y

# Create network
model = SimpleNet()
x = torch.tensor([[1.0, 2.0]])
target = torch.tensor([[3.0]])

# Forward
output = model(x)
loss = (output - target)**2

print(f"Output: {output.item():.4f}")
print(f"Loss: {loss.item():.4f}")

# Backward
loss.backward()

print("\nGradients computed!")
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: {param.grad.shape}")
  • Vectors and matrices are geometric objects representing transformations
  • Dot products measure alignment and power neural networks
  • Eigenvalues reveal matrix properties and affect optimization
  • Gradients point uphill; we move opposite to minimize
  • Computation graphs enable automatic differentiation
  • Backpropagation is the chain rule applied systematically
  • Gradients sum when variables affect output through multiple paths

Practice Exercises

  1. Implement matrix multiplication from scratch. 2. Visualize how 2x2 matrices transform a unit circle. 3. Build a simple autograd system. 4. Implement gradient descent and plot the optimization path.

Understanding linear algebra transforms machine learning from magic to mathematics. When you see a neural network, you now understand it's matrix multiplications and element-wise operations composed together. When you call backward() in PyTorch, you know it's systematically applying the chain rule through a computation graph.

The beauty is scalability. The same principles that work for a simple polynomial also power GPT-4 with billions of parameters. The mathematics remains elegant and consistent.

Continue Learning

Next steps: Study the Hessian matrix for second-order optimization, explore matrix decompositions (SVD, QR), learn about numerical stability, and implement a neural network from scratch using only NumPy.

Related Articles