Logistic Regression from Scratch in PyTorch: Every Line Explained

Build a multi-class classifier in PyTorch without nn.Linear, without optim.SGD, without CrossEntropyLoss. Just [tensors](/blog/what-is-a-tensor), [autograd](/blog/pytorch-autograd-deep-dive), and arithmetic — so you finally see what those helpers actually do.

HaneeshApril 19, 2026

In the last post we looked at TF-IDF + Logistic Regression using sklearn — a single fit() call and you're done. That's great for shipping, terrible for learning. You end up with a model that works, and no idea why. This post builds the same classifier from scratch in PyTorch — no nn.Linear, no nn.CrossEntropyLoss, no optim.SGD. Every weight, every gradient, every update is spelled out by hand.

We'll keep the same running example: CLINC150 intent classification. A user types "book me a flight to Tokyo" and we need to pick one of 151 intent labels (150 real intents plus an out-of-scope bucket). Features are a ~10,000-dim TF-IDF vector, so the numbers we'll be quoting are real.

Why bother building it by hand?

Because once you've read and written every line, nn.Linear and loss.backward() stop being magic. You'll know exactly what each helper is doing underneath, and when something breaks in a larger model, you'll have the vocabulary to debug it.

The big picture

Strip away the ceremony and logistic regression does exactly this: take a feature vector x, compute a score for each class (that's the W @ x + b you've seen a thousand times), turn scores into probabilities via softmax, and pick the argmax. Training is the process of nudging W and b until the correct class usually has the highest score.

Three shapes to keep in your head as we go:

Symbol	Meaning	Shape (CLINC150)
N	Batch size — examples processed together	256
d	Number of features per example	~10,000
C	Number of classes	151
X	Input batch	(N, d)
W	Weight matrix — one column per class	(d, C)
b	Bias vector — one scalar per class	(C,)
logits	Raw scores before softmax	(N, C)

Step 1 — The config

config.py

python

from dataclasses import dataclass

@dataclass
class LogRegConfig:
    lr: float = 0.1           # learning rate — step size
    epochs: int = 100         # passes through the full dataset
    batch_size: int = 256     # examples per update
    l2_lambda: float = 1e-4   # regularization strength
    log_every: int = 10       # print loss every N epochs
    seed: int = 42            # fix randomness for reproducibility

A @dataclass is Python's shortcut for classes that are really just bundles of values — it auto-generates __init__, __repr__, and equality checks. Think of it as a named tuple with type hints.

The hyperparameters, one at a time. If you want the beginner-friendly version of these ideas first, read ML Hyperparameters Explained for Beginners:

lr (learning rate) — how big a step to take when updating weights. 0.1 is aggressive, 0.001 is gentle. Too big and you overshoot the minimum; too small and training crawls.
epochs — one epoch is one full pass through the training data. epochs=100 means every training example is seen 100 times.
batch_size — how many examples to process before each weight update. Bigger batches give smoother, more accurate gradients; smaller batches update faster and add useful noise.
l2_lambda — penalty on large weights, to prevent overfitting. More on this below.
seed — freezes randomness. Same seed = same run, every time. Absolutely critical for debugging.

Step 2 — The weights (and why initialization matters)

init.py

python

import torch

gen = torch.Generator().manual_seed(42)

# W has shape (n_features, n_classes)
# For CLINC150 with 10k features: ~10,000 x 151 = ~1.5M parameters
self.W = torch.randn(n_features, n_classes, generator=gen) * 0.01
self.W.requires_grad_(True)

# Bias — one per class, starts at zero
self.b = torch.zeros(n_classes)
self.b.requires_grad_(True)

W has shape (n_features, n_classes). For CLINC150 that's roughly 10,000 × 151 ≈ 1.5 million parameters. Each column of W is conceptually the "prototype" for one class. When a new input x comes in, x @ W computes a dot product between x and every class prototype — 151 similarity scores in one matrix multiply.

Why multiply by 0.01?

torch.randn samples from a standard normal (mean 0, variance 1). If we left the weights at that scale, the logits X @ W would be huge, softmax would saturate, and gradients would either vanish or explode on the very first step. Shrinking by 0.01 keeps logits small and gradients healthy at the start of training.

Three more details worth noting. The bias starts at zero — we have no prior reason to prefer any class, so flat bias is the honest default. The generator=gen bit wires in our seeded RNG so the initialization is reproducible. And requires_grad_(True) is the flag that says "PyTorch, please track every operation touching this tensor so you can compute gradients later." Without it, loss.backward() silently does nothing. (Learn more about how autograd tracks operations.)

Step 3 — The forward pass

forward.py

python

def forward(self, X: Tensor) -> Tensor:
    # X shape: (N, n_features)   — N examples, each d-dimensional
    # W shape: (n_features, n_classes)
    # Result:  (N, n_classes)    — N rows of logits, one per class
    return X @ self.W + self.b

The @ operator between tensors is matrix multiplication. If X is (256, 10000) and W is (10000, 151), then X @ W is (256, 151) — one row per input example, 151 class scores per row.

Adding b (shape (151,)) to a (256, 151) matrix uses broadcasting: PyTorch virtually replicates b across all 256 rows without copying memory. The output is called logits — raw, unnormalized scores. Logits can be any real number. A big positive logit for class 5 means "this input strongly looks like class 5." A very negative logit means "this input really doesn't look like class 5."

Step 4 — Softmax: turning scores into probabilities

To turn logits into actual probabilities (positive, summing to 1), apply softmax:

softmax(x_i) = exp(x_i) / Σ exp(x_j)
— for all classes j

Two things happen: exp makes everything positive (since e^x > 0 for any real x), and dividing by the sum normalizes to 1. Softmax also preserves ordering — the biggest logit becomes the biggest probability.

softmax_intuition.py

python

# Given logits [2.0, 1.0, -1.0] for 3 classes:
#
# exp:      [7.389, 2.718, 0.368]
# sum:       10.475
# softmax:  [0.705, 0.259, 0.036]   <-- sums to 1.0
#
# The biggest logit -> biggest probability.
# Softmax preserves the ranking, just turns scores into a distribution.

The numerical overflow trap

If any logit is, say, 1000, then exp(1000) = inf and your training dies instantly. If logits are very negative, exp underflows to zero and you take log(0) = -inf later. Both cases crash silently. This is why nobody computes raw softmax in practice — they use log_softmax.

torch.log_softmax computes log(softmax(x)) directly using the log-sum-exp trick: subtract max(x) before exponentiating. Mathematically the constant cancels out; computationally, the largest exp term becomes exp(0) = 1 and everything else is between 0 and 1. No overflow, ever.

log_softmax(x_i) = x_i − max(x) − log(Σ exp(x_j − max(x)))
— the log-sum-exp identity

Step 5 — Cross-entropy loss

Cross-entropy is the standard loss for classification, and the intuition is simple: if the model assigns probability 0.9 to the correct class, you're happy; if it assigns 0.001, you're sad. The loss function -log(p) has exactly this shape:

So for each training example, we want to compute -log(p_correct_class) and average over the batch.

cross_entropy.py

python

def compute_loss(self, logits: Tensor, targets: Tensor) -> Tensor:
    N = logits.shape[0]
    log_probs = torch.log_softmax(logits, dim=1)

    # Pluck out log P(correct_class) for each example in the batch
    # torch.arange(N)  = [0, 1, 2, ..., N-1]   (row indices)
    # targets          = [5, 12, 0, 37, ...]   (column indices = correct class)
    # -> one log-prob per example, shape (N,)
    nll = -log_probs[torch.arange(N), targets]

    return nll.mean()

Fancy indexing unpacked

log_probs[torch.arange(N), targets] is the worth-memorizing trick. torch.arange(N) is [0, 1, 2, ..., N-1] (row indices). targets is the correct class for each example, e.g. [5, 12, 0, 37, ...]. Read together: 'from row 0 take column 5, from row 1 take column 12, ...'. One line, vectorized across the whole batch.

Why go via log_softmax and then index, instead of computing softmax, indexing, then taking log? Numerical stability. Staying in log-space means tiny probabilities like 1e-30 don't underflow to zero.

Step 6 — L2 regularization

Left unchecked, the model will learn huge weights to memorize the training set, then fail miserably on new data. This is overfitting. L2 regularization prevents it by adding a penalty proportional to the sum of squared weights:

l2_penalty.py

python

def _l2_penalty(self) -> Tensor:
    # Sum of squared weights. Does NOT include bias — standard practice.
    return self.config.l2_lambda * (self.W ** 2).sum()

# Total loss used for backprop:
#     loss = cross_entropy + l2_lambda * sum(W^2)

The total loss becomes cross_entropy + λ · Σ W². The optimizer now has two pressures: reduce the cross-entropy (fit the data) and keep weights small (stay simple). Lambda controls the tradeoff — too small and regularization does nothing, too big and the model underfits because every weight is squeezed toward zero.

Why don't we regularize the bias?

Biases just shift the decision boundary; they don't cause overfitting the way weights do. Penalizing them would force the model to assume all classes are equally likely before seeing any input, which is a constraint we don't want.

Step 7 — The training loop

Now the heart of it. Every epoch we shuffle, then iterate over mini-batches. For each batch we run the five-step cycle that is the beating heart of essentially all deep learning:

train_loop.py

python

for epoch in range(self.config.epochs):

    # 1. Shuffle each epoch so batches see a random mix of classes
    perm = torch.randperm(N, generator=gen)
    X_shuffled = X_train[perm]
    y_shuffled = y_train[perm]

    for start in range(0, N, self.config.batch_size):
        end = min(start + self.config.batch_size, N)
        X_batch = X_shuffled[start:end]
        y_batch = y_shuffled[start:end]

        # 2. Forward: compute predictions
        logits = self.forward(X_batch)

        # 3. Loss: how wrong are we?
        ce_loss = self.compute_loss(logits, y_batch)
        l2_loss = self._l2_penalty()
        loss = ce_loss + l2_loss

        # 4. Backward: let autograd compute d(loss)/d(W) and d(loss)/d(b)
        loss.backward()

        # 5. Update: step opposite to the gradient
        with torch.no_grad():
            self.W.data -= self.config.lr * self.W.grad
            self.b.data -= self.config.lr * self.b.grad

            # 6. Zero the gradients — otherwise they accumulate
            self.W.grad.zero_()
            self.b.grad.zero_()

Why shuffle every epoch?

If the data is sorted by class (all class 0 first, then class 1, etc.), the model would train on one class for ages, forget the previous one, and oscillate forever. Shuffling guarantees each batch sees a random mix. torch.randperm(N) gives a random permutation of [0 .. N-1] and X_train[perm] reorders the rows accordingly.

Why mini-batches?

Approach	Gradient quality	Speed	Memory
Full batch (all N at once)	Exact	Slow per step	High
Single example (SGD)	Very noisy	Fast per step, jittery overall	Low
Mini-batch (e.g. 256)	Good estimate	Fast, stable	Moderate

Mini-batches hit the sweet spot — stable enough to converge, small enough to step often, small enough to fit on a GPU. min(start + batch_size, N) handles the final batch cleanly when N doesn't divide evenly.

Step 8 — What `loss.backward()` actually does

This is the part that feels like magic until you know. When you called X @ W, PyTorch silently recorded "matmul, with these inputs" on a hidden computation graph. Same for log_softmax, the indexing, the .mean(). Every operation on a requires_grad=True tensor adds a node to this graph. (Deep dive: Understanding PyTorch's Autograd.)

loss.backward() walks that graph in reverse, applying the chain rule from calculus at each node, all the way back to the tensors with requires_grad=True. The final gradients land in W.grad and b.grad, which have the same shapes as W and b.

What the gradient means, concretely

W.grad[i, j] tells you: 'if I nudged W[i, j] up by a tiny amount ε, the loss would change by approximately W.grad[i, j] · ε.' That's the number we need to know which direction to move each weight.

You never write a derivative yourself — PyTorch ships with the derivative of every tensor operation built in. That's why this framework took over. Explore autograd internals.

Step 9 — Gradient descent: the update

The gradient points in the direction of steepest increase of the loss. We want to decrease the loss, so we step in the opposite direction. That's literally the entire idea of gradient descent:

W_new = W_old − lr · (∂loss / ∂W)
— gradient descent, in one line

Two tricky details in the code worth pausing on. The with torch.no_grad(): block tells PyTorch "don't track these operations" — otherwise the update itself becomes part of the graph, creating a recursive mess. And .data modifies the underlying tensor values directly, without breaking autograd's bookkeeping.

Step 10 — Why you MUST zero the gradients

The classic PyTorch beginner bug

PyTorch accumulates gradients by default. If you call backward() twice without zeroing, the gradients sum up. If you forget zero_() in your training loop, gradients keep growing epoch over epoch, your updates explode, and the loss diverges into NaN. Everyone hits this once. Only once, if they're lucky.

Why does PyTorch accumulate instead of replacing? Because sometimes you want accumulation — for example, gradient accumulation across several small batches to simulate a larger effective batch size when memory is tight. The framework gives you flexibility and demands you handle the bookkeeping.

Step 11 — Prediction

predict.py

python

def predict(self, X: Tensor) -> Tensor:
    with torch.no_grad():                  # no gradients needed at inference
        logits = self.forward(X)
        return logits.argmax(dim=1)        # no softmax needed — argmax is monotonic

Two optimizations here that matter in production. First, torch.no_grad() skips building the computation graph — no autograd bookkeeping, less memory, faster inference. Second, we don't compute softmax at all. Since softmax is monotonic (bigger logit ⇒ bigger probability), argmax(logits) == argmax(softmax(logits)). Save yourself the exp, the sum, and the division.

See it for yourself: a 20-line debug script

The best way to solidify any of this is to run training on a tiny toy dataset and watch the numbers move. Weights change, gradients shrink, loss drops. You can't unsee it.

toy_training.py

python

# Tiny sanity check — run this to *see* training happen.
import torch

torch.manual_seed(0)
X = torch.randn(50, 10)                    # 50 examples, 10 features
y = torch.randint(0, 3, (50,))             # 3 classes

W = (torch.randn(10, 3) * 0.01).requires_grad_()
b = torch.zeros(3, requires_grad=True)

for step in range(20):
    logits = X @ W + b
    log_probs = torch.log_softmax(logits, dim=1)
    loss = -log_probs[torch.arange(50), y].mean()
    loss.backward()

    with torch.no_grad():
        W.data -= 0.1 * W.grad
        b.data -= 0.1 * b.grad
        W.grad.zero_()
        b.grad.zero_()

    print(f"step {step:2d}  loss={loss.item():.4f}  "
          f"|W|={W.abs().mean().item():.4f}  "
          f"|grad|={W.grad.abs().mean().item() if W.grad is not None else 0:.4f}")

Run it and you'll see the loss drop from around 1.1 (random guessing for 3 classes ≈ -log(1/3) = 1.1) toward something small. You'll also see |grad| shrinking — as the model approaches a good solution, there's less and less to correct.

Putting it all together

Zoom out and the entire arc of training is this: start with random weights. Predict. Measure wrongness. Use autograd to find which direction to nudge each weight. Take a small step. Repeat thousands of times. Slowly, the columns of W fill in useful patterns — one column comes to represent "flight-booking vocabulary," another "weather-query vocabulary," and so on — and the loss drops.

The beautiful thing about this implementation is that every step is visible. There's no nn.Module hiding the parameters, no optim.SGD hiding the update rule, no CrossEntropyLoss hiding the log-softmax. Once you've written this, you know exactly what every library shortcut is doing underneath — and when something breaks in a bigger model, you'll have the vocabulary to debug it.

Takeaways

Logistic regression = linear scores + softmax + argmax. Training = nudging the linear scores until the argmax matches the label.
Logits are unnormalized scores. Softmax only exists to make them into probabilities; for prediction, argmax on logits is equivalent and cheaper.
Use log_softmax, never raw softmax. The log-sum-exp trick is the difference between training that works and training that silently explodes.
Cross-entropy punishes confident wrong answers. -log(p_correct) is huge when p is tiny, zero when p is one.
L2 regularization shrinks weights to prevent overfitting. Don't regularize the bias.
The five-step cycle is universal. Forward, loss, backward, update, zero — every neural network you ever train follows it.
loss.backward() is not magic — it's the chain rule replayed over a recorded computation graph. Autograd does the bookkeeping; you do the modeling.
Zero your gradients. You will forget this once. Then never again.

What's next

This exact structure — forward, loss, backward, update, zero — is the skeleton of every deep learning model. Swap X @ W + b for a stack of nn.Linear layers and you have a feed-forward network. Swap it for attention blocks and you have a transformer. The ceremony changes; the cycle does not. Once you have a working model, learn how to evaluate it properly with The Impartial Judge: Inside a Production ML Evaluation Harness.

#deep-learning #pytorch

beginner

PyTorch Autograd: Automatic Differentiation from the Ground Up

A complete, beginner-friendly guide to PyTorch's autograd engine — from what a gradient is to building a neural network by hand.

beginner

What Is a Tensor? A Beginner's Guide with Real Examples

Tensors explained from scratch — no math degree required. Learn what tensors are, why PyTorch uses them, and how to work with them confidently.

intermediate

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

The practical layer between sequence-model theory and a working PyTorch classifier — building a vocabulary, batching variable-length sentences, packing sequences for the LSTM, and pooling the output back into a single vector.

Why bother building it by hand?

Why multiply by 0.01?

The numerical overflow trap

Fancy indexing unpacked

Why don't we regularize the bias?

What the gradient means, concretely

The classic PyTorch beginner bug

What's next

Related Articles

PyTorch Autograd: Automatic Differentiation from the Ground Up

What Is a Tensor? A Beginner's Guide with Real Examples

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling