Back to articles
Logistic Regression from Scratch in PyTorch: Every Line Explained

Logistic Regression from Scratch in PyTorch: Every Line Explained

Build a multi-class classifier in PyTorch without nn.Linear, without optim.SGD, without CrossEntropyLoss. Just tensors, autograd, and arithmetic — so you finally see what those helpers actually do.

15 min read

In the last post we looked at TF-IDF + Logistic Regression using sklearn — a single fit() call and you're done. That's great for shipping, terrible for learning. You end up with a model that works, and no idea why. This post builds the same classifier from scratch in PyTorch — no nn.Linear, no nn.CrossEntropyLoss, no optim.SGD. Every weight, every gradient, every update is spelled out by hand.

We'll keep the same running example: CLINC150 intent classification. A user types "book me a flight to Tokyo" and we need to pick one of 151 intent labels (150 real intents plus an out-of-scope bucket). Features are a ~10,000-dim TF-IDF vector, so the numbers we'll be quoting are real.

Why bother building it by hand?

Because once you've read and written every line, `nn.Linear` and `loss.backward()` stop being magic. You'll know exactly what each helper is doing underneath, and when something breaks in a larger model, you'll have the vocabulary to debug it.

Strip away the ceremony and logistic regression does exactly this: take a feature vector x, compute a score for each class (that's the W @ x + b you've seen a thousand times), turn scores into probabilities via softmax, and pick the argmax. Training is the process of nudging W and b until the correct class usually has the highest score.

Three shapes to keep in your head as we go:

SymbolMeaningShape (CLINC150)
NBatch size — examples processed together256
dNumber of features per example~10,000
CNumber of classes151
XInput batch(N, d)
WWeight matrix — one column per class(d, C)
bBias vector — one scalar per class(C,)
logitsRaw scores before softmax(N, C)
config.py
python
from dataclasses import dataclass

@dataclass
class LogRegConfig:
    lr: float = 0.1           # learning rate — step size
    epochs: int = 100         # passes through the full dataset
    batch_size: int = 256     # examples per update
    l2_lambda: float = 1e-4   # regularization strength
    log_every: int = 10       # print loss every N epochs
    seed: int = 42            # fix randomness for reproducibility

A @dataclass is Python's shortcut for classes that are really just bundles of values — it auto-generates __init__, __repr__, and equality checks. Think of it as a named tuple with type hints.

The hyperparameters, one at a time. If you want the beginner-friendly version of these ideas first, read ML Hyperparameters Explained for Beginners:

  • lr (learning rate) — how big a step to take when updating weights. 0.1 is aggressive, 0.001 is gentle. Too big and you overshoot the minimum; too small and training crawls.
  • epochs — one epoch is one full pass through the training data. epochs=100 means every training example is seen 100 times.
  • batch_size — how many examples to process before each weight update. Bigger batches give smoother, more accurate gradients; smaller batches update faster and add useful noise.
  • l2_lambda — penalty on large weights, to prevent overfitting. More on this below.
  • seed — freezes randomness. Same seed = same run, every time. Absolutely critical for debugging.
init.py
python
import torch

gen = torch.Generator().manual_seed(42)

# W has shape (n_features, n_classes)
# For CLINC150 with 10k features: ~10,000 x 151 = ~1.5M parameters
self.W = torch.randn(n_features, n_classes, generator=gen) * 0.01
self.W.requires_grad_(True)

# Bias — one per class, starts at zero
self.b = torch.zeros(n_classes)
self.b.requires_grad_(True)

W has shape (n_features, n_classes). For CLINC150 that's roughly 10,000 × 151 ≈ 1.5 million parameters. Each column of W is conceptually the "prototype" for one class. When a new input x comes in, x @ W computes a dot product between x and every class prototype — 151 similarity scores in one matrix multiply.

Why multiply by 0.01?

`torch.randn` samples from a standard normal (mean 0, variance 1). If we left the weights at that scale, the logits `X @ W` would be huge, softmax would saturate, and gradients would either vanish or explode on the very first step. Shrinking by 0.01 keeps logits small and gradients healthy at the start of training.

Three more details worth noting. The bias starts at zero — we have no prior reason to prefer any class, so flat bias is the honest default. The generator=gen bit wires in our seeded RNG so the initialization is reproducible. And requires_grad_(True) is the flag that says "PyTorch, please track every operation touching this tensor so you can compute gradients later." Without it, loss.backward() silently does nothing.

forward.py
python
def forward(self, X: Tensor) -> Tensor:
    # X shape: (N, n_features)   — N examples, each d-dimensional
    # W shape: (n_features, n_classes)
    # Result:  (N, n_classes)    — N rows of logits, one per class
    return X @ self.W + self.b

The @ operator between tensors is matrix multiplication. If X is (256, 10000) and W is (10000, 151), then X @ W is (256, 151) — one row per input example, 151 class scores per row.

Adding b (shape (151,)) to a (256, 151) matrix uses broadcasting: PyTorch virtually replicates b across all 256 rows without copying memory. The output is called logits — raw, unnormalized scores. Logits can be any real number. A big positive logit for class 5 means "this input strongly looks like class 5." A very negative logit means "this input really doesn't look like class 5."

To turn logits into actual probabilities (positive, summing to 1), apply softmax:

softmax(x_i) = exp(x_i) / Σ exp(x_j)

for all classes j

Two things happen: exp makes everything positive (since e^x > 0 for any real x), and dividing by the sum normalizes to 1. Softmax also preserves ordering — the biggest logit becomes the biggest probability.

softmax_intuition.py
python
# Given logits [2.0, 1.0, -1.0] for 3 classes:
#
# exp:      [7.389, 2.718, 0.368]
# sum:       10.475
# softmax:  [0.705, 0.259, 0.036]   <-- sums to 1.0
#
# The biggest logit -> biggest probability.
# Softmax preserves the ranking, just turns scores into a distribution.

The numerical overflow trap

If any logit is, say, 1000, then `exp(1000) = inf` and your training dies instantly. If logits are very negative, `exp` underflows to zero and you take `log(0) = -inf` later. Both cases crash silently. This is why nobody computes raw softmax in practice — they use `log_softmax`.

torch.log_softmax computes log(softmax(x)) directly using the log-sum-exp trick: subtract max(x) before exponentiating. Mathematically the constant cancels out; computationally, the largest exp term becomes exp(0) = 1 and everything else is between 0 and 1. No overflow, ever.

log_softmax(x_i) = x_i − max(x) − log(Σ exp(x_j − max(x)))

the log-sum-exp identity

Cross-entropy is the standard loss for classification, and the intuition is simple: if the model assigns probability 0.9 to the correct class, you're happy; if it assigns 0.001, you're sad. The loss function -log(p) has exactly this shape:

So for each training example, we want to compute -log(p_correct_class) and average over the batch.

cross_entropy.py
python
def compute_loss(self, logits: Tensor, targets: Tensor) -> Tensor:
    N = logits.shape[0]
    log_probs = torch.log_softmax(logits, dim=1)

    # Pluck out log P(correct_class) for each example in the batch
    # torch.arange(N)  = [0, 1, 2, ..., N-1]   (row indices)
    # targets          = [5, 12, 0, 37, ...]   (column indices = correct class)
    # -> one log-prob per example, shape (N,)
    nll = -log_probs[torch.arange(N), targets]

    return nll.mean()

Fancy indexing unpacked

`log_probs[torch.arange(N), targets]` is the worth-memorizing trick. `torch.arange(N)` is `[0, 1, 2, ..., N-1]` (row indices). `targets` is the correct class for each example, e.g. `[5, 12, 0, 37, ...]`. Read together: 'from row 0 take column 5, from row 1 take column 12, ...'. One line, vectorized across the whole batch.

Why go via log_softmax and then index, instead of computing softmax, indexing, then taking log? Numerical stability. Staying in log-space means tiny probabilities like 1e-30 don't underflow to zero.

Left unchecked, the model will learn huge weights to memorize the training set, then fail miserably on new data. This is overfitting. L2 regularization prevents it by adding a penalty proportional to the sum of squared weights:

l2_penalty.py
python
def _l2_penalty(self) -> Tensor:
    # Sum of squared weights. Does NOT include bias — standard practice.
    return self.config.l2_lambda * (self.W ** 2).sum()

# Total loss used for backprop:
#     loss = cross_entropy + l2_lambda * sum(W^2)

The total loss becomes cross_entropy + λ · Σ W². The optimizer now has two pressures: reduce the cross-entropy (fit the data) and keep weights small (stay simple). Lambda controls the tradeoff — too small and regularization does nothing, too big and the model underfits because every weight is squeezed toward zero.

Why don't we regularize the bias?

Biases just shift the decision boundary; they don't cause overfitting the way weights do. Penalizing them would force the model to assume all classes are equally likely before seeing any input, which is a constraint we don't want.

Now the heart of it. Every epoch we shuffle, then iterate over mini-batches. For each batch we run the five-step cycle that is the beating heart of essentially all deep learning:

train_loop.py
python
for epoch in range(self.config.epochs):

    # 1. Shuffle each epoch so batches see a random mix of classes
    perm = torch.randperm(N, generator=gen)
    X_shuffled = X_train[perm]
    y_shuffled = y_train[perm]

    for start in range(0, N, self.config.batch_size):
        end = min(start + self.config.batch_size, N)
        X_batch = X_shuffled[start:end]
        y_batch = y_shuffled[start:end]

        # 2. Forward: compute predictions
        logits = self.forward(X_batch)

        # 3. Loss: how wrong are we?
        ce_loss = self.compute_loss(logits, y_batch)
        l2_loss = self._l2_penalty()
        loss = ce_loss + l2_loss

        # 4. Backward: let autograd compute d(loss)/d(W) and d(loss)/d(b)
        loss.backward()

        # 5. Update: step opposite to the gradient
        with torch.no_grad():
            self.W.data -= self.config.lr * self.W.grad
            self.b.data -= self.config.lr * self.b.grad

            # 6. Zero the gradients — otherwise they accumulate
            self.W.grad.zero_()
            self.b.grad.zero_()

If the data is sorted by class (all class 0 first, then class 1, etc.), the model would train on one class for ages, forget the previous one, and oscillate forever. Shuffling guarantees each batch sees a random mix. torch.randperm(N) gives a random permutation of [0 .. N-1] and X_train[perm] reorders the rows accordingly.

ApproachGradient qualitySpeedMemory
Full batch (all N at once)ExactSlow per stepHigh
Single example (SGD)Very noisyFast per step, jittery overallLow
Mini-batch (e.g. 256)Good estimateFast, stableModerate

Mini-batches hit the sweet spot — stable enough to converge, small enough to step often, small enough to fit on a GPU. min(start + batch_size, N) handles the final batch cleanly when N doesn't divide evenly.

This is the part that feels like magic until you know. When you called X @ W, PyTorch silently recorded "matmul, with these inputs" on a hidden computation graph. Same for log_softmax, the indexing, the .mean(). Every operation on a requires_grad=True tensor adds a node to this graph.

loss.backward() walks that graph in reverse, applying the chain rule from calculus at each node, all the way back to the tensors with requires_grad=True. The final gradients land in W.grad and b.grad, which have the same shapes as W and b.

What the gradient means, concretely

`W.grad[i, j]` tells you: 'if I nudged `W[i, j]` up by a tiny amount ε, the loss would change by approximately `W.grad[i, j] · ε`.' That's the number we need to know which direction to move each weight.

You never write a derivative yourself — PyTorch ships with the derivative of every tensor operation built in. That's why this framework took over.

The gradient points in the direction of steepest increase of the loss. We want to decrease the loss, so we step in the opposite direction. That's literally the entire idea of gradient descent:

W_new = W_old − lr · (∂loss / ∂W)

gradient descent, in one line

Two tricky details in the code worth pausing on. The with torch.no_grad(): block tells PyTorch "don't track these operations" — otherwise the update itself becomes part of the graph, creating a recursive mess. And .data modifies the underlying tensor values directly, without breaking autograd's bookkeeping.

The classic PyTorch beginner bug

PyTorch **accumulates** gradients by default. If you call `backward()` twice without zeroing, the gradients sum up. If you forget `zero_()` in your training loop, gradients keep growing epoch over epoch, your updates explode, and the loss diverges into NaN. Everyone hits this once. Only once, if they're lucky.

Why does PyTorch accumulate instead of replacing? Because sometimes you want accumulation — for example, gradient accumulation across several small batches to simulate a larger effective batch size when memory is tight. The framework gives you flexibility and demands you handle the bookkeeping.

predict.py
python
def predict(self, X: Tensor) -> Tensor:
    with torch.no_grad():                  # no gradients needed at inference
        logits = self.forward(X)
        return logits.argmax(dim=1)        # no softmax needed — argmax is monotonic

Two optimizations here that matter in production. First, torch.no_grad() skips building the computation graph — no autograd bookkeeping, less memory, faster inference. Second, we don't compute softmax at all. Since softmax is monotonic (bigger logit ⇒ bigger probability), argmax(logits) == argmax(softmax(logits)). Save yourself the exp, the sum, and the division.

The best way to solidify any of this is to run training on a tiny toy dataset and watch the numbers move. Weights change, gradients shrink, loss drops. You can't unsee it.

toy_training.py
python
# Tiny sanity check — run this to *see* training happen.
import torch

torch.manual_seed(0)
X = torch.randn(50, 10)                    # 50 examples, 10 features
y = torch.randint(0, 3, (50,))             # 3 classes

W = (torch.randn(10, 3) * 0.01).requires_grad_()
b = torch.zeros(3, requires_grad=True)

for step in range(20):
    logits = X @ W + b
    log_probs = torch.log_softmax(logits, dim=1)
    loss = -log_probs[torch.arange(50), y].mean()
    loss.backward()

    with torch.no_grad():
        W.data -= 0.1 * W.grad
        b.data -= 0.1 * b.grad
        W.grad.zero_()
        b.grad.zero_()

    print(f"step {step:2d}  loss={loss.item():.4f}  "
          f"|W|={W.abs().mean().item():.4f}  "
          f"|grad|={W.grad.abs().mean().item() if W.grad is not None else 0:.4f}")

Run it and you'll see the loss drop from around 1.1 (random guessing for 3 classes ≈ -log(1/3) = 1.1) toward something small. You'll also see |grad| shrinking — as the model approaches a good solution, there's less and less to correct.

Zoom out and the entire arc of training is this: start with random weights. Predict. Measure wrongness. Use autograd to find which direction to nudge each weight. Take a small step. Repeat thousands of times. Slowly, the columns of W fill in useful patterns — one column comes to represent "flight-booking vocabulary," another "weather-query vocabulary," and so on — and the loss drops.

The beautiful thing about this implementation is that every step is visible. There's no nn.Module hiding the parameters, no optim.SGD hiding the update rule, no CrossEntropyLoss hiding the log-softmax. Once you've written this, you know exactly what every library shortcut is doing underneath — and when something breaks in a bigger model, you'll have the vocabulary to debug it.

  1. Logistic regression = linear scores + softmax + argmax. Training = nudging the linear scores until the argmax matches the label.
  2. Logits are unnormalized scores. Softmax only exists to make them into probabilities; for prediction, argmax on logits is equivalent and cheaper.
  3. Use log_softmax, never raw softmax. The log-sum-exp trick is the difference between training that works and training that silently explodes.
  4. Cross-entropy punishes confident wrong answers. -log(p_correct) is huge when p is tiny, zero when p is one.
  5. L2 regularization shrinks weights to prevent overfitting. Don't regularize the bias.
  6. The five-step cycle is universal. Forward, loss, backward, update, zero — every neural network you ever train follows it.
  7. loss.backward() is not magic — it's the chain rule replayed over a recorded computation graph. Autograd does the bookkeeping; you do the modeling.
  8. Zero your gradients. You will forget this once. Then never again.

What's next

This exact structure — forward, loss, backward, update, zero — is the skeleton of every deep learning model. Swap `X @ W + b` for a stack of `nn.Linear` layers and you have a feed-forward network. Swap it for attention blocks and you have a [**transformer**](/blog/introduction-to-transformers). The ceremony changes; the cycle does not. Once you have a working model, learn how to evaluate it properly with [**The Impartial Judge: Inside a Production ML Evaluation Harness**](/blog/inside-an-ml-evaluation-harness).

Related Articles