intermediateMachine Learning Basics Natural Language Processing

TF-IDF + Logistic Regression: The Classical ML Baseline You Should Try First

Before reaching for LLMs or neural networks for text classification, try the boring thing. Here's how TF-IDF + Logistic Regression works, why it's often embarrassingly competitive, and where it breaks.

HaneeshApril 19, 2026

Before you reach for a big neural network or an LLM for text classification, try the boring thing first. In my intent-routing project, an 8B parameter LLM (granite3.3:8b) landed at 72.19% accuracy on the CLINC150 benchmark — respectable, but slow. The next question is almost rude: can a model from 1995 beat it?

This post walks through the classical baseline — TF-IDF + Logistic Regression — the way I built it. No PyTorch, no GPU, no transformers. Just sklearn, a few hundred lines of code, and an answer in under a second per query.

Why bother with classical ML in 2026?

Because every neural model you ever build needs something to beat. If a model that trains in 10 seconds gets 90% accuracy, the bar for your 8B LLM is not 'working' — it's 'meaningfully better than the 10-second model'. Without baselines, you're flying blind.

The problem: intent classification

Given a short user utterance like "cancel my flight to Paris", predict which of 150 intents it belongs to (book_flight, cancel_reservation, weather, etc.). CLINC150 has 150 intents spread across 10 domains — banking, travel, small talk, work, and so on — plus an out-of-scope (OOS) bucket for things the system shouldn't try to answer.

Idea 1: TF-IDF — turning text into numbers

Machine learning models don't eat text. They eat numbers. TF-IDF is one of the oldest ways to turn text into numbers, and it's built on two simple intuitions:

TF (Term Frequency) — how often does a word appear in this document? A word that shows up four times is probably more important than a word that shows up once.
IDF (Inverse Document Frequency) — how rare is the word across all documents? Words like 'the' and 'my' appear everywhere, so they're useless for distinguishing documents. Words like 'refund' appear in a specific context, so they're valuable signal.

Multiply them together and you get a score that is high when a word is frequent here but rare elsewhere — exactly the words that make a document distinctive.

TF-IDF(t, d) = TF(t, d) × log(N / df(t))
— where N = total docs, df(t) = number of docs containing term t

tfidf_intuition.py

python

# Three tiny "documents"
docs = [
    "cancel my flight to paris",
    "cancel my subscription",
    "book a flight to paris",
]

# After TF-IDF, each doc becomes a sparse vector:
# - "cancel" gets HIGH weight in docs 0 and 1 (appears in 2/3 docs, useful signal)
# - "flight" gets MEDIUM weight (appears in 2/3 docs)
# - "to"     gets LOW weight  (common, low information)
# - "paris"  gets HIGH weight in docs 0 and 2
#
# The classifier learns: "cancel" => cancel_intent, "book" => book_intent.

The catch: bag of words

TF-IDF treats a sentence as an unordered bag of words. 'Dog bites man' and 'Man bites dog' produce identical vectors. This is the fundamental limitation we'll come back to at the end.

Idea 2: Logistic Regression — drawing lines in high dimensions

Despite the name, logistic regression is a classifier, not a regressor. Given a vector of features (our TF-IDF vector), it learns a set of weights for each class and produces a probability distribution over classes. For 150 intents, it learns 150 weight vectors — one per class — and picks the class with the highest score.

Why logistic regression and not something fancier? Three reasons: it trains in seconds, it handles high-dimensional sparse inputs (like TF-IDF) beautifully, and its predictions are essentially free at inference time — a dot product per class.

Putting it together with sklearn Pipeline

The sklearn Pipeline lets you glue preprocessing and modeling into a single object. This matters for one reason above all: you can't accidentally train on test data, because the whole thing trains and predicts as one unit.

tfidf_baseline.py

python

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# One object that holds both the vectorizer and the classifier.
pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=30_000,
        ngram_range=(1, 2),   # unigrams + bigrams
        min_df=2,             # ignore words seen in < 2 docs
    )),
    ("clf", LogisticRegression(
        C=1.0,                # regularization strength (inverse)
        solver="lbfgs",
        max_iter=1000,
    )),
])

# Train on raw text — the pipeline vectorizes internally.
pipeline.fit(X_train, y_train)

# Predict on raw text — same story.
predictions = pipeline.predict(X_test)

Why the double underscore?

In sklearn, tfidf__ngram_range means 'the ngram_range parameter of the step named tfidf'. This naming lets GridSearchCV tune nested components. It's ugly but it's how you talk to a Pipeline.

The knobs that matter

Hyperparameter	What it controls	Typical range
max_features	Vocabulary cap — how many unique words/ngrams to keep	10k – 50k
ngram_range	Whether to include word pairs, triples, etc.	(1,1), (1,2), (1,3)
min_df	Drop terms seen in fewer than N documents (kills typos/noise)	1 – 5
C	Regularization strength — lower = simpler model	0.01 – 100
solver	Optimization algorithm for the logistic regression	lbfgs, liblinear

Don't guess at these. Let GridSearchCV search the space for you. It runs cross-validation across every combination and reports the winner.

grid_search.py

python

from sklearn.model_selection import GridSearchCV

param_grid = {
    "tfidf__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "tfidf__max_features": [10_000, 30_000, 50_000],
    "clf__C": [0.1, 1.0, 10.0],
}

search = GridSearchCV(
    pipeline,
    param_grid,
    cv=3,
    scoring="f1_macro",
    n_jobs=-1,
    verbose=1,
)

search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best CV score:", search.best_score_)

Measuring latency honestly

A common trap: benchmarking the batched prediction (predict on 1000 items at once) and calling that your latency number. Real inference is often one query at a time. Measure p50 and p95, and warm up the pipeline first so you don't measure JIT overhead.

measure_latency.py

python

import time

# Warm up
_ = pipeline.predict(["hello"])

# Measure single-query latency
latencies = []
for text in test_texts[:1000]:
    start = time.perf_counter()
    pipeline.predict([text])
    latencies.append((time.perf_counter() - start) * 1000)  # ms

latencies.sort()
print(f"p50: {latencies[500]:.3f} ms")
print(f"p95: {latencies[950]:.3f} ms")

LLM vs TF-IDF: the surprising scoreboard

Metric	granite3.3:8b (LLM)	TF-IDF + LogReg
Accuracy	72.19%	~85–92% (expected)
Macro F1	0.7065	typically higher
p95 latency	hundreds of ms	under 1 ms
Training cost	n/a (zero-shot)	~10 seconds on CPU
Interpretability	opaque	inspect the weights directly
Parameters	8 billion	~150 thousand

The uncomfortable takeaway

For well-defined classification tasks with enough labeled data, classical ML often crushes zero-shot LLMs on both accuracy and latency. The LLM-as-router pattern is a nice demo, not always a production win.

Where TF-IDF breaks (and why you'll still want embeddings)

TF-IDF is a bag of words. It has no idea that 'cancel' and 'terminate' mean the same thing, or that 'what time is it' and 'do you have the time' are paraphrases. The model has to see the exact words during training to learn them. Three concrete failure modes:

Synonyms — 'cancel my flight' and 'terminate my booking' share almost no vocabulary but are the same intent. TF-IDF can't bridge that gap.
Paraphrases — 'how cold is it outside' vs 'current temperature please' have no content words in common. A human gets it instantly; TF-IDF doesn't.
Word order — 'transfer from checking to savings' vs 'transfer from savings to checking' are the opposite operation but produce identical bag-of-words vectors.

This is why embeddings exist

Sentence embeddings (from models like sentence-transformers) map 'cancel' and 'terminate' to nearby points in vector space. That's the next rung on the ladder. See Building an MLP Classifier on Pretrained Sentence Embeddings for a complete implementation.

Error analysis: learn from your confusions

After training, don't just stare at the accuracy number. Look at the confusion matrix and find the most-confused class pairs. Print a few misclassified examples from the worst pair and read them. You'll discover patterns — maybe two intents genuinely overlap, maybe the labels are noisy, maybe one class needs more training data. This is where intuition is built, not on dashboards. For a deeper dive into evaluation metrics like F1 scores, precision, recall, and latency percentiles, check out Inside a Production ML Evaluation Harness.

Compute the confusion matrix from your predictions.
Find the top 10 off-diagonal cells with the highest counts.
For the worst pair, print 5 misclassified examples side-by-side.
Ask: is the model wrong, or are the labels wrong?

When to use this baseline

Scenario	Use TF-IDF + LogReg?
Short utterances, fixed vocabulary, lots of labels	Yes — often the right answer
Latency-critical production routing	Yes — microseconds per call
Long documents with nuanced meaning	Probably not — reach for embeddings
Need to handle unseen paraphrases	No — bag of words can't help you
Prototyping / establishing a baseline	Always

Takeaways

Always build the classical baseline first. It tells you what 'good' looks like before you burn GPU hours on neural models.
TF-IDF + LogReg is a bag-of-words model. It can't handle synonyms, paraphrases, or word order — but for short utterances with enough training data, it's shockingly strong.
Measure latency honestly — p50 and p95, one query at a time, with warmup.
Error analysis beats metrics. Read the misclassifications. That's where intuition lives.
The next step up is embeddings — dense vectors that capture meaning, not just word identity. That's where bag-of-words' limitations get fixed.

What's next

Want to understand what's happening inside sklearn's LogisticRegression? Check out Logistic Regression from Scratch in PyTorch where we build the same classifier by hand — every weight, every gradient, every update spelled out. Then swap TF-IDF for sentence embeddings (e.g., all-MiniLM-L6-v2) feeding into the same logistic regression head to see if the synonym problem goes away.

#machine-learning #nlp

intermediate

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

The practical layer between sequence-model theory and a working PyTorch classifier — building a vocabulary, batching variable-length sentences, packing sequences for the LSTM, and pooling the output back into a single vector.

intermediate

From Words to Intelligence: Building an MLP Classifier on Pretrained Sentence Embeddings

A deep dive into pretrained sentence embeddings, MLP architecture, BatchNorm, Dropout, Adam, and early stopping — with full PyTorch implementation.

intermediate

Understanding Transformers: The Architecture Behind Modern AI

A comprehensive guide to understanding the Transformer architecture that powers GPT, BERT, and other modern language models.

Why bother with classical ML in 2026?

The catch: bag of words

Why the double underscore?

The uncomfortable takeaway

This is why embeddings exist

What's next

Related Articles

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

From Words to Intelligence: Building an MLP Classifier on Pretrained Sentence Embeddings

Understanding Transformers: The Architecture Behind Modern AI