
TF-IDF + Logistic Regression: The Classical ML Baseline You Should Try First
Before reaching for LLMs or neural networks for text classification, try the boring thing. Here's how TF-IDF + Logistic Regression works, why it's often embarrassingly competitive, and where it breaks.
Before you reach for a big neural network or an LLM for text classification, try the boring thing first. In my intent-routing project, an 8B parameter LLM (granite3.3:8b) landed at 72.19% accuracy on the CLINC150 benchmark — respectable, but slow. The next question is almost rude: can a model from 1995 beat it?
This post walks through the classical baseline — TF-IDF + Logistic Regression — the way I built it. No PyTorch, no GPU, no transformers. Just sklearn, a few hundred lines of code, and an answer in under a second per query.
Why bother with classical ML in 2026?
Because every neural model you ever build needs something to beat. If a model that trains in 10 seconds gets 90% accuracy, the bar for your 8B LLM is not 'working' — it's 'meaningfully better than the 10-second model'. Without baselines, you're flying blind.
Given a short user utterance like "cancel my flight to Paris", predict which of 150 intents it belongs to (book_flight, cancel_reservation, weather, etc.). CLINC150 has 150 intents spread across 10 domains — banking, travel, small talk, work, and so on — plus an out-of-scope (OOS) bucket for things the system shouldn't try to answer.
Machine learning models don't eat text. They eat numbers. TF-IDF is one of the oldest ways to turn text into numbers, and it's built on two simple intuitions:
- TF (Term Frequency) — how often does a word appear in this document? A word that shows up four times is probably more important than a word that shows up once.
- IDF (Inverse Document Frequency) — how rare is the word across all documents? Words like 'the' and 'my' appear everywhere, so they're useless for distinguishing documents. Words like 'refund' appear in a specific context, so they're valuable signal.
Multiply them together and you get a score that is high when a word is frequent here but rare elsewhere — exactly the words that make a document distinctive.
TF-IDF(t, d) = TF(t, d) × log(N / df(t))
— where N = total docs, df(t) = number of docs containing term t
# Three tiny "documents"
docs = [
"cancel my flight to paris",
"cancel my subscription",
"book a flight to paris",
]
# After TF-IDF, each doc becomes a sparse vector:
# - "cancel" gets HIGH weight in docs 0 and 1 (appears in 2/3 docs, useful signal)
# - "flight" gets MEDIUM weight (appears in 2/3 docs)
# - "to" gets LOW weight (common, low information)
# - "paris" gets HIGH weight in docs 0 and 2
#
# The classifier learns: "cancel" => cancel_intent, "book" => book_intent.
The catch: bag of words
TF-IDF treats a sentence as an unordered bag of words. 'Dog bites man' and 'Man bites dog' produce identical vectors. This is the fundamental limitation we'll come back to at the end.
Despite the name, logistic regression is a classifier, not a regressor. Given a vector of features (our TF-IDF vector), it learns a set of weights for each class and produces a probability distribution over classes. For 150 intents, it learns 150 weight vectors — one per class — and picks the class with the highest score.
Why logistic regression and not something fancier? Three reasons: it trains in seconds, it handles high-dimensional sparse inputs (like TF-IDF) beautifully, and its predictions are essentially free at inference time — a dot product per class.
The sklearn Pipeline lets you glue preprocessing and modeling into a single object. This matters for one reason above all: you can't accidentally train on test data, because the whole thing trains and predicts as one unit.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
# One object that holds both the vectorizer and the classifier.
pipeline = Pipeline([
("tfidf", TfidfVectorizer(
max_features=30_000,
ngram_range=(1, 2), # unigrams + bigrams
min_df=2, # ignore words seen in < 2 docs
)),
("clf", LogisticRegression(
C=1.0, # regularization strength (inverse)
solver="lbfgs",
max_iter=1000,
)),
])
# Train on raw text — the pipeline vectorizes internally.
pipeline.fit(X_train, y_train)
# Predict on raw text — same story.
predictions = pipeline.predict(X_test)
Why the double underscore?
In sklearn, `tfidf__ngram_range` means 'the `ngram_range` parameter of the step named `tfidf`'. This naming lets `GridSearchCV` tune nested components. It's ugly but it's how you talk to a Pipeline.
| Hyperparameter | What it controls | Typical range |
|---|---|---|
| max_features | Vocabulary cap — how many unique words/ngrams to keep | 10k – 50k |
| ngram_range | Whether to include word pairs, triples, etc. | (1,1), (1,2), (1,3) |
| min_df | Drop terms seen in fewer than N documents (kills typos/noise) | 1 – 5 |
| C | Regularization strength — lower = simpler model | 0.01 – 100 |
| solver | Optimization algorithm for the logistic regression | lbfgs, liblinear |
Don't guess at these. Let GridSearchCV search the space for you. It runs cross-validation across every combination and reports the winner.
from sklearn.model_selection import GridSearchCV
param_grid = {
"tfidf__ngram_range": [(1, 1), (1, 2), (1, 3)],
"tfidf__max_features": [10_000, 30_000, 50_000],
"clf__C": [0.1, 1.0, 10.0],
}
search = GridSearchCV(
pipeline,
param_grid,
cv=3,
scoring="f1_macro",
n_jobs=-1,
verbose=1,
)
search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best CV score:", search.best_score_)
A common trap: benchmarking the batched prediction (predict on 1000 items at once) and calling that your latency number. Real inference is often one query at a time. Measure p50 and p95, and warm up the pipeline first so you don't measure JIT overhead.
import time
# Warm up
_ = pipeline.predict(["hello"])
# Measure single-query latency
latencies = []
for text in test_texts[:1000]:
start = time.perf_counter()
pipeline.predict([text])
latencies.append((time.perf_counter() - start) * 1000) # ms
latencies.sort()
print(f"p50: {latencies[500]:.3f} ms")
print(f"p95: {latencies[950]:.3f} ms")
| Metric | granite3.3:8b (LLM) | TF-IDF + LogReg |
|---|---|---|
| Accuracy | 72.19% | ~85–92% (expected) |
| Macro F1 | 0.7065 | typically higher |
| p95 latency | hundreds of ms | under 1 ms |
| Training cost | n/a (zero-shot) | ~10 seconds on CPU |
| Interpretability | opaque | inspect the weights directly |
| Parameters | 8 billion | ~150 thousand |
The uncomfortable takeaway
For well-defined classification tasks with enough labeled data, classical ML often crushes zero-shot LLMs on both accuracy and latency. The LLM-as-router pattern is a nice demo, not always a production win.
TF-IDF is a bag of words. It has no idea that 'cancel' and 'terminate' mean the same thing, or that 'what time is it' and 'do you have the time' are paraphrases. The model has to see the exact words during training to learn them. Three concrete failure modes:
- Synonyms — 'cancel my flight' and 'terminate my booking' share almost no vocabulary but are the same intent. TF-IDF can't bridge that gap.
- Paraphrases — 'how cold is it outside' vs 'current temperature please' have no content words in common. A human gets it instantly; TF-IDF doesn't.
- Word order — 'transfer from checking to savings' vs 'transfer from savings to checking' are the opposite operation but produce identical bag-of-words vectors.
This is why embeddings exist
Sentence embeddings (from models like sentence-transformers) map 'cancel' and 'terminate' to nearby points in vector space. That's the next rung on the ladder.
After training, don't just stare at the accuracy number. Look at the confusion matrix and find the most-confused class pairs. Print a few misclassified examples from the worst pair and read them. You'll discover patterns — maybe two intents genuinely overlap, maybe the labels are noisy, maybe one class needs more training data. This is where intuition is built, not on dashboards. For a deeper dive into evaluation metrics like F1 scores, precision, recall, and latency percentiles, check out Inside a Production ML Evaluation Harness.
- Compute the confusion matrix from your predictions.
- Find the top 10 off-diagonal cells with the highest counts.
- For the worst pair, print 5 misclassified examples side-by-side.
- Ask: is the model wrong, or are the labels wrong?
| Scenario | Use TF-IDF + LogReg? |
|---|---|
| Short utterances, fixed vocabulary, lots of labels | Yes — often the right answer |
| Latency-critical production routing | Yes — microseconds per call |
| Long documents with nuanced meaning | Probably not — reach for embeddings |
| Need to handle unseen paraphrases | No — bag of words can't help you |
| Prototyping / establishing a baseline | Always |
- Always build the classical baseline first. It tells you what 'good' looks like before you burn GPU hours on neural models.
- TF-IDF + LogReg is a bag-of-words model. It can't handle synonyms, paraphrases, or word order — but for short utterances with enough training data, it's shockingly strong.
- Measure latency honestly — p50 and p95, one query at a time, with warmup.
- Error analysis beats metrics. Read the misclassifications. That's where intuition lives.
- The next step up is embeddings — dense vectors that capture meaning, not just word identity. That's where bag-of-words' limitations get fixed.
What's next
Want to understand what's happening inside sklearn's LogisticRegression? Check out [**Logistic Regression from Scratch in PyTorch**](/blog/logistic-regression-from-scratch-pytorch) where we build the same classifier by hand — every weight, every gradient, every update spelled out. Then swap TF-IDF for sentence embeddings (e.g., all-MiniLM-L6-v2) feeding into the same logistic regression head to see if the synonym problem goes away.
Related Articles
From Words to Intelligence: Building an MLP Classifier on Pretrained Sentence Embeddings
A deep dive into pretrained sentence embeddings, MLP architecture, BatchNorm, Dropout, Adam, and early stopping — with full PyTorch implementation.
Understanding Transformers: The Architecture Behind Modern AI
A comprehensive guide to understanding the Transformer architecture that powers GPT, BERT, and other modern language models.
ML Hyperparameters Explained for Beginners: Learning Rate, Epochs, Batch Size, L2, and Seed
A beginner-friendly explanation of core machine learning hyperparameters — learning rate, epochs, batch size, L2 regularization, and random seed — with simple examples and every important term explained clearly.