Back to articles
TF-IDF + Logistic Regression: The Classical ML Baseline You Should Try First

TF-IDF + Logistic Regression: The Classical ML Baseline You Should Try First

Before reaching for LLMs or neural networks for text classification, try the boring thing. Here's how TF-IDF + Logistic Regression works, why it's often embarrassingly competitive, and where it breaks.

10 min read

Before you reach for a big neural network or an LLM for text classification, try the boring thing first. In my intent-routing project, an 8B parameter LLM (granite3.3:8b) landed at 72.19% accuracy on the CLINC150 benchmark — respectable, but slow. The next question is almost rude: can a model from 1995 beat it?

This post walks through the classical baseline — TF-IDF + Logistic Regression — the way I built it. No PyTorch, no GPU, no transformers. Just sklearn, a few hundred lines of code, and an answer in under a second per query.

Why bother with classical ML in 2026?

Because every neural model you ever build needs something to beat. If a model that trains in 10 seconds gets 90% accuracy, the bar for your 8B LLM is not 'working' — it's 'meaningfully better than the 10-second model'. Without baselines, you're flying blind.

Given a short user utterance like "cancel my flight to Paris", predict which of 150 intents it belongs to (book_flight, cancel_reservation, weather, etc.). CLINC150 has 150 intents spread across 10 domains — banking, travel, small talk, work, and so on — plus an out-of-scope (OOS) bucket for things the system shouldn't try to answer.

Machine learning models don't eat text. They eat numbers. TF-IDF is one of the oldest ways to turn text into numbers, and it's built on two simple intuitions:

  1. TF (Term Frequency) — how often does a word appear in this document? A word that shows up four times is probably more important than a word that shows up once.
  2. IDF (Inverse Document Frequency) — how rare is the word across all documents? Words like 'the' and 'my' appear everywhere, so they're useless for distinguishing documents. Words like 'refund' appear in a specific context, so they're valuable signal.

Multiply them together and you get a score that is high when a word is frequent here but rare elsewhere — exactly the words that make a document distinctive.

TF-IDF(t, d) = TF(t, d) × log(N / df(t))

where N = total docs, df(t) = number of docs containing term t
tfidf_intuition.py
python
# Three tiny "documents"
docs = [
    "cancel my flight to paris",
    "cancel my subscription",
    "book a flight to paris",
]

# After TF-IDF, each doc becomes a sparse vector:
# - "cancel" gets HIGH weight in docs 0 and 1 (appears in 2/3 docs, useful signal)
# - "flight" gets MEDIUM weight (appears in 2/3 docs)
# - "to"     gets LOW weight  (common, low information)
# - "paris"  gets HIGH weight in docs 0 and 2
#
# The classifier learns: "cancel" => cancel_intent, "book" => book_intent.

The catch: bag of words

TF-IDF treats a sentence as an unordered bag of words. 'Dog bites man' and 'Man bites dog' produce identical vectors. This is the fundamental limitation we'll come back to at the end.

Despite the name, logistic regression is a classifier, not a regressor. Given a vector of features (our TF-IDF vector), it learns a set of weights for each class and produces a probability distribution over classes. For 150 intents, it learns 150 weight vectors — one per class — and picks the class with the highest score.

Why logistic regression and not something fancier? Three reasons: it trains in seconds, it handles high-dimensional sparse inputs (like TF-IDF) beautifully, and its predictions are essentially free at inference time — a dot product per class.

The sklearn Pipeline lets you glue preprocessing and modeling into a single object. This matters for one reason above all: you can't accidentally train on test data, because the whole thing trains and predicts as one unit.

tfidf_baseline.py
python
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# One object that holds both the vectorizer and the classifier.
pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=30_000,
        ngram_range=(1, 2),   # unigrams + bigrams
        min_df=2,             # ignore words seen in < 2 docs
    )),
    ("clf", LogisticRegression(
        C=1.0,                # regularization strength (inverse)
        solver="lbfgs",
        max_iter=1000,
    )),
])

# Train on raw text — the pipeline vectorizes internally.
pipeline.fit(X_train, y_train)

# Predict on raw text — same story.
predictions = pipeline.predict(X_test)

Why the double underscore?

In sklearn, `tfidf__ngram_range` means 'the `ngram_range` parameter of the step named `tfidf`'. This naming lets `GridSearchCV` tune nested components. It's ugly but it's how you talk to a Pipeline.

HyperparameterWhat it controlsTypical range
max_featuresVocabulary cap — how many unique words/ngrams to keep10k – 50k
ngram_rangeWhether to include word pairs, triples, etc.(1,1), (1,2), (1,3)
min_dfDrop terms seen in fewer than N documents (kills typos/noise)1 – 5
CRegularization strength — lower = simpler model0.01 – 100
solverOptimization algorithm for the logistic regressionlbfgs, liblinear

Don't guess at these. Let GridSearchCV search the space for you. It runs cross-validation across every combination and reports the winner.

grid_search.py
python
from sklearn.model_selection import GridSearchCV

param_grid = {
    "tfidf__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "tfidf__max_features": [10_000, 30_000, 50_000],
    "clf__C": [0.1, 1.0, 10.0],
}

search = GridSearchCV(
    pipeline,
    param_grid,
    cv=3,
    scoring="f1_macro",
    n_jobs=-1,
    verbose=1,
)

search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best CV score:", search.best_score_)

A common trap: benchmarking the batched prediction (predict on 1000 items at once) and calling that your latency number. Real inference is often one query at a time. Measure p50 and p95, and warm up the pipeline first so you don't measure JIT overhead.

measure_latency.py
python
import time

# Warm up
_ = pipeline.predict(["hello"])

# Measure single-query latency
latencies = []
for text in test_texts[:1000]:
    start = time.perf_counter()
    pipeline.predict([text])
    latencies.append((time.perf_counter() - start) * 1000)  # ms

latencies.sort()
print(f"p50: {latencies[500]:.3f} ms")
print(f"p95: {latencies[950]:.3f} ms")
Metricgranite3.3:8b (LLM)TF-IDF + LogReg
Accuracy72.19%~85–92% (expected)
Macro F10.7065typically higher
p95 latencyhundreds of msunder 1 ms
Training costn/a (zero-shot)~10 seconds on CPU
Interpretabilityopaqueinspect the weights directly
Parameters8 billion~150 thousand

The uncomfortable takeaway

For well-defined classification tasks with enough labeled data, classical ML often crushes zero-shot LLMs on both accuracy and latency. The LLM-as-router pattern is a nice demo, not always a production win.

TF-IDF is a bag of words. It has no idea that 'cancel' and 'terminate' mean the same thing, or that 'what time is it' and 'do you have the time' are paraphrases. The model has to see the exact words during training to learn them. Three concrete failure modes:

  • Synonyms — 'cancel my flight' and 'terminate my booking' share almost no vocabulary but are the same intent. TF-IDF can't bridge that gap.
  • Paraphrases — 'how cold is it outside' vs 'current temperature please' have no content words in common. A human gets it instantly; TF-IDF doesn't.
  • Word order — 'transfer from checking to savings' vs 'transfer from savings to checking' are the opposite operation but produce identical bag-of-words vectors.

This is why embeddings exist

Sentence embeddings (from models like sentence-transformers) map 'cancel' and 'terminate' to nearby points in vector space. That's the next rung on the ladder.

After training, don't just stare at the accuracy number. Look at the confusion matrix and find the most-confused class pairs. Print a few misclassified examples from the worst pair and read them. You'll discover patterns — maybe two intents genuinely overlap, maybe the labels are noisy, maybe one class needs more training data. This is where intuition is built, not on dashboards. For a deeper dive into evaluation metrics like F1 scores, precision, recall, and latency percentiles, check out Inside a Production ML Evaluation Harness.

  1. Compute the confusion matrix from your predictions.
  2. Find the top 10 off-diagonal cells with the highest counts.
  3. For the worst pair, print 5 misclassified examples side-by-side.
  4. Ask: is the model wrong, or are the labels wrong?
ScenarioUse TF-IDF + LogReg?
Short utterances, fixed vocabulary, lots of labelsYes — often the right answer
Latency-critical production routingYes — microseconds per call
Long documents with nuanced meaningProbably not — reach for embeddings
Need to handle unseen paraphrasesNo — bag of words can't help you
Prototyping / establishing a baselineAlways
  1. Always build the classical baseline first. It tells you what 'good' looks like before you burn GPU hours on neural models.
  2. TF-IDF + LogReg is a bag-of-words model. It can't handle synonyms, paraphrases, or word order — but for short utterances with enough training data, it's shockingly strong.
  3. Measure latency honestly — p50 and p95, one query at a time, with warmup.
  4. Error analysis beats metrics. Read the misclassifications. That's where intuition lives.
  5. The next step up is embeddings — dense vectors that capture meaning, not just word identity. That's where bag-of-words' limitations get fixed.

What's next

Want to understand what's happening inside sklearn's LogisticRegression? Check out [**Logistic Regression from Scratch in PyTorch**](/blog/logistic-regression-from-scratch-pytorch) where we build the same classifier by hand — every weight, every gradient, every update spelled out. Then swap TF-IDF for sentence embeddings (e.g., all-MiniLM-L6-v2) feeding into the same logistic regression head to see if the synonym problem goes away.

Related Articles