
The Impartial Judge: Inside a Production ML Evaluation Harness
A developer's walkthrough of a real ML eval harness — F1, macro averaging, OOS recall, warmup, and p50/p95/p99 latency — and the design decisions behind each.
Every ML project eventually runs into the same uncomfortable question: is version B actually better than version A? You can squint at loss curves, trust your gut, or cherry-pick examples — but until both models pass through the same scoring system, you're guessing. This post cracks open a real evaluation harness — the kind you'd find in a production ML repo — and unpacks every design decision inside it, one piece at a time.
Mental Model
An evaluation harness is to ML what a unit test runner is to software. Same code path, same inputs, same judge — every time. Without it, every experiment reports its own idiosyncratic numbers and nothing is comparable.
The harness is a single Python module with two dataclasses (the report cards), three functions (score, time, format), and a small percentile helper. That's it. The whole point of a harness is to be small, stable, and boring — you don't want surprises in your ruler.
The file opens with two @dataclass definitions. They're lightweight containers — structs with benefits. Instead of returning a confusing tuple like (0.89, 0.87, 0.92, 234.1) where you have to remember which number is which, the function returns an object with named fields.
@dataclass
class ClassificationMetrics:
accuracy: float
macro_f1: float
per_class_f1: dict[int, float] = field(default_factory=dict)
oos_recall: float | None = None
oos_precision: float | None = None
n_examples: int = 0
@dataclass
class LatencyStats:
p50_ms: float
p95_ms: float
p99_ms: float
mean_ms: float
n_iters: intNotice the separation of concerns: ClassificationMetrics is about quality (did the model get it right?), LatencyStats is about speed (how long did it take?). They're orthogonal — a slow-but-accurate model is useful for some contexts, a fast-but-mediocre one for others. Keeping them separate lets each be evolved independently.
Why the as_row() method?
`ClassificationMetrics.as_row()` flattens the dataclass into a rounded dict. Tiny function, big win — CSV export, results tables, and dashboards all consume the same shape. One source of truth for reporting.
This function takes two equal-length lists — y_true (the correct labels) and y_pred (what the model guessed) — and returns four kinds of numbers. Let's unpack each one.
Accuracy is the fraction of predictions that are correct. 89 right out of 100 = 0.89. Simple. Intuitive. Often misleading.
Accuracy lies under class imbalance
If 95% of your data is 'not spam', a model that predicts 'not spam' for everything gets 95% accuracy while being completely useless. Real-world data is almost always imbalanced. Trust accuracy only when your classes are roughly equal in size.
F1 fixes the imbalance problem by measuring two things together for each class:
- Precision: of the times we predicted class X, how often were we right?
- Recall: of the actual class-X examples, how many did we catch?
- F1: the harmonic mean of precision and recall — high only when BOTH are high
The harmonic mean is the secret sauce. A regular average would let you cheat — score 1.0 on precision and 0.1 on recall, average is 0.55. But the harmonic mean punishes imbalance: it pulls the score toward the worse of the two numbers.
| Scenario | Precision | Recall | F1 |
|---|---|---|---|
| Never predicts X | N/A | 0.0 | 0.0 |
| Predicts X for everything | low | 1.0 | low |
| Conservative but accurate | high | moderate | decent |
| Well-balanced | high | high | high |
Macro F1 is the unweighted average of per-class F1 scores. It treats every class as equally important regardless of how common it is. A model that nails the common classes but bombs the rare ones will score high on accuracy but low on macro F1. That's usually what you want to know.
labels = sorted(set(y_true) | set(y_pred))
per_class_scores = f1_score(y_true, y_pred, labels=labels, average=None, zero_division=0)
per_class_f1 = {label: float(score) for label, score in zip(labels, per_class_scores, strict=True)}
macro = f1_score(y_true, y_pred, labels=labels, average="macro", zero_division=0)The subtle bug the labels= kwarg prevents
If your test split happens to exclude a rare class, sklearn will silently average macro-F1 over fewer classes — and your score will *look* better than it should. Passing labels= explicitly pins the class vocabulary so the metric is stable across runs. This one line prevents a whole class of reporting bugs.
Real-world classifiers need to say 'I don't know' sometimes. A support-ticket router shouldn't confidently shove a random gibberish message into 'billing' — it should abstain. That's what OOS (out-of-scope) detection measures.
if oos_label is not None:
true_oos = sum(1 for label in y_true if label == oos_label)
pred_oos = sum(1 for label in y_pred if label == oos_label)
true_positive_oos = sum(
1
for true_label, pred_label in zip(y_true, y_pred, strict=True)
if true_label == oos_label and pred_label == oos_label
)
oos_recall = true_positive_oos / true_oos if true_oos else 0.0
oos_precision = true_positive_oos / pred_oos if pred_oos else 0.0- OOS recall: of all truly out-of-scope messages, how many did we correctly flag? (Catching the abstentions)
- OOS precision: of all messages we flagged OOS, how many actually were? (Not over-abstaining)
Why manual counting instead of sklearn? Because the concept is clearer as arithmetic, and the explicit if true_oos else 0.0 makes the zero-division behavior obvious. No hidden library magic.
You now know how often the model is right. The other half of the story is how fast it is. This is where measure_latency steps in — and it's packed with benchmarking wisdom.
The first call to a model is almost always the slowest. On a GPU with PyTorch, the first forward pass triggers CUDA graph compilation, MPS kernel compilation, memory allocation, caching. On CPU, it triggers import caching and branch prediction warmup. Including those first-call timings in your measurement pollutes your numbers.
First-call latency can be 10-100x worse than steady state
A model that does inference in 50ms steady-state might take 3 seconds on its first call. Without warmup, your p50 would be 50ms and your p99 would be 3000ms — entirely because of a one-time compile. Always warm up.
This is one of the most important ideas in production monitoring, so let's slow down. Averages lie. Especially with latency.
99 calls at 10ms + 1 call at 5 seconds = ~60ms mean. One user in a hundred waited five seconds. The mean doesn't tell you that.
— The Case for Percentiles
In production, a small fraction of slow requests create most of the bad user experiences. That's why you want percentiles — they describe the distribution, not just the center.
| Metric | What It Represents | Who Cares |
|---|---|---|
| p50 (median) | Typical experience — half of users see this or better | Everyone |
| p95 | 1-in-20 tail — the slow experiences | Product teams |
| p99 | 1-in-100 worst — the really painful ones | SREs & on-call |
| mean | Average — vulnerable to outliers | Don't trust alone |
The function computes percentiles by hand using linear interpolation — the standard approach when the target rank falls between two sorted samples.
def percentile(sorted_values: list[float], q: float) -> float:
if len(sorted_values) == 1:
return sorted_values[0]
rank = q * (len(sorted_values) - 1)
lower = math.floor(rank)
upper = math.ceil(rank)
if lower == upper:
return sorted_values[lower]
fraction = rank - lower
return sorted_values[lower] + (sorted_values[upper] - sorted_values[lower]) * fractionWalking through p95 on 100 samples: rank = 0.95 * 99 = 94.05. That means take the value at index 94 and blend it 5% of the way toward index 95. If the rank lands on an integer, no interpolation is needed.
The docstring is emphatic: Do NOT batch. Production serving typically handles one query at a time — user sends a message, model replies. The number that matters is per-query latency. Batched throughput is a completely different metric: higher, but it doesn't reflect the user's wait time.
time.perf_counter() over time.time()
`time.perf_counter()` is a monotonic, high-resolution clock — it won't jump backward if the system clock adjusts, and it has nanosecond-ish precision. Always use it for benchmarking; `time.time()` is for wall-clock timestamps, not timing.
Once you've scored and timed the model, you need to put the numbers somewhere humans will read. format_metrics_row produces a single markdown table row — destined for an append-only RESULTS.md log that tracks every model you've ever tried.
def format_metrics_row(name, metrics, latency=None, cost_per_1k=None, params=None):
def fmt(v, digits=4):
if v is None:
return "N/A"
return f"{v:.{digits}f}"
return (
f"| {name} "
f"| {fmt(metrics.accuracy)} "
f"| {fmt(metrics.macro_f1)} "
f"| {fmt(metrics.oos_recall)} "
f"| {fmt(latency.p50_ms, 1) if latency else 'N/A'} "
f"| {fmt(latency.p95_ms, 1) if latency else 'N/A'} "
f"|"
)The inner fmt() helper handles None gracefully — a metric that wasn't computed renders as N/A rather than crashing. Small detail, big resilience.
Append-only results files are git gold
Every PR that touches a model adds a new row. Git blame tells you which commit produced which score. No spreadsheet, no external dashboard, no drift — the scorecard lives with the code.
Every future model you build plugs into this exact flow. Same API, same report shape, instantly comparable to every previous run. That's the whole point — a harness is an investment in comparability.
- Separate quality from speed — they're orthogonal concerns, so use two dataclasses.
- Don't trust accuracy alone — under class imbalance, it rewards laziness. Reach for macro F1.
- Pass
labels=explicitly to sklearn — otherwise your F1 shifts when a rare class is absent from a split. - Measure OOS precision AND recall — catching abstentions (recall) and not over-abstaining (precision) are both important.
- Always warm up before timing — first-call latency is not representative of steady state.
- Report percentiles, not means — p50, p95, p99 describe the distribution; the mean hides tail pain.
- Use
time.perf_counter()— monotonic, high-resolution, benchmarking-appropriate. - Never batch when measuring per-query latency — it gives you throughput, not user wait time.
- Format output consistently — one row per run, append-only, lives in git.
A good evaluation harness isn't clever. It's disciplined. It makes the same choices every time, surfaces the numbers that matter, and hides the ones that mislead. Every model that passes through it gets the same treatment — the impartial judge that your project deserves.
Next Steps
Try adapting this harness to your own project. Add GPU utilization tracking, cost accounting per 1K tokens, or per-class confusion matrices. The shape scales with you — just keep the interfaces clean and let every future model plug in the same way. For practical examples of evaluation in action, see [**TF-IDF + Logistic Regression: The Classical ML Baseline**](/blog/tfidf-logistic-regression-baseline) and [**Logistic Regression from Scratch in PyTorch**](/blog/logistic-regression-from-scratch-pytorch).
Related Articles
Semantic Caching & RAGAS Evaluation: Make Your RAG Pipeline Faster and Measurable
Learn how to add semantic caching to your RAG pipeline for lower latency and cost, then measure quality with RAGAS evaluation metrics.
TF-IDF + Logistic Regression: The Classical ML Baseline You Should Try First
Before reaching for LLMs or neural networks for text classification, try the boring thing. Here's how TF-IDF + Logistic Regression works, why it's often embarrassingly competitive, and where it breaks.
ML Hyperparameters Explained for Beginners: Learning Rate, Epochs, Batch Size, L2, and Seed
A beginner-friendly explanation of core machine learning hyperparameters — learning rate, epochs, batch size, L2 regularization, and random seed — with simple examples and every important term explained clearly.