
The Impartial Judge: Inside a Production ML Evaluation Harness
A developer's walkthrough of a real ML eval harness — F1, macro averaging, OOS recall, warmup, and p50/p95/p99 latency — and the design decisions behind each.
Every ML project eventually runs into the same uncomfortable question: is version B actually better than version A? You can squint at loss curves, trust your gut, or cherry-pick examples — but until both models pass through the same scoring system, you're guessing. This post cracks open a real evaluation harness — the kind you'd find in a production ML repo — and unpacks every design decision inside it, one piece at a time.
Mental Model
The harness is a single Python module with two dataclasses (the report cards), three functions (score, time, format), and a small percentile helper. That's it. The whole point of a harness is to be small, stable, and boring — you don't want surprises in your ruler.
The file opens with two @dataclass definitions. They're lightweight containers — structs with benefits. Instead of returning a confusing tuple like (0.89, 0.87, 0.92, 234.1) where you have to remember which number is which, the function returns an object with named fields.
@dataclass
class ClassificationMetrics:
accuracy: float
macro_f1: float
per_class_f1: dict[int, float] = field(default_factory=dict)
oos_recall: float | None = None
oos_precision: float | None = None
n_examples: int = 0
@dataclass
class LatencyStats:
p50_ms: float
p95_ms: float
p99_ms: float
mean_ms: float
n_iters: intNotice the separation of concerns: ClassificationMetrics is about quality (did the model get it right?), LatencyStats is about speed (how long did it take?). They're orthogonal — a slow-but-accurate model is useful for some contexts, a fast-but-mediocre one for others. Keeping them separate lets each be evolved independently.
Why the as_row() method?
ClassificationMetrics.as_row() flattens the dataclass into a rounded dict. Tiny function, big win — CSV export, results tables, and dashboards all consume the same shape. One source of truth for reporting.This function takes two equal-length lists — y_true (the correct labels) and y_pred (what the model guessed) — and returns four kinds of numbers. Let's unpack each one.
Accuracy is the fraction of predictions that are correct. 89 right out of 100 = 0.89. Simple. Intuitive. Often misleading.
Accuracy lies under class imbalance
F1 fixes the imbalance problem by measuring two things together for each class:
- Precision: of the times we predicted class X, how often were we right?
- Recall: of the actual class-X examples, how many did we catch?
- F1: the harmonic mean of precision and recall — high only when BOTH are high
The harmonic mean is the secret sauce. A regular average would let you cheat — score 1.0 on precision and 0.1 on recall, average is 0.55. But the harmonic mean punishes imbalance: it pulls the score toward the worse of the two numbers.
| Scenario | Precision | Recall | F1 |
|---|---|---|---|
| Never predicts X | N/A | 0.0 | 0.0 |
| Predicts X for everything | low | 1.0 | low |
| Conservative but accurate | high | moderate | decent |
| Well-balanced | high | high | high |
Macro F1 is the unweighted average of per-class F1 scores. It treats every class as equally important regardless of how common it is. A model that nails the common classes but bombs the rare ones will score high on accuracy but low on macro F1. That's usually what you want to know.
labels = sorted(set(y_true) | set(y_pred))
per_class_scores = f1_score(y_true, y_pred, labels=labels, average=None, zero_division=0)
per_class_f1 = {label: float(score) for label, score in zip(labels, per_class_scores, strict=True)}
macro = f1_score(y_true, y_pred, labels=labels, average="macro", zero_division=0)The subtle bug the labels= kwarg prevents
Real-world classifiers need to say 'I don't know' sometimes. A support-ticket router shouldn't confidently shove a random gibberish message into 'billing' — it should abstain. That's what OOS (out-of-scope) detection measures.
if oos_label is not None:
true_oos = sum(1 for label in y_true if label == oos_label)
pred_oos = sum(1 for label in y_pred if label == oos_label)
true_positive_oos = sum(
1
for true_label, pred_label in zip(y_true, y_pred, strict=True)
if true_label == oos_label and pred_label == oos_label
)
oos_recall = true_positive_oos / true_oos if true_oos else 0.0
oos_precision = true_positive_oos / pred_oos if pred_oos else 0.0- OOS recall: of all truly out-of-scope messages, how many did we correctly flag? (Catching the abstentions)
- OOS precision: of all messages we flagged OOS, how many actually were? (Not over-abstaining)
Why manual counting instead of sklearn? Because the concept is clearer as arithmetic, and the explicit if true_oos else 0.0 makes the zero-division behavior obvious. No hidden library magic.
You now know how often the model is right. The other half of the story is how fast it is. This is where measure_latency steps in — and it's packed with benchmarking wisdom.
The first call to a model is almost always the slowest. On a GPU with PyTorch, the first forward pass triggers CUDA graph compilation, MPS kernel compilation, memory allocation, caching. On CPU, it triggers import caching and branch prediction warmup. Including those first-call timings in your measurement pollutes your numbers.
First-call latency can be 10-100x worse than steady state
This is one of the most important ideas in production monitoring, so let's slow down. Averages lie. Especially with latency.
99 calls at 10ms + 1 call at 5 seconds = ~60ms mean. One user in a hundred waited five seconds. The mean doesn't tell you that.
— The Case for Percentiles
In production, a small fraction of slow requests create most of the bad user experiences. That's why you want percentiles — they describe the distribution, not just the center.
| Metric | What It Represents | Who Cares |
|---|---|---|
| p50 (median) | Typical experience — half of users see this or better | Everyone |
| p95 | 1-in-20 tail — the slow experiences | Product teams |
| p99 | 1-in-100 worst — the really painful ones | SREs & on-call |
| mean | Average — vulnerable to outliers | Don't trust alone |
The function computes percentiles by hand using linear interpolation — the standard approach when the target rank falls between two sorted samples.
def percentile(sorted_values: list[float], q: float) -> float:
if len(sorted_values) == 1:
return sorted_values[0]
rank = q * (len(sorted_values) - 1)
lower = math.floor(rank)
upper = math.ceil(rank)
if lower == upper:
return sorted_values[lower]
fraction = rank - lower
return sorted_values[lower] + (sorted_values[upper] - sorted_values[lower]) * fractionWalking through p95 on 100 samples: rank = 0.95 * 99 = 94.05. That means take the value at index 94 and blend it 5% of the way toward index 95. If the rank lands on an integer, no interpolation is needed.
The docstring is emphatic: Do NOT batch. Production serving typically handles one query at a time — user sends a message, model replies. The number that matters is per-query latency. Batched throughput is a completely different metric: higher, but it doesn't reflect the user's wait time.
time.perf_counter() over time.time()
time.perf_counter() is a monotonic, high-resolution clock — it won't jump backward if the system clock adjusts, and it has nanosecond-ish precision. Always use it for benchmarking; time.time() is for wall-clock timestamps, not timing.Once you've scored and timed the model, you need to put the numbers somewhere humans will read. format_metrics_row produces a single markdown table row — destined for an append-only RESULTS.md log that tracks every model you've ever tried.
def format_metrics_row(name, metrics, latency=None, cost_per_1k=None, params=None):
def fmt(v, digits=4):
if v is None:
return "N/A"
return f"{v:.{digits}f}"
return (
f"| {name} "
f"| {fmt(metrics.accuracy)} "
f"| {fmt(metrics.macro_f1)} "
f"| {fmt(metrics.oos_recall)} "
f"| {fmt(latency.p50_ms, 1) if latency else 'N/A'} "
f"| {fmt(latency.p95_ms, 1) if latency else 'N/A'} "
f"|"
)The inner fmt() helper handles None gracefully — a metric that wasn't computed renders as N/A rather than crashing. Small detail, big resilience.
Append-only results files are git gold
Every future model you build plugs into this exact flow. Same API, same report shape, instantly comparable to every previous run. That's the whole point — a harness is an investment in comparability.
- Separate quality from speed — they're orthogonal concerns, so use two dataclasses.
- Don't trust accuracy alone — under class imbalance, it rewards laziness. Reach for macro F1.
- Pass
labels=explicitly to sklearn — otherwise your F1 shifts when a rare class is absent from a split. - Measure OOS precision AND recall — catching abstentions (recall) and not over-abstaining (precision) are both important.
- Always warm up before timing — first-call latency is not representative of steady state.
- Report percentiles, not means — p50, p95, p99 describe the distribution; the mean hides tail pain.
- Use
time.perf_counter()— monotonic, high-resolution, benchmarking-appropriate. - Never batch when measuring per-query latency — it gives you throughput, not user wait time.
- Format output consistently — one row per run, append-only, lives in git.
A good evaluation harness isn't clever. It's disciplined. It makes the same choices every time, surfaces the numbers that matter, and hides the ones that mislead. Every model that passes through it gets the same treatment — the impartial judge that your project deserves.
Next Steps
Related Articles
Prompt Engineering Patterns & Techniques: The Complete Production Toolkit
Production-ready prompt engineering patterns with runnable Python code: chain-of-thought, few-shot learning, self-consistency, prompt chaining, structured output, system prompt design, and advanced techniques including A/B testing and regression frameworks.
Phase 1: Core Foundations of LLM Engineering — APIs, Prompts, Tools & RAG
A comprehensive 8-week roadmap covering LLM APIs, prompt engineering, function calling, tool use, and retrieval-augmented generation — everything you need to build production AI applications.
TF-IDF + Logistic Regression: The Classical ML Baseline You Should Try First
Before reaching for LLMs or neural networks for text classification, try the boring thing. Here's how TF-IDF + Logistic Regression works, why it's often embarrassingly competitive, and where it breaks.