intermediateMachine Learning Basics Best Practices

The Impartial Judge: Inside a Production ML Evaluation Harness

A developer's walkthrough of a real ML eval harness — F1, macro averaging, OOS recall, warmup, and p50/p95/p99 latency — and the design decisions behind each.

AI EducatorApril 16, 2026

Every ML project eventually runs into the same uncomfortable question: is version B actually better than version A? You can squint at loss curves, trust your gut, or cherry-pick examples — but until both models pass through the same scoring system, you're guessing. This post cracks open a real evaluation harness — the kind you'd find in a production ML repo — and unpacks every design decision inside it, one piece at a time.

Mental Model

An evaluation harness is to ML what a unit test runner is to software. Same code path, same inputs, same judge — every time. Without it, every experiment reports its own idiosyncratic numbers and nothing is comparable.

What the File Contains

The harness is a single Python module with two dataclasses (the report cards), three functions (score, time, format), and a small percentile helper. That's it. The whole point of a harness is to be small, stable, and boring — you don't want surprises in your ruler.

The Two Report Cards

The file opens with two @dataclass definitions. They're lightweight containers — structs with benefits. Instead of returning a confusing tuple like (0.89, 0.87, 0.92, 234.1) where you have to remember which number is which, the function returns an object with named fields.

eval.py

python

@dataclass
class ClassificationMetrics:
    accuracy: float
    macro_f1: float
    per_class_f1: dict[int, float] = field(default_factory=dict)
    oos_recall: float | None = None
    oos_precision: float | None = None
    n_examples: int = 0

@dataclass
class LatencyStats:
    p50_ms: float
    p95_ms: float
    p99_ms: float
    mean_ms: float
    n_iters: int

Notice the separation of concerns: ClassificationMetrics is about quality (did the model get it right?), LatencyStats is about speed (how long did it take?). They're orthogonal — a slow-but-accurate model is useful for some contexts, a fast-but-mediocre one for others. Keeping them separate lets each be evolved independently.

Why the as_row() method?

ClassificationMetrics.as_row() flattens the dataclass into a rounded dict. Tiny function, big win — CSV export, results tables, and dashboards all consume the same shape. One source of truth for reporting.

Scoring Predictions: compute_metrics

This function takes two equal-length lists — y_true (the correct labels) and y_pred (what the model guessed) — and returns four kinds of numbers. Let's unpack each one.

Accuracy — the obvious one

Accuracy is the fraction of predictions that are correct. 89 right out of 100 = 0.89. Simple. Intuitive. Often misleading.

Accuracy lies under class imbalance

If 95% of your data is 'not spam', a model that predicts 'not spam' for everything gets 95% accuracy while being completely useless. Real-world data is almost always imbalanced. Trust accuracy only when your classes are roughly equal in size.

F1 — when accuracy lies

F1 fixes the imbalance problem by measuring two things together for each class:

Precision: of the times we predicted class X, how often were we right?
Recall: of the actual class-X examples, how many did we catch?
F1: the harmonic mean of precision and recall — high only when BOTH are high

The harmonic mean is the secret sauce. A regular average would let you cheat — score 1.0 on precision and 0.1 on recall, average is 0.55. But the harmonic mean punishes imbalance: it pulls the score toward the worse of the two numbers.

Scenario	Precision	Recall	F1
Never predicts X	N/A	0.0	0.0
Predicts X for everything	low	1.0	low
Conservative but accurate	high	moderate	decent
Well-balanced	high	high	high

Macro F1 and the labels= trick

Macro F1 is the unweighted average of per-class F1 scores. It treats every class as equally important regardless of how common it is. A model that nails the common classes but bombs the rare ones will score high on accuracy but low on macro F1. That's usually what you want to know.

eval.py

python

labels = sorted(set(y_true) | set(y_pred))
per_class_scores = f1_score(y_true, y_pred, labels=labels, average=None, zero_division=0)
per_class_f1 = {label: float(score) for label, score in zip(labels, per_class_scores, strict=True)}

macro = f1_score(y_true, y_pred, labels=labels, average="macro", zero_division=0)

The subtle bug the labels= kwarg prevents

If your test split happens to exclude a rare class, sklearn will silently average macro-F1 over fewer classes — and your score will look better than it should. Passing labels= explicitly pins the class vocabulary so the metric is stable across runs. This one line prevents a whole class of reporting bugs.

Out-of-scope: the abstain case

Real-world classifiers need to say 'I don't know' sometimes. A support-ticket router shouldn't confidently shove a random gibberish message into 'billing' — it should abstain. That's what OOS (out-of-scope) detection measures.

eval.py

python

if oos_label is not None:
    true_oos = sum(1 for label in y_true if label == oos_label)
    pred_oos = sum(1 for label in y_pred if label == oos_label)
    true_positive_oos = sum(
        1
        for true_label, pred_label in zip(y_true, y_pred, strict=True)
        if true_label == oos_label and pred_label == oos_label
    )
    oos_recall = true_positive_oos / true_oos if true_oos else 0.0
    oos_precision = true_positive_oos / pred_oos if pred_oos else 0.0

OOS recall: of all truly out-of-scope messages, how many did we correctly flag? (Catching the abstentions)
OOS precision: of all messages we flagged OOS, how many actually were? (Not over-abstaining)

Why manual counting instead of sklearn? Because the concept is clearer as arithmetic, and the explicit if true_oos else 0.0 makes the zero-division behavior obvious. No hidden library magic.

Measuring Speed: measure_latency

You now know how often the model is right. The other half of the story is how fast it is. This is where measure_latency steps in — and it's packed with benchmarking wisdom.

The warmup ritual

The first call to a model is almost always the slowest. On a GPU with PyTorch, the first forward pass triggers CUDA graph compilation, MPS kernel compilation, memory allocation, caching. On CPU, it triggers import caching and branch prediction warmup. Including those first-call timings in your measurement pollutes your numbers.

First-call latency can be 10-100x worse than steady state

A model that does inference in 50ms steady-state might take 3 seconds on its first call. Without warmup, your p50 would be 50ms and your p99 would be 3000ms — entirely because of a one-time compile. Always warm up.

Why percentiles, not averages

This is one of the most important ideas in production monitoring, so let's slow down. Averages lie. Especially with latency.

99 calls at 10ms + 1 call at 5 seconds = ~60ms mean. One user in a hundred waited five seconds. The mean doesn't tell you that.
— The Case for Percentiles

In production, a small fraction of slow requests create most of the bad user experiences. That's why you want percentiles — they describe the distribution, not just the center.

Metric	What It Represents	Who Cares
p50 (median)	Typical experience — half of users see this or better	Everyone
p95	1-in-20 tail — the slow experiences	Product teams
p99	1-in-100 worst — the really painful ones	SREs & on-call
mean	Average — vulnerable to outliers	Don't trust alone

The percentile calculation

The function computes percentiles by hand using linear interpolation — the standard approach when the target rank falls between two sorted samples.

eval.py

python

def percentile(sorted_values: list[float], q: float) -> float:
    if len(sorted_values) == 1:
        return sorted_values[0]
    rank = q * (len(sorted_values) - 1)
    lower = math.floor(rank)
    upper = math.ceil(rank)
    if lower == upper:
        return sorted_values[lower]
    fraction = rank - lower
    return sorted_values[lower] + (sorted_values[upper] - sorted_values[lower]) * fraction

Walking through p95 on 100 samples: rank = 0.95 * 99 = 94.05. That means take the value at index 94 and blend it 5% of the way toward index 95. If the rank lands on an integer, no interpolation is needed.

No batching — and why it matters

The docstring is emphatic: Do NOT batch. Production serving typically handles one query at a time — user sends a message, model replies. The number that matters is per-query latency. Batched throughput is a completely different metric: higher, but it doesn't reflect the user's wait time.

time.perf_counter() over time.time()

time.perf_counter() is a monotonic, high-resolution clock — it won't jump backward if the system clock adjusts, and it has nanosecond-ish precision. Always use it for benchmarking; time.time() is for wall-clock timestamps, not timing.

The Reporter: format_metrics_row

Once you've scored and timed the model, you need to put the numbers somewhere humans will read. format_metrics_row produces a single markdown table row — destined for an append-only RESULTS.md log that tracks every model you've ever tried.

eval.py

python

def format_metrics_row(name, metrics, latency=None, cost_per_1k=None, params=None):
    def fmt(v, digits=4):
        if v is None:
            return "N/A"
        return f"{v:.{digits}f}"

    return (
        f"| {name} "
        f"| {fmt(metrics.accuracy)} "
        f"| {fmt(metrics.macro_f1)} "
        f"| {fmt(metrics.oos_recall)} "
        f"| {fmt(latency.p50_ms, 1) if latency else 'N/A'} "
        f"| {fmt(latency.p95_ms, 1) if latency else 'N/A'} "
        f"|"
    )

The inner fmt() helper handles None gracefully — a metric that wasn't computed renders as N/A rather than crashing. Small detail, big resilience.

Append-only results files are git gold

Every PR that touches a model adds a new row. Git blame tells you which commit produced which score. No spreadsheet, no external dashboard, no drift — the scorecard lives with the code.

How the Pieces Fit Together

Every future model you build plugs into this exact flow. Same API, same report shape, instantly comparable to every previous run. That's the whole point — a harness is an investment in comparability.

Key Takeaways

Separate quality from speed — they're orthogonal concerns, so use two dataclasses.
Don't trust accuracy alone — under class imbalance, it rewards laziness. Reach for macro F1.
Pass labels= explicitly to sklearn — otherwise your F1 shifts when a rare class is absent from a split.
Measure OOS precision AND recall — catching abstentions (recall) and not over-abstaining (precision) are both important.
Always warm up before timing — first-call latency is not representative of steady state.
Report percentiles, not means — p50, p95, p99 describe the distribution; the mean hides tail pain.
Use time.perf_counter() — monotonic, high-resolution, benchmarking-appropriate.
Never batch when measuring per-query latency — it gives you throughput, not user wait time.
Format output consistently — one row per run, append-only, lives in git.

Conclusion

A good evaluation harness isn't clever. It's disciplined. It makes the same choices every time, surfaces the numbers that matter, and hides the ones that mislead. Every model that passes through it gets the same treatment — the impartial judge that your project deserves.

Next Steps

Try adapting this harness to your own project. Add GPU utilization tracking, cost accounting per 1K tokens, or per-class confusion matrices. The shape scales with you — just keep the interfaces clean and let every future model plug in the same way. For practical examples of evaluation in action, see TF-IDF + Logistic Regression: The Classical ML Baseline and Logistic Regression from Scratch in PyTorch.

intermediate

Prompt Engineering Patterns & Techniques: The Complete Production Toolkit

Production-ready prompt engineering patterns with runnable Python code: chain-of-thought, few-shot learning, self-consistency, prompt chaining, structured output, system prompt design, and advanced techniques including A/B testing and regression frameworks.

intermediate

Phase 1: Core Foundations of LLM Engineering — APIs, Prompts, Tools & RAG

A comprehensive 8-week roadmap covering LLM APIs, prompt engineering, function calling, tool use, and retrieval-augmented generation — everything you need to build production AI applications.

intermediate

TF-IDF + Logistic Regression: The Classical ML Baseline You Should Try First

Before reaching for LLMs or neural networks for text classification, try the boring thing. Here's how TF-IDF + Logistic Regression works, why it's often embarrassingly competitive, and where it breaks.

Mental Model

Why the as_row() method?

Accuracy lies under class imbalance

The subtle bug the labels= kwarg prevents

First-call latency can be 10-100x worse than steady state

time.perf_counter() over time.time()

Append-only results files are git gold

Next Steps

Related Articles

Prompt Engineering Patterns & Techniques: The Complete Production Toolkit

Phase 1: Core Foundations of LLM Engineering — APIs, Prompts, Tools & RAG

TF-IDF + Logistic Regression: The Classical ML Baseline You Should Try First