Back to articles
The Impartial Judge: Inside a Production ML Evaluation Harness

The Impartial Judge: Inside a Production ML Evaluation Harness

A developer's walkthrough of a real ML eval harness — F1, macro averaging, OOS recall, warmup, and p50/p95/p99 latency — and the design decisions behind each.

12 min read

Every ML project eventually runs into the same uncomfortable question: is version B actually better than version A? You can squint at loss curves, trust your gut, or cherry-pick examples — but until both models pass through the same scoring system, you're guessing. This post cracks open a real evaluation harness — the kind you'd find in a production ML repo — and unpacks every design decision inside it, one piece at a time.

Mental Model

An evaluation harness is to ML what a unit test runner is to software. Same code path, same inputs, same judge — every time. Without it, every experiment reports its own idiosyncratic numbers and nothing is comparable.

The harness is a single Python module with two dataclasses (the report cards), three functions (score, time, format), and a small percentile helper. That's it. The whole point of a harness is to be small, stable, and boring — you don't want surprises in your ruler.

The file opens with two @dataclass definitions. They're lightweight containers — structs with benefits. Instead of returning a confusing tuple like (0.89, 0.87, 0.92, 234.1) where you have to remember which number is which, the function returns an object with named fields.

eval.py
python
@dataclass
class ClassificationMetrics:
    accuracy: float
    macro_f1: float
    per_class_f1: dict[int, float] = field(default_factory=dict)
    oos_recall: float | None = None
    oos_precision: float | None = None
    n_examples: int = 0

@dataclass
class LatencyStats:
    p50_ms: float
    p95_ms: float
    p99_ms: float
    mean_ms: float
    n_iters: int

Notice the separation of concerns: ClassificationMetrics is about quality (did the model get it right?), LatencyStats is about speed (how long did it take?). They're orthogonal — a slow-but-accurate model is useful for some contexts, a fast-but-mediocre one for others. Keeping them separate lets each be evolved independently.

Why the as_row() method?

`ClassificationMetrics.as_row()` flattens the dataclass into a rounded dict. Tiny function, big win — CSV export, results tables, and dashboards all consume the same shape. One source of truth for reporting.

This function takes two equal-length lists — y_true (the correct labels) and y_pred (what the model guessed) — and returns four kinds of numbers. Let's unpack each one.

Accuracy is the fraction of predictions that are correct. 89 right out of 100 = 0.89. Simple. Intuitive. Often misleading.

Accuracy lies under class imbalance

If 95% of your data is 'not spam', a model that predicts 'not spam' for everything gets 95% accuracy while being completely useless. Real-world data is almost always imbalanced. Trust accuracy only when your classes are roughly equal in size.

F1 fixes the imbalance problem by measuring two things together for each class:

  • Precision: of the times we predicted class X, how often were we right?
  • Recall: of the actual class-X examples, how many did we catch?
  • F1: the harmonic mean of precision and recall — high only when BOTH are high

The harmonic mean is the secret sauce. A regular average would let you cheat — score 1.0 on precision and 0.1 on recall, average is 0.55. But the harmonic mean punishes imbalance: it pulls the score toward the worse of the two numbers.

ScenarioPrecisionRecallF1
Never predicts XN/A0.00.0
Predicts X for everythinglow1.0low
Conservative but accuratehighmoderatedecent
Well-balancedhighhighhigh

Macro F1 is the unweighted average of per-class F1 scores. It treats every class as equally important regardless of how common it is. A model that nails the common classes but bombs the rare ones will score high on accuracy but low on macro F1. That's usually what you want to know.

eval.py
python
labels = sorted(set(y_true) | set(y_pred))
per_class_scores = f1_score(y_true, y_pred, labels=labels, average=None, zero_division=0)
per_class_f1 = {label: float(score) for label, score in zip(labels, per_class_scores, strict=True)}

macro = f1_score(y_true, y_pred, labels=labels, average="macro", zero_division=0)

The subtle bug the labels= kwarg prevents

If your test split happens to exclude a rare class, sklearn will silently average macro-F1 over fewer classes — and your score will *look* better than it should. Passing labels= explicitly pins the class vocabulary so the metric is stable across runs. This one line prevents a whole class of reporting bugs.

Real-world classifiers need to say 'I don't know' sometimes. A support-ticket router shouldn't confidently shove a random gibberish message into 'billing' — it should abstain. That's what OOS (out-of-scope) detection measures.

eval.py
python
if oos_label is not None:
    true_oos = sum(1 for label in y_true if label == oos_label)
    pred_oos = sum(1 for label in y_pred if label == oos_label)
    true_positive_oos = sum(
        1
        for true_label, pred_label in zip(y_true, y_pred, strict=True)
        if true_label == oos_label and pred_label == oos_label
    )
    oos_recall = true_positive_oos / true_oos if true_oos else 0.0
    oos_precision = true_positive_oos / pred_oos if pred_oos else 0.0
  • OOS recall: of all truly out-of-scope messages, how many did we correctly flag? (Catching the abstentions)
  • OOS precision: of all messages we flagged OOS, how many actually were? (Not over-abstaining)

Why manual counting instead of sklearn? Because the concept is clearer as arithmetic, and the explicit if true_oos else 0.0 makes the zero-division behavior obvious. No hidden library magic.

You now know how often the model is right. The other half of the story is how fast it is. This is where measure_latency steps in — and it's packed with benchmarking wisdom.

The first call to a model is almost always the slowest. On a GPU with PyTorch, the first forward pass triggers CUDA graph compilation, MPS kernel compilation, memory allocation, caching. On CPU, it triggers import caching and branch prediction warmup. Including those first-call timings in your measurement pollutes your numbers.

First-call latency can be 10-100x worse than steady state

A model that does inference in 50ms steady-state might take 3 seconds on its first call. Without warmup, your p50 would be 50ms and your p99 would be 3000ms — entirely because of a one-time compile. Always warm up.

This is one of the most important ideas in production monitoring, so let's slow down. Averages lie. Especially with latency.

99 calls at 10ms + 1 call at 5 seconds = ~60ms mean. One user in a hundred waited five seconds. The mean doesn't tell you that.

The Case for Percentiles

In production, a small fraction of slow requests create most of the bad user experiences. That's why you want percentiles — they describe the distribution, not just the center.

MetricWhat It RepresentsWho Cares
p50 (median)Typical experience — half of users see this or betterEveryone
p951-in-20 tail — the slow experiencesProduct teams
p991-in-100 worst — the really painful onesSREs & on-call
meanAverage — vulnerable to outliersDon't trust alone

The function computes percentiles by hand using linear interpolation — the standard approach when the target rank falls between two sorted samples.

eval.py
python
def percentile(sorted_values: list[float], q: float) -> float:
    if len(sorted_values) == 1:
        return sorted_values[0]
    rank = q * (len(sorted_values) - 1)
    lower = math.floor(rank)
    upper = math.ceil(rank)
    if lower == upper:
        return sorted_values[lower]
    fraction = rank - lower
    return sorted_values[lower] + (sorted_values[upper] - sorted_values[lower]) * fraction

Walking through p95 on 100 samples: rank = 0.95 * 99 = 94.05. That means take the value at index 94 and blend it 5% of the way toward index 95. If the rank lands on an integer, no interpolation is needed.

The docstring is emphatic: Do NOT batch. Production serving typically handles one query at a time — user sends a message, model replies. The number that matters is per-query latency. Batched throughput is a completely different metric: higher, but it doesn't reflect the user's wait time.

time.perf_counter() over time.time()

`time.perf_counter()` is a monotonic, high-resolution clock — it won't jump backward if the system clock adjusts, and it has nanosecond-ish precision. Always use it for benchmarking; `time.time()` is for wall-clock timestamps, not timing.

Once you've scored and timed the model, you need to put the numbers somewhere humans will read. format_metrics_row produces a single markdown table row — destined for an append-only RESULTS.md log that tracks every model you've ever tried.

eval.py
python
def format_metrics_row(name, metrics, latency=None, cost_per_1k=None, params=None):
    def fmt(v, digits=4):
        if v is None:
            return "N/A"
        return f"{v:.{digits}f}"

    return (
        f"| {name} "
        f"| {fmt(metrics.accuracy)} "
        f"| {fmt(metrics.macro_f1)} "
        f"| {fmt(metrics.oos_recall)} "
        f"| {fmt(latency.p50_ms, 1) if latency else 'N/A'} "
        f"| {fmt(latency.p95_ms, 1) if latency else 'N/A'} "
        f"|"
    )

The inner fmt() helper handles None gracefully — a metric that wasn't computed renders as N/A rather than crashing. Small detail, big resilience.

Append-only results files are git gold

Every PR that touches a model adds a new row. Git blame tells you which commit produced which score. No spreadsheet, no external dashboard, no drift — the scorecard lives with the code.

Every future model you build plugs into this exact flow. Same API, same report shape, instantly comparable to every previous run. That's the whole point — a harness is an investment in comparability.

  1. Separate quality from speed — they're orthogonal concerns, so use two dataclasses.
  2. Don't trust accuracy alone — under class imbalance, it rewards laziness. Reach for macro F1.
  3. Pass labels= explicitly to sklearn — otherwise your F1 shifts when a rare class is absent from a split.
  4. Measure OOS precision AND recall — catching abstentions (recall) and not over-abstaining (precision) are both important.
  5. Always warm up before timing — first-call latency is not representative of steady state.
  6. Report percentiles, not means — p50, p95, p99 describe the distribution; the mean hides tail pain.
  7. Use time.perf_counter() — monotonic, high-resolution, benchmarking-appropriate.
  8. Never batch when measuring per-query latency — it gives you throughput, not user wait time.
  9. Format output consistently — one row per run, append-only, lives in git.

A good evaluation harness isn't clever. It's disciplined. It makes the same choices every time, surfaces the numbers that matter, and hides the ones that mislead. Every model that passes through it gets the same treatment — the impartial judge that your project deserves.

Next Steps

Try adapting this harness to your own project. Add GPU utilization tracking, cost accounting per 1K tokens, or per-class confusion matrices. The shape scales with you — just keep the interfaces clean and let every future model plug in the same way. For practical examples of evaluation in action, see [**TF-IDF + Logistic Regression: The Classical ML Baseline**](/blog/tfidf-logistic-regression-baseline) and [**Logistic Regression from Scratch in PyTorch**](/blog/logistic-regression-from-scratch-pytorch).

Related Articles