The Impartial Judge: Inside a Production ML Evaluation Harness
intermediateMachine Learning Basics
The Impartial Judge: Inside a Production ML Evaluation Harness
A developer's walkthrough of a real ML eval harness — F1, macro averaging, OOS recall, warmup, and p50/p95/p99 latency — and the design decisions behind each.
April 16, 202612 min read
#evaluation#metrics#f1-score