2026-04-15 · Evals · By Wash Candido

Stop benchmarking models on benchmarks.

A founder asked me last month which model they should use for their support automation feature. They'd built a comparison spreadsheet from public benchmarks: MMLU scores, HumanEval, MT-Bench. The cell with the highest average had been highlighted in green. That was the model they were planning to ship.

I told them to delete the spreadsheet.

Public benchmarks tell you almost nothing about how a model will behave on your problem. They are the SAT scores of LLMs: useful at the top of the funnel, useless once you're trying to decide who can do the actual job. Here's why.

1. The distribution doesn't match. Public benchmarks are designed to be hard, broad, and adversarial. Your problem is narrow, specific, and structurally repetitive. A model can score below the median on MMLU and still be the best model for your support flow because your support flow rewards a different shape of reasoning than MMLU does.

2. Benchmark contamination is real. The big eval sets have been on the internet long enough that they're inside training data. Some models have effectively memorized them. You can't tell from the leaderboard.

3. Cost and latency aren't on the leaderboard. A model that's 1.2 percentage points better at your task and 4x more expensive is probably the wrong model to ship. The decision is multivariate; the leaderboard is one-variable.

What to do instead. Build your own eval suite. It doesn't have to be elegant. It has to be specific to your problem and runnable on every model you're considering.

A useful starter shape:

1. Collect 50–200 real examples from your product. Real inputs, real expected outputs. If you can't get to 50, your problem is probably under-specified — fix that first.

2. Define what "right" means for each example. Sometimes that's a deterministic rule ("must include the order number"). Sometimes it's an LLM-as-judge with a clear rubric. Sometimes it's a human grader. Be honest about which one applies.

3. Run every candidate model through the suite. Record accuracy, cost per call, p95 latency.

4. Plot it. A scatter of accuracy vs. cost will show you the Pareto front. The model you ship should be on the front, not on the leaderboard.

We've shipped engagements where the eval suite alone changed the model selection from a flagship API to a 7B open-source model that was 30x cheaper and 2x faster. Not because the open-source model was "better" in any abstract sense — because it was sufficient for the task and cheaper to run.

The boring answer is the right one: build evals that match your actual problem, then let the evals pick the model.

— Wash Candido


← All insights