The Skill Soup Move of the Year: Learning Evals Properly

Every AI engineer talks about evals. Almost none of them run proper ones. I committed to making evals the dense ingredient of my skill soup this year because the market is underpriced. Here is what I am learning and the small eval system I have already built to practice the skill publicly.

What Proper Evals Look Like

A frozen test set that does not change between model runs
Metrics that correlate with business outcomes, not just accuracy
Scoring that can be human, model-graded, or both
CI integration that blocks regressions
Version tracking so you can compare model X vs model Y on the same inputs

Most teams call their test set an eval set but they keep editing it when the model gets something wrong. That is not an eval. That is a wishlist.

My Starter Eval System

python

def run_eval(prompt_version: str, test_set: list[dict]):
    results = []
    for case in test_set:
        output = call_model(prompt_version, case['input'])
        score = score_output(case['expected'], output)
        results.append({'case_id': case['id'], 'score': score})
    return aggregate(results)

Simple. Deterministic. Logged to a flat file. The simplicity is the point. A fancier framework does not make the evals better. Good test cases make evals better.

Why It Is Underpriced

Every shop building with AI needs evals and almost none do them well. The ones who charge for it as a service are few. The first consultants to specialize here will capture disproportionate revenue. The gap between demand and supply is larger than in any other AI consulting niche right now.

What I Am Studying

Academic eval papers for methodology
Anthropic evals cookbook for practical patterns
The Promptfoo documentation for tooling
The OpenAI evals repo for testcase structure ideas

I read one eval paper a week and take notes. The habit is slow but compounds.

The Commitment

Two hours a week. Public notes. One client engagement explicitly scoped to evals by July. That last part forces learning under economic pressure, which is where skills compound fastest. Skills without economic pressure tend to stall at medium.

The Early Signal

Three clients this month asked specifically about eval services. Two years ago nobody asked. The demand is arriving. Supply is not ready yet. That gap is the opportunity.