Technical
The Skill Soup Move of the Year: Learning Evals Properly
Every AI engineer talks about evals. Almost none of them run proper ones. I committed to making evals the dense ingredient of my skill soup this year because the market is underpriced. Here is what I am learning and the small eval system I have already built to practice the skill publicly.
What Proper Evals Look Like
- A frozen test set that does not change between model runs
- Metrics that correlate with business outcomes, not just accuracy
- Scoring that can be human, model-graded, or both
- CI integration that blocks regressions
- Version tracking so you can compare model X vs model Y on the same inputs
Most teams call their test set an eval set but they keep editing it when the model gets something wrong. That is not an eval. That is a wishlist.
My Starter Eval System
def run_eval(prompt_version: str, test_set: list[dict]):
results = []
for case in test_set:
output = call_model(prompt_version, case['input'])
score = score_output(case['expected'], output)
results.append({'case_id': case['id'], 'score': score})
return aggregate(results)Simple. Deterministic. Logged to a flat file. The simplicity is the point. A fancier framework does not make the evals better. Good test cases make evals better.
Why It Is Underpriced
Every shop building with AI needs evals and almost none do them well. The ones who charge for it as a service are few. The first consultants to specialize here will capture disproportionate revenue. The gap between demand and supply is larger than in any other AI consulting niche right now.
What I Am Studying
- Academic eval papers for methodology
- Anthropic evals cookbook for practical patterns
- The Promptfoo documentation for tooling
- The OpenAI evals repo for testcase structure ideas
I read one eval paper a week and take notes. The habit is slow but compounds.
The Commitment
Two hours a week. Public notes. One client engagement explicitly scoped to evals by July. That last part forces learning under economic pressure, which is where skills compound fastest. Skills without economic pressure tend to stall at medium.
The Early Signal
Three clients this month asked specifically about eval services. Two years ago nobody asked. The demand is arriving. Supply is not ready yet. That gap is the opportunity.
RELATED READING
The Consulting Shift I Am Making In Year Two
After a year of writing and building, my consulting practice is changing shape. Shorter engagements. Sharper outcomes.
ReadThe Frontend Shift: Shipping Less JavaScript In Year Two
A year ago I reached for Next.js for everything. This year I often reach for nothing.
ReadThe Serverless Lesson I Would Write On A Sticky Note
After a year of shipping serverless projects, one rule explains most of the wins and all of the losses.
Read