How I Evaluate an AI Agent Before Letting It Touch Production

By month eight of running agent-assisted development, I have seen three agents silently introduce bugs that looked correct at a glance. A missing null check here. An off-by-one in a date range there. The tests passed. The diff looked clean. The bug surfaced in production a week later. That experience forced me to build a real evaluation process before trusting any agent with meaningful code.

The Three-Gate Evaluation

Every new agent I consider now runs through three gates before I let it anywhere near a client repo.

Gate one: known-failure regression. I keep a folder of five real bugs I have shipped in the past. I strip the fix, hand the buggy file to the agent with the failing test, and see if it reaches the same fix I did. An agent that cannot find the bug cannot be trusted to avoid creating one.

Gate two: refactor without behavior change. I hand it a working module with passing tests and ask it to restructure for readability. Then I diff the test output. An agent that changes behavior during a refactor is an agent that does not understand the contract.

Gate three: explanation under pressure. After a task, I ask: why did you choose this approach over X? An agent that cannot articulate tradeoffs is pattern-matching, not reasoning. That is fine for boilerplate. It is not fine for architecture.

A Simple Evaluation Script

python

EVAL_TASKS = [
    {'id': 'bug-01', 'input_file': 'eval/date_range_bug.py',
     'test': 'eval/test_date_range.py',
     'expected_fix_line': 42},
    {'id': 'refactor-01', 'input_file': 'eval/handler.py',
     'test': 'eval/test_handler.py',
     'tolerance': 'zero behavior change'},
]
 
def score_agent(agent, tasks):
    passed = 0
    for t in tasks:
        result = agent.run(t)
        if result.tests_pass and result.matches_expectation(t):
            passed += 1
    return passed / len(tasks)

What Most Reviews Miss

Public benchmarks test isolated tasks. Real production work is layered. An agent that scores 90 on HumanEval can still fail at 'update this function without breaking the seventeen other places that import it'. I care about the second number, not the first.

The Anthropic evaluation cookbook has honest templates for agent evaluation. I use it as my starting point and add my own known-failure set on top.

The Honest Conclusion

No agent earns unsupervised production access. But a passing score on my three gates earns the right to be an accelerator on supervised work. That is the bar. Everything else is hype.

What Changes Once You Trust the Bar

Evaluation is one-time work per model version. I run the gates when a new agent or model lands, record the score in a plain markdown file, and reuse the result for months. The cost of the process is small. The cost of skipping it, an untrusted agent introducing a subtle bug into a client repo, is large. Once a model passes, I trust it at the scope the gates verified. New scope requires a new evaluation. The meta-lesson: evaluation is a practice, not a phase. Agents change monthly; the gates do not.