Technical
Testing Agents Like You Test Code
Agents are software. Software without tests breaks silently. This seems obvious until you actually try to write tests for a system whose output is non-deterministic and whose behavior changes when the model updates. I've worked out an approach that gives me enough confidence to ship.
Three Layers of Tests
My agent test suite has three layers, from cheap to expensive:
- Tool tests. Each tool is a plain function. Test it with unit tests. No agent involved. This is 80% of the surface area.
- Prompt regression tests. A fixed set of prompts plus expected tool-call sequences. When the prompt or model changes, I re-run and eyeball the diff.
- End-to-end smoke tests. A handful of realistic scenarios that run in CI against the actual model. These catch behavior drift.
The Cheap Layer Carries Most of the Weight
def test_summarize_url_handles_404():
result = tools.summarize_url('https://example.com/does-not-exist')
assert result['status'] == 'error'
assert '404' in result['message']The tool is deterministic. The test is fast. I don't need the agent to verify that my tools work.
Accept Non-Determinism at the Top
For the end-to-end layer, I stopped trying to assert exact outputs. Instead I assert properties: did the agent call the retrieval tool at least once, did it return a non-empty summary, did it finish within the step budget. Property-based assertions survive model updates.
Pytest works fine for all of this; see the pytest docs.
RELATED READING
The Consulting Shift I Am Making In Year Two
After a year of writing and building, my consulting practice is changing shape. Shorter engagements. Sharper outcomes.
ReadThe Frontend Shift: Shipping Less JavaScript In Year Two
A year ago I reached for Next.js for everything. This year I often reach for nothing.
ReadThe Serverless Lesson I Would Write On A Sticky Note
After a year of shipping serverless projects, one rule explains most of the wins and all of the losses.
Read