Testing Agents Like You Test Code

Agents are software. Software without tests breaks silently. This seems obvious until you actually try to write tests for a system whose output is non-deterministic and whose behavior changes when the model updates. I've worked out an approach that gives me enough confidence to ship.

Three Layers of Tests

My agent test suite has three layers, from cheap to expensive:

Tool tests. Each tool is a plain function. Test it with unit tests. No agent involved. This is 80% of the surface area.
Prompt regression tests. A fixed set of prompts plus expected tool-call sequences. When the prompt or model changes, I re-run and eyeball the diff.
End-to-end smoke tests. A handful of realistic scenarios that run in CI against the actual model. These catch behavior drift.

The Cheap Layer Carries Most of the Weight

python

def test_summarize_url_handles_404():
    result = tools.summarize_url('https://example.com/does-not-exist')
    assert result['status'] == 'error'
    assert '404' in result['message']

The tool is deterministic. The test is fast. I don't need the agent to verify that my tools work.

Accept Non-Determinism at the Top

For the end-to-end layer, I stopped trying to assert exact outputs. Instead I assert properties: did the agent call the retrieval tool at least once, did it return a non-empty summary, did it finish within the step budget. Property-based assertions survive model updates.

Pytest works fine for all of this; see the pytest docs.

Testing Agents Like You Test Code

Three Layers of Tests

The Cheap Layer Carries Most of the Weight

Accept Non-Determinism at the Top

The Consulting Shift I Am Making In Year Two

The Frontend Shift: Shipping Less JavaScript In Year Two

The Serverless Lesson I Would Write On A Sticky Note