Testing in the Agent Era: What Actually Changed

Two years in, testing feels different than it used to. Not because the tools changed, but because agents generate code at a rate that used to require a team. The bottleneck is no longer writing tests. It is deciding which tests matter. Here is how my approach shifted.

The Old Bottleneck

Pre-agent era, writing tests took as long as writing the code. Most consultants skipped tests because budgets did not cover double the time. Tests lived on a wishlist nobody got to.

The New Bottleneck

Agents write tests for free. Generating fifty test cases takes a prompt. The question flipped: with unlimited tests, which tests actually matter?

The Test Tiering I Use Now

Tier 1: contract tests for public APIs (always)
Tier 2: integration tests for critical paths (always)
Tier 3: unit tests for pure logic (when free to generate)
Tier 4: exhaustive edge cases (only on modules touched by finance, auth, or billing)

Agents generate Tier 3 and 4 on demand. I write Tier 1 and 2 myself or review them carefully.

What Changed in My Testing Discipline

I read tests more than I write them (agents write, I verify)
I delete tests that do not encode intent, even if they pass
I value coverage less and critical-path verification more
I expect red first always, regardless of who writes the test

The Red-First Rule

Every test must fail at least once before passing. Non-negotiable, whether I or an agent wrote it. Tests that have never been red are not tests. They are assertions that happen to be true right now.

Example Failure Mode

I caught an agent generating a test like this:

python

def test_get_user():
    user = get_user('x')
    assert user is not None or user is None

Passes. Tests nothing. The agent had filled a slot without thinking. Red-first would have caught it immediately because the assertion cannot fail.

The Coverage Myth

Ninety percent coverage used to be a goal. Now it is a trivially achievable metric that tells you almost nothing. I have seen agent-generated test suites with 95 percent coverage that test almost no intent. Coverage is a floor, not a ceiling, and the ceiling is the harder question.

What I Tell Teams

Spend your test budget on the tests that encode business rules. Let agents generate everything else. The old ratio of time-on-tests to time-on-code is obsolete. The new ratio is time-on-intent to time-on-generation, and intent is where the thinking lives.

See Kent Beck's writing on test desiderata for the underlying frame I still rely on.