Testing for AI coding agents

AI coding agents can move fast. The constraint on their productivity is correctness. They are always confident, but they need clear, automatic signal about correctness otherwise a human ends up providing all of that signal (slowly). A robust test suite can ensure the confidence aligns with correctness. The shape of the test suite matters as much as its existence.

Interface tests vs internal tests

Tests that cover publicly exposed interfaces without depending on internal implementation details are a force multiplier for AI agents (and humans). These tests define what correct behavior looks like without dictating how that behavior is achieved. An agent can refactor freely, restructure internals, rewrite implementations entirely—and as long as the tests stay green, the changes are probably safe. A “perfect” test suite would cover the entire set of visible behavior, so a green suite would mean correct software. Don’t let the difficulty of a perfect suite prevent building a good one, and consider how lower development costs and higher ROI on tests may mean aiming closer to “perfect” than you might have once.

Tests that depend on internal details create friction. When tests assert specific internal state, duplicate dependencies extensively in mocks, call private methods, or depend on implementation artifacts, the agent has to update the tests alongside the code even when making “internal” changes. The tests become part of the implementation rather than a specification of the interface.

This isn’t a new observation. The benefits of testing at interface boundaries have been known for decades. But AI agents amplify both the benefits and the costs. The benefits are amplified because agents can iterate faster when unconstrained by internal test dependencies. The costs are amplified because agents are particularly prone to the “make the tests pass” failure mode—they optimize for the feedback signal they’re given.

Prefer integration tests

The traditional tradeoff between unit tests and integration tests was speed versus fidelity. Unit tests with mocks run fast but test less of the real system. Integration tests exercise more code paths but take longer. When a human is waiting for feedback, fast matters.

In my earlier post about mocking service dependencies I argued for mocking at the right boundary—at the network level rather than the library level, to get higher fidelity without giving up too much speed. That was the right tradeoff when human attention was the bottleneck.

With AI agents, the calculus shifts further. The agent doesn’t mind waiting a few extra seconds. What it needs is accurate signal. A fast test that passes when the real integration would fail sends the agent down the wrong path—and you won’t catch the error until you review the output or run a fuller test suite later.

Integration tests provide the highest-fidelity feedback. They catch problems at real boundaries rather than mocked ones. They don’t require maintaining mock implementations that can drift from reality. When an agent is iterating quickly, accurate feedback on each iteration is worth more than shaving seconds off the cycle.

This doesn’t mean mocks are never useful. External services with rate limits, slow or flaky dependencies, or genuinely expensive operations still benefit from mocking. But the default should shift toward integration tests where practical. The agent’s time is cheap; misleading feedback is expensive.

API design becomes more critical

Getting the public interface right has always been the hard part. AI agents don’t change this—they even raise the stakes.

When an agent can generate code quickly against a published interface, more code gets written against that interface faster. The interface accumulates dependents more rapidly. This makes interface mistakes more costly to fix. The window for catching design problems before they become entrenched gets shorter.

Invest time in API design before unleashing the agents. Once code is being generated against an interface at scale, changing that interface becomes expensive—whether the code was written by humans or machines.

AI agents are surprisingly good at taking advantage of good API design, but not very good at coming up with good API design. For example, on a recent project I had the agent introduce a tracing mechanism to log interactions during a function test run. It didn’t include that itself, but once it was there the agent immediately used it to help debug the next problem it ran into.

Hyrum’s Law doesn’t care who wrote the code

Internal implementation details, even undocumented ones, will eventually have dependents. This is Hyrum’s Law. It applies whether the code depending on those details was written by a person or generated by an agent.

In my robustness principle post I argued that internal APIs should fail fast on invalid input to avoid accumulating “bug compatible” behavior. The reasoning was that within an organization, integration testing should catch errors before they become entrenched.

With AI agents generating code, the same principle applies but the time horizons compress. An agent might generate code that depends on an undocumented behavior, and that code might be deployed before anyone notices. The “eventually someone depends on it” timeline gets shorter.

This argues for stricter internal APIs that reject misuse immediately, and for integration tests that catch these problems before they propagate. It also argues for being thoughtful about what internal details are observable at all—if the agent can see it, the agent might depend on it.

Testing as specification

The most useful tests for AI agent workflows are those that serve as executable specifications. They define what the system should do, not how it does it. They cover the contracts that matter—the interfaces between components, the boundaries between services, the behaviors that users and other systems depend on.

Tests that serve as specifications enable faster iteration. Tests that verify implementation details create drag. This has always been true, but AI agents make the difference more pronounced.

Write tests as if they’re the specification for what you want built. Because increasingly, that’s exactly what they are.

Agents can help here too. Ask them to review tests and explain what behavior each test actually verifies—then check whether that matches what you intended. Ask them to defend whether a test actually tests what its name claims. They’re good at this kind of analysis, and it surfaces gaps between what you think you’re specifying and what you’re actually specifying. The same pattern-matching that makes agents good at following tests makes them good at auditing them.

A recurring theme here is that all of the good practices for test design remain true, but AI agents amplify the effects and multiply the value of good tests.