Matthew Boston

The Agentic Test Pyramid

June 3, 2026

One Axis Isn’t Enough Anymore

Martin Fowler’s test pyramid — and Ham Vocke’s practical write-up of it on Fowler’s site — sorts tests along a single axis: integration scope. Unit at the bottom, integration in the middle, end-to-end (E2E) at the top. Lots of fast, deterministic unit tests; fewer slow, end-to-end ones. None of that is wrong, and none of it has expired. For the deterministic parts of your system — which is still most of any real system — it remains the right model, and you should keep following it exactly as written.

It works because it quietly assumes every test is deterministic: the same input always yields the same pass or fail. That assumption dies the moment part of your system is a large language model. The system becomes non-deterministic: the same input no longer guarantees the same output. Run the same prompt twice and you might get different words, a different number of findings, a different tool call. You can’t assertEqual your way through that, and you can’t block a merge on a check that’s red 8% of the time for no reason.

So this isn’t a replacement for Fowler’s pyramid — it’s an extension of it. You keep the scope axis intact and add a second one: determinism and cost. Every test now lives at a coordinate of (scope, determinism). The deterministic layers are the original pyramid, unchanged. The new work is bolting two more layers on top for the non-deterministic parts — the model-driven pieces that won’t answer the same way twice. The whole discipline becomes pushing each check as far down and as far toward “deterministic” as it will go — back into Fowler’s pyramid wherever possible — because the cheapest, most reliable place to catch a bug is almost never an expensive model eval.

The Six Layers

The two axes give you six layers. The bottom four are Fowler’s pyramid intact — free, deterministic, runs on every commit — just renamed for the agentic context. The top two are the part most test suites don’t have: graded checks on behavior that changes from one run to the next.

  1. Pure unitf(input) -> output. Parsing, formatting, validation, version math. No I/O, no network, no model. This is the wide base, and almost anything that can be refactored into this shape should be, precisely so it can live here.
  2. Static-invariant tripwires — the highest-leverage layer, and the one most suites lack. More on these below.
  3. In-process integration — real components wired together, no external surface. Fast and free because nothing leaves the box.
  4. Real-dependency E2E — launch the real browser, daemon, or child process, but only when the integration is the thing under test. Mock the world, never the subject.
  5. Behavioral E2E against the live model — here the unit under test is a prompt or a policy, and you have to call the real model. “Given a planted vulnerability, the review must flag it.” “Given read-only mode, the agent must never call the write tool.”
  6. Quality evals (model-as-judge) — when correctness is genuinely subjective, use a second model to grade the first. Grade the output against a rubric and pass it on a range, never on an exact match.

Tripwires Are the Multiplier

Layer two earns its own section. A static-invariant tripwire doesn’t run behavior at all. It reads your own source or config and asserts a contract with a pattern match. It’s executable architecture documentation that fails the build in milliseconds.

Every load-bearing rule in your codebase — “never import X from Y,” “all writes route through this helper,” “importing this module must have no side effects” — gets a test that fails when someone violates it, without executing anything. The discipline is simple: when you write a comment that says “NEVER do X here,” write the tripwire in the same change. A constraint without a test is just a suggestion.

This is what lets a product with a model in it stay correct cheaply. A huge fraction of “regressions” are really contract violations, and contracts stay deterministic even when behavior doesn’t.

Determinism Decides the Gate, Cost Decides the Cadence

Here’s the rule that keeps the whole thing sane. Classify every paid test — one that spends real money calling the model — by asking one question:

Can this test be red for a legitimate, non-bug reason — model variance, a flaky external service, a subjective threshold?

If no, it’s a gate: it blocks merge and runs on every PR. If yes, it’s periodic: it never blocks merge and runs on a schedule. You can never gate CI on a check that’s sometimes-red-for-no-reason, because that trains everyone to ignore red. Gate on what’s stable, monitor what’s fuzzy.

For these checks, the trick is to test against a range, not an exact answer. An exact-match assertion — “the agent found exactly 10 issues” — will fail at random the moment the model words things differently, a false alarm that teaches everyone to ignore red. So you plant a known set of problems as your answer key and check that the result lands in an acceptable range: it must catch at least 8 of the 10 you planted (a minimum it has to clear) and raise no more than 3 false alarms (a cap it has to stay under). Be exact wherever the system really is predictable; allow a tolerance only where the model’s natural variation makes exactness impossible.

Cost Is a Design Constraint

A paid suite that runs every test on every change is a suite nobody can afford. So you engineer the cost down the same way you’d engineer down latency: diff-based selection so you only run what a change could have broken, judge cascades so you only pay for a model call after a cheap regex says it’s warranted, a cheaper judge than the generator where viable, and fixture replay — recording a costly but repeatable model response once and reusing it instead of paying for it on every run. The goal is a paid suite that costs a few dollars per change set, not hundreds.

This is also where evals stop being a chore and start being a moat. Garry Tan has been making the case that evals are the real defensibility for AI startups — the hard-won, customer-specific judgment that off-the-shelf benchmarks can’t replicate. Hamel Husain makes the engineering version of the same point: you can’t ship an AI product you can’t measure, and generic evals don’t work — they have to be built around your product and your data.

What Actually Changes

You don’t need all six layers on day one. You need three things: the free/paid line drawn hard so your default test command never spends money, a wide pure-unit base, and the tripwire habit. The upper layers earn their place as more of your product runs on a model.

The mindset shift is the real deliverable. With deterministic software, a test gives a yes-or-no answer: pass or fail. With a model in the loop, some of your tests become measurements — graded, banded, tracked over time — and the engineering discipline is knowing which is which and refusing to confuse them. AI agents already depend on a fast, trustworthy feedback loop to do their work. The agentic test pyramid is how you give them one when the thing being tested can no longer be trusted to answer the same way twice.