Why we test AI agents against real APIs by default, and only mock the exceptions

Almost every guide on testing AI agents tells you to mock by default. Mock the LLM call. Mock the external APIs. Mock the database. Reserve real integration tests for a narrow lane that runs once per CI build and again before deploy. The reasoning is consistent: real APIs are slow, expensive, rate-limited, and have side effects. Mocks are fast, free, and safe. We agree with every part of that reasoning. We still default to real APIs.

The reason is simple. The shape of the failures we have seen in production agentic systems is not the shape that mocks catch. Mocks catch failures in the logic between calls. Production-shape failures are in the calls themselves: the upstream API returned a slightly different field than the mock did, the rate limit changed last week, the webhook signature we are signing differs by one whitespace character from what Stripe verifies. None of those land in mocked tests. All of them land in real-API tests.

What “production-shape” actually means

Reproducing in production-shape rather than toy-shape is one of the methodology rules in the agentic engineering method (principle 05). Concretely, for an agentic system:

Real third-party APIs in their test/sandbox mode (Stripe TEST mode, Polar test, Resend testing key, Anthropic API with low-cost model)
Real database schema against ephemeral instances (Postgres in Docker, seeded fresh per test or per test file)
Real auth surface against test users created in the actual identity provider (Supabase Auth test project)
Real network conditions: rate limits, retries, partial failures, the cold-start latency a real API has on the first request after idle

The cost is real: the test suite that exercises the AI offer agent end-to-end takes 90 seconds per run because it actually creates Stripe test accounts, fires webhooks, waits for the order row to flip to paid, and asserts the audit trail. The same suite with mocks would run in 8. We accept the 11× slowdown because the failure modes we care about (Stripe API change, webhook dedupe race, schema column drift) only surface in the slow version.

When we DO mock. The four conditions

Mocks are not banned. They are reserved for the cases where the real-API default genuinely fails. Four conditions justify a mock:

1. The upstream service has no test environment

Some services don't offer a sandbox. Push notification providers (APNs, FCM) accept real payloads against real device tokens, there is no “test mode” that returns a deterministic response without actually trying to deliver. For these, we mock at the transport layer (the HTTP client returns a fixed response) and rely on a single nightly real-API smoke test against a dedicated test device to catch upstream changes.

2. The operation costs more than the test is worth

Anthropic API calls cost real money. We run the agent loop against the real API on a small set of scenarios (golden tests, ~20 of them) but mock the LLM response for the 200+ unit tests on the dispatcher, the tool definitions, and the audit-trail logic. The mocked LLM returns a hand-crafted tool-use block so we can exercise the dispatcher without paying per test run.

3. The operation is time-based and we need determinism

A “reorder this template if no order exists for the next delivery window” test can't use real wall-clock time without becoming flaky. We mock the clock (inject a time-now function the test controls) but leave everything else real: real Drizzle queries against the test Postgres, real orderTemplates rows, real audit-trail writes.

4. The operation has irreversible real-world side effects we cannot undo

Sending an actual email to a customer's production address. Charging a real card. Posting to a public social account. For these we mock the final transport call and let everything upstream of it run against the real API. The mock is the smallest possible surface: just the one HTTP call that has the side effect.

The cost of mocking by default that nobody calculates

The benefit of mocking by default is obvious: speed, cost, determinism. The cost is invisible until production: every divergence between your mock and the real API becomes a bug your tests never caught. Three real examples from PickNDeal:

Stripe Connect destination-charge metadata round-trip: our mocked response included the metadata field; the real API stripped it on certain account types. Mock-only tests passed; the live integration broke for accounts in DK.
Resend webhook payload: we mocked the “sent” webhook with a key the docs listed; the real API delivered the same payload with the key cased differently after a Resend update. Mock-only tests passed; production stopped marking emails as sent.
Supabase Auth token refresh: our mocked refresh returned a token immediately; the real endpoint added a 200-300ms latency for accounts in EU regions. Mock-only tests had no idea this would shift our LCP. The real-API test surfaced it during dev, not after deploy.

Each of these would have been a production incident in the mock-by-default workflow. Each was caught in dev by a real-API integration test that took an extra 60 seconds. The math on which is cheaper changed the moment we counted the cost of the production incident, not just the cost of the test run.

What this looks like in CI

Our CI runs three lanes in parallel:

Unit (mocks allowed): dispatcher logic, tool schemas, validators, formatters. ~3 seconds, ~200 tests.
Integration (real APIs in test mode): end-to-end Stripe Connect flow, AI offer agent generating + supplier confirming, webhook idempotency under retry. ~90 seconds, ~30 tests.
Smoke (real production, read-only): one curl against pickndeal.app health endpoint, one read-only Stripe API call, one auth ping. Runs on deploy, ~5 seconds.

The unit lane is the only one that uses mocks, and even there mocks are for cost/speed reasons, not for “we mock everything by default.” The integration lane is what catches the failures that matter. The smoke lane catches the failures that the deploy itself introduced. Three lanes, one philosophy: real by default, mock only by exception.

The rule this codifies

The methodology rule (in our codebase at docs/context/feedback_reproduce_in_production_shape.md): reproduce production shape from day one. Real schemas, real concurrency, real error rates, real APIs in test mode where they offer one. If the agent can't handle a 1% failure rate from an upstream service in dev, it will cascade in production. Mocks hide the 1%. Real APIs in test mode return the 1%.

The reverse of this rule has a name in the methodology too: toy-shape testing. Toy shape means the system passes its tests against a version of reality that is structurally cleaner than the version production hits. The cost of toy-shape testing is paid in the postmortem, not in the test run.

The principle this post comes from: the agentic engineering method, principle 05. The end-to-end Stripe TEST verification flow that this principle drives: extracting PayoutKit from PickNDeal. The rule library that codifies this and 50+ other rules: the engineering rule library.