The method

The agentic delivery method.

Most teams treat “AI in the loop” as a creativity tool — useful for prototypes, suspect in production. We treat it as a delivery capability with the same governance you'd apply to a senior engineer: audit trail, human review at risky steps, invariants that codify what we've learned the hard way. What follows is the method we've refined across multiple production codebases.

01

Define the surface, not the prompt

The first thing we do on any new system is define the surface area an agent operates on — the read tools, the write tools, the human-approval gates — not the prompt that drives it. Prompts are throwaway; the tool surface compounds. A well-designed surface makes a mediocre prompt produce reliable results; a great prompt with a leaky surface produces undefined behaviour at scale.

Concretely: every action the agent can take goes through a typed tool. Every state-mutation goes through a human-review gate by default. We earn the right to remove gates by demonstrating a clean audit history.

02

Invariants are the moat

We maintain a project-scoped invariant library — a set of rules of the form “don't do X because here's the production incident that taught us why.” Each invariant has a name, a one-paragraph explanation of the failure mode that produced it, and code-level enforcement when possible.

The PickNDeal codebase has 20+ invariants. The first 10 were written after watching agents (and humans) re-discover the same production failures. Each invariant means a class of incidents that doesn't recur. The list compounds. The list is the moat.

03

The audit trail is the product

Every agent run writes to a trail: which tool calls, which inputs, which outputs, which human approvals, which auto-rollbacks. Stored as structured data, queryable, exportable. When something goes wrong — and it will — the trail is how we diagnose without re-running the agent. When something goes right, the trail is how we prove it to a stakeholder who wasn't in the room.

For client engagements, the audit trail is what we hand over alongside the working code. It's the evidence layer that lets the client's board sign off on agentic systems running in production.

04

Human-in-the-loop on the fault line, not everywhere

Reviewing every agent action manually defeats the point. Not reviewing any action defeats the point in a different way. We design the human-review surface around the fault line — the small set of actions that, if wrong, produce real damage (financial transactions, customer-visible mutations, irreversible deletes). Everything else runs on rails with audit-after-the-fact.

The dashboard pattern we use is exactly this: a queue of pending actions the agent has flagged, with one-tap approve/reject and a structured-diff view of what changed. We'll publish a standalone version of this UI as the first artefact under /open-source.

05

Reproduce in production-shape, not toy-shape

The standard mistake is testing agentic systems on toy data and trusting that production will look the same. It never does. We reproduce production shape from day one: real schemas, real concurrency, real error rates. If the agent can't handle a 1% failure rate from an upstream service in dev, it will cascade in production.

06

Cron + queue + webhook is the holy trinity

Most agent failures we've seen are timing failures, not logic failures. Cron jobs that run on slightly-wrong cadence; webhooks that retry into duplicate state; queues that lose messages on restart. We start every agentic system with the same trio: durable cron (with retries + alerting), idempotent webhook handlers (with HMAC verification + replay protection), at-least-once queue processing (with deduplication on the consumer side).

07

Ship the boring infra first

Authentication, role-scoped permissions, secrets management, deployment hardening, error tracking, structured logging — none of this is “agentic.” All of it must work before any agent does anything interesting. The work most teams skip in their excitement is exactly what determines whether the interesting work survives the first month in production.


Where the method came from

Refined across production codebases including PickNDeal (B2B + D2C food marketplace, 14 phases shipped), PayoutKit (Stripe Connect commercial boilerplate), and client engagements going back to 2018. Each phase surfaced a class of failure that produced an invariant. The invariants travel between projects — the methodology is the same whether we're building a marketplace or replacing a legacy procurement system.

We're publishing the engineering stories behind each invariant on the journal, and the human-review-loop UI is being extracted as the first open-source artefact. The full methodology lives inside client engagements.

Field guide

Get the method as a 12-page PDF.

Seven principles + a representative slice of the invariant library + a one-page “how to apply this to your codebase” guide. Yours to share with your team or your CTO.

Also subscribes you to the engineering journal (~1 email/month, unsubscribe at any time).

Want this method applied to your system?

We take on a small number of engagements per quarter. Discovery, build, or transformation tracks.

Submit a project brief