The method

The agentic delivery method.

Most teams treat “AI in the loop” as a creativity tool. Useful for prototypes, suspect in production. We treat it as a delivery capability with the same governance you'd apply to a senior engineer: audit trail, human review at risky steps, rules that codify the failure modes we've encountered the hard way and the lessons that came out of them. What follows is the method we've refined across multiple production codebases. Agentic engineering treated as the engineering practice it actually is, not the demo it usually gets pitched as.

01

Define the surface, not the prompt

The first thing we do on any new system is define the surface area an agent operates on. The read tools, the write tools, the human-approval gates, not the prompt that drives it. Prompts are throwaway; the tool surface compounds. A well-designed surface makes a mediocre prompt produce reliable results; a great prompt with a leaky surface produces undefined behaviour at scale.

Concretely: every action the agent can take goes through a typed tool. Every state-mutation goes through a human-review gate by default. We earn the right to remove gates by demonstrating a clean audit history.

02

Codify what production teaches.

We maintain a project-scoped rule library: a set of rules of the form “don't do X because here's the production failure mode that taught us why.” Each rule has a name, a one-paragraph explanation of the failure mode that produced it, and code-level enforcement when possible.

The PickNDeal codebase has 50+ rules; 75+ across our libraries including the cross-project layer that travels between projects. The first 10 were written after watching agents (and humans) re-discover the same production failures. Each rule means a class of failure modes that doesn't recur. The list compounds. The engineering stories behind individual rules get published as they land: see the journal, where every post ends in the rules the failure mode produced.

Read the production story this principle came from →

03

The audit trail is the product

Every agent run writes to a trail: which tool calls, which inputs, which outputs, which human approvals, which auto-rollbacks. Stored as structured data, queryable, exportable. When something goes wrong. And it will. The trail is how we diagnose without re-running the agent. When something goes right, the trail is how we prove it to a stakeholder who wasn't in the room.

For client engagements, the audit trail is what we hand over alongside the working code. It's the evidence layer that lets the client's board sign off on agentic systems running in production.

04

Human-in-the-loop on the fault line, not everywhere

Reviewing every agent action manually defeats the point. Not reviewing any action defeats the point in a different way. We design the human-review surface around the fault line. The small set of actions that, if wrong, produce real damage (financial transactions, customer-visible mutations, irreversible deletes). Everything else runs on rails with audit-after-the-fact.

The dashboard pattern we use is exactly this: a queue of pending actions the agent has flagged, with one-tap approve/reject and a structured-diff view of what changed. The same pattern ships into every client engagement as part of the deliverable.

05

Reproduce in production-shape, not toy-shape

The standard mistake is testing agentic systems on toy data and trusting that production will look the same. It never does. We reproduce production shape from day one: real schemas, real concurrency, real error rates. If the agent can't handle a 1% failure rate from an upstream service in dev, it will cascade in production.

06

Cron + queue + webhook is the holy trinity

Most agent failures we've seen are timing failures, not logic failures. Cron jobs that run on slightly-wrong cadence; webhooks that retry into duplicate state; queues that lose messages on restart. We start every agentic system with the same trio: durable cron (with retries + alerting), idempotent webhook handlers (with HMAC verification + replay protection), at-least-once queue processing (with deduplication on the consumer side).

Read the production story this principle came from →

07

Ship the boring infra first

Authentication, role-scoped permissions, secrets management, deployment hardening, error tracking, structured logging: none of this is “agentic.” All of it must work before any agent does anything interesting. The work most teams skip in their excitement is exactly what determines whether the interesting work survives the first month in production.

Read the production story this principle came from →


Where the method came from

Refined across production codebases including PickNDeal (B2B + D2C food marketplace, 14 phases shipped), PayoutKit (Stripe Connect module extracted from PickNDeal), and client engagements going back to 2018. Each phase surfaced a class of failure that produced a rule. The rules travel between projects: the methodology is the same whether we're building a marketplace or replacing a legacy procurement system.

We're publishing the engineering stories behind each rule on the journal. The full agentic engineering methodology lives inside client engagements.

Field guide

Get the method as a 5-page PDF.

Seven principles + a representative slice of the rule library + a one-page “how to apply this to your codebase” guide. Yours to share with your team or your CTO.

Also subscribes you to the engineering journal (~1 email/month, unsubscribe at any time).

Want this method applied to your system?

We take on a small number of engagements per quarter. Discovery, build, or transformation tracks.

Start a discovery conversation