8 weeks of an AI agent in production, what the engineering team actually does each week

The brochure for an AI agent in production usually shows the agent doing the work. The reality is that the engineering team around the agent works at least as hard as the agent does, just on different things. This is a week-by-week account of what the engineers actually changed, tightened, and codified over the first 8 weeks of a real AI agent running in a real product. The worked example is the AI offer agent in PickNDeal: it watches new restaurant orders, drafts a price-and-availability offer on behalf of a supplier, and submits it for the supplier to review and send. Names of files and rules are real; the lessons generalize.

Week 1. The agent works on the happy path, and nothing else

The first version ships in three days. Anthropic SDK wired into a tRPC mutation, a single tool (draftOffer), and the supplier hits a button labelled “Generate draft”. The first ten orders generate sensible drafts. The team feels good.

What the team is doing this week is mostly watching: tailing the audit log of every tool call, eyeballing the JSON of every draft, comparing it to what the supplier would have written by hand. No automation around the agent yet, no rate limits, no kill-switch, just a button and attention.

Week 2. The first time it generates 31 offers in one minute

A supplier opens the dashboard, the dashboard refetches the order list, the order list mounts a component that mounts the “Suggest offer” CTA, the CTA effect runs because a state value flipped, and the agent fires on every visible order. 31 draft offers materialize in the supplier's inbox before anyone can blink.

This is the week the team learns that “the agent did the right thing” is not the same as “the agent did the thing once.” Two rules get written:

Idempotency keys on every mutation tool: the dispatcher refuses to invoke draftOffer for the same (supplierId, orderId) pair twice within a 60-minute window.
User intent must precede agent action: the “Suggest offer” CTA is moved out of an effect and into an explicit click handler. No agent invocation runs as a side effect of a re-render.

The week ends with a one-page postmortem and a new entry in the rule library. The agent code itself barely changed; the surface around it did.

Week 3. The dispatcher gets a typed allowlist

A junior engineer adds a new tool, cancelOrder, to the dispatcher, intending it for an internal admin use case. The supplier agent now has access to it. Nobody catches it in review because the tool list is just an array of strings.

The fix lands the same day. Tools are now declared as a typed map keyed by tool name, with a per-role allowlist. The supplier agent's allowlist is explicit: draftOffer, listMyOrders, getOrderDetail. The dispatcher refuses to invoke anything not on the allowlist, at the type level (TypeScript narrows the call signature) and at the runtime check (throws before the tool ever reaches Anthropic).

The pattern is documented in the dispatcher post. The team stops thinking of “add a tool” as a one-line change.

Week 4. The first real production incident is a token-count regression

A change to the system prompt, a one-paragraph addition explaining a new supplier tier, pushes the agent over a token threshold that triples the per-call cost overnight. Finance notices the next morning because the Anthropic invoice projection jumps by a factor of three.

This is the week the team adds the boring infrastructure that nobody talks about: a per-call token-count log, a daily cost rollup, a Slack alert that fires if the rolling 24h cost crosses €5. The agent itself is unchanged; the operating discipline around it gets the same treatment a paid API would get from any sober team.

Week 5. The supplier who clicks “Send” without reading

A supplier sends an obviously wrong draft to a restaurant: the offer has the wrong unit (kg instead of pieces) on one line item, because the order text was ambiguous and the agent guessed. The restaurant is confused; the supplier blames the platform.

The team has two choices: improve the agent so it stops guessing, or improve the workflow so the supplier can't mindlessly accept a wrong draft. Both happen. The agent gets a rule: if the unit is ambiguous, the draft includes a flagged note rather than a guess. The UI gets a rule: the “Send” button is disabled until the supplier has clicked into each line item at least once. Tiny change, durable effect.

The lesson the team writes down is harder than either fix: an AI agent in front of a human is a co-pilot, and the co-pilot pattern only works if the human is actually flying. The product surface has to make the human fly.

Week 6. The agent is now part of the regression test suite

The team has accumulated a small library of orders that previously broke the agent: the unit-ambiguous one, the multi-line-with-Danish-decimals one, the order with a strikethrough in the description, the one with a comma-separated list that looked like a single product. These become a golden test set.

Every PR that touches the agent, system prompt, tool definitions, model version, even tool descriptions, runs against the golden set with the real Anthropic API in test mode. The assertion isn't “the agent produces text X” (LLM output is stochastic). The assertion is structural: draftOffer is called with the right number of line items, the unit field is non-null, the price is within a sane range, the flagged-note field is set when the input is ambiguous. Anything else and CI fails.

This is the week the agent stops being a thing the team is nervous about deploying. The golden set is the difference.

Week 7. The cold-start latency conversation

A new supplier signs up. The first time they click “Generate draft” the agent takes 11 seconds because the Anthropic connection is cold and the system prompt has grown to 1,800 tokens. The supplier closes the tab.

The team profiles the call: 700ms network handshake, 200ms prompt construction (the prompt is built from a tRPC query that joins three tables), 9.1s actual model latency. Two changes ship: the supplier's recent order context is pre-warmed into a Redis hash so the prompt builder is a single read, and the model call moves from claude-sonnet-4-5 to claude-haiku-4-5 for the first-pass draft, falling back to sonnet only if the supplier clicks “Refine”.

The first-click latency drops to 2.3s. The team realizes the model-choice decision is not a single global call but a per-call decision that depends on the conversation state. The dispatcher learns to accept a model hint per invocation.

Week 8. The audit trail is the product

The team takes a step back at the end of week 8 and looks at what was actually shipped over two months. The AI agent feature description, “suggests an offer for the supplier”, is a single tRPC mutation. The supporting infrastructure is:

A typed dispatcher with role-scoped tool allowlists (week 3)
Idempotency keys on every mutation tool, enforced at the dispatcher level (week 2)
An audit-trail table that records every tool call, its arguments, its result, the model used, the token counts, and the supplier-acting-on-behalf-of (every week)
A per-call cost log + daily rollup + threshold alert (week 4)
A “suggest, don't send” UI pattern where the draft is always reviewable before it leaves the platform (week 5)
A golden test set of 23 orders that runs against the real Anthropic API in test mode on every PR (week 6)
A per-call model-hint mechanism so cheap calls use haiku and expensive ones use sonnet (week 7)

The audit trail is the surface the team trusts. When a supplier asks “why did the agent suggest this price?”, the answer is a query against the audit table that returns the exact prompt, the exact tool calls, the exact response, and the exact draft that was presented. The agent is opaque; the trail around it is not. That asymmetry is what makes the agent shippable.

What a CTO actually buys with 8 weeks

Not “a working AI feature”. That was working in week 1. What 8 weeks buys is the surface area the team needs to deploy without flinching: the dispatcher, the audit trail, the golden set, the cost guards, the human-in-the-loop UI. The agent itself stayed roughly the same size. The infrastructure around it grew by an order of magnitude, and that growth is the actual feature.

The mistake we see other teams make is shipping the week-1 version and calling it done. It works. It also burns down quietly: one wrong offer to a key customer, one cost surprise on the invoice, one new tool that the wrong role can call, and the trust the agent built up over three good months evaporates in an afternoon. The 8 weeks of unglamorous surface-area work are what keep the trust intact.

The methodology rules that emerged from this 8-week stretch are now part of the engineering rule library. The dispatcher pattern is documented in its own post. The testing posture is in the real-API testing post. Each of those started life as a one-paragraph entry written at the end of a week, on the way home from a near-miss the team didn't want to repeat.

The principles these weeks codify live in the agentic engineering method. The system this account is taken from: the PickNDeal case study. The companion engineering posts: the dispatcher, real-API testing, the approval queue pattern.