Streaming Claude tool use in Next.js 16 without breaking the agent loop

You have a Claude-powered agent. The user types something. You want the text to start appearing on the screen the moment Claude starts generating it. You also want Claude to call tools mid-turn (read a row, write a row, hit an external API) and have those tool calls actually execute on your server, not arrive at the browser as garbled JSON. The dominant Next.js streaming tutorials handle the text case. The dominant Anthropic tool-use tutorials handle the non-streaming case. The intersection, streaming + tool use, in a Next.js 16 route handler, with a real production surface around it, is where almost every team we have seen tries the obvious thing and breaks their agent loop. This is the pattern that works.

The dominant pattern that breaks

The shape most teams reach for first:

// DON'T. This is the pattern that breaks
export async function POST(req: Request) {
  const stream = anthropic.messages.stream({
    model: 'claude-sonnet-4-5',
    tools: [...],
    messages: [...],
  });

  return new Response(
    stream.toReadableStream(), // <-- forwards every chunk to the client
    { headers: { 'Content-Type': 'text/event-stream' } }
  );
}

The client receives the stream. Text deltas show up. The user is happy until the model decides to call a tool. The tool-use block arrives as a sequence of input_json_delta events that look exactly like text deltas to the forwarding code. The fragments arrive at the browser as `{"q`, then `uery":"`, then `select * from`. The browser renders them as partial text. The tool never fires. The agent loop never closes. The session hangs.

The fix is not “buffer everything until the end.” That works but it makes the stream pointless. The user sees nothing for 15 seconds, then a wall of text. The fix is to split the stream at the right boundary: forward text events to the client, intercept tool events server-side, execute them, feed the results back to Anthropic as the next user turn.

The pattern that works

One handler. One ReadableStream to the client. One Anthropic stream per turn (multiple if the model calls tools), with a dispatcher in between that knows what to do with each event type.

// src/app/api/agent/chat/route.ts
import { Anthropic } from '@anthropic-ai/sdk';
import { dispatchTool } from '@/lib/agent/dispatcher';

const anthropic = new Anthropic();

export async function POST(req: Request) {
  const { messages, userId } = await req.json();

  // Rate-limit check runs BEFORE the stream opens.
  // A stream that has started is never killed by rate limiting.
  const limit = await checkRateLimit(userId);
  if (!limit.ok) {
    return new Response('Rate limit exceeded', {
      status: 429,
      headers: { 'Retry-After': String(limit.retryAfter) },
    });
  }

  const encoder = new TextEncoder();
  const abortController = new AbortController();

  // If the client disconnects, abort the Anthropic stream too.
  // Without this, the server keeps paying for tokens that arrive at a dead socket.
  req.signal.addEventListener('abort', () => abortController.abort());

  const readable = new ReadableStream({
    async start(controller) {
      try {
        // Conversation state is mutable across multiple Anthropic turns
        // within the same client request (model -> tool -> model -> ...).
        let conversation = [...messages];
        let keepLooping = true;

        while (keepLooping) {
          // Accumulator for tool_use blocks: maps block-index to partial JSON string.
          const toolBlocks = new Map<number, { name: string; id: string; input: string }>();
          const assistantBlocks: Anthropic.ContentBlock[] = [];

          const stream = anthropic.messages.stream(
            {
              model: 'claude-sonnet-4-5',
              max_tokens: 4096,
              tools: getToolsForUser(userId),
              messages: conversation,
            },
            { signal: abortController.signal },
          );

          for await (const event of stream) {
            if (event.type === 'content_block_start') {
              if (event.content_block.type === 'tool_use') {
                // Tool-use block opened -- remember it, don't forward to client.
                toolBlocks.set(event.index, {
                  name: event.content_block.name,
                  id: event.content_block.id,
                  input: '',
                });
              }
              // Text blocks: nothing to do at start, deltas come next.
            }

            if (event.type === 'content_block_delta') {
              if (event.delta.type === 'input_json_delta') {
                // Accumulate tool input -- never forward.
                const tb = toolBlocks.get(event.index);
                if (tb) tb.input += event.delta.partial_json;
              } else if (event.delta.type === 'text_delta') {
                // Text delta -- forward immediately as SSE.
                controller.enqueue(encoder.encode(
                  `data: ${JSON.stringify({ type: 'text', text: event.delta.text })}\n\n`
                ));
              }
            }

            if (event.type === 'content_block_stop') {
              // Note: assistant blocks are reconstructed from the final message below.
            }

            if (event.type === 'message_stop') {
              keepLooping = false; // default -- only true if tool calls were made
            }
          }

          // Collect the final assistant message (tools + text together).
          const finalMessage = await stream.finalMessage();
          assistantBlocks.push(...finalMessage.content);

          if (toolBlocks.size === 0) {
            // No tool calls -- turn is over, exit the outer loop.
            break;
          }

          // Tool calls were made. Execute them in PARALLEL (this is where
          // serial execution kills the loop's perceived latency).
          const toolResults = await Promise.all(
            Array.from(toolBlocks.values()).map(async (tb) => {
              try {
                const parsed = JSON.parse(tb.input);
                const result = await dispatchTool({ name: tb.name, input: parsed, userId });
                return { type: 'tool_result' as const, tool_use_id: tb.id, content: JSON.stringify(result) };
              } catch (err) {
                return {
                  type: 'tool_result' as const,
                  tool_use_id: tb.id,
                  content: `Error: ${(err as Error).message}`,
                  is_error: true,
                };
              }
            })
          );

          // Push the assistant turn + the tool_result back into the conversation,
          // then loop to let Claude continue (it may call more tools, or finish).
          conversation.push({ role: 'assistant', content: assistantBlocks });
          conversation.push({ role: 'user', content: toolResults });
          keepLooping = true;
        }

        controller.close();
      } catch (err) {
        controller.enqueue(encoder.encode(
          `data: ${JSON.stringify({ type: 'error', message: (err as Error).message })}\n\n`
        ));
        controller.close();
      }
    },
  });

  return new Response(readable, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache, no-transform',
      'Connection': 'keep-alive',
      'X-Accel-Buffering': 'no', // disable nginx buffering of the stream
    },
  });
}

What this fixes that the obvious version doesn't

Tool input never reaches the client. The dispatcher intercepts every input_json_delta at the route handler and accumulates server-side. The browser only ever sees text_delta events, neatly formatted as SSE.
Tools run in parallel. Promise.all on the tool dispatch is non-negotiable. A turn that calls 3 tools serially feels slow even when each tool is fast.
The agent loop closes itself. When the model emits no tool calls in a turn, the outer while exits. When tools were called, the loop runs again with the conversation history extended by the assistant turn + the tool results.
Client disconnect cancels the model call. The AbortController is wired to the request signal. The Anthropic SDK respects the signal and stops billing.
Rate limit runs before the stream opens. Once the stream has started, we never kill it for rate-limit reasons. The audit trail would have a half-finished turn with no resolution, and the user would not know whether their action succeeded.
nginx buffering is disabled per-stream. Without X-Accel-Buffering: no, nginx will hold the entire response until the upstream closes. Which defeats streaming.

The dispatcher does the actual work

The dispatchTool function referenced above is the typed dispatcher pattern documented in its own post. Three things matter for the streaming case:

Every tool call writes to the audit trail before it executes and after it returns, with the same conversation turn id as the streamed text. Streaming does not get to skip audit writes.
Tool errors get serialized as a string and sent back to Claude as a tool_result withis_error: true. The model knows how to recover from tool errors mid-turn, but only if you tell it the tool actually errored, not by silently dropping the failed call.
Long-running tools (anything over ~5s) should be pre-checked at dispatch time. Return a structured “not available right now” instead of letting the route handler hang. The agent will plan around it.

What gets tested

The streaming path is the easiest place to ship a regression. Our golden test set (covered in the real-API testing post) exercises the SSE route against the real Anthropic API in test mode with three structural assertions per scenario:

The text content arriving at the client equals the text content in the final assistant message (proves no text was dropped or duplicated by the SSE writer).
The dispatcher was invoked exactly once per emitted tool_use block (proves the accumulator handled multi-tool turns without double-firing).
The audit trail row for the turn has every tool call recorded with its input + output (proves nothing was streamed past the trail).

The third assertion is the one that catches subtle bugs. A SSE handler that streams text correctly but skips the audit write for tool calls passes the first two and fails the third , and that's exactly the bug that would let an agent take destructive actions without leaving a trace, which is the entire methodology violation we are protecting against.

Backpressure, disconnects, and the non-streaming fallback

Three production realities the obvious tutorials don't cover:

Backpressure

A slow client (mobile network, packet loss) can't consume SSE chunks as fast as Claude produces them. The ReadableStream controller's enqueue will buffer in memory if the consumer is slow. For very long responses on slow connections this can pin a MB+ in memory per request. The mitigation: cap concurrent streaming connections per server instance, and reject the route with 503 + Retry-After when the cap is hit. We use 50 concurrent streams per Next.js process; over that, the route returns 503 and the client retries.

Disconnects

The hook is one line:

req.signal.addEventListener('abort', () => abortController.abort());

That makes the Anthropic SDK stop the model call when the client tab closes. Without it, Claude continues generating into a dead socket for up to max_tokens worth of generation, which is real money on long turns. Also: the audit-trail write must still happen on abort, with whatever was generated up to that point, marked with a disconnected_at timestamp. Otherwise disconnected sessions silently vanish from the trail.

Non-streaming fallback

The dispatcher logic is the same; the only difference is the response shape. Three clients should always hit the non-streaming variant: cron-driven agents (no human watching), MCP clients that don't handle SSE, and any client that sends Accept: application/json rather than text/event-stream. Same handler, content-negotiation at the top, exact same dispatcher in the middle, JSON response with the assembled assistant turn + tool calls at the end.

Why this all matters

Streaming + tool use is one of those features that looks like a small tweak to the non-streaming version and turns out to be a different architecture. Most teams that try the obvious thing ship something that works for text-only turns, breaks silently the first time the model decides to call a tool, and ends up with users who think the agent is “flaky” because their second prompt hung. The fix is not exotic: split the event stream at the tool boundary, accumulate tool input server-side, dispatch in parallel, never let rate limiting interrupt a stream that has already started, hook the abort signal. The methodology rules this codifies are: principle 03 (audit trail by default, even mid-stream), principle 04 (human in the loop on the fault line, and a mid-stream disconnect is part of the fault line), and principle 07 (ship the boring infra first: disconnect handling, nginx buffering, per-instance concurrency caps).

FAQ

Can you stream a Claude response while a tool call is being assembled?

Yes, but the standard streaming loop has to be aware that tool-use blocks arrive incrementally. The Anthropic SDK emits a content_block_start event with type "tool_use", followed by input_json_delta events that you must accumulate into a single JSON string before you can parse and execute the tool. If you treat tool_use deltas like text deltas and forward them to the client, the client sees garbage and the tool never fires.

Why does streaming + tool use commonly break in production?

Three failure modes recur. (1) The route handler forwards every chunk to the SSE response without distinguishing tool-use deltas from text deltas, so the tool input arrives at the client as text and is never executed. (2) The handler buffers everything and only invokes the tool at message_stop, which works but defeats the point of streaming. (3) The handler invokes the tool inline mid-stream and blocks the SSE writer until the tool returns, which freezes the visible response and times out long tools. The pattern in this post avoids all three.

How do you handle multiple tool calls in one streamed turn?

The model can emit multiple tool_use content blocks in a single turn. Each gets its own content_block_start and stream of input_json_delta events, ending in a content_block_stop. The dispatcher must keep a map of block-index to accumulated-JSON-string, parse each one at its content_block_stop, execute the tool calls in parallel via Promise.all, then send the tool_result blocks back to Anthropic as the next user turn. Running tools serially in this loop is the most common cause of agent loops feeling sluggish.

What happens when the client disconnects mid-stream?

Two things have to be true. (1) The Anthropic stream must be aborted server-side via the AbortController passed into the messages.stream call. Otherwise you pay for the rest of the generation and the model keeps streaming into a dead socket. (2) The audit-trail write must still happen with whatever was generated up to the disconnect, marked with a disconnected_at timestamp. We hook the route handler's signal abort event to fire both. Without this, disconnected sessions silently disappear from the audit trail and the cost log understates by 5-15%.

How do you rate-limit a streaming Claude endpoint without dropping mid-stream connections?

The rate-limit check runs before the stream starts, never during. We check three things in the route handler before calling messages.stream: (1) per-user requests-per-minute counter in Redis with a 60-second sliding window, (2) per-user token budget for the current day, (3) global concurrency cap to protect the Anthropic rate ceiling. If any of the three fails, we return a 429 with a Retry-After header before the stream opens. A stream that has already started is never killed by rate limiting. That would corrupt the audit trail and confuse the user about whether their turn happened.

When should you fall back to non-streaming?

When the client signals it via an Accept header (some browsers, some webview wrappers, some MCP clients), when the turn is part of a batch job (cron-driven agents do not need streaming), and when the tool-call chain is known to be long enough that streaming the text-only portion is misleading (e.g. an agent that always calls 3+ tools before any user-visible text). The non-streaming fallback uses the exact same dispatcher, just awaits the full response before writing it.

The principles this codifies live in the agentic engineering method. The dispatcher this leans on: the typed tool dispatcher. The testing posture that catches regressions in this code: real APIs by default. The system this comes from: the PickNDeal case study.