#LLM#Diffusion Models#AI Agents#Performance#Architecture

Diffusion LLMs in Practice: Where Mercury 2 Fits Your Agent Stack

webhani·

Most of the language models we deploy in production share one structural trait: they write left to right, one token at a time. Each token waits for the one before it. That ordering is the source of both the fluency we depend on and the latency we fight. Inception Labs' Mercury 2 takes a different route. Instead of predicting the next token in sequence, it drafts a whole passage at once and refines it over a small number of denoising passes, the same iterative cleanup idea that powers image diffusion models, applied to text. The published throughput figure is north of 1,000 tokens per second, roughly five times faster than the quickest speed-tuned autoregressive systems, with reasoning quality kept intact.

That number is interesting, but the throughput alone is not the story. The story is what it does to system design. Below is how we at webhani actually evaluate a model like this for client architectures, what we would hand the inner loop of an agent, and where we would still escalate to a frontier model.

How parallel refinement differs from next-token decoding

A standard transformer decoder is sequential by construction. To produce token N it needs tokens 1 through N-1 already settled. You can batch requests, you can speculative-decode, you can quantize, but you cannot escape the dependency chain inside a single response. The wall-clock cost of one generation is fundamentally tied to its length.

A diffusion text model, or dLLM, breaks that chain. It starts from a noisy or masked draft of the full output and runs a fixed number of refinement steps that sharpen every position roughly in parallel. The cost is dominated by the step count, not the token count, which is why long outputs can land so much faster. Mercury 2 keeps the practical surface engineers expect from a modern model: tool calling, JSON-schema-constrained structured output, and multimodal extensions. According to Inception Labs' published pricing it sits around $0.25 per million input tokens and $0.75 per million output tokens, available through early access at the time of writing.

There are honest tradeoffs. The autoregressive ecosystem is mature, with years of tooling, prompt patterns, and fine-tuning recipes built around it. Diffusion text generation trades some of the determinism and step-by-step controllability of sequential decoding for its parallelism, and for the hardest reasoning tasks the frontier autoregressive models still hold an edge. None of that disqualifies a dLLM. It just tells you where to point it.

Why latency is an architectural concern, not a vanity metric

In a single chatbot turn, a few hundred milliseconds of extra latency is a UX detail. In an agentic system it compounds. A typical agent loop runs plan, call a tool, observe the result, then repeat, and each pass through that loop contains at least one model generation. End-to-end latency is the sum across every step:

total_latency ≈ Σ (tokens_generated_at_step_i / throughput) + tool_and_network_overhead

When an agent takes eight or twelve steps to finish a task, per-step generation speed is multiplied by the number of steps. This is exactly the regime where a 5x throughput improvement stops being a benchmark line and becomes the difference between a two-second response and a ten-second one. The same logic applies to real-time voice and other interactive UX, where the budget between user input and audible response is unforgiving.

So before reaching for any fast model, measure whether your workload is actually latency-bound. A quick back-of-envelope: suppose each agent step generates about 400 tokens and a task averages 10 steps. At 200 tokens/second that is roughly 20 seconds of pure generation. At 1,000 tokens/second it is about 4 seconds. If your tool calls and network overhead add another 3 seconds total either way, you go from 23 seconds to 7. If instead your steps are short, generating 30 tokens each, the generation portion is already small and the speedup barely moves your wall-clock time. Run that arithmetic for your real traffic before changing anything.

A two-tier router: fast model for the loop, frontier model for the hard call

The pattern we recommend most often is not "replace your model" but "tier your models." Use a fast dLLM for the high-frequency, lower-stakes work inside the loop, such as routing, classification, draft generation, and tool-argument assembly. Reserve a frontier autoregressive model for the steps that genuinely need depth, such as the final synthesis or a difficult multi-constraint decision. The router decides per turn which tier to use.

Here is a provider-agnostic sketch. The generate interface is intentionally generic so you can map it onto whatever SDK you actually run.

type Role = "system" | "user" | "assistant" | "tool";
 
interface Message {
  role: Role;
  content: string;
}
 
interface ToolSpec {
  name: string;
  description: string;
  parameters: Record<string, unknown>;
}
 
interface GenerateRequest {
  model: string;
  messages: Message[];
  tools?: ToolSpec[];
}
 
interface GenerateResult {
  text: string;
  usage: { inputTokens: number; outputTokens: number };
}
 
// Assume a provider-agnostic client exposing generate(req).
declare function generate(req: GenerateRequest): Promise<GenerateResult>;
 
const FAST_MODEL = "mercury-2";        // diffusion, optimized for throughput
const FRONTIER_MODEL = "frontier-ar";  // autoregressive, optimized for depth
 
interface Turn {
  messages: Message[];
  tools?: ToolSpec[];
  // Signals the orchestrator already knows about this turn.
  isFinalAnswer: boolean;
  estimatedComplexity: number; // 0..1, from a cheap heuristic or classifier
}
 
function pickModel(turn: Turn): string {
  // The final synthesis and clearly hard turns go to the frontier model.
  if (turn.isFinalAnswer || turn.estimatedComplexity > 0.7) {
    return FRONTIER_MODEL;
  }
  // Everything else (routing, classification, drafts) takes the fast path.
  return FAST_MODEL;
}
 
async function runTurn(turn: Turn): Promise<GenerateResult> {
  const model = pickModel(turn);
  const result = await generate({
    model,
    messages: turn.messages,
    tools: turn.tools,
  });
 
  // Optional escalation: if the fast model's draft fails a cheap quality
  // check, retry once on the frontier model before giving up.
  if (model === FAST_MODEL && !passesQualityGate(result.text)) {
    return generate({
      model: FRONTIER_MODEL,
      messages: turn.messages,
      tools: turn.tools,
    });
  }
 
  return result;
}
 
function passesQualityGate(text: string): boolean {
  // Domain-specific: schema validation, required fields present,
  // no obvious refusal or truncation. Keep it cheap and deterministic.
  return text.trim().length > 0;
}

The shape that matters is pickModel. Most production agents have a handful of obviously cheap turns and a few genuinely hard ones; routing on a simple signal captures most of the benefit. The escalation fallback in runTurn is the safety valve: if the fast model produces something that fails a deterministic gate, you spend one extra call on the frontier model rather than shipping a bad result. That keeps the common case fast and the rare hard case correct.

estimatedComplexity does not need to be sophisticated. Input length, presence of certain keywords, the depth of the current reasoning chain, or a tiny dedicated classifier all work. Start with a crude heuristic, measure the escalation rate, and tune from there.

Where we would not use a dLLM yet

Speed is not free of context. If your task is a single short request where correctness on edge cases dominates and latency is already acceptable, the tiering adds complexity you do not need. If you depend on tooling, evaluation harnesses, or fine-tuning pipelines built tightly around a specific autoregressive provider, the migration cost may outweigh the throughput gain. And for the hardest frontier reasoning, the kind where a wrong answer is expensive and the problem resists decomposition, we still default to a top-tier autoregressive model. The point of tiering is precisely that you do not have to choose globally; you choose per turn.

Takeaways

  • Diffusion LLMs like Mercury 2 attack the one cost autoregressive models cannot escape: the sequential dependency inside a single response. The payoff shows up in multi-step agents and interactive UX, where latency is the sum of many generations.
  • Measure before you migrate. Multiply tokens-per-step by steps-per-task. If that product is small, a faster model barely moves your wall-clock time.
  • Tier, do not replace. Route cheap, high-frequency turns to a fast dLLM and escalate hard or final turns to a frontier model, with a deterministic quality gate as the fallback.
  • Keep the tradeoffs honest. Ecosystem maturity, controllability, and top-end reasoning still favor frontier autoregressive models for the hardest work. A dLLM earns its place in the loop, not necessarily at the summit.