Cutting Agent Costs with Claude Sonnet 5: When to Use It Over Opus

Anthropic released Claude Sonnet 5 on June 30, 2026, and the headline isn't raw capability — it's economics. Work that recently demanded a top-tier model now runs at Sonnet pricing. That shift matters less for one-off hard problems and more for workloads where request volume stacks up. This post covers where Sonnet 5 fits, how to split work with Opus, and how to keep agent costs predictable.

What actually changed

Per Anthropic's announcement, Sonnet 5 is built for agentic use: planning, using tools like browsers and terminals, and running multi-step tasks autonomously with better reliability than the previous Sonnet 4.6. On Anthropic's agentic coding benchmark, Sonnet 5 scores 63.2% against Opus 4.8 at 69.2% and Sonnet 4.6 at 58.1% — closing much of the gap while staying in the Sonnet price tier.

Introductory pricing runs at $2 per million input tokens and $10 per million output tokens through August 31, 2026, moving to $3 / $15 afterward. Compared to running the same workload on Opus, the difference is often an order of magnitude.

The point worth internalizing isn't "it got smarter." It's that keeping an agent running continuously is now financially reasonable. The payoff shows up in internal automation, first-line support, and any workflow where invocation counts accumulate — not in isolated, high-difficulty tasks.

Splitting work between Opus and Sonnet 5

Defaulting to the top model for everything is rarely the right call in production. Cost and latency both become operational burdens. We recommend routing along these lines:

Default to Sonnet 5 for high-volume work: CRUD operations, routine code generation, log triage, document summarization, first-line support replies.
Escalate to Opus for design judgment, complex root-cause analysis, and one-shot tasks where the cost of being wrong is high.

In practice, a two-tier setup works well: handle the request with Sonnet 5 first, and escalate to Opus only when confidence drops below a threshold or a specific condition is met.

function pickModel(task: { tokensEstimate: number; needsDesignJudgment: boolean }): string {
  // Send only design-judgment or large, high-stakes tasks to Opus.
  if (task.needsDesignJudgment || task.tokensEstimate > 40_000) {
    return "claude-opus-4-8";
  }
  return "claude-sonnet-5";
}
 
async function runWithEscalation(prompt: string, task: Parameters<typeof pickModel>[0]) {
  const first = await callClaude(pickModel(task), prompt);
  // Retry on the stronger model when self-reported confidence is low.
  if (first.selfConfidence < 0.6 && pickModel(task) !== "claude-opus-4-8") {
    return callClaude("claude-opus-4-8", prompt);
  }
  return first;
}

This is illustrative pseudocode. How you obtain selfConfidence depends on the task — returning a confidence score via structured output, or adding a separate verification step, are both workable approaches.

Trimming cost with prompt caching and batching

Sonnet 5 supports up to 90% savings with prompt caching and 50% with batch processing. For anything agentic, whether you use these two levers is the difference between a manageable bill and a surprising one.

Prompt caching stores the long, unchanging preamble — the system prompt, tool definitions, reference docs — so you don't pay full price for it on every call. Agents resend the same system prompt constantly, which makes cache hit rates naturally high.

// Mark the long, fixed portion of the prompt as cacheable.
const system = [
  {
    type: "text",
    text: LONG_SYSTEM_PROMPT_WITH_TOOL_SPECS, // thousands of stable tokens
    cache_control: { type: "ephemeral" },
  },
];
 
// On subsequent calls, this fixed block is served from cache.

Batch processing suits workloads that don't need an immediate response: overnight log classification, bulk document summarization, mass test-case generation. Anything that can return "within a few hours" belongs in a batch, at roughly half the real-time API cost.

A clean sequence: first decide what genuinely needs to be real-time, push everything else to batch, then compress the header of the remaining real-time calls with caching.

Practical notes on agent design

A cheaper model doesn't rescue sloppy design. The things we watch on real projects:

Cap the step count. Autonomy is useful, but runaway loops and excessive tool calls translate directly into cost. Set an explicit maxSteps and fall back when it's exceeded.
Keep tools single-purpose. Overloaded tools invite wrong arguments and retries. Splitting into small, self-explanatory tools usually reduces token spend overall.
Structure the output. Free-form text needs downstream parsing and re-runs on failure. Constraining output with a JSON Schema from the start is more stable.
Instrument everything. Without per-step token accounting, you can't see where to optimize. Treat tracing as a prerequisite for running agents, not an afterthought.

Takeaways

The significance of Claude Sonnet 5 is that it drops agents — previously viable only on a top-tier model — into a price range you can put into everyday operation. In practice, make Sonnet 5 the default for high-volume work and escalate to Opus only for design judgment and high-stakes tasks. Layer prompt caching and batching on top, and agent operating costs stay within a range you can actually manage.

At webhani, we help teams design, build, and cost-optimize LLM-based automation. If you want to adopt agents but can't yet predict the cost, that's exactly the kind of conversation we're glad to have.