What is LLM FinOps and why does it matter now specifically?

LLM FinOps is the discipline of measuring, attributing, and optimising the cost of large language model usage in production. It matters now because LLM bills crossed the 'first noticeable line item' threshold in finance reviews during 2025, and many organizations woke up to monthly LLM spend in the AED 50,000 to AED 500,000 range. The discipline borrows heavily from cloud FinOps but with LLM-specific twists: token economics, model-tier routing, semantic caching, and provider arbitrage that don't have direct cloud analogs.

What's the typical breakdown of waste in production LLM workloads?

From auditing dozens of production stacks: 30–40% wasted on using too-capable models for simple tasks (Claude Opus where Haiku would suffice); 15–25% on lack of prompt caching for repeated context; 10–20% on retries without exponential backoff; 5–15% on no semantic caching for repeated queries; 5–10% on overly verbose outputs (no maxTokens set or too high). Total: 65–90% reduction is feasible, though 40–70% is the realistic target after accounting for use cases that genuinely need the expensive options.

How much can prompt caching alone save on Anthropic Claude?

For workloads with substantial repeated context — RAG systems re-supplying the same document chunks, agents re-injecting system prompts on every call, multi-turn conversations re-feeding the conversation history — prompt caching cuts the input token cost by approximately 90% on cached tokens. For typical agent workloads where 60–80% of input tokens are repeated context, this translates to 50–65% reduction in input billing. Output billing is unaffected. The implementation is straightforward (one cache_control header) but requires architecting prompts so the stable parts come first.

Should we use AI Gateway, or call providers directly?

Use a gateway. Vercel AI Gateway, OpenRouter, Portkey, or your own — gateways provide the cost telemetry, provider failover, daily/monthly spend caps, and rate-limit handling that you'll otherwise rebuild from scratch. The marginal latency cost of a gateway is 20–80ms in the hottest region; the FinOps benefit is hundreds to thousands of dollars per month even at modest scale. Direct provider calls are appropriate only for ultra-latency-sensitive workloads where the 50ms matters more than visibility.

What's the right model-tier routing strategy?

Three tiers, decided per-request based on task complexity. Tier 1 (cheap): classification, simple extraction, structured output, format conversion — use Haiku, Gemini Flash, or open-weight Llama 3.3. Tier 2 (balanced): drafting, summarisation, normal reasoning — use Sonnet, GPT-4.1, Gemini Pro. Tier 3 (frontier): complex reasoning, ambiguous decisions, high-stakes outputs — use Opus, GPT-5, Gemini Ultra. Routing happens at the application layer based on task type, not at the user level. Done well, 60–70% of requests stay in tier 1, 25–30% in tier 2, only 5–10% reach tier 3.

LLM FinOps: How to Cut Claude, GPT, and Gemini Costs by 40–70% in 2026

If you're spending more than AED 30,000 per month on LLM APIs and you don't have a FinOps practice yet, you're almost certainly leaving 40–70% on the table. This isn't speculation — we've audited enough production stacks across our Fractional AI Team and Agentic Pilot engagements to triangulate the median.

This post is the LLM FinOps playbook we use internally and ship to clients. It assumes basic familiarity with LLM APIs and pricing — we're going straight to the patterns that compound.

The five sources of waste

Every audit surfaces the same five patterns in roughly the same proportions:

Waste source	Typical impact	Difficulty to fix
Wrong-tier model usage	30–40%	Medium — requires routing logic
Missing prompt caching	15–25%	Low — config change
Naive retry loops	10–20%	Low — exponential backoff
No semantic caching	5–15%	Medium — embedding + cache layer
Verbose outputs	5–10%	Low — set maxTokens

The bottom three are quick wins — a single sprint of focused work can capture them. The top two are architectural and benefit from being designed in from the start.

Pattern 1: Model-tier routing

The single biggest waste pattern. Teams pick a model at the start of a project ("we'll use Claude Sonnet") and route every request through it, regardless of whether the task is "summarise this paragraph" or "write a complex multi-step analysis."

Three tiers, decided per-request:

Tier 1 (cheap). Classification, structured extraction, format conversion, simple Q&A on retrieved context. Models: Haiku 4.5, Gemini Flash, GPT-4.1-nano, or open-weight Llama 3.3 70B routed through self-hosted inference. Cost: ~10–30× cheaper than tier 3.

Tier 2 (balanced). Drafting, summarisation, retrieval-augmented response, normal reasoning, multi-turn conversation. Models: Sonnet 4.6, GPT-4.1, Gemini 2.5 Pro. Cost: ~3–5× cheaper than tier 3.

Tier 3 (frontier). Complex multi-step reasoning, ambiguous decisions, high-stakes outputs, code generation requiring large context. Models: Opus 4.7, GPT-5, Gemini Ultra. Cost: tier 3.

The routing logic lives at the application layer. Common implementations:

By task type. A configuration table mapping task names to model tiers. "intent_classification" → tier 1, "draft_email" → tier 2, "regulatory_analysis" → tier 3.
By confidence cascade. Tier 1 first; if the result has low confidence (judged by a classifier), retry with tier 2. Saves cost on the easy cases.
By output structure. If the expected output is a constrained schema, tier 1 or 2 almost always suffices. If the output is open-ended reasoning, tier 3 may be warranted.

Done well, 60–70% of requests stay in tier 1, 25–30% in tier 2, only 5–10% reach tier 3. The blended cost-per-request drops by 50–70%.

Pattern 2: Prompt caching

Anthropic's prompt caching, OpenAI's similar feature, and provider equivalents drop the input cost on repeated context by ~90%. For RAG systems and agents, this is enormous because most input tokens are repeated context (the same RAG chunks, the same system prompt, the same tool descriptions).

The implementation is straightforward — pass cache_control markers on the parts of the prompt that should be cached — but it requires architecting prompts so the stable parts come first. The cacheable region has to be a prefix; you can't have a stable system prompt, then user-specific content, then more cacheable tool definitions.

The standard ordering that works:

System prompt (stable)
Tool definitions (stable)
RAG retrieved context (stable per-conversation)
Conversation history (grows over the conversation)
Latest user message (variable)

Markers go on (1)–(3); (4)–(5) are uncached. This pattern alone typically captures 50–65% input cost reduction for agent workloads.

Pattern 3: Naive retry loops

Production code that retries failed LLM calls on a fixed delay (or worse, no delay) burns through the rate limit and amplifies cost spikes during provider degradation events.

The fix is exponential backoff with jitter. Standard library implementations: tenacity (Python), p-retry (Node), or built-in to mature gateway clients. Configure max retries (typically 3–5), base delay (1–2 seconds), and jitter (full jitter is the safe default).

Additionally: distinguish retryable from non-retryable errors. A 4xx response means your code is wrong; retrying changes nothing. A 5xx or 429 means provider-side issue; retry. A timeout means network issue; retry with caution. The library defaults are usually wrong on this distinction — review them.

Pattern 4: Semantic caching

For workloads with repeated or near-repeated queries, semantic caching — caching responses by query embedding similarity — can capture 5–25% of total cost depending on traffic patterns.

The pattern: embed the user query, search a vector cache for previous queries within similarity threshold, return the cached response if hit. The cache lives in Redis or any low-latency vector store; embeddings can be generated on cheap open-weight models (e5-mistral, BGE, or Cohere's cheaper tiers).

This works best for high-volume, low-personalisation workloads — customer support FAQ-style queries, content classification, structured extraction. It works less well for personalised conversations or workloads where each query depends on user-specific context.

The infrastructure cost of running a semantic cache (the embedding model + vector store) is typically 5–10% of the LLM cost it saves. Net positive for most workloads above ~1M queries per month.

Pattern 5: Output controls

Every LLM call should set maxTokens (or equivalent) to a sensible ceiling for the expected response. Without it, the model can produce 4,000-token responses where 200 tokens would have served. Output tokens cost ~3–5× input tokens; over-long outputs are pure waste.

Set the ceiling per-task:

Classification / structured extraction: 100–300 tokens
Drafting short content: 500–1,500 tokens
Drafting long content: 2,000–4,000 tokens
Open-ended reasoning: 2,000–8,000 tokens

If the model truncates against the ceiling, that surfaces in evals — a good thing. Better to truncate visibly than to silently overspend.

The gateway pattern

All of the above is materially easier with an AI Gateway in front of your provider calls. Vercel AI Gateway, OpenRouter, Portkey, or your own — the gateway provides:

Cost telemetry per request, per model, per route. You can't optimise what you don't measure.
Provider failover. When Anthropic has an incident, the gateway routes to OpenAI or Google with no application changes.
Daily and monthly spend caps. Hard ceilings that throttle gracefully rather than letting a runaway loop burn AED 50,000 over a weekend.
Rate limit handling. Centralised retry and backoff policies, applied consistently across providers.
Model abstraction. Your code calls "anthropic/claude-sonnet-4.6" as a string; switching providers is a config change, not a refactor.

We default to Vercel AI Gateway on every engagement we ship, including Hisabi.ai production. The marginal latency (typically 20–80ms in the active region) is unnoticeable for human-facing workloads and acceptable for most agent workloads.

What to do this week

If you have an LLM workload in production:

Pull last month's bill broken down by model. Most providers expose this in their dashboards or via API. Most teams have never looked.
Identify the top 3 cost lines. Usually one workload accounts for 60–80% of spend.
Audit each top-3 line against the five patterns above. Which apply? Which are quick wins?
Pick one pattern to ship in two weeks. Prompt caching is usually the highest leverage and lowest difficulty.
Set up a gateway if you don't have one. This is a one-week effort that pays back for years.

Where Codenovai fits

Every Agentic Pilot we ship runs through Vercel AI Gateway with all five patterns implemented from day one. For existing workloads we audit and optimise as part of Fractional AI Team retainers — the typical first-quarter outcome is 40–60% cost reduction without behaviour change.

Book a scoping call — we usually have a back-of-envelope estimate of your savings within 30 minutes of the first call.