If you're spending more than AED 30,000 per month on LLM APIs and you don't have a FinOps practice yet, you're almost certainly leaving 40–70% on the table. This isn't speculation — we've audited enough production stacks across our Fractional AI Team and Agentic Pilot engagements to triangulate the median.
This post is the LLM FinOps playbook we use internally and ship to clients. It assumes basic familiarity with LLM APIs and pricing — we're going straight to the patterns that compound.
The five sources of waste
Every audit surfaces the same five patterns in roughly the same proportions:
| Waste source | Typical impact | Difficulty to fix |
|---|---|---|
| Wrong-tier model usage | 30–40% | Medium — requires routing logic |
| Missing prompt caching | 15–25% | Low — config change |
| Naive retry loops | 10–20% | Low — exponential backoff |
| No semantic caching | 5–15% | Medium — embedding + cache layer |
| Verbose outputs | 5–10% | Low — set maxTokens |
The bottom three are quick wins — a single sprint of focused work can capture them. The top two are architectural and benefit from being designed in from the start.
Pattern 1: Model-tier routing
The single biggest waste pattern. Teams pick a model at the start of a project ("we'll use Claude Sonnet") and route every request through it, regardless of whether the task is "summarise this paragraph" or "write a complex multi-step analysis."
Three tiers, decided per-request:
Tier 1 (cheap). Classification, structured extraction, format conversion, simple Q&A on retrieved context. Models: Haiku 4.5, Gemini Flash, GPT-4.1-nano, or open-weight Llama 3.3 70B routed through self-hosted inference. Cost: ~10–30× cheaper than tier 3.
Tier 2 (balanced). Drafting, summarisation, retrieval-augmented response, normal reasoning, multi-turn conversation. Models: Sonnet 4.6, GPT-4.1, Gemini 2.5 Pro. Cost: ~3–5× cheaper than tier 3.
Tier 3 (frontier). Complex multi-step reasoning, ambiguous decisions, high-stakes outputs, code generation requiring large context. Models: Opus 4.7, GPT-5, Gemini Ultra. Cost: tier 3.
The routing logic lives at the application layer. Common implementations:
- By task type. A configuration table mapping task names to model tiers. "intent_classification" → tier 1, "draft_email" → tier 2, "regulatory_analysis" → tier 3.
- By confidence cascade. Tier 1 first; if the result has low confidence (judged by a classifier), retry with tier 2. Saves cost on the easy cases.
- By output structure. If the expected output is a constrained schema, tier 1 or 2 almost always suffices. If the output is open-ended reasoning, tier 3 may be warranted.
Done well, 60–70% of requests stay in tier 1, 25–30% in tier 2, only 5–10% reach tier 3. The blended cost-per-request drops by 50–70%.
Pattern 2: Prompt caching
Anthropic's prompt caching, OpenAI's similar feature, and provider equivalents drop the input cost on repeated context by ~90%. For RAG systems and agents, this is enormous because most input tokens are repeated context (the same RAG chunks, the same system prompt, the same tool descriptions).
The implementation is straightforward — pass cache_control markers on the parts of the prompt that should be cached — but it requires architecting prompts so the stable parts come first. The cacheable region has to be a prefix; you can't have a stable system prompt, then user-specific content, then more cacheable tool definitions.
The standard ordering that works:
- System prompt (stable)
- Tool definitions (stable)
- RAG retrieved context (stable per-conversation)
- Conversation history (grows over the conversation)
- Latest user message (variable)
Markers go on (1)–(3); (4)–(5) are uncached. This pattern alone typically captures 50–65% input cost reduction for agent workloads.
Pattern 3: Naive retry loops
Production code that retries failed LLM calls on a fixed delay (or worse, no delay) burns through the rate limit and amplifies cost spikes during provider degradation events.
The fix is exponential backoff with jitter. Standard library implementations: tenacity (Python), p-retry (Node), or built-in to mature gateway clients. Configure max retries (typically 3–5), base delay (1–2 seconds), and jitter (full jitter is the safe default).
Additionally: distinguish retryable from non-retryable errors. A 4xx response means your code is wrong; retrying changes nothing. A 5xx or 429 means provider-side issue; retry. A timeout means network issue; retry with caution. The library defaults are usually wrong on this distinction — review them.
Pattern 4: Semantic caching
For workloads with repeated or near-repeated queries, semantic caching — caching responses by query embedding similarity — can capture 5–25% of total cost depending on traffic patterns.
The pattern: embed the user query, search a vector cache for previous queries within similarity threshold, return the cached response if hit. The cache lives in Redis or any low-latency vector store; embeddings can be generated on cheap open-weight models (e5-mistral, BGE, or Cohere's cheaper tiers).
This works best for high-volume, low-personalisation workloads — customer support FAQ-style queries, content classification, structured extraction. It works less well for personalised conversations or workloads where each query depends on user-specific context.
The infrastructure cost of running a semantic cache (the embedding model + vector store) is typically 5–10% of the LLM cost it saves. Net positive for most workloads above ~1M queries per month.
Pattern 5: Output controls
Every LLM call should set maxTokens (or equivalent) to a sensible ceiling for the expected response. Without it, the model can produce 4,000-token responses where 200 tokens would have served. Output tokens cost ~3–5× input tokens; over-long outputs are pure waste.
Set the ceiling per-task:
- Classification / structured extraction: 100–300 tokens
- Drafting short content: 500–1,500 tokens
- Drafting long content: 2,000–4,000 tokens
- Open-ended reasoning: 2,000–8,000 tokens
If the model truncates against the ceiling, that surfaces in evals — a good thing. Better to truncate visibly than to silently overspend.
The gateway pattern
All of the above is materially easier with an AI Gateway in front of your provider calls. Vercel AI Gateway, OpenRouter, Portkey, or your own — the gateway provides:
- Cost telemetry per request, per model, per route. You can't optimise what you don't measure.
- Provider failover. When Anthropic has an incident, the gateway routes to OpenAI or Google with no application changes.
- Daily and monthly spend caps. Hard ceilings that throttle gracefully rather than letting a runaway loop burn AED 50,000 over a weekend.
- Rate limit handling. Centralised retry and backoff policies, applied consistently across providers.
- Model abstraction. Your code calls "anthropic/claude-sonnet-4.6" as a string; switching providers is a config change, not a refactor.
We default to Vercel AI Gateway on every engagement we ship, including Hisabi.ai production. The marginal latency (typically 20–80ms in the active region) is unnoticeable for human-facing workloads and acceptable for most agent workloads.
What to do this week
If you have an LLM workload in production:
-
Pull last month's bill broken down by model. Most providers expose this in their dashboards or via API. Most teams have never looked.
-
Identify the top 3 cost lines. Usually one workload accounts for 60–80% of spend.
-
Audit each top-3 line against the five patterns above. Which apply? Which are quick wins?
-
Pick one pattern to ship in two weeks. Prompt caching is usually the highest leverage and lowest difficulty.
-
Set up a gateway if you don't have one. This is a one-week effort that pays back for years.
Where Codenovai fits
Every Agentic Pilot we ship runs through Vercel AI Gateway with all five patterns implemented from day one. For existing workloads we audit and optimise as part of Fractional AI Team retainers — the typical first-quarter outcome is 40–60% cost reduction without behaviour change.
Book a scoping call — we usually have a back-of-envelope estimate of your savings within 30 minutes of the first call.
