Cost Optimization for LLM-Powered Products: What to Measure
Your LLM bill grew faster than your usage. A practical cost-optimization framework — what to measure and what to actually do about it.
Why your LLM bill is growing faster than usage
You added one feature. Token costs tripled. This is not a coincidence — it is a pattern.
The most common culprits are not expensive models. They are invisible inefficiencies that compound across every call. Retries that are not deduplicated, so a single user action that triggers a network hiccup hits the API three times. Embeddings that are re-computed on every page load instead of cached, so you pay ingestion cost on retrieval. Agent loops that lack hard termination conditions — you built a "plan, execute, verify" cycle and forgot that "verify" can call "plan" again, indefinitely, on a single user click. "Let's just give it more context" creep, where every sprint adds another document to the system prompt because someone saw a hallucination and thought more text was the fix. And dev/staging traffic running on production API keys, contributing real spend to benchmarks nobody monitors.
None of these show up as a spike. They show up as a slope — a bill that grows at 3x the rate of your active users, with no obvious event to blame. By the time finance flags it, the pattern is baked into every feature.
The fix starts with measurement. You cannot optimize what you have not instrumented.
The 4 costs to measure per feature
Stop looking at your total monthly invoice. It tells you nothing actionable. Instead, instrument at the feature level, and track four numbers for each one.
Input tokens per call. This is your prompt: system message, retrieved context, conversation history, tool definitions. It is the number you have the most control over, and it is the one teams pad the most aggressively. Measure it. Set a budget per feature. If your search assistant averages 6,000 input tokens per query, you should know that and have a view on whether it is justified.
Output tokens per call. LLMs are verbose by default. If you ask for a structured answer, you often get prose around it. Output tokens cost more per unit than input tokens on most providers, and they are easier to control: tighter instructions, JSON output mode, max-token limits. Measure the distribution, not just the mean — a long-tail of 4,000-token outputs on a feature that "usually" returns 200 tokens is a real problem waiting for a bad prompt.
Retry multiplier. Your real cost is not nominal cost — it is nominal cost multiplied by your retry rate. A feature that costs $0.004 per call and has a 30% retry rate on errors actually costs $0.0052. At scale that gap matters. Log retries separately from primary calls, and surface the multiplier. A high retry rate is also a signal that something upstream is wrong — timeouts, malformed JSON from the model, context window overruns — and fixing the root cause often cuts cost more than any prompt optimization.
Embedding overhead. Track ingestion cost (embeddings computed when new content arrives) and query cost (embeddings computed at search time) separately. Most teams instrument neither. If your embedding model is running on every search request rather than being cached, you are paying per user interaction for work that could be paid once.
These four numbers, per feature, give you an actual cost model. Without them you are guessing.
Prompt caching
Anthropic and OpenAI both offer prompt caching. If you have not enabled it, you are almost certainly leaving 50–90% of your input token cost on the table for any feature with a stable system prompt.
The mechanism: when you send a request, the provider caches the processed prefix of your prompt. Subsequent requests that share that prefix hit the cache instead of re-processing. Anthropic charges roughly 10% of the normal input token price for cache hits. OpenAI's implementation varies by model but is in the same range.
What should you cache? Three things: the system prompt, your tool definitions, and any retrieved context that is stable across a conversation turn. For a typical product assistant — system prompt plus tool definitions plus a few retrieved documents — the cacheable prefix might be 80% of every input. That means 80% of input tokens at 10% of the price.
The API call looks like this in Python:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": YOUR_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": user_message}],
)
The cache_control block on the system prompt content block is the entire change. Anthropic's caching is ephemeral by default (5-minute TTL) and you can extend it. The main requirement is that the cached prefix must be stable — if you interpolate dynamic values like timestamps into your system prompt on every request, caching will not help because the prefix never matches.
Most teams have not enabled this because it is not on by default and the documentation is easy to miss. It is the single highest-leverage change for any product that already has a working LLM feature.
Smart routing
Not every query needs your most capable model. The mistake is using a flat configuration — one model for all requests — because it is simple. The cost of that simplicity is paying premium rates for questions that a much cheaper model could handle correctly.
The pattern is a two-tier classifier-plus-specialist setup. A lightweight model (Claude Haiku, Gemini Flash, or a cheap GPT variant) sees the incoming query and classifies it: easy, medium, or hard. Easy queries go to the same lightweight model. Hard queries route to the specialist (Claude Sonnet, Claude Opus, GPT-4, or o1, depending on your use case). Medium queries are your judgment call — usually routed to a mid-tier model or handled with a simplified version of the hard-query prompt.
The classifier needs to be fast and cheap — often a single-turn prompt with no retrieval, just the raw query and a small rubric. "Is this a factual lookup, a reasoning task, or something requiring multi-step planning?" is a question a small model answers well. Getting it wrong 10% of the time is acceptable; the cost differential more than covers the misroutes.
Real numbers from teams doing this in production: 60–70% cost reduction on mixed-difficulty workloads. The distribution matters. If 70% of your queries are genuinely simple — status checks, factual lookups, formatting requests — and you were routing all of them through your most expensive model, the saving is proportional to that 70%.
The non-obvious implementation detail: the classifier prompt is the thing to get right. Define your tiers concretely. "A hard query requires synthesizing conflicting information or multi-step reasoning" is actionable. "A hard query is complex" is not. Spend time on the rubric, run it against a sample of real queries, and adjust. The rest is plumbing.
Batching strategies
Real-time is not always a product requirement. When it is not, you are overpaying.
Both Anthropic and OpenAI offer batch inference endpoints with a 50% discount on input and output tokens. The trade-off is latency — batch jobs are processed asynchronously, typically completing within minutes to hours rather than milliseconds. For the right workloads, this is a straightforward cost cut with no quality trade-off at all.
Where batching works: overnight document processing, async ingestion pipelines, embedding pre-computation for a new content set, weekly report generation, re-evaluation of existing summaries when a model is updated, nightly re-ranking of a search index. Any workload where the user is not waiting on a response in a browser tab is a candidate.
Where batching does not work: any chat UX, any feature where a user submitted something and is looking at a spinner, any workflow where the next step depends on the LLM output in under a second. The 50% discount is not worth a ruined user experience.
The implementation is simple. Anthropic's batch API accepts a list of requests, returns a job ID, and provides an endpoint to poll for completion. OpenAI's Files and Batch APIs work similarly. If you have a nightly pipeline that currently calls the API synchronously in a loop, switching to batch is usually a one-afternoon change that immediately cuts that pipeline's cost in half.
One thing teams miss: batching also reduces the risk of rate-limit errors on large ingestion jobs, because the provider manages the queue for you. That is a reliability benefit on top of the cost saving.
The "do you need an LLM at all?" check
Before you optimize an LLM call, ask whether the LLM call should exist.
This sounds obvious. It is not practiced. Teams reach for LLMs because they solve ambiguous problems without requiring you to write rules. That is genuinely useful. It is also a habit that spreads to cases where a much simpler solution would work just as well — and cost orders of magnitude less.
The 80/20 analysis: for most product features, 80% of the inputs follow predictable patterns that could be handled without a model call. A customer support flow that routes "where is my order" to a status lookup does not need GPT-4 to decide the intent. A regex handles it. A document classification pipeline where 90% of documents fall into five well-defined categories does not need an LLM for those cases. A small fine-tuned classifier or even keyword matching does it at a fraction of the cost.
The check to run: sample 100 real inputs to your LLM-powered feature. Tag each one: did this genuinely require reasoning, or could it have been handled by a simpler system? If more than 40% fall into the "simpler system" bucket, you have a routing opportunity — and unlike smart model routing, this one routes away from the LLM entirely.
The practical approach is not to remove the LLM but to add a pre-filter. Simple cases get handled by the cheap path. Complex cases fall through to the model. The LLM becomes a fallback for genuine ambiguity rather than the default for all inputs. That shift — treating the model as the expensive escalation path rather than the universal handler — often delivers the largest cost reduction of any optimization on this list.
In 2026, LLM cost is a real product constraint. Not a future concern, not something to revisit at Series B — a current line item that finance notices and that compounds as you scale. Teams that instrument it, route intelligently, cache aggressively, and challenge whether every call needs to exist tend to build AI products that remain economically viable as usage grows. Teams that treat it as someone else's problem get a wake-up call around month six, usually delivered as a spreadsheet with an awkward meeting attached.
At Reveronix, cost architecture is part of how we scope and build LLM features from the start — not retrofitted later. If your AI costs are growing faster than your users, that is a solvable engineering problem.
Written by the Reveronix team.
Ready to build something?
Keep reading
Claude vs GPT vs Gemini for Production: A 2026 Model Selection Guide
An honest production guide. What each frontier model is best at, where they fail, and how to pick without lock-in.
Read postGrounding LLMs: What It Actually Means and How to Do It Right
Grounding is the difference between an LLM that hallucinates and one you can ship. Here's what it actually means in 2026 and how to wire it into your product.
Read postThe DevOps Minimum for a 5-Person AI Startup
What DevOps actually has to look like for a tiny AI startup. The minimum that buys you sleep without burning runway.
Read post