LLM Cost Optimization: Tokens, Routing, Batching

A practical guide to estimating and reducing LLM costs through token control, caching, model routing, and batching.

LLM cost optimization is no longer a niche concern for large AI teams. If you run support assistants, internal copilots, document pipelines, or agentic workflows, model spend can grow quietly through long prompts, repeated context, overpowered model selection, and inefficient request patterns. This guide gives you a repeatable way to estimate and reduce inference cost using four durable levers: tokens, caching, routing, and batching. The goal is not to chase the lowest possible bill at the expense of quality. It is to build an operating model that helps you decide where savings are safe, measurable, and worth revisiting as providers, models, and workloads change.

Overview

The most useful way to think about LLM cost optimization is as a systems problem, not a single prompt problem. Teams often start by trimming wording in prompts, which can help, but the larger savings usually come from design choices made one layer above the prompt itself.

In practice, spend is driven by a combination of:

Input tokens: system prompts, user messages, retrieved context, tool schemas, examples, and conversation history.
Output tokens: the model’s completion length, including structured JSON and verbose explanations.
Request frequency: how often users, automations, or agents call the model.
Model choice: whether every task is sent to the same large model or routed by difficulty.
Workflow shape: whether tasks are processed one by one, in batches, or through multi-step agent chains.

For most teams, the best LLM cost optimization strategy follows this order:

Measure the current token and request baseline.
Cut waste before changing model behavior.
Route easy tasks to cheaper models.
Cache repeated work.
Batch predictable jobs.
Re-test quality after each change.

This sequence matters. Reducing token waste improves every provider and every model. Caching helps only where repetition exists. Routing helps only if you can separate easy tasks from hard ones. Batching helps only where latency is flexible. If you try everything at once, you may save money but lose the ability to explain why.

Cost optimization also needs to stay connected to quality. A support assistant that becomes cheaper but less accurate may increase human review time and erase the savings. A coding tool that uses a smaller model but requires more retries can end up costing more in both tokens and developer time. Cost should be evaluated alongside latency, task success, fallback rate, and review burden. That is why this article is structured like a calculator and decision guide, not a list of tricks.

How to estimate

You can estimate AI inference cost strategies with a simple model that works across providers. The exact pricing inputs will change over time, but the structure remains stable.

Base formula per request:

Total cost per request = input token cost + output token cost + tool or retrieval overhead + retry/fallback overhead

Monthly cost estimate:

Monthly cost = total cost per request × total monthly requests

To make this useful, break one workflow into stages rather than treating the entire application as one average number. For example, an AI support flow might include:

Intent classification
Knowledge retrieval
Answer generation
Escalation summary

Each stage can have a different model, token profile, and frequency. The classification step may be cheap and short. The answer generation step may include long retrieval context. The escalation summary may only run on a fraction of conversations. Estimating at the stage level makes it easier to see where to reduce token costs without guessing.

A practical estimation workflow

Pick one production use case. Start with the workflow that has the highest volume or the fastest-growing spend.
Log median and p95 token counts. Average values can hide expensive outliers. Long context windows often show up in the tail.
Separate input and output tokens. These are optimized differently. Input bloat is usually a prompt and context problem. Output bloat is usually an instruction and format problem.
Add retries and fallbacks. If 10 percent of requests retry or escalate to a stronger model, include that multiplier.
Estimate by request class. Group tasks such as short FAQ, long document answer, structured extraction, or agent workflow.
Model best case, normal case, and high-load case. This helps with budgeting and capacity planning.

Example estimation template

Requests per month
Average input tokens per request
Average output tokens per request
Percent of requests eligible for caching
Percent of requests routed to a smaller model
Percent of requests retried
Percent of requests escalated to a larger model
Batchable vs real-time share

Once you have those inputs, you can model individual savings levers:

Token reduction savings
Savings = requests × token reduction per request × token price

Caching savings
Savings = repeated requests × avoided recomputation cost

Routing savings
Savings = easy-task volume × (large-model cost − small-model cost)

Batching savings
Savings = batchable volume × per-request overhead reduction

The important point is that you do not need exact provider prices to begin. You need reliable workload ratios. Once you understand your token shape and workflow distribution, updating the numbers when pricing changes becomes straightforward.

Inputs and assumptions

This section explains the variables that most affect cost and the assumptions that often distort estimates.

1. Input tokens are often the largest hidden expense

Teams usually notice long outputs first because they are visible to users. But many expensive applications are actually dominated by input tokens: system instructions, few-shot examples, tool definitions, long chat history, and retrieval payloads.

Common sources of unnecessary input cost include:

Large system prompts that repeat policy text on every turn
Too many few-shot examples for simple classification tasks
Passing full tool schemas when only one tool is relevant
Sending the entire conversation when a short state summary would do
Retrieving too many chunks in RAG pipelines

If your goal is to reduce token costs, input cleanup usually delivers the fastest gains. This connects directly to prompt engineering best practices: concise instructions, minimal examples, and well-scoped context. If you need a broader framework, see Prompt Optimization Workflow: Diagnose, Iterate, and Measure Improvements.

2. Output control matters more than many teams expect

Output tokens can become expensive in summarization, coding, reporting, and multi-agent settings. If the prompt allows the model to be expansive, it often will be.

Ways to control output cost without damaging usefulness:

Specify the desired length range
Request bullet points instead of prose where appropriate
Use structured outputs for predictable tasks
Limit chain-style explanations in production user flows
Return excerpts or fields rather than full rewrites

Structured responses can also reduce downstream processing cost. Compare approaches in Function Calling vs JSON Prompting: Structured Output Methods Compared.

3. Caching works best when repetition is real

Caching is one of the cleanest AI inference cost strategies because it avoids paying twice for the same or nearly the same work. But it only helps when your application contains repeated prompts, repeated context, repeated retrieval results, or common answer patterns.

Good caching candidates include:

FAQ-style support prompts
Repeated system prompts or prompt prefixes
Stable retrieval results for common queries
Deterministic classification and extraction tasks
Summaries of unchanged source documents

Poor caching candidates include highly personalized prompts, rapidly changing data, or tasks where freshness matters more than savings.

When designing caching, define the cache key carefully. It may include prompt version, model version, tenant ID, document revision, retrieval source hash, and user locale. Without those controls, cached outputs can become stale or misleading.

4. Routing creates savings only when task difficulty varies

Model routing savings come from refusing to treat every request as equally hard. A short intent classifier, a refund policy lookup, and a complex contract analysis should not always hit the same model tier.

A simple routing design can use:

Rules for known task categories
Confidence thresholds from a cheap classifier
Length or complexity heuristics
Escalation paths when confidence is low

This is often the biggest structural opportunity in AI agent development. If your system already has multiple steps or tools, review your architecture with AI Agent Architecture Patterns: Single-Agent, Multi-Agent, and Tool-Using Systems.

5. Batching saves overhead, but usually trades against latency

Batching is ideal for offline or nearline workloads: classification jobs, enrichment pipelines, nightly summarization, or backfill tasks. It is less useful for interactive chat where users expect immediate responses.

Batching can improve efficiency by:

Reducing connection and orchestration overhead
Grouping homogeneous tasks
Smoothing traffic spikes
Making fallback policies easier to manage at queue level

Still, batching can create operational complexity. You may need queue controls, timeout logic, and failure isolation so one bad item does not hold up the rest.

6. Assumptions that break estimates

Three bad assumptions show up often:

Assuming average token counts are enough. Outliers matter, especially in RAG and agent chains.
Assuming a cheaper model always lowers total cost. If retries rise, total spend can increase.
Assuming quality loss is acceptable if token spend falls. Human review, escalations, or user abandonment can cost more than inference.

That is why LLM evaluation belongs in cost work. Pair savings experiments with eval sets and rubric-based checks. Helpful references: Prompt Testing Workflow: How to Build Eval Sets Before You Ship and How to Evaluate LLM Output Quality: Metrics, Rubrics, and Test Sets.

Worked examples

These examples use relative reasoning rather than provider-specific prices, so you can adapt them to your current stack.

Example 1: Internal knowledge assistant

Baseline: An internal support bot answers employee questions using a large model and a RAG pipeline. Every request includes a long system prompt, full conversation history, and multiple retrieved chunks.

Observed cost drivers:

Large repeated prompt prefix
Too many retrieved chunks
Verbose answers
All requests routed to one premium model

Optimization plan:

Shorten and version the system prompt.
Replace full chat replay with a rolling state summary.
Reduce retrieval count and tighten chunk selection.
Set answer length guidance for common employee questions.
Route simple policy lookups to a smaller model; escalate only ambiguous or multi-document questions.

Expected result: This kind of workflow often responds well to token trimming first, then routing. It is also a strong fit for caching retrieval results when internal documents do not change often. For more on grounded assistants, see Best Practices for Grounding AI Responses with Internal Knowledge Bases and How to Build an AI Agent with RAG and Tool Use.

Example 2: Support triage and reply drafting

Baseline: A customer support platform uses an LLM to classify tickets, draft replies, and generate escalation summaries.

Observed cost drivers:

Each stage uses the same model
Classification is over-provisioned
Reply drafts are longer than agents need
Escalation summaries run even on low-priority tickets

Optimization plan:

Move classification to a cheaper model.
Trigger draft generation only after routing and priority checks.
Constrain draft format to a short, editable response.
Generate escalation summaries only when a handoff actually occurs.
Batch low-urgency ticket enrichment jobs.

Expected result: Stage-level routing and conditional execution usually produce more savings than prompt edits alone. This is a good example of model routing savings because task difficulty differs sharply across the workflow.

Example 3: Coding assistant for internal developers

Baseline: A development tool sends large code context, repository instructions, and long output requests to a strong reasoning model for every task.

Observed cost drivers:

Large context payloads
Repeated repository instructions
High-output code generation with commentary
Retries after unclear formatting

Optimization plan:

Retrieve only files relevant to the current task.
Store static coding rules separately and include them selectively.
Use structured patch or diff outputs where possible.
Route simple refactors, lint fixes, or test generation to a lower-cost model.
Reserve the strongest model for architecture changes or debugging across multiple files.

Expected result: In coding tools, context discipline matters as much as model choice. If the assistant supports many developer tasks, compare model strengths by use case rather than assuming one universal winner: Best AI Models for Coding, Reasoning, and Support Tasks Compared and OpenAI vs Anthropic vs Google Models: API Features and Tradeoffs.

Example 4: Batch document enrichment pipeline

Baseline: A back-office system summarizes documents, extracts entities, and tags records as they arrive. Jobs run one request at a time, even when latency is not user-facing.

Observed cost drivers:

No batching
Same prompt prefix repeated per item
Long summaries when short abstracts would suffice
No cache for unchanged documents

Optimization plan:

Queue jobs and batch similar tasks together.
Skip unchanged documents by hashing content revisions.
Use extraction-first workflows so only selected records need full summaries.
Shorten outputs to fit the actual downstream need.

Expected result: This is where an LLM batching guide is most useful. Savings come from lower overhead, reduced duplicate work, and better control of processing schedules.

When to recalculate

Cost models become stale faster than prompt docs. Revisit your estimates whenever one of the underlying inputs moves. A lightweight recalculation habit is often more valuable than a perfectly detailed spreadsheet built once and ignored.

Recalculate when:

Provider pricing changes
You switch models or add a fallback model
Your prompt or tool schema changes materially
RAG settings change, such as chunk count or chunk size
Traffic volume rises or request mix shifts
You add a new agent step, tool call, or review loop
Latency requirements change and batching becomes more or less viable
Evaluation results show more retries, escalations, or human edits

A practical operating cadence

Weekly: Review token trends, top expensive routes, and retry rates.
Monthly: Recompute per-workflow cost estimates and compare against quality metrics.
Per release: Run a prompt and routing diff review before rollout. This is where prompt versioning helps maintain traceability: Prompt Versioning and Change Tracking for Production Teams.
Quarterly: Revisit model selection, cache policy, and architecture assumptions.

A simple optimization checklist

Do we know our highest-cost workflow by request and by total monthly spend?
Can we remove repeated prompt content safely?
Are we passing more context than the task requires?
Can we shorten outputs without hurting usefulness?
Which tasks truly require the strongest model?
Where can caching avoid duplicate work?
Which jobs can move from real-time to batch?
Did we measure quality before and after the change?

The durable lesson is that LLM cost optimization is not a one-time cleanup project. It is part of operating an AI product responsibly. Tokens, caching, routing, and batching are not isolated tactics; they are recurring decisions about workload design. If you log the right inputs, estimate costs at the workflow level, and tie savings to evaluation, you can reduce spend without turning your system into a brittle maze of shortcuts. That makes this a useful document to revisit whenever prices change, usage grows, or your AI architecture becomes more ambitious.