Cost & Performance Advanced·12 min read· ← Back to Learn

How to Optimize LLM Tokens (and Cut Costs 40–75%)

Tokens are the unit you pay for, in both money and milliseconds. Here's the open-source toolbox for serving fewer of them, serving them faster, and reusing work across requests — without retraining.

Every LLM request has a cost shaped like input_tokens + output_tokens, multiplied by how hard your hardware has to work per token. Optimization attacks all three: shrink the token count, shrink the cost per token, and avoid recomputing tokens you've already paid for. None of these require a better model — just better infrastructure.

1. Reuse work with KV-cache

When a model reads a prompt, it computes attention key/value tensors for every token. The expensive insight: if two requests share a prefix — a system prompt, a few-shot template, a retrieved document — those KV tensors are identical and can be cached and reused.

Prefix caching turns a repeated 2,000-token system prompt from something you pay for on every call into something you pay for once. For chat and RAG workloads, this alone is often a 30–50% win.

# vLLM: enable automatic prefix caching
llm = LLM(model="your-model", enable_prefix_caching=True)
# Shared system prompts / retrieved context are now cached
# across requests — no recompute, lower latency.

2. Serve more tokens per pass with speculative decoding

Generating tokens one at a time underuses the GPU. Speculative decoding uses a small, fast "draft" model to propose several tokens, then the large model verifies them all in a single forward pass. When the draft is right — which it often is for easy tokens — you get 2–3× throughput for identical output.

Why it's free qualityThe big model still decides every token, so output is mathematically the same. You only change how fast you arrive at it.

3. Send fewer tokens with prompt compression

Most prompts are padded with low-information text. Three reliable techniques:

4. Lower cost-per-token with quantization

Quantization stores weights in fewer bits (8-, 4-, even 2-bit). A 4-bit model is roughly 4× smaller and faster and fits on cheaper hardware. Modern methods keep the accuracy hit small:

5. Keep the accelerator full with continuous batching

Static batching waits for a whole batch to finish before starting the next — so one slow request stalls everyone. Continuous (in-flight) batching swaps finished sequences out and new ones in every step, keeping the GPU near 100% utilization. It's the default in engines like vLLM and is one of the largest throughput multipliers available.

Putting it together

These compound. A representative before/after we see often:

Baseline:   1.00× cost, 100% latency
+ quantization (AWQ 4-bit)     → 0.55× cost
+ prefix KV-cache reuse        → 0.38× cost
+ continuous batching          → 0.30× cost (3.3× throughput)
+ prompt compression           → 0.22× cost
# ~75% cheaper, faster, same model.
The order matters less than measuring. Wire up observability and cost dashboards first, then turn on one technique at a time and watch the numbers — otherwise you're optimizing blind.

Key takeaways

  • Attack token count, cost-per-token, and recomputation independently.
  • KV-cache reuse and continuous batching are the biggest, lowest-risk wins.
  • Speculative decoding gives free throughput with identical output.
  • Always measure with observability before and after each change.

See how these concepts connect to serving and MLOps in the AI Knowledge Graph — filter to the purple Token Optimization domain.

← AI Infrastructure 101 Next: MLOps Pipelines →

Related