How to Optimize LLM Tokens (and Cut Costs 40–75%)
Tokens are the unit you pay for, in both money and milliseconds. Here's the open-source toolbox for serving fewer of them, serving them faster, and reusing work across requests — without retraining.
Every LLM request has a cost shaped like input_tokens + output_tokens, multiplied by how hard your hardware has to work per token. Optimization attacks all three: shrink the token count, shrink the cost per token, and avoid recomputing tokens you've already paid for. None of these require a better model — just better infrastructure.
1. Reuse work with KV-cache
When a model reads a prompt, it computes attention key/value tensors for every token. The expensive insight: if two requests share a prefix — a system prompt, a few-shot template, a retrieved document — those KV tensors are identical and can be cached and reused.
Prefix caching turns a repeated 2,000-token system prompt from something you pay for on every call into something you pay for once. For chat and RAG workloads, this alone is often a 30–50% win.
# vLLM: enable automatic prefix caching
llm = LLM(model="your-model", enable_prefix_caching=True)
# Shared system prompts / retrieved context are now cached
# across requests — no recompute, lower latency.
2. Serve more tokens per pass with speculative decoding
Generating tokens one at a time underuses the GPU. Speculative decoding uses a small, fast "draft" model to propose several tokens, then the large model verifies them all in a single forward pass. When the draft is right — which it often is for easy tokens — you get 2–3× throughput for identical output.
3. Send fewer tokens with prompt compression
Most prompts are padded with low-information text. Three reliable techniques:
- Retrieve, don't dump. Instead of stuffing whole documents, retrieve only the passages that matter (good retrieval is the highest-leverage lever here).
- Summarize history. In long chats, replace old turns with a running summary rather than resending everything.
- Prune tokens. Tools like LLMLingua drop low-perplexity tokens the model can infer anyway, often cutting prompts 2–5× with little quality loss.
4. Lower cost-per-token with quantization
Quantization stores weights in fewer bits (8-, 4-, even 2-bit). A 4-bit model is roughly 4× smaller and faster and fits on cheaper hardware. Modern methods keep the accuracy hit small:
- AWQ / GPTQ — post-training 4-bit quantization, no retraining needed.
- QLoRA — fine-tune a quantized model cheaply when you also need to adapt it.
5. Keep the accelerator full with continuous batching
Static batching waits for a whole batch to finish before starting the next — so one slow request stalls everyone. Continuous (in-flight) batching swaps finished sequences out and new ones in every step, keeping the GPU near 100% utilization. It's the default in engines like vLLM and is one of the largest throughput multipliers available.
Putting it together
These compound. A representative before/after we see often:
Baseline: 1.00× cost, 100% latency
+ quantization (AWQ 4-bit) → 0.55× cost
+ prefix KV-cache reuse → 0.38× cost
+ continuous batching → 0.30× cost (3.3× throughput)
+ prompt compression → 0.22× cost
# ~75% cheaper, faster, same model.
The order matters less than measuring. Wire up observability and cost dashboards first, then turn on one technique at a time and watch the numbers — otherwise you're optimizing blind.
Key takeaways
- Attack token count, cost-per-token, and recomputation independently.
- KV-cache reuse and continuous batching are the biggest, lowest-risk wins.
- Speculative decoding gives free throughput with identical output.
- Always measure with observability before and after each change.
See how these concepts connect to serving and MLOps in the AI Knowledge Graph — filter to the purple Token Optimization domain.