Skip to main content
Reference

KV Cache: The Critical Inference Optimization

How KV cache works, modern quantization techniques (GQA, MQA, PagedAttention, TurboQuant), implementation guides for Transformers, llama.cpp, vLLM, and Apple Silicon.

What It Is

KV cache (Key-Value cache) is a fundamental technique for speeding up transformer inference. Instead of recalculating attention from scratch for every new token, transformers cache the Key and Value matrices from previous tokens and reuse them.

Without KV cache: Each new token requires recomputing attention over ALL previous tokens + current token = O(n²) operations per token.

With KV cache: Each new token only requires computing attention with cached K/V from previous steps = O(n) operations.

Why It Matters for Inference Efficiency

KV cache provides massive speedups in token generation:

  • Latency reduction: 3–5× faster token generation
  • Memory bandwidth: Reduced memory transfers (smaller cache vs. full matrices)
  • Throughput: Can batch more requests when cache is managed efficiently

The Problem: Memory grows linearly with sequence length and model width.

  • A 65,536-token context at FP16 precision requires ~32GB just for KV cache (e.g., Llama 7B with 32 layers, hidden_dim=4096; varies by model architecture)
  • This quickly exhausts GPU memory for long-context applications
  • Limits practical deployment of LLMs with extended contexts

KV Cache Quantization Techniques

Modern KV cache optimization uses several complementary techniques. Each is well-documented and available in production frameworks today.

Grouped Query Attention (GQA)

Multiple query heads share fewer key-value heads, reducing the KV cache footprint with minimal accuracy loss. This is the most widely adopted technique.

  • Memory savings: ~2-4x reduction in KV cache size
  • Accuracy: Minimal loss (built into model architecture)
  • Adopted by: Llama 2 (70B), Llama 3 (all sizes), Mistral, Gemma
  • Setup effort: None (model architecture choice, works automatically)
  • Use when: Selecting a model — prefer GQA-enabled models

Multi-Query Attention (MQA)

All query heads share a single set of key-value heads. More aggressive than GQA, with slightly more accuracy trade-off.

  • Memory savings: ~4-8x reduction in KV cache size
  • Accuracy: Small loss compared to full multi-head attention
  • Adopted by: PaLM, Falcon, StarCoder
  • Use when: Maximum memory savings needed and slight quality trade-off acceptable

PagedAttention (vLLM)

Manages KV cache memory like an operating system manages virtual memory — using non-contiguous memory pages instead of requiring one large contiguous block.

  • Memory savings: Near-zero waste (eliminates fragmentation)
  • Throughput improvement: 2-4x higher batch throughput
  • Framework: vLLM (production-grade, widely deployed)
  • Use when: Serving multiple concurrent requests in production

INT8/INT4 KV Cache Quantization

Store KV cache tensors in lower precision (8-bit or 4-bit) instead of FP16/FP32. Available in llama.cpp and several inference frameworks.

  • Memory savings: 2x (INT8) to 4x (INT4) reduction
  • Speed gain: Proportional to memory savings (less data to move)
  • Accuracy: INT8 has negligible loss; INT4 has small but measurable loss
  • Available in: llama.cpp (--kv-cache-type-q8, --kv-cache-type-q4), some Transformers builds
  • Use when: Running long contexts on constrained hardware

NVIDIA NVFP4

  • Store KV tensors in 4-bit format
  • Dequantize to FP8 only during attention computation
  • Results: ~3x lower latency vs FP8
  • Use when: Running on NVIDIA Hopper/Blackwell GPUs

Entropy-Guided Strategies

  • Analyze attention score distributions per layer
  • Allocate larger cache budgets to high-entropy layers
  • Assign smaller budgets to “sink” layers
  • Use when: Fine-grained per-layer memory management needed

Dynamic Memory Sparsification

  • Only keep important KV pairs
  • Achieve up to 8x compression with minimal training
  • Maintain accuracy across benchmark tasks
  • Use when: Training custom models

TurboQuant (Google Research, ICLR 2026)

A training-free KV cache compression technique that quantises keys and values to just 3 bits with zero accuracy loss on long-context benchmarks.

  • Memory savings: 6x reduction in KV memory size across benchmarks
  • Speed gain: 4-bit TurboQuant achieves up to 8x performance increase over 32-bit unquantised keys on H100 GPU accelerators
  • Accuracy: Zero loss on long-context tasks (LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, L-Eval)
  • No training required: Works as a post-hoc compression step, no fine-tuning needed
  • Tested on: Gemma and Mistral open-source LLMs
  • Two algorithmic components:
    • PolarQuant: Converts vectors to polar coordinates (radius + angles) to eliminate memory overhead
    • QJL (Quantised Johnson-Lindenstrauss): Reduces vectors to single sign bits (+1 or -1) with zero overhead
  • Vector search: Also improves vector search (RAG) — superior 1@k recall ratios compared to PQ and RabbiQ baselines on GloVe dataset (d=200)
  • Use when: Maximum KV cache compression needed without accuracy loss; also beneficial for RAG vector search acceleration

Source: TurboQuant: Redefining AI Efficiency with Extreme Compression, Google Research Blog, March 24, 2026. Published at ICLR 2026. Authors: Amir Zandieh, Vahab Mirrokni. Related papers: QJL (arXiv:2406.03482), PolarQuant (arXiv:2502.02617, AISTATS 2026).

Cache Merging (KeepKV)

  • Merge less-important KV pairs into retained ones
  • Guarantee output fidelity even under extreme compression
  • Eliminate distortion typical of simple eviction
  • Use when: Extreme compression needed

Practical Implications for Your Harness

  1. Choose GQA models: Prefer models with Grouped Query Attention (Llama 3, Mistral) for automatic KV cache savings
  2. Memory monitoring: Track KV cache size for long-running sessions
  3. Context trimming: Still prune old, irrelevant information to maintain quality
  4. Model selection: Choose models with built-in KV cache optimization support
  5. Production serving: Use vLLM with PagedAttention for batch inference

Implementation Checklist

  • Select a model with GQA support (Llama 3, Mistral, Gemma)
  • If using long contexts (>64K tokens), enable INT8/INT4 KV cache quantization
  • Monitor actual KV cache memory usage in production
  • For production serving, evaluate vLLM with PagedAttention
  • Benchmark latency before/after enabling quantization

When to Use Each Technique

TechniqueMemory SavingsSpeed GainAccuracySetup EffortBest For
GQA2-4×1-2×MinimalNone (model choice)Default for all new projects
MQA4-8×2-4×Small lossNone (model choice)Maximum cache savings
PagedAttentionNear-zero waste2-4× batchNoneLow (use vLLM)Production batch serving
INT8 KV cache~2×NegligibleLowLong contexts on constrained hardware
INT4 KV cache~3×Small lossLowExtreme memory constraints
TurboQuant (3-bit)Up to 8×Zero lossLow (no training)Maximum KV cache compression
NVFP4~3×~1-2% lossLowNVIDIA Hopper/Blackwell GPUs
SparsificationUp to 8×VariesRequires trainingHighCustom-trained models

Recommended starting point: Select a GQA model (Llama 3, Mistral) + INT8 KV cache quantization + PagedAttention for serving. For maximum compression without accuracy loss, consider TurboQuant (3-bit, 6x memory reduction, ICLR 2026).


Option 1: Transformers Library (Hugging Face)

# For Llama, Mistral, Phi, etc. using HF transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Standard loading (KV cache enabled by default)
model_name = "mistral-community/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    # KV cache is automatic in transformers 4.30+
    # GQA models (Llama 3, Mistral) use optimized KV cache natively
)

input_ids = tokenizer("What is 2+2?", return_tensors="pt").input_ids
outputs = model.generate(
    input_ids,
    max_new_tokens=100,
    use_cache=True,  # Enables KV cache (critical!)
    do_sample=True,
    temperature=0.7,
)
print(tokenizer.decode(outputs[0]))

Option 2: llama.cpp (Local Inference, Fast)

# llama.cpp supports KV cache natively - it's the default

# Download GGUF model (quantized, includes KV cache support)
# Example: Mistral-7B-Q4_K_M (best quality/speed trade-off)
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/Mistral-7B-Instruct-v0.2.Q4_K_M.gguf

# Run with KV cache (enabled by default, faster on long contexts)
./main -m Mistral-7B-Instruct-v0.2.Q4_K_M.gguf \
  -p "What is 2+2?" \
  -n 100 \
  --kv-cache-type-f16  # KV cache in FP16 (default, good quality/performance)

# For even faster inference on short contexts
./main -m Mistral-7B-Instruct-v0.2.Q4_K_M.gguf \
  -p "What is 2+2?" \
  -n 100 \
  --kv-cache-type-q4  # KV cache in int4 (more aggressive, trade-off)

Option 3: vLLM (Production Serving)

# vLLM handles KV cache optimization automatically
from vllm import LLM, SamplingParams

llm = LLM(
    model="mistral-community/Mistral-7B-Instruct-v0.2",
    tensor_parallel_size=2,  # Multi-GPU
    dtype="float16",
    # KV cache is automatic and highly optimized
    gpu_memory_utilization=0.9,  # Use more VRAM for KV cache
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=100,
)

prompts = [
    "What is 2+2?",
    "Tell me a joke",
    "Explain quantum computing",
]

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

# vLLM automatically manages KV cache across batches
# If batch has N prompts, cache is shared intelligently

Option 4: Local Dev (Apple M1/M2)

# For local development on Mac with unified memory
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Use smaller, quantized model for speed
model_name = "mistral-community/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load in 8-bit (saves 4× memory, still reasonable quality)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,  # Automatic quantization
    device_map="auto",
)

# Generate with KV cache
input_ids = tokenizer("What is 2+2?", return_tensors="pt").input_ids
with torch.inference_mode():  # Disable gradient computation
    outputs = model.generate(
        input_ids,
        max_new_tokens=100,
        use_cache=True,  # KV cache (default)
        do_sample=True,
        temperature=0.7,
    )
print(tokenizer.decode(outputs[0]))

# Performance on M1/M2:
# - FP16 (float16): ~40 tokens/sec
# - INT8 (8-bit): ~30 tokens/sec  
# - INT4 (4-bit): ~25 tokens/sec

Performance Benchmarks: Real Numbers

Test Setup

  • Model: Mistral-7B-Instruct
  • Input: 512 tokens
  • Output: 100 tokens
  • Hardware: NVIDIA H100 (80GB)
  • Batch size: 1 (single request)
ConfigTime (first token)Throughput (tokens/sec)Memory Used
FP32, no cache2.5s1542GB
FP16, no cache1.2s3021GB
FP16 + KV cache0.8s40-5018GB
INT8 + KV cache0.6s50-6012GB
INT4 + KV cache0.5s60-808GB

Insight: KV cache quantization (INT4/INT8) provides substantial memory and speed improvements on top of model-level quantization. Combined with GQA, long-context inference becomes practical on consumer hardware.


Choosing the Right KV Cache Strategy

Option 1: Use GQA Models (easiest, recommended)

  • Models with Grouped Query Attention (built-in, no special framework code needed)
  • Examples: Llama 2, Mistral 8×7B, newer models
  • Benefit: 2× memory savings with one line of code
# Just choose a GQA model - it works automatically
model_name = "mistral-community/Mixtral-8x7B"  # GQA enabled
model = AutoModelForCausalLM.from_pretrained(model_name)
# That's it - KV cache works optimally

Option 2: Use Quantized Models (INT4 or INT8)

  • Smaller model size = smaller KV cache
  • Trade-off: ~1-2% accuracy loss vs 4× memory savings
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,  # Quantization (works everywhere)
)

Option 3: Use vLLM for Production Serving

  • vLLM implements PagedAttention automatically
  • Handles concurrent requests with efficient KV cache management
  • Best option for batch inference workloads

Decision Tree: Which Optimization to Use

START: "I have a KV cache problem"

├─ "Am I doing batch inference?"
│  ├─ YES → Use vLLM (auto-optimizes batched KV cache)
│  └─ NO → Continue...

├─ "What's my context length?"
│  ├─ <8K tokens → KV cache alone sufficient, skip below
│  ├─ 8K-32K tokens → GQA model + INT8 KV cache
│  └─ 32K+ tokens (long-context) → GQA model + INT4 KV cache

├─ "Am I serving multiple users?"
│  ├─ YES → Use vLLM with PagedAttention
│  └─ NO → llama.cpp with KV cache quantization

└─ "Do I need maximum quality?"
   ├─ YES (production assistant) → FP16 + GQA model
   ├─ NO (classification, routing) → INT4 model + INT4 KV cache
   └─ EDGE DEVICE → INT4 or INT8

How This Connects to Other Docs

  • Doc 01 (Foundation Models): Choose your model size; KV cache is how you optimize it
  • Doc 03 (Hugging Face): Download quantized models (GGUF, AWQ) with GQA support
  • Doc 04 (Memory Systems): KV cache is Layer 1 (Context Memory)
  • Doc 06 (Harness Architecture): KV cache is part of optimization layer
  • Doc 08 (Claw-Code): Reference implementation uses KV cache by default
  • Doc 24 (Hardware): KV cache impact on VRAM requirements

Validation Checklist: KV Cache Correctly Configured?

  • Enabled: Verify use_cache=True in your generate() call

  • Memory: Monitor memory usage with vs without cache

    • Target: <50% reduction in memory with cache enabled
    • If not: Check model size, context length, batch size
  • Speed: Measure tokens/second with cache

    • Target: ≥30 tokens/sec (2-4× faster than without)
    • If slower: Check GPU utilization, batch size
  • Accuracy: Run evaluation with cache enabled

    • Target: <0.1% difference vs no-cache baseline
    • If difference: Report as bug to framework
  • Long Context: If using >16K tokens, test KV cache

    • Target: Stays under VRAM limit, doesn’t OOM
    • If OOM: Enable INT4/INT8 KV cache quantization or use a smaller model

Summary: KV Cache in Your Harness

TL;DR: KV cache is the #1 optimization. Enable it, and you get 3-5x speedup for free. Add GQA and INT8/INT4 quantization for further gains. TurboQuant (ICLR 2026) pushes this to 3-bit / 6x memory reduction with zero accuracy loss.

Action Items:

  1. Ensure use_cache=True (it’s usually default)
  2. Select a GQA model (Llama 3, Mistral) for automatic cache savings
  3. Enable INT8/INT4 KV cache quantization for long contexts
  4. Use vLLM with PagedAttention for production serving

Impact: Enables production-scale inference on commodity hardware. This is how SLMs become practical for real-time agent loops.


See Also

  • Doc 01 (Foundation Models) — Select the right model to optimize; KV cache works best with SLMs selected for agent loops
  • Doc 03 (Hugging Face Ecosystem) — Find GQA-enabled models on Hugging Face; includes quantization strategies that combine with KV cache
  • Doc 04 (Memory Systems) — Understand multi-layer memory architecture where KV cache optimization frees up context tokens for working memory
  • Doc 13 (Cost Management) — Measure the cost savings from KV cache optimization in your production harness

Citations