SGLang Prefix Cache Optimization

Your KV cache is
wasting 71×
the memory.

SGLang's eviction policy was built for standard MHA models. For MLA models like DeepSeek V2/V3/R1, each token stores a 576-dim compressed latent — not 40,960-dim full K/V. We fix the eviction logic to match reality.

71×
compression ratio
14×
more tokens cached
35/35
tests passing

MHA-era eviction in
an MLA world

SGLang stores latent vectors, but evicts like it's still storing full K/V heads.

PER-TOKEN MEMORY FOOTPRINT (DeepSeek V3, bf16)
MHA equiv.
40,960 floats
80 KB
MLA actual
576 floats
1.1 KB
The eviction policy treats both rows identically.
Only the bottom row is real.
⚖️

False equivalence

RadixCache assigns identical eviction weight to every token regardless of actual memory cost. MLA tokens are up to 71× cheaper to store.

🪣

Over-aggressive eviction

The 20% free-space threshold was tuned for MHA models. For MLA, it evicts hundreds of valuable prefix tokens unnecessarily.

📉

Cache hit rate degradation

Excessive eviction destroys prefix reuse — the main source of TTFT speedup in long-context serving workloads.

See the difference

Explore memory capacity, eviction behavior, and cache hit rates across workloads.

Memory Capacity
Eviction Behavior
Cache Hit Simulator
80 GB
40 GB
Standard MHA eviction
prefix tokens cacheable
MLA-aware (this project)
prefix tokens cacheable
Cache capacity
MHA
+MLA extra
KV Pool (300 slots visualised) Idle
Evicted: 0
Retained: 0
Effective hit rate:
CACHE HIT RATE
0%
Tokens saved: 0

Minimal surface,
maximum impact

Three targeted changes. Backward-compatible. Non-MLA models are completely untouched.

Before

SGLang Scheduler
_check_memory_and_evict()
existing
RadixCache.evict(N)
Always evicts N tokens · fixed 20% free threshold
unchanged
MLATokenToKVPool
576-dim latent vectors (already compact)
existing

After

SGLang Scheduler
_check_memory_and_evict()
existing
MLAEvictionBudget.adjust(N)
Reduces eviction count by compression ratio
new
RadixCache.evict(adjusted)
target_free_ratio = 0.20 / compression_ratio
patched
MLATokenToKVPool
Unchanged — already stores latent vectors
existing

Three files changed.

The optimization is a targeted eviction adjustment. No storage format changes, no correctness risk.

radix_cache.py · threshold
python
# Before: fixed threshold
self._target_free_ratio = 0.20

# After: MLA-aware
self._target_free_ratio = max(
    0.05,
    0.20 / mla_compression_ratio
)
# V3: 0.20 / 71 ≈ 0.003
# → 99.7% utilization is now safe
scheduler.py · eviction count
python
# Before: always evict N tokens
tree_cache.evict(
    EvictParams(num_tokens=N)
)

# After: compression-adjusted
adjusted = budget.adjust_eviction_count(
    requested_eviction=N,
    current_cached=cached,
    current_free=free,
)
tree_cache.evict(EvictParams(adjusted))
Quick start
bash
# Run all unit tests
pip install pytest torch
python -m pytest test_mla_radix_cache.py -v

# CPU benchmark
python bench_mla_radix_cache.py

# GPU validation (A100+)
python gpu_validation.py \
  --mode validate \
  --model deepseek-ai/DeepSeek-V2-Lite
Standalone usage
python
from mla_radix_cache import (
    MLARadixCache, MLAModelConfig
)
import torch

config = MLAModelConfig.deepseek_v3()
cache = MLARadixCache(config, pool_size=100000)

cache.insert(list(range(100)), torch.arange(100))
r = cache.match_prefix([0,1,2,...,50,999])
print(r.matched_len)  # 50

CPU workload benchmarks

Measured on DeepSeek-V3 config. All numbers from bench_mla_radix_cache.py.

Baseline (MHA eviction)
MLA-aware (this project)
Cache hit rate
Token reuse rate