SGLang's eviction policy was built for standard MHA models. For MLA models like DeepSeek V2/V3/R1, each token stores a 576-dim compressed latent — not 40,960-dim full K/V. We fix the eviction logic to match reality.
SGLang stores latent vectors, but evicts like it's still storing full K/V heads.
RadixCache assigns identical eviction weight to every token regardless of actual memory cost. MLA tokens are up to 71× cheaper to store.
The 20% free-space threshold was tuned for MHA models. For MLA, it evicts hundreds of valuable prefix tokens unnecessarily.
Excessive eviction destroys prefix reuse — the main source of TTFT speedup in long-context serving workloads.
Explore memory capacity, eviction behavior, and cache hit rates across workloads.
Three targeted changes. Backward-compatible. Non-MLA models are completely untouched.
The optimization is a targeted eviction adjustment. No storage format changes, no correctness risk.
# Before: fixed threshold self._target_free_ratio = 0.20 # After: MLA-aware self._target_free_ratio = max( 0.05, 0.20 / mla_compression_ratio ) # V3: 0.20 / 71 ≈ 0.003 # → 99.7% utilization is now safe
# Before: always evict N tokens tree_cache.evict( EvictParams(num_tokens=N) ) # After: compression-adjusted adjusted = budget.adjust_eviction_count( requested_eviction=N, current_cached=cached, current_free=free, ) tree_cache.evict(EvictParams(adjusted))
# Run all unit tests pip install pytest torch python -m pytest test_mla_radix_cache.py -v # CPU benchmark python bench_mla_radix_cache.py # GPU validation (A100+) python gpu_validation.py \ --mode validate \ --model deepseek-ai/DeepSeek-V2-Lite
from mla_radix_cache import ( MLARadixCache, MLAModelConfig ) import torch config = MLAModelConfig.deepseek_v3() cache = MLARadixCache(config, pool_size=100000) cache.insert(list(range(100)), torch.arange(100)) r = cache.match_prefix([0,1,2,...,50,999]) print(r.matched_len) # 50
Measured on DeepSeek-V3 config. All numbers from bench_mla_radix_cache.py.