Characterizing LLM KV Cache Workloads in Production

01/10/2025 16 min

Listen "Characterizing LLM KV Cache Workloads in Production"

Episode Synopsis

The June 2025 paper characterizes and optimizes the **Key-Value Cache (KV$)** workload patterns associated with serving large language models (LLMs) at a major cloud provider. Using **real-world production traces** from customer-facing (to-C) and business-facing (to-B) workloads, the authors analyze KV$ reuse behaviors, noting that reuses are significantly skewed, with single-turn requests being as important as multi-turn requests, especially in **API-dominated workloads**. Crucially, the analysis reveals that **KV$ lifespan is ephemeral** and that reuse probability follows predictable exponential distributions within specific request categories. Based on these findings, the researchers propose a **workload-aware cache eviction policy** that significantly improves the cache hit ratio and reduces the query time to first token compared to standard policies like LRU and LFU.Source:https://arxiv.org/pdf/2506.02634v1

More episodes of the podcast AI: post transformers