KVQuant: LLM Inference with KV Cache Quantization

08/08/2025 16 min

Listen "KVQuant: LLM Inference with KV Cache Quantization"

Descargar episodio Ver en sitio original

Episode Synopsis

Three research papers are reviewed:1) https://arxiv.org/pdf/2401.18079 - KVQuant: Towards 10 Million Context Length LLM
Inference with KV Cache Quantization2) https://arxiv.org/pdf/2402.02750 - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache3) https://arxiv.org/pdf/2502.04420 - KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache
Quantization for Efficient and Nearly Lossless LLM InferenceThese sources collectively discuss methods for quantizing Key-Value (KV) caches in large language models (LLMs) to reduce memory consumption and improve inference efficiency, especially for long context lengths. They explore various quantization strategies, highlighting the importance of per-channel quantization for Keys and per-token quantization for Values due to their distinct data distributions. Key advancements include pre-RoPE quantization, non-uniform quantization, and dense-and-sparse techniques to maintain accuracy at low bitrates, such as 2-bit and 3-bit. The papers also detail custom kernel implementations and offline calibration methods to minimize computational overhead, demonstrating significant throughput gains and increased batch sizes while preserving model performance across diverse benchmarks and LLM architectures.

More episodes of the podcast AI: post transformers

Scaling laws: long context length and in context learning 17/01/2026

DeepSeek Engram: Scaling Large Language Models via Conditional Memory Lookup 14/01/2026

PageANN: Scalable Disk ANNS with Page-Aligned Graphs 07/12/2025

NeurIPS 2025: Homogeneous Keys, Heterogeneous Values 04/12/2025

NeurIPS 2025: Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free 29/11/2025

NeurIPS 2025: Large Language Diffusion Models 29/11/2025

NeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example 29/11/2025

NeurIPS 2025: Parallel Scaling Law for Language Models 29/11/2025

NeurIPS 2025: SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data 29/11/2025

NeurIPS 2025: DYNAACT: Large Language Model Reasoning with Dynamic Action Spaces 29/11/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

KVQuant: LLM Inference with KV Cache Quantization

Listen "KVQuant: LLM Inference with KV Cache Quantization"

Episode Synopsis

More episodes of the podcast AI: post transformers

Prevent Attacks From Your Local Area Network

WWW. Is it obsolete or not? Should we use it?

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD