vLLM & PagedAttention: Efficient LLM Serving with Virtual Memory

08/08/2025 12 min

Listen "vLLM & PagedAttention: Efficient LLM Serving with Virtual Memory"

Episode Synopsis

This document introduces PagedAttention, an innovative attention algorithm, and vLLM, a high-throughput serving system for large language models (LLMs). The core problem addressed is the inefficient memory management of Key-Value (KV) cache in existing LLM serving systems, leading to significant memory waste and limited batch sizes. Inspired by operating system virtual memory and paging techniques, PagedAttention enables the KV cache to be stored in non-contiguous memory blocks, significantly reducing fragmentation and allowing flexible memory sharing. The paper highlights how vLLM, built upon PagedAttention, achieves 2-4 times higher throughput compared to state-of-the-art systems by optimizing KV cache utilization and supporting various complex decoding scenarios, such as parallel sampling and beam search.

More episodes of the podcast AI: post transformers