vLLM & PagedAttention: Efficient LLM Serving with Virtual Memory

08/08/2025 12 min

Listen "vLLM & PagedAttention: Efficient LLM Serving with Virtual Memory"

Descargar episodio Ver en sitio original

Episode Synopsis

This document introduces PagedAttention, an innovative attention algorithm, and vLLM, a high-throughput serving system for large language models (LLMs). The core problem addressed is the inefficient memory management of Key-Value (KV) cache in existing LLM serving systems, leading to significant memory waste and limited batch sizes. Inspired by operating system virtual memory and paging techniques, PagedAttention enables the KV cache to be stored in non-contiguous memory blocks, significantly reducing fragmentation and allowing flexible memory sharing. The paper highlights how vLLM, built upon PagedAttention, achieves 2-4 times higher throughput compared to state-of-the-art systems by optimizing KV cache utilization and supporting various complex decoding scenarios, such as parallel sampling and beam search.

More episodes of the podcast AI: post transformers

Scaling laws: long context length and in context learning 17/01/2026

DeepSeek Engram: Scaling Large Language Models via Conditional Memory Lookup 14/01/2026

PageANN: Scalable Disk ANNS with Page-Aligned Graphs 07/12/2025

NeurIPS 2025: Homogeneous Keys, Heterogeneous Values 04/12/2025

NeurIPS 2025: Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free 29/11/2025

NeurIPS 2025: Large Language Diffusion Models 29/11/2025

NeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example 29/11/2025

NeurIPS 2025: Parallel Scaling Law for Language Models 29/11/2025

NeurIPS 2025: SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data 29/11/2025

NeurIPS 2025: DYNAACT: Large Language Model Reasoning with Dynamic Action Spaces 29/11/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

vLLM & PagedAttention: Efficient LLM Serving with Virtual Memory

Listen "vLLM & PagedAttention: Efficient LLM Serving with Virtual Memory"

Episode Synopsis

More episodes of the podcast AI: post transformers

Localhost, there’s no place like 127.0.0.1

Deep web or Invisible Internet

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD