LLM Query Scheduling with Prefix Reuse and Latency Constraints

11/02/2025 13 min Temporada 1 Episodio 3

Listen "LLM Query Scheduling with Prefix Reuse and Latency Constraints"

Descargar episodio Ver en sitio original

Episode Synopsis

Research paper: https://arxiv.org/pdf/2502.04677Authors: Gregory Dexter, Shao Tang, Ata Fatahi Baarzi, Qingquan Song, Tejas Dharamsi, and Aman GuptaIntroduction In this episode, we explore the challenge of efficiently deploying large language models (LLMs) in online settings, where strict latency constraints—such as time-to-first-token (TTFT) and time-per-output-token (TPOT)—must be met. As demand for AI-generated content grows, optimizing inference performance becomes a critical bottleneck.Key Topics CoveredThe Challenge of Query Scheduling: Existing scheduling strategies like First-Come-First-Serve (FCFS) and Longest-Prefix-Match (LPM) struggle to balance efficiency and latency.Prefix Reuse with RadixAttention: A technique that stores and reuses shared prefixes across queries using a radix tree structure, reducing computational overhead.The NP-Hard Nature of Scheduling: The paper establishes that optimizing scheduling under TTFT constraints is computationally challenging.Introducing k-LPM: A novel scheduling algorithm that balances prefix reuse and fairness, outperforming existing methods in reducing TTFT.Empirical Validation: Real-world evaluations show that k-LPM significantly reduces P99 TTFT, making it a promising solution for large-scale LLM inference.Conclusion This research highlights the need for advanced scheduling strategies to improve LLM efficiency in real-world applications. Tune in to learn how k-LPM is pushing the boundaries of AI inference optimization!

More episodes of the podcast Paper Bytes

TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest 06/03/2025

Action Speaks Louder Than Words Trillion-Parameter Sequential Transducers for Generative Recommendations 20/02/2025

Modern Recommender Systems Using Generative Models (Gen-RecSys) 16/02/2025

Mutation-Guided LLM-based Test Generation at Meta 10/02/2025

360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation 03/02/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

LLM Query Scheduling with Prefix Reuse and Latency Constraints

Listen "LLM Query Scheduling with Prefix Reuse and Latency Constraints"

Episode Synopsis

More episodes of the podcast Paper Bytes

Personnel recruitment via Web

Digital Natives: Children of today, Technologists of Tomorrow

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD