Tempo: SLO-Aware LLM Serving Maximizing Service Gain

10/11/2025 14 min

Listen "Tempo: SLO-Aware LLM Serving Maximizing Service Gain"

Episode Synopsis

The April 24, 2025 academic paper introduces **Tempo**, a novel scheduling system designed to optimize Large Language Model (LLM) serving by addressing the wide variety of Service Level Objectives (**SLOs**) in modern LLM applications. The authors categorize requests into three types—**latency-sensitive**, **throughput-intensive**, and **collective requests**—each with distinct performance requirements that existing schedulers fail to manage effectively. Tempo maximizes "service gain" by allocating just enough serving bandwidth to meet each request’s SLO, utilizing a **hybrid scheduling strategy** that relies on lightweight prediction models for conservative initial estimates of response length and **dependency-graph matching** for complex workflows. Evaluations demonstrate that Tempo significantly outperforms state-of-the-art systems in terms of both service gain and **SLO goodput** across diverse workloads and models.Source:April 24, 2025Tempo: Application-aware LLM Serving with Mixed SLO Requirementshttps://arxiv.org/pdf/2504.20068