Paper Bytes
LLM Query Scheduling with Prefix ...

LLM Query Scheduling with Prefix Reuse and Latency Constraints

Paper Bytes di Sunil & Jiten

S01 E03

13:21

Note sull'episodio

Research paper: https://arxiv.org/pdf/2502.04677

Authors: Gregory Dexter, Shao Tang, Ata Fatahi Baarzi, Qingquan Song, Tejas Dharamsi, and Aman Gupta

Introduction

In this episode, we explore the challenge of efficiently deploying large language models (LLMs) in online settings, where strict latency constraints—such as time-to-first-token (TTFT) and time-per-output-token (TPOT)—must be met. As demand for AI-generated content grows, optimizing inference performance becomes a critical bottleneck.

Key Topics Covered

The Challenge of Query Scheduling: Existing scheduling strategies like First-Come-First-Serve (FCFS) and Longest-Prefix-Match (LPM) struggle to bala ...

... Leggi dettagli

Parole chiave

researchLLM

Funzionalità

Risorse

Podcasts

LLM Query Scheduling with Prefix Reuse and Latency Constraints

Paper Bytes di Sunil & Jiten

Note sull'episodio

Parole chiave