Skip to content

Cloud & AI Infrastructure

Faster Together: Train and Deploy a Speculative Decoding Model for Low-Latency LLM Inference

with Amit Kushwaha

Thursday 9 July 17:30 – 19:30 Room M2 (40 Seats)

About This Session

Dive deep into theory and practice of low-latency inference by deploying NVIDIA TensorRT-LLM with advanced speculative decoding techniques. You'll train an Eagle-3 draft head to propose candidate tokens efficiently, serve it, and benchmark it using AIPerf to quantify how these strategies minimize latency.

Topics

  • AI Models
  • Tokenomics