About This Session
Dive deep into theory and practice of low-latency inference by deploying NVIDIA TensorRT-LLM with advanced speculative decoding techniques. You'll train an Eagle-3 draft head to propose candidate tokens efficiently, serve it, and benchmark it using AIPerf to quantify how these strategies minimize latency.
Topics
- AI Models
- Tokenomics