Starts
Thu 9 Jul, 15:30
Ends
Thu 9 Jul, 17:30
About This Workshop
Dive deep into theory and practice of low-latency inference by deploying NVIDIA TensorRT-LLM with advanced speculative decoding techniques. You'll train an Eagle-3 draft head to propose candidate tokens efficiently, serve it, and benchmark it using AIPerf to quantify how these strategies minimize latency.