Tour de Force: LLM Inference Optimization from Simple to Sophisticated

About This Session

Azure OpenAI and similar managed APIs are the right default for serving language models. But they don't cover every case. Maybe you need to deploy in a region where your model isn't available yet, you want to run a Qwen or Mistral variant that no provider hosts, or you've fine-tuned a model and there's simply no API to call. At that point, you're self-hosting on GPUs. Making your GPUs go brrr is complex. Efficient LLM inference requires navigating a maze of optimization techniques each with different trade-offs. This session provides a practical journey through inference optimizations, clearly categorized by implementation effort. We'll explore techniques across three levels: - Model choices (start here): Model selection, quantization, smart routing - Library-level improvements (using PyTorch-based frameworks like vLLM, SGLang, TensorRT-LLM): Continuous batching, KV-cache management - Custom implementations: Speculative decoding with custom draft heads, disaggregated inference, fine-tuning smaller models The session covers practical trade-offs and key metrics: time to first token, inter-token latency, and cost per token. Whether deploying your first model or optimizing at scale, this talk delivers actionable insights into which techniques to prioritize for deeper investigation.