About This Session
Azure OpenAI and similar managed APIs are the right default for serving language models. But they don't cover every case. Maybe you need to deploy in a region where your model isn't available yet, you want to run a Qwen or Mistral variant that no provider hosts, or you've fine-tuned a model and there's simply no API to call. At that point, you're self-hosting on GPUs. Making your GPUs go brrr is complex. Efficient LLM inference requires navigating a maze of optimization techniques each with different trade-offs. This session provides a practical journey through inference optimizations, clearly categorized by implementation effort. We'll explore techniques across three levels: - Model choices (start here): Model selection, quantization, smart routing - Library-level improvements (using PyTorch-based frameworks like vLLM, SGLang, TensorRT-LLM): Continuous batching, KV-cache management - Custom implementations: Speculative decoding with custom draft heads, disaggregated inference, fine-tuning smaller models The session covers practical trade-offs and key metrics: time to first token, inter-token latency, and cost per token. Whether deploying your first model or optimizing at scale, this talk delivers actionable insights into which techniques to prioritize for deeper investigation.
Topics
- Azure
- Generative AI (GenAI)
- Infrastructure
- LLMOps
- Python
- Software Architecture