About This Session
As foundation models move toward deeper test-time computation, inference becomes the dominant scaling constraint. Latency, throughput, and cost are governed by a small set of forces: autoregressive decoding, KV-cache growth, memory bandwidth, and scheduling under contention. This workshop frames large-scale inference through these emerging laws of inference, starting from first principles and building toward real systems. Learners deploy NVIDIA Dynamo on Kubernetes to operate aggregated and disaggregated inference architectures using built-in KV-aware routing and scheduling. The outcome is a principled understanding of where inference time and money go — and how architectural choices bend those curves in production. Participants will deploy both aggregated and disaggregated inference on a 4xA100 node and compare the performance of the two.
Topics
- AI Models
- Agentic AI
- Distributed Systems
- Docker
- Infrastructure
- NVIDIA