Owning the Inference Layer: When and How to Run your Own Models

About This Session

Hosted model APIs are the fastest way to start building with Gen AI models, but they are not always the best long-term fit. As workloads grow, teams start running into harder questions around latency, cost, deployment flexibility, data boundaries, and performance tuning. This talk looks at what changes when you move from calling a hosted model API to running inference yourself. We’ll break down the practical tradeoffs between hosted and self-hosted approaches, then examine how modern open inference technologies such as vLLM and llm-d are making self-hosted AI more realistic for production systems. Rather than treating this as a debate between two camps, the session focuses on decision-making: when self-hosting is worth the added operational complexity, when hosted APIs still win, and what teams take on when they choose to own the inference layer. You’ll leave with a practical framework for evaluating cost, control, performance, and architectural fit in your own environment.