Skip to content

Workshop

Agents That Own Their Inference: Building Production AI Agents on Dedicated GPUs

with Duan Lightfoot

  • AI Models
  • Agents
  • Generative AI (GenAI)
  • Infrastructure
  • Large Language Models (LLMs)
  • Llama
  • LLMOps
  • Ollama
  • Small Language Models (SLMs)

Free for All Attendees · Seats Limited

Workshops are included with your event ticket at no extra cost. Seats fill up fast — registration opens through the official event app approximately one week before the event. Follow app notifications to know the moment sign-ups go live.

Starts

Fri 10 Jul, 14:45

Ends

Fri 10 Jul, 16:45

About This Workshop

Every production agent today is renting its intelligence. You're paying per token, sending your customer's data to someone else's servers, and hoping the provider doesn't rate-limit you during your launch. For most teams, that's fine. But for a growing number of teams in regulated industries, with high-volume products, latency-sensitive workloads, or rising token bills, it's starting to look like a liability. In this 120-minute hands-on workshop you'll get a dedicated GPU and build an agent that runs on infrastructure you control. You'll stand up vLLM, point your agent at it, and drive concurrent load through the stack until you can see batching, KV cache pressure, and throughput limits in the metrics. Then you'll optimize the deployment to improve throughput while keeping per-request latency in line. The focus isn't agent frameworks. It's the inference layer underneath them. You'll leave with working code and a real understanding of continuous batching under real concurrency, KV cache tradeoffs, vLLM's metrics, and the bottlenecks that only show up when you operate the inference server yourself.

More to Explore

More Workshops

More hands-on sessions waiting — find the one that fits your stack.