Patrick Koss
Unveiling the Magic: Scaling Large Language Models to Serve Millions
#1about 3 minutes
Understanding the benefits of self-hosting large language models
Self-hosting LLMs provides greater control over data privacy, compliance, cost, and vendor lock-in compared to using third-party services.
#2about 4 minutes
Architectural overview for a scalable LLM serving platform
A scalable LLM service requires key components for model acquisition, inference, storage, billing, security, and request routing.
#3about 7 minutes
Choosing an inference engine and model storage strategy
Using network file storage (NFS) is crucial for reducing startup times and enabling fast horizontal scaling when deploying new model instances.
#4about 5 minutes
Building an efficient token-based billing system
Aggregate token usage with tools like Redis before sending data to a payment provider to manage rate limits and improve system efficiency.
#5about 3 minutes
Implementing robust rate limiting for shared LLM systems
Prevent system abuse by implementing both request-based and token-based rate limiting, using estimations for output tokens to protect shared resources.
#6about 3 minutes
Selecting the right authentication and authorization strategy
Bearer tokens offer a flexible solution for managing authentication and fine-grained authorization, such as restricting access to specific models.
#7about 2 minutes
Scaling inference with Kubernetes and smart routing
Use tools like KServe or Knative on Kubernetes for intelligent autoscaling and canary deployments based on custom metrics like queue size.
#8about 3 minutes
Summary of best practices for scalable LLM deployment
Key strategies for success include robust rate limiting, modular design, continuous benchmarking, and using canary deployments for safe production testing.
Related jobs
Jobs that call for the skills explored in this talk.
Matching moments
05:08 MIN
The opaque and complex stack of modern LLM services
You are not my model anymore - understanding LLM model behavior
03:36 MIN
The rapid evolution and adoption of LLMs
Building Blocks of RAG: From Understanding to Implementation
09:43 MIN
The technical challenges of running LLMs in browsers
From ML to LLM: On-device AI in the Browser
02:22 MIN
Understanding the limitations of large language models
What comes after ChatGPT? Vector Databases - the Simple and powerful future of ML?
19:14 MIN
Addressing data privacy and security in AI systems
Graphs and RAGs Everywhere... But What Are They? - Andreas Kollegger - Neo4j
21:08 MIN
The future of open source licensing and incentives
Open Source: The Engine of Innovation in the Digital Age
00:05 MIN
The core challenge of scaling AI agent communication
Event-Driven Architecture: Breaking Conversational Barriers with Distributed AI Agents
00:05 MIN
The rise of self-hosted open source AI models
Self-Hosted LLMs: From Zero to Inference
Featured Partners
Related Videos
Self-Hosted LLMs: From Zero to Inference
Roberto Carratalá & Cedric Clyburn
How AI Models Get Smarter
Ankit Patel
Using LLMs in your Product
Daniel Töws
Three years of putting LLMs into Software - Lessons learned
Simon A.T. Jiménez
Inside the Mind of an LLM
Emanuele Fabbiani
How to Avoid LLM Pitfalls - Mete Atamel and Guillaume Laforge
Meta Atamel & Guillaume Laforge
Your Next AI Needs 10,000 GPUs. Now What?
Anshul Jindal & Martin Piercy
Exploring LLMs across clouds
Tomislav Tipurić
From learning to earning
Jobs that call for the skills explored in this talk.

Senior Systems/DevOps Developer (f/m/d)
Bonial International GmbH
Berlin, Germany
Senior
Python
Terraform
Kubernetes
Elasticsearch
Amazon Web Services (AWS)







Full-Stack Engineer | Specializing in LLMs & AI Agents
Waterglass
Junior
React
Python
Node.js
low-code
JavaScript

AIML -Machine Learning Research, DMLI
Apple
Python
PyTorch
TensorFlow
Machine Learning
Natural Language Processing