Patrick Koss
Unveiling the Magic: Scaling Large Language Models to Serve Millions
#1about 3 minutes
Understanding the benefits of self-hosting large language models
Self-hosting LLMs provides greater control over data privacy, compliance, cost, and vendor lock-in compared to using third-party services.
#2about 4 minutes
Architectural overview for a scalable LLM serving platform
A scalable LLM service requires key components for model acquisition, inference, storage, billing, security, and request routing.
#3about 7 minutes
Choosing an inference engine and model storage strategy
Using network file storage (NFS) is crucial for reducing startup times and enabling fast horizontal scaling when deploying new model instances.
#4about 5 minutes
Building an efficient token-based billing system
Aggregate token usage with tools like Redis before sending data to a payment provider to manage rate limits and improve system efficiency.
#5about 3 minutes
Implementing robust rate limiting for shared LLM systems
Prevent system abuse by implementing both request-based and token-based rate limiting, using estimations for output tokens to protect shared resources.
#6about 3 minutes
Selecting the right authentication and authorization strategy
Bearer tokens offer a flexible solution for managing authentication and fine-grained authorization, such as restricting access to specific models.
#7about 2 minutes
Scaling inference with Kubernetes and smart routing
Use tools like KServe or Knative on Kubernetes for intelligent autoscaling and canary deployments based on custom metrics like queue size.
#8about 3 minutes
Summary of best practices for scalable LLM deployment
Key strategies for success include robust rate limiting, modular design, continuous benchmarking, and using canary deployments for safe production testing.
Related jobs
Jobs that call for the skills explored in this talk.
Wilken GmbH
Ulm, Germany
Senior
Kubernetes
AI Frameworks
+3
Picnic Technologies B.V.
Amsterdam, Netherlands
Intermediate
Senior
Python
Structured Query Language (SQL)
+1
Matching moments
04:57 MIN
Increasing the value of talk recordings post-event
Cat Herding with Lions and Tigers - Christian Heilmann
01:32 MIN
Organizing a developer conference for 15,000 attendees
Cat Herding with Lions and Tigers - Christian Heilmann
02:39 MIN
Establishing a single source of truth for all data
Cat Herding with Lions and Tigers - Christian Heilmann
02:54 MIN
Automating video post-production with local scripts
Cat Herding with Lions and Tigers - Christian Heilmann
03:28 MIN
Why corporate AI adoption lags behind the hype
What 2025 Taught Us: A Year-End Special with Hung Lee
04:27 MIN
Moving beyond headcount to solve business problems
What 2025 Taught Us: A Year-End Special with Hung Lee
03:39 MIN
Breaking down silos between HR, tech, and business
What 2025 Taught Us: A Year-End Special with Hung Lee
05:18 MIN
Incentivizing automation with a 'keep what you kill' policy
What 2025 Taught Us: A Year-End Special with Hung Lee
Featured Partners
Related Videos
Self-Hosted LLMs: From Zero to Inference
Roberto Carratalá & Cedric Clyburn
How AI Models Get Smarter
Ankit Patel
Three years of putting LLMs into Software - Lessons learned
Simon A.T. Jiménez
Using LLMs in your Product
Daniel Töws
Unlocking the Power of AI: Accessible Language Model Tuning for All
Cedric Clyburn & Legare Kerrison
How to Avoid LLM Pitfalls - Mete Atamel and Guillaume Laforge
Meta Atamel & Guillaume Laforge
Inside the Mind of an LLM
Emanuele Fabbiani
Your Next AI Needs 10,000 GPUs. Now What?
Anshul Jindal & Martin Piercy
Related Articles
View all articles
.png?w=240&auto=compress,format)
.png?w=240&auto=compress,format)

From learning to earning
Jobs that call for the skills explored in this talk.

Forschungszentrum Jülich GmbH
Jülich, Germany
Intermediate
Senior
Linux
Docker
AI Frameworks
Machine Learning

Startup
Charing Cross, United Kingdom
PyTorch
Machine Learning

Speechify
Municipality of Madrid, Spain
Python
Kubernetes

UL Solutions
Barcelona, Spain
Python
Machine Learning

FRG Technology Consulting
Intermediate
Azure
Python
Machine Learning

Openai Global Partner
Charing Cross, United Kingdom
£80-140K
Senior
API
Azure
Python
JavaScript
+4

Amazon.com Inc.
Senior
R
API
Python
Matlab
Bootstrap
+4

Amazon.com Inc.
Senior
R
API
Unix
Perl
Ruby
+7

CloudiQS
Remote
£70-106K
Senior
React
Python
Node.js
+5