Patrick Koss

Aug 20, 2025 • World Congress 2025

Unveiling the Magic: Scaling Large Language Models to Serve Millions

A single short prompt can exhaust your GPU resources. Learn how a custom proxy and clever rate-limiting can serve large language models to millions of users.

#1about 3 minutes

Understanding the benefits of self-hosting large language models

Self-hosting LLMs provides greater control over data privacy, compliance, cost, and vendor lock-in compared to using third-party services.

#2about 4 minutes

Architectural overview for a scalable LLM serving platform

A scalable LLM service requires key components for model acquisition, inference, storage, billing, security, and request routing.

#3about 7 minutes

Choosing an inference engine and model storage strategy

Using network file storage (NFS) is crucial for reducing startup times and enabling fast horizontal scaling when deploying new model instances.

#4about 5 minutes

Building an efficient token-based billing system

Aggregate token usage with tools like Redis before sending data to a payment provider to manage rate limits and improve system efficiency.

#5about 3 minutes

Implementing robust rate limiting for shared LLM systems

Prevent system abuse by implementing both request-based and token-based rate limiting, using estimations for output tokens to protect shared resources.

#6about 3 minutes

Selecting the right authentication and authorization strategy

Bearer tokens offer a flexible solution for managing authentication and fine-grained authorization, such as restricting access to specific models.

#7about 2 minutes

Scaling inference with Kubernetes and smart routing

Use tools like KServe or Knative on Kubernetes for intelligent autoscaling and canary deployments based on custom metrics like queue size.

#8about 3 minutes

Summary of best practices for scalable LLM deployment

Key strategies for success include robust rate limiting, modular design, continuous benchmarking, and using canary deployments for safe production testing.

13 days ago

Senior Machine Learning Engineer (f/m/d)

MARKT-PILOT GmbH
Stuttgart, Germany

Remote

Senior

yesterday

AI Software Engineer (m/f/d)

Sunhat
Köln, Germany

Remote

Senior

8 days ago

Machine Learning Engineer

Picnic Technologies B.V.
Amsterdam, Netherlands

Intermediate

Senior

Featured Partners

Self-Hosted LLMs: From Zero to Inference

Self-Hosted LLMs: From Zero to Inference

Roberto Carratalá, Cedric Clyburn

about 2 months ago • World Congress 2025

How AI Models Get Smarter

How AI Models Get Smarter

Ankit Patel

about 2 months ago • World Congress 2025

Three years of putting LLMs into Software - Lessons learned

Three years of putting LLMs into Software - Lessons learned

Simon A.T. Jiménez

about 2 months ago • World Congress 2025

Using LLMs in your Product

Using LLMs in your Product

Daniel Töws

about a year ago • World Congress 2024

Inside the Mind of an LLM

Inside the Mind of an LLM

Emanuele Fabbiani

about 2 months ago • World Congress 2025

How to Avoid LLM Pitfalls - Mete Atamel and Guillaume Laforge

How to Avoid LLM Pitfalls - Mete Atamel and Guillaume Laforge

Meta Atamel, Guillaume Laforge

about 6 months ago • Coffee With Developers

Your Next AI Needs 10,000 GPUs. Now What?

Your Next AI Needs 10,000 GPUs. Now What?

Anshul Jindal, Martin Piercy

about 2 months ago • World Congress 2025

Exploring LLMs across clouds

Exploring LLMs across clouds

Tomislav Tipurić

about 2 months ago • World Congress 2025

From learning to earning

Jobs that call for the skills explored in this talk.

Senior Backend Engineer – AI Integration (m/w/x)

1 month ago

Senior Backend Engineer – AI Integration (m/w/x)

chatlyn GmbH
Vienna, Austria

Senior

JavaScript

AI-assisted coding tools

4 days ago

Security-by-Design for Trustworthy Machine Learning Pipelines

Association Bernard Gregory

Machine Learning

Continuous Delivery

4 days ago

AI Systems Engineer - LLM Execution

OpenNebula Systems
Municipality of Madrid, Spain

Python

yesterday

Machine Learning Engineer

Speechmatics
Charing Cross, United Kingdom

Remote

€39K

Machine Learning

Speech Recognition

today

AI/ML Team Lead - Generative AI (LLMs, AWS)

Provectus
Canton de Saint-Mihiel, France

Remote

€96K

Senior

Python

PyTorch

TensorFlow

+4

4 days ago

AI/ML Engineer

Licorne Society
Canton of Toulouse-5, France

C++

GIT

CMake

Python

PyTorch

+2

yesterday

Data Engineer - Machine Learning | Fraud & Abuse

DeepL
Charing Cross, United Kingdom

Remote

€40K

.NET

Python

Machine Learning

today

Applied Machine Learning Scientist

StackAdapt
Charing Cross, United Kingdom

Remote

€46K

Machine Learning