Adolf Hohl
Efficient deployment and inference of GPU-accelerated LLMs
#1about 2 minutes
The evolution of generative AI from experimentation to production
Generative AI has rapidly moved from experimentation with models like Llama and Mistral to production-ready applications in 2024.
#2about 3 minutes
Comparing managed AI services with the DIY approach
Managed services offer ease of use but limited control, while a do-it-yourself approach provides full control but introduces significant complexity.
#3about 4 minutes
Introducing NVIDIA NIM for simplified LLM deployment
NVIDIA Inference Microservices (NIM) provide a containerized, OpenAI-compatible solution for deploying models anywhere with enterprise support.
#4about 2 minutes
Boosting inference throughput with lower precision quantization
Using lower precision formats like FP8 dramatically increases model inference throughput, providing more performance for the same hardware investment.
#5about 2 minutes
Overview of the NVIDIA AI Enterprise software platform
The NVIDIA AI Enterprise platform is a cloud-native software stack that abstracts away low-level complexities like CUDA to streamline AI pipeline development.
#6about 2 minutes
A look inside the NIM container architecture
NIM containers bundle optimized inference tools like TensorRT-LLM and Triton Inference Server to accelerate models on specific GPU hardware.
#7about 3 minutes
How to run and interact with a NIM container
A NIM container can be launched with a simple Docker command, automatically discovering hardware and exposing OpenAI-compatible API endpoints for interaction.
#8about 2 minutes
Efficiently serving custom models with LoRA adapters
NIM enables serving multiple customized LoRA adapters on a single base model simultaneously, saving memory while providing distinct model endpoints.
#9about 3 minutes
How NIM automatically handles hardware and model optimization
NIM simplifies deployment by automatically selecting the best pre-compiled model based on the detected GPU architecture and user preference for latency or throughput.
Related jobs
Jobs that call for the skills explored in this talk.
Featured Partners
Related Videos
Self-Hosted LLMs: From Zero to Inference
Roberto Carratalá, Cedric Clyburn
Your Next AI Needs 10,000 GPUs. Now What?
Anshul Jindal, Martin Piercy
LLMOps-driven fine-tuning, evaluation, and inference with NVIDIA NIM & NeMo Microservices
Anshul Jindal
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow
Aarno Aukia
WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA
Ankit Patel
Unveiling the Magic: Scaling Large Language Models to Serve Millions
Patrick Koss
Exploring LLMs across clouds
Tomislav Tipurić
Unlocking the Power of AI: Accessible Language Model Tuning for All
Cedric Clyburn, Legare Kerrison
From learning to earning
Jobs that call for the skills explored in this talk.


Senior Backend Engineer – AI Integration (m/w/x)
chatlyn GmbH
Vienna, Austria
Senior
JavaScript
AI-assisted coding tools
Senior Machine Learning Engineer (LLM & GPU Architecture)
European Tech Recruit
Municipality of San Sebastian, Spain
Intermediate
Python
Docker
PyTorch
Kubernetes
Computer Vision
+2
Senior Machine Learning Engineer (LLM & GPU Architecture)
European Tech Recruit
Municipality of Murcia, Spain
Intermediate
Python
Docker
PyTorch
Kubernetes
Computer Vision
+2
Senior Machine Learning Engineer (LLM & GPU Architecture)
European Tech Recruit
Municipality of Las Palmas, Spain
Intermediate
Python
Docker
PyTorch
Kubernetes
Computer Vision
+2
Senior Machine Learning Engineer (LLM & GPU Architecture)
European Tech Recruit
Municipality of Valencia, Spain
Intermediate
Python
Docker
PyTorch
Kubernetes
Computer Vision
+2
Senior Machine Learning Engineer (LLM & GPU Architecture)
European Tech Recruit
Municipality of Granada, Spain
Intermediate
Python
Docker
PyTorch
Kubernetes
Computer Vision
+2
Senior Machine Learning Engineer (LLM & GPU Architecture)
European Tech Recruit
Municipality of Huelva, Spain
Intermediate
Python
Docker
PyTorch
Kubernetes
Computer Vision
+2





