Adolf Hohl

Aug 20, 2024 • World Congress 2024

Efficient deployment and inference of GPU-accelerated LLMs

What if you could deploy a fully optimized LLM with a single command? See how NVIDIA NIM abstracts away the complexity of self-hosting for massive performance gains.

#1about 2 minutes

The evolution of generative AI from experimentation to production

Generative AI has rapidly moved from experimentation with models like Llama and Mistral to production-ready applications in 2024.

#2about 3 minutes

Comparing managed AI services with the DIY approach

Managed services offer ease of use but limited control, while a do-it-yourself approach provides full control but introduces significant complexity.

#3about 4 minutes

Introducing NVIDIA NIM for simplified LLM deployment

NVIDIA Inference Microservices (NIM) provide a containerized, OpenAI-compatible solution for deploying models anywhere with enterprise support.

#4about 2 minutes

Boosting inference throughput with lower precision quantization

Using lower precision formats like FP8 dramatically increases model inference throughput, providing more performance for the same hardware investment.

#5about 2 minutes

Overview of the NVIDIA AI Enterprise software platform

The NVIDIA AI Enterprise platform is a cloud-native software stack that abstracts away low-level complexities like CUDA to streamline AI pipeline development.

#6about 2 minutes

A look inside the NIM container architecture

NIM containers bundle optimized inference tools like TensorRT-LLM and Triton Inference Server to accelerate models on specific GPU hardware.

#7about 3 minutes

How to run and interact with a NIM container

A NIM container can be launched with a simple Docker command, automatically discovering hardware and exposing OpenAI-compatible API endpoints for interaction.

#8about 2 minutes

Efficiently serving custom models with LoRA adapters

NIM enables serving multiple customized LoRA adapters on a single base model simultaneously, saving memory while providing distinct model endpoints.

#9about 3 minutes

How NIM automatically handles hardware and model optimization

NIM simplifies deployment by automatically selecting the best pre-compiled model based on the detected GPU architecture and user preference for latency or throughput.

14 days ago

Senior Machine Learning Engineer (f/m/d)

MARKT-PILOT GmbH
Stuttgart, Germany

Remote

Senior

2 days ago

AI Software Engineer (m/f/d)

Sunhat
Köln, Germany

Remote

Senior

8 days ago

Senior Researcher for Generative AI

Dynatrace
Linz, Austria

Senior

Featured Partners

Self-Hosted LLMs: From Zero to Inference

Self-Hosted LLMs: From Zero to Inference

Roberto Carratalá, Cedric Clyburn

about 2 months ago • World Congress 2025

Your Next AI Needs 10,000 GPUs. Now What?

Your Next AI Needs 10,000 GPUs. Now What?

Anshul Jindal, Martin Piercy

about 2 months ago • World Congress 2025

LLMOps-driven fine-tuning, evaluation, and inference with NVIDIA NIM & NeMo Microservices

LLMOps-driven fine-tuning, evaluation, and inference with NVIDIA NIM & NeMo Microservices

Anshul Jindal

about 2 months ago • World Congress 2025

DevOps for AI: running LLMs in production with Kubernetes and KubeFlow

DevOps for AI: running LLMs in production with Kubernetes and KubeFlow

Aarno Aukia

about a year ago • WeAreDevelopers LIVE

WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA

WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA

Ankit Patel

about a year ago • World Congress 2024

Unveiling the Magic: Scaling Large Language Models to Serve Millions

Unveiling the Magic: Scaling Large Language Models to Serve Millions

Patrick Koss

about 2 months ago • World Congress 2025

Exploring LLMs across clouds

Exploring LLMs across clouds

Tomislav Tipurić

about 2 months ago • World Congress 2025

Unlocking the Power of AI: Accessible Language Model Tuning for All

Unlocking the Power of AI: Accessible Language Model Tuning for All

Cedric Clyburn, Legare Kerrison

about a year ago • World Congress 2024

From learning to earning

Jobs that call for the skills explored in this talk.

Senior Backend Engineer – AI Integration (m/w/x)

1 month ago

Senior Backend Engineer – AI Integration (m/w/x)

chatlyn GmbH
Vienna, Austria

Senior

JavaScript

AI-assisted coding tools

5 days ago

AI Systems Engineer - LLM Execution

OpenNebula Systems
Municipality of Madrid, Spain

Python

yesterday

Senior Machine Learning Engineer (LLM & GPU Architecture)

European Tech Recruit
Municipality of San Sebastian, Spain

Intermediate

Python

Docker

PyTorch

Kubernetes

Computer Vision

+2

yesterday

Senior Machine Learning Engineer (LLM & GPU Architecture)

European Tech Recruit
Municipality of Murcia, Spain

Intermediate

Python

Docker

PyTorch

Kubernetes

Computer Vision

+2

yesterday

Senior Machine Learning Engineer (LLM & GPU Architecture)

European Tech Recruit
Municipality of Las Palmas, Spain

Intermediate

Python

Docker

PyTorch

Kubernetes

Computer Vision

+2

yesterday

Senior Machine Learning Engineer (LLM & GPU Architecture)

European Tech Recruit
Municipality of Valencia, Spain

Intermediate

Python

Docker

PyTorch

Kubernetes

Computer Vision

+2

yesterday

Senior Machine Learning Engineer (LLM & GPU Architecture)

European Tech Recruit
Municipality of Granada, Spain

Intermediate

Python

Docker

PyTorch

Kubernetes

Computer Vision

+2

2 days ago

Senior Machine Learning Engineer (LLM & GPU Architecture)

European Tech Recruit
Municipality of Huelva, Spain

Intermediate

Python

Docker

PyTorch

Kubernetes

Computer Vision

+2