Kevin Klues
A Deep Dive on How To Leverage the NVIDIA GB200 for Ultra-Fast Training and Inference on Kubernetes
#1about 2 minutes
Understanding the NVIDIA GB200 supercomputer architecture
The GB200 uses multi-node NVLink and NV switches to connect up to 72 GPUs across multiple nodes, creating a single powerful system.
#2about 2 minutes
Enabling secure multi-node GPU communication on Kubernetes
While the GPU Operator runs on GB200 nodes, it requires support for a new construct called IMEX to securely leverage multi-node NVLink connections.
#3about 2 minutes
How the IMEX CUDA APIs enable remote memory access
Applications use a sequence of CUDA API calls like `cuMemCreate` and `cuMemExportToShareHandle` to securely map and access remote GPU memory over NVLink.
#4about 4 minutes
Exploring the four levels of IMEX resource partitioning
IMEX security is managed through a four-level hierarchy, from the physical NVLink Domain down to the workload-specific IMEX Channel allocated within an IMEX Domain.
#5about 6 minutes
Abstracting IMEX complexity with the compute domain concept
The complex manual setup of IMEX daemons and channels is hidden behind a user-friendly "Compute Domain" abstraction that uses Dynamic Resource Allocation (DRA).
#6about 2 minutes
How to migrate a multi-node workload to compute domains
Migrating a workload involves creating a `ComputeDomain` object and updating the pod spec to reference its `resourceClaimTemplate` in the new `resourceClaims` section.
#7about 5 minutes
Understanding the compute domain DRA driver's architecture
The driver uses a central controller and a Kubelet plugin to orchestrate the lifecycle of IMEX daemons and channels, ensuring they are ready before workloads start.
#8about 6 minutes
Demonstrating a multi-node MPI job on a GB200 cluster
A live demo shows how to deploy the DRA driver and run an MPI job that automatically gets IMEX daemons and achieves full NVLink bandwidth across nodes.
#9about 2 minutes
Prerequisites and resources for using the DRA driver
To use the driver, you must enable DRA and CDI feature flags in Kubernetes and ensure the GPU driver includes the necessary IMEX binaries.
Related jobs
Jobs that call for the skills explored in this talk.
Wilken GmbH
Ulm, Germany
Senior
Kubernetes
AI Frameworks
+3
ROSEN Technology and Research Center GmbH
Osnabrück, Germany
Senior
TypeScript
React
+3
Matching moments
01:32 MIN
Organizing a developer conference for 15,000 attendees
Cat Herding with Lions and Tigers - Christian Heilmann
04:57 MIN
Increasing the value of talk recordings post-event
Cat Herding with Lions and Tigers - Christian Heilmann
02:39 MIN
Establishing a single source of truth for all data
Cat Herding with Lions and Tigers - Christian Heilmann
02:54 MIN
Automating video post-production with local scripts
Cat Herding with Lions and Tigers - Christian Heilmann
04:49 MIN
Using content channels to build an event community
Cat Herding with Lions and Tigers - Christian Heilmann
04:27 MIN
Moving beyond headcount to solve business problems
What 2025 Taught Us: A Year-End Special with Hung Lee
03:39 MIN
Breaking down silos between HR, tech, and business
What 2025 Taught Us: A Year-End Special with Hung Lee
03:28 MIN
Why corporate AI adoption lags behind the hype
What 2025 Taught Us: A Year-End Special with Hung Lee
Featured Partners
Related Videos
Your Next AI Needs 10,000 GPUs. Now What?
Anshul Jindal & Martin Piercy
WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA
Ankit Patel
From foundation model to hosted AI solution in minutes
Kevin Klues
Accelerating Python on GPUs
Paul Graham
Efficient deployment and inference of GPU-accelerated LLMs
Adolf Hohl
Accelerating Python on GPUs
Paul Graham
gRPC Load Balancing Deep Dive
Max Hausner & Yves Fauser
Accelerating Python on GPUs
Paul Graham
Related Articles
View all articles



From learning to earning
Jobs that call for the skills explored in this talk.

Forschungszentrum Jülich GmbH
Jülich, Germany
Intermediate
Senior
Linux
Docker
AI Frameworks
Machine Learning

BWI GmbH
München, Germany
Senior
Linux
DevOps
Python
Ansible
Terraform
+1

NVIDIA Corporation
Remote
Senior
C++
DevOps
Python
Docker
+1

NVIDIA Corporation
Remote
Senior
C++
Azure
Linux
Python
+4

Nvidia
Glasgow, United Kingdom
Senior
C++
Python
PyTorch
Red Hat Enterprise Linux - RHEL

NVIDIA
Municipality of Madrid, Spain
Senior
C++
DevOps
Python
Docker
Kubernetes


Nvidia
Liverpool, United Kingdom
Senior
C++
Python
PyTorch
Red Hat Enterprise Linux - RHEL

Amazon.com, Inc
Berlin, Germany
Senior
C++
Machine Learning
Amazon Web Services (AWS)