Kevin Klues

Aug 20, 2025 • World Congress 2025

A Deep Dive on How To Leverage the NVIDIA GB200 for Ultra-Fast Training and Inference on Kubernetes

What if you could treat 72 GPUs across 18 nodes as a single system on Kubernetes? Learn how Dynamic Resource Allocation unlocks this for ultra-fast training.

#1about 2 minutes

Understanding the NVIDIA GB200 supercomputer architecture

The GB200 uses multi-node NVLink and NV switches to connect up to 72 GPUs across multiple nodes, creating a single powerful system.

#2about 2 minutes

Enabling secure multi-node GPU communication on Kubernetes

While the GPU Operator runs on GB200 nodes, it requires support for a new construct called IMEX to securely leverage multi-node NVLink connections.

#3about 2 minutes

How the IMEX CUDA APIs enable remote memory access

Applications use a sequence of CUDA API calls like `cuMemCreate` and `cuMemExportToShareHandle` to securely map and access remote GPU memory over NVLink.

#4about 4 minutes

Exploring the four levels of IMEX resource partitioning

IMEX security is managed through a four-level hierarchy, from the physical NVLink Domain down to the workload-specific IMEX Channel allocated within an IMEX Domain.

#5about 6 minutes

Abstracting IMEX complexity with the compute domain concept

The complex manual setup of IMEX daemons and channels is hidden behind a user-friendly "Compute Domain" abstraction that uses Dynamic Resource Allocation (DRA).

#6about 2 minutes

How to migrate a multi-node workload to compute domains

Migrating a workload involves creating a `ComputeDomain` object and updating the pod spec to reference its `resourceClaimTemplate` in the new `resourceClaims` section.

#7about 5 minutes

Understanding the compute domain DRA driver's architecture

The driver uses a central controller and a Kubelet plugin to orchestrate the lifecycle of IMEX daemons and channels, ensuring they are ready before workloads start.

#8about 6 minutes

Demonstrating a multi-node MPI job on a GB200 cluster

A live demo shows how to deploy the DRA driver and run an MPI job that automatically gets IMEX daemons and achieves full NVLink bandwidth across nodes.

#9about 2 minutes

Prerequisites and resources for using the DRA driver

To use the driver, you must enable DRA and CDI feature flags in Kubernetes and ensure the GPU driver includes the necessary IMEX binaries.

12 days ago

Senior Platform Engineer AI Services (w/m/d)

BWI GmbH
Bonn, Germany

Senior

1 month ago

Senior DevOps Engineer (f/m/x)

Douglas GmbH
Düsseldorf, Germany

Senior

14 days ago

Senior Machine Learning Engineer (f/m/d)

MARKT-PILOT GmbH
Stuttgart, Germany

Remote

Senior

Featured Partners

Your Next AI Needs 10,000 GPUs. Now What?

Your Next AI Needs 10,000 GPUs. Now What?

Anshul Jindal, Martin Piercy

about 2 months ago • World Congress 2025

WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA

WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA

Ankit Patel

about a year ago • World Congress 2024

From foundation model to hosted AI solution in minutes

From foundation model to hosted AI solution in minutes

Kevin Klues

about a year ago

Accelerating Python on GPUs

Accelerating Python on GPUs

Paul Graham

about 2 months ago • World Congress 2025

Efficient deployment and inference of GPU-accelerated LLMs

Efficient deployment and inference of GPU-accelerated LLMs

Adolf Hohl

about a year ago • World Congress 2024

Accelerating Python on GPUs

Accelerating Python on GPUs

Paul Graham

about a year ago • World Congress 2024

The Future of Computing: AI Technologies in the Exascale Era

The Future of Computing: AI Technologies in the Exascale Era

Stephan Gillich, Tomislav Tipurić, Christian Wiebus, Alan Southall

about a year ago • World Congress 2024

AI Factories at Scale

AI Factories at Scale

Thomas Schmidt

about a year ago • World Congress 2024

From learning to earning

Jobs that call for the skills explored in this talk.

DevOps Engineer – Kubernetes & Cloud (m/w/d)

9 days ago

DevOps Engineer – Kubernetes & Cloud (m/w/d)

epostbox epb GmbH
Berlin, Germany

Intermediate

Senior

DevOps

Kubernetes

Cloud (AWS/Google/Azure)

3 days ago

Software Engineer - DGX Cloud API ServicesNVIDIA

Nvidia
Bramley, United Kingdom

Senior

API

Terraform

Kubernetes

Amazon Web Services (AWS)

yesterday

Software Architect - Deep Learning and HPC CommunicationsNVIDIA

Nvidia
Bramley, United Kingdom

Senior

C++

Linux

Node.js

PyTorch

TensorFlow

2 days ago

Kubernetes Engineer

G-Research
Charing Cross, United Kingdom

€72K

Linux

Python

VMware

Grafana

+5

5 days ago

Platform Engineer

Nvidia

Remote

€60-75K

Java

Python

Kubernetes

+1

2 days ago

Senior Software Engineer - DGX Cloud API Services

Nvidia
München, Germany

Senior

API

ARM

Kubernetes

5 days ago

Senior Software Engineer - DGX Cloud API Services

Nvidia
München, Germany

Senior

API

Terraform

Kubernetes

Amazon Web Services (AWS)

5 days ago

Senior Software K8S Engineer

Nvidia
München, Germany

Senior

Go

Kubernetes