A Deep Dive on How To Leverage the NVIDIA GB200 for Ultra-Fast Training and Inference on Kubernetes

What if you could treat 72 GPUs across 18 nodes as a single system on Kubernetes? Learn how Dynamic Resource Allocation unlocks this for ultra-fast training.

#1about 2 minutes

Understanding the NVIDIA GB200 supercomputer architecture

The GB200 uses multi-node NVLink and NV switches to connect up to 72 GPUs across multiple nodes, creating a single powerful system.

#2about 2 minutes

Enabling secure multi-node GPU communication on Kubernetes

While the GPU Operator runs on GB200 nodes, it requires support for a new construct called IMEX to securely leverage multi-node NVLink connections.

#3about 2 minutes

How the IMEX CUDA APIs enable remote memory access

Applications use a sequence of CUDA API calls like `cuMemCreate` and `cuMemExportToShareHandle` to securely map and access remote GPU memory over NVLink.

#4about 4 minutes

Exploring the four levels of IMEX resource partitioning

IMEX security is managed through a four-level hierarchy, from the physical NVLink Domain down to the workload-specific IMEX Channel allocated within an IMEX Domain.

#5about 6 minutes

Abstracting IMEX complexity with the compute domain concept

The complex manual setup of IMEX daemons and channels is hidden behind a user-friendly "Compute Domain" abstraction that uses Dynamic Resource Allocation (DRA).

#6about 2 minutes

How to migrate a multi-node workload to compute domains

Migrating a workload involves creating a `ComputeDomain` object and updating the pod spec to reference its `resourceClaimTemplate` in the new `resourceClaims` section.

#7about 5 minutes

Understanding the compute domain DRA driver's architecture

The driver uses a central controller and a Kubelet plugin to orchestrate the lifecycle of IMEX daemons and channels, ensuring they are ready before workloads start.

#8about 6 minutes

Demonstrating a multi-node MPI job on a GB200 cluster

A live demo shows how to deploy the DRA driver and run an MPI job that automatically gets IMEX daemons and achieves full NVLink bandwidth across nodes.

#9about 2 minutes