Anshul Jindal, Martin Piercy

Aug 20, 2025 • World Congress 2025

Your Next AI Needs 10,000 GPUs. Now What?

Training large language models is a networking problem, not a compute problem. Learn how to keep thousands of GPUs from sitting idle.

#1about 2 minutes

Introduction to large-scale AI infrastructure challenges

An overview of the topics to be covered, from the progress of generative AI to the compute requirements for training and inference.

#2about 4 minutes

Understanding the fundamental shift to generative AI

Generative AI creates novel content, moving beyond prediction to unlock new use cases in coding, content creation, and customer experience.

#3about 6 minutes

Using NVIDIA NIMs and blueprints to deploy models

NVIDIA Inference Microservices (NIMs) and blueprints provide pre-packaged, optimized containers to quickly deploy models for tasks like retrieval-augmented generation (RAG).

#4about 4 minutes

An overview of the AI model development lifecycle

Building a production-ready model involves a multi-stage process including data curation, distributed training, alignment, optimized inference, and implementing guardrails.

#5about 6 minutes

Understanding parallelism techniques for distributed AI training

Training massive models requires splitting them across thousands of GPUs using tensor, pipeline, and data parallelism to manage compute and communication.

#6about 2 minutes

The scale of GPU compute for training and inference

Training large models like Llama requires millions of GPU hours, while inference for a single large model can demand a full multi-GPU server.

#7about 3 minutes

Key hardware and network design for AI infrastructure

Effective multi-node training depends on high-speed interconnects like NVLink and network architectures designed to minimize communication latency between GPUs.

#8about 3 minutes

Accessing global GPU capacity with DGX Cloud Lepton

NVIDIA's DGX Cloud Lepton is a marketplace connecting developers to a global network of cloud partners for scalable, on-demand GPU compute.

yesterday

AI Software Engineer (m/f/d)

Sunhat
Köln, Germany

Remote

Senior

13 days ago

Senior Machine Learning Engineer (f/m/d)

MARKT-PILOT GmbH
Stuttgart, Germany

Remote

Senior

7 days ago

Senior Researcher for Generative AI

Dynatrace
Linz, Austria

Senior

From learning to earning

Jobs that call for the skills explored in this talk.

3 days ago

Software Engineer - DGX Cloud API ServicesNVIDIA

Nvidia
Bramley, United Kingdom

Senior

API

Terraform

Kubernetes

Amazon Web Services (AWS)

2 days ago

Senior Software Engineer - DGX Cloud API Services

Nvidia
München, Germany

Senior

API

ARM

Kubernetes

5 days ago

Senior Software Engineer - DGX Cloud API Services

Nvidia
München, Germany

Senior

API

Terraform

Kubernetes

Amazon Web Services (AWS)

yesterday

Software Architect - Deep Learning and HPC CommunicationsNVIDIA

Nvidia
Bramley, United Kingdom

Senior

C++

Linux

Node.js

PyTorch

TensorFlow

5 days ago

Multimodal Deep Learning Solution Architect - Vision Language and Action Models

NVIDIA
Canton de Plaisir, France

Senior

C++

Python

PyTorch

5 days ago

Senior Solutions Architect, AI Factory

NVIDIA Corporation

Remote

Intermediate

5 days ago

AI Developer

Nvidia
Zürich, Switzerland

Intermediate

C++

Machine Learning

5 days ago

AI Infrastructure TPU GTM Lead, Google Cloud

Google
Berlin, Germany

Senior

Google Cloud Platform

Anshul Jindal, Martin Piercy

Your Next AI Needs 10,000 GPUs. Now What?

Introduction to large-scale AI infrastructure challenges

Understanding the fundamental shift to generative AI

Using NVIDIA NIMs and blueprints to deploy models

An overview of the AI model development lifecycle

Understanding parallelism techniques for distributed AI training

The scale of GPU compute for training and inference

Key hardware and network design for AI infrastructure