This course explores how to use Numba—the just-in-time, type-specializing Python function compiler—to accelerate Python programs to run on massively parallel NVIDIA GPUs. You’ll learn how to use Numba to compile CUDA kernels from NumPy universal functions (ufuncs); use Numba to create and launch custom CUDA kernels; apply key GPU memory management techniques. Upon completion, you’ll be able to use Numba to compile and launch CUDA kernels to accelerate your Python applications on NVIDIA GPUs.
08:00 - 09:00 | Registration
09:00 | Start of Masterclass
10:30 - 11:00 | Break I
12:30 - 13:30 | Lunch Break
15:00 - 15:30 | Break II
17:00 | End of Masterclass
Learning Objectives
At the conclusion of the workshop, you’ll have an understanding of the fundamental tools and techniques for GPU-accelerated Python applications with CUDA and Numba:
- GPU-accelerate NumPy ufuncs with a few lines of code.
- Configure code parallelization using the CUDA thread hierarchy.
- Write custom CUDA device kernels for maximum performance and flexibility.
- Use memory coalescing and on-device shared memory to increase CUDA kernel bandwidth.
Topics Covered
The following topics and technologies are covered in this course:
- CUDA Python with Numba
- CUDA programming general practices
Course Outline
Introduction
- Meet the instructor.
- Create an account at https://learn.nvidia.com/join
Introduction to CUDA Python with Numba
- Begin working with the Numba compiler and CUDA programming in Python.
- Use Numba decorators to GPU-accelerate numerical Python functions.
- Optimize host-to-device and device-to-host memory transfers.
Break (60 mins)
Custom CUDA Kernels in Python with Numba
- Learn CUDA’s parallel thread hierarchy and how to extend parallel program possibilities.
- Launch massively parallel custom CUDA kernels on the GPU.
- Utilize CUDA atomic operations to avoid race conditions during parallel execution.
Break (15 mins)
Multidimensional Grids, and Shared Memory for CUDA Python with Numba
- Learn multidimensional grid creation and how to work in parallel on 2D matrices.
- Leverage on-device shared memory to promote memory coalescing while reshaping 2D matrices.
Final Review
- Review key learnings and wrap up questions.
- Complete the assessment to earn a certificate.
- Take the workshop survey.
Duration: 08:00
Subject: Generative AI/LLM
Language: English
Course Prerequisites:
- A basic understanding of Deep Learning Concepts.
- Familiarity with a Deep Learning framework such as TensorFlow, PyTorch, or Keras. This course uses PyTorch.
Tools, libraries, frameworks used: PyTorch, CLIP