Anirudh Koul
30 Golden Rules of Deep Learning Performance
#1about 5 minutes
The high cost of waiting for deep learning models to train
Long training times are a major bottleneck for developers, wasting both time and hardware resources.
#2about 2 minutes
Fine-tune your existing hardware instead of buying more GPUs
Instead of simply buying more expensive hardware, you can achieve significant performance gains by optimizing your existing setup.
#3about 3 minutes
Using transfer learning to accelerate model development
Transfer learning provides a powerful baseline by fine-tuning pre-trained models for specific tasks, drastically reducing training time.
#4about 4 minutes
Diagnose GPU starvation using profiling tools
Use tools like the TensorBoard Profiler and nvidia-smi to identify when your GPU is idle and waiting for data from the CPU.
#5about 3 minutes
Prepare your data efficiently before training begins
Optimize data preparation by serializing data into moderately sized files, pre-computing transformations, and leveraging TensorFlow Datasets for high-performance pipelines.
#6about 5 minutes
Construct a high-performance input pipeline with tf.data
Use the tf.data API to build an efficient data reading pipeline by implementing prefetching, parallelization, caching, and autotuning.
#7about 3 minutes
Move data augmentation from the CPU to the GPU
Avoid CPU bottlenecks by performing data augmentation directly on the GPU using either TensorFlow's built-in functions or the NVIDIA DALI library.
#8about 5 minutes
Key optimizations for the model training loop
Speed up the training loop by enabling mixed-precision training, maximizing the batch size, and using multiples of eight to leverage specialized hardware like Tensor Cores.
#9about 2 minutes
Automatically find the optimal learning rate for faster convergence
Use a learning rate finder library to systematically identify the optimal learning rate, preventing slow convergence or overshooting the solution.
#10about 2 minutes
Compile Python code into a graph with the tf.function decorator
Gain a significant performance boost by using the @tf.function decorator to compile eager-mode TensorFlow code into an optimized computation graph.
#11about 2 minutes
Use progressive sizing and curriculum learning strategies
Accelerate training by starting with smaller image resolutions and simpler tasks, then progressively increasing complexity as the model learns.
#12about 3 minutes
Optimize your environment and scale up your hardware
Install hardware-specific binaries and leverage distributed training strategies to scale your jobs across multiple GPUs on-premise or in the cloud.
#13about 3 minutes
Learn from cost-effective and high-speed training benchmarks
Analyze benchmarks like DawnBench and MLPerf to adopt strategies for training models faster and more cost-effectively by leveraging optimized cloud resources.
#14about 3 minutes
Select efficient model architectures for fast inference
For production deployment, choose lightweight yet accurate model architectures like MobileNet, EfficientDet, or DistilBERT to ensure fast inference on end-user devices.
#15about 2 minutes
Shrink model size and improve speed with quantization
Use model quantization to convert 32-bit weights to 8-bit integers, significantly reducing the model's size and memory footprint for faster inference.
Related jobs
Jobs that call for the skills explored in this talk.
Matching moments
14:44 MIN
Advanced techniques for boosting inference performance
Trends, Challenges and Best Practices for AI at the Edge
28:43 MIN
Strategies to overcome deep learning limitations
The pitfalls of Deep Learning - When Neural Networks are not the solution
27:27 MIN
Matching edge AI challenges with NVIDIA's solutions
Trends, Challenges and Best Practices for AI at the Edge
01:12 MIN
Boosting Python performance with the Nvidia CUDA ecosystem
The weekly developer show: Boosting Python with CUDA, CSS Updates & Navigating New Tech Stacks
24:19 MIN
The significant resource and financial cost of training
The pitfalls of Deep Learning - When Neural Networks are not the solution
02:35 MIN
Optimizing on-device AI with right-sized models
Exploring the Future of Web AI with Google
08:02 MIN
Overcoming the primary challenges of edge AI development
Trends, Challenges and Best Practices for AI at the Edge
38:45 MIN
Developer tools and learning resources for GPUs
Accelerating Python on GPUs
Featured Partners
Related Videos
How AI Models Get Smarter
Ankit Patel
Overview of Machine Learning in Python
Adrian Schmitt
Accelerating Python on GPUs
Paul Graham
WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA
Ankit Patel
The pitfalls of Deep Learning - When Neural Networks are not the solution
Adrian Spataru & Bohdan Andrusyak
Vectorize all the things! Using linear algebra and NumPy to make your Python code lightning fast.
Jodie Burchell
From ML to LLM: On-device AI in the Browser
Nico Martin
Machine learning in the browser with TensorFlowjs
Håkan Silfvernagel
From learning to earning
Jobs that call for the skills explored in this talk.


Senior AI Software Developer & Mentor
Dynatrace
Linz, Austria
Senior
Java
TypeScript
AI Frameworks
Agile Methodologies


Team Lead and Senior Software Engineer with focus on AI
Dynatrace
Linz, Austria
Senior
Java
Team Leadership






Machine Learning Engineer
Deepjudge
Zürich, Switzerland
Remote
PyTorch
TensorFlow
Data analysis
Machine Learning


Security-by-Design for Trustworthy Machine Learning Pipelines
Association Bernard Gregory
Machine Learning
Continuous Delivery






Data Engineer - Machine Learning | Fraud & Abuse
DeepL
Charing Cross, United Kingdom
Remote
€40K
.NET
Python
Machine Learning


