Anirudh Koul

30 Golden Rules of Deep Learning Performance

Is your GPU starving for data? Learn 30 rules to eliminate bottlenecks and slash your deep learning training times.

30 Golden Rules of Deep Learning Performance
#1about 5 minutes

The high cost of waiting for deep learning models to train

Long training times are a major bottleneck for developers, wasting both time and hardware resources.

#2about 2 minutes

Fine-tune your existing hardware instead of buying more GPUs

Instead of simply buying more expensive hardware, you can achieve significant performance gains by optimizing your existing setup.

#3about 3 minutes

Using transfer learning to accelerate model development

Transfer learning provides a powerful baseline by fine-tuning pre-trained models for specific tasks, drastically reducing training time.

#4about 4 minutes

Diagnose GPU starvation using profiling tools

Use tools like the TensorBoard Profiler and nvidia-smi to identify when your GPU is idle and waiting for data from the CPU.

#5about 3 minutes

Prepare your data efficiently before training begins

Optimize data preparation by serializing data into moderately sized files, pre-computing transformations, and leveraging TensorFlow Datasets for high-performance pipelines.

#6about 5 minutes

Construct a high-performance input pipeline with tf.data

Use the tf.data API to build an efficient data reading pipeline by implementing prefetching, parallelization, caching, and autotuning.

#7about 3 minutes

Move data augmentation from the CPU to the GPU

Avoid CPU bottlenecks by performing data augmentation directly on the GPU using either TensorFlow's built-in functions or the NVIDIA DALI library.

#8about 5 minutes

Key optimizations for the model training loop

Speed up the training loop by enabling mixed-precision training, maximizing the batch size, and using multiples of eight to leverage specialized hardware like Tensor Cores.

#9about 2 minutes

Automatically find the optimal learning rate for faster convergence

Use a learning rate finder library to systematically identify the optimal learning rate, preventing slow convergence or overshooting the solution.

#10about 2 minutes

Compile Python code into a graph with the tf.function decorator

Gain a significant performance boost by using the @tf.function decorator to compile eager-mode TensorFlow code into an optimized computation graph.

#11about 2 minutes

Use progressive sizing and curriculum learning strategies

Accelerate training by starting with smaller image resolutions and simpler tasks, then progressively increasing complexity as the model learns.

#12about 3 minutes

Optimize your environment and scale up your hardware

Install hardware-specific binaries and leverage distributed training strategies to scale your jobs across multiple GPUs on-premise or in the cloud.

#13about 3 minutes

Learn from cost-effective and high-speed training benchmarks

Analyze benchmarks like DawnBench and MLPerf to adopt strategies for training models faster and more cost-effectively by leveraging optimized cloud resources.

#14about 3 minutes

Select efficient model architectures for fast inference

For production deployment, choose lightweight yet accurate model architectures like MobileNet, EfficientDet, or DistilBERT to ensure fast inference on end-user devices.

#15about 2 minutes

Shrink model size and improve speed with quantization

Use model quantization to convert 32-bit weights to 8-bit integers, significantly reducing the model's size and memory footprint for faster inference.

Related jobs
Jobs that call for the skills explored in this talk.

Featured Partners

From learning to earning

Jobs that call for the skills explored in this talk.