How to efficiently train a residual network on the CIFAR-10 image classification dataset with a single GPU

In this series of articles, we focus on how to efficiently train Residual networks on the CIFAR-10 image classification dataset using a single GPU.

To document this process, we calculated the time it took the network to train from scratch to 94% accuracy. This benchmark comes from the recent DAWNBench competition. After the competition, the best time was 341 seconds on a single GPU and 174 seconds on eight GPUs.

Baseline

In this part, we replicate a baseline that trains CIFAR10 for 6 minutes, with a slight acceleration thereafter. We found that there is still a lot of room for improvement before the GPU's FLOPs are calculated.

For the past few months, I've been researching how to train deep neural networks faster. The idea came about earlier this year when I was working on a project with Myrtle's Sam Davis. We compress the large recurrent neural network for automatic speech recognition, deploy it on FPGAs, and retrain the model. The baseline from Mozilla was trained on 16 GPUs for a week. Later, thanks to Sam's efforts, we were able to train with obfuscation accuracy on NVIDIA's Volta GPUs and were able to reduce training time by a factor of 100, with iteration times taking less than a day on a single GPU.

This result got me thinking what else could be accelerated? Around the same time, researchers at Stanford University started the DAWNBench challenge to compare training speeds on multiple deep learning baselines. The most attention is that the training image classification model achieves 94% test accuracy on CIFAR10 and 93% and top5 on ImageNet. Image classification is a hot area of â€‹â€‹deep learning research, but training speeds still take hours.

By April, the challenge was drawing to a close, and the fastest single GPU training speed on CIFAR10 came from Ben Johnson, a student at fast.ai, who achieved 94% accuracy in less than 6 minutes (341 seconds) . This innovation is mainly training with obfuscation accuracy, he chooses a smaller network that has enough power to handle the task and can accelerate stochastic gradient descent with a higher learning rate.

At this time, we can't help but ask a question: How does this 94% test accuracy trained in 341 seconds perform on CIFAR10? The architecture of the network is an 18-layer residual network as shown below. In this case, the number of layers represents the sequential depth of the convolutional (purple) and fully connected layers (blue):

The network was trained for 35 epochs by stochastic gradient descent, and the learning rate graph is as follows:

Now let's assume how long will the training take on an NVIDIA Volta V100 GPU with 100% computing power. The network requires about 2.8Ã—109 FLOPs for forward and backward passes on a 32Ã—32Ã—3 CIFAR10 image. Assuming that parameter updates are not computationally expensive, training for 35 epochs on 50,000 images should complete within 5Ã—1015 FLOPs.

Tesla V100 has 640 Tensor Cores and can support 125 TeraFLOPS of deep learning performance.

Assuming that we can use 100% of our computing power, the training will be completed within 40 seconds, so it seems that there is still a lot of room for improvement in the 341-second score.

With the goal of 40 seconds, we started our own training. The first is to reproduce the baseline CIFAR10 results with the residual network above. I created a network with PyTorch, reproducing the learning rate and hyperparameters. Trained with a single V100 GPU on AWS p3.2 images, 3/5 of the run results achieved 94% accuracy in 356 seconds.

Once the baseline is established, the next step is to find simple improvements that can be used immediately. First we observe: the network starts with two consecutive norm-ReLUs in yellow and red, and after the purple convolution, we remove the duplicates, which also happened at epoch 15. After making adjustments, the network architecture became simpler, and 4/5 of the runs resulted in 94% accuracy in 323 seconds! New record!

In addition, we observed that some steps in the image processing process (padding, normalization, displacement, etc.) have to be reprocessed every time the training set is passed, which wastes a lot of time. While preprocessing ahead of time can mitigate this result with multiple CPU processors, PyTorch's data downloader will start a new process with each data iteration. This configuration time is very short, especially on a small dataset like CIFAR10. As long as you prepare before training and reduce the preprocessing pressure, you can reduce the number of processing times. For more complex tasks that require more preprocessing steps or multiple GPUs, data downloader processing is kept between each epoch. After overflowing the repetitive work and reducing the data downloader, the training time reached 308 seconds.

Continuing our research we found that most of the preprocessing time was spent summoning random number generators, choosing data augmentations rather than augmenting them for themselves. During the full training epoch, we executed millions of individual commands to the random number generator, combining them into a smaller command, saving 7 seconds of training time per epoch. The final training time was reduced to 297 seconds. The code for this process can be found here: github.com/davidcpage/cifar10-fast/blob/master/experiments.ipynb

Household Electrical Appliances

gree , https://www.greegroups.com