Interpreting the key to deep learning execution from memory bandwidth and computing power

For memory, discerning memory bandwidth is quite a simple matter, because the three types of memory, SDRAM, DDR, and RDRAM, are very different in appearance. The only thing that needs to be recognized is the DDR memory of different frequencies.

With the continuous development of deep learning, computing power has received more and more attention from the deep learning community. Any deep learning model, in the final analysis, needs to run on the device, and the lower the model's requirements for device performance, the more applications can be obtained - never let the hardware become the bottleneck of the model!

When it comes to the hardware requirements of the model, the first thing that comes to mind is the amount of computation, that is, how many calculations does a deep learning model require to complete a feedforward. However, in addition to the amount of computation, the model's need for memory bandwidth is also an important parameter that affects the time required for actual computation. As we will see below, in the case of limited memory bandwidth, simply reducing the amount of calculation does not allow the calculation time to decrease proportionally!

Interpreting the key to deep learning execution from memory bandwidth and computing power

The effect of memory bandwidth on the performance of the hardware system is shown in the figure above. If the memory is compared to a bottle and the arithmetic unit is like a cup, then the data is the various particles in the bottle, and the memory interface is the bottle mouth, and the data can be consumed (processed) through the bottle mouth data. The memory bandwidth is the width of the bottle. The narrower the mouth width, the more time it takes for the data to enter the cup (processing unit). The so-called "smart woman is difficult to be without rice", if the bandwidth is limited, even if the processing unit is infinitely fast, in most cases, the processing unit is empty and the data is wasted, resulting in a waste of computing power.

Deep Learning Network and Roofline Model

For engineers, qualitative analysis is not enough. We also need to be able to quantitatively analyze the algorithm's memory bandwidth requirements and the impact on computing performance.

The algorithm's requirement for memory bandwidth is usually expressed in terms of the operand intensity (or abitemTIc intensity), in units of OPs/byte. This quantity means how many arithmetic operations can be supported per unit of data read in the algorithm. The greater the intensity of the operation, the more data the unit data can support, which means that the algorithm has lower memory bandwidth requirements. Therefore, the intensity of computing is good!

Let us give an example. For a 3x3 convolution operation with a stride of 1, assume that the input data plane size is 64x64. For simplicity, assume that both the input and output features are 1. At this time, a total of 62x62 convolution operations are required, and each convolution requires 3x3=9 multiply-add operations, so the total number of calculations is 34596, and the amount of data is (assuming that both the data and the convolution kernel are floated with single precision. Point 2byte): 64x64x2 (input data) + 3x3x2 (convolution kernel data) = 8210 byte, so the computational strength is 34596/8210=4.21. If we change to a 1x1 convolution, the total number of calculations becomes 64x64=4096, and the amount of data required is 64x64x2 + 1x1x2=8194. Obviously, switching to a 1x1 convolution can reduce the amount of computation by nearly 9 times, but the computational strength is also reduced to 0.5, which means that the demand for memory bandwidth has increased by nearly 9 times. Therefore, if the memory bandwidth cannot satisfy the 1x1 convolution calculation, switching to the 1x1 convolution calculation reduces the calculation speed by nearly 9 times, but it cannot increase the calculation speed by 9 times.

Here, we can see that there are two bottlenecks in the deep learning computing device, one is the computing power of the processor, and the other is the computing bandwidth. How to analyze which one limits the computing performance? The Roofline model can be used.

The typical Roofline curve model is shown in the figure above. The axes are the computational performance (vertical axis) and the computational strength of the algorithm (horizontal axis). The Roofline curve is divided into two parts: the rising zone on the left and the saturation zone on the right. When the computational strength of the algorithm is small, the curve is in the rising region, that is, the computing performance is actually limited by the memory bandwidth, and many computing processing units are idle. As the computational strength of the algorithm increases, that is, the algorithm can perform more operations under the same amount of data, so that there are fewer and fewer idle computing units, and the computing performance will increase. Then, as the computing strength becomes higher and higher, the number of idle computing units becomes less and less. Finally, all the computing units are used, and the Roofline curve enters the saturation region. At this time, the computing strength becomes larger and there is no more. The calculation unit is available, so the computational performance is no longer rising, or the computational performance encounters a "roof" determined by computing power (rather than memory bandwidth). In the case of the previous 3x3 and 1x1 convolutions, the 3x3 convolution may be in the saturation region to the right of the roofline curve, and the 1x1 convolution may fall to the rising region to the left of the roofline due to the reduced computational strength, so that the 1x1 convolution is calculated. The performance of the calculation will drop and the peak performance will not be reached. Although the computational complexity of the 1x1 convolution is reduced by nearly 9 times, the actual computation time is not one-ninth of the 3x3 convolution due to the reduced computational performance.

Interpreting the key to deep learning execution from memory bandwidth and computing power

Obviously, if the memory bandwidth of a computing system is very wide, the algorithm does not need to be computationally intensive and can easily encounter the "roof" determined by the upper limit of computing power. In the figure below, the computing power remains the same, and as the memory bandwidth increases, the computational effort required to reach the computational power roof is lower.

Interpreting the key to deep learning execution from memory bandwidth and computing power

The Roofline model is very useful in algorithm-hardware collaborative design to determine the direction of algorithm and hardware optimization: Should you increase memory bandwidth/reduced memory bandwidth requirements, or increase computing power/reduced computation? If the algorithm is in the rising region of the roofline curve, then we should increase the memory bandwidth / reduce the memory bandwidth requirements, and increase the computing power / reduce the amount of calculations does not help in this case. vice versa.

Let's look at a practical example and compare the location of various machine learning algorithms on the roofline model. The image below is taken from Google's TPU paper "In-Datacenter Performance Analysis of a Tensor Processing Unit." As can be seen from the figure, the LSTM algorithm has the lowest computational strength, so it is stuck in the middle of the rising region of the roofline model. That is, when the TPU executes the LSTM algorithm, the performance is only about 3TOPS due to the memory bandwidth limitation, only the peak performance ( One-thirtieth of 90TOPS). The mulTI-layer perceptrons (MLP) have slightly better computational strength than LSTM, and are also stuck in the rising region of the roofline curve. The actual performance is about 10TOPS. The convolutional neural network model, especially CNN0, can achieve convolutional kernel multiplexing in convolutional neural networks, so the computational intensity is very high, so it can be very close to the roof of the TPU roofline curve (86 TOPS). Although the CNN1 model has a high computational intensity, it cannot reach the roof for various other reasons (the paper indicates that the CNN1 model has a shallow depth of feature that cannot fully utilize the TPU's computational unit). This example shows us another point in hardware-algorithm collaborative design: there are "other reasons" besides memory bandwidth that may make the algorithm unable to reach the roof, we should try to minimize these "other factors"!

Interpreting the key to deep learning execution from memory bandwidth and computing power

RF Coaxial Cables

RF coaxial cable, Flexible and semi-rigid cables, RG58a

TIMES microwave systmes, corrugated cables etc.

Xi'an KNT Scien-tech Co., Ltd , https://www.honorconnector.com

Posted on