[PDF] A Reconfigurable Winograd CNN Accelerator with Nesting Decomposition Algorithm for Computing Convolution with Large Filters

Abstract

Recent literature found that convolutional neural networks (CNN) with large filters perform well in some applications such as image semantic segmentation. Winograd transformation helps to reduce the number of multiplications in a convolution but suffers from numerical instability when the convolution filter size gets large. This work proposes a nested Winograd algorithm to iteratively decompose a large filter into a sequence of 3x3 tiles which can then be accelerated with a 3x3 Winograd algorithm. Compared with the state-of-art OLA-Winograd algorithm, the proposed algorithm reduces the multiplications by 1.41 to 3.29 times for computing 5x5 to 9x9 convolutions.

Full PDF

11 A Reconfigurable Winograd CNN Accelerator with Nesting Decomposition Algorithm for Computing Convolution with Large Filters

Jingbo Jiang , Xizi Chen and Chi-Ying Tsui Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong Email: {jjiangan, xchenbn}@connect.ust.hk, [email protected] Abstract —Recent literature found that convolutional neural networks (CNN) with large filters perform well in some applications such as image semantic segmentation. Winograd transformation helps to reduce the number of multiplications in a convolution but suffers from numerical instability when the convolution filter size gets large. This work proposes a nested Winograd algorithm to iteratively decompose a large filter into a sequence of 3×3 tiles which can then be accelerated with a 3×3 Winograd algorithm. Compared with the state-of-art OLA-Winograd algorithm, the proposed algorithm reduces the multiplications by 1.41 to 3.29 times for computing 5×5 to 9×9 convolutions. I. I NTRODUCTION

Convolution Neural Networks have been widely applied to different computer vision tasks in recent years. While many popular neural networks adopt a small kernel size of 3×3 [14], recent literature [18,20,23,24] found that for the applications requiring per-pixel prediction such as semantic segmentation [20] or image super-resolution [24], using convolution with large filters outperforms building deep 3×3 networks. For example, [20] demonstrated that by increasing the convolution filter size to 15×1 and 1×15, the image segmentation result on the VOC12 dataset was improved by 4.4% in terms of IoU comparing to using a deeper 3×3 network with the same number of model parameters. The result might be further improved if training convolution with larger filters can be more efficient. This phenomenon is further explained by [18], which proposed the concept of effective receptive field, saying that the central pixel of a filter contributes more to the receptive field in image because it can propagate more information to the outputs in a deep neural network. Thus, being able to explore the effectiveness of using large filters with acceptable training time can be important for these applications. Fast algorithms such as Fast Fourier Transformation (FFT) [6] or Winograd transformation [1] are commonly used to reduce the computation complexity of the convolution. They replace some of the expensive multiplications with cheap operations such as additions to improve computation throughput. However, both algorithms face some challenges when applied to convolution with large filters: FFT requires the use of complex multiplication which is composed of three real multiplications. This makes FFT consumes more multiplications than Winograd when the convolution filter size is small or moderate. For example, [8] has compared the computation throughput of the FFT-based and the Winograd-based convolutions on GPU and found that Winograd is faster when the filter size is smaller than 16. For the Winograd algorithm, directly using it to accelerate convolution with large filters is found to be numerically unstable [9,10]. Winograd requires to transform the feature maps and the filters into a fractional number field, which is done by multiplying the feature maps and filters with some fixed transformation matrices, respectively. These matrices are derived from a Vandermonde matrix, of which the value of entry numbers grow exponentially with the matrix size. Thus, multiplying the data with a large number may make the computation overflow, and dividing the data with a large number makes the computation suffer from quantization error. This is why current Winograd-based CNN accelerators [2-4] rarely use large Winograd transformation matrices. Compared with FFT, the Winograd algorithm appears to be more popular in recent CNN accelerators since it normally performs better in accelerating 3×3 convolutions. Some literature [5,11,12] proposed to use the overlap-and-add (OLA) algorithm to accelerate convolution with large filters using the Winograd algorithm without incurring the numerical stability problem. The OLA-Winograd algorithm decomposes the convolution with large filters into a sequence of 3×3 convolutions, which is realized by slicing the input feature maps and filters into multiple 3×3 tiles, respectively, then performs a 3×3 Winograd convolution on each pair of the 3×3 tiles. The OLA-Winograd algorithm is also combined with the stride Winograd algorithm [13] to decompose a convolution with arbitrary stride and filter size [12] into 3×3 convolutions. However, it is found that the OLA-Winograd does not fully utilize the data dependency between different 3×3 tiles, thus does not yield the best multiplication efficiency. In this work, we propose a nested Winograd algorithm that exploits the data dependencies between each decomposed 3×3 tiles better than the OLA-Winograd algorithm and prove that it uses fewer multiplications. This is realized because one 3×3 Winograd transformation reduces a certain number of multiplications, and nested Winograd can apply 3×3 Winograd transformation more times to the data than the OLA-Winograd does. We also propose an algorithm to decompose the convolution with arbitrary stride and filter size into 3×3 convolutions in runtime by combining the nested Winograd with the stride Winograd algorithm. To demonstrate the effectiveness of the proposed algorithm, we adapt the architecture of an OLA-Winograd accelerator to the nested Winograd and implement it in the FPGA. We observe that nested Winograd achieves 1.41 to 3.29 times throughput improvement compared with OLA-Winograd when running convolution with filter size ranging from 5×5 to 9×9. We also show that compared with a previous OLA-Winograd accelerator [19] running FSRCNN-s for the image super-resolution, adapting to nested Winograd results in an overall 1.27 times throughput improvement. In summary, the contributions of this paper are as follows: • A nested Winograd algorithm is proposed to accelerate the execution of the convolution with large filter sizes, which outperforms the state-of-art OLA-Winograd algorithm. • A decomposition algorithm combining the nested Winograd with stride Winograd is proposed to decompose the convolution with arbitrary stride and filter size into 3×3 convolutions. II. B ACKGROUND A. Winograd Algorithm

A 2D native convolution correlates M channels of input feature maps x of size H×W and N groups of M -channel filters w of size R×C with stride S , to produce N channels of output feature maps y , which is given by 𝑦 𝑛,𝑖,𝑗 = ∑ ∑ ∑ 𝑥 𝑚,ℎ⋅𝑠+𝑟,𝑤⋅𝑠+𝑐𝐶𝑐=1𝑅𝑟=1𝑀𝑚=1 ⋅ 𝑤 𝑛,𝑚,𝑟,𝑐 (2.1) For each channel of

R×C filter, a stride-1native convolution takes a

R×C tiles from the input feature map, performs multiply-and-accumulations (MAC) on them and produces one output pixel. After that the input feature map window is slid by 1 to take the next input tile. In contrast, Winograd algorithm takes a larger 𝑙 × 𝑙 input tile x from the input feature map, transforms it along with the filter into two 𝑙 × 𝑙 tiles respectively, then performs element wise matrix multiplication (EWMM) between them to create a 𝑙 × 𝑙 output tile and transforms it to produce a 𝑚 × 𝑚 output tiles. Each adjacent input tile is taken from the input feature map with a sliding window of 𝑚 . This procedure is denoted as 𝐹(𝑚 ×𝑚, 𝑟 × 𝑟) by [1], where 𝑟 × 𝑟 is the filter size (assume

𝑅 =𝑟, 𝐶 = 𝑟 ) and 𝑙 = 𝑚 + 𝑟 − 1 . This process is illustrated in Fig.1. Let

𝐵, 𝐺, 𝐴 denotes the input, filter, and output transformation matrices respectively, and ⊙ represents the EWMM, the Winograd algorithm is formulated as 𝑦 = 𝐴 𝑇 (𝐵 𝑇 𝑥𝐵 ⊙ 𝐺𝑤𝐺 𝑇 )𝐴 (2.2) The above case is illustrated based on 2D convolution, for 1D convolution, E.q.2.2. degenerates to 𝑦 = 𝐴 𝑇 (𝐵 𝑇 𝑥 ⊙ G𝑤) and is denoted by 𝐹(𝑚, 𝑟) . From E.q.2.2 we could see that the multi-dimensional Winograd algorithm can be constructed by applying the 1D Winograd transformation to each dimension of the input tile, filter and the output tile, respectively. Therefore,

𝐹(𝑚, 𝑟) is called

Winograd kernel to reflect that it is the base operation of constructing the multi-dimensional Winograd algorithm. Here we give a brief explanation on why Winograd algorithm could reduce the multiplication, the readers could refer to [13] for more details. Native convolution does all the computations in the field of real numbers ℝ . Instead, Winograd algorithm will first construct a finite field of rational numbers Ϝ 𝑙 , then transform the input tiles and the filters into Ϝ 𝑙 to do computation. In the native convolution, there are data overlaps between the adjacent input tiles. Field Ϝ 𝑙 helps to explore these data overlaps and use additions and fix multiplications to replace some of the overlapped general multiplications. Thus, if there are no data overlaps between the adjacent input tiles (considering 𝐹(1, 𝑟) ), then the Winograd convolution will converge to the native convolution. B. OLA-Winograd and Nesting Decoposition

The structure of

𝐹(𝑚, 𝑟) shows that the Winograd transformation matrices are different for convolutions with different filter length 𝑟 . OLA-Winograd provides a method to compute convolution with filter length 𝑅 > 𝑟 on a fixed

𝐹(𝑚, 𝑟) . Fig.2. illustrates the convolution of a length-7 input vector 𝑥 and a length-4 filter 𝑤 based on 𝐹(2,2) . The procedure of OLA-Winograd is straight forward. It first slices the filter 𝑤 into two sub-vectors 𝑤 and 𝑤 , and then slides them through the input vector 𝑥 to perform 𝐹(2,2) , respectively. If we denote the

𝐹(2,2) operation by 𝐹 , then the procedure of OLA-Winograd in this example can be written as 𝑦 = {𝐹 (𝑥 , 𝑤 ) + 𝐹 (𝑥 , 𝑤 )𝐹 (𝑥 , 𝑤 ) + 𝐹 (𝑥 , 𝑤 ) (2.3) where 𝑥 , 𝑥 , 𝑥 labeled in Fig.2 are the sub-vectors of 𝑥 . If we take a closer look at E.q.2.3, we could find that it can be further accelerated by another Winograd kernel. E.q.2.3 can be viewed as a general convolution between [𝑥′ , 𝑥′ , 𝑥′ ] and [𝑤′ , 𝑤′ ] by replacing the MAC operation with a 𝐹 -and-add operation labeled as 𝑀𝐴𝐶’ . Thus, this general convolution can be accelerated by a general Winograd kernel labeled as 𝐹′ . Since 𝐹 applied on the scalar convolution can reduce an extra multiplication operation, 𝐹′ applied on this vector convolution would also reduce an extra 𝐹 operation which contains three multiplication operations. This process is also shown in the Fig.2. Nesting decomposition [15] is a general algorithm with a systematic procedure to formulate the above idea. It has been applied to many fast algorithms in the signal processing domains. For example, when applied to fast Fourier transformation it becomes Cooley-Turkey [21] and Good-Thomas [22] algorithm; when applied to the cyclic convolution accelerated by Winograd algorithm it becomes Agarwal-Cooley algorithm [15]. Applying nesting decomposition to different types of fast algorithms and convolutions requires to handle the data alignment differently. In this work, we adapt the nesting decomposition to CNN computation and propose method to solve the data alignment problem. Rigorous formulation of the nesting decomposition algorithm has been given in [15]. Here we briefly reintroduce its computation procedure based on the same example given by E.q.2.3. From the previous discussion, we know that E.q.2.3 is equivalent to 𝑦 = 𝐹′ (𝑥 ′ , 𝑤 ′ ) = 𝐴′ 𝑇 (𝐵′ 𝑇 𝑥 ′ ⨀𝐺 ′ 𝑤 ′ ) (2.4) x x ... WX Y x x ... W H W X lInput Trans. l Filter Trans.... ... ... Y Output

Trans.R

Native ConvolutionWinograd Convolution

EWMM

Mult. & Add x x

01 ...

W H ... y y y y w w w w ... r x x ... r w w w w N MM MlMM y y y ... y C=r=r N Fig. 1. The computation flow of the native convolution and the Winograd convolution

𝐹(𝑚, 𝑟) . where 𝐴′ 𝑇 , 𝐵′ 𝑇 , 𝐺′ is computed by applying the tensor product ( ⨂ ) on 𝐴 𝑇 , 𝐵 𝑇 , 𝐺 with themselves, respectively (proved in [15]). So E.q. 2.4 becomes 𝑦 = (𝐴 𝑇 ⨂𝐴 𝑇 ) [(𝐵 𝑇 ⨂𝐵 𝑇 ) [𝑥′ 𝑥′ 𝑥′ ] ⨀(𝐺⨂𝐺) [𝑤′ 𝑤′ ]] (2.5) where 𝑥′ to 𝑤′ are all column vectors. An example of computing the tensor product is 𝐴 𝑇 ⨂𝐴 𝑇 = [1 ⋅ 𝐴 𝑇 𝑇 𝑇 𝑇 𝑇 𝑇 ] From the definition of tensor product, E.q.2.5. can be rewritten as 𝑦 = 𝐴 𝑇 [(𝐵 𝑇 [𝑥′ 𝑥′ 𝑥′ ]𝐵)⨀(𝐺[𝑤′ 𝑤′ ]𝐺 𝑇 )]𝐴 (2.6) and it can be seen that nesting decomposition can be computed as a higher dimensional Winograd algorithm with the corresponding data rearrangement at the input and the filter. C. Stride Winograd Algorithm

OLA-Winograd and nesting decomposition cannot handle convolution with non-1 stride directly. Stride Winograd algorithm [12] was proposed to decompose the convolution with non-1 stride into the stride-1 convolutions. [13] has proved that the computation of a Winograd convolution with stride 𝑠 denoted by 𝐹(𝑚, 𝑟; 𝑠) can be decomposed into the computation of two stride-1 convolutions with filter length differed by 1. The formulation is given as

𝐹(𝑚, 𝑟; 𝑠) = 𝑝 ⋅ 𝐹(𝑚, 𝑟 ′ + 1) + (𝑠 − 𝑝) ⋅ 𝐹(𝑚, 𝑟 ′ ) 𝑟 = 𝑟 ′ ⋅ 𝑠 + 𝑝 (2.7) where 𝑝 and 𝑟 ′ are design parameters that can be tuned. If we only want to use a single Winograd kernel for FPGA implementation, we could replace 𝐹(𝑚, 𝑟 ′ ) by 𝐹(𝑚, 𝑟 ′ + 1) with zero padding. The detailed illustration of the stride Winograd algorithm has been given in [12]. III. T HE N ESTED W INOGRAD A LGORITHM A. The Nested Winograd Algorithm

We propose a nested Winograd algorithm which applies the nesting convolution algorithm to the CNN and at the same time, the data alignment issue is solved. The procedure of the algorithm is illustrated using the same 1D example formulated by E.q.2.6, which convolves a length-R filter (R =4 ) with a very long input vector using 𝐹(𝑚 = 2, 𝑟 = 2) . The procedure of computing E.q.2.6 is illustrated in Fig.3. The first step is to perform the nested filter transformation.

𝐹(2,2) can only deal with filters that have length not greater than 𝑟 = 2 . As in this case we have 𝑟 < 𝑅 ≤ 𝑟 , we could reshape the filter into a 𝑟 × 𝑟 matrix and use 2D Winograd (E.q.2.6) to transform it. This reshaped matrix is called the filter reordered matrix. In other cases, if we have 𝑅 > 𝑟 , the filter can be reshaped into a tensor and use multi-dimensional Winograd algorithm [16] to replace the E.q.2.6. Also, if the filter has entry number smaller than that of the filter reordered matrix, then we zero-pad at the end of the filter to fill the void space. The filter reordered matrix is then transformed into the rational field 𝐹 using a 2D Winograd filter transformation. The second step is to perform the nested input transformation. Since the input vector is very long, we first apply a sliding window (outer slide) to cut it into slices called input slices. Then we apply another sliding window (inner slide) to cut the input slice into what we called reordered columns and place them column-by-column to form the input reordered matrix. The window size of the inner slide equals to 𝑙 = 𝑚 + 𝑟 − 1 as the input transformation matrix 𝐵 has a size of 𝑙 × 𝑙 . The stride of the inner slide equals to 𝑟 because to align the data, the entries of the adjacent columns in the input reordered matrix and the filter reordered matrix should have the same index distance. For example, as highlighted by the yellow circle in Fig.3, the horizontal elements in the filter reordered matrix has distance 𝑟 = 2 , so the inner slide should have stride equals to 𝑟 = 2 and window size equals to 𝑙 = 3 . This also dictates that the outer slide should have window size equals to 𝐿 = 𝑟 ∙ (𝑙 − 1) + 𝑙 . After the input reordered matrix is constructed, 𝐵 𝑇 ∙ () ∙ 𝐵 is applied to transform it to the field of 𝐹 . The final step is to perform EWMM between two transformed reordered matrices, reshape to get the output reordered matrix and perform nested output transformation on it to get the output vector ( ℝ ). The output vector has 𝑚 entries since we are using 2D Winograd. If instead, an 𝑛 -dimensional Winograd is used to compute the nested Winograd, then the algorithm consumes 𝑙 𝑛 multiplications and produce 𝑚 𝑛 outputs for each input vector slice. The above description has also shown the general procedure to align the data between the input reordered matrix and the filter reordered matrix for all 𝐹(𝑚, 𝑟) . However, the output vector may include some redundant terms which should be discarded if

𝐹(𝑚, 𝑟) with 𝑚 ≠ 𝑟 is used. Therefore, normally although not always, using

𝐹(𝑚, 𝑟) with 𝑚 = 𝑟 reduces the multiplication-per-output ratio and hence the computation complexity. As illustrated in Fig.4., to extend the above 1D computation procedure to a 2D CNN computation, we only need to apply the nested input/filter transformation first to all w0 w1 w2

Filter ( ) r ＜ R r² ℝ w3x0 x1 x6 Input Slice( )

L=r·(l-1)+l ℝ x0 x1 Input Vector ( ) ℝ x2 x3 x4 Input Reordered Matrix ( ) x0 x1 x2 x2 x3 x4 x4 x5 x6 ℝ l l distance = r w0w1 w2w3 Filter Reordered Matrix ( ) ℝ distance = r r r X0 X1 X8 l² Input Vector Slice ( ) Ϝ W0 W1 W8

Filter ( ) Ϝ l² Output Vector ( ) Ϝ Y0Y1Y2 Y3Y4Y5 Y6Y7Y8

Output Reordered Matrix ( ) Ϝ l l y0 y1 y3y2 Output Vector ( ) ℝ M=m²

F(2,2): l=3, m=2,r=2 x0 x1 x2 x2 x3 x4 L Stride by rStride by m²

Reordered Columns( ) ℝ Fig. 3. Example of convolving a input vector with a length-4 filter based on

𝐹(2,2) with nested Winograd.

OLA-Winograd x0 x1 x2 x3 x4w0 w1 w2

Native Conv. x x x w w w x: w3 x w: x5 x6 x0 x1 x2 x3 x4w0 w1 w2 x2 x3 x4 x5 x6x x x x w w w3x[0:4]: x[2:6]:x0 x1 x2x x2 x3 x4x x4 x5 x6x w0 w1 w2w w w3 Slice w into w w and perform with x x x , , , , Recall: Computation procedure of OLA-Winograd

MAC ′ , ′ , ′ + , ( ′ , ′ ) ′ , ′ , ′ + , ( ′ , ′ ) ′ , ( ′ , ′ ) Enhance OLA-Winograd With Nesting Decomposition , Wrap and compute as

Fig. 2. Convolving a length-7 input and a length-4 filter based on

𝐹(2,2) with native convolution, OLA-Winograd and nesting decomposition. the rows, then followed by the columns of the input tile/filter, and perform EWMM to get the output tile ( Ϝ ). After that nested transformation of each of its rows follows by the columns is carried out to get the final output tile ( ℝ ). Overall, since we have already seen that E.q.2.6 uses less multiplications than E.q.2.3. because it applies an extra Winograd transformation above E.q.2.3, nested Winograd for CNN built based on E.q.2.6 should also use less multiplications than OLA-Winograd because it applies more Winograd transformations on the input feature maps and filters. B. Multiplication Complexity Analysis

In the following, algorithm analysis is provided to show the advantage of using nested Winograd. The multiplication complexity is defined as the number of multiplications needed to produce one output element in a convolution. The number of additions and fix multiplications used in the Winograd transformation stages are not included in this analysis since they are normally implemented in the LUT instead of the scarce DSP resources. To derive the multiplication complexity, considering a convolution of a length- 𝑅 filter with an infinite length input vector using 𝐹(𝑚, 𝑟) . OLA-Winograd slices the length- 𝑅 filter into 𝑅/𝑟 sub-vectors and applies

𝐹(𝑚, 𝑟) to them individually to produce a length- 𝑚 output filter. Since each 𝐹(𝑚, 𝑟) transformation requires 𝑙 = 𝑚 + 𝑟 − 1 multiplications, the overall multiplication complexity is

𝒪((𝑙/𝑚) ⋅ (𝑅/𝑟)) (3.2)

For nested Winograd, we only analyze Winograd kernel with 𝑚 = 𝑟 to simplify the discussion, the 𝑚 ≠ 𝑟 case can be derived in a similar manner. Assuming the reordered tensor of nested Winograd has 𝑏 = 𝑙𝑜𝑔 𝑟 𝑅 dimensions. Since applying 𝐹(𝑚, 𝑟) to one length- 𝑟 filter consumes 𝑙 multiplications and produce 𝑚 output elements, the overall algorithm complexity for nested Winograd is the product of the multiplication complexity in all dimensions, given by 𝒪((𝑙/𝑚) 𝑙𝑜𝑔 𝑟 𝑅 ) (3.3) It is hard to directly compare the complexity of E.q.3.3 and E.q.3.2 at this stage. Thus, we further approximate E.q.3.3 with the order-1 Taylor expansion at point 𝑟 and we get 𝒪 ( 𝑙𝑚 ⋅ 𝑟 ⋅ log 𝑟 ( 𝑙𝑚) ⋅ 𝑅 + 𝛼) (3.4) where 𝛼 represents the constant term. If we apply 𝑙 = 𝑚 +𝑟 − 1 , then divide E.q.3.4 by E.q.3.2 and let the result smaller or equal to 1, we get 𝑚 + 𝑟 − 1 ≤ 𝑚 ⋅ 𝑟 , which is valid for 𝑚 and 𝑟 belongs to any positive integer. It means that nested Winograd almost always has a gentler slope than the OLA-Winograd. However, as E.q.3.4 also has a constant term which may let nested Winograd be less efficient than OLA-Winograd, we will perform a simulation on some common cases and summarize the performance difference between these two algorithms in section V.A. IV. R ECONFIGURABLE A CCELERATOR D ESIGN A. Accelerator Design

We designed a reconfigurable accelerator based on Xilinx Zynq FPGA architecture to measure the effectiveness of the nested Winograd. The accelerator is composed of two parts – An on-chip ARM processor (programming system, PS) running a decomposition algorithm, and a fabric logic array (programming logic, PL) implementing a reconfigurable convolution engine to execute 3×3 Winograd convolutions. Fig.5 shows an overview of this accelerator. To execute a convolution layer, the PS first decomposes the layer into the operation of a sequence of 3×3 stride-1 convolutions and reconfigures the PL by writing to the registers inside its controller (not reprogramming the FPGA). PS will further perform data marshaling including interleaving, reshaping, and transposing on the input data and the filters stored in DRAM. Finally, the PS streams the data into PL through Xilinx’s high performance (HP) port and finishes the computation of this layer. The output data is then streamed back to the DRAM and this process is looped until all the layers of a CNN are processed. The convolution engine is modified from an OLA-Winograd accelerator [2]. It majorly contains the following modules: A controller responsible for configuring the PL before the execution of every convolution layer; Three

AXI buffers exchanging data between PS and PL; Pipelined input, filter and output

Winograd transformation units running

𝐹(𝑚 × 𝑚, 3 × 3) with 1-cycle latency, where 𝑚 is a design parameter that can be tuned; Matrix transpose blocks (.T) implemented as a shift-register based buffer [25] to perform Winograd transformations in a loop for the nested Winograd; A processing element (PE) instantiated with DSP48 slices to perform MAC operations, which contains 𝐺 𝑐ℎ channels of 𝐺 𝑤 × 𝐺 𝑖𝑛 output stationary array for general matrix multiplication (GEMM). The computation engine executes data with a 3-stage pipeline (same as [2]), which is input/filter transform – GEMM – output transform. The reconfiguration switches the X0,0 X0,1 X0,8

W0,0W0,1 W0,8 x0,0 x0,1 x0,6 w0,0w0,1w0,2

Input Tile ( )Filter ( ) Input Tile ( )Filter ( )

Y0,0 Y0,1 Y0,8 y0,0 y0,1 y0,3y0,2

Output Tile ( ) Output Tile ( )

Reorder and B.T()B

Apply to all the rows follows by all the columns of the input tile

Reshape and

G()G.TApply to all the rows follows by all the columns of the filter

Reshape and A.T()AApply to all the rows follows by all the columns of the input tile Ϝ Ϝ ℝℝ Ϝ ℝ Fig. 4. Computing a stride-1 2D convolution which has 3×3 filter with

𝐹(2,2) using nested Winograd.

Configure PL

ARM Processor + DRAM(PS)

Controller InBuf

ITrans2D B i a s Acc. Buf R e L U MUX PE (DSP48 MAC) .T (Buf)WTrans2D

MUX

WBuf stage 1

OTrans2D

MUX M u l t i C h a nn e l A X I - MM HP GP

AXI-LineBuf HP .T (Buf) .T (Buf) Convolution Engine (PL)

GchGw Ginstage 2stage 3AXI-LITE HP AXI-WBufAXI-OBuf n·Ginn·GwGin·Gw

Control PathNested WinogradOther WinogradPS-PL Communication

Fig. 5. Block design of the reconfigurable accelerator. The transpose buffer marked in grenn is only required by the nested Winograd. execution mode of the convolution engine between OLA-Winograd, nested Winograd, stride Winograd, direct Winograd, or GEMM. Nested Winograd is executed by looping the data through the .T and Trans2D blocks, and the other Winograd algorithm is executed by disabling the .T block. GEMM is executed by disabling all Trans2D blocks. B. The Decomposition Algorithm

The decomposition algorithm decomposes a convolution with arbitrary filter size and stride into a sequence of 3×3 stride-1 convolutions. The process includes both decomposing the network structure and marshaling the data. The procedure has four steps which are illustrated in Fig.6. We will discuss each step using an example of decomposing and accelerating a 5×5 stride-2 convolution by

𝐹(𝑚 = 2, 𝑟 =2) . The first step is to determine the decomposed network structure. The procedure of using Winograd algorithm to accelerate a 5×5 stride-2 convolution can be written as

𝐹(𝑀, 𝑅 = 5; 𝑆 = 2) , where 𝑀 represents the overall output length which is an unknown parameter that will be determined during the procedure of the algorithm. The decomposition algorithm iteratively checks if the given 𝐹(𝑀, 𝑅; 𝑆) has

𝑆 > 1 or 𝑅 > 𝑟 . If yes, it decomposes the Winograd kernel with stride Winograd or nested Winograd algorithm, respectively. After step 1, the decomposition result is

𝐹(𝑀, 5; 2) = [𝐹(𝑀 , 2) ⊗ 𝐹(𝑀 , 2)] + 𝐹(𝑀 , 2) (4.1) where we use ⊗ to represent the nested Winograd operation to be consistent with E.q.2.5. E.q.4.1. is represented as an expression tree which can be read out in the reverse polish order by the processor. The second step is to determine the value of 𝑀 . Since 𝐹(𝑀 , 2) in E.q.4.1. cannot be further decomposed, we have 𝑀 = 𝑚 . Then we compute backward until reaching 𝐹(𝑀, 𝑅; 𝑆) . Nested Winograd outputs 𝑚 elements as we analyzed in the section III.A. Also, the two Winograd kernel decomposed from the stride Winograd should have the same number of output elements. One small problem is that adding two Winograd kernels requires them to have the same output length. For example, 𝐹(𝑚 , 𝑟 ) + 𝐹(𝑚 , 𝑟 ) requires 𝑚 =𝑚 . If during the decomposition we found that 𝑚 ≠ 𝑚 , 𝐹(𝑛 ∙ 𝑚, 𝑟) = 𝑛 ∙ 𝐹(𝑚, 𝑟) can be used to adjust the output length. This equation simply means executing

𝐹(𝑚, 𝑟) on the input vector 𝑛 times to get 𝑛 ⋅ 𝑚 output elements. The third and fourth step is to perform data marshaling as illustrated in Fig.6. V. E XPERIMENTS A. Algorithm Performance Simulation

We carry out simulations to compare the multiplication complexity among the nested Winograd and the OLA- Winograd. We have also included the multiplication complexity result of the direct Winograd , which directly uses

𝐹(𝑚, 𝑟) with a varying 𝑟 to compute the target convolution. We restrict the direct Winograd to not using the Winograd kernel larger than the nested Winograd to avoid the numerical instability as mentioned in the introduction part. We simulated 2D convolutions with different filter sizes ranging from 3×3 to 12×12 with four different Winograd kernels – 𝐹(2,2) , 𝐹(3,3) , 𝐹(4,4) and

𝐹(6,3) . Winograd kernel larger than

𝐹(6,3) is rarely used as stated in [2]. The simulation results are summarized in Fig.7, which demonstrates that the nested Winograd uses fewer multiplications than OLA-Winograd in most of the cases, and the gap increases when the filter size gets larger. However, nested Winograd may be less efficient than OLA-Winograd when a large Winograd kernel is used to accelerate a convolution with small filter size. For example, nested Winograd uses 2.2% more multiplications than OLA-Winograd when

𝐹(6,3) is used to accelerate convolution with 5×5 filter. However, it turns out that direct Winograd algorithm has the highest efficiency in such case. This phenomena is also reported in [12], stating that OLA-Winograd is less efficient than direct Winograd in processing convolution with small filters. B. Layer-wise Performance Evaluation

We compare the throughput of running nested Winograd and OLA-Winograd on the accelerator designed in IV with convolution layers that have filter sizes ranging from 5×5 to 9×9. We choose the 5×5 and 9×9 convolution layers from SRCNN [7] network with a 7×7 depthwise convolution layer from the PNasNet [26]. SRCNN is used in the image super-resolution and PNasNet is a network created by neural architecture search (NAS). The accelerator is implemented on the Xilinx ZCU102 board containing an ARM dual-core A53 processor with a DDR4-2666 providing sufficient bandwidth in off chip data accessing. The PL is implemented to run at 200MHZ which is consistent with [2]. The convolution engine is implemented with

𝐹(3,3) and the 𝐺 𝑐ℎ , 𝐺 𝑖𝑛 , 𝐺 𝑤 are set to be 25, 6, 6. The experimental results are summarized in Fig. 7. Algorithm complexity comparison between OLA-Winograd, nested Winograd and direct Winograd

Decomposing the convolution structureMarshaling the dataStep.1 Determine the Structure Step.2 Determine M

Input network:

F(M,R=5,S=2)

S > 1

Decompose with Stride Winograd: = F(M ,3) + F(M ,2) R > r (r=2)

Decompose with nested

Winograd: = [F(M ,2) F(M ,2)] + F(M ,2) Get the final M:

F(M=4,R=5,S=2)

Nested Winograd squares the m : = F(M =2²,3) + F(M =2*2,2) Cannot be further decomposed

Let M =m : = [F(m=2,2) F(m=2,2)] + F(m=2,2) here represents the operation of the nested WinogradExecute F(2,2) 2 times to match the output length with F(M=2²,3) Step.3 Determine the Filter Step.4 Determine the Input Slice

F(4,R=5,S=2)F(2,3)+2*F(2,2)F

F(2,2) original w: interleavereshape length: 7 = l+(l-1)*m length: 6=2*3 ... length: 13=7+6 interleaveoriginal input slice: ⊗ ⊗ ⊗ = 2*F(m=2,2) Fig. 6. Decomposing 5×5 stirde-2 convolution with

𝐹(2,2) . Tab. I, which show a 1.41 to 3.29 times improvement in GOPs when executing 5×5 to 9×9 convolution layers. C. Performance Evaluation on other CNN accelerators

To compare with the previous OLA-Winograd accelerator, we chose [19] which designs an FPGA accelerator running FSRCNN-s network to upscale an image from 1920×1080 to 3840×2160 in real time. They propose an FTConv algorithm to decompose all the layers in FSRCNN-s to 5×5 convolutions, then use OLA-Winograd to accelerate it. The decomposed FSRCNN-s has 86.1% MACs coming from the 5×5 convolution. We reimplement the convolver in their architecture and adapt it to the nested Winograd as IV. The results summarized in Tab. II show that using nested Winograd requires more LUT and BRAM resources to implement the matrix transpose buffer, but it achieves 1.26 times overall speed up compared with using OLA-Winograd with the same number of DSPs. We did not observe a PSNR drop by changing to nested Winograd in this experiment. VI. C ONCLUSION

In this work, a nested Winograd algorithm is proposed for accelerating convolution with arbitrary size and stride. The algorithm complexity is reduced when comparing with existing OLA-Winograd for processing large filters. Implementation results show 1.41 to 3.29 times speed-up in executing convolution with filter size from 5×5 to 9×9. R

EFERENCES [1] A. Lavin and S. Gray, “Fast Algorithms for Convolutional Neural Networks,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-Decem, pp. 4013–4021, 2016, doi: 10.1109/CVPR.2016.435. [2] L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms for convolutional neural networks on FPGAs,” Proc. - IEEE 25th Annu. Int. Symp. Field-Programmable Cust. Comput. Mach. FCCM 2017, pp. 101–108, 2017, doi: 10.1109/FCCM.2017.64. [3] L. Lu and Y. Liang, “SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs,” Proc. - Des. Autom. Conf., vol. Part F1377, 2018, doi: 10.1145/3195970.3196120. [4] F. Shi, H. Li, Y. Gao, B. Kuschner, and S.-C. Zhu, “Sparse Winograd Convolutional Neural Networks on Small-scale Systolic Arrays,” no. 1, pp. 118–118, 2019, doi: 10.1145/3289602.3293939. [5] A. Zlateski, Z. Jia, K. Li, and F. Durand, “FFT Convolutions are Faster than Winograd on Modern CPUs, Here is Why,” 2018, [Online]. Available: http://arxiv.org/abs/1809.07851. [6] Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation, 2015. [7] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8692 LNCS, no. PART 4, pp. 184–199, 2014, doi: 10.1007/978-3-319-10593-2_13. [8] “Comparison of convolution methods for GPUs.” [Online]. Available: http://ska-sdp.org/sites/default/files/attachments/nvidia-sdp-directconvolution.pdf [Accessed: 07-Aug-2020]. [9] B. Barabasz, A. Anderson, K. M. Soodhalter, and D. Gregg, “Error Analysis and Improving the Accuracy of Winograd Convolution for Deep Neural Networks,” vol. 2, pp. 1–37, 2018, [Online]. Available: http://arxiv.org/abs/1803.10986. [10] K. Vincent, K. Stephano, M. Frumkin, B. Ginsburg, and J. Demouth, “On improving the numerical stability of winograd convolutions,” 5th Int. Conf. Learn. Represent. ICLR 2017 - Work. Track Proc., no. 1, pp. 1–4, 2019. [11] Di Huang, Xishan Zhang, and Yunji Chen. Dwm: A decomposable winograd method for convolution acceleration, 2020. [12] C. Yang, Y. Wang, X. Wang, and L. Geng, “WRA: A 2.2-to-6.3 TOPS highly unified dynamically reconfigurable accelerator using a novel winograd decomposition algorithm for convolutional neural networks,” IEEE Trans. Circuits Syst. I Regul. Pap., vol. 66, no. 9, pp. 3480–3493, 2019, doi: 10.1109/TCSI.2019.2928682. [13] S. Winograd, Arithmetic Complexity of Computations. 1980. [14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015. [15] R. E. Blahut, Fast Algorithms for Signal Processing. 2010. [16] D. Budden, A. Matveev, S. Santurkar, S. R. Chaudhuri, and N. Shavit, “Deep tensor convolution on multicores,” 34th Int. Conf. Mach. Learn. ICML 2017, vol. 2, pp. 1007–1017, 2017. [17] S. Yao, K. Guo, and S. Han, “Hardware-Friendly Convolutional Neural Network With Even-Number Filter Size,” pp. 2–5, 2016. [18] Understanding the Effective Receptive Field in Deep Convolutional Neural Networks [19] B. Shi, Z. Tang, G. Luo, and M. Jiang, “Winograd-based real-time super-resolution system on FPGA,” Proc. - 2019 Int. Conf. Field-Programmable Technol. ICFPT 2019, vol. 2019-December, pp. 423–426, 2019, doi: 10.1109/ICFPT47387.2019.00083. [20] Large Kernel Matters—— Improve Semantic Segmentation by Global Convolutional Network [21] “Cooley–Tukey FFT algorithm,” Wikipedia, 09-Jul-2020. [Online]. Available: https://en.wikipedia.org/wiki/Cooley–Tukey_FFT_algorithm. [Accessed: 07-Aug-2020]. [22] “Prime-factor FFT algorithm,” Wikipedia, 07-Nov-2019. [Online]. Available: https://en.wikipedia.org/wiki/Prime-factor_FFT_algorithm. [Accessed: 07-Aug-2020]. [23] Linguang Zhang, Maciej Halber, and Szymon Rusinkiewicz. Accelerating large-kernel convolution using summed-area tables, 2019. [24] G. Seif and D. Androutsos, “Large receptive field networks for high-scale image super-resolution,” IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work., vol. 2018-June, pp. 876–885, 2018, doi: 10.1109/CVPRW.2018.00120. [25] S. Hsia and S. Wang, "Shift-Register-Based Data Transposition for Cost-Effective Discrete Cosine Transform," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 6, pp. 725-728, June 2007, doi: 10.1109/TVLSI.2007.898780. [26] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. Lecture Notes in Computer Science, page 1935, 2018 TABLE I. GOP S S PEED U P ON D IFFERENT C ONVOLUTIONS

Conv. Type

From Conv. Shape in (Cout, Cin, H, W) GOPs b Nested Win. GOPs OLA-Win. Speed up

Conv2D 9×9 a SRCNN Layer-1 (64,1,256,256) 3503 1063 3.29 Depth-wise 7×7 PNasNet (54,54,83,83) 540 384 1.41 Conv2D 5×5 a SRCNN Layer-2 (32,64,256,256) 991 692 1.43 Conv2D 5×5 a SRCNN Layer-3 (1,32,256,256) 186 130 1.43 a. These three layers concludes all the layers of the SRCNN. b. GOPs = (

TABLE

II. T HROUGHPUT C OMPARISON ON

FSRCNN- S N ETWORK [19] Convolver of [19] (Our Imp.) Our Results

BRAM (Mb)

Frame Rates

Precision: 16-bit fixed; Device: ZCU102; PL Frequency: 200MHZ; Winograd Kernel: