Convolution with even-sized kernels and symmetric padding
CConvolution with even-sized kernels and symmetricpadding
Shuang Wu , Guanrui Wang , Pei Tang , Feng Chen , Luping Shi Department of Precision Instrument, Department of AutomationCenter for Brain Inspired Computing ResearchBeijing Innovation Center for Future ChipTsinghua University {lpshi,chenfeng}@mail.tsinghua.edu.cn
Abstract
Compact convolutional neural networks gain efficiency mainly through depthwiseconvolutions, expanded channels and complex topologies, which contrarily aggra-vate the training process. Besides, 3 × ×
2, 4 ×
4) are rarely adopted. Inthis work, we quantify the shift problem occurs in even-sized kernel convolutionsby an information erosion hypothesis, and eliminate it by proposing symmetricpadding on four sides of the feature maps (C2sp, C4sp). Symmetric paddingreleases the generalization capabilities of even-sized kernels at little computationalcost, making them outperform 3 × Deep convolutional neural networks (CNNs) have achieved significant successes in numerous com-puter vision tasks such as image classification [37], semantic segmentation [43], image generation[8], and game playing [29]. Other than domain-specific applications, various architectures have beendesigned to improve the performance of CNNs [3, 12, 15], wherein the feature extraction and represen-tation capabilities are mostly enhanced by deeper and wider models containing ever-growing numbersof parameters and operations. Thus, the memory overhead and computational complexity greatlyimpede their deployment in embedded AI systems. This motivates the deep learning community todesign compact CNNs with reduced resources, while still retaining satisfactory performance.Compact CNNs mostly derive generalization capabilities from architecture engineering. Shortcutconnection [12] and dense concatenation [15] alleviate the degradation problem as the networkdeepens. Feature maps (FMs) are expanded by pointwise convolution (C1) and bottleneck architecture[35, 40]. Multi-branch topology [38], group convolution [42, 47], and channel shuffle operation[48] recover accuracy at the cost of network fragmentation [26]. More recently, there is a trendtowards mobile models with <
10M parameters and <
1G FLOPs [14, 24, 26], wherein the depthwiseconvolution (DWConv) [4] plays a crucial role as it decouples cross-channel correlations and spatialcorrelations. Aside from human priors and handcrafted designs, emerging neural architecture search(NAS) methods optimize structures by reinforcement learning [49], evolution algorithm [32], etc.Despite the progress, the fundamental spatial representation is dominated by 3 × Preprint. Under review. a r X i v : . [ c s . C V ] M a y arely adopted as basic building blocks for deep CNN models [37, 38]. Besides, most of the compactmodels concentrate on the inference efforts (parameters and FLOPs), whereas the training efforts(memory and speed) are neglected or even becoming more intractable due to complex topologies[24], expanded channels [35], additional transformations [17, 40, 48]. With the growing demands foronline and continual learning applications, the training efforts should be jointly addressed and furtheremphasized. Furthermore, recent advances in data augmentation [6, 46] have shown more powerfuland universal benefits. A simpler structure combined with enhanced augmentations easily eclipsesthe progress made by intricate architecture engineering, inspiring us to rethink basic convolutionkernels and the mathematical principles behind them.In this work, we explore the generalization capabilities of even-sized kernels (2 ×
2, 4 × informationerosion hypothesis : even-sized kernels have asymmetric receptive fields (RFs) that produce pixelshifts in the resulting FMs. This location offset accumulates when stacking multiple convolutions,thus severely eroding the spatial information. To address the issue, we propose convolution witheven-sized kernels and symmetric padding on each side of the feature maps (C2sp, C4sp).Symmetric padding not merely eliminates the shift problem, but also extends RFs of even-sizedkernels. Various classification results demonstrate that C2sp is an effective decomposition of C3 interms of 30%-50% saving of parameters and FLOPs. Moreover, compared with compact CNN blockssuch as DWConv, inverted-bottleneck [35], and ShiftNet [40], C2sp achieves competitive accuracywith >
20% speedup and >
35% memory saving during training. In generative adversarial networks(GANs) [8], C2sp and C4sp both obtain improved image qualities and stabilized convergence. Thiswork stimulates a new perspective full of optional units for architecture engineering, as well asprovides basic but effective alternatives that balance both the training and inference efforts.
Our method belongs to compact CNNs that design new architectures and then train them from scratch.Whereas most network compressing methods in the literature attempt to prune weights from thepre-trained reference network [9], or quantize weight and activation [16] in terms of inference efforts.Some recent advances also prune networks at the initialization stage [7] or quantize models duringtraining [41]. The compression methods are orthogonal to compact architecture engineering and canbe jointly implemented for further reducing memory consumption and computational complexity.
Even-sized kernel
Even-sized kernels are mostly applied together with stride 2 to resize images. Forexample, GAN models in [28] apply 4 × × × Atrous convolution
Dilated convolution [43] supports exponential expansions of RFs without lossof resolution or coverage, which is specifically suitable for dense prediction tasks such as semanticsegmentation. Deformable convolution [5] augments the spatial sampling locations of kernels byadditional 2D offsets and learning the offsets directly from target datasets. Therefore, deformablekernels shift at pixel-level and focus on geometric transformations. ShiftNet [40] sidesteps spatialconvolutions entirely by shift kernels that contain no parameter or FLOP. However, it requires largechannel expansions to reach satisfactory performance.
We start with the spatial correlation in basic convolution kernels. Intuitively, replacing a C3 with twoC2s should provide performance gains aside from 11% reduction of overheads, which is inspiredby the factorization of C5 into two C3s [37]. However, experiments in Figure 3 indicate that the2
16 24 328162432 4 8 12 16481216 2 4 6 82468 8 16 24 328162432 4 8 12 16481216 2 4 6 824688 16 24 328162432 4 8 12 16481216 2 4 6 82468 8 16 24 328162432 4 8 12 16481216 2 4 6 82468
Figure 1: Normalized FMs derived from well-trained ResNet-56 models. Three spatial sizes 32 × ×
16, and 8 × × × × × shift problem observed in even-sized kernels. For a conventional convolutionbetween c i input and c o output FMs F and square kernels w of size k × k , it can be given as F o ( p ) = c i (cid:88) i =1 (cid:88) δ ∈R w i ( δ ) · F i ( p + δ ) , (1)where δ and p enumerate locations in RF R and in FMs of size h × w , respectively. When k is anodd number, e.g., 3, we define the central point of R as origin: R = { ( − κ, − κ ) , ( − κ, − κ ) , . . . , ( κ, κ ) } , κ = (cid:100) k − (cid:101) , (2)where κ denotes the maximum pixel number from four sides to the origin. (cid:100)·(cid:101) is the ceil roundingfunction. Since R is symmetrical, we have (cid:80) δ ∈R δ = (0 , .When k is an even number, e.g., 2 or 4, implementing convolution between F i and kernels w i becomes inevitably asymmetric since there is no central point to align. In most deep learningframeworks, it draws little attention and is obscured by pre-defined offsets. For example, TensorFlow[1] picks the nearest pixel in the left-top direction as the origin, which gives an asymmetric R : R = { (1 − κ, − κ ) , (1 − κ, − κ ) , . . . , ( κ, κ ) } , (cid:88) δ ∈R δ = ( κ, κ ) . (3)The shift occurs at all the spatial locations p and is equivalent to pad one more zero on the bottomand right sides of FMs before convolutions. On the contrary, Caffe [18] pads one more zero on theleft and top sides. PyTorch [31] only supports symmetric padding by default, users need to manuallydefine the padding policy if desired. According to the above, even-sized kernels make zero-padding asymmetric with 1 pixel, and averagely(between two opposite directions) lead to 0.5-pixel shifts in the resulting FMs. The position offsetaccumulates when stacking multiple layers of even-sized convolutions, and eventually squeezes and3istorts features to a certain corner of the spatial location. Ideally, in case that such asymmetricpadding is performed for n times in the TensorFlow style with convolutions in between, the resultingpixel-to-pixel correspondence of FMs will be F n (cid:104) p − ( n , n (cid:105) ← F ( p ) . (4)Since FMs have finite size h × w and are usually down-sampled to force high-level feature repre-sentations, then the edge effect [2, 27] cannot be ignored because zero-padding at edges will distortthe effective values of FM, especially in deep networks and small FMs. We hypothesize that thequantity of information Q equals to the mean L1-norm of the FM, then successive convolutions withzero-padding to preserve FM size will gradually erode the information: Q n = 1 hw (cid:88) p ∈ h × w |F n ( p ) | , Q n < Q n − . (5)The information erosion happens recursively and is very complex to be formulated, we directly deriveFMs from deep networks that contain various kernel sizes. In Figure 2, 10k images of size 32 × Q decreases progressively and faster in larger kernel sizes and smaller FMs. Besides,asymmetric padding in even-sized kernels (C2, C4) speeds up the erosion dramatically, which isconsistent with well-trained networks in Figure 1. An analogy is that FM can be seen as a rectangularice chip melting in water except that it can only exchange heat on its four edges. The smaller the ice,the faster the melting process happens. Symmetric padding equally distributes thermal gradients soas to slow down the exchange. Whereas asymmetric padding produces larger thermal gradients on acertain corner, thus accelerating it.Our hypothesis also provides explanations for some experimental observations in the literature. (1)The degradation problem happens in very deep networks [12]: although the vanishing/exploding for-ward activations and backward gradients have been addressed by initialization [11] and intermediatenormalization [17], the spatial information is eroded and blurred by the edge effect after multipleconvolution layers. (2) It is reported [3] that in GANs, doubling the depth of networks hamperstraining, and increasing the kernel size to 7 or 5 leads to degradation or minor improvement. Theseindicate that GANs require information augmentation and are more sensitive to progressive erosion. Symmetric padding Conv 2x2 c o …… … … c i /4 [ c i , c o , 2, 2] ⨂ feature map padding kernel c i
18 36 54
Layer index -3 -2 -1 M ean L1 - no r m v a l ue C5C3 C4C2 C4spC2sp
Figure 2: Left: layerwise quantity of information Q and colormaps derived from the last convolutionlayers. FMs are down-sampled after 18th and 36th layers. Right: implementation of convolution with2 × Since R is inevitably asymmetric for even kernels in Equation 3, it is difficult to introduce symmetrywithin a single FM. Instead, we aim at the final output F o summed by multiple input F i and kernels.For clarity, let R be the shifted RF in Equation 3 that picks the nearest pixel in the left-top directionas origin, then we explicitly introduce a shifted collection R + R + = {R , R , R , R } (6)4hat includes all four directions: left-top, right-top, left-bottom, right-bottom.Let π : I → R + be the surjective-only mapping from input channel indexes i ∈ I = { , , ..., c i } tocertain shifted RFs. By adjusting the proportion of four shifted RFs, we can ensure that c i (cid:88) i =1 (cid:88) δ ∈ π ( i ) δ = (0 , . (7)When mixing four shifted RFs within a single convolution, the RFs of even-sized kernels are partiallyextended, e.g., 2 × → ×
3, 4 × → ×
5. If c i is an integer multiple of 4 (usually satisfied), thesymmetry is strictly obeyed within a single convolution layer by distributing RFs in sequence π ( i ) = R (cid:98) i/c i (cid:99) . (8)As mention above, the shifted RF is equivalent to pad one more zero at a certain corner of FMs. Thus,the symmetry can be neatly realized by a grouped padding strategy, an example of C2sp is illustratedin Figure 2. In summary, the 2D convolution with even-sized kernels and symmetric padding consistsof three steps: (1) Dividing the input FMs equally into four groups. (2) Padding FMs according to thedirection defined in that group. (3) Calculating the convolution without any padding. We have alsodone ablation studies on other methods dealing with the shift problem, please see Section 5. In this section, the efficacy of symmetric padding is validated in CIFAR10/100 [21] and ImageNet[33] classification tasks, as well as CIFAR10, LSUN bedroom [44], and CelebA-HQ [19] generationtasks. First of all, we intuitively demonstrate that the shift problem has been eliminated by symmetricpadding. In the symmetric case of Figure 1, FMs return to the central position, exhibiting healthymagnitudes and reasonable geometries. In Figure 2, C2sp and C4sp have much lower attenuation ratesthan C2 and C4 regarding information quantity Q . Besides, C2sp has larger Q than C3, expectingperformance improvement in the following evaluations. Params (M) T e s t e rr o r ( % ) C4C3C2C2sp
Params (M) T e s t e rr o r ( % ) C3C2sp
Epoch T e s t e rr o r ( % ) -3 -2 -1 T r a i n i ng l o ss error C3error C2sploss C3loss C2sp Figure 3: Left: parameter-accuracy curves of ResNets that have multiple depths and variousconvolution kernels. Middle: parameter-accuracy curves of DenseNets that have multiple depths withC3 and C2sp. Right: training and testing curves on DenseNet-112 with C3 and C2sp.
To explore the generalization capabilities of various convolution kernels, ResNet series withoutbottleneck architectures [12] are chosen as the backbones. We maintain all the other components andtraining hyperparameters as the same, and only replace each C3 by a C4, C2 or C2sp. The networksare trained on CIFAR10 with depths in n + 2 , n ∈ { , , . . . , } . The parameter-accuracy curvesare shown in Figure 3. The original even-sized kernels 4 ×
4, 2 × × n + 4 , n ∈ { , , . . . , } are the backbones,and results are shown in Figure 3. At the same depth, C2sp achieves comparable accuracy to C3as the network gets deeper. The training losses indicate that C2sp have better generalization andless overfitting than C3. Under the criterion of similar accuracy, a C2sp model will save 30%-50%parameters and FLOPs in the CIFAR evaluations. Therefore, we recommend using C2sp as a betteralternative to C3 in classification tasks. To facilitate fair comparisons for C2sp with compact CNN blocks that contain C1, DWConvs, or shiftkernels, we use ResNets as backbones and adjust the width and depth to maintain the same numberof parameters and FLOPs (overheads). In case there are n input channels for a basic residual block,then two C2sp layers will consume about n overheads, the expansion is marked as 1-1-1 sinceno channel expands. For ShiftNet blocks [40], we choose expansion rate 3 and 3 × n . Therefore, the value of n is slightly increased. While for theinverted-bottleneck [35], the suggested expansion rate 6 results in n + O(6 n ) overheads, thus thenumber of blocks is reduced by / . For depthwise-separable convolutions [4], the overheads areabout n + O( n ) , so the channels are doubled and formed as 2-2-2 expansions.Table 1: Comparison of various compact CNN blocks on CIFAR100. Shift, Invert, and Sep denotesShiftNet block, inverted-bottleneck, and depthwise-separable convolution, respectively. mixup denotes training with mixup augmentation. Exp denotes the expansion rates of channels in that block.SPS refers to the speed during training: samples per second.Model Block Error (%) Params FLOPs Exp Memory Speedstandard with mixup (M) (M) (MB) (SPS)20 Shift 26.87 ± ± ± ± ± ± ± ±
487 3328
56 Shift 24.07 ± ± ± ± ± ± ± ±
110 Shift 22.94 ± ± ± ± ± ± ± ± The results are summarized in Table 1. Since most models easily overfit CIFAR100 training setwith standard augmentation, we also train the models with mixup [46] augmentation to make thedifferences more significant. In addition to error rates, the memory consumption and speed duringtraining are reported. C2sp performs better accuracy than ShiftNets, which indicates that sidesteppingspatial convolutions entirely by shift operations may not be an efficient solution. Compared withblocks that contain DWConv, C2sp achieves competitive results in 56 and 110 nets with fewerchannels and simpler architectures, which reduce memory consumption ( > > K = 48 , L = 50 ) to have approxi-mately 3.3M parameters. C2sp suffers less than 0.2% accuracy loss compared with state-of-the-artauto-generated models, and achieves better accuracy ( + mixup denotes cutout [6] and mixup [46]data augmentation.Model Error (%) Params (M)NASNet-A [49] 3.41 3.3PNASNet-5 [24] 3.41 3.2AmoebaNet-A [32] mixup We start with the widely-used ResNet-50 and DenseNet-121 architectures. Since both of them containbottlenecks and C1s to scale down the number of channels, C3 only consumes about 53% and 32%of the total overheads. Changing C3s to C2sp results in about 25% and 17% reduction of parametersand FLOPs, respectively. The top-1 classification accuracy are shown in Table 3, C2sp have minorloss (0.2%) in ResNet, and slightly larger degradation (0.5%) in DenseNet. After all, there are only0.9M parameters for spatial convolution in DenseNet-121 C2sp.We further scale the channels of ResNet-50 down to 0.5 × as a mobile setting. At this stage, a C2model (asymmetric), as well as reproductions of MobileNet-v2 [35] and ShuffleNet-v2 [26] areevaluated. Symmetric padding greatly reduces the error rate of ResNet-50 0.5 × C2 for 2.5%, makingResNet a comparable solution to compact CNNs. Although MobileNet-v2 models achieve the bestaccuracy, they use inverted-bottlenecks (the same structure in Table 1) to expand too many FMs,which significantly increase the memory consumption and slow down the training process (about 400SPS), while other models can easily reach 1000 SPS.Table 3: Top-1 error rates on ImageNet. Results are obtained by our reproductions using the sametraining hyperparameters.Model Error (%) Params (M) FLOPs (M)ResNet-50 C3 [12] × C3 27.9 6.9 1127ResNet-50 0.5 × C2 31.0 5.3 870ResNet-50 0.5 × C2sp 28.4 × [35] × [26] 27.5 7.4 591 The efficacy of symmetric padding is further validated in image generation tasks with GANs. InCIFAR10 32 ×
32 image generation, we follow the same architecture described in [28], which hasabout 6M parameters in the generator and 1.5M parameters in the discriminator. In LSUN bedroomand CelebA-HQ 128 ×
128 image generation, ResNet19 [22] is adopted with five residual blocks inthe generator and six residual blocks in the discriminator, containing about 14M parameters for eachof them. Since the training of GAN is a zero-sum game between two neural networks, we remain alldiscriminators as the same (C3) to mitigate their influences, and replace each C3 in generators witha C4, C2, C4sp, or C2sp. Besides, the number of channels is reduced to 0.75 × in C4 and C4sp, orexpanded 1.5 × in C2 and C2sp to approximate the same number of parameters.The inception scores [34] and FIDs [13] are shown in Table 4 for quantitatively evaluating generatedimages, and examples from the best FID runs are visualized in Figure 4. Symmetric padding iscrucial for the convergence of C2 generators, and remarkably improves the quality of C4 generators.7able 4: Scores for different kernels. Higher inception score and lower FID is better.Model C10 (IS) C10 (FID) LSUN (FID) CelebA (FID)C5 7.64 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± C2 non-convergenceC2sp 7.77 ± ± ± ± ± ) confirm that symmetric padding stabilizes the training ofGANs. On CIFAR10, C2sp performs the best scores while in LSUN bedroom and CelebA-HQgeneration, C4sp is slightly better than C2sp. The diverse results can be explained by the informationerosion hypothesis: In CIFAR10 generation, the network depth is relatively deep in terms of imagesize × , then a smaller kernel will have less attenuation rate and more channels. Whereas thenetwork depth is relatively shallow in terms of image size 128 × ×
32, C2sp, IS=8.27, FID=19.49), LSUN-bedroom (128 × × Results reported as mean ± std in tables or error bars in figures are trained for 5 times with differentrandom seeds. The default settings for CIFAR classifications are as follows: We train models for300 epochs with mini-batch size 64 except for the results in Table 2, which run 600 epochs as in[49]. We use a cosine learning rate decay [25] starting from 0.1 except for DenseNet tests, where thepiecewise constant decay performs better. The weight decay factor is 1e-4 except for parameters indepthwise convolutions. The standard augmentation [23] is applied and the α equals to 1 in mixupaugmentation.For ImageNet classifications, all the models are trained for 100 epochs with mini-batch size 256. Thelearning rate is set to 0.1 initially and annealed according to the cosine decay schedule. We followthe data augmentation in [36]. Weight decay is 1e-4 in ResNet-50 and DenseNet-121 models, anddecreases to 4e-5 in the other compact models. Some results are worse than reported in the originalpapers. It is likely due to the inconsistency of mini-batch size, learning rate decay, or total trainingepochs, e.g., about 420 epochs in [35].In generation tasks with GANs, we follow models and hypermeters recommended in [22]. Thelearning rate is 0.2, β is 0.5 and β is 0.999 for Adam optimizer [20]. The mini-batch size is 64, theratio of discriminator to generator updates is 5:1 ( n critic = 5 ). The results in Table 3 and Figure 4 aretrained for 200k and 500k discriminator update steps, respectively. We use the non-saturation loss[8] without gradient norm penalty. The spectral normalization [28] is applied in discriminators, nonormalization is applied in generators. 8 Discussion
Ablation study
We have tested other methods dealing with shift problem, and divided them intotwo categories: (1) Replacing asymmetric padding with additional non-convolution layer, e.g.,interpolation, pooling; (2) Achieving symmetry with multiple convolution layers, e.g., padding 1pixel at each side before/within two non-padding convolutions. Their implementation is restricted tocertain architectures and the accuracy is no better than symmetric padding. Our main considerationis to propose a basic but elegant building element that achieves symmetry within a single layer,thus most of the existing compact models can be neatly transferred to even-sized kernels, providinguniversal benefits to compact CNN and GAN communities.
Network fragmentation
From the evaluations above, C2sp achieves comparable accuracy withless training memory and time. Although fragmented operators distributed in many groups [26]have fewer parameters and FLOPs, the operational intensity [39] decreases as the group numberincreases. This negatively impacts the efficiency of computation, energy, and bandwidth in hardwarethat has strong parallel computing capabilities. In the situation where memory access dominates thecomputation, e.g., training, the reduction in FLOPs will be less meaningful. We conclude that whenthe training efforts are emphasized, it is still controversial to (1) increase network fragmentationby grouping strategies and complex topologies; (2) decompose spatial and channel correlations byDWConvs, shift operations, and C1s.
Naive implementation
Meanwhile, most deep learning frameworks and hardware are mainly opti-mized for C3, which restrains the efficiency of C4sp and C2sp to a large extent. For example, in ourhigh-level python implementation in TensorFlow for models with C2sp, C2, and C3, despite that theparameters and FLOPs ratio is 4:4:9, the speed (SPS) and memory consumption ratio during trainingis about 1:1.14:1.2 and 1:0.7:0.7, respectively. It is obvious that the speed and memory overheads canbe further optimized in the following computation libraries and software engineering once even-sizedkernels are adopted by the deep learning community.
In this work, we explore the generalization capabilities of even-sized kernels (2 ×
2, 4 ×
4) and quantifythe shift problem by an information erosion hypothesis. Then we introduce symmetric paddingto elegantly achieve symmetry within a single convolution layer. In classifications, C2sp achieves30%-50% saving of parameters and FLOPs compared to C3 on CIFAR10/100, and improves accuracyfor 2.5% from C2 on ImageNet. Compared to existing compact convolution blocks, C2sp achievescompetitive results with fewer channels and simpler architectures, which reduce memory consumption( > > References [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system forlarge-scale machine learning. In
OSDI , volume 16, pages 265–283, 2016.[2] Farzin Aghdasi and Rabab K Ward. Reduction of boundary artifacts in image restoration.
IEEETransactions on Image Processing , 5(4):611–618, 1996.[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelitynatural image synthesis. In
International Conference on Learning Representations , 2019.[4] François Chollet. Xception: Deep learning with depthwise separable convolutions. In
Proceed-ings of the IEEE conference on computer vision and pattern recognition , pages 1251–1258,2017.[5] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei.Deformable convolutional networks. In
Proceedings of the IEEE international conference oncomputer vision , pages 764–773, 2017. 96] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neuralnetworks with cutout. arXiv preprint arXiv:1708.04552 , 2017.[7] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainableneural networks. In
International Conference on Learning Representations , 2019.[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances in neuralinformation processing systems , pages 2672–2680, 2014.[9] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neuralnetworks with pruning, trained quantization and huffman coding. In
International Conferenceon Learning Representations , 2016.[10] Kaiming He and Jian Sun. Convolutional neural networks at constrained time cost. In
Proceed-ings of the IEEE conference on computer vision and pattern recognition , pages 5353–5360,2015.[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:Surpassing human-level performance on imagenet classification. In
Proceedings of the IEEEinternational conference on computer vision , pages 1026–1034, 2015.[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 770–778, 2016.[13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.Gans trained by a two time-scale update rule converge to a local nash equilibrium. In
Advancesin Neural Information Processing Systems , pages 6626–6637, 2017.[14] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, TobiasWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neuralnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861 , 2017.[15] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In
Proceedings of the IEEE conference on computer vision and patternrecognition , pages 4700–4708, 2017.[16] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarizedneural networks. In
Advances in neural information processing systems , pages 4107–4115,2016.[17] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. In
International Conference on Machine Learning , pages448–456, 2015.[18] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast featureembedding. In
Proceedings of the 22nd ACM international conference on Multimedia , pages675–678. ACM, 2014.[19] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for im-proved quality, stability, and variation. In
International Conference on Learning Representations ,2018.[20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
Interna-tional Conference on Learning Representations , 2015.[21] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.Technical report, Citeseer, 2009.[22] Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. Thegan landscape: Losses, architectures, regularization, and normalization. arXiv preprintarXiv:1807.04720 , 2018. 1023] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In
Artificial Intelligence and Statistics , pages 562–570, 2015.[24] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei,Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In
Proceedings of the European Conference on Computer Vision (ECCV) , pages 19–34, 2018.[25] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In
International Conference on Learning Representations , 2017.[26] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelinesfor efficient cnn architecture design. In
Proceedings of the European Conference on ComputerVision (ECCV) , pages 116–131, 2018.[27] G McGibney, MR Smith, ST Nichols, and A Crawley. Quantitative evaluation of several partialfourier reconstruction algorithms used in mri.
Magnetic resonance in medicine , 30(1):51–59,1993.[28] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalizationfor generative adversarial networks. In
International Conference on Learning Representations ,2018.[29] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.Human-level control through deep reinforcement learning.
Nature , 518(7540):529–533, 2015.[30] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts.
Distill , 1(10):e3, 2016.[31] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation inpytorch. In
NIPS-W , 2017.[32] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution forimage classifier architecture search. arXiv preprint arXiv:1802.01548 , 2018.[33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visualrecognition challenge.
International Journal of Computer Vision , 115(3):211–252, 2015.[34] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.Improved techniques for training gans. In
Advances in Neural Information Processing Systems ,pages 2234–2242, 2016.[35] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 4510–4520, 2018.[36] Nathan Silberman and Sergio Guadarrama. Tensorflowslim image classification model library,2017.[37] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. In
International Conference on Learning Representations , 2015.[38] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-thinking the inception architecture for computer vision. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 2818–2826, 2016.[39] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visualperformance model for multicore architectures.
Communications of the ACM , 52(4):65–76,2009. 1140] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gho-laminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: A zero flop, zero parameter alternative tospatial convolutions. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 9127–9135, 2018.[41] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers indeep neural networks. In
International Conference on Learning Representations , 2018.[42] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residualtransformations for deep neural networks. In
Computer Vision and Pattern Recognition (CVPR),2017 IEEE Conference on , pages 5987–5995. IEEE, 2017.[43] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In
International Conference on Learning Representations , 2016.[44] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun:Construction of a large-scale image dataset using deep learning with humans in the loop. arXivpreprint arXiv:1506.03365 , 2015.[45] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin,Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, Murray Shanahan, VictoriaLangston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, and Peter Battaglia. Deepreinforcement learning with relational inductive biases. In
International Conference on LearningRepresentations , 2019.[46] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyondempirical risk minimization. In
International Conference on Learning Representations , 2018.[47] Ting Zhang, Guo-Jun Qi, Bin Xiao, and Jingdong Wang. Interleaved group convolutions. In
Proceedings of the IEEE International Conference on Computer Vision , pages 4373–4382,2017.[48] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficientconvolutional neural network for mobile devices. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 6848–6856, 2018.[49] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferablearchitectures for scalable image recognition. In