Precision Highway for Ultra Low-Precision Quantization
PP RECISION H IGHWAYFOR U LTRA L OW -P RECISION Q UANTIZATION
Eunhyeok Park ∗ , Dongyoung Kim ∗ , Sungjoo Yoo ∗ and Peter Vajda †∗ Department of Computer Science and EngineeringSeoul National University { eunhyeok.park, dongyoung.kim42, sungjoo.yoo } @gmail.com † Mobile Vision Team, AI [email protected] A BSTRACT
Neural network quantization has an inherent problem called accumulated quan-tization error , which is the key obstacle towards ultra-low precision, e.g., 2- or3-bit precision. To resolve this problem, we propose precision highway , whichforms an end-to-end high-precision information flow while performing the ultra-low-precision computation. First, we describe how the precision highway reducethe accumulated quantization error in both convolutional and recurrent neuralnetworks. We also provide the quantitative analysis of the benefit of precisionhighway and evaluate the overhead on the state-of-the-art hardware accelerator. Inthe experiments, our proposed method outperforms the best existing quantizationmethods while offering 3-bit weight/activation quantization with no accuracy lossand 2-bit quantization with a 2.45 % top-1 accuracy loss in ResNet-50. We alsoreport that the proposed method significantly outperforms the existing method inthe 2-bit quantization of an LSTM for language modeling.
NTRODUCTION
Energy-efficient inference of neural networks is becoming increasingly important in both serversand mobile devices (e.g., smartphones, AR/VR devices, and drones). Recently, there have beenactive studies on ultra-low-precision inference using 1 to 4 bits (Rastegari et al., 2016; Hubara et al.,2016; Zhu et al., 2016; Zhou et al., 2016; 2017; Zhuang et al., 2018; Choi et al., 2018b; Liu et al.,2018; Choi et al., 2018a) and their implementations on CPU and GPU (Tulloch & Jia, 2017), anddedicated hardware (Park et al., 2018; Sharma et al., 2017). However, as will be explained in section5.2, the existing quantization methods suffer from a problem called accumulated quantization error where large quantization errors get accumulated across layers, making it difficult to enable ultra-lowprecision in deep neural networks.In order to address this problem, we propose a novel concept called precision highway where anend-to-end path of high-precision information reduces the accumulated quantization error therebyenabling ultra-low-precision computation. Our proposed work is similar to recent studies (Liuet al., 2018; Choi et al., 2018b) which propose utilizing pre-activation residual networks, where skipconnections are kept in full precision while the residual path performs low-precision computation.Compared with these works, our proposed method offers a generalized concept of high-precisioninformation flow, namely, precision highway, which can be applied to not only the pre-activationconvolutional networks but also both the post-activation convolutional and recurrent neural networks.Our contributions are as follows. • We propose a novel idea of network-level approach to quantization, called precision highway and quantitatively analyze its benefits in terms of the propagation of quantization errors andthe difficulty of convergence in training based on the shape of loss surface.1 a r X i v : . [ c s . C V ] D ec We provide the detailed analysis of the energy and memory overhead of precision highwaybased on the state-of-the-art hardware accelerator model. According to our experiments, theoverhead is negligible while offering significant improvements in accuracy. • We apply precision highway to both convolution and recurrent networks. We report a 3-bitquantization of ResNet-50 without accuracy loss and a 2-bit quantization with a very smallaccuracy loss. We also provide the sub 4-bit quantization results of long short-term memory(LSTM) for language modeling.
ELATED W ORK
Migacz (2017) presented an int8 quantization method that selects an activation truncation thresholdto minimize the Kullback-Leibler divergence between the distributions of the original and quantizeddata. Jacob et al. (2017) proposed a quantization scheme that enables integer-arithmetic only matrixmultiplications (practically, 8-bit quantization for neural networks). These methods are implementedon existing CPUs or GPUs (Jacob et al., 2015; ACL).Hubara et al. (2016) presented a binarization method and demonstrated the performance benefit on aGPU. Rastegari et al. (2016) proposed a binary network called XNOR-Net in which a weight-binarizedAlexNet gives the same accuracy as a full-precision one. Zhou et al. (2016) presented DoReFa-Net,which applies tanh -based weight quantization and bounded activation. Zhou et al. (2017) proposed abalanced quantization that attempts to balance the population of values on quantization levels. Heet al. (2016b) proposed utilizing full precision for internal cell states in the LSTM because of theirwide value distributions. This work is similar to ours in that high-precision data are selectively utilizedto improve the quantized network. Our difference is proposing a network-level end-to-end flow ofhigh-precision activation. Recently, Zhuang et al. (2018) presented 4-bit quantization with ResNet-50.They adopt Dorefa-net style weight quantization with static bounded activation, and improve accuracyby adopting multi-step quantization and knowledge distillation during fine-tuning. Mishra et al.(2017) proposed a trade-off between the number of channels and precision. Clustering-based methodshave the potential to further reduce the precision (Han et al., 2015). However, they require a lookuptable and full-precision computation, which makes them less hardware-friendly.Recently, Choi et al. (2018b;a) and Liu et al. (2018) proposed utilizing full-precision on the skipconnections in pre-activation residual networks . Compared with those works, our proposed ideahas a salient difference in that it offers a network-level solution and demonstrates that the end-to-endflow of high-precision information is crucial. In addition, our method is not limited to pre-activationresidual networks, but general enough to be applied to both post-activation convolutional and recurrentneural networks. ROPOSED M ETHOD In precision highway , we build a path from the input to output of a network to enable the end-to-end flow of high-precision activation, while performing low-precision computation. Our proposedmethod was motivated (1) by a residual network where the signal, i.e., the activation/gradient in aforward/backward pass, can be directly propagated from one block to another (He et al., 2016a) and(2) by the LSTM, which provides an uninterrupted gradient flow across time steps via the inter-cellstate path (Gers et al., 2000). Our proposed method focuses instead on improving the accuracy ofquantized network by providing an end-to-end high-precision information flow.In this section, we first describe the precision highway in the cases of residual network (section 3.1)and recurrent neural network (section 3.2). Then, we discuss practical issues to be addressed beforeapplication to other networks in section 3.3.3.1 P RECISION H IGHWAY ON R ESIDUAL N ETWORK
In the case of a residual network, we can form a precision highway by making high-precision skipconnections. In this subsection, we explain how high-precision skip connections can be constructedto reduce the accumulated quantization error. This work was conducted in parallel with Choi et al. (2018b;a); Liu et al. (2018) o n v B N 𝑸 𝒌 [ , ] + R e L U C o n v B N 𝑸 𝒌 [ , ] + R e L U 𝑸 𝒌 [ , ] + R e L U + (a) 𝑥 + 𝑒 𝑦 C o n v B N C o n v 𝑸 𝒌 [ , ] + (b) 𝑦 R e L U 𝑸 𝒌 [ , ] + R e L U B N R e L U 𝑥 No Quantization Error B N C o n v B N C o n v 𝑸 𝒌 [ , ] + R e L U + (c) 𝑸 𝒌 [ , ] + R e L U + Figure 1: Comparison of conventional quantization and our proposed idea on residual network.In the conventional residual block shown in Figure 1 (a), quantization (denoted as Q [0 , k , k-bit linearquantization in range from 0 to 1) is applied to all of the activations after the activation function.In the figure, thick (thin) arrows represent high-precision (low-precision) activations. As the figureshows, the input of a residual block is first quantized, and the quantized input ( x + e in the figure),which contains the quantization error e , enters both the skip connection and residual path. The outputof a residual block, y , is calculated as follows: y = F ( x + e ) + x + e = F ( x ) + x + e r + e, (1)where F () represents a residual function (typically, 2 or 3 consecutive convolutional layers). Forsimplicity of explanation, we assume that F ( x + e ) can be decomposed into F ( x ) + e r , where e r represents the resulting quantization error of the residual path incurred by the quantization operationson the residual path as well as the quantization error in the input, e . As the equation shows, output y has two quantization error terms, that of residual path, e r , and that of the skip connection, e .Figure 1 (b) shows our idea of high-precision skip connection. Compared with Figure 1 (a), thedifference is the location of the first quantization operation in the residual block. In Figure 1(b), quantization is applied only to the residual path after the bifurcation to the residual path andskip connection. As shown in the figure, the skip connection now becomes a thick arrow, i.e., ahigh-precision path. The proposed idea gives the output of the residual block as follows: y = F ( x + e ) + x = F ( x ) + x + e r . (2)As Equation 2 shows, the proposed idea eliminates the quantization error of skip connection e . Thus,only the quantization error of the residual path e r remains in the output of the residual block. Notethat all of the input activations of the residual path are kept in low precision. It enables us to performlow-precision convolution operations in the residual path. We keep high-precision activation onlyon the skip connection and utilize it only for the element-wise addition. As will be shown in ourexperiments, the overhead of computation and memory access cost is small since the element-wiseaddition is much less expensive than the convolution on the residual path, and the low-precisionactivation is accessed for the computation on the residual path.As will be shown later, our method gives a smaller quantization error, and the gap between thequantization error of the existing method and that of ours becomes wider across layers. Because ofthe reduction of the accumulated quantization error, the proposed method offers much better accuracythan the state-of-the-art methods with an ultra-low precision of 2 and 3 bits.Note also that, as shown in figure 1 (c), our idea can be applied to other types of residual blocks, includ-ing the full pre-activation residual block (He et al., 2016a) as proposed in some recent works (Choiet al., 2018a; Liu et al., 2018). However, our idea is general in that it is applicable to recurrentnetworks as well as post-activation convolutional networks. Especially, our proposed idea is advanta-geous over the existing ones since hardware accelerators tend to be designed assuming as the inputnon-negative input activations enabled by ReLU activation functions (Park et al., 2018; Kim et al.,2018). Contrary to the existing works (Choi et al., 2018a; Liu et al., 2018), we provide a detailedanalysis of the effect of precision highway.3.2 P RECISION H IGHWAY ON R ECURRENT N EURAL N ETWORK
Figure 2 illustrates how the precision highway can be constructed on the LSTM (Gers et al., 2000). Intime step t , the LSTM cell takes, as an input, new input x t , along with the results of the previous timestep, output h t − and cell state c t − . First, it calculates four intermediate signals: i (input gate), f g (gate gate), and o (output gate). Then, it produces two results, c t and h t , as follows: i t = σ ( W ii x t + b ii + W hi h ( t − + b hi ) , (3a) f t = σ ( W if x t + b if + W hf h ( t − + b hf ) , (3b) g t = tanh( W ig x t + b ig + W hg h ( t − + b hg ) , (3c) o t = σ ( W io x t + b io + W ho h ( t − + b ho ) , (3d) c t = f t (cid:12) c ( t − + i t (cid:12) g t , (3e) h t = o t (cid:12) tanh( c t ) , (3f)where σ represents a sigmoid function, (cid:12) the element-wise multiplication, W the weight matrix, and b the bias. 𝑪 𝒕−𝟏 𝒉 𝒕−𝟏 𝑸 𝑳 𝑾 ×𝒙 𝒕 𝒔𝒕𝒂𝒄𝒌 𝝈 𝝈𝒕𝒂𝒏𝒉𝝈 𝒉 𝒕 𝒕𝒂𝒏𝒉 𝑪 𝒕 + 𝑸 𝒌[−𝟏,𝟏] 𝑪 𝒕−𝟏 𝒉 𝒕−𝟏 𝑸 𝑳 𝑾 × 𝒙 𝒕 𝒔𝒕𝒂𝒄𝒌 𝝈𝝈𝒕𝒂𝒏𝒉𝝈 𝒉 𝒕 𝒕𝒂𝒏𝒉 𝑪 𝒕 + 𝑸 𝒌 [−𝟏,𝟏] 𝒇𝒊𝒈𝒐 𝑸 𝒌[𝟎,𝟏] 𝑸 𝒌[𝟎,𝟏] 𝑸 𝒌[−𝟏,𝟏] 𝑸 𝒌[𝟎,𝟏] 𝑸 𝒌[−𝟏,𝟏] 𝑸 𝒌[−𝟏,𝟏] No Quantization Error (b)(a) 𝒇𝒊𝒈𝒐
Figure 2: Comparison on residual network.In the conventional LSTM operation, as Figure 2 (a) shows, the quantization (gray box denoted by Q k with the output value range as the superscript) is applied to all of the activations before computation.The results of a time step, c t and h t , are calculated based on such inputs with quantization errors.More specifically, cell state c t is calculated with the quantized, i.e., low-precision, inputs of c t − , f , i , and g . Thus, cell state c t accumulates the quantization errors of those inputs. In addition, output h t also accumulates the quantization errors from its inputs, c t and o . Then, they are propagated to thenext time steps. Thus, we have the problem of accumulated quantization error across the time steps.Such an accumulation of quantization error will prevent us from achieving ultra-low precision.Figure 2 (b) shows how we can build the precision highway in the LSTM cell. The figure shows thatthe quantization operation is applied only to the inputs of matrix multiplication (a circle denoted with × in the figure). Thus, all of the other operations and their input activations are in high precision.Specifically, when calculating c t , the inputs are not quantized, which reduces the accumulation ofquantization error on c t . The computation of h t can also reduce the accumulation of quantizationerror by utilizing high-precision inputs. The construction of such a precision highway allows us topropagate high-precision information, i.e., cell states c t and outputs h t , across time steps.Note that we benefit from low-precision computation by performing low-precision matrix multiplica-tions (in Equations 3a-3d), which dominate the total computation cost. In our proposed method, allof the element-wise multiplications in Equations 3e and 3f are performed in high precision. However,the overhead of this high-precision element-wise multiplications is negligible compared with thematrix multiplication in Equations 3a-3d. In addition, this method can be applied to other typesof recurrent neural networks. For instance, the GRU (Chung et al., 2014) can be equipped witha precision highway, in a way similar to that shown in Figure 2 (b), by keeping high-precisionoutput h t while performing low-precision matrix multiplications and high-precision element-wisemultiplications.3.3 P RACTICAL I SSUES WITH P RECISION H IGHWAY
In order to generalize our proposed idea to other networks in real applications, we need to address thefollowing issues. First, in the case of feed-forward networks with identity path, our precision highwayidea is applicable regardless of pre-activation or post-activation structure. We can exploit the benefitof reduced precision by applying quantization in front of matrix multiplications, while maintainthe accuracy by handing the identity path in high precision. Second, in the case of non-residualfeed-forward networks, the precision highway can be constructed by equipping them with additionalskip connections. In the case of networks with multiple candidates for the precision highway, e.g.,DenseNet, which has multiple parallel skip connections (Huang et al., 2017), we need to address anew problem of selecting skip connections to form a precision highway, which is left for future work.4 T RAINING
In this section, we describe weight quantization and fine-tuning for weight/activation quantization.4.1 L
INEAR W EIGHT Q UANTIZATION BASED ON L APLACE D ISTRIBUTION M ODEL
Figure 3 illustrates that a Laplace distribution can well fit the distributions of weights in full-precisiontrained networks. Thus, we propose modeling the weight distribution with Laplace distribution andselecting quantization levels for weights based on a Laplace distribution.Figure 3: Weight histogram and Laplace approximation(dashed line) of the convolutional layer of a trainedfull-precision ResNet-50. E rr o r Block index
SOTA Proposed
Figure 4: Quantization error accumula-tion across residual blocks in ResNet-50.Given a distribution of weights and a target precision of k bits, e.g., 2 bits, the quantization levels aredetermined as follows. First, the quantization levels for k bits are pre-computed for the normalizedLaplace distribution. We determine quantization levels that minimize L2 error on the normalizedLaplace distribution. For instance, in case of the 2-bit quantization, the error is minimized when fourquantization levels are placed evenly with a spacing of 1.53 µ , where µ is the mean of the absolutevalue of weights. Given the distribution of weights and the pre-calculated quantization levels on thenormalized Laplace distribution for the given k bits, we determine the real quantization levels bymultiplying the pre-computed quantization levels and the mean of the absolute value of weights.Our proposed weight quantization is similar to the one in Choi et al. (2018a). Compared to it, oursis simpler in that only Laplace distribution model is utilized, and our experiments show that theprecision highway together with the proposed simple weight quantization gives outstanding results.4.2 F INE - TUNING FOR W EIGHT /A CTIVATION Q UANTIZATION
Our quantization is applied during fine-tuning after training a full-precision network. As the base-line, we adopt the fine-tuning procedure in (Zhuang et al., 2018), where we perform incremen-tal/progressive quantization. In contrast to (Zhuang et al., 2018), we first quantize activations andthen weights in an incremental quantization. In addition, for each precision configuration, we performteacher-student training to improve the quantized network (Zhuang et al., 2018; Mishra & Marr, 2017).As the teacher network, we utilize a deeper full-precision network, e.g., ResNet-101, compared to thestudent network, e.g., quantized ResNet-50. Note that, during fine-tuning, we apply quantization inforward pass while updating full-precision weights during backward pass.
XPERIMENTS
XPERIMENTAL S ETUP
We implemented the proposed method in PyTorch and Caffe2. We use two types of trained neuralnetworks, ResNet-18/50 for ImageNet and an LSTM for language modeling (Zaremba et al., 2014;Press & Wolf, 2016; Inan et al., 2016). We evaluate 4-, 3-, and 2-bit quantizations for the networks.For ResNet, we did test with single center crop of 256x256 resized image. We compare our proposedmethod with the state-of-the-art methods (Zhou et al., 2016; Zhuang et al., 2018; Choi et al., 2018b;a;Liu et al., 2018). Note that, for the teacher-student training, we use the same teacher network forboth the baseline method (our implementation) (Zhuang et al., 2018) and ours. We also evaluate theeffects of increasing the number of channels (Mishra et al., 2017) to recover from accuracy loss due5o quantization. As in the previous works (Hubara et al., 2016; Zhou et al., 2017; Zhuang et al., 2018;Choi et al., 2018b;a; Liu et al., 2018; Mishra et al., 2017), we do not apply quantization to the firstand last layers.The LSTM has 2 layers and 300 cells on each layer. We used the Penn Treebank dataset and evaluatedthe perplexity per word. We compared the state-of-the-art method in (He et al., 2016b) and ourproposed method.5.2 A
NALYSIS OF A CCUMULATED Q UANTIZATION E RROR
Figure 4 shows the quantization errors across layers in ResNet-50 when applying the state-of-the-art 4-bit quantization to activations. We prepared, from the same initial condition, two activation-quantizednetworks (one with precision highway and the other with low precision skip connection) whereweights are not modified and only activations are quantized to 4 bits. As the metric of the quantizationerror, we utilize a metric based on the cosine similarity between the activation tensor of correspondinglayer in the full-precision and quantized networks, respectively.As the figure shows, in the existing method, the quantization errors become larger for deeper layers.It is because the quantization error generated in each layer is propagated and accumulated acrosslayers. We call this accumulated quantization error . The accumulated errors become larger withmore aggressive quantization, e.g., 2 bits, and cause poor performance, i.e., 4.8 % drop (Zhuang et al.,2018) from the top-1 accuracy of the full-precision ResNet-50 for ImageNet classification.The accumulation of quantization errors is an inherent characteristic of a quantized network in bothfeed-forward and feed-back networks. In the case of a recurrent neural network, the quantizationerrors are propagated across time steps. As shown in Figure 4, our proposed precision highwaysignificantly reduces the accumulated quantization errors, which enables 3-bit quantization withoutaccuracy drop and much better accuracy in 2-bit quantization than the existing methods.5.3 L
OSS S URFACE A NALYSIS OF Q UANTIZED M ODEL T RAINING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L o ss FPwith PHw/o PH (a) (d)(b) (c)
Figure 5: Loss surface of ResNet-18 on Cifar-10: (a) full-precision model (FP), (b) 1-bit activationand 2-bit weight quantized model (1A2W) without precision highway (PH), (c) 1A2W with precisionhighway, and (d) cross-section of loss surface.Figure 5 visualizes the complexity of loss surface depending on the existence of precision highway.We obtained the figures by applying the method proposed by Li et al. (Li et al., 2017). Each figurerepresents loss surface seen from the local minimum we obtained from the training, i.e., the weightvector of the final trained model. The origin of the figure at (0, 0) corresponds to the weight vectorof the local minimum. As shown in the figure 5 (d), the precision highway gives better loss surface(having lower and smoother surface near the minimum point and steep and simple surface elsewhere)than the existing quantization method. This characteristic helps stochastic gradient descent (SGD)method to quickly converge to a good local minimum offering better accuracy than the existingmethod. 6.4 E
VALUATING THE A CCURACY OF Q UANTIZED M ODEL
Table 1: 2-bit quantization results. Top-1/Top-5accuracy [%].
Laplace Teacher Highway ResNet-18 ResNet-50 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Full-precision
Zhuang’s (ours)
Zhuang’s (ours) + Teacher
Table 2: Comparison of accuracy loss in 2-bit activation/weight quantization. Bi-realapplies 1-bit activation/weight quantization.
ResNet-18 ResNet-50Ours
DoReFa (Zhou et al., 2016) 7.6 9.8
Zhuang’s (Zhuang et al., 2018) - 4.8
PACT (Choi et al., 2018b) 5.8 4.7
PACT new (Choi et al., 2018a) 3.4 2.7
Bi-Real (Liu et al., 2018) 12.9 -
Table 1 shows the accuracy of 2-bit quantization for ResNet-18/50. We evaluate each of our proposedmethods, Laplace, teacher, and highway, as shown in the table. When the highway box is uncheckedthe skip connection is branched after the quantization and when the teacher box is unchecked,we use the conventional cross-entropy loss. Compared with the full-precision accuracy, our 2-bitquantization (when all the methods were applied) gives a top-1 accuracy of 73.55 %, which is within2.45 % of the full-precision accuracy and much better than the state-of-the-art method (Zhuang’s70.8 %) having a top-1 accuracy loss of 4.8 %. Note that Zhuang’s implemented all the methods,incremental/progressive quantization and teacher-student training, in Zhuang et al. (2018). Wepresents the accuracy results of our own implementations of Zhuang’s method under the same amountof training time. Zhuang’s (ours) implemented only incremental and progressive methods whileZhuang’s + Teacher utilized our teacher network.Table 1 shows the effects of the precision highway. Compared with our solution supporting onlyLaplace and teacher, the highway provides an additional gain of 1.85 % (71.70 % to 73.55 %) inthe top-1 accuracy of ResNet-50. The effects of the Laplace method can be evaluated by comparingthe result of our implementation of Zhuang’s (69.04 % of top-1 accuracy in ResNet-50) and thatof our solution adopting the Laplace model (70.50 %) because these are the same except for theweight quantization method, i.e., tanh vs. Laplace based model. The Laplace method gives 1.46 %better accuracy. The table indicates that ResNet-18 also benefits from our proposed methods likeResNet-50.Table 2 compares the additional accuracy loss of quantization methods with respect to full-precisionaccuracy. The table shows that ours significantly outperform the methods without precision-highway(DoReFa, Zhuang’s, and PACT). PACT new and Bi-Real utilize high-precision skip connections onpre-activation resiudal networks. Thus, they show comparable results to ours . Note that our results inthe table are obtained from the conventional post-activation residual network, which demonstrates thegenerality of our proposed precision highway. As will be shown below for the LSTM, our proposedmethod is generally applied to recurrent networks as well as feed forward ones.Table 3: Impact of highway precision (y-axis: low precision and x-axis: highway precision). Top-1/Top-5 accuracy [%]. Full 8-bit 6-bit4-bit
ResNet-18 3-bit
Full 8-bit 6-bit4-bit
ResNet-50 3-bit / 93.03 75.33 / 92.63
Table 3 shows the impact of the precision of the precision highway. We obtained the results byvarying the highway precision (without retraining) after obtaining the results with the full-precisionhighway. The table shows that 2-bit quantization with the 8-bit highway gives only 0.09 % and0.40 % drops in the top-1 accuracy for ResNet-18 and ResNet-50, respectively, from that of the 2-bitquantization with the full-precision highway. Most importantly, our 3-bit quantization (with the 8-bithighway) gives the same accuracy as the full-precision network, i.e., 76.08 % in ResNet-50, whichmeans that our proposed method reduces the precision of the ResNet-50 from 4 bits with Zhuanget al. (2018) down to 3 bits even with the 8-bit highway. We performed 1-bit activation/weight quantization for the post-act style ResNet-18. For a fair comparison,we didnt apply the teacher-student and progressive quantization method and instead adopted BN-retrainingproposed in Bi-Real Net. Our 1-bit activation/weight ResNet-18 gives 56.73 / 80.11 % of Top-1/Top-5 accuracy,which is by 0.33 / 0.61 % higher than the result of Bi-Real Net, respectively.
Full 3-bit 2-bit Zhuang’s (ours) 2-bitwResNet-18 / 91.56 70.81 / 90.02 wResNet-50 / 93.69 75.54 / 92.71
Table 5: Perplexity of quantized LSTM.(x,y) means x-bit weight/y-bit activation. (4,4) (3,4) (3,3) (2,3) (2,2)Without Highway
With Highway
Table 4 shows the effects of two times wider channel under 2-bit quantization. We first doubled thenumber of channels in ResNet-18 and ResNet-50, and then quantized them with our methods. Asthe table shows, the wide ResNets give better accuracy than the full-precision ones even for 2-bitquantization, i.e., 73.80 % (77.35 %) in Table 4 vs. 70.15 % (76.00 %) of the full precision in Table 1for ResNet-18 (ResNet-50). It would be worth investigating how to minimize the channel size whilemeeting the full-precision accuracy with ultra-low precision, which is left for future work.Table 5 lists the quantization results for the LSTM. We varied the bit configuration (weight, activation)and obtained the perplexity results (the lower, the better). The table shows that our proposed methodsignificantly reduces the perplexity. Compared with the perplexity of full-precision model (92.84),our 4-bit quantization gives a very small increase of 3.3 % (92.84 to 95.94). The precision highwayprovides more gain for a more aggressive quantization. Specifically, it reduces the perplexity by14.1 % (from 133.25 to 114.44) in the 2-bit quantization, (2,2). Compared with the state-of-the-artquantization of a similar LSTM (He et al., 2016b) , ours offers much better results, i.e., a muchsmaller increase in perplexity, e.g., a 23.3 % increase (92.84 to 114.44 in Table 5) vs. a 39.4 %increase (109 to 152 in (He et al., 2016b)) in perplexity for 2-bit quantization.5.5 H ARDWARE C OST E VALUATION OF Q UANTIZED M ODEL
Figure 6: Comparison of chip area and energyconsumption on the hardware accelerator. Table 6: Number of operations. * denotes thehigh-precision operation.
LSTM (300) ResNet-18 ResNet-50Low-precision MAC
720 K 6.89 G 15.1 G
High-precision Add
Non-linear Op*
Elt-wise Multi*
Figure 6 shows the chip area cost and energy consumption of ResNet-18 at different levels of precisionon the state-of-the-art hardware accelerator (Chen et al., 2017). The accelerator is synthesized at65 nm, 250 MHz, and 1.0 V. Each processing element (PE) consists of a multiply-accumulate (MAC)unit and local buffers. The PEs share global on-chip 2 MB static random access memory (SRAM)at 16-bit precision and the size of which is adjusted proportional to the precision. As the figureshows, the reduced precision offers significant reduction in chip area, e.g., 82.3 % reduction from16 to 3 bits and energy consumption, e.g., 73.1 % from 16 to 3 bits. In the 2-bit case where theoverhead of precision highway is the largest, the precision highway incurs only 3.9 % additionalenergy consumption due to the high-precision data while offering 4.1 % better accuracy than the casethat precision highway is not adopted. The accelerator is already equipped large internal buffer forpartial sum accumulation. Thus, precision highway incurs additional energy consumption mainly onthe accesses to on-chip SRAM and main memory (dynamic random access memory, DRAM).Table 6 compares the number of operations in three neural networks used in our experiments. Thetable explains why the high-precision operations incur such a small overhead in energy consumption.As the table shows, it is because the frequency of high-precision operations is much smaller than thatof low-precision operations. For instance, the 2-bit LSTM network has one high-precision (in 32 bits)element-wise multiplication for every 800 2-bit multiplications. Note that we compared their relative change from the full-precision perplexity because the full-precisionperplexity of the state-of-the-art method (109) is different from that of ours (92.84). C ONCLUSION
In this paper, we proposed the concept of end-to-end precision highway which can be applied toboth feedforward and feedback networks and enable ultra-low precision in deep neural networks.The proposed precision highway reduces quantization errors by keeping high-precision activationfrom the input to output of the network with small computation costs. We described how it reducesthe accumulated quantization error and presented quantitative analyses in terms of accuracy andhardware cost as well as training characteristics. Our experiments showed that the proposed methodoutperforms the state-of-the-art methods in the 3- and 2-bit quantizations of ResNet-18/50 and 2-bitquantization of an LSTM model. We believe that our work will serve as a step toward mixed precisionnetworks for computational efficiency. R EFERENCES
ACL. Arm compute library. https://developer.arm.com/technologies/compute-library , 2017. Accessed: 2018-9-26.Yu-Hsin Chen et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutionalneural networks.
Solid-State Circuits , 2017.Jungwook Choi et al. Bridging the accuracy gap for 2-bit quantized neural networks. arXiv:1807.06964 , 2018a.Jungwook Choi et al. Pact: Parameterized clipping activation for quantized neural networks. arXiv:1805.06085 , 2018b.Junyoung Chung, C¸ aglar G¨ulc¸ehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation ofgated recurrent neural networks on sequence modeling. arXiv:1412.3555 , 2014.Felix A. Gers, J¨urgen Schmidhuber, and Fred A. Cummins. Learning to forget: Continual predictionwith LSTM.
Neural Computation , 2000.Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networkwith pruning, trained quantization and huffman coding. arXiv:1510.00149 , 2015.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residualnetworks.
European Conference on Computer Vision (ECCV) , 2016a.Qinyao He et al. Effective quantization methods for recurrent neural networks. arXiv:1611.10176 ,2016b.Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connectedconvolutional networks.
Computer Vision and Pattern Recognition(CVPR) , 2017.Itay Hubara et al. Quantized neural networks: Training neural networks with low precision weightsand activations. arXiv:1609.07061 , 2016.Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: Aloss framework for language modeling. arXiv:1611.01462 , 2016.Benoit Jacob et al. gemmlowp: a small self-contained low-precision gemm library. https://github.com/google/gemmlowp , 2015. Accessed: 2018-9-26.Benoit Jacob et al. Quantization and training of neural networks for efficient integer-arithmetic-onlyinference. arXiv:1712.05877 , 2017.Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo. Zena: Zero-aware neural network accelerator.
Design & Test , 2018.Hao Li et al. Visualizing the loss landscape of neural nets. arXiv:1712.09913 , 2017.Zechun Liu et al. Bi-real net: Enhancing the performance of 1-bit cnns with improved representationalcapability and advanced training algorithm.
European Conference on Computer Vision (ECCV) ,2018. 9zymon Migacz. NVIDIA 8-bit inference width TensorRT.
GPU Technology Conference , 2017.Asit K. Mishra and Debbie Marr. Apprentice: Using knowledge distillation techniques to improvelow-precision network accuracy. arXiv:1711.05852 , 2017.Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. WRPN: wide reduced-precisionnetworks. arXiv:1709.01134 , 2017.Eunhyeok Park, Dongyoung Kim, and Sungjoo Yoo. Energy-efficient neural network accelera-tor based on outlier-aware low-precision computation.
International Symposium on ComputerArchitecture (ISCA) , 2018.Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv:1608.05859 , 2016.Mohammad Rastegari et al. Xnor-net: Imagenet classification using binary convolutional neuralnetworks.
European Conference on Computer Vision (ECCV) , 2016.Hardik Sharma et al. Bit fusion: Bit-level dynamically composable architecture for accelerating deepneural networks. arXiv:1712.01507 , 2017.Andrew Tulloch and Yangqing Jia. High performance ultra-low-precision convolutions on mobiledevices. arXiv:1712.02427 , 2017.Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv:1409.2329 , 2014.Shuchang Zhou et al. Dorefa-net: Training low bitwidth convolutional neural networks with lowbitwidth gradients. arXiv:1606.06160 , 2016.Shuchang Zhou et al. Balanced quantization: An effective and efficient approach to quantized neuralnetworks.
Journal of Computer Science and Technology , 2017.Chenzhuo Zhu et al. Trained ternary quantization. arXiv:1612.01064 , 2016.Bohan Zhuang et al. Towards effective low-bitwidth convolutional neural networks.