[PDF] VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Abstract

Quantization enables efficient acceleration of deep neural networks by reducing model memory footprint and exploiting low-cost integer math hardware units. Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors. Excessive quantization, reducing precision too aggressively, results in accuracy degradation. When scale factors are shared at a coarse granularity across many dimensions of each tensor, effective precision of individual elements within the tensor are limited. To reduce quantization-related accuracy loss, we propose using a separate scale factor for each small vector of (\approx16-64) elements within a single dimension of a tensor. To achieve an efficient hardware implementation, the per-vector scale factors can be implemented with low-bitwidth integers when calibrated using a two-level quantization scheme. We find that per-vector scaling consistently achieves better inference accuracy at low precision compared to conventional scaling techniques for popular neural networks without requiring retraining. We also modify a deep learning accelerator hardware design to study the area and energy overheads of per-vector scaling support. Our evaluation demonstrates that per-vector scaled quantization with 4-bit weights and activations achieves 37% area saving and 24% energy saving while maintaining over 75% accuracy for ResNet50 on ImageNet. 4-bit weights and 8-bit activations achieve near-full-precision accuracy for both BERT-base and BERT-large on SQuAD while reducing area by 26% compared to an 8-bit baseline.

Full PDF

VVS-Q

UANT : P ER - VECTOR S CALED Q UANTIZATION FOR A CCURATE L OW -P RECISION N EURAL N ETWORK I NFERENCE

Steve Dai Rangharajan Venkatesan Haoxing Ren Brian Zimmer William J. Dally Brucek Khailany

NVIDIA A BSTRACT

Quantization enables efﬁcient acceleration of deep neural networks by reducing model memory footprint andexploiting low-cost integer math hardware units. Quantization maps ﬂoating-point weights and activations in atrained model to low-bitwidth integer values using scale factors. Excessive quantization, reducing precision tooaggressively, results in accuracy degradation. When scale factors are shared at a coarse granularity across manydimensions of each tensor, effective precision of individual elements within the tensor are limited. To reducequantization-related accuracy loss, we propose using a separate scale factor for each small vector of ( ≈ NTRODUCTION

Deep neural networks (DNNs) continue to achieve ground-breaking accuracy on a range of tasks, including imageclassiﬁcation, object detection, machine translation, and nat-ural language processing (NLP) (LeCun et al., 2015). Inparallel, hardware designers have been racing to achievethe best performance per watt for running DNN inferenceon devices ranging from the edge to the datacenter (Szeet al., 2020). While most DNN models are trained in single-precision ﬂoating-point, they can be deployed for inferencein lower-precision formats such as half-precision ﬂoating-point, ﬁxed-point, and low-bitwidth integer depending onthe target device and application speciﬁcations. QuantizingDNN models to lower precisions allows us to acceleratecompute-bound operations such as convolution on high-throughput low-cost math units, conserve memory band-width for memory-bound computations, and reduce storagerequirements in on-chip buffers and caches. For example,NVIDIA’s Ampere Graphics Processing Unit (GPU) archi-tecture supports INT8 and INT4 data types for these pur-poses (NVIDIA Corporation, 2020).One way to quantize a DNN model is through quantization-aware training (QAT). QAT either trains the model fromscratch or ﬁne-tunes the trained full-precision model, withquantization operations included in the model. Alternatively,post-training quantization (PTQ) directly quantizes the val- ues of the full-precision model before and during inferencewithout any retraining (Wu et al., 2020). Often, PTQ is moredesirable because it does not require access to the completeset of possibly conﬁdential training data, eliminates lengthyﬁne-tuning, requires little hyperparameter tuning, and pro-vides a turnkey solution for quantizing any DNN. However,PTQ usually results in more accuracy loss than QAT be-cause of the lack of training with quantizers in the loop.With both QAT and PTQ, accuracy loss from quantizationvaries by precision, model, and quantization algorithm.Quantization scales high-precision values of a particularrange to lower-precision values of a different range. Ahigh-precision number ( x ) is mapped to a lower-precisionnumber ( x q ) with x q = Q ( x/s, N ) where s is the scalefactor and Q ( a, b ) is the function that quantizes a to a b -bitinteger. Scale factors play an important role in determiningthe quantization error, which affects the ultimate accuracyof the quantized model. To avoid overloading the quan-tized model with too many scale factors and nullifying thecompute and memory beneﬁts of quantization, scale factorsmust be shared among multiple tensor elements. Typically,scale factors are shared at a coarse granularity by elementsof an entire tensor or a large sub-tensor. For example, atypical quantized convolution layer as shown in Figure 1employs a single scale factor for the entire input activationtensor ( C × H × W ) and each kernel ( C × R × S ) of theweight tensor. While coarse-grained scaling amortizes the a r X i v : . [ c s . L G ] F e b S-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference = CWH CR S K11 V KP QV1 1

Input activation Weight Output activation

Figure 1.

Convolution — Comparison between per-layer/per-output-channel scaling and per-vector scaling. cost of scaling across many elements, it likely requires map-ping a broader range of values to the speciﬁed low-precisionrepresentation. The resulting increase in quantization errorintroduces signiﬁcant accuracy loss for low-precision rep-resentations. The problem is exacerbated for DNNs whoseinput and/or weight values span a wide dynamic range.We propose ﬁne-grained per-vector scaled quantization (

VS-Quant ) to mitigate quantization-related accuracy loss. Incontrast to coarse-grained per-layer/per-output-channel scal-ing,

VS-Quant employs a scale factor for each vector ofelements ( V × × ) in the activation and/or weight tensoras shown in Figure 1. The range that must be representedwithin each vector is smaller than the range that must berepresented across the entire layer, so many vectors canbe represented at much higher precision and quantizationerror is only encountered in a small subset of vectors thatcontain wide ranges of values. Moreover, the unit of avector matches the unit of vector multiply-and-accumulate(MAC) hardware ubiquitous in DNN accelerators (Sijster-mans, 2018; Venkatesan et al., 2019; NVIDIA Corpora-tion, 2020). This hardware-software synergy leads to anelegant extension of current accelerator architectures forimplementing per-vector scaling with low overhead. Themajor contributions of our work are as follows:• We propose VS-Quant , a novel per-vector scaled quantiza-tion technique to mitigate accuracy loss typical in existingquantized DNN models.• We propose a two-level scaling scheme and algorithmthat combine a set of ﬁne-grained scale factors with eachcoarse-grained scale factor to enable efﬁcient

VS-Quant hardware implementations.• We evaluate

VS-Quant on popular DNN models anddemonstrate signiﬁcantly higher PTQ accuracy than con-ventionally scaled quantization on computer vision andNLP tasks.• We extend the vector MAC unit of a DNN accelerator tosupport

VS-Quant in hardware and analyze the area andpower impact.• We explore tradeoffs between accuracy and hardwareefﬁciency across a range of hardware implementationsand DNN models to identify Pareto-optimal designs forlow-precision inference with PTQ. The remainder of the paper is organized as follows: Sec-tion 2 reviews related work; Section 3 describes the funda-mentals for quantization; Section 4 presents and evaluatesour per-vector scaling technique and associated two-levelscaling scheme; Section 5 describes the hardware imple-mentation; Section 6 explores the accuracy and hardwareefﬁciency tradeoff; Section 7 discusses quantization-awareretraining in the context of per-vector scaling, followed byconclusions in Section 8.

ELATED W ORK

Krishnamoorthi evaluates per-channel scaled quantizationat various precisions for a set of convolutional neural net-works (CNNs) (Krishnamoorthi, 2018). The paper ﬁndsthat although PTQ can achieve good accuracy at 8 bits forthese networks, QAT is required for getting good accuracyat lower precisions or for matching ﬂoating-point accuracyat 8-bits. McKinstry et al. shows that CNNs require onlya small number of epochs of ﬁnetuning after carefully set-ting the learning rate schedule and ﬁxing the quantizationrange (McKinstry et al., 2018). Instead of ﬁxing the quanti-zation range before QAT, PACT proposes to learn the rangeof weights and activations as part of training (Choi et al.,2018). Both papers achieve full-precision accuracy withonly 4-bit precision. Other research has explored very lowprecision ternary (Zhu et al., 2016) and binary (Courbariauxet al., 2015; Hubara et al., 2016) weights and activations.These models required signiﬁcant retraining to recover accu-racy loss and do not reach full-precision accuracy for moredifﬁcult tasks. In addition to CNNs, recent work has pro-posed quantized transformer models for NLP (Zafrir et al.,2019; Shen et al., 2020) and for machine translation (Bhan-dare et al., 2019; Prato et al., 2019; Wu et al., 2016b). Wuet al. establishes a single 8-bit quantization workﬂow formaintaining less than 1% accuracy drop for many differenttypes of networks (Wu et al., 2020).Prior work has proposed schemes for uniform quantiza-tion (Courbariaux et al., 2014; Zhou et al., 2016) and non-uniform quantization (Han et al., 2015; Zhu et al., 2016).Uniform quantization uses integer or ﬁxed-point formatwhich can be accelerated with specialized math pipelinesand is the focus of this paper. Non-uniform quantizationleverages codebook look-ups to enable model compressionand memory bandwidth reduction. To reduce quantizationerror, vector quantization (Gray, 1984) based techniquestake advantage of redundancy within a subspace of a weightor activation tensor. In particular, product quantizationsplits each subspace into vectors and optimizes a codebookfor the vectors in each subspace (Wu et al., 2016a; Gonget al., 2014). Stock et al. extends this technique to pre-serve the reconstruction quality of actual network outputsinstead of just weights (Stock et al., 2020). On a separatenote, knowledge distillation can also improve the accuracyof quantized model (Mishra & Marr, 2017). By trainingthe quantized model to mimic a high-precision model in astudent-teacher setting, the paper obtains higher accuracy

S-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference for a ternary ResNet-variant architecture.Since the full set of training data may not be available atinference time, there is increasing interest in PTQ tech-niques that directly quantize full-precision values before andduring inference (Krishnamoorthi, 2018; Lee et al., 2018;Nagel et al., 2019). More recently, Zhao et al. proposesthe outlier channel splitting technique to exactly representoutliers (Zhao et al., 2019). By duplicating channels thatcontain outliers and halving the values of those channels,this technique effectively shrinks the quantization rangewithout modifying the network. Also focusing on the dis-tribution of tensor values, Fang et al. proposes a piecewiselinear quantization scheme that optimally splits the quan-tization range into non-overlapping but differently-sizedregions (Fang et al., 2020). With this, more precision canbe assigned to the range where a majority of values lie. Tomodel long-tail effects in data distribution, Biscaled-DNNuses two scale factors for quantization (Jain et al., 2019).One scale factor is dedicated for increasing precision, andthe other for increasing range. ZeroQ sidetracks the needfor a training dataset by engineering a synthetic one thatmatches the statistics of the batch normalization operation ofeach layer of the network (Cai et al., 2020). This techniqueis considered another form of knowledge distillation.Besides integer quantization, previous work proposes low-cost ﬁxed-point and ﬂoating-point inspired data types forenergy efﬁciency. For example, Moons et al. proposes adap-tive ﬁxed-point quantization that trains a network for arbi-trary ﬁxed-point precision while minimizing energy (Moonset al., 2017). Flexpoint replaces 32-bit ﬂoating-point val-ues with a block ﬂoating-point format that leverages sharedexponents that can be dynamically adjusted to minimizeoverﬂow and maximize dynamic range (K¨oster et al., 2017).To avoid data loss from exponent sharing while improv-ing energy efﬁciency, AdaptivFloat leverages a ﬂoating-point exponent bias based on absolute maximum valueof the tensor to optimize the clipping of the tensor’s dy-namic range (Tambe et al., 2020). Rouhani et al. explorethe accuracy-cost tradeoffs of different variants of a blockﬂoating-point format in production-level cloud-scale infer-ence (Rouhani et al., 2020). Other work performs mixed-precision quantization to minimize bitwidth on a per-layerbasis to adapt to each layer’s sensitivity to precision (Wuet al., 2018; Khoram & Li, 2018).

UANTIZATION F UNDAMENTALS

Integer quantization maps high-precision ﬂoating-pointweights and activations in a DNN to low-precision integerrepresentations, typically with 8 or fewer bits. For simplic-ity, in this paper we refer to the ﬂoating-point weights andactivations collectively as real values, and the quantizedlow-precision weights and activations collectively as inte-ger values. We also focus on uniform integer quantizationwhere the values are evenly distributed within the range ofthe integer format. While non-uniform quantization such aslogarithmic quantization (Miyashita et al., 2016) is also pos- sible, the techniques proposed in this paper are orthogonaland can be applied to either form of quantization.There are several considerations when deciding how to quan-tize real values into integers. First, we must choose a rangeof real values to be represented so that any value out-of-range will be clipped. We may not necessarily want tochoose the full range of presented real values, but ratherclip outliers to improve the precision of quantized valueswithin the range where most values reside. Second, we needto select the number of bits available for our integer values.With more integer bits, more discrete levels (integer val-ues) are available to represent the same range of real values,resulting in smaller quantization error.An N -bit signed two’s complement integer quantizationmaps real values x ∈ [ x min , x max ] to values x q ∈ [ − N − , N − − . In general, a positive real scale fac-tor s is used to scale the value from the real range tothe integer range, and a zero point z represents the inte-ger value corresponding to a real zero value. Since thezero point complicates integer computation pipelines, ef-ﬁcient DNN accelerators typically apply symmetric scale-only quantization assuming z = 0 , x min = − x max , and x q ∈ [ − N − + 1 , N − − (Wu, 2019). If α denotes theabsolute maximum real value that needs to be represented, s = α N − − (1) x q = clip (cid:16)(cid:106) xs (cid:109) , − N − + 1 , N − − (cid:17) (2)where (cid:4) xs (cid:7) denotes rounding the scaled value to the nearestinteger. If x is unsigned, x min = 0 and x q will be clippedto the integer range of [0 , N − − . To avoid issues withvanishing gradient, quantized integer values x q are avoidedduring training. Instead, simulated quantization using dis-crete real values is applied to simulate the effect of integerquantization (Krishnamoorthi, 2018). Equation 3 deﬁnes thesimulated-quantized value x sq as a real value from rescalingthe integer value by the original scale factor. x sq = s · x q (3)Typically in a convolutional layer, a scale factor for weightor activation is determined for every layer of the network.Known as per-layer scaling, a single scale factor is used foreach weight tensor (i.e., K × C × R × S ), and another scalefactor is used for each activation tensor (i.e., C × H × W ).To improve accuracy, multiple scale factors are determinedfor the weights of each layer. Known as per-channel scaling,a different scale factor is used for each output channel of alayer (i.e., C × R × S ). We collectively refer to per-layerand per-channel scaling as coarse-grained scaling.Scale factors must be chosen carefully to best approximate areal distribution with a limited number of discrete values. Acalibration process is used to select the α used in Equation 1for quantizing weights and activations. While α can be set to S-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Model Task Accuracy Metric DatasetResNet50 v1.5 Image classiﬁcation 76.16 Top1 ImageNet 2012BERT-base Language model 86.88 F1 SQuAD v1.1BERT-large Language model 90.93 F1 SQuAD v1.1

Table 1.

Overview of DNN models in this study

Model Bitwidths Max Entropy 99.9% 99.99% 99.999% 99.9999% MSEResNet50 Wt=3 Act=3U 0.18 6.61

BERT-large Wt=4 Act=4 1.92

Table 2.

DNN accuracy with per-channel scaling and static calibration:

Weight and activation bitwidths are speciﬁed under

Bitwidths . U indicates unsigned values. Values are otherwise signed. Max , Entropy , and

MSE denote calibration using max-imum absolute value, KL-divergence, and mean squared error, respectively. Percentages indicate the use of percentile calibration. the maximum absolute value of the distribution (called maxcalibration), it is often beneﬁcial to omit outlier values infavor of additional precision for inlier values. For example,percentile calibration sets α to a speciﬁc fraction of | x max | .Entropy calibration, on the other hand, determines the α thatminimizes the information loss between real and quantizeddistributions. For weights, scale factors are determinedusing static calibration prior to inference. For activations,scale factors can be determined using static calibration priorto inference or through dynamic calibration during inference.Note that static calibration for activations requires samplesof representative data to model the distribution of inputsthat the network is likely to encounter during inference (Wuet al., 2018).While per-channel scaling achieves better accuracy than per-layer scaling, coarse-grained scaling methods generally leadto signiﬁcant accuracy degradation for a range of quantizedmodels. With PTQ but without QAT, we observe accu-racy degradation in popular image recognition and languagemodels after quantization, as indicated in Table 2. Even formodels where coarse-grained scaling can be competitive,careful calibration of the scale factor with the right calibra-tion technique is required for good accuracy. As shown inTable 2, the quality of calibration varies among different ver-sions of the same network and across different networks. Weﬁrst focus on enabling state-of-the-art inference accuracywith PTQ before discussing VS-Quant for QAT. ER - VECTOR S CALED Q UANTIZATION

We propose

VS-Quant , per-vector scaled quantization, tomitigate the accuracy loss from quantization. Rather thancomputing a single scale factor over multiple dimensionsof a tensor,

VS-Quant applies a scale factor for each vectorof elements within a single dimension of a tensor. For aconvolutional layer shown in Figure 1, per-vector scalingsubdivides the input channel ( C ) dimension of the weight or Model Bitwidths Per-vector Best Per-channelResNet50 Wt=3 Act=3U 69.78 7.97Wt=4 Act=4U 75.28 70.76Wt=6 Act=6U 76.00 75.80Wt=8 Act=8U 76.15 76.16BERT-base Wt=3 Act=8 82.84 11.03Wt=4 Act=8 86.24 73.61Wt=6 Act=8 86.66 80.18Wt=8 Act=8 86.60 81.25BERT-large Wt=3 Act=8 89.56 8.71Wt=4 Act=8 90.64 83.18Wt=6 Act=8 90.77 88.90Wt=8 Act=8 90.80 89.41

Table 3.

PTQ accuracy of different DNN models with ﬂoating-point per-vector scale factors –

Best Per-Channel indi-cates the best calibrated per-channel scaled quantized accuracyamong all calibration methods in Table 2. activation tensor into (cid:100)

C/V (cid:101) vectors each with V elements.The number of vectors contained within a tensor dependson its shape and the designated vector size V .In Table 3, we show that VS-Quant with static max calibra-tion for weights and dynamic max calibration for activationshas the potential to achieve signiﬁcantly better accuracywith low bitwidths. Compared to the ﬂoating-point baseline,per-vector scaled quantization achieves negligible accuracydrop at 6 bits and less than 1% drop at 4 bits for ResNet50.In comparison, per-channel scaled quantization requires atleast 6-bit weights for less than 1% drop. Both BERT-baseand BERT-large achieve close to full-precision accuracywith 4-bit weights, compared to per-channel scaled quanti-zation which has difﬁculty reach the same level even with 8bits. Note that results are reported for PTQ where retrainingis not required.

The quality of per-vector scaling depends on the vector sizeparameter. At one extreme with V = 1 , each element would S-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

V=1 V=2 V=4 V=8 V=16 V=32 V=6476.13 76.08 76.05 76.05 76.00 75.96 75.96

Table 4.

Accuracy of 6-bit ResNet50 on ImageNet with

VS-Quant for different vector sizes be individually quantized with its own scale factor and thusexperience no loss in precision. At the other extreme with V = C , elements in each ( R, S ) in weight and ( H, W ) in activation would share the same scale factor. Table 4compares the accuracy of a 6-bit quantized ResNet50 withper-vector scaling for different vector sizes. Accuracy de-creases with increasing vector size because larger vectorshave a higher probability of needing to represent a widerrange of values. The goal is to carefully select V to mini-mize the required number of scale factors (maximize vectorsize) while maximizing the precision of the vector-scaledapproximation and resulting network accuracy. In addition to better precision, the vector granularity alsomaps naturally to the vector unit of compute in typical DNNaccelerators. Because convolution and linear layers can beconveniently expressed as a collection of dot-products be-tween an unrolled region of weights and an unrolled regionof activations, vector-MAC units are the ubiquitous buildingblocks of many DNN processing architectures. Equation 4shows the dot-product y ( j ) between the j th vector region ofweights w ( j )( i ) , i ∈ [0 , V − and the j th vector regionof activations a ( j )( i ) , i ∈ [0 , V − . y ( j ) = V − (cid:88) i =0 ( w ( j )( i ) · a ( j )( i )) (4)With VS-Quant , we compute a scale factor s w ( j ) for the j thweight vector and a scale factor s a ( j ) for the j th activationvector to scale the quantized integer weights w q ( j )( i ) , i ∈ [0 , V − and integer activations a q ( j )( i ) , i ∈ [0 , V − .Therefore, the dot-product in Equation 4 becomes the scaleddot-product in Equation 5. y sq ( j ) = (cid:32) V − (cid:88) i =0 ( w q ( i ) a q ( i )) (cid:33) s w ( j ) s a ( j ) (5)Note that the scale factors are factored out of each vectorMAC, leading to a simple VS-Quant hardware implementa-tion, as discussed in Section 5.

While it is orthogonal to per-vector scaling, calibration isstill needed to determine the range of real values to be repre-sented, which is parameterized by α . As with conventionalscaling techniques, weight scale factors s w ( j ) can be de-termined statically based on the trained model. Activationscale factors s a ( j ) can be computed statically with repre-sentative input samples or dynamically during inference. Likewise, calibration methods including maximum absolutevalue, percentile, and entropy can still be applied. However,because each vector only has a small number of elements,the distribution of a vector may lack enough samples to sup-port more sophisticated calibration methods like percentileand entropy to determine a statistically useful α . The results in Table 3 rely on ﬂoating-point scale factorsper vector, which would lead to an inefﬁcient hardwareimplementation. To improve area and energy efﬁciency, weintroduce a two-level scaling scheme that further appliesinteger quantization on the per-vector scale factors. Withthis scheme, the per-vector scale factor s in Equation 3is factored into the product of an integer per-vector scalefactor s q and a ﬂoating-point coarse-grained scale factor γ ,as shown in Equation 6. x sq = s q · γ · x q (6)Here x sq denotes the simulated-quantized values with twolevels of scale factors. With an integer scale factor per-vector, we need to store only a low-bitwidth integer along-side each vector of tensor elements, and we can completeall vector-wise computations with integer arithmetic. Withthe two-level scaling technique, we push the more expen-sive ﬂoating-point scale factors to the coarser granularityby introducing the less expensive integer scale factors atthe ﬁner granularity to achieve a balance between accu-racy and hardware efﬁciency. Given N-bit integer weightsor activations and M-bit integer per-vector scale factors,adding the M-bit scale factor alongside each V -elementvector leads to a memory overhead of M/ ( V N ) . To givea perspective with N = M = 4 and V = 16 , the storageoverhead is 6.25% which equates to an effective bitwidthof 4.25 bits. Compared to coarse-grained scaling, two-levelper-vector scaling requires scaling the dot-product by theproduct of the integer scale factors, which represents anextra (2 N + log ( V )) × M multiplication for each vectordot-product.Equations 7a-7j detail the algorithm for determining thescale factors when quantizing a real valued tensor x to anN-bit signed integer in the two-level quantization scheme.Index i indicates each vector; index j represents each ele-ment of a vector; and k is the index along the coarse-graineddimension with different coarse-grained scale factors. As-suming per-channel scale factors for the weight tensor of aconvolutional layer, k ∈ [0 , K − while i ∈ [0 , (cid:100) C/V (cid:101)− and j ∈ [0 , V − .The algorithm ﬁrst computes ﬂoating-point scale factors ata per-vector granularity. Then it quantizes the per-vectorscale factors by separating them into integer per-vector com-ponents and a ﬂoating-point per-channel component. Wespecify the datatype of each tensor in Equation 7 as f p forﬂoating-point and int for integer. x max ( k, i ) fp = max j | x ( k, j, i ) | (7a) S-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference s ( k, i ) fp = x max ( k, i )(2 N − − (7b) x q ( k, j, i ) int = (cid:22) x ( k, j, i ) s ( k, i ) (cid:25) (7c) x sq ( k, j, i ) fp = x q ( k, j, i ) s ( k, i ) (7d) s max ( k ) fp = max i s ( k, i ) (7e) γ ( k ) fp = s max ( k )2 M − (7f) s q ( k, i ) int = (cid:22) s ( k, i ) γ ( k ) (cid:25) (7g) s q ( k, i ) fp = s q ( k, i ) γ ( k ) (7h) x sq ( k, j, i ) fp = x q ( k, j, i ) s q ( k, i ) (7i) = x q ( k, j, i ) s q ( k, i ) γ ( k ) (7j)To determine the per-vector scale factors, the algorithmcomputes the absolute maximum over the elements j ∈ [0 , V − of each vector ( k, i ) in Equation 7a and thendetermines the ﬂoating-point per-vector scale factor thatwould scale the absolute maximum to the maximum rep-resentable N-bit signed integer. This step is analogous toEquation 1 but at a per-vector granularity. Equation 7c per-forms the actual per-vector scaling and rounds the resultingtensor values to integers which will be used in our integerdot-product unit. Note that the scale factor here is per-vector for each ( k, i ) but broadcast correspondingly to eachelement ( k, j, i ) of the tensor. At this point, we have every-thing we need if we were doing a single-level quantizationwith ﬂoating-point scale factors per-vector. The single-levelsimulated-quantized value is expressed in Equation 7d.To further quantize the scale factor, we repeat the quantiza-tion process of taking the absolute maximum, computingthe ratio of real valued maximum to integer maximum, andscaling and rounding to integer on the single-level scale fac-tor as shown in Equations 7e to 7g. Equation 7h shows thetwo-level scale factor as a composition of integer per-vectorscale factor and ﬂoating-point per-channel scale factor. Thetwo-level simulated-quantized value is therefore representedas the product of the integer tensor values and the two levelsof scale factors, as shown in Equation 7j.Using two-level quantization for calibrating scale factors,DNN inference accuracy with PTQ across a range of weight,activation, and scale factor bitwidths is shown in Tables 5, 6,and 7. We compare the accuracy of VS-Quant with two-levelscaling using low-bitwidth integer and fp16 scale factors to

VS-Quant with fp32 scale factors and per-channel scaling(similar to Table 3). Compared to per-vector scaling, weconsistently observe signiﬁcantly lower accuracy loss with

VS-Quant across all three DNNs, particularly at low weightand activation bitwidths. For example,

VS-Quant with 3-bitweights and 8-bit activations achieves over 89% accuracyfor BERT-large on SQuAD while the best per-channel cali-brated quantization only achieves 8.7% accuracy.The two-level quantization algorithm in Equation 7 is merely one of several ways to determine the two levels of scale fac-tors. For example, instead of ﬁrst computing the single-levelper-vector scale factor and then breaking it down into theproduct of two levels of scale factors, we can do it one levelat a time by ﬁrst computing the per-channel scale factorand then back-calculating the per-vector scale factor. Whilethis approach provides a larger space to explore the integervalues and integer scale factors, it requires computing theabsolute maximum over a larger tensor as opposed to justa vector. This is much more expensive to implement inhardware if scaling activations dynamically during infer-ence. However, it could be acceptable for scaling weightsstatically before inference.

ARDWARE I MPLEMENTATION

To evaluate the hardware efﬁciency of

VS-Quant , we ex-tended a previous optimized DNN accelerator (Venkatesanet al., 2019) by adding per-vector scaling support. Fig-ure 2(a) shows the micro-architecture of a processing ele-ment (PE) in the accelerator, which is responsible for thedot-product computation listed in Equations 4 and 5. ThePE consists of a set of

VS-Quant vector MAC units, a weightbuffer, an input activation buffer, an accumulation collector,a

VS-Quant post-processing unit, and control logic.Each

VS-Quant vector MAC unit, shown in Figure 2(b),performs a V -element dot-product between the correspond-ing weight and activation data. In parallel, the product ofthe per-vector weight scale factor s w and activation scalefactor s a is computed and rounded to the desired precision.The two outputs are then multiplied to compute a scaledpartial sum output. Each entry of the weight buffer storesa weight vector along with corresponding per-vector scalefactor. Similarly, the input activation buffer stores an acti-vation vector and a per-vector scale factor in each row. Theaccumulation collector stores partial sum values from allthe vector MAC computations and temporally accumulatesthem across multiple cycles in an integer format. For N-bitweights and activations along with M-bit weight and acti-vation scale factors, we have N × N → N -bit productsthat are accumulated over the vector size V , resulting in N + log V wide dot-product outputs. The dot-productresults are multiplied with the product of the M-bit weightand activation scale factors to produce N + log V + 2 M wide partial sums. For improved energy efﬁciency, the vec-tor MAC unit can optionally round the product of the scalefactors to fewer than M bits before multiplying with thedot-product result. Finally, the accumulation collectors aredesigned with appropriate widths to avoid overﬂow. Takentogether, the PE achieves efﬁcient data reuse across all threedata types: (i) each input activation vector is shared spatiallyacross multiple vector MAC units; (ii) weight vectors arereused temporally across multiple cycles using a weight col-lector; (iii) partial sums are reused spatially inside the vectorMAC unit and temporally in the accumulation collector.For post-processing, the output of the accumulation collec-tor is fed to a post-processing unit (PPU). To implement dy- S-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Bitwidths S=3/4 S=3/6 S=4/4 S=4/6 S=6/4 S=6/6 S=fp32 Best Per-channelWt=4 Act=3U 72.64 73.51 73.53 74.33 73.82 74.69 74.71 67.20Wt=4 Act=4U 73.39 74.20 74.36 75.04 74.58 75.35 75.28 70.76Wt=4 Act=6U 73.68 74.45 74.64 75.25 74.89 75.40 75.40 72.20Wt=4 Act=8U 73.65 74.42 74.66 75.21 74.83 75.42 75.42 72.30Wt=6 Act=3U 73.57 74.36 74.22 74.99 74.35 75.32 75.23 71.52Wt=6 Act=4U 74.26 75.12 74.95 75.59 75.14 75.80 75.83 74.77Wt=6 Act=6U 74.69 75.13 75.13 75.74 75.40 75.96 76.00 75.80Wt=6 Act=8U 74.55 75.19 75.19 75.73 75.41 76.02 76.03 75.89Wt=8 Act=3U 73.65 74.47 74.24 75.13 74.67 75.35 75.56 71.98Wt=8 Act=4U 74.48 75.16 75.08 75.71 75.21 75.96 75.91 75.11Wt=8 Act=6U 74.77 75.32 75.26 75.86 75.46 76.01 76.17 76.01Wt=8 Act=8U 74.61 75.33 75.15 75.85 75.47 76.10 76.15 76.16

Table 5.

ResNet50 on ImageNet with integer per-vector scale factors

Bitwidths S=4/8 S=4/10 S=6/8 S=6/10 S=fp16 S=fp32 Best Per-channelWt=3 Act=8 76.28 81.84 78.44 82.81 82.90 82.93 11.03Wt=4 Act=8 82.87 85.91 83.39 86.35 86.34 86.33 73.61Wt=6 Act=8 83.47 86.16 83.63 86.54 86.57 86.61 80.18Wt=8 Act=8 83.53 86.33 83.75 86.59 86.74 86.74 81.25

Table 6.

BERT-base on SQuAD with integer per-vector scale factors

Bitwidths S=4/8 S=4/10 S=6/8 S=6/10 S=fp16 S=fp32 Best Per-channelWt=3 Act=6 83.17 86.24 84.3 86.92 87.03 87.13 6.88Wt=3 Act=8 86.82 88.86 87.65 89.5 89.63 89.63 8.71Wt=4 Act=6 87.08 88.08 87.66 88.52 88.84 88.84 10.06Wt=4 Act=8 89.54 90.31 89.76 90.54 90.78 90.78 83.18Wt=6 Act=6 87.34 88.39 87.76 89.01 89.15 89.2 32.29Wt=6 Act=8 89.65 90.47 89.82 90.63 90.82 90.81 88.9Wt=8 Act=6 87.4 88.68 87.72 89.02 89.15 89.13 40.5Wt=8 Act=8 89.84 90.47 90.12 90.66 90.78 90.78 89.41

Table 7.

BERT-large on SQuAD with integer per-vector scale factors

Tables 5-7.

Accuracy of different networks when applying integer per-vector scale factors –

Accuracy numbers arecolor-coded from highest (dark blue) to lowest acceptable (dark red).

S=Sw/Sa indicates Sw -bit unsigned per-vectorweight scale factors and Sa -bit unsigned per-vector activation scale factors. S=fp16 and

S=fp32 indicate single-levelfp16 and fp32 per-vector scale factors.namic calibration for the scale factors of the activations, weperform the required calibration operations in the

VS-Quant

PPU and convert the higher-precision output activationsback to N-bit vector elements with per-vector scale factorsfor the next layer. Figure 2(c) shows the block diagram ofthe

VS-Quant

PPU that performs the calibrate-and-quantizeoperations. As a post-processing step following the com-pletion of a layer of computation, we leverage a vector maxunit to implement Equation 7a to compute the absolute max-imum of each vector of elements. Then a reciprocal unit andshifter implement Equation 7b to compute the ratios of abso-lute maximums of the vector to the maximum representablevalue of an N-bit integer value. The computed ratios arethe scale factors used to quantize the output activations andconvert them to

VS-Quant format for computation of thenext layer.To quantify the area and energy impact of supporting

VS-Quant in hardware, we also consider a baseline PE architec-ture for comparison without the scale factor related multipli-ers in the vector MAC unit and without the scale factor over-heads in the weight and activation buffers. In this case, eachvector MAC unit simply performs a V -wide dot-product and produces a partial sum of width N + log V for N-bitweights and activations. Per-channel scaling is performedin the baseline design PPU.We evaluate the impact on energy per operation of VS-Quant compared to the baseline design using the MAG-Net DNN generator and exploration infrastructure (Venkate-san et al., 2019). MAGNet’s published 8-bit conﬁgurationachieved 2.1 tera-operations/sec/mm (TOPS/mm ) and 69fJ/operation (14.5 TOPS/Watt) in a 16nm FinFET technol-ogy node. We normalize all subsequent energy and areanumbers in this paper to a similar baseline design with 8-bitweights and activations. The design tools shown in Table 8are used to implement the hardware and measure area andpower in a sub-16nm process technology.Figure 3 shows the average energy per operation across arange of hardware conﬁgurations. In this and all subsequentplots, we use W/A/ws/as to denote each conﬁguration,where W stands for weight bitwidth, A for activation bitwidth, ws for weight scale bitwidth, and as for activation scalebitwidth. - indicates use of per-channel scaling. Energyis normalized to that of the 8/8/-/- conﬁguration. The blue S-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference (a) Processing Element (b) VS-Quant vector MAC unit Accumulation Collector

VS-Quant Vector MAC Unit

Address Generator

Buffer Manager

Weight Buffer

Address GeneratorBuffer Manager

Input Buffer ++++

Control Input ActivationControl In VS-Quant PPUOutput ActivationControl Out Weights

Address Address

Wgt Vector W g t S c a l e Wgt Vector W g t S c a l e Wgt Vector W g t S c a l e WgtVector W g t S c a l e Act. Vector A c t. S c a l e Address Generator (c) VS-Quant PPU X W b A b X W b A b X W b A b VectorSize + Weight Collector

Per-vector

ScaleFactor = 2 n /maxPartial Sum VectorNon-Linear FunctionPer-layer Scaling Vector MaxQuantize Per-Layer Scale Factor x s w s a Rounding x Figure 2.

Hardware diagram — DNN accelerator with per-vector scaling support.Design toolsHLS Compiler Mentor Graphics Catapult HLSVerilog simulator Synopsys VCSLogic synthesis Synopsys Design Compiler GraphicalPlace-and-route Synopsys ICC2Power analysis Synopsys PT-PXDesign spaceVector size 16Weight/activationprecision 3-bit, 4-bit, 6-bit, 8-bitWeight/activationscale precision 3-bit, 4-bit, 6-bit, 8-bit, 10-bitScaling granularity POC, PVAO, PVWO, PVAW

Table 8.

Experimental setup –

POC = per-channel,

PVAO = per-vector on activations only,

PVWO = per-vector on weights only, and

PVAW = per-vector on both weights and activations. bars for the per-channel scaled conﬁgurations (4/4/-/-, 6/6/-/-, 6/8/-/-, 8/8/-/-) show that quantization can achieve upto 2x energy savings over an 8-bit baseline. When the

VS-Quant hardware is introduced and the scale factor product( s w × s a ) in Figure 2(b) is kept at full-bitwidth precision (i.e.,no rounding), the yellow bars for the 4/4/4/4 and 6/6/4/4 con-ﬁgurations show modest energy overheads at 4-bit and 6-bitweight and activation precisions over corresponding per-channel scaled conﬁgurations due to additional multipliersfor scaling and wider accumulation widths. When the scalefactor product is rounded to an intermediate size of 4 bits or6 bits, the energy overheads of adding VS-Quant support tothe hardware can be substantially reduced, as demonstratedby the orange and grey bars. In fact, scale factor round-ing truncates many small values and converts them to zero,thereby providing opportunities for data gating of costly ac-cumulation operations. As a result, the conﬁgurations withscale product rounding can achieve lower energy consump-

Figure 3.

Effect of scale product bitwidth on energy –

For thisand subsequent ﬁgures,

W/A/ws/as indicates weight, activation,weight scale, and activation scale bitwidths. Dashes indicate per-channel/per-layer scaling for weights/activations, respectively. tion compared to even the per-channel scaled conﬁgurations.The 8/8/6/- conﬁguration shows the same energy for 6-bitscale and full-bitwidth scale product because full-bitwidthis exactly 6 bits in this case because of its 6-bit per-vectorweight scale factor and no per-vector activation scale factor.

ESIGN S PACE E XPLORATION

To better understand the accuracy, energy, and area trade-offs enabled by

VS-Quant , we combine the energy and arearesults from our DNN inference accelerator with accuracyresults from real networks using a Pytorch-based PTQ li-brary (Wu et al., 2020). Table 8 details the design toolsused and parameters explored in our full evaluation. In thissection, we limit DNN accelerator conﬁgurations to thosewithout intermediate rounding and use full-bitwidth scalefactor products (yellow bars from Figure 3).

S-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Figure 4.

ResNet50 design space

Figures 4, 5, and 6 present the design spaces of ResNet50,BERT-base, and BERT-large, respectively, for variousbitwidth conﬁgurations of our DNN accelerator hardware.Results are shown as a tradeoff among energy efﬁciency(x-axis), area efﬁciency as performance per unit area (y-axis), and inference accuracy (color/shape). Since all con-ﬁgurations run with the same throughput (operations percycle), performance is identical and only the VLSI energyand area costs vary. Each point in the plot reports met-rics for a synthesized hardware instance selected from theset of precision parameter options in Table 8, normalizedto our baseline design (8/8/-/- conﬁguration). Energy re-sults are averaged over layers of the networks, weighted bythe number of operations in each layer. For each network,we decide the acceptable amount of accuracy loss againstthe full-precision baseline and only visualize those designpoints that are within the acceptable accuracy range. Asindicated in the legends in Figures 4, 5, and 6, we plotonly ResNet50, BERT-base, and BERT-large design pointsthat have an accuracy above 74.0%, 80.0%, and 84.5%, re-spectively. We then subdivide the acceptable range intoﬁner accuracy ranges (four colors/shapes) to help visualizethe achieved accuracy on top of the area-energy space. Fordesign points of the same color/shape (within the same ac-curacy range), the upper left of the plot is optimal with thelowest energy per operation and highest performance perarea. Solid points indicate Pareto-optimal area or energyefﬁciency for their color/shape (accuracy range) whereashollow points are not optimal. Overall,

VS-Quant providesa much more expansive space of design tradeoffs than base-line 4-bit, 6-bit, and 8-bit datapaths, which we discuss indetail for each network below.For ResNet50 results (Figure 4), the baseline 8/8/-/- alreadyhas minimal accuracy loss compared to the ﬂoating-pointreference, so limited accuracy gains are available from

VS-Quant . However, the green/circle 6/8/6/-

VS-Quant point(6-bit weights, 8-bit activations, and 6-bit per-vector scalefactors for weights) provides 12% smaller area at similaraccuracy and similar energy. When moving to 4-bit and6-bit representations,

VS-Quant provides even more en-ergy and area reductions in the 74.5% to 75.5% accuracy range. For example, in the > > VS-Quant isobserved to be the most competitive across multiple accu-racy targets, requiring very few bits for representing weights.In particular, a 4/8/6/10 conﬁguration (4-bit weights, 8-bitactivations, 6-bit per-vector weight scale factors, and 10-bit per-vector activation scale factors) for either model canachieve an accuracy target close to that of the full-precisionbaseline. This accuracy is not attainable even with ourbaseline design (8-bit per-channel scaled quantization) ac-cording to Table 2. Alternatively, we can also save someenergy at the cost of a slight area increase with a 6/8/-/10conﬁguration while maintaining close to full-precision accu-racy. Although this conﬁguration requires a higher weightbitwidth than the previous conﬁguration, it reduces energyby avoiding per-vector scaling on the weights. This tradeoffsuggests a combined effect between the value bitwidth andscale factor bitwidth, which together present an effectivebitwidth for the particular conﬁguration. If we relax ouraccuracy requirement to at least 82.0% for BERT-base and86.5% for BERT-large, we can further decrease area and en-ergy by dropping weight precision to only 3 bits. Based onthe design points, the only BERT-large conﬁguration whereit makes sense to implement per-channel scaled quantizationis the 6/8/-/- conﬁguration targeting around 1% accuracyloss, although this conﬁguration trades off signiﬁcant areato attain the lowest energy in that accuracy range. Further-more, if the same 6/8/-/- hardware conﬁguration was chosenfor BERT-base, it would lead to a large 6% accuracy loss.In comparison, optimal

VS-Quant hardware conﬁgurationssuch as 4/8/6/10 achieve great accuracy on both BERT-baseand BERT-large.We further study how the size of a network affects its ac-curacy, energy, and area tradeoff by comparing the designpoints of BERT-base against those of BERT-large. As shownin Figure 7, for example, BERT-large is the only choice ifthe accuracy target is beyond the best that BERT-base is ableto achieve. Below that threshold, we should always selectBERT-base because it is consistently more area-efﬁcientthan BERT-large. This suggests that one should conﬁgurethe size of the model based on the desired accuracy targetto realize the best hardware efﬁciency.

UANTIZATION - AWARE R ETRAINING

While we are able to leverage per-vector scaled PTQ tomaintain reasonable accuracy down to 3 bits in some cases,accuracy loss is inevitable at low precisions when compared

S-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Figure 5.

BERT-base design space

Figure 6.

BERT-large design space

Figure 7.

Accuracy and area tradeoff for BERT models of dif-ferent sizes –

Per-channel design points are outlined. to a full-precision baseline when QAT is not applied. Theloss can be substantial if an inferior combination of weightand activation precisions is used. For example, BERT gener-ally requires 8-bit precision for activations to get reasonableaccuracy even with

VS-Quant . In addition, many practicaldeployment scenarios may not have QAT as an option due tolack of access to full training datasets or limits on computetime and tuning effort. However, there are cases in whichwe can ﬁnetune a pretrained model with quantization foronly a limited number of iterations to adapt the weights andactivations to the quantized model (McKinstry et al., 2018).

VS-Quant is not limited to PTQ and can also be appliedto QAT to achieve even higher accuracy for a given set ofbitwidths. We apply per-vector scaled QAT using a con-ventional QAT framework that leverages a straight-throughestimator (STE) in the backward pass to propagate the gra-dient through any quantizer. While the framework trainsthe weights that get fed into the quantizers in the model, thequantization scale factors are not parameters and are not ex-plicitly trained. Table 9 evaluates the best accuracy achievedwith QAT-based ﬁnetuning for both per-vector scaled quan-tization and per-channel scaled quantization. The numberof retraining epochs taken to recover the speciﬁed accuracyis shown in parentheses. Based on the presented cases inTable 9, per-vector scaled QAT gives signiﬁcantly betteraccuracy than per-channel scaled QAT and requires muchless effort to recover accuracy loss from quantization.

Model Bitwidths Accuracy with QATPVAW POCResNet50 Wt=3 Act=3U 75.53 (20) 72.02 (20)BERT-base Wt=4 Act=4 86.93 (10) 41.45 (20)Wt=4 Act=8 87.80 (2) 87.01 (2)BERT-large Wt=3 Act=4 89.26 (2) 21.61 (10)Wt=3 Act=8 90.59 (1) 88.34 (1)

Table 9.

QAT study –

Compares the best accuracy achieved afterQAT-based ﬁnetuning. The number of retraining epochs taken torecover the accuracy is shown in parentheses.

ONCLUSIONS

In this paper, we introduced

VS-Quant , a novel per-vectorscaled quantization technique that employs per-vector scalefactors to mitigate accuracy loss typical in existing quan-tized DNN models. To support efﬁcient per-vector scalingin hardware, we implemented a two-level scaling schemeand associated algorithm that combine a set of ﬁne-grainedscale factors with each coarse-grained scale factor. We eval-uated

VS-Quant on a set of popular DNN models and tasksand demonstrated that it achieves signiﬁcant improvementin post-training quantization accuracy when compared toconventional per-channel scaled quantization techniques.By extending the vector MAC unit of a DNN acceleratorto dynamically support per-vector scaling at inference-time,we analyze the area and power implications of per-vectorscaling on the hardware. Experiments demonstrate that

VS-Quant with 4-bit weights and activations achieves 37%area saving and 24% energy saving while maintaining over75% accuracy for ResNet50 on ImageNet. Furthermore,

VS-Quant with 4-bit weights and 8-bit activations achieves near-full-precision accuracy for both BERT-base and BERT-largeon SQuAD while reducing area by 26% compared to a non-

VS-Quant

VS-Quant hardware and study the effect of scale factor and other inter-mediate rounding. We will extend QAT to explicitly learnper-vector scale factors and co-optimize model architecturesthemselves with the

VS-Quant hardware.

S-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference R EFERENCES

Bhandare, A., Sripathi, V., Karkada, D., Menon, V., Choi,S., Datta, K., and Saletore, V. Efﬁcient 8-bit Quantizationof Transformer Neural Machine Language TranslationModel. arXiv preprint arXiv:1906.00532 , 2019.Cai, Y., Yao, Z., Dong, Z., Gholami, A., Mahoney, M. W.,and Keutzer, K. ZeroQ: A Novel Zero Shot Quantiza-tion Framework.

Conf. on Computer Vision and PatternRecognition (CVPR) , 2020.Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I.-J.,Srinivasan, V., and Gopalakrishnan, K. PACT: Parameter-ized Clipping Activation for Quantized Neural Networks. arXiv preprint arXiv:1805.06085 , 2018.Courbariaux, M., Bengio, Y., and David, J.-P. Training DeepNeural Networks with Low Precision Multiplications. arXiv preprint arXiv:1412.7024 , 2014.Courbariaux, M., Bengio, Y., and David, J.-P. Bina-ryConnect: Training Deep Neural Networks with BinaryWeights during Propagations.

Conf. on Neural Informa-tion Processing Systems (NeurIPS) , 2015.Fang, J., Shaﬁee, A., Abdel-Aziz, H., Thorsley, D., Geor-giadis, G., and Hassoun, J. Near-lossless Post-trainingQuantization of Deep Neural Networks via a PiecewiseLinear Approximation. arXiv preprint arXiv:2002.00104 ,2020.Gong, Y., Liu, L., Yang, M., and Bourdev, L. CompressingDeep Convolutional Networks using Vector Quantization. arXiv preprint arXiv:1412.6115 , 2014.Gray, R. Vector Quantization.

IEEE ASSP Magazine , 1984.Han, S., Mao, H., and Dally, W. J. Deep Compres-sion: Compressing Deep Neural Networks with Prun-ing, Trained Quantization and Huffman Coding. arXivpreprint arXiv:1510.00149 , 2015.Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., andBengio, Y. Binarized Neural Networks.

Conf. on NeuralInformation Processing Systems (NeurIPS) , 2016.Jain, S., Venkataramani, S., Srinivasan, V., Choi, J.,Gopalakrishnan, K., and Chang, L. BiScaled-DNN: Quan-tizing Long-tailed Datastructures with Two Scale Factorsfor Deep Neural Networks.

Design Automation Conf.(DAC) , 2019.Khoram, S. and Li, J. Adaptive Quantization of Neural Net-works.

Int’l Conf. on Learning Representations (ICLR) ,2018.K¨oster, U., Webb, T., Wang, X., Nassar, M., Bansal, A. K.,Constable, W., Elibol, O., Gray, S., Hall, S., Hornof,L., et al. Flexpoint: An adaptive Numerical Format forEfﬁcient Training of Deep Neural Networks.

Conf. onNeural Information Processing Systems (NeurIPS) , 2017. Krishnamoorthi, R. Quantizing Deep Convolutional Net-works for Efﬁcient Inference: A Whitepaper. arXivpreprint arXiv:1806.08342 , 2018.LeCun, Y., Bengio, Y., and Hinton, G. Deep Learning.

Nature , 521(7553):436–444, 2015.Lee, J. H., Ha, S., Choi, S., Lee, W.-J., and Lee, S. Quanti-zation for Rapid Deployment of Deep Neural Networks. arXiv preprint arXiv:1810.05488 , 2018.McKinstry, J. L., Esser, S. K., Appuswamy, R., Bablani, D.,Arthur, J. V., Yildiz, I. B., and Modha, D. S. Discover-ing Low-precision Networks Close to Full-precision Net-works for Efﬁcient Embedded Inference. arXiv preprintarXiv:1809.04191 , 2018.Mishra, A. and Marr, D. Apprentice: Using Knowledge Dis-tillation Techniques to Improve Low-precision NetworkAccuracy. arXiv preprint arXiv:1711.05852 , 2017.Miyashita, D., Lee, E. H., and Murmann, B. ConvolutionalNeural Networks using Logarithmic Data Representation. arXiv preprint arXiv:1603.01025 , 2016.Moons, B., Goetschalckx, K., Van Berckelaer, N., and Ver-helst, M. Minimum Energy Quantized Neural Networks.

Asilomar Conference on Signals, Systems, and Comput-ers , 2017.Nagel, M., Baalen, M. v., Blankevoort, T., and Welling, M.Data-free Quantization through Weight Equalization andBias Correction.

Int’l Conf. on Computer Vision (ICCV) ,2019.NVIDIA Corporation. NVIDIA A100 Tensor Core GPUArchitecture.

NVIDIA , 2020.Prato, G., Charlaix, E., and Rezagholizadeh, M. FullyQuantized Transformer for Improved Translation. arXivpreprint arXiv:1910.10485 , 2019.Rouhani, B., Lo, D., Zhao, R., Liu, M., Fowers, J.,Ovtcharov, K., Vinogradsky, A., Massengill, S., Yang,L., Bittner, R., et al. Pushing the Limits of Narrow Preci-sion Inferencing at Cloud Scale with Microsoft FloatingPoint.

Conf. on Neural Information Processing Systems(NeurIPS) , 2020.Shen, S., Dong, Z., Ye, J., Ma, L., Yao, Z., Gholami, A.,Mahoney, M. W., and Keutzer, K. Q-BERT: HessianBased Ultra Low Precision Quantization of BERT.

AAAIConf. on Artiﬁcial Intelligence (AAAI) , 2020.Sijstermans, F. The NVIDIA Deep Learning Accelerator.

Symp. on High Performance Chips (Hot Chips) , 2018.Stock, P., Joulin, A., Gribonval, R., Graham, B., and J´egou,H. And the Bit Goes Down: Revisiting the Quantizationof Neural Networks.

Int’l Conf. on Learning Representa-tions (ICLR) , 2020.

S-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Sze, V., Chen, Y.-H., Yang, T.-J., and Emer, J. S. EfﬁcientProcessing of Deep Neural Networks.

Synthesis Lectureson Computer Architecture , 2020.Tambe, T., Yang, E.-Y., Wan, Z., Deng, Y., Reddi, V. J.,Rush, A., Brooks, D., and Wei, G.-Y. Algorithm-Hardware Co-Design of Adaptive Floating-Point Encod-ings for Resilient Deep Learning Inference.

Design Au-tomation Conference (DAC) , 2020.Venkatesan, R., Shao, Y. S., Wang, M., Clemons, J., Dai,S., Fojtik, M., Keller, B., Klinefelter, A., Pinckney, N. R.,Raina, P., et al. MAGNet: A Modular Accelerator Gener-ator for Neural Networks.

Int’l Conf. on Computer AidedDesign (ICCAD) , 2019.Wu, B., Wang, Y., Zhang, P., Tian, Y., Vajda, P., and Keutzer,K. Mixed Precision Quantization of ConvNets via Dif-ferentiable Neural Architecture Search. arXiv preprintarXiv:1812.00090 , 2018.Wu, H. Low Precision Inference on GPUs.

GPU TechnologyConference (GTC) , 2019.Wu, H., Judd, P., Zhang, X., Isaev, M., and Micikevicius,P. Integer Quantization for Deep Learning Inference:Principles and Empirical Evaluation. arXiv preprintarXiv:2004.09602 , 2020.Wu, J., Leng, C., Wang, Y., Hu, Q., and Cheng, J. Quan-tized Convolutional Neural Networks for Mobile De-vices.

Conf. on Computer Vision and Pattern Recognition(CVPR) , 2016a.Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,K., et al. Google’s Neural Machine Translation System:Bridging the Gap between Human and Machine Transla-tion. arXiv preprint arXiv:1609.08144 , 2016b.Zafrir, O., Boudoukh, G., Izsak, P., and Wasserblat,M. Q8bert: Quantized 8bit BERT. arXiv preprintarXiv:1910.06188 , 2019.Zhao, R., Hu, Y., Dotzel, J., De Sa, C., and Zhang, Z. Im-proving Neural Network Quantization without Retrainingusing Outlier Channel Splitting.

Int’l Conf. on MachineLearning (ICML) , 2019.Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., and Zou,Y. DoReFa-net: Training Low Bitwidth ConvolutionalNeural Networks with Low Bitwidth Gradients. arXivpreprint arXiv:1606.06160 , 2016.Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained TernaryQuantization. arXiv preprint arXiv:1612.01064arXiv preprint arXiv:1612.01064