[PDF] Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators

Abstract

In this paper, we propose a mixed-precision convolution unit architecture which supports different integer and floating point (FP) precisions. The proposed architecture is based on low-bit inner product units and realizes higher precision based on temporal decomposition. We illustrate how to integrate FP computations on integer-based architecture and evaluate overheads incurred by FP arithmetic support. We argue that alignment and addition overhead for FP inner product can be significant since the maximum exponent difference could be up to 58 bits, which results into a large alignment logic. To address this issue, we illustrate empirically that no more than 26-bitproduct bits are required and up to 8-bit of alignment is sufficient in most inference cases. We present novel optimizations based on the above observations to reduce the FP arithmetic hardware overheads. Our empirical results, based on simulation and hardware implementation, show significant reduction in FP16 overhead. Over typical mixed precision implementation, the proposed architecture achieves area improvements of up to 25% in TFLOPS/mm2and up to 46% in TOPS/mm2with power efficiency improvements of up to 40% in TFLOPS/Wand up to 63% in TOPS/W.

Full PDF

RR ETHINKING F LOATING P OINT O VERHEADSFOR M IXED P RECISION

DNN A

CCELERATORS

Hamzah Abdel-Aziz Ali Shaﬁee Jong Hoon Shin Ardavan Pedram Joseph H. Hassoun A BSTRACT

In this paper, we propose a mixed-precision convolution unit architecture which supports different integer andﬂoating point (FP) precisions. The proposed architecture is based on low-bit inner product units and realizes higherprecision based on temporal decomposition. We illustrate how to integrate FP computations on integer-basedarchitecture and evaluate overheads incurred by FP arithmetic support. We argue that alignment and additionoverhead for FP inner product can be signiﬁcant since the maximum exponent difference could be up to 58 bits,which results into a large alignment logic. To address this issue, we illustrate empirically that no more than 26-bitproduct bits are required and up to 8-bit of alignment is sufﬁcient in most inference cases. We present noveloptimizations based on the above observations to reduce the FP arithmetic hardware overheads. Our empiricalresults, based on simulation and hardware implementation, show signiﬁcant reduction in FP16 overhead. Overtypical mixed precision implementation, the proposed architecture achieves area improvements of up to 25% inTFLOPS/ mm and up to 46% in TOPS/ mm with power efﬁciency improvements of up to 40% in TFLOPS/Wand up to 63% in TOPS/W. NTRODUCTION

Deep Neural Networks (DNNs) have shown tremendoussuccess in modern AI tasks such as computer vision, nat-ural language processing, and recommender systems (Le-Cun et al., 2015). Unfortunately, DNNs success comes atthe cost of signiﬁcant computational complexity (e.g., en-ergy, execution time etc.). Therefore, DNNs are acceleratedon specialized hardware units (DNN accelerators) to im-prove both performance and energy efﬁciency (Jouppi et al.,2017; ten, 2017; Reuther et al., 2019). DNN acceleratorsmay utilize quantization schemes to reduce DNNs mem-ory footprint and computation time (Deng et al., 2020). Atypical quantization scheme compresses all DNN’s layersinto the same low-bit integer, which can be sub-optimal,as different layers have different redundancy and featuredistributions (Wang et al., 2019; Wu et al., 2018a). On theother hand, mixed precision quantization scheme assignsdifferent precisions (i.e., bit width) for different layers andit shows remarkable improvement over uniform quantiza-tion (Song et al., 2020; Wang et al., 2019; Chu et al., 2019;Cai et al., 2020). Therefore, mixed-precision quantizationschemes (Song et al., 2020; Wang et al., 2019; Chu et al.,2019; Cai et al., 2020) or hybrid approaches where a few Samsung Semiconductor, Inc. San Jose, CA. Correspondenceto: Hamzah Abdel-Aziz < [email protected] > . layers are kept in FP and the rest are quantized to integerare considered to maintain FP32-level accuracy (Zhu et al.,2016; Venkatesh et al., 2017).Half precision ﬂoating point (FP16) and custom ﬂoatingpoint data types (e.g., bﬂoat16 (Abadi et al., 2016)) areadopted for inference and training in several cases whenquantization is not feasible (online learning, private dataset,supporting legacy code ... etc.). They could reduce memoryfootprint and computation by a factor of two, without sig-niﬁcant loss of accuracy and they are often obtained by justdowncasting the tensors. FP16 shows remarkable beneﬁtsin numerous DNN training applications where FP16 is typi-cally used as the weights and activation data type and FP32is used for accumulation and gradient update (Micikeviciuset al., 2017; Jia et al., 2018; Ott et al., 2019).Data precision varies signiﬁcantly from low-bit integer to FPdata types (e.g., INT4, INT8, FP16, etc.) within or acrossdifferent DNN applications. Therefore, mixed-precisionDNN accelerators that support versatility in data types arecrucial and sometimes mandatory to exploit the beneﬁtof different software optimizations (e.g., low-bit quanti-zation). Moreover, supporting versatility in data types canbe leveraged to trade off accuracy for efﬁciency based onthe available resources (Shen et al., 2020). Typically, mixed-precision accelerators are designed based on low precisionarithmetic units, and higher precision operation can be sup-ported by fusing the low precision arithmetic units tempo-rally or spatially. a r X i v : . [ c s . A R ] J a n ethinking Floating Point Overhead The computation of DNNs boils down to the dot prod-uct as the basic operation. Typically, inner product isimplemented either by temporally exploiting a multiply-accumulate (MAC) unit in time or in space using an innerproduct (IP) unit with multipliers followed by an addertree. The multiplier and adder bit widths are the main ar-chitectural decisions in implementing the arithmetic unit toimplement the dot product operation. The multiplier preci-sion is a key factor for the ﬁnal performance, and efﬁciencyfor both IP and MAC based arithmetic units. For example,a higher multiplier precision (e.g., × ) limits the beneﬁtof lower-bit (e.g., INT4) quantization. On the other hand,while lower precision multipliers are efﬁcient for low-bitquantization, they incur excessive overhead for the addi-tion units. Therefore, multipliers bit width is decided basedon the common case quantization bit width. The adder bitwidth in integer IP based architecture matches the multiplieroutput bit width. Thus, they can improve energy efﬁciencyby using smaller adder and sharing the accumulation logic.However, in multiply-and-accumulate (MAC) based archi-tectures (Chen et al., 2016), adders are larger to serve asaccumulators as well. This overhead is more pronouncedin low-power accelerators with low-precision multipliersoptimized for low-bit quantized DNNs.Implementing a ﬂoating point IP (FP-IP) operation requiresalignment of the products before summation, which requirelarge shift units and adders. Theoretically, the maximumrange of alignment between FP16 products requires shiftingthe products up to 58-bit. Thus, the adder tree precision (i.e.,bit width) to align any two FP16 products would impose anadditional 58 bits in its input precision. Such alignmentsare only needed for FP operations and appear as signiﬁcantpower and area overhead for INT operations, especiallywhen IP units are based on low-precision multipliers.In this paper, we explore the design space trade-offs of IPunits that support both FP and INT based convolution. Wemake a case for a dense low-power convolution unit thatintrinsically supports INT4 operations. Furthermore, we goover the inherent overheads to support larger INT and FPoperations. We consider INT4 for two main reasons. First,this data type is the smallest type supported in several mod-ern architectures that are optimized for deep learning (e.g.,AMD MI50 (amd), Nvidia Turing architecture (Kilgariffet al., 2018) and Intel Sprig Hill (Wechsler et al., 2019)).Second, recent research on quantization report promisingresults for 4-bit quantization schemes (Fang et al., 2020;Jung et al., 2019; Nagel et al., 2020; Choukroun et al., 2019;Banner et al., 2019b; Wang et al., 2019; Choi et al., 2018;Zhuang et al., 2020). In spite of this, the proposed optimiza-tion is not limited to INT4 case and can be applied for othercases (e.g., INT8) as we discuss in Section 4.The contributions of the paper are as follows: 1. We investigate approximated versions of FP-IP oper-ation with limited alignments capabilities. We derivethe mathematical bound on the absolute error and con-duct numerical analysis based on DNN models andsynthetic values. We postulate that approximate FP-IPcan maintain the GPU-based accuracy if it can alignthe products by at least 16 bits and 27 bits, for FP16and FP32 accumulators, respectively.2. We demonstrate how to implement large alignmentsusing smaller shift units and adders in multiple cycles.This approach decouples software accuracy require-ments from the underlying IP unit implementation. Italso enables more compact circuits at the cost of FPtask performance.3. Instead of running many IP units synchronously in onetile, we decompose them into smaller clusters. This canisolate FP-IP operations that need a large alignmentand limits the performance degradation to one cluster.4. We study the design trade-offs of our architecture.The proposed architecture, implemented in standard 7nmtechnology, can achieve up to 25% in TFLOPS/ mm and upto 46% in TOPS/ mm in area efﬁciency and up to 40% inTFLOPS/W and up to 63% in TOPS/W in power efﬁciency.The rest of this paper is organized as follows. In Section 2,we present the proposed architecture of mixed-precisioninner product unit (IPU) and explain how it can supportdifferent data types including FP16. In section 3, we ﬁrstreview the alignment requirement for FP16 operations andoffer architecture optimization to reduce FP16 overheads.Section 4 goes over our methodology and discusses theempirical results. In Section 5, we review related work, andwe conclude the paper in Section 6. IXED - PRECISION I NNER P RODUCT U NIT

To support different types of data types and precisions, weuse a ﬁne-grain convolution unit that can run INT4 intrinsi-cally and realize larger sizes temporally. We consider INT4as the default common case since several recent research ef-forts are promoting INT4 quantization schemes for efﬁcientinference (Jung et al., 2019; Nagel et al., 2020). However,the proposed architecture can be applied to other cases suchas INT8 as the baseline.Figure 1 shows the building blocks of the proposed mixed-precision n -input IPU, which is based on 5b ×

5b sign mul-tipliers. The proposed IPU allows computing INT4 IPUmultiplications, both signed or unsigned, in a single cycle.In addition, larger precision operations can be computedin multiple nibble iterations . The total number of nibble ethinking Floating Point Overhead

Figure 1.

Microarchitecture of the proposed mixed-precision IPUdata path with n inputs and w -bit IPU precision. iterations is the multiplication of the number of nibbles ofthe two multipliers operands. Products are passed to a lo-cal right shift unit which used in FP-mode for alignment,and the shifted outputs are connected to an adder tree. Theadder tree results are fed to the accumulator. In the next twosubsection, we illustrate the mircoarchitecture in details forboth INT and FP modes; respectively. IPU is based on INT4 and the computation of higher INTprecision is based on nibble iterations . For example, if themultipliers operands are INT8 and INT12, thus six nib-ble iterations are required to complete INT8 by INT12multiplication for a single IP operation. The local shiftamount is always since there is no alignment requiredin INT mode. The result of the adder tree is concatenatedwith (33 − w ) bits of zeros on the right side and alwaysfed to the accumulator shift unit through the swape unit.The amount of shift depends on the signiﬁcance of thenibble operands. For instance, suppose N k refers to thenibbles of a number (i.e., N is the least signiﬁcant nib-ble), the amount of shift for the result of IPU operation ofnibble N i and N j for the ﬁrst and the second operands is × (( K a − i −

1) + ( K b − j − , where K a and K b arethe total number of nibbles for operand a and b , respectively.The accumulator can add up to n × d multiplications, where n is the number of IPU inputs and d is the maximum num-ber of times IPU can accommodate accumulation withoutoverﬂow. In this scenario, the accumulator size should beat least

33 + t + l , where l = (cid:100) log d (cid:101) . In INT mode, weassume exp = max exponent = 0 . In FP-mode, the mantissa multiplication is computed similarto INT12 IPU operation but with the following additionaloperation.

Converting numbers:

Let’s deﬁne the magnitude of FPnumber as 0.mantissa for subnormal and 1.mantissa fornormal FP numbers. We also call it the signed magnitude when sign bit are considered. Suppose M [11 : 0] is the 12-bit signed magnitude for the FP16 number, it is convertedto the following three 5-bit nibble operands: N = { M − M } , N = { , M − M } , and N = { , M − M , } .This decomposition introduces a zero in the least signiﬁcantposition of N . Since the FP-IP operation relies on rightshifting and truncation to align the products, the implicitleft shift of operands can preserve more accuracy. Local alignment:

The product results should be alignedwith respect to the maximum exponent of all products (seeAppendix A for more details). Therefore, each of the multi-plier outputs is passed to a local right shift unit that receivesthe shift amount from the exponent handling unit ( EHU ).The EHU computes the product exponents by doing the fol-lowing steps, in order: (1) element-wise summation of theoperands’ unbiased exponents, (2) computing the maximumof the product exponents, and (3) computing the alignmentshift amounts as the difference between all the product ex-ponents and the maximum exponent. A single EHU can beshared between multiple IPUs to amortize its overhead (i.e.,multiplexed in time between IPUs), since a single FP-IP op-eration consists of multiple nibble iterations with the sameexponent computation.The range of the exponent for FP16 products is [ − , ,thus the exponent difference (i.e., the right shift amount)between two FP16 products can be up to 58-bit. In general,the bit width of the product increases based on the amountof right shift (i.e., alignment with the max exponent). How-ever, due to the limited precision of the accumulator, anapproximate computation is sufﬁcient where the productalignment can be bounded and truncated to a smaller bitwidth. We deﬁne this width as the IPU precision and useit to parametrize IPUs. The IPU precision is also the maxi-mum amount of local right shift as well as the bit-width ofthe adder tree. We quantify the impact of precision on thecomputation accuracy in Section 3.1. The accumulator operations:

During the computation forone pixel, FP accumulators keep two values: accumulator’sexponent and its non-normalized signed magnitude. Onceall the input vector pairs are computed and accumulated, theresult in the accumulator is normalized and reformatted tothe standard representation (i.e., FP16 or FP32).The details of the accumulation logic are depicted in theright side of Figure 1. The accumulator has a (33 + t + l ) -bitregister and a right shift unit (see Figure 1 for deﬁning t and l ). Therefore, the register size allows up to 33 bits of rightshift, which is sufﬁcient to preserve accuracy as discussedin Section 3.1.In contrast to INT-mode accumulator, where the right shiftlogic can only shift by k ( k ∈ , , .., ), the FP-IP canright shift by any number between [0:33+t+l]. The shift ethinking Floating Point Overhead amount is computed in exponent logic and is equal to × ((3 − i −

1) + (3 − j − | max exp − exp | , where i , and j are input nibble indices, exp is the accumulator’sexponent value and max exp is the adder tree exponent(i.e., the max exponent). A swap operation followed by aright shift is applied whenever a left shift is needed, hence,a separate left shift unit is not needed. In other words, theswap operation is triggered only when max exp > exp .With respect to exp , the accumulator value is a ﬁxed pointnumber with

33 + t + l bits, including sign, (3 + t + l ) -bit ininteger positions and 30 bits in fraction positions. Note thatthe accumulator holds an approximate value since the leastsigniﬁcant bits are discarded and its bit-width is provisionedfor the practical size of IPUs. Before writing back the resultto memory, the result is rounded to its standard format (i.e.,FP16 or FP32).For the rest of this paper, we deﬁne an IP U ( w ) as an innerproduct unit with -bit signed multipliers, w -bit adder tree,and local right shifter that can shift and truncate multipliers’output by up to w bits. We refer to w as the IPU’s addertree precision or IPU precision for brevity. In general, theresult of

IP U ( w ) computation might be inaccurate, as onlythe w most signiﬁcant bits of the local shifter results areconsidered. However, there are exceptions: Proposition 1

For

IP U ( w ) , truncation is not needed andthe adder tree result is accurate if the amount of alignments,given by EHU, of all the products are smaller than w − .We refer to w − as the safe precision of the IPU . It is clear that the area and power overhead increase as theIPU precision increases (See Section 4.2). The maximum re-quired precision is determined by the software requirementand the accumulator precision (See Section 3.1).

PTIMIZING F LOATING P OINT L OGIC

In this section, we tackle the overhead of large shifters andadder tree size by, ﬁrst, evaluating the minimum shift andadder size required to preserve the accuracy (Section 3.1) forboth FP16 and FP32 accumulators. Based on the evaluation,we propose optimization methods to implement FP IPUswith relatively smaller shift units and adders (Section 3.2and Section 3.3).

As we mention in Section 2, an FP-IP operation is decom-posed into multiple nibble iterations. In a typical implemen-tation, the multiplier’s output of each iteration requires largealignment shifting and the adder tree has high precisioninputs. However, this high precision would be discardeddue to the limited precision of the accumulator (FP16 orFP32), hence, an approximated version of FP-IP alignment

Figure 2.

Pseudocode for the approximate version of nibble it-eration (top) and FP-IP operation with the approximate nibbleiteration method (bottom). Precision is the IPU precision. can be used without signiﬁcant loss of accuracy. Figure 2shows the pseudocode for the approximate FP-IP operationcustomized for our nibble-based IPU architecture. The ap-proximate FP-IP computes only most signiﬁcant precision bits of the products (Lines 5-7). The precision parameterallows us to quantify the absolute error.

Theorem 1

For FP-IP with n pairs ofFP16 inputs, the absolute error due to approx nibble iteration ( i, j, precision ) , called abs error ( i, j ) is no larger than × (4 × ( i + j ) − × max − precision × ( n − , where max is the maximumexponent of all the products in the FP operation. Proof:

Due to space limitations, we only provide an outlineof the proof. The highest error occurs when, except for oneproduct, all n − others are shifted precision to the right,and thus appear as errors. For maximum absolute error,these products should all have the same sign and have themaximum operand (i.e., 15). Hence their product would be ×

15 = 225 . The term (4 × ( i + j )) is applied for properalignment based on nibble signiﬁcance. The term − isneeded, since each FP number has 3-bit in int and 22-bitfraction positions, with respect to its own exponent. Remark 1

Iterations of the most signiﬁcant nibbles (i.e.,largest i + j ) have the highest signiﬁcant contributions tothe absolute error. The FP-IP operation is the result of nine approximate nibbleiterations added into the accumulator. However, only 11 or ethinking Floating Point Overhead

Figure 3.

Left to Right: Absolute error, percentage of absolute relative error (ARE), and the number of contaminated bits for differentdistributions and different accumulators: FP16(top) and FP32(bottom). The ﬁrst two error graphs in each row use log-scale Y-axis.

24 most signiﬁcant bits of the accumulated result are neededfor FP16 or FP32 accumulators, respectively. Unfortunately,the accumulator is non-normalized and its leading non-zeroposition depends on the input values. As a result, it is notpossible to determine a certain precision for each approxi-mate nibble iteration to guarantee any loss of signiﬁcance.Therefore, we use numerical analysis to ﬁnd the proper shiftparameters. In our analysis, we consider both syntheticinput values and input values sampled from tensors foundin Resnet-18 and Resnet-50 convolution layers. We con-sider Laplace and Normal distributions to generate syntheticinput vectors, as they resemble the distribution of DNNtensors (Park et al., 2018) and uniform distributions for thecase that tensor is re-scaled, as suggested for FP16-basedtraining (Micikevicius et al., 2017). In our analysis, weconsider 1M samples generated for our three distributionsand 5% data samples of Resnet-18 and Resnet-50 convolu-tion layers. For different IPU precisions, we measure themedian for three metrics: absolute computation error, ab-solute relative error (in percentage) compared with FP32CPU results, and the number of contaminated bits. Thenumber of contaminated bits refers to the number of differ-ent bits between the result of approximated FP-IP and theFP32 CPU computation. Figure 3 include the error analysisplots for both FP16 and FP32 accumulator cases. Based onthis analysis, we found that both the relative and absoluteerrors are less than − for 16-bit IPU precision in FP16case. Moreover, the median number of contaminated bitsis zero (mean = 0.5). For accumulator in FP32 case, botherrors drop to less than − for IP U precision ≥ -bit.However, the minimum median value of the number of con-taminated bits starts at 27b IPU precision. We conclude that in order to maintain FP32 CPU accuracy, FP16 FP-IPoperations require at least 16b and 27b IPU precision for Figure 4.

Walk-through example for ( sp = 5 ) with (A,B,C,D)as magnitudes and (10,2,3,8) as exponents. The exponent canbe written as (0,-8,-7,-2) with respect to max exp = 10 . (a)First cycle: MC-IPU only executes products A and D since theirright shift is in P = [0 , (b) Second cycle: MC-IPU computesproducts B and C as their right shift is in P = [5 , . accumulating into FP16 and FP32, respectively .We also evaluate the impact of IPU precision on Top-1 accu-racy of ResNet-18 and ResNet-50 for ImageNet data set (Heet al., 2016). We observe that, when the FP16 uses IPUprecision of 12 or more, it maintains the same accuracy(i.e., Top-1 and Top-5) as FP32 CPU for all batches. IPUprecision of 8-bit also shows no signiﬁcant difference withrespect to the ﬁnal average accuracy compared to CPU com-putation. However, we observe some accuracy drops ofup to 17% for some batches, and some accuracy improve-ments up to 17% for other batches. We are not sure if thisimprovement is just a random behavior, or because lowerprecisions may have a regularization effect as suggestedby (Courbariaux et al., 2015b). At any rate and despite theseresults, 8-bit IPU precision is not enough for all CNN in-ference due to the ﬂuctuation in the accuracy for individualbatches, compared to the FP32 model. ethinking Floating Point Overhead As we mentioned in Section 3.1, approximate nibble itera-tion requires 27-bit addition and alignment to maintain thesame accuracy as CPU implementations for FP32 accumu-lation. As we illustrate in Section 4, the large shifter andadder take a big portion of area breakdown of an IPU and anoverhead when running in the INT mode. In order to main-tain both high accuracy and low area overhead, we proposeusing multiple cycles when a DNN requires large alignment,using multi-cycle IPU(w) , (MC-IPU ( w ) ), where w refersto the adder tree bit width. Hence, designers can considerlower MC-IPU precision, in cases when the convolution tileis used more often in the INT than the FP mode.MC-IPU relies on Proposition 1 that if all the alignmentsare smaller than the safe precision ( sp ), summation is accu-rate. Otherwise, the MC-IPU performs the following stepsto maintain accurate computation. First, it decomposesproducts into multiple partitions, such that products whoserequired shift amounts belong to [ k × sp, ( k + 1) × sp ] arein partition k ( P k ) . Second, all products in partition k areadded in the same cycles and all other products are masked.Notice that all the products in P k require at least k × sp shifting. Thus MC-IPU decomposes the shift amount intoparts: (1) k × sp that is applied after the adder tree and (2)the remaining parts that is applied locally. Since the remain-ing parts are all smaller than sp , they can be done with localshift units without any loss in accuracy (Proposition 1).Figure 4 illustrates a walk-through example for MC-IPU (14) , where sp = 5 . In this example, we denote theproducts in summation as A, B, C, and D with exponents10, 2, 3, and 8, respectively. Thus, the maximum expo-nent is max exp = 10 . Before the summation, each prod-uct should be aligned ( w.r.t.max exp ) by the right shiftamount of 0, 8, 7, and 2, accordingly. The alignment andsummation happens in two cycles as follows: In the ﬁrstcycle, A and D are added after zero- and two- bit right shifts,respectively. Notice that, the circuit has extra bitwise ANDlogic to mask out input B and C in this cycle. In the secondcycle, B and C are added and they need eight- and seven-bit right shifts, respectively. While the local shifter can onlyshift up to ﬁve bits accurately, we perform the right shiftin two steps by locally shift by (8 − and (7 − bits,followed by ﬁve bit shifts of the adder tree result.In general, the Multi-Cycle IPU imposes three new over-heads to IPUs: (1) Bitwise AND logic per multiplier; (2)updating shifting logic, where the shared shifting amountwould be given to the accumulation logic (extra sh mnt inFigure 4, for each cycle; and (3) modiﬁcations to the EHUunit. The EHU unit for MC-IPU is depicted in Figure 5. Itconsists of ﬁve stages. The ﬁrst stage receives the activationexponent and weight exponents and adds them together tocalculate the product exponents. In the second and third stages, the maximum exponent and its differences from eachproduct exponent are computed. In the fourth stage, the dif-ferences that exceed the software precision are masked (seeSection 3.1). The ﬁrst four stages are common for both IPUsand MC-IPUs. However, the last stage is only needed forMC-IPU and might be called multiple times, depending onthe required number of cycles for MC-IPU. This stage keepsa single bit for each product to indicate whether that producthas been aligned or not ( serv i in Figure 5). For the non-aligned ones, this stage checks the exponent difference valuewith a threshold. The threshold value equals ( k + 1) × sp in cycle k (see the code in Figure 5). The EHU ﬁnishesfor an FP-IP, once all products are aligned (i.e., serv i = 1 ).Notice that one EHU is shared between multiple MC-IPUsas it is need once for all nine nibble iterations. Figure 5.

EHU Data path for MC-IPUs.

In the previous Section, we show how the MC-IPU can runthe FP inner product by decomposing it into nibble iterationsand computing each iteration in one or multiple cycles. Ina convolution tile that leverages MC-IPUs, the number ofcycles per iteration depends on two factors: (1) the precisionof the MC-IPU (i.e., adder tree bit width). (2) the maximumalignment needed in all the MC-IPUs in the convolutiontiles. When a MC-IPU in the convolution tile requires alarge alignment, it will stall others.When architecting such an IPU, the ﬁrst consideration is theINT and FP operations percentage split The second factor,however, can be handled by grouping MC-IPUs in smallerclusters and running them independently. This way, if oneMC-IPU requires multiple cycles, it stalls only the MC-IPUsin its own cluster. To run clusters independently, each clustershould have its own local input and output buffers. Theoutput buffer is used to synchronize the result of differentclusters before writing them back into the activation banks.Notice that the activation buffer broadcast inputs to eachlocal input buffer and would stop broadcasting even if oneof the buffers is full, which stalls the entire tile. ethinking Floating Point Overhead

Figure 6.

Convolution tile architecture.

ETHODOLOGY AND R ESULTS

In this section, we illustrate the top level architecture andexperiment setup. Then, We evaluate the hardware overheadand performance impact of our proposed architecture. Wealso discuss a comparison with some related work.

We consider a family of high-level architectures designedby IP-based tiles. IP-based tiles are crucial for energy efﬁ-ciency, especially when low-precision multipliers are used.IP-based convolution tile consists of multiple IPUs and eachIPU is assigned to one output feature map (OFM) (i.e., un-rolling in output channel (K)). All IPUs share the sameinput vectors that come from different channels in the inputfeature map (IFMs) (i.e., unrolling in the input channel di-mension (C)). As depicted in Figure 6(a), the data path of aconvolution tile consists of the following components: (1)Inner Product Unit: an array of multipliers that feeds into anadder tree. The adder tree’s result is accumulated into thepartial sum accumulator. (2) Weight Bank: contains all theﬁlters for the OFMs that are assigned to the tile. (3) Weightbuffer: contains a subset of ﬁlters that are used for the cur-rent OFMs. Each multiplier has a ﬁxed number of weights,which is called the depth of the weight buffer. Weight bufferare only needed for weight stationary (WS) (Chen et al.,2016) architecture and is either implemented with ﬂip-ﬂops,register ﬁles, or small SRAMs. The number of elementsper weight buffer determines the output/partial bandwidthrequirements. (4) Activation Bank: contains the currentactivation inputs, partial, and output tensors. (5) ActivationBuffer: serves as a software cache for the activation bank.We consider, two types of tiles, big and small, based on INT4multipliers. Both tiles are weight stationary with weightbuffer depth of 9B. The big and small tiles are unrolled (16 , , , and (8 , , , in ( C, K, H , W o ) dimensions.We consider these two tiles because they offer differentcharacteristics while achieving high utilization. The IPUsin the big tile have twice as many multipliers as in the smalltile (16 vs. 8). The 16-input IPUs have smaller accumulatoroverhead but larger likelihood of multiple cycles alignmentas compared to 8-input IPUs. For comparison, we consider two baselines: Baseline1 and Baseline2 for the small andthe big tiles, respectively. Each baseline has four tiles with a38b wide adder tree per IPU. Hence, these baselines do notneed MC-IPU (Section 3.2) and IPU clustering (Sectoin 3.3)and they can achieve (1 TOPS, 113 GFLOPS) and (4 TOPS,455 GFLOPS), respectively (OP is a × MAC).The performance impact of the proposed designs (i.e., MC-IPUs and clustering the IPUs) depends on the distributionof inputs. We developed a cycle-accurate simulator thatmodels the number of cycles for each convolution layer.The simulation parameters include the input and weighttensors. The simulator receives, the number of tiles, thetile dimension (e.g., (8, 8, 2, 2) for the small tiles), and thenumber of clusters per tile. We simulate Convolution layersas our tiles are customized to accelerate them. In addition,we assume an ideal behavior for the memory hierarchy tosingle out the impact of our designs. In reality, non-CNNlayers and system-level overhead can impact the overall re-sult. Moreover, the area and power efﬁciency improvementsmight decline due to the limitations of DRAM bandwidthand SRAM capacity (Pedram et al., 2017). Such scenariosare beyond the scope of our analysis.In the simulation analysis, we use data tensors fromResNet (He et al., 2016) and InceptionV3 (Szegedy et al.,2016). We consider four study cases which are: (1) ResNet-18 forward path, (2) ResNet50 forward path, (3) Incep-tionV3 forwad path, and (4) ResNet-18 backward path oftraining. In our benchmarks, we consider at least 16b and28b software precision (Section 3.1) that is required forFP16 and FP32 accumulation to incur no accuracy loss.

In order to evaluate the impact of FP overheads, we im-plemented our designs in SystemVerilog and synthesizedthem using Synopsys DesignCompiler with nm technol-ogy libraries (DC). We consider 25% margin and 0.71 V Voltage for our synthesis processes. Figure 7 illustratesthe breakdown of area and power for a small and big tile.We also include a design point without FP support, shownas INT in Figure 7. In addition, we consider one designwith a 38-bit adder tree, similar to NVDLA (NVD), for ourbaseline conﬁguration. We highlight the following points inFigure 7 as follows: (1) By just dropping the adder tree pre-cision from 38 to 28 bits, which is the minimum precision tomaintain CPU-level accuracy for FP32 accumulations (seeSection 3.1), the area and power are reduced by 17% and15% for 16-input and 8-input MC-IPU tiles, respectively.(2) We can reduce the adder tree precision even further atthe cost of running alignment in multiple cycles. The tilearea can be reduced by up to 39% when reducing adder treeprecision to 12 bits. (3) In comparison with INT only IPU,MC-IPU(12) can support FP16 with a increase in area. ethinking Floating Point Overhead

Figure 7.

Breakdown of (a) area (b) power for different MC-IPU based tiles. The components are accumulators (FAcc), weight buffers(WBuf), EHUs (ShCNT), multipliers (MULT), local shifters (Shft), and adder trees (AT).

Figure 8. (a) Impact of different precision on the performanceof MC-IPUs. Backward refers to the back propagation error inResNet-18.(b) Impact of cluster size on the performance for MC-IPU(16). : As shown inSection 3.1, there is no need for more than 16-bit precisionfor FP16 accumulation. Therefore, IPUs with a 16b or largeradder tree take exactly one cycle per nibble iteration. How-ever, MC-IPU(12) may require multiple-cycle alignmentexecution, which causes performance loss. Compared toBaseline1 (Baseline2), when MC-IPU(12)s are used, theperformance drops by 47% (50%), on average, when noIPU clustering is applied (Section 3.3). If we choose a clus-ter of size one, (i.e., MC-IPUs perform independently), theperformance drop is 26% (38%), compared to Baseline1(Baseline2).

FP16 operations with FP32 accumulations : As we men-tioned in Section 3.1, FP32 accumulation requires 28-bitIPU precision. Thus, an MC-IPU with precision less than28-bit might require multiple cycles, causing performanceloss. Figure 8 shows the normalized execution time for dif-ferent precision values for the forward path of ResNet-18,ResNet-50, and InceptionV3 as well as the backward pathof ResNet-18. We observe that all epochs have almost sim- ilar trend, thus we only report data for Epoch 11. In thisﬁgure, we present two sets of numbers: ones for the tileswith 8-input MC-IPUs, normalized to Baseline1 and one forthe tiles with 16-input MC-IPUs, normalized to Baseline2.According to Figure 8 (a), the execution time can increasedramatically when small adder trees are used and 28-bitIPU precision is required. The increase in the latency canbe more than × for a 12b adder tree in the case of com-putation of back propagation (backprop). Intuitively, in-creasing the adder bit width reduces the execution time. Inaddition, since 8-input MC-IPUs have fewer products, itis less likely that they need multiple cycles. Thus, 8-inputMC-IPUs (Baseline1) outperform 16-input MC-IPUs (Base-line2). We also observe that backprop computations havemore dynamic range and more variance in the exponents.To evaluate the effect of clustering, We ﬁx the adder treebit-width to 16 bits and vary the number of MC-IPUs percluster. Figure 8 (b) shows the efﬁciency of MC-IPU cluster-ing, where the x-axis and y-axis represents the cluster sizeand the execution of 8-input (16-input) MC-IPUs(16) nor-malized to Baseline1 (Baseline2) respectively. Accordingto this ﬁgure, smaller clusters can reduce the performancedegradation signiﬁcantly due to multi-cycling in the caseof forward computation using 8-input MC-IPUs. However,in 16-input cases, there is at least 12% loss even for clusterof size 1. Backward data has more variation and, even forone MC-IPU per cluster, there is at least 60% increase inthe execution time. The reason for such behavior can beexplained using the histogram of exponent difference of 8-input MC-IPUs for Resnet-18 in the forward and backwardpaths, illustrated in Figure 9. As shown in this ﬁgure, theforward path exponent differences are clustered around zeroand only 1% of them are larger than eight. On the otherhand, the products of backward computations have a widerdistribution than forward computations. Figure 10(a,b) visualize the power and area efﬁciency de-sign spaces for INT vs. FP modes, respectively. In these ethinking Floating Point Overhead

Figure 9.

The distribution of exponent difference (

Max.exp − exp , or alignment size) of ResNet-18 training computations. (a)forward-propagation, (b) back-propagation. Figure 10.

Trade-off between (a) area efﬁciency and (b) powerefﬁciency. Each design point ( p , c ) represents tiles with the p -bitadder tree MC-IPUs with c MC-IPUs per cluster (only labeled for16-input MC-IPUs). NO-OPT is Baseline2. ﬁgures, we consider the average effective throughput, usingour simulation results, for FP throughput values. The num-bers associated with some of the design points refer to theordered pair of MC-IPU precision and the cluster size. Fordesigns with 8-input (16-input), approximation can boostpower efﬁciency of INT and FP mode by 14% (17%) andimprove area efﬁciency by 17.8% (20%). The overall im-provement is the combination of all the optimizations. Thetwo design points (12,1) and (16,1) are on the power efﬁ-ciency Pareto optimal curve. For example, the design pointswith one MC-IPU per cluster and 12-bit (16-bit) IPU preci-sion, achieve 14% (25%) in TFLOPS/ mm and up to 46%(46%) in TOPS/ mm with our 8-input (16-input) IPU ar-chitectures over typical mixed precision implementation inarea efﬁciency and up to 63% (40%) in TFLOPS/W and upto 74% (63%) in TOPS/W in power efﬁciency. In this paper, we mainly consider INT4 as the common case,however, it is still possible to consider different precisionas the baseline for different targeted quantization schemes, data types, application domain (i.e., edge vs cloud) andDNNs. Therefore, we evaluate the performance of the pro-posed approach using four designs with different multiplierprecisions. The ﬁrst design (MC-SER) is based on serialmultipliers (i.e., × ) similar to Stripes (Judd et al., 2016)but MC-SER supports FP16 using the proposed optimiza-tions. Note that, FP16 operation requires at least 12 cyclesper inner product in the case of × multiplier. Thesecond design (MC-IPU4) is optimized for INT4 as dis-cussed earlier and it is based on × multipliers. The thirddesign (MC-IPU84) is optimized for INT8 for activationand INT4 for weights, and it is based on × multipliers.The fourth design (MC-IPU8) is optimized for INT8 foractivation and weights, and it is based on × multipliers.We also compare against other mixed precision designs in-cluding: NVDLA, typical FP16 implementation and mixedprecision INT-based designs which do not support FP16.We show the comparison between these designs in terms ofTOPS/mm and TOPS/W for different types of operationsas shown in Table 1. The results show that MC-IPU miti-gates the overhead of the local shift units and adder treeswhen FP16 is required. This overhead becomes relativelymore signiﬁcant as the precision of the multiplier decreasesand the optimization beneﬁt decreases as we increase thebaseline multiplier precision. However, designs with highmultiplier baseline (e.g., × ) limits the beneﬁts of low-bit(e.g., INT4) software optimization. ELATED W ORK

Previous studies on CNN accelerators exploit two major ap-proaches to their ALU/FPU datapath, MAC-based (Jouppiet al., 2017; Chen et al., 2016; Gao et al., 2017; Lu et al.,2017; Kim et al., 2016; Venkataramani et al., 2017; Yazdan-bakhsh et al., 2018) and Inner Product-based (Chen et al.,2014; NVD; Eth; Venkatesan et al., 2019; Shao et al., 2019;Liu et al., 2016; Kwon et al., 2018). Unfortunately, mostof these approaches exploit INT-based arithmetic units andrely on quantization to convert DNNs from FP to INT.The INT-based arithmetic unit can also support differentbit widths. Multi-precisions of operands for INT-based ar-chitectures has been already addressed in both spatial andtemporal decomposition. In the spatial decomposition ap-proach, a large arithmetic unit is decomposed into multipleﬁner grain units (Sharma et al., 2018; Camus et al., 2019;Mei et al., 2019; Moons et al., 2017). Since the Pascal ar-chitecture, Nvidia GPUs implement spatial decompositionvia DP4A and DP2A instructions, where INT32 units aredecomposed into 4-input INT8 or 2-input INT16 inner prod-ucts. This approach is different than ours, as we supportFP16 and use inner product rather than MAC units. On theother hand, the temporal decomposition approach performsthe sequences of ﬁne-grain operations in time to mimic a ethinking Floating Point Overhead

Table 1.

TOPs/W and TOPs/mm for different multipliers (MUL) and adder trees (ADT) precision. A and W are activation and weightprecisions MC-SER MC-IPU4 MC-IPU84 MC-IPU8 NDVLA FP16 INT8 INT4ADT 16b 16b 20b 23b 36b 36b 16b 9bMUL × × × × × ×

12 8 × × A × W T OP S/mm or T F LOP S/mm × × × F P × F P A × W T OP S/W or T F LOP S/W × × × F P × F P coarse-grain operation. Our approach resembles this ap-proach with 4-bit operations as the ﬁnest granularity. Otherworks that use this approach prefer lower precision (Juddet al., 2016; Lee et al., 2019; Eckert et al., 2018; Sharifyet al., 2018). Temporal decomposition has also been usedto avoid ineffectual operations by dynamically detectingﬁne-grain zero operands and discarding the operation (Del-mas et al., 2018; Albericio et al., 2017; Sharify et al., 2019).In contrast to us, these approaches do not support FP16operands. In addition, we only discuss the dense architec-tures; however, the ﬁne-grain building block can also beused for sparse approaches. We leave this for future.The approaches listed above rely on quantization schemes toconvert FP32 DNNs to integer-based ones (Krishnamoorthi,2018; Lee et al., 2018; Nagel et al., 2019; Zhuang et al.,2018; Wang et al., 2018; Choi et al., 2018; Hubara et al.,2017). These schemes are added to DNN software frame-works such as TensorFlow Lite. Recent advancements showthat 8-bit post-training quantization (Jacob et al., 2018) and4-bit retaining-based quantization can achieve almost thesame performance as FP32 (Jung et al., 2019). However,achieving high accuracy is less trivial for shallow networkswith 2D Convolution operations (Howard et al., 2017; Shenget al., 2018). There is also work to achieve high accuracy atlower precision (Zhu et al., 2016; Zhuang et al., 2019; Ban-ner et al., 2019a; Choukroun et al., 2019; Courbariaux et al.,2015a; Zhou et al., 2016; Zhang et al., 2018; Rastegari et al.,2016). A systematic approach to ﬁnd the correct precisionfor each layer has been shown in (Wang et al., 2019; Donget al., 2019; Cai et al., 2020). Dynamic multi-granularityfor tensors is also considered as a way of computation sav-ing (Shen et al., 2020). Several quantization schemes havebeen proposed for training (Wu et al., 2018b; Banner et al.,2018; Das et al., 2018; De Sa et al., 2018; Park et al., 2018).Recent industrial products support mixed-precision arith-metic, including Intel’s Spring Hill (Wechsler et al., 2019),Huawei’s DaVinci (Liao et al., 2019), Nvidia’s Tensor-Core (ten, 2017), Google’s TPU (Jouppi et al., 2017), andNvidia’s NVDLA (NVD). While most of these architecturesuse FP16, BFloat16 and TF32 are selected for the largerange in some products (Abadi et al., 2016; tf3). Using the current structure, our approach can support both BFloat16and TF32 by modifying the EHU to support 8-bit exponentsand using only four nibble iterations. Similar to our ap-proach, NVDLA supportd FP-IP operations. In contrast, itdecomposes an FP16 unit into two INT8 unit spatially. Ad-ditionally, NVDLA does not allow computations of FP-IPoperations with different-type operands. It also does notsupport INT4 and provides a large precision (38-bit) for itsadder tree, which we demonstrate it is not efﬁcient. Ourproposed architecture optimization can also applied to spa-tial decomposition and it is orthogonal to the decompositionscheme (i.e., temporal, serial, spatial).There are also proposals to optimize the microarchitecture ofFP MACs or IPUs. LMA is a modiﬁed FP units that lever-ages Kulisch accumulation to improve FMA energy efﬁ-ciency (Johnson, 2018). An FMA unit with ﬁxed point accu-mulation and lazy rounding is proposed in (Brunie, 2017). A4-input inner product for FP32 is proposed in (Sohn & Swart-zlander, 2016). The spatial fusion for FMA is presentedin (Zhang et al., 2019). Finally, a mixed precision FMAthat supports INT MAC operations is presented in (Zhanget al., 2020). As opposed to the proposed architecture, mostof these efforts do not support INT-based operations or areoptimized for FP operation with high overhead that hinderthe performance of the INT operations. ONCLUSION

In this paper, we explored the design space of the structureof an inner product based convolution tile and identiﬁed thechallenges to support the ﬂoating-point computation and itsoverhead. Further, from the software perspective, we investi-gated the minimum requirements for achieving the targetedaccuracy. We proposed novel architectural optimizationsthat mitigate the ﬂoating-point logic overheads in favor ofboosting computation per area for INT-based operations. Weshowed that for an IPU based on low-precision multipliers,adder and alignment logic overhead due to supporting FPoperations is substantial. We conclude that the differencesbetween product exponents are typically smaller than eightbits allowing the use of smaller shift units in FPUs. ethinking Floating Point Overhead R EFERENCES

Synopsys design compiler. URL .Arm ethos n77. URL https://developer.arm.com/ip-products/processors/machine-learning/arm-ethos-n/ethos-n77 .Nvidia deep learning accelerator (nvdla). URL http://nvdla.org/ .Amd radeon instinct™ mi50 accelerator. URL .Nvidia a100 tensor core gpu architecture. URL .NVIDIA TESLA V100 GPU ARCHITECTURE. Technicalreport, Nvidia, 08 2017.Ieee standard for ﬂoating-point arithmetic.

IEEE Std 754-2019 (Revision of IEEE 754-2008) , pp. 1–84, 2019.Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,M., et al. Tensorﬂow: Large-scale machine learningon heterogeneous distributed systems. arXiv preprintarXiv:1603.04467 , 2016.Albericio, J., Delm´as, A., Judd, P., Sharify, S., O’Leary, G.,Genov, R., and Moshovos, A. Bit-pragmatic deep neuralnetwork computing. In ,pp. 382–394, 2017.Banner, R., Hubara, I., Hoffer, E., and Soudry, D. Scal-able methods for 8-bit training of neural networks. In

Advances in neural information processing systems , pp.5145–5153, 2018.Banner, R., Nahshan, Y., and Soudry, D. Post training4-bit quantization of convolutional networks for rapid-deployment. In Wallach, H., Larochelle, H., Beygelzimer,A., d'Alch´e-Buc, F., Fox, E., and Garnett, R. (eds.),

Ad-vances in Neural Information Processing Systems 32 , pp.7950–7958. Curran Associates, Inc., 2019a.Banner, R., Nahshan, Y., and Soudry, D. Post training4-bit quantization of convolutional networks for rapid-deployment. In

Advances in Neural Information Process-ing Systems , pp. 7950–7958, 2019b. Brunie, N. Modiﬁed fused multiply and add for exact lowprecision product accumulation. In , pp. 106–113,2017.Cai, Y., Yao, Z., Dong, Z., Gholami, A., Mahoney, M. W.,and Keutzer, K. Zeroq: A novel zero shot quantizationframework. In

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pp. 13169–13178, 2020.Cambier, L., Bhiwandiwalla, A., Gong, T., Nekuii, M., Eli-bol, O. H., and Tang, H. Shifted and squeezed 8-bitﬂoating point format for low-precision training of deepneural networks, 2020.Camus, V., Enz, C., and Verhelst, M. Survey of precision-scalable multiply-accumulate units for neural-networkprocessing. In , pp.57–61, 2019.Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L.,Chen, T., Xu, Z., Sun, N., et al. Dadiannao: A machine-learning supercomputer. In , pp. 609–622. IEEE, 2014.Chen, Y.-H., Emer, J., and Sze, V. Eyeriss: A spatial ar-chitecture for energy-efﬁcient dataﬂow for convolutionalneural networks.

ACM SIGARCH Computer ArchitectureNews , 44(3):367–379, 2016.Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I.-J., Srini-vasan, V., and Gopalakrishnan, K. Pact: Parameterizedclipping activation for quantized neural networks. arXivpreprint arXiv:1805.06085 , 2018.Choukroun, Y., Kravchik, E., Yang, F., and Kisilev, P. Low-bit quantization of neural networks for efﬁcient inference.In , pp. 3009–3018, 2019.Chu, T., Luo, Q., Yang, J., and Huang, X. Mixed-precisionquantized neural network with progressively decreasingbitwidth for image classiﬁcation and object detection. arXiv preprint arXiv:1912.12656 , 2019.Courbariaux, M., Bengio, Y., and David, J.-P. Binarycon-nect: Training deep neural networks with binary weightsduring propagations. In Cortes, C., Lawrence, N. D.,Lee, D. D., Sugiyama, M., and Garnett, R. (eds.),

Ad-vances in Neural Information Processing Systems 28 , pp.3123–3131. Curran Associates, Inc., 2015a.Courbariaux, M., Bengio, Y., and David, J.-P. Binarycon-nect: Training deep neural networks with binary weightsduring propagations. In

Advances in neural informationprocessing systems , pp. 3123–3131, 2015b. ethinking Floating Point Overhead

Das, D., Mellempudi, N., Mudigere, D., Kalamkar, D.,Avancha, S., Banerjee, K., Sridharan, S., Vaidyanathan,K., Kaul, B., Georganas, E., et al. Mixed precision train-ing of convolutional neural networks using integer opera-tions. arXiv preprint arXiv:1802.00930 , 2018.De Sa, C., Leszczynski, M., Zhang, J., Marzoev, A.,Aberger, C. R., Olukotun, K., and R´e, C. High-accuracylow-precision training. arXiv preprint arXiv:1803.03383 ,2018.Delmas, A., Judd, P., Stuart, D. M., Poulos, Z., Mahmoud,M., Sharify, S., Nikolic, M., and Moshovos, A. Bit-tactical: Exploiting ineffectual computations in convo-lutional neural networks: Which, why, and how. arXivpreprint arXiv:1803.03688 , 2018.Deng, L., Li, G., Han, S., Shi, L., and Xie, Y. Model com-pression and hardware acceleration for neural networks:A comprehensive survey.

Proceedings of the IEEE , 108(4):485–532, 2020.Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., andKeutzer, K. Hawq: Hessian aware quantization of neuralnetworks with mixed-precision. In

The IEEE Interna-tional Conference on Computer Vision (ICCV) , October2019.Drumond, M., Tao, L., Jaggi, M., and Falsaﬁ, B. Trainingdnns with hybrid block ﬂoating point. In

Advances inNeural Information Processing Systems , pp. 453–463,2018.Dumoulin, V. and Visin, F. A guide to convolution arith-metic for deep learning. arXiv preprint arXiv:1603.07285 ,2016.Eckert, C., Wang, X., Wang, J., Subramaniyan, A., Iyer, R.,Sylvester, D., Blaaauw, D., and Das, R. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In , pp. 383–396. IEEE,2018.Fang, J., Shaﬁee, A., Abdel-Aziz, H., Thorsley, D., Geor-giadis, G., and Hassoun, J. H. Post-training piecewiselinear quantization for deep neural networks. In

EuropeanConference on Computer Vision , pp. 69–86. Springer,2020.Gao, M., Pu, J., Yang, X., Horowitz, M., and Kozyrakis,C. Tetris: Scalable and efﬁcient neural network accel-eration with 3d memory. In

Proceedings of the Twenty-Second International Conference on Architectural Sup-port for Programming Languages and Operating Systems ,pp. 751–764, 2017. Gustafson, J. L. and Yonemoto, I. T. Beating ﬂoating point atits own game: Posit arithmetic.

Supercomputing Frontiersand Innovations , 4(2):71–86, 2017.He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. In

The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) ,June 2016.Hill, P., Jain, A., Hill, M., Zamirai, B., Hsu, C., Lauren-zano, M. A., Mahlke, S., Tang, L., and Mars, J. Deftnn:Addressing bottlenecks for dnn execution on gpus viasynapse vector elimination and near-compute data ﬁssion.In , pp. 786–799, 2017.Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:Efﬁcient convolutional neural networks for mobile visionapplications.

CoRR , abs/1704.04861, 2017. URL http://arxiv.org/abs/1704.04861 .Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., andBengio, Y. Quantized neural networks: Training neuralnetworks with low precision weights and activations.

TheJournal of Machine Learning Research , 18(1):6869–6898,2017.Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard,A., Adam, H., and Kalenichenko, D. Quantizationand training of neural networks for efﬁcient integer-arithmetic-only inference. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pp. 2704–2713, 2018.Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F.,Xie, L., Guo, Z., Yang, Y., Yu, L., et al. Highly scal-able deep learning training system with mixed-precision:Training imagenet in four minutes. arXiv preprintarXiv:1807.11205 , 2018.Johnson, J. Rethinking ﬂoating point for deep learning,2018.Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers,A., Boyle, R., Cantin, P., Chao, C., Clark, C., Coriell, J.,Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami,T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C. R.,Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey,A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D.,Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le,D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean,G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R.,Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick,M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek,A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., ethinking Floating Point Overhead

Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson,G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter,R., Wang, W., Wilcox, E., and Yoon, D. H. In-datacenterperformance analysis of a tensor processing unit. In , pp. 1–12, 2017.Judd, P., Albericio, J., Hetherington, T., Aamodt, T. M., andMoshovos, A. Stripes: Bit-serial deep neural networkcomputing. In , pp.1–12. IEEE, 2016.Jung, S., Son, C., Lee, S., Son, J., Han, J.-J., Kwak, Y.,Hwang, S. J., and Choi, C. Learning to quantize deepnetworks by optimizing quantization intervals with taskloss. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pp. 4350–4359, 2019.Kilgariff, E., Moreton, H., Stam, N., and Bell, B. Nvidiaturing architecture in-depth. 2018.Kim, D., Kung, J., Chai, S., Yalamanchili, S., andMukhopadhyay, S. Neurocube: A programmable dig-ital neuromorphic architecture with high-density 3dmemory.

SIGARCH Comput. Archit. News , 44(3):380–392, June 2016. ISSN 0163-5964. doi: 10.1145/3007787.3001178. URL https://doi.org/10.1145/3007787.3001178 .K¨oster, U., Webb, T., Wang, X., Nassar, M., Bansal, A. K.,Constable, W., Elibol, O., Gray, S., Hall, S., Hornof,L., et al. Flexpoint: An adaptive numerical format forefﬁcient training of deep neural networks. In

Advances inneural information processing systems , pp. 1742–1752,2017.Krishnamoorthi, R. Quantizing deep convolutional networksfor efﬁcient inference: A whitepaper. arXiv preprintarXiv:1806.08342 , 2018.Kwon, H., Samajdar, A., and Krishna, T. Maeri: Enablingﬂexible dataﬂow mapping over dnn accelerators via re-conﬁgurable interconnects.

ACM SIGPLAN Notices , 53(2):461–475, 2018.LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature ,521(7553):436–444, 2015.Lee, J., Kim, C., Kang, S., Shin, D., Kim, S., and Yoo, H.Unpu: An energy-efﬁcient deep neural network accelera-tor with fully variable weight bit precision.

IEEE Journalof Solid-State Circuits , 54(1):173–185, 2019.Lee, J. H., Ha, S., Choi, S., Lee, W.-J., and Lee, S. Quan-tization for rapid deployment of deep neural networks. arXiv preprint arXiv:1810.05488 , 2018. Liao, H., Tu, J., Xia, J., and Zhou, X. Davinci: A scal-able architecture for neural network computing. In , pp. 1–44. IEEEComputer Society, 2019.Liu, S., Du, Z., Tao, J., Han, D., Luo, T., Xie, Y., Chen, Y.,and Chen, T. Cambricon: An instruction set architecturefor neural networks. In ,pp. 393–405. IEEE, 2016.Lu, J., Lu, S., Wang, Z., Fang, C., Lin, J., Wang, Z., andDu, L. Training deep neural networks using posit numbersystem. arXiv preprint arXiv:1909.03831 , 2019.Lu, W., Yan, G., Li, J., Gong, S., Han, Y., and Li, X.Flexﬂow: A ﬂexible dataﬂow accelerator architecturefor convolutional neural networks. In , pp. 553–564. IEEE, 2017.Mei, L., Dandekar, M., Rodopoulos, D., Constantin, J.,Debacker, P., Lauwereins, R., and Verhelst, M. Sub-word parallel precision-scalable mac engines for efﬁcientembedded dnn inference. In , pp. 6–10, 2019.Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen,E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O.,Venkatesh, G., et al. Mixed precision training. arXivpreprint arXiv:1710.03740 , 2017.Moons, B., Uytterhoeven, R., Dehaene, W., and Verhelst,M. 14.5 envision: A 0.26-to-10tops/w subword-paralleldynamic-voltage-accuracy-frequency-scalable convolu-tional neural network processor in 28nm fdsoi. In , pp. 246–247, 2017.Nagel, M., Baalen, M. v., Blankevoort, T., and Welling, M.Data-free quantization through weight equalization andbias correction. In

Proceedings of the IEEE InternationalConference on Computer Vision , pp. 1325–1334, 2019.Nagel, M., Amjad, R. A., van Baalen, M., Louizos, C., andBlankevoort, T. Up or down? adaptive rounding for post-training quantization. arXiv preprint arXiv:2004.10568 ,2020.Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng,N., Grangier, D., and Auli, M. fairseq: A fast, ex-tensible toolkit for sequence modeling. arXiv preprintarXiv:1904.01038 , 2019.Park, E., Yoo, S., and Vajda, P. Value-aware quantization fortraining and inference of neural networks. In

Proceedingsof the European Conference on Computer Vision (ECCV) ,pp. 580–595, 2018. ethinking Floating Point Overhead

Pedram, A., Richardson, S., Galal, S., Kvatinsky, S., andHorowitz, M. Dark memory and accelerator-rich systemoptimization in the dark silicon era. volume 34, pp. 39–50,2017. doi: 10.1109/MDAT.2016.2573586.Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A.Xnor-net: Imagenet classiﬁcation using binary convo-lutional neural networks. In

European conference oncomputer vision , pp. 525–542. Springer, 2016.Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi,S., and Kepner, J. Survey and benchmarking of machinelearning accelerators. arXiv preprint arXiv:1908.11348 ,2019.Shao, Y. S., Clemons, J., Venkatesan, R., Zimmer, B., Fo-jtik, M., Jiang, N., Keller, B., Klinefelter, A., Pinckney,N., Raina, P., et al. Simba: Scaling deep-learning in-ference with multi-chip-module-based architecture. In

Proceedings of the 52nd Annual IEEE/ACM InternationalSymposium on Microarchitecture , pp. 14–27, 2019.Sharify, S., Lascorz, A. D., Siu, K., Judd, P., and Moshovos,A. Loom: Exploiting weight and activation precisions toaccelerate convolutional neural networks. In ,pp. 1–6. IEEE, 2018.Sharify, S., Lascorz, A. D., Mahmoud, M., Nikolic, M., Siu,K., Stuart, D. M., Poulos, Z., and Moshovos, A. Laconicdeep learning inference acceleration. In

Proceedings ofthe 46th International Symposium on Computer Architec-ture , pp. 304–317, 2019.Sharma, H., Park, J., Suda, N., Lai, L., Chau, B., Chan-dra, V., and Esmaeilzadeh, H. Bit fusion: Bit-level dy-namically composable architecture for accelerating deepneural network. In ,pp. 764–775, 2018.Shen, J., Fu, Y., Wang, Y., Xu, P., Wang, Z., and Lin, Y.Fractional skipping: Towards ﬁner-grained dynamic cnninference. arXiv preprint arXiv:2001.00705 , 2020.Sheng, T., Feng, C., Zhuo, S., Zhang, X., Shen, L., andAleksic, M. A quantization-friendly separable convolu-tion for mobilenets. In , pp. 14–18, 2018.Sohn, J. and Swartzlander, E. E. A fused ﬂoating-point four-term dot product unit.

IEEE Transactions on Circuits andSystems I: Regular Papers , 63(3):370–378, 2016.Song, Z., Fu, B., Wu, F., Jiang, Z., Jiang, L., Jing, N., andLiang, X. Drq: Dynamic region-based quantization fordeep neural network acceleration. In , pp. 1010–1021. IEEE, 2020.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,Z. Rethinking the inception architecture for computervision. In

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , June 2016.Tambe, T., Yang, E.-Y., Wan, Z., Deng, Y., Reddi, V. J.,Rush, A., Brooks, D., and Wei, G.-Y. Adaptivﬂoat: Aﬂoating-point based data type for resilient deep learninginference. arXiv preprint arXiv:1909.13271 , 2019.Venkataramani, S., Ranjan, A., Banerjee, S., Das, D., Avan-cha, S., Jagannathan, A., Durg, A., Nagaraj, D., Kaul,B., Dubey, P., and Raghunathan, A. Scaledeep: A scal-able compute architecture for learning and evaluatingdeep networks. In

Proceedings of the 44th Annual In-ternational Symposium on Computer Architecture , ISCA’17, pp. 13–26, New York, NY, USA, 2017. Associa-tion for Computing Machinery. ISBN 9781450348928.doi: 10.1145/3079856.3080244. URL https://doi.org/10.1145/3079856.3080244 .Venkatesan, R., Shao, Y. S., Zimmer, B., Clemons, J., Fojtik,M., Jiang, N., Keller, B., Klinefelter, A., Pinckney, N.,Raina, P., Tell, S. G., Zhang, Y., Dally, W. J., Emer, J. S.,Gray, C. T., Keckler, S. W., and Khailany, B. A 0.11pj/op, 0.32-128 tops, scalable multi-chip-module-baseddeep neural network accelerator designed with a high-productivity vlsi methodology. In , pp. 1–24, 2019.Venkatesh, G., Nurvitadhi, E., and Marr, D. Accelerat-ing deep convolutional networks using low-precision andsparsity. In , pp.2861–2865. IEEE, 2017.Wang, K., Liu, Z., Lin, Y., Lin, J., and Han, S. Haq:Hardware-aware automated quantization with mixed pre-cision. In

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , June 2019.Wang, P., Hu, Q., Zhang, Y., Zhang, C., Liu, Y., and Cheng,J. Two-step quantization for low-bit neural networks. In

Proceedings of the IEEE Conference on computer visionand pattern recognition , pp. 4376–4384, 2018.Wechsler, O., Behar, M., and Daga, B. Spring hill (nnp-i1000) intel’s data center inference chip. In , pp. 1–12, 2019.Wu, B., Wang, Y., Zhang, P., Tian, Y., Vajda, P., and Keutzer,K. Mixed precision quantization of convnets via dif-ferentiable neural architecture search. arXiv preprintarXiv:1812.00090 , 2018a. ethinking Floating Point Overhead

Wu, S., Li, G., Chen, F., and Shi, L. Training and inferencewith integers in deep neural networks. arXiv preprintarXiv:1802.04680 , 2018b.Yazdanbakhsh, A., Samadi, K., Kim, N. S., and Es-maeilzadeh, H. Ganax: A uniﬁed mimd-simd accelerationfor generative adversarial networks. In , pp. 650–661, 2018.Zhang, D., Yang, J., Ye, D., and Hua, G. Lq-nets: Learnedquantization for highly accurate and compact deep neuralnetworks. In

Proceedings of the European conference oncomputer vision (ECCV) , pp. 365–382, 2018.Zhang, H., Chen, D., and Ko, S. Efﬁcient multiple-precisionﬂoating-point fused multiply-add with mixed-precisionsupport.

IEEE Transactions on Computers , 68(7):1035–1048, 2019.Zhang, H., Chen, D., and Ko, S. New ﬂexible multiple-precision multiply-accumulate unit for deep neural net-work training and inference.

IEEE Transactions on Com-puters , 69(1):26–38, 2020.Zhou, S., Ni, Z., Zhou, X., Wen, H., Wu, Y., and Zou,Y. Dorefa-net: Training low bitwidth convolutionalneural networks with low bitwidth gradients.

CoRR ,abs/1606.06160, 2016. URL http://arxiv.org/abs/1606.06160 .Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained ternaryquantization.

CoRR , abs/1612.01064, 2016. URL http://arxiv.org/abs/1612.01064 .Zhuang, B., Shen, C., Tan, M., Liu, L., and Reid, I. Towardseffective low-bitwidth convolutional neural networks. In

Proceedings of the IEEE conference on computer visionand pattern recognition , pp. 7920–7928, 2018.Zhuang, B., Shen, C., Tan, M., Liu, L., and Reid, I. Struc-tured binary neural networks for accurate image classiﬁca-tion and semantic segmentation. In

The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) ,June 2019.Zhuang, B., Liu, L., Tan, M., Shen, C., and Reid, I. Trainingquantized neural networks with a full-precision auxiliarymodule. In

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pp. 1488–1497, 2020. ethinking Floating Point Overhead

A B

ACKGROUND

A.1 Convolution Layer Operation

A typical Convolution Layers (CL) operates on two 4Dtensor as inputs (Input Feature Map (IFM) tensor and Kerneltensor) and results a 4D tensor (Output Feature Map (OFM)tensor). The element of IFMs and OFMs are called pixels or activations while the elements of Kernel are known as weights . Figure 11 shows simpliﬁed pseudocode for CL.The height and width of an OFM is typically determinedby the height and width of IFMs, padding and strides. Thethree innermost loops (Lines 5-7) compute one output pixeland they can be realized as one or multiple inner productoperations. The other four loops are independent, hence theycan be implemented so to boost parallelism. More details arepresented in (Dumoulin & Visin, 2016). A fully connectedlayer can be considered as a special case of convolutionwhere the height and the width of IFM, OFM and Kernelare all equal to 1. Fully connected layers are used frequentlyin natural language processing and in the ﬁnal layers ofConvolutional Neural Networks (CNNs). Figure 11.

Pseudocode of a convolution layer.

A.2 Floating-Point Representation

Typical ﬂoating-point (FP) numbers are an IEEE standardto represent real numbers (876, 2019). DNNs take advan-tage of ﬂoating point arithmetic for training and highlyaccurate inference tasks. In general, an FP number is rep-

Table 2.

Different types of FP numbers. ( exp (cid:54) = 0 , exp (cid:54) = 1 ... ). bias = 15(127) for FP16 (FP32).type ( sgn, exp, man ) Valuezero ( sgn, ... , ... zeroINF ( sgn, ... , ... ± inﬁnityNaN ( sgn, ... , man ) man (cid:54) = 0 , Not-a-Numbernormal ( sgn, exp, man ) ( − s × exp − bias × .man subnormal ( sgn, ... man ) ( − s × − bias +1 × .man resented with three parts: (sign, exponent, and mantissa),which have (1,5,10), (1,8,23), (1,8,7), and (1,8,10) for FP16,FP32, Google’s BFloat (BFloat16) (Abadi et al., 2016) andNvidia’s TensorFloat32 (TF32) (tf3).For IEEE standard FP, the (sign, exponent, and mantissa)parts are used to decode ﬁve types of FP numbers as shownin Table 2. We deﬁne the magnitude as 0.mantissa forsubnormal numbers and 1.mantissa for normal numbers.We also call it the signed magnitude when signed valuesare considered.For deep learning applications, the inner product operationscan be realized in two ways: (1) by iteratively using fused-multiply-add (FMA) units, i.e., performing A × B + C or (2) by running multiple inner product operations in par-allel. In the latter case, the inputs would be two vectors (cid:104) a , . . . , a n − (cid:105) and (cid:104) b , . . . , b n − (cid:105) and the operation resultsin one scalar output. In order to keep the most signiﬁcantpart of the result and guarantee an absolute bound on thecomputation error, the products are summed by aligning allthe products relative to the product with the maximum expo-nent. Figure 12 shows the required steps, assuming there isneither INF nor NaN in the inputs. The result has two parts:an exponent which is equal to the maximum exponent of theproducts, and a signed magnitude part which is the result ofthe summation of the aligned products.The range of the exponent for FP16 numbers is [-14,15],hence, the range of the exponent for the product of twoFP16 number is [-28,30]. The product result also has up to22 bits of mantissa before normalization. This means thatthe accurate summation of such numbers requires 80-bitwide adders (58+22=80). However, smaller adders might beenough depending on the accuracy of the accumulators. Forexample, FP32 accumulators may keep only 24 bits of theresult’s sign magnitude. Therefore, it is highly unlikely thatthe least signiﬁcant bits in the 80-bit addition contribute tothe 24 bit magnitude of the accumulator and an approximateversion of this operation would be sufﬁcient. We will discussthe level of approximation in Section 3.1. B H

YBRID

DNN

S AND C USTOMIZED FP The temporal INT4-based decomposition allows the pro-posed architecture to support different data types and pre- ethinking Floating Point Overhead

Figure 12.

Pseudocode for FP-IP operation (FP16). In a hard-ware realization, the loops would be parallel. Note, exp ( x ) = x (cid:48) sexponent − bias + 1 for subnormal numbers but we omit itfor simplicity.for subnormal numbers but we omit itfor simplicity.