[PDF] A Mixed-Precision RISC-V Processor for Extreme-Edge DNN Inference

Abstract

Low bit-width Quantized Neural Networks (QNNs) enable deployment of complex machine learning models on constrained devices such as microcontrollers (MCUs) by reducing their memory footprint. Fine-grained asymmetric quantization (i.e., different bit-widths assigned to weights and activations on a tensor-by-tensor basis) is a particularly interesting scheme to maximize accuracy under a tight memory constraint. However, the lack of sub-byte instruction set architecture (ISA) support in SoA microprocessors makes it hard to fully exploit this extreme quantization paradigm in embedded MCUs. Support for sub-byte and asymmetric QNNs would require many precision formats and an exorbitant amount of opcode space. In this work, we attack this problem with status-based SIMD instructions: rather than encoding precision explicitly, each operand's precision is set dynamically in a core status register. We propose a novel RISC-V ISA core MPIC (Mixed Precision Inference Core) based on the open-source RI5CY core. Our approach enables full support for mixed-precision QNN inference with different combinations of operands at 16-, 8-, 4- and 2-bit precision, without adding any extra opcode or increasing the complexity of the decode stage. Our results show that MPIC improves both performance and energy efficiency by a factor of 1.1-4.9x when compared to software-based mixed-precision on RI5CY; with respect to commercially available Cortex-M4 and M7 microcontrollers, it delivers 3.6-11.7x better performance and 41-155x higher efficiency.

Full PDF

AA Mixed-Precision RISC-V Processor forExtreme-Edge DNN Inference

Gianmarco Ottavi † , Angelo Garofalo † , Giuseppe Tagliavini † , Francesco Conti †∗ , Luca Benini †∗ and Davide Rossi † DEI, University of Bologna, Italy † IIS lab, ETH Zurich, Switzerland ∗ { gianmarco.ottavi2, davide.rossi, angelo.garofalo, giuseppe.tagliavini } @unibo.it { fconti, lbenini } @iis.ee.ethz.ch Abstract —Low bit-width Quantized Neural Networks (QNNs)enable deployment of complex machine learning models onconstrained devices such as microcontrollers (MCUs) by reducingtheir memory footprint. Fine-grained asymmetric quantization(i.e., different bit-widths assigned to weights and activations ona tensor-by-tensor basis) is a particularly interesting schemeto maximize accuracy under a tight memory constraint [1].However, the lack of sub-byte instruction set architecture (ISA)support in SoA microprocessors makes it hard to fully exploit thisextreme quantization paradigm in embedded MCUs. Support forsub-byte and asymmetric QNNs would require many precisionformats and an exorbitant amount of opcode space. In this work,we attack this problem with status-based SIMD instructions:rather than encoding precision explicitly, each operand’s pre-cision is set dynamically in a core status register. We proposea novel RISC-V ISA core

MPIC (Mixed Precision InferenceCore) based on the open-source RI5CY core. Our approachenables full support for mixed-precision QNN inference withdifferent combinations of operands at 16-, 8-, 4- and 2-bitprecision, without adding any extra opcode or increasing thecomplexity of the decode stage. Our results show that MPICimproves both performance and energy efﬁciency by a factor of1.1–4.9 × when compared to software-based mixed-precision onRI5CY; with respect to commercially available Cortex-M4 andM7 microcontrollers, it delivers 3.6–11.7 × better performanceand 41–155 × higher efﬁciency. Index Terms —PULP Platform, Embedded-Systems, Deep Neu-ral Networks, Mixed-precision, Microcontroller

I. I

NTRODUCTION

Running complex applications on embedded systems likemicrocontrollers (MCUs) requires optimization on both soft-ware and hardware due to severe constraints in terms ofmemory size, power consumption, and computing power. Inan Internet-of-Things (IoT) environment, wireless communi-cation to higher-level nodes often dominates the power budget.Algorithms such as Deep Neural Networks (DNNs), morespeciﬁcally Convolutional Neural Networks (CNNs) which arestate-of-the-art for computer vision and speech recognition, areused in computing at the edge of IoT to reduce the amountof data to transmit by communicating only classes or high-level features instead of the raw sensor data. The complexityof these algorithms typically requires millions of Multiply-Accumulate (MAC) operations and signiﬁcant memory foot-print, where memory is a valuable resource due to its cost interms of area and power.An effective way to reduce the memory footprint of DNNsis quantization , a technique that reduces inputs and weights toﬁxed-point formats such as 8- bits, and even sub-byte like 4-

This work was supported in part by OPRECOMP (Open trans-PREcisionCOMPuting) Grant Agreement No. 732631, and WiPLASH ( Wire-less Plas-ticity for Heterogeneous Massive Computer Architectures) GrantAgreementNo. 863337. Both projects are funded from the European Union’sHorizon 2020 research and innovation program. and 2- bits [1]–[3]. Banner et al. proposed a methodology toquantize both weights and activations to 4-bit with an accuracydrop of only a few percent, not modifying the training andnot requiring a full dataset. Rusci et al. [1] show how, usingmixed-precision quantization for each layer, it is possible toreduce by up to 7 × the memory footprint of DNNs, incurringonly in a 4% accuracy loss. However, although quantizationprovides a clear reduction of memory bandwidth visible alsoin general-purpose processors [4], much of the inference-time beneﬁt is accessible only through customized hardwareaccelerators [5] or with an FPGA implementation of quantizedarithmetic units [6]. To the best of the authors’ knowledge,the only recent work taking advantage of quantized formats insoftware processors is the one presented by Anderson et al. [7]. It proposed a software technique exploiting arbitrary bit-precise signed and unsigned integer operations embedding avector architecture with custom bit-width lanes in ﬁxed-widthscalar arithmetic [7]. However, this comes with signiﬁcanteffort in application porting.From the hardware perspective, the only relevant researchwork in this ﬁeld is the reconﬁgurable Parallel Balanced-Bit-Serial (PBBS) vector processing tile presented by Wu et al. [8], which is suitable for improving the efﬁciency of sub-byte single instruction multiple data (SIMD) computationsof heavily leakage-dominated ULP designs. However, codeserialization signiﬁcantly degrades performance and efﬁciencyin near- and super-threshold operating points. On the otherhand, the totality of commercial MCUs operates at the ﬁnestgranularity of 1-byte data [9], [10]. The new ARM [11] ISAspecialized for machine learning, implemented by the CortexM55 processor, enhances the ARMv8 with extensions similarto the ones presented in [12], such as 8-bit SIMD instructions,loops, and conditional execution extensions. In addition, itprovides pipelined execution of load and mac instructions [11]that allows maximizing utilization of MAC units during theexecution of regular patterns (e.g., convolutions).However, similarly to all other commercial cores, the ARM

Cortex M55 does not support natively smaller than 8-bit SIMDinstructions. Hence, data have to be presented as a bytefor computation, even if it is “packed” in a more compactrepresentation. First, this means that there is no way to exploitthe additional parallelism because the datapath is hardwiredto 8 bits. Second, in the tight inner loops of the quantizedDNN kernels, the cost of unpacking and packing data can beextremely high, leading to up to . × worse performance thandirectly using 8-bit data, as shown in the results. In our ex-periments, sub-byte and mixed-precision quantization by itselfimproved only the implementation feasibility of a network inMCUs (in term of squeezing the network memory footprint), a r X i v : . [ c s . A R ] O c t ut not its performance and efﬁciency in computation [13].Supporting many different precision formats to avoid dataunpacking can be challenging in a general-purpose MCU,because it leads to the proliferation in the number of in-structions, saturating the encoding space. Variable-length in-structions offer a potential solution to this problem, but onlyat the cost of code bloat and increased complexity of thedecoding stage, which would result in a signiﬁcant penaltyfor what concerns the power consumed by the MCU [14]. Inthis work, we propose a lightweight processor specializationfor quantized DNNs leveraging a status-based approach tocounter the proliferation of SIMD instructions necessary tosupport mixed-precision computations. Such instructions donot encode precision explicitly; rather, they encode a “virtual”SIMD instructions, which contain no precision informationand the latter is specialized at run-time by setting the precisionof the operands in a core status register. In this way, the samevirtual instructions can encode a range of operand precision,enabling much higher code efﬁciency.The main contributions of this paper are the following:ﬁrst, we introduce XMPI , a RISC-V ISA extension introducingmixed-precision and heavily quantized SIMD instructions toboost the execution of Quantized Neural Network workloadsfrom 16 down to 2 bits; moreover, we extend the functionalityof the RI5CY core [12], a state-of-the-art open-source RISC-Vcore, to support status-based operation. We call this new core

MPIC (Mixed Precision Inference Core). We then integratedXMPI and added new execution stage functional units tooperate at the granularity of 2 and 4 bits.To validate our design, we deployed the new core intoPULPissimo [15], a single-core open-source MCU of thePULP family [16]. We implemented the full layout ona commercially available technology at 22nm FDX fromGlobalfounderies to evaluate overheads in terms of power,area, and frequency with respect to the baseline RI5CY core.We benchmarked the mixed-precision extended core againstRI5CY, ARM Cortex M7, and ARM Cortex M4 cores ona QNN layer with different quantization conﬁgurations. Thenew approach of MPIC avoided the encoding of 200 newinstructions keeping power consumption on the level of thebaseline RI5CY core. Our results show that the new ISAbrings 1.1–4.9 × better performance and energy efﬁciencywhen compared to software-based mixed-precision on theRI5CY core; moreover, we also compare with commerciallyavailable MCUs based on Cortex-M7 and M4 cores, showingthat our solution provides a boost of 3.6–11.7 × in performanceand 41–155 × in energy efﬁciency.II. B ACKGROUND

A. RI5CY Core

RI5CY, used as a baseline for the proposed work, is an open-source core featuring a 4 stage in-order single-issue pipelinebased on the RISC-V ISA [10]. It supports the standardRISC-V extensions (I, M, C, and F) but also includes a non-standard extension, called

XpulpV2 , that introduces severalfeatures such as hardware loops, bit manipulation instructions,load/store post-increment instructions, SIMD for 16- and 8-bit format (more information can be found at [12]). Aslater described in Section IV-B, these features provide 4.4 × performance in 8-bit kernels when compared to SoA coressuch as ARM Cortex M7. https://pulp-platform.org/ B. Quantized Neural Networks

The QNN layers used for the experimental assessmentof our approach adopt layer-wise linear quantization. Thequantization process takes care of mapping each tensor tointegers values. We have 3 categories of tensors to quantize:input feature maps, output feature maps, and weights ( x , y , and w , respectively). A generic real-valued tensor t that belongsinside the range of [ α t , β t ) can be expressed as: t = α t + ε t · INT ( t ) (1)where INT ( t ) is the value of t mapped to an N -bit integer, ε t = ( β t − α t ) / N and α t is the bias to shift the value back toits original range. Imposing α t = 0 for both input and outputfeature maps (but not weights) gives a QNN that can be trainedefﬁciently by means of linear quantization-aware training [17].It is possible to work directly on quantized integer values andapply convolution, normalization and activation: INT ( y ) = quant (cid:16) conv (cid:0) INT ( w ) , INT ( x ) (cid:1)(cid:17) (2)The result of the convolution φ = conv (cid:0) INT ( w ) , INT ( x ) (cid:1) isstill an integer tensor but has to be represented with a largernumber of bits than its input ( ε φ is smaller than both ε x and ε w ). We then have the function quant( · ) that ﬁrst appliesbatch-normalization (if any) to φ and then scales the resultto the proper number of output bits.Mixed-precision QNNs do not impose the same number ofbits for activations and weights, opening the possibility to havemore sensitive layers and/or tensors to be represented at higherprecision while strongly quantizing the rest [1]. C. QNN Execution Model

The execution model for QNNs adopted in this work isbased on the PULP-NN library [13], a library composedof QNN kernels optimized to run on PULP systems. Theselibraries are inspired by the ARM CMSIS-NN [18], but theyinclude additional support for sub-byte quantization of INT-4, -2, -1 integer types. For efﬁcient execution on PULPcores, convolutional layers inference is split into three phases.The im2col phase takes the 3D input features and maps itinto a 1D vector. The

MatMul organizes the innermost dotproducts of the convolution operator as a set of 4 × QntPack ) discretizes the 32-bit outputs of the

MatMul to their target precision and pack them into 32-bitvariables. For this purpose, different discretization techniquesare employed depending on the case: 8-bit output uses scalingand clamping [18], while 4- and 2-bit conﬁgurations usethresholds [1], [3]. This operation compares the result of thematrix multiplication with a set of thresholds computed attraining time, which directly implement the quant functionof Eq. 2. III. ISA EXTENSION

A. Computational Model

The

XpulpV2 extension of the RI5CY core ISA supports16- and 8-bit SIMD operations. Supporting formats from 16-

ECODERCSR MULT/ALUMULT/ALU SIMD

SCALAR INSTRVIRTUAL SIMD INSTRSIMD FORMAT

MAC32

SCALAR

SDOTP.vMIX8x4 SDOTP.M8x4ENCODING TYPE ENCODING TYPE ENCODING TYPE

Fig. 1. Control signals for SIMD and Scalar instructions. The SIMDinstruction is a Sum-of-Dot-Product and the format is a mixed-precision 8x4.The bottom picture contains the encoding of the formats that are containedinside the CSR. down to 2-bit, and all the permutations of mixed-precisionoperations, would require 10 different encodings per each sup-ported MAC-SIMD instruction for a total of 292 instructionsversus 92 of the baseline core. This would require 4 bits, whileonly one is available for this purpose in the current RISC-Vencoding . In the MPIC core, we eliminate the problem ofencoding space by using virtual instructions. As depicted inFigure 1, for regular scalar instructions the decoder directlyproduces control signals towards the ex_stage (e.g., in the caseof a MAC with 32-bit operands); virtual SIMD instructionsrequire additional information from the status registers (CSR)to be specialized (e.g., in the case of an 8-bit by 4-bit sum-of-product).Application code requires explicit modiﬁcations to useSIMD virtual instructions. In Figure 2 a) , we have an exampleof a QNN with multiple layers using different precisions. Herewe can note how before the function calls, we set the precisionwith the SIMD_FMT macro, which writes the appropriateformat encodings into the CSR. If we “zoom” inside thefunctions and get to the inner loop of the mixed-precisionkernel, we can see how supporting directly in hardware thisnew format beneﬁts the computation. In Figure 2 b) , we showthe normal instruction ﬂow using RI5CY: ﬁrst, we load theactivations and weights from memory; then, we unpack four ofthe eight operands in the 32-bit register containing the currentweight. Once unpacked, they have to be packed again into 8-bit operands, to take advantage of RI5CY’s 8-bit vector MACinstruction. On the other hand, MPIC only requires to loadand execute the vectorial MAC (Figure 2 c) ). Thus, it savestwo-thirds of the instructions on the inner loop when runningon data smaller than a byte (or mixed-precision).In Figure 3, we illustrate how a matrix multiplication kernelworks in the case of mixed-precision operands. Having thesame instructions that deal with both mixed-precision anduniform-precision operations requires added logic for the man-agement of the input with smaller operands; by construction,we always map it to input B. As shown in Figure 3, input Bcan remain stationary for multiple MAC instructions beforenew data is needed to be fetched. In the 8x2-bit example ofthe ﬁgure, choosing the correct group of 4 operands out ofthe 16 is crucial for the correctness of the result. To this end,we designed a controller to deal with this problem, which isexplained in more detail in Section III-C. B. Virtual Instructions

Table I lists the instructions in the XMPI extension. Theinstructions are derived from a subset of

XpulpV2 , where SIMD_FMT(M8x4)convolution(A, W, Res); // ...

SIMD_FMT(M8x2)convolution(A, W, Res); p.lw x10,4(x4!)p.lw x11,4(x5!)p.extract x5,x11,4,0p.extract x6,x11,4,4p.extract x7,x11,4,8p.extract x8,x11,4,12pv.packlo.b x15,x5,x6pv.packhi.b x15,x7,x8pv.sdotsp.b x20,x15,x10p.lw x10,4(x4!)p.lw x11,4(x5!)pv.sdotsp.b x20,x15,x10 a bc

Fig. 2. a ) MPIC Functions Call changing precision before executing opera-tions; b ) RI5CY inner loop with data packing/unpacking overhead; c ) MPICinner loop with MAC instructions executed directly. Instruction DescriptionALU SIMD Instr. pv.add[.sc(i)] rD[i] = rs1[i] + rs2[i]pv.sub[.sc(i)] rD[i] = rs1[i] - rs2[i]pv.avg(u)[.sc(i)] rD[i] = (rs1[i] + rs2[i]) » 1

Vector Comparison Instr. pv.max(u)[.sc(i)] rD[i] = rs1[i] > rs2[i] ? rs1[i] : rs2[i]pv.min(u)[.sc(i)] rD[i] = rs1[i] < rs2[i] ? rs1[i] : rs2[i]

Vector Shift Instr. pv.srl[.sc(i)] rD[i] = rs1[i] » rs2[i] Shift is logicalpv.sra[.sc(i)] rD[i] = rs1[i] » rs2[i] Shift is arithmeticpv.sll[.sc(i)] rD[i] = rs1[i] « rs2[i]

Vector ABS Instr. pv.abs rD[i] = rs1[i] < 0 ? -rs1[i] : rs1[i]

Dot Product Instr. pv.dotup[.sc(i)] rD = rs1[0]*rs2[0] + ... + rs1[7]*rs2[7]pv.dotusp[.sc(i)] rD = rs1[0]*rs2[0] + ... + rs1[7]*rs2[7]pv.dotsp[.sc(i)] rD = rs1[0]*rs2[0] + ... + rs1[7]*rs2[7]pv.sdotup[.sc(i)] rD = rs1[0]*rs2[0] + ... + rs1[7]*rs2[7] + rs3pv.sdotusp[.sc(i)] rD = rs1[0]*rs2[0] + ... + rs1[7]*rs2[7] + rs3pv.sdotsp[.sc(i)] rD = rs1[0]*rs2[0] + ... + rs1[7]*rs2[7] + rs3TABLE IL

IST OF INSTRUCTIONS EXTENDED BY

XMPI they are available in 16-, and 8-bit precision only. We ex-tended them with additional support for symmetric 4- and 2-bit precision and, for the dot-product instructions, with alsomixed-precision support. We have different versions of theseinstructions: the sc variant is an operation between a scalar anda vector; the i variant uses the value from the immediate ﬁeldinstead of a register; ﬁnally, we support signed and unsignedvariants (dotp instructions also have a hybrid unsigned-signed). C. Microarchitecture

Figure 4 shows the diagram of the MPIC pipeline, highlight-ing the IPs modiﬁed to implement the new extension. Thelogic required to decode the format for SIMD instructionshas been removed by the decoder and moved to the CSR,which now has a register dedicated to this task (the orangesignal feeds the precision to the functional units). Moreover,two additional registers have been added for managing mixed-precision operations.

Dot-product Unit:

The baseline RI5CY core already sup-ported the cumulative dot product operation for 16- and 8-bitMAC operations; it consists of two distinct sets of multipliers,one for each data size [12]. The intermediate multiplicationsare fed to an adder tree to sum all the contributions. It acceptstwo 16-bit or four 8-bit operands packed in one 32 bit register,and an optional third input register used as an accumulationregister. To extend its support to 4 and 2 bits, we followed thesame principle, adding another set of multipliers and an addertree for each supported format. This conﬁguration enables theexecution of 8 and 16 operations per cycle for 4- and 2-bit,respectively, paying the cost for more area but with no impacton the critical path of the design. On the other hand, we applya power management policy for the unused SIMD units by oop_start: p.lw x5, 4(x10)! (load A )p.lw x6, 4(x10)! (load A +4)p.lw x7, 4(x10)! (load A +8)p.lw x8, 4(x10)! (load A +12)p.lw x9, 4(x11)! (load B )pv.sdotsp x15, x5, x9pv.sdotsp x15, x6, x9pv.sdotsp x15, x7, x9Loop_end: pv.sdotsp x15, x8, x9 abcd e Accumulator Resultx5 (A ) x9 (B ) a x6 (A +4) x9 (B )Accumulator Result b x7 (A +8) x9 (B )Accumulator Result c x8 (A +12) x9 (B )Accumulator Result d x5 (A +16) x9 (B +4 )Accumulator Result e Fig. 3. Matrix multiplication between operands of size 8- and 2-bit. Vector B contains four times the operands of Vector A requiring the fetch of 3 morevectors to "exhaust" Vector B. In each step, we have a group of operands unpacked (via Hardware) from Vector B, extended to match the size of Vector A,and ﬁnally execute the dot-product between the vectors where the partial result is added to the accumulator to get the ﬁnal result. On the bottom right, wecan see the kernel assembly: p.lw are post increment loads that increment by 4 the pointer after load, and pv.sdotsp is a signed sum of dot-product.Fig. 4. Pipeline of the MPIC core. means of clock gating of the input registers of the units notinvolved in the current computation, as shown in Figure 5.For what concerns mixed-precision operation, we haveoperands with different size multiplied together; this impliesthat one of the input registers contains a higher number ofoperands than the other, and when we execute a dot-product,one of the input registers is fully utilized, while only a partof the second one is needed. This requires two actions: ﬁrst,we need to select a sub-group of the second input register (wecall it input B ); second, that sub-group has to be fed to thecorrect size multiplier; e.g., if we consider an

MAC, thecorrect sub-group of operands from input B has to be routedto the 8-bit multiplier. In Figure 5, we depict the whole dot-product module. The slicer and router block is used to selectand direct the correct sub-group of operands to the various dotp multipliers; it also sign-extends the smaller operands tomatch the size of the larger operands. This block is controlledby two signals:

MPC_CNT is used to select the sub-groups ofoperands (discussed later), and

SIMD_FMT speciﬁes whichtype of operands to select (taken directly from the CSR).A mechanism to correctly choose the sub-group of inputsfor register B is needed, so we designed a small controllerdedicated to this task.

Mixed-Precision Controller:

To behave as shown in Fig. 3,the mixed-precision controller (MPC) contains a counter thatis used to select which subgroup of operands to use. Thecounter is increased only if the following requisites are met:the

ID_STAGE is decoding, which avoids the counter to go

Fig. 5. Extended dot-product block. up in case of stalls; the instruction is a MAC; the format setby the CSR is mixed-precision. The counter resets by itselfdepending on the current format used (for example, 8x4 countsup to 2, while 8x2 counts up to 4). However, to implementthe execution model explained in Sec. II-C, a single counteris not enough due to data reuse. This causes each sub-groupof operands to be used multiple times before switching to thenext one. To work around this problem, we added a secondcounter that can be programmed to count the number of MACsto execute before changing the sub-group of operands (e.g., inthe kernel 4x2 we execute 8 MAC before switching to the nextsub-group). The value of the sub-group is also written insidethe CSR; this can be changed by writing directly to it, makingit possible to choose the group of operands manually.

D. Compiler support

We integrated the support to the mixed-precision instructionsemantics into the PULP GNU toolchain . The GCC front-end does not require any modiﬁcation since programmersuse a single integral type (i.e., int ) and possibly modifythe precision of the operations setting the status register;this approach is suitable for both homogeneous and mixed-precision operations. At the middle-end level, we disabledautomatic loop unrolling for the loops that include mixed-precision instructions, intending to avoid inconsistencies withthe internal counter of the mixed-precision controller. For https://github.com/pulp-platform/pulp-riscv-gnu-toolchain ower consumption results [ mW ] @250 MHzRI5CY MPIC Overhead

GP App - 5.46 2% - 5.38 0%

Area Results [ µm ]RI5CY MPIC Overhead

SoC core ex_stage id_stage

MPLEMENTATION RESULTS IN AREA AND POWER the same reason, we inhibited the reordering of the mixed-precision instructions into the compiler backend.IV. R

ESULTS

The evaluation of the MPIC has been performed on twofronts. The ﬁrst one is the physical implementation, where weextracted values of power, area, and frequency, which wereused to compare our approach to the baseline core. The secondfront was the performance assessment with benchmarks, wherewe executed a QNN layer from 8 bit down to 2 using uniformand the various mixed-precision variants, also providing acomparison with a commercially available MCUs sportingCortex M7 and Cortex M4 cores.

A. Implementation Results

The experimental results presented in this section are basedon both RI5CY and MPIC cores integrated into the PULPis-simo SoC, which features a full set of peripherals, a DMAsubsystem, and 512 kB of SRAM memory. We synthesized theSoC with Synopsys Design Compiler-2016.03 using GlobalFoundries 22 nm FDX technology, place & route was per-formed with Cadence Innovus-15.20.100 in the worst-casecorner (SSG 0.59v, -40°C/125°C); power analysis was done at250 MHz with Typical Corner 0.65V at 25°C using SynopsisPrimeTime. We performed different runs, one to test the maxfrequency of the SoC, while the other we set the constraint at250 MHz aiming at maximizing energy efﬁciency for poweranalysis purposes.From the results of the max-frequency synthesis run, weobserved a negligible reduction in the maximum operatingfrequency from 511 to 505 MHz (1% slowdown). For whatconcerns power analysis, four different scenarios have beenproﬁled: one to evaluate the impact of the introduced exten-sions on general-purpose code, while the other 3 while usingthe modiﬁed dot-product unit while executing 8-, 4- and 2-bitQNN kernels.The results are reported under the power consumptionsection in Tab. II. Surprisingly, MPIC consumes slightly lesspower in general-purpose applications, thanks to the additionof clock-gating for unused dotp modules (Sec. III-C), whichwas shared among all the dotp units in RI5CY. Overall, thetable shows that power results are all 2% within each other,well inside the margin of error, telling us that the changesmade did not signiﬁcantly impact the overall efﬁciency of thecore.The second section of Table II reports area results. The corehas an 11% overhead given by the extension of the core forstatus-based operation. The ex_stage has a 16% overhead forthe added logic to support the new precision operations, whilethe id_stage is larger by 7.5%; this effect is due to severalfactors: the main contribution is due to the registers that have

Fig. 6. Performance expressed in MAC/Cycle. been added for operand isolation (the id_stage contains theID/EX pipeline stage) for MAC operations, while a secondarycontribution is that of the mixed-precision controller. Overall,the SoC overhead is around 0.2% since the 512 kB of SRAMoccupies the most area.

B. Benchmarking

To show the performance beneﬁts of supporting these newprecision formats, we choose a QNN layer with different con-ﬁgurations for input/output and weights. The input tensor sizeis 16x16x32, while the ﬁlter is 64x3x3x32; this conﬁgurationis among the ones featuring the best performance on thetargeted architectures. The devices used for this comparisonare the baseline core RI5CY, MPIC, STM32H7 equipped witha Cortex M7 (40 nm technology) [19], and STM32L4 withCortex M4 (90 nm); results consider the STM32H7 runningat 480MHz and STM32L4 at 80MHz [20].Figure 6 shows the results in terms of MACs per Cycle.The different conﬁgurations in the chart are denoted in the X -axis by the size of the activations and weights. Analyzingthe charts, we can see different trends. The Cortex M7 andM4 have lower performance compared to the MPIC in all theconﬁgurations, or even in comparison with RI5CY core (inthe case of 8-bit uniform quantization or 8-bit weights). Thisis due to the high overhead introduced by the unpacking ofdata before the execution of MAC, but also to the fact thatonly up to two 16-bit MACs in one cycle are supported forboth ARM Cortex.The RI5CY core presents about 2.1 MAC/Cycle for theﬁrst case, well above the Cortex M7/M4 and on par withMPIC, because both support 8-bit MACs. When going tosub-byte conﬁgurations, it suffers the same fate as the ARM(except for 8-bit weights) core due to the additional overheadintroduced by unpacking data in the MatMul phase (Sec. II-C).We can see that compared to the Cortex M7 or M4, the betterperformance of RI5CY is due to more efﬁcient ISA: load/storepost-increment, hardware loops, and the possibility to execute4 MACs per cycle at 8-bit precision. In contrast, MPIC doesnot require to unpack data before execution; data can be feddirectly to the dot-product unit, resulting in a peak of 6.5MACs per cycle in the 2-bit uniform layer. When looking at8-bit weights, we can see that the performance is close to the 8-bit uniform quantization; this is because unpacking is done inthe im2col execution phase, which is way less computationallyintensive and does not impact execution as much as the inner-loop of the kernel [13]. ig. 7. Energy efﬁciency expressed in GMAC/s/W.

Signiﬁcantly, mixed-precision QNN kernels also do notsuffer any performance hit, thanks to unpacking done inhardware. The performance of 8x4 and 8x2 kernels are closeto the 8-bit uniform kernel, likewise for the 4-bit one. Thisis because the selection of the dotp module (Fig. 5) istied to the size of the greater operand (e.g., 8x4 uses 8-bit multipliers). However, we can see that the performanceis slightly better than their equivalent uniform case, thanksto the higher operational intensity. If we perform a mixed-precision 8x4 operation, operand b needs to fetch fewer datafrom memory, since its register can hold twice as manyoperands as the register containing a. Another factor thatimpacts mixed-precision operation is the quantization process(

QntPack ). Focusing on the chart for activations of 4- and 2-bit, the performance is marginally worse than when we have8-bit activations.In contrast with performance in MAC/cycle, energy ef-ﬁciency (expressed in GMAC/s/W) takes into account alsophysical design parameters such as the fabrication technologyand the operating voltage and frequency. For the Cortex M7and M4, we used an implementation from ST-Microelectronicsconsuming ∼

234 mW at 480 MHz [19] and 10 mW at 80MHz [20], respectively; while we used the power consumptionﬁgures reported in Table II for the RISC-V SoCs. In Figure 7,we can see that the lower performance of the Cortex M7 isemphasized even more by the technology factor, having a peakof 1.27 GMAC/s/W and being from 74x to 255x less efﬁcientin these workloads compared to MPIC. The Cortex M4 isway more efﬁcient than the Cortex M7 but still falls shortwhen compared to RISC-V cores, being from 35x to 113x lessefﬁcient. For the RI5CY core, we have a slight disadvantageof 1% only in the 8-bit case, while in all other scenarios, theresults are qualitatively similar to the performance ones.V. C

ONCLUSION

In this work, we presented an alternate way to deal witha saturated encoding space. We extended the ISA to supportsub-byte and mixed-precision formats aiming at improvingthe performance of QNN via removing the overhead causedby unpacking data before computation. The MPIC-based SoCimplementation resulted in an area overhead of 11% whencompared to the baseline core while having a negligibleimpact on frequency and power and so not compromising thegeneral-purpose nature of the RI5CY core. The performancegain ranges from 1.1 × to 7.7 × when compared to thebaseline during the execution of a QNN layer, and from 3.6 × up to 19.3 × in regard to the Cortex M7 and M4. The energy efﬁciency peaks at 303 GMAC/s/W for the 2-bit convolutionand ranges from one to two orders of magnitude higher whencompared with ARM counterpart, providing a solution thatis considerably more efﬁcient than commercially availableMCUs solutions for QNN inference.R EFERENCES[1] M. Rusci, A. Capotondi, and L. Benini, “Memory-driven mixed lowprecision quantization for enabling deep network inference on micro-controllers,” arXiv preprint arXiv:1905.13082 , 2019.[2] B. Moons, K. Goetschalckx, N. Van Berckelaer, and M. Verhelst,“Minimum energy quantized neural networks,” in . IEEE, 2017, pp.1921–1925.[3] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,“Quantized neural networks: Training neural networks with low pre-cision weights and activations,”

The Journal of Machine LearningResearch , vol. 18, no. 1, pp. 6869–6898, 2017.[4] A. Stojanov, T. M. Smith, D. Alistarh, and M. Püschel, “Fast quantizedarithmetic on x86: Trading compute for data movement,” in , Oct 2018,pp. 349–354.[5] B. Moons and M. Verhelst, “An energy-efﬁcient precision-scalableconvnet processor in 40-nm cmos,”

IEEE Journal of Solid-State Circuits ,vol. 52, no. 4, pp. 903–914, April 2017.[6] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,N. Xu, S. Song, and et al., “Going deeper with embedded fpgaplatform for convolutional neural network,” in

Proceedings of the2016 ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays , ser. FPGA ’16. New York, NY, USA: Associationfor Computing Machinery, 2016, p. 26–35. [Online]. Available:https://doi.org/10.1145/2847263.2847265[7] A. Anderson, M. Doyle, and D. Gregg, “Scalar arithmetic multiple data:Customizable precision for deep neural networks,” in , June 2019, pp. 61–68.[8] B. Wu and I. Wey, “Parallel balanced-bit-serial design technique forultra-low-voltage circuits with energy saving and area efﬁciency en-hancement,”

IEEE Transactions on Circuits and Systems I: RegularPapers , vol. 65, no. 1, pp. 141–153, Jan 2018.[9] ARM, “Arm architecture reference manual armv8,”2013-2020, https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-proﬁle.[10] U. o. C. B. Andrew Waterman; Krste Asanovi; SiFive Inc., CS Division;EECS Department, “The risc-v instruction set manual, volume i: User-level isa,” April 2019.[11] D. E. Joseph Yiu, “Introduction to the arm cortex-m55 processor.available online: https://pages.arm.com/cortex-m55-introduction.html,”February 2020.[12] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi,E. Flamand, F. K. Gürkaynak, and L. Benini, “Near-threshold risc-v core with dsp extensions for scalable iot endpoint devices,”

IEEETransactions on Very Large Scale Integration (VLSI) Systems , vol. 25,no. 10, pp. 2700–2713, Oct 2017.[13] A. Garofalo, M. Rusci, F. Conti, D. Rossi, and L. Benini, “Pulp-nn:accelerating quantized neural networks on parallel ultra-low-power risc-v processors,”

Philosophical Transactions of the Royal Society A , vol.378, no. 2164, p. 20190155, 2020.[14] O. Azizi, A. Mahesri, B. C. Lee, S. J. Patel, and M. Horowitz, “Energy-performance tradeoffs in processor architecture and circuit design: amarginal cost analysis,”

ACM SIGARCH Computer Architecture News ,vol. 38, no. 3, pp. 26–36, 2010.[15] P. D. Schiavone, D. Rossi, A. Pullini, A. Di Mauro, F. Conti, andL. Benini, “Quentin: an ultra-low-power pulpissimo soc in 22nm fdx,” in . IEEE, 2018, pp. 1–3.[16] D. Rossi, F. Conti, A. Marongiu, A. Pullini, I. Loi, M. Gautschi,G. Tagliavini, A. Capotondi, P. Flatresse, and L. Benini, “Pulp: A parallelultra low power platform for next generation iot applications,” in , Aug 2015, pp. 1–39.[17] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan,and K. Gopalakrishnan, “Pact: Parameterized clipping activation forquantized neural networks,” arXiv preprint arXiv:1805.06085 , 2018.[18] L. Lai, N. Suda, and V. Chandra, “Cmsis-nn: Efﬁcient neural networkkernels for arm cortex-m cpus,” arXiv preprint arXiv:1801.06601arXiv preprint arXiv:1801.06601