[PDF] Always-On 674uW @ 4GOP/s Error Resilient Binary Neural Networks with Aggressive SRAM Voltage Scaling on a 22nm IoT End-Node

Abstract

Binary Neural Networks (BNNs) have been shown to be robust to random bit-level noise, making aggressive voltage scaling attractive as a power-saving technique for both logic and SRAMs. In this work, we introduce the first fully programmable IoT end-node system-on-chip (SoC) capable of executing software-defined, hardware-accelerated BNNs at ultra-low voltage. Our SoC exploits a hybrid memory scheme where error-vulnerable SRAMs are complemented by reliable standard-cell memories to safely store critical data under aggressive voltage scaling. On a prototype in 22nm FDX technology, we demonstrate that both the logic and SRAM voltage can be dropped to 0.5Vwithout any accuracy penalty on a BNN trained for the CIFAR-10 dataset, improving energy efficiency by 2.2X w.r.t. nominal conditions. Furthermore, we show that the supply voltage can be dropped to 0.42V (50% of nominal) while keeping more than99% of the nominal accuracy (with a bit error rate ~1/1000). In this operating point, our prototype performs 4Gop/s (15.4Inference/s on the CIFAR-10 dataset) by computing up to 13binary ops per pJ, achieving 22.8 Inference/s/mW while keeping within a peak power envelope of 674uW - low enough to enable always-on operation in ultra-low power smart cameras, long-lifetime environmental sensors, and insect-sized pico-drones.

Full PDF

11 Always-On 674uW @ 4GOP/s Error ResilientBinary Neural Networks with Aggressive SRAMVoltage Scaling on a 22nm IoT End-Node

Alﬁo Di Mauro, Francesco Conti, Pasquale Davide Schiavone, Davide Rossi, Luca Benini

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer beaccessible.

Abstract —Binary Neural Networks (BNNs) have been shownto be robust to random bit-level noise, making aggressivevoltage scaling attractive as a power-saving technique for bothlogic and SRAMs. In this work, we introduce the ﬁrst fullyprogrammable IoT end-node system-on-chip (SoC) capable ofexecuting software-deﬁned, hardware-accelerated BNNs at ultra-low voltage. Our SoC exploits a hybrid memory scheme whereerror-vulnerable SRAMs are complemented by reliable standard-cell memories to safely store critical data under aggressive voltagescaling. On a prototype in 22nm FDX technology, we demonstratethat both the logic and SRAM voltage can be dropped to 0.5Vwithout any accuracy penalty on a BNN trained for the CIFAR-10 dataset, improving energy efﬁciency by 2.2X w.r.t. nominalconditions. Furthermore, we show that the supply voltage canbe dropped to 0.42V (50% of nominal) while keeping more than99% of the nominal accuracy (with a bit error rate ∼ Index Terms —SRAM Voltage Scaling, Binary Neural Net-works, Ultra-Low Power, IoT, Near-Threshold Computing.

I. I

NTRODUCTION T HE latest advances in the Internet-of-Things (IoT) arechanging the nature of edge-computing devices. End-nodes have to support, in-place, an increasing range of func-tionality, for example, video and audio sensory data process-ing, and complex systems control strategies. These new capa-bilities will enable applications such as an entirely new classof biomedical devices [1], autonomous insect-sized drones [2],cheap smart sensors [3] to continuously check the stabilityof bridges, tunnels, and other buildings. Machine learningalgorithms and speciﬁcally deep neural networks (DNNs) haveshown outstanding performance in performing these tasks.However, while DNNs ﬁts well the performance and powerbudgets of embedded GPUs and FPGAs, deploying suchcompute-intensive algorithms on battery-powered IoT end-node platforms, characterized by heavily constrained powerbudgets (typically µ W to

100 mW ), still constitutes a hugechallenge, as they are expected to achieve lifetimes in theorders of months, years or even decades. As such, recent

A. Di Mauro, F. Conti, P. D. Schiavone and L. Benini are with the IntegratedSystems Laboratory, D-ITET, ETH Z¨urich, 8092 Z¨urich, Switzerland. F. Conti,D. Rossi and L. Benini are also with the Energy-Efﬁcient Embedded SystemsLaboratory, DEI, University of Bologna, 40126 Bologna, Italy. research efforts from both industry and academia have focusedon enabling the deployment of deep inference on devicesoperating in the –

100 mW power range [4]–[11].The most common approach to reducing average powerconsumption, widely used in commercial microcontrollersand IoT end-nodes, is duty cycling (a.k.a. sleep-walking).According to this paradigm, the system stays in deep-sleepmode for most of the time, featuring a power consumption inthe range of

100 nW to µ W , and wakes up to perform theacquisition and classiﬁcation task (e.g. with a CNNs), onlywhen it wakes up, either by an externally triggered event orby an internal timer. However, in the ﬁrst case, this approachrequires trigger ﬁne-tuning to reduce the number of false-positive activations, which could be fairly difﬁcult to achievein a real scenario, and which would feature poor generalizationcapabilities in other contexts. Alternatively, if a time-basedtrigger is employed, therefore exploiting a simple triggeringmechanism, the computing cost of an accurate CNNs is sohigh that the active energy becomes rapidly dominant evenat low duty cycling rates. As a result, the latter approach isinefﬁcient whenever fast reaction time is required at the sensoredge.A common technique to reduce the active power of deeplyembedded computing platforms is near-threshold computing[12]. Scaling voltage together with frequency allows improv-ing signiﬁcantly the energy efﬁciency of computations, byexploiting the quadratic dependency of dynamic power withsupply voltage. However, aggressive voltage scaling has asigniﬁcant impact on the operating frequency of the logic,and on the reliability of the memory elements of the sys-tem, especially those based on SRAMs. While the frequencydegradation at low voltage can be recovered by exploitingpowerful and efﬁcient hardware accelerators, the SRAM re-liability issue remains an unsolved problem. Therefore, inmost cases 6T-SRAMs have to be replaced by more resilient,custom solutions such as SRAMs composed of 8T or 10Tbitcells supported by reading and writing assist circuits [13],[14]. Among the approaches adopted to improve the resiliencyof memory elements at low voltage, usage of standard cellmemories is particularly convenient since they are built on topof standard library cells such as ﬂip-ﬂop or latches, much moreresilient than SRAMs when operating close to the thresholdvoltage of transistors [15], [16]. This comes with a signiﬁcantcost in terms of area. On the other hand, a relatively largeon-chip memory is necessary to enable complex algorithmsbased on DNNs [9], [17].In the last years, BNNs [18]–[20] became popular in the a r X i v : . [ c s . A R ] J u l embedded computing domain for having achieved remarkableaccuracy on many complex classiﬁcation tasks, narrowing thegap that separates them from the state-of-the-art ﬁxed pointor ﬂoating point CNNs. Compared to ﬁxed or ﬂoating pointCNN implementation, which relies on convolutions, BNNs arecharacterized by a very lightweight hardware implementationof the data path. Binary convolution can be implementedwith simple logic elements such as XNOR gates, requiringa very limited amount of area-hungry adders for partial sumaccumulation. Moreover, BNNs also feature lower memoryfootprints compared to CNNs, reducing the amount of energyconsumed on the memory side for weights and intermediateresults storage.Those features make BNNs a good candidate for all thosescenarios where power consumption is a major concern, butat the same time, a high responsiveness to the sensor stimulineeds to be ensured (e.g. pico-sized autonomous navigationrobots or surveillance nodes). The low power envelope thatcharacterizes BNNs allows to use them as data pre-ﬁlteringalgorithms, speciﬁcally, to extract semantically meaningfulinformation in an always-on operating mode [21]. In thiscontext, BNNs can be used as a ﬁrst inference stage of a stagedinference pipeline, composed by low-power, less-accurateearly inference stages, and computationally-powerful ﬁxed orﬂoating point CNN implementations as latter stages [22].As the memory footprint of BNNs is signiﬁcantly lowerthan CNNs, bigger topologies can be supported in the samepower/performance budget, enhancing the generalization ca-pability of the early stages, thereby lowering the false positivetriggering occurrence. Additionally, the employment of BNNsas preliminary ﬁltering stage does not prevent the adoptionof conventional power-saving strategies like duty-cycling orsleep-walking at run time, whenever the application latencyrequirements allow it.One of the advantages of DNNs, as well as BNNs, is thehigh robustness to noise [23]. The high resiliency of BNNs torandom errors is given by the fact that as opposed to traditionalneural networks, where activations and weights are representedby integer numbers, no bit in their activations and weights isinherently more signiﬁcant than any other. As a consequence,no bit is more vulnerable than any other: information process-ing is spread equally among all bits, and only a very high errorrate can bring a dramatic loss in quality-of-results. The BNNsnoise robustness is a very powerful feature since it enablesvery aggressive power reduction techniques to be applied alsoon the memories.In this work, we advance the state-of-the-art with regardsto ultra-low power deep inference with BNNs with three keycontributions:i) We propose a strategy to execute noisy BNNs on micro-controllers. To the best of our knowledge, in this work, wepropose the ﬁrst complete and fully programmable end nodeSoC architecture and BNN inference data and code allocationstrategy enabling the execution of hardware accelerated BNNsat ultra-low voltage.ii) We describe and demonstrate on silicon a hybrid memoryarchitecture composed of big SRAMs for error-resilient dataand smaller (Standard Cell Memories) SCMs to hold vulner-able data such as microcontroller instructions and stacks. In this work, we also provide a methodology to efﬁciently exploitsuch memory architecture. The hybrid memory architecturetemplate presented in this work is easily applicable to otherplatforms.iii) We present a self-test strategy for Bit Error Rate mea-surement performed on large SRAM. This approach allowscharacterizing SRAM memories at ultra-low voltages, therebyestimating the amount of noise injected in the BNN.iv) We demonstrate the validity of this architectural concepton an advanced prototype manufactured in GlobalFoundries22nm FDX technology, using the safe SCMs to hold a micro-controller program testing SRAM bit error rates with millionsof random reads/writes, operating down to 420 mV (50% ofthe nominal supply voltage) for both logic and memories.Finally, we show that using the embedded hardware accel-erator for BNNs, our prototype can be operated at

18 MHz down-scaling voltage to

420 mV for both logic and memories.In this operating point, the prototype achieves up to 99% ofthe nominal accuracy on a BNN trained for the CIFAR-10dataset, while operating with an energy efﬁciency of 170fJ/opand within a power envelope of µ W – enabling embeddingof advanced BNN-based cognitive functionality in ultra-lowpower ”TinyML” devices such as biomedical sensors, long-lifetime environmental sensors, and insect-sized pico-UAVs.The rest of the paper is organized as follows: Section IIdiscusses other works in the state-of-the-art related to thisproposed work. Section III introduces the proposed SoC ar-chitecture. Section IV discusses the simulations we performedto evaluate the resilience of BNNs against SRAM errors. Sec-tion V details the experimental methodology used to evaluatethe SoC and the results of the evaluation in terms of Bit ErrorRate (BER), power and energy efﬁciency. Section VI drawsconclusions. II. R ELATED W ORK

Recently, there has been a strong push towards the de-ployment of sophisticated artiﬁcial intelligence (in particularDNNs) on tiny end-node architectures dedicated to the extremeedge of the IoT – fostering a fast-growing

TinyML researchcommunity [24], which has explored the ﬁeld from twoconverging directions. On the one hand, in the direction ofshrinking DNN topology [25], reducing the amount [26] andnumerical precision of network parameters [27], moving fromﬂoating point down to highly quantized numerical represen-tations, e.g. 8 or 4 bits, and ultimately to BNNs [18]. Onthe other hand, edge computing platforms are supporting thistrend by becoming more and more specialized to efﬁcientlyexecute machine learning workloads [17], [28]. In this section,we focus on the latter research direction and describe researchworks related to the SoC proposed in this paper in a top-downfashion. We start from software-programmable architecturestargeting the end-nodes of the IoT, go through specializedheterogeneous and error-resilient hardware architectures, andend with dedicated architectures for CNN inference exploitingextreme quantization and error resiliency.

A. IoT End-Node Architectures

A fundamental element of all IoT end-node architecturesis software programmability, typically based on tiny micro- controllers with ARM Cortex-M class processors. Signiﬁcantcommercial examples of such micro-systems have been pro-posed by all major embedded systems vendors such as TI [29],STMicroelectronics [30], NXP [31], and Ambiq [32]. Thesesystems feature aggressive sleep-walking capabilities thanks tosub- µ W deep-sleep modes leading to an extremely smallaverage power. On the other hand, current research in IoTend nodes is moving towards optimizing both active and sleepstates exploiting near-threshold and sub-threshold operation.These techniques further improve the energy efﬁciency andreduce power consumption during computation [33] [34] [35][36] [37]. Mr. Wolf [38] couples aggressive deep-sleep ca-pabilities with an energy-proportional architecture, exceedingthe computational capabilities of ULP microcontrollers by2 orders of magnitude while offering a competitive energyefﬁciency also at low and sporadic workloads. This is achievedthanks to a heterogeneous parallel architecture composed of analways-on autonomous I/O subsystem, coupled with a parallelaccelerator with 8 ﬂoating-point capable RISC-V cores. Totarget speciﬁc computation domains such as CNNs, somecommercial architectures leverage lightweight SW accelera-tion and optimized DSP libraries to improve performance. Awell-known example is that of CMSIS developed by ARM,a set of libraries to optimize DSP applications on Cortex-Marchitectures, and CMSIS-NN [39], tuned to the deployment ofembedded neural networks. An extension to these libraries hasbeen proposed by Rusci et al. [40] targeting highly quantizednetworks such as 4-bit, 2-bit and binarized networks [41].However, due to their 32-bit nature, fully programmablesolutions can only partially exploit the beneﬁts of quantizedNNs. While this approach signiﬁcantly reduces the memoryfootprint of CNN, several additional operations are requiredto pack/unpack activation and weights to arithmetic formatssuitable for software processing (e.g. 16-bit or 8-bit) [40],degrading performance and energy efﬁciency of inference.Modern microcontrollers introduced dedicated ISA extensionsto efﬁciently perform sub-word, sub-byte and SIMD opera-tions [42], [43], and mitigate such performance degradation.To improve the overall efﬁciency of systems dedicated toNN acceleration, recent SoCs couple programmable processorswith hardwired accelerators, in some cases exploiting low-precision functional units to exploit resiliency of CNNs toquantization. Intel presented an IoT edge mote integrating anx86 processor accelerated by dedicated functional units forCNN cryptography workloads [44]. Conti et al. proposedFulmine [28], a heterogeneous SoC coupling four general-purpose processors with a convolutional accelerator. Whileconvolutional layers of CNNs run on the accelerator, otherfunctions such as activations and pooling execute on thesoftware processing cluster. GAP-8 [45] includes a specializedaccelerator for convolutional neural networks supporting 16-bitprecision for activations and 16-bit, 8-bit and 4-bit precisionfor weights, achieving up to 600 GMAC/s/W within

75 mW of power envelope. Another notable device is the low-powervision sensor node presented by Qualcomm Technologies [46],which performs end-to-end always-on visual detection tasksthanks to an ultra-low power QVGA CMOS sensor and a fulldigital processor subsystem integrated as a single device. Sucharchitecture allows to perform video processing at the sensor edge in a power envelope, exploiting low-resolutionsensing, data sparsity and event-driven computing, ultimatelyoutputting only meta-information when meaningful events aredetected.In this work we propose a near-threshold SoC joining theﬂexibility of a software programmable 32-bit RISC-V proces-sor integrated into a state-of-the-art microcontroller featuringa rich set of peripherals, with the performance boost of a ded-icated accelerator for BNN workloads, pushing quantizationto the limit. On top of the ﬂexibility and performance of thisheterogeneous architecture, in this paper, we propose a hetero-geneous memory architecture exploiting the error resiliency ofBNN with respect to random errors in the memory system. Toour best knowledge, the SoC described in this paper reports thelowest full system power for active operation and always-onBNN inference presented in industry or academia.

B. Heterogeneous and Error Resilient Memory Architectures

Optimizing the memory hierarchy is one of the mainconcerns in IoT end-nodes operating in near-threshold, sincememory can be the dominant source of power consumption,potentially jeopardizing their energy efﬁciency [34], [47],[48]. While many approaches rely on the custom designof low-voltage memories [14], [49], which come with theassociated area and power overheads (e.g. 8T or 10T bitcells,read and write assist circuits) [13], an emerging trend relies onapproximate SRAMs, often joined with precision/performancetunability or heterogeneous memory architectures. Frustaci etal. [50] proposed an approximated SRAMs for error-tolerantapplications, in which energy is saved at the cost of theoccurrence of read/write errors by exploiting voltage scaling,selective error correction code (SECC), and selective writeassist techniques (SNBB). Compared to the voltage scaling atiso-quality, the joint adoption of these techniques can providemore than × energy reduction at a negligible area penalty.Other works propose the adoption of emerging technologiesto realize approximate memory cells, such as RRAM [51] andmemristors [52].Although all the aforementioned approaches are effective,they all require the design of custom SRAM banks (eitherapproximated or not), they feature deep circuit-level opti-mizations, that cannot be easily integrated into automaticmemory generators. Other approaches exploit heterogeneousmemory architectures mixing standard SRAMs and latch-based Standard Cell Memories (SCM). While SRAMs can notbe considered reliable at relatively high voltages (e.g. 0.8V inthe technology considered in this work), SCMs can operatein the same operating range of the rest of the logic, typicallymuch wider [15]. Tagliavini et al. [53], proposed an HW/SWmethodology to design energy-efﬁcient ULP systems whichcombine the key ideas of a hybrid memory design where partof the memory system is approximated and part is precise,with an error-aware allocation strategy. Similarly to this work,our approach leverages standard 6T-SRAM cells that can berealized with memory generators provided by silicon vendors,and SCM that can be implemented with standard semi-customdesign ﬂows relying on industrially qualiﬁed standard-cellsfor implementation. On the other hand, our work exploits resiliency of binarized neural networks, where the position ofthe ﬂip-bit error within the words is irrelevant to the quality ofthe ﬁnal result, making them a much more suitable candidatefor approximate computing. C. Dedicated Hardware Accelerators for DNNs and BNNs

Many dedicated hardware accelerators speciﬁcally designedto bring deep learning at an ultra-low power budget have beenproposed. Most designs employ ﬁxed-point representation forweights and activations (e.g. Orlando [54] achieving up to 2.9Top/s/W). Pruning and compression are popular techniques tofurther reduce the power budget [55]–[59].Binary Neural Networks [18], [19] constitute a particularlyinteresting niche application due to their properties, as theycan be trained to achieve similar results to full-precisioncounterparts [60] while keeping a smaller footprint, a morescalable structure and a higher resilience to errors, as furtherexplored in Section IV.

FINN [61], was the ﬁrst architecturecapable of reaching more than 200 Gop/s/W on an FPGA.Many of the most recent efforts towards the deploymentof BNNs on silicon, such as

BRein [62],

XNOR-POP [63],

Conv-RAM [6], as well as the BTNN accelerator proposedby Yin et al. [64], and the BNN accelerator presented byWang et al. [65], and Khwa et al. [66] have achieved an energyefﬁciency in the range of 10-50 TOP/s/W using in-memorycomputing. Similar results have been claimed by more “tradi-tional” ASICs such as

UNPU [4] and

XNORBIN [67]. Mixed-signal [5], and in-memory mixed-signal approaches [68]–[71] are able to achieve up to 10-100 × higher efﬁciency,but paying a very signiﬁcant cost in terms of design time,veriﬁcation, and scalability to real systems. Yang et al. [23]exploits one such system in their work, where similarly to whatwe propose SRAM is aggressively voltage-scaled to achieve apower beneﬁt.Our own system exploits a similar technique to Yang’s,with the important distinction that their work is an extremelyspecialized ASIC, capable of executing a single BNN topol-ogy. Rather, our design is a complete and fully programmableIoT end node on the line of those discussed in Section II-A,augmented with a very small hardware accelerator [72].III. Q UENTIN S O CThis section introduces the system architecture of QuentinSoC, focusing on the micro-architecture of the binarizedneural network accelerator (XNE) and on the heterogeneoussystem architecture and its implementation strategy enablingthe power/performance/precision tunable capabilities of thesystem. The architecture of Quentin is reported in Fig.1.

A. System Architecture

The system in exam consists of an advanced microcontrollerbased on the open-source PULPissimo system , part of theParallel Ultra-Low-Power (PULP) platform . The SoC is builtaround a RISC-V processor (RI5CY) [42] optimized forenergy-efﬁcient digital signal processing. The core’s pipeline https://github.com/pulp-platform/pulpissimo https://pulp-platform.org UARTHyperBusCameraSPII2CI2S P A D M U X FLL MCU INTERCONNECTGPIO APB BUS P M U T I M E R D E B U G C L K UDMAEngine

CFG I $ RI5CYFC

DBG FPU D I G I O P A D F R A M E DBG BRIDGE

XNORNEURALENGINE

L2 INTERLEAVED (456 KB)

SRAM

112 KB

SRAM

112 KB

SRAM

112 KB

SRAM

112 KB

SCM

L2 PRIV (64 KB)

SRAM

32 KB

SRAM

24 KB

SCM

ROM

INTC CTRL

CORE + PERIPHERAL

SUBSYSTEMS

MEMORY

SUBSYSTEM

Fig. 1: Quentin SoC Architecture. The core subsystem ishighlighted in violet, the peripheral one in blue, and the memory one in green.features 4 stages, ﬂoating-point and it is fully compliant withthe RV32IMFC ISA [73]. On top of the standard RISC-VISA the processor features digital signal processing extensionstargeting energy-efﬁcient near-sensor data analytic. These ex-tensions include hardware loops, the automatic increment ofpointers accesses, bit manipulation instructions, ﬁxed-pointand packed single-instruction-multiple-data (SIMD) opera-tions, and unaligned memory accesses.The system features a full set of peripherals which includeQuad-SPI (QSPI) supporting up to two external devices, I2C,2x I2S, a parallel camera interface, UART, GPIOs, JTAG,and a DDR HyperBus interface to connect off-chip up to64 Mbytes of external Dynamic RAM (DRAM) or FLASHmemory, and a small ROM used to store the boot-code. AnI/O DMA ( µ DMA [74]) autonomously manages data transfersthrough peripherals to minimize the workload of the processor.To improve the efﬁciency of the system and the ﬂexibility oftransfer from/to the peripherals each peripheral has a dedicatedclock domain. Two Frequency Locked Loops (FLLs) adjust thefrequency of the peripheral subsystem and core subsystem.Moreover, peripherals are equipped with clock dividers thatallow ﬁne-tuning the frequency of according to the desiredbandwidth. This architecture allows to tune the performanceof computation and IO transfers, minimizing the system-levelpower consumption for the desired performance target.

B. Hybrid Memory Architecture

The L2 memory of the proposed SoC, shown in the bottompart of Fig. 4 consists of a heterogeneous memory architecturedesigned to operate on a wide voltage range and to optimizeaccess to the different regions of the memory depending ontheir purpose. From an architectural point of view, the memoryis composed of two regions. The ﬁrst one is a 64 kB privatememory that can be used by Fabric Controller (FC) for storingits program, the stack, and other private data. This portionof the memory, connected to the interconnect through twoports (e.g. one for instructions and one for data), is typically not shared with other initiators, hence it does not incur anykind of conﬂicts guaranteeing full bandwidth. The secondportion, called

L2 interleaved memory (Fig. 4), is composedof four 114 KB banks that can be accessed in parallel by themasters (i.e., µ DMA [74], instruction, and data port) whileminimizing the banking conﬂict probability thanks to the in-terleaved addressing scheme implemented by the interconnect.From a performance viewpoint, this memory organizationenables transparent sharing of the L2, increasing by 4x thesystem memory bandwidth compared to the traditional single-port memory architecture typical of AHB-based MCUs [30],without the usage of power-hungry dual-port memories.Both memory regions described above are heterogeneousalso from the memory technology point of view, being im-plemented as a hybrid mix of SRAM and standard-cell basedmemory cuts (SCM). The SCMs are based on the architecturedescribed in [75]. Each of the interleaved banks has 112 kBof SRAM and 2kB of SCM, while the private banks have8 KB of SCM as shown in Figure 1 while the rest isimplemented as SRAM. The SCM portion of the private bankis implemented as a 3-read 2-write ports register ﬁle: two ofthe read ports and one of the write ports are dedicated tothe data and instructions interfaces of the RISC-V core whileone read and one write ports are used by the interconnectarbiter for any other master node of the system. Despite theintrinsic ﬂexibility of synthesizable IPs that make them moresuitable to implement multi-port cuts, one of the main advan-tages of latch-based memories is the capability, empiricallyproven in this work, to operate reliably in a much widersupply voltage range than SRAMs. Moreover, they featuresigniﬁcantly smaller read and write energy with respect totraditional SRAMs, up to 4x depending on the conﬁguration(i.e. leakage dominated vs. dynamic dominated) [75]. On theother hand, they pay a signiﬁcant area overhead with respectto SRAMs, that makes them suitable only for implementingvery small memory regions, usually in the orders of few kB[75].

C. XNOR Neural Engine and BNN Execution Model

To execute with high performance and energy efﬁciencybinary neural networks, the Quentin SoC also contains adedicated hardware accelerator called XNOR Neural Engine(XNE) [72]. The XNE is connected as a master to the inter-leaved L2 memory. It has four ports, for an overall memorybandwidth of 128 bits per cycle. All the conﬁguration registersare memory-mapped and accessible by the core. The XNEcan execute both convolutional and fully connected layers,autonomously from the core, once all data reside in L2.Figure 2a schematizes the internal architecture of the XNE.It is divided in a control submodule responsible of receivingjobs from the core; a streamer submodule translating internaldata streams into actual memory transfers on the memoryinterconnect towards L2; and a datapath that performs binarymatrix-vector products. The controller includes a memory-mapped slave interface to a conﬁguration register ﬁle, acontroller ﬁnite-state machine and a small microcoded loopthat is used to implement the following BNN layer executionpattern.

CTRL FSMMICROCODELOOPREG FILESLAVEINPUTLOAD UNITWEIGHTLOAD UNITACTIVATIONSTORE UNIT S T A T I C M U X I N G INPUT BUFFERXNOR & POPCOUNTACCUMULATORSTHRESHOLD x - b i t - b i t a)b STREAMER DATAPATH CTRL holds 128 stationary in chan bitsmultiplies input buffer by 128 weight bits+ reduces the output to a 7-bit numberaccumulates popcount outputs in128 registers of 16-bitcompares accumulated outputs witha set of 128-bit thresholds for i in range(output_height): for j in range(output_width): for ko_major in range(nb_out_chan/128): for ui in range(filter_height): for uj in range(filter_width): for ki_major in range(nb_in_chan/128): acc[ko_minor] = 0 for ko_minor in range(128): ko = ko_major*128+ko_minor popcount = 0 for ki_minor in range(128): ki = ki_major*128+ki_minor binary_prod = W[ko,ki,ui,uj]*x[ki,i+ui,j+uj] popcount += binary_prod acc[ko_minor] += popcount for ko_minor in range(128): y[i,j,ko_major+ko_minor] = 0 if acc[ko_minor] < threshold[ko_minor] else 1 D A T A P A T H M I C R O C O D E D L OO P S Fig. 2: a) XNE internal architecture, showing the streamer(green shades), control (orange) and datapath (blue) sub-modules; b) BNN layer execution pseudo-code highlightingmicrocoded loops (orange) and datapath execution (blue). // x0, x1, x2, x3, y are statically allocated in L2 memory// W0, W1, W2, W3 are statically allocated in L2 memory and filled with weight values uint8_t *x[4] = { x0, x1, x2, x3 };uint8_t *W[4] = { W0, W1, W2, W3 };int xne_job_id; // first execution udma_get_input(x0); // fill x0 with next input data frame while (1) { for (int i=0; i<3; i++) { xne_program(x[i], W[i], x[i+1], CH_OUT[i], CH_IN[i], HEIGHT[i], WIDTH[i], FILTER_SIZE[i]); xne_job_id = xne_run(); // start execution of layer xne_wait(xne_job_id); // RI5CY sleeps and waits for XNE end of computation } xne_program(x[3], W[3], y, CH_OUT[3], CH_IN[3], HEIGHT[3], WIDTH[3], FILTER_SIZE[3]); xne_job_id = xne_run(); udma_get_input(x0); // fill x0 with next input data frame xne_wait(xne_job_id); // RI5CY sleeps and waits for XNE end of computation } lay0in RI5CYXNEuDMA i/o layer 1 layer 2 layer 3in lay0 … A B C D E F

A, FB, C, DE a)b

Fig. 3: a) Execution proﬁle for an example 4-layer fully on-chip BNN; b) ANSI C runtime code executed on RI5CY forthe same 4-layer BNN. An INPUT BUFFER is loaded with a stationary set of 128input feature bits, with each bit representing a different inputchannel (0 represents a ’-1’ value, 1 a ’+1’ value). Thestationary input is multiplied with 128 bits of weights thatare dynamically streamed in each cycle, using 128 parallelXNOR gates. The

XNOR gates are followed by a 128-wayreduction tree that performs a

POPCOUNT operation. Overall,the

XNOR & POPCOUNT unit performs a full 128x128 binarymatrix-vector product in 128 cycles, which is used to imple-ment the innermost loops of a convolutional or linear BNNlayer. To implement the outer loops, popcount outputs areaccumulated in a set of 128 registers (one per output channel)of 16-bit each. After the accumulation is completed, theaccumulated values are activated and binarized by comparingthem with a set of 8-bit activation thresholds that are streamedin from memory and left-shifted by a conﬁgurable amount tobe comparable with the 16-bit accumulators. The execution isiterated as speciﬁed by the microcoded loop to implement a . mm i n t e r c o COREXNEuDMA I n t e r l e a v e d m e m o r y ( S R A M ) Interleavedmemory(SCM) I n t e r l e a v e d m e m o r y ( S R A M ) Interleaved & Privatememory (SRAM) Privatememory(SCM)Privatememory(SCM)

Fig. 4: Quentin SoC ﬂoorplan.TABLE I: Quentin SoC features.

Technology CMOS 22nm FD-SOIChip Area 2.3mm Memory Transistors 520 kbytesEquivalent Gates (NAND2) 1.8 MgatesVoltage Range 0.42 V – 0.8 VBody Bias Range 0.00 V – 1.4 VFrequency Range 32 kHz – 670 MHzFrequency Range (with FBB) 32 kHz – 938 MHzPower Range 300 µ W – 10.4 mWPower Range (with FBB) 300 µ W – 66.2 mW full BNN layer; if the granularity of the layer is smaller than128 input or output channels, the datapath can be conﬁguredaccordingly. Figure 2b describes the full execution schedule aspseudo-Python code; we refer to Conti et al. [72] for furtherdetail.Since the XNE operates at the granularity of a singleBNN layer, the execution of a full network relies also onthe operation of two other modules in the SoC: the RI5CYcore, operating as a lightweight controller; and the UDMAengine, which is used to load inputs from I/O. While theQuentin SoC is designed with the capability to access anexternal IoT DRAM if necessary, in this work we focus onthe execution of relatively small, fully on-chip BNNs, whichcan be run within an ultra-low power budget by means ofaggressive voltage scaling and access I/O exclusively to fetchinput frames. Figure 3 shows the execution proﬁle of anexample four-layer BNN, along with the C runtime code thatis used to run it. Runtime API calls wrap the memory-mappedcontrol interfaces that both the UDMA and the XNE expose toconﬁgure them; therefore control is realized with fully com-pliant ANSI C code using regular load/store operations andrequires no extension to the RISC-V ISA. In the runtime code, udma_get_input

API calls are synchronous and xne_run asynchronous, with an explicit xne_wait bringing RI5CY tosleep.

D. Chip Implementation

Figure 4 shows the ﬂoorplan of the Quentin SoC, while Ta-ble I summarizes its main features. The SoC was implementedin 22nm FD-SOI technology using a ﬂip-well (LVT) standardcell library. The design was synthesized with Synopsys Design

CPU SRAM SCM ROM IO XNE Interconnect

Fig. 5: Quentin SoC area breakdown.TABLE II: Quentin Area breakdown in mm CPU subsystem 0.020SRAM (504kB) 0.817SCM (16kB) 0.292ROM 0.009I/O subsystem 0.056XNE 0.014Interconnect 0.009

Compiler 2016.12, while Place & Route was performed withCadence Innovus 16.10. Fig. 6 shows a micrograph of theQuentin SoC The ﬂoorplan area of the SoC is 2.31 mm and its effectivearea is 1.22 mm (6154KGE). Its main modules are high-lighted in Figure 4. The two largest components of the SoCare the SRAM banks of the L2 memory subsystem (i.e., 504kB), and by the 16 kB of SCM banks. Although the latch basedimplementation features approximately a 10X area overheadcompared to approaches based exclusively on SRAMs (TableII, Fig.5), it allows major energy savings [75], and it enablesmore ﬂexible power management strategies that can be played The area of the micrograph that is not annotated contains independentdesigns fabricated on the same chip

Quentin SoC I n t e r l e a v e d m e m o r y ( S R A M ) I n t e r l e a v e d m e m o r y ( S R A M ) Interleaved memory(SCM)

Private memory (SCM) Private memory (SCM)Interleaved and private memory (SRAM) uDMACOREInterco.XNEHyperBus

Fig. 6: Quentin SoC chip micrograph. ...

BANK 0 ...

BANK 1 ...

BANK 2 ...

BANK 3 ...

PRIVATEBANK SCM SCM SCM SCM SCMSRAM 0 SRAM 0 SRAM 0 SRAM 0 SRAM 0SRAM 3 SRAM 3 SRAM 3 SRAM 3 SRAM 3

PERIPHERY PERIPHERY PERIPHERY PERIPHERY PERIPHERY

ARRAY ARRAY ARRAY ARRAY ARRAY

PERIPHERY PERIPHERY PERIPHERY PERIPHERY PERIPHERY

ARRAY ARRAY ARRAY ARRAY ARRAY

CORE LOGICVDD QUENTIN VDD MPVDD MAINTERLEAVED REGION

Fig. 7: Quentin SoC Power Domains.at the system level. For example, SRAMs and SMCs canbe independently power-gated. Additionally, on SCMs, it ispossible to scale the operating voltage more aggressively thanfor SRAM. Our tests reported no errors when the supplyvoltage on the SCMs is scaled down to 0.42V; contrarily, errorson SRAMs become appreciable already at 0.575V (section V),limiting the voltage scaling capability of the system.To exploit both the energy advantage of SCMs andarea density advantage of SRAMs, and to enhance thepower/performance/precision tuning capabilities of Quentin,the chip was implemented as a multi power-domain system.The SRAM cuts have separate power connections from the restof the logic for both periphery and array, as shown in Figure7. This conﬁguration allows us to independently tune thesupply voltage of logic circuits, memory arrays, and memoryperiphery. Moreover, it allows the system to operate in an ultra-low-power, highly voltage-scaled mode using only the 16 KBSCM memories, and to shut down the SRAM via an off-chippower switch.IV. BNN E

RROR R ESILIENCE A NALYSIS

As argued in Section I, BNNs have been shown to be par-tially resilient to high error rates. For example, Yang et al. [23]use a statistical model to quantify the accuracy drop of aBNN in an application-speciﬁc architecture, reporting ∼ − . In this section, we evaluatethe ﬁnal classiﬁcation accuracy of different pre-trained BNNtopologies under multiple SRAM BER conditions. The goalof our analysis is to exploit the BNNs error resilience toenable major energy efﬁciency on SoC architectures featuringheterogeneous memory subsystems. Our results are silicon-proven on the Quentin chip. We performed our analysis onthe CIFAR-10 classiﬁcation data set.The BER reported in our experiments refers to data beingfetched from the SRAM. In Quentin, the source where XNEfetches and stores binary data (i.e. weights, activations andpartial results of internal BNN layers) is not ﬁxed at designtime. As described in Section III, this data can be resident x x x x

16 128 x x x x x x

32 384 x x x x Conv 3x3Pool 2x2Conv 3x3 + Pool 2x2 x x

16 128 10

Fully Connected

Fig. 8: uVGG

BNN topology.in either the interleaved SCM or SRAM memories; the XNEaccelerator always holds partial sums inside its accumulationbuffer; only fully binarized outputs are stored back to theshared memory.In this scenario, we identiﬁed three potential sources of er-rors affecting the ﬁnal BNN classiﬁcation accuracy: i) weightsreading ii) input features reading iii) activations storage. Inour experiments, the XNE data-ﬂow partial results are notaffected by errors, as they are held inside the local buffer ofthe accelerator. Additionally, output activations are binarizedby comparing the ﬁnal accumulation value y with a safe 8-bit threshold value τ , which is stored in the error-free SCMmemory. Input features, weights, and activations reside in theSRAM, potentially corrupted by errors.To evaluate the accuracy loss when data are corrupted bya certain BER, we performed a set of simulations using thePyTorch 1.0.1 framework. We targeted a set of pre-trainednetworks on CIFAR-10 based on Hubara’s implementation .We added uniformly distributed errors to input, weights, andactivations of all layers of the network according to the targetBER values to evaluate. We tested the noisy BNNs on the testset of the CIFAR-10 data-set. The networks have not been re-trained to compensate for the additional noise injected duringthe simulation. This exploration ultimately allows evaluatingthe overall effect, in terms of ﬁnal classiﬁcation accuracy, of anaggressive voltage scaling performed on the SRAM. In otherwords, the SRAM supply voltage change can be performeddynamically, depending on the tolerable quality-of-results dropof a target application. In our experiments, we also exploredthe case where errors were occurring with recurring patterns.In this scenario, we did not observe any signiﬁcant differencein the ﬁnal classiﬁcation accuracy with respect to the casewhere errors were uniformly distributed.Fig. 9 shows the results in terms of classiﬁcation accuracyversus the BER. The classiﬁcation accuracy of each networkis reported as an average over 100 randomized experimentsover the CIFAR-10 test set; the standard deviation of theresults over this sample is always less than of the reportedvalue. We report results on Hubara’s topology, as well as ona network inspired by the one proposed by Yang et al. Wealso report results on our topology, similar to Hubara’s [19]but ﬁt to be deployable on the Quentin SoC. Figure 8 showsthe topology of this latter network, which we called micro -VGG (uVGG). Table III reports the salient characteristics ofthese networks. For what concerns the network proposed byYang et al. [23], we were not able to reproduce the exacttraining setup. Our PyTorch implementation, on which our https://github.com/itayhubara/BinaryNet.pytorch BNN topology Nominal accuracy Mem. footprint

Based on Yang et al. [23] 78.6% a

319 kBHubara et al. [19], 90.9 4545 kB uVGG a Including both activations and weights.

TABLE III: Parameters of BNNs used in the resilience exper-iment. ϭϬ ϲ ϭϬ ϱ ϭϬ ϰ ϭϬ ϯ ϭϬ Ϯ ϭϬ ϭ ϭϬ Ϭ ZϬϮϬϰϬϲϬϴϬϭϬϬ ů Ă ƐƐ ŝ Ĩ ŝ Đ Ă ƚ ŝ Ž Ŷ Ă ĐĐ Ƶ ƌ Ă Đ Ǉ ΀ й ΁ Ƶs''ĂƐĞĚŽŶ΀Ϯϯ΁,hZ Fig. 9: BNN resilience versus bit-error rate.experimental results are based, achieves signiﬁcantly loweraccuracy results than what Yang et al. [23] report (78.6%instead of 85%).In the remainder of the paper, we focused on our proposednetwork, which can tolerate a BER of − with negligibleaccuracy drop ( < − with an accuracydrop of 7‰, while ﬁtting perfectly in the SRAM of theQuentin SoC. V. E XPERIMENTAL R ESULTS

In this section, we describe the results of experiments thatallowed us to correlate the SRAM supply voltage scalingto the classiﬁcation accuracy of the uVGG BNN presentedin Section IV. As a ﬁrst step, we evaluated the level ofnoise that could potentially corrupt the data stored in theSRAM by measuring how the Bit Error Rate (BER) correlateswith the memory array and peripheral voltage supplies. As asecond experiment, we measured the current drained by eachpower domain of the SoC to extract the power consumption.The power contribution reported in this section refers tothe independent power rails described in Fig.7. Finally, wecomputed the energy efﬁciency of the SoC and evaluatedthe power saving when the supply voltage of the SRAM isscaled and the quality-of-results (i.e. top1 network accuracy)is degraded by less than 1%.

A. Experimental setup

All the measurements related to SRAM BER evaluationand power domains power consumption have been performedusing an Advantest SoC hp93000 integrated circuit testingdevice. Supply voltages have been precisely regulated utilizingdedicated hp93000 power supply device channels. Power mea-surements have been performed using current measurement

UART Connector

Tester BoardQuentin ChipQuentin Board

JTAG Connector

Advantest SoC Tester Host PCCOM PortTest result IC Tester software - Quentin binary load- Chip Boot- Return check

Fig. 10: High level block diagram of the experimental setup.channels integrated into the hp93000 device and connected inseries to the voltage supply channels.BER experimental data have been obtained by running aself-test C application on the RISCY microprocessor executingfrom the private SCM of the core, which is error-free in alloperating points tested. We loaded the test program into theSCM through the SoC internal debug interface via a standardJTAG interface driven by the hp93000 digital channels. Fig.10shows the block diagram of the experimental setup.

B. Bit Error Rate analysis

Measuring Bit Error Rates from outside, i.e. directly fromthe tester equipment, requires a very large testing time. In ourtests, we observed that the number of bits to observe to detect asingle-bit memory failure can be in the order of or higher,e.g. for SRAM operating at nominal supply voltage conditions.Additionally, to acquire relevant statistics on memory errors,tests have to be repeated many times. To reduce the numberof pads dedicated to SoC debug subsystems, modern micro-controllers often employ serial debug interfaces connectedto a shared bus. Therefore, accessing the memory locationsthrough a serial JTAG debug interface designed for reliabilityrather than for high speed, could be a severe limitation forthe execution of tests targeting BER measurement. In ourtests, we estimated that a single BER measurement point isacquired in several 10th of minutes, assuming to test 448kB ofmemory for 1800 iteration, at a JTAG frequency of 1MHz, andrepeating each measurement 10 times. To overcome the serialdebug interface bottleneck, we designed an on-chip BER testapplication, which was executed by the microcontroller core.This allowed to reduce the time to test a single BER point bya factor of approximately 100X.To issue memory transactions to the SRAM, and observeerrors on the bits, our self-test application runs directly on theRISCY core of the SoC, which operates at the highest reliablefrequency for each condition. Pseudo-random test patternsare generated by the core using a lightweight 32 bits LinearFeedback Shift Register (LFSR) implemented in C code. Thetest application sequentially covers the entire SRAM sharedaddress space. Errors are counted by comparing, bit-wise, thedata read at each memory location with the ground-truth valuegenerated by the LFSR generator using the same initial seed.At each supply voltage point, the test is repeated in a loopto have a reliable measurement of the BER. Note that thisapproach could generate artifacts in the error statistics whena memory location is ﬁlled in successive iterations with the Ϭ͘ϰϬ Ϭ͘ϰϱ Ϭ͘ϱϬ Ϭ͘ϱϱs΀s΁ϭϬ ϳ ϭϬ ϲ ϭϬ ϱ ϭϬ ϰ ϭϬ ϯ ϭϬ Ϯ Z ΀ й ΁ ǆƉŽŶĞŶƚŝĂůĨŝƚŝƚƌƌŽƌZĂƚĞ Fig. 11: Bit Error Rate.same test vector; to avoid this problem, and to make ourmeasurement more robust, the software LFSR uses a differentseed to generate test data at each new iteration.In our tests, we measured only the BER related to SRAMbanks. SCM, which is hosted by the same power domain asthe circuit logic, was reserved for storing the core instructionsof the self-test application and test results (i.e. the number oferrors). Note that the storage of the software instruction onan error-free memory space is mandatory for the applicationto be able to run. In SoCs featuring single-power-domainmemory subsystems (i.e. not having the possibility to storecore instructions in a separate error-free memory), SRAMerrors could affect also core instructions – making aggressivevoltage scaling infeasible, as a single corrupted bit on a coreinstruction could cause errors in the core control ﬂow, makingthe entire SoC entering unpredictable states, and ultimately thesystem to fail. For each operating point in our experiments,we performed 1800 on-chip test runs, writing 448kB at eachiteration.Fig. 11 reports the BER at each SoC operating voltage.By construction, our test could not observe more than ∗ bits. Therefore, the reciprocal of this value represents thelower bound of the on-chip test application, i.e. . ∗ − .The results of the BER analysis versus the supply voltage arereported in Fig.11. When the supply voltage is higher than . , no BER is observable by our tests.Below a supply voltage of . , as expected, we observeda BER increasing with the memory supply voltage decrease,reaching a BER of − at the lowest supply voltage pointwhere the memory was still accessible. The BER measure-ments conﬁrm that SRAM supply voltage can be scaled at the OP mode

V dd ma/mp/quentin

Freq.

Nominal 0.8V 565 MHzHEFF 0.5V 145 MHzULP 0.42V 18 MHz

TABLE IV: Supply voltage range of Memory Array (MA),Memory Periphery (MP) and Quentin power domains at Nom-inal, High Efﬁciency (HEFF) and Ultra-Low Power (ULP)modes. Ϭ͘ϰ Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴs΀s΁ϬϭϬϬϮϬϬϯϬϬϰϬϬϱϬϬϲϬϬϳϬϬϴϬϬ F m a x ΀ D , ǌ ΁ >ŝŶĞĂƌĨŝƚDĂǆŝŵƵŵ&ƌĞƋƵĞŶĐǇ Fig. 12: SoC maximum operating frequency.cost of a higher number of errors, noise-tolerant applicationscan be deployed on Quentin SoC and there is enough marginfor trading off the amount of noise injected on the data andthe potential energy efﬁciency gain deriving from the voltagescaling.

C. Power and energy consumption

In this section, we discuss results related to the power andenergy consumption of the SoC. These measurements, togetherwith the evaluation of the maximum operating frequency,which is reported in Fig.12, allow evaluating the overall energyefﬁciency of the system. The critical path of the system isin the paths going from the core to the memory system.Power measurements were performed during the execution ofa test application on the Quentin SoC. To precisely control thesupply voltages and clock frequencies of the SoC, thereforeto measure the energy consumption of individual SoC powerdomains with enough accuracy, all the measurements wereperformed on the Advantest SoC IC tester mentioned in V-A.The application we used as a benchmark was designed toemulate realistic working conditions, using the XNE acceler-ator over a randomized pattern of bits. By using a syntheticuniformly distributed and uncorrelated set of binary inputs,we maximized the switching activity both on the XNE inputcircuitry and the XNOR-based datapath, practically obtaininga worst-case power consumption. The main application isexecuted by the core while the BNN layer output computationis accelerated by the XNE; while the accelerator is active, thecore is in clock-gating and it wakens up at the end of eachXNE job. Instruction code was stored in the error-free SCM,

Weights Activationthresholds Inputfeatures Outputfeatures Instructionsand StackSRAMExec.

SRAM SCM SRAM SRAM SCM

SCMExec.

SCM SCM SCM SCM SCM

TABLE V: Network parameters and core instruction memorystorage Ϭ͘ϰ Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴs΀s΁ϬϱϬϬϬϭϬϬϬϬϭϱϬϬϬϮϬϬϬϬ W Ž ǁ Ğ ƌ ΀ Ƶ t ΁ DDWYhEd/E ><'zED/ Fig. 13: Breakdown of independent power contributions whenoperating from supply voltage scaled SRAM.supplied at the same voltage as the core. Data (i.e. binaryweights, activation, and partial results) were stored in theerror-prone SRAM supplied at the scaled voltage, except forcritical 8-bit threshold data which is stored in the interleavedSCM. Table V reports more details on how data are stored inmemory.The measurements reported in this section refer to threerelevant operating conditions of the SoC. The nominal oper-ating point (namely Nominal) refers to a supply voltage of . . This is the operating point for which we performed theStatic Timing Analysis (STA). The High Efﬁciency (HEFF)point is the operating condition at which the chip reportedthe highest energy efﬁciency. The Ultra-Low Power (ULP)point is the operating condition where the chip reported theminimum power consumption. Table IV provides more detailsabout those three points.Fig.13 reports the power consumption breakdown of theQuentin SoC. We measured contributions from the three powerdomains by observing the total current drained from thepower supply by each power domain. We performed all powermeasurements at three different frequencies (fmax, fmax/2,10 MHz) and used a simple least squares model to detectstatic and dynamic power. As shown by the plot in Fig.13,in ULP mode (i.e. at a supply voltage of .

42 V ), leakagepower dominates the dynamic power. In this operating point,the SoC reliably works at its lowest power consumption, µ W , yet achieving a sustainable frequency of 18 MHz,and energy efﬁciency of 6.2 Tops/s/W, which is not lowerthan the one reported at the nominal operating condition. InULP mode, the leakage power contributes to approximately80% of the total power consumption. In this operating point,Quentin can perform 15.4 Inference/s, reaching an energyefﬁciency of 22.8 Inference/s/mW. When compared to theexecution of the same workload on the embedded RI5CY core,hardware-accelerated execution achieves 21.6 × better energyefﬁciency in the Nominal operating point (167 fJ/op usingthe XNE, 3.6 pJ/op performing it in software at VDD=0.8V):whereas the overall power consumption drops when usingthe core, the performance achieved is signiﬁcantly lower Ϭ͘ϰ Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴs΀s΁Ϭ͘ϬϬϬ͘ϬϱϬ͘ϭϬϬ͘ϭϱϬ͘ϮϬϬ͘ϮϱϬ͘ϯϬ Ŷ Ğ ƌ Ő Ǉ ͬ K W ΀ Ɖ : ΁ DDW YhEd/E;^ZDyͿYhEd/E;^DyͿ Fig. 14: SoC energy efﬁciency comparison when operatingwith supply voltage scaled SRAM, and when executing fromSCM.(6.6 op/cycle) [41]. Note that the software baseline used asa comparison represents already a signiﬁcant improvement(more than 10X) to the performance reported by leadingmicrocontroller architecture (e.g. arm cortexM4).Overall, we observed that the biggest leakage contributionoriginates from the SRAM arrays, which are working 380 mVbelow the nominal speciﬁcations. In HEFF mode (i.e. ata supply voltage of . ), the SoC can sustain a clockfrequency of 145 MHz, consuming . . In this operatingpoint, we report the highest energy efﬁciency achieved by thesystem, i.e. 12.7 Tops/s/W (49.2 Inference/s/mW). In HEFFoperating mode, the leakage represents 32% of the total powerconsumption. Fig.12 reports the SoC maximum frequencyused for the energy efﬁciency computation. Other relevantmeasurement points covering the entire operating range arereported in Table VII.Fig.14 shows the energy per binary operation when theSoC is executing either from SCMs only or SCMs plusSRAMs; The absolute lowest energy per binary operation is

76 fJ OP , which is achieved at .

46 V when executing fromSCM only. Note that to execute a full neural network fromSCM only is unrealistic since those memories are generally toosmall because of the low area density. The lowest energy peroperation achievable when executing from SCMs and SRAMsis

78 fJ OP , and is reached at . . Overall, this plot demon-strates that our approach allows achieving comparable energyper operations both on SCMs and SRAMs. The aggressivevoltage scaling performed in our experiments, together witha carefully crafted memory partition, signiﬁcantly improvesthe energy efﬁciency when executing from dense SRAMs,ultimately relaxing the memory constraints for error-resilientapplications deployment. The energy per operation reached at . represents an improvement of 2.2X compared to theenergy per operation measured at nominal condition

170 fJ .Table VI compares our work to BNN accelerator implemen-tations that, similarly to our case, exploit BNN error-resilienceto maximize energy efﬁciency. To our best knowledge, thiswork represents the ﬁrst complete general purpose microcon- Name Technology Core area[ mm ] Power[ mW ] Energy eff.[ T OP/s/W ] On-chip memory[ kByte ] Peak perf.[

GOP/s ] Type BNN

BRein [62] 65nm(Digital) 3.9 600 6 - 1380 DNN ASIC Conﬁgurable

UNPU [4] 65nm(Digital) 16 7.37 51 256 7372 DNN ASIC ConﬁgurableBankman et al. [5] 28nm(Mixed signal) 4.84 0.094 772 329 72 DNN ASIC Fixed topologyBinarEye [76] 28nm(Digital) 1.4 2.2 230 328 90 DNN ASIC ConﬁgurableYin et al. [64] 28nm(Digital) 4.8 3.4-20.8 765 224 3270 DNN ASIC ConﬁgurableThis work (0.8V) 22nm(Digital) 2.3 21.6 a a /14 b

520 129 a Heterog. SoCcore + periph +mem + BNN acc. SW deﬁnedThis work (0.8V)SW Impl. 16 a a b This work (0.5V) 2.52 a a /23.9 b a This work (0.42V) 0.674 a a /14 b a a - full SoCb - core domain TABLE VI: Comparison of silicon-proven Application-Speciﬁc ICs for Binary Neural Networks. Ϭ͘ϰ Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴsϭϬ ϭ Ϯ п ϭϬ ϭ ů Ă ƐƐ ŝ Ĩ ŝ Đ Ă ƚ ŝ Ž Ŷ Ğ ƌƌ Ž ƌ ΀ й ΁ Ƶs''ĂƐĞĚŽŶ΀Ϯϯ΁,hZ Fig. 15: SoC supply voltage / Accuracy tradeoff.troller architecture capable to exploit a heterogeneous memoryhierarchy to execute error-resilient applications at the highestachievable energy efﬁciency.

D. Power accuracy tradeoff

From the analysis discussed in the previous section, weconcluded that the ﬁnal classiﬁcation accuracy can be traded inspite of higher energy efﬁciency or lower power. Fig.15 showsthe accuracy loss versus the supply voltage on both memoriesand core logic. Fig.16 reports the accuracy loss versus thepower consumption reduction enabled by the aforementionedsupply voltage scaling.When the SoC operates in HEFF mode, we did not observeany accuracy loss. Thereby, we can conclude that, if perfor-mance (e.g. in terms of inference per second) is not a criticalconstraints, the energy efﬁciency can be improved by 2.2X(from

170 fJ OP to

78 fJ OP ) without any appreciable penaltyon the quality of the result, i.e. the classiﬁcation accuracy ofa BNN. Ϭ ϱϬϬϬ ϭϬϬϬϬ ϭϱϬϬϬ ϮϬϬϬϬWŽǁĞƌ΀Ƶt΁ϲϳ͘ϱϳϬ͘ϬϳϮ͘ϱϳϱ͘Ϭϳϳ͘ϱϴϬ͘ϬϴϮ͘ϱϴϱ͘Ϭ ů Ă ƐƐ ŝ Ĩ ŝ Đ Ă ƚ ŝ Ž Ŷ Ă ĐĐ Ƶ ƌ Ă Đ Ǉ ΀ й ΁ Ƶs'' Fig. 16: Power accuracy tradeoff evaluation for the uVGGBNN topology.Furthermore, if a small classiﬁcation accuracy loss can betolerated by the application, smaller than 1% in our analysis ontypical target network topologies, the power consumption canbe further pushed down, reducing it by 3.7X with respect to theHEFF operating mode. In this operating condition, the energyefﬁciency degrades compared to the HEFF operating point.This is caused by the fact that the total power consumptionbecomes leakage dominated as the supply voltage is reduced(Fig.13). Thereby, the performance reduction caused by thevoltage scaling is not followed by a proportional powerreduction. The ULP mode, where the chip consumes µ W ,is suitable for always-on operating scenarios, or IoT end-nodewith an expected lifetime in the order of months or years,as well as applications where the peak power dissipation is acritical concern (e.g. implantable devices). V DD F linfitmax P totMA P totMP P totQuent. P leakMA P leakMP P leakQuent. P dynMA P dynMP P dynQuent. Energy BER exp − fit [V] [MHz] [uW] [pJ/OP] [%]0.42 18 247.5 135.2 292.1 226.4 105.4 211.7 21 29.8 80.4 0.290 0.0017230.46 56.9 353.1 244.9 598.1 272.9 127.6 265.2 80.2 117.3 332.9 0.114 0.0001090.5 145 589.2 544.5 1383 328.7 154.9 332.6 260.5 389.7 1050.3 0.079 6.93E-060.54 205.1 823.5 839.5 2146.8 393.9 186.9 415.9 429.6 652.6 1730.9 0.080 4.40E-070.58 265.3 1115.8 1219.8 3097.4 469.6 225.4 516.8 646.2 994.4 2580.6 0.087 2.79E-080.62 325.4 1525.7 1792.2 4479.7 560.4 272.3 641.2 965.4 1519.8 3838.5 0.098 1.77E-090.66 405.5 2022.3 2507.7 6167.2 670.6 330 797 1351.7 2177.7 5370.2 0.108 1.12E-100.7 445.6 2473.1 3155.6 7616.3 799.6 399.6 993.5 1673.5 2756 6622.8 0.122 7.11E-120.75 525.8 3075.6 4045.9 9622.4 953.5 480.8 1233.8 2122.1 3565.1 8388.6 0.136 4.51E-130.80 565.8 3692 4945.1 11612.8 1137.9 582.6 1534.7 2554.1 4362.5 10078 0.154 2.86E-14 TABLE VII: Raw data for frequency, power and energy at various voltage scaled operating pointsVI. C

ONCLUSION

In this paper, we presented a strategy to maximize energyefﬁciency in complex heterogeneous SoC. Our results demon-strated how to trade-off the energy consumption of an FDX22nm SoC with the ﬁnal classiﬁcation accuracy of Binary Neu-ral Networks, executed on a dedicated hardware accelerator.The proposed approach exploits the intrinsic noise robustnessof BNN, i.e. the fact that a signiﬁcant amount of noise onnetwork parameters, quantiﬁed in terms of BER, marginallydegrades the ﬁnal classiﬁcation accuracy. Our measurementsshow that thanks to a wise L2 memory partitioning, the systemcan operate reliably at very low voltages (i.e. down to .

42 V ).Therefore, we show that by over scaling the supply voltageof the SRAMs of the SoC signiﬁcantly below the nominalspeciﬁcations, the energy per binary operation can be reducedby a factor of 2.2X compared to the nominal supply voltage.In this voltage over-scaled regime, we demonstrate that thereported energy efﬁciency gain does not affect the end-to-endclassiﬁcation accuracy of the BNN when the voltage is scaleddown to . . Additionally, we show that, if a small penaltyon the ﬁnal classiﬁcation accuracy is tolerable, e.g. within 1%,the SoC can be operated in an ultra-low power mode, furtherreducing the overall power consumption ( µ W at

18 MHz , .

42 V ) without exceeding the energy consumption per binaryoperation shown at nominal operating conditions.VII. A

CKNOWLEDGEMENT

This work was supported by the European H2020 FETproject OPRECOMP (g.a. 732631)R

EFERENCES[1] H. Greenspan, B. van Ginneken, and R. M. Summers, “Guest EditorialDeep Learning in Medical Imaging: Overview and Future Promise ofan Exciting New Technique,”

IEEE Transactions on Medical Imaging ,vol. 35, no. 5, pp. 1153–1159, May 2016.[2] D. Palossi, A. Loquercio, F. Conti, E. Flamand, and L. Benini, “A 64mWDNN-based Visual Navigation Engine for Autonomous Nano-Drones,”

IEEE Internet of Things Journal , vol. 6, no. 5, pp. 8357–8371, Oct.2019.[3] M. Manic, K. Amarasinghe, J. J. Rodriguez-Andina, and C. Rieger, “In-telligent Buildings of the Future: Cyberaware, Deep Learning Powered,and Human Interacting,”

IEEE Industrial Electronics Magazine , vol. 10,no. 4, pp. 32–49, Dec. 2016.[4] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H. Yoo, “UNPU: AnEnergy-Efﬁcient Deep Neural Network Accelerator With Fully VariableWeight Bit Precision,”

IEEE Journal of Solid-State Circuits , vol. 54,no. 1, pp. 173–185, Jan 2019. [5] D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann, “AnAlways-On 3.8 µ J/86% CIFAR-10 Mixed-Signal Binary CNN Processorwith All Memory on Chip in 28nm CMOS,” in

Proceedings of 2018IEEE International Solid-State Circuits Conference .[6] A. Biswas and A. P. Chandrakasan, “Conv-RAM: An Energy-EfﬁcientSRAM with Embedded Convolution Computation for Low-Power CNN-Based Machine Learning Applications,” in

Proceedings of 2018 IEEEInternational Solid-State Circuits Conference .[7] A. A. Bahou, G. Karunaratne, R. Andri, L. Cavigelli, and L. Benini,“XNORBIN: A 95 TOp/s/W Hardware Accelerator for Binary Convo-lutional Neural Networks,” arXiv:1803.05849 [cs] , Mar. 2018.[8] E. Flamand, D. Rossi, F. Conti, I. Loi, A. Pullini, F. Rotenberg, andL. Benini, “GAP-8: A RISC-V SoC for AI at the Edge of the IoT,”in . Milano, Italy: IEEE,Jul. 2018, pp. 1–4.[9] X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi,“Scaling for edge inference of deep neural networks,”

Nature Electron-ics , vol. 1, no. 4, pp. 216–222, Apr. 2018.[10] S. Sharify, A. D. Lascorz, K. Siu, P. Judd, and A. Moshovos, “Loom:Exploiting Weight and Activation Precisions to Accelerate ConvolutionalNeural Networks,” in

Proceedings of the 55th Annual Design AutomationConference , ser. DAC ’18. New York, NY, USA: ACM, 2018, pp. 20:1–20:6.[11] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-Efﬁcient Reconﬁgurable Accelerator for Deep Convolutional NeuralNetworks,”

IEEE Journal of Solid-State Circuits , vol. 52, no. 1, pp.127–138, Jan. 2017.[12] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, andT. Mudge, “Near-threshold computing: Reclaiming moore’s law throughenergy efﬁcient integrated circuits,”

Proceedings of the IEEE , vol. 98,no. 2, pp. 253–266, Feb 2010.[13] N. Verma and A. P. Chandrakasan, “A 65nm 8T Sub-Vt SRAMEmploying Sense-Ampliﬁer Redundancy,” in , Feb. 2007,pp. 328–606.[14] N. Verma and A. P. Chandrakasan, “A 256 kb 65 nm 8T SubthresholdSRAM Employing Sense-Ampliﬁer Redundancy,”

IEEE Journal ofSolid-State Circuits , vol. 43, no. 1, pp. 141–149, Jan 2008.[15] A. Teman, D. Rossi, P. Meinerzhagen, L. Benini, and A. Burg, “Con-trolled placement of standard cell memory arrays for high density andlow power in 28nm FD-SOI,” in

Design Automation Conference (ASP-DAC), 2015 20th Asia and South Paciﬁc , Jan. 2015, pp. 81–86.[16] D. Esposito, A. G. M. Strollo, and M. Alioto, “Power-precision scalablelatch memories,” pp. 1–4, May 2017.[17] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer, “Efﬁcient Processing ofDeep Neural Networks: A Tutorial and Survey,” arXiv:1703.09039 [cs] ,Mar. 2017.[18] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:ImageNet Classiﬁcation Using Binary Convolutional Neural Networks,” arXiv:1603.05279 [cs] , Mar. 2016.[19] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,“Quantized Neural Networks: Training Neural Networks with LowPrecision Weights and Activations,” arXiv:1609.07061 [cs] , Sep. 2016.[20] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, andY. Bengio, “Binarized neural networks,” in

Advances in NeuralInformation Processing Systems 29 , D. D. Lee, M. Sugiyama,U. V. Luxburg, I. Guyon, and R. Garnett, Eds. CurranAssociates, Inc., 2016, pp. 4107–4115. [Online]. Available:http://papers.nips.cc/paper/6573-binarized-neural-networks.pdf [21] M. Yang, C. Yeh, Y. Zhou, J. P. Cerqueira, A. A. Lazar, and M. Seok, “A1uW voice activity detector using analog feature extraction and digitaldeep neural network,” in , Feb 2018, pp. 346–348.[22] K. Goetschalckx, B. Moons, S. Lauwereins, M. Andraud, and M. Ver-helst, “Optimized Hierarchical Cascaded Processing,” IEEE Journal onEmerging and Selected Topics in Circuits and Systems , vol. 8, no. 4, pp.884–894, Dec 2018.[23] L. Yang, D. Bankman, B. Moons, M. Verhelst, and B. Murmann, “BitError Tolerance of a CIFAR-10 Binarized Convolutional Neural NetworkProcessor,” in

Proceedings - IEEE International Symposium on Circuitsand Systems , 2018.[24] “tinyML Summit,” https://tinymlsummit.org/.[25] S. Daniel and W. Pete,

TinyML . O’Reilly Media, Inc., 2019.[26] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both Weights andConnections for Efﬁcient Neural Network,” in

Advances in NeuralInformation Processing Systems , 2015, pp. 1135–1143.[27] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan,and K. Gopalakrishnan, “PACT: Parameterized Clipping Activation forQuantized Neural Networks,” arXiv:1805.06085 [cs] , May 2018.[28] F. Conti, R. Schilling, P. D. Schiavone, A. Pullini, D. Rossi, F. K.G¨urkaynak, M. Muehlberghuber, M. Gautschi, I. Loi, G. Haugou,S. Mangard, and L. Benini, “An IoT Endpoint System-on-Chip forSecure and Energy-Efﬁcient Near-Sensor Analytics,”

IEEE Transactionson Circuits and Systems I: Regular Papers , Feb 2015, pp. 1–3.[34] D. Bol, J. De Vos, C. Hocquet, F. Botman, F. Durvaux, S. Boyd,D. Flandre, and J.-D. Legat, “SleepWalker: A 25-MHz 0.4-V Sub-mm27-uW/MHz Microcontroller in 65-nm LP/GP CMOS for Low-CarbonWireless Sensor Nodes,”

IEEE Journal of Solid-State Circuits , vol. 48,no. 1, pp. 20–32, Jan. 2013.[35] F. Frustaci, D. Blaauw, D. Sylvester, and M. Alioto, “ApproximateSRAMs With Dynamic Energy-Quality Management,”

IEEE Transac-tions on Very Large Scale Integration (VLSI) Systems , vol. 24, no. 6,pp. 2128–2141, June 2016.[36] A. Roy, P. J. Grossmann, S. A. Vitale, and B. H. Calhoun, “A 1.3uW,5pJ/cycle sub-threshold MSP430 processor in 90nm xLP FDSOI forenergy-efﬁcient IoT applications,” in , March 2016, pp. 158–162.[37] A. D. Mauro, D. Rossi, A. Pullini, P. Flatresse, and L. Benini, “In-dependent body-biasing of p-n transistors in an 28nm utbb fd-soi ulpnear-threshold multi-core cluster,” in , Oct 2018, pp.1–3.[38] A. Pullini, D. Rossi, I. Loi, A. D. Mauro, and L. Benini, “Mr. Wolf: A1 GFLOP/s Energy-Proportional Parallel Ultra Low Power SoC for IOTEdge Processing,” in

ESSCIRC 2018 - IEEE 44th European Solid StateCircuits Conference (ESSCIRC) , Sep. 2018, pp. 274–277.[39] L. Lai, N. Suda, and V. Chandra, “CMSIS-NN: Efﬁcient Neural NetworkKernels for Arm Cortex-M CPUs,” arXiv:1801.06601 [cs] , Jan. 2018.[40] M. Rusci, A. Capotondi, F. Conti, and L. Benini, “Work-in-Progress:Quantized NNs as the Deﬁnitive Solution for Inference on Low-PowerARM MCUs?” in , Sep. 2018, pp. 1–2.[41] A. Garofalo, M. Rusci, F. Conti, D. Rossi, and L. Benini,“PULP-NN: Accelerating Quantized Neural Networks on ParallelUltra-Low-Power RISC-V Processors,”

Philosophical Transactionsof the Royal Society A: Mathematical, Physical and EngineeringSciences , vol. 378, no. 2164, p. 20190155, 2020. [Online]. Available:https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2019.0155[42] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi,E. Flamand, F. K. G¨urkaynak, and L. Benini, “Near-Threshold RISC-VCore With DSP Extensions for Scalable IoT Endpoint Devices,”

IEEETransactions on Very Large Scale Integration (VLSI) Systems , vol. 25,no. 10, pp. 2700–2713, Oct 2017.[43] Y. Joseph, “Introduction to Armv8.1-M Architecture,”https://pages.arm.com/rs/312-SAX-488/images/Introduction { } to {} Armv8.1-M { } architecture-3.pdf. [44] T. Karnik, D. Kurian, P. Aseron, R. Dorrance, E. Alpman, A. Nicoara,R. Popov, L. Azarenkov, M. Moiseev, L. Zhao, S. Ghosh, R. Misoczki,A. Gupta, M. Akhila, S. Muthukumar, S. Bhandari, Y. Satish, K. Jain,R. Flory, C. Kanthapanit, E. Quijano, B. Jackson, H. Luo, S. Kim,V. Vaidya, A. Elsherbini, R. Liu, F. Sheikh, O. Tickoo, I. Klotchkov,M. Sastry, S. Sun, M. Bhartiya, A. Srinivasan, Y. Hoskote, H. Wang,and V. De, “A cm-scale self-powered intelligent and secure IoT edgemote featuring an ultra-low-power SoC in 14nm tri-gate CMOS,” in ,Feb 2018, pp. 46–48.[45] E. Flamand, D. Rossi, F. Conti, I. Loi, A. Pullini, F. Rotenberg, andL. Benini, “Gap-8: A risc-v soc for ai at the edge of the iot,” in , July 2018, pp. 1–4.[46] R. Sivalingam, E. Park, and E. Gousev, “Ultra-low Power Always-OnComputer Vision,” https://rebootingcomputing.ieee.org/images/ﬁles/pdf/iccv-2019 { } edwin-park.pdf.[47] P. D. Schiavone, D. Rossi, A. Pullini, A. D. Mauro, F. Conti, andL. Benini, “Quentin: An Ultra-Low-Power PULPissimo SoC in 22nmFDX,” in , Oct. 2018, pp. 1–3.[48] D. Rossi, A. Pullini, I. Loi, M. Gautschi, F. K. G¨urkaynak, A. Bartolini,P. Flatresse, and L. Benini, “A 60 GOPS/W, -1.8V to 0.9V bodybias ULP cluster in 28nm UTBB FD-SOI technology,”

Solid-StateElectronics

IEEE Journal ofSolid-State Circuits , vol. 42, no. 3, pp. 680–688, March 2007.[50] F. Frustaci, M. Khayatzadeh, D. Blaauw, D. Sylvester, and M. Alioto,“SRAM for Error-Tolerant Applications With Dynamic Energy-QualityManagement in 28 nm CMOS,”

IEEE Journal of Solid-State Circuits ,vol. 50, no. 5, pp. 1310–1323, May 2015.[51] B. Li, P. Gu, Y. Shan, Y. Wang, Y. Chen, and H. Yang, “RRAM-BasedAnalog Approximate Computing,”

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , vol. 34, no. 12, pp.1905–1917, Dec 2015.[52] B. Li, Y. Shan, M. Hu, Y. Wang, Y. Chen, and H. Yang, “Memristor-based approximated computation,” in

International Symposium on LowPower Electronics and Design (ISLPED) , Sep. 2013, pp. 242–247.[53] G. Tagliavini, D. Rossi, A. Marongiu, and L. Benini, “Synergistic Archi-tecture and Programming Model Support for Approximate MicropowerComputing,” in

VLSI (ISVLSI), 2015 IEEE Computer Society AnnualSymposium On . IEEE, 2015, pp. 280–285.[54] G. Desoli, N. Chawla, T. Boesch, S. P. Singh, E. Guidetti, F. DeAmbroggi, T. Majo, P. Zambotti, M. Ayodhyawasi, H. Singh, andN. Aggarwal, “A 2.9TOPS/W deep convolutional neural network SoC inFD-SOI 28nm for intelligent embedded systems,”

Digest of TechnicalPapers - IEEE International Solid-State Circuits Conference , vol. 60,pp. 238–239, 2017.[55] B. Moons and M. Verhelst, “An Energy-Efﬁcient Precision-ScalableConvNet Processor in 40-nm CMOS,”

IEEE Journal of Solid-StateCircuits , vol. 52, no. 4, pp. 903–914, Apr. 2017.[56] A. Aimar, H. Mostafa, E. Calabrese, A. Rios-Navarro, R. Tapiador-Morales, I. A. Lungu, M. B. Milde, F. Corradi, A. Linares-Barranco,S. C. Liu, and T. Delbruck, “NullHop: A Flexible Convolutional NeuralNetwork Accelerator Based on Sparse Representations of Feature Maps,”

IEEE Transactions on Neural Networks and Learning Systems , 2018.[57] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, andW. J. Dally, “EIE: Efﬁcient Inference Engine on Compressed DeepNeural Network,”

Proceedings - 2016 43rd International Symposiumon Computer Architecture, ISCA 2016 , vol. 16, pp. 243–254, 2016.[58] Z. Yuan, J. Yue, H. Yang, Z. Wang, J. Li, Y. Yang, Q. Guo, X. Li,M. F. Chang, H. Yang, and Y. Liu, “Sticker: A 0.41-62.1 TOPS/W 8BitNeural Network Processor with Multi-Sparsity Compatible ConvolutionArrays and Online Tuning Acceleration for Fully Connected Layers,”

IEEE Symposium on VLSI Circuits, Digest of Technical Papers , vol.2018-June, pp. 33–34, 2018.[59] c. Yue, R. Liu, W. Sun, Z. Yuan, Z. Wang, Y.-n. Tu, Y.-j. Chen,A. Ren, Y. Wang, M.-f. Chang, X. Li, H. Yang, and Y. Liu, “A 65nm0.39-to-140.3TOPS/W 1-to-12b Uniﬁed Neural- Network ProcessorUsing Block-Circulant-Enabled Transpose-Domain Acceleration with8.1 Higher TOPS/mm 2 and 6T HBST-TRAM-Based 2D Data-ReuseArchitecture,” , pp. 138–140, 2019.[60] X. Lin, C. Zhao, and W. Pan, “Towards Accurate Binary ConvolutionalNeural Network,” in

Advances in Neural Information Processing Systems30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017,pp. 345–353.[61] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,and K. Vissers, “FINN: A Framework for Fast, Scalable BinarizedNeural Network Inference,” in

Proceedings of the 2017 ACM/SIGDAInternational Symposium on Field-Programmable Gate Arrays , ser.FPGA ’17. New York, NY, USA: ACM, 2017, pp. 65–74.[62] K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara,S. Takamaeda-Yamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Moto-mura, “Brein memory: A single-chip binary/ternary reconﬁgurable in-memory deep neural network accelerator achieving 1.4 tops at 0.6 w,”

IEEE Journal of Solid-State Circuits , vol. 53, no. 4, pp. 983–994, April2018.[63] L. Jiang, M. Kim, W. Wen, and D. Wang, “XNOR-POP: A processing-in-memory architecture for binary Convolutional Neural Networks inWide-IO2 DRAMs,” in , Jul. 2017, pp. 1–6.[64] S. Yin, P. Ouyang, J. Yang, T. Lu, X. Li, L. Liu, and S. Wei, “Anenergy-efﬁcient reconﬁgurable processor for binary-and ternary-weightneural networks with ﬂexible data bit width,”

IEEE Journal of Solid-State Circuits , vol. 54, no. 4, pp. 1120–1136, 2019.[65] Y. Wang, J. Lin, and Z. Wang, “An energy-efﬁcient architecture forbinary weight convolutional neural networks,”

IEEE Transactions onVery Large Scale Integration (VLSI) Systems , vol. 26, no. 2, pp. 280–293, 2018.[66] W.-S. Khwa, J.-J. Chen, J.-F. Li, X. Si, E.-Y. Yang, X. Sun, R. Liu,P.-Y. Chen, Q. Li, S. Yu, and M.-F. Chang, “A 65nm 4Kb Algorithm-Dependent Computing-in- Memory SRAM Unit-Macro with 2.3ns and55.8TOPS/W Fully Parallel Product-Sum Operation for Binary DNNEdge Processors,” in

Proceedings of 2018 IEEE International Solid-State Circuits Conference .[67] A. A. Bahou, G. Karunaratne, R. Andri, L. Cavigelli, and L. Benini,“XNORBIN: A 95 TOp/s/W hardware accelerator for binary convolu-tional neural networks,” , pp. 1–3,2018.[68] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A 64-Tile 2.4-MbIn-Memory-Computing CNN Accelerator Employing Charge-DomainCompute,”

IEEE Journal of Solid-State Circuits , vol. 54, no. 6, pp.1789–1799, June 2019.[69] J. Zhang and N. Verma, “An In-memory-Computing DNN Achieving700 TOPS/W and 6 TOPS/mm2 in 130-nm CMOS,”

IEEE Journal onEmerging and Selected Topics in Circuits and Systems , vol. 9, no. 2, pp.358–366, June 2019.[70] D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann,“An Always-On 3.8 µ J/86% CIFAR-10 Mixed-Signal Binary CNNProcessor With All Memory on Chip in 28-nm CMOS,”

IEEE Journalof Solid-State Circuits , vol. 54, no. 1, pp. 158–172, Jan 2019.[71] A. Agrawal, A. Jaiswal, D. Roy, B. Han, G. Srinivasan, A. Ankit,and K. Roy, “Xcel-ram: Accelerating binary neural networks in high-throughput sram compute arrays,”

IEEE Transactions on Circuits andSystems I: Regular Papers , vol. 66, no. 8, pp. 3064–3076, 2019.[72] F. Conti, P. D. Schiavone, and L. Benini, “XNOR Neural Engine:A Hardware Accelerator IP for 21.6-fJ/op Binary Neural NetworkInference,”

IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems , vol. 37, no. 11, pp. 2940–2951, Nov. 2018.[73] A. Waterman, Y. Lee, D. A. Patterson, K. Asanovic, V. I. U. levelIsa, A. Waterman, Y. Lee, and D. Patterson, “The risc-v instruction setmanual,” 2014.[74] A. Pullini, D. Rossi, G. Haugou, and L. Benini, “ µ DMA: An au-tonomous I/O subsystem for IoT end-nodes,” in , Sept 2017, pp. 1–8.[75] A. Teman, D. Rossi, P. Meinerzhagen, L. Benini, and A. Burg, “Con-trolled placement of standard cell memory arrays for high density andlow power in 28nm FD-SOI,” in

The 20th Asia and South Paciﬁc DesignAutomation Conference , Jan 2015, pp. 81–86.[76] B. Moons, D. Bankman, L. Yang, B. Murmann, and M. Verhelst,“Binareye: An always-on energy-accuracy-scalable binary cnn processorwith all memory on chip in 28nm cmos,” in , 2018, pp. 1–4.

Alﬁo Di Mauro received the M.Sc. degrees inElectronic Engineering from the Electronics andTelecommunications Department (DET) of Politec-nico di Torino in 2016. Since September 2017, he iscurrently pursuing the Ph.D. at the Integrated SystemLaboratory (IIS) of the Swiss Federal Institute ofTechnology of Zurich. His research focuses on thedesign of digital Ultra-Low Power (ULP) System-on-Chip (SoC) for Event-Driven edge computing.

Francesco Conti received the Ph.D. degree from theUniversity of Bologna, Italy, in 2016. He is currentlya Post-Doctoral Researcher with the Integrated Sys-tems Laboratory, ETH Z¨urich, Switzerland, and alsowith the Energy-Efﬁcient Embedded Systems Labo-ratory, University of Bologna. He has co-authoredover 20 papers on international conferences andjournals. His research focuses on energy efﬁcientmulticore architectures and the applications of deeplearning to low power digital systems.

Pasquale Davide Schiavone received the B.Sc.and M.Sc. degrees in computer engineering fromthe Polytechnic of Turin in 2013 and 2016, re-spectively.He is currently pursuing the Ph.D. de-gree with the Integrated Systems Laboratory, ETHZrich.His research interests include low powermicro-processors design in multi-core systems anddeep-learning architectures for energy-efﬁcient sys-tems.

Davide Rossi received the PhD from the Universityof Bologna, Italy, in 2012. He has been a post doc re-searcher in the Department of Electrical, Electronicand Information Engineering Guglielmo Marconi atthe University of Bologna since 2015, where hecurrently holds an assistant professor position. Hisresearch interests focus on energy efﬁcient digitalarchitectures in the domain of heterogeneous andreconﬁgurable multi and many-core systems on achip. This includes architectures, design implemen-tation strategies, and runtime support to address per-formance, energy efﬁciency, and reliability issues of both high end embeddedplatforms and ultra-low-power computing platforms targeting the IoT domain.In these ﬁelds he has published more than 100 papers in international peer-reviewed conferences and journals. He is recipient of Donald O. PedersonBest Paper Award 2018