Always-On 674uW @ 4GOP/s Error Resilient Binary Neural Networks with Aggressive SRAM Voltage Scaling on a 22nm IoT End-Node
Alfio Di Mauro, Francesco Conti, Pasquale Davide Schiavone, Davide Rossi, Luca Benini
11 Always-On 674uW @ 4GOP/s Error ResilientBinary Neural Networks with Aggressive SRAMVoltage Scaling on a 22nm IoT End-Node
Alfio Di Mauro, Francesco Conti, Pasquale Davide Schiavone, Davide Rossi, Luca Benini
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer beaccessible.
Abstract —Binary Neural Networks (BNNs) have been shownto be robust to random bit-level noise, making aggressivevoltage scaling attractive as a power-saving technique for bothlogic and SRAMs. In this work, we introduce the first fullyprogrammable IoT end-node system-on-chip (SoC) capable ofexecuting software-defined, hardware-accelerated BNNs at ultra-low voltage. Our SoC exploits a hybrid memory scheme whereerror-vulnerable SRAMs are complemented by reliable standard-cell memories to safely store critical data under aggressive voltagescaling. On a prototype in 22nm FDX technology, we demonstratethat both the logic and SRAM voltage can be dropped to 0.5Vwithout any accuracy penalty on a BNN trained for the CIFAR-10 dataset, improving energy efficiency by 2.2X w.r.t. nominalconditions. Furthermore, we show that the supply voltage canbe dropped to 0.42V (50% of nominal) while keeping more than99% of the nominal accuracy (with a bit error rate ∼ Index Terms —SRAM Voltage Scaling, Binary Neural Net-works, Ultra-Low Power, IoT, Near-Threshold Computing.
I. I
NTRODUCTION T HE latest advances in the Internet-of-Things (IoT) arechanging the nature of edge-computing devices. End-nodes have to support, in-place, an increasing range of func-tionality, for example, video and audio sensory data process-ing, and complex systems control strategies. These new capa-bilities will enable applications such as an entirely new classof biomedical devices [1], autonomous insect-sized drones [2],cheap smart sensors [3] to continuously check the stabilityof bridges, tunnels, and other buildings. Machine learningalgorithms and specifically deep neural networks (DNNs) haveshown outstanding performance in performing these tasks.However, while DNNs fits well the performance and powerbudgets of embedded GPUs and FPGAs, deploying suchcompute-intensive algorithms on battery-powered IoT end-node platforms, characterized by heavily constrained powerbudgets (typically µ W to
100 mW ), still constitutes a hugechallenge, as they are expected to achieve lifetimes in theorders of months, years or even decades. As such, recent
A. Di Mauro, F. Conti, P. D. Schiavone and L. Benini are with the IntegratedSystems Laboratory, D-ITET, ETH Z¨urich, 8092 Z¨urich, Switzerland. F. Conti,D. Rossi and L. Benini are also with the Energy-Efficient Embedded SystemsLaboratory, DEI, University of Bologna, 40126 Bologna, Italy. research efforts from both industry and academia have focusedon enabling the deployment of deep inference on devicesoperating in the –
100 mW power range [4]–[11].The most common approach to reducing average powerconsumption, widely used in commercial microcontrollersand IoT end-nodes, is duty cycling (a.k.a. sleep-walking).According to this paradigm, the system stays in deep-sleepmode for most of the time, featuring a power consumption inthe range of
100 nW to µ W , and wakes up to perform theacquisition and classification task (e.g. with a CNNs), onlywhen it wakes up, either by an externally triggered event orby an internal timer. However, in the first case, this approachrequires trigger fine-tuning to reduce the number of false-positive activations, which could be fairly difficult to achievein a real scenario, and which would feature poor generalizationcapabilities in other contexts. Alternatively, if a time-basedtrigger is employed, therefore exploiting a simple triggeringmechanism, the computing cost of an accurate CNNs is sohigh that the active energy becomes rapidly dominant evenat low duty cycling rates. As a result, the latter approach isinefficient whenever fast reaction time is required at the sensoredge.A common technique to reduce the active power of deeplyembedded computing platforms is near-threshold computing[12]. Scaling voltage together with frequency allows improv-ing significantly the energy efficiency of computations, byexploiting the quadratic dependency of dynamic power withsupply voltage. However, aggressive voltage scaling has asignificant impact on the operating frequency of the logic,and on the reliability of the memory elements of the sys-tem, especially those based on SRAMs. While the frequencydegradation at low voltage can be recovered by exploitingpowerful and efficient hardware accelerators, the SRAM re-liability issue remains an unsolved problem. Therefore, inmost cases 6T-SRAMs have to be replaced by more resilient,custom solutions such as SRAMs composed of 8T or 10Tbitcells supported by reading and writing assist circuits [13],[14]. Among the approaches adopted to improve the resiliencyof memory elements at low voltage, usage of standard cellmemories is particularly convenient since they are built on topof standard library cells such as flip-flop or latches, much moreresilient than SRAMs when operating close to the thresholdvoltage of transistors [15], [16]. This comes with a significantcost in terms of area. On the other hand, a relatively largeon-chip memory is necessary to enable complex algorithmsbased on DNNs [9], [17].In the last years, BNNs [18]–[20] became popular in the a r X i v : . [ c s . A R ] J u l embedded computing domain for having achieved remarkableaccuracy on many complex classification tasks, narrowing thegap that separates them from the state-of-the-art fixed pointor floating point CNNs. Compared to fixed or floating pointCNN implementation, which relies on convolutions, BNNs arecharacterized by a very lightweight hardware implementationof the data path. Binary convolution can be implementedwith simple logic elements such as XNOR gates, requiringa very limited amount of area-hungry adders for partial sumaccumulation. Moreover, BNNs also feature lower memoryfootprints compared to CNNs, reducing the amount of energyconsumed on the memory side for weights and intermediateresults storage.Those features make BNNs a good candidate for all thosescenarios where power consumption is a major concern, butat the same time, a high responsiveness to the sensor stimulineeds to be ensured (e.g. pico-sized autonomous navigationrobots or surveillance nodes). The low power envelope thatcharacterizes BNNs allows to use them as data pre-filteringalgorithms, specifically, to extract semantically meaningfulinformation in an always-on operating mode [21]. In thiscontext, BNNs can be used as a first inference stage of a stagedinference pipeline, composed by low-power, less-accurateearly inference stages, and computationally-powerful fixed orfloating point CNN implementations as latter stages [22].As the memory footprint of BNNs is significantly lowerthan CNNs, bigger topologies can be supported in the samepower/performance budget, enhancing the generalization ca-pability of the early stages, thereby lowering the false positivetriggering occurrence. Additionally, the employment of BNNsas preliminary filtering stage does not prevent the adoptionof conventional power-saving strategies like duty-cycling orsleep-walking at run time, whenever the application latencyrequirements allow it.One of the advantages of DNNs, as well as BNNs, is thehigh robustness to noise [23]. The high resiliency of BNNs torandom errors is given by the fact that as opposed to traditionalneural networks, where activations and weights are representedby integer numbers, no bit in their activations and weights isinherently more significant than any other. As a consequence,no bit is more vulnerable than any other: information process-ing is spread equally among all bits, and only a very high errorrate can bring a dramatic loss in quality-of-results. The BNNsnoise robustness is a very powerful feature since it enablesvery aggressive power reduction techniques to be applied alsoon the memories.In this work, we advance the state-of-the-art with regardsto ultra-low power deep inference with BNNs with three keycontributions:i) We propose a strategy to execute noisy BNNs on micro-controllers. To the best of our knowledge, in this work, wepropose the first complete and fully programmable end nodeSoC architecture and BNN inference data and code allocationstrategy enabling the execution of hardware accelerated BNNsat ultra-low voltage.ii) We describe and demonstrate on silicon a hybrid memoryarchitecture composed of big SRAMs for error-resilient dataand smaller (Standard Cell Memories) SCMs to hold vulner-able data such as microcontroller instructions and stacks. In this work, we also provide a methodology to efficiently exploitsuch memory architecture. The hybrid memory architecturetemplate presented in this work is easily applicable to otherplatforms.iii) We present a self-test strategy for Bit Error Rate mea-surement performed on large SRAM. This approach allowscharacterizing SRAM memories at ultra-low voltages, therebyestimating the amount of noise injected in the BNN.iv) We demonstrate the validity of this architectural concepton an advanced prototype manufactured in GlobalFoundries22nm FDX technology, using the safe SCMs to hold a micro-controller program testing SRAM bit error rates with millionsof random reads/writes, operating down to 420 mV (50% ofthe nominal supply voltage) for both logic and memories.Finally, we show that using the embedded hardware accel-erator for BNNs, our prototype can be operated at
18 MHz down-scaling voltage to
420 mV for both logic and memories.In this operating point, the prototype achieves up to 99% ofthe nominal accuracy on a BNN trained for the CIFAR-10dataset, while operating with an energy efficiency of 170fJ/opand within a power envelope of µ W – enabling embeddingof advanced BNN-based cognitive functionality in ultra-lowpower ”TinyML” devices such as biomedical sensors, long-lifetime environmental sensors, and insect-sized pico-UAVs.The rest of the paper is organized as follows: Section IIdiscusses other works in the state-of-the-art related to thisproposed work. Section III introduces the proposed SoC ar-chitecture. Section IV discusses the simulations we performedto evaluate the resilience of BNNs against SRAM errors. Sec-tion V details the experimental methodology used to evaluatethe SoC and the results of the evaluation in terms of Bit ErrorRate (BER), power and energy efficiency. Section VI drawsconclusions. II. R ELATED W ORK
Recently, there has been a strong push towards the de-ployment of sophisticated artificial intelligence (in particularDNNs) on tiny end-node architectures dedicated to the extremeedge of the IoT – fostering a fast-growing
TinyML researchcommunity [24], which has explored the field from twoconverging directions. On the one hand, in the direction ofshrinking DNN topology [25], reducing the amount [26] andnumerical precision of network parameters [27], moving fromfloating point down to highly quantized numerical represen-tations, e.g. 8 or 4 bits, and ultimately to BNNs [18]. Onthe other hand, edge computing platforms are supporting thistrend by becoming more and more specialized to efficientlyexecute machine learning workloads [17], [28]. In this section,we focus on the latter research direction and describe researchworks related to the SoC proposed in this paper in a top-downfashion. We start from software-programmable architecturestargeting the end-nodes of the IoT, go through specializedheterogeneous and error-resilient hardware architectures, andend with dedicated architectures for CNN inference exploitingextreme quantization and error resiliency.
A. IoT End-Node Architectures
A fundamental element of all IoT end-node architecturesis software programmability, typically based on tiny micro- controllers with ARM Cortex-M class processors. Significantcommercial examples of such micro-systems have been pro-posed by all major embedded systems vendors such as TI [29],STMicroelectronics [30], NXP [31], and Ambiq [32]. Thesesystems feature aggressive sleep-walking capabilities thanks tosub- µ W deep-sleep modes leading to an extremely smallaverage power. On the other hand, current research in IoTend nodes is moving towards optimizing both active and sleepstates exploiting near-threshold and sub-threshold operation.These techniques further improve the energy efficiency andreduce power consumption during computation [33] [34] [35][36] [37]. Mr. Wolf [38] couples aggressive deep-sleep ca-pabilities with an energy-proportional architecture, exceedingthe computational capabilities of ULP microcontrollers by2 orders of magnitude while offering a competitive energyefficiency also at low and sporadic workloads. This is achievedthanks to a heterogeneous parallel architecture composed of analways-on autonomous I/O subsystem, coupled with a parallelaccelerator with 8 floating-point capable RISC-V cores. Totarget specific computation domains such as CNNs, somecommercial architectures leverage lightweight SW accelera-tion and optimized DSP libraries to improve performance. Awell-known example is that of CMSIS developed by ARM,a set of libraries to optimize DSP applications on Cortex-Marchitectures, and CMSIS-NN [39], tuned to the deployment ofembedded neural networks. An extension to these libraries hasbeen proposed by Rusci et al. [40] targeting highly quantizednetworks such as 4-bit, 2-bit and binarized networks [41].However, due to their 32-bit nature, fully programmablesolutions can only partially exploit the benefits of quantizedNNs. While this approach significantly reduces the memoryfootprint of CNN, several additional operations are requiredto pack/unpack activation and weights to arithmetic formatssuitable for software processing (e.g. 16-bit or 8-bit) [40],degrading performance and energy efficiency of inference.Modern microcontrollers introduced dedicated ISA extensionsto efficiently perform sub-word, sub-byte and SIMD opera-tions [42], [43], and mitigate such performance degradation.To improve the overall efficiency of systems dedicated toNN acceleration, recent SoCs couple programmable processorswith hardwired accelerators, in some cases exploiting low-precision functional units to exploit resiliency of CNNs toquantization. Intel presented an IoT edge mote integrating anx86 processor accelerated by dedicated functional units forCNN cryptography workloads [44]. Conti et al. proposedFulmine [28], a heterogeneous SoC coupling four general-purpose processors with a convolutional accelerator. Whileconvolutional layers of CNNs run on the accelerator, otherfunctions such as activations and pooling execute on thesoftware processing cluster. GAP-8 [45] includes a specializedaccelerator for convolutional neural networks supporting 16-bitprecision for activations and 16-bit, 8-bit and 4-bit precisionfor weights, achieving up to 600 GMAC/s/W within
75 mW of power envelope. Another notable device is the low-powervision sensor node presented by Qualcomm Technologies [46],which performs end-to-end always-on visual detection tasksthanks to an ultra-low power QVGA CMOS sensor and a fulldigital processor subsystem integrated as a single device. Sucharchitecture allows to perform video processing at the sensor edge in a power envelope, exploiting low-resolutionsensing, data sparsity and event-driven computing, ultimatelyoutputting only meta-information when meaningful events aredetected.In this work we propose a near-threshold SoC joining theflexibility of a software programmable 32-bit RISC-V proces-sor integrated into a state-of-the-art microcontroller featuringa rich set of peripherals, with the performance boost of a ded-icated accelerator for BNN workloads, pushing quantizationto the limit. On top of the flexibility and performance of thisheterogeneous architecture, in this paper, we propose a hetero-geneous memory architecture exploiting the error resiliency ofBNN with respect to random errors in the memory system. Toour best knowledge, the SoC described in this paper reports thelowest full system power for active operation and always-onBNN inference presented in industry or academia.
B. Heterogeneous and Error Resilient Memory Architectures
Optimizing the memory hierarchy is one of the mainconcerns in IoT end-nodes operating in near-threshold, sincememory can be the dominant source of power consumption,potentially jeopardizing their energy efficiency [34], [47],[48]. While many approaches rely on the custom designof low-voltage memories [14], [49], which come with theassociated area and power overheads (e.g. 8T or 10T bitcells,read and write assist circuits) [13], an emerging trend relies onapproximate SRAMs, often joined with precision/performancetunability or heterogeneous memory architectures. Frustaci etal. [50] proposed an approximated SRAMs for error-tolerantapplications, in which energy is saved at the cost of theoccurrence of read/write errors by exploiting voltage scaling,selective error correction code (SECC), and selective writeassist techniques (SNBB). Compared to the voltage scaling atiso-quality, the joint adoption of these techniques can providemore than × energy reduction at a negligible area penalty.Other works propose the adoption of emerging technologiesto realize approximate memory cells, such as RRAM [51] andmemristors [52].Although all the aforementioned approaches are effective,they all require the design of custom SRAM banks (eitherapproximated or not), they feature deep circuit-level opti-mizations, that cannot be easily integrated into automaticmemory generators. Other approaches exploit heterogeneousmemory architectures mixing standard SRAMs and latch-based Standard Cell Memories (SCM). While SRAMs can notbe considered reliable at relatively high voltages (e.g. 0.8V inthe technology considered in this work), SCMs can operatein the same operating range of the rest of the logic, typicallymuch wider [15]. Tagliavini et al. [53], proposed an HW/SWmethodology to design energy-efficient ULP systems whichcombine the key ideas of a hybrid memory design where partof the memory system is approximated and part is precise,with an error-aware allocation strategy. Similarly to this work,our approach leverages standard 6T-SRAM cells that can berealized with memory generators provided by silicon vendors,and SCM that can be implemented with standard semi-customdesign flows relying on industrially qualified standard-cellsfor implementation. On the other hand, our work exploits resiliency of binarized neural networks, where the position ofthe flip-bit error within the words is irrelevant to the quality ofthe final result, making them a much more suitable candidatefor approximate computing. C. Dedicated Hardware Accelerators for DNNs and BNNs
Many dedicated hardware accelerators specifically designedto bring deep learning at an ultra-low power budget have beenproposed. Most designs employ fixed-point representation forweights and activations (e.g. Orlando [54] achieving up to 2.9Top/s/W). Pruning and compression are popular techniques tofurther reduce the power budget [55]–[59].Binary Neural Networks [18], [19] constitute a particularlyinteresting niche application due to their properties, as theycan be trained to achieve similar results to full-precisioncounterparts [60] while keeping a smaller footprint, a morescalable structure and a higher resilience to errors, as furtherexplored in Section IV.
FINN [61], was the first architecturecapable of reaching more than 200 Gop/s/W on an FPGA.Many of the most recent efforts towards the deploymentof BNNs on silicon, such as
BRein [62],
XNOR-POP [63],
Conv-RAM [6], as well as the BTNN accelerator proposedby Yin et al. [64], and the BNN accelerator presented byWang et al. [65], and Khwa et al. [66] have achieved an energyefficiency in the range of 10-50 TOP/s/W using in-memorycomputing. Similar results have been claimed by more “tradi-tional” ASICs such as
UNPU [4] and
XNORBIN [67]. Mixed-signal [5], and in-memory mixed-signal approaches [68]–[71] are able to achieve up to 10-100 × higher efficiency,but paying a very significant cost in terms of design time,verification, and scalability to real systems. Yang et al. [23]exploits one such system in their work, where similarly to whatwe propose SRAM is aggressively voltage-scaled to achieve apower benefit.Our own system exploits a similar technique to Yang’s,with the important distinction that their work is an extremelyspecialized ASIC, capable of executing a single BNN topol-ogy. Rather, our design is a complete and fully programmableIoT end node on the line of those discussed in Section II-A,augmented with a very small hardware accelerator [72].III. Q UENTIN S O CThis section introduces the system architecture of QuentinSoC, focusing on the micro-architecture of the binarizedneural network accelerator (XNE) and on the heterogeneoussystem architecture and its implementation strategy enablingthe power/performance/precision tunable capabilities of thesystem. The architecture of Quentin is reported in Fig.1.
A. System Architecture
The system in exam consists of an advanced microcontrollerbased on the open-source PULPissimo system , part of theParallel Ultra-Low-Power (PULP) platform . The SoC is builtaround a RISC-V processor (RI5CY) [42] optimized forenergy-efficient digital signal processing. The core’s pipeline https://github.com/pulp-platform/pulpissimo https://pulp-platform.org UARTHyperBusCameraSPII2CI2S P A D M U X FLL MCU INTERCONNECTGPIO APB BUS P M U T I M E R D E B U G C L K UDMAEngine
CFG I $ RI5CYFC
DBG FPU D I G I O P A D F R A M E DBG BRIDGE
XNORNEURALENGINE
L2 INTERLEAVED (456 KB)
SRAM
112 KB
SRAM
112 KB
SRAM
112 KB
SRAM
112 KB
SCM
SCM
SCM
SCM
L2 PRIV (64 KB)
SRAM
32 KB
SRAM
24 KB
SCM
ROM
INTC CTRL
CORE + PERIPHERAL
SUBSYSTEMS
MEMORY
SUBSYSTEM
Fig. 1: Quentin SoC Architecture. The core subsystem ishighlighted in violet, the peripheral one in blue, and the memory one in green.features 4 stages, floating-point and it is fully compliant withthe RV32IMFC ISA [73]. On top of the standard RISC-VISA the processor features digital signal processing extensionstargeting energy-efficient near-sensor data analytic. These ex-tensions include hardware loops, the automatic increment ofpointers accesses, bit manipulation instructions, fixed-pointand packed single-instruction-multiple-data (SIMD) opera-tions, and unaligned memory accesses.The system features a full set of peripherals which includeQuad-SPI (QSPI) supporting up to two external devices, I2C,2x I2S, a parallel camera interface, UART, GPIOs, JTAG,and a DDR HyperBus interface to connect off-chip up to64 Mbytes of external Dynamic RAM (DRAM) or FLASHmemory, and a small ROM used to store the boot-code. AnI/O DMA ( µ DMA [74]) autonomously manages data transfersthrough peripherals to minimize the workload of the processor.To improve the efficiency of the system and the flexibility oftransfer from/to the peripherals each peripheral has a dedicatedclock domain. Two Frequency Locked Loops (FLLs) adjust thefrequency of the peripheral subsystem and core subsystem.Moreover, peripherals are equipped with clock dividers thatallow fine-tuning the frequency of according to the desiredbandwidth. This architecture allows to tune the performanceof computation and IO transfers, minimizing the system-levelpower consumption for the desired performance target.
B. Hybrid Memory Architecture
The L2 memory of the proposed SoC, shown in the bottompart of Fig. 4 consists of a heterogeneous memory architecturedesigned to operate on a wide voltage range and to optimizeaccess to the different regions of the memory depending ontheir purpose. From an architectural point of view, the memoryis composed of two regions. The first one is a 64 kB privatememory that can be used by Fabric Controller (FC) for storingits program, the stack, and other private data. This portionof the memory, connected to the interconnect through twoports (e.g. one for instructions and one for data), is typically not shared with other initiators, hence it does not incur anykind of conflicts guaranteeing full bandwidth. The secondportion, called
L2 interleaved memory (Fig. 4), is composedof four 114 KB banks that can be accessed in parallel by themasters (i.e., µ DMA [74], instruction, and data port) whileminimizing the banking conflict probability thanks to the in-terleaved addressing scheme implemented by the interconnect.From a performance viewpoint, this memory organizationenables transparent sharing of the L2, increasing by 4x thesystem memory bandwidth compared to the traditional single-port memory architecture typical of AHB-based MCUs [30],without the usage of power-hungry dual-port memories.Both memory regions described above are heterogeneousalso from the memory technology point of view, being im-plemented as a hybrid mix of SRAM and standard-cell basedmemory cuts (SCM). The SCMs are based on the architecturedescribed in [75]. Each of the interleaved banks has 112 kBof SRAM and 2kB of SCM, while the private banks have8 KB of SCM as shown in Figure 1 while the rest isimplemented as SRAM. The SCM portion of the private bankis implemented as a 3-read 2-write ports register file: two ofthe read ports and one of the write ports are dedicated tothe data and instructions interfaces of the RISC-V core whileone read and one write ports are used by the interconnectarbiter for any other master node of the system. Despite theintrinsic flexibility of synthesizable IPs that make them moresuitable to implement multi-port cuts, one of the main advan-tages of latch-based memories is the capability, empiricallyproven in this work, to operate reliably in a much widersupply voltage range than SRAMs. Moreover, they featuresignificantly smaller read and write energy with respect totraditional SRAMs, up to 4x depending on the configuration(i.e. leakage dominated vs. dynamic dominated) [75]. On theother hand, they pay a significant area overhead with respectto SRAMs, that makes them suitable only for implementingvery small memory regions, usually in the orders of few kB[75].
C. XNOR Neural Engine and BNN Execution Model
To execute with high performance and energy efficiencybinary neural networks, the Quentin SoC also contains adedicated hardware accelerator called XNOR Neural Engine(XNE) [72]. The XNE is connected as a master to the inter-leaved L2 memory. It has four ports, for an overall memorybandwidth of 128 bits per cycle. All the configuration registersare memory-mapped and accessible by the core. The XNEcan execute both convolutional and fully connected layers,autonomously from the core, once all data reside in L2.Figure 2a schematizes the internal architecture of the XNE.It is divided in a control submodule responsible of receivingjobs from the core; a streamer submodule translating internaldata streams into actual memory transfers on the memoryinterconnect towards L2; and a datapath that performs binarymatrix-vector products. The controller includes a memory-mapped slave interface to a configuration register file, acontroller finite-state machine and a small microcoded loopthat is used to implement the following BNN layer executionpattern.
CTRL FSMMICROCODELOOPREG FILESLAVEINPUTLOAD UNITWEIGHTLOAD UNITACTIVATIONSTORE UNIT S T A T I C M U X I N G INPUT BUFFERXNOR & POPCOUNTACCUMULATORSTHRESHOLD x - b i t - b i t a)b STREAMER DATAPATH CTRL holds 128 stationary in chan bitsmultiplies input buffer by 128 weight bits+ reduces the output to a 7-bit numberaccumulates popcount outputs in128 registers of 16-bitcompares accumulated outputs witha set of 128-bit thresholds for i in range(output_height): for j in range(output_width): for ko_major in range(nb_out_chan/128): for ui in range(filter_height): for uj in range(filter_width): for ki_major in range(nb_in_chan/128): acc[ko_minor] = 0 for ko_minor in range(128): ko = ko_major*128+ko_minor popcount = 0 for ki_minor in range(128): ki = ki_major*128+ki_minor binary_prod = W[ko,ki,ui,uj]*x[ki,i+ui,j+uj] popcount += binary_prod acc[ko_minor] += popcount for ko_minor in range(128): y[i,j,ko_major+ko_minor] = 0 if acc[ko_minor] < threshold[ko_minor] else 1 D A T A P A T H M I C R O C O D E D L OO P S Fig. 2: a) XNE internal architecture, showing the streamer(green shades), control (orange) and datapath (blue) sub-modules; b) BNN layer execution pseudo-code highlightingmicrocoded loops (orange) and datapath execution (blue). // x0, x1, x2, x3, y are statically allocated in L2 memory// W0, W1, W2, W3 are statically allocated in L2 memory and filled with weight values uint8_t *x[4] = { x0, x1, x2, x3 };uint8_t *W[4] = { W0, W1, W2, W3 };int xne_job_id; // first execution udma_get_input(x0); // fill x0 with next input data frame while (1) { for (int i=0; i<3; i++) { xne_program(x[i], W[i], x[i+1], CH_OUT[i], CH_IN[i], HEIGHT[i], WIDTH[i], FILTER_SIZE[i]); xne_job_id = xne_run(); // start execution of layer xne_wait(xne_job_id); // RI5CY sleeps and waits for XNE end of computation } xne_program(x[3], W[3], y, CH_OUT[3], CH_IN[3], HEIGHT[3], WIDTH[3], FILTER_SIZE[3]); xne_job_id = xne_run(); udma_get_input(x0); // fill x0 with next input data frame xne_wait(xne_job_id); // RI5CY sleeps and waits for XNE end of computation } lay0in RI5CYXNEuDMA i/o layer 1 layer 2 layer 3in lay0 … A B C D E F
A, FB, C, DE a)b
Fig. 3: a) Execution profile for an example 4-layer fully on-chip BNN; b) ANSI C runtime code executed on RI5CY forthe same 4-layer BNN. An INPUT BUFFER is loaded with a stationary set of 128input feature bits, with each bit representing a different inputchannel (0 represents a ’-1’ value, 1 a ’+1’ value). Thestationary input is multiplied with 128 bits of weights thatare dynamically streamed in each cycle, using 128 parallelXNOR gates. The
XNOR gates are followed by a 128-wayreduction tree that performs a
POPCOUNT operation. Overall,the
XNOR & POPCOUNT unit performs a full 128x128 binarymatrix-vector product in 128 cycles, which is used to imple-ment the innermost loops of a convolutional or linear BNNlayer. To implement the outer loops, popcount outputs areaccumulated in a set of 128 registers (one per output channel)of 16-bit each. After the accumulation is completed, theaccumulated values are activated and binarized by comparingthem with a set of 8-bit activation thresholds that are streamedin from memory and left-shifted by a configurable amount tobe comparable with the 16-bit accumulators. The execution isiterated as specified by the microcoded loop to implement a . mm i n t e r c o COREXNEuDMA I n t e r l e a v e d m e m o r y ( S R A M ) Interleavedmemory(SCM) I n t e r l e a v e d m e m o r y ( S R A M ) Interleaved & Privatememory (SRAM) Privatememory(SCM)Privatememory(SCM)
Fig. 4: Quentin SoC floorplan.TABLE I: Quentin SoC features.
Technology CMOS 22nm FD-SOIChip Area 2.3mm Memory Transistors 520 kbytesEquivalent Gates (NAND2) 1.8 MgatesVoltage Range 0.42 V – 0.8 VBody Bias Range 0.00 V – 1.4 VFrequency Range 32 kHz – 670 MHzFrequency Range (with FBB) 32 kHz – 938 MHzPower Range 300 µ W – 10.4 mWPower Range (with FBB) 300 µ W – 66.2 mW full BNN layer; if the granularity of the layer is smaller than128 input or output channels, the datapath can be configuredaccordingly. Figure 2b describes the full execution schedule aspseudo-Python code; we refer to Conti et al. [72] for furtherdetail.Since the XNE operates at the granularity of a singleBNN layer, the execution of a full network relies also onthe operation of two other modules in the SoC: the RI5CYcore, operating as a lightweight controller; and the UDMAengine, which is used to load inputs from I/O. While theQuentin SoC is designed with the capability to access anexternal IoT DRAM if necessary, in this work we focus onthe execution of relatively small, fully on-chip BNNs, whichcan be run within an ultra-low power budget by means ofaggressive voltage scaling and access I/O exclusively to fetchinput frames. Figure 3 shows the execution profile of anexample four-layer BNN, along with the C runtime code thatis used to run it. Runtime API calls wrap the memory-mappedcontrol interfaces that both the UDMA and the XNE expose toconfigure them; therefore control is realized with fully com-pliant ANSI C code using regular load/store operations andrequires no extension to the RISC-V ISA. In the runtime code, udma_get_input
API calls are synchronous and xne_run asynchronous, with an explicit xne_wait bringing RI5CY tosleep.
D. Chip Implementation
Figure 4 shows the floorplan of the Quentin SoC, while Ta-ble I summarizes its main features. The SoC was implementedin 22nm FD-SOI technology using a flip-well (LVT) standardcell library. The design was synthesized with Synopsys Design
CPU SRAM SCM ROM IO XNE Interconnect
Fig. 5: Quentin SoC area breakdown.TABLE II: Quentin Area breakdown in mm CPU subsystem 0.020SRAM (504kB) 0.817SCM (16kB) 0.292ROM 0.009I/O subsystem 0.056XNE 0.014Interconnect 0.009
Compiler 2016.12, while Place & Route was performed withCadence Innovus 16.10. Fig. 6 shows a micrograph of theQuentin SoC The floorplan area of the SoC is 2.31 mm and its effectivearea is 1.22 mm (6154KGE). Its main modules are high-lighted in Figure 4. The two largest components of the SoCare the SRAM banks of the L2 memory subsystem (i.e., 504kB), and by the 16 kB of SCM banks. Although the latch basedimplementation features approximately a 10X area overheadcompared to approaches based exclusively on SRAMs (TableII, Fig.5), it allows major energy savings [75], and it enablesmore flexible power management strategies that can be played The area of the micrograph that is not annotated contains independentdesigns fabricated on the same chip
Quentin SoC I n t e r l e a v e d m e m o r y ( S R A M ) I n t e r l e a v e d m e m o r y ( S R A M ) Interleaved memory(SCM)
Private memory (SCM) Private memory (SCM)Interleaved and private memory (SRAM) uDMACOREInterco.XNEHyperBus
Fig. 6: Quentin SoC chip micrograph. ...
BANK 0 ...
BANK 1 ...
BANK 2 ...
BANK 3 ...
PRIVATEBANK SCM SCM SCM SCM SCMSRAM 0 SRAM 0 SRAM 0 SRAM 0 SRAM 0SRAM 3 SRAM 3 SRAM 3 SRAM 3 SRAM 3
PERIPHERY PERIPHERY PERIPHERY PERIPHERY PERIPHERY
ARRAY ARRAY ARRAY ARRAY ARRAY
PERIPHERY PERIPHERY PERIPHERY PERIPHERY PERIPHERY
ARRAY ARRAY ARRAY ARRAY ARRAY
CORE LOGICVDD QUENTIN VDD MPVDD MAINTERLEAVED REGION
Fig. 7: Quentin SoC Power Domains.at the system level. For example, SRAMs and SMCs canbe independently power-gated. Additionally, on SCMs, it ispossible to scale the operating voltage more aggressively thanfor SRAM. Our tests reported no errors when the supplyvoltage on the SCMs is scaled down to 0.42V; contrarily, errorson SRAMs become appreciable already at 0.575V (section V),limiting the voltage scaling capability of the system.To exploit both the energy advantage of SCMs andarea density advantage of SRAMs, and to enhance thepower/performance/precision tuning capabilities of Quentin,the chip was implemented as a multi power-domain system.The SRAM cuts have separate power connections from the restof the logic for both periphery and array, as shown in Figure7. This configuration allows us to independently tune thesupply voltage of logic circuits, memory arrays, and memoryperiphery. Moreover, it allows the system to operate in an ultra-low-power, highly voltage-scaled mode using only the 16 KBSCM memories, and to shut down the SRAM via an off-chippower switch.IV. BNN E
RROR R ESILIENCE A NALYSIS
As argued in Section I, BNNs have been shown to be par-tially resilient to high error rates. For example, Yang et al. [23]use a statistical model to quantify the accuracy drop of aBNN in an application-specific architecture, reporting ∼ − . In this section, we evaluatethe final classification accuracy of different pre-trained BNNtopologies under multiple SRAM BER conditions. The goalof our analysis is to exploit the BNNs error resilience toenable major energy efficiency on SoC architectures featuringheterogeneous memory subsystems. Our results are silicon-proven on the Quentin chip. We performed our analysis onthe CIFAR-10 classification data set.The BER reported in our experiments refers to data beingfetched from the SRAM. In Quentin, the source where XNEfetches and stores binary data (i.e. weights, activations andpartial results of internal BNN layers) is not fixed at designtime. As described in Section III, this data can be resident x x x x
16 128 x x x x x x
32 384 x x x x Conv 3x3Pool 2x2Conv 3x3 + Pool 2x2 x x
16 128 10
Fully Connected
Fig. 8: uVGG
BNN topology.in either the interleaved SCM or SRAM memories; the XNEaccelerator always holds partial sums inside its accumulationbuffer; only fully binarized outputs are stored back to theshared memory.In this scenario, we identified three potential sources of er-rors affecting the final BNN classification accuracy: i) weightsreading ii) input features reading iii) activations storage. Inour experiments, the XNE data-flow partial results are notaffected by errors, as they are held inside the local buffer ofthe accelerator. Additionally, output activations are binarizedby comparing the final accumulation value y with a safe 8-bit threshold value τ , which is stored in the error-free SCMmemory. Input features, weights, and activations reside in theSRAM, potentially corrupted by errors.To evaluate the accuracy loss when data are corrupted bya certain BER, we performed a set of simulations using thePyTorch 1.0.1 framework. We targeted a set of pre-trainednetworks on CIFAR-10 based on Hubara’s implementation .We added uniformly distributed errors to input, weights, andactivations of all layers of the network according to the targetBER values to evaluate. We tested the noisy BNNs on the testset of the CIFAR-10 data-set. The networks have not been re-trained to compensate for the additional noise injected duringthe simulation. This exploration ultimately allows evaluatingthe overall effect, in terms of final classification accuracy, of anaggressive voltage scaling performed on the SRAM. In otherwords, the SRAM supply voltage change can be performeddynamically, depending on the tolerable quality-of-results dropof a target application. In our experiments, we also exploredthe case where errors were occurring with recurring patterns.In this scenario, we did not observe any significant differencein the final classification accuracy with respect to the casewhere errors were uniformly distributed.Fig. 9 shows the results in terms of classification accuracyversus the BER. The classification accuracy of each networkis reported as an average over 100 randomized experimentsover the CIFAR-10 test set; the standard deviation of theresults over this sample is always less than of the reportedvalue. We report results on Hubara’s topology, as well as ona network inspired by the one proposed by Yang et al. Wealso report results on our topology, similar to Hubara’s [19]but fit to be deployable on the Quentin SoC. Figure 8 showsthe topology of this latter network, which we called micro -VGG (uVGG). Table III reports the salient characteristics ofthese networks. For what concerns the network proposed byYang et al. [23], we were not able to reproduce the exacttraining setup. Our PyTorch implementation, on which our https://github.com/itayhubara/BinaryNet.pytorch BNN topology Nominal accuracy Mem. footprint
Based on Yang et al. [23] 78.6% a
319 kBHubara et al. [19], 90.9 4545 kB uVGG a Including both activations and weights.
TABLE III: Parameters of BNNs used in the resilience exper-iment. ϭ Ϭ ϲ ϭ Ϭ ϱ ϭ Ϭ ϰ ϭ Ϭ ϯ ϭ Ϭ Ϯ ϭ Ϭ ϭ ϭ Ϭ Ϭ Z Ϭ Ϯ Ϭ ϰ Ϭ ϲ Ϭ ϴ Ϭ ϭ Ϭ Ϭ ů Ă Ɛ Ɛ ŝ Ĩ ŝ Đ Ă ƚ ŝ Ž Ŷ Ă Đ Đ Ƶ ƌ Ă Đ LJ й Ƶ s ' ' Ă Ɛ Ğ Ě Ž Ŷ Ϯ ϯ , h Z Fig. 9: BNN resilience versus bit-error rate.experimental results are based, achieves significantly loweraccuracy results than what Yang et al. [23] report (78.6%instead of 85%).In the remainder of the paper, we focused on our proposednetwork, which can tolerate a BER of − with negligibleaccuracy drop ( < − with an accuracydrop of 7‰, while fitting perfectly in the SRAM of theQuentin SoC. V. E XPERIMENTAL R ESULTS
In this section, we describe the results of experiments thatallowed us to correlate the SRAM supply voltage scalingto the classification accuracy of the uVGG BNN presentedin Section IV. As a first step, we evaluated the level ofnoise that could potentially corrupt the data stored in theSRAM by measuring how the Bit Error Rate (BER) correlateswith the memory array and peripheral voltage supplies. As asecond experiment, we measured the current drained by eachpower domain of the SoC to extract the power consumption.The power contribution reported in this section refers tothe independent power rails described in Fig.7. Finally, wecomputed the energy efficiency of the SoC and evaluatedthe power saving when the supply voltage of the SRAM isscaled and the quality-of-results (i.e. top1 network accuracy)is degraded by less than 1%.
A. Experimental setup
All the measurements related to SRAM BER evaluationand power domains power consumption have been performedusing an Advantest SoC hp93000 integrated circuit testingdevice. Supply voltages have been precisely regulated utilizingdedicated hp93000 power supply device channels. Power mea-surements have been performed using current measurement
UART Connector
Tester BoardQuentin ChipQuentin Board
JTAG Connector
Advantest SoC Tester Host PCCOM PortTest result IC Tester software - Quentin binary load- Chip Boot- Return check
Fig. 10: High level block diagram of the experimental setup.channels integrated into the hp93000 device and connected inseries to the voltage supply channels.BER experimental data have been obtained by running aself-test C application on the RISCY microprocessor executingfrom the private SCM of the core, which is error-free in alloperating points tested. We loaded the test program into theSCM through the SoC internal debug interface via a standardJTAG interface driven by the hp93000 digital channels. Fig.10shows the block diagram of the experimental setup.
B. Bit Error Rate analysis
Measuring Bit Error Rates from outside, i.e. directly fromthe tester equipment, requires a very large testing time. In ourtests, we observed that the number of bits to observe to detect asingle-bit memory failure can be in the order of or higher,e.g. for SRAM operating at nominal supply voltage conditions.Additionally, to acquire relevant statistics on memory errors,tests have to be repeated many times. To reduce the numberof pads dedicated to SoC debug subsystems, modern micro-controllers often employ serial debug interfaces connectedto a shared bus. Therefore, accessing the memory locationsthrough a serial JTAG debug interface designed for reliabilityrather than for high speed, could be a severe limitation forthe execution of tests targeting BER measurement. In ourtests, we estimated that a single BER measurement point isacquired in several 10th of minutes, assuming to test 448kB ofmemory for 1800 iteration, at a JTAG frequency of 1MHz, andrepeating each measurement 10 times. To overcome the serialdebug interface bottleneck, we designed an on-chip BER testapplication, which was executed by the microcontroller core.This allowed to reduce the time to test a single BER point bya factor of approximately 100X.To issue memory transactions to the SRAM, and observeerrors on the bits, our self-test application runs directly on theRISCY core of the SoC, which operates at the highest reliablefrequency for each condition. Pseudo-random test patternsare generated by the core using a lightweight 32 bits LinearFeedback Shift Register (LFSR) implemented in C code. Thetest application sequentially covers the entire SRAM sharedaddress space. Errors are counted by comparing, bit-wise, thedata read at each memory location with the ground-truth valuegenerated by the LFSR generator using the same initial seed.At each supply voltage point, the test is repeated in a loopto have a reliable measurement of the BER. Note that thisapproach could generate artifacts in the error statistics whena memory location is filled in successive iterations with the Ϭ ͘ ϰ Ϭ Ϭ ͘ ϰ ϱ Ϭ ͘ ϱ Ϭ Ϭ ͘ ϱ ϱ s s ϭ Ϭ ϳ ϭ Ϭ ϲ ϭ Ϭ ϱ ϭ Ϭ ϰ ϭ Ϭ ϯ ϭ Ϭ Ϯ Z й dž Ɖ Ž Ŷ Ğ Ŷ ƚ ŝ Ă ů Ĩ ŝ ƚ ŝ ƚ ƌ ƌ Ž ƌ Z Ă ƚ Ğ Fig. 11: Bit Error Rate.same test vector; to avoid this problem, and to make ourmeasurement more robust, the software LFSR uses a differentseed to generate test data at each new iteration.In our tests, we measured only the BER related to SRAMbanks. SCM, which is hosted by the same power domain asthe circuit logic, was reserved for storing the core instructionsof the self-test application and test results (i.e. the number oferrors). Note that the storage of the software instruction onan error-free memory space is mandatory for the applicationto be able to run. In SoCs featuring single-power-domainmemory subsystems (i.e. not having the possibility to storecore instructions in a separate error-free memory), SRAMerrors could affect also core instructions – making aggressivevoltage scaling infeasible, as a single corrupted bit on a coreinstruction could cause errors in the core control flow, makingthe entire SoC entering unpredictable states, and ultimately thesystem to fail. For each operating point in our experiments,we performed 1800 on-chip test runs, writing 448kB at eachiteration.Fig. 11 reports the BER at each SoC operating voltage.By construction, our test could not observe more than ∗ bits. Therefore, the reciprocal of this value represents thelower bound of the on-chip test application, i.e. . ∗ − .The results of the BER analysis versus the supply voltage arereported in Fig.11. When the supply voltage is higher than . , no BER is observable by our tests.Below a supply voltage of . , as expected, we observeda BER increasing with the memory supply voltage decrease,reaching a BER of − at the lowest supply voltage pointwhere the memory was still accessible. The BER measure-ments confirm that SRAM supply voltage can be scaled at the OP mode
V dd ma/mp/quentin
Freq.
Nominal 0.8V 565 MHzHEFF 0.5V 145 MHzULP 0.42V 18 MHz
TABLE IV: Supply voltage range of Memory Array (MA),Memory Periphery (MP) and Quentin power domains at Nom-inal, High Efficiency (HEFF) and Ultra-Low Power (ULP)modes. Ϭ ͘ ϰ Ϭ ͘ ϱ Ϭ ͘ ϲ Ϭ ͘ ϳ Ϭ ͘ ϴ s s Ϭ ϭ Ϭ Ϭ Ϯ Ϭ Ϭ ϯ Ϭ Ϭ ϰ Ϭ Ϭ ϱ Ϭ Ϭ ϲ Ϭ Ϭ ϳ Ϭ Ϭ ϴ Ϭ Ϭ F m a x D , nj > ŝ Ŷ Ğ Ă ƌ Ĩ ŝ ƚ D Ă dž ŝ ŵ Ƶ ŵ &