[PDF] Linear Delay-cell Design for Low-energy Delay Multiplication and Accumulation

Abstract

A practical deep neural network's (DNN) evaluation involves thousands of multiply-and-accumulate (MAC) operations. To extend DNN's superior inference capabilities to energy constrained devices, architectures and circuits that minimize energy-per-MAC must be developed. In this respect, analog delay-based MAC is advantageous due to reasons both extrinsic and intrinsic to the MAC implementation - (1) lower fixed-point precision requirement for a DNN's evaluation, (2) better dynamic range than charge-based accumulation, for smaller technology nodes, and (3) simpler analog-digital interfacing. Implementing DNNs using delay-based MAC requires mixed-signal delay multipliers that accept digitally stored weights and analog voltages as arguments. To this end, a novel, linearly tune-able delay-cell is proposed, wherein, the delay is realized using an inverted MOS capacitor's (C*) steady discharge from a linearly input-voltage dependent initial charge. The cell is analytically modeled, constraints for its functional validity are determined, and jitter-models are developed. Multiple cells with scaled delays, corresponding to each bit of the digital argument, must be cascaded to form the multiplier. To realize such bit-wise delay-scaling of the cells, a biasing circuit is proposed that generates sub-threshold gate-voltages to scale C*'s discharging rate, and thus area-expensive transistor width-scaling is avoided. For 130nm CMOS technology, the theoretical constraints and limits on jitter are used to find the optimal design-point and quantify the jitter versus bits-per-multiplier trade-off. Schematic-level simulations show a worst-case energy-consumption close to the state-of-art, and thus, feasibility of the cell.

Full PDF

aa r X i v : . [ c s . ET ] A ug Linear Delay-cell Design for Low-energy DelayMultiplication and Accumulation

Aditya Shukla,

Student Member, IEEE

Abstract —A practical deep neural network’s (DNN) evaluationinvolves thousands of multiply-and-accumulate (MAC) opera-tions. To extend DNN’s superior inference capabilities to energyconstrained devices, architectures and circuits that minimizeenergy-per-MAC must be developed. In this respect, analog delay -based MAC is advantageous due to reasons both extrinsicand intrinsic to the MAC implementation − (1) lower ﬁxed-point precision requirement for a DNN’s evaluation, (2) betterdynamic range than charge-based accumulation, for smallertechnology nodes, and (3) simpler analog-digital interfacing.Implementing DNNs using delay-based MAC requires mixed-signal delay multipliers that accept digitally stored weights andanalog voltages as arguments. To this end, a novel, linearlytune-able delay-cell is proposed, wherein, the delay is realizedusing an inverted MOS capacitor’s ( C ∗ ) steady discharge froma linearly input-voltage dependent initial charge. The cell isanalytically modeled, constraints for its functional validity aredetermined, and jitter-models are developed. Multiple cells withscaled delays, corresponding to each bit of the digital argument,must be cascaded to form the multiplier. To realize such bit-wise delay-scaling of the cells, a biasing circuit is proposed thatgenerates sub-threshold gate-voltages to scale C ∗ ’s dischargingrate, and thus area-expensive transistor width-scaling is avoided.For 130nm CMOS technology, the theoretical constraints andlimits on jitter are used to ﬁnd the optimal design-point andquantify the jitter versus bits-per-multiplier trade-off. Schematic-level simulations show a worst-case energy-consumption close tothe state-of-art, and thus, feasibility of the cell. Index Terms —Analog-computing, delay-cell, mixed-signal de-lay multiplier, multiply-and-accumulate

I. I

NTRODUCTION R ECENT advances in machine learning algorithmsand, particularly, deep neural networks (DNNs), haveequipped portable computing devices with human-like infer-ring, classifying and planning capabilities. Enormous sizesof these networks, with number of operations per evaluationoften running into millions, make remote computing serversindispensable. Reliance on servers increases inference latency,communication energy, risk of privacy loss, trafﬁc, and needsa perpetual connection to the server. Some of these metrics arecritical in applications like self-driven cars, that cannot afforddelays while making decisions. Delocalizing computationaleffort for evaluating ML model, away from server and towardsthe leaf nodes, requires ML-speciﬁc energy-efﬁcient com-puting architectures [1]. Many such architectures have beenproposed to greatly accelerate the training and inference speedof DNNs [2]–[4], but much work is needed to efﬁciently run

The author is with the Department of Electrical Engineering and ComputerScience, University of Michigan, Ann Arbor, MI - 48104, USA (e-mail:[email protected]). these networks under severe energy restrictions many portabledevices operate under.The computing-energy’s problem [5] is tackled by: (1)using simpler data-types: these algorithms do not requirea large precision and continue to provide similar accuracywith simpler data-types and restricted widths [6]–[11] (2)minimizing data-transfer: the number of data-fetches shoots upfor human-level, large-scale applications of these algorithmscausing signiﬁcant non-compute (latent) energy losses [2], [4],[12].Relative robustness of DNNs to precision-loss, togetherwith a limitation on energy, motivates the use of analogcomputing systems, wherein, the loss of information dueto noise and process-variability can effectively be modeledas the loss in precision. To maintain the energy-efﬁciencywithout an excessive (counter-productive) precision-loss, thesesystems constitute both analog and digital computing units.The computational roles are distributed such that the multiply-and-accumulate (MAC) operations, which form the bulk of aDNN’s evaluation, are executed in an analog domain, whileother operations (e.g. control-ﬂow, data-communication andstorage) are done using binary voltages. Superposable elec-trical variables like charge [13], [14] and current [15]–[17]physically represent partial sums of a MAC, with a capacitoras a an accumulator to store the sum of physical variables.Recently, time was proposed as an accumulation variable,as it is better than charge and current in following regards:(1) time-to-voltage/digital converters (TDC, and vice versaDTC) are more area and power-efﬁcient than voltage-basedconverters [18], [19]. For instance, both DTC and TDC canbe realized out of clocked counters, while voltage ADC/DACsrequire area and energy-expensive operational ampliﬁers; (2)while the noise-ﬂoor is relatively constant, the supply voltage, V DD , drops with technology nodes. Thus, the dynamic-rangeof accumulation of voltage, current or charge gets increasinglylimited; (3) the transition frequency of the FETs, whichdictates the temporal resolution of a TDC, increases with tech.nodes.Within the purview of time-based accumulation, pulse-width [20], [21] and pulse-delay [18], [22], [23] are thetwo modulation schemes that have been demonstrated on-chip. Of these, pulse- (or, event) delay is more promising forMAC applications due to (1) free addition/subtraction in caseof delay, and (2) requirement of peripheral pulse re-routingcircuitry requirements in the prior.A practical delay-MAC must meet following speciﬁcations:ﬁrstly, it must accept mixed-signal arguments − one analogwhile other digital, for locally stored weights; secondly, it Fig. 1. Sub-processes of the delay-cell should posses linear voltage-delay (transfer) characteristicsto accept externally sensed analog voltages and allow cas-cading of multiple layers of MACs. To implement a low-energy mixed-signal delay-multiplier, major challenge is thedesign of a tune-able delay-cell, having linear transfer char-acteristics. Miyashita et al. [18] ﬁrst proposed the use ofanalog-digital mixed signal delay-MAC. Common mathemat-ical operations like addition, subtraction, multiplication, andmax-/minimization were demonstrated in a clocked time-domain. However, the multiplication using clocked time-to-digital converters negated power-savings expected from ananalog processor. Clock-less tune-able delay-cells for MACswere later proposed in [22], where the accumulation aftereach dot-product in a binary convolutional neural network wascarried implicitly by the delays of a series of nMOS resistor-based delay-cells. However, the use of resistors for enablingscaling of delay, lead to an area-expensive solution.Delay-modulation via programming voltages, for both − low-power front-end analog processing and approximate-computing acceleration, was demonstrated in [23]. Applyinga small-signal analog input to the back-gate of the transistormodulated the threshold voltage and hence, the delay. Sincethe threshold voltage varies with the input in a square-rootfashion, the delay is inherently non-linear. Also, variationin the threshold voltage across a chip can introduce non-homogeneity in the multiplier.In this work, a novel CMOS referential delay-cell, basedon a steady discharge of a MOSCAP ( C ∗ ) via a constantcurrent ( I ∗ ), is proposed. Block-diagram in Fig. 1 depicts threesequential processes that C ∗ undergoes, from t = 0 :1) instantaneous pre-charge to V ∗ (colored red)2) steady discharge, through a constant current I ∗ (blue)3) thresholding of V ∗ at V ∗ th , using a threshold detector(green).With these three processes, time taken for V ∗ to reach V ∗ th is: T d = C ∗ I ∗ ( V ∗ − V ∗ th ) . (1)If V is a linear function of V A , then time taken to dischargeto V ∗ th (or simply, the delay) becomes a linear function of V A . This forms the basis of the proposed delay-cell. For usewithin a multiplier, its delay is exponentially scaled throughgate-voltages of the source of I ∗ , rather than transistor widths.Next, analytical models for all the sub-processes in the delay-cell are developed, key sources of jitter identiﬁed and a modelfor the net jitter is formed. From these models, constraints on Fig. 2. Referential delay C ∗ and I ∗ for the usability of delay-cells in a multiplier arefound, and it’s shown that the multiplier cannot accommodatemore than 5 bits of (signed) digital-input. Biasing circuitsthat generate the gate-voltages to scale I ∗ and the delayexponentially, are then proposed and validated.The paper is divided as follows: in Sec. II, necessary butbrief background on mixed-signal delay multipliers, delay-MACs and how multiple delay-cells together constitute amultiplier, is presented. In Sec. III, the concept behind theproposed delay cell is presented in more details. For each ofthe three sub-processes, CMOS implementation details are pre-sented and constraints for a linear delay transfer characteristicsdeveloped. Jitter-models are then developed for the two sub-processes that contribute most to the jitter. In Sec. IV, theconstraints and jitter model developed are employed to ﬁndthe minimum latency and maximum number of bits that canbe accommodated within the multiplier. Also, a biasing circuitthat enables an accurate exponential scaling of delays of cellswithin a multiplier is presented. In Sec. V, we discuss aboutthe chosen noise-ﬂoor and its connection with the maximumnumber of bits, and input dependence of energy consumption.Next, the proposed delay-cell is compared with the state-of-art,before concluding in Sec. VI.II. B ACKGROUND

A delay-MAC comprises of several delay multipliers, whosedelays serially accumulate, before undergoing further non-linear processing. The multiplier accepts a time-referencedevent signal, which it propagates forward as-is, but after adelay in proportion to the product of its arguments. Whenseveral such multipliers are placed in series, and a referenceevent is applied to the ﬁrst, then, the ref. event is propagatedforward, and the net (accumulation of) delay models thedot-product of inputs. By using delays, the need of addersis eliminated, because the delays are summed up naturally.Since negative numbers cannot be represented using individualevents, a pair of events is used, where, the time of instance ofone’s occurrence referred to the other’s, is called referentialdelay (Fig. 2).A delay multiplier, besides two input arguments, has a pairof a variable and a reference event-signals (henceforth called referential events) at its input and output. For a mixed-signalmultiplier, a signed, ﬁxed-point weight vector ( S and an n-bitwide vector W ) and an analog scalar ( V A ) form the argument,and a pair of rising (or, falling) edges of voltages form referen-tial event-signals (Fig. 2). To accommodate negative weights, (a) 1-bit (b) 3-bitFig. 3. Signed mixed-signal delay multiplier symmetric 2:2 multiplexers (or, relays) are placed within eachmultiplier that are realized using transmission-gates. The relayensures that for each negative weight, the referential events areswapped before multiplication (Fig. 3a,b).Each multiplier consists of smaller referential delay-cellsthat correspond to each bit of W and create a referentialdelay equaling i D , where, D is the common delay-factorand i ∈ { , , n − } . The common delay-factor, D , is alinear function of V A . For reasons explained Sec. III-C, two, V DD → falling edges, as referential event signals, arepropagated through the multiplier. Weight bits, w i , individuallydetermine whether reference-event signals are delayed by i D or not, by making the falling-edge pass or skip a delay-cell.Each referential delay-cell has a pair of identical andparallel, linearly tunable delay-cells. One delay-cell inputs V A and outputs the falling-edge after delay linearly dependent on V A . The second cell inputs a constant reference voltage, V A ,and outputs the event after a ﬁxed time. If V A > V A , thenvariable event gets more delayed compared to the reference,which represents a positive partial sum. A negative partial sumis produced if V A < V A and zero, if V A = V A . Thus, usinga pair of delay-cells homogenises the multiplier with respectto the multiplicand and allows negative weights and referentialdelays.For illustration, a 1-bit multiplier is shown in Fig. 3a. Themultiplier comprises of a 2:2 relay and a referential delaycell. S = 1 implies a negative weight which causes thefalling-edges to get swapped. The referential delay-cell furthercontains two delay-cells, with one variable input ( V A ) and theother with a reference input ( V A ). When W = w = 0 , cell-bypassing MUX is enabled leading to both negligible delayand ref. delay. Next, 3-bit multiplier is shown in Fig. 3b. Thereferential delays are scaled in the ratio 1,2 and 4, by scalingthe absolute delays in the ratio 1,2 and 4.One may also use differential mode of operations, where,the referential-delay cell is replaced with differential delay-cell. In this mode, the analog input are changed from V A and V A to V A + V A and V A − V A . This may remove secondorder distortion terms of the multiplier without changing themultiplier’s circuitry. Fig. 4. Three sub-processes with an n-FET for the initial discharge’s linearity(a) Varying V A (b) Varying I ∗ Fig. 5. Expected transient response

III. D

ELAY - CELL D ESIGN IN

CMOS

A. Steady discharge-based delay-cells

An idealized circuit implementing this process’s equivalentis shown in Fig. 4, where, each component responsible for thethree sub-processes have been boxed and colored correspond-ingly.In branch 1, the key component is a V A -accepting M A thathas a net source-capacitance C S . Initially, both S and S areopen and C ∗ is charged to V DD . Once S is closed, NFETinitializes C ∗ by sinking its charge into C S , until its source-voltage reaches approximately V thn below the gate voltage, to V A − V thn . From charge conservation, the V ∗ lowers by: ∆ V ∗ ≈ C S C ∗ ( V A − V thn ) . (2)Thus, the n-FET conducts until the source voltage risesenough to cut-off the channel, establishing a linear relationshipbetween V A and ∆ V ∗ . The approximation in Eq. 2 comes fromthe fact that a real sub-micron FET doesn’t have a well deﬁnedthreshold-voltage. However, as later shown in Sec. III-C, thelinear relationship still holds well if V A > V thn . Fig. 6. Delay-cell schematic. All transistors have minimum widths

Once the voltage across C S is set, S is opened and S is closed causing C ∗ to spontaneously discharge via I ∗ , at aconstant rate (Fig. 4). The steady discharge process can bedescribed as: ∆ V ∗ ( t ) = ∆ V ∗ + I ∗ C ∗ ( t − t ) , (3)where, ∆ V ∗ is the drop in V ∗ below V DD . Next, a threshold-detector, with a threshold V ∗ th , outputs a falling-edge once V ∗ drops below V ∗ th . Time taken for ∆ V ∗ ( = V DD − V ∗ ) to reacha given threshold ∆ V ∗ th (= V DD − V ∗ th ) , called the absolutedelay ( T d ), is given by: T d = C ∗ I ∗ [∆ V ∗ th − ∆ V ∗ ] ≈ C ∗ I ∗ (cid:20) ∆ V ∗ th − C S C ∗ ( V A − V thn ) (cid:21) (4)This is a linear function of V A . So, the delay can beadjusted linearly with the input (Fig. 5). As discussed in II,to homogenize the delay-input relationship, referential delayis used. For a pair of steady discharge based delay-cells, thereferential delay ∆ t D is: ∆ t D = T d − T d,REF = − C S I ∗ ( V A − V A ) , (5)which, is independent of C ∗ . In case C ∗ varies with V ∗ , thereferential delay may be written as: ∆ t D = 1 I ∗ Z V ∗ ( V A ) V ∗ ( V A ) C ( v ) dv. (6)For ∆ t D to be a linear function of V A , C ( V ∗ ) needs tobe maximally constant for the range of V ∗ ∈ [ V ∗ , V ∗ th ] . If I ∗ is scaled (exponentially) by factor of (Fig. 5), then a delaymultiplier with a digital input vector ¯ W = { w i , ∀ i = 1 , , ...n } will yield the following referential delay: ∆ t D = n X i =1 w i n [∆ t D,i ( V A )] , (7)where, ∆ t D,i is the ref. delay from the delay cell i . This formsthe basis of our mixed-signal delay multiplier. -6 W* (m) -16 -15 -14 C * ( F ) N-typeP-type

Fig. 7. C ∗ vs. FET width, W ∗ The schematic of the delay-cell in CMOS is given in Fig.6, detailed design methodology of which, is discussed next.

B. CMOS implementation: C ∗ Later in Sec. IV-A, it is shown that C ∗ of approximately f F is optimal for minimizing energy, latency and jitter. Aninverted MOSCAP, steadily discharging towards depletion,reliably provides capacitance in this range. Since an n-typeMOSCAP has a larger inversion capacitance-density than p-type (Fig. 7), the prior is used. C. CMOS implementation: Voltage initialization

This stage comprises of min. sized transistors M − andpMOSCAP C S in Fig. 6, key design considerations of whichare discussed next:1) Input nFET M A ( M ) : To nullify effect of processvariations (PV), the V A -accepting nFET is unique toa multiplier, i.e. it is shared by all delay-cells withina multiplier. Speciﬁcally, random dopant-ﬂuctuation,oxide-thickness variations and other process-related non-idealities may offset V thn , that may in-turn offset outputref. delay by: ∆ t D = − C S I ∗ ∆ V thn , (8)where, ∆ V thn models effect of PV2) C S : This capacitor is unique to a delay-cell, and amin. sized pFET is assigned to each cell. The pFET V A (V) V * (a) V A (V) d / d V A ( V * ) (b) Fig. 8. ∆ V ∗ plots, vs. V A (a) Absolute value (b) Derivative stays in the inversion regime regardless of V A , because V S saturates to a value that is at least V thn less than V A,max (= V DD ) Switch S-1 ( M ) : In the relevant regime of operation, V ∗ remains close to V DD , necessitating a p-type FET.The switch is unique to each delay-cell, as it isolates C ∗ of each cell from a shared M of the multiplier4) Reset switch ( M ) : A switch to reset the V S to zerobefore each computation is kept common to all cellswithin a multiplierIf the initial charge on C S is zero, then, ∆ V ∗ can beexpressed as a linear function of V A and an offset term: ∆ V ∗ ( V A ) = C S + C pS C ∗ ( V A − V thn ) + ∆ Q of C ∗ ≈ C S C ∗ ( V A − V thn ) + ∆ Q of C ∗ . (9)Here, C pS is the parasitic capacitors, arising from M and M ; ∆ Q of models the zero-offset at V A = V thn dependenton several parameters: C ∗ , M ’s width and other parasiticeffects like feed-forward of input falling-edge into C ∗ . Notethat ∆ Q of has two distinct values: ﬁrst is deﬁned withinthe discharge-pulse application (MD) and it contains a feed-forward component of the falling-edge. The second is deﬁnedafter the discharge-pulse application (PD), and is slightly lessthan MD.To quantify linearity, ∆ V ∗ is plotted in Fig. 8 against V A and its derivative w.r.t. V A in 8b for V A ranging between . V and . V , with W ∗ (or C ∗ ) as parameters of design. For thisrange, less than 10% variation is seen. The ﬁgure shows thatlarger capacitors can provide better linearity.Fig. 9 plots the ∆ V ∗ for and its average derivative, versus W ∗ ( ∝ C ∗ ), in a log-log fashion. Both plots have a constantslope of − for sufﬁciently large W ∗ , validating Eq. 9 as amodel for the discharge process. C S + C pS and ∆ Q of are thenempirically determined by ﬁtting the model of Eq. 9, yielding C S + C pS = 0 . f F and ∆ Q of = 0 . f C (MD).Eq. 9 is only valid when the C ∗ ’s voltage is big enough tocharge up C S . Mathematically, V DD − ∆ V ∗ ( V A ) > V A − V thn (10)For V A = V DD , we get: ⇒ C ∗ > ( C S + C pS )( V DD − V thn ) + ∆ Q of V thn (11) -6 W* (m) -2 -1 V * ( V ) Mid.Post

Slope-1 (a) Deriv. ∆ V ∗ vs. W ∗ -6 -5 W* (m) -2 -1 d / d V A V * ( V ) Slope -1 (b) ∆ V ∗ vs. W ∗ Fig. 9. Variation of ∆ V ∗ with W ∗ and linearization models Time (s) -8 V * ( V ) R } V* (V A ) (a) V ∗ transient (simulated) V A (V) -9 -8 -7 T d ( s ) (b) T d at V DD / vs. V A Fig. 10. Steady discharge’s characterization

This sets the lower limit on C ∗ , which is employed later inSec. IV-A.To slightly enhance the linearity without adding to thearea, one in every 5 pMOSCAP of the C S is replaced byan nMOSCAP. For V S < V DD − V thp , PFET is inverted andprovides a close to a constant cap. For V S > V DD − V thp ,pFET’s capacitance diminishes but nFET offsets the loss.Since NMOS is smaller, it does so, only to a small extent. D. CMOS implementation: Steady discharge

This part of the delay-cell consists of transistors M − inFig. 6, key design considerations of which are discussed next:1) I ∗ ( M − M ) : This is realized using bi-cascoded nFETcurrent source, with the FETs at their min. widths.The exponential current scaling is done via an externalbiasing circuit, that generates gate-voltages for both M and M . The biasing circuits are discussed in Sec. IV-B.2) Switch S-2 ( M ) : An NMOS switch is placed in se-ries with the current-source. Unlike S-1, the switchwas placed away from C ∗ , preventing the feed-forwardthrough the parasitic capacitors.For W S ( ∝ C S ) at its minimum value ( nm ) and W ∗ ( ∝ C ∗ ) = 640 nm , C ∗ s discharge transient is shown inFig. 10a. Referring back to Fig. 6, the switch ( M ) is turnedON at t = 25 ns , by ramping-up IF E ′ , the input to M . Asexpected, I ∗ discharges C ∗ at a near-constant rate (Fig. 10a).Its constancy depends solely on output resistance of the currentsource. Kinks observable in the transients are caused by thefeedback from the half-latch and do not practically affect theperformance. The absolute delay for various R ( = I ∗ /C ∗ ),spaced exponentially with a factor of 2, is plotted in Fig. 10b. E. CMOS Implementation: Threshold detector

It comprises of M − as the falling-edge, uni-polar thresh-old detectors, half-latch and other switches for resetting.Details and design consideration are discussed next:1) Falling-edge inverter ( M ) : To minimize the area re-quirements, the width of M is kept minimum. Asdiscussed below, under certain constraints on C ∗ , thisinverter contributes to a V A -independent delay, thuskeeping distortion negligible2) Leak-prevention switch ( M ) : It prevents the sub-threshold M from leaking and set-up the latch pre-maturely. It inputs the falling-edge of the previous delaycell3) Half-latching inverter ( M , ) : These transistors latch V RE to V DD and the OF E -node to , once V RE reaches V thn Latch-en-/disable switches ( M , M ) : These switchesenable the half-latch operation when closed and other-wise, disable it, reducing the energy required to resetthe half-latch5) Reset-FETs ( M , ) : M resets node V RE to 0. M sets the OFE node to V DD before the start of computa-tion. These transistors are shared within the multiplierA CMOS inverter can serve as a low-energy thresholddetector, whose switching-voltage can be set by designing theratio of sizes of pMOS and nMOS. However, it consumesshort-circuit energy ( E SC ) given by: E SC = V DD R µ WL C ox ( V DD − V thp − V thn ) (12)where R = I ∗ /C ∗ and, µ WL = µ p µ n ( W/L ) p ( W/L ) n (( µ p ( W/L ) p ) / + ( µ n ( W/L ) n ) / ) . Since R decreases exponentially, E SC increases exponen-tially. Hence, a delay-cell implementing the n-th exponentwill expend n × the E SC of the cell implementing the ﬁrst.For a n-bit multiplier, the total short-circuit energy lost is (cid:0) n +1 − (cid:1) E SC . Thus, an exponential requirement in energyconsumption motivates an alternative inverting mechanism.The low-energy alternative to the CMOS inverter is astandalone pFET ( M in Fig. 6), due to its switch-like I-V relationship. If it were an ideal switch, with a switchingvoltage V S ( > ∆ V ∗ ,max ), V RE would jump to V DD after aﬁxed delay following ∆ V ∗ ( t ) = V thp . This would conservethe linearity of Eq. 5 with respect to V A , as it only adds aconstant delay. However, a real PFET has a close to expo-nential I-V relationship and conservation of linearity needs tobe established, or at least constraints for maximal linearitydetermined.With the assumption of exponential I-V characteristics andlarge output-resistance ( g DS ), the sub-threshold current can beexpressed as a function of the gate-source voltage ( = ∆ V ∗ )using the following equation: I = I exp (cid:18) ∆ V ∗ V T (cid:19) , (13) where, V T is the thermal voltage. Eq. 13 is valid only for ∆ V ∗ < V thp ; for V GS > V thp , I-V relationship is usuallydegree-2 or less polynomial, moving the switch away from anideal behavior.For a V A that linearly decreases from ∆ V ∗ with a steadyrate R , V RE (Fig. 6) can be expresses as a function of timeusing: V RE ( t ) = I C V T R exp (cid:18) ∆ V ∗ V T (cid:19) (cid:18) exp (cid:18) RtV T (cid:19) − (cid:19) , (14)where, R = I ∗ C ∗ is the rate of change of V ∗ with time, and C is net capacitance at the drain of M8. When V RE ( t ) = V thn ,the half-latch is set up and the voltage at OF E node (Fig.6) falls to . Thus, the time taken from the start of discharge( t = 0 ) to the drop in OF E -node voltage to zero ( t = T d ) is: T d = V T R ln (cid:18) R CV thn I V T exp (cid:18) − ∆ V ∗ V T (cid:19) + 1 (cid:19) . (15) T d becomes a linear function of ∆ V ∗ under the constraint: RCV thn I V T exp (cid:18) − ∆ V ∗ V T (cid:19) ≫ (16)Putting R = I ∗ C ∗ , this inequality may alternatively be writtenas: I ∗ C ∗ CI exp (cid:16) ∆ V ∗ V T (cid:17) V thn V T ≫ (17)Since ∆ V ∗ varies inversely with C ∗ (from Eq. 9), the de-nominator in Eq. 17 is a monotonically decreasing functionof C ∗ . Then, as per this inequality, C ∗ should be greater thana critical capacitance C ∗ min . This inequality is used in Sec.IV-A, for establishing constraints on C ∗ and n .Under the validity of this inequality, T d can be expressedas: T d = V T R ln (cid:18) RCV thn I V T (cid:19) − ∆ V ∗ R , (18)matching the expectation of T d ’s linearity with ∆ V ∗ , or V A .Note that the latch-point, or, the value of V ∗ when V RE = V thn is a constant, independent of V ∗ (or V A ), and expressibleas: ∆ V ∗ th = ∆ V ∗ + RT d = V T ln (cid:18) RCV thn I V T (cid:19) . (19)Though an exponential I-V characteristics is assumed for M , in reality, it is exponential only for sub-threshold gatevoltages. For devices with power I-V relations, Eq. 14 is re-derived with the modiﬁed I-V, and constraints of Eq. 11 re-determined. For p − power current-voltage relationship, I ∗ C ∗ CI ( V A , C ∗ ) V RE V G /p ≫ . (20)For an ideal switch ( p → ∞ ), the constraint is trivially satisﬁedand TD stage doesn’t contribute to distortion. As the I-Vrelationship of the pFET moves away from step-like behaviourtowards linearity ( p → ), ensuring linearity from delay- V A characteristics becomes harder. F. Noise-modelling

The delay-cell essentially consists of two current-integratorsthat accumulate the accompanying noise-current, starting fromthe arrival of falling-edge ( t = 0 ) to the latch-up ( t = T d ). Thisleads to a net temporal shift in the falling-edge, or a jitter inthe output falling-edge. To enable design of the delay-cell andmultiplier, the two jitter components are modeled as a function C ∗ and I ∗ (the design variables) and an upper limit on jitteris set, yielding constraints on the design variables and n . Forsimplifying jitter-modeling, it is assumed that:1) Out of the three, only two processes contribute to thejitter: steady discharge and threshold-detection. (Initialdischarge occurs much faster than T d , so it contributesnegligibly to the net jitter.)2) The net jitter is much smaller than T d

1) Jitter from steady discharge:

For this stage, the primarycontributor of jitter the is channel noise-current from M , accumulating in C ∗ . To simplify the model, it is assumed thatthe noise-current out of M circulates within itself, and hencecontributes negligibly to the jitter. With this assumption, thestage reduces to a noisy FET discharging a ﬁxed capacitor,jitter modelling for which was done for ring oscillators in[24]. For an inverter-type ring-oscillator, the jitter-per-stage ismodeled as: ∆ t Dn = 4 kT γg d I ∗ T d , (21)where, γ is the excess noise factor, g d is the drain-sourceconductance at V DS = 0 . This naturally extends to theproposed delay-cell, with the exception that T d is variable ,dependent on the rate of discharge and V A . Using Eq. 4 theexpression for jitter becomes: ∆ t Dn = 4 kT γg d I ∗ C ∗ (∆ V ∗ th − ∆ V ∗ ) (22)To further simply, the dependence of jitter on ∆ V ∗ th and ∆ V ∗ is neglected and a constant jitter, for a V DD / drop in V ∗ ,is deﬁned and used. Owing to the fact that g d has a lineardependence on current, jitter from this stage is compactlyexpress-able as: ∆ t dn = K C ∗ I ∗ , (23)where, K is a temperature and technology dependent constant.For model validation, the jitter is simulated in software, forIBM’s nm technology. Resulting ∆ t dn , with only M − noise turned on, versus C ∗ and I ∗ is plotted in Fig.11.Instead of Eq. 23, the following model is used as it ﬁts theexperimental data better (R-sq. of 0.982, from 10 iterations): ∆ t dn = K C ∗ I ∗ p , (24)where, K = 2 . × − and p = 2 . .

2) Jitter from threshold detector:

Since the input gate-source voltage ( ∆ V ∗ ) of M increases linearly with time anddrain-current exponentially, it is assumed that RMS channelnoise-current ( i n ) out of M increases exponentially. Thus,at any given instant of time post falling-edge’s arrival, noise -15 -15 C* (F) -22 -20 t D n2 ( s ) (a) Iso- I ∗ -7 -6 I* (A) -22 -20 t D n2 ( s ) (b) Iso- C ∗ Fig. 11. steady discharge-stage’s jitter R (Vs -1 ) v n , T D ( V ) -5 (a) Variance in V RE R (Vs -1 ) -22 -20 -18 t d2 ( S ) (b) JitterFig. 12. Variance in V RE and jitter due to M ’s noise-current current from only the past 3-4 V T -drops in ∆ V ∗ , contributesto this stage’s jitter.If ∆ v n is the deviation in V RE at t → T − d , then for aconstant i n , we have: ∆ v n ∝ i n ∆ f T dn C . (25)Since i n varies exponentially over the duration t dn , thisequation cannot be applied without adjustments. Thus, thefollowing equation is used: ∆ v n ∝ R T d kT γg d ( t ) dtC (26)Letting g d = G exp (cid:16) ∆ V ∗ + RtV T (cid:17) , we get: ∆ v n ∝ Z T d kT γG exp (cid:18) ∆ V ∗ + RtV T (cid:19) dt = β V T R exp (cid:18) ∆ V ∗ th V T (cid:19) = β CV thn I , (27)where, β = 4 kT γG . This equation establishes an indepen-dence of v n on R , which is conﬁrmed from Fig. 12. In theﬁgure, R is varied by a factor of more than × , but less than × rise is seen in ∆ v n .Next, the relationship between ∆ t Dn and R is determined.Similar to the approach adopted in [24], ∆ t Dn can be found -10 -8 -6 I* fast (A) -16 -15 -14 C * ( F ) (a) n=4 -10 -8 -6 I* fast (A) -16 -15 -14 C * ( F ) (b) n=5 -10 -8 -6 I* fast (A) -16 -15 -14 C * ( F ) (c) n=6 No solution

31 2

Fig. 13. Constraints 1,2 and 3 for 4,5 and 6 bit multipliers by extrapolating noisy V RE along the noise-less V RE , to thepoint of latch-up: ∆ t Dn = (cid:18) dV RE dt (cid:19) − ∆ v n , ∝ v n R , (28)where, Eq. 14 was used for dV RE /dt ∝ R . Simulated jitter,with only M ’s noise turned on, is plotted in Fig. 12.For minimally sized M − , the ﬁtted model from 10iterations of (noisy) simulations is: ∆ t dn = K R . , (29)where, K = 1 . × − . Thus, the actual exponent is lessthan predicted.IV. M IXED - SIGNAL D ELAY M ULTIPLIER

With the delay-cell design considerations discussed, next,the necessary steps to employ the cells within a multiplierare presented: (1) use of constraints to ﬁnd the valid re-gion of design and operation (2) bias-circuit design for I ∗ ’sexponentiation. Lastly, through simulations, the functionalityof cascaded delay-cells as multiplier is validated and thekey energy components for each multiplication operation areidentiﬁed. A. Optimizing C ∗ and No. of Bits Using the inequalities involving C ∗ , developed in Sec. III-C,Sec. III-E and the noise models of Sec. III-F, the constraintson C ∗ and I ∗ are determined. Note that these are validonly for the IBM’s nm technology, but may similarly bedetermined for other CMOS technology nodes.

1) Linearity of voltage-initialization:

In the inequality ofEq. 11, replacing model-parameters extracted from the data ofFig. 8-9 gives constraint 1, C ∗ > . f, (30)which, corresponds to an inverted nMOSCAP single-ﬁngerwidth of . µm . This constraint is marked by ’1’ in Fig.13a-c.

2) Linearity of threshold detection:

Since I ∗ of the slowestcell is n times smaller than than that of the fastest cell ( I ∗ f ),constraint 2 from Eq. 17 becomes: − n I ∗ f C ∗ CI exp (cid:16) ∆ V ∗ ( V A ,C ∗ ) V T (cid:17) V thn V T > (31)Only C ∗ and I ∗ f are designable; the rest − C, V thn , V T and I , are constant. To simplify the analysis, the denominator ismaximized over V A and the uni-variate ∆ V ∗ ( V A = 1 . , C ∗ ) used. Though the argument of the exponential in Eq. 31, ∆ V ( V A , C ∗ ) , was modeled in Sec. III-C, actual data of Fig.8 is used. This constraint is marked by ’2’ in Fig. 13a-c.

3) Upper limit on jitter:

The referential delay of the fastestcell, from Eq. 5, is: ∆ t D = − C S I ∗ f ( V A − V A ) If jitter from steady discharge is denoted by ∆ t Dn, andfrom TD by ∆ t Dn, , the constraint on the net jitter is suchthat it is to be smaller than the maximum ref. delay of thefastest cell. For a V A = 0 . V , q ∆ t Dn, + ∆ t Dn, ≤ . C S I ∗ f (32a) vuut K C ∗ (2 − n I ∗ f ) . + K C ∗ − n I ∗ f ! . ≤ . C S I ∗ f . (32b)With its LHS being monotonic function of C ∗ , Eq. 32 givesan upper limit on C ∗ for a given n . This constraint is markedby ’3’ in Fig. 13a-c.Fig. 13 plots the constraints for a 4, 5 and 6-bit multiplier.For 4 and 5 bits, the valid region of operation, marked bydouble-sided arrows, lies between the curves corresponding toconstraints 1, 2, and 3. For 6 bits, no solution exists for thechosen constraints. Thus, the 5-bit multiplier with C ∗ = 2 . f F, and I ∗ f = 1 µA emerges as the point of design, as it works for all multiplierswith less than 6 bits, minimizes the multiplication latency andenergy consumption. B. Biasing circuit

Accurate biasing for the current sources is required to ensurelow output distortion. As discussed below, its behavior mustmeet two speciﬁcations:1) As discussed in Sec. IV-A, I ∗ f is achievable only forthe sub-threshold transistors with the employed VLSInode. Hence, the biasing circuit is designed only for sub-threshold currents and works well in this regime only2) The current source within each cell consists of a pair ofseries NFETs ( M and M in Fig. 6). M ’s gate-voltage(primary bias) is such that it sinks − n I ∗ and its drain-voltage is ﬁxed close to mV ( ≈ V T ). The drainvoltage is maintained by M gated with a secondarybias approximately mV above M The circuit (Fig. 14) has two branches: source and scaling .Source branch ( M − ) generates biasing voltages dependenton a programmable voltage, V REF . Scaling branch ( M − )ﬁrst uses those biasing voltages to generate current in theexponents of 2 (using transistor widths) and then generates thebias for the cells’ current-source using self-biasing. Here, theprimary bias out of M is denoted as V B and the secondaryout of M as V B .In the source branch, M converts the reference voltage V REF into current I BIAS = I ∗ f . M and M , being self-biased in the saturation regime, push up the gate voltages ofthe tri-cascode, enough to keep mirror transistors ( M − )of the scaling branch saturated. M − produce the multiplier-cascode’s bias.In the scaling branch, M − ’s widths are down-scaled by − n w.r.t. the source cascode’s width, which down-scale thecurrent in the same proportion. M is self-biased to accept thecurrent and generates the primary bias V B . The gate voltageof the FET with the largest current exponent (or, the smallestdelay exponent) is: V B ,f ≈ V REF (33)Its drain-voltage is maintained at mV using a ﬁxed biased M . After down-scaling M ’s size (by − i , i being the expo-nent), its source voltage is maintained at a constant value. M ,also a down-scaled transistor, is used to generate the secondarybias ( V B ). M ’s width is adjusted using parametric analysisto keep its self-bias above V B by mV . An additionalexponent-dependent scaling for the M is needed, given by: W M ,i ≈ (1 . − i W M ,max , (34)to counter the lower turn-on voltages is required for thescaling branches. The response of the bias circuit is plottedin Fig. 15. As V REF varies, the bias current input to themultiplier’s cascode is plotted in Fig. 15. V B , V B and V B − V B of down-scaled multiplier branches (upto 8 bits)is plotted in Fig. 15.To minimize distortion, it is essential that M ’s correspond-ing to all exponents are applied the same drains-source voltage.Besides using a 3-level cascoding, the length of all transistorswithin the cascode is increased by × over the minimum tominimize the CLM and short-channel effects that shoot-downthe r DS . The sizes of all transistors are summarized in TableI. Fig. 14. Biasing circuit. M − constitute source branch and M − constitute the scaling branch V REF (V) -10 -5 I B I AS ( A ) I*2 -8 I* (a) Source branch’s current V REF (V) V B ( V ) FastestSlowest (b) Output V B V REF (V) V B ( V ) (c) Output V B V REF (V) V B - V B ( V ) (d) V B − V B Fig. 15. Biasing circuit outputs

C. Multiplier Simulation

Using the peripheral elements described in Sec. II, themultiplier is simulated using transient simulators, for IBM nm technology.

1) Functionality test:

A 5-bit multiplier, composed ofdelay-cells cascaded as described in Sec. II, was simulated.Letting V A = 0 . V , Fig. 16a plots the ref. delay ofthe multiplier as it varies with V A , with weight ( W ) as aparameter. Conversely, ref. delay with W as the independentvariable and V A as parameter is plotted in Fig. 16b. Since theoutput ref. delay is distorted for V A < mV , the valid rangeof inputs for the multiplier is mV to . V .

2) Energy analysis and simulation results:

Within a delaycell, the key components of energy are:1) E C ∗ : Energy used up in charging C ∗ for each compu- TABLE IFET

SIZES FOR THE B IASING C IRCUIT

FET Label Width ( / nm ) Length ( / nm ) M M − n M − × n M − × i M . i M i M V A (V) -4-20 t D ( s ) -9 Valid V A (a) Iso- | W | |W| -4-20 t D ( s ) -9 Valid V A (b) Iso- V A Fig. 16. 5-bit multiplier transfer characteristics tation. It is given by: E C ∗ = C ∗ V DD (35)The actual value may vary due to parasitic capacitanceand the dependence of C ∗ on V ∗ E T D : Energy stored in node corresponding to V RE (6),once ∆ V ∗ crosses the threshold. It is given by: E T D = C RE V DD (36)where, C RE is the net capacitance at the node. Part ofit comes from the thresholding-pFET M ( E T D ) andother comes from latching-pFET, M ( E T D )3) E P U : Energy used in pull-up of the event-propagatingwires of the delay cell (

OF E node in Fig. 6)4) E INV : Energy used up in inverting the input falling-edge, to a rising edge (

IF E ′ , input to M in Fig. 6)Next, value of these metrics is determined by simulatingthe 5-bit multiplier (schematic) for one cycle of computation,with arguments V A = 1 . V and | W | = 31 . E T D , E P U and E INV is determined during the computation-phase and E C ∗ and E P U are determined during pre-charge phase. The energycomponents and their simulated values are listed in Table II.Comparing their sum with the simulated total, it is concludedthat the listed components account for almost all the expendedenergy. V. D

ISCUSSION AND B ENCH - MARKING

In Sec. IV-A3, constraint on jitter was chosen such thatthe peak-jitter ( t dN ) from the slowest delay-cell, is lessthan the maximum referential delay of the fastest delay-cell. TABLE IID

ELAY - CELL E NERGY C OMPONENTS

Component Energy/MAC (fJ) Energy/MAC/bit (fJ) E C ∗ E TD E TD E PU E INV

JM, n Fig. 17. Number of bits vs. excess jitter margin

However, for certain inputs, the net output referential delayof a multiplier can be zero, which makes it impossible forfor the noise to ever be smaller than the output signal. Thus,the chosen constraint is a practical as it grants the beneﬁt oflower energy consumption by delay-based analog computingand simultaneously prevents excessive signal corruption bythe noise. Depending on the signal-to-noise speciﬁcation foran application, much tighter constraint on noise may beplaced, which, in effect, reduces the maximum number of bitsaccomodable. Fig. 17 plots the number of bits possible withina multiplier, as a function of excess jitter margin ( ǫ ), where, ǫ is the ratio of maximum ref. delay of the fastest cell andpeak-jitter. At ǫ = 1 , the number of bits is the highest, anddecreases to 1 at ǫ ≈ .Note that the multiplier’s E C ∗ , given in Table II, is com-puted for the case when all weight bits are set to 1 (W=31).Otherwise, this components of energy depends on (1) the inputweight and (2) number of computations being done by theMAC, per second. If the multiplier is used in a sense , or,one-time-use mode, then, the listed E C ∗ is accurate, as allthe charged-up energy leaks out eventually. Any new MACcycle would require the same energy to charge-up C ∗ fromthe point of no charge. However, for acceleration mode, wheresame weights are used with variable V A , C ∗ of the cells with w = 0 never get an opportunity to discharge completely, sinceall falling-edges bypass the cell. Before it fully discharges,a new MAC cycle’s pre-charge step would charge-up C ∗ to V DD from intermediate voltage.In Table III, two simulated performance metrics are com-pared with the state-of-art: (1) energy consumption permultiply-accumulate, reported above, and (2) multiplicationlatency (from absolute delay of the delay-cell). We alsocompare whether the multiplier allows negative weights and TABLE IIIC

OMPARISON WITH S TATE - OF -A RT This work Gopal et al.[23] Miyashita etal. [18] Sayal et al.[20] Lee et al.[14] Skrzyniarz etal. [16]Domain Time: delay Time: delay Time:Clocked-delay Time: Pulse-width Analog-charge Analog-currentDemo. node 130nm 65nm 65nm 40nm 40nm 65nmInput-width Analog-5b Analog-3b 1b 8b Analog-3b 2b/1bEnergy(fJ/MAC/bit) 23 7 20 - 15 13Latency 1b: 1.2ns and5b: 50ns 250ps 50ps - - -Negativeweights Yes No Yes Yes No NoLinearitymechanism Discharge-till-pinch-off Body-gatebiasing Binary Binary NA NA the maximum number bits accommodable for various mixed-signal MACs. From the table, it is seen that:1) The delay cell consumes f J per bit of digital argu-ment/input, which, is more than lowest-reported state-of-art energy consumption. Our energy metric is at nm , and the lowest state-of-art metric at nm .Assuming that the energy scales by L , the scaled energyconsumption approaches that of the state-of-art2) In [23], linearity of the delay-cell is based on back-body biasing, which, is theoretically non-linear. In theproposed cell, the output-input characteristics are linear,due to the linear voltage-initialization step3) Despite noise limitations, the number of bits that can beaccommodated in the mixed-signal multiplier is higherthan state-of-art. All reported mixed-signal MACs usean exponential scaling of transistor widths, as a wayto convert digital signals to analog. In this work, weproposed a biasing circuit that exponentially scales thecurrents via gate-voltages and avoid area expensivewidth-scaling VI. C ONCLUSION

In this work, a linearly tunable delay-cell is proposedthat realizes the analog input-dependent delay using threesequential sub-processes: (1) an input-dependant charge-up of C ∗ (2) its steady discharge, via current I ∗ (3) thresholdingof its voltage. Each of the sub-processes is then analyticallymodeled, using which, constraints on the C ∗ and I ∗ for linear-ity are found. Jitter models, based on prior ones developed forCMOS inverter ring-oscillator, were modiﬁed and validatedfor the proposed cell. To form a multiplier, delay-cells withsame analog input and I ∗ scaled in the exponents of 2, must becascaded to form a multiplier. Since I ∗ is scaled using gates-source voltages, a biasing circuit that accept a ref. voltageand generates gate biases for delay cells corresponding to allexponents, is proposed and validated. From the constraintson C ∗ for linearity and noise, the minimum C ∗ was foundto be around f J and maximum bits supportable to beﬁve. Lastly we also identify key energy components of the multiplier, which sum up to be 20 fJ/MAC/bit for IBM’s130nm technology. R EFERENCES[1] X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi,“Scaling for edge inference of deep neural networks,”

Nature Electron-ics , vol. 1, no. 4, pp. 216–222, 2018.[2] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin,C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V.Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin,G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan,R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan,R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-DatacenterPerformance Analysis of a Tensor Processing Unit,” in

Proceedings ofthe 44th Annual International Symposium on Computer Architecture .New York, NY, USA: ACM, 6 2017, pp. 1–12. [Online]. Available:https://dl.acm.org/doi/10.1145/3079856.3080246[3] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu,D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil,P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt,A. M. Caulﬁeld, E. S. Chung, and D. Burger, “A conﬁgurable cloud-Scale DNN processor for real-Time AI,”

Proceedings - InternationalSymposium on Computer Architecture , pp. 1–14, 2018.[4] V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer, “Efﬁcient Processingof Deep Neural Networks: A Tutorial and Survey,”

Proceedings of theIEEE , vol. 105, no. 12, pp. 2295–2329, 2017.[5] M. Horowitz, “Computing’s energy problem (and what we can do aboutit),”

Digest of Technical Papers - IEEE International Solid-State CircuitsConference , vol. 57, pp. 10–14, 2014.[6] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, andA. Moshovos, “Stripes: Bit-serial deep neural network computing,”in . IEEE, 10 2016, pp. 1–12. [Online].Available: http://ieeexplore.ieee.org/document/7783722/[7] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M.Hernandez-Lobato, G. Y. Wei, and D. Brooks, “Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators,”

Proceed-ings - 2016 43rd International Symposium on Computer Architecture,ISCA 2016 , pp. 267–278, 2016.[8] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:ImageNet Classiﬁcation Using Binary Convolutional Neural Networks,”pp. 1–17, 3 2016. [Online]. Available: http://arxiv.org/abs/1603.05279 [9] S. Sharify, A. D. Lascorz, K. Siu, P. Judd, and A. Moshovos,“Loom: Exploiting Weight and Activation Precisions to AccelerateConvolutional Neural Networks,” 6 2017. [Online]. Available:http://arxiv.org/abs/1706.07853[10] M. Courbariaux and I. Hubara, “Binarized Neural Networks: TrainingNeural Networks with Weights and Activations Constrained to +1 or1,” 2014.[11] B. Moons, D. Bankman, L. Yang, B. Murmann, and M. Verhelst,“BinarEye : An Always-On Energy-Accuracy-Scalable Binary CNNProcessor With All Memory On Chip In 28nm CMOS,” no. Ld, pp.2–5.[12] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, andO. Temam, “DianNao,” in Proceedings of the 19th internationalconference on Architectural support for programming languagesand operating systems - ASPLOS ’14 . New York, New York,USA: ACM Press, 2014, pp. 269–284. [Online]. Available:http://dl.acm.org/citation.cfm?doid=2541940.2541967[13] M. Kang, S. K. Gonugondla, A. Patil, and N. R. Shanbhag, “A Multi-Functional In-Memory Inference Processor Using a Standard 6T SRAMArray,”

IEEE Journal of Solid-State Circuits , vol. 53, no. 2, pp. 642–655,2018.[14] E. H. Lee and S. S. Wong, “Analysis and Design of a PassiveSwitched-Capacitor Matrix Multiplier for Approximate Computing,”

IEEE Journal of Solid-State Circuits , vol. 52, no. 1, pp. 261–271, 12017. [Online]. Available: http://ieeexplore.ieee.org/document/7579580/[15] H. Li, T. F. Wu, S. Mitra, and H. S. Wong, “Resistive RAM-CentricComputing: Design and Modeling Methodology,”

IEEE Transactions onCircuits and Systems I: Regular Papers , vol. 64, no. 9, pp. 2263–2273,2017.[16] S. Skrzyniarz, L. Fick, J. Shah, Y. Kim, D. Sylvester, D. Blaauw,D. Fick, and M. B. Henry, “24.3 A 36.8 2b-TOPS/W self-calibrating GPS accelerator implemented using analog calculation in65nm LP CMOS,” in . IEEE, 1 2016, pp. 420–422. [Online]. Available:http://ieeexplore.ieee.org/document/7418086/ [17] Z. Wang and N. Verma, “A Low-Energy Machine-Learning ClassiﬁerBased on Clocked Comparators for Direct Inference on Analog Sensors,”

IEEE Transactions on Circuits and Systems I: Regular Papers , vol. 64,no. 11, pp. 2954–2965, 2017.[18] D. Miyashita, R. Yamaki, K. Hashiyoshi, H. Kobayashi, S. Kousai,Y. Oowaki, and Y. Unekawa, “An LDPC Decoder With Time-DomainAnalog and Digital Mixed-Signal Processing,”

IEEE Journal of Solid-State Circuits , vol. 49, no. 1, pp. 73–83, 1 2014. [Online]. Available:http://ieeexplore.ieee.org/document/6630119/[19] G. Li, Y. M. Tousi, A. Hassibi, and E. Afshari, “Delay-Line-BasedAnalog-to-Digital Converters,”

IEEE Transactions on Circuits andSystems II: Express Briefs , vol. 56, no. 6, pp. 464–468, 6 2009.[Online]. Available: http://ieeexplore.ieee.org/document/5075832/[20] A. Sayal, S. Fathima, S. S. Nibhanupudi, and J. P. Kulkarni, “14.4 All-Digital Time-Domain CNN Engine Using Bidirectional Memory DelayLines for Energy-Efﬁcient Edge Computing,”

Digest of Technical Papers- IEEE International Solid-State Circuits Conference , vol. 2019-Febru,no. 4, pp. 228–230, 2019.[21] A. Sayal, S. S. Nibhanupudi, S. Fathima, and J. P. Kulkarni, “A 12.08-TOPS/W All-Digital Time-Domain CNN Engine Using Bi-DirectionalMemory Delay Lines for Energy Efﬁcient Edge Computing,”

IEEEJournal of Solid-State Circuits , vol. 55, no. 1, pp. 60–75, 2020.[22] D. Miyashita, S. Kousai, T. Suzuki, and J. Deguchi, “A NeuromorphicChip Optimized for Deep Learning and CMOS Technology With Time-Domain Analog and Digital Mixed-Signal Processing,”

IEEE Journal ofSolid-State Circuits , vol. 52, no. 10, pp. 2679–2689, 2017.[23] S. Gopal, P. Agarwal, J. Baylon, L. Renaud, S. N. Ali, P. P. Pande, andD. Heo, “A Spatial Multi-Bit Sub-1-V Time-Domain Matrix MultiplierInterface for Approximate Computing in 65-nm CMOS,”

IEEE Journalon Emerging and Selected Topics in Circuits and Systems , vol. 8, no. 3,pp. 506–518, 2018.[24] A. Abidi, “Phase Noise and Jitter in CMOS Ring Oscillators,”