A Low-Power Domino Logic Architecture for Memristor-Based Neuromorphic Computing
AA Low-Power Domino Logic Architecture for Memristor-BasedNeuromorphic Computing
Cory MerkelAnimesh Nikam [email protected]@rit.eduBrain LabRochester Institute of TechnologyRochester, New York
ABSTRACT
We propose a domino logic architecture for memristor-based neu-romorphic computing. The design uses the delay of memristor RCcircuits to represent synaptic computations and a simple binary neu-ron activation function. Synchronization schemes are proposed forcommunicating information between neural network layers, anda simple linear power model is developed to estimate the design’senergy efficiency for a particular network size. Results indicate thatthe proposed architecture can achieve 0.61 fJ per classification percomponent (neurons and synapses) and outperforms other designsin terms of energy per % accuracy.
KEYWORDS
Memristor, neuromorphic, low-power
ACM Reference Format:
Cory Merkel and Animesh Nikam. 2019. A Low-Power Domino Logic Archi-tecture for Memristor-Based Neuromorphic Computing. In
ICONS ’19: ACMInternational Conference on Neuromorphic Systems, Knoxville, Tn.
ACM, NewYork, NY, USA, 4 pages. https://doi.org/10.1145/1122445.1122456
Custom neuromorphic hardware platforms are gaining popularityfor the acceleration of neural network algorithms, owing to theirability to perform complex tasks that are analogous of the physicalprocesses underlying biological nervous systems [4]. A key fea-ture of these systems is that they overcome the limitations causedby the von Neumann bottleneck by collocating computation andmemory [6]. While modern digital complementary-metal-oxide-semiconductor (CMOS) technology is used to replicate the behaviorof the neurons, the absence of a device that can efficiently performsynaptic operations stunted progress for several years. However,recent advancements in nanoscale materials and realization of de-vices such as memristors have opened possibilities for developingcompact memory device arrays that are potentially transformativefor the design of ultra energy-efficient neuromorphic systems.
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
ICONS ’19, July 23–25, 2019, Knoxville, Tn © 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-9999-9/18/06...$15.00https://doi.org/10.1145/1122445.1122456
Previous work has studied several aspects of memristor-basedneuromorphic systems, including device properties, reliability, cross-bar implementation, on-chip training, quantization, and much more[7, 8]. One of the most power-efficient design approaches is com-bining memristor synapses with an integrate-and-fire (IF) neurondesign. The energy efficiency of the IF neuron comes from i.) all-or-nothing representation of information and ii.) little-to-no short-circuit current between the neuron’s input and the synapses drivingit (since they are just driving the membrane capacitor). In this work,we explore a similar idea applied to networks of binary neuronsinspired by domino logic. Domino logic, a type of dynamic logic,separates a circuit into pre-charge and evaluation phases to avoidshort circuit current and reduce power consumption. Here, we pro-pose a domino logic style neuron that uses memristor-based RCdelays for evaluation and offers good power efficiency. The buildingblocks of the proposed design are outline in Section 2. Then, Section3 discusses scaling the design up to a multi-layer neural network.In Sections 4 and 5, we detail the power consumption model andquantization approach. Section 6 discusses results on the MNISTdataset and concludes this work.
The core building block of our design is shown in Figure 1(a). Whenthe clock signal ϕ is low, the dynamic node (input to the inverter)is precharged to V dd . Then, during the evaluation phase, ϕ is high,and the dynamic node discharges at a rate dependent on the pull-down network’s RC time constant. Once the dynamic node fallsbelow the inverter threshold, the output will go high.During the evaluation phase, the voltage on the dynamic nodeevolves as v ldi ( t ) = V dd exp (cid:169)(cid:173)(cid:171) − t ∫ G li ( ξ ) C d ξ (cid:170)(cid:174)(cid:172) (1)where G li is the equivalent pulldown conductance. A memristoronly contributes to the pull-down conductance when its selecttransistor is on. Assuming that memristor conductance values areconstant during the evaluation phase and input voltages are digital,i.e. v l − x j ∈ { , V dd } , then G li is a piecewise constant function writtenas: G li ( t ) = V dd N l − (cid:213) j = v l − x j ( t ) G lij (2) a r X i v : . [ c s . ET ] J un CONS ’19, July 23–25, 2019, Knoxville, Tn Merkel and Nikam 𝑅 𝑁 𝑙−1 𝑙 … 𝜙𝑣 𝑥1𝑙−1 𝑣 𝑥2𝑙−1 𝑣 𝑥𝑁 𝑙−1 𝑙−1 𝑣 𝑥 𝑙 𝑉 𝑑𝑑 Memristors 𝑅 𝑅 𝑡𝑣𝜙𝑣 𝑑𝑖𝑙 𝑣 𝑥𝑖𝑙 𝜃 𝜙𝐯 𝐱𝑙−1 exin 𝑣 𝑥𝑙 𝑣 𝑥𝑖𝑛𝑙 𝑣 𝑥𝑒𝑥𝑙 𝑡𝑣𝜙𝑣 𝑥𝑒𝑥𝑙 𝑣 𝑥𝑖𝑛𝑙 𝑣 𝑥𝑙 𝑡𝑣𝜙 𝑣 𝑑𝑖 𝑙 𝑣 𝑥𝑖𝑙 𝑡 𝜃 𝜃 𝑣 𝑑𝑙 𝜙 Pre-synaptic InputsPre-synaptic Inputs EX 𝜙 IN 𝑣 𝑥𝑙 clk 𝜙 (a) 𝑅 𝑁 𝑙−1 𝑙 … 𝜙𝑣 𝑥1𝑙−1 𝑣 𝑥2𝑙−1 𝑣 𝑥𝑁 𝑙−1 𝑙−1 𝑣 𝑥 𝑙 𝑉 𝑑𝑑 Memristors 𝑅 𝑅 𝑡𝑣𝜙𝑣 𝑑𝑙 𝑣 𝑥𝑙 𝜃 𝜙𝐯 𝐱𝑙−1 exin 𝑣 𝑥𝑙 𝑣 𝑥𝑖𝑛𝑙 𝑣 𝑥𝑒𝑥𝑙 𝑡𝑣𝜙𝑣 𝑥𝑒𝑥𝑙 𝑣 𝑥𝑖𝑛𝑙 𝑣 𝑥𝑙 𝑡𝑣𝜙 𝑣 𝑑 𝑙 𝑣 𝑥𝑙 𝜃 𝑣 𝑑𝑙 𝜙 𝑒𝑥 Pre-synaptic InputsPre-synaptic Inputs EX 𝜙 𝑖𝑛 IN 𝑣 𝑥𝑒𝑥𝑙 clk 𝜙 𝑣 𝑥𝑖𝑛𝑙 (b) 𝑅 𝑁 𝑙−1 𝑙 … 𝜙𝑣 𝑥1𝑙−1 𝑣 𝑥2𝑙−1 𝑣 𝑥𝑁 𝑙−1 𝑙−1 𝑣 𝑥 𝑙 𝑉 𝑑𝑑 Memristors 𝑅 𝑅 𝑡𝑣𝜙𝑣 𝑑𝑙 𝑣 𝑥𝑙 𝜃 𝜙𝐯 𝐱𝑙−1 exin 𝑣 𝑥𝑙 𝑣 𝑥𝑖𝑛𝑙 𝑣 𝑥𝑒𝑥𝑙 𝑡𝑣𝜙𝑣 𝑥𝑒𝑥𝑙 𝑣 𝑥𝑖𝑛𝑙 𝑣 𝑥𝑙 𝑡𝑣𝜙 𝑣 𝑑 𝑙 𝑣 𝑥𝑙 𝜃 𝑣 𝑑𝑙 𝜙 𝑒𝑥 Pre-synaptic InputsPre-synaptic Inputs EX 𝜙 𝑖𝑛 IN 𝑣 𝑥𝑒𝑥𝑙 clk 𝜙 𝑣 𝑥𝑖𝑛𝑙 (c) 𝑅 𝑁 𝑙−1 𝑙 … 𝜙𝑣 𝑥1𝑙−1 𝑣 𝑥2𝑙−1 𝑣 𝑥𝑁 𝑙−1 𝑙−1 𝑣 𝑥 𝑙 𝑉 𝑑𝑑 Memristors 𝑅 𝑅 𝑡𝑣𝜙𝑣 𝑑𝑙 𝑣 𝑥𝑙 𝜃 𝜙𝐯 𝐱𝑙−1 exin 𝑣 𝑥𝑙 𝑣 𝑥𝑖𝑛𝑙 𝑣 𝑥𝑒𝑥𝑙 𝑡𝑣𝜙𝑣 𝑥𝑒𝑥𝑙 𝑣 𝑥𝑖𝑛𝑙 𝑣 𝑥𝑙 𝑡𝑣𝜙 𝑣 𝑑 𝑙 𝑣 𝑥𝑙 𝜃 𝑣 𝑑𝑙 𝜙 𝑒𝑥 Pre-synaptic InputsPre-synaptic Inputs EX 𝜙 𝑖𝑛 IN 𝑣 𝑥𝑒𝑥𝑙 clk 𝜙 𝑣 𝑥𝑖𝑛𝑙 (d) Figure 1: (a) Domino logic-style neuron circuit schematic. (b) The neuron’s dynamic node does not cross the inverter thresholdbefore the end of the evaluation period, indicating a ’0’. (c) The neuron’s dynamic node reaches the inverter threshold withinthe evaluation period, indicating a ’1’. (d) Combining two domino circuits to create synapses that can be positive or negative.
In this work, the inputs to a neuron are constant during eachevaluation period. In addition, each neuron’s output will be con-sidered as ’1’ if it switches from ’0’ to ’1’ at any point during theevaluation period. Otherwise, it is ’0’. This is illustrated in Figures1(b) and 1(c). In Figure 1(b), the neuron’s dynamic node does notdischarge to the inverter threshold before the end of the evaluationperiod, so the output is ’0’. In contrast, the dynamic node in Figure1(c) discharges quickly, well before the end of the evaluation period,so its output is ’1’.In order to get an inhibitory effect on the post-synaptic neuron,we introduce a second domino circuit, as shown in Figure 1(d),where the two boxes represent the circuit in Figure 1(a). The topdomino circuit is an excitatory neuron, where large memristorconductances will tend to cause the excitatory output to be ’1’ andthe inhibitory output to be ’0’. The bottom domino circuit is aninhibitory neuron, where large memristor conductances will tendto cause the inhibitory output to be ’1’ and the excitatory outputto be ’0’. The circuit uses a built-in arbiter (cross-coupled NORgates) to decide which neuron reached its inverter threshold firstand then re-charge the dynamic node of the other neuron so itsoutput will be ’0’. One design issue that should be considered infuture work is the detection and cancellation of metastability inthe feedback loop, as it could cause unstable behavior and largerpower consumption. On the other hand, it may be a useful tool forimplementing stochastic neuorn behavior. However, this is outsidethe scope of the present work.Now, we can define a linear mapping between a weight between-1 and 1 and the two conductance values associated with it: G li , j ex = max (cid:16) G min , w li , j G max + (cid:16) − w li , j (cid:17) G min (cid:17) (3) G li , j in = max (cid:16) G min , − w li , j G max + (cid:16) + w li , j (cid:17) G min (cid:17) (4)The above two equations set the inhibitory conductance to G min when the weight is positive and the excitatory conductance to G min when the weight is negative. Then, the excitatory conductance willrange from G min for a weight of 0 to G max for a weight of +1.The inhibitory conductance will range from G min for a weight of0 to G max for a weight of -1. Note that G min and G max are theminimum and maximum conductance values of the memristorsand vary considerably based on the type of device (i.e. materialproperties, fabrication process, etc.) [3]. In this work, we have clk clk clk 𝑣 𝑥𝑒𝑥1𝑙 𝑣 𝑥𝑒𝑥2𝑙 𝑣 𝑥𝑒𝑥𝑁 𝑙 𝑙 𝑣 𝑥1𝑙−1 𝑣 𝑥2𝑙−1 𝑣 𝑥3𝑙−1 𝑣 𝑥𝑁 𝑙−1 𝑙−1 … … … … … … … …… 𝜙 𝜙 𝜙 𝜙 𝜙 𝑁 𝑙 𝑒𝑥 𝜙 𝑁 𝑙 𝑖𝑛 𝜙 𝑒𝑥 𝜙 𝑖𝑛 𝜙 𝑒𝑥 𝜙 𝑖𝑛 𝜙 𝑁 𝑙 𝑒𝑥 𝜙 𝑁 𝑙 𝑖𝑛 𝑣 𝑥𝑖𝑛1𝑙 𝑣 𝑥𝑖𝑛2𝑙 𝑣 𝑥𝑖𝑛𝑁 𝑙 𝑙 Figure 2: Implementation of the synaptic weight matrix be-tween two neural network layers using the proposed neurondesign and a 1T1R memristor crossbar. chosen values of G min = / Ω and G max = / Ω , howeverour design will work for other conductance ranges. For multilayer neural networks, we propose a 1T1R memristorcrossbar, as shown in Figure 2. Here, the word lines are connectedto the pre-synaptic neuron outputs from the previous layer. Thetwo terminals of each 1T1R synapse are connected to the crossbarcolumns. Each neuron uses two crossbar columns to implementexcitatory and inhibitory synapses. Footer transistors are usedto eliminate short circuit power consumption during pre-charge.Note that secondary pre-charge transistors may be needed to avoidcharge sharing between each domino circuit’s dynamic node andthe drain of the footer transistors. For fully-connected neural net-works, the simplest design would employ one N l − × N l crossbarfor each layer l . However, more advanced methods will likely beneeded for sharing crossbars across layers and efficiently mappingsparse connectivity networks to dense crossbar structures. Low-Power Domino Logic Architecture for Memristor-Based Neuromorphic Computing ICONS ’19, July 23–25, 2019, Knoxville, Tn
The proposed design is based on the timing of RC delays in eachdomino circuit. Since each neuron’s output is binary, it is importantthat the domino circuits do not perform an evaluation until allof their inputs are ready (i.e. the evaluation period of the inputshas completed). For this reason, it is critical to have some form ofsynchronization across layers. We propose three different methods.The first method uses non-overlapping clocks with varying dutycycles for each network layer in the following manner: First, all ofthe clocks are ’1’ to pre-charge all of the domino circuits. Next, theclock for the first layer becomes ’0’ for evaluation of the networkinputs. After enough time has passed for evaluation of the first layer(this will depend on the size of the network, weight values, etc.),the clock for the second layer will become ’0’, and so on until theclock for the final layer becomes ’0’. Then, the process starts over.The advantage of this approach is that no circuitry has to be addedto the neuron circuits. This disadvantage is that each layer has towait for all of the previous layers to finish before it can perform anycomputation. The second synchronization method is to add flipflopsto the output of each neuron. This way the entire network can bepipelined across layers and each neuron can perform computationson every clock cycle. Of course, the disadvantage of this approachis that it adds overhead to the neuron design. A final method is touse asynchronous handshaking across layers. In this case, a globalreset signal would be asserted every time a new input arrives tothe network, causing all domino circuits to be pre-charged. Then,an OR gate would be connected to each neuron’s excitatory andinhibitory outputs. Once the OR gate’s output becomes ’1’, we knowthat the neuron has finished evaluation. When all such signals fora whole layer become ’1’ (which could be detected with an ANDtree), that layer has finished evaluating, and the next layer cancontinue evaluation. The main advantage of this approach is that aglobal clock is not needed, which may significantly reduce powerconsumption. In this work, we performed simulations using thefirst method. Figure 3 shows the simulation results for a 2-inputnetwork with 2 hidden layer neurons and 1 output. The networkwas trained to perform the XOR function of its inputs. The topsubplot shows the clock signal, while the second two subplots showthe clocks distributed to layers 2 and 3, repsectively. During the firstclock cycle, all of the neurons in the network pre-charge. Duringthe second clock cycle, the second layer clock goes low, and thenthe third layer clock goes low during the third clock cycle. Therfore,the output of the network has a valid result after three cycles fromthe time that the input changes.
The power consumption of the proposed design was modeled byassuming that most of the power is consumed when a neuron pre-charges. The justification for this is that, especially for neuronswith high fan-in the switching capacitance of the neuron’s dynamicnode will be much larger than the capacitance at other nodes inthe circuit. Therefore, the power can be formulated as P ≈ × ( + β ) N L (cid:213) L = αC L V dd f (5) Figure 3: MLP simulation of XOR with sequential evaluationacross layers. P o w e r ( u W ) Figure 4: Power consumption vs. network size (synapsesplus neurons) for the proposed design. where β is a fitting parameter that comes from the extra powerassociated with the inverter, arbiter, etc., α is the switching activ-ity factor, and C L is the total switching capacitance of the layer.The factor of 3 comes from the fact that each synapse will haveapproximately 3 units of capacitance associated with it from theaccess transistor’s source, drain, and the memristor itself. Note thata unit of capacitance is calculated as C unit = A f et × . × ϵ / t ox ,where A f et is the transistor channel area, ϵ is the permittivity offree space, and t ox is the transistor gate oxide thickness. For an L layer network with the chosen synchronization scheme, α = / L ,since the neuron circuit only pre-charge once every L clock cy-cles. In addition, the value of C L is C unit times the sum of thenumber of synapses and neurons of each layer (both excitatoryand inhibitory). We have empirically found β ≈ .
15. In Figure 4,we show the power consumption for 100 randomly-sized 3-layernetworks vs. the number of synapses and neurons in the network.For each network, both the inputs and weights were generatedrandomly. Furthermore, the network used a clock frequency of 10MHz. The results are based on a 130 nm bulk CMOS process [1], andall simulations were performed using Synpopsys HSPICE. From
CONS ’19, July 23–25, 2019, Knoxville, Tn Merkel and Nikam this data, we estimate the energy efficiency of our design to beapproximately 0.61 fJ per classification per component, where acomponent is either a neuron or a synapse.
Quantization methods for deep learning are becoming popular foraccelerating training, reducing model size, and mapping neuralnetworks to specialized hardware. The simplest quantization meth-ods use rounding to reduce activation and weight precision aftertraining. This usually results in large drops in accuracy betweenthe full-precision and quantized models. Other methods quantizeweights, activations, and sometimes gradients during training, re-sulting in better performance [10]. In this work, we only quantizeweights and activations. The core idea is to use quantized valuesduring forward propagation and full-precision gradient estimatesduring backward propagation. For activations, we use a simplethreshold model on the forward pass: x =
12 sign ( s ) +
12 (6)where sign (·) is 1 if the argument is non-negative and -1 otherwise.Since the sign has a gradient that is zero everywhere it will stallthe backpropagation algorithm and nothing will be learned. To fixthis, we approximate the gradient as ∂ J ∂ s ≈ + exp (− ks ) (cid:18) − + exp (− ks ) (cid:19) , (7)where k was empirically chosen as 2. In other words, on the back-ward pass, the gradient is calculated as if the activation had beena logistic sigmoid function. Of course, we note that the thresholdactivation function is indeed a logistic sigmoid with a k value of + ∞ .For weights, we use the following quantization technique: w q = × round (cid:16) ( Q − ) clip ( w , − , ) + (cid:17) Q − − Q is the desired number of quantization steps, round (·) rounds to the nearest integer and clip ( w , a , b ) = max ( a , min ( b , w )) ,where a ≤ b . For backpropagation, we estimate the gradient as ∂ J / ∂ w ≈ ∂ J / ∂ w q We tested our design using the MNIST dataset of handwritten digits[2], which contains 60,000 training samples and 10,000 test samplesof 28 ×
28 grayscale images. Our network is parameterized with 784,64, 64, and 10 neurons for the input, first hidden, second hidden,and output layers, respectively. We used Tensorflow with Kerasto perform all training and testing. We have not considered anyprocess variations in this work, so we assume that the results ofTensorflow simulations can be directly mapped to our circuit. Inthe future, we plan to explore techniques for mitigating the effectsof process variations using hardware-in-the-loop training. Figure 5shows the test accuracy vs. weight precision. We observe a largeincrease in accuracy from 1 to 2-bit precision, which then levelsoff. Note that we haven’t used any regularization (dropout, etc.) in except when the argument approaches zero from the left, where the gradient isundefined. Table 1: Comparison of memristor-based neuromorphic de-signs on MNIST classification.
Ref Tech. Node Accuracy Power Latency Energy/% Accuracy[5] 130 nm 86% 53 mW 80 ns 4.93 × − J/%[9] 45 nm 92% 1.79 mW 40 ns 7.78 × − J/%This work 130 nm 96% 0.168 mW 400 ns × − J/%
Weight Precision (bits) A cc u r ac y TrainingTesting
Figure 5: Accuracy of the proposed design on the MNISTdataset with different weight precision. this work. Table 1 compares this work to other memristor-basedneuromorphic systems that studied MNIST classification with low-bit weight precision. The proposed design outperforms [5] by 2orders of magnitude and is comparable to [9] in terms of energyper percent accuracy. Power results for our design are estimatedfrom the model presented in (5). Note that [9] was simulated ata 45 nm technology node, so the dynamic power would increaseat 130 nm. While these initial results are encouraging, a numberof avenues for future work should be pursued to better determinethe robustness of the proposed architecture, including studies ondevice variability and clock skew. Also of interest for future workis the exploration of pipelined and asynchronous handshaking forcoordination across layers.
REFERENCES [1] [n. d.]. Predictive Technology Models. http://ptm.asu.edu/.[2] [n. d.]. The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/index.html.[3] Geoffrey W Burr, Robert M Shelby, Abu Sebastian, Sangbum Kim, Seyoung Kim,Severin Sidler, Kumar Virwani, Masatoshi Ishii, Pritish Narayanan, AlessandroFumarola, et al. 2017. Neuromorphic computing using non-volatile memory.
Advances in Physics: X
2, 1 (2017), 89–124.[4] Rodney Douglas, Misha Mahowald, and Carver Mead. 1995. Neuromorphicanalogue VLSI.
Annual review of neuroscience
18, 1 (1995), 255–281.[5] Hao Jiang, Kevin Yamada, Zizhe Ren, Thomas Kwok, Fu Luo, Qing Yang, Xi-aorong Zhang, J Joshua Yang, Qiangfei Xia, Yiran Chen, et al. 2018. Pulse-widthmodulation based dot-product engine for neuromorphic computing system usingmemristor crossbar array. In . IEEE, 1–4.[6] SR Nandakumar, Shruti R Kulkarni, Anakha V Babu, and Bipin Rajendran. 2018.Building brain-inspired computing systems: Examining the role of nanoscaledevices.
IEEE Nanotechnology Magazine
12, 3 (2018), 19–35.[7] Catherine D Schuman, Thomas E Potok, Robert M Patton, J Douglas Birdwell,Mark E Dean, Garrett S Rose, and James S Plank. 2017. A survey of neuromorphiccomputing and neural networks in hardware. arXiv preprint arXiv:1705.06963 (2017).[8] Changhyuck Sung, Hyunsang Hwang, and In Kyeong Yoo. 2018. Perspective:A review on memristive hardware for neuromorphic computation.
Journal ofApplied Physics . IEEE, 1–7.[10] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou.2016. Dorefa-net: Training low bitwidth convolutional neural networks with lowbitwidth gradients. arXiv preprint arXiv:1606.06160arXiv preprint arXiv:1606.06160