An Energy-efficient Time-domain Analog VLSI Neural Network Processor Based on a Pulse-width Modulation Approach
Masatoshi Yamaguchi, Goki Iwamoto, Hakaru Tamukoh, Takashi Morie
aa r X i v : . [ c s . ET ] F e b An Energy-efficient Time-domain Analog VLSINeural Network ProcessorBased on a Pulse-width Modulation Approach
Masatoshi Yamaguchi, Goki Iwamoto, Hakaru Tamukoh, and Takashi Morie
Graduate School of Life Science and Systems Engineering,Kyushu Institute of Technology2-4, Hibikino, Wakamatsu-ku, Kitakyushu, 808-0196 Japan
Abstract.
A time-domain analog-weighted-sum calculation model basedon a pulse-width modulation (PWM) approach is proposed. The pro-posed calculation model can be applied to any types of network struc-ture including multi-layer feedforward networks. We also propose verylarge-scale integrated (VLSI) circuits to implement the proposed model.Unlike the conventional analog voltage or current mode circuits used incomputing-in-memory circuits, our time-domain analog circuits use tran-sient operation in charging/discharging processes to capacitors. Sincethe circuits can be designed without operational amplifiers, they can beoperated with extremely low power consumption. However, they haveto use very high-resistance devices, on the order of giga-ohms. We de-signed a CMOS VLSI chip to verify weighted-sum operation based onthe proposed model with binary weights, which realizes the BinaryCon-nect model. In the chip, memory cells of static-random-access mem-ory (SRAM) are used for synaptic connection weights. High-resistanceoperation was realized by using the subthreshold operation region ofMOS transistors unlike the ordinary computing-in-memory circuits. Thechip was designed and fabricated using a 250-nm fabrication technology.Measurement results showed that energy efficiency for the weighted-sum calculation was 300 TOPS/W (Tera-Operations Per Second perWatt), which is more than one order of magnitude higher than that instate-of-the-art digital AI processors, even though the minimum widthof interconnection used in this chip was several times larger than thatin such digital processors. If state-of-the-art VLSI technology is usedto implement the proposed model, an energy efficiency of more than1,000 TOPS/W will be possible. For practical applications, developmentof emerging analog memory devices such as ferroelectric-gate field effecttransistors (FeFETs) is necessary.
Keywords: time-domain analog computing, weighted sum, multiply-and-accumulate, pulse-width modulation, deep neural networks, multi-layer perceptron, artificial intelligence hardware, AI processor Masatoshi Yamaguchi, Goki Iwamoto, Hakaru Tamukoh, Takashi Morie
Artificial neural networks (ANNs), such as convolutional deep neural networks(CNNs) [12] and multi-layer perceptrons (MLPs) [3], have shown excellent per-formance on various tasks including image recognition [3,11,5,27,13]. However,computation in ANNs is very heavy, which leads to high power consumptionin current digital computers and even in highly parallel coprocessors such asgraphics processing units (GPUs). In order to implement ANNs at edge devicessuch as mobile phones and personal service robots, operation at very low powerconsumption is required.In ANN models, weighted summation, or multiply-and-accumulate (MAC)operation, is an essential and heavy calculation task, and dedicated comple-mentary metal-oxide-semiconductor (CMOS) very-large-scale integration (VLSI)processors have been developed to accomplish it [26,20,25,10,2]. As an imple-mentation approach other than digital processors, use of analog operation inCMOS VLSI circuits is a promising method for achieving extremely low-powerconsumption for such calculation tasks [6,14,19,17]. In particular, computing-in-memory approaches, which achieve weighted-sum calculation utilizing the cir-cuit of static-random-access memory (SRAM), have been popular since around2016 [18].Although the calculation precision is limited due to the non-idealities of ana-log operation such as noise and device mismatches, neural network models andcircuits can be designed to be robust to such non-idealities [21,9,7]. On theother hand, ANN models with binarized weights or even with binarized inputshave been proposed and their comparable performance has been demonstrated,mainly in applications of image recognition [4,8]. These models facilitate thedevelopment of energy-efficient hardware implementations [19].The time-domain analog weighted-sum calculation model was originally pro-posed based on mathematical spiking neuron models inspired by biological neu-ron behavior [15,16]. We have simplified this calculation model under the as-sumption of operation in analog circuits with transient states, and call its VLSIimplementation approach “Time-domain Analog Computing with Transient states(TACT).” In contrast to conventional weighted-sum operation in analog voltageor current modes, the TACT approach is suitable for operation with much lowerpower consumption in the CMOS VLSI implementation of ANNs.We have already proposed a device and circuit that performs time-domainweighted-sum calculation [23,28,22]. The proposed circuit consists of plural in-put resistive elements and a capacitor (RC circuit), which can achieve extremelylow-power operation. The energy consumption could be lowered to the order of1 fJ per operation, which is almost comparable to the calculation efficiency inthe brain, as long as weighted-sum operation is considered. We also proposeda circuit architecture to implement a weighted-sum calculation with different-signed weights with two sets of RC circuits, one of which calculates positivelyweighted sums while the other calculates negatively weighted sums [29,30]. Us-ing a similar time-domain approach, a vector-by-matrix multiplier using flashmemory technology was proposed [1]. n Energy-efficient Time-domain Analog VLSI Neural Processor 3 I i I QCS i I S S W i W W Fig. 1.
Weighted-sum calculation using current sources switched with PWM signals.
Weighted-sum calculation circuits using pulse-width modulation (PWM) sig-nals have previously been proposed [24]. In this paper, we reformulate theweighted-sum calculation model based on the time-domain analog computingapproach using PWM signals, called the TACT-PWM approach, and proposeits applications to ANNs such as MLPs and CNNs with extremely high comput-ing energy efficiency. We also show the design and measurement results of anANN VLSI chip fabricated using a 250-nm CMOS VLSI technology, in whichthe calculation results by the proposed model are compared with the ordinarynumerical calculation results and verify its very high computing efficiency.
The basic circuit configuration based on the TACT-PWM approach is shown inFig. 1. Corresponding to input signals S i ∈ { , } in the voltage domain, eachswitched-current source (SCS) outputs current I i when S i = 1. An SCS can bereplaced by a resistor and a diode if the nonlinearity in charging characteristicscan be ignored. The total charge amount Q stored at the node of capacitor C charged by N SCSs with inputs S i , each of which has pulse width of W i , isexpressed by Q = N X i =1 W i I i , (1)where Q can be considered as the weighted-sum calculation result with weight I i and input W i . The node voltage of C , V c , is given by V c = Q/C . If I i ≥ E of this charging and discharging process is given by E = CV c V dd ( V dd is a supply voltage of SCSs), where the energy for chargingthe input capacitance of SCSs is not included.The weighted-sum calculation circuit and a timing diagram of its operationare shown in Fig. 2. Here, we consider this operation as a weighted-sum calcu-lation with the same signed weighting. The circuit consists of a weighted-sumcalculation or MAC part and a voltage-pulse conversion (VPC) part. The MACpart consists of SCSs corresponding with inputs, which is accompanied by para-sitic wiring capacitance C d . The VPC part consists of an SCS, two switches, anda comparator with an input capacitance C n . Since the parasitic capacitances C d Masatoshi Yamaguchi, Goki Iwamoto, Hakaru Tamukoh, Takashi Morie and C n are inevitably included in the circuit, to minimize the energy consump-tion for the operation, the charged capacitance C , which is equal to C d + C n ,should be as small as possible.The PWM inputs are given in the input period T in ; ∀ i, W i ≤ T in , which isarbitrarily determined. If the node voltage V c at the timing of the end of thisinput period is denoted by V mac , V mac = QC d + C n = 1 C d + C n N X i =1 W i I i . (2)In the VPC part, the output PWM signal S out with pulse width W out isgenerated during the output period T out . In this operation, capacitance C ischarged up by the SCS with current I n . To minimize the energy consumption inthis operation, the VPC part can be separated from the MAC part by S n , andonly C n can be charged up to the threshold voltage V θ of the comparator. Inthis case, to meet the condition that 0 ≤ W out ≤ T out , the current I n is given by I n = C n V θ T out , (3)which means that the node voltage V n increases with the slope of V θ /T out . When V n > V θ , the comparator output S out = 1, and after the end of output period V n is reset by S rst at the resting state, which is usually zero. Thus, the pulsewidth of the output signal as a result of weighted-sum calculation is given by W out = V mac V θ T out (4)= T out ( C d + C n ) V θ N X i =1 W i I i , (5)where it is assumed that 0 ≤ Q ≤ ( C d + C n ) V θ .If the same input line structures are used regarding the positive and negativeweights, the denominator of Eq. (5) is common, Thus, positive and negativeweighted calculations are performed separately in the different lines, and bysubtracting W out for negative weighing from that for the positive one, the totalcalculation result is obtained as follows: W + out − W − out = T out ( C d + C n ) V θ " N + X i =1 W + i I + i − N − X i =1 W − i I − i , (6) N = N + + N − , (7)where W ± out are the pulse widths of output signals with positive and negativeweighting, respectively. Since the obtained result can be fed into the next circuitcorresponding to the next layer of the network via nonlinear transform operation,calculations for ANNs can be achieved. n Energy-efficient Time-domain Analog VLSI Neural Processor 5 S i I i S I S n I n S out V q S rst C d C n V n (a)(b) time S n V n T out T in S out V q q S W S W S W W out MAC VPC
Fig. 2.
Weighted-sum calculation circuit model with the same signed weighting: (a)circuit diagram and (b) timing diagram.
The total energy consumption for the MAC calculation is expressed as fol-lows: E cal = E mac + E vpc , (8) E mac = C d V mac V dd + N X i =1 E i , (9) E vpc = C n ( V mac + V θ ) V dd + E n + Z T in + T out P cmp ( t ) dt, (10)where E mac and E vpc are the energy consumptions of the MAC and VPC parts, E i and E n are those for the switching of the SCS at each MAC part i and forthe switching of the SCS at the VPC part, respectively, and P cmp ( t ) is the powerconsumption of the comparator. Masatoshi Yamaguchi, Goki Iwamoto, Hakaru Tamukoh, Takashi Morie
On the basis of our TACT-PWM circuit approach, a CMOS circuit using anSRAM cell array structure is shown in Fig. 3(a). This circuit implements aBinaryConnect neural network, which uses analog input values while weightsare binary [4].This circuit consists of a synapse part and a neuron part. The synapsepart consists of an SRAM cell array, and each synapse circuit operates as twoMAC circuits. Unlike the ordinary SRAM circuits proposed in the conceptof computing-in-memory, our SRAM cell circuit outputs very low current onthe order of nano-amperes to guarantee the time constant in the TACT ap-proach [29,30], and therefore the p-type MOS field effect transistors (pMOS-FETs) M ± supply subthreshold currents to dendrite lines D ± based on theinput from axon lines A i , where axon and dendrite are neuroscientific terms inthe biological neuron.In the neuron part, two VPC circuits perform positive and negative weightingcalculations, respectively, and the subtraction result is fed into a rectified-linear-unit (ReLU) function circuit. A detailed explanation follows. In the synapse part, each SRAM cell shown in Fig. 3(b), which is called here abinary synapse unit (BSU), performs binary weighting, when receiving an inputpulse S i as the gate voltage of the pMOSFET M ± to make it operate in thesubthreshold region. To perform this operation, it is necessary that the SRAMcell be set at a 0 or 1 state based on the training result in a BinaryConnectnetwork.The BSU has three functions: one-bit memory, a switched current source, anda selector. The one-bit memory function is achieved at the flip-flop, which storesthe binary weight w i ∈ { +1 , − } by setting voltages V + P and V − P , as follows: w i = { +1 if ( V + P , V − P ) = ( V dd , − V + P , V − P ) = (0 , V dd ) , (11)where V dd is the supply voltage. The switched current source with a selector isrealized by pMOSFETs M ± that are connected to dendrite lines D ± , respec-tively. Since pMOSFETs M ± operate in the subthreshold region, their draincurrents I ± i are expressed as follows: I ± i ≈ I exp( V ± P − V Ai ) (12) V Ai = (cid:26) V dd if S i = 0 V w if S i = 1 (cid:27) , (13)where I is a constant, V Ai is the voltage of axon line A i , and V w is the constantgate voltage for subthreshold operation. For example, if synapse i has positiveweight ( w i = 1) and S i = 1, then ( V + P , V − P ) = ( V dd , I + w ≈ I exp( V dd − V w ), and I − w ≈ n Energy-efficient Time-domain Analog VLSI Neural Processor 7 T out (b)(c)(d)(a) D - D + V dd V dd M + C di + C di - A i P - I i + I i - P + M - ReLU C n + C n - V q V n - V n + I n - I n + W out - W out + W out S n S rst D - D + Neuron partSynapse part
BSUBSU S S i W i V w V w V dd V dd A i A W W out - W out + W out S out - S out + S out S out - S out + S out S out - S out + S out time Fig. 3.
BinaryConnect neural network circuit based on TACT-PWM approach: (a)schematic diagram, (b) binary synapse unit (BSU) circuit, (c) ReLU function circuit,and (d) timing diagram of the ReLU function circuit.
In the neuron circuit, dendrite lines are initialized and reset at ground level by S rst before inputting signals S i to the synapse part. Next, input PWM signalsare given during input time period T in , and capacitance C di and C n are charged.Then, dendrite lines are separated by neuron parts with S n . At the same time, thecurrent source I n is connected to capacitance C n , and thus C n is charged. Whenthe node voltage of C n , V ± n , reaches the threshold voltage of the comparator,the output signal S ± out is generated. A set of output signals S ± out are fed intothe ReLU function circuit, which simply consists of logic circuits, as shown inFig. 3(c), and the output PWM signal is only generated when W + out > W − out , asshown in Fig. 3(d). Using TSMC 250 nm CMOS technology we designed and fabricated a CMOSVLSI chip of our neural network circuit with ten neurons each of which has 100synapses. The layout results and microphotographs are shown in Fig. 4.
Masatoshi Yamaguchi, Goki Iwamoto, Hakaru Tamukoh, Takashi Morie
Table 1.
Measurement conditions and results for power efficiency of the fabricatedVLSI chip Number of synapses 100 × V dd V θ Measurement results of the input-output relationship in weighted-sum cal-culations operations at one neuron with 100 synapses are shown in Fig. 5. Asshown in Fig. 5(a), weighted-sum operation was approximately achieved andsufficient linearity was obtained. From Fig. 5(b), the deviations in the time do-main are ±
20 ns, and this means that the precision of the calculation is about ± µ s. However, an offset andscattering of weighting are clearly observed in Fig. 5(a). These nonidealities aredue to variations in the threshold voltages of MOSFETs operating in the sub-threshold region in BSUs. Such variations can be compensated for by adjustingthe threshold voltages if analog memory devices such as ferroelectric-gate FETsare used in BSUs.Measurement results of the output pulse width as a function of weighted-sum calculation results followed by the ReLU function in one neuron with 100synapses are shown in Fig. 6. The average error was 1.5 %, and the maximumerror was about 8 %. This error can be decreased by adjusting the deviations ofthe threshold voltages of MOSFETs operating in the subthreshold region.The measurement conditions and results for the power efficiency of the fab-ricated VLSI chip are shown in Table 1. The power efficiency obtained from themeasurement was 300 TOPS/W (Tera-Operations Per Second per Watt), whichis about 30 times higher than that of state-of-the-art digital AI processors, whilethe minimum feature size of the VLSI fabrication technology used was around 10times larger than that in the digital AI processors. Therefore, if we used the sameVLSI fabrication technology as in the digital AI processors, we could obtain apower efficiency of more than 1,000 TOPS/W or 1 POPS/W (Peta-OPS/W). In this paper, we proposed a time-domain weighted-sum calculation model basedon the TACT-PWM approach with an activation function of ReLU. We also n Energy-efficient Time-domain Analog VLSI Neural Processor 9 (cid:1)(cid:2)(cid:0)(cid:4)(cid:3) (cid:5)(cid:6)(cid:7)(cid:8) (cid:9)(cid:10) (cid:11)(cid:12)(cid:13)(cid:14)(cid:15) (cid:16)(cid:17)(cid:18)(cid:19) (cid:20)(cid:21)(cid:22)(cid:23)(cid:24) (cid:25)(cid:26)(cid:27)(cid:28) (cid:29)(cid:30)(cid:31) !"
Fig. 4.
VLSI layout results of a 100 ×
10 BinaryConnect neural network: (a) layoutresult, (b) microphotograph of the circuit, and (c) chip microphotograph. A: switchand buffer array for axon lines, B: BSU array, C: neuron array, and D: buffer array fordendrite lines. proposed VLSI circuits based on the TACT approach to implement a calcula-tion model with extremely low energy consumption. A high energy efficiency of300 TOPS/W was achieved by the fabricated CMOS VLSI circuit with binaryweights using 250-nm CMOS VLSI technology. If we use a more advanced VLSIfabrication technology, which achieves lower parasitic capacitance, the energyefficiency will be further much improved to over 1,000 TOPS/W.However, the fabricated circuit had insufficient calculation precision, whichis mainly due to the characteristic variations of subthreshold operation in MOS-FETs. To improve the calculation precision and compensate for such variations,it is necessary to introduce analog memory devices.As for the neuron parts, the measurement results of the fabricated VLSI chipsuggest that the energy consumption of this part is comparable to that of thewhole synapse part with 100 inputs. Therefore, it is also necessary to redesign acomparator circuit with much lower power consumption to improve the energyefficiency of the whole calculation circuit.
Acknowledgments.
This work was supported by JSPS KAKENHI Grant Nos.22240022 and 15H01706. Part of the work was carried out under a project com-missioned by the New Energy and Industrial Technology Development Organi-zation (NEDO), and the Collaborative Research Project of the Institute of FluidScience, Tohoku University. The circuit design was supported by VLSI Designand Education Center (VDEC), the University of Tokyo in collaboration withCadence Design Systems, Inc., Mentor Graphics, Inc., and Synopsys, Inc. (a)(b)
Input pulse width W i ( m s) O u t pu t pu l s e w i d t h W ou t ( m s ) + - S u m o f a ll b i n a r y w e i gh t s D e v i a ti on ( n s ) Measurement index + + - -
400 1000 2000 3000 4000 5000 + - S u m o f a ll b i n a r y w e i gh t s Fig. 5.
Measurement results of input-output characteristics: (a) averaged output pulsewidth and (b) deviation.
References
1. Bavandpour, M., Mahmoodi, M.R., Strukov, D.B.: Energy-efficient time-domain vector-by-matrix multiplier for neurocomputing and beyond. CoRRabs/1711.10673 (2017), http://arxiv.org/abs/1711.10673
2. Biswas, A., Chandrakasan, A.P.: Conv-RAM: An energy-efficient SRAM with em-bedded convolution computation for low-power CNN-based machine learning ap-plications. In: IEEE Int. Solid-State Circuits Conf. (ISSCC). pp. 488–489 (2018)3. Cire¸san, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Deep, big, simpleneural nets for handwritten digit recognition. Neural Comp. 22(12), 3207–3220(2010)4. Courbariaux, M., Bengio, Y., David, J.P.: Binaryconnect: Training deep neuralnetworks with binary weights during propagations. In: Advances in Neural Infor-mation Processing Systems. pp. 3123–3131 (2015)n Energy-efficient Time-domain Analog VLSI Neural Processor 11
20 40 600 - - - O u t pu t pu l s e w i d t h W ou t ( m s ) Fig. 6.
Measurement results of output pulse widths for the combination of randomweights and inputs. Timing jitters were decreased by averaging output signals for50 measurement results. The horizontal axis shows numerical calculation values of P N =50 i =1 w i · W i /T i , where w i ∈ { +1 , − } and 0 ≤ W i /T in ≤1.5. Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical featuresfor scene labeling. IEEE Trans. Pattern Analysis and Machine Intelligence 35(8),1915–1929 (2013)6. Fick, L., Blaauw, D., Sylvester, D., Skrzyniarz, S., Parikh, M., Fick, D.: Analog in-memory subthreshold deep neural network accelerator. In: Proc. of IEEE CustomIntegrated Circuits Conf. (CICC). pp. 1–4 (2017)7. Guo, X., Bayat, F.M., Prezioso, M., Chen, Y., Nguyen, B., Do, N., Strukov, D.B.:Temperature-insensitive analog vector-by-matrix multiplier based on 55 nm NORflash memory cells. In: Proc. of IEEE Custom Integrated Circuits Conf. (CICC).pp. 1–4 (2017)8. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neu-ral networks: Training neural networks with low precision weights and activations.J. Mach. Learn. Res. 18(1), 6869–6898 (2017)9. Indiveri, G.: Computation in neuromorphic analog VLSI systems. In: Proc. of Ital-ian Workshop on Neural Nets (WIRN). pp. 3–19 (2001)10. Khwa, W.S., Chen, J.J., Li, J.F., Si, X., Yang, E.Y., Sun, X., Liu, R., Chen, P.Y.,Li, Q., Yu, S., Chang, M.F.: A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3 ns and 55.8 TOPS/W fully parallel product-sum operation for binary DNN edge processors. In: IEEE Int. Solid-State CircuitsConf. (ISSCC). pp. 496–498 (2018)11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-volutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger,2 Masatoshi Yamaguchi, Goki Iwamoto, Hakaru Tamukoh, Takashi MorieK.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105.Curran Associates, Inc. (2012)12. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied todocument recognition. Proc. IEEE 86(11), 2278–2324 (1998)13. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444(2015)14. Lee, E.H., Wong, S.S.: A 2.5 GHz 7.7 TOPS/W switched-capacitor matrix multi-plier with co-designed local memory in 40nm. In: IEEE Int. Solid-State CircuitsConf. (ISSCC). pp. 418–419 (2016)15. Maass, W.: Fast sigmoidal networks via spiking neurons. Neural Comp. 9, 279–304(1997)16. Maass, W.: Computing with spiking neurons. In: Maass, W., Bishop, C.M. (eds.)Pulsed Neural Networks. pp. 55–85. MIT Press (1999)17. Mahmoodi, M.R., Strukov, D.: An ultra-low energy internally analog, externallydigital vector-matrix multiplier based on NOR flash memory technology. In: Proc.of Design Automation Conf. (DAC). p. 22 (2018)18. Milojicic, D., Bresniker, K., Campbell, G., Faraboschi, P., Strachan, J.P., Williams,S.: Computing in-memory, Revisited. In: IEEE 38th International Conference onDistributed Computing Systems (ICDCS). pp. 1300–1309 (2018)19. Miyashita, D., Kousai, S., Suzuki, T., Deguchi, J.: A neuromorphic chip opti-mized for deep learning and CMOS technology with time-domain analog and digitalmixed-signal processing. IEEE J. Solid-State Circuits 52(10), 2679–2689 (2017)20. Moons, B., Uytterhoeven, R., Dehaene, W., Verhelst, M.: ENVISION: a 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convo-lutional neural network processor in 28nm FDSOI. In: IEEE Int. Solid-State Cir-cuits Conf. (ISSCC). pp. 246–247 (2017)21. Morie, T., Amemiya, Y.: An all-analog expandable neural network LSI with on-chipbackpropagation learning. IEEE J. Solid-State Circuits 29(9), 1086–1093 (1994)22. Morie, T., Liang, H., Tohara, T., Tanaka, H., Igarashi, M., Samukawa, S., Endo,K., Takahashi, Y.: Spike-based time-domain weighted-sum calculation using nan-odevices for low power operation. In: 16th Int. Conf. on Nanotechnology (IEEENANO). pp. 390–392 (2016)23. Morie, T., Sun, Y., Liang, H., Igarashi, M., Huang, C., Samukawa, S.: A 2-dimensional Si nanodisk array structure for spiking neuron models. In: IEEE Proc.of Int. Symp. Circuits and Systems (ISCAS). pp. 781–784 (2010)24. Nagata, M., Funakoshi, J., Iwata, A.: A PWM signal processing core circuit basedon a switched current integration technique. IEEE J. Solid-State Circuits 33(1),53–60 (1998)25. Shin, D., Lee, J., Lee, J., Yoo, H.: DNPU: An 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks. In: IEEE Int. Solid-StateCircuits Conf. (ISSCC). pp. 240–241 (2017)26. Sim, J., Park, J.S., Kim, M., Bae, D., Choi, Y., Kim, L.S.: A 1.42TOPS/W deepconvolutional neural network recognition processor for intelligent IoE systems. In:IEEE Int. Solid-State Circuits Conf. (ISSCC). pp. 264–265 (2016)27. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proc. of IEEEConf. on Computer Vision and Pattern Recognition (CVPR). pp. 1–9 (2015)28. Tohara, T., Liang, H., Tanaka, H., Igarashi, M., Samukawa, S., Endo, K., Taka-hashi, Y., Morie, T.: Silicon nanodisk array with a fin field-effect transistor fortime-domain weighted sum calculation toward massively parallel spiking neuralnetworks. Appl. Phys. Express 9, 034201–1–4 (2016)n Energy-efficient Time-domain Analog VLSI Neural Processor 1329. Wang, Q., Tamukoh, H., Morie, T.: Time-domain weighted-sum calculation forultimately low power VLSI neural networks. In: Proc. Int. Conf. on Neural Infor-mation Processing (ICONIP). pp. 240–247 (2016)30. Wang, Q., Tamukoh, H., Morie, T.: A time-domain analog weighted-sum calcu-lation model for extremely low power VLSI implementation of multi-layer neuralnetworks. CoRR abs/1810.06819 (2018),