[PDF] An Ultra Fast Low Power Convolutional Neural Network Image Sensor with Pixel-level Computing

Abstract

The separation of the data capture and analysis in modern vision systems has led to a massive amount of data transfer between the end devices and cloud computers, resulting in long latency, slow response, and high power consumption. Efficient hardware architectures are under focused development to enable Artificial Intelligence (AI) at the resource-limited end sensing devices. This paper proposes a Processing-In-Pixel (PIP) CMOS sensor architecture, which allows convolution operation before the column readout circuit to significantly improve the image reading speed with much lower power consumption. The simulation results show that the proposed architecture enables convolution operation (kernel size=3*3, stride=2, input channel=3, output channel=64) in a 1080P image sensor array with only 22.62 mW power consumption. In other words, the computational efficiency is 4.75 TOPS/w, which is about 3.6 times as higher as the state-of-the-art.

Full PDF

IIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 1

An Ultra Fast Low Power Convolutional NeuralNetwork Image Sensor with Pixel-level Computing

Ruibing Song,

Student Member, IEEE,

Kejie Huang,

Senior Member, IEEE,

Zongsheng Wang,

StudentMember, IEEE, and Haibin Shen

Abstract —The separation of the data capture and analysis inmodern vision systems has led to a massive amount of datatransfer between the end devices and cloud computers, resultingin long latency, slow response, and high power consumption.Efﬁcient hardware architectures are under focused developmentto enable Artiﬁcial Intelligence (AI) at the resource-limited endsensing devices. This paper proposes a Processing-In-Pixel (PIP)CMOS sensor architecture, which allows convolution operationbefore the column readout circuit to signiﬁcantly improve theimage reading speed with much lower power consumption.The simulation results show that the proposed architectureenables convolution operation (kernel size=3 ×

3, stride=2, inputchannel=3, output channel=64) in a 1080P image sensor arraywith only 22.62 mW power consumption. In other words, thecomputational efﬁciency is 4.75 TOPS/w, which is about 3.6 timesas higher as the state-of-the-art.

Index Terms —processing-in-pixel, visual perception, convolu-tional neural network, CMOS image sensor.

I. I

NTRODUCTION C OMPUTER vision, which trains computers to interpretand understand the visual world, is one of the researchhotspots in computer science and Artiﬁcial Intelligence (AI).With the rapid development of machine learning technologies,Convolutional Neural Networks (CNNs) have outperformedprevious state-of-the-art techniques in computer visions suchas object detection [1], face recognition [2], video compression[3], motion transfer [4], etc.Although CNN has signiﬁcantly improved visual systems’performance, they consume many operations and storage,making it difﬁcult for end devices to independently completethe computation. Therefore, in modern visual systems, datacapture and analysis are separately carried out by sensing de-vices and cloud computers. The separation of the data captureand analysis has led to a tremendous amount of data transferbetween the end devices and the cloud computers, resulting inlong delay, slow response, and high power consumption [5].What’s more, in many vision applications, the systems haveto work continuously for monitoring or anomaly detection,i.e., surveillance cameras. The low information density hasseriously wasted communication bandwidth, data storage, andcomputing resource in such applications.

K.Huang and H.Shen are with the College of Information Science & Elec-tronic Engineering, Zhejiang University, 38 Zheda Road, Hangzhou, China,310027, and also with Zhejiang Lab, Building 10, China Artiﬁcial IntelligenceTown, 1818 Wenyi West Road, Hangzhou City, Zhejiang Province, China,email: [email protected]; shen [email protected] and Z.Wang are with the College of Information Science &Electronic Engineering, Zhejiang University, 38 Zheda Road, Hangzhou,China, 310027, email: [email protected]; [email protected]

To improve the efﬁciency of modern vision systems, re-searchers are focusing on reducing the readout power con-sumption or data density of sensors [6]–[11]. One of themost promising methods is to move the processing unitsmuch closer to the sensing units. Equipping CMOS ImageSensor (CIS) with a neural network processor can be dividedinto three categories: (1)Processing-Near-Sensor (PNS) withDeep Learning Accelerators (DLA); (2) Processing-In-Sensor(PIS); and (3) Processing-in-Pixel (PIP). The PNS architectureutilizes on-chip DLA to shorten the physical distance betweenthe processor and the image sensor [12]–[14]. The PIS archi-tecture is proposed to reduce the data transfer distance, readoperations, and analog-to-digital conversions. For example,Redeye performs several layers of CNN calculation in CISby additional analog arithmetic circuits before readout, saving85% energy due to the reduced read operations [15]. However,it needs lots of analog capacitors for data storage, leadingto a large area overhead and low computational efﬁciency.PIP is a fully integrated architecture to enable sensing andcomputing simultaneously. However, they may only supportlow-level processing [16] or need complicated pixel circuits,which lead to the excessive area and power consumption [17],[18].We propose a novel PIP architecture to enable high pre-cision convolutional neural network computation in pixels toaddress the limitations mentioned above. The multiplicationis achieved by pulse modulation during the exposure period.The charge redistribution does the accumulation at the pixellevel. The whole pixel array is organized with 3 × × ×

3. Our proposed architecturecould also support 60 frames and 1080P computation speedwhen the output channel size is 64. It only consumes 22.62mW power and has a computational efﬁciency up to 4.75TOPS/w, which is about 2.6 times higher than state-of-the-art. Our proposed splitting technique achieves the realizationof other kernel sizes.This paper is organized as follows: Section II presents therelated works. Section III introduces the detailed design ofour proposed scheme, including the overview architecture, thepixel circuit, the MAC operation, array convolution, and theimplementation of other convolution kernel sizes. Section IVanalyzes the simulation results and ﬁnally the conclusion isdrawn in Section V. a r X i v : . [ ee ss . I V ] J a n EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 2

RST SEL

PixelPixelPixel Col i (b) APS-3T N+ P N+ P PN Junction SEL (c) APS-4T P Pinned Diode

P+N

TXRSTRS

PixelPixelPixel Col i (a) PPS N+ P N+ P PN Junction

Reset -+ V ref V out V out SEL (d) 1.75T shared readout

RSTTX3

TX4

TX1 TX2

P+NPFD FDFD

Fig. 1. Four types of CIS pixel circuit: (a) PPS, (b) APS-3T, (c) APS-4T, and (d) APS-1.75T (shared readout circuit).

II. B

ACKGROUND AND R ELATED W ORK

A. CMOS Image Sensor

Pixel is the primary component in CIS to convert opticalsignals into electrical signals by photodiodes. Fig. 1 showsfour types of pixel circuits according to [19].As shown in Fig. 1(a), Passive Pixel Sensor (PPS) is theearly mainstream CIS technology, consisting of a photodiodeand a row-selection transistor. The output of PPS is a currentsignal, which is then converted to a voltage signal throughthe column charge-to-voltage ampliﬁer, and ﬁnally quantizedby Analog to Digital Converter (ADC). The main advantageof PPS is the small pixel area. However, it suffers from lowSignal-to-Noise Ratio (SNR) and low readout speed.In Active Pixel Sensors (APS), a reset transistor is used toperiodically reset the photodiode and a source-follower transis-tor is employed to buffer and separate the photodiode from thebit line to reduce noise. There are mainly three types of APS,including APS-3T, APS-4T, and APT-1.75T. APS-3T shownin Fig. 1(b) can’t solve the kTC noise caused by its reset.As shown in Fig. 1(c), APS-4T (Pinned Photodiode (PPD))includes a transfer transistor TX and a ﬂoating diffusion (FD)node to further reduce the noise by decoupling the reset andthe discharge of the photodiode. Besides, the dark current ofthe P+NP structure is also smaller than that of the PN junction.However, the PPD structure has four transistors, whichsigniﬁcantly reduces the Filling Factor (FF). As a result,the photoelectric conversion efﬁciency and SNR are reduced.APS-1.75T is then proposed to share the readout and resettransistors, as shown in Fig. 1(d). A total of 7 transistors areshared by four pixels, which highly reduces the area occupiedby the readout circuit in each pixel and thus dramaticallyimproves the ﬁlling factor.

B. PNS, PIS, and PIP Architectures

To reduce the distance between the data capture and anal-ysis, in sensor or near sensor computing has been widelyproposed. Fig. 2 shows the block diagram of different archi-tectures, including traditional architecture, PNS, PIS, and PIP.

Pixel

Readout &ADC

Cloud computersPixel Readout&ADC On-chip DLA

Pixel

Readout &ADC

Calculation circuits Readout &ADC

Calculation circuitsPhotodiode(a)(b) (c) (d)

Fig. 2. Different architectures of visual systems. (a)Traditional architecture.(b)PNS architecture. (c)PIS architecture. (d)PIP architecture. Blue boxesrepresent the pixel, grey boxes mean the sensors, and green boxes show wherethe calculation is conducted.

PNS architecture (Fig. 2(b)) . [12] utilized 3D-stackedcolumn-parallel ADCs and Processing Elements (PEs) to per-form spatio-temporal image processing. In [20], the signals arequantized by the ramp ADCs and then computed by the on-chip stochastic-binary convolutional neural network processor.Compared with the traditional architecture shown in Fig. 2(a),PNS architectures reduces the energy consumption of datamovement, but the energy consumed by the data readout andquantization is still not optimized.

PIS architecture (Fig. 2(c)) . In PIS architectures, thecomputing units are moved to the place before ADC to reducequantization frequency. Unlike PNS, the computing in PIS isusually done in the analog domain. In [21], the proposed CIScan realize a maximum 5 × EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 3 rdw1 rdw2rdw3 rdw4

RST x Convlink (a) Pixel

Pixel(a,b)Pixel(a,b) Pixel(a+1,b)Pixel(a+1,b) Pixel(a+2,b)Pixel(a+2,b) Pixel(a+3,b)Pixel(a+3,b)Pixel(a,b+1)Pixel(a,b+1) Pixel(a+1,b+1)Pixel(a+1,b+1) Pixel(a+2,b+1)Pixel(a+2,b+1) Pixel(a+3,b+1)Pixel(a+3,b+1)Pixel(a,b+2)Pixel(a,b+2) Pixel(a+1,b+2)Pixel(a+1,b+2) Pixel(a+2,b+2)Pixel(a+2,b+2) Pixel(a+3,b+2)Pixel(a+3,b+2)Pixel(a,b+3)Pixel(a,b+3) Pixel(a+1,b+3)Pixel(a+1,b+3) Pixel(a+2,b+3)Pixel(a+2,b+3) Pixel(a+3,b+3)Pixel(a+3,b+3) (b) Array (entire column of pixels)SF

Column ADC

Output(c) column readout circuit

Column decoder(reset and split) R o w d e c o d e r ( w e i g h t s , r e a d o u t , s p li t ) RST y RST y RST y RST y Col split Col split Col splitw1-w4,RST x ,rdw1-w4,RST x ,rdw1-w4,RST x ,rd w1-w4, RST x ,rdRow splitRow splitRow split S/H

RST y Fig. 3. The overview of the PIP architecture. (a) The pixel circuits, (b) The structure diagram of the pixel array, and (c) The column readout circuit diagram. ferred to the in-sensor analog calculation circuit. However,both schemes only support binary neural networks.

PIP architecture (Fig. 2(d)) . In PIP architectures, thecomputing units are integrated with the pixel array. [23]adopted a linear-response Pulse Width Modulation (PWM)pixel to provide a PWM signal for analog-domain convolution.The weights for multiplication is achieved by adjusting thecurrent level and the integral time based on the pixel-signalpulse width. Meanwhile, accumulation is implemented by thecurrent integration. However, the current level is generated byDigital-to-Analog Converter (DACs) according to the weights,which leads to extra power consumption. [17] adopted a pixelprocessor array-based vision sensor called SCAMP-5. Eachpixel contains 13 digital registers and seven analog memoryregisters to achieve various operations. However, it costs toomuch pixel area, leading to wiring problems and low ﬁllfactors. [24] proposed a dual-mode PIS architecture calledMACSen, which has many SRAM cells and computation cellsin each unit of the array, resulting in a large area and low ﬁllingfactor.New materials and devices are also developed for PIParchitectures to improve the ﬁlling factor. [25] proposed a

W Se two-dimensional (2D) material neural network imagesensors, which uses a 2D semiconductor photodiode array tostore the network’s synaptic weights. However, changing thephotodiode’s photosensitivity may need additional complicateddigital-to-analog circuits for each pixel to enable massiveparallel computing. Mixed architecture

It’s usually difﬁcult to conduct allcalculation tasks with PIS or PIP architectures. Mixed schemesare thus proposed to achieve the whole neural networkcomputing. In [26], an analog calculation circuit is always- on to achieve face detection before ADCs. When faces aredetected, the on-chip DLA performs the calculation for facerecognition in the digital domain, which can be describedas a PIS + PNS scheme. [27] fabricates a sensor basedon

W Se / h − BN / Al O van der Waals heterostructure toemulate the retinal function of simultaneously sensing andprocessing an image. An in-memory computing unit is addedafter the sensor to make up the PIP + PNS scheme.III. P ROPOSED A RCHITECTURE

In this section, our proposed PIP architecture is introducedin this order: (A) the pixel level circuit design to enable MACoperation, (B) the implementation of convolution operation inthe pixel array, (C) the methods to support different kernelsizes, and (D) the workﬂow in the traditional mode.

A. Pixel Circuit and MAC Operation

Fig. 3 is the overview block diagram of the proposed PIParchitecture. The convolution operation is realized by an arrayof W × H pixel units. Fig. 3(a) shows the circuits of pixelunits. Two reset transistors

RST x and RST y , are shared byfour adjacent pixels representing RGGB channels. Each pixelcontains an exposure control transistor, a storage capacitor, anda read control transistor. Fig. 3(b) shows the array’s structurediagram, in which Convlink connects adjacent pixel units withsplit transistors in both row and column directions. Signal RST y is controlled in the column direction, while RST x andrd are headed in the row. Weights are loaded in rows. Fig.3(c) shows the column readout circuit diagram. Each pixel ina column is connected to the same readout circuit by Convlinkwith a select transistor used in the convolution operation. EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 4 ∑ Q/(nC) = Δ U(V)

Photodiode CapacitorPWM I ph (A) = Charge(C)* Exposure time(s) Capacitor Readout ∑ /n Charge distribution *n Fig. 4. The calculation ﬂow diagram of the proposed architecture.

RSTrdw1 w2 U1( )U2( )U3( )U4( ) k·w1k·w2Weights>0 w3 w3<0 readout1(+) w1>0 w2>0 Weights<0k·|w3| readout2(-)w4 w4<0 k·|w4|

Fig. 5. The convolution sequence diagram of the pixel circuit.

Fig. 4 shows the calculation ﬂow of the MAC operationunder the proposed PIP architecture. The multiplication ofphotocurrent and weights is realized in the pixel unit bycontrolling the exposure time of photodiodes. The exposuretime of photodiodes is modulated by the weights (8 bit) inthe convolution kernel. The multiplication results are storedon the capacitors, which can be connected between differentpixel units to realize summation by charge redistribution.The timing diagram is shown in Fig. 5, which only containsfour pixels for simplicity. When the signal RST is high,both

RST x and RST y are asserted to reset the capacitors’potential to Vdd. The exposure stage is started after the resetstage when both RST and rd are de-asserted. In this stage,the control pulses of exposure signals w - w are modulatedby the convolution kernel weights. The exposure time T isproportional to the weight value w . Since the photocurrent I ph is unchanged in a short period, the charge Q stored oncapacitor C can be expressed as Q = CU rst − It = CU rst − Ikw (1)where k is the exposure constant, adjusted by the softwareaccording to the external light intensity. So the charge Q onthe capacitor represents the product of the photocurrent I andthe corresponding weight value w in the convolution kernel.After the exposure, it is the charge redistribution and readoutstage when rd is asserted. The Convlink line redistributes the charges stored in the capacitor Q - Q . According to theprinciple of charge redistribution, the voltage would reach auniform value of U conv . If only considering the four pixelsshown in Fig. 3(a), the value of U conv can be expressed as U conv = ( Q + Q + Q + Q ) C + C + C + C = U rst − k C ( I w + I w + I w + I w ) (2)where k C is a known constant, so the voltage U conv on theConvlink line represents the sum of the four multiplication re-sults, thus achieving MAC operation in-pixel level. Assumingthat the convolution kernel size is r , one of the output resultsof the 1-st layer convolution can be obtained by connecting r such adjacent pixels by the Convlink lines, which can beexpressed as U conv = (cid:80) r i =1 Q i r C = U rst − k r C r (cid:88) i =1 [ I i w i ] (3)The weight precision of the convolution kernel used in thesystem is 8-bit. That is, the weight size of the convolutionkernel ranges from -128 to +127. The positive and negativeweights of the convolution kernel can be achieved by subtract-ing two consecutive exposures, as shown in Fig. 5. As w and w are positive, they are enabled in the ﬁrst exposure period.The negative w and w are enabled in the second exposureperiod. The readout operation is done after the redistribution.The digital circuits subtract the two readout operations in Fig.5 after the ADCs, which is expressed as U = U − − U + = k r C ( (cid:88) [ I i w + i ] − (cid:88) [ I i w − i ]) (4)Eq. (4) also illustrates that the Correlation Double Sampling(CDS) is realized because it eliminates the inﬂuence of darkcurrent. B. Convolution Operation in Array

After introducing the basic idea of the convolution opera-tion, this section gives a detailed introduction to the system’soverall architecture and the sliding convolution on the entirepixel array.As can be seen from Fig. 3, the most fundamental com-ponent of the pixel array is a pixel unit containing fourphotocells. Split transistors separate the Convlink wires of theadjacent pixel units. Each column of pixel units includes acolumn readout circuit and a column ADC outside the array,which can read the convolution results and convert them intodigital signals. The adopted ADC is taken from [28], whichconsumes 4.04 uW with a 12.5 MS/s sampling rate.The ﬂow of convolution operation in the array is shownin Fig. 6. In the following example, we assume that theconvolution kernel size is 3 × EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 5 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

Step 1 r1r2 r3r4 r5r6r7r8r9r10r11readout circuit

Readout a group at once.(The readout path of other capsules are not drawn)

Step 2Step 3Step 4 capsulegroup

A group contains 3 rows of capsules.

Fig. 6. The ﬂow diagram of the array convolution operation. The convolutionkernel size is 3x3, and the stride is 2. transistors are closed. The 3*3 connected active pixel unitsare deﬁned as capsules. Then the whole array can be dividedinto several independent capsules. The Convlink wires connectthe pixel units in each capsules. The capsules’ exposure andcharge redistribution (MAC operations) are enabled simulta-neously in each step. We deﬁned the three rows of capsuleswhich are readout simultaneously as a group.As stated in the previous section, the MAC operation can beachieved by connecting the Convlink wires of all pixel unitscorresponding to a convolution kernel during computation.More MAC operations should be carried out simultaneouslyto maximize parallel operation and computing throughput.Because the charge redistribution is a destructive read of pixelvalues, regions of multiple simultaneous MAC operations mustbe non-overlap. The non-overlap is achieved by dividing theconvolution procedures of the entire array into four steps, asshown in Fig. 6. In each step, the colored squares represent theactive pixel units, and the uncolored squares represent pixelunits not involved in the computation of this step. To minimizethe power consumed by the photodiode reset,

RST y and RST x disconnect the unpainted pixel from the adjacent unit in therow and column directions, respectively. In such a scenario,all the convolution areas in one step can be calculated andread out with only one exposure. In each step, the active pixelunits will achieve the MAC operations with the convolutionkernel, and calculate a quarter of the convolution result. Afterfour steps of calculations, a complete convolution operation isﬁnished. As we exposure twice for each step’s positive andnegative weights, eight exposure cycles are needed for eachconvolution operation.The above convolution operation need to carefully plan thehardware wiring. As shown in Fig. 7(a), when the convolutionkernel size is 3 × W , W , W , W , W , W , W ... In this way, each capsule in a step contains the same wire orders “ W , W , W ” in ﬁrst and second steps and“ W , W , W ” in the third and fourth steps. As the minimumcycles of the wire order are 2, so only even stride can besupported.As each column of pixel units is connected to a columnreadout circuit, each capsule includes 3 column readout cir-cuits. So the calculation results of every three rows of capsulescan be read by the three readout circuits simultaneously. Toachieve this readout method, pixels with the row number x(x= 4n + 3, n = 0,1,2,3...) are connected to three independentrow enabling signals C , C , C as shown in Fig. 7(b). Asshown in Fig. 6, signal C is active in the rd row, and signal C is active in the th row so that the ﬁrst row of capsules canbe read from the ﬁrst column of the readout circuit while thesecond row of capsules from the second column of readoutcircuit.The processing sequence of the convolution operation isshown in Fig. 8. The subscript n represents the n th group. Asshown in Fig. 8(a), after a readout operation of the n th groupis ﬁnished, the signal rd n is de-asserted, rd n +1 is assertedfor the readout of the ( n + 1) th group, and rst n are activatedto reset the n th group. As shown in Fig. 8(b), a capsule isreset immediately after each readout operation and then beginthe exposure for the next readout. Assuming the resolutionis 1080P, the convolution kernel size is 3 ×

3, the stride is 2,and each step contains 270 rows of convolution kernel results,then each step needs 90 readout operations. As shown in Fig.6, in different steps, the active capsules are correspondingto different pixel units. Since the next readout of the n th group in the next step needs an extra readout cycle’s delay toavoid overlapping, so the next readout of a group is separatedfrom the reset by (90-1) readout cycles. Assuming the readoutinterval is T rd , the number of the readout operations in eachstep is n rd , the time interval between two readouts of the samecapsule is ( n rd − T rd , the reset interval is T rst , and themaximum exposure time is T expo . As a capsule’s reset andexposure stage need to be ﬁnished before the next readoutoperation, there should be ( n rd − T rd > T rst + T expo (5) C. Universal Implementation of Convolution Kernel with Dif-ferent Size

To support other kernel size with the same wires, we pro-pose a method called “kernel splitting” to split the convolutionkernel. As shown in Fig. 9(a), two 5 × k and k are used to form a 5 × k includes the ﬁrst 3 columns of the 5 × k includes the 4-5 columns. A 5 × × × × × × EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 6

Pixel 1 Pixel 2 Pixel 3 Pixel 4 Pixel 5 Pixel 6(W1) (W2) (W3) (W2) (W1) (W2)

R1R2

R3 G1

G2G3 G1 G2G3 B1B2 B3 W1 W2 W3 W2 W1 W2 W3 W2 W1 W2 W3 W2W1 W2 W3 W2 W1 W2 W3 W2 W1 W2 W3 W2STEP 1/2STEP 3/4

W2(R2,G2,B2) W3(R3,G3,B3)W1(R1,G1,B1)

W5 W6W4 W8 W9W7

W2(R2,G2,B2) W3(R3,G3,B3)W1(R1,G1,B1)

W5 W6W4 W8 W9W7

Kernel size3*3*RGB,stride2

C0 C0 C0 C0 C0 C0 C0 C0 C0 C0 C0 C0C1 C2 C3 C2 C1 C2 C3 C2 C1 C2 C3 C2 (a) wiring method for loading weights

Row x , when x = 4n + 3, n ∈ N (for computing and traditional mode) C1 C2C3C0

Row x , when x ≠

4n + 3, n ∈ N (only works in traditional mode) (b) wiring method for readout Fig. 7. (a) The wiring way of weight loading. (b) The wiring method for readout. The convolution kernel has a size of 3 ×

3, and the stride is 2. U n rd n rd n+1 RST n T rst T expo Δ t rd = (90-1) T rd expo n T rd reset exposure charge redistribution and readout reset exposure charge redistribution and readout reset (a) the array(b) one capsule cycle1 cycle2 cycle3... Fig. 8. The convolution operation sequence diagram of (a) the array and (b)one capsule. U represents the potential of a capacitor in the chosen capsules, RST represents the reset stage, expo represents the exposure stage, and rd represents the charge redistribution and readout stage. operation, each group still have 3 rows of capsules. Anothertwo examples for 7 × × × r and the stride is s , the totalnumber of steps is r +1 s ( r − , where the ratio r +1 s needs toround up to an integer if necessary. For a ﬁxed height of thepixel array H (1080 in our case), the total number of outputrows in each step is Hr +1 . Since each readout operation containsthree output rows, the minimum ADC conversion rate can becalculated by f ADC ( min ) = 2 nf H ( r − s (6)where f is the frame rate and n is the number of channels. W1 W2 W3 W1 W2 W3 W1

K1Kernel

W2 W3W1 W5 W6W4 W7

K1 K2 K3

W5 W4 W5 W4 W51 2 3 2 1 2 3 2 1

WiringK2

W6 W7 W6 W7 K3 W1 W2 W3 W1 W2 W3 W1 K1 Kernel

W2 W3W1 W5W4

K1 K2

W5 W4 W5 W4 W5 K2 (b) size7*7*RGB (a) size5*5*RGB W2 W3W1 W5 W6W4 W7

K1 K2 K3 K4

W8 W9

W1 W2 W3 W1 W2 W3 W1 K1 W5 W4 W5 W4 W5 K2 W6 W7 W6 W7 K3 W8 W9 W8 W9 K4 (C) size9*9*RGB Fig. 9. The convolution implementations of kernel splitting. The convolutionkernel size is (a) 5 ×

5, (b) 7 ×

7, and (c) 9 × As each step requires two exposures for the positive andnegative weights. The real frame rate f real is deﬁned as theproduct of frame rate and the output channel number f × n .With a ﬁxed maximum exposure time T expo , the maximumreal frame rate can be calculated by EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 7

TABLE IC

OMPARISON OF DIRECTLY AND SPLITTING CONVOLUTION OPERATION operation minimum ADC maximum realcondition conversion rate frame rate × × × × f real ( max ) = s r − T expo (7)As shown in Eq. (6), the minimum conversion rate of ADCis proportional to the frame rate f , the channel number n ,and the kernel size r . It is inversely proportional to the stride s . As shown in Eq. (7), when the kernel size increases, themaximum real frame rate will decrease.Assuming the resolution is 1080P, the stride is 2, and themaximum exposure time is 32.56 us (calculated when kernelsize is 3 ×

3, stride is 2, frame rate is 60, and output channelnumber is 64), the maximum real frame rate and the minimumADC conversion rate are calculated for (a) 3 × × × × f ADC ( min ) is based on the calculated f real ( max ) in each condition. The results shows when kernelsize increases, the conversion rate of ADC will decrease.This is because the real frame rate decreases and the readoutoperation frequency decreases. D. Traditional Mode and Mode Switch

In the preceding three subsections, we have introduced therealization of convolution operations. In the convolution mode,it does not output the raw image. However, the original imageis vital for some applications. The proposed CIS can work inthe traditional mode appropriate control signals and output theraw image.We set the opening time of transistors W - W to a uniﬁedlength according to the external light intensity to achieve this.During a reading, the RD transistors of four pixels can beselected in turn to read out RGB data. In the pixel array, aseach pixel in a column shares the same column readout circuit,each row of pixels will be selected by enabling C or C - C in turn for readout. It needs a total of 4H readout operations toread the entire pixel array and obtain the RGB three-channelimage with a size of H × W.The switch between computing mode and traditional modecan adopt an event-driven mechanism. When the target objectis identiﬁed in the subsequent computing module results, theCIS control mode can be switched to output the completeraw image in the traditional mode. The light intensity canalso determine the exposure time to avoid overexposure orunderexposure. IV. S

IMULATION R ESULTS

Our proposed architecture was implemented with a generic45nm CMOS process. To simulate the response of the pho- P in = 50 W/m P in = 100 W/m P in = 500 W/m P in = 1000 W/m P in = 5000 W/m P in = 10000 W/m Fig. 10. Simulation result of a single photodiode based on the model givenin Eq. (8)-(10). todiode, an analytic model taken from [29] is used in thesimulation. The model can be expressed as J np = qG L (0) L p − ( αL p ) [ αL p e − αx j + sinh x j L p + A ( x j , L p ) cosh x j L p ]+ qG L (0) α [1 − e − αx dr ] e − αx j + qG L (0) L n ( αL n ) − A ( L − x d , L n ) + αL n ] e − α ( x j + x dr ) (8)where A ( x, y ) = e − αx − cosh xy sinh xy (9) G L (0) = α P in hc λη (1 − R ) (10)Fig. 10 shows the simulation results of the photodiodemodel in our proposed pixel circuit. Firstly all the capacitorsand diode are reset to Vdd (1 V). After exposure with differentlight intensities, the voltages are declined at different speeds.The results show that the potential should be held above 0.5V to ensure linearity. A. Circuit function veriﬁcation

The transient simulation result to verify the correctness ofour proposed pixel circuits’ MAC operation is shown in Fig.11. Signal RST represents both

RST x and RST y . Differentvalues of P in and weights are set verify multiplication andaccumulation.At 0 us, the reset stage begins after RST, rd, and w - w are asserted to reset the capacitors to 1 V. At t , the exposurestage begins after RST and rd are de-asserted. According to theweights, w - w stay open with exposure time and the voltageon capacitors U C - U C are decreased with different speedsdetermined by the input light power. This stage achieves the EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 8

RSTrdw U C1 w U C2 w U C3 w U C4 t t RSTrdw U C1 w U C2 w U C3 w U C4 t t Fig. 11. Simulation result of our proposed pixel circuit. multiplication operation between the input light power andweights. At t , the charge redistribution and readout stagebegins when rd is asserted to redistribute the four capacitors’charge. U C - U C reach a uniﬁed voltage level in a short time.This stage achieves the average operation. Therefore the MACoperation is achieved after the multiplication in exposure stageand the average in charge redistribution and readout stage.In this case, the input light powers of the four photodiodeare set to 1000, 600, 600, and 800 W/m , while the weightsare set to 20, 45, 70, and 70, respectively. With a ﬁttingslope -0.00465955, the theoretical values of U C - U C afterthe multiplication operation are 906.8, 874.2, 804.3, and 739.1mV, respectively. The simulation results shown in Fig. 11 are906.2, 874.0, 805.2, and 738.7 mV, which are consistent withthe theoretical values. B. MAC Operation Linearity Simulation

Fig. 12 shows the relationship between U out and (a) theweight and (b) the input optical power. U out is the readoutvoltage after the MAC operation. Linear ﬁtting results of bothﬁgures show that the R are all above 0.999. Results showthat the proposed CIS architecture achieves high linearity andaccuracy.Fig. 13 shows the Differential Nonlinearity (DNL) andIntegration Nonlinearity (INL) simulation results by the codedensity measurement. The simulated DNLs (INLs) in terms ofthe weight and the input light power are +0.0755/-0.0206 LSB(+0.2334/-0.7242 LSB) and +0.0210/-0.0061 LSB (+0.3560/-1.1947 LSB), respectively. The DNLs are all below 1 LSB,which means no missing codes. C. Performance Analysis

The power consumption and performance comparison underdifferent conditions are shown in Table II. The array size for all situations is 1920 × × × × × D. Analysis of the Robustness

Operations in the analog domain are affected by undesirablefactors such as noise and variations. In this section, we analyzethe effects of these factors in detail.

1) Device Variation:

As shown in Fig. 14, the schematic ofthe CIS computation parts can be simpliﬁed as a photodiodewith a capacitor and two switches W i and rd in each pixel.For an r × r convolution kernel, r × r pixels are connected to thesame readout circuits, including a source follower transistor.After reset and exposure, V Ci will be saved on C i . Whensignal rd is set high, the voltages V Ci connected in a kernelare averaged due to the charge sharing. In the ideal case, thereadout voltage can be formulated as V out = (cid:80) C i V Ci (cid:80) C i − ( V thi + V od ) (11)where V thi is the threshold of the source follower transistorin readout circuit and V od is the over drive voltage. C i is EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 9 (a) weight (b) input optical power P in weightR =0.9996R =0.9998R =0.9999R =0.9999R =0.9999 R =0.9993R =0.9999R =0.9999R =0.9999R =0.9999 Fig. 12. Simulation of MAC operations. (a) Readout voltage for ADCs versus weight for different values of input optical power. (b) Readout voltage forADCs versus input optical power for different values of weight. (a) weight (average input optical power is set to 500W/m ) (b) input optical power (w/m ) (average weight is set to 100) Fig. 13. The simulated DNL and INL in terms of (a)the weights, (b)the input optical power.TABLE IIP

OWER CONSUMPTION ANALYSIS . T

HE AMOUNT OF THE

MAC

OPERATION IS CALCULATED ACCORDING TO THE GENERAL CALCULATION PRINCIPLE OFCONVOLUTION OPERATION , THAT IS , THE PRODUCT OF CONVOLUTION KERNEL SIZE r , OUTPUT CHARACTERISTIC SIZE

HWs , NUMBER OF INPUTCHANNEL

AND NUMBER OF OUTPUT CHANNELS n . condition Power(mW) Efﬁciency(TOPS/w) FoM (pJ/pixel/frame) ×

3, s=2 22.62 4.75 2.85120FPS, 3 ×

3, s=2 37.48 5.73 2.3660FPS, 5 × × × × EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 10

TABLE IIIP

ERFORMANCE C OMPARISION

90 nm/60 nm 180 nm 180 nm 180 nm 45 nm

Supply

Array Size ×

976 128 ×

128 32 ×

32 32 ×

32 1920 × Frame Rate data/weight width

Feature

Spatial-Temporal processing 1st-layer CNN, ED,Blur,Sharpen 1st-layer BNN 1st-layer BNN 1st-layer CNN

Processing

Digital Analog Analog Analog Analog

Memory

Yes(Digital) No No Yes(Analog) No

Method

Power

363 mW 91 uW 12.16 uW 1.8 mW 22.62 mW

Computing Efﬁciency rdw i C i =C (1+ β t ) V PDi +n pdi V thi +V od r*r pixels V out Fig. 14. Effects of variation, mismatch and noise. Red represents the voltagenode inﬂuenced by the additive gaussian noise. Blue represents the thresholdand the capacitance inﬂuenced by process variation and mismatch the capacitor in each pixel. V + out and V − out are the outputvoltages after charge sharing for positive and negative weights,respectively. As described in section III, the ﬁnal output isobtained by the digital circuit subtracting the two voltages. Thenominal capacitance of C i is C . Now we can illustrate thenoise, variations, mismatch factors considered in our analysis.Firstly, noise in the integrated circuits such as thermalnoise, ﬂicker noise, and environmental noise can be consid-ered together as the additive Gaussian noise on the dynamiccapacitance [11], as depicted in red in Fig. 14. Therefore, V (cid:48) Ci = V Ci + n Ci = V P Di + n pdi + n ci (12)where V P Di and n P Di are the value of V Ci and the randomnoise, respectively. n ci are the random noises on the capacitor.All noises follows the normal distribution N(0, σ noise ).Mismatch refers to the different deviations between differentdevices. It affect the threshold voltage of the source followertransistor and the capacitance of C i . We can formulate it as V thi = V th (1 + β ti ) C i = C (1 + β ci ) (13) where β ti and β ci refer to the deviation of devices, both ofwhich follow the normal distribution N(0, σ mismatch ).

2) Computation Error Analysis:

Given Eq. (11)-(13), theoutput V out can be formulated as V out = V + out − V − out = (cid:80) (1 + β ci )( w i x i + n all ) (cid:80) (1 + β ci ) (14)where n all = n + pdi + n + ci − n − pdi − n − ci . Because the four sums areindependently distributed, n all follows the normal distributionN(0,4 σ noise ). β ci follows the normal distribution with mean =0, therefore Eq. (14) can be simpliﬁed as V out = (cid:80) (1 + β ci )( w i x i + n all )4 r (15)Eq. (15) shows that (1) the impacts of all noise can beconsidered together as one random noise value n all added toeach pixel, which follows the normal distribution N(0,4 σ noise ).(2) the mismatch across different capacitors in each pixel havethe multiplicative factor (1 + β ci ) on the output data. (3) theimpacts of devices’ global process variation can be ignoredbecause of the charge sharing operations and subtractionoperations.Compared with traditional design, the sharing operationshave the extra beneﬁt of increasing the SNR. The effect of therandom additive noise in Eq. (15) can be expanded as V out = (cid:80) w i x i r + (cid:80) n all r (16)As (cid:80) w i x i r is the desired output, (cid:80) n all r is the additivenoise. When the convolution kernel size is 3 ×

3, 9 pixelsare connected, and each pixel has four photodiodes, so r = 36, which means (cid:80) n all follows the normal distributionN(0,36 σ ). Then the noise and SNR can be calculated by noise = E [( (cid:80) n all

36 ) ] = 136 D ( (cid:88) n all ) = σ (17) SN R = powernoise = 36 powerσ (18)As shown in Eq. (18), the SNR is 36 times as high astraditional design, or a 15.6 dB increase. This means smaller EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 11 - 20- 100102030405060 A CC U R A C Y SNR/dB

Accuracy v s S N R

5% 10% 20%

500 0.85 0.9 0.95 1 1.05 1.1 1.150100200300 0.94 0.96 0.98 1 1.02 1.04 1.060100200300 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 (a) distributed within a 5% deviation(b) distributed within a 10% deviation(c) distributed within a 20% deviation

Fig. 15. Relationship between CNN accuracy and three types of disturbance. capacitors are acceptable in our design, so the exposure timecan be decreased, which contributed to a huge increase inframe rate to 3840 FPS.

3) Algorithm Robustness:

As CNN algorithm is a neuralnetwork algorithm, its robustness is very high, and it canaccept errors within a certain range of input data. Throughnetwork simulation with Cifar-10 [30] dataset and Resnet-18, the accuracy of CNN changes with SNR or mismatchas shown in Fig. 15. As the proposed CIS only support 1st-layer CNN, the rest calculation is achieved by software. Threedifferent types of distributions of capacitors are used, and thedistributions are shown in Fig. 15. The results showed littleaccuracy loss when SNR is more than 40 dB, as the typicalSNR value for CIS is 40 dB - 60 dB[5].Our proposed CIS circuit only support the 1-st layer ofCNN, but is very important for the whole architecture’scalculation. The quantization or pruning of the ﬁrst layer ofCNN usually lose a lot of accuracy, which makes it difﬁcultto improve the performance. Due to the small number ofinput channels, DLAs’ PEs are often not fully utilized forthe ﬁrst layer. Therefore, this design can greatly improve thecomputational efﬁciency of the subsequent DLAs, leading tomuch higher performance of the whole machine vision system.V. C

ONCLUSION

In this work, a PIP architecture has been proposed to per-form the ﬁrst layer convolution operation of CNN. It supportsa variety of different convolution kernel sizes and parameters.The simulation results have shown that our proposed schemefunctions correctly with good linearity. In the case of the con-volution kernel is 3 ×

3, step size is 2 and channel number is 64at 60 frames and 1080P, the proposed architecture consumes 22.62 mW power, and have a computational efﬁciency up to4.75 TOPS/w, which is about 3.6 times as higher as the state-of-the-art. It is very suitable for application scenarios with tightrequirements on power consumption, such as daily monitoringand Internet of Things (IoT) terminal devices.R

EFERENCES[1] S. Mane and S. Mangale, “Moving object detection and trackingusing convolutional neural networks,” in ,2018, pp. 1809–1813.[2] C. Ding and D. Tao, “Trunk-branch ensemble convolutional neuralnetworks for video-based face recognition,”

IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 40, no. 4, pp. 1002–1014, 2018.[3] L. Wu, K. Huang, H. Shen, and L. Gao, “Foreground-backgroundparallel compression with residual encoding for surveillance video,”

IEEE Transactions on Circuits and Systems for Video Technology , pp.1–1, 2020.[4] D. Wei, X. Xu, H. Shen, and K. Huang, “Gac-gan: A general methodfor appearance-controllable human video motion transfer,”

IEEE Trans-actions on Multimedia , pp. 1–1, 2020.[5] F. Zhou and Y. Chai, “Near-sensor and in-sensor computing,”

NatureElectronics , vol. 3, pp. 664–671, 2020.[6] M. T. Chung, C. L. Lee, C. Yin, and C. C. Hsieh, “A 0.5 v pwm cmosimager with 82 db dynamic range and 0.055% ﬁxed-pattern-noise,”

IEEEJournal of Solid-State Circuits , vol. 48, no. 10, pp. 2522–2530, 2013.[7] X. Liu, M. Zhang, and J. Van der Spiegel, “A low-power multifunctionalcmos sensor node for an electronic facade,”

IEEE Transactions onCircuits and Systems I: Regular Papers , vol. 61, no. 9, pp. 2550–2559,2014.[8] O. Kumagai, A. Niwa, K. Hanzawa, H. Kato, and Y. Nitta, “A 1/4-inch3.9mpixel low-power event-driven back-illuminated stacked cmos imagesensor,” in , 2018.[9] A. Y. Chiou and C. Hsieh, “An ulv pwm cmos imager with adaptive-multiple-sampling linear response, hdr imaging, and energy harvesting,”

IEEE Journal of Solid-State Circuits , vol. 54, no. 1, pp. 298–306, 2019.

EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 12 [10] J. Choi, J. Shin, D. Kang, and D. Park, “Always-on cmos image sensorfor mobile and wearable devices,”

IEEE Journal of Solid-State Circuits ,vol. 51, no. 1, pp. 130–140, 2016.[11] A. Y. Chiou and C. Hsieh, “A 137 db dynamic range and 0.32 v self-powered cmos imager with energy harvesting pixels,”

IEEE Journal ofSolid-State Circuits , vol. 51, no. 11, pp. 2769–2776, 2016.[12] T. Yamazaki, H. Katayama, S. Uehara, A. Nose, M. Kobayashi, S. Shida,M. Odahara, K. Takamiya, Y. Hisamatsu, S. Matsumoto, L. Miyashita,Y. Watanabe, T. Izawa, Y. Muramatsu, and M. Ishikawa, “4.9 a 1mshigh-speed vision chip with 3d-stacked 140gops column-parallel pes forspatio-temporal image processing,” in , 2017, pp. 82–83.[13] M. F. Amir, D. Kim, J. Kung, D. Lie, S. Yalamanchili, and S. Mukhopad-hyay, “Neurosensor: A 3d image sensor with integrated neural acceler-ator,” in , 2016, pp. 1–2.[14] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,and O. Temam, “Shidiannao: Shifting vision processing closer to thesensor,” in , 2015, pp. 92–104.[15] Yunhui, Hou, Robert, LiKamWa, Lin, Zhong, Mia, Polansky, Julian, andGao, “Redeye: Analog convnet image sensor architecture for continuousmobile vision,”

Computer Architecture News , 2016.[16] C. Xu, Y. Mo, G. Ren, W. Ma, X. Wang, W. Shi, J. Hou, K. Shao,H. Wang, P. Xiao, Z. Shao, X. Xie, X. Wang, and C. Yiu, “5.1 a stackedglobal-shutter cmos imager with sc-type hybrid-gs pixel and self-kneepoint calibration single frame hdr and on-chip binarization algorithmfor smart vision applications,” in , 2019, pp. 94–96.[17] L. Bose, J. Chen, S. J. Carey, P. Dudek, and W. Mayol-Cuevas, “Fullyembedding fast convolutional networks on pixel processor arrays,” 2020.[18] T. Hsu, Y. Chiu, W. Wei, Y. Lo, C. Lo, R. Liu, K. Tang, M. Chang, andC. Hsieh, “Ai edge devices using computing-in-memory and processing-in-sensor: From system to device,” in , 2019, pp. 22.5.1–22.5.4.[19] A. El Gamal and H. Eltoukhy, “Cmos image sensors,”

IEEE Circuitsand Devices Magazine , vol. 21, no. 3, pp. 6–20, 2005.[20] V. T. Lee, A. Alaghi, J. P. Hayes, V. Sathe, and L. Ceze, “Energy-efﬁcienthybrid stochastic-binary neural networks for near-sensor computing,” in

Design, Automation Test in Europe Conference Exhibition (DATE), 2017 ,2017, pp. 13–18.[21] Z. Chen, H. Zhu, E. Ren, Z. Liu, K. Jia, L. Luo, X. Zhang, Q. Wei,F. Qiao, X. Liu, and H. Yang, “Processing near sensor architecture inmixed-signal domain with cmos image sensor of convolutional-kernel-readout method,”

IEEE Transactions on Circuits and Systems I: RegularPapers , vol. 67, no. 2, pp. 389–400, 2020.[22] T. Ma, K. Jia, X. Zhu, F. QiaO, Q. Wei, H. Zhao, X. Liu, and H. Yang,“An analog-memoryless near sensor computing architecture for always-on intelligent perception applications,” in ,2019, pp. 150–155.[23] T. Hsu, Y. Chen, T. Wen, W. Wei, Y. Chen, F. Chang, H. Kim, Q. Chen,B. Kim, R. Liu, C. Lo, K. Tang, M. Chang, and C. Hsieh, “A 0.5v real-time computational cmos image sensor with programmable kernel foralways-on feature extraction,” in , 2019, pp. 33–34.[24] H. Xu, Z. Li, N. Lin, Q. Wei, F. Qiao, X. Yin, and H. Yang, “Macsen: Aprocessing-in-sensor architecture integrating mac operations into imagesensor for ultra-low-power bnn-based intelligent visual perception,”

IEEE Transactions on Circuits and Systems II: Express Briefs , pp. 1–1,2020.[25] L. Mennel, J. Symonowicz, S. Wachter, D. K. Polyushkin, andT. Mueller, “Ultrafast machine vision with 2d material neural networkimage sensors,”

Nature , vol. 579, no. 7797, pp. 62–66, 2020.[26] K. Bong, S. Choi, C. Kim, D. Han, and H. Yoo, “A low-power convolu-tional neural network face recognition processor and a cis integrated withalways-on face detector,”

IEEE Journal of Solid-State Circuits , vol. 53,no. 1, pp. 115–123, 2018.[27] S. Wang, C.-Y. Wang, P. Wang, C. Wang, Z.-A. Li, C. Pan, Y. Dai,A. Gao, C. Liu, J. Liu, H. Yang, X. Liu, B. Cheng, K. Chen, Z. Wang,K. Watanabe, T. Taniguchi, S.-J. Liang, and F. Miao, “Networkingretinomorphic sensor with memristive crossbar for brain-inspired visualperception,”

National Science Review , 07 2020, nwaa172. [Online].Available: https://doi.org/10.1093/nsr/nwaa172[28] S. Zhang, K. Huang, and H. Shen, “A robust 8-bit non-volatilecomputing-in-memory core for low-power parallel mac operations,”

IEEE Transactions on Circuits and Systems I: Regular Papers , vol. 67,no. 6, pp. 1867–1880, 2020.[29] R. J. Perry and K. Arora, “Using pspice to simulate the photoresponse ofideal cmos integrated circuit photodiodes,” in

Proceedings of SOUTH-EASTCON ’96 , 1996, pp. 374–380.[30] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,”

Handbook of Systemic Autoimmune Diseases , vol. 1, no. 4,2009.

Ruibing Song (Student Member, IEEE) receiveda bachelor’s degree from the College of ElectricalEngineering, Zhejiang University, in 2020. He iscurrently pursuing a master’s degree at the Collegeof Information Science & Electronic Engineering,Zhejiang University. He is interested in in-sensorcomputing and in-memory computing.

Kejie Huang (Senior Member, IEEE) received thePh.D. degree from the Department of Electrical En-gineering, National University of Singapore (NUS),Singapore, in 2014. He has been a Principal In-vestigator with the College of Information ScienceElectronic Engineering, Zhejiang University (ZJU),since 2016. Before joining ZJU, he has spent ﬁveyears in the IC design industry, including Samsungand Xilinx, two years in the Data Storage Insti-tute, Agency for Science Technology and Research(A*STAR), and another three years in SingaporeUniversity of Technology and Design (SUTD), Singapore. He has authoredor coauthored more than 40 scientiﬁc articles in international peer-reviewedjournals and conference proceedings. He holds four granted internationalpatents, and another eight pending ones. His research interests include lowpower circuits and systems design using emerging non-volatile memories,architecture and circuit optimization for reconﬁgurable computing systemsand neuromorphic systems, machine learning, and deep learning chip design.He currently serves as the Associate Editor of the IEEE TRANSACTIONSON CIRCUITS AND SYSTEMS-PART II: EXPRESS BRIEFS.

Zongsheng Wang (Student Member, IEEE) receiveda bachelor’s degree from the College of ElectricalEngineering, Zhejiang University, in 2020. He iscurrently pursuing a master’s degree at the Collegeof Information Science & Electronic Engineering,Zhejiang University. He is interested in in-sensorcomputing, low power digital circuit design and deeplearning accelerator.

EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 13