An Ultra Fast Low Power Convolutional Neural Network Image Sensor with Pixel-level Computing
IIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 1
An Ultra Fast Low Power Convolutional NeuralNetwork Image Sensor with Pixel-level Computing
Ruibing Song,
Student Member, IEEE,
Kejie Huang,
Senior Member, IEEE,
Zongsheng Wang,
StudentMember, IEEE, and Haibin Shen
Abstract —The separation of the data capture and analysis inmodern vision systems has led to a massive amount of datatransfer between the end devices and cloud computers, resultingin long latency, slow response, and high power consumption.Efficient hardware architectures are under focused developmentto enable Artificial Intelligence (AI) at the resource-limited endsensing devices. This paper proposes a Processing-In-Pixel (PIP)CMOS sensor architecture, which allows convolution operationbefore the column readout circuit to significantly improve theimage reading speed with much lower power consumption.The simulation results show that the proposed architectureenables convolution operation (kernel size=3 ×
3, stride=2, inputchannel=3, output channel=64) in a 1080P image sensor arraywith only 22.62 mW power consumption. In other words, thecomputational efficiency is 4.75 TOPS/w, which is about 3.6 timesas higher as the state-of-the-art.
Index Terms —processing-in-pixel, visual perception, convolu-tional neural network, CMOS image sensor.
I. I
NTRODUCTION C OMPUTER vision, which trains computers to interpretand understand the visual world, is one of the researchhotspots in computer science and Artificial Intelligence (AI).With the rapid development of machine learning technologies,Convolutional Neural Networks (CNNs) have outperformedprevious state-of-the-art techniques in computer visions suchas object detection [1], face recognition [2], video compression[3], motion transfer [4], etc.Although CNN has significantly improved visual systems’performance, they consume many operations and storage,making it difficult for end devices to independently completethe computation. Therefore, in modern visual systems, datacapture and analysis are separately carried out by sensing de-vices and cloud computers. The separation of the data captureand analysis has led to a tremendous amount of data transferbetween the end devices and the cloud computers, resulting inlong delay, slow response, and high power consumption [5].What’s more, in many vision applications, the systems haveto work continuously for monitoring or anomaly detection,i.e., surveillance cameras. The low information density hasseriously wasted communication bandwidth, data storage, andcomputing resource in such applications.
K.Huang and H.Shen are with the College of Information Science & Elec-tronic Engineering, Zhejiang University, 38 Zheda Road, Hangzhou, China,310027, and also with Zhejiang Lab, Building 10, China Artificial IntelligenceTown, 1818 Wenyi West Road, Hangzhou City, Zhejiang Province, China,email: [email protected]; shen [email protected] and Z.Wang are with the College of Information Science &Electronic Engineering, Zhejiang University, 38 Zheda Road, Hangzhou,China, 310027, email: [email protected]; [email protected]
To improve the efficiency of modern vision systems, re-searchers are focusing on reducing the readout power con-sumption or data density of sensors [6]–[11]. One of themost promising methods is to move the processing unitsmuch closer to the sensing units. Equipping CMOS ImageSensor (CIS) with a neural network processor can be dividedinto three categories: (1)Processing-Near-Sensor (PNS) withDeep Learning Accelerators (DLA); (2) Processing-In-Sensor(PIS); and (3) Processing-in-Pixel (PIP). The PNS architectureutilizes on-chip DLA to shorten the physical distance betweenthe processor and the image sensor [12]–[14]. The PIS archi-tecture is proposed to reduce the data transfer distance, readoperations, and analog-to-digital conversions. For example,Redeye performs several layers of CNN calculation in CISby additional analog arithmetic circuits before readout, saving85% energy due to the reduced read operations [15]. However,it needs lots of analog capacitors for data storage, leadingto a large area overhead and low computational efficiency.PIP is a fully integrated architecture to enable sensing andcomputing simultaneously. However, they may only supportlow-level processing [16] or need complicated pixel circuits,which lead to the excessive area and power consumption [17],[18].We propose a novel PIP architecture to enable high pre-cision convolutional neural network computation in pixels toaddress the limitations mentioned above. The multiplicationis achieved by pulse modulation during the exposure period.The charge redistribution does the accumulation at the pixellevel. The whole pixel array is organized with 3 × × ×
3. Our proposed architecturecould also support 60 frames and 1080P computation speedwhen the output channel size is 64. It only consumes 22.62mW power and has a computational efficiency up to 4.75TOPS/w, which is about 2.6 times higher than state-of-the-art. Our proposed splitting technique achieves the realizationof other kernel sizes.This paper is organized as follows: Section II presents therelated works. Section III introduces the detailed design ofour proposed scheme, including the overview architecture, thepixel circuit, the MAC operation, array convolution, and theimplementation of other convolution kernel sizes. Section IVanalyzes the simulation results and finally the conclusion isdrawn in Section V. a r X i v : . [ ee ss . I V ] J a n EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 2
RST SEL
PixelPixelPixel Col i (b) APS-3T N+ P N+ P PN Junction SEL (c) APS-4T P Pinned Diode
P+N
TXRSTRS
PixelPixelPixel Col i (a) PPS N+ P N+ P PN Junction
Reset -+ V ref V out V out SEL (d) 1.75T shared readout
RSTTX3
TX4
TX1 TX2
P+NPFD FDFD
Fig. 1. Four types of CIS pixel circuit: (a) PPS, (b) APS-3T, (c) APS-4T, and (d) APS-1.75T (shared readout circuit).
II. B
ACKGROUND AND R ELATED W ORK
A. CMOS Image Sensor
Pixel is the primary component in CIS to convert opticalsignals into electrical signals by photodiodes. Fig. 1 showsfour types of pixel circuits according to [19].As shown in Fig. 1(a), Passive Pixel Sensor (PPS) is theearly mainstream CIS technology, consisting of a photodiodeand a row-selection transistor. The output of PPS is a currentsignal, which is then converted to a voltage signal throughthe column charge-to-voltage amplifier, and finally quantizedby Analog to Digital Converter (ADC). The main advantageof PPS is the small pixel area. However, it suffers from lowSignal-to-Noise Ratio (SNR) and low readout speed.In Active Pixel Sensors (APS), a reset transistor is used toperiodically reset the photodiode and a source-follower transis-tor is employed to buffer and separate the photodiode from thebit line to reduce noise. There are mainly three types of APS,including APS-3T, APS-4T, and APT-1.75T. APS-3T shownin Fig. 1(b) can’t solve the kTC noise caused by its reset.As shown in Fig. 1(c), APS-4T (Pinned Photodiode (PPD))includes a transfer transistor TX and a floating diffusion (FD)node to further reduce the noise by decoupling the reset andthe discharge of the photodiode. Besides, the dark current ofthe P+NP structure is also smaller than that of the PN junction.However, the PPD structure has four transistors, whichsignificantly reduces the Filling Factor (FF). As a result,the photoelectric conversion efficiency and SNR are reduced.APS-1.75T is then proposed to share the readout and resettransistors, as shown in Fig. 1(d). A total of 7 transistors areshared by four pixels, which highly reduces the area occupiedby the readout circuit in each pixel and thus dramaticallyimproves the filling factor.
B. PNS, PIS, and PIP Architectures
To reduce the distance between the data capture and anal-ysis, in sensor or near sensor computing has been widelyproposed. Fig. 2 shows the block diagram of different archi-tectures, including traditional architecture, PNS, PIS, and PIP.
Pixel
Readout &ADC
Cloud computersPixel Readout&ADC On-chip DLA
Pixel
Readout &ADC
Calculation circuits Readout &ADC
Calculation circuitsPhotodiode(a)(b) (c) (d)
Fig. 2. Different architectures of visual systems. (a)Traditional architecture.(b)PNS architecture. (c)PIS architecture. (d)PIP architecture. Blue boxesrepresent the pixel, grey boxes mean the sensors, and green boxes show wherethe calculation is conducted.
PNS architecture (Fig. 2(b)) . [12] utilized 3D-stackedcolumn-parallel ADCs and Processing Elements (PEs) to per-form spatio-temporal image processing. In [20], the signals arequantized by the ramp ADCs and then computed by the on-chip stochastic-binary convolutional neural network processor.Compared with the traditional architecture shown in Fig. 2(a),PNS architectures reduces the energy consumption of datamovement, but the energy consumed by the data readout andquantization is still not optimized.
PIS architecture (Fig. 2(c)) . In PIS architectures, thecomputing units are moved to the place before ADC to reducequantization frequency. Unlike PNS, the computing in PIS isusually done in the analog domain. In [21], the proposed CIScan realize a maximum 5 × EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 3 rdw1 rdw2rdw3 rdw4
RST x Convlink (a) Pixel
Pixel(a,b)Pixel(a,b) Pixel(a+1,b)Pixel(a+1,b) Pixel(a+2,b)Pixel(a+2,b) Pixel(a+3,b)Pixel(a+3,b)Pixel(a,b+1)Pixel(a,b+1) Pixel(a+1,b+1)Pixel(a+1,b+1) Pixel(a+2,b+1)Pixel(a+2,b+1) Pixel(a+3,b+1)Pixel(a+3,b+1)Pixel(a,b+2)Pixel(a,b+2) Pixel(a+1,b+2)Pixel(a+1,b+2) Pixel(a+2,b+2)Pixel(a+2,b+2) Pixel(a+3,b+2)Pixel(a+3,b+2)Pixel(a,b+3)Pixel(a,b+3) Pixel(a+1,b+3)Pixel(a+1,b+3) Pixel(a+2,b+3)Pixel(a+2,b+3) Pixel(a+3,b+3)Pixel(a+3,b+3) (b) Array (entire column of pixels)SF
Column ADC
Output(c) column readout circuit
Column decoder(reset and split) R o w d e c o d e r ( w e i g h t s , r e a d o u t , s p li t ) RST y RST y RST y RST y Col split Col split Col splitw1-w4,RST x ,rdw1-w4,RST x ,rdw1-w4,RST x ,rd w1-w4, RST x ,rdRow splitRow splitRow split S/H
RST y Fig. 3. The overview of the PIP architecture. (a) The pixel circuits, (b) The structure diagram of the pixel array, and (c) The column readout circuit diagram. ferred to the in-sensor analog calculation circuit. However,both schemes only support binary neural networks.
PIP architecture (Fig. 2(d)) . In PIP architectures, thecomputing units are integrated with the pixel array. [23]adopted a linear-response Pulse Width Modulation (PWM)pixel to provide a PWM signal for analog-domain convolution.The weights for multiplication is achieved by adjusting thecurrent level and the integral time based on the pixel-signalpulse width. Meanwhile, accumulation is implemented by thecurrent integration. However, the current level is generated byDigital-to-Analog Converter (DACs) according to the weights,which leads to extra power consumption. [17] adopted a pixelprocessor array-based vision sensor called SCAMP-5. Eachpixel contains 13 digital registers and seven analog memoryregisters to achieve various operations. However, it costs toomuch pixel area, leading to wiring problems and low fillfactors. [24] proposed a dual-mode PIS architecture calledMACSen, which has many SRAM cells and computation cellsin each unit of the array, resulting in a large area and low fillingfactor.New materials and devices are also developed for PIParchitectures to improve the filling factor. [25] proposed a
W Se two-dimensional (2D) material neural network imagesensors, which uses a 2D semiconductor photodiode array tostore the network’s synaptic weights. However, changing thephotodiode’s photosensitivity may need additional complicateddigital-to-analog circuits for each pixel to enable massiveparallel computing. Mixed architecture
It’s usually difficult to conduct allcalculation tasks with PIS or PIP architectures. Mixed schemesare thus proposed to achieve the whole neural networkcomputing. In [26], an analog calculation circuit is always- on to achieve face detection before ADCs. When faces aredetected, the on-chip DLA performs the calculation for facerecognition in the digital domain, which can be describedas a PIS + PNS scheme. [27] fabricates a sensor basedon
W Se / h − BN / Al O van der Waals heterostructure toemulate the retinal function of simultaneously sensing andprocessing an image. An in-memory computing unit is addedafter the sensor to make up the PIP + PNS scheme.III. P ROPOSED A RCHITECTURE
In this section, our proposed PIP architecture is introducedin this order: (A) the pixel level circuit design to enable MACoperation, (B) the implementation of convolution operation inthe pixel array, (C) the methods to support different kernelsizes, and (D) the workflow in the traditional mode.
A. Pixel Circuit and MAC Operation
Fig. 3 is the overview block diagram of the proposed PIParchitecture. The convolution operation is realized by an arrayof W × H pixel units. Fig. 3(a) shows the circuits of pixelunits. Two reset transistors
RST x and RST y , are shared byfour adjacent pixels representing RGGB channels. Each pixelcontains an exposure control transistor, a storage capacitor, anda read control transistor. Fig. 3(b) shows the array’s structurediagram, in which Convlink connects adjacent pixel units withsplit transistors in both row and column directions. Signal RST y is controlled in the column direction, while RST x andrd are headed in the row. Weights are loaded in rows. Fig.3(c) shows the column readout circuit diagram. Each pixel ina column is connected to the same readout circuit by Convlinkwith a select transistor used in the convolution operation. EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 4 ∑ Q/(nC) = Δ U(V)
Photodiode CapacitorPWM I ph (A) = Charge(C)* Exposure time(s) Capacitor Readout ∑ /n Charge distribution *n Fig. 4. The calculation flow diagram of the proposed architecture.
RSTrdw1 w2 U1( )U2( )U3( )U4( ) k·w1k·w2Weights>0 w3 w3<0 readout1(+) w1>0 w2>0 Weights<0k·|w3| readout2(-)w4 w4<0 k·|w4|
Fig. 5. The convolution sequence diagram of the pixel circuit.
Fig. 4 shows the calculation flow of the MAC operationunder the proposed PIP architecture. The multiplication ofphotocurrent and weights is realized in the pixel unit bycontrolling the exposure time of photodiodes. The exposuretime of photodiodes is modulated by the weights (8 bit) inthe convolution kernel. The multiplication results are storedon the capacitors, which can be connected between differentpixel units to realize summation by charge redistribution.The timing diagram is shown in Fig. 5, which only containsfour pixels for simplicity. When the signal RST is high,both
RST x and RST y are asserted to reset the capacitors’potential to Vdd. The exposure stage is started after the resetstage when both RST and rd are de-asserted. In this stage,the control pulses of exposure signals w - w are modulatedby the convolution kernel weights. The exposure time T isproportional to the weight value w . Since the photocurrent I ph is unchanged in a short period, the charge Q stored oncapacitor C can be expressed as Q = CU rst − It = CU rst − Ikw (1)where k is the exposure constant, adjusted by the softwareaccording to the external light intensity. So the charge Q onthe capacitor represents the product of the photocurrent I andthe corresponding weight value w in the convolution kernel.After the exposure, it is the charge redistribution and readoutstage when rd is asserted. The Convlink line redistributes the charges stored in the capacitor Q - Q . According to theprinciple of charge redistribution, the voltage would reach auniform value of U conv . If only considering the four pixelsshown in Fig. 3(a), the value of U conv can be expressed as U conv = ( Q + Q + Q + Q ) C + C + C + C = U rst − k C ( I w + I w + I w + I w ) (2)where k C is a known constant, so the voltage U conv on theConvlink line represents the sum of the four multiplication re-sults, thus achieving MAC operation in-pixel level. Assumingthat the convolution kernel size is r , one of the output resultsof the 1-st layer convolution can be obtained by connecting r such adjacent pixels by the Convlink lines, which can beexpressed as U conv = (cid:80) r i =1 Q i r C = U rst − k r C r (cid:88) i =1 [ I i w i ] (3)The weight precision of the convolution kernel used in thesystem is 8-bit. That is, the weight size of the convolutionkernel ranges from -128 to +127. The positive and negativeweights of the convolution kernel can be achieved by subtract-ing two consecutive exposures, as shown in Fig. 5. As w and w are positive, they are enabled in the first exposure period.The negative w and w are enabled in the second exposureperiod. The readout operation is done after the redistribution.The digital circuits subtract the two readout operations in Fig.5 after the ADCs, which is expressed as U = U − − U + = k r C ( (cid:88) [ I i w + i ] − (cid:88) [ I i w − i ]) (4)Eq. (4) also illustrates that the Correlation Double Sampling(CDS) is realized because it eliminates the influence of darkcurrent. B. Convolution Operation in Array
After introducing the basic idea of the convolution opera-tion, this section gives a detailed introduction to the system’soverall architecture and the sliding convolution on the entirepixel array.As can be seen from Fig. 3, the most fundamental com-ponent of the pixel array is a pixel unit containing fourphotocells. Split transistors separate the Convlink wires of theadjacent pixel units. Each column of pixel units includes acolumn readout circuit and a column ADC outside the array,which can read the convolution results and convert them intodigital signals. The adopted ADC is taken from [28], whichconsumes 4.04 uW with a 12.5 MS/s sampling rate.The flow of convolution operation in the array is shownin Fig. 6. In the following example, we assume that theconvolution kernel size is 3 × EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 5 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
Step 1 r1r2 r3r4 r5r6r7r8r9r10r11readout circuit
Readout a group at once.(The readout path of other capsules are not drawn)
Step 2Step 3Step 4 capsulegroup
A group contains 3 rows of capsules.
Fig. 6. The flow diagram of the array convolution operation. The convolutionkernel size is 3x3, and the stride is 2. transistors are closed. The 3*3 connected active pixel unitsare defined as capsules. Then the whole array can be dividedinto several independent capsules. The Convlink wires connectthe pixel units in each capsules. The capsules’ exposure andcharge redistribution (MAC operations) are enabled simulta-neously in each step. We defined the three rows of capsuleswhich are readout simultaneously as a group.As stated in the previous section, the MAC operation can beachieved by connecting the Convlink wires of all pixel unitscorresponding to a convolution kernel during computation.More MAC operations should be carried out simultaneouslyto maximize parallel operation and computing throughput.Because the charge redistribution is a destructive read of pixelvalues, regions of multiple simultaneous MAC operations mustbe non-overlap. The non-overlap is achieved by dividing theconvolution procedures of the entire array into four steps, asshown in Fig. 6. In each step, the colored squares represent theactive pixel units, and the uncolored squares represent pixelunits not involved in the computation of this step. To minimizethe power consumed by the photodiode reset,
RST y and RST x disconnect the unpainted pixel from the adjacent unit in therow and column directions, respectively. In such a scenario,all the convolution areas in one step can be calculated andread out with only one exposure. In each step, the active pixelunits will achieve the MAC operations with the convolutionkernel, and calculate a quarter of the convolution result. Afterfour steps of calculations, a complete convolution operation isfinished. As we exposure twice for each step’s positive andnegative weights, eight exposure cycles are needed for eachconvolution operation.The above convolution operation need to carefully plan thehardware wiring. As shown in Fig. 7(a), when the convolutionkernel size is 3 × W , W , W , W , W , W , W ... In this way, each capsule in a step contains the same wire orders “ W , W , W ” in first and second steps and“ W , W , W ” in the third and fourth steps. As the minimumcycles of the wire order are 2, so only even stride can besupported.As each column of pixel units is connected to a columnreadout circuit, each capsule includes 3 column readout cir-cuits. So the calculation results of every three rows of capsulescan be read by the three readout circuits simultaneously. Toachieve this readout method, pixels with the row number x(x= 4n + 3, n = 0,1,2,3...) are connected to three independentrow enabling signals C , C , C as shown in Fig. 7(b). Asshown in Fig. 6, signal C is active in the rd row, and signal C is active in the th row so that the first row of capsules canbe read from the first column of the readout circuit while thesecond row of capsules from the second column of readoutcircuit.The processing sequence of the convolution operation isshown in Fig. 8. The subscript n represents the n th group. Asshown in Fig. 8(a), after a readout operation of the n th groupis finished, the signal rd n is de-asserted, rd n +1 is assertedfor the readout of the ( n + 1) th group, and rst n are activatedto reset the n th group. As shown in Fig. 8(b), a capsule isreset immediately after each readout operation and then beginthe exposure for the next readout. Assuming the resolutionis 1080P, the convolution kernel size is 3 ×
3, the stride is 2,and each step contains 270 rows of convolution kernel results,then each step needs 90 readout operations. As shown in Fig.6, in different steps, the active capsules are correspondingto different pixel units. Since the next readout of the n th group in the next step needs an extra readout cycle’s delay toavoid overlapping, so the next readout of a group is separatedfrom the reset by (90-1) readout cycles. Assuming the readoutinterval is T rd , the number of the readout operations in eachstep is n rd , the time interval between two readouts of the samecapsule is ( n rd − T rd , the reset interval is T rst , and themaximum exposure time is T expo . As a capsule’s reset andexposure stage need to be finished before the next readoutoperation, there should be ( n rd − T rd > T rst + T expo (5) C. Universal Implementation of Convolution Kernel with Dif-ferent Size
To support other kernel size with the same wires, we pro-pose a method called “kernel splitting” to split the convolutionkernel. As shown in Fig. 9(a), two 5 × k and k are used to form a 5 × k includes the first 3 columns of the 5 × k includes the 4-5 columns. A 5 × × × × × × EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 6
Pixel 1 Pixel 2 Pixel 3 Pixel 4 Pixel 5 Pixel 6(W1) (W2) (W3) (W2) (W1) (W2)
R1R2
R3 G1
G2G3 G1 G2G3 B1B2 B3 W1 W2 W3 W2 W1 W2 W3 W2 W1 W2 W3 W2W1 W2 W3 W2 W1 W2 W3 W2 W1 W2 W3 W2STEP 1/2STEP 3/4
W2(R2,G2,B2) W3(R3,G3,B3)W1(R1,G1,B1)
W5 W6W4 W8 W9W7
W2(R2,G2,B2) W3(R3,G3,B3)W1(R1,G1,B1)
W5 W6W4 W8 W9W7
Kernel size3*3*RGB,stride2
C0 C0 C0 C0 C0 C0 C0 C0 C0 C0 C0 C0C1 C2 C3 C2 C1 C2 C3 C2 C1 C2 C3 C2 (a) wiring method for loading weights
Row x , when x = 4n + 3, n ∈ N (for computing and traditional mode) C1 C2C3C0
Row x , when x ≠
4n + 3, n ∈ N (only works in traditional mode) (b) wiring method for readout Fig. 7. (a) The wiring way of weight loading. (b) The wiring method for readout. The convolution kernel has a size of 3 ×
3, and the stride is 2. U n rd n rd n+1 RST n T rst T expo Δ t rd = (90-1) T rd expo n T rd reset exposure charge redistribution and readout reset exposure charge redistribution and readout reset (a) the array(b) one capsule cycle1 cycle2 cycle3... Fig. 8. The convolution operation sequence diagram of (a) the array and (b)one capsule. U represents the potential of a capacitor in the chosen capsules, RST represents the reset stage, expo represents the exposure stage, and rd represents the charge redistribution and readout stage. operation, each group still have 3 rows of capsules. Anothertwo examples for 7 × × × r and the stride is s , the totalnumber of steps is r +1 s ( r − , where the ratio r +1 s needs toround up to an integer if necessary. For a fixed height of thepixel array H (1080 in our case), the total number of outputrows in each step is Hr +1 . Since each readout operation containsthree output rows, the minimum ADC conversion rate can becalculated by f ADC ( min ) = 2 nf H ( r − s (6)where f is the frame rate and n is the number of channels. W1 W2 W3 W1 W2 W3 W1
K1Kernel
W2 W3W1 W5 W6W4 W7
K1 K2 K3
W5 W4 W5 W4 W51 2 3 2 1 2 3 2 1
WiringK2
W6 W7 W6 W7 K3 W1 W2 W3 W1 W2 W3 W1 K1 Kernel
W2 W3W1 W5W4
K1 K2
W5 W4 W5 W4 W5 K2 (b) size7*7*RGB (a) size5*5*RGB W2 W3W1 W5 W6W4 W7
K1 K2 K3 K4
W8 W9
W1 W2 W3 W1 W2 W3 W1 K1 W5 W4 W5 W4 W5 K2 W6 W7 W6 W7 K3 W8 W9 W8 W9 K4 (C) size9*9*RGB Fig. 9. The convolution implementations of kernel splitting. The convolutionkernel size is (a) 5 ×
5, (b) 7 ×
7, and (c) 9 × As each step requires two exposures for the positive andnegative weights. The real frame rate f real is defined as theproduct of frame rate and the output channel number f × n .With a fixed maximum exposure time T expo , the maximumreal frame rate can be calculated by EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 7
TABLE IC
OMPARISON OF DIRECTLY AND SPLITTING CONVOLUTION OPERATION operation minimum ADC maximum realcondition conversion rate frame rate × × × × f real ( max ) = s r − T expo (7)As shown in Eq. (6), the minimum conversion rate of ADCis proportional to the frame rate f , the channel number n ,and the kernel size r . It is inversely proportional to the stride s . As shown in Eq. (7), when the kernel size increases, themaximum real frame rate will decrease.Assuming the resolution is 1080P, the stride is 2, and themaximum exposure time is 32.56 us (calculated when kernelsize is 3 ×
3, stride is 2, frame rate is 60, and output channelnumber is 64), the maximum real frame rate and the minimumADC conversion rate are calculated for (a) 3 × × × × f ADC ( min ) is based on the calculated f real ( max ) in each condition. The results shows when kernelsize increases, the conversion rate of ADC will decrease.This is because the real frame rate decreases and the readoutoperation frequency decreases. D. Traditional Mode and Mode Switch
In the preceding three subsections, we have introduced therealization of convolution operations. In the convolution mode,it does not output the raw image. However, the original imageis vital for some applications. The proposed CIS can work inthe traditional mode appropriate control signals and output theraw image.We set the opening time of transistors W - W to a unifiedlength according to the external light intensity to achieve this.During a reading, the RD transistors of four pixels can beselected in turn to read out RGB data. In the pixel array, aseach pixel in a column shares the same column readout circuit,each row of pixels will be selected by enabling C or C - C in turn for readout. It needs a total of 4H readout operations toread the entire pixel array and obtain the RGB three-channelimage with a size of H × W.The switch between computing mode and traditional modecan adopt an event-driven mechanism. When the target objectis identified in the subsequent computing module results, theCIS control mode can be switched to output the completeraw image in the traditional mode. The light intensity canalso determine the exposure time to avoid overexposure orunderexposure. IV. S
IMULATION R ESULTS
Our proposed architecture was implemented with a generic45nm CMOS process. To simulate the response of the pho- P in = 50 W/m P in = 100 W/m P in = 500 W/m P in = 1000 W/m P in = 5000 W/m P in = 10000 W/m Fig. 10. Simulation result of a single photodiode based on the model givenin Eq. (8)-(10). todiode, an analytic model taken from [29] is used in thesimulation. The model can be expressed as J np = qG L (0) L p − ( αL p ) [ αL p e − αx j + sinh x j L p + A ( x j , L p ) cosh x j L p ]+ qG L (0) α [1 − e − αx dr ] e − αx j + qG L (0) L n ( αL n ) − A ( L − x d , L n ) + αL n ] e − α ( x j + x dr ) (8)where A ( x, y ) = e − αx − cosh xy sinh xy (9) G L (0) = α P in hc λη (1 − R ) (10)Fig. 10 shows the simulation results of the photodiodemodel in our proposed pixel circuit. Firstly all the capacitorsand diode are reset to Vdd (1 V). After exposure with differentlight intensities, the voltages are declined at different speeds.The results show that the potential should be held above 0.5V to ensure linearity. A. Circuit function verification
The transient simulation result to verify the correctness ofour proposed pixel circuits’ MAC operation is shown in Fig.11. Signal RST represents both
RST x and RST y . Differentvalues of P in and weights are set verify multiplication andaccumulation.At 0 us, the reset stage begins after RST, rd, and w - w are asserted to reset the capacitors to 1 V. At t , the exposurestage begins after RST and rd are de-asserted. According to theweights, w - w stay open with exposure time and the voltageon capacitors U C - U C are decreased with different speedsdetermined by the input light power. This stage achieves the EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 8
RSTrdw U C1 w U C2 w U C3 w U C4 t t RSTrdw U C1 w U C2 w U C3 w U C4 t t Fig. 11. Simulation result of our proposed pixel circuit. multiplication operation between the input light power andweights. At t , the charge redistribution and readout stagebegins when rd is asserted to redistribute the four capacitors’charge. U C - U C reach a unified voltage level in a short time.This stage achieves the average operation. Therefore the MACoperation is achieved after the multiplication in exposure stageand the average in charge redistribution and readout stage.In this case, the input light powers of the four photodiodeare set to 1000, 600, 600, and 800 W/m , while the weightsare set to 20, 45, 70, and 70, respectively. With a fittingslope -0.00465955, the theoretical values of U C - U C afterthe multiplication operation are 906.8, 874.2, 804.3, and 739.1mV, respectively. The simulation results shown in Fig. 11 are906.2, 874.0, 805.2, and 738.7 mV, which are consistent withthe theoretical values. B. MAC Operation Linearity Simulation
Fig. 12 shows the relationship between U out and (a) theweight and (b) the input optical power. U out is the readoutvoltage after the MAC operation. Linear fitting results of bothfigures show that the R are all above 0.999. Results showthat the proposed CIS architecture achieves high linearity andaccuracy.Fig. 13 shows the Differential Nonlinearity (DNL) andIntegration Nonlinearity (INL) simulation results by the codedensity measurement. The simulated DNLs (INLs) in terms ofthe weight and the input light power are +0.0755/-0.0206 LSB(+0.2334/-0.7242 LSB) and +0.0210/-0.0061 LSB (+0.3560/-1.1947 LSB), respectively. The DNLs are all below 1 LSB,which means no missing codes. C. Performance Analysis
The power consumption and performance comparison underdifferent conditions are shown in Table II. The array size for all situations is 1920 × × × × × D. Analysis of the Robustness
Operations in the analog domain are affected by undesirablefactors such as noise and variations. In this section, we analyzethe effects of these factors in detail.
1) Device Variation:
As shown in Fig. 14, the schematic ofthe CIS computation parts can be simplified as a photodiodewith a capacitor and two switches W i and rd in each pixel.For an r × r convolution kernel, r × r pixels are connected to thesame readout circuits, including a source follower transistor.After reset and exposure, V Ci will be saved on C i . Whensignal rd is set high, the voltages V Ci connected in a kernelare averaged due to the charge sharing. In the ideal case, thereadout voltage can be formulated as V out = (cid:80) C i V Ci (cid:80) C i − ( V thi + V od ) (11)where V thi is the threshold of the source follower transistorin readout circuit and V od is the over drive voltage. C i is EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 9 (a) weight (b) input optical power P in weightR =0.9996R =0.9998R =0.9999R =0.9999R =0.9999 R =0.9993R =0.9999R =0.9999R =0.9999R =0.9999 Fig. 12. Simulation of MAC operations. (a) Readout voltage for ADCs versus weight for different values of input optical power. (b) Readout voltage forADCs versus input optical power for different values of weight. (a) weight (average input optical power is set to 500W/m ) (b) input optical power (w/m ) (average weight is set to 100) Fig. 13. The simulated DNL and INL in terms of (a)the weights, (b)the input optical power.TABLE IIP
OWER CONSUMPTION ANALYSIS . T
HE AMOUNT OF THE
MAC
OPERATION IS CALCULATED ACCORDING TO THE GENERAL CALCULATION PRINCIPLE OFCONVOLUTION OPERATION , THAT IS , THE PRODUCT OF CONVOLUTION KERNEL SIZE r , OUTPUT CHARACTERISTIC SIZE
HWs , NUMBER OF INPUTCHANNEL
AND NUMBER OF OUTPUT CHANNELS n . condition Power(mW) Efficiency(TOPS/w) FoM (pJ/pixel/frame) ×
3, s=2 22.62 4.75 2.85120FPS, 3 ×
3, s=2 37.48 5.73 2.3660FPS, 5 × × × × EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 10
TABLE IIIP
ERFORMANCE C OMPARISION
90 nm/60 nm 180 nm 180 nm 180 nm 45 nm
Supply
Array Size ×
976 128 ×
128 32 ×
32 32 ×
32 1920 × Frame Rate data/weight width
Feature
Spatial-Temporal processing 1st-layer CNN, ED,Blur,Sharpen 1st-layer BNN 1st-layer BNN 1st-layer CNN
Processing
Digital Analog Analog Analog Analog
Memory
Yes(Digital) No No Yes(Analog) No
Method
Power
363 mW 91 uW 12.16 uW 1.8 mW 22.62 mW
Computing Efficiency rdw i C i =C (1+ β t ) V PDi +n pdi V thi +V od r*r pixels V out Fig. 14. Effects of variation, mismatch and noise. Red represents the voltagenode influenced by the additive gaussian noise. Blue represents the thresholdand the capacitance influenced by process variation and mismatch the capacitor in each pixel. V + out and V − out are the outputvoltages after charge sharing for positive and negative weights,respectively. As described in section III, the final output isobtained by the digital circuit subtracting the two voltages. Thenominal capacitance of C i is C . Now we can illustrate thenoise, variations, mismatch factors considered in our analysis.Firstly, noise in the integrated circuits such as thermalnoise, flicker noise, and environmental noise can be consid-ered together as the additive Gaussian noise on the dynamiccapacitance [11], as depicted in red in Fig. 14. Therefore, V (cid:48) Ci = V Ci + n Ci = V P Di + n pdi + n ci (12)where V P Di and n P Di are the value of V Ci and the randomnoise, respectively. n ci are the random noises on the capacitor.All noises follows the normal distribution N(0, σ noise ).Mismatch refers to the different deviations between differentdevices. It affect the threshold voltage of the source followertransistor and the capacitance of C i . We can formulate it as V thi = V th (1 + β ti ) C i = C (1 + β ci ) (13) where β ti and β ci refer to the deviation of devices, both ofwhich follow the normal distribution N(0, σ mismatch ).
2) Computation Error Analysis:
Given Eq. (11)-(13), theoutput V out can be formulated as V out = V + out − V − out = (cid:80) (1 + β ci )( w i x i + n all ) (cid:80) (1 + β ci ) (14)where n all = n + pdi + n + ci − n − pdi − n − ci . Because the four sums areindependently distributed, n all follows the normal distributionN(0,4 σ noise ). β ci follows the normal distribution with mean =0, therefore Eq. (14) can be simplified as V out = (cid:80) (1 + β ci )( w i x i + n all )4 r (15)Eq. (15) shows that (1) the impacts of all noise can beconsidered together as one random noise value n all added toeach pixel, which follows the normal distribution N(0,4 σ noise ).(2) the mismatch across different capacitors in each pixel havethe multiplicative factor (1 + β ci ) on the output data. (3) theimpacts of devices’ global process variation can be ignoredbecause of the charge sharing operations and subtractionoperations.Compared with traditional design, the sharing operationshave the extra benefit of increasing the SNR. The effect of therandom additive noise in Eq. (15) can be expanded as V out = (cid:80) w i x i r + (cid:80) n all r (16)As (cid:80) w i x i r is the desired output, (cid:80) n all r is the additivenoise. When the convolution kernel size is 3 ×
3, 9 pixelsare connected, and each pixel has four photodiodes, so r = 36, which means (cid:80) n all follows the normal distributionN(0,36 σ ). Then the noise and SNR can be calculated by noise = E [( (cid:80) n all
36 ) ] = 136 D ( (cid:88) n all ) = σ (17) SN R = powernoise = 36 powerσ (18)As shown in Eq. (18), the SNR is 36 times as high astraditional design, or a 15.6 dB increase. This means smaller EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 11 - 20- 100102030405060 A CC U R A C Y SNR/dB
Accuracy v s S N R
5% 10% 20%
500 0.85 0.9 0.95 1 1.05 1.1 1.150100200300 0.94 0.96 0.98 1 1.02 1.04 1.060100200300 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 (a) distributed within a 5% deviation(b) distributed within a 10% deviation(c) distributed within a 20% deviation
Fig. 15. Relationship between CNN accuracy and three types of disturbance. capacitors are acceptable in our design, so the exposure timecan be decreased, which contributed to a huge increase inframe rate to 3840 FPS.
3) Algorithm Robustness:
As CNN algorithm is a neuralnetwork algorithm, its robustness is very high, and it canaccept errors within a certain range of input data. Throughnetwork simulation with Cifar-10 [30] dataset and Resnet-18, the accuracy of CNN changes with SNR or mismatchas shown in Fig. 15. As the proposed CIS only support 1st-layer CNN, the rest calculation is achieved by software. Threedifferent types of distributions of capacitors are used, and thedistributions are shown in Fig. 15. The results showed littleaccuracy loss when SNR is more than 40 dB, as the typicalSNR value for CIS is 40 dB - 60 dB[5].Our proposed CIS circuit only support the 1-st layer ofCNN, but is very important for the whole architecture’scalculation. The quantization or pruning of the first layer ofCNN usually lose a lot of accuracy, which makes it difficultto improve the performance. Due to the small number ofinput channels, DLAs’ PEs are often not fully utilized forthe first layer. Therefore, this design can greatly improve thecomputational efficiency of the subsequent DLAs, leading tomuch higher performance of the whole machine vision system.V. C
ONCLUSION
In this work, a PIP architecture has been proposed to per-form the first layer convolution operation of CNN. It supportsa variety of different convolution kernel sizes and parameters.The simulation results have shown that our proposed schemefunctions correctly with good linearity. In the case of the con-volution kernel is 3 ×
3, step size is 2 and channel number is 64at 60 frames and 1080P, the proposed architecture consumes 22.62 mW power, and have a computational efficiency up to4.75 TOPS/w, which is about 3.6 times as higher as the state-of-the-art. It is very suitable for application scenarios with tightrequirements on power consumption, such as daily monitoringand Internet of Things (IoT) terminal devices.R
EFERENCES[1] S. Mane and S. Mangale, “Moving object detection and trackingusing convolutional neural networks,” in ,2018, pp. 1809–1813.[2] C. Ding and D. Tao, “Trunk-branch ensemble convolutional neuralnetworks for video-based face recognition,”
IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 40, no. 4, pp. 1002–1014, 2018.[3] L. Wu, K. Huang, H. Shen, and L. Gao, “Foreground-backgroundparallel compression with residual encoding for surveillance video,”
IEEE Transactions on Circuits and Systems for Video Technology , pp.1–1, 2020.[4] D. Wei, X. Xu, H. Shen, and K. Huang, “Gac-gan: A general methodfor appearance-controllable human video motion transfer,”
IEEE Trans-actions on Multimedia , pp. 1–1, 2020.[5] F. Zhou and Y. Chai, “Near-sensor and in-sensor computing,”
NatureElectronics , vol. 3, pp. 664–671, 2020.[6] M. T. Chung, C. L. Lee, C. Yin, and C. C. Hsieh, “A 0.5 v pwm cmosimager with 82 db dynamic range and 0.055% fixed-pattern-noise,”
IEEEJournal of Solid-State Circuits , vol. 48, no. 10, pp. 2522–2530, 2013.[7] X. Liu, M. Zhang, and J. Van der Spiegel, “A low-power multifunctionalcmos sensor node for an electronic facade,”
IEEE Transactions onCircuits and Systems I: Regular Papers , vol. 61, no. 9, pp. 2550–2559,2014.[8] O. Kumagai, A. Niwa, K. Hanzawa, H. Kato, and Y. Nitta, “A 1/4-inch3.9mpixel low-power event-driven back-illuminated stacked cmos imagesensor,” in , 2018.[9] A. Y. Chiou and C. Hsieh, “An ulv pwm cmos imager with adaptive-multiple-sampling linear response, hdr imaging, and energy harvesting,”
IEEE Journal of Solid-State Circuits , vol. 54, no. 1, pp. 298–306, 2019.
EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 12 [10] J. Choi, J. Shin, D. Kang, and D. Park, “Always-on cmos image sensorfor mobile and wearable devices,”
IEEE Journal of Solid-State Circuits ,vol. 51, no. 1, pp. 130–140, 2016.[11] A. Y. Chiou and C. Hsieh, “A 137 db dynamic range and 0.32 v self-powered cmos imager with energy harvesting pixels,”
IEEE Journal ofSolid-State Circuits , vol. 51, no. 11, pp. 2769–2776, 2016.[12] T. Yamazaki, H. Katayama, S. Uehara, A. Nose, M. Kobayashi, S. Shida,M. Odahara, K. Takamiya, Y. Hisamatsu, S. Matsumoto, L. Miyashita,Y. Watanabe, T. Izawa, Y. Muramatsu, and M. Ishikawa, “4.9 a 1mshigh-speed vision chip with 3d-stacked 140gops column-parallel pes forspatio-temporal image processing,” in , 2017, pp. 82–83.[13] M. F. Amir, D. Kim, J. Kung, D. Lie, S. Yalamanchili, and S. Mukhopad-hyay, “Neurosensor: A 3d image sensor with integrated neural acceler-ator,” in , 2016, pp. 1–2.[14] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,and O. Temam, “Shidiannao: Shifting vision processing closer to thesensor,” in , 2015, pp. 92–104.[15] Yunhui, Hou, Robert, LiKamWa, Lin, Zhong, Mia, Polansky, Julian, andGao, “Redeye: Analog convnet image sensor architecture for continuousmobile vision,”
Computer Architecture News , 2016.[16] C. Xu, Y. Mo, G. Ren, W. Ma, X. Wang, W. Shi, J. Hou, K. Shao,H. Wang, P. Xiao, Z. Shao, X. Xie, X. Wang, and C. Yiu, “5.1 a stackedglobal-shutter cmos imager with sc-type hybrid-gs pixel and self-kneepoint calibration single frame hdr and on-chip binarization algorithmfor smart vision applications,” in , 2019, pp. 94–96.[17] L. Bose, J. Chen, S. J. Carey, P. Dudek, and W. Mayol-Cuevas, “Fullyembedding fast convolutional networks on pixel processor arrays,” 2020.[18] T. Hsu, Y. Chiu, W. Wei, Y. Lo, C. Lo, R. Liu, K. Tang, M. Chang, andC. Hsieh, “Ai edge devices using computing-in-memory and processing-in-sensor: From system to device,” in , 2019, pp. 22.5.1–22.5.4.[19] A. El Gamal and H. Eltoukhy, “Cmos image sensors,”
IEEE Circuitsand Devices Magazine , vol. 21, no. 3, pp. 6–20, 2005.[20] V. T. Lee, A. Alaghi, J. P. Hayes, V. Sathe, and L. Ceze, “Energy-efficienthybrid stochastic-binary neural networks for near-sensor computing,” in
Design, Automation Test in Europe Conference Exhibition (DATE), 2017 ,2017, pp. 13–18.[21] Z. Chen, H. Zhu, E. Ren, Z. Liu, K. Jia, L. Luo, X. Zhang, Q. Wei,F. Qiao, X. Liu, and H. Yang, “Processing near sensor architecture inmixed-signal domain with cmos image sensor of convolutional-kernel-readout method,”
IEEE Transactions on Circuits and Systems I: RegularPapers , vol. 67, no. 2, pp. 389–400, 2020.[22] T. Ma, K. Jia, X. Zhu, F. QiaO, Q. Wei, H. Zhao, X. Liu, and H. Yang,“An analog-memoryless near sensor computing architecture for always-on intelligent perception applications,” in ,2019, pp. 150–155.[23] T. Hsu, Y. Chen, T. Wen, W. Wei, Y. Chen, F. Chang, H. Kim, Q. Chen,B. Kim, R. Liu, C. Lo, K. Tang, M. Chang, and C. Hsieh, “A 0.5v real-time computational cmos image sensor with programmable kernel foralways-on feature extraction,” in , 2019, pp. 33–34.[24] H. Xu, Z. Li, N. Lin, Q. Wei, F. Qiao, X. Yin, and H. Yang, “Macsen: Aprocessing-in-sensor architecture integrating mac operations into imagesensor for ultra-low-power bnn-based intelligent visual perception,”
IEEE Transactions on Circuits and Systems II: Express Briefs , pp. 1–1,2020.[25] L. Mennel, J. Symonowicz, S. Wachter, D. K. Polyushkin, andT. Mueller, “Ultrafast machine vision with 2d material neural networkimage sensors,”
Nature , vol. 579, no. 7797, pp. 62–66, 2020.[26] K. Bong, S. Choi, C. Kim, D. Han, and H. Yoo, “A low-power convolu-tional neural network face recognition processor and a cis integrated withalways-on face detector,”
IEEE Journal of Solid-State Circuits , vol. 53,no. 1, pp. 115–123, 2018.[27] S. Wang, C.-Y. Wang, P. Wang, C. Wang, Z.-A. Li, C. Pan, Y. Dai,A. Gao, C. Liu, J. Liu, H. Yang, X. Liu, B. Cheng, K. Chen, Z. Wang,K. Watanabe, T. Taniguchi, S.-J. Liang, and F. Miao, “Networkingretinomorphic sensor with memristive crossbar for brain-inspired visualperception,”
National Science Review , 07 2020, nwaa172. [Online].Available: https://doi.org/10.1093/nsr/nwaa172[28] S. Zhang, K. Huang, and H. Shen, “A robust 8-bit non-volatilecomputing-in-memory core for low-power parallel mac operations,”
IEEE Transactions on Circuits and Systems I: Regular Papers , vol. 67,no. 6, pp. 1867–1880, 2020.[29] R. J. Perry and K. Arora, “Using pspice to simulate the photoresponse ofideal cmos integrated circuit photodiodes,” in
Proceedings of SOUTH-EASTCON ’96 , 1996, pp. 374–380.[30] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,”
Handbook of Systemic Autoimmune Diseases , vol. 1, no. 4,2009.
Ruibing Song (Student Member, IEEE) receiveda bachelor’s degree from the College of ElectricalEngineering, Zhejiang University, in 2020. He iscurrently pursuing a master’s degree at the Collegeof Information Science & Electronic Engineering,Zhejiang University. He is interested in in-sensorcomputing and in-memory computing.
Kejie Huang (Senior Member, IEEE) received thePh.D. degree from the Department of Electrical En-gineering, National University of Singapore (NUS),Singapore, in 2014. He has been a Principal In-vestigator with the College of Information ScienceElectronic Engineering, Zhejiang University (ZJU),since 2016. Before joining ZJU, he has spent fiveyears in the IC design industry, including Samsungand Xilinx, two years in the Data Storage Insti-tute, Agency for Science Technology and Research(A*STAR), and another three years in SingaporeUniversity of Technology and Design (SUTD), Singapore. He has authoredor coauthored more than 40 scientific articles in international peer-reviewedjournals and conference proceedings. He holds four granted internationalpatents, and another eight pending ones. His research interests include lowpower circuits and systems design using emerging non-volatile memories,architecture and circuit optimization for reconfigurable computing systemsand neuromorphic systems, machine learning, and deep learning chip design.He currently serves as the Associate Editor of the IEEE TRANSACTIONSON CIRCUITS AND SYSTEMS-PART II: EXPRESS BRIEFS.
Zongsheng Wang (Student Member, IEEE) receiveda bachelor’s degree from the College of ElectricalEngineering, Zhejiang University, in 2020. He iscurrently pursuing a master’s degree at the Collegeof Information Science & Electronic Engineering,Zhejiang University. He is interested in in-sensorcomputing, low power digital circuit design and deeplearning accelerator.
EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, VOL. X, NO. X, DECEMBER? 2020 13