An Energy-Efficient VCO-Based Matrix Multiplier Block to Support On-Chip Image Analysis
aa r X i v : . [ c s . ET ] D ec An Energy-Efficient VCO-Based Matrix MultiplierBlock to Support On-Chip Image Analysis
Imon Banerjee and Arindam Sanyal
Abstract —Images typically are represented as uniformly sam-pled data in the form of matrix of pixels/voxels. Therefore, matrixmultiply-and-accumulate (MAC) forms the core of most state-of-the-art image analysis algorithms. While digital implementationof MAC has generally been the preferred approach, high powerconsumption is an impediment to adopting it for medical imageanalysis. In this work, we present a time-domain signal processingarchitecture which performs MAC operations with 7bit accuracywhile consuming 400X lower energy than digital implementation.The proposed architecture performs analog computation usingmostly digital circuits and is suitable for scaled CMOS technolo-gies. The proposed time-domain MAC architecture is expectedto play a central role in empowering the advancement of variouson-chip image analysis operations.
Index Terms —On-chip image analysis, voltage-controlled os-cillator, time-domain, matrix multiplication
I. I
NTRODUCTION
In the digital data era, 2D/3D image analysis operations(e.g. registration, feature calculation, interpolation, fusion) arethe core processing block of a wide range of automatedsystems, including computer aided diagnosis (CAD) [1], [2].For example, in the modern CAD systems, an importantimage analysis operation is the co-registration of positronemission tomography (PET) image and magnetic resonanceimage (MRI) which combines functional information fromPET images with anatomical information in MR images. Theefficient co-registration of PET and MRI images can pave theway for a better understanding of physiological and diseasemechanisms in pre-clinical and clinical settings.The co-registration algorithm applies rigid registration [3]to map each voxel ( x , y , z ) in the MR image into the co-ordinate space of PET image ( x , y , z ) , provided that theimages of both modalities are acquired from the same subjectand the scanning processes have not introduced nonrigidspatial transformations. To illustrate the algorithmic facets,we present a workflow diagram in Fig. 1 that takes MRI andPET images as inputs and uses rigid registration to computethe co-registered MRI/PET image. The rigid registration isdone through multiply-and-accumulate (MAC) operations per-formed on the original voxel locations in 3D space (x,y,z) with × translation matrix ( U ) which represents translations( t )and scaling( S ) and × rotation matrices ( R ) which representrotation (see (1)). I. Banerjee is with the Laboratory of Quantitative Imaging, Stan-ford University School of Medicine, Stanford 94305, CA, USA. (e-mail:[email protected])A. Sanyal is with Department of Electrical Engineering, The StateUniversity of New York at Buffalo, Buffalo 14260, NY,USA. (email:[email protected])
The transformation matrix (M = translation + rotation) foreach voxel is derived automatically by analyzing the differencebetween source and target data of the registration, which canagain be described as a sequence of MAC operations. Notethat the MRI and the PET images with reasonable accuracyhave number of voxels more than × × . Therefore, astandard registration operation requires > , , MACoperations which can consume a significant amount of powerand computation time. Similarly, other image analysis opera-tions, such as object segmentation, feature extraction, can bedecomposed into a set of sequential MAC operations. Hence,reduction in power consumption during MAC operations isa significant challenge that has to be addressed in order todevelop on-chip image analysis blocks.Spurred on by Moore’s law and CMOS technology scaling,the general approach towards CAD has been to performall mathematical operations digitally using computers. Whiledigital computation can provide high accuracy, the powerconsumption has been prohibitively high to prevent portablesolutions. With CMOS technology scaling slowing down, thereis an interest in analog signal processing (ASP) to performmathematical computations in a low power fashion [4]. A keyenabler for ASP methodology is that most image processingalgorithms require only 6-8 bits precision. ASP excels atapproximate computing while consuming lower power thandigital computing. Recent approaches to analog computationhas been to use switches and capacitors as signal process-ing elements [5]–[8] in advanced CMOS technologies. Thematrix multiplier reported in [8] performs calculations with6b linearity while having an energy consumption of only13fJ/operation at 1GHz. The work in [5] presents switchedcapacitor multipliers and dividers. However, it uses amplifierswhich is not power efficient, specially in advanced CMOStechnologies. The works in [6]–[8] present power efficientmultipliers by doing away with amplifiers and relying on onlyswitched capacitors for signal processing. These multiplierswork in the charge domain and are good solutions for highspeed operations usually in the range of GHz of bandwidths.However, charge leakage presents a significant challenge tousing switched-capacitor multipliers for medical signal pro-cessing in which computations can require only a few MHzof bandwidths. Thus, switched capacitor multipliers are notvery suitable for medical image processing.In this letter, we present an alternate solution by performingmultiplication and addition operations in time domain, whichsatisfies medical image processing requirements. The proposedtechnique has significantly higher immunity to charge leakagethan switched capacitors and can trade-off speed for power
Fig. 1. A workflow diagram of MRI and PET brain image co-registration. U = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S x S y S z t x t y t z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ; R x = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ − sin θ
00 sin θ cos θ
00 0 0 1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ; R y = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) cos θ − sin θ
00 1 0 0 − sin θ θ
00 0 0 1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ; R x = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) cos θ − sin θ θ cos θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (1) without sacrificing computational accuracy. Time domain cir-cuits are uniquely suitable for scaled CMOS technology. Theyare highly digital and hence can operate at low supply voltagesin advanced CMOS technologies. In addition, the quantizationnoise in time domain circuits is essentially transistor delaywhich reduces with technology scaling. Thus, the proposedarchitecture can be used for a power efficient hardware imple-mentation of the rigid registration block (see Fig. 1) which is acore element in PET/MRI modality fusion process. In addition,the proposed time domain architecture holds the promise ofproviding low power computational ability for a wide rangeof portable CAD systems.The rest of this letter is organized as follows: Section IIpresents a brief review of existing analog-to-digital matrixmultipliers, Section III discusses the key idea behind the pro-posed time-domain matrix multiply-and-add operator, SectionIV presents a CMOS circuit implementation and simulationresults, while the conclusion and future research direction arebrought up in Section V.II. R EVIEW OF A NALOG - TO -D IGITAL M ATRIX M ULTIPLICATION AND A DDITION
A MAC operation is defined as Y = N X j =1 W j X j (2)where X j is the input and W j is the weight.Several analog signal processing techniques have been re-ported which accept an input X j , perform the multiplication in(2) in analog domain and return a digital output Y j = W j X j .A straightforward method is to use an operational amplifierto perform voltage domain multiplication and addition asdescribed by (2). The operational amplifier needs to have a high gain to perform the MAC operation accurately. However,high gain operational amplifiers are power hungry and insteadfour-quadrant CMOS multipliers [9] are often used for approx-imate multiplication. To reach higher energy efficiency, [10]uses charge coupling to implement approximate analog matrixmultipliers. [11] uses low performance, thin-film transistors(TFT) to implement an approximate analog multiplier forprocessing data from sensors.More recently, there have been efforts to integrate analog-digital matrix multiplication (AD-MM) inside analog-to-digital converters (ADC) to lower power consumption. Capac-itors and switches are used for passive multiplication in [7],[8], [12]. Charge domain passive multipliers achieve very lowpower consumption in advanced CMOS technology nodes asthey do not use any active amplifiers which are power hungry.Addition can be done in current or charge domain in a lowpower fashion.III. T IME D OMAIN M ATRIX O PERATIONS
While passive switched capacitor techniques are a goodsolution for low energy, approximate multipliers, they sufferfrom non-idealities due to charge leakage specially for ad-vanced CMOS technologies which suffer from increased leak-age. This problem is exacerbated at low speeds of operation,which implies that further lowering of energy consumptionby reducing speed is challenging for these techniques. Inaddition, often switched capacitor multipliers are used inconjunction with voltage domain ADCs which suffer fromreduced dynamic range as the supply voltage is scaled downin advanced CMOS technologies.To counter these limitations of switched capacitor multipli-ers, we propose a time domain multiplier. By shifting to timedomain, sensitivity to charge leakage is greatly diminishedand very low energy multipliers can be designed which are suitable for low bandwidth medical image processing. Anadded advantage is that quantization noise of time-domainmultipliers come from transistor delay which reduces withtechnology scaling. Thus, in advanced CMOS technologies,time-domain multipliers are better suited for low-bandwidthoperations than switched capacitors.A voltage-controlled oscillator (VCO) is a major buildingblock of time domain multiplier. The quantized phase of aVCO can be written as φ [ k ] = 2 πK v Z T int V in ( t ) dt (3)where V in is the input to the VCO, K v is the VCO gain and T int is time over which the VCO phase is integrated. If V in changes slowly, φ [ k ] can be written as φ [ k ] = 2 πK v N X j =1 ( t j − t j − ) V in [ j ]= 2 πK v N X j =1 (∆ t j ) V in [ j ] (4)where T int = N X j =1 ∆ t j and V in [ j ] = V in ( t = t j ) (4) is analogous to (2) with W j ≡ πK v (∆ t j ) and X j ≡ V in [ j ] . Thus, a VCO can be used to perform a MACoperation in phase domain. The equivalent digital output of theMAC operation can be readily obtained by simply samplingthe output of the VCO, without requiring a separate ADCas in the charge-domain matrix multipliers. The accumulationoperation is done in phase domain and is highly linear.Unlike charge-domain architectures, accumulation in phasedomain comes without any additional power consumption.Nonlinearity in phase domain MAC operation is primarily dueto nonlinearity in voltage-to-phase conversion. Increasing theintegration time, T int , allows reduction of VCO gain, K v ,which in turn increases VCO linearity. This is particularlysuitable for medical image processing which does not requirea high bandwidth, and hence, linearity of the VCO can beincreased by lowering K v and increasing T int . As long as theVCO is oscillating, there is no leakage error in the phase valueheld by the VCO which is of importance for low bandwidthmedical signal processing.IV. M ATRIX M ULTIPLIER A RCHITECTURE
Fig. 2 shows the conceptual block diagram of the proposedtime-domain matrix multiplier architecture along with itstiming diagram. A voltage-to-current (V/I) converter drives acurrent-controlled oscillator (CCO) and the quantized phaseoutput of the CCO holds the result of multiplication of V in and πK v φ . The duration of φ in the j − th sampling periodis ∆ t j which is digitally controllable. Addition comes withoutany hardware cost as the CCO holds on to its phase whichkeeps accumulating with time. Fig. 2 illustrates how theproposed architecture performs MAC operation.Fig. 3 shows circuit implementation of the proposed ar-chitecture. The architecture is implemented in a differential Fig. 2. Architecture and timing diagram of proposed time domain matrixmultiplier. fashion to suppress common-mode noise on the inputs as wellas noise from supply and ground. The input signal is applieddifferentially to a V/I converter. The V/I converter drives twopseudo-differential CCOs during the phase φ and the twoCCOs are run with a low current supply I L during the phase φ . The two CCOs are not stopped during φ to ensure that theaccumulated phase held by them are not corrupted by leakage.By running the two CCOs at the same frequency during φ ,no phase is accumulated differentially at φ . The output ofeach CCO stage is latched by the sampling clock into a flip-flop (FF). At any given time, only one of the CCO stages isin a state of either a positive or negative transition. Thus, foran N -stage CCO, the instantaneous phase can be quantizedinto N levels between (0 , π ) corresponding to N positivetransitions and N negative transitions. The matrix multiplieris highly digital and makes use of simple digital circuits toperform time domain analog signal processing. Vin+ Vin-Ibias
Vdd Vdd
Encoderin+ in-out+out-
FF FF FF
Encoder
FFFFFF
Vdd Vdd ϕ out Ibias R Counter Counter
V/I converter N stages ϕ ) PhaseCount CountPhase nTs (n+1)Ts
Quantized phase
CCON stages ϕ ) Vdd Vdd I L I L ϕ ϕ ϕ ϕ Fig. 3. Circuit schematic of proposed time domain matrix multiplier.
As shown in Fig. 3, CCO phase increases monotonicallywith time and wraps over after it crosses π . A counter isused to store the number of times the VCO phase overflowsover the period of integration. The total phase at any timeis given by (2 N · Count + ˆ φ ) where ˆ φ is the instantaneousquantized phase.Since VCO gain varies with process, voltage and tempera-ture (PVT), the result of multiply-and-add will vary with PVT.Hence, background tracking is needed to correct for VCOgain variation. The V/I transconductance is given by /R for g m R ≫ , where g m is the transconductance of the V/I inputtransistors and R is their source degeneration resistance. Thus, the V/I converter is relatively insensitive to PVT variations,and hence, resistance trimming is not required for multiplierswith 6-8 bits accuracy. To track the CCO gain, width of the tailcurrent source can be changed depending on the output of acounter which is clocked by the CCO as shown in Fig. 4. Thecounter output is monitored by a comparator which is clockedby a divided-down version of the sampling clock. The counteris reset after every comparison. If the CCO is running toofast, the comparator will reduce the tail current source widthand vice versa. This ensures that the counter output is heldequal to a preset value F in which sets the CCO free-runningfrequency.The background tracking technique can be applied to areference multiplier and the comparator digital output wordcan be applied to the tail current source of all the CCOs.Process related mismatch between the tail current sourcesof the different CCOs will limit the accuracy to which thePVT sensitivity can be corrected. Fortunately, the matchingaccuracy is not very stringent as only 6-8 bits accuracy isrequired. In addition, CCO tail current devices are usuallymade large to reduce flicker noise. Large device size alsoreduces mismatches. For a large design with many multipliercells, a few local copies of the reference multiplier can bedistributed across the chip to account for gradient mismatches. VddCCOV/I xNCounterV bias F in NF s /MRST Fig. 4. VCO gain tracking architecture.
A MAC cell was designed in 40nm CMOS process. With a mV differential input running at 0.6MHz, the MAC cell op-erates with 7-bit linearity. To test the accuracy of the proposedMAC cell, a row vector (1 × is multiplied with a columnvector (512 × . The row vector (cid:2) x x x · · · x (cid:3) is set to (cid:2) ωT s ) sin(2 ωT s ) · · · sin(511 ωT s ) (cid:3) andall the elements in the column vector are set to T s where T s = 10ns. Fig. 5 shows the simulated result for multiplicationof the row vector and the column vector as well as the errorbetween the output of the MAC operation and the desiredoutput. It can be seen from Fig. 5 that the proposed MAC cellhas a low quantization error and has 7 bit linearity.The MAC cell consumes µ W from a 1.1V supply whileoperating at 100 MHz. The energy efficiency of the pro-posed time-domain MAC cell is 2fJ/operation, compared to968 fJ/operation energy efficiency of highly optimized digital
Time ( µ s) -0.2-0.100.10.2 S i gn a l V o lt a g e ( V ) -0.01-0.00500.0050.01 E rr o r V o lt a g e ( V ) desired outputMAC outputerror signal Fig. 5. Matrix multiply-and-add transient simulation. matrix multipliers [13]. Thus, the proposed MAC cell hasmore than 400X better energy efficiency than digital matrixmultipliers. V. C
ONCLUSION
A time-domain architecture for performing multiplication-and-addition operations is presented in this letter. The pro-posed architecture exploits the low bandwidth and not-toostringent accuracy requirement of medical image processingalgorithms to achieve drastic increase in energy efficiencycompared to digital matrix multipliers. The architecture ishighly digital and suitable for advanced CMOS technologies.The proposed architecture can act as an enabler for developingportable hardware solutions for 2D/3D image analysis.R
EFERENCES[1] Banerjee, I., Agibetov, A., Catalano, C.E., Patan`e, G., Spagnuolo, M.,‘Semantics-driven Annotation of Patient-Specific 3D Data: A Step to AssistDiagnosis and Treatment of Rheumatoid Arthritis’,
The Visual ComputerJournal , 2016, pp. 1–13.[2] Slomka, P. J.: ‘Software approach to merging molecular with anatomicinformation’,
Journal of Nuclear Medicine , 2004, vol. 45, pp. 36S–45S.[3] Goshtasby, A. Ardeshir.: ‘2-D and 3-D image registration: for medical,remote sensing, and industrial applications’,
John Wiley & Sons , 2005.[4] St. Amant, R., Yazdanbakhsh, A., Park, J., Thwaites, B., Esmaeilzadeh,H., Hassibi, A., Ceze, L., and Burger, D.: ‘General-purpose code accelera-tion with limited-precision analog computation’,
ACM SIGARCH Comput.Archit. News , 2014, vol. 42, no. 3, pp. 505–516.[5] Watanabe, K., and Temes, G.: ‘A switched-capacitor multiplier/dividerwith digital and analog output’,
IEEE Trans. Circuits Syst. , 1984, vol. 31,no. 9, pp. 796–800.[6] Sadhu, B., Sturm, M., Sadler, B.M., and Harjani, R.: ‘Analysis and designof a 5GS/s analog charge-domain FFT for an SDR front-end in 65nmCMOS’,
IEEE Journal of Solid-State Circuits , 2013, vol. 48, no. 5, pp.1199–1211.[7] Bankman, D. and Murmann, B.: ‘Passive charge redistribution digital-to-analogue multiplier’,
Elec. Lett. , 2015, vol. 51, no. 5, pp. 386–388.[8] Lee, E.H., and Wong, S.S.: ‘A 2.5GHz 7.7TOPS/W switched-capacitormatrix multiplier with co-designed local memory in 40nm’,
IEEE Journalof Solid-State Circuits , 2016, pp. 418–420.[9] Bult, K., and Wallinga, H.: ‘A CMOS four-quadrant analog multiplier’,
IEEE Journal of Solid-State Circuits , 1986, SSC-21, no. 3, pp. 430–445.[10] Genov, R., and Cauwenberghs, G.: ‘Kerneltron: support vector machinein silicon’,
IEEE Trans. Neural. Netw. , 2003, vol. 14, no. 5, pp. 1426–1434.[11] Rieutort-Louis, W.R., Moy, T., Wang, Z., Wagner, S., Sturm, J.C., andVerma, N.: ‘A large-area image sensing and detection system based onembedded thin-film classifiers’,
IEEE Journal of Solid-State Circuits , 2016,vol. 51, no. 1, pp. 281–290.[12] Wang, Z., Zhang, J. and Verma,N.: ‘Realizing low-energy classificationsystems by implementing matrix multiplication directly within an ADC’,
IEEE Trans. Biomed. Circ. and Syst. , 2015, vol. 9, no. 6, pp. 825–837.[13] Saha, P., Banerjee, A., Bhattacharyya, P., and Dandapat, A.: ‘Improvedmatrix multiplier design for high-speed digital signal processing applica-tions’,