[PDF] A 128 channel Extreme Learning Machine based Neural Decoder for Brain Machine Interfaces

Abstract

Currently, state-of-the-art motor intention decoding algorithms in brain-machine interfaces are mostly implemented on a PC and consume significant amount of power. A machine learning co-processor in 0.35um CMOS for motor intention decoding in brain-machine interfaces is presented in this paper. Using Extreme Learning Machine algorithm and low-power analog processing, it achieves an energy efficiency of 290 GMACs/W at a classification rate of 50 Hz. The learning in second stage and corresponding digitally stored coefficients are used to increase robustness of the core analog processor. The chip is verified with neural data recorded in monkey finger movements experiment, achieving a decoding accuracy of 99.3% for movement type. The same co-processor is also used to decode time of movement from asynchronous neural spikes. With time-delayed feature dimension enhancement, the classification accuracy can be increased by 5% with limited number of input channels. Further, a sparsity promoting training scheme enables reduction of number of programmable weights by ~2X.

Full PDF

aa r X i v : . [ c s . L G ] S e p A 128 channel Extreme Learning Machine basedNeural Decoder for Brain Machine Interfaces

Yi Chen,

Student Member, IEEE , Enyi Yao,

Student Member, IEEE , Arindam Basu,

Member, IEEE

Abstract —Currently, state-of-the-art motor intention decodingalgorithms in brain-machine interfaces are mostly implementedon a PC and consume signiﬁcant amount of power. A machinelearning co-processor in 0.35- µ m CMOS for the motor intentiondecoding in the brain-machine interfaces is presented in thispaper. Using Extreme Learning Machine algorithm and low-power analog processing, it achieves an energy efﬁciency of . pJ/MAC at a classiﬁcation rate of Hz. The learningin second stage and corresponding digitally stored coefﬁcientsare used to increase robustness of the core analog processor.The chip is veriﬁed with neural data recorded in monkey ﬁngermovements experiment, achieving a decoding accuracy of . for movement type. The same co-processor is also used todecode time of movement from asynchronous neural spikes. Withtime-delayed feature dimension enhancement, the classiﬁcationaccuracy can be increased by with limited number of inputchannels. Further, a sparsity promoting training scheme enablesreduction of number of programmable weights by ≈ X . Index Terms —Neural Decoding, Motor Intention, Brain-Machine Interfaces, VLSI, Extreme Learning Machine, MachineLearning, Neural Network, Portable, Implant

I. I

NTRODUCTION

Brain-machine interfaces (BMI) are becoming increasinglypopular over the last decade and open up the possibility ofneural prosthetic devices for patients with paralysis or inlocked-in state. As depicted in Fig. 1, a typical implantedBMI consists of a neural recording IC to amplify, digitize andtransmit neural action potentials (AP) recorded by the micro-electrode array (MEA). Signiﬁcant effort has been dedicatedto develop energy efﬁcient neural recording channel in recentyears for long-term operation of the implanted devices [1][2] [3] [4]. Some recent solutions have also integrated APdetection [5] [6] [7] [8] and spike sorting features [9] [10][11]. However, in order to produce an actuation command (e.g.for a prosthetic arm), the subsequent step of motor intentiondecoding is required to map spike train patterns acquired inthe neural recording to the motor intention of the subjects.Though various elaborate models and methods of motorintention decoding have been developed in past decades withthe goal of achieving high decoding performance [12] [13][14], the state-of-the art neural signal decoding are mainly con-ducted on PC consuming a considerable amount of power andmaking it impractical for the long-term use. With on-chip real-time motor intention decoding, the size and the power con-sumption of the computing device can be reduced effectively

Yi Chen, Enyi Yao, and Arindam Basu are with Centre of Excellence in ICDesign (VIRTUS), School of Electrical and Electronic Engineering, NanyangTechnological University, Singapore 639798 (e-mail:[email protected],[email protected]).

AFE ADC DSP

TxRx

AFE ADC DSP

TxRxTraditional designEnvisioned design

MLCP ML Actuator

Tx/Rx

Actuator

Tx/RxSkullSkull

Fig. 1:

Comparison of envisioned and traditional implantedBMI:

The envisioned system uses a machine learning co-processor(MLCP) along with the DSP used in traditional neural implants toestimate motor intentions from neural recordings thus providing datacompression. Traditional systems perform such decoding outside theimplant and use bulky computers. and the solution becomes truly portable. Furthermore, integrat-ing the neural decoding algorithm with the neural recordingdevice is also desired to reduce the wireless data transmissionrate and make the implanted BMI solution scalable as requiredin the future [15]. Until now, very few attempts have beenmade to give a solution for this problem. A low-power motorintention architecture using analog computing is proposedin [16], featuring an active ﬁltering with massive parallelcomputing through low power analog ﬁlters and memories.However, no measurement results are published to supportthe silicon viability of the architecture. A more recent workproposes a universal computing architecture for the neuralsignal decoding [17]. The architecture is implemented on aFPGA with a power consumption of µ W.In this paper, we present a machine learning co-processor(MLCP) achieving low-power operation through massive par-allelism, sub-threshold analog processing and careful choiceof algorithm. Figure 1 contrasts our approach with traditionalapproaches: our MLCP acts in conjunction with the digitalsignal processor (DSP) already present in implants (for spikesorting, detection and packetizing) to provide the decodedoutputs. The bulk of processing is done on the MLCP whilesimple digital functions are performed on the DSP. Comparedto traditional designs that perform the decoding outside theimplant, our envisioned system that provides opportunity forhuge data compression by integrating the decoder in theimplant. The MLCP is characterized by measurement and thedecoding performance of the proposed design is veriﬁed withdata acquired in individuate ﬁnger movements experiment ofmonkeys. Some initial results of this work were presentedin [18]. Here, we present more detailed theory, experimentalresults including decoding time of movement, new sparsity o x D β kj H H j H L w ij ELM o M Movement Onset

G(t k ) Movement type

S(t k ) : label of maximum outputMoving average of input spikes o M+1 θ (a) (b) o x D β kj H H j H L w ij g( · )g( · ) g( · )o C MLCPDSP

Fig. 2:

Algorithm: (a) The architecture of the Extreme LearningMachine (ELM) with one nonlinear hidden layer and linear outputlayer. (b) Use of ELM in neural decoding for classifying movementtype and onset time of movement. promoting training and also discuss scalability of this archi-tecture. II. P

ROPOSED D ESIGN : A

LGORITHM

A. Extreme Learning Machine1) Network Architecture:

The machine learning algorithmused in this design is Extreme Learning Machine (ELM) pro-posed in [19]. As depicted in Fig. 2 (a), The ELM is essentiallya single hidden-layer feed-forward network (SLFN). The k-thoutput of the network (1 ≤ k ≤ C) can be expressed as follows, o k = L X i β ki g ( w i , x , b i ) = L X i β ki h i = h T β k , w i , x ∈ R D ; β ki , b i ∈ R ; h , β k ∈ R L (1)where x denotes the input feature vector, L is the number ofhidden neurons, h is the output of the hidden layer, b i is thebias for each hidden layer node, w i and β ki are input andoutput weights respectively. A non-linear activation function g () is needed for non-linear classiﬁcation. A special caseof the nonlinear function is the additive node deﬁned by h i = g (cid:0) w Ti x + b i (cid:1) . The above equation can be compactlywritten for all classes as o = h β, o = [ o , o , ..o c ] where β = [ β , β ... β C ] denotes the L × C matrix of output weights.While the output o k can be directly used in regression,for classiﬁcation tasks the input is categorized as the k-thclass if o k is the largest output. Formally, we can deﬁnethe classiﬁcation output as an integer class label s given by s = argmax k o k , ≤ k ≤ C . Intuitively, we can think of theﬁrst layer as creating a set of random basis functions whilethe second layer chooses how to combine these functions tomatch a desired target. Of course, if we could choose the basisfunctions through training as well, we would need less numberof such functions. But the penalty to be paid is longer trainingtimes. More details about the algorithm can be found in [19],[20].

2) Training Methods:

The special property of the ELMis that w can be random numbers from any continuousprobability distribution and remains unchanged after initiation[19], while only β needs to be trained and stored with highresolution. Therefore the training of this SLFN reduces toﬁnding a least-square solution of β given the desired target values in a training set. We will next show two methods oftraining–the conventional one (T1) for improved generalizationas well as a second method (T2) that promotes sparsity. Forsimplicity, we show the solution of weights for one output o k –the same method can be extended to other output weightsas well and can be represented in a compact matrix equation[19].Suppose there are p training samples–then we can create a p × L hidden layer output matrix H where each row of H hasthe hidden layer neuron output for each training sample. Let T k ∈ R p be the vector of target values for the p samples.With these inputs, the two training methods are shown inFig. 3. The step for L norm minimization can be solveddirectly with the solution given by β k = H † T k where H † is the Moore-Penrose generalized inverse of the matrix H .Hence, training can happen quickly in this case. The L normminimization step in T2 however has to be performed usingstandard optimization algorithms like LARS [21]. Thus T2provides reduced hardware complexity due to reduction in thenumber of hidden neurons at the cost of increased trainingtime. B. Neural Decoding

The neural decoding algorithm we use is inspired by themethod in [22]. We replace the committee of ANN in theirwork with ELM in our case. Three speciﬁc advantages ofthe ELM for this application are (1) the ﬁxed random inputweights can be realized by a current mirror array exploitingfabrication mismatch of the CMOS process; (2) one-steptraining that is necessary for quick weight update to addresschange in input statistics and (3) the hidden layer outputs h can be reused for multiple operations on the same input data x . In this case, we have reused h to classify both the onsettime and type of movement. One disadvantage of the ELMalgorithm is the usage of . − X hidden neurons comparedto fully tuned architectures (e.g. SVM, AdaBoost) since thehidden nodes in ELM only create random projections that arenot ﬁne tuned [23]. However, implementing random weightsresults in more than X savings over fully tunable weightsmaking this architecture more lucrative overall. Next, we givean overview of the decoding algorithm while the reader ispointed to [22] for more details.

1) Movement type and Onset time Decoding:

Figure 2(b) depicts how the ELM is used in neural decoding. Eventhough the input is an asynchronous spike train, the ELMproduces classiﬁcation outputs at a ﬁxed rate of once every T s seconds. The input x is created from the ﬁring rate ofspike trains p ( t ) = P t s δ ( t − t s ) of biological neurons byﬁnding the moving average over a duration T w . Hence, wecan deﬁne the ﬁring rate r i of i-th neuron at time instant t k as r i ( t k ) = R t k t k − T w p ( t ) dt where T s = 20 ms and T w = 100 ms following [22]. Finally, x ( t k ) = [ r ( t k ) , r ( t k ) , ...r q ( t k )] where there are q biological neurons in the recording ( d = q ).As shown in Fig. 2 (b), we have C = M +1 output neurons inthis case where there are M movement types to be decoded.The M +1-th neuron is used to decode the onset time ofmovement. For decoding type of movement, we can directly T1 H S1: Create hidden layer matrixfor all p training samples T2 Lp R ´ ÎH pk R ÎT min kkk k bgb b +T-H S2: Least square optimization with L regularization: Lp R ´ ÎH pk R ÎT S3: Prune the hidden layer neuronswith zero output weights Lp R ¢´ ÎH * pk R ÎT Lk R Îb Lk R Î * b Lk R ¢ Îb H S1: Create hidden layer matrixfor all p training samplesS2: Least square optimization with L regularization:S4: Least square optimization with L regularization: min kkk k bgb b +T-H min kkk k bgb b +T-H Fig. 3:

Training methods for ELM:

T1 is the conventionally used training method to improve generalization by minimizing norm of weightsas well as training error. T2 uses an additional step of sparsifying output weights to reduce the required hardware. use the method described in the earlier section II-A for M -class classiﬁer to get the predicted output class at time t k as s ( t k ) = argmax p o p ( t k ) , ≤ p ≤ M .For decoding movement onset time, we further create abinary classiﬁer that reuses the same hidden layer but addsan extra output neuron. Similar to [22], this output is trainedfor regression–the target is a trapezoidal fuzzy membershipfunction which gradually rises from to representing thegradual evolution of biological neural activity. This output o M +1 is thresholded to produce the ﬁnal output G ( t k ) at time t k as: G ( t k ) = ( , if o M +1 ( t k ) > θ , otherwise (2)where θ is a threshold optimized as a hyperparameter. More-over, to reduce spurious classiﬁcation and produce a contin-uous output, the primary output G ( t k ) is processed to create G track ( t k ) that is high only if G is high for at least λ timesover the last τ time points. Further, to reduce false positives,another detection is prohibited for T r ms after a valid one.The ﬁnal decoded output, F ( t k ) is obtained by a simple com-bination of the two classiﬁers as F ( t k ) = G track ( t k ) × s ( t k ) .

2) Time delay based dimension increase (TDBDI):

A com-mon problem in long-term neural recording is the loss ofinformation from electrodes over time due to tissue reactionssuch as gliosis, meningitis or mechanical failures [24]. Hence,initially functional electrodes may not provide informationlater on. To retain the quality of decoding, we propose amethod commonly used in time series analysis–the use ofinformation from earlier time points [25]. In the context ofneural decoding, it means that we use more information fromthe dynamics of neural activity in functional electrodes inplace of lost information from the instantaneous values of activity in previously functional electrodes. So if we use p − previous values from the n functional electrodes, the newfeature vector is given by: x ( t k ) = [ r ( t k ) , r ( t k − ) , r ( t k − ) , r ( t k − p +1 ) ,r ( t k ) , r ( t k − ) ...r n ( t k − p +1 )] (3)where the input dimension of the ELM is given by D = n × p .This is a novel algorithmic feature in our work compared to[22].III. P ROPOSED D ESIGN : H

ARDWARE I MPLEMENTATION

Fig.1 shows a typical usage scenario for our MLCP whereit works in tandem with the DSP and performs the intentiondecoding. The DSP only needs to send very simple controlsignals to the MLCP and performs the calculation of thesecond stage of ELM (multiplication by learned weights β ).The input to the MLCP comes from spike sorting that can beperformed on the DSP [10]. In some cases, spike sorting maynot be needed and spike detection may be the only requiredpre-processing [24]. A. Architecture

Details and timing of the MLCP are shown in Fig. 4. Wemap input and hidden layers of ELM into the MLCP fabricatedin AMS 0.35- µ m CMOS process, where high computationefﬁciency is achieved by exploiting fabrication mismatchabundantly found in analog devices, while the output layerthat requires precision and tunability (tough to attain in analogdesigns) can be implemented on the DSP. Since the number ofcomputations in ﬁrst stage far outnumbers those in the second(as long as D >> C ), such a system partition still retainsthe power efﬁciency of the analog design. Up to input RN _c n t SP KA < : > CC O CN T h -t o - D e M u x W i n CN T DAC I npu t P r o cess i ng C u rr e n t M i rr o r A rr ay R e f e r e n ce SP I M L C P C L K _ ou t C < : > I i n CC O h I i n CN T CC O L h L I i nL CN T L W i n CN T DAC W i n CN T D DAC D I DAC , I DAC , I DAC , D C o l u m n S ca nn e r D SP Timing & Control ELM outputStage W i r e l ess T / R x NEU S D L < : > RN _ i n C L K _ i n Random weight( w ij )Hidden layer output I npu t vec t o r ( x ) C M C M C M (a) CLK_outSPKA<6:0>RN_in

CLK_in

NEUC<0:13>RN_cnt (b)

Fig. 4:

The diagram (a) and the timing (b) of MLCP based neural decoder. channels and hidden layer nodes are supported by theMLCP, with each input channel embedding an input processingcircuit that extracts input feature from the incoming spiketrains. As mentioned in the earlier section, we extract a movingaverage of the spike count as the input feature of interest.On receiving a spike from the neural ampliﬁer array (afterspike detection and/or spike sorting), the DSP sends a pulsevia

SP K and -bit channel address (A h i ) to the DEMUXin the MLCP for row-decoding. Each row of the MLCP has a -bit window counter ( W inCN T ) to count the total numberof input spikes in a moving window with length of t s and amoving step of t s . The length of t s , normally set to ms, isdetermined by the period of CLK in . The counter value inj-th row is converted into input feature current I DACj for theELM, corresponding to the input x j in Fig. 2. Furthermore, a -bit control signal ( S ext h i ) stored in each row determineswhether the j-th row’s input to the moving window circuitis an external spike count or a delayed spike count fromthe previous channel. The delay length can be selected fromamong delay steps ranging from ms to ms, based on SDL h i . This is how the TDBDI feature described earlieris implemented in the MLCP.The input feature current from each row is further mirroredinto all hidden-layer nodes by a current mirror array. Hence,ratios of the current mirrors are essentially the input weights,and are inherently random due to fabrication mismatch of thetransistors even when identical value is used in the design. Weuse sub-threshold current mirror to achieve very low powerconsumption, resulting in w ij = e ∆ Vt,ijUT with U T denotingthermal voltage and ∆ V t,ij denoting the threshold voltagemismatch between input transistor on j-th row and mirrortransistor on i-th column of that row. This is similar to theconcepts described in [26] [27]. The input weights are log-normal distributed since ∆ V t,ij follows a normal distribution.We therefore realize random input weights in a very low ‘cost’way that requires only one transistor per weight. It is the ﬁxedrandom input weights of the ELM that makes this uniquedesign possible. A capacitance C M = fF on each rowsets the SNR of the mirroring to dB.The hidden layer node is implemented by a current con-

4b CNT 4b reg b r e g S ext <1>=0D i <3:0> M U X t o D o < : >

6b currentDAC4b reg4b reg4b reg4b reg4b reg

DAC WinCNT D i <3:0>

4b CNT 4b reg b r e g S ext <2>=1 M U X t o D o < : >

6b currentDAC4b reg4b reg4b reg4b reg4b reg

DAC WinCNT I DAC I DAC SPK D n D n-5 Q n-1 SDL<2:0>SDL<2:0> (a) D D _ I DAC I ref D D _ D D _ DVDD (b)

DVDD v mem I in N E U C int M I l e a k C f v f M M V l k v o INV INV INV (c) Fig. 5:

Sub-block circuit diagrams: (a) Input processing circuit totake moving average of incoming spikes; (b) Current-mode DAC toconvert average value to input x for the current mirror; (c) Neuron-based CCO to implement hidden node non-linearity and convert todigital. trolled oscillator (CCO) driving a -bit counter with a -bit programmable stop value f max to implement a saturatingnonlinearity in the activation function g () . The advantage ofchoosing this nonlinearity is that it can be digitally set andalso some neurons can be conﬁgured to be linear as wellto achieve good performance in linearly separable problems[28]. The computation of hidden layer nodes is activated bysetting N EU high. The output of CCO is a pulse frequencymodulated signal with the frequency proportional to total inputcurrent. The counter outputs are latched and serially readusing the

CLK OU T signal when

N EU is low with CCOdisabled to save power. The output weights, β , are stored onthe DSP where the ﬁnal output o k is calculated. Thus theMLCP performs the bulk of MACs (D × L) while the DSP only

MLCP

MCU

Wireless T/Rx . m PEU Photo

CurrentMirrorArray W i ndo w C oun t e r C i r c u i t s CCOs & OutputCounters

Reference& SPI . mm MLCP Die Photo

Fig. 6:

Die photo and test board:

The die photo of MLCP fabricatedin . - µ m CMOS process and the portable external unit (PEU)integrating MLCP with MCU and battery. performs C × L MACs of the output layer. It should be notedhere, the output of hidden layer neurons changes with powersupply voltage due to sensitivity of the CCO frequency topower supply variation, leading to degradation of the decodingaccuracy. However, since power supply variation is a common-mode component to all CCOs, normalization methods can beapplied in post-processing (see Section IV-E4) to the hiddenlayer outputs to reduce the effect introduced by power supplyvariation.

B. Sub-circuit: Input processing

Fig. 5 shows diagrams of the circuit blocks in the MLCP.Fig. 5 (a) shows two adjacent input processing circuits with

W inCN T conﬁgured to receive an external spike train bysetting S ext h i = 0 and W inCN T conﬁgured as time delaybased channel by setting S ext h i = 1. The correspondingsignal ﬂows are also depicted in the ﬁgure by red dash lines.The moving window counter is realized by (1) counting spikein a sub-window in a length of t s ; (2) storing sub-windowcounter value in a delay chain made of shift registers; and(3) adding and subtracting previous -b output value withcorresponding sub-window counter values in the delay chainto get new -b output value of W inCN T . This calculationcan be represented as: Q n h i = Q n − h i + D n h i − D n − h i , (4)where Q n h i and D n h i are -b output value and -bsub-window counter value at time instance n respectively. Allregisters in the input processing circuits toggle at the risingedge of CLK in . The advantage of this structure is that thedelay chain for sub-window counter value is reused in theproposed TDBDI feature, leading to a compact design.A compact, -bit MOS ladder based current mode DAC,as shown in Fig. 5 (b), splits a reference current I ref ( -bitprogrammable in range of nA to nA) according to the W inCN T output value to generate the input feature current I DAC to the current mirrors.

C. Sub-circuit: Current Controlled Oscillator

The diagram of the current controlled oscillator (CCO) isdepicted in Fig. 5 (c). The capacitance of C int = 400 fFsets oscillation frequency of this relaxation oscillator based

173 174 175020406080100 counter value

Median counter value: 174 710 711 712 713 714020406080100 counter value

Median counter value: 712 counter value

Median counter value: 2154 μ =174 σ =0 μ =712.1 σ =0.48 μ =2153.6 σ =1.39 Fig. 8:

Jitter performance: The variation in the counter output for a ﬁxed value of input current is observed for trials andplotted as a histogram for (a) low, (b) medium and (c) high input currents. The measured jitter is < . CLK_inQ<5>Q<4>Q<3>Q<2>Q<1>Q<0>CLK_in40 ms 80 ms14 28 43 57 63 Q<5:0> 14 28 43 57 63

Sext=0 Sext=1SDL<2:0>=001, 40ms delay

CLK_inQ<5>Q<4>Q<3>Q<2>Q<1>Q<0>CLK_in

Input Frequency (Hz) O u t pu t F r e qu e n c y ( k H z )

128 CCOs Transfer Curves

SPK v mem v o (a)(b) (c) Fig. 7:

MLCP circuit blocks measurement results: (a) TDBDIfeature; (b) waveform of CCO oscillation; (c) transfer curves of all128 CCOs. on the summed input current while C f = 100 fF provideshysteresis through positive feedback. When N EU is pulledhigh, pFET M is turned off. M is used to set the leakage term b i in equation 1 and can be set to for most cases. I in fromthe current mirrors starts to discharge v mem until it crossesthe threshold voltage of the INV , leading to transition of allinverters. Then, v mem is pulled down very quickly througha positive feedback loop formed by C f . At the same time,M turns on, charging v mem towards DV DD until it crossesthe threshold voltage of INV from low to high and the cyclerepeats. Neglecting higher order effects, the time for each cycle

10 20 30 40 50 60−4−2024 DN L ( b it ) Input Code

Fig. 9:

DNL performance:

DNL of randomly selected input DACchannels show ± LSB performance. of the CCO operation is determined by the sum of the chargingand discharging time constant of v mem , and can be expressedby: T CCO = C f × DV DDI in + C f × DV DD ( I rst − I in ) , (5)where I rst is the charging current when M is on. Normally I rst >> I in reducing equation 5 to: f CCO = 1 T CCO ≈ I in C f × DV DD . (6)IV. M

EASUREMENT R ESULTS

A. MLCP Characterization

This section presents the measurement results from theMLCP fabricated in 0.35- µ m CMOS process. To test thecircuit, we have integrated it with a microcontroller unit orMCU (TI MSP430F5529) to act as the DSP. Though we havenot integrated it with an implant yet, this setup does allowus to realistically assess performance of the MLCP with pre-recorded neural data as shown later. Moreover, the designedboard is entirely portable with its own power supply andwireless TX-RX module (TI CC2500). Hence, it can be usedas a portable external unit (PEU) for neural implant systemsas well. As shown in Fig. 6, the MLCP has a die area of . × . mm and the PEU measures . cm × . cm. Output Channel Address I npu t C h a nn e l A dd r e ss Output Frequency (kHz)

20 40 60 80 100 12012010080604020 100200300 -60 -40 -20 0 20 40 6005001000150020002500 D V t (mV) (a)(b) (c) Fig. 10:

The random input weights: (a) Measured mismatch mapof the CCO frequencies; (b) Distribution of input weights and (c) ∆ V t,ij . These values are measured by reading the output countervalues when a ﬁxed input value is given one row at a time.TABLE I: Mean and standard deviation of ∆ V t,ij Chip No. µ (mV) δ (mV)1 0.188 16.22 0.132 16.93 -0.019 16.84 -0.105 17.25 0.004 16.56 0.535 16.47 0.276 17.68 -0.012 16.6 For the characterization results shown next, we use

AV DD = 2.1 V powering the reference circuits to generate biascurrents and

DV DD = 0.6 V for the rest. Figure 7(a) veriﬁesoperation of the input processing by probing output of thewindow counter, with frequency of

CLK in and input spiketrain being Hz and

Hz respectively. The output, aslabeled by Q h i , increases from to within msin the left half of Fig. 7 (a). The TDBDI feature is shownin the right half of Fig. 7 based on setting of SDL h i =001 when S ext = 1. It adds a delay of ms to the Q h i ,comparing with waveforms in the left half. Measured chargingand discharging dynamics of the CCO based neuron are shownin Fig. 7 (b) by probing a buffered version of membranevoltage v mem . Measured transfer curves of the CCOs in achip is plotted in Fig. 7 (c), by varying input spike frequencyfrom to Hz. Here, the saturation of the count is notshown–when implemented, it stops the count at the presetvalue. The noise of the whole circuit is also characterized in terms of jitter at the output of the CCO. The variance in thecounter value is measured for the same input current over trials. This experiment is repeated for three different currentvalues spanning most of the counting range. The results of thisexperiment, shown in Fig. 8, demonstrate percentage jitter lessthan . for the entire counting range.Next, we show characterization results for the input DACchannels. Since it is not possible to separately measure outputcurrent of the DAC, we measure the output of the CCO toinfer the linearity of the DAC. This is reasonable since thelinearity and the noise performance of the CCO is better thanthe bit resolution of the DAC. Figure 9 plots the measureddifferential non-linearity (DNL) of randomly selected inputDACs. The worst case DNL is approximately ± LSB. Whilethis DNL can be part of the non-linearity g ( w i , x , b i ) in thegeneral case, it makes the implementation of the additive nodeless accurate.Variation in transfer curves of the CCO array is a result ofrandom mismatch from various aspects of the circuits, mainlycurrent mirror array, which is expected and desired in thisdesign. By applying the same input spike frequency of Hz to each row individually, a mismatch map of the CCOfrequencies is generated with I ref = 32 nA, as presented inFig. 10 (a), by reading out the quantized frequency values inthe output counters. These frequencies are normalized to themedian frequency and plotted in Fig. 10 (b) and (c) to showconformance to the log-normal distribution as expected. Theunderlying random variable of ∆ V t,ij has a normal distributionwith mean ≈ ≈ ∆ V t,ij in all chips listed in Tab. I. B. Experiment

The neural data used to verify the decoding performance ofthe proposed design is acquired in a monkey ﬁnger movementexperiment described in detail in [22]. In the experiment, themonkey puts its right hand into a pistol-grip manipulandumwith one ﬁnger placed in one slot of the manipulandum. Themonkey is trained to perform ﬂexion or extension of the indi-vidual ﬁnger and wrist according to given visual instruction. Asingle-unit recording device is implanted into the M1 cortex ofthe monkey, enabling real-time recording of single unit spiketrain during the experiment. The entire data set includes neuraldata recorded from three different monkeys–Monkey C, G andK, performing 12 types of individuated movements labeled bythe moving ﬁnger number and by the ﬁrst letter of the movingdirection. Furthermore, all the trials are aligned such that theonset of the movement happens at s. Therefore the ELM canbe trained according to the given label and the onset moment. C. Neural Decoding Performance

We have tested the MLCP based PEU using the data setmentioned above. A multiple-output ELM with number ofclasses C = 12 is trained to identify the movement type of thetrial. An additional output is used to decode the onset time ofmovement. During training, the pre-recorded input spikes frombiological neurons in M1 are sent to the MLCP the counter D ec od i ng A cc u r ac y Chip 1Chip 2Chip 3Chip 4Chip 5Chip 6Chip 7Chip 8 T e s ti ng A cc u r ac y D ec o d i n g A cc u r a c y Number Of HIdden Layer Nodes (L) D ec od i ng A cc u r ac y Monkey K (indiv)Monkey K (comb)Monkey CMonkey G

Number Of M1 Neurons (n) D ec o d i n g A cc u r a c y (a)(c) Number Of M1 Neurons (n) D ec o d i n g A cc u r a c y (b)(d) D ec od i ng A cc u r ac y without delayed sampleswith delayed samples Number Of M1 Neurons (n) D ec o d i n g A cc u r a c y Fig. 11:

Measured movement types decoding performance: (a) Decoding accuracy versus number of hidden layer nodes; (b) Decodingaccuracy versus number of M1 neurons (with/without TDBDI); (c) Decoding accuracy across monkeys; (d) Decoding accuracy across 8 dies; values of H are wirelessly transmitted to a PC where f max and β are calculated and communicated back. This processalready includes non-idealities in the analog processor suchas DNL of input DAC, non-linearity in CCO and early effectinduced current addition errors–hence, the learning takes theseeffects into account and corrects for them appropriately. Then,the MLCP can run autonomously during testing phase.We present decoding results in a format similar to [22]for easy comparison wherever possible. For the ﬁrst set ofexperiments, we use the normal training method T1 describedin section II-A2. As shown in Fig. 11 (a) with n = D = 30 , thedecoding accuracy of the types of movements (the ﬂexionand extension of the ﬁngers and wrist in one hand) increasesas L is increased, with a mean accuracy of . at L = 60 .This trend is expected [19] since more number of randomprojections should allow better separation of the classes tillthe point when the amount of extra information for a newprojection is negligible. Based on this result, we ﬁx L = 60 for the rest of the experiments unless stated otherwise.Next, we explore the variation in performance as numberof available neural channels ( n ) (or equivalently M1 neuronsin this case) reduces while keeping L ﬁxed at . Fig. 11(b) shows that an increase in accuracy from . to . can be obtained at n = 15 , by using delayed samples asadded features (TDBDI). Here, we have used only one earliersample–hence, p = 2 and the effective input dimension of theELM is D = 2 × n . With n = 40 , L = 60 and p = 2 , adecoding accuracy of . can be achieved. Next, to checkthe robustness of the earlier result, the same experiment is per-formed using several different datasets, including individuatedﬁnger movement data from Monkey K, C and G and combined ELM OnsetDecodingM1 SpikesPrimary outputThresholding( θ ) Y N Sequential l Positives?

RefractoryPeriod (T r )Post-process output o M+1 (t k )G(t k ) G track (t k ) Fig. 12:

Flow chart describing the ﬁnite state machine on DSPto calculate G track from G . ﬁnger movement from Monkey K (12 individuate movementsand 6 types of simultaneous movement of two ﬁngers). Theresults of the MLCP with increasing M1 neurons, as shown inFig. 11 (c), is consistent with software result in [22]. The trendof increasing performance with more M1 neurons is expectedsince it provides more information. The performance of theproposed MLCP is also robust across eight sample chips, aspresented in Fig. 11 (d) for the same experiment as in the last f4 e4 f1 Time (s) M n e u r on s s p i k e s PredictedMove TypePredicted OnsetMoment primary output

Threshold Post-processoutputTrainingTarget OnsetOnsetOnset (a) T r u e P o s iti v e R a t e l =2, Tr=0 ms l =6, Tr=0 ms l =2, Tr=140 ms l =6, Tr=140 ms θ =1.0 θ =0.6 θ =0.3 (b) Fig. 13:

Measured movement onset decoding results: (a) A segmentof 40 channel input spike trains is shown with real-time decodingoutput deciding when a movement onset happens and which tpye isthis onset. (b) ROC curves of onset decoding. D ec od i ng A cc u r ac y T1T2

Fig. 14:

Advantage of Sparsity promoting training T2:

Thesparsity promoting method chooses best random projections and canreduce required number of hidden neurons by around . two cases.The hidden layer output matrix H is reused to decode theonset time of ﬁnger movement using the regression capacityof the ELM. As mentioned earlier, only one more output nodeis added to the ELM. The trapezoidal membership functiondescribed in section II-B and shown in Fig. 13 (a) is set to around the time of s to indicate the onset and set to wherethere is deﬁnitely no movement. Figure 12 illustrates the ﬁnitestate machine in the MCU to implement the post-processingdescribed in section II-B to obtain G track from the primaryoutput G . Optimal values of λ = 6 and T r = 140 ms can befound from the ROC curve shown in Fig. 13 (b). The nature

Power Breakup

Current reference88.4nW

128 input DACs271.6 nW

CCOs & current mirrors54 nW

AVDD D V D D

Fig. 15:

Power breakup:

Power dissipation in the MLCP is domi-nated by ﬁxed analog power consumption of nW compared tothe power of nW dissipated from DV DD in CCO and counter. of the ROC curves are again very similar to the ones in [22].With H reused, we achieve real-time combined decoding bydetecting when there is a movement in the trial and labeling thepredicted movement type when a movement onset is detected.This is illustrated by a snapshot of the developed GUI inFig. 13 (a), where three -s trials are shown with -channelinput spike trains recorded from M1 region printed at thebottom part of the ﬁgure. Primary, post-processed output andpredicted movement type are also shown in the top half of theﬁgure. Lastly we show the beneﬁts of the sparsity promotingtraining method, T2 described in section II-A2. To show thebeneﬁt of this method, we compare with the ﬁrst experimentshown earlier in Fig. 11 (a) where n = D = 30 and thenumber of hidden layer neurons L is varied to see its effecton performance. It can be seen that for the method T2, thedecoding accuracy increases to approximately the maximumvalue of . attained by the method T1 for much fewernumber of hidden layer neurons ( L ≈ ). This is possiblebecause the sparsity promoting step of minimizing L norm ofoutput weights chooses the most relevant random projectionsin the hidden layer. Thus, the new method T2 can reduce powerdissipation by approximately due to reduction in numberof hidden layer neurons. D. Power Dissipation

Finally, we report the power consumption of the proposedMLCP for the input channels, hidden layer nodes, -class classiﬁcation problem. The current drawn from analogand digital power supply pins were measured using a Keithleypicoammeter. The power breakup is shown in Fig. 15. At thelowest value of AV DD = . V and

DV DD = . V neededfor robust operation, the total power dissipated is nW with nW from DV DD and nW from

AV DD . Performing × MAC in the current mirror array at Hz rate ofclassiﬁcation, the MLCP provides a . pJ/MAC and . nJ/classify performance. It is clear that the efﬁciency is limitedby the ﬁxed analog power that is amortized across the L hidden layer neurons and D × L current mirror multipliers. Thefundamental limit of this architecture is the power dissipationof the CCO and current mirror array which is limited to . pJ/MAC. TABLE II: Comparison Table

JSSC 2013 [29] JSSC 2007 [30] JSSC 2013 [31] ISSCC 2014 [32] This WorkTechnology 0.13 µ m 0.5 µ m 0.13 µ m 0.13 µ m 0.35 µ mSupply voltage 0.85 V 4 V 1.2 V (digital) 3 V 0.6 V (digital)1 V (analog) 1.2 V (analog)Design style Digital Analog ﬂoating Mixed mode Analog ﬂoating Mixed modegate gateAlgorithm SVM SVM Fuzzy logic Deep learning feature ELM feature with TDBDIApplication EEG/ECG analysis Speech Recognition Image processing Autonomous sensing Neural ImplantPower dissipation 136.5 µ W 0.84 µ W 57 mW 11.4 µ W 0.4 µ WMax input dimension 400

14 14 8 128 Energy efﬁciency 631 pJ/MAC Resolution 16 b 4.5 b 6 b 7.5 b 7/14 b Classiﬁcation rate 0.5-2 Hz 40 Hz 5 MHz 8.3 kHz 50 Hz can be further extended by reusing input channels at the expense of classiﬁcation rate assuming 1000 support vectors The operations are much simpler than a MAC. D = 40 , L = 60 and C = 12 . In reality, analog power is amortized across all multipliesand the peak efﬁciency of . pJ/MAC is attainable for D = L = 128 for the same value of C . See section IV-D for details. Each multiply is 7 bit accurate due to SNR limitation while the output quantization in the CCO-ADC has 14 bits for dynamic range.

In contrast, recently reported -bit digital multipliers con-sume 16-70 pJ/MAC [33] [34] [35] [36] where we ignore thepower consumed by the adder for simplicity. We have alsoimplemented near threshold digital array multipliers in nmCMOS operating at . V that resulted in energy efﬁciencyof pJ/MAC conﬁrming the much lower energy attainableby analog solutions over digital ones. Moreover, implementingthe MLCP computations in digital domain would incur furtherenergy cost due to memory access (for getting the weightvalues) and clocking which are ignored here.Since we implement the operation of second stage in digitaldomain, we need C × L multiplications per classiﬁcation.For the case of L = 60 and C = 12 described above andenergy cost of pJ/MAC for digital multiplies, the totalenergy cost of second stage operation is . nJ/classify.Hence, the total energy/classiﬁcation becomes . nJ andthe combined energy/operation increases to . pJ/MAC. Forpeak energy efﬁciency, we consider D = 128 , L = 128 and C = 12 resulting in a net energy/computation of . pJ/MACincluding both stages. E. Discussion1) Comparison:

Our MLCP is compared with other re-cently reported machine learning systems in Table II. Com-pared to the digital implementation of SVM in [29], ourimplementation achieves far less energy per MAC due tothe analog implementation. [30], [31] and [32] achieve goodenergy efﬁciency similar to our method by using analogcomputing. [31] uses a multiplying DAC (MDAC) to performthe multiplication by weights–however, they have only bitresolution in the multiply and also the MDAC occupies muchlarger area than the single transistor we use for multiplications.[30] and [32] use analog ﬂoating-gate transistors for the mul-tiplication. Compared to these, our single transistor multipliertakes lesser area (no capacitors that are needed in ﬂoating-gates), does not require high voltages for programming chargeand allows digital correction of errors because of the digitaloutput. WinCNTWinCNT CC O C N T CC O C N T μ m24 μ m0.4×0.35 μ m Fig. 16:

Array Layout:

The area of the current IC is limited by thepitch of the CCO and WINCNT circuits even though the actual areaof the current mirrors ( . × . µm ) are very small.

2) Area Limits:

Using a single transistor for multiplicationin the ﬁrst layer should provide area beneﬁts over otherschemes. The current layout (Fig. 16) was done due to itssimple connection patterns and is not optimized. It can be seenthat the actual area of a unit transistor in the array ( . × . µm ) is much less than the area of an unit cell in the layoutwhich is limited by the pitch of the CCO and the windowcounter circuits. Moving to a highly scaled process or foldingthe placement of the output CCO layer to be parallel to theinput window counter circuits would enable large reduction( ≈ X ) in the area of the current mirror array. The ultimatelimit in terms of area for this architecture stems from the areaof capacitors–for this input, output architecture, thetotal capacitor area is . mm .

3) Data rate requirements:

When used in an implant withofﬂine training, the MLCP can reduce transmission data ratedrastically. Firstly, for direct transmission of channel datasampled at kHz with bit resolution, required data rateis Mbps. This massive data rate can be reduced partiallyby including spike sorting [11]. In this case, assuming bit address encoding a maximum of biological neuronseach ﬁring at a rate f bio , the data rate to be transmitted fora conventional implant without neural decoder is given by R conv = 8 × × f bio . As an example, with f bio = 100 Hz, R conv = 204 kbps. This can be reduced even further byintegrating the decoder as proposed here. For the proposedcase, the output of the decoder is obtained at a rate f deco .During regular operation after training, the data rate for C classes is given by R prop,test = f deco × ⌈ log ( C ) ⌉ . As an ex-ample, for the case described in section IV-C with f deco = 50 Hz and C = 13 , R prop,test = 200 bps. This example, showsthe potential for thousand fold data rate reductions over spikesorting by integrating the decoder in the implant.From the viewpoint of power dissipation, the analog frontend and spike detection can be accomplished within a powerbudget of µ W per channel [37] [38] [5]. Assuming a trans-mission energy of ≈ pJ/bit from recently reported wirelesstransmitters for implants [39]–[41], the power dissipation forraw data rates of kbps/channel and compressed data ratesof kbps/channel after spike sorting are µ W and . µ Wrespectively. Hence, the power for wireless transmission isa bottleneck for systems transmitting raw data. For systemswith spike sorting in the implant, this power dissipation isnot a bottleneck. However, the power/channel needed for thespike sorter is about µ W. In comparison, if our decoderoperates directly on the spike detector output, it can providecompression at a power budget of < . µ W/channel. Thiswould result in a total power dissipation/channel of ≈ µ Win our case compared to ≈ µ W in the case of spike sorting–a 6X reduction. There is a lot of evidence that the decodingalgorithms can work on the spike detector output [24]; in fact,it is believed that this will make the system more robust forlong term use. This will be a subject of our future studies.Even if the decoder is explanted, a MCU cannot providesufﬁcient throughput to support advanced decoding algorithmswhile FPGA based systems consume a large baseline power. Acustom MLCP based solution provides an attractive combina-tion of low-power and high throughput operation when pairedwith a ﬂexible MCU for control ﬂow.

4) Normalization for Increased Robustness:

The variationof temperature is not a big concern in the case of implantableelectronics since body temperature is well regulated. However,variation of power supply voltage can be a concern. A normal-ization method can be applied to the hidden layer output forreducing its variation due to power supply ﬂuctuation, at thecost of additional computation. The normalization proposedhere can be expressed by: h j,norm = h j P Lj =0 h j / P Di =0 x i . (7)The rationale behind the proposed normalization is that theeffect of power supply ﬂuctuation on the hidden layer outputcan be modelled as multiplication factor in hidden layer outputequation. As analyzed before, the output of the j th hiddenlayer node can be formulated as: h j = I in,j C f × V DD t cnt , where I in,j is the input current of the j th hidden layer node and t cnt is counting window length. Since I in,j is proportional tothe strength of input vector x = [ x , x ...x D ] , we can modelthe relation between the input vector and hidden layer outputas: h j = K j α ( T, V DD ) P Di =0 x i , where the variation partis a multiplicative term α ( T, V DD ) , and K j lumps up theconstant part of the path gain from input to j th hidden layer DVDD (V)

DVDD (V) x=8 (a) (b) x=10 h i N o r m a li ze d h i h i N o r m a li ze d h i (cid:3) (cid:3) hi Normalized hi (cid:3) (cid:3) hi Normalized hi Blue lines are original hidden layer output from SPICE simu-lation, while green dashed lines are normalized output in both(a) and (b). The input x in (a) and (b) are 8 and 10 respectively.

Fig. 17:

Normalization to reduce variation: output. It is reasonable to assume that α ( T, V DD ) is the sameacross different nodes, since ﬂuctuation of power supply is aglobal effect on chip scale. Hence, it can be cancelled by theproposed normalization as: h j,norm = h j P Lj =0 h j / P Di =0 x i = K j α ( T, V DD ) P Di =0 x i P Lj =0 ( K j α ( T, V DD ) P Di =0 x i ) / P Di =0 x i = K j P Lj =0 K j D X i =0 x i . (8)Simulation results are presented here to verify the proposedmethod of normalization. The original hidden layer outputs( L = 3 ) are obtained by SPICE simulations where DV DD is swept from . V to . V and input x ( D = 1 ) changesfrom to . Original and normalized values of one of thehidden layer outputs are compared in Fig. 17. As can beobserved here, the normalized output (in green dashed lines)varies signiﬁcantly less due to variation of DV DD than theoriginal output (in blue solid lines). The hardware cost for thisnormalization is D + L additions and L divisions. Assumingsimilar costs for division and multiplication, the normalizationdoes not incur much overhead if C >> since L × C multiplications are required by the second stage anyway.

5) Considerations for Long Term Implants:

When usingthis MLCP based decoder in long term implants, we haveto consider issues of parameter drift over several time scales.Over long term of days, aging of the circuits in MLCP or probeimpedance change due to gliosis and scarring may changeperformance. This is typically countered by retraining thedecoder every day [24]. Such retraining has allowed decodersto operate at similar level of performance over years. Overshorter time scales, any variation not sufﬁciently quenched bythe normalization method described earlier can be explicitlycalibrated by having digital multiplication of coefﬁcients forevery input and output channel. These can be determinedperiodically by injecting calibration inputs and observing theoutput of the CCO.Another type of training–referred to as decoder retraining[42], [43] are needed to take into account change in neuralstatistics during closed loop experiments. The training donehere may be thought of as open loop training for initialization of coefﬁcents of second stage of ELM. Next, the experimenthas to be redone with closed loop feedback and new trainingdata set has to be generated for retraining the second layerweights. After several such iterations, the ﬁnal set of weightsof second layer will be obtained.V. C ONCLUSION

We presented a MLCP in . - µ m CMOS with a die areaof 4.95 × mm and a 7.4 cm × . µ Wat Hz classiﬁcation rate, resulting in an energy efﬁciency of . pJ/MAC. Learning in the second stage also compensatesfor non-idealities in the analog processor. Furthermore, It in-cludes time-delayed sample based dimension increase featurefor enhancing decoding performance when number of recordedneurons are limited. A sparsity promoting training method isshown to reduce the number of hidden layer neurons andoutput weights by ≈ . We demonstrated the operationof the IC for decoding individuated ﬁnger movements usingrecordings of M1 neurons. However, the ELM algorithm usedin the decoder is quite general and has been shown to be anuniversal approximator and equivalent to SVM or multi-layerperceptrons [20]. Hence, our MLCP can also be used for otherdecoding applications requiring regression or classiﬁcationcomputations. Higher dimensions of inputs and hidden layerscan be handled by making a larger IC and also by reusingthe same hidden layer several times. In either case, powerdissipation increases but not energy/compute. Higher inputdimensions can be accommodated at same power by reducingthe bias current input of the splitter DACs in input channels[27]. Increase of hidden layer neurons however do incur aproportional power increase. Given that the power requirementof the current decoder is > X lower than the AFE, wecan easily extend it to handle many more input and outputchannels. VI. A CKNOWLEDGEMENT

The authors would like to thank Dr. Nitish Thakor forproviding neural recording data.R

EFERENCES[1] R. R. Harrison, P. T. Watkins, R. J. Kier, R. O. Lovejoy, D. J. Black,B. Greger, and F. Solzbacher, “A low-power integrated circuits for awireless 100-electrode neural recording system,”

IEEE Journal of Solid-State Circuits , vol. 42, no. 1, pp. 123–133, Jan. 2007.[2] R. Sarpeshkar, W. Wattanapanitch, S. K. Arﬁn, B. I. Rapoport, S. Man-dal, Michael W. Baker, M. S. Fee, S. Musallam, and R. A. Adersen,“Low-power circuits for brain-machine interfaces,”

IEEE Transactionson Biomedical Circuits and Systems , vol. 2, no. 3, pp. 173–183, Sept.2008.[3] F. Shahrokhi, K. Abdelhalim, D. Serletis, P. L. Carlen, and R. Genov,“The 128-channel fully differnetial digital integrated neural recordingand stimulation interface,”

IEEE Transactions on Biomedical Circuitsand Systems , vol. 4, no. 3, pp. 149–161, Jun. 2010.[4] Y. Chen, A. Basu, L. Liu, X. Zou, R. Rajkumar, G. Dawe, and M. Je, “ADigitally Assisted, Signal Folding Neural Recording Ampliﬁer,”

IEEETransactions on Biomedical Circuits and Systems , vol. 8, no. 8, pp.528–542, August 2014. [5] E. Yao and A. Basu, “A 1 V, Compact, Current-Mode Neural SpikeDetector with Detection Probability Estimator in 65 nm CMOS,” in

IEEE ISCAS , May 2015.[6] J. Holleman, A. Mishra, C. Diorio, and B. Otis, “A micro-power neuralspike detector and feature extractor in .13um CMOS,” in

IEEE CustomIntegrated Circuits Conference (CICC) , 2008.[7] L. Hoang, Y. Zhi, and W. Liu, “VLSI architecture of NEO spike detec-tion with noise shaping ﬁlter and feature extraction using informativesamples,” in

IEEE EMBC , Sept. 2009, pp. 978–981.[8] B. Gosselin and M. Sawan, “An Ultra Low-Power CMOS AutomaticAction Potential Detector,”

IEEE Transactions on Neural Systems andRehabilitation Engineering , vol. 17, no. 4, pp. 346–353, Aug. 2009.[9] T. Chen, K. Chen, Z. Yang, K. Cockerham, and W. Liu, “A biomedicalmultiprocessor SoC for closed-loop neuroprosthetic applications,” in

Solid-State Circuits Conference - Digest of Technical Papers, 2009.ISSCC 2009. IEEE International , Feb 2009, pp. 434–435,435a.[10] V. Karkare, S. Gibson, and D. Markovic, “A 130-uW, 64-Channel NeuralSpike-Sorting DSP chip,”

IEEE Journal of Solid-State Circuits , vol. 46,no. 5, pp. 1214–22, May 2011.[11] V. Karkare, S. Gibson, and D. Markovic, “A 75-uW, 16-Channel NeuralSpike-Sorting Processor With Unsupervised Clustering,”

IEEE Journalof Solid-State Circuits , vol. 48, no. 9, pp. 2230–8, Sept 2013.[12] S. Acharya, F. Tenore, V. Aggarwal, R. Etienne-Cummings, M. Schieber,and N. Thakor, “Decoding individuated ﬁnger movements using volume-constrained neuronal ensembles in the M1 hand area,”

IEEE Transac-tions on Neural Systems and Rehabilitation Engineering , vol. 16, pp.15–23, 2008.[13] P. Ifft, S. Shokur, Z. Li, M. Lebedev, and M. Nicolelis, “A brain-machineinterface enables bimanual arm movements in monkeys,”

Science:Translational Medicine , vol. 5, pp. 1–13, 2013.[14] L. Hochberg, D. Bacher, B. Jarosiewicz, N. Masse, J. Simeral, J. Vogel,S. Haddain, J. Liu, S. Cash, P. der Smagt, and J. Donoghue, “Reachand grasp by people with tetraplegia using a neurally controlled roboticarm,”

Nature , vol. 485, pp. 372–375, 2012.[15] I. H. Stevenson and K. P. Kording, “How advances in neural recordingaffect data analysis,”

Nature Neuroscience , vol. 14, pp. 139–142, 2011.[16] B. Rapoport, W. Wattanapanitch, H. Penagos, S. Musallam, R. Andersen,and R. Sarpeshkar, “A biomimetic adaptive algorithm and low-powerarchitecture for implantable neural decoders,” in , 2009.[17] B. Rapoport, L. Turicchian, W. Wattanapanitch, T. Davidson, andR. Sarpeshkar, “Efﬁcient universal computing architectures for decodingneural activity,”

PLoS ONE , vol. 7, pp. e42492, 2012.[18] Y. Chen, Y. Enyi, and A. Basu, “A 128 Channel 290 GMACs/W MachineLearning Based Co-Processor for Intention Decoding in Brain MachineInterfaces,” in

IEEE ISCAS , May 2015.[19] G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme Learning Machines:Theory and Applications,”

Neurocomputing , vol. 70, pp. 489–501, 2006.[20] G. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learningmachine for regression and multiclass classiﬁcation,”

Systems, Man,and Cybernetics, Part B: Cybernetics, IEEE Transactions on , vol. 42,no. 2, pp. 513–529, April 2012.[21] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani,“Least angle regression,”

The Annals of Statistics , vol. 32, no. 2, pp.407–499, 2004.[22] V. Aggarwal, S. Acharya, F. Tenore, H. Shin, R. Etienne-Cummings,M. Schieber, and N. Thakor, “Asynchronous decoding of dexterousﬁnger movements using M1 neurons,”

IEEE Transactions on NeuralSystems and Rehabilitation Engineering , vol. 16, pp. 3–14, 2008.[23] A. Rahimi and B. Recht, “Weighted Sums of Random Kitchen Sinks:Replacing minimization with randomization in learning,” in

Proceedingsof Neural Information Processing Systems (NIPS) , 2009.[24] J.C. Kao, S.D. Stavisky, D. Sussillo, P. Nuyujukian, and K.V. Shenoy,“Information Systems Opportunities in Brain-Machine Interface De-codersInformation Systems Opportunities in Brain-Machine InterfaceDecoders,”

Proceedings of the IEEE , vol. 102, no. 5, pp. 666–682,May 2014.[25] A. Grigorievskiy, Y. Miche, A. Ventela, E. Severin, and A. Lendasse,“Long-term time series pprediction using OP-ELM,”

Neural Networks ,vol. 51, pp. 50–56, March 2014.[26] A. Basu, S. Shuo, H. Zhou, M. Lim, and G. Huang, “Silicon SpikingNeurons for Hardware Implementation of Extreme Learning Machines,”

Neurocomputing , vol. 102, pp. 125–134, 2013.[27] Y. Enyi, S. Hussain, and A. Basu, “Computation using Mismatch:Neuromorphic Extreme Learning Machines,” in

IEEE BiomedicalCircuits and Systems Conference (BioCAS) , Rotterdam, 2013, pp. 294–7. [28] Yoan Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, and A. Lendasse,“Op-elm: Optimally pruned extreme learning machine,” Neural Net-works, IEEE Transactions on , vol. 21, no. 1, pp. 158–162, Jan 2010.[29] Kyong Ho Lee and N. Verma, “A low-power processor with conﬁgurableembedded machine-learning accelerators for high-order and adaptiveanalysis of medical-sensor signals,”

Solid-State Circuits, IEEE Journalof , vol. 48, no. 7, pp. 1625–1637, July 2013.[30] S. Chakrabartty and G. Cauwenberghs, “Sub-microwatt analog vlsitrainable pattern classiﬁer,”

Solid-State Circuits, IEEE Journal of , vol.42, no. 5, pp. 1169–1179, May 2007.[31] Jinwook Oh, Gyeonghoon Kim, Byeong-Gyu Nam, and Hoi-Jun Yoo, “A57 mw 12.5 uj/epoch embedded mixed-mode neuro-fuzzy processor formobile real-time object recognition,”

Solid-State Circuits, IEEE Journalof , vol. 48, no. 11, pp. 2894–2907, Nov 2013.[32] J. Lu, S. Young, I. Arel, and J. Holleman, “1 TOPS/W AnalogDeep Machine-Learning Engine with Floating-Gate Storage in 0.13umCMOS,” in

ISSCC Dig. Tech. Papers , 2014, pp. 504–5.[33] Y. He and C-H. Chang, “A New Redundant Binary Booth Encodingfor Fast 2n-Bit Multiplier Design,”

IEEE Transactions on Circuits andSystems-I , vol. 56, no. 6, pp. 1192–1201, June 2009.[34] K-S Chong, B-H Gwee, and J. S. Chang, “A Micropower Low-VoltageMultiplier With Reduced Spurious Switching,”

IEEE Transactions onVLSI , vol. 13, no. 2, pp. 255–65, 2005.[35] M. La Guia de Solaz and R. Conway, “Razor Based ProgrammableTruncated Multiply and Accumulate, Energy-Reduction for EfﬁcientDigital Signal Processing,”

IEEE Transactions on VLSI , vol. 23, no.1, pp. 189–93, Jan 2015.[36] N. Petra, D. De Caro, V. Garofalo, E. Napoli, and A.G.M. Strollo,“Truncated Binary Multipliers With Variable Correction and MinimumMean Square Error,”

IEEE Transactions on Circuits and Systems-I , vol.57, no. 6, pp. 1312–25, Jun 2010.[37] D. Han, Y. Zheng, R. Rajkumar, G. Dawe, and M. Je, “A 0.45V100-channel neural-recording IC with sub-uW/channel consumption in0.18um CMOS,” in

IEEE International Solid-State Circuits Conference ,2013, pp. 290–291.[38] Y. Enyi, Y. Chen, and Arindam Basu, “A 0.7 V, 40 nW Compact,Current-Mode Neural Spike Detector in 65 nm CMOS,”

IEEE Trans-actions on Biomedical Circuits and Systems , Early Access 2015.[39] J. Tan, W. S. Liu, C. H. Heng, and Y. Lian, “A 2.4 GHz ULPreconﬁgurable asymmetric transceiver for single-chip wireless neuralrecording IC,”

IEEE Transactions on Biomedical Circuits and Systems ,vol. 8, no. 4, pp. 497–509, Aug 2014.[40] S. X. Diao, Y. J. Zheng, Y. Gao, S. J. Cheng, X. J. Yuan, and M. Y.Je, “A 50-Mb/s CMOS QPSK/O-QPSK transmitter employing injectionlocking for direct modulation,”

IEEE Transactions on Microwave Theoryand Techniques , vol. 60, no. 1, pp. 120–130, Jan 2012.[41] M. Chae, Z. Yang, M. Yuce, L. Hong, and W. Liu, “A 128-Channel6 mW Wireless Neural Recording IC With Spike Feature Extractionand UWB Transmitter,”

IEEE Transactions on Neural Systems andRehabilitation Engineering , vol. 17, no. 4, pp. 312–321, 2009.[42] J. M. Fan, P. Nuyujukian, and J. C. Kao et. al., “Intention estimation inbrain machine interfaces,”

Journal of Neuroengg. , vol. 11, no. 1, 2014.[43] A. L. Orsborn, S. Dangi, H. G. Moorman, and J. M. Camena, “Closed-loop decoder adaptation on intermediate time-scales facilitates rapidBMI performance improvements independent of decoder initializationconditions,”