[PDF] End-to-End Memristive HTM System for Pattern Recognition and Sequence Prediction

Abstract

Neuromorphic systems that learn and predict from streaming inputs hold significant promise in pervasive edge computing and its applications. In this paper, a neuromorphic system that processes spatio-temporal information on the edge is proposed. Algorithmically, the system is based on hierarchical temporal memory that inherently offers online learning, resiliency, and fault tolerance. Architecturally, it is a full custom mixed-signal design with an underlying digital communication scheme and analog computational modules. Therefore, the proposed system features reconfigurability, real-time processing, low power consumption, and low-latency processing. The proposed architecture is benchmarked to predict on real-world streaming data. The network's mean absolute percentage error on the mixed-signal system is 1.129X lower compared to its baseline algorithm model. This reduction can be attributed to device non-idealities and probabilistic formation of synaptic connections. We demonstrate that the combined effect of Hebbian learning and network sparsity also plays a major role in extending the overall network lifespan. We also illustrate that the system offers 3.46X reduction in latency and 77.02X reduction in power consumption when compared to a custom CMOS digital design implemented at the same technology node. By employing specific low power techniques, such as clock gating, we observe 161.37X reduction in power consumption.

Full PDF

EEnd-to-End Memristive HTM System for PatternRecognition and Sequence Prediction by Abdullah M. Zyarah, Kevin Gomez, and Dhireesha Kudithipudi

Note: The post-production title of the paper is changed to ”Neuromorphic Systemfor Spatial and Temporal Information Processing”

Bibtex citation: @ARTICLE { { Zyarah, Abdullah M and Gomez, Kevin and Kudithipudi, Dhireesha } ,journal = { IEEE Transactions on Computers } ,title = { Neuromorphic System for Spatial and Temporal Information Processing } ,year = { } ,doi = { } ,publisher = { IEEE Computer Society }} Plain-text citation:

A. M. Zyarah, K. Gomez and D. Kudithipudi, ”Neuromorphic System for Spatialand Temporal Information Processing,” in IEEE Transactions on Computers, doi:10.1109/TC.2020.3000183. a r X i v : . [ c s . ET ] J un EEE TRANSACTIONS ON COMPUTERS, 2020 1

End-to-End Memristive HTM System for PatternRecognition and Sequence Prediction

Abdullah M. Zyarah, Kevin Gomez, Dhireesha Kudithipudi,

Senior member, IEEE

Abstract —Neuromorphic systems that learn and predict from streaming inputs hold signiﬁcant promise in pervasive edge computing andits applications. In this paper, a neuromorphic system that processes spatio-temporal information on the edge is proposed. Algorithmically,the system is based on hierarchical temporal memory that inherently offers online learning, resiliency, and fault tolerance. Architecturally,it is a full custom mixed-signal design with an underlying digital communication scheme and analog computational modules. Therefore,the proposed system features reconﬁgurability, real-time processing, low power consumption, and low-latency processing. The proposedarchitecture is benchmarked to predict on real-world streaming data. The network’s mean absolute percentage error on the mixed-signalsystem is 1.129X lower compared to its baseline algorithm model. This reduction can be attributed to device non-idealities andprobabilistic formation of synaptic connections. We demonstrate that the combined effect of Hebbian learning and network sparsity alsoplays a major role in extending the overall network lifespan. We also illustrate that the system offers 3.46X reduction in latency and77.02X reduction in power consumption when compared to a custom CMOS digital design implemented at the same technology node. Byemploying speciﬁc low power techniques, such as clock gating, we observe 161.37X reduction in power consumption.

Index Terms —Neuromorphic computing, Hierarchical Temporal Memory, Synthetic Synapses Representation, Plasticity, Neocortex. (cid:70)

NTRODUCTION O VER the course of the last decade, there has been aprofound shift in artiﬁcial intelligence (AI) research,where biologically inspired computing systems are beingactively studied to address the demand for energy-efﬁcientintelligent devices. Biologically inspired systems, such ashierarchical temporal memory (HTM) [1], [2], have demon-strated strong capability in processing spatial and temporalinformation with a high degree of plasticity while learningmodels of the world. HTM also exhibits natural compat-ibility for continuous online learning [3], noise and faulttolerance [4], and low power consumption achieved throughsparse neuronal activity [5], [6]. These properties makethe algorithm attractive for a wide range of applicationssuch as visual object recognition and classiﬁcation [7], [8],prediction of data streams [9], natural language processingand anomaly detection [10]. Despite the fact that HTMis an attractive algorithm, it demands high computationalpower that cannot be fulﬁlled by conventional von Neumannarchitectures. This is because the innate HTM architecture,which is composed of thousands of neuronal circuits, requiresa high-level parallelism in information processing. One maymap the HTM algorithm onto a GPU. A GPU can providethe necessary parallelism, but it fails to provide satisfactoryperformance and demands a large power budget [11]. To thisend, several research groups have attempted to develop spe-cialized custom hardware designs to run the HTM algorithm • This work is sponsored in part by Seagate Technology. • Abdullah M. Zyarah is with the Neuromorphic AI Lab, Department ofComputer Engineering, Rochester Institute of Technology, NY 14623, USA(E-mail: [email protected]). • Kevin Gomez is with Seagate Research Group, Seagate Technology, MN55379, USA (Email: [email protected]) • Dhireesha Kudithipudi is with the Neuromorphic AI Lab, Department ofElectrical and Computer Engineering, University of Texas at San Antonio,TX 78249 USA (E-mail: [email protected]). efﬁciently and affordably [12]. While some of the previousdesigns focused only on the spatial aspect of the HTM [13]–[15], other endeavors incorporated both the spatial andtemporal models in the same design. For instance, in 2015,Zyarah et al. implemented the HTM algorithm including thespatial and temporal aspects [16]. The implemented networkincorporates 100 mini-columns with 3 cells each, and isveriﬁed for image classiﬁcation and sequence prediction.Furthermore, it supports synthetic synapses, which arerealized with distributed memory blocks, to enable synapticpathway dynamics. The authors also optimized their designfurther in [17]. Weifu Li et al. [18], proposed a full architectureof the HTM algorithm in 2016. The proposed architectureis composed of 400 mini-columns (2 cells in each mini-column) connected in point-to-point format to the HTMinput space, which eventually causes the mini-column to bein an active mode even when there is insigniﬁcant activity(noise) in the input space. When it comes to HTM memristor-based analog and mixed-signal designs, in 2016, Fan et al.implemented the ﬁrst generation of HTM, HTM-Zeta. Theauthors proposed RCN (resistive-crossbar networks) patternmatching modules with core processing unit named, spin-neurons [19]. The network operation is veriﬁed for imageclassiﬁcation in an ofﬂine fashion as the proposed designdoes not support online learning. In 2018, Krestinskaya etal. presented the full analog design of the HTM, but thetemporal aspect of the implementation does not match thatin the HTM sequence memory as it depends on the classmap concept which matches the stored patterns with thetest ones (unseen input samples) [20]. To the best of ourknowledge, there is no full custom mixed-signal design ofthe HTM algorithm in literature with underlying digitalcommunication scheme and analog computational modules.Such a design should include the necessary reconﬁgurability,low energy-delay product, and a robust communication

XXXX-XXXX c (cid:13) cheme, in one platform. It is important to mention herethat such architectures have been explored in the context ofspiking neural networks (SNNs) [21], where the communi-cation scheme is realized with address event representation(AER), developed by Mahowald in 1992 [22]. AER takesadvantage of sparse neuronal activity and high-bandwidthVLSI to enable time-multiplexed communication. Hence, itreduces the number of connections between sending andreceiving neuronal arrays from n to log n [23]. It turns outthat AER is considered as an effective approach for point-to-point connections, but not for complex networks with sparseconnections. The complex network connectivity is solvedthrough the enhanced AER proposed by Goldberg et al. [24].The enhanced AER uses look-up tables (LUTs) to describethe connectivity network between two sets of neuronalarrays. The LUT contains the sender address, destinationaddress, and the probability of connectivity. Thus, complexnetworks even sparse ones can easily be implemented.However, the enhanced AER demands a large amount ofmemory and this makes it unsuitable for power and areaconstrained devices. Therefore, this paper also proposesa synthetic synapses representation (SSA) communicationscheme, which leverages the linear feedback shift registers(LFSR)s to describe the sparse connections among neurons.Using the LFSRs eliminates the need for memory-basedaddress description as the addresses between neurons aregenerated rather than stored. This results in a considerablereduction in the network area and power consumption.Speciﬁc key contributions of this paper are as follows: • Developing a memristor-based mixed-signal neuromor-phic system of the HTM network including both thespatial pooler and temporal memory. • Synthetic synapses representation (SSR) communicationscheme is proposed to virtually formulate and prune thephysical synaptic connections in the HTM network. • System-level analysis of the performance, lifespan, area,and power consumption with respect to a CMOS onlyimplementation is performed.

IERARCHICAL T EMPORAL MEMORY

HTM is a biomimetic algorithm that aims to develop acomputational framework capturing the structure and thealgorithmic properties of the human neocortex. Structurally,the algorithm is composed of hierarchical ascending layersof cellular regions that enable the network to capture spatialand temporal information, shown in Fig. 1. Each regionin the HTM is composed of building blocks, namely cells,which are arranged in columns to model biological mini-columns. The cell in HTM is just an abstract model ofthe excitatory pyramidal neurons. As pyramidal neurons,each cell has hundreds of synaptic connections groupedinto three integration zones (or segments): proximal, distal,and apical [4], [25]. The proximal segment is dedicated toreceive feed-forward input i.e. observe the cellular activitiesin the lower layers in the hierarchy, or sensory input.Typically, activities detected on proximal segments leadsto generation of neuronal action potential. The distal and A cell in HTM typically has one proximal segment (shared with othercells of the same mini-column) and multiple distal and apical segments. apical segments, on the other hand, are dedicated to observethe cellular activities of the neighboring cells in the sameregion (contextual input) and higher levels in the hierarchy(feedback input), respectively. Unlike the proximal segment,the cellular activities detected by distal and apical segmentslead to NMDA spikes [26]. The NMDA spikes slightlydepolarize the cell without generating an action potential,giving the cell a competitive advantage in responding tofuture input representations [27].Fig. 1 shows a high-level diagram of the HTM networkequipped with a data encoder and multiple classiﬁers.The encoder transforms sensory information into binaryrepresentations, while the classiﬁers map the HTM output tothe corresponding class labels (SDR classiﬁer) and identifyanomalies (anomaly classiﬁer). The mixed-signal designof the SDR classiﬁer has been developed in our previouswork [6]. Thus, this work will emphasize the design andimplementation of a single HTM region , which is equivalentto realizing the primary sensory region in the supra-granularlayers of the neocortex. Given an HTM region, there aretwo core operations which capture the spatial and temporalinformation of a given input, namely the spatial pooler andtemporal memory, which are discussed in the followingsubsections: Mini-columnProximal connection Level-2 Level-1

Cell

Region Distalconnection

Encoder SDR Classiﬁer [Prediction]Anomaly Classiﬁer HTM Network B u ck e t I nde x + P r ed i c t ed F i e l d C l a ss l abe l s Figure 1. High-level architecture of the HTM system with three coreunits: data encoder, HTM network, and classiﬁers. The encoder trans-forms the input data into binary representations. The HTM algorithmlearns spatial information and captures temporal transitions, while theclassiﬁers map the HTM output to the corresponding class labels andidentify anomalies.

The spatial pooler in the HTM is responsible for extractingand learning the spatial patterns of the sequential data.Typically, the spatial pooler models an encoded sensoryinput, generated by the encoder, using a population of activeand inactive mini-columns chosen through a combinationof competitive Hebbian learning rules and homeostasis [27].Typically, the number of active mini-columns is limited to(2-4)% of the total mini-columns in a given HTM region,resulting in so-called sparse distributed representation (SDR).The SDR in HTM deﬁnes the underlying data structure andenables the crucial features of the algorithm such as distin-guishing the common features between inputs [28], learn-ing sequences, and making simultaneous predictions [29]. The hierarchical structure of the HTM network has not been thoroughlystudied yet. k -winner-take-all ( k -WTA)computation principle, the top (2-4)% mini-columns withthe highest overlap scores are activated (become winners)and inhibit their neighbors. The output of the spatial pooleris a binary vector, which represents the joint activity of allmini-columns in the HTM region in response to the currentinput. The spatial pooler operation can be divided into threedistinct phases: initialization, overlap and inhibition, andlearning, discussed in our previous work [6] and brieﬂydescribed below.In the initialization phase (Algorithm 1, lines 2-5), whichoccurs only once, the mini-columns’ connections to theinput space, synapses’ permanences, and boosting factors,are initialized. Let S p be an n c × n x array holding theproximal synaptic connections between n c mini-columnsand n x − dimensional input space. Similarly, let ρ p be an n c × n x array that deﬁnes the permanence of correspondingpotential synapses in S p . Given the j th mini-column, itsmaximum number of potential synapses ( n sp ) is deﬁned bythe non-zero elements in (cid:126)s p ( (cid:126)s p is a row vector in S p ) whoseindices are generated by a pseudo-random number generator,and their permanence values are uniformly initialized atrandom between ‘0’ and ‘1’. Initializing the synapses isfollowed by setting the boosting factor of the individualmini-columns to ‘1’. After the initialization, the overlapand inhibition phase (lines 8-11) starts in which the feed-forward input is collectively represented by a subset ofactive mini-columns (winning mini-columns). Selecting theactive mini-columns is done after counting mini-columns’active synapses that are associated with active bits in theinput space, i.e. overlap scores ( α ). Mathematically, this isachieved by performing a dot product operation betweenthe feed-forward input vector ( (cid:126)x t ) at time t and the activesynapses array, as in line 9, where the active synapses arrayis the result of an element-wise multiplication (denoted as (cid:12) )between S p and ¯ ρ p . (cid:126)b t , here, denotes the boosting factor thatregulates mini-column activities. ¯ ρ p is a permanence binaryarray to indicate the status of each potential synapse, where‘1’ indicates a connected synapse and ‘0’ an unconnectedsynapse. Upon the completion of computing the overlapscores, each mini-column overlap score gets evaluated bycomparing it to a threshold, α th (line 10). The resulting vector( (cid:126)eα t ) is an indicator vector representing the nominated mini-columns with high overlap scores. Given an inhibition radiusdeﬁned by ξ , based on the mini-column overlap scores anddesired level of sparsity ( η ), n w number of mini-columnswill be selected to represent the input, as shown in line11. After determining the winning mini-columns in (cid:126) Λ t , thelearning phase (lines 13-16) starts to update the permanencevalues of the winning mini-columns’ synapses. The synapses’permanences are updated according to Hebbian rule [30]. Therule implies that the synapses connected to active bits must ALGORITHM 1:

HTM-Spatial Pooling

Input: (cid:126)x t ∈ R n x { , } , where (cid:126)x t ⊂ X t and X t ∈ R n x × n n { , } ; Output: (cid:126) Λ t ∈ R n c { , } ; /* n c :Number of mini-columns */ // Initialization: /* n x :Input vector length */ S ind ∼ rand.pseudo, where S ind ∈ N n c × n sp { ,n x } ; S p [ S ind ] ← , where S and ρ ∈ R n c × n x ; ρ p [ S ind ] ∼ rand.uniform[0,1]; (cid:126)b t ∈ R n c , where ∀ b t [ j ] = 1 ; repeat // Overlap and Inhibition: ¯ ρ p ← I ( ρ p ≥ P th ) ; (cid:126)α t ← (cid:126)b t (cid:12) (cid:2) ( S p (cid:12) ¯ ρ p ) · (cid:126)x t (cid:3) ; (cid:126)eα t ← I ( (cid:126)α t ≥ α th ) ; (cid:126) Λ t ← kmax ( (cid:126)eα t , η, ξ ) ; /* kmax : k -WTA function */ // Learning: if Learning == ’Enable’ then ∆ ρ p ← (cid:126) Λ t .transpose (cid:12) ( S p (cid:12) ¯ ρ p ) (cid:12) ( λ(cid:126)x t − P − p ); (cid:126)b t ← e − γ (¯ a t − ) ; end until t > n n ; be strengthened, increasing their permanence by P + p , whilethose connected to inactive bits will be weakened, decreasingtheir permanence by P − p , as in line 14, where ∆ ρ p is thechange in the permanence array for all mini-columns givenan input (cid:126)x t , and λ denotes the sum of P + p and P − p . Afteradjusting the synapse’s permanence, the boosting factor ofeach mini-column is updated according to the mini-column’stime-averaged activity level ( ¯ a t ) and its activity level withrespect to its neighbor ( < ¯ a t > ) [27]. The temporal memory in the HTM is mainly dedicated tolearn time-based sequences and to make predictions. Thetemporal memory operates at the cells level, speciﬁcally, thecells of the winning mini-columns. When a mini-columnbecomes active, at least one of its cells is selected to be activeto represent the input contextually. This usually dependson whether the cells within the winning mini-columns arepredicting the incoming input. If a winning mini-columnhas a predictive cell, that cell becomes active and inhibitsother cells within the same mini-columns from being active.Otherwise, the joint activation of all cells within the mini-column represents the input and this is known as massiveneurons ﬁring or bursting. However, once a cell is activated,it forms lateral connections with the cells that were activein the previous time step. Patterns recognized by lateralconnections lead to a slight depolarization of the cell soma(predictive state), subsequently predicting the upcomingevents. Typically, the lateral connections are grouped intodistal segments. A cell in HTM can have more than onedistal segment and this grants the cells the capability topredict more unique temporal patterns. The operation ofthe temporal memory can be divided into three phases:mini-columns evaluation, prediction, and learning phase,described in Algorithm 2 .During the mini-columns evaluation phase (Algorithm 2,Line 4-15), the active cells within the winning mini-columns Forming and pruning lateral connections are not discussed in thealgorithm to avoid complexity. n m be the number of cells in each mini-column, and A t ∈ R n m × n c { , } is a binary array that represents the region cells’activity, where ‘1’ indicates an active cell and ‘0’ is inactive.Similarly, let π t be also a binary array that has the same sizeof A , and the active bits in π refers to the predictive cells.An i th cell within the j th mini-column is set to be active if (cid:126) Λ tj = 1 and the cell was in the predictive state in the previoustime step i.e. π t − ij = 1 . Otherwise, bursting (all cells withinthe j th mini-column are set to be active) will take place.In the second phase of temporal memory, prediction(line 17-35), the status of the cells for the next time stepis evaluated. This is done via observing the distal segmentsactivation level ( α ). Let D ij represent a group of distalsegments that belong to the i th cell within the j th mini-column, where a segment in D ij indexed by d . If ¯ ρ dij is theactive distal synaptic connection within the d th segment, and ¯ S dij holds its distal connections that are connected to activebits in A t , the d th distal segment is set to be active segmentif its || ¯ ρ dij · ¯ S dij || is greater than the segments activationthreshold, D th . Otherwise, the segment is set to a matchingstate if it has at least one synapse connected to an active cell in A t . Once the status of the distal segments are determined, thecells with active distal segments are set to be in the predictivestate. It is important to mention here that occasionally cells inHTM may incorrectly predict patterns. In such scenario, thesecells need to have their synaptic strength reduced to lowerthe likelihood of incorrect prediction (as in lines 19-23). Afterevaluating the cells’ segments, their synaptic connectionsare updated, which occurs during the learning phase (lines38-47).As aforementioned, the learning in HTM follows Heb-bian’s rule and it is applied solely to active cells. Given a tij ∈ A t , where a tij = 1 and has an active segment, thenall the synaptic connections that are laterally connected toprevious active cells are potentiated, while those that areconnected to inactive cells are depressed. This implies thatthe permanences of the distal synaptic connections, ρ dij , areincreased by P + when they are connected to active cells,otherwise, they are decreased by P − . YSTEM D ESIGN AND M ETHODOLOGY

Fig. 2 demonstrates the high-level architecture of the devel-oped HTM network including the core units of the SSRcommunication scheme. Essentially, there are √ n c × √ n c mini-columns with n m cells each to constitute the HTMregion. Unlike the mathematical description of the HTMwhich assumed 2D representation of the region for simplicity,in the hardware design, we consider a 3D architecture ofthe region to cut down the resources and to simplify thecommunication scheme considerably. The HTM region isintegrated to a main control unit (MCU), and an arbiterand selector. The MCU is dedicated to control data ﬂow,to generate the necessary control signals, and to bridgethe region to the input data encoder or other regions in The feasibility of the HTM network scaling (beyond 1024 mini-columns)can be made possible by adopting the slicing approach proposed in ourprevious work [17]. The number of mini-columns assumed in this work is always a powerof two, k , where k is an integer number. ALGORITHM 2:

HTM-Temporal Memory

Input: (cid:126) Λ t ∈ R n c { , } ; /* n c : Number of columns */ Output: A t ∈ R n m × n c { , } ; /* n m : Number of cells */ zeros cnt = 0; repeat for j ← to n c do if (cid:126) Λ t [ j ] == 1 then for i ← to n m do if π t − [ i, j ] == 1 then A t [ i, j ] ← ; else zeros cnt ← zeros cnt + 1 ; end if zeros cnt == n m then A t [ i, j ] ← , ∀ i ; zeros cnt = 0; end for j ← to n c do for i ← to n m do if (cid:126) Λ t [ j ] == 0 and π t − [ i, j ] == 1 then for d ← to n d do if D [ i, j ][ d ] .MatchingSegment then ∆ ρ [ i, j ][ d ] ← ( A t − (cid:12) S [ i, j ][ d ]) × P + ; end else if (cid:126) Λ t [ j ] == 0 then for d ← to n d do ¯ ρ [ i, j ][ d ] ← I ( ρ [ i, j ][ d ] ≥ P th ) ; ¯ S [ i, j ][ d ] ← A t (cid:12) S [ i, j ][ d ] ; α t ← || ¯ S [ i, j ][ d ] · ¯ ρ [ i, j ][ d ] || ; if α t ≥ D th then D [ i, j ][ d ] .ActiveSegment ← ; π t [ i, j ] ← ; else if || A t · S dij || > then D [ i, j ][ d ] .MatchingSegment ← ; end end end for j ← to n c do if (cid:126) Λ t [ j ] == 1 then for i ← to n m do if A t [ i, j ] == 1 then for d ← to n d do if D [ i, j ][ d ] .ActiveSegment == 1 then ∆ ρ [ i, j ][ d ] ← λ ( A t − (cid:12) S [ i, j ][ d ]) − P − ; end end end until t > n n ; the hierarchy while the arbiter and selector are responsiblefor regulating data sharing among cells within the region.Here, the interaction among cells is based on the SSR as thecells’ activity is sparse in nature, approximately 4.2%. Ata high level, the system works as follows: when the MCUestablishes a connection with the data encoder which is donethrough a hand-shake protocol, it commences receiving theencoded packets. The received packets are routed throughthe H-Tree to all the region’ mini-columns. Here, we usedthe H-Tree structure to reduce the parasitic capacitance and4 (1,N)C(2,N)C (N-1,N)C(N,N)C(1,2)C(2,2)C(N-1,2)C(N,2) M a i n C on t r o l U n i t ( M CU ) R eq A ck C(1,1)

C(2,1)

C(N-1,1)C(N,1) C(1,N-1)C(2,N-1)C (N-1,N-1)C(N,N-1)

Arbiter S e l ec t o r 𝑁 = 𝑛 𝑐 ‾‾ √ MCU

Figure 2. High-level architecture of the HTM network, including HTMregion with √ n c ×√ n c mini-columns with n m cells each, a main controlunit (MCU), and an arbiter and selector. to minimize the power consumption [31] of the developedsystem. However, there are two H-Trees, one is a digital bus(34-bit width, n lines are used by the cells, where n = n c × n m ) driven by the MCU and the cells to sharedata. The other one (not shown in Fig. 2) is an analog lineto enable mini-columns to compete against each other forinput representation. When the winning mini-columns andthen the cells are selected, the arbiter and selector are used tobroadcast information about the current/previous active cellsand their locations in the region so that lateral connectionsare formed and future predictions are made. In the followingsubsections, more details about each core unit of the HTMnetwork are provided, while the communication scheme isdiscussed in a separate section. The mini-columns in HTM are responsible for capturingspatial patterns of the feedforward inputs. The HTM mini-column circuit, developed in our previous work [6], isdepicted in Fig. 3-(left). Succinctly, the circuit comprisesa peripheral unit, a proximal unit, and WTA cell. In theperipheral unit, the proximal connections are generated andconnected to the input space. The proximal unit and theWTA cell hold the proximal connections’ permanences anda contesting unit that enables each mini-column to competewith its neighbors for the input representation, respectively.In this work, the input to the mini-column is generated bythe HTM random scalar encoder [32], which encodes everyscalar value of the time-series data into a high-dimensionalbinary vector sorted into small 31-bit packets. This is tominimize data movement and the required storage units.Sequentially, each packet is fetched to the mini-columnsand stored into

Addr Reg . When the input packet is storedin the

Addr Reg and the LFSR generates an address for alocation in the received packet, a matching score is stored inthe synapses’ registers which are modeled by n sp × serial-in-parallel-out shift register. Once all inputs are received,the outputs of the synapses’ registers are presented to thememristive crossbar word-line where the proximal synapsepermanences are stored. The input voltages to the crossbarwill be converted into current through the memristor and the output is collected at the crossbar bit-line. The outputof the crossbar, which modulates the mini-column overlapscore to current, is then boosted. Boosting is done via the useof a sense memristor (M s ). However, upon the completion ofcomputing the overlap score ( V αj ≡ α j ), its value, which issampled by the sense memristor, is then presented to a WTAcircuit (detailed description of the WTA is provided in [6]).The WTA performs a kmax operation on V αj , ∀ j followed bya thresholding, to generate the ﬁnal j th mini-column output,( Λ j ), as given in (1) and (2): V α j = n s (cid:80) i =1 g i V i g s + n s (cid:80) i =1 g i (1) Λ j = (cid:40) , V x j > V th , where V x j = f ( V αj )0 , Otherwise (2)where V i denotes the i th input voltage, g i refers to theconductance of the i th memristor, and g s is the conductanceof the sense memristor. However, once the ﬁnal outputof each mini-column is generated, the learning phase ofthe spatial pooler starts. As alluded to earlier, the learningin HTM follows Hebbian rule [30], which is implementedusing Ziksa [33], as discussed in [6]. Then, the mini-columns’status is relayed to their associated cells to start the nextphase, temporal memory. Although the cells are encapsulatedwithin the mini-columns and are considered a part of it,for the sake of clarity and simplicity we dealt with themseparately.Fig. 3-(right) demonstrates the process of computing theoverlap score and tuning the proximal synaptic connectionsfor a given mini-column while receiving feed-forward input,shown in Fig. 3-(a). Since the mini-column has a large numberof proximal connections, for the purpose of demonstration,we randomly picked only two. The changes in proximalconnection permanence for both HTM-SW and HTM-HWmodels are shown in Fig. 3-(b) and 4-(c), respectively. Here,it can be observed that any changes in the synapse’spermanence below the permanence threshold, P th , in theHTM-SW model has no impact on the overlap score, unlikethe HW model where there is no explicit threshold blockingthe memristors from contributing to overlap score value.Furthermore, the change in the HTM-HW model synapticpermanence (memristors’ conductances) tends to be non-linear as compared to the HTM-SW counterpart. However,selecting a memristor device with high conductance rangeand switching dynamics as required by the HTM theorymade the synapses with high conductance states dominatethe changes in the overlap level (see Fig. 3-(d)). This eventu-ally results in almost analogous overlap score variation forboth the SW and HW models. The cells in HTM enable the network to capture the temporalpatterns, modeling the input representations within theircontext, and predicting the upcoming events. The HTM cell The overlap scores for the HTM-HW and HTM-SW models are notreported up to scale for the purpose of comparison. QQD QQD QQ M pns M s V DD V Decoder = En 5

Addr_RegLFSRLocal_CUPacket_CNT

WTA cellProximal unitPeripheral unit

FromMCU M p0 T1 T2T3 T4T5 I c TrEn Λ[ 𝑗 ] V x SIPO_Regclk 𝛼 Input Input Memristors conductance[ S] of the correspodninginputsPermanence of thecorresponding inputs P th 𝜇 Overlap-HWOverlap-SW

Figure 3. (left) The circuit diagram of the HTM mini-column [6]. The circuit is composed of a peripheral unit to generate the proximal synapticconnections, a proximal unit to hold the permanences of the synaptic connections, and a WTA cell to enable mini-columns to compete with eachother for input representation. (right) The impact of the synaptic permanence (denoted as Perman. for HTM-SW) modulation on the mini-columnoverlap score as the proximal synapses receive feed-forward input. circuit developed in this work is composed of synaptogenesisunit , distal segments unit, and current comparators, shownin Fig. 4. The synaptogenesis unit is responsible for formingand pruning distal synaptic pathways with the previous ac-tive cells. The distal segments unit possesses the permanencevalues which describe the growth level of the individualdistal synaptic pathways, while the current comparators areutilized to evaluate the segments activation level and todetermine their states (active or matching) accordingly.Initially, the cells start with no distal synapses. Once theHTM begins processing the incoming patterns, the distalsynapses start forming in the synaptogenesis unit. Given anHTM region arranged into a 3D space, where the x and y axes index the mini-columns in the region and the z axis indexes the cells. When the region receives an input,this causes activation of a population of cells within theregion, and in this context, it is referred to as A t D . If a txyz ∈ A t D , where a txyz is an active cell located at xyz , a txyz will form connections with the active cells in A t − D .Let’s assume that the number of active cells in A t − D is 4.2%of n c . Then, if n c =961, ≈

40 cells will be active in each timestep, assuming no bursting takes place. The active cell at time t establishes connections with the 40 cells that were activeat t − by forming a distal segment. A cell in HTM canhave around 10 or more distal segments, and this enables thenetwork to learn the temporal transitions in sequences. Recallthat forming and pruning distal connections in hardwareplatforms requires high interconnect dynamics which arelacking in most of the existing platforms, especially ASICdesigns, hence the virtual description of the synapse becamea common approach [22], [34]. However, describing thesynapses virtually, in most cases, demands a high memoryusage to store the sender/receiver addresses. For instance, inHTM’s context (assuming there are 961 mini-columns in theregion with 4 cells each), if we assume that the address ofeach cell is represented with 12 bits and the distal connectionpermanence is represented with 16 bits, having 10 segmentswith 60 distal connections in each cell costs 16.8kb of memoryper cell and more than 64.57Mb for the entire network. Letsassume that the addresses and the permanences are stored ina DRAM implemented in 45nm process. If the energy cost per One may share the synaptogenesis unit between multiple cells ofthe same mini-column to cut-down resources and reduce powerconsumption, but at the expense of increasing the latency.

32 bits of off-memory access takes 640pJ [35], having 40 activecells at each time step leads to a total energy consumption of15.36 µ J (ﬁrst-order approximation). Running the system at8MHz can result in a power consumption of 122.88W just toaccess the memory, which is a prohibitive amount of powerespecially for edge devices with limited power budget.One possible solution to overcome the above challenge isto reduce the memory usage in each cell. This can be donethrough modeling the synaptic permanence using analogmemristors and leveraging the randomness in forming thedistal synaptic connections to generate the addresses ratherthan storing them. A possible approach to do so is generatingthe distal segment addresses through the use of LFSRs. Todemonstrate this, let’s assume that the cell c is currentlyactive and trying to establish a connection with anothercell, c , which was active in the previous time step. Thecell c will receive a packet that holds c location into3D space, in this example . Upon receiving the address,the cell, c , begins the matching process in which the cellidentiﬁes whether there is a possibility to establish a distalconnection with the cell c . The matching process startsby enabling the X-LFSR to generate 16 addresses withinone clock cycle . The same is applied for the Y-LFSR. Whilethe LFSRs generate their random values, the cell translatesany matches between the generated random numbers andreceived the Cartesian locations into ﬂags stored into 4-bit registers, which are later decoded by X-DMUX and Y-DMUX. Here, a match means there is a distal connectionestablished between the two cells. It is important to mentionhere that following such an approach makes the processof forming distal connection probabilistic, while in HTMnetwork it is deterministic. However, in HTM, the cells thatare currently active form connections with a subset (typically50%) of the cells that were active in the previous time step,and in our design this is achieved naturally through ouradopted probabilistic approach. Now, in order to estimatethe likelihood of matching between distal segment addresses(randomly generated) and the addresses of the active cells, (3)can be used, where n sd is the maximum number of synapsesin a distal segment. Let the distal segment size for a givencell be 256. Given 961 mini-columns with 40 actives at eachtime step, there is a 0.847 likelihood that at least 20% of the The cells’ LFSRs are clocked with 128MHz, while the system clock is8MHz. -LFSRY-LFSRX-IndexY-Index == Local Controller

Distal SegmentsUnit V DD Match Seg

Current Comparators

LearningThreshold ActiveThreshold

SynaptogenesisUnit

Distal Segemts CNTXYZ LU Active Seg.MFAct PredWin dis_Ov2 5 5 C u rr en t C o m pa r a t o r Distal connectionspermanence16 16 Synapse CircuitSynapse CircuitSynapsecircuitY-DMuxX-DMux FEn8 4Z-Buffer RLPFCF BF I burst

Figure 4. The circuit diagram depicting the HTM cell with synaptogenesis unit, which can generate or prune distal segments; a distal dendriticsegment to hold the permanence values of the distal connections; and current comparators to evaluate the distal segment activation level andconsequently the cell status (predictive or unpredictive). generated random address matches those of the previousactive cells. This likelihood can be signiﬁcantly increasedbeyond 0.95 when the segment size is increased, as shownin Fig. 5. P match = f rac n w (cid:88) i =10 (cid:32) n w i (cid:33) × (cid:32) n c − n w n sd − i (cid:33)(cid:32) n c n sd (cid:33) (3)

20 30 40 50 60 70

Matching Percentage [%] M a t c h i n g P r o b a b ili t y n sd =200 n sd =250 n sd =300 n sd =350 n sd =400 Figure 5. The matching probability between a distal segment’s ad-dresses generated by LFSRs and the addresses of the active cells in theprevious time step for various segment sizes.

After ﬁnishing the matching process and activating theX-DMUX and Y-DMUX, all the possible combinations of16 X-addresses and 16 Y-addresses are achieved throughthe AND gate array. The output of logic ’1’ for an ANDgate, let’s say gate number 5, may indicate an active cellin the location ( x =1 and y =5). The output of the AND gateenables the corresponding ’green’ 2-bit register to load theZ-address, and this represents the cell distal synapse that is Increasing the distal segment size cost more cycles to generate morerandom addresses and additional memristor devices for each newadded synapse. currently connected to an active cell at time t − , whereasthe previously formed distal synapses are stored in the ’blue’2-bit register. However, once the registers are loaded, they arecompared and the results are relayed to the distal segmentmemristors (only when evaluating the cellular activitiesdetected by distal segment). For the distal segment unit, thiscell architecture leverages the union propriety of the SDRrepresentation to considerably reduce the cell architecturecomplexity. The main concept behind the union property isstoring several patterns using one representation. This can betranslated into having one universal distal segment for eachcell rather than multiple of them. The universal segmentgrows as the cell learns more temporal information. It isimportant to mention here that merging the segments canincrease the possibility of false triggering of cell segmentsand incorrect predictions. However, this is less likely tohappen if we limit the number of patterns ( M ) a segmentcan learn, while setting the number of mini-columns andcells to be large enough. For instance, in this work, we used961 mini-columns with 4 cells each. If we stored 30 patternsin a segment and set the matching threshold for any twogiven patterns to 5, according to [36], the probability of afalse match is . × − as calculated using (4). P fp = (cid:20) − (1 − n w n c ) M (cid:21) n w (4)The output current that is collected at the distal segmentbitline is received by the current comparator unit. Then, thecurrent gets mirrored to be compared with two referencecurrents: active threshold and learning threshold. If the The bursting mini-columns’ cells generate additional current thatshould add up to the segment current during the evaluation step.However, the bursting mini-columns are evaluated globally, at theregion level, and their contribution to the segment activation is donethrough I burst . r ed i c t i on F l ag ( P F ) B u r s t i ng ( B F ) V DD IcIrefI ov CellFB2 (CF) V DD MFlag1(MF) T C e ll _1 C e ll _2 Icell1 Icell2Icell3

Cells currentin A 𝜇 MFlagsfor cells1,2,3 V o l t age [ v ] V o l t age [ v ]

10 20 30

Time [ s] 𝜇 Ith

Figure 6. (left) Competitive circuit that enables the cells within onemini-column to interact with each other when a massive ﬁring activitytakes place in the mini-column. (right) Waveform diagram illustratingthe competitive circuit for the cell to select the best matching cell. segment current is more than the active threshold, thesegment is set to be in an active state and consequentlythe cell state changes to predictive for the next time step.On the contrary, current less than the active thresholdand more than the learning threshold, marks the cell asa matching cell. A matching cell has a high probability tobe selected to represent the input when bursting takes place.It is important to mention here that the prior discussedoperations are carried out within the cells, but running thetemporal memory successfully also requires the cells withinthe mini-columns to interact with each other to identifywhether bursting is necessary. If bursting takes place in amini-column, all the cells within the mini-column are set tobe active and one cell is selected to learn the current inputpattern. Typically, this is done either by selecting the bestmatching cell or least used cell. The former occurs only whena cell has a sufﬁcient number of potential synapses thatare connected to active cells in the previous time step i.e. acell has a matching segment. Choosing the best matchingsegment involves selecting the cell with the highest matchinglevel (distal current). This implies mirroring all cells’ outputcurrent to another unit, namely the competitive circuit (amodiﬁed current based winner-take-all circuit originallyproposed in [37]), so that the cell with the highest outputcurrent is chosen (see Fig. 6-(left)). In the case when thereare no matching segments, the least used cell is chosen asa winning cell. Selecting the least used cells is done viaselecting the cells with the least number of distal segments.Since this implementation deals with one universal mergedsegment, a counter in the cell is used to monitor the ﬂags ofadded segments and consequently the number of mergedsegments in each universal one. Fig. 6-(right) demonstratesthe operation of the cells competitive circuit. Here, threecells are competing to select the best matching cells. Twoscenarios are considered. The ﬁrst of which (interval 0-10 µ s), all cells have high overlapping current (all MFlags = ’0’)so that they are in competition. Since cell has the highestoverlapping current, it is selected as a winner. In the secondscenario (interval 10-40 µ s), cell has less current than the ‘Active Threshold’, for this reason it is excluded from thecompetition. This is accomplished via switching T to anON state, and this eventually blocks cell current which ismirrored to the WTA circuit. YNTHETIC S YNAPSES R EPRESENTATION

The cells in the HTM network interact with each other duringthe temporal memory phase. This interaction is essentialto enable the network to predict the upcoming events. Asalluded to earlier, the cells’ interaction is enabled throughthe distal segments which are established and evolvedwhile learning temporal information. In hardware, this trans-lates into thousands of interconnects that are continuouslychanging in their conductivity level and locations. Due tothe fact that interconnects in VLSI systems are rigid innature and do not support this level of reconﬁgurability,memory units can be used to virtually formulate theseconnections and to describe their strength as in [24], [34].Although such an approach is effective as it endows thenetwork with the necessary dynamic to learn spatial andtemporal information, it does not suit edge devices whichhave stringent area and energy constraints. Thus, we arepresenting the SSR communication scheme that heavily relieson random generators and memristor devices rather thanconventional memory units to form synaptic connectionsand to deﬁne their growth levels. This results in signiﬁcantsavings in terms of resources and energy consumption.Two aspects associated with the SSR are addressed in thiswork: forming synaptic connections using LFSRs (discussedearlier in section 3.2) and controlling the data transferamong cells through regulating the access to the H-Treebus. Considering the same HTM system with A t − activecells in the previous time step and A t active cells in thecurrent time step, during the temporal memory phase, everycell in the network with enough strong connection to A t − cells can be depolarized for the next time step and becomepredictive. The challenge here is how to transfer the A t − cells’ addresses to all other cells in the network efﬁciently.Let all the mini-columns with active cells at time t − place arequest at the input of the outgoing tri-state gates (see Fig. 2).Then, each set of tri-states belonging to the same row areactivated simultaneously through the selector. When a rowis selected all its tri-state buffers associated with the mini-columns are activated, allowing the mini-columns to sendrequests to the arbiter and to receive acknowledgements. Thearbiter circuit is shown in Fig. 7-(left). It comprises of buffers,a series of nMOS pass transistors, and a feedback circuit.The buffers are used to store the simultaneous requests fromthe selected mini-columns. The series of pass transistors areused to monitor the status of the individual mini-columnrequests, whereas the feedback circuit is used to acknowledgethe mini-columns after their requests are served. In Fig. 7-(right), a waveform diagram illustrates the operation of thearbiter, selector, and other units in the developed systemwhile processing information sent from a row with 5 mini-columns. Initially, all the winning mini-columns’ (in thisexample: 2, 3, and 5) requests are directed toward the arbiterand stored in the buffers (DFF). When the DFF-3, for instance,receives Req , it waits in a queue until Req is served. Once Req is served, the voltage drop at T drain will be high. This8 lkResetReq2Req3Req5Ack2Ack3Ack5Bus

125 nS

Request signalsthat are low all timeare not shown inthe waveformdiagram

Req_C Ack_C Req_C N Ack_C N SelectorQ' DQ T T Q' DQ Q' DQ V DD Figure 7. (left) A synthetic synaptic arbiter (SSR) circuit consistingof buffers to store the simultaneous requests from the winning mini-columns, a series of nMOS pass transistors to monitor the status of theindividual mini-column’s requests, and a feedback circuit to clear mini-column requests once served. (right) Waveform diagram demonstratinga part of the SSR operation while processing several concurrent requestssent from several mini-columns located within the same row of the HTMregion. will trigger the feedback circuit to send ack signal to mini-column 3, which in turn clears its request and broadcasts theaddress of its active cell(s). Serving the requests of all theactive cells in the HTM region leads to a latency given by: t cc = √ n c (cid:88) i =1 ( √ n c (cid:88) j =1 Λ[ i ][ j ] + 1) (5)Recall that the SSR conveys the same concepts of the AERand the enhanced AER, but it is designed to serve intra-chipcommunication while offering the following advantages: • In AER, the neuron potential duration must be ≈ • The enhanced AER demands memory units on bothsides, sender and receiver, to hold neuron addresses thatare virtually connected (connecting 32x32 cells requires20Mb RAM [24]). For a sparse network like the HTM,this is very overwhelming in terms of memory usage.However, in the SSR, the addresses are generated ratherthan stored. This serves two advantages: smaller storageunits are used and random selection is achieved. • The SSR is synchronous and its capacity, the maximumrate of sample transmission (considering the worstcase scenario and the adopted network architecture),is 4MSamples/sec. In the AER case, its capacity forSNN with approximately the same network size is2.5MSamples/sec [38]. • The SSR uses priority arbiter, which applies a queu-ing mechanism to access the H-Tree (or channel) bus,whereas AER utilizes an arbitration mechanism toaccess the channel. The latter is known to lengthenthe communication cycle period and reduce channelcapacity [38]. • The AER is deemed an effective approach for inter-chip communication, where neuronal information iscommunicated by means of encoded events. At thetargeted destination, the encoded events are typicallydecoded and routed to the proper accessible neurons.The encoder size here is highly dependent on thenumber of neurons, whereas in the SSR, the decoding process complexity is deﬁned by the number of synapsesassociated with the targeted neurons. This property isextremely beneﬁcial for sparse networks like HTM. • The enhanced AER offers better ﬂexibility in updatingthe synaptic connections individually. The oppositeis true for the SSR, in which changing the seeds ofLFSRs enables the cell to form a new set of synapticconnections.

XPERIMENTAL M ETHODOLOGY

In order to assess the performance of the proposed mixed-signal HTM system, two models are created. The ﬁrst is agolden model (HTM-SW) that runs the HTM system withoutany constraints. This model is used to ﬁnd the optimalnetwork performance for a given task. The second model(HTM-HW) is an emulation of the hardware design underpredeﬁned circuit constraints. Here, the circuit constraints are achieved after the individual components of the designare simulated and veriﬁed within Cadence Virtuoso envi-ronment. Prior to that, all the digital units are veriﬁed forfunctionality in Cadence SimVision. During simulation, thesupply voltage is set to 1.2v and the system clock is at8MHz. The system also has 128MHz high-speed clock todrive the LFSRs of the cells. When it comes to emulatingHTM synaptic connections’ strength, a representative non-linear Verilog-A memristor model [39] with a modiﬁed Z-window function [6] is utilized . The device conductancechanges as a function of the state variable, w , is describedin (6) and (7) , where D is the device thickness, and G on and G off deﬁne the memristor conductance limits. Emulatingthe synaptic behavior of HTM using memristors turns outto be challenging. This is because the synapses in HTMare binary in nature, i.e. they exhibit the same properties ifthey are above the permanence threshold regardless of thesynapse’s growth level and vice versa. In 2017, Jiang et al.proposed a memristor device to implement the k -nearestneighbour algorithm and that exhibits properties requiredfor HTM [40]. Fig. 8 illustrates the experimental behaviorof the physical device as a function of the applied pulses,ﬁtted to the memristor model. Here, it can be observedthat the memristor has minor changes in conductance levelon either side of the permanence threshold (highlighted ingreen), while the changes are extreme in the middle. To someextent, this captures the binary nature of the ideal synapsein HTM. It is important to mention here that in order tooptimize the HTM system performance and maintain lowpower consumption, the following assumptions were made:1) the memristor device exhibits semi-symmetrical behaviorwhen switching from low/high conductance to high/low; 2)the memristor device offers fast switching speed and highconductance range. Table 1 shows all the device parametersused for proximal and distal synaptic connections. G mem = wD × G on + (1 − wD ) × G off (6) Memristor device non-idealities considered during the simulation are:10% cycle-to-cycle variability (memristor resistance) and device-to-device variability (write variation). τ , δ , k , and p are constants to control the window function shape. Thenominal values used in this work are: τ =200, δ =0.5, k =1, and p =4 k o ff , k on , α on , and α off are constant, and v off and v on are thememristor threshold voltage. able 1. The memristor device parameters used in the mini-columnand cell designs. Parameter Value [mini-column] Value [cell]

Proximal memristor range 150k Ω - 10M Ω Ω - 10M Ω Memristor threshold ± ± Ω -80k Ω - Pulses G m e m [ m S ] Unconductive RegionConductive Region Device ModelPhysical DeviceIdeal Synapse

Figure 8. Fitting the memristor model to the physical device behaviorwhile modulating the device conductance with a train of pulses. ∆ w ∆ t =  k off . (cid:16) v ( t ) v off − (cid:17) α off .f z ( w ) , < v off < v , v on < v < v off k on . (cid:16) v ( t ) v on − (cid:17) α on .f z ( w ) , v < v on < (7) f z ( w ) = k [1 − wD − δ )] p e τ ( wD − δ ) p (8) ESULTS AND D ISCUSSION

The prediction accuracy of the proposed HTM system isevaluated using real-world streaming data. Given an inputdataset of length n n , where each data point presented tothe HTM system at time t is represented by y t , while thecorresponding predicted value is given by ˆ y t , the meanabsolute percentage error (MAPE) can be computed as in (9). M AP E = n n (cid:80) t =1 | y t − ˆ y t | n n (cid:80) t =1 | y t | (9)Fig. 9 illustrates a snapshot of the Hot-Gym dataset [41],the power consumption in a gym, over a small period.The power consumption is recorded at every hour for 4months (total samples count = 4390). Here, the HTM systemis used to predict the power consumption for the next 2and 5 hours. Initially, the golden software model, HTM-SW, is used in the prediction. Then, the same prediction ismade using the HTM-HW model . Fig. 10-(a) shows theaccumulated MAPE recorded at every 250 samples. It canbe seen that the initial value of the MAPE is really high,but over time it decreases as the network learns patternsand uses the acquired knowledge to make valid predictions HTM-HW model is also benchmarked using other datasets suchas NYC-Taxi [42]. The achieved MAPE for the 2nd and 5th orderpredictions are 0.0996 ± ± in the future. However, the overall MAPE of the softwaremodel, assuming the ﬁrst 500 samples presented to thenetwork are dedicated to learning, is calculated to be 0.154 ± ± ± ± - - - - - - - - - Date-Time P o w e r [ k W ] Figure 9. A snapshot of the power consumption of the Hot-Gymdataset [41] recorded every hour over approximately 4 days.

The latency is measured as the time required for the HTMnetwork to process an SDR input generated by the encoder.In this context, HTM processes SDR inputs of the Hot-Gymdataset, where each input is encoded with 512-bit binaryvector. The spatial pooler and temporal memory phaseshere are performed simultaneously and in a pipelinedfashion to minimize the latency, which is estimated tobe 11.64 µ s. Fig. 10-(b) shows the latency of the CMOSdigital HTM (system clk = 100MHz) and the proposedmixed-signal HTM (system clk = 8MHz) as a function ofthe network size, given by the number of mini-columns.One can notice that the latency in the digital HTM isalways higher than the mixed-signal counterpart. This canbe attributed to several reasons. The ﬁrst is the need forthe initialization phase in the digital HTM design to set thesynaptic connections’ permanences, particularly the proximalsynapses, prior to receiving any input. The initializationof the synaptic connections’ permanence is achieved forfree in the mixed-signal design as the memristors after theformation process have random conductance with Gaussiandistribution [43]. Second, tuning the synaptic connections,proximal or distal, is performed simultaneously at the celland mini-column levels, but within them it is sequentialbecause the permanence values are stored in distributedSRAMs, where the read/write operations take several clockcycles. In the mixed-signal design, on the contrary, the tuningprocess is performed concurrently even within the mini-columns or cells and usually it takes two clock cycles. Finally,in the digital HTM, the winning mini-columns that representthe input are decided in a sequential fashion to cut down the Spatial pooler and temporal memory operate simultaneously whenthe H-Tree bus is exploited by either of them.

000 2000 3000 4000

Presented samples count (a) M A P E HTM-HW(2)HTM-HW(5)HTM-SW(2)HTM-SW(5) 250 500 750 1000 1250 1500 1750 2000

Number of mini-columns (b) L a t e n c y [ s ] CMOS Digital HTMMixed-Signal HTM 2 4 6 8

Years(c) N o n - e l a s t i c m i n i - c o l s . Non-elastic mini-columns3 46 183 395 652 961HTM-HWIdeal-Scenario

Figure 10. (a) MAPE for predicting the power consumption in a gym for the next 2 and 5 hours using HTM software (HTM-SW) and HTMhardware (HTM-HW) models. (b) Latency of the digital and mixed-signal HTM as a function of the network size, given by the number ofmini-columns. (c) Elasticity (lifespan) of the overall HTM mini-columns in the ideal and real-world scenarios. resource cost and power consumption. This in turn translatesto longer latency that is proportional to the number of mini-columns. In the mixed-signal design, a WTA circuit [6] isused, which processes all the inputs concurrently.

The memristor device write endurance, which is the numberof times a memory cell can be overwritten successfully, turnsout to be a crucial factor in determining network sustainabil-ity for learning. The memristor devices, particularly oxide-based devices, have a typical endurance range between − [44]. This low endurance reduces the networkreliability for online learning and continuous adaptationespecially when the network is densely connected and allneurons need to be updated continuously. For the HTMnetwork, this is not the case as cell/mini-column activitiesare sparse in nature and the learning is conﬁned only tothe active ones. This feature endows the network longerelasticity (lifespan) in comparison to other networks. Inorder to estimate the elasticity of mini-columns in the HTMnetwork, we need to estimate their successful training rounds( L r ) and likelihood of activation, as given by (10), where E d is the memristor device endurance. L r = E d × n c n w (10)In the ideal scenario, mini-columns in the HTM networkare activated with equal likelihood by patterns detected at theproximal segments. Thus, the number of successful learningrounds that can be made, given E d = , n c =961, and n w =40,is × . This is equivalent to ≈ ≈

24 times more than a conventional network with nosparse activities, and X times more than the SNNs . In spiteof the fact that SNNs are asynchronous and sparse in nature,usually their neurons ﬁre and their synaptic connections aretuned multiple times while processing a single input. This isbecause each input is stochastically encoded as a stream ofspikes. However, the previous comparison hypothesizes thatthe HTM mini-columns’ activations are perfectly regularized X is not speciﬁed here because it is highly affected by the input andthe encoding approach. SNNs are usually trained with spike-time-dependent-plasticity (STDP)rules. STDP requires neurons to be tuned based on the time differencebetween the pre and post synaptic neurons’ spikes. by incorporating the homeostasis plasticity mechanism (orboosting). In real-world scenarios, this is not the case, becausethe mini-columns’ activations are highly affected by inputspace statistics. Fig. 10-(c) is an example demonstrating anestimation of the developed system elasticity (lifespan) forthe Hot-Gym dataset. Here, we see that after year 4, a gradualloss in mini-columns’ elasticity starts to occur. Even after 8years of work, ≈

309 mini-columns are still elastic and havethe capability to acquire new information. However, theoverall network performance at that time would be limited.

There are various types of memristor defects that may affectnetwork performance, and usually they occur due to processvariation [45], [46]. Examples of device defects are ageingfaults, endurance degradation faults, switching delay faults,and stuck-at faults [47], [48]. Here, we will emphasize onthe stuck-at fault as it is ubiquitous and has high impact onnetwork performance [48]. Two types of stuck-at faults arestudied. The ﬁrst of which investigates the impact of stuck-on(high-conductance state) on HTM system performance whilemaking two step ahead prediction for the Hot-Gym dataset.The second focuses on the stuck-off (low-conductance state)effect. Fig. 11-(a) illustrates the averaged MAPE over 5 runsfor the HTM-HW prediction as a function of the faulty device percentage for the aforementioned cases. It can beseen that the stuck-off fault has a positive marginal impacton the network performance as it leads to an increase in thenetwork sparsity level. In contrast, the stuck-on increasesthe MAPE by 1.7% and it can go up to 4.9% when thefault percentage is 30%. This degradation in performancearises from the fact that the SDR classiﬁer is implementedusing a softmax classiﬁer with weighted synapses realizedusing a memristive crossbar. Having 10% of stuck-on faultin crossbar means on average, every row and column in thecrossbar has 55 and 344 defected devices, respectively. Thiseventually makes the softmax classiﬁer output nodes unableto distinguish various pattern activities and ﬁre excessively.During the fault analysis, it is also found that applyingthe fault solely to the spatial pooler results in a marginalchange in the system performance. This is because each inputsample presented to the HTM is spatially represented by asmall population of active mini-columns, and having a slight The fault is applied to the mini-columns’ proximal connections andSDR classiﬁer weights. Fault Percentage(a) M A P E Stuck-onStuck-off 0 2 4 6 8 10 12

Time [ s] (b) P o w e r [ m W ] Proximal unit is on Distal unit is on

Number of mini-columns [k] (c) N u m b e r o f c e ll s Energy-delay- product [pJ.s] . . . . . . . Figure 11. (a) The MAPE of the HTM-HW predicting two steps ahead in time for the Hot-Gym dataset while experiencing various types ofstuck-at faults. (b) The total power consumption of the developed HTM system as it processes and predicts time-series data from Hot-Gym dataset.(c) Contour of energy-delay-product for the developed HTM system as a function of network size. change in the representation pattern, which may result fromthe fault, has very low impact. Furthermore, using k -winnermechanism mitigates the changes that may occur in spatialpatterns. The average total power consumption of the developed HTMsystem while predicting time-series data from the Hot-Gymdataset is estimated to be 28.94mW and 29.38mW when theonline learning is enabled. The high power consumptionduring the training is due to the use of high voltage(memristor training voltage ≈ to the mini-columns and establishingthe proximal synaptic connections, which take place insimultaneous fashion. The power then abruptly increases dueto the activation of the proximal segments to compute themini-column overlap scores. Once the winning mini-columnsare selected, the spatial pooler learning phase starts, inwhich the memristors associated with proximal synapses aremodulated. Meanwhile, the prior active cells’ addresses arerouted to each cell in the winning mini-columns to computetheir distal segments’ overlap scores. Computing distalsegments’ overlap scores give rise to another abrupt increasein the power consumption (at time ≈ . µ s). However,this increase is much smaller than the one occurred whilecomputing the overlap scores of the mini-columns. This isbecause computing the cells’ overlap scores is conﬁned onlyto the cells within the winning mini-column while other cellsare disabled through clock-gating. After computing the cells’overlap scores, the cells of the winning mini-columns locallycompete to represent the input contextually. The selectedactive cells form the lateral connections with the neighboringcells and tune their distal connections accordingly. Onemay observe from the previous discussion that tuning andcomputing the overlap scores here turn to be the most power-hungry operations as there are more than 45.15k synapsesinvolved in the network computations. One possible way to The approach used to estimate the power consumption is described inour previous work [6]. The H-Tree structure might be buffered with full-swing and reducedswing buffers, proposed in [31], to minimize the power consumptionfurther. minimize the power is to modify network size or segregatethe above operations into multiple stages at the mini-columnor cell levels, but this will be at the expense of increasing theoverall network latency. Fig. 11-(c) illustrates the contour ofenergy-delay-product measured in pJ.s which can be usedto pick the optimal network architecture for a given powerconsumption and latency requirement.Fig. 12 shows the distribution of the power consumptionamong the different entities of the proposed HTM systemduring the training and testing modes. It implies that in theHTM-Test, most of the power consumption is devoted to theHTM cells as they are more complicated and have a largenumber of synaptic connections. During the training mode,HTM-Train, the cells and mini-columns pull further powerto modulate their synaptic connections. On the contrary, theMCU and other units (arbiter, selector, excluding the H-Tree)consume a small fraction of the total power as they are lesscomplex and have limited memory usage. (a) HTM-Train (total power = 29.38mW) (b) HTM-Test (total power = 28.94mW)

Figure 12. The distribution of the power consumption for the buildingblocks of the proposed HTM system during training and testing modes.

In an endeavour to compare our work with previousHTM implementations in literature, we found that perform-ing relative comparisons is a challenging process due tothe lack of similarity in network architectures, technologynodes, operating frequency, etc. Thus, we attempt to bringall networks to the same size in terms of the mini-columnsand cell count. Also, we hypothesize that the size of thenetwork can be scaled linearly and the same is appliedto their power consumption. Starting with Krestinskayaet. al. [20] , here we scaled only the number of mini-columns and the single pixel processing elements (total =961x1) as detailed information about the distal segmentsand their sizes are not reported, and this results in 17.45X The authors in this paper also consider linear scaling for the networksize and the power consumption. able 2. A comparison of the proposed HTM system with previous work. One may note that these implementations are on different substrates, thereby this table offers a high-level reference template for HTM hardware rather than an absolute comparison . Algorithm Memristive HTM [20] PIM HTM [17] Digital HTM [18] PE HTM [11]

This work

Task Classiﬁcation Classiﬁcation&Prediction Prediction Image recognition Classiﬁcation&PredictionOperating Frequency - 100MHz 100MHz 100MHz Dual 8-128 MHZProximal Segment Size 9 16 1 40 31Distal Segments x Size - 5x10 - 12x16 Shared 256Total Power consumed 13.34mW b a d & Hot-GymMini-columns x cells 25xX c a In [18], the power consumption is reported for a single processing element (PE) without considering the register ﬁles. Thus, we linearly scaledthe power for an HTM network of size 400x2. b In this reference, the temporal memory power is reported for single pixel processing. This value is multiplied by the total number ofmini-columns to estimate a total power of an HTM region with 25 mini-columns with one cell each. c X denotes unknown number of cells. d Further details about MNIST results are provided in [6]. improvement. In the case of fully CMOS digital design,77.02X is achieved when compared to our previous workin [17], and 31.75X and 22.29X when compared to the workdone by Li Weifu et al. [11], [18]. In contrast to otherprevious works, the power consumption reported in [18]does not consider the register ﬁles, which are usually themost power-hungry components in the design. In the caseof [11], it is unclear if the register ﬁles power consumption isincluded. It is important to mention here that, in most cases,the overall networks’ synaptic connections have not beenincluded in the aforementioned scaling process as there is noclear approach to estimate the power consumption for theindividual synaptic primitives. However, since our designuses more synaptic connections, equating our design withprevious works in terms of the synaptic connections countmay result in further improvement in power consumption.

ONCLUSIONS

This paper proposes a memristor-based mixed-signal ar-chitecture of the HTM network including the spatial andtemporal aspects of the algorithm. The proposed architec-ture incorporates several plasticity mechanisms such assynaptogenesis, neurogenesis, etc. that endow the network ahigh-degree of plasticity with lifelong learning and minimalenergy dissipation. The high-level behavioral model of thearchitecture is veriﬁed for time-series data prediction. Itis found that the MAPE of the hardware model is morethan that in the software counterpart by 1.129X. This degra-dation is mainly attributed to the memristor devices’ non-idealities and the use of synthetic synapses representation.The proposed architecture is also evaluated for latency andlifespan. We found that the mixed-signal implementation is ≈ R EFERENCES [1] J. Hawkins and S. Blakeslee,

On intelligence: How a new understandingof the brain will lead to the creation of truly intelligent machines .Macmillan, 2005.[2] J. Hawkins, D. George, and J. Niemasik, “Sequence memory forprediction, inference and behaviour,”

Philosophical Transactions of theRoyal Society B: Biological Sciences , vol. 364, no. 1521, pp. 1203–1209,2009.[3] D. George and J. Hawkins, “Towards a mathematical theory ofcortical micro-circuits,”

PLoS computational biology , vol. 5, no. 10, p.e1000532, 2009.[4] J. Hawkins and S. Ahmad, “Why neurons have thousands ofsynapses, a theory of sequence memory in neocortex,”

Frontiers inneural circuits , vol. 10, p. 23, 2016.[5] D. E. Padilla-Baez, “Analysis and spiking implementation of thehierarchical temporal memory model for pattern and sequencerecognition,” Ph.D. dissertation, University of South Australia,2015.[6] A. M. Zyarah and D.Kudithipudi, “Neuromemrisitive architectureof htm with on-device learning and neurogenesis,”

ACM Journalon Emerging Technologies in Computing Systems (JETC) , vol. 15, no. 3,p. 24, 2019.[7] K. L. Rice, T. M. Taha, and C. N. Vutsinas, “Hardware accelerationof image recognition through a visual cortex model,”

Optics & LaserTechnology , vol. 40, no. 6, pp. 795–802, 2008.[8] J. Xing, T. Wang, Y. Leng, and J. Fu, “A bio-inspired olfactory modelusing hierarchical temporal memory,” in

Biomedical Engineering andInformatics (BMEI), 2012 5th International Conference on . IEEE, 2012,pp. 923–927.[9] N. O. El-Ganainy, I. Balasingham, P. S. Halvorsen, and L. A.Rosseland, “On the performance of hierarchical temporal mem-ory predictions of medical streams in real time,” in . IEEE, 2019, pp. 1–6.[10] A. Lavin and S. Ahmad, “Evaluating real-time anomaly detectionalgorithms–the numenta anomaly benchmark,” in

Machine Learningand Applications (ICMLA), 2015 IEEE 14th International Conference on .IEEE, 2015, pp. 38–44.[11] W. Li, “Design of hardware accelerators for hierarchical temporalmemory and convolutional neural network.” Ph.D. dissertation,North Carolina State University, 2019.[12] O. Krestinskaya, I. Dolzhikova, and A. P. James, “Hierarchicaltemporal memory using memristor networks: A survey,”

IEEETransactions on Emerging Topics in Computational Intelligence , vol. 2,no. 5, pp. 380–395, 2018.[13] L. Streat, D. Kudithipudi, and K. Gomez, “Non-volatile hierarchicaltemporal memory: Hardware for spatial pooling,” arXiv preprintarXiv:1611.02792 , 2016.[14] M. Kerner and K. Tammem¨ae, “Hierarchical temporal memoryimplementation on fpga using lfsr based spatial pooler addressspace generator,” in .IEEE, 2017, pp. 92–95.[15] T. Ibrayev, A. P. James, C. Merkel, and D. Kudithipudi, “A designof htm spatial pooler for face recognition using memristor-cmos ybrid circuits,” in . IEEE, 2016, pp. 1254–1257.[16] A. M. Zyarah, “Design and analysis of a reconﬁgurable hierarchicaltemporal memory architecture,” Master thesis, Rochester Instituteof Technology, 2015.[17] A. M. Zyarah and D. Kudithipudi, “Neuromorphic architecture forthe hierarchical temporal memory,” IEEE Transactions on EmergingTopics in Computational Intelligence , vol. 3, no. 1, pp. 4–14, 2019.[18] W. Li and P. Franzon, “Hardware implementation of hierarchicaltemporal memory algorithm,” in . IEEE, 2016, pp. 133–138.[19] D. Fan, M. Sharad, A. Sengupta, and K. Roy, “Hierarchical temporalmemory based on spin-neurons and resistive memory for energy-efﬁcient brain-inspired computing,”

IEEE transactions on neuralnetworks and learning systems , vol. 27, no. 9, pp. 1907–1919, 2015.[20] O. Krestinskaya, T. Ibrayev, and A. P. James, “Hierarchical temporalmemory features with memristor logic circuits for pattern recog-nition,”

IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems , vol. 37, no. 6, pp. 1143–1156, 2018.[21] B. V. Benjamin, P. Gao, E. McQuinn, S. Choudhary, A. R. Chan-drasekaran, J.-M. Bussat, R. Alvarez-Icaza, J. V. Arthur, P. A.Merolla, and K. Boahen, “Neurogrid: A mixed-analog-digitalmultichip system for large-scale neural simulations,”

Proceedings ofthe IEEE , vol. 102, no. 5, pp. 699–716, 2014.[22] M. Mahowald, “Vlsi analogs of neuronal visual processing: asynthesis of form and function,” Ph.D. dissertation, CaliforniaInstitute of Technology, 1992.[23] R. J. Vogelstein, F. Tenore, R. Philipp, M. S. Adlerstein, D. H. Gold-berg, and G. Cauwenberghs, “Spike timing-dependent plasticity inthe address domain,” in

Advances in Neural Information ProcessingSystems , 2003, pp. 1171–1178.[24] D. H. Goldberg, G. Cauwenberghs, and A. G. Andreou, “Proba-bilistic synaptic weighting in a reconﬁgurable network of VLSIintegrate-and-ﬁre neurons,”

Neural Networks , vol. 14, no. 6-7, pp.781–793, 2001.[25] Y. Cui, C. Surpur, S. Ahmad, and J. Hawkins, “A comparativestudy of htm and other neural network models for online sequencelearning with streaming data,” in . IEEE, 2016, pp. 1530–1538.[26] Y. Cui, S. Ahmad, and J. Hawkins, “Continuous online sequencelearning with an unsupervised neural network model,”

Neuralcomputation , vol. 28, no. 11, pp. 2474–2504, 2016.[27] Y. Cui, A. Subutai, and J. Hawkins, “The htm spatial pooler − a neocortical algorithm for online sparse distributed coding,” Frontiers in Computational Neuroscience , vol. 11, 2017.[28] P. F¨oldiak, “Forming sparse representations by local anti-hebbianlearning,”

Biological cybernetics , vol. 64, no. 2, pp. 165–170, 1990.[29] S. Ahmad and J. Hawkins, “Properties of sparse distributedrepresentations and their application to hierarchical temporalmemory,” arXiv preprint arXiv:1503.07469 , 2015.[30] D. Hebb, “The organization of behavior; a neuropsychologicaltheory.” 1949.[31] F. H. A. Asgari and M. Sachdev, “A low-power reduced swingglobal clocking methodology,”

IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems , vol. 12, no. 5, pp. 538–545, 2004.[32] S. Purdy, “Encoding data for htm systems,” arXiv preprintarXiv:1602.05925 , 2016.[33] A. M. Zyarah, N. Soures, L. Hays, R. B. Jacobs-Gedrim, S. Agarwal,M. Marinella, and D. Kudithipudi, “Ziksa: On-chip learning accel-erator with memristor crossbars for multilevel neural networks,” in .IEEE, 2017, pp. 1–4.[34] A. M. Zyarah and D. Kudithipudi, “Reconﬁgurable hardwarearchitecture of the spatial pooler for hierarchical temporal memory,”in .IEEE, 2015, pp. 143–153.[35] S. Han, “Efﬁcient methods and hardware for deep learning,” Ph.D.dissertation, Stanford University, 2017.[36] S. Ahmad and J. Hawkins, “How do neurons operate on sparsedistributed representations? a mathematical theory of sparsity,neurons and active dendrites,” arXiv preprint arXiv:1601.00720 ,2016.[37] J. Lazzaro, S. Ryckebusch, M. A. Mahowald, and C. A. Mead,“Winner-take-all networks of o (n) complexity,” in

Advances inneural information processing systems , 1989, pp. 703–711. [38] K. A. Boahen, “Communicating neuronal ensembles between neu-romorphic chips,” in

Neuromorphic systems engineering . Springer,1998, pp. 229–259.[39] S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny,“Vteam: A general model for voltage-controlled memristors,”

IEEETransactions on Circuits and Systems II: Express Briefs , vol. 62, no. 8,pp. 786–790, 2015.[40] Y. Jiang, J. Kang, and X. Wang, “Rram-based parallel computingarchitecture using k-nearest neighbor classiﬁcation for patternrecognition,”

Scientiﬁc reports

Nature , vol. 521, no. 7550, p. 61, 2015.[44] M. Coll, J. Fontcuberta, M. Althammer, M. Bibes, H. Boschker,A. Calleja, G. Cheng, M. Cuoco, R. Dittmann, B. Dkhil et al. ,“Towards oxide electronics: a roadmap,”

Applied surface science ,vol. 482, pp. 1–93, 2019.[45] S. Kannan, N. Karimi, R. Karri, and O. Sinanoglu, “Detection,diagnosis, and repair of faults in memristor-based memories,” in . IEEE, 2014, pp. 1–6.[46] L. P. Romero, S. Ambrogio, M. Giordano, G. Cristiano, M. Bodini,P. Narayanan, H. Tsai, R. M. Shelby, and G. W. Burr, “Trainingfully connected networks with resistive memories: impact of devicefailures,”

Faraday Discussions , vol. 213, pp. 371–391, 2019.[47] S. Kumar, Z. Wang, X. Huang, N. Kumari, N. Davila, J. P. Strachan,D. Vine, A. D. Kilcoyne, Y. Nishi, and R. S. Williams, “Oxygenmigration during resistance switching and failure of hafnium oxidememristors,”

Applied Physics Letters , vol. 110, no. 10, p. 103503, 2017.[48] V. Ravi and S. Prabaharan, “Fault tolerant adaptive write schemesfor improving endurance and reliability of memristor memories,”

AEU-International Journal of Electronics and Communications , vol. 94,pp. 392–406, 2018.

Abdullah M. Zyarah is a lecturer at the De-partment of Electrical Engineering, University ofBaghdad. He specializes in digital and mixed-signal designs and his current research interestsinclude Neuromorphic architectures for energyconstrained platforms and biologically inspiredalgorithms. Mr. Zyarah received his B.Sc. degreein Electrical Engineering from University of Bagh-dad, Iraq, in 2009, and M.Sc. degree in the samediscipline from Rochester Institute of Technology,USA, in 2015. Currently, he is a PhD candidatewithin the Neuromorphic AI Lab research group in the Department ofComputer Engineering, Rochester Institute of Technology.

Kevin Gomez is a Technologist in the ResearchGroup at Seagate. His research interests includecomputer architecture post Dennard scaling andhuman brain inspired computing. r. Dhireesha Kudithipudi [M’06, SM’16] is aprofessor and founding Director of the AI Con-sortium at the University of Texas, San Antonioand Robert F McDermott Chair in Engineering.Her research interests are in neuromorphic AI,low power machine intelligence, brain-inspired ac-celerators, and use-inspired research. Her teamhas developed comprehensive neocortex andcerebellum based architectures with nonvolatilememory, hybrid plasticity models, and ultra-lowprecision architectures.She is passionate about transdisciplinary and inclusive research train-ing in AI ﬁelds. She is the recipient of the Clare Booth Luce Scholarshipin STEM for women in highered (2018) and the 2018 Technology Womenof the Year in Rochester.[M’06, SM’16] is aprofessor and founding Director of the AI Con-sortium at the University of Texas, San Antonioand Robert F McDermott Chair in Engineering.Her research interests are in neuromorphic AI,low power machine intelligence, brain-inspired ac-celerators, and use-inspired research. Her teamhas developed comprehensive neocortex andcerebellum based architectures with nonvolatilememory, hybrid plasticity models, and ultra-lowprecision architectures.She is passionate about transdisciplinary and inclusive research train-ing in AI ﬁelds. She is the recipient of the Clare Booth Luce Scholarshipin STEM for women in highered (2018) and the 2018 Technology Womenof the Year in Rochester.