Temporal State Machines: Using temporal memory to stitch time-based graph computations
TTemporal State Machines: Using temporal memory to stitch time-based graphcomputations
ADVAIT MADHAVAN ∗ , University of Maryland and National Institute of Standards and Technology
MATTHEW W. DANIELS ∗ and MARK D. STILES, National Institute of Standards and Technology
Race logic, an arrival-time-coded logic family, has demonstrated energy and performance improvements for applications ranging fromdynamic programming to machine learning. However, the ad hoc mappings of algorithms into hardware result in custom architecturesmaking them difficult to generalize. We systematize the development of race logic by associating it with the mathematical field calledtropical algebra. This association between the mathematical primitives of tropical algebra and generalized race logic computationsguides the design of temporally coded tropical circuits. It also serves as a framework for expressing high level timing-based algorithms.This abstraction, when combined with temporal memory, allows for the systematic generalization of race logic by making it possibleto partition feed-forward computations into stages and organizing them into a state machine. We leverage analog memristor-basedtemporal memories to design a such a state machine that operates purely on time-coded wavefronts. We implement a version ofDijkstra’s algorithm to evaluate this temporal state machine. This demonstration shows the promise of expanding the expressibility oftemporal computing to enable it to deliver significant energy and throughput advantages.Additional Key Words and Phrases: Temporal computing, temporal state machines, graph algorithms
ACM Reference Format:
Advait Madhavan, Matthew W. Daniels, and Mark D. Stiles. 2020. Temporal State Machines: Using temporal memory to stitch time-basedgraph computations. 1, 1 (October 2020), 25 pages. https://doi.org/10.1145/1122445.1122456
Energy efficiency is a key constraint when designing modern computers. The performance and efficiency of moderncomputers, which largely rely on Boolean encoding, can be attributed to developments across the computational stackfrom transistors through circuits, architectures, and other mid- to high-level abstractions. The recent stagnation ofprogress at the transistor level [29] is leading designers to make improvements at the lowest levels of the stack. Theseinclude re-imagining how data is encoded in physical states and introducing novel devices. The rationale is simple:making the fundamental mathematical operations required for computation more efficient can have a cascading effecton the whole architecture. However, novel encoding schemes and devices come with new trade-offs that differ fromthose of conventional Boolean computing schemes and which are not yet well understood.In this paper, we focus on an arrival-time encoding known as race logic [42]. Since digital transitions (edges) accountfor much of the energy consumption in traditional computation, race logic encodes multi-bit information in a single ∗ Both authors contributed equally to this research.Authors’ addresses: Advait Madhavan, [email protected], University of Maryland and National Institute of Standards and Technology; MatthewW. Daniels, [email protected]; Mark D. Stiles, [email protected], National Institute of Standards and Technology.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].© 2020 Association for Computing Machinery.Manuscript submitted to ACMManuscript submitted to ACM a r X i v : . [ c s . ET ] S e p Advait Madhavan, Matthew W. Daniels, and Mark D. Stilesedge per wire. The arrival time t of this single edge is the value encoded by the signal. Encoding multiple bits on asingle wire makes some operations very simple to implement. Standard and and or gates naturally implement themax and min functions; a unit delay element acts as an increment gate. A fourth logic gate, inhibit, allows its first(inhibiting) input to block the second input signal, if the inhibiting signal arrives earlier.The development of race-logic-based architectures has been largely ad hoc . Race logic was first developed to acceleratedynamic programming algorithms [42], and its application space has expanded to include machine learning [64] andsorting networks [49], demonstrating energy and performance advantages. Parallel development of logical frame-works [44, 58, 66], novel device technologies [43, 67], and fabricated chips [41] have contributed to a cross-stack effortto make this encoding scheme technologically viable. Here, we offer two important developments.The first development is a systematized method of building computing architectures. An imporant step is to identifya suitable mathematical foundation that can express problems uniquely suited to a race logic approach. Formal logic,computation, and verification frameworks have been developed [58, 65, 66]. Continued progress requires identifying amathematical algebra in which race logic algorithms and state machines are naturally expressed in the highly paralleldataflow contexts typical of temporal computing accelerators. We propose tropical algebra to be used in this context.The second development is a compositional framework for programatically linking low-level subroutines into higher-order functions. This development expands race logic beyond one-shot application-specific circuits uses acceleratorbased architectures coupled with the absence of temporal memory technologies. Recent work has started to exploreseveral device concepts for efficiently reading and writing time-coded signals [43, 67]. The advantage of such memoriesis that they can directly interface with the temporal domain; read and write operations in such a memory can beperformed without conversion to digital encoding.The introduction of temporal memory technologies allows race logic to serve as an efficient computational fabricfor two distinct but compatible advances. First, because memory breaks symmetries related to translations of thetime coordinate, a temporal computer equipped with a memory is no longer subject to the invariance constraint ontime-coded functions outlined in Refs. [44, 58]. Lifting this restriction, allows tropical algebra to serve as a coherentalgebraic context for designing and interfacing race logic circuits. Second, memories allow us to reach beyond specializedone-shot temporal computations. Primitive race logic operations can be composed and iterated upon by saving outputsof one circuit and rerunning that circuit on the saved state, the temporal equivalent of a classical state machine. In thispaper, we develop the temporal state machine as a tool for accelerating tropical algebra in a generalizable computationalfabric of high-efficiency race logic circuits.Our contributions are: • A description of a temporal state machine that solves temporal problems in systematized parts, providing a clearcomputational abstraction for stitching larger computations out of primitive race logic elements. • An exposition of tropical algebra as a mathematical framework for working with temporal vectors. We explainthe mapping into tropical algebra from race logic and how it provides a convenient mathematical setting forworking with temporal computations. • Augmentations to conventional 1T1R (1 transistor, 1 resistor) arrays that make the crossbar architecture nativelyperform fundamental tropical operations. We use this, and other temporal operations, to create a more generalfeed-forward temporal computation unit. • Demonstration and evaluation of a temporal state machine which uses Dijkstra’s shortest path algorithm to findthe minimal spanning tree on directed acyclic graphs.
Manuscript submitted to ACM emporal State Machines: Using temporal memory to stitch time-based graph computations 3The paper is organized as follows. Section 2 briefly describes race logic and tropical algebra, showing the mappingbetween them. Based on that mapping, we describe circuit implementations of important tropical operations as the basicgenerators of higher order temporal functions. Section 3 introduces temporal state machines, explaining time-codedstates and transition functions. We represent simple problems tropically and demonstrate how such a state machinecan solve them in discrete steps. Section 4 presents a case study implementating Dijkstra’s algorithm on a temporalstate machine, and proposes a purely temporal version of the algorithm. Performance and energy numbers follow inSection 5, followed by a comparison with previous work and discussion in Section 6.
Computing with time traces back to two communities, one, bio-inspired and the other purely efficiency oriented. Thebiological interest in precise timing relationships between spikes grew after the seminal works by Thorpe on theprocessing speed of the human visual system [26, 63] and on spike timing dependent plasticity by Bi and Poo [7].From then, temporal wavefront computation [20, 30, 54, 68] in the biological community expanded to the machinelearning and neuromorphic computing communities [37, 51, 52]. References [8, 48, 53, 62, 71] show state of the artperformance and learning strategies in temporal neural networks, while the neuromorphic computing community inRefs. [3, 16, 18, 23, 35, 51, 55] developed hardware to emulate precise timing relationships in spiking neural activity.More recently, precise timing based codes in spiking neural networks perform a variety of applications such as graphprocessing [28, 33], median filtering [69], image processing [69] and dynamic programming [2]. For several decades,the circuit community has independently been using time domain mixed signal analog techniques in Analog/Time toDigital Converters [50, 74], clock recovery circuits, phase and delay locked loops, phase detectors and arbiters. Withshrinking voltage levels and diminishing headroom, the temporal domain becomes attractive for analog processing.With the interest in emerging computing paradigms, this community has contributed temporal coded complementarymetal-oxide-semiconductor (CMOS) only computational approaches [14, 19, 45, 46, 56].Race logic sits between the aforementioned approaches in that it uses biologically-inspired wavefronts as thefundamental data structure, while using conventional digital CMOS circuits to compute. Race logic encodes informationin the timing of rising digital edges and computes by manipulating delays between racing events. In the conventionalBoolean domain, the electrical behaviour of wires changing voltage from ground to V dd is interpreted as changing fromlogic level 0 to logic level 1 at time t . In race logic, these wires are understood to encode each t as their value, since therising edge arrives at t with respect to a temporal origin at t =
0. In some cases, a voltage edge can fail to appear ona wire within the allotted operational time of a race logic computation. In these cases, we assign the value temporalinfinity, represented by the ∞ symbol.We define race logic without memory elements as pure race logic, which accounts for most of the extant literature.We call race logic that uses dynamic memory elements stateful or impure race logic. Our goal here is to describe statefulrace logic, but first we review issues that arise in pure race logic. The class of functions that can be implemented inpure race logic is constrained by physics [44, 58] through causality and invariance . The causal constraint, also called non-prescience , requires that the output of a race logic function be greater than or equal to at least one of the function’sinputs. Any output must be caused by an input that arrives either earlier than or simultaneously with that output.The invariance constraint arises because the circuit is indifferent to the choice of temporal origin. It is satisfied byrace logic functions f for which f ( t + δ , t + δ , · · · , t N + δ ) = f ( t , t , · · · , t N ) + δ ; all operations in pure race logic Manuscript submitted to ACM
Advait Madhavan, Matthew W. Daniels, and Mark D. Stiles
Table 1. List of symbols related to race logic and tropical algebra and their meanings
Symbol Meaning Description ∞ infinity additive identity in tropical algebra; edge that never arrives in race logic ⊗ add multiplicative operation in tropical algebra; temporal delay in race logic ⊕ min additive operation in tropical algebra; first arrival in race logic ⊕ ′ max alternate additive operation in tropical algebra; last arrival in race logic ⊣ inhibit ramp function in tropical algebra; signal blocking in race logic = equivalence expressing equality between two statements: = storage storing a signal in memory: (cid:27) normalized storage storing a signal in memory by first performing a normalizing operationmust obey this equality. Invariance need not apply to impure circuits, which contain a memory or state element: suchcircuits perform differently at different times, depending on whether a memory element has been modified. From aprogramming perspective, a pure function is akin to a function in mathematics which always gives the same outputwhen presented with the same input; an impure function is closer to a subroutine that can access and modify globalvariables. Named in honor of Brazilian mathematician Imre Simon, tropical algebra treats the tropical semiring T . In T , theoperations of addition and multiplication obey the familiar rules of commutativity, distributivity, and so on, but arereplaced by different functions. The tropical multiplicative operation is conventional addition, and the tropical additiveoperation is either min or max; the choice of additive operation distinguishes two isomorphic semirings. Depending onthe choice of min or max as the additive operation, the semiring is given by T = ( R ∪{∞} , ⊕ , ⊗) or T = ( R ∪{−∞} , ⊕ ′ , ⊗) ; ±∞ are included to serve as additive identities. These symbols, and others used in this paper, are collected for referencein Table 1. That some of the generating operations of tropical algebra correspond directly to the primitive operations ofrace logic suggests that it is an ideal setting for the development of time-coded algorithms. Tropical algebra has found numerous applications in the computing literature particularly in a variety of graphalgorithms, such as shortest path finding, graph matching and alignment, and minimal spanning trees. It is used as thebasis of GraphBLAS (Graph Basic Linear Algebra Subprograms) [34]. In mathematics, it is being used to explore problemsin combinatorial optimization [5], control theory, machine learning [78], symplectic geometry [6], and computationalbiology [75].There are some fundamental similarities between tropical algebra and race logic. Both of them have an ∞ element.In race logic, it physically corresponds to a signal that never arrives, while in tropical algebra it corresponds to theadditive identity, since α ⊕ ∞ = min ( α , ∞) = α . (1) By contrast, the ring of real arithmetic is ( R , + , ×) . While tropical algebra is defined over the real numbers with infinity, a race logic circuit can practically represent only a finite discrete set of signal timings.Race logic and tropical algebra are therefore not isomorphic, per se . Note that the same is true of a traditional computer with respect to conventional realarithmetic. However, just as traditional computers can operate over a large enough subring of the reals to produce useful calculations, there exists anembedding of min-based race logic as a subsemiring of tropical algebra. Regardless of whether we work in a subset of natural numbers (clocked racelogic) or the reals (analog race logic), the fact that this mapping is well-behaved ensures that tropical algebra is a useful mathematical landscape forunderstanding race logic operations.Manuscript submitted to ACM emporal State Machines: Using temporal memory to stitch time-based graph computations 5(a) a b dc (b) ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞∞ ∞ (c) abcd
112 24 abcd (d) ∞∞ = ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞∞ ∞ ∞ ∞∞ (e) abcd
112 24 abcd (f) ∞ = ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞∞ ∞ ∞ ∞ Fig. 1. Tropical matrices for graph exploration: (a) shows an example directed graph; (b) shows the equivalent weighted adjacencymatrix. Panel (c) shows the propagation of a signal originating at node b through a delay network corresponding to the edges of theexample graph. Panel (d) shows the tropical vector-matrix multiplication corresponding to panel (c). Panels (e) and (f) repeat theserepresentations for the case where signals are injected at both b and c . Such an addition doesn’t have an inverse, since there is no value of β in min ( α , β ) that would give ∞ . The non-invertibilityof addition means that this algebra is a semi-ring and fundamentally winner-take-all in nature. Every time the additiveoperation is performed, the smallest number (the first arriving signal in race logic) “wins” and propagates furtherthrough the computation. The multiplicative identity in tropical algebra is zero, rather than one, since α ⊗ = α + = α . Tropical algebra can be especially useful for graph analytics, where it provides a simple mathematical language forgraph traversal. A fundamental concept in graph traversal is the graph’s weighted adjacency matrix A . Figs. 1(a) and(b) show a directed graph and its weighted adjacency matrix, respectively. The i th column of the weighted adjacencymatrix represents the distances of the outward connections from node i to all other nodes in the graph, so that A ji isthe weight for the edge i → j . Where there is no edge to node i from j , we assign the value A ji = ∞ .The usefulness of tropical algebra for graph traversal is seen when using A in a tropical vector-matrix multiplication.Tropical vector-matrix multiplication (VMM) proceeds like conventional VMM, but with (⊕ , ⊗) instead of ( + , ×) . Asshown in Fig. 1, each vector element is scaled (tropical multiplication) before they are all accumulated (tropical addition).Extracting any single column from a matrix can be done by multiplying a one-hot vector as shown in Fig. 1. The tropicalone-hot vector has a single zero element with all other entries set to ∞ ; from Sec. 2.2. During scaling, the columns ofthe adjacency matrix that correspond to the infinities of the one-hot vector get scaled to infinity (tropically multipliedby ∞ ) while the remaining column, scaled by the multiplicative identity 0, is the output. The values stored in the outputvector represent the distances from the one hot node in the input vector. This operation represents a search from the If we had chosen ⊕ ′ instead of ⊕ , the additive identity would be −∞ , though we generally prefer the min-plus version of tropical algebra. In purerace logic, ∞ corresponds to an edge that never arrived, whereas −∞ would correspond to an edge that had always been present—not to be confusedwith an edge that arrived at t = . No nontrivial function in pure race logic can output −∞ due to the causality constraint, so the min-plus algebra hasconsiderably more practical utility in race logic. Manuscript submitted to ACM Advait Madhavan, Matthew W. Daniels, and Mark D. Stilesnode in question (decided by the one-hot vector) to all the connected nodes in the graph, and reports the distancesalong all edges of this parallel search.Using a “two-hot” vector for input, as shown in Fig. 1(d) outputs a tropical linear combination of two vectors,corresponding to the “hot” columns of the adjacency matrix. The accumulation phase of the tropical VMM is nontrivial;the ⊕ operation selects the smallest computed distance to each node for the output. The tropical VMM reports theshortest distance to each node in the graph after a single edge traversal from either of the initial nodes specified by thetwo-hot vector. Both steps—the exploration of a node’s neighbors and the elementwise minimum of possible parentnodes associated with an output—are performed in parallel by a single matrix operation.Representing a collective hop through the graph as a single matrix operation allows a series of matrix operations torepresent extended graph traversal. The shortest traversed distance to each node in a graph from an initial node x is y = x ⊕ ( x ⊗ A ) ⊕ ( x ⊗ A ⊗ A ) ⊕ ( x ⊗ A ⊗ A ⊗ A ) ⊕ · · · . (2)The first term represents all single-hop shortest paths starting out from x , while the second term accounts for all thetwo-hop shortest paths, and so on. Hence the tropical summation across all the terms in y allows it to encode the shortestdistances between the input node as specified by x , independent of the number of hops. Performing N such hops andcalculating the minimum distance across all of them is the key operation in various dynamic-programming-basedshortest path algorithms. This process makes tropical algebra the natural semiring for Dijkstra’s algorithm [47]. We usethese ideas to implement Dijkstra’s single-source shortest path algorithm in a stateful race logic system in Sec. 4. Since tropically linear functions (cid:201) j ( a j ⊗ t j ) with the a j values constant satisfy the invariance condition, tropicallinear transformations may be carried out in pure race logic. In Section 2.1, we describe how single rising edges can beused to encode information in their arrival time. Interpreting the edges as tropical scalars, we can see how or gatesand delay elements are modeled by tropical addition and multiplication. This understanding can also be extended totropical vectors. Section 2.3 describes how tropical vectors can be interpreted as distance vectors in graph operations.These distance vectors can be interpreted temporally as a wavefront or a volley of edges measured with respect to atemporal origin. Other researchers have proposed using such vectors as the primary data structure underlying temporalcomputations [58, 62].Just as conventional vectors can be normalized, so can tropical vectors. In tropical algebra, the vector norm is theminimum element of the vector. Tropical division of a vector by its norm then corresponds to subtracting out thisminimum value from all the components. This ensures that at least one element of the tropical vector is zero. It iscommon to regard a tropical vector as equivalent to its normalized version, implying that it only encodes informationabout the shape of its temporal wavefront, and not about an arbitrarily chosen temporal origin. To accept this equivalenceis to work in the projective tropical vector space, and we refer to tropical vectors as being projectively equivalent if theirnormalized versions are elementwise equal. In this paper we frequently make use of normalized tropical vectors anddescribe a method to store vectors projectively. Not only are they commonly the naturally correct data structure, butfrequent renormalization of the temporal origin helps mitigate the limited dynamic range of our temporal encoding.Keeping track of the relative normalization constants allows us to encode information that would nominally extendbeyond our dynamic range in a principled way. In the max-plus tropical semiring, the vector norm would be the maximum element of the vector. This vector magnitude operation is sometimes calledthe L ∞ norm.Manuscript submitted to ACM emporal State Machines: Using temporal memory to stitch time-based graph computations 7 pre Tunable delay element pre (a) (b) C r i t i c a l node Level shifter pre programming transistor pull-down transistor state capacitor (c) programming control Input line Output line
Fig. 2. Construction of race logic circuits for tropical algebra: Panel (a) shows a composite circuit for the tropical dot product operation.A simple array of delay elements takes an incoming wavefront and delays its elements by the values stored in each of the delayelements. This represents the tropical element-wise multiplication by constant operation. The output is then connected to a p -type,metal-oxide semiconductor (PMOS) pre-charge pullup coupled with a nor-style pulldown network which behaves like a first arrivaldetector and performs the tropical addition operation. Panel (b) combines multiple elements of panel (a) and scales this up to a 2Darray such that it performs tropical vector matrix multiplication, the critical operation for graph traversal as described in section 2.3.Panel (c) shows a detailed circuit implementation of the tropical VMM cell. Each cell consists of two transistors, one for programmingand the other for operation, and a level shifter. In the programming mode, the array is used like a conventional 1T1R array and thememristors are written to the appropriate resistance values. In the operation mode, the programming transistor is turned off, whilethe gate capacitor (shown in figure) is charged through the memristor. The level shifter is used to make sure that the discharge timeconstant is determined by the memristor charging process and not the pulldown of the transistor, by applying full swing inputs to thepulldown transistor. Once a wavefront of rising voltage edges is interpreted as a tropical vector, the techniques shown in Fig. 2 canbe used to implement tropical vector operations. Panel (a) shows the vectorized version of the tropical dot productoperation. First the column of delay elements delays each line in the incoming wavefront by a different amount. Thisimplements tropical multiplication by constants, and can be seen as superimposing the delay wavefront onto theincoming wavefront. The outputs of such a circuit are then connected to the inputs of a pre-charge-based pullup withan or-type pulldown network followed by an inverter. The circuit operation is divided in to two phases, the pre-chargephase followed by the evaluation phase. In the pre-charge phase, the PMOS transistor has its input connected to ground,causing the critical node to be pulled-up (connected to V dd ). When the pre-charge phase ends, the PMOS transistor isturned off, which maintains the potential at the critical node at V dd . During the evaluation phase, the the first arrivingrising edge at the input of the one of the NMOS transistors, causes the critical node to discharge to ground, hence beingpulled-down to a potential of zero volts. This behaves as a first-arrival detection circuit that outputs a rising edge at theminimum of the input arrival times, performing the min operation. It implements tropical vector addition. Combiningthe delay (multiplication) with the min (summation), we get the tropical dot product operation. By replicating thisbehavior across multiple stored vectors, as in panel (b), we get the tropical VMM operation, where the input vectortropically multiplies a matrix.To be specific, we consider versions of the tunable delay elements described in the previous paragraph that arebased on memristor or resistive random access memory (ReRAM) technology. In the tropical VMM such a device isused as programmable resistor with a known capacitance to generate an RC delay [43]. The details of these tropicalvector algebra cells are shown in Fig. 2(c). The main element of this circuits is a 2T1R array comprised of a pulldowntransistor and a programming transistor. During the programming phase, the programming transistor coupled with the Manuscript submitted to ACM
Advait Madhavan, Matthew W. Daniels, and Mark D. Stilesprogramming lines can be used to apply the necessary read and write voltages across the memristor, thus changingthe resistance and therefore RC delay time stored in the device. During the operation of the circuit, the programmingtransistor is turned off to decouple the programming lines from the active circuitry. In the pre-charge phase, the outputlines are pulled up to V dd through the pullup transistor. In the evaluation phase, the input lines receive temporallycoded rising edges which charge up the gate capacitors as shown in Fig. 2. This causes the pulldown transistor to beturned on at the time proportional to input arrival times plus the RC time constant of the coupled memristor-capacitorin each cell, faithfully performing the tropical VMM operation.The largest read voltage that can applied across the device without disturbing the state of the device is approximately600 mV. In a 180 nm process, this value is only a few 100 mV above the transistor threshold voltage and would cause aslow and delayed leak. This leak allows multiple inputs to affect the pulldown simultaneously, influencing the functionalcorrectness of the circuit. We propose two solutions to this problem. Figure 2(c) shows a level shifter added between thememristor and the pulldown transistor, the full swing of which causes the pulldown transistor to work much faster. Inan alternate approach (not shown here), a medium V th device is used for the pulldown. Such devices ensure a smallfootprint as well as correct operation, provided the fabrication process allows them.In addition to tropical VMM based linear transformations, other primitive vector operations are crucial in many realapplications. Elementwise min and max can be performed with arrays of or and and gates, respectively. Vectors canalso be reduced to scalars by computing min or max amongst all elements using multi-input or and and gates. Apart from circuits that allow race logic to implement tropical linear algebra, additional built-in functions, suchas elementwise inhibit, argmin, and binarization, are required to perform general purpose tropical computations.Elementwise inhibit, shown in Fig. 3(a), is particularly powerful, as it allows us to implement piecewise functions. Itstechnical operation follows directly from the scalar inhibit operation discussed in Sec. 2.1.The argmin function, shown in Fig. 3(b), converts its vector input to a tropical one-hot vector that labels a minimalinput component. An or gate is used to select a first arriving signal which then inhibits every vector component. Onlyone first arriving edge survives its self-inhibition; no other signals in the wavefront are allowed to pass, effectivelysending these other values to infinity. The resulting vector is projectively equivalent to a tropical one-hot, and achievesthe canonical form with a single zero among infinities after normalization. The binarization operation, shown in Fig. 3(c), is similar; it converts all finite components to 0 while preservinginfinite components at ∞ . This operation utilizes a pre-stored vector which has the maximum finite (non- ∞ ) value t max of the computational dynamic range on each component. We define binarize ((cid:174) x ) = t max ⊕ ′ (cid:174) x . Computing theelementwise max of such a vector with any incoming vector, values that are ∞ remain so while the other values areconverted to the maximal finite input value. Normalizing the result via projective storage saves a many-hot vectorlabeling the finite components of the original input. The finite state machine or finite state automaton is a central concept in computing and lies at the heart of most moderncomputing systems. Such a machine is in one of some finite set of states S at any particular instant in time; inputs x ∈ Σ A variety of conventions could be taken for the case where more than one signal arrives at the same, earliest time. Note that such a multi-hot vector canbe generated by the circuit shown in Fig. 3(b). When such a situation occurs, a sorted delay vector can be used to select one of the hot input elements andconvert the vector to a one-hot vector.Manuscript submitted to ACM emporal State Machines: Using temporal memory to stitch time-based graph computations 9 (a) (b) pre n n n n n (c) n Dynamic range delay element
Fig. 3. Tropically nonlinear race logic functions: Panel (a) shows the conceptual and circuit diagram for an element-wise inhibitoperator. The inhibiting input is buffered before being fed into the gate terminal of a PMOS. As the inhibiting input turns high, thePMOS turns off inhibiting the secondary input. Panel(b) shows the argmin operation that takes an input vector and returns a one-hotvector at the location of the element with the minimum value. This is done by taking the first arrival signal and inhibting everythingelse but that signal. Panel (c) shows a binarizer. An input wavefront is maxed with the all n wavefront. This takes all values to thismax value, except ∞ , which remains as is, performing binarization. to the machine both produce outputs y ∈ Γ and induce transitions between these internal states. A state transitionfunction δ : S × Σ → S determines the next state based on the current state and current input, and an output function ω : S × Σ → Γ gives the output based on the state and inputs of the machine. The presence of state means that there is not a one-to-one correspondence between input and output of the machine;in the language we have developed above, a state machine is an impure function. This impurity is due entirely to thestate variable; δ and ω are pure mathematical functions. The finite state machine thus provides a template for howwe might compose pure race logic functions together across stateful interfaces to create general purpose temporalautomata. In fact, the temporal state machines we introduce below fits into the mathematical framework given above.The temporal state machine we introduce differs from conventional automata in that the signals use temporal ratherthan Boolean encoding. State is made possible by recent proposals for temporal memories. These memories use temporalwavefronts as their primary data structure. By coupling nanodevice parameters to pulse duration, they are able tofreeze temporal data in device properties such as resistance. Together with the pure race logic primitives described inprevious sections, we can now build finite state automata in an end-to-end temporal encoding.Designing such a machine requires addressing several problems intrinsic to the temporal nature of these logic andmemory primitives. We start this section with a brief background on temporal memories, based on hybrid CMOS andemerging technologies, and explain their benefits and drawbacks. Then we describe the impure tropical multiplicationof two signals in race logic as a first example of composing pure race logic across stateful interfaces in order to breakthrough the invariance restriction. Finally, we return to the general state machine formulation and argue for theextensibility of our simple example to more complex systems. This specification of a state machine is called a Mealy machine; if ω depends only on the state and not the current input, it is called a Moore machine.The two models are equally powerful in principle. Manuscript submitted to ACM S L S L S L BL 1 BL 2 BL 3 E n a E n a E n a BL read/write circuit
BL read/write circuit
BL read/write circuit
BL read/write circuit
SL read/write circuit
SL read/write circuit
SL read/write circuit V dd V r ead V dd V w r i t e V read V dd V dd V write DBL 1 DBL 2 DBL 3 DBL 4 D S L D S L D S L BL DBL S L D S L (a) Clock Line -5.00E-01 0.00E+00 5.00E-01 1.00E+00 1.50E+00 2.00E+00 0 5E-08 0.0000001 1.5E-07 0.0000002 2.5E-07 -2.00E-01 0.00E+00 2.00E-01 4.00E-01 6.00E-01 8.00E-01 1.00E+00 1.20E+00 1.40E+00 1.60E+00 0 5E-08 0.0000001 1.5E-07 0.0000002 2.5E-07 -5.00E-01 0.00E+00 5.00E-01 1.00E+00 1.50E+00 2.00E+00 0 5E-08 0.0000001 1.5E-07 0.0000002 2.5E-07 -4.00E-01 -2.00E-01 0.00E+00 2.00E-01 4.00E-01 6.00E-01 8.00E-01 1.00E+00 1.20E+00 1.40E+00 1.60E+00
Device Resistance (10K Ω ) Source Line (SL, V) Bit Line (BL, V) Digital Source Line (DSL, V) Digital Bit Line (DBL, V)
100 200 0 50 150
Read voltage Dead time
Wavefront playback Wavefront capture A pp r o x . li nea r s t a t e c hange Write voltage Write voltage
Time (ns) (b) (i) (ii) (iii) (iv) (v)
Dummy row
Fig. 4. Memristive wavefront memory: Panel (a) shows a × memristive temporal memory, complete with read and write peripheralcircuits as described in Ref. [43]. Note that bit-line-4 has been replaced by a dummy line where the resistance values are fixed. Thetime constant of this line is governed by the parasitics of the circuit and determines the temporal origin of the outgoing wavefront. .Panel (b) shows the functioning of a × , 4-bit, temporal memory as simulated in our 180 nm Silterra process. Strip (i) shows thecapture and playback of a linearly varying digital wavefront, with each color representing one of the sixteen lines involved. Theseedges have been collapsed into a single strip for clarity. Note that small timing mismatches cause small changes in the shape of thewavefront that is played back. Strip (ii) shows in the digital read input applied to a captured column. Strips (iii, iv) show the sourcelines and bit lines, but internal to the memory and hence operate at different voltages which are shifted to V dd with level shifters asshown in panel(a). Strip (v) shows the state change behaviour of the memristors as given by the memristor model in [13]. Note thatthe state change is almost linear. Careful inspection reveals a slight convexity by virtue of higher order terms in the exponentialdependence. Temporal memories natively operate in the time domain. They operate on wavefronts of rising edges rather than onBoolean values. Such memories can be implemented with emerging technologies such as memristors (as shown in Fig.4)and magnetic race tracks [67] because the physics includes a direct coupling between the time variable and some analogdevice property. In any case, the memory structures are similar: memory cells are arranged in a crossbar. For the readoperation, a single rising edge represented by the tropical one-hot vector is applied to the input address line of thememory, creating a wavefront at the output data line. For a write operation, the column of the crossbar where thememory has to be stored is activated and an incoming wavefront is captured.The way temporal information is encoded in the devices depends on the technology. For memristors, the dependenceof the RC charging time constants on the resistance R of the memristor and the row capacitance C is used to encode thetemporal information, leading to a linear relationship between timing and resistance. Utilizing a 1T1R-like structure,the shared row capacitances are the output capacitances that have to be charged. In the write operation the temporaldependence of state change of the device creates a relative change in resistive state based on arrival time of the edges.This enables a set of devices to correctly encode the shape of an incoming wavefront.An alternative proposal in the literature uses magnetic racetracks to store temporal information in the position of amagnetic domain within each track [67]. Magnetic tracks have the property that a current passing through a metallayer below the magnet causes motion of the magnetic domain in the direction of the current flow. The speed v ofmagnetic domain motion is constant for constant current amplitude, and so the simple equation x = vt determines the Manuscript submitted to ACM emporal State Machines: Using temporal memory to stitch time-based graph computations 11final location of the domain. This linear proportionality provides a straightforward mapping between the timing of awrite signal and the position of the stored magnetic domain.Regardless of the technology, the behaviour of these memories is qualitatively different from that of registersor flip-flops. In the case of registers, a single clock tick performs two functions. Not only does it capture the nextstate information from the calculation performed in the previous cycle, it also initiates the next cycle’s computation.Combinational logic is thus “stitched” together by register interfaces. This feature of conventional memory does notexist in the temporal memories proposed to date because wavefront playback and capture use the same address anddata lines, and cannot be used at the same time. Addressing this deficiency requires memories that can be used bothupstream and downstream for the same operation, as shown in Fig. 4.Some limitations of the temporal memories discussed above arise because they are analog : they possess limiteddynamic range, and a dead time is incurred in their use as shown in 4(b). The dead time is as a result of the charging ofthe parasitics of the array, which—with growing array size—can become comparable to the delays stored. As measuredfrom the temporal origin of the calculation, the dead times introduce artificial delays in each component that result inincorrectly encoded values at a memory write input. To deal with this issue, we introduce an extra dummy line, whichwe call the clock line , that always has the minimum R on value for the resistor. This line serves as a temporal referenceto the origin and hence behaves like a clock. This ensures that the parasitics of the lines are accounted for and only therelative changes in the resistance values are translated to the output wavefront.The dynamic range is determined by the relative changes in the stored resistances, which manifest themselves aschanges in the shape of the wavefront with respect to the clock line. Even optimistically, the range is limited to 6 bits to7 bits with present technologies. Given our constrained dynamic range, we often restrict ourselves to the storage ofnormalized tropical vectors. In Sec. 2.2, we describe how this can be achieved by subtracting from each component theminimal value among all components. This guarantees that at least one element is zero in the normalized result. Thisalters the reference time of the calculation by the normalization constant which was subtracted away. Some algorithmsare insensitive to this shift; otherwise, the normalization constant can be stored in an additional memory element forlater recovery. In order to store the normalized version of a tropical vector, the min value of the vector (without theclock line) is used to replace the clock line for the storage operation, re-assigning it as the temporal origin. This can beperformed by pulling the clock line input in Fig. 4(a) to V dd . In Section 2.1, we describe the invariance restriction on pure race logic. It constrains stateless race logic circuits to thecomputation of tropically linear functions. An immediate consequence is that pure race logic cannot tropically multiplytwo temporal signals. Static delay elements can be used to increment the value of a temporal signal by some fixedamount, but the raw addition of two time codes t + t is physically forbidden in the presence of time-translationalsymmetry. We can break this symmetry in stateful race logic through the introduction of memory.With temporal memory, tropical multiplication of two wavefronts proceeds by breaking the operation into twophases as shown in Fig. 5(a,b). The first panel shows the first phase which stores the incoming wavefront in a local The computing scheme discussed here can be either analog or digital. Though our evaluation (Sec. 5) is done assuming analog behavior, noise and othernon-idealities will in practice determine the information capacity afforded to such a computing scheme. We discuss this issue further in Sec. 6. Under the invariance constraint (Sec. 2.1), if two signals t A and t B are both shifted by a constant time δ , the output of a function of those signals mustalso be shifted by the same amount, so that f ( t A + δ , t B + δ ) = δ + f ( t A , t B ) . This doesn’t work for addition: if f ( t A , t B ) = t A + t B , shifting theinputs would result in t A + t B + δ at the output. Therefore addition is not temporally invariant and two temporal signals cannot be added togetherusing pure race logic. Manuscript submitted to ACM Addr 1 Addr 2 Addr 3
Read mode
Stored in Memory (a)
Addr 1 Addr 2 Addr 3
Read mode Write mode
Tropical VMM (c)
Addr 1 Addr 2 Addr 3
Read mode
Added as constant (b)
Write mode
Addr 1 Addr 2 Addr 3
Read mode
Element-wise ops (d)
Write mode Read mode Wavefront memory Invariant race logic functions and addition memory
Fig. 5. State machine operations: The state machine is partitioned into two main units: the temporal wavefront memory and thearithmetic unit. These are shown in panel (a). The multiplexers and the read/write modes of the memory allow the operations tobe performed sequentially. Depending on the operations, individual memory units can behave as either upstream or downstreammemories. Panels (a) and (b) show tropical multiplication in a temporal state machine split into two state transitions. In panel (a)storage of the incoming wavefront manifests as a one-argument operation; the vector is stored in the additive memory bank. Thenext phase in panel (b) is another one-argument operation, where the incoming wavefront is delayed by the wavefront stored in theprevious phase. Panel (c) shows the tropical VMM operation. Panel (d) shows other element-wise operations that can be performed ina temporal state machine. Note that all operations, aside from the first phase of the tropical multiplication, store an output back inthe temporal memory. The element-wise operations are the only two-argument operations and involve all three memories: the readmemories are the upstream memories, while the write memory is the downstream memory. temporal memory using wavefront capture circuits. This stored vector can be temporally added to a new incomingwavefront, as shown in the next panel. Commutativity ensures that the order of storage and playback does not matter.Though the state transition and output functions within each phase are pure race logic functions, the state breaksinvariance across the phase boundaries. Using memory for tropical multiplication thus allows us to construct tropicalmultinomial functions of arbitrary order.
The invariant race logic circuits and temporal wavefront memory described above are sufficient to build a simpletemporal state machine, as shown in Fig. 5(a). It consists of three banks of temporal memory, which can receive addressinputs from external sources as well as data inputs from the output of the machine. The data outputs of the wavefrontmemory are multiplexed into the computation unit. This unit consists of a variety of the invariant race logic functionsfrom Sec. 2.1 as well as a temporal memory unit for tropical VMM described in Sec. 2.4. The structure allows for amaximum of two-operand operations to be executed at once.
Manuscript submitted to ACM emporal State Machines: Using temporal memory to stitch time-based graph computations 13
Algorithm 1:
Pseudocode for procedural computation of Eq. (3)
Input: temporal vectors (cid:174) b , (cid:174) c , (cid:174) d , and (cid:174) e (cid:174) c ′ : = (cid:174) d ⊗ (cid:174) e ; // temporal vector addition (requires two transitions), Figs. 5(a,b) (cid:174) b ′ : = (cid:174) c ⊣ (cid:174) c ′ ; // elementwise inhibit, Fig. 5(d) (cid:174) a : = (cid:174) b ⊕ (cid:174) b ′ ; // elementwise min, Fig. 5(d) return (cid:174) a ;This state machine allows arbitrary expressions such as (cid:174) a = (cid:174) b ⊕ ′ ((cid:174) c ⊣ ( (cid:174) d ⊗ (cid:174) e )) (3)to be calculated. The computation is performed by partitioning it into phases, with each phase implemented serially onthe state machine of Fig 5. By breaking the computation into discrete read-compute-store transitions of a state machine,we can represent the computation using a procedural algorithm, Algorithm 1.We follow the regular order of arithmetic and perform the tropical multiplication first. Assume vector (cid:174) d and (cid:174) e reside in memories one and two. The ⊗ operation is shown in Fig. 5(a),(b). The first phase selects the memory in thecomputation unit and applies a one-hot vector at the input of wavefront memory 1, initiating the computation. Thememory places the vector (cid:174) d on the output data bus, which then passes it to the accumulator of the computation unit.The next step is shown in Fig. 5(b), where memory 3 is setup to receive the output of the operation while being activatedin write mode. A one-hot vector is applied to the input of memory 2, playing the wavefront through the stored vector,and storing the resulting output in memory 3. This storage operation is indicated by the assignment operator : = in thepseudocode. Tropical vector-matrix multiplication is a similar one-input operation and can be performed in a similarway, as shown in Fig. 5(c).Two-operand operations such as elementwise inhibit, and tropical vector addition are all performed in the same way.Synchronized one-hot vectors are presented to the address input that causes output wavefronts to be triggered. Thesewavefronts enter the computational unit where circuits for the requested operations are multiplexed in, and the outputis written to wavefront memory 3. This is all illustrated in Fig. 5(d). In this way, one- and two-operand operationscan be performed in a single state machine. Note that the computation is set up by control circuits not shown in thefigure. These control circuits are the only circuits that are not temporal in nature, and are used to direct the flow ofthe computation in the system. These control circuits can be understood as the machine-level subroutines called by a“tropical interpreter” stepping through the lines of pseudocode in Algorithm 1. DNA alignment using a temporal instantiation of the Needleman-Wunsch algorithm was one of the first applications ofrace logic [41, 42]. In that work, the alignment matrix of the Needleman-Wunsch algorithm is physically laid out as aplanar graph, and pure race logic operations define the scoring information at each node. Though the implementationin [41] is extremely fast and energy efficient, it suffers the disadvantage of requiring a dedicated ASIC. In this sectionwe briefly sketch how Needleman-Wunsch might instead be implemented in a general-purpose tropical state machinelike what we describe in the previous section.
Manuscript submitted to ACM
Algorithm 2:
Pseudocode for Needleman-Wunsch (forward pass only; computes optimal alignment cost)
Input: gene sequences (cid:174) x , (cid:174) y ∈ { , , , } n , indel cost σ , mismatch cost m (cid:174) µ ( ) := [ ] ; (cid:174) µ ( ) := [ σ , σ ] ; // Upper-left triangular part [ dim ((cid:174) µ ( k ) ) increasing]: for k ← to n do (cid:174) c ′ : = δ (cid:0) (cid:174) x , ··· , k − , (cid:174) y k − , ··· , (cid:1) ; // mismatches → ∞ , matches → { , , , }(cid:174) c : (cid:27) binarize ((cid:174) c ′ ) ; // mismatches ⇝ ∞ , matches ⇝ (cid:174) a : = σ ⊗ (cid:174) µ ( k − ) ; // apply insertion/deletion (indel) cost σ (cid:174) b : = ( m ⊕ (cid:174) c ) ⊗ (cid:174) µ ( k − ) ; // apply mutation cost m for mismatches (cid:174) r : = (cid:174) a , ··· , k − ⊕ (cid:174) b ⊕ (cid:174) a , ··· , k − ; // find least-cost local path (Eq. (4) ) (cid:174) µ ( k ) : = (cid:2) a , (cid:174) r , a k − (cid:3) ; // append boundary conditions end // Lower-right triangular part [ dim ((cid:174) µ ( k ) ) decreasing]: for k ← n + to n do (cid:174) c ′ : = δ (cid:0) (cid:174) x k − n , ··· , n , (cid:174) y n , ··· , k − n (cid:1) ; // mismatches → ∞ , matches → { , , , }(cid:174) c : (cid:27) binarize ((cid:174) c ′ ) ; // mismatches ⇝ ∞ , matches ⇝ (cid:174) a : = σ ⊗ (cid:174) µ ( k − ) ; // apply insertion/deletions (indel) cost σ (cid:174) b : = ( m ⊕ (cid:174) c ) ⊗ (cid:174) µ ( k − ) ; // apply mutation cost m for mismatches (cid:174) µ ( k ) : = (cid:174) a , ··· , n − k ⊕ (cid:174) b ⊕ (cid:174) a , ··· , n − k + ; // find least-cost local path (Eq. (4) ) endreturn (cid:174) µ ( n ) ; // this is actually just a scalar: lowest possible alignment cost The Needleman-Wunsch algorithm finds the shortest path through a dynamically constructed score matrix. Eachelement of the score matrix M ij is constructed recursively as M ij = min { M i , j − + σ , M i − , j + σ , M i − , j − + mδ x i , y j } ,where σ is the cost of a genetic insertion or deletion (an “indel”) and m is the cost of a single gene mutation. Thisnaturally has the structure of a tropical inner product, but the Kronecker delta function breaks the causality conditionand so cannot be implemented in pure race logic.To compute the Kronecker delta, we encode the set of four possible genes { G , A , T , C } as temporal values { , , , } .We then use the coincidence function [44, 58] to determine equality of the temporally encoded gene values. Tropicallythe coincidence function is described as δ ( t , t ) = ( t ⊕ t ) ⊣ ( t ⊕ ′ t ) , which is equal to the inputs when they are thesame, and ∞ otherwise [44]. The coincidence function could either be a primitive operation of the state machineor could be accomplished over multiple state transitions using ⊕ , ⊕ ′ , and ⊣ ; we assume the former circumstance.Binarization followed by projective storage of δ ( x i , y j ) would then save zero (tropical one) to memory when x i = y j and ∞ (tropical zero) otherwise, resulting in a many-hot vector that indexes genewise equality.To frame the Needleman-Wunch algorithm as a tropical vector problem, we exploit the independence of the skew-diagonals [40]. We define (cid:174) µ ( k ) as the k th skew-diagonal vector of M , so that (cid:174) µ ( ) = [ M ] , (cid:174) µ ( ) = [ M , M ] , and so on. The Kronecker delta δ ij is defined as one when i = j and zero otherwise. The simple version presented in the text applies to only an idealized coincidence: the exact point where t = t . In practice [44, 58] a nonzero coincidencewindow can introduced via a tolerance ϵ , by computing [ ϵ ⊗ ( t ⊕ t )] ⊣ ( t ⊕ ′ t ) .Manuscript submitted to ACM emporal State Machines: Using temporal memory to stitch time-based graph computations 15The first and last elements of (cid:174) µ ( k ) are kσ by construction for k ≤ n , that is, until we hit the main skew diagonal. Thedefining equation for M ij is then given through (cid:174) µ ( k ) by µ ( k ) j = (cid:16) σ ⊗ µ ( k − ) j (cid:17) ⊕ (cid:16)(cid:104) m ⊕ δ x j , y k − j (cid:105) ⊗ µ ( k − ) j (cid:17) ⊕ (cid:16) σ ⊗ µ ( k − ) j + (cid:17) . (4)The vectorized computation of this recursion relation is presented programmatically in Algorithm 2. The right-hand sideof each assignment is a pure race logic computation; the left-hand side represents a register address. As in Algorithm 1,the assignment operator (cid:174) x : = (cid:174) y indicates storage of (cid:174) y to a temporal memory register represented by (cid:174) x . The projectivestorage operator (cid:174) x : (cid:27) (cid:174) y assigns the tropical normalization (cid:174) y − min (cid:174) y to the vector register (cid:174) x .The interpreter required here is more complex than in Algorithm 1. Though we could in principle implement the for -loops tropically by assigning k : = ⊗ k and monitoring n ⊣ k and 2 n ⊣ k , we are not aware of a way to elegantlyperform subarray extraction using temporal signals as indices. We therefore imagine that k , as well as the array slicingoperations, are managed digitally by the interpreter. In Sec. 3 we demonstrate a simple model state machine, but it is too simple to utilize the graph traversal logic of tropicallinear algebra that we describe in Sec. 2.3. Though the Needleman-Wunsch machine in Sec. 3.4 does perform graphtraversal, it is restricted to a known, uniform progression through a highly regular planar graph. From the discussion ofSec. 2.3, however, we know that general graph traversal should be accessible to a tropical state machine. In the presentsection, we discuss an implementation of Dijkstra’s algorithm in a temporal state machine using the concepts developedin this paper. We will see that the core neighbor-search operation of Dijkstra’s algorithm is naturally parallelizedby the tropical VMM, leading to very high throughput in terms of graph edges traversed per unit time, and that theinhibit operation together with projective storage allow the embedding of important Boolean logic structures withinthe temporal framework.
We assume that the reader is familiar with the classical implementation of Dijkstra’s algorithm. In Algorithm 3, we mapthe operations of Dijkstra’s algorithm into race logic, with each step as a single transition of a temporal state machine.Two trivial modifications simplify the race logic implementation. First, instead of tracking the known distances to eachnode, we mask out the distances of visited nodes with the value ∞ . This vector of distances to unvisited nodes is (cid:174) d in the algorithm listing, and a tropically binarized record of which nodes have been visited is recorded in a vector (cid:174) v .Second, instead of storing a parent vector directly, we define a parent matrix ˆ P as a collection of tropical column vectorswhere a finite entry P ij holds the distance from node i to node j along the current optimal path to j from the sourcenode s . We assume that the memristors in the VMM are already programmed to their correct values, meaning that thegraph is already stored in the arithmetic unit.There are several apparent differences in how operations of the algorithm are performed in this (tropical) linearalgebra engine compared to a traditional programming language. There are, loosely speaking, two “modes” in which weuse tropical vectors. First, there are true temporal wavefronts, such as (cid:174) e and (cid:174) d , that represent variable distances measuredthroughout the graph. These flow through the data path of the algorithm. Second, there are indicator wavefronts, suchas (cid:174) v and (cid:174) d ∗ , with elements restricted to 0 or ∞ . These are used along the control paths of the algorithm to performelement lookup from data-carrying temporal wavefronts, modification of tropically binary records such as (cid:174) v , and for Manuscript submitted to ACM
Algorithm 3:
Pseudocode for Temporal Dijkstra’s Algorithm
Input: graph G , source node s // Variable initializations (cid:174) d : = s ; // distances to unvisited nodes (tropical one-hot labels source) (cid:174) v : = ∞ ; // visited nodes (tropical zero vector) ˆ P : = ∞ ; // parent matrix (tropical zero matrix) ˆ A := adjacency-matrix( G ); // adjacency matrix of the graph while (cid:16)(cid:201) j d j < ∞ (cid:17) do (cid:174) n : = argmin ( (cid:174) d ) ; // choose node to visit// Examine neighbors (cid:174) e : = ˆ A ⊗ (cid:174) n ; // VMM examine neighbors of current node (cid:174) f : = (cid:174) d ⊣ (cid:174) e ; // keep only newly-found shortest paths// Update records for the next iteration (cid:174) v : = (cid:174) v ⊕ (cid:174) n ; // record the current node as visited (cid:174) d ′ : = (cid:174) d ⊕ (cid:174) f ; // construct new record of shortest paths (cid:174) d : (cid:27) (cid:174) v ⊣ (cid:174) d ′ ; // update global unvisited distance vector// Parent vector update process (cid:174) f ∗ : (cid:27) binarize ( (cid:174) f ) ; // vector indices of found nodes ˆ P : = (cid:174) f ∗ ⊣ ˆ P ; // delete row data of previously recorded parents for found nodes (cid:174) P (cid:174) n : = (cid:174) f ; // record in column (cid:174) n distances (cid:174) f from (cid:174) n to the found nodes endreturn ˆ P ; // adjacency matrix of the minimal spanning (from s ) subgraph of G index selection of the parent matrix. Projective storage plays a key role in these processes via binarization of one-hotvectors. Sometimes, quantities like (cid:174) n can play either of the above roles depending on context.There are two primary constraints on this algorithm’s application. First, because directed edge weights are encodedas temporal delays, negative edge weights are physically forbidden. Second, temporal vectors are limited to a finitedynamic range and resolution constrained by the technology in which they are implemented, and consecutive tropicalmultiplication could lead to dynamic range issues. To mitigate this dynamic range issue, we arrange the computationsuch that no more than one successive tropical multiplication occurs along a single datapath per state machine transition.Normalization of (cid:174) u at the end of each cycle shrinks the dynamic range as much as possible between VMMs.The algorithm initializes by setting the vector (cid:174) d of known distances to unvisited nodes to a tropical one-hot s labeling the source node s . The vector (cid:174) v labeling visited nodes, as well as the parent matrix ˆ P keeping track of theminimal spanning tree through the graph, have all elements set to ∞ . We assume the weighted adjacency matrix ˆ A ofthe desired graph has been programmed to a VMM unit before the algorithm begins. This is a one-time cost that can beamortized over frequent reuse of the array. The algorithm then begins by cycling the state machine through the mainloop. Manuscript submitted to ACM emporal State Machines: Using temporal memory to stitch time-based graph computations 17In each iteration, we check to see if any unvisited nodes have are available for exploration by evaluating the minimumelement of (cid:174) d . The algorithm terminates if this operation returns ∞ , which indicates that all nodes have either beenvisited or are unreachable. Taking the argmin (Sec. 2.5) of (cid:174) d nominally gives us a vector d j ⊗ j where j is the index of anode along a shortest path (of those so far explored) from the source and j is the tropical one-hot labeling index j . Thiscan be thought of as a one-hot vector with a magnitude d j . However, we will see that d j is always zero by construction,so argmin ( (cid:174) d ) is just a one-hot. We store this one-hot to the vector register (cid:174) n .The next step is to examine the directed edges to the neighbors of node (cid:174) n . We use (cid:174) n as the input to a temporal VMMoperation with ˆ A , which performs a parallel traversal to all neighbors. The result of this exploration is stored in (cid:174) e . Thisvector may contain shorter paths to the neighbors nodes, via node (cid:174) n , than what had been previously found. Such shorterpaths would manifest as elements of (cid:174) e with smaller values than their corresponding elements in (cid:174) d . Those specific nodescan be extracted by taking an elementwise inhibit of (cid:174) e by (cid:174) d ; the resulting updated distance vector is stored as (cid:174) d ′ . Wealso note that (cid:174) n has been visited, and should not be visited again, by imposing the zero of (cid:174) n onto (cid:174) v and saving it inmemory.If the dynamic range of our memory were boundless, we could perform this operation repeatedly and determine thefinal distance vector of the algorithm. But because we are dynamic-range-limited, we have to ensure that accumulationin the distance vector is minimized. We do this via projective storage of (cid:174) d ′ into (cid:174) d . We also inhibit (cid:174) d ′ by (cid:174) v before storageto ensure that no nodes we have already visited are candidates for exploration in the next iteration; this also ensuresargmin ( (cid:174) d ) will be a magnitude-free one-hot on the next cycle. This shifts the temporal origin for the entirety of thenext iteration into the perspective of argmin ( (cid:174) v ⊣ (cid:174) d ′ ) ; all temporal values in the new (cid:174) d are now expressed relative to thestopwatch of an observer at the argmin node.After completing neighbor-exploration, we update the parent matrix. The newly found nodes in (cid:174) f are the ones whoseparents need to be updated. A binarized version (cid:174) f ∗ of (cid:174) f is used to inhibit rows of the parent matrix corresponding tothe new paths in f , erasing these nodes’ now-outdated parent data. This operation is performed row-by-row, requiring N state machine transitions to complete. The new parent is then added to the parent matrix; (cid:174) n is used to enable theappropriate column of ˆ P for writing. Vector (cid:174) f is then written to this column.Throughout this algorithm, we require dynamical indexing of memory addresses based on past results of the temporalcomputation. Recall that Needlemen-Wunsch algorithm required significantly nontrivial subarray selection operationsin Algorithm 2. We claimed in Sec. 3.4 that these would likely needed to be handled digitally. Those index selectionscan be statically determined at compile time, so they could merely be part of the elaborated bytecode controlling thestate machine: there is no need for data to translate back and forth between temporal and digital domains in order toexecute Algorithm 2. In Algorithm 3, index selections of the parent matrix are dynamically determined at runtime, andcannot be statically embedded in the digital controller around the state machine. But the one-hot nature of the indexingoperations offer natural interfaces to the crossbar architecture, so, again no digital intermediary is required to performaddress lookup. We start the evaluation of this temporal state machine by describing the assumptions in its design and the simulationframework we use. To understand the scaling of this architecture, we create models for temporal memories and the Note that even if our memory were not range-limited, we must still choose a dynamic-range cutoff at which we assign finite time values to ∞ ; otherwise,a circuit that outputs ∞ as a valid return value could never halt. In practice, though, memory is the limiting factor. However the finite delay representing ∞ is chosen, it informs the effective clock frequency of the state machine. Manuscript submitted to ACM ≈
200 mV), the voltages needed to write them can be as high as 2 V to 3 V, which puts a lower limiton the technology node we can use. To secure enough voltage headroom for changing device states, we use the 180 nmSilterra process with a V dd = . . . V dd . The write path of the memory includes circuitsfor two different write modes, the conventional and normalized forms described previously. Both these operationsrequire similar circuits with an input first-arrival detector charging the source line and level-shifting circuits to theappropriate write voltages, causing the quasilinear state write described in [43].The read and write energy costs for various N × N array sizes ranging from N = N =
32 are shown in Fig. 6.The energy scales superlinearly with array size due to growth in support circuitry size that scales with N , the inputdriver needs to be scaled up for larger array sizes; for the write case, larger array sizes require first-arrival circuitrywith more inputs. From the figures, we see that the read cost is approx 2 pJ per line while the write cost is around 10 pJper line. This 5 × factor between read and write energies drives many of the tradeoff considerations in designs.The most computationally intensive pure race logic function is the tropical VMM, which implements a single-stepall-to-all graph traversal. Such an operation naturally scales as N , which can be seen in Fig. 6(b). On average, thissystem ends up costing ≈
700 fJ per cell, so a 32 ×
32 grid consumes ≈
700 pJ of energy. The large energy cost ofthis operation arises from the conservative design strategy we employed. In order to make sure that the or pulldownnetwork functions properly, we have to ensure that the time constants of the pulldown dynamics are not determined bythe CMOS—that is, we have to ensure that it switches quicker than the resolution of our temporal code. The low readvoltage causes the pulldown transistor to discharge too slowly, causing multiple nodes pulling down the same sourceline and leading to functional incorrectness of tropical addition. In order to overcome this issue, we add level-shifters toeach cell to boost the input voltage from V read to V dd . These provide the necessary overdrive for correct operation.Other pure race logic functions such ⊕ = min, ⊕ ′ = max, and ⊣ = inhibit—as well as compound functions such asargmin and binarize—are implemented with conventional CMOS gates and hence have a minimal energy cost for thisprocess node. For example, for a 32 channel elementwise- ⊕ , the energy cost is approximately 1 pJ. This is negligiblecompared to temporal read, write, and VMM operations. The argmin operation has the largest energy cost amongthe combinatorial gates, since the first arriving input has to turn of all of the other channels and must therefore drive Certain commercial processes are identified in this paper to foster understanding. Such identification does not imply recommendation or endorsementby the National Institute of Standards and Technology, nor does it imply that the processes identified are necessarily the best available for the purpose.Manuscript submitted to ACM emporal State Machines: Using temporal memory to stitch time-based graph computations 19 a r g m i n32 v m m i nh32 m i n32 b i n32 add32 writecost readcost opcost a r g m i n32 v m m i nh32 m i n32 b i n32 add32 writecost readcost opcost m i n32 m a x i nh32 a r g m i n32 m i n16 m a x i nh16 a r g m i n16 m i n8 m a x i nh8 a r g m i n8 m i n4 m a x i nh4 a r g m i n4 E ne r g y ( p J ) E ne r g y ( p J ) w r i t e32 r ead32 w r i t e16 r ead16 w r i t e8 r ead8 w r i t e4 r ead4 E ne r g y ( p J ) E ne r g y ( p J ) (a) (d) (c) (b) (e) Edge traversal rate (ns -1 ) E d g e t r a v e r s a l e ff i c i e n c y ( n J - ) GPU ASIC This work a b c d
Fig. 6. Energy results for various operations: Panel (a) shows the energy costs for vector operations of various array sizes. Panel (b)shows the energy costs for read and write operations using the memristor temporal memory, while panel(c) shows the energy costsof the VMM operations. Panel (d) compiles these energies and presents the energy cost of single and multi-operand operations ina temporal state machine of size × . Panel (e) compares the energy cost of such a × kernel with that of state-of-the-artapplication specific integrated circuit (ASIC) and graphical processing unit (GPU) designs. It shows simulation results from our180 nm process as well scaling to more advanced nodes following the procedure described in [60]. circuits with a larger output capacitance. The energy cost of each of these operations with respect to the problem size isshown in Fig. 6(a).The energy numbers for the key phases of Algorithm 3 for a problem size of 32 are shown in Fig. 6(d). Every operationincurs a single write cost, by virtue of the output that has to be written in to the memory cells, except for ⊗ , whichincurs two writes [Fig. 5(a,b)]. Read costs, on the other hand, depend on the number of operands. Single operandoperations such as argmin require a single read, while binary operations such as ⊕ , ⊕ ′ , and ⊣ incur the twice the readcosts. Energy costs for all operations except tropical VMM are dominated by reads and writes. This is as a result of thesimplicity of the primitive operations which are essentially made of simple Boolean primitives. Graph processing is a well studied problem in computing and a variety of solutions have been proposed for it atvarious scales [25]. Processing of real world graphs—which can contain hundreds of thousands of nodes and millions ofedges—combines both software and hardware frameworks, employing everything from central processing units (CPUs),field programmable gate arrays (FPGAs) [77, 80], and graphics processing units (GPUs) [21, 70] to application specific
Manuscript submitted to ACM ×
32 kernel in a 180 nm technology node has an edge traversal rateof 10 ns − (10 GETS) and the energy efficiency is about 1 nJ − (1 GETJ), which compares favorably with the state ofthe art. Using scaling projections from [60], we estimate that a single kernel can theoretically surpass state-of-the-artkernel performance. When scaled up to larger N × N array sizes, such as N =
128 or N =
256 (not an uncommon coresize for memristor crossbars), we can expect massive performance improvements. Note that the state of the art forgraph processing engines when energy is of no concern is on the order of 100s of GETS, which our analysis indicates tobe feasible for temporal designs.Independent but parallel work on graph problems is being undertaken by the neuromorphic computing community.Dikjstra’s algorithm has been studied by researchers in neuromorphic computing as a benchmark application for thefield [2, 17, 28]. State-of-the-art industrial research spiking neural network platforms [18] use Dijkstra’s algorithm toestablish performance metrics for their systems. The Tennessee neuromorphic project uses single-source shortest pathcomputation to demonstrate their spiking neuromorphic chips [57]. The energy per spike costs have been detailed inRef. [57]; when an operation equivalent to the tropical VMM is implemented, it costs approximately 2.5 nJ in a 65 nmprocess. By comparison, combining both the memory and VMM primitives, race logic performs the same operation for1 nJ in a 180 nm process.
Previous work with race logic has demonstrated that most ofthe energy expended in race logic architectures is spent in the distribution of timing information, such as in clock
Manuscript submitted to ACM emporal State Machines: Using temporal memory to stitch time-based graph computations 21trees or analog voltages [41]. In order to get an energy advantage over those approaches, the present work relies onnovel technologies such as memristors to locally generate a programmable delay, which has the advantage that theenergy cost is limited by the capacitor. Hence we are limited by today’s memristor technology, which requires largewrite voltages (1.2 V to 6 V). This requires that we use a relatively old technology node. The development of memristortechnology is being driven toward the goal of CMOS compatibility at advanced technology nodes, which require lowerread and write voltages [15]. Companies are exploring low write voltage resistive random access memory (ReRAM) andembedding it into 22 nm fin field-effect transistor (FinFET) stacks [24, 31].Maturing technology has great promise for the designs proposed in this work. As CMOS transistors become smaller,the area, energy, and speed all improve. For example, when moving from a 180 nm CMOS to 14 nm FinFET, usinga fan-out-of-4 inverter as a benchmark, the area, energy, and latency numbers improve by 100 × , 190 × and 19 × ,respectively [60]. As memristor technologies become compatible with lower voltages, the energy of the read and writeoperations are expected to decrease. The write energy, determined by the voltage and current needed to alter thememristive state, changes less than the read energy, which follows the inverter characteristics. The scaling performanceof race logic systems is easy to estimate since the spatial nature of the information flow ensures that the architecturesin various technology nodes all have similar design and activity factors. We expect the dynamic energy cost to followthe energy trend of the inverter as described in Ref. [60]. Though latency and area are determined by other factors suchas memory dynamic range and functional correctness, the overall advantage in energy-delay-product from scaling to alower node could be as high as three orders of magnitude. There are a variety of device non-idealities that affect the design. First, thedynamic range of the technology is limited. Prior work with memristors has demonstrated precision as high as 6 bits to7 bits [4, 72], but this was accomplished with a carefully programmed write-verify process. With the memristor modelused in this paper, we can extract up to 5 bits of precision. Practical implementations have even lower precision. Oneway to increase precision is to use extra wires to encode higher precision bits as done for Boolean logic. A similar ideahas been proposed in Refs. [14, 36].Another major impediment to smooth operation in our circuits is the linearity requirement of the memristor writeprocess. A truly linear write would increase the dynamic range of our operations and ensure a clear mapping betweenthe read and write processes. This linearity requirement has been a major topic of research for the neuromorphic VMMcommunity with large implications on the hardware training of large scale neural networks [9, 61, 72]. Considerableeffort has been dedicated to this effort. Various groups show highly linear behaviour by operating in the high conductanceregime with proper compliance control [38], exploring new materials [12], and using three terminal lithium devices [22].An alternate approach utilizes highly linear trench-capacitor based storage [39]. Recently, a temporal magnetic memoryhas been proposed which exhibits linear dynamics [67]. This proposal re-purposes magnetic configurations in racetrackssuch as domain walls or skyrmions to encode temporal information spatially within the race track.Finally, issues such as noise, variability, drift, and mismatch will be ultimately responsible for determining the actualdynamic range that can be extracted from such nanodevices [72]. The advantage of using a charge-based approach tocomputation is that the memristors can be used in their high-conductance, low-variability regime, while still maintaininglow read energy costs. Building a noise model that describes how variability accumulates in such systems is beyond thescope of this text, but will be an important future work. The energy benefits of such the analog approach come at thecost of error; effectively bounding this tradeoff is a crucial theoretical problem.
Manuscript submitted to ACM
Temporal computation leads to unconventional architectures that come with theirown design constraints. The cost of primitive operations (aside from the VMM) in temporal computing is cheapcompared to memory access operations. This points to utilizing strategies that amortize the cost of memory accessesover multiple feed-forward operations. Future systems would greatly benefit by performing many such operations in asingle phase. In Algorithm 3, for instance, neither (cid:174) e nor (cid:174) d ′ need to be stored in memory. A sophisticated compiler coulddetect optimally long compositions of pure race logic functions and only use memory where invariance or causalityneed to be broken. Though such a state machine would need additional control logic with separate clock and dummylines, the energy savings accrued by this sort of optimization would be significant.As higher level algorithms become more clearly expressible, an important question would be, what kind of complexityof operations would we want in our designs? Similar to the discussion of reduced versus complex instruction setcomputers (RISC vs. CISC), a design with simpler fundamental primitives could be more flexible, but might sacrificeperformance. An example of that can be seen in the parent matrix update of Algorithm 3. A 2D update array similar tothe VMM could amortize the cost of N extra operations, and hence save on N memory reads and writes, in just a singleoperation. Hence a more complex operation would have smaller energy and delay, which would be very favorable—atthe cost of specialized circuitry. The sensibility of such tradeoffs is an open question that needs to be addressed.Finally, we must consider the question of dynamic range. Approaches to extend the dynamic range of such memoriesby using a binary-like encoding has been proposed by previous authors [36]. These techniques may be required toexpand dynamic ranges to be compatible with future designs. The utility of temporal computation in solving problems expressible by dynamic programming has been widely noted.Though the first race logic work was proposed as a hardware acceleration for dynamic programming algorithms, it wasconstrained in its design: a limited topology and a feed-forward memoryless structure. Only the length of the shortestpath was reported, with extra circuitry nominally required to report the path itself. Since then, other designs withstate-of-the-art performance have been proposed, but they similarly suffer from an ad hoc design approach.In this work we attempt to make the first steps at generalizability of temporal computing. We provide a generalizeabledata-path and a mathematical algebra, expanding the logical framework of race logic . This leads to novel circuitdesigns that are informed by higher level algorithmic requirements. The properties of abstraction and composabilityoffered by the mathematical framework coupled with native storage from the temporal memory lend themselves togeneralization. We design a state machine that can carry out both specialized and general graph algorithms, such asNeedleman-Wunsch and Dijkstra’s algorithm, respectively. The potential for general purpose graph accelerators builton temporal computing motivates further exploration of temporal state machines.
Authors thank Brian Hoskins, Mark Anders, Jabez McClelland, Melika Payvand, George Tzimpragos, and Tim Sherwoodfor helpful discussions. Advait Madhavan acknowledges support under the Cooperative Research Agreement AwardNo. 70NANB14H209, through the University of Maryland.
REFERENCES [1] Gina C Adam, Brian D Hoskins, Mirko Prezioso, Farnood Merrikh-Bayat, Bhaswar Chakrabarti, and Dmitri B Strukov. 2016. 3-D memristor crossbarsfor analog and neuromorphic computing applications.
IEEE Transactions on Electron Devices
64, 1 (2016), 312–318.Manuscript submitted to ACM emporal State Machines: Using temporal memory to stitch time-based graph computations 23 [2] James B Aimone, Ojas Parekh, Cynthia A Phillips, Ali Pinar, William Severa, and Helen Xu. 2019. Dynamic Programming with Spiking NeuralComputing. In
Proceedings of the International Conference on Neuromorphic Systems . 1–9.[3] Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John Arthur, Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta,Gi-Joon Nam, et al. 2015. Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip.
IEEE transactions oncomputer-aided design of integrated circuits and systems
34, 10 (2015), 1537–1557.[4] Fabien Alibart, Ligang Gao, Brian D Hoskins, and Dmitri B Strukov. 2012. High precision tuning of state for memristive devices by adaptablevariation-tolerant algorithm.
Nanotechnology
23, 7 (2012), 075201.[5] Xavier Allamigeon, Pascal Benchimol, Stéphane Gaubert, and Michael Joswig. 2014. Combinatorial simplex algorithms can solve mean payoff games.
SIAM Journal on Optimization
24, 4 (2014), 2096–2117.[6] Xavier Allamigeon, Pascal Benchimol, Stéphane Gaubert, and Michael Joswig. 2015. Tropicalizing the simplex algorithm.
SIAM Journal on DiscreteMathematics
29, 2 (2015), 751–795.[7] Guo-qiang Bi and Mu-ming Poo. 1998. Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, andpostsynaptic cell type.
Journal of neuroscience
18, 24 (1998), 10464–10472.[8] Sander M Bohte, Joost N Kok, and Han La Poutre. 2000. Poutré. Spike-prop: backpropagation for networks of spiking neurons. In
Proc. ESANNâĂŹ2000 .Citeseer.[9] Geoffrey W Burr, Robert M Shelby, Severin Sidler, Carmelo Di Nolfo, Junwoo Jang, Irem Boybat, Rohit S Shenoy, Pritish Narayanan, KumarVirwani, Emanuele U Giacometti, et al. 2015. Experimental demonstration and tolerancing of a large-scale neural network (165 000 synapses) usingphase-change memory as the synaptic weight element.
IEEE Transactions on Electron Devices
62, 11 (2015), 3498–3507.[10] Bhaswar Chakrabarti, Miguel Angel Lastras-Montaño, Gina Adam, Mirko Prezioso, Brian Hoskins, M Payvand, A Madhavan, A Ghofrani, LTheogarajan, K-T Cheng, et al. 2017. A multiply-add engine with monolithically integrated 3D memristor crossbar/CMOS hybrid circuit.
Scientificreports . IEEE, 433–445.[12] Sridhar Chandrasekaran, Firman Mangasa Simanjuntak, R Saminathan, Debashis Panda, and Tseung-Yuen Tseng. 2019. Improving linearity byintroducing Al in HfO2 as a memristor synapse device.
Nanotechnology
30, 44 (2019), 445205.[13] Pai-Yu Chen and Shimeng Yu. 2015. Compact modeling of RRAM devices and its applications in 1T1R and 1S1R array design.
IEEE Transactions onElectron Devices
62, 12 (2015), 4022–4028.[14] Zhengyu Chen and Jie Gu. 2019. 19.7 A Scalable Pipelined Time-Domain DTW Engine for Time-Series Classification Using Multibit Time Flip-FlopsWith 140Giga-Cell-Updates/s Throughput. In . IEEE, 324–326.[15] Sumit Choudhary, Mahesh Soni, and Satinder K Sharma. 2019. Low voltage & controlled switching of MoS2-GO resistive layers based ReRAM fornon-volatile memory applications.
Semiconductor Science and Technology
34, 8 (2019), 085009.[16] Simon Davidson, Stephen B Furber, and Oliver Rhodes. 2020. Spiking Associative Memory for Spatio-Temporal Patterns. arXiv preprintarXiv:2006.16684 (2020).[17] Mike Davies. 2019. Benchmarks for progress in neuromorphic computing.
Nature Machine Intelligence
1, 9 (2019), 386–388.[18] Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam,Shweta Jain, et al. 2018. Loihi: A neuromorphic manycore processor with on-chip learning.
IEEE Micro
38, 1 (2018), 82–99.[19] Piotr Dudek. 2006. An asynchronous cellular logic network for trigger-wave image processing on fine-grain massively parallel arrays.
IEEETransactions on Circuits and Systems II: Express Briefs
53, 5 (2006), 354–358.[20] Tommas J Ellender, Wiebke Nissen, Laura L Colgin, Edward O Mann, and Ole Paulsen. 2010. Priming of hippocampal population bursts by individualperisomatic-targeting interneurons.
Journal of Neuroscience
30, 17 (2010), 5979–5991.[21] Zhisong Fu, Michael Personick, and Bryan Thompson. 2014. Mapgraph: A high level api for fast development of high performance graph analyticson gpus. In
Proceedings of Workshop on GRAph Data management Experiences and Systems . 1–6.[22] Elliot J Fuller, Farid El Gabaly, François Léonard, Sapan Agarwal, Steven J Plimpton, Robin B Jacobs-Gedrim, Conrad D James, Matthew J Marinella,and A Alec Talin. 2017. Li-ion synaptic transistor for low power analog computing.
Advanced Materials
29, 4 (2017), 1604310.[23] Steve B Furber, Francesco Galluppi, Steve Temple, and Luis A Plana. 2014. The spinnaker project.
Proc. IEEE . IEEE, T230–T231.[25] Chuang-Yi Gui, Long Zheng, Bingsheng He, Cheng Liu, Xin-Yu Chen, Xiao-Fei Liao, and Hai Jin. 2019. A survey on graph processing accelerators:Challenges and opportunities.
Journal of Computer Science and Technology
34, 2 (2019), 339–371.[26] Rudy Guyonneau, Rufin VanRullen, and Simon J Thorpe. 2005. Neurons tune to the earliest spikes through STDP.
Neural Computation
17, 4 (2005),859–879.[27] Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In . IEEE, 1–13.[28] Kathleen E Hamilton, Tiffany M Mintz, and Catherine D Schuman. 2019. Spike-based primitives for graph algorithms. arXiv preprint arXiv:1903.10574 (2019). Manuscript submitted to ACM [29] John L Hennessy and David A Patterson. 2019. A new golden age for computer architecture.
Commun. ACM
62, 2 (2019), 48–60.[30] Eugene M Izhikevich. 2006. Polychronization: computation with spikes.
Neural computation
18, 2 (2006), 245–282.[31] Pulkit Jain, Umut Arslan, Meenakshi Sekhar, Blake C Lin, Liqiong Wei, Tanaya Sahu, Juan Alzate-vinasco, Ajay Vangapaty, Mesut Meterelliyoz, NathanStrutt, et al. 2019. 13.2 A 3.6 Mb 10.1 Mb/mm 2 Embedded Non-Volatile ReRAM Macro in 22nm FinFET Technology with Adaptive Forming/Set/ResetSchemes Yielding Down to 0.5 V with Sensing Time of 5ns at 0.7 V. In . IEEE, 212–214.[32] Zizhen Jiang, Shimeng Yu, Yi Wu, Jesse H Engel, Ximeng Guan, and H-S Philip Wong. 2014. Verilog-A compact model for oxide-based resistiverandom access memory (RRAM). In . IEEE, 41–44.[33] Bill Kay, Prasanna Date, and Catherine Schuman. 2020. Neuromorphic Graph Algorithms: Extracting Longest Shortest Paths and Minimum SpanningTrees. In
Proceedings of the Neuro-inspired Computational Elements Workshop . 1–6.[34] Jeremy Kepner, Peter Aaltonen, David Bader, Aydin Buluç, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine,Henning Meyerhenke, et al. 2016. Mathematical foundations of the GraphBLAS. In . IEEE, Waltham, MA, 1–9.[35] Dion Khodagholy, Jennifer N Gelinas, Thomas Thesen, Werner Doyle, Orrin Devinsky, George G Malliaras, and György Buzsáki. 2015. NeuroGrid:recording action potentials from the surface of the brain.
Nature neuroscience
18, 2 (2015), 310–315.[36] KwangSeok Kim, Wonsik Yu, and SeongHwan Cho. 2014. A 9 bit, 1.12 ps resolution 2.5 b/stage pipelined time-to-digital converter in 65 nm CMOSusing time-register.
IEEE Journal of Solid-State Circuits
49, 4 (2014), 1007–1016.[37] Xavier Lagorce and Ryad Benosman. 2015. Stick: spike time interval computational kernel, a framework for general purpose computation usingneurons, precise timing, delays, and synchrony.
Neural computation
27, 11 (2015), 2261–2317.[38] Can Li, Miao Hu, Yunning Li, Hao Jiang, Ning Ge, Eric Montgomery, Jiaming Zhang, Wenhao Song, Noraica Dávila, Catherine E Graves, et al. 2018.Analogue signal and image processing with large memristor crossbars.
Nature Electronics
1, 1 (2018), 52.[39] Y Li, S Kim, X Sun, P Solomon, T Gokmen, H Tsai, S Koswatta, Z Ren, R Mo, CC Yeh, et al. 2018. Capacitor-based cross-point array for analog neuralnetwork with record symmetry and linearity. In . IEEE, 25–26.[40] RJ LIPTON. 1985. A systolic array for rapid string comparison. In
Proc. of the Chapel Hill Conf. on VLSI, 1985 . 363–376.[41] Advait Madhavan, Timothy Sherwood, and D.Strukov. 2017. A 4-mm 2 180-nm-CMOS 15-Giga-cell-updates-per-second DNA sequence alignmentengine based on asynchronous race conditions. In . IEEE, 1–4.[42] Advait Madhavan, Timothy Sherwood, and Dmitri Strukov. 2014. Race logic: A hardware acceleration for dynamic programming algorithms. In . IEEE, 517–528.[43] Advait Madhavan and Mark D Stiles. 2020. Storing and retrieving wavefronts with resistive temporal memory.
Accepted at IEEE internationalSymposium for Circuits and Systems(ISCAS) 2020 (2020).[44] Advait Madhavan, Georgios Tzimpragos, Mark Stiles, and Timothy Sherwood. 2019. A Truth-Matrix view into Unary Computing.
First unarycomputing workshop, ISCA 2019 (2019).[45] Daisuke Miyashita, Shouhei Kousai, Tomoya Suzuki, and Jun Deguchi. 2017. A neuromorphic chip optimized for deep learning and CMOS technologywith time-domain analog and digital mixed-signal processing.
IEEE Journal of Solid-State Circuits
52, 10 (2017), 2679–2689.[46] Daisuke Miyashita, Ryo Yamaki, Kazunori Hashiyoshi, Hiroyuki Kobayashi, Shouhei Kousai, Yukihito Oowaki, and Yasuo Unekawa. 2013. An LDPCdecoder with time-domain analog and digital mixed-signal processing.
IEEE Journal of Solid-State Circuits
49, 1 (2013), 73–83.[47] Mehryar Mohri. 2002. Semiring frameworks and algorithms for shortest-distance problems.
Journal of Automata, Languages and Combinatorics
7, 3(2002), 321–350.[48] Harideep Nair, John Paul Shen, and James E. Smith. 2020. Direct CMOS Implementation of Neuromorphic Temporal Neural Networks for SensoryProcessing. arXiv:cs.AR/2009.00457[49] M Hassan Najafi, David J Lilja, Marc D Riedel, and Kia Bazargan. 2018. Low-cost sorting network circuits using unary processing.
IEEE Transactionson Very Large Scale Integration (VLSI) Systems
26, 8 (2018), 1471–1480.[50] Taehwan Oh, Hariprasath Venkatram, and Un-Ku Moon. 2013. A time-based pipelined ADC using both voltage and time domain information.
IEEEJournal of Solid-State Circuits
49, 4 (2013), 961–971.[51] Garrick Orchard, Cedric Meyer, Ralph Etienne-Cummings, Christoph Posch, Nitish Thakor, and Ryad Benosman. 2015. HFirst: a temporal approachto object recognition.
IEEE transactions on pattern analysis and machine intelligence
37, 10 (2015), 2028–2040.[52] Marc Osswald, Sio-Hoi Ieng, Ryad Benosman, and Giacomo Indiveri. 2017. A spiking neural network model of 3D perception for event-basedneuromorphic stereo vision systems.
Scientific reports
Neural computation
22, 2 (2010), 467–510.[54] Filip Jan Ponulak and John J Hopfield. 2013. Rapid, parallel path planning by propagating wavefronts of spiking neural activity.
Frontiers incomputational neuroscience . IEEE, 1–4.[56] Aseem Sayal, Shirin Fathima, SS Teja Nibhanupudi, and Jaydeep P Kulkarni. 2019. 14.4 All-Digital Time-Domain CNN Engine Using BidirectionalMemory Delay Lines for Energy-Efficient Edge Computing. In . IEEE, 228–230.Manuscript submitted to ACM emporal State Machines: Using temporal memory to stitch time-based graph computations 25 [57] Catherine D Schuman, Kathleen Hamilton, Tiffany Mintz, Md Musabbir Adnan, Bon Woong Ku, Sung-Kyu Lim, and Garrett S Rose. 2019. Shortestpath and neighborhood subgraph extraction on a spiking memristive neuromorphic implementation. In
Proceedings of the 7th Annual Neuro-inspiredComputational Elements Workshop . 1–6.[58] James E Smith. 2018. Space-time algebra: a model for neocortical computation. In
Proceedings of the 45th Annual International Symposium onComputer Architecture . IEEE Press, 289–300.[59] Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2018. GraphR: Accelerating graph processing using ReRAM. In . IEEE, 531–543.[60] Aaron Stillmaker and Bevan Baas. 2017. Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm.
Integration
58 (2017), 74–81.[61] Xiaoyu Sun and Shimeng Yu. 2019. Impact of non-ideal characteristics of resistive synaptic devices on implementing convolutional neural networks.
IEEE Journal on Emerging and Selected Topics in Circuits and Systems
9, 3 (2019), 570–579.[62] Amirhossein Tavanaei, Masoud Ghodrati, Saeed Reza Kheradpisheh, Timothee Masquelier, and Anthony Maida. 2018. Deep learning in spikingneural networks.
Neural Networks (2018).[63] Simon J Thorpe. 1990. Spike arrival times: A highly efficient coding scheme for neural networks.
Parallel Processing in Neural Systems (1990), 91–94.[64] Georgios Tzimpragos, Advait Madhavan, Dilip Vasudevan, Dmitri Strukov, and Timothy Sherwood. 2019. Boosted Race Trees for Low EnergyClassification. In
Proceedings of the Twenty-Forth International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS ’19) .[65] Georgios Tzimpragos, Nestan Tsiskaridze, Kylie Huch, Advait Madhavan, and Timothy Sherwood. 2019. From Arbitrary Functions to Space-TimeImplementations.
First unary computing workshop, ISCA 2019 (2019).[66] Georgios Tzimpragos, Dilip Vasudevan, Nestan Tsiskaridze, George Michelogiannakis, Advait Madhavan, Jennifer Volk, John Shalf, and TimothySherwood. 2020. A Computational Temporal Logic for Superconducting Accelerators. In
Proceedings of the Twenty-Fifth International Conference onArchitectural Support for Programming Languages and Operating Systems . 435–448.[67] Hamed Vakili, Mohammad Nazmus Sakib, Samiran Ganguly, Mircea Stan, Matthew W Daniels, Advait Madhavan, Mark D Stiles, and Avik W Ghosh.2020. Temporal Memory with Magnetic Racetracks. arXiv preprint arXiv:2005.10704 (2020).[68] Rufin VanRullen, Rudy Guyonneau, and Simon J Thorpe. 2005. Spike times make sense.
Trends in neurosciences
28, 1 (2005), 1–4.[69] Stephen J Verzi, Fredrick Rothganger, Ojas D Parekh, Tu-Thach Quach, Nadine E Miner, Craig M Vineyard, Conrad D James, and James B Aimone.2018. Computing with spikes: The advantage of fine-grained timing.
Neural computation
30, 10 (2018), 2660–2690.[70] Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. 2016. Gunrock: A high-performance graph processinglibrary on the GPU. In
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 1–12.[71] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi. 2018. Spatio-temporal backpropagation for training high-performance spiking neuralnetworks.
Frontiers in neuroscience
12 (2018), 331.[72] Qiangfei Xia and J Joshua Yang. 2019. Memristive crossbar arrays for brain-inspired computing.
Nature materials
18, 4 (2019), 309–323.[73] Mingyu Yan, Xing Hu, Shuangchen Li, Abanti Basak, Han Li, Xin Ma, Itir Akgun, Yujing Feng, Peng Gu, Lei Deng, et al. 2019. Alleviating irregularityin graph analytics acceleration: A hardware/software co-design approach. In
Proceedings of the 52nd Annual IEEE/ACM International Symposium onMicroarchitecture . 615–628.[74] Heemin Y Yang and Rahul Sarpeshkar. 2005. A time-based energy-efficient analog-to-digital converter.
IEEE Journal of solid-state circuits
40, 8(2005), 1590–1601.[75] Ruriko Yoshida, Leon Zhang, and Xu Zhang. 2019. Tropical principal component analysis and its application to phylogenetics.
Bulletin ofmathematical biology
81, 2 (2019), 568–597.[76] Shimeng Yu. 2018. Neuro-inspired computing with emerging nonvolatile memorys.
Proc. IEEE
Proceedings of the 2018 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays . 229–238.[78] Liwen Zhang, Gregory Naitzat, and Lek-Heng Lim. 2018. Tropical Geometry of Deep Neural Networks. In
Proceedings of the 35th International Con-ference on Machine Learning (Proceedings of Machine Learning Research) , Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, StockholmsmÃďssan,Stockholm Sweden, 5824–5832. http://proceedings.mlr.press/v80/zhang18i.html[79] Minxuan Zhou, Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana Rosing. 2019. GRAM: graph processing in a ReRAM-based computationalmemory. In
Proceedings of the 24th Asia and South Pacific Design Automation Conference . 591–596.[80] Shijie Zhou and Viktor K Prasanna. 2017. Accelerating graph analytics on CPU-FPGA heterogeneous platform. In2017 29th International Symposiumon Computer Architecture and High Performance Computing (SBAC-PAD)