Variable Binding for Sparse Distributed Representations: Theory and Applications
11 Variable Binding for Sparse DistributedRepresentations: Theory and Applications
E. Paxon Frady, Denis Kleyko, and Friedrich T. Sommer
Abstract —Symbolic reasoning and neural networks are oftenconsidered incompatible approaches in artificial intelligence. Con-nectionist models known as Vector Symbolic Architectures (VSAs)can potentially bridge this gap by enabling symbolic reasoningwith distributed representations (Plate, 1994; Gayler, 1998; Kan-erva, 1996). However, classical VSAs and neural networks arestill incompatible because they represent information differently.VSAs encode symbols by dense pseudo-random vectors, whereinformation is distributed throughout the entire neuron population.Neural networks encode features locally, by the activity of singleneurons or small groups of neurons, often forming sparse vectorsof neural activation (Hinton et al., 1986). Following Rachkovskij(2001); Laiho et al. (2015), we explore symbolic reasoning withsparse distributed representations.The core operations in VSAs are dyadic operations betweenvectors to express variable binding and the representation ofsets. Thus, algebraic manipulations enable VSAs to represent andprocess data structures of varying depth in a vector space of fixeddimensionality. Using techniques from compressed sensing, we firstshow that variable binding between dense vectors in classical VSAs(Gayler, 1998) is mathematically equivalent to tensor productbinding (Smolensky, 1990) between sparse vectors, an operationwhich increases dimensionality. This theoretical result implies thatdimensionality-preserving binding for general sparse vectors mustinclude a reduction of the tensor matrix into a single sparse vector.Two options for sparsity-preserving variable binding are inves-tigated. One binding method for general sparse vectors extendsearlier proposals to reduce the tensor product into a vector,such as circular convolution (Plate, 1994). The other variablebinding method is only defined for sparse block-codes (Gripon andBerrou, 2011), block-wise circular convolution (Laiho et al., 2015).Our experiments reveal that variable binding for block-codes hasideal properties, whereas binding for general sparse vectors alsoworks, but is lossy, similar to previous proposals (Rachkovskij,2001). We demonstrate a VSA with sparse block-codes in exampleapplications, cognitive reasoning and classification, and discuss itsrelevance for neuroscience and neural networks.
Index Terms —vector symbolic architectures, compressed sens-ing, tensor product variable binding, sparse distributed represen-tations, sparse block-codes, cognitive reasoning, classification
I. I
NTRODUCTION
In a traditional computer, the internal representation of datais organized by data structures. A data structure is a collectionof data values with their relationships. For example, a simpledata structure is a key-value pair, relating a variable name to itsassigned value. Particular variables within data structures can be
E. P. Frady and F. T. Sommer are with Neuromorphic Computing Lab, IntelLabs and also with the Redwood Center for Theoretical Neuroscience at theUniversity of California, Berkeley, CA 94720, USA.D. Kleyko is with the Redwood Center for Theoretical Neuroscience at theUniversity of California, Berkeley, CA 94720, USA and also with Intelligentsystems Lab at Research Institutes of Sweden, 164 40 Kista, Sweden. individually accessed for computations. Data structures are thebackbones for computation, and needed for organizing, storing,managing and manipulating information in computers.For many tasks that brains have to solve, for instanceanalogical inference in cognitive reasoning tasks and invariantpattern recognition, it is essential to represent knowledge in datastructures and to query the components of data structures on thefly. It has been a long-standing debate if, and if so how, brainscan represent data structures with neural activity and implementalgorithms for their manipulation (Fodor et al., 1988).Here, we revisit classical connectionist models (Plate, 1994;Kanerva, 1996; Gayler, 1998) that propose encodings of datastructures with distributed representations. Following Gayler(2003), we will refer to these models as
Vector SymbolicArchitectures (VSAs), synonymously their working principlesare sometimes summarized as hyperdimensional computing (Kanerva, 2009). Typically, VSA models use dense randomvectors to represent atomic symbols, such as variable namesand feature values. Atomic symbols can be combined intocompound symbols that are represented by vectors that have thesame dimension. The computations with pseudo-random vectorsin VSAs rest on the concentration of measure phenomenon(Ledoux, 2001) that random vectors become almost orthogonalin large vector spaces (Frady et al., 2018).In neural networks, features are encoded locally by theactivity of a single or of a few neurons. Also, patterns ofneural activity are often sparse, i.e. there are only few nonzeroelements (Willshaw et al., 1969; Olshausen and Field, 1996).Connectionists attempted to use such local feature representa-tions in models describing computations in the brain. However,a critical issue emerged with these representations, known as the binding problem in neuroscience. This problem occurswhen a representation requires the encoding of sets of featureconjunctions, for example when representing a red triangle anda blue square (Treisman, 1998). Just representing the colorand shape features would lose the binding information thatthe triangle is red, not the square. One solution proposedfor the binding problem is the tensor product representation(TPR) (Smolensky, 1990), where a neuron is assigned to eachcombination of feature conjunctions. However, when expressinghierarchical data structures, the dimensionality of TPRs growsexponentially with hierarchical depth. One proposal to remedythis issue is to form reduced representations of TPRs, so thatthe resulting representations have the same dimensions as theatomic vectors (Hinton et al., 1990; Plate, 1993). This has beenthe inspiration of VSAs, which have proposed various algebraicoperations for binding that preserve dimensionality. Building on a r X i v : . [ c s . N E ] S e p arlier work on sparse VSA (Rachkovskij, 2001; Laiho et al.,2015), we investigate the possibility to build binding operationsfor sparse patterns that preserve dimensionality and sparsity.The paper is organized as follows. In section II, the back-ground for our study is introduced, which covers differentflavors of symbolic reasoning, sparse distributed representa-tions, and the basics of compressed sensing. In section III-A,compressed sensing is employed to establish the equivalencebetween the dense representations in classical VSA models andsparse representations. This treatment reveals the operationsbetween sparse vectors that are induced by VSA operationsdefined on dense vectors. Interestingly, we find that the classicaldimensionality-preserving operations in VSAs induce equivalentoperations between sparse vectors that do not preserve dimen-sionality. Section III-B introduces and investigates concretemethods for dimensionality- and sparsity-preserving variablebinding. Known binding methods, such as circular convolution(Plate, 1994) and vector-derived transformation binding (Gos-mann and Eliasmith, 2019), lead to binding operations thatare dimensionality- but not sparsity-preserving. We investigatetwo solutions for sparsity-preserving binding, one for generalsparse vectors, and one for the subset of sparse block vectors(block-codes). Section III-C demonstrates the most promisingsolution, a VSA with sparse block-codes in two applications.In Section IV, we summarize our results and discuss theirimplications. II. B ACKGROUND
A. Models for symbolic reasoning
Many connectionist models for symbolic reasoning withvectors use vector addition (or a thresholded form of it) toexpress sets of symbols. But the models characteristically de-viate in encoding strategies and in their operation for binding.TPRs (Smolensky, 1990) use real-valued localist feature vectors x , y ∈ R N and the outer product x y (cid:62) ∈ R N × N as thebinding operation. This form of tensor product binding encodescompound data structures by representations that have higherdimensions than those of atomic symbols. The deeper a hierar-chical data structure, the higher the order of the tensor.Building on Hinton’s concept of reduced representations(Hinton, 1990), several VSA models were proposed (Plate,1994; Kanerva, 1996; Gayler, 1998) in which atomic andcomposed data structures have the same dimension. Thesemodels encode atomic symbols by pseudo-random vectors andthe operations for set formation and binding are designed in away that representations of compound symbols still resemblerandom vectors. The operations for addition ( + ) and binding ( ◦ ) are dyadic operations that form a ring-like structure. Thedesired properties for a binding operation are:i) Associative, i.e., ( a ◦ b ) ◦ c = a ◦ ( b ◦ c ) = ( a ◦ c ) ◦ b .ii) Distributes over addition, i.e., (cid:80) D i a i ◦ (cid:80) D j b j = (cid:80) D ,D i,j c ij with c ij = a i ◦ b j .iii) Has an inverse operation to perform unbinding. Holographic Reduced Representation (HRR) (Plate, 1991,1995) was probably the earliest formalized VSA which usesreal-valued Gaussian random vectors and circular convolution as the binding operation. Circular convolution is the standardconvolution operation used in the discrete finite Fourier trans-form which can be used to produce a vector from two inputvectors x and y : ( x ◦ y ) k := ( x ∗ y ) k = N (cid:88) i =1 x ( i − k ) mod N y i (1)Other VSA models use binding operations based on projectionsof the tensor product matrix that only sample the matrix diag-onal. For example, the Binary Spatter Code (BSC) (Kanerva,1996) uses binary random vectors and binding is the XORoperation between components with the same index.In the following, we focus on the
Multiply-Add-Permute(MAP) model (Gayler, 1998), which uses bipolar atomic vectorswhose components are -1 and 1. Atomic features or symbolsare represented by random vectors of a matrix Φ , called the codebook . The columns of Φ are normalized i.i.d. random codevectors , Φ i ∈ {± } N . The binding operation is the Hadamardproduct between the two vectors: x ◦ y := x (cid:12) y = ( x y , x y , ..., x N y N ) (cid:62) (2)When the binding involves just a scalar value, the multipli-cation operation (2) relaxes to ordinary vector-scalar multipli-cation. A feature with a particular value is simply representedby the vector representing the feature, Φ i (which acts like a“key”), multiplied with the scalar representing the “value” a i : x = Φ i a i .For representing a set of features , the generic vector additionis used, and the vector representing a set of features with specificvalues is then given by: x = Φa (3)Here, the nonzero components of a represent the values offeatures contained in the set, the zero-components label thefeatures that are absent in the set.Although the representation x of this set is lossy, a particularfeature value can be approximately decoded by forming theinner product with the corresponding “key” vector: a i ≈ Φ (cid:62) i x /N, (4)where N is the dimension of vectors. The cross-talk noise in thedecoding (4) decreases with the square root of the dimensionof the vectors or by increasing the sparseness in a , for analysisof this decoding procedure, see (Frady et al., 2018).To represent a set of sets, one cannot simply form a sumof the compound vectors. This is because a feature bindingproblem occurs, and the set information on the first level is lost.VSAs can solve this issue by combining addition and binding toform a representation of a set of compound objects in which theintegrity of individual objects is preserved. This is sometimescalled the protected sum of L objects: s = L (cid:88) j Ψ j (cid:12) x j (5)where Ψ j are dense bipolar random vectors that label thedifferent compound objects. Another method for representing2rotected sums uses powers of a single random permutationmatrix P (Laiho et al., 2015; Frady et al., 2018): s = L (cid:88) j ( P ) ( j − x j (6)In general, algebraic manipulation in VSAs yields a noisy rep-resentation of the result of a symbolic reasoning procedure. Tofilter out the result, so-called cleanup memory is required, whichis typically nearest-neighbor search in a content-addressablememory or associative memory (Willshaw et al., 1969; Palm,1980; Hopfield, 1982) storing the codebook(s). B. Sparse distributed representations
The classical VSAs described in the previous section usedense representations, that is, vectors in which most componentsare nonzero. In the context of neuroscience and neural networksfor unsupervised learning and synaptic memory, another typeof representation has been suggested: sparse representations . Insparse representations, a large fraction of components are zero,e.g. most neurons are silent. Sparse representations capture es-sential aspects of receptive field properties seen in neurosciencewhen encoding sensory inputs, such as natural images or naturalsound (Olshausen and Field, 1996; Bell and Sejnowski, 1997).Here, we will investigate how sparse representations canbe used in VSAs. For the cleanup required in VSAs, sparserepresentations have the advantage that they can be stored moreefficiently than dense representations in Hebbian synapses (Will-shaw et al., 1969; Palm, 1980; Tsodyks and Feigel’man, 1988;Palm and Sommer, 1992; Frady and Sommer, 2019). However,how the algebraic operations in VSAs can be performed withsparse vectors has only been addressed in a few previous studies(Rachkovskij, 2001; Laiho et al., 2015).A particular type of sparse representation with additionalstructure has been proposed for symbolic reasoning before: sparse block-codes (Laiho et al., 2015). In a K -sparse block-code, the ratio between active components and total number ofcomponents is K/N , as usual. But the index set is partitionedinto K blocks, each block of size N/K , with one active elementin each block. Thus, the activity in each block is maximallysparse, it only contains a single nonzero component .The constraint of a sparse block-code reduces the entropy ina code vector significantly, from log (cid:16)(cid:0) NK (cid:1)(cid:17) to K log (cid:0) NK (cid:1) bits(Gritsenko et al., 2017). At the same time, the block constraintcan also be exploited to improve the retrieval in Hebbianassociative memory. As a result, the information capacity ofassociative memories with Hebbian synapses for block-codedsparse vectors is almost the same as for unconstrained sparsevectors (Gripon and Berrou, 2011; Knoblauch and Palm, 2020).Sparse block-codes also may reflect coding principles observedin the brain, such as competitive mechanisms between sensoryneurons representing different features (Heeger, 1992), as well Note that sparse block-codes differ from sparse block signals (Eldar et al.,2010), in the latter the activity within blocks can be non-sparse but the nonzeroblocks is K (cid:48) -sparse, with K (cid:48) << K . The resulting N -dimensional vectorshave a ratio between active components and total number of components of K (cid:48) L/N = K (cid:48) /K . as orientation hypercolumns seen in the visual system of certainspecies (Hubel and Wiesel, 1977).Recent proposals also include sparse phasor-codes for repre-senting information, where the active elements in the populationare not binary, but complex-valued with binary magnitudes andarbitrary phases (Frady and Sommer, 2019). Such a codingscheme may be relevant for neuroscience, as they can berepresented with spikes and spike-timing. VSA architectureshave also been demonstrated in the complex domain (Plate,2003), which use dense vectors of unit-magnitude phasors asatomic symbols. Here, we also propose and analyze a variationof the block-code where active entries are phasors. C. Compressed sensing
Under certain conditions, there is unique equivalence betweensparse and dense vectors that has been investigated under thename compressed sensing (CS) (Candes et al., 2006; Donohoet al., 2006). Many types of measurement data, such as imagesor sounds, have a sparse underlying structure and CS canbe used as a compression method, in applications or evenfor modeling communication in biological brains (Hillar andSommer, 2015). For example, if one assumes that the datavectors are K -sparse, that is: a ∈ A K := (cid:8) a ∈ IR M : || a || ≤ K (cid:9) (7)with || . || the L0-norm. In CS, the following linear transforma-tion creates a dimensionality-compressed dense vector from thesparse data vector: x = Ξa (8)where Ξ is a N × M random sampling matrix, with N < M .Due to the distribution of sparse random vectors a , the statis-tics of the dimensionality-compressed dense vectors x becomessomewhat non-Gaussian. The data vector can be recoveredfrom the compressed vector x by solving the following sparseinference problem: ˆ a = argmax a ( x − Ξa ) + λ | a | (9)The condition for K , N , M and Ξ , under which the recovery(9) is possible, forms the cornerstones of compressed sensing(Donoho et al., 2006; Candes et al., 2006).For CS to work, a necessary condition is that the sam-pling matrix is injective for the sparse data vectors, i.e. thatthe intersection between the kernel of the sampling matrix,Ker ( Ξ ) = { a : Ξa = 0 } , with the set of sparse data vectors, A K , is empty: Ker ( Ξ ) ∩ A K = ∅ . But, this condition doesnot guarantee that each data vector has a unique dense repre-sentation. In other words, the mapping between data vectorsand dense representations must also be bijective. To guaranteeuniqueness of the dense representation of K -sparse vectors, thekernel of the sampling matrix must not contain any (2 K + 1) -sparse vector: Ker ( Ξ ) ∩ A K +1 = ∅ (10)with A K +1 being the set of (2 K +1) -sparse vectors. Intuitively,the condition (10) excludes that any two K -sparse data vectors3an have the same dense representation: a (cid:54) = a : Ξa − Ξa =0 . Even with condition (10), it still might not be possible toinfer the sparse data vectors from the dense representations (9)in the presence of noise. Another common criterion for CS towork is the s -restricted isometry property (RIP): (1 − δ s ) || a s || ≤ || Ξa s || ≤ (1 + δ s ) || a s || (11)with the vector a s s -sparse, and the RIP constant δ s ∈ (0 , .The choice δ K +1 = 1 is equivalent to condition (10). Witha choice δ K +1 = δ ∗ < , one can impose a more stringentcondition that enables the inference, even in the presence ofnoise. The minimal dimension of the compression vector thatguarantees (11) is typically linear in K but increases onlylogarithmically with M : N ≥ C K log (cid:18) MK (cid:19) (12)where C is a constant of order O (1) that depends on δ K +1 .Here, we will use the uniqueness conditions (10) and (11) toassess the equivalence between different models of symbolicreasoning. III. R ESULTS
A. Equivalent representations with sparse vs. dense vectors
In this section, we consider a setting where sparse and densesymbolic representations can be directly compared. Specifically,we ask what operations between K -sparse vectors are inducedby the operations in the MAP VSA. To address this question,we map K -sparse feature vectors to corresponding dense vectorsvia (3). The column vectors of the codebook in (3) correspond tothe atomic dense vectors in the VSA. We choose the dimension N and properties of the codebook(s) and sparse random vectorsso that the CS condition (10) is fulfilled . Thus, each sparsevector has a unique dense representation and vice versa.
1) Improved VSA decoding based on CS:
In our setting, thecoefficient vector a is sparse. The standard decoding method inVSA (4) provides a noisy estimate of the sparse vector (Fig. 1)from the dense representation. However, if the sparse vectorand the codebook Φ in (3) satisfy the compressed sensing con-ditions, one can do better: decoding `a la CS (9) achieves near-perfect accuracy (Fig. 1). Note that sparse inference requiresthat the entire coefficient vector a is decoded at once, similarto Ganguli and Sompolinsky (2010), while with (4) individualvalues a i can be decoded separately. If the CS condition isviolated, sparse inference (9) abruptly ceases to work, whilethe VSA decoding with (4) gradually degrades, see Frady et al.(2018). In compressed sensing, the choice of sampling matrices with binary orbipolar random entries is common, e.g, (Amini and Marvasti, 2011).
2) Variable binding operation:
The Hadamard product be-tween dense vectors turns out to be a function of the tensorproduct, i.e. the TPR, of the corresponding sparse vectors: ( x (cid:12) y ) i = ( Φa ) i ( Ψb ) i = (cid:88) l Φ il a l (cid:88) k Ψ ik b k = (cid:88) lk Φ il Ψ ik a l b k = (( Φ (cid:26) Ψ ) vec ( a b (cid:62) )) i (13)This linear relationship between the Hadamard product of twovectors and the TPR can be seen as a generalization of theFourier convolution theorem, see Appendix A.The reshaping of the structure on the RHS of (13) also showsthat there is a relationship to the matrix vector multiplicationin CS sampling (8): The ravelled tensor product matrix of thesparse vectors becomes a M -dimensional vector vec ( a b (cid:62) ) with K nonzero elements. Further, ( Φ (cid:26) Ψ ) is a N × M sampling matrix, formed by pair-wise Hadamard products ofvectors in the individual dictionaries Φ and Ψ : ( Φ (cid:26) Ψ ) := ( Φ (cid:12) Ψ , Φ (cid:12) Ψ , ..., Φ M (cid:12) Ψ M ) (14)One can now ask under what conditions Hadamard productand tensor product become mathematically equivalent, that is,can any sparse tensor product in (13) be uniquely inferredfrom the Hadamard product using a CS inference procedure(9). The following two lemmas consider a worst-case scenarioin which there is equivalence between the atomic sparse anddense vectors, which requires that the sparks of the individualcodebooks are at least K + 1 . Lemma 1:
Let Spark ( Φ ) = Spark ( Ψ ) = 2 K +1 . Then the sparkof the sampling matrix in (13) is Spark (( Φ (cid:26) Ψ )) ≤ K + 1 . Proof:
Choose a (2 K + 1) -sparse vector c in the kernel of Φ ,and choose any cardinal vector b j := (0 , ..., , , , ..., withthe nonzero component at index j . Then we have: Φ α = Φc (cid:12) Ψ j = (cid:80) i ∈ α c i Φ i (cid:12) Ψ j = ( Φ (cid:26) Ψ ) vec ( c ⊗ b j ) . Thusthe (2 K + 1) -sparse vector vec ( c ⊗ b j ) lies in the kernel of thesampling matrix in (13). There is also a small probability thatthe construction of ( Φ (cid:26) Ψ ) produces a set of columns withless than K + 1 components that are linearly dependent. (cid:3) Lemma 1 reveals that the sampling matrix (14) does cer-tainly not allow the recovery of K -sparse patterns in general.However, this is not required since the reshaped outer productsof K -sparse vectors form a subset of K -sparse patterns. Thefollowing lemma shows that for this subset recovery can stillbe possible. Lemma 2:
The difference between the outer-products of pairsof K -sparse vectors cannot fully coincide in support with the (2 K + 1) -sparse vectors in the kernel of the sampling matrixof (13) as identified by Lemma 1. Thus, although Spark (( Φ (cid:26) Ψ )) ≤ K + 1 , the recovery of reshaped tensor products fromthe Hadamard product can still be possible.4 ig. 1. Readout of sparse coefficients from dense distributed representation.
A. The sparse coefficients (left) are stored as a dense representation using arandom codebook. The coefficients are recovered with standard VSA readout (middle) and with sparse inference (right), which reduces the crosstalk noise. B. Twosparse coefficients are stored as a protected set (left). Readout with sparse inference reduces crosstalk noise, but some noise can remain depending on sparsitypenalty.
Proof:
The (2 K +1) -sparse vectors in the kernel of the samplingmatrix ( Φ (cid:26) Ψ ) identified in Lemma 1 correspond to an outerproduct of a (2 K + 1) -sparse vector with a -sparse vector. Theresulting matrix has K + 1 nonzero components in one singlecolumn.The difference of two outer products of K -sparse vectorsyields a matrix which can have maximally K nonzero compo-nents in one column. Thus, the sampling matrix should enablethe unique inference of the tensor product from the Hadamardproduct of the dense vectors. (cid:3) Lemmas 1 and 2 investigate the equivalence of Hadamard andtensor product binding in the worst case, that is, when the code-books have the minimum spark that still guarantees the uniqueequivalence between the sparse and dense atomic vectors. Toexplore the equivalence in the case of random codebooks,we performed simulation experiments with a large ensembleof randomly generated codebook pairs ( Φ , Ψ ) . Fig. 2 showsthe averaged worst (i.e., highest) RIP constant amongst theensembles for inferring the tensor product from the Hadamardproduct (solid red line).Compared to the RIP constant for inferring the sparse repre-sentations of atomic vectors (black line), the RIP constant forinferring the tensor product (red line) is significantly higher.Thus, tensor product and Hadamard product are not alwaysequivalent even if the atomic sparse and dense vectors areequivalent – in the example, when the dimension of densevectors is between N = 40 to N = 140 . However, withthe dimension of dense vectors large enough ( N > ), theequivalence holds. Further, the controls in Fig. 2 help to explainthe reasons for the gap in equivalence for small dense vectors.
Fig. 2.
Worst-case RIP constant for inferring sparse tensor products inensemble of random codebooks.
The largest empirical RIP constant ( δ s ) inan ensemble of 10 pairs of pseudo-random dictionaries Φ , Ψ . For each pairthe maximum RIP was determined by compressing 10000 sparse vectors. Forsuccessful inference of the sparse representations, the RIP constant has to bebelow the δ s = 1 level (yellow line). The black solid line represents the RIPfor inferring atomic sparse vectors from dense vectors formed according to(8). The red solid line represents the RIP for inferring tensor products fromdense vectors formed according to (13). Other lines in the diagram are controls.The blue solid line represents the RIP for a ( N × M ) random dictionary inwhich all elements are indpendently sampled rather than constructed by Φ (cid:26) Ψ from the smaller dictionaries. Dashed and dotted red lines represent the RIPusing the Φ (cid:26) Ψ sampling matrix with sparse vectors with independent randomcomponents, rather than formed by a tensor product vec ( a b (cid:62) ) of two randomvectors. The blue dashed line is for real-valued vectors with elements sampledfrom a from a Chi-squared distribution, the blue dotted line for binary randomvectors. Dashed and dotted blue lines represent the RIP for the same type ofindependent random vectors with the independent random sampling matrix. The RIP constants are significantly reduced if the tensor productis subsampled with a fully randomized matrix (solid blue line),rather than with the sampling matrix resulting from (13). Incontrast, the requirement to infer outer products of continuousvalued random vectors (solid red line) does not much increasethe RIP values over the RIP requirement for the inference of5uter products of binary vectors (dotted red line). Thus, weconclude that sampling with matrix Φ (cid:26) Ψ (14), which is noti.i.d. random but formed by a deterministic function from thesmaller atomic random sampling matrices, requires a somewhatbigger dimension of the dense vectors to be invertible.Here we have shown that under certain circumstances thebinding operation between dense vectors in the MAP VSAis mathematically equivalent to the tensor product betweenthe corresponding sparse vectors. This equivalence reveals anatural link between two prominent proposals for symbolicbinding in the literature, the dimensionality preserving bindingoperations in VSA models, with the tensor product in the TPRmodel (Smolensky, 1990; Smolensky et al., 2016). In otherVSA models, such as HRR (Plate, 2003), atomic symbolsare represented by dense Gaussian vectors and the bindingoperation is circular convolution. Our treatment can be extendedto these models by simply noting that by the Fourier convolutiontheorem (28) circular convolution is equivalent to the Hadamardproduct in the Fourier domain, i.e. x ∗ y = F − ( F ( x ) (cid:12) F ( y )) .
3) Set operations:
Summing dense vectors corresponds tosumming the sparse vectors x + y = Φ ( a + b ) (15)Thus, the sum operation represents a bag of features from allobjects, but the grouping information on how these featureswere configured in the individual objects is lost. The inabilityto recover the individual compound objects from the sumrepresentation has been referred to as the binding problem inneuroscience (Treisman (1998)).The protected sum of set vectors (5) can resolve the bindingproblem. This relies on binding the dense representations of theindividual objects to a set of random vectors that act as keys,stored in the codebook Ψ (5): L (cid:88) j Ψ j (cid:12) x j = L (cid:88) j Ψ j (cid:12) M (cid:88) i Φ i a ji = ( Φ (cid:26) Ψ )( a , a , ..., a L ) (16)This shows that the protected sum can be computed from theconcatenation of sparse vectors. The concatenation of sparsevectors is a representation that fully contains the binding infor-mation, but again leads to an increase in dimensionality. Similarto (13), (16) describes linear sampling of a sparse vector likein compressed sensing. The sampling matrix ( Φ (cid:26) Ψ ) is a N × M L sampling matrix formed by each pair of vectors in Φ and Ψ , as in (14), and the sparse vector is the M L -dimensionalconcatenation vector.We again ask under what conditions the sparse concatenationvector can be uniquely inferred given the dense representation ofthe protected sum, which makes the dense and sparse representa-tions equivalent. Like in section III-A2, we first look at the worstcase scenario, and then perform an experiment with codebookscomposed of random vectors. The worst case scenario assumesthe spark of Φ to be K +1 , just big enough that atomic vectorscan be inferred uniquely. By Lemma 1, the spark of the samplingmatrix is smaller or equal to K + 1 , smaller than the sparsity KL of vectors to be inferred. Again, the vectors to be inferredare a subset of KL -sparse vectors, the vectors that have K Fig. 3.
Worst-case RIP constant for inferring sparse representations ofprotected sums in ensemble of random codebooks.
The largest empiricalRIP constant ( δ s ) in ensemble of 10 pairs of pseudo-random dictionaries Φ , Ψ .For each pair the maximum RIP was determined by compressing 10000 sparsevectors. For successful inference of the sparse representations, the RIP constanthas to be below the δ s = 1 level (yellow line). Red solid line represents RIPvalues for inferring atomic sparse vectors from dense vectors formed accordingto (8). Blue dashed line represents RIP values for inferring protected sumfrom dense vectors formed according to (16). For comparison, black dottedline represents RIP values for inferring protected sum when instead of Φ (cid:26) Ψ the dictionary is random. nonzero components in each of the L M -sized compartments.Thus, as in Lemma 2 for the Hadamard product, the differenceformed by two of these vectors can maximally produce K nonzero components in each compartment, and therefore nevercoincide with a kernel vector of the sampling matrix.Fig. 3 shows the results of simulation experiments with anensemble of random codebooks. For the protected sum, theworst RIP values of the inference of individual sparse vectorsversus the list of sparse vectors composing the protected sumdo coincide. Thus, the dense protected sum vector and the listof sparse feature vectors are equivalent.The alternative method of forming a protected sum (6) usingpowers of a permutation matrix P , corresponds equally toa sampling of the concatenation of the sparse vectors. Aslong as the sampling matrices ( Φ , PΦ , P Φ , ..., P ( L − Φ ) and ( Φ (cid:26) Ψ ) have similar properties, the conditions for equivalencebetween protected sum and concatenated sparse vectors hold. B. Dimension- and sparsity-preserving VSA operations
The results from Sect. III-A reveal that variable bindingand the protected set representation in classical VSA modelsinduce equivalent operations between sparse vectors that arenot dimensionality preserving. Thus, dimensionality-preservingoperations for binding and protected sum involve potentiallylossy transformations of the higher dimensional data structureinto a single vector. However, dimensionality-preserving bindingoperations have only been defined for dense VSA representa-tions. In the following, we investigate binding operations onsparse VSA representations that are both dimensionality- andsparsity-preserving, one for general sparse vectors and one forsparse vectors with block structure.
1) Sparsity-preserving binding for general K -sparse vectors: Binding operations in VSAs can all be described as a pro-jection of the tensor product to a vector, including Hadamardproduct, circular convolution binding (Plate, 2003) and vector-6 ig. 4.
Circuits for sparsity-preserving binding : Three pools of neurons (blue: two inputs, red: output) represent the sparse neural activity patterns a , b and c . The dendritic tree of the output neurons contains coincidence detectors that detect pairs of co-active axons (red circles), and the soma (red triangles) sums upseveral coincidence detectors based on the required fan-in. Each neuron samples only a subset of the outer product depending on the desired sparsity and thresholdsettings. The subsampling pattern of neurons is described by a binary tensor W lij ∈ { , } , where i, j indexes the coincidence point and l the postsynaptic neuron.We examine three different sampling strategies random sampling, structured sampling, and the block-code. derived transformation binding (VDTB) (Gosmann and Elia-smith, 2019), see Appendix (26). However, when applied tosparse atomic vectors, these operations do not preserve sparsity– circular convolution produces a vector with reduced sparsity,while the Hadamard product increases sparsity.Ideally, a sparsity-preserving VSA binding operation operateson two atomic vectors that are K -sparse and produces a K -sparse vector that has the correct algebraic properties. Topreserve sparsity, we developed a binding operation that isa projection from a sub-sampling of the tensor product. Werefer to this operation as sparsity-preserving tensor projection(SPTP) . Given two K -sparse binary vectors a and b , SPTPvariable binding is given by: ( a ◦ b ) l = H (cid:88) ij W lij a i b j − θ (17)Here H ( x ) is the Heaviside function, θ is a threshold. For a pairof K -sparse complex phasor vectors, SPTP binding is definedas: ( a ◦ b ) l = z l | z l | H ( | z l | − θ ) (18) z l = (cid:88) ij W lij a i b j The computation of (17) resembles a circuit of thresholdneurons with coincidence detectors in their dendritic trees, seeFig. 4. The synaptic tensor W ∈ { , } M × M × M is a binarythird-order tensor that indicates how each output neuron samplesfrom the outer-product. We examined two types of samplingtensors, one with the -entries chosen i.i.d. (without repetition),and one with -entries aligned along truncated diagonals of thetensor (left and middle panel in Fig. 4).The sparsity of the output in (17) is controlled by thethreshold and by the density of this sampling tensor. To achievea target sparsity of K/N for a threshold θ = 1 , the fan-in toeach neuron has to be N/K (see analysis in Appendix B-A).Thus, the minimal fan-in of the sampling tensor W increaseswith sparsity. If the pattern activity is linear in the dimension, K = βN with β << , the minimal fan-in is α ∗ = 1 /β . In this case, the computational cost of SPTP binding is order N . If the pattern activity goes with the square root of thedimension, K = β √ N , the minimal fan-in is α ∗ = √ N /β .If the pattern activity goes with the logarithm of the dimension, K = β ln( N ) , the minimal fan-in is α ∗ = N/ ( β ln( N )) .Further, for optimizing the unbinding performance, the samplingtensor should fulfill the symmetry condition W ijl = W lij (seeanalysis in Appendix B-B).
2) Sparsity-preserving binding for sparse block-codes:
Wenext consider sparse vector representations that are constrainedas block-codes (Gripon and Berrou, 2011), which have beenproposed for VSAs before (Laiho et al., 2015). Our modelextends this previous work with a block-code in the complexdomain. In a sparse block-code, a vector of length N is dividedinto K equally-sized blocks, each with a one-hot component. Inthe complex domain, the hot component is a phasor with unitamplitude and arbitrary phase.The binding operation Laiho et al. (2015) proposed operateson each block individually. For each block, the indices of thetwo active elements of the input are summed modulo blocksize to produce the index for the active element of the output.This is the same as circular convolution (Plate, 1994) performedlocally between individual blocks. This binding operation, localcircular convolution (LCC), denoted by ∗ b , produces a sparseblock-code when the input vectors are sparse block-codes(Fig. 4). LCC variable binding can be implemented by formingthe outer product and sampling as in Fig. 4, with a circuitry inwhich each neuron has a fan-in of α = N/K and samplesalong truncated diagonals of the tensor product. LCC has acomputational complexity of αN , which is order N if K isproportional to N . An alternative implementation (that is moreefficient on a CPU) uses the Fourier convolution theorem (28)to replace convolution by the Hadamard product: ( a ∗ b b ) block i = a block i ∗ b block i = F − ( F ( a block i ) (cid:12) F ( b block i )) (19)where F is the Fourier transform.The LCC unbinding of a block can be performed by comput-ing the inverse of the input vector to unbind. This is the inverse7 ig. 5. Comparison of binding operations.
The unbinding performance was measured as the correlation between ground truth and output of unbinding. Differentlevels of sparsity (x-axis) and superposition were examined (colored lines: [0, 1, 2, 4, 8 16] items in superposition).Fig. 6.
Preservation of sparsity with a binding operation.
A. The outputsparsity K bind is compared to the sparsity of the base vectors K . Binding withSPTP results in an output vector that has the correct expected sparsity, but thereis some random variance. This variance reduces with more active components( K = [20 , , , black to orange lines). This result is similar for bothrandom and structured SPTP. B. The output sparsity of binding sparse block-codes with LCC deterministically results in a vector which maintains the sparsityof the inputs. with respect to circular convolution, which is computed for eachblock, a − block i = F − ( F ( a block i ) ∗ ) (20)where ∗ is the complex conjugate. The inverse is used whenunbinding, for instance, if c = a ∗ b b , then a = b − ∗ b c .
3) Experiments with sparsity-preserving binding:
The bind-ing operations are evaluated based on whether they maintainsparsity and how much information is retained when unbinding.Circular convolution and Hadamard product can be ruled out,because they do not preserve sparsity, the Hadamard productincreases sparsity, and circular convolution reduces sparsity, butwe still evaluated these operations for comparison.We investigated how well sparsity is preserved with LCC andSPTP binding (Fig. 6). We find that LCC binding preservessparsity perfectly, and SPTP binding preserves sparsity onaverage (statistically), but with some variance.We next measure how much information is retained when firstbinding and then unbinding a vector using the proposed bindingoperations (Fig. 5). The Hadamard product binding achievesthe highest correlation values for dense vectors, but performsvery poorly for sparse vectors. The other three binding methodsperform equally across sparsity levels. Circular convolutionand SPTP binding are somewhat lossy for all sparsity levels.The LCC variable binding between block-codes achieves thehighest correlation values, outperforming circular convolution, and SPTP binding. Each diagram in Fig. 5 contains 6 curves,corresponding to different levels of additive superposition in thebound vectors.SPTP binding works for general K -sparse vectors. It hasdecent properties but is somewhat lossy. The information lossis due to the fact that not all active input components contributeto the generation of active outputs, which means that someactive input components cannot be inferred during unbindingand information is lost. The loss can be kept at a minimumby using a synaptic weight tensor that fulfills the symmetrycondition W ijl = W lij . This information loss persisted regardlessof SPTP being structured or random, or the threshold and fan-insettings.These experiments identify LCC binding as an ideal sparsity-preserving binding operation. With sparse block-codes andlocal circular convolution applied separately to each block, theunbinding is loss-less. The block structure guarantees that eachactive input component participates in the formation of an activeoutput component, which cannot be guaranteed for general K -sparse vectors. Of course, there is a price to pay, LCC bindingrequires the atomic vectors to be sparse block-codes. The codingentropy of block-codes is significantly smaller than general K -sparse patterns. C. Applications of VSAs with sparse block-codes1) Solving symbolic reasoning problems:
As a basic illustra-tion of symbolic reasoning with sparse block-codes, we imple-ment the solution to the cognitive reasoning problem (Kanerva,2010): “What’s the dollar of Mexico?” in the supplementalJupyter notebook. To answer such queries, data structures areencoded into vectors that represent trivia information aboutdifferent countries. A data record of a country is a table ofkey-value pairs. For example, to answer the specific query, therelevant records are: ustates = nam ∗ b usa + cap ∗ b wdc + cur ∗ b dolmexico = nam ∗ b mex + cap ∗ b mxc + cur ∗ b pes The keys of the fields country name , capital and currency arerepresented by random sparse block-code vectors nam , cap and cur . The corresponding values USA , Washington D.C. , Dollar , Mexico , Mexico City , and
Peso are also represented bysparse block-code vectors usa , wdc , dol , mex , mxc , pes . Allthe vectors are stored in the codebook Φ . The vectors ustates mexico represent the complete data records – they area representation of key-value pairs that can be manipulated toanswer queries. These record vectors have several terms addedtogether, which reduces the sparsity.To perform the reasoning operations required to answer thequery, first the two relevant records have to be retrieved inthe database. While mexico can be found by simple patternmatching between terms in the query and stored data recordvectors, the retrieval of ustates is not trivial. The originalwork does not deal with the language challenge of inferring thatthe ustates record is needed. Rather, the problem is formallyexpressed as analogical reasoning, where the query is givenas: Dollar : USA ::?:
Mexico . Thus, the pair of records needed forreasoning are given by the query.Once the pair of records is identified, the following transfor-mation vector is created: t UM = mexico ∗ b ustates − Note that unbinding with LCC is to bind with the inverse vector(20), whereas in the MAP VSA used in the original work(Kanerva, 2010) the binding and unbinding operations are thesame. The transformation vector will also contain many summedterms, leading to less sparsity. The transformation vector thencontains the relationships between the different concepts t UM = mex ∗ b usa − + mxc ∗ b wdc − + pes ∗ b dol − + noise where all of the cross-terms can be ignored and act as smallamounts of cross-talk noise.The correspondence to dollar can be computed by binding dol to the transformation vector: ans = dol ∗ b t UM = pes + noise The vector ans is then compared to each vector in thecodebook Φ . The codebook entry with highest similarity rep-resents the answer to the query. This will be Peso with highprobability for large N . The probability of the correct answercan be understood through the capacity theory of distributedrepresentations described in Frady et al. (2018), which we nextapply to this context.In general, a vector like t UM can be considered as a mappingbetween the fields in the two tables. The number of entries willdetermine the amount of crosstalk noise, but all of the entriesthat are non-sensible also are considered crosstalk noise.Specifically, we consider general data records of key-valuepairs, similar in form to ustates and mexico . These datarecords will contain R “role” vectors that act as keys. Eachone has corresponding M r potential “filler” values. The rolevectors are stored in a codebook Ψ ∈ C N × R . For simplicity,we assume that all R roles are present in a data record, eachwith one of the M r fillers attached. The fillers for each role arestored in the codebook Φ ( r ) ∈ C N × M r . This yields a generickey-value data record: rec = R (cid:88) r Ψ r ∗ b Φ ( r ) i ∗ (21)where the index i ∗ indicates one filler vector from the codebookfor a particular role. Fig. 7.
Performance of analogic reasoning tasks with sparse block-codes.
Weempirically simulated analogic reasoning tasks with data records containing R key-value pairs, and measured the performance (dashed lines). This performancecan be predicted based on the VSA capacity theory reported in Frady et al.(2018) (solid lines). Next, we form the transformation vector, which is used tomap one data record to another. This is done generically bybinding two record vectors: t ij = rec j ∗ b rec − i .As discussed, the terms in each record will distribute, and thevalues that share the same roles will be associated with eachother. But, there are many cross-terms that are also present inthe transformation vector that are not useful for any analogicalreasoning query. The crosstalk noise is dependent on how manyterms are present in the sum, and this includes the cross-terms.Thus, the total number of terms in the transformation vector t ij will be R .In the next step, a particular filler is queried and the result isdecoded by comparison to the codebook Φ , which contains thesparse-block code of each possible filler: a r = Φ ( r ) ( t ij ∗ b Φ ( r ) j ∗ ) (22)where j ∗ indicates the index of the filler in the query (e.g. theindex of Dollar ). The entry with the largest amplitude in thevector a r is considered the output.The probability that this inference finds the correct relation-ship can be predicted by the VSA capacity analysis (Frady et al.,2018) (Fig. 7). The probability is a function of the signal-to-noise ratio, given in this case by s = N/R .
2) Solving classification problems:
Although VSAs origi-nated as models for symbolic reasoning, Kleyko et al. (2019)have recently described the similarities between VSAs andrandomly connected feed-forward neural networks (Scardapane,S. and Wang, D., 2017) for classification, known as RandomVector Functional Link (RVFL) (Igelnik and Pao, 1995) orExtreme Learning Machines (ELM) (G. Huang and Q. Zhu andC. Siew, 2006). Specifically, RVFL/ELM can be expressed byVSA operations in the MAP VSA model (Kleyko et al., 2019).Leveraging these insights, we implemented a classificationmodel using a VSA with sparse block-codes.The model proposed in (Kleyko et al., 2019) forms a densedistributed representation x of a set of features a . Each featureis assigned a random “key” vector Φ i ∈ {± } N . The collectionof “key” vectors constitutes the codebook Φ . However, incontrast to (3) the set of features is represented differently. The9 B Fig. 8.
Solving classification problems with sparse block-codes.
A. Similaritypreserving representation of scalars with sparse block-codes: K = 16 , N =128 . Similarity (overlap) between the representions ofthe levels , and thevectors representing other signal levels. B. Cross-validation accuracy of the VSAwith dense distributed representations against the VSA with sparse distributedrepresentations. A point corresponds to a dataset. proposed approach requires the mapping of a feature value a i to distributed representation F i (“value”) which preserve thesimilarity between nearby scalars. Kleyko et al. (2019) usedthermometric encoding (Rachkovskij et al., 2005) to create suchsimilarity preserving distributed representations. The feature setis represented as the sum of “key”-“value” pairs using thebinding operation: x = f κ ( M (cid:88) i Φ i (cid:12) F i ) , (23)where f κ denotes the clipping function which is used as anonlinear activation function: f κ ( x i ) = − κ x i ≤ − κx i − κ < x i < κκ x i ≥ κ (24)The clipping function is characterized by the configurablethreshold parameter κ regulating nonlinear behavior of theneurons and limiting the range of activation values.The predicted class ˆ y is read out from x using the trainablereadout matrix as: ˆ y = argmax W out x , (25)where W out is obtained via the ridge regression applied to atraining dataset.For the purposes of using sparse block-codes, however,thermometric codes are non-sparse and their mean activity isvariable across different values. Building on earlier efforts in thedesign of similarity-preserving sparse coding (Palm et al., 1994;Palm, 2013), we design a similarity-preserving encoding schemewith sparse block-codes. In this scheme, the lowest signal levelhas all hot components in the first positions of each block. Thesecond signal level is encoded by the same pattern except thatthe hot component of the first block is shifted to the secondposition. The third signal level is encoded by the code of thesecond level with the hot component of the second block shiftedto the second position, and so on. This feature encoding schemecan represent N − K + 1 signal levels uniquely. The similaritybetween vectors drops of gradually as a function of distance (Fig. 8A). Each pattern has the highest similarity with itself(overlap = K ). For the range of distances between and K ,the overlap decreases linearly until it reaches and then staysat this level for larger distances.The data vectors in a classification problem are encodedby the following steps. First, labels of the different features(data dimensions) are encoded by random sparse block-codevectors. Key-value pairs are then formed by binding featurelabels with corresponding values, using the similarity preservingsparse block-code scheme described above. A data vector isthen represented by the sum of all the key-value pairs. Inessence, such a representation coincides is a protected sum (5).In addition, we apply a clipping function to the resulting input.The described representation of the data can be computedin a sparse block-code VSA, the last step can be representedby the activation of a hidden layer with nonlinear neurons.To perform classification, the hidden representation is pattern-matched to prototypes of the different classes. To optimize thispattern matching, in cases where the prototypes are correlated,we train a perceptron network with ridge regression, similaras previously proposed for in a sequence memory with VSAs(Frady et al., 2018).Interestingly, the cross-validated accuracies for VSAs withsparse block and dense representations (Kleyko et al., 2019) on real-world classification datasets are quite similar (Fig. 8B),with a correlation coefficient of . and both reaching averageaccuracy of . . The datasets from the UCI MachineLearning Repository (Dua and Graff, 2019) have been initiallyanalyzed in a large-scale comparison study of different classi-fiers (Fernandez-Delgado et al., 2014). The only preprocessingstep we introduced was to normalize features in the range [0 , and quantize the values into N − K + 1 levels. Thehyperparameters for both dense and sparse models were opti-mized through grid search over N (for dense representations N varied in the range [50 , with step ), λ (ridge regressionregularization parameter; varied in the range [ − , with step ), and κ (varied between { , , , } ). The search additionallyconsidered K for sparse block-codes ( K/N varied in the range [2 , with step while K varied in the range [4 , with step ).Importantly, the average number of neurons used by bothapproaches was also comparable: about for sparse block-codes and about for dense representations. Thus, we con-clude that sparse block-codes can be used as substitutes of denserepresentations for practical problems such as classificationstasks. IV. D ISCUSSION
In this paper we investigated methods for variable bindingfor symbolic reasoning with sparse distributed representations.The motivation for this study was two-fold. First, we believethat such methods of variable binding could be key for com-bining the universal reasoning properties of vector symboliccomputing (Gayler, 2003) with advantages of neural networks.Second, these methods will enable implementations of symbolicreasoning that can leverage efficient sparse Hebbian associative10emories (Willshaw et al., 1969; Palm, 1980; Knoblauch andPalm, 2020) and low-power neuromorphic hardware (Davieset al., 2018).
A. Theoretical Results
Using the framework of compressed sensing, we investigateda setting in which there is a unique equivalence between sparsefeature vectors and dense random vectors. We find that:i) With this setting, CS inference outperforms the classicalVSA readout of set representations.ii) Classical vector symbolic binding between dense vectorswith the Hadamard product (Plate, 2003; Gayler, 1998;Kanerva, 2009) is under certain conditions mathematicallyequivalent to tensor product binding (Smolensky, 1990) ofthe corresponding sparse vectors.iii) For representing sets of objects, vector addition of densevectors (15) is equivalent to addition of the correspondingsparse vectors.iv) The protected sum of dense vectors (16) is equivalent tothe concatenation of the sparse vectors.v) The dimensionality preserving operations between densevectors for variable binding and protected set represen-tations mathematically correspond to operations betweensparse vectors, tensor product and vector concatenation,which are not dimensionality preserving.
B. Experimental Results
Our theory result v) implies that in order to construct di-mensionality and sparsity-preserving variable binding betweensparse vectors, an additional reduction step is required formapping the outer product to a sparse vector. Existing reductionschemes of the outer product proposed in the literature, circularconvolution (Plate, 2003) and vector-derived transformationbinding (Gosmann and Eliasmith, 2019), are not sparsity-preserving when applied to sparse vectors.For binding pairs of general K -sparse vectors, we designed astrategy of sub-sampling from the outer-product with additionalthresholding to maintain sparsity. Such a computation can beimplemented in neural circuitry where dendrites of neuronsdetect firing coincidences between pairs of input neurons. Thenecessary connection density increases with sparsity of the codevectors. Still, the computational complexity is order N when K = βN , which favorably compares to other binding operationswhich can have order of N or N log N . However, the samplingin the circuit always misses components of the tensor product,making the unbinding operation lossy.Another direction we investigated extends previous work(Laiho et al., 2015) developing VSAs for sparse representationsof restricted type, sparse block-codes. We propose block-wisecircular convolution as a variable binding method which issparsity and dimensionality preserving. Interestingly, for sparseblock-codes, the unbinding given the reduced tensor and oneof the factors is lossless. As our experiments show, it has thedesired properties required for VSA manipulations, outperform-ing the other methods. Independent other work has proposedefficient Hebbian associative memory models (Kanter, 1988; Gripon and Berrou, 2011; Knoblauch and Palm, 2020) that couldbe applied for cleanup steps required in VSAs with block-codes.VSAs with block-codes are demonstrated in two applications.In a symbolic reasoning application we show that the accuracyas a function of the dimension of sparse block-codes reaches thefull performance of dense VSAs and can be described by thesame theory (Frady et al., 2018). On classification datasetsfrom the UCI Machine Learning Repository we show that theblock-code VSA reaches the same performance as dense VSAs(Kleyko et al., 2019). Moreover, the average accuracy of . ofVSAs models is comparable to the state-of-the-art performanceof . achieved by Random Forest (Fernandez-Delgado et al.,2014). C. Relationship to earlier work
Rachkovskij (2001); Rachkovskij and Kussul (2001) wereto our knowledge the first to propose similarity- and sparsity-preserving variable binding. For binary representations they pro-posed methods that involve component-wise Boolean operationsand deletion (thinning) based on random permutations. Thesemethods of variable binding are also lossy, similar to our methodof SPTP.The variable binding with block-codes, which our experi-ments identify as the best, can be done with real-valued binaryor complex-valued phasor block codes. For binary block-codesour binding method is the same as in (Laiho et al., 2015), whodemonstrated it in a task processing symbolic sequences. Forprotecting individual elements in a sum representation, theyuse random permutations between blocks, rather than variablebinding as we do in section III-A3.
D. Implications for neural networks and machine learning
In the deep network literature, concatenation is often usedin neural network models as a variable binding operation (Sollet al., 2019). However, our result iv) suggests that concatenationis fundamentally different from a binding operation. This mightbe a reason why deep learning methods have limited capabilitiesto represent and manipulate data structures (Marcus, 2020).Several recent studies have applied VSAs to classificationproblems (Ge and Parhi, 2020; Rahimi et al., 2019). Here wedemonstrated classification in a block-code VSA. The block-code VSA exhibited the same average classification accuracy asearlier VSA solutions with dense codes. This result suggests thatsparse block-code VSAs can be a promising basis for developingclassification algorithms for low-power neuromorphic hardwareplatforms (Davies et al., 2018).
E. Implications for neuroscience
We have investigated variable binding operations betweensparse patters regarding their computational properties in sym-bolic reasoning. It is interesting that this form of variable bind-ing requires multiplication or coincidence detection, computa-tions which can be implemented by active dendritic mechanismsof biological neurons (Larkum and Nevian, 2008). Althoughthis computation is beyond the capabilities of standard neural11etworks, it can be implemented with formal models of neurons,such as sigma-pi neurons (Mel and Koch, 1990).We found that the most efficient form of variable bindingwith sparse vectors relies on block-code structure. Althoughblock-codes were engineered independent of neurobiology, theycompatible with some experimental observations, such as divi-sive normalization (Heeger, 1992), and functional modularity.Specifically, in sensory cortices of carnivores neurons withinsmall cortical columns (Mountcastle, 1957) respond to the samestimulus features, such as the orientation of local edges inthe image (Hubel and Wiesel, 1963, 1977). Further, groupsof nearby orientation columns form so-called macro columns,tiling all possible edge orientations at a specific image location(Hubel and Wiesel, 1974; Swindale, 1990). A macro columnmay correspond to a block in a block-code.While binary block-codes are not biologically plausible,complex-valued block-codes in which active elements are com-plex phasors with unit magnitude, can be represented as timingpatterns in networks of spiking neurons (Frady and Sommer,2019). Further, it seems possible to extend LCC binding tosoft block-codes, in which localized bumps with graded neuralactivities represented by spike rate, e.g. (Ben-Yishai et al.,1995).
F. Future directions
One important future direction is to investigate how tocombine the advantages of VSA and traditional neural net-works to build more powerful tools for artificial intelligence.The challenge is how to design neural networks for learningsparse representations that can be processed in sparse VSAs.Such combined systems could potentially overcome some ofthe severe limitations of current neural networks, such as thedemand of large amounts of data, limited abilities to generalizelearned knowledge, etc.Another interesting research direction is to design VSAsoperating with spatio-temporal spike patterns that can be imple-mented in neuromorphic hardware, potentially also making usespike timing and efficient associative memory for spike timingpatterns (Frady and Sommer, 2019).Further, it will be interesting to study how binding in sparseVSAs can be used to form similarity-preserving sparse codes(Palm et al., 1994; Palm, 2013) for continuous manifolds.For example, binding can be used to create index patternsfor representing locations in space, which could be useful fornavigation in normative modeling of hippocampus (Frady andSommer, 2020). A
CKNOWLEDGEMENT
The authors thank Charles Garfinkle and other members of theRedwood Center for Theoretical Neuroscience for stimulatingdiscussions. DK is supported by the European Unions Horizon2020 research and innovation programme under the MarieSkodowska-Curie Individual Fellowship grant agreement No.839179, and DARPA’s VIP program under Super-HD project.FTS is supported by NIH R01-EB026955. A
PPENDIX AR ELATIONS BETWEEN DIFFERENT VARIABLE BINDINGOPERATIONS
A. VSA binding, a subsampling of TRP
The dimensionality-preserving binding operations in VSAscan be expressed as a sampling of the tensor product matrix x y (cid:62) into a vector: x ◦ y = (cid:88) ij W lij x i y j (26)where the binary third-order tensor W ∈ { , } N × N × N deter-mines what elements of the outer-product are sampled. For theHadamard product in the MAP VSA, the sampling tensor justcopies the diagonal elements of the tensor matrix into a vector,using a sampling tensor W lij = δ ( i, j ) δ ( i, l ) . Here δ ( i, j ) isthe Kronecker symbol. Conversely, in circular convolution thesampling involves summing the diagonals of the outer-productmatrix: W lij = δ (( i + l − mod n , j ) (27)For neurally implementing a binding operation, like (17), alow fan-in is essential. The fan-in is the number of nonzeroelements in the tensor feeding the coincidences between theinput vectors to an output neuron α = α ( l ) = (cid:80) ij W lij . Forcircular convolution the fan-in is α CCB = N . For VDTBbinding the fan-in is α V DT B = √ N . When applied to a pair ofsparse vectors, circular convolution and VDTB binding are notsparsity-preserving. In the next section we analyze the propertiesof the sampling tensor (26) required to make (17) a sparsity-preserving binding operation for general K -sparse vectors withoptimal properties. B. Generalizing the Fourier Convolution Theorem
There is a direct relation between circular convolution bindingand the Hadamard product binding through the Fourier convo-lution theorem. The Fourier transform is a (non-random) lineartransform that previously has been proposed in holographyfor generating (dense) distributed representations from datafeatures. The
Fourier convolution theorem states: F ( a ) (cid:12) F ( b ) = F ( a ∗ b ) (28)where F ( z ) := (1 /n ) (cid:80) n − m =0 Φ F − km z m with Φ Fkm := e j πkmn isthe discrete Fourier transform. With (26) and (27), the Fourierconvolution theorem establishes a relationship between the outerproduct of two vectors and the Hadamard product of theirFourier transforms. Replacing the Fourier transform by twoCS random sampling matrices, the Fourier convolution theoremgeneralizes to: Φa (cid:12) Ψb = J ( a b (cid:62) ) (29)This equation coincides with (13), describing the relationshipbetween the Hadamard product of dense vectors to the outerproduct of the corresponding sparse vectors. The linear projec-tion J is formed from the CS sampling matrices Φ and Ψ . Note,that there is no general invertibility of J , unlike in the Fourierconvolution theorem. The outer product of the sparse vectors can12e uniquely inferred from the Hadamard product of the densevectors under certain conditions of sparsity, dimensionality, andproperties of the sampling matrices, as discussed in Sect. III-A2.A PPENDIX BS PARSITY - PRESERVING BINDING FOR K - SPARSE VECTORS
A. Analysis of required fan-in
We examine the required fan-in in (17) for two types of ran-dom sampling tensors for sparsity-preserving binding (Fig. 4).One type in which the tensor product of the input vectors issampled entirely randomly with a fixed constant fan-in perdownstream neuron. And another type, in which the tensor issampled along its diagonals, similar to circular convolution,but the diagonals truncated by a fixed sized fan-in of thedownstream neurons.First, we determine the minimal fan-in that still providessignals at downstream neurons so that a threshold operationcan reliably produce a K -sparse vector. The dendritic sums in(17) are approximately distributed by the following Binomialdistribution: p ( d l = r ) = (cid:18) αr (cid:19) (cid:18) K N (cid:19) r (cid:18) − K N (cid:19) α − r (30)We require that the fan-in is large enough so that the expectednumber of downstream neurons with d l = 0 is smaller than thenumber of silent neurons in the K -sparse result vector: N p ( d l =0) < N − K . To satisfy this condition, a lower bound α ∗ to theminimal fan-in is computed, which ensures the equality betweenthe two numbers and translates into the condition that for athreshold of θ = 1 the patterns sparsity is preserved: p ( d l = 0) ! = 1 − KN (31)Inserting (30) into this condition we obtain after some algebra: α > α ∗ = ln (cid:0) − KN (cid:1) ln (cid:0) − K N (cid:1) (32)For sparse patterns, (32) becomes simply α ∗ = N/K .Note, that setting the fan-in exactly to the lower bound α ∗ should result in patterns of dendritic activity in the population ofdownstream neurons which are approximately K -sparse, with-out any thresholding necessary. One can generalize condition(31) to an arbitrary threshold: θ − (cid:88) i =0 p ( d l = i ) = ( α − θ + 1) (cid:18) αθ − (cid:19) (cid:90) − K N t α − θ (1 − t ) θ − dt ! = 1 − KN (33)Unfortunately, the variable α cannot be analytically resolvedfrom the exact condition (33). However, it is straight-forwardto compute α numerically (Fig. 9A). Fig. 9.
Fan-in requirements of SPTP.
The fan-in for each neuron can bedetermined from the binomial distribution. Higher thresholds require more fan-in ( θ = [1 , , ). The fan-in is determined where each colored line crosses thesparsity level (black line shows 10% sparsity). B. Symmetry for optimizing unbinding performance
Another crucial question is what symmetry of the W tensorbest enables the inversibility of the sparsity-preserving bindingoperation (17). Chaining a binding and unbinding step, oneobtains the following self-consistency condition: a i = H (cid:88) j,l W ijl b j H (cid:88) i (cid:48) ,j (cid:48) W li (cid:48) j (cid:48) a i (cid:48) b j (cid:48) − θ − θ (34)which should hold for arbitrary K -sparse vectors a and b .The self-consistency condition can be approximately substitutedby maximizing the objective function L := (cid:80) i a i ( d i − θ ) .Replacing also the inner nonlinearity by its argument we obtain: L ( W ijl ; a , b , θ ) = (cid:88) i,j,i (cid:48) ,j (cid:48) W iij W li (cid:48) j (cid:48) a i a i (cid:48) b j b j (cid:48) − θ (cid:88) ijl W ijl a i b j + K (35)The quantity (35) should be high for any vectors a , b . Thus itis only the expectation (over all vectors a and b ) of first termthat can be consistently increased by changing the structure ofthe sampling tensor. The biggest increase is achieved by makingsure that terms with ( a i ) ( b j ) are zeroed out with probability − α rather than with − α , which can be accomplished byintroducing the following symmetry into the sampling tensor: W ijl = W lij (36)For sparse complex vectors one can define a binding operationwith the nonlinearity f Θ ( x ) = x | x | H ( | x |− Θ) taken from (Fradyand Sommer, 2019): a ◦ b = f Θ (cid:88) jl W lij a i b j (37)13here the sampling tensor is binary random or a random phasortensor. The corresponding selfconsistency condition is then: a i = f Θ (cid:88) j,l W ijl ¯ b j f Θ (cid:88) i (cid:48) ,j (cid:48) W li (cid:48) j (cid:48) a i (cid:48) b j (cid:48) (cid:39) f Θ (cid:88) j,l,i (cid:48) ,j (cid:48) W ijl W li (cid:48) j (cid:48) a i (cid:48) b j (cid:48) ¯ b j (38)In (38) one can notice that the signal is maximized if thetensor fulfills the following symmetry: W ijl = ¯ W lij (39)Even with condition (39) the unbinding will be noisy andcleanup through an additional associative network will be re-quired. R EFERENCES
Amini, A. and Marvasti, F. (2011). Deterministic constructionof binary, bipolar, and ternary compressed sensing matrices.
IEEE Transactions on Information Theory , 57(4):2360–2370.Bell, A. J. and Sejnowski, T. J. (1997). The independentcomponents of natural scenes are edge filters.
Vision research ,37(23):3327–3338.Ben-Yishai, R., Bar-Or, R. L., and Sompolinsky, H. (1995).Theory of orientation tuning in visual cortex.
Proceedingsof the National Academy of Sciences , 92(9):3844–3848.Candes, E. J., Romberg, J. K., and Tao, T. (2006). Stablesignal recovery from incomplete and inaccurate measure-ments.
Communications on Pure and Applied Mathematics:A Journal Issued by the Courant Institute of MathematicalSciences , 59(8):1207–1223.Davies, M., Srinivasa, N., Lin, T. H., Chinya, G., Cao, Y.,Choday, S. H., Dimou, G., Joshi, P., Imam, N., Jain, S., Liao,Y., Lin, C. K., Lines, A., Liu, R., Mathaikutty, D., McCoy,S., Paul, A., Tse, J., Venkataramanan, G., Weng, Y. H., Wild,A., Yang, Y., and Wang, H. (2018). Loihi: A NeuromorphicManycore Processor with On-Chip Learning.
IEEE Micro ,38(1):82–99.Donoho, D. L. et al. (2006). Compressed sensing.
IEEETransactions on information theory , 52(4):1289–1306.Dua, D. and Graff, C. (2019). UCI machine learning repository.Eldar, Y. C., Kuppinger, P., and Bolcskei, H. (2010). Block-sparse signals: Uncertainty relations and efficient recovery.
IEEE Transactions on Signal Processing , 58(6):3042–3054.Fernandez-Delgado, M., Cernadas, E., Barro, S., and Amorim,D. (2014). Do we need hundreds of classifiers to solve realworld classification problems?
Journal of Machine LearningResearch , 15:3133–3181.Fodor, J. A., Pylyshyn, Z. W., et al. (1988). Connectionism andcognitive architecture: A critical analysis.
Cognition , 28(1-2):3–71.Frady, E. P., Kleyko, D., and Sommer, F. T. (2018). A theory ofsequence indexing and working memory in recurrent neuralnetworks.
Neural Computation , 30(6):1449–1513.Frady, E. P. and Sommer, F. T. (2019). Robust computation withrhythmic spike patterns.
PNAS , 116(36):18050–18059. Frady, E. P. and Sommer, F. T. (2020). A normative hippocam-pus model: robustly encoding variables on smooth manifoldsusing spiking neurons. In , page 221.CoSyNe.G. Huang and Q. Zhu and C. Siew (2006). Extreme learningmachine: Theory and applications.
Neurocomputing , 70(1-3):489–501.Ganguli, S. and Sompolinsky, H. (2010). Statistical mechanicsof compressed sensing.
Physical Review Letters , 104(18):1–4.Gayler, R. W. (1998). Multiplicative binding, representationoperators & analogy. In
Gentner, D., Holyoak, K. J., Kokinov,B. N. (Eds.), Advances in analogy research: Integrationof theory and data from the cognitive, computational, andneural sciences , pages 1–4, New Bulgarian University, Sofia,Bulgaria.Gayler, R. W. (2003). Vector Symbolic Architectures answerJackendoff’s challenges for cognitive neuroscience.
Proceed-ings of the ICCS/ASCS International Conference on CognitiveScience , (2002):6.Ge, L. and Parhi, K. K. (2020). Classification using Hyperdi-mensional Computing: A Review. arXiv:2004.11204 , pages1–16.Gosmann, J. and Eliasmith, C. (2019). Vector-derived trans-formation binding: An improved binding operation for deepsymbol-like processing in neural networks.
Neural computa-tion , 31(5):849–869.Gripon, V. and Berrou, C. (2011). Sparse neural networkswith large learning diversity.
IEEE transactions on neuralnetworks , 22(7):1087–1096.Gritsenko, V., Rachkovskij, D., Frolov, A., Gayler, R., Kleyko,D., and Osipov, E. (2017). Neural distributed autoassociativememories: A survey.
Cybernetics and Computer Engineering ,2(188):5–35.Heeger, D. J. (1992). Normalization of cell responses in catstriate cortex.
Visual neuroscience , 9(2):181–197.Hillar, C. J. and Sommer, F. T. (2015). When can dictionarylearning uniquely recover sparse data from subsamples?
IEEETransactions on Information Theory , 61(11):6290–6297.Hinton, G. E. (1990). Mapping part-whole hierarchies intoconnectionist networks.
Artificial Intelligence , (46):47–76.Hinton, G. E. et al. (1986). Learning distributed representationsof concepts. In
Proceedings of the eighth annual conferenceof the cognitive science society , volume 1, page 12. Amherst,MA.Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. (1990).Distributed representations.
The philosophy of artificialintelligence , pages 248–280.Hopfield, J. J. (1982). Neural networks and physical systemswith emergent collective computational abilities.
Proceedingsof the National Academy of Sciences of the United States ofAmerica , 79(8):2554–2558.Hubel, D. H. and Wiesel, T. (1963). Shape and arrangement ofcolumns in cat’s striate cortex.
The Journal of physiology ,165(3):559–568.Hubel, D. H. and Wiesel, T. N. (1974). Sequence regularityand geometry of orientation columns in the monkey striate14ortex.
Journal of Comparative Neurology , 158(3):267–293.Hubel, D. H. and Wiesel, T. N. (1977). Ferrier lecture-functionalarchitecture of macaque monkey visual cortex.
Proceedingsof the Royal Society of London. Series B. Biological Sciences ,198(1130):1–59.Igelnik, B. and Pao, Y. (1995). Stochastic choice of basis func-tions in adaptive function approximation and the functional-link net.
IEEE Transactions on Neural Networks , 6:1320–1329.Kanerva, P. (1996). Binary spatter-coding of ordered K-tuples.
Lecture Notes in Computer Science (including subseries Lec-ture Notes in Artificial Intelligence and Lecture Notes inBioinformatics) , 1112 LNCS:869–873.Kanerva, P. (2009). Hyperdimensional computing: An intro-duction to computing in distributed representation with high-dimensional random vectors.
Cognitive Computation , 1:139–159.Kanerva, P. (2010). What We Mean When We Say “What’sthe Dollar of Mexico?”: Prototypes and Mapping in ConceptSpace. In
AAAI Fall Symposium: Quantum Informatics forCognitive, Social, and Semantic Processes , pages 2–6.Kanter, I. (1988). Potts-glass models of neural networks.
Physical Review A , 37(7):2739–2742.Kleyko, D., Kheffache, M., Frady, E. P., Wiklund, U., and Os-ipov, E. (2019). Density encoding enables resource-efficientrandomly connected neural networks. arXiv:1909.09153 ,pages 1–7.Knoblauch, G. E. and Palm, G. (2020). Iterative retrieval andblock coding in auto- and hetero-associative memory.
NeuralComputation , 32(1):205–260.Laiho, M., Poikonen, J. H., Kanerva, P., and Lehtonen, E.(2015). High-dimensional computing with sparse vectors. , pages 1–4.Larkum, M. E. and Nevian, T. (2008). Synaptic clusteringby dendritic signalling mechanisms.
Current opinion inneurobiology , 18(3):321–331.Ledoux, M. (2001).
The concentration of measure phenomenon .Number 89. American Mathematical Soc.Marcus, G. (2020). The Next Decade in AI: Four Steps TowardsRobust Artificial Intelligence. arXiv:2002.06177 , pages 1–59.Mel, B. W. and Koch, C. (1990). Sigma-pi learning: On radialbasis functions and cortical associative learning. In
Advancesin neural information processing systems , pages 474–481.Mountcastle, V. B. (1957). Modality and topographic propertiesof single neurons of cat’s somatic sensory cortex.
Journal ofneurophysiology , 20(4):408–434.Olshausen, B. A. and Field, D. J. (1996). Natural image statisticsand efficient coding.
Network (Bristol, England) , 7(2):333–9.Palm, G. (1980). On associative memory.
Biological cybernet-ics , 36(1):19–31.Palm, G. (2013). Neural associative memories and sparsecoding.
Neural Networks , 37:165–171.Palm, G., Schwenker, F., and Sommer, F. T. (1994). Associativememory networks and sparse similarity preserving codes. In
From Statistics to Neural Networks , pages 282–302. Springer. Palm, G. and Sommer, F. T. (1992). Information capacityin recurrent mcculloch–pitts networks with sparsely codedmemory states.
Network: Computation in Neural Systems ,3(2):177–186.Plate, T. A. (1991). Holographic Reduced Representations: Convolution Algebra for Compositional Distributed Rep-resentations.
Proceedings of the 12th international jointconference on Artificial intelligence , pages 30–35.Plate, T. A. (1993). Holographic Recurrent Networks.
Advancesin Neural Information Processing Systems , 5(1):34–41.Plate, T. A. (1994).
Distributed Representations and NestedCompositional Structure . PhD thesis.Plate, T. A. (2003).
Holographic Reduced Representation:Distributed Representation for Cognitive Structures . Stanford:CSLI Publications.Plate, T. A. T. (1995). Holographic reduced representations.
IEEE Transactions on Neural Networks , 6(3):623–641.Rachkovskij, D. A. (2001). Representation and processing ofstructures with binary sparse distributed codes.
IEEE Trans-actions on Knowledge and Data Engineering , 13(2):261–276.Rachkovskij, D. A. and Kussul, E. M. (2001). Bindingand Normalization of Binary Sparse Distributed Representa-tions by Context-Dependent Thinning.
Neural Computation ,13(2):411–452.Rachkovskij, D. A., Slipchenko, S. V., Kussul, E. M., andBaidyk, T. N. (2005). Sparse binary distributed encodingof scalars.
Journal of Automation and Information Sciences ,37(6):12–23.Rahimi, A., Kanerva, P., Benini, L., and Rabaey, J. M. (2019).Efficient biosignal processing using hyperdimensional com-puting: Network templates for combined learning and classifi-cation of ExG signals.
Proceedings of the IEEE , 107(1):123–143.Scardapane, S. and Wang, D. (2017). Randomness in neuralnetworks: An overview.
Data Mining and Knowledge Dis-covery , 7:1–18.Smolensky, P. (1990). Tensor product variable binding andthe representation of symbolic structures in connectionistsystems.
Artificial Intelligence , 46(1-2):159–216.Smolensky, P., Lee, M., He, X., Yih, W.-t., Gao, J., and Deng, L.(2016). Basic reasoning with tensor product representations. arXiv preprint arXiv:1601.02745 .Soll, M., Hinz, T., Magg, S., and Wermter, S. (2019). Evaluatingdefensive distillation for defending text processing neuralnetworks against adversarial examples. In
InternationalConference on Artificial Neural Networks (ICANN) , pages685–696. Springer.Swindale, N. V. (1990). Is the cerebral cortex modular?
Trendsin Neurosciences , 13(12):487–492.Treisman, A. (1998). Feature binding, attention and objectperception.
Philosophical Transactions of the Royal Societyof London. Series B: Biological Sciences , 353(1373):1295–1306.Tsodyks, M. V. and Feigel’man, M. V. (1988). The enhancedstorage capacity in neural networks with low activity level.
Europhysics Letters , 6(2):101–105.15illshaw, D. J., Buneman, O. P., and Longuet-Higgins, H. C. (1969). Non-holographic associative memory.