Modeling Sequences with Quantum States: A Look Under the Hood
MMODELING SEQUENCES WITH QUANTUM STATES: ALOOK UNDER THE HOOD
TAI-DANAE BRADLEY, MILES STOUDENMIRE, AND JOHN TERILLA
Abstract.
Classical probability distributions on sets of sequences canbe modeled using quantum states. Here, we do so with a quantum statethat is pure and entangled. Because it is entangled, the reduced densitiesthat describe subsystems also carry information about the complemen-tary subsystem. This is in contrast to the classical marginal distributionson a subsystem in which information about the complementary systemhas been integrated out and lost. A training algorithm based on thedensity matrix renormalization group (DMRG) procedure uses the ex-tra information contained in the reduced densities and organizes it intoa tensor network model. An understanding of the extra informationcontained in the reduced densities allow us to examine the mechanicsof this DMRG algorithm and study the generalization error of the re-sulting model. As an illustration, we work with the even-parity datasetand produce an estimate for the generalization error as a function of thefraction of the dataset used in training.
Contents
1. Introduction 2Acknowledgments 32. Densities and reduced densities 32.1. Reconstructing a pure state from its reduced densities 53. Reduced densities of classical probability distributions 63.1. Learning from samples 94. The Training Algorithm 115. Under the hood 165.1. High-level summary 205.2. Combinatorics of reduced densities 226. Experiments 247. Conclusion 25References 26 a r X i v : . [ qu a n t - ph ] O c t TAI-DANAE BRADLEY, MILES STOUDENMIRE, AND JOHN TERILLA Introduction
In this paper, we present a deterministic algorithm for unsupervised gen-erative modeling on strings using tensor networks. The algorithm is deter-ministic with a fixed number of steps and the resulting model has a perfectsampling algorithm that allows efficient sampling from marginal distribu-tions, or sampling conditioned on a substring. The algorithm is inspired bythe density matrix renormalization group (DMRG) procedure [1, 2, 3]. Thisapproach, at its heart, involves only simple linear algebra which allows usto give a detailed “under the hood” look at the algorithm in action. Ouranalysis illustrates how to interpret the trained model and how to go beyondworst case bounds on generalization errors. We work through the algorithmwith an exemplar dataset to produce a prediction for the generalization er-ror as a function of the fraction used in training which well approximatesthe generalization error observed in experiments.The machine learning problem of interest is to learn a probability distri-bution on a set of sequences from a finite training set of samples. For us,an important technical and conceptual first step is to pass from
Finite Sets to Functions on Finite Sets . Functions on sets have more structure thansets themselves and we find that the extra structure is meaningful. Further-more, well-understood concepts and techniques in quantum physics give uspowerful tools to exploit this extra structure without incurring significant al-gorithmic costs [4]. We emphasize that it is not necessary that the datasetsbeing modeled have any inherently quantum properties or interpretation.The inductive bias of the model can be understood as a kind of low-rankfactorization hypothesis—a point we expand upon in this paper.Reduced density operators play a central role in our model. In a happycoincidence, they play the central role in both the model’s theoretical inspi-ration and the training algorithm. There is structure in reduced densitiesthat inspire us to model classical probability distributions using a quantummodel. The training algorithm amounts to successively matching reduceddensities, a process which leads inevitably to a tensor network model, whichmay be thought of as a sequence of compatible autoencoders. We refer read-ers unfamiliar with tensor diagram notation to references such as [5, 6, 7].This paper also builds on investigations of tensor networks as models formachine learning tasks. Tensor networks have been demonstrated to givegood results for supervised learning and regression tasks [8, 3, 9, 10, 11, 12,13, 14]. They have also been applied successfully to unsupervised, generativemodeling [15, 16, 17, 18] including a study based on the parity dataset weuse here [17]. This work focuses on the latter task, proposing and studyingan alternative algorithm for optimizing MPS for generative modeling. Theexpressivity of models like the one considered in this paper have been studied[19]. In this paper, we focus on understanding how our training algorithmlearns to generalize.
ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD 3
Acknowledgments.
The authors thank Gabriel Drummond-Cole, GlenEvenbly, James Stokes, and Yiannis Vlassopoulos for helpful discussions,and are happy to acknowledge KITP Santa Barbara, the Flatiron Institute,and Tunnel for support and excellent working conditions.2.
Densities and reduced densities
For our purposes, the passage from classical to quantum can be thoughtof as the passage from
Finite Sets to Functions on Finite Sets , which havea natural Hilbert space structure. We are interested in probability distri-butions on finite sets. The quantum version of a probability distribution isa density operator on a Hilbert space. The quantum version of a marginalprobability distribution is a reduced density operator. The operation thatplays the role of marginalization is the partial trace. In our setup, the re-duced densities contain more information than the marginal distributionsassociated to them and much of our work concerns this extra information.Given a finite set S , one has the free vector space V = C S consisting ofcomplex valued functions on S , which is a Hilbert space with inner product (cid:104) f | g (cid:105) = (cid:88) s ∈ S f ( s ) g ( s ) . The free vector space comes with a natural map from S → C S , which werecall in a moment. To avoid confusion, it is helpful to use notation todistinguish between an element s ∈ S and its image in C S , which is avector. Commonly, the vector image of s is denoted with a boldface font oran overset arrow. We like the bra and ket notation, which is better wheninner products are involved. For any s ∈ S , let | s (cid:105) denote the function S → C that sends s (cid:55)→ s (cid:48) (cid:55)→ s (cid:48) (cid:54) = s . The set {| s (cid:105)} is anindependent, orthonormal spanning set for V . If one chooses an ordering onthe set S , say S = { s . . . . , s d } , then | s j (cid:105) is identified with the j -th standardbasis vector in C d , thus defining an isometric isomorphism of V ∼ −→ C d anda “one-hot” encoding S (cid:44) → C d . More generally, we denote elements in V byket notation | ψ (cid:105) ∈ V .For any | ψ (cid:105) ∈ V , there is a linear functional in V ∗ whose value on | φ (cid:105) ∈ V is the inner product (cid:104) ψ | φ (cid:105) . We denote this linear functional by the succinctbra notation (cid:104) ψ | ∈ V ∗ . Every linear functional in V ∗ is of the form (cid:104) ψ | forsome | ψ (cid:105) ∈ V . We have vectors | ψ (cid:105) ∈ V and covectors (cid:104) ψ | ∈ V ∗ and themap | ψ (cid:105) ←→ (cid:104) ψ | defines a natural isomorphism between V and V ∗ . We have chosen to dis-tinguish between vectors and covectors with bra and ket notation; we willnot imbue upper and lower indices with any special meaning.When several spaces V, W, . . . are in play, some tensor product symbolsare suppressed. So, for instance, if | ψ (cid:105) ∈ V and | φ (cid:105) ∈ W , we will write | ψ (cid:105)| φ (cid:105) , or even | ψφ (cid:105) , instead of | ψ (cid:105) ⊗ | φ (cid:105) ∈ V ⊗ W . An expression like | φ (cid:105)(cid:104) ψ | TAI-DANAE BRADLEY, MILES STOUDENMIRE, AND JOHN TERILLA is an element of W ⊗ V ∗ , naturally identified with an operator V → W . Theexpression | ψ (cid:105)(cid:104) ψ | is an element in End( V ). Here, End( V ) denotes the spaceof all linear operators on V and in the presence of a basis is identified withdim( V ) × dim( V ) matrices. If | ψ (cid:105) is a unit vector, then the operator | ψ (cid:105)(cid:104) ψ | is orthogonal projection onto | ψ (cid:105) : it maps | ψ (cid:105) (cid:55)→ | ψ (cid:105) and maps every vectorperpendicular to | ψ (cid:105) to zero.A density operator , or just density for short, is a unit-trace, positive semi-definite linear operator on a Hilbert space. Sometimes a density is called aquantum state . If S is a finite set and V = C S , then a density ρ : V → V defines a probability distribution on S by defining the probability π ρ : S → R by the Born rule (1) π ρ ( s ) = (cid:104) s | ρ | s (cid:105) . Going the other way, there are multiple ways to define a density ρ : V → V from a classical probability distribution π on S so that π ρ = π . One way isas a diagonal operator: ρ diag := (cid:80) s ∈ S π ( s ) | s (cid:105)(cid:104) s | . Another way is to define(2) ρ π = | ψ (cid:105)(cid:104) ψ | where | ψ (cid:105) := (cid:88) s ∈ S (cid:112) π ( s ) | s (cid:105) . There exist other densities that realize π via the Born rule, but think of thediagonal density and projection onto | ψ (cid:105) as two extremes. The density ρ π has minimal rank and ρ diag has maximal rank. In the language of quantummechanics, a state is pure if it has rank one and is mixed otherwise. Thedegree to which a state is mixed is measured by its von Neumann entropy, − tr( ρ ln( ρ )), which ranges from zero in the case of ρ π up to the Shannonentropy of the classical distribution π in the case of ρ diag . In this paper,we always use the pure state ρ := ρ π . To summarize, we associate to anyprobability distribution π : S → R the density ρ π : V → V defined byEquation (2), which has the property that π ρ π = π. If a set S is a Cartesian product S = A × B then the Hilbert space C S decomposes as a tensor product C S ∼ = C A ⊗ C B . In this case, a density ρ : C A ⊗ C B → C A ⊗ C B is the quantum version of a joint probability distribution π : A × B → R . By an operation that is analogous to marginalization, ρ gives rise to two densities ρ A : C A → C A and ρ B : C B → C B which we referto as reduced densities . We now describe this operation, which is called partial trace .If X and Y are finite dimensional vector spaces, then End( X ⊗ Y ) isisomorphic to End( X ) ⊗ End( Y ). Using this isomorphism, there are mapsEnd( X ⊗ Y )End( X ) End( Y ) tr Y tr X defined by tr Y ( f ⊗ g ) := f tr( g ) and tr X ( f ⊗ g ) := g tr( f ) ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD 5 for f ∈ End( X ) and g ∈ End( Y ). The maps tr Y and tr X are called partialtraces . The partial trace preserves both trace and positive semi-definitenessand so the image of any density ρ ∈ End( X ⊗ Y ) under partial trace defines reduced densities tr Y ρ ∈ End( X ) and tr X ρ ∈ End( Y ).It is worth noting that while we have maps End( X ) ⊗ End( Y ) → End( X )and End( X ) ⊗ End( Y ) → End( Y ), there do not exist natural maps V ⊗ W → V or V ⊗ W → W for arbitrary vector spaces V and W ; partial trace isspecial, it is defined in the case that V and W are endomorphism spaces.2.1. Reconstructing a pure state from its reduced densities.
We nowdiscuss the problem of reconstructing a pure quantum state ρ on a product X ⊗ Y from its reduced densities ρ X and ρ Y .Using the isomorphism X ∼ = X ∗ that is available in any finite dimen-sional Hilbert space, one can view any vector | ψ (cid:105) in a product of Hilbertspaces X ⊗ Y as an element of X ∗ ⊗ Y , hence as a linear map M : X → Y .Computationally, if | ψ (cid:105) is expressed using bases {| a (cid:105)} of X and {| b (cid:105)} of Y as | ψ (cid:105) = (cid:88) a,b m ab | a (cid:105) ⊗ | b (cid:105) then the coefficients { m ab } of that sum can be reshaped into a dim( Y ) × dim( X ) matrix M . A singular value decomposition (SVD) of M gives afactorization M = V DU ∗ with V and U unitary and D diagonal as inFigure 1. (cid:32) (cid:32) = with = and = Figure 1.
A tensor network diagram following | ψ (cid:105) ∈ X ⊗ Y through the isomorphisms X ⊗ Y ∼ = X ∗ ⊗ Y ∼ = hom( X, Y ),leading to the singular value decomposition of M = V DU ∗ with the unitarity of V and U .The columns {| f i (cid:105)} of the matrix V are the left singular vectors of M .They are the eigenvectors of M M ∗ and comprise an orthonormal basis forthe image of M . The columns {| e i (cid:105)} of the matrix of U are the right singularvectors of M . They are the eigenvectors of M ∗ M , an orthonormal set ofvectors spanning a subspace of X isomorphic to the image of M . Thenonnegative real numbers { σ i } on the diagonal of D are the singular values TAI-DANAE BRADLEY, MILES STOUDENMIRE, AND JOHN TERILLA ρ = , ρ X = = , ρ Y = = . Figure 2.
A tensor network diagram showing that ρ X = M ∗ M and ρ Y = M M ∗ .of the matrix M . The matrices M ∗ M and M M ∗ have the same eigenvalues { λ i } which are the squares of the singular values λ i := | σ i | . The map M defines a bijection between the {| e i (cid:105)} and {| f i (cid:105)} . Specifically, M acts as(3) | e i (cid:105) (cid:55)→ σ i | f i (cid:105) and maps the perpendicular complement of the span of the {| e i (cid:105)} to zero.Now, given a unit vector | ψ (cid:105) ∈ X ⊗ Y , we have the density ρ = | ψ (cid:105)(cid:104) ψ | ∈ X ⊗ Y ⊗ Y ∗ ⊗ X ∗ and the reduced densities ρ X : X → X and ρ Y : Y → Y .The reduced densities of ρ are related to the operator M : X → Y fashionedfrom | ψ (cid:105) as follows(4) ρ X = M ∗ M and ρ Y = M M ∗ as illustrated in Figure 2. The singular vectors {| e i (cid:105)} and {| f i (cid:105)} of M areprecisely the eigenvectors of the reduced densities. Therefore, the density ρ can be completely reconstructed from its reduced densities ρ X and ρ Y .One obtains | ψ (cid:105) by gluing the eigenvectors of the reduced densities alongtheir shared eigenvalues (Figure 3). In the nondegenerate case that theeigenvalues are distinct, then there is a unique way to glue the {| e i (cid:105)} andthe {| f i (cid:105)} and | ψ (cid:105) is recovered perfectly.3. Reduced densities of classical probability distributions
Let π : S → R be a probability distribution and consider the density ρ π as in Equation (2). Suppose S ⊂ A × B and let ρ A = tr Y ρ and ρ B = tr X ρ denote the reduced densities where, as above, X = C A , Y = C B , and V = X ⊗ Y . Let us now interpret the matrix representation of these reduced ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD 7 ρ X = = , ρ Y = = ,= (cid:32) = | ψ (cid:105) Figure 3.
Reconstructing | ψ (cid:105) from the eigenvectors of ρ X and ρ Y and their shared eigenvalues.densities. We compute: ρ = | ψ (cid:105)(cid:104) ψ | = (cid:88) ( a,b ) ∈ S (cid:112) π ( a, b ) | a (cid:105) ⊗ | b (cid:105) ⊗ (cid:88) ( a (cid:48) ,b (cid:48) ) ∈ S (cid:112) π ( a (cid:48) , b (cid:48) ) (cid:104) a (cid:48) | ⊗ (cid:104) b (cid:48) | = (cid:88) ( a,b ) ∈ S ( a (cid:48) ,b (cid:48) ) ∈ S (cid:112) π ( a, b ) (cid:112) π ( a (cid:48) , b (cid:48) ) | a (cid:105)(cid:104) a (cid:48) | ⊗ | b (cid:105)(cid:104) b (cid:48) | We compute the partial trace tr Y ( | a (cid:105)(cid:104) a (cid:48) | ⊗ | b (cid:105)(cid:104) b (cid:48) | ) = (cid:104) b | b (cid:48) (cid:105) | a (cid:105)(cid:104) a (cid:48) | . Since (cid:104) b | b (cid:48) (cid:105) = 1 if b = b (cid:48) and zero otherwise, we can understand the ( a, a (cid:48) ) entryof the reduced density ρ A as(5) ( ρ A ) a (cid:48) a = (cid:88) b ∈ B (cid:112) π ( a, b ) π ( a (cid:48) , b ) . In particular, the diagonal entry ( ρ A ) aa is (cid:80) b ∈ B π ( a, b ) and we see the mar-ginal distribution π A : A → R along the diagonal of the reduced density ρ A .We make the consistent observation that ρ A has unit trace. The off-diagonal TAI-DANAE BRADLEY, MILES STOUDENMIRE, AND JOHN TERILLA entries of ρ A are determined by the extent to which a, a (cid:48) ∈ A have the samecontinuations in B . Note that ρ A is symmetric. The reduced density on B is similarly given:(6) ( ρ B ) b (cid:48) b = (cid:88) a ∈ A (cid:112) π ( a, b ) π ( a, b (cid:48) ) . So, the reduced densities of ρ contains all the information of the marginaldistributions π A and π B and more. Now, let’s take a look at the extrainformation carried by the reduced densities, which is entirely contained inthe off diagonal entries. Since the entire state, and therefore π itself, can bereconstructed from the eigenvectors and eigenvalues of ρ A and ρ B , we knowthat from a high level this spectral information encodes the conditionalprobabilities that are lost by the classical process of marginalization. Enroute to decoding this spectral information, let us describe how an arbitrarydensity τ is a classical mixture model of pure quantum states. If | e (cid:105) , . . . , | e k (cid:105) is a basis for the image of a density τ consisting of orthonormal eigenvectors,then the corresponding eigenvalues λ , . . . , λ k are nonnegative real numberswhose sum is one. One has τ = k (cid:88) i =1 λ i | e i (cid:105)(cid:104) e i | The density τ defines a probability distribution on pure states: the proba-bility of the pure state | e i (cid:105)(cid:104) e i | being λ i . Then, | e i (cid:105)(cid:104) e i | defines a probabilitydistribution on the computational basis { s } via the Born Rule: the proba-bility of s is (cid:104) s | e i (cid:105)(cid:104) e i | s (cid:105) = |(cid:104) e i | s (cid:105)| .We’re interested in the reduced densities of ρ = | ψ (cid:105)(cid:104) ψ | and in this casethere exists a one-to-one correspondence | e i (cid:105) ↔ | f i (cid:105) between eigenvectorsof the reduced densities ρ A := tr Y ( ρ ) and ρ B := tr X ( ρ ) spanning theirrespective images. ρ A = k (cid:88) i =1 λ i | e i (cid:105)(cid:104) e i | and ρ B = k (cid:88) i =1 λ i | f i (cid:105)(cid:104) f i | . as outlined in Section 2.1.Putting together the general picture of a density as a mixture of purestates with the reduced densities of a pure state leads one to the followingparadigm. With probability λ i the prefix subsystem will be in a state deter-mined by the corresponding eigenvector | e i (cid:105) of ρ A , and the correspondingsuffix subsystem will be in a state determined by the eigenvector | f i (cid:105) . Thevector | e i (cid:105) = (cid:80) a γ ai | a (cid:105) determines a probability distribution on the set ofprefixes A : the probability of the prefix a is | γ ai | . The vector | f i (cid:105) = (cid:80) b β bi | b (cid:105) determines a probability distribution on the set of suffixes B : the probabilityof b is | β bi | . ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD 9
As a final remark, if we had begun with the diagonal density ρ diag = (cid:88) ( a,b ) ∈ A × B π ( a, b ) ( | a (cid:105) ⊗ | b (cid:105) ) ⊗ ( (cid:104) b | ⊗ (cid:104) a | )whose Born distribution is also π , then the matrices representing ρ A and ρ B would be diagonal matrices with marginal distributions on A and B alongthe diagonals and all off diagonal elements are zero. The eigenvectors of ρ A and ρ B are simply the prefixes | a (cid:105) and and suffixes | b (cid:105) and carry nofurther information. The process of computing reduced densities of ρ diag isnothing more than the process of marginalization. We always use the purestate ρ = | ψ (cid:105)(cid:104) ψ | ensuring that the reduced densities carry information aboutsubsystem interactions. The eigenvectors of the reduced densities, which arelinear combinations of prefixes and linear combinations of suffixes, interactthrough their eigenvalues and capture rich information about the prefix-suffix system.Let us summarize. Begin with a classical probability distribution π on aproduct set S = A × B . Form a density ρ π on C A × B by the formula in Equa-tion (2). The reduced densities ρ A and ρ B on C A and C B contain marginaldistributions π A and π B on their diagonals, but they are not diagonal oper-ators. The eigenvectors of these reduced densities encode information aboutprefix-suffix interactions. The prefix-suffix interactions are tantamount toconditional probabilities and carry sufficient information to reconstruct thedensity ρ .3.1. Learning from samples.
In the machine learning applications tocome, the goal is to learn ρ π defined in Equation (2) from a set { s , . . . , s N T } of samples drawn from a probability distribution π . Each sample s i will bea sequence ( x , . . . , x N ) of a fixed length N . The algorithm to learn thedensity ρ π on the full set of sequences S is an inductive procedure.One only works with a density ρ defined using the sample set since thedensity ρ π for the entire distribution π is unavailable. The procedure beginsby computing the reduced density ρ A and its eigenvectors for a subsystem A consisting of short prefixes. Step by step, the size of the subsystem A isincreased until one reaches a point where the suffix subsystem B is small.In a final step, ρ is recombined from the collected eigenvectors of ρ A forall the prefix systems A and the eigenvectors and eigenvalues of ρ B . Thisprocedure leads naturally to a tensor network approximation for ρ .An important point is that the reduced density ρ A operates in a spacewhose dimension grows exponentially with the length of the prefix system A . So, instead of computing ρ A exactly, it is computed by a sequence ofapproximations that keep its rank small. The modeling hypothesis is that π is a distribution whose corresponding quantum state ρ π has low rank in thesense that the reduced densities ρ A and ρ B are low rank operators for allprefix-suffix subsystems A and B . The large rank of the density ρ witnessedfrom the empirical distribution drawn from π is regarded as sampling error. Therefore, under the modeling hypothesis, the process of replacing the em-pirically computed reduced densities with low rank approximations shouldbe thought of as repairing a state damaged by sampling errors. The lowrank modeling hypothesis can lead to excellent generalization properties forthe model.Let us continue our analysis of the reduced densities as in the previoussections using notation appropriate for the machine learning algorithm. Let T be a training set of labeled samples T = { s , . . . , s N T } . We use N T forthe number of training examples. Each sample s i will be a sequence ofsymbols from a fixed alphabet Σ of a fixed length N . We will designate acut to obtain a prefix a i and suffix b i whose concatenation is the sample s i = ( a i , b i ) ∈ Σ N . This provides a decomposition of T as T ⊂ A × B where A = { a , a , . . . , a N T } and B = { b , b , . . . , b N T } are the sampled prefixesand suffixes. For the applications we have in mind, samples in T will bedistinct. That is ( a i , b i ) (cid:54) = ( a j , b j ) if i (cid:54) = j , though crucially it may happenthat a i = a j or b i = b j for i (cid:54) = j . Let (cid:98) π be the resulting empirical distributionon T so that(7) (cid:98) π ( a, b ) = (cid:40) / √ N T if ( a, b ) ∈ T, | ψ (cid:105) = 1 √ N T N T (cid:88) i =1 | s i (cid:105) , the empirical density ρ = | ψ (cid:105)(cid:104) ψ | , and its partial trace(9) ρ A = 1 N T N T (cid:88) i,j =1 s ( a i , a j ) | a i (cid:105)(cid:104) a j | . Here the sum is expressed in terms of the indices i, j , which range over thenumber of samples. The coefficient s ( a i , a j ) of | a i (cid:105)(cid:104) a j | is a nonnegative inte-ger, namely the number of times that a i and a j have the same continuation b i = b j . It may be convenient to have some notation for shared contin-uations. For any pair a, a (cid:48) of elements of A , let T a,a (cid:48) be the subset of B consisting of shared continuations of a and a (cid:48) :(10) T a,a (cid:48) = { b ∈ B : ( a, b ) ∈ T and ( a (cid:48) , b ) ∈ T } . So, the ( a, a (cid:48) ) entry of the matrix representing ρ A is the cardinality of theset T a,a (cid:48) divided by an overall factor of 1 /N T .A similar combinatorial description holds for the reduced density on B , ρ B = 1 N T (cid:88) i,j s ( b i , b j ) | b i (cid:105)(cid:104) b j | where s ( b i , b j ) is the number of common prefixes that b i and b j share. ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD11
The counting involved can be visualized with graphs. Every probabilitydistribution (cid:98) π on a Cartesian product A × B uniquely defines a weightedbipartite graph: the two vertex sets are A and B and the edge joining a and b is labeled by (cid:98) π ( a, b ) . Here, because we assume the samples in T aredistinct, the graph can be simplified since (cid:98) π ( a, b ) is either 0 or 1 /N T . Wedraw an edge from a to b if ( a, b ) ∈ T and we omit the edge if ( a, b ) / ∈ T and understand the probabilities to be obtained by dividing by N T , whichis the total number of edges in the graph. a a b b b b In the example above, the total number of edges is the sample size N T = 6.The probability of ( a , b ) = 1 / a , b ) = 0. Nowwe illustrate how to read off the entries of the reduced density ρ A from thegraph. There will be an overall factor of 1 /N T multiplied by a matrix ofnonnegative integers. The diagonal entries are d ( a ), the degree of vertex a .The ( a, a (cid:48) ) entry is the number of shared suffixes, which equals the numberof paths of length 2 between a and a (cid:48) , divided by 6.Given any graph with | A | = 2, such as the one above, the reduced densityon the prefix subsystem is equal to(11) ρ A = 1 N T (cid:34) d ss d (cid:35) where the diagonal entries are the degrees of the vertices and s is the numberof paths of length two, which equals the number of degree two vertices of B .The denominator of the coefficient N T = d + d is the total number of edgesin the graph. The eigenvalues λ + and λ − and (unnormalized) eigenvectors e + and e − of this matrix have simple, explicit expressions in terms of thegap G = d − d in the diagonal entries and the off-diagonal entry s . Namely,(12) λ + = N T + √ G + 4 s N T and λ − = N T − √ G + 4 s N T and(13) | e + (cid:105) = (cid:20) √ G + 4 s + G +2 s (cid:21) and | e − (cid:105) = (cid:20) √ G + 4 s − G − s (cid:21) . The Training Algorithm
Suppose that | ψ (cid:105) ∈ V ⊗ · · · ⊗ V N . We depict | ψ (cid:105) as V V · · · V N − V N There are various sorts of decompositions of such a tensor that are akinto an iterated SVD. We will describe one decomposition that results ina factorization of | ψ (cid:105) into what is called a matrix product state (MPS) orsynonymously, a tensor train decomposition. The process defines a sequenceof “bond” spaces { B k } and operators { U k : B k ⊗ V k → B k − } which can becomposed U U · · · U N − U N as pictured: · · · V V V N − V N = V V · · · V N − V N The initial operator has form U : B → V and the final tensor has the form U N ∈ B N − ⊗ V N . We begin with B = V and set U : B → V to be theidentity. For k = 2 , . . . , N − U k inductively.To describe the inductive process, first notice that for any k = 1 , . . . , N − V ⊗ · · · ⊗ V N ∼ = ( V ⊗ · · · ⊗ V k ) (cid:79) ( V k +1 ⊗ · · · ⊗ V N ) . The operator α k : V ⊗ · · · ⊗ V k → V k +1 ⊗ · · · ⊗ V N fashioned from | ψ (cid:105) maybe pictured as follows:(14) α k = V V · · · V k V k +1 · · · V N The operators U k when composed U U · · · U k as below · · · B k V V V k define an operator B k → V ⊗ · · · ⊗ V k . One then has the composition β k := α k U U · · · U k : B k → V k +1 ⊗ · · · ⊗ V N : V · · · V k +1 · · · V N B k ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD13
The inductive hypothesis is that α k U U · · · U k U ∗ k · · · U ∗ U ∗ = α k . Pictorally, · · · · · · V k +1 · · · V N V k · · · V V = α k In the penultimate step, one has the operator α N − U U · · · U N − : B N − → V N . The final step is to define U N as the adjoint of this operator: U N =( α N − U U · · · U N − ) ∗ . · · · V N B N − = V N B N − Therefore, the entire composition reduces nicely: U U · · · U N − U N = U U · · · U N − U ∗ N − · · · U ∗ U ∗ α ∗ N − = α ∗ N − The final equality follows from the adjoint of the inductive hypothesis. Theoutcome α ∗ N − : V ∗ N → V ⊗ · · · ⊗ V N − , after a minor reshaping, is the sameas | ψ (cid:105) .To define the inducive step, assume the spaces B , . . . , B k − and operators U k have been defined and satisfy the inductive hypothesis. Reshape theoperator B k − → V k ⊗ V k +1 ⊗ · · · ⊗ V N as a map B k − ⊗ V k → V k +1 ⊗ · · · ⊗ V N An SVD decomposition of this map yields α k − U · · · U k − = W k D k U ∗ k . V · · · V k V k +1 · · · V N B k − = B k − V k V k +1 V N · · · The adjoint of the map U ∗ k : B k − ⊗ V k → B k , pictured as the blue triangle onthe right hand side, is then defined to be U k : B k → B k − ⊗ V k and becomesthe next tensor in the MPS decomposition. To check that the inductivehypothesis is satisfied, note that α k U · · · U k − U k U ∗ k = α k − U · · · U k − since α k − U · · · U k − = W k D k U ∗ k and U ∗ k U k = 1. Here is the picture proof: V · · · B k V k V k +1 · · · V N B k − B k − is equal to this ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD15 B k − V k V k +1 V N · · · = B k − V k V k +1 V N · · · which is the first picture: V · · · V k V k +1 · · · V N B k − In our application, the vector | ψ (cid:105) and the operators β k − : B k − ⊗ V k → V k +1 ⊗ · · · ⊗ V N operate in spaces of such high dimensions that neither they,nor a direct SVD of them, is feasible. Nonetheless, the U k operators can beobtained from an SVD of a reduced density operating in the effective space B k − ⊗ V k β ∗ k − β k − : B k − ⊗ V k → B k − ⊗ V k In our application, the effective reduced density β ∗ k − β k − can be computedas a double sum over the training examples and we can efficiently computethe tensors required for the inductive steps. Then in the final step, the com-plementary space is small so the final map U N D N : B N − → V N completesthe reconstruction.More specifically, to define the U k , we only need an eigenvector decom-position of β ∗ k − β k − , which looks like V · · · V k V k V · · · B k − B k − · · · and is given by a formula like the one in Equation (9).In general, when factoring an arbitrary vector as an MPS, the bond spaces B k grow large exponentially fast. Therefore, we may characterize data setsfor which the MPS model is a good model by saying that | ψ (cid:105) as definedin Equation (2) has an MPS model whose bond spaces B k remain small.Alternatively, one can truncate or restrict the dimensions of the spaces B k resulting in a low rank MPS approximation of | ψ (cid:105) . As a criterion for thistruncation, one can inspect the singular values at each inductive step anddiscard those which are small according to a pre-determined cutoff, andthe corresponding columns of U and W . In the even-parity dataset thatwe investigate as an example, we always truncate B k to two dimensionsthroughout.To understand whether this kind of low-rank approximation is useful,remember that we understand that the eigenvectors and eigenvalues of thereduced densities carry the essential prefix-suffix interactions. By havinga training algorithm that emphasizes these eigenvalues and eigenvectors asthe most important features of the data throughout training, the resultingmodel should be interpreted as capturing the most important prefix-suffixinteractions. We view these prefix-suffix interactions a proxy for the meaningof substrings within a language of larger strings.5. Under the hood
With an in-depth understanding of the training algorithm, we aim topredict experimental results, simply given the fraction 0 < f ≤ ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD17
As an example, we perform an analysis of how well the algorithm learnson the even-parity dataset. Let Σ = { , } and consider the set Σ N ofbitstrings of a fixed length N . Define the parity of a bitstring ( b , . . . , b N )to be(15) parity( b , . . . , b N ) := N (cid:88) i =1 b i mod 2 . The set Σ N is partitioned into even and odd bitstrings: E N = { s ∈ Σ N : parity( s ) = 0 } and O N = { s ∈ Σ N : parity( s ) = 1 } Consider the probability distribution π : Σ N → R uniformly concentratedon E N : π ( x ) = (cid:40) N − if x ∈ E N x ∈ O N .This distribution defines a density ρ π = | E N (cid:105)(cid:104) E N | where(16) | E N (cid:105) = 1 √ N − (cid:88) s ∈ E N | s (cid:105) ∈ V ⊗ V ⊗ · · · ⊗ V N where V j ∼ = C is the site space spanned by the bits in the j -th position.Choose a subset T = { s , . . . , s N T } ⊂ E N of even parity bitstrings and let f = N T / N − be the fraction selected. The empirical distribution on thisset defines the vector | ψ (cid:105) = √ N T (cid:80) N T i =1 | s i (cid:105) as in Equation (8). To begin ouranalysis on | ψ (cid:105) , let us closely inspect the algorithm’s second step. The ideastherein will generalize to subsequent steps.In step 2, we view each sample s as a prefix-suffix pair ( a, b ) where a ∈ Σ and b ∈ Σ N − . We visualize the training set T as a bipartite graph. Verticesrepresent prefixes a and suffixes b and there is an edge joining a and b if andonly if ( a, b ) ∈ T .0011 00001100011000111010010110011111 0110 10000100001000010111101111011110Notice that samples in the left graph are concatenations of even parity bit-strings; samples in the right graph are concatenations of odd parity bit-strings. Let | ψ (cid:105) ∈ C Σ ⊗ C Σ N − denote the sum of the samples after havingcompleted step 1, (17) | ψ (cid:105) =and consider the reduced density ρ = tr Σ N − | ψ (cid:105)(cid:104) ψ | . The entries of itsmatrix representation are understood from the data in the graph. Choosingan ordering on the set Σ , we write ρ as(18) ρ = 1 N T d s e s e d d s o s o d The number of training samples N T is the total number of edges in the graph.The diagonal entries are the degrees of vertices associated to prefixes: d is the degree of 00, d is the degree of 11, d is the degree of 01, d is thedegree of 10. The off-diagonal entries are the number of paths of length 2 ineach component of the graph. That is, s e is the number of suffixes that 00and 11 have in common; s o is the number of suffixes that 01 and 10 have incommon. If T contains all samples then both graphs are complete bipartiteand the entries of ρ are all equal (to 2 N − in this case). In this case, ρ is arank 2 operator. It has two eigenvectors—one from each block. This is theidealized scenario: every sequence is present in the training set, the tensorobtained ρ = U D U ∗ is then ρ = 12 ( | E (cid:105)(cid:104) E | ⊕ | O (cid:105)(cid:104) O | )where | E (cid:105) = √ ( | (cid:105) + | (cid:105) ) denotes the normalized sum of even prefixesof length 2, and | O (cid:105) = √ ( | (cid:105) + | (cid:105) ) denotes the normalized sum of oddprefixes of length 2. As a matrix, U has | E (cid:105) and | O (cid:105) along its rows.We think of it as a “summarizer”: it projects a prefix onto an axis thatcan be identified with either | E (cid:105) or | O (cid:105) according to its parity, perfectlysummarizing the information of that prefix required to understand whichsuffixes it is paired with.More generally, however, if T (cid:54) = E N then the reduced density ρ may befull rank. In this case we choose the eigenvectors | E (cid:48) (cid:105) , | O (cid:48) (cid:105) that correspondto the two largest eigenvalues of ρ . We assume these eigenvectors comefrom distinct blocks. This defines the tensor U , which as a matrix has | E (cid:48) (cid:105) and | O (cid:48) (cid:105) along its rows, where | E (cid:48) (cid:105) = cos θ | (cid:105) + sin θ | (cid:105)| O (cid:48) (cid:105) = cos φ | (cid:105) + sin φ | (cid:105) ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD19 for some angles θ and φ . These angles can be computed following theexpression in (13) for the eigenvectors: θ = arctan (cid:32) s e (cid:112) G e + 4 s e + G e (cid:33) and φ = arctan (cid:32) s o (cid:112) G o + 4 s o + G o (cid:33) Here, G e = d − d and G o = d − d denote the gaps between the diagonalentries in each block. The angles should be thought of as measuring thedeviation from perfect learning in step 2: if f = 1 then G e , G o = 0 and so θ = φ = π/ | E (cid:48) (cid:105) = | E (cid:105) and | O (cid:48) (cid:105) = | O (cid:105) . In this case,step 2 has worked perfectly. Note that this is not an if-and-only-if scenario.Even if f < still have | E (cid:105) and | O (cid:105) as itseigenvectors. Indeed, this occurs whenever G e = G o = 0 and s e , s o (cid:54) = 0. Inthat case, the eigenvectors of ρ are the desired parity vectors | E (cid:105) , | O (cid:105) ,and the summarizer U obtained is a true summarization tensor. But if G e or G o are both nonzero, then step 2 induces a summarization error, whichwe measure as the deviation of θ and φ from the desired π/ k =3 , . . . , N , with minor adjustments to the combinatorics. So let us now de-scribe the general schema. In the k th step of the training algorithm, eachsample is cut after the k -th bit and viewed as a prefix-suffix pair s = ( a, b )where a ∈ Σ k and b ∈ Σ N − k . Let | ψ k (cid:105) ∈ C Σ k ⊗ C Σ N − k denote the sum ofthe samples after having completed step k − | ψ (cid:105) =and let ρ k := tr Σ N − k | ψ k (cid:105)(cid:104) ψ k | denote the reduced density on the prefix sub-system at step k . It is an operator on B k − ⊗ V k , where B k − is a 2-dimensional space which may be identified with the span of the eigenvectorsassociated to the two largest eigenvalues of ρ k − . As a matrix, ρ k is a directsum of 2 × ρ k = 1 tr ( ρ k ) e s e s e o e s o s o o We postpone a description of the entries until Section 5.2. But know that, asin the case when k = 2, the upper and lower blocks contains combinatorialinformation about prefixes of even and odd parity, respectively. As before,we are interested in the largest eigenvectors | E (cid:48) k (cid:105) , | O (cid:48) k (cid:105) contributed by eachblock. They define the tensor U k , which as a matrix has | E (cid:48) k (cid:105) and | O (cid:48) k (cid:105) along its rows, and can be understood inductively. The eigenvectors contain combinatorial information from step k along with data from step k −
1. Let | E (cid:48) (cid:105) := | (cid:105) and | O (cid:48) (cid:105) := | (cid:105) . Then for k ≥ | E (cid:48) k (cid:105) = cos θ k | E (cid:48) k − (cid:105) ⊗ | (cid:105) + sin θ k | O (cid:48) k − (cid:105) ⊗ | (cid:105)| O (cid:48) k (cid:105) = cos φ k | E (cid:48) k − (cid:105) ⊗ | (cid:105) + sin φ k | O (cid:48) k − (cid:105) ⊗ | (cid:105) where(20) θ k = arctan (cid:32) s e (cid:112) G e + 4 s e + G e (cid:33) φ k = arctan (cid:32) s o (cid:112) G o + 4 s o + G o (cid:33) Again, the angles are a measurement of the error accrued in step k . Signif-icantly, no error is accrued when the gaps G e := e − o G o := e − o s e , s o are non-zero, for then θ k = φ k = π/ U k = (cid:20) cos θ k sin θ k φ k sin φ k (cid:21) and so U k is akin to a map B k − ⊗ V k → B k that combines previouslysummarized information from B k − with new information from V k . It thensummarizes the resulting data by projecting onto one of two orthogonalvectors, which may be identified with | E (cid:48) k (cid:105) or | O (cid:48) k (cid:105) , in the new bond space B k . U | (cid:105) | E (cid:48) (cid:105)| (cid:105) The true orientation of the arrows on U k are down-left, rather than up-right.But the vector spaces in question are finite-dimensional, and our standardbases provide an isomorphism between a space and its dual. That is, noinformation is lost by momentarily adjusting the arrows for the purposes ofsharing intuition.In summary, this template provides a concrete handle on the tensors U k that comprise the MPS factorization of | ψ (cid:105) .5.1. High-level summary.
We close by summarizing the high-level ideaspresent in this under-the-hood analysis. At the k th step of the trainingalgorithm one obtains a 4 × ρ k . Itis given in Equation (18) in the case when k = 2 and as in Equation (19)when k >
2. These matrices are obtained by tracing out the suffix subsystemfrom the projection | ψ k (cid:105)(cid:104) ψ k | , where | ψ k (cid:105) is the sum of the samples in thetraining set after having completed step k −
1. Since | ψ k (cid:105) depends on the ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD21 error obtained in step k −
1, so does ρ k . This error is defined by the angles θ k − and φ k − . As shown in Equation (20), these angles—and hence theerror—are functions of the entries of the matrix representing ρ k − . So, the k th level density takes into account the errors accrued at each subsequentstep as well as combinatorial information in the present step. A partial tracecomputation thus directly leads to the matrix representation for ρ k given inEquation (19). Explicitly, the non-zero entries of the matrix are computedby Equations (23) and (24). With this, one has full knowledge of the matrix ρ k and therefore of its eigenvectors | E (cid:48) k (cid:105) , | O (cid:48) k (cid:105) . Written in the computationalbasis, they are of the form shown in Equation (13). These two eigenvectorsthen assemble to form the rows of the tensor U k , when viewed as a 2 × | ψ MPS (cid:105) . To measurethe algorithm’s performance, we begin by evaluating the inner product ofthis vector with an MPS decomposition of the target vector | E N (cid:105) . (cid:104) E N | ψ MPS (cid:105) =The k th tensor comprising the decomposition of | E N (cid:105) is equal to U k when θ k and φ k are evaluated at π/ . The contraction thus results in a sum ofproducts of cos θ k , sin θ k , cos φ k , sin φ k for k = 2 , . . . , N . More concretely, foreach even bitstring s ∈ E N the inner product (cid:104) s | ψ MPS (cid:105) is the square root ofthe probability of s . For now, we’ll refer to it as the weight w ( s ) := (cid:104) s | ψ MPS (cid:105) associated to the sample s . For each s , its weight w ( s ) is a product of variouscos θ k , sin θ k , cos φ k , sin φ k , the details of which are given in Section 5.2. Thefinal overlap is then the sum(21) (cid:104) E N | ψ MPS (cid:105) = 1 √ N − (cid:88) s ∈ E N w ( s )Now, suppose the training set consists of a fraction f of the entire pop-ulation. The entries of the reduced densities in (19) are described combi-natorially, as detailed in the next section. This makes it possible to makestatistical estimates for gaps G e and G o and off-diagonal entries s o and s e in(20). Therefore, we can make statistical predictions for the angles θ k and φ k and hence for the tensors U k comprising the trained MPS and the resultinggeneralization error. The results are plotted in Figure 4, where we use theBhattacharya distance(22) − √ N − ln (cid:88) s ∈ E N w ( s ) between the true population distribution and the one defined by either anexperimentally trained MPS as a proxy for generalization error. The the-oretical curve could, in principle, be improved by making more accuratestatistical estimates for the combinatorics involved. (a) The experimental average (orange) and theoreticalprediction (blue). (b)
A closer look for 0 . ≤ f ≤ . . Figure 4.
The experimental average (orange) and theoret-ical prediction (blue) of the weighted Bhattacharya distancebetween the probability distribution learned experimentallyand the theoretical prediction for bit strings of length N = 16and training set fractions of 0 < f ≤ . Combinatorics of reduced densities.
We now describe the entriesof k th level reduced density in Equation (19). They depend on certaincombinatorics in step k as well as error accumulated in the previous step.The latter has an inductive description. To start, observe that the parity of ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD23 a prefix a ∈ Σ k is determined by its last bit, together with the parity of itsfirst k − k thus partitions into four sets: E { a ∈ Σ k : a = ( e k − ,
0) where e k − ∈ E k − } O { a ∈ Σ k : a = ( o k − ,
1) where o k − ∈ O k − } E { a ∈ Σ k : a = ( e k − ,
1) where e k − ∈ E k − } O { a ∈ Σ k : a = ( o k − ,
0) where o k − ∈ O k − } By viewing the training set as a bipartite graph, one has a visual under-standing of these sets: E O k = 3, we use color to distinguish each set.00 011 001 110 1 000110101011 E O θ sin θ cos φ sin φ
00 111 101 010 0 100010001111 E O θ sin θ cos φ sin φ As shown, each prefix also has a weight that records its contribution to theerror accumulated in previous steps. Concretely, we assign to each prefix a ∈ Σ k a weight w ( a ), which is a product of k − ≤ i ≤ k − , the i th factor of w ( a ) is defined to be • cos θ i if the parity of the first i − i th bit is 0 • sin θ i if the parity of the first i − i th bit is 1 • cos φ i if the parity of the first i − i th bit is 1 • sin φ i if the parity of the first i − i th bit is 0For example, if k = 3 then w (011) = cos φ . If k = 5 then w (01101) =cos θ sin θ cos φ . These weights are naturally associated to each tensor.For instance, recalling that each tensor U k is akin to a summarizer, one sees w (01101) in the following way: U | (cid:105) cos φ | (cid:105)| (cid:105) U sin θ cos φ | (cid:105)| (cid:105) U cos θ sin θ cos φ | (cid:105)| (cid:105) We can now describe the entries of the reduced density defined in Equation(19). The first diagonal entry is(23) e (cid:88) suffixes b (cid:88) a ∈ E a,b ) ∈ T w ( a )
24 TAI-DANAE BRADLEY, MILES STOUDENMIRE, AND JOHN TERILLA and the other diagonals are defined similarly. If perfect learning occurs then e E
0. For example, in the graph below e + 2 + 1 = 9.00 011 001 110 1 000110101011 E θ sin θ cos φ sin φ In general, though, the summands will not be integers but rather productsof weights. The off-diagonal entry in the even block of the reduced densityis(24) s e = (cid:88) suffixes b (cid:88) a ∈ E , a (cid:48) ∈ O a,b ) , ( a (cid:48) ,b ) ∈ T w ( a ) · w ( a (cid:48) ) When perfect learning occurs, s e counts the number of paths of length 2,where now a path is comprised of one edge from E O s e = 3 . = + + E O s e will be a sum of products of weights. The expressionfor the off-diagonal s o in the odd block is similar to that in Equation (24).In summary, the theory behind the reduced densities and their eigenvec-tors gives us an exact understanding of the error propagated through eachstep of the training algorithm. We may then predict the Bhattacharya dis-tance in (22) using statistical estimates of the expected combinatorics. Thisprovides an accurate prediction based solely on the fraction f of trainingsamples used and the length N of the sequences.6. Experiments
The training algorithm was written in the ITensor library [20]; the code isavailable on Github. For a fixed fraction 0 < f ≤ . N T = f N − bitstrings of length N =16. We then compare the average Bhattacharya distance in Equation (22) tothe theoretical prediction. To handle the angles θ k and φ k in the theoreticalmodel, we make a few simplifying assumptions about the expected behaviorof the combinatorics. ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD25
First we assume θ = φ k for all k since the combinatorics of both blocks ofthe reduced densities ρ k in (19) have similar behavior. We further assume theaverage angle θ is a function of the average off-diagonal s e and the averagediagonal gap G e at the k th step, that is E [ θ k ( s e , G e )] = θ k ( E [ s e ] , E [ G e ]) forall k . The expectation for s e is experimentally determined to be independentof k , and dependent on the fraction f and bitstring length N alone: E [ s e ] = f · N T / k . We approximate the expected gap G e at the k th stepto be an experimentally determined function of f and the expected gap G = | d − d | of the diagonal entries of the reduced density defined at step2 of the algorithm. Understanding the expected behavior of G is similarto understanding the statistics of a coin toss. On average, one expects toflip the same number of heads and tails and yet the expectation for theirdifference is non-zero. The distribution for G is similar, but a little different: E [ G ] = (cid:88) d | d − r | (cid:0) nd (cid:1)(cid:0) nr − d (cid:1)(cid:0) nr (cid:1) where r = d + d = N T / n = 2 N − is the number of even paritybitstrings of length N −
2. The plots in Figure 4 compare the theoreticalestimate against the experimental average.7.
Conclusion
Models based on tensor networks open interesting directions for machinelearning research. Tensor networks can be viewed as a sequence of relatedlinear maps, which by acting together on a very high-dimensional space al-lows the model to be arbitrarily expressive. The underlying linearity andpowerful techniques from linear algebra allow us to pursue a training algo-rithm where we can look “under the hood” to understand each step and itsconsequences for the ability of our model to reconstruct a particular dataset, the even-parity data set.Our work also highlights the advantages of working in a probability for-malism based on the 2-norm. This is the same formalism used to inter-pret the wavefunction in quantum mechanics; here we use it as a frame-work to treat classical data. Density matrices naturally arise as the 2-normanalogue of marginal probability distributions familiar from conventional1-norm probability. Marginals still appear as the diagonal of the densitymatrix. Unlike marginals, the density matrices we use hold sufficient infor-mation to reconstruct the entire joint distribution. Our training algorithmcan be summarized as estimating the density matrix from the training data,then reconstructing the joint distribution step-by-step from these densitymatrix estimates.The theoretical predictions we obtained for the generative performanceof the model agree well with the experimental results. Note that care isneeded to compare these results, since the theoretical approach involves av-eraging over all possible training sets to produce a single typical weight MPS, whereas the experiments produce a different weight MPS for each training-set sample. In the near future, we look forward to extending our approachto other measures of model performance and behavior, and certainly otherdata sets as well.More ambitiously, we hope this work points the way to theoretically soundand robust predictions of machine learning model performance based onempirical summaries of real-world data. If such predictions can be obtainedfor training algorithms that also produce state-of-the art results, as tensornetworks are starting to do, we anticipate this will continue to be an excitingprogram of research.
References [1] Ulrich Schollw¨ock. The density-matrix renormalization group in the age of matrixproduct states.
Annals of Physics , 326(1):96–192, 2011.[2] Steven R. White. Density matrix formulation for quantum renormalization groups.
Physical Review Letters , 69(19):2863–2866, 1992.[3] E. Miles Stoudenmire and D. J. Schwab. Supervised learning with quantum-inspiredtensor networks.
Advances in Neural Information Processing Systems (NIPS) ,29:4799–4807, 2016.[4] M. M. Wolf D. Perez-Garcia, F. Verstraete and J. I. Cirac. Matrix product staterepresentations.
Quantum Information and Computation , 7:401–430, 2007.[5] E. Miles Stoudenmire. The tensor network, 2019. http://tensornetwork.org .[6] Glen Evenbly. Tensors.net, 2019. .[7] Rom´an Or´us. A practical introduction to tensor networks: Matrix product states andprojected entangled pair states.
Annals of Physics , 349:117–158, 2014.[8] Alexander Novikov, Mikhail Trofimov, and Ivan Oseledets. Exponential machines. arxiv:1605.03795 , 05 2016.[9] E Miles Stoudenmire. Learning relevant features of data with multi-scale tensor net-works.
Quantum Science and Technology , 3(3):034003, 2018.[10] Ivan Glasser, Nicola Pancotti, and J. Ignacio Cirac. Supervised learning with gener-alized tensor networks. arxiv:1806.05964 , 06 2018.[11] Chu Guo, Zhanming Jie, Wei Lu, and Dario Poletti. Matrix product operators forsequence-to-sequence learning.
Phys. Rev. E , 98:042114, Oct 2018.[12] Glen Evenbly. Number-state preserving tensor networks as classifiers for supervisedlearning. arxiv:1905.06352 , 2019.[13] Ding Liu, Shi-Ju Ran, Peter Wittek, Cheng Peng, Raul Bl´azquez Garc´ıa, Gang Su,and Maciej Lewenstein. Machine learning by unitary tensor network of hierarchicaltree structure.
New Journal of Physics , 21(7):073059, jul 2019.[14] Stavros Efthymiou, Jack Hidary, and Stefan Leichenauer. TensorNetwork for machinelearning. arXiv:1906.06329 , 2019.[15] Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang. Unsupervised gen-erative modeling using matrix product states.
Phys. Rev. X , 8:031012, Jul 2018.[16] Zhuan Li and Pan Zhang. Shortcut matrix product states and its applications. arxiv:1812.05248 , 12 2018.[17] James Stokes and John Terilla. Probabilistic modeling with matrix product states. arxiv:1902.06888 , 02 2019.[18] Song Cheng, Lei Wang, Tao Xiang, and Pan Zhang. Tree tensor networks for gener-ative modeling.
Phys. Rev. B , 99:155131, Apr 2019.
ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD27 [19] Ivan Glasser, Ryan Sweke, Nicola Pancotti, Jens Eisert, and J. Ignacio Cirac. Expres-sive power of tensor-network factorizations for probabilistic modeling, with applica-tions from hidden Markov models to quantum machine learning. arxiv:1907.03741 ,2019.[20] ITensor Library (version 3.0.0). https://itensor.org . CUNY Graduate Center, New York, NY
E-mail address : [email protected] Flatiron Institute, New York, NY, A Division of the Simons Foundation
E-mail address : [email protected] Tunnel, New York, NY
E-mail address ::