[PDF] Modeling Sequences with Quantum States: A Look Under the Hood

Abstract

Classical probability distributions on sets of sequences can be modeled using quantum states. Here, we do so with a quantum state that is pure and entangled. Because it is entangled, the reduced densities that describe subsystems also carry information about the complementary subsystem. This is in contrast to the classical marginal distributions on a subsystem in which information about the complementary system has been integrated out and lost. A training algorithm based on the density matrix renormalization group (DMRG) procedure uses the extra information contained in the reduced densities and organizes it into a tensor network model. An understanding of the extra information contained in the reduced densities allow us to examine the mechanics of this DMRG algorithm and study the generalization error of the resulting model. As an illustration, we work with the even-parity dataset and produce an estimate for the generalization error as a function of the fraction of the dataset used in training.

Full PDF

MMODELING SEQUENCES WITH QUANTUM STATES: ALOOK UNDER THE HOOD

TAI-DANAE BRADLEY, MILES STOUDENMIRE, AND JOHN TERILLA

Abstract.

Classical probability distributions on sets of sequences canbe modeled using quantum states. Here, we do so with a quantum statethat is pure and entangled. Because it is entangled, the reduced densitiesthat describe subsystems also carry information about the complemen-tary subsystem. This is in contrast to the classical marginal distributionson a subsystem in which information about the complementary systemhas been integrated out and lost. A training algorithm based on thedensity matrix renormalization group (DMRG) procedure uses the ex-tra information contained in the reduced densities and organizes it intoa tensor network model. An understanding of the extra informationcontained in the reduced densities allow us to examine the mechanicsof this DMRG algorithm and study the generalization error of the re-sulting model. As an illustration, we work with the even-parity datasetand produce an estimate for the generalization error as a function of thefraction of the dataset used in training.

Contents

1. Introduction 2Acknowledgments 32. Densities and reduced densities 32.1. Reconstructing a pure state from its reduced densities 53. Reduced densities of classical probability distributions 63.1. Learning from samples 94. The Training Algorithm 115. Under the hood 165.1. High-level summary 205.2. Combinatorics of reduced densities 226. Experiments 247. Conclusion 25References 26 a r X i v : . [ qu a n t - ph ] O c t TAI-DANAE BRADLEY, MILES STOUDENMIRE, AND JOHN TERILLA Introduction

In this paper, we present a deterministic algorithm for unsupervised gen-erative modeling on strings using tensor networks. The algorithm is deter-ministic with a ﬁxed number of steps and the resulting model has a perfectsampling algorithm that allows eﬃcient sampling from marginal distribu-tions, or sampling conditioned on a substring. The algorithm is inspired bythe density matrix renormalization group (DMRG) procedure [1, 2, 3]. Thisapproach, at its heart, involves only simple linear algebra which allows usto give a detailed “under the hood” look at the algorithm in action. Ouranalysis illustrates how to interpret the trained model and how to go beyondworst case bounds on generalization errors. We work through the algorithmwith an exemplar dataset to produce a prediction for the generalization er-ror as a function of the fraction used in training which well approximatesthe generalization error observed in experiments.The machine learning problem of interest is to learn a probability distri-bution on a set of sequences from a ﬁnite training set of samples. For us,an important technical and conceptual ﬁrst step is to pass from

Finite Sets to Functions on Finite Sets . Functions on sets have more structure thansets themselves and we ﬁnd that the extra structure is meaningful. Further-more, well-understood concepts and techniques in quantum physics give uspowerful tools to exploit this extra structure without incurring signiﬁcant al-gorithmic costs [4]. We emphasize that it is not necessary that the datasetsbeing modeled have any inherently quantum properties or interpretation.The inductive bias of the model can be understood as a kind of low-rankfactorization hypothesis—a point we expand upon in this paper.Reduced density operators play a central role in our model. In a happycoincidence, they play the central role in both the model’s theoretical inspi-ration and the training algorithm. There is structure in reduced densitiesthat inspire us to model classical probability distributions using a quantummodel. The training algorithm amounts to successively matching reduceddensities, a process which leads inevitably to a tensor network model, whichmay be thought of as a sequence of compatible autoencoders. We refer read-ers unfamiliar with tensor diagram notation to references such as [5, 6, 7].This paper also builds on investigations of tensor networks as models formachine learning tasks. Tensor networks have been demonstrated to givegood results for supervised learning and regression tasks [8, 3, 9, 10, 11, 12,13, 14]. They have also been applied successfully to unsupervised, generativemodeling [15, 16, 17, 18] including a study based on the parity dataset weuse here [17]. This work focuses on the latter task, proposing and studyingan alternative algorithm for optimizing MPS for generative modeling. Theexpressivity of models like the one considered in this paper have been studied[19]. In this paper, we focus on understanding how our training algorithmlearns to generalize.

ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD 3

Acknowledgments.

The authors thank Gabriel Drummond-Cole, GlenEvenbly, James Stokes, and Yiannis Vlassopoulos for helpful discussions,and are happy to acknowledge KITP Santa Barbara, the Flatiron Institute,and Tunnel for support and excellent working conditions.2.

Densities and reduced densities

For our purposes, the passage from classical to quantum can be thoughtof as the passage from

Finite Sets to Functions on Finite Sets , which havea natural Hilbert space structure. We are interested in probability distri-butions on ﬁnite sets. The quantum version of a probability distribution isa density operator on a Hilbert space. The quantum version of a marginalprobability distribution is a reduced density operator. The operation thatplays the role of marginalization is the partial trace. In our setup, the re-duced densities contain more information than the marginal distributionsassociated to them and much of our work concerns this extra information.Given a ﬁnite set S , one has the free vector space V = C S consisting ofcomplex valued functions on S , which is a Hilbert space with inner product (cid:104) f | g (cid:105) = (cid:88) s ∈ S f ( s ) g ( s ) . The free vector space comes with a natural map from S → C S , which werecall in a moment. To avoid confusion, it is helpful to use notation todistinguish between an element s ∈ S and its image in C S , which is avector. Commonly, the vector image of s is denoted with a boldface font oran overset arrow. We like the bra and ket notation, which is better wheninner products are involved. For any s ∈ S , let | s (cid:105) denote the function S → C that sends s (cid:55)→ s (cid:48) (cid:55)→ s (cid:48) (cid:54) = s . The set {| s (cid:105)} is anindependent, orthonormal spanning set for V . If one chooses an ordering onthe set S , say S = { s . . . . , s d } , then | s j (cid:105) is identiﬁed with the j -th standardbasis vector in C d , thus deﬁning an isometric isomorphism of V ∼ −→ C d anda “one-hot” encoding S (cid:44) → C d . More generally, we denote elements in V byket notation | ψ (cid:105) ∈ V .For any | ψ (cid:105) ∈ V , there is a linear functional in V ∗ whose value on | φ (cid:105) ∈ V is the inner product (cid:104) ψ | φ (cid:105) . We denote this linear functional by the succinctbra notation (cid:104) ψ | ∈ V ∗ . Every linear functional in V ∗ is of the form (cid:104) ψ | forsome | ψ (cid:105) ∈ V . We have vectors | ψ (cid:105) ∈ V and covectors (cid:104) ψ | ∈ V ∗ and themap | ψ (cid:105) ←→ (cid:104) ψ | deﬁnes a natural isomorphism between V and V ∗ . We have chosen to dis-tinguish between vectors and covectors with bra and ket notation; we willnot imbue upper and lower indices with any special meaning.When several spaces V, W, . . . are in play, some tensor product symbolsare suppressed. So, for instance, if | ψ (cid:105) ∈ V and | φ (cid:105) ∈ W , we will write | ψ (cid:105)| φ (cid:105) , or even | ψφ (cid:105) , instead of | ψ (cid:105) ⊗ | φ (cid:105) ∈ V ⊗ W . An expression like | φ (cid:105)(cid:104) ψ | TAI-DANAE BRADLEY, MILES STOUDENMIRE, AND JOHN TERILLA is an element of W ⊗ V ∗ , naturally identiﬁed with an operator V → W . Theexpression | ψ (cid:105)(cid:104) ψ | is an element in End( V ). Here, End( V ) denotes the spaceof all linear operators on V and in the presence of a basis is identiﬁed withdim( V ) × dim( V ) matrices. If | ψ (cid:105) is a unit vector, then the operator | ψ (cid:105)(cid:104) ψ | is orthogonal projection onto | ψ (cid:105) : it maps | ψ (cid:105) (cid:55)→ | ψ (cid:105) and maps every vectorperpendicular to | ψ (cid:105) to zero.A density operator , or just density for short, is a unit-trace, positive semi-deﬁnite linear operator on a Hilbert space. Sometimes a density is called aquantum state . If S is a ﬁnite set and V = C S , then a density ρ : V → V deﬁnes a probability distribution on S by deﬁning the probability π ρ : S → R by the Born rule (1) π ρ ( s ) = (cid:104) s | ρ | s (cid:105) . Going the other way, there are multiple ways to deﬁne a density ρ : V → V from a classical probability distribution π on S so that π ρ = π . One way isas a diagonal operator: ρ diag := (cid:80) s ∈ S π ( s ) | s (cid:105)(cid:104) s | . Another way is to deﬁne(2) ρ π = | ψ (cid:105)(cid:104) ψ | where | ψ (cid:105) := (cid:88) s ∈ S (cid:112) π ( s ) | s (cid:105) . There exist other densities that realize π via the Born rule, but think of thediagonal density and projection onto | ψ (cid:105) as two extremes. The density ρ π has minimal rank and ρ diag has maximal rank. In the language of quantummechanics, a state is pure if it has rank one and is mixed otherwise. Thedegree to which a state is mixed is measured by its von Neumann entropy, − tr( ρ ln( ρ )), which ranges from zero in the case of ρ π up to the Shannonentropy of the classical distribution π in the case of ρ diag . In this paper,we always use the pure state ρ := ρ π . To summarize, we associate to anyprobability distribution π : S → R the density ρ π : V → V deﬁned byEquation (2), which has the property that π ρ π = π. If a set S is a Cartesian product S = A × B then the Hilbert space C S decomposes as a tensor product C S ∼ = C A ⊗ C B . In this case, a density ρ : C A ⊗ C B → C A ⊗ C B is the quantum version of a joint probability distribution π : A × B → R . By an operation that is analogous to marginalization, ρ gives rise to two densities ρ A : C A → C A and ρ B : C B → C B which we referto as reduced densities . We now describe this operation, which is called partial trace .If X and Y are ﬁnite dimensional vector spaces, then End( X ⊗ Y ) isisomorphic to End( X ) ⊗ End( Y ). Using this isomorphism, there are mapsEnd( X ⊗ Y )End( X ) End( Y ) tr Y tr X deﬁned by tr Y ( f ⊗ g ) := f tr( g ) and tr X ( f ⊗ g ) := g tr( f ) ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD 5 for f ∈ End( X ) and g ∈ End( Y ). The maps tr Y and tr X are called partialtraces . The partial trace preserves both trace and positive semi-deﬁnitenessand so the image of any density ρ ∈ End( X ⊗ Y ) under partial trace deﬁnes reduced densities tr Y ρ ∈ End( X ) and tr X ρ ∈ End( Y ).It is worth noting that while we have maps End( X ) ⊗ End( Y ) → End( X )and End( X ) ⊗ End( Y ) → End( Y ), there do not exist natural maps V ⊗ W → V or V ⊗ W → W for arbitrary vector spaces V and W ; partial trace isspecial, it is deﬁned in the case that V and W are endomorphism spaces.2.1. Reconstructing a pure state from its reduced densities.

We nowdiscuss the problem of reconstructing a pure quantum state ρ on a product X ⊗ Y from its reduced densities ρ X and ρ Y .Using the isomorphism X ∼ = X ∗ that is available in any ﬁnite dimen-sional Hilbert space, one can view any vector | ψ (cid:105) in a product of Hilbertspaces X ⊗ Y as an element of X ∗ ⊗ Y , hence as a linear map M : X → Y .Computationally, if | ψ (cid:105) is expressed using bases {| a (cid:105)} of X and {| b (cid:105)} of Y as | ψ (cid:105) = (cid:88) a,b m ab | a (cid:105) ⊗ | b (cid:105) then the coeﬃcients { m ab } of that sum can be reshaped into a dim( Y ) × dim( X ) matrix M . A singular value decomposition (SVD) of M gives afactorization M = V DU ∗ with V and U unitary and D diagonal as inFigure 1. (cid:32) (cid:32) = with = and = Figure 1.

A tensor network diagram following | ψ (cid:105) ∈ X ⊗ Y through the isomorphisms X ⊗ Y ∼ = X ∗ ⊗ Y ∼ = hom( X, Y ),leading to the singular value decomposition of M = V DU ∗ with the unitarity of V and U .The columns {| f i (cid:105)} of the matrix V are the left singular vectors of M .They are the eigenvectors of M M ∗ and comprise an orthonormal basis forthe image of M . The columns {| e i (cid:105)} of the matrix of U are the right singularvectors of M . They are the eigenvectors of M ∗ M , an orthonormal set ofvectors spanning a subspace of X isomorphic to the image of M . Thenonnegative real numbers { σ i } on the diagonal of D are the singular values TAI-DANAE BRADLEY, MILES STOUDENMIRE, AND JOHN TERILLA ρ = , ρ X = = , ρ Y = = . Figure 2.

A tensor network diagram showing that ρ X = M ∗ M and ρ Y = M M ∗ .of the matrix M . The matrices M ∗ M and M M ∗ have the same eigenvalues { λ i } which are the squares of the singular values λ i := | σ i | . The map M deﬁnes a bijection between the {| e i (cid:105)} and {| f i (cid:105)} . Speciﬁcally, M acts as(3) | e i (cid:105) (cid:55)→ σ i | f i (cid:105) and maps the perpendicular complement of the span of the {| e i (cid:105)} to zero.Now, given a unit vector | ψ (cid:105) ∈ X ⊗ Y , we have the density ρ = | ψ (cid:105)(cid:104) ψ | ∈ X ⊗ Y ⊗ Y ∗ ⊗ X ∗ and the reduced densities ρ X : X → X and ρ Y : Y → Y .The reduced densities of ρ are related to the operator M : X → Y fashionedfrom | ψ (cid:105) as follows(4) ρ X = M ∗ M and ρ Y = M M ∗ as illustrated in Figure 2. The singular vectors {| e i (cid:105)} and {| f i (cid:105)} of M areprecisely the eigenvectors of the reduced densities. Therefore, the density ρ can be completely reconstructed from its reduced densities ρ X and ρ Y .One obtains | ψ (cid:105) by gluing the eigenvectors of the reduced densities alongtheir shared eigenvalues (Figure 3). In the nondegenerate case that theeigenvalues are distinct, then there is a unique way to glue the {| e i (cid:105)} andthe {| f i (cid:105)} and | ψ (cid:105) is recovered perfectly.3. Reduced densities of classical probability distributions

Let π : S → R be a probability distribution and consider the density ρ π as in Equation (2). Suppose S ⊂ A × B and let ρ A = tr Y ρ and ρ B = tr X ρ denote the reduced densities where, as above, X = C A , Y = C B , and V = X ⊗ Y . Let us now interpret the matrix representation of these reduced ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD 7 ρ X = = , ρ Y = = ,= (cid:32) = | ψ (cid:105) Figure 3.

Reconstructing | ψ (cid:105) from the eigenvectors of ρ X and ρ Y and their shared eigenvalues.densities. We compute: ρ = | ψ (cid:105)(cid:104) ψ | =  (cid:88) ( a,b ) ∈ S (cid:112) π ( a, b ) | a (cid:105) ⊗ | b (cid:105)  ⊗  (cid:88) ( a (cid:48) ,b (cid:48) ) ∈ S (cid:112) π ( a (cid:48) , b (cid:48) ) (cid:104) a (cid:48) | ⊗ (cid:104) b (cid:48) |  = (cid:88) ( a,b ) ∈ S ( a (cid:48) ,b (cid:48) ) ∈ S (cid:112) π ( a, b ) (cid:112) π ( a (cid:48) , b (cid:48) ) | a (cid:105)(cid:104) a (cid:48) | ⊗ | b (cid:105)(cid:104) b (cid:48) | We compute the partial trace tr Y ( | a (cid:105)(cid:104) a (cid:48) | ⊗ | b (cid:105)(cid:104) b (cid:48) | ) = (cid:104) b | b (cid:48) (cid:105) | a (cid:105)(cid:104) a (cid:48) | . Since (cid:104) b | b (cid:48) (cid:105) = 1 if b = b (cid:48) and zero otherwise, we can understand the ( a, a (cid:48) ) entryof the reduced density ρ A as(5) ( ρ A ) a (cid:48) a = (cid:88) b ∈ B (cid:112) π ( a, b ) π ( a (cid:48) , b ) . In particular, the diagonal entry ( ρ A ) aa is (cid:80) b ∈ B π ( a, b ) and we see the mar-ginal distribution π A : A → R along the diagonal of the reduced density ρ A .We make the consistent observation that ρ A has unit trace. The oﬀ-diagonal TAI-DANAE BRADLEY, MILES STOUDENMIRE, AND JOHN TERILLA entries of ρ A are determined by the extent to which a, a (cid:48) ∈ A have the samecontinuations in B . Note that ρ A is symmetric. The reduced density on B is similarly given:(6) ( ρ B ) b (cid:48) b = (cid:88) a ∈ A (cid:112) π ( a, b ) π ( a, b (cid:48) ) . So, the reduced densities of ρ contains all the information of the marginaldistributions π A and π B and more. Now, let’s take a look at the extrainformation carried by the reduced densities, which is entirely contained inthe oﬀ diagonal entries. Since the entire state, and therefore π itself, can bereconstructed from the eigenvectors and eigenvalues of ρ A and ρ B , we knowthat from a high level this spectral information encodes the conditionalprobabilities that are lost by the classical process of marginalization. Enroute to decoding this spectral information, let us describe how an arbitrarydensity τ is a classical mixture model of pure quantum states. If | e (cid:105) , . . . , | e k (cid:105) is a basis for the image of a density τ consisting of orthonormal eigenvectors,then the corresponding eigenvalues λ , . . . , λ k are nonnegative real numberswhose sum is one. One has τ = k (cid:88) i =1 λ i | e i (cid:105)(cid:104) e i | The density τ deﬁnes a probability distribution on pure states: the proba-bility of the pure state | e i (cid:105)(cid:104) e i | being λ i . Then, | e i (cid:105)(cid:104) e i | deﬁnes a probabilitydistribution on the computational basis { s } via the Born Rule: the proba-bility of s is (cid:104) s | e i (cid:105)(cid:104) e i | s (cid:105) = |(cid:104) e i | s (cid:105)| .We’re interested in the reduced densities of ρ = | ψ (cid:105)(cid:104) ψ | and in this casethere exists a one-to-one correspondence | e i (cid:105) ↔ | f i (cid:105) between eigenvectorsof the reduced densities ρ A := tr Y ( ρ ) and ρ B := tr X ( ρ ) spanning theirrespective images. ρ A = k (cid:88) i =1 λ i | e i (cid:105)(cid:104) e i | and ρ B = k (cid:88) i =1 λ i | f i (cid:105)(cid:104) f i | . as outlined in Section 2.1.Putting together the general picture of a density as a mixture of purestates with the reduced densities of a pure state leads one to the followingparadigm. With probability λ i the preﬁx subsystem will be in a state deter-mined by the corresponding eigenvector | e i (cid:105) of ρ A , and the correspondingsuﬃx subsystem will be in a state determined by the eigenvector | f i (cid:105) . Thevector | e i (cid:105) = (cid:80) a γ ai | a (cid:105) determines a probability distribution on the set ofpreﬁxes A : the probability of the preﬁx a is | γ ai | . The vector | f i (cid:105) = (cid:80) b β bi | b (cid:105) determines a probability distribution on the set of suﬃxes B : the probabilityof b is | β bi | . ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD 9

As a ﬁnal remark, if we had begun with the diagonal density ρ diag = (cid:88) ( a,b ) ∈ A × B π ( a, b ) ( | a (cid:105) ⊗ | b (cid:105) ) ⊗ ( (cid:104) b | ⊗ (cid:104) a | )whose Born distribution is also π , then the matrices representing ρ A and ρ B would be diagonal matrices with marginal distributions on A and B alongthe diagonals and all oﬀ diagonal elements are zero. The eigenvectors of ρ A and ρ B are simply the preﬁxes | a (cid:105) and and suﬃxes | b (cid:105) and carry nofurther information. The process of computing reduced densities of ρ diag isnothing more than the process of marginalization. We always use the purestate ρ = | ψ (cid:105)(cid:104) ψ | ensuring that the reduced densities carry information aboutsubsystem interactions. The eigenvectors of the reduced densities, which arelinear combinations of preﬁxes and linear combinations of suﬃxes, interactthrough their eigenvalues and capture rich information about the preﬁx-suﬃx system.Let us summarize. Begin with a classical probability distribution π on aproduct set S = A × B . Form a density ρ π on C A × B by the formula in Equa-tion (2). The reduced densities ρ A and ρ B on C A and C B contain marginaldistributions π A and π B on their diagonals, but they are not diagonal oper-ators. The eigenvectors of these reduced densities encode information aboutpreﬁx-suﬃx interactions. The preﬁx-suﬃx interactions are tantamount toconditional probabilities and carry suﬃcient information to reconstruct thedensity ρ .3.1. Learning from samples.

In the machine learning applications tocome, the goal is to learn ρ π deﬁned in Equation (2) from a set { s , . . . , s N T } of samples drawn from a probability distribution π . Each sample s i will bea sequence ( x , . . . , x N ) of a ﬁxed length N . The algorithm to learn thedensity ρ π on the full set of sequences S is an inductive procedure.One only works with a density ρ deﬁned using the sample set since thedensity ρ π for the entire distribution π is unavailable. The procedure beginsby computing the reduced density ρ A and its eigenvectors for a subsystem A consisting of short preﬁxes. Step by step, the size of the subsystem A isincreased until one reaches a point where the suﬃx subsystem B is small.In a ﬁnal step, ρ is recombined from the collected eigenvectors of ρ A forall the preﬁx systems A and the eigenvectors and eigenvalues of ρ B . Thisprocedure leads naturally to a tensor network approximation for ρ .An important point is that the reduced density ρ A operates in a spacewhose dimension grows exponentially with the length of the preﬁx system A . So, instead of computing ρ A exactly, it is computed by a sequence ofapproximations that keep its rank small. The modeling hypothesis is that π is a distribution whose corresponding quantum state ρ π has low rank in thesense that the reduced densities ρ A and ρ B are low rank operators for allpreﬁx-suﬃx subsystems A and B . The large rank of the density ρ witnessedfrom the empirical distribution drawn from π is regarded as sampling error. Therefore, under the modeling hypothesis, the process of replacing the em-pirically computed reduced densities with low rank approximations shouldbe thought of as repairing a state damaged by sampling errors. The lowrank modeling hypothesis can lead to excellent generalization properties forthe model.Let us continue our analysis of the reduced densities as in the previoussections using notation appropriate for the machine learning algorithm. Let T be a training set of labeled samples T = { s , . . . , s N T } . We use N T forthe number of training examples. Each sample s i will be a sequence ofsymbols from a ﬁxed alphabet Σ of a ﬁxed length N . We will designate acut to obtain a preﬁx a i and suﬃx b i whose concatenation is the sample s i = ( a i , b i ) ∈ Σ N . This provides a decomposition of T as T ⊂ A × B where A = { a , a , . . . , a N T } and B = { b , b , . . . , b N T } are the sampled preﬁxesand suﬃxes. For the applications we have in mind, samples in T will bedistinct. That is ( a i , b i ) (cid:54) = ( a j , b j ) if i (cid:54) = j , though crucially it may happenthat a i = a j or b i = b j for i (cid:54) = j . Let (cid:98) π be the resulting empirical distributionon T so that(7) (cid:98) π ( a, b ) = (cid:40) / √ N T if ( a, b ) ∈ T, | ψ (cid:105) = 1 √ N T N T (cid:88) i =1 | s i (cid:105) , the empirical density ρ = | ψ (cid:105)(cid:104) ψ | , and its partial trace(9) ρ A = 1 N T N T (cid:88) i,j =1 s ( a i , a j ) | a i (cid:105)(cid:104) a j | . Here the sum is expressed in terms of the indices i, j , which range over thenumber of samples. The coeﬃcient s ( a i , a j ) of | a i (cid:105)(cid:104) a j | is a nonnegative inte-ger, namely the number of times that a i and a j have the same continuation b i = b j . It may be convenient to have some notation for shared contin-uations. For any pair a, a (cid:48) of elements of A , let T a,a (cid:48) be the subset of B consisting of shared continuations of a and a (cid:48) :(10) T a,a (cid:48) = { b ∈ B : ( a, b ) ∈ T and ( a (cid:48) , b ) ∈ T } . So, the ( a, a (cid:48) ) entry of the matrix representing ρ A is the cardinality of theset T a,a (cid:48) divided by an overall factor of 1 /N T .A similar combinatorial description holds for the reduced density on B , ρ B = 1 N T (cid:88) i,j s ( b i , b j ) | b i (cid:105)(cid:104) b j | where s ( b i , b j ) is the number of common preﬁxes that b i and b j share. ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD11

The counting involved can be visualized with graphs. Every probabilitydistribution (cid:98) π on a Cartesian product A × B uniquely deﬁnes a weightedbipartite graph: the two vertex sets are A and B and the edge joining a and b is labeled by (cid:98) π ( a, b ) . Here, because we assume the samples in T aredistinct, the graph can be simpliﬁed since (cid:98) π ( a, b ) is either 0 or 1 /N T . Wedraw an edge from a to b if ( a, b ) ∈ T and we omit the edge if ( a, b ) / ∈ T and understand the probabilities to be obtained by dividing by N T , whichis the total number of edges in the graph. a a b b b b In the example above, the total number of edges is the sample size N T = 6.The probability of ( a , b ) = 1 / a , b ) = 0. Nowwe illustrate how to read oﬀ the entries of the reduced density ρ A from thegraph. There will be an overall factor of 1 /N T multiplied by a matrix ofnonnegative integers. The diagonal entries are d ( a ), the degree of vertex a .The ( a, a (cid:48) ) entry is the number of shared suﬃxes, which equals the numberof paths of length 2 between a and a (cid:48) , divided by 6.Given any graph with | A | = 2, such as the one above, the reduced densityon the preﬁx subsystem is equal to(11) ρ A = 1 N T (cid:34) d ss d (cid:35) where the diagonal entries are the degrees of the vertices and s is the numberof paths of length two, which equals the number of degree two vertices of B .The denominator of the coeﬃcient N T = d + d is the total number of edgesin the graph. The eigenvalues λ + and λ − and (unnormalized) eigenvectors e + and e − of this matrix have simple, explicit expressions in terms of thegap G = d − d in the diagonal entries and the oﬀ-diagonal entry s . Namely,(12) λ + = N T + √ G + 4 s N T and λ − = N T − √ G + 4 s N T and(13) | e + (cid:105) = (cid:20) √ G + 4 s + G +2 s (cid:21) and | e − (cid:105) = (cid:20) √ G + 4 s − G − s (cid:21) . The Training Algorithm

Suppose that | ψ (cid:105) ∈ V ⊗ · · · ⊗ V N . We depict | ψ (cid:105) as V V · · · V N − V N There are various sorts of decompositions of such a tensor that are akinto an iterated SVD. We will describe one decomposition that results ina factorization of | ψ (cid:105) into what is called a matrix product state (MPS) orsynonymously, a tensor train decomposition. The process deﬁnes a sequenceof “bond” spaces { B k } and operators { U k : B k ⊗ V k → B k − } which can becomposed U U · · · U N − U N as pictured: · · · V V V N − V N = V V · · · V N − V N The initial operator has form U : B → V and the ﬁnal tensor has the form U N ∈ B N − ⊗ V N . We begin with B = V and set U : B → V to be theidentity. For k = 2 , . . . , N − U k inductively.To describe the inductive process, ﬁrst notice that for any k = 1 , . . . , N − V ⊗ · · · ⊗ V N ∼ = ( V ⊗ · · · ⊗ V k ) (cid:79) ( V k +1 ⊗ · · · ⊗ V N ) . The operator α k : V ⊗ · · · ⊗ V k → V k +1 ⊗ · · · ⊗ V N fashioned from | ψ (cid:105) maybe pictured as follows:(14) α k = V V · · · V k V k +1 · · · V N The operators U k when composed U U · · · U k as below · · · B k V V V k deﬁne an operator B k → V ⊗ · · · ⊗ V k . One then has the composition β k := α k U U · · · U k : B k → V k +1 ⊗ · · · ⊗ V N : V · · · V k +1 · · · V N B k ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD13

The inductive hypothesis is that α k U U · · · U k U ∗ k · · · U ∗ U ∗ = α k . Pictorally, · · · · · · V k +1 · · · V N V k · · · V V = α k In the penultimate step, one has the operator α N − U U · · · U N − : B N − → V N . The ﬁnal step is to deﬁne U N as the adjoint of this operator: U N =( α N − U U · · · U N − ) ∗ . · · · V N B N − = V N B N − Therefore, the entire composition reduces nicely: U U · · · U N − U N = U U · · · U N − U ∗ N − · · · U ∗ U ∗ α ∗ N − = α ∗ N − The ﬁnal equality follows from the adjoint of the inductive hypothesis. Theoutcome α ∗ N − : V ∗ N → V ⊗ · · · ⊗ V N − , after a minor reshaping, is the sameas | ψ (cid:105) .To deﬁne the inducive step, assume the spaces B , . . . , B k − and operators U k have been deﬁned and satisfy the inductive hypothesis. Reshape theoperator B k − → V k ⊗ V k +1 ⊗ · · · ⊗ V N as a map B k − ⊗ V k → V k +1 ⊗ · · · ⊗ V N An SVD decomposition of this map yields α k − U · · · U k − = W k D k U ∗ k . V · · · V k V k +1 · · · V N B k − = B k − V k V k +1 V N · · · The adjoint of the map U ∗ k : B k − ⊗ V k → B k , pictured as the blue triangle onthe right hand side, is then deﬁned to be U k : B k → B k − ⊗ V k and becomesthe next tensor in the MPS decomposition. To check that the inductivehypothesis is satisﬁed, note that α k U · · · U k − U k U ∗ k = α k − U · · · U k − since α k − U · · · U k − = W k D k U ∗ k and U ∗ k U k = 1. Here is the picture proof: V · · · B k V k V k +1 · · · V N B k − B k − is equal to this ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD15 B k − V k V k +1 V N · · · = B k − V k V k +1 V N · · · which is the ﬁrst picture: V · · · V k V k +1 · · · V N B k − In our application, the vector | ψ (cid:105) and the operators β k − : B k − ⊗ V k → V k +1 ⊗ · · · ⊗ V N operate in spaces of such high dimensions that neither they,nor a direct SVD of them, is feasible. Nonetheless, the U k operators can beobtained from an SVD of a reduced density operating in the eﬀective space B k − ⊗ V k β ∗ k − β k − : B k − ⊗ V k → B k − ⊗ V k In our application, the eﬀective reduced density β ∗ k − β k − can be computedas a double sum over the training examples and we can eﬃciently computethe tensors required for the inductive steps. Then in the ﬁnal step, the com-plementary space is small so the ﬁnal map U N D N : B N − → V N completesthe reconstruction.More speciﬁcally, to deﬁne the U k , we only need an eigenvector decom-position of β ∗ k − β k − , which looks like V · · · V k V k V · · · B k − B k − · · · and is given by a formula like the one in Equation (9).In general, when factoring an arbitrary vector as an MPS, the bond spaces B k grow large exponentially fast. Therefore, we may characterize data setsfor which the MPS model is a good model by saying that | ψ (cid:105) as deﬁnedin Equation (2) has an MPS model whose bond spaces B k remain small.Alternatively, one can truncate or restrict the dimensions of the spaces B k resulting in a low rank MPS approximation of | ψ (cid:105) . As a criterion for thistruncation, one can inspect the singular values at each inductive step anddiscard those which are small according to a pre-determined cutoﬀ, andthe corresponding columns of U and W . In the even-parity dataset thatwe investigate as an example, we always truncate B k to two dimensionsthroughout.To understand whether this kind of low-rank approximation is useful,remember that we understand that the eigenvectors and eigenvalues of thereduced densities carry the essential preﬁx-suﬃx interactions. By havinga training algorithm that emphasizes these eigenvalues and eigenvectors asthe most important features of the data throughout training, the resultingmodel should be interpreted as capturing the most important preﬁx-suﬃxinteractions. We view these preﬁx-suﬃx interactions a proxy for the meaningof substrings within a language of larger strings.5. Under the hood

With an in-depth understanding of the training algorithm, we aim topredict experimental results, simply given the fraction 0 < f ≤ ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD17

As an example, we perform an analysis of how well the algorithm learnson the even-parity dataset. Let Σ = { , } and consider the set Σ N ofbitstrings of a ﬁxed length N . Deﬁne the parity of a bitstring ( b , . . . , b N )to be(15) parity( b , . . . , b N ) := N (cid:88) i =1 b i mod 2 . The set Σ N is partitioned into even and odd bitstrings: E N = { s ∈ Σ N : parity( s ) = 0 } and O N = { s ∈ Σ N : parity( s ) = 1 } Consider the probability distribution π : Σ N → R uniformly concentratedon E N : π ( x ) = (cid:40) N − if x ∈ E N x ∈ O N .This distribution deﬁnes a density ρ π = | E N (cid:105)(cid:104) E N | where(16) | E N (cid:105) = 1 √ N − (cid:88) s ∈ E N | s (cid:105) ∈ V ⊗ V ⊗ · · · ⊗ V N where V j ∼ = C is the site space spanned by the bits in the j -th position.Choose a subset T = { s , . . . , s N T } ⊂ E N of even parity bitstrings and let f = N T / N − be the fraction selected. The empirical distribution on thisset deﬁnes the vector | ψ (cid:105) = √ N T (cid:80) N T i =1 | s i (cid:105) as in Equation (8). To begin ouranalysis on | ψ (cid:105) , let us closely inspect the algorithm’s second step. The ideastherein will generalize to subsequent steps.In step 2, we view each sample s as a preﬁx-suﬃx pair ( a, b ) where a ∈ Σ and b ∈ Σ N − . We visualize the training set T as a bipartite graph. Verticesrepresent preﬁxes a and suﬃxes b and there is an edge joining a and b if andonly if ( a, b ) ∈ T .0011 00001100011000111010010110011111 0110 10000100001000010111101111011110Notice that samples in the left graph are concatenations of even parity bit-strings; samples in the right graph are concatenations of odd parity bit-strings. Let | ψ (cid:105) ∈ C Σ ⊗ C Σ N − denote the sum of the samples after havingcompleted step 1, (17) | ψ (cid:105) =and consider the reduced density ρ = tr Σ N − | ψ (cid:105)(cid:104) ψ | . The entries of itsmatrix representation are understood from the data in the graph. Choosingan ordering on the set Σ , we write ρ as(18) ρ = 1 N T  d s e s e d d s o s o d  The number of training samples N T is the total number of edges in the graph.The diagonal entries are the degrees of vertices associated to preﬁxes: d is the degree of 00, d is the degree of 11, d is the degree of 01, d is thedegree of 10. The oﬀ-diagonal entries are the number of paths of length 2 ineach component of the graph. That is, s e is the number of suﬃxes that 00and 11 have in common; s o is the number of suﬃxes that 01 and 10 have incommon. If T contains all samples then both graphs are complete bipartiteand the entries of ρ are all equal (to 2 N − in this case). In this case, ρ is arank 2 operator. It has two eigenvectors—one from each block. This is theidealized scenario: every sequence is present in the training set, the tensorobtained ρ = U D U ∗ is then ρ = 12 ( | E (cid:105)(cid:104) E | ⊕ | O (cid:105)(cid:104) O | )where | E (cid:105) = √ ( | (cid:105) + | (cid:105) ) denotes the normalized sum of even preﬁxesof length 2, and | O (cid:105) = √ ( | (cid:105) + | (cid:105) ) denotes the normalized sum of oddpreﬁxes of length 2. As a matrix, U has | E (cid:105) and | O (cid:105) along its rows.We think of it as a “summarizer”: it projects a preﬁx onto an axis thatcan be identiﬁed with either | E (cid:105) or | O (cid:105) according to its parity, perfectlysummarizing the information of that preﬁx required to understand whichsuﬃxes it is paired with.More generally, however, if T (cid:54) = E N then the reduced density ρ may befull rank. In this case we choose the eigenvectors | E (cid:48) (cid:105) , | O (cid:48) (cid:105) that correspondto the two largest eigenvalues of ρ . We assume these eigenvectors comefrom distinct blocks. This deﬁnes the tensor U , which as a matrix has | E (cid:48) (cid:105) and | O (cid:48) (cid:105) along its rows, where | E (cid:48) (cid:105) = cos θ | (cid:105) + sin θ | (cid:105)| O (cid:48) (cid:105) = cos φ | (cid:105) + sin φ | (cid:105) ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD19 for some angles θ and φ . These angles can be computed following theexpression in (13) for the eigenvectors: θ = arctan (cid:32) s e (cid:112) G e + 4 s e + G e (cid:33) and φ = arctan (cid:32) s o (cid:112) G o + 4 s o + G o (cid:33) Here, G e = d − d and G o = d − d denote the gaps between the diagonalentries in each block. The angles should be thought of as measuring thedeviation from perfect learning in step 2: if f = 1 then G e , G o = 0 and so θ = φ = π/ | E (cid:48) (cid:105) = | E (cid:105) and | O (cid:48) (cid:105) = | O (cid:105) . In this case,step 2 has worked perfectly. Note that this is not an if-and-only-if scenario.Even if f < still have | E (cid:105) and | O (cid:105) as itseigenvectors. Indeed, this occurs whenever G e = G o = 0 and s e , s o (cid:54) = 0. Inthat case, the eigenvectors of ρ are the desired parity vectors | E (cid:105) , | O (cid:105) ,and the summarizer U obtained is a true summarization tensor. But if G e or G o are both nonzero, then step 2 induces a summarization error, whichwe measure as the deviation of θ and φ from the desired π/ k =3 , . . . , N , with minor adjustments to the combinatorics. So let us now de-scribe the general schema. In the k th step of the training algorithm, eachsample is cut after the k -th bit and viewed as a preﬁx-suﬃx pair s = ( a, b )where a ∈ Σ k and b ∈ Σ N − k . Let | ψ k (cid:105) ∈ C Σ k ⊗ C Σ N − k denote the sum ofthe samples after having completed step k − | ψ (cid:105) =and let ρ k := tr Σ N − k | ψ k (cid:105)(cid:104) ψ k | denote the reduced density on the preﬁx sub-system at step k . It is an operator on B k − ⊗ V k , where B k − is a 2-dimensional space which may be identiﬁed with the span of the eigenvectorsassociated to the two largest eigenvalues of ρ k − . As a matrix, ρ k is a directsum of 2 × ρ k = 1 tr ( ρ k )  e s e s e o e s o s o o  We postpone a description of the entries until Section 5.2. But know that, asin the case when k = 2, the upper and lower blocks contains combinatorialinformation about preﬁxes of even and odd parity, respectively. As before,we are interested in the largest eigenvectors | E (cid:48) k (cid:105) , | O (cid:48) k (cid:105) contributed by eachblock. They deﬁne the tensor U k , which as a matrix has | E (cid:48) k (cid:105) and | O (cid:48) k (cid:105) along its rows, and can be understood inductively. The eigenvectors contain combinatorial information from step k along with data from step k −

1. Let | E (cid:48) (cid:105) := | (cid:105) and | O (cid:48) (cid:105) := | (cid:105) . Then for k ≥ | E (cid:48) k (cid:105) = cos θ k | E (cid:48) k − (cid:105) ⊗ | (cid:105) + sin θ k | O (cid:48) k − (cid:105) ⊗ | (cid:105)| O (cid:48) k (cid:105) = cos φ k | E (cid:48) k − (cid:105) ⊗ | (cid:105) + sin φ k | O (cid:48) k − (cid:105) ⊗ | (cid:105) where(20) θ k = arctan (cid:32) s e (cid:112) G e + 4 s e + G e (cid:33) φ k = arctan (cid:32) s o (cid:112) G o + 4 s o + G o (cid:33) Again, the angles are a measurement of the error accrued in step k . Signif-icantly, no error is accrued when the gaps G e := e − o G o := e − o s e , s o are non-zero, for then θ k = φ k = π/ U k = (cid:20) cos θ k sin θ k φ k sin φ k (cid:21) and so U k is akin to a map B k − ⊗ V k → B k that combines previouslysummarized information from B k − with new information from V k . It thensummarizes the resulting data by projecting onto one of two orthogonalvectors, which may be identiﬁed with | E (cid:48) k (cid:105) or | O (cid:48) k (cid:105) , in the new bond space B k . U | (cid:105) | E (cid:48) (cid:105)| (cid:105) The true orientation of the arrows on U k are down-left, rather than up-right.But the vector spaces in question are ﬁnite-dimensional, and our standardbases provide an isomorphism between a space and its dual. That is, noinformation is lost by momentarily adjusting the arrows for the purposes ofsharing intuition.In summary, this template provides a concrete handle on the tensors U k that comprise the MPS factorization of | ψ (cid:105) .5.1. High-level summary.

We close by summarizing the high-level ideaspresent in this under-the-hood analysis. At the k th step of the trainingalgorithm one obtains a 4 × ρ k . Itis given in Equation (18) in the case when k = 2 and as in Equation (19)when k >

2. These matrices are obtained by tracing out the suﬃx subsystemfrom the projection | ψ k (cid:105)(cid:104) ψ k | , where | ψ k (cid:105) is the sum of the samples in thetraining set after having completed step k −

1. Since | ψ k (cid:105) depends on the ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD21 error obtained in step k −

1, so does ρ k . This error is deﬁned by the angles θ k − and φ k − . As shown in Equation (20), these angles—and hence theerror—are functions of the entries of the matrix representing ρ k − . So, the k th level density takes into account the errors accrued at each subsequentstep as well as combinatorial information in the present step. A partial tracecomputation thus directly leads to the matrix representation for ρ k given inEquation (19). Explicitly, the non-zero entries of the matrix are computedby Equations (23) and (24). With this, one has full knowledge of the matrix ρ k and therefore of its eigenvectors | E (cid:48) k (cid:105) , | O (cid:48) k (cid:105) . Written in the computationalbasis, they are of the form shown in Equation (13). These two eigenvectorsthen assemble to form the rows of the tensor U k , when viewed as a 2 × | ψ MPS (cid:105) . To measurethe algorithm’s performance, we begin by evaluating the inner product ofthis vector with an MPS decomposition of the target vector | E N (cid:105) . (cid:104) E N | ψ MPS (cid:105) =The k th tensor comprising the decomposition of | E N (cid:105) is equal to U k when θ k and φ k are evaluated at π/ . The contraction thus results in a sum ofproducts of cos θ k , sin θ k , cos φ k , sin φ k for k = 2 , . . . , N . More concretely, foreach even bitstring s ∈ E N the inner product (cid:104) s | ψ MPS (cid:105) is the square root ofthe probability of s . For now, we’ll refer to it as the weight w ( s ) := (cid:104) s | ψ MPS (cid:105) associated to the sample s . For each s , its weight w ( s ) is a product of variouscos θ k , sin θ k , cos φ k , sin φ k , the details of which are given in Section 5.2. Theﬁnal overlap is then the sum(21) (cid:104) E N | ψ MPS (cid:105) = 1 √ N − (cid:88) s ∈ E N w ( s )Now, suppose the training set consists of a fraction f of the entire pop-ulation. The entries of the reduced densities in (19) are described combi-natorially, as detailed in the next section. This makes it possible to makestatistical estimates for gaps G e and G o and oﬀ-diagonal entries s o and s e in(20). Therefore, we can make statistical predictions for the angles θ k and φ k and hence for the tensors U k comprising the trained MPS and the resultinggeneralization error. The results are plotted in Figure 4, where we use theBhattacharya distance(22) − √ N − ln  (cid:88) s ∈ E N w ( s )  between the true population distribution and the one deﬁned by either anexperimentally trained MPS as a proxy for generalization error. The the-oretical curve could, in principle, be improved by making more accuratestatistical estimates for the combinatorics involved. (a) The experimental average (orange) and theoreticalprediction (blue). (b)

A closer look for 0 . ≤ f ≤ . . Figure 4.

The experimental average (orange) and theoret-ical prediction (blue) of the weighted Bhattacharya distancebetween the probability distribution learned experimentallyand the theoretical prediction for bit strings of length N = 16and training set fractions of 0 < f ≤ . Combinatorics of reduced densities.

We now describe the entriesof k th level reduced density in Equation (19). They depend on certaincombinatorics in step k as well as error accumulated in the previous step.The latter has an inductive description. To start, observe that the parity of ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD23 a preﬁx a ∈ Σ k is determined by its last bit, together with the parity of itsﬁrst k − k thus partitions into four sets: E { a ∈ Σ k : a = ( e k − ,

0) where e k − ∈ E k − } O { a ∈ Σ k : a = ( o k − ,

1) where o k − ∈ O k − } E { a ∈ Σ k : a = ( e k − ,

1) where e k − ∈ E k − } O { a ∈ Σ k : a = ( o k − ,

0) where o k − ∈ O k − } By viewing the training set as a bipartite graph, one has a visual under-standing of these sets: E O k = 3, we use color to distinguish each set.00 011 001 110 1 000110101011 E O θ sin θ cos φ sin φ

00 111 101 010 0 100010001111 E O θ sin θ cos φ sin φ As shown, each preﬁx also has a weight that records its contribution to theerror accumulated in previous steps. Concretely, we assign to each preﬁx a ∈ Σ k a weight w ( a ), which is a product of k − ≤ i ≤ k − , the i th factor of w ( a ) is deﬁned to be • cos θ i if the parity of the ﬁrst i − i th bit is 0 • sin θ i if the parity of the ﬁrst i − i th bit is 1 • cos φ i if the parity of the ﬁrst i − i th bit is 1 • sin φ i if the parity of the ﬁrst i − i th bit is 0For example, if k = 3 then w (011) = cos φ . If k = 5 then w (01101) =cos θ sin θ cos φ . These weights are naturally associated to each tensor.For instance, recalling that each tensor U k is akin to a summarizer, one sees w (01101) in the following way: U | (cid:105) cos φ | (cid:105)| (cid:105) U sin θ cos φ | (cid:105)| (cid:105) U cos θ sin θ cos φ | (cid:105)| (cid:105) We can now describe the entries of the reduced density deﬁned in Equation(19). The ﬁrst diagonal entry is(23) e (cid:88) suﬃxes b  (cid:88) a ∈ E a,b ) ∈ T w ( a ) 

24 TAI-DANAE BRADLEY, MILES STOUDENMIRE, AND JOHN TERILLA and the other diagonals are deﬁned similarly. If perfect learning occurs then e E

0. For example, in the graph below e + 2 + 1 = 9.00 011 001 110 1 000110101011 E θ sin θ cos φ sin φ In general, though, the summands will not be integers but rather productsof weights. The oﬀ-diagonal entry in the even block of the reduced densityis(24) s e = (cid:88) suﬃxes b  (cid:88) a ∈ E , a (cid:48) ∈ O a,b ) , ( a (cid:48) ,b ) ∈ T w ( a ) · w ( a (cid:48) )  When perfect learning occurs, s e counts the number of paths of length 2,where now a path is comprised of one edge from E O s e = 3 . = + + E O s e will be a sum of products of weights. The expressionfor the oﬀ-diagonal s o in the odd block is similar to that in Equation (24).In summary, the theory behind the reduced densities and their eigenvec-tors gives us an exact understanding of the error propagated through eachstep of the training algorithm. We may then predict the Bhattacharya dis-tance in (22) using statistical estimates of the expected combinatorics. Thisprovides an accurate prediction based solely on the fraction f of trainingsamples used and the length N of the sequences.6. Experiments

The training algorithm was written in the ITensor library [20]; the code isavailable on Github. For a ﬁxed fraction 0 < f ≤ . N T = f N − bitstrings of length N =16. We then compare the average Bhattacharya distance in Equation (22) tothe theoretical prediction. To handle the angles θ k and φ k in the theoreticalmodel, we make a few simplifying assumptions about the expected behaviorof the combinatorics. ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD25

First we assume θ = φ k for all k since the combinatorics of both blocks ofthe reduced densities ρ k in (19) have similar behavior. We further assume theaverage angle θ is a function of the average oﬀ-diagonal s e and the averagediagonal gap G e at the k th step, that is E [ θ k ( s e , G e )] = θ k ( E [ s e ] , E [ G e ]) forall k . The expectation for s e is experimentally determined to be independentof k , and dependent on the fraction f and bitstring length N alone: E [ s e ] = f · N T / k . We approximate the expected gap G e at the k th stepto be an experimentally determined function of f and the expected gap G = | d − d | of the diagonal entries of the reduced density deﬁned at step2 of the algorithm. Understanding the expected behavior of G is similarto understanding the statistics of a coin toss. On average, one expects toﬂip the same number of heads and tails and yet the expectation for theirdiﬀerence is non-zero. The distribution for G is similar, but a little diﬀerent: E [ G ] = (cid:88) d | d − r | (cid:0) nd (cid:1)(cid:0) nr − d (cid:1)(cid:0) nr (cid:1) where r = d + d = N T / n = 2 N − is the number of even paritybitstrings of length N −

2. The plots in Figure 4 compare the theoreticalestimate against the experimental average.7.

Conclusion

Models based on tensor networks open interesting directions for machinelearning research. Tensor networks can be viewed as a sequence of relatedlinear maps, which by acting together on a very high-dimensional space al-lows the model to be arbitrarily expressive. The underlying linearity andpowerful techniques from linear algebra allow us to pursue a training algo-rithm where we can look “under the hood” to understand each step and itsconsequences for the ability of our model to reconstruct a particular dataset, the even-parity data set.Our work also highlights the advantages of working in a probability for-malism based on the 2-norm. This is the same formalism used to inter-pret the wavefunction in quantum mechanics; here we use it as a frame-work to treat classical data. Density matrices naturally arise as the 2-normanalogue of marginal probability distributions familiar from conventional1-norm probability. Marginals still appear as the diagonal of the densitymatrix. Unlike marginals, the density matrices we use hold suﬃcient infor-mation to reconstruct the entire joint distribution. Our training algorithmcan be summarized as estimating the density matrix from the training data,then reconstructing the joint distribution step-by-step from these densitymatrix estimates.The theoretical predictions we obtained for the generative performanceof the model agree well with the experimental results. Note that care isneeded to compare these results, since the theoretical approach involves av-eraging over all possible training sets to produce a single typical weight MPS, whereas the experiments produce a diﬀerent weight MPS for each training-set sample. In the near future, we look forward to extending our approachto other measures of model performance and behavior, and certainly otherdata sets as well.More ambitiously, we hope this work points the way to theoretically soundand robust predictions of machine learning model performance based onempirical summaries of real-world data. If such predictions can be obtainedfor training algorithms that also produce state-of-the art results, as tensornetworks are starting to do, we anticipate this will continue to be an excitingprogram of research.

References [1] Ulrich Schollw¨ock. The density-matrix renormalization group in the age of matrixproduct states.

Annals of Physics , 326(1):96–192, 2011.[2] Steven R. White. Density matrix formulation for quantum renormalization groups.

Physical Review Letters , 69(19):2863–2866, 1992.[3] E. Miles Stoudenmire and D. J. Schwab. Supervised learning with quantum-inspiredtensor networks.

Advances in Neural Information Processing Systems (NIPS) ,29:4799–4807, 2016.[4] M. M. Wolf D. Perez-Garcia, F. Verstraete and J. I. Cirac. Matrix product staterepresentations.

Quantum Information and Computation , 7:401–430, 2007.[5] E. Miles Stoudenmire. The tensor network, 2019. http://tensornetwork.org .[6] Glen Evenbly. Tensors.net, 2019. .[7] Rom´an Or´us. A practical introduction to tensor networks: Matrix product states andprojected entangled pair states.

Annals of Physics , 349:117–158, 2014.[8] Alexander Novikov, Mikhail Troﬁmov, and Ivan Oseledets. Exponential machines. arxiv:1605.03795 , 05 2016.[9] E Miles Stoudenmire. Learning relevant features of data with multi-scale tensor net-works.

Quantum Science and Technology , 3(3):034003, 2018.[10] Ivan Glasser, Nicola Pancotti, and J. Ignacio Cirac. Supervised learning with gener-alized tensor networks. arxiv:1806.05964 , 06 2018.[11] Chu Guo, Zhanming Jie, Wei Lu, and Dario Poletti. Matrix product operators forsequence-to-sequence learning.

Phys. Rev. E , 98:042114, Oct 2018.[12] Glen Evenbly. Number-state preserving tensor networks as classiﬁers for supervisedlearning. arxiv:1905.06352 , 2019.[13] Ding Liu, Shi-Ju Ran, Peter Wittek, Cheng Peng, Raul Bl´azquez Garc´ıa, Gang Su,and Maciej Lewenstein. Machine learning by unitary tensor network of hierarchicaltree structure.

New Journal of Physics , 21(7):073059, jul 2019.[14] Stavros Efthymiou, Jack Hidary, and Stefan Leichenauer. TensorNetwork for machinelearning. arXiv:1906.06329 , 2019.[15] Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang. Unsupervised gen-erative modeling using matrix product states.

Phys. Rev. X , 8:031012, Jul 2018.[16] Zhuan Li and Pan Zhang. Shortcut matrix product states and its applications. arxiv:1812.05248 , 12 2018.[17] James Stokes and John Terilla. Probabilistic modeling with matrix product states. arxiv:1902.06888 , 02 2019.[18] Song Cheng, Lei Wang, Tao Xiang, and Pan Zhang. Tree tensor networks for gener-ative modeling.

Phys. Rev. B , 99:155131, Apr 2019.

ODELING SEQUENCES WITH QUANTUM STATES: A LOOK UNDER THE HOOD27 [19] Ivan Glasser, Ryan Sweke, Nicola Pancotti, Jens Eisert, and J. Ignacio Cirac. Expres-sive power of tensor-network factorizations for probabilistic modeling, with applica-tions from hidden Markov models to quantum machine learning. arxiv:1907.03741 ,2019.[20] ITensor Library (version 3.0.0). https://itensor.org . CUNY Graduate Center, New York, NY

E-mail address : [email protected] Flatiron Institute, New York, NY, A Division of the Simons Foundation

E-mail address : [email protected] Tunnel, New York, NY

E-mail address ::