[PDF] Learning Hidden Quantum Markov Models

Abstract

Hidden Quantum Markov Models (HQMMs) can be thought of as quantum probabilistic graphical models that can model sequential data. We extend previous work on HQMMs with three contributions: (1) we show how classical hidden Markov models (HMMs) can be simulated on a quantum circuit, (2) we reformulate HQMMs by relaxing the constraints for modeling HMMs on quantum circuits, and (3) we present a learning algorithm to estimate the parameters of an HQMM from data. While our algorithm requires further optimization to handle larger datasets, we are able to evaluate our algorithm using several synthetic datasets. We show that on HQMM generated data, our algorithm learns HQMMs with the same number of hidden states and predictive accuracy as the true HQMMs, while HMMs learned with the Baum-Welch algorithm require more states to match the predictive accuracy.

Full PDF

LLearning Hidden Quantum Markov Models

Siddarth Srinivasan , Geoﬀ Gordon , and Byron Boots College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USAOctober 26, 2017

Abstract

Hidden Quantum Markov Models (HQMMs) can be thought of as quantum probabilistic graphicalmodels that can model sequential data. We extend previous work on HQMMs with three contributions:(1) we show how classical hidden Markov models (HMMs) can be simulated on a quantum circuit, (2) wereformulate HQMMs by relaxing the constraints for modeling HMMs on quantum circuits, and (3) wepresent a learning algorithm to estimate the parameters of an HQMM from data. While our algorithmrequires further optimization to handle larger datasets, we are able to evaluate our algorithm using severalsynthetic datasets. We show that on HQMM generated data, our algorithm learns HQMMs with thesame number of hidden states and predictive accuracy as the true HQMMs, while HMMs learned withthe Baum-Welch algorithm require more states to match the predictive accuracy.

We extend previous work on Hidden Quantum Markov Models (HQMMs), and propose a novel approach tolearning these models from data. HQMMs can be thought of as a new, expressive class of graphical modelsthat have adopted the mathematical formalism for reasoning about uncertainty from quantum mechanics.We stress that while HQMMs could naturally be implemented on quantum computers, we do not need sucha machine for these models to be of value. Instead, HQMMs can be viewed as novel models inspired byquantum mechanics that can be run on classical computers. In considering these models, we are interested inanswering three questions: (1) how can we construct quantum circuits to simulate classical Hidden MarkovModels (HMMs); (2) what happens if we take full advantage of this quantum circuit instead of enforcing theclassical probabilistic constraints; and (3) how do we learn the parameters for quantum models from data?The paper is structured as follows: ﬁrst we describe related work and provide background on quantuminformation theory as it relates to our work. Next, we describe the hidden quantum Markov model andcompare our approach to previous work in detail, and give a scheme for writing any hidden Markov model asan HQMM. Finally, our main contribution is the introduction of a maximum-likelihood-based unsupervisedlearning algorithm that can estimate the parameters of an HQMM from data. Our implementation is slow totrain HQMMs on large datasets, and will require further optimization. Instead, we evaluate our learningalgorithm for HQMMs on several simple synthetic datasets by learning a quantum model from data andﬁltering and predicting with the learned model. We also compare our model and learning algorithm tomaximum likelihood for learning hidden Markov models and show that the more expressive HQMM canmatch HMMs’ predictive capability with fewer hidden states on data generated by HQMMs.1 a r X i v : . [ s t a t . M L ] O c t Background

Hidden Quantum Markov Models were introduced by Monras et al. [2010], who discussed their relationship toclassical HMMs, and parameterized these HQMMs using a set of Kraus operators. Clark et al. [2015] furtherinvestigated HQMMs, and showed that they could be viewed as open quantum systems with instantaneousfeedback. We arrive at the same Kraus operator representation by building a quantum circuit to simulate aclassical HMM and then relaxing some constraints.Our work can be viewed as extending previous work by Zhao and Jaeger [2010] on Norm-observable operatormodels (NOOM) and Jaeger [2000] on observable-operator models (OOM). We show that HQMMs can beviewed as complex-valued extensions of NOOMs, formulated in the language of quantum mechanics. We usethis connection to adapt the learning algorithm for NOOMs in M. Zhao [2007] into the ﬁrst known learningalgorithm for HQMMs, and demonstrate that the theoretical advantages of HQMMs also hold in practice.Schuld et al. [2015a] and Biamonte et al. [2016] provide general overviews of quantum machine learning, anddescribe relevant work on HQMMs. They suggest that developing algorithms that can learn HQMMs fromdata is an important open problem. We provide just such a learning algorithm in Section 4.Other work at the intersection of machine learning and quantum mechanics includes Wiebe et al. [2016] onquantum perceptron models and learning algorithms. Schuld et al. [2015b] discuss simulating a perceptron ona quantum computer.

Classical discrete latent variable models represent uncertainty with a probability distribution using a vector (cid:126)x whose entries describe the probability of being in the corresponding system state. Each entry is realand non-negative, and the entries sum to 1. In general, we refer to the run-time system component thatmaintains a state estimate of the latent variable as an ‘observer’, and we refer to the observer’s state asa ‘belief state.’ A common example is the belief state that results from conditioning on observations in an HMM.In quantum mechanics, the quantum state of a particle A can be written using Dirac notation as | ψ (cid:105) A , acolumn-vector in some orthonormal basis (the row-vector is the complex-conjugate transpose (cid:104) ψ | = ( | ψ (cid:105) ) † )with each entry being the ‘probability amplitude’ corresponding to that system state. The squared normof the probability amplitude for a system state is the probability of observing that state, so the sum ofsquared norms of probability amplitudes over all the system states must be 1 to conserve probability. Forexample, | ψ (cid:105) = (cid:104) √ − i √ (cid:105) † is a valid quantum state, with basis states 0 and 1 having equal probability (cid:13)(cid:13)(cid:13) √ (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) i √ (cid:13)(cid:13)(cid:13) = . However, unlike classical belief states such as (cid:126)x = (cid:2)

12 12 (cid:3) T , where the probability ofdiﬀerent states reﬂects an ignorance of the underlying system, a pure quantum state like the one describedabove is the true description of the system; the system is in both states simultaneously.But how can we describe classical mixtures of quantum systems (‘mixed states’), where we maintain classicaluncertainty about the underlying quantum states? Such information can be captured by a ‘density matrix.’Given a mixture of N quantum systems, each with probability p i , the density matrix for this ensemble isdeﬁned as follows: ˆ ρ = N (cid:88) i p i | ψ i (cid:105)(cid:104) ψ i | (1)The density matrix is the general quantum equivalent of the classical belief state (cid:126)x and has diagonalelements representing the probabilities of being in each system state. Consequently, the normalization2ondition is tr (ˆ ρ ) = 1 . The oﬀ-diagonal elements represent quantum coherences and entanglement, whichhave no classical interpretation. The density matrix ˆ ρ can be used to describe the state of any quantum system.The density matrix can also be extended to represent the joint state of multiple variables, or that of ‘multi-particle’ systems, to use the physical interpretation. If we have density matrices ˆ ρ A and ˆ ρ B for two qudits(a d -state quantum system, akin to qubits or ‘quantum bits’ which are 2-state quantum systems) A and B , we can take the tensor product to arrive at the density matrix for the joint state of the particles, as ˆ ρ AB = ˆ ρ A ⊗ ˆ ρ B . As a valid density matrix, the diagonal elements of this joint density matrix representprobabilities; tr (ˆ ρ AB ) = 1 , and the probabilities correspond to the states in the Cartesian product of thebasis states of the composite particles. In this paper, the joint density matrix will serve as the analogue toclassical joint probability distribution, with the oﬀ-diagonal terms encoding extra ‘quantum’ information.Given the joint state of a multi-particle system, we can examine the state of just one or few of the particlesusing the ‘partial trace’ operation, where we trace over the diagonal elements of the particles we wish todisregard. This lets us recover a ‘reduced density matrix’ for a subsystem of interest. The partial trace for atwo-particle system ˆ ρ AB where we trace over the second particle to obtain the state of the ﬁrst particle is: ˆ ρ A = tr B (ˆ ρ AB ) = (cid:88) j B (cid:104) j | ˆ ρ AB | j (cid:105) B (2)For our purposes, this operation will serve as the quantum analogue of classical marginalization. Finally,we discuss the quantum analogue of ‘conditioning’ on an observation. In quantum mechanics, the act ofmeasuring a quantum system can change the underlying distribution, i.e., collapses it to the observed state inthe measurement basis, and this is represented mathematically by applying von Neumann projection operators(denoted ˆ P y in this paper) to density matrices describing the system. One can think of the projection operatoras a matrix of zeros with ones in the diagonal entries corresponding to observed system states. If we areonly observing one part of a larger joint system, the system collapses to the states where that subsystem hadthe observed result. For example, suppose we have the following density matrix, for a two-state two-particlesystem with basis {| (cid:105) A | (cid:105) B , | (cid:105) A | (cid:105) B , | (cid:105) A | (cid:105) B , | (cid:105) A | (cid:105) B } : ˆ ρ AB =  .

25 0 0 00 0 . − . − . .

25 00 0 0 0 .  (3)Suppose we measure the state of particle B , and ﬁnd it to be in state | (cid:105) B . The corresponding projectionoperator is ˆ P B =   and the collapsed state is now: ˆ ρ AB = ˆ P B ˆ ρ AB ˆ P † B normalize −→  . .  .When we trace over particle A to get the state of particle B , the result is ˆ ρ B = (cid:20) (cid:21) , reﬂecting the fact thatparticle B is now in state | (cid:105) B with certainty. Tracing over particle B , we ﬁnd ˆ ρ A = (cid:20) . . (cid:21) , indicating thatparticle A still has an equal probability of being in either state. Note that measuring | (cid:105) B has changed theunderlying distribution of the system ˆ ρ AB ; the probability of measuring the state of particle B to be | (cid:105) B isnow 0, whereas before measurement we had a .

25 + 0 .

25 = 0 . chance of measuring | (cid:105) B . This is unlikeclassical probability where measuring a variable doesn’t change the joint distribution. We will use this factwhen we construct our quantum circuit to simulate HMMs.Thus, if we have an n -state quantum system that tracks a particle’s evolution, and an s -state quantum systemthat tracks the likelihood of observing various outputs as they depend (probabilistically) on the n -statesystem, upon observing an output y , we apply the projection operator ˆ P y on the joint system, and trace overthe second particle to obtain the n -state system conditioned on observation y . 3able 1: Comparison between classical and quantum representations Classical probability Quantum Analogue

Description Representation Representation DescriptionBelief State (cid:126)x ˆ ρ Density MatrixJoint Distribution (cid:126)x ⊗ (cid:126)x ˆ ρ X ⊗ ˆ ρ X Multi-particle Density MatrixMarginalization (cid:126)x = (cid:80) y (cid:126)x ⊗ (cid:126)y ˆ ρ X = tr Y (ˆ ρ XY ) Partial TraceConditional probability P ( (cid:126)x | y ) = P ( y,(cid:126)x ) P ( y ) P ( states | y ) = tr Y ( ˆ P y ˆ ρ XY ˆ P † y ) Projection + Partial Trace

Classical Hidden Markov Models (HMMs) are graphical models used to model dynamic processes thatexhibit Markovian state evolution. Figure 1 depicts a classical HMM, where the transition matrix A andemission matrix C are column-stochastic matrices that determine the Markovian hidden state-evolution andobservation probabilities respectively. Bayesian inference can be used to track the evolution of the hiddenvariable. Figure 1: Hidden Markov ModelThe belief state at time t is a probability distribution over states, and prior to any observation is written as: (cid:126)x (cid:48) t = A (cid:126)x t − (4)The probabilities of observing each output at time t is given by the vector (cid:126)s : (cid:126)s t = C (cid:126)x (cid:48) t = CA (cid:126)x t − (5)We can use Bayesian inference to write the belief state vector after conditioning on observation y : (cid:126)x t = diag ( C ( y, :) ) A (cid:126)x t − T diag ( C ( y, :) ) A (cid:126)x t − (6)where diag ( C ( y, :) ) is a diagonal matrix with the entries of the y th row of C along the diagonal, and thedenominator renormalizes the vector (cid:126)x t .An alternate representation of the Hidden Markov Model uses ‘observable’ operators (Jaeger [2000]). Insteadof using the matrices A and C , we can write T y = diag ( C ( y, :) ) A . There is a diﬀerent operator T y for eachpossible observable output y and [ T y ] ij = P ( y ; i t | j t − ) . We can then rewrite Equation 6 as: (cid:126)x t = T y (cid:126)x t − T T y (cid:126)x t − (4)If we observe outputs y , . . . , y n , we apply T n . . . T (cid:126)x and take the sum of the resulting vector to ﬁnd theprobability of observing the sequence, or renormalize to ﬁnd the belief state after the ﬁnal observation. 4 Hidden Quantum Markov Models

Let us now contrast state evolution in quantum systems with state evolution in HMMs. The quantumanalogue of observable operators is a set of non-trace-increasing Kraus operators { ˆ K i } that are completelypositive (CP) linear maps. Trace-preserving Kraus operators (cid:80) Ni ˆ K † i ˆ K i = I , can map a density operator toanother density operator. Trace-decreasing Kraus operators (cid:80) Ni ˆ K † i ˆ K i < I , represent operations on a smallerpart of a quantum system that can allow probability to ‘leak’ to other states that aren’t being considered.This paper will formulate problems such that all sets of Kraus operators are trace-preserving. When thereis only one operator in the set, i.e., ˆ U such that ˆ U † ˆ U = I , then ˆ U is a unitary matrix. Unitary operatorsgenerally model the evolution of the ‘whole’ system, which may be high-dimensional. But if we care onlyabout tracking the evolution of a smaller sub-system, which may interact with its environment, we canuse Kraus operators. The most general quantum operation that can be performed on a density matrix is ˆ ρ (cid:48) = (cid:80) Mi K † i ˆ ρK i tr ( (cid:80) Mi K † i ˆ ρK i ) , where the denominator re-normalizes the density matrix.Now, how do we simulate classical HMMs on quantum circuits with qudits, where computation is done usingunitary operations? There is no general way to convert column-stochastic transition and emission matrices tounitary matrices, so we prepare ‘ancilla’ particles and construct unitary matrices (see Algorithm 1) to act onthe joint state. We then trace over one particle to obtain the state of the other. Algorithm 1 s × n Column-Stochastic Matrix to ns × ns Unitary Matrix

Input: s × n Column-Stochastic Matrix A Output: ns × ns block diagonal Unitary Matrix ˆ U with n blocks of s × s unitary matrices, zeros everywhereelse Construct an s × s unitary matrix from each column of A : Let c i denote the i th column of A .First create an s × s matrix whose each row is the square root of column c i . Find the null space of thismatrix, and you will get the s − vectors that are linearly independent of c i . Make c i the ﬁrst column,and the remaining s − vectors the other columns of an s × s matrix. Stack each s × s matrix on a diagonal : Follow step 1 for each column of A , and obtain n unitarymatrices of dimension s × s . Create a block diagonal matrix with each of these smaller unitary matricesalong the diagonal, and you will obtain an ns × ns dimensional unitary matrix ˆ U . Note: The unitary operator constructed here is designed to be applied on a density matrixtensored with an environment density matrix prepared with zeros everywhere except ˆ ρ , =1 . Figure 2a illustrates a quantum circuit constructed with these unitary matrices. By preparing the ‘ancilla’states ˆ ρ X t and ˆ ρ Y t appropriately (i.e., entirely in system state 1, represented by a density matrix of zerosexcept ˆ ρ , = 1 ), we construct ˆ U and ˆ U from transition matrix A and emission matrix C , respectively. ˆ U evolves (ˆ ρ t − ⊗ ˆ ρ X t ) to perform Markovian transition, while ˆ U updates ˆ ρ Y t to contain the probabilitiesof measuring each observable output. At runtime, we measure ˆ ρ Y t which changes the joint distribution of ˆ ρ X t ⊗ ˆ ρ Y t to give the updated conditioned state ˆ ρ t . Mathematically, this is equivalent to applying a projectionoperator on the joint state and tracing over ˆ ρ Y t . Thus, the forward algorithm corresponding to Figure 2athat explicitly models a hidden Markov Model on a quantum circuit can be written as: ˆ ρ t ∝ tr ˆ ρ Yt (cid:16) ˆ P y ˆ U (cid:16) tr ˆ ρ t − ( ˆ U (ˆ ρ t − ⊗ ˆ ρ X t ) ˆ U † ) ⊗ ˆ ρ Y t (cid:17) ˆ U † ˆ P † y (cid:17) (7) We can simplify this circuit to use Kraus operators acting on the lower-dimensional state space of ˆ ρ X t . Sincewe always prepare ˆ ρ Y t in the same state, the operation ˆ U on the joint state of ˆ ρ X t ⊗ ˆ ρ Y t followed by theapplication of the projection operator ˆ P y can be more concisely written as a Kraus operator on just ˆ ρ X t , sothat we need only be concerned with representing how the particle ˆ ρ X t evolves. We would need to construct5 set of Kraus operators { ˆ K y } for each observable output y , such that (cid:80) y ( ˆ K y ) † ( ˆ K y ) = I .Tensoring with an ancilla qudit and tracing over a qudit can be achieved with an ns × n matrix W andan n × ns matrix V y respectively, since we always prepare our ancilla qudits in the same state (details onconstructing these matrices can be found in the Appendix), so that: ˆ ρ X t ⊗ ˆ ρ Y t −→ W ˆ ρ X W † tr ˆ ρ Yt (cid:16) ˆ P y ˆ U W ˆ ρ X t W † ˆ U † ˆ P † y (cid:17) −→ V y ˆ P y ˆ U W ˆ ρ X t W † ˆ U † ˆ P † y V † y (8) ˆ ρ t − ˆ U ˆ ρ X t ˆ U ˆ ρ t ˆ ρ Y t (a) Full Quantum Circuit to implement HMM ˆ ρ t − ˆ U ˆ ρ X t ˆ K y t − ˆ ρ t (b) Simpliﬁed scheme to implement HMM Figure 2: HMM implementation on quantum circuits ˆ ρ t − K w ˆ K y t − ˆ ρ t (a) HQMM scheme with separate transition andemission; K w = (cid:80) w ˆ K w ( · ) ˆ K † w ˆ ρ t − K w,y t − ˆ ρ t (b) A generalized scheme for HQMMs; K w,y t − = (cid:80) w ˆ K w,y t − ( · ) ˆ K † w,y t − Figure 3: Quantum schemes implementing classical HMMsWe can then construct Kraus operators such that ˆ K y = V y ˆ P y ˆ U W . Figure 2b shows this updated circuit,where ˆ U is still the quantum implementation of the transition matrix and ˆ K y t is the quantum implementationof the Bayesian update after observation. This scheme to model a classical HMM can be written as: ˆ ρ t = ˆ K y t − (cid:16) tr ˆ ρ t − ( ˆ U (ˆ ρ t − ⊗ ˆ ρ X t ) ˆ U † ) (cid:17) ˆ K † y t − tr (cid:16) ˆ K y t − (cid:16) tr ˆ ρ t − ( ˆ U (ˆ ρ t − ⊗ ˆ ρ X t ) ˆ U † ) (cid:17) ˆ K † y t − (cid:17) (9)We can similarly simplify ˆ U to a set of Kraus operators. We write the unitary operation ˆ U in terms of aset of n Kraus operators { ˆ K w } as if we were to measure ˆ ρ t − immediately after the operation ˆ U . However,instead of applying one Kraus operator associated with measurement as we do with Figure 2b, we sum overall of n possible ‘observations’, as if to ‘ignore’ the observation on ˆ ρ t − . Post-multiplying each Kraus operatorin { ˆ K w } with each operator in { ˆ K y } , we have a set of Kraus operators { ˆ K w y ,y } that can be used to model aclassical HMM as follows (the full procedure is described in Algorithm 2): ˆ ρ t = (cid:80) w y ˆ K w y ,y t − ˆ ρ t − ˆ K † w y ,y t − tr (cid:16)(cid:80) w y ˆ K w y ,y t − ˆ ρ t − ˆ K † w y ,y t − (cid:17) (10)We believe this procedure to be a useful illustration of performing classical operations on graphical modelsusing quantum circuits. In practice, we needn’t construct the Kraus operators in this peculiar fashion tosimulate HMMs; an equivalent but simpler approach is to construct observable operators { T y } from transitionand emission matrices as described in section 2.3, and set the w th column of ˆ K (: ,w ) w y ,y = (cid:113) T (: ,w ) y , with all otherentries being zero. This ensures (cid:80) w y ,y ˆ K † w y ,y K w y ,y = I . 6 lgorithm 2 Simulating Hidden Markov Models with HQMMs

Input:

Transition Matrix A and Emission Matrix C Output:

Belief State as diag (ˆ ρ ) , or P ( y , . . . , y n | D ) where D is the HMM Initialization: Let s = outputs, n = hidden states, y t = observed output at time t Prepare density matrix ˆ ρ in some initial state. ˆ ρ = diag ( π ) if priors π are known. Construct unitary matrices ˆ U and ˆ U from A and C respectively using Algorithm 1 (in appendix) Using ˆ U and ˆ U , construct a set of n Kraus Operators { ˆ K w } and s Kraus operators { ˆ K y } , with ˆ K w = V w ˆ U W and ˆ K y = V y ˆ P y ˆ U W and combine them into a set { ˆ K w y ,y } with ˆ K w y ,y = ˆ K y ˆ K w . (Matrix W tensors with an ancilla, Matrix V y carries out a trivial partial trace operation and summing over V w for all w carries out the proper partial trace operation. Details in appendix). for t = 1 : T do ˆ ρ t +1 ← (cid:80) w y ˆ K w,y t ˆ ρ t − ( ˆ K w y ,y i ) † end for tr( ˆ ρ T ) gives the probability of the sequence; renormalizing ˆ ρ T gives the belief state on the diagonal. Monras et al. [2010] formulate Hidden Quantum Markov Models by deﬁning a set of Kraus operators { ˆ K w y ,y } ,where each observable y has w y associated Kraus operators acting on a state with hidden dimension n , andthey form a complete set such that (cid:80) w,y ˆ K † w,y ˆ K w,y = I . The update rule for a quantum operation is exactlythe same as Equation 10, which we arrived at by ﬁrst constructing a quantum circuit to simulate HMMs withknown parameters and then constructing operators { ˆ K w,y } in a very peculiar way. The process outlined inthe previous section is a particular parameterization of HQMMs to model HMMs. If we let the operators ˆ U and ˆ U be any unitary matrices, or the Kraus operators be any set of complex-valued matrices that satisfy (cid:80) w y ,y ˆ K † w y ,y K w y ,y = I , then we have a general and fully quantum HQMM.Indeed, Equation 10 gives the forward algorithm for HQMMs. To ﬁnd the probability of emitting anoutput y given the previous state ˆ ρ t − , we simply take the trace of the numerator in Equation 10, i.e., p ( y t | ˆ ρ t − ) = tr (cid:16)(cid:80) w y ˆ K w y ,y t − ˆ ρ t − ˆ K † w y ,y t − (cid:17) .The number of parameters for a HQMM is determined by the number of latent states n , outputs s , andKraus operators associated with an output w . To exactly simulate HMM dynamics with an HQMM, weneed w = n as per the derivation above. However, this constraint need not hold for a general HQMM, whichcan have any number of Kraus operators we apply and sum for a given output. w can also be thoughtof as the dimension of the ancilla ˆ ρ X t that we tensor with in Figure 2a before the unitary operation ˆ U .Consequently, if we set w = 1 , we do not tensor with an additional particle, but model the evolution of theoriginal particle as unitary. In all, a HQMM requires learning n sw parameters, which is a factor w timesmore than a HMM with the observable operator representation which has n s parameters. The canonical rep-resentation of HMMs with with an n × n transition matrix and an s × n emission matrix has n + ns parameters.HQMMs can also be seen as a complex-valued extension of norm-observable operator models deﬁned by Zhaoand Jaeger [2010]. Indeed, the HQMM we get by applying Algorithm on a HMM is also a valid NOOM(allowing for multiple operators per output), implying that HMMs can be simulated by NOOMs. We can alsostate that both HMMs and NOOMs can be simulated by HQMMs (the latter is trivially true). While Zhaoand Jaeger [2010] show that any NOOM can be written as an OOM, the exact relationship between HQMMsand OOMs is not straightforward owing to the complex entries in HQMMs and requires further investigation.7 An Iterative Algorithm For Learning HQMMs

We present an iterative maximum-likelihood algorithm to learn

Kraus operators to model sequential datausing an HQMM. Our algorithm is general enough that it can be applied to any quantum version of a classicalmachine learning algorithm for which the loss is deﬁned in terms of the Kraus operators to be learned.We begin by writing the likelihood of observing some sequence y , . . . , y T . Recall that for a given output y , weapply the w Kraus operators associated with that observable in the ‘forward’ algorithm, as (cid:80) w y ˆ K w y ,y ( · ) ˆ K w y ,y .If we do not renormalize the density matrix after applying these operators, the diagonal entries contain thejoint probability of the corresponding system states and observing the associated sequence of outputs. Thetrace of this un-normalized density matrix gives the probability of observing y since we have summed over(i.e., marginalized) all the ‘hidden’ states. Thus, the general log-likelihood of a sequence of length n beingpredicted by a HQMM where each observable y has w y associated Kraus operators is: L = ln tr (cid:88) w yn ˆ K w yn ,y n . . . (cid:88) w y ˆ K w y ,y ˆ ρ ˆ K † w y ,y  . . . ˆ K † w yn ,y n  (11)It is not straightforward to directly maximize this log-likelihood using gradient descent; we must preserve theKraus operator constraints and long sequences can quickly lead to underﬂow issues. Our approach is to learna nsw × n matrix κ ∗ , which is essentially the set of ws Kraus operators { ˆ K w,y } of dimension n × n , stackedvertically. The Kraus operators constraint requires (cid:80) s ˆ K † s ˆ K s = I , which implies κ † κ = I , where the columnsof κ are orthonormal. Algorithm 3

Iterative Learning Algorithm for Hidden Quantum Markov Models

Input: A M × (cid:96) matrix Y , where M is the number of data points and (cid:96) is the length of a stochastic sequenceto be modeled. Output:

A set of ws of n × n Kraus operators { ˆ K w,s } that maximize the log-likelihood of the data, where n is the dimension of the hidden state, s is the number of outputs, and w is the number of operators peroutputs. Initialization : Randomly generate a set of ws Kraus operators { ˆ K w,s } of dimension n × n , and stackthem vertically to obtain a matrix κ of dimension nsw × n . Let b be the batch size, B the total numberof batches to process, and Y b a b × (cid:96) matrix of randomly chosen data samples. Let num _ iterations bethe number of iterations spent modifying κ to maximize the likelihood of observing Y b . for batch = : B do Randomly select b sequences to process, and construct matrix Y b for it = 1 : num _ iterations do Randomly select rows i and j of κ to modify, i < j Find (cid:126)w = ( φ, ψ, δ, θ ) that maximises the log-likelihood of Y b under the following update, and update: κ i ← (cid:16) e iφ / e iψ cos( θ ) (cid:17) κ i + (cid:16) e iφ / e iδ sin( θ ) (cid:17) κ j κ j ← (cid:16) − e iφ / e − iδ sin( θ ) (cid:17) κ i + (cid:16) e iφ / e − iψ cos( θ ) (cid:17) κ j end for end for Let κ be our guess and κ ∗ be the true matrix of stacked Kraus operators that maximizes the likelihood underthe observed data. Then, there must exist some unitary operator ˆ U that maps κ to κ ∗ , i.e., κ ∗ = ˆ U κ . Ourgoal is now to ﬁnd the matrix ˆ U . To do this, we use the fact that the matrix ˆ U can written as the product ofsimpler matrices H ( i, j, θ, φ, ψ, δ ) (see appendix for proof), where 8 ( i, j, θ, φ, ψ, δ ) =  · · · · · · · · · ... . . . ... ... ... · · · e iφ / e iψ cos θ · · · e iφ / e iδ sin θ · · · ... ... . . . ... ... · · · − e iφ / e − iδ sin θ · · · e iφ / e − iψ cos θ · · · ... ... ... . . . ... · · · · · · · · ·  (12) i and j specify the two rows in the matrix with the non-trivial entries, and the other paramters θ, φ, ψ, δ are angles that parameterize the non-trivial entries. The H matrices can be thought of as Givens rotationsgeneralized for complex-valued unitary matrices. Applying such a matrix H ( i, j, θ, φ, ψ, δ ) on κ has the eﬀectof combining rows i and j ( i < j ) of κ like so: κ i ← (cid:16) e iφ / e iψ cos( θ ) (cid:17) κ i + (cid:16) e iφ / e iδ sin( θ ) (cid:17) κ j κ j ← (cid:16) − e iφ / e − iδ sin( θ ) (cid:17) κ i + (cid:16) e iφ / e − iψ cos( θ ) (cid:17) κ j (13)Now the problem becomes one of identifying the sequence of H matrices that can take κ to κ ∗ . Since theoptimization is non-convex and the H matrices need not commute, we are not guaranteed to ﬁnd the globalmaximum. Instead, we look for a local-max κ ∗ that is reachable by only multiplying H matrices that increasethe log-likelihood. To ﬁnd this sequence, we iteratively ﬁnd the parameters ( i, j, θ, φ, ψ, δ ) that, if used inequation 13, would increase the log-likelihood. To perform this optimization, we use the fmincon function inMATLAB that uses interior-point optimization. It can also be computationally expensive to ﬁnd the the bestrows i, j to swap at a given step, so in our implementation, we randomly pick the rows ( i, j ) to swap. SeeAlgorithm 3 for a summary. We believe more eﬃcient implementations are possible, but we leave this tofuture work. In this section, we evaluate the performance of our learning algorithm on simple synthetic datasets, andcompare it to the performance of Expectation Maximization for HMMs (Rabiner [1989]). We judge thequality of the learnt model using its Description Accuracy (DA) (M. Zhao [2007]), deﬁned as: DA = f (cid:18) s P ( Y | D ) (cid:96) (cid:19) (14)where (cid:96) is the length of the sequence, s is the number of output symbols in the sequence, Y is the data, and D is the model. Finally, the function f ( · ) is a non-linear function that takes the argument from ( −∞ , to ( − , : f ( x ) = (cid:26) x x ≥ − e − . x e − . x x < (15)If DA = 1 , the model perfectly predicted the stochastic sequence, while DA > would mean that the modelpredicted the sequence better than random.In each experiment, we generate 20 training sequences of length 3000, and 10 validation sequences of length3000, with a ‘burn-in’ of 1000 to disregard the inﬂuence of the starting distribution. We use QETLAB (aMATLAB Toolbox developed by Johnston [2016]) to generate random HQMMs. We apply our learningalgorithm once to learn HQMMs from data and report the DA. We use the Baum-Welch algorithm imple-mented in the hmmtrain function from MATLAB’s Statistics and Machine Learning Toolbox to learn HMM9arameters. When training HMMs, we train 10 models and report the best DA.We found that starting with a batch size of 1 with 5-6 iterations to get close to the local maximum, and thenincreasing the batch size to 3-4 and smaller num _ iterations ∼ was a good way to reach convergence. Wealso ﬁnd that training models with w > becomes very slow; when w = 1 , to compute the log-likelihood, wecan simply take the product of all the Kraus operators corresponding to the observed sequence, and applyit on either side of the density matrix. However, with w ≥ , we have to perform a sum over the w Krausoperators corresponding to a given observation, before we can apply the next set of Kraus operators.The ﬁrst experiment compares learned models on data generated by a valid ‘probability clock’ NOOM/HQMMmodel (M. Zhao [2007]) that theoretically cannot be modeled by a ﬁnite-dimensional HMM. The secondexperiment considers data generated by the 2-state, 4-output HQMM proposed in Monras et al. [2010], whichrequires at least 3 hidden states to be modeled with an HMM. The third experiment is performed on datagenerated by physically motivated, fully quantum 2-state, 6-output HQMM requiring at least 4 classical statesfor HMMs to model, and can be seen as an extension of the Monras et al. [2010] model. Finally, we comparethe performance of our algorithm with EM for HMMs on data that was generated by a hand-written HMM.These experiments are meant to showcase the greater expressiveness of HQMMs compared with HMMs.While we see mixed performance on HMM-generated data, we are able to empirically demonstrate that on theHQMM-generated datasets, our algorithm is able to learn an HQMM that can better predict the generateddata than EM for classical HMMs with fewer hidden states.

Zhao and Jaeger [2010] describes a 2-hidden state, 2-observable NOOM ‘probability clock,’ where theprobability of generating an observable a changes periodically with the length of the sequence of a s precedingit, and cannot be modeled with a ﬁnite-dimensional HMM: ˆ K , = (cid:18) . . − sin(0 . . .

6) cos(0 . (cid:19) ˆ K , = (cid:18) . (cid:19) (16) This is a valid HQMM since (cid:80) y =2 y =1 K † ,y K ,y = I . Observe that this HQMM has only 1 Kraus operator perobservable, which means it models the state evolution as unitary.Our results in Table 2 demonstrate that a probability clock generates data that is hard for HMMs to modeland that our iterative algorithm yields a simple HQMM that matches the predictive power of the originalmodel.Table 2: Performance of various HQMMs and HMMs learned from data generated by the probability clockmodel. HQMM parameters are given as ( n, s, w ) and HMM parameters are given as ( n, s ) , where n isthe number of hidden states, s is the number of observables, and w is the number of Kraus operators perobservable. (T) indicates the true model, (L) indicates learned models. P is the number of parameters. Boththe mean and STD of the DA are indicated for training and test data. Model P Train DA Test DA , , − HQMM (T) 8 . ( . ) . ( . ) , , − HQMM (L) 8 . ( . ) . ( . ) , − HMM (L) 8 . ( . ) . ( . , − HMM (L) 24 . ( . ) . ( . ) , − HMM (L) 80 . ( . ) . ( . ) 10 .2 Monras et al. [2010] 2-state HQMM Monras et al. [2010] present a 4-state, 4-output HMM with a loose lower bound requirement of 3 classicallatent states that can be modeled by the following 2-state, 4-output HQMM: ˆ K , = (cid:18) √

00 0 (cid:19) ˆ K , = (cid:18) √ (cid:19) (17) ˆ K , = (cid:32) √ √ √ √ (cid:33) ˆ K , = (cid:32) √ − √ − √ √ (cid:33) (18)This model also treats state evolution as unitary since there is only 1 Kraus operator per observable. Wegenerate data using this model, and our results in Table 3 show that our algorithm is capable of learning anHQMM that can match the DA of the original model, while the HMM needs more states to match the DA.Table 3: Performance of various HQMMs and HMMs on data generated by the Monras et al. [2010] model.HQMM parameters are given as ( n, s, w ) and HMM parameters are given as ( n, s ) , where n is the number ofhidden states, s is the number of observables, and w is the number of Kraus operators per observable Model P Train DA Test DA , , − HQMM (T) 16 . ( . ) . ( . ) , , − HQMM (L) 16 . ( . ) . ( . ) , , − HQMM (L) 32 . ( . ) . ( . ) , − HMM (L) 12 . ( . ) . ( . , − HMM (L) 21 . ( . ) . ( . ) , − HMM (L) 32 . ( . ) . ( . ) In the previous two experiments, the HQMMs we used to generate data were also valid NOOMs since theyused only real-valued entries. Here, we present the results of our algorithm on a fully quantum HQMM. Sincewe use complex-valued entries, there is no known way of writing our model as an equivalent-sized HMM,NOOM, or OOM.We motivate this model with a physical system. Consider electron spin: quantized angular momentum thatcan either be ‘up’ or ‘down’ along whichever spatial axis the measurement is made, but not in between.There is no well-deﬁned 3D vector describing electron spin along the 3 spatial dimensions, only ‘up’ or ‘down’along a chosen axis of measurement (i.e., measurement basis). This is unlike classical angular momentumwhich can be represented by a vector with well-deﬁned components in three spatial dimensions. Picking anarbitrary direction as the z -axis, we can write the electron’s spin state in the { + z , − z } basis so that (cid:2) (cid:3) T is | + z (cid:105) and (cid:2) (cid:3) T is | − z (cid:105) . But electron spin constitutes a two-state quantum system, so it can be insuperpositions of the orthogonal ‘up’ and ‘down’ quantum states, which can be parameterized with ( θ, φ ) and written as | ψ (cid:105) = cos (cid:0) θ (cid:1) | + z (cid:105) + e iφ sin (cid:0) θ (cid:1) | − z (cid:105) , where ≤ θ ≤ π and ≤ φ ≤ π . The Bloch sphere(sphere with radius 1) is a useful tool to visualize qubits since it can map any two-state system to a point onthe surface of the sphere using ( θ, φ ) as polar and azimuthal angles. We could also have chosen { + x , − x } or { + y , − y } , which can be written in our original basis: 11 + x (cid:105) = 1 √ | + z (cid:105) + 1 √ | − z (cid:105) (cid:16) θ = π , φ = 0 (cid:17) (19) | − x (cid:105) = 1 √ | + z (cid:105) − √ | − z (cid:105) (cid:16) θ = π , φ = π (cid:17) (20) | + y (cid:105) = 1 √ | + z (cid:105) + i √ | − z (cid:105) (cid:16) θ = π , φ = π (cid:17) (21) | − y (cid:105) = 1 √ | + z (cid:105) − i √ | − z (cid:105) (cid:18) θ = π , φ = 3 π (cid:19) (22)Now consider the following process, inspired by the Stern-Gerlach experiment (Gerlach and Stern [1922])from quantum mechanics. We begin with an electron whose spin we represent in the { + z , − z } basis. At eachtime step, we pick one of the x , y , or z directions uniformly and at random, and apply an inhomogeneousmagnetic ﬁeld along that axis. This is an act of measurement that collapses the electron spin to either ‘up’ or‘down’ along that axis, which will deﬂect the electron in that direction. Let us use the following encodingscheme for the results of the measurement: : + z , : − z , : + x , : − x , : + y , : − y . Consequently, ateach time step, the observation tells us which axis we measured along, and whether the spin of the particle isnow ‘up’ or ‘down’ along that axis. As an example, if we prepare an electron spin ‘up’ along the z -axis, andobserve the following sequence: , , , , it means that we applied the inhomogeneous magnetic ﬁeld in the z -direction, then x -direction, then z -direction, and ﬁnally the y -direction, causing the electron spin state toevolve as + z , + x , − z , − y .Note that transitions ↔ , ↔ , and ↔ are not allowed, since there are no spin-ﬂip operations inour process. Admittedly, this is a slightly contrived example, since normally we think of a hidden state thatevolves according to some rules, producing noisy observation. Here, we select the observation (down to thepair, (1 , , (3 , , (5 , ) that we wish to observe, and that tells us how the ‘hidden state’ evolves as describedby a chosen basis.This model is related to the 2-state HQMM requiring 3 classical states described in Monras et al. [2010]. It isstill a 2-state system, but we add two new Kraus operators with complex entries and renormalize: ˆ K , = (cid:18) √

00 0 (cid:19) ˆ K , = (cid:18) √ (cid:19) (23) ˆ K , = (cid:32) √ √ √ √ (cid:33) ˆ K , = (cid:32) √ − √ − √ √ (cid:33) (24) ˆ K , = (cid:32) √ − i √ i √ √ (cid:33) ˆ K , = (cid:32) √ i √ − i √ √ (cid:33) (25)Physically, Kraus operators ˆ K , and ˆ K , keep the spin along the z -axis, Kraus operators ˆ K , and ˆ K , rotate the spin to lie along the x -axis, while Kraus operators ˆ K , and ˆ K , rotate the spin to lie along the y -axis. Following the approach of Monras et al. [2010], we write down an equivalent 6-state HMM, andcompute the rank of a Hankel matrix with the statistics of this process, yielding a requirement of 4 classicalstates as a weak lower bound.We present the results of our learning algorithm applied to data generated by this model in Table 4. Weﬁnd that our algorithm can learn a 2-state HQMM (same size as the model that generated the data) withpredictive power matched only by a 6-state HMM. 12able 4: Performance of various HQMMs and HMMs on the fully quantum HQMM. HQMM parameters aregiven as ( n, s, w ) and HMM parameters are given as ( n, s ) , where n is the number of hidden states, s is thenumber of observables, and w is the number of Kraus operators per observable Model P Train DA Test DA , , − HMM (T) 24 . ( . ) . ( . ) , , − HQMM (L) 24 . ( . ) . ( . ) , − HMM (L) 16 . ( . ) . ( . ) , − HMM (L) 27 . ( . ) . ( . ) , − HMM (L) 40 . ( . ) . ( . ) , − HMM (L) 55 . ( . ) . ( . ) , − HMM (L) 72 . ( . ) . ( . ) We have shown that we can generate data using HQMMs that classical HMMs with the same number ofhidden states struggle to model. In this section, we explore how well HQMMs can model data generated by aclassical HMM. In general, randomly generated HMMs generate data that is hard to predict (i.e., DA closerto 0), so we hand-author an arbitrary, well-behaved HMM with full-rank transition matrix A and full-rankemission matrix C to compare HQMM learning with EM for HMMs: A =  . .

01 0 0 . . .

02 0 .

02 0 . .

15 0 .

05 00 .

08 0 .

03 0 . . .

05 0 . .

05 0 .

04 0 . .

35 0 0 . .

03 0 . .

03 0 0 . .

02 0 . .

27 0 0 0  , C =  . .

05 0 .

95 0 .

01 0 . . . .

05 0 .

01 0 .

05 0 . .

05 0 . . .

02 0 .

05 0 . .

04 0 .

02 0 0 .

84 0 . .

01 0 .

03 0 . .

01 0 .

02 0 .

20 0 .

03 0 .

08 0 .

01 0 .

03 0 .  (26)Our results are presented in Table 5. We ﬁnd that small HQMMs outperform HMMs with the same numberof hidden states, although the parameter count ends up being larger. However, as model size increases,training becomes quite slow, and our HQMMs are over-parameterized, becoming prone to local optima, andEM for HMMs may work better in practice on HMM-generated data. Interestingly, even though our schemein Section 3.1 requires w = n to simulate HMMs with HQMMs, empirically, we ﬁnd that we are able to learnreasonable models with w < n . We formulated and parameterized hidden quantum Markov models by ﬁrst ﬁnding quantum circuits toimplement HMMs, reducing them to their Kraus operator representation, and then relaxing some constraints.We showed how quantum analogues of classical conditioning and marginalization can be implemented, andindeed these methods are general enough to allow us to construct quantum versions of any probabilisticgraphical model. We also proposed an iterative maximum-likelihood algorithm to learn the Kraus operatorsfor HQMMs. We demonstrated that our algorithm could successfully learn HQMMs that were shown to(theoretically) better model certain sequences in the literature. While our HQMMs cannot model data anybetter than a suﬃciently large HMM, we ﬁnd that HQMMs can often better model the same data withfewer hidden states. Future work could look at optimizing our algorithm to scale on larger datasets, andat the performance of HQMMs in areas like natural language processing or ﬁnance, or quantum versions ofexisting graphical models. We speculate that quantum models could lead to improvements in these areaswhere ‘quantum’ eﬀects may be able to better simulate the dynamic processes. 13able 5: Performance of various HQMMs and HMMs on synthetic data generated by an HMM. HQMMparameters are given as ( n, s, w ) and HMM parameters are given as ( n, s ) , where n is the number of hiddenstates, s is the number of observables, and w is the number of Kraus operators per observable Model P Train DA Test DA , − HMM (T) 72 . ( . ) . ( . ) , , − HQMM (L) 24 . ( . ) . ( . ) , , − HQMM (L) 54 . ( . ) . ( . ) , , − HQMM (L) 96 . ( . ) . ( . ) , , − HQMM (L) 150 . ( . ) . ( . ) , , − HQMM (L) 300 . ( . ) . ( . ) , , − HQMM (L) 450 . ( . ) . ( . ) , , − HQMM (L) 750 . ( . ) . ( . ) , , − HQMM (L) 216 . ( . ) . ( . ) , , − HQMM (L) 432 . ( . ) . ( . ) , − HMM (L) 16 . ( . ) . ( . ) , − HMM (L) 27 . ( . ) . ( . ) , − HMM (L) 40 . ( . ) . ( . ) , − HMM (L) 55 . ( . . ( . ) , − HMM (L) 72 . ( . ) . ( . ) Acknowledgements

We would like to thank Theresa W. Lynn at Harvey Mudd College for her inputs and feedback on this work.

References

Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd. Quantummachine learning. arXiv preprint arXiv:1611.09347 , 2016.Lewis A Clark, Wei Huang, Thomas M Barlow, and Almut Beige. Hidden quantum markov models and open quantumsystems with instantaneous feedback. In

ISCS 2014: Interdisciplinary Symposium on Complex Systems , pages143–151. Springer, 2015.Walther Gerlach and Otto Stern. Der experimentelle nachweis der richtungsquantelung im magnetfeld.

Zeitschrift fürPhysik , 9(1):349–352, 1922.Herbert Jaeger. Observable operator models for discrete stochastic time series.

Neural Computation , 12(6):1371–1398,2000.Nathaniel Johnston. QETLAB: A MATLAB toolbox for quantum entanglement, version 0.9. http://qetlab.com ,January 2016.H. Jaeger M. Zhao. Norm observable operator models. Technical report, Jacobs University, 2007.Alex Monras, Almut Beige, and Karoline Wiesner. Hidden quantum markov models and non-adaptive read-out ofmany-body states. arXiv preprint arXiv:1002.2337 , 2010.Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition.

Proceedingsof the IEEE , 77(2):257–286, 1989.Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione. An introduction to quantum machine learning.

ContemporaryPhysics , 56(2):172–185, 2015a. aria Schuld, Ilya Sinayskiy, and Francesco Petruccione. Simulating a perceptron on a quantum computer. PhysicsLetters A , 379(7):660–663, 2015b.Nathan Wiebe, Ashish Kapoor, and Krysta Svore. Quantum perceptron models. In D. D. Lee, M. Sugiyama, U. V.Luxburg, I. Guyon, and R. Garnett, editors,

Advances in Neural Information Processing Systems 29 , pages 3999–4007.Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6401-quantum-perceptron-models.pdf .Ming-Jie Zhao and Herbert Jaeger. Norm-observable operator models.

Neural computation , 22(7):1927–1959, 2010. PPENDIX

A.1 Tensor Product and Partial Trace as Matrix Operations

Here we go into more depth on how we construct matrices W , V y and V w to perform the tensor product and partialtrace operations for use in our Algorithm 2. A.1.1 Tensor Product

We construct a matrix W that performs tensor product with an s × s density matrix ˆ ρ B with all zeros, except ˆ ρ , = 1 ,i.e., ˆ ρ B =  . . .

00 0 . . . ... ... . . .

00 0 0 0  s × s .Observe that for an n × n density matrix ˆ ρ A , we the tensor product yields an ns × ns matrix ˆ ρ AB = ˆ ρ A ⊗ ˆ ρ B . Thus,our matrix W will be an ns × n matrix, such that ˆ ρ A ⊗ ˆ ρ B = W ˆ ρ A W † .To construct W , take n of s × n matrices of zeros, for the i th among those n matrices, place ‘ ’ on the ﬁrst row and i th column. Then stack all of those matrices vertically to obtain the ns × n matrix W .Example If we have a × density matrix we wished to tensor with a × density matrix, we construct W such that: W =   × (27)Then, we ﬁnd that: ˆ ρ A ⊗   = W ˆ ρ A W † (28) A.1.2 Partial Trace

The partial trace cannot ordinarily be implemented with a single matrix operation. However, if a projection oper-ator has just been applied, this operation becomes trivial and easy to perform with a matrix multiplication, i.e.,tr B (cid:16) ˆ P y ˆ ρ AB ˆ P † y (cid:17) = V y ˆ P y ˆ ρ AB ˆ P † y V † y . On the other hand, if we wish to take the partial trace without applying aprojection operator, i.e., without a measurement of one of the two subsystems, we must take a sum over these matriceslike so: tr A (ˆ ρ AB ) = (cid:80) w V w ˆ ρ AB V † w . The subscript of ‘tr’ tells us which particle we are tracing over.Partial Trace after Projection Here, we will assume that a projection operator ˆ P y corresponding to an observation onthe second particle in the same basis was applied on the joint state of a system prior to the partial trace. If this is notthe case, we simply construct all matrices V y for each observation and take a sum as previously described.The construction of this matrix V y is straightforward. We take s of n × s matrices of zeros, and for the i th of these s matrices, place ‘ ’ on the y th column and i th row. Then, concatenate these matrices horizontally to obtain V y . xample If we have a × density matrix describing the joint state of a -state particle and -state particle, wecan construct V to trace over the second particle after applying a projection operator ˆ P to be: V =   × (29)Then, we ﬁnd that if we have applied a projection operator:tr B (cid:16) ˆ P ˆ ρ AB ˆ P † (cid:17) = V ˆ P ˆ ρ AB ˆ P † V † (30)Partial Trace without Projection Here, we assume that no measurement/projection has been made, since this is howwe use it in the algorithm. If this is not the case and there a projection operator was applied, forgo the sum andsimply apply the V w corresponding to the measurement.To perform partial trace where there has been no observation, we must construct a set of matrices V w , which we applyand then sum over. The construction of each matrix V w is as follows. We take s of s × n matrices of zeros, except the w th out these s matrices which is an identity matrix. Then concatenate these matrices horizontally to obtain V w .Example If we have a × density matrix describing the joint state of a -state particle and -state particle, wecan construct V w to trace over the ﬁrst particle as: V =   × V =   × V =   × (31)Then, we ﬁnd that: ˆ ρ B = tr A (ˆ ρ AB ) = (cid:88) w =1 V w ˆ ρ AB V † w (32) B.2 Factorizing Unitary Matrices into H Matrices

The proof of this theorem is a generalization of the proof found in M. Zhao [2007].

Lemma 1.

For any vector (cid:126)x ∈ C n where n ≥ , there exists a matrix A that is a product of H matrices, such that A (cid:126)x = (cid:107) (cid:126)x (cid:107) (cid:126)e where (cid:126)e = (cid:2) . . . (cid:3) T × n (unit vector in R n ). Proof.

Consider an arbitrary vector (cid:126)x ∈ C n , written as (cid:126)x = (cid:2) x x . . . x n (cid:3) T × n . Let us deﬁne y = (cid:112) (cid:107) x (cid:107) + (cid:107) x (cid:107) and parameterize the entries x and x in (cid:126)x with α and β so as to write: x = y e iβ cos( α ) x = y e iβ sin( α ) (33)Now consider the action of H (1 , , α , − β , , on (cid:126)x : H (cid:126)x =  e − iβ cos α e − iβ sin α · · · − e − iβ sin α e − iβ cos α ... ... ... · · · ... ... . . . ... · · ·   y e iβ cos( α ) y e iβ sin( α ) x ... x n  =  y x ... x n  (34) ext, we can deﬁne y = (cid:112) (cid:107) y (cid:107) + (cid:107) x (cid:107) and parameterize y and x using α and β , just like we previously. Wecan then apply H (1 , , α , − β , , , and we ﬁnd that: H H (cid:126)x =  y x ... x n  (35)Following this pattern, we can construct a sequence of H matrices such that H n − . . . H H (cid:126)x = (cid:2) y n . . . (cid:3) T .Observe that y n = (cid:112) (cid:107) y n − (cid:107) + (cid:107) x n (cid:107) = (cid:112) (cid:107) y n − (cid:107) + (cid:107) x n − (cid:107) + (cid:107) x n (cid:107) = (cid:112) (cid:107) x (cid:107) + . . . (cid:107) x n (cid:107) = (cid:107) (cid:126)x (cid:107) . Thus, with A = H n − . . . H H , we have shown that there exists a matrix A that is a product of H matrices, such that A (cid:126)x = (cid:107) (cid:126)x (cid:107) (cid:126)e . Lemma 2.

Any 2x2 unitary matrix A can be written as H (1 , , θ, φ, ψ, δ ) . Proof.

A generalized 2x2 unitary matrix is written as: (cid:20) e iφ / e iψ cos θ e iφ / e iδ sin θ − e iφ / e − iδ sin θ e iφ / e − iψ cos θ (cid:21) (36)which is exactly H (1 , , θ, φ, ψ, δ ) . Theorem 3.

A matrix ˆ U is unitary if and only if it can be written as a product of H ( i, j, θ, φ, ψ, δ ) matrices withthe following form, where i, j denote the rows and columns with special entries: H ( i, j, θ, φ, ψ, δ ) =  · · · · · · · · · ... . . . ... ... ... · · · e iφ / e iψ cos θ · · · e iφ / e iδ sin θ · · · ... ... . . . ... ... · · · − e iφ / e − iδ sin θ · · · e iφ / e − iψ cos θ · · · ... ... ... . . . ... · · · · · · · · ·  (37) Proof.

We will prove both the forward and reverse directions:1.

A matrix ˆ U is unitary if it can be written as a product of H matrices. Observe that matrix H is unitary, since H † H = I . A product of unitary matrices is itself unitary, hence amatrix ˆ U that is a product of these H matrices is unitary.2. If a matrix ˆ U is unitary, it can be written as a product of H matrices. We will give a proof by induction. We want to show that any n × n ˆ U unitary matrix can be written as productof H matrices. Base Case

When n = 2 , i.e., for a × unitary matrix, we know that it can be written as product of H matrices from Lemma 2. Inductive Hypothesis

Assume that the claim holds for n = k , i.e., any k × k unitary matrix can be writtenas a product of H matrices.With n = k + 1 , consider an arbitrary ( k + 1) × ( k + 1) unitary matrix ˆ U = (cid:2) (cid:126)u (cid:126)u . . . (cid:126)u k +1 (cid:3) where (cid:126)u i isthe i th column. Since ˆ U is unitary, (cid:107) (cid:126)u i (cid:107) = 1 for ≤ i ≤ k + 1 . Then, by Lemma 1, we have a matrix A that isa product of H matrices such that A (cid:126)u = (cid:107) (cid:126)u (cid:107) (cid:126)e = (cid:126)e . sing this matrix, we ﬁnd that ˆ U (cid:48) = A ˆ U = (cid:20) (cid:126)C(cid:126) V (cid:21) where (cid:126) represents a k × column vector, (cid:126)C represents a × k row vector, and V represents a k × k matrix. But ˆ U (cid:48) is unitary, so ˆ U (cid:48) ( ˆ U (cid:48) ) † = I , which means VV † = I k × k and (cid:126)C = (cid:126) . Inductive Step

From the inductive hypothesis, we know that V can be written as a product of H matrices,so let us write V = H k , . . . , H . Next, we take each of these k × k H matrices and pad them to obtain ( k + 1) × ( k + 1) matrices H (cid:48) i = (cid:20) H i (cid:21) . Then, we see that H (cid:48) k , . . . , H (cid:48) = (cid:20) (cid:126) (cid:126) V (cid:21) = ˆ U (cid:48) .Finally, we can write our arbitrary unitary matrix ˆ U = A − H (cid:48) k , . . . , H (cid:48) , which is indeed a product of H matrices.Hence, we have shown that any unitary matrix can be written as a product of H matrices.matrices.