EEfficient Bayesian phase estimation using mixed priors
Ewout van den BergIBM Quantum, IBM T.J. Watson Research CenterYorktown Heights, NY, USAJuly 24, 2020
Abstract
We describe an efficient implementation of Bayesian quantum phase estimation in the presence of noise andmultiple eigenstates. The main contribution of this work is the dynamic switching between different represen-tations of the phase distributions, namely truncated Fourier series and normal distributions. The Fourier-seriesrepresentation has the advantage of being exact in many cases, but suffers from increasing complexity with eachupdate of the prior. This necessitates truncation of the series, which eventually causes the distribution to becomeunstable. We derive bounds on the error in representing normal distributions with a truncated Fourier series,and use these to decide when to switch to the normal-distribution representation. This representation is muchsimpler, and was proposed in conjunction with rejection filtering for approximate Bayesian updates. We showthat, in many cases, the update can be done exactly using analytic expressions, thereby greatly reducing thetime complexity of the updates. Finally, when dealing with a superposition of several eigenstates, we need toestimate the relative weights. This can be formulated as a convex optimization problem, which we solve using agradient-projection algorithm. By updating the weights at exponentially scaled iterations we greatly reduce thecomputational complexity without affecting the overall accuracy.
Phase estimation is an important building block in quantum computing, with applications ranging from ground-state determination in quantum chemistry, to prime factorization and quantum sampling [2, 13, 15]. In the idealsetting we assume that the quantum system can be initialized to an eigenstate | φ (cid:105) of a known unitary U , such that U | φ (cid:105) = e iφ | φ (cid:105) . The goal of quantum phase estimation (QPE) is then to estimate the phase φ . Some of the existing approachesinclude quantum Fourier based phase estimation [1, 6], iterative phase estimation [10], and other methods [14].In practice, there are several factors that complicate the problem. First, it may not be possible to initialize thestate exactly to a single eigenstate. This could be because the state is perturbed by noise, or simply becausethe eigenstate is unknown. The latter case arises, for instance, in the ground state determination of moleculesin quantum chemistry where the desired eigenstate can only be approximated. Regardless of the cause, phaseestimation may need to deal with an initial state that is a superposition of eigenstates: | Ψ (cid:105) = (cid:88) j α j | φ j (cid:105) , with (cid:88) j | α j | = 1 , (1)Second, practical phase-estimation algorithms may also need to deal different sources of noise present in currentand near-term quantum devices. Bayesian phase estimation [14] has been shown to be particularly well suited fordealing with noise [16] and the presence of multiple eigenstates [12].In this paper we describe an efficient implementation of Bayesian phase estimation. In Section 2 we describe theBayesian approach to quantum phase estimation, review existing work, and point out some of the shortcomings inexisting approaches. In Section 3 we describe the proposed algorithm in detail, followed by numerical experimentsin Section 4, and conclusions in Section 5. 1 a r X i v : . [ qu a n t - ph ] J u l (cid:105) H R z ( β ) • H m | (cid:105) H R z ( β ) • H m · · ·| Ψ (cid:105) / U k U k · · · Figure 1: The multi-round quantum circuit used in the experiments.
We consider Bayesian phase estimation with measurements obtained using the quantum circuit depicted in Figure 1,as proposed in [12]. The circuit consists of a series of rounds, each parameterized by an integer exponent k r andphase shift β r . The different rounds in the circuit make up a single experiment and result in a binary measurementvector m . Suppose we are given a quantum circuit of R rounds parameterized by vectors β ∈ [0 , π ) R and k ∈ N R ,then we can denote by P k,β ( m | φ, w ) the probability of observing measurements m for given phases φ and weightvector w with entries | α j | . We can define a prior distribution P ( φ, w ) over the phases φ ∈ Φ R = [0 , π ) R , andweights w ∈ ∆, where ∆ denotes the unit simplex given by the set { w ∈ R R | w ≥ , (cid:107) w (cid:107) = 1 } . The posteriordistribution then follows from Bayes’ theorem, and is given by P k,β ( φ, w | m ) = P k,β ( m | φ, w ) · P ( φ, w ) P k,β ( m ) , where P k,β ( m ) = (cid:90) Φ R (cid:90) ∆ P k,β ( m | φ, w ) · P ( φ, w ) dw dφ. (2)Note that throughout the text, the probability density functions are distinguished by parameter names to keep thenotation uncluttered. By updating the prior after each set of measurements, we can express the posterior distributionresulting from the (cid:96) -th experiment, with parameters k = k ( (cid:96) ) and β = β ( (cid:96) ) , and measurements m = m ( (cid:96) ) , as P ( (cid:96) +1) ( φ, w ) = P k,β ( m | φ, w ) · P ( (cid:96) ) ( w, φ ) P ( (cid:96) ) k,β ( m ) , (3)which implicitly depends on the parameters and measurements from previous experiments. In the special casewhere | Ψ (cid:105) = | ϕ (cid:105) is a single eigenphase, the probability of observing a certain measurement vector m is given by P k,β ( m | ϕ ) = R (cid:89) r =1 cos (cid:18) k r ϕ β r − m r π (cid:19) = R (cid:89) r =1 (cid:18) k r ϕ + β r − m r π )2 (cid:19) . (4)In the general case, where | Ψ (cid:105) = (cid:80) j α j | φ j (cid:105) is a superposition of eigenstates, this changes to P k,β ( m | φ, w ) = (cid:88) j w j P k,β ( m | ϕ = φ j ) . It is reasonable to assume, as done in [12], that the joint prior P ( φ, w ) can be written in the form P (0) ( φ, w ) = P (0) ( w ) (cid:89) j P (0) j ( φ j ) . Updates to the prior for each of the phases φ j can then be obtained from (3) by integrating out the weights and allphases except φ j (denoted by φ \ j ), which yields P ( (cid:96) +1) j ( φ j ) = (cid:90) Φ R − (cid:90) ∆ P ( (cid:96) +1) ( φ, w ) dw dφ \ j = P ( (cid:96) ) j ( φ j ) P ( (cid:96) ) k,β ( m ) (cid:88) (cid:96) (cid:54) = j C ( (cid:96) ) (cid:96) W ( (cid:96) ) (cid:96) + P k,β ( m | φ j ) W ( (cid:96) ) j , (5)2here C ( (cid:96) ) j = (cid:90) π P ( (cid:96) ) j ( φ j ) P k,β ( m | φ j ) , and W ( (cid:96) ) j = (cid:90) ∆ w j P ( (cid:96) ) ( w ) dw. (6)Similarly, integrating over all phases φ gives the updated weight distribution P ( (cid:96) +1) ( w ) = (cid:90) Φ R P ( (cid:96) +1) ( φ, w ) dφ = P ( (cid:96) ) ( w ) P k,β ( m ) (cid:88) j w j C ( (cid:96) ) j . (7)Finally, it can be verified that the marginal probability (2) can be expressed in terms of scalars C (cid:96)j and W (cid:96)j fromequation 6 as P ( (cid:96) ) k,β ( m ) = (cid:88) j C ( (cid:96) ) j W ( (cid:96) ) j . (8)In order to implement Bayesian phase estimation we need a suitable representation for the probability densityfunctions P ( (cid:96) ) ( φ j ) for the phases. We look at these next and discuss the treatment of weights in more detail inSection 2.3. One of the simplest ways to represent the prior P ( φ ) is based on the probability-density function of the univariatenormal distribution N ( µ, σ ): f µ,σ ( x ) = 1 σ √ π e − ( x − µσ ) . As phases that differ by integer multiples of 2 π are equivalent, we can obtain the desired representation by wrappingthe normal distribution around the interval [0 , π ) to get f ◦ µ,σ ( φ ) = f µ,σ ( φ ) + ∞ (cid:88) k =1 ( f µ,σ ( φ + 2 πk ) + f µ,σ ( φ − πk )) . (9)After acquiring the measurements from an experiment, we would like to update the prior distributions to P ( (cid:96) +1) j ( φ j ),as given in (5). However, the resulting candidate distribution will not be in the form of a wrapped normal distri-bution, and we must therefore find the best fit. This amounts to finding the mean µ ( n +1) and standard deviation σ ( n +1) of the candidate distribution and using these to define the updated normal prior. In order to find theseparameters, we have to evaluate the expected value of e iφ over the distribution Q j = P ( (cid:96) +1) j , namely (cid:104) e iφ (cid:105) Q j := (cid:90) π e iφ Q j ( φ ) dφ, (10)Given this expectation we can obtain the mean and Holevo variance using (see for example [9]): µ j = arg( (cid:104) e iφ (cid:105) Q j ) , and σ j = 1 |(cid:104) e iφ (cid:105) Q j | − . (11)These values are then used to define the updated prior P ( (cid:96) +1) j := f ◦ µ j ,σ j . Priors based on the normal distributionwere used in the context of (noisy) single-round experiments with a single eigenstate by [16]. In that work, theintegral appearing in (10) is evaluated approximately using rejection sampling, which requires a potentially largenumber of samples to evaluate accurately. In Section 3.3 we show how this expectation can be computed muchmore efficiently. 3 .2 Fourier representation A second way to represent distributions is based on the Fourier series P ( φ ) = c + (cid:88) k =1 c k cos( kφ ) + s k sin( kφ ) , where c k and s k are scalar coefficients. The mean and variance of the resulting distribution over [0 , π ), areconveniently expressed in terms of the coefficients as arg( c + is ) and ( π ( c + s )) − −
1, respectively [12]. It iswell known that products of sine and cosine terms can be rewritten as sums of individual sine and cosine terms by,possibly repeated, application of the product-sum formulascos( θ ) cos( ϕ ) = cos( θ − ϕ ) + cos( θ + ϕ )sin( θ ) sin( ϕ ) = cos( θ − ϕ ) − cos( θ + ϕ )sin( θ ) cos( ϕ ) = sin( θ + ϕ ) + sin( θ − ϕ ) . (12)Together with the identities cos( − φ ) = cos( φ ) and sin( − φ ) = − sin( φ ) it follows that sums and products of Fourierseries can also be written as Fourier series. By appropriately adding or subtracting the product-sum formulas in (12)it follows that cos( θ + ϕ ) = cos( θ ) cos( ϕ ) − sin( θ ) sin( ϕ )sin( θ + ϕ ) = sin( θ ) cos( ϕ ) + cos( θ ) sin( ϕ ) . (13)With these identities we can write the measurement probability (4) in the form of a Fourier series: P k,β ( m | φ ) (12) = R (cid:88) k =0 (cid:0) α cos k cos( kφ + θ cos k ) + α sin k sin( kφ + θ sin k ) (cid:1) (13) = R (cid:88) k =0 (cid:0) ¯ α cos k cos( kφ ) + ¯ α sin k sin( kφ ) (cid:1) . (14)The uniform prior P ( φ ) = 1 / π can be written as a Fourier-series with c = 1 / π and all other coefficients zero.Since the update rule in (5) only multiplies and adds distributions, the resulting phase distributions P j can bewritten exactly as a Fourier series [3, 9, 12]. More generally, it follows that whenever the initial prior can beexpressed as a Fourier series, all subsequent phase distributions can be expressed as Fourier series. In order toevaluate the products of distributions in (5) we can use the following convenient update rules: (1 + cos( kφ + γ )) cos( jφ ) = cos( jφ ) + cos( γ )[cos(( k + j ) φ ) + cos(( k − j ) φ )] − sin( γ )[sin(( k + j ) φ ) + sin(( k − j ) φ )] (1 + cos( kφ + γ )) sin( jφ ) = sin( jφ ) + cos( γ )[sin(( k + j ) φ ) − sin(( k − j ) φ )]+ sin( γ )[cos(( k + j ) φ ) − cos(( k − j ) φ )] . (15)Given two Fourier series with a finite number of terms n and n . Then the sum of the two series will haveno more than max { n , n } terms, whereas the product will have a non-zero term at the maximum index n + n .When updating the distribution, this means that each round in the experiment increases the maximum indexof the Fourier-series representation by k r . The number of coefficients that need to be stored and manipulatedtherefore keeps growing with each update, leading to a computational complexity that grows at least quadraticallyin the number of rounds. In order to keep the Bayesian approach practical, we therefore need to limit the numberof terms in the representation [12]. However, as the algorithm progresses, the probability distributions tend tobecome increasingly localized and ringing effects, such as those shown in Figure 2, start to appear as a result ofthe truncation. This eventually leads to instability in the distribution, causing the algorithm to fail. Increasingthe number of terms in the representation can delay but not eliminate this effect. Moreover, from a computationalpoint of view, we of course would like to keep the number of terms in the representation as small as possible.4 ( a ) ( b )Figure 2: Representation of the wrap-around normal distribution with µ = π and σ = 0 . n max = 20, and (b) n max = 50. The dotted and dashed lines respectively indicatethe error in the representation and our error bound. Given an initial uniform prior for the weights, we see that equation (7) evolves to a weighted sum of Dirichletdistributions, which are of the form: P α ( x ) = 1 B ( α ) (cid:89) i x α i − i , with B ( α ) = (cid:81) i Γ( α i )Γ( (cid:80) i α i ) . The coefficients α for that distribution are initially zero, but change after each update such that each Dirichletdistribution has integer parameters α i ≥ (cid:107) α (cid:107) = n , where n is the update iteration. Assuming there are k eigenvalues, this gives a total of (cid:0) n + k − k − (cid:1) = O ( n k ) terms to represent the distribution exactly, which becomesimpractical for even modest values of n and k . An alternative approach, taken in [12], is to choose the weights w such that they maximize the log-likelihood of the measurements. Assuming a uniform prior, this leads to a convexoptimization problem: w ( n ) := argmin x ∈ ∆ − n n (cid:88) (cid:96) =1 log( (cid:104) C ( (cid:96) ) , x (cid:105) ) , (16)where C ( (cid:96) ) denotes a vector of the coefficients C ( (cid:96) ) j from (6) at iteration (cid:96) . The normalization by 1 /n is included tokeep the problem scale independent of the number of iterations, but otherwise does not affect the solution. In [12],this subproblem is solved using sequential least-squares approach for the first thousand iterations, after which alocal optimization method is applied. In the previous section we have seen that, given initial uniform priors, the phase distributions have exact Fourierrepresentations up to the point where the number of coefficients exceeds a chosen maximum. Truncation of theseries eventually leads to instability as the phase distributions become increasingly localized. In order to prevent thealgorithm from breaking down we propose to monitor the standard deviation of each phase distribution. Once thestandard deviation falls below a certain threshold we convert the Fourier-series representation of the distributionto a normal-based representation. From then on we only update the parameters of the normal distribution.At the beginning of the algorithm we need to choose the priors and initialize the weights. Starting withidentical priors and weights for each distribution leads to a symmetry in which the updates to the distributions areidentical, which means that we effectively only use a single distribution. To break this symmetry we can initializethe priors such that they are all distinct. In Section 3.1 we show how a wrap-around normal distribution canbe approximated by a finite Fourier series. This allows us to initialize the phase distributions with relatively flatnormal-like distributions with different mean values. In Section 3.2 we bound the pointwise error in representinga normal distribution by a truncated Fourier series. For a given maximum number of coefficients in the Fourierrepresentation we can then find the standard deviation for which the error bound exceeds a given maximum.5his standard deviation is then used as the threshold that triggers conversion from the Fourier to the normalrepresentation of the phase distribution. In Section 3.3 we show that the parameter updates for the normaldistribution can be computed analytically. This means that we no longer need the rejection sampling procedureused in [16] to approximate the parameters. Finally, in Section 3.4 we propose an efficient approach for solving theweight optimization problem (16).
In order to represent a wrapped normal distribution f ◦ µ,σ in Fourier form we need to determine the sine and cosinecoefficients. For a term such as sin( kφ ), this can be done by evaluating the inner product: (cid:90) π f ◦ µ,σ ( φ ) sin( kφ ) dφ = (cid:90) ∞−∞ f µ,σ ( φ ) sin( kφ ) dφ = (cid:104) f µ,σ , sin( k · ) (cid:105) , (17)where the first equality follows from the definition of f ◦ µ,σ in (9) and periodicity of the sine function. A similarexpression holds for the inner product with cosine terms cos( kφ ). We can evaluate these expressions using theFourier transform of the normal distribution [5, 8], which is given byˆ f ( t ) = (cid:90) ∞−∞ f µ,σ ( φ ) e itφ dφ = e iµt e − ( σt ) / Choosing t = k and taking the real and imaginary parts respectively we obtain: (cid:104) f µ,σ , sin( k · ) (cid:105) = sin( µk ) · e − ( σk ) / (cid:104) f µ,σ , cos( k · ) (cid:105) = cos( µk ) · e − ( σk ) / . (18)For the final coefficients we need to make sure that the basis functions are orthonormal, which is equivalent toscaling the coefficients by 1 /π for k (cid:54) = 0 and by 1 / π for k = 0, since (cid:90) π cos ( kx ) dx = (cid:40) π k = 0 π otherwise , (cid:90) π sin ( kx ) dx = (cid:40) k = 0 π otherwise . After each experiment we update the Fourier representation of each of the eigenphases using the rules in (15).Truncation of the coefficients during the update may introduce intermediate errors, and is therefore best done afterthe update by discarding the excessive coefficients. Over the course of successive updates, the distributions tendto become increasingly peaked. As a results of the limited number of coefficients, this causes the distributionsto exhibit the ringing effects seen in Figure 2, which get more and more severe until the distributions blow upand become meaningless. We can avoid this by monitoring the standard deviation for each of the distributionsand convert a distribution from Fourier representation to normal form once the standard distribution becomes toosmall. We determine the critical standard deviation by considering the maximum pointwise error of approximatinga normal distribution with a truncated Fourier series. After normalizing the coefficients found in (18) the pointwiseerror at any point φ for n max ≥ φ ) = ∞ (cid:88) k = n max +1 (sin( µk ) sin( kφ ) + cos( µk ) cos( kφ )) · e − ( σk ) / π = ∞ (cid:88) k = n max +1 cos( k ( µ − φ )) · e − ( σk ) / π . φ = µ , where cos( k ( µ − φ )) = 1 for all k . Defining γ := e − σ / < φ | err( φ ) | = 1 π ∞ (cid:88) k = n max +1 e − ( σk ) / = 1 π ∞ (cid:88) k = n max +1 γ k (19a) ≤ π (cid:90) ∞ n max γ x dx = 1 π (cid:34) √ π · erf( (cid:112) − ln( γ ) x )2 √− ln γ (cid:35) ∞ n max = 1 π (cid:20) √ π √ σ erf( xσ/ √ (cid:21) ∞ n max = 1 σ √ π (cid:16) − erf( n max σ/ √ (cid:17) = 1 σ √ π erfc( n max σ/ √ . (19b)We illustration the error bound in Figure 2. Given a fixed limit n max on the number of coefficients and a maximumpermitted error ε , we can define the critical value σ (cid:15) ( n max ) as the minimum σ that satisfieserfc( n max σ/ √ ≤ εσ √ π. A good approximation can be determined efficiently using bisection search. Once the standard deviation of theFourier representation falls below this critical value, we can no longer guarantee that the representation is accurate,and therefore switch to a normal representation of the distribution based on the values in (11). Note that theconversion is done independently for the different distributions.
In [16], single-round updates to the normal distribution are done using rejection sampling. It turns out that theintegrals appearing in the update have closed-form solutions that easily evaluated. We first consider the computationof the C ( (cid:96) ) j coefficients appearing in (6): C ( (cid:96) ) j = (cid:90) P ( (cid:96) ) j ( φ j ) P k,β ( m | φ j ) . It follows from (4) that we can write P k,β ( m | φ j ) as a weighted sum of sine and cosine terms with certain coefficients c k and s k . Given a wrap-around normal prior P ( (cid:96) ) j = f µ,σ , it then follows that C ( (cid:96) ) j = (cid:90) ∞−∞ f µ,σ ( φ ) (cid:34)(cid:88) k c k cos( kφ ) + s k sin( kφ ) (cid:35) dφ. Using the integrals in (18), this simplifies to C ( (cid:96) ) j = (cid:88) k ( c k cos( µk ) + s k sin( µk )) e − ( σk ) / . Once the C ( (cid:96) ) j values have been computed, we can evaluate the probability P ( (cid:96) ) k,β ( m ) using (8). The posteriordistribution used to define the updated prior is then given by equation (5):˜ P ( (cid:96) +1) j ( φ j ) = P ( (cid:96) ) j ( φ j ) P ( (cid:96) ) k,β ( m ) (cid:88) i (cid:54) = j C ( (cid:96) ) i W ( (cid:96) ) i + P k,β ( m | φ j ) W ( (cid:96) ) j , where the tilde indicates that this is an intermediate distribution. The expression within brackets, along withthe normalization factor 1 /P ( (cid:96) ) k,β ( m ) can be expressed in Fourier representation with certain coefficients c (cid:48) k and s (cid:48) k .Together with the normal prior distribution, we thus have˜ P ( (cid:96) +1) j ( φ j ) = f µ,σ ( φ j ) (cid:88) k ( c (cid:48) k cos( kφ j ) + s (cid:48) k sin( kφ j )) .
7o obtain an updated normal distribution we use (11), which requires the evaluation of (cid:104) e iφ (cid:105) Q with Q = ˜ P ( (cid:96) +1) j : (cid:104) e iφ (cid:105) Q = (cid:90) ∞−∞ f µ,σ ( φ ) (cid:88) k ( c (cid:48) k cos( kφ ) + s (cid:48) k sin( kφ )) e iφ . Writing e iφ = cos( φ ) + i sin( φ ) and defining coefficients c (cos) k , s (cos) k , c (sin) k , and s (sin) k such that (cid:88) k c (cos) k cos( kφ ) + s (cos) k sin( kφ ) = cos( φ ) (cid:88) k ( c (cid:48) k cos( kφ ) + s (cid:48) k sin( kφ )) (cid:88) k c (sin) k cos( kφ ) + s (sin) k sin( kφ ) = sin( φ ) (cid:88) k ( c (cid:48) k cos( kφ ) + s (cid:48) k sin( kφ )) , we have (cid:104) e iφ (cid:105) Q = (cid:32)(cid:88) k c (cos) k (cid:104) f µ,σ , cos( k · ) (cid:105) + s (cos) k (cid:104) f µ,σ , sin( k · ) (cid:105) (cid:33) + i (cid:32)(cid:88) k c (sin) k (cid:104) f µ,σ , cos( k · ) (cid:105) + s (sin) k (cid:104) f µ,σ , sin( k · ) (cid:105) (cid:33) . This expression is easily evaluated using (18).
Updates of the weights are done by minimizing the negative log-likelihood function (16): w ( n ) := argmin x ∈ ∆ f n ( x ) with f n ( x ) := − n n (cid:88) (cid:96) =1 log( (cid:104) C ( (cid:96) ) , x (cid:105) ) . (20)A single evaluation of the objective function and gradient at iteration n has complexity O ( mn ), where m is thetotal number of eigenphases considered. If we were to solve this problem for weight updates in every iteration, wewould obtain a complexity that is quadratic in the number of iterations n . Although polynomial, this neverthelessimposes a practical limit on the number of iterations the algorithm can take. Adding a single vector C ( (cid:96) ) to thesummation after a large number of iterations does little to change the objective function or gradient. It thereforemakes sense to solve the weights only when a fixed percentage of new vectors has been added since the previoussolve. In other words, we can use an exponentially spaced grid of iterations at which to compute the weights.Assuming a fixed maximum number of iterations to solve each problem, we then obtain a complexity of O ( mn )for all weight updates combined, which is linear in the total number of iterations. An alternative approach, takenin [12] is to solve (20) using sequential least-squares programming for the first hundred experiments, and switch tooptimization of a quadratic model of the objective function after that. Given the constant size of the model, thismeans that the time complexity also remains linear in n .For the minimization of (20) we use a gradient projection algorithm [4]. This iterative method requires anoperator for Euclidean projection onto the unit simplex: P ( x ) := argmin v ∈ ∆ (cid:107) v − x (cid:107) . Application of the projection operator can be done using a simple O ( m log m ) algorithm, which works well for small m . More advanced O ( m ) algorithms exist [7], but may have a large constant factor. We implemented the gradientprojection algorithm with curvilinear line search, in which updates are of the form x ( i +1) = x ( i +1) + d ( i ) , with d ( i ) ( α ) = P (cid:16) x ( i ) − α ∇ f n ( x ( i ) ) (cid:17) − x ( i ) . The line-search parameter α is chosen to satisfy the Armijo condition [11] f n ( x ( i ) + d ( i ) ( α )) ≤ f n ( x ( i ) ) + γ (cid:104) d ( i ) ( α ) , ∇ f n ( x ( i ) ) (cid:105) , (21)for some small value of γ , for instance γ = 0 . α k = 1, and halved as often as needed to satisfy (21). The main stopping criterionis that (cid:107) d ( i ) (1) (cid:107) is sufficiently small, as this indicates that either the norm of the gradient is small, or the negativegradient lies close to the normal cone of the simplicial constraint. In addition we impose a maximum number ofiterations, as well as a maximum on the number of halving steps in the line-search procedure.8 .5 Noise modeling One of the advantages of Bayesian phase estimation is that common types of noise can easily be included in themodel. One such type of noise is depolarizing noise, which results in an ideal noise free measurement with probability p , and a uniformly random measurement with probability 1 − p . For a single-round experiment, the probability p can be written as e − k /k err , where k err is a system-dependent parameter [12]. This can be incorporated in theconditional measurement distribution by defining¯ P k,β ( m | φ j ) = p · P k,β ( m | φ j ) + 1 − p , (22)and computing C j in (6) using the updated measurement probability. For experiments with R rounds such thateach round uses a different ancilla qubit and all measurements are taken at the very end, we can generalize this byreplacing k by the sum of all k i values in p , and replacing the last term in (22) by (1 − p )2 − R .Another example of noise that can be included in the probability model is measurement noise. Here, each ofthe measurement outcomes m i is flipped with some fixed probability p that can be estimated experimentally. Formulti-round experiments this means that the current state differs from the one expected based on the measurementsso far. When the number of rounds is limited we can evaluate the measurement probability as¯ P k,β ( m | φ j ) = (cid:88) m (cid:48) ∈{ , } R P ( m | m (cid:48) ) P k,β ( m | φ j ) , (23)where m (cid:48) represents the measurements if there were no noise, and P ( m | m (cid:48) ) = (1 − p ) R − d ( m,m (cid:48) )) p d ( m,m (cid:48) ) is theprobability of observing the noisy measurement m with Hamming distance d ( m, m (cid:48) ) to m (cid:48) . Both (22) and (23) canbe written as finite Fourier series and therefore easily fit into the proposed framework. In this section we compare the performance of the three different approaches: the normal approach in which allpriors take the form of a normal distribution; the Fourier approach in which the priors are represented as a truncatedFourier series; and the proposed mixed approach, in which each of the priors is initially represented as a truncatedFourier series and converted to normal distribution form once the standard deviation falls below a given threshold.For all experiments we choose phase shift values β uniformly at random on the interval [0 , π ), and initialize the priordistributions as equally spaced normal distributions with a standard deviation of 3, either directly, or approximatelyusing the techniques described in Section 3.1. By choosing different mean values we break the model symmetry,which would otherwise lead to identical updates to all distributions. All experiments were run on a 2.6 GHz IntelCore i7 processor with 16 GB of memory. As a first experiment we consider the application of the Fourier approach on a noiseless test problem in which twoeigenstates with eigenphases φ = 2 and φ = 4 are superimposed with weights w = 0 . w = 0 .
3. We modelthe phase distributions using a truncated Fourier series and consider truncation after 200, 1000, and 5000 termsrespectively. For the experiments we use a fixed three-round setup with k = [1 , , iterations each. As the distributions evolve they may converge to eigenphases corresponding to any of theactual eigenstates. In order to evaluate the accuracy of the results, we therefore need to associate each distributionwith one of the know eigenphases. There are several ways in which this matching can be done. The simple approachwe found to work very well, and which we therefore use throughout, consists of matching the distribution to theeigenphase that is closest to its mean at a given reference iteration. We typically perform the matching based onthe distribution at the final iteration, but intermediate distributions are sometimes used to better illustrate theperformance. Applying this procedure to our current setup with matching done at iteration 100 gives median phaseerrors as shown in Figure 3(a). For each of the three truncation limits we see that the median phase error initiallyconverges steadily before suddenly diverging rapidly. For the Fourier representation with 200 terms this divergencehappens after some 2,500 iterations. Using representations with a larger number of terms helps postpone the point9 Iteration10 M e d i a n p h a s e e rr o r A B C D A Max. coef = 200 B Max. coef = 1000 C Max. coef = 5000 D Mixed approach (200) Iteration10 M e d i a n s t a n d a r d d e v i a t i o n (200)(1000)(5000) A B C D A Max. coef = 200 B Max. coef = 1000 C Max. coef = 5000 D Mixed approach (5000) F o u r i e r t r un c a t i o n e rr o r A B C D E A Max. coef = 5000 B Max. coef = 1000 C Max. coef = 200 D Max. coef = 50 E Max. coef = 10 ( a ) ( b ) ( c ) Iteration10 M e d i a n p h a s e e rr o r A B A Mixed approach B Normal prior only 10 Iteration10 M e d i a n s t a n d a r d d e v i a t i o n A B A Mixed approach B Normal prior only 10 Iteration10 M e d i a n a b s o l u t e w e i g h t e rr o r A B C A Mixed approach B Normal prior only C Normal prior only (T=2) ( d ) ( e ) ( f )Figure 3: Plots of (a) the median phase error for a two-eigenphase problem using Fourier representations with threedifferent values for the maximum number of coefficients; (b) the corresponding median standard deviation alongwith the critical σ values for (cid:15) = 10 − ; (c) the Fourier truncation error as a function of σ ; (d) the median phaseerror for three phases using the normal-based approach (light green, marker B), and the mixed approach (dark blue,marker B), as well as the corresponding (e) median standard deviation in the phases, and (f) median absolute errorin the weights. The curve bundles in plot (f) illustrate the use of different schemes to update the weights of thedistributions. For each approach the different update schemes generally perform similarly, and we therefore omitexplicit labels.at which divergence sets in, but does not prevent it. Matching at different iterations beyond 1,000 confirms thatthe divergence is inherent in the representation and not caused by the distributions suddenly converging to differenteigenphases. In Figure 3(b) we plot the median standard deviation as a function of iteration along with the critical σ values for (cid:15) = 10 − , as calculated using the procedure described in Section 3.2. For plotting purposes we set thestandard deviation to 20 when the weight of the distribution is zero. For the case where the maximum number ofcoefficients is 5,000, we see that the break down occurs very quickly after the critical standard deviation is reached.To get a better understanding of the sensitivity of this critical value with respect to (cid:15) , we plot the Fourier truncationerror (19b) as a function of σ in Figure 3(c). The dashed line in the plot gives a more accurate bound on the errorobtained by explicitly summing the terms in (19a) up to 100,000, and using the complementary error-function (erfc)bound beyond that. The fact that the dashed line is barely visible shows that the bound in (19b) is reasonablytight, certainly for our purposes. In addition, it can be seen that the truncation error exhibits a sharp increase overa narrow range of σ values. This means that the critical value is not very sensitive to the choice of (cid:15) ; choosing 10 − or 10 − will give very similar critical values. On the other hand, the sharp increase also does mean that any furtherdecrease in the standard deviation beyond this point results in rapidly increasing errors. The critical sigma valuescan be seen to decrease with the number of coefficients in the representation, and it may therefore appear thatchoosing a large number of coefficients solves the problem. However, the computational complexity of updatingthe representation increases with the number of terms. For instance, for the three maximum values used in ourexamples we recorded average runtimes of 68, 323, and 1552 seconds per trial. Further increasing the number ofterms will become computationally expensive, and we therefore conclude that a pure Fourier-based implementationis not practical in general. 10 .2 Normal and mixed representations We now move on to another experiment in which we compare the normal and mixed approach. For the mixedapproach we use a Fourier representation with 200 coefficients and a critical standard deviation corresponding to (cid:15) = 10 − . We maintain the non-adaptive measurement scheme described above, but change to a new problem withthree phases φ = [2 , ,
5] and weight values w = [0 . , . , . | Ψ (cid:105) is generally not known, we maintain five distributions for the phases. For the performance evaluation wematch each distribution to one of the three ground-truth values. The weight values contributing to each distributionare added up, and the phases are averaged using the relative weights . The resulting median phase errors for thecollated distributions are plotted in Figure 3(d). The normal-based approach has a slower start than the mixedapproach, but eventually catches up after around 10 iterations. The standard deviation of the distributions andthe convergence of the weights in terms of the median absolute distance are plotted in Figures 3(e–f). A closer lookat the data indicates that the mixed approach successfully converged to the correct weights in all twenty trials.The normal-based approach does so in the majority of trials, but has some notable deviations, which lead to adeteriorated mean absolute error. The average runtime of the mixed approach was 70.43 seconds, which is largerthan the average of 15.65 seconds for the normal-based approach. The difference in runtime is due to the updatesof the Fourier representation in the initial iterations, which are more expensive. In most of the trials of the mixedapproach we found exactly three distributions with non-negligible weights. For these distribution, the transitionfrom Fourier representation to normal distribution occurred on average at iterations 282, 533, and 1111, in orderof decreasing weight of the distribution. In Section 3.4 we proposed a way of reducing the number of weight updates to ensure that the computationalcomplexity scales linearly instead of quadratically in the number of iterations. To validate the approach we evaluatethe performance of the mixed approach with five different weight-update settings. For each of these we update theweights at every iteration for the first T iterations. After that, the weights are updated only at iterations thatare power-of-two multiples of T , namely, 2 T , 4 T , 8 T , and so on. In the first setting we update the weight atevery iteration for the first T = 10 iterations. This setting therefore closely resembles the naive approach with acomplexity that scales quadratically in the number of iterations. In the next three settings we respectively choose T = 2, T = 64, and T = 512. In the fifth setting we again choose T = 512, but update the line-search as describedin Section 3.4. Rather than projecting x + αd for each line-search parameter α , we project once with α = 1, andperform backtracking line search using only that point. We evaluate these settings on the problem described inSection 4.2 and plot the resulting mean absolute weight errors in Figure 3(f). The different shades of the colorsused for the normal and mixed approaches correspond to the different update schemes. The only setting that failedto converge was the normal approach combined with update parameter T = 2. Aside from this all update schemesperformed comparably for each of the approaches. The only real difference, not visible in the plot, lies in theruntime: in the mixed approach, the first setting, in which updates the weights at every iteration for the first 10 iterations, takes 1040 seconds to complete. Settings two through four take 472, 587, and 610 seconds, and the finalsetting takes 604 seconds. For the normal approach these runtimes are 366, 117, 122, 121, and 126, respectively.The experiments in Section 4.2 were done using T = 512, and we will continue to use this value for the remainderof the paper. As an alternative to the three-round setup with fixed exponents k = [1 , ,
5] used so far, we now consider the effectof using single-round experiments in which the single k value is either chosen according to some fixed scheme orchosen adaptively based on the estimated phase distributions. Using a fixed k (cid:54) = 1 was found not to work, and wetherefore follow [12], and consider choosing k cyclically over the values { , , . . . , c max } for different values of c max .The resulting median phase error over twenty problem instances with a single eigenphase is shown in Figure 4(a).The performance of the normal and mixed approaches are quite similar for c max = 1, which coincides with keeping In practice we do not know which, if any, of the distributions should be merged. One possible approach is compute the likelihoodof the separate versus combined distributions and find the grouping that maximizes the likelihood. For this work we did not pursue anysuch approach. = 1 fixed. For c max values equal to 2, 3, or 5, the normal-based approach consistently fails to converge tothe desired phase. The mixed approach, by contrast, shows rapid convergence during the initial iterations andsteady convergence for subsequent iterations, even after switching to the normal distribution. The fact that theperformance remains good must be due to the initialization of the normal distribution with the partially convergedFourier representation. To verify this, we study the convergence of the normal approach when initialized withdifferent normal distributions. The standard deviation of all distributions is set to the critical value σ (cid:15) (200) ≈ . c max increases, or the critical sigma values decreases, this transition becomessharper and starts at smaller bias values. For the transition from a Fourier representation to a normal distributionto succeed it is therefore important that sufficient convergence has been achieved at the time of the switch. Ifneeded, this can be achieved by adjusting the scheme or increasing the maximum number of Fourier coefficients.Next, we apply the same schemes to the three-eigenstates problem described in Section 4.2. When combined withthe normal-based approach, none of these schemes showed successful convergence to the desired phases. Moreover,for c max = 1 both the normal and mixed approaches failed to converge. For other small values up to c max = 5,as seen in Figure 4(c), the mixed approach initially converged to the desired phases, but quickly diverged afterswitching to the normal distribution. The plot also illustrates that we can improve convergence by increasing c max or by increasing the number of terms used in the Fourier representation. Results obtained when using fivedistributions to approximate the desired three are similar to those shown and therefore omitted from the plot.It is also possible to use an adaptive scheme in which the value of k is determined based on the current standarddeviation. The choice of k = (cid:100) . /σ (cid:101) was proposed in [16] for single eigenstate experiments, and combined withcyclic schemes for multiple eigenstate experiments in [12]. To evaluate the use of adaptive schemes in both thenormal-based and mixed approaches, we consider the quantity¯ k := min (cid:110)(cid:108)(cid:80) j . w j σ j (cid:109) , k max (cid:111) . Based on this quantity we derive two schemes: a purely adaptive scheme in which we simply set k = ¯ k , and anadaptive cyclic scheme in which we choose k cyclically with an adaptive maximum value c max = ¯ k . In Figure 4(d)we show the performance of these schemes, along with fixed ( k = 1) and purely cyclic schemes, on a single eigenstateproblem. The adaptive and cyclic schemes with c max = 50 have similar convergence and all outperform the schemewith fixed k . Choosing c max = 4 ,
096 gives even faster convergence, especially so for the purely adaptive scheme.The median phase error of the adaptive and purely cyclic schemes steadily decreases until it comes to halt arounditeration 7 , In Section 3.5 we showed how different types of noise can be included into the probability models. Here we studythe performance Bayesian phase estimation under decoherence and read-out errors. Decoherence in single-roundexperiments has the effect of obtaining a completely random measurement with probability 1 − e − k /k err , where k is the number of controlled application of unitary U , and k err is a system dependent parameter, which we setto 100 in our experiments. It can be seen that the larger the value of k the larger the probability of obtaining acompletely random measurement. Indeed, even for k = 70 this already occurs with a probability slightly exceeding0 .
5, which clearly shows that we can no longer choose arbitrarily large values for k . We therefore run adaptive andcyclic experiments on a single eigenstate with c max and k max conservatively bounded by 50. The resulting phaseestimates along with those obtained for a fixed k = 1 are plotted in Figure 5(a). The cyclic and adaptive schemeshave faster initial convergence than the fixed scheme, and likewise for the mixed versus the normal approach. The12 Iteration10 M e d i a n p h a s e e rr o r A BCDE A Normal ( c max = 1) B Normal ( c max = 5) C Mixed ( c max = 1) D Mixed ( c max = 5) E Mixed ( c max = 20) F a il u r e p r o b a b ili t y ( % ) ABCDEF A (200), c max = 1 B (200), c max = 2 C (200), c max = 5 D (200), c max = 10 E (1000), c max = 10 F (5000), c max = 10 Iteration10 M e d i a n p h a s e e rr o r ABCD A Mixed ( c max = 5) B Mixed ( c max = 10) C Mixed ( c max = 20) D Mixed ( c max = 5, n max =5,000) ( a ) ( b ) ( c ) Iteration10 M e d i a n p h a s e e rr o r ABCD A Fixed (k = 1) B Adaptive/cyclic ( k max = 50) C Adaptive/pure cyclic ( k max = 4096) D Pure adaptive ( k max = 4096) Iteration10 M e d i a n p h a s e e rr o r AB A Cyclic ( c max = 50) B Adaptive cyclic ( c max = 50) Iteration0102030405060 M e d i a n c m a x A B A Single eigenstate B Three eigenstates (nEig = 3,5) ( d ) ( e ) ( f )Figure 4: Plot (a) shows the convergence of the normal and mixed approaches in single-round experiments withcyclic k in the range 1 through c max . The three dots indicate the average iteration at which the mixed approachswitches to the normal distribution. Plot (b) gives the probability of failure of the normal-based approach toconverge as a function of deviation from the ideal phase for different initial standard deviations. Plot (d) showsthe convergence of adaptive schemes in single-stage experiments with a single eigenphase using the normal (lightershades) and mixed approach (darker shades); plot (e) shows the convergence of cyclic and adaptive cyclic schemeswhen applied to a problem with three eigenphases. The median c max values resulting from the adaptive cyclicschemes from plots (d) and (e) are shown in plot (f). The curve with the lighter shade of blue corresponds to thesetting where five distributions are used instead of three.only configuration that failed was the normal approach combined with a purely cyclic scheme. When applied tothe problem with three eigenphases, we found that the purely adaptive approach fails, even with k max reduced to10 or 20. The results obtained using the mixed approach and the purely cyclic scheme, are shown in Figure 5(b).The results for the adaptive cyclic scheme are similar aside from slightly faster initial convergence and a somewhatslower convergence later. For the cyclic scheme we observe that the convergence of the different phases begins atlater iterations when c max is larger. In the case of c max = 100 we see an abrupt degradation of the performancearound iteration 10 . As discussed in Section 4.1, this happens when the deviation to the correct phase is still toolarge at the time of switching to the normal distribution. In this case we can avoid the degradation by increasingthe number of terms n max in the Fourier representation to 1000, as illustrated in Figure 5(b).For our experiments with read-out errors, we apply Bayesian phase estimation in combination with the quantumFourier transform (QFT). The QFT circuit, illustrated in Figure 5(c) for three measurement qubits, has long beenused for non-iterative estimation of a single eigenphase [1, 6]. Here we show that the same circuit can be used torecover multiple eigenphases even when the measurements are noisy. The circuit in Figure 5(c) is similar to the onegiven in Figure 1, although there are some important differences. The first difference is that the exponents k arefixed to successive powers of 2, starting at 1. The second difference is that the phase-shift values β are no longerpredetermined but instead appear as conditional rotations. In our example we use quantum-controlled gates, whichmeans that we do not know the effective value of the rotation until after measuring the qubit, provided there areno read-out errors. An alternative implementation where the rotations are classically controlled after taking themeasurements is also possible, but we do not consider it here. Unlike our earlier experiment setup, we are now in13 Iteration10 M e d i a n p h a s e e rr o r ABC A Cyclic ( c max = 50), normal-only B Fixed ( k =1) C Adaptive/cyclic ( c max = 50) | (cid:105) H • H • • m | (cid:105) H • R † z ( π ) H • m | (cid:105) H • R † z ( π ) R † z ( π ) H m | Ψ (cid:105) U U U ( a ) ( c ) Iteration10 M e d i a n p h a s e e rr o r Cyclic, c max = 20Cyclic, c max = 50Cyclic, c max = 100, n max = 200Cyclic, c max = 100, n max = 1000 Iteration10 M e d i a n p h a s e e rr o r A B C D E F A Noiseless B
20% noise C
40% noise D
45% noise E
48% noise F
26% noise (normal-only) Iteration10 M e d i a n p h a s e e rr o r ABDC A Noiseless (normal / mixed) B
8% noise (normal / mixed) C
30% noise, mixed ( n max =1000) D
30% noise, mixed ( n max =2000) ( b ) ( d ) ( e )Figure 5: Convergence of the different Bayesian approaches with decoherence on problems with (a) a single eigen-phase, and (b) three eigenphases. Plot (c) illustrates the circuit for quantum Fourier transformation with measure-ments on three qubits. The performance of Bayesian phase estimation subject to read-out errors is shown for (a)problems with a single eigenphase, and (b) problems with three eigenphases.a situation where the effective β values depend on the quantum state of the system. In order to deal with read-outerrors we must therefore distinguish between the actual state of the system, and the observed measurements. Asdescribed in Section 3.5, we determine the measurement probability by first computing the probability of endingup in each of the 2 n possible states, and multiplying this by the probability of obtaining the observed measurementfrom the given states. Although the computational complexity of this approach grows exponential in the numberqubits, it remains tractable when the number of qubits is small. This is certainly the case for our experiments,where we use a QFT circuit with measurements on five qubits (note that the illustration in Figure 5(c) uses onlythree). The results obtained in this way on a single-eigenphase problem with varying levels of read-out error areshown in Figure 5(d). The performance of the normal approach gradually deteriorates as the noise level is increased,and the method only just manages to converge when the read-out error reaches 26%. Beyond that point the methoddoes not converge to the desired value within the given number of iterations. The mixed approach also convergesslower with increasing read-out errors, but performs well up to a 48% error rate. As seen from Figure 5(e), thenormal and mixed approach both converge for the problem with three eigenphases with read-out error rates up to8%. When the noise reaches 10% or above, the normal-based approach fails to find the desired phases. For themixed approach this happens at noise levels of around 30%. In this work we have analyzed the performance of Bayesian phase estimation for different representations of theprior distributions. As a first representation we use a normal distribution, which was used earlier in [16]. Weshow that updates to the distributions can be evaluated analytically in noiseless settings, as well as in settingswith several types of commonly encountered noise. The normal-based approach is fast, but performs poorly whenmultiple eigenphases are present, certainly in experiments with only a single measurement round. In these casesthe distributions tend to collapse into a single distribution, which is then impossible to untangle since the updates14ill essentially be identical. As a second representation, we consider the truncated Fourier series representationused in [12]. For noiseless problems, as well as for problems with many different types of noise, the Fourier series isideal in that it captures exactly the desired distributions. However, the number of coefficients required to maintainexact representations keeps growing with each additional round of measurement. In order for the method to remaincomputationally attractive, it is therefore necessary to truncate the Fourier series. We show that this truncationeventually causes the distribution to become unstable, thereby limiting the accuracy that can be achieved usinga fixed number of terms. To combine the advantages of both approaches, we propose a mixed approach in whichthe distributions are initially represented as truncated Fourier series. When successive updates cause the standarddeviation of a distribution to fall below a suitably chosen threshold, we change the representation to a normaldistribution. We show that the proposed mixed approach performs well with both static and adaptively chosenvalues of k , and that the performance remains stable when decoherence or read-out errors are present. Finally, weshow how the Bayesian approach can be combined with the quantum Fourier transformation, which is traditionallyused for phase estimation. Measurement errors can be dealt with successfully in this setting but the current methodrequires the evaluation of the probability distributions for all possible states. Reduction of this complexity is aninteresting topic for future work. Acknowledgments
The author would like to thank Antonio C´orcoles and Maika Takita for useful discussions.
References [1] Daniel S. Abrams and Seth Lloyd. Quantum algorithm providing exponential speed increase for finding eigen-values and eigenvectors.
Physical Review Letters , 83(24):5162–5165, 1999.[2] Al´an Aspuru-Guzik, Anthony D. Dutoi, Peter J. Love, and Martin Head-Gordon. Simulated quantum compu-tation of molecular energies.
Science , 309(5741):1704–1707, 2005.[3] Dominic W. Berry, Brendon L. Higgins, Stephen D. Bartlett, Morgan W. Mitchell, Geoff J. Pryde, andHoward M. Wiseman. How to perform the most accurate possible phase measurements.
Physical ReviewA , 80(5):052114, 2009.[4] Dimitri P. Bertsekas.
Nonlinear Programming . Athena, 2nd edition, 1999.[5] Wlodzimierz Bryc.
The Normal Distribution: Characterizations with Applications . Springer-Verlag, 1995.[6] Richard Cleve, Arthur Ekert, Chiara Macchiavello, and Michele Mosca. Quantum algorithms revisited.
Pro-ceedings of the Royal Society A , 454(1969):339–354, 1998.[7] Laurent Condat. Fast projection onto the simplex and the (cid:96) ball. Mathematical Programming, Series A ,158(1–2):575–585, 2016.[8] Loukas Grafakos.
Classical Fourier analysis . Springer, 3rd edition, 2008.[9] Alexander J. F. Hayes and Dominic W. Berry. Swarm optimization for adaptive phase measurement with lowvisibility.
Physical Review A , 89(1):013838, 2014.[10] Alexei Yu. Kitaev. Quantum measurements and the Abelian stabilizer problem. arXiv preprint quant-ph/9511026, 1995. (See also Electronic Colloquium on Computational Complexity, TR96-003, 1996).[11] Jorge Nocedal and Stephen J. Wright.
Numerical Optimization . Springer series in operations research andfinancial engineering. Springer, second edition, 2006.[12] Thomas E. O’Brien, Brian Tarasinski, and Barbara M. Terhal. Quantum phase estimation of multiple eigen-values for small-scale (noisy) experiments.
New Journal of Physics , 21:023022, 2019.1513] Peter W. Shor. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantumcomputer.
SIAM Journal on Computing , 26(5):1484–1509, 1997.[14] Krysta M. Svore, Matthew B. Hastings, and Michael Freedman. Faster phase estimation.
Quantum Information& Computation , 14(3–4):306–328, March 2014.[15] Kristan Temme, Tobias J. Osborne, Karl G. Vollbrecht, David Poulin, and Frank Verstraete. QuantumMetropolis sampling.
Nature , 471:87–90, 2011.[16] Nathan Wiebe and Chris Granade. Efficient Bayesian phase estimation.