[PDF] Quantum Tomographic Reconstruction with Error Bars: a Kalman Filter Approach

Abstract

We present a novel quantum tomographic reconstruction method based on Bayesian inference via the Kalman filter update equations. The method not only yields the maximum likelihood/optimal Bayesian reconstruction, but also a covariance matrix expressing the measurement uncertainties in a complete way. From this covariance matrix the error bars on any derived quantity can be easily calculated. This is a first step towards the broader goal of devising an omnibus reconstruction method that could be adapted to any tomographic setup with little effort and that treats measurement uncertainties in a statistically well-founded way. In this first part we restrict ourselves to the important subclass of tomography based on measurements with discrete outcomes (as opposed to continuous ones), and we also ignore any measurement imperfections (dark counts, less than unit detector efficiency, etc.), which will be treated in a follow-up paper. We illustrate our general theory on real tomography experiments of quantum optical information processing elements.

Full PDF

aa r X i v : . [ qu a n t - ph ] F e b Quantum Tomographic Reconstruction with Error Bars: a Kalman Filter Approach

Koenraad M.R. Audenaert ∗ Dept. of Mathematics, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK

Stefan Scheel † Quantum Optics and Laser Science, Blackett Laboratory,Imperial College London, Prince Consort Road, London SW7 2AZ, UK (Dated: October 25, 2018, 5:6)We present a novel quantum tomographic reconstruction method based on Bayesian inference via the Kalmanﬁlter update equations. The method not only yields the maximum likelihood/optimal Bayesian reconstruction,but also a covariance matrix expressing the measurement uncertainties in a complete way. From this covariancematrix the error bars on any derived quantity can be easily calculated. This is a ﬁrst step towards the broadergoal of devising an omnibus reconstruction method that could be adapted to any tomographic setup with littleeffort and that treats measurement uncertainties in a statistically well-founded way.In this ﬁrst part we restrict ourselves to the important subclass of tomography based on measurements withdiscrete outcomes (as opposed to continuous ones), and we also ignore any measurement imperfections (darkcounts, less than unit detector efﬁciency, etc.), which will be treated in a follow-up paper. We illustrate ourgeneral theory on real tomography experiments of quantum optical information processing elements.

PACS numbers: 03.65.Wj,06.20.Dk,42.50.-p

I. INTRODUCTION

Since the ﬁrst proposal of optical quantum tomography byVogel and Risken [1], and the ﬁrst practical tomographic ex-periments by Smithey et al [2], quantum tomography has gonea long way, and is now being used in a variety of physicalsetups, not restricted to optical systems, and many improve-ments have been made to the original reconstruction methods[3, 4]. While this is certainly a desirable evolution, it mustalso be said that on the negative side this resulted in a pro-liferation of tomography methods. While it is unavoidablethat every physical system has its own peculiarities, and eachparticular setup calls for its own tomographic measurementmethod, it is not conceivable that for every type of system andfor every tomography method there should also be a differ-ent tomographic reconstruction method. Furthermore, wheneach time the reconstruction software is written from scratch,that does not beneﬁt reliability. After 20 years of tomograph-ical experience it is not unreasonable to ask for a uniﬁcationof these methods, taking the best out of each and devising asmall set of “omnibus” reconstruction methods, that only needsome small adaptations to the particular tomographic setup.An even more important remark concerning reconstructionmethods is the fact that error bars are seldom given. Measure-ments are worthless without error bars. When tomography isjust a measurement, even though a complicated one, why thentreat tomography differently? Error bars are dearly neededhere as well, since the whole purpose of tomography is tocome up with a description of the quantum state that is suf-ﬁcient to derive further properties, and for these properties er-ror bars would certainly be needed. As ˇRehaˇcek, Mogilevtsev ∗ Electronic address: [email protected] † Electronic address: [email protected] and Hradil [5] stated: “The quantiﬁcation of relevant statisti-cal errors is an indispensable but often neglected part of anytomographic scheme used for quantum diagnostic purposes.”Some theoretical papers mention error bars, but they arecalculated from simulated data, using Monte-Carlo methodsand are only meant to give an indication about how good themethod performs. As measurement errors depend on the ac-tual system and its state, this is clearly not enough. What weare after is error bars produced straight from the experimentaldata and the underlying statistical model.A widely used error criterion is the state ﬁdelity (for quan-tum states) or the process ﬁdelity (for quantum processes),which compares the reconstructed state to a predeﬁned “de-sired state” (e.g. [6, 7, 8]). While it is easy to use, it clearlygives no indication about the reconstruction alone but is thesum of reconstruction and construction errors. As such, itcannot answer the following two questions separately: “Arewe seeing the correct state?”, and “Are we seeing the statecorrectly?” Furthermore, as pointed out in [5], ﬁdelity is justa single number and one cannot expect it to describe the com-plete error structure of the reconstruction.One could argue, however, that error bars are not explic-itly needed if one just subsumes all measurement noise intothe estimated quantum state via statistical mixing. We dis-agree with this point of view, and we claim that this throwsaway useful information. There is a difference between, onone hand, preparing a pure state and assuming it is mixed be-cause the measurements do not allow to conclude otherwise,and, on the other hand, not being able to prepare a pure state,obtaining a mixed state instead, and knowing that state per-fectly. In both cases the outcome is the same, a mixed state,but in the former case the real state is actually pure.To make sense of the concept of error bars in the contextof state (process) estimation, we have to clearly distinguishbetween the roles of the state preparer and the observer mea-suring the state, both of which involve noise. Tomography isbased on the assumption that every time the observer measuresthe state, it is actually the same state. Because of measure-ment noise, the observer cannot obtain perfect knowledge ofwhich state has been prepared in ﬁnite time. However, noth-ing prevents him from doing so in principle. Measuring foran inﬁnite time, using a sufﬁcient set of measurements, willyield perfect knowledge. If the same state is being prepared,it is ultimately knowable. On the other hand, the preparationof the state will also involve noise. Every time the preparerattempts to prepare the nominal state, noise will affect thisand some slightly other state will result. This kind of noise isimpossible to overcome, not even in principle. The only wayone can deal with it is by absorbing the preparation noise intothe quantum state that is being prepared, through statisticalmixing. If the preparation noise is ergodic, the observer willrecover the average state of the quantum state ensemble.In short: the individual states are not knowable, but their av-erage is; and if the number of measurements is ﬁnite, the ob-server will not obtain this average preparation state perfectly.Then error bars, or more precisely a density function over statespace, are needed to express this lack of complete knowledge.One of the ﬁrst papers calculating error bars from the mea-surement data is Ref. [9], for a speciﬁc reconstruction methodof optical homodyning tomography (OHT) using so-calledpattern functions [10, 11]. Using this method, the density ma-trix ρ can be derived directly from the detection probabilitiespr ( x, θ ) sampled over phase space, where x and θ are parame-ters representing the settings of the OHT apparatus. To recon-struct ρ from the measurement data, these probabilities are re-placed by the relative frequencies f ( x, θ ) /N . To obtain errorbars, the ﬂuctuations on the detection frequencies are mod-eled by a Poisson process, by which the variance σ equalsthe mean pr divided by the number of runs N . The ﬁrst draw-back of this method is that only the marginal variances of thedensity matrix elements are treated here, disregarding corre-lations between errors, and therefore overestimating them. Asecond drawback of this approach is that, due to pr not be-ing known, it is approximated by the relative frequency, giv-ing σ = pr /N ≈ f /N , where f is the absolute detectionfrequency. This approximation actually underestimates thevariance, especially for low-probability events. Indeed, forevents with f = 0 this approach assigns zero probability tothe event, with zero variance, which corresponds to an abso-lute certainty. That certainty is not warranted given that onlya ﬁnite number N of experiments were done. This problem isknown more generally as the “zero-eigenvalue problem” andoccurs in different guises in many other reconstruction meth-ods.The present work is a ﬁrst step towards overcoming the twodeﬁciencies described above: we propose to use Bayesian in-ference as the unifying principle to calculate a probability den-sity function over state space, from the measurement data, andfrom that density derive the ﬁrst and second order moments:the mean state and the state covariance matrix. The goal ofthe present paper is to outline a practical method for calcu-lating this mean state and state covariance directly from a setof tomographic data, in a completely general and statistically well-founded way.During the course of this work, the paper [5] appeared, inwhich the same goals were aimed for and a method quite sim-ilar in spirit to ours was proposed. We refer the reader to Sec-tion VI for a discussion of the main differences between ourmethod and the one in [5].The input required by our method is a statistical model ofthe quantum tomographic setup. Given a state and the mea-surement settings, how do the statistical properties of the mea-surement outcomes depend on that? For a measurement witha ﬁnite number K of outcomes, a measurement setting cor-responds to a choice of operator elements A ( k ) , one elementfor each outcome. When the outcome is a continuous variable(e.g. position x ), the measurement operator is parameterisedby the continuous outcome value x . In either case, the laws ofquantum mechanics dictate that the outcome k (or x ) occurs inan experiment with a probability (or probability density) givenby Born’s rule p k = Tr[ ρA ( k ) ] (or p ( x ) dx = Tr[ ρA ( x )] dx ).The measurements taken in a tomographic experiment relateto this probability density, in one way or another. For anygiven setting a number of runs N would typically be per-formed, and in case of a ﬁnite outcome experiment, the fre-quencies f k of the various outcomes would be recorded, orthe values of the measurement in case of a continuous out-come. The relation between these frequencies and the under-lying probability distribution is governed by the laws of statis-tics.Alternatively, in some experiments the outcome could bethe combined effect of many small measurement events. Forexample, in tomography of atomic/molecular clouds, the ﬂu-orescence response of the cloud to an impinging probe beamcould be observed, in which case the experimental outcome isa ﬂuorescence spectrum [12, 13, 14]. The recorded spectrumis a random variate with expected value proportional to the rel-evant probability density (a marginal of the quasi-probabilitydensity describing the state) and variance depending on thesignal-to-noise ratio. Another example is an optical homo-dyning tomography (OHT) experiment where the probe beamintensity is so high that individual photons impinging on thedetectors can no longer be resolved and the measurement re-sults are light intensities, measured as voltages.Once the statistical model of the tomographic setup hasbeen supplied in suitable form, a general-purpose Bayesianinference engine could in principle take it from there, convert-ing the measurement data into a posterior probability densityover quantum state space. However, the actual calculationsare typically too demanding to be at all practical. The secondingredient of our proposed solution is to make the calculationsinvolved in Bayesian inference feasible by approximating thestatistical tomographic model by a so-called linear-Gaussianmodel (explained below). For such models, the Bayesian in-ference simpliﬁes to a set of simple equations known as the Kalman ﬁlter update equations .Kalman ﬁltering is a technique for dynamical state estima-tion that allows to estimate a dynamical state from a sequenceof noisy data [15]. Kalman ﬁltering has already been appliedin the context of continuous quantum measurement and quan-tum control [16, 17, 18]. For tomographic reconstruction, ap-plying Kalman ﬁltering seems to be a novel idea.The goals we have set out for this work are quite challeng-ing. Rather than trying to solve all the problems involved inone go, we focus here on a particular, but important, class oftomography experiments, namely those where the measure-ments have (few) discrete outcomes, as opposed to measure-ments of continuous variables. This class still covers a widevariety of quantum systems, including single-photon opticalsystems (e.g. optical quantum computing) [6, 19, 20, 21], spinsystems based on ions [7], atoms [22], or electrons (spin echotomography) [23], superconducting [8] and solid state qubits[24], and tomography of atomic and molecular states basedon ﬂuorescence spectra [12, 13, 14]. For the purposes of ex-position, we will restrict our attention here and illustrate ourreconstruction method for optical systems. Important exam-ples of tomography experiments not falling in this class areoptical homodyning and heterodyning [1, 2, 11], where theoutcome is the continuous variable x . We leave this for futurework.The second restriction we have imposed here is that we as-sume that the apparatus performing the tomography is ideal.In reality any measurement exhibits imperfections; e.g. pho-ton detectors have less than unit efﬁciency and may exhibitdark counts and input states for process tomography may beimperfectly prepared. Although these imperfections pose nodeep fundamental problems, they would obscure the presenta-tion which is why we have chosen to treat them in a follow-uppaper [25]. In the present paper we will to cover the mainprinciples of our proposal and illustrate them using a simplereal application (based on actual data) as proof of principle.A third (minor) restriction is that the number of experimen-tal runs per measurement setting is not too small, so that thedistributions governing the measurement outcomes can be ap-proximated by normal distributions without too great an error.It goes without saying that our reconstruction method issuitable both for state tomography and process tomography,because state tomography lies at the basis of process tomog-raphy. Either one sends various input states through the quan-tum process and measures the output states (as in Ref. [6]),or one half of a maximally entangled state is sent through theprocess and the global output state (including the other half) ismeasured (as in Ref. [20]). Both methods are formally equiv-alent with state tomography of the state representative of theprocess, under the Choi-Jamiolkowski isomorphism. The pre-sentation of our method will therefore be mainly state based,for deﬁniteness and simplicity.As we bring together a number of concepts from statistics,engineering mathematics and quantum mechanics, we beginour presentation with a rather lengthy section (Sec. II) onbackground material, with an extensive list of notations, andbrief overviews of Bayesian inference and Kalman ﬁltering.In Sec. III we present the basic theory of our proposal, andshow how the problem of tomographic reconstruction can bemade to ﬁt the mould of Kalman ﬁltering, thereby propos-ing solutions to several problems that we encounter along theway. The theory is then illustrated on two real tomographyapplications in Secs. IV and V, based on actual experimentaltomography data. Finally, in Sec. VI we highlight the main beneﬁts of our method, its performance and the costs associ-ated with it, and compare it to existing methods, in particularto its sister-method proposed in [5]. II. THEORETICAL BACKGROUNDA. Notations

Let us start with introducing the main notations and ty-pographical conventions that we will use in the paper, alongwith some fundamental notions from applied probability the-ory. We denote vectors and matrices by symbols in boldface, F , f , to distinguish them from scalar quantities which we de-note in roman, including vector and matrix components, f i .Exceptions to this convention are quantum-mechanical quan-tities like states ρ , POVM elements Π and maps Φ . The vectorwhose entries are all 1 is denoted by := (1 , . . . , , and theidentity matrix by F , and the val-ues they can take with lowercase, f . For example, the prob-ability density function (PDF) of a random variable F is de-noted f F ( f ) . The ﬁrst f is the general symbol for a PDF, thesecond F is the random variable, and the third f is the argu-ment of the PDF and represents the values the random variable F can take. The mean and variance of a scalar random vari-able X are denoted by µ ( X ) , σ ( X ) , and the mean and thecovariance matrix of a d -dimensional random variable X by µ ( X ) and Σ ( X ) .In this paper, a number of distributions are prominent. Herewe recall deﬁnitions and notations about the multinomial andnormal distributions. Other distributions (chi-squared, Dirich-let and beta) will be described in subsequent sections.When a random d -dimensional variable F is distributedaccording to a multinomial distribution with parameters N (where P dk =1 f k = N ) and p this is denoted by F ∼ Mtn ( N ; p ) . The PDF of this distribution is given by f F ( f ) = (cid:18) N f (cid:19) p f . . . p f d d . (1)Here we denote the multinomial coefﬁcient by (cid:18) N f (cid:19) = N ! f ! . . . f d ! . The mean of the multinomial distribution is given by µ k = N p k , and the entries of its covariance matrix are σ k,l = ( N p k (1 − p k ) , k = l − N p k p l , k = l. When a random variable X is distributed according to aunivariate normal distribution with mean µ and variance σ we write this as X ∼ N ( µ, σ ) . Similarly, for a multivariatenormal distribution with mean µ and covariance matrix Σ wewrite X ∼ N ( µ , Σ ) .We will reserve the symbol φ for the PDF of a normal dis-tribution, while using f for general PDFs. The PDF of theunivariate normal distribution will thus be denoted by φ ( x ; µ, σ ) := 1 √ πσ exp( − ( x − µ ) σ ) Similarly, we will denote the PDF of an n -dimensional multi-variate normal distribution by φ ( x ; µ , Σ ) := exp (cid:2) − ( x − µ ) ∗ Σ − ( x − µ ) (cid:3)p (2 π ) n | det Σ | . (2)The quadratic form appearing in the exponent is a proper dis-tance measure between the vectors x and µ and is called the squared Mahalanobis distance , which we denote by M : M := ( x − µ ) ∗ Σ − ( x − µ ) . (3)One of the main statistical tools used in this paper is theapproximation of distributions by normal distributions, usingthe technique of moment matching , whereby a distributionis approximated by a normal distribution with the same ﬁrstand second order moments as the original. While conceptu-ally simple and easy to use in practice, this method is alsostatistically well-founded because it gives the approximationthat minimises the Kullback-Leibler distance D KL ( p || q ) := R d x p ( x ) log( p ( x ) /q ( x )) between a given PDF p and theapproximating normal PDF q .On the matrix analysis side, we will follow mathematicalconvention (and not the physical one) of denoting Hermitianconjugates with an asterisk A ∗ instead of a dagger, and re-serve the dagger for the Moore-Penrose (MP) inverse: A † .Complex conjugation is denoted by an overline: A .We will have the occasion to apply the following formulafor the matrix inverse of a rank- k correction to a non-singularmatrix: ( A + U CV ∗ ) − = A − − A − U (cid:0) C − + V ∗ A − U (cid:1) − V ∗ A − (4)Here, A is n × n and non-singular, C is k × k and non-singular,and U and V are general n × k matrices. This formula is alter-nately known as the Matrix inversion lemma , or the Woodburymatrix identity [26].While the main topic of this paper is the tomographic recon-struction of quantum states, maps and POVMs, objects thatare typically represented by matrices (density matrices, Choimatrices, POVM elements), the reconstruction technique weuse is based on vector representations of the state of the sys-tem. Therefore, more often than not, we will need to con-vert the usual matrix representation of the quantum objectsto vector representation. For quantum states that means wewill be employing a Hilbert space representation. The space M d ( C ) of d × d matrices will be considered as a Hilbertspace H d equipped with the Hilbert-Schmidt inner product h A , B i = Tr AB ∗ . To distinguish between the two represen-tations, we write ρ for a density matrix, and ρ for its Hilbertspace representative.While many different bases could be used for H , by farthe easiest one for the purposes of this paper is the basis ofmatrix units { e ij } di,j =1 ; in quantum physical notation e ij = | i ih j | . Converting a density matrix ρ to its Hilbert space rep-resentative ρ amounts to the so-called Vec operation, whichworks by just stacking the columns of the density matrix intoa single d -dimensional column vector: ρ = Vec( ρ ) := P i,j ρ ij | i i| j i . The reverse operation is the Mat operation,which reshapes a d -dimensional vector into a d × d matrix.The vector Vec | i . That is, | i = P di =1 | i i| i i .In the same vein, the Hilbert representation of a linear map L (be it completely positive (CP) or not) acting on d × d density matrices, expressed in the basis of matrix units, is a d × d matrix whose columns are the Hilbert space represen-tations of the matrices L ( e ij ) . More speciﬁcally, if the mapis a CP map and has the Kraus representation ρ

7→ L ( ρ ) = P Kk =1 A ( k ) ρ A ( k ) , ∗ , then the Hilbert space representation of L is the matrix L given by L = X k A ( k ) ⊗ A ( k ) . B. The Dirichlet Distribution

The Dirichlet distribution is the higher-dimensional gener-alisation of the beta distribution. The importance of this distri-bution stems from the fact that it is the conjugate distributionof the multinomial distribution. That is, if F ∼ Mtn ( N, p ) isthe distribution of F conditional on P = p , then Bayesian in-version yields that P conditional on F = f is Dirichlet withparameter f . Formally, the two distributions only differ bytheir normalisation. The multinomial is normalised by sum-ming over all integer non-negative f summing up to N , whilethe Dirichlet is normalised by integrating over the simplex ofnon-negative p summing to 1.The general form of the PDF of a d -dimensional Dirichletdistribution with parameters α i is ([27], Chapter 49) f P ( p ) = Γ( α ) d Y i =1 p α i − i Γ( α i ) , where α is deﬁned as α := d X i =1 α i . (5)The range of P is the simplex p i ≥ , P p i = 1 . In our case α i − f i , and as P i f i = N , the PDF is ( N + d − d Y i =1 p f i i f i ! . The mean values of the Dirichlet distribution are µ i = α i α = f i + 1 N + d , (6)and the elements of its covariance matrix are σ ij =  α i ( α − α i ) α ( α +1) , i = j − α i α j α ( α +1) , i = j =  ( f i +1)( N + d − f i − N + d ) ( N + d +1) , i = j − ( f i +1)( f j +1)( N + d ) ( N + d +1) , i = j (7)For further reference, we note a few properties of the co-variance matrix Σ of the Dirichlet distribution. First of all, Σ is singular; it has a zero eigenvalue pertaining to the eigenvec-tor := (1 , . . . , . As the inverse of Σ does not exist we willneed its Moore-Penrose inverse Σ † . Because Σ can be writtenas a diagonal matrix minus a symmetric rank-1 matrix, Σ = ( N + d ) Diag( f + 1) − ( f + 1)( f + 1) ∗ ( N + d ) ( N + d + 1) , its MP-inverse can be calculated analytically.For non-singular differences of a matrix D and a rank 1matrix xx ∗ , the matrix inversion lemma (4) provides the for-mula ( D − xx ∗ ) − = D − + (1 − x ∗ D − x ) − D − xx ∗ D − . Even for invertible D the difference can still be singular when x ∗ D − x = 1 . In that case the MP-inverse is given by ( D − xx ∗ ) † = G D − GG = − ( x ∗ D − x ) − D − xx ∗ D − . Here, G is an orthogonal projector on the support of D − xx ∗ .This gives in our case Σ † = ( N + d )( N + d + 1) G Diag( f + 1) − GG = − d − ∗ . (8)Thus G is the projector on the subspace of vectors x for which P i x i = 0 . In other words, on the subspace of differences ofprobability vectors, Σ † reduces to the diagonal matrix ( N + d )( N + d + 1) Diag( f + 1) − . C. Bayesian Inference

Our reconstruction procedure essentially amounts to per-forming Bayesian inference, in conjunction with an approx-imation scheme for the statistical properties of the measure-ment process. More precisely, the measurement process is ap-proximated by a so-called linear-Gaussian model. Within theconﬁnes of this model, the actual calculations for perform-ing the Bayesian inference turn out to be very simple and are given by the update equations of a Kalman ﬁlter; this will beexplained below. In this section we brieﬂy describe the ele-ments of Bayesian inference. For an in-depth treatment werefer to the excellent introductory work [28].We describe the state of the system under investigation bya vector and denote it by x . For the time being, we ignore thefact that the system is a quantum system. As our knowledgeof x is obtained from (quantum) measurements and is statis-tical in nature, we describe it by a random variable, X . Sincein our setup measurements are independent (each measure-ment operates on a private copy of the quantum state under in-vestigation), the reconstruction procedure can be decomposedas an iterative scheme in which each measurement is incor-porated sequentially. We assume that in each iteration anyprior knowledge about X , as well as any knowledge obtainedthrough the outcomes of the ﬁrst m − measurement set-tings has been incorporated into the probability density func-tion f X . In Bayesian terminology the PDF of X is calledthe prior PDF. The goal of the inference procedure is to comeup with a posterior PDF that incorporates the knowledge ob-tained by the measurement outcomes in the m -th measure-ment setting. We use the random variable X ′ to describe theupdated knowledge; its PDF f X ′ is called the posterior PDF.We denote the “knowledge obtained through a measure-ment” by a vector z describing the measurement outcome.This vector z is a sample of the corresponding random vari-able Z . It can give us additional information about the systemvia the statistical relation between Z and X , which we ex-press as the conditional PDF f Z | X . This conditional PDF isthe statistical description of the measurement model linking X to Z ; it will be speciﬁed further in section III B. The pos-terior f X ′ is then nothing but the conditional PDF f X | Z .At the heart of any Bayesian inference procedure liesBayes’ rule, which expresses the relation between f X | Z and f Z | X : f X | Z ( x | z ) = f Z | X ( z | x ) f X ( x ) f Z ( z ) . While f Z is deﬁned as the marginal of f X , Z = f Z | X f X , it ismuch more convenient to interpret f Z as a normalisation con-stant, ensuring that f X | Z integrates to 1 over the probabilityspace of X . Second, as the main random variable in Bayes’rule is X , while z plays the role as parameter and is given bythe observation, we have to consider f Z | X as a function of x too. As a function of its second argument, f Z | X is no longera conditional probability density but a likelihood function . Wewill denote this function by L X | Z . Because of the explicitnormalisation in Bayes’ rule, L is deﬁned up to proportional-ity. Note the reversal of the arguments, which is customary inthe Bayesian literature and reﬂects the change of focus from Z , being the measurement outcome causally related to the un-derlying state X , to X , our guess of what the state should be,given the measurement outcome Z = z as evidence: L X | Z ( x | z ) ∝ f Z | X ( z | x ) . (9)We will henceforth rewrite Bayes’ rule as f X | Z = C L X | Z f X . The Bayesian inference step, incorporating z as new knowl-edge, is then expressed as f X ′ ( x ) = C L X | Z ( x | z ) f X ( x ) . (10)For a sequence of measurement settings ( Z i ) ni =1 , this stephas to be iterated n times, leading through a sequence of n + 1 PDFs that describe the state X in a way that is consistent withthe additional knowledge obtained through the measurements.If we describe the prior information by the PDF of X , the i -th measurement by the likelihood function with parameter z i ,and the updated information after i iterations by the PDF of X i , we get f X n ( x ) = C L X n − | Z n ( x | z n ) L X n − | Z n − ( x | z n − ) . . .L X | Z ( x | z ) f X ( x ) . (11)As one can see, this is merely a product of the n likelihoodfunctions and the prior PDF, and can therefore be calculatedin any order. This will turn out to be important later on.Despite the apparent simplicity of the Bayesian update for-mula, actual calculations based on it are in general very com-plicated because the variable X appears both as main variableof a continuous PDF (the prior) and as continuous parameterof the (discrete) likelihood functions. The resulting product isin general an extremely complicated continuous function of x .In the context of reconstruction of tomographic data, for ex-ample, the likelihood functions are polynomials of very highdegree.The calculations do become feasible in the speciﬁc caseof so-called Linear-Gaussian Models . In such models thedynamics of the system can be described by a time-discreteMarkov chain with a linear evolution operator perturbed byzero-mean Gaussian noise. Similarly, the measurement alsodepends linearly on the system state and any perturbation mustalso be zero-mean Gaussian noise. In other words, all vari-ables ( X , and all Z i ) are normally distributed, and the pa-rameter x enters only in the value of the mean of f Z i | X i − ;moreover, it does so in a linear way only. A typical exampleis a classical measurement whose output depends linearly onthe state and is perturbed by additive white Gaussian noise.For these models, and for more complicated dynamicalmodels where the state X varies over time, the Bayesian up-date formula reduces to a set of simple equations called the Kalman ﬁlter equations . A Kalman ﬁlter is the optimal stateestimator when the system and the measurement can be mod-elled by linear-Gaussian models. The Kalman ﬁlter equationsconsist of two sets of equation; one set (the predictor equa-tions) predicts the evolution of the Markov chain, while theother set (the update equations) updates the state of the sys-tem based on the measurements taken. For the purposes ofthe present paper only the update equations will be used be-cause the basic assumption of tomography is that the systemis static. A very good account of Kalman ﬁltering is given inRef. [15].The reason for the feasibility of the linear-Gaussian modelis that all distributions occurring in the calculations are nor-mal, including the intermediate products of the factors of (11).This will be explained in the next section, where we describethe Kalman update equations in detail.

D. Kalman Filtering for Static Systems

Let us return again to the Bayesian update formula (10) f X ′ ( x ) = C L X | Z ( x | z ) f X ( x ) . A linear-Gaussian model can be represented pictorially as fol-lows: X H −→ Y + N (0 , Θ ) −→ Z . For the purposes of this paper it turns out to be beneﬁcial tosplit the model into two parts: a linear mapping, representedby a matrix H , followed by the noise process, which consistssimply of adding zero-mean Gaussian white noise with givencovariance matrix Θ . A further model assumption is that theprior PDF f X is Gaussian, so that the posterior PDF will beGaussian, too.Suppose now that an observation of Z is made, giving thevalue z . Bayesian inference of the noise process then yieldsthat the distribution of Y conditional on this observation willbe N ( z , Θ ) . In other words, the moments of Y are given by µ ( Y ) = z (12a) Σ( Y ) = Θ . (12b)We are now left with performing Bayesian inference on thelinear part, with the variables distributed as X ′ ∼ N ( µ ′ , Σ ′ ) (13) X ∼ N ( µ , Σ ) (14) Y ∼ N ( µ ( Y ) , Σ( Y )) . (15)Here, µ and Σ are the already known ﬁrst and second ordermoments of X (mean and covariance matrix) and µ ′ and Σ ′ are the unknown ﬁrst and second order moments of X ′ .Taking into account the explicit formula (2) for the PDF of aGaussian distribution, the logarithm of the likelihood functionis given by −

12 [ y − µ ( Y )] ∗ Σ( Y ) − [ y − µ ( Y )] plus some constant. We get similar expressions for the loga-rithm of f X ′ and the logarithm of f X . Combining everything,dropping the factors − / , and using the relation y = H x ,the Bayesian update formula (10) for the linear mapping be-comes: ( x − µ ′ ) ∗ Σ ′− ( x − µ ′ )= c + [ H x − µ ( Y )] ∗ Σ( Y ) − [ H x − µ ( Y )]+( x − µ ) ∗ Σ − ( x − µ ) . (16)Here, all additive constants have been absorbed in the constant c . This constant is actually irrelevant because it is ultimatelyabsorbed in the Bayesian normalisation factor C of (10). Bothsides of the equation are therefore degree-2 polynomials in x , which conﬁrms our earlier statement that all distributionsoccurring in the calculations for linear-Gaussian models areGaussian. Eliminating x from this equation gives us the twoupdate equations we need, one for the mean, and one for thecovariance. Remark.

Although the probability space of X is a real vec-tor space, the vector entries of x need not be real. This willgive us more freedom in choosing a basis when X is a Hilbertspace representation of density matrices; to get real vector en-tries one is forced to choose Pauli matrices (or generalisationsthereof) as basis vectors. This is the reason why we have ap-plied the Hermitian conjugate ∗ rather than the transpose.Combining the terms that are quadratic in x yields theequation Σ ′− = H ∗ Σ( Y ) − H + Σ − . Inverting gives the solution for Σ ′ [using the matrix inversionlemma, Eq. (4)] Σ ′ = [ H ∗ Σ( Y ) − H + Σ − ] − (17) = Σ − Σ H ∗ [ H Σ H ∗ + Σ( Y )] − H Σ . (18)The factor Σ H ∗ ( H Σ H ∗ + Σ( Y )) − is customarily calledthe Kalman gain factor and is denoted by K .Combining the terms linear in x yields the equation ( Σ ′ ) − µ ′ = H ∗ Σ( Y ) − µ ( Y ) + Σ − µ , with the solution for µ ′ : µ ′ = Σ ′ [ H ∗ Σ( Y ) − µ ( Y ) + Σ − µ ]= µ + K [ µ ( Y ) − Hµ ] . (19)In the last line we have used equation (18) and the easily ver-iﬁed relation ( Σ − KH Σ ) H ∗ Σ( Y ) − = K .The three relevant equations are commonly known as the Kalman update equations , and they form the backbone for themethod of this paper. Combined with the equations (12) forthe moments of Y they read K = Σ H ∗ [ H Σ H ∗ + Θ ] − (20a) µ ′ = µ + K ( z − Hµ ) (20b) Σ ′ = Σ − KH Σ . (20c)It is instructive to consider the special case where the mea-surement mean y is exactly the state x , i.e. H =

1. Thisgives the simpliﬁed equations K = Σ ( Σ + Θ ) − (21a) µ ′ = µ + K ( z − µ ) (21b) Σ ′ = Σ − K Σ . (21c)After some algebra one ﬁnds that Σ ′ is given by the parallelsum of Σ and Θ : Σ ′ = ( Σ − + Θ − ) − . In other words, in this special case, the inverses of the covari-ance matrices add up.

Remarks.

1. In the expression for the Kalman gain we use a ma-trix inversion, and that requires invertibility of the ar-gument. In our setting (quantum mechanics), it turnsout that the argument is actually never invertible. Thereason for that is a certain set of exact constraints thatthe state has to obey. For example, when the state is aquantum state, its trace must be equal to 1. In addition,the measurement vector also has to satisfy certain exactconstraints. For example, in an N -run experiment, thenumber of clicks must add up to N . This eventually hasan impact on Σ , H and Θ , causing H Σ H ∗ + Θ to benon-invertible. How to deal with this will be describedlater on, in Section III G.2. From an analytic viewpoint it is not readily clear thatthe expression (20c) for Σ ′ always yields a positivesemi-deﬁnite matrix. More importantly, the expressionis not the best one from a numerical viewpoint; since asubtraction is made, numerical roundoff may produce anon-positive semi-deﬁnite matrix. This is more likely tohappen for precise measurements, yielding small vari-ances. For that reason, the alternative formula (17) in-volving addition rather than subtraction is preferable.In that form it is also obvious that the obtained Σ ′ isalways positive semi-deﬁnite. III. KALMAN FILTER RECONSTRUCTION OFQUANTUM TOMOGRAPHIC DATA

In this Section we present the basic theory of our recon-struction method, based on the concepts of Bayesian inferenceand Kalman ﬁltering, which were described in the previousSection. This Section contains the bulk of the material, andincludes the mathematical underpinnings of our method. Toassist the readers who are more interested in the method itselfand how to apply it, we have included Sec. III A, containinga self-contained explanation of the method. Those readers areadvised to read that Section only and then fast-forward to theapplications and discussions Sections IV, V and VI.We start in Sec. III B with a characterisation of the like-lihood function for quantum measurements (characterisedthemselves by a POVM { Π ( k ) } ) and obtain a normal ap-proximation that allows to approximately ﬁt a quantum mea-surement in the mould of a linear-Gaussian model. We ﬁndthat incorporating the information obtained from the quantummeasurements into the likelihood function amounts to apply-ing the Kalman ﬁlter update equations (20) where the entriesof the measurement mean z are given by formula (6), thoseof the measurement covariance Θ by (7), and the measure-ment matrix H by the matrix representing the linear mapping ρ p = (Tr[ ρ Π ( k ) ]) Kk =1 .Apart from calculating the likelihood, Bayesian inferencerequires a choice of a prior, and this is partially treated inSec. III C. In a full treatment one would have to incorpo-rate the structure of the physical set into the prior, i.e. theprior should be zero outside of the set of physical states. This,however, causes the prior to be non-Gaussian and it thereforedoes not ﬁt the requirements of Kalman ﬁltering. For that rea-son, the restriction to the physical set will be done in a post-processing step, as described in Sections III E and III F, andinstead a Gaussian dummy prior is chosen. Directly after theKalman updates, a simple correction step removes the effectsof this dummy prior again [Sec. III C, Eqs. (27) and (28)]. Atthis point, the eigenvalues of the corrected Σ , before restrict-ing to the physical set, already give useful information as theyare variances of certain linear combinations of state compo-nents and give an indication about how many more measure-ments are needed to reduce the measurement error. For exam-ple, to bound the maximal error on any state component fromabove, the largest eigenvalue of the covariance matrix shouldbe bounded.Once the posterior PDF, restricted to the physical set is cal-culated, via its ﬁrst and second order moments, one can cal-culate the conﬁdence region for the reconstruction, i.e. theset within which the actual state should lie with high prob-ability (say, 95%). The basic quantity expressing this conﬁ-dence region is the Mahalanobis distance. This is explainedin Sec. III D.Next, the problem of restricting the posterior PDF to thephysical set is treated. This is actually a very difﬁcult numer-ical problem, especially in high dimensions, and turns out tobe a challenge even for current state-of-the-art Bayesian inte-gration methods. In Sections III E and III F we give two sim-ple algorithms that perform the task reasonably well, if one iswilling to give up exactness of the solution.For most quantum tomography problems, the physical setis partially deﬁned by exact constraints. For example, quan-tum states have trace equal to 1. In Sec. III G we show howsuch exact constraints are best dealt with, in order to avoidnumerical problems. The Kalman update equations have tobe slightly modiﬁed.Finally, Sec. III H deals with the issue of graphically rep-resenting the calculated results in a meaningful way, basedon mean values and error bars. The ﬁrst moment of the pos-terior PDF roughly corresponds to the maximum likelihoodsolution, and as frequently happens with this kind of solu-tions, exhibits reconstruction artifacts, which are not featuresof the actual state, but are not ruled out by the measurementseither. A number of methods have been developed to removethese artifacts, all based on maximisation of entropy or relatedfunctionals, and we discuss these in our context. A. Overview

In this Section we present a self-contained overview of ourmethod for the purpose of implementation. Mathematical de-tails, as well as the underlying rationale are explained in sub-sequent sections, which could be skipped on ﬁrst reading. Forthe sake of clarity we restrict ourselves to the setting of statereconstruction.The ﬁrst step in implementing the method is to gather allrelevant information about the tomography process and castit in an appropriate mathematical form. The following areneeded: • dimension D of the state (or of its representation); • the various sets of POVM elements { Π ( j ) } j , one set permeasurement setting; • the number of runs N per measurement setting, if ap-plicable; in continuous wave (CW) experiments, thereis no such N ; • the measurement data: the frequencies f j (the “numberof clicks”) of each of the outcomes; • any exact linear constraints on the state; by default, Tr ρ = 1 is the only such constraint; • any exact linear constraints on the measurement out-comes; for example, P j f j = N ; • a statistical model of the measurement process in theform of a PDF of the frequencies f j in terms of thePOVM elements, N , or any other parameter; typically,this PDF is multinomial or Poissonian; • a reference state ρ satisfying the linear constraints onthe state; typically, this would be the maximally mixedstate, ρ = D /D .The reconstruction algorithm consists of a number ofphases, and we describe each in the following. It is importantto note that we represent states by vectors. The most conve-nient basis is the standard basis, in which ρ is represented bythe D -dimensional vector Vec( ρ ) .

1. Setup phase

The exact linear constraints are enforced through the useof two projectors T X and T Z . These have to be calculatedﬁrst. The projector T X is a projector on a subspace S X instate space (that is, the vector space representation thereof),while T Z is a projector on a subspace S Z in measurementspace (the space of measurement vectors z ). In principle,the latter could differ per measurement setting, but we willassume here that it does not. The interpretation of the sub-space S X is that a state ρ satisﬁes the linear state constraintsif and only if Vec( ρ − ρ ) is in the subspace S X . For exam-ple, if the only linear constraint is Tr ρ = 1 , then S X is thespace of vector representations of Hermitian matrices of tracezero; that is, the space of D -dimensional vectors orthogonalto x = Vec( ρ ) = Vec( D /D ) = | D i /D . We will hence-forth represent a state ρ by the vector Vec( ρ − ρ ) in S X . Thereference state ρ is thus represented by the null vector.The interpretation of S Z is similar. If, for example, the onlylinear constraint on the measurements is P j f j = N , result-ing in the constraint P j z j = 1 , then S Z is the space of realvectors (with dimension equal to the number of measurementoutcomes) whose components sum up to zero.The corresponding projectors can be calculated by con-structing orthonormal bases (onb) supporting each of thesesubspaces. If { x i } is an onb for S X , we construct the ma-trix X whose columns are the vectors x i . This matrix is aso-called partial isometry. The projector T X is then given as T X = X X ∗ . Similarly we have T Z = Z Z ∗ , where thecolumns of Z form an onb for the subspace S Z . In the actualalgorithm we don’t need the projectors but only these partialisometries X and Z .Conversely, if the projectors are given, we can calculate X and Z from them via the singular value decomposition(SVD). Let T X = U SU ∗ be the singular value decomposition(SVD) of T X , where U is unitary and S is diagonal, the diag-onal elements of which are the singular values of T X . For aprojector T X of rank k , the ﬁrst k singular values are 1, whilethe others are zero. Then X consists of the ﬁrst k columnsof U . The partial isometry Z is calculated in the same wayfrom the SVD of T Z .For the example where the linear state constraint is Tr ρ =1 , the projector T X is easily constructed as T X = D −| D ih D | /D . Using an SVD, this gives X . Similarly, whenthe measurement constraint is P dj =1 z j = 1 , with d the num-ber of outcomes, T Z = d − | ih | /d , and Z is also calcu-lated using an SVD.The exact constraints can be dealt with by expressing therelevant quantities in terms of the bases X and Z . A tildewill be used to indicate this. For example, a state ρ satis-fying the constraint can be represented by the tilde vector ˜ x = X ∗ ( x − x ) = X ∗ Vec( ρ − ρ ) . Initial mean value and covariance:

Let k be the rank of T X ; thus k is D minus the number of independent constraintson ρ . The initial mean value is set to the k -dimensional nullvector ˜ µ = , representing the reference state ρ . The initialcovariance matrix is ˜ Σ = b k , where b is a scalar with a“large” chosen value.

2. Kalman Filter phase

The following is applied iteratively, for each measurementsetting:The measurement matrix H represents the POVM elementsof the current measurement setting. The j -th column of H isgiven by Vec(Π ( j ) ) . From H we calculate ˜ H = Z ∗ H X , toincorporate the exact state and measurement constraints.The reference measurement vector is then z = H x . Themeasurement vectors will be represented by the tilde vectors ˜ z = Z ∗ ( z − z ) . The original measurement vector z is de-rived from the actual measurements f j , along with the mea-surement covariance matrix Θ , as follows.For a measurement with multinomial statistics (as for N runs of single-photon tomographic measurements) z is givenby z j = ( f j +1) / ( N + d ) (formula (6)), and Θ by the formula(7). The tilde quantities then follow using ˜ z = Z ∗ ( z − z ) and ˜ Θ = Z ∗ Θ Z .For measurements with Poissonian statistics, in the casethat the POVM elements add up to either the identity matrixor a scalar multiple thereof, the same formulas hold, but with N given by N = P j f j . When the POVM elements do notadd up to a scalar matrix (a situation that better be avoided), the modiﬁed formulas (23) and (24) have to be used for z and Θ .We now apply the Kalman ﬁlter update equations (45), writ-ten entirely in terms of tilde quantities: ˜ K = ˜ Σ ˜ H ∗ ( ˜ H ˜ Σ ˜ H ∗ + ˜Θ) − ˜ µ ′ = ˜ µ + ˜ K ( ˜ z − ˜ H ˜ µ )˜ Σ ′ = ˜ Σ − ˜ K ˜ H ˜ Σ . For the ﬁrst iteration, we set ˜ µ = ˜ µ and ˜ Σ = ˜ Σ . Theprimed quantities are the updated values and have to be usedas ˜ µ and ˜ Σ in the next iteration.After the ﬁnal iteration, the “untilded” quantities can be cal-culated, if one so wishes, using the formulas µ ′ = µ + X ˜ µ ′ and Σ ′ = X ˜ Σ ′ X ∗ .

3. First Interpretation

The Kalman ﬁlter reconstruction procedure yields a Gaus-sian distribution over the linear space containing the physicalstate space as a subset. In most cases, this distribution willhave a non-negligible probability outside this physical set. Infact, more often than not the mean of the distribution will benon-physical. At this point, the reconstructed mean and co-variance only summarise what the measurements tell us, re-gardless of the physical signiﬁcance of the outcome. In thenext phase, the physicality constraint has to be combined withthe reconstruction, as a kind of prior knowledge.However, the unphysical covariance matrix already givesus interesting diagnostic information about the tomographyitself, namely about the inherent accuracy of the measure-ments. To that purpose one can investigate the eigenvaluesof the covariance matrix Σ . If some (or all) of these are large,that means the tomography is not able to give accurate infor-mation about the state in the direction of the correspondingeigenvectors. Application 2 (Section V) gives a particularlynice example of this.

4. Restriction phase

Restricting the reconstruction to physical state space can bedone in a number of ways, but is always more complicatedthan in the MaxLik case, because the covariance matrix alsohas to be treated. The easiest way is to ﬁrst calculate the max-imum likelihood physical state. This is the state ρ ML satisfy-ing Tr ρ ML = 1 and ρ ML ≥ and minimising the squaredMahalanobis distance M = ( ρ − µ ) ∗ Σ − ( ρ − µ ) to the re-constructed unphysical mean state. In the presence of exactconstraints, these calculations are best done using tilde quan-tities. This is a constrained minimisation problem that can bereformulated as a semi-deﬁnite problem (SDP) (see SectionIII E) and can be solved efﬁciently using SDP solvers.The obtained minimal Mahalanobis distance M ML has di-agnostic value. If nothing has gone wrong, the MaxLik so-lution should be well within the conﬁdence region of the0reconstructed likelihood distribution. Taking a 95% conﬁ-dence value, the conﬁdence region is given by the inequality M ≤ γ ν := ( p ν − / . , where ν is the dimen-sion of the subspace S X supporting the reconstructed state ρ (the number of degrees-of-freedom). If the value of M ML ismuch larger than that, this indicates that something has gonewrong either with the tomographic measurements or with thereconstruction, in the sense that the underlying assumptionsare violated, for example if certain additional noise sourceshaven’t been accounted for. Application 1 illustrates this as-pect very clearly.If M ML falls within the conﬁdence interval related to theunphysical reconstruction, we can calculate the conﬁdenceregion for the physical restriction. A good approximationfor that region is given by the intersection of the ellipsoid ( ρ − ρ ML ) ∗ Σ − ( ρ − ρ ML ) ≤ ( √ γ ν + M ML ) =: γ ′ ν withthe physical set. A drawback of this approach is that to get er-ror bars the intersection has to be calculated explicitly, whichis a non-trivial task.In Sec. III F we present an alternative algorithm for per-forming the restriction to the physical set. This algorithmdoes not rely on an SDP solver and, furthermore, yields anexplicit conﬁdence region, allowing to calculate error bars ina straightforward way.Finally, it is possible to calculate a regularised solution.This is a physical state within the physical conﬁdence regionthat optimises a certain regularisation functional. Possiblechoices are the entropy, and then we get the so-called max-imum entropy (MaxEnt) solution, or a functional expressingthe smoothness of a solution. An example of the latter is givenin Application 2. The calculations for this again require solv-ing a semi-deﬁnite program (see Sec. III H 4). B. Approximation of the Measurement Process in QuantumTomography by a Linear-Gaussian Model

1. The Likelihood Function in Quantum Tomography

Any quantum measurement, be it in state tomography orprocess tomography, can be characterised by the applicationof a POVM { Π ( k ) } Kk =1 to a certain state ρ ; this state couldbe the state under investigation, or the output of a quantumprocess given a certain applied input state. From the pointof view of the experimenter, the state ρ is initially unknown,even though the experimenter may have certain preconcep-tions about it. Because the tomographic experiments revealinformation about the state of a statistical nature, the state hasto be treated as a random variable. Henceforth, ρ will denotean observed quantum state corresponding to a random vari-able denoted by R .Quantum mechanics predicts the probabilities of each ofthe K POVM outcomes on a state ρ to be p k = Tr[ ρ Π ( k ) ] .We deﬁne the vector of probabilities as p = ( p k ) Kk =1 . Whenthe state is described by a random variable R , the vector p is an observation of an underlying random variable P , with P k = Tr[ R Π ( k ) ] .In reality, P is never observed directly. We will consider two types of optical tomography experiments in this paper.In pulsed mode tomography experiments, N individual lightpulses are sent into the system, each pulse prepared in a state ρ . The POVM measurement is repeated N times, presumablyon a sequence of N independent identically prepared states ρ .For every pulse, a detector either clicks or does not click. Theresults of these N runs can then be combined into a vector offrequencies f = ( f k ) Kk =1 of the respective outcomes. Thisvector is an observation of a random variable, F . As is well-known, for ﬁxed N and p , F has a multinomial distributionwith parameters N and p : F ∼ Mtn ( N ; p ) .In continuous wave (CW) mode optical experiments, the in-coming laser beam is turned on for a relatively long but ﬁxedtime, and the number of times the detectors click in that timespan are taken as the frequencies f k . The elements F k arePoisson distributed with mean value µ k = Ap k , where A isa proportionality factor called the brightness factor . This in-corporates the intensity of the laser beam, the duration of theexperiment, detector losses, etc. Obviously, the sum of fre-quencies N = P k f k is not ﬁxed but is a Poissonian randomvariable as well.Combining all this with the relation P k = Tr[ R Π ( k ) ] weobtain the PDF f F | R ( f | ρ ) of F conditional on R , or the like-lihood function when considered as a function of ρ . Pictori-ally, we have the following (for pulsed mode experiments): R Π ( k ) −→ P Mtn ( N,p ) −→ F . The ﬁrst step is a linear mapping, and the second step is thequantum noise model. In comparison, recall that Kalman ﬁl-tering is based on the linear-Gaussian model: X H −→ Y + N (0 , Σ) −→ Z . The ﬁrst step is again a linear mapping, but the second step isan additive Gaussian white noise (AGWN) model.As mentioned, the basic idea explored in this paper is toapproximate the quantum model by a linear-Gaussian modelin order to open the door for Kalman ﬁltering. To do that,the following incompatibilities have to be overcome: ﬁrst, inlinear-Gaussian models there are typically no restrictions onthe state vector X , while in quantum mechanics R is con-ﬁned to quantum state space (positive semi-deﬁnite and traceequal to 1). We postpone the solution to this problem untilSection III E and just pretend for the time being that the vari-able R is unrestricted.The second difference is of course the different noisemodel. While in both measurement models the ﬁrst step isa linear mapping, the quantum noise model is non-additiveand non-Gaussian. In spite of this apparently rather large dif-ference, the simplicity of the Kalman ﬁlter equations is so ap-pealing that one is enticed to try and approximate f F | R by alinear-Gaussian model anyhow. Indeed, many distributions,including the multinomial and Poisson distributions, can beapproximated by a normal distribution, and according to thelaw of large numbers the approximation gets better when thenumber of observations increases. Incidentally, this is why wehave imposed the requirement that the number of experimen-tal runs per measurement setting should not be too small.1So the main problem we are faced with is to reconcile thetwo models in a statistically sound way, but without losingsight of the practical issues. In the next subsection we ﬁrstpresent a deceptively simple “solution”, one that comes tomind almost automatically, but which suffers from a num-ber of serious drawbacks. An observation that is more than200 years old will then provide a way out of this conundrum,paving the way to a more satisfactory solution.

2. A na¨ıve approach

If we make the straightforward identiﬁcations Z = F , Y = P and X = R , then L R | F ( ρ | f ) provides the likelihoodfunction L X | Z required for the Bayesian update formula (10).There are two problems with this, however, preventing a directmapping to a linear-Gaussian model:1. P enters in the moments of F of all order, and not justthe mean.2. P enters in these moments in a highly non-linear way.A ﬁrst na¨ıve approach could be to simply approximate themultinomial distribution by a multivariate normal with mean N p k = N Tr[ ρ Π ( k ) ] (which is linear in ρ , as required) andwith covariance matrix the one obtained by taking the covari-ance matrix of the multinomial distribution and replacing ev-ery occurrence of p k by f k /N (which is independent of ρ , asrequired).Although this superﬁcially seems to solve the above prob-lems, a serious drawback of this approach is that the assign-ment of the covariance matrix is very ad-hoc; for example, p k is replaced by its estimator f k /N in the covariance but not inthe mean. Even more importantly, this approach is statisticallyill-founded and, in fact, underestimates the actual variance of F .This is most apparent when some of the components of f are zero. Indeed, consider an N -trial 2-outcome measure-ment, where f = ( f , f ) = ( f , N − f ) , and suppose f = 0 . In the na¨ıve approach, the variance assigned to f F | R would be N p (1 − p ) , with p replaced by f /N , hence giv-ing zero. This is clearly a mistake because a variance of zeroamounts to perfect knowledge, and a conﬁdence interval ofzero width. However, never having seen outcome ‘0’ is noguarantee that ‘0’ will not occur in the future, no matter howhigh the value of N may be. This has also been noted inRef. [5].The ﬁrst documented encounter of this phenomenon ap-peared in a 1774 paper by Laplace, as the so-called “sun-rise problem”: calculate the probability of a sunrise, solelybased on the information that it has risen N days before [29].The answer is not 1. Instead, the correct value of this condi-tional probability is given by a formula that is now known as Laplace’s rule of succession (see, e.g. Ref. [30]). In a morewider context we can consider the “visible sunrise problem”and calculate the probability p that we can see the sun rise (un-hindered by clouds), given that we have done so in f of the N days before. In the modern interpretation of Laplace’s rule, p is a random variable with a uniform prior, and conditionalon the N observations has a posterior PDF that according toBayes’ rule is a beta-distribution with parameters f and N − f ,whose mean is ( f + 1) / ( N + 2) . While useful for predictingsunrises, beta distributions will also offer the solution to ourreconstruction problem.

3. Bayesian Solution

Essentially, Kalman ﬁltering can be seen as Bayesian infer-ence, simpliﬁed to the case of linear-Gaussian models. Whenthe noise is no longer Gaussian, as in our case, but we stillwant to reap the beneﬁts from the simplicity of the Kalman ﬁl-ter equations, we really should be looking at the Bayesian in-ference equations and suitably approximate these, rather thanapproximate the model and apply Kalman ﬁltering to that. Inthis way we can avoid the problems of the na¨ıve approach.More precisely, what we will do is match the two modelsafter Bayesian inversion of their noise processes. Recall, forthe linear-Gaussian model this gave X H −→ Y , Y ∼ N ( µ ( Y ) , Σ( Y )) , with the moments of Y determined by the observation z ,Eqs. (12). For the quantum measurement model we have R Π ( k ) −→ P Mtn ( N,p ) −→ F , Bayesian inversion yields the PDF of P conditional on theobservation f of F . As explained in Section II B this is aDirichlet distribution with parameter f , P ∼ Dirichlet ( f ) ,and moments given by (6) and (7).The solution to the matching problem has now become verysimple. We match the partially inverted quantum measure-ment model to the partially inverted linear-Gaussian model,and to do so we approximate the Dirichlet distribution by aGaussian distribution with same ﬁrst and second order mo-ments (moment matching). The upshot of all this is is thefollowing rule: In the Kalman ﬁlter update equations (20) replace z by for-mula (6), and Θ by (7). Remarks.

1. In our context, the formula for the mean of a Dirich-let random variate (Laplace’s rule of succession) couldbe paraphrased as “each outcome gets one click forfree”. In statistics this extra count is called a pseudo-count [31]. In comparison, the mode (the position ofthe maximum of the PDF) is given by p i = f i /N .2. We would like to point out that in maximum likelihoodreconstruction one takes the mode as the basic quantity(the relative frequencies of the outcomes), as that is thepoint of maximum likelihood, while in our approach weuse the conﬁdence region, which is approximately cen-tered around the mean (the modiﬁed relative frequen-cies, with pseudocounts included). To counter poten-tial objections against this approach we already mention2here that, for any non-unreasonably small conﬁdencevalue, the mode is well within this conﬁdence region.This will be shown in the appendix, Section B 1.3. In fact, even if one is not going to calculate the conﬁ-dence region, Laplace’s rule tells us that one should re-ally use the modiﬁed relative frequencies, because “themean of a posterior can be thought of as being morerepresentative [than its mode] as it takes into accountthe skewness of the PDF.” ([28], p. 25).4. In this whole discussion we have quietly disregardedthe fact that in the quantum setting the probabilities P do not necessarily range over the whole probabilityspace. The range is essentially determined by the rela-tions p j = Tr[ ρ Π ( j ) ] . Thus, to be completely correct,all Bayesian integrations should be carried out over thisrange. However, in many cases, integrating over the ex-act range complicates the calculations too much. Onthe other hand, in the quantum tomography setting, per-forming exact integrations does not guarantee that theﬁnal solution, in terms of the state, belongs to the phys-ical set anyway. Therefore, we integrate over the fullprobability space (the d -dimensional probability sim-plex) and restrict the solution to the exact physical setafterwards (see Sections III E and III F). More aboutthis is discussed in Section III B 5.We have illustrated our approach here for the case of pulsedexperiments, where the distribution of clicks is governed bythe Multinomial distribution. More generally, the approachcan be described as follows. Let f be the PDF of the distri-bution of the outcome frequencies, conditional on the proba-bilities p . From this, derive the conjugate PDF, i.e. the PDFof P conditional on the observed frequencies. Then approxi-mate this conjugate PDF by a Gaussian using moment match-ing. This amounts to replacing z and Θ in the Kalman updateequations by the ﬁrst and second order moments (which arefunctions of the observed outcomes) of the conjugate PDF, re-spectively.

4. Poissonian counts

Consider, as a second example, CW experiments, where theoutcome frequencies are governed by a Poisson distribution.More precisely, the frequencies f j of the outcomes are inde-pendent Poisson variates with parameters µ j = Ap j , where A is the brightness factor of the experimental setup.The conjugate distribution of the Poissonian f ( k | p ) = e − µ µ k /k ! with µ = Ap is the PDF f ( p | k ) = Ae − Ap ( Ap ) k k ![1 − Q ( k + 1 , A )] (22)where Q ( k +1 , A ) is the regularised incomplete Gamma func-tion [32].We can apply this to ﬁnd the PDF of P conditional on theoutcome frequencies f j . While the latter are independently Poissonian distributed, the p j have to add up to 1 and aretherefore correlated. Thus, the PDF of P equals the product Q j f ( p j | f j ) but renormalised to 1 over the probability sim-plex of p . A short calculation yields the surprising result that,again, the conjugate PDF is given by the Dirichlet distribu-tion, with N = P dj =1 f j . Maybe even more surprising is thefact that the brightness factor A cancels out completely. Thisis rather convenient, since, in general, A is not known, or atleast not with great precision. We can therefore carry over theformulas for the pulsed mode case to the CW case wholesale,with the one addition that N has to be explicitly deﬁned as N = P dj =1 f j .

5. Non-POVM Measurements

As is well-known, the most general measurement one canperform is a POVM measurement, described by POVM-elements, positive-semideﬁnite operators that add up to theidentity operator. In practical experiments, however, one isnot bound to implement the full POVM. For example, onecould just implement one element of the POVM at a time,make N measurements with it, and leave the other elementsfor subsequent runs. Under the assumption that the mea-surements are always made on identically prepared state, thismakes no difference in the end result (the vector of outcomefrequencies). Because of this, one can simulate measurementsthat cannot be performed in a single shot, namely POVMmeasurements where the elements do not add up to identity,as long as the elements Π ( j ) themselves obey the condition ≤ Π ( j ) ≤ A drops out of the calculations, just as in the case of properPOVMs. When the elements do not add up to a multiple ofidentity, this is no longer the case, and the calculations becomemore complicated. The brightness factor A is now a so-callednuisance parameter and has to be removed from the likelihoodfunction by integrating it out, as shown below.This situation has already been considered in [33, 34] forMaximum Likelihood reconstruction. It was noted there thatthe sum of the POVM elements, the matrix Π (0) := P j Π ( j ) ,determines the ﬁeld-of-view of the tomography experiment.That is, if Π (0) is supported only on a restricted subspace,the reconstructed state will also be supported on that subspaceonly. The tomography will be “blind” to state componentsoutside that subspace. Moreover, the eigenvalues of Π (0) de-termine the sensitivity of the tomography along the corre-sponding eigenvectors. The larger the eigenvalue, the moreaccurate the tomography in that direction will be.When the POVM elements Π ( j ) no longer add up to iden-tity, the corresponding “probabilities” p j = Tr[ ρ Π ( j ) ] do notadd up to 1 either. Let us then deﬁne p = P dj =1 p j . Simi-larly, deﬁne f = P dj =1 f j . The maximum likelihood recon-struction method can be extended to cover this situation by3renormalising p j : one just replaces p j by p j /p in the originalMaxLik formulas. This so-called extended maximum likeli-hood principle was ﬁrst suggested by Fermi (see [35], p. 90).For our reconstruction method, however, we need the meanand variance of the likelihood function, and not just the mode,and these depend on Π (0) in a more complicated way.If the POVM elements add up to a multiple of identity, Π (0) = M

1, then p = M , a constant. Otherwise, p is nota constant, but depends on the state ρ . The PDF of the corre-sponding measurement outcomes (CW case) is the product ofPoissonians f F | P ( f | p ) = d Y j =1 exp( − Ap j ) ( Ap j ) f j f j != exp( − Ap ) A f d Y j =1 p f j j f j ! . The corresponding likelihood function is proportional to this,and can be converted to a PDF by normalising over the set ofallowed values of P . Now this is exactly the problem: howshould the probability space P of P look like when the p j nolonger add up to 1? When p is a constant, p = M , it isreasonable to take the set of non-negative vectors adding upto M as probability space. Then the normalisation constantbecomes exp( − AM ) A f Z P d p . . . d p d d Y j =1 p f j j f j ! , and all references to A indeed cancel. The end result is then aDirichlet distribution in terms of the probability vector p/M .The mean and covariance matrix of P are thus given by theformulas for the Dirichlet moments, multiplied by M and M ,respectively. The remainder of this subsection can be skipped on ﬁrstreading, and can be skipped altogether if one always makessure that the POVM elements used add up to a multiple ofidentity; this design rule is recommended.

When p is not a constant, this magical cancellation of A no longer takes place. The standard way to deal with this inBayesian inference is to consider A as a random variable, too,and calculate the joint distribution of A and P : f A, P | F ( a, p | f ) ∝ exp( − ap ) a f d Y j =1 p f j j f j ! . Since we are not really interested in A (it is a nuisance pa-rameter) we then take the marginal distribution of P by in-tegrating out a . Using the integral R ∞ da exp( − p a ) a f = f ! /p f +10 , this gives f P | F ( p | f ) ∝ f ! p f +10 d Y j =1 p f j j f j ! . A second problem that occurs when p is not a constant isthe geometrical shape of the probability space P . In principle,this shape can be derived from the relations p j = Tr[ ρ Π ( j ) ] ,but this nearly always yields a complicated set and integratingover it is extremely difﬁcult (as has been remarked before).For example, let ρ be a qubit state ρ = (cid:18) x zz − x (cid:19) , andtake the POVM elements Π (1) = (cid:18) (cid:19) , Π (2) = (cid:18) (cid:19) , Π (3) = 12 (cid:18) (cid:19) , Π (4) = 12 (cid:18) i − i (cid:19) . Then P is deﬁned by p + p = 1 and ( p − / +( p − / + ( p − / ≤ / (essentially a Blochsphere). The reader is invited to try and integrate the func-tion p f p f p f p f /p f +10 over this set.As before, we propose to integrate over the smallest set con-taining P and giving easy integrals, and restrict to the physi-cal set in a later phase. The easiest way to do this is to ﬁrst ﬁx p and include all points in the simplex P ( p ) := { p : p j ≥ , P dj =1 p j = p } , perform all the calculations conditional onthis assumption, and then average over the range of p . Thisrange can be easily determined from the extremal eigenvaluesof Π (0) := P j Π ( j ) . Let m and M be the smallest and largesteigenvalue of Π (0) , respectively, then m ≤ p ≤ M .To average over p we need a measure; to get P = S m ≤ p ≤ M P ( p ) , with all points in the set equally weighted,this measure has to be proportional to the volume of the sim-plex P ( p ) . As this volume is proportional to p d − , the mea-sure is p d − d p /K , with K = Z Mm p d − dp = ( M d − m d ) /d. For ﬁxed p , the calculations show that P /p follows aDirichlet distribution, with N = P j f j . Thus the momentsof P are: µ ( P i | P = p ) = p f j + 1 N + dµ ( P i P j | P = p ) = p ( f i + 1)( f j + 1)( N + d )( N + d + 1) µ ( P i | P = p ) = p ( f i + 1)( f i + 2)( N + d )( N + d + 1) . Now we average over P . The average of P k is µ ( P ) = R Mm d p p d − p k ( M d − m d ) /d = dd + k M d + k − m d + k M d − m d =: φ k M k , where we have deﬁned φ k := dd + k − ( m/M ) d + k − ( m/M ) d . φ k is between d/ ( d + k ) and , obtained when m = 0 and m = M , respectively.Hence we get µ ( P i ) = M φ f j + 1 N + d (23a) µ ( P i P j ) = M φ ( f i + 1)( f j + 1)( N + d )( N + d + 1) (23b) µ ( P i ) = M φ ( f i + 1)( f i + 2)( N + d )( N + d + 1) . (23c)This yields the covariance matrix elements via the relations σ i,j ( P ) = µ ( P i P j ) − µ ( P i ) µ ( P j ) (24a) σ i,i ( P ) = µ ( P i ) − µ ( P i ) . (24b)For the extreme case m = 0 , this gives σ i,j ( P ) = M da ( f i + 1)( f j + 1)( d + 1) ( d + 2)( N + d ) ( N + d + 1) σ i,i ( P ) = M d ( f i + 1)( af i + b )( d + 1) ( d + 2)( N + d ) ( N + d + 1) a := N − d − db := ( d + 2 d + 2) N + d + d . As a small check, for m = M ( Π (0) = M M and M , respectively. C. Choosing a Prior

In this Section, we tackle the problem of choosing an appro-priate prior distribution, as required for starting the Kalmanﬁlter process. We also have to solve problem of restricting thesolution of the Kalman ﬁlter to physical space. Both problemscould have been solved in one go by choosing the prior to be auniform distribution over state space, and setting it equal to 0outside of it. Unfortunately, Kalman ﬁltering requires a Gaus-sian prior, and we leave the solution of the restriction problemto the next section. In this section, therefore, we ignore therestriction to physical state space.When we have no prior information about the quantum stateapart from the tomography data, we have to construct a priorthat reﬂects this total lack of knowledge. Moreover, to allowfor the application of Kalman ﬁltering, this prior has to beGaussian. One such prior could be a Gaussian with an inﬁnitecovariance (the mean would then be irrelevant): Σ = ∞ b , giving Σ = b b should benot too big.But what does big enough and not too big mean? Fortu-nately, as we are dealing with quantum state estimation, weknow that the state belongs to a bounded set: its eigenvaluesare positive numbers summing up to 1. To illustrate this, letus restrict to diagonal d -dimensional states, i.e. distributions. With the choice Σ = b

1, the squared 2-norm distance be-tween two such distributions p and q is given by || p − q || /b .(Why we take the 2-norm distance will become clear in thenext Section). The maximum value is therefore /b . To min-imise the inﬂuence of the prior, this distance should be smallenough, and certainly much smaller than d . Hence, we need b ≫ /d . In our numerical experiments, we have chosen thevalue b = 1 .For the mean of the prior, the best choice is to take a state“in the middle” of state space. For distributions this would bethe uniform distribution (1 , . . . , /d , for quantum states themaximally mixed state ρ = /d . More generally, one couldtake the state that has the largest entropy within the physicalset.An alternative solution to the problem of choosing a prioris based on the observation that the Bayesian update equation(10) is basically a multiplication and therefore all Bayesianupdates commute. We can therefore start with any suitableprior and divide it out again after all Kalman ﬁlter updateshave been performed. This amounts to the same thing as start-ing off with the inﬁnite width prior. This division is easy whenthe covariance matrix of the chosen prior is a scalar matrix, Σ = b

1, with some ﬁnite choice of b .Let µ represent some ﬁxed state and let us consider mea-surement parameters of a very speciﬁc form z = Hµ and Θ = b HH ∗ . In that case the Kalman update equations sim-plify to µ ′ = b ( b + Σ ) − µ + Σ ( b + Σ ) − µ (25) Σ ′ = b ( b + Σ ) − Σ . (26)Using this it is easy to calculate that starting off the Kalmanﬁltering sequence with the “inﬁnite width” Gaussian prior( µ = 0 and Σ = ∞

1) and applying the Kalman update step( z = Hµ and Θ = b HH ∗ ) yields as updated state a Gaus-sian with µ ′ = µ and Σ ′ = b

1. Starting off with this Gaus-sian as prior ( µ = µ and Σ = b

1) is therefore equivalentto starting off with an inﬁnitely wide prior and applying thisparticular Kalman update step once, anywhere during the se-quence (anywhere, because of commutativity). In particular,this update can be done at the end of the sequence.Undoing the narrow prior can therefore be done after theﬁnal Kalman update by applying the inverses of equations (25)and (26). Denoting by µ and Σ the quantities obtained atthe end of the Kalman ﬁlter sequence, and by µ corr and Σ corr the corrected ones, with inﬁnite prior, we have the correctionequations Σ corr = ( Σ − − /b ) − (27) µ corr = µ + ( Σ corr /b )( µ − µ ) . (28)In practice, one could for example choose µ to be a repre-sentation of the maximal entropy (maximally mixed) quantumstate ( /d ).The problem with these correction equations is the extremesensitivity of µ corr to even the tiniest variations in µ when Σ corr has very large components. While this is not nec-essarily a numerical artifact—large uncertainties on certaincomponents of the covariance should go hand in hand with5equally large uncertainties on the mean—it may cause numer-ical problems further down the line. For that reason we try toavoid this situation by choosing a slightly larger value for b inthe correction equations than was used in the construction ofthe prior. To obtain a corrected covariance matrix Σ corr withan upper bound σ on the variances we can choose a value b ′ satisfying /b − /b ′ = 1 /σ . D. Calculation of the Conﬁdence Region

The mean value and covariance matrix that we have beenable to calculate using the Kalman ﬁlter method are not endsin themselves. One possible use of these is to calculate meanand variance of certain operators when applied to the state. InRef. [5] it is shown that the mean value and variance of anoperator Z depends on the mean state µ and state covariance Σ (the inverse of the Fischer information matrix F ) via therelations h z i = Tr µZ h (∆ z ) i = z ∗ Σ z , where z is a vector representation of the operator Z . Theerror bars on Z can then be derived by setting appropriatecondifence levels.In this subsection, we derive more generally an expressionfor the conﬁdence regions for the complete state, correspond-ing to the reconstructed mean value and covariance. The con-ﬁdence region is the region around the mean value obtainedfrom the Kalman ﬁlter procedure within which the actual statecan be found “with high probability”. The value of this proba-bility is called the conﬁdence value and is typically chosen tobe 95%. Stated otherwise, the probability that the actual stateis outside the conﬁdence region should be “low”, e.g. 5%.For the multivariate normal distribution, with mean µ andcovariance matrix Σ , the conﬁdence region is an ellipsoid cen-tered around the mean. This is quite clear as the surfaceswhere the PDF assumes a constant value are governed by thequadratic equation ( x − µ ) ∗ Σ † ( x − µ ) = c (note the Moore-Penrose inverse, as required when there are exact linear con-straints; see Section III G). The quantity of the left-hand sideis the squared Mahalanobis distance M between points x and µ , as deﬁned by (3). The surface of the conﬁdence re-gion is thus the set of points at a certain Mahalanobis distancefrom the mean. To ﬁnd which value M should take for whichconﬁdence value, we have to consider the distribution of M .It is well-known that the squared Mahalanobis distance hasa chi-square distribution with ν degrees of freedom (DoF): M ∼ χ ν , where ν is the rank of Σ , equalling the dimen-sion of x minus the number of independent exact constraints(zero variance components). A proof of this basic fact goes asfollows: Proof.

Suppose Σ has rank ν and is bounded (all eigenval-ues are ﬁnite), then its MP inverse has rank ν as well and cantherefore be written as Σ † = Q ∗ Q , where Q is a ν × d matrix.Introduce the ν -dimensional random vector U = Q ( X − µ ) . Then the entries of this vector are independent and distributedas U i ∼ ind N (0 , . The sum-of-squares of these entries isthen, by deﬁnition, χ ν distributed [36]. Since the squared Ma-halanobis distance M is equal to M = ( x − µ ) ∗ Σ † ( x − µ )= ( x − µ ) ∗ Q ∗ Q ( x − µ )= u ∗ u = X i u i , it follows that, indeed, M ∼ χ ν . (cid:3) We summarise the main properties of the chi-squared dis-tribution [36]. The PDF as a function of x , with x ≥ , isgiven by ν/ Γ( ν/ x ν/ − exp( − x/ , and the cumulative distribution function (CDF) by − Γ( ν/ , x/ ν/ , where Γ( a, y ) is the incomplete gamma function .The mean of a variable X ∼ χ ν is ν and its variance is ν . The variable X itself is distributed as X ∼ χ ν . Fornot too small values of ν , X is approximately normal withmean p ν − / and variance / . The 95% conﬁdence in-terval of X is therefore approximately given by ≤ x ≤ p ν − / . , where . InvErf (0 . , the rootof [1 + Erf( x )] / . . Even for the smallest ν thatwe will encounter, this approximation turns out to be verygood; for ν = 3 (a single qubit state, for example) the value x ≤ p − / . yields the only slightly smallerconﬁdence value of 94.3%.One immediately obtains that 95% of the probability massof a multivariate normal is contained in the ellipsoid consist-ing of points whose M lies in the 95% χ ν conﬁdence inter-val. In other words, the 95% conﬁdence region is the ellipsoid M := ( x − µ ) ∗ Σ † ( x − µ ) ≤ ( p ν − / . =: γ ν . (29)This formula lies at the heart of many statistical procedures,for example outlier detection.Now, since we are approximating the actual posterior PDFby a Gaussian, the true conﬁdence region will be differentfrom the one just obtained. However, we show in AppendixB 2 that the difference will not be dramatic. Even in theworst case, the actual distribution of M is very close tochi-squared, but has a variance that is larger by a factor ofabout 30% (unless N , the number of measurements per run, In Matlab, the CDF can be calculated using the built-in function gammainc(x/2,nu/2) . γ ν = ( p ν − / . . (30) E. Restricting to Physical State Space

In this Section and the next, we treat the problem of restrict-ing the solution of the Kalman ﬁlter to the physical region S ;when the state is a quantum state, this means restricting solu-tions to state space, the set of positive semideﬁnite matrices oftrace 1. We will perform this restriction as a post-processingstep after the Kalman ﬁltering calculations.At the level of the PDF, the restriction involves setting thevalues of the obtained PDF equal to zero outside S , and thenrenormalising the PDF (as it should now integrate to 1 over S instead of the whole space). The whole art is to determine thevalue of the new normalisation factor. This requires the inte-gration of the posterior PDF over S , which is a very compli-cated problem, especially for high dimensions; this problem isknown as the Bayesian integration problem. Likewise, similarintegrals are necessary to obtain the moments of the restrictedPDF.Various numerical methods have been proposed to approx-imate such integrals; for an overview see [37, 38]. The partic-ular problem faced here is that the intersection of the uncon-strained conﬁdence region with the physical set has extremelylow volume both within the conﬁdence region and within thephysical set, in part due to the high dimensionality of the prob-lem. This turns out to be a very challenging situation for allexisting integration methods.In the following, we present two approximative methods.The ﬁrst method is very simple but rather crude and actuallycircumvents the Bayesian integration problem. It is not a verypowerful method, because it only allows to check whether astate is in the restricted conﬁdence region. For some situa-tions this might be already enough. If one desires to know theshape of the restricted conﬁdence region, e.g. via its moments,then this method does not sufﬁce. Nevertheless, the method isextremely simple to apply.A second method, discussed in the next Section, is morepowerful and yields an approximation of the restricted conﬁ-dence region in explicit form. Perhaps not surprisingly, it isbased again on Kalman ﬁltering and yields an approximativeconﬁdence region expressed by a mean value and a covariancematrix. This method, while still in its experimental stages,seems to work amazingly well in practice.The simple method consists of keeping the ﬁrst and sec-ond moments of the unphysical posterior PDF but modifyingthe M CR value to take the renormalisation over the physicalset into account. Furthermore, rather than calculate the exactvalue of the new M CR , a conservative upper bound is takenthat is easy to calculate.The method consists of two parts, of which the ﬁrst one isoptional. First, the maximum likelihood (MaxLik) solution is calculated. That is, the physical state for which the un-physical posterior PDF is maximal is calculated. This corre-sponds to the following semi-deﬁnite program: ﬁnd the mini-mal value of t for which a state ρ exists satisfying the mixedsemi-deﬁnite/quadratic constraints ρ ≥ ρ = 1( ρ − µ ) ∗ Σ † ( ρ − µ ) ≤ t. For this minimal t , the state ρ in question is the MaxLik solu-tion.Even though very efﬁcient semi-deﬁnite program solversexist [39], this part can still be very time consuming when thedimension of the state is high. Nevertheless, ﬁnding the Max-Lik solution is interesting enough in its own right to warrantinclusion of this part. After all, this solution is what most re-construction algorithms try to ﬁnd. In our context, the MaxLiksolution also allows to check the validity of the tomographydata. Indeed, given that the MaxLik solution corresponds tothe best physical “guess” of the actual state, the former shouldlie within the “raw” (i.e. unconstrained) conﬁdence region al-lowed by the measurements. Thus, the Mahalanobis distancebetween the MaxLik solution ρ ML and the mean of the un-constrained posterior PDF should be below M CR . If not, thiscould be an indication that something is wrong, either withthe data, or with the underlying assumptions (e.g. the noisemodel).For the second step, we consider the Mahalanobis distancejust calculated, M ML := ( ρ ML − µ ) ∗ Σ † ( ρ ML − µ ) . (31)The conﬁdence region for the constrained PDF is the intersec-tion of the physical set with the ellipsoid ( ρ − µ ) ∗ Σ † ( ρ − µ ) ≤M CR,phys , where M CR,phys is the conﬁdence value for theconstrained posterior PDF. Note that µ and Σ are still themoments of the unconstrained PDF. Because the constrainedposterior has to be normalised over the physical set only, M CR,phys will be larger than M CR . Calculating the exactvalue of this M CR,phys is an extremely difﬁcult problem, butwe can prove the validity of a very simple upper bound: M CR,phys ≤ M

CR,unphys + M ML , (32)with M ML given by Eq. (31). The proof of this bound isgiven in Appendix A.In case one does not even want to calculate the MaxLiksolution, and one is willing to believe this solution is in theunconstrained conﬁdence region, M ML can be replaced byits (presumed) upper bound M CR,unphys , giving the simpleresult that the constrained conﬁdence limit is at most twice theunconstrained one: M CR,phys ≤ M CR,unphys . (33)In conclusion, the simple method consists of doing the fol-lowing:1. Depending on resources and taste, choose betweensteps 2 or 3, then proceed to step 4.72. Calculate the physical MaxLik solution, i.e. the solu-tion in P that minimises the Mahalanobis distance tothe (unphysical) µ . Record M ML , the minimal Maha-lanobis distance just found.3. Or, just take M ML = M CR,unphys .4. A physical state ˆ ρ is in the physical conﬁdence regionif its Mahalanobis distance to µ is not (much) above M CR,unphys + M ML . F. Restricting to Physical State Space; Kalman Filter Method

Let us now move on to our second method for restrictingthe posterior PDF to physical state space. It is a more in-volved method but gives more information. To simplify thediscussion, consider an example where X is a d -dimensionalreal variable, and the physical region consists of the positiveorthant x i ≥ , i = 1 , . . . , d . We assume that, after incor-porating the measurement data, the unrestricted PDF of X is(approximately) given by its mean µ and its covariance matrix Σ . We assume that the corresponding conﬁdence region is notcompletely contained in the physical region; otherwise, therewould be nothing to do here.Consider now the marginal distributions of each of the com-ponents X i of X . The marginal distribution of X i is of coursenormal, with mean µ i and variance Σ ii . We can then easilycalculate the conﬁdence interval of each X i for given conﬁ-dence levels. There will at least be one such X i whose conﬁ-dence interval will not be completely contained in the physicalinterval x i ≥ . Broadly speaking, this is the one-dimensionalmarginal version of our restriction problem.The ﬁrst key idea of our proposal is to consider the marginaldistribution of this X i . If more than a ﬁxed amount α of prob-ability mass of this marginal is outside the physical interval X i ≥ , we truncate the marginal to that interval. That means,we set the density equal to 0 outside the physical interval, andrenormalise to 1. We then approximate this truncated normaldistribution by an ordinary normal distribution, with appropri-ate mean ˜ µ and variance ˜ σ . How we will do this is describedin the next subsection.

1. The marginal restriction problem

The obvious idea of using moment matching to approxi-mate the truncated normal is of no use here because we needa procedure that gives a stable result when applied twiceor more. Approximating a truncated normal with momentmatching does not yield a distribution with controlled tails(the tail being that part of the distribution outside the physi-cal interval). Truncating it again and approximating that trun-cated normal with moment matching for a second time willtypically yield yet another set of values for mean and vari-ance. There is no guarantee at all that this process would con-verge. For that reason we seek an approximation procedurethat controls the tail probability explicitly. -2 -1 1 20.20.40.60.811.21.4

FIG. 1: Illustration of the approximation process of Sec. III F 1. Nor-mal distribution with moments µ = − and σ = 1 (full line). Thecorresponding truncated normal distribution, truncated to x ≥ (dashed line). The best approximating normal distribution with 5%of probability mass outside the interval x ≥ (dotted line). We will require that the approximating normal distributionhas exactly an amount α of probability mass outside the phys-ical interval x ≥ , where α is small, say 5%. For an illustra-tion of this approximation process, see Fig. 1.As a quality measure of the approximation, we will usethe Kullback-Leibler distance again D KL ( p || q ) with the trun-cated distribution as ﬁrst argument (thus, with integration in-terval restricted to x ≥ ) and the approximating distributionas second argument. Without the restriction on the approxi-mating distribution this would result in the moment matchedapproximation.Minimising this distance over all choices of approximatingdistributions amounts to maximising the following functionover the parameters ˜ µ and ˜ σ : Z ∞ d x φ ( x ; µ, σ ) log φ ( x ; ˜ µ, ˜ σ ) . The requirement we imposed on the probability mass in theleft tail of the approximating normal translates to the equality ˜ µ = α ˜ σ, (34)where α = √ InvErf (( α + 1) / . For α = 5% we ﬁnd α ≈ . . After some algebraic manipulations one ﬁndsthe following complicated set of formulas for the optimal ˜ σ : ˜ σ = σ (cid:18) − α g + q α g − c (cid:19) (35a) c = tτ − (1 + t ) / (35b) g = τ − t/ (35c) τ = exp( − t / / [ √ π Erfc( t/ √ (35d) t = − µ/σ. (35e)In Fig. 2, we plot the ratio ˜ σ/σ as a function of t .This one-dimensional solution has now to be translatedback to the original setting of restricting the state vector X to the physical set. As this translation is basically an inverse8 FIG. 2: The ratio ˜ σ/σ plotted as a function of t [Eq. (35)] for α =1 . and t ≥ − α . problem involving normal distributions only, we will againuse a Kalman ﬁlter to solve it, as explained in the next subsec-tion.

2. Backprojecting the marginal restriction using Kalman ﬁltering

Our second key idea is to enforce the original unrestricteddistribution to have a marginal distribution (for component X i ) given by the one-dimensional approximating normal,with parameters ˜ µ and ˜ σ . Formally, this is equivalent to per-forming a linear one-dimensional measurement, and we canincorporate its effects on the state X by a Kalman ﬁlter up-date step. The parameters of the measurement (measurementmatrix H , mean z and covariance Θ ) will have to be such thatthe effect of the measurement is the required enforcement ofthe marginal mentioned above.As we only want to enforce the marginal of the X i com-ponent, we will set H equal to the row vector h i | . Thus, Hx = x i . This is a one-dimensional measurement, so that itsmean z and covariance Θ will be one-dimensional too. Cor-respondingly, the Kalman gain is a column vector. Insertingthis into the Kalman ﬁlter update equations (20) gives K = Σ H ∗ ( H Σ H ∗ + Θ) − = Σ | i i (Σ ii + Θ) − µ ′ = µ + K ( z − Hµ ) = µ + ( z − µ i ) K Σ ′ = ( − KH ) Σ = Σ − K h i | Σ . The X i marginal of the updated distribution will then havemoments µ ′ i and Σ ′ ii given by µ ′ i = µ i + ( z − µ i ) h i | K = µ i + ( z − µ i )Σ ii (Σ ii + Θ) − Σ ′ ii = Σ ii − h i | K Σ ii = [1 − Σ ii (Σ ii + Θ) − ]Σ ii . We ﬁnd the required values for the measurement parameters z and Θ by solving the equations µ ′ i = ˜ µ and Σ ′ ii = ˜ σ .The solution is Θ = (1 /κ − ii (36) z = [˜ µ − (1 − κ ) µ i ] /κ (37) κ = 1 − ˜ σ / Σ ii . (38) Here, ˜ µ and ˜ σ are given by the equations (34) and (35), with µ = µ i and σ = √ Σ ii , the marginal moments of X i in theoriginal distribution.

3. The Restriction Procedure

In the previous subsections, we have shown how a singlevariable X i can be restricted to its physical interval X i ≥ .In general, if the physical set is convex, the set is deﬁned by anumber of such inequalities, possibly an inﬁnite number. Forsimplicity, we ﬁrst treat the case that the physical interval isdeﬁned by a ﬁnite set of inequalities X i ≥ , ∀ i , and treat themore general case below. This case corresponds for exampleto diagonal quantum states, and also to the optical POVM ofSection V.To restrict the complete state vector X to the physicalset, we repeat the above-mentioned procedure for every com-ponent of X , or at least for those components for which µ i /σ i < α . In general, however, because of correlations,multiple components of X will be affected by a single stepof the procedure, and it could very well be that the work ofprevious steps is partially undone by the current step. For ex-ample, forcing X to be positive could bring X back into thenon-physical region.Therefore, a number of runs of the algorithm will be neces-sary, stopping when all marginals have their conﬁdence inter-vals approximately within the physical interval. As the quan-tity min i µ i /σ i will converge to α from below, a good stop-ping criterion is min i µ i /σ i > (1 − ǫ ) α , where ǫ is a smallpositive number. In practice, ǫ should not be chosen too small,so that the algorithm terminates in reasonable time; in our ap-plications we chose ǫ = 0 . .The order of the iterations, namely which marginal X i totreat ﬁrst, does not seem to inﬂuence the end result very much.In one set of experiments we treated the marginals in ﬁxed or-der, and in another we always chose the marginal with small-est µ i /σ i ﬁrst. It is not clear that the latter order should con-verge faster because of the correlations between the X i ; in ourexperiments it only did marginally so. While this and otherconvergence issues are still under investigation, they appearnot to be of major importance.An illustration on a small example is shown in Fig. 3. Theleft graph shows the result of the restriction of X to the phys-ical interval X , X ≥ . One sees that the new distributionagain crosses the border of the physical set, but now compo-nent X is involved, although it did not before the update. Asecond Kalman ﬁlter update step will therefore be necessary,on X . The result of that second step is shown in the rightgraph of Fig. 3.In the case that the number of inequalities deﬁning thephysical set is inﬁnite, for example for the set of quantumstates, where the inequalities are Tr ρX ≥ , ∀ X ≥ , theﬁxed order rule is obviously infeasible. The smallest-ﬁrst rule,on the other hand, requires the complicated minimisation of X ∗ ρ / √ X ∗ P X over all X ≥ . A third, and much simplerpossibility is to choose X at random. However, that methodexhibits slow convergence, especially in the ﬁnal stages when9 -2 -1 0 1 2-10123 -2 -1 0 1 2-10123 FIG. 3: Illustration of the Kalman ﬁlter update step of section III F 2. The physical set consists of the positive orthant. We start with anunrestricted normal distribution with mean ( − , and covariance Σ = „ − . − . « . The contour lines of the PDF are plotted in fulllines in the upper left of both graphs. The X marginal of this distribution is the same one as plotted in Fig. 1. The normal distributionapproximating the truncated marginal is reproduced here as the full line curve at the bottom of the left diagram. After applying one step of theKalman ﬁlter, with parameters given by Eqs (36-38), we obtain a normal distribution whose PDF is plotted in the left diagram in dashed lines.In the graph on the right the result of a second Kalman ﬁlter update step is shown, now for component X . the number of unsatisﬁed constraints becomes small.A better option is to consider a combination of two rulesin the following double iteration: the inner iteration consistsof, given a unitary U , performing the restriction on the diag-onal of U ρU ∗ , that is with H = Vec( U ∗ e ii U ) ∗ , and varying i . This inner iteration can be performed either using the ﬁxedorder rule i = 1 , . . . , d or the smallest-ﬁrst rule. The outeriteration consists of choosing a new random unitary U eachtime, until a suitable stopping criterion is satisﬁed, e.g. un-til the inner iterations achieve no further reduction of µ/σ .Although the smallest µ/σ component does not necessarilyoccur for U diagonalising ρ , it is a good idea to choose theunitary U diagonalising the current ρ every now and then. G. Incorporating Exact Linear Constraints

In many cases, the description of the physical state is sub-ject to one or more exact constraints. For quantum states thetrace of the density matrix is 1, for trace preserving quantumprocesses the partial trace of the state representative over theoutput Hilbert space is the maximally mixed state /d , andfor POVMs the sum of all the elements must be the identitymatrix. Depending on the physical system, there may be fur-ther constraints like this. These exact linear constraints canbe incorporated into the reconstruction process in a number ofways.A ﬁrst approach is to incorporate exact constraints viadummy measurements with zero measurement covariance,and replace inverses by Moore-Penrose (MP) inverses. The beneﬁt of this method is that virtually no changes to theKalman ﬁlter implementation are needed. A serious drawbackis that the state covariance matrix becomes ill-conditioned,since exact constraints correspond to zero variance compo-nents. In reality, numerical round-off causes these compo-nents to have non-zero variance, which makes it hard to dis-criminate between variances that are nominally zero and thosethat are not. This is a notorious problem for actual implemen-tations of Kalman ﬁlters and may cause serious numerical in-stabilities. Later on in the calculations, the Mahalanobis dis-tance ( x − µ ) ∗ Σ − ( x − µ ) has to be calculated (see Sec. III D)and even the smallest deviation in x − µ from the exact con-straints is blown up by the inverse of Σ .A second approach is to store exact constraints in two ad-ditional matrices, along with state mean and state covariance.In general, exact constraints may have an impact on the statebut also on the measurement. For example, when the stateis a quantum state, we have the exact constraint on the state Tr ρ = 1 , and a corresponding exact constraint on the mea-surement probabilities P i p i = 1 . This implies that the dif-ference between any two states, e.g. the mean X and its up-date X ′ , should lie in a subspace, namely the one for whichthe trace is zero. Similarly, the difference between two mea-surements, e.g. the actual measurement outcome z and theexpected outcome Hµ , should also lie in a subspace, namelythe one for which the sum of all components is zero.Both subspaces can be represented in calculations by twoprojectors, T X and T Z . The projector T X projects on the sub-space in state space, and T Z on the subspace in measurementspace. The Kalman ﬁlter update equations can be made more0resistant to numerical inaccuracies using these projectors, en-suring that the exact constraints are obeyed in any iteration ofthe update process, as follows: y = T Z ( z − Hµ ) (39a) S = T Z ( H Σ H ∗ + Θ ) T Z (39b) K = Σ H ∗ S † (39c) µ ′ = T X ( µ − µ + Ky ) + µ (39d) Σ ′ = T X ( Σ − KH Σ ) T X (39e)Here, µ is a reference state, e.g. the maximally mixed state /d .Note that the ordinary inverse in the formula for the Kalmangain K has been replaced by the MP inverse. Likewise, theinverse of Σ appearing in the formula for the posterior PDFcorresponding to the Kalman ﬁlter solution has to be replacedby an MP inverse too.A third approach is to parameterise the state such that theexact constraints are inherently satisﬁed. The obvious ben-eﬁt is that the exact constraints do not have to be explicitlyimposed. A second beneﬁt is higher numerical stability, andstraightforward invertibility of all matrices that have to be in-verted.We start again from the projectors T X and T Z . From theseprojectors we can derive two partial isometries, X and Z ,such that the following holds: the number of columns of X and Z must be equal to the ranks of T X and T Z , respectively; X X ∗ = T X , Z Z ∗ = T Z ; and X ∗ X = Z ∗ Z = T X = U SU ∗ ; the partial isometry X is thenobtained from the unitary U by dropping those columns thatcorrespond to the zero-valued singular values.Roughly speaking, using these partial isometries, the matri-ces Θ , Σ and H can be “cut down” to their invertible parts,which we will denote by a tilde. Deﬁne ˜ Θ := Z ∗ Θ Z . (40)Since the support of Θ is exactly the support of T Z , we alsohave the reversed equality Θ = Z ˜ Θ Z ∗ . Furthermore, ˜ Θ isfull rank and therefore invertible. In a similar way we deﬁne ˜ Σ := X ∗ Σ X , (41)which is also full rank and invertible and satisﬁes Σ = X ˜ Σ X ∗ .Furthermore, µ and z live in certain afﬁne subspaces. If µ and z are ﬁxed reference vectors in these afﬁne subspaces,we ﬁnd that µ − µ is a vector in the support of T X , and z − z a vector in the support of T Z . Then we can deﬁne ˜ µ := X ∗ ( µ − µ ) (42) ˜ z := Z ∗ ( z − z ) , (43)and these again obey µ = µ + X ˜ µ and z = z + Z ˜ z .In addition, it is possible, and best, to choose z such that z = Hx . Finally, we deﬁne ˜ H := Z ∗ H X . (44) Using these deﬁnitions (and a little work), the Kalman ﬁlterupdate equations can be rewritten as follows: K = X ˜ K Z ∗ (45a) ˜ K := ˜ Σ ˜ H ∗ ( ˜ H ˜ Σ ˜ H ∗ + ˜Θ) − (45b) µ ′ = µ + X [ ˜ µ + ˜ K ( ˜ z − ˜ H ˜ µ )] (45c) Σ ′ = X ( ˜ Σ − ˜ K ˜ H ˜ Σ ) X ∗ (45d) = X ( ˜ Σ − + ˜ H ˜ Θ − ˜ H ∗ ) − X ∗ (45e)It has to be stressed again that all inversions here are ordinaryones, not MP inverses. One sees that the equations reduceto the original Kalman ﬁlter update equations provided onealways works with the “tilde” quantities. For the sake of ref-erence, we combine all deﬁnitions here again: ˜ Θ := Z ∗ Θ Z , Θ = Z ˜ Θ Z ∗ (46a) ˜ Σ := X ∗ Σ X , Σ = X ˜ Σ X ∗ (46b) ˜ µ := X ∗ ( µ − µ ) (46c) ˜ z := Z ∗ ( z − z ) (46d) ˜ H := Z ∗ H X . (46e)All required calculations can be expressed directly in termsof tilde quantities. For the initial (prior) Σ , we choose b T X (rather than b ˜ Σ = b

1. Con-cerning the Mahalanobis distance, if we deﬁne ˜ x := X ∗ x ,we have for any vector x (as long as it is in the support of T X ;if not, the Mahalanobis distance will be inﬁnite) ( x − µ ) ∗ Σ † ( x − µ ) = ( ˜ x − ˜ µ ) ∗ ˜ Σ − ( ˜ x − ˜ µ ) . As an example, we consider the case of N -run pulsed modestate tomography. Then the constraints on the state are Tr ρ =1 . Denoting the dimension of the underlying Hilbert spaceby D , this translates to x = | D i /D and T X = D − D | D ih D | . The measurement vector z must in turn satisfythe constraint P di =1 z i = 1 , with d the number of POVMelements of the measurement POVM, i.e. h | z i = 1 . Hence T Z = d − d | ih | .The corresponding partial isometries can be found numer-ically using an SVD, as indicated, but for this particular caseanalytical formulas can be found. Let U be the d -dimensionaldiscrete Fourier transform-kernel U j,k = 1 √ d exp[2 πi ( j − k − /d ] , ≤ j, k ≤ d. Let U ′ be the matrix obtained from this by dropping the ﬁrstcolumn (which has constant entries). This U ′ is a good choicefor Z , as can be readily checked. Similarly, for X we canchoose the matrix obtained from X j,k =  δ j,k , d + 1 j − U j ′ ,k ′ , j − d + 1)( j ′ − ,k − d + 1)( k ′ − . by dropping the ﬁrst column.1 H. Graphical Representations of the Reconstruction

In the previous sections we have presented a methodologyfor state reconstruction from tomographic data by which aKalman ﬁlter is used to obtain a normal approximation to thelikelihood function f X | F , where X is the state and F is themeasurement data. The normal approximation is deﬁned byits two moments: the mean state vector µ , and the covari-ance matrix Σ . These two moments should in principle suf-ﬁce as a complete statistical description of the reconstructedstate (within the limits of the normal approximation).When it comes to presenting the reconstruction, however,there are a number of problems with the use of mean and co-variance matrix alone. Consider, for example, the reconstruc-tion of an optical POVM using our method, as discussed inSection V below. The reconstruction of the diagonal elementsof the ﬁrst element is shown in Fig. 6. The reconstructed meanis plotted as the centerline in the ﬁgure. On top of that, wewould like a depiction of the covariance matrix, because thismatrix essentially describes the reconstruction uncertainties.

1. Depicting the covariance matrix

The ﬁrst problem one is faced with is that the covariancematrix Σ , being a matrix, cannot really be depicted in a verymeaningful way. Nevertheless, as the whole purpose of cal-culating it is to provide some kind of error bars on the recon-struction, it is desirable to have some means of representation.One can do this by plotting its diagonal elements √ Σ ii as er-ror bars on the mean value. This is meaningful because thediagonal element Σ ii is exactly the variation of the marginaldistribution of X i . Of course, such a plot has to be accom-panied by the proviso that the plot can only be indicative, be-cause the variations on the elements are in general correlated.

2. Avoiding reconstruction artifacts

The second, and more important problem we want to ad-dress in this Section is the appearance of reconstruction arti-facts in the reconstructed mean. Closer inspection of Fig. 6reveals the presence of a wave-like pattern in the center-line, while from theoretical considerations of the underlyingPOVM model there really is no reason why that pattern shouldbe there. Such artifacts are typical for maximum-likelihoodreconstruction methods and are well-known in image restora-tion [28]. Even though the wave pattern in the POVM recon-struction stays well within the error bars, which is already aclear counter-indication to its statistical signiﬁcance, it wouldbe better to have a reconstruction not showing such artifactsat all. Two methods for obtaining artifact-free solutions (or atleast for suppressing the artifacts) are described below.

3. MaxEnt reconstruction

A widely used method for suppressing reconstruction arti-facts is the MaxEnt method, ﬁrst proposed by Skilling in thecontext of image reconstruction [28]. Originally, the methodwas formulated as choosing a special prior PDF based on theentropy S ( x ) of the states (provided such an entropy exists).In many cases a state can be formally identiﬁed with a prob-ability distribution, after suitable normalisation. This is pos-sible whenever the state consists of a set of positive numbers.For digital images, the PDF is the list of intensities of eachpixel. For quantum states, it could be the list of eigenvalues ofthe density matrix. In those cases one can assign a meaningfulentropy functional to the state space. For quantum states, thevon Neumann entropy is the obvious choice.The MaxEnt method then consists of choosing the func-tion exp( αS ( x )) (properly normalised), where α is a ﬁxedparameter, as prior PDF. Inference then proceeds in the normalway, by calculating the posterior PDF and ﬁnding the maxi-mum likelihood solution. The upshot of this choice of prior isthat in the absence of other information, preference is given tostates with higher entropy. The parameter α characterises theamount of preference. Jaynes’ principle of maximum entropy[40] could be seen as a legitimisation of this approach.In the context of quantum tomography, Hradil and ˇRehaˇcek[41] advocated a combination of the maximum entropymethod with the maximum likelihood (MaxLik) reconstruc-tion method, which they called MaxEnt assisted MaxLik(MEML) tomography . This method can be seen as a specialcase of Skilling’s MaxEnt method. In their paper, they con-sidered the situation of incomplete measurements. This corre-sponds to a likelihood function whose covariance matrix has acertain number of eigenvalues that are (almost) zero, while theothers are inﬁnitely large. The MaxLik reconstruction is thusknown with certainty to lie in a certain subspace, but its po-sition within that subspace is completely unknown. In otherwords, there exists not a single state maximising the likeli-hood function, but a whole plateau of states. The proposal of[41] consists of ﬁnding the point on that plateau (i.e. in theMaxLik subspace) for which the entropy S ( ρ ) = − Tr ρ log ρ is maximised and take that point as the reconstruction.From an experimental viewpoint, the situation consideredin Ref. [41], of variances that are either zero or inﬁnite, is anidealised one. In practical experiments, the number of mea-surements is ﬁnite, so that even the most precisely known statecomponents have a non-negligible variance. Second, theremay be practical and/or technical limitations on the kind ofmeasurements that can be performed, so that some variancesmay be very large, but still ﬁnite. In Sec. V we will see a clearexample of this. In that section the reconstruction of an op-tical POVM is described. While the elements of this POVMare diagonal in the Fock basis, its tomography is based oncoherent states rather than Fock states, because the latter areextremely hard to produce. This causes large variances onthe reconstructed elements without a clear-cut distinction be-tween perfectly known and completely unknown components.When dealing with such realistic experiments, the full-blownMaxEnt method is much more preferable.2In its original formulation as a choice of prior, the MaxEntmethod has a number of shortcomings. One is that there ap-pears to be no satisfactory and rigorous way of choosing theparameter α . Secondly, the principle of maximum entropydoes not necessarily apply to the entropy of the states. Inquantum tomography we are dealing with a controlled system;the system is being prepared in a predeﬁned quantum state, tothe best of the preparer’s abilities, and the tomography actson a sequence of independent identically prepared systems.In thermodynamical terms, this corresponds to a system thatcould be as far from equilibrium as the preparer wants it tobe. This has to be contrasted with Jaynes’ MaxEnt principle,which has been inspired by the statistical mechanics of sys-tems in near-equilibrium, and which is based on the argumentthat the probability of a macro-state should be proportionalto the number of microstates consistent with it, i.e. is pro-portional to its thermodynamic entropy. For systems close toequilibrium, we agree that it makes sense to choose a priordistribution that assigns more weight to states with higher en-tropy. For controlled systems, and for those systems lackinga fundamental notion of entropy, we are more tempted to optfor a uniform distribution, as we have done in this work, andincorporate the maximum entropy idea as a regularisation, asexplained below.

4. Regularisation

Rather than apply the MaxEnt principle, which we deemnot always appropriate, one can adopt a more pragmatic ap-proach in which the entropy functional is no longer fundamen-tal and can be replaced by other functionals. And rather thanreplace the prior PDF with the chosen functional, which im-plicitly changes the ﬁnal posterior PDF, and choose the maxi-mum likelihood solution for that changed posterior, the regu-larisation method does the following (Ref. [28], Sec. 6.2): theprior PDF is unchanged, and within the conﬁdence region ofthe resulting posterior PDF (unchanged as well) ( x − µ ) ∗ Σ † ( x − µ ) ≤ M CR , it ﬁnds the solution that maximises the chosen functional.When expressing the functional as a cost, or penalty function,this would be a minimisation.Since the entropy is a concave functional, maximising itover a convex set (such as the conﬁdence region) is a convexproblem and can be efﬁciently solved numerically. Likewise,minimising a cost function again gives a convex problem pro-vided the cost function is convex. Proper distance measures,for example, would therefore be good cost functions.Which cost function to use really depends on the prob-lem setting. In the example of the optical POVM men-tioned above, theoretical considerations suggested [19] thatthe smoothness of the POVM elements, deﬁned as Q ( { Π ( k ) } k ) = K X k =1 d − X i =1 (Π ( k ) i +1 − Π ( k ) i ) (47) could be appropriate. In fact, this smoothness is a commonlyused regularisation functional in image reconstruction meth-ods [28]. It is immediately clear that this Q is a convex func-tional, as required. The appropriateness of this cost functioncomes from the fact that it penalises the ‘wavyness’ of thecenterline, as exempliﬁed in Fig. 6.When the cost function is quadratic, like this smoothnessterm, the minimisation problem is a quadratically constrainedquadratic programming (QCQP) problem. Such problems canbe efﬁciently solved using semi-deﬁnite programming (SDP)solvers [42]. For the sake of deﬁniteness, let us considerthe case where the states are quantum states ( ρ ≥ and Tr ρ = 1 ). The general form of a quadratic cost functioncan then be written in terms of a matrix A and a vector b as ( Aρ − b ) ∗ ( Aρ − b ) . The SDP form of the QCQP problem isthen: minimise the (slack) variable t over all t and ρ under thecombined quadratic and semi-deﬁnite constraints ρ ≥ ρ = 1( Aρ − b ) ∗ ( Aρ − b ) ≤ t ( ρ − µ ) ∗ Σ † ( ρ − µ ) ≤ M CR . This problem can be solved in a straightforward way by SDPsolvers like Sedumi [39].

IV. APPLICATION 1: STATE RECONSTRUCTION OF ANENTANGLED 2-QUBIT STATE

The methods introduced in this paper have all been testedon real sets of tomographic data. In this Section and the nextwe report on two such applications, one in state tomographyand one in POVM tomography.In the present Section, we consider the reconstruction oftomography data of a source of polarisation-entangled pho-ton pairs, obtained by Langford et al [43] and compare ourresults to their reconstruction. The source is a BBO-crystaldown-conversion source operating in CW mode, pumped byan Argon laser. Two sets of tomography data were taken,one directly on the crystal, and one on the single mode ﬁbres(SMF) attached to the crystal. In both cases, the sequence ofmeasurements is as given in Tab. I. This measurement basisis over-complete because not all measurements are needed toobtain a full state reconstruction. Nevertheless, it was arguedthat by taking an over-complete basis a more accurate recon-struction could be obtained.A nice consequence of this choice for our reconstructionmethod is that the projectors of the 36 basis states add upto a multiple of the identity, P k =1 | ψ ( k ) ih ψ ( k ) | = 9

1. Ashas been discussed in Sec. III B 5, this allows us to considerthese projectors as if they were POVM elements of one bigover-complete POVM with normalisation factor M = 9 . Wecan thus take all click frequencies and put them in one 36-dimensional vector f . Similarly, we have a 36-dimensionalvector of probabilities p such that p/M is a genuine (nor-malised) probability vector. As the measurements are ob-tained in CW mode, the frequencies are Poissonian and after3 HH DH RH HV DV RV V H AH LH V V AV LV HD DD RD HA DA RA V D AD LD V A AA LA HR DR RR HL DL RL V R AR LR V L AL LL TABLE I: Sequence of measurements in the state tomography ofthe BBO source of Application 1 (Section IV). The labels H , V , D , A , R , L refer to the polarisation basis states: H = (1 , , V =(0 , , D = (1 , / √ , A = (1 , − / √ , R = (1 , i ) / √ and L = (1 , − i ) / √ . Bayesian inversion ( f −→ p ) we ﬁnd that p /M is Dirichletdistributed with parameters f and N = P k =1 f k .The upshot of all this is that the Kalman update equa-tions have to be executed exactly once, with z given by z = M µ ( Dirichlet ( f )) and Θ by M times the covariancematrix of Dirichlet ( f ) . This is particularly convenient, be-cause the issue of setting an initial prior and removing it againafter the Kalman updates (see Section III C) can be resolvedanalytically, which allows us to choose an inﬁnitely wide ini-tial prior b = ∞ without getting into numerical trouble. Withsuch a prior, the Kalman update yields the following posterior,as can be checked with a modest amount of work: ˜ µ ′ = ( ˜ H ∗ ˜ Θ − ˜ H ) − ˜ H ∗ ˜ Θ − ˜ z (48) ˜ Σ ′ = ( ˜ H ∗ ˜ Θ − ˜ H ) − . (49)Note that these formulas are stated in terms of the “tilde quan-tities” [see Sec. III G, Eq. (46)]. Both the state and the fre-quencies satisfy exact constraints, Tr ρ = 1 , and P k =1 f k = N , and we have chosen to deal with these constraints in thenumerically most stable way, by “cutting off” the kernels(zero eigenvalues) of the respective operators. In the deriva-tion of the above formulas care has to be taken because theproduct ˜ H ˜ H ∗ is not full rank.We show the results of the tomographic reconstructions ofthe measurements at the crystal and at the SMF in Figs. 4 and5, respectively. Obviously, we cannot show the conﬁdenceregions in full 16-dimensional space, and we have chosen a2D subspace spanned by two pure state projectors. We takethe two eigenvectors ψ and φ of the reconstructed mean state µ that correspond to the 2 smallest eigenvalues (one of thembeing negative). The parameters x and x are then given bythe mappings ρ x = h ψ | ρ | ψ i and ρ x = h φ | ρ | φ i .For both cases we calculate the least-squares solution andthe MaxLik solution. The least-squares solution ρ N is thestate ρ N = P j c j | ψ ( j ) ih ψ ( j ) | / P j c j , where the coefﬁ-cients c j are the least-squares solutions of the system f i = h ψ ( i ) | P j c j | ψ ( j ) ih ψ ( j ) | | ψ ( i ) i = P j c j |h ψ ( i ) | ψ ( j ) | . TheMaxLik solution is the physical state ρ ML for which the Ma-halanobis distance from the reconstructed mean state is min- −0.05 −0.04 −0.03 −0.02 −0.01 0−0.01−0.00500.0050.010.0150.020.0250.03 µ ρ ML ρ N x x FIG. 4: A display of the reconstruction results for Application 1,Section IV, showing a slice through state space illustrating the po-sition of the two-photon state, reconstructed from the data obtainedby measuring directly at the BBO crystal. What is shown is the pro-jection of the state on the 2D subspace spanned by two well-chosenpure state projectors. The two concentric ellipses centered about thereconstruction mean µ are the projections of the 50% (inner ellipse)and 95% conﬁdence regions (outer ellipse), respectively; these el-lipses are quite close together due to the rather high dimension ofthe system ( d = 15 ). The intersection of the physical set with thesubspace is the triangle x ≥ , x ≥ , x + x ≤ , of whichthe lower left corner is shown. The projection of the MaxLik solu-tion ρ ML is also shown. This solution is well within the conﬁdenceregion, as should be. For comparison purposes we have also plottedthe projection of the “na¨ıve” least-squares reconstruction ρ N . imal. We have implemented this in Sedumi, as indicated inSec. III E.In [43], the MaxLik solution was calculated in a differentway, through the minimisation of a penalty function Π( ρ ) := X k [ f k − Ap k ( ρ )] [ Ap k ( ρ )] . Here A is the unknown brightness factor of the experiment.This MaxLik solution closely matches the MaxLik solutionobtained through our KF method. To obtain a quantiﬁcation ofthe accuracy of the MaxLik solution, Langford used a MonteCarlo calculation to estimate the mean value of Π( ρ ML ) when f k is considered as a Poissonian random variable with mean Ap k ( ρ ML ) . From this mean value, a ﬁt quality parameter Q is obtained by dividing the mean value by the total number ofmeasurements and taking the square root. Ideally, the meanvalue of Q should be 1.Compared to the full error bars of the KF method, the Q quantity conveys little information about the statistical errorsand it is not clear what the acceptable values of Q should be.Moreover, the Monte Carlo calculation needed to ﬁnd Q isseveral orders of magnitude slower than the KF algorithm.Langford reports MC running times of about 150 seconds for200 MC iterations. In contrast, our KF algorithm runs in 0.12sec (about 1000 times faster), while at the same time offering4 −0.05 −0.04 −0.03 −0.02 −0.01 0−0.00500.0050.010.0150.020.0250.030.035 µ ρ ML ρ N x x FIG. 5: As Fig. 4, but with the state measured at the SMF outputs.Because the duration of the tomography run was twice as long as inthe case of Fig. 4, the conﬁdence region is much smaller. The maindifference, however, is that the conﬁdence region lies deep into non-physical space, meaning that the MaxLik solution is far outside theconﬁdence region. This is not a deﬁciency of the KF reconstructionmethod, nor of its implementation, but is actually a feature. It is adiagnostic feature that indicates that something is wrong with the as-sumptions about the underlying noise model. A likely possibility isthat the measurements are subject to additional ﬂuctuations. Accord-ing to [43] the most likely source is temperature-dependency of thespatial alignment of the SMFs, which caused the measurements todrift. To get a proper reconstruction this drift should be taken intoaccount in the noise model as an additional term. much more error information, with a clear statistical interpre-tation.

V. APPLICATION 2: RECONSTRUCTION OF ANOPTICAL POVM

Following a proposal of Ref. [44], in Ref. [45] an exper-imental realisation was reported of an optical detector withphoton-number resolving capabilities. The basic idea is tocarve up an optical pulse into 8 portions and detect the pres-ence of photons in each of these portions. More precisely,this setup simulates a cascade of beam splitters and eightavalanche photo-detectors (APDs), with the probability of aphoton arriving at a certain APD being roughly 1 in 8. Thenumber of detectors clicking therefore gives an indicationof the photon numbers in the pulse. The detector is imple-mented using two Franson interferometers, an additional bal-anced beam splitter, two avalanche photo-detectors, and twoidentical circuits for performing time binning.The behaviour of this composite detector can be describedby a 9-element POVM, where each of the outcomes corre-sponds to the number of APD’s clicking (from 0 to 8). We de-note the POVM elements by { Π ( k ) } k =0 , where the elements Π ( k ) are positive semi-deﬁnite and add up to the identity ma-trix. In principle, the elements are inﬁnite dimensional (cor-responding to photon numbers being unbounded), but we will truncate them at a certain dimension d (in our calculations wehave chosen values of d of up to 170). Since this detector hasno phase reference, it is insensitive to phase, which means thatthe POVM elements have to be diagonal in the Fock basis.To obtain a precise characterisation of the POVM elements,a tomography experiment has been performed [19] by whicha large number of pulses consisting of coherent states | α i ofever increasing power ( ∝ | α | ) were sent to the composite de-tector and the resulting numbers of detectors clicking wererecorded. The parameter α was sweeped from 0.4 to 11, insteps of about 0.01, and for each value of α , N = 38084 mea-surements were taken. Per value of α the measurement recordconsisted of the number of pulses f k that caused k detectorsto click, for k = 0 , . . . , ; obviously, P k =0 f k = N .Using these data, a reconstruction of the POVM elements(without error bars) was obtained and presented in [19]. Herewe take the same data and perform a reconstruction based onthe KF method, yielding a maximally likely solution with inaddition a deﬁnite conﬁdence region. To avoid any confusion,we stress that the object under scrutiny is a POVM and themeasurement is made using prepared quantum states. In otherwords: the POVMs are states and the state is a POVM.We have calculated the (unphysical) mean value µ and co-variance matrix Σ using Kalman ﬁltering, including T pro-jectors for including the exact constraints that the POVM el-ements must add up to the identity matrix. Then we appliedthe KF method for restricting to the physical set, giving phys-ical mean value and covariance matrix. Finally, we calculatedthe maximally smooth solution within the physical conﬁdenceregion.In Fig. 6 we depict the ﬁnal results for each of the POVMelements, showing the physical mean value solution, the errorbars, and the smoothed solution. The smoothed solution ofall POVM elements together is depicted separately in Fig. 7.The results are in very good correspondence with both the re-construction of [19] and the theoretical model of the POVM(based on independent measurements of the reﬂectivities ofthe beam splitters and the overall photon loss).To illustrate how the mean values and error bars change af-ter each KF iteration, we have created a movie, where eachframe consists of a plot similar to the one of Fig. 6, generatedafter each iteration. We refer the reader to [48] for this anima-tion, the MatLab routines used, and other related material.In order to infer how many measurements are needed toreduce the errors, one has to look at the unconstrained conﬁ-dence region. We have plotted the spectrum of the unphysicalcovariance matrix in Fig. 8. This graph allows to estimate thenumber of experimental runs N necessary to achieve a cer-tain ﬁnal precision. It is evident from the graph that only 110of the 800 free components have standard deviation less than . ( λ i = σ i ≤ − ). Since variances scale as /N ,to double that number to 220, say, N should be increasedby a factor of no less than about 100,000 (to get the σ of − below − ), i.e. from 38,000 to the rather impractical3,800,000,000. Hence, to really achieve higher precision withthis kind of experiment, another setup should be considered.5 . . . . . . . . . i Π i . . . . . . . . . i Π i . . . . . . . . . i Π i . . . . . . . . . i Π i . . . . . . . . . i Π i . . . . . . . . . i Π i . . . . . . . . . i Π i . . . . . . . . . i Π i . . . . . . . . . i Π i FIG. 6: (Colour online) Reconstruction of the POVM elements of the Optical POVM of Sec. V. The graph depicts the mean value solution µ i of the diagonal entries (central wavy line (blue)) and their marginal standard deviations √ Σ ii ( ± σ error bars). As the actual variations onthe diagonal entries are correlated, this plot can only give an indication. Along with the mean value solution, the regularised solution is plotted(central smooth line (red)). i Π ( k ) ii FIG. 7: Reconstruction of the POVM elements Π ( k ) ii of the OpticalPOVM of Sec. V (up to photon number 30 only). This is the reg-ularised solution, i.e. the solution with maximal smoothness that isstill within the conﬁdence region obtained from the Kalman Filtermethod. The solution obtained is in complete agreement with the so-lution from [19], which was obtained in a completely different way. −20 −15 −10 −5 i λ i FIG. 8: Spectrum of the corrected covariance matrix Σ corr for theOptical POVM, before restricting to the physical set; the eigenvaluesare plotted on a logarithmic scale. The values saturate at about ,due to numerical imprecision in the calculation of Σ corr . Eigenvalues800 and higher correspond to the exact constraints imposed by theprojector T X and are nominally 0. VI. DISCUSSIONA. Comparison to other Methods

The reconstruction method that matches ours most closelyis the one reported in [5], which is also based on the likeli-hood function and also yields a covariance matrix. Hence, thismethod allows to calculate conﬁdence regions in the same wayas ours. The main differences are that in [5] the point of max-imum likelihood is calculated ﬁrst, the covariance matrix is calculated as the inverse of the Hessian (the second derivativematrix ∂ /∂x i ∂x j ) of the logarithm of the likelihood func-tion, taken in the mode of that function, and the restrictionto the physical set is imposed beforehand. In contrast, ourmethod amounts to calculating the mean of the log-likelihoodfunction, its Hessian in that mean, and the restriction to thephysical set is made afterwards.First of all, we believe that our approach yields resultsthat better match the exact conﬁdence region. The likelihoodfunction is highly skewed, whenever there are a lot of low-probability measurement outcomes; this appears to be the rulerather than the exception. In those cases the mean is statis-tically more meaningful than the mode, especially when onerestricts to the mode over the physical set from the outset. Sec-ond, restricting to the physical set only in a post-processingphase yields valuable diagnostic information about correct-ness of the assumed noise model and also about the ultimateaccuracy allowed by the particular tomographic data. As illus-trated in Application 1, the mode over the physical set can bereally far from the mean, due to unforeseen noise/error con-tributions, and the mean has to be calculated in order to seethat. Also, to infer how many more measurements would beneeded to improve the reconstruction accuracy, one needs tolook at the covariance matrix before restricting to the physicalset, as illustrated in Application 2.Other reconstruction methods also calculate the MaxLiksolution and derive error measures from Monte Carlo sim-ulations. As such, they suffer from the same drawbacks asthe method of [5] in that the restriction to physical space ismade from the outset. Moreover, the time required for theMonte Carlo calculations rapidly becomes prohibitive with in-creasing system dimensions. Even for two-qubit systems, ourmethod is orders of magnitude faster than Monte Carlo meth-ods. A further problem with the Monte Carlo method is thedifﬁculty of obtaining a reliable stopping criterion. B. Computational Resources

The memory requirements of our method are easily calcu-lated. They are essentially governed by the dimension of thesubspace S X on which the state ρ is supported. If this dimen-sion is D then storage for ˜ µ consists of D complex numbers,while for the (tilde) covariance matrix it is the square of that, D . This means that for the full reconstruction of n -qubitstates, n elements are needed for the state, and n for thecovariance matrix.The computation time for the Kalman ﬁlter update (exe-cuted once per measurement setting) is dominated by a ﬁxednumber of matrix multiplications (of D × D matrices) andone matrix inversion (of a K × K matrix, where K is thenumber of outcomes per measurement setting and therefore istypically much smaller than D ). As the computation complex-ity of a matrix multiplication for two k × k matrices is O ( k ) (or somewhat less), we get a computational complexity of D .The optional post-processing steps of calculating the Max-Lik and/or MaxEnt solution require solving a semi-deﬁniteprogram. In all reported applications this turned out to be the7most time-consuming step. VII. CONCLUSION

In this work we have introduced a novel Bayesian tomo-graphic reconstruction method based on Kalman ﬁltering thatdoes not just give a maximum likelihood solution but also pro-duces error bars, in the form of a conﬁdence region around amean value solution. It must be stressed that the error bars aredirectly derived from the measurement data, unlike in Monte-Carlo methods, where they are produced from simulations.We have shown that to properly deal with low-probabilityevents (e.g. measurement outcomes with very few clicks) onehas to consider the conjugate distribution of the noise model,in the spirit of Laplace’s rule of succession. That is, if clickfrequencies are distributed multinomially or Poissonian, thisyields a distribution of the underlying click probabilities thatis Dirichlet distributed. This avoids the incorrect assignmentof zero probability to an outcome that has not been observed.Furthermore, we have introduced a novel method of ensuringthat the reconstruction is physical. This method is again basedon Kalman ﬁltering, and has the beneﬁt that it is very fast andagain produces appropriate error bars.Finally, we have applied the method to two real world ap-plications. In the ﬁrst example, the state reconstruction of anentangled two-qubit state, the reconstruction process reducesto a single application of the Kalman update equations which,apart from its numerical stability, reduces the computationaleffort. Compared to Monte Carlo methods for calculatingerror bars the computational effort is reduced by several or-ders of magnitude. The Kalman ﬁlter method also revealedthe necessity to adjust the underlying noise model by takinginto account additional error sources. The second exampleconcerned the reconstruction of an optical POVM. There theadvantages of Kalman ﬁltering also became evident in one’scapability to estimate the number of experimental runs neces-sary to achieve a certain ﬁnal precision. Both examples indi-cate that our KF method can be an invaluable diagnostic tool.In future work we will consider how to deal with mea-surement imperfections, including drift in the tomographicand system components. We will investigate how the presentmethod can be applied to tomography with continuous vari-able outcomes. A further topic of study will be the integrationof the Kalman ﬁlter method within adaptive tomographic se-tups, as the method is very much an online method, updatingthe covariance matrix as it goes. Among the more technicalissues, we will study the convergence properties of our pro-posed Kalman ﬁlter method for restricting the reconstructionto physical space.We are conﬁdent that our reconstruction method, due to itsstatistically well-founded nature, can be the basis of a depend-able, easily adaptable, and universal reconstruction algorithm.

Acknowledgments

SS thanks the UK Engineering and Physical Sciences Re-search Council (EPSRC) for support. We thank the follow-ing people who have kindly provided us with ample of to-mographic data for developing and testing the Kalman Filtermethod: I. Walmsley, A.G. White, J. O’Brien and N.K. Lang-ford. It is fair to say that without their assistance this pa-per would have been very theoretic and equally useless. Wealso thank M. Plenio, A. Feito, R. Schack, T. Osborne andT. Sharia for illuminating discussions.Last but not least, wethank Z. Hradil for sharing his thoughts on the issues consid-ered in our work.

APPENDIX A: PROOF OF THE BOUND (32) ON THEPHYSICAL CONFIDENCE VALUE

For deﬁnitions we refer back to Sec. III D. We start with aLemma.

Lemma 1

Deﬁne the function g ( r, a ) := Z r d x x d − e − ( x + a ) / . Then for a ≥ , and r ≤ R the relation g ( r, a ) g ( R, a ) ≥ g ( r, g ( R, holds.Proof. Consider three integrable functions on the interval [0 , R ] , f , g , and h . Let f be non-negative, and g and h non-increasing. It is easily shown that these functions satisfy theinequality Z R d xf ( x ) g ( x ) h ( x ) Z R d xf ( x ) ≥ Z R d xf ( x ) g ( x ) Z R d xf ( x ) h ( x ) . (A1)To see this, subtract the right-hand side from the left-handside, rewrite the integrals as double integrals over the square [0 , R ] × [0 , R ] , split up this square into two equal parts alongthe diagonal x = y , and enjoy the beneﬁts of the integrand’ssymmetry, giving: Z R d x Z R d yf ( x ) f ( y ) g ( x )( h ( x ) − h ( y ))= Z R d x Z Rx d yf ( x ) f ( y ) (cid:16) g ( x ) ( h ( x ) − h ( y ))+ g ( y ) ( h ( y ) − h ( x )) (cid:17) = Z R d x Z Rx d yf ( x ) f ( y ) ( g ( y ) − g ( x )) ( h ( y ) − h ( x )) ≥ . f ( x ) = x d − e − x / g ( x ) = e − ax h ( x ) = Φ(0 ≤ x ≤ r ) gives the inequality of the lemma. (cid:3) We start from the unphysical reconstruction, that is themean µ and the covariance matrix Σ . Let P be the physi-cal set, and let ρ be the maximum likelihood solution, i.e. thestate that is closest to the mean µ , in the Mahalanobis distance.In what follows we will use the Hilbert space representationof states, i.e. a representation as vectors. As before, we willdenote this by math boldface. The discussion becomes eas-ier by going over to a new, “standardised” coordinate system,in which the mean µ is the origin and the covariance matrixis the identity matrix. The Mahalanobis distance is then justthe Euclidean distance, and the conﬁdence regions are spherescentered around the origin.In quantum mechanics, the physical set P is convex. Bydeﬁnition, ρ is on the boundary of P . Therefore, P canbe decomposed into inﬁnitesimal cones with center ρ , eachpointing to a different direction Ω , having cross-section d Ω ,and cut to certain length R ( Ω ) , where the latter function de-termines the overall shape of P .In standardised coordinates, the unphysical posterior f isgiven by f ( x ) = C exp( − x / , with C the normalisationconstant, and x = || x || . We now want to calculate the cumu-lative distribution function (CDF) of the physical posterior,which is the normalised integral of f over the intersection of P with the ball of radius x , g ( x ) /g ( ∞ ) , with g ( x ) := Z Ω d Ω Z R ( Ω )0 d r r d − Φ( || r Ω + ρ || ≤ x ) × exp( −|| r Ω + ρ || / . Let us also deﬁne the non-negative function g ( x, Ω ) = Z R ( Ω )0 d r r d − Φ( || r Ω + ρ || ≤ x ) × exp( −|| r Ω + ρ || / . Then we have g ( x ) /g ( ∞ ) = R Ω d Ω g ( x, Ω ) R Ω d Ω g ( ∞ , Ω )= Z Ω d Ω g ( ∞ , Ω ) R Ω d Ω ′ g ( ∞ , Ω ′ ) g ( x, Ω ) g ( ∞ , Ω ) . The ﬁrst factor of the integrand, which we will denote by w ( Ω ) , is a PDF, in that it is a non-negative function integrat-ing to 1 over Ω . We have thus shown the following statement: Statement C:

The function g ( x ) /g ( ∞ ) is a weighted averageof the functions g ( x, Ω ) /g ( ∞ , Ω ) over Ω . Let us now ﬁx Ω . The value of g ( x, Ω ) no longer changesfor x beyond R ( Ω ) . We deﬁne R x as that value of r for which || r Ω + ρ || = x . Thus, for R x ≥ R ( Ω ) , we have g ( x, Ω ) = g ( ∞ , Ω ) .Consider now the case that x is small enough so that R x ≤ R ( Ω ) . Let ρ = || ρ || (the 2-norm of the vector representationof ρ ). In fact, ρ = M ML as used in the bound (32). Let θ bethe angle between a normal to ρ − µ and Ω . Because ρ isthe nearest point in P to µ , this angle is between 0 and π/ .In this case we have g ( x, Ω ) = Z R x d r r d − exp[ − ξ ( r ) / , with ξ ( r ) = || r Ω + ρ || = ρ + r + 2 ρr sin θ = ( r + ρ sin θ ) + ρ cos θ. This gives us g ( x, Ω ) = exp( − ( ρ cos θ ) / × Z R x d r r d − exp( − ( r + ρ sin θ ) / . The factor in front of the integral is independent of x and can-cels out in the quantity of interest g ( x, Ω ) /g ( ∞ , Ω ) . Apply-ing the lemma we now get g ( x, Ω ) g ( ∞ , Ω ) = R R x d r r d − exp( − ( r + ρ sin θ ) / R R ( Ω )0 d r r d − exp( − ( r + ρ sin θ ) / ≥ R R x d r r d − exp( − r / R R ( Ω )0 d r r d − exp( − r / ≥ R R x d r r d − exp( − r / R ∞ d r r d − exp( − r / . Now R x satisﬁes the triangle inequality: x ≤ ρ + R x . Thus if we replace R x as upper integration limit by its lowerbound x − ρ , or 0 if the difference is negative, then we get alower bound on the integral too, giving g ( x, Ω ) g ( ∞ , Ω ) ≥ R x − ρ d r r d − exp( − r / R ∞ d r r d − exp( − r / . The upshot of this step is that the right-hand side is now com-pletely independent of Ω , which allows us to invoke State-ment C and get that g ( x ) /g ( ∞ ) satisﬁes the same inequality: g ( x ) g ( ∞ ) ≥ R x − ρ d r r d − exp( − r / R ∞ d r r d − exp( − r / . The right-hand side is the CDF of the chi distribution (with d − degrees of freedom) evaluated in x − ρ , i.e. the CDF isshifted to the right by an amount ρ = M ML . Its conﬁdenceregion is therefore the interval [0 , M CR,unphys + M ML ] .The left-hand side is the CDF of the restricted posterior, withconﬁdence region [0 , M CR,phys ] . Because of the inequality,the latter conﬁdence region is contained in the former. Thatproves the bound (32). (cid:3) APPENDIX B: PROPERTIES OF THE DIRICHLETESTIMATOR1. Mode v Conﬁdence Region

Here we give the promised proof that the mode of theDirichlet distribution lies within the conﬁdence region as de-ﬁned in (29), with µ and Σ given by (6) and (7). Proof.

Let x be the mode of the Dirichlet distribution, x = f /N , and µ be its mean, µ = ( f +1) / ( N + d ) . Then x − µ =( d f − N ) / ( N ( N + d )) ; as the sum of the entries of x − µ is 0, x − µ lies in the subspace on which G of (8) projects.Thus we have M = ( x − µ ) ∗ Σ † ( x − µ )= N + d + 1 N ( N + d ) d X i =1 ( df i − N ) f i + 1= ( N + d + 1) N ( N + d ) d X i =1 ( f i + 1) − − d ! . If r is the number of non-zero components of f (thus ≤ r ≤ d ) and if we put f i = N p i , ﬁxing p i , then this expression canbe expanded as d − r + O (1 /N ) . The term P di =1 ( f i + 1) − ismaximal for f = ( N, , . . . , , giving the sum ( N + 1) − + d − . In this way we get the upper bound M ≤ N + d + 1 N + 1 ( d −

1) = d − O (1 /N ) . For not too small values of N , this bound is approximatelyequal to d − , which is also the number of degrees of freedom ν in this case. As d − is the mean value of the χ d − distri-bution, the value d − lies within any reasonable conﬁdence interval. Therefore, the mode of the Dirichlet distribution lieswithin the conﬁdence region of its normal approximation. (cid:3)

2. Wald statistic

Suppose the actual state under consideration is ρ , and ameasurement is made using a d -outcome POVM, so that theprobabilities of the outcomes are given by the probability vec-tor p . In an experiment this gives rise to certain outcome fre-quencies f , drawn from a multinomial distribution with pa-rameters N and p . From these frequencies f one can derivean estimation ˆ P of p , Dirichlet distributed with parameter f according to the prescription of Sec. III B. Let µ and Σ be themoments of this Dirichlet estimation.We want to study how well the actual p ﬁts within the con-ﬁdence region obtained from this estimation. To do so, weconstruct the Wald statistic z := ( p − µ ) ∗ Σ † ( p − µ )= N + d + 1 N + d d X i =1 (( N + d ) p i − ( f i + 1)) f i + 1= ( N + d + 1) ( N + d ) X i p i f i + 1 − ! . If the distribution involved was Gaussian, this statistic wouldbe χ d − distributed. In reality, the distribution only tends toa Gaussian and the Wald statistic is only asymptotically χ d − [46].An exact calculation yields the ﬁrst two moments of z ,given that F is distributed as F ∼ Mtn ( N, p ) , in terms of p : µ ( Z ) = ( N + d + 1) N + dN + 1 (cid:16) − X i p i (1 − p i ) N +1 (cid:17) − ! (B1) σ ( Z ) = ( N + d + 1) ( N + d ) ( N + 1) ( N + 1 N + 2 X i = j p i p j (cid:2) − p i − p j ) N +2 − (1 − p i ) N +2 − (1 − p j ) N +2 (cid:3) +( N + 1) X i g ( p i , N ) − h − X i p i (1 − p i ) N +1 i ) , (B2)where the function g ( p, N ) is deﬁned as g ( p, N ) := p N X k =0 (cid:18) N + 1 k (cid:19) N + 1 − k p N − k (1 − p ) k = p N +1 X k =1 (cid:18) N + 1 k (cid:19) k p k (1 − p ) N +1 − k . The sum g ( p, N ) is related to the ﬁrst inverse moment ofthe positive (i.e. non-zero) binomial distribution µ − ( p, N ) = N X k =1 k (cid:18) Nk (cid:19) p k (1 − p ) N − k by g ( p, N ) = p µ − ( p, N + 1) . FIG. 9: Graph of µ ( Z ) (lower curve) and σ ( Z ) (upper curve), fromEq. (B1) and the square root of (B2), as function of p , for d = 2 and N = 100 . No closed form for inverse moments exists, but several expan-sions are known (see, e.g. Ref. [47] and references therein).For large N , one can approximate the binomial distributionby a Poisson distribution with mean µ = N p . For the ﬁrst in-verse moment, this gives an approximation by the known ﬁrstinverse moment of the Poisson distribution, with relative errorof the order /N : µ − ( p, N ) ≈ f ( N p ) , f ( µ ) = e − µ ( Ei ( µ ) − log µ − γ ); here, Ei ( x ) is the exponential integral function and γ is theEuler-Mascheroni constant. Thus, we get g ( p, N ) ≈ ( N + 1) − µ e − µ ( Ei ( µ ) − log µ − γ ) with µ = ( N + 1) p . To obtain σ ( Z ) , however, g has to bemultiplied by a constant of order O ( N ) , and as σ ( Z ) turns out to be of order O (1) , we need to know g with a relativeprecision of order O (1 /N ) . This requires correction terms of µ − of up to second order. According to the recipe describedin [47], the required approximate formula for µ − is given by µ − ( p, N ) ≈ A ( µ ) f ( µ ) + B ( µ ) e − µ + C ( µ )24( N + 1) with µ = ( N + 1) p and f ( µ ) = e − µ ( Ei ( µ ) − log µ − γ ) A ( µ ) = 3 µ − µ − N + 1) µ + 24( N + 1) B ( µ ) = 12 µ − µ − N + 1) µ − (12 N + 10) C ( µ ) = − µ + 5 µ + (12 N + 14) µ + (12 N + 10) . For the 2-dimensional case, with N = 100 , the resultingvalues for µ ( Z ) and σ ( Z ) are plotted as function of p inFig. 9. When p is sufﬁciently far removed from the endpoints0 or 1, one sees that µ and σ converge to their χ values ( µ χ d = d − ) and √ ( σ χ d = p d − ).More generally, good convergence occurs when the small-est p i is still larger than about /N , i.e. when every out-come has at least 20 clicks. Numerical studies reveal thatthe highest value of σ occurs roughly when the smallest p i is about . /N . In turn, this highest value of σ is maxi-mal when all p i bar one are equal to . /N . This worstcase value is approximately given by the empirical formula . d − / /N ) σ χ .This gives us the following conservative approach: Take thechi-square value for σ ( Z ) whenever the smallest p i is largerthan /N , and . d − / /N ) times the chi-squarevalue otherwise. [1] K. Vogel and H. Risken, Phys. Rev. A , 2847 (1989).[2] D.T. Smithey, M. Beck, M.G. Raymer and A. Faridani, Phys.Rev. Lett. , 1244 (1993).[3] U. Leonhardt, Measuring the quantum state of light , (Cam-bridge University Press, Cambridge, 1997).[4] D.-G. Welsch, W. Vogel and T. Opatrn´y, “Homodyne detec-tion and Quantum state reconstruction”, in:

Progress in OpticsXXXIX , ch. II (1999).[5] J. ˇReh´aˇcek, D. Mogilevtsev and Z. Hradil, New J. Phys. ,043022 (2008).[6] J.L. O’Brien, G.J. Pryde, A. Gilchrist, D.F.V. James,N.K. Langford, T.C. Ralph and A.G. White, Phys. Rev. Lett. , 080502 (2004).[7] M. Riebe, M. Chwalla, J. Benhelm, H. Haeffner, W. Haensel,C.F. Roos and R. Blatt, New J. Phys. , 211 (2007).[8] M. Steffen et al, Science , 1423 (2006).[9] U. Leonhardt, M. Munroe, T. Kiss, Th. Richter andM.G. Raymer, Opt. Commun. , 144 (1996).[10] G.M. D’Ariano, C. Macchiavello and M.G.A. Paris, Phys. Rev.A , 4298 (1994).[11] U. Leonhardt, H. Paul and G.M. D’Ariano, Phys. Rev. A , 4899 (1995).[12] T. Dunn, I.A. Walmsley and S. Mukamel, Phys. Rev. Lett. ,884 (1995).[13] L.J. Waxer, I.A. Walmsley and W. Vogel, Phys. Rev. A ,R2491 (1997).[14] A. Zucchetti, W. Vogel, D.-G. Welsch, and I.A. Walmsley, Phys.Rev. A , 2716 (1999).[15] D.J. Simon, Optimal State Estimation , First Edition, (Wiley,New York, 2006).[16] V.P. Belavkin, Rep. Math. Phys. , 405 (1999).[17] J.M. Geremia, J.K. Stockton, A.C. Doherty and H. Mabuchi,Phys. Rev. Lett. , 250801 (2003).[18] F. Verstraete, A.C. Doherty and H. Mabuchi, Phys. Rev. A ,032111 (2001).[19] J.S. Lundeen, A. Feito, H. Coldenstrodt-Ronge, K.L. Preg-nell, Ch. Silberhorn, T.C. Ralph, J. Eisert, M.B. Plenioand I.A. Walmsley, “Tomography of Quantum Detectors,”Nature Physics, published online: 16 Nov. 2008, DOI:10.1038/NPHYS1133 (2008).[20] M. Karpinski, C. Radzewicz and K. Banaszek, J. Opt. Soc. Am.B , 668 (2008). [21] G. Molina-Terriza, A. Vaziri, J. ˇRehaˇcek, Z. Hradil andA. Zeilinger, Phys. Rev. Lett. , 167903 (2004).[22] J. Sherson, H. Krauter, R.K. Olsson, B. Julsgaard, K. Ham-merer, I. Cirac and E.S. Polzik, Nature , 557 (2006).[23] L. Childress et al, Science , 281 (2006).[24] L. Rippe, B. Julsgaard, A. Walther, Y. Ying and S. Kr¨oll, Phys.Rev. A , 022307 (2008).[25] K. Audenaert and S. Scheel, In preparation.[26] W.W. Hager, SIAM Review , 221 (1989).[27] S. Kotz, N. Balakrishnan, and N.L. Johnson, Continuous Multi-variate Distributions, Volume 1: Models and Applications , Sec-ond Edition, (New York, Wiley, 2000).[28] D.S. Sivia, with J. Skilling,

Data Analysis, a Bayesian Tutorial ,second edn. (Clarendon Press, Oxford, 2006).[29] P.S. Laplace, M´emoires de l’Acad´emie Royale des Sciences ,621 (1774).[30] A.R. Thatcher, J. Roy. Statist. Soc. Series B (Methodological) , 176 (1964).[31] S.E. Fienberg and P.W. Holland, J. Am. Statist. Assoc. , 683(1973).[32] M. Abramowitz and I.A. Stegun (eds.), Handbook of mathemat-ical functions , (Dover, New York, 1972).[33] Z. Hradil, D. Mogilevtsev and J. ˇReh´aˇcek, Phys. Rev. Lett. ,230401 (2006).[34] D. Mogilevtsev, J. ˇReh´aˇcek and Z. Hradil, Phys. Rev. A ,012112 (2007). [35] R. Barlow, Statistics , (Wiley, New York, 1989).[36] N.L. Johnson, S. Kotz, and N. Balakrishnan,

Continuous Uni-variate Distributions, Volume 1 , Second Edition, (Wiley, NewYork, 1994).[37] M. Evans and T. Swartz, Statistical Science , 254 (1995).[38] D.J.C. MacKay, Information Theory, Inference, and LearningAlgorithms , (Cambridge University Press, Cambridge, 2003).[39] J.F. Sturm, Optimization Methods and Software , 625-653 (1999). Free software (running under Matlab) available at http://sedumi.mcmaster.ca .[40] E.T. Jaynes, Phys. Rev. , 620 (1957).[41] J. ˇReh´aˇcek and Z. Hradil, “MaxEnt assisted MaxLik tomogra-phy”, Proc. Maxent 2003 (2003 August 3–8, Jackson Hole, WY,USA), see also arXiv:physics/0404121 (2004).[42] L. Vandenberghe and S. Boyd, SIAM Review , 49 (1996).[43] N.K. Langford, PhD Thesis, University of Queensland (2007).[44] K. Banaszek and I.A. Walmsley, Opt. Lett. , 52 (2003).[45] D. Achilles, Ch. Silberhorn, C. Sliwa, K. Banaszek andI.A. Walmsley, Opt. Lett. , 2387 (2003).[46] A. Wald, Trans. AMS54