Bayesian parameter estimation of miss-specified models
BBayesian parameter estimation of miss-specified models
Johannes Oberpriller and Torsten Enßlin
Max-Planck-Institut f¨ur Astrophysik, Karl-Schwarzschildstr. 1, 85748 Garching, GermanyLudwig-Maximilians-Universit¨at M¨unchen, Geschwister-Scholl-Platz 1, 80539 Munich, Germany
Fitting a simplifying model with several parameters to real data of complex objects is a highlynontrivial task, but enables the possibility to get insights into the objects physics. Here, we presenta method to infer the parameters of the model, the model error as well as the statistics of themodel error. This method relies on the usage of many data sets in a simultaneous analysis in orderto overcome the problems caused by the degeneracy between model parameters and model error.Errors in the modeling of the measurement instrument can be absorbed in the model error allowingfor applications with complex instruments.
Keywords: Information theory, Bayesian methods, Data analysis, Model Fitting
I. INTRODUCTIONA. General problem
Whenever phenomena of nature are not directly ac-cessible, it is a common way to build a model of them.These models usually have parameters for which one hasto find the best fitting parameter values for each phenom-ena. Unfortunately, in this process of fitting and mod-eling there are two main problems. On the one hand,models are often derived from theoretical analysis of theunderlying processes. However, these theoretical modelsdescribing complex real world systems are necessarily ide-alizations. The implied simplifications allow us to focuson the essential physics, to keep the model computation-ally feasible, and to investigate systems for which not allof their components are perfectly known. On the otherhand there is always measurement noise in the observa-tion process.By judging theoretical models on a goodness-of-fit ba-sis they often perform insufficiently when faced to realdata of the actual system. The difference between realand modeled data can easily exceed the error budget ofthe measurement. The reason is that the idealized mod-els do not capture all aspects of the real systems. Butaspects, which are not modeled, imprint on the data aswell. To discriminate these from measurement errors ornoise, we use the term model errors to describe theseimperfections of our theoretical description. A commonapproach to determine the model parameters from thedata is a likelihood based methodology ( X -fit [1, 2],maximum likelihood [3], or Bayesian parameter estima-tion [4]) that uses the measurement uncertainties as ametric in data space. The resulting parameter estimatescan be strongly distorted by the desire of the method tominimize all apparent differences between predicted andreal data, indifferently if these differences occur due tomeasurement or model errors.Thus, the model error should be included into the er-ror budget of the parameter estimation as well. However,this requires a model for the not yet captured aspects ofthe system or at least for the model errors which are pro-duced by the missing aspects. In a rigorous way, we need to do a case by case analysis of the missing physics. How-ever, this would be completely impractical in cases likethe one we have in mind - spectroscopic investigation ofcomplex astrophysical objects. Instead, this paper aimsto construct a plausible, but by no means perfect, effec-tive description of the model errors and their uncertainty.We want to show how the estimation of the model param-eters is improved by taking a model error into account. B. Example of the general problem in astrophysics:Spectroscopic investigation of complex astrophysicalobjects “ Remember that all models are wrong, the practicalquestion is how wrong do they have to be to not be use-ful. ”[5]. This quote of George E.P. Box, a frontier inBayesian inference, is well applicable for astrophysicalmodels. An example for the process of modeling and fit-ting is the task to fit stellar models to observed stellarspectra. Many different physical conditions determinethe spectrum of a star e.g. stellar parameters like theeffective temperature T eff , the magnitude of the gravi-tational surface acceleration log g , the metalicity [Fe / H]and abundances for almost the full periodic table. Ausual approach to determine stellar conditions is fittingdata to synthetic model spectra.The calculations of synthetic spectral models are nor-mally done by radiative transfer calculations. However,if one accounts for all physical conditions in these cal-culations one can not keep the models computationallyfeasible. Thus, some simplifications have to be done e.g.local thermodynamical equilibrium (LTE) or plane par-allel geometry. In the resulting star models only the mostimportant physical parameters are included.We will denote physical conditions included in the pro-cess of modeling with p , while not modeled physical con-ditions will be denoted with c . Additionally, it is stillvery hard to produce high quality synthetic data due toinaccuracies in atomic and molecular constants. Thus, itis almost impossible to ensure spectral lines to fit per-fectly over all stellar parameters and spectral ranges.Due to the inaccuracy in the modeling process stellar a r X i v : . [ phy s i c s . d a t a - a n ] D ec spectra can only be described approximately. As an il-lustrative, simple model for an absorption spectrum ac-counting for a gray body spectrum with absorption lines,one could use the following: t ν ( p ) = A (cid:18) νν (cid:19) α e − νT eff e − τ ν , with τ ν = (cid:88) i =1 t i G ( ν − ν i , σ i ) , (1)where G ( ν − ν i , σ i ) denotes a Gaussian in ν with mean ν i and standard deviation σ i . The exponent of the power-law α , the effective temperature T eff and the strengthparameters for the absorption line spectrum t , t , t areassumed to be the unknown parameters, while the am-plitude A , the mean frequency ν , the means of the ab-sorption peaks ν , ν , ν , and the line widths σ , σ , σ are known. This simplyfing model does not account forthe gravitational surface acceleration log g , the metalic-ity [Fe / H] and all abundances of the periodic table expectfor three.Assuming one determines the correct values p of thestar models, not modeled conditions will present them-selves in a model error. However, we have observationsof many stars. If some of them are fitted with the samemodel, this gives us a chance to at least identify and char-acterize from star to star varying spectral structures dueto model errors. Then, the model errors can be statis-tically added to the error budget. However, parameterdetermination and determination of the statistics of themodel errors are coupled, which is the reason why theymust be performed together. Here we use a generic, hier-archical Bayesian model based on Gaussian processes todescribe the model errors, which is schematically shownin fig. 1.The structure of this work is the following. In sec. II weexplain the general problem of parameter estimations ofmiss-specified models and give an overview of the state ofthe art methods tackling this problem. A brief discussionof the Wiener filter given in sec. III, since the understand-ing of it is crucial for our algorithms. In sec. IV we stateand explain our model for the model error and its un-certainty and furthermore discuss our prior assumptions.In sec. V we derive and present the inference algorithmand show its capability on mock data. Finally, we com-pare the inferred results with a least-square parameterestimation. II. PROBLEM SETTING AND STATE OF THEART
Computer models simulating star spectra are invalu-able tools in astrophysics. Unfortunately, there is typi-cally uncertainty in the predictions of the synthetic starspectra models. The arising errors can be categorizedin two main sources. The first one results from the useof an inadequate model with a correct set of parameters
Model error model u ( c ) (cid:120) G ( u, U )Model pa-rameters p Physical conditions
Other con-ditions cs = t ( p ) + u ( c )= model + model error Stellar spectra d = R s + n Data
Observations
FIG. 1. Systematic representation of the connection betweenphysical conditions on the star and an model error due to notmodeled conditions. values. This means one has found the true parametervalues for the wrong model, which is called the model er-ror. The second one is the uncertainty in the parametersconsidering the true model, which is called the parametererror . By modeling and computing stellar spectra botherrors are present simultaneously.Let us denote with t ∗ the true forward operator of themodel M and with t another forward operator, e.g. theforward operator used for the parameter estimations. Let p ∗ be the true set of parameter values and p another setof parameter values. Then the parameter error is thedifference between t ∗ ( p ) and t ∗ ( p ∗ ) u par = t ∗ ( p ∗ ) − t ∗ ( p ) , (2)which is independent of t , but a quantification would re-quire the knowledge of t ∗ . However, t ∗ is not availableto us as it represents the true physics. This raises thequestion of a meaningful parameter error evaluation, es-pecially in the absence of a model error model.The direct evaluation of model errors, defined as u model = t ∗ ( p ∗ ) − t ( p ∗ ) , (3)is a subject of large uncertainty and has only little use inpractice. The importance of model errors in parameterestimation is widely recognized, but there exists only lit-tle research about it. Thus, still no general methodologyfor quantifying the impact of model error for parameterestimations and its influence on parameter error analysisis available.In literature a common way to describe the forwardproblem used to estimate the unknown parameters is de-scribed in the following way d = t ( p ) + e , (4)where d is the observed data, t ( p ) the forward operatorbelonging to the mathematical model M and e is theerror. The error e is the sum of a measurement error n and a model error u e = n + u . (5)The inverse problem is formulated by finding the best fit-ting parameters p . Usually this is achieved by minimiz-ing the residual e via a classical or weighted least-squaremethod. In this method p is obtained by minimizing thesum of the squared differencesˆ p = min p [ d − t ( p )] † E − [ d − t ( p )] , (6)where E represents the covariance of the error e . The un-derlying assumption for this formulation of least-squareis that the measurement error and model error have thesame statistics. However, there are two main problems.On the one hand the model error is not random, but itis systematic and certainly will not have the same statis-tic as the measurement error. The model error will fordifferent measurements of objects, which are described bythe same model and same parameters, have a comparablestrength on different positions and a certain correlationstructure. On the other hand the covariance of the modelerror is not known a priori and can only be estimated.Thus, this approach is almost useless, if the model error u is dominating the overall error e .Another main problem in estimating parameters andmodel error is that several approaches do not give aunique solution for the parameters p due to the degen-eracy between the model error and the parameters. Oneway to tackle the problem of degeneracy is adding priorinformation on the parameters. The objective functionminimized to find the best fitting parameters reads H = [ d − t ( p )] † E − [ d − t ( p )]+( p − p e ) † P − ( p − p e ) , (7)where P is the covariance matrix for the parameters and p e is a estimated set of parameter values. The assump-tion for the parameters is that they are Gaussian dis-tributed. This approach allows to derive a posterior un-certainty of the parameters p , where all plausible sets ofparameter values should be represented. Unfortunately,the model error is again not incorporated directly in themodeling process for the data observation and this ap-proach has the same drawbacks as least-square.Assuming we know the measurement error distribu-tion, we can write the solutions for the inverse problemfor d = t ( p ) + u + n as a set M s = { ( p, u ) : s = t ( p ) + u } , (8) where s is the quantity of interest. This shows that thedegeneracy of the model error and the parameters canbe embedded on a manifold, on which the solutions areequally distributed without prior knowledge.The authors of [6] investigated the role of the priors onthe model parameters and model errors in the inference.For one data set, all points of M s have equal likelihoodand the posterior distribution on this manifold is com-pletely determined by the prior. The probability to ob-tain the true values of p and respectively u will be high, ifthe prior has a high density for the correct solutions rela-tive to the other solutions. In general one can distinguishtwo extreme cases.For the first case, the authors of this study considereda weak prior on p and a diffuse, zero-mean prior distri-bution for u . It was found by [7], that the posterior willbe flat over the set of ( p, u ) that result in the same s .As the authors of [7] assumed a zero-mean prior on u ,the probability in regions, where u is near zero, will behigher. Thus, the estimated model parameters p will besimilar to a maximum a posteriori estimator, which maynot cover the true values of p .The second case is given, if realistic prior informationfor u and/or p is available, which leads to a tighter poste-rior distribution. Unfortunately, if the prior informationis wrong, a low probability is attached to the true value of p and the posterior uncertainty may fail to cover them. Ifa strong prior for the parameters is available from otherobservations that do not rely on the model process, thequality of model error and model parameter estimationis higher.As we have seen, there are many subtleties in thesemethods. Thus, we will now discuss the main problemsof approaches in literature to tackle the issue of a miss-specified model and give a conceptual overview of ourideas. If we try to isolate the effect of the model errorsfor the parameter estimation with the help of eq. 7 thiswill lead to following problems.The first problem is that the model error is not ran-dom and does certainly not have the same probabilisticstatistics as the measurement error. To solve this issuewe will try to model the model error as its own quantitywith its own statistic.The second problem is that although the inclusionof prior information on the parameters in the objectivefunction reduces the ill-posedness of the inverse problem,it does not guarantee the uniqueness of the solution. Aswe are in the comfortable situation to have more than onedata set with approximately the same parameter valuesfor our analysis, the ill-posedness can be reduced further.The third problem is that each data set can be fittedin different ways. Thus, one parameter set can be thebest fit for the first data set and another set of parametervalues can be the most appropriate for the next data set.We will analyze all available data sets at the same time,but do not assume that all data sets have the same trueparameters. Note that this is not the same as combiningthe results of an analysis of each data set on its own,since the different data sets share the same model errorstatistics. III. WIENER FILTER THEORY
In the following, we want to give a short overview ofthe Wiener filter. Stellar spectra, which we will alsocall the signal s in this work, are physical fields. Thusthey have infinitely many degrees of freedom, which mo-tivates the usage of the framework of information fieldtheory (IFT) [8] in which the Wiener filter appears asthe solution to the free field case. We observe a singlestellar spectrum with the help of a measurement instru-ment that somehow discretizes a field, e.g. by integratingand averaging. In our mathematical description we rep-resent the measurement instrument as a linear operator R . Further, we assume the instrument to be well known,such that errors arising from the modeling process of themeasurement instrument will be ignored. The measureddata d can be expressed by applying R to the signal s corrupted with noise n . The measurement equation de-scribing this reads d = R s + n . (9)In particular, we assume the noise n to be Gaussian, i.e. n (cid:120) G ( n, N ) (10) G ( n, N ) = 1 | πN | e − n † N − n , (11)where we denote with n † the complex conjugate and ad-joint of n and N is representing the covariance of thenoise distribution. Mathematical conventions in this for-mula, e.g. how an operator acts on a field and how thescalar product is defined, are explained in appendix A 1.By using the measurement eq. 9 the likelihood for thedata given the signal and noise realization can be derived.The conditional probability P ( d | s, n ) = δ [ d − ( Rs + n )]tells us how probably the data is given the noise and thesignal: P ( d | s, n ) = δ [ d − ( Rs + n )] . (12)Since we do not know the noise, we marginalize it out: P ( d | s ) = (cid:90) D n P ( d, n | s ) = (cid:90) D n P ( d | s, n ) P ( n )= G ( d − R s, N ) . (13)For the task of infering the signal s from the data d ,we have to inspect the posterior distribution. This isprovided by Bayes theorem [9], P ( s | d ) = P ( d | s ) P ( s ) P ( d ) . (14) In analogy to statistical physics, Bayes theorem is re-formulated by introducing the information Hamiltonian H ( s, d ) and the partition function Z ( d ) P ( s | d ) = P ( d | s ) P ( s ) P ( d ) = 1 Z ( d ) e −H ( d,s ) , with (15) H ( s, d ) = − ln[ P ( s, d )] , (16) Z ( d ) = (cid:90) d s e −H ( s,d ) = P ( d ) . (17)We will assume a Gaussian prior for the signal s , i.e. P ( s ) = G ( s, S ) , (18)where for the moment the prior covariance S of the sig-nal s is assumed to be known. Together with the datamodel and Bayes equation, this allows us to calculate theinformation Hamiltonian H ( s, d ), which is given by H ( s, d ) = H ( d | s ) + H ( s ) = 12 ( d − Rs ) † N − ( d − Rs )+ 12 s † S − s + 12 ln | πN | + 12 ln | πS | . (19)A quadratic completion leads to a more conventionalform of the information Hamiltonian, where we dropterms not depending on s , as denoted by the equalityup to constants ” (cid:98) =”: H ( s, d ) (cid:98) = 12 ( s − m ) † D − ( s − m ) , (20) D = ( S − + R † N − R ) − , (21) j = R † N − d , (22) m = Dj . (23)As mentioned before, the posterior is the quantity of in-terest in the inference and with a short calculation weget P ( s | d ) ∝ e − H ( s,d ) ∝ e − ( s − m ) D − ( s − m ) ∝ G ( s − m, D ) . (24)Since the l.h.s and the r.h.s. of eq. 24 are normalized, P ( s | d ) = G ( s − m, D ) (25)holds exactly. Here we introduced the mean m = DR † N − d , which can be written as applying the infor-mation propagator D = (cid:0) S − + R † N − R (cid:1) − on the in-formation source j = R † N − d . The operation DR † N − is known as the Wiener filter [10].In many real world cases the prior uncertainty structureof the signal s and additional fields are not known a prioriand one would like to obtain posterior uncertainties. Oneway to counteract these problems is proposed in [11, 12]which reconstructs the uncertainty structure alongsidethe signal simultaneously. We will adopt this improvedformulation in the following. IV. MODEL ERROR MODEL
In this section we motivate our model for the modelerror and its uncertainty. Furthermore, we want to de-rive the information Hamiltonian for this specific choiceof model error uncertainty. As our goal is to identifythe model parameters of a theoretical model, we have tobuild a model for the model error. Additionally, we haveto connect the model error to observed data.Assuming we have spectral observations of n stars, wedenote the spectral data of star i with d i and the datavector d = ( d i , · · · , d n ) describes the ordered collection ofall data sets in the data vector. The measurement equa-tion is given in eq. 9, where the signal can be described asa parameter model t ( p ) corrupted with a model error u .However, in the case of a spectroscopic analysis of an ab-sorption spectrum, we want to ensure, that the signal ispositive. Therefore, we adopt the following measurementmodel: s = t ( p ) + u , (26) d = R e [ t ( p )+ u ] + n . (27)This change to a positive definite signal has two mainconsequences. On the one hand the model error e u is nowmultiplicative to the model e t ( p ) , which is often a moreaccurate description than an additive model error as itallows a better separation of systematic model error andmeasurement error [13]. On the other hand, to comparethe model to data, we have to take the logarithm of themodel. In case of our spectroscopic model in eq. 1, thisis given by t ( p ) → ln[ t ν ( p )] = ln( A ) + α ln( νν ) − νT eff − (cid:88) i =1 t i G ( ν − ν i , σ i ) . (28)The consequence is that now the model is linear in everyparameter except for T eff .For a rigorous statistical analysis we need a prior P ( p ),which is also necessary to reduce the degeneracy betweenthe parameters and the model error.However, we also have to build a model for the statis-tics of the model errors u . We assume the model errorto be Gaussian with a model error uncertainty covari-ance U . By analyzing data sets with the same modelforward operator t and the same model-generating pro-cess, we can assume that the model errors u i follow thesame statistics. For the model error uncertainty covari-ance U we try to be as general as possible for the rea-sons explained in sec. II. On the one hand, we have totake care of locally varying errors, e.g. peaks in stellarspectra, which are not modeled. This task can be tack-led with a diagonal matrix (cid:99) e θ in position space (here ν , parameterized by the field θ ). On the other hand wethink that for some part of the model error budget a priori no position should be singled out. Thus, statis-tical homogenuity is assumed, which is according to theWiener-Khintchin theorem [10, 14], represented as diago-nal matrix (cid:99) e τ in Fourier space (here the Fourier domainconjugate to ν ). For signals with more than one dimen-sions statistical isotropy might be assumed as well. Thismeans to assume all directions to be statistically equal,resulting only in a dependence of that part of the modelerror correlation on the relative distance of the comparedlocations. Thus, statistical isotropy leads to a collapse ofthe Fourier-diagonal part of the correlation structure toa one-dimensional power spectrum function e τ = e τ ( κ ) ofthe harmonic modes κ . This is incorporated in the un-certainty through the power projection operator P , whichis explained in appendix A 2, and the Fourier transformoperator F . One simple way to combine both parts ofuncertainty structures into the uncertainty covariance isa modified product of both, which is given by u (cid:120) G ( u, U ) (29) U = (cid:99) e θ x F † (cid:92)P † e τ y F (cid:99) e θ x . (30)The index y = ln( k ) indicates that we assumed a loga-rithmic scale for the Fourier space coordinate of τ andthe index x denotes a linear space coordinate for θ . Weindicate the transformation of a field to a diagonal oper-ator in the domain of the field by ” (cid:99) ”. In this param-eterization the model error uncertainty is a multiplica-tive combination of a mask in position space (cid:99) e θ x and aconvolution. Any field processed by U (and also by itsinverse) first goes through the (inverted) mask, then getsconvolved (deconvolved) and at last the (inverted) maskagain imprints on the field.Again for the sake of a statistical rigorous analysis, weneed a prior for τ and θ . To account for the smoothness ofthe power spectrum on a logarithmic scale, a smoothnessprior is adopted to τ . Similarly, a smoothness prior for θ on a linear scale in position space is adapted to θ : H ( τ ) = 12 σ τ (cid:90) d y (cid:18) ∂ τ∂ y (cid:19) = 12 τ † T τ , (31) H ( θ ) = 12 σ θ (cid:90) d x (cid:18) ∂ θ∂ x (cid:19) = 12 θ † Θ θ . (32)The strength parameters of the curvature, σ θ and σ τ ,represent the tolerated deviation by weighting the L -norm of the second derivative. For the limit σ θ \ τ → ∞ the expected roughness is very high and therefore roughrealizations are not suppressed. Otherwise, for the limit σ θ \ τ → σ θ \ τ are not the driving forces for the inference.Since the priors on θ and τ only punish curvature, thereis a degeneracy in U . We can add a constant factor a ∈ R to θ , if we subtract at the same time 2 a from τ . This op-eration leaves the functional form of U invariant. As oneis not interested in the individual values of τ or θ , thisdoes not affect the Hamiltonian. One can use the degen-eracy to control the numerical stability of the algorithmby keeping θ and τ in numerically stable regions.In the following the model error parameters ( τ, θ ) willbe merged into the vector q = ( τ, θ ). Thus, if we talkabout parameters without further specification, both p aswell as q are meant. The hierarchical Bayesian networksummarizing our assumptions can be seen in fig. 2.Similar as for the Wiener filter theory in sec. III,we introduce the posterior information Hamiltonian H ( p, τ, θ, u | d ) and the partition function Z ( d ): P ( p, τ, θ, u | d ) = 1 Z ( d ) e −H ( d,p,τ,θ,u ) , with (33) H ( d, p, τ, θ, u ) = − ln[ P ( d, p, τ, θ, u )] , (34) Z ( d ) = (cid:90) D p D τ D θ D u e −H ( d,p,τ,θ,u ) = P ( d ) . (35)The resulting full Hamiltonian can be calculated as be-fore, such that we get H ( p, τ, θ, u, d ) = 12 (cid:110) d − e R [ t ( p )+ u ] (cid:111) † N − (cid:110) d − e R [ t ( p )+ u ] (cid:111) + 12 ln( | πN | ) + H ( p ) + 12 u † U − u + 12 ln( | πU | ) + 12 τ † T τ + 12 θ † Θ θ + 12 ln( | πT | ) + 12 ln( | π Θ | ) . (36)The model error u is assumed to be generated by draw-ing from a Gaussian distribution, which is neither diag-onal in position nor in Fourier space. Furthermore, itis possible to draw from a Gaussian distribution in itseigenbasis by weighting an independent, white spectrumexcitation field ξ with unit variance with the square rootof its eigenvalues. We will adopt this idea and reformu-late our problem in the following way. u = AξA = (cid:99) e θ x F † (cid:92)P † e τ y s . t . AA † = U . (37)Here we introduced the amplitude operator A . Thechange in causality can be seen in fig. 3, where one ob-serves that now all unknown fields directly interact withthe data. This reparametrization is following the ideasof [15, 16]. Inserting the change of variables from u to ξ into the Hamiltonian gives H ( p, τ, θ, ξ, d ) = 12 (cid:110) d − Re [ t ( p )+ A ξ ] (cid:111) † N − × (cid:110) d − Re [ t ( p )+ A ξ ] (cid:111) + H ( p )+ 12 ξ † ξ + 12 θ † Θ θ + 12 τ † T τ + H . (38) s nd = R s + n G ( n, N ) n n m ..... ..... u G ( u, U ( q )) u u m ..... ..... qτ θ G ( τ, T ) G ( θ, Θ) t ( p ) p P ( p ) p p m ............ σ τ σ θ FIG. 2. Graphical model of the hierarchical Bayesian networkintroduced in sec. IV. The parameters in the top row, σ τ and σ θ , as well as P ( p ) have to be specified by the user. The pa-rameters characterizing the model error uncertainty θ, τ andthe model parameters p will be infered by the algorithm. All terms that are unimportant for the later process arecollected in H : H = 12 ln( | πT | )+ 12 ln( | π Θ | )+ 12 ln( | πN | )+ 12 ln( | π | ) . (39)These terms will not affect the further inference. For theinference we will need the posterior Hamiltonian H ( p, τ, θ, ξ | d ) = H ( p, τ, θ, ξ, d ) + ln [ Z ( d )] , (40)with ln [ Z ( d )] defined in eq. 35. Since the partition sumdoes not depend on ξ , q and p , we can ignore it in thefollowing. ξdp sθ τ θ τ pd FIG. 3. Causality structure of the classical formulation of ourproblem (left) and of its reformulation (right)
V. METHOD
In this section we want to present the approximation ofthe posterior distribution via variational Bayes. Thus, weadapt the methods presented in [11, 12] which performs aminimization of all unknown fields at the same time andprovides a posterior uncertainty estimation via samples.
A. Approximation of the posterior distribution
To use the implementation in the Numerical Informa-tion field theory package (NIFTy) [17, 18], we have toensure that all used priors, except the noise prior, arewhite priors. Due to this fact, we have to reformulateour parameterization of U in eq. 30. As already men-tioned the reformulation is possible by weighting a whiteexcitation field with the square root of the covariance.This can be done for the smoothness priors on θ and τ . θ → A θ ξ θ τ → A τ ξ τ A θ = e θ (cid:86) → e (cid:102) A θ ξ θ (cid:86) A τ = P † e τ (cid:86) → P † e (cid:102) A τ ξ τ (cid:86) (41)The exact transformations and details on the chosencoordinates for the excitation fields are given in ap-pendix A 3. Thus, the covariance operator U has thereformulated form U = A θ F † A † τ A τ F A θ . (42)Now, we can again use a variable transformation to write u = Aξ , with A = A θ F † A τ . (43)The coordinate transformation of the parameters p intoa coordinate system ξ p where the prior is white is a littlebit more subtle, since the prior for the parameters canalmost have any form. However, as long as the cumula-tive distribution function of the prior can be calculatedanalytically, this coordinate transformation can alwaysbe applied [15, 16]. As a next step one stores all unknown fields, namely ξ, ξ θ , ξ τ , ξ p , in a multi-field φ . Now, we can continue asbefore and calculate the full Hamiltonian H ( ξ p , ξ τ , ξ θ , ξ φ , ξ, d ) = N data (cid:88) i =1 (cid:20) (cid:110) d i − Re [ t ( ξ pi )+ A ξ i ] (cid:111) † N − × (cid:110) d i − Re [ t ( ξ pi )+ A ξ i ] (cid:111) + 12 ξ † i ξ i + 12 ξ † p i ξ p i (cid:21) + 12 ξ † θ ξ θ + 12 ξ † τ ξ τ + H . (44)Usually one would need to calculate the derivatives ofthis in order to be able to apply a Newton scheme. How-ever, since the model for the model error is constructedfrom NIFTy model objects, NIFTy is able to computeall derivatives regarding ξ, ξ θ , ξ τ on its own. Only thegradient with respect to the model parameters ξ p has tobe calculated and implemented, which is individual foreach model t ( ξ p ).In order to apply a variational inference method, the pos-terior distribution has to be approximated. The advan-tage of the method implemented in NIFTy is that we canapproximate the posterior as a Gaussian in all unknownfields (cid:101) P ( φ | d ) = G ( φ − ¯ φ, Φ) , (45)where ¯ φ represents the mean of the multi-field φ and Φis its full covariance matrix, which contains the cross-correlations of the different fields within the multi-field.By using this approximation one can minimize the varia-tional Kullback-Leibler divergence ( D KL ), which is givenby D KL = (cid:90) D φ (cid:101) P ( φ | d ) ln (cid:34) (cid:101) P ( φ | d ) P ( φ | d ) (cid:35) = (cid:104)H ( φ | d ) (cid:105) G ( φ − ¯ φ, Φ) −
12 ln ( | πe Φ | ) . (46) B. Numerical example
We demonstrate the applicability of our derived algo-rithm for one-dimensional synthetic data. The algorithmis implemented in python using the NIFTy package. Bycomparing our algorithm with the least-square algorithm,we show its capability. In our example the signal has aresolution of 128 pixels. As a prior for the parameters weused a Gaussian, such that H ( p i ) = 12 ( p i − p ∗ ) P − ( p i − p ∗ ) , (47)where we assumed the prior matrix P − to be diagonal.As discussed before, we have to transform the parame-ters, such that they have a white prior. For the chosenprior this is straight forward and the transformation isgiven by p i → p ∗ i + σ i ξ p i , (48)where p ∗ i is the mean estimation of the i -th parameterand σ i = (cid:98) P represents the expected deviation of themean. Here the (cid:99) denotes the diagonal of an operator.We have chosen σ i = (0 . , ., . , . , .
3) and the meanof each data set was drawn uniformly from ranges the([1 . , . , [19 . , . , [1 . , . , [0 . , . , [0 . , . ξ p from the white prior.The corresponding models which are used in the examplewith 32 different data sets and models can be seen in fig.4.We produce the data sets according to eq. 27, where thesignal field s is a sum of the error and the model error.By drawing a white excitation field ξ θ and fixing thepower-spectrum to P ξ τ ( k ) = 1( k + 1) , (49)we construct the operator A . The noise covariance isgiven by 0 . × and the measurement instrument is a unitresponse. The gradient of the model with respect to thetransformed parameters ξ p i = (cid:16) ξ α i , ξ T eff i , ξ t i , ξ t i , ξ t i (cid:17) reads t (cid:48) ( ξ p i ) = σ α ln( νν ) σ T eff (cid:16) µ T eff i + σ T eff ξ T eff i (cid:17) − σ t G ( ν − ν , σ ) − σ t G ( ν − ν , σ ) − σ t G ( ν − ν , σ ) . (50)Fig. 5 shows one of the spectra with the correspondingmodel. The model with its free parameters is not ableto capture all the aspects of the spectrum. Conventionalalgorithms would try to explain the differences throughthe model, whether they are due to the model error orto the model. In fig. 6 the 32 different model errors andthe square root of the covariance U are shown, whichillustrates the correlation structure and the difficulty toreconstruct a spatial varying uncertainty covariance. Onecan observe that for this special U the model error mainlyimprints on the spectra in their ascending parts and thepart where the absorption lines are present.Fig. 7 shows a comparison on between our method anda least-square method (LS) for the estimation of modelparameters. Using multiple records can cause the opti-mization scheme to go to a local minimum. After sometests, we decided to divide the likelihood Hamiltonian bythe number of data sets used in the particular inferencesimilar to the proposed solution of this problem by [19]..We can observe that our method is able to learn froman increasing number of data sets while the LS estimateseach data set on its own and thus does not improve withmore data sets. However, our method saturates at about v a l u e FIG. 4. 32 different models with individual parameter val-ues, which are used in the inference for the model errors andparameters.
512 data sets on an average parameter error of 6% andthe addition of more data sets does not further reducethe error. As for each model still the parameters p i haveto be found individually in the presence of the unknownnoise relations n i and u i .The square root of the diagonal of the model error un-certainty covariance (cid:112) (cid:98) U and its reconstruction togetherwith the root mean square error u RMS are shown in fig. 8.One would expect that the recovered uncertainty, as anconservative estimator, is in general smaller than the cor-rect uncertainty. However, in our reconstruction of themodel error uncertainty the peaks can be modeled well,while the tails are overestimated.
VI. CONCLUSION
Estimations of parameters from models are essentialto get insights into physics. However, many physical ob-jects are not understood in their full complexity. Thus,the model errors need to be included in the error un-certainty, but this would afford a case by case analysis ofthe missing physics and therefore is often impractical. Toaddress these problems this paper constructs a plausibleand effective description of model errors and their uncer-tainty. To constraint the additional degrees of freedomit relies on using several data sets at the same time in ajoint inference.We derive an algorithm to reconstruct the parametersof a model as well as the model error and its statisticsfrom noisy data. By using more than one data set of v a l u e Spectrum s Model t FIG. 5. One of the spectra with the corresponding model. v a l u e FIG. 6. 32 different model error realizations drawn from U the same type of stars, we got a more sophisticated esti-mate for the model error uncertainty. It was importantto build a very general approximating description for themodel error uncertainty in order to be capable for mostsorts of model errors. One can see that a few generalassumptions about the nature of model errors, i.e. local-ity and smoothness, can be sufficient to reconstruct themodel error uncertainty when one has access to a largeset of data. The algorithm can be graphically visualizedthrough a Bayesian hierarchical model. It was able to Number of data sets used024681012141618 A v e r a g e e rr o r [ % ] Least squareOur method
FIG. 7. Comparison of the errors of a least-square parameterestimation and our method for different number of data sets. v a l u e Correct U Recovered Uu RMS
FIG. 8. The correct square root of the diagonal of U andits reconstruction together with the positive average error areshown. reconstruct the model errors as well as the model param-eters.Additionally, expectation means can be calculatedthrough the samples from the inverse metric. By usingapproximate posterior samples the opportunity to dealwith more complex uncertainty structures is computa-tionally feasible. We can conclude that this paper en-ables the opportunity to deal with parameter estimates0of miss-specified models by accounting for the model er- ror and its properties within a Bayesian approach. [1] M. Nakanishi and L. G. Cooper, “Parameter estima-tion for a multiplicative competitive interaction model:Least squares approach,”
Journal of Marketing Research ,vol. 11, no. 3, pp. 303–311, 1974.[2] A. Legendre,
Nouvelles M´ethodes pour la D´eterminationdes Orbites des Com`etes (Classic Reprint) . Fb&c Lim-ited, 2018.[3] H. White, “Maximum likelihood estimation of misspec-ified models,”
Econometrica , vol. 50, no. 1, pp. 1–25,1982.[4] T. S. Jaakkola and M. I. Jordan, “Bayesian parameterestimation via variational methods,”
Statistics and Com-puting , vol. 10, pp. 25–37, Jan 2000.[5] G. Box and N. Draper,
Empirical model-building and re-sponse surfaces . Wiley series in probability and mathe-matical statistics: Applied probability and statistics, Wi-ley, 1987.[6] J. Brynjarsdottir and A. O’Hagan, “Learning aboutphysical parameters: the importance of model discrep-ancy,”
Inverse Problems , vol. 30, no. 11, p. 114007, 2014.[7] M. C. Kennedy and A. O’Hagan, “Bayesian calibrationof computer models,”
Journal of the Royal Statistical So-ciety. Series B (Statistical Methodology) , vol. 63, no. 3,pp. 425–464, 2001.[8] T. A. Enßlin, M. Frommert, and F. S. Kitaura, “Infor-mation field theory for cosmological perturbation recon-struction and nonlinear signal analysis,”
Physical ReviewD , vol. 80, no. 10, p. 105005, 2009.[9] T. Bayes, “An essay towards solving a problem in thedoctrine of chances,”
Phil. Trans. of the Royal Soc. ofLondon , vol. 53, pp. 370–418, 1763.[10] N. Wiener,
Extrapolation, interpolation, and smoothingof stationary time series , vol. 2. MIT press Cambridge,1949.[11] J. Knollm¨uller, T. Steininger, and T. A. Enßlin, “Infer-ence of signals with unknown correlation structure fromnonlinear measurements,”
ArXiv e-prints , Nov. 2017.[12] J. Knollm¨uller et al. , (In preparation).[13] Y. Tian, G. J. Huffman, R. F. Adler, L. Tang, M. Sapi-ano, V. Maggioni, and H. Wu, “Modeling errors indaily precipitation measurements: Additive or multi-plicative?,”
Geophysical Research Letters , vol. 40, no. 10,pp. 2060–2065.[14] A. Khintchin, “Korrelationstheorie der station¨arenstochastischen Prozesse,”
Mathematische Annalen ,vol. 109, no. 1, pp. 604–615, 1934.[15] M. J. Betancourt and M. Girolami, “Hamiltonian montecarlo for hierarchical models,” 12 2013.[16] J. Knollmller and T. A. Enlin, “Encoding prior knowl-edge in the structure of the likelihood,” 2018.[17] M. Selig, M. R. Bell, H. Junklewitz, N. Oppermann,M. Reinecke, M. Greiner, C. Pachajoa, and T. A. Enßlin,“NIFTY–Numerical Information Field Theory-A versa-tile PYTHON library for signal inference,”
Astronomy &Astrophysics , vol. 554, p. A26, 2013.[18] T. Steininger, J. Dixit, P. Frank, M. Greiner,S. Hutschenreuter, J. Knollm¨uller, R. Leike, N. Por-queres, D. Pumpe, M. Reinecke, et al. , “Nifty 3-numerical information field theory-a python framework for mul-ticomponent signal inference on hpc clusters,” arXivpreprint arXiv:1708.01073 , 2017.[19] R. Boudjemaa, A. Forbes, N. P. L. G. B. C. for Mathe-matics, and S. Computing,
Parameter Estimation Meth-ods for Data Fusion: Report to the National Measure-ment System Policy Unit, Department of Trade and In-dustry ; from the Software Support for Metrology Pro-gramme . NPL report CMSC, National Physical Labora-tory. Great Britain, Centre for Mathematics and Scien-tific Computing, 2004.
Appendix A: Definitions1. Symbolic conventions
The scalar product between a field a † and another field b is defined in the following way a † b = (cid:90) d x a ∗ ( x ) b ( x ) . (A1)An operator A acting on a field b is defined as( A b ) ( x (cid:48) ) = (cid:90) d x A ( x (cid:48) , x ) b ( x ) . (A2)
2. Power projector
The power projection operator averages all positions inthe Fourier space belonging to one spectral bin. Let p k be a component of a field over the power space. Then thecorresponding component f k of the field over the Fourierspace can be calculated by p k = P kκ f κ = 1 ρ k (cid:90) κ ∈ k d κ (2 π ) dim f κ , (A3)where ρ k is the bin volume and dim indicates thedimensionality of f .
3. Transformation to a white prior
Since a smoothness prior is best represented in Fourierspace, where it is just a multiplication with the factor k , θ will be transformed to a field ξ θ , which lives in Fourierspace. Thus, the transformation of θ to ξ θ is given by θ → F † (cid:92)P † lin P ξ θ ( k ) ξ θ , (A4)where P lin is a power projector as defined in eq. A3 and P ξ θ ( k ) = σ θ k +1 indicates the power spectrum representing1the spectral smoothness. The subscript lin indicates thatthe power space has linear bin bounds. We will denotethe exponential of θ as A θ = exp [ θ x ] (cid:86) = exp (cid:20) F † (cid:92)P † lin P ξ θ ( k ) ξ θ (cid:21) (cid:86) . (A5)A reformulation for the convolutional part of the modelerror uncertainty U already exists in NIFTy. This refor- mulation splits the τ field into two different fields, onedescribing the slope ξ φ and the other one describing thedeviation of a linear slope ξ ττ