Inference with minimal Gibbs free energy in information field theory
aa r X i v : . [ a s t r o - ph . I M ] A ug Inference with minimal Gibbs free energy in information field theory
Torsten A. Enßlin and Cornelius Weig
Max-Planck-Institut f¨ur Astrophysik, Karl-Schwarzschild-Str. 1, 85741 Garching, Germany (Dated: October 23, 2018)Non-linear and non-Gaussian signal inference problems are difficult to tackle. Renormalizationtechniques permit us to construct good estimators for the posterior signal mean within informationfield theory (IFT), but the approximations and assumptions made are not very obvious. Here weintroduce the simple concept of minimal Gibbs free energy to IFT, and show that previous renor-malization results emerge naturally. They can be understood as being the Gaussian approximationto the full posterior probability, which has maximal cross information with it. We derive optimizedestimators for three applications, to illustrate the usage of the framework: (i) reconstruction of alog-normal signal from Poissonian data with background counts and point spread function, as itis needed for gamma ray astronomy and for cosmography using photometric galaxy redshifts, (ii)inference of a Gaussian signal with unknown spectrum and (iii) inference of a Poissonian log-normalsignal with unknown spectrum, the combination of (i) and (ii). Finally we explain how Gaussianknowledge states constructed by the minimal Gibbs free energy principle at different temperaturescan be combined into a more accurate surrogate of the non-Gaussian posterior.
I. INTRODUCTIONA. Abstract inference problem
Measurements provide information on the signals weare interested in, encoded in the delivered data. Howcan this information be best retrieved? Is there a genericand simple principle from which optimal data analysisstrategies derive? Can an information energy be con-structed which – if minimized – provides us with the cor-rect knowledge state given the data and prior informa-tion? And if this exists, how can this information groundstate be found at least approximatively?An information energy, to be minimized, would be veryuseful to have, since many of the existing minimizationtechniques, analytical and numerical, can then be ap-plied to it. A number of such functions to be extremizedto solve inference problems were proposed in the liter-ature, like the likelihood, the posterior, or the entropy.The likelihood is the probability that the data has re-sulted from some signal. The posterior is the reverse, itis the probability that given the data some signal was theorigin of it. Extremizing either of them certainly makessense, but often ignores the presence of slightly less prob-able, but much more numerous possibilities in the signalphase space. Those have a much larger entropy and aretherefore favored by maximum entropy methods. How-ever, maximum entropy alone can not be the inferencedetermining criterion, since it favors states of completelack of knowledge, irrespective of the data. Thus somecounteracting energy is required which provides the rightamount of force to the inference solution. Here, we ar-gue that the ideal information energy is provided by theGibbs free energy, which combines both maximum en-tropy and maximum a posteriori (MAP) principles.The Gibbs free energy has to be regarded as a func-tional over the space of possible probability density func-tions (PDF) of the signal given the data. The result ofthe minimization is therefore a PDF itself, and not a sin- gle signal estimate. Minimizing the Gibbs free energymaximizes the entropy within the constraints given bythe internal energy. The latter is understood as the av-erage of the negative logarithm of the joint probabilityfunction of signal and data weighted with the PDF.The usage of thermodynamical concepts for inferenceproblems is not new, see e.g. [1, 2]. What is new here,is that we develop this for signals which are fields, spa-tially distributed quantities with an infinite number ofdegrees of freedom, while using an approximate Gaus-sian ansatz for the PDF to be inferred. We thereby con-nect information field theory (IFT) [3–10], as a statisticalfield theory dealing with a huge number of microscopicdegrees of freedom, to thermodynamics, as a means togenerate simplified, but macroscopic descriptions of ourknowledge. Thereby we find that former IFT results ob-tained with complex renormalization schemes in [9, 11]can easily be reproduced, and even be extended to morecomplicated measurement situations.In the remainder of Sect. I we briefly introduce toIFT, MAP, and Maximum Entropy. This motivates theminimal Gibbs free energy principle, which we formallyderive in Sect. II, and show its equivalence to maximalcross information. The application of this principle tooptimize approximations of the posterior of concrete in-ference problems is provided in Sect. III. There, thelog-normal Poisson problem (Sect. III A) and the prob-lem to reconstruct without known signal power spectrum(Sect. III B), as well as their combination (Sect. III C)are addressed. Finally, we show how approximate poste-riors obtained at different temperatures can be combinedinto a better posterior surrogate in Sect. IV before weconclude Sect. V.
B. Information field theory
Information theory describes knowledge states withprobabilities. If Ω is the complete set of possibilities,and A ⊂ Ω is a subset, then P ( A ) ∈ [0 ,
1] describes theplausibility of A being the case, with P ( A ) = 1 denot-ing A being assumed to be sure, P ( A ) = 0 denoting A being (assumed to be) impossible, and 0 < P ( A ) < A . Obviously P (Ω) = 1 and P ( ∅ ) = 0. The usual rules of proba-bility theory apply, and generalize the binary logic ofAristotle to different degrees of certainty or uncertainty[12, 13]. In case the set of possibilities is a continuum,it makes sense to introduce a PDF P ( ψ ) over Ω, so that P ( A ) = R A dψ P ( ψ ). Each possible state ψ can be amulti-component vector, containing all aspects of realitywhich are in the focus of our inference problem.We might be interested in a sub-aspect of ψ whichwe call our signal s = s ( ψ ). The induced signal PDFis retrieved from a functional or path integral over allthe phase spaces of the possibilities of ψ via P ( s ) = R D ψ P ( ψ ) δ ( s − s ( ψ )). If s is a field, a function overa physical space V , then s = ( s x ) x ∈ V might be a vectorin the Hilbert space Ω of all L -integrable functions over V and P ( s ) is then a probability density functional. In-formation theory for s becomes IFT, which is a statisticalfield theory.Inference on the signal s from data d is done from theposterior probability P ( s | d ), which can be constructedfrom the joint PDF of signal and data P ( d, s ) via P ( s | d ) = P ( d, s ) P ( d ) = e − β H ( d,s ) Z β (cid:12)(cid:12)(cid:12)(cid:12) β =1 , (1)where P ( d, s ) = R Ω D ψ P ( d | ψ ) δ ( s − s ( ψ )) P ( ψ ) = P ( d | s ) P ( s ) and P ( d ) = R D s P ( d, s ). The second equal-ity in (1) is just a renaming of the numerator and denomi-nator of the first fraction, which highlights the connectionto statistical mechanics. Thus we define the informationHamiltonian H ( d, s ) = − log P ( d, s ) , (2)the partition function including a moment generatingsource term JZ β ( d, J ) = Z D s e − β ( H ( d,s )+ J † s ) , (3)and the inverse temperature β = 1 /T as usual in statis-tical mechanics. Here s † is the transposed and complexconjugated signal vector s , leading to a scalar product j † s = R V dx ¯ j x s x . The ad-hoc notion of temperature isas in standard simulated annealing practice. It permitsto narrow (for T <
1) or widen (for
T >
1) the exploredphase space region with respect to the one of the jointPDF and therefore is a useful auxiliary parameter. Weshow in Sect. II A that the well known thermodynamicalequipartition theorem holds: h H ( s, d ) i ( s | d ) − H ( m, d ) ≈ N dgf T. (4)where N dgf is the number of degrees of freedom and m is the mean signal field as defined below in (5), which defines the ground state energy. This relation can e.g. beused to check the correctness of an implementation of asignal phase-space sampling algorithm. C. Maximum a posteriori
The first guess for a suitable energy to be minimizedto obtain the information state might be the Hamilto-nian. Minimizing the Hamiltonian with respect to s ,while keeping d at their observed values, is equivalentto maximizing the joint probability P ( d, s ) and also theposterior P ( s | d ). The classical field emerging from this iscalled the MAP signal reconstruction in signal process-ing. For a detailed discussion of the usage of the MAPprinciple in IFT see [10]. The MAP field is often a verygood approximation of the mean field m = h s i ( s | d ) ≡ Z D s s P ( s | d ) , (5)which is the optimal estimator of the signal in a statistical L error norm sense: m = argmin ˜ s (cid:28)Z V dx ( s x − ˜ s x ) (cid:29) ( s | d ) . (6)The MAP estimator on the other hand can be shownto optimize the statistical L norm , the result of whichmay strongly deviate from the mean m , if the posterioris highly asymmetric around its maximum. Thus we canregard the MAP estimator as a good reference point, butnot as the solution we are seeking in general. It is, how-ever, accurate (in the L error norm sense) in case theposterior around its maximum is close to a Gaussian. Inthis case, the MAP field can easily be augmented withsome uncertainty information from the Hessian of theHamiltonian H = δ H ( d, s ) δs δs † (cid:12)(cid:12)(cid:12)(cid:12) s = m , (7)as an approximation of the two point function of the sig-nal uncertainty D ≡ (cid:10) ( s − m ) ( s − m ) † (cid:11) ( s | d ) . (8)Thus we set D ≈ H − in P ( s | d ) ≈ ˜ P ( s | d ) = G ( s − m, D ) , (9)where we introduced the Gaussian G ( φ, D ) ≡ | π D | e − φ † D − φ . (10)Unfortunately, the MAP estimator can perform subop-timally in cases where the Gaussian approximation doesnot hold, see e.g. [11]. The L norm measures the amount of exact agreement via k f k = lim ε → ε R dx θ ( f ( x ) − ε ), with θ denoting the Heavi-side function. D. Maximum Entropy
1. Image entropy
Another quantity often extremized in image recon-struction problems is the so-called image entropy (iE)[14–25]. In classical maximum (image) entropy (MiEhere, usually ME) methods the iE is defined for a strictlypositive signal via S iE ( s ) = − Z V dx s x log( s x / ˜ s x ) ≡ − s † log( s/ ˜ s ) , (11)where ˜ s x is the reference image, which is used to modelsome prior information. In the second equality we havedefined the component-wise application of functions onfields, e.g. ( f ( s )) x = f ( s x ), which we use throughoutthis work.We note, that the iE is actually not a physical entropy.Usually its usage is argued for by ad hoc assumptions onthe distribution entropy of photon packages in the imageplane, rather than being a well motivated description ofthe signal prior knowledge (or lack thereof). In the fol-lowing we will reveal the implicitly assumed prior of MiEmethods.The data enter the MiE method in form of an imageenergy, which is ideally chosen to be the negative log-likelihood, E ( d | s ) = − log (cid:0) P ( d | s ) (cid:1) , (12)in order to ensure the best imprint of the data on thereconstruction. The entropy is then maximized with theenergy constraint given by minimizing E iE ( d, s ) = E ( d | s ) − T S iE ( s ) (13)with respect to s . Here T is some adjustabletemperature-like parameter, permitting us to choose therelative weight of image entropy and image energy. Lowtemperature means that the MiE map follows the dataclosely, high temperature that the map space wants to bemore uniformly occupied by the signal reconstruction.The prior information on the signal, P ( s ), does notenter the MiE formalism explicitly. Actually, an implicitprior can be identified, assuming that MiE is actuallya MAP principle. In that case the implicitly assumedHamiltonian is H iE ( d, s ) ∼ = E iE ( d, s ), where ∼ = denotesequality up to an irrelevant, since s -independent, additiveconstant, and we find P iE ( s ) ∝ e T S iE ( s ) ∝ Y x (cid:18) s x ˜ s x (cid:19) − T s x . (14)This is not a general prior, but a very specific PDF. Al-though there is some flexibility to adopt its functionalform by choosing ˜ s , T , and the image space (pixel space,Fourier space, wavelet space, etc.) in which (11) holds, P iE ( s ) can not be regarded as being generic. The MiE prior strongly suppresses large values in the MiE map. Ifa data feature can be either explained by a single mappixel exhibiting a peak value or by several pixels dividingthat value among themselves, MiE will usually prefer thesecond option, leading to blurred reconstructed images.We conclude, that the term maximum entropy com-monly used in image reconstruction is very misleading. Amore accurate term would be minimal dynamical range ,since the implicitly assumed prior states that pixels car-rying larger than average signal s x are extremely unlikely.
2. Physical entropy
A physical entropy should measure the distributionspread of a PDF using a phase space integral over itsphase space. In fact, the latter is given by the Boltzmannentropy as given by the negative Shannon information, S B = − Z D s P ( s | d ) log P ( s | d ) , (15)which is a functional of the signal posterior, S B = S B [ P ( s | d )], and not of the signal map. Inserting (1)yields S B = h H ( d, s ) i ( s | d ) + log Z ( d,
0) = U − F, (16)where we introduced the internal energy U = h H ( d, s ) i ( s | d ) and the Helmholtz free energy F = F ( d, F β ( d, J ) = − β log Z β ( d, J ) . (17)The fully J -dependent Helmholtz free energy providesthe field expectation value via m = h s i ( s | d ) = ∂F β ( d, J ) ∂J (cid:12)(cid:12)(cid:12)(cid:12) β =1 ,J =0 . (18)The entropy is also given in terms of the free energy via S B = ∂F β ( d, J ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β =1 ,J =0 . (19)The entropy as well as the free energy are functionals ofthe posterior and not of the signal. Maximizing or min-imizing them does not provide a signal estimator, butsingles out a PDF. If we restrict the space of PDFs tothe ones we can handle analytically, namely Gaussiansas given in (9) and (10), we might obtain a suitable ap-proximation scheme to the full field theoretical inferenceproblem.Maximizing the entropy alone does not lead to a suit-able algorithm, since the maximal entropy state is that ofcomplete lack of knowledge, with a uniform probabilityfor every signal possibility. The internal energy, however,favors knowledge states close to the posterior maximumand would return the MAP solution if extremized alone.Thus the right combination of entropy and internal en-ergy is to be extremized. We would expect a free energyof the form U − T S B to be this function, in analogy tothe energy (13) used in MiE methods. Thermodynamicsteaches us that the Gibbs free energy is the quantity tobe minimized (which is identical to the Helmholtz freeenergy in case J = 0). Since we are going to calculatethis for an approximation of the real PDF, it is necessaryto go through the derivation in order to make sure we dothis in the right fashion and understand all implications. II. THERMODYNAMICAL INFERENCEA. Tempered Posterior
In order to take full advantage of the existing ther-modynamical machinery we want to construct the Gibbsfree energy for information problems. To this end, weintroduce a temperature and a source function into thePDF of the signal posterior as suggested by the definitionof the partition function (3) by defining P ( s | d, T, J ) = e − β ( H ( s,d )+ J † s ) Z β ( d, J ) = ( P ( d, s ) e − J † s ) β R D s ′ ( P ( d, s ′ ) e − J † s ′ ) β . (20)With the temperature we can broaden (for T >
1) or nar-row (for
T <
1) the posterior. Three temperature valuesare of special importance, namely T = 0, which modi-fies the PDF into a delta peak located at the posteriormaximum, T = 1, which returns the original posterior,and T = ∞ , leading to the maximum entropy state of anuniform PDF. The source function J permits us to shiftthe mean of the PDF to any possible signal configuration m = m ( d, T, J ).The modified PDF will be approximated by a Gaussianwith identical mean and variance: P ( s | d, T, J ) ≈ G ( s − m, D ) = ˜ P ( s | m, D ) , (21)where also D = D ( d, T, J ).We will see, that the width D of this Gaussian ap-proximation of the PDF increases with increasing tem-perature. At low temperature ( T ≪
1) the center ofthe PDF is probed and modeled, while at large tempera-tures ( T ≫
1) the focus is on its asymptotic tails. Sincethe Gaussian in (21) is an approximation, it is not evenguaranteed that T = 1 provides the best recipe for sig-nal reconstruction. E.g. in [9] a case is shown, wheresignal reconstruction using T = 0 . T = 0 and T = 1. Since working at multiple tem-peratures can reveal different aspects of the same non-Gaussian PDF (i.e. its central or asymptotic behavior),the question appears how the differently retrieved Gaus-sian approximations can be combined into a single andmore accurate representation of the original PDF. Thiswill be addressed in Sect. IV. For the moment we ap-proximate our posterior by a single Gaussian as in (21). In this case, the partition function can be calculatedexplicitly and reads˜ Z β ( d, J ) = (cid:12)(cid:12)(cid:12)(cid:12) πβ D (cid:12)(cid:12)(cid:12)(cid:12) / exp (cid:18) J † D J β + J † m − βH ( m, d ) (cid:19) . With standard thermodynamics procedure we calculate h H i ( s | d ) ≈ − δδβ ˜ Z β ( d, J ) (cid:12)(cid:12)(cid:12)(cid:12) J =0 = N dgf T + H ( m, d )(22)where N dgf is the dimension of the signal vector. Thisresult is the re-phrased equipartition theorem (4) fromclassical thermodynamics and further motivates the no-tion of temperature in IFT. B. Internal, Helmholtz and Gibbs energy
The next step is to calculate the Helmholtz free energy.In case it can be calculated explicitly from (17), the in-ference problem is basically solved, since any (connected)moment of the signal posterior can directly be calculatedfrom it by taking derivatives with respect to the momentgenerating function J , e.g. see (18). This will, however,only be the case for a very restricted class of Hamilto-nians, like the free ones, which are only quadratic in s .In the more interesting case the Helmholtz free energycan not be calculated explicitly, we can use the thermo-dynamical relation of the Helmholtz free energy with theinternal energy and entropy.First, we note that the internal energy of the modifiedposterior is given by U ( d, T, J ) = h H ( s, d ) i ( s | d,T,J ) ≈ h H ( s, d ) i ( s | m,D ) = ˜ U ( d, m, D ) , (23)where m and D are still functions of d , T , and J . Theaverage in the second line has to be understood to beperformed over a Gaussian with mean m and dispersion D : h f ( s ) i ( s | m,D ) = R D s f ( s ) G ( s − m, D ).Further, we need to calculate the entropy for the mod-ified PDF, which for a Gaussian depends only on D : S B [ G ( s − m, D )] = 12 Tr (cid:0) π D ) (cid:1) = ˜ S B ( D ) . (24)For the full modified posterior, (20), the entropy is cal-culated via (15) to be S B = β (cid:0) U + J † m − F (cid:1) , (25)where m = m ( d, T, J ) = h s i ( s | d,T,J ) , U is given by (23),and F by (17). Solving (25) for the Helmholtz free energyyields F β ( d, J ) = U − T S B + J † m. (26)This expresses the Helmholtz free energy in terms of in-ternal energy and entropy. Unfortunately, this expressioncontains the term J † m , where m depends on J implicitlythrough (18). In order to get rid of this term, we Leg-endre transform with respect to J and thereby use (18),which provides us with the Gibbs free energy G β ( d, m ) = F − J † δFδJ = U − T S B . (27)The Gibbs energy depends solely on m and not on J . Itcan be constructed approximatively, in case approxima-tions of the internal energy and the entropy are available.For our Gaussian approximation of the modified poste-rior we therefore write˜ G β ( d, m, D ) = ˜ U ( d, m, D ) − T ˜ S B ( D ) . (28)We know from thermodynamics that the minimum of theGibbs free energy with respect to variations in m providesthe expectation value h s i ( s | d ) of our field: δG ( d, m, D ) δm (cid:12)(cid:12)(cid:12)(cid:12) m = h s i ( s | d ) = 0 (29)Thus, the Gibbs energy is the information energy we werelooking for in the introduction.Minimizing the Gibbs free energy for a Gaussian PDFwith respect to m yields0 = δ ˜ Gδm = Z D s H ( d, s ) δ G ( s − m, D ) δm = − D − h φ H m ( d, φ ) i ( φ | D ) , (30)with H m ( d, φ ) = H ( d, m + φ ), which implies m = h s H ( d, s ) i ( s | m,D ) h H ( d, s ) i ( s | m,D ) = h s H ( d, s ) i ( s | m,D ) ˜ U ( m, D ) . (31)The optimal map is therefore the first signal moment ofthe full Hamiltonian weighted with the approximatingGaussian.Thermodynamics teaches us further that the propaga-tor, the uncertainty dispersion of the field, is providedby the second derivative of the Gibbs free energy aroundthis location, thanks to the well known relation (cid:18) δ Gδm δm † (cid:19) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m = h s i ( s | d ) = − δ FδJ δJ † (cid:12)(cid:12)(cid:12)(cid:12) J =0 = β D. (32)This relation closes the set of equations by providing D .Evaluating (32) with our approximate Gibbs energy (28)and using (29) yields T D − = δ ˜ Gδm δm † (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m = h s i ( s | d ) = − D − ˜ U ( d, m, D )+ D − (cid:10) φ φ † H m ( d, φ ) (cid:11) ( φ | D ) D − . Thus the propagator is the second moment of the Gaus-sian weighted Hamiltonian, D = (cid:10) φ φ † H m ( φ ) (cid:11) ( φ | D ) ˜ U ( d, m, D ) + T . (33)This equation seems to suggest that the propagator eval-uated at higher temperature is narrower, since T appearsin the denominator. However, the opposite is the casedue to the presence of D in all terms, as a test with afree Hamiltonian will show in (44). C. Cross information
The Gibbs free energy at T = 1 is directly related tothe cross information between the posterior and its Gaus-sian approximation. The cross information (or negativerelative entropy) of a PDF ˜ P with respect to anotherone P is measured by the so called Kullback-Leibler di-vergence [26]:d KL [ ˜ P , P ] = Z D s ˜ P ( s | d ) log ˜ P ( s | d ) P ( s | d ) ! . (34)The Kullback-Leibler divergence characterizes the dis-tance between a surrogate and target PDF in an informa-tion theoretical sense. It is an asymmetric distance mea-sure, reflecting that the roles of the two involved PDFdiffer. The equivalence of Gibbs free energy and cross in-formation with respect to inference problems can easilybe seen:˜ G ( m, D ) = h H ( d, s ) + log( G ( s − m, D )) i ( s | m,D ) = Z D s G ( s − m, D ) log (cid:18) G ( s − m, D ) P ( s, d ) (cid:19) ∼ = Z D s G ( s − m, D ) log (cid:18) G ( s − m, D ) P ( s | d ) (cid:19) = d KL [ ˜ P , P ] . (35)In the second last step we added the term log P ( d ), whichis irrelevant here, since m - and D -independent, and inthe last step we introduced the Kullback-Leibler diver-gence between posterior P ( s | d ) and its Gaussian sur-rogate ˜ P ( s | d ) = G ( s − m, D ). Minimal Gibbs free en-ergy therefore seems to corresponds to minimal Kullback-Leibler divergence, and therefore to maximal cross infor-mation of the surrogate with the exact posterior.However, we have only minimized the Gibbs free en-ergy so far with respect to m , the mean field, degrees offreedom of our Gaussian, not with respect to the ones pa-rameterizing the uncertainty dispersion D . We have de-termined this using the thermodynamical relation (29).If we want that our surrogate PDF has maximal cross in-formation with the posterior with respect to all degreesof freedom of our Gaussian, we also have to minimizingthe Gibbs energy with respect to D . A short calculationshows that this actually yields a result which is equivalentto the thermodynamical relation (32):0 = δ ˜ GδD = Z D φ H m ( d, φ ) δ G ( φ, D ) δD − T δ ˜ S B ( D ) δD = D − h(cid:10) φ φ † H m ( d, φ ) (cid:11) ( φ | D ) − D (cid:0) ˜ U ( m, D ) + T (cid:1)i D − , from which also (33) follows. Thus, we can regard both,the map m and its uncertainty covariance D , as param-eters for which the Gibbs energy should be minimized.We will refer to this as the maximal cross informationprinciple.We further note that the maximal cross informationprinciple also holds if the Gaussian is replaced by someother model function, G [ ˜ P ( s | d )] ∼ = d KL [ ˜ P , P ], a propertywe will use later in Sect. IV.Note, that the minimal cross information and the ther-modynamical relations yield exactly the same results for m and D only if ˜ G ( m, D ) is calculated exactly. In casethere are approximations involved, the resulting algo-rithms differ slightly, and this difference can be used tomonitor the impact of the approximation made. In thefollowing, we use the minimal cross information principlefor our examples. D. Calculating the internal energy
In order to calculate the approximative Gibbs energy,we need to estimate the internal energy, for which wehave to specify the exact Hamiltonian. We assume thatit can be Taylor-Fr´echet expanded as H ( d, s ) = ∞ X n =0 n ! Λ ( n ) x ...x n s x · · · s x n | {z } Λ ( n ) ( s,...s ) , (36)where repeated coordinates are thought to be integratedor summed over. The approximative internal energy isthen ˜ U ( m, D ) = U [ ˜ P ( s | d )] = Z D s H ( d, s ) ˜ P ( s | d )= ∞ X n =0 n ! D Λ ( n ) ( s, . . . s ) E ( s | m,D ) . (37)The Gaussian n -point correlation functions in this equa-tion can actually be calculated analytically. For this, weagain use the shifted field φ = s − m , which has theHamiltonian H m ( d, φ ) = ∞ X n =0 n ! Λ ( n ) m ( φ, . . . φ ) , with (38)Λ ( n ) m ( φ, . . . φ ) = ∞ X k =0 k ! Λ ( n + k ) ( φ, . . . φ | {z } n , m, . . . m | {z } k ) . We assume that the interaction coefficients Λ ( n ) x ...x n aresymmetric with respect to index permutations, since theyresulted from a Taylor-Fr´echet expansion.The internal energy can then be calculated via theWick theorem and the fact that all odd moments of φ vanish:˜ U ( m, D ) = ∞ X n =0 n ! D Λ ( n ) m ( φ, . . . φ ) E ( φ | D ) (39)= ∞ X n =0 n n ! Λ (2 n ) m ( n z }| { D ⊗ · · · D )= ∞ X n,k =0 Λ (2 n + k ) ( n z }| { D ⊗ · · · D ⊗ k z }| { m ⊗ · · · m )2 n n ! k ! . Here, we defined the symmetrized tensor product (cid:0) T ⊗ T ′ (cid:1) x ...x n ≡ P π ∈ S n n ! T x π (1) ...x π ( k ) · T ′ x π ( k +1) ...x π ( n ) by aver-aging over all permutations in S n , the symmetric group.Having obtained the internal energy with (39), andentropy with (25) approximatively, we can construct theGibbs free energy according to (28) which we use for ourinference. E. Minimizing
In order to get our optimal Gaussian approximationto the posterior, we have to minimize ˜ G β ( m, D ) withrespect to m and D . Minimizing for m is equivalent tominimizing the internal energy, since the entropy doesnot depend on m . This yields0 = δ ˜ U ( m, D ) δm (40)= ∞ X n,k =0 Λ (2 n + k +1) ( n z }| { D ⊗ · · · D ⊗ k z }| { m ⊗ · · · m, · )2 n n ! k ! ,. which has to be solved for m for any given D . The prop-agator derives from (32) or from0 = δ ˜ G ( m, D ) δD ⇒ (41) T D − = ∞ X n,k =0 Λ (2 n + k +2) ( · , · , n z }| { D ⊗ · · · D ⊗ k z }| { m ⊗ · · · m )2 n n ! k ! . which also depends on m . Thus, (40) and (41) have tobe solved simultaneously.A simple example should be in order. The simplestcase is that of the original Hamiltonian being quadratic.The approximated one should then match this exactly. Aquadratic or free Hamiltonian is equivalent to a Gaussianposterior, P ( s | d ) = G ( s − m ∗ , D ∗ ). We get H ( d, s ) ∼ = 12 ( s − m ∗ ) † D − ∗ ( s − m ∗ ) ∼ = Λ (1) x s x + 12 Λ (2) xy s x s y with (42)Λ (1) = − D − ∗ m ∗ , andΛ (2) = D − ∗ . Inserting this into (40) and (41) yields0 = Λ (1) ( · ) + Λ (2) ( m, · ) = D − ∗ ( m − m ∗ ) ⇒ m = m ∗ , (43) T D − = Λ (2) ( · , · ) = D − ∗ ⇒ D = T D ∗ , (44)which indeed recovers the original coefficients for T = 1,and a narrower or wider uncertainty dispersion for T <
T >
1, respectively. In the following, we will see thatalso in case of interacting Hamiltonians the minimal freeenergy principle provides the correct results. We showthis by reproducing (and extending) signal estimators de-rived previously in IFT using renormalization techniques.
III. APPLICATION EXAMPLESA. Poissonian log-normal data
1. Separable case
Many inference problems have to deal with Poissoniannoise, like X-ray and γ -ray astronomy as well as recon-struction of the cosmic large-scale structure from galaxycounts. Let us assume that the mean count rate λ ofphotons or galaxies is proportional to an exponentiatedGaussian random field s with covariance S = (cid:10) s s † (cid:11) ( s ) according to λ ( s ) = κ e b s . (45)Here, κ is the expected counts for s = 0, which may de-pend on the spatial position. The scalar b permits us tochange conveniently the strength of the non-linearity ofthe problem without changing the signal statistics. Thislog-normal model for the cosmic large-scale structures asan approximative description is actually supported ob-servationally [27, 28] and theoretically [29–34].As a starting point, we assume a local response, so thatthe Poisson statistics for the actual counts d x at location x are P ( d x | λ x ) = λ d x x d x ! e − λ x , (46)and the full likelihood is well separable into local ones: P ( d | s ) = Y x P ( d x | λ x ( s x )) . (47) The corresponding Hamiltonian was shown in [9] to be H ( d, s ) ∼ = 12 s † S − s − d † b s + κ † e b s . (48)Reconstruction methods for this data model were devel-oped by [9, 35–37].The internal energy of our Gaussian approximation canbe calculated analytically,˜ U ( m, D ) ∼ = 12 m † S − m + 12 Tr( D S − ) − d † b m + κ † e b m + b b D (49)where b D denotes the vector of diagonal elements of D .Minimizing ˜ G ( m, D ) = ˜ U ( m, D ) − T ˜ S B ( D ) with re-spect to m and D yields m = S b (cid:16) d − κ m + b b D (cid:17) , and D = T (cid:16) S − + b b κ m + b b D (cid:17) − , (50)respectively. Here we have defined κ t = κ exp( b t ) anddenote a diagonal matrix by putting a hat onto a vectorof its diagonal elements ( b λ ) xy = λ x δ xy . This result isidentical with the one found in [9] using a lengthy renor-malization calculation. There it was found by numericalexperiment, that using T = 0 . T = 0 and T = 1.
2. Entangled case
So far, we assumed that the response provides a one toone correspondence between locations in signal and dataspace. However, for most measurements this is not ex-actly true. X- and γ -ray telescopes typically exhibit pointspread functions, which map a single signal space locationonto several detectors, of which each detects events com-ing from several indistinguishable directions. Also galaxyredshifts do not provide accurate distance information,since redshift distortions and measurement errors lead toeffective point spread functions.In the following, we generalize to the case of a knownand fixed, but non-local measurement response. Fixedmeans, that the response is independent of the signal.This excludes the treatment of galaxy redshift distortionswith this case (e.g. see [38] for this), but still includesphotometric redshift errors of galaxy catalogs as well asX- and γ -ray telescope data. Such problems have beenapproached in the past via the MAP principle [39–42].The point spread function is modeled by the responsematrix R = ( R ix ) which describes how emissivity at lo-cation x is expected to be observed in data channel i .The expected count rate is now λ ( s ) = R e b s , (51)and the likelihood does not separate any more with re-spect to x P ( d | s ) = Y i P ( d i | λ i ( s )) , (52)since λ i ( s ) entangles the signal from several locations,whereas in (47) it depends only on the local signal value.We recover the former case for a diagonal response R ix = κ x δ ix . The resulting Hamiltonian H ( s | d ) ∼ = 12 s † S − s + 1 † R e b s − d † log( R e b s ) (53)reduces to (48) for R being diagonal.The internal energy of our surrogate Gaussian˜ P ( s | d ) = G ( s − m, D ) is then˜ U ( m, D ) = 12 m † S − m + 12 Tr( D S − ) + 1 † R e b m + b b D − X i d i Z D φ log (cid:16) R † i e b ( m + φ ) (cid:17) G ( φ, D ) | {z } I i . (54)This integral I i can not be calculated in closed from dueto the logarithm in the integrand. We expand the loga-rithm around R † i e m , since we will see that this recoversthe result of the separable case most easily for R beingdiagonal. We getI i = log (cid:16) R † i e b m (cid:17) + * log R † i e b ( m + φ ) R † i e b m !+ ( φ | D ) . (55)In case R is diagonal, the first term reduces to b m +log R i , the second vanishes as h log(exp( b φ )) i ( φ | D ) = h b φ i ( φ | D ) = 0, and we recover the Hamiltonian of theseparable case.In the general case of an entangling response we Taylorexpand the logarithm of the second termI i = log (cid:16) R † i e b m (cid:17) − ∞ X n =1 ( − n n D(cid:16) r † i e b φ − (cid:17) n E ( φ | D ) | {z } II i n , with (56) r i = R i e b m R † i e b m or r i ( x ) = R i ( x ) e b m ( x ) R dx ′ R i ( x ′ ) e b m ( x ′ ) . We note that r † i R dx r ix = 1 by construction.The expansion coefficients II i n can be worked out oneby one. We provide here the first few, namelyII i = r † i e b b D − , II i = r ix r iy e b ( D xx + D yy +2 D xy ) − r † i e b b D + 1 , II i = r ix r iy r iz exp b X a,b ∈{ x,y,z } D ab − r ix r iy exp b X a,b ∈{ x,y } D ab + 3 r † i e b b D − . (57)These coefficients stay small if b D ≪
1, which meansthat the expansion can be truncated if the signal is knownwithin a few ten percent or if non-Gaussianity is small.Large uncertainties in the signal strength do not nec-essarily lead to large coefficients if they are located atpositions without instrumental sensitivity ( R ix small) ormuch lower expected count rates ( m x small). In bothcases mostly prior information and extrapolation fromregions with more informative data will determine thesolution at such locations.In case some of these coefficients are large, substan-tial signal uncertainty at the locations to which they aresensitive must be present. In this case an accurate recon-struction for these locations can not be expected. Thus,if we simplify the Hamiltonian by dropping such terms,even if they are relatively large, the quality of the recon-struction will not suffer too much since only regions areaffected, which are poorly constrained by the data any-way. Therefore, truncating the expansion should alreadyprovide usable algorithms.
3. Zeroth order solution
To zeroth order, we ignore all II i n -terms and find forthe approximative free energy˜ G ( m, D ) ≈ m † S − m + 12 Tr( D S − )+ X i h R † i e b m + b b D − d i log (cid:16) R † i e b m (cid:17)i − T π D )) . (58)Minimizing this with respect to m and D yields m = S b X i R i e b m d i R † i e b m − e b b D ! = S b (cid:16) d † r − κ ′ ( m + b b D/ (cid:17) , and D = T (cid:16) S − + b b κ ′ ( m + b b D/ (cid:17) − , with κ ′ ( t ) = X i R i e b t . (59)This is very similar to (50) and reduces to it for a diagonalresponse.
4. First order correction
First order corrections are included by keeping theII i -term in the approximative free energy, but ignoringhigher terms. The resulting equations are m = S b X i d i (cid:16) r † i e b b D (cid:17) r i − κ ′′ ( m + b b D/ ! D = T (cid:16) S − + b b κ ′′ ( m + b b D/ (cid:17) − , with κ ′′ ( t ) = X i R i e b t d i R † i e b m ! . (60)This is a slight modification with respect to (59) in twoaspects. The map changes a bit, but the sign of thechanges depends on the details of the point spread func-tion, since there are two new terms of similar order, butwith opposite signs. The uncertainty variance is reduced,since the term added to the inverse propagator is alwayspositive.
5. Observation with background
The observation may suffer from a background, eventsin data space, which do not contribute to our signalknowledge. For example γ -ray astronomy has to suppresscosmic ray events as much as possible, since charged par-ticles do not point back to the same sources as neutralphotons due to cosmic magnetic fields. Fortunately, cos-mic rays have different signatures in data space due to thedifferences in hadronic and electromagnetic interactions.However, not for all measured events is the distinctionclearly cut and we have to use prior knowledge to sup-press the background events.Therefore we should extend our formalism to also takesuch unwanted backgrounds into account. Actually areinterpretation of the above formula will do. We ex-tend our signal space by the quantity f determining thelogarithm of the background count rate, s → s ′ = ( s, f ). f z might be a field over the same physical space as s x , orjust a single number as a total isotropic cosmic ray flux.In any case, the x − and z − coordinates are regarded to beover different spaces, or distinct areas of the joint spaceover which f and s live. The joint covariance reads S ′ = (cid:18) S F (cid:19) (61)due to the independence of signal and background. Here, F = (cid:10) f f † (cid:11) ( f ) is the log-background covariance. The re-sponse R → R ′ has to be extended to map also the back-ground space into the data space. Whether the responseimages of signal and background events in data space arewell separated or whether they overlap decides about thebackground discriminating power of the instrument.The combined map and covariance of signal and log-background can now be obtained, e.g. from (59) or(60) with the appropriate replacements for S, R, m, D → S ′ , R ′ , m ′ , D ′ . Our joint map can be split into a sig-nal and log-background part m ′ = (˜ s, ˜ f ). Since we are usually not interested in the background proper-ties, we marginalize over it. This is especially simplein the Gaussian approximation of our joint posterior P ( s ′ | d ) ≈ G ( s ′ − m ′ , D ′ ), with s ′ = ( s, f ), m ′ = (˜ s, ˜ f ), m ≈ Z D s ′ s G ( s ′ − m ′ , D ′ ) = ˜ s, and (62) D xy ≈ Z D s ′ ( s − ˜ s ) x ( s − ˜ s ) y G ( s ′ − m ′ , D ′ ) = D ′ xy . Although this does not look too different from the for-mula for the case without background, the effect of thebackground entered through the joint covariance matrix D ′ , which mixes the contribution from the signal andbackground events appropriately. B. Reconstruction without spectral knowledge
1. Effective theory
The reconstruction of the signal in the Poisson log-normal model in the previous section assumed that thesignal covariance is known a priori. In case it is unknown,it has to be extracted from the same data used for thesignal inference [43–47]. However, the optimal way to dothis was usually not derived from first principles, maybeexcept in [48–50]. A rigorous approach to such prob-lems is given by the computationally expensive Gibbs-sampling technique, which investigates the joint space ofsignal realizations and power spectra [51–54], which canthen easily be marginalized over the power spectra toobtain a generic signal reconstruction. This problem wasalso addressed approximatively for the case of linear re-sponse data from a Gaussian signal subject to Gaussiannoise using the MAP principle as well as by the help ofparameter uncertainty renormalized estimation by [11].We re-address this problem here using the minimal freeenergy approach.We assume the covariance S = (cid:10) s s † (cid:11) ( s ) of our Gaus-sian signal s to be diagonal within some known functionbasis O kx , e.g. the Fourier basis with O kx = e i k x . Wemodel the power spectrum (in this basis) as being a lin-ear combination of a number of positive basis functions f i ( k ) with disjoint supports (the spectral bands), so that P s ( k ) = X i p i f i ( k ) (63)is positive for all k (all coefficients of p = ( p i ) i are positiveand the spectral bands cover the full k -space domain).We define( S i ) xy = ( O † b f i O ) xy = O k x f i ( k ) O k y (64)to be the i -th spectral band matrix and S − i to be itspseudo-inverse. Thus, we write our signal covariance as S = X i p i S i , (65)0with p = ( p i ) the vector of unknown spectral parame-ters. We further assume that the individual signal-bandamplitudes p i have an independent prior distribution, P ( p ) = Y i P ( p i ) , (66)with the individual priors being inverse-gamma distribu-tions, power-laws with exponential low amplitude cutoffat q i : P ( p i ) = 1 q i Γ( α i − (cid:18) p i q i (cid:19) − α i exp (cid:18) − q i p i (cid:19) . (67)For α i ≫ q i /α i determines the preferred value. A non-informative priorwould be given by Jeffreys prior with α i = 1 and q i = 0. For a linear data model d = R s + n, (68)with Gaussian noise with covariance N = (cid:10) n n † (cid:11) ( n ) , theparameter marginalized effective Hamiltonian is accord-ing to [11] H ( d, s ) ∼ = 12 s † M s − j † s + X i γ i log (cid:18) q i + 12 s † S − i s (cid:19) . (69)Here M = R † N − R , j = R † N − d , γ i = α i − ̺ i / ̺ i = Tr[ S − i S i ] the number of spectral degrees offreedom within the band i .
2. Free energy expansion
The internal energy of a Gaussian posterior-ansatz isthen˜ U ( m, D ) ∼ = 12 m † M m + 12 Tr(
D M ) − j † m + X i γ i (cid:28) log (cid:18) q i + 12 s † S − i s (cid:19)(cid:29) ( s | m,D ) | {z } I i . (70)Again we have to deal with a Gaussian average over alogarithm, which we expand asI i = log(˜ q i ) − ∞ X k =1 ( − k k (˜ q i ) k *(cid:18) q i + 12 s † S − i s − ˜ q i (cid:19) k + ( s | m,D ) | {z } II ik , with ˜ q i = q i + 12 Tr(( m m † + δ D ) S − i ) . (71) Since this would result in an improperly normalized prior, weunderstand this as α i = 1 + ǫ , q i = ǫ , and lim ǫ → at the end ofthe calculation. Here we have introduced a parameter δ to be fixed soon.The first two expansion coefficients areII i = 12 (1 − δ )Tr( D S − i )II i = II i + Tr (cid:18)(cid:18) m m † + 12 D (cid:19) S − i D S − i (cid:19) . (72)
3. Zeroth order solution
To zeroth order we find by minimizing the free energywhile ignoring the II-corrections m = D ′ j, D = T D ′ , and D ′ = M + X i p − i S − i ! − . (73)This means that the map is the Wiener filtered data,where the spectral coefficients are assumed to be p i = ˜ q i γ i δ = 1 γ i δ (cid:18) q i + 12 Tr(( m m † + δ D ) S − i ) (cid:19) . (74)For δ = 0 this yields p i = ∞ and therefore D = M − if M is (pseudo)-invertible. The resulting filter providesa noise weighted deconvolution, however is unable to ex-trapolate into unobserved regions of the signal space. Itis widely used for map making in the field of cosmic mi-crowave background observations. For δ = 1 we recoverthe critical estimator of [11]. Since there it was shownthat the latter performs significantly better than the for-mer, and also since II i = 0 and II i is minimal for δ = 1,we adopt this in the following. For Jeffreys prior we find p i = Tr( B i ) ̺ i , (75)with B i = ( m m † + D ) S − i .
4. Second order correction
Including higher order corrections should improve thereconstruction. The first order corrections vanish for δ =1. The second order correction yields m = D ′ j, D = T " D ′− − X i γ i ˜ q i S − i m m † S − i − ,D ′ = " M + X i γ i ˜ q i X i S − i − , (76) X i = 1 + 1˜ q i Tr (cid:18) ( m m † + 12 D ) S − i D S − i (cid:19) − q i S − i D. The operator D ′ , which is applied to j to generate themap, and the uncertainty dispersion D are not identi-cal any more. Neither of them can still be expressed as1( M + P i p − i S − i ) − , due to the operator structure of the S − i D and S − i m m † terms. This was also found in [11].However, if we can assume that this operator processesany channel in the i -th band in a similar way, we can re-place S − i D S − i and S − i m m † S − i by their channel aver-aged values Tr( DS − i ) S − i /̺ i and Tr( m m † S − i ) S − i /̺ i ,respectively. This permits to identify spectral coeffi-cients of D = ( M + P i p − i S − i ) − and D ′ = ( M + P i p ′ i − S − i ) − . For Jeffreys prior they become p i = Tr( B i ) ̺ i − ̺ i Tr (cid:0) m m † S − i (cid:1) Tr( B i ) ! − , and p ′ i = Tr( B i ) ̺ i " ̺ i Tr (cid:0) m m † S − i (cid:1) Tr (cid:0) D S − i (cid:1) (Tr( B i )) − , (77)where m , D , and B i = ( m m † + D ) S − i all depend on p .It is obvious, that the second order correction increases p i by some margin compared to (75), meaning that thereconstruction uncertainty increases. It is less obvioushow p ′ i develops, since at first glance it seems to be cor-rected downwards. Note however, that an increased p i implies an increased Tr( B i ), since D grows (spectrally)with increasing p i .The fact that we get two differing sets of spectral coeffi-cients, p i and p ′ i , reminds us to regard them as auxiliaryvariables of our signal reconstruction algorithm, ratherthan as optimal spectrum estimates. C. Poisson log-normal distribution with unknownspectrum
The combined problem, reconstructing a Poisson log-normal signal with unknown spectrum, can now betreated approximatively. The combined free energy forthe Gaussian posterior approximation to zeroth order is˜ G ( m, D ) ≈ X i h R † i e b m + b b D − d i log (cid:16) R † i e b m (cid:17)i + X i γ i log (cid:18) q i + 12 Tr (cid:0) ( m m † + D ) S − i (cid:1)(cid:19) − T π D )) . (78)The resulting map and uncertainty dispersion are pro-vided by (59) with the addition that S = P i p i S i andthe p i s are provided by (74). Higher order corrections canbe included in a similar way as in the individual prob-lems. Also background counts with known or unknowncovariance structure can be included in the same waythey were treated in Sect. III A 5. IV. INFORMATION SYNTHESISA. Multi-temperature posterior
Although the obtained Gaussian knowledge states fromminimal free energy estimation are approximative andtherefore of limited accuracy, they might permit us toconstruct more accurate models of the posterior. Theidea is to combine several Gaussian distributions to amore accurate approximation of the true non-Gaussianposterior probability, and to measure the mean map andits uncertainty dispersion from this combination.We recall that Gaussian approximations of the poste-rior obtained at low temperatures ( T ≪
1) mostly carryinformation on its peak region, while those obtained atlarge temperatures ( T ≫
1) information on its asymp-totics. Also the canonical T = 1 does not provide aperfect representation of the posterior, as a Gaussian ap-proximation for a non-Gaussian PDF never can. How-ever, by combining such different approximations in anappropriate way, we should obtain an improved repre-sentation of the correct PDF, which permits much easiercalculation of moments like the signal mean and its un-certainty variance.To this end we postulate the existence of a temperaturedistribution function P ( T ), such that P ( s | d ) = Z ∞ dT G ( s − m ( d,T ) , D ( d,T ) ) P ( T ) (79)combines the different Gaussians with means m ( d,T ) anddispersions D ( d,T ) to synthesize the right posterior prob-ability. A formal proof of the existence of P ( T ), and thenecessary conditions for this is beyond the scope of thiswork. It should be noted, that e.g. multi-peaked distri-butions cannot accurately be represented by approximateGaussians obtained at different temperatures. They can,however, often be well approximated by Gausians cen-tered on those peaks. The recipes described below donot depend on the way the different Gaussians used inthe mixture model were obtained, and therefore can alsobe used in such cases.In the following we provide a recipe to construct P ( T )in practice. We assume that m i = m ( d,T i ) and D i = D ( d,T i ) have been computed for a number N T of temper-atures T i . The temperatures are best chosen to samplewell the different part of the posterior, its peak by havingsome T i ≪
1, the bulk of the PDF with T i = 1, and thePDF tails with T i ≫ P ( s | d ) = N T X i =1 G ( s − m i , D i ) P i . (80)˜ P ( s | d ) should be as close as possible to P ( s | d ) in an infor-mation theoretical sense. The natural choice for the dis-tance measure is the Kullback-Leibler divergence, which2measures the cross-information of ˜ P ( s | d ) on P ( s | d ), andwhich is practically identical to the free energy ˜ G [ ˜ P ( s | d )]of our surrogate posterior according to (34). Introducingun-normalized probabilities p i as our degrees of freedom,and setting P i = p i /Z p with Z p = P j p j in order toenforce the proper normalization, P i P i = 1, this reads˜ G ( p ) = X i p i Z p ( U i − ˜ U i ( p )) − F. (81)We have introduced the here irrelevant, since p -independent, free energy F = − log Z d of the originalproblem and the energies U i and ˜ U i ( p ) with respect tothe template distributions G i ( s ) = G ( s − m i , D i ): U i = h H ( d, s ) i G i = Z D s G i ( s ) H ( d, s ) and˜ U i ( p ) = D ˜ H p ( s ) E G i , with (82)˜ H p ( s ) = − log( X i p i G i ( s ) /Z p ) . B. Minimizing the Gibbs energy
1. Analytical scheme
Now one has to minimize ˜ G ( p ) with respect to p . Theproblem to calculate the path integrals defining the en-ergies was already addressed in this work. A system-atic way is to Taylor-Fr´echet expand the Hamiltoniansaround the centers of the Gaussians m i and then use theknown moments of G i ( s ) to approximate the energies.For the surrogate energies this yields up to second orderin φ i = s − m i ˜ U i ( p ) = − log g i + 12 X j g j i g i Tr( D − j D i ) (83)+ 12 X j k g j i g i (cid:18) g k i g i − δ jk (cid:19) m † ij D − j D i D − k m ik , with g j i = p j G j ( m i ) /Z p , and g i = N j X j =1 g j i , and (84) m ij = m i − m j .
2. Monte-Carlo scheme
Alternatively, one can approximate the average h X [ s ] i G i of a quantity X [ s ] by sums over N i samplingpoints { s ( j ) i } j , which can easily be drawn from G i ( s ): h X [ s ] i G i ≈ X j X [ s ( j ) i ] /N i . (85) This way, ˜ G ( p ) can be approximated, and minimized witha suitable optimization scheme. The sampling points,their Gaussian probabilities G ( j ) k i = G k ( s ( j ) i ), as well asthe energies U i need only be calculated once, but the sur-rogate energies U i ( p ) = log Z p − P j log( P k p k G ( j ) k i ) /N i have to be updated at any step of the scheme.One might argue, that if we use stochastic methodsto build ˜ P ( s | d ), one could have used a Markov-ChainMonte-Carlo (MCMC) method right from the beginningfor the signal inference problem. However, we expect thatthe here described posterior synthesis method should re-produce the correct posterior better than a sample pointcloud, since we are using well adapted Gaussians as ourbuilding blocks and not delta functions as the directMCMC approach uses. Furthermore, the analytical andsampling method can be combined, in that the analyti-cal estimates are combined with the sampling estimatesof the contributions of the neglected terms in the Taylor-Fr´echet expansions of (83). And finally, since our schemedraws samples from Gaussians, it can be trivially paral-lelized, which is not easily possible with MCMC schemes. C. Maps and moments
Once the minimum of ˜ G ( p ) with respect to p is found,one has synthesized a posterior approximation with aGaussian mixture model. From this, any moment of thedistribution function can easily be calculated. The meanmap can be expressed as m ≈ h s i ˜ P ( s ) = X i P i h s i G i ( s ) = X i P i m i , (86)as well as the uncertainty dispersion as D ≈ (cid:10) ( s − m ) ( s − m ) † (cid:11) ˜ P ( s ) = X i P i ( D i + m i m † i ) − m m † . (87)We leave the verification and application of the informa-tion synthesis method for future work. V. CONCLUSIONS
We have shown that the minimal free Gibbs energyprinciple in information field theory can be used to ob-tain approximate knowledge states with maximal cross-information to the exact posterior. The construction ofsuch knowledge states with Gaussian PDF is relativelystraightforward:1. The joint PDF of signal and data P ( d, s ) has tobe specified, e.g. by specifying a data likelihood P ( d | s ) and signal prior P ( s ), and using P ( d, s ) = P ( d | s ) P ( s ).2. The information Hamiltonian H is the negative log-arithm of this, H ( d, s ) = − log( P ( d, s )).33. A suitably parametrized PDF as a surrogate for theposterior has to be specified, e.g. a Gaussian withits mean and dispersion as degrees of freedom.4. The internal energy U and entropy S B of thisPDF have to be calculated as the PDF-average ofthe Hamiltonian and the negative log-PDF, respec-tively.5. The Gibbs free energy, G = U − T S B , has then tobe minimized with respect to all degrees of freedomof the surrogate PDF.6. Any statistical summary like mean and variancecan now be extracted from the surrogate PDF.The minimal free energy principle is therefore well suitedto tackle statistical inference problems. We have demon-strated this with two different problems and their com-bination: reconstructing a log-normal field from Poissondata subject to a point spread function and reconstruc-tion without prior knowledge on the signal power spec-trum. Earlier results from renormalization calculationsin [9, 11] have been reproduced. The there used renor-malization schemes can therefore be understood as aim-ing for a surrogate Gaussian PDF which has maximalcross information to the correct posterior. Since theseresults were previously shown to reconstruct well, also the here proposed method for the more complicated com-bined case can be expected to work. However, a detailedimplementation and verification of this was left for futurework.Finally we have sketched how Gaussian knowledgestates obtained at different thermodynamical tempera-tures can be combined into a more accurate representa-tion of the posterior, from which moments of the signaluncertainty distributions can easily be extracted.The minimal Gibbs energy and maximal cross informa-tion principle introduced here to IFT should allow theconstruction of novel reconstruction schemes for statis-tical inference problems on spatially distributed signals.The thermodynamical language may help to clarify con-cepts and to simplify applications of IFT, since it permitsus to tackle non-linear inverse problems without the needto use diagrammatic perturbation theory and renormal-ization schemes. Acknowledgments
We thank Mona Frommert, Jens Jasche, Niels Opper-mann, Gerhard B¨orner, and three referees for discussionsand comments on the manuscript. [1] E. T. Jaynes, Physical Review , 620 (1957).[2] E. T. Jaynes, Physical Review , 171 (1957).[3] J. N. Fry, ApJ , 10 (1985).[4] E. Bertschinger, ApJ , L103 (1987).[5] W. Bialek and A. Zee, Physical Review Letters , 741(1987).[6] W. Bialek and A. Zee, Physical Review Letters , 1512(1988).[7] W. Bialek, C. G. Callan, and S. P. Strong, Physical Re-view Letters , 4693 (1996), arXiv:cond-mat/9607180.[8] J. C. Lemm, ArXiv Physics e-prints (1999),physics/9912005.[9] T. A. Enßlin, M. Frommert, and F. S. Kitaura,Phys. Rev. D , 105005 (2009), 0806.3474.[10] J. C. Lemm, Bayesian Field Theory (Johns Hopkins Uni-versity Press, 2003).[11] T. A. Enßlin and M. Frommert, ArXiv e-prints (2010),1002.2928.[12] R. T. Cox, American Journal of Physics , 1 (1946).[13] R. T. Cox, American Journal of Physics , 66 (1963).[14] S. F. Gull and G. J. Daniell, Nature , 686 (1978).[15] J. Skilling, A. W. Strong, and K. Bennett, MNRAS ,145 (1979).[16] R. K. Bryan and J. Skilling, MNRAS , 69 (1980).[17] S. F. Burch, S. F. Gull, and J. Skilling, Computer VisionGraphics and Image Processing , 113 (1983).[18] S. F. Gull and J. Skilling, in Indirect Imaging. Measure-ment and Processing for Indirect Imaging (1983), p. 267.[19] S. Sibisi, J. Skilling, R. G. Brereton, E. D. Laue, andJ. Staunton, Nature , 446 (1984). [20] D. M. Titterington and J. Skilling, Nature , 381(1984).[21] J. Skilling and R. K. Bryan, MNRAS , 111 (1984).[22] R. K. Bryan and J. Skilling, Journal of Modern Optics , 287 (1986).[23] S. F. Gull, in Maximum Entropy and Bayesian Meth-ods , edited by J. Skilling (Kluwer Academic Publishers,Dordtrecht, 1989), pp. 53–71.[24] S. F. Gull and J. Skilling,
The MEMSYS5 User’s Man-ual (Maximum Entropy Data Consultants Ltd, Royston,1990).[25] J. Skilling, in
Maximum Entropy and Bayesian Methods ,edited by G. J. Erickson, J. T. Rychert, and C. R. Smith(1998), p. 1.[26] S. Kullback and R. Leibler, Annals of MathematicalStatistics
22 (1) , 79 (1951).[27] E. Hubble, ApJ , 8 (1934).[28] F. S. Kitaura, J. Jasche, C. Li, T. A. Enßlin, R. B. Met-calf, B. D. Wandelt, G. Lemson, and S. D. M. White,MNRAS , 183 (2009), 0906.3978.[29] D. Layzer, AJ , 383 (1956).[30] P. Coles and B. Jones, MNRAS , 1 (1991).[31] R. K. Sheth, MNRAS , 933 (1995), astro-ph/9511096.[32] I. Kayo, A. Taruya, and Y. Suto, ApJ , 22 (2001),arXiv:astro-ph/0105218.[33] R. Vio, P. Andreani, and W. Wamsteker, PASP ,1009 (2001), arXiv:astro-ph/0105107.[34] M. C. Neyrinck, I. Szapudi, and A. S. Szalay, ApJ ,L90 (2009), 0903.4693. [35] J. Jasche and F. S. Kitaura, ArXiv e-prints (2009),0911.2496.[36] J. Jasche, F. S. Kitaura, C. Li, and T. A. Enßlin, ArXive-prints (2009), 0911.2498.[37] F. Kitaura, J. Jasche, and R. B. Metcalf, MNRAS ,589 (2010), 0911.1407.[38] C. Weig and T. A. Enßlin, ArXiv e-prints (2010),1003.1311.[39] T. J. Hebert and R. Leahy, IEEE Transactions on SignalProcessing , 2290 (1992).[40] T. J. Cornwell and K. F. Evans, A&A , 77 (1985).[41] G. Wang, L. Fu, and J. Qi, Physics in Medicine andBiology , 593 (2008).[42] C. Oh and B. Roy Frieden, Optics Communications ,2489 (2009).[43] D. H. Roberts, J. Lehar, and J. W. Dreher, AJ , 968(1987).[44] G. B. Rybicki and W. H. Press, ApJ , 169 (1992).[45] J. Li and P. Stoica, IEEE Transactions on Signal Pro-cessing , 1469 (1996). [46] P. Stoica, H. Li, and J. Li, IEEE Signal Processing Let-ters , 205 (1999).[47] P. Stoica, E. G. Larsson, and J. Li, AJ , 2163 (2000).[48] P. K. Kitanidis, Water Resources Research , 499(1986).[49] G. Rydbeck, ApJ , 1304 (2008).[50] U. Seljak, ApJ , 492 (1998), astro-ph/9710269.[51] B. D. Wandelt, D. L. Larson, and A. Lakshmi-narayanan, Phys. Rev. D , 083511 (2004), arXiv:astro-ph/0310080.[52] H. K. Eriksen, I. J. O’Dwyer, J. B. Jewell, B. D. Wan-delt, D. L. Larson, K. M. G´orski, S. Levin, A. J. Ban-day, and P. B. Lilje, ApJS , 227 (2004), arXiv:astro-ph/0407028.[53] J. Jewell, S. Levin, and C. H. Anderson, ApJ , 1(2004), arXiv:astro-ph/0209560.[54] J. Jasche, F. S. Kitaura, B. D. Wandelt, and T. A. Enßlin,MNRAS406