Akaike's Bayesian information criterion (ABIC) or not ABIC for geophysical inversion
aa r X i v : . [ s t a t . M E ] N ov Akaike’s Bayesian information criterion (ABIC) or not ABICfor geophysical inversion
Peiliang XuDisaster Prevention Research Institute, Kyoto University, Uji, Kyoto 611-0011, Japanemail: [email protected]
Abstract:
Akaike’s Bayesian information criterion (ABIC) has been widely used in geophysicalinversion and beyond. However, little has been done to investigate its statistical aspects. Wepresent an alternative derivation of the marginal distribution of measurements, whose maximiza-tion directly leads to the invention of ABIC by Akaike. We show that ABIC is to statisticallyestimate the variance of measurements and the prior variance by maximizing the marginal dis-tribution of measurements. The determination of the regularization parameter on the basis ofABIC is actually equivalent to estimating the relative weighting factor between the variance ofmeasurements and the prior variance for geophysical inverse problems. We show that if the noiselevel of measurements is unknown, ABIC tends to produce a substantially biased estimate of thevariance of measurements. In particular, since the prior mean is generally unknown but arbitrarilytreated as zero in geophysical inversion, ABIC does not produce a reasonable estimate for the priorvariance either.
Keywords:
Akaike’s Bayesian information criterion, marginal likelihood functions, prior variance,regularization parameter, variance of measurements.
A linear inverse ill-posed problem can often be written symbolically as the following linear model: y = A β + ǫ , (1)where y is an ( n ×
1) vector of measurements, A is the deterministic coefficient matrix, which is assumedto be theoretically of full column rank but with singular values close to zero, β is a ( t ×
1) vector ofunknown parameters to be estimated, and the random error vector ǫ is of zero mean and variance-covariance matrix W σ , W is an ( n × n ) (positive definite) weighting matrix. If the least squares (LS)method is applied to (1), we have the weighted LS solution of β below:ˆ β = ( A T WA ) − A T Wy . (2)Although ˆ β is unbiased, it can become highly unstable and practically meaningless physically, becausethe normal matrix A T WA , often denoted by N = A T WA , is of almost zero eigenvalues (see e.g.,Phillips 1962; Tikhonov 1963; Tikhonov and Arsenin 1977).To obtain a mathematically and/or physically meaningful solution to inverse ill-posed problems (1),we can either limit ourselves to a sub-space spanned by the vectors corresponding to sufficiently largeeigenvalues (see e.g., Xu 1998) or add a positive (semi-)definite matrix, say W β κ , to the normal matrix N such that the addition of both matrices avoids sufficiently small eigenvalues, where W β is positive(semi-)definite and κ is a positive scalar. This latter method is well known either as ridge regressionin the statistical literature (see e.g., Hoerl and Kennard 1970; Vinod and Ullah 1981; Xu 1992a; Xuand Rummel 1994) or regularization in the literature of mathematics and applied sciences (see e.g.,Tikhonov 1963; Tikhonov and Arsenin 1977), with the solution being given as follows:ˆ β r = ( A T WA + κ W β ) − A T Wy , (3)which will be called a regularized solution in this letter, for simplicity but without loss of confusion.Determination of the regularization parameter κ is of crucial importance in regularizing the inverse ill-posed model (1). If κ is too small, the regularized solution ˆ β r of (3) may still remain unstable; however,if it is chosen too large, ˆ β r will become over-smoothed. In particular, if κ tends positively to infinity,ˆ β r will shrink towards zero and the measurements y play no role at all in retrieving the informationon β . There are a number of methods to determine the regularization parameter κ , depending on howthe regularized solution ˆ β r of (3) is interpreted. The view points can be either presented in terms offrequentist, under the Bayesian framework, or even intuitively in terms of some norms of residuals ofmeasurements and/or estimates of β .From the frequentist point of view, ˆ β r of (3) has been known as a biased estimator of β . As a result,the regularization parameter has been mainly motivated to compromise between noise amplification and1ias such as the criterion of mean squared errors (see e.g., Hoerl and Kennard 1970; Vinod and Ullah1981; Xu 1992a; Xu and Rummel 1994) or predict measurements such as generalized cross-validationand maximum likelihood (see e.g., Golub et al. 1979; Wahba 1985; Xu 2009). Norms of residuals andparameters have played an important role in choosing the regularization parameter. Very often, one caneither determine the regularization parameter, given a fixed level of noise for the residual norm (see e.g.,Miller 1970; Tikhonov and Arsenin 1977; Morozov 1984) or by finding a balance between some normsof residuals and parameters (see e.g., Hansen and O’Leary DP 1993).The inverse ill-posed model (1) has also been solved under the Bayesian framework (see e.g., Akaike1980; Tarantola 1987). Assuming that there exists prior information on the parameters β in terms ofthe first and second (central) moments, namely, E ( β ) = µ , D ( β ) = W − β σ β , (4)where the prior vector µ is given, D ( · ) stands for variance operator, W β is a positive definite matrixand the scalar σ β is given as well. Then applying stochastic inference to inverse problems (1) with priorinformation (4) results in the following estimator:ˆ β i = A T WA σ + W β σ β ! − A T Wy σ + W β µ σ β ! . (5)If the prior values µ are equal to zero, namely, µ = , and denoting κ = σ /σ β , then the stochasticinference ˆ β i of (5) formally becomes the regularized or biased solution ˆ β r of (3).An alternative use of prior information to solve inverse problems (1) is based on full Bayesian in-ference, which requires the assumption of distributions of both measurements and prior data. As inthe case of regularization and/or stochastic inference, proper choice of the regularization parameter iscrucial in Bayesian inversion. Akaike (1980) proposed a Bayesian information criterion to determinethe regularization parameter, which has since been widely applied in geophysical inversion. However,little has been known about statistical aspects of Akaike’s Bayesian information criterion (ABIC) forill-posed inverse problems. The purpose of this paper is primarily to investigate statistical performancesof ABIC. We will provide an alternative representation of the marginal distribution of measurements.We will show that the ABIC determination of regularization parameter is statistically equivalent toestimating the variance σ of measurements and the prior variance σ β . We will also prove in this sectionthat if the prior mean µ is unknown but arbitrarily treated as zero, as almost always in the case ofpractical geophysical inversion, the estimate of σ will be significantly biased, which will further affectthe estimate of σ β and thus further the determination of the regularization parameter. The theoreticalresults will then be summarized in the concluding remarks. The measurement and prior distributions are usually assumed to be normal and respectively given asfollows: f y ( y / β , σ ) = p det( W )(2 π ) n/ σ n exp (cid:26) − σ ( y − A β ) T W ( y − A β ) (cid:27) , (6a)and π β ( β /σ β ) = p det( W β )(2 π ) t/ σ tβ exp ( − σ β ( β − µ ) T W β ( β − µ ) ) , (6b)as can be found, for example, in Zellner (1971) and Akaike (1980), where det( · ) stands for the determi-nant of a square matrix. Then the joint distribution density of both y and β is given by f ( y , β /σ , σ β ) = f y ( y / β , σ ) π β ( β /σ β ) . (7)Bayesian inference on β has been done on the basis of the following posterior distribution of β giventhe measurements y : π ( β / y , σ , σ β ) = f ( y , β /σ , σ β ) m ( y /σ , σ β )= f y ( y / β , σ ) π β ( β /σ β ) m ( y /σ , σ β ) , (8)2here π ( β / y , σ , σ β ) stands for the posterior distribution of β given y , and m ( y /σ , σ β ) is the marginaldistribution of y , which is defined and given as follows: m ( y /σ , σ β ) = Z f ( y , β /σ , σ β ) d β . (9)As a result, one can derive the Bayesian (posterior) estimate, either by maximizing or computing theposterior mean from the posterior distribution π ( β / y , σ , σ β ) of (8), which is denoted by ˆ β b and simplywritten below ˆ β b = A T WA σ + W β σ β ! − A T Wy σ + W β µ σ β ! , = ( A T WA + κ W ) − ( A T Wy + κ W β µ ) , (10)(see e.g., Zellner 1971; Akaike 1980). It is obvious that both stochastic and Bayesian inferences resultin the same estimator under the assumption of normal distributions by comparing ˆ β i of (5) with ˆ β b of(10). The Bayesian posterior estimator (10) can also be equivalently rewritten asˆ β b = A T WA σ + W β σ β ! − A T WA ˆ β σ + W β µ σ β ! , (11)indicating that ˆ β b is actually the weighted mean of the LS estimate ˆ β and the prior mean µ . If the priordistribution is too poor to be useful, or more precisely, σ β = ⇒ ∞ , then ˆ β b becomes ˆ β . However, for anill-posed inverse problem, the LS estimate ˆ β can be too poor in accuracy and the prior information willconstrain ˆ β b even for a large value of σ β .If µ = in (10), the corresponding Bayesian estimator is then denoted by ˆ β b and (10) becomesˆ β b = A T WA σ + W β σ β ! − A T Wy σ = ( A T WA + κ W β ) − A T Wy , (12)which essentially turns out to be ˆ β r of (3). Remark 1 : For a geophysical inverse problem, the geophysical signals we are seeking from measure-ments y are not equal to zero. In other words, it is generally not fair/reasonable to assume µ = ;otherwise, we have no reasons to collect the data y , as explained in the case of satellite gravimetry (Xuand Rummel 1994). By setting µ = for geophysical inverse problems, the Bayesian estimator (12) isbiased, as correctly pointed out by Akaike (1980) (see also Xu 1992b). From this point of view, it maybecome easily understandable why the regularized solution ˆ β r of (3) has been called a biased estimatorfrom the frequentist point of view.Realizing the importance of choosing the parameter κ , or equivalently σ and σ β , Akaike (1980)proposed a criterion by maximizing the marginal distribution of y , i.e. m ( y /σ , σ β ) of (9) to choose κ ,which is known as Akaike’s Bayesian information criterion or simply ABIC. ABIC has since been widelyapplied to geophysical inverse problems, as can be seen, for example, in Tamura et al. (1991), Yabukiand Matsu’ura (1992) and Fukuda and Johnson (2008), just to name a few of geophysical applications.Although m ( y /σ , σ β ) plays a key role in ABIC, it was simply given in Akaike (1980) without pro-viding any details of its derivation, which has been further basically used verbatim in geophysical appli-cations (see e.g., Tamura et al. 1991; Yabuki and Matsu’ura 1992). Given the normal distribution (6a)of measurements and the normal prior distribution (6b), we have re-derived the marginal distribution m ( y /σ , σ β ) of the measurements y , which is simply given as follows: m ( y /σ , σ β ) = Z f ( y , β /σ , σ β ) d β = 1(2 π ) n/ p det( Σ py ) exp (cid:26) −
12 ( y − A µ ) T Σ − py ( y − A µ ) (cid:27) , (13)where Σ py = W − σ + AW − β A T σ β . (14)The detailed derivation of (13) is given in appendix A. The univariate version of (13) can be found inZellner (1971, p.28). The marginal distribution m ( y /σ , σ β ) appears to be formally different from thecorresponding distribution in Akaike (1980), they are essentially identical.3t has become clear that ABIC for geophysical inversion is the same as estimating the two unknownvariance components σ and σ β or equivalently, the unknown variance σ and the relative weight κ , bymaximizing the marginal distribution (13) of measurements. Mathematically, maximizing m ( y /σ , σ β )of (13) is equivalent to minimizingmin: L ( σ , σ β ) = ln { det( Σ py ) } + ( y − A µ ) T Σ − py ( y − A µ ) . (15)Since κ = σ /σ β , the objective function L ( σ , σ β ) can also be rewritten in terms of σ and κ as follows: L ( σ , κ ) = n ln( σ ) + ln { det( E py ) } + ( y − A µ ) T E − py ( y − A µ ) /σ , (16)where E py = W − + AW − β A T /κ. We should like to note that L ( σ , κ ) of (16) is essentially the same as the corresponding formula inSection 5 of Akaike (1980). Case 1: both σ and σ β unknown. Differentiating L ( σ , κ ) of (16) with respect to σ and letting it equal to zero, we have n c σ − ( y − A µ ) T E − py ( y − A µ )( c σ ) = 0 , or equivalently, c σ = ( y − A µ ) T E − py ( y − A µ ) /n. (17)Given the prior variance σ β or the equivalent relative weight κ , it is rather easy to prove that c σ of (17)is an unbiased estimate of σ . Substituting (17) into (16) and neglecting a constant term, we have theABIC for determining the relative weight (or regularization parameter) κ as follows: L ( κ ) = n ln { ( y − A µ ) T E − py ( y − A µ ) } + ln { det( E py ) } . (18)The likelihood function L ( κ ) consists of two parts: a positive definite quadratic form of the predictedresiduals ( y − A µ ) and their cofactor matrix E py . It is easy to prove mathematically that the first partof L ( κ ) increases with κ , while the second part decreases with the increase of κ .Although µ is generally not equal to zero but unknown in geophysical inversion, it is almost alwaysreplaced with a zero vector under the framework of Bayesian geophysical inversion. In this case, byimposing µ = (even if we know it cannot not zero), both formulae (17) and (18) become c σ s = y T E − py y /n, (19)and L s ( κ ) = n ln { y T E − py y } + ln { det( E py ) } , (20)respectively.Given a value κ , applying the expectation operator to (19) yields E ( c σ s ) = E ( E − py yy T ) /n = y T E − py y /n + tr { E − py W − } σ /n, (21)which clearly indicates that c σ s of (19) can be significantly biased from its true value σ , depending onthe true values of measurements, where y stands for the vector of true values of measurements y . Onthe other hand, since both L ( κ ) of (18) and L s ( κ ) of (20) are different only in whether the predictedresiduals ( y − A µ ) or the measurements y are used to compute the positive definite quadratic form.Since y are generally larger than the residuals ( y − A µ ), the relative weighting factor or regularizationparameter κ is expected to be estimated with a bias. As a result, we expect that the prior variance σ β will also be computed with a bias from both c σ s and κ .4 ase 2: σ known but σ β unknown. If σ is given/known but σ β unknown, use of ABIC to determine the regularization parameter is math-ematically equivalent to finding the best estimate of the prior variance σ β such that it maximizes themarginal distribution m ( y /σ , σ β ) of measurements to most favor the output of the measurements y .It is also equivalent to finding the optimal ABIC estimate of σ β (or κ ) that minimizes the followinglikelihood function: L b ( κ ) = ( y − A µ ) T E − py ( y − A µ ) /σ + ln { det( E py ) } . (22)As in the case of L ( κ ) in (18), the first term of L b ( κ ) increases with κ , while its second term with thecofactor matrix E py decreases with the increase of κ .Since the prior mean µ is practically unknown in geophysical inversion, it is almost always treatedas zero and the likelihood function L b ( κ ) of (22) becomes L bs ( κ ) = y T E − py y /σ + ln { det( E py ) } . (23)Since y is generally expected to be much larger in size than the predicted residual vector ( y − A µ ),the optimal κ from minimizing (23) may tend to be smaller than that from minimizing (22). Thus,arbitrarily assigning the prior mean µ to zero would affect the ABIC estimate of σ β . Akaike’s Bayesian information criterion (ABIC) was proposed by Akaike (1980) and has since beenwidely applied in geophysical inversion. The ABIC method is to determine the regularization parametersuch that the marginal distribution of measurements is maximized. In other words, the ABIC-basedregularization parameter is optimally chosen to most favor the output of the collected measurements.However, little has been done to investigate its statistical aspects for geophysical inversion. We havepresented an alternative representation of the marginal distribution of measurements, which is thestarting point of ABIC. We have shown that ABIC for geophysical inverse problems is statisticallyequivalent to estimating the variance of measurements and the prior variance by maximizing the marginaldistribution of measurements or minimizing the likelihood function of the variance of measurementsand the prior variance. If the prior distribution is correct, the regularization parameter actually reflectsthe relative weighting factor between these two variances. We have proved that if the noise level ofmeasurements is unknown, ABIC tends to produce a substantially biased estimate of the variance ofmeasurements. In particular, since the prior mean is generally unknown but arbitrarily treated as zeroin geophysical inversion, ABIC does not produce a reasonable estimate for the prior variance either. Incase of a given variance of measurements, the determination of the regularization parameter on the basisof ABIC is mathematically equivalent to estimating the prior variance for geophysical inverse problems.We may also like to note that although ABIC maximizes the marginal distribution of measurementsunder the Bayesian framework, it does not directly target at constructing a solution of high quality interms of mean squared errors.
Acknowledgements:
This work is partially supported by the National Natural Science Foundationof China (41874012 and 41674013).
References
Akaike H (1980). Likelihood and the Bayes procedure. in:
Bayesian Statistics , Bernardo JM, de GrootMH, Lindley DV, Smith AFM, eds., University Press, Valencia, Spain, pp.1-13.Anderson TW, Rubin H (1956). Statistical inference in factor analysis. In
Proceedings of the thirdBerkeley symposium on mathematical statistics and probability , Vol.5, pp. 111-150.Fukuda J, Johnson KM (2008). A fully Bayesian inversion for spatial distribution of fault slip withobjective smoothing.
Bull. seism. Soc. Am. , , 1128-1146.Golub GM, Heath M, Wahba G (1979). Generalized cross-validation as a method for choosing a goodridge parameter. Technometrics , , 215-223.Hansen PC, O’Leary DP (1993). The use of the L-curve in the regularization of discrete ill-posedproblems. SIAM J. Sci. Comput. , , 1487-1503.Heiskanen W, Moritz H (1967). Physical Geodesy . Freeman, San Francisco.5oerl AE, Kennard RW (1970). Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics , , 55-67.Miller K (1970). Least squares methods for ill-posed problems with a prescribed bound. SIAM J. math.Anal. , , 52-74.Morozov VA (1984). Methods for Solving Incorrectly Posed Problems , Springer, Berlin.Phillips DL (1962). A technique for the numerical solution of certain integral equations of the first kind.
J. Assoc. Comput. Mach. , , 84-97.Reed GB (1973). Application of kinematical geodesy for determining the short wave length componentsof the gravity field by satellite gradiometry. Reports of the Department of Geodetic Science , No.201,Ph.D. dissertation, The Ohio State University.Rummel R (1986). Satellite gradiometry. In:
Mathematical and Numerical Techniques in PhysicalGeodesy , edited by H S¨unkel, Springer, Berlin, pp.317-363.Tamura Y, Sato T, Ooe M, Ishiguro M (1991). A procedure for tidal analysis with a Bayesian informationcriterion.
Geophys. J. Int. , , 507-516.Tarantola A (1987). Inverse Problem Theory: Methods for Data Fitting and Model Parameter Estima-tion . Elsevier, Amsterdam.Tikhonov AN (1963). Regularization of incorrectly posed problems,
Soviet Math. , , 1624-1627.Tikhonov AN, Arsenin VY (1977). Solutions of Ill-posed Problem . John Wiley & Sons, New York.Vinod H, Ullah A (1981).
Recent Advances in Regression . Marcel Dekker, New York.Wahba G (1985). A comparison of GCV and GML for choosing the smoothing parameter in thegeneralized spline smoothing problem.
Ann. Statist. , , 1378-1402.Xu PL (1992a). Determination of surface gravity anomalies using gradiometric observables. Geophys.J. Int. , , 321-332.Xu PL (1992b). The value of minimum norm estimation of geopotential fields. Geophys. J. Int. , 111,170-178.Xu PL (1998) Truncated SVD methods for linear discrete ill-posed problems.
Geophys. J. Int. , ,505-514.Xu PL (2009) Iterative generalized cross-validation for fusing heteroscedastic data of inverse ill-posedproblems. Geophys. J. Int. , , 182-200, doi: 10.1111/j.1365-246X.2009.04280.xXu PL, Rummel R (1994). A generalized ridge regression method with applications in determinationof potential fields. manuscr. geodaetica , , 8-20.Yabuki T, Matsu’ura M (1992). Geodetic data inversion using a Bayesian information criterion forspatial distribution of fault slip. Grophys. J. Int. , , 363-375.Zellner A (1971). An Introduction to Bayesian Inference in Econometrics . Wiley, New York.
Appendix A: the derivation of the marginal distribution m ( y /σ , σ β ) of measurements y in (9) Based on the normal distributions (6a) and (6b), we can write the major exponent part of the jointdistribution (7), denoted by F ( y , β /σ , σ β ), as follows: F ( y , β /σ , σ β ) = 1 σ ( y − A β ) T W ( y − A β ) + 1 σ β ( µ − β ) T W β ( µ − β )= ( β − ˆ β b ) T A T WA σ + W β σ β ! ( β − ˆ β b ) + F y ( y /σ , σ β ) , (24)where F y ( y /σ , σ β ) = 1 σ ( y − A ˆ β b ) T W ( y − A ˆ β b ) + 1 σ β ( µ − ˆ β b ) T W β ( µ − ˆ β b ) , (25)6nd ˆ β b has been given in (10).We now rewrite F y ( y /σ , σ β ) of (25) as follows: F y ( y /σ , σ β ) = 1 σ [( y − A µ ) + A ( µ − ˆ β b )] T W [( y − A µ ) + A ( µ − ˆ β b )]+ 1 σ β ( µ − ˆ β b ) T W β ( µ − ˆ β b )= ( y − A µ ) T W σ ( y − A µ ) + ( y − A µ ) T W σ ( µ − ˆ β b )+( µ − ˆ β b ) T W σ ( y − A µ )+( µ − ˆ β b ) T W σ ( µ − ˆ β b ) + ( µ − ˆ β b ) T W β σ β ( µ − ˆ β b )= ( y − A µ ) T W σ ( y − A µ ) + ( µ − ˆ β b ) T ( W σ + W β σ β ) ( µ − ˆ β b )+( y − A µ ) T W σ ( µ − ˆ β b ) + ( µ − ˆ β b ) T W σ ( y − A µ ) . (26)On the other hand, since the Bayesian estimator ˆ β b is derived from the following normal equations: (cid:20) AI (cid:21) T (cid:20) W /σ
00 W β /σ β (cid:21) (cid:20) y − A ˆ β b µ − ˆ β b (cid:21) = , or A T W σ ( y − A ˆ β b ) + W β σ β ( µ − ˆ β b ) = , which is equivalent to A T W σ ( y − A µ ) + A T W σ A ( µ − ˆ β b ) + W β σ β ( µ − ˆ β b ) = . (27)Thus, we have ( µ − ˆ β b ) = − " A T WA σ + W β σ β − A T W σ ( y − A µ ) . (28)By substituting ( µ − ˆ β b ) of (28) into (26) and after some slight rearrangement, we finally obtain F y ( y /σ , σ β ) = ( y − A µ ) T W σ ( y − A µ ) − ( y − A µ ) T W σ A " A T WA σ + W β σ β − A T W σ ( y − A µ )= ( y − A µ ) T W σ − W σ A " A T WA σ + W β σ β − A T W σ ( y − A µ )= ( y − A µ ) T { W − σ + AW − β A T σ β } − ( y − A µ )= ( y − A µ ) T Σ − py ( y − A µ ) . (29)Actually, the term ( W − σ + AW − β A T σ β ) in the last line of (29) is exactly equal to the variance-covariance matrix of the random vector ( A β + ǫ ). It is also obvious that the mean of ( A β + ǫ ) is A µ .Inserting F ( y , β /σ , σ β ) of (24) and F y ( y /σ , σ β ) of (29) into m ( y /σ , σ β ) of (9) and after theintegration, we finally obtain the marginal distribution m ( y /σ , σ β ) of y as follows: m ( y /σ , σ β ) = 1(2 π ) n/ p det( Σ py ) exp (cid:26) −
12 ( y − A µ ) T Σ − py ( y − A µ ) (cid:27) ,,