Interpolating the Trace of the Inverse of Matrix A+tB
IInterpolating the Trace of the Inverse of Matrix A + t B Siavash Ameli ∗ and Shawn C. Shadden † Mechanical Engineering , University of California , Berkeley, CA, USA 94720
Abstract
We develop heuristic interpolation methods for the function t (cid:55)→ trace (cid:0) ( A + t B ) − (cid:1) , wherethe matrices A and B are symmetric and positive definite and t is a real variable. This functionis featured in many applications in statistics, machine learning, and computational physics. Thepresented interpolation functions are based on the modification of a sharp upper bound that wederive for this function, which is a new trace inequality for matrices. We demonstrate the accu-racy and performance of the proposed method with numerical examples, namely, the marginalmaximum likelihood estimation for linear Gaussian process regression and the estimation of theregularization parameter of ridge regression with the generalized cross-validation method. Keywords: parameter estimation, Gaussian process, generalized cross-validation, maximumlikelihood method, inequality, harmonic mean
Mathematics Subject Classification (2020):
Estimation of the trace of matrices is a key component and often a computational challenge inmany algorithms in data analysis, statistics, machine learning, computational physics, and com-putational biology. Some applications of trace estimation can be found in Ubaru & Saad (2018).A few examples of such applications are high-performance uncertainty quantification (Bekas et al.,2012; Kalantzis et al., 2013), optimal Bayesian experimental design (Chaloner & Verdinelli, 1995),regression with Gaussian process (MacKay et al., 2003), rank estimation (Ubaru & Saad, 2016),and computing observables in lattice quantum chromodynamics (Wu et al., 2016).
In this paper, we are interested in estimating the trace of the matrix ( A + t B ) − , where A and B are symmetric and positive definite, and t is a real number . The function t (cid:55)→ trace (cid:0) ( A + t B ) − (cid:1) , (1) ∗ Email address: [email protected] † Email address: [email protected] We use boldface lowercase letters for vectors, boldface upper case letters for matrices, and normal face letters forscalars, including the components of vectors and matrices, such as x i and H ij respectively for the components of thevector x and the matrix H . a r X i v : . [ m a t h . NA ] S e p s featured in a vast number of applications in statistics and machine learning. Often, in theseapplications, the goal is to optimize a problem for the parameter t , and (1) should be evaluated fora wide range of t during the optimization process.A common example of such an application can be found in regularization techniques appliedto inverse problems and supervised learning. For instance, in ridge regression by generalized cross-validation (Wahba, 1977; Craven & Wahba, 1978; Golub & von Matt, 1997), the optimal regu-larization parameter t is sought by minimizing a function that involves (1) (see § A + t I that appearfrequently in Gaussian processes with additive noise (Ameli & Shadden, 2020) (see also § ∂∂t log det( A + t I ) = trace (cid:0) ( A + t I ) − (cid:1) . frequently appears. Other examples of (1) are in the optimal design of experiment (Haber et al.,2008), probabilistic principal component analysis (Bishop, 2006, § § § § The difficulty of estimating (1) in all the above applications is that the matrices are generally large.Thus, the inverse of A + t I is not available explicitly, rather it is implicitly known by matrix-vectormultiplications through solving a linear system. Because of this, the evaluation of (1) is usuallythe main computational challenge in these problems, and several algorithms have been developedto address this problem.The trace of the inverse of a symmetric and positive-definite matrix can be calculated by theCholesky factorization (see e.g., a simple formulation by (17) in § O ( nw ) where w is the bandwidth of the Cholesky matrix (see also Bj¨orck (1996, § It is desirable in the above applications to have the estimation of (1) be available inexpensively fora wide range of t . For instance, if the matrices A and B are relatively small enough to pre-computetheir eigenvalues, λ i and µ i , respectively, then, the evaluation of (1) for any t is immediate bytrace (cid:0) ( A + t B ) − (cid:1) = n (cid:88) i =1 λ i + tµ i . However, for large matrices, estimating all eigenvalues is impractical. Motivated by this objective,we developed an interpolation scheme for the function (1), which employs the pre-computed functionvalues at a few locations t i using any of the methods mentioned in § • We present a new sharp inequality for the trace of the inverse of the sum of two matrices thatare symmetric and positive definite. For the matrices of the form in (1), the sharp inequalityserves as a rough estimate to this function. • We propose two interpolation methods based on the inequality that we find. Namely, aninterpolation function based on the linear combination of orthogonal basis functions, andinterpolation by rational polynomials.We demonstrate the computational advantage of our method through two examples, and weaccomplish the following results: • Gaussian process regression.
We compute (1) in the context of marginal likelihood estimationof Gaussian process regression. We show that with only a few interpolant points, accuracy of0 .
01% can be achieved. • Ridge regression.
We estimate the regularization parameter of ridge regression with thegeneralized cross-validation method. We demonstrate that with only a few interpolationpoints, the ridge parameters can be estimated and the overall computational cost is reducedby 2 orders of magnitude.The outline of the paper is as follows. In §
2, we present a matrix trace inequality. In §
3, wepropose interpolation methods. In § § We will seek an interpolation of t (cid:55)→ trace(( A + t B ) − ) by modifying a sharp bound to this function.In this section, particularly in Theorem 1, we find bounds to trace(( A ± B ) − ), which without lossof generality, we omitted the parameter t momentarily. However, in §
3, we will retrieve the desiredrelations for our purpose by replacing B with t B .3 heorem 1 (A matrix trace inequality) . Let A , B ∈ C n × n be Hermitian and positive-definitematrices. Then A + B ) − ) ≥ A − ) + 1trace( B − ) . (2a) Furthermore, let λ i and µ i , i = 1 , . . . , n , denote the eigenvalues of A and B , respectively. If λ i > µ i for all i , then A − B ) − ) ≤ A − ) − B − ) . (2b) In both (2a) and (2b) , the equality holds if and only if A and B are similar (conjugate) matricesmodulo a positive scalar multiplication. We prove the above theorem as follows.
Definition 2.1 (Harmonic Mean) . The smooth function
H ∈ C ∞ ( R n + ; R + ) is defined by H ( x ) := n n (cid:88) i =1 x i , (3)where x := ( x , . . . , x n ) ∈ R n + . The function H ( x ) is the harmonic mean of the positive numbers x , . . . , x n (Bullen, 2013, Ch. II, § (cid:52) In Lemma 2, we will show that H is a concave function, which leads to Proposition 3 where weshow −H is a sublinear function. Lemma 2. H is a concave function.Proof. We will show the Hessian H of the function H is negative-semidefinite. The component H ij of the Hessian matrix H is H ij := ∂ H ∂x i ∂x j = − nx i (cid:32) δ ij x i n (cid:88) k =1 x k − x j (cid:33) (cid:32) n (cid:88) k =1 x k (cid:33) − , where δ ij is the Kronecker delta function. The matrix H is negative-semidefinite if and only if w (cid:124) H w ≤ w := ( w , . . . , w n ). The latter condition is equivalent to n (cid:88) i =1 n (cid:88) j =1 w i w j x i (cid:32) δ ij x i n (cid:88) k =1 x k − x j (cid:33) ≥ , which simplifies to n (cid:88) j =1 w j x j ≤ (cid:32) n (cid:88) k =1 x k (cid:33) (cid:32) n (cid:88) i =1 w i x i (cid:33) . (4)The relation (4) holds, since by the Cauchy-Schwarz inequality, we have n (cid:88) j =1 √ x j · w j x j √ x j ≤ (cid:32) n (cid:88) k =1 √ x k ) (cid:33) (cid:32) n (cid:88) i =1 (cid:18) w i x i √ x i (cid:19) (cid:33) . Thus, H is negative-semidefinite and H is a concave function.4 roposition 3. Let x , y ∈ R n + . It holds that H ( x + y ) ≥ H ( x ) + H ( y ) . (5a) Furthermore, suppose x i > y i for all i = 1 , . . . , n . Then H ( x − y ) ≤ H ( x ) − H ( y ) . (5b) In both (5a) and (5b) , the equality holds if and only if x = c y where c > .Proof. Since by Lemma 2, the function H is concave, from Jensen inequality (see e.g., Hardy et al.(1952, § H (cid:18) x + y (cid:19) ≥ H ( x ) + H ( y )2 . which concludes (5a). Jensen inequality becomes an equality if x = y . But, since H ( c x ) = c H ( x ),the equality criterion can be extended to x = c y for some positive constant c . To show (5b), write H ( x ) = H ( y + ( x − y )). If x − y ∈ R n + , by (5a) we have H ( y + ( x − y )) ≥ H ( y ) + H ( x − y ), whichconcludes (5b). Remark . The relation (5a) in the expanded form is n x + y + · · · + x n + y n ≥ n x + · · · + x n + n y + · · · + y n , which appears to be an original inequality, despite many inequalities on harmonic mean have beennoted in the literature (Hardy et al., 1952, § § § § § §
13, Ch. 2, § §
4, and Ch. § §
10) and references therein,(Bullen, 1998, p. 117), (Herman et al., 2000, p. 156 and 167), and (Bullen, 2013, Ch. II). A simplerform of (5a) for y = · · · = y n = c has been given in Hoehn & Niven (1985, Equation 8). (cid:52) Based on the above inequality, we prove Theorem 1 as follows.
Proof of Theorem 1.
Let λ := ( λ , . . . , λ n ) and µ := ( µ , . . . , µ n ) be the tuples of the eigenvalues A and B , respectively. Since A and B are Hermitian and positive definite, we have λ , µ ∈ R n + .Also H ( λ ) = n trace( A − ) , H ( µ ) = n trace( B − ) , and , H ( λ ± µ ) = n trace(( A ± B ) − ) . Applying Proposition 3 to the above concludes (2a) and (2b). Also, equality holds if and only if λ = c µ for some positive constant c . That is, when A and B are similar matrices up to a constantmultiplication. Remark . The equation (2a) is an upper bound to trace(( A + B ) − ). Forcompleteness, we also mention a known lower bound to the above trace. By the arithmetic-harmonicmean inequality, we have H ( λ ± µ ) ≤ A ( λ ± µ ), where A is the arithmetic mean function of a tuple(Mitrinovi´c & Vasi´c, 1970, Ch. 2, Theorem 1). Also, A ( λ ± µ ) = trace( A ± B ) /n . Combining bothrelations implies a lower bound to trace(( A ± B ) − ) bytrace(( A ± B ) − ) ≥ n trace( A ) ± trace( B ) . (6)5he equality in the above holds only if A ± B = c I , where I is the identity matrix. The aboveinequality, however, is not as useful as the inequalities we found in Theorem 1, since if B is eithertoo small or too large compared to A , (6) does not asymptote to equality, whereas (2a) and (2b)become asymptotic equalities, which is a desired property for our purpose. (cid:52) In this section, we present interpolations the trace of the matrix ( A + t B ) − . To this end, wereplace the matrix B with t B in the bounds found in §
2. Define τ ( t ) := trace (( A + t B ) − )trace( B − ) , and τ := τ (0) . We assume τ is known by computing trace( A − ) and trace( B − ) directly. Then, (2a) reads as1 τ ( t ) ≥ τ + t, t ∈ [0 , ∞ ) , (7a)and (2b) becomes 1 τ ( t ) ≤ τ + t, t ∈ ( t min , , (7b)where t min := − min i ( λ i /µ i ), and λ i and µ i are the eigenvalues of A and B respectively. The abovesharp inequalities become equality at t = 0. Also, (7a) becomes asymptotic equality at t → ∞ .Based on (7a) and (7b), the bound functionˆ τ ( t ) := τ tτ , (8)can be regarded as reasonable approximations of τ ( t ) at | tτ | (cid:28) τ ( t ) ≈ τ , and at tτ (cid:29) τ ( t ) ≈ t − . We expect ˆ τ ( t ) deviate from τ ( t ) the most roughly at O ( tτ ) = 1.To improve the approximation in tτ ∈ ( c, c − ) for some c (cid:28)
1, we define interpolating functionsbased on the above bounds to honor the known function values at some intermediate points t i ∈ ( cτ − , c − τ − ). In particular, we specify interpolant points on logarithmically spaced intervals,because t is usually varied in a wide logarithmic range in most applications. We compute thefunction values at interpolant points, τ ( t i ), with any of the trace estimation methods mentionedearlier.Many types of interpolating functions can be employed to improve the above approximation.However, we seek interpolating functions that their parameters can be easily obtained via solvinga linear system of equations. In this section, we define two types of interpolations, namely, by alinear combination of basis functions and by rational polynomials, respectively in § § Based on (7a), we define an interpolating function ˜ τ ( t ) by1˜ τ ( t ) := 1 τ + p (cid:88) i =0 w i φ i ( t ) , (9)6here φ i are basis functions with the weights w i . To maintain the integrity of the formulation in(7a), we impose φ ( t ) = t and w = 1. The rest of the functions, i.e., i = 1 , . . . , p , improve theapproximation while satisfying the boundary conditions, i.e., φ i (0) = 0 and φ i ( t ) (cid:28) φ ( t ) at t (cid:29) τ ( t ) → φ ( t ) = t for large t . The coefficients w i , i = 1 , . . . , p are found bysolving a linear system of p equations using a priori known values τ i := τ ( t i ), i = 1 , . . . , p . When p = 0, no intermediate interpolation point is introduced and the approximation function is thesame as the upper bound ˆ τ ( t ) given by (8).To satisfy the criteria of the basis functions, we propose φ i ( t ) = t i +1 , i = 0 , . . . , p. (10)The above set of functions satisfy φ ( t ) = t , φ i (0) = 0, and φ ( t ) (cid:29) φ i ( t ), i > t (cid:29) Remark . An alternative interpolation function, for instance, can be defined using monomials t i , leading to the form 1(˜ τ ( t )) p +1 := 1 τ p +10 + p +1 (cid:88) i =1 w i t i , (11)with w p +1 = 1, and the rest of weights w i , i = 1 , . . . , p , are to be found from the known values of thefunction. Unlike the interpolation function in (9), the interpolation of the form (11) is not usefulin practice for two reasons. Firstly, the exponentiation terms, t i , in (11) at t (cid:28) w i . Secondly, the Runge’s phenomenonoccurs even at low-order interpolations at p >
1. That is, since interpolating points t i ∈ (0 , t ∈ (0 , (cid:52) In practice, only a few interpolating points t i are sufficient to obtain a reasonable interpolationof τ ( t ). However, when more interpolation points are used (such as when p ≥ w i become ill-conditioned. To overcome this issue, orthogonalbasis functions can be used (see e.g., Seber & Lee (2012, § φ ⊥ i ( t ) that are orthogonal on the unit interval t ∈ [0 , t , we use the Haar measured log( t ) = d t/t to define the inner product in the space of functions. Applicability of Haar measurecan be justified by letting t i = e x i , where x i are normally spaced interpolant points. Following thediscussion of Seber & Lee (2012, § x to define the inner product of functions.The measure d x is equivalent to the Haar measure d log t for the variable t .The desired orthogonality condition in the Hilbert space of functions on [0 ,
1] with respect tothe Haar measure becomes (cid:104) φ ⊥ i , φ ⊥ j (cid:105) L ([0 , , d t/t ) = (cid:90) φ ⊥ i ( t ) φ ⊥ j ( t ) d tt = δ ij , (12)where δ ij is the Kronecker delta function. The set of orthogonal functions φ ⊥ i ( t ) can be presentedbased on the linear combination of non-orthogonal basis functions in (10) by φ ⊥ i ( t ) = α i p (cid:88) j =1 a ij φ j ( t ) , i = 1 , . . . , p. (13)7 − − − − − − − t − φ ⊥ i ( t ) Orthogonal functions i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 Figure 1:
Orthogonal functions φ ⊥ j ( t ) in the logarithmic scale of t . Using the recursive application of Gram-Schmidt orthogonalization process on the set { φ i } pi =1 (note, φ is excluded), we computed the first nine orthogonal basis functions { φ ⊥ i } pi =1 . These functionsare shown in Figure 1 and their coefficients, α i and a ij corresponding to (13), are given by Table 1. Table 1:
Coefficients of orthogonal functions in (13) i α i a i a i a i a i a i a i a i a i a i (cid:112) / − (cid:112) / −
53 + (cid:112) / −
40 214 − (cid:112) / −
175 210 −
845 + (cid:112) / −
560 1134 − − (cid:112) / − − − (cid:112) / − − − − (cid:112) / − − − − (cid:112) /
10 825 − − − − The set of orthogonal functions can also be defined on intervals other than [0 ,
1] by adjustingthe integral’s bound in the orthogonality condition in (12), which yields a different set of functioncoefficients. However, it is more convenient to fix the domain of orthogonal functions to theunit interval [0 , , l ] where l := max( t i ). Despite the latter approach does not lead to orthogonal functions in [0 , l ], theyequivalently produce a well-conditioned system of equations to solve the weights w i . Remark . The interpolation function defined in (9) asymptotes consistently to ˜ τ ( t ) → t − at tτ (cid:29)
1. On the other end, the convergence ˜ τ ( t ) → τ at tτ (cid:28) To generate an arbitrary number of orthogonal functions φ ⊥ j ( t ) in Table 1, a simple computer algebra programusing Python’s SymPy package can be found at https://github.com/ameli/Orthogonal-Functions . φ i , i >
0, thatare not independent at tτ (cid:28)
1, particularly, near the origin. This dependency of basis functionscannot be resolved by the orthogonalized functions φ ⊥ i , as they are orthogonal with respect to thesingular weight function t − at the origin. In conclusion, (9) should not be employed on very smalllogarithmic scales, rather, other interpolation functions should be employed for such purpose. (cid:52) We define another type of interpolating function that can perform well at small scales of t , by usingrational polynomials. Define ˜ τ ( t ) := t p + a p − t p − + · · · + a t + a t p +1 + b p t p + · · · + b t + b . (14)where a = b τ to satisfy τ (0) = τ . Also, we note that the above interpolation also satisfiesthe asymptotic equality τ ( t ) → t as t → ∞ . At p = 0, where no interpolant point is used, theabove interpolation function falls back to the bound given in (2a) by setting b = τ − . For p > p interpolant points t i are needed to solve the linear system of equations for thecoefficients a , . . . , a p − and b , . . . , b p .The advantage of the rational polynomial interpolation is that, at tτ (cid:28)
1, the relation (14)converges to τ without undesirable oscillations. On the other hand, a disadvantage of the rationalpolynomial is the possible singularity on the poles of (14). In practice, however, this issue can beresolved simply by a slight adjustment of the location of interpolant points t i to move the poles ofthe rational polynomial away from the domain of interest. Another possible issue of the rationalpolynomial emerges when many interpolant points, e.g., p >
6, are used. In this case, the linearsystem of equations to solve the coefficients become ill-conditioned. In practice, however, only p ≤ We provide two examples in the following sections. The first example in § § § § In our first example, we generate a full rank correlation matrix from a spatially correlated set ofpoints x ∈ D in the domain D = [0 , . To define a spatial correlation function, we use the isotropicexponential decay kernel given by K ( x , x (cid:48) | ρ ) = exp (cid:18) − (cid:107) x − x (cid:48) (cid:107) ρ (cid:19) , (15)where ρ is the decorrelation scale of the kernel, and we set ρ = 0 .
1. The above exponential decaykernel represents Ornstein-Uhlenbeck random process, which is a Gaussian and zeroth-order Markovprocess (Rasmussen & Williams, 2006, p. 85). To produce discrete data, we sample n = 50 points9rom D , which yields the symmetric and positive-definite correlation matrix K with the components K ij = K ( x i , x j | ρ ). We aim to interpolate the function τ ( t ) := 1 n trace (cid:0) ( K + t I ) − (cid:1) , (16)which appears in many statistical applications. For instance, Ameli & Shadden (2020, §
6) developeda computational procedure to efficiently estimate noise in Gaussian process regression based onmaximizing the marginal likelihood function. The function τ ( t ) given as in (16) is a computationallyexpensive term in the derivatives of the marginal likelihood function in that reference. We referthe reader to (Ameli & Shadden, 2020, Theorem 6 and § τ ( t ) (either at interpolant points t i or at all points t for thepurpose of benchmark comparison) as follows. We compute the Cholesky factorization of K t = LL (cid:124) ,where L is lower triangular. Thentrace (cid:0) K − t (cid:1) = trace( L − (cid:124) L − ) = trace( L − L − (cid:124) ) = (cid:107) L − (cid:107) F , (17)where (cid:107) · (cid:107) F is the Frobenius norm. In the second equality in the above, the cyclic property of thetrace operator is applied. A simple method to compute (cid:107) L − (cid:107) F without storing L − is to solve thelower triangular system L x i = e i for x i , i = 1 , . . . , n , where e i = (0 , . . . , , , , . . . , (cid:124) is a columnvector of zeros, except, its i th entry is one. The solution vector x i is the i th column of L − . Thus, (cid:107) L − (cid:107) F = (cid:80) ni =1 (cid:107) x i (cid:107) . This method is memory efficient since the vectors x i do not need to bestored. Remark . There exist efficient methods to compute the Choleskyfactorization of sparse matrices (see e.g., Davis (2006, Ch. 4)). Also, the inverse of the sparsetriangular matrix L can be computed at O ( n ) complexity (Stewart, 1998, pp. 93-95). Also, alinear system with both sparse kernel L and sparse right-hand side e i can be solved efficiently (seeDavis (2006, § (cid:52) The exact value of τ ( t ) is computed directly using the Cholesky factorization described by(17), and shown in Figure 2(a) by the solid black curve (overlaid by the red curve) in the range t ∈ [10 − , ]. Also, the dashed black curve in Figure 2(a) is the upper bound ˆ τ ( t ) given by (8),which can be thought of as the estimation with zero interpolant points, i.e., p = 0. For completeness,we have also shown the lower bound of τ ( t ) by the black dash-dot line, given byˇ τ ( t ) := 11 + t ≤ τ ( t ) . (18)The above lower bound can be obtained from Remark 2.2 and the fact that trace( K ) = n , sincethe diagonals of the correlation matrix are 1. We recall that unlike the upper bound that we foundin (2a), the lower bound (18) is not useful for approximation as it does not asymptote to τ ( t ) atsmall t . However, both of the lower and upper bounds consistently asymptote to t − at large t .To estimate τ , we used the interpolation function in (9) with the set of orthonormal basis func-tions in Table 1. The colored solid lines in Figure 2(a) are the interpolations ˜ τ ( t ) with p = 1 , , , t i , spanning a wide range 10 − to 10 . In practice, fewer interpolant pointsin a small range, e.g., [10 − , ], are sufficient to effectively interpolate τ . It can be seen from theembedded diagram in Figure 2(a) that ˜ τ ( t ) with only a few interpolant points is remarkably closeto the true function value. 10 − − − − t − − − τ ( t ) (a) Exact, interpolation, and bounds of τ ( t ) ExactUpper boundLower boundInterpolation, p = 1Interpolation, p = 3Interpolation, p = 5Interpolation, p = 7Interpolation, p = 9 − − − − t -3%0%3%6%9%12% τ a pp r o x ( t ) / τ e x a c t ( t ) − (b) Relative error of interpolation of τ ( t ) Upper boundInterpolation, p = 1Interpolation, p = 3Interpolation, p = 5Interpolation, p = 7Interpolation, p = 9 − − Figure 2: (a) Comparison of the exact function τ ( t ), upper bound ˆ τ ( t ), lower bound ˇ τ ( t ), and the in-terpolations ˜ τ ( t ) for various numbers of interpolant points. The black, green, and red curves are almostindistinguishable from the diagram. (b) The relative error of estimations and the upper bound. The greenand red curves have relative errors of less than 0 .
02% and 0 . To better compare the exact and the interpolated functions, we have shown the relative errorof interpolations in Figure 2(b). The relative error of the upper bound (dashed curve) rapidlyvanishes at both ends, namely, at tτ (cid:28) tτ (cid:29)
1, where τ = 6 .
33. Also, the absolute errorof the upper bound is the highest at O ( tτ ) = 1, or t ≈ τ − = 0 .
16. The peak of the relative errorof the upper bound is slight to the right of the peak of the absolute error, which can be verified inFigure 2(b). For an effective interpolation, we distribute the interpolant points, t i , almost evenlyaround t ≈ τ − where the upper bound has the highest error.The blue curve in Figure 2(b) corresponds to the case with one interpolant point at t =10 − , with the relative error less than 3% everywhere. The red curve with nine interpolant points t i ∈ { − , × − , − , − , . . . , } shows a relative error of less than 0 . τ ( t )decays by orders of magnitude at large t , making the absolute error significantly small at t (cid:29) τ − . In the second example, we calculate the optimal regularization parameter for the linear ridgeregression model using generalized cross-validation (GCV).Consider the linear model z = X β + δ , where z ∈ R n is a column vector of given data, X ∈ R n × m is the known design matrix representing m basis functions, β ∈ R m is the unknowncoefficients of the linear model, and δ ∼ N ( , σ I ) is the residual error of the model, which isa zero-mean Gaussian random vector with the unknown variance σ . An ordinary least-squaressolution to this problem minimizes (cid:107) z − X β (cid:107) yielding an estimation of β by ˆ β = ( X (cid:124) X ) − X (cid:124) z (Seber & Lee, 2012, p. 37). When X is not full rank, the least-squares problem is not well-11onditioned. A resolution of the ill-conditioned problems is the ridge (Tikhonov) regularization,where the function (cid:107) z − X β (cid:107) + nθ (cid:107) β (cid:107) is minimized instead (Seber & Lee, 2012, § θ , plays a crucial role to balance the residual error versus the penaltyterm. The generalized cross-validation of Wahba (1977), Craven & Wahba (1978), and Golubet al. (1979) is one of the popular methods to seek an optimal regularization parameter withoutthe necessity of estimating the error variance σ . In this method, the regularization parameter issought as the minimizer of V ( θ ) := n (cid:13)(cid:13)(cid:0) I − X ( X (cid:124) X + nθ I ) − X (cid:124) (cid:1) z (cid:13)(cid:13) (cid:0) n trace (cid:0) I − X ( X (cid:124) X + nθ I ) − X (cid:124) (cid:1)(cid:1) . (19)For large matrices, a major difficulty is to compute the trace in the denominator of (19), and severalmethods have been developed to address this problem (Golub & von Matt, 1997), (Lukas et al.,2010). Using the presented interpolation method, we aim to estimate the trace in the denominatorof (19), which can also be written astrace (cid:16) I − X ( X (cid:124) X + nθ I ) − X (cid:124) (cid:17) = n − m + nθ trace (cid:0) ( X (cid:124) X + nθ I ) − (cid:1) , provided that n > m (see (Golub & von Matt, 1997, p. 4)). To compute the above term, we define, τ ( t ) := 1 m trace ( A + t I ) , (20)where A = X (cid:124) X + s I , and t := nθ − s. We note that the size of A and I is m . The purpose of the fixed parameter s (cid:28) X (cid:124) X to make A non-singular. The shift is necessary since without it,(20) is undefined at t = 0, and we cannot compute τ = m trace( A − ). Also, the shift can improveinterpolation by relocating the origin of t to the vicinity of the interval where we are interested tocompute V ( θ ).In our example, we create a singular matrix X by the singular value decomposition X = UΣV (cid:124) as follows. The orthogonal matrices U ∈ R n × n and V ∈ R m × m are produced by the Householdermatrices U := I − uu (cid:124) (cid:107) u (cid:107) , and V := I − vv (cid:124) (cid:107) v (cid:107) , where u ∈ R n and v ∈ R m are random vectors (see also (Golub & von Matt, 1997, § Σ ∈ R n × m is defined by the components Σ ii asΣ ii := exp (cid:18) − (cid:16) i − m (cid:17) / (cid:19) , i = 1 , . . . , m. (21)We set n = 10 and m = 500. Also, we generate data by z = X β + δ , where β and δ are randomlygenerated with the unit variance and σ = 0 .
4, respectively.We computed the exact solution of τ ( t ) in (20) using the Cholesky factorization method de-scribed by (17). The exact solution is shown by the solid black curve in Figure 3 (overlaid by thered curve) with τ = 960 .
5. Also, the upper bound ˆ τ ( t ) from (7a) is shown by the dashed blackcurve for t >
0. In contrast, at t ∈ ( t min , t min = − λ − − − − − t − − − τ ( t ) (a) Exact, interpolation, and bounds of τ ( t ) ExactUpper bound (at t ≥ t < p = 1Interpolation, p = 2 − − − − − − t -0.5%0.0%0.5%1.0%1.5%2.0%2.5% τ a pp r o x ( t ) / τ e x a c t ( t ) − (b) Relative error of estimation of τ ( t ) Upper bound (at t ≥ t < p = 1Interpolation, p = 2 Figure 3: (a) The exact solution τ ( t ), bounds ˆ τ ( t ), and rational polynomial interpolations ˜ τ ( t ) for p = 1 , − − , − ] is linear, but outside this interval, the axis is logarithmic. and λ ≈ s = 10 − is the smallest eigenvalue of A . The relative error of the bounds with respectto the exact solution are shown in Figure 3(b). As was expected before, the peak of the absoluteerror of the upper bound is located approximately at t ≈ τ − = 10 − , and the peak of its relativeerror is slight to the right of this value as seen in the diagram.We aim to interpolate (20) in the interval θ ∈ [10 − ,
10] by setting parameters and interpolantpoints as follows. We set s = 10 − , which shifts the origin of t = nθ − s inside the interval nθ ∈ [10 − , ]. Thus, we approximately have − − < t < . Because this interval contains theorigin, we employ the rational polynomial interpolation method described in § t , particularly at t (cid:28) τ − , the rational polynomial interpolation performs betterthan the basis functions interpolation. Also, we distribute the interpolant points at t i ≥ τ − = 10 − where the rational polynomial interpolation has to adhere to the exact solution. On the interpolantpoints, we compute τ ( t i ) using the Cholesky factorization method in (17).The interpolation function ˜ τ ( t ) with p = 1 using two interpolation points t i ∈ [10 − , − ] isshown by the green curve in Figure 3(a). Also, the interpolation with p = 2 using four interpolationpoints t i ∈ [10 − , − , − ,
1] is shown by the red curve in that figure. The black and red curvesare indistinguishable even in the embedded diagram that magnifies the location with the highesterror. The relative error of interpolation is shown by Figure 3(b) where the approximations arecompared with the benchmark solution. On the far left of the range of t , the error spikes due tothe singularity of the matrix X that makes τ ( t ) undefined at t = − − , corresponding to θ = 0.On the rest of the range shown in the diagram, the red curve shows less than 0 .
05% relative error,which is an outstanding accuracy for a broad range of t achieved with only four interpolation points.Based on τ ( t ) obtained in the above, we compute the generalized cross-validation function13 − − − − − − − θ V ( θ ) Generalized Cross Validation
ExactInterpolation, p = 1Interpolation, p = 2 Figure 4:
The generalized cross-validation function is shown, where the black and colored curves correspondrespectively to the exact and interpolated computation of τ ( t ) in the denominator of V ( θ ). The globalminimum of each curve at θ ∗ is shown by a dot. V ( θ ) in (19) and shown in Figure 4. The black curve in that diagram corresponds to the exactsolution of τ ( t ) applied in the denominator of V ( θ ) and serves as a benchmark for comparison.The green and red curves correspond to the interpolation ˜ τ ( t ) with rational polynomial applied inthe denominator of V ( θ ), respectively with p = 1 and p = 2. The interpolated curves exhibit bothlocal minima of V ( θ ) similar to the benchmark curve, but with a slight difference in the positionof the minima. Due to the singularity at θ = 0, the interpolations of τ ( t ) become less accurateat low ranges of θ , which leads to a higher discrepancy to the benchmark curve. On the otherend, at high ranges of θ , all curves steadily asymptote to a constant. We note that the results inFigure 4 are impressive considering that the estimation of V ( θ ) is sensitive to the interpolation ofits denominator. Particularly, because a consistent interpolation accuracy over all the parameterrange is essential to capture the qualitative shape of V ( θ ) correctly.The global minimum of V ( θ ) at θ = θ ∗ is the optimal compromise between an ill-conditionedregression problem (small θ ) and a highly regularized regression problem (large θ ). We aim to testthe practicality of the interpolation method in searching the global minimum, V ( θ ∗ ), by numericaloptimization. We note that by the specific design of the singular values Σ ii of the singular matrix X in (21), we intended to generate V ( θ ) with two local minima to make the optimization lesstrivial. In general, the generalized cross-validation function may have more than one local minimumwhich necessitates global search algorithms (Kent & Mohammadzadeh, 2000). To this end, we usedifferential evolution optimization method (Storn & Price, 1997) with best/1/exp strategy and 40initial guess points. The results are shown in the first three rows of Table 2, where the trace of amatrix inverse is computed by the Cholesky factorization described in (17). In the first row, τ iscomputed exactly, i.e., without interpolation, in all requested locations t during the optimizationprocedure. Whereas, on the second and third row, τ is first pre-computed at interpolant points, t i ,by the Cholesky factorization, and then interpolated afterward during the optimization procedureat other locations.In the table, N tr counts the number of exact evaluations of τ . For the rational polynomialinterpolation method of degree p , we have N tr = 2 p + 1, accounting for 2 p interpolant points14n addition to the evaluation of τ at t = 0. Also, N tot is the total number of estimations of τ during the optimization process. In the first row, N tr = N tot as all points are evaluated exactly,i.e., without interpolation. However, for the interpolation methods, N tot consists of N tr plus theevaluations of τ via interpolation at other locations.The exact computations of τ (at N tr points) are the most computationally expensive part of theoverall process. Our numerical experiments were performed on the Intel Xeon E5-2640 v4 processorusing shared memory parallelism. We measure computational costs by the total CPU times of allcomputing cores. In the sixth column of the table, T tr denotes the processing time of computing τ at N tr points that are exactly computed. In the seventh column, T tot measures the processing timeof the overall optimization, which includes T tr . It can be seen that the interpolation methods tooksignificantly less processing time compared to no interpolation, namely, by two orders of magnitudefor T tr , and an order of magnitude for T tot .The results of the optimized parameter, θ ∗ , and the corresponding minimum, V ( θ ∗ ), are shownin the eighth and ninth columns of Table 2, respectively. The last column is the relative errorof estimating θ ∗ and obtained by comparing log θ ∗ between interpolated and the benchmarksolution of the first row. Also, from the third row of the table, we observe that with one-tenthof the processing time, an accuracy with 4% error is achieved. In practice, however, a roughapproximation of the regularization parameter is sufficient. Table 2:
Comparison of methods to optimize the regularization parameter θ , with and without interpolationof τ ( t ), and by various algorithms of computing trace of a matrix inverse. Computing trace of inverse, τ ( t ) Iterations Time (sec) ResultsAlgorithm Interpolate trace Interpolant points t i N tr N tot T tr T tot V ( θ ∗ ) log θ ∗ ErrorCholesky No interpolation (exact) ∅
282 282 28.0 31.2 0.16376 -3.8164 0.00 %Rational polynomial, p = 1 { − , − } p = 2 { − , − , − , } ∅
414 414 2.13 6.93 0.16369 -3.8066 0.25 %Rational polynomial, p = 1 { − , − } p = 2 { − , − , − , } ∅
394 394 51.1 55.6 0.16373 -3.8038 0.33 %Rational polynomial, p = 1 { − , − } p = 2 { − , − , − , } Besides the Cholesky factorization algorithm, we have also repeated the numerical experi-ment with stochastic trace estimators, namely, the Hutchinson’s algorithm (Hutchinson, 1990)and stochastic Lanczos quadrature algorithm (Golub & Meurant, 2009, § n v = 30 random vectors withRademacher distribution for Monte-Carlo sampling. Also, in SLQ algorithm, we set the Lanczosdegree to 30, which is the number of Lanczos iterations for tri-diagonalization (see details e.g., in(Ubaru et al., 2017, § T tr and T tot . Also, with both stochastic estimators, the interpolation techniquereduces the processing times compared to no interpolation, namely, T tr is reduced by two ordersof magnitude, and T tot by an order of magnitude, while maintaining a reasonable accuracy with15% to 6% error. Lastly, without using the interpolation, the overall time T tot with Hutchinson’salgorithm is faster than SLQ algorithm by an order of magnitude. Whereas, with interpolation, theoverall processing time is more or less similar for both these algorithms. This is because with bothof the stochastic estimators, T tr becomes so small that the interpolation of τ ( t ) and the evaluationof V ( θ ) become the main time-consuming part of the computation. In many applications in statistics and machine learning, it is desirable to estimate the trace ofthe inverse of a one-parameter family of matrix functions A + t B where only the parameter t varies and the matrices A and B in the formulation remain unchanged. There exist many efficienttechniques to estimate the trace of a matrix inverse, however, these methods are geared towardgeneric matrices. Using those methods, the computation of the trace of the inverse of parametricmatrices should be repeated for each parameter value as the matrix is updated. To efficientlyperform such computation for a wide range of the parameter, in this work, we presented heuristicmethods to interpolate the function t (cid:55)→ trace(( A + t B ) − ). Our interpolation methods are basedon a sharp upper bound we obtained for this function. We proposed two interpolation functions,namely, interpolation with a linear combination of orthogonal basis functions, and interpolationwith rational polynomials. We summarize our results as follows: • We demonstrated that both interpolation methods can provide remarkably accurate interpo-lation over a wide range of the parameter using only a few interpolant points. Namely, with9 interpolant points, the interpolation with linear basis functions led to 0 .
01% accuracy overa range of seven orders of magnitude of the parameter t . Also, with 4 interpolant points, therational polynomial interpolation led to 0 .
05% error in the interval [ − − , ]. • The interpolation method with rational polynomials provides better results in the neighbor-hood of the origin of the parameter. For such reason, this method is suitable for ill-conditionedor singular matrices, where a shift of the origin of the parameter is required to circumventthe singularity. • The interpolation with the linear combination of orthogonal basis functions can be used forapplications where many interpolant points should be used. Generally, when many interpolantpoints are used, the system of equations to find the coefficients of the interpolation functionbecome ill-conditioned. However, since the basis functions of this interpolation method canbe orthogonalized, the system of equations for the coefficients remains well-conditioned evenfor many interpolant points. In contrast, the interpolation with rational polynomials can beapplied to only a limited number of interpolant points. • By employing the presented interpolation methods, the processing time in applications re-quiring to frequently compute the above trace function can be reduced dramatically with anacceptable interpolation error. For instance, we estimated the regularization parameter ofridge regression with roughly 5% error by one and two orders of magnitude less processingtime respectively for estimating the above trace function and for the total numerical procedureof the generalized cross-validation method.16 cknowledgment.
The authors acknowledge support from the National Science Foundation,award number 1520825, and American Heart Association, award number 18EIA33900046.
References
Ameli, S. & Shadden, S. C. (2020). Marginal maximum likelihood estimation of noise in generallinear model. in preparation. 2, 10Avron, H. & Toledo, S. (2011). Randomized algorithms for estimating the trace of an implicitsymmetric positive semi-definite matrix.
J. ACM , 58(2). 2Bai, Z., Fahey, G., & Golub, G. (1996). Some large-scale matrix computation problems.
Journalof Computational and Applied Mathematics , 74(1), 71 – 89. 2Bai, Z. & Golub, G. H. (1997). Bounds for the trace of the inverse and the determinant of symmetricpositive definite matrices.
Annals Numer. Math , 4, 29–38. 2Bekas, C., Curioni, A., & Fedulova, I. (2012). Low-cost data uncertainty quantification.
Concur-rency and Computation: Practice and Experience , 24(8), 908–920. 1Bishop, C. M. (2006).
Pattern Recognition and Machine Learning (Information Science and Statis-tics) . Berlin, Heidelberg: Springer-Verlag. 2Bj¨orck, ˚A. (1996).
Numerical Methods for Least Squares Problems . Society for Industrial andApplied Mathematics. 2Bullen, P. (1998).
A Dictionary of Inequalities . Monographs and Research Notes in Mathematics.Taylor & Francis. 5Bullen, P. (2013).
Handbook of Means and Their Inequalities . Mathematics and Its Applications.Springer Netherlands. 4, 5Chaloner, K. & Verdinelli, I. (1995). Bayesian experimental design: A review.
Statistical Science ,10(3), 273–304. 1Craven, P. & Wahba, G. (1978). Smoothing noisy data with spline functions.
Numerische Mathe-matik , 31(4), 377–403. 2, 12Davis, T. A. (2006).
Direct Methods for Sparse Linear Systems . Society for Industrial and AppliedMathematics. 10Gibbs, M. & MacKay, D. J. C. (1997).
Efficient Implementation of Gaussian Processes . Technicalreport, Cavendish Laboratory, Cambridge, UK. 2Girard, A. (1989). A fast ‘monte-carlo cross-validation’ procedure for large least squares problemswith noisy data.
Numerische Mathematik , 56(1), 1–23. 2Golub, G. H., Heath, M., & Wahba, G. (1979). Generalized cross-validation as a method forchoosing a good ridge parameter.
Technometrics , 21(2), 215–223. 1217olub, G. H. & Meurant, G. (2009).
Matrices, Moments and Quadrature with Applications . USA:Princeton University Press. 2, 15Golub, G. H. & Plemmons, R. J. (1980). Large-scale geodetic least-squares adjustment by dissectionand orthogonal decomposition.
Linear Algebra and its Applications , 34, 3 – 28. 2Golub, G. H. & Strakoˇs, Z. (1994). Estimates in quadratic formulas.
Numerical Algorithms , 8(2),241–268. 2Golub, G. H. & von Matt, U. (1997). Generalized cross-validation for large-scale problems.
Journalof Computational and Graphical Statistics , 6(1), 1–34. 2, 12Haber, E., Horesh, L., & Tenorio, L. (2008). Numerical methods for experimental design of large-scale linear ill-posed inverse problems.
Inverse Problems , 24(5), 055012. 2Hardy, G. H., Littlewood, J. E., & P´olya, G. (1952).
Inequalities . Cambridge Mathematical Library.Cambridge University Press. 5Herman, J., Kucera, R., & Simsa, J. (2000).
Equations and Inequalities: Elementary Problems andTheorems in Algebra and Number Theory . CMS Books in Mathematics. Springer-Verlag NewYork. 5Hoehn, L. & Niven, I. (1985). Averages on the move.
Mathematics Magazine , 58(3), 151–156. 5Hutchinson, M. F. (1990). A stochastic estimator of the trace of the influence matrix for Laplaciansmoothing splines.
Communications in Statistics - Simulation and Computation , 19(2), 433–450.2, 15Kalantzis, V., Bekas, C., Curioni, A., & Gallopoulos, E. (2013). Accelerating data uncertaintyquantification by solving linear systems with multiple right-hand sides.
Numerical Algorithms ,62(4), 637–653. 1Kent, J. T. & Mohammadzadeh, M. (2000). Global optimization of the generalized cross-validationcriterion.
Statistics and Computing , 10(3), 231–236. 14Lukas, M. A., de Hoog, F. R., & Anderssen, R. S. (2010). Efficient algorithms for robust generalizedcross-validation spline smoothing.
Journal of Computational and Applied Mathematics , 235(1),102 – 107. 12MacKay, D., Kay, D., & Press, C. U. (2003).
Information Theory, Inference and Learning Algo-rithms . Cambridge University Press. 1Mitrinovi´c, D. S., Peˇcari´c, J., & Fink, A. M. (1992).
Classical and New Inequalities in Analysis .Mathematics and its Applications. Springer Netherlands. 5Mitrinovi´c, D. S. & Vasi´c, P. M. (1970).
Analytic Inequalities . Grundlehren der mathematischenWissenschaften in Einzeldarstellungen mit besonderer Ber¨ucksichtigung der Anwendungsgebiete.Springer-Verlag. 5Niessner, H. & Reichert, K. (1983). On computing the inverse of a sparse matrix.
InternationalJournal for Numerical Methods in Engineering , 19(10), 1513–1526. 218asmussen, C. E. & Williams, C. K. I. (2006).
Gaussian Processes for Machine Learning . AdaptiveComputation and Machine Learning. Cambridge, MA, USA: MIT Press. 2, 9Saibaba, A. K., Alexanderian, A., & Ipsen, I. C. F. (2017). Randomized matrix-free trace andlog-determinant estimators.
Numerische Mathematik , 137(2), 353–395. 2Seber, G. & Lee, A. (2012).
Linear Regression Analysis . Wiley Series in Probability and Statistics.Wiley. 7, 11, 12Stathopoulos, A., Laeuchli, J., & Orginos, K. (2013). Hierarchical probing for estimating thetrace of the matrix inverse on toroidal lattices.
SIAM Journal on Scientific Computing , 35(5),S299–S322. 2Stewart, G. W. (1998).
Matrix Algorithms: Volume 1: Basic Decompositions . Society for Industrialand Applied Mathematics. 10Storn, R. & Price, K. (1997). Differential evolution a simple and efficient heuristic for globaloptimization over continuous spaces.
Journal of Global Optimization , 11(4), 341359. 14Takahashi, K., Fagan, J., & Chen, M. S. (1973). Formation of a sparse bus impedance matrix andits application to short circuit study. In (pp. 63–69).: IEEE Power Engineering Society. 2Tang, J. M. & Saad, Y. (2012). A probing method for computing the diagonal of a matrix inverse.
Numerical Linear Algebra with Applications , 19(3), 485–501. 2Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine.
J. Mach. Learn.Res. , 1, 211244. 2Ubaru, S., Chen, J., & Saad, Y. (2017). Fast estimation of tr( f ( a )) via stochastic Lanczos quadra-ture. SIAM Journal on Matrix Analysis and Applications , 38(4), 1075–1099. 3, 15Ubaru, S. & Saad, Y. (2016). Fast methods for estimating the numerical rank of large matrices.In
Proceedings of the 33rd International Conference on International Conference on MachineLearning - Volume 48 , ICML16 (pp. 468477).: JMLR.org. 1Ubaru, S. & Saad, Y. (2018). Applications of trace estimation techniques. In T. Kozubek, M.ˇCerm´ak, P. Tich´y, R. Blaheta, J. ˇS´ıstek, D. Luk´aˇs, & J. Jaroˇs (Eds.),
High Performance Com-puting in Science and Engineering (pp. 19–33). Cham: Springer International Publishing. 1Wahba, G. (1977). Practical approximate solutions to linear operator equations when the data arenoisy.
SIAM Journal on Numerical Analysis , 14(4), 651–667. 2, 12Wu, L., Laeuchli, J., Kalantzis, V., Stathopoulos, A., & Gallopoulos, E. (2016). Estimating thetrace of the matrix inverse by interpolating from the diagonal of an approximate inverse.