Inexact Krylov Subspace Algorithms for Large Matrix Exponential Eigenproblem from Dimensionality Reduction
IINEXACT KRYLOV SUBSPACE ALGORITHMS FOR LARGEMATRIX EXPONENTIAL EIGENPROBLEM FROMDIMENSIONALITY REDUCTION
GANG WU ∗ , TING-TING FENG † , LI-JIA ZHANG ‡ , AND
MENG YANG § Abstract.
Matrix exponential discriminant analysis (EDA) is a generalized discriminant anal-ysis method based on matrix exponential. It can essentially overcome the intrinsic difficulty of smallsample size problem that exists in the classical linear discriminant analysis (LDA). However, fordata with high dimension, one has to solve a large matrix exponential eigenproblem in this method,and the time complexity is dominated by the computation of exponential of large matrices. In thispaper, we propose two inexact Krylov subspace algorithms for solving the large matrix exponentialeigenproblem effectively. The contribution of this work is threefold. First, we consider how to com-pute matrix exponential-vector products efficiently, which is the key step in the Krylov subspacemethod. Second, we compare the discriminant analysis criterion of EDA and that of LDA from atheoretical point of view. Third, we establish a relationship between the accuracy of the approximateeigenvectors and the distance to nearest neighbour classifier, and show why the matrix exponentialeigenproblem can be solved approximately in practice. Numerical experiments on some real-worlddatabases show superiority of our new algorithms over many state-of-the-art algorithms for facerecognition.
Key words.
Large matrix exponential eigenproblem, Krylov subspace method, Dimensionalityreduction, Face recognition, Linear discriminant analysis (LDA), Exponential discriminant analysis(EDA).
AMS subject classifications.
Face recognition has become one of the most successful applications of imageanalysis. Real data for face images are usually depicted in high dimensions. In orderto handle high dimensional data, their dimensionality needs to be reduced. In essence,dimensionality reduction is the transformation of high-dimensional data into a lowerdimensional data space. Currently, one of the most extensively used dimensionalityreduction methods is subspace transformation [12, 41, 42, 44].Linear discriminant analysis (LDA) is one of notable subspace transformationmethods for dimensionality reduction [12, 14, 28]. LDA encodes discriminant in-formation by maximizing the between-class scatter, and meanwhile minimizing thewithin-class scatter in the projected subspace. More precisely, let X = [ χ , χ , . . . , χ n ]be a set of training samples in a d -dimensional feature space, and assume that theoriginal data in X is partitioned into k classes as X = [ X , X , . . . , X k ]. We denoteby n j the number of samples in the j -th class, and thus (cid:80) kj =1 n j = n . Let µ j be thecentroid of the j -th class, and µ be the global centroid of the training data set. If we ∗ Corresponding author. G. Wu is with Department of Mathematics, China University of Miningand Technology & School of Mathematics and Statistics, Jiangsu Normal University, Xuzhou, 221116,Jiangsu, P.R. China. E-mail: [email protected] and [email protected] . This author is supportedby the National Science Foundation of China under grant 11371176, the Natural Science Foundationof Jiangsu Province under grant BK20131126, the 333 Project of Jiangsu Province, and the TalentIntroduction Program of China University of Mining and Technology. † Department of Mathematics, Shanghai Key Laboratory of Pure Mathematics and MathematicalPractice, East China Normal University, Dongchuan RD 500, Shanghai, 200241, P.R. China. E-mail: [email protected] . ‡ School of Mathematics and statistics, Jiangsu Normal University, Xuzhou, 221116, Jiangsu, P.R.China. E-mail: [email protected] . § School of Computer Science and Technology, Soochow University, Suzhou, 215006, Jiangsu, P.R.China. E-mail: [email protected] . 1 a r X i v : . [ m a t h . NA ] D ec G. WU, T. FENG, L. ZHANG AND M. YANG denote e j = [1 , , . . . , T ∈ R n j , then the within-class scatter matrix is defined as S W = k (cid:88) j =1 (cid:88) χ i ∈X j ( χ i − µ j )( χ i − µ j ) T = H W H TW , (1.1)where H W = [ X − µ · e T , . . . , X k − µ k · e Tk ] ∈ R d × n . The between-class scatter matrixis defined as S B = k (cid:88) j =1 n j ( µ j − µ )( µ j − µ ) T = H B H TB , (1.2)where H B = [ √ n ( µ − µ ) , √ n ( µ − µ ) , . . . , √ n k ( µ k − µ )] ∈ R d × k . The LDA method isrealized by maximizing the between-class scatter distance while minimizing the within-class scatter distance, which involves solving a “Trace Ratio” problem [19, 23, 24, 25]in the form of τ = max V ∈R d × tV T V = I tr( V T S B V )tr( V T S W V ) , (1.3)where tr( · ) denotes the trace of a matrix, and t (cid:28) d is the dimension of projectionsubspace. However, this problem is seldom solved in practice. It is generally consid-ered too difficult to solve and is commonly replaced by a simpler, but not equivalent,problem called “Ratio Trace” problem of the following form [23, 28] (cid:37) = max V ∈R d × tV T V = I tr (cid:0) ( V T S W V ) − ( V T S B V ) (cid:1) , (1.4)and the optimal projection matrix V can be calculated from solving the followinggeneralized symmetric eigenproblem S B x = λS W x . (1.5)In practice, the dimension of real data usually exceeds the number of trainingsamples, which results in the scatter matrix S W being singular. This is called the small-sample-size (SSS) or undersampled problem [28]. It is an intrinsic limitation ofthe classical LDA method, and it is also a common problem in classification appli-cations [28]. In other words, the SSS problem stems from generalized eigenproblemswith singular matrices. To tackle the SSS problem, many variants of LDA have beenproposed in recent years. To name a few, the regularized LDA method (RLDA) [15],LDA+PCA [5], the null-space LDA method (NLDA) [8], LDA/QR [49], LDA/GSVD[18, 48], the direct LDA method (DLDA) [50], the orthogonal LDA method (OLDA)[26, 47], the neighborhood minmax projections (NMMP) [27], and so on. The abovevariations on LDA have both advantages and disadvantages [28, 51]. An et al. [2]unified these LDA variants in one framework: principal component analysis plus con-strained ridge regression.Recently, a novel method based on matrix exponential, called exponential dis-criminant analysis (EDA), was proposed in [51]. Instead of the LDA criterion (1.4),EDA considers the following criterion ρ = max V ∈R d × tV T V = I tr (cid:0) ( V T exp( S W ) V ) − ( V T exp( S B ) V ) (cid:1) , (1.6) NEXACT KRYLOV ALGORITHMS FOR MATRIX EXPONENTIAL EIGENPROBLEM · ) denotes exponential of a matrix or scalar in this paper. The projectionmatrix V can be obtained from solving the t dominant eigenvectors of the following generalized symmetric matrix exponential eigenproblem [51]exp( S B ) x = λ exp( S W ) x . (1.7)The framework of the EDA method for dimensionality reduction has gained wideattention in recent years. For instance, an exponential locality preserving projections(ELPP) [40] was proposed to avoid the SSS problem occured in locality preserving pro-jections (LPP). Wang et al. [41] applied matrix exponential to extend many popularLaplacian embedding algorithms such as locality preserving projections, unsuperviseddiscriminant projections, and marginal fisher analysis. A matrix exponential local dis-criminant embedding method (ELDE) was investigated in [11] to deal with the SSSproblem appeared in local discriminant embedding (LDE). Using the method of expo-nential discriminant analysis, Ahmed [1] proposed a novel image clustering model thatincorporated both local and global information in image database. A 2DEDA methodwas presented in [46], which is an algorithm based on image matrices (2D data) ratherthan image vectors (1D data). Thus, 2DEDA can be viewed as a generalization ofthe EDA method to 2D data.It has been widely shown that the EDA method has more discriminant powerthan its original counterpart [1, 11, 40, 41, 46, 51]. In the EDA framework, thematrix exponential can be roughly interpreted as a random walk over the featuresimilarity matrix, and thus is more robust. As the exponentials of S W and S B aresymmetric positive definite (SPD), the EDA method naturally deals with the SSSproblem. Moreover, the behavior of the decay property of matrix exponential ismore significant in emphasizing small distance pairs [41]. However, in all the EDA-based methods, one has to solve the large matrix exponential eigenproblem (1.7)for data with high dimensionality [1, 11, 40, 41, 46, 51], and the time complexity isdominated by the computation of exp( S B ) and exp( S W ) [51, pp.191]. This cost willbe prohibitively large as the exponential of a matrix is often dense, even if the matrixin question is sparse [17]. Thus, for data with high dimension, the EDA methodoften suffers from heavy overhead and storage requirement. So it is urgent to seeknew technologies to speed up the solution of the large generalized matrix exponentialeigenproblem arising in the framework of EDA.Modern numerical linear algebra exploits the Krylov subspace method in differentadvanced iterative procedures for large scale eigenvalue problems [4, 31, 37]. Indeed,this type of method ranks among “ The Top 10 Algorithms of the 20th Century ” [9]. Inthis paper, we devote ourselves to solving (1.7) with the Krylov subspace method. Thekey involves the evaluation of matrix exponential-vector products, which is a hot topicin large scientific and engineering computations [17, 22]. In conventional approaches,one has to evaluate these products by using some iterative methods [17, 22]. In thiswork, we derive closed-form formulae for the matrix exponential-vector products, sothat these products can be formed very efficiently. The second contribution of thiswork is to give a theoretical comparison for the discriminant analysis criteria (1.4)and (1.6) of LDA and EDA, respectively, and show the reason why EDA can improvethe classification rate of the original LDA method. Finally, we establish a relation-ship between the accuracy of the approximate eigenvectors and the distance to thenearest neighbour classifier (NN) [10], and shed light on why the matrix exponentialeigenproblem can be solved approximately in practical calculations.This paper is organized as follows. In Section 2, we briefly overview the EDAmethod. In Section 3, we propose two inexact Krylov subspace algorithms for EDA.
G. WU, T. FENG, L. ZHANG AND M. YANG
Table 1.1
Notations used in this paper
Notations Descriptions d Data dimension k Number of classes n Number of samples t Dimension of projection subspace (or number of desired eigenpairs) (cid:96)
Number of samples in the training set X Training samples in a d -dimensional feature space X i The i -th class of training samples χ i The i -th sample S W The within-class scatter matrix defined in (1.1) S B The between-class scatter matrix defined in (1.2)exp( S W ) , exp( S B ) Exponentials of S W and S B exp / ( − S W ) Square root of exp( − S W ) λ, x Eigenvalue and eigenvector K m ( A, v ) Krylov subspace with respect to A and v tr( A ) Trace of the matrix AV Projection matrix composed of the desired eigenvectors A T Transpose of AI The identity matrix O Zero matrix or vectorspan { W } Subspace spanned by the columns of W dim( W ) Dimension of the subspace W(cid:107) · (cid:107)
2. The exponential discriminant analysis method.
Given an arbitrarysquare matrix A , its exponential is defined as [16, 17]exp( A ) = ∞ (cid:88) j =0 A j j ! = I + A + A · · · + A s s ! + · · · , (2.1)where I is the identity matrix. An important consequence is that any matrix expo-nential is invertible . Indeed, we haveexp( − A ) · exp( A ) = exp( A − A ) = I, i.e., exp − ( A ) = exp( − A ) . (2.2)Suppose that A ∈ R d × d is symmetric and let A = X Λ X − be the eigen-decompositionof A , where Λ = diag { λ , λ , . . . , λ d } is diagonal and X is a d × d matrix with itscolumns being the eigenvectors. Then it is easy to see thatexp( A ) = X exp(Λ) X − . (2.3) NEXACT KRYLOV ALGORITHMS FOR MATRIX EXPONENTIAL EIGENPROBLEM A and exp( A ) share the same eigenvectors, and the eigenvalues ofexp( A ) are exp(Λ) = diag { exp( λ ) , exp( λ ) , . . . , exp( λ d ) } .The exponential discriminant analysis method (EDA) makes use of the exponen-tial criterion (1.6), and the projection matrix V is obtained from solving the matrixexponential eigenproblem (1.7). This method is equivalent to transforming the orig-inal data into a new space by distance diffusion mapping, and the LDA criterion isapplied in such a new space [51]. The EDA algorithm is described as follows: Algorithm 1.
The exponential discriminant analysis method (EDA) [51]
Input:
The data matrix X = [ χ , χ , . . . , χ n ] ∈ R d × n , where χ j represernts the j -thtraining image. Output:
The projection matrix V .1. Compute the matrices S B , S W , exp( S B ) , and exp( S W ) ;2. Compute the eigenvectors { x i } and eigenvalues { λ i } of exp( S W ) − exp( S B ) ;3. Sort the eigenvectors V = { x i } according to λ i in decreasing order;4. Orthogonalize the columns of the projection matrix V . As a result of diffusion mapping, the margin between different classes is enlarged inthe EDA framework, which is helpful in improving classification accuracy [51]. In thesmall-sample-size case, EDA can extract not only the most discriminant informationthat is contained in the null space of the within-class scatter matrix, where it is similarto NLDA; but also the discriminant information that is contained in the non-null spaceof the within-class scatter matrix, and it is equivalent to LDA+PCA [51]. In addition,there are at most k − k is the numberof classes in the data set, so the dimension of the projected subspace is at most k − Example 1.1
We illustrate the superiority of the EDA method over the LDAmethod via a toy example. Consider the set of training samples X = [ χ , χ , χ ] ∈ R × , whose dimension is equal to the number of samples. Let the original data bepartitioned into k = 3 classes: X = [ χ , χ , χ ] = [ X , X , X ] , and assume that thecolumns of X are linear independent. So we have µ i = χ i , i = 1 , , , the globalcentroid µ = ( µ + µ + µ ) / , and H W = [ χ − µ , χ − µ , χ − µ ] = O,H B = [ χ − µ, χ − µ, χ − µ ] . Therefore, S W = H W H TW = O , S B = H B H TB , and the generalized eigenproblem ( ) for the LDA method becomes S B x = O. (2.4) Since the columns of X are linear independent, we have rank( S B ) = 2 , and there isonly one solution vector ( up to a scalar ) for ( ) .On the other hand, the matrix-exponential eigenproblem ( ) for the EDA methodreduces to exp ( S B ) x = λ x , (2.5) which has three linear independent solution vectors. Therefore, the LDA method losessome useful discriminant information, and the EDA method can perform better thanthe LDA method. G. WU, T. FENG, L. ZHANG AND M. YANG
As was stressed in [51, pp.191], however, the time complexity of EDA is domi-nated by the computation of exp( S B ) and exp( S W ), as well as the evaluation of thelarge matrix exponential eigenproblem. More precisely, one has to explicitly formand store exp( S B ) and exp( − S W ), as well as their product exp( − S W )exp( S B ). Thecomplexity is O ( d ), which is prohibitively large as the dimension of the data is high[17]. Moreover, we have to solve a large matrix exponential eigenproblem with respectto exp( S W ) − exp( S B ). If the matrix exponential eigenproblem is solved by using theimplicit QR algorithm [4, 16], another O ( d ) flops are required [4, 16]. Therefore, itis necessary to seek new strategies to improve the numerical performance of the EDAmethod.
3. Inexact Krylov subspace algorithms for matrix exponential discrim-inant analysis.
In this section, we propose two inexact Krylov subspace algorithmsfor solving (1.7). We first consider how to compute the matrix exponential-vectorproducts involved in the Krylov subspace method, and then derive new lower andupper bounds on the EDA criterion (1.6) and the LDA criterion (1.4). Finally, weprovide a practical stopping criterion for solving the matrix-exponential eigenproblemapproximately.
Thegeneralized matrix exponential eigenproblem (1.7) is mathematically equivalent to thefollowing standard nonsymmetric matrix exponential eigenproblem exp( − S W )exp( S B ) x = λ x , (3.1)where we used exp − ( S W ) = exp( − S W ). As it is only required to calculate t ( t (cid:28) d )dominant eigenpairs of the large matrix exp( − S W )exp( S B ), we are interested in solv-ing the large matrix exponential eigenproblem by using the Krylov subspace method.The Arnoldi method is a widely used Krylov subspace method for finding a fewextreme eigenvalues and their associated eigenvectors of a large nonsymmetric matrix[4, 31, 37]. It requires only matrix-vector products to extract eigen-information tocompute the desired solutions, and thus is very attractive in practice when the matrixis too large to be solved by, e.g., using the implicit QR algorithm [4, 16]; or the matrixdoes not exist explicitly but in the form of being capable of generating matrix-vectormultiplications.The principle of the Arnoldi method is described as follows. Given a real nonsym-metric matrix A and an initial real vector v of unit norm, in exact arithmetics, the m -step Arnoldi process recursively generates an orthonormal basis V m = [ v , v , . . . , v m ]for the Krylov subspace K m ( A, v ) = span { v , A v , . . . , A m − v } . At the same time, the projection of A onto K m ( A, v ) is expressed as an upperHessenberg matrix H m = V Tm A V m . Afterwards the Rayleigh-Ritz procedure [4, 31, 37]is applied to compute approximate eigenpairs of A : Let ( (cid:101) λ, (cid:101) y ) be some eigenpairs of H m , then we construct some approximate eigenpairs ( (cid:101) λ, V m (cid:101) y ) of A . Here (cid:101) λ is calleda Ritz value and V m (cid:101) y is called a Ritz vector.Thus, denote A = exp( − S W )exp( S B ), the key ingredient of the Arnoldi methodis to evaluate the matrix exponential-vector products w = A v = exp( − S W )exp( S B ) v (3.2) NEXACT KRYLOV ALGORITHMS FOR MATRIX EXPONENTIAL EIGENPROBLEM v . When the matrices S W , S B are large, Krylov subspacemethods are widely used for this type of problem [17, 22]. Generally speaking, thereare two classes of Krylov subspace methods for evaluating (3.2) iteratively [30]. Inthe first class of methods, the matrix is projected into a much smaller subspace,then the exponential is applied to the reduced matrix, and finally the approximationis projected back to the original large space [32]. In the second class of methods,the exponential function is first approximated by simpler functions such as rationalfunctions, and then the action of the matrix exponential is calculated [43]. In thiscase, the core of (3.2) reduces to solving some linear systems with multiple shifts [45].Thanks to the structure of exp( − S W ) and exp( S B ), instead of evaluating (3.2) iteratively , we compute it directly by using a closed-form formula. More precisely, let H W = Q W R W , H B = Q B R B (3.3)be the (skinny) QR decomposition [16] of the matrices H W and H B from (1.1) and(1.2), respectively, where Q B ∈ R d × k , Q W ∈ R d × n are orthonormal, and R B ∈ R k × k , R W ∈ R n × n are upper triangular. Denote by R B = U B Σ B V TB , R W = U W Σ W V TW (3.4)the singular value decomposition (SVD) [16] of R B and R W , respectively, and let (cid:101) Q B = Q B U B , D B = Σ B , (3.5)and (cid:101) Q W = Q W U W , D W = Σ W , (3.6)Then we have the following theorem: Theorem 3.1.
Let (cid:101) Q ⊥ B and (cid:101) Q ⊥ W be orthonormal bases for the orthogonal com-plement of span { (cid:101) Q B } and span { (cid:101) Q W } , respectively, and let f be a function defined onthe spectra of S W and S B defined in ( ) and ( ) , respectively. Then f ( S B ) = (cid:101) Q B f ( D B ) (cid:101) Q TB + (cid:101) Q ⊥ B f ( O ( d − k ) × ( d − k ) )( (cid:101) Q ⊥ B ) T (3.7) and f ( S W ) = (cid:101) Q W f ( D W ) (cid:101) Q TW + (cid:101) Q ⊥ W f ( O ( d − n ) × ( d − n ) )( (cid:101) Q ⊥ W ) T , (3.8) where O ( d − k ) × ( d − k ) and O ( d − n ) × ( d − n ) are ( d − k ) × ( d − k ) and ( d − n ) × ( d − n ) zeromatrices, respectively. In particular, for the matrix exponential, we have that exp( S B ) = (cid:101) Q B exp( D B ) (cid:101) Q TB + I − (cid:101) Q B (cid:101) Q TB , (3.9)exp( − S W ) = (cid:101) Q W exp( − D W ) (cid:101) Q TW + I − (cid:101) Q W (cid:101) Q TW , (3.10) and exp / ( − S W ) = (cid:101) Q W exp (cid:16) − D W (cid:17) (cid:101) Q TW + I − (cid:101) Q W (cid:101) Q TW , (3.11) where exp / ( − S W ) represents the square root of exp( − S W ) . G. WU, T. FENG, L. ZHANG AND M. YANG
Proof . We only prove (3.7), and the proof of (3.8) is similar. It follows from (1.1),(1.2) and (3.3)–(3.6) that S B = H B H TB = ( Q B U B )Σ B ( V TB V B )Σ B ( U TB Q TB ) = (cid:101) Q B D B (cid:101) Q TB = (cid:104) (cid:101) Q B (cid:101) Q ⊥ B (cid:105) (cid:20) D B O ( d − k ) × ( d − k ) (cid:21) (cid:34) (cid:101) Q TB ( (cid:101) Q ⊥ B ) T (cid:35) . From the properties of matrix function [17, Theorem 1.13], we obtain f ( S B ) = (cid:104) (cid:101) Q B (cid:101) Q ⊥ B (cid:105) (cid:20) f ( D B ) f ( O ( d − k ) × ( d − k ) ) (cid:21) (cid:34) (cid:101) Q TB ( (cid:101) Q ⊥ B ) T (cid:35) = (cid:104) (cid:101) Q B f ( D B ) (cid:101) Q ⊥ B f ( O ( d − k ) × ( d − k ) ) (cid:105) (cid:34) (cid:101) Q TB ( (cid:101) Q ⊥ B ) T (cid:35) = (cid:101) Q B f ( D B ) (cid:101) Q TB + (cid:101) Q ⊥ B f ( O ( d − k ) × ( d − k ) )( (cid:101) Q ⊥ B ) T . And (3.9)–(3.11) follow from (cid:101) Q ⊥ B ( (cid:101) Q ⊥ B ) T = I − (cid:101) Q B (cid:101) Q TB , (cid:101) Q ⊥ W ( (cid:101) Q ⊥ W ) T = I − (cid:101) Q W (cid:101) Q TW , and the fact that exp( O ) = I for any square zero matrix. Remark 3.1.
According to ( ) and ( ) , for a given vector v , we can computethe matrix exponential-vector product ( ) within two steps: (i) v = (cid:101) Q B exp( D B )( (cid:101) Q TB v ) + v − (cid:101) Q B ( (cid:101) Q TB v ) ; (ii) v = (cid:101) Q W exp( − D W )( (cid:101) Q TW v ) + v − (cid:101) Q W ( (cid:101) Q TW v ) .Notice that D B and D W are k × k and n × n diagonal matrices, respectively. As aresult, there is no need to explicitly form and store the large matrices S B , S W andtheir exponentials exp( S B ) , exp( − S W ) in the EDA framework. Indeed, the generalized matrix exponential eigenproblem (1.7) can also be reformu-lated as a standard symmetric eigenvalue problem , say, by using exp( S W )’s Choleskydecomposition [16]. Unfortunately, calculating the Cholesky decomposition of a largematrix can be very expensive. Rather than using the Cholesky decomposition, weconsider exp( S W )’s square root [17] that is advantageous for both computational andtheoretical considerations. By (1.7), (cid:2) exp − / ( S W )exp( S B )exp − / ( S W ) (cid:3)(cid:2) exp / ( S W ) x (cid:3) = λ (cid:2) exp / ( S W ) x (cid:3) . (3.12)Denote exp − / ( S W ) the inverse of square root of exp( S W ), and note that exp − / ( S W ) =exp / ( − S W ), (3.12) can be rewritten as (cid:2) exp / ( − S W )exp( S B )exp / ( − S W ) (cid:3) y = λ y , (3.13)and x = exp / ( − S W ) y (3.14)is the desired solution. Notice that the matrix M = exp / ( − S W )exp( S B )exp / ( − S W ) (3.15)is a symmetric positive definite (SPD) matrix, and our aim is to compute a fewdominate eigenpairs of it. The (symmetric) Lanczos method [4, 31, 37] is a widely used NEXACT KRYLOV ALGORITHMS FOR MATRIX EXPONENTIAL EIGENPROBLEM H m becomes symmetric tridiagonal,which leads to a three-term recurrence in the Lanczos process. Similar to the Arnoldiprocess, in the Lanczos process, we need to compute the matrix exponential-vectorproducts w = M v = exp / ( − S W )exp( S B )exp / ( − S W ) v (3.16)with different vectors v . In terms of (3.9), (3.10) and (3.11), this problem can besolved within three steps:(iii) v = (cid:101) Q W exp( − / × D W ) (cid:101) Q TW v + v − (cid:101) Q W (cid:101) Q TW v ;(iv) v = (cid:101) Q B exp( D B ) (cid:101) Q TB v + v − (cid:101) Q B (cid:101) Q TB v ;(v) v = (cid:101) Q W exp( − / × D W ) (cid:101) Q TW v + v − (cid:101) Q W (cid:101) Q TW v .Compared with the Arnoldi method for the standard nonsymmetric matrix expo-nential eigenproblem (3.1), the advantage of the Lanczos method for (3.13) is that onecan use a three-term recurrence in the Lanczos procedure for a symmetric eigenprob-lem [4, 31, 37]. The disadvantage is that one has to evaluate three matrix exponential-vector products (iii)–(v) in each step of the Lanczos procedure, rather than two matrixexponential-vector products (i)–(ii) in each step of the Arnoldi procedure. So a natu-ral question is: which one is better, Arnoldi or Lanczos? We believe that the answeris problem-dependent. Indeed, we cannot tell which one is definitely better than theother; refer to the numerical experiments given in Section 4. The pseudo-code forcomputing (3.2) and (3.16) is listed as follows: Algorithm 2.
Computing the matrix exponential-vector product (3.2)/(3.16)in the Arnoldi/Lanczos process • Preprocessing:
1. Given the training set X , form the matrices H B and H W ;2. Computing the skinny QR decompositions: H B = Q B R B , H W = Q W R W ;3. Computing SVDs: R B = U B Σ B V TB , R W = U W Σ W V TW ;4. Let Q B = Q B U B , Q W = Q W U W , Σ B = exp(Σ B ) ; • Computing:
5. Given a vector v ; if computing (3.2) in the Arnoldi process Σ W = exp( − Σ W ) ;7A. w = Q TB v ;8A. v = Q B Σ B w + v − Q B w ;9A. w = Q TW v ;10A. w = Q W Σ W w + v − Q W w ; endif computing (3.16) in the Lanczos process Σ W = exp( − / × Σ W ) ;7L. w = Q TW v ;8L. v = Q W Σ W w + v − Q W w ;9L. w = Q TB v ;10L. v = Q B Σ B w + v − Q B w ;11L. w = Q TW v ;12L. w = Q W Σ W w + v − Q W w ; end0 G. WU, T. FENG, L. ZHANG AND M. YANG
Remark 3.2.
In practical implementations, the “Preprocessing” phase 1–4 isrun once for all, and the variables Q B , Q W , Σ B and Σ W can be stored for a latteruse. In this procedure, one needs to perform two skinny QR decompositions in O ( dk ) and O ( dn ) flops [16], respectively. Further, it is required to perform SVD of twosmall-sized matrices R B and R W , in O ( k ) and O ( n ) flops [16], respectively; as wellas to compute a small-sized diagonal matrix exponential in O ( k ) flops [17], whichare negligible as k ≤ n (cid:28) d . In the “Computing” phase 6A–10A and 6L–12L, thecomputational cost is O (cid:0) ( k + n ) d (cid:1) . Thus, as k ≤ n (cid:28) d , the main overhead for solving ( ) and ( ) is O ( d ) .On the other hand, if the n training vectors are linear independent, the ranks of thematrices S B and S W are k − and n − k , respectively, and the main storage requirementof Algorithm 2 is to store n − vectors of length d . Therefore, the computationalcomplexities of our new algorithms are much fewer than those of Algorithm 1. In this subsec-tion, we will establish new lower and upper bounds for the criteria (1.4) and (1.6), andgive a theoretical comparison of the LDA method and the EDA method. Recall from(3.1) and (3.12) that, the eigenvalues of the matrix M from (3.15) are the same asthose of exp( − S W )exp( S B ) and the matrix exponential pencil (cid:0) exp( S B ) , exp( S W ) (cid:1) . Theorem 3.2.
Denote by ν ≥ ν ≥ · · · ≥ ν d the eigenvalues of S W , by µ ≥ µ ≥ · · · ≥ µ d the eigenvalues of S B , and let λ ( M ) ≥ λ ( M ) ≥ · · · ≥ λ d ( M ) be theeigenvalues of M . Then we have that for i = 1 , , . . . , d , λ i ( M ) ≥ max (cid:8) exp( µ i − ν ) , exp( µ d − ν d − i +1 ) (cid:9) , (3.17) and λ i ( M ) ≤ min (cid:8) exp( µ i − ν d ) , exp( µ − ν d − i +1 ) (cid:9) . (3.18) Proof . Note that λ i ( M ) = λ i (cid:0) exp( − S W )exp( S B ) (cid:1) , i = 1 , , . . . , d . Denote (cid:102) W =exp / ( − S W ), we have from the Courant-Fischer minimax theorem that [16] λ i (cid:0) exp( − S W )exp( S B ) (cid:1) = λ i ( (cid:102) W T exp( S B ) (cid:102) W )= max dim( S )= i min v ∈ S, (cid:107) v (cid:107) =1 v T (cid:102) W T exp( S B ) (cid:102) W v = max dim( S )= i min v ∈ S, (cid:107) v (cid:107) =1 ( (cid:102) W v ) T exp( S B )( (cid:102) W v )( (cid:102) W v ) T ( (cid:102) W v ) · v T (cid:102) W T (cid:102) W vv T v . (3.19)Recall that the eigenvalues of exp( − S W ) and exp( S B ) are exp( − ν d ) ≥ exp( − ν d − ) ≥· · · ≥ exp( − ν ) and exp( µ ) ≥ exp( µ ) ≥ · · · ≥ exp( µ d ), respectively. For any v ∈ R d , (cid:107) v (cid:107) = 1, it follows thatexp( − ν ) ≤ v T (cid:102) W T (cid:102) W vv T v = v T exp( − S W ) vv T v ≤ exp( − ν d ) , (3.20)and exp( µ i ) = max dim( S )= i min v ∈ S, (cid:107) v (cid:107) =1 ( (cid:102) W v ) T exp( S B )( (cid:102) W v )( (cid:102) W v ) T ( (cid:102) W v ) . (3.21)Therefore, we have from (3.19)–(3.21) thatexp( µ i − ν ) ≤ λ i ( M ) ≤ exp( µ i − ν d ) , i = 1 , , . . . , d. (3.22) NEXACT KRYLOV ALGORITHMS FOR MATRIX EXPONENTIAL EIGENPROBLEM λ i (cid:0) exp( − S W )exp( S B ) (cid:1) = λ i (cid:0) exp / ( S B )exp( − S W )exp / ( S B ) (cid:1) . Using the same trick, we can prove that for i = 1 , , . . . , d ,exp( µ d − ν d − i +1 ) ≤ λ i ( M ) ≤ exp( µ − ν d − i +1 ) . (3.23)A combination of (3.22) and (3.23) yields (3.17) and (3.18).We are in a position to establish new lower and upper bounds for the ρ valuedefined in (1.6). First we need the following lemma. Lemma 3.3. [39, Corollary 4.6.4, pp.101] Let A and B be two d × d Hermitianmatrix, where A is semi-positive definite and B is positive definte. Let X be a d × t matrix with rank( X ) = t . Then tr (cid:0) X H AX ( X H BX ) − (cid:1) ≥ t (cid:88) i =1 λ d − t + i ( B − A ) , and tr (cid:0) X H AX ( X H BX ) − (cid:1) ≤ t (cid:88) i =1 λ i ( B − A ) . Combining Theorem 3.2 and Lemma 3.3, we get the theorem as follows for theexponential discriminant analysis criterion (1.6).
Theorem 3.4.
Under the above notations, for the EDA criterion, there holds ρ ≥ t (cid:88) i =1 max (cid:8) exp( µ d − t + i − ν ) , exp( µ d − ν t − i +1 ) (cid:9) , (3.24) and ρ ≤ t (cid:88) i =1 min (cid:8) exp( µ i − ν d ) , exp( µ − ν d − i +1 ) (cid:9) . (3.25) Proof . This result is from Lemma 3.3, (3.17), (3.18), and the fact that for i =1 , , . . . , d , λ d − t + i (cid:0) exp( − S W )exp( S B ) (cid:1) ≥ max (cid:8) exp( µ d − t + i − ν ) , exp( µ d − ν t − i +1 ) (cid:9) . If S W is nonsingular, we have the following result on the LDA criterion (1.4),whose proof is similar to that of Theorem 3.4, and thus is omitted. Theorem 3.5. If S W is nonsingular, let ν ≥ ν ≥ · · · ≥ ν d > , µ ≥ µ ≥· · · ≥ µ d be the eigenvalues of S W and S B , respectively, and let (cid:37) be the LDA criteriondefined in ( ) . Then we have that t (cid:88) i =1 max (cid:26) µ d − t + i ν , µ d ν t − i +1 (cid:27) ≤ (cid:37) ≤ t (cid:88) i =1 min (cid:26) µ i ν d , µ ν d − i +1 (cid:27) . (3.26)2 G. WU, T. FENG, L. ZHANG AND M. YANG
Without loss of generality, we assume that the n training vectors are linear inde-pendent, so that the ranks of the matrices S B and S W are k − n − k , respectively.As k ≤ n (cid:28) d , there are many 1’s in the spectra of exp( − S W ) and exp( S B ). Thefollowing theorem gives the number of 1 eigenvalues of M . Theorem 3.6.
If the n samples in X are linear independent, then M has at least d − n + 1 eigenvalues that are equal to 1.Proof . Since the n samples are linear independent, there are d − k + 1 and d − n + k zero eigenvalues in the spectra of S B and S W , respectively. As a result,there are d − k + 1 and d − n + k eigenvalues of exp( S B ) and exp( S W ), respectively,which are equal to 1. Let { u i } d − k +1 i =1 and { g i } d − n + ki =1 be the orthnormal eigenvectorsof exp( S W ) and exp( S B ) corresponding to the eigenvalue 1, respectively, and denoteby Φ = span { u , . . . , u d − k +1 } , Φ = span { g , . . . , g d − n + k } the corresponding eigenspace (or invariant subspace) of exp( S B ) and exp( S W ), re-spectively. Let dim(Φ i ) be the dimension of the space Φ i ( i = 1 , d ≥ dim(Φ + Φ2) = dim(Φ ) + dim(Φ ) − dim(Φ ∩ Φ ) , we obtain dim(Φ ∩ Φ ) ≥ d − n + 1 . Therefore, there are at least d − n + 1 independent vectors { z } d − n +1 i =1 ∈ Ω ∩ Ω , suchthat exp( S W ) z i = z i and exp( S B ) z i = z i . That is, exp( − S W )exp( S B ) z i = z i , i = 1 , , . . . , d − n + 1 , and M has at least d − n + 1 eigenvalues that are equal to 1. Remark 3.3.
It was assumed in [51, pp.189] that if exp( µ i )exp( ν i ) > µ i ν i , i = 1 , , . . . , t, (3.27) then there is a difference in diffusion scale between the within-class and between-class distances, and the diffusion scale to the between-class distance is larger thanthe within-class distance. Hence, the margin between different classes is enlarged,which is helpful in improving classification accuracy. Theorems 3.4–3.6 show thismore precisely. It indicates that if exp( µ i )exp( ν d ) > µ i ν d or exp( µ )exp( ν d − i +1 ) > µ ν d − i +1 , i = 1 , , . . . , t, (3.28) then ρ can be larger than (cid:37) . Notice that the conditions in ( ) are ( much ) weakerthan those in ( ) . So our result is stronger than the one given in [51], and can beuseful to clarify the motivation of EDA. NEXACT KRYLOV ALGORITHMS FOR MATRIX EXPONENTIAL EIGENPROBLEM Inthe new strategy, one needs to solve the large matrix exponential eigenproblems (3.1)or (3.12) in a prescribed accuracy. If the desired accuracy is high (say, tol = 10 − or even the machine precision (cid:15) ≈ . × − ), then the cost for solving the largeeigenproblems will be very high. In this subsection, we give theoretical analysis on therelationship between the accuracy of the eigenvectors and the distance measure of thenearest neighbour classifier (NN) [10]. Based on the theoretical results, we provide apractical stopping criterion for the matrix exponential eigenproblems.K nearest neighbors (KNN) is a simple algorithm that stores all available casesand classifies new cases based on a similarity measure [13], e.g., the Euclidean distance.This algorithm has been used in statistical estimation and pattern recognition in thebeginning of 1970’s as a non-parametric technique. In KNN classification, the outputis a class membership. An object is classified by a majority vote of its neighbors, withthe object being assigned to the class most common among its K nearest neighbors(where K is a positive integer, typically small). If K = 1, then the case is simplyassigned to the class of its nearest neighbor (NN) [10].Denote by V, (cid:101) V ∈ R d × (cid:96) the orthonormal matrices whose columns are the “exact”and “computed” solutions of (3.1) or (3.12), respectively. Let (cid:98) x i be a sample fromthe training set, and (cid:98) y j be a sample from the testing set. Then the nearest neighbourclassifier gives class membership via investigating the Euclidean distance [10] d ij = (cid:107) V V T ( (cid:98) x i − (cid:98) y j ) (cid:107) = (cid:107) V T ( (cid:98) x i − (cid:98) y j ) (cid:107) , (3.29)where (cid:107) · (cid:107) denotes 2-norm (or Euclidean norm) of a vector or matrix. The follow-ing theorem sheds light on why we can solve the matrix exponential eigenproblemsapproximately. Theorem 3.7.
Let V, (cid:101) V ∈ R d × t be orthonormal matrices whose columns arethe “exact” and “computed” solutions of ( ) or ( ) , respectively. Denote by d ij = (cid:107) V T ( (cid:98) x i − (cid:98) y j ) (cid:107) and (cid:101) d ij = (cid:107) (cid:101) V T ( (cid:98) x i − (cid:98) y j ) (cid:107) the “exact” and “computed” Euclideandistances, respectively, and let sin ∠ ( V, (cid:101) V ) = (cid:107) ( I − V V T ) (cid:101) V (cid:107) be the distance betweenthe eigenspace span { V } and the approximate eigenspace span { (cid:101) V } . If (cid:107) (cid:98) x i (cid:107) , (cid:107) (cid:98) y j (cid:107) = 1 and cos ∠ ( V, (cid:101) V ) (cid:54) = 0 , then (cid:101) d ij − ∠ ( V, (cid:101) V )cos ∠ ( V, (cid:101) V ) ≤ d ij ≤ (cid:101) d ij cos ∠ ( V, (cid:101) V ) + 2 sin ∠ ( V, (cid:101) V ) . (3.30) Proof . Note that V = (cid:101) V (cid:101) V T V + ( I − (cid:101) V (cid:101) V T ) V , and cos ∠ ( V, (cid:101) V ) = (cid:107) V T (cid:101) V (cid:107) [37].Thus, d ij = (cid:107) V T ( (cid:98) x i − (cid:98) y j ) (cid:107) = (cid:107) ( V T (cid:101) V ) (cid:101) V T ( (cid:98) x i − (cid:98) y j ) + V T ( I − (cid:101) V (cid:101) V T )( (cid:98) x i − (cid:98) y j ) (cid:107) ≤ (cid:101) d ij (cid:107) V T (cid:101) V (cid:107) + (cid:107) V T ( I − (cid:101) V (cid:101) V T ) (cid:107) · (cid:107) (cid:98) x i − (cid:98) y j (cid:107) = (cid:101) d ij · cos ∠ ( V, (cid:101) V ) + sin ∠ ( V, (cid:101) V ) · (cid:107) (cid:98) x i − (cid:98) y j (cid:107) ≤ (cid:101) d ij cos ∠ ( V, (cid:101) V ) + 2 sin ∠ ( V, (cid:101) V ) . where we used (cid:107) (cid:98) x i − (cid:98) y j (cid:107) ≤ (cid:107) (cid:98) x i (cid:107) + (cid:107) (cid:98) y j (cid:107) ≤
2. Using the same trick, we can provethat (cid:101) d ij ≤ d ij cos ∠ ( V, (cid:101) V ) + 2 sin ∠ ( V, (cid:101) V ) , G. WU, T. FENG, L. ZHANG AND M. YANG a combination of the above two inequalities gives (3.30).
Remark 3.4.
Theorem 3.7 indicates that if sin ∠ ( V, (cid:101) V ) is small ( say, − ) ,then the { d ij } (cid:48) s and the { (cid:101) d ij } (cid:48) s will be close to each other. Thus, there is no needto compute the eigenvectors too accurately in practice. This explains why the matrixexponential eigenproblems ( ) and ( ) can be solved approximately in practice.However, we cannot get the value of sin ∠ ( V, (cid:101) V ) a priori, as the “exact” eigenspace span { V } is unavailable. Let ( λ , x ) , . . . , ( λ t , x t ) be the desired eigenpairs, it wasshown that if the minimal distance between the Ritz values (cid:101) λ , (cid:101) λ , . . . , (cid:101) λ t and the othereigenvalues ( i.e., the eigenvalues other than (cid:101) λ , (cid:101) λ . . . , (cid:101) λ t ) is sufficiently large, then sin ∠ ( V, (cid:101) V ) is proportional to the residual norms of Ritz pairs [20, 37]. Therefore, wecan use the largest residual norm of the Ritz pairs to take the place of sin ∠ ( V, (cid:101) V ) inpractice. Experimentally, we find that a tolerance of tol = 10 − is good enough forthe matrix exponential eigenproblems. In summary, we propose the main algorithm of this work for solving the largegeneralized matrix exponential eigenproblems.
Algorithm 3.
Inexact Krylov-EDA algorithms for matrix exponentialdiscriminant analysis
Steps 1–4 are the same as those of Algorithm 2.5. Given a convergence threshold tol ( e.g., tol = 10 − ) , compute the desired eigen-vectors by using a restarted Krylov subspace algorithm ( e.g. [3, 36] ) for solving ( )( by using the Arnoldi method ) or ( ) ( by using the Lanczos method ) , where we usesteps 6A–10A or 6L–12L in Algorithm 2 for matrix exponential-vector products.6. Orthogonalize the columns of the projection matrix V . Remark 3.5.
Our new methods fall within the class of the inexact Krylov sub-space methods. However, the new method is different from the inner-outer Krylovsubspace methods with inexact matrix-vector products [6, 7, 33, 34]. In those methods,the tolerance of the inner iterative process ( i.e., approximating the matrix exponentialaction ) can be significantly relaxed once the outer process ( i.e, the Arnoldi/Lanczoseigenvalue solver ) has started to converge. However, in our new methods the matrixexponential-vector products are computed “exactly” based on some closed-form formu-lae. Further, the tolerance for Arnoldi/Lanczos eigenvalue solver in Algorithm 3 isdetermined in advance.
4. Numerical Experiments.
In this section, we make some numerical experi-ments to illustrate the efficiency of Algorithm 3 for face recognition. All the numericalexperiments were run on a Dell PC with eight core Intel(R) Core(TM)i7-2600 pro-cessor with CPU 3.40 GHz and RAM 16.0 GB, under the Windows 7 with 64-bitoperating system. All the numerical results were obtained from using a MATLAB7.10.0 implementation. In all the algorithms for comparison, the projection matrix V is composed of the k − t = k −
1) dominant discriminant vectors, where k isthe number of classes. We apply the nearest neighbor (NN) [10] as the classifier withthe L metric as distance measure. Each column x i of the data matrices is scaled byits 2-norm, and 10 random splits are run so that one can obtain a stable recognitionresult. In all the EDA-based algorithms, the columns of V is orthonomalized by usingthe MATLAB built-in function orth.m , where the orth operation is performed via aQR decomposition, i.e., stabilized Gram-Schmidt.For the performance of the EDA-based methods and how they compare withthe current state-of-the-art, we refer to [11, 40, 41, 46, 51]. Indeed, if the largematrix exponential eigenproblems (1.7) and (3.13) were solved “exactly”, then our NEXACT KRYLOV ALGORITHMS FOR MATRIX EXPONENTIAL EIGENPROBLEM Figure 4.1, Example 4.1 : Sample face images of 3 individuals from the AR database, d = 50 × new methods are mathematically equivalent to the original EDA method proposed in[51]. The main aim of this section is to show Algorithm 3 outperforms Algorithm 1according to CPU time, while the classification accuracy of the former is comparableto that of the latter.The different EDA-based algorithms are listed as follows: • (Inexact) Arnoldi-EDA (Algorithm 3): We apply Algorithm 3 to solve the largenonsymmetric matrix exponential eigenproblem (3.1), in which the implicitly restarted Arnoldi algorithm [36] (the MATLAB built-in function eigs.m ) is used. The matrixexponential-vector products are computed by using Algorithm 2. The stopping crite-rion for the large eigenproblem is chosen as tol = 10 − . • (Inexact) Lanczos-EDA (Algorithm 3): We apply Algorithm 3 for solving the largesymmetric matrix exponential eigenproblem (3.13), in which the implicitly restarted Lanczos algorithm [36] (the MATLAB built-in function eigs.m ) is used. The matrixexponential-vector products are computed by using Algorithm 2. The stopping crite-rion for the large eigenproblem is chosen as tol = 10 − . • EDA-eigs : Solving the generalized symmetric matrix exponential eigenproblem (1.7)by using the MATLAB built-in function eigs.m , the convergence threshold is cho-sen as tol = 10 − . This mimics forming the two matrix exponentials exp( S B ) andexp( S W ) explicitly while solving the matrix exponential eigenproblem iteratively witha relatively higher accuracy. The matrix exponentials are evaluated by using the MAT-LAB built-in function expm.m . • EDA (Algorithm 1): The EDA algorithm advocated in [51], in which we formexp( − S W )exp( S B ) and solve the nonsymmetric matrix exponential eigenproblem (3.1)by using the implicit QR algorithm [4, 16] (the MATLAB built-in function eig.m ).The matrix exponentials are evaluated by using the MATLAB built-in function expm.m . Example 4.1
In this example, we compare the EDA-based algorithms withLDA+PCA [5] and the classical LDA method [12, 14], and show the efficiency of theexponential discriminant analysis method. A subset of AR database [21] is used herewith 1680 face images from 120 persons (14 images per person), and all images arecropped and scaled to 50 ×
40. Figure 4.1 presents the sample of cropped AR databaseimages of three individuals. A random subset with (cid:96) = 2 , , eig.m ). Aswas done in [51], for LDA+PCA, we reserve 99% energy in the PCA stage, followed byLDA. Table 4.1 lists the CPU time (in seconds) and the recognition accuracy obtainedfrom the different methods. When (cid:96) = 3, we plot in Figure 4.2 recognition accuracyof the six algorithms with respect to the projected dimension on the AR database.Notice that the curves of Arnoldi-EDA, EDA-eigs, and EDA overlap with each other.We see from Table 4.1 that the recognition accuracy obtained from the fourEDA-based algorithms are about the same, which are (much) higher than those fromLDA+PCA and classical LDA. Furthermore, both Arnoldi-EDA and Lanczos-EDAconverge much faster than EDA-eigs and EDA, while EDA-eigs outperforms EDA.The classical LDA method works the poorest in terms of CPU time and recognition6 G. WU, T. FENG, L. ZHANG AND M. YANG
Figure 4.2, Example 4.1 : Recognition accuracy with respect to the projected dimensionon the AR database, 3 Train.
Algorithm 2 Train 3 Train 5 Train
Arnoldi-EDA 0.51(84.3%) 0.71(92.0%) 1.28 (97.0%)Lanczos-EDA 0.55(84.6%) 0.80(92.2%) 1.58 (97.3%)EDA-eigs 6.15(84.3%) 6.69(92.0%) 7.35 (97.0%)EDA 9.61(84.3%) 9.75(92.0%) 10.0 (97.0%)LDA+PCA 0.09(82.6%) 0.21(87.7%) 0.46 (94.8%)Classical LDA 51.4(65.0%) 52.1(71.3%) 52.0 (78.0%)
Table 4.1, Example 4.1 : CPU time in seconds and recognition accuracy (in brackets) ofthe six algorithms on the AR database, d = 50 × , k = 120. accuracy. This is due to the fact that the dimension d = 2000 > n = 1680, and wesuffer from the SSS problem. Recall that the classical LDA method cannot cure thisdrawback properly. Example 4.2
The aim of this example is two-fold. First, we show that our newalgorithms Arnoldi-EDA and Lanczos-EDA run much faster than EDA and EDA-eigsfor face recognition. Second, we illustrate the effectiveness of Theorem 3.6. Thereare two data sets in this example. The first one is the Yale face database taken fromthe Yale Center for Computational Vision and Control. It contains 165 grayscaleimages of k = 15 individuals. The images demonstrate variation with the followingexpressions or configurations: 1) lighting: center light, left light, and right light; 2)with/without glasses; and 3) facial expressions: normal, happy, sad, sleepy, surprised,and winking. The original image size is 320 ×
243 pixels. Figure 4.3 gives the sampleof cropped Yale database images of three individuals.The second test set is the Extended YaleB database. This database contains5760 single light source images of 10 subjects each seen under 576 viewing conditions(9 different poses and 64 illumination conditions of each person). The images havenormal, sleepy, sad and surprising expressions. There are some images with or without
NEXACT KRYLOV ALGORITHMS FOR MATRIX EXPONENTIAL EIGENPROBLEM Figure 4.3, Example 4.2 : Sample face images of 3 individuals from the Yale database, d = 64 × Figure 4.4, Example 4.2 : Sample face images of 3 individuals from the Extended YaleBdatabase, d = 64 × glasses. These images are captured by varying the position of light source at thecenter, left or right. For every subject in a particular pose, an image with ambient(background) illumination was also captured. A subset of k = 38 with 2432 imagesare used in this example (64 images of per individual with illumination). Figure 4.4shows the sample of cropped Extended YaleB database images of three individuals.In this example, all images are aligned based on eye coordinates and are croppedand scaled to 32 × , ×
64 and 100 × (cid:96) = 3 , , d = 100 ×
100 and (cid:96) = 3, Arnoldi-EDA and Lanczos-EDAused 0.12 and 0.11 seconds, while EDA-eigs and EDA used 409.4 and 1027.4 seconds,respectively. Thus, both the Arnoldi-EDA and the Lanczos-EDA algorithms can cir-cumvent the drawback of heavily computational complexity that bothers the originalEDA algorithm, while keep comparable recognition accuracy. Second, cropping theoriginal images may lose some useful information and thus may result in a low recog-nition accuracy. For instance, for the Yale database, when (cid:96) = 5, the recognitionaccuracy of Arnoldi-EDA applying to d = 100 ×
100 images is 93 . .
7% as d = 32 ×
32. Third, it is seen from Tables 4.2–4.3that we cannot tell which one, Arnoldi-EDA or Lanczos-EDA, is definitely better thanthe other in terms of CPU time and recognition accuracy.In order to show effectiveness of Theorem 3.6, we plot in Figure 4.5 the first k − k -th to the ( d − n + k )-th eigenvalues of the matrixexp( − S W )exp( S B ) as d = 32 ×
32. One observes that the 4-th to the ( k − d − n + 1 eigenvalues equal to 1. Allthese coincide with the theoretical results given in Theorem 3.6. Example 4.3
In this example, we show efficiency of our “inexact” strategy (seeTheorem 3.7 and Remark 3.4) for solving the large matrix exponential eigenproblems(3.1) and (3.13). The FERET database is one of the standard image database speciallyused for the face recognition algorithms [29]. The final corpus consists of 14051 eight-bit grayscale images of human heads with views ranging from frontal to left and rightprofiles. In this example, we consider a subset of 1400 images of k = 200 individuals,in which each person contributing seven images. The seven images of each individualconsists of different illumination and expression variation. Before experiment, thefacial images are cropped to a size of 80 ×
80 pixels. Figure 4.6 lists the sampleof cropped FERET database images of three individuals. A random subset with (cid:96) = 2 , , k − × G. WU, T. FENG, L. ZHANG AND M. YANG
Figure 4.5, Example 4.2 : The first k − k -th to the ( d − n + k )-theigenvalues (Right) of exp( − S W )exp( S B ), Yale data base, d = 32 × , k = 15; 3 Train. Algorithm d Arnoldi-EDA 32 ×
32 0.04(65.3%) 0.04(74.7%) 0.06(83.1%)Lanczos-EDA 32 ×
32 0.02(65.2%) 0.02(73.8%) 0.04(83.1%)EDA-eigs 32 ×
32 0.76(65.8%) 0.79(74.7%) 0.78(83.1%)EDA 32 ×
32 1.24(65.8%) 1.23(74.7%) 1.24(83.1%)Arnoldi-EDA 64 ×
64 0.06(74.0%) 0.08(83.8%) 0.11(90.2%)Lanczos-EDA 64 ×
64 0.04(72.8%) 0.07(84.9%) 0.11(90.2%)EDA-eigs 64 ×
64 29.7(74.8%) 31.1(84.7%) 31.4(89.8%)EDA 64 ×
64 73.0(74.8%) 74.7(84.7%) 75.1(89.8%)Arnoldi-EDA 100 ×
100 0.12(88.3%) 0.15(93.8%) 0.25(96.4%)Lanczos-EDA 100 ×
100 0.11(88.5%) 0.16(95.3%) 0.31(96.0%)EDA-eigs 100 ×
100 409.4(88.0%) 397.1(94.8%) 401.4(96.9%)EDA 100 ×
100 1027.4(88.0%) 993.5(94.9%) 1009.2(96.9%)
Table 4.2, Example 4.2 : CPU time in seconds and recognition accuracy (in brackets) ofthe four algorithms on the Yale database, k = 15. In order to show efficiency of our inexact strategy, we run the (implicitly restarted)Arnoldi and Lanczos algorithm in Step 5 of Algorithm 3 with convergence tolerance tol = 10 − , − , − , − and 10 − , respectively. Table 4.4 presents the numericalresults. We observe that the two inexact algorithms work quite well in all the cases,and a tolerance of tol = 10 − is good enough. Indeed, as is shown by Theorem 3.7, itis unnecessary to solve the large exponential eigenproblems in a very high accuracy.This is favorable to very large matrix computations arising from high dimensionalityreduction.It is seen that the recognition rates of Lanczos-EDA are lower than those ofArnoldi-EDA for this problem. A possible reason is that the approximation y fromLanczos-EDA undergoes a exponential transformation to get the desired solution x , NEXACT KRYLOV ALGORITHMS FOR MATRIX EXPONENTIAL EIGENPROBLEM Algorithm d Arnoldi-EDA 32 ×
32 0.07(55.9%) 0.11(73.0%) 0.16(82.8%)Lanczos-EDA 32 ×
32 0.06(59.5%) 0.11(71.2%) 0.15(82.9%)EDA-eigs 32 ×
32 1.02(56.6%) 1.16(73.2%) 1.28(83.7%)EDA 32 ×
32 1.28(56.6%) 1.34(73.2%) 1.35(83.7%)Arnoldi-EDA 64 ×
64 0.23(56.3%) 0.50(72.9%) 0.78(83.2%)Lanczos-EDA 64 ×
64 0.23(59.4%) 0.58(71.1%) 0.99(83.0%)EDA-eigs 64 ×
64 36.1(56.9%) 37.7(73.4%) 41.2(84.1%)EDA 64 ×
64 78.0(56.9%) 78.4(73.4%) 81.4(84.1%)Arnoldi-EDA 100 ×
100 0.60(56.6%) 1.46(73.9%) 1.90(83.2%)Lanczos-EDA 100 ×
100 0.74(59.4%) 1.88(71.4%) 2.46(83.3%)EDA-eigs 100 ×
100 421.7(57.3%) 444.8(73.7%) 541.8(84.2%)EDA 100 ×
100 1013.2(57.3%) 1030.0(73.7%) 1129.7(84.2%)
Table 4.3, Example 4.2 : CPU time in seconds and recognition accuracy (in brackets) ofthe four algorithms on the Extended YaleB database, k = 38. Figure 4.6, Example 4.3 : Sample face images of 3 individuals from the FERETdatabase, d = 80 × see (3.14). As a comparison, we also list the numerical results obtained from runningEDA and LDA+PCA. One observes that our new algorithms run much faster thanEDA, and are superior to LDA+PCA in terms of recognition accuracy. Algorithm tol
Arnoldi-EDA 10 − − − − − − − − − − Table 4.4, Example 4.3 : CPU time in seconds and recognition accuracy (in brackets) ofthe Arnoldi-EDA and Lanczos-EDA algorithms (with different tolerances for exponentialeigenproblems) on the FERET database, d = 80 × , k = 200. Example 4.4
In this example, we demonstrate that Arnoldi-EDA and Lanczos-EDA are suitable to data sets with high dimension. There are two test sets in thisexample. The ORL database contains 400 face images of k = 40 individuals, and theimage size is 92 × (cid:96) = 2 , , G. WU, T. FENG, L. ZHANG AND M. YANG
Figure 4.7, Example 4.4 : Sample face images of 3 individuals from the ORL database, d = 92 × Figure 4.8, Example 4.4 : Sample face images of 3 individuals from the CMU-PIE database, d = 640 × The CMU-PIE database contains more than 40,000 images of 68 subjects withmore than 500 images for each. These face images are captured by 13 synchronizedcameras and 21 flashes under varying pose, illumination, expression and lights. In thisexperiment, we choose k = 10 subjects and 10 images under different illuminations,lights, expressions and poses for each subject. Thus, the total number of imageschosen from CMU-PIE database is 100. The image size is d = 486 ×
640 pixels, sowe have to deal with eigenproblems of size 311 , × , (cid:96) = 3 , , (cid:96) = 5 and 3, respectively.It is observed from Table 4.5 and Figure 4.9 that the recognition accuracies ofthe four EDA-based algorithms are higher than those of many state-of-the-art algo-rithms such as LDA+PCA, PCA, LDA/QR and LDA/GSVD. This coincides withthe numerical results reported in [11, 40, 41, 46, 51]. We notice that the recognitionaccuracies of RLDA and NLDA are comparable to those of the four EDA-based meth-ods, while Arnoldi-EDA and Lanczos-EDA run much faster than RLDA, NLDA andLDA/GSVD.Moreover, it is seen from Table 4.6 and Figure 4.10 that for very large data setssuch as CMU-PIE, all the algorithms EDA-eigs, EDA, RLDA, NLDA and LDA/GSVDfail to work, due to their heavy storage requirements. As a comparison, Arnoldi-EDAand Lanczos-EDA perform quite well in all the situations, and their recognition accu-racies are a little higher than those of PCA, LDA+PCA and LDA/QR. Consequently,our new algorithms are suitable to data sets with high dimension, which are compet-itive alternatives to EDA for dimensionality reduction.
5. Conclusions and future work.
The computation of large scale matrix ex-ponential eigenproblem is the bottleneck in the frame work of EDA for high dimen-sionality reduction [1, 11, 40, 41, 46, 51]. In this paper, we propose two inexactKrylov subspace algorithms, i.e., the inexact Arnoldi algorithm and the inexact Lanc-zos algorithm for nonsymmetric and symmetric matrix exponential eigenproblems,respectively. Our main contribution is to investigate computing matrix exponential-vector products efficiently, and to solve the matrix exponential eigenproblems “ap-proximately”. The relationship between the accuracy of the approximate eigenvectorsand the distance to the nearest neighbour classifier (NN) is revealed. A theoreticalcomparison of the LDA criterion and the EDA criterion is also given. Experimental
NEXACT KRYLOV ALGORITHMS FOR MATRIX EXPONENTIAL EIGENPROBLEM Figure 4.9, Example 4.4 : Recognition accuracy with respect to projected dimension onthe ORL database, 5 Train.
Figure 4.10, Example 4.4 : Recognition accuracy with respect to projected dimension onthe CMU-PIE database, 3 Train. results on popular databases for face recognition have validated the effectiveness ofthe proposed inexact Krylov subspace algorithms.We point out that our new strategies also apply to other exponential based meth-ods such as the exponential Laplacian embedding, exponential LDE, exponential LPP,exponential discriminant regularization, and 2DEDA for high dimensionality reduc-tion other than face recognition [1, 11, 40, 41, 46, 51]. Furthermore, how to combineour new strategies with other eigen-solvers deserves further investigation. For in-stance, the Jacobi-Davidson method [35] is a successful eigenvalue solver for large2
G. WU, T. FENG, L. ZHANG AND M. YANG
Algorithm 2 Train 3 Train 5 Train
Arnoldi-EDA 0.36(84.3%) 0.51(87.8%) 0.72 (96.4%)Lanczos-EDA 0.41(83.1%) 0.63(86.9%) 0.98 (95.7%)EDA-eigs 407.9(84.3%) 418.2(87.8%) 426.0 (96.4%)EDA 1079.9(84.3%) 1090.9(87.8%) 1098.9 (96.4%)LDA+PCA 0.05(83.2%) 0.08(86.7%) 0.17 (91.6%)PCA 0.03(78.4%) 0.06(83.4%) 0.12 (93.5%)RLDA 1142.3(84.3%) 1165.7(88.3%) 1159.0 (95.8%)NLDA 12.0(83.8%) 17.4(87.7%) 16.5 (94.8%)LDA/QR 0.09(80.2%) 0.12(85.0%) 0.17 (94.0%)LDA/GSVD 10.9(82.3%) 16.3(85.9%) 15.5 (93.6%)
Table 4.5, Example 4.4 : CPU time in seconds and recognition accuracy (in brackets) ofthe 10 algorithms on the ORL database, d = 92 × , k = 40. Algorithm 3 Train 5 Train 8 Train
Arnoldi-EDA 1.98(88.9%) 3.44(94.4%) 5.81(92.0%)Lanczos-EDA 2.46(90.0%) 4.17(92.4%) 6.93(91.0%)EDA-eigs – – –EDA – – –LDA+PCA 0.36(88.6%) 0.74(94.0%) 1.15(92.0%)PCA 0.31(82.3%) 0.67(88.4%) 1.06(89.0%)RLDA – – –NLDA – – –LDA/QR 0.47(88.9%) 0.64(93.6%) 0.89(92.0%)LDA/GSVD – – –
Table 4.6, Example 4.4 : CPU time in seconds and recognition accuracy (in brackets) ofthe 10 algorithms on the CMU-PIE database, d = 640 × , k = 10. Here “–” means thealgorithm suffers from “Out of Memory”. eigenproblems. This method is applicable to the matrix exponential eigenproblems(3.1) and (3.12). Furthermore, the bound (3.30) only relates the perturbation of theprojections to the perturbation of the projected points for the nearest neighbor (NN).It is interesting to establish a performance bound for general KNN which should alsoinvolve the distances between points of different classes. REFERENCES[1]
N. Ahmed . Exponential discriminant regularization using nonnegative constraint and imagedescriptor , IEEE 9th International Conference on Emerging Technologies, pp.1–6, 2013.[2]
S. An, W. Liu, S. Venkatesh, and H. Yan . Unified formulation of linear discriminant analysismethods and optimal parameter selection , Pattern Recognition, 44: 307–319, 2011.[3]
J. Baglama, D. Calvetti, and L. Reichel . A matlab programe for computing a few eigenpairs of a large sparse Hermitian matrix , ACM Transactions on Mathematical Software,29: 337–348, 2003.[4]
Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst . Templates for theSolution of Algebraic Eigenvalue Problems: A Practical Guide , SIAM, Philadelphia, 2000.[5]
P. Belhumeur, J. Hespanha, and D. Kriegman . Eigenfaces vs fisherface: recognition us-ing class-specific linear projection , IEEE Transactions on Pattern Analysis and MachineIntelligence, 19: 711–720, 1997.[6]
A. Bouras and V. Frayss´e . A relaxation strategy for the Arnoldi method in eigenproblems ,NEXACT KRYLOV ALGORITHMS FOR MATRIX EXPONENTIAL EIGENPROBLEM Technical Report TR/PA/00/16, CERFACS, Toulouse, France, 2000.[7]
A. Bouras and V. Frayss´e . Inexact matrix-vector products in Krylov methods for solvinglinear systems: a relaxation strategy , SIAM Journal on Matrix Analysis and Applications,26: 660–678, 2005.[8]
L. Chen, H. Liao, M. Ko, J. Lin, and G. Yu . A new lda-based face recognition system whichcan solve the small-sample-size problem , Pattern Recognition, 33: 1713–1726, 2000.[9]
B. Cipra . The best of the 20th century: editors name top 10 algorithms , SIAM News 33(4),2000.[10]
T. Cover and P. Hart . Nearest neighbor pattern classification , IEEE Transactions on Infor-mation Theory, 13: 21–27, 1967.[11]
F. Dornaika and A. Bosaghzadeh . Exponential local discriminant embedding and its appli-cation to face recognition , IEEE Transactions on Systems, Man, and Cybernetics-Part B:Cybernetics, 43: 921–934, 2013.[12]
R. Duda, P. Hart, and D. Stork . Pattern Classification , 2nd ed. New York: Wiley, 2000.[13]
B. Everitt, S. Landau, M. Leese, and D. Stahl . Miscellaneous Clustering Methods , inCluster Analysis, 5th Edition, John Wiley & Sons, Ltd, Chichester, UK, 2011.[14]
R. Fisher . The statistical utilization of multiple measurements , Annals and Eugnenics, 8: 376–386, 1938.[15]
J. Friedman . Regularized discriminant analysis , Journal of the America Statistical Association,84: 165–175, 1989.[16]
G.H. Golub and C.F. Van Loan . Matrix Computations . The Johns Hopkins university Press,4th edtion, 2013.[17]
N.J. Higham . Functions of Matrices: Theory and Computation , SIAM, Philadelphia, 2008.[18]
P. Howland and H. Park . Generalizing discriminant analysis using the generalized singularvalue decomposition , IEEE Transactions on Pattern Analysis and Machine Intelligence, 26:995–1006, 2004.[19]
Y. Jia, F. Nie, and C. Zhang . Trace ratio problem revisited , IEEE Transactions on NeuralNetworks, 20: 729–735, 2009.[20]
Z. Jia and G.W. Stewart . An analysis of the Rayleigh-Ritz method for approximatingeigenspaces , Mathematics of Computation, 70: 637–647, 2001.[21]
A. Martinez . Recognizing imprecisely localized, partially occluded, and expression variant facesfrom a single sample perclass , IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 24: 748–763, 2002.[22]
C. Moler and C.F. Van Loan . Nineteen dubious ways to compute the exponential of a matrix,twenty-five years later , SIAM Review, 45: 3–49, 2003.[23]
T. Ngo, M. Bellalij, and Y. Saad . The trace optimization problem for dimensionality re-duction , SIAM Review, 54: 545–569, 2012.[24]
F. Nie, S. Xiang, Y. Jia, C. Zhang, and S. Yan . Trace ratio criterion for feature selection .Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp.671–676, 2008.[25]
F. Nie, S. Xiang, Y. Jia, and S. Zhang . Semi-supervised orthogonal discriminant analysisvia label propagation , Pattern Recognition, 42: 2615–2627, 2009.[26]
F. Nie, S. Xiang, Y. Liu, C. Hou, and C. Zhang . Orthogonal vs. uncorrelated least squaresdiscriminant analysis for feature extraction , Pattern Recognition Letters, 33: 485–491,2012.[27]
F. Nie, S. Xiang, and C. Zhang . Neighborhood MinMax Projections , The TwentiethInterna-tional Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, pp. 993–997,2007.[28]
C. Park and H. Park . A comparision of generalized linear discriminant analysis algorithms ,Pattern Recognition, 41: 1083–1097, 2008.[29]
P. Phillips, H. Moon, P. J. Rauss, and S. Rizvi . The FERET evaluation methodology for facerecognition algorithms , IEEE Transactions on Pattern Analysis and Machine Intelligence,22: 1090–1104, 2000.[30]
M. Popolizio and V. Simoncini . Acceleration techniques for approximating the matrix expo-nential operator , SIAM Journal on Matrix Analysis and Aplications, 30: 657–683, 2008.[31]
Y. Saad . Numerical Methods for Large Eigenvalues Problems , 2nd edition, SIAM, Philadelphia,2011.[32]
Y. Saad . Analysis of some Krylov subspace approximations to the matrix exponential operator ,SIAM Journal on Numerical Analysis, 29: 209–228, 1992.[33]
V. Simoncini . Variable accuracy of matrix-vector products in projection methods for eigencom-putation , SIAM Journal on Numerical Analysis, 43: 1155–1174, 2005.[34]
V. Simoncini and D. Szyld . Theory of inexact Krylov subspace methods and applications toscientific computing , SIAM Journal on Scientific Computing, 25: 454–477, 2003. G. WU, T. FENG, L. ZHANG AND M. YANG[35]
G. Sleijpen and H. van der Vorst . A Jacobi-Davidson iteration method for linear eigenvalueproblems , SIAM Journal on Matrix Analysis and Aplications, 17: 401–425, 1996.[36]
D. Sorensen . Implicit application of polynomial filters in a k -step Arnoldi method , SIAMJournal on Matrix Analysis and Aplications, 13: 357–385, 1992.[37] G.W. Stewart . Matrix Algorithms II: Eigensystems , SIAM, Philadelphia, 2001.[38]
M. Turk and A. Pentland . Eigenfaces for recognition , Cognitive Neuroscience, 3: 71–86,1991.[39]
S.G. Wang, M. Wu, and Z.Z. Jia . Matrix inequalities (in Chinese), Science Academic Press,Beijing, 2nd edition, 2005.[40]
S. Wang, H. Chen, X. Peng, and C. Zhou . Exponential locality preserving projections forsmall sample size problem , Neurocomputing, 74: 36–54, 2011.[41]
S. Wang, S. Yan, J. Yang, C. Zhou, and X. Fu . A general exponential framework for di-mensionality reduction , IEEE Transactions on Image Processing, 23: 920–930, 2014.[42]
X. Wang and X. Tang . A unified framework for subspace face recognition , IEEE Transactionson Pattern Analysis and Machine Intelligence, 26: 1222–1228, 2004.[43]
G. Wu and Y. Wei . On analysis of projection methods for rational function approximation tothe matrix exponential , SIAM Journal on Numerical Analysis, 48: 191–197, 2010.[44]
G. Wu, W. Xu, and H. Leng . Inexact and incremental bilinear Lanczos components algorithmsfor high dimensionality reduction and image reconstruction , Pattern Recognition, 48: 244–263, 2015.[45]
G. Wu, H. Pang, and J. Sun . Preconditioning the restarted and shifted block FOMalgorithm for matrix exponential computation , arXiv preprint, available online athttp://arxiv.org/abs/1405.0707.[46]
L. Yan and J. Pan . Two-dimensional exponential discriminant analysis and its application toface recognition , International Conference on Computational Aspects of Social Networks(CASoN), pp. 528–531, 2010.[47]
J. Ye . Charaterization of a family of algorithms for generalized discriminant analysis on un-dersampled problems , Journal of Machine Learning Research, 6: 483–502, 2006.[48]
J. Ye, R. Janardan, C.H. Park, and H. Park . An optimization criterion for generalizeddiscriminant analysis on undersampled problems , IEEE Transactions on Pattern Analysisand Machine Intelligence, 26: 982–994, 2004.[49]
J. Ye and Q. Li . A two-stage linear discriminant analysis via QR decomposition , IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 27: 929–941, 2005.[50]
H. Yu and H. Yang . A direct LDA algorithm for high-dimensional data with application toface recognition , Pattern Recognition, 34: 2067–2070, 2001.[51]