Robust Bhattacharyya bound linear discriminant analysis through adaptive algorithm
aa r X i v : . [ c s . L G ] N ov Robust Bhattacharyya bound linear discriminantanalysis through adaptive algorithm
Chun-Na Li a , Yuan-Hai Shao b , Zhen Wang c , Nai-Yang Deng d a Zhijiang College, Zhejiang University of Technology, Hangzhou, 310024, P.R.China b School of Economics and Management, Hainan University, Haikou, 570228, P.R.China. c School of Mathematical Sciences, Inner Monggolia University, Hohhot, 010021,P.R.China d College of Science, China Agricultural University, Beijing, 100083, P.R.China
Abstract
In this paper, we propose a novel linear discriminant analysis criterion via theBhattacharyya error bound estimation based on a novel L1-norm (L1BLDA)and L2-norm (L2BLDA). Both L1BLDA and L2BLDA maximize the between-class scatters which are measured by the weighted pairwise distances of classmeans and meanwhile minimize the within-class scatters under the L1-normand L2-norm, respectively. The proposed models can avoid the small samplesize (SSS) problem and have no rank limit that may encounter in LDA. Itis worth mentioning that, the employment of L1-norm gives a robust perfor-mance of L1BLDA, and L1BLDA is solved through an effective non-greedyalternating direction method of multipliers (ADMM), where all the projec-tion vectors can be obtained once for all. In addition, the weighting constantsof L1BLDA and L2BLDA between the between-class and within-class termsare determined by the involved data set, which makes our L1BLDA andL2BLDA adaptive. The experimental results on both benchmark data setsas well as the handwritten digit databases demonstrate the effectiveness ofthe proposed methods.
Key words: dimensionality reduction; linear discriminant analysis; robustlinear discriminant analysis; Bhattacharyya error bound; alternating ✩ This work is supported by the National Natural Science Foundation of China(No.61703370, No.61866010, No.11871183, No.11501310 and No.61603338), the NaturalScience Foundation of Zhejiang Province (No.LQ17F030003 and No.LY18G010018), andthe Hainan Provincial Natural Science Foundation of China (No.118QN181).
Preprint submitted to Elsevier November 7, 2018 irection method of multipliers
1. Introduction
Linear discriminant analysis (LDA) [1, 2] is a well-known supervised di-mensionality reduction method, and has been extensively studied since it wasproposed. LDA tries to find an optimal linear transformation by maximizingthe quadratic distance between the class means simultaneously minimizingthe within-class distance in the projected space. Due to its simplicity andeffectiveness, LDA is widely applied in many applications, including imagerecognition [3–6], gene expression [7], biological populations [8], image re-trieval [9], etc.Despite the popularity of LDA, there exist some drawbacks that restrictits applications. As we know, LDA is solved through a generalized eigenvalueproblem S b w = λ S w w , where S b and S w are the classical between-classscatter and the within-class scatter, respectively. When dealing with theSSS problem, S w is not of full rank and LDA will encounter the singularity.Moreover, since LDA is constructed based on the L2-norm, it is sensitive tothe presence of outliers. These make LDA non-robust. In addition, since therank of S b is most c −
1, where c is the class number, LDA can find at most c − S w [15], and using therobust mean and scatter variance estimators [16, 17]. For the rank limit issue,incorporating the local information [18] and the recursive technique [19–22]were usually considered. Recently, the employment of the L1-norm ratherthan the L2-norm in LDA was studied to cope with the non-robustness andrank limit problems. Li et al. [23] considered a rotational invariant L1-norm(R1-norm) based LDA, while the L1-norm based LDA-L1 [24–26], ILDA-L1[27], L1-LDA [28] and L1-ELDA [29] were also put forward, where their scat-ter matrices are measured by the R1-norm and L1-norm, respectively. Thematrix based LDA-L1 was further raised and studied [30–32]. The extensionto the Lp-norm ( p >
0) [33, 34] scatter covariances was also used in LDA.However, as pointed in [29], some of the above methods were still not robustenough. 2s we know, minimizing the Bhattacharyya error [35] bound is a reason-able way to establish classification [2, 37]. In this paper, based on the Bhat-tacharyya error bound, a novel robust L1-norm linear discriminant analysis(L1BLDA) and its corresponding L2-norm criterion (L2BLDA) are proposed.Both of them can avoid the singularity and the rank limit issues, and theemployment of the L1-norm makes our L1BLDA more robust.In summary, the proposed L1BLDA and L2BLDA have the followingseveral characteristics:
Class 1Class 2Class 3Class 4LDAL2BLDAL1BLDA (a) Original data and the projection direc-tions -12 -10 -8 -6 -4 -2 0-0.4-0.200.2
LDA -2 -1 0 1 2 3 4 5-0.4-0.200.2
L2BLDA -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-0.4-0.200.2
L1BLDA (b) Projected data
Class 1Class 2Class 3Class 4LDAL2BLDAL1BLDA outlier outlier (c) Data with outliers and the projectiondirections -12 -10 -8 -6 -4 -2 0-0.4-0.200.2
LDA -2 -1 0 1 2 3 4 5 6 7 8 9-0.4-0.200.2
L2BLDA -4 -2 0 2 4 6 8-0.4-0.200.2
L1BLDA outlier outlieroutlier outlieroutlier outlier (d) Projected data with outliersFigure 1: Artificial data set and its projections obtained by LDA, L2BLDA, and L1BLDA. • Both L1BLDA and L2BLDA are derived by minimizing the Bhat-tacharyya error bound, which ensures the rationality of the proposed meth-3ds. Specifically, we prove that the upper bound of the Bhattacharyya errorcan be expressed linearly by the between-class scatter and the within-classscatter, so that minimizing this upper bound leads to our optimization prob-lem of the form min − S bB + D · S wB , where S bB is the between-class scatter, S wB is the within-class scatter, and D weights S bB and S wB . In particular, itshould be pointed out that the weight D is calculated based the input data,so that our models adapt to different data sets. • For the between-class scatter and within-class scatter, L1BLDA usesthe L1-norm (LASSO) loss f ( a ) = | a | , while both the scatters of L2BLDAand LDA are described by the L2-norm (square) loss f ( a ) = | a | . It isobvious that the difference between f ( a ) and f ( a ) becomes larger as | a | getting larger, so we expect that L1BLDA is more robust than L2BLDA andLDA when the data set contains outliers.To testify the robustness of L1BLDA, we here perform an experiment ona simple data set with four classes. The first class contains 120 data samples,while each of the other three classes contains 30 data samples. We apply thefollowing three algorithms LDA, our L1BLDA and L2BLDA on the data setand obtain the one dimensional projected data, as shown in Fig. 1. Then twoadditional outliers are added on the above data for testing. It is obvious thatfor our L1BLDA, the outliers have little influence to its projection direction,and the projected samples are separated well. On the contrary, LDA andLB2DLDA are greatly affected by outliers. • Two nongreedy adaptive algorithms are proposed for the optimizationproblems in solving L1BLDA and L2BLDA, respectively: i) for L1BLDA,it is solved by an effective ADMM algorithm, which is characterized by aone-time projection matrix without the need to recursively solve a singleprojection vector. Compared with traditional recursive algorithm for L1-norm based LDA, our ADMM approach could maintain the orthogonality andthe normalization of the projection directions; ii) for L2BLDA, it is solvedthrough a standard eigenvalue decomposition problem that does not involvethe inversion operation, rather than a generalized eigenvalue decompositionproblem in LDA. • Our L1BLDA and L2BLDA can avoid the singularity caused by theSSS problem. Moreover, L1BLDA does not have the rank limit issue.The notation of the paper is given as follows. All vectors are column ones.Given the training set T = { x , x , . . . , x N } with the associated class labels y , y , . . . , y N belonging to { , , . . . , c } , where x l ∈ R n for l = 1 , , . . . , N .Denote X = ( x , x , . . . , x N ) ∈ R n × N as the data matrix. Assume that the4 -th class contains N i samples. Then c P i =1 N i = N . Let x = N N P l =1 x l be themean of all samples and x i = N i N i P j =1 x ij be the mean of samples in the i -thclass.The paper is organized as follows. Section 2 briefly reviews LDA. Section 3and Section 4 elaborate on our L2BLDA and L1BLDA, respectively. Section5 makes comparisons of the proposed methods with their related methods.At last, concluding remarks are given in Section 6.
2. Linear discriminant analysis
The main idea of LDA is to find an optimal projection transformationmatrix W such that the ratio of between-class scatter to within-class scatteris maximized in the projected space of W ∈ R n × d , d ≤ n . Specifically, LDAsolves the following optimal problemmax W tr( W T S b W )tr( W T S w W ) , (1)where the between-class scatter matrix S b and the within-class scatter matrix S w are defined by S b = 1 N c X i =1 N i ( x i − x )( x i − x ) T (2)and S w = 1 N c X i =1 N i X j =1 ( x ij − x i )( x ij − x i ) T . (3)The optimization problem (1) is equivalent to the generalized problem S b w = λ S w w where λ = 0, with its solution W = ( w , . . . , w d ) given by the first d largest eigenvalues of ( S w ) − S b in case S w is nonsingular. Since the rank of S b is at most c −
1, the number of extracted features is less or equal than c −
1. 5 . L2-norm linear discriminant analysis criterion via the Bhat-tacharyya error bound estimation
The error probability minimization is a natural way to obtain dimen-sionality reduction for classification, who involves the maximization of prob-abilistic distance measures and probabilistic dependence measures betweendifferent classes [2, 3, 36–40]. Since the Bayes classifier is the best classifierwhich minimizes the probability of classification error, minimizing its errorrate (called the Bayes error or probability of misclassification [2]) is expectedto lead to good classification model. The Bayes error is defined as ǫ = 1 − Z max i ∈{ , ,...,c } { P i p i ( x ) } d x , (4)where x ∈ R n is a sample vector, P i and p i ( x ) are the prior probability andthe probability density function of the i th class for the data set T , respec-tively. The computation of the Bayes error is a very difficult task except insome special cases, and an alternative way of minimizing the Bayes error is tominimize its upper bound [2, 41]. In particular, the Bhattacharyya error [35]is a widely used upper bound that provides a close bound to the Bayes error.In the following, we will derive a novel L2-norm linear discriminant analysiscriterion via the Bhattacharyya error bound estimation, named L2BLDA,and give its solving algorithm. The Bhattacharyya error bound is given by ǫ B = c X i Assume P i and p i ( x ) are the prior probability and the prob-ability density function of the i th class for the training data set T , respec-tively, and the data samples in each class are independent identically nor-mally distributed. Let p ( x ) , p ( x ) , . . . , p c ( x ) be the Gaussian functions givenby p i ( x ) = N ( x | x i , Σ i ) , where x i and Σ i are the class mean and the classcovariance matrix, respectively. We further suppose Σ i = Σ , i = 1 , , . . . , c ,and x i and Σ i can be estimated accurately from the training data set T .Then for arbitrary projection vector w ∈ R n , the Bhattacharyya error bound B defined by (5) on the data set e T = { e x i | e x i = w T x i } satisfies the following: ǫ B ≤ − N c X i We first note that p i ( e x ) = N ( e x | e x i , σ i ), where e x i = w T e x i is the i -classmean and σ i is the i -class standard variance in the 1-dimensional projectedspace.Then we have [2] Z q p i ( e x ) p j ( e x ) = e − ( f ¯ xi − f ¯ xj )28 σ . (7)Since σ = 1 N c X i =1 N i X j =1 ( w T ( x ij − x i )) = 1 N N X l =1 ( w T x l − w T x t l ) = 1 N || w T X − w T x I || , (8)7e have ǫ B ≤ c X i 18 ( f ¯ xi − f ¯ xj )2 σ = c X i 0. Forthe last inequality, since || w T ( x i − x j ) || ≤ || w || · || x i − x j || = || x i − x j || and || w T X − w T x I || (cid:0) − || w T X − w T x I || (cid:1) ≤ , we have || w T ( x i − x j ) || · || w T X − w T x I || (1 − || w T X − w T x I || ) ≤ || x i − x j || , (10)which implies || w T ( x i − x j ) || || w T X − w T x I || ≥|| w T ( x i − x j ) || − || x i − x j || · || w T X − w T x I || ≥|| w T ( x i − x j ) || − ∆ ′ ij · || w T X − w T x I || , (11)and hence (9). 8y taking ∆ = c P i 4. L1-norm linear discriminant analysis criterion via the Bhat-tacharyya error bound estimation In this section, we derive a different upper bound of the Bhattacharyyaerror under the L1-norm measure, aiming to construct a robust L1-normbased Bhattacharyya bound linear discriminant analysis (L1BLDA). Similarto L2BLDA, we first give the following proposition. Proposition 2. Assume P i and p i ( x ) are the prior probability and the prob-ability density function of the i th class for the training data set T , respec-tively, and the data samples in each class are independent identically nor-mally distributed. Let p ( x ) , p ( x ) , . . . , p c ( x ) be the Gaussian functions givenby p i ( x ) = N ( x | x i , Σ i ) , where x i and Σ i are the class mean and the classcovariance matrix, respectively. We further suppose Σ i = Σ , i = 1 , , . . . , c ,and x i and Σ i can be estimated accurately from the training data set T . Thenfor arbitrary projection vector w ∈ R n , there exist some constants B and C such that the Bhattacharyya error bound ǫ B defined by (5) on the data set e T = { e x i | e x i = w T x i } satisfies the following: ǫ B ≤ − B c X i From (7), we have ǫ B ≤ c X i 18 ( f x i − f x j )2 σ ≤ c X i 18 ( w T x i − w T x j )21 N N P i =1( w T x k − w T x lk )2 = c X i 0. It is easy to know that h ( x ) is concavewhen 0 ≤ x ≤ √ b , and h ( x ) is convex when x ≥ √ b . Therefore, when x ≥ √ b , the linear function passing through ( √ b , h ( √ b )) and ( a, h ( a )) isthe tightest linear upper bound of h ( x ), e.g., g ( x ) = − e − b − e − ba a − √ b x + (cid:0) e − b − √ b · e − ba − e − b a − √ b (cid:1) . When 0 ≤ x ≤ √ b , it is obvious that there exists someconstant s > g ( x ) = − e − b − e − ba a − √ b x + (cid:0) e − b − √ b · e − ba − e − b a − √ b + s (cid:1) is tangent to h ( x ) and also a linear upper bound of h ( x ).In summary, if we define g ( x ) = − Ex + C, (19)where E = e − b − e − ba a − √ b , C = e − b − √ b · e − ba − e − b a − √ b + s if 0 ≤ x < √ b and C = e − b − √ b · e − ba − e − b a − √ b if x ≥ √ b , b = N , then by combining (17) we11ave ǫ B ≤ c X i Input: Data set T = { ( x , y ) , ..., ( x m , y m ) } ; the positive tolerances ǫ pri and ǫ dual . Set the iteration number k = 0 and initialize D ∈ R n × d , B ij ∈ R d , Z is ∈ R d and α ij , β ij ∈ R d , Γ ∈ R n × d , i = 1 , . . . , c , j = 1 , . . . , N i ; maximumiteration number ItMax . Process:while k < ItM ax (a) W k +1 = arg min W L ρ ( W , B kij , Z kis , D k ; α kij , β kij , Γ k );(b) B k +1 ij = arg min B ij L ρ ( W k +1 , B ij , Z kis , D k ; α kij , β kij , Γ k );(c) Z k +1 is = arg min Z is L ρ ( W k +1 , B k +1 ij , Z is , D k ; α kij , β kij , Γ k );(d) D k +1 = arg min Z is L ρ ( W k +1 , B k +1 ij , Z k +1 is , D ; α kij , β kij , Γ k );(e) α k +1 ij = α kij + ( N p N i N j ( W k +1 ) T ( x i − x j ) − B k +1 ij );(f) β k +1 ij = β kij + (( W k +1 ) T ( x is − x i ) − Z k +1 is );(g) Γ k +1 = Γ k + ( D k +1 − W k +1 ) Until || r k +1 || = max i,j {|| N p N i N j ( W k +1 ) T ( x i − x j ) − B k +1 ij || , || ( W k +1 ) T ( x is − x i ) − Z k +1 is || , || W k +1 − D k +1 || } ≤ ǫ pri and || s k +1 || = max i,j {|| ρ ( x i − x j )( B k +1 ij − B kij ) ′ || , || ρ ( x is − x i )( Z k +1 is − Z kis ) ′ || , || ρ ( D k +1 − D k ) || F } ≤ ǫ dual . Output: W ∗ = W k +1 . 16n this situation, problem (27) is a balanced Procrustes problem [42], andcan be solved as W k +1 = P k ( P k ) T , where P k and P k are orthogonal matricesfrom the SVD A k = P k Σ k P k . Case 2: d < n . In this situation, problem (26) is an unbalanced Procrustesproblem [43], and there is no analytic solution. We here adopt a recentlyproposed convergent algorithm studied in [44] to solve (26). Specifically, weuse Algorithm 2. Algorithm 2 Algorithm for problem (26) when d < n .(a) Compute the dominant eigenvalue a of G .(b) Randomly initialize W ∈ R n × d such that W T W = I .(c) Update M ← a I − G ) W − A k .(d) Calculate W by solving the following problem:max W T W = I tr ( W T M ) . (e) Iteratively perform the above steps (c) and (d) untilconvergence.For step (b) in Algorithm 1, we need to solve B k +1 ij =arg min B ij L ρ ( W k +1 , B ij , Z kis , D k ; α kij , β kij , Γ k )=arg min B ij − c X i 5. Experiments In this section, we perform experiments to test the proposed methods onsome UCI benchmark data sets and two handwritten digit databases. Severalrelated dimensionality reduction methods, including PCA [45], PCA-L1[46],LDA [2], LDA-L1 [24] are used for comparison. The learning rate parametersfor LDA-L1 is chosen from the set { − , − , . . . , } . To test the discrim-inant ability of various methods, the nearest neighbor classification accuracy(%) in the projected space is used as an indicator, where the projected spaceis obtained by applying each dimensionality reduction method on the train-ing data. All the methods are carried out on a PC with P4 2 GHz CPU and2 GB RAM memory by Matlab 2017b.18 .1. The UCI data sets We first apply the proposed methods on 21 benchmark data sets. Allthe data sets are normalized to the interval [0 , Table 1: UCI data sets information. Data set Sample no. Feature no. Class no. Data set Sample no. Feature no. Class no.Australian 690 14 2 Iris 150 4 3BUPA 345 6 2 Monks3 432 6 2Car 1782 6 4 Musk1 476 166 2Credit 690 15 2 Libras 360 90 15Diabetics 768 8 2 Sonar 208 60 2Echocardiogram 131 10 2 Spect 267 44 2Ecoli 336 7 8 TicTacToe 958 27 2German 1000 20 2 Titanic 2201 3 2Haberman 306 3 2 Waveform 5000 21 2Hourse 300 2 2 WPBC 198 34 2House votes 435 16 2 In this subsection, the behaviors of various methods are investigated ontwo handwritten digit databases, including the MNIST database and theUSPS database. 19 able 2: Classification results on original UCI data sets. Data set PCA PCA-L1 LDA LDA-L1 L2BLDA L1BLDAAcc (Dim) Acc (Dim) Acc (Dim) Acc (Dim) Acc (Dim) Acc (Dim)Australian 81.16 (7) 80.19 (6) 80.19 (1) 82.31 (8) 82.31 (2) BUPA 60.19 (6) 61.17 (2) 56.31 (1) 63.11 (6) 65.05 (3) Car 93.63 (3) 76.25 (6) 50.00 (3) 75.87 (5) Diabetics 71.74 (4) 70.87 (8) 68.70 (1) Ecoli 78.22 (7) German 73.67 (18) 73.67 (17) 69.33 (1) 74.33 (15) 73.67 (18) Haberman 71.43 (3) 71.43 (3) 51.65 (1) 68.13 (2) 73.63 (2) Hourse House votes 88.46 (18) 87.69 (16) Iris 100 (1) 100 (3) 100 (2) 100 (3) 100 (4) 100 (3) Monks3 63.08 (3) 60.77 (3) 60.00 (1) 70.00 (1) 56.15 (5) Musk1 63.08 (41) Libras Sonar 56.45 (9) 56.45 (9) 46.77 (1) Spect TicTacToe 97.57 (14) 96.88 (15) 94.44 (1) 95.83 (15) Waveform 86.23 (3) 86.13 (3) 81.40 (1) PCA PCA-L1 LDA LDDA-L1 L2BLDA L1BLDA65707580 A cc u r a c y ( % ) Original data30% noise data50% noise data Figure 2: The mean accuracies for various methods on the UCI data sets. able 3: Classification results on the UCI data sets with 30% features added with Gaussiannoise. Data set PCA PCA-L1 LDA LDA-L1 L2BLDA L1BLDAAcc (Dim) Acc (Dim) Acc (Dim) Acc (Dim) Acc (Dim) Acc (Dim)Australian 81.64 (9) 81.64 (2) 78.74 (1) BUPA 56.31 (6) 57.28 (1) 57.28 (1) 60.19 (2) 56.31 (3) Car 81.66 (4) 70.08 (6) 45.56 (3) 67.57 (2) 81.80 (3) Credit 83.57 (15) 84.06 (9) 78.26 (1) Echocardiogram 76.92 (1) 82.05 (6) 74.36 (1) Ecoli 77.23 (4) 75.25 (7) 75.25 (5) 75.25 (5) 75.25 (6) German 71.67 (8) 73.00 (8) 67.33 (1) 72.33 (18) 72.00 (5) Haberman Hourse 80.00 (9) 77.78 (8) 65.56 (1) Iris 88.89(2) 91.11 (3) Monks3 65.38 (2) 47.69 (1) 51.54 (5) 53.08 (1) 74.62 (2) Musk1 79.72 (27) 78.32 (21) 65.73 (1) 79.72 (11) 79.72 (48) Libras 60.00 (24) 57.14 (37) 39.05 (13) Spect 75.00 (9) Titanic Waveform 81.67 (8) 81.87 (12) 80.86 (1) 83.60 (17) 81.73 (12) WPBC 72.88 (15) 67.80 (7) 59.32 (1) able 4: Classification results on the UCI data sets with 50% features added with Gaussiannoise. Data set PCA PCA-L1 LDA LDA-L1 L2BLDA L1BLDAAcc (Dim) Acc (Dim) Acc (Dim) Acc (Dim) Acc (Dim) Acc (Dim)Australian 82.61 (7) 83.57 (3) 78.74 (1) 83.09 (5) 83.09 (7) BUPA 58.25 (2) 57.28 (4) 62.14 (1) 59.22 (2) 60.19 (2) Car 78.19 (5) 70.08 (6) 44.21 (3) 68.73 (4) 77.03 (5) Credit 79.71 (4) 76.81 (6) 78.26 (1) Diabetics 66.52 (8) Echocardiogram ) 82.50 (7) Ecoli German 70.33 (5) 70.67 (10) 68.33 (1) 73.67 (8) 69.33 (14) Haberman 70.33 (3) 71.43 (3) 63.74 (1) 70.33 (3) 70.33 (3) Hourse 83.33(17) 82.22(23) 76.67(1) 82.22(26) 84.44(15) House votes 90.00(7) Iris 88.89 (1) 84.44 (2) 82.22 (1) 86.67 (3) 84.44 (4) Monks3 51.54 (1) 65.38 (3) 48.46 (1) Libras 59.05 (48) Spect 77.50 (7) 78.75 (8) 70.00 (1) 80.00 (7) 78.75 (7) TicTacToe Waveform 76.73 (11) 76.67 (19) 75.27 (1) 77.27 (16) 76.73 (20) WPBC 76.26 (8) 74.58 (7) 59.32 (1) 74.58 (27) 77.97 (9) able 5: Average ranks of various methods for the accuracies on the UCI data sets. Data set PCA PCA-L1 LDA LDA-L1 L2BLDA L1BLDAAustralian 4.00 3.50 5.83 2.50 3.00 1.17BUPA 5.17 4.50 3.83 4.17 2.33 1.00Car 2.33 4.00 6.00 5.00 2.00 1.67CMC 2.33 5.00 5.67 2.00 4.17 1.83Credit 3.83 3.00 5.83 1.50 5.00 1.83Diabetics 3.83 3.17 5.83 2.00 4.67 1.50Echocardiogram 3.50 4.00 5.00 2.33 3.83 2.33Ecoli 2.67 2.83 5.00 5.33 3.50 1.67German 4.33 3.00 6.00 2.33 4.33 1.00Glass 4.00 2.50 4.83 3.00 5.33 1.33Haberman 3.33 2.67 6.00 4.67 2.83 1.50Heartstatlog 4.50 3.33 3.67 2.67 5.83 1.00Hourse 2.50 4.83 6.00 3.17 3.00 1.50House votes 4.83 4.50 2.50 2.83 4.83 1.50Ionosphere 3.33 4.33 5.33 1.83 5.00 1.17Iris 3.83 4.33 3.67 3.33 3.83 2.00Monks3 3.33 4.33 5.33 2.33 4.33 1.33Musk1 4.00 3.83 4.83 2.83 2.83 1.67Libras 2.83 2.50 6.00 2.00 3.83 3.83Sonar 4.67 3.67 4.83 1.83 4.67 1.33Spect 3.50 3.00 6.00 2.83 3.83 1.83TicTacToe 2.83 3.33 5.67 2.83 2.17 4.17Titanic 1.83 3.17 2.83 3.17 2.83 3.17Waveform 3.83 4.00 6.00 1.67 3.17 2.33WPBC 3.50 4.67 6.00 3.00 2.67 1.17Average rank 3.55 3.66 5.19 2.94 3.50 1.88 .2.1. The MNIST database The MNIST database contains 70000 digit images with 10 classes of thesize 28 × 28. We up-sample the images to the size 16 × 16, and further reshapethem to vectors of the length 256. 30% data from each class are randomlyselected for training, while the rest data is used for testing. Further, Gaussiannoise with mean 0 and variance 0.05 is added on the training data, where thenoise covers random 30% rectangular area of each image. The contaminateddigit images are displayed in Figure 3. All the methods are then applied onthe original training data and the contaminated training data. We show theclassification results in Table 6. Figure 3: The contaminated samples from the MNIST database.Table 6: Classification results on the MNIST database. Data PCA PCA-L1 LDA LDA-L1 L2BLDA L1BLDAAcc (Dim) Acc (Dim) Acc (Dim) Acc (Dim) Acc (Dim) Acc (Dim)Original data 96.79 (53) The table shows that for the original data, all the methods behave sim-ilarly except for LDA. However, when the samples are contaminated, PCA,PCA-L1 and LDA are all greatly influenced by noise, while LDA-L1 andour L2BLDA, L1BLDA have small changes. In addition, our L2BLDA andL1BLDA are both better than LDA-L1, and our L1BLDA has the best per-formance. It demonstrates the effectiveness of the proposed methods. To seehow the reduced dimension affect each method, we depict the variation ofaccuracies along dimensions, as shown in Figure 4. For the original data, asthe dimension grows, the accuracies of all the methods fist grow rapidly andthen keep steady with the similar performance. When the noise is considered,24ur L1BLDA and L2BLDA and LDA-L1 affected less by noise comparing toPCA and PCA-L1, while our L2BLDA and L1BLDA have the higher accu-racies than LDA-L1 after dimension 17. This demonstrates the effectivenessof the proposed method. Dimension A cc u r a c y PCAPCA-L1LDALDA-L1L2BLDAL1BLDA (a) Original data Dimension A cc u r a c y PCAPCA-L1LDALDA-L1L2BLDAL1BLDA (b) Noise dataFigure 4: The variation of accuracies along different dimensions on the MNIST database. The USPS database contains 11000 samples with 10 classes of dimension256, and each sample corresponds to a digit. We randomly select 80% samplesfrom each class for training, while the rest data is used for testing. To testthe robustness of the proposed method, we further add black block on eachtraining data, where the block covers random 20% rectangular area of each25mage, as shown in Figure 5. As before, all the methods are then applied Figure 5: The contaminated samples from the USPS database. on the original training data and the contaminated training data, and thecorresponding results are given in Table 7 and Figure 6. When no noise isadded, our L2BLDA performs the best, while our L1BLDA, PCA and PCA-L1 are comparable to L2BLDA. However, when the image is contaminated,L1BLDA behaves the best, which shows its robustness. Similar to the MNISTdatabase, the variation of accuracies along different dimensions shown inFigure 6 also demonstrates the superiority of our proposed methods. Table 7: Classification results on the USPS database. Data PCA PCA-L1 LDA LDA-L1 L2BLDA L1BLDAAcc (Dim) Acc (Dim) Acc (Dim) Acc (Dim) Acc (Dim) Acc (Dim)Original data 97.53 (55) 97.53 (53) 92.63 (9) 96.88 (139) 6. Conclusion Dimension A cc u r a c y PCAPCA-L1LDALDA-L1L2BLDAL1BLDA (a) Original data Dimension A cc u r a c y PCAPCA-L1LDALDA-L1L2BLDAL1BLDA (b) Noise dataFigure 6: The variation of accuracies along different dimensions on the USPS database. cknowledgment This work is supported by the National Natural Science Foundation ofChina (No.61703370 and No.61603338), the Natural Science Foundation ofZhejiang Province (No.LQ17F030003 and No.LY18G010018), and the Natu-ral Science Foundation of Hainan Province (No.118QN181). References [1] Fisher R A. The use of multiple measurements in taxonomic problems.Annals of Eugenics, 1936, 7(2): 179-188.[2] Fukunaga K. Introduction to statistical pattern recognition, second edi-tion. Academic Press, New York, 1991.[3] Belhumeur P N, Hespanha J P, Kriegman D J. Eigenfaces vs. fisherfaces:Recognition using class specific linear projection. IEEE Transactions onPattern Analysis And Machine Intelligence, 1997, 19(7): 711-720.[4] Sun S, Xie X, Yang M. Multiview uncorrelated discriminant analysis.IEEE Transactions on cybernetics, 2016, 46(12), 3272-3284.[5] Luo T, Hou C, Nie F, et al. Dimension reduction for non-gaussian databy adaptive discriminative analysis. IEEE Transactions on Cybernetics,2018, DOI: 10.1109/TCYB.2018.2789524.[6] Li C N, Shao Y H, Chen W J, et al. Generalized two-dimensional lineardiscriminant analysis with regularization. arXiv:1801.07426, 2018.[7] Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysisand its application in microarrays. Biostatistics, 2006, 8(1): 86-100.[8] Jombart T, Devillard S, Balloux F. Discriminant analysis of principalcomponents: a new method for the analysis of genetically structuredpopulations. BMC Genetics, 2010, 11(1): 94.[9] Swets D L, Weng J J. Using discriminant eigenfeatures for image re-trieval. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 1996, 18(8): 831-836. 2810] Chen L F, Liao H Y M, Ko M T, et al. A new LDA-based face recog-nition system which can solve the small sample size problem. PatternRecognition, 2000, 33(10): 1713-1726.[11] Yu H, Yang J. A direct LDA algorithm for high-dimensional datawithapplication to face recognition. Pattern Recognition, 2001, 34(10): 2067-2070.[12] Lai Z, Mo D, Wong W K, et al. Robust discriminant regression for fea-ture extraction. IEEE Transactions on Cybernetics, 2018, 48(8): 2472-2484.[13] Friedman J H. Regularized discriminant analysis. Journal of the Amer-ican statistical association, 1989, 84(405): 165-175.[14] Kim S J, Magnani A, Boyd S. Robust fisher discriminant analysis. Ad-vances in Neural Information Processing Systems. 2006: 659-666.[15] Tian Q, Barbero M, Gu Z H, et al. Image classification by the Foley-Sammon transform. Optical Engineering, 1986, 25(7): 257834.[16] Croux C, Dehon C. Robust linear discriminant analysis using S-estimators. Canadian Journal of Statistics, 2001, 29(3): 473-493.[17] Hubert M, Van Driessen K. Fast and robust discriminant analysis. Com-putational Statistics & Data Analysis, 2004, 45(2): 301-320.[18] Sugiyama M. Dimensionality reduction of multimodal labeled data bylocal fisher discriminant analysis. Journal of Machine Learning Research,2007, 8: 1027-1061.[19] Xiang C, Fan X A, Lee T H. Face recognition using recursive Fisher lin-ear discriminant. IEEE Transactions on Image Processing, 2006, 15(8):2097-2105.[20] Ye Q L, Zhao C X, Zhang H F, et al. Recursive “concave-convex” Fisherlinear discriminant with applications to face, handwritten digit and ter-rain recognition. Pattern Recognition, 2012, 45: 54-65.[21] Chen X, Yang J, Mao Q, et al. Regularized least squares fisher lineardiscriminant with applications to image recognition. Neurocomputing,2013, 122: 521-534. 2922] Li C N, Zheng Z R, Liu M Z, et al. Robust recursive absolute value in-equalities discriminant analysis with sparseness. Neural Networks, 2017,93:205-218.[23] Li X, Hua W, Wang H, et al. Linear discriminant analysis using rota-tional invariant L1