One-shot Distributed Algorithm for Generalized Eigenvalue Problem
OOne-shot Distributed Algorithm for Generalized EigenvalueProblem (cid:63)
Kexin Lv , Fan He , Xiaolin Huang ∗ , Jie Yang ∗ , Liming Chen Abstract
Nowadays, more and more datasets are stored in a distributed way for the sake of memory storageor data privacy. The generalized eigenvalue problem (GEP) plays a vital role in a large familyof high-dimensional statistical models. However, the existing distributed method for eigenvaluedecomposition cannot be applied in GEP for the divergence of the empirical covariance matrix.Here we propose a general distributed GEP framework with one-shot communication for GEP. Ifthe symmetric data covariance has repeated eigenvalues, e.g., in canonical component analysis,we further modify the method for better convergence. The theoretical analysis on approximationerror is conducted and the relation to the divergence of the data covariance, the eigenvalues of theempirical data covariance, and the number of local servers is analyzed. Numerical experimentsalso show the e ff ectiveness of the proposed algorithms. Keywords:
Generalized eigenvalue problems, canonical correlation analysis, distributedoptimization algorithm. (cid:63)
This research is partly supported by National Key R&D Program of China (No.2019YFB1311503); National ScienceFounadation, China (61876107,U1803261, 61977046); Committee of Science and Technology, Shanghai, China (No.19510711200). ∗ Corresponding author
Email addresses: [email protected] (Kexin Lv), [email protected] (Fan He), [email protected] (Xiaolin Huang), [email protected] (Jie Yang), [email protected] (Liming Chen) Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, also with the MOE KeyLaboratory of System Control and Information Processing, 800 Dongchuan Road, Shanghai, 200240, P. R. China. Institute of Medical Robotics, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, P. R. China. Ecole Centrale de Lyon, France.
Preprint submitted to Elsevier October 23, 2020 a r X i v : . [ c s . L G ] O c t . Introduction Nowadays, more and more data with large scales in dimension and quantity are stored indistributed local servers. The distributed framework enables many equivalent local servers toseek the solution to the same class of problems, where specific servers may provide insight forother servers. This process is usually achieved by iteratively exchanging information betweenlocal servers to partially improve the way they accomplish their tasks. As the increasing scenarioslike financial, medical, and biomedical tasks appear, where they hold the sensitive and limitedscale of datasets in a distributed manner, the distributed systems shall o ff er a relatively safe ande ffi cient way to obtain a satisfying result.In the past decades, the distributed systems for regression and classification has been dis-cussed. There have been many works on that, especially distributed linear regression [5, 6, 12],distributed sketched ridge regression [31], etc. Besides classification and regression, there isanother important type of learning tasks that involve (generalized) eigenvalue decomposition,which include SVD [21], PCA [20], CCA [18], etc. For distributed (generalized) eigenvalue de-composition, the studies of PCA [15, 10, 1] give distributed algorithms but lack communicatione ffi ciency and privacy. Recently, an e ffi cient structure of distributed privacy-preserving sparsePCA [13] is proposed and solves a general distributed eigenvalue decomposition. Ref [3] pro-poses a decentralized CCA framework in wireless sensor networks with a fully connected or atree topology. Distributed SVD [22] utilizes multiple local power iterations between local andcentral servers.Although great achievements have been obtained in individual task of distributed (general-ized) eigenvalue problems, we lack a common approach for distributed generalized eigenvalueproblem (GEP). Let us go to the formulation of GEP which plays a vital role in a large family ofhigh-dimensional statistical models as below. Aw = λ Bw (1)where A ∈ R d × d is a symmetric matrix and B ∈ R d × d is a positive definite matrix. w ∈ R d denotesthe generalized eigenvector of GEP corresponding to the generalized eigenvalue λ . The solutionto Eq.(1) can be modeled as local optima of the maximum generalized eigenvalue problem as2elow. max w ∈ R d w ∗ Aw s . t . w ∗ Bw = w ∗ denotes the conjugate transpose matrix of w . It finds the generalized eigenvector corre-sponding to the largest eigenvalue. When B is an identity matrix, GEP turns to the ordinary eigen-value problem (EP). The optimization problem has an equivalent solution as λ max ( B − AB − ) [2]which pursuits the maximum eigenvalue of symmetric data covariance B − AB − . SVD is ane ff ective and accurate means to solve the problem. However, it is often replaced with poweriterations [14] which is numerically e ffi cient for the large-scale data.Since GEP solves a series of multivariant statistical problems, applying distributed algo-rithms for EP [10, 13] directly is not suitable. Besides, recall from Eq.(2), when B is a normaldense matrix, not an identity matrix, the symmetry of the covariance of data matrices B − ∗ AB − meets the convergence barrier in power iterations in [13, 22]. Because the symmetric self-adjointdilation [29] of the empirical covariance matrix with repeated nonzero generalized eigenvalues,namely no eigengap [see Corollary 1 in 28], appears in GEP with multiple views like CCA andPLS. Consequently, the existing distributed methods in PCA and SVD cannot be simply con-cluded as solutions to GEP for their convergence barrier.Based on it, in this paper, we design a general distributed GEP framework in one-shot com-munication with high e ffi ciency. The empirical data covariance is broadcast to the central serverand the pursuit of GEP is carried out there. For the large-scale data, power iterations are usu-ally used in solving GEP in the central server. And we overcome the convergence barrier for noeigengap in the power method for CCA involving multi-view data and put forward the distributedmultiple CCA algorithm with the power method in the central server for better convergence. Theanalysis of approximation error accounts for the divergence of data covariance in one-shot com-munication which is in relationship with the eigenvalues of covariance and the number of localservers.The remainder of the paper is organized as follows. Section 2 produces the related work onGEP and distributed algorithms. Section 3 gives a general framework of the distributed GEPalgorithm in one-shot communication and its extensional applications. In Section 4, we analyzethe approximation error of the proposed algorithm. Section 5 puts forward the distributed multi-3le CCA algorithm with the power method. The performance of numerical experiments is shownin Section 6. And the conclusion goes to Section 7.
2. Related works
Generalized eigenvalue problems play important roles in machine learning fields. There aredi ff erent approaches to solving the generalized eigenvalue problems. One group is Jacobi-typesequential algorithms which come from the Falk-Langemeyer method [9] and the Hari–Zimmermannalgorithm [17]. The other group of algorithms is based on the generalization of the Kogbetliantzmethod [24]. They designed algorithms for generalized singular vector decomposition (GSVD)problem [30] which is a popular solution to GEP and QR-iteration type algorithms [23] are suit-able for solving GSVD. Based on it, power and inverse power methods defined for the matrixproduct with an arbitrary number of factors are developed to solve GEP. From an optimizationperspective, the trace minimization methods of GEP [25] arises. They make use of the powermethod to build a sequence of subspaces containing the desired eigenvectors to solve GEP. Distributed algorithms are usually aimed at solving large-scale problems e ffi ciently with largequantities of data. Due to the limitation of cache memory in large-scale problems, the block andparallel methods [4, 8] rearrange and parallelize operations in the original algorithm to take fulladvantage of resources. They need to think of the trade-o ff between communication and com-putation. Due to the limited bandwidth of communication in distributed systems and strongcomputation ability nowadays, communication cost is usually more expensive than computationcost. However, if considering the communication and privacy-preserving in the distributed sys-tems for the sensitive dataset, distributed optimization algorithms proposed as the constrainedminimization of convex functions o ff er a great solution and they are applied in a variety of fieldsincluding machine learning, robotics, resource allocation and so on [13, 27, 33]. In the machinelearning field, power and invert power methods could be extended to distributed setting espe-cially in PCA [10, 13, 11], where the global empirical covariance matrix is consistent with the4um of local covariance matrices. However, the consistency of covariance doesn’t hold when thesimilar methods are extended to GEP in the distributed systems.
3. Distributed GEP algorithm and its extensional applications
In this section, GEP is represented as a trace optimization subjecting to fixed constraints.Then a distributed GEP algorithm (DGEP) is proposed. We are seeking the generalized eigen-vectors of the symmetric empirical data covariance in distributed settings with one-shot commu-nication.
We consider a kind of distributed generalized eigenvalue problem, where the data of the same d observations are provided from N local servers to their shared believable central server. Thatis, the centered data is established as X = [ X , X , . . . , X N ] ∈ R d × num and Y = [ Y , Y , . . . , Y N ] ∈ R d × num where d is the feature dimension and num , num are the number of data X , Y . X i and Y i are i.i.d respectively for i ∈ { , , . . . , N } . Each local server prepares data matrices A i ∈ R d × d and B i ∈ R d × d according to the di ff erent tasks of GEP from X and Y . The centered GEP is formulatedas a maximum optimization as below.max w ∈ R d Trace (cid:18) w ∗ (cid:18)(cid:88) Ni = A i (cid:19) w (cid:19) s . t . w ∗ (cid:18)(cid:88) Ni = B i (cid:19) w = . (3)It is equivalent to max w ∈ R d Trace ( w ∗ M w ) s . t . w ∗ w = . (4)where the covariance matrix M = (cid:16)(cid:80) Ni = B i (cid:17) − (cid:80) Ni = A i (cid:16)(cid:80) Ni = B i (cid:17) − and A i , B i , M ∈ R d × d are allsymmetric. Based on the trace maximization method, the central optimization problem in thecentral server is formulated asmax w ∈ R d Trace (cid:18) w ∗ (cid:18)(cid:88) Ni = M i (cid:19) w (cid:19) s . t . w ∗ w = . (5)From data privacy perspective, sending local covariance M i = ( B i ) − A i ( B i ) − to the centralserver is safer than sending A i and B i directly. But there is a divergence in the covariance of GEP5fter communication. It is a significant di ff erence from distributed SVD [22] in practice whoseassumption on the approximation of communication is broken.In the distributed systems, it is assumed that node failure and asynchronization are ignored.All the local servers have a similar scale of data to ensure that synchronous communication willnot result in a long delay in the whole system. The general distributed GEP algorithm (DGEP) is shown in Algorithm 1. We focus on thedistributed optimization of GEP in Eq.(5) with symmetric M i ∈ R d × d for i ∈ { , , . . . , N } . N local servers broadcast the data covariance M i respectively to the central server in one-shot com-munication for e ffi ciency. Learning the generalized eigenvector ˆ w ∈ R d occurs in the centralserver. In EP like PCA, sending M i may lose privacy for its simple symmetry. However, in GEPlike CCA, a dense M i is regarded as compromising a mix of multi-view data which is hard tobe recovered. The communication complexity in the distributed system is O ( d N ). To be morespecific, the whole process is depicted in Figure 1. Algorithm 1
One-shot Distributed GEP algorithm. In the local servers, calculate local covariance matrix M i in i -th local server and broadcast itto the central server. In the central server, calculate ˆ M = (cid:80) Ni = M i as the approximation of M . Computing the leading k eigenvectors ˆ W of the approximate matrix ˆ M . Return ˆ W . Many statistical tools in machine learning fields can be formulated as special instances ofGEP in Eq.(2) with the symmetric-definite matrix pair ( A , B ). And we give three instances brieflyas below. They all fit for Algorithm 1 when in the distributed framework. • FDA (Fisher’s Discriminant Analysis [26]): It is desired to maximize the between-classvariance S B and minimize the within-class variance S W . Therefore, the instances of every6 igure 1: General process of distributed GEP. class get close to one another and the classes get far away from each other. It is a directinstace of GEP with A = S B and B = S W . • CCA (Canonical Correlation Analysis [18]): Given two random vectors X and Y , let Σ XX and Σ YY be the covariance matrices of X and Y respectively. And Σ XY is the cross-covariance matrix between X and Y . The canonical vectors w X and w Y can be obtained bysolving GEP with A = Σ XY Σ XY , B = Σ XX Σ YY , w = w X w Y . • PLS (Partial Least Squares [19]): It is a common method for dimension reduction. Derivedfrom CCA with X and Y , it is also a instance of GEP with A = Σ XY Σ XY , B = I , w = w X w Y .
4. Approximation error analysis of distributed GEP
In this section, we analyze the approximation error in terms of top k subspace distance as (cid:107) D k ( W c , ˆ W ) (cid:107) of DGEP in Algorithm 1 in Theorem 1, where norm (cid:107) · (cid:107) denotes (cid:96) norm as default.It accounts for the divergence of data covariance in one-shot communication.7he analysis of communication is based on sin Θ theorem [32]. The centered covariancebefore one-shot communication Σ = M and the approximation covariance in the center server Σ = (cid:80) Ni = M i are represented by the detailed data matrix multiplication of A i and B i respectively.The goal of Algorithm 1 is to learn a central orthonormal matrix ˆ W ∈ R d × k to estimate W c , thetop k eigenspace of Σ . Our main result is in Theorem 1. Theorem 1.
Given matrices Σ ∈ R d × d with eigenvalues σ ≥ · · · ≥ σ d and Σ ∈ R d × d inthe central server with eigenvalues ˜ σ ≥ · · · ≥ ˜ σ d . For a given dimention k where k ≤ d, letW c = ( w , . . . , w k ) ∈ R d × k and ˆ W = ( ˜ w , . . . , ˜ w k ) ∈ R d × k have orthogonal columns satisfying Σ w j = σ j w j and Σ ˜ w j = ˜ σ j ˜ w j for j = , . . . , k. Assume ∆ k = in f {(cid:107) ˆ σ − σ | : σ ∈ [ σ k , σ ] , ˆ σ ∈ [ −∞ , ˆ σ k + ] ∪ [ ˆ σ , ∞ ] } , where ˆ σ = −∞ and ˆ σ k + = ∞ and ∆ k > . Assume (cid:107) Var ( B − i ) (cid:107) ≤ γ and (cid:107) Var ( A i B i ) (cid:107) ≤ γ for i = , . . . , N where Var ( · ) denotes the variance of total elements in thematrix, the approximation error (cid:107) D k ( W c , ˆ W ) (cid:107) = (cid:107) sin θ k ( W c , ˆ W ) (cid:107) ≤ (cid:107) Σ − Σ (cid:107) ∆ k ≤ N γ max i b − i max j a j + γ max i b − i ∆ k with a i , b i are the maximum spectral radius of A i , B i respectively. Proof. 1. D k ( W c , ˆ W ) = (cid:107) sin Θ ( W c − ˆ W ) (cid:107) ≤ (cid:107) Σ − Σ (cid:107) ∆ is a direct conclusion from sin Θ theorem.Recall that M i = B − i A i B − i , M = (cid:16)(cid:80) Ni = B i (cid:17) − (cid:80) Ni = A i (cid:16)(cid:80) Ni = B i (cid:17) − , then (cid:107) Σ − Σ (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) M − N (cid:88) i = M i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) i = B i − N (cid:88) i = A i N (cid:88) i = B i − − N (cid:88) i = B − i A i B − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Considering Jensen’s Inequlity for B i for its convexity, we have (cid:80) Ni = B i n − ≤ n N (cid:88) i = B − i . That is, (cid:16)(cid:80) Ni = B i (cid:17) − ≤ n − (cid:80) Ni = B − i only if B = B = · · · = B N the equality holds. Then (cid:107) Σ − Σ (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N (cid:88) i = B − i N (cid:88) i = A i N (cid:88) i = B − i − N (cid:88) i = B − i A i B − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) i = B − i N N (cid:88) j A j N (cid:88) k B − k − A i B − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) i = B − i N N (cid:88) j A j N (cid:88) k B − k − N N (cid:88) i = A i B − i + N N (cid:88) i = A i B − i − A i B − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) i = B − i N N (cid:88) j A j N N (cid:88) k B − k − B − j + N N (cid:88) i = A i B − i − A i B − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , here the former term is related to the variance of B − i , and the latter term is concerned withthe variance of A i B i for i = , . . . , N. Assume (cid:107)
Var ( B − i ) (cid:107) ≤ γ and (cid:107) Var ( A i B i ) (cid:107) ≤ γ , whereVar ( · ) denotes the variance of total elements in the matrix. inf( γ ) = max i (cid:107) Var ( B − i ) (cid:107) and inf( γ ) = max i (cid:107) Var ( A i B i ) (cid:107) . Then (cid:107) Σ − Σ (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) i = B − i N N (cid:88) j = A j (cid:18) Var (cid:18) B − j (cid:19)(cid:19) + Var ( A i B i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Var ( B − j ) N N (cid:88) i = B − i N (cid:88) j = A j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Var ( A i B i ) N (cid:88) i = B − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ γ N N (cid:88) i = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) j = (cid:13)(cid:13)(cid:13) A j (cid:13)(cid:13)(cid:13) + γ N N (cid:88) i = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = N γ max i b − i max j a j + γ max i b − i , where a i , b i are the maximum spectral radius of A i , B i respectively for i = , . . . , N. (cid:3) When γ = γ =
0, that is A = A = · · · = A N and B = B = · · · = B N , the approximated error D k ( W c , ˆ W ) =
5. Distributed Canonical correlation Analysis (CCA) and its multiple formulation
The CCA in the classical setting, a well-known technique proposed by [18], is to find thelinear combinations of two sets of random variables with maximal correlation. Given d ob-servations, denote X = ( x , . . . , x d ) ∈ R n × d , Y = ( y , . . . , y d ) ∈ R n × d . The covariances Σ XY = E ( XY ∗ ) = XY ∗ , Σ XX = E ( XX ∗ ) = XX ∗ and Σ YY = E ( YY ∗ ) = YY ∗ . Multiple CCA [16] isfomulated as following. max U , V Trace( U ∗ XY ∗ V )s . t . U ∗ XX ∗ U = I k , U ∈ R n × k V ∗ YY ∗ V = I k , V ∈ R n × k (6)where I k stands for identity matrix and its number of columns is k . k also denotes the numberof canonical components and normally k ≤ min { n , n } . When k =
1, it degrades to the standardCCA. And when it is formulated as a special instance [28] of GEP as Eq.(2), we obtainmax W ∈ R d × k Trace( W ∗ AW ) s . t . W ∗ BW = I k (7)9ith A = XY ∗ Y X ∗ , B = XX ∗ YY ∗ , W = U V .According to Theorem 2 in [22], there must be an eigengap of M = B − AB − ∈ R d × d when using the power method. However, in Eq.(7), it has repeated generalized eigenvaluesfor the self-adjoint structure of A and B when formulated as GEP. It may not only increasethe approximation error during communication but also result in a convergence barrier for noeigengap. So in this section, we reformulate CCA on the basis of Algorithm 1 and put forwarda one-shot distributed multiple CCA algorithm with the power method which solves the inherentstructure defect e ffi ciently. The reformulated centered problem is shown as below.max W , W Trace( W ∗ ( XX ∗ ) − ∗ XY ∗ ( YY ∗ ) − W )s . t . W ∗ W = I k , W ∗ W = I k (8)where X = [ X , X , . . . , X N ] and Y = [ Y , Y , . . . , Y N ] are concatenations of original data from N local servers. M = ( XX ∗ ) − ∗ XY ∗ ( YY ∗ ) − is the asymmetric covariance of the centered problem.The distributed optimization problem ismax W , W Trace W ∗ N (cid:88) i = M i W s . t . W ∗ W = I k , W ∗ W = I k (9)where the covariance M i = ( X i X ∗ i ) − ∗ X i Y ∗ i ( Y i Y ∗ i ) − for CCA is also asymmetric. And ˆ M = (cid:80) Ni = M i is an approximation of M in the distributed setting.Algorithm 2 shows the one-shot distributed multiple CCA algorithm which improves theperformance by breaking the self-adjoint symmetric GEP structure. The power iterations in thecentral optimization are divided into two parts as computing the eigenapace of ˆ M = (cid:80) Ni = M i M ∗ i and ˆ M = (cid:80) Ni = M ∗ i M i . And the process can be viewed as seeking the left and right singularvector of the approximated covariance ˆ M at the same time. QR factorization [14] is used to getorthonormal approximated eigenspace. The algorithm takes O ( k · d ) time per iteration for thematrix-matrix multiplications and the QR decomposition has time complexity O ( d · k ). The anal-ysis of the approximated error of Algorithm 2 is similar as that in Section 4. And the convergenceboils down to that of the power method in [14].10 lgorithm 2 One-shot Distributed multiple CCA algorithm. In the local servers, calculate local covariance matrix M i in i -th local server and broadcast M i M ∗ i and M ∗ i M i respectively to the central server. In the central server, calculate the leading k eigenvectors of ˆ M and ˆ M as ˆ W and ˆ W by thepower method in parallel which are viewed as the left and right approximated eigenspace ofˆ M . Return ˆ W and ˆ W .
6. Numerical experiments
In this section, numerical experiments on synthetic and real data are carried out to illustratethe e ff ectiveness and accuracy of the proposed algorithms. In GEP, Algorithm 1 (DGEP) iscompared with the centered result. In the pursuit of eigenvectors of DGEP in the central server,SVD and the power method (PM) are carried out respectively. They are denoted as DGEP-SVDand DGEP-PM in the following. In CCA, Algorithm 2 and DGEP-PM are compared with thecentered results. And we only pursue the generalized eigenvector corresponding to the largestgeneralized eigenvalue for convenience in the maximum optimization of distributed GEP anddistributed CCA, which is k =
1. All the experiments are performed on MATLAB R2019a on acomputer with 6 Core 2.20 GHz CPUs and 8GB RAM.
Considering the general settings of GEP in Eq.(2), we generate a symmetric matrix A i ∈ R f eats × f eats and a positive definite matrix B i ∈ R f eats × f eats randomly and independently whichcompromise data-mixed information in the i -th local server with i = , . . . , N . The running timeconcludes communication time and optimization time for DGEP-PM and the maximum iterationstep of PM is set to be 10. The error is calculated as Error = sin( subspace ( ˆ W , W c )) = (cid:107) ˆ W − W c ( W ∗ c ˆ W ) (cid:107) , where W c is the result from centered data and ˆ W obtained in central server. The performance ofDGEP in time and error is depicted in Fig.2 with f eats =
100 and N varying from 2 to 100 in an11 igure 2: Performance of DGEP in running time and error. interval of 2. As the number of local servers ( N ) increases, the running time of DGEP goes up.And SVD as a solver in MATLAB is e ffi cient if the complexity O ( d ) is acceptable. The errorsof DGEP-PM and DGEP-SVD are similar and keep in a low level as N varies. The technique of FDA [26], as an instance of GEP, is used in the binary classification of theGSE2187 dataset , which is a large cRNA microarray dataset reflecting the drugs and toxicantsresponse of rats. Only two categories (toxicants and fibrates, named as C and C respectivelyfor short) are used in classification and the missing values are filled with mean values. The dataare randomly divided into train data and test data with a ratio of 1 : 1. The detailed informationis displayed in Table 1. Table 1: GSE2187 data structures: number of data ( num ), data dimension ( d ). Data toxicants ( C ) fibrates ( C ) num
181 107 d thres = W c ∗ ( num C ∗ m C − num C ∗ m C ) num C + num C with the mean of C and C ( m C and m C ). The classification accuracy is calculated with data ∈ R num × d as acc C = (cid:80) num C i = δ ( ˆ W ∗ data (cid:48) , thres ) num C , acc C = (cid:80) num C i = δ ( ˆ W ∗ data (cid:48) , thres ) num C . where δ and δ are the indicator functions when the projected value is less than the mean thresh-old value δ equals 1 and 0 otherwise and δ is exactly the opposite. The repeated experiments arecarried out for 10 times and the classification accuracy in GSE2187 data is reported in Table 2. Table 2: Classification accuracy of GSE2187 data in FDA: training accuracy of C i ( acc C i tr ), testing accuracy of C i ( acc C i ts ). method accuracy mean accuracycentered by SVD acc C tr + acc C tr + acc C ts + acc C ts + acc C tr + acc C tr + acc C ts + acc C ts + acc C tr + acc C tr + acc C ts + acc C ts + In this case, we apply Algorithm 2 to a multi-classification problem where the feature isdenoted as a view X and the class as another view Y . Two real datasets from the gene expres-sion database [7] are considered as data-sensitive and dispersed evenly for convenience in localservers. Each dataset is locally divided into train data and test data in terms of the same datadimension d . The details are explained below and the statistics can be found in Table 3. • Lymphoma: 42 samples of di ff use large B-cell lymphoma, 9 observations of follicularlymphoma, and 11 cases of chronic. • SRBCT: the filtered dataset of 2308 gene expression profiles for 4 types of small roundblue cell tumors of childhood.
Table 3: Genedata structures: data dimension ( d ), number of data ( num ), number of training data ( tr num ), number oftesting data ( ts num ), number of classes ( K ). Type Data
K d num tr num ts num Gene Data Lymphomia 3 4026 62 48 14SRBCT 4 2308 63 48 15The classification accuracy is defined as acc = (cid:80) numi = δ ( ol i , pl i ) num , where δ ( ol i , pl i ) is the indicator function when the obtained label ol i equals to the provided label pl i it equals 1 and 0 otherwise. Considering the limited number of total data, the maximum N isset to be 8. The experiments on two gene datasets are repeated for 20 times respectively and thedetailed mean accuracy is reported in Table 4. We compare the accuracy results tr and ts byusing Algorithm 2 with tr and ts by using DGEP-PM to prove the advance of Algorithm 2 inGene dataset. 14hen N =
1, without the communication error, the results reveal that DGEP-PM, the dis-tributed CCA in GEP form without eigengap, leads to a worse accuracy which is no higher than72% for train and 68% for test in Lymphomia, no higher than 41% for train and 36% for testin SRBCT. The performance of Algorithm 2 in Table 2 is obviously higher. From another as-pect as the number of local servers N , the accuracy decreases as N goes up. The divergence ofcovariance in the distributed way influences the classification accuracy of Genedata, especiallyin SRBCT. However, it is validated that there has been progress in e ff ectiveness compared withDGEP-PM. Table 4: Classification accuracy of Gene data: number of local servers ( N ), training and testing accuracy of Alg.2 ( acc tr , acc ts ). Gene type N acc tr ± ± ± ± acc ts ± ± ± ± acc tr ± ± ± ± acc ts ± ± ± ±
7. Conclusion
The paper proposes a general distributed algorithm for generalized eigenvalue problem (GEP)in one-shot communication. While for multi-view analysis in DGEP like distributed CCA, thealgorithm meets hard convergence for no eigengap of the approximated covariance matrix. Theone-shot distributed multiple CCA algorithm produced solves the problem. The theoretical anal-ysis of the approximation error reveals the divergence of data covariance in the distributed systemand gives the upper bound concerned with the eigenvalues of the data covariance and the numberof local servers. As a one-shot method, the proposed algorithms scarify the computation e ffi -ciency for communication e ffi ciency. The quantities of numerical experiments demonstrate thee ff ectiveness of proposed algorithms in di ff erent applications.15 eferencesReferences [1] Balcan, M.F., Kanchanapally, V., Liang, Y., Woodru ff , D., 2014. Improved distributed principal component anal-ysis, in: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2,MIT Press, Cambridge, MA, USA. p. 3113–3121.[2] Beck, A., Teboulle, M., 2010. On minimizing quadratically constrained ratio of two quadratic functions. Journalof Convex Analysis 17.[3] Bertrand, A., Moonen, M., 2015. Distributed canonical correlation analysis in wireless sensor networks withapplication to distributed blind source separation. IEEE Transactions on Signal Processing 63, 4800–4813.[4] Bosner, N., 2020. Parallel reduction of four matrices to condensed form for a generalized matrix eigenvaluealgorithm. Numerical Algorithms URL: https://doi.org/10.1007/s11075-020-00883-z , doi: .[5] de Cock, M., Dowsley, R., Nascimento, A.C., Newman, S.C., 2015. Fast, privacy preserving linear regressionover distributed datasets based on pre-distributed data, in: Proceedings of the 8th ACM Workshop on ArtificialIntelligence and Security, pp. 3–14.[6] Dankar, F.K., 2015. Privacy preserving linear regression on distributed databases. Transactions on Data Privacy 8,3–28.[7] Dettling, M., 2004. Bagboosting for tumor classification with gene expression data. Bioinformatics 20 18, 3583–93.[8] Duan, M., Li, K., Liao, X., Li, K., 2018. A parallel multiclassification algorithm for big data using an extremelearning machine. IEEE Transactions on Neural Networks and Learning Systems 29, 2337–2351.[9] Falk S., L.P., 1960. Das jacobische rotations-verfahren f¨ur reel symmetrische matrizen-paare i, ii. ElektronischeDatenverarbeitung , 30–43.[10] Fan, J., Wang, D., Wang, K., Zhu, Z., 2017. Distributed estimation of principal eigenspaces. Annals of statistics47. doi: .[11] Garber, D., Shamir, O., Srebro, N., 2017. Communication-e ffi cient algorithms for distributed stochastic prin-cipal component analysis, in: Precup, D., Teh, Y.W. (Eds.), Proceedings of the 34th International Confer-ence on Machine Learning, PMLR, International Convention Centre, Sydney, Australia. pp. 1203–1212. URL: http://proceedings.mlr.press/v70/garber17a.html .[12] Gasc´on, A., Schoppmann, P., Balle, B., Raykova, M., Doerner, J., Zahur, S., Evans, D., 2017. Privacy-preservingdistributed linear regression on high-dimensional data. Proceedings on Privacy Enhancing Technologies 2017,345–364.[13] Ge, J., Wang, Z., Wang, M., Liu, H., 2018. Minimax-optimal privacy-preserving sparse pca in distributed systems.21st International Conference on Artificial Intelligence and Statistics, AISTATS 2018 , 1589–1598.[14] Golub, G.H., 1983. Matrix Computations.[15] Grammenos, A., Mendoza-Smith, R., Mascolo, C., Crowcroft, J., 2019. Federated PCA with adaptive rank estima-tion. CoRR abs / http://arxiv.org/abs/1907.08059 , arXiv:1907.08059 .
16] Hardoon, D.R., Szedmak, S.R., Shawe-taylor, J.R., 2004. Canonical correlation analysis: An overview with appli-cation to learning methods. Neural Computation 16, 2639–2664.[17] Hari, V., 1984. On Cyclic Jacobi Methods for the Positive Definite Generalized Eigenvalue Problem. Ph.D. thesis.FernUniversit¨at-Gesamthochschule, Hagen.[18] Hotelling, H., 1936. Relations between two sets of variates. Biometrika 28, 321–377.[19] H¨oskuldsson, A., 1988. Pls regression methods. Journal of Chemometrics 2, 211–228.[20] Jolli ff e, I.T., 1986. Principal component analysis .[21] Klema, V., Laub, A., 1980. The singular value decomposition: Its computation and some applications. IEEETransactions on Automatic Control 25, 164–176.[22] Li, X., Wang, S., Chen, K., Zhang, Z., 2020. Communication-e ffi cient distributed svd via local power iterations. arXiv:2002.08014 .[23] Moler, C.B., Stewart, G.W., 1973. An algorithm for generalized matrix eigenvalue problems. SIAM Journalon Numerical Analysis 10, 241–256. URL: https://doi.org/10.1137/0710024 , doi: , arXiv:https://doi.org/10.1137/0710024 .[24] Paige, C.C., 1986. Computing the generalized singular value decomposition. SIAM Journal on Scientific andStatistical Computing 7, 1126–1146. URL: https://doi.org/10.1137/0907077 , doi: , arXiv:https://doi.org/10.1137/0907077 .[25] Sameh, A., Tong, Z., 2000. The trace minimization method for the symmetric generalized eigenvalue problem.Journal of Computational and Applied Mathematics 123, 155–175.[26] Sugiyama, M., 2007. Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis.Journal of Machine Learning Research 8, 1027–1061.[27] Sundararajan, A., Hu, B., Lessard, L., 2017. Robust convergence analysis of distributed optimization algorithms,in: 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1206–1212.[28] Tan, K.M., Wang, Z., Liu, H., Zhang, T., 2018. Sparse generalized eigenvalue problem: optimal statistical rates viatruncated rayleigh flow. Journal of The Royal Statistical Society Series B-statistical Methodology 80, 1057–1086.[29] Tropp, J.A., 2012. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathe-matics 12, 389–434.[30] Van Loan, C.F., 1975. A general matrix eigenvalue algorithm. SIAM Journal on Numer-ical Analysis 12, 819–834. URL: https://doi.org/10.1137/0712061 , doi: , arXiv:https://doi.org/10.1137/0712061 .[31] Wang, S., Gittens, A., Mahoney, M.W., 2017. Sketched ridge regression: Optimization perspective, statisticalperspective, and model averaging. arXiv:1702.04837 .[32] Wedin, P.k., 1972. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathe-matics 12, 99–111. URL: https://doi.org/10.1007/BF01932678 , doi: .[33] Xiao, L., Boyd, S., 2006. Optimal scaling of a gradient method for distributed resource allocation. Journal of Op-timization Theory and Applications 129, 469–488. URL: https://doi.org/10.1007/s10957-006-9080-1 ,doi: ..