Learnable Graph-regularization for Matrix Decomposition
11 Learnable Graph-regularization for MatrixDecomposition
Penglong Zhai and Shihua Zhang*
Abstract —Low-rank approximation models of data matrices have become important machine learning and data mining tools in manyfields including computer vision, text mining, bioinformatics and many others. They allow for embedding high-dimensional data intolow-dimensional spaces, which mitigates the effects of noise and uncovers latent relations. In order to make the learnedrepresentations inherit the structures in the original data, graph-regularization terms are often added to the loss function. However, theprior graph construction often fails to reflect the true network connectivity and the intrinsic relationships. In addition, manygraph-regularized methods fail to take the dual spaces into account. Probabilistic models are often used to model the distribution of therepresentations, but most of previous methods often assume that the hidden variables are independent and identically distributed forsimplicity. To this end, we propose a learnable graph-regularization model for matrix decomposition (LGMD), which builds a bridgebetween graph-regularized methods and probabilistic matrix decomposition models. LGMD learns two graphical structures (i.e., twoprecision matrices) in real-time in an iterative manner via sparse precision matrix estimation and is more robust to noise and missingentries. Extensive numerical results and comparison with competing methods demonstrate its effectiveness.
Index Terms —Matrix decomposition, dual graph regularization, probabilistic model, matrix normal distribution, sparse precisionestimation, unsupervised learning (cid:70)
NTRODUCTION M A ny data of diverse fields are redundant and noisy.Extracting useful information from such primitivedata directly is often infeasible. Therefore, how to dis-cover compact and meaningful representations of high-dimensional data is a fundamental problem. Researchershave developed many powerful methods to address this is-sue from different views. Matrix decomposition approacheshave been successfully applied to various fields in thepast two decades [1], [2], [3]. A classical low-rank matrixdecomposition is achieved by minimizing a loss or errorfunction between an observed measurement matrix and abilinear factorization. Specifically, for a given data matrix Y ∈ R n × p , its low-rank factorization model is written asfollows Y = XW T + E (1)where X ∈ R n × k , W ∈ R p × k and E is the noise matrix.There are mainly two ways on how to obtain better repre-sentations. One is to model the noise matrix E of the originaldata and the other is to model the subspace structures of therepresentations or factor matrices X and W .Principal component analysis (PCA) is one of themost fundamental and widely used matrix decompositionmethod, which has become a standard tool for dimensionreduction and feature extraction in various fields such assignal processing [4], human face recognition [5], geneexpression analysis [6]. PCA obtains a low-dimensional • Penglong Zhai and Shihua Zhang are with the NCMIS, CEMS, RCSDS,Academy of Mathematics and Systems Science, Chinese Academy ofSciences, Beijing 100190, School of Mathematical Sciences, Universityof Chinese Academy of Sciences, Beijing 100049, and Center for Excel-lence in Animal Evolution and Genetics, Chinese Academy of Sciences,Kunming 650223, China.*To whom correspondence should be addressed. Email: [email protected]. representation for high-dimensional data in a l -norm sense.However, it is known that the conventional PCA is sensitiveto gross errors. To remove the effect of sparse gross errors,robust PCA (RPCA) [7], [8], [9] has been proposed. RPCAseeks to decompose the data matrix as a superposition ofa low-rank matrix with a sparse matrix. The sparse matrixcaptures the gross errors and enables RPCA to recover theunderlying low-rank representation of the primitive datawith vast applications [9], [10], [11], [12].Tipping and Bishop [13] originally presented that PCAcan be derived from a Gaussian latent variable model(named as the probabilistic PCA, PPCA), which becamethe beginning of the probabilistic decomposition models.PPCA assumes that the underlying distribution of noise isthe independent identical distribution (IID) of Gaussian. Itsmaximum likelihood estimation leads to a PCA solution.This probabilistic framework allows various extensions ofPCA and matrix decomposition, and becomes a naturalway to model different types of noise or error of data.Different prior distributions have been introduced into thisframework. For example, Zhao and Jiang [14] used the mul-tivariate Student-t distribution to model the noise (tPPCA).Wang et al. [15] introduced the Laplace distribution to modelthe noise (RPMF), which is more robust to outliers. Li andTao [16] employed the exponential family distributions tohandle general types of noise. Zhao et al. [17] introduced themixture of Gaussian to model the complex noise. However,most of those variants assume the underlying distributionof noise is IID, which rarely holds in reality. More recently,Zhang et al. [18] proposed a novel probabilistic model,which places the matrix normal prior to the noise to achievegraphical noise modeling.On the other hand, a number of studies have been de-veloped to model the representation structures by applying a r X i v : . [ c s . L G ] O c t diverse regularization terms to the loss functions [19], [20],[21]. A common assumption is that if two data points areclose in the original space, their representations should beclose to each other too. To preserve the local geometricalstructure in representation, graph regularizers have beenimposed. Gao et al. [22] proposed a sparse coding methodto exploit the dependence among the feature space by con-structing a Laplacian matrix. Zheng et al. [23] also proposeda graph-regularized sparse coding model to learn the sparserepresentations that explicitly takes into account the localmanifold structure of the data. More recently, Yin et al. [24]proposed a low-rank representation method to consider thegeometrical structure in both feature and sample spaces.As many measurements in experiments are naturally non-negative, some non-negative matrix factorization variantswith graph-regularization have also been developed [20],[25]. Nevertheless, the graph-construction is far rather ar-bitrary. It may not accurately reflect the true network con-nectivity and the intrinsic relationships between data enti-ties (e.g., negatively conditional dependency). Probabilisticmodels on matrix decomposition have also been tried tomodel structure of the representations by placing differentpriors on X and/or W [26], [27], [28]. However, in previousprobabilistic models, both sample and feature spaces ofhidden representations are always assumed to be IID, whichrarely holds in real-world scenarios.To this end, we propose a matrix decomposition modelwith learnable graph-regularization (LGMD) to obtain thelow-rank representation and the structure of the underlyinglatent variables simultaneously. The key idea is to model thelatent variables as matrix normal distributions. It enablesus to explore the structures of the latent variables in bothsample and feature spaces in real-time and better estimategraphs with the hidden information extracted from the data.The parameters of LGMD can be estimated with a hybridalgorithm of the alternative least square (ALS) and sparseprecision matrix estimation methods under the block coordi-nate descent framework. Extensive experiments on variousdata show the effectiveness of LGMD by considering thestructure of the underlying sample and feature manifolds.In short, LGMD can obtain a better low-rank representationand a better restoration of the original data with the learnedsample and feature structures. ELATED W ORK
Low-rank matrix decomposition is a large class of methodsto achieve the low-rank approximation of a given datamatrix. The conventional matrix decomposition models arebased on the assumption that the data matrices are con-taminated stochastically with diverse types of noises andthe low-rank matrices are deterministic with unknown pa-rameters. Thus, the point estimations of low-rank compo-nents can be obtained by maximum likelihood estimationor maximum a posteriori. A prominent advantage of theaforementioned point estimation methods is that they aresimple and easy to implement. However, we can not obtainthe probability distributions of the low-rank matrices thatare pre-requisite in exploring the generative models. In the past two decades, a variety of probabilistic models of low-rank matrix decomposition have been developed. The mostsignificant difference between low-rank matrix decomposi-tion methods and their corresponding probabilistic modelsis that the latter treat the low-rank components as randomvariables. These probabilistic models have been widelyapplied onto the fields of signal and image processing,computer vision, bioinformatics and so on.Tipping and Bishop [13] originally presented PPCA byassuming the latent variables following the unit isotropicGaussian distribution. Specifically, the generative model ofPPCA is Y = XW T + E (2)where X ∈ R n × k , W ∈ R p × k and E ij IID ∼ N (0 , σ ) . Thenegative log likelihood is as follows σ (cid:13)(cid:13)(cid:13) Y − XW T (cid:13)(cid:13)(cid:13) F + mn log σ (3)where σ is the standard deviation of the noise and (cid:107)·(cid:107) F denotes the Frobenius norm. If we treat σ as a constant, theobjective function Eq. (3) is equivalent to the minimizationof the reconstruction error in PCA. PPCA further places thestandard Gaussian N (0 , I ) prior on each column of X andderive the maximum likelihood estimation of W and σ .The goal of PPCA is not to give better results thanPCA, but to permit a broad range of future extensions byfacilitating various probabilistic techniques and introducingdifferent assumptions of prior distributions. For example,Bishop [29] developed a variational formulation of BayesianPCA, which can automatically determine the number ofretained principal components. Zhao and Jiang [14] pro-posed tPPCA which assumes that data are sampled fromthe multivariate Student-t distribution. Probabilistic matrixfactorization (PMF) [26] assumes that X ik are independentand identically distributed and so are W Tkj , and Varia-tional Bayesian PMF [27] assumes the entries from differentcolumns of X or W have different variances. Bayesian PMF[28] further generalizes PMF by assuming that columnsof X or W are multivariate Gaussian distributed and therows are independent and identically distributed. Amongall of these probabilistic matrix factorization models, PMFcan be a typical case. Similar to PPCA, it follows the samegenerative model, while places isotropic Gaussian priors on X , W and the noise E ij IID ∼ N (0 , σ ) , X ik IID ∼ N (0 , σ X ) , W Tkj IID ∼ N (0 , σ W ) Its negative log-posterior over the observed is σ (cid:13)(cid:13)(cid:13) Y − XW T (cid:13)(cid:13)(cid:13) F + λ X (cid:107) X (cid:107) F + λ W (cid:107) W (cid:107) F (4)where λ X = σ /σ X , λ W = σ /σ W .Most of these methods assume that the underlyingdistributions of both sample and feature spaces are inde-pendent and identically distributed for simplicity. How-ever, the structures of the sample manifold and featuremanifold might be complicated and nonlinear, which areoften ignored in previous probabilistic models. To addressthis challenge, Zhang et al. [18] recently proposed a novelprobabilistic model on matrix decomposition by placing thematrix normal prior on the noise to explore the structures of Fig. 1: Illustration of the Laplacian and precision matrices based on a data with 250 samples [30].sample and feature spaces. While their focus is on graphicalnoise modeling but not about the latent variables.
Graph-regularized matrix decomposition (GRMD) methodsobtain the low-rank representation of the primitive dataand preserve the local geometrical structure by graph reg-ularizers to some extent. For example, Zheng et al. [23]proposed a graph-regularized sparse coding method forimage presentation. It has the following form min (cid:13)(cid:13)(cid:13) Y − XW T (cid:13)(cid:13)(cid:13) F + η tr( X T LX ) + ρ (cid:107) X (cid:107) (5)where Y ∈ R n × p is the data matrix, η > , ρ > , X is thelow-dimensional representation of images, L ∈ R n × n is theLaplacian matrix, and the l -norm regularizer encourages X to be sparse. One can construct a binary graph matrix G bythe k -NN algorithm G ij = (cid:40) , if x i is a neighbor of x j , otherwise (6)The Laplacian matrix is defined as L = D − G , where D = diag ( d , d , · · · , d n ) ( d j = (cid:80) ni =1 G ij ) is the degree matrix.Then the graph regularizer can be written as tr( X T LX ) = 12 (cid:88) ij ( x i − x j ) G ij (7)and it encourages neighbors in the original space to beneighbors in the sparse representation X .Recent studies have shown that not only the observeddata are found to lie on a nonlinear low-dimensionalmanifold, namely sample manifold, but also the featureslie on a manifold, namely feature manifold. For example,Yankelevsky and Elad [31] proposed a low-rank represen-tation method to consider the geometrical structure in bothfeature and sample spaces in the following manner min X,W (cid:13)(cid:13)(cid:13) Y − XW T (cid:13)(cid:13)(cid:13) F + η tr( X T LX ) + η tr( W T L c W ) s.t. (cid:107) x i (cid:107) ≤ T ∀ i (8) where L and L c are the Laplacian matrices of sample andfeature spaces respectively. T is a parameter to control thesparsity of x i . However, the Laplacian matrix prefers to con-sider the local geometry while the precision matrix not onlycaptures the neighbor relationships between the variablesbut also recovers relationships between variables that arenegatively condition-dependent (Fig. 1) [30]. In addition,conventional graph-regularization is derived based on theoriginal data directly, which doesn’t consider the intrinsicrelationships among latent variables. Precision matrix (i.e., the inverse of the covariance matrix)reveals the conditional correlations between pairs of vari-ables. How to estimate a large precision matrix is a fun-damental issue in modern multivariate analysis. Formally,suppose we have n multivariate normal observations of di-mension p with covariance Σ . Let Θ = Σ − be the precisionmatrix and S be the empirical covariance matrix, then theproblem of precision matrix estimation is to maximize thelog-likelihood ln | Θ | − tr( S Θ) (9)where Θ is a positive definite matrix, and | M | is the deter-minant of a matrix M .To estimate a large precision matrix, the graphical Lassomodel has been developed by employing the sparsity as-sumption, i.e., many entries in the precision matrix arezeros. Specifically, it adds the l -norm penalty onto theprecision matrix Θ as follows ln | Θ | − tr( S Θ) − ρ (cid:107) Θ (cid:107) (10)where ρ > controls the sparsity of Θ . This problem hasbeen extensively studied and many algorithms have beenproposed [32], [33], [34], [35], [36], [37]. Readers may refer to[38] for a comprehensive review. However, these graphicalLasso algorithms are computationally expensive for large-scale problems. To address this, a fast heuristic methodbased on simply thresholding the sample covariance matrixhas been developed [39], [40]. Furthermore, it has beenproved that there is an explicit closed-form solution undersome less stringent conditions when the graphs are suffi-ciently sparse. A B Y ij X i W j σ • σ X • σ W • η A • η B • i ∈ [0 , n ] j ∈ [0 , p ] (b) Y ij X i W j σ • σ X σ W • • i ∈ [0 , n ] j ∈ [0 , p ] (a) Fig. 2: Illustration of (a) PMF and (b) LGMD.
We propose the LGMD model to simultaneously considerthe structures of both the sample and feature manifolds(Fig. 2b). Furthermore, we develop a block-coordinate op-timization scheme based on the iterative updating rules oftwo low-rank factor matrices and the estimation of theircorresponding precision matrices to solve it.
Let’s first recall the generative model of PPCA (Fig. 2a) Y = XW T + E (11)The underlying noise matrix E is assumed to be IID Gaus-sian. Thus, the conditional distribution over the observa-tions can be defined as p ( Y | X, W, σ ) = n (cid:89) i =1 p (cid:89) j =1 N ( Y ij | X i W Tj , σ I ) (12)where N ( x | µ, σ ) is the probability density function ofthe Gaussian distribution with mean µ and variance σ .In order to capture the distribution of the latent vari-ables, a naive approach is to simply assume that the la-tent variables follow a multivariate Gaussian distribution,i.e., vec ( X ) ∼ N ( vec ( X ) , Σ X ) , vec ( W ) ∼ N ( vec ( W ) , Σ W ) .However, X and W consist of k ( n + p ) variables in total, thecorresponding covariance matrices are of size nk × nk and pk × pk respectively, which are too huge to estimate.A more reasonable assumption is that the latent variablesare correlated in the sample and feature spaces, respectively.Thus, we place the matrix Gaussian priors over X and W as follows (cid:26) p ( X | A ) = MN n,k ( X | , A − , σ X I ) p ( W T | B ) = MN k,p ( X | , σ W I, B − ) (13)where A n × n and B p × p are among-sample and among-feature inverse covariance matrices, i.e., precision matrices,respectively. Additionally, the columns of X and W areassumed independent and identically distributed, and thevariance are σ X and σ W respectively. Note that the matrix normal is the same as multivariate normal distribution ifand only if (cid:26) vec ( X ) ∼ N nk (0 , A − ⊗ σ X I ) vec ( W T ) ∼ N kp (0 , σ W I ⊗ B − ) (14)where ⊗ denotes the Kronecker product. Therefore, the logof the posterior distribution over X and W is given by ln p ( X, W | Y, A, B, σ ) = − σ n (cid:88) i =1 p (cid:88) j =1 ( Y ij − X i W Tj ) − σ X tr ( X T AX ) + k | A | − σ W tr ( W T BW )+ k | B | + C where | M | is the determinant of a matrix M , and C is aconstant that does not depend on the parameters. Maxi-mizing the log-posterior with fixed hyperparameters (noisevariance and prior variances) is equivalent to minimizingthe sum-of-squared-errors function with quadratic regular-ization terms L = 12 || Y − XW T || F + λ || A ˆ X || F − k ln | A | )+ λ || ˆ W T B || F − k ln | B | ) (15)where λ = σ /σ X and λ = σ /σ W . ˆ X and ˆ W are normal-ized X and W by dividing their standard deviations respec-tively, which will be denoted as X and W for simplicity inthe rest of this paper. Under the matrix normal distributionassumption, we reduce the number of parameters of thecovariance from ( n + p ) k to n + p . However, it isstill infeasible to minimize the negative log-likelihood inEq. (15). Because the number of free parameters in A and B grows quadratically with n and p , respectively. Moreimportantly, the underlying graphical structure of sampleand feature representations should be very sparse. Thus, we Fig. 3: Change of the graph Laplacian with the increase ofthe noise level and the percentage of missing entries. Theblue and red lines represent the change of edge numbersunder different noise levels σ n with respect to (w.r.t.) thedata standard deviation (STD) σ d and missing entries per-centages, respectively.impose sparse constraints onto the precision matrices A and B by introducing l -norm regularizations L = 12 || Y − XW T || F + λ || A X || F − k ln | A | + η (cid:48) (cid:107) A (cid:107) , off )+ λ || W T B || F − k ln | B | + η (cid:48) (cid:107) B (cid:107) , off ) (16)where (cid:107)·(cid:107) , off represents the off-diagonal l -norm, η (cid:48) and η (cid:48) control the sparsity of A and B , respectively. When η (cid:48) and η (cid:48) are too large, A and B turns to be diagonal. Thus, PMFcan be viewed as a special case of LGMD. A common assumption in GRMD is that the latent variables X and W have the same structures as the original matrixdata Y . Dual GRMD (dGRMD) considers the followingformula (cid:13)(cid:13)(cid:13) Y − XW T (cid:13)(cid:13)(cid:13) F + λ tr( X T LX ) + λ tr( W T L c W ) (17)where L and L c are the graph Laplacian of sample spaceand feature space respectively, and they both are computedthrough the original data matrix Y or the prior graphstructure. However, this is often not easy when the datais relatively noisy and incomplete. As we can see in Eq.(16), when A and B are assumed to be known, LGMDbecomes the conventional dGRMD with the graphs beingthe precision matrices of X and W respectively.Another application of LGMD is matrix completion,which is the task of filling in the missing entries of a partiallyobserved matrix. One classical example is the movie-rating Fig. 4: The number of learned edges based on the latentvariables X versus the iterations.matrix such as the Netflix problem. Given a rating matrix inwhich each entry ( i, j ) represents the rating of movie j bycustomer i if customer i has rated movie j and is otherwisemissing, we want to predict the remaining entries in orderto make good recommendations to customers on what towatch next. A common method for matrix completion isbased on GRMD as follows (cid:88) i,j O ij ( Y ij − X i W Tj ) + λ tr( X T LX ) + λ tr( W T L c W ) where O ij is an indicator matrix of Y , which means O ij = 1 if Y ij is observed and 0 otherwise. The construction of thegraph based on the partially observed data is a problem.Both data noise and missing entries have great impact on itsestimation (Fig. 3).In order to reduce the influence of the bias of thegraph estimation, LGMD attempts to learn and modify thegraphs in the iterations. We know that even if withoutthe regularization terms, minimizing the objective function min X,W (cid:13)(cid:13) Y − XW T (cid:13)(cid:13) F can make X and W automaticallylearn the structure of the original data to some extent (Fig.4), which can be used to modify the initial estimated graphsand accelerate the decomposition. Clearly, minimizing the objective function in Eq. (16) is a ma-trix optimization problem involving four matrix variables X , W , A and B . X and W are controlled by the graphs A and B , while A and B are estimated based on X and W under the sparsity assumption. It is easy to verify that thefunction Eq. (16) is convex relative to each matrix variable,respectively. Thus, a local minimum can be found by theblock coordinate descent [41], [42] as stated below. Updating X and W : Given A and B , one retains the termsinvolving X and W L = 12 || Y − XW T || F + λ || X T AX || F + λ || W T BW || F (18) LGMD becomes a dGRMD problem with the graphs A and B being the precision matrices of X and W . ALS is adoptedto obtain the optimal X and W for a local minimum X = ( Y W − λ AX ) (cid:16) W T W + (cid:15)I (cid:17) − W T = (cid:16) X T X + (cid:15)I (cid:17) − (cid:16) X T Y − λ W T B (cid:17) (19)where (cid:15) is a small positive constant to protect the inverse ofmatrices from singular. (cid:0) X T X + (cid:15)I (cid:1) − and (cid:0) W T W + (cid:15)I (cid:1) − are pretty small matrices and easy to compute. Updating A and B : A straightforward approach is to opti-mize A and B iteratively as follows (cid:40) ˆ A = arg min A tr ( AS ) − log | A | + η (cid:107) A (cid:107) ˆ B = arg min B tr ( BS ) − log | B | + η (cid:107) B (cid:107) (20)where η = η (cid:48) /k and η = η (cid:48) /k , S = XX T /k and S = W W T /k are sample and feature space covariance ma-trices, respectively. Efficient optimization algorithms havebeen intensively studied as we described in Section 2.3for such l -norm regularized precision matrix estimationproblems. We use an explicit closed-form solution basedon simply thresholding the covariance matrices, denoted as Threshold , as a basic solver [43], [44]. Take A as an exam-ple. The approximate closed-form solution for the graphicalLasso is as follows A ij = ii (cid:32) (cid:80) ( i,m ) ∈E res ( Σ res im ) Σ ii Σ mm − (Σ res im ) (cid:33) if i = j − Σ res ij Σ ii Σ jj − ( Σ res ij ) if ( i, j ) ∈ E res otherwise (21) where Σ is the empirical covariance matrix and Σ res is shortfor Σ res ( λ ) , which is defined as the residual of Σ relative to λ . The ( i, j ) -th entry of Σ res ( λ ) equals to Σ ij − λ × sign (Σ ij ) if i (cid:54) = j and | Σ ij | > λ , and 0 otherwise. E res is the supportgraph of Σ res , which indicates the edge of ( i, j ) ∈ E res if Σ res ij (cid:54) = 0 . This formula can be applied to solve A and B (cid:40) A = Threshold ( S , η ) B = Threshold ( S , η ) (22)The inference scheme is summarized in Algorithm Post processing:
Given any invertible matrix S , XSS − W T equals XW T . Therefore, X and W are not identifiable. Toobtain a unique low-rank representation, we use GPCA [45]to post-process X and W . Specifically, given A − and B − ,GPCA solves the following problem min 12 (cid:13)(cid:13)(cid:13) Y − U DV T (cid:13)(cid:13)(cid:13) A − ,B − s.t. U T B − U = IV T A − V = I diag ( D ) ≥ (23)where (cid:107) X (cid:107) A − ,B − = √ A − XB − X T . It has been provedthat GPCA converges to the unique global solution when A − and B − are positive-definite. Then we set X = U D and W = V as the output.Additionally, as mentioned in the last part, a positiveoff-diagonal entry in the precision matrix implies a negative Algorithm 1 Maximum Regularized Likelihood
Input: data matrix Y , rank k Output: X , W , A and B
1: Truncated k SVD of Y = U k Σ k V k
2: Initialize X = U k and Y = Σ k V k , A = I , B = I repeat
4: update A and normalize A
5: update B and normalize B repeat
7: update X
8: update W until convergence10: until Change of the objective function value is small enough partial correlation between the two random variables, whichis difficult to interpret in some contexts, such as road trafficnetworks [46], [47], [48]. For such application settings, itis therefore desirable to learn a graph with non-negativeweights. The problem has been formulated as [49]max Θ ,σ ln | Θ | − tr( S Θ) − ρ (cid:107) Θ (cid:107) s.t. Θ = L + 1 σ I, L ∈ L where I is the identity matrix, σ is the a priori featurevariance, L is the set of valid graph Laplacian matrices.The solution was found using the package CVX for solvingconvex programs [50]. We denote the proposed algorithmestimating graphs with the above formula as LGMD + andwe will apply it to the real-world scenarios for comparison. Here we briefly discuss the computational complexity of themaximum regularized likelihood algorithm for LGMD. Ateach iteration of updating X , the computational cost lying inmatrix multiplication and inverse is O ( n k + npk ) . Similarly,the computational cost for updating W is O ( p k + npk ) .Generally, solving A and B is more computationally ex-pensive if we use conventional algorithms. Thus, we adoptthe approximate thresholding method, which gives a closeenough solution and is less time-consuming than traditionalmethods. Given two empirical covariance matrices of n × n and p × p , the computational complexities of solving A and B are O ( n ) and O ( p ) , respectively. In contrast, solving theconstrained precision estimation is more time-consuming,which lacks an effective thresolding algorithm. We will com-pare the running time in the real-world data experiment. In our algorithm, we have four hyperparameters in total,in which λ and λ balance the trade-off of the low-rankapproximation and structural restoration while η and η control how many edges should we restore. In the exper-iments, we kept the sparseness of the graphs fixed, sinceone of our goals is to test whether the learned graphsare informative. We only tune two parameters λ and λ .However, it still takes a long time (minutes or hours) toevaluate using the grid-search strategy. Thus, we adopt aneffective Bayesian optimization approach to optimize, which Fig. 5: The support graphs of the two generated ground-truth precision matrices (a) A and (b) B using a data with 100samples and 100 features with dimension reduction. The blue and the gray lines represent positive and negative edges,respectively.uses fewer combinations of hyperparameters than the grid-search strategy. Bayesian optimization is best-suited for op-timization over continuous domains of less than 20 dimen-sions, and tolerates stochastic noise in function evaluations[51]. We also applied the Bayesian optimization strategy toother methods in the experiments. XPERIMENTAL R ESULTS
In this section, we demonstrate the effectiveness of LGMDon both synthetic and real-world data and compare it withPCA, PMF and dGRMD. With the synthetic datasets, wedemonstrate the capability of LGMD to improve denoisingperformance, and successfully infer the underlying struc-ture. In addition, we evaluate the ability of LGMD andLGMD+ to recover the true underlying signals from noisyand incomplete samples on two sensor applications includ-ing the temperature data and netflix data. At last, we applyLGMD to real labeled image and text datasets to check itsclustering performance.
For the synthetic experiment, the underlying graphs areknown, and the recovered structure by LGMD can bequantitatively assessed. Due to the model complexity andthe inherent coupling, these matrices can not be drawnindependently, and need to be designed carefully.We generated two graphs with 100 nodes each and keep-ing about 6 % of the overall edges (Fig. 5), whose adjacencymatrices are denoted as A gt and B gt , which are symmetricand sparse as assumed. The toy data can be constructed inthe following way X gt ∼ MN n,k (0 , A − gt , I ) W Tgt ∼ MN k,p (0 , I, B − gt ) Y gt = X gt W Tgt Y no = Y gt + E where the entries of E follow the Gaussian noise withvariance σ , Y gt represents the ground truth matrix and thenoise matrix is denoted as Y no .To test the quality of the learned representations, thefinal performance of each method on each experiment wasmeasured as the average over 30 realizations with the fol-lowing eight measures E = (cid:13)(cid:13)(cid:13) O (cid:12) ( Y no − ˜ X ˜ W T ) (cid:13)(cid:13)(cid:13) F , E = (cid:13)(cid:13)(cid:13) Y gt − ˜ X ˜ W T (cid:13)(cid:13)(cid:13) F ,E = corr ( ˜ X ) , E = corr ( ˜ W ) ,E = subspace ( X gt , ˜ X ) , E = subspace ( W gt , ˜ W ) ,E = edge ( A gt , ˜ A ) , E = edge ( B gt , ˜ B ) where ˜ X, ˜ W , ˜ A, ˜ B are the outputs of LGMD. E is usedto measure the performance of matrix completion and O is an indicator matrix of Y which means O ij = 1 if Y ij is observed and 0 otherwise. In our experiment, we useda part of the observed entries of Y no for training and theremaining entries (denoted by O ) for prediction. E is tomeasure the denoising performance. They are compared interms of the root mean squared error (RMSE). Note that E and E are commonly used to measure the performanceof representation learning. In addition, we also adopt someother meaningful measures to evaluate if the method re-covers the correct subspace. subspace( X , X ) denotes theangle between two subspaces spanned by the columns of X and X respectively. corr( X ) is a vector denoting all thePearson correlation coefficients between any two columns of X . Lastly, edge( A , A ) denotes the edge number recoveredby A in A .We will analyse the performance of the methods in thenext sections with two parts. The first part mainly reportsdenoising and matrix completion performance ( E , E ),and the second part compares the ability of recovering thehidden space and structure ( E - E ). We use the true graphsas baseline for comparison (GRMD-real). TABLE 1: Performance of the recovered subspace angles of X on the small synthetic datasets. The best results arehighlighted in bold. σ n /σ d GRMD-real 0.0887 0.1793 0.2737 0.3725 0.4799 0.5736 0.6671 0.8049 0.9366 1.0151
TABLE 2: Performance of the recovered subspace angles of W on the small synthetic datasets. The best results arehighlighted in bold. σ n /σ d GRMD-real 0.0881 0.1738 0.2613 0.3412 0.4533 0.5882 0.7144 0.8728 0.9852 1.0454
Fig. 6: Comparison of four methods in terms of RMSE testedon the synthetic dataset. (a) denoising error for differentnoise levels σ n w.r.t. the data STD σ d , and (b) data comple-tion error for different percentages of remaining samples. The reconstruction error E and data completion error E measure the performance of the methods on denoising andmissing value completion. Clearly, LGMD gets superiorperformance compared to three competing methods (Fig. 6).dGRMD is closer to LGMD when the noise is smaller, butthe results gradually get worse as the noise increases. This suggests that the learnable graph regularization in LGMD ismore robust to noise and missing values. Here we explore the performance of LGMD on extractedeffective representations with E and E , which are im-portant measures on evaluating whether the outputs trulyrestore the original data space (Table 1 and Table 2). LGMDperforms best in all the experiments. Feature decoupling is always a hot topic and an im-portant measure on evaluating whether the learned repre-sentation is more compact and independent [52], [53]. Inthis experiment, we generate X and W with rank 20, sothere are 190 sets of correlation coefficients in total, whichare plotted with the increase of noise (Fig. 7). The resultsdemonstrate that LGMD has comparable performance withPCA, and will not significantly get worse as the noiseincreases. Whereas dGRMD and PMF fail to decouple thefeatures.A more reasonable measure is that whether the learnedrepresentation recovers the structures (edges) in the originaldata. So we compared the recovered structure performance(Fig. 8), which illustrates the comparison results of the mostsignificant edges from 10 % to 30 % . The results show that ourmethod performs best in restoring the original edges withthe increase of noise level. The more significant the edges,it is more easier to be recovered by our method. Moreover,LGMD still performs best on edges that are not significantenough.Here, we compared LGMD with several matrix de-composition methods on the measures for reconstruction,completion and representation. The results demonstrate theeffectiveness of LGMD. The learnable characteristics of thegraphs turns to be more robust to the noise and missingvalues. Furthermore, through the estimated graphs, we canget a more compact and independent representation andrecover more underlying edges. This experiment aims to test the effectiveness of LGMD fordata completion and denoising on real-world datasets. Theunderlying true graphs are unknown here. We consider twodatasets including the Netflix and Temperature data, which
Fig. 7: Comparison of four methods on the correlation coefficients.Fig. 8: Comparison of four methods on the recovered structures.have clear sample and feature structures. This kind of datais very common in many applications. For example, in themovie rating data, the users and movies both have their ownstructures. Users with similar age, gender, and growth en-vironment tend to have similar ratings on the same movies.Then there should be a connected edge between such twousers.To assess the potential benefit of LGMD for denoising,Gaussian noises of different levels σ n were added to theoriginal data. Since the noises are random and affect thegraph estimation, the learnable graph-regularization is morelikely to restore the true graph structure. We further evaluatethe performance of LGMD for data completion. In practice,missing entries may arise either from a budget restricteddata acquisition, or from faulty sensors. For this scenario,we set the noise level σ n = 0 and draw O through ran-domly subsampling the test signals with various predefinedpercentages of samples preserved. In addition, we alsodemonstrate that LGMD + performs best in terms of clusteraccuracy, but costs more time. Temperature data
We first consider a dataset of dailytemperature measurements collected from January 2018 toOctober 2019 by N =
150 weather sensors across themainland United States. The entries in the original datarepresents the average temperatures given in degrees ofFahrenheit measured across the sensor network on a singleday. Data completion makes sense for this data becauseshort-term temperature prediction is still problematic. The underlying assumption is that close sensors and close dateswill have highly correlated temperatures. From the structureillustration (Fig. 11), we can see that the temperature inspring is similar to that in winter while more different fromthat in summer. Moreover, from the distribution of prede-fined 150 sensors in the United States, sensors with similartemperatures are always distributed in similar places. Mak-ing full use of the structure of the data itself can predictfuture temperature. The results clearly demonstrates theadvantages of LGMD and LGMD+ on temperature predic-tion and denoising (Fig. 9). LGMD + performs the best withvarious noise levels and missing entry percentages. Netflix data
Recommendation system is a hot and dif-ficult problem nowadays. Graph-regularization terms areoften used to obtain more meaningful features, in whichthe graphs are constructed based on scores given by theusers. However, the computation of the graphs is oftenbiased because of the sparsity of the data. The Netflix datawas used here. This data was collected between October1998 and December 2005 and represents the distributionof all ratings. The training dataset consists of 100,480,507ratings from 480,189 randomly-chosen, anonymous users on17,770 movie titles. 1000 users and movies were chosen re-spectively to conduct this experiment for convenience. Thenumerical results here further demonstrated the superiorityof LGMD and LGMD + (Fig. 10). With the increase of noiselevel and proportion of missing entries, the performance ofPMF and dGRMD gradually become worse, while LGMD Fig. 9: Comparison of four methods in terms of RMSEtested on the temperature dataset. (a) denoising error fordifferent noise levels σ n w.r.t. the data STD σ d , and (b)data completion error for different percentages of remainingsamples.TABLE 3: Comparison of Running Time on the Temperaturedata PMF dGRMD LGMD LGMD + Denoising 10s 20s 230s 2500sCompletion 80s 80s 500s 3000s and LGMD + show robust performance.Table 3 reports the running time per 1000 iterations ondenoising and completion tasks on the temperature data.The running time of LGMD is acceptable while LGMD + ismore time-consuming compared to other methods.In short, LGMD not only performs better in denoising,but also outperforms PMF and dGRMD in data completion.Even with relatively high noise levels and severe missingpercentages, the performance of LGMD is not seriouslydegraded compared to other methods, suggesting its robust- Fig. 10: Comparison of four methods in terms of RMSEtested on the Netflix dataset (the same setting with Fig. 9).ness against missing data and noise. LGMD and LGMD + perform the best than all the other methods, and LGMD + ismore time-consuming. Clustering performance is a critical measure on evaluat-ing the learned representations. To demonstrate how theclustering performance can be improved by LGMD, wecompare the following 3 popular matrix decompositionmethods including k -means, PMF and dGRMD. Note k -means is indeed a special matrix decomposition methodwhere entries of the coefficient matrix are only 1s or 0s. Twodatasets were used in this section. MNIST data
The first data set is MNIST image dataconsisting of 60000 gray-scale images of 32 ×
32 on hand-written digits ranging from 1 to 10.
TDT2 data
The second data set is from the NIST TopicDetection and Tracking (TDT2) corpus, which consists ofdata collected during the first half of 1998 and taken from6 sources, including 2 newswires (APW, NYT), 2 radio pro-grams (VOA, PRI) and 2 television programs (CNN, ABC). Fig. 11: Illustration of the sample and feature structures of Temperature data. (a) heatmap of the temperature data of 12months in 2018. (b) distribution of 150 sensors in the United States and the local temperature on a certain day in 2019.Those documents appearing in two or more categories wereremoved and only the largest 20 categories were kept.In order to facilitate the calculation, we have only used2000 samples (Table 4 and Table 5). In order to randomizethe experiments, we conduct the evaluations with differentcluster numbers. For each given cluster number K , 20repetitions were conducted on different randomly chosenclusters. The mean and standard error of the performanceare reported. We can see that graph-regularization methodsoutperform PMF and k -means, suggesting the importanceof the geometrical structure in learning the hidden factors.In addition, regardless of the data sets, LGMD and LGMD + always show the best performance. This shows that by lever-aging the power of learnable graph-regularization, LGMDcan learn better compact representations.TABLE 4: Comparison of Clustering Accuracy on MNIST nClusters 10 9 8 7 6 k -means 0.530 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± + ± ± ± ± ± TABLE 5: Comparison of Clustering Accuracy on TDT2 nClusters 20 18 16 14 12 k -means 0.233 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± + ± ± ± ± ± ISCUSSION AND C ONCLUSION
In this paper, we propose a novel matrix decompositionmodel LGMD with learnable graph-regularization to modelthe underlying structure in both feature and sample spaces of the latent variables by the matrix normal distribu-tion. Intriguingly, LGMD builds a bridge between graph-regularized methods and probabilistic models of matrix de-composition. It can learn and update the graph knowledgebased on the hidden information extracted from the data.To our knowledge, this is the first study to propose theconcept of learnable graph-regularization and develop aneffective algorithm to solve it. This method demonstratessuperior performance to several competing methods on thesynthetic and real-world data when the effect of the noise ismild and the percentage of missing entries is large. Specif-ically, numerical experiments on synthetic datasets showthat considering the structures of sample and feature spacesof latent variables brings an improvement in recoveringmore structures, denoising and predicting missing entries.LGMD also shows its advantages in feature decouplingand subspace recovering. In addition, we also suggest toconsider the non-negativity of graph structure (LGMD + )in real-world applications with superior performance. Nu-merical experiments on real-world data show that LGMD(and LGMD + ) is competitive or superior to the competingmethods and could obtain better low-rank representations.There are several questions remained to be investigated.First, the estimation of the precision matrix is still inefficientfor very large data. The objective function involving thesquare root of the precision matrix makes it difficult to solve.Second, the estimation of the precision matrix may be notstable during optimization. LGMD + alleviates this problemin some sense but is relatively more time-consuming. R EFERENCES [1] S. Zhang, C.-C. Liu, W. Li, H. Shen, P. W. Laird, and X. J. Zhou,“Discovery of multi-dimensional modules by integrative analysisof cancer genomic data,”
Nucleic acids research , vol. 40, no. 19, pp.9379–9391, 2012.[2] L. Zhang and S. Zhang, “Learning common and specific patternsfrom data of multiple interrelated biological scenarios with matrixfactorization,”
Nucleic acids research , vol. 47, no. 13, pp. 6606–6617,2019.[3] C. Zhang and S. Zhang, “Bayesian joint matrix decomposition fordata integration with heterogeneous noise,”
IEEE transactions onpattern analysis and machine intelligence , 2019. [4] B. Moore, “Principal component analysis in linear systems: Con-trollability, observability, and model reduction,” IEEE transactionson automatic control , vol. 26, no. 1, pp. 17–32, 1981.[5] P. J. Hancock, A. M. Burton, and V. Bruce, “Face processing:Human perception and principal components analysis,”
Memory& Cognition , vol. 24, no. 1, pp. 26–40, 1996.[6] T. Hastie, R. Tibshirani, M. B. Eisen, A. Alizadeh, R. Levy,L. Staudt, W. C. Chan, D. Botstein, and P. Brown, “’gene shaving’asa method for identifying distinct sets of genes with similar expres-sion patterns,”
Genome biology , vol. 1, no. 2, pp. research0003–1,2000.[7] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, “Robust principalcomponent analysis: Exact recovery of corrupted low-rank ma-trices via convex optimization,” in
Advances in neural informationprocessing systems , 2009, pp. 2080–2088.[8] H. Xu, C. Caramanis, and S. Sanghavi, “Robust pca via outlierpursuit,” in
Advances in Neural Information Processing Systems , 2010,pp. 2496–2504.[9] E. J. Cand`es, X. Li, Y. Ma, and J. Wright, “Robust principalcomponent analysis?”
Journal of the ACM (JACM) , vol. 58, no. 3,p. 11, 2011.[10] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma, “Rasl: Robustalignment by sparse and low-rank decomposition for linearlycorrelated images,”
IEEE transactions on pattern analysis and machineintelligence , vol. 34, no. 11, pp. 2233–2246, 2012.[11] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery ofsubspace structures by low-rank representation.”
IEEE transactionson pattern analysis and machine intelligence , vol. 35, no. 1, pp. 171–184, 2013.[12] N. Shahid, V. Kalofolias, X. Bresson, M. Bronstein, and P. Van-dergheynst, “Robust principal component analysis on graphs,” in
Proceedings of the IEEE International Conference on Computer Vision ,2015, pp. 2812–2820.[13] M. E. Tipping and C. M. Bishop, “Probabilistic principal com-ponent analysis,”
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , vol. 61, no. 3, pp. 611–622, 1999.[14] J. Zhao and Q. Jiang, “Probabilistic pca for t distributions,”
Neuro-computing , vol. 69, no. 16-18, pp. 2217–2226, 2006.[15] N. Wang, T. Yao, J. Wang, and D.-Y. Yeung, “A probabilisticapproach to robust matrix factorization,” in
European Conferenceon Computer Vision . Springer, 2012, pp. 126–139.[16] J. Li and D. Tao, “Simple exponential family pca,” in
Proceedingsof the Thirteenth International Conference on Artificial Intelligence andStatistics , 2010, pp. 453–460.[17] Q. Zhao, D. Meng, Z. Xu, W. Zuo, and L. Zhang, “Robust principalcomponent analysis with complex noise,” in
International confer-ence on machine learning , 2014, pp. 55–63.[18] C. Zhang, K. Gai, and S. Zhang, “Matrix normal pca for in-terpretable dimension reduction and graphical noise modeling,” arXiv preprint arXiv:1911.10796 , 2019.[19] W. Min, J. Liu, and S. Zhang, “Group-sparse svd models and theirapplications in biological data,” arXiv preprint arXiv:1807.10956 ,2018.[20] S. Zhang, Q. Li, J. Liu, and X. J. Zhou, “A novel computa-tional framework for simultaneous integration of multiple typesof genomic data to identify microrna-gene regulatory modules,”
Bioinformatics , vol. 27, no. 13, pp. i401–i409, 2011.[21] L. Zhang and S. Zhang, “A general joint matrix factorizationframework for data integration and its systematic algorithmicexploration,”
IEEE Transactions on Fuzzy Systems , 2019.[22] S. Gao, I. W.-H. Tsang, L.-T. Chia, and P. Zhao, “Local features arenot lonely–laplacian sparse coding for image classification,” 2010.[23] M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, and D. Cai,“Graph regularized sparse coding for image representation,”
IEEETransactions on Image Processing , vol. 20, no. 5, pp. 1327–1336, 2011.[24] M. Yin, J. Gao, Z. Lin, Q. Shi, and Y. Guo, “Dual graph regular-ized latent low-rank representation for subspace clustering,”
IEEETransactions on Image Processing , vol. 24, no. 12, pp. 4918–4933, 2015.[25] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularizednonnegative matrix factorization for data representation,”
IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 33,no. 8, pp. 1548–1560, 2011.[26] A. Mnih and R. R. Salakhutdinov, “Probabilistic matrix factoriza-tion,” in
Advances in neural information processing systems , 2008, pp.1257–1264. [27] Y. J. Lim and Y. W. Teh, “Variational bayesian approach to movierating prediction,” in
Proceedings of KDD cup and workshop , vol. 7,2007, pp. 15–21.[28] R. Salakhutdinov and A. Mnih, “Bayesian probabilistic matrixfactorization using markov chain monte carlo,” in
Proceedings ofthe 25th international conference on Machine learning . ACM, 2008,pp. 880–887.[29] C. M. Bishop, “Variational principal components,” 1999.[30] A. N. Srivastava, R. Nemani, and K. Steinhaeuser,
Large-ScaleMachine Learning in the Earth Sciences . CRC Press, 2017.[31] Y. Yankelevsky and M. Elad, “Dual graph regularized dictionarylearning,”
IEEE Transactions on Signal and Information Processingover Networks , vol. 2, no. 4, pp. 611–624, 2016.[32] M. Yuan and Y. Lin, “Model selection and estimation in thegaussian graphical model,”
Biometrika , vol. 94, no. 1, pp. 19–35,2007.[33] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covari-ance estimation with the graphical lasso,”
Biostatistics , vol. 9, no. 3,pp. 432–441, 2008.[34] L. Li and K.-C. Toh, “An inexact interior point method for l 1-regularized sparse covariance selection,”
Mathematical Program-ming Computation , vol. 2, no. 3-4, pp. 291–315, 2010.[35] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar, “Quic:quadratic approximation for sparse inverse covariance estima-tion.”
Journal of Machine Learning Research , vol. 15, no. 1, pp. 2911–2947, 2014.[36] P. Danaher, P. Wang, and D. M. Witten, “The joint graphical lassofor inverse covariance estimation across multiple classes,”
Journalof the Royal Statistical Society: Series B (Statistical Methodology) ,vol. 76, no. 2, pp. 373–397, 2014.[37] T. T. Cai, W. Liu, H. H. Zhou et al. , “Estimating sparse precisionmatrix: Optimal rates of convergence and adaptive estimation,”
The Annals of Statistics , vol. 44, no. 2, pp. 455–488, 2016.[38] J. Fan, Y. Liao, and H. Liu, “An overview of the estimation oflarge covariance and precision matrices,”
The Econometrics Journal ,vol. 19, no. 1, pp. C1–C32, 2016.[39] S. Sojoudi, “Equivalence of graphical lasso and thresholding forsparse graphs,”
The Journal of Machine Learning Research , vol. 17,no. 1, pp. 3943–3963, 2016.[40] S. Fattahi and S. Sojoudi, “Graphical lasso and threshold-ing: Equivalence and closed-form solutions,” arXiv preprintarXiv:1708.09479 , 2017.[41] Y. Xu and W. Yin, “A block coordinate descent method for regular-ized multiconvex optimization with applications to nonnegativetensor factorization and completion,”
SIAM Journal on ImagingSciences , vol. 6, no. 3, pp. 1758–1789, Oct. 2013.[42] Z. Cui, Y.-L. Gao, J.-X. Liu, L.-Y. Dai, and S.-S. Yuan, “L2,1-grmf:an improved graph regularized matrix factorization method topredict drug-target interactions,”
BMC Bioinformatics , vol. 20, 2019.[43] S. Fattahi and S. Sojoudi, “Closed-form solution and sparsitypath for inverse covariance estimation problem,” in , 2018, pp. 410–417.[44] R. Zhang, S. Fattahi, and S. Sojoudi, “Large-scale sparse inversecovariance estimation via thresholding and max-det matrixcompletion,” ser. Proceedings of Machine Learning Research,J. Dy and A. Krause, Eds., vol. 80. Stockholmsm¨assan, StockholmSweden: PMLR, 10–15 Jul 2018, pp. 5766–5775. [Online]. Available:http://proceedings.mlr.press/v80/zhang18c.html[45] G. I. Allen, L. Grosenick, and J. Taylor, “A generalized least-square matrix decomposition,”
Journal of the American StatisticalAssociation , vol. 109, no. 505, pp. 145–159, 2014.[46] H. E. Egilmez, E. Pavez, and A. Ortega, “Graph learning from dataunder laplacian and structural constraints,”
IEEE Journal of SelectedTopics in Signal Processing , vol. 11, no. 6, pp. 825–841, 2017.[47] X. Dong, D. Thanou, M. Rabbat, and P. Frossard, “Learning graphsfrom data: A signal representation perspective,”
IEEE Signal Pro-cessing Magazine , vol. 36, pp. 44–63, 2019.[48] E. Pavez and A. Ortega, “Generalized laplacian precision matrixestimation for graph signal processing,” , pp.6350–6354, 2016.[49] B. M. Lake and J. B. Tenenbaum, “Discovering structure bylearning sparse graph,” in