EquiNMF: Graph Regularized Multiview Nonnegative Matrix Factorization
EEquiNMF: Graph Regularized MultiviewNonnegative Matrix Factorization
Daniel Hidru ∗ SickKids Research Institute686 Bay St, Toronto, ON, Canada [email protected]
Anna Goldenberg
SickKids Research Institute, University of Toronto686 Bay St, Toronto, ON, Canada [email protected]
Abstract
Nonnegative matrix factorization (NMF) methods have proved to be powerfulacross a wide range of real-world clustering applications. Integrating multi-ple types of measurements for the same objects/subjects allows us to gain adeeper understanding of the data and refine the clustering. We have developed anovel Graph-reguarized multiview NMF-based method for data integration calledEquiNMF. The parameters for our method are set in a completely automateddata-specific unsupervised fashion, a highly desirable property in real-world ap-plications. We performed extensive and comprehensive experiments on multiviewimaging data. We show that EquiNMF consistently outperforms other single-viewNMF methods used on concatenated data and multi-view NMF methods with dif-ferent types of regularizations.
Combining multiple sources of evidence helps us gain a deeper understanding of the data. If un-supervised clustering is to be performed, a simple way to utilize the multiple sources of data is toconcatenate them after normalizing their features and to perform clustering on the unified data set.This is not an ideal strategy because concatenation is likely to cause the loss of structure inherent inindividual datasets which could compromise the identification of clusters. For this reason, methodshave been developed to cluster data sets preserving their multiview structure (e.g. [3]).Nonnegative Matrix Factorization (NMF) has achieved wide spread popularity and has become aclustering method of choice in many applications, such as imaging [5], blind-source separation[13] and computational biology [15]. With NMF, clustering is performed on the lower dimensionalrepresentation of the data which arises from the matrix factors. The power of the method lies in thequality of the latent embedding which was shown to yield superior performance to PCA [8]. ManyNMF variants have been proposed to improve the performance [4]. For example, sparsity constraintshave been enforced to identify better bases for NMF [6]. Graph regularization has also been addedto generate superior clustering results [2].Many application areas are now interested in data integration since integrating various sources ofdata can yield a much finer picture of the domain. A recently proposed MultiNMF [9] extends NMFto the multi-view clustering problem, by constraining each view’s lower dimensional representationto be similar to each other. The current MultiNMF has a major disadvantage that it does not cap-ture the geometric structure of the data, which has been shown to improve NMF for single views[2]. We propose EquiNMF: a graph-based regularized multi-view method where the parameters areautomatically learned from data. It results in significant performance improvements over four alter-native approaches on three imaging datasets and shows consistency and robustness across a variety ∗ DH and AG were supported by the SickKids Foundation a r X i v : . [ c s . L G ] S e p f parameter settings that in our case determine relative contributions of multiple views. Impor-tantly, while competing methods perform well on one dataset and badly on others, our approach isable to deal with the data diversity appropriately.Our three major contributions are 1) a novel formalization of a graph regularized multi-view NMFwhich results in much improved accuracy; 2) reformulating the multi-view objective to simplify andreduce the complexity of the approach by explicitly representing equal view contribution withoutthe consensus matrix; 3) automatic parameter estimation in a truly unsupervised setting. Nonnegative Matrix Factorization (NMF) is a method used to factorize a matrix of nonnegativeentries into the product of two lower dimensional, nonnegative matrices. Let X ∈ R M × N + , where X contains N data points and M nonnegative measurements for each data point. NMF attempts tofind U ∈ R M × K + and V ∈ R N × K + such that X ≈ U V T [8]. This task is expressed mathematicallyas the following optimization problem with iterative updates [12]: min U,V ≥ || X − U V T || F ; U i,k ← U i,k ( XV ) i,k ( U V T V ) i,k , V j,k ← V j,k ( X T U ) j,k ( V U T U ) j,k (1) Graph Regularized NMF (GNMF) is an extension of NMF which has been shown to improve thequality of the factorization of X [2]. This improvement has been achieved through the addition ofa regularization term which causes similar data points to have similar lower dimensional represen-tation. This in turn reduces overfitting of the basis vectors.Let W be an N × N symmetric matrix representing the similarity between the N data points. Let D be the diagonal matrix such that D jj = (cid:80) l W jl , then the Laplacian of W is ∆ = D − W . GNMFattempts to solve the following optimization problem with iterative updates [2]: min U,V ≥ || X − U V T || F + γT r ( V T ∆ V ); U i,k ← U i,k ( XV ) i,k ( U V T V ) i,k , V j,k ← V j,k ( X T U ) j,k + γ ( W V ) j,k ( V U T U ) j,k + γ ( DV ) j,k (2) Multi-view NMF (MultiNMF) is an extension of NMF to multiple nonnegative matrices describingthe same set of data points. Let { X (1) , ..., X ( n v ) } be n v views of a set of data points. MultiNMFattempts to approximate X ( v ) ≈ U ( v ) ( V ( v ) ) T for each v , while the constraining the V ( v ) ’s to besimilar [9]. This is achieved by solving the following optimization problem: min U ( v ) ,V ( v ) ,V ∗ ≥ n v (cid:88) v =1 || X ( v ) − U ( v ) ( V ( v ) ) T || F + n v (cid:88) v =1 λ v || V ( v ) Q ( v ) − V ∗ || F (3)In the optimization above, Q ( v ) is a matrix which constrains the column sums of U ( v ) to make the V ( v ) ’s comparable [9]. The multi-view data is reduced to V ∗ . Capturing internal structure of the data within each view in a multiview problem is key to improvingperformance and gaining meaningful insight into the data and its underlying domain (e.g. [14]). Wethus propose a novel graph-regularized multi-view approach. The usual problem of the multi-viewsetting, especially in the unsupervised scenario, is that it is not clear how to chose how much each2iew should contribute to the final objective. The selection of parameter values in the objectivefunction has a substantial effect on the results of NMF methods which require them. Previousmethods determined these values empirically using their labeled data and recommended the use ofthe same parameter values on all datadsets. Since the appropriate parameter values may depend onthe size and scale of the data being used, we have developed a method to determine these parametersfrom the data by assuming equivalent contributions of each view (note that it does not mean thateach view gets the same coefficient as is done in many multi-view approaches).
Here we show how to extend graph-regularized NMF (GNMF) to the multi-view setting. Let { X (1) , ..., X ( n v ) } be n v views of a set of N data points, such that X ( v ) ∈ R M v × N + . The pro-posed method attempts to approximate X ( v ) ≈ U ( v ) V T for each v , where U ( v ) ∈ R M v × K + and V ∈ R N × K + and the coefficient matrix V is shared between all of the views.Since V is shared between all of the views, we would like to guarantee that the entries from eachrow of V have a magnitude which will allow them to approximate the corresponding column in eachof the views. Suppose that X ≈ U V T , || X .,j || = 1 and || U .,k || = 1 for each k . Then: || X .,j || ≈ K (cid:88) k =1 || U .,k V j,k || = K (cid:88) k =1 || V j,k || = || V j,. || (4)Given the above constraints, a single V can be used to approximate each of the views simultaneously.This motivates us to normalize the original data such that || X ( v ) .,j || = 1 and express the otherconstraints within the optimization problem below: min U ( v ) ,V ≥ n v (cid:88) v =1 α v || X ( v ) − U ( v ) C ( v ) V T || F + γT r ( V T ∆ V ) (5)Where C ( v ) = Diag ( (cid:80) Mvi =1 U ( v ) i, , ..., (cid:80) Mvi =1 U ( v ) i,K ) is used to constrain the column sums of U ( v ) , as || ( U C ) .,k || = (cid:80) Mi =1 ( U C ) i,k = C k,k (cid:80) Mi =1 U i,k = 1 .To solve the optimization problem in (Eq.5), we derive alternating updates in the same manner asprevious NMF papers [8]. First, we fix V and minimize the objective for each U ( v ) . When V isfixed, each of the U ( v ) ’s do not depend on each other. For this reason, the v indices have beenremoved for notational convenience.For each U , we only need to minimize the terms in the objective which depend on it. Let Ψ be theLagrange multiplier matrix for the constraint U ≥ . Considering the terms which are only relevantto U , minimizing the objective is equivalent to minimizing the Lagrange: L U = αT r ( U CV T V C T U T − XV C T U T ) + T r (Ψ U )= α M (cid:88) i =1 (( U CV T V C T U T ) ii − XV C T U T ) ii ) + T r (Ψ U )= α M (cid:88) i =1 K (cid:88) k =1 (( U CV T V ) ik − XV ) ik ) U ik (cid:80) Ml =1 U lk + T r (Ψ U ) (6)Taking the partial derivative of L U with respect to U ik gives: ∂L U ∂U i,k = 2 C kk α (( U CV T V ) i,k − M (cid:88) l =1 ( U CV T V ) l,k ( U C ) l,k − ( XV ) i,k + M (cid:88) l =1 ( XV ) l,k ( U C ) l,k )+Ψ (7)3f we assume that U was column normalized before the update, then C = I . Using the KKTconditions Ψ i,k U i,k = 0 and ∂L U ∂U i,k = 0 , we get the update: U i,k ← U i,k ( XV ) i,k + (cid:80) Ml =1 ( U V T V ) l,k U l,k ( U V T V ) i,k + (cid:80) Ml =1 ( XV ) l,k U l,k (8)To compute the update for V , we first normalize the columns of U. This normalization does notchange the value of the objective and reduces C to the identity matrix. Let Φ be the Lagrangemultiplier matrix for the constraint V ≥ . If we fix each U ( v ) and only consider the terms whichare relevant to V , minimizing the objective is equivalent to minimizing the Lagrange: L V = n v (cid:88) v =1 α v ( T r ( V ( U ( v ) ) T U ( v ) V T ) − T r (( X ( v ) ) T U ( v ) V T )) + γT r ( V T ∆ V ) + T r (Φ V ) (9)Taking the derivative of L V with respect to V gives: ∂L V ∂V = n v (cid:88) v =1 α v (( V ( U ( v ) ) T U ( v ) ) − (( X ( v ) ) T U ( v ) )) + 2 γ ∆ V + Φ (10)Using the KKT conditions Φ j,k V j,k = 0 and ∂L V ∂V j,k = 0 , we get the update: V jk ← V jk (cid:80) n v v =1 α v (( X ( v ) ) T U ( v ) ) jk + γ ( W V ) jk (cid:80) n v v =1 α v ( V ( U ( v ) ) T U ( v ) ) jk + γ ( DV ) jk (11) In an unsupervised multi-view setting, it is reasonable to desire each view to contribute equally tothe final result ( V ) unless prior information is available. Each view can be said to contribute equallyto the final result if it contributes equally to each intermediate result ( V after every update). Sinceeach view contributes to the interemediate result according to the magnitude of the term associatedwith it in the numerator of Eq.11, equal contribution of the views can be enforced by having theaverage contribution of each view to be the same. Since E [ α v ( X T U ) j,k ] = α v M (cid:88) i =1 E [ X i,j U i,k ] ≈ α v M E [ X i,j ] E [ U i,k ] = α v M (1 /M )(1 /M ) = α v /M (12)then setting α v = M v will ensure that each view contributes equally to the final result.The selection of the regularization parameter γ is also required. If γ is too large, then the graphregularization term dominates that might not lead to a desirable effect: data points would be forcedto have similar values in V , even if this provided a poor approximation. If γ is too small, then thegraph would have little effect on the result. We thus hypothesize that it is reasonable to set the graphto have the same scale of influence as the data. Since the data has an expected total contribution of n v with the above parameter setting and E [ γ ( W V ) j,k ] = γ N (cid:88) l =1 E [ W j,l V l,k ] ≈ γN E [ W j,l ] E [ V l,k ] ≈ γN E [ W j,l ] /K (13)then setting γ = n v ∗ K/ ( N ∗ E [ W j,l ]) will ensure that the graph contributes equally to the finalresult. We have applied EquiNMF to three imaging datasets (Digits, Faces and Butterflies) and compared tofour competing approaches (K-means, NMF, GNMF and MultiNMF) using accuracy and normalizedmutual information (NMI) [9]. 4 .1 Data description
A brief description of the three image data sets used in the tests is provided below and the summaryof the dimensions can be found in Table 1: • UCI Handwritten Digits : This UCI repository dataset contains handwritten digits from 0to 9. Each class contains 200 examples. The first view contains 76 Fourier coefficients ofthe character shapes and the second view contains 240 pixel averages in × windows. • ORL Face data set: This data set from the ORL database contains images of 40 individuals.The database contains 10 different photos for each individual. The images are grayscaleand have been normalized to × pixels. The first view contains the raw pixel valuesand the second view contains GIST [10]. • Butterfly data set: This data set contains 10 different classes of butterflies [16]. Eachclass contains 55 to 100 images with 832 butterflies in total. The views were formed usingtwo different encodings of the images which describe different statistics of the codebooks.The two encoding methods are Fisher Vector (FV) [11] and Vector of Linearly AggregatedDescriptors (VLAD) [7] with dense SIFT [1].Table 1: Summary of Datasets D ATA SET S AMPLES C LUSTERS F EATURES D IGIT
ACE
400 40 (4096, 59)B
UTTERFLY
832 10 (10240, 6400)
Each method relied on a random initialization, so each test was performed 20 times. The reduceddimension K of the factor matrices was set to the number of clusters in each data set as in [9]. Allof the methods which relied on regularization parameters had these parameters set to their recom-mended values. We use a 5 nearest neighbour similarity matrix to obtain a graph for each view as in[2]. W was set to the sum of each view’s similarity graph. Each of the methods tested had their own form of initialization contained within their code. Ourmethod used a similar style of initialization as MultiNMF [9]. The factors were generated from theUniform[0, 1] distribution and scaled so that the column sums of each U ( v ) and the row sums of V were set to 1. Then, in a consecutive sequence which cycled through the views 50 times, each U ( v ) was used for a single iteration of NMF. To evaluate our method, we compare its perfomance to the following algorithms: • Kmeans: The data is normalized so that || X .,j || = 1 and concatenated into a single view.Kmeans was performed on the concatenation. • Concatenated NMF (NMF): The data is normalized so that || X .,j || = 1 and concatenatedinto a single view. NMF is performed on the concatenation. • Concatenated GNMF (GNMF): The data is normalized so that || X .,j || = 1 and concate-nated into a single view. GNMF is performed with the recommended value of γ = 100 [2]. • Multi-view NMF (MultiNMF): The data is normalized so that || X || = 1 . MultiNMF isperformed with the recommended value of λ = 0 . [9].To cluster our NMF results, k-means clustering was performed on V ∗ for MultiNMF and on V forall other methods. Clustering was run with 20 repeats and 100 iterations per repeat.5able 2: Clustering accuracy on three imaging datasets. Statistically significantly better performersare in bold (ttest α = 0 . ). A LGORITHM D IGIT F ACE B UTTERFLY K MEANS ± .04 0.51 ± .02 0.68 ± .04NMF 0.84 ± .03 0.30 ± .02 0.57 ± .03GNMF ± .06 0.43 ± .02 0.62 ± .06M ULTI
NMF 0.87 ± .01 0.55 ± .04 0.67 ± .03E QUI
NMF ± .04 ± .02 ± .03 Table 3: Clustering nmi on three imaging datasets. Statistically significantly better performers arein bold (ttest α = 0 . ). A LGORITHM D IGIT F ACE B UTTERFLY K MEANS ± .01 0.73 ± .02 0.68 ± .02NMF 0.78 ± .02 0.54 ± .01 0.52 ± .03GNMF ± .02 0.66 ± .01 0.67 ± .03M ULTI
NMF 0.79 ± .01 0.75 ± .02 0.64 ± .02E QUI
NMF 0.89 ± .01 ± .01 ± .01 We observe that NMF used on the concatenated views performs consistently the worst of the com-pared methods across all 3 datasets. We hypothesize that this is due to the fact that it does not accountat all for the internal geometric structure of the data. Interestingly, classic Kmeans performs welloutperforming NMF and MultiNMF on Digits and Butterflies. It additionally outperforms GNMFon Faces and Butterfly datasets. Kmeans is a reasonable performer because it takes into accountdistances in the high dimensional space, something that a single view NMF might miss, but fallsshort of the best performance since it does not take into account the dependency between measure-ments. GNMF shows unstable performance, performing very well on Digits, but falling far behindother methods on other datasets. This is due to the fact that as a single view method it cannot usemultiple representations of the data effectively. EquiNMF performs consistently better than all of itscompetitors except for GMNF on the Digits dataset according to the NMI score (it is significantlybetter than GNMF according to accuracy).
We plotted the performance of EquiNMF as a function of a multiplicative constant of the selectedgraph-regularization parameter γ v . Figure 1 shows that EquiNMF is robust for a range of γ v values.The resulting accuracy depends on the contribution of the objective and the regularizer, the graphlaplacian in our case. As such, it is very important to set the contribution of the regularization tothe right scale. Here, we propose to have comparable contributions of the objective and regularizer,unless prior information is available. Figure 1 shows that while no graph regularization results insignificantly worse performance, the equal contribution (multiple of the graph parameter is 1)orhalf of the objective contribution (mulitple of the graph parameter is 1) perform as well as the bestperforming parameter setting. We have also observed that the performance deteriorates once graphregularization is given too much weight (Butterflies, multiplier is equal to 2). We thus recommendto use our automatic setting of equal contribution (multiplier equals 1), resulting in a completelyautomatically set parameters for EquiNMF in a fully unsupervised though data-specific fashion. In this paper we propose a graph-regularized multi-view NMF with equal contribution from theviews. We have initially extended MultiNMF to use graph regularization. This approach raised alot of questions, such as should we regularize each view or the consensus matrix or both? Doesit matter whether we converge for each U and V before we update the consensus matrix V ∗ ? (Itturned out that the answer to this question, was yes). Importantly, there was a lot of ambiguity http://archive.ics.uci.edu/ml/datasets/Multiple+Features V about how to weigh the contributions of each of the views, consensus and each of the potentialgraph regularizers. We have extensively studied this idea first and found that some of the solutionshad substantially increased the performance of MultiNMF, but made search for the best parametersetting very difficult and often impossible without known labels. We have not pursued this approach,since it is not useful in the real world applications where we would ultimately want our method tobe used.Our EquiNMF has many advantages over the graph-regularized MultiNMF approach. For example,automatically setting the parameters of the graph-regularized MultiNMF by using our assumption ofequal view contribution is not fully transferable to MultiNMF because there is no way to determinethe appropriate proportion of influence that V ∗ should have on each V . Additional advantage ofusing EquiNMF is that without the consensus, there is no longer a need to determine the order ofupdates. In MultiNMF, each U , V pair are updated till convergence before V ∗ is updated. Regular-izing V towards a consensus or average is bad. In theory, as the regularization parameter increases,the method is equivalent to concatenation. This is bad because concatenation does not allow of theequal contribution of views to the determination of V . In practice, as the regularization parameterincreases, the V ’s are similar, but are not necessarily a good approximation of the data. Due to theconstraint, it is more difficult to move them from their initialization.Some other interesting observations about EquiNMF that we found from our extensive experimentsare for example, that constraining (normalizing) the length of rows and columns. Under the con-straints on X and U which we imposed above, || V j,. || ≈ . In this case, we may wish to imposethe row constraint || V j,. || = l in a similar manner to the column constraints imposed on U . Un-fortunately, this causes a deterioration in performance, as the model becomes over constrained andloses its expressiveness.Initialization also plays an important role. We found that initializing the matrices with (s)kmeans +noise does not allow the method to improve on the initialization. We have observed that our methodperforms well even with random initialization but has high variance in performance and thus werecommend to use our proposed initialization as it does not add a heavy computational load to themethod. 7inally, in an unsupervised multiview setting the α parameters cannot be determined by cross vali-dation, as each view’s error would decrease as their parameter, and influence on V , increased. Thegraph parameter γ may be determined by cross validation, but this is not necessary because of ourheuristic. If the graph parameter is determined by cross validation, our heuristic gives a reasonablescale to select parameters from. Many application areas of machine learning are now looking for multiview methods that will helpdomain experts to gain deeper understanding of their data. Being a powerful paradigm, NMF hasreceived a wide acclaim in many application areas and thus it is of practical importance to developnovel multiview NMF methods. Existing multiview NMF methods have all relied on supervisedparameter detection, either through simulations or through real-world datasets where labels areavailable. Here we are making two major contributions to the field: 1) a novel graph-regularizedmulti-view method that outperforms its state-of-the-art competitors; 2) an automatic way to set allthe parameters for our model in unsupervised data-specific fashion. We hope that our approachwill be of wide applicability in multiview settings. We will provide both R and matlab code uponacceptance.
References [1] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Image classification using random forestsand ferns.
ICCV , pages 1–8, 2007.[2] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S Huang. Graph regularized nonnegativematrix factorization for data representation.
Pattern Analysis and Machine Intelligence, IEEETransactions on , 33(8):1548–1560, 2011.[3] Ning Chen, Jun Zhu, and Eric P Xing. Predictive subspace learning for multi-view data: a largemargin approach. In
Advances in neural information processing systems , pages 361–369, 2010.[4] Andrzej Cichocki, Anh Huy Phan, and Rafal Zdunek.
Nonnegative Matrix and Tensor Factor-izations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation .Wiley, Chichester, 2009.[5] David Guillamet, Bernt Schiele, and Jordi Vitria. Analyzing non-negative matrix factoriza-tion for image classification. In
Pattern Recognition, 2002. Proceedings. 16th InternationalConference on , volume 2, pages 116–119. IEEE, 2002.[6] Patrik O Hoyer. Non-negative matrix factorization with sparseness constraints.
The Journal ofMachine Learning Research , 5:1457–1469, 2004.[7] Herv´e J´egou, Matthijs Douze, Cordelia Schmid, and Patrick P´erez. Aggregating local de-scriptors into a compact image representation. In
Computer Vision and Pattern Recognition(CVPR), 2010 IEEE Conference on , pages 3304–3311. IEEE, 2010.[8] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrixfactorization.
Nature , 401(6755):788–791, 1999.[9] Jialu Liu, Chi Wang, Jing Gao, and Jiawei Han. Multi-view clustering via joint nonnegativematrix factorization. 2013.[10] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representationof the spatial envelope.
International journal of computer vision , 42(3):145–175, 2001.[11] Florent Perronnin, Jorge S´anchez, and Thomas Mensink. Improving the fisher kernel for large-scale image classification. In
Computer Vision–ECCV 2010 , pages 143–156. Springer, 2010.[12] D Seung and L Lee. Algorithms for non-negative matrix factorization.
Advances in neuralinformation processing systems , 13:556–562, 2001.[13] Tuomas Virtanen. Monaural sound source separation by nonnegative matrix factorization withtemporal continuity and sparseness criteria.
Audio, Speech, and Language Processing, IEEETransactions on , 15(3):1066–1074, 2007. 814] Bo Wang, Aziz M Mezlini, Feyyaz Demir, Marc Fiume, Zhuowen Tu, Michael Brudno, Ben-jamin Haibe-Kains, and Anna Goldenberg. Similarity network fusion for aggregating datatypes on a genomic scale.
Nat Methods , 11(3):333–7, Mar 2014.[15] Jim Jing-Yan Wang, Xiaolei Wang, and Xin Gao. Non-negative matrix factorization by maxi-mizing correntropy for cancer clustering.
BMC bioinformatics , 14(1):107, 2013.[16] Josiah Wang, Katja Markert, and Mark Everingham. Learning models for object recognitionfrom natural language descriptions.