[PDF] EquiNMF: Graph Regularized Multiview Nonnegative Matrix Factorization

Abstract

Nonnegative matrix factorization (NMF) methods have proved to be powerful across a wide range of real-world clustering applications. Integrating multiple types of measurements for the same objects/subjects allows us to gain a deeper understanding of the data and refine the clustering. We have developed a novel Graph-reguarized multiview NMF-based method for data integration called EquiNMF. The parameters for our method are set in a completely automated data-specific unsupervised fashion, a highly desirable property in real-world applications. We performed extensive and comprehensive experiments on multiview imaging data. We show that EquiNMF consistently outperforms other single-view NMF methods used on concatenated data and multi-view NMF methods with different types of regularizations.

Full PDF

EEquiNMF: Graph Regularized MultiviewNonnegative Matrix Factorization

Daniel Hidru ∗ SickKids Research Institute686 Bay St, Toronto, ON, Canada [email protected]

Anna Goldenberg

SickKids Research Institute, University of Toronto686 Bay St, Toronto, ON, Canada [email protected]

Abstract

Nonnegative matrix factorization (NMF) methods have proved to be powerfulacross a wide range of real-world clustering applications. Integrating multi-ple types of measurements for the same objects/subjects allows us to gain adeeper understanding of the data and reﬁne the clustering. We have developed anovel Graph-reguarized multiview NMF-based method for data integration calledEquiNMF. The parameters for our method are set in a completely automateddata-speciﬁc unsupervised fashion, a highly desirable property in real-world ap-plications. We performed extensive and comprehensive experiments on multiviewimaging data. We show that EquiNMF consistently outperforms other single-viewNMF methods used on concatenated data and multi-view NMF methods with dif-ferent types of regularizations.

Combining multiple sources of evidence helps us gain a deeper understanding of the data. If un-supervised clustering is to be performed, a simple way to utilize the multiple sources of data is toconcatenate them after normalizing their features and to perform clustering on the uniﬁed data set.This is not an ideal strategy because concatenation is likely to cause the loss of structure inherent inindividual datasets which could compromise the identiﬁcation of clusters. For this reason, methodshave been developed to cluster data sets preserving their multiview structure (e.g. [3]).Nonnegative Matrix Factorization (NMF) has achieved wide spread popularity and has become aclustering method of choice in many applications, such as imaging [5], blind-source separation[13] and computational biology [15]. With NMF, clustering is performed on the lower dimensionalrepresentation of the data which arises from the matrix factors. The power of the method lies in thequality of the latent embedding which was shown to yield superior performance to PCA [8]. ManyNMF variants have been proposed to improve the performance [4]. For example, sparsity constraintshave been enforced to identify better bases for NMF [6]. Graph regularization has also been addedto generate superior clustering results [2].Many application areas are now interested in data integration since integrating various sources ofdata can yield a much ﬁner picture of the domain. A recently proposed MultiNMF [9] extends NMFto the multi-view clustering problem, by constraining each view’s lower dimensional representationto be similar to each other. The current MultiNMF has a major disadvantage that it does not cap-ture the geometric structure of the data, which has been shown to improve NMF for single views[2]. We propose EquiNMF: a graph-based regularized multi-view method where the parameters areautomatically learned from data. It results in signiﬁcant performance improvements over four alter-native approaches on three imaging datasets and shows consistency and robustness across a variety ∗ DH and AG were supported by the SickKids Foundation a r X i v : . [ c s . L G ] S e p f parameter settings that in our case determine relative contributions of multiple views. Impor-tantly, while competing methods perform well on one dataset and badly on others, our approach isable to deal with the data diversity appropriately.Our three major contributions are 1) a novel formalization of a graph regularized multi-view NMFwhich results in much improved accuracy; 2) reformulating the multi-view objective to simplify andreduce the complexity of the approach by explicitly representing equal view contribution withoutthe consensus matrix; 3) automatic parameter estimation in a truly unsupervised setting. Nonnegative Matrix Factorization (NMF) is a method used to factorize a matrix of nonnegativeentries into the product of two lower dimensional, nonnegative matrices. Let X ∈ R M × N + , where X contains N data points and M nonnegative measurements for each data point. NMF attempts toﬁnd U ∈ R M × K + and V ∈ R N × K + such that X ≈ U V T [8]. This task is expressed mathematicallyas the following optimization problem with iterative updates [12]: min U,V ≥ || X − U V T || F ; U i,k ← U i,k ( XV ) i,k ( U V T V ) i,k , V j,k ← V j,k ( X T U ) j,k ( V U T U ) j,k (1) Graph Regularized NMF (GNMF) is an extension of NMF which has been shown to improve thequality of the factorization of X [2]. This improvement has been achieved through the addition ofa regularization term which causes similar data points to have similar lower dimensional represen-tation. This in turn reduces overﬁtting of the basis vectors.Let W be an N × N symmetric matrix representing the similarity between the N data points. Let D be the diagonal matrix such that D jj = (cid:80) l W jl , then the Laplacian of W is ∆ = D − W . GNMFattempts to solve the following optimization problem with iterative updates [2]: min U,V ≥ || X − U V T || F + γT r ( V T ∆ V ); U i,k ← U i,k ( XV ) i,k ( U V T V ) i,k , V j,k ← V j,k ( X T U ) j,k + γ ( W V ) j,k ( V U T U ) j,k + γ ( DV ) j,k (2) Multi-view NMF (MultiNMF) is an extension of NMF to multiple nonnegative matrices describingthe same set of data points. Let { X (1) , ..., X ( n v ) } be n v views of a set of data points. MultiNMFattempts to approximate X ( v ) ≈ U ( v ) ( V ( v ) ) T for each v , while the constraining the V ( v ) ’s to besimilar [9]. This is achieved by solving the following optimization problem: min U ( v ) ,V ( v ) ,V ∗ ≥ n v (cid:88) v =1 || X ( v ) − U ( v ) ( V ( v ) ) T || F + n v (cid:88) v =1 λ v || V ( v ) Q ( v ) − V ∗ || F (3)In the optimization above, Q ( v ) is a matrix which constrains the column sums of U ( v ) to make the V ( v ) ’s comparable [9]. The multi-view data is reduced to V ∗ . Capturing internal structure of the data within each view in a multiview problem is key to improvingperformance and gaining meaningful insight into the data and its underlying domain (e.g. [14]). Wethus propose a novel graph-regularized multi-view approach. The usual problem of the multi-viewsetting, especially in the unsupervised scenario, is that it is not clear how to chose how much each2iew should contribute to the ﬁnal objective. The selection of parameter values in the objectivefunction has a substantial effect on the results of NMF methods which require them. Previousmethods determined these values empirically using their labeled data and recommended the use ofthe same parameter values on all datadsets. Since the appropriate parameter values may depend onthe size and scale of the data being used, we have developed a method to determine these parametersfrom the data by assuming equivalent contributions of each view (note that it does not mean thateach view gets the same coefﬁcient as is done in many multi-view approaches).

Here we show how to extend graph-regularized NMF (GNMF) to the multi-view setting. Let { X (1) , ..., X ( n v ) } be n v views of a set of N data points, such that X ( v ) ∈ R M v × N + . The pro-posed method attempts to approximate X ( v ) ≈ U ( v ) V T for each v , where U ( v ) ∈ R M v × K + and V ∈ R N × K + and the coefﬁcient matrix V is shared between all of the views.Since V is shared between all of the views, we would like to guarantee that the entries from eachrow of V have a magnitude which will allow them to approximate the corresponding column in eachof the views. Suppose that X ≈ U V T , || X .,j || = 1 and || U .,k || = 1 for each k . Then: || X .,j || ≈ K (cid:88) k =1 || U .,k V j,k || = K (cid:88) k =1 || V j,k || = || V j,. || (4)Given the above constraints, a single V can be used to approximate each of the views simultaneously.This motivates us to normalize the original data such that || X ( v ) .,j || = 1 and express the otherconstraints within the optimization problem below: min U ( v ) ,V ≥ n v (cid:88) v =1 α v || X ( v ) − U ( v ) C ( v ) V T || F + γT r ( V T ∆ V ) (5)Where C ( v ) = Diag ( (cid:80) Mvi =1 U ( v ) i, , ..., (cid:80) Mvi =1 U ( v ) i,K ) is used to constrain the column sums of U ( v ) , as || ( U C ) .,k || = (cid:80) Mi =1 ( U C ) i,k = C k,k (cid:80) Mi =1 U i,k = 1 .To solve the optimization problem in (Eq.5), we derive alternating updates in the same manner asprevious NMF papers [8]. First, we ﬁx V and minimize the objective for each U ( v ) . When V isﬁxed, each of the U ( v ) ’s do not depend on each other. For this reason, the v indices have beenremoved for notational convenience.For each U , we only need to minimize the terms in the objective which depend on it. Let Ψ be theLagrange multiplier matrix for the constraint U ≥ . Considering the terms which are only relevantto U , minimizing the objective is equivalent to minimizing the Lagrange: L U = αT r ( U CV T V C T U T − XV C T U T ) + T r (Ψ U )= α M (cid:88) i =1 (( U CV T V C T U T ) ii − XV C T U T ) ii ) + T r (Ψ U )= α M (cid:88) i =1 K (cid:88) k =1 (( U CV T V ) ik − XV ) ik ) U ik (cid:80) Ml =1 U lk + T r (Ψ U ) (6)Taking the partial derivative of L U with respect to U ik gives: ∂L U ∂U i,k = 2 C kk α (( U CV T V ) i,k − M (cid:88) l =1 ( U CV T V ) l,k ( U C ) l,k − ( XV ) i,k + M (cid:88) l =1 ( XV ) l,k ( U C ) l,k )+Ψ (7)3f we assume that U was column normalized before the update, then C = I . Using the KKTconditions Ψ i,k U i,k = 0 and ∂L U ∂U i,k = 0 , we get the update: U i,k ← U i,k ( XV ) i,k + (cid:80) Ml =1 ( U V T V ) l,k U l,k ( U V T V ) i,k + (cid:80) Ml =1 ( XV ) l,k U l,k (8)To compute the update for V , we ﬁrst normalize the columns of U. This normalization does notchange the value of the objective and reduces C to the identity matrix. Let Φ be the Lagrangemultiplier matrix for the constraint V ≥ . If we ﬁx each U ( v ) and only consider the terms whichare relevant to V , minimizing the objective is equivalent to minimizing the Lagrange: L V = n v (cid:88) v =1 α v ( T r ( V ( U ( v ) ) T U ( v ) V T ) − T r (( X ( v ) ) T U ( v ) V T )) + γT r ( V T ∆ V ) + T r (Φ V ) (9)Taking the derivative of L V with respect to V gives: ∂L V ∂V = n v (cid:88) v =1 α v (( V ( U ( v ) ) T U ( v ) ) − (( X ( v ) ) T U ( v ) )) + 2 γ ∆ V + Φ (10)Using the KKT conditions Φ j,k V j,k = 0 and ∂L V ∂V j,k = 0 , we get the update: V jk ← V jk (cid:80) n v v =1 α v (( X ( v ) ) T U ( v ) ) jk + γ ( W V ) jk (cid:80) n v v =1 α v ( V ( U ( v ) ) T U ( v ) ) jk + γ ( DV ) jk (11) In an unsupervised multi-view setting, it is reasonable to desire each view to contribute equally tothe ﬁnal result ( V ) unless prior information is available. Each view can be said to contribute equallyto the ﬁnal result if it contributes equally to each intermediate result ( V after every update). Sinceeach view contributes to the interemediate result according to the magnitude of the term associatedwith it in the numerator of Eq.11, equal contribution of the views can be enforced by having theaverage contribution of each view to be the same. Since E [ α v ( X T U ) j,k ] = α v M (cid:88) i =1 E [ X i,j U i,k ] ≈ α v M E [ X i,j ] E [ U i,k ] = α v M (1 /M )(1 /M ) = α v /M (12)then setting α v = M v will ensure that each view contributes equally to the ﬁnal result.The selection of the regularization parameter γ is also required. If γ is too large, then the graphregularization term dominates that might not lead to a desirable effect: data points would be forcedto have similar values in V , even if this provided a poor approximation. If γ is too small, then thegraph would have little effect on the result. We thus hypothesize that it is reasonable to set the graphto have the same scale of inﬂuence as the data. Since the data has an expected total contribution of n v with the above parameter setting and E [ γ ( W V ) j,k ] = γ N (cid:88) l =1 E [ W j,l V l,k ] ≈ γN E [ W j,l ] E [ V l,k ] ≈ γN E [ W j,l ] /K (13)then setting γ = n v ∗ K/ ( N ∗ E [ W j,l ]) will ensure that the graph contributes equally to the ﬁnalresult. We have applied EquiNMF to three imaging datasets (Digits, Faces and Butterﬂies) and compared tofour competing approaches (K-means, NMF, GNMF and MultiNMF) using accuracy and normalizedmutual information (NMI) [9]. 4 .1 Data description

A brief description of the three image data sets used in the tests is provided below and the summaryof the dimensions can be found in Table 1: • UCI Handwritten Digits : This UCI repository dataset contains handwritten digits from 0to 9. Each class contains 200 examples. The ﬁrst view contains 76 Fourier coefﬁcients ofthe character shapes and the second view contains 240 pixel averages in × windows. • ORL Face data set: This data set from the ORL database contains images of 40 individuals.The database contains 10 different photos for each individual. The images are grayscaleand have been normalized to × pixels. The ﬁrst view contains the raw pixel valuesand the second view contains GIST [10]. • Butterﬂy data set: This data set contains 10 different classes of butterﬂies [16]. Eachclass contains 55 to 100 images with 832 butterﬂies in total. The views were formed usingtwo different encodings of the images which describe different statistics of the codebooks.The two encoding methods are Fisher Vector (FV) [11] and Vector of Linearly AggregatedDescriptors (VLAD) [7] with dense SIFT [1].Table 1: Summary of Datasets D ATA SET S AMPLES C LUSTERS F EATURES D IGIT

ACE

400 40 (4096, 59)B

UTTERFLY

832 10 (10240, 6400)

Each method relied on a random initialization, so each test was performed 20 times. The reduceddimension K of the factor matrices was set to the number of clusters in each data set as in [9]. Allof the methods which relied on regularization parameters had these parameters set to their recom-mended values. We use a 5 nearest neighbour similarity matrix to obtain a graph for each view as in[2]. W was set to the sum of each view’s similarity graph. Each of the methods tested had their own form of initialization contained within their code. Ourmethod used a similar style of initialization as MultiNMF [9]. The factors were generated from theUniform[0, 1] distribution and scaled so that the column sums of each U ( v ) and the row sums of V were set to 1. Then, in a consecutive sequence which cycled through the views 50 times, each U ( v ) was used for a single iteration of NMF. To evaluate our method, we compare its perfomance to the following algorithms: • Kmeans: The data is normalized so that || X .,j || = 1 and concatenated into a single view.Kmeans was performed on the concatenation. • Concatenated NMF (NMF): The data is normalized so that || X .,j || = 1 and concatenatedinto a single view. NMF is performed on the concatenation. • Concatenated GNMF (GNMF): The data is normalized so that || X .,j || = 1 and concate-nated into a single view. GNMF is performed with the recommended value of γ = 100 [2]. • Multi-view NMF (MultiNMF): The data is normalized so that || X || = 1 . MultiNMF isperformed with the recommended value of λ = 0 . [9].To cluster our NMF results, k-means clustering was performed on V ∗ for MultiNMF and on V forall other methods. Clustering was run with 20 repeats and 100 iterations per repeat.5able 2: Clustering accuracy on three imaging datasets. Statistically signiﬁcantly better performersare in bold (ttest α = 0 . ). A LGORITHM D IGIT F ACE B UTTERFLY K MEANS ± .04 0.51 ± .02 0.68 ± .04NMF 0.84 ± .03 0.30 ± .02 0.57 ± .03GNMF ± .06 0.43 ± .02 0.62 ± .06M ULTI

NMF 0.87 ± .01 0.55 ± .04 0.67 ± .03E QUI

NMF ± .04 ± .02 ± .03 Table 3: Clustering nmi on three imaging datasets. Statistically signiﬁcantly better performers arein bold (ttest α = 0 . ). A LGORITHM D IGIT F ACE B UTTERFLY K MEANS ± .01 0.73 ± .02 0.68 ± .02NMF 0.78 ± .02 0.54 ± .01 0.52 ± .03GNMF ± .02 0.66 ± .01 0.67 ± .03M ULTI

NMF 0.79 ± .01 0.75 ± .02 0.64 ± .02E QUI

NMF 0.89 ± .01 ± .01 ± .01 We observe that NMF used on the concatenated views performs consistently the worst of the com-pared methods across all 3 datasets. We hypothesize that this is due to the fact that it does not accountat all for the internal geometric structure of the data. Interestingly, classic Kmeans performs welloutperforming NMF and MultiNMF on Digits and Butterﬂies. It additionally outperforms GNMFon Faces and Butterﬂy datasets. Kmeans is a reasonable performer because it takes into accountdistances in the high dimensional space, something that a single view NMF might miss, but fallsshort of the best performance since it does not take into account the dependency between measure-ments. GNMF shows unstable performance, performing very well on Digits, but falling far behindother methods on other datasets. This is due to the fact that as a single view method it cannot usemultiple representations of the data effectively. EquiNMF performs consistently better than all of itscompetitors except for GMNF on the Digits dataset according to the NMI score (it is signiﬁcantlybetter than GNMF according to accuracy).

We plotted the performance of EquiNMF as a function of a multiplicative constant of the selectedgraph-regularization parameter γ v . Figure 1 shows that EquiNMF is robust for a range of γ v values.The resulting accuracy depends on the contribution of the objective and the regularizer, the graphlaplacian in our case. As such, it is very important to set the contribution of the regularization tothe right scale. Here, we propose to have comparable contributions of the objective and regularizer,unless prior information is available. Figure 1 shows that while no graph regularization results insigniﬁcantly worse performance, the equal contribution (multiple of the graph parameter is 1)orhalf of the objective contribution (mulitple of the graph parameter is 1) perform as well as the bestperforming parameter setting. We have also observed that the performance deteriorates once graphregularization is given too much weight (Butterﬂies, multiplier is equal to 2). We thus recommendto use our automatic setting of equal contribution (multiplier equals 1), resulting in a completelyautomatically set parameters for EquiNMF in a fully unsupervised though data-speciﬁc fashion. In this paper we propose a graph-regularized multi-view NMF with equal contribution from theviews. We have initially extended MultiNMF to use graph regularization. This approach raised alot of questions, such as should we regularize each view or the consensus matrix or both? Doesit matter whether we converge for each U and V before we update the consensus matrix V ∗ ? (Itturned out that the answer to this question, was yes). Importantly, there was a lot of ambiguity http://archive.ics.uci.edu/ml/datasets/Multiple+Features V about how to weigh the contributions of each of the views, consensus and each of the potentialgraph regularizers. We have extensively studied this idea ﬁrst and found that some of the solutionshad substantially increased the performance of MultiNMF, but made search for the best parametersetting very difﬁcult and often impossible without known labels. We have not pursued this approach,since it is not useful in the real world applications where we would ultimately want our method tobe used.Our EquiNMF has many advantages over the graph-regularized MultiNMF approach. For example,automatically setting the parameters of the graph-regularized MultiNMF by using our assumption ofequal view contribution is not fully transferable to MultiNMF because there is no way to determinethe appropriate proportion of inﬂuence that V ∗ should have on each V . Additional advantage ofusing EquiNMF is that without the consensus, there is no longer a need to determine the order ofupdates. In MultiNMF, each U , V pair are updated till convergence before V ∗ is updated. Regular-izing V towards a consensus or average is bad. In theory, as the regularization parameter increases,the method is equivalent to concatenation. This is bad because concatenation does not allow of theequal contribution of views to the determination of V . In practice, as the regularization parameterincreases, the V ’s are similar, but are not necessarily a good approximation of the data. Due to theconstraint, it is more difﬁcult to move them from their initialization.Some other interesting observations about EquiNMF that we found from our extensive experimentsare for example, that constraining (normalizing) the length of rows and columns. Under the con-straints on X and U which we imposed above, || V j,. || ≈ . In this case, we may wish to imposethe row constraint || V j,. || = l in a similar manner to the column constraints imposed on U . Un-fortunately, this causes a deterioration in performance, as the model becomes over constrained andloses its expressiveness.Initialization also plays an important role. We found that initializing the matrices with (s)kmeans +noise does not allow the method to improve on the initialization. We have observed that our methodperforms well even with random initialization but has high variance in performance and thus werecommend to use our proposed initialization as it does not add a heavy computational load to themethod. 7inally, in an unsupervised multiview setting the α parameters cannot be determined by cross vali-dation, as each view’s error would decrease as their parameter, and inﬂuence on V , increased. Thegraph parameter γ may be determined by cross validation, but this is not necessary because of ourheuristic. If the graph parameter is determined by cross validation, our heuristic gives a reasonablescale to select parameters from. Many application areas of machine learning are now looking for multiview methods that will helpdomain experts to gain deeper understanding of their data. Being a powerful paradigm, NMF hasreceived a wide acclaim in many application areas and thus it is of practical importance to developnovel multiview NMF methods. Existing multiview NMF methods have all relied on supervisedparameter detection, either through simulations or through real-world datasets where labels areavailable. Here we are making two major contributions to the ﬁeld: 1) a novel graph-regularizedmulti-view method that outperforms its state-of-the-art competitors; 2) an automatic way to set allthe parameters for our model in unsupervised data-speciﬁc fashion. We hope that our approachwill be of wide applicability in multiview settings. We will provide both R and matlab code uponacceptance.

References [1] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Image classiﬁcation using random forestsand ferns.

ICCV , pages 1–8, 2007.[2] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S Huang. Graph regularized nonnegativematrix factorization for data representation.

Pattern Analysis and Machine Intelligence, IEEETransactions on , 33(8):1548–1560, 2011.[3] Ning Chen, Jun Zhu, and Eric P Xing. Predictive subspace learning for multi-view data: a largemargin approach. In

Advances in neural information processing systems , pages 361–369, 2010.[4] Andrzej Cichocki, Anh Huy Phan, and Rafal Zdunek.

Nonnegative Matrix and Tensor Factor-izations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation .Wiley, Chichester, 2009.[5] David Guillamet, Bernt Schiele, and Jordi Vitria. Analyzing non-negative matrix factoriza-tion for image classiﬁcation. In

Pattern Recognition, 2002. Proceedings. 16th InternationalConference on , volume 2, pages 116–119. IEEE, 2002.[6] Patrik O Hoyer. Non-negative matrix factorization with sparseness constraints.

The Journal ofMachine Learning Research , 5:1457–1469, 2004.[7] Herv´e J´egou, Matthijs Douze, Cordelia Schmid, and Patrick P´erez. Aggregating local de-scriptors into a compact image representation. In

Computer Vision and Pattern Recognition(CVPR), 2010 IEEE Conference on , pages 3304–3311. IEEE, 2010.[8] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrixfactorization.

Nature , 401(6755):788–791, 1999.[9] Jialu Liu, Chi Wang, Jing Gao, and Jiawei Han. Multi-view clustering via joint nonnegativematrix factorization. 2013.[10] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representationof the spatial envelope.

International journal of computer vision , 42(3):145–175, 2001.[11] Florent Perronnin, Jorge S´anchez, and Thomas Mensink. Improving the ﬁsher kernel for large-scale image classiﬁcation. In

Computer Vision–ECCV 2010 , pages 143–156. Springer, 2010.[12] D Seung and L Lee. Algorithms for non-negative matrix factorization.

Advances in neuralinformation processing systems , 13:556–562, 2001.[13] Tuomas Virtanen. Monaural sound source separation by nonnegative matrix factorization withtemporal continuity and sparseness criteria.

Audio, Speech, and Language Processing, IEEETransactions on , 15(3):1066–1074, 2007. 814] Bo Wang, Aziz M Mezlini, Feyyaz Demir, Marc Fiume, Zhuowen Tu, Michael Brudno, Ben-jamin Haibe-Kains, and Anna Goldenberg. Similarity network fusion for aggregating datatypes on a genomic scale.

Nat Methods , 11(3):333–7, Mar 2014.[15] Jim Jing-Yan Wang, Xiaolei Wang, and Xin Gao. Non-negative matrix factorization by maxi-mizing correntropy for cancer clustering.

BMC bioinformatics , 14(1):107, 2013.[16] Josiah Wang, Katja Markert, and Mark Everingham. Learning models for object recognitionfrom natural language descriptions.