Low-Rank Matrices on Graphs: Generalized Recovery & Applications
LLow-Rank Matrices on Graphs: Generalized Recovery & Applications
Nauman Shahid*, Nathanael Perraudin, Pierre VandergheynstLTS2, EPFL, Switzerland*nauman.shahid@epfl.chOctober 17, 2018
Abstract
Many real world datasets subsume a linear or non-linear low-rank structure in a very low-dimensional space. Unfortunately,one often has very little or no information about the geometry of the space, resulting in a highly under-determined recoveryproblem. Under certain circumstances, state-of-the-art algorithms provide an exact recovery for linear low-rank structures butat the expense of highly inscalable algorithms which use nuclear norm. However, the case of non-linear structures remainsunresolved. We revisit the problem of low-rank recovery from a totally different perspective, involving graphs which encodepairwise similarity between the data samples and features. Surprisingly, our analysis confirms that it is possible to recovermany approximate linear and non-linear low-rank structures with recovery guarantees with a set of highly scalable and efficientalgorithms. We call such data matrices as
Low-Rank matrices on graphs and show that many real world datasets satisfy thisassumption approximately due to underlying stationarity. Our detailed theoretical and experimental analysis unveils the power ofthe simple, yet very novel recovery framework
Fast Robust PCA on Graphs . “This techncial report is a detailed explanation of the novel low-rank recovery concepts introduced in theprevious work, Fast Robust PCA on Graphs [1]. The readers can refer to [1] for experiments which are notrepeated here for brevity.” .Consider some data sampled from a manifold (a low dimensional structure embedded in high dimensional space, like a hyper-surface.) In the first row of Fig. 1, we present some examples of or dimensional manifolds embedded in a or dimensionalspace. Given a noisy version of the samples in the space, we want to recover the underlying clean low-dimensional manifold.There exist two well-known approaches to solve this problem.1. One common idea is to assume that the manifold is linear and the data is low rank. This approach is used in all thePrincipal Component Analysis (PCA) techniques. Nevertheless, for many datasets, this assumption is only approximatelyor locally true, see Fig. 1 for example. PCA has a nice interpretation in the original data space (normally termed as low-rank representation) as well as in the low-dimensional space (embedding or the projected space). Throughout, we refer tothe former as the ‘data space’ and latter as ‘projected space’.2. Another idea is to use the manifold learning techniques such as LLE, laplacian eigenmaps or MDS. However, these tech-niques project the points into a new low dimensional space (projected space), which is not the task we would like toperform. We want the resulting points to remain in the data space. N.Shahid and N.Perraudin are supported by the SNF grant no. 200021 154350/1 for the project “Towards signal processing on graphs”. a r X i v : . [ c s . C V ] M a y hus, we need a framework to recover low dimensional manifold structure in the original space of the data points. It shouldbe highly scalable to deal with the datasets consisting of millions of samples that are corrupted with noise and errors. This paperbroadly deals with a class of algorithms which solve such problems, all under one framework, using simple, yet very powerfultools, i.e, graphs. We call our framework Generalized Recovery of Low-rank Matrices on Graphs or Generalized PCA on Graphs(GPCAG) .A first reaction is that the goals of this paper are trivial. The problem of low-rank recovery has been very well studied in thelast decade. So, why would one still be interested in solving this problem. Moreover, what is special about specifically targetingthis problem via graphs? In the following discussion we motivate the utility of this work with several real world examples andgradually describe the tools which are useful for our work.
HIGHLIGHT: “The choice of a suitable low-rank recovery method depends on the domain from which the datais acquired. Thus, there is a need to generalize linear subspace recovery to manifolds!” .The answer to such a question does not exist. In fact the answer solely depends on the type of data under consideration. Realworld datasets are complex, big and corrupted with various types of noise. What makes the low-rank extraction a hard problemis the domain from which the dataset is acquired. Here we present a few examples.
Consider a real world surveillance application where a camera is installed in an airport lobby and it collects frames with a certaintime interval. As the structure of this lobby does not change over time, one would like to extract the moving people or objectsfrom the static lobby. Such an application clearly falls under the low-rank extraction framework. A good enough assumptionabout the slowly varying images is that they can be very well defined by a linear subspace in a low-dimensional space. Thus, alinear subspace model, like PCA might work very well for such a dataset.
Now consider the same camera installed outside in the open air where the lightning is changing over time. Assume there are shortperiods of sunlight interleaved by cloudy periods and even rain. As compared to the lobby example where all these conditionswere constant, this is a more challenging environment. It might not be possible to extract the low-rank component of this setof frames via a single linear subspace model. One would need something slightly more complex than this simple assumption.Another example where the linear subspace assumption might fail is the case of a movie where the scenes change over time. Amore reasonable assumption would be to target a dynamically changing low-rank representation over short bursts of frames. Thiswould require some sort of grouping between the frames in time.
As another example, consider a huge database of faces of several people collected from a social network platform. Assumefurther that the faces are corrupted by different types of noise and errors. Furthermore, the faces undergo a change in the facialexpressions. A linear subspace model would simply fail to extract a low-rank representation from such a database of facesbecause it would not take into account the subgroups of the samples in the data. A more reasonable assumption for such a datasetwould be to group the images based on the people using some notion of pairwise similarity. This would introduce non-linearityin the low-rank extraction method. 2 .1.4 Topic extraction
Now consider a set of text messages collected from a social network about different topics. One would like to analyze the mostcommon topics in the whole set of messages. A more interesting analysis would be to identify the communities of people in thesocial network among which a particular topic is more popular. Again, the messages are highly corrupted by noise due to theuse of slang and unstructured sentences. In such a case one definitely needs to use the notion of pairwise similarity between thepeople in the network to identify the low-rank component, i.e, most commonly used words and topics and as well as analyze thecommunity structure.
Finally consider a set of MRI scans which vary slowly over time and the regions of brain. One would like to separate theclean low-rank components from sparse outliers which might correspond to some kind of abnormal condition. Furthermore,the acquired scans might be corrupted with noise due to limitations of device or the movement of patient, etc. This is a typicalexample of a space-time real world dataset which has gained significant attention of the research community. For such a dataset,the notion of functional connectivity between different regions of the brain plays an important role in addition to the structuralconnectivity. Moreover, the acquired images can be related in time. Thus, it is important to consider the space-time relationshipwhile extracting the low-rank representation. A single linear subspace might not do well in such a scenario.
In the light of above examples, we can conclude that a linear subspace model is not always good enough to extract low-rankstructures from the data. In the sequel we use the word ‘manifold’ as a very general term to describe non-linear subspacestructures. What type of a manifold assumption is good enough for each of the above applications? As we do not know theanswer to this question, we can instead characterize the manifold directly using the data itself.Furthermore, sometimes the variation in the data is huge and we do not have enough samples to extract the low-rank structure.This is a completely reasonable scenario as many real world datasets are highly corrupted by different types of noise and outliers.A general method should be naive enough to simply de-noise the data in such a scenario. Below we provide a list of the types ofnoise that should be tolerable by any low-rank recovery method: • Gaussian noise • Sparse and gross errors in a few features of the data. • Sample specific errors, where only a few samples of the data are corrupted by errors.
MOTIVATION: “Standard linear low-rank recovery methods, including standard PCA and Robust PCA fail torecover a non-linear manifold. We need new methods to recover low-rank manifolds” .We argue with a few simple examples as shown in Fig. 1 to demonstrate the motivation of our work. The Fig. shows variousdatasets which are corrupted by noise. The goal is to:1. de-noise and recover the underlying low-rank manifold if the data is low-rank2. simply de-noise the dataset and recover the clean manifold if it is not low-rankWe try to perform this task using the very famous state-of-the-art algorithm Robust PCA (with an (cid:96) data fidelity term) [2] fromthe regime of linear dimensionality reduction. This model has the tendency to exactly recover the linear low-rank structure of thedata. In very simple words the algorithm decomposes a dataset into a low-rank and a noisy matrix.3igure 1: This figure shows the recovery of underlying clean low-rank structure of the noisy 2D or 3D manifolds using the RobustPCA framework [2]. RPCA, which is a linear subspace recovery algorithm totally fails in all the scenario as it tries to reducethe whole data to a point at the center of the manifold in an attempt to exactly recover the manifold. The last column shows theapproximate desired behavior for each of the cases. 4he first two rows of Fig. 1 show two different manifolds lying in a 2D-space. We add noise in the third dimension and createan artificial 3D data from these manifolds. The goal here is to de-noise and recover the low-rank manifold in 2D. The RobustPCA (RPCA) framework, which is a linear subspace recovery algorithm totally fails in this scenario. It can be seen that it triesto reduce the whole data to a point in the center of the manifold.Next, consider the datasets shown in the third, fourth and fifth rows of Fig. 1. Here, the manifolds are not low-rank, i.e, theyonly lie in the 2D space and the noise is introduced in the same dimensions. Apparently, this task is easier than the first onebecause it only involves de-noising. Again we use RPCA to recover the manifold. We observe it fails in this task as well.The last column of Fig. 1 shows what we would approximately like to achieve for each of these datasets. Clearly RPCA isfar from achieving the goal. One can simply argue that RPCA can only extract linear structure from the data so it is supposed tofail in such a scenario. The argument is entirely true. It is possible to use some non-linear dimensionality reduction algorithm,like the LE framework but it would recover the low rank structure of the manifold in a projected space . Currently we do not havealgorithms that would simultaneously de-noise the data, recover the manifold and as well as extract the low-rank structures in thedata space itself. Low-rank noisy dataset on a linear / non-linear manifold De-noising Approximate Low-rank manifold recovery Clustering Via scalable algorithm
GOALS EVALUATION METRICS
Qualitative and quantitative comparison with ground truth K-means on low-rank or embedding in low-dimensional space
DATA
Figure 2: A summary of the main goals and the evaluation schemes used in this work.More specifically, the goal of this work is to solve the following problem:
OUR GOALS: “Given a high dimensional big dataset that follows some unknown linear or non-linear structurein the low-dimensional space, recover the underlying low-rank structure in the data space via highly scalable andefficient algorithms” .In the context of the above mentioned problem statement, we can simply define some basic goals of this work. Assume wehave a high dimensional dataset, corrupted by noise and errors and consisting of millions of samples. Further assume that a cleanrepresentation of this data lies on a linear or non-linear low-dimensional subspace, then our goal is to perform the followingwhile remaining in the original data space :1. De-noise the data.2. Extract the low-dimensional manifold.3. Be able to exploit the low-dimensional structure of the data5. Handle corruptions and outliers5. Do all above with highly scalable algorithms.It is slightly overwhelming that we want to target the above goals altogether. We argue that the main aim of this work is notto provide exact solutions for the above tasks. In fact, our main motivation is just to provide a general scalable framework whichprovides approximate solutions for each of the above and works reasonably well in practice. We also provide theoretical boundson the approximation errors for our framework.
Fig. 3 shows the general framework that we propose for GPCAG. GPCAG stands on three important pillars to extract low-rankmanifold structures:1. De-noising2. Graph filtering3. Data and graph compressionIt is quite obvious that we need a de-noising setup for real world data. In the following therefore we motivate and discuss thesecond tool in depth which was introduced in [1]. The compression framework is presented in [3] and will not be repeated herefor brevity.
Data & Graph compression Graph filtering De-noising
Generalized PCA on graphs
DATA
Approximate manifold recovery Approximate low-rank recovery
GPCAG Framework
Clustering (if required)
Figure 3: The general framework that we propose for GPCAG: 1) De-noising, 2) Data and graph compression, 3) Graph filtering “We propose GRAPH as a building block for the low-rank extraction and recover low-rank ON a graph. This isdifferent from Graph Regularized PCA. The graph is given, i.e, it can come from external source or can beconstructed by using fast state-of-the-art methods. However, graph construction is not the focus of our work.” .6 .1 Low-rank Representation on Graphs: The key to our solution It is clear that the solution to our problem lies in some sort of dimensionality reduction framework which has a nice interpretationin the data space. As we target a generalized low-rank recovery for both linear and non-linear manifolds, we need tools fromboth linear and non-linear dimensionality reduction.
It is not surprising that principal component analysis (PCA) [4] is the most widely used tool for linear dimensionality reduction orlow-rank recovery. For a dataset Y ∈ R p × n with n p -dimensional data vectors, standard PCA learns the projections or principalcomponents V ∈ R n × k of Y on a k -dimensional orthonormal basis U ∈ R p × k , where k < p (model 1 in Table 2). Thoughnon-convex, this problem has a global minimum that can be computed using Singular Value Decomposition (SVD), giving aunique low-rank representation X = U V (cid:62) . In the following, without the loss of generality we also refer to X = U Σ V (cid:62) as thefactorized form (SVD) of X , where Σ is a k × k diagonal matrix which consists of the singular values of X and U (cid:62) U = I k and V (cid:62) V = I k . Note that V (cid:62) = Σ V (cid:62) .A main drawback of PCA is its sensitivity to heavy-tailed noise, i.e, a few strong outliers can result in erratic principalcomponents. However, we want our low-rank extraction method to be robust to outliers as well. RPCA proposed by Candes et al.[2] overcomes this problem by recovering the clean low-rank representation X from grossly corrupted Y by solving model 4 inTable 2. Here S represents the sparse matrix containing the errors and (cid:107) X (cid:107) ∗ denotes the nuclear norm of X , the tightest convexrelaxation of rank ( X ) [5]. Thus, RPCA is the tool that we will use from the class of linear dimensionality reduction algorithms. Graphs have the ability to model a wide variety of pairwise relationship between the data samples. Consider the real worldexamples of low-rank extraction which we described earlier. One can safely use graphs for each of those applications. Forthe surveillance application, the graph can encode the pairwise relationship between the different frames. One can either use a K -nearest neighbors approach or just connect each frame to its two neighbors in time (a line graph). For the case of messagescollected from a social network there can be several possibilities of graph construction. A straight forward way would be toconstruct a graph by using some additional information about the physical connectivity of the network, such as a graph of friendson the particular social platform. Another way would be to extract features from the messages and construct a graph using thosefeatures. For the case of a database of faces of several people the most meaningful graph would be between the pairwise faces.A good graph in such a case would be well connected between the faces of the same person. This will automatically give rise toa community structure in the dataset. For the case of time series data collected for a brain application one can construct a graphbased on the anatomical connectivity of the brain regions or functional connectivity calculated directly from the brain signals.Thus, the notion of the non-linearity in our framework follows directly from the use of graphs due to the above mentionedapplications. In fact many non-linear dimensionality reduction techniques like Laplacian Eigenmaps (LE) [6] use graphs to findthe embedding of the data in a low-dimensional space. “Low-rank matrices on graphs (constructed between their rows and columns) can be recovered approximatelyfrom the uncompressed or compressed measurements via a dual graph regularization scheme.” .1. Two graphs:
For a data matrix Y ∈ (cid:60) p × n , we propose to construct two K -nearest neighbor graphs: 1) between the rows G r and 2) between the columns G c of Y . The goal is to recover a low-rank representation X of Y on these graphs.2. Low-rank matrices on graphs (LRMG):
In Section 5 we present a first glimpse of our proposed framework and definea few basic concepts which provide the foundation for this work. More specifically, we introduce the novel concept of7Low-rank matrices on graphs” (LRMG). We define a LRMG as a matrix whose rows and columns belong to the span ofthe first few eigenvectors of the graphs constructed between its rows ( G r ) and columns ( G c ). This is a direct generalizationof the concept of band-limited signals that is introduced in [7]. Many real world datasets can benefit from this notion forlow-rank feature extraction.3. Generalized Fast Robust PCA on Graphs (GFRPCAG):
Sections 6 and 9 provide a step-by-step sound theoretical andalgorithmic construction of a general method to recover LRMG. We call this method as “Fast Robust PCA on Graphs” asit resembles Robust PCA in its functionality but is actually a much faster method. We show that the recovery of LRMGdepends on the spectral gaps of the Laplacians L r and L c of G r and G c . More specifically we solve the following dualgraph filtering problem: min X φ ( Y − X ) + γ c tr( X L c X (cid:62) ) + γ r tr( X (cid:62) L r X ) , (1)where φ is a convex loss function and γ r , γ c are the model parameters.Motivated from the deeper theoretical and experimental understanding of the approximation error of FRPCAG, we gener-alize the recovery to general graph filters. We design a family of deterministic filters which is characterized by a simple andeasy to tune parameter and prove experimentally and theoretically that it improves upon the error incurred by FRPCAG.We call this framework as “Generalized FRPCAG (GFRPCAG)”.Our proposed methods are highly scalable and memory efficient. For a dataset Y ∈ (cid:60) p × n , the complete generalized frame-work, involving graph construction, compression and graph filtering and decoding poses a complexity that is O ( n log n ) in thenumber of data samples n . In simple words, our framework can cluster 70,000 digits of MNIST in less than a minute. Theexperimental results for the complete framework are presented in [1] and [3]. A summary of all the main notations used in this work is presented in Table 1. The rest of this work is organized as following:In Section 3 we introduce graphs as the basic building blocks of our framework and present graph nomenclature. In Section 4we discuss the state-of-the-art and connections and differences of our framework with the prior work. In Section 5 we presentthe concept of Low-rank matrices on graphs (LRMG) which is key to our proposed framework. In Section 6 we introduce FastRobust PCA on Graphs (FRPCAG) as a recovery method for LRMG and present the optimization solution. In Section 7 wepresent a sound theoretical analysis of FRPCAG and analyze the approximation error. In Section 8 we present a few working ofFRPCAG on the artificial and real world datasets to justify our theoretical findings. In Section 9 we re-visit the apporximationerror of FRPCAG and present a generalized version GFRPCAG to recover LRMG with lower approximation error. Section 10concludes the paper by discussing a few cases for which it makes sense to perform dual graph filtering. Please refer to [1] for theexperimental results.
A graph is a tupple G = {V , E , W} where V is a set of vertices, E a set of edges, and W : V × V → R + a weight function.The vertices are indexed from , . . . , |V| and each entry of the weight matrix W ∈ R |V|×|V| + contains the weight of the edgeconnecting the corresponding vertices: W i,j = W ( v i , v j ) . If there is no edge between two vertices, the weight is set to . Weassume W is symmetric, non-negative and with zero diagonal. We denote by i ↔ j the connection of node v i to node v j .8able 1: A summary of notations used in this work Notation Terminology (cid:107) · (cid:107) F matrix frobenius norm (cid:107) · (cid:107) matrix (cid:96) norm (cid:107) · (cid:107) , matrix (cid:96) , norm for column group sparsity n number of data samples p number of features / pixels k dimension of the subspace / rank of the data matrix K number of classes in the data set Y ∈ R p × n = U Σ V (cid:62) clean data matrix & its SVD Y ∈ R p × n = U y Ω y V (cid:62) y noisy data matrix & its SVD X ∈ R p × n = U Σ V (cid:62) low-rank noiseless approximation of Y & its SVD V ( V (cid:62) = Σ V (cid:62) ) ∈ R p × k principal components of XW ∈ R n × n or R p × p adjacency matrix between samples / features of YD = diag ( (cid:80) j W ij ) ∀ i diagonal degree matrix σ smoothing parameter / window width of the Gaussian kernel G c graph between the samples / columns of YG r graph between the features / rows of Y ( V , E ) set of vertices, edges for graph γ r penalty for G r Tikhonov regularization term γ c penalty for G c Tikhonov regularization term K number of nearest neighbors for the construction of graphs C c sample covariance of YC r feature covariance of Y L r ∈ R n × n = P Λ r P (cid:62) = P k r Λ k r P (cid:62) k r + ¯ P k r ¯Λ k r ¯ P (cid:62) k r Laplacian for graph G r & its EVD L c ∈ R p × p = Q Λ c Q (cid:62) = Q k c Λ k c Q (cid:62) k c + ¯ Q k c ¯Λ k c ¯ Q (cid:62) k c Laplacian for graph G c & its EVD P k r ∈ (cid:60) p × k r , Q k c ∈ (cid:60) n × k c First k r , k c eigenvectors in P, Q ¯ P k r ∈ (cid:60) p × ( p − k r ) , ¯ Q k c ∈ (cid:60) n × ( n − k c ) eigenvectors beyond k r + 1 , k c + 1 in P, Q Λ k r ∈ (cid:60) k r × k r , Λ k c ∈ (cid:60) k c × k c Λ r , Λ c with the first k r , k c eigenvalues ¯Λ k r ∈ (cid:60) ( p − k r ) × ( p − k r ) , ¯Λ k c ∈ (cid:60) ( n − k c ) × ( n − k c ) Λ r , Λ c with all the eigenvalues above k r , k c .1.1 Standard graph construction methods There can be several ways to define a graph G for a dataset. For example, for a dataset Y ∈ (cid:60) p × n , the vertices v i of the graph G can correspond to the data samples y i ∈ (cid:60) p . The most common method to construct graphs is via a standard K -nearest neighborsstrategy. The first step consists of searching the closest neighbors for all the samples using Euclidean distances. Thus, each y i isconnected to its K nearest neighbors y j , resulting in |E| ≈ Kn number of connections. The second step consists of computingthe graph weight matrix W using one of the several commonly used strategies [8]. We mention some of the most commonly usedschemes below: Gaussian kernel weighting W ij = (cid:40) exp (cid:16) − (cid:107) ( y i − y j ) (cid:107) σ (cid:17) if y j is connected to y i otherwise. Binary weighting W ij = (cid:40) if y j is connected to y i otherwise. Correlation weighting W ij = (cid:40) y (cid:62) i y j (cid:107) y i (cid:107) (cid:107) y j (cid:107) if y j is connected to y i otherwise.One can also use (cid:96) distance weighting scheme if the data is corrupted by outliers.For a vertex v i ∈ V (or a data sample y i ∈ Y ), the degree d ( i ) is defined as the sum of the weights of incident edges: d ( i ) = (cid:80) j ↔ i W i,j . Finally, the graph can be characterized using a normalized or unnormalized graph Laplacian. The normalizedgraph Laplacian L n defined as L n = D − ( D − W ) D − = I − D − W D − where D is the diagonal degree matrix with diagonal entries D ii = d ( i ) and I the identity and the un-normalized is defined as L = D − W. For big or high dimensional datasets, i.e, large n or large p or both, the exact K -nearest neighbors strategy can become computa-tionally cumbersome O ( n ) . In such a scenario the computations can be made efficient using the FLANN library (Fast Libraryfor Approximate Nearest Neighbors searches in high dimensional spaces) [9]. The quality of the graphs constructed using thisstrategy is slightly lower as compared to the above mentioned strategy due to the approximate nearest neighbor search method.However, for a graph of n nodes the computationally complexity is as low as O ( n log n ) [10]. Throughout this work we constructour graphs using FLANN irrespective of the size of the datasets. In this framework, a graph signal is defined as a function s : V → R assigning a value to each vertex. It is convenient to considera signal s as a vector of size |V| with the i th component representing the signal value at the i th vertex. For a signal s living onthe graph G , the gradient ∇ G : R |V| → R |E| is defined as ∇ G s ( i, j ) = (cid:112) W ( i, j ) (cid:32) s ( j ) (cid:112) d ( j ) − s ( i ) (cid:112) d ( i ) (cid:33) , { i, j } when i ↔ j . For a signal c living on the graph edges, the adjoint of the gradient ∇ ∗ G : R |E| → R |V| , called divergence can be written as ∇ ∗ G c ( i ) = (cid:88) i ↔ j (cid:112) W ( i, j ) (cid:32) (cid:112) d ( i ) c ( i, i ) − (cid:112) d ( j ) c ( i, j ) (cid:33) . The unnormalized Laplacian corresponds to the second order derivative and its definition arises from L s := ∇ ∗G ∇ G s . Since the Laplacian L is by construction always a symmetric positive semi-definite operator, it always possesses a complete setof orthonormal eigenvectors that we denote by { q (cid:96) } (cid:96) =0 , ,...,n − . For convenience, and without loss of generality, we order theset of eigenvalues as follows: λ < λ ≤ λ ≤ ... ≤ λ n − = λ max . When the graph is connected, there is only one zeroeigenvalue. In fact, the multiplicity of the zero eigenvalue is equal to the number of connected components. See for example [11]for more details on spectral graph theory. For simplicity, we use the following notation: L = Q Λ Q (cid:62) . For symmetric (Hermitian) matrices, we define the following operator: g ( L ) = Qg (Λ) Q (cid:62) , where g (Λ) is a diagonal matrix with g (Λ) (cid:96),(cid:96) = g ( λ (cid:96) ) .One of the key idea of this spectral decomposition is that the Laplacian eigenfunctions are used as a graph ”Fourier” basis[12]. The graph Fourier transform ˆ x of a signal x defined on a graph G is the projection onto the orthonormal set of eigenvectors Q of the graph Laplacian associated with G : ˆ x = Q (cid:62) x (2)Since Q is an orthonormal matrix, the inverse graph Fourier transform becomes: x = Q ˆ x (3)The spectrum of the Laplacian replaces the frequencies as coordinates in the Fourier domain with the following relation | f (cid:96) | = λ (cid:96) . The main motivation of this definition is the fact that the Fourier modes are intuitively the oscillating modes for the frequenciesgiven by the eigenvalues [12] [13]. A very important property of these modes is that the higher the eigenvalue, the more themode oscillates. As a consequence, high eigenvalues are associated to high frequencies and low eigenvalues correspond to lowfrequencies. In many problems, tr ( X (cid:62) L X ) is used as a smoothness regularization term for the signals contained in X . The most commoninterpretation is that it is equivalent to penalizing the gradient of the function, i.e: tr ( X (cid:62) L X ) = (cid:107)∇ G X (cid:107) F . (4)However, we can also study this term from a different angle. Using the graph Fourier transform, it is possible to rewrite it as aspectral penalization: tr ( X (cid:62) L X ) = tr ( X (cid:62) Q Λ Q (cid:62) X ) = tr ( ˆ X (cid:62) Λ ˆ X ) = (cid:107) Λ ˆ X (cid:107) F . (5)From the previous equation, we observe that the regularization term tr ( X (cid:62) L X ) penalizes the Fourier transform of the signalwith the weights √ Λ . This point of view suggests a new potential regularization term. We can use a function g : R + → R + , tochange the Frequency weighting: tr ( X (cid:62) g ( L ) X ) = tr ( ˆ X (cid:62) g (Λ) ˆ X ) = (cid:107) g (Λ) ˆ X (cid:107) F . (6)This technique will be a good asset to improve the quality of the low-rank representation, as described in the following sections.11 Prior Work: Connections & Differences with Our Framework “The goal is NOT to improve low-rank representation via graph regularization. Instead, we aim to solely recoveran approximate low-rank matrix with dual-graph regularization only. We show that one can obtain a goodenough low-rank representation without using expensive nuclear norm or non-convex matrix factorization.” .Table 2: A comparison of various PCA models and their properties. Y ∈ R p × n is the data matrix, U ∈ R p × k and V ∈ R n × k are the principal directions and principal components in a k dimensional linear space (rank = k ). X = U V (cid:62) ∈ R p × n is thelow-rank representation and S ∈ R p × n is the sparse matrix. L , L g and L gh ∈ S n × n characterize a simple graph or a hypergraph G between the samples of X . (cid:107) · (cid:107) F , (cid:107) · (cid:107) ∗ and (cid:107) · (cid:107) denote the Frobenius, nuclear and l norms respectively. Model Objective Constraints Parameters Graph? Factors? Convex? min
U,Q (cid:107) Y − UV (cid:62) (cid:107) F U T U = I k no yes no2 RPCA [2] min
X,S (cid:107) X (cid:107) ∗ + λ (cid:107) S (cid:107) Y = X + S λ no no yes3 RPCAG [14] min
X,S (cid:107) X (cid:107) ∗ + λ (cid:107) S (cid:107) + γ tr( X L X T ) Y = X + S λ, γ yes no yes4 GLPCA [15] min
U,V (cid:107) X − UV (cid:62) (cid:107) F + γ tr( V (cid:62) L V ) V V (cid:62) = I min U,V (cid:107) X − UV (cid:62) (cid:107) , + γ tr( V (cid:62) L V ) k, γ min U,V (cid:107) X − UV (cid:62) (cid:107) F + γ tr( V (cid:62) L V ) U T U = I yes yes no7 MMMF [17] min U,V , α (cid:107) X − UV (cid:62) (cid:107) F + γ tr( V (cid:62) ( (cid:80) g α g Φ g ) V ) + β (cid:107) α (cid:107) U T U = I k, γ, β min
U,V , α (cid:107) X − UV (cid:62) (cid:107) F + γ tr( V (cid:62) ( (cid:80) g α g Φ gh ) V ) + β (cid:107) α (cid:107) T α = min X (cid:107) X (cid:107) ∗ + λ (cid:107) S (cid:107) Y = Y X + S λ no no yes10 GLRR [19] min X (cid:107) X (cid:107) ∗ + λ (cid:107) S (cid:107) + γ tr( X L X (cid:62) ) The idea of extracting enhanced low-rank and sparse representations using graphs has been around for a while now. Many recentworks propose to incorporate the data manifold information in the form of a discrete graph into the dimensionality reductionframework [15, 16, 20, 21, 17, 22, 8, 23, 24]. In fact, for PCA, this can be considered as a method of exploiting the localsmoothness information in order to improve low-rank and clustering quality.Depending on how the graph information is used in the context of PCA we divide the works into two types: • The graph encodes the smoothness of principal components V . Such methods explicitly learn the basis U and components V and we refer to such methods as factorized models . • The graph encodes the smoothness of the low-rank matrix X as a whole and we refer to such methods as non-factorizedmodels The graph smoothness of the principal components V using the graph Laplacian L has been exploited in various works thatexplicitly learn V and the basis U . We refer to such models as the ones which use principal components graph . In this contextGraph Laplacian PCA (GLPCA) was proposed in [15] (model 4 in Table 2). This model explicitly learns the basis U and principalcomponents V by assuming that V is smooth on the graph of data samples. Note that as compared to the standard PCA, the modelrequires V to be orthonormal, instead of U . The closed form solution to this problem is given by the eigenvalue decompositionoperation. The model is non-convex, however a globally unique solution can be obtained for small datasets by the closed formsolution.Manifold Regularized Matrix Factorization (MMF) [16] (model 6 in Table 2) is another such factorized model which explic-itly learns the basis U and principal components V . Note that like standard PCA, the orthonormality constraint in this model is12n U , instead of the principal components V . This model implicitly encodes the smoothness of the low-rank matrix X on thegraph as tr( X L X (cid:62) ) = tr( V (cid:62) L V ) , using X = U V (cid:62) .More recently, the idea of using multiple graphs has been seen in the PCA community. In this context the smoothness of theprincipal components V has been proposed on the linear combination of the graphs. Such models are particularly useful whenthe data is noisy and one cannot rely on a single graph construction strategy. A linear combination of graphs constructed usingvarious strategies might provide robustness to noise. Tao et al. [17] propose an ensemble of graphs in this context, whereas Jinet al. [8] propose to use an ensemble of hypergraphs respectively ( th and th models in Table 2). All of the above mentionedmodels are non-convex and require a rank k to be specified as a model parameter.Another important line of works related to the factorized models computes a non-negative low-rank representation (NMF)for data by decomposing a data matrix as a product of two thin non-negative matrices [25]. Several models which incorporategraph regularization to NMF have been proposed in the literature, like [21]. In contrast to the factorized models, the non-factorized models directly learn a low-rank matrix X by decomposing the datamatrix Y into sum of a low-rank X and sparse matrix S . Such models are convex. The authors of [14] have generalized robustPCA [2] by incorporating the graph smoothness (model 2 in Table 2). In their model, they propose a simple graph smoothnessregularization term directly on the low-rank matrix instead of principal components and improve the clustering and low-rankrecovery properties. They call it Robust PCA on Graphs (RPCAG). We refer to such models as the ones which use low-rankgraph .Another line of works, known as Low-rank representations (LRR) [18] (model 9 in Table 2) falls under the category of convexnon-factorized models. More specifically, LRR corresponds to a generalization of RPCA to multiple subspaces. The frameworkis a dictionary learning problem, where one would like to encode each data sample as a low-rank combination of the atoms of thedictionary. In its most basic form, LRR uses the data matrix itself as a dictionary. Later on the authors of [19],[26],[27] proposedto incorporate the smoothness of the low-rank codes on a graph into the LRR framework (model 10 in Table 2). This enhancesthe subspace recovery capability of the simple LRR. An extension of the graph regularized LRR with local constraints for graphconstruction has also been proposed in [28]. Both the factorized and the non-factorized PCA based models suffer from a few problems. The factorized models require therank k to be specified as a model parameter. One never knows the exact rank of the real world data, therefore several parametersneed to be tried to get an optimal solution. The specification of the rank makes these algorithms scalable to some extent becausethey do not need a full SVD. However, the non-convexity requires the algorithms to be run several times for each parametermodel unless we know a good initialization scheme. Furthermore, GLPCA and MMF are not robust to gross errors and outliersin the dataset. In fact making them robust to outliers requires a modification of objective function, thus adding another modelparameter.The non-factorized models on the other hand are convex but they are not scalable. RPCA and RPCAG require the full SVDof the estimated low-rank matrix in every iteration of the algorithm. For a data matrix Y ∈ (cid:60) p × n the complexity of SVD is O ( np ) where p < n . LRR and GLRR are even more computationally complex because they require the SVD of n × n matrixwhich costs O ( n ) . As already demonstrated with simple examples of Fig. 1, RPCA recovers only linear structures in the data.Although RPCAG, incorporates graph structure in the standard RPCA framework, the issue of scalability remains there. Therandomized algorithms reduce the computational complexity of these algorithms but they still require SVD on the compresseddata which is still an issue for very big datasets. The idea of using two graph regularization terms has previously appeared in the work of matrix completion [29], co-clustering[30], NMF [31], [32] and more recently in the context of low-rank representation [33]. However, to the best of our knowledge13ll these models aim to improve the clustering quality of the data in the low-dimensional space. The co-clustering & NMFbased models which use such a scheme [30], [31] suffer from non-convexity and the works of [29] and [33] use a nuclear-normformulation which is computationally expensive and not scalable for big datasets. Our proposed method is different from thesemodels in the following sense: • We do not target an improvement in the low-rank representation via graphs. Our method aims to solely recover an ap-proximate low-rank matrix with dual-graph regularization only. The underlying motivation is that one can obtain a goodenough low-rank representation without using expensive nuclear norm or non-convex matrix factorization. Note that theNMF-based method [31] targets the smoothness of factors of the low-rank while the co-clustering [30] focuses on thesmoothness of the labels.
Our method, on the other hand, targets directly the recovery of the low-rank matrix, and not theone of the factors or labels . • We introduce the concept of low-rank matrices on graphs and provide a theoretical justification for the success of ourmodel. The use of PCA as a scalable and efficient clustering method using dual graph regularization has surfaced for thevery first time in this paper.
A NOVEL CONCEPT: “A matrix is said to be low-rank on graphs (LRMG) constructed between its rows G r andcolumns G c if its rows belong to the span of the eigenvectors of G c and columns to the span of the eigenvectors of G r .” .In this section we define the novel concept of low-rank matrices on graphs which motivates a fast recovery method toapproximate Robust PCA. Consider a simple example of the digit from the traditional USPS dataset (example taken from [34]). We vectorize all theimages and form a data matrix Y , whose columns consist of different samples of digit from the USPS dataset. We build the nearest neighbors graph of features G r , i.e, a graph between the rows of Y and G c , i.e, a graph between the columns of Y using the FLANN strategy of Section 3. Let L r ∈ R p × p = D r − W r = P Λ r P (cid:62) and L c ∈ R n × n = D c − W c = Q Λ c Q (cid:62) as thenormalized graph Laplacians for the graphs G r , G c between the rows and columns of the data Y . Fig. 4 shows the eigenvectors of the normalized Laplacian L r denoted by P . We observe that they have a -like shape. In Fig. 4,we also plot the eigenvectors associated to the experimental covariance matrix C r . We observe that both sets of eigenvectorsare similar. This is confirmed by computing the following matrix: Γ r = P (cid:62) C r P (7)In order to measure the level of alignment between the orthogonal basis P r and the basis of C , we use the following ratio, whichwe call as the alignment order : s r (Γ r ) = (cid:32) (cid:80) (cid:96) Γ r(cid:96),(cid:96) (cid:80) (cid:96) (cid:80) (cid:96) Γ r(cid:96) ,(cid:96) (cid:33) = (cid:107) diag(Γ r ) (cid:107) (cid:107) Γ r (cid:107) F . (8) The experimental covariance matrix is computed as C r = ˜ Y ˜ Y (cid:62) n , where n is the number of samples and ˜ Y = Y − µ Y for µ Y = n (cid:80) nj =1 Y ij . Thisdefinition is motivated in [34]. C r and the graph Laplacian L r are simultaneously diagonalizable, givinga ratio equal to . On the contrary, when the bases are not aligned, the ratio is close to p , where p is the dimension of the dataset.Note that an alternative would be to compute directly the inner product between P and the eigenvectors of C r . However, using Γ r we implicitly weight the eigenvectors of C r according to their importance given by their corresponding eigenvalues.In the special case of the digit , we obtain a ratio s r (Γ r ) = 0 . , meaning that the main covariance eigenvectors are wellaligned to the graph eigenvectors. Fig. 4 shows a few eigenvectors of both sets and the matrix Γ r . This effect has been studied inFigure 4: Studying the number of USPS. Left: Covariance eigenvectors associated with the highest eigenvalues. Right:Laplacian eigenvectors associated to the smallest non-zero eigenvalues. Because of stationarity, Laplacian eigenvectors aresimilar to the covariance eigenvectors. Right: Γ r = P (cid:62) C r P in dB. Note the diagonal shape of the matrix implying that P isaligned with the eigenvectors of C r . (Figure taken from [34] with permission).[34] where the definition of stationary signals on graphs is proposed. A similar idea is also the motivation of the Laplacianfacesalgorithm [35].A closer look at the rightmost figure of Fig. 4 shows that most of the energy in the diagonal is concentrated in the first fewentries. To study how well is the alignment of the first few eigenvectors in P and C r we compute another ratio, which we call asthe rank-k alignment order : ˆ s r (Γ r ) = (cid:80) k(cid:96) =1 Γ r(cid:96),(cid:96) (cid:80) (cid:96) Γ r(cid:96),(cid:96) (9)This ratio denotes the fraction of energy which is concentrated in the first k entries of the diagonal. For the example of theUSPS dataset above, ˆ s r (Γ r ) = 0 . for k = 10 . This shows that of the diagonal energy is concentrated in the first tenentries. Thus first 10 eigenvectors of the Laplacian are very well aligned with the first 10 eigenvectors of the covariance matrix.This phenomena implies that the digit of the USPS dataset is low-rank, i.e, only the first few eigenvectors (corresponding tothe low eigenvalues) are enough to serve as the features for this dataset.Of course, this phenomena also holds for the full dataset. Let us analyze how the graph eigenvectors evolve when all digitsare taken into account. Fig. 5 shows the Laplacian and covariance eigenvectors for the full USPS dataset. Again we observesome alignment: s r (Γ r ) = 0 . .From this example, we can conclude that every column of a low-rank matrix X lies approximately in the span of the eigen-vectors P k r of the features graph G r , where k r denotes the eigenvectors corresponding to the smallest k r eigenvalues. This issimilar to PCA, where a low-rank matrix is represented in the span of the first few principal directions or atoms of the basis.Alternately, the Laplacian eigenvectors are meaningful features for the USPS dataset. Let the eigenvectors P of L r be dividedinto two sets ( P k r ∈ (cid:60) p × k r , ¯ P k r ∈ (cid:60) p × ( p − k r ) ) . Then, we can write L r = P Λ r P (cid:62) = P k r Λ k r P (cid:62) k r + ¯ P k r ¯Λ k r ¯ P (cid:62) k r . (10)Note that the columns of P k r contain the eigenvectors corresponding to the low graph frequencies and ¯ P k r contains thosecorresponding to higher graph frequencies. Now we can write, X = X ∗ + E , where X ∗ is the low-rank part and E models the15igure 5: Studying the full USPS dataset. Left: Covariance eigenvectors associated with the highest eigenvalues. Right:Laplacian eigenvectors associated to the smallest non-zero eigenvalues. Because of stationarity, Laplacian eigenvectors aresimilar to the covariance eigenvectors. Right: Γ r = P (cid:62) C r P in dB. Note the diagonal shape of the matrix implying that P isaligned with the eigenvectors of C r . (Figure taken from [34] with permission).noise or corruptions. Thus, X = P k r A + ¯ P k r ¯ A and X ∗ = P k r A where A ∈ (cid:60) k r × n and ¯ A ∈ (cid:60) ( p − k r ) × n . From Fig. 4 it is also clear that (cid:107) ¯ P k r ¯ A (cid:107) F (cid:28) (cid:107) P k r A (cid:107) F for a specific value of k r . Let L c ∈ R n × n = D − / c ( D c − W c ) D − / c as the normalized graph Laplacian for the graph G c between columns of the data Y .The smallest eigenvectors of the graph of samples (columns) G c provide an embedding of the data in the low-dimensional space[6]. This is the heart of many algorithms in clustering [36] and dimensionality reduction [6] and has a similar interpretation asthe principal components in PCA. Let C c be the sample covariance of the data Y , L c = Q Λ c Q (cid:62) and Γ c = Q (cid:62) C c Q , then wecan define the alignment order s c (Γ c ) and rank-k alignment order ˆ s c (Γ c ) similar to that for the rows of the data Y and derive thesame conclusions. We do not repeat this procedure here for brevity.Thus, we argue that every row of a low-rank matrix lies in the span of the first few eigenvectors of the graph of samples G c .In our present application this term has two effects.1. Firstly, when the data has a class structure, the graph of samples enforces the low-rank X to benefit from this class structure.This results in an enhanced clustering of the low-rank signals.2. Secondly, it will force that the low-rank X of the signals is well represented by the first few Laplacian eigenvectorsassociated to low λ cj .Let the eigenvectors Q of L c be divided into two sets ( Q k c ∈ (cid:60) n × k c , ¯ Q k c ∈ (cid:60) n × ( n − k c ) ) , where k c denotes the eigenvectorsin Q corresponding to the smallest k c eigenvalues. Then, we can write L c = Q Λ c Q (cid:62) = Q k c Λ k c Q (cid:62) k c + ¯ Q k c ¯Λ k c ¯ Q (cid:62) k c . (11)Note that the columns of Q k c contain the eigenvectors corresponding to the low graph frequencies and ¯ Q k c contains thosecorresponding to higher graph frequencies. Now, we can write: X = BQ (cid:62) k c + ¯ B ¯ Q (cid:62) k c and X ∗ = BQ (cid:62) k c where B ∈ (cid:60) p × k c and ¯ B ∈ (cid:60) p × ( n − k c ) . As argued in the previous subsection, (cid:107) ¯ B ¯ Q (cid:62) k c (cid:107) F (cid:28) (cid:107) BQ (cid:62) k c (cid:107) F .16 .1.3 Real world examples Many real world datasets can benefit from such notions of data similarity. We already described an example for the USPS digits.Other examples include the matrix of the users and items in a typical recommendation system where one can assume that theusers and items form communities among themselves. Thus it would be meaningful to exploit the graph similarity between theusers and another between the items. One can also use such a notion of similarity to extract the repeating patterns across spaceand time such as in EEG, FMRI and other types of the brain data.
Thus we propose to construct two graphs, one between the rows and another between the columns of the data. Clearly ourproposed framework is moving in a direction where we want to exploit the row and column graph based similarity of the datamatrix to extract the low-rank representation. In simpler words, we look for the underlying noiseless low-rank data matrix whoserows vary smoothly on the row graph and whose columns vary smoothly on the column graph.Smoothness in general corresponds to the low frequency content of a signal. For example, for our database of the noisyimages of a landscape, the smooth images will be the clean landscape which constitutes most of the information in the dataset. Inother words, a smooth signal corresponds to a representative signal which explains most of the data variance or which correspondto the low-frequency components of the data. Now, how can we characterize the notion of smoothness on a graph in a formalmanner? This motivates us to define the low-rank matrices on graphs.
From the above explanation related to the role of the two graphs, we can conclude the following facts about the representation ofany clusterable low-rank matrix X ∗ ∈ (cid:60) p × n ( p features and n samples).1. Each of its columns can be represented as the span of the Laplacian eigenvectors of the graph of features G r , i.e, X ∗ = P k r A .2. Each of its rows can be represented as a span of the Laplacian eigenvectors of the graph of samples, i.e, X ∗ = BQ (cid:62) k c .As already pointed out, only the first k c or k r eigenvectors of the graphs correspond to the low frequency information,therefore, the other eigenvectors correspond to high frequency content. We are now in a position to define low-rank matrix ongraphs . Definition 1.
A matrix X ∗ is ( k r , k c ) -low-rank on the graphs L r and L c if ( X ∗ ) (cid:62) i ∈ span( Q k c ) for all i = 1 , . . . , p , and ( X ∗ ) j ∈ span( P k r ) for all j = 1 , . . . , n . The set of ( k r , k c ) -low-rank matrices on the graphs L r and L c is denoted by LR ( P k r , Q k c ) . We note here that X ∗ ∈ span( P k r ) means that the columns of X ∗ are in span( P k r ) , i.e, ( X ∗ ) i ∈ span( P k r ) , for all i = 1 , . . . , n , where for any matrix A , ( A ) i is its i th column vector. RECOVERY: “Low-rank matrices on graphs (LRMG) can be recovered via dual graph regularization with asuitable loss function. This method is called Fast Robust PCA on graphs (FRPCAG)” .17e want to recover a matrix X that is smooth / low-rank with respect to the row and column graph of the data Y . Furthermore,we want it to be robust to noise and a wide variety of gross errors in the data. Thus, we solve the following generalized robustlow-rank recovery problem min X Φ( X − Y ) s.t: X ∈ span ( P k r ) , X (cid:62) ∈ span ( Q k c ) (12)where Φ( X − Y ) models a smooth or a non-smooth loss function depending on the type of noise or errros in the dataset. Weprovide a few examples below.1. (cid:107) X − Y (cid:107) , the L loss function. This povides robustness to sparse gross errors in the dataset.2. (cid:107) X − Y (cid:107) F , the L loss function which provides robustness to Gaussian noise.3. (cid:107) X − Y (cid:107) , , the L , loss function which provides robustness to sample specific corruptions and outliers, where L , isthe mixed norm which forces the columns of a matrix to be zero. This promotes group sparsity along the columns of amatrix.In the remainder of this paper we discuss only about (12) with L loss function. However, our theoretical results are general andcan be easily extended to the case of more specific loss functions.Eq. 12 is computationally expensive to solve because it requires the information about P k r and Q k c which can be obtainedby diagonalizing the Laplacians L r and L c and cost O ( p ) , O ( n ) respectively. Therefore, we need to transform the constraintssuch that we do not require the diagonalization of the Laplacians. Given a vector x , its smoothness on a graph Laplacian L canbe measured by using the graph Tikhonov / graph dirichlet energy x (cid:62) L x (as mentioned in Section 3). The lower is this energy,more smooth / low-rank is the signal x . Thus we can transform the problem (12) to the following: min X Φ( X − Y ) + γ c tr( X L c X (cid:62) ) + γ r tr( X (cid:62) L r X ) , (13)where γ c and γ r are parameters which control the regularization of the two graph tikhonov terms. These constants trade-offthe amount of signal energy corresponding to the high frequency content versus the low frequency content. Thus, these constantsindirectly control the rank of X . Using the L loss function in (12) we get: min X (cid:107) X − Y (cid:107) + γ c tr( X L c X (cid:62) ) + γ r tr( X (cid:62) L r X ) (14)Model (14) is known as Fast Robust PCA on Graphs (FRPCAG) [1] and corresponds to our earlier work in this direction.We use the Fast Iterative Soft Thresholding Algorithm (FISTA) [37] to solve problem (12). Let g : R N → R be a convex,differentiable function with a β -Lipschitz continuous gradient ∇ g and h : R N → R a convex function with a proximity operator prox h : R N → R N defined as: prox λh ( y ) = argmin s (cid:107) s − y (cid:107) + λh ( s ) . Our goal is to minimize the sum g ( s ) + h ( s ) , which is done efficiently with proximal splitting methods. More informationabout proximal operators and splitting methods for non-smooth convex optimization can be found in [38]. For model (12), g ( X ) = γ c tr( X L c X (cid:62) ) + γ r tr( X (cid:62) L r X ) and h ( X ) = (cid:107) Y − X (cid:107) . The gradient of g becomes ∇ g ( X ) = 2( γ c X L c + γ r L r X ) . (15)18e define an upper bound on the Lipschitz constant β as β ≤ β (cid:48) = 2 γ c (cid:107)L c (cid:107) + 2 γ r (cid:107)L r (cid:107) where (cid:107)L(cid:107) is the spectral norm(or maximum eigenvalue) of L . Moreover, the proximal operator of the function h is the (cid:96) soft-thresholding given by theelementwise operations (here ◦ is the Hadamard product) prox λh ( X ) = Y + sgn( X − Y ) ◦ max( | X − Y | − λ, . (16)The FISTA algorithm [37] can now be stated as Algorithm 1, where λ is the step size (we use λ = β (cid:48) ), (cid:15) the stopping tolerance Algorithm 1
FISTA for FRPCAGINPUT: S = Y , X = Y , t = 1 , (cid:15) > for j = 1 , . . . J do X j = prox λ j h ( S j − λ j ∇ g ( S j )) t j +1 = √ t j S j +1 = X j + t j − t j +1 ( X j − X j − ) if (cid:107) S j +1 − S j (cid:107) F < (cid:15) (cid:107) S j (cid:107) F then BREAK end ifend for
OUTPUT: X j +1 and J the maximum number of iterations.While [1] focuses more on the experimental results for FRPCAG, one of the main contributions of this work is to presenta theoretical understanding of this model in a more involved manner. Furthermore, we also present a more generalized versionof FRPCAG in the discussion that follows. Therefore, we take a step back here and perform a theoretical underpinning of themodel (12). It is this analysis which reveals that the model recovers an approximate low-rank representation. First, we presentour first main result in simple words here: “FRPCAG (12) gives an approximate low-rank representation X of a LRMG Y which is k c , k r clusterable acrossits columns and rows. The left and right singular vectors of the low-rank matrix X are given by subspace rotationoperation and the singular values are penalized by the graph eigenvalues. The approximation error of X dependson the spectral gaps of the Laplacians L r and L c respectively, where the spectral gaps are defined as the ratios λ k r /λ k r +1 and λ k c /λ k c +1 ” . Now we are ready to formalize our findings mathematically and prove that any solution of (12) yields an approximately low-rankmatrix. In fact, we prove this for any proper, positive, convex and lower semi-continuous loss function φ (possibly (cid:96) p -norms (cid:107) · (cid:107) , (cid:107) · (cid:107) , ..., (cid:107) · (cid:107) pp ). We re-write (12) again with a general loss function φ min X φ ( Y − X ) + γ c tr( X L c X (cid:62) ) + γ r tr( X (cid:62) L r X ) (17)Before presenting our mathematical analysis we gather a few facts which will be used later:19 We assume that the observed data matrix Y satisfies Y = Y ∗ + E where Y ∗ ∈ LR ( P k r , Q k c ) and E models noise/corruptions.Furthermore, for any Y ∗ ∈ LR ( P k r , Q k c ) there exists a matrix C such that Y ∗ = P k r CQ (cid:62) k c . • L c = Q Λ c Q (cid:62) = Q k c Λ k c Q (cid:62) k c + ¯ Q k c ¯Λ k c ¯ Q (cid:62) k c , where Λ k c ∈ (cid:60) k c × k c is a diagonal matrix of lower eigenvalues and ¯Λ k c ∈ (cid:60) ( n − k c ) × ( n − k c ) is also a diagonal matrix of higher graph eigenvalues. All values in Λ c are sorted in increasingorder, thus λ ≤ λ ≤ · · · ≤ λ k c ≤ · · · ≤ λ n − . The same holds for L r as well. • For a K -nearest neighbors graph constructed from a k c -clusterable data (along samples / columns) one can expect λ k c /λ k c +1 ≈ as λ k c ≈ and λ k c (cid:28) λ k c +1 . The same holds for the graph of features / rows L r as well. • For the proof of the theorem, we will use the fact that for any Y ∈ (cid:60) p × n , there exist A ∈ (cid:60) k r × n and ¯ A ∈ (cid:60) ( p − k r ) × n suchthat X = P k r A + ¯ P k r ¯ A , and B ∈ (cid:60) p × k c and ¯ B ∈ (cid:60) p × ( n − k c ) such that X = BQ (cid:62) k c + ¯ B ¯ Q (cid:62) k c . Theorem 2.
Let Y ∗ ∈ LR ( P k r , Q k c ) , γ > , and E ∈ (cid:60) p × n . Any solution X ∗ ∈ (cid:60) p × n of (17) with γ c = γ/λ k c +1 , γ r = γ/λ k r +1 and Y = Y ∗ + E satisfies φ ( X ∗ − Y ) + γ c (cid:107) X ∗ ¯ Q k c (cid:107) F + γ r (cid:107) ¯ P (cid:62) k r X ∗ (cid:107) F ≤ φ ( E ) + γ (cid:107) Y ∗ (cid:107) F (cid:16) λ k c λ k c +1 + λ k r λ k r +1 (cid:17) . (18) where λ k c , λ k c +1 denote the k c , k c + 1 eigenvalues of L c , λ k r , ω k r +1 denote the k r , k r + 1 eigenvalues of L r .Proof. As X ∗ is a solution of (17), we have φ ( X ∗ − Y ) + γ c tr( X ∗ L c ( X ∗ ) (cid:62) ) + γ r tr(( X ∗ ) (cid:62) L r X ∗ ) ≤ φ ( E ) + γ c tr( Y ∗ L c ( Y ∗ ) (cid:62) ) + γ r tr(( Y ∗ ) (cid:62) L r Y ∗ ) . (19)Using the facts that L c = Q k c Λ k c Q (cid:62) k c + ¯ Q k c ¯Λ k c ¯ Q (cid:62) k c and that there exists B ∈ (cid:60) p × k c and ¯ B ∈ (cid:60) p × ( n − k c ) such that X ∗ = BQ (cid:62) k c + ¯ B ¯ Q (cid:62) k c , we obtain tr( X ∗ L c ( X ∗ ) (cid:62) ) = tr( B Λ k c B (cid:62) ) + tr( ¯ B ¯Λ k c ¯ B (cid:62) ) ≥ tr(¯Λ k c ¯ B (cid:62) ¯ B ) ≥ λ k c +1 (cid:107) ¯ B (cid:107) F = λ k c +1 (cid:107) X ∗ ¯ Q k c (cid:107) F . Then, using the fact that there exists C ∈ (cid:60) k r × k c such that Y ∗ = P k r CQ (cid:62) k c , we obtain tr( Y ∗ L c ( Y ∗ ) (cid:62) ) = tr( C Λ k c C (cid:62) ) ≤ λ k c (cid:107) C (cid:107) F = λ k c (cid:107) Y ∗ (cid:107) F . Similarly, we have tr(( X ∗ ) (cid:62) L r X ∗ ) ≥ ω k r +1 (cid:107) ¯ P (cid:62) k r X ∗ (cid:107) F , tr(( X ∗ ) (cid:62) L X ∗ ) ≤ ω k (cid:107) X ∗ (cid:107) F . Using the four last bounds in (19) yields φ ( X ∗ − Y ) + γ c λ k c +1 (cid:107) X ∗ ¯ Q k c (cid:107) F + γ r ω k r +1 (cid:107) ¯ P (cid:62) k r X ∗ (cid:107) F ≤ φ ( E ) + γ c ω k c (cid:107) Y ∗ (cid:107) F + γ r ω k r (cid:107) Y ∗ (cid:107) F , which becomes φ ( X ∗ − Y ) + γ (cid:107) X ∗ ¯ Q k c (cid:107) F + γ (cid:107) ¯ P (cid:62) k r X ∗ (cid:107) F ≤ φ ( E ) + γ (cid:107) Y ∗ (cid:107) F (cid:18) λ k c λ k c +1 + ω k r ω k r +1 (cid:19) for our choice of γ c and γ r . This terminates the proof. 20 .3 Remarks on the theoretical analysis (18) implies that (cid:107) X ∗ ¯ Q k c (cid:107) F + (cid:107) ¯ P (cid:62) k r U ∗ (cid:107) F ≤ γ φ ( E ) + (cid:107) Y ∗ (cid:107) F (cid:18) λ k c λ k c +1 + λ k r λ k r +1 (cid:19) . The smaller (cid:107) X ∗ ¯ Q k c (cid:107) F + (cid:107) ¯ P (cid:62) k r X ∗ (cid:107) F is, the closer X ∗ to LR ( P k r , Q k c ) is. The above bound shows that to recover a low-rank matrix one should have large eigengaps λ k c +1 − λ k c and λ k r +1 − λ k r . This occurs when the rows and columns of Y canbe clustered into k r and k c clusters. Furthermore, one should also try to chose a metric φ (or (cid:96) p -norm) that minimizes φ ( E ) .Clearly, the rank of X ∗ is approximately min { k r , k c } . This section constitutes of a more formal and theoretical discussion of FRPCAG. We base our explanations on two arguments:1. FRPCAG penalizes the the singular values of the data matrix. This can be viewed in two ways: 1) In the data domain viaSVD analysis and 2) In the graph fourier domain .2. FRPCAG is a subspace rotation / alignment method, where the left and right singular vectors of the resultant low-rankmatrix are being aligned with the singular vectors of the clean data. In order to demonstrate how FRPCAG penalizes the singular values of the data we study another way to cater the graph regu-larization in the solution of the optimization problem which is contrary to the one presented in Section 6.1. In Section 6.1 weused a gradient for the graph regularization terms γ c tr( X L c X (cid:62) ) + γ r tr( X (cid:62) L r X ) and used this gradient as an argument of theproximal operator for the soft-thresholding. What we did not point out there was that the solution of the graph regularizationscan also be computed by proximal operators. It is due to the reason that using proximal operators for graph regularization (thatwe present here) is more computationally expensive. Assume that the prox of γ c tr( X L c X (cid:62) ) is computed first, and let Z be atemporary variable, then it can be written as: min Z (cid:107) Y − Z (cid:107) F + γ c tr( Z L c Z (cid:62) ) The above equation has a closed form solution which is given as: Z = Y ( I + γ c L c ) − Now, compute the proximal operator for the term γ r tr( X (cid:62) L r X )min X (cid:107) Z − X (cid:107) F + γ r tr( X (cid:62) L r X ) The closed form solution of the above equation is given as: X = ( I + γ r L r ) − Z Thus, the low-rank U can be written as: X = ( I + γ r L r ) − Y ( I + γ c L c ) − after this the soft thresholding can be applied on X . Here we assume that the data is stationary Y , Y = U y Σ y V (cid:62) y , L c = Q Λ c Q (cid:62) and L r = P Λ r P (cid:62) , then we get: X = ( I + γ r P Λ r P (cid:62) ) − U y Ω y V (cid:62) y ( I + γ c Q Λ c Q (cid:62) ) − = P ( I + γ r Λ r ) − P (cid:62) U y Ω y V (cid:62) y Q ( I + γ c Λ c ) − Q (cid:62) thus, the singular values Σ y of Y are penalized by / (1 + γ c Λ c )(1 + γ r Λ r ) . Clearly, the above solution requires the computationof two inverses which can be computationally intractable for big datasets.The singular value thresholding effect is also shown in Fig. 6, where the green vectors (scaled by their singular values) aslearned via our model tend to shrink with the increasing regularization parameter. Another way to analyse FRPCAG is to look at the penalization γ c tr( X L c X (cid:62) ) + γ r tr( X (cid:62) L c X ) in the spectral graph Fourierdomain. As stated in Subsection 3.1.5, it is possible to rewrite it as a spectral penalization: γ c tr( X L c X (cid:62) ) + γ r tr( X (cid:62) L r X ) = γ c (cid:107) Λ c ˆ X (cid:107) F + γ r (cid:107) Λ r ˆ X (cid:62) (cid:107) F . (20)The signal is thus penalized in the spectral graph domain with a weight proportional to the squared root of the graph eigenvalue.As a result, the signal is pushed towards the lowest frequencies of both graphs, enforcing its low-rank structure. This interpre-tation allows us to clearly identify the role of the regularization constant γ r and γ c . With big constants γ c and γ r , the resultingsignal is closer to low rank but we also attenuate the content inside the band as illustrated in Figure 6. Let X = U Σ V (cid:62) be the SVD of X and suppose now that we minimize w.r.t factors U, Σ , V instead of X . Then we have γ c tr( X L c X (cid:62) ) + γ r tr( X (cid:62) L r X ) = γ c tr( U Σ V (cid:62) Q Λ c Q (cid:62) V Σ U (cid:62) ) + γ r tr( V Σ U (cid:62) P Λ r P (cid:62) U Σ V (cid:62) )= γ c tr(Σ V (cid:62) Q Λ c Q (cid:62) V Σ) + γ r tr(Σ U (cid:62) P Λ r P (cid:62) U Σ) (21) = min { n,p } (cid:88) i,j =1 σ i ( γ c λ cj ( v (cid:62) i q j ) + γ r λ rj ( u (cid:62) i p j ) )= min { n,p } (cid:88) i =1 σ i γ c n (cid:88) j =1 λ cj ( v (cid:62) i q j ) + γ r p (cid:88) j =1 λ rj ( u (cid:62) i p j ) , where λ cj and λ rj are the eigenvalues in the matrices Λ c and Λ r respectively. The second step follows from V (cid:62) V = I and U (cid:62) U = I and the cyclic permutation invariance of the trace. In the standard terminology v i and u i are the principal componentsand principal directions of of the low-rank matrix X . From the above expression, the minimization is carried out with respect tothe singular values σ i and the singular vectors v i , u i . The minimization has the following effect:1. Minimize σ i by performing a penalization with the graph eigenvalues as explained earlier.2. When σ i is big, the principal components v i are more well aligned with the graph eigenvectors q j for small values of λ cj , i.e, the lower graph frequencies of L c as compared to the q j for higher λ cj . The principal directions u i are alsomore aligned with the graph eigenvectors p j for small values of λ rj , i.e, the lower graph frequencies of L r as comparedto higher frequencies. This alignment makes sense as the higher eigenvalues correspond to the higher graph frequencieswhich constitute the noise in data. 22 .4.4 FRPCAG is a subspace rotation method The coherency of the principal directions and components with their respective graph eigenvectors implicitly promotes thecoherency of these vectors with the principal components and directions of the clean data. This is of course due to the factthat we want X ∗ ∈ LR ( P k r , Q k c ) , assuming X ∗ is the optimal solution. Note that this implication of FRPCAG is not differentfrom the motivation of Section 5.1 as the eigenvectors of C r are the principal directions of Y and the eigenvectors of C c are theprincipal components of Y . Thus the parameters γ c and γ r act as a force which pushes the singular vectors (principal directionsand components) of the low-rank matrix away from the graph eigenvectors towards the singular vectors of the clean data. Thisis effect is shown in Fig. 6. We call this effect as the subspace rotation / alignment effect and this is the key feature of our graphbased low-rank matrix approximation. Increasing graph regularization parameter Clean data & singular vectors Noisy data & distorted singular vectors The singular vector alignment / subspace rotation and singular value shrinking effect Clean data Noisy data learned via FRPCAG
Best solution, lowest error Not enough regularization, noise still dominant Subspace getting aligned, more regularization needed good regularization, best possible alignment, slight shrinking of singular values Too much regularization alignment still good but too much shrinking of singular values
Figure 6: The alignment of the singular vectors with the singular vectors of clean data and the corresponding shrinking of thesingular values with the increasing graph regularization parameter. The first row shows the original and noisy data, where theblue vectors show the first two singular vectors of the clean data and the red ones for the noisy data. The second row shows howthe singular vectors (green) learned with the FRPCAG (14) are being aligned with the singular vectors of the clean data. All thesingular vectors are scaled by their singular values. Too much regularization shrinks the singular vectors.23 .4.5 FRPCAG is a weighted rank-k alignment order maximizer
Eq. (21) can be further written as: = γ c tr( Q Λ c Q (cid:62) V Σ V (cid:62) ) + γ r tr( P Λ r P (cid:62) U Σ U (cid:62) )= γ c tr( Q Λ c Q (cid:62) C c ) + γ r tr( P Λ r P (cid:62) C r )= γ c tr(Λ c Q (cid:62) C c Q ) + γ r tr(Λ r P (cid:62) C r P )= γ c tr(Λ c Γ c ) + γ r tr(Λ r Γ r )= γ c (cid:88) i λ ci Γ ci,i + γ c (cid:88) j λ ri Γ rj,j , where C c , C r are the sample and feature covariance matrices and Γ r , Γ c are the featutre and sample alignment orders asdefined in Section 5.1. Assuming the entries of Λ r , Λ c are sorted in the increasing order, the last equation above corresponds to aweighted minimization of the diagonal entries of Γ c and Γ r . Thus, the lower diagonal entries undergo more penalization than thetop ones. Now assume that there is a spectral gap, i.e, λ ck c (cid:28) λ ck c +1 and λ rk r (cid:28) λ rk r +1 , then the minimization above leadsto a higher penalization of all the entries above k r in Γ r and k c in Γ c . This automatically leads to a maximixzation of the rank-kalignment orders ˆ s r (Γ r ) , ˆ s c (Γ c ) (eq. (9)). In the biclustering problem, one seeks to simultaneously cluster samples and features. Biclustering has applications in a widevariety of domains, ranging from text mining to collaborative filtering [39]. Very briefly, a typical bi-clustering method reveals:1. the clusters across the rows and columns of a data matrix.2. the checkerboard pattern hidden in a noisy dataset.The theoretical analysis of FRPCAG reveals that the method can recover low-rank representation for a dataset Y that is k r , k c clusterable across its rows / features and columns / samples. This is not strange, as we show in the netx subsection that FRPCAGis able to successfully recover meaningful clusters not only acorss across samples but features as well. “We justify the singular value penalization, singular vector alignment, low-rankness and bi-clustering effects ofFRPCAG with a few artificial and real world datasets.” .We presented several different implications of FRPCAG in the previous section. These implications are justified with someartificial and real world datasets in this section. We generate a random low-rank matrix (rank = 10) of size × and construct a 10-nearest neighbors graph between itsrows G r and columns G c . Let L c = Q Λ c Q (cid:62) , L r = P Λ r P (cid:62) . Then, we generate a noiseless data matrix Y as P RQ (cid:62) , where P , Q are the first 10 vectors in P, Q and R ∈ (cid:60) × is a random Gaussian matrix. Then, we corrupt Y with Gaussiannoise N (0 , σ I ) to get Y . Finally we perform FRPCAG on noisy Y to recover the low-rank for three different regularizationparameter settings: γ r = γ c = 0 . , , .Let X = U Σ V (cid:62) be the SVD of low-rank X and Y = U Σ V (cid:62) be the SVD of clean data. To study the error ofapproximation (cid:107) Y − X (cid:107) F / (cid:107) Y (cid:107) F , we plot the low-rank X , Γ r , Γ c , the alignment order s r (Γ r ) , s c (Γ c ) and rank-k alignment s r (Γ r ) , ˆ s c (Γ c ) ( k = 10 ), the singular values Σ of X , the coherency between ( U, U ) , ( V, V ) in Fig. 7. Following conclusionscan be drawn:1. The lowest error occurs for γ r = γ c = 1 (it is possible to get a lower error with finer parameter tuning but the purpose hereis to study the effect of regularization only).2. For this parameter setting, the singular values Σ decay to 0 after 10.3. An interesting observation is the coherency between ( U, U ) and ( V, V ) for increasing regularization. Clearly, for smallparameters, these pairs are incoherent. The st U and V are aligned with the st U and V forthe best parameter γ c = γ r = 1 . The coherency of the pairs of singular vectors decreases again for higher parameters andthe thresholding on the singular values is more than required. The rank-k alignment order also shows a very high degreeof alignment between the graph eigenvectors P, Q and covariance matrices C r , C c .For the best parameters γ r = γ c = 1 , there is a strong alignment between the st th singular value makes it slightly lower than that of the clean data Y which results in a miss-alignmentof the th singular vector. The purpose of this example is to show the following effects of FRPCAG:1. The model recovers a close-to-low-rank representation.2. The principal components V and principal directions U of X (assuming X = U Σ V (cid:62) ) align with the first few eigenvectorsof their respective graphs, automatically revealing a low-rank and enhanced class structure.3. The singular values of the low-rank matrix obtained using our model closely approximate those obtained by nuclear normbased models even in the presence of corruptions.Our justification relies mostly on the quality of the singular values of the low-rank representation and the alignment of thesingular vectors with their respective graphs. We perform an experiment with 1000 samples of the MNIST dataset belonging to two different classes (digits 0 and 1). Wevectorize all the digits and form a data matrix Y whose columns contain the digits. Then we compute a graph of samplesbetween the columns of Y and a graph of features between the rows of Y as mentioned in Section 3. We determine the cleanlow-rank X by solving model (14) and perform one SVD at the end X = U Σ V (cid:62) . Finally, we do the clustering by performingk-means (k = 2) on the low-rank X . As argued in [18], if the data is arranged according to the classes, the matrix V V (cid:62) (where V are the principal components of the data) reveals the subspace structure. The matrix V V (cid:62) is also known as the shape interactionmatrix (SIM) [40]. If the subspaces are orthogonal then SIM should acquire a block diagonal structure. Furthermore, our model(14) tends to align the first few principal components v i and principal directions u i of X to the first few eigenvectors q j and p j of L c and L r for larger σ i respectively. Thus, it is interesting to observe the matrices Σ V (cid:62) Q and Σ U (cid:62) P scaled with the singularvalues Σ of the low-rank matrix X , as justified by eq. (21). This scaling takes into account the importance of the eigenvectorsthat are associated to bigger singular values.Fig. 8 plots the matrix V V (cid:62) , the corresponding clustering error, the matrices Σ , Σ V (cid:62) Q , and Σ U (cid:62) P for different values of γ r and γ c from left to right. Increasing γ r and γ c from 1 to 30 leads to 1) the penalization of the singular values in Σ resultingin a lower rank 2) alignment of the first few principal components v i and principal directions u i in the direction of the first feweigenvectors q j and p j of L c and L r respectively 3) an enhanced subspace structure in V V (cid:62) and 4) a lower clustering error.Together the two graphs help in acquiring a low-rank structure that is suitable for clustering applications as well.25 l e a nd a t a ( r a n k = ) S i n g u l a r v a l u e s N o i s y d a t a S i n g u l a r v a l u e s L o w - r a n k ⌃ U > U V > V S i n g u l a r v a l u e s C o h e r e n ce o f U & U C o h e r e n ce o f V & V r = c = 0 . r = c = 100 m i n X k Y X k + c t r ( X L c X > ) + r t r ( X > L r X ) O u t pu t o f F R P C A G f o r d i ↵ e r e n t v a l u e s o f r e g u l a r i z a t i o np a r a m e t e r s S V D ( X ) = U ⌃ V > P = e i g e n v ec t o r s o f r o w g r a ph L a p l a c i a n Q= e i g e n v ec t o r s o f c o l u m n g r a ph L a p l a c i a n ⌃ y Y X Y ⌃ F R P C A G :i npu t = n o i s y d a t a Y o u t pu t = l o w - r a n k X T h e l o w e s t e rr o r o cc u r s a tt h e b e s t a li g n m e n t o f U & U , V & V A q u a n t i t a t i v ec h a r a c t e r i z a t i o n o f a pp r o x i m a t i o n e rr o r o f F R P C A G r = c = 4 E rr o r = % E rr o r = % E rr o r = % T h e s t l e f t & r i g h t s i n g u l a r v ec t o r s a r e a li g n e d w i t h t h ec o rr e s p o nd i n g s i n g u l a r v ec t o r s o f c l e a nd a t a . T h e a li g n m e n t i s goo d f o r t h e v ec t o r s b ec a u s e t h i s a n e a s y p r o b l e m w h e r e d a t a f o ll o w s t h e m o d e l. T h e s i n g u l a r v a l u e s d r o p t o0a f t e r (t h e a c t u a l r a n k ) . : Y > s p a n ( Q ) & Y s p a n ( P ) c = Q > C c Q r = P > C r P ˆ s r ( r ) = . s r ( r ) = . s r ( r ) = .
99 ˆ s c ( c ) = . s c ( c ) = . s c ( c ) = . Figure 7: A study of the error of approximation (cid:107) Y − X (cid:107) F / (cid:107) Y (cid:107) F for a data generated from graph eigenvectors. We plot thelow-rank X , the alignment order s r (Γ r ) , s c (Γ c ) and rank-k alignment ˆ s r (Γ r ) , ˆ s c (Γ c ) ( k = 10 ), the singular values Σ of X , thecoherency between ( U, U ) , ( V, V ) . max . max . max V V > V V > V V > ⌃ U > P ⌃ U > P ⌃ U > P ⌃ V > Q ⌃ V > Q ⌃ V > Q Figure 8: The matrices
V V (cid:62) , Σ , Σ V (cid:62) Q , Σ U (cid:62) P and the corresponding clustering errors obtained for different values of theweights on the two graph regularization terms for 1000 samples of MNIST dataset (digits 0 and 1). If X = U Σ V (cid:62) is the SVD of X , then V corresponds to the matrix of principal components (right singular vectors of X ) and U to the principal directions (leftsingular vectors of X ). Let L c = Q Λ r Q (cid:62) and L c = P Λ c P (cid:62) be the eigenvalue decompositions of L r and L c respectively then Q and P correspond to the eigenvectors of Laplacians L c and L r . The block diagonal structure of V V (cid:62) becomes more clear byincreasing γ r and γ c with a thresholding of the singular values in Σ . Further, the sparse structures of Σ V (cid:62) Q and Σ U (cid:62) P towardsthe rightmost corners show that the number of left and right singular vectors (weighted by singular values) which align with theeigenvectors of the Laplacians L r and L c go on decreasing with increasing γ r and γ c . This shows that the two graphs help inattaining a low-rank structure with a low clustering error. 27igure 9: A comparison of singular values of the low-rank matrix obtained via our model, RPCA and RPCAG. The experimentswere performed on the ORL dataset with different levels of block occlusions. The parameters corresponding to the minimumvalidation clustering error for each of the model were used. Next we demonstrate that for data with or without corruptions, FRPCAG is able to acquire singular values as good as the nuclearnorm based models, RPCA and RPCAG. We perform three clustering experiments on 30 classes of ORL dataset with no blockocclusions, 15% block occlusions and 25% block occlusions. Fig. 9 presents a comparison of the singular values of the originaldata with the singular values of the low-rank matrix obtained by solving RPCA, RPCAG and our model. The parameters for allthe models are selected corresponding to the lowest clustering error for each model. It is straightforward to conclude that thesingular values of the low-rank representation using our fast method closely approximate those of the nuclear norm based modelsirrespective of the level of corruptions.
We perform a small experiment with the 50 faces from ORL dataset, corresponding to 5 different classes. We take the samplescorresponding to the 5 different faces and pre-process the dataset to zero mean across the features and run FRPCAG with γ r = 3 = γ c = 3 . Then, we cluster the low-rank X to 5 clusters (number of classes in ORL) across the columns and 5 clustersacross the features. While, the clustering across samples / columns has a straight-forward implication and has already beendepicted in the previous subsection, we discuss the feature clustering in detail here. Fig. 10 shows the 5 clusters of the features ofthe ORL dataset for one face of each of the 5 different types of faces. It is interesting to note that the clusters across the featurescorrespond to very localized regions of the faces. For example, the first cluster (first row of Fig. 10) corresponds to the region ofeyes and upper lips while the fifth cluster corresponds mostly to the smooth portion of the face (cheeks and chin). Of course, asFRPCAG is a low-rank recovery method which exploits the similarity information of the rows and columns, it also reveals thecheckerboard pattern of the ORL dataset. This checkerboard pattern is visible in Fig. 11. This pattern will become more visiblefor larger regularization parameters γ r and γ c . 28igure 10: The clustering of features for ORL dataset. The top five rows show the five different feature clusters across fivedifferent faces. The last row shows the actual face which is a sum of all the feature clusters.Figure 11: The checkerboard pattern of the low-rank representation of ORL dataset, obtained by FRPCAG.29 Generalized Fast Robust PCA on Graphs (GFRPCAG) “FRPCAG suffers from uncontrolled penalization of higher singular values by the lower graph eigenvalues, TheManifold Shrinking Effect. The approximation can be improved by deterministic graph filters which shape thegraph spectra to favour reasonable penalization of data singular values” .In principle, model (14), which is a special case of (13) has some connections with the state-of-the-art nuclear norm basedlow-rank recovery method RPCA [2]. RPCA determines a low-rank representation by:1. Penalizing the singular values by a model parameter.2. Determining the clean left and right singular vectors iteratively from the data by performing an SVD in every iteration.We first briefly discuss what makes our model an approximate recovery as compared to exact recovery, study what theapproximation error is composed of and then use these arguments to motivate a more general model. We say that our model (14)has connections with RPCA because it performs the same two steps but in a different way, as explained in the previous section.In our model:1. The singular values are penalized by the graph eigenvalues. Therefore, one does not have a direct control on this operationas the thresholding depends on the graph spectrum.2. The left and right singular vectors are constrained to be aligned with the low frequency eigenvectors of the row and columngraph. Thus, we take away the freedom of the singular vectors. In fact this is the primary reason our method does not needan SVD and scales very well for big datasets.
There are two main reasons of the approximation error, a glimpse of which has already been given above. We now discuss thesetwo reasons in detail:
Ideally one would like the singular values corresponding to high frequency of the data matrix (smaller in magnitude) to beattenuated more because they correspond to noise and errors in the data. Furthermore, the singular values corresponding to thelower frequencies, which constitute most of the data variance should be attenuated less. Such an operation would require thegraph spectrum to be a step function where the position of the step depends on the rank of the matrix. For the readers who wentthrough the theoretical details, from eq. 18 one can see that the error of the optimal low-rank X ∗ depends on the two ratios: λ kc λ kc +1 and λ kr λ kr +1 . These ratios are known as the spectral gaps of the graphs L c and L r respectively. In practice, the graphs constructedvia the K -nearest neighbors strategy result in a smoothly increasing spectrum. Therefore, these ratios are never small enough.As a consequence, the undesired singular values (smaller in magnitude) are not attenuated enough and they are not exactly zero.If one forces higher attenuation by increasing the constants γ c and γ r then the desired singular values are penalized more thanrequired. This results in a shrinking effect of the manifold as shown in Fig. 12. In this Fig. the first row shows the recoveryof manifold with increasing values of the graph regularization parameter in FRPCAG (with one graph only) and the second rowshows the singular values. Clearly, the rd singular value does not become zero with higher penalization which degrades thedesired singular values and the manifold begins to shrink. This calls for a procedure to manually threshold the singular valuesbeyond a certain penalization. 30igure 12: The manifold shrinking effect of a 2D manifold corrupted with noise. The first row shows the recovery of manifoldwith increasing values of FRPCAG (with one graph only) and the second row shows the singular values. Clearly, the rd singularvalue does not become zero with higher penalization which degrades the lower singular values and the manifold begins to shrink. We do not allow the left and right singular vectors of the low-rank matrix to be determined freely via SVD. This results in a slightmiss-alignment of the clean singular vectors of the resultant low-rank matrix as compared to the singular vectors of the originalclean data. Note that the alignment occurs as a result of a force acting on the singular vectors which depends on the regularizationparameters. While increasing this force results in more alignment, the downside is the shrinking effect of singular values. Thus,a good parameter setting might still leave some alignment gap in order not to threshold the singular values more than required.These two effects with the increasing values of the graph regularization parameter for FRPCAG have been illustrated inFig. 6. Here we show this effect for only one subspace (left singular vectors). Similar phenomenon takes place for the rightsingular vectors as well. For very small values of the regularization parameter, the singular vectors learned via FRPCAG are stillaligned with the singular vectors of the noisy data and the thresholding on the higher singular values is not enough to removenoise. With the increasing regularization, the singular vectors start aligning and the higher singular values are thresholded more.Very high values of the regularization parameter might still provide a good alignment of the singular vectors but the singularvalues might be thresholded more than required, resulting in the shrinking effect. Thus, beyond a certain point, the increase ofparameters contributes to higher errors due to the excessive shrinking of the singular values.
The thresholding of the singular values is governed by the graph spectra as also pointed out in theorem 2. While the subspacerotation effect is an implicit attribute of our graph based low-rank recovery method and it cannot be improved trivially, wecan do something to counter the shrinking effect to improve the approximation. A straight-forward method to improve theapproximation would be to have a better control on the singular values by manipulating the graph spectra. To do so, it wouldbe wise to define functions of the graph spectrum which make the spectral gap as large as possible. In simpler words we needa family of functions g ( a ) which are characterized by some parameter a that force the graph filtering operation to follow a stepbehavior. More specifically, we would like to solve the following generalized problem instead of (12).31 in X Φ( X − Y ) + γ c tr( Xg c ( L c ) X (cid:62) ) + γ r tr( X (cid:62) g r ( L r ) X ) (22)The term tr( X (cid:62) g ( L ) X ) = (cid:107) g (Λ) ˆ X (cid:107) allows for a better control of the graph spectral penalization.For our specific problem, we would like to have the lowest frequencies untouched as well as the highest frequencies stronglypenalized. In order to do so, we propose the following design.Let us first define a function h b ( x ) = (cid:40) e − bx − b if x ≥ b otherwise . Please note that this function is differentiable at all points. We then define g b as g b ( x ) = h b ( x ) h b (2 b − x ) This new function has a few important properties. It is equal to for all values smaller than b , and then grows. It is infinitefor all values above b . As a result, choosing the function √ g b in our regularization term has the following effect: It preservesthe frequency content for λ (cid:96) << b and cuts the high frequencies λ (cid:96) >> b . The parameter b acts as a frequency limit for ourfunction. See Fig. 13. Since we solve our problem with proximal splitting algorithm, we also need to compute the proximal operator of tr( X (cid:62) g ( L ) X ) = (cid:107) g b (Λ) ˆ X (cid:107) . The computation is straightforward and explains the inner mechanism of such a regularization term. We first com-pute the gradient of argmin x (cid:107) x − y (cid:107) + γ (cid:107) g b (Λ) Q (cid:62) x (cid:107) and set is to . x − y ) + 2 γQg b (Λ) Q (cid:62) x = 0 We then rewrite the following expression in the spectral domain: ˆ x ( (cid:96) ) − ˆ y ( (cid:96) ) + γg b ( λ (cid:96) )ˆ x ( (cid:96) ) = 0 , ∀ (cid:96) ∈ { , . . . , n − } Shuffling the equation leads to the solution. ˆ x ( (cid:96) ) = 11 + γg b ( λ (cid:96) ) ˆ y ( (cid:96) ) , ∀ (cid:96) ∈ { , . . . , n − } (23)Equation 23 shows that the proximal operator of tr( X (cid:62) g ( L ) X ) is a graph spectral filtering with f b ( x, γ ) = 11 + γg ( x, b )= h (2 b − x ) h (2 b − x ) + γh ( x ) In fact the filter f b is infinitely many times differentiable and is smooth enough to be be easily approximated by a low orderpolynomial. We observe that f b ( · , is a low pass filter with bandwidth b . The computation cost of a filtering operation growswith the number edges e and the order of the polynomial o : O ( oe ) [41]. In figure 13, we show and example of the differentaforementioned functions. 32igure 13: Example of functions. Here b = 0 . . In order to solve our problem, we use a forward backward based primal dual for GFRPCAG [42] which can be cast into thefollowing form argmin x a ( x ) + b ( x ) + c ( x ) (24)Our goal is to minimize the sum a ( x ) + b ( x ) + c ( x ) , which is done efficiently with proximal splitting methods [38, 42]. Weset a ( X ) = γ c tr( X (cid:62) L c X ) , b ( x ) = γ r tr( Xg ( L r ) X (cid:62) ) and c ( X ) = (cid:107) Y − X (cid:107) and c ( X ) = 0 . The gradient of a becomes ∇ a ( X ) = 2 γ r X L r . Note that here we assume the presence of a deterministic filter only on the graph of columns . We define anupper bound on the Lipschitz constant β as β ≤ β (cid:48) = 2 γ r (cid:107)L r (cid:107) . The proximal operator of the (cid:96) soft-thresholding given by theelementwise operations, prox λb ( X ) = Y + sgn( X − Y ) ◦ max( | X − Y | − λ, , (25)here ◦ is the Hadamard product. Finally, from (23) the proximal operator of the function b is given by a graph spectral filtering.The forward-backward primal [42] Algorithm 2 can now be stated, where the time-steps τ = β , τ = β , τ = 0 . , (cid:15) thestopping tolerance and J the maximum number of iterations. δ is a very small number to avoid a possible division by .
10 When does the dual-graph filtering makes sense?
In the final Section of this extensive work we discuss some important implications of our work and its connections with thestate-of-the-art that uses dual graph based filtering. Obviously, our dual graph filtering approach makes sense for a number ofreal world applications mentioned throughout the whole paper. This is proved experimentally in the results section of this workas well. Apparently, the use of dual-graph filtering should help in a broad range of clustering applications. However, we highlyrecommend that this approach cannot be adopted as a rule-of-thumb in any graph based convex optimization problem, speciallyif one wants to target an exact low-rank recovery. Interestingly, we point out a few examples from the state-of-the-art which haveclose connections with our approach and justify why our model is different from those works.33 lgorithm 2
Forward-backward primal dual for GFRPCAG with filter on column graph only.INPUT: X = Y , V = Y , (cid:15) > for j = 0 , . . . J − do P j = prox τ f ( X j − τ ( ∇ h ( X j ) + V j )) T j = V j + τ (2 P j − X j ) Q j = T j − τ prox τ g (cid:16) τ T j (cid:17) ( X j +1 , V j +1 ) = ( U j , V j ) + τ (( P j , Q j ) − ( X j , V j )) if (cid:107) X j +1 − X j (cid:107) F (cid:107) U j (cid:107) F + δ < (cid:15) and (cid:107) V j +1 − V j (cid:107) F (cid:107) V j (cid:107) F + δ < (cid:15) then BREAK end ifend for
The collaborative filtering approach is commonly used in the recommendation systems where one uses a notion of graph basedsimilarity between the users and items to perform an effective recommendation. The authors of [43] use an NMF based approachwith dual graph filtering in their proposed recommendation system. Similarly, the authors of [29] propose a hybrid recommen-dation with nuclear-norm based exact recovery and dual-graph filtering. While the authors of [29] shown that when the numberof measurements is very small then their proposed dual-graph filtering approach outperforms the state-of-the-art methods, theirmethod does not improve upon the exact recovery of the nuclear-norm based approach [44]. Thus, it makes sense to use thedual-graph filtering to reduce the error of approximation in the low-rank but it does not ensure an exact recovery if a standardexact recovery mechanism (like nuclear norm) is not available.
Our previous work [14], which uses a nuclear norm based exact recovery framework with graph filtering has also proved that theuse of graph filtering does not improve upon the exact recovery conditions over the standard RPCA framework [2]. In fact, asshown in the rank-sparsity plots in [14], the graph based filtering reduces the error over RPCA [2] for various rank-sparsity pairsbut it does not guarantee an exact recovery in the region where the nuclear norm based recovery fails.The conclusions drawn from the above two state-of-the-art techniques are exactly in accordance with the motivation of thiswork. Our motivation was not to design a framework that returns an exact low-rank representation but to present a system thatworks well under a broad range of conditions in contrast to those required for exact recovery [2]. Thus, we target an approximateand fast recovery under a broad range of conditions and data types. This is exactly why our proposed framework, apparently veryclosely related to [43] & [29], is entirely different from these works. The recovery of our method totally depends on the qualityof information provided by the graphs.Moreover, our framework relies on the assumption that the data matrices are dual band-limited on two graphs. Although, suchan assumption is implicit in [43] & [29], the authors do not discuss it theoretically or experimentally in their works. Furthermore,RPCAG [14] uses only one graph. Thus, the assumption behind this model is totally different from our proposed framework.