[PDF] Graph topology inference based on sparsifying transform learning

Abstract

Graph-based representations play a key role in machine learning. The fundamental step in these representations is the association of a graph structure to a dataset. In this paper, we propose a method that aims at finding a block sparse representation of the graph signal leading to a modular graph whose Laplacian matrix admits the found dictionary as its eigenvectors. The role of sparsity here is to induce a band-limited representation or, equivalently, a modular structure of the graph. The proposed strategy is composed of two optimization steps: i) learning an orthonormal sparsifying transform from the data; ii) recovering the Laplacian, and then topology, from the transform. The first step is achieved through an iterative algorithm whose alternating intermediate solutions are expressed in closed form. The second step recovers the Laplacian matrix from the sparsifying transform through a convex optimization method. Numerical results corroborate the effectiveness of the proposed methods over both synthetic data and real brain data, used for inferring the brain functionality network through experiments conducted over patients affected by epilepsy.

Full PDF

aa r X i v : . [ ee ss . SP ] J un Graph topology inference based onsparsifying transform learning

Stefania Sardellitti,

Member, IEEE , Sergio Barbarossa,

Fellow, IEEE , and Paolo Di Lorenzo,

Member, IEEE

Abstract —Graph-based representations play a key role inmachine learning. The fundamental step in these representationsis the association of a graph structure to a dataset. In thispaper, we propose a method that aims at ﬁnding a block sparserepresentation of the graph signal leading to a modular graphwhose Laplacian matrix admits the found dictionary as itseigenvectors. The role of sparsity here is to induce a band-limited representation or, equivalently, a modular structure of thegraph. The proposed strategy is composed of two optimizationsteps: i) learning an orthonormal sparsifying transform fromthe data; ii) recovering the Laplacian, and then topology, fromthe transform. The ﬁrst step is achieved through an iterativealgorithm whose alternating intermediate solutions are expressedin closed form. The second step recovers the Laplacian matrixfrom the sparsifying transform through a convex optimizationmethod. Numerical results corroborate the effectiveness of theproposed methods over both synthetic data and real braindata, used for inferring the brain functionality network throughexperiments conducted over patients affected by epilepsy.

Index Terms —Graph learning, graph signal processing, spar-sifying transform.

I. I

NTRODUCTION

In many machine learning applications, from brain func-tional analysis to gene regulatory networks or sensor networks,associating a graph-based representation to a dataset is afundamental step to extract relevant information from the data,like detecting clusters or measuring node centrality. The graphunderlying a given data-set can be real, as in the case of asensor network, for example, or it can be just an abstractdescriptor of pairwise similarities. There is a large amountof work aimed at learning the network topology from a setof observations [1]. Typically, the graph topology reﬂectscorrelations among signals deﬁned over its vertices. However,looking only at correlations may fail to capture the causalityrelations existing among the data. Alternative approaches builton partial correlation [1], or Gaussian graphical models [2], [3]have been deeply investigated. The analysis of signal deﬁnedover graphs, or Graph Signal Processing (GSP) [4]–[7], hasprovided a further strong impulse to the discovery of newtechniques for inferring the graph topology from a set ofobservations. Some GSP-based approaches make assumptionsabout the graph by enforcing properties such as sparsity and/orsmoothness of the signals [8], [9], [10]. Speciﬁcally, Kalofolias

The authors are with the Department of Information Engineering, Elec-tronics, and Telecommunications, Sapienza University of Rome, Via Eudos-siana 18, 00184, Rome, Italy. E-mails: { stefania.sardellitti; sergio.barbarossa;paolo.dilorenzo } @uniroma1.it. This work has been supported by H2020 EUJProject 5G-MiEdge, Nr. 723171. This paper was presented in part at the IEEEGlobal Conference on Signal and Information Processing, GlobalSIP, USA,Dec. 7-9, 2016. in [8] proposed a Laplacian learning framework, for smoothgraph signals, by reformulating the problem as a weight ℓ -norm minimization with a log barrier penalty term on thenode degrees. Recently, the problem of learning a sparse,unweighted, graph Laplacian from smooth graph signals hasbeen modeled by Chepuri et al. as a rank ordering problem,where the Laplacian is expressed in terms of a sparse edgeselection vector by assuming a priori knowledge of the graphsparsity level [9]. An alternative approach consists in deﬁningjoint properties between signals and underlying graph suchthat the signal representation is consistent with given statisticalpriors on latent variables [10], [11]. Under the assumptionthat the observed ﬁeld is a Gaussian Markov Random Field(GMRF), whose precision matrix is the graph Laplacian,Dong et al. [10] proposed a graph learning method thatfavors signal representations that are smooth and consistentwith the underlying statistical prior model. Recently, in [11]Egilmez et al. proposed a general framework where graphlearning is formulated as the estimation of different types ofgraph Laplacian matrices by adopting as minimization metricthe log-determinant Bregman divergence with an additional,sparsity promoting, ℓ -regularization term. In [11] it is shownthat such a problem corresponds to a maximum a posterioriparameter estimation of GMRFs whose precision matricesare graph Laplacians. In case of directed graphs, Mei etal. proposed algorithms for estimating the directed, weightedadjacency matrices describing the dependencies among timeseries generated by systems of interacting agents [12].There are also some recent works that focus on learningthe graph topology from signals that diffuse over a graph[13], [14], [15]. Segarra et al. developed efﬁcient sparse graphrecovery algorithms for identifying the graph shift operator(the adjacency matrix or the normalized graph Laplacian)given only the eigenvectors of the shift operator [13]. Thesebases are estimated from the eigenvectors of the samplecovariance matrix of stationary graph signals resulting fromdiffusion dynamics on the graphs. Then the optimizationstrategy minimizes the shift-operator ℓ -norm by imposingcomplete or partial knowledge of the eigenvectors of theempirical covariance matrix. In [14], Pasdeloup et al. formu-lated the graph-inference problem in a way similar to [13],albeit with a different matrix selection strategy, which aimsat recovering a matrix modeling a diffusion process withoutimposing the structure of a valid Laplacian to this matrix.Recently, Thanou et al. proposed a graph learning strategybased on the assumption that the graph signals are generatedfrom heat diffusion processes starting from a few nodes ofthe graphs, by enforcing sparsity of the graph signals in a dictionary that is a concatenation of graph diffusion kernelsat different scales [15]. The problem of jointly estimatingthe dictionary matrix and the graph topology from data hasbeen recently tackled by Yankelevsky et al. by taking intoaccount the underlying structure in both the signal and themanifold domains [16]. Speciﬁcally, two smoothness regu-larization terms are introduced to describe, respectively, therelations between the rows and the columns of the data matrix.This leads to a dual graph regularized formulation aimed atﬁnding, jointly, the dictionary and the sparse data matrix. Thegraph Laplacian is then estimated by imposing smoothness ofany signal represented over the estimated dictionary. Giventhe observed graph signal, the estimation of the Laplacianeigenvectors is conceptually equivalent to ﬁnding a sparsifyingtransform [17]. A dictionary learning method for sparse codingincorporating a graph Laplacian regularization term, whichenables the exploitation of data-similarity, was developed in[18], while a dictionary learning method reinforcing group-graph structures was proposed in [19].One of the key properties of the spectral representationof signals deﬁned over a graph is that the representationdepends on the graph topology. The Graph Fourier Transform(GFT), for example, is the projection of a signal onto thespace spanned by the eigenvectors of the graph Laplacianmatrix. The main idea underlying our work is to associatea graph topology to the data in order to make the observedsignals band-limited over the inferred graph or, equivalently, tomake their spectral representation sparse . Finding this sparserepresentation of a signal on a graph is instrumental to ﬁndinggraph topologies that are modular , i.e. graphs characterizedby densely interconnected subsets (clusters) weakly intercon-nected with other clusters. Modularity is in fact a structurethat conveys important information. This is useful in all thoseapplications, like clustering or classiﬁcation, whose goal is toassociate a label to all vertices belonging to the same classor cluster. If we look at the labels as a signal deﬁned overthe vertices of a graph, we have a signal constant within eachcluster and varying arbitrarily across different clusters. Thisis a typical band-limited signal over a graph. The proposedapproach is composed of the following two main steps: i) learnthe GFT basis and the sparse signal representation jointly fromthe observed dataset; ii) infer the graph weighted Laplacian,and then the graph topology, from the estimated Laplacianeigenvectors. A nice feature of the proposed approach is thatthe ﬁrst step is solved through an alternating optimizationalgorithm whose single steps are solved in closed form.Furthermore, the second step is cast as a convex problem,so that it can be efﬁciently solved. With respect to previousworks, we make no assumptions about the stationarity of theobserved signal and we do not assume any diffusion processtaking place on the graph.Interestingly, we ﬁnd theoretical conditions relating thesparsity of the graph that can be inferred and the bandwidth(sparsity level) of the signal.To assess the performance of the proposed approaches,we consider both synthetic and real data. In the ﬁrst case,we know the ground-truth graph and then we use it as abenchmark. In such a case, we also compare our methods with similar approaches, under different data models (e.g., band-limitedness, multivariate Gaussian random variables describedby an inverse Laplacian covariance matrix, etc.). Then, weapplied our methods to infer the brain functionality graphfrom electrocorticography (ECoG) seizure data collected inan epilepsy study [1], [20]. In such a case, the brain networkis not directly observable and must be inferred by measurableECoG signals. This represents an important application for therecently developed, graph-based learning methods [21], [22].By exploiting such methods, the hope is to gain more insightsabout the unknown mechanisms underlying some neurologicaldiseases such as, for example, epileptic seizures.The rest of the paper is organized as follows. Section IIintroduces the graph signal models and the topology learningstrategy. Section III describes the ﬁrst step of our methoddesigned to learn jointly, a subset of the GFT basis and thesparse signal. Two alternative algorithms to learn the Laplacianmatrix from the estimated (incomplete) Fourier basis arethen presented in Section IV. Theoretical conditions on graphidentiﬁability are discussed in Section V. Numerical tests basedon both synthetic and real data from an epilepsy study, are thenpresented in Section VI.II. L

EARNING TOPOLOGY FROM GRAPH SIGNALS

A. Graph signals model

We consider an undirected graph G = {V , E} consisting ofa set of N vertices (or nodes) V = { , . . . , N } along with a setof edges E = { a ij } i,j ∈V , such that a ij > , if there is a linkbetween node j and node i , or a ij = 0 , otherwise. We denotewith |V| the cardinality of V , i.e. the number of elements of V . Let A denote the N × N adjacency, symmetric matrixwith entries the edge weights a ij for i, j = 1 , . . . , N . The(unnormalized) graph Laplacian is deﬁned as L := D − A ,where the degree matrix D is a diagonal matrix whose i thdiagonal entry is d i = P j a ij . Since G is an undirected graph,the associated Laplacian matrix is symmetric and positivesemideﬁnite and admits the eigendecomposition L = UΛU T ,where U collects all the eigenvectors { u i } Ni =1 of L , whereas Λ is a diagonal matrix containing the eigenvalues of L . Asignal y on a graph G is deﬁned as a mapping from the vertexset to a complex vector of size N = |V| , i.e. y : V → C .For undirected graphs, the GFT ˆ y of a graph signal y isdeﬁned as the projection of y onto the subspace spanned bythe eigenvectors of the Laplacian matrix, see e.g. [4], [6], [7]: ˆ y = U T y . (1)A band-limited graph signal is a signal whose GFT is sparse.Let us denote these signals as y = U s (2)where s is a sparse vector. One of the main motivations forprojecting the signal y onto the subspace spanned by theeigenvectors of L , is that these eigenvectors encode someof the graph topological features [23], [24]. In particular,the eigenvectors associated to the smallest eigenvalues of L identify graph clusters [23]. Hence, a band-limited signal withsupport on the lowest indices is a signal that is smooth within clusters, while it can vary arbitrarily across clusters. Moreformally, given a subset of indices K ⊆ V , we introducethe band-limiting operator B K = UΣ K U T , where Σ K is adiagonal matrix deﬁned as Σ K = Diag { K } and K is the setindicator vector, whose i -th entry is equal to one if i ∈ K , andzero otherwise.Let us assume that s ∈ R N is a |K| -sparse vector, i.e. k s k ≤ K , where the l -norm counts the number of non-zeros in s and K = |K| . We say that y is a K -band-limitedgraph signal if B K y = y , or, equivalently, if y = U s , where s is a |K| -sparse vector [25]. The observed signal y ∈ R N canbe seen as a linear combination of columns from a dictionary U ∈ R N × N . Collecting M vectors s i ∈ R N in the matrix S ∈ R N × M , we assume that the signals s i share a commonsupport. This implies that S is a block-sparse matrix havingentire rows as zeros or nonzeros. We deﬁne the set of K -blocksparse signals [26], [27] as B K = { S = [ s , . . . , s M ] ∈ R N × M | S ( i, :) = , ∀ i

6∈ K ⊆ V ,K = |K|} (3)where S ( i, :) denotes the i th row of S . Hence, by introducingthe matrix Y ∈ R N × M with columns the M observed vectors y i , we get the compact form Y = US where S ∈ B K . B. Problem formulation

In this paper we propose a novel strategy to learn, jointly,the orthonormal transform matrix U , the sparse graph signal S and the underlying graph topology L . Although the matrices U , S and L are not uniquely identiﬁed, we will devise anefﬁcient algorithm to ﬁnd optimal solutions that minimizethe estimation error. Before formulating our problem, weintroduce the space of valid combinatorial Laplacians, i.e. L = { L ∈ S N + | L1 = , L ij = L ji ≤ , ∀ i = j } (4)where S N + is the set of real, symmetric and positive semidef-inite matrices. In principle, our goal would be to ﬁnd thesolution of the following optimization problem min L , U , Λ ∈ R N × N S ∈ R N × M k Y − US k F + f ( L , Y , S ) ( P ) s.t. a ) U T U = I , u = b b ) LU = UΛ , L ∈ L , tr ( L ) = p c ) Λ (cid:23) , Λ ij = Λ ji = 0 , ∀ i = j d ) S ∈ B K , where: the objective function is the sum of the data ﬁttingerror plus the function f ( L , Y , S ) enabling, as we will see inSection IV, the recovering of a graph topology which reﬂectsthe smoothness of the observed graph signals; the constraint a)forces U to be unitary and includes a vector proportional to thevector of all ones; b) constrains L to possess the desired struc-ture and imposes that U and Λ are its eigen-decomposition;the trace constraint, with p > , is used to avoid the trivial solution; c) forces L to be positive semideﬁnite (and Λ tobe diagonal); ﬁnally, d) enforces the K -block sparsity ofthe signals. However, problem P is non-convex in both theobjective function and the constraint set. To make the problemtractable, we devise a strategy that decouples P in two simpleroptimization problems. The proposed strategy is rooted on twobasic steps: i) ﬁrst, given the observation matrix Y , learn theorthonormal transform matrix U and the sparse signal matrix S jointly , in order to minimize the ﬁtting error k Y − US k F ,subject to a block-sparsity constraint; ii) second, given U , inferthe graph Laplacian matrix L that admits the column of U asits eigenvectors. While the theoretical proof of convergenceof this double-step approach is still an open question, theeffectiveness of the proposed strategy has been corroborated byextensive numerical results over both simulated and real data.In the next sections, we will develop alternative, suboptimalalgorithms to solve the proposed graph inference problem andwe will present the numerical results.III. LEARNING FOURIER BASIS AND GRAPHSIGNALS JOINTLYIn this section we describe the ﬁrst step of the proposedgraph recovering strategy, aimed at learning the pair of ma-trices ( U , S ) jointly, up to a multiplicative rotation matrix.Observe that, since U ∈ R N × N is full rank, if U is known,recovering S from Y is easy. However, in our case, U isunknown. Hence, recovering both S and U is a challengingtask, which is prone to an identiﬁability problem. To solvethis task, we use a method conceptually similar to sparsifyingtransform learning, assuming a unitary transform [28], with theonly difference that we enforce block -sparsity and we impose,a priori, that one of the eigenvectors be a constant vector. Morespeciﬁcally, we set u = b , with b = 1 / √ N .Because of the unitary property of U , we can write k Y − US k F = k U T Y − S k F . Our goal is to ﬁnd the block-sparsecolumns { s i } Mi =1 , and the Fourier basis orthonormal vectors { u i } Ni =1 , minimizing the sparsiﬁcation error. Therefore, giventhe training matrix Y ∈ R N × M , motivated by the originalproblem P , we formulate the following optimization problem: min U ∈ R N × N , S ∈ R N × M k U T Y − S k F ( P U,S ) s.t. U T U = I , u = b ∈ B K . (5)Although the objective function is convex, problem P U,S isnon-convex due to the nonconvexity of both the orthonormalityand sparsity constraints. In the following, inspired by [28], wepropose an algorithm to solve P U,S that alternates betweensolving for the block-sparse signal matrix S and for theorthonormal transform U . A. Alternating optimization algorithm

The proposed approach to solve the above non-convexproblem P U,S alternates between the minimization of the

Algorithm 1 : Learning Fourier basis and graph signals

Set U ∈ R N × N , U T U = I , u = b , k = 1 Repeat

Find ˆ S k as solution of S k in (7);Compute the optimal point ˆ U k as in (9); k = k + 1 ; until convergence objective function with respect to S and U at each step k ,iteratively, as follows:1. ˆ S k , arg min S ∈ R N × M k ( ˆ U k − ) T Y − S k F ( S k ) s.t. S ∈ B K ˆ U k , arg min U ∈ R N × N k U T Y − ˆ S k k F ( U k ) s.t. U T U = I , u = b . The algorithm iterates until a termination criterion is met,within a prescribed accuracy, as summarized in Algorithm 1.The interesting feature is that the two non-convex problems S k and U k admit closed-form solutions. Therefore, both stepsof the above algorithm can be performed exactly, with a lowcomputational cost, as follows.

1) Computation of the critical point ˆ S k : To solve problem S k we need to deﬁne the ( p, q ) -mixed norm of the matrix S as k S k ( p,q ) = N X i =1 k S ( i, :) k qp ! /q . (6)When q = 0 , k S k ( p, simply counts the number of nonzerorows in S , whereas for p = q = 2 we get k S k , = k S k F .Hence, we reformulate problem S k as in [26], to obtain thebest block-sparse approximation of the matrix S as follows arg min S ∈ R N × M k ( ˆ U k − ) T Y − S k F ( S k ) s.t. k S k (2 , ≤ K (7)where we force the rows of S to be K -block sparse. Thesolution of this problem is simply obtained by sorting the rowsof ( ˆ U k − ) T Y by their l -norm and then selecting the rowswith largest norms [26].

2) Computation of the critical point ˆ U k : The optimaltransform update for the transform matrix U k is provided inthe following proposition. Proposition 1:

Deﬁne ¯ Y k = Y ( S k ) T , Z k = [¯ y k , . . . , ¯ y kN ] ,where ¯ y ki represents the i th column of ¯ Y k , and Z k ⊥ = PZ k with P = I − T /N . If r = rank ( Z k ⊥ ) , we have Z k ⊥ = XΣV T = [ X r X s ] ΣV T (8)where the N × r matrix X r contains as columns the r left-eigenvectors associated to the non-zero singular values of Z k ⊥ , X s is an orthonormal N × N − r matrix selected as belongingto the nullspace of the matrix B = [ T ; X Tr ] , i.e. BX s = .The optimal solution of the non-convex problem U k is then ˆ U k = [ b ¯ U ⋆ ] (9) where ¯ U ⋆ = X − V T and X − is obtained by removing from X its last column. Proof.

See Appendix A.Of course, the pair of matrices ( U , S ) solving (5) can onlybe found up to the multiplication by a unitary matrix W thatpreserves the sparsity of S . In fact, US = UWW T S :=ˆ U ˆ S , where W is a unitary matrix, i.e., WW T = I . In thenext section, we will take into account the presence of thisunknown rotation matrix in the recovery of the graph topology.Nevertheless, if the graph signals assume values in a discretealphabet, it is possible to de-rotate the estimated transform ˆ U by using the adaptive, iterative algorithm developed in [29],based on the stochastic gradient approximation, as shown inSection VI.IV. LEARNING THE GRAPH TOPOLOGYIn this section we propose a strategy aimed at identifyingthe topology of the graph by learning the Laplacian matrix L from the estimated transform matrix ˆ U . Of course, if wewould know the eigenvalue and eigenvector matrices of L , wecould ﬁnd L by solving LU = UΛ . (10)However, from the algorithm solving problem (5) we can onlyexpect to ﬁnd two matrices ˆ U = UW and ˆ S = W T S , where W is a unitary matrix that preserves the block-sparsity of S .This means that, if we take the product L ˆ U , we get L ˆ U = LUW = UWW T ΛW . (11)Introducing the matrix C = W T ΛW (cid:23) , we can write L ˆ U = ˆ UC . (12)Furthermore, having supposed that the observed graph signalsare K -bandlimited, with |K| = K , we can only assume, fromthe ﬁrst step of the algorithm, knowledge of the K columnsof ˆ U associated to the observed signal subspace. Under thissetting, the constraint in (12) reduces to L ˆ U K = ˆ U K C K (13)with C K ∈ R K × K , C K = Φ − Λ K Φ (cid:23) , Φ a unitary matrixand Λ K the diagonal matrix obtained by selecting the set K of diagonal entries of Λ .For the sake of simplifying the optimization problem P , thegraph learning problem can be formulated as follows min L ∈ R N × N , C K ∈ R K × K f ( L , Y , ˆ S ) ( P f ) s.t. L ∈ L , tr ( L ) = p L ˆ U K = ˆ U K C K , C K (cid:23) ) , X ( ˆ U K ) with L deﬁned as in (4). Different choices for the objectivefunction f ( L , Y , ˆ S ) have been proposed in the literature,such as, for example: i) f ( L ) = || L || , as in [13], or ii) f ( L , Y ) = tr ( YLY T ) + µ || L || F , as in [10]. In the ﬁrst case,the goal was to recover the sparsest graph, given the eigen-vectors of the shift matrix, estimated via principal componentanalysis (PCA) of the sample covariance of graph signalsdiffused on the network. In the second case, the goal was to minimize a linear combination of the l -norm total variationof the observed signal Y , measuring the signal smoothness,plus a Frobenius norm term, speciﬁcally added to control(through the coefﬁcient µ ≥ ) the distribution of the off-diagonal entries of L . Note that, because of the constraint onthe Laplacian trace, setting µ to large values penalizes largedegrees and leads, in the limit, to dense graphs with constantdegrees across the nodes.In this paper, our goal is to infer a graph topology that givesrise to a band-limited signal, consistent with the observedsignal. This inference is based on the joint estimation of the |K| -sparse signal vector s i and the Laplacian eigenvectors jointly . To this end, we propose two alternative choices for theobjective function, leading to the following two algorithms. A. Total variation based graph learning

The Total Variation Graph Learning (TV-GL) algorithm, isformulated as follows min L ∈ R N × N C K ∈ R K × K f ( L , Y , µ ) , TV ( L , Y ) + µ k L k F ( P f ) s.t. ( L , C K ) ∈ X ( ˆ U K ) where TV ( L , Y ) = P Mm =1 P Ni,j =1 L ij | Y im − Y jm | repre-sents the ℓ -norm total variation of the observed signal Y onthe graph. Minimizing this ℓ -norm tends to disconnect nodeshaving distinct signal values and to connect nodes have similarvalues, whereas the Frobenius norm controls the sparsity ofthe off-diagonal entries of L ; the constraint on the Laplaciantrace in X ( ˆ U K ) is instrumental to avoid the trivial solutionby ﬁxing the l -norm of L . Note that this constraint can beread as a sparsity constraint on L , since tr ( L ) = k L k ,and the l norm represents the tightest relaxation of the l norm. The coefﬁcient µ is used to control the graph sparsity:if µ increases the Frobenius norm of L tends to decrease and,because of the trace constraint, this leads to a more uniformdistribution of the off-diagonal entries so that the number ofedges increases; on the contrary, as µ goes to , the graphtends to be more and more sparse.Problem P f is convex since both feasible set and objectivefunction are convex. B. Estimated-signal-aided graph learning

We propose now an alternative method, which we nameEstimated-Signal-Aided Graph Learning (ESA-GL) algorithm.In this case the signals ˆ S estimated in Algorithm are usedin the graph recovering method. To motivate this method, webegin observing thattr ( Y T LY ) = tr ( S T K Λ K S K ) = tr (ˆ S T K C K ˆ S K ) (14)where S K contains the rows with index in the support set ofthe graph signals, whereas ˆ S K = Φ T S K are the graph signalsestimated by Algorithm 1. Therefore, an alternative formula- tion of problem P f aiming at minimizing the smoothing termin (14), is min L ∈ R N × N C K ∈ R K × K f ( L , ˆ S K , µ ) , tr (ˆ S T K C K ˆ S K ) + µ || L || F ( P f ) s.t. ( L , C K ) ∈ X ( ˆ U K ) . Also in this case the formulated problem is convex, so that itcan be efﬁciently solved.V. L

APLACIAN IDENTIFIABILITY CONDITIONS

Before assessing the goodness of the proposed inferencealgorithms, in this section we investigate the conditions underwhich the Laplacian matrix can be recovered uniquely. Inter-estingly, the derivations lead to an interesting relation betweenthe bandwidth of the signal and the sparsity of the graph. Westart looking for the conditions under which the feasible set X ( ˆ U K ) of problems P f and P f reduces to a singleton. Toderive closed form bounds for the identiﬁability conditions ofthe Laplacian matrix, we need to assume that the transformmatrix is perfectly known, so that ˆ U K = U K . Even thoughthese are ideal conditions because our proposed algorithmsare only able to recover U K up to a rotation, the closed formsprovide a meaningful insight into the relation between graphsparsity and signal bandwidth. Furthermore, we remark thatit is possible to recover the matrix U K with no rotation inthe case where the signal values belong to a known discreteset, by using the blind algorithm of [29], as shown in SectionVI. Assume w.l.o.g. that the K eigenvectors of U K are sortedin increasing order of their corresponding eigenvalues so that λ K , ≤ λ K , ≤ . . . ≤ λ K ,K with Λ K = diag ( λ K ) . Asa consequence, the convex set X ( U K ) of problems P f and P f , can be written as X ( U K ) ,  a) LU K = U K Λ K b) L1 = c) L ij = L ji ≤ , ∀ i = j d) λ K , = 0 , λ K , ≥ e) λ K ,i +1 ≥ λ K ,i , i = 2 , . . . , K − f) tr ( L ) = p (15)where conditions b) − e) impose the desired Laplacian structureto L and f) ﬁxes its l -norm by avoiding the trivial solution.For the sake of simplicity, we focus on the case where thesubset K consists of the indices associated to the ﬁrst K columns of U ; otherwise one can follow similar argumentsfor any subsets of indices by properly rewriting conditionsd) − e). As shown in Appendix B, we can reduce equationsa)-c) and f) in (15) to the following compact form: Fx = b , x ∈ R N ( N − / K − (16)where we deﬁned x , [ − z ; ¯ λ ] ; z , vech ( L ) ∈ R N ( N − / − ;vech ( L ) the half-vectorization of L obtained by vectorizingonly the lower triangular part (except the diagonal entries)of L ; R + and R − the sets of, respectively, nonnegative andnonpositive real numbers; ¯ λ , { λ K ,i } ∀ i ∈K − where, assuming the entries of λ K in increasing order, the index set K − isobtained by removing from K the ﬁrst index correspondingto λ K , ; b = [ KN ; p ] ; and, ﬁnally, the coefﬁcient matrix F ∈ R m × n , with m = KN + 1 , n = N ( N − / K − is deﬁned in Appendix B, equation (40).Note that the rank q of the coefﬁcient matrix F ∈ R m × n is q ≥ K − and q ≤ min { n, m } . Proposition 2: Assume the set X ( U K ) to be feasible, thenit holds:a) K − ≤ rank ( F ) ≤ KN + 1 for K ≤ N − N − and K − ≤ rank ( F ) ≤ N ( N − / K − for K > N − N − ;b) if rank ( F ) = KN + 1 and K = N − N − , or rank ( F ) = N ( N − / K − and K > N − N − , the set X ( U K ) isa singleton. Proof.

See Section A in Appendix C.By using some properties about nonnegative solutions of un-derdetermined systems [30], under the assumption of existenceand uniqueness of solutions for the system (16), we can derivethe following condition relating graph sparsity and signalbandwidth K . Proposition 3: If the set X ( U K ) with K < N − N − , isfeasible and { x | Fx = Fx , x ≥ } is a singleton for anynonnegative s -sparse signal x , then N − N − > K ≥ s/N and k A k ≤ K ( N −

2) + 2 c where c is the number ofconnected components in the graph. Proof.

See Section B in Appendix C.Note that if we assume the graph to be disconnected, i.e. c > ,the upper bound on the graph sparsity tends to increase tocompensate for the connectivity loss.VI. N UMERICAL RESULTS

In this section, we present some numerical results validatingthe effectiveness of the proposed graph-learning strategiesfor both synthetic and real-world graphs. In all numericalexperiments we solved the proposed optimization problemsby using the convex optimization package CVX [31].

Performance over synthetic data

Let us consider the graph in Fig. 1a composed of N = 30 nodes forming clusters of nodes each. To start testingthe effectiveness of the proposed graph inference strategies,in Fig. 1 we illustrate an example of graph topology recoveryresulting by using the TV-GL and ESA-GL algorithms. Thecolor on each edge encodes the strength of that edge. We canobserve that, learning the Fourier basis using Algorithm ,both algorithms are able to ensure a good topology recovery.As a statistical test, we run simulation over independentsignal matrix realizations, with N = 30 and M = 15 ,assuming K = 3 for the block sparsity and setting tr ( L ) = N .As a ﬁrst performance measure, we measured the correlationcoefﬁcient between the true Laplacian entries L ij and their es-timates ˆ L ij : ρ ( L , ˆ L ) , P ij ( L ij − L m )(ˆ L ij − ˆ L m ) √ P ij ( L ij − L m ) qP ij (ˆ L ij − ˆ L m ) where L m and ˆ L m are the average values of the entries, respectively,of the true and estimated Laplacian matrices. Note that,because of the Laplacian matrix structure, we always have L m = ˆ L m = 0 . In Fig. 2, we plot the average correlationcoefﬁcient ¯ ρ ( L , ˆ L ) vs. the penalty coefﬁcient µ , where ˆ L is computed by applying the total variation based (TV-GL) or theestimated-signal aiding graph-learning (ESA-GL) algorithms.To get insight into the proposed algorithms, we consideredboth cases when the Fourier transform matrix U is a prioriknown or when it is estimated by using Algorithm . FromFig. 2, we notice that both methods are able to ensure highaverage correlation coefﬁcients. Furthermore, we can observea high robustness of the TV-GL method to the choice of thepenalty parameter with respect to the ESA-GL algorithm. Bycomparing the curves obtained by assuming perfect knowledgeof U with those derived by estimating it through Algorithm ,we can also notice that the performance loss due to estimationerrors is negligible.In general, the choice of µ has an impact on the ﬁnal result.To assess this impact, in Fig. 3 we illustrate the averagenormalized recovery error ¯ E versus the penalty coefﬁcient µ . The error ¯ E represents the fraction of misidentiﬁed edgesand is deﬁned as k A − ˆ A b k N · ( N − where A and ˆ A b are, respectively,the groundtruth and the recovered binary adjacency matrices.The binary matrix ˆ A b has been obtained by thresholding theentries of the recovered matrix ˆ A with a threshold equal tohalf the average values of the elements of ˆ A . We consider bothcases where the Fourier basis vectors are estimated (upperplot) or a-priori perfectly known (lower plot). Under thislast setting, we solve problem P f and P f by assuming ( L , C K ) ∈ X ( U K ) . Thereby, we solve problem S k , bysupposing perfect knowledge of U , in order to calculate theestimated signals ˆ S , needed as input to solve problem P f .We can see that the percentage of incorrect edge recovery canbe in the order of . . Furthermore, comparing the TV-GLalgorithm with the ESA-GL, we can observe that TV-GL tendsto perform better. Performance versus signal bandwidth

To evaluate the impact of the signal bandwidth K onthe graph recovery strategy, we illustrate some numericalresults performed on random graphs composed of N = 30 nodes with clusters, each of nodes, by averaging theﬁnal results over graphs and signal matrix realizationsfor M = 15 . We consider both cases of exactly or onlyapproximately (or compressible) bandlimited graph signals[32]. We recall that a vector is compressible if the errorof its best k -term approximation decays quickly in k [32].We generate each band-limited signal s i for i = 1 , . . . , M as s i ( k ) ∼ N (1 , . for all k ≤ K , and s i ( k ) = 0 forall k > K . In Fig. 4a we plot the average recovery error ¯ E (upper plot) and the average recovery error ¯ E F (lowerplot), deﬁned as k A − ˆ A k F N · ( N − , vs. the signal bandwidth K . Weselected for each K the optimal coefﬁcient µ minimizingthe average recovery error. We can observe that the errortends to increase as the signal bandwidth K gets largerthan the number of clusters, equal to K = 3 . Finally, inFig. 4b we report the averaged recovery errors in case ofcompressible graph signals, generated as s i ( k ) ∼ N (1 , . for all k ≤ K v and s i ( k ) = ( K v /k ) β for all k > K v , with β = 2 , K v = 5 [33]. We can observe that for the TV-GLmethod the average recovery errors increases as the signalbandwidth increases, while the minimum is achieved for ORIGINAL (a)

TV−GL (b)

ESA-GL (c)

Fig. 1: Examples of graphs learning: ( a ) original graph; ( b ) recovered graph through TV-GL algorithm with µ = 2 ; ( c ) recovered graph through ESA-GL algorithm µ = 0 . . µ ¯ ρ ( L , ˆ L ) M = 15, K = 3 ˆ U , TV-GLˆ U , ESA-GL U , TV-GL U , ESA-GL Fig. 2: Average correlation coefﬁcient versus the parameter µ . K = 5 , since K v = 5 represents the approximated graphsignal bandwidth. Additionally, the ESA-GL method seemsto reach minimum values quite similar to those of the TV-GLalgorithm for K close to . Performance over real data

In this section we test our methods over real data forrecovering the brain functional connectivity networkassociated to epilepsy. Understanding how this abnormalneuronal activity emerges and from what epileptogenic zone,would help to reﬁne surgical techniques and devise noveltherapies. The diagnosis involves comparing ECoG timeseries obtained from the patient’s brain before and after onsetof a seizure.

Epilepsy data description

We used the datasets taken from experiments conducted inan epilepsy study [20] to infer the brain functional activitymap. The data were collected from a 39-year-old femalesubject with a case of intractable epilepsy at the Universityof California, San Francisco (UCSF) Epilepsy Center [20].An × electrode grid was implanted in the cortical surface −3 −2 −1 µ ¯ E ˆ U , TV-GLˆ U , ESA-GL −3 −2 −1 µ ¯ E U , TV-GL U , ESA-GL Fig. 3: Average recovery error vs. the coefﬁcient µ , usingestimated Fourier basis vectors (upper plot) and a-prioriknowledge of the Fourier transform (lower plot).of the subject’s brain and two accompanying strips of sixelectrodes each were implanted deeper into the brain. Thiscombined electrodes network recorded ECoG time series,consisting of voltage levels in the vicinity of each electrode,which are indicative of the levels of local brain activity.Physicians recorded data for consecutive days and ECoGepochs containing eight seizures were extracted from therecord. The time series at each electrode were ﬁrst passedthrough a bandpass ﬁlter with cut-off frequencies of and Hz and two temporal intervals of interest were picked foranalysis, namely, the preictal and ictal intervals. The preictalinterval is deﬁned as a -second interval preceding seizureonset, while the ictal interval corresponds to the -secondimmediately after the start of a seizure. For further details ondata acquisition see [20]. K ¯ E Bandlimited signals ˆ U , TV-GLˆ U , ESA-GL K ¯ E F ˆ U , TV-GLˆ U , ESA-GL (a) K ¯ E Compressible signals ˆ U , TV-GLˆ U , ESA-GL K ¯ E F ˆ U , TV-GLˆ U , ESA-GL (b) Fig. 4: Average recovery errors versus K for a) bandlimited graph signals; b) for compressible graph signals. Recovery of brain functional activity graph

Since the observed signal is highly non-stationary, followingthe same approach described in [20], before running our algo-rithms we divide the seconds interval into overlappingsegments of second, so that each segment overlaps with theprevious one by . seconds. The reason is that within a onesecond interval, the brain activity is approximately stationary.After this segmentation, our goal is to infer the networktopology for each segment. We denote by Y ∈ R N × M the observed data matrix, with N = 76 and M = 400 .Before applying our inference algorithm, we ﬁlter the data toreduce the effect of noise, using the method proposed in [34]where an optimal shrinkage of the empirical singular values ofthe observed matrix is applied. Given the compressed signalmatrix ˆ Y , we proceed by running ﬁrst Algorithm to estimatethe transform matrix ˆ U and, thereafter, we recover the brainnetwork topology by using the TV-GL algorithm. In Fig. 5 weshow two examples of graphs learned from the observed (andﬁltered) data. They refer, respectively, to the pre-ictal interval and ictal interval . We can notice how the number of edgeschanges dramatically, showing that the network connectivitytends to decrease at seizure onset. This behavior is especiallyevident in the bottom corner of the grid closest to the twostrips of electrodes which were located close to where theseizure was suspected to emanate so that in this region theconnectivity of the network tends to decrease at seizure onset,as observed in [20].The problem in associating a graph topology to a set ofsignals is that we do not know the ground truth. To validateour approach from a purely data-oriented perspective, we usedthe following approach. We consider the observations takenin two intervals, say pre-ictal interval and ictal interval .The graphs inferred in these two cases are the ones depictedin Fig. 5. Then, we assume that we observe only the signalstaken from a subset of electrodes and we ask ourselves whetherwe can recover the signals over the unobserved electrodes. Ifthe signals is band-limited over the recovered graph, we can apply sampling theory for graph signals, as e.g. [25], to recoverthe unobserved signals. Then we compare what we are ableto reconstruct with what is indeed known. We use a greedysampling strategy using an E-optimal design criterion [35],selecting the bandwidth provided by our transform learningmethod, which is equal to for the pre-ictal interval, and to for the ictal interval. The number of electrodes assumed tobe observed is chosen equal to the bandwidth in both cases.For each time instant, we use a batch consistent recoverymethod that reconstructs the signals from the collected samples[see eq. (1.9) in [35]]. In Fig. 6, we illustrate the true ECoGsignal present at node (black line) and what we are able toreconstruct (green line) from a subset of nodes that does notcontain node . We repeated this operation for both the pre-ictal (top) and the ictal (bottom) phases. As we can notice fromFig. 6, the reconstructed signal is very close to the real signal,in both phases. This means that, evidently, the inferred graphsare close to reality, because otherwise the sampling theorywould have been based on a wrong topology assumption andthen it could not lead to such a good prediction. Comparison with GSP-based topology inference methods.

In this section we compare the performance of our algorithmswith recent GSP-based algorithms, namely the methods usedin [13], [8], [10] to identify the Laplacian matrix exploiting thesmoothness of graph signals. To make a fair comparison, weconsidered for all methods the combinatorial Laplacian matrix.To implement the algorithm SpecTemp proposed in [13], weneed to solve the following problem min L , L ¯ K , L ′ , λ k L k ( P s ) s.t. L ∈ L L ′ = L ¯ K + P Kk =1 λ k ˆ v k ˆ v Tk , L ¯ K ˆ V K = d ( L ′ , L ) ≤ ǫ where ˆ V K , [ˆ v , . . . , ˆ v K ] is the graph Fourier basis obtainedvia PCA, i.e. taking the K eigenvectors associated to the K Pre-ictal interval 19 Ictal interval 1

Fig. 5: Recovered networks for the pre-ictal interval 19 (left) and the ﬁrst ictal interval (right).

Time index -800-600-400-2000200400600800

True signalReconstruction

Fig. 6: True and recovered brain signals during pre-ictal (top)and ictal (bottom) intervals.largest nonzero eigenvalues of the sample covariance matrix C y , M YY T , with Y assumed to be zero mean; d ( · , · ) isa convex vector distance function, e.g. k L − L ′ k F , and ǫ is a positive constant controlling the feasibility of the set. Weselect ǫ as the smallest value that admits a feasible solution.As suggested in [13], the recovered eigenvalues λ are requiredto satisfy the condition λ i ≥ λ i + k + δ , ∀ i , with k = 2 and δ = 0 . . This is done to enforce the property that the principaleigenvectors of the estimated covariance matrix give rise tothe eigenvectors associated to the smallest eigenvectors of theLaplacian matrix.We also considered the combinatorial Laplacian recoveringalgorithm proposed by Dong et al. in [10], which solves thefollowing optimization problem min L µ k L k F + tr ( Y T LY ) ( P D ) s.t. L ∈ L , tr ( L ) = p where the coefﬁcient µ > is chosen in order to maximizethe correlation coefﬁcient ρ ( L , ˆ L ) . Finally, we consideredthe Kalofolias algorithm proposed in [8], which ﬁnds the adjacency matrix as the optimal solution of the followingconvex problem min A ∈A N tr ( AZ ) − α T log( A1 ) + β k A k F ( P K ) where α, β > , A N , { A ∈ R N × N + : A = A T , diag ( A ) =0 , A ij > } and Z , ( Z ij ) ∀ i,j with Z ij , k y i − y j k . Notethat the logarithmic barrier in P K forces the node degreesto be positive, while the Frobenius norm is added to controlsparsity.As ﬁgures of merit, we considered the average correlationcoefﬁcient ¯ ρ ( L , ˆ L ) , the average recovery error ¯ E and threeevaluation criteria commonly used in information retrieval[36], i.e. the F-measure, the edge Precision and the edgeRecall. The Precision is the percentage of correct edges inthe learned graph and the Recall evaluates the percentage ofthe edges in the true graph that are present in the learned graph[36]. Deﬁning E g , E r as the sets of edges present, respectively,in the ground truth and in the recovered graph, Precision andRecall arePrecision = | E g ∩ E r || E r | , Recall = | E g ∩ E r || E g | . (17)The F-measure is deﬁned as the harmonic mean of edgePrecision and edge Recall. In our numerical experiments, weaverage the results over random graphs and comparethe different methods of interest on three different types ofgraphs composed of N = 30 nodes and for M = 100 graph signal vectors. More speciﬁcally, we considered threecases: i) random graphs composed of three clusters, each of nodes, with edge probability of connection among clustersequal to .e − ; ii) Erd˝os-R`enyi graphs with probability ofconnection . ; iii) Bar´abasi-Albert graphs generated from aset of m = 4 initial nodes and where one node is addedat each step and connected to m = 3 already existing nodesaccording to the preferential attachment mechanism. For a faircomparison we generated the signals according to two models.The graph signals are generated as y = U s , where s are:i) K -bandlimited random i.i.d. signals, with zero mean, unitvariance and |K| = 3 (bandlimted GS model); ii) zero-meanGaussian random variables with covariance matrix deﬁned asthe pseudo-inverse of the eigenvalue matrix Λ of the graph Laplacian, i.e. s ∼ N ( , Λ † ) [10] (inverse Laplacian GSmodel). To compute the Precision, Recall and F-measure, weapplied a threshold on the estimated edge values, to identifythe sets E g and E r . We used the same threshold for all methods,evaluated as one half the average value of the off-diagonalentries of the estimated Laplacian. In Table I, we summarizethe performance measures, respectively, for the bandlimitedand the inverse Laplacian graph signal models. To make afair comparison, we put every method in its best conditions.So, for SpecTemp [13], we searched for the minimum ǫ valueguaranteeing the set feasibility, while for the Kalofolias [8] andDong et al. [10] algorithms, we selected the optimal µ , α and β coefﬁcients achieving the maximum ρ ( L , ¯ L ) . Comparingall methods, we notice ﬁrst that our methods offer superiorperformance when we observe K -bandlimited signals overrandom graphs composed of K clusters. Clearly this had to beexpected because our methods are perfectly matched to sucha condition. However, we can also notice that our methodsare quite robust across the different graph and signal models.Additionally, it can be noted that the SpecTemp algorithmperforms poorly when we use a low number of observedsignals M and of basis vectors K , whereas the TV-GL andESA-GL methods reach good performance under this criticalsetting. In Table II we report the performance metrics fordifferent random geometric graphs with K = 3 , , clusters,each composed of nodes and with probability of connectionamong clusters e − . Note that the proposed graph learningmethods ensure good performance for both bandlimited andinverse Laplacian GS models. Finally, in Table III, under thesetting of discrete alphabet of size K , we report the numericalresults by using as input of the TV-GL and ESA-GL methodsthe de-rotate transform matrix ˆ U K H T where the matrix H isestimated according to the blind recovering method proposedin [29]. We named these methods Derotated TV-GL (DTV-GL)and Derotated ESA-GL (DESA-GL). Interestingly, we cannote that high performance levels are achieved in both cases,by applying the derotation method or by omitting this. Indeed,we observed that for bandlimited graph signals over randomgraphs with clusters, as the number of clusters increases, thetransform and signals matrices, learned through the TV-GLand ESA-GL methods, are more and more accurate and arenot rotated with respect to the true transform matrix. This isindeed an interesting behavior whose investigation we deferto future works. VII. C ONCLUSIONS

In this paper we have proposed efﬁcient strategies forrecovering the graph topology from the observed data. Themain idea is to associate a graph to the data in such a way thatthe observed signal looks like a band-limited signal over theinferred graph. The motivation is that enforcing the observedsignal to be band-limited signal over the inferred graph tendsto provide modular graphs and modularity is a structural prop-erty that carries important information. The proposed methodconsists of two steps: learn ﬁrst the sparsifying transformand the graph signals jointly from the observed signals, andthen infer the graph Laplacian from the estimated orthonormal bases. Although the graph topology inference is intrinsicallynon-convex and we cannot make any claim of optimality onthe achieved solution, the proposed algorithms are compu-tationally efﬁcient, because the ﬁrst step alternates betweenintermediate solutions expressed in closed form, while thesecond step involves a convex optimization. We applied ourmethods to recover the brain functional activity network fromECoG seizure data, collected in an epilepsy study, and we havedevised a purely data-driven strategy to assess the goodnessof the inferred graph. Finally, we have compared our methodswith existing ones under different settings and conditions toassess the relative merits.R

EFERENCES[1] E. D. Kolaczyk,

Statistical analysis of network data: methods andmodels , Springer Series Stat., 2009.[2] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covarianceestimation with the graphical lasso,”

Biost. , vol. 3, pp. 432–441, 2008.[3] B. M. Lake and J. B. Tenenbaum, “Discovering structure by learningsparse graph,” in

Ann. Cogn. Sc. Conf. , 2010, pp. 778–783.[4] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Van-dergheynst, “The emerging ﬁeld of signal processing on graphs:Extending high-dimensional data analysis to networks and other irregulardomains,”

IEEE Signal Process. Mag. , vol. 30, no. 3, pp. 83–98, May2013.[5] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing ongraphs: Frequency analysis,”

IEEE Trans. Signal Process. , vol. 62, no.12, pp. 3042–3054, Jun. 2014.[6] I. Pesenson, “Sampling in Paley-Wiener spaces on combinatorialgraphs,”

Trans. Amer. Math. Soc. , vol. 360, no. 10, pp. 5603–5627,Oct. 2008.[7] X. Zhu and M. Rabbat, “Approximating signals supported on graphs,”in

Proc. of IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) ,Mar. 2012.[8] V. Kalofolias, “How to learn a graph from smooth signals,” in

Proc.of the 19th Int. Conf. Art. Intel. Stat. , A. Gretton and C. C. Robert,Eds., Cadiz, Spain, May 2016, vol. 51 of

Proc. of Mach. Lear. Res. , pp.920–929, PMLR.[9] S. P. Chepuri, S. Liu, G. Leus, and A. O. Hero, “Learning sparse graphsunder smoothness prior,” in

Proc. of IEEE Int. Conf. Acoust., Speech,Signal Process. (ICASSP) , Mar. 2017, pp. 6508–6512.[10] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “LearningLaplacian matrix in smooth graph signal representations,”

IEEE Trans.Signal Process. , vol. 64, no. 23, pp. 6160–6173, Dec. 2016.[11] H. E. Egilmez, E. Pavez, and A. Ortega, “Graph learning from dataunder Laplacian and structural constraints,”

IEEE J. Sel. Top. SignalProcess. , vol. 11, no. 6, pp. 825–841, Sep. 2017.[12] J. Mei and J. M. F. Moura, “Signal processing on graphs: Causalmodeling of unstructured data,”

IEEE Trans. Signal Process. , vol. 65,no. 8, pp. 2077–2092, Apr. 2017.[13] S. Segarra, A. G. Marques, G. Mateos, and A. Ribeiro, “Networktopology inference from spectral templates,”

IEEE Trans. Signal Inf.Process. Netw. , vol. 3, no. 3, pp. 467–483, Sep. 2017.[14] B. Pasdeloup, V. Gripon, G. Mercier, D. Pastor, and M. G. Rabbat,“Characterization and inference of graph diffusion processes from ob-servations of stationary signals,”

IEEE Trans. Signal Inf. Process. Netw. ,vol. PP, no. 99, pp. 1–1, 2017.[15] D. Thanou, X. Dong, D. Kressner, and P. Frossard, “Learning heatdiffusion graphs,”

IEEE Trans. Signal Inf. Process. Netw. , vol. 3, no. 3,pp. 484–499, Sep. 2017.[16] Y. Yankelevsky and M. Elad, “Dual graph regularized dictionarylearning,”

IEEE Trans. Signal Inf. Process. Netw. , vol. 2, no. 4, pp.611–624, Dec. 2016.[17] S. Ravishankar and Y. Bresler, “Learning sparsifying transforms,”

IEEETrans. Signal Process. , vol. 61, no. 5, pp. 1072–1086, Mar. 2013.[18] E. Kodirov, T. Xiang, and S. Gong, “Dictionary learning with iterativeLaplacian regularisation for unsupervised person re-identiﬁcation,” in

Proc. of Brit. Mach. Vis. Conf. (BMVC) , X. Xie, M. W. Jones, andG. K. L. Tam, Eds. Sep. 2015, pp. 44.1–44.12, BMVA Press.[19] H. Xu, L. Yu, D. Luo, H. Zha, and Y. Xu, “Dictionary learning withmutually reinforcing group-graph structures,” in

AAAI Conf. Artif. Intel. ,2015. Bandlimited GS model Inverse Laplacian GS modelTV-GL ESA-GL SpecTemp Kalofolias Dong et al. TV-GL ESA-GL SpecTemp Kalofolias Dong et al.RG, K = 3 clusters F-measure 0.792 0.839 0.311 0.774 0.768 0.688 0.650 0.318 0.879 0.865Precision 0.664 0.734 0.185 0.734 0.650 0.540 0.498 0.189 0.901 0.794Recall 0.989 0.982 0.980 0.829 0.943 0.955 0.946 1 0.859 0.952 ¯ ρ ( L , ˆ L ) ¯ E Erd˝os-R`enyi

F-measure 0.433 0.480 0.237 0.444 0.401 0.545 0.448 0.237 0.633 0.712Precision 0.280 0.323 0.134 0.331 0.255 0.383 0.308 0.134 0.501 0.577Recall 0.961 0.943 1 0.700 0.954 0.946 0.826 1 0.863 0.932 ¯ ρ ( L , ˆ L ) ¯ E Barab´asi-Albert

F-measure 0.436 0.462 0.301 0.464 0.448 0.549 0.433 0.304 0.652 0.670Precision 0.287 0.313 0.178 0.364 0.294 0.440 0.311 0.179 0.579 0.662Recall 0.914 0.886 0.990 0.653 0.936 0.731 0.718 1 0.750 0.680 ¯ ρ ( L , ˆ L ) ¯ E TABLE I: Performance comparison between TV-GL, ESA-GL, SpecTemp [13], Kalofolias [8] and Dong et al. [10] for differentgraph signal models, with M = 100 . For TV-GL and ESA-GL we set K = 3 . Bandlimited GS model Inverse Laplacian GS modelTV-GL ESA-GL SpecTemp Kalofolias Dong et al. TV-GL ESA-GL SpecTemp Kalofolias Dong et al.RG, K = 3 clusters F-measure 0.792 0.839 0.311 0.774 0.768 0.688 0.650 0.318 0.879 0.865Precision 0.664 0.734 0.185 0.734 0.650 0.540 0.498 0.189 0.901 0.794Recall 0.989 0.982 0.980 0.829 0.943 0.955 0.946 1 0.859 0.952 ¯ ρ ( L , ˆ L ) ¯ E RG, K = 6 clusters F-measure 0.712 0.776 0.177 0.752 0.713 0.661 0.705 0.175 0.742 0.762Precision 0.557 0.639 0.097 0.671 0.586 0.500 0.556 0.097 0.628 0.659Recall 0.993 0.992 1 0.863 0.917 0.981 0.967 0.980 0.913 0.907 ¯ ρ ( L , ˆ L ) ¯ E RG, K = 9 clusters F-measure 0.743 0.875 0.087 0.691 0.747 0.617 0.688 0.037 0.750 0.749Precision 0.593 0.781 0.049 0.566 0.658 0.451 0.536 0.139 0.664 0.660Recall 0.997 0.999 0.376 0.898 0.867 0.980 0.965 0.120 0.862 0.867 ¯ ρ ( L , ˆ L ) ¯ E TABLE II: Performance comparison between TV-GL, ESA-GL, SpecTemp [13], Kalofolias [8] and Dong et al. [10] for randomgeometric graph with K = 3 , , clusters and M = 100 . [20] M. A. Kramer, E. D. Kolaczyk, and H. E. Kirsch, “Emergent networktopology at seizure onset in humans,” Epilepsy Res. , vol. 79, no. 2-3,pp. 173–186, 2008.[21] B. Baingana Y. Shen and G. B. Giannakis, “Nonlinear structuralvector autoregressive models for inferring effective brain networkconnectivity,”

IEEE Trans. Medic. Imag. , submitted Oct. 2016,https://arxiv.org/abs/1610.06551.[22] W. Huang, L. Goldsberry, N. F. Wymbs, S. T. Grafton, D. S. Bassett,and A. Ribeiro, “Graph frequency analysis of brain signals,”

IEEE J.Sel. Top. Signal Process. , vol. 10, no. 7, pp. 1189–1203, Oct. 2016.[23] U. Von Luxburg, “A tutorial on spectral clustering,”

Stat. Comput. , vol.17, no. 4, pp. 395–416, 2007.[24] F. R. K. Chung,

Spectral Graph Theory , Amer. Math. Soc., 1997.[25] M. Tsitsvero, S. Barbarossa, and P. Di Lorenzo, “Signals on graphs:Uncertainty principle and sampling,”

IEEE Trans. Signal Process. , vol.64, no. 18, pp. 4845–4860, Sep. 2016.[26] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde, “Model-basedcompressive sensing,”

IEEE Trans. Inf. Theory , vol. 56, no. 4, pp. 1982–2001, Apr. 2010.[27] Y. C. Eldar and M. Mishali, “Robust recovery of signals from astructured union of subspaces,”

IEEE Trans. Inf. Theory , vol. 55, no.11, pp. 5302–5316, Nov. 2009.[28] S. Ravishankar and Y. Bresler, “ ℓ sparsifying transform learning withefﬁcient optimal updates and convergence guarantees,” IEEE Trans. Signal Process. , vol. 63, no. 9, pp. 2389–2404, May 2015.[29] E. Moreau O. Macchi, “Adaptive unsupervised separation of discretesources,”

Signal Process. , vol. 73, pp. 49 – 66, 1999.[30] M. Wang, W. Xu, and A. Tang, “A unique “nonnegative” solution to anunderdetermined system: From vectors to matrices,”

IEEE Trans. SignalProcess. , vol. 59, no. 3, pp. 1007–1016, Mar. 2011.[31] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convexprogramming, version 2.1,” http://cvxr . com/cvx, Mar. 2014.[32] S. Foucart and H. Rauhut, A Mathematical Introduction to CompressiveSensing , Birkh¨auser, Springer, New York, 2013.[33] S. Chen, R. Varma, A. Singh, and J. Kovacevi´c, “Signal recovery ongraphs: Random versus experimentally designed sampling,” in

Proc. ofInt. Conf. Sampl. Th. Applic. (SampTA) , May 2015, pp. 337–341.[34] M. Gavish and D. L. Donoho, “Optimal shrinkage of singular values,”

IEEE Trans. Inf. Theory , vol. 63, no. 4, pp. 2137–2152, Apr. 2017.[35] P. Di Lorenzo, S. Barbarossa, and P. Banelli, “Sampling and recoveryof graph signals,” arXiv preprint arXiv:1712.09310 , 2017.[36] C. D. Manning, P. Raghavan, and H. Schutze,

Introduction to Informa-tion Retrieval , Cambr. Univ. Press, 2008.[37] J. H. Manton, “Optimization algorithms exploiting unitary constraints,”

IEEE Trans. Signal Process. , vol. 50, no. 3, pp. 635–650, Mar. 2002. Bandlimited, GS with discrete alphabetTV-GL ESA-GL DTV-GL DESA-GL SpecTemp Kalofolias Dong et al.RG, K = 2 , M = 120 F-measure 0.795 0.827 0.821 0.811 0.449 0.711 0.762Precision 0.669 0.726 0.715 0.693 0.291 0.560 0.633Recall 0.987 0.966 0.971 0.983 1 0.995 0.968 ¯ ρ ( L , ˆ L ) ¯ E RG, K = 6 , M = 1000 F-measure 0.809 0.840 0.779 0.839 0.151 0.745 0.746Precision 0.685 0.728 0.641 0.726 0.090 0.634 0.634Recall 0.992 0.997 0.997 0.997 0.479 0.909 0.913 ¯ ρ ( L , ˆ L ) ¯ E TABLE III: Performance comparison between TV-GL, ESA-GL, DTV-GL, DESA-GL, SpecTemp [13], Kalofolias [8] andDong et al. [10] for bandlimited, GS with discrete alphabet, over RG with K = 2 , clusters.A PPENDIX AC LOSED - FORM SOLUTION FOR PROBLEM ˜ U k The proof is conceptually similar to that given in [37], [28],for unitary transform, with the only difference that in our caseone eigenvector is known a priori. Hence, given the sparse datamatrix S k ∈ R N × M and the observations matrix Y ∈ R N × M ,we derive closed form solution for the optimization problem U k . Note that the objective function is equivalent to k U T Y − S k k F = tr (cid:16) U T YY T U + S k ( S k ) T − U T Y ( S k ) T (cid:17) and using the orthonormality property U T U = I , problem U k becomes max U ∈ R N × N tr (cid:16) U T Y ( S k ) T (cid:17) ( Q k ) s.t. U T U = I , u = b . (18)Deﬁning ¯ Y k = Y ( S k ) T , it holdstr (cid:16) U T ¯ Y k (cid:17) = N X i =1 u Ti ¯ y ki = b T ¯ y k + N X i =2 u Ti ¯ y ki (19)where ¯ y ki represents the i th column of ¯ Y k . Therefore, byintroducing the matrices ¯ U = [ u , . . . , u N ] and Z k =[¯ y k , . . . , ¯ y kN ] , problem Q k is equivalent to the following non-convex problem max ¯ U ∈ R N × ( N − tr (cid:16) ¯ U T Z k (cid:17) ( ¯ Q k ) s.t. ¯ U T ¯ U = I N − , T ¯ U = T . (20)The Lagrangian function associated to ¯ Q k can be written as L ( ¯ U ) = tr (cid:16) ¯ U T Z k (cid:17) − tr (cid:16) ¯ Λ ( ¯ U T ¯ U − I N − ) (cid:17) + T ¯ U ¯ µ where ¯ Λ ∈ R ( N − × ( N − and ¯ µ ∈ R N − × contain theLagrangian multipliers associated to the constraints. Then, theKKT necessary conditions for the solutions optimality area) ∇ ¯ U L ( ¯ U ) = Z k − ¯ U ( ¯ Λ + ¯ Λ T ) + ¯ µ T = N × N − b) ¯ Λ ⊥ ( ¯ U T ¯ U − I N − ) = c) ¯ µ ⊥ T ¯ U = T (21) where A ⊥ B stands for h A , B i , tr ( AB T ) . From a), assum-ing w.l.o.g. ¯ Λ = ¯ Λ T , one gets U ¯ Λ = Z k + ¯ µ T (22)which, after multiplying both sides by T and using c), gives T ( Z k + ¯ µ T ) = T or ¯ µ T = − T Z k /N . Plugging this lastequality in (22), we have U ¯ Λ = PZ k (23)where P = I − T /N . Then, from (23), it holds Λ ¯ U T ¯ UΛ = ( Z k ) T PZ k ⇒ ¯ Λ = (( Z k ) T PZ k ) / / , (24)where the last equality follows from b). Hence, replacing (24)in (23), we get ¯ U (( Z k ) T PZ k ) / = PZ k . (25)Observe that Z k ⊥ = PZ k is the orthogonal projection of Z k onto the orthogonal complement of the one-dimensional spacespan { } , so we yield ¯ U (( Z k ) T ⊥ Z k ⊥ ) / = Z k ⊥ . (26)Let denote with Z k ⊥ = XΣV T the svd decomposition of Z k ⊥ where X ∈ R N × N , V ∈ R ( N − × ( N − and Σ is a diagonalrectangular N × ( N − matrix. More speciﬁcally, if r = rank ( Z k ⊥ ) we can rewrite Z k ⊥ as Z k ⊥ = XΣV T = [ X r X s ] ΣV T (27)where the N × r matrix X r contains as columns the r left-eigenvectors associated to the non-zeros singular values of Z k ⊥ , X s is a N × N − r matrix selected as belonging to the nullspaceof the matrix B = [ T ; X Tr ] , i.e. BX s = . This choice meetsthe orthogonality condition T X = T with X T X = I N .Therefore, by using Z k ⊥ = XΣV T in (26), we have ¯ U ( VΣ T ΣV T ) / = XΣV T (28)or ¯ U ( V ( Σ T Σ ) / V T ) = XΣV T , then ¯ UV ( Σ T Σ ) / = XΣ . (29)Being Σ a N × N − rectangular diagonal matrix, it holds XΣ = X − Σ − (30) where X − , Σ − are the matrices obtained by removing,respectively, from X its last column and from Σ the last allzero row. Hence, it holds ¯ UV ( Σ T Σ ) / = X − Σ − ⇒ ¯ UVΣ − = X − Σ − (31)then the optimal solution is ¯ U ⋆ = X − V T (32)and U ⋆ = [ b ¯ U ⋆ ] . (33)Let us now prove that ¯ U ⋆ is a global maximum for prob-lem ¯ Q k . First observe that from the orthogonality condition ¯ U T = we havetr (cid:16) ¯ U T Z k (cid:17) = tr (cid:16) ¯ U T Z k ⊥ (cid:17) = tr (cid:16) V T ¯ U T XΣ (cid:17) . (34)Deﬁning Q = V T ¯ U T X , it holds QQ T = V T ¯ U T XX T ¯ UV = I N − . (35)From (34) and using Q ⋆ = V T ¯ U ⋆ T X = I N − ,N , with I N − ,N = [ I N − ] , we have to prove thattr ( QΣ ) ≤ tr ( I N − ,N Σ ) , ∀ Q : QQ T = I N − (36)with equality if and only if Q ⋆ = I N − ,N . The inequality in(36) holds because Σ ii ≥ and from QQ T = I N − we yield Q ii ≤ | Q ii | ≤ [37]. On the other hand Q ii = 1 ∀ i if andonly if Q = I N − ,N so that the maximum is achieved in Q ⋆ .A PPENDIX BD ERIVATION OF THE COMPACT FORM IN (16)In this section we show as equations a)-c) and f) in system(15) can be reduced to the matrix form in (16). Note that, fromconditions c), the vectorization decomposition of L readsvec ( L ) = M vech ( L ) , M z (37)where vec ( L ) is the N -dimensional column vector obtainedby stacking the columns of the matrix L on top of oneanother; z , vech ( L ) ∈ R N ( N − / − with vech ( L ) the half-vectorization of L obtained by vectorizing only the lowertriangular part (except the diagonal entries) of L . To deﬁnethe duplication matrix M ∈ R N × N ( N − / which meetsconditions b ) − f ) , we ﬁrst introduce the matrix M , that can bedescribed in terms of its rows or columns as detailed next: for i ≥ j , the [( j − N + i ] th and the [( i − N + j ] th rows of M equal the [( j − N − j ) / i ] th row of I N ( N +1) / . Then, M can be derived from M as follows: i) ﬁrst remove from M the columns of index ( k − N + k − P k − l =1 l for k = 1 , . . . , N by deﬁning the new matrix M ; ii) replace in M the rows ofindex k + N ( k − with the vector v derived by summing upthe rows of the matrix D k = − M (1 + N ( k −

1) :

N k, :) for k = 1 , . . . , N . As an example, if we set N = 3 , the matrix M reads as M =  − − − −

10 0 0 0 − −  T . (38) Therefore, the ﬁrst equation in (15) can be vectorized asvec ( LU K ) − vec ( U K Λ K ) = . We can now exploit the prop-erty of the vec operator, namely vec ( AB ) = ( I ⊗ A ) vec ( B ) =( B T ⊗ I ) vec ( A ) and deﬁne to that purpose the matrices B ∈ R KN × N ( N − / and Q ∈ R KN × K as B = ( U T K ⊗ I N ) M , Q = ( I K ⊗ U K ) K X k =1 e k ⊗ E k ! where e k denotes the canonical K -dimensional vector withthe k -entry equal to and E k = e k · e Tk . Note that the matrix Q is full-column rank (with rank K ) since each column k is [ ; . . . ; u k |{z} k − th ; ] . Let us introduce the matrix Q − obtainedby removing from Q the column with index correspondingto λ K , . Hence, one can easily see that part of system (15)reduces to the following form Fx = b , x ∈ R N ( N − / K − (39)where we deﬁned the coefﬁcient matrix F ∈ R m × n , with m = KN + 1 , n = N ( N − / K − , as F , (cid:20) − B − Q − TN ( N − / TK − (cid:21) , (40) x , [ − z ; ¯ λ ] ; ¯ λ , { λ K ,i } ∀ i ∈K − where, assuming the entriesof λ K in increasing order, the index set K − is obtained byremoving from K the ﬁrst index corresponding to λ K , ; and,ﬁnally, b = [ KN ; p ] . A PPENDIX C A. Proof of Proposition Assume that the set is feasible, i.e. there exists at least apoint x ∈ X ( U K ) . The solution set size depends on the rank q of the coefﬁcient matrix F and of the augmented matrix [ F , b ] . Let us distinguish the two cases m ≤ n , which leadsto K ≤ N − N − , and m > n , or K > N − N − . Then,necessary conditions for which the system (16) admits at leasta solution are: q = rank ( F ) = rank ([ F , b ]) ≤ m = KN + 1 for K ≤ N − N − ; q = rank ( F ) = rank ([ F , b ]) ≤ n for K > N − N − . Additionally, since the rank of Q − is K − ,it holds K − ≤ rank ( F ) . This proves statement a). To showb) note that, from the feasibility of X ( U K ) , the condition q = m = n or K = N − N − is sufﬁcient in order for theset to be a singleton. Additionally, if K > N − N − andrank ( F ) = n = N ( N − / K − , the set is a singleton. B. Proof of Proposition Assume the set X ( U K ) with m < n feasible and { x | Fx = Fx , x ≥ } a singleton for any nonnegative s -sparse signal x . This implies { x | Fx = Fx , x ≥ } = x with x a nonnegative s -sparse vector. Then, from Proposition in [30] if F ∈ R m × n with m < n , or N − N − > K , it holds that m ≥ s + 1 , i.e. K ≥ s/N . Therefore we get N − N − >K ≥ s/N . Additionally, from the feasibility of X ( U K ) , if c denotes the multiplicity of the eigenvalue of L , i.e. thenumber of connected components of the graph, it results k ¯ λ k = K − c , so that s = K − c + k A k / . This impliesfrom m ≥ s + 1 that KN ≥ K − c + k A k or k A k ≤ K ( N −

2) + 2 cc