[PDF] Graph Learning from Filtered Signals: Graph System and Diffusion Kernel Identification

Abstract

This paper introduces a novel graph signal processing framework for building graph-based models from classes of filtered signals. In our framework, graph-based modeling is formulated as a graph system identification problem, where the goal is to learn a weighted graph (a graph Laplacian matrix) and a graph-based filter (a function of graph Laplacian matrices). In order to solve the proposed problem, an algorithm is developed to jointly identify a graph and a graph-based filter (GBF) from multiple signal/data observations. Our algorithm is valid under the assumption that GBFs are one-to-one functions. The proposed approach can be applied to learn diffusion (heat) kernels, which are popular in various fields for modeling diffusion processes. In addition, for specific choices of graph-based filters, the proposed problem reduces to a graph Laplacian estimation problem. Our experimental results demonstrate that the proposed algorithm outperforms the current state-of-the-art methods. We also implement our framework on a real climate dataset for modeling of temperature signals.

Full PDF

11 Graph Learning from Filtered Signals: GraphSystem and Diffusion Kernel Identiﬁcation

Hilmi E. Egilmez,

Student Member, IEEE,

Eduardo Pavez,

Student Member, IEEE, and Antonio Ortega,

Fellow, IEEE

Abstract —This paper introduces a novel graph signal process-ing framework for building graph-based models from classesof ﬁltered signals. In our framework, graph-based modelingis formulated as a graph system identiﬁcation problem, wherethe goal is to learn a weighted graph (a graph Laplacianmatrix) and a graph-based ﬁlter (a function of graph Laplacianmatrices). In order to solve the proposed problem, an algorithmis developed to jointly identify a graph and a graph-based ﬁlter(GBF) from multiple signal/data observations. Our algorithm isvalid under the assumption that GBFs are one-to-one functions.The proposed approach can be applied to learn diffusion (heat)kernels, which are popular in various ﬁelds for modeling diffusionprocesses. In addition, for speciﬁc choices of graph-based ﬁlters,the proposed problem reduces to a graph Laplacian estimationproblem. Our experimental results demonstrate that the proposedalgorithm outperforms the current state-of-the-art methods. Wealso implement our framework on a real climate dataset formodeling of temperature signals.

Index Terms —Graph learning, graph signal processing, graph-based ﬁltering, graph system identiﬁcation, diffusion kernels, heatkernels.

I. I

NTRODUCTION G RAPHS are fundamental mathematical structures used invarious ﬁelds to characterize data, signals and processes.Particularly in signal processing, machine learning and statis-tics, structured modeling of signals/data by means of graphsis essential in numerous problems including clustering [1]–[3], regularized regression and denoising [4]–[6], where graphsprovide concise (sparse) representations for effective modelingand analysis of signals/data [7]. Graph signal processing [6]offers a new general paradigm for processing and analyzingsignals/data on graphs, referred as graph signals , by usinggraph Laplacian matrices to extend basic signal processingoperations such as ﬁltering [9], [10], transformation [11], [12]and sampling [13] on graph signals. However, in practice,datasets typically consist of an unstructured list of samples,where the graph information (representing the structural rela-tions between samples/features) is latent. For analysis, learn-ing, processing and algorithmic purposes, it is often useful tobuild graph-based models that provide a concise explanationfor datasets and also reduce the dimension of the problem[14], [15] by exploiting the available prior knowledge and as-sumptions about the desired graph (e.g., structural information Authors are with the Department of Electrical Engineering, University ofSouthern California, Los Angeles, CA, 90089 USA. Contact author e-mail:[email protected]. This paper was funded in part by NSF funding undergrant: CCF-1410009. In [8], adjacency matrices are used to deﬁne basic processing operations.In this work, we adopt the graph Laplacian based construction in [6]. L graph Laplacian graph-based filter h h ( L ) x in x out Fig. 1. Input-output relation of a graph system deﬁned by a graph Laplacian( L ) and a graph-based ﬁlter ( h ). In this work, we focus on joint identiﬁcationof L and h , given the observed data x out . including connectivity and sparsity level) and observed data(e.g., signal smoothness).The focus of this paper is to build graph-based models fromsignals/data, where the models of interest are deﬁned by graphsystems consisting of a graph (graph Laplacian matrix L ) anda graph-based ﬁlter (GBF h ) as depicted in Fig. 1. For graph-based modeling, graph systems provide a general abstraction,in which graphs represent pairwise relations between the signalsamples and GBFs model smoothness of signals, deﬁned ona designated graph. In particular, graph systems (with speciﬁcchoices of GBFs) have been introduced to represent diffu-sion processes. Prominent examples include diffusion kernelsdeﬁned on graphs [16] and polynomials of graph Laplacianmatrices used to deﬁne localized diffusion operators on graphs[6]. Models based on graph systems can be useful in a broadnumber of applications including signal/data processing oncomputer, social, sensor, energy, transportation and biologicalnetworks, when signals are observed without the knowledgeof a graph and GBF associated with the underlying network.For example, signals can be diffused on an unknown network( L ) with an unknown rate ( β ) as the smoothness parameter ofa GBF ( h β ).In the literature, there has been a lot of recent interestin the problem of learning graph-based models from data.Several papers, including [17]–[19], consider the problemof learning a graph, represented by a matrix such as graphLaplacian, where the goal is to ﬁnd the graph that providesa best ﬁt to the training data according to different criteria.A few studies (see Section III for more details) considerthat the training data are the output of an unknown graphsystem and propose methods under different assumptions onGBFs. In [20], [21], the authors aim to learn a graph (i.e.,a graph Laplacian or an adjacency matrix) from eigenvectorsof the sample covariance of training data, where the GBF isan arbitrary function. In [22], the GBF is assumed to be apolynomial function of a speciﬁc order, and the goal is to learna graph (an adjacency matrix) and coefﬁcients of a polynomialGBF. However, with such general assumptions on GBFs, itis hard to provide guarantees about the performance of theproposed methods. Speciﬁcally in [22], the solutions require a r X i v : . [ c s . L G ] M a r explicit sparsity conditions both on graphs and polynomials ofadjacency matrices, which are restrictive and may not hold insome practical cases. Moreover, the approaches in [20], [21]are based on the assumption that the sample covariance of theobserved data and the graph Laplacian have the same set ofeigenvectors. Thus, their performances strictly depend on theaccuracy of eigenvectors obtained from the sample covarianceof data, which is not a good estimator for the true covarianceof a model, especially when the number of data samples issmall [23], [24]. Indeed, better eigenvector estimates can beobtained using inverse covariance estimation methods [24],[25] or graph learning methods, such as those introduced inour prior work [17], which minimize a regularized maximum-likelihood (ML) criterion.In this paper, we propose an extension of our prior work[17] to estimate the parameters of a graph system (a graph anda GBF) from data. For this purpose, we formulate a graphsystem identiﬁcation (GSI) problem with a regularized MLcriterion, where our goal is to jointly identify a combinatorialgraph Laplacian (CGL) matrix L and a GBF h from multiplesignal/data observations under the main assumption that theGBF h is a one-to-one function. While this assumption isnot as general as the assumptions on GBFs made in existingstudies [20]–[22], several useful GBFs have this property, asshown in Table I. The key novelty of our work is to proposenew methods that rely on the one-to-one GBF property to learnboth graph and GBF with stronger optimality guarantees ascompared to the approaches in [20]–[22]. In Section IV-D,we will show that this assumption is one of the sufﬁcientconditions required for graph system identiﬁability , and italso allows our algorithm to perform a preﬁltering step thatsigniﬁcantly improves the estimation accuracy. Although ouralgorithm can be extended for any one-to-one GBFs, we focuson methods to estimate the parametric GBFs listed in TableI, which are one-to-one functions that depend on a singleparameter β and have useful applications discussed in SectionIV-C. The ﬁrst three GBF types in the table deﬁne basicscaling and shifting operations, and the exponential decay and β -hop localized GBFs provide diffusion-based models. Sinceall these GBFs yield larger ﬁlter responses ( h β ) in lower graphfrequencies ( λ ) as illustrated in Fig. 2, they can be used formodeling different classes of smooth graph signals satisfyingthat most of the signal energy is concentrated in the low graphfrequencies.In order to solve the GSI problem, we propose an alternatingoptimization algorithm that ﬁrst optimizes the graph by ﬁxingthe GBF and then designs the GBF by ﬁxing the graph.Basically, the proposed algorithm involves three main steps,which are repeated until convergence is achieved: • The graph-based ﬁlter (GBF) identiﬁcation step designsa GBF h for current estimate of graph Laplacian L sothat its inverse (i.e., h − ) will be used for the preﬁlteringstep. Note that h has to be a one-to-one function for itsinverse h − to exist. • The preﬁltering step ﬁlters the observed signals using h − to compute the covariance matrix that will be used in thegraph Laplacian estimation step. We show that this stepsigniﬁcantly improves the accuracy of graph estimation. • The graph Laplacian estimation step learns a graphfrom the covariance of preﬁltered signals obtained in theprevious step by using the combinatorial graph Laplacian(CGL) estimation algorithm introduced in our prior work[17]. Although the present paper focuses on graphs as-sociated with CGL matrices, our solution can be easilyextended to other types of graph Laplacian matrices, e.g.,generalized graph Laplacians [26].In order to accommodate the GBFs in Table I in our algo-rithm, we propose speciﬁc methods (for the GBF identiﬁcationstep) to ﬁnd the ﬁlter parameter β fully characterizing theselected GBF. Our proposed algorithm guarantees optimalidentiﬁcation of β and L in an ‘ -regularized ML sensefor frequency shifting, variance shifting and β -hop localizedGBFs. However, for frequency scaling and exponential decayGBFs, our algorithm cannot ﬁnd the optimal β in general,but it guarantees that the estimated L is optimal up to aconstant factor. In practice, the type of GBF and its parameter β can also be selected based on the prior knowledge availableabout the problem and the application. In this case, differentGBFs and their parameters can be tested until the estimatedgraph and GBF pair achieves the desired performance (e.g.,in terms of mean square error, likelihood or sparsity) wherethe parameter β serves as a regularization parameter, whichcan be used for tuning smoothness of the signal, for example.Moreover, our algorithm can be extended to support GBFsbeyond the ﬁlters in Table I (including one-to-one functionswith more than one parameter) by developing speciﬁc methodsfor the GBF identiﬁcation step. As long as a speciﬁed GBF( h ) has an inverse function ( h − ), the proposed preﬁlteringand graph Laplacian estimation steps can be directly utilizedto learn graphs from signals/data.To the best of our knowledge, this is the ﬁrst work that (i)formulates the graph-based modeling problem as identiﬁcationof graph systems with the types of GBFs in Table I undera regularized ML criterion and (ii) proposes a preﬁltering-based algorithm to jointly identify a graph and a GBF. Existingrelated studies (see Section III) consider optimization of differ-ent criteria (see Table III), and do not use preﬁltering, whichcan be shown to be optimal in some cases (see Section V).The proposed approach can signiﬁcantly improve the accuracyof graph learning from ﬁltered signals , as compared to theexisting methods that estimate a graph directly from observedsignals without preﬁltering. Particularly, if observed signals arediffused/ﬁltered on an unknown network/graph to be learned,applying a graph learning algorithm (e.g., CGL algorithm in[17]) on diffused signals potentially results in a dense graphdue to diffusion, even when the underlying graph is sparse. Onthe other hand, our proposed algorithm reverses the effect ofdiffusion via preﬁltering ( h − ), whose corresponding GBF ( h )is jointly estimated with the graph ( L ). Thus, the latent graphcan be learned more accurately. Note that the diffusion (heat)kernels [16] used in a number of applications [4], [5], [27] arespecial cases of graph systems with exponential decay GBFs,and thus our proposed algorithm can be applied to learn theirparameters.The rest of the paper is organized as follows. SectionII presents the notation and some concepts used throughout TABLE IGBF

S WITH PARAMETER β AND CORRESPONDING INVERSE FUNCTIONSWHERE λ † DENOTES THE PSEUDOINVERSE OF A SCALAR λ , THAT IS , λ † = 1 /λ FOR λ = 0 AND λ † = 0 FOR λ = 0 .Filter Name h β ( λ ) h − β ( s ) Range of β Frequencyscaling ( βλ λ > λ = 0 ( βs s > s = 0 β ∈ R , β > Frequencyshifting ( λ + β ) † s − β β ∈ R , β ≥ Varianceshifting λ † + β ( s − β ) † β ∈ R , β ≥ Exponentialdecay exp( − βλ ) − log( s ) /β β ∈ R , β > β -hoplocalized ( λ β λ > λ = 0 ((cid:0) s (cid:1) β s > s = 0 β ∈ N , β ≥ TABLE IIL

IST OF S YMBOLS AND T HEIR M EANING

Symbols Meaning L | L , L c graph Laplacian or CGL matrix | set of CGLs h , h β | H GBF function | set of GBFs λ i , λ i ( L ) i -th eigenvalue of L in ascending order n | k number of vertices | number of data samples I | W | D identity matrix | adjacency matrix | degree matrix | column vector of ones | column vector of zeros Θ − | Θ † inverse of Θ | pseudo-inverse of ΘΘ T | θ T transpose of Θ | transpose of θ det( Θ ) | | Θ | determinant of Θ | pseudo-determinant of Θ ( Θ ) ij entry of Θ at i -th row and j -th column ( θ ) i i -th entry of θ ≥ ( ≤ ) element-wise greater (less) than or equal to operator Θ (cid:23) Θ is a positive semideﬁnite matrix Tr | logdet( Θ ) trace operator | natural logarithm of det( Θ ) p ( x ) probability density function of random vector xx ∼ N ( , Σ ) zero-mean multivariate Gaussian with covariance Σ k θ k , k Θ k sum of absolute values of all elements ( ‘ -norm) k θ k , k Θ k F sum of squared values of all elements the paper. Section III discusses the prior related work. InSection IV, the GSI problem and its variations are formulated.Additionally, the graph system identiﬁability conditions areintroduced. In Section V, we derive optimality conditions anddevelop methods to solve the GSI problem. The experimentalresults are presented in Section VI, and Section VII drawssome conclusions.II. N OTATION AND P RELIMINARIES

Throughout the paper, lowercase normal (e.g., a and θ ),lowercase bold (e.g., a and θ ) and uppercase bold (e.g., A and Θ ) letters denote scalars, vectors and matrices, respectively.Unless otherwise stated, calligraphic capital letters (e.g., E and S ) represent sets. Notation is summarized in Table II.The graph-based models considered in this paper are deﬁnedbased on undirected, simple weighted graphs with nonnegativeedge weights and no self-loops. Let G = ( V , E , f w ) be a simpleweighted graph with n vertices in the set V = { v , . . . , v n } , where the edge set E = { e | f w ( e ) = 0 , ∀ e ∈ P u } is a subsetof P u , the set of all unordered pairs of distinct vertices, and f w (( v i , v j )) ≥ for i = j is a real-valued edge weight function.The adjacency matrix of G is an n × n symmetric matrix, W ,such that ( W ) ij = ( W ) ji = f w (( v i , v j )) for ( v i , v j ) ∈ P u .The degree matrix of G is an n × n diagonal matrix, D , withentries ( D ) ii = P nj =1 ( W ) ij and ( D ) ij = 0 for i = j . The combinatorial graph Laplacian (CGL) of G is deﬁned as L = D − W . Alternatively, the set of CGL matrices can also bewritten as L c = { L | L (cid:23) , ( L ) ij ≤ for i = j, L1 = } . (1)By construction, CGLs are symmetric and positive semideﬁ-nite matrices, so each of them has a complete set of orthog-onal eigenvectors u , u , . . . , u n whose associated eigenval-ues λ ≤ λ ≤ · · · ≤ λ n are nonnegative real numbers . Inaddition, the CGL of a connected graph always has a zeroeigenvalue (i.e., λ = 0 with multiplicity one) whose associatedeigenvector is u = (1 / √ n ) .In graph signal processing, the eigenpairs of a CGL matrix, ( λ , u ) , ( λ , u ) , . . . , ( λ n , u n ) , provide a Fourier-like spec-tral interpretation for signals deﬁned on graphs, where thegraph frequency spectrum is deﬁned by eigenvalues of theCGL, which are called graph frequencies, and eigenvectorsof the CGL u , u , . . . , u n are the harmonics associated withthe graph frequencies. Based on the eigenvectors of a CGLmatrix L , the graph Fourier transform (GFT) is deﬁned as theorthogonal matrix U (i.e., satisfying U T U = I ) obtained byeigendecomposition of L = UΛU T , where Λ denotes the di-agonal matrix consisting of eigenvalues λ , λ , . . . , λ n (graphfrequencies).For a given signal x = [ x x · · · x n ] T deﬁned on a graph G with n vertices, where x i is attached to v i ( i -th vertex), theGFT of x is b x = U T x . Naturally, the variation of the GFTbasis vectors gradually increase with the graph frequencies,and the GBF basis vectors corresponding to low frequenciesare relatively smooth. As a speciﬁc example, the GFT basisvector u = (1 / √ n ) , associated with lowest graph frequency( λ = 0 ), is the smoothest among other GFT basis vectors.To quantify smoothness of a graph signal x , a commonvariation measure used in graph signal processing is the graphLaplacian quadratic form , x T Lx , which can be written interms of edge weights of G as x T Lx = X ( i,j ) ∈I ( W ) ij ( x i − x j ) (2)where ( W ) ij = − ( L ) ij , and I = { ( i, j ) | ( v i , v j ) ∈ E} is the setof index pairs of vertices associated with the edge set E . Asmaller Laplacian quadratic term ( x T Lx ) indicates a smoothergraph signal ( x ). A more general notion of smoothness canbe obtained by deﬁning graph-based ﬁlters (GBFs), i.e., ma-trix functions h ( L ) = U h ( Λ ) U T , where U is the GFT and ( h ( Λ )) ii = h (( Λ ) ii ) = h ( λ i ) ∀ i . Note that a GBF h serves asa scalar function of λ that maps graph frequencies λ , . . . , λ n Without loss of generality, we assume that the eigenvalues are ordered as λ ≤ λ ≤ · · · ≤ λ n throughout the paper. h β ( λ ) λ β = 1 β = 2 β = 3 (a) Frequency scaling GBF h β ( λ ) λβ = 0 . β = 0 . β = 0 . (b) Frequency shifting GBF h β ( λ ) λβ = 0 . β = 0 . β = 0 . (c) Variance shifting GBF h β ( λ ) λβ = 0 . β = 0 . β = 0 . (d) Exponential decay GBF h β ( λ ) λ β = 1 β = 2 β = 3 (e) β -hop localized GBFFig. 2. The illustration of graph-based ﬁlters (GBFs) in Table I with different β parameters as a function of graph frequency h β ( λ ) . to ﬁlter responses h ( λ ) , . . . , h ( λ n ) . Thus, the measure in (2)can be generalized using a GBF h as x T h ( L ) x = x T U h ( Λ ) U T x = n X i =1 h ( λ i ) b x i , (3)where b x i = u T i x for i = 1 , . . . , n denote GFT coefﬁcients.Based on a graph and a GBF, we formally deﬁne a graphsystem as follows. Deﬁnition 1 (Graph System) . A graph system is deﬁned by asimple graph G and a GBF h such that the input-output relationof the system is x out = h ( L ) x in , where L ∈ L c is a CGL matrixassociated with G , and x in is the input signal vector.In this paper, the graph system parameter h is selected froma set of GBFs H determined based on the GBF types in TableI. As a speciﬁc example, the exponential decay GBFs lead tothe set H = { h | h ( λ ) = exp( − βλ ) , β ∈ R and β > } andthe operator h ( L ) = exp( − β L ) , which is also known as thediffusion (heat) kernel deﬁned on a graph [16]. Deﬁnition 2 (Diffusion Kernels on Graphs) . The diffusionkernel over graph G is the matrix exponential of L , that is exp( − β L ) = lim t →∞ (cid:18) I − β L t (cid:19) t t ∈ N , (4)where L denotes a graph Laplacian associated with G and theparameter β is a real number called diffusion bandwidth.Appendix A presents a derivation of diffusion kernels (i.e.,exponential GBFs) based on a basic class of random diffusionprocesses deﬁned by CGLs.III. R ELATED W ORK

Table III summarizes prior related studies by comparingagainst our present work in terms of (i) target variables,(ii) underlying assumptions and (iii) optimization criteria. Inour previous work on graph learning from data [17], wehave proposed algorithms for estimating models based onthree different types of graph Laplacians (including CGLs).The present paper extends our prior work by introducingmore general graph-based models based on graph systems,deﬁned by a CGL and a GBF, and proposing an algorithm tolearn parameters of a graph system from multiple signal/dataobservations. In the literature, there are several papers on graph Filter responses corresponding to the eigenvalues with multiplicity morethan one have the same value. Formally, if λ i = λ i +1 = · · · = λ i + c − for some i > and multiplicity c > , then h ( λ i ) = h ( λ i +1 ) = · · · = h ( λ i + c − ) . learning from signals/data . Methods to estimate CGLs fromsmooth signals are proposed in [18] and [19], but, in contrastwith our work, GBFs are not considered explicitly. Instead, thegraph estimation problem is formulated as minimization of aregularized graph Laplacian quadratic form (i.e., a smoothnessmetric for graph signals). In [20], [21], the authors focus onlearning graph shift/diffusion operators (e.g., adjacency andgraph Laplacian matrices) from a complete set of eigenvectorsthat has to be computed from observed data. More speciﬁcally,Segarra et al. [20] solve a sparse recovery problem by min-imizing the ‘ -norm of the target variable (i.e., minimizing k Θ k where Θ is the target variable) to infer the topologyof a graph, and Pasdeloup et al. [21] estimate an adjacencymatrix by minimizing its trace, Tr( Θ ) , as well as its ‘ -norm, k Θ k . In fact, the problems in [20] and [21] are equivalentif the target matrix is constrained to be a CGL, since theproblems of minimizing k L k and Tr( L ) over the set ofCGL matrices (i.e., L ∈ L c ) lead to the same solution. Notethat both of these methods [20], [21] only use the eigenvec-tors of the sample covariance, while its eigenvalues (whichalso carry graph information implicitly) are not exploited ingraph estimation. Although our approach also requires a seteigenvectors to be computed, they are not directly used toestimate a graph. Instead, the computed eigenvectors and theirassociated eigenvalues are used in the preﬁltering step, then agraph is estimated from the covariance of preﬁltered signalsby minimizing a regularized ML criterion. The optimality ofour preﬁltering-based approach is discussed in Section V.Thanou et al. [28] address the estimation of a graph anda sparse input (i.e., localized sources) from a set of observedsignals, and they propose a dictionary-based method, where agraph estimation problem is solved to construct a dictionaryconsisting of multiple diffusion kernels and the resultingdictionary is used for identifying the localized sources inthe diffusion process. Noting that our paper focuses on theGSI problem without locality assumptions on diffused/ﬁlteredsignals (see Table III), when no locality assumptions areimposed and a single diffusion kernel is used in the dictionary,the problem in [28] reduces to the graph estimation problemin [18], formulated as minimization of a regularized graphLaplacian quadratic form. In contrast with the work in [28],our algorithm iteratively solves a different graph learning prob-lem with a regularized ML criterion (i.e., the CGL estimationproblem in [17]) and also performs preﬁltering on observedsignals. Since our algorithm often results in a more accurate We refer to [17] for a discussion of methods beyond CGL estimation.

TABLE IIIA

N OVERVIEW OF THE RELATED WORK ON LEARNING GRAPHS FROM SMOOTH / DIFFUSED SIGNALS .References Target Variable(s) Assumptions Optimization CriterionGraph Filter SignalDong et al. [18]Kalofolias [19] graph Laplacian None None None regularized Laplacian quadratic formSegarra et al. [20] graph shift operator None None None ‘ -norm of graph shift operatorPasdeloup et al. [21] adjacency matrix None None None trace and ‘ -norm of adjacencyThanou et al. [28] multiple heat kernelssource locations None Heatkernel Localized regularized least squaresMei and Moura [22] polynomials of adjacency Sparse Polynomial None regularized least squaresEgilmez et al. [17] graph Laplacian None None None regularized maximum likelihoodThis work graph LaplacianGBF (Table I) None One-to-onefunction None regularized maximum likelihood graph estimation as compared to the method in [18] (seeSection VI for the details), it can be extended to constructpotentially better dictionaries as alternatives to the ones in[28] used for identifying localized sources. Such an extensionis out of the scope of the present paper. In [22], Mei andMoura address the estimation of polynomials of adjacencymatrices by solving a regularized least-squares problem. Astheir counterparts, polynomials of graph Laplacians can beused to approximate the GBFs in Table I as well as many othertypes of ﬁlters such as bandlimited GBFs. Since polynomialﬁlters provide more degrees of freedom in designing GBFs,they can be considered as more general compared to ourGBFs of interest. However, polynomials are not one-to-onefunctions in general, so our proposed preﬁltering step cannotbe applied. The algorithm in [22] provides some optimalityguarantees in a mean-square sense, but this is under a re-strictive set of assumptions, which require the polynomialsof adjacency matrices to be sparse . Our proposed algorithmprovides stronger optimality guarantees in a regularized MLsense without imposing any assumptions on the sparsity ofgraphs (see Table III), as long as GBFs are one-to-one. Theidentiﬁcation of graph systems with polynomial GBFs will bestudied as part of our future work.IV. P ROBLEM F ORMULATION : G

RAPH S YSTEM I DENTIFICATION

Our formulation is based on the following general assump-tion on a GBF h , which holds for the GBFs in Table I. Assumption 1.

We assume that a graph-based ﬁlter h ( λ ) is anonnegative and one-to-one function of λ . A. Filtered Signal Model

We formulate the GSI problem in a probabilistic settingby assuming that the observed (ﬁltered) signals have beensampled from a zero mean n -variate Gaussian distribution In practice, powers of adjacency/Laplacian matrices lead to dense matrices. The zero mean assumption is made to simplify the notation. Our modelcan be trivially extended to a multivariate Gaussian distributions with nonzeromean. x ∼ N ( , h ( L )) , p ( x | h ( L )) = 1(2 π ) n/ | h ( L ) | / exp (cid:18) − x T h ( L ) † x (cid:19) , (5)with the covariance Σ = h ( L ) deﬁned based on a CGL matrix L and a GBF h .For modeling smooth graph signals, it is reasonable tochoose h ( λ ) to be a monotonically decreasing function sat-isfying h ( λ ) ≥ h ( λ ) ≥ · · · ≥ h ( λ n − ) ≥ h ( λ n ) > ,where the graph frequencies (eigenvalues of L ) are ordered as λ ≤ λ ≤ · · · ≤ λ n . Thus, the corresponding covariancematrix h ( L ) represents graph signals whose energy is larger inlower graph frequencies. Note that the nonnegativity conditionin Assumption 1, that is, h ( λ i ) ≥ for i = 1 , . . . , n , ensuresthat the covariance h ( L ) is a positive semideﬁnite matrix.From the graph system perspective (see Fig. 1 and Def-inition 1), the above probabilistic model corresponds to thecase when the input is the n -variate white Gaussian noise x in ∼ N ( , Σ in = I ) . Then, the covariance of the output vector x out = h ( L ) x in becomes Σ out = h ( L ) Σ in h ( L ) T = h ( L ) ,which can be translated into our model in (5) by simplytransforming Σ out as Σ = Σ / out = h ( L ) , where the mappingbetween h ( L ) and h ( L ) is one-to-one (i.e., there is no lossof information after transformation), since eigenvalues of h ( L ) are nonnegative. B. Problem Formulation

Our goal is to estimate the graph system parameters L and h based on k random samples, x i for i = 1 , . . . , k , obtainedfrom the signal model in (5) by maximizing the likelihood of h ( L ) , that is k Y i =1 p ( x i | h ( L )) = c | h ( L ) | − k k Y i =1 exp (cid:18) − x i T h ( L ) † x i (cid:19) (6)with a constant c = (2 π ) − kn . The maximization of (6) can beequivalently stated as minimizing its negative log-likelihood, All of the GBFs in Table I are monotonically decreasing functions on theinterval λ ∈ (0 , ∞ ) . The exponential decay and frequency shifting GBFsfurther satisﬁes h ( λ ) ≥ h ( λ ) at the zero frequency (i.e., λ ≤ λ ),while for the other GBFs, we have h ( λ ) ≥ h ( λ ) . which leads to the following problem for estimating the graphsystem parameters L and h , ( b L , b h ) = argmin L ,h ( k X i =1 Tr (cid:16) x i T h ( L ) † x i (cid:17) − k | h ( L ) † | ) = argmin L ,h n Tr (cid:16) h ( L ) † S (cid:17) − log | h ( L ) † | o (7)where S denotes the sample covariance calculated using k samples, x i for i = 1 , , . . . , k . In our problem formulation,we additionally introduce a sparsity promoting weighted ‘ -regularization, k L (cid:12) H k for robust graph estimation, where H is symmetric regularization matrix and (cid:12) denotes element-wise multiplication. Thus, the proposed GSI problem is for-mulated as follows: minimize L (cid:23) ,β Tr( h β ( L ) † S ) − log | h β ( L ) † | + k L (cid:12) H k subject to L 1 = , ( L ) ij ≤ i = j (8)where β is the (unknown) parameter for a given type of GBF h β from Table I, and H denotes the regularization matrix. Theconstraints ensure that L is a CGL matrix . In this work, weparticularly choose H = α (2 I − T ) so that the regularizationterm in (8) reduces to the standard ‘ -regularization α k L k with parameter α . C. Special Cases of Graph System Identiﬁcation Problem andApplications of Graph-based Filters

Graph Learning.

The problem in (8) can be reduced to theCGL problem proposed in our previous work [17] by choosingthe ﬁlter h β to be the frequency scaling or β -hop localizedGBFs with β = 1 , so that we have h β ( λ ) = λ † = ( /λ λ > λ = 0 that is h β ( L ) = L † . (9)Since h β ( L ) † = L , we can rewrite the objective function in (8)as Tr( LS ) − log | L | + k L (cid:12) H k (10)which is the objective function of graph learning problemsoriginally introduced in [17]. Graph Learning from Noisy Signals.

The variance shiftingﬁlter corresponds to a noisy signal model with variance β = σ . Speciﬁcally, assuming that the noisy signal is modeled as b x = x + e , where x ∼ N ( , L † ) denotes the original signal,and e ∼ N ( , σ I ) is the additive white Gaussian noise vectorindependent of x with variance σ , then the noisy signal b x is distributed as b x ∼ N ( , b Σ = L † + σ I ) . The same model isobtained by using a variance shifting ﬁlter with β = σ so that h β ( λ ) = λ † + β and h β ( L ) = L † + β I = L † + σ I . (11)With this type of GBFs, our GSI problem in (8) can be solvedto learn graphs from noisy signals by identifying L and thenoise parameter β = σ . This problem can also be viewedas an extension of our formulations in [17] derived based onsignal models that are assumed to be noise-free (i.e., β = 0 ), so The formulation can be trivially extended for different types of graphLaplacians (e.g., generalized graph Laplacian) discussed in [17]. it can be solved to improve the performance of graph learningfrom data in the presence of noise [29].

Graph Learning from Frequency Shifted Signals.

For theshifted frequency ﬁlter with parameter β , we have h β ( λ ) = ( λ + β ) † and h β ( L ) = ( L + β I ) † . (12)By substituting h β ( L ) with ( L + β I ) † , the problem in (8) canbe written as minimize e L (cid:23) ,β ≥ Tr( e LS ) − log | e L | + k e L (cid:12) H k subject to e L = L + β I , L1 = , ( L ) ij ≤ i = j (13)which is the shifted combinatorial Laplacian (SCGL) estima-tion problem that is originally proposed in [30]. Diffusion (Heat) Kernel Learning.

To learn diffusion kernels,formally deﬁned in Deﬁnition 2, the proposed GSI problemin (8) can be modiﬁed by choosing h β ( λ ) as the exponentialdecay ﬁlter, so we get h β ( L ) = U h β ( Λ ) U T = U exp( − β Λ ) U T = exp( − β L ) . (14)The resulting problem can be solved to learn diffusion kernelsfrom signals/data, which are popular in many applications [4],[5], [27]. Graph Learning from β -hop Localized Signals. To learngraphs from β -hop localized signals, where β is a positiveinteger, the GBF can be selected as h ( λ ) = ( λ † ) β so that thecorresponding inverse covariance (i.e., precision) matrix in (8)is h ( L ) † = L β , which is generally not a graph Laplacian matrixdue to the exponent β . However, it deﬁnes diffusion operators(on a graph associated with L ) that can be used as alternativesto heat kernels, and the corresponding signal model leads to β -hop localized/diffused signals, in which each sample dependson samples located within the neighborhood at most β -hopsaway. D. Graph System Identiﬁability

In statistical learning theory, a probabilistic model is identiﬁ-able if it is possible to learn the model parameters exactly frominﬁnite number of data samples. This requires different modelparameters to generate different probability distributions [31],which is formally stated in the following deﬁnition.

Deﬁnition 3 (Model Identiﬁability [31]) . Let M = { p ( x | Θ ) : Θ ∈ P Θ } be the family of probabilistic models deﬁned by theparameter space P Θ . M is identiﬁable if the mapping from P Θ to M is one-to-one, that is, if Θ = Θ then p ( x | Θ ) = p ( x | Θ ) for arbitrary Θ , Θ ∈ P Θ .Based on the above deﬁnition, we introduce three differentnotions of identiﬁability for graph systems characterizing theprobabilistic model discussed in Section IV-A. Deﬁnition 4 (Notions of Identiﬁability) . Let the parameter setof the signal model in (5) be P Θ = { ( L , h ) | L ∈ L , h ∈ H} such that a class of graph systems is deﬁned by a speciﬁcchoice of sets L and H , denoting admissible graphs and ﬁltersin the class. For any such class of graph systems, we deﬁnethe following types of identiﬁability: • A class is ﬁlter identiﬁable if h = h implies h ( L ) = h ( L ) for arbitrary L ∈ L and h , h ∈ H . • A class is graph identiﬁable if L = L implies h ( L ) = h ( L ) for arbitrary h ∈ H and L , L ∈ L . • A class is jointly identiﬁable if any two pairs satisfying ( L , h ) = ( L , h ) lead to h ( L ) = h ( L ) forarbitrary L , L ∈ L and h , h ∈ H .The following proposition categorizes the classes of systemsdeﬁned based on parametric GBFs in Table I in terms of theiridentiﬁability. Proposition 1.

Let L c and H deﬁne a class of graph systems.All classes corresponding to the parametric GBFs in Table Iare both ﬁlter and graph identiﬁable. In addition, the classeswith β -hop localized, frequency shifting or variance shiftingGBFs in Table I are jointly identiﬁable.Proof: The proof follows from Deﬁnition 4 by checking typesof identiﬁability for each GBF with different β parameters.For ﬁlter identiﬁability the proof is straightforward, becausetwo different parameters β = β imply h β ( L ) = h β ( L ) .Since GBFs of interest are one-to-one functions, correspondingclasses are also graph identiﬁable by Proposition 2. The classeswith frequency scaling and exponential decay GBFs are notjointly identiﬁable, since we have h β ( L ) = h β ( L ) when L = ( β /β ) L . However, no such construction leadingto h β ( L ) = h β ( L ) exists for the other GBF types asshown in (24)–(30), thus the corresponding classes are jointlyidentiﬁable. (cid:3) For a general choice of H , the following proposition statesthe sufﬁcient condition for graph identiﬁability. Proposition 2.

A class of graph systems deﬁned by L and H is graph identiﬁable if L is a set of CGLs and H is a set ofone-to-one functions.Proof: Assuming that all h ∈ H are one-to-one functions, weneed to show that for L = L in L , we have h ( L ) = h ( L ) .Speciﬁcally, if L = UΛ U T and L = UΛ U T , which havethe same GFT, U h ( Λ ) U T = U h ( Λ ) U T is satisﬁed, because h is one-to-one. The proof is obvious when GFTs of L and L are different. (cid:3) To derive necessary conditions for graph identiﬁability, ad-ditional set of assumptions are needed. For example, if graphs(i.e., graph Laplacians in L ) are assumed to be sparse, the one-to-one requirement on H may not be necessary. Particularly in[32], identiﬁability of graphs from their eigenspaces (i.e., fromGFTs as in [20]) are investigated, and necessary conditions areintroduced for a speciﬁc type of identiﬁability, called diagonalidentiﬁability . To the best of our knowledge, there are no otheranalysis on the identiﬁability of graphs in the literature. Aspart of our future work, we will focus on characterizationsof necessary and sufﬁcient conditions for joint identiﬁabilityunder more general choices of H .V. P ROPOSED S OLUTION AND O PTIMALITY C ONDITIONS

Algorithm 1 is proposed to solve the GSI problem in (8)for a given sample covariance S and a selected type of GBF h β . After obtaining U and Λ s via eigendecomposition of S and initialization of the parameter β (see lines 1 and 2), the Algorithm 1

Graph System Identiﬁcation (

GSI ) Input:

Sample covariance S , graph-based ﬁlter type h β Output:

Graph Laplacian L and β ﬁlter parameter Obtain U and Λ s via eigendecomposition S = UΛ s U T Initialize parameter b β : For variance or frequency shiftingGBFs, apply the initialization methods in Section V-C. Forthe other GBF types, apply random initialization. repeat Preﬁlter the sample covariance S : S pf = ( h − b β ( S )) † = U ( h − b β ( Λ s )) † U T (15) Estimate L from preﬁltered data ( S pf ): b L ← Run the CGL algorithm in [17] to solve (23) Update ﬁlter parameter b β or skip if b β is optimal: b β ← Apply GBF-speciﬁc update in Section V-C until convergence has been achieved return Graph system parameters b L and b β algorithm performs three main steps to ﬁnd the optimal graphLaplacian L and the ﬁlter parameter β :1) Preﬁltering step applies an inverse ﬁltering on the sam-ple covariance S to reverse the effect of ﬁlter h inthe graph system. Without the preﬁltering step, it maybe impossible to effectively recover L from S , since Σ † = h ( L ) † is generally not a graph Laplacian. Theproposed preﬁltering allows us to effectively estimate theoriginal eigenvalues of L (i.e., Λ λ ) from the preﬁlteredcovariance S pf in (15).2) Graph Laplacian estimation step uses the CGL estima-tion algorithm introduced in [17] to learn b L from thepreﬁltered covariance S pf .3) Filter parameter selection step ﬁnds the best matching β for the graph system. Depending on the type of GBF,we propose different methods for parameter selection.In the rest of this section, we derive the optimality condi-tions and discuss optimal preﬁltering, graph Laplacian estima-tion and ﬁlter parameter selection in Sections V-A, V-B andV-C, respectively. A. Optimal Preﬁltering

Let CGL L and GBF h be the parameters of a graph systemso that the covariance matrix of the model in (5) is Σ = h ( L ) .There is a GFT matrix U jointly diagonalizing L and Σ suchthat L = UΛ λ U T and Σ = UΛ σ U T . Under the ideal case that S = Σ is obtained from an asymptotically large number ofsamples ( k ), by change of variables, the objective function in(7) becomes J ( U , h ( Λ λ )) = Tr( U h ( Λ λ ) † Λ σ U T ) − log | U h ( Λ λ ) † U T | (16)which is simpliﬁed using properties of Tr( · ) and |·| as J ( h ( Λ λ )) = Tr( h ( Λ λ ) † Λ σ ) − log | h ( Λ λ ) † | (17)where the GBF h and the diagonal matrix of graph frequencies Λ λ are unknown, and the diagonal matrix Λ σ is known from data. By letting φ i = h ( λ i ) † = ( h ( Λ λ ) † ) ii and σ i = ( Λ σ ) ii for i = 1 , , . . . , n , we can write (17) as J ( φ ) = n X i =1 (cid:0) φ i σ i − log( φ i ) (cid:1) , (18)where φ = [ φ φ · · · φ n ] T . In minimization of the convexfunction (18), the optimal solution satisﬁes the followingnecessary and sufﬁcient conditions [33] obtained by takingthe derivative of (18) with respect to φ i and equating to zero, ∂ J ( φ ) ∂φ i = 1 φ i − σ i = 0 . (19)By change of variables, the optimality conditions can be statedas h ( λ i ) = ( h ( Λ λ )) ii = ( Λ σ ) ii = σ i ∀ i. (20)Based on Assumption 1, we can also write (20) as h − ( h ( λ i )) = λ i = h − ( σ i ) ∀ i, (21)where h − is the inverse function of h . By using the matrixnotation, we can state (21) more compactly as h − ( Σ ) = U h − ( Λ σ ) U T = UΛ λ U T = L . (22)This condition shows that we can ﬁnd the optimal Laplacian L via inverse ﬁltering (inverse preﬁltering) h − ( Σ ) . However,in practice, we can only have access to a sample estimate of Σ (i.e., S ) obtained from a limited number of data samples( k ), which is not a good estimator especially when k is small[24]. Thus, computing h − ( S ) generally does not lead to aCGL matrix. In order to address this problem, the proposedAlgorithm 1 ﬁrst estimates the preﬁltered sample covariance S pf as in (15), then employs the CGL estimation algorithm [17]to ﬁnd the best CGL from S pf by minimizing the criterion in(10). B. Optimal Graph Laplacian Estimation

For a GBF h (or h β ) satisfying the optimal preﬁlteringcondition in (21), the GSI problem in (8) can be rewrittenas the following graph learning problem, minimize L (cid:23) Tr( LS pf ) − log | L | + k L (cid:12) H k subject to L 1 = , ( L ) ij ≤ i = j (23)where S pf = ( h − ( S )) † is obtained by preﬁltering the empiricalcovariance S . This problem is discussed in detail in our previ-ous work [17] where we have derived the optimality conditionsfor (23) and developed the CGL estimation algorithm to solveit. C. Filter Parameter Selection

Based on the optimality condition in (21), we propose dif-ferent methods to identify the parameter β for GBFs in Table I.Speciﬁcally, optimal parameter initializations for variance andfrequency shifting GBFs are derived, and a line search methodis proposed for β -hop localized GBF. Since graph systems withexponential decay and frequency scaling GBFs are not jointlyidentiﬁable (as discussed in Section IV-D), we cannot identify β optimally. Yet, we show that graph Laplacian matrices canbe identiﬁed up to a constant factor, which depends on β . Initialization for Variance/Frequency Shifting Filters.

Forboth variance and frequency shifting GBFs, the optimal β isfound by calculating σ = u T Σu where u is the eigen-vector associated with the zero eigenvalue of the Laplacian( ( Λ λ ) = λ = 0 ). Speciﬁcally, by using the optimality condi-tion in (21), • if h β ( λ ) is a variance shifting ﬁlter, we get h β ( λ ) = σ = u T Σu = λ † + β, (24)since λ = λ † = 0 , the optimal β satisﬁes β = u T Σu = σ . (25) • if h β ( λ ) is a frequency shifting ﬁlter, we obtain h β ( λ ) = σ = u T Σu = 1 / ( λ + β ) = 1 /β, (26)so the optimal β satisﬁes β = 1 / ( u T Σu ) = 1 /σ . (27)Since the optimal β can be directly estimated from the samplecovariance S as calculated from Σ in (25) and (27), Algorithm1 uses the optimized initial β (in line 2) for preﬁltering, andthen estimates L , so that the ﬁlter parameter update (line 6) isskipped for graph systems with frequency and variance shiftingﬁlters. Exponential Decay and Frequency Scaling Filters.

Supposethat b L is obtained by inverse preﬁltering of Σ with b β , that isformally, b L = h − b β ( Σ ) . • If h β ( λ ) is a frequency scaling ﬁlter, we have h − b β ( σ i ) : λ = σ = 0 , β b β λ i = 1 b βσ i i = 2 , . . . , n. (28) • If h β ( λ ) is an exponential decay ﬁlter, we have h − b β ( σ i ) = − log( σ i ) b β = − log(exp( − βλ i )) b β = β b β λ i . (29)Based on (28) and (29), for any selected b β , the resultingmatrix is a CGL satisfying b L = ( β/ b β ) L where L and β denotethe original graph system parameters. Since the inverse of adifferent GBF (i.e., h − b β with β = b β ) leads to a CGL for any β , Algorithm 1 can only ﬁnd the optimal CGL matrix L upto a constant factor β/ b β . In practice, the parameter b β can betuned so that the desired normalization (scaling factor) for L is achieved. β -hop Localized Filter. For estimation of the optimal hopcount β in Algorithm 1, inverse preﬁltering with b β gives, h − b β ( σ i ) : λ = σ = 0 , λ β/ b βi = (cid:18) σ i (cid:19) / b β i = 2 , . . . , n. (30)Since this requires the graph learning step to estimate L β/ b β ,which is not a graph Laplacian in general, Algorithm 1 cannotguarantee optimal graph system identiﬁcation unless β = b β . Inorder to ﬁnd the optimal β , we perform a line search for givenrange of integers optimizing the following: b β = argmin b β ∈ N || h b β ( b L ) − S || F . (31) VI. R

ESULTS

A. Graph Learning from Diffusion Signals/Data

We evaluate the performance of our proposed graph systemidentiﬁcation method (

GSI or Algorithm 1) by benchmarkingagainst the current state-of-the-art approaches proposed forlearning graph from smooth signals (

GLS ) [18], [19] as wellas the graph topology inference (

GTI ) in [20]. The proposed

GSI is also compared against the CGL estimation algorithm,

CGL [17] (i.e., using the CGL estimation algorithm withoutpreﬁltering), and the inverse preﬁltering (

IPF ) approach, whichestimates a graph Laplacian matrix by inverting the preﬁlteredcovariance, S pf in (15), so that b L = h − β ( S ) = S † pf . For thispurpose, we generate several artiﬁcial datasets based on thesignal model in (5), deﬁned by a graph Laplacian ( L ) and aGBF ( h β ) where the dataset entries are generated by randomsampling from the distribution N ( , h β ( L )) . Then, the gener-ated data is used in the proposed and benchmark algorithmsto recover the corresponding graph Laplacian L . We repeatour experiments for different L and h β where graphs areconstructed by using three different graph connectivity models: • Grid graph, G ( n )grid , consisting n vertices attached to theirfour nearest neighbors (except the vertices at boundaries). • Random Erdos-Renyi graph, G ( n , p )ER , with n vertices at-tached to other vertices with probability p = 0 . . • Random modular graph (also known as stochastic blockmodel), G ( n , p , p )M with n vertices and four modules (sub-graphs) where the vertex attachment probabilities acrossmodules and within modules are p = 0 . and p = 0 . ,respectively.Then, the edge weights of a graph are randomly selected fromthe uniform distribution U (0 . , , on the interval [0 . , . Foreach L and h β pair, we perform Monte-Carlo simulations totest average performance of proposed and benchmark methodswith varying number of data samples ( k ) and ﬁxed number ofvertices ( n = 36 ) . To measure the estimation performance,we use the following two metrics: RE ( b L , L ∗ ) = k b L − L ∗ k F k L ∗ k F (32)which is the relative error between the ground truth graph ( L ∗ )and estimated graph parameters ( b L ), and FS ( b L , L ∗ ) = 2 tp tp + fn + fp (33)is the F-score metric (commonly used metric to evaluate binaryclassiﬁcation performance) calculated based on true-positive( tp ), false-positive ( fp ) and false-negative ( fn ) detection ofgraph edges in estimated b L with respect to the ground truthedges in L ∗ . F-score takes values between 0 and 1, where thevalue 1 means perfect classiﬁcation.In our experiments, for the proposed GSI , the regularizationparameter α in (23) is selected from the following set: { } ∪ { . r ( s max p log( n ) /k ) | r = 1 , , , . . . , } , (34) Methods are evaluated on small graphs (with n = 36 vertices), since GTI [20] is implemented using CVX [34] and do not currently have an efﬁcientand scalable implementation. The proposed and the other benchmark methodsare more efﬁcient and can support larger graphs. where s max = max i = j | ( S ) ij | is the maximum off-diagonal en-try of S in absolute value, and the scaling term p log( n ) /k is used for adjusting the regularization according to k and n as suggested in [24], [35]. Monte-Carlo simulations areperformed for each proposed/baseline method, by successivelysolving the associated problem with different regularizationparameters to ﬁnd the best regularization that minimizes RE . The corresponding graph estimate is also used to cal-culate FS . For all baseline methods [18]–[20], the requiredparameters are selected by ﬁne tuning. Since CGL [17],

GLS [18], [19] and

GTI [20] approaches generally result inseverely biased solutions with respect to the ground truth L ∗ (based on our observations from the experiments), RE values are calculated after normalizing the estimated solution L as b L = (Tr( L ∗ ) / Tr( L )) L . Note that, this normalization alsoresolves the ambiguity in identiﬁcation of graph systems withexponential decay and frequency scaling ﬁlters up to a scalefactor (discussed in Section V-C).Figures 3 and 4 depict the performance of different methodsapplied for estimating graphs from signals modeled basedon exponential decay ﬁlters (diffusion kernels) and β -hoplocalized ﬁlters. As shown in Figs. 3 and 4, the proposed GSI signiﬁcantly outperforms all baseline methods, including thestate-of-the-art

GLS [18], [19] and

GTI [20], in terms of aver-age RE and FS metrics. The performance difference between GSI and

CGL [17] demonstrates the impact of the preﬁlteringstep, which substantially improves the graph learning accuracy.Similarly, the performance gap between

GSI and

IPF shows theadvantage of Algorithm 1 compared to the direct preﬁltering ofinput covariance ( S ) as in (22), where GSI provide better graphestimation especially when number of data samples (i.e., k/n )is small. Besides, Figs. 5 and 6 illustrate two examples fromthe experiments with grid graphs for the case of k/n = 30 ,where the proposed

GSI constructs graphs that are the mostsimilar to the ground truth ( L ∗ ). B. Graph Learning from Variance/Frequency Shifted Signals

In this subsection, we compare the CGL estimation perfor-mance of

GSI , CGL [17] and

SCGL [30] methods from signalsmodeled based on variance and frequency shifting GBFs. Asdiscussed in Section IV-C, the covariance matrices for signalsmodeled based on these GBFs with parameter β are • Σ = L † + β I for variance shifting, • Σ = ( L + β I ) † for frequency shifting.where L denotes the associated combinatorial Laplacian.In our experiments, we construct 10 random Erdos-Renyigraphs ( G ( n , p )ER ) with n = 36 vertices and p = 0 . , then generate Σ for each GBF by varying β between 0 and 1. To evaluatethe effect of β only, we use actual covariance matrices insteadof sample covariances as input to the algorithms. So, GSI , CGL and

SCGL estimate a graph Laplacian L from Σ . The average RE results are presented in Tables IV and V for various β .Table IV shows that the proposed GSI signiﬁcantly outper-forms

CGL for β > , and the average RE difference increasesas β gets larger. This is because the variance shifting GBFleads to the noisy signal model with the covariance in (11)where β represents the variance of the noise ( σ ), and the A v e r a g e R E k/n GSIIPFCGL [17]

GLS [18]

GLS [19]

GTI [20] 0.30.40.50.60.70.80.91 5 10 30 50 100 250 500 1000 A v e r a g e F S k/n GSIIPFCGL [17]

GLS [18]

GLS [19]

GTI [20]

Fig. 3. Average RE and FS results for graph estimation from signals modeled based on exponential decay GBFs tested with β = { . , . } on 10 differentgrid, Erdos-Renyi and modular graphs (30 graphs in total). The proposed GSI outperforms all baseline methods in terms of both RE and FS . A v e r a g e R E k/n GSIIPFCGL [17]

GLS [18]

GLS [19]

GTI [20] 0.30.40.50.60.70.80.91 5 10 30 50 100 250 500 1000 A v e r a g e F S k/n GSIIPFCGL [17]

GLS [18]

GLS [19]

GTI [20]

Fig. 4. Average RE and FS results for graph estimation from signals modeled based on β -hop localized GBFs tested with β = { , } on 10 different grid,Erdos-Renyi and modular graphs (30 graphs in total). The proposed GSI outperforms all baseline methods in terms of both RE and FS .TABLE IVA VERAGE R ELATIVE E RRORS FOR V ARIANCE S HIFTING

GBF

Filter parameter ( β ) Method . . . . . CGL [17] × − .

60 0 .

79 0 .

85 0 .

88 0 . GSI × − TABLE VA

VERAGE R ELATIVE E RRORS FOR F REQUENCY S HIFTING

GBF

Filter parameter ( β ) Method . . . SCGL [30] . . × − . × − . × − GSI . × − preﬁltering step allows GSI to perfectly estimate the parameter β from Σ by using (25) so that the covariance is preﬁlteredas in (15) based on the optimal β . The preﬁltering step canalso be considered as a denoising operation (reversing theeffect of variance shifting GBFs) on the signal covariancebefore the graph estimation step, while CGL works with noisy(i.e., shifted) covariances, which diminish the CGL estimationperformance. For β = 0 (i.e., Σ is noise-free), the problem(8) reduces to the CGL estimation problem in [17], so both GSI and

CGL lead to the same average RE . For the frequency shifting GBFs with β > , GSI performsslightly better than

SCGL , since

SCGL is implemented usinga general purpose solver CVX [34], which generally producesless accurate solutions compared to our algorithm. Moreover,our algorithm is approximately 90 times faster than

SCGL onaverage, and signiﬁcantly outperforms

SCGL for β = 0 , since SCGL method is developed for shifted covariance matrices(i.e., L + β I ) where β needs to be strictly positive. C. Illustrative Results on Temperature Data

In this experiment, we apply our proposed method on a real(climate) dataset consisting of air temperature measurements[36]. We speciﬁcally use the average daily temperature mea-surements collected from 45 states in the US over 16 years(2000-2015), so that in total there are k = 5844 samplesfor each of the n = 45 states. Fig. 7 shows samples ofaverage temperature signals, which are spatially smooth acrossdifferent states. Also, the Rocky Mountains region has loweraverage temperature values as compared to the other regionsin the western US . The Rocky Mountains cross through the states of Idaho, Montana,Wyoming, Utah, Colorado and New Mexico. For example, in Fig. 7b, theareas with temperature values between 0 and 10 degrees Celsius (colored ingreen) correspond to the Rocky Mountains region approximately. (a) The ground truth CGL ( L ∗ ) (b) GLS [18] ( RE , FS )=( ) (c) GTI [20] ( RE , FS )=( ) (d) Proposed GSI ( RE , FS )=( ) Fig. 5. A sample illustration of graph estimation results (for k/n = 30 ) from signals modeled using the exponential decay GBF with β = 0 . and L ∗ isderived from the grid graph in (a). The edge weights are color coded where darker colors indicate larger weights. The proposed GSI leads to the graph thatis the most similar to the ground truth. (a) The ground truth CGL ( L ∗ ) (b) GLS [18] ( RE , FS )=( ) (c) GTI [20] ( RE , FS )=( ) (d) Proposed GSI ( RE , FS )=( ) Fig. 6. A sample illustration of graph estimation results (for k/n = 30 ) from signals modeled using the β -hop localized GBF with β =2 and L ∗ is derivedfrom the grid graph in (a). The edge weights are color coded where darker colors indicate larger weights. The proposed GSI leads to the graph that is themost similar to the ground truth.

The goal of this experiment is to learn graphs (with 45 ver-tices) associated with exponential decay and β -hop localizedﬁlters from temperature data and investigate the effect of ﬁltertype and parameter β on the resulting graphs, representingthe similarity of temperature conditions between the 45 states.For modeling temperature signals, a diffusion kernel is agood candidate, because it is a fundamental solution of the heat equation , which describes the distribution of heat in aphysical environment [37].Figure 8 illustrates the graphs estimated using the GSI method without ‘ -regularization (i.e., H in (8) is set to thezero matrix). As shown in the ﬁgure, larger edge weights areassigned between vertices (i.e., states) that are closer to eachother in general, since temperature values are mostly similarbetween states within close proximity. However, the distancebetween states is obviously not the only factor effecting thesimilarity of temperature values. For example, in Figs. 8b–8f,the weights are considerably small between the states in theRocky Mountains region and their neighboring states in theeast (e.g., Nebraska and Kansas) due to the large differencesin altitude. Note also that different choices of GBFs can leadto substantially different similarity graphs. Especially for the β -hop localized GBF with β = 1 (corresponding to the CGL This is the reason that diffusion kernels are also known as heat kernelsin the literature. method), the resulting graph is signiﬁcantly different than theresults in Figs. 8b and 8c, since the 1-hop localized ﬁlterdoes not lead to a diffusion model. The graphs associatedwith exponential decay GBF leads to sparser graphs, betterrevealing the structure of the signal, compared to the graphsin Figs. 8a–8c. The structure of the graphs in Figs. 8d–8fare similar for different β because of the relation in (29) forthe exponential decay ﬁlter. For example, increasing β from . to . approximately halves edge weights, as shown inFigs. 8d and 8e. Besides, the distribution of edge weights for β -hop localized GBFs becomes more similar to the ones inFigs. 8d–8f as β gets larger.VII. C ONCLUSION

In this paper, we have introduced a novel graph-basedmodeling framework that (i) formulates the modeling prob-lem as the graph system identiﬁcation from signals/data and(ii) proposes an alternating optimization algorithm iterativelysolving for a graph and a GBF. At each iteration of thealgorithm, a preﬁltering operation, deﬁned by the estimatedGBF, is applied on the observed signals, and then a graphis estimated from preﬁltered signals. The experimental resultshave demonstrated that the proposed algorithm outperformsthe existing methods on modeling smooth signals and learningdiffusion-based models [18]–[20] in terms of graph estimationaccuracy. (a) 45th day (Winter) (b) 135th day (Spring) (c) 225th day (Summer) (d) 315th day (Autumn)Fig. 7. Average air temperatures (in degree Celsius) for (a) 45th, (b) 135th, (c) 225th and (d) 315th days over 16 years (2000-2015). Black dots denote 45states. (a) β -hop localized GBF with β = 1 (b) β -hop localized GBF with β = 5 (c) β -hop localized GBF with β = 10 (d) Exponential decay with β = 0 . (e) Exponential decay with β = 0 . (f) Exponential decay with β = 3 Fig. 8. Graphs learned from temperature data using the

GSI method with exponential decay and β -hop localized GBFs for ﬁxed β parameters whereno regularization is applied (i.e., H is set to the zero matrix). The edge weights are color coded so that darker colors represent larger weights (i.e., moresimilarities). The graphs associated with exponential decay GBF leads to sparser structures compared to the graphs corresponding to β -hop localized GBFs. The proposed framework supports various types of GBFs(in Table I) including the diffusion (heat) kernels as specialcases which are widely used in many applications [4], [5], [27].Our future work focuses on the extensions of our algorithmfor joint identiﬁcation of graphs and polynomial ﬁlters (i.e.,estimation of polynomials of graph Laplacians), which canprovide more degrees of freedom in designing ﬁlters thanthe GBFs in Table I. Also, data-oriented applications of theproposed modeling framework is considered as part of ourfuture work. A

PPENDIX AD ERIVATION OF D IFFUSION K ERNELS

Suppose that the random process is obtained by diffusionof the initial random vector x (0) whose entries are inde-pendent, zero-mean random variables, denoted as ( x (0)) i for i = 1 , , . . . , n , with variance σ , the random diffusion over agraph G is formulated recursively as x ( t +1) = x ( t ) − r Lx ( t ) = ( I − r L ) x ( t ) t ∈ { , , . . . } (35) where t represents the time index, L is the graph Laplacianof G and the real number r ∈ (0 , is the diffusion rate ofthe process. On i -th vertex of the graph G , we can write therandom process in (35) as ( x ( t + 1)) i = ( x ( t )) i + r X j ∈N i ( W ) ij (( x ( t )) j − ( x ( t )) i ) (36)for all i where W is the adjacency matrix of the graph G , and N i is the set of vertex indices neighboring i -th vertex ( v i ).More compactly, (35) can be written as x ( t ) = G ( t ) x (0) where G ( t ) = ( I − r L ) t . (37)The covariance of x ( t ) is ( Σ ( t )) ij = E [( G ( t ) x (0)) i ( G ( t ) x (0)) j ]= E " n X k =1 ( G ( t )) ik ( x (0)) k ! n X l =1 ( G ( t )) jl ( x (0)) l ! , (38)which simpliﬁes by using independence of x (0) as ( Σ ( t )) ij = σ n X k =1 (( G ( t )) ik ( G ( t )) kj ) = σ ( G ( t ) ) ij . (39) Therefore, the covariance matrix leads to Σ ( t ) = G ( t ) = σ ( I − r L ) t . (40)To adjust the time resolution of the process, we replace t with t/ ∆ t and r with r ∆ t in G ( t ) so that G ( t ) = σ (cid:18) I − r L / ∆ t (cid:19) t/ ∆ t . (41)For arbitrarily small ∆ t , G ( t ) converges to a matrix exponen-tial function of L , e Σ ( t ) = σ lim ∆ t → (cid:18) I − r L / ∆ t (cid:19) t/ ∆ t = σ exp( − rt L ) (42)which is equivalent to exponential decay ﬁltering operation inTable I with β ( t ) = − rt . For arbitrarily large t (i.e., whenthe diffusion reaches steady-state), the covariance is lim t →∞ e Σ ( t ) = lim t →∞ n X i =1 exp( − rtλ i ) u i u i T = σ n T , (43)whose all entries have the same value, since we have λ = 0 for CGLs. Thus, the statistical properties eventually becomespatially ﬂat across all vertices as intuitively expected for adiffusion process. Yet, for nonsingular graph Laplacians (e.g.,generalized graph Laplacians [17], [26]), the process behavesdifferently so that the signal energy vanishes, since the abovelimit converges to the n × n zero matrix as t gets larger.R EFERENCES[1] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysisand an algorithm,” in

Proceedings of the 14th International Conferenceon Neural Information Processing Systems: Natural and Synthetic , ser.NIPS’01. Cambridge, MA, USA: MIT Press, 2001, pp. 849–856.[2] M. Soltanolkotabi, E. Elhamifar, and E. J. Cand`es, “Robust subspaceclustering,”

Ann. Statist. , vol. 42, no. 2, pp. 669–699, 04 2014.[3] U. Luxburg, “A tutorial on spectral clustering,”

Statistics and Computing ,vol. 17, no. 4, pp. 395–416, Dec. 2007.[4] A. J. Smola and I. R. Kondor, “Kernels and regularization on graphs.”in

Proceedings of the Annual Conference on Computational LearningTheory , 2003.[5] A. D. Szlam, M. Maggioni, and R. R. Coifman, “Regularization ongraphs with function-adapted diffusion processes,”

J. Mach. Learn. Res. ,vol. 9, pp. 1711–1739, Jun. 2008.[6] D. Shuman, S. Narang, P. Frossard, A. Ortega, and P. Vandergheynst,“The emerging ﬁeld of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,”

IEEE Signal Processing Magazine , vol. 30, no. 3, pp. 83–98, 2013.[7] P. Buhlmann and S. van de Geer,

Statistics for High-Dimensional Data:Methods, Theory and Applications . Springer Publishing Co., Inc., 2011.[8] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing ongraphs: Frequency analysis,”

IEEE Transactions on Signal Processing ,vol. 62, no. 12, pp. 3042–3054, June 2014.[9] P. Milanfar, “A tour of modern image ﬁltering: New insights and meth-ods, both practical and theoretical,”

IEEE Signal Processing Magazine ,vol. 30, no. 1, pp. 106–128, Jan 2013.[10] H. E. Egilmez and A. Ortega, “Spectral anomaly detection using graph-based ﬁltering for wireless sensor networks,” in , May2014, pp. 1085–1089.[11] H. E. Egilmez, Y. H. Chao, A. Ortega, B. Lee, and S. Yea, “GBST:Separable transforms based on line graphs for predictive video coding,”in ,Sept 2016, pp. 2375–2379.[12] S. K. Narang and A. Ortega, “Perfect reconstruction two-channel waveletﬁlter banks for graph structured data,”

IEEE Transactions on SignalProcessing , vol. 60, no. 6, pp. 2786–2799, June 2012.[13] A. Anis, A. Gadde, and A. Ortega, “Efﬁcient sampling set selectionfor bandlimited graph signals using graph spectral proxies,”

IEEETransactions on Signal Processing , vol. 64, no. 14, pp. 3775–3789, July2016. [14] T. Hastie, R. Tibshirani, and J. Friedman,

The elements of statisticallearning: data mining, inference and prediction , 2nd ed. Springer,2008.[15] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin,

Learning FromData . AMLBook, 2012.[16] R. I. Kondor and J. D. Lafferty, “Diffusion kernels on graphs and otherdiscrete input spaces,” in

Proceedings of the Nineteenth InternationalConference on Machine Learning , ser. ICML ’02. San Francisco, CA,USA: Morgan Kaufmann Publishers Inc., 2002, pp. 315–322.[17] H. E. Egilmez, E. Pavez, and A. Ortega, “Graph learning from dataunder Laplacian and structural constraints,”

IEEE Journal of SelectedTopics in Signal Processing , vol. 11, no. 6, pp. 825–841, Sept 2017.[18] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “LearningLaplacian matrix in smooth graph signal representations,”

IEEE Trans-actions on Signal Processing , vol. 64, no. 23, pp. 6160–6173, Dec 2016.[19] V. Kalofolias, “How to learn a graph from smooth signals,” in

Proceed-ings of the 19th International Conference on Artiﬁcial Intelligence andStatistics (AISTATS) , May 2016, pp. 920–929.[20] S. Segarra, A. G. Marques, G. Mateos, and A. Ribeiro, “Networktopology inference from spectral templates,”

IEEE Transactions onSignal and Information Processing over Networks , vol. 3, no. 3, pp.467–483, Sept 2017.[21] B. Pasdeloup, V. Gripon, G. Mercier, D. Pastor, and M. G. Rabbat,“Characterization and inference of graph diffusion processes fromobservations of stationary signals,”

IEEE Transactions on Signal andInformation Processing over Networks , vol. PP, no. 99, pp. 1–1, 2017.[22] J. Mei and J. M. F. Moura, “Signal processing on graphs: Causal mod-eling of unstructured data,”

IEEE Transactions on Signal Processing ,vol. 65, no. 8, pp. 2077–2092, April 2017.[23] I. M. Johnstone and A. Y. Lu, “On consistency and sparsity for principalcomponents analysis in high dimensions,”

Journal of the AmericanStatistical Association , vol. 104, no. 486, pp. 682–693, 2009.[24] P. Ravikumar, M. Wainwright, B. Yu, and G. Raskutti, “High dimen-sional covariance estimation by minimizing l1-penalized log-determinantdivergence,”

Electronic Journal of Statistics (EJS) , vol. 5, pp. 935–980,2011.[25] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covarianceestimation with the graphical lasso,”

Biostatistics , vol. 9, no. 3, pp. 432–441, Jul. 2008.[26] T. Bıyıkoglu, J. Leydold, and P. F. Stadler, “Laplacian eigenvectors ofgraphs,”

Lecture notes in mathematics , vol. 1915, 2007.[27] J. Lafferty and G. Lebanon, “Diffusion kernels on statistical manifolds,”

J. Mach. Learn. Res. , vol. 6, pp. 129–163, Dec. 2005.[28] D. Thanou, X. Dong, D. Kressner, and P. Frossard, “Learning heat diffu-sion graphs,”

IEEE Transactions on Signal and Information Processingover Networks , vol. 3, no. 3, pp. 484–499, Sept 2017.[29] M. G. Rabbat, “Inferring sparse graphs from smooth signals withtheoretical guarantees,” in , March 2017, pp.6533–6537.[30] B. M. Lake and J. B. Tenenbaum, “Discovering structure by learningsparse graph,” in

Proceedings of the 33rd Annual Cognitive ScienceConference , 2010, pp. 778–783.[31] E. L. Lehmann and G. Casella,

Theory of Point Estimation , 2nd ed.New York, NY, USA: Springer-Verlag, 1998.[32] Y. D. Castro, T. Espinasse, and P. Rochet, “Reconstructing undirectedgraphs from eigenspaces,”

Journal of Machine Learning Research ,vol. 18, no. 51, pp. 1–24, 2017.[33] D. P. Bertsekas,

Nonlinear Programming . Belmont, MA: AthenaScientiﬁc, 1999.[34] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convexprogramming, version 2.1,” http://cvxr.com/cvx, Mar. 2014.[35] S. Zhou, P. Rutimann, M. Xu, and P. Buhlmann, “High-dimensionalcovariance estimation based on Gaussian graphical models,”

J. Mach.Learn. Res. , vol. 12, pp. 2975–3026, Nov. 2011.[36] E. Kalnay, M. Kanamitsu, R. Kistler, W. Collins, D. Deaven, L. Gandin,M. Iredell, S. Saha, G. White, J. Woollen, Y. Zhu, A. Leetmaa,R. Reynolds, M. Chelliah, W. Ebisuzaki, W. Higgins, J. Janowiak, K. C.Mo, C. Ropelewski, J. Wang, R. Jenne, and D. Joseph, “The ncep/ncar40-year reanalysis project,”

Bulletin of the American MeteorologicalSociety , vol. 77, no. 3, pp. 437–471, 1996.[37] L. C. Evans,