Graph Learning from Filtered Signals: Graph System and Diffusion Kernel Identification
11 Graph Learning from Filtered Signals: GraphSystem and Diffusion Kernel Identification
Hilmi E. Egilmez,
Student Member, IEEE,
Eduardo Pavez,
Student Member, IEEE, and Antonio Ortega,
Fellow, IEEE
Abstract —This paper introduces a novel graph signal process-ing framework for building graph-based models from classesof filtered signals. In our framework, graph-based modelingis formulated as a graph system identification problem, wherethe goal is to learn a weighted graph (a graph Laplacianmatrix) and a graph-based filter (a function of graph Laplacianmatrices). In order to solve the proposed problem, an algorithmis developed to jointly identify a graph and a graph-based filter(GBF) from multiple signal/data observations. Our algorithm isvalid under the assumption that GBFs are one-to-one functions.The proposed approach can be applied to learn diffusion (heat)kernels, which are popular in various fields for modeling diffusionprocesses. In addition, for specific choices of graph-based filters,the proposed problem reduces to a graph Laplacian estimationproblem. Our experimental results demonstrate that the proposedalgorithm outperforms the current state-of-the-art methods. Wealso implement our framework on a real climate dataset formodeling of temperature signals.
Index Terms —Graph learning, graph signal processing, graph-based filtering, graph system identification, diffusion kernels, heatkernels.
I. I
NTRODUCTION G RAPHS are fundamental mathematical structures used invarious fields to characterize data, signals and processes.Particularly in signal processing, machine learning and statis-tics, structured modeling of signals/data by means of graphsis essential in numerous problems including clustering [1]–[3], regularized regression and denoising [4]–[6], where graphsprovide concise (sparse) representations for effective modelingand analysis of signals/data [7]. Graph signal processing [6]offers a new general paradigm for processing and analyzingsignals/data on graphs, referred as graph signals , by usinggraph Laplacian matrices to extend basic signal processingoperations such as filtering [9], [10], transformation [11], [12]and sampling [13] on graph signals. However, in practice,datasets typically consist of an unstructured list of samples,where the graph information (representing the structural rela-tions between samples/features) is latent. For analysis, learn-ing, processing and algorithmic purposes, it is often useful tobuild graph-based models that provide a concise explanationfor datasets and also reduce the dimension of the problem[14], [15] by exploiting the available prior knowledge and as-sumptions about the desired graph (e.g., structural information Authors are with the Department of Electrical Engineering, University ofSouthern California, Los Angeles, CA, 90089 USA. Contact author e-mail:[email protected]. This paper was funded in part by NSF funding undergrant: CCF-1410009. In [8], adjacency matrices are used to define basic processing operations.In this work, we adopt the graph Laplacian based construction in [6]. L graph Laplacian graph-based filter h h ( L ) x in x out Fig. 1. Input-output relation of a graph system defined by a graph Laplacian( L ) and a graph-based filter ( h ). In this work, we focus on joint identificationof L and h , given the observed data x out . including connectivity and sparsity level) and observed data(e.g., signal smoothness).The focus of this paper is to build graph-based models fromsignals/data, where the models of interest are defined by graphsystems consisting of a graph (graph Laplacian matrix L ) anda graph-based filter (GBF h ) as depicted in Fig. 1. For graph-based modeling, graph systems provide a general abstraction,in which graphs represent pairwise relations between the signalsamples and GBFs model smoothness of signals, defined ona designated graph. In particular, graph systems (with specificchoices of GBFs) have been introduced to represent diffu-sion processes. Prominent examples include diffusion kernelsdefined on graphs [16] and polynomials of graph Laplacianmatrices used to define localized diffusion operators on graphs[6]. Models based on graph systems can be useful in a broadnumber of applications including signal/data processing oncomputer, social, sensor, energy, transportation and biologicalnetworks, when signals are observed without the knowledgeof a graph and GBF associated with the underlying network.For example, signals can be diffused on an unknown network( L ) with an unknown rate ( β ) as the smoothness parameter ofa GBF ( h β ).In the literature, there has been a lot of recent interestin the problem of learning graph-based models from data.Several papers, including [17]–[19], consider the problemof learning a graph, represented by a matrix such as graphLaplacian, where the goal is to find the graph that providesa best fit to the training data according to different criteria.A few studies (see Section III for more details) considerthat the training data are the output of an unknown graphsystem and propose methods under different assumptions onGBFs. In [20], [21], the authors aim to learn a graph (i.e.,a graph Laplacian or an adjacency matrix) from eigenvectorsof the sample covariance of training data, where the GBF isan arbitrary function. In [22], the GBF is assumed to be apolynomial function of a specific order, and the goal is to learna graph (an adjacency matrix) and coefficients of a polynomialGBF. However, with such general assumptions on GBFs, itis hard to provide guarantees about the performance of theproposed methods. Specifically in [22], the solutions require a r X i v : . [ c s . L G ] M a r explicit sparsity conditions both on graphs and polynomials ofadjacency matrices, which are restrictive and may not hold insome practical cases. Moreover, the approaches in [20], [21]are based on the assumption that the sample covariance of theobserved data and the graph Laplacian have the same set ofeigenvectors. Thus, their performances strictly depend on theaccuracy of eigenvectors obtained from the sample covarianceof data, which is not a good estimator for the true covarianceof a model, especially when the number of data samples issmall [23], [24]. Indeed, better eigenvector estimates can beobtained using inverse covariance estimation methods [24],[25] or graph learning methods, such as those introduced inour prior work [17], which minimize a regularized maximum-likelihood (ML) criterion.In this paper, we propose an extension of our prior work[17] to estimate the parameters of a graph system (a graph anda GBF) from data. For this purpose, we formulate a graphsystem identification (GSI) problem with a regularized MLcriterion, where our goal is to jointly identify a combinatorialgraph Laplacian (CGL) matrix L and a GBF h from multiplesignal/data observations under the main assumption that theGBF h is a one-to-one function. While this assumption isnot as general as the assumptions on GBFs made in existingstudies [20]–[22], several useful GBFs have this property, asshown in Table I. The key novelty of our work is to proposenew methods that rely on the one-to-one GBF property to learnboth graph and GBF with stronger optimality guarantees ascompared to the approaches in [20]–[22]. In Section IV-D,we will show that this assumption is one of the sufficientconditions required for graph system identifiability , and italso allows our algorithm to perform a prefiltering step thatsignificantly improves the estimation accuracy. Although ouralgorithm can be extended for any one-to-one GBFs, we focuson methods to estimate the parametric GBFs listed in TableI, which are one-to-one functions that depend on a singleparameter β and have useful applications discussed in SectionIV-C. The first three GBF types in the table define basicscaling and shifting operations, and the exponential decay and β -hop localized GBFs provide diffusion-based models. Sinceall these GBFs yield larger filter responses ( h β ) in lower graphfrequencies ( λ ) as illustrated in Fig. 2, they can be used formodeling different classes of smooth graph signals satisfyingthat most of the signal energy is concentrated in the low graphfrequencies.In order to solve the GSI problem, we propose an alternatingoptimization algorithm that first optimizes the graph by fixingthe GBF and then designs the GBF by fixing the graph.Basically, the proposed algorithm involves three main steps,which are repeated until convergence is achieved: • The graph-based filter (GBF) identification step designsa GBF h for current estimate of graph Laplacian L sothat its inverse (i.e., h − ) will be used for the prefilteringstep. Note that h has to be a one-to-one function for itsinverse h − to exist. • The prefiltering step filters the observed signals using h − to compute the covariance matrix that will be used in thegraph Laplacian estimation step. We show that this stepsignificantly improves the accuracy of graph estimation. • The graph Laplacian estimation step learns a graphfrom the covariance of prefiltered signals obtained in theprevious step by using the combinatorial graph Laplacian(CGL) estimation algorithm introduced in our prior work[17]. Although the present paper focuses on graphs as-sociated with CGL matrices, our solution can be easilyextended to other types of graph Laplacian matrices, e.g.,generalized graph Laplacians [26].In order to accommodate the GBFs in Table I in our algo-rithm, we propose specific methods (for the GBF identificationstep) to find the filter parameter β fully characterizing theselected GBF. Our proposed algorithm guarantees optimalidentification of β and L in an ‘ -regularized ML sensefor frequency shifting, variance shifting and β -hop localizedGBFs. However, for frequency scaling and exponential decayGBFs, our algorithm cannot find the optimal β in general,but it guarantees that the estimated L is optimal up to aconstant factor. In practice, the type of GBF and its parameter β can also be selected based on the prior knowledge availableabout the problem and the application. In this case, differentGBFs and their parameters can be tested until the estimatedgraph and GBF pair achieves the desired performance (e.g.,in terms of mean square error, likelihood or sparsity) wherethe parameter β serves as a regularization parameter, whichcan be used for tuning smoothness of the signal, for example.Moreover, our algorithm can be extended to support GBFsbeyond the filters in Table I (including one-to-one functionswith more than one parameter) by developing specific methodsfor the GBF identification step. As long as a specified GBF( h ) has an inverse function ( h − ), the proposed prefilteringand graph Laplacian estimation steps can be directly utilizedto learn graphs from signals/data.To the best of our knowledge, this is the first work that (i)formulates the graph-based modeling problem as identificationof graph systems with the types of GBFs in Table I undera regularized ML criterion and (ii) proposes a prefiltering-based algorithm to jointly identify a graph and a GBF. Existingrelated studies (see Section III) consider optimization of differ-ent criteria (see Table III), and do not use prefiltering, whichcan be shown to be optimal in some cases (see Section V).The proposed approach can significantly improve the accuracyof graph learning from filtered signals , as compared to theexisting methods that estimate a graph directly from observedsignals without prefiltering. Particularly, if observed signals arediffused/filtered on an unknown network/graph to be learned,applying a graph learning algorithm (e.g., CGL algorithm in[17]) on diffused signals potentially results in a dense graphdue to diffusion, even when the underlying graph is sparse. Onthe other hand, our proposed algorithm reverses the effect ofdiffusion via prefiltering ( h − ), whose corresponding GBF ( h )is jointly estimated with the graph ( L ). Thus, the latent graphcan be learned more accurately. Note that the diffusion (heat)kernels [16] used in a number of applications [4], [5], [27] arespecial cases of graph systems with exponential decay GBFs,and thus our proposed algorithm can be applied to learn theirparameters.The rest of the paper is organized as follows. SectionII presents the notation and some concepts used throughout TABLE IGBF
S WITH PARAMETER β AND CORRESPONDING INVERSE FUNCTIONSWHERE λ † DENOTES THE PSEUDOINVERSE OF A SCALAR λ , THAT IS , λ † = 1 /λ FOR λ = 0 AND λ † = 0 FOR λ = 0 .Filter Name h β ( λ ) h − β ( s ) Range of β Frequencyscaling ( βλ λ > λ = 0 ( βs s > s = 0 β ∈ R , β > Frequencyshifting ( λ + β ) † s − β β ∈ R , β ≥ Varianceshifting λ † + β ( s − β ) † β ∈ R , β ≥ Exponentialdecay exp( − βλ ) − log( s ) /β β ∈ R , β > β -hoplocalized ( λ β λ > λ = 0 ((cid:0) s (cid:1) β s > s = 0 β ∈ N , β ≥ TABLE IIL
IST OF S YMBOLS AND T HEIR M EANING
Symbols Meaning L | L , L c graph Laplacian or CGL matrix | set of CGLs h , h β | H GBF function | set of GBFs λ i , λ i ( L ) i -th eigenvalue of L in ascending order n | k number of vertices | number of data samples I | W | D identity matrix | adjacency matrix | degree matrix | column vector of ones | column vector of zeros Θ − | Θ † inverse of Θ | pseudo-inverse of ΘΘ T | θ T transpose of Θ | transpose of θ det( Θ ) | | Θ | determinant of Θ | pseudo-determinant of Θ ( Θ ) ij entry of Θ at i -th row and j -th column ( θ ) i i -th entry of θ ≥ ( ≤ ) element-wise greater (less) than or equal to operator Θ (cid:23) Θ is a positive semidefinite matrix Tr | logdet( Θ ) trace operator | natural logarithm of det( Θ ) p ( x ) probability density function of random vector xx ∼ N ( , Σ ) zero-mean multivariate Gaussian with covariance Σ k θ k , k Θ k sum of absolute values of all elements ( ‘ -norm) k θ k , k Θ k F sum of squared values of all elements the paper. Section III discusses the prior related work. InSection IV, the GSI problem and its variations are formulated.Additionally, the graph system identifiability conditions areintroduced. In Section V, we derive optimality conditions anddevelop methods to solve the GSI problem. The experimentalresults are presented in Section VI, and Section VII drawssome conclusions.II. N OTATION AND P RELIMINARIES
Throughout the paper, lowercase normal (e.g., a and θ ),lowercase bold (e.g., a and θ ) and uppercase bold (e.g., A and Θ ) letters denote scalars, vectors and matrices, respectively.Unless otherwise stated, calligraphic capital letters (e.g., E and S ) represent sets. Notation is summarized in Table II.The graph-based models considered in this paper are definedbased on undirected, simple weighted graphs with nonnegativeedge weights and no self-loops. Let G = ( V , E , f w ) be a simpleweighted graph with n vertices in the set V = { v , . . . , v n } , where the edge set E = { e | f w ( e ) = 0 , ∀ e ∈ P u } is a subsetof P u , the set of all unordered pairs of distinct vertices, and f w (( v i , v j )) ≥ for i = j is a real-valued edge weight function.The adjacency matrix of G is an n × n symmetric matrix, W ,such that ( W ) ij = ( W ) ji = f w (( v i , v j )) for ( v i , v j ) ∈ P u .The degree matrix of G is an n × n diagonal matrix, D , withentries ( D ) ii = P nj =1 ( W ) ij and ( D ) ij = 0 for i = j . The combinatorial graph Laplacian (CGL) of G is defined as L = D − W . Alternatively, the set of CGL matrices can also bewritten as L c = { L | L (cid:23) , ( L ) ij ≤ for i = j, L1 = } . (1)By construction, CGLs are symmetric and positive semidefi-nite matrices, so each of them has a complete set of orthog-onal eigenvectors u , u , . . . , u n whose associated eigenval-ues λ ≤ λ ≤ · · · ≤ λ n are nonnegative real numbers . Inaddition, the CGL of a connected graph always has a zeroeigenvalue (i.e., λ = 0 with multiplicity one) whose associatedeigenvector is u = (1 / √ n ) .In graph signal processing, the eigenpairs of a CGL matrix, ( λ , u ) , ( λ , u ) , . . . , ( λ n , u n ) , provide a Fourier-like spec-tral interpretation for signals defined on graphs, where thegraph frequency spectrum is defined by eigenvalues of theCGL, which are called graph frequencies, and eigenvectorsof the CGL u , u , . . . , u n are the harmonics associated withthe graph frequencies. Based on the eigenvectors of a CGLmatrix L , the graph Fourier transform (GFT) is defined as theorthogonal matrix U (i.e., satisfying U T U = I ) obtained byeigendecomposition of L = UΛU T , where Λ denotes the di-agonal matrix consisting of eigenvalues λ , λ , . . . , λ n (graphfrequencies).For a given signal x = [ x x · · · x n ] T defined on a graph G with n vertices, where x i is attached to v i ( i -th vertex), theGFT of x is b x = U T x . Naturally, the variation of the GFTbasis vectors gradually increase with the graph frequencies,and the GBF basis vectors corresponding to low frequenciesare relatively smooth. As a specific example, the GFT basisvector u = (1 / √ n ) , associated with lowest graph frequency( λ = 0 ), is the smoothest among other GFT basis vectors.To quantify smoothness of a graph signal x , a commonvariation measure used in graph signal processing is the graphLaplacian quadratic form , x T Lx , which can be written interms of edge weights of G as x T Lx = X ( i,j ) ∈I ( W ) ij ( x i − x j ) (2)where ( W ) ij = − ( L ) ij , and I = { ( i, j ) | ( v i , v j ) ∈ E} is the setof index pairs of vertices associated with the edge set E . Asmaller Laplacian quadratic term ( x T Lx ) indicates a smoothergraph signal ( x ). A more general notion of smoothness canbe obtained by defining graph-based filters (GBFs), i.e., ma-trix functions h ( L ) = U h ( Λ ) U T , where U is the GFT and ( h ( Λ )) ii = h (( Λ ) ii ) = h ( λ i ) ∀ i . Note that a GBF h serves asa scalar function of λ that maps graph frequencies λ , . . . , λ n Without loss of generality, we assume that the eigenvalues are ordered as λ ≤ λ ≤ · · · ≤ λ n throughout the paper. h β ( λ ) λ β = 1 β = 2 β = 3 (a) Frequency scaling GBF h β ( λ ) λβ = 0 . β = 0 . β = 0 . (b) Frequency shifting GBF h β ( λ ) λβ = 0 . β = 0 . β = 0 . (c) Variance shifting GBF h β ( λ ) λβ = 0 . β = 0 . β = 0 . (d) Exponential decay GBF h β ( λ ) λ β = 1 β = 2 β = 3 (e) β -hop localized GBFFig. 2. The illustration of graph-based filters (GBFs) in Table I with different β parameters as a function of graph frequency h β ( λ ) . to filter responses h ( λ ) , . . . , h ( λ n ) . Thus, the measure in (2)can be generalized using a GBF h as x T h ( L ) x = x T U h ( Λ ) U T x = n X i =1 h ( λ i ) b x i , (3)where b x i = u T i x for i = 1 , . . . , n denote GFT coefficients.Based on a graph and a GBF, we formally define a graphsystem as follows. Definition 1 (Graph System) . A graph system is defined by asimple graph G and a GBF h such that the input-output relationof the system is x out = h ( L ) x in , where L ∈ L c is a CGL matrixassociated with G , and x in is the input signal vector.In this paper, the graph system parameter h is selected froma set of GBFs H determined based on the GBF types in TableI. As a specific example, the exponential decay GBFs lead tothe set H = { h | h ( λ ) = exp( − βλ ) , β ∈ R and β > } andthe operator h ( L ) = exp( − β L ) , which is also known as thediffusion (heat) kernel defined on a graph [16]. Definition 2 (Diffusion Kernels on Graphs) . The diffusionkernel over graph G is the matrix exponential of L , that is exp( − β L ) = lim t →∞ (cid:18) I − β L t (cid:19) t t ∈ N , (4)where L denotes a graph Laplacian associated with G and theparameter β is a real number called diffusion bandwidth.Appendix A presents a derivation of diffusion kernels (i.e.,exponential GBFs) based on a basic class of random diffusionprocesses defined by CGLs.III. R ELATED W ORK
Table III summarizes prior related studies by comparingagainst our present work in terms of (i) target variables,(ii) underlying assumptions and (iii) optimization criteria. Inour previous work on graph learning from data [17], wehave proposed algorithms for estimating models based onthree different types of graph Laplacians (including CGLs).The present paper extends our prior work by introducingmore general graph-based models based on graph systems,defined by a CGL and a GBF, and proposing an algorithm tolearn parameters of a graph system from multiple signal/dataobservations. In the literature, there are several papers on graph Filter responses corresponding to the eigenvalues with multiplicity morethan one have the same value. Formally, if λ i = λ i +1 = · · · = λ i + c − for some i > and multiplicity c > , then h ( λ i ) = h ( λ i +1 ) = · · · = h ( λ i + c − ) . learning from signals/data . Methods to estimate CGLs fromsmooth signals are proposed in [18] and [19], but, in contrastwith our work, GBFs are not considered explicitly. Instead, thegraph estimation problem is formulated as minimization of aregularized graph Laplacian quadratic form (i.e., a smoothnessmetric for graph signals). In [20], [21], the authors focus onlearning graph shift/diffusion operators (e.g., adjacency andgraph Laplacian matrices) from a complete set of eigenvectorsthat has to be computed from observed data. More specifically,Segarra et al. [20] solve a sparse recovery problem by min-imizing the ‘ -norm of the target variable (i.e., minimizing k Θ k where Θ is the target variable) to infer the topologyof a graph, and Pasdeloup et al. [21] estimate an adjacencymatrix by minimizing its trace, Tr( Θ ) , as well as its ‘ -norm, k Θ k . In fact, the problems in [20] and [21] are equivalentif the target matrix is constrained to be a CGL, since theproblems of minimizing k L k and Tr( L ) over the set ofCGL matrices (i.e., L ∈ L c ) lead to the same solution. Notethat both of these methods [20], [21] only use the eigenvec-tors of the sample covariance, while its eigenvalues (whichalso carry graph information implicitly) are not exploited ingraph estimation. Although our approach also requires a seteigenvectors to be computed, they are not directly used toestimate a graph. Instead, the computed eigenvectors and theirassociated eigenvalues are used in the prefiltering step, then agraph is estimated from the covariance of prefiltered signalsby minimizing a regularized ML criterion. The optimality ofour prefiltering-based approach is discussed in Section V.Thanou et al. [28] address the estimation of a graph anda sparse input (i.e., localized sources) from a set of observedsignals, and they propose a dictionary-based method, where agraph estimation problem is solved to construct a dictionaryconsisting of multiple diffusion kernels and the resultingdictionary is used for identifying the localized sources inthe diffusion process. Noting that our paper focuses on theGSI problem without locality assumptions on diffused/filteredsignals (see Table III), when no locality assumptions areimposed and a single diffusion kernel is used in the dictionary,the problem in [28] reduces to the graph estimation problemin [18], formulated as minimization of a regularized graphLaplacian quadratic form. In contrast with the work in [28],our algorithm iteratively solves a different graph learning prob-lem with a regularized ML criterion (i.e., the CGL estimationproblem in [17]) and also performs prefiltering on observedsignals. Since our algorithm often results in a more accurate We refer to [17] for a discussion of methods beyond CGL estimation.
TABLE IIIA
N OVERVIEW OF THE RELATED WORK ON LEARNING GRAPHS FROM SMOOTH / DIFFUSED SIGNALS .References Target Variable(s) Assumptions Optimization CriterionGraph Filter SignalDong et al. [18]Kalofolias [19] graph Laplacian None None None regularized Laplacian quadratic formSegarra et al. [20] graph shift operator None None None ‘ -norm of graph shift operatorPasdeloup et al. [21] adjacency matrix None None None trace and ‘ -norm of adjacencyThanou et al. [28] multiple heat kernelssource locations None Heatkernel Localized regularized least squaresMei and Moura [22] polynomials of adjacency Sparse Polynomial None regularized least squaresEgilmez et al. [17] graph Laplacian None None None regularized maximum likelihoodThis work graph LaplacianGBF (Table I) None One-to-onefunction None regularized maximum likelihood graph estimation as compared to the method in [18] (seeSection VI for the details), it can be extended to constructpotentially better dictionaries as alternatives to the ones in[28] used for identifying localized sources. Such an extensionis out of the scope of the present paper. In [22], Mei andMoura address the estimation of polynomials of adjacencymatrices by solving a regularized least-squares problem. Astheir counterparts, polynomials of graph Laplacians can beused to approximate the GBFs in Table I as well as many othertypes of filters such as bandlimited GBFs. Since polynomialfilters provide more degrees of freedom in designing GBFs,they can be considered as more general compared to ourGBFs of interest. However, polynomials are not one-to-onefunctions in general, so our proposed prefiltering step cannotbe applied. The algorithm in [22] provides some optimalityguarantees in a mean-square sense, but this is under a re-strictive set of assumptions, which require the polynomialsof adjacency matrices to be sparse . Our proposed algorithmprovides stronger optimality guarantees in a regularized MLsense without imposing any assumptions on the sparsity ofgraphs (see Table III), as long as GBFs are one-to-one. Theidentification of graph systems with polynomial GBFs will bestudied as part of our future work.IV. P ROBLEM F ORMULATION : G
RAPH S YSTEM I DENTIFICATION
Our formulation is based on the following general assump-tion on a GBF h , which holds for the GBFs in Table I. Assumption 1.
We assume that a graph-based filter h ( λ ) is anonnegative and one-to-one function of λ . A. Filtered Signal Model
We formulate the GSI problem in a probabilistic settingby assuming that the observed (filtered) signals have beensampled from a zero mean n -variate Gaussian distribution In practice, powers of adjacency/Laplacian matrices lead to dense matrices. The zero mean assumption is made to simplify the notation. Our modelcan be trivially extended to a multivariate Gaussian distributions with nonzeromean. x ∼ N ( , h ( L )) , p ( x | h ( L )) = 1(2 π ) n/ | h ( L ) | / exp (cid:18) − x T h ( L ) † x (cid:19) , (5)with the covariance Σ = h ( L ) defined based on a CGL matrix L and a GBF h .For modeling smooth graph signals, it is reasonable tochoose h ( λ ) to be a monotonically decreasing function sat-isfying h ( λ ) ≥ h ( λ ) ≥ · · · ≥ h ( λ n − ) ≥ h ( λ n ) > ,where the graph frequencies (eigenvalues of L ) are ordered as λ ≤ λ ≤ · · · ≤ λ n . Thus, the corresponding covariancematrix h ( L ) represents graph signals whose energy is larger inlower graph frequencies. Note that the nonnegativity conditionin Assumption 1, that is, h ( λ i ) ≥ for i = 1 , . . . , n , ensuresthat the covariance h ( L ) is a positive semidefinite matrix.From the graph system perspective (see Fig. 1 and Def-inition 1), the above probabilistic model corresponds to thecase when the input is the n -variate white Gaussian noise x in ∼ N ( , Σ in = I ) . Then, the covariance of the output vector x out = h ( L ) x in becomes Σ out = h ( L ) Σ in h ( L ) T = h ( L ) ,which can be translated into our model in (5) by simplytransforming Σ out as Σ = Σ / out = h ( L ) , where the mappingbetween h ( L ) and h ( L ) is one-to-one (i.e., there is no lossof information after transformation), since eigenvalues of h ( L ) are nonnegative. B. Problem Formulation
Our goal is to estimate the graph system parameters L and h based on k random samples, x i for i = 1 , . . . , k , obtainedfrom the signal model in (5) by maximizing the likelihood of h ( L ) , that is k Y i =1 p ( x i | h ( L )) = c | h ( L ) | − k k Y i =1 exp (cid:18) − x i T h ( L ) † x i (cid:19) (6)with a constant c = (2 π ) − kn . The maximization of (6) can beequivalently stated as minimizing its negative log-likelihood, All of the GBFs in Table I are monotonically decreasing functions on theinterval λ ∈ (0 , ∞ ) . The exponential decay and frequency shifting GBFsfurther satisfies h ( λ ) ≥ h ( λ ) at the zero frequency (i.e., λ ≤ λ ),while for the other GBFs, we have h ( λ ) ≥ h ( λ ) . which leads to the following problem for estimating the graphsystem parameters L and h , ( b L , b h ) = argmin L ,h ( k X i =1 Tr (cid:16) x i T h ( L ) † x i (cid:17) − k | h ( L ) † | ) = argmin L ,h n Tr (cid:16) h ( L ) † S (cid:17) − log | h ( L ) † | o (7)where S denotes the sample covariance calculated using k samples, x i for i = 1 , , . . . , k . In our problem formulation,we additionally introduce a sparsity promoting weighted ‘ -regularization, k L (cid:12) H k for robust graph estimation, where H is symmetric regularization matrix and (cid:12) denotes element-wise multiplication. Thus, the proposed GSI problem is for-mulated as follows: minimize L (cid:23) ,β Tr( h β ( L ) † S ) − log | h β ( L ) † | + k L (cid:12) H k subject to L 1 = , ( L ) ij ≤ i = j (8)where β is the (unknown) parameter for a given type of GBF h β from Table I, and H denotes the regularization matrix. Theconstraints ensure that L is a CGL matrix . In this work, weparticularly choose H = α (2 I − T ) so that the regularizationterm in (8) reduces to the standard ‘ -regularization α k L k with parameter α . C. Special Cases of Graph System Identification Problem andApplications of Graph-based Filters
Graph Learning.
The problem in (8) can be reduced to theCGL problem proposed in our previous work [17] by choosingthe filter h β to be the frequency scaling or β -hop localizedGBFs with β = 1 , so that we have h β ( λ ) = λ † = ( /λ λ > λ = 0 that is h β ( L ) = L † . (9)Since h β ( L ) † = L , we can rewrite the objective function in (8)as Tr( LS ) − log | L | + k L (cid:12) H k (10)which is the objective function of graph learning problemsoriginally introduced in [17]. Graph Learning from Noisy Signals.
The variance shiftingfilter corresponds to a noisy signal model with variance β = σ . Specifically, assuming that the noisy signal is modeled as b x = x + e , where x ∼ N ( , L † ) denotes the original signal,and e ∼ N ( , σ I ) is the additive white Gaussian noise vectorindependent of x with variance σ , then the noisy signal b x is distributed as b x ∼ N ( , b Σ = L † + σ I ) . The same model isobtained by using a variance shifting filter with β = σ so that h β ( λ ) = λ † + β and h β ( L ) = L † + β I = L † + σ I . (11)With this type of GBFs, our GSI problem in (8) can be solvedto learn graphs from noisy signals by identifying L and thenoise parameter β = σ . This problem can also be viewedas an extension of our formulations in [17] derived based onsignal models that are assumed to be noise-free (i.e., β = 0 ), so The formulation can be trivially extended for different types of graphLaplacians (e.g., generalized graph Laplacian) discussed in [17]. it can be solved to improve the performance of graph learningfrom data in the presence of noise [29].
Graph Learning from Frequency Shifted Signals.
For theshifted frequency filter with parameter β , we have h β ( λ ) = ( λ + β ) † and h β ( L ) = ( L + β I ) † . (12)By substituting h β ( L ) with ( L + β I ) † , the problem in (8) canbe written as minimize e L (cid:23) ,β ≥ Tr( e LS ) − log | e L | + k e L (cid:12) H k subject to e L = L + β I , L1 = , ( L ) ij ≤ i = j (13)which is the shifted combinatorial Laplacian (SCGL) estima-tion problem that is originally proposed in [30]. Diffusion (Heat) Kernel Learning.
To learn diffusion kernels,formally defined in Definition 2, the proposed GSI problemin (8) can be modified by choosing h β ( λ ) as the exponentialdecay filter, so we get h β ( L ) = U h β ( Λ ) U T = U exp( − β Λ ) U T = exp( − β L ) . (14)The resulting problem can be solved to learn diffusion kernelsfrom signals/data, which are popular in many applications [4],[5], [27]. Graph Learning from β -hop Localized Signals. To learngraphs from β -hop localized signals, where β is a positiveinteger, the GBF can be selected as h ( λ ) = ( λ † ) β so that thecorresponding inverse covariance (i.e., precision) matrix in (8)is h ( L ) † = L β , which is generally not a graph Laplacian matrixdue to the exponent β . However, it defines diffusion operators(on a graph associated with L ) that can be used as alternativesto heat kernels, and the corresponding signal model leads to β -hop localized/diffused signals, in which each sample dependson samples located within the neighborhood at most β -hopsaway. D. Graph System Identifiability
In statistical learning theory, a probabilistic model is identifi-able if it is possible to learn the model parameters exactly frominfinite number of data samples. This requires different modelparameters to generate different probability distributions [31],which is formally stated in the following definition.
Definition 3 (Model Identifiability [31]) . Let M = { p ( x | Θ ) : Θ ∈ P Θ } be the family of probabilistic models defined by theparameter space P Θ . M is identifiable if the mapping from P Θ to M is one-to-one, that is, if Θ = Θ then p ( x | Θ ) = p ( x | Θ ) for arbitrary Θ , Θ ∈ P Θ .Based on the above definition, we introduce three differentnotions of identifiability for graph systems characterizing theprobabilistic model discussed in Section IV-A. Definition 4 (Notions of Identifiability) . Let the parameter setof the signal model in (5) be P Θ = { ( L , h ) | L ∈ L , h ∈ H} such that a class of graph systems is defined by a specificchoice of sets L and H , denoting admissible graphs and filtersin the class. For any such class of graph systems, we definethe following types of identifiability: • A class is filter identifiable if h = h implies h ( L ) = h ( L ) for arbitrary L ∈ L and h , h ∈ H . • A class is graph identifiable if L = L implies h ( L ) = h ( L ) for arbitrary h ∈ H and L , L ∈ L . • A class is jointly identifiable if any two pairs satisfying ( L , h ) = ( L , h ) lead to h ( L ) = h ( L ) forarbitrary L , L ∈ L and h , h ∈ H .The following proposition categorizes the classes of systemsdefined based on parametric GBFs in Table I in terms of theiridentifiability. Proposition 1.
Let L c and H define a class of graph systems.All classes corresponding to the parametric GBFs in Table Iare both filter and graph identifiable. In addition, the classeswith β -hop localized, frequency shifting or variance shiftingGBFs in Table I are jointly identifiable.Proof: The proof follows from Definition 4 by checking typesof identifiability for each GBF with different β parameters.For filter identifiability the proof is straightforward, becausetwo different parameters β = β imply h β ( L ) = h β ( L ) .Since GBFs of interest are one-to-one functions, correspondingclasses are also graph identifiable by Proposition 2. The classeswith frequency scaling and exponential decay GBFs are notjointly identifiable, since we have h β ( L ) = h β ( L ) when L = ( β /β ) L . However, no such construction leadingto h β ( L ) = h β ( L ) exists for the other GBF types asshown in (24)–(30), thus the corresponding classes are jointlyidentifiable. (cid:3) For a general choice of H , the following proposition statesthe sufficient condition for graph identifiability. Proposition 2.
A class of graph systems defined by L and H is graph identifiable if L is a set of CGLs and H is a set ofone-to-one functions.Proof: Assuming that all h ∈ H are one-to-one functions, weneed to show that for L = L in L , we have h ( L ) = h ( L ) .Specifically, if L = UΛ U T and L = UΛ U T , which havethe same GFT, U h ( Λ ) U T = U h ( Λ ) U T is satisfied, because h is one-to-one. The proof is obvious when GFTs of L and L are different. (cid:3) To derive necessary conditions for graph identifiability, ad-ditional set of assumptions are needed. For example, if graphs(i.e., graph Laplacians in L ) are assumed to be sparse, the one-to-one requirement on H may not be necessary. Particularly in[32], identifiability of graphs from their eigenspaces (i.e., fromGFTs as in [20]) are investigated, and necessary conditions areintroduced for a specific type of identifiability, called diagonalidentifiability . To the best of our knowledge, there are no otheranalysis on the identifiability of graphs in the literature. Aspart of our future work, we will focus on characterizationsof necessary and sufficient conditions for joint identifiabilityunder more general choices of H .V. P ROPOSED S OLUTION AND O PTIMALITY C ONDITIONS
Algorithm 1 is proposed to solve the GSI problem in (8)for a given sample covariance S and a selected type of GBF h β . After obtaining U and Λ s via eigendecomposition of S and initialization of the parameter β (see lines 1 and 2), the Algorithm 1
Graph System Identification (
GSI ) Input:
Sample covariance S , graph-based filter type h β Output:
Graph Laplacian L and β filter parameter Obtain U and Λ s via eigendecomposition S = UΛ s U T Initialize parameter b β : For variance or frequency shiftingGBFs, apply the initialization methods in Section V-C. Forthe other GBF types, apply random initialization. repeat Prefilter the sample covariance S : S pf = ( h − b β ( S )) † = U ( h − b β ( Λ s )) † U T (15) Estimate L from prefiltered data ( S pf ): b L ← Run the CGL algorithm in [17] to solve (23) Update filter parameter b β or skip if b β is optimal: b β ← Apply GBF-specific update in Section V-C until convergence has been achieved return Graph system parameters b L and b β algorithm performs three main steps to find the optimal graphLaplacian L and the filter parameter β :1) Prefiltering step applies an inverse filtering on the sam-ple covariance S to reverse the effect of filter h inthe graph system. Without the prefiltering step, it maybe impossible to effectively recover L from S , since Σ † = h ( L ) † is generally not a graph Laplacian. Theproposed prefiltering allows us to effectively estimate theoriginal eigenvalues of L (i.e., Λ λ ) from the prefilteredcovariance S pf in (15).2) Graph Laplacian estimation step uses the CGL estima-tion algorithm introduced in [17] to learn b L from theprefiltered covariance S pf .3) Filter parameter selection step finds the best matching β for the graph system. Depending on the type of GBF,we propose different methods for parameter selection.In the rest of this section, we derive the optimality condi-tions and discuss optimal prefiltering, graph Laplacian estima-tion and filter parameter selection in Sections V-A, V-B andV-C, respectively. A. Optimal Prefiltering
Let CGL L and GBF h be the parameters of a graph systemso that the covariance matrix of the model in (5) is Σ = h ( L ) .There is a GFT matrix U jointly diagonalizing L and Σ suchthat L = UΛ λ U T and Σ = UΛ σ U T . Under the ideal case that S = Σ is obtained from an asymptotically large number ofsamples ( k ), by change of variables, the objective function in(7) becomes J ( U , h ( Λ λ )) = Tr( U h ( Λ λ ) † Λ σ U T ) − log | U h ( Λ λ ) † U T | (16)which is simplified using properties of Tr( · ) and |·| as J ( h ( Λ λ )) = Tr( h ( Λ λ ) † Λ σ ) − log | h ( Λ λ ) † | (17)where the GBF h and the diagonal matrix of graph frequencies Λ λ are unknown, and the diagonal matrix Λ σ is known from data. By letting φ i = h ( λ i ) † = ( h ( Λ λ ) † ) ii and σ i = ( Λ σ ) ii for i = 1 , , . . . , n , we can write (17) as J ( φ ) = n X i =1 (cid:0) φ i σ i − log( φ i ) (cid:1) , (18)where φ = [ φ φ · · · φ n ] T . In minimization of the convexfunction (18), the optimal solution satisfies the followingnecessary and sufficient conditions [33] obtained by takingthe derivative of (18) with respect to φ i and equating to zero, ∂ J ( φ ) ∂φ i = 1 φ i − σ i = 0 . (19)By change of variables, the optimality conditions can be statedas h ( λ i ) = ( h ( Λ λ )) ii = ( Λ σ ) ii = σ i ∀ i. (20)Based on Assumption 1, we can also write (20) as h − ( h ( λ i )) = λ i = h − ( σ i ) ∀ i, (21)where h − is the inverse function of h . By using the matrixnotation, we can state (21) more compactly as h − ( Σ ) = U h − ( Λ σ ) U T = UΛ λ U T = L . (22)This condition shows that we can find the optimal Laplacian L via inverse filtering (inverse prefiltering) h − ( Σ ) . However,in practice, we can only have access to a sample estimate of Σ (i.e., S ) obtained from a limited number of data samples( k ), which is not a good estimator especially when k is small[24]. Thus, computing h − ( S ) generally does not lead to aCGL matrix. In order to address this problem, the proposedAlgorithm 1 first estimates the prefiltered sample covariance S pf as in (15), then employs the CGL estimation algorithm [17]to find the best CGL from S pf by minimizing the criterion in(10). B. Optimal Graph Laplacian Estimation
For a GBF h (or h β ) satisfying the optimal prefilteringcondition in (21), the GSI problem in (8) can be rewrittenas the following graph learning problem, minimize L (cid:23) Tr( LS pf ) − log | L | + k L (cid:12) H k subject to L 1 = , ( L ) ij ≤ i = j (23)where S pf = ( h − ( S )) † is obtained by prefiltering the empiricalcovariance S . This problem is discussed in detail in our previ-ous work [17] where we have derived the optimality conditionsfor (23) and developed the CGL estimation algorithm to solveit. C. Filter Parameter Selection
Based on the optimality condition in (21), we propose dif-ferent methods to identify the parameter β for GBFs in Table I.Specifically, optimal parameter initializations for variance andfrequency shifting GBFs are derived, and a line search methodis proposed for β -hop localized GBF. Since graph systems withexponential decay and frequency scaling GBFs are not jointlyidentifiable (as discussed in Section IV-D), we cannot identify β optimally. Yet, we show that graph Laplacian matrices canbe identified up to a constant factor, which depends on β . Initialization for Variance/Frequency Shifting Filters.
Forboth variance and frequency shifting GBFs, the optimal β isfound by calculating σ = u T Σu where u is the eigen-vector associated with the zero eigenvalue of the Laplacian( ( Λ λ ) = λ = 0 ). Specifically, by using the optimality condi-tion in (21), • if h β ( λ ) is a variance shifting filter, we get h β ( λ ) = σ = u T Σu = λ † + β, (24)since λ = λ † = 0 , the optimal β satisfies β = u T Σu = σ . (25) • if h β ( λ ) is a frequency shifting filter, we obtain h β ( λ ) = σ = u T Σu = 1 / ( λ + β ) = 1 /β, (26)so the optimal β satisfies β = 1 / ( u T Σu ) = 1 /σ . (27)Since the optimal β can be directly estimated from the samplecovariance S as calculated from Σ in (25) and (27), Algorithm1 uses the optimized initial β (in line 2) for prefiltering, andthen estimates L , so that the filter parameter update (line 6) isskipped for graph systems with frequency and variance shiftingfilters. Exponential Decay and Frequency Scaling Filters.
Supposethat b L is obtained by inverse prefiltering of Σ with b β , that isformally, b L = h − b β ( Σ ) . • If h β ( λ ) is a frequency scaling filter, we have h − b β ( σ i ) : λ = σ = 0 , β b β λ i = 1 b βσ i i = 2 , . . . , n. (28) • If h β ( λ ) is an exponential decay filter, we have h − b β ( σ i ) = − log( σ i ) b β = − log(exp( − βλ i )) b β = β b β λ i . (29)Based on (28) and (29), for any selected b β , the resultingmatrix is a CGL satisfying b L = ( β/ b β ) L where L and β denotethe original graph system parameters. Since the inverse of adifferent GBF (i.e., h − b β with β = b β ) leads to a CGL for any β , Algorithm 1 can only find the optimal CGL matrix L upto a constant factor β/ b β . In practice, the parameter b β can betuned so that the desired normalization (scaling factor) for L is achieved. β -hop Localized Filter. For estimation of the optimal hopcount β in Algorithm 1, inverse prefiltering with b β gives, h − b β ( σ i ) : λ = σ = 0 , λ β/ b βi = (cid:18) σ i (cid:19) / b β i = 2 , . . . , n. (30)Since this requires the graph learning step to estimate L β/ b β ,which is not a graph Laplacian in general, Algorithm 1 cannotguarantee optimal graph system identification unless β = b β . Inorder to find the optimal β , we perform a line search for givenrange of integers optimizing the following: b β = argmin b β ∈ N || h b β ( b L ) − S || F . (31) VI. R
ESULTS
A. Graph Learning from Diffusion Signals/Data
We evaluate the performance of our proposed graph systemidentification method (
GSI or Algorithm 1) by benchmarkingagainst the current state-of-the-art approaches proposed forlearning graph from smooth signals (
GLS ) [18], [19] as wellas the graph topology inference (
GTI ) in [20]. The proposed
GSI is also compared against the CGL estimation algorithm,
CGL [17] (i.e., using the CGL estimation algorithm withoutprefiltering), and the inverse prefiltering (
IPF ) approach, whichestimates a graph Laplacian matrix by inverting the prefilteredcovariance, S pf in (15), so that b L = h − β ( S ) = S † pf . For thispurpose, we generate several artificial datasets based on thesignal model in (5), defined by a graph Laplacian ( L ) and aGBF ( h β ) where the dataset entries are generated by randomsampling from the distribution N ( , h β ( L )) . Then, the gener-ated data is used in the proposed and benchmark algorithmsto recover the corresponding graph Laplacian L . We repeatour experiments for different L and h β where graphs areconstructed by using three different graph connectivity models: • Grid graph, G ( n )grid , consisting n vertices attached to theirfour nearest neighbors (except the vertices at boundaries). • Random Erdos-Renyi graph, G ( n , p )ER , with n vertices at-tached to other vertices with probability p = 0 . . • Random modular graph (also known as stochastic blockmodel), G ( n , p , p )M with n vertices and four modules (sub-graphs) where the vertex attachment probabilities acrossmodules and within modules are p = 0 . and p = 0 . ,respectively.Then, the edge weights of a graph are randomly selected fromthe uniform distribution U (0 . , , on the interval [0 . , . Foreach L and h β pair, we perform Monte-Carlo simulations totest average performance of proposed and benchmark methodswith varying number of data samples ( k ) and fixed number ofvertices ( n = 36 ) . To measure the estimation performance,we use the following two metrics: RE ( b L , L ∗ ) = k b L − L ∗ k F k L ∗ k F (32)which is the relative error between the ground truth graph ( L ∗ )and estimated graph parameters ( b L ), and FS ( b L , L ∗ ) = 2 tp tp + fn + fp (33)is the F-score metric (commonly used metric to evaluate binaryclassification performance) calculated based on true-positive( tp ), false-positive ( fp ) and false-negative ( fn ) detection ofgraph edges in estimated b L with respect to the ground truthedges in L ∗ . F-score takes values between 0 and 1, where thevalue 1 means perfect classification.In our experiments, for the proposed GSI , the regularizationparameter α in (23) is selected from the following set: { } ∪ { . r ( s max p log( n ) /k ) | r = 1 , , , . . . , } , (34) Methods are evaluated on small graphs (with n = 36 vertices), since GTI [20] is implemented using CVX [34] and do not currently have an efficientand scalable implementation. The proposed and the other benchmark methodsare more efficient and can support larger graphs. where s max = max i = j | ( S ) ij | is the maximum off-diagonal en-try of S in absolute value, and the scaling term p log( n ) /k is used for adjusting the regularization according to k and n as suggested in [24], [35]. Monte-Carlo simulations areperformed for each proposed/baseline method, by successivelysolving the associated problem with different regularizationparameters to find the best regularization that minimizes RE . The corresponding graph estimate is also used to cal-culate FS . For all baseline methods [18]–[20], the requiredparameters are selected by fine tuning. Since CGL [17],
GLS [18], [19] and
GTI [20] approaches generally result inseverely biased solutions with respect to the ground truth L ∗ (based on our observations from the experiments), RE values are calculated after normalizing the estimated solution L as b L = (Tr( L ∗ ) / Tr( L )) L . Note that, this normalization alsoresolves the ambiguity in identification of graph systems withexponential decay and frequency scaling filters up to a scalefactor (discussed in Section V-C).Figures 3 and 4 depict the performance of different methodsapplied for estimating graphs from signals modeled basedon exponential decay filters (diffusion kernels) and β -hoplocalized filters. As shown in Figs. 3 and 4, the proposed GSI significantly outperforms all baseline methods, including thestate-of-the-art
GLS [18], [19] and
GTI [20], in terms of aver-age RE and FS metrics. The performance difference between GSI and
CGL [17] demonstrates the impact of the prefilteringstep, which substantially improves the graph learning accuracy.Similarly, the performance gap between
GSI and
IPF shows theadvantage of Algorithm 1 compared to the direct prefiltering ofinput covariance ( S ) as in (22), where GSI provide better graphestimation especially when number of data samples (i.e., k/n )is small. Besides, Figs. 5 and 6 illustrate two examples fromthe experiments with grid graphs for the case of k/n = 30 ,where the proposed
GSI constructs graphs that are the mostsimilar to the ground truth ( L ∗ ). B. Graph Learning from Variance/Frequency Shifted Signals
In this subsection, we compare the CGL estimation perfor-mance of
GSI , CGL [17] and
SCGL [30] methods from signalsmodeled based on variance and frequency shifting GBFs. Asdiscussed in Section IV-C, the covariance matrices for signalsmodeled based on these GBFs with parameter β are • Σ = L † + β I for variance shifting, • Σ = ( L + β I ) † for frequency shifting.where L denotes the associated combinatorial Laplacian.In our experiments, we construct 10 random Erdos-Renyigraphs ( G ( n , p )ER ) with n = 36 vertices and p = 0 . , then generate Σ for each GBF by varying β between 0 and 1. To evaluatethe effect of β only, we use actual covariance matrices insteadof sample covariances as input to the algorithms. So, GSI , CGL and
SCGL estimate a graph Laplacian L from Σ . The average RE results are presented in Tables IV and V for various β .Table IV shows that the proposed GSI significantly outper-forms
CGL for β > , and the average RE difference increasesas β gets larger. This is because the variance shifting GBFleads to the noisy signal model with the covariance in (11)where β represents the variance of the noise ( σ ), and the A v e r a g e R E k/n GSIIPFCGL [17]
GLS [18]
GLS [19]
GTI [20] 0.30.40.50.60.70.80.91 5 10 30 50 100 250 500 1000 A v e r a g e F S k/n GSIIPFCGL [17]
GLS [18]
GLS [19]
GTI [20]
Fig. 3. Average RE and FS results for graph estimation from signals modeled based on exponential decay GBFs tested with β = { . , . } on 10 differentgrid, Erdos-Renyi and modular graphs (30 graphs in total). The proposed GSI outperforms all baseline methods in terms of both RE and FS . A v e r a g e R E k/n GSIIPFCGL [17]
GLS [18]
GLS [19]
GTI [20] 0.30.40.50.60.70.80.91 5 10 30 50 100 250 500 1000 A v e r a g e F S k/n GSIIPFCGL [17]
GLS [18]
GLS [19]
GTI [20]
Fig. 4. Average RE and FS results for graph estimation from signals modeled based on β -hop localized GBFs tested with β = { , } on 10 different grid,Erdos-Renyi and modular graphs (30 graphs in total). The proposed GSI outperforms all baseline methods in terms of both RE and FS .TABLE IVA VERAGE R ELATIVE E RRORS FOR V ARIANCE S HIFTING
GBF
Filter parameter ( β ) Method . . . . . CGL [17] × − .
60 0 .
79 0 .
85 0 .
88 0 . GSI × − TABLE VA
VERAGE R ELATIVE E RRORS FOR F REQUENCY S HIFTING
GBF
Filter parameter ( β ) Method . . . SCGL [30] . . × − . × − . × − GSI . × − prefiltering step allows GSI to perfectly estimate the parameter β from Σ by using (25) so that the covariance is prefilteredas in (15) based on the optimal β . The prefiltering step canalso be considered as a denoising operation (reversing theeffect of variance shifting GBFs) on the signal covariancebefore the graph estimation step, while CGL works with noisy(i.e., shifted) covariances, which diminish the CGL estimationperformance. For β = 0 (i.e., Σ is noise-free), the problem(8) reduces to the CGL estimation problem in [17], so both GSI and
CGL lead to the same average RE . For the frequency shifting GBFs with β > , GSI performsslightly better than
SCGL , since
SCGL is implemented usinga general purpose solver CVX [34], which generally producesless accurate solutions compared to our algorithm. Moreover,our algorithm is approximately 90 times faster than
SCGL onaverage, and significantly outperforms
SCGL for β = 0 , since SCGL method is developed for shifted covariance matrices(i.e., L + β I ) where β needs to be strictly positive. C. Illustrative Results on Temperature Data
In this experiment, we apply our proposed method on a real(climate) dataset consisting of air temperature measurements[36]. We specifically use the average daily temperature mea-surements collected from 45 states in the US over 16 years(2000-2015), so that in total there are k = 5844 samplesfor each of the n = 45 states. Fig. 7 shows samples ofaverage temperature signals, which are spatially smooth acrossdifferent states. Also, the Rocky Mountains region has loweraverage temperature values as compared to the other regionsin the western US . The Rocky Mountains cross through the states of Idaho, Montana,Wyoming, Utah, Colorado and New Mexico. For example, in Fig. 7b, theareas with temperature values between 0 and 10 degrees Celsius (colored ingreen) correspond to the Rocky Mountains region approximately. (a) The ground truth CGL ( L ∗ ) (b) GLS [18] ( RE , FS )=( ) (c) GTI [20] ( RE , FS )=( ) (d) Proposed GSI ( RE , FS )=( ) Fig. 5. A sample illustration of graph estimation results (for k/n = 30 ) from signals modeled using the exponential decay GBF with β = 0 . and L ∗ isderived from the grid graph in (a). The edge weights are color coded where darker colors indicate larger weights. The proposed GSI leads to the graph thatis the most similar to the ground truth. (a) The ground truth CGL ( L ∗ ) (b) GLS [18] ( RE , FS )=( ) (c) GTI [20] ( RE , FS )=( ) (d) Proposed GSI ( RE , FS )=( ) Fig. 6. A sample illustration of graph estimation results (for k/n = 30 ) from signals modeled using the β -hop localized GBF with β =2 and L ∗ is derivedfrom the grid graph in (a). The edge weights are color coded where darker colors indicate larger weights. The proposed GSI leads to the graph that is themost similar to the ground truth.
The goal of this experiment is to learn graphs (with 45 ver-tices) associated with exponential decay and β -hop localizedfilters from temperature data and investigate the effect of filtertype and parameter β on the resulting graphs, representingthe similarity of temperature conditions between the 45 states.For modeling temperature signals, a diffusion kernel is agood candidate, because it is a fundamental solution of the heat equation , which describes the distribution of heat in aphysical environment [37].Figure 8 illustrates the graphs estimated using the GSI method without ‘ -regularization (i.e., H in (8) is set to thezero matrix). As shown in the figure, larger edge weights areassigned between vertices (i.e., states) that are closer to eachother in general, since temperature values are mostly similarbetween states within close proximity. However, the distancebetween states is obviously not the only factor effecting thesimilarity of temperature values. For example, in Figs. 8b–8f,the weights are considerably small between the states in theRocky Mountains region and their neighboring states in theeast (e.g., Nebraska and Kansas) due to the large differencesin altitude. Note also that different choices of GBFs can leadto substantially different similarity graphs. Especially for the β -hop localized GBF with β = 1 (corresponding to the CGL This is the reason that diffusion kernels are also known as heat kernelsin the literature. method), the resulting graph is significantly different than theresults in Figs. 8b and 8c, since the 1-hop localized filterdoes not lead to a diffusion model. The graphs associatedwith exponential decay GBF leads to sparser graphs, betterrevealing the structure of the signal, compared to the graphsin Figs. 8a–8c. The structure of the graphs in Figs. 8d–8fare similar for different β because of the relation in (29) forthe exponential decay filter. For example, increasing β from . to . approximately halves edge weights, as shown inFigs. 8d and 8e. Besides, the distribution of edge weights for β -hop localized GBFs becomes more similar to the ones inFigs. 8d–8f as β gets larger.VII. C ONCLUSION
In this paper, we have introduced a novel graph-basedmodeling framework that (i) formulates the modeling prob-lem as the graph system identification from signals/data and(ii) proposes an alternating optimization algorithm iterativelysolving for a graph and a GBF. At each iteration of thealgorithm, a prefiltering operation, defined by the estimatedGBF, is applied on the observed signals, and then a graphis estimated from prefiltered signals. The experimental resultshave demonstrated that the proposed algorithm outperformsthe existing methods on modeling smooth signals and learningdiffusion-based models [18]–[20] in terms of graph estimationaccuracy. (a) 45th day (Winter) (b) 135th day (Spring) (c) 225th day (Summer) (d) 315th day (Autumn)Fig. 7. Average air temperatures (in degree Celsius) for (a) 45th, (b) 135th, (c) 225th and (d) 315th days over 16 years (2000-2015). Black dots denote 45states. (a) β -hop localized GBF with β = 1 (b) β -hop localized GBF with β = 5 (c) β -hop localized GBF with β = 10 (d) Exponential decay with β = 0 . (e) Exponential decay with β = 0 . (f) Exponential decay with β = 3 Fig. 8. Graphs learned from temperature data using the
GSI method with exponential decay and β -hop localized GBFs for fixed β parameters whereno regularization is applied (i.e., H is set to the zero matrix). The edge weights are color coded so that darker colors represent larger weights (i.e., moresimilarities). The graphs associated with exponential decay GBF leads to sparser structures compared to the graphs corresponding to β -hop localized GBFs. The proposed framework supports various types of GBFs(in Table I) including the diffusion (heat) kernels as specialcases which are widely used in many applications [4], [5], [27].Our future work focuses on the extensions of our algorithmfor joint identification of graphs and polynomial filters (i.e.,estimation of polynomials of graph Laplacians), which canprovide more degrees of freedom in designing filters thanthe GBFs in Table I. Also, data-oriented applications of theproposed modeling framework is considered as part of ourfuture work. A
PPENDIX AD ERIVATION OF D IFFUSION K ERNELS
Suppose that the random process is obtained by diffusionof the initial random vector x (0) whose entries are inde-pendent, zero-mean random variables, denoted as ( x (0)) i for i = 1 , , . . . , n , with variance σ , the random diffusion over agraph G is formulated recursively as x ( t +1) = x ( t ) − r Lx ( t ) = ( I − r L ) x ( t ) t ∈ { , , . . . } (35) where t represents the time index, L is the graph Laplacianof G and the real number r ∈ (0 , is the diffusion rate ofthe process. On i -th vertex of the graph G , we can write therandom process in (35) as ( x ( t + 1)) i = ( x ( t )) i + r X j ∈N i ( W ) ij (( x ( t )) j − ( x ( t )) i ) (36)for all i where W is the adjacency matrix of the graph G , and N i is the set of vertex indices neighboring i -th vertex ( v i ).More compactly, (35) can be written as x ( t ) = G ( t ) x (0) where G ( t ) = ( I − r L ) t . (37)The covariance of x ( t ) is ( Σ ( t )) ij = E [( G ( t ) x (0)) i ( G ( t ) x (0)) j ]= E " n X k =1 ( G ( t )) ik ( x (0)) k ! n X l =1 ( G ( t )) jl ( x (0)) l ! , (38)which simplifies by using independence of x (0) as ( Σ ( t )) ij = σ n X k =1 (( G ( t )) ik ( G ( t )) kj ) = σ ( G ( t ) ) ij . (39) Therefore, the covariance matrix leads to Σ ( t ) = G ( t ) = σ ( I − r L ) t . (40)To adjust the time resolution of the process, we replace t with t/ ∆ t and r with r ∆ t in G ( t ) so that G ( t ) = σ (cid:18) I − r L / ∆ t (cid:19) t/ ∆ t . (41)For arbitrarily small ∆ t , G ( t ) converges to a matrix exponen-tial function of L , e Σ ( t ) = σ lim ∆ t → (cid:18) I − r L / ∆ t (cid:19) t/ ∆ t = σ exp( − rt L ) (42)which is equivalent to exponential decay filtering operation inTable I with β ( t ) = − rt . For arbitrarily large t (i.e., whenthe diffusion reaches steady-state), the covariance is lim t →∞ e Σ ( t ) = lim t →∞ n X i =1 exp( − rtλ i ) u i u i T = σ n T , (43)whose all entries have the same value, since we have λ = 0 for CGLs. Thus, the statistical properties eventually becomespatially flat across all vertices as intuitively expected for adiffusion process. Yet, for nonsingular graph Laplacians (e.g.,generalized graph Laplacians [17], [26]), the process behavesdifferently so that the signal energy vanishes, since the abovelimit converges to the n × n zero matrix as t gets larger.R EFERENCES[1] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysisand an algorithm,” in
Proceedings of the 14th International Conferenceon Neural Information Processing Systems: Natural and Synthetic , ser.NIPS’01. Cambridge, MA, USA: MIT Press, 2001, pp. 849–856.[2] M. Soltanolkotabi, E. Elhamifar, and E. J. Cand`es, “Robust subspaceclustering,”
Ann. Statist. , vol. 42, no. 2, pp. 669–699, 04 2014.[3] U. Luxburg, “A tutorial on spectral clustering,”
Statistics and Computing ,vol. 17, no. 4, pp. 395–416, Dec. 2007.[4] A. J. Smola and I. R. Kondor, “Kernels and regularization on graphs.”in
Proceedings of the Annual Conference on Computational LearningTheory , 2003.[5] A. D. Szlam, M. Maggioni, and R. R. Coifman, “Regularization ongraphs with function-adapted diffusion processes,”
J. Mach. Learn. Res. ,vol. 9, pp. 1711–1739, Jun. 2008.[6] D. Shuman, S. Narang, P. Frossard, A. Ortega, and P. Vandergheynst,“The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,”
IEEE Signal Processing Magazine , vol. 30, no. 3, pp. 83–98, 2013.[7] P. Buhlmann and S. van de Geer,
Statistics for High-Dimensional Data:Methods, Theory and Applications . Springer Publishing Co., Inc., 2011.[8] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing ongraphs: Frequency analysis,”
IEEE Transactions on Signal Processing ,vol. 62, no. 12, pp. 3042–3054, June 2014.[9] P. Milanfar, “A tour of modern image filtering: New insights and meth-ods, both practical and theoretical,”
IEEE Signal Processing Magazine ,vol. 30, no. 1, pp. 106–128, Jan 2013.[10] H. E. Egilmez and A. Ortega, “Spectral anomaly detection using graph-based filtering for wireless sensor networks,” in , May2014, pp. 1085–1089.[11] H. E. Egilmez, Y. H. Chao, A. Ortega, B. Lee, and S. Yea, “GBST:Separable transforms based on line graphs for predictive video coding,”in ,Sept 2016, pp. 2375–2379.[12] S. K. Narang and A. Ortega, “Perfect reconstruction two-channel waveletfilter banks for graph structured data,”
IEEE Transactions on SignalProcessing , vol. 60, no. 6, pp. 2786–2799, June 2012.[13] A. Anis, A. Gadde, and A. Ortega, “Efficient sampling set selectionfor bandlimited graph signals using graph spectral proxies,”
IEEETransactions on Signal Processing , vol. 64, no. 14, pp. 3775–3789, July2016. [14] T. Hastie, R. Tibshirani, and J. Friedman,
The elements of statisticallearning: data mining, inference and prediction , 2nd ed. Springer,2008.[15] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin,
Learning FromData . AMLBook, 2012.[16] R. I. Kondor and J. D. Lafferty, “Diffusion kernels on graphs and otherdiscrete input spaces,” in
Proceedings of the Nineteenth InternationalConference on Machine Learning , ser. ICML ’02. San Francisco, CA,USA: Morgan Kaufmann Publishers Inc., 2002, pp. 315–322.[17] H. E. Egilmez, E. Pavez, and A. Ortega, “Graph learning from dataunder Laplacian and structural constraints,”
IEEE Journal of SelectedTopics in Signal Processing , vol. 11, no. 6, pp. 825–841, Sept 2017.[18] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “LearningLaplacian matrix in smooth graph signal representations,”
IEEE Trans-actions on Signal Processing , vol. 64, no. 23, pp. 6160–6173, Dec 2016.[19] V. Kalofolias, “How to learn a graph from smooth signals,” in
Proceed-ings of the 19th International Conference on Artificial Intelligence andStatistics (AISTATS) , May 2016, pp. 920–929.[20] S. Segarra, A. G. Marques, G. Mateos, and A. Ribeiro, “Networktopology inference from spectral templates,”
IEEE Transactions onSignal and Information Processing over Networks , vol. 3, no. 3, pp.467–483, Sept 2017.[21] B. Pasdeloup, V. Gripon, G. Mercier, D. Pastor, and M. G. Rabbat,“Characterization and inference of graph diffusion processes fromobservations of stationary signals,”
IEEE Transactions on Signal andInformation Processing over Networks , vol. PP, no. 99, pp. 1–1, 2017.[22] J. Mei and J. M. F. Moura, “Signal processing on graphs: Causal mod-eling of unstructured data,”
IEEE Transactions on Signal Processing ,vol. 65, no. 8, pp. 2077–2092, April 2017.[23] I. M. Johnstone and A. Y. Lu, “On consistency and sparsity for principalcomponents analysis in high dimensions,”
Journal of the AmericanStatistical Association , vol. 104, no. 486, pp. 682–693, 2009.[24] P. Ravikumar, M. Wainwright, B. Yu, and G. Raskutti, “High dimen-sional covariance estimation by minimizing l1-penalized log-determinantdivergence,”
Electronic Journal of Statistics (EJS) , vol. 5, pp. 935–980,2011.[25] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covarianceestimation with the graphical lasso,”
Biostatistics , vol. 9, no. 3, pp. 432–441, Jul. 2008.[26] T. Bıyıkoglu, J. Leydold, and P. F. Stadler, “Laplacian eigenvectors ofgraphs,”
Lecture notes in mathematics , vol. 1915, 2007.[27] J. Lafferty and G. Lebanon, “Diffusion kernels on statistical manifolds,”
J. Mach. Learn. Res. , vol. 6, pp. 129–163, Dec. 2005.[28] D. Thanou, X. Dong, D. Kressner, and P. Frossard, “Learning heat diffu-sion graphs,”
IEEE Transactions on Signal and Information Processingover Networks , vol. 3, no. 3, pp. 484–499, Sept 2017.[29] M. G. Rabbat, “Inferring sparse graphs from smooth signals withtheoretical guarantees,” in , March 2017, pp.6533–6537.[30] B. M. Lake and J. B. Tenenbaum, “Discovering structure by learningsparse graph,” in
Proceedings of the 33rd Annual Cognitive ScienceConference , 2010, pp. 778–783.[31] E. L. Lehmann and G. Casella,
Theory of Point Estimation , 2nd ed.New York, NY, USA: Springer-Verlag, 1998.[32] Y. D. Castro, T. Espinasse, and P. Rochet, “Reconstructing undirectedgraphs from eigenspaces,”
Journal of Machine Learning Research ,vol. 18, no. 51, pp. 1–24, 2017.[33] D. P. Bertsekas,
Nonlinear Programming . Belmont, MA: AthenaScientific, 1999.[34] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convexprogramming, version 2.1,” http://cvxr.com/cvx, Mar. 2014.[35] S. Zhou, P. Rutimann, M. Xu, and P. Buhlmann, “High-dimensionalcovariance estimation based on Gaussian graphical models,”
J. Mach.Learn. Res. , vol. 12, pp. 2975–3026, Nov. 2011.[36] E. Kalnay, M. Kanamitsu, R. Kistler, W. Collins, D. Deaven, L. Gandin,M. Iredell, S. Saha, G. White, J. Woollen, Y. Zhu, A. Leetmaa,R. Reynolds, M. Chelliah, W. Ebisuzaki, W. Higgins, J. Janowiak, K. C.Mo, C. Ropelewski, J. Wang, R. Jenne, and D. Joseph, “The ncep/ncar40-year reanalysis project,”
Bulletin of the American MeteorologicalSociety , vol. 77, no. 3, pp. 437–471, 1996.[37] L. C. Evans,