Sparsistent Estimation of Time-Varying Discrete Markov Random Fields
aa r X i v : . [ s t a t . M L ] A p r Sparsistent Estimation Of Time-Varying MarkovRandom Fields
Mladen Kolar and Eric P. Xing ∗ Abstract
Network models have been popular for modeling and representing complex re-lationships and dependencies between observed variables. When data comes froma dynamic stochastic process, a single static network model cannot adequately cap-ture transient dependencies, such as, gene regulatory dependencies throughout adevelopmental cycle of an organism. Kolar et al (2010b) proposed a method basedon kernel-smoothing l1-penalized logistic regression for estimating time-varyingnetworks from nodal observations collected from a time-series of observational data.In this paper, we establish conditions under which the proposed method consis-tently recovers the structure of a time-varying network. This work complementsprevious empirical findings by providing sound theoretical guarantees for the pro-posed estimation procedure. For completeness, we include numerical simulationsin the paper.
Key words and phrases:
High-dimensional inference, Markov random fields,semi-parametric inference, time-varying Ising model, varying coefficient models.
In recent years, we have witnessed fast advancement of data-acquisition techniques inmany areas, including biological domains, engineering and social sciences. As a result,new statistical and machine learning techniques are needed to help us develop a betterunderstanding of complexities underlying large, noisy data sets. Networks have beencommonly used to abstract noisy data and provide an insight into regularities anddependencies between observed variables. For example, in a biological study, nodes ofthe network can represent genes in one organism and edges can represent associationsor regulatory dependencies among genes. In a social domain, nodes of a network canrepresent actors and edges can represent interactions between actors. Recent populartechniques for modeling and exploring networks are based on the structure estimation inthe probabilistic graphical models, specifically, Markov Random Fields (MRFs). Thesemodels represent conditional independence between variables, which are represented as ∗ Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15217, USA; e-mail: { mladenk,epxing } @cs.cmu.edu . • Analysis of gene regulatory networks.
Suppose that we have a set of n microarraymeasurements of gene expression levels, obtained at different stages during thedevelopment of an organism or at different times during the cell cycle. Giventhis data, biologists would like to get insight into dynamic relationships betweendifferent genes and how these relations change at different stages of development.The problem is that at each time point there is only one or at most a few mea-surements of the gene expressions; and a naive approach to estimating the generegulatory network, which uses only the data at the time point in question to inferthe network, would fail. To obtain a good estimate of the regulatory network atany time point, we need to leverage the data collected at other time points andextract some information from them. • Analysis of stock market.
In a finance setting, we have values of different stocks ateach time point. Suppose, for simplicity, that we only measure whether the valueof a particular stock is going up or down. We would like to find the underlyingtransient relational patterns between different stocks from these measurementsand get insight into how these patterns change over time. Again, we only haveone measurement at each time point and we need to leverage information fromthe data obtained at nearby time points. • Understanding social networks.
There are 100 Senators in the U.S. Senate andeach can cast a vote on different bills. Suppose that we are given n votingrecords over some period of time. How can one infer the latent political liaisonsand coalitions among different senators and the way these relationships changewith respect to time and with respect to different issues raised in bills just fromthe voting records?The aforementioned problems have commonality in estimating a sequence of time-specific latent relational structures between a fixed set of entities (i.e., variables),from a time series of observation data of entities states; and the relational struc-tures between the entities are time evolving, rather than being invariant throughoutthe data collection period as commonly assumed in much of the current literature onhigh-dimensional undirected network estimation (see, e.g., Meinshausen and B¨uhlmann(2006), Bresler et al. (2007), Yuan and Lin (2007), Banerjee et al. (2008), Rothman et al.(2008), Friedman et al. (2008), Ravikumar et al. (2008), Fan et al. (2009), Peng et al.(2009), Ravikumar et al. (2010), Wang et al. (2009), Guo et al. (2010a) and referencestherein). Typically, the available data for the problem are very scarce, with only oneor at most a few measurements corresponding to any particular latent structure, while2he total number of potential relations is large and exceeds the total number of ob-servations. However, as we will show later, under the assumption that the network issparse and slowly changes with time, it is possible to consistently estimate the networkstructure at any time point.A popular model for the relational structure over a fixed set of entities that iswidely studied is the Markov random field (MRF) (Wainwright and Jordan, 2008,Getoor and Taskar, 2007). Let G = ( V, E ) represent a graph, of which V denotesthe set of vertices, and E denotes the set of edges over vertices. Depending on thespecific application of interest, a node u ∈ V can represent a gene, a stock, or a socialactor, and an edge ( u, v ) ∈ E can represent a relationship (e.g., correlation, influence,friendship) between actors u and v . Let X = ( X , . . . , X p ) ′ , where p = | V | , be a randomvector of nodal states following a probability distribution indexed by θ ∈ Θ. Under aMRF, the nodal states X u ’s are assumed to be discrete, i.e., X u ∈ X ≡ { s , . . . , s k } ,and the edge set E ⊆ V × V encodes certain conditional independence assumptionsamong components of the random vector X , for example, the random variable X u isconditionally independent of the random variable X v given the rest of the variables if( u, v ) E . Under the special case of binary nodal states, e.g., X u ∈ X ≡ {− , } ,and assuming pairwise potential weighted by θ uv for all ( u, v ) ∈ E and θ uv = 0 for all( u, v ) E , the joint probability of X = x can be expressed by a simple exponentialfamily model: P θ ( x ) = Z exp { P u 1] and a sequence of observations D n = { x t ∼ P θ t | t ∈ T n } with P θ t defined Eq. (2.1), the goal is to estimate the graph structure of the Markov randomfield associated with the distribution P θ τ . The parameter vector θ τ is a (cid:0) p (cid:1) -dimensionalvector, indexed by distinct pairs of nodes, of which an element is non-zero if and onlyif the corresponding edge ( u, v ) ∈ E τ . The problem of recovering the graph structure G τ is equivalent to estimating the non-zero pattern of the vector θ τ , i.e., locations ofnon-zero elements of θ τ . A stronger notion of structure estimation is that of signededge recovery in which an edge ( u, v ) ∈ E τ is recovered together with the sign of theparameter sign( θ τuv ). We will show that the estimation procedure can consistentlyrecover signed edges.The estimation procedure is based on the neighborhood selection technique, wherethe graph structure is estimated by combining the local estimates of neighborhoods ofeach node. For each vertex u ∈ V , define the set of neighboring edges S τ ( u ) := { ( u, v ) :( u, v ) ∈ E τ } and the set of signed neighboring edges S τ ± ( u ) := { (sign( θ τuv ) , ( u, v )) :( u, v ) ∈ S τ ( u ) } . The set of signed neighboring edges S τ ± ( u ) can be determined fromthe signs of elements of the ( p − θ τu := { θ τuv : v ∈ V \ u } associated with vertex u . Under the model (2.1), the conditional distribution6f X τu given other variables X τ \ u := { X τv : v ∈ V \ u } takes the form P θ τu ( x τu | X τ \ u = x τ \ u ) = exp(2 x τu h θ τu , x τ \ u i )exp(2 x τu h θ τu , x τ \ u i ) + 1 , (3.1)where h a, b i = a ′ b denotes the dot product. Observe that the model given in (3.1)can be viewed as expressing X τu as the response variable in the generalized varying-coefficient models with X τ \ u playing the role of covariates. For simplicity, we will write P θ τu ( x τu | X τ \ u = x τ \ u ) as P θ τu ( x τu | x τ \ u ).Under the model given in Eq. (3.1) the log-likelihood, for one data-point t ∈ T n ,can be written in the following form: γ ( θ u ; x t ) = log P θ u ( x tu | x t \ u )= x tu h θ u , x t \ u i − log (cid:16) exp( h θ u , x t \ u i ) + exp( −h θ u , x t \ u i ) (cid:17) . (3.2)For an arbitrary point of interest τ ∈ [0 , θ τu of the sign-pattern of thevector θ τu is defined as the solution to the following convex program:ˆ θ τu = min θ u ∈ R p − { ℓ ( θ u ; D n ) + λ n || θ u || } (3.3)where ℓ ( θ u ; D n ) = − P t ∈T n w τt γ ( θ u ; x t ) is the weighted logloss, with weights defined as w τt = K h ( t − τ ) P t ′ ∈T n K h ( t ′ − τ )and K h ( · ) = K ( · /h ) is a symmetric nonnegative kernel. The regularization parameter λ n ≥ θ u is always achieved, as the problem can be cast as aconstrained optimization problem over the ball || θ u || ≤ C ( λ n ) and the claim followsfrom the Weierstrass theorem.Let ˆ θ τu be a minimizer of (3.3). The convex program (3.3) does not necessarilyhave a unique optimum, but as we will prove shortly, in the regime of interest any twosolutions will have non-zero elements in the same positions. Based on the vector ˆ θ τu ,we have the following estimate of the signed neighborhood:ˆ S τ ± ( u ) := n (sign(ˆ θ τuv ) , ( u, v )) : v ∈ V \ u, ˆ θ τuv = 0 o . (3.4)The structure of graph G τ is consistently estimated if every signed neighborhood isrecovered, i.e. ˆ S τ ± ( u ) = S τ ± ( u ) for all u ∈ V . A summary of the algorithm is given inAlgorithm 1.The convex program (3.3), can be solved using any general optimization solver. Oneparticularly fast algorithm, based on the coordinate-wise descent method, for this typeof a problem is described in Friedman et al. (2010) and implemented as the R package glmnet . Note that the algorithm provides only an estimate of the graph structure attime point τ and in order to get insight into the dynamics of the graph changes, oneneeds to estimate the graph structure at multiple time points. Typically, in a realapplication task, one is interested in estimating G τ for all τ ∈ T n .7 lgorithm 1 Graph structure estimation Input : Dataset D n , time point of interest τ ∈ [0 , λ n ,bandwidth parameter h Output : Estimate of the graph structure ˆ G τ for all u ∈ V do Estimate ˆ θ u by solving the convex program (3.3) Estimate the set of signed neighboring edges ˆ S τ ± ( u ) using (3.4) end for Combine sets { ˆ S τ ± ( u ) } u ∈ V to obtain ˆ G τ . In this section, we provide conditions under which Algorithm 1 consistently recovers thegraph structure. In particular, we show that under suitable conditions P [ ∀ u ˆ S τ ± ( u ) = S τ ± ( u )] n →∞ −−−→ 1, the property known as sparsistency . We are mainly interested in thehigh-dimensional case, where the dimension p = p n is comparable or even larger thanthe sample size n . It is of great interest to understand the performance of the estimatorunder this assumption, since in many real world scenarios the dimensionality of datais large. Our analysis is asymptotic and we consider the model dimension p = p n togrow at a certain rate as the sample size grows. This essentially allows us to considermore “complicated” models as we observe more data points. Another quantity that willdescribe the complexity of the model is the maximum node degree s = s n , which is alsoconsidered as a function of the sample size. Under the assumption that the true-graphstructure is sparse, we will require that the maximum node degree is small, s ≪ n . Themain result describes the scaling of the triple ( n, p n , s n ) under which the estimationprocedure given in the previous section estimates the graph structure consistently.We will need certain regularity conditions to hold in order to prove the sparsistencyresult. These conditions are expressed in terms of the Hessian of the log-likelihoodfunction as evaluated at the true model parameter, i.e., the Fisher information matrix.The Fisher information matrix Q τu ∈ R ( p − × ( p − is a matrix defined for each node u ∈ V as: Q τu : = E [ ∇ log P θ τu [ X u | X \ u ]]= E [ η ( X ; θ τu ) X \ u X ′\ u ] , where η ( x ; θ u ) := 4 exp(2 x u h θ u , x \ u i )(exp(2 x u h θ u , x \ u i ) + 1) is the variance function and ∇ denotes the operator that computes the matrix ofsecond derivatives. We write Q τ := Q τu and assume that the following assumptionshold for each node u ∈ V . A1: Dependency condition There exist constants C min , D min , D max > min ( Q τSS ) ≥ C min min ( Σ τ ) ≥ D min , Λ max ( Σ τ ) ≤ D max , where Σ τ = E θ τ [ X τ X τ ′ ]. Here Λ min ( · ) and Λ max ( · ) denote the minimum andmaximum eigenvalue of a matrix. A2: Incoherence condition There exists an incoherence parameter α ∈ (0 , 1] suchthat ||| Q τS c S ( Q τSS ) − ||| ∞ ≤ − α, where, for a matrix A ∈ R a × b , the ℓ ∞ matrix norm is defined as ||| A ||| ∞ :=max i ∈{ ,...,a } P bj =1 | a ij | . Here the set S c denotes the complement of the set S in { , . . . , p } , that is, S c = { , . . . , p }\ S .With some abuse of notation, when defining assumptions A1 and A2, we use the indexset S := S τ ( u ) to denote nodes adjacent to the node u at time τ . For example, if s = | S | , then Q τSS ∈ R s × s denotes the sub-matrix of Q τ indexed by S .Condition A1 assures that the relevant features are not too correlated, while condi-tion A2 assures that the irrelevant features do not have to strong effect onto the relevantfeatures. Similar conditions are common in other literature on high-dimensional estima-tion (see, e.g., Meinshausen and B¨uhlmann (2006), Ravikumar et al. (2010), Peng et al.(2009), Guo et al. (2010a) and references therein). The difference here is that we as-sume the conditions hold for the time point of interest τ at which we want to recoverthe graph structure.Next, we assume that the distribution P θ t changes smoothly over time, which weexpress in the following form, for every node u ∈ V . A3: Smoothness conditions Let Σ t = [ σ tuv ]. There exists a constant M > u,v ∈ V × V sup t ∈ [0 , | ∂∂t σ tuv | < M, max u,v ∈ V × V sup t ∈ [0 , | ∂ ∂t σ tuv | < M max u,v ∈ V × V sup t ∈ [0 , | ∂∂t θ tuv | < M, max u,v ∈ V × V sup t ∈ [0 , | ∂ ∂t θ tuv | < M. The condition A3 captures our notion of the distribution that changes smoothly overtime. If we consider the elements of the covariance matrix and the elements of theparameter vector as a function of time, then these functions have bounded first andsecond derivatives. From these assumptions, it is not too hard to see that elements ofthe Fisher information matrix are also smooth functions of time. A4: Kernel The kernel K : R R is a symmetric function, supported in [ − , M K ≥ z ∈ R | K ( z ) | and max z ∈ R K ( z ) . 9his condition, A4, gives some regularity conditions on the kernel used to define theweights. For example, the assumption is satisfied by the box kernel K ( z ) = { z ∈ [ − , } .With the assumptions made above, we are ready to state the theorem that char-acterizes the consistency of the method given in Section 3 for recovering the unknowntime-varying graph structure. An important quantity, appearing in the statement, isthe minimum value of the parameter vector that is different from zero θ min = min ( u,v ) ∈ E τ | θ τuv | . Intuitively, the success of the recovery should depend on how hard it is to distinguishthe true non-zero parameters from noise. Theorem 1. Assume that the dependency condition A1 holds with C min , D min and D max , that for each node u ∈ V , the Fisher information matrix Q τ satisfies the in-coherence condition A2 with parameter α , the smoothness assumption A3 holds withparameter M , and that the kernel function used in Algorithm 1 satisfies assumptionA4 with parameter M K . Let the regularization parameter satisfy λ n ≥ C √ log pn / for a constant C > independent of ( n, p, s ) . Furthermore, assume that the followingconditions hold:1. h = O ( n − ) s = o ( n / ) , s log pn / = o (1) θ min = Ω( √ s log pn / ) . Then for a fixed τ ∈ [0 , the estimated graph ˆ G τ ( λ n ) obtained through neighborhoodselection satisfies P h ˆ G τ ( λ n ) = G τ i = O exp − C n / s + C ′ log p !! → , for some constants C ′ , C ′′ independent of ( n, p, s ) . This theorem guarantees that the procedure in Algorithm 1 asymptotically recoversthe sequence of graphs underlying all the nodal-state measurements in a time series,and the snapshot of the evolving graph at any time point during measurement intervals,under appropriate regularization parameter λ n as long as the ambient dimensionality p and the maximum node degree s are not too large, and minimum θ values do nottend to zero too fast. Remarks: 10. The bandwidth parameter h is chosen so that it balances variance and squaredbias of estimation of the elements of the Fisher information matrix.2. Theorem 1 states that the tuning parameter λ can be set as λ n ≥ Cn − / √ log p .In practice, one can use the Bayesian information criterion to select the tuningparameter λ n is a data dependent way, as explained in Section 2.4 of Kolar et al.(2010b). We conjecture that this approach would lead to asymptotically consis-tent model selection, however, this claim needs to be proven.3. Condition 2 requires that the size of the neighborhood of each node remainssmaller than the size of the samples. However, the model ambient dimension p isallowed to grow exponentially in n .4. Condition 3 is crucial to be able to distinguish true elements in the neighborhoodof a node. We require that the size of the minimum element of the parametervector stays bounded away from zero.5. The rate of convergence is dictated by the rate of convergence of the sample Fisherinformation matrix to the true Fisher information matrix, as shown in Lemma5. Using a local linear smoother, instead of the kernel smoother, to estimate thecoefficients in the model (3.1) one could get a faster rate of convergence.6. Theorem 1 provides sufficient conditions for reliable estimation of the sequenceof graphs when the sample size is large enough. In order to improve small sampleproperties of the procedure, one could adapt the approach of Guo et al. (2010b)to the time-varying setting, to incorporate sharing between nodes. Guo et al.(2010b) estimate all the local neighborhoods simultaneously, as opposed to es-timating each neighborhood individually, effectively reducing the number of pa-rameters needed to be inferred from data. This is especially beneficial in networkswith prominent hubs and scale-free networks.In order to obtain insight into the network dynamics one needs to estimate thegraph structure at multiple time points. A common choice is to estimate the graphstructure for every τ ∈ T n and obtain a sequence of graph structures { ˆ G τ } τ ∈T n . We ahave the following immediate consequence of Theorem 1. Corollary 2. Under the assumptions of Theorem 1, we have that P h ∀ τ ∈ T n : ˆ G τ ( λ n ) = G τ i n →∞ −−−→ . (4.1)In the sequel, we set out to prove Theorem 1. First, we show that the minimizerˆ θ τu of (3.3) is unique under the assumptions given in Theorem 1. Next, we show thatwith high probability the estimator ˆ θ τu recovers the true neighborhood of a node u .Repeating the procedure for all nodes u ∈ V we obtain the result stated in Theorem 1.The proof uses the results that the empirical estimates of the Fisher information matrixand the covariance matrix are close elementwise to their population versions. Theseresults are given in Appendix A. 11 Proof of the main result In this section we give the proof of Theorem 1. The proof is given through a sequenceof technical lemmas. We build on the ideas developed in Ravikumar et al. (2010). Notethat in what follows, we use C, C ′ and C ′′ to denote positive constants independent of( n, p, s ) and their value my change from line to line.The main idea behind the proof is to characterize the minimum obtained in Eq. (3.3)and show that the correct neighborhood of one node at an arbitrary time point canbe recovered with high probability. Next, using the union bound over the nodes of agraph, we can conclude that the whole graph is estimated sparsistently at the timepoints of interest.We first address the problem of uniqueness of the solution to (3.3). Note thatbecause the objective in Eq. (3.3) is not strictly convex, it is necessary to show thatthe non-zero pattern of the parameter vector is unique, since otherwise the problem ofsparsistent graph estimation would be meaningless. Under the conditions of Theorem 1we have that the solution is unique. This is shown in Lemma 3 and Lemma 4. Lemma 3gives conditions under which two solutions to the problem in Eq. (3.3) have the samepattern of non-zero elements. Lemma 4 then shows, that with probability tending to 1,the solution is unique. Once we have shown that the solution to the problem in Eq. (3.3)is unique, we proceed to show that it recovers the correct pattern of non-zero elements.To show that, we require the sample version of the Fisher information matrix to satisfycertain conditions. Under the assumptions of Theorem 1, Lemma 5 shows that thesample version of the Fisher information matrix satisfies the same conditions as thetrue Fisher information matrix, although with worse constants. Next we identify twoevents, related to the Karush-Kuhn-Tucker optimality conditions, on which the vectorˆ θ u recovers the correct neighborhood the node u . This is shown in Proposition 6.Finally, Proposition 7 shows that the event, on which the neighborhood of the node u is correctly identified, occurs with probability tending to 1 under the assumptions ofTheorem 1. Table 1 provides a summary of different parts of the proof.Let us denote the set of all solution to (3.3) as Θ( λ n ). We define the objectivefunction in Eq. (3.3) by F ( θ u ) := − X t ∈T n w τt γ ( θ u ; x t ) + λ n || θ u || (5.1)and we say that θ u ∈ R p − satisfies the system ( S ) when ∀ v = 1 , . . . , p − , (cid:26) P t ∈T n w τt ( ∇ γ ( θ u ; x t )) v = λ n sign( θ uv ) if θ uv = 0 | P t ∈T n w τt ( ∇ γ ( θ u ; x t )) v | ≤ λ n if θ uv = 0 , (5.2)where ∇ γ ( θ u ; x t ) = x t \ u n x tu + 1 − P θ u [ x tu = 1 | x t \ u ] o (5.3)12able 1: Outline of the proof strategy.Result Description of the resultLemma 3 and Lemma 4 These two lemmas establish the unique-ness of the solution to the optimizationproblem in Eq. (3.3).Lemma 5 Shows that the sample version of theFisher information matrix satisfies thesimilar conditions to the population ver-sion of the Fisher information matrix.Proposition 6 Shows that on an event, related to theKKT conditions, the vector ˆ θ u recoversthe correct neighborhood the node u .Proposition 7 Shows that the event in Proposition 6holds with probability tending to 1.is the score function. Eq. (5.2) is obtained by taking the sub-gradient of F ( θ ) andequating it to zero. From the Karush-Kuhn-Tucker (KKT) conditions it follows that θ u ∈ R p − belongs to Θ( λ n ) if and only if θ u satisfies the system ( S ). The followingLemma shows that any two solutions have the same non-zero pattern. Lemma 3. Consider a node u ∈ V . If ¯ θ u ∈ R p − and ˜ θ u ∈ R p − both belong to Θ( λ n ) then h x t \ u , ¯ θ u i = h x t \ u , ˜ θ u i , t ∈ T n . Furthermore, solutions ¯ θ u and ˜ θ u have non-zeroelements in the same positions. We now use the result of Lemma 3 to show that with high probability the minimizerin (3.3) is unique. We consider the following event:Ω = { D min − δ ≤ y ′ ˆ Σ τSS y ≤ D max + δ : y ∈ R s , || y || = 1 } . Lemma 4. Consider a node u ∈ V . Assume that the conditions of Lemma 10 aresatisfied. Assume also that the dependency condition A1 holds. There are constants C, C ′ , C ′′ > depending on M and M K only, such that P [Ω ] ≥ − − Cnh ( δs − C ′ h ) + C ′′ log( s )) . Moreover, on the event Ω , the minimizer of (3.3) is unique. We have shown that the estimate ˆ θ τu is unique on the event Ω , which under theconditions of Theorem 1 happens with probability converging to 1 exponentially fast.To finish the proof of Theorem 1 we need to show that the estimate ˆ θ τu has the samenon-zero pattern as the true parameter vector θ τu . In order to show that we consider afew “good” events, which happen with high probability and on which the estimate ˆ θ τu := { C min − δ ≤ y ′ ˆ Q τSS y : y ∈ R s , || y || = 1 } and Ω := {||| ˆ Q τS c S ( ˆ Q τSS ) − ||| ∞ ≤ − α } . Lemma 5. Assume that the conditions of Lemma 10 are satisfied. Assume also thatthe dependency condition A1 holds and the incoherence condition A2 holds with theincoherence parameter α . There are constants C, C ′ , C ′′ > depending on M , M K and α only, such that P [Ω ] ≥ − − C nhδ s + C ′ log( s )) and P [Ω ] ≥ − exp( − C nhs + C ′′ log( p )) . Lemma 5 guarantees that the sample Fisher information matrix satisfies “good”properties with high probability, under the appropriate scaling of quantities n, p, s and h . We are now ready to analyze the optimum to the convex program (3.3). To thatend we apply the mean-value theorem coordinate-wise to the gradient of the weightedlogloss P t ∈T n w τt ∇ γ ( θ u ; x t ) and obtain X t ∈T n w τt ( ∇ γ ( ˆ θ τu ; x t ) − ∇ γ ( θ τu ; x t )) = [ X t ∈T n w τt ∇ γ ( θ τu ; x t )]( ˆ θ τu − θ τu ) + ∆ τ , (5.4)where ∆ τ ∈ R p − is the remainder term of the form∆ τv = [ X t ∈T n w τt ( ∇ γ ( ¯ θ ( v ) u ; x t ) − ∇ γ ( θ τu ; x t ))] ′ v ( ˆ θ τu − θ τu ) (5.5)and ¯ θ ( v ) u is a point on the line between θ τu and ˆ θ τu , and [ · ] ′ v denoting the v -th row of thematrix. Recall that ˆ Q τ = P t ∈T n w τt ∇ γ ( θ τu ; x t ). Using the expansion (5.4), we writethe KKT conditions given in Eq. (5.2) in the following form, ∀ v = 1 , . . . , p − ( ˆ Q τv ( θ u − θ τu ) + P t ∈T n w τt ( ∇ γ ( θ τu ; x t )) v + ∆ τv = λ n sign( θ uv ) if θ uv = 0 | ˆ Q τv ( θ u − θ τu ) + P t ∈T n w τt ( ∇ γ ( θ τu ; x t )) v + ∆ τv | ≤ λ n if θ uv = 0 . (5.6)We consider the following eventsΩ = Ω ∩ Ω ∩ Ω , Ω = {∀ v ∈ S : | λ n (( ˆ Q τSS ) − sign( θ τS )) v − (( ˆ Q τSS ) − W τS ) v | < | θ τuv |} and Ω = {∀ v ∈ S c : | ( W τS c − ˆ Q τS c S ( ˆ Q τSS ) − W τS ) v | < α λ n } W τ = X t ∈T n w τt ∇ γ ( θ τu ; x t ) + ∆ τ . We will work on the event Ω on which the minimum eigenvalue of ˆ Q τSS is strictlypositive and, so, ˆ Q τSS is regular and Ω ∩ Ω and Ω ∩ Ω are well defined. Proposition 6. Assume that the conditions of Lemma 5 are satisfied. The event {∀ ˆ θ τu ∈ R p − solution of ( S ) , we have sign( ˆ θ τu ) = sign( θ τu ) } ∩ Ω contains event Ω ∩ Ω ∩ Ω .Proof. We consider the following linear functional G : (cid:26) R s → R s θ θ − θ τS + ( ˆ Q τSS ) − W τS − λ n ( ˆ Q τSS ) − sign( θ τS ) . For any two vectors y = ( y , . . . , y s ) ′ ∈ R s and r = ( r , . . . , r s ) ′ ∈ R s + , define thefollowing set centered at y as B ( y , r ) = s Y i =1 ( y i − r i , y i + r i ) . Now, we have G ( B ( θ τS , | θ τS | )) = B (cid:16) ( ˆ Q τSS ) − W τS − λ n ( ˆ Q τSS ) − sign( θ τS ) , | θ τS | (cid:17) . On the event Ω ∩ Ω ,0 ∈ B (cid:16) ( ˆ Q τSS ) − W τS − λ n ( ˆ Q τSS ) − sign( θ τS ) , | θ τS | (cid:17) , which implies that there exists a vector ¯ θ τS ∈ B ( θ τS , | θ τS | ) such that G ( ¯ θ τS ) = 0. For ¯ θ τS it holds that ¯ θ τS = θ τS + λ n ( ˆ Q τSS ) − sign( θ τS ) − ( ˆ Q τSS ) − W τS and | ¯ θ τS − θ τS | < | θ τS | . Thus,the vector ¯ θ τS satisfies sign( ¯ θ τS ) = sign( θ τS )and ˆ Q SS ( ¯ θ τS − θ τS ) + W τS = λ n sign( ¯ θ τS ) . (5.7)Next, we consider the vector ¯ θ τ = (cid:18) ¯ θ τS ¯ θ τS c (cid:19) where ¯ θ τS c is the null vector of R p − − s .On event Ω , from Lemma 5 we know that ||| ˆ Q τS c S ( ˆ Q τSS ) − ||| ∞ ≤ − α . Now, on theevent Ω ∩ Ω it holds || ˆ Q τS c S ( ¯ θ τS − θ τS ) + W τS c || ∞ = || − ˆ Q τS c S ( ˆ Q τSS ) − W τS + W τS c + λ n ˆ Q τS c S ( ˆ Q τSS ) − sign( ¯ θ τS ) || ∞ < λ n . (5.8)Note that for ¯ θ τ , equations (5.7) and (5.8) are equivalent to saying that ¯ θ τ satis-fies conditions (5.6) or (5.2), i.e., saying that ¯ θ τ satisfies the KKT conditions. Sincesign( ¯ θ τS ) = sign( θ τS ), we have sign( ¯ θ τ ) = sign( θ τu ). Furthermore, because of the unique-ness of the solution to (3.3) on the event Ω , we conclude that ˆ θ τu = ¯ θ τ .15roposition 6 implies Theorem 1 if we manage to show that the event Ω ∩ Ω ∩ Ω occurs with high probability under the assumptions stated in Theorem 1. Proposition7 characterizes the probability of that event, which concludes the proof of Theorem 1. Proposition 7. Assume that the conditions of Theorem 1 are satisfied. Then thereare constants C, C ′ > depending on M , M K , D max , C min and α only, such that thefollowing holds: P [Ω ∩ Ω ∩ Ω ] ≥ − − Cnh ( λ n − sh ) + log( p )) . (5.9) Proof. We start the proof of the proposition by giving a technical lemma, which char-acterizes the distance between vectors ˆ θ τu = ¯ θ τ and θ τu under the assumptions of The-orem 1, where ¯ θ τ is constructed in the proof of Proposition 6. The following lemmagives a bound on the distance between the vectors ˆ θ τS and θ τS , which we use in theproof of the proposition. The proof of the lemma is given in Appendix. Lemma 8. Assume that the conditions of Theorem 1 are satisfied. There are constants C, C ′ > depending on M, M K , D max , C min and α only, such that || ˆ θ τS − θ τS || ≤ C √ s log pn / (5.10) with probability at least − exp( − C ′ log p ) . Using Lemma 8 we can prove Proposition 7. We start by studying the probabilityof the event Ω . We haveΩ C ⊂ ∪ v ∈ S c { W v + ( ˆ Q τS c S ( ˆ Q τSS ) − W τS ) v ≥ α λ n } . Recall that W τ = P t ∈T n w τt ∇ γ ( θ τu ; x t ) + ∆ τ . Let us define the eventΩ = { max ≤ v ≤ p − | e ′ v X t ∈T n w τt ∇ γ ( θ τu ; x t ) | < αλ n − α ) } , where e v ∈ R p − is a unit vector with one at the position v and zeros elsewhere.From the proof of Lemma 8 available in the appendix we have that P [Ω ] ≥ − − C log( p )) and on that event the bound given in Eq. (5.10) holds.On the event Ω , we bound the remainder term ∆ τ . Let g : R R be defined as g ( z ) = z )(1+exp(2 z )) . Then η ( x ; θ u ) = g ( x u h θ u , x \ u i ). For v ∈ { , . . . , p − } , using themean value theorem it follows that∆ v = [ X t ∈T n w τt ( ∇ γ ( ¯ θ ( v ) u ; x t ) − ∇ γ ( θ τu ; x t ))] ′ v ( ˆ θ τu − θ τu )= X t ∈T n w τt [ η ( x t ; ¯ θ ( v ) u ) − η ( x t ; θ τu )][ x t \ u x t ′ \ u ] ′ v [ ˆ θ τu − θ τu ]= X t ∈T n w τt g ′ ( x tu h ¯¯ θ ( v ) u , x t \ u i )[ x tu x t \ u ] ′ [ ¯ θ ( v ) u − θ τu ][ x tv x t ′ \ u ][ ˆ θ τu − θ τu ]= X t ∈T n w τt { g ′ ( x tu h ¯¯ θ ( v ) u , x t \ u i ) x tu x tv }{ [ ¯ θ ( v ) u − θ τu ] ′ x t \ u x t ′ \ u [ ˆ θ τu − θ τu ] } , θ ( v ) u is another point on the line joining ˆ θ τu and θ τu . A simple calculation showsthat | g ′ ( x tu h ¯¯ θ ( v ) u , x t \ u i ) x tu x tv | ≤ 1, for all t ∈ T n , so we have | ∆ v | ≤ [ ¯ θ ( v ) u − θ τu ] ′ { X t ∈T n w τt x t \ u x t ′ \ u } [ ˆ θ τu − θ τu ] ≤ [ ˆ θ τu − θ τu ] ′ { X t ∈T n w τt x t \ u x t ′ \ u } [ ˆ θ τu − θ τu ]= [ ˆ θ τS − θ τS ] ′ { X t ∈T n w τt x tS x t ′ S } [ ˆ θ τS − θ τS ] ≤ D max || ˆ θ τS − θ τS || . (5.11)Combining the equations (5.11) and (5.10), we have that on the event Ω max ≤ v ≤ p − | ∆ v | ≤ Cλ n s < λ n α − α )where C is a constant depending on D max and C min only.On the event Ω ∩ Ω , we have W τv + ( ˆ Q τS c S ( ˆ Q τSS ) − W τS ) v < αλ n − α ) + (1 − α ) αλ n − α ) ≤ αλ n P [Ω ] ≥ − − C log( p )) for some constant C dependingon M, M K , C min , D max and α only.Next, we study the probability of the event Ω . We haveΩ C ⊂ ∪ v ∈ S { λ n (( ˆ Q τSS ) − sign( θ τS )) v + (( ˆ Q τSS ) − W τS ) v ≥ θ τuv } . (5.12)Again, we will consider the event Ω . On the event Ω ∩ Ω we have that λ n (( ˆ Q τSS ) − sign( θ τS )) v + (( ˆ Q τSS ) − W τS ) v ≤ λ n √ sC min + λ n C min ≤ Cλ n √ s, (5.13)for some constant C . When θ min > Cλ n √ s , we have that P [Ω ] ≥ − − C log( p ))for some constant C that depends on M, M K , C min , D max and α only.In summary, under the assumptions of Theorem 1, the probability of event Ω ∩ Ω ∩ Ω converges to one exponentially fast. On this event, we have shown that theestimator ˆ θ τu is the unique minimizer of (3.3) and that it consistently estimates thesigned non-zero pattern of the true parameter vector θ τu , i.e., it consistently estimatesthe neighborhood of a node u . Applying the union bound over all nodes u ∈ V , we canconclude that our estimation procedure explained in Section 3 consistently estimatesthe graph structure at a time point τ . 17 Numerical simulation In this section, we demonstrate numerical performance of Algorithm 1. A detailedcomparison with other estimation procedures and an application to biological datahas been reported in Kolar et al. (2010b). We will use three different types of graphstructures: a chain, a nearest-neighbor and a random graph. Each graph has p = 50nodes and the maximum node degree is bounded by s = 4. These graphs are detailedbelow: Example 1: Chain graph. First a random permutation π of { , . . . , p } is chosen.Then a graph structure is created by connecting consecutive nodes in the permutation,that is, ( π (1) , π (2)) , . . . , ( π ( p − , π ( p )) ∈ E . Example 2: Nearest neighbor graph. A nearest neighbor graph if generatedfollowing the procedure outlined in Li and Gui (2006). For each node, we draw a pointuniformly at random on a unit square and compute the pairwise distances betweennodes. Each node is then connected to 4 closest neighbors. Since some of nodes willhave more than 4 adjacent edges, we remove randomly edges from nodes that havedegree larger than 4 until the maximum degree of a node in a graph is 4. Example 3: Random graph. To generate a random graph with e = 45 edges, weadd each edges one at a time, between random pairs of nodes that have the node degreeless than 4.We use the above described procedure to create the first random graph ˜ G . Next,we randomly add 10 edges and remove 10 edges from ˜ G , taking care that the maximumnode degree is still 4, to obtain ˜ G . Repeat the process of adding and removing edgesfrom ˜ G to obtain ˜ G , . . . , ˜ G . We refer to these 6 graphs as the anchor graphs. We willrandomly generate the prototype parameter vectors ˜ θ , . . . , ˜ θ , corresponding to theanchor graphs, and then interpolate 200 points between them to obtain the parameters { θ t } t ∈T n , which gives us n = 1000. We generate a prototype parameter vector ˜ θ i foreach anchor graph ˜ G i , i ∈ { , . . . , } , by sampling non-zero elements of the vectorindependently from Unif ([ − , . ∪ [0 . , t ∈ T n we generate 10 i.i.d. samples using Gibbs sampling from the distribution P θ t . Specifically, we discardsamples from the first 10 iterations and collect samples every 100 iterations.We estimate ˆ G t for each t ∈ T n using k ∈ { , . . . , } samples at each time point.The results are expressed in terms of the precision ( Pre ) and the recall ( Rec ) and F F ∗ Pre ∗ Rec / ( Pre + Rec ). Let ˆ E t denote the estimated edge set of ˆ G t , then the precision is calculated as Pre := 1 /n P t ∈T n | ˆ E t ∩ E t | / | ˆ E t | and the recall as Rec := 1 /n P t ∈T n | ˆ E t ∩ E t | / | E t | . Fur-thermore, we report results averaged over 100 independent runs. The tuning parametersare selected by maximizing the BIC score over a grid of regularization parameters asdescribed in Kolar et al. (2010b). Table 2 contains a summary of simulation results.As suggested by the reviewer, we perform an additional simulation that illustratesthat the conditions of Theorem 1 can be satisfied. We will use the random chaingraph and the nearest neighbor graph for two simulation settings. In each setting, wegenerate two anchor graphs with p nodes and create two prototype parameter vectors,18able 2: Summary of simulation results. The number of nodes p = 50 and the numberof discrete time points n = 1000. Number of independent samples1 2 3 4 5 6 7 8 9 10Precision Chain 0.75 0.95 0.96 0.96 0.97 0.98 0.99 0.99 0.99 0.99NN 0.84 0.98 0.97 0.96 0.98 0.98 0.98 0.98 0.97 0.98Random 0.55 0.57 0.65 0.71 0.75 0.79 0.83 0.84 0.85 0.85Recall Chain 0.59 0.65 0.69 0.72 0.73 0.73 0.73 0.73 0.73 0.73NN 0.48 0.57 0.61 0.63 0.63 0.64 0.64 0.64 0.65 0.65Random 0.50 0.52 0.55 0.56 0.56 0.58 0.60 0.60 0.63 0.66F1 score Chain 0.66 0.76 0.80 0.82 0.83 0.84 0.84 0.84 0.85 0.84NN 0.61 0.72 0.74 0.76 0.77 0.77 0.77 0.77 0.77 0.78Random 0.52 0.54 0.60 0.63 0.64 0.67 0.70 0.70 0.72 0.74as described above. Then we interpolate these two parameters over n points. Theorem 1predicts the scaling for the sample size n , as a function of other parameters, requiredto successfully recover the graph at a time point τ . Therefore, if our theory correctlypredicts the behavior of the estimation procedure and we plot the hamming distancebetween the true and recovered graph structure against appropriately rescaled samplesize, we expect the curves to reach zero distance for different problem sizes at a samepoint. The bandwidth parameter h is set as h = 4 . n − / and the penalty parameter λ n as λ n = 2 p n − / log( p ) as suggested by the theory. Figure 1 shows the hammingdistance against the scaled sample size n/ ( s . log . ( p )). Each point is averaged over100 independent runs. In the paper, we focus on sparsistent estimation of the time-varying high-dimensionalgraph structure in Markov Random Fields from a small size sample. An interestingopen direction is estimation of the graph structure from a general time-series, whereobservations are dependent. In our opinion, the graph structure that changes with timecreates the biggest technical difficulties. Incorporating dependent observations wouldbe an easier problem to address, however, the one of great practical importance, sincesamples in the real data sets are likely to be dependent. Another open direction isto establish necessary conditions, to complement sufficient conditions established here,under which it is possible to estimate a time-varying graph structure. Another researchdirection may be to use non-convex penalties introduced by Fan and Li (2001) in placeof the ℓ penalty. The idea would be to relax the condition imposed in the assumptionA2, since it is well known that the SCAD penalties improve performance when thevariables are correlated. 19 Chain Graph Scaled sample size n/ ( s . log . ( p )) H a mm i ng d i s t an c e p = 60p = 100p = 140 20 25 30 35020406080100 Nearest Neighbor Graph Scaled sample size n/ ( s . log . ( p )) H a mm i ng d i s t an c e p = 60p = 100p = 140 Figure 1: Average hamming distance plotted against the rescaled sample size. Eachcolumn represents one simulation setting. Results are averaged over 100 independentruns. Acknowledgment We would like to thank Larry Wasserman for many useful discussions and suggestions.The research reported here was supported in part by Grant ONR N000140910758,NSF DBI-0640543, NSF DBI-0546594, NSF IIS- 0713379, an Alfred P. Sloan ResearchFellowship to EPX and a graduate fellowship from Facebook to MK. A Large deviation inequalities In this section we characterize the deviation of elements of the sample Fisher informa-tion matrix ˆ Q τ := ˆ Q τu at time point τ , defined asˆ Q τ = X t w τt η ( x t ; θ τu ) x t \ u x t ′ \ u , (A.1)and the sample covariance matrix ˆ Σ τ from their population versions Q τ and Σ τ . Theseresults are crucial for the proof of the main theorem, where the consistency resultdepends on the bounds on the difference ˆ Q τ − Q τ and ˆ Σ τ − Σ τ . In the following, weuse C, C ′ and C ′′ as generic positive constants independent of ( n, p, s ).20 .1 Sample Fisher information matrix To bound the deviation between elements of ˆ Q τ = [ˆ q τvv ′ ] and Q τ = [ q τvv ′ ], v, v ′ ∈ V \ u ,we will use the following decomposition: | ˆ q τvv ′ − q τvv ′ | ≤ | X t ∈T n w τt η ( x t ; θ τu ) x tv x tv ′ − X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ | + | X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ − E [ X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ ] | + | E [ X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ ] − q τvv ′ | . (A.2)The following lemma gives us bounds on the terms in Eq. (A.2). Lemma 9. Assume that the smoothness condition A3 is satisfied and that the kernelfunction K ( · ) satisfies A4. Furthermore, assume max t ∈ [0 , |{ v ∈ { , . . . , p } : θ tuv = 0 }| < s, i.e., the number of non-zero elements of the parameter vector is bounded by s . Thereexist constants C, C ′ , C ′′ > , depending on M and M K only, which are the constantsquantifying assumption A3 and A4, respectively , such that for any τ ∈ [0 , , we have max v,v ′ | ˆ q τvv ′ − X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ | = Csh (A.3)max v,v ′ | E [ X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ ] − q τvv ′ | = C ′ h. (A.4) Furthermore, | X t ∈T n ( w τt η ( x t ; θ tu ) x tv x tv ′ − E [ w τt η ( x t ; θ tu ) X tv X tv ′ ]) | < ǫ (A.5) with probability at least − − C ′′ nhǫ ) .Proof. We start the proof by bounding the difference | η ( x ; θ t + δu ) − η ( x ; θ tu ) | which willbe useful later on. By applying the mean value theorem to η ( x ; · ) and the Taylorexpansion on θ tu we obtain: | η ( x ; θ t + δu ) − η ( x ; θ tu ) | = | p − X v =1 ( θ t + δuv − θ tuv ) η ′ ( x ; ¯ θ ( v ) u ) | ¯ θ ( v ) u is a point on the linebetween θ t + δu and θ tu ! ≤ p − X v =1 | θ t + δuv − θ tuv | ( | η ′ ( x ; · ) | ≤ p − X v =1 | δ ∂∂t θ tuv + δ ∂ ∂t θ tuv (cid:12)(cid:12)(cid:12) t = β v | (cid:18) β v is a point on the linebetween t and t + δ (cid:19) τ = 1. Using the above equation, and the Riemannintegral to approximate the sum, we have | X t ∈T n w τt η ( x t ; θ τu ) x tv x tv ′ − X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ |≈ | Z h K ( z − τh )[ η ( x z ; θ τu ) − η ( x z ; θ zu )] x zv x zv ′ dz |≤ Z − h K ( z ′ ) | η ( x τ + z ′ h ; θ τu ) − η ( x τ + z ′ h ; θ τ + z ′ hu ) | dz ′ ≤ Z − K ( z ′ )[ p − X v =1 | z ′ h ∂∂t θ tuv (cid:12)(cid:12)(cid:12) t = τ + ( z ′ h ) ∂ ∂t θ tuv (cid:12)(cid:12)(cid:12) t = β v | ] dz ′ ≤ Csh, for some constant C > M from A3 which bounds the derivatives inthe equation above, and M K from A4 which bounds the kernel. The last inequalityfollows from the assumption that the number of non-zero components of the vector θ tu is bounded by s .Next, we prove equation (A.4). Using the Taylor expansion, for any fixed 1 ≤ v, v ′ ≤ p − | E [ X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ ] − q τvv ′ | = | X t ∈T n w τt ( q tvv ′ − q τvv ′ ) | = | X t ∈T n w τt (( t − τ ) ∂∂t q tvv ′ (cid:12)(cid:12)(cid:12) t = τ + ( t − τ ) ∂ ∂t q tvv ′ (cid:12)(cid:12)(cid:12) t = ξ | , where ξ ∈ [ t, τ ]. Since w τt = 0 for | t − τ | > h , we havemax v,v ′ | E [ X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ ] − q τvv ′ | ≤ C ′ h for some constant C > M and M K only.Finally, we prove equation (A.5). Observe that w τt η ( x t ; θ tu ) x tv x tv ′ are independentand bounded random variables [ − w τt , w τt ]. The equation simply follows from the Ho-effding’s inequality.Using results of Lemma 9 we can obtain the rate at which the element-wise distancebetween the true and sample Fisher information matrix decays to zero as a function ofthe bandwidth parameter h and the size of neighborhood s . In the proof of the maintheorem, the bandwidth parameter will be chosen so that the bias and variance termsare balanced. 22 .2 Sample covariance matrix The deviation of the elements of the sample covariance matrix is bounded in a similarway as the deviation of elements of the sample Fisher information matrix, given inLemma 9. Denoting the sample covariance matrix at time point τ asˆ Σ τ = X t w τt x t x t ′ , and the difference between the elements of ˆ Σ τ and Σ τ can be bounded as | ˆ σ τuv − σ τuv | = | X t ∈T n w τt x tu x tv − σ τuv |≤ | X t ∈T n w τt x tu x tv − E [ X t ∈T n w τt x tu x tv ] | + | E [ X t ∈T n w τt x tu x tv ] − σ τuv | . (A.6)The following lemma gives us bounds on the terms in Eq. (A.6). Lemma 10. Assume that the smoothness condition A3 is satisfied and that the kernelfunction K ( · ) satisfies A4. There are constants C, C ′ > depending on M and M K only such that for any τ ∈ [0 , , we have max u,v | E [ X t ∈T n w τt x tu x tv ] − σ τuv | ≤ Ch. (A.7) and | X t ∈T n w τt x tu x tv − E [ X t ∈T n w τt x tu x tv ] | ≤ ǫ (A.8) with probability at least − − C ′ nhǫ ) .Proof. To obtain the Lemma, we follow the same proof strategy as in the proof ofLemma 9. In particular, Eq. (A.7) is proved in the same way as Eq. (A.4) and Eq. (A.8)in the same way as Eq. (A.5). The details of this derivation are omitted. B Technical proofs In this appendix we provide proofs of lemmas used to prove the main result. B.1 Proof of Lemma 3 The set of minima Θ( λ n ) of a convex function is convex. So, for two distinct pointsof minima, ¯ θ u and ˜ θ u , every point on the line connecting two points also belongs tominima, i.e. ξ ¯ θ u + (1 − ξ ) ˜ θ u ∈ Θ( λ n ), for any ξ ∈ (0 , η = ¯ θ u − ˜ θ u and now any23oint on the line can be written as ˜ θ u + ξ η . The value of the objective at any point ofminima is constant and we have F ( ˜ θ u + ξ η ) = c, ξ ∈ (0 , , where c is some constant. By taking the derivative with respect to ξ of F ( ˜ θ u + ξ η ) weobtain X t ∈T n w τt " − x tu + exp( h ˜ θ u + ξ η , x t \ u i ) − exp( −h ˜ θ u + ξ η , x t \ u i )exp( h ˜ θ u + ξ η , x t \ u i ) + exp( −h ˜ θ u + ξ η , x t \ u i ) h η , x t \ u i + λ n p − X v =1 η v sign(˜ θ uv + ξη v ) = 0 . (B.1)On a small neighborhood of ξ the sign of ˜ θ u + ξ η is constant, for each component v ,since the function ˜ θ u + ξ η is continuous in ξ . By taking the derivative with respect to ξ of Eq. (B.1) and noting that the last term is constant on a small neighborhood of ξ we have 4 X t ∈T n w τt h η , x t \ u i exp( − h ˜ θ u + ξ η , x t \ u i ) (cid:16) − h ˜ θ u + ξ η , x t \ u i ) (cid:17) = 0 . This implies that h η , x t \ u i = 0 for every t ∈ T n , which implies that h x t \ u , ¯ θ u i = h x t \ u , ˜ θ u i , t ∈ T n , for any two solutions ¯ θ u and ˜ θ u . Since ¯ θ u and ˜ θ u were two arbitrary elementsof Θ( λ n ) we can conclude that h x t \ u , θ u i , t ∈ T n is constant for all elements θ u ∈ Θ( λ n ).Next, we need to show that the conclusion from above implies that any two solutionshave non-zero elements in the same position. From equation (5.2), it follows that theset of non-zero components of the solution is given by S = ( ≤ v ≤ p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X t ∈T n w τt ( ∇ γ ( θ u ; x t )) v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = λ ) . Using equation (5.3) we have that X t ∈T n w τt ( ∇ γ ( θ τu ; x t )) v = X t ∈T n w τt ( x t \ u { x tu + 1 − x tu h θ τu , x t \ u i )exp(2 x tu h θ τu , x τ \ u i ) + 1 } ) v , which is constant across different elements θ u ∈ Θ( λ n ), since h x t \ u , θ u i , t ∈ T n isconstant for all θ u ∈ Θ( λ n ). This implies that the set of non-zero components is thesame for all solutions. (cid:3) .2 Proof of Lemma 4 Under the assumptions given in the Lemma, we can apply the result of Lemma 10. Let y ∈ R s be a unit norm minimal eigenvector of ˆ Σ τSS . We haveΛ min ( Σ τSS ) = min || x || =1 x ′ Σ τSS x = min || x || =1 { x ′ ˆ Σ τSS x + x ′ ( Σ τSS − ˆ Σ τSS ) x }≤ y ′ ˆ Σ τSS y + y ′ ( Σ τSS − ˆ Σ τSS ) y , which implies Λ min ( ˆ Σ τSS ) ≥ D min − ||| ( Σ τSS − ˆ Σ τSS ) ||| . Let Σ τ = [ σ τuv ] and ˆ Σ τ = [ˆ σ τuv ]. We have the following bound on the spectral norm ||| Σ τSS − ˆ Σ τSS ||| ≤ s X u =1 s X v =1 (ˆ σ τuv − σ τuv ) ! / ≤ δ, with the probability at least 1 − − Cnh ( δs − C ′ h ) + C ′′ log( s )), for some fixedconstants C, C ′ , C ′′ > M and M K only.Similarly, we have that Λ max ( ˆ Σ τSS ) ≤ D max + δ, with probability at least 1 − − Cnh ( δs − C ′ h ) + C ′′ log( s )), for some fixed constants C, C ′ , C ′′ > M and M K only.From Lemma 3, we know that any two solutions ¯ θ u , ˜ θ u ∈ Θ( λ n ) of the optimizationproblem (3.3) have non-zero elements in the same position. So, for any two solutions¯ θ u , ˜ θ u ∈ Θ( λ n ), it holds X \ u ( ¯ θ u − ˜ θ u ) = X \ u,S ( ¯ θ u − ˜ θ u ) S + X \ u,S c ( ¯ θ u − ˜ θ u ) S c = X \ u,S ( ¯ θ u − ˜ θ u ) S . Furthermore, from Lemma 3 we know that the two solutions are in the kernel of X \ u,S .On the event Ω , kernel of X \ u,S is { } . Thus, the solution is unique on Ω . (cid:3) B.3 Proof of Lemma 5 We first analyze the probability of the event Ω . Using the same argument to thosein the proof of Lemma 4, we obtainΛ min ( ˆ Q τSS ) ≥ C min − ||| Q τSS − ˆ Q τSS ||| . Next, using results of Lemma 9, we have the following bound ||| Q τSS − ˆ Q τSS ||| ≤ s X u =1 s X v =1 (ˆ q τuv − q τuv ) ! / ≤ δ, (B.2)25ith probability at least 1 − − C nhδ s +2 log( s )), for some fixed constants C, C ′ > M and M K only.Next, we deal with the event Ω . We are going to use the following decompositionˆ Q τS c S ( ˆ Q τSS ) − = Q τS c S [( ˆ Q τSS ) − − ( Q τSS ) − ]+ [ ˆ Q τS c S − Q τS c S ]( Q τSS ) − + [ ˆ Q τS c S − Q τS c S ][( ˆ Q τSS ) − − ( Q τSS ) − ]+ Q τS c S ( Q τSS ) − = T + T + T + T . Under the assumption A2, we have that ||| T ||| ∞ ≤ − α . The lemma follows ifwe prove that for all the other terms we have ||| · ||| ∞ ≤ α . Using the submultiplicativeproperty of the norm, we have for the first term: ||| T ||| ∞ ≤ ||| Q τS c S ( Q τSS ) − ||| ∞ ||| ˆ Q τSS − Q τSS ||| ∞ ||| ( ˆ Q τSS ) − ||| ∞ ≤ (1 − α ) ||| ˆ Q τSS − Q τSS ||| ∞ √ s ||| ( ˆ Q τSS ) − ||| . (B.3)Using Eq. (B.2), we can bound the term ||| (cid:16) ˆ Q τSS (cid:17) − ||| ≤ C ′′ , for some constant de-pending on C min only, with probability at least 1 − − C nhs + 2 log( s )), for somefixed constant C > 0. The bound on the term ||| ˆ Q τSS − Q τSS ||| ∞ follows from applicationof Lemma 9. Observe that P [ ||| ˆ Q τSS − Q τSS ||| ∞ ≥ δ ] = P [max v ∈ S { X v ′ ∈ S | ˆ q τvv ′ − q τvv ′ |} ≥ δ ] ≤ − Cnh ( δs − C ′ sh ) + 2 log( s )) , (B.4)for some fixed constants C, C ′ > 0. Combining all the elements, we obtain the boundon the first term ||| T ||| ∞ ≤ α , with probability at least 1 − C exp( C ′ nhs + C ′′ log( s )), forsome constants C, C ′ , C ′′ > ||| T ||| ∞ ≤ ||| ˆ Q τS c S − Q τS c S ||| ∞ √ s ||| ( Q τSS ) − ||| ≤ √ sC min ||| ˆ Q τS c S − Q τS c S ||| ∞ . (B.5)The bound on the term ||| ˆ Q τSS − Q τSS ||| ∞ follows in the same way as the bound inEq. (B.4) and we can conclude that ||| T ||| ∞ ≤ α with probability at least 1 − C exp( C ′ nhs + C ′′ log( p )), for some constants C, C ′ , C ′′ > T . We have the following decomposition ||| [ ˆ Q τS c S − Q τS c S ][( ˆ Q τSS ) − − ( Q τSS ) − ] ||| ∞ ≤ ||| ˆ Q τS c S − Q τS c S ||| ∞ √ s ||| ( Q τSS ) − [ Q τSS − ˆ Q τSS ]( ˆ Q τSS ) − ||| ≤ √ sC min ||| ˆ Q τS c S − Q τS c S ||| ∞ ||| Q τSS − ˆ Q τSS ||| ||| ( ˆ Q τSS ) − ||| . ||| T ||| ∞ ≤ α with probability at least 1 − C exp( C ′ nhs + C ′′ log( p )).Bound on the probability of event Ω follows from combining the bounds on allterms. (cid:3) B.4 Proof of Lemma 8 To prove this Lemma, we use a technique of Rothman et al. (2008) applied to theproblem of consistency of the penalized covariance matrix estimator. Let us define thefollowing function H : (cid:26) R p → R D F ( θ τu + D ) − F ( θ τu ) , where the function F ( · ) is defined in equation (5.1). The function H ( · ) takes thefollowing form H ( D ) = X t ∈T n w τt ( γ ( θ τu ; x t ) − γ ( θ τu + D ; x t ))+ λ n ( || θ τu + D || − || θ τu || ) . Recall the minimizer of (3.3) constructed in the proof of Proposition 6, ˆ θ τu =( ¯ θ ′ S , ′ S c ) ′ . The minimizer of the function H ( · ) is ˆ D = ˆ θ τu − θ τu . Function H ( · ) isconvex and H (0) = 0 by construction. Therefor H ( ˆ D ) ≤ 0. If we show that for someradius B > 0, and D ∈ R p with || D || = B and D S c = , we have H ( D ) > 0, then weclaim that || ˆ D || ≤ B . This follows from the convexity of H ( · ).We proceed to show strict positivity of H ( · ) on the boundary of the ball with radius B = Kλ n √ s , where K > D ∈ R p be anarbitrary vector with || D || = B and D S c = , then by the Taylor expansion of γ ( · ; x t )we have H ( D ) = − ( X t ∈T n w τt ∇ γ ( θ τu ; x t )) ′ D − D ′ [ X t ∈T n w τt η ( x t ; θ τu + α D ) x t \ u x t ′ \ u ] D + λ n ( || θ τu + D || − || θ τu || )= ( I ) + ( II ) + ( III ) , (B.6)for some α ∈ [0 , I ). Let e v ∈ R p be a unit vector with one at the position v and zeros elsewhere. Then random variables − e ′ v P t ∈T n w τt ∇ γ ( θ τu ; x t ) are bounded[ − Cnh , Cnh ] for all 1 ≤ v ≤ p − 1, with constant C > M K only. Using theHoeffding inequality and the union bound, we havemax ≤ v ≤ p − | e ′ v ( X t ∈T n w τt ∇ γ ( θ τu ; x t ) − E [ X t ∈T n w τt ∇ γ ( θ τu ; x t )]) | ≤ δ, − − Cnhδ + log( p )), where C > M K only. Moreover, denoting p ( θ tu ) = P θ tu [ x tu = 1 | x t \ u ]to simplify the notation, we have for all 1 ≤ v ≤ p − | E [ e ′ v X t ∈T n w τt ∇ γ ( θ τu ; x t ) | { x t \ u } t ∈T n ] | = | E [ X t ∈T n w τt x tv [ x tu + 1 − p ( θ τu )] | { x t \ u } t ∈T n ] | = | X t ∈T n w τt x tv [ p ( θ tu ) − p ( θ τu )] |≤ Z − h K ( z ) | p ( θ τ + zhu ) − p ( θ τu ) | dz. (B.7)Next, we apply the mean value theorem on p ( · ) and the Taylor’s theorem on θ tu . Underthe assumption A3, we have | p ( θ τ + zhu ) − p ( θ τu ) |≤ p − X v =1 | θ τ + zhuv − θ τuv | ( | p ′ ( · ) | ≤ p − X v =1 | zh ∂∂t θ tuv (cid:12)(cid:12)(cid:12) t = τ + ( zh ) ∂ ∂t θ tuv (cid:12)(cid:12)(cid:12) t = α v | ( α v ∈ [ τ + zh, τ ] ) ≤ Cs | zh + ( zh ) | , (B.8)for some C > M . Combining (B.8) and (B.7) we have that | E [ e ′ v P t ∈T n w τt ∇ γ ( θ τu ; x t ) | ≤ Csh for all 1 ≤ v ≤ p − 1. Thus, with probability greaterthan 1 − − Cnh ( λ n − sh ) + log( p )) for some constant C > M K , M and α , which under the conditions of Theorem 1 goes to 1 exponentially fast,we have max ≤ v ≤ p − | e ′ v X t ∈T n w τt ∇ γ ( θ τu ; x t ) | ≤ αλ n − α ) < λ n . On that event, using H¨older’s inequality, we have | ( X t ∈T n w τt ∇ γ ( θ τu ; x t )) ′ D | ≤ || D || max ≤ v ≤ p − | e ′ v X t ∈T n w τt ∇ γ ( θ τu ; x t ) |≤ λ n √ s || D || ≤ ( λ n √ s ) K . The triangle inequality applied to the term ( III ) of equation (B.6) yields: λ n ( || θ τu + D || − || θ τu || ) ≥ − λ n || D S || ≥ − λ n √ s || D S || ≥ − K ( λ n √ s ) . II ) of equation (B.6). Observe that since D S c = 0, wehave D ′ [ X t ∈T n w τt η ( x t ; θ τu + α D ) x t \ u x t ′ \ u ] D = D ′ S [ X t ∈T n w τt η ( x t ; θ τu + α D ) x tS x t ′ S ] D S ≥ K Λ min ( X t ∈T n w τt η ( x t ; θ τu + α D ) x tS x t ′ S )Let g : R R be defined as g ( z ) = z )(1+exp(2 z )) . Now, η ( x ; θ u ) = g ( x u h θ u , x \ u i ) andwe have Λ min ( X t ∈T n w τt η ( x t ; θ τu + α D ) x tS x t ′ S ) ≥ min α ∈ [0 , Λ min ( X t ∈T n w t η ( x t ; θ τu + α D ) x tS x t ′ S ) ≥ Λ min ( X t ∈T n w τt η ( x t ; θ τu ) x tS x t ′ S ) − max α ∈ [0 , ||| X t ∈T n w τt g ′ ( x tu h θ τu + α D , x tS i )( x tu D ′ S x tS ) x tS x t ′ S ||| ≥ C min − max α ∈ [0 , ||| X t ∈T n w τt g ′ ( x tu h θ τu + α D , x tS i )( x tu D ′ S x tS ) x tS x t ′ S ||| To bound the spectral norm, we observe that for any fixed α ∈ [0 , 1] and y ∈ R s , || y || =1 we have: y ′ { X t ∈T n w τt g ′ ( x tu h θ τu + α D , x tS i )( x tu D ′ S x tS ) x tS x t ′ S } y = X t ∈T n w τt g ′ ( x tu h θ τu + α D , x tS i )( x tu D ′ S x tS )( x t ′ S y ) ≤ X t ∈T n w τt | g ′ ( x tu h θ τu + α D , x tS i )( x tu D ′ S x tS ) | ( x t ′ S y ) ≤ √ s || D || ||| X t w τt x tS x t ′ S ||| ( | g ′ ( · ) | ≤ ≤ D max Kλ n s ≤ C min . The last inequality follows as long as λ n s ≤ C min D max K . We have shown thatΛ min ( X t ∈T n w τt η ( x t ; θ τu + α D ) x tS x t ′ S ) ≥ C min , with high probability. 29utting the bounds on the three terms together, we have H ( D ) ≥ ( λ n √ s ) (cid:26) − K + C min K − K (cid:27) , which is strictly positive for K = C min . For this choice of K , we have that λ n s ≤ C D max ,which holds under the conditions of Theorem 1 for n large enough. (cid:3) References A. Ahmed and E. P. Xing. Recovering time-varying networks of dependencies in socialand biological studies. Proc. Natl. A. Sci. , 106(29):11878–11883, 2009.C. Andrieu, M. Davy, and A. Doucet. Efficient particle filtering for jump markov sys-tems. application to time-varying autoregressions. Signal Processing, IEEE Trans-actions on , 51(7):1762–1770, 2003.O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse max-imum likelihood estimation. J. Mach. Learn. Res. , 9:485–516, 2008.G. Bresler, E. Mossel, and A. Sly. Reconstruction of markov random fields from samples:Some easy observations and algorithms. Arxiv, 0712.1402 , 2007.W. S. Cleveland, E. Grosse, and W. M. Shyu. Local regression models. In J. M.Chambers and T. J. Hastie, editors, Statistical Models in S , pages 309–376, 1991.N. Dobigeon, J.Y. Tourneret, and M. Davy. Joint segmentation of piecewise constantautoregressive processes by using a hierarchical model and a bayesian sampling ap-proach. Signal Processing, IEEE Transactions on , 55(4):1251–1263, 2007.F. Dondelinger, S. Lebre, and D. Husmeier. Heterogeneous continuous dynamicbayesian networks with flexible structure and inter-time segment information sharing.In Proceedings of the 27th International Conference on Machine Learning (ICML-10) , 2010.Frank Dondelinger, Sophie Lbre, and Dirk Husmeier. Non-homogeneous dynamicbayesian networks with bayesian regularization for inferring gene regulatory net-works with gradually time-varying structure. Machine Learning , pages 1–40, 2012.ISSN 0885-6125. doi: 10.1007/s10994-012-5311-x.J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracleproperties. J. Am. Statist. Ass. , 96:1348–1360, 2001.J. Fan, Y. Feng, and Y. Wu. Network exploration via the adaptive LASSO and SCADpenalties. Ann. Appl. Statist. , 3(2):521–541, 2009.P. Fearnhead. Exact and efficient bayesian inference for multiple changepoint problems. Statistics and computing , 16(2):203–213, 2006.30. Fox, E.B. Sudderth, M.I. Jordan, and A.S. Willsky. Bayesian nonparametric infer-ence of switching dynamic linear models. Signal Processing, IEEE Transactions on ,59(4):1569–1585, 2011.J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation withthe graphical lasso. Biostatistics , 9(3):432–441, 2008.J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linearmodels via coordinate descent, 2010.A. Fujita, JR Sato, HM Garay-Malpartida, PA Morettin, MC Sogayar, and CE Ferreira.Time-varying modeling of gene expression regulatory networks using the waveletdynamic vector autoregressive method. Bioinformatics , 23(13):1623–1630, 2007.L. Getoor and B. Taskar. Introduction to Statistical Relational Learning (AdaptiveComputation and Machine Learning) . The MIT Press, 2007.M. Grzegorczyk and D. Husmeier. Improvements in the reconstruction of time-varyinggene regulatory networks: dynamic programming and regularization by informationsharing among genes. Bioinformatics , 27(5):693–699, 2011a.M. Grzegorczyk and D. Husmeier. Bayesian regularization of non-homogeneous dy-namic bayesian networks by globally coupling interaction parameters. In Proceedingsof the 15th International Conference on Artifical Intelligence and Statistics (AIS-TATS) , pages 467–476, 2012a.M. Grzegorczyk and D. Husmeier. A non-homogeneous dynamic bayesian network withsequentially coupled interaction parameters for applications in systems and syntheticbiology. Statistical Applications in Genetics and Molecular Biology , 11(4), 2012b.Marco Grzegorczyk and Dirk Husmeier. Non-stationary continuous dynamic bayesiannetworks. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Cu-lotta, editors, Advances in Neural Information Processing Systems 22 , pages 682–690.2009.Marco Grzegorczyk and Dirk Husmeier. Non-homogeneous dynamic bayesian networksfor continuous data. Mach. Learn. , 83(3):355–419, June 2011b. ISSN 0885-6125. doi:10.1007/s10994-010-5230-7.J. Guo, E. Levina, G. Michailidis, and J. Zhu. Joint Structure Estimation for Categor-ical Markov Networks. Unpublished manuscript , 2010a.J. Guo, E. Levina, G. Michailidis, and J. Zhu. Joint structure estimation for categor-ical markov networks. , 2010b.F. Liu H. Jiang, A.Lozano. A bayesian markov-switching model for sparse dynamicnetwork estimation. In Proceedings of 2012 SIAM International Conference on DataMining , 2012. 31. Hastie and R. Tibshirani. Varying-coefficient models. J. Roy. Statist. Soc. B Met. ,55(4):757–796, 1993.Dirk Husmeier, Frank Dondelinger, and Sophie Lebre. Inter-time segment informationsharing for non-homogeneous dynamic bayesian networks. In J. Lafferty, C. K. I.Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in NeuralInformation Processing Systems 23 , pages 901–909. 2010.Yi Jia and Jun Huan. Constructing non-stationary dynamic bayesian networks with aflexible lag choosing mechanism. BMC Bioinformatics , 11(Suppl 6):S27, 2010. ISSN1471-2105. doi: 10.1186/1471-2105-11-S6-S27.M. Kolar and E. P. Xing. Sparsistent Estimation of Time-Varying Discrete MarkovRandom Fields. ArXiv e-prints , July 2009.M. Kolar and E.P. Xing. Estimating Networks With Jumps. Arxiv, 1012.3795 , 2010.M. Kolar and E.P. Xing. On time varying undirected graphs. In Proceedings of the14th International Conference on Artificial Intelligence and Statistics , 2011.M. Kolar, A. P. Parikh, and E. P. Xing. On sparse nonparametric conditional covarianceselection. In ICML ’10: Proc. 27th Ann. Int’l. Conf. Mach. Learn. , 2010a.M. Kolar, L. Song, A. Ahmed, and E. P. Xing. Estimating Time-Varying networks. Ann. Appl. Statist. , 4(1):94–123, 2010b.S. Lebre, F. Dondelinger, and D. Husmeier. Nonhomogeneous dynamic bayesian net-works in systems biology. Methods in Molecular Biology (Clifton, NJ) , 802:199–213,2012.Sophie Lebre, Jennifer Becq, Frederic Devaux, Michael Stumpf, and Gaelle Lelandais.Statistical inference of the time-varying structure of gene-regulation networks. BMCSystems Biology , 4(1):130, 2010. ISSN 1752-0509. doi: 10.1186/1752-0509-4-130.H. Li and J. Gui. Gradient directed regularization for sparse Gaussian concentrationgraphs, with applications to inference of genetic networks. Biostatistics , 7(2):302,2006.N. Meinshausen and P. B¨uhlmann. High-dimensional graphs and variable selectionwith the lasso. Ann. Statist. , 34:1436, 2006.J. Peng, P. Wang, N. Zhou, and J. Zhu. Partial correlation estimation by joint sparseregression models. J. Am. Statist. Ass. , 104(486):735–746, 2009.E. Punskaya, C. Andrieu, A. Doucet, and W.J. Fitzgerald. Bayesian curve fitting usingmcmc with applications to signal segmentation. Signal Processing, IEEE Transac-tions on , 50(3):747–758, 2002. 32. Rao, A.O. Hero III, J.D. Engel, et al. Inferring time-varying network topologies fromgene expression data. EURASIP Journal on Bioinformatics and Systems Biology ,2007:7–7, 2007.P. Ravikumar, M.J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covarianceestimation by minimizing l1-penalized log-determinant divergence. Department ofStatistics, UC Berkeley, Tech. Rep , 767, 2008.P. Ravikumar, M. J. Wainwright, and J. D. Lafferty. High-dimensional ising modelselection using ℓ regularized logistic regression. Ann. Statist. , 38(3):1287–1319,2010.Joshua W Robinson and Alexander J Hartemink. Non-stationary dynamic bayesiannetworks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advancesin Neural Information Processing Systems 21 , pages 1369–1376. 2009.J.W. Robinson and A.J. Hartemink. Learning non-stationary dynamic bayesian net-works. The Journal of Machine Learning Research , 11:3647–3680, 2010.A.J. Rothman, P.J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant co-variance estimation. Electron. J. Statist. , 2:494–515, 2008.M.R. Siracusa and JW Fisher. Tractable bayesian inference of time-series dependencestructure. In Proceedings of AISTATS , 2009.L. Song, M. Kolar, and E. P. Xing. Keller: Estimating time-evolving interactionsbetween genes. In Proc. 16th Int’l. Conf. Intell. Syst. Molec. Bio. , 2009a.Le Song, Mladen Kolar, and Eric Xing. Time-varying dynamic bayesian networks. InY. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22 , pages 1732–1740. 2009b.M. Talih and N. Hengartner. Structural learning with time-varying components: track-ing the cross-section of financial time series. Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 67(3):321–341, 2005.D. Vogel and R. Fried. On robust gaussian graphical modelling. In L. Devroye et al.(Eds.), editor, Recent Developments in Applied Probability and Statistics , pages 155–182. Berlin, Heidelberg: Springer-Verlag, 2010.M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, andvariational inference. Found. Trends Mach. Learn. , 1(1-2):1–305, 2008.P. Wang, D. L. Chao, and L. Hsu. Learning networks from high dimensional binarydata: An application to genomic instability data. Arxiv, 0908.3882 , 2009.Z. Wang, E.E. Kuruoglu, X. Yang, Y. Xu, and T.S. Huang. Time varying dynamicbayesian network for nonstationary events modeling and online inference. SignalProcessing, IEEE Transactions on , 59(4):1553–1568, 2011.33iang Xuan and Kevin Murphy. Modeling changing dependency structure in multi-variate time series. In Proceedings of the 24th international conference on Machinelearning , ICML ’07, pages 1055–1062, New York, NY, USA, 2007. ACM. ISBN978-1-59593-793-3. doi: 10.1145/1273496.1273629.J. Yin, Z. Geng, R. Li, and H. Wang. Nonparametric covariance model. Statist. Sin. ,20:469–479, 2010.Ryo Yoshida, Seiya Imoto, and Tomoyuki Higuchi. Estimating time-dependent genenetworks from time series microarray data by dynamic linear models with markovswitching. In Proceedings of the 2005 IEEE Computational Systems BioinformaticsConference , CSB ’05, pages 289–298, Washington, DC, USA, 2005. IEEE ComputerSociety. ISBN 0-7695-2344-7. doi: 10.1109/CSB.2005.32.M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika , 94(1):19–35, 2007.S. Zhou, J. Lafferty, and L. Wasserman. Time varying undirected graphs. In Rocco A.Servedio and Tong Zhang, editors,