[PDF] Sparsistent Estimation of Time-Varying Discrete Markov Random Fields

Abstract

Network models have been popular for modeling and representing complex relationships and dependencies between observed variables. When data comes from a dynamic stochastic process, a single static network model cannot adequately capture transient dependencies, such as, gene regulatory dependencies throughout a developmental cycle of an organism. Kolar et al (2010b) proposed a method based on kernel-smoothing l1-penalized logistic regression for estimating time-varying networks from nodal observations collected from a time-series of observational data. In this paper, we establish conditions under which the proposed method consistently recovers the structure of a time-varying network. This work complements previous empirical findings by providing sound theoretical guarantees for the proposed estimation procedure. For completeness, we include numerical simulations in the paper.

Full PDF

aa r X i v : . [ s t a t . M L ] A p r Sparsistent Estimation Of Time-Varying MarkovRandom Fields

Mladen Kolar and Eric P. Xing ∗ Abstract

Network models have been popular for modeling and representing complex re-lationships and dependencies between observed variables. When data comes froma dynamic stochastic process, a single static network model cannot adequately cap-ture transient dependencies, such as, gene regulatory dependencies throughout adevelopmental cycle of an organism. Kolar et al (2010b) proposed a method basedon kernel-smoothing l1-penalized logistic regression for estimating time-varyingnetworks from nodal observations collected from a time-series of observational data.In this paper, we establish conditions under which the proposed method consis-tently recovers the structure of a time-varying network. This work complementsprevious empirical ﬁndings by providing sound theoretical guarantees for the pro-posed estimation procedure. For completeness, we include numerical simulationsin the paper.

Key words and phrases:

High-dimensional inference, Markov random ﬁelds,semi-parametric inference, time-varying Ising model, varying coeﬃcient models.

In recent years, we have witnessed fast advancement of data-acquisition techniques inmany areas, including biological domains, engineering and social sciences. As a result,new statistical and machine learning techniques are needed to help us develop a betterunderstanding of complexities underlying large, noisy data sets. Networks have beencommonly used to abstract noisy data and provide an insight into regularities anddependencies between observed variables. For example, in a biological study, nodes ofthe network can represent genes in one organism and edges can represent associationsor regulatory dependencies among genes. In a social domain, nodes of a network canrepresent actors and edges can represent interactions between actors. Recent populartechniques for modeling and exploring networks are based on the structure estimation inthe probabilistic graphical models, speciﬁcally, Markov Random Fields (MRFs). Thesemodels represent conditional independence between variables, which are represented as ∗ Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15217, USA; e-mail: { mladenk,epxing } @cs.cmu.edu . • Analysis of gene regulatory networks.

Suppose that we have a set of n microarraymeasurements of gene expression levels, obtained at diﬀerent stages during thedevelopment of an organism or at diﬀerent times during the cell cycle. Giventhis data, biologists would like to get insight into dynamic relationships betweendiﬀerent genes and how these relations change at diﬀerent stages of development.The problem is that at each time point there is only one or at most a few mea-surements of the gene expressions; and a naive approach to estimating the generegulatory network, which uses only the data at the time point in question to inferthe network, would fail. To obtain a good estimate of the regulatory network atany time point, we need to leverage the data collected at other time points andextract some information from them. • Analysis of stock market.

In a ﬁnance setting, we have values of diﬀerent stocks ateach time point. Suppose, for simplicity, that we only measure whether the valueof a particular stock is going up or down. We would like to ﬁnd the underlyingtransient relational patterns between diﬀerent stocks from these measurementsand get insight into how these patterns change over time. Again, we only haveone measurement at each time point and we need to leverage information fromthe data obtained at nearby time points. • Understanding social networks.

There are 100 Senators in the U.S. Senate andeach can cast a vote on diﬀerent bills. Suppose that we are given n votingrecords over some period of time. How can one infer the latent political liaisonsand coalitions among diﬀerent senators and the way these relationships changewith respect to time and with respect to diﬀerent issues raised in bills just fromthe voting records?The aforementioned problems have commonality in estimating a sequence of time-speciﬁc latent relational structures between a ﬁxed set of entities (i.e., variables),from a time series of observation data of entities states; and the relational struc-tures between the entities are time evolving, rather than being invariant throughoutthe data collection period as commonly assumed in much of the current literature onhigh-dimensional undirected network estimation (see, e.g., Meinshausen and B¨uhlmann(2006), Bresler et al. (2007), Yuan and Lin (2007), Banerjee et al. (2008), Rothman et al.(2008), Friedman et al. (2008), Ravikumar et al. (2008), Fan et al. (2009), Peng et al.(2009), Ravikumar et al. (2010), Wang et al. (2009), Guo et al. (2010a) and referencestherein). Typically, the available data for the problem are very scarce, with only oneor at most a few measurements corresponding to any particular latent structure, while2he total number of potential relations is large and exceeds the total number of ob-servations. However, as we will show later, under the assumption that the network issparse and slowly changes with time, it is possible to consistently estimate the networkstructure at any time point.A popular model for the relational structure over a ﬁxed set of entities that iswidely studied is the Markov random ﬁeld (MRF) (Wainwright and Jordan, 2008,Getoor and Taskar, 2007). Let G = ( V, E ) represent a graph, of which V denotesthe set of vertices, and E denotes the set of edges over vertices. Depending on thespeciﬁc application of interest, a node u ∈ V can represent a gene, a stock, or a socialactor, and an edge ( u, v ) ∈ E can represent a relationship (e.g., correlation, inﬂuence,friendship) between actors u and v . Let X = ( X , . . . , X p ) ′ , where p = | V | , be a randomvector of nodal states following a probability distribution indexed by θ ∈ Θ. Under aMRF, the nodal states X u ’s are assumed to be discrete, i.e., X u ∈ X ≡ { s , . . . , s k } ,and the edge set E ⊆ V × V encodes certain conditional independence assumptionsamong components of the random vector X , for example, the random variable X u isconditionally independent of the random variable X v given the rest of the variables if( u, v ) E . Under the special case of binary nodal states, e.g., X u ∈ X ≡ {− , } ,and assuming pairwise potential weighted by θ uv for all ( u, v ) ∈ E and θ uv = 0 for all( u, v ) E , the joint probability of X = x can be expressed by a simple exponentialfamily model: P θ ( x ) = Z exp { P u

1] and a sequence of observations D n = { x t ∼ P θ t | t ∈ T n } with P θ t deﬁned Eq. (2.1), the goal is to estimate the graph structure of the Markov randomﬁeld associated with the distribution P θ τ . The parameter vector θ τ is a (cid:0) p (cid:1) -dimensionalvector, indexed by distinct pairs of nodes, of which an element is non-zero if and onlyif the corresponding edge ( u, v ) ∈ E τ . The problem of recovering the graph structure G τ is equivalent to estimating the non-zero pattern of the vector θ τ , i.e., locations ofnon-zero elements of θ τ . A stronger notion of structure estimation is that of signededge recovery in which an edge ( u, v ) ∈ E τ is recovered together with the sign of theparameter sign( θ τuv ). We will show that the estimation procedure can consistentlyrecover signed edges.The estimation procedure is based on the neighborhood selection technique, wherethe graph structure is estimated by combining the local estimates of neighborhoods ofeach node. For each vertex u ∈ V , deﬁne the set of neighboring edges S τ ( u ) := { ( u, v ) :( u, v ) ∈ E τ } and the set of signed neighboring edges S τ ± ( u ) := { (sign( θ τuv ) , ( u, v )) :( u, v ) ∈ S τ ( u ) } . The set of signed neighboring edges S τ ± ( u ) can be determined fromthe signs of elements of the ( p − θ τu := { θ τuv : v ∈ V \ u } associated with vertex u . Under the model (2.1), the conditional distribution6f X τu given other variables X τ \ u := { X τv : v ∈ V \ u } takes the form P θ τu ( x τu | X τ \ u = x τ \ u ) = exp(2 x τu h θ τu , x τ \ u i )exp(2 x τu h θ τu , x τ \ u i ) + 1 , (3.1)where h a, b i = a ′ b denotes the dot product. Observe that the model given in (3.1)can be viewed as expressing X τu as the response variable in the generalized varying-coeﬃcient models with X τ \ u playing the role of covariates. For simplicity, we will write P θ τu ( x τu | X τ \ u = x τ \ u ) as P θ τu ( x τu | x τ \ u ).Under the model given in Eq. (3.1) the log-likelihood, for one data-point t ∈ T n ,can be written in the following form: γ ( θ u ; x t ) = log P θ u ( x tu | x t \ u )= x tu h θ u , x t \ u i − log (cid:16) exp( h θ u , x t \ u i ) + exp( −h θ u , x t \ u i ) (cid:17) . (3.2)For an arbitrary point of interest τ ∈ [0 , θ τu of the sign-pattern of thevector θ τu is deﬁned as the solution to the following convex program:ˆ θ τu = min θ u ∈ R p − { ℓ ( θ u ; D n ) + λ n || θ u || } (3.3)where ℓ ( θ u ; D n ) = − P t ∈T n w τt γ ( θ u ; x t ) is the weighted logloss, with weights deﬁned as w τt = K h ( t − τ ) P t ′ ∈T n K h ( t ′ − τ )and K h ( · ) = K ( · /h ) is a symmetric nonnegative kernel. The regularization parameter λ n ≥ θ u is always achieved, as the problem can be cast as aconstrained optimization problem over the ball || θ u || ≤ C ( λ n ) and the claim followsfrom the Weierstrass theorem.Let ˆ θ τu be a minimizer of (3.3). The convex program (3.3) does not necessarilyhave a unique optimum, but as we will prove shortly, in the regime of interest any twosolutions will have non-zero elements in the same positions. Based on the vector ˆ θ τu ,we have the following estimate of the signed neighborhood:ˆ S τ ± ( u ) := n (sign(ˆ θ τuv ) , ( u, v )) : v ∈ V \ u, ˆ θ τuv = 0 o . (3.4)The structure of graph G τ is consistently estimated if every signed neighborhood isrecovered, i.e. ˆ S τ ± ( u ) = S τ ± ( u ) for all u ∈ V . A summary of the algorithm is given inAlgorithm 1.The convex program (3.3), can be solved using any general optimization solver. Oneparticularly fast algorithm, based on the coordinate-wise descent method, for this typeof a problem is described in Friedman et al. (2010) and implemented as the R package glmnet . Note that the algorithm provides only an estimate of the graph structure attime point τ and in order to get insight into the dynamics of the graph changes, oneneeds to estimate the graph structure at multiple time points. Typically, in a realapplication task, one is interested in estimating G τ for all τ ∈ T n .7 lgorithm 1 Graph structure estimation

Input : Dataset D n , time point of interest τ ∈ [0 , λ n ,bandwidth parameter h Output : Estimate of the graph structure ˆ G τ for all u ∈ V do Estimate ˆ θ u by solving the convex program (3.3) Estimate the set of signed neighboring edges ˆ S τ ± ( u ) using (3.4) end for Combine sets { ˆ S τ ± ( u ) } u ∈ V to obtain ˆ G τ . In this section, we provide conditions under which Algorithm 1 consistently recovers thegraph structure. In particular, we show that under suitable conditions P [ ∀ u ˆ S τ ± ( u ) = S τ ± ( u )] n →∞ −−−→

1, the property known as sparsistency . We are mainly interested in thehigh-dimensional case, where the dimension p = p n is comparable or even larger thanthe sample size n . It is of great interest to understand the performance of the estimatorunder this assumption, since in many real world scenarios the dimensionality of datais large. Our analysis is asymptotic and we consider the model dimension p = p n togrow at a certain rate as the sample size grows. This essentially allows us to considermore “complicated” models as we observe more data points. Another quantity that willdescribe the complexity of the model is the maximum node degree s = s n , which is alsoconsidered as a function of the sample size. Under the assumption that the true-graphstructure is sparse, we will require that the maximum node degree is small, s ≪ n . Themain result describes the scaling of the triple ( n, p n , s n ) under which the estimationprocedure given in the previous section estimates the graph structure consistently.We will need certain regularity conditions to hold in order to prove the sparsistencyresult. These conditions are expressed in terms of the Hessian of the log-likelihoodfunction as evaluated at the true model parameter, i.e., the Fisher information matrix.The Fisher information matrix Q τu ∈ R ( p − × ( p − is a matrix deﬁned for each node u ∈ V as: Q τu : = E [ ∇ log P θ τu [ X u | X \ u ]]= E [ η ( X ; θ τu ) X \ u X ′\ u ] , where η ( x ; θ u ) := 4 exp(2 x u h θ u , x \ u i )(exp(2 x u h θ u , x \ u i ) + 1) is the variance function and ∇ denotes the operator that computes the matrix ofsecond derivatives. We write Q τ := Q τu and assume that the following assumptionshold for each node u ∈ V . A1: Dependency condition

There exist constants C min , D min , D max > min ( Q τSS ) ≥ C min min ( Σ τ ) ≥ D min , Λ max ( Σ τ ) ≤ D max , where Σ τ = E θ τ [ X τ X τ ′ ]. Here Λ min ( · ) and Λ max ( · ) denote the minimum andmaximum eigenvalue of a matrix. A2: Incoherence condition

There exists an incoherence parameter α ∈ (0 ,

1] suchthat ||| Q τS c S ( Q τSS ) − ||| ∞ ≤ − α, where, for a matrix A ∈ R a × b , the ℓ ∞ matrix norm is deﬁned as ||| A ||| ∞ :=max i ∈{ ,...,a } P bj =1 | a ij | . Here the set S c denotes the complement of the set S in { , . . . , p } , that is, S c = { , . . . , p }\ S .With some abuse of notation, when deﬁning assumptions A1 and A2, we use the indexset S := S τ ( u ) to denote nodes adjacent to the node u at time τ . For example, if s = | S | , then Q τSS ∈ R s × s denotes the sub-matrix of Q τ indexed by S .Condition A1 assures that the relevant features are not too correlated, while condi-tion A2 assures that the irrelevant features do not have to strong eﬀect onto the relevantfeatures. Similar conditions are common in other literature on high-dimensional estima-tion (see, e.g., Meinshausen and B¨uhlmann (2006), Ravikumar et al. (2010), Peng et al.(2009), Guo et al. (2010a) and references therein). The diﬀerence here is that we as-sume the conditions hold for the time point of interest τ at which we want to recoverthe graph structure.Next, we assume that the distribution P θ t changes smoothly over time, which weexpress in the following form, for every node u ∈ V . A3: Smoothness conditions

Let Σ t = [ σ tuv ]. There exists a constant M > u,v ∈ V × V sup t ∈ [0 , | ∂∂t σ tuv | < M, max u,v ∈ V × V sup t ∈ [0 , | ∂ ∂t σ tuv | < M max u,v ∈ V × V sup t ∈ [0 , | ∂∂t θ tuv | < M, max u,v ∈ V × V sup t ∈ [0 , | ∂ ∂t θ tuv | < M. The condition A3 captures our notion of the distribution that changes smoothly overtime. If we consider the elements of the covariance matrix and the elements of theparameter vector as a function of time, then these functions have bounded ﬁrst andsecond derivatives. From these assumptions, it is not too hard to see that elements ofthe Fisher information matrix are also smooth functions of time.

A4: Kernel

The kernel K : R R is a symmetric function, supported in [ − , M K ≥ z ∈ R | K ( z ) | and max z ∈ R K ( z ) . 9his condition, A4, gives some regularity conditions on the kernel used to deﬁne theweights. For example, the assumption is satisﬁed by the box kernel K ( z ) = { z ∈ [ − , } .With the assumptions made above, we are ready to state the theorem that char-acterizes the consistency of the method given in Section 3 for recovering the unknowntime-varying graph structure. An important quantity, appearing in the statement, isthe minimum value of the parameter vector that is diﬀerent from zero θ min = min ( u,v ) ∈ E τ | θ τuv | . Intuitively, the success of the recovery should depend on how hard it is to distinguishthe true non-zero parameters from noise.

Theorem 1.

Assume that the dependency condition A1 holds with C min , D min and D max , that for each node u ∈ V , the Fisher information matrix Q τ satisﬁes the in-coherence condition A2 with parameter α , the smoothness assumption A3 holds withparameter M , and that the kernel function used in Algorithm 1 satisﬁes assumptionA4 with parameter M K . Let the regularization parameter satisfy λ n ≥ C √ log pn / for a constant C > independent of ( n, p, s ) . Furthermore, assume that the followingconditions hold:1. h = O ( n − ) s = o ( n / ) , s log pn / = o (1) θ min = Ω( √ s log pn / ) . Then for a ﬁxed τ ∈ [0 , the estimated graph ˆ G τ ( λ n ) obtained through neighborhoodselection satisﬁes P h ˆ G τ ( λ n ) = G τ i = O exp − C n / s + C ′ log p !! → , for some constants C ′ , C ′′ independent of ( n, p, s ) . This theorem guarantees that the procedure in Algorithm 1 asymptotically recoversthe sequence of graphs underlying all the nodal-state measurements in a time series,and the snapshot of the evolving graph at any time point during measurement intervals,under appropriate regularization parameter λ n as long as the ambient dimensionality p and the maximum node degree s are not too large, and minimum θ values do nottend to zero too fast. Remarks:

10. The bandwidth parameter h is chosen so that it balances variance and squaredbias of estimation of the elements of the Fisher information matrix.2. Theorem 1 states that the tuning parameter λ can be set as λ n ≥ Cn − / √ log p .In practice, one can use the Bayesian information criterion to select the tuningparameter λ n is a data dependent way, as explained in Section 2.4 of Kolar et al.(2010b). We conjecture that this approach would lead to asymptotically consis-tent model selection, however, this claim needs to be proven.3. Condition 2 requires that the size of the neighborhood of each node remainssmaller than the size of the samples. However, the model ambient dimension p isallowed to grow exponentially in n .4. Condition 3 is crucial to be able to distinguish true elements in the neighborhoodof a node. We require that the size of the minimum element of the parametervector stays bounded away from zero.5. The rate of convergence is dictated by the rate of convergence of the sample Fisherinformation matrix to the true Fisher information matrix, as shown in Lemma5. Using a local linear smoother, instead of the kernel smoother, to estimate thecoeﬃcients in the model (3.1) one could get a faster rate of convergence.6. Theorem 1 provides suﬃcient conditions for reliable estimation of the sequenceof graphs when the sample size is large enough. In order to improve small sampleproperties of the procedure, one could adapt the approach of Guo et al. (2010b)to the time-varying setting, to incorporate sharing between nodes. Guo et al.(2010b) estimate all the local neighborhoods simultaneously, as opposed to es-timating each neighborhood individually, eﬀectively reducing the number of pa-rameters needed to be inferred from data. This is especially beneﬁcial in networkswith prominent hubs and scale-free networks.In order to obtain insight into the network dynamics one needs to estimate thegraph structure at multiple time points. A common choice is to estimate the graphstructure for every τ ∈ T n and obtain a sequence of graph structures { ˆ G τ } τ ∈T n . We ahave the following immediate consequence of Theorem 1. Corollary 2.

Under the assumptions of Theorem 1, we have that P h ∀ τ ∈ T n : ˆ G τ ( λ n ) = G τ i n →∞ −−−→ . (4.1)In the sequel, we set out to prove Theorem 1. First, we show that the minimizerˆ θ τu of (3.3) is unique under the assumptions given in Theorem 1. Next, we show thatwith high probability the estimator ˆ θ τu recovers the true neighborhood of a node u .Repeating the procedure for all nodes u ∈ V we obtain the result stated in Theorem 1.The proof uses the results that the empirical estimates of the Fisher information matrixand the covariance matrix are close elementwise to their population versions. Theseresults are given in Appendix A. 11 Proof of the main result

In this section we give the proof of Theorem 1. The proof is given through a sequenceof technical lemmas. We build on the ideas developed in Ravikumar et al. (2010). Notethat in what follows, we use

C, C ′ and C ′′ to denote positive constants independent of( n, p, s ) and their value my change from line to line.The main idea behind the proof is to characterize the minimum obtained in Eq. (3.3)and show that the correct neighborhood of one node at an arbitrary time point canbe recovered with high probability. Next, using the union bound over the nodes of agraph, we can conclude that the whole graph is estimated sparsistently at the timepoints of interest.We ﬁrst address the problem of uniqueness of the solution to (3.3). Note thatbecause the objective in Eq. (3.3) is not strictly convex, it is necessary to show thatthe non-zero pattern of the parameter vector is unique, since otherwise the problem ofsparsistent graph estimation would be meaningless. Under the conditions of Theorem 1we have that the solution is unique. This is shown in Lemma 3 and Lemma 4. Lemma 3gives conditions under which two solutions to the problem in Eq. (3.3) have the samepattern of non-zero elements. Lemma 4 then shows, that with probability tending to 1,the solution is unique. Once we have shown that the solution to the problem in Eq. (3.3)is unique, we proceed to show that it recovers the correct pattern of non-zero elements.To show that, we require the sample version of the Fisher information matrix to satisfycertain conditions. Under the assumptions of Theorem 1, Lemma 5 shows that thesample version of the Fisher information matrix satisﬁes the same conditions as thetrue Fisher information matrix, although with worse constants. Next we identify twoevents, related to the Karush-Kuhn-Tucker optimality conditions, on which the vectorˆ θ u recovers the correct neighborhood the node u . This is shown in Proposition 6.Finally, Proposition 7 shows that the event, on which the neighborhood of the node u is correctly identiﬁed, occurs with probability tending to 1 under the assumptions ofTheorem 1. Table 1 provides a summary of diﬀerent parts of the proof.Let us denote the set of all solution to (3.3) as Θ( λ n ). We deﬁne the objectivefunction in Eq. (3.3) by F ( θ u ) := − X t ∈T n w τt γ ( θ u ; x t ) + λ n || θ u || (5.1)and we say that θ u ∈ R p − satisﬁes the system ( S ) when ∀ v = 1 , . . . , p − , (cid:26) P t ∈T n w τt ( ∇ γ ( θ u ; x t )) v = λ n sign( θ uv ) if θ uv = 0 | P t ∈T n w τt ( ∇ γ ( θ u ; x t )) v | ≤ λ n if θ uv = 0 , (5.2)where ∇ γ ( θ u ; x t ) = x t \ u n x tu + 1 − P θ u [ x tu = 1 | x t \ u ] o (5.3)12able 1: Outline of the proof strategy.Result Description of the resultLemma 3 and Lemma 4 These two lemmas establish the unique-ness of the solution to the optimizationproblem in Eq. (3.3).Lemma 5 Shows that the sample version of theFisher information matrix satisﬁes thesimilar conditions to the population ver-sion of the Fisher information matrix.Proposition 6 Shows that on an event, related to theKKT conditions, the vector ˆ θ u recoversthe correct neighborhood the node u .Proposition 7 Shows that the event in Proposition 6holds with probability tending to 1.is the score function. Eq. (5.2) is obtained by taking the sub-gradient of F ( θ ) andequating it to zero. From the Karush-Kuhn-Tucker (KKT) conditions it follows that θ u ∈ R p − belongs to Θ( λ n ) if and only if θ u satisﬁes the system ( S ). The followingLemma shows that any two solutions have the same non-zero pattern. Lemma 3.

Consider a node u ∈ V . If ¯ θ u ∈ R p − and ˜ θ u ∈ R p − both belong to Θ( λ n ) then h x t \ u , ¯ θ u i = h x t \ u , ˜ θ u i , t ∈ T n . Furthermore, solutions ¯ θ u and ˜ θ u have non-zeroelements in the same positions. We now use the result of Lemma 3 to show that with high probability the minimizerin (3.3) is unique. We consider the following event:Ω = { D min − δ ≤ y ′ ˆ Σ τSS y ≤ D max + δ : y ∈ R s , || y || = 1 } . Lemma 4.

Consider a node u ∈ V . Assume that the conditions of Lemma 10 aresatisﬁed. Assume also that the dependency condition A1 holds. There are constants C, C ′ , C ′′ > depending on M and M K only, such that P [Ω ] ≥ − − Cnh ( δs − C ′ h ) + C ′′ log( s )) . Moreover, on the event Ω , the minimizer of (3.3) is unique. We have shown that the estimate ˆ θ τu is unique on the event Ω , which under theconditions of Theorem 1 happens with probability converging to 1 exponentially fast.To ﬁnish the proof of Theorem 1 we need to show that the estimate ˆ θ τu has the samenon-zero pattern as the true parameter vector θ τu . In order to show that we consider afew “good” events, which happen with high probability and on which the estimate ˆ θ τu := { C min − δ ≤ y ′ ˆ Q τSS y : y ∈ R s , || y || = 1 } and Ω := {||| ˆ Q τS c S ( ˆ Q τSS ) − ||| ∞ ≤ − α } . Lemma 5.

Assume that the conditions of Lemma 10 are satisﬁed. Assume also thatthe dependency condition A1 holds and the incoherence condition A2 holds with theincoherence parameter α . There are constants C, C ′ , C ′′ > depending on M , M K and α only, such that P [Ω ] ≥ − − C nhδ s + C ′ log( s )) and P [Ω ] ≥ − exp( − C nhs + C ′′ log( p )) . Lemma 5 guarantees that the sample Fisher information matrix satisﬁes “good”properties with high probability, under the appropriate scaling of quantities n, p, s and h . We are now ready to analyze the optimum to the convex program (3.3). To thatend we apply the mean-value theorem coordinate-wise to the gradient of the weightedlogloss P t ∈T n w τt ∇ γ ( θ u ; x t ) and obtain X t ∈T n w τt ( ∇ γ ( ˆ θ τu ; x t ) − ∇ γ ( θ τu ; x t )) = [ X t ∈T n w τt ∇ γ ( θ τu ; x t )]( ˆ θ τu − θ τu ) + ∆ τ , (5.4)where ∆ τ ∈ R p − is the remainder term of the form∆ τv = [ X t ∈T n w τt ( ∇ γ ( ¯ θ ( v ) u ; x t ) − ∇ γ ( θ τu ; x t ))] ′ v ( ˆ θ τu − θ τu ) (5.5)and ¯ θ ( v ) u is a point on the line between θ τu and ˆ θ τu , and [ · ] ′ v denoting the v -th row of thematrix. Recall that ˆ Q τ = P t ∈T n w τt ∇ γ ( θ τu ; x t ). Using the expansion (5.4), we writethe KKT conditions given in Eq. (5.2) in the following form, ∀ v = 1 , . . . , p − ( ˆ Q τv ( θ u − θ τu ) + P t ∈T n w τt ( ∇ γ ( θ τu ; x t )) v + ∆ τv = λ n sign( θ uv ) if θ uv = 0 | ˆ Q τv ( θ u − θ τu ) + P t ∈T n w τt ( ∇ γ ( θ τu ; x t )) v + ∆ τv | ≤ λ n if θ uv = 0 . (5.6)We consider the following eventsΩ = Ω ∩ Ω ∩ Ω , Ω = {∀ v ∈ S : | λ n (( ˆ Q τSS ) − sign( θ τS )) v − (( ˆ Q τSS ) − W τS ) v | < | θ τuv |} and Ω = {∀ v ∈ S c : | ( W τS c − ˆ Q τS c S ( ˆ Q τSS ) − W τS ) v | < α λ n } W τ = X t ∈T n w τt ∇ γ ( θ τu ; x t ) + ∆ τ . We will work on the event Ω on which the minimum eigenvalue of ˆ Q τSS is strictlypositive and, so, ˆ Q τSS is regular and Ω ∩ Ω and Ω ∩ Ω are well deﬁned. Proposition 6.

Assume that the conditions of Lemma 5 are satisﬁed. The event {∀ ˆ θ τu ∈ R p − solution of ( S ) , we have sign( ˆ θ τu ) = sign( θ τu ) } ∩ Ω contains event Ω ∩ Ω ∩ Ω .Proof. We consider the following linear functional G : (cid:26) R s → R s θ θ − θ τS + ( ˆ Q τSS ) − W τS − λ n ( ˆ Q τSS ) − sign( θ τS ) . For any two vectors y = ( y , . . . , y s ) ′ ∈ R s and r = ( r , . . . , r s ) ′ ∈ R s + , deﬁne thefollowing set centered at y as B ( y , r ) = s Y i =1 ( y i − r i , y i + r i ) . Now, we have G ( B ( θ τS , | θ τS | )) = B (cid:16) ( ˆ Q τSS ) − W τS − λ n ( ˆ Q τSS ) − sign( θ τS ) , | θ τS | (cid:17) . On the event Ω ∩ Ω ,0 ∈ B (cid:16) ( ˆ Q τSS ) − W τS − λ n ( ˆ Q τSS ) − sign( θ τS ) , | θ τS | (cid:17) , which implies that there exists a vector ¯ θ τS ∈ B ( θ τS , | θ τS | ) such that G ( ¯ θ τS ) = 0. For ¯ θ τS it holds that ¯ θ τS = θ τS + λ n ( ˆ Q τSS ) − sign( θ τS ) − ( ˆ Q τSS ) − W τS and | ¯ θ τS − θ τS | < | θ τS | . Thus,the vector ¯ θ τS satisﬁes sign( ¯ θ τS ) = sign( θ τS )and ˆ Q SS ( ¯ θ τS − θ τS ) + W τS = λ n sign( ¯ θ τS ) . (5.7)Next, we consider the vector ¯ θ τ = (cid:18) ¯ θ τS ¯ θ τS c (cid:19) where ¯ θ τS c is the null vector of R p − − s .On event Ω , from Lemma 5 we know that ||| ˆ Q τS c S ( ˆ Q τSS ) − ||| ∞ ≤ − α . Now, on theevent Ω ∩ Ω it holds || ˆ Q τS c S ( ¯ θ τS − θ τS ) + W τS c || ∞ = || − ˆ Q τS c S ( ˆ Q τSS ) − W τS + W τS c + λ n ˆ Q τS c S ( ˆ Q τSS ) − sign( ¯ θ τS ) || ∞ < λ n . (5.8)Note that for ¯ θ τ , equations (5.7) and (5.8) are equivalent to saying that ¯ θ τ satis-ﬁes conditions (5.6) or (5.2), i.e., saying that ¯ θ τ satisﬁes the KKT conditions. Sincesign( ¯ θ τS ) = sign( θ τS ), we have sign( ¯ θ τ ) = sign( θ τu ). Furthermore, because of the unique-ness of the solution to (3.3) on the event Ω , we conclude that ˆ θ τu = ¯ θ τ .15roposition 6 implies Theorem 1 if we manage to show that the event Ω ∩ Ω ∩ Ω occurs with high probability under the assumptions stated in Theorem 1. Proposition7 characterizes the probability of that event, which concludes the proof of Theorem 1. Proposition 7.

Assume that the conditions of Theorem 1 are satisﬁed. Then thereare constants

C, C ′ > depending on M , M K , D max , C min and α only, such that thefollowing holds: P [Ω ∩ Ω ∩ Ω ] ≥ − − Cnh ( λ n − sh ) + log( p )) . (5.9) Proof.

We start the proof of the proposition by giving a technical lemma, which char-acterizes the distance between vectors ˆ θ τu = ¯ θ τ and θ τu under the assumptions of The-orem 1, where ¯ θ τ is constructed in the proof of Proposition 6. The following lemmagives a bound on the distance between the vectors ˆ θ τS and θ τS , which we use in theproof of the proposition. The proof of the lemma is given in Appendix. Lemma 8.

Assume that the conditions of Theorem 1 are satisﬁed. There are constants

C, C ′ > depending on M, M K , D max , C min and α only, such that || ˆ θ τS − θ τS || ≤ C √ s log pn / (5.10) with probability at least − exp( − C ′ log p ) . Using Lemma 8 we can prove Proposition 7. We start by studying the probabilityof the event Ω . We haveΩ C ⊂ ∪ v ∈ S c { W v + ( ˆ Q τS c S ( ˆ Q τSS ) − W τS ) v ≥ α λ n } . Recall that W τ = P t ∈T n w τt ∇ γ ( θ τu ; x t ) + ∆ τ . Let us deﬁne the eventΩ = { max ≤ v ≤ p − | e ′ v X t ∈T n w τt ∇ γ ( θ τu ; x t ) | < αλ n − α ) } , where e v ∈ R p − is a unit vector with one at the position v and zeros elsewhere.From the proof of Lemma 8 available in the appendix we have that P [Ω ] ≥ − − C log( p )) and on that event the bound given in Eq. (5.10) holds.On the event Ω , we bound the remainder term ∆ τ . Let g : R R be deﬁned as g ( z ) = z )(1+exp(2 z )) . Then η ( x ; θ u ) = g ( x u h θ u , x \ u i ). For v ∈ { , . . . , p − } , using themean value theorem it follows that∆ v = [ X t ∈T n w τt ( ∇ γ ( ¯ θ ( v ) u ; x t ) − ∇ γ ( θ τu ; x t ))] ′ v ( ˆ θ τu − θ τu )= X t ∈T n w τt [ η ( x t ; ¯ θ ( v ) u ) − η ( x t ; θ τu )][ x t \ u x t ′ \ u ] ′ v [ ˆ θ τu − θ τu ]= X t ∈T n w τt g ′ ( x tu h ¯¯ θ ( v ) u , x t \ u i )[ x tu x t \ u ] ′ [ ¯ θ ( v ) u − θ τu ][ x tv x t ′ \ u ][ ˆ θ τu − θ τu ]= X t ∈T n w τt { g ′ ( x tu h ¯¯ θ ( v ) u , x t \ u i ) x tu x tv }{ [ ¯ θ ( v ) u − θ τu ] ′ x t \ u x t ′ \ u [ ˆ θ τu − θ τu ] } , θ ( v ) u is another point on the line joining ˆ θ τu and θ τu . A simple calculation showsthat | g ′ ( x tu h ¯¯ θ ( v ) u , x t \ u i ) x tu x tv | ≤

1, for all t ∈ T n , so we have | ∆ v | ≤ [ ¯ θ ( v ) u − θ τu ] ′ { X t ∈T n w τt x t \ u x t ′ \ u } [ ˆ θ τu − θ τu ] ≤ [ ˆ θ τu − θ τu ] ′ { X t ∈T n w τt x t \ u x t ′ \ u } [ ˆ θ τu − θ τu ]= [ ˆ θ τS − θ τS ] ′ { X t ∈T n w τt x tS x t ′ S } [ ˆ θ τS − θ τS ] ≤ D max || ˆ θ τS − θ τS || . (5.11)Combining the equations (5.11) and (5.10), we have that on the event Ω max ≤ v ≤ p − | ∆ v | ≤ Cλ n s < λ n α − α )where C is a constant depending on D max and C min only.On the event Ω ∩ Ω , we have W τv + ( ˆ Q τS c S ( ˆ Q τSS ) − W τS ) v < αλ n − α ) + (1 − α ) αλ n − α ) ≤ αλ n P [Ω ] ≥ − − C log( p )) for some constant C dependingon M, M K , C min , D max and α only.Next, we study the probability of the event Ω . We haveΩ C ⊂ ∪ v ∈ S { λ n (( ˆ Q τSS ) − sign( θ τS )) v + (( ˆ Q τSS ) − W τS ) v ≥ θ τuv } . (5.12)Again, we will consider the event Ω . On the event Ω ∩ Ω we have that λ n (( ˆ Q τSS ) − sign( θ τS )) v + (( ˆ Q τSS ) − W τS ) v ≤ λ n √ sC min + λ n C min ≤ Cλ n √ s, (5.13)for some constant C . When θ min > Cλ n √ s , we have that P [Ω ] ≥ − − C log( p ))for some constant C that depends on M, M K , C min , D max and α only.In summary, under the assumptions of Theorem 1, the probability of event Ω ∩ Ω ∩ Ω converges to one exponentially fast. On this event, we have shown that theestimator ˆ θ τu is the unique minimizer of (3.3) and that it consistently estimates thesigned non-zero pattern of the true parameter vector θ τu , i.e., it consistently estimatesthe neighborhood of a node u . Applying the union bound over all nodes u ∈ V , we canconclude that our estimation procedure explained in Section 3 consistently estimatesthe graph structure at a time point τ . 17 Numerical simulation

In this section, we demonstrate numerical performance of Algorithm 1. A detailedcomparison with other estimation procedures and an application to biological datahas been reported in Kolar et al. (2010b). We will use three diﬀerent types of graphstructures: a chain, a nearest-neighbor and a random graph. Each graph has p = 50nodes and the maximum node degree is bounded by s = 4. These graphs are detailedbelow: Example 1: Chain graph.

First a random permutation π of { , . . . , p } is chosen.Then a graph structure is created by connecting consecutive nodes in the permutation,that is, ( π (1) , π (2)) , . . . , ( π ( p − , π ( p )) ∈ E . Example 2: Nearest neighbor graph.

A nearest neighbor graph if generatedfollowing the procedure outlined in Li and Gui (2006). For each node, we draw a pointuniformly at random on a unit square and compute the pairwise distances betweennodes. Each node is then connected to 4 closest neighbors. Since some of nodes willhave more than 4 adjacent edges, we remove randomly edges from nodes that havedegree larger than 4 until the maximum degree of a node in a graph is 4.

Example 3: Random graph.

To generate a random graph with e = 45 edges, weadd each edges one at a time, between random pairs of nodes that have the node degreeless than 4.We use the above described procedure to create the ﬁrst random graph ˜ G . Next,we randomly add 10 edges and remove 10 edges from ˜ G , taking care that the maximumnode degree is still 4, to obtain ˜ G . Repeat the process of adding and removing edgesfrom ˜ G to obtain ˜ G , . . . , ˜ G . We refer to these 6 graphs as the anchor graphs. We willrandomly generate the prototype parameter vectors ˜ θ , . . . , ˜ θ , corresponding to theanchor graphs, and then interpolate 200 points between them to obtain the parameters { θ t } t ∈T n , which gives us n = 1000. We generate a prototype parameter vector ˜ θ i foreach anchor graph ˜ G i , i ∈ { , . . . , } , by sampling non-zero elements of the vectorindependently from Unif ([ − , . ∪ [0 . , t ∈ T n we generate 10 i.i.d. samples using Gibbs sampling from the distribution P θ t . Speciﬁcally, we discardsamples from the ﬁrst 10 iterations and collect samples every 100 iterations.We estimate ˆ G t for each t ∈ T n using k ∈ { , . . . , } samples at each time point.The results are expressed in terms of the precision ( Pre ) and the recall (

Rec ) and F F ∗ Pre ∗ Rec / ( Pre + Rec ). Let ˆ E t denote the estimated edge set of ˆ G t , then the precision is calculated as Pre := 1 /n P t ∈T n | ˆ E t ∩ E t | / | ˆ E t | and the recall as Rec := 1 /n P t ∈T n | ˆ E t ∩ E t | / | E t | . Fur-thermore, we report results averaged over 100 independent runs. The tuning parametersare selected by maximizing the BIC score over a grid of regularization parameters asdescribed in Kolar et al. (2010b). Table 2 contains a summary of simulation results.As suggested by the reviewer, we perform an additional simulation that illustratesthat the conditions of Theorem 1 can be satisﬁed. We will use the random chaingraph and the nearest neighbor graph for two simulation settings. In each setting, wegenerate two anchor graphs with p nodes and create two prototype parameter vectors,18able 2: Summary of simulation results. The number of nodes p = 50 and the numberof discrete time points n = 1000. Number of independent samples1 2 3 4 5 6 7 8 9 10Precision Chain 0.75 0.95 0.96 0.96 0.97 0.98 0.99 0.99 0.99 0.99NN 0.84 0.98 0.97 0.96 0.98 0.98 0.98 0.98 0.97 0.98Random 0.55 0.57 0.65 0.71 0.75 0.79 0.83 0.84 0.85 0.85Recall Chain 0.59 0.65 0.69 0.72 0.73 0.73 0.73 0.73 0.73 0.73NN 0.48 0.57 0.61 0.63 0.63 0.64 0.64 0.64 0.65 0.65Random 0.50 0.52 0.55 0.56 0.56 0.58 0.60 0.60 0.63 0.66F1 score Chain 0.66 0.76 0.80 0.82 0.83 0.84 0.84 0.84 0.85 0.84NN 0.61 0.72 0.74 0.76 0.77 0.77 0.77 0.77 0.77 0.78Random 0.52 0.54 0.60 0.63 0.64 0.67 0.70 0.70 0.72 0.74as described above. Then we interpolate these two parameters over n points. Theorem 1predicts the scaling for the sample size n , as a function of other parameters, requiredto successfully recover the graph at a time point τ . Therefore, if our theory correctlypredicts the behavior of the estimation procedure and we plot the hamming distancebetween the true and recovered graph structure against appropriately rescaled samplesize, we expect the curves to reach zero distance for diﬀerent problem sizes at a samepoint. The bandwidth parameter h is set as h = 4 . n − / and the penalty parameter λ n as λ n = 2 p n − / log( p ) as suggested by the theory. Figure 1 shows the hammingdistance against the scaled sample size n/ ( s . log . ( p )). Each point is averaged over100 independent runs. In the paper, we focus on sparsistent estimation of the time-varying high-dimensionalgraph structure in Markov Random Fields from a small size sample. An interestingopen direction is estimation of the graph structure from a general time-series, whereobservations are dependent. In our opinion, the graph structure that changes with timecreates the biggest technical diﬃculties. Incorporating dependent observations wouldbe an easier problem to address, however, the one of great practical importance, sincesamples in the real data sets are likely to be dependent. Another open direction isto establish necessary conditions, to complement suﬃcient conditions established here,under which it is possible to estimate a time-varying graph structure. Another researchdirection may be to use non-convex penalties introduced by Fan and Li (2001) in placeof the ℓ penalty. The idea would be to relax the condition imposed in the assumptionA2, since it is well known that the SCAD penalties improve performance when thevariables are correlated. 19 Chain Graph

Scaled sample size n/ ( s . log . ( p )) H a mm i ng d i s t an c e p = 60p = 100p = 140

20 25 30 35020406080100

Nearest Neighbor Graph

Scaled sample size n/ ( s . log . ( p )) H a mm i ng d i s t an c e p = 60p = 100p = 140 Figure 1: Average hamming distance plotted against the rescaled sample size. Eachcolumn represents one simulation setting. Results are averaged over 100 independentruns.

Acknowledgment

We would like to thank Larry Wasserman for many useful discussions and suggestions.The research reported here was supported in part by Grant ONR N000140910758,NSF DBI-0640543, NSF DBI-0546594, NSF IIS- 0713379, an Alfred P. Sloan ResearchFellowship to EPX and a graduate fellowship from Facebook to MK.

A Large deviation inequalities

In this section we characterize the deviation of elements of the sample Fisher informa-tion matrix ˆ Q τ := ˆ Q τu at time point τ , deﬁned asˆ Q τ = X t w τt η ( x t ; θ τu ) x t \ u x t ′ \ u , (A.1)and the sample covariance matrix ˆ Σ τ from their population versions Q τ and Σ τ . Theseresults are crucial for the proof of the main theorem, where the consistency resultdepends on the bounds on the diﬀerence ˆ Q τ − Q τ and ˆ Σ τ − Σ τ . In the following, weuse C, C ′ and C ′′ as generic positive constants independent of ( n, p, s ).20 .1 Sample Fisher information matrix To bound the deviation between elements of ˆ Q τ = [ˆ q τvv ′ ] and Q τ = [ q τvv ′ ], v, v ′ ∈ V \ u ,we will use the following decomposition: | ˆ q τvv ′ − q τvv ′ | ≤ | X t ∈T n w τt η ( x t ; θ τu ) x tv x tv ′ − X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ | + | X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ − E [ X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ ] | + | E [ X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ ] − q τvv ′ | . (A.2)The following lemma gives us bounds on the terms in Eq. (A.2). Lemma 9.

Assume that the smoothness condition A3 is satisﬁed and that the kernelfunction K ( · ) satisﬁes A4. Furthermore, assume max t ∈ [0 , |{ v ∈ { , . . . , p } : θ tuv = 0 }| < s, i.e., the number of non-zero elements of the parameter vector is bounded by s . Thereexist constants C, C ′ , C ′′ > , depending on M and M K only, which are the constantsquantifying assumption A3 and A4, respectively , such that for any τ ∈ [0 , , we have max v,v ′ | ˆ q τvv ′ − X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ | = Csh (A.3)max v,v ′ | E [ X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ ] − q τvv ′ | = C ′ h. (A.4) Furthermore, | X t ∈T n ( w τt η ( x t ; θ tu ) x tv x tv ′ − E [ w τt η ( x t ; θ tu ) X tv X tv ′ ]) | < ǫ (A.5) with probability at least − − C ′′ nhǫ ) .Proof. We start the proof by bounding the diﬀerence | η ( x ; θ t + δu ) − η ( x ; θ tu ) | which willbe useful later on. By applying the mean value theorem to η ( x ; · ) and the Taylorexpansion on θ tu we obtain: | η ( x ; θ t + δu ) − η ( x ; θ tu ) | = | p − X v =1 ( θ t + δuv − θ tuv ) η ′ ( x ; ¯ θ ( v ) u ) | ¯ θ ( v ) u is a point on the linebetween θ t + δu and θ tu ! ≤ p − X v =1 | θ t + δuv − θ tuv | ( | η ′ ( x ; · ) | ≤ p − X v =1 | δ ∂∂t θ tuv + δ ∂ ∂t θ tuv (cid:12)(cid:12)(cid:12) t = β v | (cid:18) β v is a point on the linebetween t and t + δ (cid:19) τ = 1. Using the above equation, and the Riemannintegral to approximate the sum, we have | X t ∈T n w τt η ( x t ; θ τu ) x tv x tv ′ − X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ |≈ | Z h K ( z − τh )[ η ( x z ; θ τu ) − η ( x z ; θ zu )] x zv x zv ′ dz |≤ Z − h K ( z ′ ) | η ( x τ + z ′ h ; θ τu ) − η ( x τ + z ′ h ; θ τ + z ′ hu ) | dz ′ ≤ Z − K ( z ′ )[ p − X v =1 | z ′ h ∂∂t θ tuv (cid:12)(cid:12)(cid:12) t = τ + ( z ′ h ) ∂ ∂t θ tuv (cid:12)(cid:12)(cid:12) t = β v | ] dz ′ ≤ Csh, for some constant

C > M from A3 which bounds the derivatives inthe equation above, and M K from A4 which bounds the kernel. The last inequalityfollows from the assumption that the number of non-zero components of the vector θ tu is bounded by s .Next, we prove equation (A.4). Using the Taylor expansion, for any ﬁxed 1 ≤ v, v ′ ≤ p − | E [ X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ ] − q τvv ′ | = | X t ∈T n w τt ( q tvv ′ − q τvv ′ ) | = | X t ∈T n w τt (( t − τ ) ∂∂t q tvv ′ (cid:12)(cid:12)(cid:12) t = τ + ( t − τ ) ∂ ∂t q tvv ′ (cid:12)(cid:12)(cid:12) t = ξ | , where ξ ∈ [ t, τ ]. Since w τt = 0 for | t − τ | > h , we havemax v,v ′ | E [ X t ∈T n w τt η ( x t ; θ tu ) x tv x tv ′ ] − q τvv ′ | ≤ C ′ h for some constant C > M and M K only.Finally, we prove equation (A.5). Observe that w τt η ( x t ; θ tu ) x tv x tv ′ are independentand bounded random variables [ − w τt , w τt ]. The equation simply follows from the Ho-eﬀding’s inequality.Using results of Lemma 9 we can obtain the rate at which the element-wise distancebetween the true and sample Fisher information matrix decays to zero as a function ofthe bandwidth parameter h and the size of neighborhood s . In the proof of the maintheorem, the bandwidth parameter will be chosen so that the bias and variance termsare balanced. 22 .2 Sample covariance matrix The deviation of the elements of the sample covariance matrix is bounded in a similarway as the deviation of elements of the sample Fisher information matrix, given inLemma 9. Denoting the sample covariance matrix at time point τ asˆ Σ τ = X t w τt x t x t ′ , and the diﬀerence between the elements of ˆ Σ τ and Σ τ can be bounded as | ˆ σ τuv − σ τuv | = | X t ∈T n w τt x tu x tv − σ τuv |≤ | X t ∈T n w τt x tu x tv − E [ X t ∈T n w τt x tu x tv ] | + | E [ X t ∈T n w τt x tu x tv ] − σ τuv | . (A.6)The following lemma gives us bounds on the terms in Eq. (A.6). Lemma 10.

Assume that the smoothness condition A3 is satisﬁed and that the kernelfunction K ( · ) satisﬁes A4. There are constants C, C ′ > depending on M and M K only such that for any τ ∈ [0 , , we have max u,v | E [ X t ∈T n w τt x tu x tv ] − σ τuv | ≤ Ch. (A.7) and | X t ∈T n w τt x tu x tv − E [ X t ∈T n w τt x tu x tv ] | ≤ ǫ (A.8) with probability at least − − C ′ nhǫ ) .Proof. To obtain the Lemma, we follow the same proof strategy as in the proof ofLemma 9. In particular, Eq. (A.7) is proved in the same way as Eq. (A.4) and Eq. (A.8)in the same way as Eq. (A.5). The details of this derivation are omitted.

B Technical proofs

In this appendix we provide proofs of lemmas used to prove the main result.

B.1 Proof of Lemma 3

The set of minima Θ( λ n ) of a convex function is convex. So, for two distinct pointsof minima, ¯ θ u and ˜ θ u , every point on the line connecting two points also belongs tominima, i.e. ξ ¯ θ u + (1 − ξ ) ˜ θ u ∈ Θ( λ n ), for any ξ ∈ (0 , η = ¯ θ u − ˜ θ u and now any23oint on the line can be written as ˜ θ u + ξ η . The value of the objective at any point ofminima is constant and we have F ( ˜ θ u + ξ η ) = c, ξ ∈ (0 , , where c is some constant. By taking the derivative with respect to ξ of F ( ˜ θ u + ξ η ) weobtain X t ∈T n w τt " − x tu + exp( h ˜ θ u + ξ η , x t \ u i ) − exp( −h ˜ θ u + ξ η , x t \ u i )exp( h ˜ θ u + ξ η , x t \ u i ) + exp( −h ˜ θ u + ξ η , x t \ u i ) h η , x t \ u i + λ n p − X v =1 η v sign(˜ θ uv + ξη v ) = 0 . (B.1)On a small neighborhood of ξ the sign of ˜ θ u + ξ η is constant, for each component v ,since the function ˜ θ u + ξ η is continuous in ξ . By taking the derivative with respect to ξ of Eq. (B.1) and noting that the last term is constant on a small neighborhood of ξ we have 4 X t ∈T n w τt h η , x t \ u i exp( − h ˜ θ u + ξ η , x t \ u i ) (cid:16) − h ˜ θ u + ξ η , x t \ u i ) (cid:17) = 0 . This implies that h η , x t \ u i = 0 for every t ∈ T n , which implies that h x t \ u , ¯ θ u i = h x t \ u , ˜ θ u i , t ∈ T n , for any two solutions ¯ θ u and ˜ θ u . Since ¯ θ u and ˜ θ u were two arbitrary elementsof Θ( λ n ) we can conclude that h x t \ u , θ u i , t ∈ T n is constant for all elements θ u ∈ Θ( λ n ).Next, we need to show that the conclusion from above implies that any two solutionshave non-zero elements in the same position. From equation (5.2), it follows that theset of non-zero components of the solution is given by S = ( ≤ v ≤ p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X t ∈T n w τt ( ∇ γ ( θ u ; x t )) v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = λ ) . Using equation (5.3) we have that X t ∈T n w τt ( ∇ γ ( θ τu ; x t )) v = X t ∈T n w τt ( x t \ u { x tu + 1 − x tu h θ τu , x t \ u i )exp(2 x tu h θ τu , x τ \ u i ) + 1 } ) v , which is constant across diﬀerent elements θ u ∈ Θ( λ n ), since h x t \ u , θ u i , t ∈ T n isconstant for all θ u ∈ Θ( λ n ). This implies that the set of non-zero components is thesame for all solutions. (cid:3) .2 Proof of Lemma 4 Under the assumptions given in the Lemma, we can apply the result of Lemma 10. Let y ∈ R s be a unit norm minimal eigenvector of ˆ Σ τSS . We haveΛ min ( Σ τSS ) = min || x || =1 x ′ Σ τSS x = min || x || =1 { x ′ ˆ Σ τSS x + x ′ ( Σ τSS − ˆ Σ τSS ) x }≤ y ′ ˆ Σ τSS y + y ′ ( Σ τSS − ˆ Σ τSS ) y , which implies Λ min ( ˆ Σ τSS ) ≥ D min − ||| ( Σ τSS − ˆ Σ τSS ) ||| . Let Σ τ = [ σ τuv ] and ˆ Σ τ = [ˆ σ τuv ]. We have the following bound on the spectral norm ||| Σ τSS − ˆ Σ τSS ||| ≤ s X u =1 s X v =1 (ˆ σ τuv − σ τuv ) ! / ≤ δ, with the probability at least 1 − − Cnh ( δs − C ′ h ) + C ′′ log( s )), for some ﬁxedconstants C, C ′ , C ′′ > M and M K only.Similarly, we have that Λ max ( ˆ Σ τSS ) ≤ D max + δ, with probability at least 1 − − Cnh ( δs − C ′ h ) + C ′′ log( s )), for some ﬁxed constants C, C ′ , C ′′ > M and M K only.From Lemma 3, we know that any two solutions ¯ θ u , ˜ θ u ∈ Θ( λ n ) of the optimizationproblem (3.3) have non-zero elements in the same position. So, for any two solutions¯ θ u , ˜ θ u ∈ Θ( λ n ), it holds X \ u ( ¯ θ u − ˜ θ u ) = X \ u,S ( ¯ θ u − ˜ θ u ) S + X \ u,S c ( ¯ θ u − ˜ θ u ) S c = X \ u,S ( ¯ θ u − ˜ θ u ) S . Furthermore, from Lemma 3 we know that the two solutions are in the kernel of X \ u,S .On the event Ω , kernel of X \ u,S is { } . Thus, the solution is unique on Ω . (cid:3) B.3 Proof of Lemma 5

We ﬁrst analyze the probability of the event Ω . Using the same argument to thosein the proof of Lemma 4, we obtainΛ min ( ˆ Q τSS ) ≥ C min − ||| Q τSS − ˆ Q τSS ||| . Next, using results of Lemma 9, we have the following bound ||| Q τSS − ˆ Q τSS ||| ≤ s X u =1 s X v =1 (ˆ q τuv − q τuv ) ! / ≤ δ, (B.2)25ith probability at least 1 − − C nhδ s +2 log( s )), for some ﬁxed constants C, C ′ > M and M K only.Next, we deal with the event Ω . We are going to use the following decompositionˆ Q τS c S ( ˆ Q τSS ) − = Q τS c S [( ˆ Q τSS ) − − ( Q τSS ) − ]+ [ ˆ Q τS c S − Q τS c S ]( Q τSS ) − + [ ˆ Q τS c S − Q τS c S ][( ˆ Q τSS ) − − ( Q τSS ) − ]+ Q τS c S ( Q τSS ) − = T + T + T + T . Under the assumption A2, we have that ||| T ||| ∞ ≤ − α . The lemma follows ifwe prove that for all the other terms we have ||| · ||| ∞ ≤ α . Using the submultiplicativeproperty of the norm, we have for the ﬁrst term: ||| T ||| ∞ ≤ ||| Q τS c S ( Q τSS ) − ||| ∞ ||| ˆ Q τSS − Q τSS ||| ∞ ||| ( ˆ Q τSS ) − ||| ∞ ≤ (1 − α ) ||| ˆ Q τSS − Q τSS ||| ∞ √ s ||| ( ˆ Q τSS ) − ||| . (B.3)Using Eq. (B.2), we can bound the term ||| (cid:16) ˆ Q τSS (cid:17) − ||| ≤ C ′′ , for some constant de-pending on C min only, with probability at least 1 − − C nhs + 2 log( s )), for someﬁxed constant C >

0. The bound on the term ||| ˆ Q τSS − Q τSS ||| ∞ follows from applicationof Lemma 9. Observe that P [ ||| ˆ Q τSS − Q τSS ||| ∞ ≥ δ ] = P [max v ∈ S { X v ′ ∈ S | ˆ q τvv ′ − q τvv ′ |} ≥ δ ] ≤ − Cnh ( δs − C ′ sh ) + 2 log( s )) , (B.4)for some ﬁxed constants C, C ′ >

0. Combining all the elements, we obtain the boundon the ﬁrst term ||| T ||| ∞ ≤ α , with probability at least 1 − C exp( C ′ nhs + C ′′ log( s )), forsome constants C, C ′ , C ′′ > ||| T ||| ∞ ≤ ||| ˆ Q τS c S − Q τS c S ||| ∞ √ s ||| ( Q τSS ) − ||| ≤ √ sC min ||| ˆ Q τS c S − Q τS c S ||| ∞ . (B.5)The bound on the term ||| ˆ Q τSS − Q τSS ||| ∞ follows in the same way as the bound inEq. (B.4) and we can conclude that ||| T ||| ∞ ≤ α with probability at least 1 − C exp( C ′ nhs + C ′′ log( p )), for some constants C, C ′ , C ′′ > T . We have the following decomposition ||| [ ˆ Q τS c S − Q τS c S ][( ˆ Q τSS ) − − ( Q τSS ) − ] ||| ∞ ≤ ||| ˆ Q τS c S − Q τS c S ||| ∞ √ s ||| ( Q τSS ) − [ Q τSS − ˆ Q τSS ]( ˆ Q τSS ) − ||| ≤ √ sC min ||| ˆ Q τS c S − Q τS c S ||| ∞ ||| Q τSS − ˆ Q τSS ||| ||| ( ˆ Q τSS ) − ||| . ||| T ||| ∞ ≤ α with probability at least 1 − C exp( C ′ nhs + C ′′ log( p )).Bound on the probability of event Ω follows from combining the bounds on allterms. (cid:3) B.4 Proof of Lemma 8

To prove this Lemma, we use a technique of Rothman et al. (2008) applied to theproblem of consistency of the penalized covariance matrix estimator. Let us deﬁne thefollowing function H : (cid:26) R p → R D F ( θ τu + D ) − F ( θ τu ) , where the function F ( · ) is deﬁned in equation (5.1). The function H ( · ) takes thefollowing form H ( D ) = X t ∈T n w τt ( γ ( θ τu ; x t ) − γ ( θ τu + D ; x t ))+ λ n ( || θ τu + D || − || θ τu || ) . Recall the minimizer of (3.3) constructed in the proof of Proposition 6, ˆ θ τu =( ¯ θ ′ S , ′ S c ) ′ . The minimizer of the function H ( · ) is ˆ D = ˆ θ τu − θ τu . Function H ( · ) isconvex and H (0) = 0 by construction. Therefor H ( ˆ D ) ≤

0. If we show that for someradius

B >

0, and D ∈ R p with || D || = B and D S c = , we have H ( D ) >

0, then weclaim that || ˆ D || ≤ B . This follows from the convexity of H ( · ).We proceed to show strict positivity of H ( · ) on the boundary of the ball with radius B = Kλ n √ s , where K > D ∈ R p be anarbitrary vector with || D || = B and D S c = , then by the Taylor expansion of γ ( · ; x t )we have H ( D ) = − ( X t ∈T n w τt ∇ γ ( θ τu ; x t )) ′ D − D ′ [ X t ∈T n w τt η ( x t ; θ τu + α D ) x t \ u x t ′ \ u ] D + λ n ( || θ τu + D || − || θ τu || )= ( I ) + ( II ) + ( III ) , (B.6)for some α ∈ [0 , I ). Let e v ∈ R p be a unit vector with one at the position v and zeros elsewhere. Then random variables − e ′ v P t ∈T n w τt ∇ γ ( θ τu ; x t ) are bounded[ − Cnh , Cnh ] for all 1 ≤ v ≤ p −

1, with constant

C > M K only. Using theHoeﬀding inequality and the union bound, we havemax ≤ v ≤ p − | e ′ v ( X t ∈T n w τt ∇ γ ( θ τu ; x t ) − E [ X t ∈T n w τt ∇ γ ( θ τu ; x t )]) | ≤ δ, − − Cnhδ + log( p )), where C > M K only. Moreover, denoting p ( θ tu ) = P θ tu [ x tu = 1 | x t \ u ]to simplify the notation, we have for all 1 ≤ v ≤ p − | E [ e ′ v X t ∈T n w τt ∇ γ ( θ τu ; x t ) | { x t \ u } t ∈T n ] | = | E [ X t ∈T n w τt x tv [ x tu + 1 − p ( θ τu )] | { x t \ u } t ∈T n ] | = | X t ∈T n w τt x tv [ p ( θ tu ) − p ( θ τu )] |≤ Z − h K ( z ) | p ( θ τ + zhu ) − p ( θ τu ) | dz. (B.7)Next, we apply the mean value theorem on p ( · ) and the Taylor’s theorem on θ tu . Underthe assumption A3, we have | p ( θ τ + zhu ) − p ( θ τu ) |≤ p − X v =1 | θ τ + zhuv − θ τuv | ( | p ′ ( · ) | ≤ p − X v =1 | zh ∂∂t θ tuv (cid:12)(cid:12)(cid:12) t = τ + ( zh ) ∂ ∂t θ tuv (cid:12)(cid:12)(cid:12) t = α v | ( α v ∈ [ τ + zh, τ ] ) ≤ Cs | zh + ( zh ) | , (B.8)for some C > M . Combining (B.8) and (B.7) we have that | E [ e ′ v P t ∈T n w τt ∇ γ ( θ τu ; x t ) | ≤ Csh for all 1 ≤ v ≤ p −

1. Thus, with probability greaterthan 1 − − Cnh ( λ n − sh ) + log( p )) for some constant C > M K , M and α , which under the conditions of Theorem 1 goes to 1 exponentially fast,we have max ≤ v ≤ p − | e ′ v X t ∈T n w τt ∇ γ ( θ τu ; x t ) | ≤ αλ n − α ) < λ n . On that event, using H¨older’s inequality, we have | ( X t ∈T n w τt ∇ γ ( θ τu ; x t )) ′ D | ≤ || D || max ≤ v ≤ p − | e ′ v X t ∈T n w τt ∇ γ ( θ τu ; x t ) |≤ λ n √ s || D || ≤ ( λ n √ s ) K . The triangle inequality applied to the term (

III ) of equation (B.6) yields: λ n ( || θ τu + D || − || θ τu || ) ≥ − λ n || D S || ≥ − λ n √ s || D S || ≥ − K ( λ n √ s ) . II ) of equation (B.6). Observe that since D S c = 0, wehave D ′ [ X t ∈T n w τt η ( x t ; θ τu + α D ) x t \ u x t ′ \ u ] D = D ′ S [ X t ∈T n w τt η ( x t ; θ τu + α D ) x tS x t ′ S ] D S ≥ K Λ min ( X t ∈T n w τt η ( x t ; θ τu + α D ) x tS x t ′ S )Let g : R R be deﬁned as g ( z ) = z )(1+exp(2 z )) . Now, η ( x ; θ u ) = g ( x u h θ u , x \ u i ) andwe have Λ min ( X t ∈T n w τt η ( x t ; θ τu + α D ) x tS x t ′ S ) ≥ min α ∈ [0 , Λ min ( X t ∈T n w t η ( x t ; θ τu + α D ) x tS x t ′ S ) ≥ Λ min ( X t ∈T n w τt η ( x t ; θ τu ) x tS x t ′ S ) − max α ∈ [0 , ||| X t ∈T n w τt g ′ ( x tu h θ τu + α D , x tS i )( x tu D ′ S x tS ) x tS x t ′ S ||| ≥ C min − max α ∈ [0 , ||| X t ∈T n w τt g ′ ( x tu h θ τu + α D , x tS i )( x tu D ′ S x tS ) x tS x t ′ S ||| To bound the spectral norm, we observe that for any ﬁxed α ∈ [0 ,

1] and y ∈ R s , || y || =1 we have: y ′ { X t ∈T n w τt g ′ ( x tu h θ τu + α D , x tS i )( x tu D ′ S x tS ) x tS x t ′ S } y = X t ∈T n w τt g ′ ( x tu h θ τu + α D , x tS i )( x tu D ′ S x tS )( x t ′ S y ) ≤ X t ∈T n w τt | g ′ ( x tu h θ τu + α D , x tS i )( x tu D ′ S x tS ) | ( x t ′ S y ) ≤ √ s || D || ||| X t w τt x tS x t ′ S ||| ( | g ′ ( · ) | ≤ ≤ D max Kλ n s ≤ C min . The last inequality follows as long as λ n s ≤ C min D max K . We have shown thatΛ min ( X t ∈T n w τt η ( x t ; θ τu + α D ) x tS x t ′ S ) ≥ C min , with high probability. 29utting the bounds on the three terms together, we have H ( D ) ≥ ( λ n √ s ) (cid:26) − K + C min K − K (cid:27) , which is strictly positive for K = C min . For this choice of K , we have that λ n s ≤ C D max ,which holds under the conditions of Theorem 1 for n large enough. (cid:3) References

A. Ahmed and E. P. Xing. Recovering time-varying networks of dependencies in socialand biological studies.

Proc. Natl. A. Sci. , 106(29):11878–11883, 2009.C. Andrieu, M. Davy, and A. Doucet. Eﬃcient particle ﬁltering for jump markov sys-tems. application to time-varying autoregressions.

Signal Processing, IEEE Trans-actions on , 51(7):1762–1770, 2003.O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse max-imum likelihood estimation.

J. Mach. Learn. Res. , 9:485–516, 2008.G. Bresler, E. Mossel, and A. Sly. Reconstruction of markov random ﬁelds from samples:Some easy observations and algorithms.

Arxiv, 0712.1402 , 2007.W. S. Cleveland, E. Grosse, and W. M. Shyu. Local regression models. In J. M.Chambers and T. J. Hastie, editors,

Statistical Models in S , pages 309–376, 1991.N. Dobigeon, J.Y. Tourneret, and M. Davy. Joint segmentation of piecewise constantautoregressive processes by using a hierarchical model and a bayesian sampling ap-proach.

Signal Processing, IEEE Transactions on , 55(4):1251–1263, 2007.F. Dondelinger, S. Lebre, and D. Husmeier. Heterogeneous continuous dynamicbayesian networks with ﬂexible structure and inter-time segment information sharing.In

Proceedings of the 27th International Conference on Machine Learning (ICML-10) , 2010.Frank Dondelinger, Sophie Lbre, and Dirk Husmeier. Non-homogeneous dynamicbayesian networks with bayesian regularization for inferring gene regulatory net-works with gradually time-varying structure.

Machine Learning , pages 1–40, 2012.ISSN 0885-6125. doi: 10.1007/s10994-012-5311-x.J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracleproperties.

J. Am. Statist. Ass. , 96:1348–1360, 2001.J. Fan, Y. Feng, and Y. Wu. Network exploration via the adaptive LASSO and SCADpenalties.

Ann. Appl. Statist. , 3(2):521–541, 2009.P. Fearnhead. Exact and eﬃcient bayesian inference for multiple changepoint problems.

Statistics and computing , 16(2):203–213, 2006.30. Fox, E.B. Sudderth, M.I. Jordan, and A.S. Willsky. Bayesian nonparametric infer-ence of switching dynamic linear models.

Signal Processing, IEEE Transactions on ,59(4):1569–1585, 2011.J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation withthe graphical lasso.

Biostatistics , 9(3):432–441, 2008.J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linearmodels via coordinate descent, 2010.A. Fujita, JR Sato, HM Garay-Malpartida, PA Morettin, MC Sogayar, and CE Ferreira.Time-varying modeling of gene expression regulatory networks using the waveletdynamic vector autoregressive method.

Bioinformatics , 23(13):1623–1630, 2007.L. Getoor and B. Taskar.

Introduction to Statistical Relational Learning (AdaptiveComputation and Machine Learning) . The MIT Press, 2007.M. Grzegorczyk and D. Husmeier. Improvements in the reconstruction of time-varyinggene regulatory networks: dynamic programming and regularization by informationsharing among genes.

Bioinformatics , 27(5):693–699, 2011a.M. Grzegorczyk and D. Husmeier. Bayesian regularization of non-homogeneous dy-namic bayesian networks by globally coupling interaction parameters. In

Proceedingsof the 15th International Conference on Artiﬁcal Intelligence and Statistics (AIS-TATS) , pages 467–476, 2012a.M. Grzegorczyk and D. Husmeier. A non-homogeneous dynamic bayesian network withsequentially coupled interaction parameters for applications in systems and syntheticbiology.

Statistical Applications in Genetics and Molecular Biology , 11(4), 2012b.Marco Grzegorczyk and Dirk Husmeier. Non-stationary continuous dynamic bayesiannetworks. In Y. Bengio, D. Schuurmans, J. Laﬀerty, C. K. I. Williams, and A. Cu-lotta, editors,

Advances in Neural Information Processing Systems 22 , pages 682–690.2009.Marco Grzegorczyk and Dirk Husmeier. Non-homogeneous dynamic bayesian networksfor continuous data.

Mach. Learn. , 83(3):355–419, June 2011b. ISSN 0885-6125. doi:10.1007/s10994-010-5230-7.J. Guo, E. Levina, G. Michailidis, and J. Zhu. Joint Structure Estimation for Categor-ical Markov Networks.

Unpublished manuscript , 2010a.J. Guo, E. Levina, G. Michailidis, and J. Zhu. Joint structure estimation for categor-ical markov networks. , 2010b.F. Liu H. Jiang, A.Lozano. A bayesian markov-switching model for sparse dynamicnetwork estimation. In

Proceedings of 2012 SIAM International Conference on DataMining , 2012. 31. Hastie and R. Tibshirani. Varying-coeﬃcient models.

J. Roy. Statist. Soc. B Met. ,55(4):757–796, 1993.Dirk Husmeier, Frank Dondelinger, and Sophie Lebre. Inter-time segment informationsharing for non-homogeneous dynamic bayesian networks. In J. Laﬀerty, C. K. I.Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors,

Advances in NeuralInformation Processing Systems 23 , pages 901–909. 2010.Yi Jia and Jun Huan. Constructing non-stationary dynamic bayesian networks with aﬂexible lag choosing mechanism.

BMC Bioinformatics , 11(Suppl 6):S27, 2010. ISSN1471-2105. doi: 10.1186/1471-2105-11-S6-S27.M. Kolar and E. P. Xing. Sparsistent Estimation of Time-Varying Discrete MarkovRandom Fields.

ArXiv e-prints , July 2009.M. Kolar and E.P. Xing. Estimating Networks With Jumps.

Arxiv, 1012.3795 , 2010.M. Kolar and E.P. Xing. On time varying undirected graphs. In

Proceedings of the14th International Conference on Artiﬁcial Intelligence and Statistics , 2011.M. Kolar, A. P. Parikh, and E. P. Xing. On sparse nonparametric conditional covarianceselection. In

ICML ’10: Proc. 27th Ann. Int’l. Conf. Mach. Learn. , 2010a.M. Kolar, L. Song, A. Ahmed, and E. P. Xing. Estimating Time-Varying networks.

Ann. Appl. Statist. , 4(1):94–123, 2010b.S. Lebre, F. Dondelinger, and D. Husmeier. Nonhomogeneous dynamic bayesian net-works in systems biology.

Methods in Molecular Biology (Clifton, NJ) , 802:199–213,2012.Sophie Lebre, Jennifer Becq, Frederic Devaux, Michael Stumpf, and Gaelle Lelandais.Statistical inference of the time-varying structure of gene-regulation networks.

BMCSystems Biology , 4(1):130, 2010. ISSN 1752-0509. doi: 10.1186/1752-0509-4-130.H. Li and J. Gui. Gradient directed regularization for sparse Gaussian concentrationgraphs, with applications to inference of genetic networks.

Biostatistics , 7(2):302,2006.N. Meinshausen and P. B¨uhlmann. High-dimensional graphs and variable selectionwith the lasso.

Ann. Statist. , 34:1436, 2006.J. Peng, P. Wang, N. Zhou, and J. Zhu. Partial correlation estimation by joint sparseregression models.

J. Am. Statist. Ass. , 104(486):735–746, 2009.E. Punskaya, C. Andrieu, A. Doucet, and W.J. Fitzgerald. Bayesian curve ﬁtting usingmcmc with applications to signal segmentation.

Signal Processing, IEEE Transac-tions on , 50(3):747–758, 2002. 32. Rao, A.O. Hero III, J.D. Engel, et al. Inferring time-varying network topologies fromgene expression data.

EURASIP Journal on Bioinformatics and Systems Biology ,2007:7–7, 2007.P. Ravikumar, M.J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covarianceestimation by minimizing l1-penalized log-determinant divergence.

Department ofStatistics, UC Berkeley, Tech. Rep , 767, 2008.P. Ravikumar, M. J. Wainwright, and J. D. Laﬀerty. High-dimensional ising modelselection using ℓ regularized logistic regression. Ann. Statist. , 38(3):1287–1319,2010.Joshua W Robinson and Alexander J Hartemink. Non-stationary dynamic bayesiannetworks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,

Advancesin Neural Information Processing Systems 21 , pages 1369–1376. 2009.J.W. Robinson and A.J. Hartemink. Learning non-stationary dynamic bayesian net-works.

The Journal of Machine Learning Research , 11:3647–3680, 2010.A.J. Rothman, P.J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant co-variance estimation.

Electron. J. Statist. , 2:494–515, 2008.M.R. Siracusa and JW Fisher. Tractable bayesian inference of time-series dependencestructure. In

Proceedings of AISTATS , 2009.L. Song, M. Kolar, and E. P. Xing. Keller: Estimating time-evolving interactionsbetween genes. In

Proc. 16th Int’l. Conf. Intell. Syst. Molec. Bio. , 2009a.Le Song, Mladen Kolar, and Eric Xing. Time-varying dynamic bayesian networks. InY. Bengio, D. Schuurmans, J. Laﬀerty, C. K. I. Williams, and A. Culotta, editors,

Advances in Neural Information Processing Systems 22 , pages 1732–1740. 2009b.M. Talih and N. Hengartner. Structural learning with time-varying components: track-ing the cross-section of ﬁnancial time series.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 67(3):321–341, 2005.D. Vogel and R. Fried. On robust gaussian graphical modelling. In L. Devroye et al.(Eds.), editor,

Recent Developments in Applied Probability and Statistics , pages 155–182. Berlin, Heidelberg: Springer-Verlag, 2010.M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, andvariational inference.

Found. Trends Mach. Learn. , 1(1-2):1–305, 2008.P. Wang, D. L. Chao, and L. Hsu. Learning networks from high dimensional binarydata: An application to genomic instability data.

Arxiv, 0908.3882 , 2009.Z. Wang, E.E. Kuruoglu, X. Yang, Y. Xu, and T.S. Huang. Time varying dynamicbayesian network for nonstationary events modeling and online inference.

SignalProcessing, IEEE Transactions on , 59(4):1553–1568, 2011.33iang Xuan and Kevin Murphy. Modeling changing dependency structure in multi-variate time series. In

Proceedings of the 24th international conference on Machinelearning , ICML ’07, pages 1055–1062, New York, NY, USA, 2007. ACM. ISBN978-1-59593-793-3. doi: 10.1145/1273496.1273629.J. Yin, Z. Geng, R. Li, and H. Wang. Nonparametric covariance model.

Statist. Sin. ,20:469–479, 2010.Ryo Yoshida, Seiya Imoto, and Tomoyuki Higuchi. Estimating time-dependent genenetworks from time series microarray data by dynamic linear models with markovswitching. In

Proceedings of the 2005 IEEE Computational Systems BioinformaticsConference , CSB ’05, pages 289–298, Washington, DC, USA, 2005. IEEE ComputerSociety. ISBN 0-7695-2344-7. doi: 10.1109/CSB.2005.32.M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model.

Biometrika , 94(1):19–35, 2007.S. Zhou, J. Laﬀerty, and L. Wasserman. Time varying undirected graphs. In Rocco A.Servedio and Tong Zhang, editors,