Information-theoretic limits of selecting binary graphical models in high dimensions
aa r X i v : . [ c s . I T ] M a y Information-theoretic limits of selecting binarygraphical models in high dimensions
Narayana Santhanam Martin J. WainwrightDepartment of ECE Departments of Statistics, and EECSUniversity of Hawaii UC BerkeleyHonolulu, HI Berkeley, CA 94720
Abstract
The problem of graphical model selection is to correctly estimate the graph structure of aMarkov random field given samples from the underlying distribution. We analyze the information-theoretic limitations of the problem of graph selection for binary Markov random fields underhigh-dimensional scaling, in which the graph size p and the number of edges k , and/or themaximal node degree d are allowed to increase to infinity as a function of the sample size n .For pairwise binary Markov random fields, we derive both necessary and sufficient conditionsfor correct graph selection over the class G p,k of graphs on p vertices with at most k edges,and over the class G p,d of graphs on p vertices with maximum degree at most d . For the class G p,k , we establish the existence of constants c and c ′ such that if n < ck log p , any method haserror probability at least 1 / n > c ′ k log p .Similarly, for the class G p,d , we exhibit constants c and c ′ such that for n < cd log p , any methodfails with probability at least 1 /
2, and we demonstrate a graph decoder that succeeds with highprobability for n > c ′ d log p . Markov random fields (also known as undirected graphical models) provide a structured representa-tion of the joint distributions of families of random variables. They are used in various applicationdomains, among them image analysis [14, 5], social network analysis [27, 29], and computationalbiology [12, 20, 1]. Any Markov random field is associated with an underlying graph that describesconditional independence properties associated with the joint distribution of the random variables.The problem of graphical model selection is to recover this unknown graph using samples from thedistribution.Given its relevance in many domains, the graph selection problem has attracted a great dealof attention. The naive approach of searching exhaustively over the space of all graphs is compu-tationally intractable, since there 2( p ) distinct graphs over p vertices. If the underlying graph isknown to be tree-structured, then the graph selection problem can be reduced to a maximum-weightspanning tree problem and solved in polynomial time [9]. On the other hand, for general graphswith cycles, the problem is known to be difficult in a complexity-theoretic sense [8]. Nonetheless,a variety of methods have been proposed, including constraint-based approaches [26, 20], thresh-olding methods [6], and ℓ -based relaxations [21, 22, 32, 13, 24]. Other researchers [19, 11] haveanalyzed graph selection methods based on penalized forms of pseudolikelihood.Given a particular procedure for graph selection, a classical analysis studies its behavior for afixed graph as the sample size n is increased. In this paper, as with an evolving line of contemporarystatistical research, we address the graph selection problem in the high-dimensional setting , meaningthat we allow the graph size p as well as other structural parameters, such as the number of edges k or the maximum vertex degree d , to scale with the sample size n . We note that a line of recent workhas established some high-dimensional consistency results for various graph selection procedures,including methods based on ℓ -regularization for Gaussian models [21, 23, 24], ℓ -regularization1or binary discrete Markov random fields [22], thresholding methods for discrete models [6], andvariants of the PC algorithm for directed graphical models [20]. All of these methods are practicallyappealing given their low-computational cost.Of complementary interest—and the focus of the paper—are the information-theoretic limita-tions of graphical model selection. More concretely, consider a graph G = ( V, E ), consisting of avertex set V with cardinality p , and an edge set E ⊂ V × V . In this paper, we consider both theclass G p,k of all graphs with | E | ≤ k edges, as well as the class G p,d all graphs with maximum vertexdegree d . Now suppose that we are allowed to collect n independent and identically distributed(i.i.d.) samples from a Markov random field defined by some graph G ∈ G p,k (or G p,d ). Remember-ing that the graph size p and structural parameters ( k, d ) are allowed to scale with the sample size,we thereby obtain sequences of statistical inference problems, indexed by the triplet ( n, p, k ) for theclass G p,k , and by the triplet ( n, p, d ) for the class G p,d . The goal of this paper is to address ques-tions of the following type. First, under what scalings of the triplet ( n, p, k ) (or correspondingly,the triplet ( n, p, d )) is it possible to recover the correct graph with high probability? Conversely,under what scalings of these triplets does any method fail most of the time?Although our methods are somewhat more generally applicable, so as to bring sharp focus tothese issues, we limit the analysis of this paper to the case of pairwise binary Markov random fields,also known as the Ising model. The Ising model is a classical model from statistical physics [18, 4],where it is used to model physical phenomena such as crystal structure and magnetism; morerecently it has been used in image analysis [5, 14], social network modeling [3, 27], and genenetwork analysis [1, 25].At a high level, then, the goal of this paper is to understand the information-theoretic ca-pacity of Ising model selection. Our perspective is not unrelated to a line of statistical work innon-parametric estimation [15, 17, 31, 30], in that we view the observation process as a channelcommunicating information about graphs to the statistician. In contrast to non-parametric esti-mation, the spaces of possible “codewords” are not function spaces but rather classes of graphs.Accordingly, part of the analysis in this paper involves developing ways in which to measure dis-tances between graphs, and to relate these distances to the Kullback-Leibler divergence knownto control error rates in statistical testing. We note that understanding of the graph selectioncapacity can be practically useful in two different ways. First, it can clarify when computation-ally efficient algorithms achieve information-theoretic limits, and hence are optimal up to constantfactors. Second, it can reveal regimes in which the best known methods to date are sub-optimal,thereby motivating the search for new and possibly better methods. Indeed, the analysis of thispaper has consequences of both types.In this paper, we prove four main theorems, more specifically necessary and sufficient conditionsfor the class G p,k of bounded edge cardinality models, and for the class G p,d of bounded vertexdegree models. Proofs of the necessary conditions (Theorems 1 and 2) use indirect methods, basedon a version of Fano’s lemma applied to carefully constructed sub-families of graphs. On the otherhand, our proof of the sufficient conditions (Theorems 3 and 4) is based on direct analysis of explicit“graph decoders”. The remainder of this paper is organized as follows. We begin in Section 2 withbackground on Markov random fields, the classes of graphs considered in this paper, and a precisestatement of the graphical model selection problem. In Section 3, we state our main results andexplore some of their consequences. Section 4 is devoted to proofs of the necessary conditions on thesample size (Theorems 1 and 2), whereas Section 5 is devoted to proofs of the sufficient conditions.We conclude with a discussion in Section 6. In this paper, we assume that the data is drawn from some Ising model from the class G p,k and G p,d , so that westudy the probability of recovering the exact model. However, similar analysis can be applied to the problem of findthe best approximating distribution using an Ising model from class G p,k or G p,k . otation: For the convenience of the reader, we summarize here notation to be used throughoutthe paper. We use the following standard notation for asymptotics: we write f ( n ) = O ( g ( n )) if f ( n ) ≤ cg ( n ) for some constant c < ∞ , and f ( n ) = Ω( g ( n )) if f ( n ) ≥ c ′ g ( n ) for some constant c ′ >
0. The notation f ( n ) = Θ( g ( n )) means that f ( n ) = O ( g ( n )) and f ( n ) = Ω( g ( n )). We begin with some background on Markov random fields, and then provide a precise formulationof the problem.
An undirected graph G = ( V, E ) consists a collection V = { , , . . . , p } of vertices joined by acollection E ⊂ V × V of edges. The neighborhood of any node s ∈ V is the subset N ( s ) ⊂ V N ( s ) : = { t ∈ V | ( s, t ) ∈ E } , (1)and the degree of vertex s is given by d s : = |N ( s ) | , corresponding to the cardinality of this neighborset. We use d = max s ∈ V d s to denote the maximum vertex degree, and k = | E | to denote the totalnumber of edges.A Markov random field is obtained by associating a random variable X s to each vertex s ∈ V ,and then specifying a joint distribution P over the random vector ( X , . . . , X p ) that respects thegraph structure in a particular way. In the special case of the Ising model, each random variable X s takes values {− , +1 } , and the the probability mass function has the form P θ ( x , . . . , x p ) = 1 Z ( θ ) exp (cid:8) X ( s,t ) ∈ E θ st x s x t (cid:9) (2)where Z ( θ ) is the normalization factor given by Z ( θ ) : = log h X x ∈{− , +1 } p exp (cid:8) X ( s,t ) ∈ E θ st x s x t (cid:9)i . (3)To be clear, we view the parameter vector θ as an element of R ( p ) with the understanding that θ st = 0 for all pairs ( s, t ) / ∈ E . So as to emphasize the graph-structured nature of the parameter θ ,we often use the notation θ ( G ). The edge weight θ st describes the conditional dependence between X s and X t , given fixed values for all vertices X u , u = s, t . In particular, a little calculation showsthat the conditional distribution takes the form P θ (cid:0) x s , x t | x V \{ s,t } (cid:1) ∝ exp (cid:18) θ st x s x t + X u ∈N ( s ) \ t θ us x u x s + X u ∈N ( t ) \ s θ ut x u x t (cid:19) . The Ising model (2) has its origins in statistical physics [18, 4], where it used to model physicalphenomena such as crystal structure and magnetism; it is also has been used as a simple model inimage processing [5, 14], gene network analysis [1, 25], and in modeling social networks [3, 27]. Forinstance, Banerjee et al. [3] use this model to describe the voting behaviors of p politicians, where X s represents whether politician s voted for ( X s = +1) or against ( X s = −
1) a particular bill.In this case, a positive edge weight θ st > s and t are more likely to agree in their voting (i.e., X s = X t ) than to disagree( X s = X t ), whereas a negative edge weight means that they are more likely to disagree. In this paper, we forbid self-loops in the graph, meaning that ( s, s ) / ∈ E for all s ∈ V . .2 Classes of graphical models In this paper, we consider two different classes of Ising models (2), depending on the condition thatwe impose on the edge set E . In particular, we consider the two classes of graphs:(a) the collection G p,d of graphs such that each vertex has degree at most d for some d ≥
1, and(b) the collection G p,k of graphs G with | E | ≤ k edges for some k ≥ θ ( G ) ∈ R ( p ). Naturally, one important property is theminimum value over the edges. Accordingly, we define the function λ ∗ ( θ ( G )) : = min ( s,t ) ∈ E | θ st | . (4)The interpretation of the parameter λ is clear: as in any signal detection problem, it is obviouslydifficult to detect an interaction θ st if it is extremely close to zero.In contrast to classical signal detection problems, estimation of the graphical structure turnsout to be harder if the edge parameters θ st are large, since the large value of edge parameters canmask the presence of interactions on other edges. The following example illustrates this point: Example 1.
Consider the family G p,k of graphs on p = 3 with k = 2 edges; note that there are atotal of such graphs. For each of these three graphs, consider the parameter vector θ ( G ) = (cid:2) θ θ (cid:3) , where the single zero corresponds to the single distinct pair s = t not in the graph’s edge set, asillustrated in Figure 1. In the limiting case θ = + ∞ , for any choice of graph with two edges, PSfrag replacements θ θ
PSfrag replacements θ θ
PSfrag replacements θθ (a) (b) (c) Figure 1.
Illustration of the family G p,k for p = 3 and k = 2; note that there are three distinctgraphs G with p = 3 vertices and k = 2 edges. Setting the edge parameter θ ( G ) = [ θ θ
0] inducesa family of three Markov random fields. As the edge weight parameter θ increases, the associateddistributions P θ ( G ) become arbitrarily difficult to separate. the Ising model distribution enforces the “hard-core” constraint that ( X , X , X ) must all be equal;that is, for any graph G , the distribution P θ ( G ) places mass / on the configuration (cid:2) +1 +1 +1 (cid:3) and mass / on the configuration (cid:2) − − − (cid:3) . Of course, this hard-core limit is an extremecase, in which the models are not actually identifiable. Nonetheless, it shows that if the edge weight θ is finite but very large, the models will not identical, but nonetheless will be extremely hard todistinguish. Motivated by this example, we define the maximum neighborhood weight ω ∗ ( θ ( G )) : = max s ∈ V X t ∈N ( s ) | θ st | . (5)4ur analysis shows that the number of samples n required to distinguish graphs grows exponentiallyin this quantity.In this paper, we study classes of Markov random fields that are parameterized by a lower bound λ on the minimum edge weight, and an upper bound ω on the maximum neighborhood weight. Definition 1 ( Classes of graphical models). (a) Given a pair ( λ, ω ) of positive numbers, theset G p,d ( λ, ω ) consists of all distributions P θ ( G ) of the form (2) such that (i) the underlyinggraph G = ( V, E ) is a member of the family G p,d of graphs on p vertices with vertex degree atmost d ; (ii) the parameter vector θ = θ ( G ) respects the structure of G , meaning that θ st = 0only when ( s, t ) ∈ E , and (iii) the minimum edge weight and maximum neighborhood satisfythe bounds λ ∗ ( θ ( G )) ≥ λ, and ω ∗ ( θ ( G )) ≤ ω. (6)(b) The set G p,k ( λ, ω ) is defined in an analogous manner, with the graph G belonging to the class G p,k of graphs with p vertices and k edges.We note that for any parameter vector θ ( G ), we always have the inequality ω ∗ ( θ ( G )) ≥ max s ∈ V |N ( s ) | λ ∗ ( θ ( G )) , (7)so that the families G p,k ( λ, ω ) and G p,d ( λ, ω ) are only well-defined for suitable pairs ( λ, ω ). For a given graph class G (either G p,d or G p,k ) and positive weights ( λ, ω ), suppose that naturechooses some member P θ ( G ) from the associated family G ( λ, ω ) of Markov random fields. Assumethat the statistician observes n samples X n : = { X (1) , . . . , X ( n ) } drawn in an independent andidentically distributed (i.i.d.) manner from the distribution P θ ( G ) . Note that by definition of theMarkov random field, each sample X ( i ) belongs to the discrete set X : = {− , +1 } p , so that theoverall data set X n belongs to the Cartesian product space X n .We assume that the goal of the statistician is to use the data X n to infer the underlying graph G ∈ G , which we refer to as the problem of graphical model selection . More precisely, we considerfunctions φ : X n → G , which we refer to as graph decoders. We measure the quality of a givengraph decoder φ using the 0-1 loss function I [ φ ( X n ) = G ], which takes value 1 when φ ( X n ) = G and takes the value 0 otherwise, and we define associated 0-1 risk P θ ( G ) [ φ ( X n ) = G ] = E θ ( G ) (cid:2) I [ φ ( X n ) = G ] (cid:3) , corresponding to the probability of incorrect graph selection. Here the probability (and expectation)are taken with the respect the product distribution of P θ ( G ) over the n i.i.d. samples.The main purpose of this paper is to study the scaling of the sample sizes n —-more specifically,as a function of the graph size p , number of edges k , maximum degree d , minimum edge weight λ and maximum neighborhood weight ω —that are either sufficient for some graph decoder φ tooutput the correct graph with high probability, or conversely, are necessary for any graph decoderto output the correct graph with probability larger than 1 / θ are known or unknown. In the known edge weight variant, the task of the decoderis to distinguish between graphs, where for any candidate graph G = ( V, E ), the decoder knows thenumerical values of the parameters θ ( G ). (Recall that by definition, [ θ ( G )] uv = 0 for all ( u, v ) / ∈ E ,5o that the additional information being provided are the values [ θ ( G )] st for all ( s, t ) ∈ E .) Inthe unknown edge weight variant, both the graph structure and the numerical values of the edgeweights are unknown. Clearly, the unknown edge variant is more difficult than the known edgevariant. We prove necessary conditions (lower bounds on sample size) for the known edge variant,which are then also valid for the unknown variant. In terms of sufficiency, we provide separate setsof conditions for the known and unknown variants. In this section, we state our main results and then discuss some of their consequences. We begin withstatement and discussion of necessary conditions in Section 3.1, followed by sufficient conditions inSection 3.2.
We begin with stating some necessary conditions on the sample size n that any decoder must satisfyfor recovery over the families G p,d and G p,k . Recall (6) for the definitions of λ and ω used in thetheorems to follow. Theorem 1 (Necessary conditions for G p,d ) . Consider the family G p,d ( λ, ω ) of Markov randomfields for some ω ≥ . If the sample size is upper bounded as n ≤ max (cid:26) log p λ tanh( λ ) , exp( ω/ dλ log( pd − λ ) , d p d , (cid:27) , (8) then for any graph decoder φ : X n → G p,d , whether given known edge weights or not, max θ ( G ) ∈G p,d ( λ,ω ) P θ ( G ) (cid:2) φ ( X n ) = G (cid:3) ≥ . (9) Remarks:
Let us make some comments regarding the interpretation and consequences of Theo-rem 1. First, suppose that both the maximum degree d and the minimum edge weight λ remainbounded (i.e., do not increase with the problem sequences). In this case, the necessary condi-tions (8) can be summarized more compactly as requiring that for some constant c , a sample size n > c log pλ tanh( λ ) is required for bounded degree graphs. The observation of log p scaling has also beenmade in independent work [6], although the dependence on the signal-to-noise ratio λ given here ismore refined. Indeed, note that if the minimum edge weight decreases to zero as the sample sizeincreases, then since λ tanh( λ ) = O ( λ ) for λ →
0, we conclude that a sample size n > c ′ log pλ isrequired, for some constant c ′ .Some interesting phenomena arise in the case of growing maximum degree d . Observe that inthe family G p,d , we necessarily have ω ≥ λd . Therefore, in the case of growing maximum degree d → + ∞ , if we wish the bound (8) not to grow exponentially in d , it is necessary to impose theconstraint λ = O ( d ). But as observed previously, since λ tanh( λ ) = O ( λ ) as λ →
0, we obtain thefollowing corollary of Theorem 1:
Corollary 1.
For the family G p,d ( λ, ω ) with increasing maximum degree d , there is a constant c > such that in a worst case sense, any method requires at least n > c max { d , λ − } log p samples torecover the correct graph with probability at least / .
6e note that Ravikumar et al. [22] have shown that under certain incoherence assumptions(roughly speaking, control on the Fisher information matrix of the distributions P θ ( G ) ) and assumingthat λ = Ω( d − ), a computationally tractable method using ℓ -regularization can recover graphsover the family G p,d using n > c ′ d log p samples, for some constant c ′ ; consequently, Corollary 1shows concretely that this scaling is within a factor d of information-theoretic bound.We now turn to some analogous necessary conditions over the family G p,k of graphs on p verticeswith at most k edges. Theorem 2 (Necessary conditions for G p,k ) . Consider the family G p,k ( λ, ω ) of Markov randomfields for some ω ≥ . If the sample size is upper bounded as n ≤ max (cid:26) log p λ tanh( λ ) , exp( ω ) log( k/ ω exp( λ ) sinh( λ ) , (cid:27) , (10) then for any graph decoder φ : X n → G p,k , whether given known edge weights or not, max θ ( G ) ∈G p,k ( λ,ω ) P θ ( G ) (cid:2) φ ( X n ) = G (cid:3) ≥ . (11) Remarks:
Again, we make some comments about the consequences of Theorem 2. First, supposethat both the number of edges k and the minimum edge weight λ remain bounded (i.e., do notincrease with the problem sequences). In this case, the necessary conditions (10) can be summarizedmore compactly as requiring that for some constant c , a sample size n > c log pλ tanh( λ ) is required forgraphs with a constant number of edges. Again, note that if the minimum edge weight decreasesto zero as the sample size increases, then since λ tanh( λ ) = O ( λ ) for λ →
0, we conclude that asample size n > c ′ log pλ is required, for some constant c ′ .The behavior is more subtle in the case of graph sequences in which the number of edges k increases with the sample size. As shown in the proof of Theorem 2, it is possible to construct aparameter vector θ ( G ) over a graph G with k edges such that ω ∗ ( θ ( G )) ≥ λ ⌊√ k ⌋ . (More specifically,the construction is based on forming a completely connected subgraph on ⌊√ k ⌋ vertices, which hasa total of (cid:0) ⌊√ k ⌋ (cid:1) ≤ k edges.) Therefore, if we wish to avoid the exponential growth from theterm exp( ω ), we require that λ = O ( k − / ) as the graph size increases. Therefore, we obtain thefollowing corollary of Theorem 2: Corollary 2.
For the family G p,k ( λ, ω ) with increasing number of edges k , there is a constant c > such that in a worst case sense, any method requires at least n > c max { k, λ − } log p samples torecover the correct graph with probability at least / . To clarify a subtle point about comparing Theorems 1 and 2, consider a graph G ∈ G p,d , say onewith homogeneous degree d at each node. Note that such a graph has a total of k = dp/ n > c pd log p sampleswould be required in this case. However, as shown in our development of sufficient conditions forthe class G p,d (see Theorem 3), this is not true for sufficiently small degrees d .To understand the difference, it should be remembered that our necessary conditions are worst-case results, based on adversarial choices from the graph families. As mentioned, the necessaryconditions of Theorem 2 and hence of Corollary 2 are obtained by constructing a graph G thatcontains a completely connected graph, K √ k , with uniform degree √ k . But K √ k is not a memberof G p,d unless d ≥ √ k . On the other hand, for the case when d ≥ √ k , the necessary conditionsof Corollary 1 amount to n > ck log p samples being required, which matches the scaling given inCorollary 2. 7 .2 Sufficient conditions We now turn to stating and discussing sufficient conditions (lower bounds on the sample size) forgraph recovery over the families G p,d and G p,k . These conditions provide complementary insight tothe necessary conditions discussed so far. Theorem 3 (Sufficient conditions for G p,d ) . (a) Suppose that for some δ ∈ (0 , , the sample size n satisfies n ≥ (cid:2) ω ) + 1 (cid:3) sinh ( λ ) d (cid:8) p + log(2 d ) + log 1 δ (cid:9) . (12) Then there exists a graph decoder φ ∗ : X n → G p,d such that given known edge weights, the worst-caseerror probability satisfies max θ ( G ) ∈G p,d ( λ,ω ) P θ ( G ) (cid:2) φ ∗ ( X n ) = G ] ≤ δ. (13) (b) In the case of unknown edge weights, suppose that the sample size satisfies n > h ω (cid:0) ω + 1 (cid:1) sinh ( λ/ i (cid:8)
16 log p + 4 log(2 /δ ) (cid:9) . (14) Then there exists a graph decoder φ † : X n → G p,d that that has worst-case error probability at most δ . Remarks:
It is worthwhile comparing the sufficient conditions provided by Theorem 3 to thenecessary conditions from Theorem 1.First, consider the case of finite degree graphs. In this case, the condition (12) reduces to thestatement that for some constant c , it suffices to have n > c λ − log p samples. Comparing with thenecessary conditions (see the discussion following Theorem 1), we see that for known edge weightsand bounded degrees, the information-theoretic capacity scales as λ − log p . For unknown edgeweights, the conditions (14) provide a weaker guarantee, namely that n > c ′ λ − log p samples arerequired; we suspect that this guarantee could be improved by a more careful analysis.In the case of growing maximum graph degree d , we note that like the necessary conditions (8),the sample size specified by the sufficient conditions (12) scales exponentially in the parameter ω .If we wish not to incur such exponential growth, we necessarily must have that λ = O (1 /d ). Wethus obtain the following consequence of Theorem 3: Corollary 3.
For the graph family G p,d ( λ, ω ) with increasing maximum degree, there exists a graphdecoder that succeeds with high probability using n > c max { d , λ − (cid:9) d log p samples. This corollary follows because the scaling λ = O (1 /d ) implies that λ → d increases, andsinh( λ/
2) = O ( λ ) as λ →
0. Note that in this regime, Corollary 1 of Theorem 1 showed thatno method has error probability below 1 / n < c max { d , λ − } log p , for some constant c .Therefore, together Theorems 1 and 3 provide upper and lower bounds on the sample complexityof graph selection that are matching to within a factor of d . We note that under the condition λ ≥ c d , the results of Ravikumar et al. [22] also guarantee correct recovery with high probabilityfor n > c d log p using ℓ -regularized logistic regression; however, their method requires additional(somewhat restrictive) incoherence assumptions that are not imposed here.Finally, we state sufficient conditions for the class G p,k in the case of known edge weights:8 heorem 4 (Sufficient conditions for G p,k ) . (a) Suppose that for some δ ∈ (0 , , the sample size n satisfies n > ω ) + 1sinh ( λ ) (cid:0) ( k + 1) log p + log 1 δ (cid:1) . (15) Then for known edge weights, there exists a graph decoder φ ∗ : X n → G p,k such that max θ ( G ) ∈G p,k ( λ,ω ) P θ ( G ) (cid:2) φ ∗ ( X n ) = G ] ≤ δ. (16) (b) For unknown edge weights, there also exists a graph decoder that succeeds under the condi-tion (14) . Remarks:
It is again interesting to compare Theorem 4 with the necessary conditions from Theo-rem 2. To begin, let the number of edges k remain bounded. In this case, for λ = o (1), condition (15)states that for some constant c , it suffices to have n > c log pλ samples, which matches (up to constantfactors) the lower bound implied by Theorem 2. In the more general setting of k → + ∞ , we beginby noting that like in Theorem 2, the sample size in Theorem 4 grows exponentially unless theparameter ω stays controlled. As with the discussion following Theorem 2, one interesting scalingis to require that λ ≍ k − / , a choice which controls the worst-case construction that leads to thefactor exp( ω ) in the proof of Theorem 2. With this scaling, we have the following consequence: Corollary 4.
Suppose that the minimum value λ scales with the number of edges k as λ ≍ k − / .Then in the case of known edge weights, there exists a decoder that succeeds with high probabilityusing n > ck log p samples. Note that these sufficient conditions are within a factor of k of the necessary conditions fromCorollary 2, which show that unless n > c ′ max { k, λ − } log p , then any graph estimator fails atleast half of the time. In the following two sections, we provide the proofs of our main theorems. We begin by introducingsome background on distances between distributions, as well as some results on the cardinalities ofour model classes. We then provide proofs of the necessary conditions (Theorems 1 and 2) in thissection, followed by the proofs of the sufficient conditions stated in Theorems 3 and 4 in Section 5.
We begin with some preliminary definitions and results concerning “distance” measures betweendifferent models, and some estimates of the cardinalities of different model classes.
In order to quantify the distinguishability of different models, we begin by defining some useful“distance” measures. Given two parameters θ and θ ′ in R ( p ), we let D ( θ k θ ′ ) denote the Kullback-Leibler divergence [10] between the two distributions P θ and P θ ′ . For the special case of the Isingmodel distributions (2), this Kullback-Leibler divergence takes the form D ( θ k θ ′ ) : = X x ∈{− , +1 } p P θ ( x ) log P θ ( x ) P θ ′ ( x ) . (17)9ote that the Kullback-Leibler divergence is not symmetric in its arguments (i.e., D ( θ k θ ′ ) = D ( θ k θ ′ ) in general).Our analysis also makes use of two other closely related divergence measures, both of which aresymmetric. First, we define the symmetrized Kullback-Leibler divergence, defined in the naturalway via S ( θ k θ ′ ) : = D ( θ k θ ′ ) + D ( θ ′ k θ ) . (18)Secondly, given two parameter vectors θ and θ ′ , we may consider the model P θ + θ ′ specified by theiraverage. In terms of this averaged model, we define another type of divergence via J ( θ k θ ′ ) : = D ( θ + θ ′ k θ ) + D ( θ + θ ′ k θ ′ ) . (19)Note that this divergence is also symmetric in its arguments. A straightforward calculation showsthat this divergence measure can be expressed in terms of the cumulant function (3) associatedwith the Ising family as J ( θ k θ ′ ) = log Z ( θ ) Z ( θ ′ ) Z ( θ + θ ′ ) . (20)Useful in our analysis are representations of these distance measures in terms of the vector of mean parameters µ ( θ ) ∈ R ( p ), where element µ st is given by µ st : = E θ [ X s X t ] = X x ∈{− , +1 } p P θ [ X ] X s X t . (21)It is well-known from the theory of exponential families [7, 28] that there is a bijection between thecanonical parameters θ and the mean parameters µ .Using this notation, a straightforward calculation shows that the symmetrized Kullback-Leiblerdivergence between P θ and P θ ′ is equal to S ( θ k θ ′ ) = X s,t ∈ V,s = t (cid:0) θ st − θ ′ st (cid:1) (cid:0) µ st − µ ′ st (cid:1) , (22)where µ st and µ ′ st denote the edge-based mean parameters under θ and θ ′ respectively. In addition to these divergence measures, we require some estimates of the cardinalities of the graphclasses G p,d and G p,k , as summarized in the following: Lemma 1. (a) For k ≤ (cid:0) p (cid:1) / , the cardinality of G p,k is bounded as (cid:18)(cid:0) p (cid:1) k (cid:19) ≤ |G p,k | ≤ k (cid:18)(cid:0) p (cid:1) k (cid:19) , (23) and hence log (cid:12)(cid:12) G p,k (cid:12)(cid:12) = Θ( k log p √ k ) .(b) For d ≤ p − , the cardinality of G p,d is bounded as (cid:20) ⌊ pd + 1 ⌋ ! (cid:21) d ( d +1)2 ≤ |G p,d | ≤ pd (cid:18)(cid:0) p (cid:1) pd (cid:19) . (24) and hence log (cid:12)(cid:12) G p,d (cid:12)(cid:12) = Θ (cid:0) pd log pd (cid:1) . roof. (a) For the bounds (23) on |G p,k | , we observe that there are (cid:0) ( p ) ℓ (cid:1) graphs with exactly ℓ edges, and that for k ≤ (cid:0) p (cid:1) /
2, we have (cid:0) ( p ) ℓ (cid:1) ≤ (cid:0) ( p ) k (cid:1) for all ℓ = 1 , , . . . , k .(b) Turning to the bounds (24) on |G p,d | , observe that every model in G p,d has at most pd edges.Note that d ≤ p − ensures that pd ≤ (cid:18) p (cid:19) / . Therefore, following the argument in part (a), we conclude that |G p,d | ≤ pd (cid:0) p (cid:1) pd as claimed.In order to establish the lower bound (24), we first group the p vertices into d + 1 groups ofsize ⌊ pd +1 ⌋ , discarding any remaining vertices. We consider a subset of G p,d : graphs with maximumdegree d having the property that each component edge straddles vertices in two different groups.To construct one such graph, we pick a permutation of ⌊ pd +1 ⌋ , and form an bijection from group1 to group 2 corresponding to the permutation. Similarly, we form an bijection from group 1 to 3,and so on up until d + 1. Note that use d permutations to complete this procedure, and at the endof this round, every vertex in group 1 has degree d , vertices in all other groups have degree 1.Similarly, in the next round, we use d − d + 1. In general, for i = 1 , . . . , d , in round i , we use d + 1 − i permutations to connect group i withgroups i + 1 , . . . , d + 1. Each choice of these permutations yields a distinct graph in G p,d . Note thatwe use a total of d X i =1 ( d + 1 − i ) = d X ℓ =1 ℓ = d ( d + 1)2permutations over ⌊ pd +1 ⌋ elements, from which the stated claim (24) follows. We provide some background on Fano’s lemma and its variants needed in our arguments. Consider afamily of M models indexed by the parameter vectors { θ (1) , θ (2) , . . . , θ ( M ) } . Suppose that we choosea model index k uniformly at random from { , . . . , M } , and than sample a data set X n of n samplesdrawn in an i.i.d. manner according to a distribution P θ ( k ) . In this setting, Fano’s lemma provides alower bound on the probability of error of any classification function φ : X n → { , . . . , M } , specifiedin terms of the mutual information I ( X n ; K ) = H ( X n ) − H ( X n | K ) (25)between the data X n and the random model index K . We say that a decoder φ : X n → { , . . . , M } is unreliable over the family { θ (1) , . . . , θ ( M ) } ifmax k =1 ,...,M P θ ( k ) (cid:2) φ ( X n ) = k ] ≥ . (26)We summarize Fano’s inequality and a variant thereof in the following lemma: Lemma 2.
Any of the following upper bounds on the sample size imply that any decoder φ isunreliable over the family { θ (1) , . . . , θ ( M ) } :(a) The sample size n is upper bounded as n < log( M/ I ( X n ; K ) . (27)11 b) The sample size n is upper bounded as n < log( M/ M M P k =1 M P ℓ = k +1 S ( θ ( k ) k θ ( ℓ ) ) . (28)These variants of Fano’s inequality are standard and widely-used in the non-parametric statisticsliterature (e.g., [15, 17, 31, 30]); see Cover and Thomas [10] for a statement and proof of the originalFano’s inequality. In order to exploit the condition (28), one needs to construct families of models with relativelylarge cardinality ( M large) such that the models are all relatively close in symmetrized Kullback-Leibler (KL) divergence. Recalling the definition (21) of the mean parameters and the form of thesymmetrized KL divergence (22), we see that control of the divergence between P θ and P θ ′ canbe achieved by ensuring that their respective mean parameters µ st µ st stay relatively close for alledges ( s, t ) where the models differ.In this section, we state and prove a key technical lemma that allows us to control the meanparameters of a certain carefully constructed class of models. As shown in the proofs of Theorems 1and 2 to follow, this lemma allows us to gain good control on the symmetrized Kullback-Leiblerdivergences between pairs of models. Our construction, which applies to any integer m ≥
2, is basedon the following procedure. We begin with the complete graph on m vertices, denoted by K m . Wethen form a set of (cid:0) m (cid:1) graphs, each of which is a subgraph of K m , by removing a particular edge.Denoting by G st the subgraph with edge ( s, t ) removed, we define the Ising model distribution P θ ( G st ) by setting [ θ ( G st )] uv = λ for all edges ( u, v ), and [ θ ( G st )] st = 0.The following lemma shows that the mean parameter µ st = E θ ( G st ) [ X s X t ] approaches its maxi-mum value 1 exponentially quickly in the parameter ω = λm . Lemma 1.
Suppose that ω = λm ≥ . Then the likelihood ratio on edge ( s, t ) is lower bounded as P θ ( G st ) [ X s X t = +1] P θ ( G st ) [ X s X t = −
1] = q st − q st ≥ exp (cid:0) ω − λ (cid:1) m + 1 . (29) and moreover, the mean parameter over the pair ( s, t ) is lower bounded as E θ ( G st ) [ X s X t ] ≥ − m + 1) exp( λ )exp( ω ) + ( m + 1) exp( λ ) . (30) Proof.
Let us introduce the convenient shorthand q st = P θ ( G st ) [ X s X t = 1]. We begin by observingthat the bound (29) implies the bound (30). Indeed, suppose that equation (29) holds, or equiva-lently that q st ≥ b b where b = exp( ω − λ ) m +1 . Observing that E θ ( G st ) [ X s X t ] = 2 q st −
1, we see that q st − q st ≥ b b implies that E θ ( G st ) [ X s X t ] ≥ b b − −
21 + b , from which equation (30) follows. 12he remainder of our proof is devoted to proving the lower bound (29). Some calculation showsthat q st − q st = P mj =0 (cid:0) mj (cid:1) exp (cid:0) λ (cid:2) (2 j − m + 1) − (cid:3)(cid:1)P mj =0 (cid:0) mj (cid:1) exp (cid:0) λ (cid:2) (2 j − m ) (cid:3)(cid:1) . (31)We lower bound the ratio (31) by choosing one of largest terms in the denominator. It can be shownthat for λm ≥
2, the largest terms always lie in the range j > m/ j < m/
4. Accordingly, wemay choose a maximizing point j ∗ > m/
4. Since all the terms in the numerator are non-negative,we have P θ ( G st ) [ X s X t = +1] P θ ( G st ) [ X s X t = − ≥ (cid:0) mj ∗ (cid:1) exp (cid:0) λ (cid:2) (2 j ∗ − m + 1) − (cid:3)(cid:1) ( m + 1) (cid:0) mj ∗ (cid:1) exp (cid:0) λ (cid:2) (2 j ∗ − m ) (cid:3)(cid:1) = exp (cid:0) λ (cid:2) j ∗ − m − (cid:3)(cid:1) m + 1 ≥ exp (cid:0) λ (cid:2) m − (cid:3)(cid:1) m + 1= exp (cid:0) ω − λ (cid:1) m + 1 , which completes the proof of the bound (29). We begin with necessary conditions for the bounded degree family G p,d . The proof is based on ap-plying Fano’s inequality to three ensembles of graphical models, each contained within the family G p,d ( λ, ω ). Ensemble A:
In this ensemble, we consider the set of (cid:0) p (cid:1) graphs, each of which contains a singleedge. For each such graph—say the one containing edge ( s, t ), which we denote by H st —we set[ θ ( H st )] st = λ , and all other entries equal to zero. Clearly, the resulting Markov random fields P θ ( H ) all belong to the family G p,d ( λ, ω ). (Note that by definition, we must have ω ≥ λ for thefamily to be non-empty.)Let us compute the symmetrized Kullback-Leibler divergence between the MRFs indexed by θ ( G st ) and θ ( G uv ). Using the representation (22), we have S ( θ ( H st ) k θ ( H uv )) = λ (cid:26)(cid:0) E θ ( H st ) [ X s X t ] − E θ ( H uv ) [ X s X t ] (cid:1) − (cid:0) E θ ( H st ) [ X u X v ] − E θ ( H uv ) [ X u X v ] (cid:1)(cid:27) = 2 λ E θ ( H st ) [ X s X t ] , since E θ ( H st ) [ X u X v ] = 0 for all ( u, v ) = ( s, t ), and E θ ( H uv ) [ X u X v ] = E θ ( H st ) [ X s X t ]. Finally, bydefinition of the distribution P θ ( H st ) , we have E θ ( H st ) [ X s X t ] = exp( λ ) − exp( − λ )exp( λ ) + exp( − λ ) = tanh( λ ) , so that we conclude that the symmetrized Kullback-Leibler divergence is equal to 2 λ tanh( λ ) foreach pair. 13sing the bound (28) from Lemma 2 with M = (cid:0) p (cid:1) , we conclude that the graph recovery isunreliable (i.e., has error probability above 1 /
2) if the sample size is upper bounded as n < log( (cid:0) p (cid:1) / λ tanh( λ ) . (32) Ensemble B:
In order to form this graph ensemble, we begin with a grouping of the p vertices into ⌊ pd +1 ⌋ groups, each with d + 1 vertices. We then consider the graph G obtained by fully connectingeach subset of d + 1 vertices. More explicitly, G is a graph that contains ⌊ pd +1 ⌋ cliques of size d + 1.Using this base graph, we form a collection of graphs by beginning with G , and then removing asingle edge ( u, v ). We denote the resulting graph by G uv . Note that if p ≥ d + 1), then we canform ⌊ pd + 1 ⌋ (cid:18) d + 12 (cid:19) ≥ pd G uv , we form an associated Markov random field P θ ( G uv ) by setting[ θ ( G uv )] ab = λ > a, b ) in the edge set of G uv , and setting the parameter to zero otherwise.A central component of the argument is the following bound on the symmetrized Kullback-Leibler divergence between these distributions Lemma 2.
For all distinct pairs of models θ ( G st ) = θ ( G uv ) in ensemble B and for all λ ≥ /d ,the symmetrized Kullback-Leibler divergence is upper bounded as S ( θ ( G st ) k θ ( G uv )) ≤ λ d exp( λ )exp( λd ) . Proof.
Note that any pair of distinct parameter vectors θ ( G st ) = θ ( G uv ) differ in exactly two edges.Consequently, by the representation (22), and the definition of the parameter vectors, S ( θ ( G st ) k θ ( G uv )) = λ (cid:0) E θ ( G uv ) [ X s X t ] − E θ ( G st ) [ X s X t ] (cid:1) + λ (cid:0) E θ ( G st ) [ X u X v ] − E θ ( G uv ) [ X u X v ] (cid:1) ≤ λ (cid:0) − E θ ( G st ) [ X s X t ] (cid:1) + λ (cid:0) − E θ ( G uv ) [ X u X v ] (cid:1) , where the inequality uses the fact that λ >
0, and the edge-based mean parameters are upperbounded by 1.Since the model P θ ( G st ) factors as a product of separate distributions over the ⌊ pd +1 ⌋ cliques, wecan now apply the separation result (30) from Lemma 1 with m = d + 1 to conclude that S ( θ ( G st ) k θ ( G uv )) ≤ λ d + 2) exp( λ )exp( λ ( d +1)2 ) + ( d + 2) exp( λ ) ≤ λ d exp( λ )exp( λd ) , as claimed.Using Lemma 2 and applying the bound (28) from Lemma 2 with M = pd yields that forprobability of error below 1 / λd ≥
2, we require at least n > log( pd − S ( θ ( G st ) k θ ( G uv )) ≥ exp( dλ ) log( pd − dλ exp( λ )14amples. Since exp( t/ ≥ t /
16 for all t ≥
1, we certainly need at least n > exp( dλ/ dλ log( pd − λ ) samples. Since ω = dλ in this construction, we conclude that n > exp( ω/ dλ log( pd − λ )samples are required, as claimed in Theorem 1. Ensemble C:
Finally, we prove the third component in the bound (8). In this case, we considerthe ensemble consisting of all graphs in G p,d . From Lemma 1(b), we havelog |G p,d | ≥ d ( d + 1)2 log ⌊ pd + 1 ⌋ ! ≥ d ( d + 1)2 ⌊ pd + 1 ⌋ log ⌊ pd +1 ⌋ e ≥ dp p d . For this ensemble, it suffices to use a trivial upper bound on the mutual information (25), namely I ( X n ; G ) ≤ H ( X n ) ≤ np, where the second bound follows since X n is a collection of np binary variables, each with entropyat most 1. Therefore, from the Fano bound (27), we conclude that the error probability stays above1 / n is upper bounded as n < d log p d , as claimed. We now turn to the proof of necessary conditions for the graph family G p,k with at most k edges.As with the proof of Theorem 2, it is based on applying Fano’s inequality to three ensembles ofMarkov random fields contained in G p,k ( λ, ω ). Ensemble A:
Note that the ensemble (A) previously constructed in the proof of Theorem 1 is alsovalid for the family G p,k ( λ, ω ), and hence the bound (32) is also valid for this family. Ensemble B:
For this ensemble, we choose the largest integer m such that k + 1 ≥ (cid:0) m (cid:1) . Note thatwe certainly have (cid:18) m (cid:19) ≥ ⌊√ k ⌋ ≥ √ k . We then form a family of (cid:0) m (cid:1) graphs as follows: (a) first form the complete graph K m on a subsetof m vertices, and (b) for each ( s, t ) ∈ K m , form the graph G st by removing edge ( s, t ) from K m .We form Markov random fields on these graphs by setting [ θ ( G st )] wz = λ if ( w, z ) ∈ E ( G st ), andsetting it to zero otherwise. Lemma 3.
For all distinct model pairs θ ( G st ) and θ ( G uv ) , we have S ( θ ( G st ) k θ ( G uv )) ≤ ω exp( λ ) sinh( λ )exp( ω ) . (33)15 roof. We begin by claiming for any pair ( s, t ) = ( u, v ), the distribution P θ ( G uv ) (i.e., correspondingto the subgraph that does not contain edge ( u, v )) satisfies P θ ( G uv ) [ X s X t = +1] P θ ( G uv ) [ X s X t = − ≤ exp(2 λ ) P θ ( G st ) [ X s X t = +1] P θ ( G st ) [ X s X t = −
1] = exp(2 λ ) q st − q st , (34)where we have re-introduced the convenient shorthand q st = P θ ( G st ) [ X s X t = 1] from Lemma 1.To prove this claim, let P θ be the distribution that contains all edges in the complete subgraph K m , each with weight λ . Let Z ( θ ) and Z ( θ ( G uv )) be the normalization constants associated with P θ and P θ ( G uv ) respectively. Now since λ > P θ ( G uv ) [ X s X t = +1] P θ ( G uv ) [ X s X t = − ≤ P θ [ X s X t = +1] P θ [ X s X t = − P θ and expand the right-hand side of this expression, recalling thefact that the model P θ ( G st ) does not contain the edge ( s, t ). Thus we obtain P θ [ X s X t = +1] P θ [ X s X t = −
1] = exp( λ ) Z ( θ ( G st )) Z ( θ ) P θ ( G st ) [ X s X t = +1]exp( − λ ) Z ( θ ( G st )) Z ( θ ) P θ ( G st ) [ X s X t = − λ ) P θ ( G st ) [ X s X t = +1] P θ ( G st ) [ X s X t = − , which establishes the claim (34).Finally, from the representation (22) for the symmetrized Kullback-Leibler divergence and thedefinition of the models, S ( θ ( G st ) k θ ( G uv )) = λ (cid:8) E θ ( G st ) [ X u X v ] − E θ ( G uv ) [ X u X v ] (cid:9) + λ (cid:8) E θ ( G uv ) [ X s X t ] − E θ ( G st ) [ X s X t ] (cid:9) = 2 λ (cid:8) E θ ( G uv ) [ X s X t ] − E θ ( G st ) [ X s X t ] (cid:9) , where we have used the symmetry of the two terms. Continuing on, we observe the decomposi-tion E θ ( G st ) [ X s X t ] = 1 − P θ ( G st ) [ X s X t = − S ( θ ( G st ) k θ ( G uv )) = 4 λ (cid:8) P θ ( G st ) [ X s X t = − − P θ ( G uv ) [ X s X t = − (cid:9) = 4 λ n P θ ( Gst ) [ X s X t =+1] P θ ( Gst ) [ X s X t = − + 1 − P θ ( Guv ) [ X s X t =+1] P θ ( Guv ) [ X s X t = − + 1 o ( a ) ≤ λ n q st − q st + 1 − λ ) q st − q st + 1 o = 4 λ q st − q st n
11 + − q st q st − λ ) + − q st q st o = 4 λ (exp(2 λ ) − q st − q st (cid:2) − q st q st (cid:3) (cid:2) exp(2 λ ) + − q st q st (cid:3) where in obtaining the inequality (a), we have applied the bound (34) and recalled our shorthandnotation q st = P θ ( G st ) [ X s X t = +1]. Since λ > − q st ) /q st ≥
0, both terms in the denominatorof the second term are at least one, so that we conclude that S ( θ ( G st ) k θ ( G uv )) ≤ λ (exp(2 λ ) − qst − qst .16inally, applying the lower bound (29) from Lemma 1 on the ratio q st / (1 − q st ), we obtain that S ( θ ( G st ) k θ ( G uv )) ≤ λ (exp(2 λ ) −
1) ( m + 1)exp( ω − λ ) ≤ ω exp( λ ) sinh( λ )exp( ω ) , where we have used the fact that λ ( m + 1) ≤ mλ = 2 ω .By combining Lemma 3 with Lemma 2(b), we conclude that for correctness with probability atleast 1 /
2, the sample size n must be at least n > exp( ω ) log ( m ) ω exp( λ ) sinh( λ ) ≥ exp( ω ) log( k/ ω exp( λ ) sinh( λ ) , as claimed in Theorem 2. We now turn to the proofs of the sufficient conditions given in Theorems 3 and 4, respectively, for theclasses G p,d and G p,k . In both cases, our method involves a direct analysis of a maximum likelihood(ML) decoder, which searches exhaustively over all graphs in the given class, and computes themodel with highest likelihood. We begin by describing this ML decoder and providing a standardlarge deviations bound that governs its performance. The remainder of the proof involves moredelicate analysis to lower bound the error exponent in the large deviations bound in terms of theminimum edge weight λ and other structural properties of the distributions. Given a collection X n = { X (1) , . . . , X ( n ) } of n i.i.d. samples, its (rescaled) likelihood with respectto model P θ is given by ℓ θ ( X n ) : = 1 n n X i =1 log P θ [ X ( i ) ] . (35)For a given graph class G and an associated set of graphical models { θ ( G ) | G ∈ G} , the maximumlikelihood decoder is the mapping φ ∗ : X → G defined by φ ∗ ( X n ) = arg max G ∈G ℓ θ ( G ) ( X n ) . (36)(If the maximum is not uniquely achieved, we choose some graph G from the set of models thatattains the maximum.)Suppose that the data is drawn from model P θ ( G ) for some G ∈ G . Then the ML decoder φ ∗ fails only if there exists some other θ ( G ′ ) = θ ( G ) such that ℓ θ ( G ′ ) ( X n ) ≥ ℓ θ ( G ) ( X n ). (Note that weare being conservative by declaring failure when equality holds). Consequently, by union bound,we have P θ [ φ ∗ ( X n ) = G ] ≤ X G ′ ∈G\ G P (cid:2) ℓ θ ( G ′ ) ( X n ) ≥ ℓ θ ( G ) ( X n ) (cid:3) Therefore, in order to provide sufficient conditions for the error probability of the ML decoder tovanish, we need to provide an appropriate large deviations bound.17 emma 3.
Given n i.i.d. samples X n = { X (1) , . . . , X ( n ) } from P θ ( G ) , for any G ′ = G , we have P (cid:2) ℓ θ ( G ′ ) ( X n ) ≥ ℓ θ ( G ) ( X n ) (cid:3) ≤ exp (cid:0) − n J ( θ ( G ) k θ ( G ′ )) , (37) where the distance S was defined previously (19) .Proof. So as to lighten notation, let us write θ = θ ( G ) and θ ′ = θ ( G ′ ). We apply the Chernoffbound to the random variable V = ℓ θ ′ ( X n ) − ℓ θ ( X n ), thereby obtaining that1 n log P θ [ V ≥ ≤ n inf s> log E θ (cid:2) exp( sV ) (cid:3) = inf s> X x ∈{− , +1 } p (cid:2) P θ ( x ) (cid:3) − s (cid:2) P θ ′ ( x ) (cid:3) s ≤ log Z (cid:0) θ/ θ ′ / −
12 log Z ( θ ) −
12 log Z ( θ ′ ) , where Z ( θ ) denotes the normalization constant associated with the Markov random field P θ , asdefined in equation (3). The claim then follows by applying the representation (20) of J ( θ k θ ′ ). In order to exploit the large deviations claim in Lemma 3, we need to derive lower bounds on thedivergence J ( θ ( G ) k θ ( G ′ )) between different models. Intuitively, it is clear that this divergenceis related to the discrepancy of the edge sets of the two graph. The following lemma makes thisintuition precise. We first recall some standard graph-theoretic terminology: a matching of a graph G = ( V, E ) is a subgraph H such that each vertex in H has degree one. The matching number of G is the maximum number of edges in any matching of G . Lemma 4.
Given two distinct graphs G = ( V, E ) and G ′ = ( V, E ′ ) , let m ( G, G ′ ) be the matchingnumber of the graph with edge set E ∆ E ′ : = ( E \ E ′ ) ∪ ( E ′ \ E ′ ) . Then for any pair of parameter vectors θ ( G ) and θ ( G ′ ) in G ( λ, ω ) , we have J ( θ ( G ) k θ ( G ′ )) ≥ m ( G, G ′ )3 exp(2 ω ) + 1 sinh ( λ . (38) Proof.
Some comments on notation before proceeding: we again adopt the shorthand notation θ = θ ( G ) and θ ′ = θ ( G ′ ). In this proof, we use e j to denote either a particular edge, or the setof two vertices that specify the edge, depending on the context. Given any subset A ⊂ V , we use x A = { x s , s ∈ A } to denote the collection of variables indexed by A .Given any edge e = ( u, v ) with u / ∈ A and v / ∈ A , we define the conditional distribution P eθ [ x A ] ( x u , x v ) = P θ ( x u , x v , x A ) P θ ( x A ) (39)over the random variables x e = ( x u , x v ) indexed by the edge. Finally, we use J ex A ( θ k θ ′ ) : = D (cid:0) P e θ [ xA ]+ θ ′ [ xA ]2 k P eθ [ x A ] (cid:1) + D (cid:0) P e θ [ xA ]+ θ ′ [ xA ]2 k P eθ ′ [ x A ] (cid:1) (40)18o denote the divergence (19) applied to the conditional distributions of ( X u , X v | X A = x A ).With this notation, let M ⊂ E ∆ E ′ be the subset of edges in some maximal matching of thegraph with edge set E ∆ E ′ ; concretely, let us write M = { e , . . . , e m } , and denote by V \ M thesubset of vertices that are not involved in the matching. Note that since J is a combination ofKullback-Leibler (KL) divergences, the usual chain rule for KL divergences [10] also applies to it.Consequently, we have J ( θ k θ ′ ) ≥ m X ℓ =1 X x V \ M ,x e ,...x eℓ − P θ + θ ′ ( x V \ M , x e , . . . , x e ℓ − ) J e ℓ x Sℓ − ( θ k θ ′ ) , where for each ℓ , we are conditioning on the set of variables x S ℓ − : = (cid:0) x V \ M , x e , . . . , x e ℓ − (cid:1) . Finally,from Lemma 7 in Appendix A, for all ℓ = 1 , . . . , m and all values of x S ℓ − , we have J e ℓ x Sℓ − ( θ k θ ′ ) ≥
13 exp(2 ω ) + 1 sinh ( λ , from which the claim follows. We first consider distributions belonging to the class G p,d ( λ, ω ), where λ is the minimum absolutevalue of any non-zero edge weight, and ω is the maximum neighborhood weight (5). Consider apair of graphs G and G ′ in the class G p,d that differ in ℓ = | E ∆ E ′ | edges. Since both graphs havemaximum degree at most d , we necessarily have a matching number m ( G, G ′ ) ≥ ℓ d . Note that theparameter ℓ = | E ∆ E ′ | can range from 1 all the way up to dp , since a graph with maximum degree d has at most dp edges.Now consider some fixed graph G ∈ G p,d and associated distribution P θ ( G ) ∈ G p,d ; we upperbound the error probability P θ ( G ) [ φ ∗ ( X n ) = G ]. For each ℓ = 1 , , . . . , dp , there are at most (cid:0) ( p ) ℓ (cid:1) models in G p,d with mismatch ℓ from G . Therefore, applying the union bound, the large deviationsbound in Lemma 3, and the lower bound in terms of matching from Lemma 4, we obtain P θ ( G ) [ φ ∗ ( X n ) = G ] ≤ pd X ℓ =1 (cid:18)(cid:0) p (cid:1) ℓ (cid:19) exp n − n ℓ/ (4 d )3 exp(2 ω ) + 1 sinh ( λ o ≤ pd max ℓ =1 ,...,pd exp n log (cid:18)(cid:0) p (cid:1) ℓ (cid:19) − n ℓ/ (4 d )3 exp(2 ω ) + 1 sinh ( λ o ≤ max ℓ =1 ,...,pd exp n log( pd ) + ℓ log p − n ℓ/ (4 d )3 exp(2 ω ) + 1 sinh ( λ o . This probability is at most δ under the given conditions on n in the statement of Theorem 3(a). Next we consider the class G p,k of graphs with at most k edges. Given some fixed graph G ∈ G p,k ,consider some other graph G ′ ∈ G p,k such that the set E ∆ E ′ has cardinality m . We claim that foreach m = 1 , , . . . , k , the number of such graphs is at mostTo verify this claim, recall the notion of a vertex cover of a set of edges, namely a subset ofvertices such that each edge in the set is incident to at least one vertex of the set. Note also that the19ertices involved in any maximal matching form a vertex cover. Consequently, any maximal match-ing over the edge set E ∆ E ′ of cardinality m be described in the following (suboptimal) fashion:(i) first specify which of the k edges in E are missing in E ′ ; (ii) describe which of the at most 2 m vertices belong to the vertex cover defined by the maximal matching; and (iii) describe the subset ofat most k vertices that are connected to it. This procedure yields at most 2 k p m p k m = 2 k p m ( k +1) possibilities, as claimed.Consequently, applying the union bound, the large deviations bound in Lemma 3, and the lowerbound in terms of matching from Lemma 4, we obtain P θ ( G ) [ φ ∗ ( X n ) = G ] ≤ k X m =1 k p m ( k +1) exp n − n m ω ) + 1 sinh ( λ o ≤ k max m =1 ,...,k exp n k + 2 m ( k + 1) log p − n m ω ) + 1 sinh ( λ o . This probability is less than δ under the conditions of Theorem 4, which completes the proof. Finally, we prove the sufficient conditions given in Theorem 3(b) and 4(b), which do not assumethat the decoder knows the parameter vector θ ( G ) for each graph G ∈ G p,d . In this case, the simpleML decoder (36) cannot be applied, since it assumes knowledge of the model parameters θ ( G ) foreach graph G ∈ G p,d . A natural alternative would be the generalized likelihood ratio approach,which would maximize the likelihood over each model class, and then compare the maximizedlikelihoods. Our proof of Theorem 3(b) is based on minimizing the distance between the empiricaland model mean parameters in the ℓ ∞ norm, which is easier to analyze. We begin by describing the graph decoder used to establish the sufficient conditions of Theo-rem 3(b). For any parameter vector θ ∈ R ( p ), let µ ( θ ) ∈ R ( p ) represent the associated set ofmean parameters, with element ( s, t ) given by [ µ ( θ )] st : = E θ [ X s X t ]. Given a data set X n = { X (1) , . . . , X ( n ) } , the empirical mean parameters are given by b µ st : = 1 n n X i =1 X ( i ) s X ( i ) t . (41)For a given graph G = ( V, E ), let Θ λ,ω ( G ) ⊂ R ( p ) be a subset of exponential parameters thatrespect the graph structure—viz.(a) we have θ uv = 0 for all ( u, v ) / ∈ E ;(b) for all edges ( s, t ) ∈ E , we have | θ st | ≥ λ , and(c) for all vertices s ∈ V , we have P t ∈N ( s ) | θ st | ≤ ω .For any graph G and set of mean parameters µ ∈ R ( p ), we define a projection-type distance via J G ( µ ) = min θ ∈ Θ λ,ω ( G ) k µ − µ ( θ ) k ∞ .We now have the necessary ingredients to define a graph decoder φ † : X n → G p,d ; in particular,it is given by φ † ( X n ) : = arg min G ∈G p,d J G ( b µ ) , (42)20here b µ are the empirical mean parameters previously defined (41). (If the minimum (42) is notuniquely achieved, then we choose some graph that achieves the minimum.) Suppose that the data are sampled from P θ ( G ) for some fixed but known graph G ∈ G p,d , andparameter vector θ ( G ) ∈ Θ λ,ω ( G ). Note that the graph decoder φ † can fail only if there exists someother graph G ′ such that the difference ∆( G ′ ; G ) : = J G ′ ( b µ ) − J G ( b µ ) is not positive. (Again, weare conservative in declaring failure if there are ties.)Let θ ′ denote some element of Θ λ,ω ( G ′ ) that achieves the minimum defining J G ′ ( b µ ), so that J G ′ ( b µ ) = k b µ − µ ( θ ′ ) k ∞ . Note that by the definition of J G , we have J G ( b G ) ≤ k b µ − θ ( G ) k ∞ , where θ ( G ) are the parameters of the true model. Therefore, by the definition of ∆( G ′ ; G ), we have∆( G ′ ; G ) ≥ k b µ − µ ( θ ′ ) k ∞ − k b µ − µ ( θ ( G )) k ∞ ≥ k µ ( θ ′ ) − µ ( θ ( G )) k ∞ − k b µ − µ ( θ ( G )) k ∞ , (43)where the second inequality applies the triangle inequality.Therefore, in order to prove that ∆( G ′ ; G ) is positive, it suffices to obtain an upper bound on k b µ − µ ( θ ( G )) k ∞ , and a lower bound on k µ ( θ ′ ) − µ ( θ ( G )) k ∞ , where θ ′ ranges over Θ λ,ω ( G ′ ). Withthis perspective, let us state two key lemmas. We begin with the deviation between the sampleand population mean parameters: Lemma 5 (Elementwise deviation) . Given n i.i.d. samples drawn from P θ ( G ) , the sample meanparameters b µ and population mean parameters µ ( θ ( G )) satisfy the tail bound P [ k b µ − µ ( θ ( G )) k ∞ ≥ t ] ≤ (cid:0) − n t p (cid:1) . This probability is less than δ for t ≥ q p +log(2 /δ ) n . Our second lemma concerns the separation of the mean parameters of models with differentgraph structure:
Lemma 6 (Pairwise separations) . Consider any two graphs G = ( V, E ) and G ′ = ( V, E ′ ) , andan associated set of model parameters θ ( G ) ∈ Θ λ,ω ( G ) and θ ( G ′ ) ∈ Θ λ,ω ( G ′ ) . Then for all edges ( s, t ) ∈ E \ E ′ ∪ E ′ \ E , max u ∈{ s,t } ,v ∈ V (cid:12)(cid:12) E θ ( G ) [ X u X v ] − E θ ( G ′ ) [ X u X v ] (cid:12)(cid:12) ≥ sinh ( λ/ ω (cid:0) ω ) + 1 (cid:1) . We provide the proofs of these two lemmas in Sections 5.5.3 and 5.5.4 below.Given these two lemmas, we can complete the proofs of Theorem 3(b) and Theorem 4(b). Usingthe lower bound (43), with probability greater than 1 − δ , we have∆( G ′ ; G ) ≥ sinh ( λ/ ω (cid:0) ω + 1 (cid:1) − r p + log(2 /δ ) n . This quantity is positive as long as n > h ω (cid:0) ω + 1 (cid:1) sinh ( λ/ i (cid:8)
16 log p + 4 log(2 /δ ) (cid:9) , which completes the proof.It remains to prove the auxiliary lemmas used in the proof.21 .5.3 Proof of Lemma 5 This claim is an elementary consequence of the Hoeffding bound. By definition, for each pair ( s, t )of distinct vertices, we have b µ st − [ µ ( θ ( G ))] st = 1 n n X i =1 X ( i ) s X ( i ) t − E θ ( G ) [ X s X t ] , which is the deviation of a sample mean from its expectation. Since the random variables { X ( i ) s X ( i ) t } ni =1 are i.i.d. and lie in the interval [ − , +1], an application of Hoeffding’s inequality [16] yields that P [ (cid:12)(cid:12)b µ st − [ µ ( θ ( G ))] st (cid:12)(cid:12) ≥ t ] ≤ (cid:0) − nt / . The lemma follows by applying union bound over all (cid:0) p (cid:1) edges of the graph, and the fact thatlog (cid:0) p (cid:1) ≤ p . The proof of this claim is more involved. Let ( s, t ) be an edge in E \ E ′ , and let C be the set of allother vertices that are adjacent to s or t in either graphs—namely, the set C : = (cid:8) u ∈ V | ( u, s ) or ( u, t ) ∈ E ∪ E ′ (cid:9) = (cid:0) N ( s ) ∪ N ( t ) (cid:1) \{ s, t } . Our approach is to condition on the variables x C = { x u , u ∈ C } , and consider the two conditionaldistributions over the pair ( X s , X t ), defined by P θ and P θ ′ respectively. In particular, for any subset S ⊂ V , let us define the unnormalized distribution Q θ ( x S ) : = X x a , a/ ∈ S exp (cid:0) X ( u,v ) ∈ E θ uv x u x v (cid:1) , (44)obtained by summing out all variables x a for a / ∈ S . With this notation, we can write the conditionaldistribution of ( X s , X t ) given { X C = x C } as P θ [ x C ] ( x s , x t ) = Q θ ( x s , x t , x C ) Q θ ( x C ) . (45)As reflected in our choice of notation, for each fixed x C , the distribution (39) can be viewed as aIsing model over the pair ( X s , X t ) with exponential parameter θ [ x C ]. We define the unnormalizeddistributions Q θ ′ [ x S ] and the conditional distributions P θ ′ [ x C ] in an analogous manner.Our approach now is to study the divergence J ( θ [ x C ] k θ ′ [ x C ]) between the conditional distri-butions induced by P θ and P θ ′ . Using Lemma 7 from Appendix A, for each choice of x C , we have J ( θ [ x C ] k θ ′ [ x C ]) ≥ sinh ( λ )3 exp(2 ω )+1 , and hence E θ + θ ′ (cid:2) J ( θ [ X C ] k θ ′ [ X C ]) (cid:3) ≥
13 exp(2 ω ) + 1 sinh ( λ , (46)where the expectation is taken under the model P θ + θ ′ . Some calculation shows that E θ + θ ′ (cid:2) J ( θ [ X C ] k θ ′ [ X C ]) (cid:3) = E θ + θ ′ (cid:20) log Q θ ( X C ) Q θ + θ ′ ( X C ) (cid:21) + E θ + θ ′ (cid:20) log Q θ ′ ( X C ) Q θ + θ ′ ( X C ) (cid:21) . E θ + θ ′ (cid:20) log Q θ ( X C ) Q θ + θ ′ ( X C ) (cid:21) ≤ log (cid:20) P x C Q θ + θ ′ ( x C ) X x C Q θ + θ ′ ( x C ) Q θ ( x C ) Q θ ′ + θ [ x C ] (cid:21) = log P x C Q θ ( x C ) P x C Q θ + θ ′ ( x C ) , with an analogous inequality for the term involving Q θ ′ . Consequently, we have the upper bound E θ + θ ′ (cid:2) J ( θ [ X C ] k θ ′ [ X C ]) (cid:3) ≤ log (cid:2) P x C Q θ ( x C ) (cid:3) (cid:2) P x C Q θ ′ ( x C ) (cid:3)(cid:2) P x C Q θ + θ ′ ( x C ) (cid:3) , (47)which we exploit momentarily.In order to use this bound, let us upper bound the quantity∆( θ, θ ′ ) : = E θ (cid:2) D ( θ [ X C ] k ( θ + θ ′ X C ]) (cid:3) + E θ ′ (cid:2) D ( θ ′ [ X C ] k ( θ + θ ′ X C ]) (cid:3) . By the definition of the Kullback-Leibler divergence, we have∆( θ, θ ′ ) = X u ∈N ( s ) \ t ( µ su − µ ′ su ) ( θ su − θ ′ su ) + X v ∈N ( t ) \ s ( µ tv − µ ′ tv ) ( θ tv − θ ′ tv )+ E θ log Q θ + θ ′ ( X C ) Q θ ( X C ) + E θ ′ log Q θ + θ ′ ( X C ) Q θ ′ ( X C ) (48)In this equation, the quantities µ and µ ′ denote mean parameters computed under the distributions P θ and P θ ′ respectively. But by Jensen’s inequality, we have the upper bound E θ log Q θ + θ ′ ( X C ) Q θ ( X C ) ≤ log P x C Q θ + θ ′ ( x C ) P x C Q θ ( x C ) , (49)with an analogous upper bound for the term involving θ ′ .Combining the bounds (47), (48) and (49), we obtain E θ + θ ′ (cid:2) J ( θ [ X C ] k θ ′ [ X C ]) (cid:3) ≤ X u ∈N ( s ) \ t ( µ su − µ ′ su ) ( θ su − θ ′ su ) + X v ∈N ( t ) \ s ( µ tv − µ ′ tv ) ( θ tv − θ ′ tv ) . Finally, since P u ∈N ( s ) | θ us | ≤ ω by the definition (5) (and similarly for the neighborhood of t ), weconclude that E θ + θ ′ (cid:2) J ( θ [ X C ] k θ ′ [ X C ]) (cid:3) ≤ ω max u ∈{ s,t } ,v ∈ V | µ uv − µ ′ uv | . Combining this upper bound with the lower bound (46) yields the claim.
In this paper, we have analyzed the information-theoretic limits of binary graphical model selectionin a high-dimensional framework, in which the sample size n , number of graph vertices p , number23f edges k and/or the maximum vertex degree d are allowed to tend to infinity. We proved fourmain results, corresponding to both necessary and sufficient conditions for the class G p,d of graphson p vertices with maximum vertex degree d , as well as for the class G p,k of graphs on p verticeswith at most k edges. More specifically, for the class G p,d , we showed that any algorithm requires atleast n > c d log p samples, and we demonstrated an algorithm that succeeds using n < c ′ d log p samples. Our two main results for the class G p,d have a similar flavor: we show that any algorithmrequires at least n > c k log p samples, and we demonstrated an algorithm that succeeds using n < c ′ k log p samples. Thus, for graphs with constant degree d or a constant number of edges k ,our bounds provide a characterization of the information-theoretic complexity of binary graphicalselection that is tight up to constant factors. For growing degrees or edge numbers, there remainsa minor gap in our conditions.In terms of open questions, one immediate issue is to close the current gap between our neces-sary and sufficient conditions; as summarized above, these gaps are of order d and k for G p,d and G p,k respectively. We note that previous work by Ravikumar et al. [22] has shown that a compu-tationally tractable method, based on ℓ -regularization and logistic regression, can recover binarygraphical models using n = Ω( d log p ) samples. This result is consistent with the theory givenhere, and it would be interesting to determine whether or not their algorithm, appealing due to itscomputational tractability, is actually information-theoretically optimal. Moreover, in the currentpaper, although we have focused exclusively on binary graphical models with pairwise interactions,many of the techniques and results (e.g., constructing “packings” of graph classes, Fano’s lemmaand variants, large deviations analysis) applies to more general classes of discrete graphical models,and it would be interesting to explore extensions in this direction. Acknowledgements
This work was partially supported by NSF grants CAREER-0545862 and DMS-0528488 to MJW.
A A separation lemma
In this appendix, we prove the following lemma, which plays a key role in the proofs of bothLemmas 4 and 6. Given an edge e = ( s, t ) and some subset U ⊆ V \{ s, t } , recall that J ex U ( θ k θ ′ )denotes the divergence (19) applied to the conditional distributions of ( X u , X v | X A = x A ), asdefined explicitly in equation (40). Lemma 7.
Consider two distinct graphs G = ( V, E ) and G ′ = ( V, E ′ ) , with associated parametervectors θ and θ ′ . Given an edge ( s, t ) ∈ E \ E ′ and any subset U ⊆ V \{ s, t } , we have J e ℓ x U ( θ k θ ′ ) ≥
13 exp(2 ω ) + 1 sinh ( θ uv . (50) Proof.
To lighten notation, we define ǫ : =
23 exp(2 ω )+1 sinh ( θ uv ) >
0. Note that from the defi-nition (5), we have ω ≥ | θ uv | , which implies that ǫ ≤
2. For future reference, we also note therelation (cid:2) exp( θ uv − exp( − θ uv (cid:3) = 2 ǫ + 6 ǫ exp(2 ω ) . (51)With this set-up, our argument proceeds via proof by contradiction. In particular, we assumethat J e ℓ x U ( θ k θ ′ ) ≤ ǫ/ Q θ ( x A ) for the unnormal-ized distribution applied to the subset of variables x A = { x i , i ∈ A } . With a little bit of algebra,we find that J e ℓ x U ( θ k θ ′ ) = log Q θ ( x U ) Q θ ′ ( x U ) (cid:18) P z ...z p z S = x U p Q θ ( z ) Q θ ′ ( z ) (cid:19) . Let us introduce some additional shorthand so as to lighten notation in the remainder of theproof. First we define β ( x ) : = p Q θ ( x ) Q θ ′ ( x ), as well as α ( x ) : = q Q θ ( x ) Q θ ′ ( x ) . Now observe that α ( x ) = exp(∆( x ) / x ) : = P ( s,t ) ∈ E θ st x s x t − P ( s,t ) ∈ E ′ θ ′ st x s x t . Observe that Lemma 8in Appendix B characterizes the behavior of ∆( x ) under changes to x . Finally, we define the set Y ( x U ) : = (cid:8) y ∈ {− , +1 } p | y i = x i for all i ∈ U (cid:9) , corresponding to the subset of configurations y ∈ {− , +1 } p that agree with x U over the subset U .From the definitions of α and β , we observe that (cid:2) X y ∈Y ( x U ) Q θ ( y ) (cid:3) (cid:2) X y ∈Y ( x U ) Q θ ′ ( y ) (cid:3) = (cid:2) X y ∈Y ( x U ) α ( y ) β ( y ) (cid:3) (cid:2) X y ∈Y ( x U ) α ( y ) β ( y ) (cid:3) ≤ (1 + ǫ ) (cid:2) X y ∈Y ( x U ) β ( y ) (cid:3) , (53)where the inequality follows from the fact ǫ ≤
2, our original assumption (52), and the elementaryrelations exp( z ) < (1 + 2 z ) < (1 + 2 z ) for all z ∈ (0 , t , one for each y ∈ Y ( x U ), given by β ( y ) α ( y ) − ǫ ) β ( y ) t + β ( y ) α ( y ) t = 0 . Summing these quadratic equations over y ∈ Y ( x U ) yields q ( t ) : = X y ∈Y ( x U ) β ( y ) α ( y ) − t (1 + ǫ ) X y ∈Y ( x U ) β ( y ) + t X y ∈Y ( x U ) β ( y ) α ( y ) = 0 , which by which by equation (53) must have two real roots.Let t ∗ denote the value of t at which q ( · ) achieves its minimum. By the quadratic formula, wehave t ∗ = (1 + ǫ ) P y ∈Y ( x U ) β ( y ) P y ∈Y ( x U ) ( β ( y ) /α ( y )) > . Since q ( t ∗ ) <
0, we obtain2 ǫt ∗ X y ∈Y ( x U ) β ( y ) > X y ∈Y ( x U ) β ( y ) t ∗ (cid:2)p t ∗ /α ( y ) − p α ( y ) /t ∗ (cid:3) . (54)Using the notation Y ∗ ( x U ) : = (cid:8) y ∈ Y ( x U ) | max { t ∗ /α ( y ) , α ( y ) /t ∗ } < exp( θ uv / (cid:9) ,
25e can rewrite equation (54) as2 ǫt ∗ X y ∈Y ∗ ( x U ) β ( y ) > X y / ∈Y ∗ ( x U ) β ( y ) t ∗ (cid:26)(cid:2)p t ∗ /α ( y ) − p α ( y ) /t ∗ (cid:3) − ǫ (cid:27) (55) ( a ) ≥ ǫt ∗ X y / ∈Y ∗ ( x U ) β ( y ) exp(2 ω ) , where inequality (a) follows from the definition of Y ∗ ( x U ), the monotonically increasing nature ofthe function f ( s ) = ( s − s ) for s ≥
1, and the relation (51).From Lemma 8, for each y ∈ Y ∗ ( x U ), we obtain a configuration a / ∈ Y ∗ ( x U ) by flipping either u , v or both. Note that at most three configurations y ∈ Y ∗ ( x U ) can yield the same configuration z / ∈ Y ∗ ( x U ). Since these flips do not decrease β ( y ) by more than a factor of exp(2 ω ). we concludethat X y ∈Y ∗ ( x U ) β ( y ) ≤ ω ) X y / ∈Y ∗ ( x U ) β ( y ) , which is a contradiction of equation (55). Hence the quadratic q ( · ) cannot have two real roots,which contradicts our initial assumption (52). B Proof of a flipping lemma
It remains to state and prove a lemma that we exploited in the proof of Lemma 7 from Appendix A.
Lemma 8.
Consider distinct models θ and θ ′ , and for each x ∈ {− , +1 } p , define ∆( x ) : = X ( u,v ) ∈ E θ uv x u x v − X ( u,v ) ∈ E ′ θ ′ uv x u x v . (56) Then for any edge ( s, t ) ∈ E \ E ′ and for any configuration x ∈ {− , +1 } p , flipping either x s or x t (or both) changes ∆( x ) by at least | θ st | .Proof. We use N ( s ) and N ′ ( s ) to denote the neighborhood sets of s in the graphs G = ( V, E ) and G ′ = ( V, E ′ ) respectively, with analogous notation for the sets N ( t ) and N ′ ( t ). We then define δ s ( x ) : = X u ∈N ( s ) \N ′ ( s ) θ su x u , and δ ′ s ( x ) : = X u ∈N ′ ( s ) \N ( s ) θ ′ su x u , with analogous definitions for the quantities δ t ( x ) and δ ′ t ( x ). Similarly, we define γ s ( x ) : = X u ∈N ( s ) ∩N ′ ( s ) θ su x u , and γ t ( x ) : = X v ∈N ( t ) ∩N ′ ( t ) θ tv x v . Now, let the contribution to the first (respectively second) term of E not involving s and t be r i (respectively r j ), namely µ ( x ) : = X ( u,v ) ∈ E i u,v / ∈{ s,t } θ uv x u x v , and µ ′ ( x ) : = X ( u,v ) ∈ E ′ u,v / ∈{ s,t } θ uv x u x v . x ) must changewhen ( x s , x t ) are flipped: to the contrary, suppose that for ∆( x ) stays fixed for all four choices( x s , x t ) ∈ {− , +1 } . We then show that this assumption implies that θ st = 0. Note that both ofthe terms δ s ( x ) and δ t ( x ) include a contribution from the edge ( s, t ). When ( x s , x t ) = (+1 , +1),we have( δ s ( x ) − θ st )+( δ t ( x ) − θ st )+ θ st + µ ( x )+ γ s ( x )+ γ t ( x ) = δ ′ s ( x )+ δ ′ t ( x )+ µ ′ ( x )+ γ s ( x )+ γ t ( x )+∆( x ) , whereas when ( x s , x t ) = ( − , − − ( δ s ( x )+ θ st ) − ( δ t ( x )+ θ st )+ θ st + µ ( x ) − γ s ( x ) − γ t ( x ) = − δ ′ s ( x ) − δ ′ t ( x )+ µ ′ ( x ) − δ ′ s ( x ) − γ t ( x )+∆( x ) . Adding these two equations together yields the equality µ ( x ) − θ st = µ ′ ( x ) + ∆( x ) . (57)On the other hand, for ( x s , x t ) = ( − , +1), we have − ( δ s ( x ) − θ st )+( δ t ( x )+ θ st ) − θ st + µ ( x ) − γ s ( x )+ γ t ( x ) = − δ ′ s ( x )+ δ ′ t ( x )+ µ ′ ( x ) − γ s ( x )+ γ t ( x )+∆( x ) , and for ( x s , x t ) = (+1 , − δ s ( x ) + θ st ) − ( δ t ( x ) − θ st ) − θ st + µ ( x ) + γ s ( x ) − γ t ( x ) = δ ′ s ( x ) − δ ′ t ( x ) + µ ′ ( x ) + γ s ( x ) − γ t ( x ) + ∆( x ) . Adding together these two equations yields µ ( x ) + θ st = µ ′ ( x ) + ∆( x ) . (58)Note that equations (57) and (57) cannot hold simultaneously unless θ st = 0, which implies thatour initial assumption—namely, that ∆( x ) does not change as we vary ( x s , x t ) ∈ {− , − } —-wasfalse.Finally, we show that the change in | ∆( x ) | must be at least | θ st | . For each pair ( i, j ) ∈ {− , +1 } ,let E ij = ∆( x | x s = i, x t = j ) be the value of ∆( x ) when x s = i and x t = j . Suppose that forsome constant c and δ >
0, we have E ij ∈ [ c − ǫ, c + ǫ ] for all ( i, j ). By following the same reasoningas above, we obtain the inequalities µ ( x ) − θ st ≥ µ ′ ( x ) + c − ǫ and µ ( x ) + θ st ≤ µ ′ ( x ) + c + ǫ , whichtogether imply that θ st ≤ ǫ . In a similar manner, we obtain the inequalities µ ( x )+ θ st ≥ µ ′ ( x )+ c − ǫ and µ ( x ) − θ st ≤ µ ′ ( x ) + c + ǫ , which imply that − θ st ≤ ǫ , thereby completing the proof. References [1] A. Ahmedy, L. Song, and E. P. Xing. Time-varying networks: Recovering temporally rewiringgenetic networks during the life cycle of drosophila melanogaster. Technical Report arXiv,Carngie Mellon University, 2008.[2] N. Alon and J. Spencer.
The Probabilistic Method . Wiley Interscience, New York, 2000.[3] O. Bannerjee, , L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximumlikelihood estimation for multivariate Gaussian or binary data.
Jour. Mach. Lear. Res. , 9:485–516, March 2008. 274] R. J. Baxter.
Exactly solved models in statistical mechanics . Academic Press, New York, 1982.[5] J. Besag. On the statistical analysis of dirty pictures.
Journal of the Royal Statistical Society,Series B , 48(3):259–279, 1986.[6] G. Bresler, E. Mossel, and A. Sly. Reconstruction of markov random fields from samples: Someeasy observations and algorithms. Technical Report arXiv, UC Berkeley, 2008.[7] L.D. Brown.
Fundamentals of statistical exponential families . Institute of Mathematical Statis-tics, Hayward, CA, 1986.[8] D. Chickering. Learning Bayesian networks is NP-complete.
Proceedings of AI and Statistics ,1995.[9] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependencetrees.
IEEE Trans. Info. Theory , IT-14:462–467, 1968.[10] T.M. Cover and J.A. Thomas.
Elements of Information Theory . John Wiley and Sons, NewYork, 1991.[11] I. Csisz´ar and Z. Talata. Consistent estimation of the basic neighborhood structure of Markovrandom fields.
The Annals of Statistics , 34(1):123–145, 2006.[12] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, editors.
Biological Sequence Analysis .Cambridge University Press, Cambridge, 1998.[13] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with thegraphical Lasso.
Biostatistics , 2007.[14] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restora-tion of images.
IEEE Trans. PAMI , 6:721–741, 1984.[15] R. Z. Has’minskii. A lower bound on the risks of nonparametric estimates of densities in theuniform metric.
Theory Prob. Appl. , 23:794–798, 1978.[16] W. Hoeffding. Probability inequalities for sums of bounded random variables.
Journal of theAmerican Statistical Association , 58:13–30, 1963.[17] I. A. Ibragimov and R. Z. Has’minskii.
Statistical Estimation: Asymptotic Theory . Springer-Verlag, New York, 1981.[18] E. Ising. Beitrag zur theorie der ferromagnetismus.
Zeitschrift f¨ur Physik , 31:253–258, 1925.[19] C. Ji and L. Seymour. A consistent model selection procedure for Markov random fields basedon penalized pseudolikelihood.
Annals of Applied Prob. , 6(2):423–443, 1996.[20] M. Kalisch and P. B¨uhlmann. Estimating high-dimensional directed acyclic graphs with thePC algorithm.
Journal of Machine Learning Research , 8:613–636, 2007.[21] N. Meinshausen and P. B¨uhlmann. High-dimensional graphs and variable selection with theLasso.
Annals of Statistics , 34:1436–1462, 2006.[22] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional graph selection using ℓ -regularized logistic regression. Annals of Statistics , 2008. To appear.2823] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance es-timation: Convergence rates of ℓ -regularized log-determinant divergence. Technical report,Department of Statistics, UC Berkeley, September 2008.[24] A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covarianceestimation. Electronic Journal of Statistics , 2009.[25] N. Santhanam, J. Dingel, and O. Milenkovic. On modeling gene regulatory networks usingmarkov random fields. In
Information Theory Workshop , Volos, Greece, June 2009.[26] P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction and search.
MIT Press , 2000.[27] F. Vega-Redondo.
Complex social networks . Econometric Society Monographs. CambridgeUniversity Press, Cambridge, 2007.[28] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families and variationalinference.
Foundations and Trends in Machine Learning , 1(1–2):1—305, December 2008.[29] S. Wasserman and K. Faust.
Social network analysis: Methods and applications . CambridgeUniversity Press, New York, NY, 1994.[30] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence.
Annals of Statistics , 27(5):1564–1599, 1999.[31] B. Yu. Assouad, Fano and Le Cam. In
Festschrift for Lucien Le Cam , pages 423–435. Springer-Verlag, Berlin, 1997.[32] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model.