[PDF] Model Selection in Undirected Graphical Models with the Elastic Net

Abstract

Structure learning in random fields has attracted considerable attention due to its difficulty and importance in areas such as remote sensing, computational biology, natural language processing, protein networks, and social network analysis. We consider the problem of estimating the probabilistic graph structure associated with a Gaussian Markov Random Field (GMRF), the Ising model and the Potts model, by extending previous work on l 1 regularized neighborhood estimation to include the elastic net l 1 + l 2 penalty. Additionally, we show numerical evidence that the edge density plays a role in the graph recovery process. Finally, we introduce a novel method for augmenting neighborhood estimation by leveraging pair-wise neighborhood union estimates.

Full PDF

MModel Selection in Undirected Graphical Modelswith the Elastic Net

Mihai Cucuringu ∗ , Jes´us Puente † , and David Shue ‡ Abstract

Structure learning in random ﬁelds has attracted considerable atten-tion due to its diﬃculty and importance in areas such as remote sensing,computational biology, natural language processing, protein networks, andsocial network analysis. We consider the problem of estimating the proba-bilistic graph structure associated with a Gaussian Markov Random Field(GMRF), the Ising model and the Potts model, by extending previouswork on l regularized neighborhood estimation to include the elastic net l + l penalty. Additionally, we show numerical evidence that the edgedensity plays a role in the graph recovery process. Finally, we introducea novel method for augmenting neighborhood estimation by leveragingpair-wise neighborhood union estimates. Edge sparsity in an undirected graphical model (Markov Random Field) encodesconditional independence via graph separation. Essentially, graphical modelsdetangle the global interconnections between the random variables of a jointdistribution into localized neighborhoods. Any distribution P ( X ) consistentwith the graphical model must abide by these simplifying constraints. Thus,the graph learning problem is equivalent to a model class selection problem.Let G = ( V, E ) be an undirected graph on p = | V ( G ) | vertices and m = | E ( G ) | edges. Let X = ( X , . . . , X p ) ∈ X p denote a random vector with distribution P ( X ), where variable X i is associated to vertex i ∈ V ( G ). Graphical modelselection attempts to ﬁnd the simplest graph, often dubbed the concentrationgraph , consistent with the underlying distribution.Recent work in graphical model selection exploits the local structure of theunderlying distribution to derive consistent neighborhoods for each random vari-able. In terms of graphs, the neighborhood set of a vertex r is N ( r ) = { t ∈ V ( G ) | ( r, t ) ∈ E } . More importantly, for undirected graphical models, N ( r )is the Markov blanket of r , where X r is rendered conditionally independent ∗ Applied Mathematics, Princeton University, email: [email protected] † Applied Mathematics, Princeton University, email: [email protected] ‡ Computer Science, Princeton University, email: [email protected] a r X i v : . [ s t a t . O T ] N ov f all other variables given N ( r ): X r ⊥⊥ X \ ( { r }∪ N ( r )) | X N ( r ) . To estimatethe neighborhood conditional probabilities P ( X r | X \ r ), these methods employpseudo-likelihood measures, speciﬁcally l regularized regression: the lasso [11].Compared to other l p penalty based regularization schemes, the l penalty en-joys the dual properties of convexity and sparseness by straddling the boundarybetween the two domains. By treating X r as the response variable and X \ r asthe predictors in a generalized linear model, the l regularization penalty canrecover an appropriately sparse representation of N ( r ). Reconstructing the fulledge set of the graph using the estimated neighborhood ˆ N ( r ), allows for twoalternate deﬁnitions: ˆ E ∧ = { ( a, b ) : a ∈ ˆ N ( b ) ∧ b ∈ ˆ N ( a ) } which we call AND;or ˆ E ∨ = { ( a, b ) : a ∈ ˆ N ( b ) ∨ b ∈ ˆ N ( a ) } which we call OR.Ravikumar et. al [8], [9] consider the problem of estimating the graphstructure associated with a Gaussian Markov Random Field and the Isingmodel. Their main result shows that under certain assumptions, the prob-lem of neighborhood selection can be accurately estimated with a sample sizeof n = Ω( d log p ) for high dimensional regimes where d is the max degree, and( p >> n ). Note that the number of samples needed is further improved in [8] toΩ( d log p ) for GMRF model selection. Meinshausen et. al [7] also examine theGMRF case, and provide an asymptotic analysis of consistency under relativelymild conditions along with an alternate λ penalty.We build upon this previous work by extending the l penalized neighbor-hood estimation framework to use the elastic net [13] l + l penalty and expand-ing the scope of graphical model recovery to include the multinomial discretecase. While the lasso performs beautifully in many settings, it has its draw-backs. In particular, when p > n , the lasso can only select at most n variables.Moreover, for highly correlated covariates, the lasso tends to select a single vari-able to represent the entire group. By incorporating the l penalty term, theelastic net is able to retain the lasso’s sparsity while selecting highly correlatedvariables together.Additionally, we introduce a novel scheme for augmenting neighborhoodrecovery by pooling pair-wise neighborhood union estimates. The idea is to inferthe joint neighborhood of a pair of nodes ( i, j ) (not necessarily adjacent), andobtain the neighborhood of node i by combining all the information given by the n − i . The frequency with which nodes appearin a specially designed neighbor list for node i gives us a weighted ranking ofnodes in terms of their neighbor likelihood. This method can be combined withthe usual neighborhood recovery to extract more information from a possiblyinsuﬃcient set of samples. Undirected graphical models encode the factorization of potential functions overcliques, which in their most basic form, are comprised of 1st and 2nd order2nteractions: functions that map node and edge values to the real line: P ( X ) = 1 Z (cid:89) ( s,t ) ∈ E φ s,t ( X s , X t ) (cid:89) u ∈ V φ u ( X u ) , which for max-entropy exponential family distributions, can be written as P ( X ) = 1 Z exp  (cid:88) ( s,t ) ∈ E φ s,t ( X s , X t ) + (cid:88) u ∈ V φ u ( X u )  Note that Z represents the normalization constant or partition function.For continuous random variables, the most common exponential family MRFrepresentation is the multivariate Gaussian with suﬃcient statistics { X s , X s | s ∈ V } ∪ { X s X t | ( s, t ) ∈ E } . P ( X ) = 1 Z exp (cid:32)(cid:88) r ∈ V θ r X r + 12 (cid:88) s ∈ V (cid:88) t ∈ V Θ st X s X t (cid:33) (1)The p × p symmetric pairwise parameter matrix Θ, known as the inverse covari-ance matrix of X denotes the partial correlations between pair of nodes, given theremaining nodes. Every edge ( s, t ) ∈ E will have a non-zero entry in Θ and eachrow s of Θ speciﬁes the graph neighborhood N ( s ). Conversely, the sparsity of Θreveals the conditional independencies of the graph where Θ st = 0 , ∀ ( s, t ) / ∈ E .Conditional neighborhood expectations can be represented by a linear model: E ( X s | X \ s ) = (cid:80) t ∈ N ( s ) θ st X t .In the binary case, the MRF distribution can be described using an Isingmodel where X s ∈ {− , } , ∀ s ∈ V , and φ st ( X s , X t ) = θ st X s X t . The fullprobability distribution takes the following form, which omits ﬁrst order terms: P ( X ) = 1 Z exp  (cid:88) ( s,t ) ∈ E θ st X s X t  (2)The conditional neighborhood probability P ( X s | X \ s ) is deﬁned as: P ( X s | X \ s ) = exp(2 X s (cid:80) ( s,t ) ∈ E θ st X t )exp(2 X s (cid:80) ( s,t ) ∈ E θ st X t ) + 1 (3)Taking the Hessian of the local conditional probability gives the Fisher informa-tion matrix for X s , Much like partial correlations in the Gaussian concentrationmatrix, zero entries in the Fisher information matrix indicate conditional inde-pendence.Extending the discrete parameterization to variables with k > φ now describea set of parameterized indicator variables I ( X s = l, X t = m ) representing the3 possible value pairs between X s and X t . P ( X ) = 1 Z exp (cid:32)(cid:88) s ∈ V k (cid:88) i =1 θ s : i I ( X s = i ) + (cid:88) ( s,t ) ∈ E k (cid:88) l =1 k (cid:88) m =1 θ st : lm I ( X s = l, X t = m )  As described in [12], this particular representation is over complete since theindicator functions satisfy a variety of linear relationships (cid:80) ki =1 I s ( X s = i ) = 1.However, despite the lack of a guaranteed unique solution, the factorization canstill satisfy the desired neighborhood recovery criterion. A simpliﬁed variantof the general discrete parameterization is the Potts model where each φ isdeﬁned by two indicator functions denoting node agreement and disagreementfor arbitrary k >

2. We observe that in the Ising model, the form of φ may bebe recast as φ st ( X s , X t ) = θ st I ( X s = X t ) − θ st I ( X s (cid:54) = X t ). P ( X ) = 1 Z exp (cid:32)(cid:88) s ∈ V k (cid:88) i =1 θ s : i I ( X s = i ) + (cid:88) ( s,t ) ∈ E θ st I ( X s = X t ) − θ st I ( X s (cid:54) = X t )  Note that the Potts model only requires a single parameter and generalizes theIsing model to k states.To extend neighborhood estimation from the binary Ising model case to adiscrete parameterization, we note that the neighborhood conditional probabil-ity takes the form P ( X s = d | X \ s ) = exp { θ s : d + (cid:80) ( s,t ) ∈ E (cid:80) km =1 θ st : dm I ( X s = d, X t = m ) } (cid:80) kl =1 exp { θ s : l + (cid:80) ( s,t ) ∈ E (cid:80) km =1 θ st : lm I ( X s = l, X t = m ) } (4)which is equivalent, after a variable transformation from a discrete feature spaceto indicators, to the classical multinomial logistic regression equation: P ( X s = d | X \ s ) = exp { θ Td X N ( s ) ,d } (cid:80) kl =1 exp { θ Tl X N ( s ) ,l } (5)where X N ( s ) ,l = { I ( X s = l, X t = m ) | ( s, t ) ∈ E } with an additional singletonindicator variable I ( X s = l ) always set to 1. With the conditional probabilityequations in hand, we can approach the problem of neighborhood estimation asa generalized linear regression. Building on previous model selection work usingthe l penalty, we extend the approach, to use the combined l + l penaltyapproach of the elastic net [13], which for the basic linear model takes the form: L ( λ , λ , θ ) = || x s − X \ s θ || + λ || θ || + λ || θ || The elastic net performance surpasses the l penalty under noisy conditions andwhere groups of highly correlated variables exist in the graph. However, as notedby Bunea [2], the additional l smoothing penalty should be small relative to the l term to preserve sparsity. Many authors have extended the elastic net penalty4o additional regression models, covering a broad swath of the generalized linearrealm. For the linear Gaussian case, we use the original elastic net package ofZhou and Hastie [13]. For binary and multinomial regression we rely on theglmnet library of Friedman et. al [4]. To evaluate the elastic net for Gaussian MRF model selection, we generate thedistribution inverse covariance matrix Θ = Σ − in the following way. We setΘ ij = 0 . ij ) ∈ E , and then perturb the diagonal of the matrixΘ ii = τ , with τ large enough to force all eigenvalues of Θ to pe positive. Weexperimentally choose τ , starting from 1 and increasing it in increments of 0 . θ ’s in our case) is low.To overcome this diﬃculty, we turn to the Swendsen-Wang algorithm. Thismethod generates an augmented graph ˜ G = ( ˜ V , ˜ E ), where ˜ V = V ∪ E and ˜ E contains ( v, e ) iﬀ v ∈ V and e ∈ E are incident. Given this formulation, ˜ G isbipartite between the V nodes and E nodes. Thus in the joint distribution of˜ G , the Markov blanket of E will only consist of elements in V and vice versa.The random variables assigned to E can only take the values 0 and 1.We deﬁne the conditional probabilities of V and E as: • P ˜ G ( e = 1 | V ) is given by considering the nodes s, t ∈ G s.t. e is incidentwith s and t . P ˜ G ( e = 1 | V ) = 1 − e − J if x s = x t and 0 otherwise. • P ˜ G ( V | E ) is such that all nodes in the same component (in the graph whenwe consider only the edges e s.t. e = 1) have the same value, and eachcomponent takes each of the k possible values with equal probability.Essentially, the algorithm generates MCMC samples by alternately updatingthe values of V and E using Gibbs sampling. Although the augmentation sub-stantially increases the number of vertices, the algorithm creates a Markov chainthat explores the space of outcomes much more rapidly. For the details of theSwendsen-Wang algorithm, we refer the reader to [10] and [6].By introducing an l penalty term to the regression model, the maximizationproblem in the elastic net setup becomesˆ θ s,λ ,λ = arg min θ : θ s =0 (cid:107) X s − Xθ (cid:107) + λ (cid:107) θ (cid:107) + λ (cid:107) θ (cid:107) (6)with λ = c (cid:112) ( log pn ). We experimented with several values of λ for diﬀerentnumber of samples, and observed the type I and type II probability errors.In Figure 1 we show 3D plots of the total error rates as a function of thenumber of samples and the λ parameter, for the AND and OR neighborhood5stimation. Several observations can be made from these plots. First note thatlarger values of λ performs worse than smaller values, no matter what thesample size is. The best recovery rates are achieved when λ is very small.Also note that the AND neighborhood selection perform much worse than itsOR counterpart when λ is large. The ﬁgure 2 plots the error rates versus theFigure 1: Total error rates as a function of λ and number of samples n , in theAND neighborhood selection (left) and the OR neighborhood selection (right).sample size, for a ﬁxed λ = c (cid:112) ( log pn ). Note that the chosen graph has 40vertices and is of maximum degree d = 25, and it cannot be recovered withouterrors even when the number of samples scales as 5 d log ( p ).Figure 2: Error rates for a random graph, with p = 40, λ = (cid:113) log ( p ) n , and n ≤ d log ( p ).The ﬁrst family of graphs we tested were the star graphs and the more generalclique of star graphs. We denote by Star ( a, b ) the clique-star graph obtainedfrom a copies of a star graph by connecting all star centers among them, orin other words we add d diﬀerent neighbors to each vertex in a clique of size a . Star (1 , b ) is just the standard star graph. Note that the maximum degree in Star ( a, b ) is d = a + b − p = a ( b + 1). A star graph consists of one node of degree q connected to the remaining q vertices ofdegree 1 G we let ρ ( G ) denote the edge density of the graph, i.e. ρ = | E | n ( n − . The reason we introduce this parameter in our simulations is to observethe impact of edge density on the recovery rates when the maximum degreeand the number of samples are ﬁxed. As our simulations show, recovering thegraph structure is signiﬁcantly harder when the graph has a higher density butthe same ﬁxed maximum degree d . To test this, we generate a star graph withmaximum degree d and edge density ρ , and then start adding edges among thelower degree neighbors to obtain a new graph with new edge density ρ > ρ .Note that this edge density dependence can be equivalently formulated in termsof the average degree ¯ d of a graph by the formula ¯ d = ( p − ρ . The errorrates are averaged over 20 runs for a ﬁxed graph G on diﬀerent samples of size n . Additionally, λ was chosen by discretizing the interval [0 , (cid:113) log pn ] into 15equally sized subintervals.Figure 3 plots the error recovery rates for G = Star (1 ,

24) with ρ = 0 . G = Star (1 ,

24) until ρ = 0 . d = 24.Note that for these graphs d log ( p ) ≈ G with a 0 .

30 error rate. We repeat the above experiment for the clique-stargraph H = Star (6 ,

4) with p = 30 and edge density ρ = 0 .

18, and the graph H obtained by adding edges to H while keeping the maximum degree d = 9unchanged. In this case, d log ( p ) = 275 and we successfully recover H onlywhen the sample size exceeds 1400, due to higher edge density (plot omitted).Figure 4 shows the error rates when we increase the edge density to ρ = 0 . λ penalty was rarely of any help, and in mostcases λ = 0 achieved the best error rates.A second type of graph we considered was the community graph, denotedby Com ( s, t, β in , β out ), which consists of s groups of highly connected nodes,where each group has size t , so p = st . Two vertices within the same group(or community) are connected with probability β in , while nodes that belong totwo diﬀerent communities share an edge with a smaller probability β out . Thesecommunity structures are a common feature of complex networks, and have theproperty that nodes within a group are much more connected to each other thanto the rest of the network (for β in > β out ). In application, these communitiesmay represent groups of related individuals in social networks, topically relatedweb pages or biochemical pathways, and thus their identiﬁcation is of centralimportance. To completely understand the modular structure of such graphs,one should be able to both detect overlapping communities and make meaningfulstatements about their hierarchies [5]. Figure 5 plots the error rates in therecovery of a Com (4 , , . , .

15) graph with p = 32, d = 15, and ρ = 0 . d log ( p ) = 780, however again even with 9000 samples, the error rate is stillover 10 percent. When few samples are available, λ = 0 achieves the besterror rates, but as we increase the number of samples we notice that the elastic7igure 3: Error recovery rates ( y -axis) versus the λ parameter ( x -axis) for thegraphs G = Star (1 ,

24) with ρ = 0 .

16 (top) and G with ρ = 0 .

76 (bottom).Figure 4: Error recovery rates ( y -axis) versus the λ parameter ( x -axis) for thegraph H with ρ = 0 . λ > λ = 0. While thisimprovement is not signiﬁcant, it hints that the additional l regularization mayproduce better results in some cases. Figures 6 and 7 depict the performance of the elastic net neighborhood esti-mator over a range of discrete MRF graphs. The graphs evaluated for theIsing and Potts model are random graphs with bounded maximum degree. Allexperiments were run over a sample range covering the d log p edge recoverythreshold and α values ranging from 0.5 to 1, where α = λ λ + λ . Results frommultiple trials were averaged for the Ising model. Due to the computationalload of the glmnet multinomial regression for large data sizes, only single runresults are shown for the Potts model. Smaller α values are omitted from the8igure 5: Error rates for the community graph Com (4 , , . , .

15) with p = 32, d = 15, and ρ = 0 .

28 .plot in order to limit the scale and improve clarity. Unless otherwise noted,the plots represent AND neighborhood unions, with OR neighborhood unionsshowing similar performance.As seen in the plots, the elastic net neighborhood estimator recovers theunderlying graph with high probability under corresponding Ω( d log p ) samplesizes, validating our formulation of the discrete model neighborhood estimationas a multinomial logistic regression. Similar to the Gaussian case, the eﬀect ofthe l penalty λ , (while λ is set to (cid:113) log( p ) n ) tends to beneﬁt neighborhoodrecovery mostly at small sample values and when α is close to 1. Oversized l penalties introduce an inordinate number of noise edges, but small l penaltiesreduce the chance of missing edges with weak correlation which the l penaltyrejects. When the l penalty is non-zero, the minimization function is strictlyconvex and allows the estimator to select additional nodes that exhibit highlycorrelated behavior by eﬀectively averaging their contribution.Figure 6: Type I and II error rate for an Ising model graph with 64 nodesand max degree 10. While α ≈ l penalty induces false negatives even for large samplesizes. Choosing an α < λ parameter provides, in essence, a trade-oﬀ between precision andrecall, as it can be seen in Figure 8, especially when the number of samples is9igure 7: Type I and II error rate for a Potts model graph with 64 nodes, maxdegree 10, and 4 states. When α = 1 the FPR is minimized over all samplesizes. However, again, the l penalty induces false negatives for small samplesizessmall, which is the case in many high dimensional p (cid:29) n applications. Whilethe α = 1 curve consistently provides the highest graph recovery precision overall sample sizes, the actual number of recovered edges may be extremely limiteddue to the sparsity constraint. For the Ising model graph depicted in the ﬁgure,the smallest sample size 1200 with α = 1 gives a precision of 0.88 with a recall of0.7. By introducing a λ term with α = 0 .

8, the precision drops to 0.8 but therecall improves to 0.79, which is better than a 1 to 1 trade-oﬀ. As expected, withlarge sample sizes, the beneﬁt of the λ parameter diminishes as shown by thenearly vertical slope of the large sample size curves. Similarly, the Potts modelrecall-precision plot also displays this trend, albeit in a compressed fashion sincethe neighborhood estimator is able to recover the graph at a smaller sample size,rendering the larger sample size curves uninformative. From these results we cansay that the additional presence of an l penalty may yield substantial beneﬁtsfor p (cid:29) n situations where the goal is to extract relevant correlation informationfrom small sample sizes. The technique introduced in this section is useful in neighborhood reconstruc-tion especially in the case of regular graphs, graphs with a small gap betweenknown maximum and minimum degree, or when we would like to obtain a like-lihood ranking of the p − i . The idea is toinfer neighborhoods not only for one vertex at a time, but for pairs of vertices,which may or may not be adjacent. We denote by ˆ N ( i ) the estimation of theneighborhood of node i , as given by the optimization in equation (6).We denote by N ij the set of neighbors of nodes i and j , i.e. N ij = { v ∈ V ( G ) \{ i, j } : ( i, v ) ∈ E or ( j, v ) ∈ E } , in other words N ij is the union of theneighborhoods of nodes i and j , minus the edge ( i, j ) if it exists. We now deﬁnethe following optimization problem similar to the one in equation (6)10igure 8: Recall-Precision curve for Ising and Potts model with 64 nodes andmax degree 10. Horizontal curves correspond to α values varying over samplesizes represented by the vertical dashed black curves increasing to the right. Theleft most vertical curve corresponds to sample size 1200. Although the α = 1pure l penalty dominates precision performance, choosing a smaller α permitsa trade-oﬀ in precision for recall that is particularly beneﬁcial for small samplesizes(ˆ θ I , ˆ θ J ) λ ,λ := arg min θ I ,θ J : θ Ii =0 ,θ Jj =0 (cid:107) X i − Xθ I + X j − Xθ J (cid:107) + λ (cid:107) θ I (cid:107) + λ (cid:107) θ I (cid:107) + λ (cid:107) θ J (cid:107) + λ (cid:107) θ J (cid:107) (7)After grouping the terms of λ and λ and approximating (cid:107) θ I (cid:107) + (cid:107) θ J (cid:107) by (cid:107) θ I + θ J (cid:107) , and (cid:107) θ I (cid:107) + (cid:107) θ J (cid:107) by (cid:107) θ I + θ J (cid:107) , we approximate (5) byarg min θ,θ i =0 ,θ j =0 (cid:107) ( X i + X j ) − Xθ (cid:107) + λ (cid:107) θ (cid:107) + λ (cid:107) θ (cid:107) =: ˆ θ I,J,λ ,λ (8)where in the last step we make the change of variable θ = θ I + θ J and wedenote by ˆ θ I,J,λ ,λ the regression coeﬃcients of the sum of variables I and J against the remaining variable. We now deﬁne the estimated neighborhood ofa pair of vertices ( i, j ), not necessarily adjacent, to be ˆ N ij = { v ∈ V ( G ) \{ i, j } :ˆ θ ij,λ ,λ ( v ) (cid:54) = 0 } , in other words ˆ N ij is an estimate of N ij . Note that for avertex v ∈ ˆ N ij it may be the case that v is adjacent to either i or j or perhapsboth.For a ﬁxed node i , we obtain its neighborhood in the following way. We let L i denote the list obtained by concatenating the pair neighborhoods of node i L i = (cid:71) j (cid:54) = i N ij where (cid:70) denotes union with repetitions. We denote by ˆ L i the concatenation ofthe estimated pair neighborhoods, i.e. ˆ L i = (cid:70) j (cid:54) = i ˆ N ij . Note that L i includes11ll nodes that are neighbors of i , each appearing with multiplicity p −

2. If j isa neighbor of i , then j ∈ N ik for all k (cid:54) = i, j , i.e. p − N ij = N ij , ˆ L i = L i , and with the exception described in the nextparagraph, we can correctly recover the neighborhood of node i by picking themost frequent elements from ˆ L i , i.e. all nodes which appear in the list exactly p − i by selecting the most frequent elements in ˆ L i . Also, if there are noerrors, L i contains all other nodes j (cid:54) = i of G at least once. This is obvious if j ∼ i , as j appears p − L i as explained above. If j (cid:28) i , then pick k a neighbor of j ( k exists since we assumed G is connected) and it must be that j ∈ N ik ⊆ L i .Note that a non-neighbor node j of i can appear n − L i if j isconnected to all nodes in V ( G ) \ i , in which case we (incorrectly) add j to theneighborhood of node i . Similarly, if i is connected to all nodes in V ( G ) \ j , then i appears in L j with multiplicity p − i as beingin the neighborhood of node j . In other words, i and j appear in each other’sneighborhood lists, thus rendering our approach incorrect, whenever i and j areboth connected to all other p − N = { , , , , , , , , } , ˆ N = { , , , , , , , , , } , ˆ L = { , 15(22), 35(18), 32(17), . . . } where bold-face indicates true neighbors and numbers in parentheses denote frequencies inˆ L As shown in Figure 9, true neighbors of node i occur most frequently inˆ L i . When ordering vertices in ˆ L i based on their frequency, most of the trueneighbors appear at the top of the list. However, errors occur and false neighborssometimes precede true neighbors. Note however that there are cases whenthe single neighborhood estimation performs much worse and omits many trueneighbors.We have seen how histograms based on neighborhoods of pairs are usefulin determining a likelihood ranking of possible neighbors of a given node. Thetop t i most frequent elements in ˆ L i are the most likely neighbors of i . The12roblem now becomes how to select this threshold value t i for each list. If t i istoo small then true neighbors might be left out, and if t i is too big then we willintroduce false neighbors. We make an additional observation that improves onthe accuracy of the above ordering obtained from ˆ L i . Denote by L the matrixformed from lists ˆ L i by letting L ij equal the frequency of node j in list ˆ L i . Toincorporate the symmetry between two neighboring nodes: if i is a neighbor of j then j is also a neighbor of i , we build the symmetric matrix S = L + L T .The intuition here is to average out the votes received by nodes i and j in theirrespective rows. Suppose that ( i, j ) ∈ E , but j does not rank highly in ˆ L i .However, it may be the case that i ranks highly in ˆ L j , and helps in identifyingthat ( i, j ) are indeed neighbors in G . One can think of this method as averagingout the bad information (noisy edges) and boosting up the good information(correct edges).Finally, another alternative would be to ﬁrst row normalize L and thenconstruct the symmetric matrix described above. We divide each row in L bythe largest entry in that row and obtain the row stochastic matrix ¯ L , whose row i may be thought of as ranked probabilities of the possible neighbors of i . Wethen build the row stochastic matrix ¯ S = ( ¯ L + ¯ L T ), as described above.As mentioned earlier, the main problem is ﬁnding the threshold t i for eachrow i to separate the neighbors for non-neighbors of i . If we know a priori whatthe degree of each node is, then one way to pick the neighbors would be to selectthe most frequent d i entries in ˆ L i . Alternatively, if we know that the graph isalmost regular of degree r , or in other words that the average degree of thegraph G is r but the degree distribution has very little variance around r , thenwe can again select the top r most frequent entries in ˆ L i . Another way one canchoose a threshold is to plot the frequency values in order and look for a bigjump in the graph. This idea is illustrated in the right plot of Figure 9. Notehow the frequency values decreases suddenly within two steps from 26 (for node6) to 22 (for node 15) and then to 18 (for node 35). Such large sudden drops inthe ordered list of frequency values hint at a good threshold point.Table 1 shows the results of an experiment that illustrates the above ideaswhen we have oracle information about the degree of each node. We observe thatthe symmetric matrix S works better than just using L , and that ¯ S performsbetter than S . Note also that the type I and II errors are now more balancedboth in the AND and OR case. When doing the estimation N , in the ANDcase almost all the errors were coming from missing edges, while in the ORscenario almost all errors were given by false edges. Picking the threshold t i allows for a trade-oﬀ between these two type of errors.It would be interesting to compare the results of our new pair neighborhoodrecovery to the original single neighborhood method, when both techniquestake into account the knowledge of node degree. One way to incorporate thisinformation into the single neighborhood method is to avoid using 10-fold crossvalidation and replace it with the following procedure. We have seen that whenregressing variable i against all other variables, the elastic net method does notreturn a single θ i vector corresponding to the optimal λ value, but rather com-putes an entire matrix (or sequence) where each row corresponds to a value of13ND (Error type) OR (Error type)Method I II Total I II Total N N , L N , S N , ¯ S N is obtained by the original method,and the remaining three rely on histograms of neighborhoods of pairs when thennode degrees are given. λ at which an additional variable ”turns on”. Instead of picking the empiricallyoptimal row with the 10-fold cross validation method as our θ i vector, we cansimply pick the ﬁrst row which has r i nonzero entries, where r i is the degree ofnode i . One can also interpret the order in which the remaining p − i than anode which activates later on. This ranking can also be obtained by our pair-wise neighborhood union estimate, however we produce more than just a simpleranking. The frequency of each node in ˆ L i combines additional information,and one can interpret this ordering as a weighted ranking of possible neighborsof i . This additional information may capture longer range correlations thatelude single neighborhood estimation, as suggested in a recent paper of Bentoand Montanari [1]. In this paper we considered the problem of estimating the graph structure asso-ciated with a GMRF, the Ising model and the Potts model. Building on previouswork using l penalized neighborhood estimation, we experimented with the anadditional l penalty term. Simulations across the three models show that asmall but non-negligible λ penalty term improves the edge recovery rates whenthe sample size is small by trading precision for recall. We make the observa-tion that in the GMRF model, the addition of the l penalty term does nothave much inﬂuence on the recovery rates. Numerical simulations conﬁrm ourhypothesis that the lower bounds on the number of samples needed for recoveryshould not only be a function of the maximum degree d and number of nodes p ,but also of the edge density ρ (or equivalently the average degree of the graph).We also introduce a new method for improving the neighborhood recovery byconsidering pair-wise neighborhood unions which produce a ranking of p − G with respect to their likelihood of being adjacent to the remainingnode. This can be thought of as a way to incorporate local information (rank-14ngs) at each node into a globally consistent edge structure estimation of thegraph G . References [1] Bento J., Montanari A., Which graphical models are diﬃcult to learn?,submitted Oct 30, 2009 reference: arXiv:0910.5761v1[2] Bunea F., Honest variable selection in linear and logistic regression modelsvia l and l + l l -regularized MLE[9] Ravikumar, P., Wainwright M., Laﬀerty J., High Dimensional GraphicalModel Selection Using l1