[PDF] Approximating the Log-Partition Function

Abstract

Variational approximation, such as mean-field (MF) and tree-reweighted (TRW), provide a computationally efficient approximation of the log-partition function for a generic graphical model. TRW provably provides an upper bound, but the approximation ratio is generally not quantified. As the primary contribution of this work, we provide an approach to quantify the approximation ratio through the property of the underlying graph structure. Specifically, we argue that (a variant of) TRW produces an estimate that is within factor \frac{1}{\sqrt{\kappa(G)}} of the true log-partition function for any discrete pairwise graphical model over graph G, where \kappa(G) \in (0,1] captures how far G is from tree structure with \kappa(G) = 1 for trees and 2/N for the complete graph over N vertices. As a consequence, the approximation ratio is 1 for trees, \sqrt{(d+1)/2} for any graph with maximum average degree d, and \stackrel{\beta\to\infty}{\approx} 1+1/(2\beta) for graphs with girth (shortest cycle) at least \beta \log N. In general, \kappa(G) is the solution of a max-min problem associated with G that can be evaluated in polynomial time for any graph. Using samples from the uniform distribution over the spanning trees of G, we provide a near linear-time variant that achieves an approximation ratio equal to the inverse of square-root of minimal (across edges) effective resistance of the graph. We connect our results to the graph partition-based approximation method and thus provide a unified perspective. Keywords: variational inference, log-partition function, spanning tree polytope, minimum effective resistance, min-max spanning tree, local inference

Full PDF

AApproximating theLog-Partition Function

Romain Cosson [email protected]

Devavrat Shah [email protected]

Abstract

Variational approximation, such as mean-ﬁeld (MF) and tree-reweighted (TRW), provide a computa-tionally efﬁcient approximation of the log-partition function for a generic graphical model. TRW provablyprovides an upper bound, but the approximation ratio is generally not quantiﬁed. As the primary contribu-tion of this work, we provide an approach to quantify the approximation ratio through the property of theunderlying graph structure. Speciﬁcally, we argue that (a variant of) TRW produces an estimate that is withinfactor √ κ ( G ) of the true log-partition function for any discrete pairwise graphical model over graph G , where κ ( G ) ∈ (0 , captures how far G is from tree structure with κ ( G ) = 1 for trees and /N for the completegraph over N vertices. As a consequence, the approximation ratio is for trees, (cid:112) ( d + 1) / for any graphwith maximum average degree d , and β →∞ ≈ / (2 β ) for graphs with girth (shortest cycle) at least β log N .In general, κ ( G ) is the solution of a max-min problem associated with G that can be evaluated in polynomialtime for any graph. Using samples from the uniform distribution over the spanning trees of G, we provide anear linear-time variant that achieves an approximation ratio equal to the inverse of square-root of minimal(across edges) effective resistance of the graph. We connect our results to the graph partition-based approxi-mation method and thus provide a uniﬁed perspective. Keywords: variational inference, log-partition function, spanning tree polytope, minimum effective resis-tance, min-max spanning tree, local inference

The Setup.

We consider a collection of N discrete valued random variables, X = ( X , . . . , X N ) , whose jointdistribution is modeled as a pair-wise graphical model. Let G = ( V, E ) represent the associated graph withvertices V = { , . . . , N } representing N variables and E ⊂ V × V representing edges. Let each variable takevalue in a discrete set X ⊂ R + . For e ∈ E , let φ e : X × X → R + denote the edge potential and let θ e ∈ R + denote the associated parameter. This leads to joint distribution with probability mass function P ( X = x ; θ ) ∝ exp (cid:16) (cid:88) e ∈ E θ e φ e ( x e ) (cid:17) = 1 Z ( θ ) exp (cid:16) (cid:88) e ∈ E θ e φ e ( x e ) (cid:17) (1)where x = ( x , . . . , x N ) ∈ X N , x e is short hand for ( x s , x t ) if e = ( s, t ) ∈ E , θ = ( θ e : e ∈ E ) ∈ R | E | + andnormalizing constant or partition function Z ( θ ) is deﬁned as Z ( θ ) = (cid:88) x ∈X N exp (cid:16) (cid:88) e ∈ E θ e φ e ( x e ) (cid:17) . (2)Such pairwise graphical models provide succinct description for complicated joint distributions. However, thekey challenge in utilizing them (e.g. for inference) arises in estimating the partition function Z ( θ ) . In thiswork, our interest is in computing logarithm of Z ( θ ) , precisely Φ( θ ) = log Z ( θ ) = log (cid:34) (cid:88) x ∈X N exp (cid:16) (cid:88) e ∈ E θ e φ e ( x e ) (cid:17)(cid:35) . (3)1 a r X i v : . [ c s . D S ] F e b omputing Z ( θ ) is known to be computationally hard in general, i.e. P-complete due to relation to countingdiscrete objects such as independent sets cf. [23, 15]. Due to reductions from discrete optimization problemsto log-partition function computation, approximating Φ( θ ) , even up to a multiplicative error, can be NP-hardcf. [26, 25, 7]. Therefore, the goal is to develop polynomial time (in N ) approximation method for computing Φ( θ ) or Z ( θ ) with provable guarantees on the approximation error. Speciﬁcally, let ALG denote such anapproximation method that takes problem description ( G, ( φ e ) e ∈ E , X ) as input and produces estimate (cid:98) Φ ALG ( θ ) for Φ( θ ) for any given θ ∈ R | E | + . Then, we deﬁne approximation ratio associated with ALG as α ( G, ALG ) ≥ as α ( G, ALG ) = sup θ ∈ R + max (cid:16) Φ( θ ) (cid:98) Φ ALG ( θ ) , (cid:98) Φ ALG ( θ )Φ( θ ) (cid:17) . (4) Prior Work.

There is a long literature on developing computationally efﬁcient approximation method forlog-partition function with signiﬁcant progress in the past two decades. We recall few relevant prior workshere.A collection of methods, classiﬁed as variational approximations, utilize the (Gibbs) variational characteriza-tion of the log-partition function when distribution (1) is viewed as a member of an exponential family, cf.[11, 25]. Speciﬁcally, Φ( θ ) can be viewed as a solution of a high-dimensional constrained maximization prob-lem. By solving the problem with additional constraints, one obtains a valid lower bound such as that givenby Mean-Field methods. By utilizing the convexity of Φ( · ) and restricting it to tree-structured sub-graphs of G , one obtains a valid upper bound such as that given by the tree-reweighted (TRW) method. By relaxing theconstraints and adapting the objective to allow for pairwise pseudo-marginals, one obtains heuristics such asBelief Propagation (BP) via Bethe approximation [27, 28]. While BP does not provide provable upper or lowerbound in general, for graphs with large-girth such as sparse random graphs and distributions with spatial decayof correlation, it provides an excellent approximation cf. [21]. The spatial decay of correlation property hasbeen further exploited to obtain deterministic Fully Polynomial Time Approximation Schemes (FPTAS) forvarious counting problems, i.e. computing partition functions cf. [26, 10, 2, 9]. The approximation error ofbelief propagation for computing log-partition function has been studied through connection to loop calculusas well cf. [6, 5].In another line of works, graph partitioning based methods have been proposed to provide Polynomial TimeApproximation Schemes (PTAS) for a class of graphs that satisfy certain graph partitioning properties whichincludes minor-excluded graphs [17] or graphs with polynomial growth [16].In summary, despite the progress, the approximation ratio α ( G, ALG ) for any of the known variational approx-imation methods ALG remains undetermined. Summary of Contributions.

As the main contribution, for a simple variant of tree-reweighted (TRW) method,denoted as

TRW (cid:48) , we quantify α ( G, TRW (cid:48) ) for any G . The TRW (cid:48) is described in Section 3 and produces anestimate of Φ( · ) in polynomial time. Speciﬁcally, we establish Theorem 1.1.

For any graph G , the approximation ratio of TRW (cid:48) is such that α ( G, TRW (cid:48) ) ≤ / (cid:112) κ ( G ) where κ ( G ) = min S ⊂ V | S | − | E ( S ) | , (5) with E ( S ) = E ∩ ( S × S ) for any S ⊂ V . The term κ ( G ) captures the proximity of G with respect to the tree structure across all of its induced sub-graphs: for S ⊂ V , the induced subgraph ( S, E ( S )) would have at most | S | − if it were cycle free, but it has | E ( S ) | edges. Therefore, the ratio of ( | S | − / | E ( S ) | measures how far it is from tree – 1 if connected treeand / | S | if complete graph. The minimum over all possible S ⊂ V of this ratio captures how far G is from atree structure.Using this characterization, we provide bounds on α ( G, TRW (cid:48) ) in terms of various simpler graph proper-ties in Section 4.4. Speciﬁcally, we show that for any graph with maximum average vertex degree d ≥ ,2 ( G, TRW (cid:48) ) ≤ (cid:112) ( d + 1) / . And for graphs with girth (i.e. length of shortest cycle) g > , α ( G, TRW (cid:48) ) ≤ (cid:113) N / ( g − − /g ) : for g ≥ β log N , it is (cid:0) β + o ( β ) (cid:1) for large β . This means that for any G with large( (cid:29) log N ) girth, α ( G, TRW (cid:48) ) ≈ .In general, we establish that κ ( G ) can be evaluated in polynomial time for any graph G by solving an ap-propriate linear program on the (polynomially-)extended spanning tree polytope. This is explained in Section4.The tree-reweighted variant TRW (cid:48) considered here requires solving a certain optimization problem over the treepolytope of the graph G . Though it can be computed in polynomial time, it can be quite involved. With an eyetowards near linear-time (in | E | ) computation, a variant that instead of optimizing over the tree polytope simplyconsiders a feasible point in the tree polytope that corresponds to the uniform distribution over spanning treesof G . Using the near-linear time sampling of spanning tree from [22], we provide a randomized approximationmethod. It’s approximation ratio α ( G ) is bounded above by / √ min e ∈ E r e where r e ≥ is the effectiveresistance of e ∈ E for the graph G = ( V, E ) (see (39) for precise deﬁnition). While in general, this providesa weaker approximation guarantee than that of TRW’, for graphs with vertex degree bounded by d it leads to asimilar guarantee of α ( G ) ≤ (cid:112) ( d + 1) / .We show that the results based on graph partitioning cf. [17, 16] can be recovered as a natural extension of thevariant of TRW introduced in this work by allowing for general graphs with bounded tree-width beyond trees.We take note of the fact that though results discussed in this work are primarily for the variant of TRW describedin Section 3, as an immediate consequence of our results, α ( G, TRW ) ≤ /κ ( G ) , i.e. it is bounded by the squareof that derived in Theorem 1.1. As discussed in Section 7, understanding the tightness of this characterizationespecially for TRW remains an important open direction.

Outline of Paper.

In Section 2, we provide some preliminaries including recalling the tree-reweighted (TRW)method. In Section 3, we provide a modiﬁcation of TRW and characterize its approximation guarantee. InSection 4, we provide a linear optimization characterization of the approximation guarantee which leads to theproof of Theorem 1.1. We discuss implications of Theorem 1.1 for various classes of graphs as well. In Section5, we present a near linear-time variant of modiﬁed TRW based on sampling from the uniform distribution ofspanning tree over G . We derive approximation guarantees for the resulting method in terms of the effectiveresistance of the graph and derive its implications. In Section 6, we discuss connection with graph partitioningmethods by extending the modiﬁed TRW of Section 3 to allow for bounded tree-width subgraphs beyond trees.We argue how results of [17, 16] follow naturally. Section 7 discusses directions for future work. We start by recalling the variational characterization of the log-partition function Φ( · ) . Let P ( X N ) denote thespace of all probability distributions over X N . Then, the Gibbs variational characterization states that Φ( θ ) = sup q ∈P ( X N ) E x ∼ q ( (cid:88) e θ e φ e ( x e )) + H ( q ) , (6)where H ( q ) = − E x ∼ q (log( q ( x ))) is the entropy of q . While computationally (6) does not provide tractable so-lution for evaluating Φ( · ) , it provides a framework to develop approximation methods – such methods, inspiredby this characterization, are called variational approximations .As mentioned earlier, the classical mean-ﬁeld consists in relaxing P ( X N ) to the space of independent dis-tributions over X N denoted as I ( X N ) , i.e. I ( X N ) = { q ∈ P ( X N ) : q ( X , . . . , X N ) = (cid:81) Ni =1 ( X i ) } . Byrestricting optimization in (6) to I ( X N ) , the resulting answer is a lower bound on Φ( θ ) . And mean-ﬁeldmethod precisely attempts to solve such a lower-bound.3t turns out that (6) is solvable efﬁciently for tree-structured graph. Speciﬁcally, if G is a connected tree, i.e. G is connected with | E | = N − , then any distribution satisfying (1) can be re-parametrized as P ( x ; θ ) = (cid:89) u ∈ V P X u ( x u ) (cid:89) ( u,v ) ∈ E P X u ,X v ( x u , x v ) P X u ( x u ) P X v ( x v ) . (7)In the expression above, P X u ( · ) denotes the marginal distribution of X u , u ∈ V and P X u ,X v ( · , · ) denotesthe pairwise marginal distribution of ( X u , X v ) for any edge e = ( u, v ) ∈ E . The Belief Propagation (or sum-product) algorithm can compute these marginal distributions efﬁciently for tree graphs using only knowledge of θ and φ e , e ∈ E but not requiring Φ( θ ) . It utilizes O ( |X | N ) computation time, when implemented efﬁciently.Therefore, Z ( θ ) and hence Φ( θ ) can be computed for tree graphs using O ( |X | N ) computations.Indeed, the re-parametrization of the form (7) was a basis for the Belief Propagation (BP) algorithm for genericgraphical models and also led to the so called Bethe Approximation of (6), cf. [27]. However, it does not resultin a provably upper or lower bound in general (with few exceptions).To obtain an upper bound on Φ( · ) , its convexity was exploited in [24] along with the fact that (6) is solvable ef-ﬁciently for tree-structured graph. This resulted into tree-reweighted (TRW) algorithm which we describe next. Φ( · ) Recall that a spanning tree T is a subgraph of G that contains all vertices V and a subset of edges E so thatthe resulting subgraph is a tree, i.e. does not have a cycle. Let T ( G ) be the set of all spanning trees of G . Weshall denote a distribution on T ( G ) as ρ = ( ρ T ) T ∈T ( G ) where ρ T ≥ for all T ∈ T ( G ) , (cid:80) T ∈T ( G ) ρ T = 1 .The space of all distributions on T ( G ) is denoted by P ( T ( G )) . For simplicity, we shall drop notation of G attimes when it is clear from the context and denote it simply as P ( T ) . A distribution ρ ∈ P ( T ) induces for alledge e ∈ E a probability ρ e that this edge will appear in a tree selected from ρ , ρ e = P T ∼ ρ (cid:0) e ∈ T ) = (cid:88) T ∈T ( G ) ρ T ( e ∈ T ) . (8)Note that in the above, we have abused notation using T as a spanning tree as well as the set of edges con-stituting it. We shall continue using this notation since the all spanning trees have the same set of vertices, V and only the edges differ (among subsets of E ). Also note another convenient abuse of notation: given ρ , ρ T denotes probability of T ∈ T ( G ) while ρ e is the marginal probability of edge e ∈ E being present in tree asper ρ and satisﬁes (cid:80) e ∈ E ρ e = N − . Given ρ ∈ P ( T ( G )) , we now deﬁne κ ρ as κ ρ = min e ∈ E ρ e . (9)For any θ ∈ R | E | + , deﬁne its support as s ( θ ) = { e ∈ E : θ e (cid:54) = 0 } . Given a spanning tree T ∈ T ( G ) , let θ T ∈ R | E | + be such that s ( θ T ) ⊂ T . Let ρ ∈ P ( T ) along with ( θ T ) T ∈T be such that (cid:80) T ∈T ρ T θ T = θ . Thatis, E T ∼ ρ (cid:2) θ T (cid:3) = θ . Therefore, we can write Φ( θ ) = Φ (cid:0) E T ∼ ρ (cid:2) θ T (cid:3)(cid:1) . (10)It has been well established that Φ : R | E | + → R is a convex function. Precisely, for any θ , θ ∈ R | E | + and γ ∈ [0 ,

1] Φ( γ θ + (1 − γ ) θ ) ≤ γ Φ( θ ) + (1 − γ )Φ( θ ) . (11)From (10) and (11), it follows from Jensen’s inequality that Φ( θ ) ≤ E T ∼ ρ (cid:2) Φ( θ T ) (cid:3) = (cid:88) T ∈T ρ T Φ( θ T ) . (12)4ince the upper bound (50) holds for any ρ ∈ P ( T ) and ( θ T ) T ∈T such that (cid:80) T ∈T ρ T θ T = θ we can optimizeon these two parameters to obtain Φ( θ ) ≤ inf (cid:80) T ∈T ρ T θ T = θ (cid:32) (cid:88) T ∈T ρ T Φ( θ T ) (cid:33) ≡ U TRW ( θ ) . (13)As established in [24], this seemingly complicated optimized bound, U TRW ( θ ) , can be computed via aniterative tree-reweighted message-passing algorithm through the dual of the above optimization problem. Whilethis is a valid upper bound, how tight the upper bound is for a given graphical model is not quantiﬁed in theliterature. And this is precisely the primary contribution of this work. Modiﬁed Tree-Reweighted:

TRW (cid:48) . We describe a simple variant of

TRW that enables us to bound the ap-proximation ratio of the estimation of Φ using properties of G . We start with some useful notations. Given θ = ( θ e ) e ∈ E ∈ R | E | + , ρ ∈ P ( T ( G )) and spanning tree T ∈ T ( G ) of graph G , deﬁne “projection” operations Π T : R | E | + → R | E | + where Π T ( θ ) = (cid:0) ( e ∈ T ) θ e (cid:1) e ∈ E Π T ρ : R | E | + → R | E | + where Π T ρ ( θ ) = (cid:0) ρ e ( e ∈ T ) θ e (cid:1) e ∈ E . (14)With these notations, for a given ρ ∈ P ( T ( G )) deﬁne L ρ ( θ ) = E T ∼ ρ (Φ(Π T ( θ ))) = (cid:88) T ∈T ( G ) ρ T Φ(Π T ( θ )) , (15) U ρ ( θ ) = E T ∼ ρ (Φ(Π T ρ ( θ ))) = (cid:88) T ∈T ( G ) ρ T Φ(Π T ρ ( θ )) . (16)For a given ρ ∈ P ( T ( G )) , one obtains an estimate of Φ( θ ) (cid:98) Φ ρ ( θ ) = (cid:113) L ρ ( θ ) U ρ ( θ ) . (17)For reasons that will become clear, TRW (cid:48) outputs (cid:98) Φ ρ (cid:63) ( θ ) where ρ (cid:63) = ρ (cid:63) ( G ) deﬁned as ρ (cid:63) ( G ) ∈ arg max ρ ∈P ( T ( G )) (cid:0) min e ∈ E ρ e (cid:1) and κ ρ (cid:63) ( G ) = max ρ ∈P ( T ( G )) (cid:0) min e ∈ E ρ e (cid:1) . (18) Guarantee.

The lemma below quantiﬁes the approximation ratio for

TRW (cid:48) . It’s proof is in Appendix A.

Lemma 3.1.

Given θ ∈ R | E | + , TRW (cid:48) produce (cid:98) Φ ρ (cid:63) ( θ ) with ρ (cid:63) = ρ (cid:63) ( G ) as deﬁned in (18) . Then, α ( G, TRW (cid:48) ) ≤ √ κ ρ (cid:63) . (19) κ ρ (cid:63) ( G ) : Efﬁcient computation, characterization Lemma 3.1 establishes the approximation guarantee for

TRW (cid:48) as claimed in Theorem 1.1 with caveat that itis in terms of κ ρ (cid:63) ( G ) while Theorem 1.1 states it in form of κ ( G ) as deﬁned in (5). In this section, we shallestablish the characterization of κ ρ (cid:63) ( G ) = κ ( G ) and in the process argue that it can be evaluated in polynomialtime for any graph G . This characterization will allow us to bound κ ( G ) for certain classes of graphs to obtainmeaningful intuition. 5 .1 Computing ρ (cid:63) ( G ) and κ ρ (cid:63) ( G ) efﬁciently Spanning Tree Polytope.

We deﬁne a notion of spanning tree polytope for a given graph G . Recall that T ( G ) isthe set of all spanning trees of G . For any tree T ∈ T ( G ) , we shall utilize the notation of χ T = [ χ Te ] ∈ { , } E to represent the characteristic vector of the tree T deﬁned such that ∀ e ∈ E : χ Te = ( e ∈ T ) . (20)Given this notation, we deﬁne the polytope of spanning trees of G , denoted P tree ( G ) , as the convex hull of theircharacteristic vectors. That is, P tree ( G ) = (cid:8) v ∈ [0 , E : v = (cid:88) T ∈T ( G ) ρ T χ T , (cid:88) T ∈T ( G ) ρ T = 1 , ρ T ≥ , ∀ T ∈ T ( G ) (cid:9) . (21)The weights ( ρ T ) T ∈T ( G ) can be viewed as probability distribution on T ( G ) , i.e. an element of P ( T ( G )) .Therefore v = (cid:80) T ∈T ( G ) ρ T χ T corresponds to a vector representing the probabilities that edges in E will bepresent in in T ∼ ρ = ( ρ T ) , i.e. v = E T ∼ ρ [ ( e ∈ T )] . That is, v = ( ρ e ) e ∈ E as deﬁned in (8). Therefore, weshall abuse notation and write P tree ( G ) = (cid:8) ( ρ e ) e ∈ E | ( ρ T ) T ∈T ( G ) ∈ P ( T ( G )) (cid:9) . (22)[8] gave the following characterization of the spanning tree polytope: P tree ( G ) = (cid:40) ( v e ) e ∈ E ∈ R E + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∀ S ⊂ E : v ( E ( S )) ≤ | S | − v ( E ) = | V | − (cid:41) , (23)where v ( E ( S )) = (cid:80) e ∈ E ( S ) v e . Efﬁcient Separation Oracle.

A polytope P ⊂ R n , deﬁned through a set of linear constraints, is said to have aseparation oracle if there exists a polynomial time algorithm in n which for given any x ∈ R n can determinewhether x ∈ P or not; and output a violated constraint if x / ∈ P . Edmond’s characterization of the spanningtree polytope, though has an exponential number of constraints, admits an efﬁcient separation oracle. Such anefﬁcient separation oracle is deﬁned explicitly via a min-cut reduction, see [19, Chapter 4.1]. Complexity of Linear Programming.

Consider a linear program where the goal is to ﬁnd a minimum of a linearobjective function over a polytope deﬁned by ﬁnitely many linear constraints. Such a linear program can besolved in polynomial time (in size of problem description) via the Ellipsoid method if the polytope admits anefﬁcient separation oracle, see [3, Theorem 8.5] for example. Given that the spanning tree polytope has anefﬁcient separation oracle, optimizing a linear objective over it can be solved efﬁciently. Of course, due to thestructure of the trees, a greedy algorithm like that of Kruskal’s may be a lot more direct for solving such a linearprogram. Having said that, the beneﬁt of efﬁcient separation oracle becomes apparent as soon as we consideradditional linear constraints beyond those described in P tree ( G ) . Indeed, such approaches have found utility insolving other problems, liked solving bounded-degree maximum-spanning-tree relaxations like in [12]. Augmented Spanning Tree Polytope.

We consider a reformulation of the max-min problem in (18). To that endconsider the following augmented spanning tree polytope: P tree min ( G ) =  ( z, ( v e ) e ∈ E ) ∈ R × R | E | + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∀ e ∈ E : z ≤ v e ∀ S ⊂ E : v ( E ( S )) ≤ | S | − v ( E ) = | V | −  . (24)With this notation, we can re-write κ ρ (cid:63) ( G ) as per (18) as κ ρ (cid:63) ( G ) = max ( v e ) e ∈ E ∈ P tree { min e ∈ E v e } = max ( z, ( v e ) e ∈ E ) ∈ P tree min z. (25)6ext, we argue that P tree min admits an efﬁcient separation oracle as follows. The separation oracle for P tree min takes ( z, ( v e ) e ∈ E ) as input. It ﬁrst checks that all | E | constraints of the form z ≤ v e are satisﬁed. If one is notsatisﬁed, then the oracle outputs this constraint. If all constraints are satisﬁed, the algorithm runs the separationoracle of P tree on ( v e ) e ∈ E and reproduces its output. Since | E | ≤ N and P tree has an efﬁcient separationoracle, this leads to polynomial time separation oracle for P tree min . Efﬁcient computation of ρ (cid:63) ( G ) and κ ρ (cid:63) ( G ) . From the linear program formulation (25) and from the efﬁcientseparation oracle as deﬁned above, we can compute κ ρ (cid:63) ( G ) in polynomial time using the Ellipsoid algorithm.Note that this does not directly provides ρ (cid:63) ( G ) ∈ P ( T ( G )) since the representation in P tree correspondsto the edge probabilities ( ρ (cid:63) ( G ) e ) e ∈ E . However, ( ρ (cid:63) ( G ) e ) e ∈ E is a convex combination of extreme pointsof P tree , which correspond to the spanning trees of G . Since P tree has efﬁcient separation oracle, we canrecover a decomposition of ( ρ (cid:63) ( G ) e ) e ∈ E in terms of convex combination of characteristic vectors weighted by ( ρ (cid:63) ( G ) T ) T ∈T ( G ) and such that at most | E | of these weights are strictly positive, see details in [13, Theorem3.9]. κ ρ (cid:63) ( G ) = κ ( G ) We wish to establish κ ρ (cid:63) ( G ) = κ ( G ) , i.e. we want to establish κ ρ (cid:63) ( G ) = max ( v e ) e ∈ E ∈ P tree { min e ∈ E v e } = min S ⊂ V | S | − | E ( S ) | . (26) Upper bound: κ ρ (cid:63) ( G ) ≤ | S |− | E ( S ) | . The upper bound is immediately given by Edmond’s characterisation of thespanning tree polytope. For any ( ρ e ) e ∈ E ∈ P tree and any S ⊂ V : | E ( S ) | (cid:0) min e ∈ E ρ e (cid:1) ≤ (cid:0) (cid:88) e ∈ E ( S ) ρ e (cid:1) = ρ ( E ( S )) ≤ | S | − . (27)That is, for any ρ ∈ P ( T ( G )) κ ρ ≤ min S ⊂ V | S | − | E ( S ) | . (28)And hence it holds for ρ (cid:63) ( G ) as well. Lower bound: κ ρ (cid:63) ( G ) ≥ | S |− | E ( S ) | . To establish the lower bound, we need a few additional results. To start with,we deﬁne a dual of the optimization problem (25) to characterize κ ρ (cid:63) ( G ) . By strong duality it follows that κ ρ (cid:63) ( G ) = max ρ ∈P ( T ) min e ∈ E (cid:88) T ∈T ( e ∈ T ) ρ T = min w ∈P ( E ) max T ∈T (cid:88) e ∈ E ( e ∈ T ) w e , (29)where P ( E ) = { w = ( w e ) e ∈ E : (cid:80) e ∈ E w e = 1 , w e ≥ ∀ e ∈ E } . Table 1 provides the precise primaland dual formulation associated with κ ρ (cid:63) ( G ) justifying (29). We state the following Lemma characterizing anoptimal solution of Dual , whose proof is in Appendix B.

Lemma 4.1.

There exists an optimal solution of

Dual , w (cid:63) ∈ arg min w ∈P ( E ) max T ∈T (cid:88) e ∈ E ( e ∈ T ) w e , such that all non-zero components of w (cid:63) take the identical values: i.e. |{ w e : w e (cid:54) = 0 , e ∈ E }| = 1 . rimal Dual Objective max z min y Variables / Constraints z ∈ R ∀ T ∈ T : ρ T ∈ R + (cid:80) e ∈ E w e = 1 ∀ T ∈ T : y − (cid:80) e ∈ T w e ≥ Constraints / Variables (cid:80) T ∈T ρ T = 1 ∀ e ∈ E : (cid:80) T (cid:51) e ρ T − z ≥ y ∈ R ∀ e ∈ E : w e ∈ R + Table 1: The primal (cf. (25)) and dual formulation of κ ρ (cid:63) ( G ) .As per Lemma 4.1, consider an optimal solution of Dual , w (cid:63) , that assigns constant value to a subset F ⊂ E edges and to edges E \ F : let w (cid:63) = ( w (cid:63)e ) e ∈ E with w (cid:63)e = | F | for e ∈ F and w (cid:63)e = 0 for e ∈ E \ F . Let V ( F ) ⊂ V be set of all vertices corresponding to the end points of edges in F making a subgraph ( V ( F ) , F ) of G . Let c ( F ) ≥ denote the number of connected components of ( V ( F ) , F ) . Per Dual , given w (cid:63) , κ ρ (cid:63) ( G ) equals the weight of the maximum weight spanning tree in G with edges assigned weights as per w (cid:63) . Sucha maximum weight spanning tree must select as many edges as possible from F : since it has c ( F ) connectedcomponents and V ( F ) vertices, it can select at most | V ( F ) | − c ( F ) such edges and any each such edge hasweight / | F | . The rest of the edges in the maximum weight spanning tree will carry weight . Thus, thetotal weight of such a maximum weight spanning tree is ( | V ( F ) | − c ( F )) / | F | . This gives us an equivalentcharacterization for κ ρ (cid:63) ( G ) as κ ρ (cid:63) ( G ) = min F ⊂ E | V ( F ) | − c ( F ) | F | . (30)Now we state a Lemma, whose proof is in Appendix C, which relates the characterization of (30) with that of(5). Lemma 4.2.

For any graph G , min S ⊂ V | S | − | E ( S ) | = min F ⊂ E | V ( F ) | − c ( F ) | F | . (31) The primary claim of Theorem 1.1 is that α ( G, TRW (cid:48) ) ≤ / (cid:112) κ ( G ) . As per Lemma 3.1, we have that α ( G, TRW (cid:48) ) ≤ / √ κ ρ (cid:63) ( G ) . As per arguments in Section 4.2, we have that κ ρ (cid:63) ( G ) = κ ( G ) . Therefore,we conclude the proof of Theorem 1.1. κ ( G ) For a Class of Graphs

As established in Section 4.1, κ ( G ) or κ ρ (cid:63) ( G ) can be computed in polynomial time for any G . Here, we attemptto obtain a (lower) bound on κ ( G ) in terms of simple graph properties. To that end, we obtain the followingfor graphs with bounded maximum average degree. Lemma 4.3.

For a graph G = ( V, E ) , let ¯ d = max S ⊂ V | E ( S ) || S | denote the maximum average degree. Then κ ( G ) ≥ d + 1 . (32)For graphs with large girth, we obtain the following. Lemma 4.4.

For a graph G = ( V, E ) , let g > be its girth, i.e. the length of the shortest cycle. Then κ ( G ) ≥

21 + N g − (1 − g ) . (33)8he proofs of Lemmas 4.3 and 4.4 are presented in Appendix D. As per Lemma 4.4, for g = β log N for β (cid:29) and N large enough κ ( G ) ≥

21 + N g − (1 − g ) . (34)Therefore, α ( G, TRW (cid:48) ) ≤ (cid:112) κ ( G ) ≈ β . (35) TRW

The

TRW (cid:48) requires ﬁnding ρ (cid:63) ( G ) . As discussed in Section 4, it can be computed efﬁciently. However it canbe cumbersome and having near-linear (in | E | ) time variant can be more attractive in practice. With this as amotivation, we propose utilizing uniform distribution on T ( G ) , denoted as u ≡ u ( T ( G )) , in place of ρ (cid:63) ( G ) in TRW (cid:48) . The challenge is it has very large support, T ( G ) , and hence it is difﬁcult to compute L u ( θ ) , U u ( θ ) .But, both of these quantities are averages, with respect to u , of a certain functional. And it is feasible to samplespanning tree uniformly at random for any G in near-linear time. Therefore, we can draw n samples fromthe distribution u and consider the empirical distribution ˆ u n to compute estimates L ˆ u n ( θ ) , U ˆ u n ( θ ) with fewsamples. This is precisely the algorithm.To that end, consider n trees T , . . . , T n sampled uniformly at random from T ( G ) . Compute ˆ u ne = 1 n n (cid:88) i =1 ( e ∈ T i ) , ∀ e ∈ E, L ˆ u n ( θ ) = 1 n n (cid:88) i =1 Φ(Π T i ( θ )) , U ˆ u n ( θ ) = 1 n n (cid:88) i =1 Φ(Π T i ˆ u n ( θ )) , (36)where ˆ u n = (ˆ u ne ) e ∈ E . Given this, produce the estimate (cid:98) Φ ˆ u n ( θ ) = (cid:112) L ˆ u n ( θ ) U ˆ u n ( θ ) . (37) Given a graph G , remember that κ u ( G ) = min e ∈ E u e with u being the uniform distribution on T ( G ) and u e = E T ∼ u [ ( e ∈ T )] . We state the following Lemma, whose proof can be found in Appendix E. Lemma 5.1.

Given (cid:15) > and d > , for n ≥ O (cid:0) log( Nδ ) κ u ( G ) − (cid:15) − (cid:1) and (cid:15) sufﬁciently small, with proba-bility at least − δ max θ ∈ R E (cid:16) Φ( θ ) (cid:98) Φ ˆ u n ( θ ) , (cid:98) Φ ˆ u n ( θ )Φ( θ ) (cid:17) ≤ (cid:15) (cid:112) κ u ( G ) . (38) To sample tree uniformly at random from T ( G ) , [22] recently proposed a method that has O ( | E | o (1) ) run-time using short-cutting method and insights from effective resistance. The earliest polynomial time algorithmhas been known since [14]. While we do not recall either of these here, we brieﬂy recall algorithm from [4]due to its elegance even though it is not the optimal (it has O ( N | E | ) run time): (1) starting with any u ∈ V run a random walk on G until it covers all vertices, (2) for every vertex v (cid:54) = u , select the edge through which v was reached for the ﬁrst time during the walk, and (3) output the N − edges (which form tree) thus selected.Given n such samples, to compute (cid:98) Φ ˆ u n ( θ ) , we have to compute 2 n log-partition functions for tree structuredgraph. As noted in Section 2, each such computation requires O ( N |X | ) operations.9y Lemma 5.1, we need n ≥ O (cid:0) ( d + log( N )) κ u ( G ) − (cid:15) − (cid:1) to achieve (1 + (cid:15) ) / (cid:112) κ u ( G ) approximationwith probability − e − d . That is, in total we need total of O ( | E | o (1) + N |X | ) × O ( κ u ( G ) − (cid:15) − log 1 /(cid:15) ) computation for (1 + (cid:15) ) / (cid:112) κ u ( G ) approximation with probability − (cid:15) . κ u ( G ) and Effective Resistance The κ u ( G ) = min e ∈ E u e where u e = E T ∼ u [ ( e ∈ T )] turns out to be related to the so called “effectiveresistance” associated with edge e ∈ E for the graph G . The notion was introduced by [18] and has multipleinterpretations. We present one such here. For e = ( s, t ) ∈ E , the effective resistance u e is equal to the amountof electric energy dissipated by the network when all edges are seen as electric wire of resistance R e = 1 anda generator guarantees a total current ﬂow ( ι gen = 1 ) from s to t . The distribution of the current ι across thenetwork must minimize the dissipated energy while respecting the constraints imposed by Kirchoff’s laws (alsosee [20, Chapter 2]). Below we provide variational characterization of it. ∀ e = ( s, t ) ∈ E : u e = min  (cid:88) { u,v }∈ E ι ( u, v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∀{ u, v } ∈ E : ι ( u, v ) + ι ( v, u ) = 0 ∀ u ∈ V \ { s, t } : (cid:88) v | ( u,v ) ∈ E ι ( u, v ) = 0 (cid:88) v | ( s,v ) ∈ E ι ( s, v ) = (cid:88) u | ( u,t ) ∈ E ι ( u, t ) = 1  . (39) Lemma 5.2.

Given G = ( V, E ) : (a) if d be the maximum vertex degree, then for any e ∈ E , u e ≥ d +1 ; (b) ifthe girth is at least g > , then for any e ∈ E , u e ≥ | E | ( g − . The proof can be found in Appendix F.

This far, we have restricted to approximating Φ( θ ) by decomposing θ = E T ∼ ρ [Π T ρ ( θ )] and then using convex-ity, monotonicity and sub-linearity to produce an approximation guarantee. Such arguments would hold evenif we can decompose θ using subgraphs of G beyond trees. The choice of trees was particularly useful sincethey allow for an efﬁcient computation of Φ . In general, graphs with bounded tree-width lend themselves toefﬁcient computation of Φ , cf. [5].To that end, let T k ( G ) denote the set of all subgraphs of G that have treewidth bounded by k ≥ . Let P ( T k ( G )) denote the distribution over all such subgraphs. For any H ∈ T k ( G ) and ρ ∈ P ( T k ( G )) , deﬁne Π H ( · ) and Π H ρ ( · ) similar to that in (14) in Section 3 in the deﬁnition of TRW (cid:48) , L ρ ( θ ) = E H ∼ ρ (Φ(Π H ( θ ))) , U ρ ( θ ) = E H ∼ ρ (Φ(Π H ρ ( θ ))) , and (cid:98) Φ ρ ( θ ) = (cid:113) L ρ ( θ ) U ρ ( θ ) . (40)Using identical arguments as in Theorem 1.1, it follows that (cid:98) Φ ρ ( θ ) is √ κ k ρ -approximation where κ k ρ = max ρ ∈P ( T k ( G )) min e ∈ E ρ e . (41) ( (cid:15), k ) -partitioning. While such generality is pleasing its utility is in improved approximation. Indeed, in[17, 16] a seemingly different approach was proposed using graph partitioning. At its core, it was shown thatfor a large family of graphs including minor-excluded graphs and graphs with polynomial growth, there exists ρ ∈ P ( T k ( G )) which satisﬁes certain ( (cid:15), k ) -partitioning property (for appropriately chosen (cid:15), k ). Consider k -partitions of G deﬁned as Part k ( G ) = { H = ( V, K (cid:91) i =1 E ( S i )) | ( S i ) ≤ i ≤ k is a partition of V and ∀ i : | S i | ≤ k } . (42)10ote that Part k ( G ) ⊂ T k ( G ) . A distribution ρ ∈ P ( Part k ( G )) ⊂ P ( T k ( G )) is called an ( (cid:15), k ) -partitioning of G if ∀ e ∈ E : 1 − (cid:15) ≤ E H ∼ ρ [ ( e ∈ H )] ≤ . (43)We state the following result whose proof can be found in Appendix E. Theorem 6.1.

Let G be such that there exists ρ ∈ P ( Part k ( G )) ⊂ P ( T k ( G )) that is ( (cid:15), k ) partition of G .Then, for any θ ∈ R | E | + √ − (cid:15) ≤ Φ( θ ) (cid:98) Φ ρ ( θ ) ≤ √ − (cid:15) . (44)We note that √ − (cid:15) = 1 + (cid:15) + o ( (cid:15) ) and hence it improves upon the result given in [17, 16] which achievesa (cid:15) approximation error. We presented a method to quantify the approximation ratio of variational approximation method for est matingthe log-partition function of discrete pairwise graphical models. As the main contribution, we quantiﬁed theapproximation error as a function of the underlying graph properties. In particular, for a variant of the tree-reweighted algorithm, for graphs with bounded degree the approximation ratio is a constant factor (functionof degree) and graphs with large ( (cid:29) logarithmic) girth, the approximation ratio is close to . The methodnaturally extends beyond trees unifying prior works on graph partitioning based approach.In this work, we restricted the analysis to non-negative valued potentials and edge parameters. If potentialsare bounded, we can transform the general setting into a setting with non-negative potentials. However, theapproximation ratio with respect to this transformed setting may not translate to that of the original setting.This may be interesting direction for future works. Acknowledgements

This work is supported in parts by projects from NSF and KACST as well as by a Hewlett Packard graduatefellowship. We would like to thank Mo¨ıse Blanchard for useful discussions on duality.

References [1] Noga Alon, Shlomo Hoory, and Nathan Linial. The moore bound for irregular graphs.

Graphs andCombinatorics , 18(1):53–57, 2002.[2] Mohsen Bayati, David Gamarnik, Dimitriy Katz, Chandra Nair, and Prasad Tetali. Simple determinis-tic approximation algorithms for counting matchings. In

Proceedings of the thirty-ninth annual ACMsymposium on Theory of computing , pages 122–127, 2007.[3] Dimitris Bertsimas and John N Tsitsiklis.

Introduction to linear optimization , volume 6. Athena ScientiﬁcBelmont, MA, 1997.[4] Andrei Z Broder. Generating random spanning trees. In

FOCS , volume 89, pages 442–447. Citeseer,1989.[5] Venkat Chandrasekaran, Misha Chertkov, David Gamarnik, Devavrat Shah, and Jinwoo Shin. Countingindependent sets using the bethe approximation.

SIAM Journal on Discrete Mathematics , 25(2):1012–1034, 2011. 116] Michael Chertkov and Vladimir Y Chernyak. Loop series for discrete statistical models on graphs.

Jour-nal of Statistical Mechanics: Theory and Experiment , 2006(06):P06009, 2006.[7] Amir Dembo, Andrea Montanari, Nike Sun, et al. Factor models on locally tree-like graphs.

Annals ofProbability , 41(6):4162–4213, 2013.[8] Jack Edmonds. Matroids and the greedy algorithm.

Mathematical programming , 1(1):127–136, 1971.[9] David Gamarnik and Dmitriy Katz. Sequential cavity method for computing free energy and surfacepressure.

Journal of Statistical Physics , 137(2):205–232, 2009.[10] David Gamarnik and Dmitriy Katz. Correlation decay and deterministic fptas for counting colorings of agraph.

Journal of Discrete Algorithms , 12:29–47, 2012.[11] Hans-Otto Georgii.

Gibbs measures and phase transitions , volume 9. Walter de Gruyter, 2011.[12] Michel X Goemans. Minimum bounded degree spanning trees. In , pages 273–282. IEEE, 2006.[13] Martin Gr¨otschel, L´aszl´o Lov´asz, and Alexander Schrijver. The ellipsoid method and its consequences incombinatorial optimization.

Combinatorica , 1(2):169–197, 1981.[14] Alain Guenoche. Random spanning tree.

Journal of Algorithms , 4(3):214–220, 1983.[15] Mark Jerrum and Alistair Sinclair. Approximating the permanent.

SIAM journal on computing ,18(6):1149–1178, 1989.[16] Kyomin Jung, Pushmeet Kohli, and Devavrat Shah. Local rules for global map: When do they work? In

NIPS , pages 871–879, 2009.[17] Kyomin Jung and Devavrat Shah. Local approximate inference algorithms. arXiv preprint cs/0610111 ,2006.[18] Douglas J Klein and Milan Randi´c. Resistance distance.

Journal of mathematical chemistry , 12(1):81–95,1993.[19] Lap Chi Lau, Ramamoorthi Ravi, and Mohit Singh.

Iterative methods in combinatorial optimization ,volume 46. Cambridge University Press, 2011.[20] Russell Lyons and Yuval Peres.

Probability on trees and networks , volume 42. Cambridge UniversityPress, 2017.[21] Marc Mezard and Andrea Montanari.

Information, physics, and computation . Oxford University Press,2009.[22] Aaron Schild. An almost-linear time algorithm for uniform random spanning tree generation. In

Pro-ceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , pages 214–227, 2018.[23] Leslie G Valiant. The complexity of enumeration and reliability problems.

SIAM Journal on Computing ,8(3):410–421, 1979.[24] Martin J Wainwright, Tommi S Jaakkola, and Alan S Willsky. A new class of upper bounds on the logpartition function.

IEEE Transactions on Information Theory , 51(7):2313–2335, 2005.[25] Martin J Wainwright and Michael Irwin Jordan.

Graphical models, exponential families, and variationalinference . Now Publishers Inc, 2008.[26] Dror Weitz. Counting independent sets up to the tree threshold. In

Proceedings of the thirty-eighth annualACM symposium on Theory of computing , pages 140–149, 2006.1227] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Generalized belief propagation. In

Advances inneural information processing systems , pages 689–695, 2001.[28] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Understanding belief propagation and its gen-eralizations.

Exploring artiﬁcial intelligence in the new millennium , 8:236–239, 2003.13

Proof of Lemma 3.1

Proof.

We start by observing a few properties of function Φ( · ) . Property 1. Φ is non-decreasing. For a, b ∈ R n let a (cid:22) b denote that every component of a is less or equal tothat of b , i.e. a i ≤ b i , i ∈ [ n ] . With this, for θ , θ (cid:48) ∈ R | E | + such that θ (cid:22) θ (cid:48) , it can be easily veriﬁed that Φ( θ ) ≤ Φ( θ (cid:48) ) . (monotonicity)Since Φ( ) = N log |X | , and (cid:22) θ (cid:22) θ (cid:48) , we have N log( |X | ) ≤ Φ( θ ) ≤ Φ( θ (cid:48) ) . (45) Property 2. Φ is sub-linear. For λ ≥ and θ ∈ R | E | + , Φ( λ θ ) ≤ λ Φ( θ ) . (sub-linearity)The above follows from the fact that for any s = ( s i ) ∈ R n + , (cid:0) n (cid:88) i =1 s λi (cid:1) ≤ (cid:0) n (cid:88) i =1 s i (cid:1) λ . Now consider any ρ ∈ P ( T ( G )) . For any T ∈ T ( G ) and θ ∈ R | E | + , by deﬁnition of Π T , we have that Π T ( θ ) (cid:22) θ . Therefore, using the monotonicity of the log-partition function it follows that L ρ ( θ ) = (cid:88) T ∈T ( G ) ρ T Φ(Π T ( θ )) ≤ (cid:88) T ∈T ( G ) ρ T Φ( θ ) ≤ Φ( θ ) . (46)By deﬁnition θ = E T ∼ ρ [Π T ρ ( θ )] , and due to convexity of Φ (cf. (11)), it follows that Φ( θ ) = Φ (cid:0) E T ∼ ρ [Π T ρ ( θ )] (cid:1) ≤ E T ∼ ρ [Φ(Π T ρ ( θ ))] = U ρ ( θ ) . (47)By deﬁnition of κ ρ = min e ∈ E ρ e , it follows that Π T ρ ( θ ) ≤ κ ρ Π T ( θ ) , ∀ T ∈ T ( G ) . (48)And, by deﬁnition κ ρ ≥ . Therefore by (monotonicity) and (sub-linearity), we have Φ(Π T ρ ( θ )) ≤ Φ (cid:0) κ ρ Π T ( θ ) (cid:1) ≤ κ ρ Φ(Π T ( θ )) . (49)Therefore, U ρ ( θ ) = (cid:88) T ∈T ( G ) ρ T Φ(Π T ρ ( θ )) ≤ κ ρ (cid:0) (cid:88) T ∈T ( G ) ρ T Φ(Π T ( θ )) (cid:1) = 1 κ ρ L ρ ( θ ) . (50)As a consequence of (46), (47) and (50) we obtain that Φ( θ ) ≤ U ρ ( θ ) ≤ κ ρ Φ( θ ) and κ ρ Φ( θ ) ≤ L ρ ( θ ) ≤ Φ( θ ) . (51)From this, it follows that √ κ ρ Φ( θ ) ≤ (cid:113) L ρ ( θ ) U ρ ( θ ) ≤ √ κ ρ Φ( θ ) . (52)Which can be rewritten as √ κ ρ ≤ (cid:98) Φ ρ ( θ )Φ( θ ) ≤ √ κ ρ . (53)By optimizing over choice of ρ = ρ (cid:63) , we conclude that α ( G, TRW (cid:48) ) ≤ κ ρ (cid:63) .14 Proof of Lemma 4.1

Proof. (See illustration in Figure 1) For w = ( w e ) e ∈ E denote f ( w ) the number of distinct values in its support: f ( w ) = |{ w e : e ∈ E, w e (cid:54) = 0 }| . (54)To prove the lemma, it sufﬁces to show that there exists an optimal solution of Dual such that f ( w ) = 1 . Wewill prove that if w is an optimal solution and f ( w ) > then we can build w (cid:48) of similar objective value suchthat f ( w (cid:48) ) ≤ f ( w ) − . By repeating this till f ( w ) = 1 will conclude the proof.Let w be an optimal solution with f ( w ) > . We consider the edges e , e , ..., e | E | ordered by their weights,i.e. w e ≥ ... ≥ w e | E | . (55)In what follows, we will make sure that the ordering on the edges never changes, therefore we allow our-selves to write w i instead of w e i . Now the objective of Dual achieved by such an optimal w corresponds tothe weight of a maximum weight spanning tree. Let us utilize Kruskal’s algorithm to ﬁnd such an maximumweight spanning tree. Recall that Kruskal’s algorithm greedily selects edges from higher to lower weight aslong as they do not create a cycle with previously selected edges. We will denote I T = { t < ... < t N − } theindices of the edges selected by the algorithm to construct tree T and let I E \ T = ∪ N − k =1 { s : t k < s < t k +1 } denote the indices of edges not part of T with notation t N = | E | + 1 . The weight of the maximum spanningtree is then w ( T ) = (cid:80) N − k =1 w t k . Note that t = 1 and t = 2 since cycle requires or more edges. Bydeﬁnition w j − ≥ w j for ≤ j ≤ | E | . Now if w j − > w j the we claim that j ∈ I T . This is because for ≤ k ≤ N − if ( w t k , ...., w t k +1 − ) are not equal, setting them all to their average decreases w t k strictlywhile preserving w ∈ P ( E ) as well as the order on the edges and therefore contradicting the optimality of w for Dual . Therefore w is piece-wise constant with discontinuities only appearing for j ∈ I T .If f ( w ) = 2 and all weights are positive, we denote ≤ k ≤ N − such that w t k − > w t k > and wehave: w = ... = w t k − > w t k = .... = w | E | . (56)In this case, the optimal objective value for Dual is equal to ( k − w + ( N − k ) w t k . To make w con-stant on its support while preserving the order on the weights, there are two possibilities. Either transfer allweight from ( w t k , ...., w | E | ) to ( w , ..., w t k − ) until ( w t k , ...., w | E | ) reaches zero. The objective will then be w + | E |− t k +1 t k − w t k . Or transfer all weight from ( w , ..., w t k − ) to ( w t k , ...., w | E | ) until all weights are equal.The objective will be then w t k + | E |− t k +1 t k − ( w − w t k ) . Because either | E |− t k +1 t k − ≤ or t k − | E |− t k +1 ≤ , one ofthese transfers does not increase the objective and yields f ( w ) = 1 < .If f ( w ) = 2 and some weights are , denote k the smallest index such that w t k = 0 . The method abovestill holds when replacing | E | − t k + 1 by t k − t k .Now suppose f ( w ) ≥ , making sure that the order on the weights is preserved requires extra caution. Inaddition to k and k (if required), we denote k the index of the discontinuity that follows k . We have: ... = w t k − > w t k = ... = w t k − > w t k = ... = w t k − > w t k = ... (57)In the event when we want to transfer weight from ( w t k , ..., w t k − ) to ( w t k , ..., w t k − ) , we must make surethat ( w t k , ..., w t k − ) does not exceed w t k − . If ( w t k , ..., w t k − ) attains w t k − the transfer must stop atequality, and one should observe that we have decreased f ( w ) strictly by because the discontinuity at w t k has disappeared and no new discontinuity was created.In summary, we have argued that if w is an optimal solution and f ( w ) > then we can build w (cid:48) of sameobjective value (optimal) and such that f ( w (cid:48) ) ≤ f ( w ) − . This completes the proof of Lemma.15igure 1: An example of an optimal weight assignment for the problem Dual on nine different graphs. Thesolution was found by the interior point method using a linear programming solver. Note that for most graphs,the solution reached is already constant on its support. On graphs G and G , note that evening out the weightswould not increase the weight of the maximum spanning tree.16 Proof of Lemma 4.2

Proof.

We prove the equality by establishing inequalities in both direction.

Establishing min S ⊂ V | S |− | E ( S ) | ≥ min F ⊂ E | V ( F ) |− c ( F ) | F | : For S ⊂ V note that V ( E ( S )) ⊂ S and c ( E ( S )) ≥ and therefore that | S |− | E ( S ) | ≥ | V ( E ( S )) |− c ( E ( S )) | E ( S ) | with E ( S ) ⊂ E . Thus, min S ⊂ V | S |− | E ( S ) | is minimizing alarger objective function over smaller set compared to min F ⊂ E | V ( F ) |− c ( F ) | F | . Therefore, inequality followsimmediately. Establishing min S ⊂ V | S |− | E ( S ) | ≤ min F ⊂ E | V ( F ) |− c ( F ) | F | : Let F (cid:63) ⊂ E be a minimizer of min F ⊂ E | V ( F ) |− c ( F ) | F | .Let H = ( V ( F (cid:63) ) , F (cid:63) ) . By optimality, all connected components of H must be vertex-induced subgraphs of G .This is because, if not then it is possible to add edges to H without changing the number of vertices or numberof connected components in it, which would contradict optimality. In other words, there exists disjoint subsets S i , ≤ i ≤ c ( H ) of V ( F (cid:63) ) with V ( F (cid:63) ) = ∪ c ( H ) i =1 S i and F (cid:63) = ∪ c ( H ) i =1 E ( S i ) . If c ( H ) = 1 , then the inequalityfollows immediately. If c ( H ) ≥ , denote H \ H the graph obtained by removing H = ( S , E ( S )) from H .Note that c ( H \ H ) = c ( H ) − and that c ( H ) = 1 . By Lemma C.1, ∀ a, b, c, d ∈ R : min( ab , cd ) ≤ a + cb + d .Therefore, min (cid:18) | V ( H ) | − c ( H ) | E ( H ) | , | V ( H \ H ) | − c ( H \ H ) | E ( H \ H ) | (cid:19) ≤ | V ( H ) | − c ( H ) | E ( H ) | . (58)If H achieves the minimum on the left hand side, then it concludes the proof. If H \ H achieves the minimumsimply iterate the above argument till we are left with single connected component and that would concludethe proof. Lemma C.1.

For any a, b, c, d ∈ R + , min( ab , cd ) ≤ a + cb + d . (59) Proof.

Let a, b, c, d ∈ R + . Without loss of generality assume that Then the following sequence of statementshold leading to the proof of the claim: ad ≤ bc or bc ≤ ad (60) min( ad ( b + d ) , cb ( b + d )) ≤ ( a + c ) bd (61) min( ab , cd ) ≤ a + cb + d . (62) D Proofs of Lemmas 4.3 and 4.4

Proof of Lemma 4.3.

Assume G has maximum average degree bounded by ¯ d , where by deﬁnition ¯ d = max S ⊂ V | E ( S ) || S | . (63)Therefore, for any S ⊂ V , | E ( S ) | ≤ ¯ d | S | . And there can be at most (cid:0) | S | (cid:1) edges in a graph over vertices S ,and hence | E ( S ) | ≤ | S | ( | S |− . Therefore, we obtain | S | − | E ( S ) | ≥ d (cid:0) − | S | (cid:1) = L ( | S | ) , (64) | S | − | E ( S ) | ≥ | S | = L ( | S | ) . (65)17herefore | S | − | E ( S ) | ≥ min x ∈ R + { max( L ( x ) , L ( x )) } . (66)Note that L is increasing and bounded whereas L is decreasing. Therefore, max( L ( x ) , L ( x )) with x ∈ R reaches its minimum for x such that L ( x ) = L ( x ) which leads to minima at x = ¯ d + 1 . Therefore, weconclude that for all S ⊂ V , | S | − | E ( S ) | ≥ d + 1 . (67) Proof of Lemma 4.4.

Let G has girth g > . Therefore, all subgraphs of G have girth at least g . The generalisedMoore bound (obtained by [1]) then gives ∀ S ⊂ V : | S | ≥ d S g − (cid:88) i =0 ( d S − i if g is odd , (68) | S | ≥ g − (cid:88) i =0 ( d S − i , if g is even (69)with d S = | E ( S ) || S | . We will only keep a weaker version of this bound that does not depend on the parity of g .Speciﬁcally, for all S ⊂ V : | S | ≥ (cid:0) | E ( S ) || S | − (cid:1) g − . (70)Therefore, | E ( S ) | ≤ ( | S | g − +1 + | S | ) for all S ⊂ V . Subsequently, we have | S | − | E ( S ) | ≥ − | S | | S | g − ≥ − | S | N g − (71)This bound is clearly increasing with | S | . Also note that if | S | ≤ g − , the subgraph ( S, E ( S )) can have nocycle and therefore | S |− | E ( S ) | = 1 . The worse case is therefore attained for | S | = g where we have: | S | − | E ( S ) | ≥

21 + N g − (cid:0) − g (cid:1) . (72) E Proof of Lemma 5.1

Proof.

We shall use Hoeffding’s inequality: for any bounded random variable a ≤ X ≤ b , the deviation of its n -empirical average X n computed from in dependant samples is such that for any t > , P ( | E ( X ) − X n | ≥ t ) ≤ (cid:16) − nt ( b − a ) (cid:17) . (73)Another version of the equation when E ( X ) > is as follows, for any (cid:15) > P (cid:18) − (cid:15) ≤ X n E ( X ) ≤ (cid:15) (cid:19) ≥ − (cid:16) − n(cid:15) E ( X ) ( b − a ) (cid:17) . (74)18n immediate consequence is that ˆ u n is a good approximation for u . For any e ∈ E , P (cid:18) − (cid:15) ≤ ˆ u ne u e ≤ (cid:15) (cid:19) ≥ − (cid:0) − n(cid:15) u e (cid:1) , (75)fctherefore by union bound, P (cid:18) ∀ e ∈ E : 1 − (cid:15) ≤ ˆ u ne u e ≤ (cid:15) (cid:19) ≥ − | E | exp (cid:0) − n(cid:15) κ u (cid:1) . (76)Another consequence is that L ˆ u n is a good approximation for L u . Indeed, considering the random variable Φ(Π T ( θ )) of mean L u ( θ ) and of empirical average L ˆ u n ( θ ) = n (cid:80) ni =1 Φ(Π T i ( θ )) and noting that thisvariable is bounded as follows ≤ Φ(Π T ( θ )) ( ≤ Φ( θ )) ≤ κ u L u ( θ ) , we have P (cid:0) − (cid:15) ≤ L ˆ u n ( θ ) L u ( θ ) ≤ (cid:15) (cid:1) ≥ − (cid:0) − n(cid:15) κ u (cid:1) . (77)Regarding U ˆ u n ( θ ) , the discussion requires an additional argument because n (cid:80) ni =1 Φ(Π T i ˆ u n ( θ )) is not a sumof independent random variables. Instead, let us focus on the close quantity, n (cid:80) ni =1 Φ(Π T i u ( θ )) for which wehave ≤ Φ(Π T u ( θ )) (cid:16) ≤ κ u Φ( θ ) (cid:17) ≤ κ u U u ( θ ) and therefore, P (cid:0) − (cid:15) ≤ n (cid:80) ni =1 Φ(Π T i u ( θ )) U u ( θ ) ≤ (cid:15) (cid:1) ≥ − (cid:0) − n(cid:15) κ u (cid:1) . (78)Fortunately, if (76) is satisﬁed this quantity turns out to be a good approximation of U ˆ u n ( θ ) . Indeed, assumingthat ∀ e ∈ E : 1 − (cid:15) ≤ ˆ u ne u e ≤ (cid:15) we have that for all T ∈ T ( G ) , (1 − (cid:15) )Π T u ( θ ) (cid:22) Π T ˆ u n ( θ ) (cid:22) (1 + (cid:15) )Π T u ( θ ) (79)therefore by (monotonicity) and (sub-linearity), (1 − (cid:15) )Φ(Π T u ( θ )) ≤ Φ(Π T ˆ u n ( θ )) ≤ (1 + (cid:15) )Φ(Π T u ( θ )) , (80)which shows, (1 − (cid:15) ) ≤ U ˆ u n ( θ ) n (cid:80) ni =1 Φ(Π T i u ( θ )) ≤ (1 + (cid:15) ) . (81)Therefore by union bound, P (cid:18) (1 − (cid:15) ) ≤ U ˆ u n ( θ ) U u ( θ ) ≤ (1 + (cid:15) ) (cid:19) ≥ − (2 | E | + 2) exp (cid:0) − nκ u (cid:15) (cid:1) . (82)By putting together (77) and (82), we obtain P (cid:32) (1 − (cid:15) ) ≤ (cid:98) Φ ˆ u n ( θ )Φ u ( θ ) ≤ (1 + (cid:15) ) (cid:33) ≥ − (2 | E | + 4) exp (cid:0) − nκ u (cid:15) (cid:1) , (83)and by arguments of Lemma 3.1, we can conclude that P (cid:32) √ κ u (1 − (cid:15) ) ≤ (cid:98) Φ ˆ u n ( θ )Φ( θ ) ≤ (1 + (cid:15) ) √ κ u (cid:33) ≥ − (2 | E | + 4) exp (cid:16) − n √ κ u (cid:15) (cid:17) , (84)This completes the proof of Lemma 5.1. 19 Proof of Lemma 5.2

Bounded degree graph G . First assume that G has maximum degree d . Consider any edge e = ( s, t ) ∈ E .Denote N ( s ) , N ( t ) ⊂ V the neighbours of s and t . Consider current ι : V × V → R which is a solution ofoptimization problem corresponding to effective resistance as deﬁned in (39). By deﬁnition, we have that theeffective resistance u e for e ∈ E is given by u e = (cid:88) ( u,v ) ∈ E ι ( u, v ) ≥ ι ( s, t ) + (cid:88) u ∈N ( s ) \{ t } ι ( s, u ) + (cid:88) u ∈N ( t ) \{ s } ι ( u, t ) . (85)By constraints of the optimization problem, the sum of currents entering source s and leaving sink t is equal to 1(whereas it is null for isolated vertices). Therefore, focusing on s , we have (cid:80) u ∈N ( s ) \{ t } | ι ( s, u ) | ≥ −| ι ( s, t ) | .By applying Cauchy Schwarz inequality, we have that (cid:0) (cid:88) u ∈N ( s ) \{ t } ι ( s, u ) (cid:1) × (cid:0) (cid:88) u ∈N ( s ) \{ t } (cid:1) ≥ (1 − | ι ( s, t ) | ) . (86)Recall that G has maximum vertex degree d and therefore |N ( s ) \ { t }| ≤ d − . Therefore, (cid:88) u ∈N ( s ) \{ t } ι ( s, u ) ≥ (1 − | ι ( s, t ) | ) d − . (87)Because the same holds for the term (cid:80) u ∈N ( t ) \ ( s ) ι ( u, t ) , we obtain from (85) that u e ≥ ι ( s, t ) + (1 − | ι ( s, t ) | ) d − . (88)This expression holds for all possible values of ι ( s, t ) . We note that for any given λ ∈ R + , inf x ∈ R x + (1 − x ) λ ≥ λ λ . (89)Therefore, we conclude that for graph G with bounded degree d , u e ≥ d + 1 . (90) Graph G with girth g . We now assume that G has girth g . As before, let e = ( s, t ) ∈ E . Denote G \ { e } =( V, E \ { e } ) the graph obtained by removing edge e from G . For ≤ k ≤ g − , we deﬁne E k = { ( u, v ) ∈ E : d G \{ e } ( s, u ) = k, d G \{ e } ( s, v ) = k + 1 } , (91)where d G \{ e } ( s, u ) denotes the shortest path distance between vertices s, u in graph G excluding edge e . Thatis, E k is the set of edges connecting vertices at distance k from s in G \ { e } to vertices at distance k + 1 from s in G \ { e } . Since k ≤ g − , all E k are disjoint and hence current ι satisﬁes u e ≥ ι ( s, t ) + g − (cid:88) k =0 (cid:88) ( u,v ) ∈ E k ι ( u, v ) . (92)For ≤ k ≤ g − , note that E k ∪ { e } deﬁnes a cut of G . Therefore by Kirchoff’s law (cid:80) ( u,v ) ∈ E k | ι ( u, v ) | ≥ − | ι ( s, t ) | . Using Cauchy-Schwartz inequality, we obtain: (cid:0) (cid:88) ( u,v ) ∈ E k ι ( u, v ) (cid:1) × (cid:0) (cid:88) ( u,v ) ∈ E k (cid:1) ≥ (1 − ι ( s, t )) . (93)20y summing-up all inequalities, we obtain (cid:0) g − (cid:88) k =0 (cid:88) ( u,v ) ∈ E k ι ( u, v ) ) ≥ (1 − | ι ( s, t ) | ) (cid:0) g − (cid:88) k =0 | E k | (cid:1) . (94)Note that if a sequence ( m k ) ≥ respects (cid:80) lk =1 m k ≤ | E | then, (cid:80) lk =1 1 m k ≥ l | E | . Therefore, because all E k are disjoint, (cid:80) g − k =0 1 | E k | ≥ ( g − | E | . Inserting this in (92), we obtain u e ≥ ι ( s, t ) + (1 − ι ( s, t )) ( g − | E | . (95)Using (89), we obtain u e ≥

11 + | E | ( g − . (96)This completes the proof of Lemma 5.2. G Proof of Theorem 6.1

Proof.

The proof follows by establishing that κ k ρ as deﬁned in (41) for ρ ∈ P ( Part k ( G )) is such that κ k ρ ≥ − (cid:15), (97)if ρ is ( (cid:15), k ) partition. Indeed, by deﬁnition of ( (cid:15), k ) partition, we have that for any e ∈ E , ρ e = E H ∼ ρ [ ( e ∈ H )] ≥ − (cid:15). (98)Therefore, κ k ρ = min e ∈ E ρ e ≥ − (cid:15). (99)Subsequently, using arguments identical to that for proof of Lemma 3.1, it follows that (cid:98) Φ ρ ( θ ) is / (cid:113) κ k ρ approximation. That is, √ − (cid:15) ≤ Φ( θ ) (cid:98) Φ ρ ( θ ) ≤ √ − (cid:15) .(cid:15) .