Approximating the Log-Partition Function
AApproximating theLog-Partition Function
Romain Cosson [email protected]
Devavrat Shah [email protected]
Abstract
Variational approximation, such as mean-field (MF) and tree-reweighted (TRW), provide a computa-tionally efficient approximation of the log-partition function for a generic graphical model. TRW provablyprovides an upper bound, but the approximation ratio is generally not quantified. As the primary contribu-tion of this work, we provide an approach to quantify the approximation ratio through the property of theunderlying graph structure. Specifically, we argue that (a variant of) TRW produces an estimate that is withinfactor √ κ ( G ) of the true log-partition function for any discrete pairwise graphical model over graph G , where κ ( G ) ∈ (0 , captures how far G is from tree structure with κ ( G ) = 1 for trees and /N for the completegraph over N vertices. As a consequence, the approximation ratio is for trees, (cid:112) ( d + 1) / for any graphwith maximum average degree d , and β →∞ ≈ / (2 β ) for graphs with girth (shortest cycle) at least β log N .In general, κ ( G ) is the solution of a max-min problem associated with G that can be evaluated in polynomialtime for any graph. Using samples from the uniform distribution over the spanning trees of G, we provide anear linear-time variant that achieves an approximation ratio equal to the inverse of square-root of minimal(across edges) effective resistance of the graph. We connect our results to the graph partition-based approxi-mation method and thus provide a unified perspective. Keywords: variational inference, log-partition function, spanning tree polytope, minimum effective resis-tance, min-max spanning tree, local inference
The Setup.
We consider a collection of N discrete valued random variables, X = ( X , . . . , X N ) , whose jointdistribution is modeled as a pair-wise graphical model. Let G = ( V, E ) represent the associated graph withvertices V = { , . . . , N } representing N variables and E ⊂ V × V representing edges. Let each variable takevalue in a discrete set X ⊂ R + . For e ∈ E , let φ e : X × X → R + denote the edge potential and let θ e ∈ R + denote the associated parameter. This leads to joint distribution with probability mass function P ( X = x ; θ ) ∝ exp (cid:16) (cid:88) e ∈ E θ e φ e ( x e ) (cid:17) = 1 Z ( θ ) exp (cid:16) (cid:88) e ∈ E θ e φ e ( x e ) (cid:17) (1)where x = ( x , . . . , x N ) ∈ X N , x e is short hand for ( x s , x t ) if e = ( s, t ) ∈ E , θ = ( θ e : e ∈ E ) ∈ R | E | + andnormalizing constant or partition function Z ( θ ) is defined as Z ( θ ) = (cid:88) x ∈X N exp (cid:16) (cid:88) e ∈ E θ e φ e ( x e ) (cid:17) . (2)Such pairwise graphical models provide succinct description for complicated joint distributions. However, thekey challenge in utilizing them (e.g. for inference) arises in estimating the partition function Z ( θ ) . In thiswork, our interest is in computing logarithm of Z ( θ ) , precisely Φ( θ ) = log Z ( θ ) = log (cid:34) (cid:88) x ∈X N exp (cid:16) (cid:88) e ∈ E θ e φ e ( x e ) (cid:17)(cid:35) . (3)1 a r X i v : . [ c s . D S ] F e b omputing Z ( θ ) is known to be computationally hard in general, i.e. P-complete due to relation to countingdiscrete objects such as independent sets cf. [23, 15]. Due to reductions from discrete optimization problemsto log-partition function computation, approximating Φ( θ ) , even up to a multiplicative error, can be NP-hardcf. [26, 25, 7]. Therefore, the goal is to develop polynomial time (in N ) approximation method for computing Φ( θ ) or Z ( θ ) with provable guarantees on the approximation error. Specifically, let ALG denote such anapproximation method that takes problem description ( G, ( φ e ) e ∈ E , X ) as input and produces estimate (cid:98) Φ ALG ( θ ) for Φ( θ ) for any given θ ∈ R | E | + . Then, we define approximation ratio associated with ALG as α ( G, ALG ) ≥ as α ( G, ALG ) = sup θ ∈ R + max (cid:16) Φ( θ ) (cid:98) Φ ALG ( θ ) , (cid:98) Φ ALG ( θ )Φ( θ ) (cid:17) . (4) Prior Work.
There is a long literature on developing computationally efficient approximation method forlog-partition function with significant progress in the past two decades. We recall few relevant prior workshere.A collection of methods, classified as variational approximations, utilize the (Gibbs) variational characteriza-tion of the log-partition function when distribution (1) is viewed as a member of an exponential family, cf.[11, 25]. Specifically, Φ( θ ) can be viewed as a solution of a high-dimensional constrained maximization prob-lem. By solving the problem with additional constraints, one obtains a valid lower bound such as that givenby Mean-Field methods. By utilizing the convexity of Φ( · ) and restricting it to tree-structured sub-graphs of G , one obtains a valid upper bound such as that given by the tree-reweighted (TRW) method. By relaxing theconstraints and adapting the objective to allow for pairwise pseudo-marginals, one obtains heuristics such asBelief Propagation (BP) via Bethe approximation [27, 28]. While BP does not provide provable upper or lowerbound in general, for graphs with large-girth such as sparse random graphs and distributions with spatial decayof correlation, it provides an excellent approximation cf. [21]. The spatial decay of correlation property hasbeen further exploited to obtain deterministic Fully Polynomial Time Approximation Schemes (FPTAS) forvarious counting problems, i.e. computing partition functions cf. [26, 10, 2, 9]. The approximation error ofbelief propagation for computing log-partition function has been studied through connection to loop calculusas well cf. [6, 5].In another line of works, graph partitioning based methods have been proposed to provide Polynomial TimeApproximation Schemes (PTAS) for a class of graphs that satisfy certain graph partitioning properties whichincludes minor-excluded graphs [17] or graphs with polynomial growth [16].In summary, despite the progress, the approximation ratio α ( G, ALG ) for any of the known variational approx-imation methods ALG remains undetermined. Summary of Contributions.
As the main contribution, for a simple variant of tree-reweighted (TRW) method,denoted as
TRW (cid:48) , we quantify α ( G, TRW (cid:48) ) for any G . The TRW (cid:48) is described in Section 3 and produces anestimate of Φ( · ) in polynomial time. Specifically, we establish Theorem 1.1.
For any graph G , the approximation ratio of TRW (cid:48) is such that α ( G, TRW (cid:48) ) ≤ / (cid:112) κ ( G ) where κ ( G ) = min S ⊂ V | S | − | E ( S ) | , (5) with E ( S ) = E ∩ ( S × S ) for any S ⊂ V . The term κ ( G ) captures the proximity of G with respect to the tree structure across all of its induced sub-graphs: for S ⊂ V , the induced subgraph ( S, E ( S )) would have at most | S | − if it were cycle free, but it has | E ( S ) | edges. Therefore, the ratio of ( | S | − / | E ( S ) | measures how far it is from tree – 1 if connected treeand / | S | if complete graph. The minimum over all possible S ⊂ V of this ratio captures how far G is from atree structure.Using this characterization, we provide bounds on α ( G, TRW (cid:48) ) in terms of various simpler graph proper-ties in Section 4.4. Specifically, we show that for any graph with maximum average vertex degree d ≥ ,2 ( G, TRW (cid:48) ) ≤ (cid:112) ( d + 1) / . And for graphs with girth (i.e. length of shortest cycle) g > , α ( G, TRW (cid:48) ) ≤ (cid:113) N / ( g − − /g ) : for g ≥ β log N , it is (cid:0) β + o ( β ) (cid:1) for large β . This means that for any G with large( (cid:29) log N ) girth, α ( G, TRW (cid:48) ) ≈ .In general, we establish that κ ( G ) can be evaluated in polynomial time for any graph G by solving an ap-propriate linear program on the (polynomially-)extended spanning tree polytope. This is explained in Section4.The tree-reweighted variant TRW (cid:48) considered here requires solving a certain optimization problem over the treepolytope of the graph G . Though it can be computed in polynomial time, it can be quite involved. With an eyetowards near linear-time (in | E | ) computation, a variant that instead of optimizing over the tree polytope simplyconsiders a feasible point in the tree polytope that corresponds to the uniform distribution over spanning treesof G . Using the near-linear time sampling of spanning tree from [22], we provide a randomized approximationmethod. It’s approximation ratio α ( G ) is bounded above by / √ min e ∈ E r e where r e ≥ is the effectiveresistance of e ∈ E for the graph G = ( V, E ) (see (39) for precise definition). While in general, this providesa weaker approximation guarantee than that of TRW’, for graphs with vertex degree bounded by d it leads to asimilar guarantee of α ( G ) ≤ (cid:112) ( d + 1) / .We show that the results based on graph partitioning cf. [17, 16] can be recovered as a natural extension of thevariant of TRW introduced in this work by allowing for general graphs with bounded tree-width beyond trees.We take note of the fact that though results discussed in this work are primarily for the variant of TRW describedin Section 3, as an immediate consequence of our results, α ( G, TRW ) ≤ /κ ( G ) , i.e. it is bounded by the squareof that derived in Theorem 1.1. As discussed in Section 7, understanding the tightness of this characterizationespecially for TRW remains an important open direction.
Outline of Paper.
In Section 2, we provide some preliminaries including recalling the tree-reweighted (TRW)method. In Section 3, we provide a modification of TRW and characterize its approximation guarantee. InSection 4, we provide a linear optimization characterization of the approximation guarantee which leads to theproof of Theorem 1.1. We discuss implications of Theorem 1.1 for various classes of graphs as well. In Section5, we present a near linear-time variant of modified TRW based on sampling from the uniform distribution ofspanning tree over G . We derive approximation guarantees for the resulting method in terms of the effectiveresistance of the graph and derive its implications. In Section 6, we discuss connection with graph partitioningmethods by extending the modified TRW of Section 3 to allow for bounded tree-width subgraphs beyond trees.We argue how results of [17, 16] follow naturally. Section 7 discusses directions for future work. We start by recalling the variational characterization of the log-partition function Φ( · ) . Let P ( X N ) denote thespace of all probability distributions over X N . Then, the Gibbs variational characterization states that Φ( θ ) = sup q ∈P ( X N ) E x ∼ q ( (cid:88) e θ e φ e ( x e )) + H ( q ) , (6)where H ( q ) = − E x ∼ q (log( q ( x ))) is the entropy of q . While computationally (6) does not provide tractable so-lution for evaluating Φ( · ) , it provides a framework to develop approximation methods – such methods, inspiredby this characterization, are called variational approximations .As mentioned earlier, the classical mean-field consists in relaxing P ( X N ) to the space of independent dis-tributions over X N denoted as I ( X N ) , i.e. I ( X N ) = { q ∈ P ( X N ) : q ( X , . . . , X N ) = (cid:81) Ni =1 ( X i ) } . Byrestricting optimization in (6) to I ( X N ) , the resulting answer is a lower bound on Φ( θ ) . And mean-fieldmethod precisely attempts to solve such a lower-bound.3t turns out that (6) is solvable efficiently for tree-structured graph. Specifically, if G is a connected tree, i.e. G is connected with | E | = N − , then any distribution satisfying (1) can be re-parametrized as P ( x ; θ ) = (cid:89) u ∈ V P X u ( x u ) (cid:89) ( u,v ) ∈ E P X u ,X v ( x u , x v ) P X u ( x u ) P X v ( x v ) . (7)In the expression above, P X u ( · ) denotes the marginal distribution of X u , u ∈ V and P X u ,X v ( · , · ) denotesthe pairwise marginal distribution of ( X u , X v ) for any edge e = ( u, v ) ∈ E . The Belief Propagation (or sum-product) algorithm can compute these marginal distributions efficiently for tree graphs using only knowledge of θ and φ e , e ∈ E but not requiring Φ( θ ) . It utilizes O ( |X | N ) computation time, when implemented efficiently.Therefore, Z ( θ ) and hence Φ( θ ) can be computed for tree graphs using O ( |X | N ) computations.Indeed, the re-parametrization of the form (7) was a basis for the Belief Propagation (BP) algorithm for genericgraphical models and also led to the so called Bethe Approximation of (6), cf. [27]. However, it does not resultin a provably upper or lower bound in general (with few exceptions).To obtain an upper bound on Φ( · ) , its convexity was exploited in [24] along with the fact that (6) is solvable ef-ficiently for tree-structured graph. This resulted into tree-reweighted (TRW) algorithm which we describe next. Φ( · ) Recall that a spanning tree T is a subgraph of G that contains all vertices V and a subset of edges E so thatthe resulting subgraph is a tree, i.e. does not have a cycle. Let T ( G ) be the set of all spanning trees of G . Weshall denote a distribution on T ( G ) as ρ = ( ρ T ) T ∈T ( G ) where ρ T ≥ for all T ∈ T ( G ) , (cid:80) T ∈T ( G ) ρ T = 1 .The space of all distributions on T ( G ) is denoted by P ( T ( G )) . For simplicity, we shall drop notation of G attimes when it is clear from the context and denote it simply as P ( T ) . A distribution ρ ∈ P ( T ) induces for alledge e ∈ E a probability ρ e that this edge will appear in a tree selected from ρ , ρ e = P T ∼ ρ (cid:0) e ∈ T ) = (cid:88) T ∈T ( G ) ρ T ( e ∈ T ) . (8)Note that in the above, we have abused notation using T as a spanning tree as well as the set of edges con-stituting it. We shall continue using this notation since the all spanning trees have the same set of vertices, V and only the edges differ (among subsets of E ). Also note another convenient abuse of notation: given ρ , ρ T denotes probability of T ∈ T ( G ) while ρ e is the marginal probability of edge e ∈ E being present in tree asper ρ and satisfies (cid:80) e ∈ E ρ e = N − . Given ρ ∈ P ( T ( G )) , we now define κ ρ as κ ρ = min e ∈ E ρ e . (9)For any θ ∈ R | E | + , define its support as s ( θ ) = { e ∈ E : θ e (cid:54) = 0 } . Given a spanning tree T ∈ T ( G ) , let θ T ∈ R | E | + be such that s ( θ T ) ⊂ T . Let ρ ∈ P ( T ) along with ( θ T ) T ∈T be such that (cid:80) T ∈T ρ T θ T = θ . Thatis, E T ∼ ρ (cid:2) θ T (cid:3) = θ . Therefore, we can write Φ( θ ) = Φ (cid:0) E T ∼ ρ (cid:2) θ T (cid:3)(cid:1) . (10)It has been well established that Φ : R | E | + → R is a convex function. Precisely, for any θ , θ ∈ R | E | + and γ ∈ [0 ,
1] Φ( γ θ + (1 − γ ) θ ) ≤ γ Φ( θ ) + (1 − γ )Φ( θ ) . (11)From (10) and (11), it follows from Jensen’s inequality that Φ( θ ) ≤ E T ∼ ρ (cid:2) Φ( θ T ) (cid:3) = (cid:88) T ∈T ρ T Φ( θ T ) . (12)4ince the upper bound (50) holds for any ρ ∈ P ( T ) and ( θ T ) T ∈T such that (cid:80) T ∈T ρ T θ T = θ we can optimizeon these two parameters to obtain Φ( θ ) ≤ inf (cid:80) T ∈T ρ T θ T = θ (cid:32) (cid:88) T ∈T ρ T Φ( θ T ) (cid:33) ≡ U TRW ( θ ) . (13)As established in [24], this seemingly complicated optimized bound, U TRW ( θ ) , can be computed via aniterative tree-reweighted message-passing algorithm through the dual of the above optimization problem. Whilethis is a valid upper bound, how tight the upper bound is for a given graphical model is not quantified in theliterature. And this is precisely the primary contribution of this work. Modified Tree-Reweighted:
TRW (cid:48) . We describe a simple variant of
TRW that enables us to bound the ap-proximation ratio of the estimation of Φ using properties of G . We start with some useful notations. Given θ = ( θ e ) e ∈ E ∈ R | E | + , ρ ∈ P ( T ( G )) and spanning tree T ∈ T ( G ) of graph G , define “projection” operations Π T : R | E | + → R | E | + where Π T ( θ ) = (cid:0) ( e ∈ T ) θ e (cid:1) e ∈ E Π T ρ : R | E | + → R | E | + where Π T ρ ( θ ) = (cid:0) ρ e ( e ∈ T ) θ e (cid:1) e ∈ E . (14)With these notations, for a given ρ ∈ P ( T ( G )) define L ρ ( θ ) = E T ∼ ρ (Φ(Π T ( θ ))) = (cid:88) T ∈T ( G ) ρ T Φ(Π T ( θ )) , (15) U ρ ( θ ) = E T ∼ ρ (Φ(Π T ρ ( θ ))) = (cid:88) T ∈T ( G ) ρ T Φ(Π T ρ ( θ )) . (16)For a given ρ ∈ P ( T ( G )) , one obtains an estimate of Φ( θ ) (cid:98) Φ ρ ( θ ) = (cid:113) L ρ ( θ ) U ρ ( θ ) . (17)For reasons that will become clear, TRW (cid:48) outputs (cid:98) Φ ρ (cid:63) ( θ ) where ρ (cid:63) = ρ (cid:63) ( G ) defined as ρ (cid:63) ( G ) ∈ arg max ρ ∈P ( T ( G )) (cid:0) min e ∈ E ρ e (cid:1) and κ ρ (cid:63) ( G ) = max ρ ∈P ( T ( G )) (cid:0) min e ∈ E ρ e (cid:1) . (18) Guarantee.
The lemma below quantifies the approximation ratio for
TRW (cid:48) . It’s proof is in Appendix A.
Lemma 3.1.
Given θ ∈ R | E | + , TRW (cid:48) produce (cid:98) Φ ρ (cid:63) ( θ ) with ρ (cid:63) = ρ (cid:63) ( G ) as defined in (18) . Then, α ( G, TRW (cid:48) ) ≤ √ κ ρ (cid:63) . (19) κ ρ (cid:63) ( G ) : Efficient computation, characterization Lemma 3.1 establishes the approximation guarantee for
TRW (cid:48) as claimed in Theorem 1.1 with caveat that itis in terms of κ ρ (cid:63) ( G ) while Theorem 1.1 states it in form of κ ( G ) as defined in (5). In this section, we shallestablish the characterization of κ ρ (cid:63) ( G ) = κ ( G ) and in the process argue that it can be evaluated in polynomialtime for any graph G . This characterization will allow us to bound κ ( G ) for certain classes of graphs to obtainmeaningful intuition. 5 .1 Computing ρ (cid:63) ( G ) and κ ρ (cid:63) ( G ) efficiently Spanning Tree Polytope.
We define a notion of spanning tree polytope for a given graph G . Recall that T ( G ) isthe set of all spanning trees of G . For any tree T ∈ T ( G ) , we shall utilize the notation of χ T = [ χ Te ] ∈ { , } E to represent the characteristic vector of the tree T defined such that ∀ e ∈ E : χ Te = ( e ∈ T ) . (20)Given this notation, we define the polytope of spanning trees of G , denoted P tree ( G ) , as the convex hull of theircharacteristic vectors. That is, P tree ( G ) = (cid:8) v ∈ [0 , E : v = (cid:88) T ∈T ( G ) ρ T χ T , (cid:88) T ∈T ( G ) ρ T = 1 , ρ T ≥ , ∀ T ∈ T ( G ) (cid:9) . (21)The weights ( ρ T ) T ∈T ( G ) can be viewed as probability distribution on T ( G ) , i.e. an element of P ( T ( G )) .Therefore v = (cid:80) T ∈T ( G ) ρ T χ T corresponds to a vector representing the probabilities that edges in E will bepresent in in T ∼ ρ = ( ρ T ) , i.e. v = E T ∼ ρ [ ( e ∈ T )] . That is, v = ( ρ e ) e ∈ E as defined in (8). Therefore, weshall abuse notation and write P tree ( G ) = (cid:8) ( ρ e ) e ∈ E | ( ρ T ) T ∈T ( G ) ∈ P ( T ( G )) (cid:9) . (22)[8] gave the following characterization of the spanning tree polytope: P tree ( G ) = (cid:40) ( v e ) e ∈ E ∈ R E + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∀ S ⊂ E : v ( E ( S )) ≤ | S | − v ( E ) = | V | − (cid:41) , (23)where v ( E ( S )) = (cid:80) e ∈ E ( S ) v e . Efficient Separation Oracle.
A polytope P ⊂ R n , defined through a set of linear constraints, is said to have aseparation oracle if there exists a polynomial time algorithm in n which for given any x ∈ R n can determinewhether x ∈ P or not; and output a violated constraint if x / ∈ P . Edmond’s characterization of the spanningtree polytope, though has an exponential number of constraints, admits an efficient separation oracle. Such anefficient separation oracle is defined explicitly via a min-cut reduction, see [19, Chapter 4.1]. Complexity of Linear Programming.
Consider a linear program where the goal is to find a minimum of a linearobjective function over a polytope defined by finitely many linear constraints. Such a linear program can besolved in polynomial time (in size of problem description) via the Ellipsoid method if the polytope admits anefficient separation oracle, see [3, Theorem 8.5] for example. Given that the spanning tree polytope has anefficient separation oracle, optimizing a linear objective over it can be solved efficiently. Of course, due to thestructure of the trees, a greedy algorithm like that of Kruskal’s may be a lot more direct for solving such a linearprogram. Having said that, the benefit of efficient separation oracle becomes apparent as soon as we consideradditional linear constraints beyond those described in P tree ( G ) . Indeed, such approaches have found utility insolving other problems, liked solving bounded-degree maximum-spanning-tree relaxations like in [12]. Augmented Spanning Tree Polytope.
We consider a reformulation of the max-min problem in (18). To that endconsider the following augmented spanning tree polytope: P tree min ( G ) = ( z, ( v e ) e ∈ E ) ∈ R × R | E | + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∀ e ∈ E : z ≤ v e ∀ S ⊂ E : v ( E ( S )) ≤ | S | − v ( E ) = | V | − . (24)With this notation, we can re-write κ ρ (cid:63) ( G ) as per (18) as κ ρ (cid:63) ( G ) = max ( v e ) e ∈ E ∈ P tree { min e ∈ E v e } = max ( z, ( v e ) e ∈ E ) ∈ P tree min z. (25)6ext, we argue that P tree min admits an efficient separation oracle as follows. The separation oracle for P tree min takes ( z, ( v e ) e ∈ E ) as input. It first checks that all | E | constraints of the form z ≤ v e are satisfied. If one is notsatisfied, then the oracle outputs this constraint. If all constraints are satisfied, the algorithm runs the separationoracle of P tree on ( v e ) e ∈ E and reproduces its output. Since | E | ≤ N and P tree has an efficient separationoracle, this leads to polynomial time separation oracle for P tree min . Efficient computation of ρ (cid:63) ( G ) and κ ρ (cid:63) ( G ) . From the linear program formulation (25) and from the efficientseparation oracle as defined above, we can compute κ ρ (cid:63) ( G ) in polynomial time using the Ellipsoid algorithm.Note that this does not directly provides ρ (cid:63) ( G ) ∈ P ( T ( G )) since the representation in P tree correspondsto the edge probabilities ( ρ (cid:63) ( G ) e ) e ∈ E . However, ( ρ (cid:63) ( G ) e ) e ∈ E is a convex combination of extreme pointsof P tree , which correspond to the spanning trees of G . Since P tree has efficient separation oracle, we canrecover a decomposition of ( ρ (cid:63) ( G ) e ) e ∈ E in terms of convex combination of characteristic vectors weighted by ( ρ (cid:63) ( G ) T ) T ∈T ( G ) and such that at most | E | of these weights are strictly positive, see details in [13, Theorem3.9]. κ ρ (cid:63) ( G ) = κ ( G ) We wish to establish κ ρ (cid:63) ( G ) = κ ( G ) , i.e. we want to establish κ ρ (cid:63) ( G ) = max ( v e ) e ∈ E ∈ P tree { min e ∈ E v e } = min S ⊂ V | S | − | E ( S ) | . (26) Upper bound: κ ρ (cid:63) ( G ) ≤ | S |− | E ( S ) | . The upper bound is immediately given by Edmond’s characterisation of thespanning tree polytope. For any ( ρ e ) e ∈ E ∈ P tree and any S ⊂ V : | E ( S ) | (cid:0) min e ∈ E ρ e (cid:1) ≤ (cid:0) (cid:88) e ∈ E ( S ) ρ e (cid:1) = ρ ( E ( S )) ≤ | S | − . (27)That is, for any ρ ∈ P ( T ( G )) κ ρ ≤ min S ⊂ V | S | − | E ( S ) | . (28)And hence it holds for ρ (cid:63) ( G ) as well. Lower bound: κ ρ (cid:63) ( G ) ≥ | S |− | E ( S ) | . To establish the lower bound, we need a few additional results. To start with,we define a dual of the optimization problem (25) to characterize κ ρ (cid:63) ( G ) . By strong duality it follows that κ ρ (cid:63) ( G ) = max ρ ∈P ( T ) min e ∈ E (cid:88) T ∈T ( e ∈ T ) ρ T = min w ∈P ( E ) max T ∈T (cid:88) e ∈ E ( e ∈ T ) w e , (29)where P ( E ) = { w = ( w e ) e ∈ E : (cid:80) e ∈ E w e = 1 , w e ≥ ∀ e ∈ E } . Table 1 provides the precise primaland dual formulation associated with κ ρ (cid:63) ( G ) justifying (29). We state the following Lemma characterizing anoptimal solution of Dual , whose proof is in Appendix B.
Lemma 4.1.
There exists an optimal solution of
Dual , w (cid:63) ∈ arg min w ∈P ( E ) max T ∈T (cid:88) e ∈ E ( e ∈ T ) w e , such that all non-zero components of w (cid:63) take the identical values: i.e. |{ w e : w e (cid:54) = 0 , e ∈ E }| = 1 . rimal Dual Objective max z min y Variables / Constraints z ∈ R ∀ T ∈ T : ρ T ∈ R + (cid:80) e ∈ E w e = 1 ∀ T ∈ T : y − (cid:80) e ∈ T w e ≥ Constraints / Variables (cid:80) T ∈T ρ T = 1 ∀ e ∈ E : (cid:80) T (cid:51) e ρ T − z ≥ y ∈ R ∀ e ∈ E : w e ∈ R + Table 1: The primal (cf. (25)) and dual formulation of κ ρ (cid:63) ( G ) .As per Lemma 4.1, consider an optimal solution of Dual , w (cid:63) , that assigns constant value to a subset F ⊂ E edges and to edges E \ F : let w (cid:63) = ( w (cid:63)e ) e ∈ E with w (cid:63)e = | F | for e ∈ F and w (cid:63)e = 0 for e ∈ E \ F . Let V ( F ) ⊂ V be set of all vertices corresponding to the end points of edges in F making a subgraph ( V ( F ) , F ) of G . Let c ( F ) ≥ denote the number of connected components of ( V ( F ) , F ) . Per Dual , given w (cid:63) , κ ρ (cid:63) ( G ) equals the weight of the maximum weight spanning tree in G with edges assigned weights as per w (cid:63) . Sucha maximum weight spanning tree must select as many edges as possible from F : since it has c ( F ) connectedcomponents and V ( F ) vertices, it can select at most | V ( F ) | − c ( F ) such edges and any each such edge hasweight / | F | . The rest of the edges in the maximum weight spanning tree will carry weight . Thus, thetotal weight of such a maximum weight spanning tree is ( | V ( F ) | − c ( F )) / | F | . This gives us an equivalentcharacterization for κ ρ (cid:63) ( G ) as κ ρ (cid:63) ( G ) = min F ⊂ E | V ( F ) | − c ( F ) | F | . (30)Now we state a Lemma, whose proof is in Appendix C, which relates the characterization of (30) with that of(5). Lemma 4.2.
For any graph G , min S ⊂ V | S | − | E ( S ) | = min F ⊂ E | V ( F ) | − c ( F ) | F | . (31) The primary claim of Theorem 1.1 is that α ( G, TRW (cid:48) ) ≤ / (cid:112) κ ( G ) . As per Lemma 3.1, we have that α ( G, TRW (cid:48) ) ≤ / √ κ ρ (cid:63) ( G ) . As per arguments in Section 4.2, we have that κ ρ (cid:63) ( G ) = κ ( G ) . Therefore,we conclude the proof of Theorem 1.1. κ ( G ) For a Class of Graphs
As established in Section 4.1, κ ( G ) or κ ρ (cid:63) ( G ) can be computed in polynomial time for any G . Here, we attemptto obtain a (lower) bound on κ ( G ) in terms of simple graph properties. To that end, we obtain the followingfor graphs with bounded maximum average degree. Lemma 4.3.
For a graph G = ( V, E ) , let ¯ d = max S ⊂ V | E ( S ) || S | denote the maximum average degree. Then κ ( G ) ≥ d + 1 . (32)For graphs with large girth, we obtain the following. Lemma 4.4.
For a graph G = ( V, E ) , let g > be its girth, i.e. the length of the shortest cycle. Then κ ( G ) ≥
21 + N g − (1 − g ) . (33)8he proofs of Lemmas 4.3 and 4.4 are presented in Appendix D. As per Lemma 4.4, for g = β log N for β (cid:29) and N large enough κ ( G ) ≥
21 + N g − (1 − g ) . (34)Therefore, α ( G, TRW (cid:48) ) ≤ (cid:112) κ ( G ) ≈ β . (35) TRW
The
TRW (cid:48) requires finding ρ (cid:63) ( G ) . As discussed in Section 4, it can be computed efficiently. However it canbe cumbersome and having near-linear (in | E | ) time variant can be more attractive in practice. With this as amotivation, we propose utilizing uniform distribution on T ( G ) , denoted as u ≡ u ( T ( G )) , in place of ρ (cid:63) ( G ) in TRW (cid:48) . The challenge is it has very large support, T ( G ) , and hence it is difficult to compute L u ( θ ) , U u ( θ ) .But, both of these quantities are averages, with respect to u , of a certain functional. And it is feasible to samplespanning tree uniformly at random for any G in near-linear time. Therefore, we can draw n samples fromthe distribution u and consider the empirical distribution ˆ u n to compute estimates L ˆ u n ( θ ) , U ˆ u n ( θ ) with fewsamples. This is precisely the algorithm.To that end, consider n trees T , . . . , T n sampled uniformly at random from T ( G ) . Compute ˆ u ne = 1 n n (cid:88) i =1 ( e ∈ T i ) , ∀ e ∈ E, L ˆ u n ( θ ) = 1 n n (cid:88) i =1 Φ(Π T i ( θ )) , U ˆ u n ( θ ) = 1 n n (cid:88) i =1 Φ(Π T i ˆ u n ( θ )) , (36)where ˆ u n = (ˆ u ne ) e ∈ E . Given this, produce the estimate (cid:98) Φ ˆ u n ( θ ) = (cid:112) L ˆ u n ( θ ) U ˆ u n ( θ ) . (37) Given a graph G , remember that κ u ( G ) = min e ∈ E u e with u being the uniform distribution on T ( G ) and u e = E T ∼ u [ ( e ∈ T )] . We state the following Lemma, whose proof can be found in Appendix E. Lemma 5.1.
Given (cid:15) > and d > , for n ≥ O (cid:0) log( Nδ ) κ u ( G ) − (cid:15) − (cid:1) and (cid:15) sufficiently small, with proba-bility at least − δ max θ ∈ R E (cid:16) Φ( θ ) (cid:98) Φ ˆ u n ( θ ) , (cid:98) Φ ˆ u n ( θ )Φ( θ ) (cid:17) ≤ (cid:15) (cid:112) κ u ( G ) . (38) To sample tree uniformly at random from T ( G ) , [22] recently proposed a method that has O ( | E | o (1) ) run-time using short-cutting method and insights from effective resistance. The earliest polynomial time algorithmhas been known since [14]. While we do not recall either of these here, we briefly recall algorithm from [4]due to its elegance even though it is not the optimal (it has O ( N | E | ) run time): (1) starting with any u ∈ V run a random walk on G until it covers all vertices, (2) for every vertex v (cid:54) = u , select the edge through which v was reached for the first time during the walk, and (3) output the N − edges (which form tree) thus selected.Given n such samples, to compute (cid:98) Φ ˆ u n ( θ ) , we have to compute 2 n log-partition functions for tree structuredgraph. As noted in Section 2, each such computation requires O ( N |X | ) operations.9y Lemma 5.1, we need n ≥ O (cid:0) ( d + log( N )) κ u ( G ) − (cid:15) − (cid:1) to achieve (1 + (cid:15) ) / (cid:112) κ u ( G ) approximationwith probability − e − d . That is, in total we need total of O ( | E | o (1) + N |X | ) × O ( κ u ( G ) − (cid:15) − log 1 /(cid:15) ) computation for (1 + (cid:15) ) / (cid:112) κ u ( G ) approximation with probability − (cid:15) . κ u ( G ) and Effective Resistance The κ u ( G ) = min e ∈ E u e where u e = E T ∼ u [ ( e ∈ T )] turns out to be related to the so called “effectiveresistance” associated with edge e ∈ E for the graph G . The notion was introduced by [18] and has multipleinterpretations. We present one such here. For e = ( s, t ) ∈ E , the effective resistance u e is equal to the amountof electric energy dissipated by the network when all edges are seen as electric wire of resistance R e = 1 anda generator guarantees a total current flow ( ι gen = 1 ) from s to t . The distribution of the current ι across thenetwork must minimize the dissipated energy while respecting the constraints imposed by Kirchoff’s laws (alsosee [20, Chapter 2]). Below we provide variational characterization of it. ∀ e = ( s, t ) ∈ E : u e = min (cid:88) { u,v }∈ E ι ( u, v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∀{ u, v } ∈ E : ι ( u, v ) + ι ( v, u ) = 0 ∀ u ∈ V \ { s, t } : (cid:88) v | ( u,v ) ∈ E ι ( u, v ) = 0 (cid:88) v | ( s,v ) ∈ E ι ( s, v ) = (cid:88) u | ( u,t ) ∈ E ι ( u, t ) = 1 . (39) Lemma 5.2.
Given G = ( V, E ) : (a) if d be the maximum vertex degree, then for any e ∈ E , u e ≥ d +1 ; (b) ifthe girth is at least g > , then for any e ∈ E , u e ≥ | E | ( g − . The proof can be found in Appendix F.
This far, we have restricted to approximating Φ( θ ) by decomposing θ = E T ∼ ρ [Π T ρ ( θ )] and then using convex-ity, monotonicity and sub-linearity to produce an approximation guarantee. Such arguments would hold evenif we can decompose θ using subgraphs of G beyond trees. The choice of trees was particularly useful sincethey allow for an efficient computation of Φ . In general, graphs with bounded tree-width lend themselves toefficient computation of Φ , cf. [5].To that end, let T k ( G ) denote the set of all subgraphs of G that have treewidth bounded by k ≥ . Let P ( T k ( G )) denote the distribution over all such subgraphs. For any H ∈ T k ( G ) and ρ ∈ P ( T k ( G )) , define Π H ( · ) and Π H ρ ( · ) similar to that in (14) in Section 3 in the definition of TRW (cid:48) , L ρ ( θ ) = E H ∼ ρ (Φ(Π H ( θ ))) , U ρ ( θ ) = E H ∼ ρ (Φ(Π H ρ ( θ ))) , and (cid:98) Φ ρ ( θ ) = (cid:113) L ρ ( θ ) U ρ ( θ ) . (40)Using identical arguments as in Theorem 1.1, it follows that (cid:98) Φ ρ ( θ ) is √ κ k ρ -approximation where κ k ρ = max ρ ∈P ( T k ( G )) min e ∈ E ρ e . (41) ( (cid:15), k ) -partitioning. While such generality is pleasing its utility is in improved approximation. Indeed, in[17, 16] a seemingly different approach was proposed using graph partitioning. At its core, it was shown thatfor a large family of graphs including minor-excluded graphs and graphs with polynomial growth, there exists ρ ∈ P ( T k ( G )) which satisfies certain ( (cid:15), k ) -partitioning property (for appropriately chosen (cid:15), k ). Consider k -partitions of G defined as Part k ( G ) = { H = ( V, K (cid:91) i =1 E ( S i )) | ( S i ) ≤ i ≤ k is a partition of V and ∀ i : | S i | ≤ k } . (42)10ote that Part k ( G ) ⊂ T k ( G ) . A distribution ρ ∈ P ( Part k ( G )) ⊂ P ( T k ( G )) is called an ( (cid:15), k ) -partitioning of G if ∀ e ∈ E : 1 − (cid:15) ≤ E H ∼ ρ [ ( e ∈ H )] ≤ . (43)We state the following result whose proof can be found in Appendix E. Theorem 6.1.
Let G be such that there exists ρ ∈ P ( Part k ( G )) ⊂ P ( T k ( G )) that is ( (cid:15), k ) partition of G .Then, for any θ ∈ R | E | + √ − (cid:15) ≤ Φ( θ ) (cid:98) Φ ρ ( θ ) ≤ √ − (cid:15) . (44)We note that √ − (cid:15) = 1 + (cid:15) + o ( (cid:15) ) and hence it improves upon the result given in [17, 16] which achievesa (cid:15) approximation error. We presented a method to quantify the approximation ratio of variational approximation method for est matingthe log-partition function of discrete pairwise graphical models. As the main contribution, we quantified theapproximation error as a function of the underlying graph properties. In particular, for a variant of the tree-reweighted algorithm, for graphs with bounded degree the approximation ratio is a constant factor (functionof degree) and graphs with large ( (cid:29) logarithmic) girth, the approximation ratio is close to . The methodnaturally extends beyond trees unifying prior works on graph partitioning based approach.In this work, we restricted the analysis to non-negative valued potentials and edge parameters. If potentialsare bounded, we can transform the general setting into a setting with non-negative potentials. However, theapproximation ratio with respect to this transformed setting may not translate to that of the original setting.This may be interesting direction for future works. Acknowledgements
This work is supported in parts by projects from NSF and KACST as well as by a Hewlett Packard graduatefellowship. We would like to thank Mo¨ıse Blanchard for useful discussions on duality.
References [1] Noga Alon, Shlomo Hoory, and Nathan Linial. The moore bound for irregular graphs.
Graphs andCombinatorics , 18(1):53–57, 2002.[2] Mohsen Bayati, David Gamarnik, Dimitriy Katz, Chandra Nair, and Prasad Tetali. Simple determinis-tic approximation algorithms for counting matchings. In
Proceedings of the thirty-ninth annual ACMsymposium on Theory of computing , pages 122–127, 2007.[3] Dimitris Bertsimas and John N Tsitsiklis.
Introduction to linear optimization , volume 6. Athena ScientificBelmont, MA, 1997.[4] Andrei Z Broder. Generating random spanning trees. In
FOCS , volume 89, pages 442–447. Citeseer,1989.[5] Venkat Chandrasekaran, Misha Chertkov, David Gamarnik, Devavrat Shah, and Jinwoo Shin. Countingindependent sets using the bethe approximation.
SIAM Journal on Discrete Mathematics , 25(2):1012–1034, 2011. 116] Michael Chertkov and Vladimir Y Chernyak. Loop series for discrete statistical models on graphs.
Jour-nal of Statistical Mechanics: Theory and Experiment , 2006(06):P06009, 2006.[7] Amir Dembo, Andrea Montanari, Nike Sun, et al. Factor models on locally tree-like graphs.
Annals ofProbability , 41(6):4162–4213, 2013.[8] Jack Edmonds. Matroids and the greedy algorithm.
Mathematical programming , 1(1):127–136, 1971.[9] David Gamarnik and Dmitriy Katz. Sequential cavity method for computing free energy and surfacepressure.
Journal of Statistical Physics , 137(2):205–232, 2009.[10] David Gamarnik and Dmitriy Katz. Correlation decay and deterministic fptas for counting colorings of agraph.
Journal of Discrete Algorithms , 12:29–47, 2012.[11] Hans-Otto Georgii.
Gibbs measures and phase transitions , volume 9. Walter de Gruyter, 2011.[12] Michel X Goemans. Minimum bounded degree spanning trees. In , pages 273–282. IEEE, 2006.[13] Martin Gr¨otschel, L´aszl´o Lov´asz, and Alexander Schrijver. The ellipsoid method and its consequences incombinatorial optimization.
Combinatorica , 1(2):169–197, 1981.[14] Alain Guenoche. Random spanning tree.
Journal of Algorithms , 4(3):214–220, 1983.[15] Mark Jerrum and Alistair Sinclair. Approximating the permanent.
SIAM journal on computing ,18(6):1149–1178, 1989.[16] Kyomin Jung, Pushmeet Kohli, and Devavrat Shah. Local rules for global map: When do they work? In
NIPS , pages 871–879, 2009.[17] Kyomin Jung and Devavrat Shah. Local approximate inference algorithms. arXiv preprint cs/0610111 ,2006.[18] Douglas J Klein and Milan Randi´c. Resistance distance.
Journal of mathematical chemistry , 12(1):81–95,1993.[19] Lap Chi Lau, Ramamoorthi Ravi, and Mohit Singh.
Iterative methods in combinatorial optimization ,volume 46. Cambridge University Press, 2011.[20] Russell Lyons and Yuval Peres.
Probability on trees and networks , volume 42. Cambridge UniversityPress, 2017.[21] Marc Mezard and Andrea Montanari.
Information, physics, and computation . Oxford University Press,2009.[22] Aaron Schild. An almost-linear time algorithm for uniform random spanning tree generation. In
Pro-ceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , pages 214–227, 2018.[23] Leslie G Valiant. The complexity of enumeration and reliability problems.
SIAM Journal on Computing ,8(3):410–421, 1979.[24] Martin J Wainwright, Tommi S Jaakkola, and Alan S Willsky. A new class of upper bounds on the logpartition function.
IEEE Transactions on Information Theory , 51(7):2313–2335, 2005.[25] Martin J Wainwright and Michael Irwin Jordan.
Graphical models, exponential families, and variationalinference . Now Publishers Inc, 2008.[26] Dror Weitz. Counting independent sets up to the tree threshold. In
Proceedings of the thirty-eighth annualACM symposium on Theory of computing , pages 140–149, 2006.1227] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Generalized belief propagation. In
Advances inneural information processing systems , pages 689–695, 2001.[28] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Understanding belief propagation and its gen-eralizations.
Exploring artificial intelligence in the new millennium , 8:236–239, 2003.13
Proof of Lemma 3.1
Proof.
We start by observing a few properties of function Φ( · ) . Property 1. Φ is non-decreasing. For a, b ∈ R n let a (cid:22) b denote that every component of a is less or equal tothat of b , i.e. a i ≤ b i , i ∈ [ n ] . With this, for θ , θ (cid:48) ∈ R | E | + such that θ (cid:22) θ (cid:48) , it can be easily verified that Φ( θ ) ≤ Φ( θ (cid:48) ) . (monotonicity)Since Φ( ) = N log |X | , and (cid:22) θ (cid:22) θ (cid:48) , we have N log( |X | ) ≤ Φ( θ ) ≤ Φ( θ (cid:48) ) . (45) Property 2. Φ is sub-linear. For λ ≥ and θ ∈ R | E | + , Φ( λ θ ) ≤ λ Φ( θ ) . (sub-linearity)The above follows from the fact that for any s = ( s i ) ∈ R n + , (cid:0) n (cid:88) i =1 s λi (cid:1) ≤ (cid:0) n (cid:88) i =1 s i (cid:1) λ . Now consider any ρ ∈ P ( T ( G )) . For any T ∈ T ( G ) and θ ∈ R | E | + , by definition of Π T , we have that Π T ( θ ) (cid:22) θ . Therefore, using the monotonicity of the log-partition function it follows that L ρ ( θ ) = (cid:88) T ∈T ( G ) ρ T Φ(Π T ( θ )) ≤ (cid:88) T ∈T ( G ) ρ T Φ( θ ) ≤ Φ( θ ) . (46)By definition θ = E T ∼ ρ [Π T ρ ( θ )] , and due to convexity of Φ (cf. (11)), it follows that Φ( θ ) = Φ (cid:0) E T ∼ ρ [Π T ρ ( θ )] (cid:1) ≤ E T ∼ ρ [Φ(Π T ρ ( θ ))] = U ρ ( θ ) . (47)By definition of κ ρ = min e ∈ E ρ e , it follows that Π T ρ ( θ ) ≤ κ ρ Π T ( θ ) , ∀ T ∈ T ( G ) . (48)And, by definition κ ρ ≥ . Therefore by (monotonicity) and (sub-linearity), we have Φ(Π T ρ ( θ )) ≤ Φ (cid:0) κ ρ Π T ( θ ) (cid:1) ≤ κ ρ Φ(Π T ( θ )) . (49)Therefore, U ρ ( θ ) = (cid:88) T ∈T ( G ) ρ T Φ(Π T ρ ( θ )) ≤ κ ρ (cid:0) (cid:88) T ∈T ( G ) ρ T Φ(Π T ( θ )) (cid:1) = 1 κ ρ L ρ ( θ ) . (50)As a consequence of (46), (47) and (50) we obtain that Φ( θ ) ≤ U ρ ( θ ) ≤ κ ρ Φ( θ ) and κ ρ Φ( θ ) ≤ L ρ ( θ ) ≤ Φ( θ ) . (51)From this, it follows that √ κ ρ Φ( θ ) ≤ (cid:113) L ρ ( θ ) U ρ ( θ ) ≤ √ κ ρ Φ( θ ) . (52)Which can be rewritten as √ κ ρ ≤ (cid:98) Φ ρ ( θ )Φ( θ ) ≤ √ κ ρ . (53)By optimizing over choice of ρ = ρ (cid:63) , we conclude that α ( G, TRW (cid:48) ) ≤ κ ρ (cid:63) .14 Proof of Lemma 4.1
Proof. (See illustration in Figure 1) For w = ( w e ) e ∈ E denote f ( w ) the number of distinct values in its support: f ( w ) = |{ w e : e ∈ E, w e (cid:54) = 0 }| . (54)To prove the lemma, it suffices to show that there exists an optimal solution of Dual such that f ( w ) = 1 . Wewill prove that if w is an optimal solution and f ( w ) > then we can build w (cid:48) of similar objective value suchthat f ( w (cid:48) ) ≤ f ( w ) − . By repeating this till f ( w ) = 1 will conclude the proof.Let w be an optimal solution with f ( w ) > . We consider the edges e , e , ..., e | E | ordered by their weights,i.e. w e ≥ ... ≥ w e | E | . (55)In what follows, we will make sure that the ordering on the edges never changes, therefore we allow our-selves to write w i instead of w e i . Now the objective of Dual achieved by such an optimal w corresponds tothe weight of a maximum weight spanning tree. Let us utilize Kruskal’s algorithm to find such an maximumweight spanning tree. Recall that Kruskal’s algorithm greedily selects edges from higher to lower weight aslong as they do not create a cycle with previously selected edges. We will denote I T = { t < ... < t N − } theindices of the edges selected by the algorithm to construct tree T and let I E \ T = ∪ N − k =1 { s : t k < s < t k +1 } denote the indices of edges not part of T with notation t N = | E | + 1 . The weight of the maximum spanningtree is then w ( T ) = (cid:80) N − k =1 w t k . Note that t = 1 and t = 2 since cycle requires or more edges. Bydefinition w j − ≥ w j for ≤ j ≤ | E | . Now if w j − > w j the we claim that j ∈ I T . This is because for ≤ k ≤ N − if ( w t k , ...., w t k +1 − ) are not equal, setting them all to their average decreases w t k strictlywhile preserving w ∈ P ( E ) as well as the order on the edges and therefore contradicting the optimality of w for Dual . Therefore w is piece-wise constant with discontinuities only appearing for j ∈ I T .If f ( w ) = 2 and all weights are positive, we denote ≤ k ≤ N − such that w t k − > w t k > and wehave: w = ... = w t k − > w t k = .... = w | E | . (56)In this case, the optimal objective value for Dual is equal to ( k − w + ( N − k ) w t k . To make w con-stant on its support while preserving the order on the weights, there are two possibilities. Either transfer allweight from ( w t k , ...., w | E | ) to ( w , ..., w t k − ) until ( w t k , ...., w | E | ) reaches zero. The objective will then be w + | E |− t k +1 t k − w t k . Or transfer all weight from ( w , ..., w t k − ) to ( w t k , ...., w | E | ) until all weights are equal.The objective will be then w t k + | E |− t k +1 t k − ( w − w t k ) . Because either | E |− t k +1 t k − ≤ or t k − | E |− t k +1 ≤ , one ofthese transfers does not increase the objective and yields f ( w ) = 1 < .If f ( w ) = 2 and some weights are , denote k the smallest index such that w t k = 0 . The method abovestill holds when replacing | E | − t k + 1 by t k − t k .Now suppose f ( w ) ≥ , making sure that the order on the weights is preserved requires extra caution. Inaddition to k and k (if required), we denote k the index of the discontinuity that follows k . We have: ... = w t k − > w t k = ... = w t k − > w t k = ... = w t k − > w t k = ... (57)In the event when we want to transfer weight from ( w t k , ..., w t k − ) to ( w t k , ..., w t k − ) , we must make surethat ( w t k , ..., w t k − ) does not exceed w t k − . If ( w t k , ..., w t k − ) attains w t k − the transfer must stop atequality, and one should observe that we have decreased f ( w ) strictly by because the discontinuity at w t k has disappeared and no new discontinuity was created.In summary, we have argued that if w is an optimal solution and f ( w ) > then we can build w (cid:48) of sameobjective value (optimal) and such that f ( w (cid:48) ) ≤ f ( w ) − . This completes the proof of Lemma.15igure 1: An example of an optimal weight assignment for the problem Dual on nine different graphs. Thesolution was found by the interior point method using a linear programming solver. Note that for most graphs,the solution reached is already constant on its support. On graphs G and G , note that evening out the weightswould not increase the weight of the maximum spanning tree.16 Proof of Lemma 4.2
Proof.
We prove the equality by establishing inequalities in both direction.
Establishing min S ⊂ V | S |− | E ( S ) | ≥ min F ⊂ E | V ( F ) |− c ( F ) | F | : For S ⊂ V note that V ( E ( S )) ⊂ S and c ( E ( S )) ≥ and therefore that | S |− | E ( S ) | ≥ | V ( E ( S )) |− c ( E ( S )) | E ( S ) | with E ( S ) ⊂ E . Thus, min S ⊂ V | S |− | E ( S ) | is minimizing alarger objective function over smaller set compared to min F ⊂ E | V ( F ) |− c ( F ) | F | . Therefore, inequality followsimmediately. Establishing min S ⊂ V | S |− | E ( S ) | ≤ min F ⊂ E | V ( F ) |− c ( F ) | F | : Let F (cid:63) ⊂ E be a minimizer of min F ⊂ E | V ( F ) |− c ( F ) | F | .Let H = ( V ( F (cid:63) ) , F (cid:63) ) . By optimality, all connected components of H must be vertex-induced subgraphs of G .This is because, if not then it is possible to add edges to H without changing the number of vertices or numberof connected components in it, which would contradict optimality. In other words, there exists disjoint subsets S i , ≤ i ≤ c ( H ) of V ( F (cid:63) ) with V ( F (cid:63) ) = ∪ c ( H ) i =1 S i and F (cid:63) = ∪ c ( H ) i =1 E ( S i ) . If c ( H ) = 1 , then the inequalityfollows immediately. If c ( H ) ≥ , denote H \ H the graph obtained by removing H = ( S , E ( S )) from H .Note that c ( H \ H ) = c ( H ) − and that c ( H ) = 1 . By Lemma C.1, ∀ a, b, c, d ∈ R : min( ab , cd ) ≤ a + cb + d .Therefore, min (cid:18) | V ( H ) | − c ( H ) | E ( H ) | , | V ( H \ H ) | − c ( H \ H ) | E ( H \ H ) | (cid:19) ≤ | V ( H ) | − c ( H ) | E ( H ) | . (58)If H achieves the minimum on the left hand side, then it concludes the proof. If H \ H achieves the minimumsimply iterate the above argument till we are left with single connected component and that would concludethe proof. Lemma C.1.
For any a, b, c, d ∈ R + , min( ab , cd ) ≤ a + cb + d . (59) Proof.
Let a, b, c, d ∈ R + . Without loss of generality assume that Then the following sequence of statementshold leading to the proof of the claim: ad ≤ bc or bc ≤ ad (60) min( ad ( b + d ) , cb ( b + d )) ≤ ( a + c ) bd (61) min( ab , cd ) ≤ a + cb + d . (62) D Proofs of Lemmas 4.3 and 4.4
Proof of Lemma 4.3.
Assume G has maximum average degree bounded by ¯ d , where by definition ¯ d = max S ⊂ V | E ( S ) || S | . (63)Therefore, for any S ⊂ V , | E ( S ) | ≤ ¯ d | S | . And there can be at most (cid:0) | S | (cid:1) edges in a graph over vertices S ,and hence | E ( S ) | ≤ | S | ( | S |− . Therefore, we obtain | S | − | E ( S ) | ≥ d (cid:0) − | S | (cid:1) = L ( | S | ) , (64) | S | − | E ( S ) | ≥ | S | = L ( | S | ) . (65)17herefore | S | − | E ( S ) | ≥ min x ∈ R + { max( L ( x ) , L ( x )) } . (66)Note that L is increasing and bounded whereas L is decreasing. Therefore, max( L ( x ) , L ( x )) with x ∈ R reaches its minimum for x such that L ( x ) = L ( x ) which leads to minima at x = ¯ d + 1 . Therefore, weconclude that for all S ⊂ V , | S | − | E ( S ) | ≥ d + 1 . (67) Proof of Lemma 4.4.
Let G has girth g > . Therefore, all subgraphs of G have girth at least g . The generalisedMoore bound (obtained by [1]) then gives ∀ S ⊂ V : | S | ≥ d S g − (cid:88) i =0 ( d S − i if g is odd , (68) | S | ≥ g − (cid:88) i =0 ( d S − i , if g is even (69)with d S = | E ( S ) || S | . We will only keep a weaker version of this bound that does not depend on the parity of g .Specifically, for all S ⊂ V : | S | ≥ (cid:0) | E ( S ) || S | − (cid:1) g − . (70)Therefore, | E ( S ) | ≤ ( | S | g − +1 + | S | ) for all S ⊂ V . Subsequently, we have | S | − | E ( S ) | ≥ − | S | | S | g − ≥ − | S | N g − (71)This bound is clearly increasing with | S | . Also note that if | S | ≤ g − , the subgraph ( S, E ( S )) can have nocycle and therefore | S |− | E ( S ) | = 1 . The worse case is therefore attained for | S | = g where we have: | S | − | E ( S ) | ≥
21 + N g − (cid:0) − g (cid:1) . (72) E Proof of Lemma 5.1
Proof.
We shall use Hoeffding’s inequality: for any bounded random variable a ≤ X ≤ b , the deviation of its n -empirical average X n computed from in dependant samples is such that for any t > , P ( | E ( X ) − X n | ≥ t ) ≤ (cid:16) − nt ( b − a ) (cid:17) . (73)Another version of the equation when E ( X ) > is as follows, for any (cid:15) > P (cid:18) − (cid:15) ≤ X n E ( X ) ≤ (cid:15) (cid:19) ≥ − (cid:16) − n(cid:15) E ( X ) ( b − a ) (cid:17) . (74)18n immediate consequence is that ˆ u n is a good approximation for u . For any e ∈ E , P (cid:18) − (cid:15) ≤ ˆ u ne u e ≤ (cid:15) (cid:19) ≥ − (cid:0) − n(cid:15) u e (cid:1) , (75)fctherefore by union bound, P (cid:18) ∀ e ∈ E : 1 − (cid:15) ≤ ˆ u ne u e ≤ (cid:15) (cid:19) ≥ − | E | exp (cid:0) − n(cid:15) κ u (cid:1) . (76)Another consequence is that L ˆ u n is a good approximation for L u . Indeed, considering the random variable Φ(Π T ( θ )) of mean L u ( θ ) and of empirical average L ˆ u n ( θ ) = n (cid:80) ni =1 Φ(Π T i ( θ )) and noting that thisvariable is bounded as follows ≤ Φ(Π T ( θ )) ( ≤ Φ( θ )) ≤ κ u L u ( θ ) , we have P (cid:0) − (cid:15) ≤ L ˆ u n ( θ ) L u ( θ ) ≤ (cid:15) (cid:1) ≥ − (cid:0) − n(cid:15) κ u (cid:1) . (77)Regarding U ˆ u n ( θ ) , the discussion requires an additional argument because n (cid:80) ni =1 Φ(Π T i ˆ u n ( θ )) is not a sumof independent random variables. Instead, let us focus on the close quantity, n (cid:80) ni =1 Φ(Π T i u ( θ )) for which wehave ≤ Φ(Π T u ( θ )) (cid:16) ≤ κ u Φ( θ ) (cid:17) ≤ κ u U u ( θ ) and therefore, P (cid:0) − (cid:15) ≤ n (cid:80) ni =1 Φ(Π T i u ( θ )) U u ( θ ) ≤ (cid:15) (cid:1) ≥ − (cid:0) − n(cid:15) κ u (cid:1) . (78)Fortunately, if (76) is satisfied this quantity turns out to be a good approximation of U ˆ u n ( θ ) . Indeed, assumingthat ∀ e ∈ E : 1 − (cid:15) ≤ ˆ u ne u e ≤ (cid:15) we have that for all T ∈ T ( G ) , (1 − (cid:15) )Π T u ( θ ) (cid:22) Π T ˆ u n ( θ ) (cid:22) (1 + (cid:15) )Π T u ( θ ) (79)therefore by (monotonicity) and (sub-linearity), (1 − (cid:15) )Φ(Π T u ( θ )) ≤ Φ(Π T ˆ u n ( θ )) ≤ (1 + (cid:15) )Φ(Π T u ( θ )) , (80)which shows, (1 − (cid:15) ) ≤ U ˆ u n ( θ ) n (cid:80) ni =1 Φ(Π T i u ( θ )) ≤ (1 + (cid:15) ) . (81)Therefore by union bound, P (cid:18) (1 − (cid:15) ) ≤ U ˆ u n ( θ ) U u ( θ ) ≤ (1 + (cid:15) ) (cid:19) ≥ − (2 | E | + 2) exp (cid:0) − nκ u (cid:15) (cid:1) . (82)By putting together (77) and (82), we obtain P (cid:32) (1 − (cid:15) ) ≤ (cid:98) Φ ˆ u n ( θ )Φ u ( θ ) ≤ (1 + (cid:15) ) (cid:33) ≥ − (2 | E | + 4) exp (cid:0) − nκ u (cid:15) (cid:1) , (83)and by arguments of Lemma 3.1, we can conclude that P (cid:32) √ κ u (1 − (cid:15) ) ≤ (cid:98) Φ ˆ u n ( θ )Φ( θ ) ≤ (1 + (cid:15) ) √ κ u (cid:33) ≥ − (2 | E | + 4) exp (cid:16) − n √ κ u (cid:15) (cid:17) , (84)This completes the proof of Lemma 5.1. 19 Proof of Lemma 5.2
Bounded degree graph G . First assume that G has maximum degree d . Consider any edge e = ( s, t ) ∈ E .Denote N ( s ) , N ( t ) ⊂ V the neighbours of s and t . Consider current ι : V × V → R which is a solution ofoptimization problem corresponding to effective resistance as defined in (39). By definition, we have that theeffective resistance u e for e ∈ E is given by u e = (cid:88) ( u,v ) ∈ E ι ( u, v ) ≥ ι ( s, t ) + (cid:88) u ∈N ( s ) \{ t } ι ( s, u ) + (cid:88) u ∈N ( t ) \{ s } ι ( u, t ) . (85)By constraints of the optimization problem, the sum of currents entering source s and leaving sink t is equal to 1(whereas it is null for isolated vertices). Therefore, focusing on s , we have (cid:80) u ∈N ( s ) \{ t } | ι ( s, u ) | ≥ −| ι ( s, t ) | .By applying Cauchy Schwarz inequality, we have that (cid:0) (cid:88) u ∈N ( s ) \{ t } ι ( s, u ) (cid:1) × (cid:0) (cid:88) u ∈N ( s ) \{ t } (cid:1) ≥ (1 − | ι ( s, t ) | ) . (86)Recall that G has maximum vertex degree d and therefore |N ( s ) \ { t }| ≤ d − . Therefore, (cid:88) u ∈N ( s ) \{ t } ι ( s, u ) ≥ (1 − | ι ( s, t ) | ) d − . (87)Because the same holds for the term (cid:80) u ∈N ( t ) \ ( s ) ι ( u, t ) , we obtain from (85) that u e ≥ ι ( s, t ) + (1 − | ι ( s, t ) | ) d − . (88)This expression holds for all possible values of ι ( s, t ) . We note that for any given λ ∈ R + , inf x ∈ R x + (1 − x ) λ ≥ λ λ . (89)Therefore, we conclude that for graph G with bounded degree d , u e ≥ d + 1 . (90) Graph G with girth g . We now assume that G has girth g . As before, let e = ( s, t ) ∈ E . Denote G \ { e } =( V, E \ { e } ) the graph obtained by removing edge e from G . For ≤ k ≤ g − , we define E k = { ( u, v ) ∈ E : d G \{ e } ( s, u ) = k, d G \{ e } ( s, v ) = k + 1 } , (91)where d G \{ e } ( s, u ) denotes the shortest path distance between vertices s, u in graph G excluding edge e . Thatis, E k is the set of edges connecting vertices at distance k from s in G \ { e } to vertices at distance k + 1 from s in G \ { e } . Since k ≤ g − , all E k are disjoint and hence current ι satisfies u e ≥ ι ( s, t ) + g − (cid:88) k =0 (cid:88) ( u,v ) ∈ E k ι ( u, v ) . (92)For ≤ k ≤ g − , note that E k ∪ { e } defines a cut of G . Therefore by Kirchoff’s law (cid:80) ( u,v ) ∈ E k | ι ( u, v ) | ≥ − | ι ( s, t ) | . Using Cauchy-Schwartz inequality, we obtain: (cid:0) (cid:88) ( u,v ) ∈ E k ι ( u, v ) (cid:1) × (cid:0) (cid:88) ( u,v ) ∈ E k (cid:1) ≥ (1 − ι ( s, t )) . (93)20y summing-up all inequalities, we obtain (cid:0) g − (cid:88) k =0 (cid:88) ( u,v ) ∈ E k ι ( u, v ) ) ≥ (1 − | ι ( s, t ) | ) (cid:0) g − (cid:88) k =0 | E k | (cid:1) . (94)Note that if a sequence ( m k ) ≥ respects (cid:80) lk =1 m k ≤ | E | then, (cid:80) lk =1 1 m k ≥ l | E | . Therefore, because all E k are disjoint, (cid:80) g − k =0 1 | E k | ≥ ( g − | E | . Inserting this in (92), we obtain u e ≥ ι ( s, t ) + (1 − ι ( s, t )) ( g − | E | . (95)Using (89), we obtain u e ≥
11 + | E | ( g − . (96)This completes the proof of Lemma 5.2. G Proof of Theorem 6.1
Proof.
The proof follows by establishing that κ k ρ as defined in (41) for ρ ∈ P ( Part k ( G )) is such that κ k ρ ≥ − (cid:15), (97)if ρ is ( (cid:15), k ) partition. Indeed, by definition of ( (cid:15), k ) partition, we have that for any e ∈ E , ρ e = E H ∼ ρ [ ( e ∈ H )] ≥ − (cid:15). (98)Therefore, κ k ρ = min e ∈ E ρ e ≥ − (cid:15). (99)Subsequently, using arguments identical to that for proof of Lemma 3.1, it follows that (cid:98) Φ ρ ( θ ) is / (cid:113) κ k ρ approximation. That is, √ − (cid:15) ≤ Φ( θ ) (cid:98) Φ ρ ( θ ) ≤ √ − (cid:15) .(cid:15) .