[PDF] The Bethe Permanent of a Non-Negative Matrix

Abstract

It has recently been observed that the permanent of a non-negative square matrix, i.e., of a square matrix containing only non-negative real entries, can very well be approximated by solving a certain Bethe free energy function minimization problem with the help of the sum-product algorithm. We call the resulting approximation of the permanent the Bethe permanent. In this paper we give reasons why this approach to approximating the permanent works well. Namely, we show that the Bethe free energy function is convex and that the sum-product algorithm finds its minimum efficiently. We then discuss the fact that the permanent is lower bounded by the Bethe permanent, and we comment on potential upper bounds on the permanent based on the Bethe permanent. We also present a combinatorial characterization of the Bethe permanent in terms of permanents of so-called lifted versions of the matrix under consideration. Moreover, we comment on possibilities to modify the Bethe permanent so that it approximates the permanent even better, and we conclude the paper with some observations and conjectures about permanent-based pseudo-codewords and permanent-based kernels.

Full PDF

aa r X i v : . [ c s . I T ] O c t The Bethe Permanent of a Non-Negative Matrix

Pascal O. Vontobel

Abstract —It has recently been observed that the permanent ofa non-negative square matrix, i.e. , of a square matrix containingonly non-negative real entries, can very well be approximatedby solving a certain Bethe free energy function minimizationproblem with the help of the sum-product algorithm. We call theresulting approximation of the permanent the Bethe permanent.In this paper we give reasons why this approach to approx-imating the permanent works well. Namely, we show that theBethe free energy function is convex and that the sum-productalgorithm ﬁnds its minimum efﬁciently. We then discuss the factthat the permanent is lower bounded by the Bethe permanent,and we comment on potential upper bounds on the permanentbased on the Bethe permanent. We also present a combinatorialcharacterization of the Bethe permanent in terms of permanentsof so-called lifted versions of the matrix under consideration.Moreover, we comment on possibilities to modify the Bethe per-manent so that it approximates the permanent even better, andwe conclude the paper with some observations and conjecturesabout permanent-based pseudo-codewords and permanent-basedkernels.

Index Terms —Bethe approximation, Bethe permanent, frac-tional Bethe approximation, graph cover, partition function,perfect matching, permanent, sum-product algorithm.

I. I

NTRODUCTION

Central to the topic of this paper is the deﬁnition of thepermanent of a square matrix (see, e.g. , [1]).

Deﬁnition 1

Let θ = ( θ i,j ) i,j be a real matrix of size n × n .The permanent of θ is deﬁned to be the scalar perm( θ ) = X σ Y i ∈ [ n ] θ i,σ ( i ) , (1) where the summation is over all n ! permutations of the set [ n ] , { , , . . . , n } . (cid:3) Contrast this deﬁnition with the deﬁnition of the determi-nant of θ , i.e. , det( θ ) = X σ sgn( σ ) Y i ∈ [ n ] θ i,σ ( i ) , where sgn( σ ) equals +1 if σ is an even permutation and equals − if σ is an odd permutation. Accepted for IEEE Transactions on Information Theory. Manuscript re-ceived July 21, 2011; date of current version October 20, 2012. Some of thematerial in this paper was previously presented at the 48th Annual AllertonConference on Communications, Control, and Computing, Monticello, IL,USA, Sep. 29–Oct. 1, 2010, and at the 2011 Information Theory andApplications Workshop, UC San Diego, La Jolla, CA, USA, Feb. 6–11, 2011.P. O. Vontobel is with Hewlett–Packard Laboratories, 1501 Page Mill Road,Palo Alto, CA 94304, USA (e-mail: [email protected]).

A. Complexity of Computing the Permanent

Because the deﬁnition of the permanent looks simpler thanthe deﬁnition of the determinant, it is tempting to concludethat the permanent can be computed at least as efﬁcientlyas the determinant. However, this does not seem to be thecase. Namely, whereas the arithmetic complexity (number ofreal additions and multiplications) needed to compute thedeterminant is in O ( n ) , Ryser’s algorithm (one of the mostefﬁcient algorithms for computing the permanent) requires Θ( n · n ) arithmetic operations [2]. This clearly improves uponthe brute-force complexity O ( n · n !) = O (cid:0) n / · ( n/e ) n (cid:1) forcomputing the permanent, but is still exponential in the matrixsize.In terms of complexity classes, the computation of thepermanent is in the complexity class B. Approximations to the Permanent

Given the difﬁculty of computing the permanent exactly,and given the fact that in many applications it is good enoughto compute an approximation to the permanent, this paperfocuses on efﬁcient methods to approximate the permanent.This relaxation in requirements, from exact to approximateevaluation of the permanent, allows one to devise algorithmsthat potentially have much lower complexity.Moreover, we will consider only the case where the matrix θ in (1) is non-negative, i.e. , where all entries of θ are non-negative. It is to be expected that approximating the permanentis simpler in this case because with this restriction the sumin (1) contains only non-negative terms, i.e. , the terms in thissum “interfere constructively.” This is in contrast to the generalcase where the sum in (1) contains positive and negativeterms, i.e. , the terms in this sum “interfere constructivelyand destructively.” Despite this restriction to non-negativematrices, many interesting counting problems can be capturedby this setup.Earlier work on approximating the permanent of a non-nega-tive matrix includes: • Markov-chain-Monte-Carlo-based methods, which startedwith the work of Broder [4] and ultimately lead toa famous fully polynomial randomized approximationscheme (FPRAS) by Jerrum, Sinclair, and Vigoda [5] Strictly speaking, there are also matrices θ with positive and negativeentries but where the product Q i ∈ [ n ] θ i,σ ( i ) is non-negative for every σ . (for more details, in particular for complexity estimates ofthese and related methods, see for example the discussionin [6]); • Godsil-Gutman-estimator-based methods by Karmarkar,Karp, Lipton, Lov´asz, and Luby [7] and by Barvinok [8]; • a divide-and-conquer approach by Jerrum and Vazi-rani [9]; • a Sinkhorn-matrix-rescaling-based method by Linial,Samorodnitsky, and Wigderson [10]; • Bethe-approximation / sum-product-algorithm (SPA)based methods by Chertkov, Kroc, and Vergassola [11]and by Huang and Jebara [12].The study in this paper was very much motivated by these lasttwo papers on graphical-model-based methods, in particularbecause the resulting algorithms are very efﬁcient and theobtained permanent estimates have an accuracy that is goodenough for many purposes .The main idea behind this graphical-model-based approachis to formulate a factor graph whose partition function equalsthe permanent that we are looking for. Consequently, thenegative logarithm of the permanent equals the minimum ofthe so-called Gibbs free energy function that is associated withthis factor graph. Although being an elegant reformulationof the permanent computation problem, this does not yieldany computational savings yet. Nevertheless, it suggests tolook for a function that is tractable and whose minimum isclose to the minimum of the Gibbs free energy function. Onesuch function is the so-called Bethe free energy function [13],and with this, paralleling the above-mentioned relationshipbetween the permanent and the minimum of the Gibbs freeenergy function, the

Bethe permanent is deﬁned such that itsnegative logarithm equals the global minimum of the Bethefree energy function. The Bethe free energy function is aninteresting candidate because a theorem by Yedidia, Freeman,and Weiss [13] says that ﬁxed points of the SPA correspondto stationary points of the Bethe free energy function.In general, this approach of replacing the Gibbs free energyfunction by the Bethe free energy function comes with veryfew guarantees, though. • The Bethe free energy function might have multiple localminima. • It is unclear how close the (global) minimum of the Bethefree energy function is to the minimum of the Gibbs freeenergy function. • It is unclear if the SPA converges, even to a localminimum of the Bethe free energy function. (As we willsee, the factor graph that we use (see Fig. 1) is not sparseand has many short cycles, in particular many four-cycles.These facts might suggest that the application of the SPAto this factor graph is rather problematic.)Luckily, in the case of the permanent approximation problem,one can formulate a factor graph where the Bethe free energyfunction is very well behaved. In particular, in this paper wediscuss a factor graph that has the following properties. • We show that the Bethe free energy function is, whensuitably parameterized, a convex function; therefore it hasno non-global local minima. • The minimum of the Bethe free energy function is quiteclose to the minimum of the Gibbs free energy function.Namely, as was recently shown by Gurvits [14], [15],the permanent is lower bounded by the Bethe permanent.Moreover, we list conjectures on strict and probabilisticBethe-permanent-based upper bounds on the permanent.In particular, for certain classes of square non-negativematrices, empirical evidence suggests that the permanentis upper bounded by some constant (that grows rathermodestly with the matrix size) times the Bethe perma-nent. • We show that the SPA ﬁnds the minimum of the Bethefree energy function under rather mild conditions. Infact, the error between the iteration-dependent estimateof the Bethe permanent and the Bethe permanent itselfdecays exponentially fast, with an exponent dependingon the matrix θ . Interestingly enough, in the associatedconvergence analysis a key role is played by a certainMarkov chain that maximizes the sum of its entropy rateplus some average state transition cost.Besides leaving some questions open with respect to (w.r.t.)the Bethe free energy function (see, e.g. , the above-mentionedconjectures concerning permanent upper bounds), these resultsby-and-large validate the empirical success, as observed byChertkov, Kroc, and Vergassola [11] and by Huang andJebara [12], of approximating the permanent by graphical-model-based methods.Let us remark that for many factor graphs with cycles theBethe free energy function is not as well behaved as theBethe free energy function under consideration in this paper.In particular, as discussed in [16], every code picked froman ensemble of regular low-density parity-check codes [17],where the ensemble is such that the minimum Hammingdistance grows (with high probability) linearly with the blocklength, has a Bethe free energy function that is non-convexin certain regions of its domain. Nevertheless, decoding suchcodes with SPA-based decoders has been highly successful(see, e.g. [18]). C. Related Work

The literature on permanents (and adjacent areas of countingperfect matchings, counting zero/one matrices with speciﬁedrow and column sums, etc. ) is vast. Therefore, we just mentionworks that are (to the best of our knowledge) the most relevantto the present paper.Besides the already cited papers [11], [12] on Bethe-approximation-based methods to the permanent of a non-negative matrix, some aspects of the Bethe free energyfunction were analyzed by Watanabe and Chertkov in [19]and by Chertkov, Kroc, Krzakala, Vergassola, and Zdeborov´ain [20]. (In particular, the paper [19] applied the loop calculustechnique by Chertkov and Chernyak [21].) Very recent workin that line of research is presented in a paper by A. B. Yedidiaand Chertkov [22] that studies so-called fractional free energyfunctionals, and resulting lower and upper bounds on thepermanent of a non-negative matrix.Because computing the permanent is related to countingperfect matchings, the paper by Bayati and Nair [23] on counting matchings in graphs with the help of the SPA isvery relevant. Note that their setup is such that the perfectmatching case can be seen as a limiting case (namely the zero-temperature limit) of the matching setup. However, for theperfect matching case (a case for which the authors of [23]make no claims) the convergence proof of the SPA in [23]is incomplete. Moreover, their matchings are weighted onlyinasmuch as the weight of a matching depends on the sizeof the matching. Consequently, because all perfect matchingshave the same size, they all are assigned the same weight. (Seealso the related paper by Bayati, Gamarnik, Katz, Nair, andTetali [24], and an extension to counting perfect matchings incertain types of graph by Gamarnik and Katz [25].) For anSPA convergence analysis of a slightly generalized weightedmatching setup, the interested reader is referred to a recentpaper by Williams and Lau [26].Very relevant to the present paper are also papers on max-product algorithm / min-sum algorithm based approaches tothe maximum weight perfect matching problem [27]–[30].As shown in these papers, these algorithms ﬁnd the desiredsolution efﬁciently for bipartite graphs, a fact which is stronglyrelated to the observation that the linear programming re-laxation of the underlying integer linear program is tight inthis case. This tightness in relaxation, which is an immediateconsequence of a theorem by Birkhoff and von Neumann (seeTheorem 3), goes also a long way towards explaining why theBethe free energy function under consideration in the presentpaper is well behaved. Finally, let us remark that because thedifference between two perfect matchings corresponds to aunion of disjoint cycles, the max-product algorithm / min-sum algorithm convergence analysis in [27]–[30] has someresemblance with Wiberg’s max-product algorithm / min-sumalgorithm convergence analysis for so-called cycle codes [31].Linial, Samorodnitsky, and Wigderson [10] published adeterministic strongly polynomial algorithm to compute thepermanent of an n × n non-negative matrix within a multiplica-tive factor of e n . This is related to the present paper becausetheir approach is based on Sinkhorn’s matrix rescaling method,which can be seen as ﬁnding the minimum of a certain freeenergy type function.The present paper has some similarities with recent papersby Barvinok on counting zero/one matrices with prescribedrow and column sums [32] and by Barvinok and Samorod-nitsky on computing the partition function for perfect match-ings in hypergraphs [33]. However, these papers pursue whatwould be called a mean-ﬁeld theory approach in the physicsliterature [34]. An exception to the previous statement isSection 3.2 in [32], which contains Bethe-approximation-typecomputations. (See the references in that section for furtherpapers that investigate similar approaches.)As mentioned in the abstract, the present paper discusses acombinatorial characterization of the Bethe permanent in termsof permanents of so-called lifted versions of the matrix underconsideration. For this we use results from [16] that give acombinatorial characterization of the Bethe partition functionof a factor graph in terms of the partition function of graphcovers of this factor graph. Interestingly, very similar objectswere considered by Greenhill, Janson, and Ruci´nski [35]; we will comment on this connection in Section VII-E.Finally, as already mentioned in the previous subsection,Gurvits’s recent papers [14], [15] contain important observa-tions w.r.t. the relationship between the permanent and theBethe permanent of a non-negative matrix, and puts them intothe context of Schrijver’s permanental inequality. D. Overview of the Paper

This paper is structured as follows. We conclude this intro-ductory section with a discussion of some of the notation thatis used. In Section II we then introduce the main normal factorgraph (NFG) for this paper, in Section III we formally deﬁnethe Bethe permanent, in Section IV we discuss properties ofthe Bethe entropy function and the Bethe free energy function,in Section V we analyze the SPA, in Section VI we givea “combinatorial characterization” of the Bethe permanentin terms of graph covers of the above-mentioned NFG, inSection VII we discuss Bethe-permanent-based bounds on thepermanent, in Section VIII we list some thoughts on usingthe concept of the “fractional Bethe entropy function,” inSection IX we list some observations and conjectures, andwe conclude the paper in Section X. Finally, the appendixcontains some of the proofs.

E. Basic Notations and Deﬁnitions

This subsection discusses the most important notations thatwill be used in this paper. More notational deﬁnitions will begiven in later sections.We let R be the ﬁeld of real numbers, R > be the set ofnon-negative real numbers, R > be the set of positive realnumbers, Z be the ring of integers, Z > be the set of non-negative integers, Z > be the set of positive integers, and forany positive integer L we deﬁne [ L ] , { , . . . , L } . Scalarsare denoted by non-boldface characters, whereas vectors andmatrices by boldface characters. For any positive integer L ,the matrix L × L is the all-one matrix of size L × L . Assumption 2

Throughout this paper, if not mentioned other-wise, n is a positive integer and θ = ( θ i,j ) i,j is a non-negativematrix of size n × n . Moreover, we assume that θ is such that perm( θ ) > , i.e. , there is at least one permutation σ of [ n ] such that Q i ∈ [ n ] θ i,σ ( i ) > . (cid:3) We use calligraphic letters for sets, and the size of a set S is denoted by |S| . For a ﬁnite set S , we let Π S be the set ofprobability mass functions over S , i.e. , Π S , ( p = (cid:0) p s (cid:1) s ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p s > for all s ∈ S , X s ∈S p s = 1 ) . Moreover, for any positive integer L , we deﬁne P L × L to bethe set of all L × L permutation matrices, i.e. , P L × L ,  P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P is a matrix of size L × L P contains exactly one per row P contains exactly one per column P contains s otherwise  . Clearly, there is a bijection between P L × L and the set of allpermutations of [ L ] . Finally, for any positive integer L , we let Γ L × L be the set of doubly stochastic matrices of size L × L , i.e. , Γ L × L ,  γ = (cid:0) γ i,j (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) γ i,j > for all ( i, j ) ∈ [ L ] × [ L ] P j ∈ [ L ] γ i,j = 1 for all i ∈ [ L ] P i ∈ [ L ] γ i,j = 1 for all j ∈ [ L ]  . The convex hull [36] of some subset S of some multi-dimensional real space is denoted by conv( S ) . In the follow-ing, when talking about the interior of a polytope, we willmean the relative interior [36] of that polytope.When appropriate, we will identify the set of L × L realmatrices with the L -dimensional real space. In that sense, Γ L × L can be seen as a polytope in the L -dimensional realspace. Clearly, Γ L × L is a convex set, and every permutationmatrix of size L × L is a doubly stochastic matrix of size L × L . Most interestingly, every doubly stochastic matrix of size L × L can be written as a convex combination of permutationmatrices of size L × L ; this observation is a consequence ofthe important Birkhoff–von Neumann Theorem. Theorem 3 (Birkhoff–von Neumann Theorem)

For anypositive integer L , the set of doubly stochastic matrices ofsize L × L is a polytope whose vertex set equals the set ofpermutation matrices of size L × L , i.e. , vertex-set(Γ L × L ) = P L × L . As a consequence, the set of doubly stochastic matrices of size L × L is the convex hull of the set of all permutation matricesof size L × L , i.e. , Γ L × L = conv( P L × L ) . Proof:

See, e.g. , [37, Section 8.7]. (cid:4)

Finally, all logarithms will be natural logarithms and thevalue of · log(0) is deﬁned to be equal to .II. N ORMAL F ACTOR G RAPH R EPRESENTATION

Factor graphs are a convenient way to represent multivariatefunctions [38]. In this paper we use a variant called “normalfactor graphs (NFGs)” [39] (also called “Forney-style factorgraphs” [40]), where variables are associated with edges.As already mentioned in the introduction, the main ideabehind the graphical-model-based approach to estimating thepermanent is to formulate an NFG such that its partitionfunction equals the permanent. There are of course differentways to do this and typically different formulations will yielddifferent results when estimating the permanent with sub-optimal algorithms like the SPA. It is well known that whenthe NFG has no cycles, then the SPA computes the partitionfunction exactly, however, for the given problem any NFG without cycles yields highly inefﬁcient SPA update rules forreasonably large n (otherwise there would be a contradictionto the considerations in Section I-A), and so we will focus onNFGs with cycles. The NFG that is introduced in the followingdeﬁnition and that is based on a complete bipartite graph with Fig. 1. The NFG N ( θ ) is based on a complete bipartite graph with two times n vertices (here n = 5 ). The function nodes on the left-hand side represent thelocal functions { g i } i ∈I , the function nodes on the right-hand side representthe functions { g j } j ∈J , and with the edge e = ( i, j ) we associate the variable A e = A i,j . (See Deﬁnition 4 for more details.) two times n vertices, is a rather natural candidate, and, as wewill see, has very interesting and useful properties. Deﬁnition 4

We deﬁne the NFG N ( θ ) , N ( F , E , A , G ) asfollows (see also Fig. 1). • The set of vertices (henceforth also called function nodes)is F , I ˙ ∪ J , where I , [ n ] will be called the set ofleft vertices and J , [ n ] will be called the set of rightvertices. • The set of full-edges is E full , I × J = (cid:8) ( i, j ) (cid:12)(cid:12) i ∈I , j ∈ J (cid:9) and the set of half-edges is E half = ∅ , i.e. , theempty set. (A full-edge is an edge connecting two vertices,whereas a half-edge is an edge that is connected to onlyone vertex.) The set of edges is E , E full ∪ E half = E full . • With every edge e = ( i, j ) ∈ E we associate the variable A e = A i,j with alphabet A e = A i,j , { , } ; arealization of A e = A i,j will be denoted by a e = a i,j . • The set A , Q e A e = Q i,j A i,j will be called theconﬁguration set, and so a , ( a e ) e ∈E = ( a i,j ) ( i,j ) ∈I×J ∈ A will be called a conﬁguration. For a given vector a , wealso deﬁne the sub-vectors a i , ( a i,j ) j ∈J and a j , ( a i,j ) i ∈I . When convenient, the vector a will be considered to be an n × n matrix. Then a i corresponds to the i th row of a , and a j corresponds to the j th column of a . (Note that we willalso use the notations a i , ( a i,j ) j ∈J and a j , ( a i,j ) i ∈I when there is not necessarily an underlying conﬁguration a of the whole NFG.) • For every i ∈ I we deﬁne the local functions g i : Y j ′ A i,j ′ → R , a i (p θ i,j (if a i = u j ) (otherwise) Here, F , I ˙ ∪ J stands for the more cumbersome F , (cid:0) { left } × I (cid:1) ∪ (cid:0) { right } × J (cid:1) . In the following, i (and variations thereof) will refer to a leftvertex and j (and variations thereof) will refer to a right vertex. In that spirit,variables like η i and η j are different variables, also if i = j . Here and in the following, u j , j ∈ J , stands for the length- n vectorwhere all entries are zero except for the j th entry that equals . The vector u i , i ∈ I , is deﬁned similarly. Here and in the following, we will use the short-hands P i , P j , P i ′ , P j ′ , P e , P e ′ for P i ∈I , P j ∈J , P i ′ ∈I , P j ′ ∈J , P e ∈E , P e ′ ∈E ,respectively, with similar conventions for products. Similarly, for every j ∈ J we deﬁne the local functions g j : Y i ′ A i ′ ,j → R , a j (p θ i,j (if a j = u i ) (otherwise) • For every i ∈ I we deﬁne the function node alphabet A i to be the set A i ,  a i ∈ Y j ′ A i,j ′ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g i ( a i ) = 0  = { u j | j ∈ J } . Similarly, for every j ∈ J we deﬁne the function nodealphabet A j to be the set A j , ( a j ∈ Y i ′ A i ′ ,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g j ( a j ) = 0 ) = { u i | i ∈ I} . (The sets A i and A j are also known as the local con-straint codes of the function nodes i and j , respectively.) • The global function g is deﬁned to be g : A → R , a Y i g i ( a i ) ! · Y j g j ( a j )  . • A conﬁguration c with g ( c ) = 0 will be called a validconﬁguration. The set of all valid conﬁgurations, i.e. , C , (cid:8) c ∈ A (cid:12)(cid:12) g ( c ) = 0 (cid:9) =  ( c i,j ) i,j ∈I×J (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c i,j ∈ A i,j , ( i, j ) ∈ I × J c i ∈ A i , i ∈ I c j ∈ A j , j ∈ J  , will be called the global behavior of N ( θ ) . Consideringthe elements of C as n × n matrices, it can easily beveriﬁed that C = P n × n . This allows us to associate with c ∈ C the permutation σ c : [ n ] → [ n ] that maps i ∈ I to j ∈ J if c i,j = 1 . (cid:3) Lemma 5

Consider the NFG N ( θ ) and let c ∈ C be a validconﬁguration of it. Then g i ( c i ) = q θ i,σ c ( i ) , i ∈ I ,g j ( c j ) = q θ σ − c ( j ) ,j , j ∈ J ,g ( c ) = Y i θ i,σ c ( i ) = Y j θ σ − c ( j ) ,j . Proof:

The ﬁrst two expressions follow easily from the deﬁ-nitions of g i and g j in Deﬁnition 4. The third expression is aconsequence of g ( c ) = Y i g i ( c i ) ! · Y j g j ( c j )  = Y i q θ i,σ c ( i ) ! · Y j q θ σ − c ( j ) ,j  = Y i q θ i,σ c ( i ) ! · Y i ′ q θ i ′ ,σ c ( i ′ ) ! = Y i θ i,σ c ( i ) . (cid:4) Deﬁnition 6

The (Gibbs) partition function of the NFG N ( θ ) is deﬁned to be the sum of the global function over allconﬁgurations, or, equivalently, the sum of the global functionover all valid conﬁgurations, i.e. , Z G , X a ∈A g ( a ) = X c ∈C g ( c ) . (2) In the following, when confusion can arise what NFG a certainGibbs partition function is referring to, we will use Z G (cid:0) N ( θ ) (cid:1) ,etc., instead of Z G . (cid:3) Deﬁnition 7

The Gibbs free energy function associated withthe NFG N ( θ ) is deﬁned to be F G : Π C → R , p U G ( p ) − H G ( p ) , where U G : Π C → R , p

7→ − X c ∈C p c · log (cid:0) g ( c ) (cid:1) ,H G : Π C → R , p

7→ − X c ∈C p c · log (cid:0) p c (cid:1) . Here, U G is called the Gibbs average energy function and H G is called the Gibbs entropy function. In the following, whenconfusion can arise what NFG a certain Gibbs free energyfunction is referring to, we will use F G , N ( θ ) , etc., instead of F G . Similar comments apply to U G and H G . (cid:3) For more details on these functions we refer to, e.g. , [13].For a discussion of these functions in the context of NFGs werefer to, e.g. , [16]. Note that H G is a concave function of p ,that U G is a linear function of p , and that, consequently, F G is a convex function of p . Lemma 8

The permanent of θ can be expressed in terms ofthe partition function or in terms of the minimum of the Gibbsfree energy function of N ( θ ) . Namely, perm( θ ) = Z G = exp (cid:18) − min p F G ( p ) (cid:19) , (3) where the minimization is over p ∈ Π C .Proof: The ﬁrst equality is a straightforward consequence ofDeﬁnitions 1 and 4, along with Lemma 5. For the secondequality we refer to, e.g. , [13], [16]. (cid:4)

The partition function Z G and the Gibbs free energy func-tion F G were speciﬁed for temperature T = 1 in the abovedeﬁnitions. For a general temperature parameter T ∈ R > ,these functions have to be replaced by Z G , P c ∈C g ( c ) /T and by F G ( p ) , U G ( p ) − T · H G ( p ) , respectively, andLemma 8 has to be replaced by Z G = exp (cid:0) − T min p F G ( p ) (cid:1) .Of course, Z G = perm( θ ) does not hold anymore, unless asuitable T -dependence is built into the deﬁnition of perm( θ ) . Note that “function” in “partition function” refers to the fact that the ex-pression in (2) typically is a function of some parameters like the temperature T (see the discussion below). A better word for “partition function” wouldpossibly be “partition sum” or “state sum,” which would more closely followthe German “Zustandssumme” whose ﬁrst letter is used to denote the partitionfunction. III. T HE B ETHE P ERMANENT

Although the reformulation of the permanent in Lemma 8in terms of a convex minimization problem is elegant, from acomputational perspective it does not represent much progress.However, it suggests to look for a minimization problem thatcan be solved efﬁciently and whose minimum value is relatedto the desired quantity. This is the approach that is taken inthis section and will be based on the Bethe approximation ofthe Gibbs free energy function: the resulting approximationof the permanent of a non-negative square matrix will becalled the Bethe permanent. (Note that in this section we givethe technical details only; for a general discussion w.r.t. themotivations behind the Bethe approximation we refer to [13],and for a discussion of the Bethe approximation in the contextof NFGs we refer to [16].)

Deﬁnition 9

Consider the NFG N ( θ ) . We let β , (cid:0) ( β i ) i ∈I , ( β j ) j ∈J , ( β e ) e ∈E (cid:1) be a collection of vectors based on the real vectors β i , ( β i, a i ) a i ∈A i , β j , ( β j, a j ) a j ∈A j , β e , ( β e,a e ) a e ∈A e . Moreover, we deﬁne the sets B i , Π A i , i ∈ I , B j , Π A j , j ∈ J , B e , Π A e , e ∈ E , and call B i , B j , and B e , the i th local marginal polytope,the j th local marginal polytope, and the e th local marginalpolytope, respectively. (Sometimes B i is also called the i thbelief polytope, etc.)With this, the local marginal polytope (or belief polytope) B is deﬁned to be the set B =  β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β i ∈ B i for all i ∈ I β j ∈ B j for all j ∈ J β e ∈ B e for all e ∈ E P a ′ i ∈A i : a ′ i,j = a e β i, a ′ i = β e,a e for all e = ( i, j ) ∈ E , a e ∈ A e P a ′ j ∈A j : a ′ i,j = a e β j, a ′ j = β e,a e for all e = ( i, j ) ∈ E , a e ∈ A e  , where β ∈ B is called a pseudo-marginal vector. (The twoconstraints that were listed last in the deﬁnition of B will becalled “edge consistency constraints.”) (cid:3) Deﬁnition 10

The Bethe free energy function associated withthe NFG N ( θ ) is deﬁned to be the function F B : B → R , β U B ( β ) − H B ( β ) , where U B : B → R , β X i U B ,i ( β i ) + X j U B ,j ( β j ) H B : B → R , β X i H B ,i ( β i ) + X j H B ,j ( β j ) − X e H B ,e ( β e ) , with U B ,i : B i → R , β i

7→ − X a i β i, a i · log (cid:0) g i ( a i ) (cid:1) ,U B ,j : B j → R , β j

7→ − X a j β j, a j · log (cid:0) g j ( a j ) (cid:1) ,H B ,i : B i → R , β i

7→ − X a i β i, a i · log( β i, a i ) ,H B ,j : B j → R , β j

7→ − X a j β j, a j · log( β j, a j ) ,H B ,e : B e → R , β e

7→ − X a e β e,a e · log( β e,a e ) . Here, U B is the Bethe average energy function and H B isthe Bethe entropy function. In the following, when confusioncan arise what NFG a certain Bethe free energy function isreferring to, we will use F B , N ( θ ) , etc., instead of F B . Similarcomments apply to U B and H B . (cid:3) With this, the Bethe partition function of an NFG is deﬁned such that an equality analogous to the second equality in (3)holds.

Deﬁnition 11

The Bethe partition function of the NFG N ( θ ) is deﬁned to be Z B , exp (cid:18) − min β ∈B F B ( β ) (cid:19) . In the following, when confusion can arise what NFG a certainBethe partition function is referring to, we will use Z B ( N ) ,etc., instead of Z B . (cid:3) The next deﬁnition is the main deﬁnition of this paper andwas motivated by the work of Chertkov, Kroc, and Vergasso-la [11] and by the work of Huang and Jebara [12].

Deﬁnition 12

Consider the NFG N ( θ ) . The Bethe permanentof θ , which will be denoted by perm B ( θ ) , is deﬁned to be perm B ( θ ) , Z B (cid:0) N ( θ ) (cid:1) . (cid:3) A similar comment w.r.t. a temperature parameter T ∈ R > as at the end of Section II applies also to the deﬁnition of theBethe partition function and the Bethe free energy function.In the following, however, we will only consider the case T = 1 . An exception is Section VIII on the fractional Betheapproximation: this approximation can be viewed as introduc-ing multiple temperature parameters, namely one temperatureparameter for every term of H B , and therefore includes thesingle temperature parameter case as a special case. Here and in the following, we use the short-hand P a i for P a i ∈A i , etc. . IV. P

ROPERTIES OF THE B ETHE E NTROPY F UNCTIONAND THE B ETHE F REE E NERGY F UNCTION

There are relatively few general statements about the shapeof the Bethe entropy function. In this section we show thatBethe entropy function associated with N ( θ ) has many specialproperties. • In general, the Bethe entropy function is not a concavefunction. However, here we show that the Bethe entropyfunction associated with N ( θ ) is, when suitably parame-terized, a concave function.Similarly, the Bethe free energy function is in generalnot a convex function. However, because the Bethe freeenergy function is the difference of the Bethe averageenergy function and the Bethe entropy function, becausethe Bethe average energy function is linear in its argu-ments, and because the Bethe entropy function is concave,the Bethe free energy function associated with N ( θ ) isconvex and does not have non-global local minima. • In general, the Bethe entropy function can take on pos-itive, zero, and negative values. However, here we showthat the Bethe entropy function associated with N ( θ ) isnon-negative. • Very often, the directional derivative of the Bethe entropyfunction away from a vertex of its domain is + ∞ or −∞ . For the Bethe entropy function of N ( θ ) we showthat the directional derivative away from any vertex ofits domain has a (non-negative) ﬁnite value. (As we willsee in Section V, this observation will have importantconsequences for the SPA convergence analysis.) A. Reformulation of the Bethe Entropy Functionand the Bethe Free Energy Function

As mentioned in Section I-C, the successes of max-productalgorithm / min-sum algorithm based approaches to the bipar-tite graph maximum weight perfect matching problem in thepapers [27]–[30] was heavily based on a theorem by Birkhoffand von Neumann (see Theorem 3). This theorem is equallycentral to the results of the present paper. Namely, in the nextlemma we introduce a parameterization of the belief polytope B based on Γ n × n that will be used for the rest of the paper. Lemma 13

Consider the NFG N ( θ ) . Its belief polytope B canbe parameterized by Γ n × n , the set of doubly stochastic matri-ces of size n × n . In particular, we deﬁne the parameterizationsuch that the matrix γ = ( γ i,j ) ( i,j ) ∈I×J ∈ Γ n × n indexes thepseudo-marginal vector β ∈ B with β i, a i (cid:12)(cid:12)(cid:12) a i = u j = β j, a j (cid:12)(cid:12)(cid:12) a j = u i = γ i,j , and β e,a e (cid:12)(cid:12)(cid:12) a e =0 = 1 − γ i,j , β e,a e (cid:12)(cid:12)(cid:12) a e =1 = γ i,j , for every i ∈ I , j ∈ J , and e = ( i, j ) ∈ E . The fact that convexity / non-convexity of a function depends on its param-eterization might explain the non-convexity observations in [12, Section 3.3]w.r.t. the Bethe free energy function.

Proof:

It is straightforward to verify that the pseudo-marginalvector β which is speciﬁed in the lemma statement is indeedin B . Moreover, one can verify that for every pseudo-marginalvector β ∈ B there is a γ ∈ Γ n × n such that γ indexes β . (cid:4) In the following, for a given matrix γ = ( γ i,j ) ( i,j ) ∈I×J ,the i th row of γ will be denoted by γ i = ( γ i,j ) j ∈J and the j th column of γ will be denoted by γ j = ( γ i,j ) i ∈I .The above observations allow us to express the Bethe freeenergy function and related functions in terms of γ ∈ Γ n × n . Lemma 14

Consider the NFG N ( θ ) . Then F B : Γ n × n → R , γ U B ( γ ) − H B ( γ ) , where U B : Γ n × n → R , γ X i U B ,i ( γ i ) + X j U B ,j ( γ j ) ,H B : Γ n × n → R , γ X i H B ,i ( γ i ) + X j H B ,j ( γ i ) − X i,j H B , ( i,j ) ( γ i,j ) , with U B ,i : Π [ n ] → R , γ i

7→ − X j γ i,j · log( θ i,j ) ,U B ,j : Π [ n ] → R , γ j

7→ − X i γ i,j · log( θ i,j ) ,H B ,i : Π [ n ] → R , γ i

7→ − X j γ i,j · log( γ i,j ) ,H B ,j : Π [ n ] → R , γ j

7→ − X i γ i,j · log( γ i,j ) ,H B , ( i,j ) : [0 , → R γ i,j

7→ − γ i,j log( γ i,j ) − (1 − γ i,j ) log(1 − γ i,j ) , Proof:

This follows straightforwardly from Deﬁnition 10 andLemma 13. (cid:4)

Corollary 15

It holds that perm B ( θ ) = exp (cid:18) − min γ ∈ Γ n × n F B ( γ ) (cid:19) , where F B ( γ ) = U B ( γ ) − H B ( γ ) ,U B ( γ ) = − X i,j γ i,j log( θ i,j ) ,H B ( γ ) = − X i,j γ i,j log( γ i,j ) + X i,j (1 − γ i,j ) log(1 − γ i,j ) . Proof:

This follows from Deﬁnitions 11 and 12 and fromLemma 14. (cid:4)

If the sign in front of the second half of the expressionfor H B ( γ ) in Corollary 15 were a minus sign, then H B ( γ ) could be expressed as a sum of binary entropy functions,and therefore the concavity of H B ( γ ) would be immediate.However, the presence of the plus sign means that a more careful look at H B ( γ ) is required to determine if it is concaveor not. Assumption 16

For the rest of this section we assume that n > and that θ is a positive matrix of size n × n . Thissimpliﬁes the wording of most results without hurting theirgenerality too much. In practice, two possible ways to dealwith the issue of zero entries in θ are the following. • One can change the matrix θ so that zero entries becometiny positive entries. • One can redeﬁne N ( θ ) by removing the edge e = ( i, j ) ,along with redeﬁning the local functions g i and g j , if θ i,j = 0 (cid:3) B. Concavity of the Bethe Entropy Functionand Convexity of the Bethe Free Energy Function

Towards showing that H B ( γ ) is a concave function of γ ,and subsequently that F B ( γ ) is a convex function of γ , weﬁrst study two useful functions. Namely, in Deﬁnition 17 andLemma 18 we look at a function called s , and in Deﬁnition 19and Theorem 20 we look at a function called S . Note that inthis section we use the short-hands P ℓ and P ℓ = ℓ ∗ for P ℓ ∈ [ n ] and P ℓ ∈ [ n ]: ℓ = ℓ ∗ , respectively. Deﬁnition 17

Let s be the function s : [0 , → R , ξ

7→ − ξ log( ξ ) + (1 − ξ ) log(1 − ξ ) . Note that in contrast to the binary entropy function, there isa plus sign (not a minus sign) in front of the second term. (cid:3)

Lemma 18

The function s that is speciﬁed in Deﬁnition 17has the following properties. • As can be seen from Fig. 2 (left), the graph of the function s is s-shaped. • The ﬁrst-order derivative of s is dd ξ s ( ξ ) = − − log (cid:0) ξ (1 − ξ ) (cid:1) . • The second-order derivative of s is d d ξ s ( ξ ) = − ξ + 11 − ξ = − − ξξ (1 − ξ ) . Clearly, the function s ( ξ ) is strictly concave in theinterval ξ < / and strictly convex in the interval / < ξ . • The graph of s has a point-symmetry at (1 / , .Proof: The proof of this lemma is based on straightforwardcalculus and is therefore omitted. (cid:4)

Deﬁnition 19

Let S be the function S : Π [ n ] → R , ξ X ℓ s ( ξ ℓ ) = − X ℓ ξ ℓ log( ξ ℓ )+ X ℓ (1 − ξ ℓ ) log(1 − ξ ℓ ) . (cid:3) ξ s ( ξ ) ξ ξ Fig. 2. Left: plot of the function s , see Deﬁnition 17. Right: contour plotof the function ( ξ , ξ ) S ( ξ , ξ , − ξ − ξ ) , see Deﬁnition 19. Fig. 2 (right) shows the function S for n = 3 . Moreprecisely, that plot shows the contour plot of the function ( ξ , ξ ) S ( ξ , ξ , − ξ − ξ ) .Clearly, if the domain of the function S were the set [0 , n ,then S would not be concave everywhere because s is notconcave everywhere. Therefore, the observation that is made inthe following theorem, namely that S is concave, is non-trivial.(Note that because the function s is concave in [0 , / , thefunction S is concave in Π [ n ] ∩ [0 , / n . Therefore, as we willsee, most of the work in the proof of the following theoremwill be devoted to proving the concavity of the function S in Π [ n ] \ [0 , / n .) Theorem 20

The function S from Deﬁnition 19 is concaveand satisﬁes S ( ξ ) > for all ξ ∈ Π [ n ] . Moreover, • For n = 2 , it holds that S ( ξ ) = 0 for all ξ ∈ Π [ n ] . • For n > , the function S is at almost all points inits domain a strictly concave function. However thereare points in its domain and corresponding directionsin which the function S is linear.Proof: See Appendix A. (cid:4)

After the original submission of the present paper, analternative proof of the concavity of the function S has beengiven by Gurvits, see [15, Section 5.1].Interestingly, the functions s and S have recently appearedalso in another context [41]. (We refer to [41] for details.) Inparticular, that paper gives a direct proof of S ( ξ ) > for all ξ ∈ Π [ n ] ; this is in contrast to the proof of that statement inTheorem 20 which was mainly based on the concavity of S . Lemma 21

The Bethe entropy function can be expressed interms of the function S as follows H B : Γ n × n → R γ X i S ( γ i ) + 12 X j S ( γ j ) . Proof:

This result follows from H B ( γ ) (a) = − X i,j γ i,j log( γ i,j ) + X i,j (1 − γ i,j ) log(1 − γ i,j )= 12 X i  − X j γ i,j log( γ i,j ) + X j (1 − γ i,j ) log(1 − γ i,j )  +12 X j − X i γ i,j log( γ i,j ) + X i (1 − γ i,j ) log(1 − γ i,j ) ! (b) = 12 X i S ( γ i ) + 12 X j S ( γ j ) , where at step (a) we have used Corollary 15 and where atstep (b) we have used Deﬁnition 19. (cid:4) Theorem 22

The Bethe entropy function H B ( γ ) is a concavefunction of γ ∈ Γ n × n . Moreover, for all γ ∈ Γ n × n it holdsthat H B ( γ ) > .Proof: Lemma 21 showed that H B ( γ ) can be written as a sumof S -functions. The concavity of H B ( γ ) then follows fromTheorem 20 and the fact that the sum of concave functionsis a concave function. Similarly, the non-negativity of H B ( γ ) follows from Theorem 20 and the fact that the sum of non-negative functions is a non-negative function. (cid:4) Corollary 23

The Bethe free energy function F B ( γ ) is aconvex function of γ ∈ Γ n × n .Proof: This follows from F B ( γ ) = U B ( γ ) − H B ( γ ) (seeCorollary 15), from the fact that U B ( γ ) is a linear functionof γ (see Corollary 15), and from the fact that H B ( γ ) is aconcave function of γ (see Theorem 22). (cid:4) C. Behavior of the Bethe Entropy Function and the Bethe FreeEnergy Function at a Vertex of their Domain

In this section we study the Bethe entropy function andthe Bethe free energy function near a vertex of their domain.Because both functions can be expressed in terms of thefunction S , we ﬁrst study the behavior of S near a vertexof its domain. Lemma 24

Let ξ ( τ ) , ξ + τ · ˆ ξ , where the vector ξ ∈ Π [ n ] is a vertex of Π [ n ] and where ˆ ξ = is such that ξ ( τ ) ∈ Π [ n ] for small non-negative τ . This meansthat there is an ℓ ∗ ∈ [ n ] such that ξ satisﬁes ξ ℓ ∗ = 1 and ξ ℓ = 0 , ℓ = ℓ ∗ , and such that ˆ ξ satisﬁes ˆ ξ ℓ ∗ < , ˆ ξ ℓ > , ℓ = ℓ ∗ , and P ℓ ˆ ξ ℓ = 0 . Then, for < τ ≪ , we have S (cid:0) ξ ( τ ) (cid:1) = τ · | ˆ ξ ℓ ∗ | ·  − X ℓ = ℓ ′ | ˆ ξ ℓ || ˆ ξ ℓ ∗ | log | ˆ ξ ℓ || ˆ ξ ℓ ∗ | ! + O ( τ ) , (4) i.e. , the function S (cid:0) ξ ( τ ) (cid:1) can very well be approximated bya linear function for < τ ≪ . Note that the coefﬁcient of τ in (4) is non-negative.Proof: See Appendix B. (cid:4)

A word of caution: the behavior of the function S issomewhat special around a vertex ξ of Π [ n ] : namely, in generalthere is no gradient vector G such that S ( ξ + τ · ˆ ξ ) = S ( ξ ) + τ · P ℓ G ℓ ˆ ξ ℓ + O ( τ ) = τ · P ℓ G ℓ ˆ ξ ℓ + O ( τ ) for < τ ≪ and for all possible direction vectors ˆ ξ .Lemma 24 has the following consequences for the behaviorof the Bethe entropy function at a vertex of its domain. Lemma 25

Let γ ( τ ) , γ + τ · ˆ γ , where γ ∈ C is a vertex of Γ n × n and where ˆ γ = is suchthat γ ( τ ) ∈ Γ n × n for small non-negative τ . This means that γ corresponds to the permutation σ γ . (In the following statementwe will use the short-hands σ , σ γ and ¯ σ , σ − γ .) Then, for < τ ≪ , we have H B (cid:0) γ ( τ ) (cid:1) = τ X i | ˆ γ i,σ ( i ) | ·  − X j = σ ( i ) | ˆ γ i,j || ˆ γ i,σ ( i ) | log (cid:18) | ˆ γ i,j || ˆ γ i,σ ( i ) | (cid:19) + O ( τ )= τ X j | ˆ γ ¯ σ ( j ) ,j | ·  − X i =¯ σ ( j ) | ˆ γ i,j || ˆ γ ¯ σ ( j ) ,j | log (cid:18) | ˆ γ i,j || ˆ γ ¯ σ ( j ) ,j | (cid:19) + O ( τ ) , i.e. , the function H B (cid:0) γ ( τ ) (cid:1) can very well be approximated bya linear function for < τ ≪ . Note that the coefﬁcient of τ is non-negative.Proof: See Appendix C. (cid:4)

Assume that ˆ γ in Lemma 25 is chosen such that P i | ˆ γ i,σ ( i ) | = 1 . (If this is not the case, then ˆ γ can berescaled by a positive real number such that this conditionis satisﬁed.) The coefﬁcient of τ in the ﬁrst display equationof Lemma 25 can be given the following meaning. It is theentropy rate of the time-invariant Markov chain correspondingto the (backtrackless) random walk on the NFG N ( θ ) (seeFig. 1) with the following properties: • The probability of being at vertex i ∈ I is | ˆ γ i,σ ( i ) | . • The probability of going to vertex j ∈ J \ { σ ( i ) } ,conditioned on being at vertex i ∈ I , is | ˆ γ i,j | / | ˆ γ i,σ ( i ) | .The probability of going to vertex σ ( i ) ∈ J , conditionedon being at vertex i ∈ I , is . • The probability of being at vertex j ∈ J is | ˆ γ ¯ σ ( j ) ,j | . • The probability of going to vertex ¯ σ ( j ) ∈ I , conditionedon being at vertex j ∈ J , is .The probability of going to vertex i ′ ∈ I \ { ¯ σ ( j ) } ,conditioned on being at vertex j ∈ J , is . For a discussion of the entropy rate of a time-invariant Markov chain, see, e.g. , [42, Section 4.2]. The above two half-steps of the random walk can be combinedinto one step: • The probability of being at vertex i ∈ I is | ˆ γ i,σ ( i ) | . • For i, i ′ ∈ I with i = i ′ , the probability of going tovertex σ ( i ′ ) and then to vertex i ′ , conditioned on beingat vertex i , is | ˆ γ i,σ ( i ′ ) | / | ˆ γ i,σ ( i ) | .An analogous interpretation can be given to the coefﬁcientof τ in the second display equation of Lemma 25. Observe thatthe condition P i | ˆ γ i,σ ( i ) | = 1 is equivalent to the condition P j | ˆ γ ¯ σ ( j ) ,j | = 1 .Note that similar random walks appeared in the analysis ofthe Bethe entropy function for so-called cycle codes ( cf. [43])and in the analysis of linear programming decoding of low-density parity-check codes ( cf. [44], which gives a ran-dom walk interpretation of a result by Arora, Daskalakis,Steurer [45] and its extensions by Halabi and Even [46]).Actually, given the fact that the symmetric difference of twoperfect matchings corresponds to a union of cycles in N ( θ ) ,the similarity of the random walks here and of the randomwalks in the above-mentioned context of cycle codes is nottotally surprising.We come now to the main result of this subsection. Al-though this result is interesting in its own right, it will beespecially important for the convergence analysis of the SPAin Section V. Theorem 26

Let γ ( τ ) , γ + τ · ˆ γ , where γ ∈ C is a vertex of Γ n × n and where ˆ γ = is suchthat γ ( τ ) ∈ Γ n × n for small non-negative τ . This means that γ corresponds to the permutation σ γ . (In the following statementwe will use the short-hands σ , σ γ and ¯ σ , σ − γ .) We alsoassume that ˆ γ is normalized as follows X i | ˆ γ i,σ ( i ) | = X j | ˆ γ ¯ σ ( j ) ,j | = 1 . (5) Then, for < τ ≪ , we have F B (cid:0) γ ( τ ) (cid:1) > − X i log( θ i,σ ( i ) ) − τ · log( ρ ) + O ( τ ) , (6) where ρ is the maximal (real) eigenvalue of the n × n matrix A with entries A i,i ′ , ( θ i,σ ( i ′ ) θ i,σ ( i ) (if i = i ′ ) (otherwise) . Note that equality holds in (6) for the matrix ˆ γ with entries ˆ γ i,σ ( i ′ ) , ( + κ · u L i · A i,i ′ · u R i ′ ρ (if i = i ′ ) − κ · u L i · u R i (otherwise) , where u L and u R are, respectively, the left and right eigen-vectors of A with eigenvalue ρ , and where κ is a suitablenormalization constant such that (5) is satisﬁed.Proof: See Appendix D. (cid:4)

Corollary 27

Consider a vertex γ of Γ n × n and deﬁne ρ for γ as in Theorem 26. • If ρ < then F B has its unique minimum at γ . • If ρ > then F B is not minimal at γ .Proof: Consider the setup of Theorem 26. From that theoremwe know that F B (cid:0) γ ( τ ) (cid:1) > − X i log( θ i,σ ( i ) ) − τ · log( ρ ) + O ( τ ) , with equality for the direction matrix ˆ γ that was speciﬁedthere. Moreover, from Corollary 23 we know that F B is convexover Γ n × n . Therefore, if log( ρ ) < ( i.e. , ρ < ) then F B hasa unique minimum at γ . On the other hand, if log( ρ ) > ( i.e. , ρ > ) then F B cannot be minimal at γ .Note that for log( ρ ) = 0 ( i.e. , ρ = 1 ), the minimality /non-minimality of F B at γ is determined by the O ( τ ) term. (cid:4) Typically, the Bethe entropy function and the Bethe freeenergy function have a positive or negative inﬁnite directionalderivative away from a vertex of their domain because of theappearance of terms like c · τ · log( τ ) . However, because for thefunction S all these c · τ · log( τ ) terms cancel in the vicinity of avertex of its domain (see the proof of Theorem 20, in particularEq. (19) in Appendix A-B), the directional derivatives of theBethe entropy function and the Bethe free energy function areﬁnite away from a vertex of their domain.Let us conclude this section by pointing out that the obser-vations that were made in this subsection give an alternativeviewpoint of some of the results that were presented in [19,Section 3].V. S UM -P RODUCT -A LGORITHM -B ASED S EARCH OF THE M INIMUM OF THE B ETHE F REE E NERGY F UNCTION

Assumption 28

In this section we make the following twoassumptions, both with the goal of simplifying the wording ofmost results without hurting their generality too much. • We assume that n > and that θ is a positive matrix ofsize n × n . • We assume that the minimum of the Bethe free energyfunction F B is either in the interior of Γ n × n or at avertex of Γ n × n , but not at a non-vertex boundary pointof Γ n × n . A possibility to guarantee this with probability is to apply tiny random perturbations to the entries of θ . (cid:3) In Deﬁnition 12 we have deﬁned the Bethe permanent ofa square matrix θ via the minimum of the Bethe free energy The purpose of these assumptions is, in particular, to avoid dealing withmatrices θ which have the following property. Namely, consider the subgraphinduced by the edge subset (cid:8) ( i, j ) ∈ E (cid:12)(cid:12) θ i,j > } . Assume that one ofthe connected components of this subgraph is a cycle (necessarily of evenlength), and consider the partition of the edge set of this cycle into two sets E ′ and E ′′ such that the edges of this cycle are alternatingly placed into E ′ and E ′′ , respectively. If Q ( i,j ) ∈E ′ θ i,j = Q ( i,j ) ∈E ′′ θ i,j holds, then theSPA exhibits a periodic behavior unless the initial messages correspond toSPA ﬁxed point messages. A matrix having this property is, e.g. , the matrix θ = (cid:0) (cid:1) . Here, the relevant cycle (1 , − (1 , − (2 , − (2 , − (1 , has length four and one veriﬁes that θ , · θ , = θ , · θ , . function of the NFG N ( θ ) . In Corollary 23 we have seen thatthe Bethe free energy function is a convex function, i.e. , itbehaves very favorably. This means that we could use anygeneric optimization algorithm (see, e.g. , [36], [47]) to ﬁndthe minimum of the Bethe free energy function, and with thatthe Bethe permanent of θ . However, given the special structureof the optimization problem, there is the hope that there aremore efﬁcient approaches.A natural candidate for searching this minimum is theSPA [38]–[40]. The reason for this is that a theorem byYedidia, Freeman, and Weiss [13] says that ﬁxed points ofthe SPA correspond to stationary points of the Bethe freeenergy function. Given the convexity of the Bethe freeenergy function, the following two questions must thereforebe answered: • If the minimum of F B is in the interior of Γ n × n , doesthe SPA always converge to a ﬁxed point? • If the minimum of F B is at a vertex of Γ n × n , does theSPA ﬁnd that vertex?In this section we answer both questions afﬁrmatively, inde-pendently of the matrix θ , and (nearly) independently of thechosen initial messages.The rest of this section is structured as follows. Firstwe discuss the details of the SPA message update rules inSection V-A. Afterwards, we state the SPA convergence resultin Section V-B. A. Sum-Product Algorithm Message Update Rules

In this subsection we derive the SPA message update rulesfor the NFG N ( θ ) in Fig. 1. Here we only give the technicaldetails; for a general discussion w.r.t. the motivations behindthe SPA we refer to [38]–[40]. Note that analogous SPAmessage update rules were already stated in [12], [20]. (Incontrast to [12], we use an undampened version of the SPA.)On a high level, the SPA works as follows. With everyedge in Fig. 1 we associate a right-going message and a left-going message. Every iteration of the SPA consists then oftwo half-iterations, in the ﬁrst half-iteration the right-goingmessages are updated based on the left-going messages andin the second half-iteration the left-going messages are updatedbased on the right-going messages. Finally, once some suitableconvergence criterion is met or a ﬁxed number of iterationshas been reached, the pseudo-marginal vector (belief vector)is computed based on the messages at the last iteration.Mathematically, we deﬁne for every t > and every edge ( i, j ) ∈ I × J a left-going message ←− µ ( t ) i,j : A i,j → R , andfor every t > and every edge ( i, j ) ∈ I × J a right-goingmessage −→ µ ( t ) i,j : A i,j → R .For every left-going and for every right-going message it Strictly speaking, for NFGs with hard constraints, i.e. , NFGs that containlocal functions that can assume the value zero for certain points in theirdomain (which is the case for N ( θ ) ), this statement has only been provenfor interior stationary points of the Bethe free energy function (see [13,Theorem 2]). For SPA ﬁxed points with some beliefs equal to zero it is onlyconjectured that they correspond to edge-stationary points of the Bethe freeenergy function ( cf. discussion in [13, Section VI.D]). turns out to be sufﬁcient to keep track of the likelihood ratios −→ Λ ( t ) i,j , −→ µ ( t ) i,j (0) −→ µ ( t ) i,j (1) , ←− Λ ( t ) i,j , ←− µ ( t ) i,j (0) ←− µ ( t ) i,j (1) , respectively. Actually, for the NFG under consideration it ismore convenient to deal with the inverses of these quantities,and so we deﬁne the inverse likelihood ratios as follows −→ V ( t ) i,j , (cid:16) −→ Λ ( t ) i,j (cid:17) − , ←− V ( t ) i,j , (cid:16) ←− Λ ( t ) i,j (cid:17) − . Lemma 29

Consider the NFG N ( θ ) . The inverse likelihoodratio update rules for the left-hand side and right-hand sidefunction nodes of N ( θ ) are given by, respectively, −→ V ( t ) i,j = p θ i,j P j ′ = j p θ i,j ′ · ←− V ( t − i,j ′ , t > , ( i, j ) ∈ I × J , ←− V ( t ) i,j = p θ i,j P i ′ = i p θ i ′ ,j · −→ V ( t ) i ′ ,j , t > , ( i, j ) ∈ I × J . The beliefs at the left-hand side and right-hand side functionnodes of N ( θ ) are given by, respectively, β ( t ) i, a i (cid:12)(cid:12)(cid:12) a i = u j ∝ p θ i,j · ←− V ( t ) i,j , t > , ( i, j ) ∈ I × J ,β ( t ) j, a j (cid:12)(cid:12)(cid:12) a j = u i ∝ p θ i,j · −→ V ( t ) i,j , t > , ( i, j ) ∈ I × J . Here the proportionality constants are deﬁned such that forevery function node the beliefs sum to . At a ﬁxed point ofthe SPA, the beliefs satisfy the edge consistency constraints, i.e. , for every e = ( i, j ) ∈ E and every a e ∈ A e , it holds that P a ′ i ∈A i : a ′ i,j = a e β ( t ) i, a ′ i = P a ′ j ∈A j : a ′ i,j = a e β ( t ) j, a ′ j .Proof: See Appendix E. (cid:4)

Let us remark on the side that the above update equationscan be reformulated such that we only multiply by factors like θ i,j instead of by factors like p θ i,j . We leave the details tothe reader. Remark 30

The SPA messages for the NFG N ( θ ) exhibit thefollowing property, a property that we will henceforth call“message gauge invariance.” Namely, consider the messages n ←− V ( t ) i,j o i,j,t and n −→ V ( t ) i,j o i,j,t that are connected by the update equations in Lemma 29. Itis then easy to show that for any C ∈ R > the messages n C · ←− V ( t ) i,j o i,j,t and (cid:26) C · −→ V ( t ) i,j (cid:27) i,j,t also satisfy the update equations in Lemma 29. Moreover, thebeliefs (cid:8) β ( t ) i, a i (cid:9) i, a i ,t and (cid:8) β ( t ) j ( a j ) (cid:9) j, a j ,t are left unchangedby this rescaling of the inverse likelihood ratios. This isbecause the normalization that appears in the deﬁnition of (cid:8) β ( t ) i, a i (cid:9) i, a i ,t and (cid:8) β ( t ) j ( a j ) (cid:9) j, a j ,t removes the inﬂuence of thismessage rescaling. (cid:3) Strictly speaking, the Bethe free energy function can onlybe evaluated at ﬁxed points of the SPA. However, very often it is desirable to track the progress towards the minimum Bethefree energy function value. This can be done via the so-calledpseudo-dual function of the Bethe free energy function [48],[49]. This function has the following two properties: it canbe evaluated at any point during the SPA computations, andat a ﬁxed point of the SPA its value equals the value of theBethe free energy function. However, in general it is not anon-increasing or a non-decreasing function of the iterationnumber. Lemma 31

Consider the NFG N ( θ ) . For any set of left-going messages (cid:8) ←− V i,j (cid:9) i,j and any set of right-going messages (cid:8) −→ V i,j (cid:9) i,j , the pseudo-dual function of the Bethe free energyfunction is F (cid:16)(cid:8) ←− V i,j (cid:9) , (cid:8) −→ V i,j (cid:9)(cid:17) = − X i log X j p θ i,j · ←− V i,j  − X j log X i p θ i,j · −→ V i,j ! + X i,j log (cid:16) ←− V i,j · −→ V i,j (cid:17) Proof:

See Appendix F. (cid:4)

In particular, if desired, we can evaluate F af-ter every half-iteration of the SPA, i.e. , we can compute F (cid:0)(cid:8) ←− V ( t − i,j (cid:9) , (cid:8) −→ V ( t ) i,j (cid:9)(cid:1) and F (cid:0)(cid:8) ←− V ( t ) i,j (cid:9) , (cid:8) −→ V ( t ) i,j (cid:9)(cid:1) forevery t > . B. Convergence of the Sum-Product Algorithm

Note that there are rather few general results concerning thebehavior of message-passing type algorithms for NFGs withcycles. For certain classes of graphical models and message-passing type algorithms, early results showed that under theassumption that the algorithm converges then the obtainedestimates are correct (see, e.g. , the results in [50], [51]). Later,conditions for convergence were established for a variety ofgraphical models and message-passing type algorithms (see, e.g. , [52]–[55] and references therein). However, these resultsdo not seem to be applicable to the NFG under considerationin this paper.The SPA convergence proof that is the most relevant for thepresent paper is the one in the paper by Bayati and Nair [23](see also the comments that we made about this paper inSection I-C). However, the fact that the graphical model in [23]counts matchings (and not only perfect matchings like here),implies a different behavior of the Bethe free energy functionnear the boundary of its domain, and so no separate analysisof interior and boundary minima of the Bethe free energy isrequired in the convergence proof in [23]. The SPA conver-gence analysis for a slightly generalized weighted matchingsetup was recently presented by Williams and Lau [26].Note that, interestingly enough, establishing convergence forthe SPA on N ( θ ) is independent of the choice of θ , whichis in contrast to, say, Gaussian graphical models where theconvergence behavior not only depends on the connectivity of the underlying graph but also on the values of the non-zero entries of the information matrix describing the Gaussiangraphical model. (Of course, the convergence speed of the SPAon N ( θ ) does depend on the choice of θ .) Theorem 32

Consider the SPA for NFG N ( θ ) , for which themessage update rules were established in Lemma 29. Forany initial set of inverse likelihood ratios (cid:8) ←− V (0) i,j (cid:9) i,j thatsatisﬁes < ←− V (0) i,j < ∞ , ( i, j ) ∈ I × J , the pseudo-marginals computed by the SPA converge to the pseudo-marginals that minimize the Bethe free energy function of N ( θ ) . More precisely, we can make the following statements.(We remind the reader of the assumptions that were made inAssumption 28.) • If the minimum of F B is in the interior of Γ n × n , then theinverse likelihood ratios (cid:8) ←− V ( t ) i,j (cid:9) i,j,t (cid:12)(cid:12)(cid:12) t →∞ and (cid:8) −→ V ( t ) i,j (cid:9) i,j,t (cid:12)(cid:12)(cid:12) t →∞ stay bounded and converge (modulo the message gaugeinvariance mentioned in Remark 30) to the ﬁxed pointinverse likelihood ratios corresponding to the minimumof F B . • If the minimum of F B is at the vertex γ of Γ n × n , thenthe inverse likelihood ratios satisfy ←− V ( t ) i,j (cid:12)(cid:12)(cid:12) j = σ γ ( i ) t →∞ −−−→ ∞ , −→ V ( t ) i,j (cid:12)(cid:12)(cid:12) j = σ γ ( i ) t →∞ −−−→ ∞ , ←− V ( t ) i,j (cid:12)(cid:12)(cid:12) j = σ γ ( i ) t →∞ −−−→ , −→ V ( t ) i,j (cid:12)(cid:12)(cid:12) j = σ γ ( i ) t →∞ −−−→ . Finally, (cid:12)(cid:12)(cid:12)(cid:12) exp (cid:18) − F (cid:16)(cid:8) ←− V ( t ) i,j (cid:9) , (cid:8) −→ V ( t ) i,j (cid:9)(cid:17)(cid:19) − perm B ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) C · e − ν · t for some constants C, ν ∈ R > that depend on the matrix θ and the initial messages.Proof: See Appendix G. (cid:4)

Explicit convergence speed estimates (in particular, valuesfor C and ν ) can be extracted from the proof of Theorem 32.However, we think that a more sophisticated analysis mightyield tighter convergence speed estimates; we leave this as anopen problem for future research.VI. F INITE -G RAPH -C OVER I NTERPRETATIONOF THE B ETHE P ERMANENT

Note that the deﬁnition of the permanent of θ in Deﬁnition 1has a “combinatorial ﬂavor.” In particular, it can be seen asa sum over all weighted perfect matchings of a completebipartite graph. This is in contrast to the deﬁnition of theBethe permanent of θ (see Deﬁnitions 11 and 12) that has an“analytical ﬂavor.” In this section we show that it is possibleto represent the Bethe permanent by an expression that hasa “combinatorial ﬂavor.” We do this by applying the resultsfrom [16], that hold for general NFGs, to the NFG N ( θ ) . Thekey concept in that respect are so-called ﬁnite graph covers.(We keep the discussion here somewhat brief and we referto [16] for all the details. See also [56].) Besides being of interest in its own right, we think that thecombinatorial interpretation of the Bethe permanent discussedin this section can lead to alternative proofs of known resultsor to proofs of new results for the Bethe permanent. See, e.g. ,Appendix I that gives an alternative proof of a special case ofTheorem 49 in the next section.This section is structured as follows. In Section VI-A wedeﬁne the degree- M Bethe permanent of a non-negative squarematrix with the help of ﬁnite graph covers and show that in thelimit M → ∞ the degree- M Bethe permanent converges tothe Bethe permanent. Towards obtaining a better understandingof the degree- M Bethe permanent, we then study variousexamples of × matrices in Sections VI-B–VI-E. Becausethe Bethe permanent can be computed with the help of theSPA, and because the SPA is a locally operating algorithm onthe relevant NFG, it is not surprising that ﬁnite graph coversplay a central role in the above-mentioned combinatorialinterpretation of the Bethe permanent; this aspect will bediscussed in Section VI-F. A. The Degree- M Bethe Permanent of a Non-Negative Matrix

Deﬁnition 33 (see, e.g. , [57], [58]) A cover of a graph G with vertex set V and edge set E is a graph G with vertex set ˜ V and edge set ˜ E , along with a surjection π : ˜ V → V whichis a graph homomorphism ( i.e. , π takes adjacent vertices of G to adjacent vertices of G ) such that for each vertex v ∈ V and each ˜ v ∈ π − ( v ) , the neighborhood ∂ (˜ v ) of ˜ v is mappedbijectively to ∂ ( v ) . A cover is called an M -cover , where M ∈ Z > , if (cid:12)(cid:12) π − ( v ) (cid:12)(cid:12) = M for every vertex v in V . (cid:3) Because NFGs are graphs, it is straightforward to extendthis deﬁnition to NFGs. (Of course, the variables that areassociated with the M copies of an edge are allowed to take ondifferent values.) For an M -cover, the left-hand side functionnodes will be labeled by elements of I × [ M ] , the right-handside function nodes will be labeled by elements of J × [ M ] ,and the edges will be labeled by elements of a cover-dependentsubset of I × [ M ] × J × [ M ] . We will denote the set of all M -covers ˜ N of N ( θ ) by ˜ N M ( θ ) . (Note that we distinguishtwo M -covers with different function node labels, even if theunderlying graphs are isomorphic; see also the comments onlabeled graph covers after [16, Deﬁnition 19].) Example 34

Let n = 3 . The NFG N ( θ ) is shown in Fig. 3(a).There is only one -cover of N ( θ ) , namely N ( θ ) itself. Twopossible -covers of N ( θ ) are shown in Figs. 3(b)–(c). The -cover in Fig. 3(b) is “trivial” in the sense that it consists of disconnected copies of N ( θ ) . On the other hand, the -coverin Fig. 3(c) is “nontrivial” in the sense that it consists of copies of N ( θ ) that are intertwined. (cid:3) Lemma 35

It holds that (cid:12)(cid:12) ˜ N M ( θ ) (cid:12)(cid:12) = ( M !) ( n ) . (7) The number M is also known as the degree of the cover. (Not to beconfused with the degree of a vertex.)

12 233 1 (a) (1 , , , ,

1) (1 , , , , , , , , , , , ,

4) (3 , , , , , , , , (b) (1 , , , ,

1) (1 , , , , , , , , , , , ,

4) (3 , , , , , , , , (c)Fig. 3. (a) NFG N ( θ ) for n = 3 . (b) “Trivial” -cover of N ( θ ) (c) Apossible -cover of N ( θ ) . Proof:

This follows from [16, Lemma 20] and the fact thatthe NFG N ( θ ) has n full-edges. (cid:4) The following deﬁnition is the main deﬁnition of thissection.

Deﬁnition 36

For any M ∈ Z > we deﬁne the degree- M Bethe permanent of θ to be perm B ,M ( θ ) , M rD Z G ( ˜ N ) E ˜ N ∈ ˜ N M , where the angular brackets represent the arithmetic averageof Z G ( ˜ N ) over all ˜ N ∈ ˜ N M . (Note that the right-hand side isbased on the Gibbs partition function, not the Bethe partitionfunction.) (cid:3) As we will now show, one can express Z G ( ˜ N ) for any M -cover ˜ N of N ( θ ) as the permanent of some matrix that isderived from θ . Deﬁnition 37

For any M ∈ Z > we deﬁne ˜Ψ M to be the set ˜Ψ M , n ˜ P = (cid:8) ˜ P ( i,j ) (cid:9) i ∈I ,j ∈J (cid:12)(cid:12)(cid:12) ˜ P ( i,j ) ∈ P M × M o . Moreover, for ˜ P ∈ ˜Ψ M we deﬁne the ˜ P -lifting of θ to be thefollowing ( nM ) × ( nM ) matrix θ ↑ ˜ P ,  θ , ˜ P (1 , · · · θ ,n ˜ P (1 ,n ) ... ... θ n, ˜ P ( n, · · · θ n,n ˜ P ( n,n )  . (cid:3) For any positive integer M it is straightforward to see thatthere is a bijection between the set ˜ N M ( θ ) of all M -coversof N ( θ ) and the set { θ ↑ ˜ P } ˜ P ∈ ˜Ψ M . In particular, because ofLemma 8, for an M -cover ˜ N and its corresponding matrix θ ↑ ˜ P it holds that Z G ( ˜ N ) = perm( θ ↑ ˜ P ) . Therefore, we havethe following reformulation of Deﬁnition 36. Deﬁnition 38 (Reformulation of Deﬁnition 36)

For any M ∈ Z > we deﬁne the degree- M Bethe permanent of θ tobe perm B ,M ( θ ) , M rD perm (cid:16) θ ↑ ˜ P (cid:17)E ˜ P ∈ ˜Ψ M , (8) where the angular brackets represent the arithmetic averageof perm (cid:0) θ ↑ ˜ P (cid:1) over all ˜ P ∈ ˜Ψ M . (Note that the permanent,not the Bethe permanent, appears on the right-hand side ofthe above expression.) (cid:3) In order to better appreciate the right-hand side of theabove expression, it is worthwhile to make the following twoobservations. • For M = 1 , the averaging is trivial because ˜Ψ M containsonly one element. Moreover, letting ˜ P be this singleelement, it holds that θ ↑ ˜ P = θ . Therefore perm B , ( θ ) = perm( θ ) . • For any M ∈ Z > , the “trivial” M -cover of N ( θ ) is givenby the choice ˜ P = (cid:8) ˜ P ( i,j ) (cid:9) i ∈I ,j ∈J with ˜ P ( i,j ) = ˜ I , ( i, j ) ∈ I × J , where ˜ I is the identity matrix of size M × M . For this M -cover we obtain perm( θ ↑ ˜ P ) = perm( θ ) M , i.e. M q perm( θ ↑ ˜ P ) = perm( θ ) . With this, we are ready for the main result of this section.

Theorem 39

It holds that lim sup M →∞ perm B ,M ( θ ) = perm B ( θ ) . Proof:

This follows from Deﬁnitions 12 and 38, along withthe application of [16, Theorem 33] to N = N ( θ ) . (cid:4) Theorem 39, together with the relation perm B , ( θ ) =perm( θ ) , are visualized in Fig. 4. Because the permanentsthat appear on the right-hand side of (8) are combinatorialobjects, Deﬁnition 38 and Theorem 39 give the promised“combinatorial characterization” of the Bethe permanent. B. The Bethe Permanent for Matrices of Size × In this and the following subsections we illustrate theconcepts and results that have been presented so far in thissection by having a detailed look at the case n = 2 , i.e. , westudy the permanent, the Bethe permanent, and the degree- M Bethe permanent for the matrix θ = (cid:18) θ , θ , θ , θ , (cid:19) . The corresponding NFG N ( θ ) is shown in Fig. 5(a). Of course,nobody would use the Bethe permanent to approximate thepermanent of a × matrix, however, it gives some goodinsights into the strengths and the weaknesses of the Betheapproximation to the permanent. perm B ,M ( θ ) (cid:12)(cid:12)(cid:12) M →∞ = perm B ( θ ) (cid:12)(cid:12)(cid:12) perm B ,M ( θ ) (cid:12)(cid:12)(cid:12) perm B ,M ( θ ) (cid:12)(cid:12)(cid:12) M =1 = perm( θ ) Fig. 4. The degree- M Bethe permanent of the non-negative matrix θ fordifferent values of M . Lemma 40

For n = 2 it holds that perm( θ ) = θ , θ , + θ , θ , , perm B ( θ ) = max( θ , θ , , θ , θ , ) . Proof:

The result for perm( θ ) follows from Deﬁnition 1.On the other hand, in order to obtain perm B ( θ ) , we applyCorollary 15. The crucial step in Corollary 15 is to minimize F B ( γ ) over γ ∈ Γ × . Because H B ( γ ) = 0 , γ ∈ Γ × ,minimizing F B ( γ ) is equivalent to minimizing U B ( γ ) = − P i,j γ i,j log( θ i,j ) . • For θ , θ , = θ , θ , the minimum is achieved at every γ ∈ Γ × . • For θ , θ , > θ , θ , the minimum is achieved at γ = (cid:0) (cid:1) . • For θ , θ , < θ , θ , the minimum is achieved at γ = (cid:0) (cid:1) . (cid:4) Example 41

For n = 2 and θ i,j = 1 , ( i, j ) ∈ I × J , we have perm( θ ) = 2 , perm B ( θ ) = 1 . Recall that perm( θ ) represents the sum of all the weightedperfect matchings of the complete bipartite graph N ( θ ) , andso, for the special choice θ i,j = 1 , ( i, j ) ∈ I × J , thequantity perm( θ ) represents the number of perfect matchingsof N ( θ ) . As is illustrated in Figs. 5(b)–(c), the graph N ( θ ) has two perfect matchings, thereby combinatorially verifying perm( θ ) = 2 . (cid:3) C. The Degree- M Bethe Permanent for Matrices of Size × — Initial Considerations One of the goals of this and the next subsections is to obtaina better combinatorial understanding of the result perm B ( θ ) =1 for n = 2 , in particular, why it is different from perm( θ ) ,yet not too different.Towards this goal, let us study the degree- M Bethe perma-nent of θ as speciﬁed in Deﬁnition 38. Therein, the averageis taken over (cid:12)(cid:12)(cid:12) ˜Ψ M (cid:12)(cid:12)(cid:12) = ( M !) matrices θ ↑ ˜ P = (cid:18) θ , ˜ P , θ , ˜ P , θ , ˜ P , θ , ˜ P , (cid:19) , θ ↑ ˜ P ∈ ˜Ψ M . (a) (b) (c) ′′ ′ ′′ ′′ ′ ′ ′ ′′ (d) ′′ ′ ′′ ′′ ′ ′ ′ ′′ (e) ′′ ′ ′′ ′′ ′ ′ ′ ′′ (f) ′′ ′ ′′ ′′ ′ ′ ′ ′′ (g) ′′ ′ ′′ ′′ ′ ′ ′ ′′ (h) ′′ ′ ′′ ′′ ′ ′ ′ ′′ (i) ′′ ′ ′′ ′′ ′ ′ ′ ′′ (j) ′′ ′ ′′ ′′ ′ ′ ′ ′′ (k)Fig. 5. Graphs (NFGs) that are discussed in Sections VI-C–VI-E. (a) Base graph. (b)–(c) Perfect matchings of the graph in (a). (d) A possible double coverof the graph in (a). (e)–(h) Perfect matchings of the graph in (d). (i) A possible double cover of the graph in (a). (j)–(k) Perfect matchings of the graph in (i). We can simplify the analysis by realizing that the permanentof θ ↑ ˜ P equals the permanent of a modiﬁed matrix θ ↑ ˜ P , wherethe ﬁrst block row is multiplied from the left by ˜ P − , , wherethe second block row is multiplied from the left by ˜ P − , , andwhere the second block column is multiplied from the rightby ˜ P − , · ˜ P , , i.e. , perm (cid:0) θ ↑ ˜ P (cid:1) = perm (cid:18) θ , ˜ I θ , ˜ I θ , ˜ I θ , ˜ P − , ˜ P , ˜ P − , ˜ P , (cid:19) , where ˜ I is the identity matrix of size M × M . Therefore, wecan rewrite perm B ,M ( θ ) as follows perm B ,M ( θ ) , M sD perm (cid:18) θ , ˜ I θ , ˜ I θ , ˜ I θ , ˜ P ′ , (cid:19)E ˜ P ′ , ∈P M × M , (9) i.e. , an average over the M ! permutation matrices of size M × M . D. The Degree- M Bethe Permanent for Matrices of Size × — All-One Matrix In this subsection we consider the cases M = 2 , M = 3 ,and general M for the special choice θ = (cid:18) (cid:19) . Example 42

Let n = 2 , M = 2 , and θ i,j = 1 , ( i, j ) ∈ I × J .We make the following observations. • The average in (9) is over

2! = 2 matrices, namely over θ ↑ (1) ,   , θ ↑ (2) ,   . • The matrix θ ↑ (1) corresponds to the double cover of N ( θ ) shown in Fig. 5(d). Because that graph has perfectmatchings, see Figs. 5(e)–(h), we have perm( θ ↑ (1) ) = 4 . • The matrix θ ↑ (2) corresponds to the double cover of N ( θ ) shown in Fig. 5(i). Because that graph has perfectmatchings, see Figs. 5(j)–(k), we have perm( θ ↑ (1) ) = 2 . Putting everything together, we obtain the degree- Bethepermanent of θ , i.e. , perm B , ( θ ) = r · (4 + 2) = r · √ ≈ . . We note that the graph in Fig. 5(d) consists of M independentcopies of the graph in Fig. 5(a), therefore it is not surprisingthat perm( θ ↑ (1) ) = perm( θ ) M = 2 = 4 . On the other hand,the graph in Fig. 5(i) consists of M coupled copies of thegraph in Fig. 5(a), which implies that we cannot choose theperfect matchings independently. Therefore, it is not surprisingthat we have perm( θ ↑ (2) ) = perm( θ ) M = 2 = 4 , whichﬁnally results in perm B , ( θ ) = perm( θ ) . Nevertheless, theseconsiderations also show why perm B , ( θ ) is not too differentfrom perm( θ ) . (cid:3) Example 43

Let n = 2 , M = 3 , and θ i,j = 1 , ( i, j ) ∈ I × J .The average in (9) is over

3! = 6 matrices. These matricescorrespond to the triple covers of N ( θ ) shown in Fig. 6(b)–(g). Computing the number of perfect matchings for each ofthese cases, we obtain perm B , ( θ ) = r · (8 + 4 + 4 + 4 + 2 + 2)= r ·

24 = √ ≈ . . In particular, for the triple cover in Fig. 6(c) we show its perfect matchings explicitly in Fig. 7.Overall, we can make similar observations as at the endof Example 42 concerning the coupling of the M copies of N ( θ ) that make up a degree- M cover and its inﬂuence on thenumber of perfect matchings. (cid:3) Example 44

Let n = 2 , M ∈ Z > , and θ i,j = 1 , ( i, j ) ∈I ×J . The average in (9) is over M ! matrices that correspondto the M -covers of N ( θ ) . For each of these matrices, theirpermanent equals the number of perfect matchings in the (a) ′′ ′′′ ′ ′′ ′′′ ′′ ′′′ ′ ′ ′ ′′ ′′′ (b) pms. ′′ ′′′ ′ ′′ ′′′ ′′ ′′′ ′ ′ ′ ′′ ′′′ (c) pms. ′′ ′′′ ′ ′′ ′′′ ′′ ′′′ ′ ′ ′ ′′ ′′′ (d) pms. ′′ ′′′ ′ ′′ ′′′ ′′ ′′′ ′ ′ ′ ′′ ′′′ (e) pms. ′′ ′′′ ′ ′′ ′′′ ′′ ′′′ ′ ′ ′ ′′ ′′′ (f) pms. ′′ ′′′ ′ ′′ ′′′ ′′ ′′′ ′ ′ ′ ′′ ′′′ (g) pms.Fig. 6. Graphs (NFGs) that are discussed in Sections VI-C–VI-E. (a) Base graph. (b)–(g) Possible triple covers of the graph in (a). (“pms.” stands for “perfectmatchings”.) ′′ ′′′ ′ ′′ ′′′ ′′ ′′′ ′ ′ ′ ′′ ′′′ (a) ′′ ′′′ ′ ′′ ′′′ ′′ ′′′ ′ ′ ′ ′′ ′′′ (b) ′′ ′′′ ′ ′′ ′′′ ′′ ′′′ ′ ′ ′ ′′ ′′′ (c) ′′ ′′′ ′ ′′ ′′′ ′′ ′′′ ′ ′ ′ ′′ ′′′ (d)Fig. 7. The four perfect matchings of the triple cover in Fig. 6(c). corresponding M -cover. We make the following observations(see Figs. 5–7 for illustrations for the cases M = 2 and M = 3 ). • Every M -cover consists of up to M cycles. • Every cycle supports two perfect matchings (indepen-dently of the cycle length and independently of the perfectmatchings chosen on the rest of the graph).Therefore, if an M -cover has c cycles then it has c perfectmatchings. The average in (9) can then be evaluated withsuitable combinatorial tools, for example by using the so-called cycle index of the symmetric group over M elements(see, e.g. , [59]), and we obtain perm B ,M ( θ ) = M √ M + 1 . Therefore, in the limit M → ∞ , we get perm B ( θ ) = lim sup M →∞ perm B ,M ( θ ) = 1 . This conﬁrms the result for perm B ( θ ) in Example 41, whichwas obtained by analytical means. (cid:3) E. The Degree- M Bethe Permanent for Matrices of Size × — General Non-Negative Matrix In this subsection we consider the cases M = 2 , M = 3 ,and general M for the general non-negative matrix θ = (cid:18) θ , θ , θ , θ , (cid:19) . A particular goal of this subsection is to compare the degree- M Bethe permanent of θ with the permanent of θ . In fact, aswe will see, for every considered case in this subsection wehave perm B ,M ( θ ) perm( θ ) . Example 45

Let n = 2 and M = 2 . We perform similarcomputations as in Example 42, but for a general non-negativematrix θ . Towards computing perm B , ( θ ) as given in (9) , wemake the following observations. • The average in (9) is over

2! = 2 matrices, namely over θ ↑ (1) ,  θ , θ , θ , θ , θ , θ , θ , θ ,  , θ ↑ (2) ,  θ , θ , θ , θ , θ , θ , θ , θ ,  . • We obtain perm (cid:0) θ ↑ (1) (cid:1) = ( θ , θ , + θ , θ , ) = θ , θ , + 2 θ , θ , θ , θ , + θ , θ , . Note that the coefﬁcients add up to because θ ↑ (1) cor-responds to the double cover of N ( θ ) shown in Fig. 5(d),which admits (weighted) perfect matchings. • We obtain perm (cid:0) θ ↑ (2) (cid:1) = θ , θ , + θ , θ , . Note that the coefﬁcients add up to because θ ↑ (2) cor-responds to the double cover of N ( θ ) shown in Fig. 5(i),which admits (weighted) perfect matchings.Putting everything together, we obtain for the square of thedegree- Bethe partition function of θ (cid:0) perm B , ( θ ) (cid:1) = 12 · (cid:0) perm( θ ↑ (1) ) + perm( θ ↑ (2) ) (cid:1) = θ , θ , + θ , θ , θ , θ , + θ , θ , . Given the observations that perm (cid:0) θ ↑ (1) (cid:1) (cid:0) perm( θ ) (cid:1) , perm (cid:0) θ ↑ (2) (cid:1) (cid:0) perm( θ ) (cid:1) , it is not surprising that we also have the inequality (cid:0) perm B , ( θ ) (cid:1) (cid:0) perm( θ ) (cid:1) , i.e. , perm B , ( θ ) perm( θ ) . (cid:3) Example 46

Let n = 2 and M = 3 . We perform similarcomputations as in Example 43, but for a general non-negativematrix θ . Towards computing perm B , ( θ ) as given in (9) , wemake the following observations. • The average in (9) is over

3! = 6 matrices. Thesematrices correspond to the triple covers of N ( θ ) shownin Fig. 6(b)–(g). • For example, for the matrix θ ↑ (2) corresponding to thetriple cover in Fig. 6(c), we obtain perm (cid:0) θ ↑ (2) (cid:1) = θ , θ , + θ , θ , θ , θ , + θ , θ , θ , θ , + θ , θ , , where each (weighted) perfect matching in Fig. 7 con-tributes one monomial to the above expression. One canverify that perm (cid:0) θ ↑ (2) (cid:1) = (cid:0) θ , θ , + θ , θ , (cid:1) · ( θ , θ , + θ , θ , ) (cid:0) θ , θ , + θ , θ , (cid:1) · ( θ , θ , + θ , θ , )= (cid:0) θ , θ , + θ , θ , (cid:1) = (cid:0) perm( θ ) (cid:1) . (The product expression in the ﬁrst line is not surprisinggiven the fact that graph in Fig. 6(c) contains twoindependent components, each contributing one factor tothe above product.)Similar observations can be made for the other ﬁve triplecovers in Fig. 6(b)–(g), and so we obtain (cid:0) perm B , ( θ ) (cid:1) (cid:0) perm( θ ) (cid:1) , i.e. , perm B , ( θ ) perm( θ ) . (cid:3) Example 47

Let n = 2 and M ∈ Z > . We perform similarcomputations as in Example 44, but for a general non-negative matrix θ . The observations that we made there canbe generalized (beyond the all-one matrix), and we obtain (cid:0) perm B ,M ( θ ) (cid:1) M = M X ℓ =0 ( θ , θ , ) M − ℓ ( θ , θ , ) ℓ . Because (cid:0) perm( θ ) (cid:1) M = M X ℓ =0 (cid:18) Mℓ (cid:19) ( θ , θ , ) M − ℓ ( θ , θ , ) ℓ , we see that (cid:0) perm B ,M ( θ ) (cid:1) M (cid:0) perm( θ ) (cid:1) M , i.e. , perm B ,M ( θ ) perm( θ ) . Moreover, in the limit M → ∞ , we have perm B ( θ ) = lim sup M →∞ perm B ,M ( θ )= max( θ , θ , , θ , θ , ) . This conﬁrms the result for perm B ( θ ) in Lemma 40, whichwas obtained by analytical means. (cid:3) For n > , we leave it as an open problem to obtain an“explicit expression” for perm B ,M ( θ ) , M ∈ Z > , either forthe all-one matrix case, or for the general non-negative matrixcase.In conclusion, the above examples shows that in general perm B ( θ ) = perm( θ ) , however, they also show that theBethe permanent has the potential to give reasonably goodestimates, in particular in the cases where the “coupling effect”in the average graph cover is not too strong. Heuristically, this“coupling effect” seems actually to be the worst for n = 2 andto become weaker the larger n is. F. Relevance of Finite Graph Covers

If the NFG N ( θ ) had no cycles then the SPA couldbe used to exactly compute the partition function. Namely,after a ﬁnite number of iterations, the SPA would reach aﬁxed point and the partition function Z G (cid:0) N ( θ ) (cid:1) = perm( θ ) could be computed with the help of an expression like exp (cid:0) − F (cid:0) {←− V ( t ) i,j } , {−→ V ( t ) i,j } (cid:1)(cid:1) , where F is deﬁnedin Lemma 31. However, N ( θ ) has cycles: the use of thisexpression at a ﬁxed point of the SPA is still possible butusually it does not yield the correct partition function. In thissubsection, we would like to better understand the source ofthis suboptimality.To that end, observe that the SPA is an algorithm thatprocesses information locally on N ( θ ) , i.e. , messages are sentalong edges, function nodes take incoming messages fromincident edges, do some computations, and send out newmessages along the incident edges. On the one hand, thislocality explains the main strengths of the SPA, namely its lowcomplexity and its parallelizability, two key factors for makingthe SPA a popular algorithm. On the other hand, this localityexplains also the main weakness of the SPA. Namely, a locallyoperating like SPA “cannot distinguish” if it is operating on N ( θ ) or any of its covers [16], [60], [61].More precisely, let ˜ N be an M -cover ˜ N of N ( θ ) . Such an M -cover “looks locally the same” as N ( θ ) in the sense thatthe local structure of ˜ N is exactly the same as the one of N ( θ ) . (Of course, globally ˜ N and N ( θ ) are different becausethe former NFG contains M times as many function nodesand M times as many edges.) Consequently, if the SPA isrun on ˜ N with the same initialization as the SPA on N ( θ ) (every initial message is replicated M times), we observe that,because both graphs look locally the same and because theSPA is a locally operating algorithm, after every iteration themessages on ˜ N are exactly the same as the messages on N ( θ ) ,simply replicated M times. In that sense, the SPA “cannotdistinguish” if it is operating on N ( θ ) , or, implicitly, on ˜ N ,or any other M -cover of N ( θ ) . This observation allows us to give the following interpretation of (8) (which is reproducedhere for the ease of reference) perm B ,M ( θ ) , M rD perm (cid:16) θ ↑ ˜ P (cid:17)E ˜ P ∈ ˜Ψ M . (10)Namely, because the SPA implicitly tries to compute in parallelthe partition function Z G (cid:0) N ( θ ↑ ˜ P ) (cid:1) = perm( θ ↑ ˜ P ) for all M -covers of N ( θ ) , yet it has to give back one real number only,the “best it can do” is to give back the average of thesepartition functions, i.e. , (cid:10) perm (cid:16) θ ↑ ˜ P (cid:17) (cid:11) ˜ P ∈ ˜Ψ M . (The M th rootthat appears in (10) is included so that the result is properlynormalized w.r.t. Z G (cid:0) N ( θ ) (cid:1) = perm( θ ) .)Let us conclude this subsection by commenting on tworecent papers. • Translating the results of a paper by Greenhill, Janson,and Ruci´nski [35] to graphical models, it turns out thatthe authors compute a high-order approximation to thequantity (cid:10) Z G ( ˜ N ′ ) (cid:11) ˜ N ′ ∈ ˜ N ′ M for some NFG N ′ ( θ ) with Z G ( N ′ ( θ )) = perm( θ ) . The NFG N ′ ( θ ) is in generaldifferent from N ( θ ) , where the latter NFG was speciﬁedin Deﬁnition 4. We will elaborate on this interestingconnection in Section VII-E. • The paper [32] by Barvinok presents bounds on thenumber of zero/one matrices with prescribed row andcolumn sums. (As already mentioned in Section I-C, instatistical physics terms the approach taken therein canbe considered as a mean-ﬁeld approach.) In terms ofNFGs, the quantity of interest is expressed as the partitionfunction of an NFG that has the same topology as N ( θ ) but different function nodes.Section 3.1 of [32] then presents an interpretation ofthese bounds that has a similar ﬂavor of the graph coverinterpretation of the Bethe permanent, however, it also hasstark differences. Namely, in terms of NFGs, Section 3.1of [32] presents an NFG where every function node ofthe base graph is replicated M times and every edge isreplicated M times, i.e. , all M n left-hand side functionnodes are connected by exactly one edge to all the

M n right-hand side function nodes. In order for this to makesense, the local functions are adapted so that they have

M n arguments instead of n arguments. It is then shownthat the M th root of the partition function of this newNFG, M → ∞ , yields the relevant number in whichthe bounds are expressed. Despite all the similarities, thedifferences to ﬁnite graph covers are clear: – There is only one such M -fold version of the basegraph, whereas the number of M -covers of N ( θ ) is ( M !) ( n ) . – The number of edges is M n , whereas the numberof edges in an M -cover of N ( θ ) is M n . – The local functions need to be adapted in order toallow for

M n instead of n arguments, whereas thelocal functions of an M -cover of N ( θ ) are the sameas the local functions of N ( θ ) . VII. T HE R ELATIONSHIP BETWEEN THE P ERMANENTAND THE B ETHE P ERMANENT

In this section we explore the relationship between perm( θ ) and perm B ( θ ) , in particular, if and how perm( θ ) can beupper and lower bounded by expressions that are functionsof perm B ( θ ) . For an additional/complementary discussion onthis topic we refer to [22].We start with a lemma that shows that there are non-negativesquare matrices for which the Bethe permanent can give ratheraccurate estimates of the permanent, thereby showing theoverall potential of the Bethe permanent to be the basis forgood upper and lower bounds on the permanent of generalnon-negative square matrices. Lemma 48

Let n × n be the all-one matrix of size n × n . Then perm( n × n )perm B ( n × n ) = r πn e · (cid:0) o (1) (cid:1) , where o (1) is w.r.t. n .Proof: See Appendix H. (cid:4)

Although the factor p πn/ e is non-negligible, comparedto perm( n × n ) = n ! it is rather small. A. Lower Bounds on the Permanent of the Matrix θ In this subsection we study lower bounds on perm( θ ) basedon perm B ( θ ) . Theorem 49 (Gurvits [14], [15])

It holds that perm( θ )perm B ( θ ) > . Proof:

This result was recently shown by Gurvits [14], [15].Roughly speaking, its elegant proof is based on ﬁrst expressing θ in terms of a stationary point of F B , N ( θ ) and then applyingan inequality due to Schrijver [62]. (cid:4) For more details, along with a discussion of this result’srelationship to the results in [63], [64], we refer to [14], [15].For a somewhat different approach to proving this theorem,we refer the interested reader to [22].

Corollary 50 (Gurvits [14], [15])

For any γ ∈ Γ n × n itholds that perm( θ )exp (cid:0) − F B , N ( θ ) ( γ ) (cid:1) > . Proof:

This is a straightforward consequence of Theorem 49and Deﬁnitions 11 and 12. (cid:4)

Some comments on Theorem 49 and Corollary 50: • Corollary 50 has its signiﬁcance when one is not willingto run the SPA algorithm, but one has a reasonably goodestimate of the γ ∈ Γ n × n that minimizes F B , N ( θ ) . Thisapproach is for example interesting when one wants toobtain analytical lower bounds on the permanent of someparameterized class of non-negative square matrices. • Chertkov, Kroc, and Vergassola [11] observed in 2008that perm( θ ) > perm B ( θ ) holds for all the matrices thatthey experimented with. They also outlined a potentialapproach to proving this inequality via the loop calculustechnique by Chertkov and Chernyak [21], which in thecase of N ( θ ) states that perm( θ ) equals perm B ( θ ) pluscertain correction terms (see [65] for a reformulation ofthe loop calculus in terms of NFGs). However, giventhe fact that for N ( θ ) these correction terms happento be positive and negative, it is at present unclear ifTheorem 49 can be proven with this technique. • In the Allerton 2010 version of this paper we stated theinequality that appears in Theorem 49 as a theorem. How-ever, while writing the present paper we realized that our“proof” had a ﬂaw, which, so far, we have not been ableto ﬁx. Nevertheless, we still think that our proof strategycan work out and possibly give an alternative viewpointof Schrijver’s inequality that features prominently in [14],[15]. In that respect, we list below some special cases ofmatrices θ for which our proof strategy works, along withconjectures that, if true, would give an alternative proofof Theorem 49 in its full generality. Conjecture 51

For any M ∈ Z > it holds that D perm (cid:16) θ ↑ ˜ P (cid:17)E ˜ P ∈ ˜Ψ M (cid:0) perm( θ ) (cid:1) M . Possibly also the following, stronger, statement is true: for any M ∈ Z > and any ˜ P ∈ ˜Ψ M it holds that perm (cid:16) θ ↑ ˜ P (cid:17) (cid:0) perm( θ ) (cid:1) M . (cid:3) Theorem 49 would then follow from perm B ( θ ) (a) = lim sup M →∞ perm B ,M ( θ ) (b) = lim sup M →∞ M rD perm (cid:16) θ ↑ ˜ P (cid:17)E ˜ P ∈ ˜Ψ M (c) lim sup M →∞ M q perm( θ ) M = lim sup M →∞ perm( θ ) (d) = perm( θ ) , where at step (a) we have used Theorem 39, where at step (b)we have used Deﬁnition 38, where at step (c) we have usedthe weaker part of Conjecture 51, and where step (d) followsfrom evaluating the (now trivial) limit M → ∞ .We now list some special matrices θ for which Conjec-ture 51 is true. • Conjecture 51 is true for θ = n × n . (The proof is givenin Appendix I.) • Conjecture 51 is true for all matrices θ that were studiedin Section VI.Actually, the results in Section VI suggest the following,stronger version of Conjecture 51. Conjecture 52

Fix some M ∈ Z > and consider the expres-sions D perm (cid:16) θ ↑ ˜ P (cid:17)E ˜ P ∈ ˜Ψ M and (cid:0) perm( θ ) (cid:1) M as polynomials in the indeterminates { θ i,j } i,j . We conjecturethat the coefﬁcient of every monomial of the ﬁrst polynomialis upper bounded by the coefﬁcient of the correspondingmonomial of the second polynomial.Possibly also the following, stronger, statement is true. Fixsome M ∈ Z > and ˜ P ∈ ˜Ψ M , and consider the expressions perm (cid:16) θ ↑ ˜ P (cid:17) and (cid:0) perm( θ ) (cid:1) M as polynomials in the indeterminates { θ i,j } i,j . We conjecturethat the coefﬁcient of every monomial of the ﬁrst polynomialis upper bounded by the coefﬁcient of the correspondingmonomial of the second polynomial. (cid:3) Let us conclude this subsection by noting that the inequali-ties Z G ( ˜ N ) Z G ( N ) M , M ∈ Z > , ˜ N ∈ ˜ N M , i.e. , inequalitiesof the type that appear in Conjecture 51, have recently beenused to prove Z B ( N ) Z G ( N ) for graphical models N appearing in other contexts. We refer the interested reader to[16, Example 34 and Lemma 35] and [66] for details. B. Upper Bounds on the Permanent of the Matrix θ In this subsection we list conjectures and open problemsw.r.t. upper bounds on perm( θ ) based on perm B ( θ ) . Conjecture 53 (Gurvits [14], [15])

Let θ be an arbitrarynon-negative matrix of size n × n . For even n it is conjecturedthat perm( θ )perm B ( θ ) √ n , (11) with a similar conjecture for odd n . Note that (11) holds withequality for the matrix θ = I ( n/ × ( n/ ⊗ × , i.e. , theKronecker product of an identity matrix of size ( n/ × ( n/ and the all-one matrix of size × . (cid:3) We refer the interested reader to [14], [15] for a discussionof families of non-negative matrices for which the aboveconjecture has been veriﬁed.Note that Conjecture 53 replaces the conjecture that wemade in the Allerton 2010 version of this paper where, forﬁxed n , the largest ratio perm( θ ) / perm B ( θ ) was thought tobe obtained for the all-one matrix of size n × n .Besides proving the bound in Conjecture 53, it would bedesirable to prove statements of the form Pr (cid:26) θ ∈ Θ : perm( θ )perm B ( θ ) τ (cid:27) > − ε, where Θ is some ensemble of random matrices of size n × n ,where τ is some positive real number, and where ε is somesmall positive number. For example, for the ensemble of n × n matrices where the matrix entries are chosen uni-formly and independently between and , we conjecture that perm( θ ) / perm B ( θ ) is, with high probability, upper boundedby the ratio that appears in Lemma 48. (Note that this ratio ismuch smaller than the ratio that appears in Conjecture 53.) C. Closeness of the Permanent to the Bethe Permanent

In this subsection we list some cases where perm( θ ) isrelatively close to perm B ( θ ) . We start with an auxiliary resultthat relates the Bethe permanent of a lifted matrix to the Bethepermanent of the base matrix. Lemma 54

For any M ∈ Z > and any ˜ P ∈ ˜Ψ M it holdsthat perm B (cid:16) θ ↑ ˜ P (cid:17) = (cid:0) perm B ( θ ) (cid:1) M . Proof:

See Appendix J. (cid:4)

Theorem 55

For any α > and any M > M α , the majorityof the matrices in (cid:8) θ ↑ ˜ P (cid:9) ˜ P ∈ ˜Ψ M satisﬁes perm (cid:16) θ ↑ ˜ P (cid:17) perm B (cid:16) θ ↑ ˜ P (cid:17) < α M . Here M α is a parameter that depends on α .Proof: The ﬁrst inequality follows from Theorem 49. We provethe second inequality by contradiction. So, assume that thereis an α > and a constant M α such that for all M > M α the set ˜Ψ ′ M ⊆ ˜Ψ M of all lifted matrices θ ↑ ˜ P that satisfy perm (cid:0) θ ↑ ˜ P (cid:1) > α M · perm B (cid:0) θ ↑ ˜ P (cid:1) has size at least | ˜Ψ M | / .Then perm B ,M ( θ ) (a) = M rD perm (cid:16) θ ↑ ˜ P (cid:17)E ˜ P ∈ ˜Ψ M (b) = M vuut | ˜Ψ M | X ˜ P ∈ ˜Ψ M perm (cid:16) θ ↑ ˜ P (cid:17) > M vuut | ˜Ψ M | X ˜ P ∈ ˜Ψ ′ M perm (cid:16) θ ↑ ˜ P (cid:17) (c) > M vuut | ˜Ψ M | X ˜ P ∈ ˜Ψ ′ M α M · perm B (cid:16) θ ↑ ˜ P (cid:17) (d) = M vuut | ˜Ψ M | X ˜ P ∈ ˜Ψ ′ M α M · (cid:0) perm B ( θ ) (cid:1) M = M s | ˜Ψ ′ M || ˜Ψ M | · α · perm B ( θ ) (e) > − /M · α · perm B ( θ ) , where at step (a) we have used Deﬁnition 38, where at step (b)we have replaced the angular brackets by the correspondingnormalized sum, where at step (c) we have used the assump-tion, where at step (d) we have used Lemma 54, and where atstep (e) we have again used the assumption. However, taking lim sup M →∞ on both sides of the above expression, we seethat we obtain a contradiction w.r.t. Theorem 39. (cid:4) The following example partially corroborates Theorem 55.

Example 56

For some positive integer M , consider the ma-trix θ ↑ ˜ P = (cid:18) θ , ˜ I θ , ˜ I θ , ˜ I θ , ˜ P ′ , (cid:19) , where ˜ I is the identity matrix of size M × M and where ˜ P ′ , is a once cyclically left-shifted identity matrix of size M × M .Then perm (cid:0) θ ↑ ˜ P (cid:1) = θ M , θ M , + θ M , θ M , , perm B (cid:0) θ ↑ ˜ P (cid:1) = (cid:0) perm B ( θ ) (cid:1) M = (cid:0) max( θ , θ , , θ , θ , ) (cid:1) M , where the ﬁrst result is a consequence of the observation thatthe underlying graph has exactly one cycle, i.e. , only twoperfect matchings, and where the second result follows fromLemmas 40 and 54. Therefore, perm (cid:0) θ ↑ ˜ P (cid:1) perm B (cid:0) θ ↑ ˜ P (cid:1) . Note that the right-hand side of the above expression does notonly grow sub-exponentially in M , it does not grow at all. (cid:3) Let us conclude this subsection with the following remark.As already mentioned, the proof of Theorem 49 takes ad-vantage of an inequality by Schrijver [62], and therefore thecloseness of perm( θ ) to perm B ( θ ) is linked with the tightnessof Schrijver’s inequality. Now, interestingly enough, whenSchrijver demonstrates a certain asymptotic tightness of hisinequality, see [62, Section 3], he implicitly evaluates andcompares both sides of his inequality for some ﬁnite coverof a certain graph. D. Open Problems on the Relationship between the Permanentand the Bethe Permanent

There are also classes of structured matrices for whichit would be interesting to better understand the relationshipbetween the permanent and the Bethe permanent. For example,the permanent of the matrix θ =  α µ α µ · · · α µ m · · · α µ α µ · · · α µ m · · · ... ... ... ... ... α µ n α µ n · · · α µ m n · · ·  , with m n , real numbers α ℓ > , ℓ ∈ [ n ] , and realnumbers µ ℓ , ℓ ∈ [ m ] , turns up in a variety of contexts. • When P ℓ ∈ [ n ] α ℓ = 1 and µ ℓ are non-negative integersthen perm( θ ) corresponds to the probability of the pat-tern of a sequence (see, e.g. , [67], [68]). • When m = n and µ ℓ = n − − ℓ , ℓ ∈ [ n ] , then perm( θ ) appears in the analysis of list ordering algorithms (see, e.g. , [69]) or in the analysis of source coding algorithms(see, e.g. , [70]). Note that in this case, θ is a Vander-monde matrix.Moreover, given the fact that the above θ depends only on (atmost) n parameters (and not on n parameters as θ in (1)), one wonders if speed-ups in the SPA-based computation of perm B ( θ ) are possible.In some applications one is not interested in the absolutevalue of the permanent, only the relative value in the sensethat for two matrices θ and θ ′ one wants to know whichone has the larger permanent. Therefore, for some suitablestochastic setting it would be desirable to state with what prob-ability perm( θ ) perm( θ ′ ) is equivalent to perm B ( θ ) perm B ( θ ′ ) . Some very encouraging initial investigations ofthis topic have been presented in [12, Section 4.2]. E. Connections to Results by Greenhill, Janson, and Ruci´nski

After the initial submission of the present paper, we becameaware of the paper by Greenhill, Janson, and Ruci´nski [35] oncounting perfect matchings in random graph covers. Using theﬁndings of [16] and the present paper, their results can, oncethey have been translated to factor graphs, be seen as deﬁningan NFG N ′ , N ′ ( θ ) with Z G ( N ′ ) = perm( θ ) and computing Z B ( N ′ ) , along with approximately computing Z B ,M ( N ′ ) . TheNFG N ′ is in general different from N , N ( θ ) , where the latterNFG was speciﬁed in Deﬁnition 4 and shown in Figure 1.The advantage of N ′ is that minimizing its Bethe free energyfunction towards determining Z B ( N ′ ) is quite straightforward.Moreover, high-order approximations to Z B ,M ( N ′ ) can begiven. The disadvantage of N ′ is that Z B ( N ′ ) is a weakerlower bound to perm( θ ) than perm B ( θ ) = Z B ( N ) .Let us elaborate on these comments. Namely, consider amatrix like θ , (cid:18) (cid:19) , (12)where all entries are non-negative integers and where all rowand all column sums are equal to some constant d . Here, n = 2 , d = 4 , and perm( θ ) = 10 . Its NFG N , N ( θ ) asspeciﬁed in Deﬁnition 4 is shown in Figure 8 (a). In terms offactor graphs, the paper [35] considers the NFG N ′ , N ′ ( θ ) shown in Figure 8 (b): like N it has n function nodes on theleft-hand side and n function nodes on the right-hand side.However, for every ( i, j ) ∈ I × J , there are d · θ i,j edgesconnecting function node i on the left-hand side to functionnode j on the right-hand side. The variable associated withan edge of N ′ takes on values in the set { , } . Moreover,a local function takes on the value if exactly one of thevariables associated with the incident edges is , and takes onthe value otherwise. One can show that these deﬁnitionsyield Z ( N ′ ) = perm( θ ) . Indeed, this result follows fromobserving that valid conﬁgurations of N ′ correspond to perfectmatchings of the graph underlying N ′ , that the global functionvalue of every valid conﬁgurations of N ′ is , and that thegraph underlying N ′ has perm( θ ) perfect matchings.Note that in the case of N , the graph structure is independent of θ but the local function values depend on θ , whereas inthe case of N ′ , the graph structure depends on θ but the localfunction node values are independent of θ .The Bethe free energy function of N ′ is minimized by ( β ′ e, , β ′ e, ) = (1 − /d, /d ) , e ∈ E ( N ′ ) , with correspondingbeliefs for the function nodes. (This can, e.g. , be veriﬁedwith the help of symmetry arguments, along with suitably (a) NFG N , N ( θ ) . (b) NFG N ′ , N ′ ( θ ) .Fig. 8. NFGs used in Section VII-E. generalizing the convexity results of Corollary 23 from N to N ′ .) With this, after a few manipulations, Z B ( N ′ ) = (cid:18) ( d − d − d d − (cid:19) n . (13)Interestingly, the expression on the right-hand side of (13)appears also in Corollary 1a in [62]. (One of the main resultsof Schrijver’s paper [62] is to show that this expression is alower bound on perm( θ ) .)Clearly, the advantage of N ′ is that we can explicitlycompute Z B ( N ′ ) . However, Z B ( N ′ ) is a weaker lower boundon perm( θ ) than perm B ( θ ) = Z B ( N ) . (For example, for thematrix θ in (12) we obtain perm( θ ) = 10 > perm B ( θ ) = Z B ( N ) = 9 > Z B ( N ′ ) = 729 /

256 = 2 . . . . .) This is nottotally surprising given the fact that the right-hand side of (13)depends only on θ inasmuch as θ determines n and d . Indeed,observing that d · θ is a doubly stochastic matrix, we get log (cid:0) Z B ( N ) (cid:1) (a) > − F B , N ( γ ) (cid:12)(cid:12) γ = d · θ (b) = − U B , N ( γ ) + H B , N ( γ ) (cid:12)(cid:12) γ = d · θ (c) = X i,j θ i,j d log( θ i,j ) − X i,j θ i,j d log (cid:18) θ i,j d (cid:19) + X i,j (cid:18) − θ i,j d (cid:19) log (cid:18) − θ i,j d (cid:19) = n log( d ) + X i,j (cid:18) − θ i,j d (cid:19) log (cid:18) − θ i,j d (cid:19) (d) = n log( d ) + X i  n X j =1 u (cid:18) θ i,j d (cid:19) + max( n,d ) X j = n +1 u (0)  (e) > n log( d ) + X i  d X j =1 u (cid:18) d (cid:19) + max( n,d ) X j = d +1 u (0)  = n ( d −

1) log( d − − n ( d −

2) log( d ) (f) = log (cid:0) Z B ( N ′ ) (cid:1) , where at step (a) we have used Deﬁnition 11, where at steps (b)and (c) we have used Lemma 14, where at step (d) we haveused the function u : [0 , → R , ξ (1 − ξ ) log(1 − ξ ) , whereat step (e) we have used Karamata’s inequality [71] (note that u is convex and that, after sorting, ( θ i, /d, . . . , θ i,n /d, , . . . , majorizes (1 /d, . . . , /d, , . . . , ), and where at step (f) wehave used (13). (See also [15, Section 3] for similar inequali-ties as in the above display equation.) Interestingly enough, as shown by the authors of [35], forany M ∈ Z > one can give a high-order approximationof (cid:10) Z G ( ˜ N ′ ) (cid:11) ˜ N ′ ∈ ˜ N ′ M , and therefore of the degree- M Bethepartition function [16] Z B ,M ( N ′ ) = (cid:0)(cid:10) Z G ( ˜ N ′ ) (cid:11) ˜ N ′ ∈ ˜ N ′ M (cid:1) /M .For the corresponding expressions we refer the interestedreader to [35].Near the beginning of this subsection we assumed that θ isa non-negative integral matrix where all row and all columnsums are equal to some constant d . This is less restrictive thanit appears. Namely, Sinkhorn’s theorem states that any positive n × n matrix θ can be written as θ = D · θ ′ · D where θ ′ isdoubly stochastic and where D and D are diagonal matriceswith strictly positive diagonal elements (see, e.g. , [72], whichpresents also some generalizations of this statement). If thereis a positive integer d such that d · θ ′ has only integral entries,then we can write θ = d · D · ( d · θ ′ ) · D . (If there is nosuch d , then d can be chosen large enough so that d · θ ′ is asclose to an integral matrix as desired.) With this, perm( θ ) = d n · (cid:0) Q i ∈ [ n ] ( D ) i,i (cid:1) · perm( d · θ ′ ) · (cid:0) Q i ∈ [ n ] ( D ) i,i (cid:1) , and wehave reduced the problem of (approximately) computing thepermanent of θ to (approximately) computing the permanentof d · θ ′ , a non-negative integral matrix where all row and allcolumn sums are equal to some constant d . The complexity of(approximately) computing the decomposition θ = D · θ ′ · D is discussed in [10].VIII. T HE F RACTIONAL B ETHE P ERMANENT

The terms that appear in H B ( β ) in Deﬁnition 10 all haveeither coefﬁcient +1 or − , with obvious implications for thecoefﬁcients of the terms of H B ( γ ) in Lemma 14. The mainidea behind the fractional Bethe entropy function is to allowthese coefﬁcients to take on also other values. This is donetowards the goal of obtaining a modiﬁed Bethe free energyfunction whose minimum resembles the minimum of the Gibbsfree energy function even more. Such generalizations of theBethe entropy function were for example considered in [73]–[78] and a combinatorial characterization of the fractionalBethe entropy function was discussed in [56]. In particular,for the permanent estimation problem such generalizations areextensively studied in the very recent paper by A. B. Yedidiaand Chertkov [22], to which we refer for additional discussionon this topic.As we will see in this section, if the modiﬁcations to theBethe entropy function are applied within some suitable limits,the concavity of the modiﬁed Bethe entropy function (andtherefore the convexity of the modiﬁed Bethe free energyfunction) will be maintained.

Deﬁnition 57

Let κ , (cid:8) { κ i } i ∈I , { κ j } j ∈J , { κ i,j } ( i,j ) ∈I×J (cid:9) be a collection of real values. We deﬁne the κ -fractional Bethe One might also modify U B ( γ ) , however, we do not pursue this optionhere. entropy function to be H ( κ )B : Γ n × n → R , γ X i κ i · H B ,i ( γ i ) + X j κ j · H B ,j ( γ i ) − X i,j κ i,j · H B , ( i,j ) ( γ i,j ) . (Clearly, if all values in κ equal then H ( κ )B ( γ ) = H B ( γ ) ,with H B ( γ ) as shown in Lemma 14.) (cid:3) Lemma 58

The fractional Bethe entropy function from Deﬁ-nition 57 can also be expressed as follows H ( κ )B ( γ ) = − X i,j ( κ i + κ j − κ i,j ) · γ i,j log( γ i,j )+ X i,j κ i,j · (1 − γ i,j ) log(1 − γ i,j ) . (If all values in κ equal then H ( κ )B ( γ ) = H B ( γ ) , with H B ( γ ) as shown in Corollary 15.)Proof: Follows from combining Deﬁnition 57 and Lemma 14. (cid:4)

The following deﬁnition generalizes Deﬁnitions 11 and 12and Corollary 15.

Deﬁnition 59

We deﬁne the κ -fractional Bethe free energyfunction to be F ( κ )B : Γ n × n → R , γ U B ( γ ) − H ( κ )B ( γ ) , and the κ -fractional Bethe permanent to be perm ( κ )B ( θ ) , exp (cid:18) − min β ∈B F ( κ )B ( β ) (cid:19) . (cid:3) The following theorem gives a sufﬁcient condition on κ sothat the κ -fractional Bethe entropy function is concave in γ ,thereby generalizing Theorem 22. Theorem 60 If κ is such that κ i > i ∈ I ) ,κ j > j ∈ J ) ,κ i + κ j > κ i,j (( i, j ) ∈ I × J ) . then H ( κ )B ( γ ) is a concave function of γ and F ( κ )B ( γ ) is aconvex function of γ . Proof:

We have H ( κ )B ( γ ) (a) = − X i,j (cid:18) κ i + κ j κ i + κ j − κ i,j (cid:19) · γ i,j log( γ i,j )+ X i,j (cid:18) κ i + κ j − κ i + κ j κ i,j (cid:19) · (1 − γ i,j ) log(1 − γ i,j ) (b) = X i κ i · S ( γ i ) + X j κ j · S ( γ j )+ X i,j (cid:18) κ i + κ j − κ i,j (cid:19) · h ( γ i,j ) , where at step (a) we have used Lemma 58, and where atstep (b) we have used the S -function as speciﬁed in Def-inition 19 and have introduced the binary entropy function h : [0 , → R , ξ

7→ − ξ log( ξ ) − (1 − ξ ) log(1 − ξ ) . If κ i > , κ j > , and κ i + κ j − κ i,j > (the latter being equivalentto κ i + κ j > κ i,j ), then the concavity of H ( κ )B ( γ ) in γ follows from Theorem 20, the well-known concavity of thebinary entropy function, and the fact that the sum of concavefunctions is a concave function.The convexity of F ( κ )B ( γ ) in γ follows from the concavityof H ( κ )B ( γ ) in γ and the linearity of U B ( γ ) in γ . (cid:4) Lemma 61

An interesting choice for κ is κ i = 1 ( i ∈ I ) ,κ j = 1 ( j ∈ J ) ,κ i,j = 1 − n (( i, j ) ∈ I × J ) . The resulting H ( κ )B ( γ ) is a concave function of γ and theresulting F ( κ )B ( γ ) is a convex function of γ . Moreover, letting n × n be the all-one matrix of size n × n , we obtain perm( n × n )perm ( κ )B ( n × n ) = √ π e · (cid:0) o (1) (cid:1) = 0 . . . . · (cid:0) o (1) (cid:1) , where o (1) is w.r.t. n . (Note that, in contrast to Lemma 48,there is no √ n -factor on the right-hand side of the aboveexpression.)Proof: See Appendix K. (cid:4)

Let us make a few comments about the choice of κ inLemma 61. • Fig. 9 shows the exact ratios for n from to . Inparticular, note that for n = 2 we have perm( × )perm ( κ )B ( × ) = 1 . • For even integers n and for the choice of κ fromLemma 61, the matrix θ = I ( n/ × ( n/ ⊗ × yieldsthe ratio perm( θ )perm ( κ )B ( θ ) = 1 . This is in stark contrast toConjecture 53 where θ represents the conjectured “worst-case” matrix for the ratio perm( θ )perm B ( θ ) . n p e r m ( n × n ) / p e r m ( κ ) B ( n × n ) Fig. 9. Illustration of the ratio perm( n × n ) / perm ( κ )B ( n × n ) for thespecial choice of κ in Lemma 61, when n varies from to . • For integers n and k such that k divides n we have (0 . . . . ) n/k perm( θ )perm ( κ )B ( θ ) for the matrix θ , I ( n/k ) × ( n/k ) ⊗ k × k .Let us conclude this section on the fractional Bethe entropyfunction with a few comments. • The SPA message update equations in Section V needto be modiﬁed so that its ﬁxed points correspond tostationary points of the fractional Bethe free energy, i.e. ,so that a modiﬁed version of the theorem by Yedidia,Freeman, and Weiss [13] holds. In contrast to the SPAmessage update equations in Section V, the modiﬁedSPA message update equations will be such that theright-going messages depend not only on the previousleft-going messages but also on the previous right-goingmessages, and such that the left-going messages dependnot only on the previous right-going messages but also onthe previous left-going messages. (We omit the details.)Moreover, the convergence analysis in Section V has tobe revisited. • We leave it as an open problem to explore the κ parameterspace and to ﬁnd fractional Bethe permanents for whichinteresting statements can be made, in particular forwhich a statement like the one in Theorem 49 can bemade. IX. C OMMENTS AND C ONJECTURES

It is an interesting challenge to look at theorems involvingpermanents and to prove that the theorems still hold if the per-manents in these theorems are replaced by Bethe permanents.Let us mention two conjectures along these lines that werelisted in [43]. A. Perm-Pseudo-Codewords

The following conjecture is based on a theorem in [79]involving permanents of submatrices of a parity-check matrix.

Deﬁnition 62

Let C be a binary linear code described bya parity-check matrix H ∈ F m × n , m < n . For a size- ( m +1) subset S of the column index set I ( H ) we deﬁnethe Bethe perm-vector based on S to be the vector ω ∈ Z n with components ω i , ( perm B (cid:0) H S\ i (cid:1) if i ∈ S otherwise , where H S\ i is the submatrix of H consisting of all thecolumns of H whose index is in the set S \ { i } . (cid:3) Conjecture 63

Let C be a binary linear code described bythe parity-check matrix H ∈ F m × n , m < n , let K ( H ) bethe fundamental cone associated with H [60], [61], and let S be a size- ( m +1) subset of I ( H ) . The Bethe perm-vector ω based on S is a pseudo-codeword of H , i.e. , ω ∈ K ( H ) , (14) (cid:3) A proof of this conjecture has recently been presented bySmarandache [80].

B. Permanent-Based Kernels

Based on a result by Cuturi [81], Huang and Jebara [12]made the following conjecture.

Conjecture 64 (Huang and Jebara [12])

Let n be a posi-tive integer and let X be a set endowed with a kernel κ . Let X = { x , . . . , x n } ∈ X n and Y = { y , . . . , y n } ∈ X n . Then κ perm B : ( X, Y ) perm B (cid:16)(cid:2) κ ( x i , y j ) (cid:3) i n, j n (cid:17) is a positive deﬁnite kernel on X n × X n . (cid:3) X. C

ONCLUSIONS

In this paper, we have pursued a graphical-model-basedapproach to approximating the permanent of a non-negativesquare matrix, the resulting approximation being called theBethe permanent. We have seen that the associated functions,like the Bethe entropy function and the Bethe free energyfunction, are remarkably well behaved for a graphical modelwith a non-trivial cycle structure. In that respect, an importantpart is played by a theorem by Birkhoff and von Neumann (seeTheorem 3). Moreover, the SPA can be used to efﬁciently ﬁndthe minimum of the Bethe free energy function and thereby theBethe permanent. We have also presented a graph-cover-basedanalysis that gives additional insights into the inner workingsof the Bethe permanent, its strengths, and its weaknesses,and we have commented on Bethe-permanent-based upperand lower bounds on the permanent. Along the way we havestated several conjectures and open problems, that, if answeredone way or the other, could further elucidate the relationshipbetween the permanent and the Bethe permanent. A

CKNOWLEDGMENTS

We gratefully acknowledge Farzad Parvaresh for pointingout to us the papers [32], [33], Krishna Viswanathan fordiscussions on the permanent of structured matrices, AdamYedidia and Misha Chertkov for sharing an early version oftheir paper [22], and Leonid Gurvits for general discussionsabout permanents. Moreover, we very much appreciate thehelpful comments that were made by the reviewers.A

PPENDIX AP ROOF OF T HEOREM S is established, it isstraightforward to verify the claim in the theorem statementthat S ( ξ ) > for all ξ ∈ Π [ n ] . Indeed, because Π [ n ] is apolytope with n vertices, because S takes on the value ateach of these vertices, and because S is concave, this statementis true.Therefore, let us focus on the concavity statement. Clearly,for n = 2 the statement can easily be veriﬁed and so the restof this appendix will only discuss the case n > .By deﬁnition, a multi-dimensional function is concave ifit is a concave function along any straight line in its domain.Towards showing that this is indeed the case for S , let us ﬁx anarbitrary point ξ ∈ Π [ n ] and an arbitrary direction ˆ ξ ∈ R n \{ } such that the function ξ ( τ ) , ξ + τ · ˆ ξ satisﬁes ξ ( τ ) ∈ Π [ n ] for a suitable τ -interval around (to be deﬁned later). Weneed to distinguish three different cases that will be discussedseparately in the following subsections:1) The point ξ is in the interior of Π [ n ] .2) The point ξ is at a vertex of Π [ n ] .3) The point ξ is neither in the interior nor at a vertexof Π [ n ] . A. The Point ξ is in the Interior of Π [ n ] It is straightforward to see that the direction vector ˆ ξ mustsatisfy X ℓ ˆ ξ ℓ = 0 , (15)otherwise ξ ( τ ) ∈ Π [ n ] holds only for τ = 0 . Therefore,we assume that (15) is satisﬁed. Moreover, because ξ ∈ interior(Π [ n ] ) , we have < ξ ℓ < , ℓ ∈ [ n ] , and we canﬁnd an ε > such that ξ ( τ ) ∈ Π [ n ] for − ε τ ε . Wewill now show that the function τ S (cid:0) ξ ( τ ) (cid:1) is concave at τ = 0 .We start by computing the ﬁrst-order derivative dd τ S (cid:0) ξ ( τ ) (cid:1) = − X ℓ dd ξ ℓ ( τ ) s (cid:0) ξ ℓ ( τ ) (cid:1) · ˆ ξ ℓ , and the second-order derivative d d τ S (cid:0) ξ ( τ ) (cid:1) = X ℓ d d ξ ℓ ( τ ) s (cid:0) ξ ℓ ( τ ) (cid:1) · ˆ ξ ℓ (a) = − X ℓ ˆ ξ ℓ ξ ℓ ( τ ) + X ℓ ˆ ξ ℓ − ξ ℓ ( τ ) , where at step (a) we have used Lemma 18. In particular, at τ = 0 we have d d τ S (cid:0) ξ ( τ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) τ =0 = X ℓ δ ℓ , where δ ℓ , ℓ ∈ [ n ] , is deﬁned as δ ℓ , − X ℓ ˆ ξ ℓ ξ ℓ + X ℓ ˆ ξ ℓ − ξ ℓ = − X ℓ ˆ ξ ℓ · − ξ ℓ ξ ℓ (1 − ξ ℓ ) . (16)The proof will be ﬁnished once we have shown that d d τ S (cid:0) ξ ( τ ) (cid:1) at τ = 0 , which is equivalent to the conditionthat X ℓ δ ℓ . (17)We show this by separately considering two cases, the ﬁrstcase being ξ ∈ interior(Π [ n ] ) ∩ [0 , / n , the second casebeing ξ ∈ interior(Π [ n ] ) \ [0 , / n .The ﬁrst case, ξ ∈ interior(Π [ n ] ) ∩ [0 , / n , is relativelystraightforward. Namely, for all ℓ ∈ [ n ] we have < ξ ℓ / ,which implies − ξ ℓ > , which in turn implies δ ℓ , andso (17) is satisﬁed.The second case, ξ ∈ interior(Π [ n ] ) \ [0 , / n , needssomewhat more work. We start by observing that there is aunique ℓ ∗ ∈ [ n ] such that ξ ℓ ∗ > / . (Note that there canonly be one such ℓ ∗ ∈ [ n ] because P ℓ ξ ℓ = 1 .) Consequently, − ξ ℓ ∗ < and − ξ ℓ > , ℓ = ℓ ∗ .In the following, it is sufﬁcient to consider only directions ˆ ξ that satisfy ˆ ξ ℓ ∗ > and ˆ ξ ℓ , ℓ = ℓ ∗ , or that satisfy ˆ ξ ℓ ∗ < and ˆ ξ ℓ > , ℓ = ℓ ∗ . This follows from contemplating (15)and (16) and from observing that for a given ξ and givendirectional magnitudes (cid:8) | ˆ ξ ℓ | (cid:9) ℓ = ℓ ∗ , the left-hand side of (17) ismaximized by a ˆ ξ that satisﬁes the conditions that we have justmentioned. From (15) it follows that such direction vectors ˆ ξ satisfy | ˆ ξ ℓ ∗ | = X ℓ = ℓ ∗ | ˆ ξ ℓ | . (18)Before continuing, let us introduce δ ′ , − ˆ ξ ℓ ∗ ξ ℓ ∗ + X ℓ = ℓ ∗ ˆ ξ ℓ − ξ ℓ ,δ ′′ , + ˆ ξ ℓ ∗ − ξ ℓ ∗ − X ℓ = ℓ ∗ ˆ ξ ℓ ξ ℓ . Note that P ℓ δ ℓ = δ ′ + δ ′′ , and so, if we can show that δ ′ and δ ′′ then we have veriﬁed the desired result (17).The fact δ ′ is a consequence of the equation X ℓ = ℓ ∗ ˆ ξ ℓ − ξ ℓ (a) ξ ℓ ∗ X ℓ = ℓ ∗ ˆ ξ ℓ (b) ξ ℓ ∗ ·  X ℓ = ℓ ∗ | ˆ ξ ℓ |  (c) = ˆ ξ ℓ ∗ ξ ℓ ∗ , where step (a) follows from ξ being in Π [ n ] , which implies that ξ ℓ ∗ = 1 − P ℓ ′ = ℓ ∗ ξ ℓ ′ , which in turn implies that ξ ℓ ∗ − ξ ℓ for In other words, such a ˆ ξ produces the “worst-case” left-hand side in (17):if we can show non-positivity for such direction vectors, we have implicitlyshown non-positivity for any other direction vector. all ℓ = ℓ ∗ . Moreover, step (b) follows from a simple inequalityand step (c) follows from (18).The fact δ ′′ is shown as follows. We start by observingthat (1 − ξ ℓ ∗ ) ·  X ℓ = ℓ ∗ ˆ ξ ℓ ξ ℓ  (a) =  X ℓ = ℓ ∗ ξ ℓ  ·  X ℓ = ℓ ∗ ˆ ξ ℓ ξ ℓ  =  X ℓ = ℓ ∗ p ξ ℓ  ·  X ℓ = ℓ ∗ | ˆ ξ ℓ |√ ξ ℓ !  (b) >  X ℓ = ℓ ∗ | ˆ ξ ℓ |  (c) = ˆ ξ ℓ ∗ , where step (a) follows from ξ being in Π [ n ] (which impliesthat ξ ℓ ∗ = 1 − P ℓ = ℓ ∗ ξ ℓ ), where at step (b) we use theCauchy-Schwarz inequality, and where at step (c) we use (18).Rearranging this inequality, we see that it is equivalent to theinequality δ ′′ ℓ . B. The Point ξ is at a Vertex of Π [ n ] Clearly, the direction vector ˆ ξ must satisfy (15). Moreover,because ξ is at a vertex of Π [ n ] , there is an ℓ ∗ ∈ [ n ] such that ξ ℓ ∗ = 1 and ξ ℓ = 0 , ℓ = ℓ ∗ , and such that ˆ ξ ℓ ∗ < and ˆ ξ ℓ > , ℓ = ℓ ∗ . Then we can ﬁnd an ε > such that ξ ( τ ) ∈ Π [ n ] for τ ε . We will now show that the function τ S (cid:0) ξ ( τ ) (cid:1) is concave at τ = 0 .We start by plugging in the deﬁnition of ξ ( τ ) into S (cid:0) ξ ( τ ) (cid:1) , i.e. , S (cid:0) ξ ( τ ) (cid:1) = − X ℓ ξ ℓ ( τ ) log (cid:0) ξ ℓ ( τ ) (cid:1) + X ℓ (cid:0) − ξ ℓ ( τ ) (cid:1) log (cid:0) − ξ ℓ ( τ ) (cid:1) = − (1 + τ ˆ ξ ℓ ∗ ) log(1 + τ ˆ ξ ℓ ∗ ) − X ℓ = ℓ ∗ ( τ ˆ ξ ℓ ) log( τ ˆ ξ ℓ )+ ( − τ ˆ ξ ℓ ∗ ) log( − τ ˆ ξ ℓ ∗ ) + X ℓ = ℓ ∗ (1 − τ ˆ ξ ℓ ) log(1 − τ ˆ ξ ℓ ) . From this we compute the ﬁrst-order derivative dd τ S (cid:0) ξ ( τ ) (cid:1) = − ˆ ξ ℓ ∗ log(1 + τ ˆ ξ ℓ ∗ ) − ˆ ξ ℓ ∗ − X ℓ = ℓ ∗ ˆ ξ ℓ log( τ ) − X ℓ = ℓ ∗ ˆ ξ ℓ log( ˆ ξ ℓ ) − X ℓ = ℓ ∗ ˆ ξ ℓ − ˆ ξ ℓ ∗ log( τ ) − ˆ ξ ℓ ∗ log( − ˆ ξ ℓ ∗ ) − ˆ ξ ℓ ∗ − X ℓ = ℓ ∗ ˆ ξ ℓ log(1 − τ ˆ ξ ℓ ) − X ℓ = ℓ ∗ ˆ ξ ℓ (a) = − ˆ ξ ℓ ∗ log(1 + τ ˆ ξ ℓ ∗ ) − X ℓ = ℓ ∗ ˆ ξ ℓ log( ˆ ξ ℓ ) − ˆ ξ ℓ ∗ log( − ˆ ξ ℓ ∗ ) − X ℓ = ℓ ∗ ˆ ξ ℓ log(1 − τ ˆ ξ ℓ ) , (19)where at step (a) we have used P ℓ ˆ ξ ℓ = 0 multiple times. Thesecond-order derivative is then d d τ S (cid:0) ξ ( τ ) (cid:1) = − ˆ ξ ℓ ∗ τ ˆ ξ ℓ ∗ + X ℓ = ℓ ∗ ˆ ξ ℓ − τ ˆ ξ ℓ . For τ ↓ we obtain d d τ S (cid:0) ξ ( τ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) τ ↓ = − ˆ ξ ℓ ∗ + X ℓ = ℓ ∗ ˆ ξ ℓ (a) = −  − X ℓ = ℓ ∗ ˆ ξ ℓ  + X ℓ = ℓ ∗ ˆ ξ ℓ (b) , where at step (a) we have used (15) and where step (b) followsfrom a simple inequality and the fact that ˆ ξ ℓ > for ℓ = ℓ ∗ .Therefore, the function τ S (cid:0) ξ ( τ ) (cid:1) is concave at τ = 0 . C. The Point ξ is Neither in the Interior nor at a Vertex of Π [ n ] The fact that ξ is neither in the interior nor at a vertex of Π [ n ] means that there is an ℓ ∗ ∈ [ n ] such that < ξ ℓ ∗ < .Clearly, the direction vector ˆ ξ must satisfy (15), plus someadditional constraints that are irrelevant for the discussionhere. Then we can ﬁnd an ε > such that ξ ( τ ) ∈ Π [ n ] for τ ε . The concavity of the function τ S (cid:0) ξ ( τ ) (cid:1) at τ = 0 follows then from the observation that, for small non-negative τ , the second-order derivative of S (cid:0) ξ ( τ ) (cid:1) w.r.t. τ isdominated by the second-order derivative of the expression − P ℓ : ξ ℓ =0 , ˆ ξ ℓ > ξ ℓ ( τ ) log (cid:0) ξ ℓ ( τ ) (cid:1) , a function that is concavein τ . A PPENDIX BP ROOF OF L EMMA S (cid:0) ξ ( τ ) (cid:1) and the ﬁrst-order derivative of S (cid:0) ξ ( τ ) (cid:1) w.r.t. τ at τ = 0 . Clearly, S (cid:0) ξ ( τ ) (cid:1) = 0 and so we can focus oncomputing the ﬁrst-order derivative.Fortunately, in Appendix A-B we have already computedthe ﬁrst-order derivative for exactly the same setup. Namely,from (19) we obtain dd τ S (cid:0) ξ ( τ ) (cid:1) = − ˆ ξ ℓ ∗ log(1 + τ ˆ ξ ℓ ∗ ) − X ℓ = ℓ ∗ ˆ ξ ℓ log( ˆ ξ ℓ ) − ˆ ξ ℓ ∗ log( − ˆ ξ ℓ ∗ ) − X ℓ = ℓ ∗ ˆ ξ ℓ log(1 − τ ˆ ξ ℓ ) . In the limit τ ↓ this simpliﬁes to dd τ S (cid:0) ξ ( τ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) τ ↓ = − X ℓ = ℓ ∗ ˆ ξ ℓ log( ˆ ξ ℓ ) + ( − ˆ ξ ℓ ∗ ) log( − ˆ ξ ℓ ∗ ) . (20)This can be rewritten as follows dd τ S (cid:0) ξ ( τ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) τ ↓ = | ˆ ξ ℓ ∗ | ·  − X ℓ = ℓ ∗ | ˆ ξ ℓ || ˆ ξ ℓ ∗ | log | ˆ ξ ℓ || ˆ ξ ℓ ∗ | ! , where we have used − ˆ ξ ℓ ∗ = | ˆ ξ ℓ ∗ | , ˆ ξ ℓ = | ˆ ξ ℓ | , ℓ = ℓ ∗ , and | ˆ ξ ℓ ∗ | = P ℓ = ℓ ∗ | ˆ ξ ℓ | , i.e. , P ℓ = ℓ ∗ | ˆ ξ ℓ | / | ˆ ξ ℓ ∗ | = 1 . This veriﬁesthe expressions for S (cid:0) ξ ( τ ) (cid:1) in the lemma statement.Finally, the non-negativity of the coefﬁcient of τ in (4)follows from | ˆ ξ ℓ ∗ | > | ˆ ξ ℓ | , ℓ = ℓ ∗ , which is a consequenceof the above-mentioned relation | ˆ ξ ℓ ∗ | = P ℓ = ℓ ∗ | ˆ ξ ℓ | . A PPENDIX CP ROOF OF L EMMA γ i,j = 1 if j = σ ( i ) and γ i,j = 0 otherwise.From the condition that ˆ γ is such that γ ( τ ) ∈ Γ n × n for smallnon-negative τ , it follows that P j ˆ γ i,j = 0 for all i ∈ I and P i ˆ γ i,j = 0 for all j ∈ J . Moreover, for every i ∈ I we have ˆ γ i,j if j = σ ( i ) and ˆ γ i,j > otherwise. Then H B (cid:0) γ ( τ ) (cid:1) (a) = 12 X i S (cid:0) γ i ( τ ) (cid:1) + 12 X j S (cid:0) γ j ( τ ) (cid:1) (b) = − τ X i X j = σ ( i ) ˆ γ i,j log(ˆ γ i,j ) + τ X i ( − ˆ γ i,σ ( i ) ) log( − ˆ γ i,σ ( i ) ) − τ X j X i =¯ σ ( j ) ˆ γ i,j log(ˆ γ i,j ) + τ X j ( − ˆ γ ¯ σ ( j ) ,j ) log( − ˆ γ ¯ σ ( j ) ,j )+ O ( τ ) , where step (a) follows from Lemma 21 and where at step (b)we have used S ( γ i ) = 0 , S ( γ j ) = 0 , and (20).We observe that in the above expression there are exactlytwo terms for every edge e = ( i, j ) ∈ I × J . Rewriting thesesummations such that all the main summations are over i ∈ I ,we obtain H B (cid:0) γ ( τ ) (cid:1) = − τ X i X j = σ ( i ) ˆ γ i,j log(ˆ γ i,j ) + τ X i ( − ˆ γ i,σ ( i ) ) log( − ˆ γ i,σ ( i ) )+ O ( τ ) (a) = τ X i | ˆ γ i,σ ( i ) | ·  − X j = σ ( i ) | ˆ γ i,j || ˆ γ i,σ ( i ) | log (cid:18) | ˆ γ i,j || ˆ γ i,σ ( i ) | (cid:19) + O ( τ ) , which is the ﬁrst display equation in the lemma state-ment. Here, at step (a) we have used − ˆ γ i,σ ( i ) = | ˆ γ i,σ ( i ) | , ˆ γ i,j = | ˆ γ i,j | , j = σ ( i ) , and | ˆ γ i,σ ( i ) | = P j = σ ( i ) | ˆ γ i,j | , i.e. , P j = σ ( i ) | ˆ γ i,j | / | ˆ γ i,σ ( i ) | = 1 .The non-negativity of the coefﬁcient of τ in the aboveexpression follows from | ˆ γ i,σ ( i ) | > | ˆ γ i,j | , j = σ ( i ) , whichis a consequence of the above-mentioned relation | ˆ γ i,σ ( i ) | = P j = σ ( i ) | ˆ γ i,j | .On the other hand, rewriting these summations such that allthe main summations are over j ∈ J , we obtain the seconddisplay equation in the lemma statement.A PPENDIX DP ROOF OF T HEOREM | ˆ γ i,σ ( i ) | = − ˆ γ i,σ ( i ) for all i ∈ I and that | ˆ γ i,j | = ˆ γ i,j forall i ∈ I , j ∈ J \ { σ ( i ) } (see also the proof of Lemma 25 in Appendix C). Then, U B (cid:0) γ ( τ ) (cid:1) (a) = − X i (1+ τ ˆ γ i,σ ( i ) ) log( θ i,σ ( i ) ) − X i X j = σ ( i ) ( τ ˆ γ i,j ) log( θ i,j ) (b) = − X i log( θ i,σ ( i ) ) − τ X i X j = σ ( i ) | ˆ γ i,j | log (cid:18) θ i,j θ i,σ ( i ) (cid:19) (c) = C − τ X i X j = σ ( i ) | ˆ γ i,j | log (cid:18) θ i,j θ i,σ ( i ) (cid:19) , where at step (a) we have used Corollary 15, where at step (b)we have used that P j ˆ γ i,j = 0 holds for every i ∈ I , i.e. , that − ˆ γ i,σ ( i ) = P j = σ ( i ) ˆ γ i,j = P j = σ ( i ) | ˆ γ i,j | holdsfor every i ∈ I , and where at step (c) we have deﬁned C , − P i log( θ i,σ ( i ) ) . (Note that there is no O ( τ ) termin the above expressions.) Then F B (cid:0) γ ( τ ) (cid:1) (a) = U B ( γ ) − H B ( γ ) (b) = C − τ X i X j = σ ( i ) | ˆ γ i,j | log (cid:18) θ i,j θ i,σ ( i ) (cid:19) − τ X i | ˆ γ i,σ ( i ) | ·  − X j = σ ( i ) | ˆ γ i,j || ˆ γ i,σ ( i ) | log (cid:18) | ˆ γ i,j || ˆ γ i,σ ( i ) | (cid:19) + O ( τ ) (c) = C − τ X i X i ′ = i | ˆ γ i,σ ( i ′ ) | log (cid:18) θ i,σ ( i ′ ) θ i,σ ( i ) (cid:19) − τ X i | ˆ γ i,σ ( i ) | ·  − X i ′ = i | ˆ γ i,σ ( i ′ ) || ˆ γ i,σ ( i ) | log (cid:18) | ˆ γ i,σ ( i ′ ) || ˆ γ i,σ ( i ) | (cid:19) + O ( τ ) (d) = C − τ X i X i ′ = i µ i · p i,i ′ | {z } = Q i,i ′ · (cid:2) − log( p i,i ′ ) + T i,i ′ (cid:3) + O ( τ ) , (21)where at step (a) we have used Corollary 15, where at step (b)we have inserted the above expression for U B ( γ ) and theexpression for H B ( γ ) from Lemma 25, where at step (c) wehave replaced the summations over j ∈ J , j = σ ( i ) , bysummations over i ′ ∈ I , σ ( i ′ ) = σ ( i ) , i.e. , by summationsover i ′ ∈ I , i ′ = i , and where at step (d) we have introducedthe deﬁnitions µ i , | ˆ γ i,σ ( i ) | (22) p i,i ′ , | ˆ γ i,σ ( i ′ ) || ˆ γ i,σ ( i ) | , (23) Q i,i ′ , µ i · p i,i ′ = | ˆ γ i,σ ( i ′ ) | , (24) T i,i ′ , log (cid:18) θ i,σ ( i ′ ) θ i,σ ( i ) (cid:19) , (25) Fig. 10. Trellis for the random walk described in Appendix D. (Here n = 5 .)Highlighted is an instance of a possible walk. for all ( i, i ′ ) ∈ I × I with i = i ′ . One can verify that theassumptions on ˆ γ imply that X i µ i = 1 , X i ′ = i p i,i ′ = 1 (for all i ∈ I ) , X i ′ = i Q i,i ′ = µ i (for all i ∈ I ) , X i = i ′ Q i,i ′ = µ i ′ (for all i ′ ∈ I ) , X i X i ′ = i Q i,i ′ = 1 . In order to obtain the theorem statement, we need tomaximize the coefﬁcient of ( − τ ) in (21). Before doing this,let us quickly discuss the meaning of this coefﬁcient.Namely, consider the trellis in Fig. 10 with state space I ( i.e. , with n states) and where a trellis section has a branchfrom state i ∈ I to state i ′ ∈ I if and only if i = i ′ . It isstraightforward to see that there is a bijection between, on theone hand, the set of all left-to-right walks in the time-invarianttrellis shown in Fig. 10, and, on the other hand, the set ofbacktrackless walks in N ( θ ) (see Fig. 1) that were mentionedafter Lemma 25. In particular, going from state i ∈ I to state i ′ ∈ I \{ i } in the trellis of Fig. 10 corresponds to the two half-steps of going from node i ∈ I to node σ ( i ′ ) ∈ J and thento node i ′ ∈ I in N ( θ ) . With this, translating (backtrackless)random walks to left-to-right random walks in the trellis inFig. 10, we obtain that • µ i is the probability of being in state i , • p i,i ′ is the probability of going to state i ′ = i , conditionedon being in state i , • Q i,i ′ is the probability of being in state i and then goingto state i ′ = i , • − P i P i ′ = i µ i p i,i ′ log( p i,i ′ ) is the entropy rate of (theMarkov chain corresponding to) the random walk on thistrellis, • T i,i ′ is a branch metric, • P i P i ′ = i µ i p i,i ′ T i,i ′ is the average branch metric of therandom walk on this trellis, • and maximizing the coefﬁcient of ( − τ ) in the aboveexpression for F B (cid:0) γ ( τ ) (cid:1) (see (21)) means to ﬁnd the(time-invariant) left-to-right random walk on this trellisthat maximizes X i X i ′ = i µ i · p i,i ′ · (cid:2) − log( p i,i ′ ) + T i,i ′ (cid:3) , i.e. , the sum of the entropy rate and the average branchmetric of the random walk. (In statistical physics terms,this expression can be considered to be some negativefree energy function.)The purpose of rewriting the above expression in the waywe did, was so that it is very close to the notation usedin [82, Lemma 44] that solved exactly the above maximizationproblem. (Note that related problems were also solved in [83]and [84].)As was shown in [82, Lemma 44], the maximal value of X i X i ′ = i µ i · p i,i ′ | {z } = Q i,i ′ · (cid:2) − log( p i,i ′ ) + T i,i ′ (cid:3) is log( ρ ) and is attained by µ ∗ i = κ · u L i · u R i ,p ∗ i,i ′ = ( u R i ′ u R i · A i,i ′ ρ (if i = i ′ ) (otherwise) ,Q ∗ i,i ′ = µ ∗ i · p ∗ i,i ′ = ( κ · u L i · A i,i ′ · u R i ′ ρ (if i = i ′ ) (otherwise) , where A , ρ , u L , and u R are deﬁned in the theorem statement,and where κ is a normalization constant such that P i µ ∗ i =1 . Note that A , called the noisy adjacency matrix in [82,Lemma 44], is such that A i,i ′ = exp( T i,i ′ ) for i = i ′ andsuch that A i,i = 0 .Because A contains only non-negative entries, ρ is the so-called Perron eigenvector of A , and u L and u R are the so-called left and right, respectively, Perron eigenvectors of A ;one can show that these two vectors contain only non-negativeentries.Translating this result back using (22), (23), and (24), weobtain the result given in the theorem statement.A PPENDIX EP ROOF OF L EMMA g i , i ∈ I , at iteration t > . Following [38]–[40], we have for every i ∈ I , every j ∈ J , and every ¯ a i,j ∈A i,j , −→ µ ( t ) i,j (¯ a i,j ) , C i,j · X a i ∈A iai,j =¯ ai,j f i ( a i ) · Y j ′ = j ←− µ ( t − i,j ′ ( a i,j ′ ) , where C i,j is some suitable normalization constant. Conse-quently, the update of the likelihood ratio reads −→ Λ ( t ) i,j , −→ µ ( t ) i,j (0) −→ µ ( t ) i,j (1) = P a i ∈A iai,j =0 f i ( a i ) · Q j ′′ = j ←− µ ( t − i,j ( a i,j ′′ ) P a i ∈A iai,j =1 f i ( a i ) · Q j ′′ = j ←− µ ( t − i,j ′′ ( a i,j ′′ ) (a) = P j ′ = j p θ i,j ′ · ←− µ ( t − i,j ′ (1) · Q j ′′ = j,j ′ ←− µ ( t − i,j ′′ (0) p θ i,j · Q j ′′ = j ←− µ ( t − i,j ′′ (0) (b) = 1 p θ i,j · X j ′ = j p θ i,j ′ · (cid:16) ←− Λ ( t − i,j ′ (cid:17) − , where at step (a) we have used A i = { u j | j ∈ J } forsimplifying the numerator, and where at step (b) we haveused the deﬁnition of ←− Λ ( t − i,j ′ , j ′ = j . This yields the ﬁrstexpression in the lemma statement. The second expression isobtained analogously by considering the SPA message updaterule for function nodes g j , j ∈ J , at iteration t > .Now we turn our attention to computing the beliefs at thefunction nodes g i , i ∈ I , at iteration t > . Following [38]–[40], we have for every i ∈ I and every a i ∈ A i , β ( t ) i, a i = 1 C i · f i ( a i ) · Y j ←− µ ( t ) i,j ( a i,j ) , where C i is chosen such that P a i β ( t ) i, a i = 1 . In particular, for a i = u j , j ∈ J , we get β ( t ) i, a i = 1 C i · f i ( a i ) · Y j ′ ←− µ ( t ) i,j ′ ( a i,j ′ )= 1 C i · f i ( a i ) · Y j ′ ←− µ ( t ) i,j ′ (0)  · Y j ′ ←− µ ( t ) i,j ′ ( a i,j ′ ) ←− µ ( t ) i,j ′ (0)= 1 C i · p θ i,j · Y j ′ ←− µ ( t ) i,j ′ (0)  · ←− V ( t ) i,j . Because C i and the expression in the parentheses are inde-pendent of j , we have just veriﬁed the third expression inthe lemma statement. The fourth expression in the lemmastatement is obtained analogously by considering the beliefsat function nodes g j , j ∈ J , at iteration t > .A PPENDIX FP ROOF OF L EMMA i.e. , F B (cid:0) { β i } , { β j } , { β e } (cid:1) = X i U B ,i ( β i ) + X j U B ,j ( β j ) − X i H B ,i ( β i ) − X j H B ,j ( β j ) + X e H B ,e ( β e ) . (For the purposes of this appendix, the expression for F B inDeﬁnition 10 is somewhat more convenient than the one inLemma 14.) Now, introducing a Lagrange multiplier for the edge con-sistency constraints (but not for the other constraints imposedby the local marginal polytope B , see Deﬁnition 9), we obtainthe relevant Lagrangian L Bethe (cid:0) { β i } , { β j } , { β e } , {←− λ e } , {−→ λ e } (cid:1) = F B ( { β i } , { β j } , { β e } ) − X e =( i,j ) X a e ←− λ e,a e ·  X a i : a i,e = a e β i, a i − β e,a e  − X e =( i,j ) X a e −→ λ e,a e ·  X a j : a j,e = a e β j, a j − β e,a e  , Because F B is convex in { β i } i and { β j } j , but concave in { β e } e , the pseudo-dual function of F B is given by F (cid:0) {←− λ e } , {−→ λ e } (cid:1) = max { β e } min { β i } , { β j } L Bethe (cid:0) { β i } , { β j } , { β e } , {←− λ e } , {−→ λ e } (cid:1) , where the maximization/minimization is over all { β e } e , { β i } i , { β j } j that satisfy the constraints imposed by the localmarginal polytope B , except for the edge consistency con-straints. We obtain the maximizing { β e } e and the minimizing { β i } i , { β j } j by setting suitable partial derivatives to zero.This yields, β i, a i = 1 Z i · g i ( a i ) · Y e : i ( e )= i exp (cid:16) ←− λ e,a i,j ( e ) (cid:17) ,β j, a j = 1 Z j · g j ( a j ) · Y e : j ( e )= j exp (cid:16) −→ λ e,a i ( e ) ,j (cid:17) ,β e,a e = 1 Z e · exp (cid:16) ←− λ e,a e (cid:17) · exp (cid:16) −→ λ e,a e (cid:17) , where i ( e ) and j ( e ) give the label of the, respectively, leftand right vertex to which e is incident, and where { Z i } i , { Z j } j , and { Z e } e are suitable normalization constants suchthat relevant sums are equal to one.Now, plugging these beliefs into the Lagrangian, we obtain(after cancelling several terms) the expression F (cid:0) {←− λ e } , {−→ λ e } (cid:1) = − X i log( Z i ) − X j log( Z j ) + X e log( Z e )= − X i log X a i g i ( a i ) · Y e : i ( e )= i exp (cid:16) ←− λ e,a i,j ( e ) (cid:17) − X j log X a j g j ( a j ) · Y e : j ( e )= j exp (cid:16) −→ λ e,a i ( e ) ,j (cid:17) + X e log X a e exp (cid:16) ←− λ e,a e + −→ λ e,a e (cid:17)! . We proceed by using some details of the deﬁnition of N ( θ ) .Namely, using the deﬁnition of the local function nodes and taking advantage of the binary alphabet A e = { , } , e ∈ E ,we obtain (after some simpliﬁcations) F (cid:0) {←− λ e } , {−→ λ e } (cid:1) = − X i log X j p θ i,j · exp (cid:16) ←− λ ( i,j ) , − ←− λ ( i,j ) , (cid:17) − X j log X i p θ i,j · exp (cid:16) −→ λ ( i,j ) , − −→ λ ( i,j ) , (cid:17)! + X e log (cid:16) (cid:16)(cid:0) ←− λ e, − ←− λ e, (cid:1) + (cid:0) −→ λ e, − −→ λ e, (cid:1)(cid:17)(cid:17) From the results in [13] it follows that at a ﬁxed point ofthe SPA, the quantity ←− λ ( i,j ) , − ←− λ ( i,j ) , represents the log-likelihood ratio of the left-going message along the edge ( i, j ) ,and the quantity −→ λ ( i,j ) , −−→ λ ( i,j ) , represents the log-likelihoodratio of the right-going message along the edge ( i, j ) . Clearly,for every edge ( i, j ) ∈ I × J , these quantities are related tothe inverse likelihood ratios by ←− V i,j = exp (cid:16) ←− λ ( i,j ) , − ←− λ ( i,j ) , (cid:17) , −→ V i,j = exp (cid:16) −→ λ ( i,j ) , − −→ λ ( i,j ) , (cid:17) , respectively. Therefore, we get F (cid:0) {←− V i,j } , {−→ V i,j } (cid:1) = − X i log (cid:16)p θ i,j · ←− V i,j (cid:17) − X j log (cid:16)p θ i,j · −→ V i,j (cid:17) + X i,j log (cid:16) ←− V i,j · −→ V i,j (cid:17) , which is the expression in the lemma statement.Although the interpretation of the log-likelihood ratios wasgiven by looking at ﬁxed points of the SPA, it is not difﬁcultto see that we can evaluate this last expression for any set ofinverse likelihood ratios.A PPENDIX GP ROOF OF T HEOREM F B is achievedat a vertex of Γ n × n , whereas the second subsection considersthe case where the global minimum of F B is achieved in theinterior of Γ n × n .For ease of reference, we reproduce here the SPA messageupdate rules from Lemma 29, i.e. , −→ V ( t ) i,j = p θ i,j P j ′ = j p θ i,j ′ · ←− V ( t − i,j ′ , t > , ( i, j ) ∈ I × J , (26) ←− V ( t ) i,j = p θ i,j P i ′ = i p θ i ′ ,j · −→ V ( t ) i ′ ,j , t > , ( i, j ) ∈ I × J . (27)In both parts of this appendix, the main task will be to exhibita contraction operation of a suitably chosen subset of the SPAmessages. A. Global Minimum of F B is Achieved at a Vertex of Γ n × n Let γ ∈ C be the vertex of Γ n × n that uniquely minimizes F B . This means that γ corresponds to the permutation σ γ .(In the following, we will use the short-hands σ , σ γ and ¯ σ , σ − γ .)From (26) it follows that −→ Λ ( t ) i,σ ( i ) = 1 / −→ V ( t ) i,σ ( i ) , i ∈ I , can bewritten as −→ Λ ( t ) i,σ ( i ) = 1 p θ i,σ ( i ) · X j = σ ( i ) p θ i,j · ←− V ( t − i,j , t > , i ∈ I . On the other hand, for i ∈ I and j = σ ( i ) the SPA messageupdate equation in (27) implies ←− V ( t − i,j = p θ i,j P i ′ = i p θ i ′ ,j · −→ V ( t − i ′ ,j = p θ i,j p θ ¯ σ ( j ) ,j · −→ V ( t − σ ( j ) ,j ·

11 + P i ′ = i, ¯ σ ( j ) √ θ i ′ ,j ·−→ V ( t − i ′ ,j √ θ ¯ σ ( j ) ,j ·−→ V ( t − σ ( j ) ,j p θ i,j p θ ¯ σ ( j ) ,j · −→ V ( t − σ ( j ) ,j = p θ i,j p θ ¯ σ ( j ) ,j · −→ Λ ( t − σ ( j ) ,j , t > , i ∈ I , j = σ ( i ) , where the inequality follows from the fact that all terms in thesummation P i ′ = i, ¯ σ ( j ) are non-negative. Then, combining thetwo above expressions, we obtain −→ Λ ( t ) i,σ ( i ) X j = σ ( i ) θ i,j p θ i,σ ( i ) p θ ¯ σ ( j ) ,j · −→ Λ ( t − σ ( j ) ,j , t > , i ∈ I . Rearranging terms, we obtain −→ Λ ( t ) i,σ ( i ) p θ i,σ ( i ) X j = σ ( i ) θ i,j θ i,σ ( i ) · −→ Λ ( t − σ ( j ) ,j p θ ¯ σ ( j ) ,j = X i ′ = i θ i,σ ( i ′ ) θ i,σ ( i ) · −→ Λ ( t − i ′ ,σ ( i ′ ) p θ i ′ ,σ ( i ′ ) , t > , i ∈ I . Now, for every t > , consider the length- n vector −→ m ( t ) whose i th entry is −→ Λ ( t ) i,σ ( i ) / p θ i,σ ( i ) . Grouping several of the aboveinequalities together, we obtain the vector inequality −→ m ( t ) A · −→ m ( t − , t > , (28)where the vector inequality has to be understood component-wise, and where the n × n matrix A was deﬁned in Theorem 26for the vertex γ of Γ n × n . Let ρ be the maximal (real)eigenvalue of A . Then, Corollary 27 and the assumption that γ is the unique minimizer of F B allow us to conclude that ρ < . However, because ρ < implies that all eigenvalues of A have magnitude strictly smaller than , the update equationin (28) represents a contraction, and so (cid:13)(cid:13) −→ m ( t ) (cid:13)(cid:13) t →∞ −−−→ . For simplicity, because j does not appear on the left-hand side of thisequation, we use j as a summation variable on the right-hand side. Thisis in contrast to (26) where j appears on the left-hand side and where thesummation variable on the right-hand side is j ′ . Therefore, −→ Λ ( t ) i,σ ( i ) t →∞ −−−→ , i ∈ I . A similar argument shows that ←− Λ ( t )¯ σ ( j ) ,j t →∞ −−−→ , j ∈ J . Finally, from (26) and (27) and the above results it followsthat −→ V ( t ) i,j t →∞ −−−→ , i ∈ I , j ∈ J , j = σ ( i ) , ←− V ( t ) i,j t →∞ −−−→ , i ∈ I , j ∈ J , j = σ ( i ) . All these quantities converge to zero exponentially fast.When F B achieves its minimum in the interior of Γ n × n ,then we have equality between F B and F at stationarypoints of the SPA. However, we also have equality in thepresent case. Namely, evaluating F (see Lemma 31) forthe above messages, we obtain F (cid:0)(cid:8) ←− V ( t ) i,j (cid:9) , (cid:8) −→ V ( t ) i,j (cid:9)(cid:1) t →∞ −−−→ − X i log( θ i,σ ( i ) ) , which indeed equals F B ( γ ) . From ρ < and F B ( γ ) = − log (cid:0) perm B ( θ ) (cid:1) it also follows that (cid:12)(cid:12)(cid:12)(cid:12) exp (cid:18) − F (cid:16)(cid:8) ←− V ( t ) i,j (cid:9) , (cid:8) −→ V ( t ) i,j (cid:9)(cid:17)(cid:19) − perm B ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) C · e − ν · t for some suitable constants C, ν ∈ R > . B. Global Minimum of F B is Achieved in the Interior of Γ n × n In Corollary 23 we established that the Bethe free energyfunction of N ( θ ) is convex, i.e. , it does not have stationarypoints besides the global minimum. Therefore, using a theoremby Yedidia, Freeman, Weiss [13], we know that ﬁxed pointsof the SPA correspond to the global minimum of the Bethefree energy function.Let (cid:8) ←− V i,j (cid:9) i,j , (cid:8) −→ V i,j (cid:9) i.j be inverse likelihood ratios thatconstitute a ﬁxed point of the SPA update rules in (26)–(27).As such, these inverse likelihoods must satisfy −→ V i,j = p θ i,j P j ′ = j p θ i,j ′ · ←− V i,j ′ , (29) ←− V i,j = p θ i,j P i ′ = i p θ i ′ ,j · −→ V i ′ ,j , (30)for every ( i, j ) ∈ E . Note that these SPA ﬁxed point inverselikelihood ratios satisfy < −→ V i,j < ∞ and < ←− V i,j < ∞ ,otherwise the assumption that we are dealing with an interiorpoint of Γ n × n would be violated.It follows from the message gauge invariance mentioned inRemark 30 that, for any positive real number C , the inverselikelihoods (cid:8) C · ←− V i,j (cid:9) i,j , (cid:8) C · −→ V i,j (cid:9) i.j also constitute a ﬁxedpoint of the SPA update rules. We will use this fact later on.On the other hand, let (cid:8) ←− V ( t ) i,j (cid:9) i,j,t , (cid:8) −→ V ( t ) i,j (cid:9) i,j,t be a set ofinverse likelihoods obtained by running the SPA on N ( θ ) ac-cording to the SPA update rules in (26)–(27). In the following,we will not work with (cid:8) ←− V ( t ) i,j (cid:9) i,j,t , (cid:8) −→ V ( t ) i,j (cid:9) i,j,t directly, but with (cid:8) ←− ε ( t ) i,j (cid:9) i,j,t , (cid:8) −→ ε ( t ) i,j (cid:9) i,j,t , which are implicitly deﬁned bythe equations −→ V ( t ) i,j = −→ V i,j · (cid:16) −→ ε ( t ) i,j (cid:17) , (31) ←− V ( t ) i,j = ←− V i,j · (cid:16) ←− ε ( t ) i,j (cid:17) . (32)(Note that − < ←− ε ( t ) i,j < ∞ and − < −→ ε ( t ) i,j < ∞ .)Clearly, (cid:8) ←− ε ( t ) i,j (cid:9) i,j,t , (cid:8) −→ ε ( t ) i,j (cid:9) i,j,t can be considered to be a“measure” of the distance of the SPA messages to the ﬁxed-point messages. In particular, we have established convergenceof the SPA if we can show that these values converge to zerofor t → ∞ .In a ﬁrst step, we express the SPA message update rules interms of (cid:8) ←− ε ( t ) i,j (cid:9) i,j,t and (cid:8) −→ ε ( t ) i,j (cid:9) i,j,t . Lemma 65

For the right-going messages it holds that −→ δ ( t ) i,j , P j ′ = j p θ i,j ′ · ←− V i,j ′ · ←− ε ( t − i,j ′ P j ′ = j p θ i,j ′ · ←− V i,j ′ , (33) −→ ε ( t ) i,j = − −→ δ ( t ) i,j −→ δ ( t ) i,j . (34) For the left-going messages it holds that ←− δ ( t ) i,j , P i ′ = i p θ i ′ ,j · −→ V ( t ) i ′ ,j · −→ ε ( t ) i ′ ,j P i ′ = i p θ i ′ ,j · −→ V ( t ) i ′ ,j , (35) ←− ε ( t ) i,j = − ←− δ ( t ) i,j ←− δ ( t ) i,j . (36) Proof:

Let us establish (34). The expression in (36) thenfollows analogously. We compute −→ V i,j · (cid:16) −→ ε ( t ) i,j (cid:17) (a) = −→ V ( t ) i,j (b) = p θ i,j P j ′ = j p θ i,j ′ · ←− V ( t − i,j ′ (c) = p θ i,j P j ′ = j p θ i,j ′ · ←− V i,j ′ · (cid:16) ←− ε ( t − i,j ′ (cid:17) (d) = p θ i,j (cid:16)P j ′ = j p θ i,j ′ · ←− V i,j ′ (cid:17) · (cid:16) −→ δ ( t ) i,j (cid:17) (e) = −→ V i,j −→ δ ( t ) i,j , where at step (a) we have used (31), where at step (b) wehave used (26), where at step (c) we have used (32), whereat step (d) we have used (33), and where at step (e) we haveused (29). Dividing both sides by −→ V i,j , and then subtracting from both sides, yields the expression in (34). (cid:4) Note that −→ δ ( t ) i,j is a weighted arithmetic average of the errorvalues (cid:8) ←− ε ( t − i,j ′ (cid:9) j ′ = j , and that ←− δ ( t ) i,j is a weighted arithmeticaverage of the error values (cid:8) −→ ε ( t ) i ′ ,j (cid:9) i ′ = i .Note also that the expressions in (34) and (36) have thefollowing peculiarity. Namely, solving ε = − δ/ (1 + δ ) for δ we obtain δ = − ε/ (1 + ε ) , which is structurally the sameexpression as the ﬁrst expression but with the roles of ε and δ interchanged. Lemma 66

Fix an iteration number t > . Taking advantageof the message gauge invariance that was mentioned in Re-mark 30, we can rescale the left-going and right-going ﬁxed-point messages such that all {←− ε ( t − i,j } i,j are non-negative.With this we deﬁne the numbers ←− ε ( t − > and ←− ε ( t )max > to be the smallest numbers that satisfy ←− ε ( t − i,j ←− ε ( t − , ( i, j ) ∈ E , ←− ε ( t ) i,j ←− ε ( t )max , ( i, j ) ∈ E . Then ←− ε ( t ) i,j ←− ε ( t )max ←− ε ( t − ( i, j ) ∈ E . Proof:

It follows immediately from (33) that −→ δ ( t ) i,j ←− ε ( t − , ( i, j ) ∈ E , and so, because of (34), we have − < − ←− ε ( t − ←− ε ( t − −→ ε ( t ) i,j , ( i, j ) ∈ E . (37)Using (35), this implies − < − ←− ε ( t − ←− ε ( t − ←− δ ( t ) i,j , ( i, j ) ∈ E , and so, because of (36), we have ←− ε ( t ) i,j ←− ε ( t − , ( i, j ) ∈ E . (38)This proves the statement in the lemma. (cid:4) This shows that the errors stay bounded but it does notprove convergence yet. (This result is essentially equivalentto the result that is obtained by taking the zero-temperaturelimit of the contraction coefﬁcient that is computed in theSPA convergence analysis of [23]: the result is a contractioncoefﬁcient of , which is non-trivial, but not good enough toshow that the message update map is a contraction. )It turns out that in order to improve these bounds we haveto track the error values over two iteration, i.e. , four halfiterations. (We suspect that this is related to the fact that thegirth of N ( θ ) , i.e. , the length of the shortest cycle of N ( θ ) ,is .) Lemma 67

Fix an iteration number t > . Taking advantageof the message gauge invariance that was mentioned in Re-mark 30, we can rescale the left-going and right-going ﬁxed-point messages such that all {←− ε ( t − i,j } i,j are non-negative andsuch that, additionally, min i,j ←− ε ( t − i,j = 0 . With this, we deﬁne Given the difference in the graphical model in [23] and the graphicalmodel considered here, some care is required when comparing the temperaturethat is mentioned here and the temperature that is mentioned in Sections IIand III. the numbers ←− ε ( t − > and ←− ε ( t +1)max > to be the smallestnumbers that satisfy ←− ε ( t − i,j ←− ε ( t − , ( i, j ) ∈ E , ←− ε ( t +1) i,j ←− ε ( t +1)max , ( i, j ) ∈ E . Then ←− ε ( t +1) i,j ←− ε ( t +1)max ν ′ · ←− ε ( t − ( i, j ) ∈ E , for some constant ν ′ < that depends only on θ andthe ﬁxed-point messages {←− V i,j } i,j and {−→ V i,j } i,j , i.e. , ν ′ isindependent of t .Proof: The statement ←− ε ( t +1) i,j > , ( i, j ) ∈ E follows fromapplying Lemma 66 twice. Therefore, we can focus on theproof of ←− ε ( t +1)max ν ′ · ←− ε ( t − .For a given edge ( i, j ) ∈ E , we observe that −←− ε ( t − (cid:14)(cid:0) ←− ε ( t − (cid:1) −→ ε ( t ) i,j in (37) holds with equality only if ←− ε ( t − i,j ′ = ←− ε ( t − for all edges ( i, j ′ ) with j ′ = j . Similarly, for a givenedge ( i, j ) ∈ E we observe that ←− ε ( t ) i,j ←− ε ( t − in (38) holdswith equality only if −→ ε ( t ) i ′ ,j = −←− ε ( t − (cid:14)(cid:0) ←− ε ( t − (cid:1) forall edges ( i ′ , j ) with i ′ = i . This motivates the deﬁnition ofthe following sets where we track the edges for which a strictinequality holds w.r.t. the inequalities just mentioned. Namely,for t > we deﬁne −→E ( t ) , (cid:26) ( i, j ) ∈ E (cid:12)(cid:12)(cid:12)(cid:12) there is at least one edge ( i, j ′ ) , j ′ = j , such that ( i, j ′ ) ∈ ←−E ( t − (cid:27) , ←−E ( t ) , (cid:26) ( i, j ) ∈ E (cid:12)(cid:12)(cid:12)(cid:12) there is at least one edge ( i ′ , j ) , i ′ = i , such that ( i ′ , j ) ∈ −→E ( t ) (cid:27) . With this, assume that ←−E ( t − contains all the edges forwhich ←− ε ( t − i,j < ←− ε ( t − . Clearly, −→E ( t ) then contains all edges ( i, j ) for which −→ ε ( t ) i,j > −←− ε ( t − (cid:14)(cid:0) ←− ε ( t − (cid:1) . Similarly, ←−E ( t ) contains all edges ( i, j ) for which ←− ε ( t ) i,j < ←− ε ( t − .If ←− ε ( t − = 0 then the lemma is clearly true. So, assumethat ←− ε ( t − > . Let ←−E ( t − contain all edges ( i, j ) for which −→ ε ( t − i,j < ←− ε ( t − . The assumptions in the lemma statementguarantee that there is at least one such edge, namely theedge(s) ( i, j ) for which −→ ε ( t − i,j = 0 , and so the set ←−E ( t − is non-empty. It can then be veriﬁed that four half-iterationslater we have ←−E ( t +1) = E .The fact that there is, as mentioned in the lemma statement,a constant ν ′ that is t -independent and strictly smaller than is then established by tracking the differences between theleft- and the right-hand sides in the above-mentioned strictinequalities. This is done with the help of (33) and (35). (cid:4) The convergence proof is then completed by applyingLemma 67 repeatedly. One detail needs to be mentioned,though. Namely, if min i,j ←− ε ( t +1) i,j > , and a non-trivialre-gauging occurs at the beginning of the next applicationof Lemma 67, then in this re-gauging process the valueof max i,j ←− ε ( t +1) i,j > never increases (in fact, it alwaysdecreases).Finally, we have (cid:12)(cid:12)(cid:12)(cid:12) exp (cid:18) − F (cid:16)(cid:8) ←− V ( t ) i,j (cid:9) , (cid:8) −→ V ( t ) i,j (cid:9)(cid:17)(cid:19) − perm B ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) C · e − ν · t for suitable constants C, ν ∈ R > . This follows from, on theone hand, the fact that when F B achieves its minimum in theinterior of Γ n × n then we have equality between F B and F at stationary points of the SPA [13], and, on the other hand,the above convergence analysis.A PPENDIX HP ROOF OF L EMMA perm( n × n ) . Namely, we obtain perm( n × n ) = n ! (a) = √ πn · (cid:16) n e (cid:17) n · (cid:0) o (1) (cid:1) , (39)where at step (a) we have used Stirling’s approximation of n ! .In a second step we evaluate perm B ( n × n ) . From Deﬁni-tions 11 and 12 it follows that perm B ( n × n ) , exp (cid:18) − min γ F B ( γ ) (cid:19) . From Corollary 23 and symmetry considerations it follows thatthe minimum in the above expression is achieved by γ i,j =1 /n , ( i, j ) ∈ I × J . Therefore, log (cid:0) perm B ( n × n ) (cid:1) = − F B ( γ ) (cid:12)(cid:12) γ i,j =1 /n, ( i,j ) ∈I×J (a) = − U B ( γ ) + H B ( γ ) (cid:12)(cid:12) γ i,j =1 /n, ( i,j ) ∈I×J (b) = − n · n · log (cid:18) n (cid:19) + n · (cid:18) − n (cid:19) · log (cid:18) − n (cid:19) = n · log( n ) + n · ( n − · log (cid:18) − n (cid:19) = n · log( n ) + n · ( n − · (cid:18) − n − n + o (cid:18) n (cid:19)(cid:19) = n · log( n ) − ( n − − n − n + o (1)= n · log( n ) − n + 12 + o (1) , where at steps (a) and (b) we have used Corollary 15.Consequently, perm B ( n × n ) = √ e · (cid:16) ne (cid:17) n · (cid:0) o (1) (cid:1) . (40)Combining (39) and (40) we obtain the promised result inthe lemma statement. A PPENDIX IP ROOF OF C ONJECTURE FOR θ = n × n Let θ = n × n . In this appendix we prove that for any M ∈ Z > and any ˜ P ∈ ˜Ψ M it holds that perm (cid:16) θ ↑ ˜ P (cid:17) (cid:0) perm( θ ) (cid:1) M . (41)Although the proof is somewhat lengthy, the combinatorialidea behind it is quite straightforward. Moreover, the onlyinequality that we use is the AM–GM inequality, whichsays that the arithmetic mean of a list of non-negative realnumbers is at least as large as the geometric mean of thislist of numbers. Notably, there is no need to use Stirling’sapproximation of the factorial function. Towards showing (41), let us ﬁx some positive inte-ger M , ﬁx some collection of permutation matrices ˜ P = (cid:8) ˜ P ( i,j ) (cid:9) i ∈I ,j ∈J ∈ ˜Ψ M , deﬁne ˜ θ , θ ↑ ˜ P as in Deﬁnition 37,and let the row and column index sets of θ ↑ ˜ P be I × [ M ] and J × [ M ] , respectively. With this, it follows from Deﬁnition 1that perm( θ ) = X σ Y i ∈I θ i,σ ( i ) , (42) perm( ˜ θ ) = X ˜ σ Y ( i,m ) ∈I× [ M ] ˜ θ ( i,m ) , ˜ σ (( i,m )) , (43)where σ ranges over all permutations of the set I and where ˜ σ ranges over all permutations of the set I × [ M ] .Note that, because all entries of ˜ θ are either equal tozero or to one, the products in (43) evaluate either to zeroor to one. Computing perm( ˜ θ ) is therefore equivalent tocounting the ˜ σ ’s for which these products evaluate to one.Equivalently, perm( ˜ θ ) equals the number of perfect matchingsin the NFG N ( ˜ θ ) . Example 68

Some of the steps of the proof will be illustratedwith the help of the NFGs in Fig. 3 (which are reproduced inFig. 11 for ease of reference), where n = 3 and M = 4 . • Fig. 11(a) shows the NFG N ( θ ) ; perm( θ ) equalsthe number of perfect matchings in Fig. 11(a). Note: perm( θ ) = n ! . • If ˜ P = (cid:8) ˜ P ( i,j ) (cid:9) i ∈I ,j ∈J = (cid:8) ˜ I (cid:9) i ∈I ,j ∈J , where ˜ I isthe identity matrix of size M × M , then we obtainthe M -cover shown in Fig. 11(b), which is a “trivial” M -cover of N ( θ ) ; perm (cid:0) θ ↑ ˜ P (cid:1) equals the number ofperfect matchings in Fig. 11(b). Note: perm (cid:0) θ ↑ ˜ P (cid:1) = (cid:0) perm( θ ) (cid:1) M = ( n !) M . • For a “non-trivial” collection of permutation matrices ˜ P = (cid:8) ˜ P ( i,j ) (cid:9) i ∈I ,j ∈J we obtain an M -cover like inFig. 11(c); perm (cid:0) θ ↑ ˜ P (cid:1) equals the number of perfectmatchings in Fig. 11(c). (cid:3) Let us therefore count the number of perfect matchings in N ( ˜ θ ) , see Fig. 11(c). Before continuing, we deﬁne ˜ ∂ (( i, m )) , ( i, m ) ∈ I × [ M ] , to be the set of neighbors of the vertex ( i, m ) in N ( ˜ θ ) , i.e. , ˜ ∂ (( i, m )) , n ( j, m ′ ) ∈ J × [ M ] (cid:12)(cid:12)(cid:12) ˜ P ( i,j ) m,m ′ = 1 o . One can easily verify that for every i ∈ I , the sets ˜ ∂ (( i, m )) , m ∈ [ M ] , form a partition of J × [ M ] . (See Figs. 11(b)–(c)that highlight this partitioning for i = 1 .) This observationwill be the crucial ingredient of the following steps.We count the number of perfect matchings in N ( ˜ θ ) byconsidering the vertices (cid:8) ( i, m ) (cid:9) m ∈ [ M ] for i = 1 , i = 2 , up to i = n , thereby counting in how many ways we can specify ˜ σ such that the product in (43) equals one. Note that because ofthe above partitioning observation, we can, conditioned on theselection of a perfect matching up to and including step i − (which we shall symbolically denote by ˜ σ i − ), consider thevertices (cid:8) ( i, m ) (cid:9) m ∈ [ M ] independently. Then we deﬁne ˜ d i,m | ˜ σ i − , ( i, m ) ∈ I × [ M ] ,

12 233 1 (a) (1 , , , ,

1) (1 , , , , , , , , , , , ,

4) (3 , , , , , , , , (b) (1 , , , ,

1) (1 , , , , , , , , , , , ,

4) (3 , , , , , , , , (c)Fig. 11. (a) NFG N ( θ ) for n = 3 . (b) “Trivial” -cover of N ( θ ) (c)A possible -cover of N ( θ ) . The coloring of the edges in (b) and (c) showvisually the fact that he sets ˜ ∂ (( i, m )) , m ∈ [ M ] , form a partition of J × [ M ] (here for i = 1 ). (For more details, see the text in Appendix I). to be the number of possibilities of choosing ˜ σ (( i, m )) , i.e. ,the number of ways that the edge of the perfect matching of N ( ˜ θ ) that is incident on ( i, m ) can be chosen. • Let i = 1 . Then ˜ d i,m | ˜ σ i − , m ∈ [ M ] , is the number ofpossibilities of choosing the edge of the perfect matchingof N ( ˜ θ ) that is incident on ( i, m ) . Because the i throw of θ contains only ones, and because of the abovepartitioning observation, we ﬁnd that ˜ d i,m = n for all m ∈ [ M ] , and so, X m ∈ [ M ] ˜ d i,m | ˜ σ i − = M n.

We observe that, whatever the selection of these M edgesis, M vertices on the right-hand side will be incident ona selected edge, and therefore be “not available anymore”in the following steps. This reduces the number of “avail-able” right-hand side vertices to M n − M = M · ( n − . • Let i = 2 . Then ˜ d i,m | ˜ σ i − , m ∈ [ M ] , is the number ofpossibilities of choosing the edge of the perfect matchingof N ( ˜ θ ) that is incident on ( i, m ) . Because the i th row of θ contains only ones, because of the above partitioningobservation, and because of the observation at the end ofthe above step, we ﬁnd that X m ∈ [ M ] ˜ d i,m | ˜ σ i − M · ( n − . (44)(If all permutation matrices in ˜ P are identity matrices,then it can be veriﬁed that the inequality in (44) isan equality. However, for general ˜ P , equality in (44)does not need to hold.) Similar to the end of the abovestep, we observe that whatever the selection of these M edges is, M vertices on the right-hand side willbe incident on a selected edge, and therefore be “notavailable anymore” in the following steps. This reducesthe number of “available” right-hand side vertices to M · ( n − − M = M · ( n − . • Continuing as above, we observe that for general i ∈ I it holds that X m ∈ [ M ] ˜ d i,m | ˜ σ i − M · ( n − i + 1) . (45)Note that for i ∈ I we have Y m ∈ [ M ] ˜ d i,m | ˜ σ i − =  Y m ∈ [ M ] ˜ d /Mi,m | ˜ σ i −  M (a)  M X m ∈ [ M ] ˜ d i,m | ˜ σ i −  M (b) (cid:18) M · M · ( n − i + 1) (cid:19) M = ( n − i + 1) M , (46)where at step (a) we have used the fact that the geometric meanof a collection of non-negative numbers is upper bounded bythe arithmetic mean of the same collection of numbers, andwhere at step (b) we have used (45).With this, we obtain the following upper bound on perm( ˜ θ ) .Namely, perm( ˜ θ ) (a) = X ˜ σ X ˜ σ | ˜ σ · · · X ˜ σ n − | ˜ σ n − X ˜ σ n | ˜ σ n − (b) = X ˜ σ X ˜ σ | ˜ σ · · · X ˜ σ n − | ˜ σ n − Y m n ∈ [ M ] ˜ d n,m n | ˜ σ n − (c) X ˜ σ X ˜ σ | ˜ σ · · · X ˜ σ n − | ˜ σ n − ( n − n + 1) M (d) ( n − n + 1) M · X ˜ σ X ˜ σ | ˜ σ · · · X ˜ σ n − | ˜ σ n − ... (e) Y i ∈I ( n − i + 1) M = ( n !) M (f) = perm( θ ) M , where at step (a) we have used the fact that perm( ˜ θ ) equalsthe number of perfect matchings in N ( ˜ θ ) , where at step (b)we have used the deﬁnition of ˜ d n,m | ˜ σ n − , where at step (c) wehave used (46) for i = n , where at step (d) we take advantageof the fact that ( n − n + 1) M is independent of ˜ σ n − , where atstep (e) we apply similar results as at steps (b)–(d) (note thatfor all i , the quantity ( n − i +1) M is independent of ˜ σ i − ), andwhere at step (f) we have used the observation perm( θ ) = n ! .This shows that the desired inequality (41) indeed holds forarbitrary positive integer M and ˜ P ∈ ˜Ψ M .A PPENDIX JP ROOF OF L EMMA perm B (cid:0) θ ↑ ˜ P (cid:1) > (cid:0) perm B ( θ ) (cid:1) M and then perm B (cid:0) θ ↑ ˜ P (cid:1) (cid:0) perm B ( θ ) (cid:1) M , from which the promisedequality follows. For the rest of the proof, we will use the short-hand ˜ θ for θ ↑ ˜ P . We remind the reader of Assumption 2, i.e. , we willassume that there is at least one permutation σ : [ n ] → [ n ] suchthat Q i θ i,σ ( i ) > (otherwise, perm B ( ˜ θ ) = perm B ( θ ) = 0 ).Moreover, N ( ˜ θ ) will be the NFG associated with ˜ θ . Towards proving the ﬁrst inequality, let γ ∈ Γ n × n be amatrix that minimizes F B , N ( θ ) . Based on γ , we deﬁne the ( M n ) × ( M n ) matrix ˜ γ with entries ˜ γ ( i,m ) , ( j,m ′ ) , γ i,j · ˜ P ( i,j ) m,m ′ for all ( i, m, j, m ′ ) ∈ I × [ M ] ×J × [ M ] . One can easily verifythat ˜ γ ∈ Γ ( Mn ) × ( Mn ) and that F B , N ( ˜ θ ) ( ˜ γ ) = M · F B , N ( θ ) ( γ ) .From this and Corollary 15 it then follows that perm B ( ˜ θ ) > (cid:0) perm B ( θ ) (cid:1) M . Towards proving the second inequality, let ˜ γ ∈ Γ ( Mn ) × ( Mn ) be a matrix that minimizes F B , N ( ˜ θ ) . One can easily verifythat ˜ γ ( i,m ) , ( j,m ′ ) = 0 whenever ˜ P ( i,j ) m,m ′ = 0 , ( i, m, j, m ′ ) ∈I × [ M ] × J × [ M ] . Based on ˜ γ , we deﬁne the n × n matrix γ with entries γ i,j , M X m X m ′ ˜ γ ( i,m ) , ( j,m ′ ) · ˜ P ( i,j ) m,m ′ for all ( i, j ) ∈ I×J . One can easily verify that γ ∈ Γ n × n . Let ˜ γ ( i,m ) be the length- n vector based on the ( i, m ) th row of ˜ γ ,where we include an entry only if ˜ P ( i,j ) m,m ′ = 1 . Similarly, deﬁnethe length- n vector ˜ γ ( j,m ′ ) based on the ( j, m ′ ) th columnof ˜ γ . One can verify that the i th row of γ , i.e. , γ i , equals M P m ˜ γ ( i,m ) . Similarly, the j th column of γ , i.e. , γ j , equals M P m ′ ˜ γ ( j,m ′ ) . Then H B , N ( ˜ θ ) ( ˜ γ ) (a) = 12 X i X m S ( ˜ γ ( i,m ) ) + 12 X j X m ′ S ( ˜ γ ( j,m ′ ) ) (b) M X i S ( ˜ γ i ) + M X j S ( ˜ γ j ) (c) = M · H B , N ( θ ) ( γ ) , where at step (a) we have used Lemma 21, where at step (b)we have used the concavity of the S -function (see Theo-rem 20), and where at step (c) we have used once againLemma 21. Moreover, one can easily show that U B , N ( ˜ θ ) ( ˜ γ ) = M · U B , N ( θ ) ( γ ) , and so F B , N ( ˜ θ ) ( ˜ γ ) > M · F B , N ( θ ) ( γ ) . Fromthis and Corollary 15 it then follows that perm B ( ˜ θ ) (cid:0) perm B ( θ ) (cid:1) M . Let ˜ N be the M -cover of N ( θ ) corresponding to ˜ P . Note that, strictlyspeaking, ˜ N and N ( ˜ θ ) are not the same NFG. The former is an M -coverof N ( θ ) (therefore it has two times Mn function nodes, all of them withdegree n ), whereas the latter is a complete bipartite graph with two times Mn function nodes. However, with the above condition on θ , for all practicalpurposes they are the same because F B , N ( ˜ θ ) ( ˜ γ ) < ∞ only for matrices ˜ γ ∈ Γ ( Mn ) × ( Mn ) for which ˜ γ ( i,m ) , ( j,m ′ ) = 0 whenever ˜ P ( i,j ) m,m ′ = 0 , ( i, m, j, m ′ ) ∈ I × [ M ] × J × [ M ] . A PPENDIX KP ROOF OF L EMMA κ satisﬁes the conditions listed in Theorem 60, theconcavity statement for the Bethe entropy function and theconvexity statement for the Bethe free energy function followimmediately.Therefore, let us turn our attention to evaluating the ratio perm( n × n ) / perm ( κ )B ( n × n ) . In a ﬁrst step we evaluate perm( n × n ) . Namely, as in the proof of Lemma 48 inAppendix H we have perm( n × n ) = n ! = √ πn · (cid:16) n e (cid:17) n · (cid:0) o (1) (cid:1) . (47)In a second step we evaluate perm ( κ )B ( n × n ) . From The-orem 60 and symmetry considerations it follows that theminimum in the above expression is achieved by γ i,j = 1 /n , ( i, j ) ∈ I × J . Therefore, log (cid:0) perm B ( n × n ) (cid:1) (a) = − U B ( γ ) + H ( κ )B ( γ ) (b) = − n · (cid:18) n (cid:19) · n · log (cid:18) n (cid:19) + n · (cid:18) − n (cid:19) · (cid:18) − n (cid:19) · log (cid:18) − n (cid:19) = (cid:18) n + 12 (cid:19) · log( n ) + (cid:18) n − (cid:19) · ( n − · log (cid:18) − n (cid:19) = (cid:18) n + 12 (cid:19) · log( n )+ (cid:18) n − (cid:19) · ( n − · (cid:18) − n − n + o (cid:18) n (cid:19)(cid:19) = (cid:18) n + 12 (cid:19) · log( n ) − n + 1 + o (1) , where at step (a) we have used F ( κ )B ( γ ) = U B ( γ ) − H ( κ )B ( γ ) ,where at (b) we have used U B ( γ ) = − P i,j γ i,j log( θ i,j ) = 0 and the expression for H ( κ )B ( γ ) from Lemma 58. Therefore, perm ( κ )B ( n × n ) = e ·√ n · (cid:16) ne (cid:17) n · (cid:0) o (1) (cid:1) . (48)By combining (47) and (48) we obtain the promised result.R EFERENCES[1] H. Minc,

Permanents . Reading, MA: Addison-Wesley, 1978.[2] H. J. Ryser,

Combinatorial Mathematics (Carus Mathematical Mono-graphs No. 14) . Mathematical Association of America, 1963.[3] L. Valiant, “The complexity of computing the permanent,”

Theor. Comp.Sc. , vol. 8, no. 2, pp. 189–201, 1979.[4] A. Z. Broder, “How hard is it to marry at random? (On the approximationof the permanent),” in

Proc. 18th Annual ACM Symp. Theory of Comp. ,Berkeley, CA, USA, May 28–30 1986, pp. 50–58, (Erratum in Proc.20th Annual ACM Symposium on Theory of Computing, 1988, p. 551).[5] M. Jerrum, A. Sinclair, and E. Vigoda, “A polynomial-time approxima-tion algorithm for the permanent of a matrix with nonnegative entries,”

J. ACM , vol. 51, no. 4, pp. 671–697, Jul. 2004.[6] M. Huber and J. Law, “Fast approximation of the permanent for verydense problems,” in

Proc. ACM-SIAM Symp. Discr. Alg. , San Francisco,CA, USA, Jan. 20–22 2008.[7] N. Karmarkar, R. Karp, R. Lipton, L. Lov´asz, and M. Luby, “A Monte-Carlo algorithm for estimating the permanent,”

SIAM J. Comp. , vol. 22,no. 2, pp. 284–293, Apr. 1993. [8] A. Barvinok, “Polynomial time algorithms to approximate permanentsand mixed discriminants within a simply exponential factor,”

RandomStructures and Algorithms , vol. 14, no. 1, pp. 29–61, Jan. 1999.[9] M. Jerrum and U. V. Vazirani, “A mildly exponential approximationalgorithm for the permanent,”

Algorithmica , vol. 16, no. 4–5, pp. 392–401, Oct.–Nov. 1996.[10] N. Linial, A. Samorodnitsky, and A. Wigderson, “A deterministicstrongly polynomial algorithm for matrix scaling and approximatepermanents,”

Combinatorica , vol. 20, no. 4, pp. 545–568, 2000.[11] M. Chertkov, L. Kroc, and M. Vergassola, “Belief propagation andbeyond for particle tracking,”

CoRR, available online under http://arxiv.org/abs/0806.1199 , Jun. 2008.[12] B. Huang and T. Jebara, “Approximating the permanent with belief prop-agation,”

CoRR, available online under http://arxiv.org/abs/0908.1769 , Aug. 2009.[13] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free-energyapproximations and generalized belief propagation algorithms,”

IEEETrans. Inf. Theory , vol. 51, no. 7, pp. 2282–2312, Jul. 2005.[14] L. Gurvits, “Unharnessing the power of Schrijver’s permanental inequal-ity,”

CoRR, http://arxiv.org/abs/1106.2844 , Jun. 2011.[15] ——, “Unleashing the power of Schrijver’s permanental inequality withthe help of the Bethe approximation,”

Elec. Coll. Comp. Compl. , Dec.2011.[16] P. O. Vontobel, “Counting in graph covers: a combinatorial characteriza-tion of the Bethe entropy function,” submitted to IEEE Trans. Inf. The-ory, Nov. 2010, available online under http://arxiv.org/abs/1012.0065 (ver. 2), Oct. 2012.[17] R. G. Gallager,

Low-Density Parity-Check Codes . M.I.T. Press,Cambridge, MA, 1963.[18] T. Richardson and R. Urbanke,

Modern Coding Theory . New York,NY: Cambridge University Press, 2008.[19] Y. Watanabe and M. Chertkov, “Belief propagation and loop calculusfor the permanent of a non-negative matrix,”

Journal of Physics A:Mathematical and Theoretical , vol. 43, p. 242002, 2010.[20] M. Chertkov, L. Kroc, F. Krzakala, M. Vergassola, and L. Zdeborov´a,“Inference in particle tracking experiments by passing messsages be-tween images,”

Proc. Natl. Acad. Sci. , vol. 107, no. 17, pp. 7663–7668,Apr. 2010.[21] M. Chertkov and V. Y. Chernyak, “Loop series for discrete statisticalmodels on graphs,”

J. Stat. Mech.: Theory and Experiment , p. P06009,Jun. 2006.[22] A. B. Yedidia and M. Chertkov, “Computing the permanent with beliefpropagation,” submitted to J. Mach. Learn. Res., available online under http://arxiv.org/abs/1108.0065 , Jul. 2011.[23] M. Bayati and C. Nair, “A rigorous proof of the cavity method forcounting matchings,” in

Proc. 44th Allerton Conf. on Communications,Control, and Computing , Allerton House, Monticello, IL, USA, Sep. 27–29 2006.[24] M. Bayati, D. Gamarnik, D. Katz, C. Nair, and P. Tetali, “Simpledeterministic approximation algorithms for counting matchings,” in

Proc. Symp. Theory of Computing, , San Diego, CA, USA, Jun.13–162007, pp. 122–127.[25] D. Gamarnik and D. Katz, “A deterministic approximation algorithmfor computing the permanent of a 0, 1 matrix,”

J. Computer and SystemSciences , vol. 76, no. 8, pp. 879–883, Dec. 2010.[26] J. L. Williams and R. A. Lau, “Convergence of loopy belief propagationfor data association,” in

Proc. 6th Int. Conf. on Intelligent Sensors, Sen-sor Networks and Information Processing , Brisbane, Australia, Dec. 7–10 2010, pp. 175–180.[27] B. Huang and T. Jebara, “Loopy belief propagation for bipartite max-imum weight b-matching,” in

Proc. 11th Intern. Conf. on ArtiﬁcialIntelligence and Statistics , San Juan, Puerto Rico, Mar. 21–24 2007.[28] M. Bayati, D. Shah, and M. Sharma, “Max-product for maximum weightmatching: convergence, correctness, and LP duality,”

IEEE Trans. Inf.Theory , vol. 54, no. 3, pp. 1241–1251, Mar. 2008.[29] M. Bayati, C. Borgs, J. Chayes, and R. Zecchina, “Belief-propagationfor weighted b-matchings on arbitrary graphs and its relation to linearprograms with integer solutions,”

SIAM J. Discr. Math. , vol. 25, no. 2,pp. 989–1011, 2011.[30] S. Sanghavi, D. Malioutov, and A. Willsky, “Belief propagation and LPrelaxation for weighted matching in general graphs,”

IEEE Trans. Inf.Theory , vol. 57, no. 4, pp. 2203–2212, Apr. 2011.[31] N. Wiberg, “Codes and decoding on general graphs,” Ph.D. dissertation,Department of Electrical Engineering, Link¨oping University, Sweden,1996. [32] A. Barvinok, “On the number of matrices and a random matrix withprescribed row and column sums and 0–1 entries,” Adv. in Math. , vol.224, no. 1, pp. 316–339, May 2010.[33] A. Barvinok and A. Samorodnitsky, “Computing the partition functionfor perfect matchings in a hypergraph,”

Comb., Prob., and Comp. ,vol. 20, no. 6, pp. 815–835, Nov. 2011.[34] J. Yedidia, “An idiosyncratic journey beyond mean ﬁeld theory,” in

Advanced Mean Field Methods, Theory and Practice , M. Opper andD. Saad, Eds. MIT Press, Jan. 2001, pp. 21–36.[35] C. Greenhill, S. Janson, and A. Ruci´nski, “On the number of perfectmatchings in random lifts,”

Comb., Prob., and Comp. , vol. 19, no. 5–6,pp. 791–817, Nov. 2010.[36] S. Boyd and L. Vandenberghe,

Convex Optimization . Cambridge, UK:Cambridge University Press, 2004.[37] R. A. Horn and C. R. Johnson,

Matrix Analysis . Cambridge: CambridgeUniversity Press, 1990, corrected reprint of the 1985 original.[38] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs andthe sum-product algorithm,”

IEEE Trans. Inf. Theory , vol. 47, no. 2, pp.498–519, Feb. 2001.[39] G. D. Forney, Jr., “Codes on graphs: normal realizations,”

IEEE Trans.Inf. Theory , vol. 47, no. 2, pp. 520–548, Feb. 2001.[40] H.-A. Loeliger, “An introduction to factor graphs,”

IEEE Sig. Proc. Mag. ,vol. 21, no. 1, pp. 28–41, Jan. 2004.[41] F. Lad, G. Sanﬁlippo, and G. Agr`o, “Extropy: a complementary dual ofentropy,”

CoRR, available online under http://arxiv.org/abs/1109.6440 , Sep. 2011.[42] T. M. Cover and J. A. Thomas,

Elements of Information Theory , 2nd ed.New York: John Wiley & Sons Inc., 2006.[43] P. O. Vontobel, “Connecting the Bethe entropy and the edge zeta functionof a cycle code,” in

Proc. IEEE Int. Symp. Inf. Theory , Austin, TX, USA,Jun. 13–18 2010, pp. 704–708.[44] ——, “A factor-graph-based random walk, and its relevance for LPdecoding analysis and Bethe entropy characterization,” in

Proc. Inf.Theory Appl. Workshop , UC San Diego, La Jolla, CA, USA, Jan. 31–Feb. 5 2010.[45] S. Arora, C. Daskalakis, and D. Steurer, “Message-passing algorithmsand improved LP decoding,” in

Proc. 41st Annual ACM Symp. Theoryof Computing , Bethesda, MD, USA, May 31–June 2 2009.[46] N. Halabi and G. Even, “LP decoding of regular LDPC codes inmemoryless channels,”

IEEE Trans. Inf. Theory , vol. 57, no. 2, pp. 887–897, Feb. 2011.[47] D. Bertsekas,

Nonlinear Programming , 2nd ed. Belmont, MA: AthenaScientiﬁc, 1999.[48] P. A. Regalia and J. M. Walsh, “Optimality and duality of the turbodecoder,”

Proceedings of the IEEE , vol. 95, no. 6, pp. 1362–1377, Jun.2007.[49] M. M´ezard and A. Montanari,

Information, Physics, and Computation .New York, NY: Oxford University Press, 2009.[50] Y. Weiss and W. T. Freeman, “On the optimality of the max-productbelief propagation algorithm in arbitrary graphs,”

IEEE Trans. Inf.Theory , vol. 47, no. 2, pp. 736–744, 2001.[51] ——, “Correctness of belief propagation in Gaussian graphical modelsof arbitrary topology,”

Neural Computation , vol. 13, no. 10, pp. 2173–2200, Oct. 2001.[52] P. Rusmevichientong and B. Van Roy, “An analysis of belief propagationon the turbo decoding graph with Gaussian densities,”

IEEE Trans. Inf.Theory , vol. 47, no. 2, pp. 745–765, 2001.[53] J. M. Mooij and H. J. Kappen, “Sufﬁcient conditions for convergenceof the sum-product algorithm,”

IEEE Trans. Inf. Theory , vol. 53, no. 12,pp. 4422–4437, Dec. 2007.[54] D. M. Malioutov, J. K. Johnson, and A. S. Willsky, “Walk-sums andbelief propagation in Gaussian graphical models,”

J. Mach. Learn. Res. ,vol. 7, pp. 2031–2064, Dec. 2006.[55] N. Ruozzi, J. Thaler, and S. Tatikonda, “Graph covers and quadratic min-imization,” in

Proc. 47th Allerton Conf. on Communications, Control,and Computing , Allerton House, Monticello, IL, USA, Sep. 30–Oct. 22009, pp. 1590–1596.[56] P. O. Vontobel, “A combinatorial characterization of the Bethe and theKikuchi partition functions,” in

Proc. Inf. Theory Appl. Workshop , UCSan Diego, La Jolla, CA, USA, Feb. 6–11 2011.[57] W. S. Massey,

Algebraic Topology: an Introduction . New York:Springer-Verlag, 1977, reprint of the 1967 edition, Graduate Texts inMathematics, Vol. 56. [58] H. M. Stark and A. A. Terras, “Zeta functions of ﬁnite graphs andcoverings,”

Adv. in Math. , vol. 121, no. 1, pp. 124–165, Jul. 1996.[59] N. L. Biggs,

Discrete Mathematics , 2nd ed. New York: The ClarendonPress and Oxford University Press, 1989.[60] R. Koetter and P. O. Vontobel, “Graph covers and iterative decodingof ﬁnite-length codes,” in

Proc. 3rd Intern. Symp. on Turbo Codes andRelated Topics , Brest, France, Sep. 1–5 2003, pp. 75–82.[61] P. O. Vontobel and R. Koetter, “Graph-cover decoding and ﬁnite-lengthanalysis of message-passing iterative decoding of LDPC codes,”

CoRR, , Dec. 2005.[62] A. Schrijver, “Counting 1-factors in regular bipartite graphs,”

J. Comb.Theory, Ser. B , vol. 72, no. 1, pp. 122–135, Jan. 1998.[63] L. Gurvits, “Van der Waerden / Schrijver-Valiant like conjectures andstable (aka hyperbolic) homogeneous polynomials: one theorem for all,”

Elec. J. Comb. , vol. 15, p. R66, 2008.[64] M. Laurent and A. Schrijver, “On Leonid Gurvits’s proof for perma-nents,”

Amer. Math. Monthly , vol. 117, no. 10, pp. 903–911, Dec. 2010.[65] G. D. Forney, Jr. and P. O. Vontobel, “Partition functions of normalfactor graphs,” in

Proc. Inf. Theory Appl. Workshop , UC San Diego, LaJolla, CA, USA, Feb. 6–11 2011.[66] N. Ruozzi, “The Bethe partition function of log-supermodular graphicalmodels,” in

Proc. Neural Inf. Proc. Sys. Conf. , Lake Tahoe, NV, USA,Dec. 3–6 2012.[67] K. Viswanathan, “Pattern maximum-likelihood,” Talk at Workshop on“Permanents and modeling probability distributions,” American Instituteof Mathematics, Palo Alto, CA, USA, Sep. 1 2009.[68] P. O. Vontobel, “The Bethe approximation of the pattern maximum like-lihood distribution,” in

Proc. IEEE Int. Symp. Inf. Theory , Cambridge,MA, USA, Jul. 1–6 2012, pp. 2012–2016.[69] R. Rivest, “On self-organizing sequential search heuristics,”

Comm.ACM , vol. 19, no. 2, pp. 63–67, Feb. 1976.[70] J. Sayir, “Ordering memoryless source alphabets using competitive lists,”in

Proc. First INTAS International Seminar on Coding Theory andCombinatorics , Thahkadzor, Armenia, Oct. 6–11 1996.[71] A. W. Marshall and I. Olkin,

Inequalities: Theory of Majorization andIts Applications . San Diego, CA: Academic Press, 1979.[72] A. Marshall and I. Olkin, “Scaling of matrices to achieve speciﬁed rowand column sums,”

Num. Math. , vol. 12, no. 1, pp. 83–90, Jan. 1968.[73] W. Wiegerinck and T. Heskes, “Fractional belief propagation,” in

Advances in Neural Information Processing Systems 15 , S. Becker,S. Thrun, and K. Obermayer, Eds. Cambridge, MA: MIT Press, 2003,pp. 438–445.[74] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, “A new class ofupper bounds on the log partition function,”

IEEE Trans. Inf. Theory ,vol. 51, no. 7, pp. 2313–2335, Jul. 2005.[75] T. Heskes, “On the uniqueness of loopy belief propagation ﬁxed points,”

Neural Computation , vol. 16, no. 11, pp. 2379–2413, Nov. 2004.[76] Y. Weiss, T. Meltzer, and C. Yanover, “MAP estimation, linear program-ming and belief propagation with convex free energies,” in

Proc. Conf.Uncert. in Artif. Intell. , Vancouver, Canada, July 19–22 2007.[77] T. Hazan and A. Shashua, “Norm-product belief propagation: primal-dual message-passing for approximate inference,”

IEEE Trans. Inf.Theory , vol. 56, no. 12, pp. 6294–6316, Dec. 2010.[78] N. Ruozzi and S. Tatikonda, “Convergent and correct message passingschemes for optimization problems over graphical models,” submittedto JMLR , 2010, available online under http://arxiv.org/abs/1002.3239 .[79] R. Smarandache and P. O. Vontobel, “Absdet-pseudo-codewords andperm-pseudo-codewords: deﬁnitions and properties,” in

Proc. IEEE Int.Symp. Inf. Theory , Seoul, Korea, June 28–July 3 2009.[80] R. Smarandache, “Pseudocodewords from Bethe permanents,” submit-ted to IEEE Trans. Inf. Theory, available online under http://arxiv.org/abs/1112.4625 , Dec. 2011.[81] M. Cuturi, “Permanents, transportation polytopes and positive deﬁnitekernels on histograms,” in

Proc. Int. Joint Conf. Artiﬁcial Intelligence ,Hyderabad, India, Jan. 6–12 2007.[82] P. O. Vontobel, A. Kavˇci´c, D. M. Arnold, and H.-A. Loeliger, “Ageneralization of the Blahut-Arimoto algorithm to ﬁnite-state channels,”

IEEE Trans. Inf. Theory , vol. 54, no. 5, pp. 1887–1918, May 2008.[83] J. Justesen and T. Høholdt, “Maxentropic Markov chains,”

IEEE Trans.Inf. Theory , vol. 30, no. 4, pp. 665–667, Jul. 1984.[84] A. S. Khayrallah and D. L. Neuhoff, “Coding for channels with costconstraints,”