Convergence of the Min-Sum Algorithm for Convex Optimization
CConvergence of theMin-Sum Algorithm for Convex Optimization
Ciamac C. MoallemiElectrical EngineeringStanford Universityemail: [email protected]
Benjamin Van RoyManagement Science & EngineeringElectrical EngineeringStanford Universityemail: [email protected]
28 May 2007
Abstract
We establish that the min-sum message-passing algorithm and itsasynchronous variants converge for a large class of unconstrained con-vex optimization problems.
Consider an optimization problem of the form(1.1) minimize F ( x ) = (cid:80) C ∈C f C ( x C )subject to x ∈ X V . Here, the vector of decision variables x is indexed by a finite set V = { , . . . , n } . Each decision variable takes values in the set X . The set C is a collection of subsets of the index set V . This collection describesan additive decomposition of the objective function. We associate witheach set C ∈ C a component function (or factor ) f C : X C → R , whichtakes values as a function of those components x C of the vector x identified by the elements of C . Given a vector x ∈ X V and a subset A ⊂ V , we use the notation x A = ( x i , i ∈ A ) ∈X A for the vector of components of x specified by the set A . a r X i v : . [ m a t h . O C ] M a y he min-sum algorithm is a method for optimization problems ofthe form (1.1). It is one of a class of methods know as message-passingalgorithms. These algorithms have been the subject of considerableresearch recently across a number of fields, including communications,artificial intelligence, statistical physics, and theoretical computer sci-ence. Interest in message-passing algorithms has been sparked by theirsuccess in solving certain classes of NP-hard combinatorial optimiza-tion problems, such as the decoding of low-density parity-check codesand turbo codes (e.g., [1, 2, 3]), or the solution of certain classes ofsatisfiability problems (e.g., [4, 5]).Despite their successes, message-passing algorithms remain poorlyunderstood. For example, conditions for convergence and accurateresulting solutions are not well characterized.In this paper, we consider cases where X = R , and the optimiza-tion problem is continuous. One such case that has been examinedpreviously in the literature is where the objective is pairwise separable(i.e., | C | ≤
2, for all C ∈ C ) and the component functions { f C ( · ) } are quadratic and convex. Here, the min-sum algorithm is known tocompute the optimal solution when it converges [6, 7, 8], and sufficientconditions for convergence identify a broad class of problems [9, 10].Our main contribution is the analysis of cases where the functionsare convex but not necessarily quadratic. We establish that the min-sum algorithm and its asynchronous variants converge for a large classof such problems. The main sufficient condition is that of scaled diago-nal dominance . This condition is similar to known sufficient conditionsfor asynchronous convergence of other decentralized optimization al-gorithms, such as coordinate descent and gradient descent.Analysis of the convex case has been an open challenge and itsresolution advances the state of understanding in the growing literatureon message-passing algorithms. Further, it builds a bridge betweenthis emerging research area and the better established fields of convexanalysis and optimization.This paper is organized as follows. The next section studies themin-sum algorithm in the context of pairwise separable convex pro-grams, establishing convergence for a broad class of such problems.Section 3 extends this result to more general separable convex pro-grams, where each factor can be a function of more than two variables.In Section 4, we discuss how our convergence results hold even with atotally asynchronous model of computation. When applied to a con-tinuous optimization problem, messages computed and stored by themin-sum algorithm are functions over continuous domains. Except invery special cases, this is not feasible for digital computers, and inSection 5, we discuss implementable approaches to approximating thebehavior of the min-sum algorithm. We close by discussing possibleextensions and open issues in Section 6. Pairwise Separable Convex Programs
Consider first the case of pairwise separable programs. These are pro-grams of the form (1.1), where | C | ≤
2, for all C ∈ C . In this case, wecan define an undirected graph ( V, E ) based on the objective function.This graph has a vertex set V corresponding to the decision variables,and an edge set E defined by the pairwise factors, E = { C ∈ C : | C | = 2 } . Definition 1. (Pairwise Separable Convex Program)
A pairwiseseparable convex program is an optimization problem of the form (2.1) minimize F ( x ) = (cid:80) i ∈ V f i ( x i ) + (cid:80) ( i,j ) ∈ E f ij ( x i , x j )subject to x ∈ R V , where the factors { f i ( · ) } are strictly convex, coercive, and twice con-tinuously differentiable, the factors { f ij ( · , · ) } are convex and twice con-tinuously differentiable, and M (cid:52) = min i ∈ V inf x ∈ R V ∂ ∂x i F ( x ) > . Under this definition, the objective function F ( x ) is strictly convexand coercive. Hence, we can define x ∗ ∈ R V to be the unique optimalsolution. The min-sum algorithm attempts to minimize the objective function F ( · ) by an iterative, message-passing procedure. For each vertex i ∈ V ,denote the set of neighbors of i in the graph by N ( i ) = { j ∈ V : ( i, j ) ∈ E } . Denote the set of edges with direction distinguished by (cid:126)E = { ( i, j ) ∈ V × V : i ∈ N ( j ) } . At time t , each vertex i keeps track of a “message” from each neighbor u ∈ N ( i ). This message takes the form of a function J ( t ) u → i : R → R .These incoming messages are combined to compute new outgoing mes-sages for each neighbor. The message J ( t +1) i → j ( · ) from vertex i to vertex j ∈ N ( i ) evolves according to(2.2) J ( t +1) i → j ( x j ) = min y i f i ( y i ) + f ij ( y i , x j ) + (cid:88) u ∈ N ( i ) \ j J ( t ) u → i ( y i ) + κ ( t +1) i → j . ere, κ ( t +1) i → j represents an arbitrary offset term that varies from mes-sage to message. Only the relative values of the function J ( t +1) i → j ( · )matter, so the choice of κ ( t +1) i → j does not influence relevant information.At each time t >
0, a local objective function b ( t ) i ( · ) is defined foreach variable x i by(2.3) b ( t ) i ( x i ) = f i ( x i ) + (cid:88) u ∈ N ( i ) J ( t ) u → i ( x i ) . An estimate x ( t ) i can be obtained for the optimal value of the variable x i by minimizing the local objective function:(2.4) x ( t ) i = argmin y i b ( t ) i ( y i ) . The min-sum algorithm requires an initial set of messages { J (0) i → j ( · ) } at time t = 0. We make the following assumption regarding thesemessages: Assumption 1. (Min-Sum Initialization)
Assume that the initialmessages { J (0) i → j ( · ) } are chosen to be twice continuously differentiableand so that, for each message J (0) i → j ( · ) , there exists some z i → j ∈ R with (2.5) d dx j J (0) i → j ( x j ) ≥ ∂ ∂x j f ij ( z i → j , x j ) , ∀ x j ∈ R . Assumption 1 guarantees that the messages at time t = 0 are con-vex functions. Examining the update equation (2.2), it is clear that, byinduction, this implies that all future messages are also convex func-tions. Similarly, since the functions { f i ( · ) } are strictly convex andcoercive, and the functions { f ij ( · , · ) } are convex, it follows that theoptimization problem in the update equation (2.2) is well-defined anduniquely1 minimized. Finally, each local objective function b ( t ) i ( · ) muststrictly convex and coercive, and hence each estimate x ( t ) i is uniquelydefined by (2.4).Assumption 1 also requires that the initial messages be sufficientlyconvex, in the sense of (2.5). As we will shortly demonstrate, thiswill be an important condition for our convergence results. For themoment, however, note that it is easy to select a set of initial messagessatisfying Assumption 1. For example, one might choose J (0) i → j ( x j ) = f ij (0 , x j ) . .2 Convergence Our goal is to understand conditions under which the min-sum algo-rithm converges to the optimal solution x ∗ , i.e.lim t →∞ x ( t ) = x ∗ . Consider the following diagonal dominance condition:
Definition 2. (Scaled Diagonal Dominance)
An objective function F : R V → R is ( λ, w ) -scaled diagonally dominant if λ is a scalar with < λ < and w ∈ R V is a vector with w > , so that for each i ∈ V and all x ∈ R V , (cid:88) j ∈ V \ i w j (cid:12)(cid:12)(cid:12)(cid:12) ∂ ∂x i ∂x j F ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ λw i ∂ ∂x i F ( x ) . Our main convergence result is as follows:
Theorem 1.
Consider a pairwise separable convex program with anobjective function that is ( λ, w ) -scaled diagonally dominant. Assumethat the min-sum algorithm is initialized in accordance with Assump-tion 1. Define the constant K = 1 M max u w u min u w u . Then, the iterates of the min-sum algorithm satisfy (cid:107) x ( t ) − x ∗ (cid:107) ∞ ≤ K λ t − λ (cid:88) ( u,v ) ∈ (cid:126)E (cid:12)(cid:12)(cid:12)(cid:12) ddx v J (0) u → v ( x ∗ v ) − ∂∂x v f uv ( x ∗ u , x ∗ v ) (cid:12)(cid:12)(cid:12)(cid:12) . Hence, lim t →∞ x ( t ) = x ∗ . Proof.
The proof for Theorem 1 will be provided in Section 2.4.We can compare Theorem 1 to existing results on min-sum con-vergence in the case of where the objective function F ( · ) is quadratic.Rusmevichientong and Van Roy [7] developed abstract conditions forconvergence, but these conditions are difficult to verify in practical in-stances. Convergence has also been established in special cases arisingin certain applications [11, 12].More closely related to our current work, Weiss and Freeman [6]established convergence when the factors { f i ( · ) , f ij ( · , · ) } are quadratic,the single-variable factors { f i ( · ) } are strictly convex, and the pairwisefactors { f ij ( · , · ) } are convex and diagonally dominated, i.e. (cid:12)(cid:12)(cid:12)(cid:12) ∂ ∂x i ∂x j f ij ( x i , x j ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ∂ ∂x i f ij ( x i , x j ) , ∀ ( i, j ) ∈ E, x j , x j ∈ R . he results of Malioutov, et al. [10] and our prior work [9] remove thediagonal dominance assumption. However, all of these results are spe-cial cases of Theorem 1. In particular, if the a quadratic objective func-tion F ( · ) decomposes into pairwise factors so that the single-variablefactors are quadratic and strictly convex, and the pairwise factors arequadratic convex, then F ( · ) must be scaled diagonally dominant. Thiscan be established as a consequence of the Perron-Frobenius theorem[10]. Finally, as we will see in Section 3, Theorem 1 also generalizesbeyond pairwise decompositions. In order to prove Theorem 1, we first introduce the notion of the com-putation tree . This is a useful device in the analysis of message-passingalgorithms, originally introduced by Wiberg [13]. Given a vertex r ∈ V and a time t , the computation tree defines an optimization problemthat is constructed by “unrolling” all the optimizations involved in thecomputation of the min-sum estimate x ( t ) r .Formally, the computation tree is a graph T = ( V , E ) where eachvertex i ∈ V is in labeled by a vertex ˜ i ∈ V in the original graph,through a mapping σ : V→ V . This mapping is required to preservethe edge structure of the graph, so that if ( i, j ) ∈ E , then ( σ i , σ j ) ∈ E . Given a vertex i ∈ V , we will abuse notation and refer to thecorresponding vertex σ i ∈ V in the original graph simply by i .Fixing a vertex r ∈ V and a time t , the computation tree rootedat r and of depth t is defined in an iterative fashion. Initially, the treeconsists of a root single vertex corresponding to r . At each subsequentstep, the leaves in the computation tree are examined. Given a leaf i with a parent j , a vertex u and an edge ( u, i ) are added to thecomputation tree corresponding to each neighbor of i excluding j inthe original graph. This process is repeated for t steps. An example ofthe resulting graph is illustrated in Figure 1.Given the graph T = ( V , E ), and the correspondence mapping σ ,define a decision variable x i for each vertex i ∈ V . Define a pairwiseseparable objective function F T : R V → R , by considering factors ofthe form:1. For each i ∈ V , add a single-variable factor f i ( x i ) by setting f i ( x i ) (cid:52) = f σ i ( x i ).2. For each ( i, j ) ∈ V , add a pairwise factor f ij ( x i , x j ) by setting f ij ( x i , x j ) (cid:52) = f σ i σ j ( x i , x j ).3. For each i ∈ V that is a leaf vertex with parent j , add a single-variable factor J (0) u → σ i ( x i ), for each neighbor u ∈ N ( σ i ) \ σ j of i in the original graph, excluding j . t = 3. The vertices in the computation tree are labeledaccording to the corresponding vertex in the original graph. Now, let ˜ x be the optimal solution to the minimization of the com-putation tree objective F T ( · ). By inductively examining the operationof the min-sum algorithm, it is easy to establish that the component ˜ x r of this solution at the root of the tree is precisely the min-sum estimate x ( t ) r .The following lemma establishes that the computation tree inheritsthe scaled diagonal dominance property from the original objectivefunction. Lemma 1.
Consider a pairwise separable convex program with anobjective function that is ( λ, w ) -scaled diagonally dominant. Assumethat the min-sum algorithm is initialized in accordance with Assump-tion 1, and let T = ( V , E ) be a computation tree associated with thisprogram. Then, the computation tree objective function F T ( · ) is also( λ, w ) -scaled diagonally dominant. Proof.
Given a vertex i ∈ V , let N V ( i ) be the neighborhood in thecomputation tree, and let N ( i ) be the neighborhood of the correspond-ing vertex in the original graph. If i ∈ V is an interior vertex of the omputation tree, then (cid:88) u ∈V\ i w u (cid:12)(cid:12)(cid:12)(cid:12) ∂ ∂x i ∂x u F T ( x ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) u ∈ N V ( i ) w u (cid:12)(cid:12)(cid:12)(cid:12) ∂ ∂x i ∂x u f iu ( x i , x u ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ λw i ∂ ∂x i f i ( x i ) + (cid:88) u ∈ N V ( i ) ∂ ∂x i f iu ( x i , x u ) = λw i ∂ ∂x i F T ( x ) , where the inequality follows from the scaled diagonal dominance of theoriginal objective function F ( · ).Similarly, if i is a leaf vertex with parent j , (cid:88) u ∈V\ i w u (cid:12)(cid:12)(cid:12)(cid:12) ∂ ∂x i ∂x u F T ( x ) (cid:12)(cid:12)(cid:12)(cid:12) = w j (cid:12)(cid:12)(cid:12)(cid:12) ∂ ∂x i ∂x j f ij ( x i , x j ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ w j (cid:12)(cid:12)(cid:12)(cid:12) ∂ ∂x i ∂x j f ij ( x i , x j ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:88) u ∈ N ( i ) \ j w u (cid:12)(cid:12)(cid:12)(cid:12) ∂ ∂x i ∂x u f iu ( x i , z u → i ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ λw i ∂ ∂x i f i ( x i ) + ∂ ∂x i f ij ( x i , x j ) + (cid:88) u ∈ N ( i ) \ j ∂ ∂x i f iu ( x i , z u → i ) ≤ λw i ∂ ∂x i f i ( x i ) + ∂ ∂x i f ij ( x i , x j ) + (cid:88) u ∈ N ( i ) \ j ∂ ∂x i J (0) u → i ( x i ) = λw i ∂ ∂x i F T ( x ) . Here, the second inequality follows from the scaled diagonal dominanceof the original objective function F ( · ), and the third inequality followsfrom Assumption 1. In order to prove Theorem 1, we will study the evolution of the min-sumalgorithm under a set of linear perturbations. Consider an arbitraryvector p ∈ R (cid:126)E with one component p i → j for each i ∈ V and j ∈ N ( i ). Given an arbitrary vector p , define { J ( t ) i → j ( · , p ) } to be the set of essages that evolve according to J (0) i → j ( x j , p ) = J (0) i → j ( x j ) + p i → j x j ,J ( t +1) i → j ( x j , p ) = min y i f i ( y i ) + f ij ( y i , x j ) + (cid:88) u ∈ N ( i ) \ j J ( t ) u → i ( y i , p ) + κ ( t +1) i → j . (2.6)Similarly, define { b ( t ) i ( · , p ) } and { x ( t ) i ( p ) } to be the resulting local ob-jective functions and optimal value estimates under this perturbation: b ( t ) i ( x i , p ) = f i ( x i ) + (cid:88) u ∈ N ( i ) J ( t ) u → i ( x i , p ) ,x ( t ) i ( p ) = argmin y i b ( t ) i ( y i , p ) . The following simple lemma gives a particular choice of p for whichthe min-sun algorithm yields the optimal solution at every time. Lemma 2.
Define the vector p ∗ ∈ R (cid:126)E by setting, for each i ∈ V and j ∈ N ( i ) , p ∗ i → j = ∂∂x j f ij ( x ∗ i , x ∗ j ) − ddx j J (0) i → j ( x ∗ j ) . Then, at every time t ≥ , (2.7) ∂∂x j J ( t ) i → j ( x ∗ j , p ∗ ) = ∂∂x j f ij ( x ∗ i , x ∗ j ) , and x ( t ) j ( p ∗ ) = x ∗ j . Proof.
Note that the first order optimality conditions for F ( x ) at x ∗ imply that, for each j ∈ V , ddx j f i ( x ∗ j ) + (cid:88) i ∈ N ( j ) ∂∂x j f ij ( x ∗ i , x ∗ j ) = 0 . If (2.7) holds at time t , this is exactly the first order optimality condi-tion for the minimization of b ( t ) j ( · , p ∗ ), thus x ( t ) j ( p ∗ ) = x ∗ j .Clearly (2.7) holds at time t = 0. Assume it holds at time t ≥ x j = x ∗ j , the minimizing value of y i in (2.7) is x ∗ i . Hence,(2.7) holds at time t + 1.Next, we will bound the sensitivity of the estimate x ( t ) i ( p ) to thechoice of p . The main technique employed here is analysis of the com-putation tree described in Section 2.3. In particular, the perturbation impacts the computation tree only through the leaf vertices at depth t . The scaled diagonal dominance property of the computation tree,provided by Lemma 1, can then be used to guarantee that this impactis diminishing in t . Lemma 3.
We have, for all p ∈ R (cid:126)E , r ∈ V , ( u, v ) ∈ (cid:126)E , and t ≥ , (cid:12)(cid:12)(cid:12)(cid:12) ∂∂p u → v x ( t ) r ( p ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ K λ t − λ . Proof.
Fix r ∈ V , and let T = ( V , E ) be the computation tree rootedat r after t time steps. Let F T ( x, p ) be the objective value of thiscomputation tree, and let˜ x ( p ) = argmin x F T ( x, p ) , so that ˜ x r ( p ) = x ( t ) r ( p ) . By the first order optimality conditions, for any j ∈ V , ∂∂x j F T (˜ x ( p ) , p ) = 0 . If j is an interior vertex of T , this becomes(2.8) ddx j f j (˜ x j ( p )) + (cid:88) i ∈ N ( j ) ∂∂x j f ij (˜ x i ( p ) , ˜ x j ( p )) = 0 . If j is a leaf with parent u , we have(2.9) ddx j f j (˜ x j ( p )) + ∂∂x j f uj (˜ x u ( p ) , ˜ x j ( p ))+ (cid:88) i ∈ N ( j ) \ u (cid:18) ∂∂x j J (0) i → j (˜ x j ( p )) + p i → j (cid:19) = 0 . Now, fixed some directed edge ( a, b ), and differentiate (2.8)–(2.9) withrespect to p a → b . We have, for an interior vertex j ,0 = d dx j f j (˜ x j ( p )) ∂∂p a → b ˜ x j ( p )+ (cid:88) i ∈ N ( j ) ∂ ∂x j f ij (˜ x i ( p ) , ˜ x j ( p )) ∂∂p a → b ˜ x j ( p )+ (cid:88) i ∈ N ( j ) ∂ ∂x i ∂x j f ij (˜ x i ( p ) , ˜ x j ( p )) ∂∂p a → b ˜ x i ( p ) , nd for a leaf vertex j with parent u ,0 = d dx j f j (˜ x j ( p )) ∂∂p a → b ˜ x j ( p )+ ∂ ∂x j f uj (˜ x u ( p ) , ˜ x j ( p )) ∂∂p a → b ˜ x j ( p )+ ∂ ∂x u ∂x j f uj (˜ x u ( p ) , ˜ x j ( p )) ∂∂p a → b ˜ x u ( p )+ (cid:88) i ∈ N ( j ) \ u (cid:32) ∂ ∂x j J (0) i → j (˜ x j ( p )) ∂∂p a → b ˜ x j ( p ) + I { ( a,b )=( i,j ) } (cid:33) . We can write this system of equations in matrix form, as(2.10) Γ v a → b + h a → b = 0 . Here, v a → b ∈ R V is a vector with components v a → bj = ∂∂p a → b ˜ x j ( p ) . The vector h a → b ∈ R V has components h a → bj = I { j is a leaf vertex of type a with a parent of type b } . The symmetric matrix Γ ∈ R V×V has components as follows:1. If j is an interior vertex,Γ jj = d dx j f j (˜ x j ( p )) + (cid:88) i ∈ N ( j ) ∂ ∂x j f ij (˜ x i ( p ) , ˜ x j ( p )) .
2. If j is an interior vertex and i ∈ N ( j ),Γ ij = ∂ ∂x i ∂x j f ij (˜ x i ( p ) , ˜ x j ( p )) .
3. If j is a leaf vertex with parent u ,Γ jj = d dx j f j (˜ x j ( p )) + ∂ ∂x j f uj (˜ x u ( p ) , ˜ x j ( p ))+ (cid:88) i ∈ N ( j ) \ u ∂ ∂x j J (0) i → j (˜ x j ( p )) , Γ uj = ∂ ∂x u ∂x j f uj (˜ x u ( p ) , ˜ x j ( p )) . . All other entries of Γ are zero.Note that Γ = ∇ x F T (˜ x ( p ) , p ). Then, Lemma 1 implies that(2.11) (cid:88) i ∈V\ j w i | Γ ij | ≤ λw j Γ jj . Define, for vectors x ∈ R V , the weighted sup-norm (cid:107) x (cid:107) w ∞ = max j ∈V | x j | /w j . For a linear operator A : R V → R V , The corresponding induced opera-tor norm is given by (cid:107) A (cid:107) w ∞ = max j ∈V w j (cid:88) i ∈V w i | A ji | . Define the matrices D = diag(Γ) ,R = I − D − Γ . Then, (2.11) implies that (cid:107) R (cid:107) w ∞ ≤ λ < . Hence, the matrix I − R = D − Γ is invertible, and (cid:0) D − Γ (cid:1) − = ( I − R ) − = ∞ (cid:88) s =0 R s . Examining the linear equation (2.10), we have v a → b = − Γ − h a → b = − ( I − R ) − D − h a → b = − ∞ (cid:88) s =0 R s D − h a → b . We are interested in bounding the value of the component v a → br (recallthat v a → br = ∂x ( t ) r ( p ) /∂p a → b ). Hence, we have v a → br = − ∞ (cid:88) s =0 (cid:2) R s D − h a → b (cid:3) r . Since h a → b is zero on interior vertices, and any leaf vertex is distance t from the root r , we have (cid:2) R s D − h a → b (cid:3) r = 0 , ∀ s < t. hus, v a → br = − ∞ (cid:88) s = t (cid:2) R s D − h a → b (cid:3) r . Then, | v a → br | /w r ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:88) s = t R s D − h a → b (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w ∞ ≤ ∞ (cid:88) s = t (cid:107) R s (cid:107) w ∞ (cid:13)(cid:13) D − h a → b (cid:13)(cid:13) w ∞ ≤ λ t − λ (cid:13)(cid:13) D − h a → b (cid:13)(cid:13) w ∞ ≤ λ t − λ max i ∈ V sup x ∈ R V (cid:18) w i ∂ ∂x i F ( x ) (cid:19) − ≤ M λ t − λ max i ∈V w i . The following lemma combines the results from Lemmas 2 and 3.Theorem 1 follows by taking p = 0. Lemma 4.
Given an arbitrary vector p ∈ R (cid:126)E , (cid:107) x ( t ) ( p ) − x ∗ (cid:107) ∞ ≤ K λ t − λ (cid:88) ( u,v ) ∈ (cid:126)E | p u → v − p ∗ u → v | . Proof.
For any j ∈ V , define g ( t ) j ( θ ) = x ( t ) j ( θp + (1 − θ ) p ∗ ) . We have, from Lemma 2, x ( t ) j ( p ) − x ∗ j = x ( t ) j ( p ) − x ( t ) j ( p ∗ ) = g ( t ) j (1) − g ( t ) j (0) . By the mean value theorem and Lemma 3, | x ( t ) j ( p ) − x ∗ j |≤ sup θ ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12) ddθ g ( t ) j ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ sup θ ∈ [0 , (cid:88) ( u,v ) ∈ (cid:126)E (cid:12)(cid:12)(cid:12)(cid:12) ∂∂p u → v x ( t ) j ( θp + (1 − θ ) p ∗ ) (cid:12)(cid:12)(cid:12)(cid:12) | p u → v − p ∗ u → v |≤ K λ t − λ (cid:88) ( u,v ) ∈ (cid:126)E | p u → v − p ∗ u → v | . General Separable Convex Programs
In this section we will consider convergence of the min-sum algorithmfor more general separable convex programs. In particular, consider avector of real-valued decision variables x ∈ R V , indexed by a finite set V , and a hypergraph ( V, C ), where the set C is a collection of subsets(or “hyperedges”) of the vertex set V . Definition 3. (General Separable Convex Program)
A generalseparable convex program is an optimization problem of the form (3.1) minimize F ( x ) = (cid:80) i ∈ V f i ( x i ) + (cid:80) C ∈C f C ( x C )subject to x ∈ R V , where the factors { f i ( · ) } are strictly convex, coercive, and twice con-tinuously differentiable, the factors { f C ( · ) } are convex and twice con-tinuously differentiable, and M (cid:52) = min i ∈ V inf x ∈ R V ∂ ∂x i F ( x ) > . In this setting, the min-sum algorithm operates by passing mes-sages between vertices and hyperedges. In particular, denote the setof neighbor hyperedges to a vertex i ∈ V by N f ( i ) = { C ∈ C : i ∈ C } , The min-sum update equations take the form J ( t +1) i → C ( x i ) = f i ( x i ) + (cid:88) C (cid:48) ∈ N f ( i ) \ C J ( t ) C (cid:48) → i ( x i ) + κ ( t +1) i → C ,J ( t +1) C → i ( x i ) = minimize y C \ i f C ( x i , y C \ i ) + (cid:88) i (cid:48) ∈ C \ i J ( t +1) i (cid:48) → C ( y i (cid:48) ) + κ ( t +1) C → i . (3.2)Local objective functions and estimates of the optimal solution aredefined by b ( t ) i ( x i ) = f i ( x i ) + (cid:88) C ∈ N f ( i ) J ( t ) C → i ( x i ) ,x ( t ) i = argmin y i b ( t ) i ( y i ) . We will make the following assumption on the initial messages:
Assumption 2. (Min-Sum Initialization)
Assume that the initialmessages { J (0) C → j ( · ) } are chosen to be twice continuously differentiable nd so that, for each message J (0) C → j ( · ) , there exists some z C → j ∈ R C \ i with d dx j J (0) C → j ( x j ) ≥ ∂ ∂x j f C ( x j , z C → j ) , ∀ x j ∈ R . Then, we have the following analog of Theorem 1:
Theorem 2.
Consider a general separable convex program. Assumethat either:(i) The objective function F ( x ) is scaled diagonally dominant, andeach pair of vertices i, j ∈ V participate in at most one commonfactor. That is, |{ C ∈ C : ( i, j ) ⊂ C }| ≤ , ∀ i, j ∈ V. (ii) The factors { f C ( · ) } are individually scaled diagonally dominant,in the sense that exists a scalar λ ∈ (0 , and a vector w ∈ R V ,with w > , so that for all C ∈ C , i ∈ C , and x C ∈ R C , (cid:88) j ∈ C \ i w j (cid:12)(cid:12)(cid:12)(cid:12) ∂ ∂x i ∂x j f C ( x C ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ λw i ∂ ∂x i f C ( x C ) . Assume that the min-sum algorithm is initialized in accordance withAssumption 2. Define the constant K = 1 M max u w u min u w u . Then, the iterates of the min-sum algorithm satisfy (cid:107) x ( t ) − x ∗ (cid:107) ∞ ≤ K λ t − λ (cid:88) C ∈C (cid:88) v ∈ C (cid:12)(cid:12)(cid:12)(cid:12) ddx v J (0) C → v ( x ∗ v ) − ∂∂x v f C ( x ∗ C ) (cid:12)(cid:12)(cid:12)(cid:12) . Hence, lim t →∞ x ( t ) = x ∗ . Proof.
This result can be proved using the same method as Theo-rem 1. The main modification required is the development of a suitableanalog of Lemma 1. In the general case, scaled diagonal dominanceof the computation tree does not follow from scaled diagonal domi-nance of the objective function F ( x ). However, it is easy to verifythat either of the hypotheses (i) or (ii) imply scaled diagonal domi-nance of the computation tree. The balance of the proof proceeds asin Section 2.4. Asynchronous Convergence
The convergence results of Theorems 1 and 2 assumed a synchronousmodel of computation. That is, each message is updated at everytime step in parallel. The min-sum update equations (2.2) and (3.2)are naturally decentralized, however. If we consider the applicationof the min-sum algorithm in distributed contexts, it is necessary toconsider convergence under an asynchronous model of computation.In this section, we will establish that Theorems 1 and 2 extend to anasynchronous setting.Without loss of generality, consider the pairwise case. Assume thatthere is a processor associated with each vertex i in the graph, and thatthis processor is responsible for computing the message J i → j ( · ), foreach neighbor j of vertex i . Each processor occasionally communicatesits messages to neighboring processors, and occasionally computes newmessages based on the most recent messages it has received. Definethe T i to be the set of times at which new messages are computed.Define 0 ≤ τ j → i ( t ) ≤ t to be the last time the processor at vertex j communicated to the processor at vertex i . Then, the messages evolveaccording to J ( t +1) i → j ( x j ) = min y i f i ( y i ) + f ij ( y i , x j ) + (cid:88) u ∈ N ( i ) \ j J ( τ u → i ( t )) u → i ( y i ) + κ ( t +1) i → j , if t ∈ T i , and J ( t +1) i → j ( x j ) = J ( t ) i → j ( x j ) , otherwise.We will make the following assumption [14]: Assumption 3. (Total Asynchronism)
Assume that each set T i isinfinite, and that if { t k } is a sequence in T i tending to infinity, then lim k →∞ τ i → j ( t k ) = ∞ , for each neighbor j ∈ N ( i ) . Total asynchronism is a very mild assumption. It guarantees thateach component is updated infinitely often, and that processors eventu-ally communicate with neighboring processors. It allows for arbitrarydelays in communication, and even the out-of-order arrival of messagesbetween processors.Theorem 1 can be extended to the totally asynchronous setting. Tosee this, note that we can repeat the construction of the computationtree in Section 2.3. As in the synchronous case, the initial messages nly impact the leaves of computation tree. The total asynchronismassumption guarantees that these leaves are, eventually, arbitrarilyfar away from the root of the computation tree. The arguments inLemma 3 then imply that the optimal value at the root of the compu-tation tree is insensitive to the choice of initial messages. Convergencefollows, as in Section 2.4.The scaled diagonal dominance requirement of our convergence re-sult is similar to conditions required for the totally asynchronous con-vergence of other optimization algorithms. Consider, for example, adecentralized coordinate descent algorithm. Here, the processor asso-ciated with vertex i maintains an estimate x ( t ) i of the i th component ofthe optimal solution at time t . These estimates are updated accordingto x ( t +1) i = argmin y i f i ( y i ) + (cid:88) u ∈ N ( i ) f oi ( x ( τ u → i ( t )) u , y i ) , if t ∈ T i , and x ( t +1) i = x ( t ) i , otherwise.Similarly, consider a decentralized gradient method, where x ( t +1) i = x ( t ) i − α ∂∂x i f i (cid:16) x ( t ) i (cid:17) + (cid:88) u ∈ N ( i ) f ui (cid:16) x ( τ u → i ( t )) u , x ( t ) i (cid:17) , if t ∈ T i , and x ( t +1) i = x ( t ) i , otherwise, for some small positive stepsize α . These methods are not guaranteed to converge for arbitrarypairwise separable convex optimization problems. Typically, some sortof diagonal dominance condition is needed [14]. The convergence theory we have presented elucidates properties of themin-sum algorithm and builds a bridge to the more established areasof convex analysis and optimization. However, except in very spe-cial cases, the algorithm as we have formulated it can not be imple-mented on a digital computer because the messages that are computedand stored are functions over continuous domains. In this section, wepresent two variations that can be implemented to approximate be-havior of the min-sum algorithm. For simplicity, we restrict attentionto the case of synchronous min-sum for pairwise separable convex pro-grams.Our first approach approximates messages using quadratic func-tions and can be viewed as a hybrid between the min-sum algorithmand Newton’s method. It is easy to show that, if the single-variablefactors { f i ( · ) } are positive definite quadratics and the pairwise factors { f ij ( · , · ) } are positive semidefinite quadratics, then min-sum updates ap quadratic messages to quadratic messages. The algorithm we pro-pose here maintains a running estimate ˜ x ( t ) of the optimal solution,and at each time approximates each factor by a second-order Taylorexpansion. In particular, let ˜ f ( t ) i ( · ) be the second-order Taylor expan-sion of f i ( · ) around ˜ x ( t ) i and let ˜ f ( t ) ij ( · , · ) be the second-order Taylorexpansion of f ij ( · , · ) around (˜ x ( t ) i , ˜ x ( j ) j ). Quadratic messages are up-dated according to(5.1) J ( t +1) i → j ( x j ) = min y i ˜ f ( t ) i ( y i ) + ˜ f ( t ) ij ( y i , x j ) + (cid:88) u ∈ N ( i ) \ j J ( t ) u → i ( y i ) + κ ( t +1) i → j , where running estimates of the optimal solution are generated accord-ing to(5.2) ˜ x ( t +1) i = argmin y i ˜ f ( t +1) i ( y i ) + (cid:88) u ∈ N ( i ) J ( t +1) u → i ( y i ) . Note that the message update equation (5.1) takes the form of a Ricattiequation for a scalar system, which can be carried out efficiently. Fur-ther, each optimization problem (5.2) is a scalar unconstrained convexquadratic program.A second approach makes use of a piecewise-linear approximationto each message. Let us assume knowledge that the optimal solution x ∗ is in a closed bounded set [ − B, B ] n . Let S = { ˆ x , . . . , ˆ x m } ⊂ [ − B, B ],with − B = ˆ x < · · · < ˆ x m = B , be a set of points where the linearpieces begin and end. Our approach applies the min-sum update equa-tion to compute values at these points. Then, an approximation to themin-sum message is constructed via linear interpolation between con-secutive points or extrapolation beyond the end points. In particular,the algorithm takes the form J ( t +1) i → j ( x j ) = min y i ∈ [ − B,B ] f i ( y i ) + f ij ( y i , x j ) + (cid:88) u ∈ N ( i ) \ j J ( t ) u → i ( y i ) + κ ( t +1) i → j , (5.3)for x j ∈ S , where(5.4) J ( t ) u → i ( x i ) = max ≤ k ≤ m − (ˆ x k +1 − x i ) J ( t ) u → i (ˆ x k +1 ) + ( x i − ˆ x k ) J ( t ) u → i (ˆ x k )ˆ x k +1 − ˆ x k , for all x i ∈ R . As opposed to the case of quadratic approximations,where each message is parameterized by two numerical values, the umber of parameters for each piecewise linear message grows with m .Hence, we anticipate that for fine-grain approximations, our secondapproach is likely to require greater computational resources. On theother hand, piecewise linear approximations may extend more effec-tively to non-convex problems, since non-convex messages are unlikelyto be well-approximated by convex quadratic functions. There are many open questions in the theory of message passing al-gorithms. They fuel a growing research community that cuts acrosscommunications, artificial intelligence, statistical physics, theoreticalcomputer science, and operations research. This paper has focused onapplication of the min-sum message passing algorithm to convex pro-grams, and even in this context a number of interesting issues remainunresolved.Our proof technique establishes convergence under total asynchro-nism assuming a scaled diagonal dominance condition. With such aflexible model of asynchronous computation, convergence results forgradient descent and coordinate descent also require similar diagonaldominance assumptions. On the other hand, for the partially asyn-chronous setting, where communication delays and times between suc-cessive updates are bounded, such assumptions are no longer requiredto guarantee convergence of these two algorithms. It would be in-teresting to see whether convergence of the min-sum algorithm underpartial asynchronism can be established in the absence of scaled diag-onal dominance.Another direction will be to assess practical value of the min-sumalgorithm for convex optimization problems. This calls for theoreticalor empirical analysis of convergence and convergence times for im-plementable variants as those proposed in the previous section. Someconvergence time results for a special case reported in [11] may providea starting point. Our expectation is that for most relevant centralizedoptimization problems, the min-sum algorithm will be more efficientthan gradient descent or coordinate descent but fall short of Newton’smethod. On the other hand, Newton’s method does not decentralizegracefully, so in applications that call for decentralized solution, themin-sum algorithm may prove to be useful.Finally, it would be interesting to explore whether ideas from thispaper can be helpful in analyzing behavior of the min-sum algorithmfor non-convex programs. It is encouraging that convex optimizationtheory has more broadly proved to be useful in designing and analyzingapproximation methods for non-convex programs. cknowledgments The first author was supported by a Benchmark Stanford GraduateFellowship. This research was supported in part by the National Sci-ence Foundation through grant IIS-0428868.
References [1] R. G. Gallager.
Low-Density Parity Check Codes . M.I.T. Press,Cambridge, MA, 1963.[2] C. Berrou, A. Glavieux, and P. Thitimajshima. Near Shannonlimit error-correcting coding and decoding. In
Proc. Int. Com-munications Conf. , pages 1064–1070, Geneva, Switzerlang, May1993.[3] T. Richardson and R. Urbanke. The capacity of low-density paritycheck codes under message-passing decoding.
IEEE Transactionson Information Theory , 47:599–618, 2001.[4] M. M´ezard, G. Parisi, and R. Zecchina. Analytic and algo-rithmic solutions to random satisfiability problems.
Science ,297(5582):812–815, 2002.[5] A. Braunstein, M. M´ezard, and R. Zecchina. Survey propaga-tion: An algorithm for satisfiability.
Random Struct. Algorithms ,27(2):201–226, 2005.[6] Y. Weiss and W. T. Freeman. Correctness of belief propagationin Gaussian graphical models of arbitrary topology.
Neural Com-putation , 13:2173–2200, 2001.[7] P. Rusmevichientong and B. Van Roy. An analysis of belief propa-gation on the turbo decoding graph with Gaussian densities.
IEEETransactions on Information Theory , 47(2):745–765, 2001.[8] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-basedreparameterization framework for analysis of sum-product andrelated algorithms.
IEEE Transactions on Information Theory ,49(5):1120–1146, 2003.[9] C. C. Moallemi and B. Van Roy. Convergence of the min-summessage passing algorithm for quadratic optimization. Technicalreport, Management Science & Engineering Deptartment, Stan-ford University, 2006.[10] D. M. Malioutov, J. K. Johnson, and A. S. Willsky. Walk-sumsand belief propagation in Gaussian graphical models.
Journal ofMachine Learning Research , 7:2031–2064, October 2006.[11] C. C. Moallemi and B. Van Roy. Consensus propagation.
IEEETransactions on Information Theory , 52(11):4753–4766, 2006.
12] A. Montanari, B. Prabhakar, and D. Tse. Belief propagation basedmulti-user detection. In
Proceedings of the Allerton Conference onCommunication, Control, and Computing , 2005.[13] N. Wiberg.
Codes and decoding on general graphs . PhD thesis,Link¨oping University, Link¨oping, Sweden, 1996.[14] D. P. Bertsekas and J. N. Tsitsiklis.
Parallel and Distributed Com-putation: Numerical Methods . Athena Scientific, Belmont, MA,1997.. Athena Scientific, Belmont, MA,1997.