Decomposition Methods for Global Solutions of Mixed-Integer Linear Programs
aa r X i v : . [ m a t h . O C ] F e b Decomposition Methods for Global Solutions of Mixed-IntegerLinear Programs
Kaizhao Sun · Mou Sun · Wotao Yin
February 25, 2021
Abstract
This paper introduces two decomposition-based methods for two-block mixed-integer linearprograms (MILPs), which break the original problem into a sequence of smaller MILP subproblems.The first method is based on the ℓ -augmented Lagrangian. The second method is based on thealternating direction method of multipliers. When the original problem has a block-angular structure,the subproblems of the first block have low dimensions and can be solved in parallel. We add reverse-norm cuts and augmented Lagrangian cuts to the subproblems of the second block. For both methods,we show asymptotic convergence to globally optimal solutions and present iteration upper bounds.Numerical comparisons with recent decomposition methods demonstrate the exactness and efficiencyof our proposed methods. Keywords
Mixed-integer linear programming · Augmented Lagrangian · Alternating directionmethod of multipliers
In this paper, we consider the generic two-block mixed-integer linear program (MILP) p ∗ := min x,z c ⊤ x + g ⊤ z (1a)subject to Ax + Bz = 0 (1b) x ∈ X, z ∈ Z, (1c)where we have real decision variables x ∈ R n and z ∈ R d , rational parameters c ∈ Q n , g ∈ Q d , A ∈ Q m × n , and B ∈ Q m × d , and compact sets X ⊂ R n and Z ⊂ R d . The sets require some coordinatesto take only integer values. Without loss of generality , we have used a zero vector in the right-handside of (1b). Let us write x in the block-coordinate form x = [ x ⊤ , . . . , x ⊤ p ] ⊤ for p ≥
1. Assume that A is block diagonal and X = X × · · · × X p . K. SunSchool of Industrial and Systems Engineering, Georgia Institute of TechnologyE-mail: [email protected]. SunDecision Intelligence Lab, DAMO Academy, Alibaba GroupE-mail: [email protected]. YinDecision Intelligence Lab, DAMO Academy, Alibaba Group (US)E-mail: [email protected] Any Ax + Bz = b can be equivalently expressed as Ax + ˜ B ˜ z = 0 where ˜ B = [ B, b ] and ˜ z = [ z ⊤ , − ⊤ . K. Sun, M. Sun, W. Yin This paper introduces methods to take advantages of the lower dimensions of the separable blocks x , . . . , x p . The methods decouple them into a set of mixed-integer subproblems, each involving one x i ∈ X i , so that p of them can be solved in parallel on multiple computing cores. Of course, when p = 1, our methods still offer a decomposition between x and z . They coordinate the subproblemsolutions and converge to a solution of the global problem.1.1 BackgroundA mixed-integer program (MIP) is an optimization problem involving integer variables. A MILP isa MIP with affine objectives and constraints. The MIP is a common formulation in decision sciencesand data sciences. As data have become much larger, decision problems with higher dimensions havearisen. For example, in online marketing, decisions are made on billions of consumers and millions ofSKUs, which are significantly more than those in classic marketing. As another example, dispatchingresources in cloud computing has a typical scale that is order-of-magnitude larger than dispatchinga fleet of vehicles. The scale increase of these MIPs has surpassed the evolution of MIP method.Therefore, it is of central importance to develop algorithms that can scale well enough to solve hugeMIPs.Huge MIPs typically have decomposable structures. The coefficient matrix of (1) may take theclassic block-angular structure: [ A | B ] = A B A B . . . ... A p B p , (2)where A has a block-diagonal structure and the rows of B are divided into groups accordingly. Thematrix in (2) naturally arises in two-stage optimization problems: in Ax + Bz = 0, the z variablerepresents the first-stage decision variable and couples with the second-stage variables x i via A i x i + B i z = 0 in each scenario i ∈ [ p ]. More often than not, each second-stage variable x i is independentlyconstrained by x i ∈ X i .We can also transform the following multi-block problem into problem (1):min x ∈ X , ··· ,x p ∈ X p p X i =1 c ⊤ i x i , subject to p X i =1 G i x i = h. (3)There possibly exist dense rows of the constraints that couple all the p blocks of variables. To ob-tain (1), introduce auxiliary variable z i for each block variable x i : let X := Q pi =1 X i , Z := { z =[ z ⊤ , · · · , z ⊤ p ] ⊤ | P pi =1 G i z i = h } , and impose consensus constraints x i = z i for all i ∈ [ p ]. Formulation(3) has many important applications including, but not limited to, the consensus and sharing problemsin utility maximization.1.2 Augmented Lagrangian Decomposition and ChallengesDecomposing a large global problem into smaller local subproblems is not new. The most commonmethods are based on the augmented Lagrangian (AL) and the alternating direction method of multi-pliers (ADMM). However, theoretical and practical issues arise when it comes to decomposing MILPs.Consider the following ℓ -penalty AL for problem (1): L ( x, z, λ, ρ ) := c ⊤ x + g ⊤ z + h λ, Ax + Bz i + ρ k Ax + Bz k , λ ∈ R m , ρ > , (4) ecomposition Methods for Global MILP 3 and the corresponding dual function d ( λ, ρ ) := min x ∈ X,z ∈ Z L ( x, z, λ, ρ ) . (5)We call the minimization problem in (5) the primal subproblem and call the following problem:sup λ ∈ R m ,ρ> d ( λ, ρ ) (6)the dual problem . Note that the penalty parameter ρ is a variable. Feizollahi, Ahmed, and Sun [13]established that one can solve (1) by solving (6).The classic AL uses the squared penalty k Ax + Bz k , which couples the blocks of x and z in anonlinear way. The subproblems of the classic AL are, therefore, more difficult to solve than thosegiven by the ℓ -penalty AL.The augmented Lagrangian method (ALM) was proposed in the late 1960s by Hestenes [15] andPowell [22], and is still a popular method for constrained programs. Convergence of ALM has beenestablished for convex programs [23,24] and smooth nonlinear programs [3].For MILP (1) and other nonsmooth nonconvex problems, ALM-based algorithms are developedbased on the theory of nonconvex AL duality [14,16,25] and aim to solve the max-min problem (5)–(6).Since d ( λ, ρ ) is concave and upper-semicontinuous in ( λ, ρ ) [25, Exercise 11.56], one can maximize itwith existing methods for nonsmooth convex optimization. In the context of a more general equality-constrained nonlinear program, works [6,7,8] introduced various modified subgradient methods, and[10] proposed an inexact bundle method for solving the dual problem (6); convergence of the dualvariable is established as long as the dual optimal set is nonempty, while convergence of the primalvariable can be ensured even when the dual optimal set is empty.When we consider MILP (1), the subproblem (5) has nonconvex mixed-integer constraint sets X or Z . To our knowledge, there is no efficient decomposition method for (5) that can guaranteeconvergence to a globally optimal solution. Methods [6,7,8,10] all need an exact or inexact oracle of d ( λ, ρ ), thus requiring a (nearly) global minimize of subproblem (5). The method in [10] alternativelyupdates two blocks of variables in (5), yet block coordinate descent (BCD) methods, in general, mayget stuck at a non-stationary point when a nonsmooth coupling function, such as k Ax + Bz k in (4),is present [34]. For the special case of (1) with B = I , g = 0, and Z being a linear subspace, recentwork [5] applied proximal ALM to a convexified formulation, which may not return a feasible solution x ∈ X .1.3 ADMM and ChallengesThe ADMM successively updates each of the primal variable blocks, rather than update them together,so it has simpler subproblems than ALM. Proving ADMM converges is more challenging.Although the majority of the ADMM literature assumes convexity, convergence results of ADMMfor nonconvex multi-block problems have appeared [32,17,21,27]. So have the ADMM-type methodsinvolving discrete variables. Authors of [36] reformulated a vehicle routing problem as a multi-blockMILP and applied ADMM where blocks are updated in the Gauss-Seidel style. They discovered thateach mixed-integer quadratic program (MIQP) subproblem can be reduced to MILP since the couplingterms in the objective are binary. Problems studied in [2,18,19,28,29,35] can be abstracted into theform min x { f ( x ) | x ∈ X ∩ Z } , where f and X are continuous and easy to optimize with but Z isdiscrete and poses challenges. Introduce variable z to obtainmin x,z { f ( x ) | x = z, x ∈ X, z ∈ Z } , (7)on which ADMM can be applied. The x -subproblem becomes a convex program (except in [19], whereit is a mixed-integer nonlinear program); the z -subproblem is a projection onto a discrete set, whichmay admit a closed-form solution and may be also parallelized. K. Sun, M. Sun, W. Yin
The aforementioned works of ADMM cannot ensure convergence to the global solution. The condi-tions in [32] for ADMM to converge to a stationary point of a nonconvex problem also fail to hold for(7). Indeed, one common set of structural conditions in [32] are 1) Im( A ) ⊆ Im( B ), and 2) Z = R m .They fail to hold for (7) when Z is not R m .The authors of [33] proposed an ADMM-based method to solve problems involving binary variables;the binary constraint x ∈ { , } n is formulated as the intersection of [0 , n and a shifted ℓ p -sphere,and the resulting ADMM is shown to converge to a stationary point of a perturbed problem. It is,however, unclear whether such a stationary point is always a (nearly) optimal binary solution of theoriginal problem.Encouraging numerical results of ADMM on discrete optimization are reported in [2,18,19,28,29,35,36], where valuable experimental observations on adaptive tuning of the penalty parameterand restart heuristics are also discussed. None of these works offers convergence guarantees to globalsolutions. Indeed, there are examples on which ADMM either diverges or converges to local or infeasiblesolutions.1.4 Our ApproachWe aim to overcome the challenges outlined in Sections 1.2 and 1.3.Our first method is based on ALM with its subproblem (5) solved by a new method we call A lternating U pdate Scheme for the S harp AL function or AUSAL. Here, “sharp” refers to the exactpenalty property in (4). AUSAL has dual-stage loops. The inner stage minimizes over x ∈ X with z fixed. The outer-stage objective is a Lipschitz function in z , which we can evaluate but do not have anexplicit form. Borrowing a classic idea in global Lipschitz minimization from [20], we approximate thisobjective using the point-wise maximum of a set of nonconvex minorants. AUSAL generates nonconvexminorants using reverse norm cuts , which are obtained from the inner-stage solutions. After manycuts generated, AUSAL returns a nearly optimal z in the outer stage and a nearly optimal x in theinner stage for (5). We can bound the error by the number of cuts and other constants. The x and z are used to update the Lagrange multipliers and penalty parameter.Our second method is based on the ADMM, where we update the Lagrange multipliers and penaltyparameter more frequently. Since a historical reverse norm cut cannot be effectively reused in a newAUSAL subroutine, we instead use a class of nonconvex minorants known as AL cuts (though we usethem in ADMM, not AL), which were proposed recently by Ahmed et al. [1] and Zhang and Sun [37]in nonconvex stochastic optimization. Each such cut can be obtained by minimizing over x for a fixedLagrange multipliers, a penalty parameter, and z .We show that the two methods are guaranteed to converge to a global solution of the original MIP(1) under proper conditions. We summarize our contributions in the next subsection.1.5 ContributionsWe propose AUSAL to solve the primal subproblem (5), and we bound its iteration complexities by O ( ρ d /ǫ d ) to find an ǫ -optimal solution, where d is the dimension of z . In order to obtain an ǫ -solutionof MILP (1) (see Definition 1), we can either apply AUSAL to the penalty formulation, i.e., λ = 0 in(5), or apply AUSAL inside an ALM. The former requires properly increasing the penalty parameterto reach exact penalty. The latter updates both Lagrange multipliers and the penalty parameter. Weprovide two variants of the latter based on two subgradient methods.We also propose an ADMM-based method that uses AL cuts. Our approach differs from thosein [1,37]. In particular, we obtain an AL cut by solving a single MIP subproblem in x with z fixedrather than getting an AL cut after solving a series of subproblems as in [1,37]. Under conditionsregarding the sequence of Lagrange multipliers and penalty parameters, the method finds an ǫ -solution ecomposition Methods for Global MILP 5 in O (1 /ǫ d ) iterations. The conditions allow flexible ways to update the Lagrange multipliers andpenalty parameter.The proposed ALM-based and ADMM-based methods maintain the separable structure of theoriginal problem, where the update of the x variable can be decomposed into lower-dimensional MILPsand solved simultaneously. They are applicable to the motivating examples discussed in Section 1.1.We conduct numerical experiments and compare our methods with existing decomposition al-gorithms on a class of randomly generated MILP instances. The proposed methods demonstrateadvantages in either solution time or quality, and often both.A disadvantage of our methods is that the z -subproblem grows in size as more cuts are added,which make the z -subproblem increasingly difficult to solve. This is a common issue associated withcut-based MILP methods. We believe, however, we can alleviate this issue by generating more efficientcuts, applying row generation, or using other means to solve the z -subproblem efficiently. We leavethem to future work.The proposed methods are not replacement of MILP solvers but means to scale the existing solversto larger MILP instances, especially for those with block separable substructures.1.6 Other Related WorksReferences [9,31] study the multi-block MILP problem, where the number of blocks is far greater thanthe dimension of coupling constraints. Paper [9] applies primal decomposition to the multi-block prob-lem (3), giving rise to a bi-level optimization framework, where each lower-level subproblem involvesonly one pair of cost vector c i and constraint set { x ∈ X i : G i x i ≤ y i } and the high-level problem co-ordinates their local resource variables y , . . . , y p . Paper [31] and its follow-up works [11,12] are basedon dual decomposition , where a restricted Lagrangian dual problem reduces to subproblems that eachinvolves only one pair of ( c i , G i ) and constraint set X i . In addition to pioneering in distributed com-putation of multi-block MILPs, both primal and dual decompositions are also theoretically supported:under mild assumptions, a feasible primal solution can be recovered with a suboptimality bound byinvoking the Shapley-Folkman Theorem [3, Section 5.6]. Paper [30] considers a generic MILP in afully decentralized setting, where constraints are distributed over a time-varying network. Duringeach communication round, nodal agents solve localized linear program (LP) subproblems, generateGomory Mixed-Integer (GMI) cuts, and exchange their cuts and local bases with neighbors; finite timeconvergence to an optimal solution is established with the lexicographical optimality of the solutionto each LP relaxation.1.7 Notation and OrganizationWe let R , Z , Q , and N denote the sets of real, integer, rational, and natural numbers. Write [ k ] = { , · · · , k } . For a vector x ∈ R n , use k x k p as the ℓ p -norm of x for 1 ≤ p ≤ ∞ . The inner product of x, y ∈ R n is denoted by h x, y i or x ⊤ y . For a matrix A ∈ R m × n , k A k p denotes its induced (operator)norm, and Im( A ) denotes its column space. We introduce B p ( x ; R ) = { y ∈ R n | k x − y k p ≤ R } and D p ( X ) = sup {k x − y k p | x, y ∈ X } . We let I X ( x ) be the 0 / ∞ -indicator function of X , which equals0 if x ∈ X and + ∞ otherwise.The rest of the paper is organized as follows. We state our assumption on problem (1) and providea more detailed review of background materials in Section 2. We present the proposed ALM frameworkin Section 3 and ADMM variant in Section 4, together with their convergence results. In Section 5, wediscuss implementation issues and present numerical experiments. Finally, we give some concludingremarks in Section 6. K. Sun, M. Sun, W. Yin
Assumption 1
Problem (1) is feasible, and constraint sets X and Z in (8) are compact and mixed-integer representable, i.e., X = { x ∈ R n + × Z n + | Ex = f } , Z = { z ∈ R d + × Z d + | Gx = h } , (8) for some rational matrices (vectors) E and G ( f and h ), where n + n = n , d + d = d . We measure the accuracy of an approximate solution as follows.
Definition 1
Let ǫ >
0. We say ( x ∗ , z ∗ ) is an ǫ -solution of the MIP (1) if x ∗ ∈ X , z ∗ ∈ Z , and c ⊤ x ∗ + g ⊤ z ∗ ≤ p ∗ + ǫ, k Ax ∗ + Bz ∗ k ≤ ǫ. (9)Note that infeasibility is measured in the ℓ -norm due to our analysis. Since ( x ∗ , z ∗ ) may be infeasible,it is possible that c ⊤ x ∗ + g ⊤ z ∗ < p ∗ .2.2 Exact PenalizationSince the AL dual problem in (6) has weak duality sup λ ∈ R m ,ρ ≥ d ( λ, ρ ) ≤ p ∗ , two important follow-upquestions are: 1) how to obtain strong duality, i.e., “=” holds; 2) how to obtain an optimal primalsolution by solving the dual problem? Paper [13] provides positive answers to both questions for MILP. Definition 2 (Exact Penalization [25, Definition 11.60])
A dual variable λ ∈ R m is said tosupport exact penalization if, for all sufficiently large ρ >
0, it holds thatArgmin x ∈ X,z ∈ Z { c ⊤ x + g ⊤ z | Ax + Bz = 0 } = Argmin x ∈ X,z ∈ Z L ( x, z, λ, ρ ) . We say a pair ( λ, ρ ) supports exact penalization if the above equation holds. We simply say ρ supportsexact penalization when (0 , ρ ) does so. Theorem 1 (Exact Penalization for MILP [13, Proposition 1 and Theorem 5])
SupposeAssumption 1 holds. Strong duality holds for (6) , i.e., sup λ ∈ R m ,ρ ≥ d ( λ, ρ ) = p ∗ . For any λ ∈ R m ,there exists a ρ > such that every ( λ, ρ ) with ρ ∈ [ ρ, + ∞ ) supports exact penalization. The exact penalization result proved in [13] applies to a broader class of penalty functions, including allnorms in the finite Euclidean space. This paper focuses on the ℓ -norm for component-wise separability.The augmented Lagrangian given by k · k does not support exact penalization; in general, no finitepenalty ρ can eliminate the duality gap [13, Proposition 7]. A complete characterization of exactpenalization is given in the following theorem. Theorem 2 (Criterion for Exact Penalization [25, Theorem 11.61])
Suppose Assumption 1holds. The following statements are equivalent:1. The pair ( λ, ρ ) supports exact penalization.2. The pair ( λ, ρ ) solves the dual problem (6) .3. There is r > such that p ( u ) ≥ p (0) + h λ, u i − ρ k u k , ∀ u ∈ B (0; r ) , for p ( u ) := min x,z { c ⊤ x + g ⊤ z | Ax + Bz + u = 0 , x ∈ X, z ∈ Z } . (10) ecomposition Methods for Global MILP 7 In this section, we introduce an ALM framework for MILP (1). In Section 3.1, we present the AUSALalgorithm for subproblem (5) with convergence guarantees. In order to find an ǫ -soluton of MILP(1), AUSAL is further applied to the penalty formulation in Section 3.2 or embedded in the ALM inSection 3.3. We present two variants of ALM based on different subgradient updates.3.1 AUSALIn this section, we propose AUSAL to solve the primal subproblem in (5): d ( λ, ρ ) = min x ∈ X,z ∈ Z c ⊤ x + g ⊤ z + h λ, Ax + Bz i + ρ k Ax + Bz k , (11)where λ ∈ R m and ρ are considered as constants. We decompose the minimization into two stages:the inner stage minimizes over x ∈ X with z fixed and the outer stage minimizes over Z , respectively, R ( z ) := min x ∈ X h c + A ⊤ λ, x i + ρ k Ax + Bz k , (12) d ( λ, ρ ) = min z ∈ Z h g + B ⊤ λ, z i + R ( z ) . (13)Note that R ( z ) is well defined over z ∈ R d since X is compact. The function R ( z ) captures thedependency of the x ’s objective value on variable z , and is closely related to the concept of valuefunction or cost-to-go function in the context of sequential decision making problems.Since the minimizer x does not have a closed form, R ( z ) lacks an explicit formula. We can onlycompute it for each z . One approach is to solve (13) with a blackbox optimization method, which wedo not study here.We present an alternative approach that replaces R ( z ) by an explicit function that is its lowerapproximation and reduce (13) to a MILP. To this end, we need the following properties of R ( z ): Lemma 1
Suppose Assumption 1 holds, and X is compact for any right-hand side vector f (possiblyempty). Then R ( z ) is piecewise-linear and K ρ -Lipschitz continuous with respect to the ℓ -norm over R d , where K ρ := ρ k B k .Proof We firstly show R ( z ) is piecewise linear by considering the standard-form MILP problem: v ( b ) := min x (cid:26) c ⊤ x (cid:12)(cid:12)(cid:12) Ax := (cid:20) A xA x (cid:21) = (cid:20) b b (cid:21) =: b, x ∈ R p + × Z q + (cid:27) . Let v ( b ) equal + ∞ if it is infeasible and −∞ if it is unbounded. Define D := { b | v ( b ) < + ∞} . Itis shown in [4] that if the MILP is described by rational data and v ( b ) > −∞ for all b ∈ D , then v ( b ) is a piecewise linear function over D . Now fix b ∈ Im( A ) and define the partial value functionwith respect to b by v ( b ) = v ([ b ⊤ , b ⊤ ] ⊤ ). Then v ( b ) is also piecewise linear over its domain D := { b | [ b ⊤ , b ⊤ ] ⊤ ∈ D } . Notice that the premise of [4] is satisfied for problem (12) by assumption,and we can equivalently write R ( z ) = min x ∈ X,u {h c + A ⊤ λ, x i + ρ k Ax + Bu k | u = z } , where z playsthe role of b . Since the rest of the problem defining R ( z ) can be cast as a standard MILP problem,we conclude that R ( z ) is a piecewise linear function in R d .To prove R ( z ) is Lipschitz, take any ˜ z, ¯ z ∈ R d , and let ˜ x and ¯ x be the optimal solution to (12)with z = ˜ z and z = ¯ z , respectively. It holds that R (˜ z ) − R (¯ z ) ≤h c + A ⊤ λ, ¯ x i + ρ k A ¯ x + B ˜ z k − h c + A ⊤ λ, ¯ x i − ρ k A ¯ x + B ¯ z k ≤ ρ k B k k ˜ z − ¯ z k , K. Sun, M. Sun, W. Yin where the first inequality is due to the optimality of ˜ x , and the second inequality is by the triangleinequality. Similarly we have R (¯ z ) − R (˜ z ) ≤ ρ k B k k ˜ z − ¯ z k , which concludes the proof with the claimedmodulus K ρ . ⊓⊔ Since R ( z ) is K ρ -Lipschitz, it has the lower approximation: r ( z ; ¯ z ) := R (¯ z ) − K ρ k z − ¯ z k , (14)where, for any given ¯ z ∈ Z , we can compute R (¯ z ). Let us call −k z − ¯ z k “reverse norm”; hence, Definition 3
We call inequality R ( z ) ≥ r ( z ; ¯ z ) a reverse norm cut of R at ¯ z .From all ¯ z ∈ ¯ Z ⊆ Z , we collect a set of lower approximations and take their point-wise supremum: R ( z ; ¯ Z ) := sup { r ( z ; ¯ z ) | ¯ z ∈ ¯ Z } , (15)which becomes a tighter lower approximation of R ( z ) as ¯ Z gets larger. Eventually, if ¯ Z = Z , we have R ( z ; ¯ Z ) = R ( z ) over all z ∈ Z .Function R ( z ; ¯ Z ) is piece-wise linear and K ρ -Lipschitz under the ℓ norm. So, we solvemin z ∈ Z h g + B ⊤ λ, z i + R ( z ; ¯ Z ) (16)either with a method for piecewise linear objectives or with a MILP solver (after expanding R ( z ; ¯ Z )into a set of linear constraints).Suppose (16) returns a solution z ′ . Then, h g + B ⊤ λ, z ′ i + R ( z ′ ) is an upper bound of (13). When ¯ Z changes, we keep track of the best (i.e., lowest) upper bound. On the other side, h g + B ⊤ λ, z ′ i + R ( z ′ ; ¯ Z )is a lower bound of (13). Adding z ′ to ¯ Z can reduce the gap between the best upper bound and thelower bound. Indeed, even if z ′ is a minimizer of (13), the upper bound is tight but the lower boundmay not be tight since R ( z ′ ; ¯ Z ) < R ( z ′ ) is possible. If z ′ is not a minimizer of (13), the lower boundmust be non-tight. Adding z ′ to ¯ Z leads to R ( z ′ ; ¯ Z ) = R ( z ′ ) and tightens the lower bound. Thisadding step requires computing R ( z ′ ) according to (12).The approach described in the last paragraph is known to the community of global Lipschitzminimization [20] and is recently used in stochastic optimization [1]. Applying this idea to the ALframework has advantages that are not noted before: the explicit Lipschitz constant K ρ = ρ k B k of R ( z ) and that problem (12) is always feasible without assuming the complete continuous recourse(CCR) condition [38].We present the proposed AUSAL method in Algorithm 1, which returns an ǫ -optimal to the primalsubproblem (5). Algorithm 1 : AUSAL for subproblem (11) Input λ ∈ R m , ρ > ǫ ≥ Initialize z ∈ Z , R ( z ), Z ← { z } , UB ← + ∞ ;3: for k = 1 , , · · · do
4: compute z k ∈ Argmin z ∈ Z h g + B ⊤ λ, z i + R ( z ; Z k − );5: compute x k ∈ Argmin x ∈ X h c + A ⊤ λ, x i + ρ k Ax + Bz k k ; ⊲ parallelizable if A is block-diagonal6: UB ← min { UB , h g + B ⊤ λ, z k i + R ( z k ) } ;7: if UB − (cid:0) h g + B ⊤ λ, z k i + R ( z k ; Z k − ) (cid:1) ≤ ǫ then return ( x k , z k );9: end if Z k ← Z k − ∪ { z k } ;11: end for ecomposition Methods for Global MILP 9 We present the asymptotic convergence and iteration complexity of AUSAL.
Lemma 2
Let z k be an iterate in AUSAL (Algorithm 1). Then R ( z ) − R ( z ; Z k ) ≤ ǫ for all z ∈ Z such that k z − z k k ≤ ǫ/ (2 K ρ ) .Proof Notice for all z ∈ Z and k z − z k k ≤ ǫ/ (2 K ρ ), we have R ( z ) − R ( z ; Z k ) ≤ R ( z ) − R ( z k ) + K ρ k z − z k k ≤ R ( z k ) + K ρ k z − z k k − R ( z k ) + K ρ k z − z k k =2 K ρ k z − z k k ≤ ǫ, where the first inequality is due to R ( z ; Z k ) ≥ R ( z k ) − K ρ k z − z k k , and the second inequality is dueto R being K ρ -Lipschitz. ⊓⊔ In words, Lemma 2 means that, as Z k includes z k , the function R ( z ; Z k ) (and all future R ( z ; Z k +1 ) , . . . )well approximates R near z k . Consider two consecutive iterates z k and z k +1 . If k z k +1 − z k k ≤ ǫ/ (2 K ρ ),then the stopping criteria in line 7 of AUSAL is satisfied at iteration k + 1; otherwise, z k +1 must bebounded away from z k by some positive distance. Theorem 3
Suppose Assumption 1 holds.1. (Convergence) Let { ( x k , z k ) } k ∈ N be the sequence generated by AUSAL, and define t k = R ( z k ; Z k − ) .The sequence {h g + B ⊤ λ, z k i + t k } k ∈ N converges to d ( λ, ρ ) monotonically from below. Any limitpoint ( x ∗ , z ∗ ) of { ( x k , z k ) } k ∈ N is a global solution to the primal subproblem (5) .2. (Complexity) Let ǫ > , and suppose Z ⊆ B (¯ z ; R ) for some ¯ z ∈ R d and radius R > . ThenAUSAL terminates in no more than (cid:18) ρ k B k Rǫ (cid:19) d = O (cid:18) ρ d ǫ d (cid:19) (17) iterations.Proof
1. Notice R ( z ; Z k − ) ≤ R ( z ; Z k ) for all k ∈ N and z ∈ Z , so the sequence {h g + B ⊤ λ, z k i + t k } k ∈ N is nondecreasing and bounded from above by d ( λ, ρ ) and thus converges to some ¯ d ≤ d ( λ, ρ ). Let( x ∗ , z ∗ ) be a limit point and { ( x k j , z k j ) } j ∈ N be the subsequence convergent to ( x ∗ , z ∗ ). Then wehave h g + B ⊤ λ, z k j i + t k j ≥ h g + B ⊤ λ, z k j i + R ( z k j − ) − K ρ k z k j − z k j − k . (18)Since k z k j − z k j − k →
0, letting j → ∞ on (18) gives ¯ d ≥ h g + B ⊤ λ, z ∗ i + R ( z ∗ ) ≥ d ( λ, ρ ), wherethe first inequality follows from the continuity of R , and the second inequality is due to z ∗ ∈ Z .As a result, we have ¯ d = d ( λ, ρ ) = h g + B ⊤ λ, z ∗ i + R ( z ∗ ). Finally, taking the limit on both sidesof R ( z k j ) = h c + A ⊤ λ, x k j i + ρ k Ax k j + Bz k j k over j implies that ( x ∗ , z ∗ ) is optimal to (5).2. Since lim j →∞ z k j = z ∗ , we have k z k j +1 − z k j k ≤ ǫ/ (2 K ρ ) for all large enough j ∈ N . Notice k j +1 ≥ k j + 1; by Lemma 2, it follows R ( z k j +1 ) − t k j +1 ≤ R ( z k j +1 ) − R ( z k j +1 ; Z k j ) ≤ ǫ . Now let K be the first index such that R ( z K ) − t K ≤ ǫ , which is sufficient to ensure the termination ofAUSAL (line 7). For all 0 ≤ i < j ≤ K −
1, we claim that k z i − z j k > ǫ/ (2 K ρ ): suppose not, thenLemma 2 suggests that R ( z j ) − t j ≤ R ( z j ) − R ( z j ; Z i ) ≤ ǫ , contradicting to the choice of K . Let r = ǫ/ (4 K ρ ). Since z i ∈ Z for all 0 ≤ i ≤ K − Z ⊆ B (¯ z ; R ), K − [ i =0 B ( z i ; r ) ⊆ Z + B (0; r ) ⊆ B (¯ z ; R + r ); since the balls { B ( z i ; r ) } ≤ i ≤ K − are disjoint, it followsVol K − [ i =0 B ( z i ; r ) ! = K Vol (cid:0) ( B (0; r ) (cid:1) ≤ Vol (cid:0) B (¯ z ; R + r ) (cid:1) , where Vol( · ) returns the volume of the argument, and thus K ≤ Vol( B (¯ z ; R + r ))Vol( B (0; r )) = (cid:18) R + rr (cid:19) d = (cid:18) K ρ Rǫ (cid:19) d = (cid:18) ρ k B k Rǫ (cid:19) d . This completes the proof. ⊓⊔ λ, ρ ) that supports exact penalization; then we only need to call AUSALto solve the subproblem (5) once. Our first result suggests that, with such a pair ( λ, ρ ), we can expectto obtain an ǫ -solution of (1) upon the termination of AUSAL. Theorem 4
Suppose ( λ, ρ ) ∈ R m +1 satisfies (1) ρ ≥ k λ k ∞ and (2) ( λ, ρ − supports exact penal-ization. Then AUSAL applied to the primal subproblem (5) returns an ǫ -solution of MILP (1) .Proof Let ( x ∗ , z ∗ ) denote the solution returned by Algorithm 1. Clearly, x ∗ ∈ X and z ∗ ∈ Z . We firstshow c ⊤ x ∗ + g ⊤ z ∗ ≤ p + ǫ : c ⊤ x ∗ + g ⊤ z ∗ ≤ c ⊤ x ∗ + g ⊤ z ∗ + ( ρ − k λ k ∞ ) k Ax ∗ + Bz ∗ k ≤ c ⊤ x ∗ + g ⊤ z ∗ + h λ, Ax ∗ + Bz ∗ i + ρ k Ax ∗ + Bz ∗ k ≤ d ( λ, ρ ) + ǫ ≤ p ∗ + ǫ. (19)Furthermore, (19) also implies ρ k Ax ∗ + Bz ∗ k ≤ p ∗ + ǫ − c ⊤ x ∗ − g ⊤ z ∗ − h λ, Ax ∗ + Bz ∗ i ;since ( λ, ρ −
1) also supports exact penalization, p ∗ ≤ c ⊤ x ∗ + g ⊤ z ∗ + h λ, Ax ∗ + Bz ∗ i + ( ρ − k Ax ∗ + Bz ∗ k . The above two inequalities together imply k Ax ∗ + Bz ∗ k ≤ ǫ . ⊓⊔ Since the current ( λ, ρ ) may not support exact penalization, we must address the question: given any λ ∈ R m , what is the sufficiently large ρ for AUSAL to deliver an ǫ -solution of MILP (1). Note that λ = 0 is possible, and if this is the case, the primal subproblem (5) becomes the penalty formulationof (1). Theorem 5
Let λ ∈ R m and ǫ > . Then AUSAL applied to the primal subproblem (5) with ρ = k c k ∞ D ( X ) + k g k ∞ D ( Z ) ǫ + k λ k ∞ + 1 returns an ǫ -solution ( x ∗ , z ∗ ) of the MILP (1) in at most (cid:20) k B k Rǫ (cid:18) k c k ∞ D ( X ) + k g k ∞ D ( Z ) ǫ + k λ k ∞ + 1 (cid:19)(cid:21) d (20) iterations, where D ( · ) returns the diameter of the argument, and R is the radius of Z . ecomposition Methods for Global MILP 11 Proof
We have c ⊤ x ∗ + g ⊤ z ∗ ≤ p ∗ + ǫ from (19). It remains to show k Ax ∗ + Bz ∗ k ≤ ǫ . Let (¯ x, ¯ z ) bean optimal solution of MILP (1). Again from (19), we have k Ax ∗ + Bz ∗ k ≤ c ⊤ (¯ x − x ∗ ) + g ⊤ (¯ z − z ∗ ) + ǫρ − k λ k ∞ ≤ k c k ∞ D ( X ) + k g k ∞ D ( Z ) + ǫρ − k λ k ∞ . It is straightforward to verify the choice of ρ ensures k Ax ∗ + Bz ∗ k ≤ ǫ , and finally the complexityresult is a direct consequence of Theorem 3. ⊓⊔ The choice of ρ in Theorem 5 serves as a sufficient condition for AUSAL to deliver an ǫ -solution of(1) given any λ ∈ R m ; however, the O (1 /ǫ d ) iteration complexity is undesirable. In practice, insteadof calling AUSAL just once with some fixed ( λ, ρ ), we should update them in an iterative scheme,which we discuss in the next section.3.3 ALM and Dual UpdatesWe present an ALM that uses AUSAL for its subproblem and two different subgradient methods forthe update of ( λ, ρ ). This method is appropriate when a pair ( λ, ρ ) that supports exact penalizationis not known initially.The dual function d ( λ, ρ ) is a concave and upper-semicontinuous function in ( λ, ρ ). Let (¯ x, ¯ z ) be apair of solution returned by AUSAL with input (¯ λ, ¯ ρ, ǫ ). Then for all ( λ, ρ ) ∈ R m × R + , it holds d ( λ, ρ ) ≤ c ⊤ ¯ x + g ⊤ ¯ z + h λ, A ¯ x + B ¯ z i + ρ k A ¯ x + B ¯ z k = c ⊤ ¯ x + g ⊤ ¯ z + h ¯ λ, A ¯ x + B ¯ z i + ¯ ρ k A ¯ x + B ¯ z k + h λ − ¯ λ, A ¯ x + B ¯ z i + ( ρ − ¯ ρ ) k A ¯ x + B ¯ z k ≤ d (¯ λ, ¯ ρ ) + ǫ + h λ − ¯ λ, A ¯ x + B ¯ z i + ( ρ − ¯ ρ ) k A ¯ x + B ¯ z k , where the last inequality is due to the ǫ -optimality of (¯ x, ¯ z ). Therefore, we have( − d )( λ, ρ ) ≥ ( − d )(¯ λ, ¯ ρ ) − (cid:20) A ¯ x + B ¯ z k A ¯ x + B ¯ z k (cid:21) ⊤ (cid:20) λ − ¯ λρ − ¯ ρ (cid:21) − ǫ, or equivalently, − (cid:20) Ax + Bz k Ax + Bz k (cid:21) ∈ ∂ ǫ ( − d )( λ, ρ ) . (21)When ǫ = 0, the above inclusion yields a convex subgradient. Therefore, we can use the primalinformation (¯ x, ¯ z ) to construct an ǫ -subgradient of ( − d ) at (¯ λ, ¯ ρ ) and apply an inexact subgradientmethod to solve the dual problem (6). Note that the compactness on X and Z ensures that d ( λ, ρ ) iswell defined for all λ ∈ R m and ρ ≥
0; the exact penalization of MILP from Theorem 1 guaranteesthat Argmax d ( λ, ρ ) is nonempty.There are various subgradient methods under different assumptions and variations, such as theexactness of subgradient, choices of step size, and stopping criteria. Below we provide two specificimplementations. Since the analysis is standard, we place the proof in Appendix A only for complete-ness. Beyond the two implementations below, we can apply methods [6,7,8,10] with AUSAL as asubroutine to solve the dual problem (6). Algorithm 2 : A Subgradient Variant with Iteration Complexity on Objective Gap Input ǫ p ≥ Initialize ( λ , ρ ), and a sequence { τ k } k ∈ N such that τ k > τ k →
0, and P k ∈ N τ k = + ∞ ;3: for k = 1 , , · · · do
4: ( x k , z k ) ← AUSAL( λ k , ρ k , ǫ p );5: set α k = τ k / ( √ k Ax k + Bz k k );6: λ k +1 ← λ k + α k ( Ax k + Bz k ), ρ k +1 ← ρ k + α k k Ax k + Bz k k ;7: end for The first subgradient algorithm is presented as Algorithm 2. The constant ǫ p ≥ x k , z k ) with Ax k + Bz k = 0, ( x k , z k ) is an ǫ p -solution to MILP (1). To study convergence and complexity, we assume k Ax k + Bz k k > k ∈ N , so that Algorithm 2 will produce an infinite sequence { ( λ k , ρ k ) } k ∈ N . We introduce the followingconstants for our analysis: D ∗ := Argmax λ,ρ ≥ d ( λ, ρ ) ,d := min ( λ,ρ ) ∈ D ∗ k λ − λ k + | ρ − ρ | , and M := max x ∈ X,z ∈ Z k Ax + Bz k . Consider the step size α k chosen as in line 5 of Algorithm 2. Other standard choices include α k = ǫ d / (2 M ) or α k = ǫ d / (2 k Ax k + Bz k k ); their convergence results are similar and thus omitted. Theorem 6
The following statements hold.1. Let ǫ d > . Suppose Algorithm 2 performs K = ⌈ M d /ǫ d ⌉ iterations with τ i = d / √ K . Then, min k ∈ [ K ] p ∗ − d ( λ k , ρ k ) ≤ ǫ p + ǫ d .
2. Suppose ǫ p = 0 , and the sequence { τ k } k ∈ N also satisfies < τ k ≤ τ for all k ∈ N and P k ∈ N τ k < + ∞ . Then ( λ k , ρ k ) converges to some ( λ ∗ , ρ ∗ ) ∈ D ∗ .Proof See Appendix A.1. ⊓⊔ In addition, we can impose a compact feasible region for ( λ, ρ ) and apply the projected subgradientmethod. The results in Theorem 6 still apply with essentially the same proof. The rationale is that,if Algorithm 2 does not find the optimal dual variable after K iterations, then by Theorem 6, we canestimate the optimality in objective, and expect the best-so-far iterate ( λ k , ρ k ) to be close to someoptimal dual solution ( λ ∗ , ρ ∗ ). Consequently, we can post-process to recover an optimal dual solution. Corollary 1
Suppose Algorithm 2 generates a pair ( λ k , ρ k ) such that k ( λ k , ρ k ) − ( λ ∗ , ρ ∗ ) k ∞ ≤ l forsome ( λ ∗ , ρ ∗ ) ∈ D ∗ . Then ( λ k , ρ k + 2 l ) supports exact penalization. Applying AUSAL with λ = λ k , ρ = max {k λ k k ∞ , ρ k + 2 l + 1 } , and ǫ > will return an ǫ -solution of the MILP (1) .Proof By Theorem 2, there exists r > u ∈ B (0; r ), it holds p ( u ) ≥ p (0) + h λ ∗ , u i − ρ ∗ k u k ≥ p (0) + h λ k , u i − ρ k k u k − k λ k − λ ∗ k ∞ k u k − | ρ k − ρ ∗ |k u k ≥ p (0) + h λ k , u i − ( ρ k + 2 l ) k u k , so that ( λ k , ρ k + 2 l ) supports exact penalization. The second claim follows from Theorem 4. ⊓⊔ ecomposition Methods for Global MILP 13 We present the second variant in Algorithm 3. Now we assume ǫ p is strictly positive, and we areinterested in finding an ǫ p -solution of MILP (1). Algorithm 3 : A Subgradient Variant with Finite Convergence to Approximate Solution Input ǫ p > Initialize ( λ , ρ ) with ρ ≥ k λ k ∞ , and some τ > for k = 1 , , · · · do
4: ( x k , z k ) ← AUSAL( λ k , ρ k , ǫ p );5: if k Ax k + Bz k k ≤ ǫ p then return ( x k , z k ).7: end if
8: set α k = τ/ k Ax k + Bz k k ;9: λ k +1 ← λ k + α k ( Ax k + Bz k ), ρ k +1 ← max {k λ k +1 k ∞ , ρ k + α k k Ax k + Bz k k } ;10: end for If Algorithm 3 terminates with ( x k , z k ), then x k ∈ X , z k ∈ Z , and k Ax k + Bz k k ≤ ǫ p . Since ρ k ≥ k λ k k ∞ , the same derivation in (19) gives c ⊤ x k + g ⊤ z k ≤ p ∗ + ǫ p . Therefore, ( x k , z k ) is indeedan ǫ p -solution of the MILP (1). Theorem 7
Algorithm 3 returns an ǫ p -solution of MILP (1) in a finite number of iterations.Proof See Appendix A.2. ⊓⊔ In the section, we present an ADMM variant and state its convergence result. We start by lookinginto the penalty formulation of MILP (1)min x ∈ X,z ∈ Z c ⊤ x + g ⊤ z + ρ k Ax + Bz k for some ρ >
0. We further adopt the new notation R ρ ( z ) := min x ∈ X c ⊤ x + ρ k Ax + Bz k , (22)in replace of R ( z ) in (12), where the dependency on λ vanishes and R ρ is only parameterized by thepenalty ρ . As a special case of (12), R ρ is also Lipschitz continuous with modulus ρ k B k .We firstly introduce a new point of view of generating valid lower approximation for R ρ in Section4.1, and then present our proposed ADMM variant in Section 4.2. Convergence and complexity resultsare given in Section 4.3.4.1 Generating Cuts from the Dual PerspectiveTo begin with, for a given z ∈ Z and ρ >
0, the evaluation R ρ ( z ) requires solving the optimizationproblem (22), which can be regarded as a primal problem in x . We consider a relaxation of this primalproblem as follows: given µ ∈ R m and β ≥
0, let P ( z, µ, β ) := min x c ⊤ x + h µ, Ax + Bz i + β k Ax + Bz k . (23)Notice that P ( z, µ, β ) ≤ min x c ⊤ x + ( β + k µ k ∞ ) k Ax + Bz k ≤ R ρ ( z ) (24) for all ( µ, β ) ∈ Λ ( ρ ), where Λ ( ρ ) := { ( µ, β ) ∈ R m +1 | β ≥ , β + k µ k ∞ ≤ ρ } . (25)Inequality (24) is a weak duality result in the sense that P ( z, µ, β ) provides a lower bound for R ρ ( z )when the pair ( µ, β ) is constrained in Λ ( ρ ). It turns out that strong duality also holds: R ρ ( z ) = max ( µ,β ) ∈ Λ ( ρ ) P ( z, µ, β )= max ( µ,β ) ∈ Λ ( ρ ) min x ∈ X c ⊤ x + h µ, Ax + Bz i + β k Ax + Bz k , (26)simply due to the fact that (0 , ρ ) ∈ Λ ( ρ ) and P ( z, , ρ ) = R ρ ( z ) by definition. Given some ¯ z and(¯ µ, ¯ β ) ∈ Λ ( ρ ), suppose we solve the problem (23), then it holds that for any z ∈ ZR ρ ( z ) = max ( µ,β ) ∈ Λ ( ρ ) P ( z, µ, β ) ≥ P ( z, ¯ µ, ¯ β ) = min x ∈ X f ( x ) + h ¯ µ, Ax + Bz i + ¯ β k Ax + Bz k ≥ min x ∈ X f ( x ) + h ¯ µ, Ax + B ¯ z i + ¯ β k Ax + B ¯ z k + h ¯ µ, Bz − B ¯ z i− ¯ β k Bz − B ¯ z k = P (¯ z, ¯ µ, ¯ β ) + h ¯ µ, Bz − B ¯ z i − ¯ β k Bz − B ¯ z k . Therefore, the function defined by˜ r ( z ; ¯ z, ¯ µ, ¯ β ) := P (¯ z, ¯ µ, ¯ β ) + h ¯ µ, Bz − B ¯ z i − ¯ β k Bz − B ¯ z k (27)is a lower approximation of R ρ ( z ). Definition 4
We call the inequality R ρ ( z ) ≥ ˜ r ( z ; ¯ z, ¯ µ, ¯ β ) an augmented Lagrangian cut (AL cut) at¯ z parameterized by (¯ µ, ¯ β ). We say the cut is tight at z if R ρ ( z ) = ˜ r ( z ; ¯ z, ¯ µ, ¯ β ).We note that an AL cut is not necessarily tight since P (¯ z, ¯ µ, ¯ β ) < R ρ (¯ z ) when (¯ µ, ¯ β ) ∈ Λ ( ρ ) isnot optimal for (26). The additional linear term h ¯ µ, Bz − B ¯ z i corresponds to a rotation around thepivot (¯ z, P (¯ z, ¯ µ, ¯ β )) ∈ R d +1 . So unlike the reverse norm cut, an AL cut is not symmetrically pointingdownwards. In addition, since k ¯ µ k ∞ + ¯ β ≤ ρ , the AL cut may have a smaller Lipschitz constant than R ρ . Geometrically, the rotation effect and smaller Lipschitz constant allow an AL cut to be “fatter”than R ρ and thus covers a wider range in Z than a reverse norm cut. Moreover, a smaller value of¯ β + k ¯ µ k ∞ can be preferable for optimization solvers.AL cuts appear recently in [1,37] to approximate nonconvex value functions in multistage stochasticoptimization. Ahmed et al. [1] observed that AL cuts can be numerically more efficient than reversenorm cuts, but they didn’t provide a systematic analysis for the convergence of their algorithm usingAL cuts. In order to get a tight cut, Zhang and Sun [37] assumed an oracle for (a similar version of)the max-min problem (26) is available; in practice, this requires solving a series of MILP problemsof the form (23), for example, in a double-looped ALM. As we will show in the next subsection, theproposed ADMM variant avoids the max-min problem (26), and only solves a single MILP of the form(23) in each iteration.4.2 ADMM with AL cutsWe can approximate R ρ ( z ) using AL cuts (27). The x -subproblem has the following form: at iteration k , given z k − and some dual information ( µ k , β k ), we obtain x k as x k ∈ Argmin x ∈ X c ⊤ x + h µ k , Ax + Bz k − i + β k k Ax + Bz k − k . (28) ecomposition Methods for Global MILP 15 Then an AL cut (27) is generated and appended to the z -subproblem to obtain z k :min z ∈ Z,t g ⊤ z + t (29a)s . t . t ≥ ˜ r ( z ; z j − , µ j , β j ) ∀ j ∈ [ k ] . (29b) Algorithm 4 : An ADMM Framework with AL Cuts Initialize z ∈ Z , ( µ , β ) ∈ R m × R + ;2: for k = 1 , , · · · do
3: obtain x k as a minimizer of (28);4: obtain ( z k , t k ) as a minimizer of (29);5: update ( µ k +1 , β k +1 ) ∈ R m × R + ;6: end for A conceptual ADMM is described in Algorithm 4. One major difference between our proposedADMM variant and the classic ADMM lies in the z -subproblem. In a traditional ADMM framework,the z -subproblem has a similar structure as the x -subproblem, i.e.,min z ∈ Z g ⊤ z + (cid:0) c ⊤ x k + h µ k , Ax k + Bz i + β k σ ( Ax k + Bz ) (cid:1) , (30)where σ ( · ) = k·k (proximal Lagrangian) or σ ( · ) = k·k (sharp Lagrangian) is usually used. Problem(29) consists of valid lower approximation for R ρ ( z ), which will be refined over iterations. In contrast,(30) approximates this dependency locally by only one piece (inside the parenthesis), which is notnecessarily a lower approximation. This might shed some light on why the classic ADMM may not beable to converge to global optimal solutions.We do not specify how ( µ k +1 , β k +1 ) is updated in Algorithm 4. Instead, we present a set ofassumptions on the selection of ( µ k +1 , β k +1 ) to establish convergence results. We provide a geometricintuition in the next section. Any updates the meet the assumptions will work.4.3 Convergence and ComplexityIn this subsection, we study the convergence of the ADMM framework. In particular, we establishasymptotic convergence in Theorem 8 under Assumption 2 and an iteration complexity result inTheorem 9 under (a slightly stronger) Assumption 3. Assumption 2
Let ρ > be the minimum penalty that supports exact penalization for MILP (1) .Suppose ( µ k , β k ) are chosen such that1. β k − k µ k k ∞ ≥ ρ for all sufficiently large k ∈ N ;2. β k + k µ k k ∞ ≤ ρ for all k ∈ N for some ρ > .Remark 1
1. Part 1 of Assumption 2 avoids too many loose cuts. It guarantees that for sufficiently large k , thepeak of the AL cut reaches at least R ρ . Otherwise the objective of the z -subproblem is always aloose lower bound of p ∗ .2. Part 1 can be satisfied if β k is bounded away from k µ k k ∞ by some constant. However, we donot want β k to go to infinity, as the resulting generalized cut is very “slim”. This is ensured bypart 2. “Slim” cuts are not desirable since we will need a lot more such cuts to construct a goodapproximation. The constant ρ also appears in the complexity result in Theorem 9.
3. The exact values of ρ and ρ are not required during numerical implementation. For example, onecan update the dual variables like in the classic ADMM: µ k +1 = µ k + β k ( Ax k + Bz k ), and thenincrease β k +1 accordingly. The bound ρ is required for analysis only; treat it as a constant largeenough such that projecting ( µ, β ) onto some bounded set is not necessary.Recall p ∗ is the optimal value of the MILP (1) and that t k = max j ∈ [ k ] { ˜ r ( z k ; z j − , µ j , β j ) } . Theorem 8
Suppose Assumption 2 holds. Let { ( z k , t k , x k +1 ) } k ∈ N be the sequence generated by Algo-rithm 4, and ( x ∗ , z ∗ ) be a limit point of { ( z k , x k +1 ) } k ∈ N . The following claims holds.1. { g ⊤ z k + t k } k ∈ N converges to p ∗ monotonically from below.2. p ∗ = g ⊤ z ∗ + R ρ ( z ∗ ) = g ⊤ z ∗ + R ρ ( z ∗ ) .3. ( x ∗ , z ∗ ) is an optimal solution to MILP (1) .Proof
1. We firstly prove the sequence { g ⊤ z k + t k } k ∈ N converges to p ∗ monotonically from below. Since g ⊤ z k + t k is the optimal value of problem (29), whose feasible region is shrinking over k ∈ N ,we know { g ⊤ z k + t k } k ∈ N is monotone non-decreasing. Since k µ k k ∞ + β k ≤ ρ , it follows that( µ k , β k ) ∈ Λ ( ρ ) for all for all k ∈ N , i.e., all cuts of the form (29b) are valid lower approximationsof R ρ ( z ). As a result, we have g ⊤ z k + t k ≤ min z ∈ Z g ⊤ z + R ρ ( z ) = p ∗ , where the equality is due to ρ supports exact penalization. The sequence { g ⊤ z k + t k } k ∈ N is non-decreasing and bounded from above by p ∗ , so it converges to some ¯ p ≤ p ∗ . Next let z ∗ be a limitpoint, and { z k j } j ∈ N be the subsequence convergent to z ∗ . Notice that by the H¨older’s inequalityand Assumption 2, we have for large enough j , P ( z k j − , µ k j − +1 , β k j − +1 )= min x ∈ X c ⊤ x + h µ k j − +1 , Ax + Bz k j − i + β k j − +1 k Ax + Bz k j − k ≥ min x ∈ X c ⊤ x + ( β k j − +1 − k µ k j − +1 k ∞ ) k Ax + Bz k j − k ≥ min x ∈ X c ⊤ x + ρ k Ax + Bz k j − k = R ρ ( z k j − ) . (31)In addition, g ⊤ z k j + t k j ≥ g ⊤ z k j + P ( z k j − , µ k j − +1 , β k j − +1 ) + h µ k j − +1 , Bz k j − Bz k j − i− β k j − +1 k Bz k j − Bz k j − k ≥ g ⊤ z k j + R ρ ( z k j − ) + h µ k j − +1 , Bz k j − Bz k j − i − β k j − +1 k Bz k j − Bz k j − k . where the first inequality is due to (29b), and the second inequality is due to (31). Taking limiton both sides of the above inequality gives p ∗ ≥ ¯ p = lim j →∞ g ⊤ z k j + t k j ≥ g ⊤ z ∗ + R ρ ( z ∗ ) ≥ p ∗ , where the second inequality is due to the continuity of R ρ , the fact that Bz k j − Bz k j − vanishes, andthe boundedness of { ( µ k , β k ) } k ∈ N by Assumption 2. So we conclude that g ⊤ z ∗ + R ρ ( z ∗ ) = p ∗ = ¯ p . ecomposition Methods for Global MILP 17
2. We have already shown the first equality. Let x ( ρ, z ∗ ) ∈ Argmin x ∈ X c ⊤ x + ρ k Ax + Bz ∗ k . Then it holds that ( x ( ρ, z ∗ ) , z ∗ ) ∈ Argmin x ∈ X,z ∈ Z c ⊤ x + g ⊤ z + ρ k Ax + Bz k = Argmin x ∈ X,z ∈ Z { c ⊤ x + g ⊤ z | Ax + Bz = 0 } . So we know Ax ( ρ, z ∗ ) + Bz ∗ = 0 and c ⊤ x ( ρ, z ∗ ) + g ⊤ z ∗ = p ∗ . Consequently, p ∗ = g ⊤ z ∗ + R ρ ( z ∗ ) ≤ g ⊤ z ∗ + R ρ ( z ∗ )= g ⊤ z ∗ + min x ∈ X c ⊤ x + ρ k Ax + Bz ∗ k ≤ g ⊤ z ∗ + c ⊤ x ( ρ, z ∗ ) = p ∗ . This proves the second equality.3. It remains to show ( x ∗ , z ∗ ) is an optimal solution to MILP (1). Again, let { ( z k j , x k j +1 ) } j ∈ N be thesubsequence convergent to ( z ∗ , x ∗ ). Notice that g ⊤ z k j + R ρ ( z k j ) ≤ g ⊤ z k j + c ⊤ x k j +1 + h µ k j +1 , Ax k j +1 + Bz k j i + β k j +1 k Ax k j +1 + Bz k j k ≤ g ⊤ z k j + R ρ ( z k j );assuming without loss generality that lim j →∞ ( µ k j +1 , β k j +1 ) = ( µ ∗ , β ∗ ) and taking limit on bothsides of the above two inequalities, we have c ⊤ x ∗ + g ⊤ z ∗ + h µ ∗ , Ax ∗ + Bz ∗ i + β ∗ k Ax ∗ + Bz ∗ k = p ∗ , where the equality holds due to the second claim. It suffices to show Ax ∗ + Bz ∗ = 0. Suppose not,then by Theorem 1, ( x ∗ , z ∗ ) / ∈ Argmin x ∈ X,z ∈ Z c ⊤ x + g ⊤ z + ρ k Ax + Bz k ; therefore, p ∗ = c ⊤ x ∗ + g ⊤ z ∗ + h µ ∗ , Ax ∗ + Bz ∗ i + β ∗ k Ax ∗ + Bz ∗ k ≥ c ⊤ x ∗ + g ⊤ z ∗ + ρ k Ax ∗ + Bz ∗ k > min x ∈ X,z ∈ Z c ⊤ x + g ⊤ z + ρ k Ax + Bz k = p ∗ , where is a desired contradiction. This completes the proof. ⊓⊔ In order to establish iteration complexity of Algorithm 4, we need a slightly stronger version ofAssumption 2.
Assumption 3
Let ρ − > be the minimum penalty that supports exact penalization for MILP (1) .Suppose ( µ k , β k ) are chosen such that1. β k − k µ k k ∞ ≥ ρ for all k ∈ N ;2. β k + k µ k k ∞ ≤ ρ for all k ∈ N . Theorem 9
Suppose Assumption 3 holds, and Z ⊆ B (¯ z ; R ) for some ¯ z ∈ R d and radius R > . Let ǫ > , and { ( z k , t k ) } k ∈ N be the sequence generated by line 4 of Algorithm 4. Then Algorithm 4 findsa solution ( z K , t K ) satisfying1. (optimality gap) p ∗ − ( g ⊤ z K + t K ) ≤ ǫ ,
2. (approximate solution) (˜ x K , z K ) is an ǫ -solution to MILP (1) , where ˜ x K ∈ Argmin x ∈ X c ⊤ x + ρ k Ax + Bz K k , in no more than K ≤ (cid:18) ρ + ρ ) k B k Rǫ (cid:19) d (32) iterations.Proof For all nonnegative integers i < j , we havemax { p ∗ − ( g ⊤ z j + t j ) , g ⊤ z j + R ρ ( z j ) − p ∗ }≤ R ρ ( z j ) − t j ≤ R ρ ( z j ) − P ( z i , µ i +1 , β i +1 ) − h µ i +1 , Bz j − Bz i i + β i +1 k Bz j − Bz i k ≤ R ρ ( z j ) − R ρ ( z i ) + ( k µ i +1 k ∞ + β i +1 ) k Bz j − Bz i k ≤ ( ρ + ρ ) k B k k z j − z i k , (33)where the first inequality is due to g ⊤ z j + t j ≤ p ∗ ≤ g ⊤ z j + R ρ ( z j ), the second inequality is due to(29b), the third inequality is due to (31), and the last inequality is due to Assumption 3 and R ρ being ρ k B k -Lipschitz. Let K be the smallest index such thatmax { p ∗ − ( g ⊤ z K + t K ) , g ⊤ z K + R ρ ( z K ) − p ∗ } ≤ ǫ. Then we must have k z i − z j k > ǫ/ [( ρ + ρ ) k B k ] for all 0 ≤ i < j ≤ K −
1, since otherwise (33) impliesmax { p ∗ − ( g ⊤ z j + t j ) , g ⊤ z j + R ρ ( z j ) − p ∗ } ≤ ( ρ + ρ ) k B k k z j − z i k ≤ ǫ, contradicting to the choice of K . Using the same argument as the proof of Theorem 3, Part2, we canbound K by (32).It remains to show the claimed (˜ x K , z K ) is an ǫ -solution of the MILP (1). Since g ⊤ z K + R ρ ( z K ) − p ∗ ≤ ǫ and c ⊤ ˜ x K + ρ k A ˜ x K + Bz K k = R ρ ( z K ), we know (˜ x K , z K ) is an ǫ -optimal solution to theproblem min x ∈ X,z ∈ Z c ⊤ x + g ⊤ z + ρ k Ax + Bz k . Now by Assumption 3 and the proof of Theorem 4,we conclude that (˜ x K , z K ) is an ǫ -solution of MILP (1). ⊓⊔ λ, ¯ ρ ), we use AUSAL (Algorithm 1) tosolve the primal subproblem (5): in each iteration, given some ¯ z ∈ Z , we approximate the function R ( z ; ¯ λ, ¯ ρ ) by adding a reverse norm cut of the form R ( z ; ¯ λ, ¯ ρ ) ≥ R (¯ z ; ¯ λ, ¯ ρ ) − K ¯ ρ k z − ¯ z k (34)to the z -subproblem, where R ( z ; ¯ λ, ¯ ρ ) is defined in (12) and explicitly parameterized by (¯ λ, ¯ ρ ). Whenthe current AUSAL terminates, we update a new pair of dual variables ( λ, ρ ), and start the nextAUSAL. ecomposition Methods for Global MILP 19 Notice that cuts (34) generated with (¯ λ, ¯ ρ ) are not in general valid anymore for R ( z ; λ, ρ ). Naively,we can remove all previous cuts when starting a new AUSAL, but this will cause a loss of histori-cal information. Instead, we can modify old cuts so that they always stay valid for the latest dualinformation ( λ, ρ ). Assuming the penalty is nondecreasing, i.e., ρ ≥ ¯ ρ , we have R (¯ z ; λ, ρ ) = min x ∈ X c ⊤ x + h λ, Ax i + ρ k Ax + B ¯ z k = min x ∈ X c ⊤ x + h ¯ λ, Ax i + ¯ ρ k Ax + B ¯ z k + h λ − ¯ λ, Ax i + ( ρ − ¯ ρ ) k Ax + B ¯ z k ≥ R (¯ z ; ¯ λ, ¯ ρ ) − k λ − ¯ λ k ∞ (cid:18) max x ∈ X k Ax k (cid:19) , which follows R ( z ; λ, ρ ) ≥ R (¯ z ; ¯ λ, ¯ ρ ) − k λ − ¯ λ k ∞ (cid:18) max x ∈ X k Ax k (cid:19) − K ρ k z − ¯ z k (35)is a valid inequality for all R ( z ; λ, ρ ) over z ∈ Z . The new cuts (35) can be obtained by modifying thecoefficient K ¯ ρ and constant R (¯ z ; ¯ λ, ¯ ρ ) in old cuts (34). Though these modified cuts may not be tight,empirically we do observe they help ALM to maintain a more stable lower bound.5.2 Comparison with Primal and Dual DecompositionsIn this section, we present numerical experiments on the generic multi-block MILP problem for thepurpose of proof of concept:min x ∈ X , ··· ,x p ∈ X p c ⊤ x + · · · + c ⊤ p x p subject to A x + · · · + A p x p ≤ b. (36)In view of the discussion on (3), our proposed algorithms are applicable to the two-block reformulationof the above problem. Authors of [9,31] apply primal and dual decomposition algorithms, respectively,to solve a restricted problem, where the right-hand side vector b is replaced by b − σ . The introductionof the vector σ ∈ R m + makes it possible to recover a feasible solution of (36) from the Lagrangian dualframework.We compare the performance of the proposed algorithms with the primal and dual decompositionalgorithms. We set p = 100 and m = 10. For each block i ∈ [ p ], we let X i = { x ∈ { , , · · · , } × [ − , | E i x i ≤ f i } ; we generate ( E i , f i , c i ) according to [9, Section 5] and A i with standard Gaussianentries. The restriction vector σ is calculated according to [9, Section 4.1] for both primal and dualdecomposition algorithms, as the one used in [31] is typically larger in magnitude and results ininfeasibility for tested instances. For different right-hand side vector b ∈ { , , , , } × e , where e is the vector of all ones, we generate five instances. Notice that as the resource vector b decreases, we expect the problem to become more challenging.We initialize the ALM dual variable and penalty as ( λ , ρ ) = (0 , . x k , z k ), we update the dual variable λ k +1 as in line 6 of Algorithm 2with τ k = 0 . /k , and set ρ k +1 = 1 . ρ k . For ADMM, we initialize the dual variable and penalty as( µ , β ) = (0 , . µ k +1 = µ k + β k ( Ax k + Bz k − ) , and β k +1 = ( . β k if 5 divides kβ k otherwise. (37)All MILP subproblems are solved by Gurobi 8.1.0. Since the comparison involves both single-looped(ADMM and dual decomposition) and double-looped (ALM and primal decomposition) algorithms, we set a time limit of 20 minutes (MAX) for each generated instance, and report the best dualitygap achieved, i.e., (upper bound − lower bound) / | upper bound | , as well as the corresponding time inTable 1. ALM ADMM Primal [9] Dual [31] b Gap(%) Time(s) Gap(%) Time(s) Gap(%) Time(s) Gap(%) Time(s)1500 e e ∗ e ∗ e ∗ ∗ e ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ Table 1
Comparison with primal and dual decomposition on multi-block MILP
The proposed ALM and ADMM are able to find global optimal solutions in a few iterations forrelatively large values of b . However, as the magnitude of b decreases, ALM and ADMM have difficultieslocating feasible solutions for some cases, where we use the augmented Lagrangian function value as anupper bound estimate to calculate the gap (marked by “ ∗ ”), and report “MAX” in the time column.For such cases, ALM and ADMM cannot terminate successfully within the time limit, and we observethe z -subproblem suffers from increasingly long computation time as the two algorithms proceed. Thisis because the number of binary variables and constraints in the z -subproblem will grow linearly withrespect to the number of cuts, which makes the subproblem more difficult to solve over iterations.In contrast, the primal and dual decomposition algorithms are able to obtain feasible solutions in alltested instances. Nevertheless, since they are essentially solving a restricted problem, convergence tosolutions with large duality gaps are observed.In terms of computation time, the dual decomposition algorithm is consistently fast since it issingle-looped and maintains simple subproblems. The primal decomposition algorithm is double-looped, and each local subproblem is solved by a cutting-plane subroutine, which can affect the overallcomputation time from two perspectives: on one hand, the cutting-plane subroutine itself can be slowwhen applied to nonsmooth problems; on the other hand, the quality of solutions returned by thecutting-plane subroutine can further influence the convergence of the outer-level subgradient method.We indeed observe that sometimes the cutting-plane subroutine is not able to terminate within 200iterations, and consequently only an inexact subgradient is available for the outer level update. Thismight also count for the large duality gaps in Table 1. Finally, the proposed ALM and ADMM areusually fast in the early stage: when they converge successfully for instances in Table 1, the solutiontime is less than or comparable to that of dual decomposition. However, as we have commented earlier,they can be slow in the long run, especially after over a few hundreds of cuts are appended to the z -subproblem. ecomposition Methods for Global MILP 21 In this paper, we study generic MILP problems with two blocks of variables x and z . We propose analgorithm named AUSAL that alternatively updates x and z in the augmented Lagrangian function,and provide its convergence and complexity results. AUSAL can be directly applied to the penaltyformulation, or embedded inside ALM to solve the augmented Lagrangian dual problem. We alsopropose a single-looped ADMM variant, which is built upon the AL cut introduced in [1,37]. Differentfrom the procedure used in the previous two references, we obtain an AL cut by solving a singleaugmented Lagrangian relaxation in variable x ; compared to existing ADMM works, our ADMMvariant allows a more flexible update scheme for the dual variable and penalty, and is guaranteedto converge to a global optimal solution with iteration complexity estimate. When certain block-angular structure is present, the update of x can be further decomposed and solved in parallel in bothalgorithm.We test the proposed ALM and ADMM variant on multi-block MILP problems, and comparewith existing decompositions algorithms. Admittedly, the update of z variable in both algorithmsrequires solving a MILP problem with an increasing size of variables and constraints, which canbe the computational bottleneck for large and dense problems. We are interested in investigatingmore practical subproblem oracles, i.e., managing a controllable size of cuts, or new methodologies toapproximate the dependency between x and z . We leave these in the future work. Acknowledgements
We would like to thank the authors of [9] for making their primal-decomposition code availableand the authors of [37] for the discussion on AL cuts’ numerical performance. A part of this work was done during withthe internship of Alibaba (US) Innovation Research.
References
1. Ahmed, S., Cabral, F.G., da Costa, B.F.P.: Stochastic lipschitz dynamic programming. Mathematical Programmingpp. 1–39 (2020)2. Alavian, A., Rotkowitz, M.C.: Improving ADMM-based optimization of mixed integer objectives. In: 2017 51stAnnual Conference on Information Sciences and Systems (CISS). IEEE (2017)3. Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic press (2014)4. Blair, C.E., Jeroslow, R.G.: The value function of a mixed integer program: I. Discrete Mathematics (2), 121–138(1977)5. Boland, N., Christiansen, J., Dandurand, B., Eberhard, A., Oliveira, F.: A parallelizable augmented Lagrangianmethod applied to large-scale non-convex-constrained optimization problems. Mathematical Programming (1-2), 503–536 (2019)6. Burachik, R.S., Gasimov, R.N., Ismayilova, N.A., Kaya, C.Y.: On a modified subgradient algorithm for dual prob-lems via sharp augmented Lagrangian. Journal of Global Optimization (1), 55–78 (2006)7. Burachik, R.S., Iusem, A.N., Melo, J.G.: A primal dual modified subgradient algorithm with sharp Lagrangian.Journal of Global Optimization (3), 347–361 (2010)8. Burachik, R.S., Iusem, A.N., Melo, J.G.: An inexact modified subgradient algorithm for primal-dual problems viaaugmented Lagrangians. Journal of Optimization Theory and Applications (4), 563–568 (2018)12. Falsone, A., Margellos, K., Prandini, M.: A decentralized approach to multi-agent MILPs: finite-time feasibility andperformance guarantees. Automatica , 141–150 (2019)13. Feizollahi, M.J., Ahmed, S., Sun, A.: Exact augmented Lagrangian duality for mixed integer linear programming.Mathematical Programming (1-2), 365–387 (2017)14. Gasimov, R.N.: Augmented Lagrangian duality and nondifferentiable optimization methods in nonconvex program-ming. Journal of Global Optimization (2), 187–203 (2002)15. Hestenes, M.R.: Multiplier and gradient methods. Journal of optimization theory and applications (5), 303–320(1969)2 K. Sun, M. Sun, W. Yin16. Huang, X., Yang, X.: A unified augmented Lagrangian approach to duality and exact penalization. Mathematicsof Operations Research (3), 533–552 (2003)17. Jiang, B., Lin, T., Ma, S., Zhang, S.: Structured nonconvex and nonsmooth optimization: algorithms and iterationcomplexity analysis. Computational Optimization and Applications (1), 115–157 (2019)18. Kanno, Y., Fujita, S.: Alternating direction method of multipliers for truss topology optimization with limited num-ber of nodes: A cardinality-constrained second-order cone programming approach. Optimization and Engineering (2), 327–358 (2018)19. Kanno, Y., Kitayama, S.: Alternating direction method of multipliers as a simple effective heuristic for mixed-integernonlinear optimization. Structural and Multidisciplinary Optimization (3), 1291–1295 (2018)20. Mayne, D.Q., Polak, E.: Outer approximation algorithm for nondifferentiable optimization problems. Journal ofOptimization Theory and Applications (6), 555–562 (1973)24. Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming.Mathematics of operations research (2), 97–116 (1976)25. Rockafellar, R.T., Wets, R.J.B.: Variational Analysis, vol. 317. Springer Science & Business Media (2009)26. Ruszczynski, A.: Nonlinear Optimization. Princeton university press (2011)27. Sun, K., Sun, X.A.: A two-level distributed algorithm for nonconvex constrained optimization. arXiv preprintarXiv:1902.07654 (2019)28. Takapoui, R.: The alternating direction method of multipliers for mixed-integer optimization applications. Ph.D.thesis, Stanford University (2017)29. Takapoui, R., Moehle, N., Boyd, S., Bemporad, A.: A simple effective heuristic for embedded mixed-integer quadraticprogramming. International Journal of Control (1), 2–12 (2020)30. Testa, A., Rucco, A., Notarstefano, G.: A finite-time cutting plane algorithm for distributed mixed integer linearprogramming. In: 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 3847–3852. IEEE (2017)31. Vujanic, R., Esfahani, P.M., Goulart, P.J., Mari´ethoz, S., Morari, M.: A decomposition method for large scaleMILPs, with performance guarantees and a power system application. Automatica , 144–156 (2016)32. Wang, Y., Yin, W., Zeng, J.: Global convergence of ADMM in nonconvex nonsmooth optimization. Journal ofScientific Computing (1), 29–63 (2019)33. Wu, B., Ghanem, B.: ℓ p -box ADMM: A versatile framework for integer programming. IEEE transactions on patternanalysis and machine intelligence (7), 1695–1708 (2018)34. Xu, Y., Yin, W.: A globally convergent algorithm for nonconvex optimization based on block coordinate update.Journal of Scientific Computing (2), 700–734 (2017)35. Yadav, A.K., Ranjan, R., Mahbub, U., Rotkowitz, M.C.: New methods for handling binary constraints. In: 201654th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1074–1080. IEEE(2016)36. Yao, Y., Zhu, X., Dong, H., Wu, S., Wu, H., Tong, L.C., Zhou, X.: ADMM-based problem decomposition schemefor vehicle routing problem with time windows. Transportation Research Part B: Methodological , 156–174(2019)37. Zhang, S., Sun, X.A.: Stochastic dual dynamic programming for multistage stochastic mixed-integer nonlinearoptimization. arXiv preprint arXiv:1912.13278 (2019)38. Zou, J., Ahmed, S., Sun, X.A.: Stochastic dual dynamic integer programming. Mathematical Programming (1-2), 461–502 (2019) A Proof
A.1 Proof of Theorem 6
We first state a standard lemma regarding the progress in objective value.
Lemma 3
Let { ( λ k , ρ k ) } k ∈ N be the sequence of iterates generated by Algorithm 2. Then for all K ≥ , it holds that min k ∈ [ K ] p ∗ − d ( λ k , ρ k ) ≤ M √ d + P Kk =1 τ k P Kk =1 τ k + ǫ p . Proof
To simplify notation, we denote w k = ( λ k , ρ k ), and denote w ∗ = ( λ ∗ , ρ ∗ ) to be the maximizer in the definitionof d . We also denote the ǫ -subgradient of ( − d )( w k ) by g k , and it holds k g k k = k Ax k + Bz k k + k Ax k + Bz k k ≤ k Ax k + Bz k k ≤ M . (38)ecomposition Methods for Global MILP 23Also notice k w k +1 − w ∗ k = k w k − α k g k − w ∗ k = k w k − w ∗ k + α k k g k k − α k h g k , w k − w ∗ i≤k w k − w ∗ k + α k k g k k − α k ( p ∗ − d ( w k ) − ǫ p ) , where we use the fact that g k ∈ ∂ ǫ ( − d )( w k ) in the last inequality. Re-arranging terms and summing over k = 1 , · · · , K ,we have 2 (cid:18) min k ∈ [ K ] p ∗ − d ( λ k , ρ k ) − ǫ p (cid:19) K X k =1 α k ≤ K X k =1 α k ( p ∗ − d ( λ k , ρ k ) − ǫ d ) ≤ k w − w ∗ k + K X k =1 α k k g k k ≤k w − w ∗ k + K X k =1 α k k Ax k + Bz k k = d + K X k =1 τ k , which further implies min k ∈ [ K ] p ∗ − d ( λ k , ρ k ) ≤ d + P Kk =1 τ k P Kk =1 α k + ǫ p ≤ M √ d + P Kk =1 τ k P Kk =1 τ k + ǫ p . ⊓⊔ Proof (Proof of Theorem 6)
The first claim follows from Lemma 3 and the choices of τ i and K so that: M √ d + P Kk =1 τ k P Kk =1 τ k = M √ d √ Kd = √ Md √ K ≤ ǫ d . Let γ k = α k /τ k . Recall the bound in (38), and we have γ k k g k k ≤
1; the second claim is then proved in [26, Theorem7.4]. ⊓⊔ A.2 Proof of Theorem 7
Proof (Proof of Theorem 7)
Firstly notice that according to the penalty update, we have ρ k = ρ + k X j =2 ρ j − ρ j − = ρ + k X j =2 max {k λ j k ∞ , α j k Ax j + Bz j k }≥ ρ + k X j =2 α j k Ax j + Bz j k = ρ + ( k − τ. For the purpose of contradiction, suppose k Ax k + Bz k k > ǫ p for all k ∈ N , and thus Algorithm 3 will generate anunbounded sequence { ρ k } k ∈ N . Let ( λ ∗ , ρ ∗ ) be an optimal solution to the dual problem (6). Then we have k λ k +1 − λ ∗ k = k λ k + α k ( Ax k + Bz k ) − λ ∗ k = k λ k − λ ∗ k + α k k Ax k + Bz k k + 2 α k h Ax k + Bz k , λ k − λ ∗ i≤k λ k − λ ∗ k + α k k Ax k + Bz k k +2 α k (cid:16) d ( λ k , ρ k ) − p ∗ + ǫ p + k Ax k + Bz k k ( ρ ∗ − ρ k ) (cid:17) ≤k λ k − λ ∗ k + α k k Ax k + Bz k k + 2 α k (cid:16) ǫ p + k Ax k + Bz k k ( ρ ∗ − ρ k ) (cid:17) , (39)where the first inequality is due to (21), and the second inequality is due to d ( λ k , ρ k ) ≤ p ∗ . In view of the definition of α k and the fact that k Ax k + Bz k k > ǫ p , (39) further implies that k λ k +1 − λ ∗ k ≤ k λ k − λ ∗ k + τ + 2 τ + 2 τρ ∗ − τρ k . (40)Notice that when ρ k ≥ ρ ∗ + τ/ k λ k +1 − λ ∗ k ≤ k λ k − λ ∗ k , and thus the dual sequence λ k stays bounded;now letting k → ∞ on (40), the left-hand side is nonnegative while the right-hand side goes to −∞ , which is a desiredcontradiction., which is a desiredcontradiction.