Inexact Stochastic Mirror Descent for two-stage nonlinear stochastic programs
IINEXACT STOCHASTIC MIRROR DESCENT FOR TWO-STAGE NONLINEARSTOCHASTIC PROGRAMS
Vincent GuiguesSchool of Applied Mathematics, FGVPraia de Botafogo, Rio de Janeiro, Brazil [email protected]
Abstract.
We introduce an inexact variant of Stochastic Mirror Descent (SMD), called Inexact StochasticMirror Descent (ISMD), to solve nonlinear two-stage stochastic programs where the second stage problemhas linear and nonlinear coupling constraints and a nonlinear objective function which depends on bothfirst and second stage decisions. Given a candidate first stage solution and a realization of the second stagerandom vector, each iteration of ISMD combines a stochastic subgradient descent using a prox-mappingwith the computation of approximate (instead of exact for SMD) primal and dual second stage solutions.We provide two convergence analysis of ISMD, under two sets of assumptions. We show in particular thatunder some assumptions, ISMD has the same convergence rate as SMD. The first convergence analysis isbased on the formulas for inexact cuts of value functions of convex optimization problems shown recentlyin [6]. The second convergence analysis relies on new formulas that we derive for inexact cuts of valuefunctions of convex optimization problems assuming that the dual function of the second stage problem forall fixed first stage solution and realization of the second stage random vector, is strongly concave. We showthat this assumption of strong concavity is satisfied for some classes of problems and present the results ofnumerical experiments on two simple two-stage problems which show that solving approximately the secondstage problem for the first iterations of ISMD can help us obtain a good approximate first stage solutionquicker than with SMD.
Keywords:
Inexact cuts for value functions and Inexact Stochastic Mirror Descent and Strong Concavityof the dual function and Stochastic Programming.AMS subject classifications: 90C15, 90C90.1.
Introduction
We are interested in inexact solution methods for two-stage nonlinear stochastic programs of form(1.1) (cid:26) min f ( x ) := f ( x ) + Q ( x ) x ∈ X with X ⊂ R n a convex, nonempty, and compact set, and Q ( x ) = E ξ [ Q ( x , ξ )] where E is the expectationoperator, ξ is a random vector with probability distribution P on Ξ ⊂ R k , and(1.2) Q ( x , ξ ) = (cid:26) min x f ( x , x , ξ ) x ∈ X ( x , ξ ) := { x ∈ X : Ax + Bx = b, g ( x , x , ξ ) ≤ } . In the problem above vector ξ contains in particular the random elements in matrices A, B , and vector b .Problem (1.1) is the first stage problem while problem (1.2) is the second stage problem which has abstractconstraints ( x ∈ X ), and linear ( Ax + Bx = b ) and nonlinear ( g ( x , x , ξ ) ≤
0) constraints both ofwhich couple first stage decision x and second stage decision x . Our solution methods are suited for thefollowing framework:a) first stage problem (1.1) is convex, i.e., X , X , and f are convex and for every ξ functions f ( · , · , ξ )and g ( · , · , ξ ) are convex;b) second stage problem (1.2) is convex, i.e., X is convex and for every ξ functions f ( · , · , ξ ) and g ( · , · , ξ ) are convex; a r X i v : . [ m a t h . O C ] J u l ) for every realization ˜ ξ of ξ , the primal second stage problem obtained replacing ξ by ˜ ξ in (1.2)with optimal value Q ( x , ˜ ξ ) and its dual (obtained dualizing coupling constraints) are solved ap-proximately.There is a large literature on solution methods for two-stage risk-neutral stochastic programs. Essentially,these methods can be cast in two categories: (A) decomposition methods based on sampling and cuttingplane approximations of Q (which date back to [3],[8]) and their variants with regularization such as [17]and (B) Robust Stochastic Approximation [15] and its variants such as stochastic Primal-Dual subgradientmethods [9], Stochastic Mirror Descent (SMD) [13], [10], or Multistep Stochastic Mirror Descent (MSMD)[5]. These methods have been extended to solve multistage problems, for instance Stochastic Dual DynamicProgramming [14], belonging to class (A), and recently Dynamic Stochastic Approximation [11], belongingto class (B).However, for all these methods, it is assumed that second stage problems are solved exactly. This latterassumption is not satisfied when the second stage problem is nonlinear since in this setting only approximatesolutions are available. On top of that, for the first iterations, we still have crude approximations of thefirst stage solution and it may be useful to solve inexactly, with less accuracy, the second stage problem forthese iterations and to increase the accuracy of the second stage solutions computed when the algorithmprogresses in order to decrease the overall computational bulk.Therefore the objective of this paper is to fill a gap considering the situation when second stage problemsare nonlinear and solved approximately (both primal and dual, see Assumption c) above). More precisely,to account for Assumption (c), as an extension of the methods from class (B) we derive an Inexact Sto-chastic Mirror Descent (ISMD) algorithm, designed to solve problems of form (1.1). This inexact solutionmethod is based on an inexact black box for the objective in (1.1). To this end, we compute inexact cuts(affine lower bounding functions) for the value function of a convex optimization problem with the argumentof the value function appearing in the objective and the linear and nonlinear constraints, on the basis ofapproximate primal and dual solutions. For this analysis, we first need formulas for exact cuts (cuts basedon exact primal and dual solutions). We had shown such formulas in [4, Lemma 2.1] using convex analysistools, in particular standard calculus on normal and tangeant cones. We derive in Proposition 3.2 a prooffor these formulas based purely on duality. This is an adaptation of the proof of the formulas we gave in [6,Proposition 2.7] for inexact cuts, considering exact solutions instead of inexact solutions. To our knowledge,the computation of inexact cuts for value functions has only been discussed in [6] so far (see Proposition3.7). We propose in Section 3 new formulas for computing inexact cuts based in particular on the strongconcavity of the dual function. In Section 2, we provide, for several classes of problems, conditions ensuringthat the dual function of an optimization problem is strongly concave and give formulas for computing thecorresponding constant of strong concavity when possible. It turns out that our results improve Theorem10 in [19] (the only reference we are aware of on the strong concavity of the dual function) which provesthe strong concavity of the dual function under stronger assumptions. The tools developped in Sections 2and 3 allow us to build the inexact black boxes necessary for the Inexact Stochastic Mirror Descent (ISMD)algorithm and its convergence analysis presented in Section 4. Finally, in Section 5 we report the results ofnumerical tests comparing the performance of SMD and ISMD on two simple two-stage nonlinear stochasticprograms.Throughout the paper, we use the following notation: • The domain dom( f ) of a function f : X → ¯ R is the set of points in X such that f is finite:dom( f ) = { x ∈ X : −∞ < f ( x ) < + ∞} . • The largest (resp. smallest) eigenvalue of a matrix Q having real-valued eigenvalues is denoted by λ max ( Q ) (resp. λ min ( Q )). • The (cid:107) · (cid:107) of a matrix A is given by (cid:107) A (cid:107) = max x (cid:54) =0 (cid:107) Ax (cid:107) (cid:107) x (cid:107) . • Diag( x , x , . . . , x n ) is the n × n diagonal matrix whose entry ( i, i ) is x i . • For a linear application A , Ker( A ) is its kernel and Im( A ) its image. • (cid:104)· , · , (cid:105) is the usual scalar product in R n : (cid:104) x, y (cid:105) = (cid:80) ni =1 x i y i which induces the norm (cid:107) x (cid:107) = (cid:112)(cid:80) ni =1 x i . • Let f : R n → ¯ R be an extended real-valued function. The Fenchel conjugate f ∗ of f is the functiongiven by f ∗ ( x ∗ ) = sup x ∈ R n (cid:104) x ∗ , x (cid:105) − f ( x ). For functions f : X → Y and g : Y → Z , the function g ◦ f : X → Z is the composition of functions g and f given by ( g ◦ f )( x ) = g ( f ( x )) for every x ∈ X .2. On the strong concavity of the dual function of an optimization problem
The study of the strong concavity of the dual function of an optimization problem on some set hasapplications in numerical optimization. For instance, the strong concavity of the dual function and theknowledge of the associated constant of strong concavity are used by the Drift-Plus-Penalty algorithm in[19] and by the (convergence proof of) Inexact SMD algorithm presented in Section 4 when inexact cuts arecomputed using Proposition 3.8.The only paper we are aware of providing conditions ensuring this strong concavity property is [19]. Inthis section, we prove similar results under weaker assumptions and study an additional class of problems(quadratic with a quadratic constraint, see Proposition 2.11).2.1.
Preliminaries.
In what follows, X ⊂ R n is a nonempty convex set. Definition 2.1 (Strongly convex functions) . Function f : X → R ∪ { + ∞} is strongly convex with constantof strong convexity α > with respect to norm (cid:107) · (cid:107) if for every x, y ∈ dom(f ) we have f ( tx + (1 − t ) y ) ≤ tf ( x ) + (1 − t ) f ( y ) − αt (1 − t )2 (cid:107) y − x (cid:107) , for all ≤ t ≤ . We can deduce the following well known characterizations of strongly convex functions f : R n → R ∪{ + ∞} (see for instance [7]): Proposition 2.2. (i) Function f : X → R ∪ { + ∞} is strongly convex with constant of strong convexity α > with respect to norm (cid:107) · (cid:107) if and only if for every x, y ∈ dom(f ) we have f ( y ) ≥ f ( x ) + s T ( y − x ) + α (cid:107) y − x (cid:107) , ∀ s ∈ ∂f ( x ) . (ii) Function f : X → R ∪ { + ∞} is strongly convex with constant of strong convexity α > with respectto norm (cid:107) · (cid:107) if and only if for every x, y ∈ dom(f ) we have f ( y ) ≥ f ( x ) + f (cid:48) ( x ; y − x ) + α (cid:107) y − x (cid:107) , where f (cid:48) ( x ; y − x ) denotes the derivative of f at x in the direction y − x .(iii) Let f : X → R ∪ { + ∞} be differentiable. Then f is strongly convex with constant of strong convexity α > with respect to norm (cid:107) · (cid:107) if and only if for every x, y ∈ dom(f ) we have ( ∇ f ( y ) − ∇ f ( x )) T ( y − x ) ≥ α (cid:107) y − x (cid:107) . (iv) Let f : X → R ∪ { + ∞} be twice differentiable. Then f is strongly convex on X ⊂ R n with constantof strong convexity α > with respect to norm (cid:107) · (cid:107) if and only if for every x ∈ dom(f ) we have h T ∇ f ( x ) h ≥ α (cid:107) h (cid:107) , ∀ h ∈ R n . Definition 2.3 (Strongly concave functions) . f : X → R ∪ {−∞} is strongly concave with constant ofstrong concavity α > with respect to norm (cid:107) · (cid:107) if and only if − f is strongly convex with constant of strongconvexity α > with respect to norm (cid:107) · (cid:107) . The following propositions are immediate and will be used in the sequel:
Proposition 2.4. If f : X → R ∪ { + ∞} is strongly convex with constant of strong convexity α > withrespect to norm (cid:107) · (cid:107) and (cid:96) : R n → R is linear then f + (cid:96) is strongly convex on X with constant of strongconvexity α > with respect to norm (cid:107) · (cid:107) . Proposition 2.5.
Let X ⊂ R m , Y ⊂ R n , be two nonempty convex sets. Let A : X → Y be a linear operatorand let f : Y → R ∪ { + ∞} be a strongly convex function with constant of strong convexity α > with respectto a norm (cid:107) · (cid:107) n on R n induced by scalar product (cid:104)· , ·(cid:105) n on R n . Assume that Ker ( A ∗ ◦ A ) = { } . Then g = f ◦ A is strongly convex on X with constant of strong convexity αλ min ( A ∗ ◦ A ) with respect to norm (cid:107) · (cid:107) m . roof. For every x, y ∈ X , using Proposition 2.2-(ii) we have f ( A ( y )) ≥ f ( A ( x )) + f (cid:48) ( A ( x ); A ( y − x )) + α (cid:107)A ( y − x ) (cid:107) n and since g (cid:48) ( x ; y − x ) = f (cid:48) ( A ( x ); A ( y − x )), we get g ( y ) ≥ g ( x ) + g (cid:48) ( x ; y − x ) + 12 αλ min ( A ∗ ◦ A ) (cid:107) y − x (cid:107) m with αλ min ( A ∗ ◦ A ) > λ min ( A ∗ ◦ A ) is nonnegative because A ∗ ◦ A is self-adjoint and it cannot be zerobecause A ∗ ◦ A is nondegenerate). (cid:3) In the rest of this section, we fix (cid:107) · (cid:107) = (cid:107) · (cid:107) and provide, under some assumptions, the constant of strongconcavity of the dual function of an optimization problem for this norm. Problems with linear constraints.
Consider the optimization problem(2.3) (cid:26) inf f ( x ) Ax ≤ b where f : R n → R ∪ { + ∞} , b ∈ R q , and A is a q × n real matrix.We will use the following known fact, see for instance [16]: Proposition 2.6.
Let f : R n → R ∪ { + ∞} be a proper convex lower semicontinuous function. Then f ∗ isstrongly convex with constant of strong convexity α > for norm (cid:107) · (cid:107) if and only if f is differentiable and ∇ f is Lipschitz continuous with constant /α for norm (cid:107) · (cid:107) . Proposition 2.7.
Let θ be the dual function of (2.3) given by (2.4) θ ( λ ) = inf x ∈ R n { f ( x ) + λ T ( Ax − b ) } , for λ ∈ R q . Assume that the rows of matrix A are independent, that f is convex, differentiable, and ∇ f is Lipschitz continuous with constant L ≥ with respect to norm (cid:107) · (cid:107) . Then dual function θ is stronglyconcave on R q with constant of strong concavity λ min ( AA T ) L with respect to norm (cid:107) · (cid:107) on R q .Proof. The dual function of (2.3) can be written(2.5) θ ( λ ) = inf x ∈ R n { f ( x ) + λ T ( Ax − b ) } = − λ T b − sup x ∈ R n {− x T A T λ − f ( x ) } = − λ T b − f ∗ ( − A T λ ) by definition of f ∗ . Since the rows of A are independent, matrix AA T is invertible and Ker( AA T ) = { } . The result followsfrom the above representation of θ and Propositions 2.4, 2.5, and 2.6. (cid:3) The strong concavity of the dual function of (2.3) was shown in Corollary 5 in [19] assuming that f issecond-order continuously differentiable and strongly convex. Therefore Proposition 2.7 (whose proof is veryshort), which only assumes that f is convex, differentiable, and has Lipschitz continuous gradient, improvesexisting results (neither second-order differentiability nor strong convexity is required). Below, we discussseveral examples. In particular, in Examples 2.9 and 2.10, f may not be strongly convex. Example 2.8 (Linear programs) . Let f : R n → R given by (2.6) f ( x ) = c T x + c where c ∈ R n , c ∈ R . Clearly f is convex differentiable with Lipschitz continuous gradients; any L ≥ being a valid Lipschitz constant. Proposition (2.7) tells us that if the rows of A are independent then dualfunction θ of (2.3) given by (2.4) is strongly concave on R q . In this case, the strong concavity can be checkeddirectly computing θ . Indeed, we have f ∗ ( x ) = (cid:26) − c if x = c, + ∞ if x (cid:54) = c, Using the equivalence between norms in R n , we can derive a valid constant of strong concavity for other norms, for instance (cid:107) · (cid:107) ∞ and (cid:107) · (cid:107) . nd plugging this expression of f ∗ into (2.5) , we get θ ( λ ) = (cid:26) − λ T b + c if A T λ = − c, −∞ if A T λ (cid:54) = − c. Therefore if c ∈ Im ( A T ) then there is λ ∈ R q such that (2.7) A T λ = − c, and if the rows of A are independent then there is only one λ , let us call it λ , satisfying (2.7) . In thissituation, the domain of θ is a singleton: dom ( θ ) = { λ } , and θ indeed is strongly convex (see Definition2.1). If c / ∈ Im ( A T ) then dom ( θ ) = ∅ and θ is again strongly convex. The example which follows gives a class of problems where the dual function is strongly concave on R q : Example 2.9 (Quadratic convex programs) . Consider a problem of form (2.3) where f ( x ) = x T Q x + a T x + b , Q is an n × n nonnull positive semidefinite matrix, A is a q × n real matrix, a ∈ Im ( Q ) , and b ∈ R . Clearly, f is convex, differentiable, and ∇ f is Lipschitz continuous with Lipschitz constant L = (cid:107) Q (cid:107) = λ max ( Q ) > with respect to (cid:107) · (cid:107) on R n . If the rows of A are independent, using Proposition 2.7we obtain that the dual function of (2.3) is strongly concave with constant of strong concavity λ min ( AA T ) λ max ( Q ) > with respect to norm (cid:107) · (cid:107) on R q . Observe that strong concavity holds in particular if Q is not positivedefinite, in which case f is not strongly convex.Since f is convex, differentiable, its gradient being Lipschitz continuous with Lipschitz constant λ max ( Q ) ,from Proposition 2.6, we know that f ∗ is strongly convex with constant of strong convexity /λ max ( Q ) .This can be checked by direct computation. Indeed, let λ max ( Q ) = λ ( Q ) ≥ λ ( Q ) ≥ . . . ≥ λ r ( Q ) >λ r +1 ( Q ) = λ r +2 ( Q ) = . . . = λ n ( Q ) = 0 be the ordered eigenvalues of Q where r is the rank of Q . Let P be a corresponding orthogonal matrix of eigenvectors for Q , i.e., Diag( λ ( Q ) , . . . , λ n ( Q )) = P T Q P with P P T = P T P = I n . Defining Q +0 = P Diag (cid:16) λ ( Q ) , . . . , λ r ( Q ) , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) n-r times (cid:17) P T , it is straightforward to check that (2.8) f ∗ ( x ) = (cid:26) − b + ( x − a ) T Q +0 ( x − a ) if x ∈ Im ( Q ) , + ∞ otherwise , and plugging expression (2.8) of f ∗ into (2.5) , we get θ ( λ ) = (cid:26) b − λ T b − ( a + A T λ ) T Q +0 ( a + A T λ ) if A T λ ∈ Im ( Q ) , −∞ otherwise.If x (cid:48) = ( x (cid:48) , . . . , x (cid:48) n ) is the vector of the coordinates of x in the basis ( v , v , . . . , v n ) where v i is i th columnof P = [ v , v , . . . , v n ] (i.e., ( v , . . . , v r ) is a basis of Im( Q ) and ( v r +1 , . . . , v n ) is a basis of Ker( Q )) andwriting a = (cid:80) ri =1 a (cid:48) i v i , we obtain f ∗ ( x ) = (cid:40) g ( P T x ) where g : R n → R is given by g ( x (cid:48) ) = − b + (cid:80) ri =1 ( x (cid:48) i − a (cid:48) i ) λ i ( Q ) if x ∈ Im ( Q ) , + ∞ otherwise.Observe that for x (cid:48) , y (cid:48) ∈ R r ×{ (0 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) n-r times } we have g ( y (cid:48) ) ≥ g ( x (cid:48) ) + ∇ g ( x (cid:48) ) T ( y (cid:48) − x (cid:48) ) + 12 λ ( Q ) (cid:107) y (cid:48) − x (cid:48) (cid:107) and g is strongly convex with constant of strong convexity λ ( Q ) with respect to norm (cid:107)·(cid:107) on R r ×{ (0 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) n-r times } .Recalling that f ∗ ( x ) = g ( P T x ) for x ∈ dom ( f ∗ ) = Im ( Q ) , that P T x ∈ R r ×{ (0 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) n-r times } for x ∈ Im ( Q ) , and In this simple case, the dual function is well known and can also be obtained without using the conjugate of f sing Proposition 2.5, we get that f ∗ is strongly convex with constant of strong convexity λ min ( P P T ) λ ( Q ) = λ min ( I n ) λ max ( Q ) = 1 λ max ( Q ) with respect to norm (cid:107) · (cid:107) . Example 2.10.
Let f ( x ) = (cid:80) Mk =1 α k f k ( x ) for α k ∈ R and f k : R n → R convex differentiable with Lipschitzconstant L k ≥ with respect to norm (cid:107) · (cid:107) on R n for k = 1 , . . . , M . Let A be a q × n matrix withindependent rows. Then dual function (2.4) of (2.3) is strongly concave on R q with constant of strongconcavity λ min ( AA T ) / (cid:80) Mk =1 α k L k with respect to (cid:107) · (cid:107) . Problems with quadratic objective and a quadratic constraint.
We now consider the followingquadratically constrained quadratic optimization problem(2.9) (cid:26) inf x ∈ R n f ( x ) := x T Q x + a T x + b g i ( x ) := x T Q i x + a Ti x + b i ≤ , i = 1 , . . . , p, with Q positive definite and Q i , i = 1 , . . . , p , positive semidefinite. The dual function θ of this problem isknown in closed-form: for µ ∈ R p , µ ≥
0, we have(2.10) θ ( µ ) = inf x ∈ R n { f ( x ) + p (cid:88) i =1 µ i g i ( x ) } = − A ( µ ) T Q ( µ ) − A ( µ ) + B ( µ )where A ( µ ) = a + p (cid:88) i =1 µ i a i , Q ( µ ) = Q + p (cid:88) i =1 µ i Q i , and B ( µ ) = b + p (cid:88) i =1 µ i b i . When there is only one quadratic constraint in (2.9), we can show, under some assumptions, that dualfunction θ is strongly concave on some set and compute analytically the corresponding constant of strongconcavity: Proposition 2.11.
Consider optimization problem (2.9) with p = 1 quadratic constraint. Assume that Q , Q , are positive definite, that there exists x such that g ( x ) < , and that a (cid:54) = Q Q − a . Let L be anylower bound on the optimal value of (2.9) and let ¯ µ = ( L − f ( x )) /g ( x ) ≥ . Then the optimal solution ofthe dual problem max µ ≥ θ ( µ ) is contained in the interval [0 , ¯ µ ] and the dual function θ given by (2.10) is strongly concave on the interval [0 , ¯ µ ] with constant of strong concavity α D = ( Q − / ( a − Q Q − a )) T ( Q − / Q Q − / + ¯ µI n ) − Q − / ( a − Q Q − a ) > .Proof. Making the change of variable x = y − Q − a , we can rewrite (2.9) without linear terms in g underthe form: (cid:26) inf x ∈ R n x T Q x + ( a − Q Q − a ) T x + b + a T Q − Q Q − a − a T Q − a x T Q x + b − a T Q − a ≤ , with corresponding dual function given by θ ( µ ) = −
12 ¯ a T ( Q + µQ ) − ¯ a + ( b − a T Q − a ) µ + b − a T Q − a + 12 a T Q − Q Q − a where we have set ¯ a = a − Q Q − a (see (2.10)).Using [7, Remark 2.3.3, p.313] we obtain that the optimal dual solutions are contained in the interval[0 , ¯ µ ]. Setting ˜ a = Q − / ¯ a and A = Q − / Q Q − / , we compute the first and second derivatives of thenonlinear term θ q ( µ ) = − ¯ a T ( Q + µQ ) − ¯ a = − ˜ a T ( A + µI n ) − ˜ a of θ on [0 , ¯ µ ]: θ (cid:48) q ( µ ) = ˜ a T ( A + µI n ) − ˜ a and θ (cid:48)(cid:48) q ( µ ) = − ˜ a T ( A + µI n ) − ˜ a . For these computations we have used the fact that for F : I → GL n ( R ) differentiable on I ⊂ R , we have d F ( t ) − dt = −F ( t ) − d F ( t ) dt F ( t ) − . Since − θ (cid:48)(cid:48) q ( µ ) is decreasing on [0 , ¯ µ ], we get − θ (cid:48)(cid:48) q ( µ ) ≥ α D = − θ (cid:48)(cid:48) q (¯ µ ) on[0 , ¯ µ ]. This computation, together with Proposition 2.2-(iv), shows that θ is strongly concave on [0 , ¯ µ ] withconstant of strong concavity α D . (cid:3) .4. General case: problems with linear and nonlinear constraints.
Let us add to problem (2.3)nonlinear constraints. More precisely, given f : R n → R , a q × n real matrix A , b ∈ R q , and g : R n → R p with convex component functions g i , i = 1 , . . . , p , we consider the optimization problem(2.11) (cid:26) inf f ( x ) x ∈ X, Ax ≤ b, g ( x ) ≤ . Let v be the value function of this problem given by(2.12) v ( c ) = v ( c , c ) = (cid:26) inf f ( x ) x ∈ X, Ax − b + c ≤ , g ( x ) + c ≤ , for c ∈ R q , c ∈ R p . In the next lemma, we relate the conjugate of v to the dual function θ ( λ, µ ) = (cid:26) inf f ( x ) + λ T ( Ax − b ) + µ T g ( x ) x ∈ X, of this problem: Lemma 2.12. If v ∗ is the conjugate of the value function v then v ∗ ( λ, µ ) = − θ ( λ, µ ) for every ( λ, µ ) ∈ R q + × R p + .Proof. For ( λ, µ ) ∈ R q + × R p + , we have − v ∗ ( λ, µ ) = − sup ( c ,c ) ∈ R q × R p λ T c + µ T c − v ( c , c )= inf − λ T c − µ T c + f ( x ) x ∈ X, Ax − b + c ≤ , g ( x ) + c ≤ ,c ∈ R q , c ∈ R p , = (cid:26) inf f ( x ) + λ T ( Ax − b ) + µ T g ( x ) x ∈ X, = θ ( λ, µ ) . (cid:3) From Lemma 2.12 and Proposition 2.6, we obtain that dual function θ of problem (2.11) is stronglyconcave with constant α with respect to norm (cid:107) · (cid:107) on R p + q if and only if the value function v given by(2.12) is differentiable and ∇ v is Lipschitz continuous with constant 1 /α with respect to norm (cid:107) · (cid:107) on R p + q .Using Lemma 2.1 in [4] the subdifferential of the value function is the set of optimal dual solutions of (2.12).Therefore θ is strongly concave with constant α with respect to norm (cid:107) · (cid:107) on R p + q if and only if the valuefunction is differentiable and the dual solution of (2.12) seen as a function of ( c , c ) is Lipschitz continuouswith Lipschitz constant 1 /α with respect to norm (cid:107) · (cid:107) on R p + q .We now provide conditions ensuring that the dual function is strongly concave in a neighborhood of theoptimal dual solution. Theorem 2.13.
Consider the optimization problem (2.13) inf x ∈ R n { f ( x ) : Ax ≤ b, g i ( x ) ≤ , i = 1 , . . . , p } . We assume that (A1) f : R n → R ∪ { + ∞} is strongly convex and has Lipschitz continuous gradient; (A2) g i : R n → R ∪ { + ∞} , i = 1 , . . . , p , are convex and have Lipschitz continuous gradients; (A3) if x ∗ is the optimal solution of (2.13) then the rows of matrix (cid:18) AJ g ( x ∗ ) (cid:19) are linearly independentwhere J g ( x ) denotes the Jacobian matrix of g ( x ) = ( g ( x ) , . . . , g p ( x )) at x ; (A4) there is x ∈ ri ( { g ≤ } ) such that Ax ≤ b .Let θ be the dual function of this problem: (2.14) θ ( λ, µ ) = (cid:26) inf f ( x ) + λ T ( Ax − b ) + µ T g ( x ) x ∈ R n . et ( λ ∗ , µ ∗ ) ≥ be an optimal solution of the dual problem sup λ ≥ ,µ ≥ θ ( λ, µ ) . Then there is some neighborhood N of ( λ ∗ , µ ∗ ) such that θ is strongly concave on N ∩ R p + q + .Proof. Due to (A1) the optimization problem (2.14) has a unique optimal solution that we denote by x ( λ, µ ).Assumptions (A2) and (A3) imply that there is some neighborhood V ε ( x ∗ ) = { x ∈ R n : (cid:107) x − x ∗ (cid:107) ≤ ε } of x ∗ for some ε > (cid:18) AJ g ( x ) (cid:19) are independent for x in V ε ( x ∗ ).We argue that ( λ, µ ) → x ( λ, µ ) is continuous on R q × R p . Indeed, let (¯ λ, ¯ µ ) ∈ R q × R p and take a sequence( λ k , µ k ) converging to (¯ λ, ¯ µ ). We want to show that x ( λ k , µ k ) converges to x (¯ λ, ¯ µ ). Take an arbitraryaccumulation point ¯ x of the sequence x ( λ k , µ k ), i.e., ¯ x = lim k → + ∞ x ( λ σ ( k ) , µ σ ( k ) ) for some subsequence x ( λ σ ( k ) , µ σ ( k ) ) of x ( λ k , µ k ). Then by definition of x ( λ σ ( k ) , µ σ ( k ) ), for every x ∈ R n and every k ≥ f ( x ( λ σ ( k ) , µ σ ( k ) )) + λ Tσ ( k ) ( Ax ( λ σ ( k ) , µ σ ( k ) ) − b ) + µ Tσ ( k ) g ( x ( λ σ ( k ) , µ σ ( k ) )) ≤ f ( x ) + λ Tσ ( k ) ( Ax − b ) + µ Tσ ( k ) g ( x ) . Passing to the limit in the inequality above and using the continuity of f and g i we obtain for all x ∈ R n : f (¯ x ) + ¯ λ T ( A ¯ x − b ) + ¯ µ T g (¯ x ) ≤ f ( x ) + ¯ λ T ( Ax − b ) + ¯ µ T g ( x ) , which shows that ¯ x = x (¯ λ, ¯ µ ). Therefore there is only one accumuation point ¯ x = x (¯ λ, ¯ µ ) for the sequence x ( λ k , µ k ) which shows that this sequence converges to x (¯ λ, ¯ µ ). Consequently, we have shown that ( λ, µ ) → x ( λ, µ ) is continuous on R q × R p . This implies that there is a neighborhood N ( λ ∗ , µ ∗ ) of ( λ ∗ , µ ∗ ) such thatfor ( λ, µ ) ∈ N ( λ ∗ , µ ∗ ) we have (cid:107) x ( λ, µ ) − x ( λ ∗ , µ ∗ ) (cid:107) ≤ ε . Moreover, due to (A4), we have x ( λ ∗ , µ ∗ ) = x ∗ .It follows that for ( λ, µ ) ∈ N ( λ ∗ , µ ∗ ) we have (cid:107) x ( λ, µ ) − x ( λ ∗ , µ ∗ ) (cid:107) = (cid:107) x ( λ, µ ) − x ∗ (cid:107) ≤ ε which in turnimplies that the rows of matrix (cid:18) AJ g ( x ( λ, µ )) (cid:19) are independent. We now show that θ is strongly concaveon N ( λ ∗ , µ ∗ ) ∩ R p + q + .Take ( λ , µ ), ( λ , µ ) in N ( λ ∗ , µ ∗ ) ∩ R p + q + and denote x = x ( λ , µ ) and x = x ( λ , µ ). The optimalityconditions give(2.15) ∇ f ( x ) + A T λ + J g ( x ) T µ = 0 , ∇ f ( x ) + A T λ + J g ( x ) T µ = 0 . Recall that (2.14) has a unique solution and therefore θ is differentiable. The gradient of θ is given by (seefor instance Lemma 2.1 in [4]) ∇ θ ( λ, µ ) = (cid:18) Ax ( λ, µ ) − bg ( x ( λ, µ )) (cid:19) and we obtain, using the notation (cid:104) x, y (cid:105) = x T y :(2.16) − (cid:28) ∇ θ ( λ , µ ) − ∇ θ ( λ , µ ) , (cid:18) λ − λ µ − µ (cid:19)(cid:29) = −(cid:104) A ( x − x ) , λ − λ (cid:105) − (cid:104) g ( x ) − g ( x ) , µ − µ (cid:105) . By convexity of constraint functions we can write for i = 1 , . . . , p :(2.17) g i ( x ) ≥ g i ( x ) + (cid:104)∇ g i ( x ) , x − x (cid:105) ( a ) g i ( x ) ≥ g i ( x ) + (cid:104)∇ g i ( x ) , x − x (cid:105) . ( b )Multiplying (2.17)-(a) by µ ( i ) ≥ µ ( i ) ≥ − (cid:104) g ( x ) − g ( x ) , µ − µ (cid:105) ≥ (cid:104) J g ( x ) T µ − J g ( x ) T µ , x − x (cid:105) . Recalling (A1), we can find 0 ≤ L ( f ) < + ∞ such that for all x, y ∈ R n :(2.19) (cid:107)∇ f ( y ) − ∇ f ( x ) (cid:107) ≤ L ( f ) (cid:107) y − x (cid:107) . sing (2.16) and (2.18) and denoting by α > f with respect to norm (cid:107) · (cid:107) we get: (2.20) − (cid:28) ∇ θ ( λ , µ ) − ∇ θ ( λ , µ ) , (cid:18) λ − λ µ − µ (cid:19)(cid:29) ≥ −(cid:104) x − x , A T ( λ − λ ) (cid:105) + (cid:104) J g ( x ) T µ − J g ( x ) T µ , x − x (cid:105) , (2.15) = (cid:104) x − x , ∇ f ( x ) − ∇ f ( x ) (cid:105)≥ α (cid:107) x − x (cid:107) by strong convexity of f, ≥ αL ( f ) (cid:107)∇ f ( x ) − ∇ f ( x ) (cid:107) using (2.19) , (2.15) = αL ( f ) (cid:107) (cid:16) A T J g ( x ) T (cid:17) (cid:18) λ − λ µ − µ (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) a + ( J g ( x ) − J g ( x )) T µ (cid:124) (cid:123)(cid:122) (cid:125) b (cid:107) . Now recall that for every x ∈ V ε ( x ∗ ) the rows of the matrix (cid:18) AJ g ( x ) (cid:19) are independent and therefore thematrix (cid:18) AJ g ( x ) (cid:19) (cid:18) AJ g ( x ) (cid:19) T is invertible. Moreover, the function x → λ min (cid:32)(cid:18) AJ g ( x ) (cid:19) (cid:18) AJ g ( x ) (cid:19) T (cid:33) is continuous (due to (A2)) and positive on the compact set V ε ( x ∗ ). It follows that we can define λ ε ( x ∗ ) = min x ∈V ε ( x ∗ ) λ min (cid:32)(cid:18) AJ g ( x ) (cid:19) (cid:18) AJ g ( x ) (cid:19) T (cid:33) , and λ ε ( x ∗ ) >
0. Since x ∈ V ε ( x ∗ ), we deduce that(2.21) (cid:107) a (cid:107) ≥ (cid:112) λ ε ( x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) λ − λ µ − µ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) . Recalling that ( λ , µ ) is in N ( λ ∗ , µ ∗ ), we obtain for (cid:107) µ (cid:107) the bound(2.22) (cid:107) µ (cid:107) ≤ U ε ( x ∗ ) := (cid:107) µ ∗ (cid:107) + εαL . Due to (A2), there is L ( g ) ≥ x, y ∈ R n , we have (cid:107)∇ g i ( y ) − ∇ g i ( x ) (cid:107) ≤ L ( g ) (cid:107) y − x (cid:107) , , . . . , p. Combining this relation with (2.22), we get(2.23) (cid:107) b (cid:107) ≤ (cid:107) µ (cid:107) L ( g ) (cid:107) x − x (cid:107) ≤ L ( g ) U ε ( x ∗ ) (cid:107) x − x (cid:107) . Therefore (cid:107) a + b (cid:107) ≥ (cid:107) a (cid:107) − (cid:107) b (cid:107) ≥ (cid:112) λ ε ( x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) λ − λ µ − µ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) − L ( g ) U ε ( x ∗ ) (cid:107) x − x (cid:107) and combining this relation with (2.20) we obtain (cid:107) x − x (cid:107) ≥ L ( f ) (cid:20)(cid:112) λ ε ( x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) λ − λ µ − µ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) − L ( g ) U ε ( x ∗ ) (cid:107) x − x (cid:107) (cid:21) which gives(2.24) (cid:107) x − x (cid:107) ≥ (cid:112) λ ε ( x ∗ ) L ( f ) + L ( g ) U ε ( x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) λ − λ µ − µ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) . Plugging (2.24) into (2.20) we get − (cid:28) ∇ θ ( λ , µ ) − ∇ θ ( λ , µ ) , (cid:18) λ − λ µ − µ (cid:19)(cid:29) ≥ αλ ε ( x ∗ )( L ( f )+ L ( g ) U ε ( x ∗ )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) λ − λ µ − µ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) . Using Proposition 2.2-(iii), the relation above shows that θ is strongly concave on N ( λ ∗ , µ ∗ ) ∩ R p + q + withconstant of strong concavity αλ ε ( x ∗ )( L ( f )+ L ( g ) U ε ( x ∗ )) with respect to norm (cid:107) · (cid:107) . (cid:3) he local strong concavity of the dual function of (2.13) was shown recently in Theorem 10 in [19] assum-ing (A3), assuming instead of (A1) that f is strongly convex and second-order continuously differentiable(which is stronger than (A1)), and assuming instead of (A2) that g i , i = 1 , . . . , p , are convex second-ordercontinuously differentiable, which is stronger than (A2). Therefore Theorem 2.13 gives a new proof of thelocal strong concavity of the dual function and improves existing results.3.
Computing inexact cuts for value functions of convex optimization problems
Preliminaries.
Let Q : X → R ∪ { + ∞} be the value function given by(3.25) Q ( x ) = (cid:26) inf y ∈ R n f ( y, x ) y ∈ S ( x ) := { y ∈ Y : Ay + Bx = b, g ( y, x ) ≤ } . Here, and in all this section, X ⊆ R m and Y ⊆ R n are nonempty, compact, and convex sets, and A and B are respectively q × n and q × m real matrices. We will make the following assumptions: (H1) f : R n × R m → R ∪ { + ∞} is lower semicontinuous, proper, and convex.(H2) For i = 1 , . . . , p , the i -th component of function g ( y, x ) is a convex lower semicontinuous function g i : R n × R m → R ∪ { + ∞} .In what follows, we say that C is a cut for Q on X if C is an affine function of x such that Q ( x ) ≥ C ( x ) forall x ∈ X . We say that the cut is exact at ¯ x if Q (¯ x ) = C (¯ x ). Otherwise, the cut is said to be inexact at ¯ x .In this section, our basic goal is, given ¯ x ∈ X and ε -optimal primal and dual solutions of (3.25) writtenfor x = ¯ x , to derive an inexact cut C ( x ) for Q at ¯ x , i.e., an affine lower bounding function for Q such thatthe distance Q (¯ x ) − C (¯ x ) between the values of Q and of the cut at ¯ x is bounded from above by a knownfunction of the problem parameters. Of course, when ε = 0, we will check that Q (¯ x ) = C (¯ x ).We first provide in Proposition 3.2 below a characterization of the subdifferential of value function Q at¯ x ∈ X when optimal primal and dual solutions for (3.25) written for x = ¯ x are available (computation ofexact cuts).Consider for problem (3.25) the Lagrangian dual problem(3.26) sup ( λ,µ ) ∈ R q × R p + θ x ( λ, µ )for the dual function(3.27) θ x ( λ, µ ) = inf y ∈ Y L x ( y, λ, µ )where L x ( y, λ, µ ) = f ( y, x ) + λ T ( Ay + Bx − b ) + µ T g ( y, x ) . We denote by Λ( x ) the set of optimal solutions of the dual problem (3.26) and we use the notationSol( x ) := { y ∈ S ( x ) : f ( y, x ) = Q ( x ) } to indicate the solution set to (3.25). Lemma 3.1 (Lemma 2.1 in [4]) . Consider the value function Q given by (3.25) and take ¯ x ∈ X such that S (¯ x ) (cid:54) = ∅ . Let Assumptions (H1) and (H2) hold and assume the Slater-type constraint qualification condition: there exists ( x ∗ , y ∗ ) ∈ X × ri( Y ) such that Ay ∗ + Bx ∗ = b and ( y ∗ , x ∗ ) ∈ ri( { g ≤ } ) . Then s ∈ ∂ Q (¯ x ) if and only if (3.28) (0 , s ) ∈ ∂f (¯ y, ¯ x ) + (cid:110) [ A T ; B T ] λ : λ ∈ R q (cid:111) + (cid:110) (cid:88) i ∈ I (¯ y, ¯ x ) µ i ∂g i (¯ y, ¯ x ) : µ i ≥ (cid:111) + N Y (¯ y ) ×{ } , Note that we used (A4) to ensure that x ( λ ∗ , µ ∗ ) = x ∗ , which is also used in the proof of Theorem 10 in [19]. Note that (H1) and (H2) imply the convexity of Q given by (3.25). Indeed, let x , x ∈ X , 0 ≤ t ≤
1, and y ∈ S ( x ) , y ∈ S ( x ) , such that Q ( x ) = f ( y , x ) and Q ( x ) = f ( y , x ). By convexity of g and Y , we have that have ty +(1 − t ) y ∈ S ( tx +(1 − t ) x ) and therefore Q ( tx +(1 − t ) x ) ≤ f ( ty +(1 − t ) y , tx +(1 − t ) x ) ≤ tf ( y , x )+(1 − t ) f ( y , x ) = t Q ( x )+(1 − t ) Q ( x )where for the last inequality we have used the convexity of f . here ¯ y is any element in the solution set Sol( ¯ x ) and with I (¯ y, ¯ x ) = (cid:110) i ∈ { , . . . , p } : g i (¯ y, ¯ x ) = 0 (cid:111) . In particular, if f and g are differentiable, then (3.29) ∂ Q (¯ x ) = (cid:110) ∇ x f (¯ y, ¯ x ) + B T λ + (cid:88) i ∈ I (¯ y, ¯ x ) µ i ∇ x g i (¯ y, ¯ x ) : ( λ, µ ) ∈ Λ(¯ x ) (cid:111) . The proof of Lemma 3.1 is given in [4] using calculus on normal and tangeant cones. In Proposition3.2 below, we show how to obtain an inexact cut for Q at ¯ x ∈ X using convex duality when f and g aredifferentiable. Proposition 3.2.
Consider the value function Q given by (3.25) and take ¯ x ∈ X such that S (¯ x ) (cid:54) = ∅ . LetAssumptions (H1) and (H2) hold and assume the following constraint qualification condition: there exists y ∈ ri ( Y ) ∩ ri ( { g ( · , ¯ x ) ≤ } ) such that Ay + B ¯ x = b . Assume that f and g are differentiable on Y × X .Let (¯ λ, ¯ µ ) be an optimal solution of dual problem (3.26) written with x = ¯ x and let (3.30) s (¯ x ) = ∇ x f (¯ y, ¯ x ) + B T ¯ λ + (cid:88) i ∈ I (¯ y, ¯ x ) ¯ µ i ∇ x g i (¯ y, ¯ x ) , where ¯ y is any element in the solution set Sol( ¯ x ) and with I (¯ y, ¯ x ) = (cid:110) i ∈ { , . . . , p } : g i (¯ y, ¯ x ) = 0 (cid:111) . Then s (¯ x ) ∈ ∂ Q (¯ x ) .Proof. The constraint qualification condition implies that there is no duality gap and therefore(3.31) f (¯ y, ¯ x ) = Q (¯ x ) = θ ¯ x (¯ λ, ¯ µ ) . Moreover, ¯ y is an optimal solution of inf { L ¯ x ( y, ¯ λ, ¯ µ ) : y ∈ Y } which gives (cid:104)∇ y L ¯ x (¯ y, ¯ λ, ¯ µ ) , y − ¯ y (cid:105) ≥ ∀ y ∈ Y, and therefore(3.32) min y ∈ Y (cid:104)∇ y L ¯ x (¯ y, ¯ λ, ¯ µ ) , y − ¯ y (cid:105) = 0 . Using the convexity of the function which associates to ( x, y ) the value L x ( y, ¯ λ, ¯ µ ) we obtain for every x ∈ X and y ∈ Y that(3.33) L x ( y, ¯ λ, ¯ µ ) ≥ L ¯ x (¯ y, ¯ λ, ¯ µ ) + (cid:104)∇ x L ¯ x (¯ y, ¯ λ, ¯ µ ) , x − ¯ x (cid:105) + (cid:104)∇ y L ¯ x (¯ y, ¯ λ, ¯ µ ) , y − ¯ y (cid:105) . By definition of θ x , for any x ∈ X we get Q ( x ) ≥ θ x (¯ λ, ¯ µ )which combined with (3.33) gives Q ( x ) ≥ L ¯ x (¯ y, ¯ λ, ¯ µ ) + (cid:104)∇ x L ¯ x (¯ y, ¯ λ, ¯ µ ) , x − ¯ x (cid:105) + min y ∈ Y (cid:104)∇ y L ¯ x (¯ y, ¯ λ, ¯ µ ) , y − ¯ y (cid:105) (3.32) = L ¯ x (¯ y, ¯ λ, ¯ µ ) + (cid:104)∇ x f (¯ y, ¯ x ) + B T ¯ λ + p (cid:88) i =1 ¯ µ i ∇ x g i (¯ y, ¯ x ) , x − ¯ x (cid:105) , = Q (¯ x ) + (cid:104) s (¯ x ) , x − ¯ x (cid:105) where the last equality follows from (3.31), A ¯ y + B ¯ x = b (feasibility of ¯ y ), (cid:104) ¯ µ, g (¯ y, ¯ x ) (cid:105) = 0, and ¯ µ i = 0 if i / ∈ I (¯ y, ¯ x )(complementary slackness for ¯ y ). (cid:3) .2. Inexact cuts with fixed feasible set.
As a special case of (3.25), we first consider value functionswhere the argument only appears in the objective of optimization problem (3.25):(3.34) Q ( x ) = (cid:26) inf y ∈ R n f ( y, x ) y ∈ Y. We fix ¯ x ∈ X and denote by ¯ y ∈ Y an optimal solution of (3.34) written for x = ¯ x :(3.35) Q (¯ x ) = f (¯ y, ¯ x ) . If f is differentiable, using Proposition 3.2, we have that ∇ x f (¯ y, ¯ x ) ∈ ∂ Q (¯ x ) and C ( x ) := Q (¯ x ) + (cid:104)∇ x f (¯ y, ¯ x ) , x − ¯ x (cid:105) is an exact cut for Q at ¯ x . If instead of an optimal solution ¯ y of (3.34) we only have at hand an approximate ε -optimal solution ˆ y ( ε ) Proposition 3.3 below gives an inexact cut for Q at ¯ x : Proposition 3.3 (Proposition 2.2 in [6]) . Let ¯ x ∈ X and let ˆ y ( ε ) ∈ Y be an (cid:15) -optimal solution for problem (3.34) written for x = ¯ x with optimal value Q (¯ x ) , i.e., Q (¯ x ) ≥ f (ˆ y ( ε ) , ¯ x ) − ε . Assume that f is convex anddifferentiable on Y × X . Then setting η ( ε, ¯ x ) = (cid:96) (ˆ y ( ε ) , ¯ x ) where (cid:96) : Y × X → R + is the function given by (3.36) (cid:96) (ˆ y, ¯ x ) = − min y ∈ Y (cid:104)∇ y f (ˆ y, ¯ x ) , y − ˆ y (cid:105) = max y ∈ Y (cid:104)∇ y f (ˆ y, ¯ x ) , ˆ y − y (cid:105) , the affine function (3.37) C ( x ) := f (ˆ y ( ε ) , ¯ x ) − η ( ε, ¯ x ) + (cid:104)∇ x f (ˆ y ( ε ) , ¯ x ) , x − ¯ x (cid:105) is a cut for Q at ¯ x , i.e., for every x ∈ X we have Q ( x ) ≥ C ( x ) and the quantity η ( ε, ¯ x ) is an upper boundfor the distance Q (¯ x ) − C (¯ x ) between the values of Q and of the cut at ¯ x . Remark 3.4. If ε = 0 then ˆ y ( ε ) is an optimal solution of problem (3.34) written for x = ¯ x , η ( ε, ¯ x ) = (cid:96) (ˆ y ( ε ) , ¯ x ) = 0 and the cut given by Proposition 3.3 is exact. Otherwise it is inexact. In Proposition 3.5 below, we derive inexact cuts with an additional assumption of strong convexity on f :(H3) f is convex and differentiable on Y × X and for every x ∈ X there exists α ( x ) > f ( · , x ) is strongly convex on Y with constant of strong convexity α ( x ) > (cid:107) · (cid:107) : f ( y , x ) ≥ f ( y , x ) + ( y − y ) T ∇ y f ( y , x ) + α ( x )2 (cid:107) y − y (cid:107) , ∀ x ∈ X, ∀ y , y ∈ Y. We will also need the following assumption, used to control the error on the gradients of f :(H4) For every y ∈ Y the function f ( y, · ) is differentiable on X and for every x ∈ X there exists 0 ≤ M ( x ) < + ∞ such that for every y , y ∈ Y , we have (cid:107)∇ x f ( y , x ) − ∇ x f ( y , x ) (cid:107) ≤ M ( x ) (cid:107) y − y (cid:107) . Proposition 3.5.
Let ¯ x ∈ X and let ˆ y ( ε ) ∈ Y be an (cid:15) -optimal solution for problem (3.34) written for x = ¯ x with optimal value Q (¯ x ) , i.e., Q (¯ x ) ≥ f (ˆ y ( ε ) , ¯ x ) − ε . Let Assumptions (H3) and (H4) hold. Then setting (3.38) η ( ε, ¯ x ) = ε + M (¯ x )Diam( X ) (cid:115) εα (¯ x ) , the affine function (3.39) C ( x ) := f (ˆ y ( ε ) , ¯ x ) − η ( ε, ¯ x ) + (cid:104)∇ x f (ˆ y ( ε ) , ¯ x ) , x − ¯ x (cid:105) is a cut for Q at ¯ x , i.e., for every x ∈ X we have Q ( x ) ≥ C ( x ) and the distance Q (¯ x ) − C (¯ x ) between thevalues of Q and of the cut at ¯ x is at most η ( ε, ¯ x ) , or, equivalently, ∇ x f (ˆ y, ¯ x ) ∈ ∂ η ( ε, ¯ x ) Q (¯ x ) .Proof. For short, we use the notation ˆ y instead of ˆ y ( ε ). Using the fact that ˆ y ∈ Y , the first order optimalityconditions for ¯ y imply (ˆ y − ¯ y ) T ∇ y f (¯ y, ¯ x ) ≥
0, which combined with Assumption (H3), gives f (ˆ y, ¯ x ) ≥ f (¯ y, ¯ x ) + (ˆ y − ¯ y ) T ∇ y f (¯ y, ¯ x ) + α (¯ x )2 (cid:107) ˆ y − ¯ y (cid:107) ≥ Q (¯ x ) + α (¯ x )2 (cid:107) ˆ y − ¯ y (cid:107) , ielding(3.40) (cid:107) ¯ y − ˆ y (cid:107) ≤ (cid:115) α (¯ x ) (cid:16) f (ˆ y, ¯ x ) − Q (¯ x ) (cid:17) ≤ (cid:115) εα (¯ x ) . Now recalling that ∇ x f (¯ y, ¯ x ) ∈ ∂ Q (¯ x ), we have for every x ∈ X ,(3.41) Q ( x ) ≥ Q (¯ x ) + ( x − ¯ x ) T ∇ x f (¯ y, ¯ x ) ≥ f (ˆ y, ¯ x ) − ε + ( x − ¯ x ) T ∇ x f (¯ y, ¯ x )= f (ˆ y, ¯ x ) − ε + ( x − ¯ x ) T ∇ x f (ˆ y, ¯ x ) + ( x − ¯ x ) T (cid:16) ∇ x f (¯ y, ¯ x ) − ∇ x f (ˆ y, ¯ x ) (cid:17) ≥ f (ˆ y, ¯ x ) − ε + ( x − ¯ x ) T ∇ x f (ˆ y, ¯ x ) − M (¯ x ) (cid:107) ˆ y − ¯ y (cid:107) (cid:107) x − ¯ x (cid:107) ≥ f (ˆ y, ¯ x ) − ε − M (¯ x )Diam( X ) (cid:113) εα (¯ x ) + ( x − ¯ x ) T ∇ x f (ˆ y, ¯ x ) , where for the third inequality we have used Cauchy-Schwartz inequality and Assumption (H4). Finally,observe that C (¯ x ) = f (ˆ y, ¯ x ) − η ( ε, ¯ x ) ≥ Q (¯ x ) − η ( ε, ¯ x ). (cid:3) Remark 3.6.
As expected, if ε = 0 then η ( ε, ¯ x ) = 0 and the cut given by Proposition 3.5 is exact. Otherwiseit is inexact. The error term η ( ε, ¯ x ) is the sum of the upper bound ε on the error on the optimal value andof the error term M (¯ x )Diam( X ) (cid:113) εα (¯ x ) which accounts for the error on the subgradients of Q . Inexact cuts with variable feasible set.
For x ∈ X , recall that for problem (3.25) the Lagrangianfunction is L x ( y, λ, µ ) = f ( y, x ) + λ T ( Bx + Ay − b ) + µ T g ( y, x ) , and the dual function is given by(3.42) θ x ( λ, µ ) = inf y ∈ Y L x ( y, λ, µ ) . Define (cid:96) : Y × X × R q × R p + → R + by(3.43) (cid:96) (ˆ y, ¯ x, ˆ λ, ˆ µ ) = − min y ∈ Y (cid:104)∇ y L ¯ x (ˆ y, ˆ λ, ˆ µ ) , y − ˆ y (cid:105) = max y ∈ Y (cid:104)∇ y L ¯ x (ˆ y, ˆ λ, ˆ µ ) , ˆ y − y (cid:105) . We make the following assumption which ensures no duality gap for (3.25) for any x ∈ X :(H5) if Y is polyhedral then for every x ∈ X there exists y x ∈ Y such that Bx + Ay x = b and g ( y x , x ) < Y is not polyhedral then for every x ∈ X there exists y x ∈ ri( Y ) such that Bx + Ay x = b and g ( y x , x ) < Q given by (3.25): Proposition 3.7. [Proposition 2.7 in [6] ] Let ¯ x ∈ X , let ˆ y ( (cid:15) ) be an (cid:15) -optimal feasible primal solution forproblem (3.25) written for x = ¯ x and let (ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) be an (cid:15) -optimal feasible solution of the correspondingdual problem, i.e., of problem (3.26) written for x = ¯ x . Let Assumptions (H1), (H2), and (H5) hold.If additionally f and g are differentiable on Y × X then setting η ( ε, ¯ x ) = (cid:96) (ˆ y ( (cid:15) ) , ¯ x, ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) , the affinefunction (3.44) C ( x ) := L ¯ x (ˆ y ( (cid:15) ) , ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) − η ( ε, ¯ x ) + (cid:104)∇ x L ¯ x (ˆ y ( (cid:15) ) , ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) , x − ¯ x (cid:105) with ∇ x L ¯ x (ˆ y ( (cid:15) ) , ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) = ∇ x f (ˆ y ( (cid:15) ) , ¯ x ) + B T ˆ λ ( (cid:15) ) + p (cid:88) i =1 ˆ µ i ( (cid:15) ) ∇ x g i (ˆ y ( (cid:15) ) , ¯ x ) , is a cut for Q at ¯ x and the distance Q (¯ x ) − C (¯ x ) between the values of Q and of the cut at ¯ x is at most ε + (cid:96) (ˆ y ( (cid:15) ) , ¯ x, ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) . In Proposition 3.8 below, we derive another formula for inexact cuts with an additional assumption ofstrong convexity:(H6) Strong concavity of the dual function: for every x ∈ X there exists α D ( x ) > D x containingthe set of optimal solutions of dual problem (3.26) such that the dual function θ x is strongly concaveon D x with constant of strong concavity α D ( x ) with respect to (cid:107) · (cid:107) . e refer to Section 2 for conditions on the problem data ensuring Assumption (H6).If the constants α (¯ x ) and α D (¯ x ) in Assumptions (H3) and (H6) are sufficiently large and n is small thenthe cuts given by Proposition 3.8 are better than the cuts given by Proposition 3.7, i.e., Q (¯ x ) − C (¯ x ) issmaller. We refer to Section 3.4 for numerical tests comparing the cuts given by Propositions 3.7 and 3.8 onquadratic programs.To proceed, take an optimal primal solution ¯ y of problem (3.25) written for x = ¯ x and an optimal dualsolution (¯ λ, ¯ µ ) of the corresponding dual problem, i.e., problem (3.26) written for x = ¯ x .With this notation, using Proposition 3.2, we have that ∇ x L ¯ x (¯ y, ¯ λ, ¯ µ ) ∈ ∂ Q (¯ x ). Since we only haveapproximate primal and dual solutions, ˆ y ( ε ) and (ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) respectively, we will use the approximate sub-gradient ∇ x L ¯ x (ˆ y ( (cid:15) ) , ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) instead of ∇ x L ¯ x (¯ y, ¯ λ, ¯ µ ). To control the error on this subgradient, we assumedifferentiability of the constraint functions and that the gradients of these functions are Lipschitz continuous.More precisely, we assume:(H7) g is differentiable on Y × X and for every x ∈ X there exists 0 ≤ M ( x ) < + ∞ such that for all y , y ∈ Y , we have (cid:107)∇ x g i ( y , x ) − ∇ x g i ( y , x ) (cid:107) ≤ M ( x ) (cid:107) y − y (cid:107) , i = 1 , . . . , p. If Assumptions (H1)-(H7) hold, the following proposition provides an inexact cut for Q at ¯ x : Proposition 3.8.
Let ¯ x ∈ X , let ˆ y ( ε ) be an (cid:15) -optimal feasible primal solution for problem (3.25) written for x = ¯ x and let (ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) be an (cid:15) -optimal feasible solution of the corresponding dual problem, i.e., of problem (3.26) written for x = ¯ x . Let Assumptions (H1), (H2), (H3), (H4), (H5), (H6), and (H7) hold. Assumethat (ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) ∈ D ¯ x where D ¯ x is defined in (H6) and let (3.45) U = max i =1 ,...,p (cid:107)∇ x g i (ˆ y ( ε ) , ¯ x ) (cid:107) . Let also L ¯ x be any lower bound on Q (¯ x ) . Define (3.46) U ¯ x = f ( y ¯ x , ¯ x ) − L ¯ x min( − g i ( y ¯ x , ¯ x ) , i = 1 , . . . , p ) and η ( ε, ¯ x ) = ε + (cid:16) ( M (¯ x ) + M (¯ x ) U ¯ x ) (cid:115) α (¯ x ) + 2 max( (cid:107) B T (cid:107) , √ pU ) (cid:112) α D (¯ x ) (cid:17) Diam( X ) √ ε. Then C ( x ) := f (ˆ y ( ε ) , ¯ x ) − η ( (cid:15), ¯ x ) + (cid:104)∇ x L ¯ x (ˆ y ( (cid:15) ) , ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) , x − ¯ x (cid:105) where ∇ x L ¯ x (ˆ y ( (cid:15) ) , ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) = ∇ x f (ˆ y ( (cid:15) ) , ¯ x ) + B T ˆ λ ( (cid:15) ) + p (cid:88) i =1 ˆ µ i ( (cid:15) ) ∇ x g i (ˆ y ( (cid:15) ) , ¯ x ) , is a cut for Q at ¯ x and the distance Q (¯ x ) − C (¯ x ) between the values of Q and of the cut at ¯ x is at most η ( (cid:15), ¯ x ) .Proof. For short, we use the notation ˆ y, ˆ λ, ˆ µ instead of ˆ y ( ε ) , ˆ λ ( ε ) , ˆ µ ( ε ). Since ∇ x L ¯ x (¯ y, ¯ λ, ¯ µ ) ∈ ∂ Q (¯ x ), wehave(3.47) Q ( x ) ≥ Q (¯ x ) + (cid:104)∇ x L ¯ x (¯ y, ¯ λ, ¯ µ ) , x − ¯ x (cid:105) ≥ f (ˆ y, ¯ x ) − ε + (cid:104)∇ x L ¯ x (¯ y, ¯ λ, ¯ µ ) , x − ¯ x (cid:105) . Next observe that (cid:107)∇ x L ¯ x (¯ y, ¯ λ, ¯ µ ) − ∇ x L ¯ x (ˆ y, ˆ λ, ˆ µ ) (cid:107) ≤ M (¯ x ) (cid:107) ¯ y − ˆ y (cid:107) + (cid:107) B T (cid:107)(cid:107) ¯ λ − ˆ λ (cid:107) + (cid:107) p (cid:88) i =1 ¯ µ ( i ) (cid:16) ∇ x g i (¯ y, ¯ x ) − ∇ x g i (ˆ y, ¯ x ) (cid:17) (cid:107) + (cid:107) p (cid:88) i =1 (cid:16) ¯ µ ( i ) − ˆ µ ( i ) (cid:17) ∇ x g i (ˆ y, ¯ x ) (cid:107)≤ M (¯ x ) (cid:107) ¯ y − ˆ y (cid:107) + (cid:107) B T (cid:107)(cid:107) ¯ λ − ˆ λ (cid:107) + M (¯ x ) (cid:107) ¯ µ (cid:107) (cid:107) ¯ y − ˆ y (cid:107) + U √ p (cid:107) ¯ µ − ˆ µ (cid:107)≤ ( M (¯ x ) + M (¯ x ) (cid:107) ¯ µ (cid:107) ) (cid:107) ¯ y − ˆ y (cid:107) + √ (cid:107) B T (cid:107) , U √ p ) (cid:113) (cid:107) ˆ λ − ¯ λ (cid:107) + (cid:107) ˆ µ − ¯ µ (cid:107) . (3.48) sing Remark 2.3.3, p.313 in [7] and Assumption (H5) we have for (cid:107) ¯ µ (cid:107) the upper bound(3.49) (cid:107) ¯ µ (cid:107) ≤ f ( y ¯ x , ¯ x ) − Q (¯ x )min( − g i ( y ¯ x , ¯ x ) , i = 1 , . . . , p ) ≤ U ¯ x . Using Assumptions (H3) and (H6), we also get(3.50) (cid:107) ˆ y − ¯ y (cid:107) ≤ εα (¯ x ) and (cid:107) ˆ λ − ¯ λ (cid:107) + (cid:107) ˆ µ − ¯ µ (cid:107) ≤ εα D (¯ x ) . Combining (3.48), (3.49), and (3.50), we get (cid:107)∇ x L ¯ x (¯ y, ¯ λ, ¯ µ ) − ∇ x L ¯ x (ˆ y, ˆ λ, ˆ µ ) (cid:107) ≤ η ( ε, ¯ x ) − ε Diam( X ) . (3.51)Plugging the above relation into (3.47) and using Cauchy-Schwartz inequality, we get(3.52) Q ( x ) ≥ f (ˆ y, ¯ x ) − ε + (cid:104)∇ x L ¯ x (ˆ y, ˆ λ, ˆ µ ) , x − ¯ x (cid:105) + (cid:104)∇ x L ¯ x (¯ y, ¯ λ, ¯ µ ) − ∇ x L ¯ x (ˆ y, ˆ λ, ˆ µ ) , x − ¯ x (cid:105)≥ f (ˆ y, ¯ x ) − ε − (cid:107)∇ x L ¯ x (ˆ y, ˆ λ, ˆ µ ) − ∇ x L ¯ x (¯ y, ¯ λ, ¯ µ ) (cid:107) Diam( X ) + (cid:104)∇ x L ¯ x (ˆ y, ˆ λ, ˆ µ ) , x − ¯ x (cid:105)≥ f (ˆ y, ¯ x ) − η ( ε, ¯ x ) + (cid:104)∇ x L ¯ x (ˆ y, ˆ λ, ˆ µ ) , x − ¯ x (cid:105) . Finally, since ˆ y ∈ S (¯ x ) we check that Q (¯ x ) − C (¯ x ) = Q (¯ x ) − f (ˆ y, ¯ x ) + η ( ε, ¯ x ) ≤ η ( ε, ¯ x ) , which achieves theproof of the proposition. (cid:3) Observe that the “slope” ∇ x L ¯ x (ˆ y ( (cid:15) ) , ˆ λ ( (cid:15) ) , ˆ µ ( (cid:15) )) of the cut given by Proposition 3.7 is the same as the“slope” of the cut given by Proposition 3.8. Remark 3.9. If ˆ y ( ε ) and (ˆ λ ( ε ) , ˆ µ ( ε )) are respectively optimal primal and dual solutions, i.e., ε = 0 , thenProposition 3.8 gives, as expected, an exact cut for Q at ¯ x . As shown in Corollary 3.10, the formula for the inexact cuts given in Proposition 3.8 can be simplifieddepending if there are nonlinear coupling constraints or not, if f is separable (sum of a function of x and ofa function of y ) or not, and if g is separable. Corollary 3.10.
Consider the value functions Q : X → R where Q ( x ) is given by the optimal value of thefollowing optimization problems: (3.53) ( a ) min y f ( y, x ) Ay + Bx = b,h ( y ) + k ( x ) ≤ ,y ∈ Y, ( b ) min y f ( y ) + f ( x ) Ay + Bx = b,g ( y, x ) ≤ ,y ∈ Y, ( c ) min y f ( y ) + f ( x ) Ay + Bx = b,h ( y ) + k ( x ) ≤ ,y ∈ Y, ( d ) min y f ( y, x ) g ( y, x ) ≤ ,y ∈ Y, ( e ) min y f ( y, x ) h ( y ) + k ( x ) ≤ ,y ∈ Y, ( f ) min y f ( y ) + f ( x ) g ( y, x ) ≤ ,y ∈ Y, ( g ) min y f ( y ) + f ( x ) h ( y ) + k ( x ) ≤ ,y ∈ Y, ( h ) min y f ( y, x ) Ay + Bx = b,y ∈ Y, ( i ) min y f ( y ) + f ( x ) Ay + Bx = b,y ∈ Y. For problems (b),(c),(f ),(g), (i) above define f ( y, x ) = f ( y ) + f ( x ) and for problems (a), (c), (e), (g) define g ( y, x ) = h ( y ) + k ( x ) . With this notation, assume that (H1), (H2), (H3), (H4), (H5), (H6), and (H7) holdfor these problems. If g is defined, let L x ( y, λ, µ ) = f ( y, x ) + λ T ( Bx + Ay − b ) + µ T g ( y, x ) be the Lagrangianand define U = max i =1 ,...,p (cid:107)∇ x g i (ˆ y ( ε ) , ¯ x ) (cid:107) and U ¯ x = f ( y ¯ x , ¯ x ) − L ¯ x min( − g i ( y ¯ x , ¯ x ) , i = 1 , . . . , p ) where L ¯ x is any lower bound on Q (¯ x ) . If g is not defined, define L x ( y, λ ) = f ( y, x ) + λ T ( Bx + Ay − b ) .Let ¯ x ∈ X , let ˆ y be an (cid:15) -optimal feasible primal solution for problem (3.25) written for x = ¯ x and let (ˆ λ, ˆ µ ) be an (cid:15) -optimal feasible solution of the corresponding dual problem, i.e., of problem (3.26) written for x = ¯ x . hen C ( x ) = f (ˆ y, ¯ x ) − η ( ε, ¯ x ) + (cid:104) s (¯ x ) , x − ¯ x (cid:105) is an inexact cut for Q at ¯ x where the formulas for η ( ε, ¯ x ) and s (¯ x ) in each of cases (a)-(i) above are the following: (3.54) ( a ) (cid:40) η ( ε, ¯ x ) = ε + (cid:16) M (¯ x ) √ α (¯ x ) + √ (cid:107) B T (cid:107) , √ pU ) √ α D (¯ x ) (cid:17) Diam( X ) √ ε,s (¯ x ) = ∇ x f (ˆ y, ¯ x ) + B T ˆ λ + (cid:80) pi =1 ˆ µ i ∇ x k i (¯ x ) , ( b ) (cid:40) η ( ε, ¯ x ) = ε + (cid:16) M (¯ x ) U ¯ x √ α (¯ x ) + √ (cid:107) B T (cid:107) , √ pU ) √ α D (¯ x ) (cid:17) Diam( X ) √ ε,s (¯ x ) = ∇ x f (¯ x ) + B T ˆ λ + (cid:80) pi =1 ˆ µ i ∇ x g i (ˆ y, ¯ x ) , ( c ) (cid:40) η ( ε, ¯ x ) = ε + 2 max( (cid:107) B T (cid:107) , √ pU )Diam( X ) (cid:113) εα D (¯ x ) ,s (¯ x ) = ∇ x f (¯ x ) + B T ˆ λ + (cid:80) pi =1 ˆ µ i ∇ x k i (¯ x ) , ( d ) (cid:40) η ( ε, ¯ x ) = ε + (cid:16) ( M (¯ x ) + M (¯ x ) U ¯ x ) √ α (¯ x ) + U (cid:113) pα D (¯ x ) (cid:17) Diam( X ) √ ε,s (¯ x ) = ∇ x f (ˆ y, ¯ x ) + (cid:80) pi =1 ˆ µ i ∇ x g i (ˆ y, ¯ x ) , ( e ) (cid:40) η ( ε, ¯ x ) = ε + (cid:16) M (¯ x ) √ α (¯ x ) + (cid:113) pα D (¯ x ) U (cid:17) Diam( X ) √ ε,s (¯ x ) = ∇ x f (ˆ y, ¯ x ) + (cid:80) pi =1 ˆ µ i ∇ x k i (¯ x ) , ( f ) (cid:40) η ( ε, ¯ x ) = ε + (cid:16) M (¯ x ) √ α (¯ x ) U ¯ x + U (cid:113) pα D (¯ x ) (cid:17) Diam( X ) √ ε,s (¯ x ) = ∇ x f (¯ x ) + (cid:80) pi =1 ˆ µ i ∇ x g i (ˆ y, ¯ x ) , ( g ) (cid:40) η ( ε, ¯ x ) = ε + Diam( X ) (cid:113) εpα D (¯ x ) U,s (¯ x ) = ∇ x f (¯ x ) + (cid:80) pi =1 ˆ µ i ∇ x k i (¯ x ) , ( h ) (cid:40) η ( ε, ¯ x ) = ε + (cid:16) M (¯ x ) √ α (¯ x ) + (cid:107) B T (cid:107) √ α D (¯ x ) (cid:17) Diam( X ) √ ε,s (¯ x ) = ∇ x f (ˆ y, ¯ x ) + B T ˆ λ, ( i ) (cid:40) η ( ε, ¯ x ) = ε + (cid:107) B T (cid:107) (cid:113) εα D (¯ x ) Diam( X ) ,s (¯ x ) = ∇ x f (¯ x ) + B T ˆ λ. Proof.
It suffices to follow the proof of Proposition 3.8, specialized to cases (a)-(i). For instance, let us checkthe formulas in case (g). For (g), s (¯ x ) = ∇ x L ¯ x (ˆ y, ˆ µ ) = ∇ x f (¯ x ) + (cid:80) pi =1 ˆ µ i ∇ x k i (¯ x ) and(3.55) (cid:107)∇ x L ¯ x (ˆ y, ˆ µ ) − ∇ x L ¯ x (¯ y, ¯ µ ) (cid:107) = (cid:107) (cid:80) pi =1 (ˆ µ i − ¯ µ i ) ∇ x k i (¯ x ) (cid:107) ≤ U (cid:107) ˆ µ − ¯ µ (cid:107) ≤ U √ p (cid:107) ˆ µ − ¯ µ (cid:107) ≤ U √ p (cid:113) εα D (¯ x ) . It then suffices to combine (3.47) and (3.55). (cid:3)
Numerical results.
Argument of the value function in the objective only.
Let S = (cid:18) S S S T S (cid:19) be a positive definitematrix, let c ∈ R m , c ∈ R n be vectors of ones, and let Q be the value function given by(3.56) Q ( x ) = min y ∈ R n f ( y, x ) = (cid:18) xy (cid:19) T S (cid:18) xy (cid:19) + (cid:18) c c (cid:19) T (cid:18) xy (cid:19) y ∈ Y := { y ∈ R n : y ≥ , (cid:80) ni =1 y i = 1 } , = (cid:26) min y ∈ R n c T x + c T y + x T S x + x T S y + y T S yy ≥ , (cid:80) ni =1 y i = 1 . Clearly, Assumption (H3) is satisfied with α ( x ) = λ min ( S ), and (cid:107)∇ x f ( y , x ) − ∇ x f ( y , x ) (cid:107) = (cid:107) S ( y − y ) (cid:107) ≤ (cid:107) S (cid:107) (cid:107) y − y (cid:107) implying that Assumption (H4) is satisfied with M (¯ x ) = (cid:107) S (cid:107) = σ ( S ) where σ ( S ) is the largest singularvalue of S . We take X = Y with Diam( X ) = max x ,x ∈ X (cid:107) x − x (cid:107) ≤ √
2. With this notation, if ˆ y is an (cid:15) -optimal solution of (3.56) written for x = ¯ x , we compute at ¯ x the cut C ( x ) = f (ˆ y, ¯ x ) − η ( ε, ¯ x ) + (cid:104)∇ x f (ˆ y, ¯ x ) , x − ¯ x (cid:105) = f (ˆ y, ¯ x ) − η ( ε, ¯ x ) + (cid:104) c + S ¯ x + S ˆ y, x − ¯ x (cid:105) where • η ( ε, ¯ x ) = η ( ε, ¯ x ) = ε + 2 M (¯ x ) (cid:113) εα (¯ x ) using Proposition 3.5; α (¯ x ) n η η ε α (¯ x ) n η η
10 0.076 0.3540.016 129.0 10 9.81 0.047 0.0084 174.5 10 4.85 0.0370.029 10054 10 2.49 0.128 0.002 10
10 0.09 0.3420.008 112.3 10 8.07 0.043 0.008 150.0 10 6.36 0.0220.018 10 090 10 1.29 0.078 0.0019 10
10 0.06 0.4420.15 531.9 100 175.6 0.3 0.18 665.3 100 183.5 0.30.23 10 687 100 44.5 0.2 0.03 10
100 2.1 0.90.17 676.2 100 185.7 0.2 0.09 734.3 100 106.5 0.20.11 10 638 100 37.9 0.2 0.02 10
100 1.7 0.30.05 660 100 106.7 0.2 0.40 777 100 253.8 0.40.07 10 585 100 32.6 0.2 0.02 10
100 1.3 0.46.78 6017.9 1000 4177.8 9.5 2.69 5991.4 1000 2778.8 6.88.12 15 722 1000 3059.5 11.1 0.99 10 Table 1.
Values of η ( ε, ¯ x ) = η ( ε, ¯ x ) (resp. η ( ε, ¯ x ) = η ( ε, ¯ x )) for the inexact cuts givenby Proposition 3.5 (resp. Proposition 3.3) for value function (3.56) for various values of n (problem dimension), α (¯ x ) = λ min ( S ), and ε . • η ( ε, ¯ x ) is given by η ( ε, ¯ x ) = η ( ε, ¯ x ) = (cid:26) max (cid:104)∇ y f (ˆ y, ¯ x ) , ˆ y − y (cid:105) y ≥ , (cid:80) ni =1 y i = 1 , = (cid:26) max (cid:104) c + S T ¯ x + S ˆ y, ˆ y − y (cid:105) y ≥ , (cid:80) ni =1 y i = 1 , using Proposition 3.3.We compare in Table 1 the values of η ( ε, ¯ x ) and η ( ε, ¯ x ) for several values of m = n , ε , and α (¯ x ). Inthese experiments S is of the form AA T + λI n for some λ > A has random entries in [ − , ε r on the optimal value to 0.1,0.5, and 1. In each run, ε was estimated computing the duality gap (the difference between the approximateoptimal values of the dual and the primal). Though η ( ε, ¯ x ) does not depend on ¯ x (because on this example α and M do not depend on ¯ x ), the absolute error ε depends on the run (for a fixed ε r , different runs corre-sponding to different ¯ x yield different errors ε, η ( ε, ¯ x ) and η ( ε, ¯ x )). Therefore, for each fixed ( ε r , α (¯ x ) , n ),the values ε, η ( ε, ¯ x ), and η ( ε, ¯ x ) reported in the table correspond to the mean values of ε , η ( ε, ¯ x ), and η ( ε, ¯ x ) obtained taking randomly 50 points in X . We see that the cuts computed by Proposition 3.5 aremuch more conservative on nearly all combinations of parameters, except on three of these combinationswhen n = 10 and α (¯ x ) = 10 is very large.3.4.2. Argument of the value function in the objective and constraints.
We close this section comparing theerror terms in the cuts given by Propositions 3.7 and 3.8 on a very simple problem with a quadratic objectiveand a quadratic constraint.Let S = (cid:18) S S S T S (cid:19) be a positive definite matrix, let c , c ∈ R n , and let Q : X → R be the valuefunction given by(3.57) Q ( x ) = min y ∈ R n { f ( y, x ) : g ( y, x ) ≤ } , here(3.58) f ( y, x ) = (cid:18) xy (cid:19) T S (cid:18) xy (cid:19) + (cid:18) c c (cid:19) T (cid:18) xy (cid:19) = c T x + c T y + x T S x + x T S y + y T S y,g ( y, x ) = (cid:107) y − y (cid:107) + (cid:107) x − x (cid:107) − R ,X = { x ∈ R n : (cid:107) x − x (cid:107) ≤ } . In what follows, we take R = 5 and x , y ∈ R n given by x ( i ) = y ( i ) = 10 , i = 1 , . . . , n . Clearly, for fixed¯ x ∈ X and any feasible y for (3.57), (3.58) written for x = ¯ x , we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) x y (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) + R ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) ¯ xy (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) x y (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) − R. Knowing that with our problem data (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) x y (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) − R >
0, we obtain the bound Q (¯ x ) ≥ L ¯ x where L ¯ x = 12 λ min ( S ) (cid:16) (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) x y (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) − R (cid:17) − (cid:16) (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) x y (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) + R (cid:17) (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) c c (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) . Next, for every ¯ x ∈ X we have g ( y , ¯ x ) < U ¯ x = L ¯ x − f ( y , ¯ x ) g ( y , ¯ x )for any optimal dual solution ¯ µ ≥ x = ¯ x . Making the change ofvariable z = y − y , we can express (3.57) under the form (2.9) where(3.60) Q = S , a = a ( x ) = c + S T x + S y , b = b ( x ) = x T S x + c T x + y T ( c + S T x ) + y T S y ,Q = I n , a = 0 , b = b ( x ) = ( (cid:107) x − x (cid:107) − R ) . Therefore, using Proposition 2.11, we have that dual function θ ¯ x for (3.57) is given by(3.61) θ ¯ x ( µ ) = − a (¯ x ) T ( S + µI n ) − a (¯ x ) + b (¯ x ) + µb (¯ x )with a , b , b given by (3.60) and setting α D (¯ x ) = a (¯ x ) T ( S + U ¯ x I n ) − a (¯ x ) , if a (¯ x ) (cid:54) = 0 then θ ¯ x is strongly concave on the interval [0 , U ¯ x ] with constant of strong concavity α D (¯ x ) where U ¯ x is given by (3.59). Let ˆ y be an ε -optimal primal solution of (3.57) written for x = ¯ x and let ˆ µ be an ε -optimal solution of its dual. If a (¯ x ) (cid:54) = 0, using Corollary 3.10-(e) with p = 1 , U = (cid:107) ¯ x − x (cid:107) , we obtain for Q the cut(3.62) C ( x ) = f (ˆ y, ¯ x ) − η ( ε, ¯ x ) + (cid:104)∇ x L ¯ x (ˆ y, ˆ µ ) , x − ¯ x (cid:105) where η ( ε, ¯ x ) = ε + D ( X ) √ ε (cid:16) M (¯ x ) √ α (¯ x ) + (cid:107) ¯ x − x (cid:107) √ α D (¯ x ) (cid:17) with D ( X ) = 2 , M (¯ x ) = (cid:107) S (cid:107) , α (¯ x ) = λ min ( S ) , ∇ x L ¯ x (ˆ y, ˆ µ ) = S ¯ x + c + S ˆ y + ˆ µ (¯ x − x ) . We now apply Proposition 3.7 to obtain another inexact cut for Q at ¯ x ∈ X rewriting (3.57) under theform (3.25) with Y the compact set Y = { y ∈ R n : (cid:107) y − y (cid:107) ≤ R } :(3.63) Q ( x ) = min y ∈ R n { f ( y, x ) : g ( y, x ) ≤ , (cid:107) y − y (cid:107) ≤ R } . Applying Proposition 3.7 to reformulation (3.63) of (3.57), we obtain for Q the inexact cut C at ¯ x where(3.64) C ( x ) = f (ˆ y, ¯ x ) − η ( ε, ¯ x ) + (cid:104)∇ x L ¯ x (ˆ y, ˆ µ ) , x − ¯ x (cid:105) with η ( ε, ¯ x ) = − min {(cid:104)∇ y L ¯ x (ˆ y, ˆ µ ) , y − ˆ y (cid:105) : (cid:107) y − y (cid:107) ≤ R } , = (cid:104)∇ y L ¯ x (ˆ y, ˆ µ ) , ˆ y − y (cid:105) + R (cid:107)∇ y L ¯ x (ˆ y, ˆ µ ) (cid:107) , ∇ x L ¯ x (ˆ y, ˆ µ ) = S ¯ x + c + S ˆ y + ˆ µ (¯ x − x ) , ∇ y L ¯ x (ˆ y, ˆ µ ) = S ˆ y + S T ¯ x + c + ˆ µ (ˆ y − y ) . As in the previous example, we take S of form S = AA T + λI n where the entries of A are randomly selectedin the range [ − , c ( i ) = c ( i ) = 1 , i = 1 , . . . , n . For 8 values of the pair ( n, λ ), namely n, λ ) ∈ { (1 , , (10 , , (100 , , (1000 , , (1 , , (10 , , (100 , , (1000 , } , we generate a matrix S of form AA T + λI n where the entries of A are realizations of independent random variables with uniformdistribution in [ − , x ∈ X and solve (3.57), (3.58) and its dual writtenfor x = ¯ x using Mosek interior point solver. The value of α (¯ x ) = λ min ( S ), the dual function θ ¯ x ( · ), and thedual iterates computed along the iterations are reported in Figure 6 in the Appendix. Figure 7 shows theplots of η ( ε k , ¯ x ) and η ( ε k , ¯ x ) as a function of iteration k where ε k is the duality gap at iteration k .The cuts computed by Proposition 3.8 are more conservative than cuts given by Proposition 3.7 on nearlyall instances and iterations. We also see that, as expected, the error terms η ( ε k , ¯ x ) and η ( ε k , ¯ x ) go to zerowhen ε k goes to zero (see the proof of Theorem 4.2 for a proof of this statement).4. Inexact Stochastic Mirror Descent for two-stage nonlinear stochastic programs
The algorithm to be described in this section is an inexact extension of SMD [13] to solve(4.65) (cid:26) min f ( x ) := f ( x ) + Q ( x ) x ∈ X with X ⊂ R n a convex, nonempty, and compact set, and Q ( x ) = E ξ [ Q ( x , ξ )], ξ is a random vectorwith probability distribution P on Ξ ⊂ R k , and(4.66) Q ( x , ξ ) = (cid:26) min x f ( x , x , ξ ) x ∈ X ( x , ξ ) := { x ∈ X : Ax + Bx = b, g ( x , x , ξ ) ≤ } . Recall that ξ contains the random variables in ( A, B, b ) and eventually other sources of randomness. Let (cid:107) · (cid:107) be a norm on R n and let ω : X → R be a distance-generating function . This function should • be convex and continuous on X , • admit on X o = { x ∈ X : ∂ω ( x ) (cid:54) = ∅} a selection ω (cid:48) ( x ) of subgradients, and • be compatible with (cid:107) · (cid:107) , meaning that ω ( · ) is strongly convex with constant of strong convexity µ ( ω ) > (cid:107) · (cid:107) :( ω (cid:48) ( x ) − ω (cid:48) ( y )) T ( x − y ) ≥ µ ( ω ) (cid:107) x − y (cid:107) ∀ x, y ∈ X o . We also define(1) the ω -center of X given by x ω = argmin x ∈ X ω ( x ) ∈ X o ;(2) the Bregman distance or prox-function(4.67) V x ( y ) = ω ( y ) − ω ( x ) − ( y − x ) T ω (cid:48) ( x ) , for x ∈ X o , y ∈ X ;(3) the ω -radius of X defined as(4.68) D ω,X = (cid:114) (cid:104) max x ∈ X ω ( x ) − min x ∈ X ω ( x ) (cid:105) . (4) The proximal mapping (4.69) Prox x ( ζ ) = argmin y ∈ X { ω ( y ) + y T ( ζ − ω (cid:48) ( x )) } [ x ∈ X o , ζ ∈ R n ] , taking values in X o .We describe below ISMD, an inexact variant of SMD for solving problem (4.65) in which primal and dualsecond stage problems are solved approximately.For x ∈ X , ξ ∈ Ξ, and ε ≥
0, we denote by x ( x , ξ , ε ) an ε -optimal feasible primal solution of (4.66),i.e., x ( x , ξ , ε ) ∈ X ( x , ξ ) and Q ( x , ξ ) ≤ f ( x , x , ξ ) ≤ Q ( x , ξ ) + ε. We now define ε -optimal dual second stage solutions. For x ∈ X and ξ ∈ Ξ let L x ,ξ ( x , λ, µ ) = f ( x , x , ξ ) + (cid:104) λ, Ax + Bx − b (cid:105) + (cid:104) µ, g ( x , x , ξ ) (cid:105) , nd let θ x ,ξ be the dual function given by(4.70) θ x ,ξ ( λ, µ ) = (cid:26) min L x ,ξ ( x , λ, µ ) x ∈ X . For x ∈ X , ξ ∈ Ξ, and ε ≥
0, we denote by ( λ ( x , ξ , ε ) , µ ( x , ξ , ε )) an ε -optimal feasible solution of thedual problem(4.71) (cid:26) max θ x ,ξ ( λ, µ ) µ ≥ , λ = Ax + Bx − b, x ∈ Aff( X ) . Under Slater-type constraint qualification conditions to be specified in Theorems 4.2 and 4.4, the opti-mal values of primal second stage problem (4.66) and dual second stage problem (4.71) are the same and( λ ( x , ξ , ε ) , µ ( x , ξ , ε )) satisfies: µ ( x , ξ , ε ) ≥ , λ ( x , ξ , ε ) = Ax + Bx − b, for some x ∈ Aff( X ) and Q ( x , ξ ) − ε ≤ θ x ,ξ ( λ ( x , ξ , ε ) , µ ( x , ξ , ε )) ≤ Q ( x , ξ ) . We also denote by D X = max x,y ∈ X (cid:107) y − x (cid:107) the diameter of X , by s f ( x ) a subgradient of f at x , andwe define(4.72) H ( x , ξ , ε ) = ∇ x f ( x ( x , ξ , ε ) , x , ξ ) + B T λ ( x , ξ , ε ) + (cid:80) pi =1 µ i ( x , ξ , ε ) ∇ x g i ( x ( x , ξ , ε ) , x , ξ ) ,G ( x , ξ , ε ) = s f ( x ) + H ( x , ξ , ε ) . Inexact Stochastic Mirror Descent (ISMD) for risk-neutral two-stage nonlinear stochasticproblems.Parameters:
Sequence ( ε t ) and θ > For N = 2 , , . . . , Take x N, = x ω . For t = 1 , . . . , N −
1, sample a realization ξ N,t of ξ (with corresponding realizations A N,t of A , B N,t of B , and b N,t of b ), compute an ε t -optimal solution x N,t of the problem(4.73) Q ( x N,t , ξ N,t ) = min x f ( x , x N,t , ξ N,t ) A N,t x + B N,t x N,t = b N,t ,g ( x , x N,t , ξ N,t ) ≤ ,x ∈ X , and an ε t -optimal solution ( λ N,t , µ
N,t ) = ( λ ( x N,t , ξ N,t , ε t ) , µ ( x N,t , ξ N,t , ε t )) of the dual problem(4.74) (cid:40) max θ x N,t ,ξ N,t ( λ, µ ) µ ≥ , λ = A N,t x + B N,t x N,t − b N,t , x ∈ Aff( X )used to compute G ( x N,t , ξ N,t , ε t ) given by (4.72) replacing ( x , ξ , ε ) by ( x t , ξ t , ε t ). Compute γ t ( N ) = θ √ N and(4.75) x N,t +11 = Prox x N,t ( γ t ( N ) G ( x N,t , ξ N,t , ε t )) . Compute(4.76) x ( N ) = 1Γ N N (cid:88) τ =1 γ τ ( N ) x N,τ andˆ f N = 1Γ N (cid:34) N (cid:88) τ =1 γ τ ( N ) (cid:16) f ( x N,τ ) + f ( x N,τ , x N,τ , ξ N,τ ) (cid:17)(cid:35) with Γ N = N (cid:88) τ =1 γ τ ( N ) . Any optimization solver for convex nonlinear programs able to provide ε t -optimal solutions can be used (for instance aninterior point solver). nd ForEnd ForRemark 4.1. In practise ISMD is run fixing the number N of inner iterations, i.e., we fix N and compute x ( N ) and ˆ f N . Convergence of Inexact Stochastic Mirror Descent for solving (4.65) can be shown when error terms ( ε t )asymptotically vanish: Theorem 4.2 (Convergence of ISMD) . Consider problem (4.65) and assume that (i) X and X arenonempty, convex, and compact, (ii) f is convex, finite-valued, and has bounded subgradients on X , (iii)for every x ∈ X and x ∈ X , f ( x , x , · ) and g i ( x , x , · ) , i = 1 , . . . , p , are measurable, (iv) for every ξ ∈ Ξ the functions f ( · , · , ξ ) and g i ( · , · , ξ ) , i = 1 , . . . , p , are convex and continuously differentiable on X × X , (v) ∃ κ > and r > such that for all x ∈ X , for all ˜ ξ ∈ Ξ , there exists x ∈ X such that B ( x , r ) ∩ Aff ( X ) (cid:54) = ∅ , ˜ Ax + ˜ Bx = ˜ b , and g ( x , x , ˜ ξ ) < − κe where e is a vector of ones. If γ t = θ √ N forsome θ > , if the support Ξ of ξ is compact, and if lim t →∞ ε t = 0 , then lim N → + ∞ E [ f ( x ( N ))] = lim N → + ∞ E [ ˆ f N ] = f ∗ where f ∗ is the optimal value of (4.65) .Proof. For fixed N , to alleviate notation, we denote vectors x N,t , x N,t , ξ N,t , A N,t , B
N,t , b
N,t , γ t ( N ) , λ N,t , µ
N,t used to compute x ( N ) and ˆ f N by x t , x t , ξ t , A t , B t , b t , γ t , λ t , µ t , respectively. Let x ∗ be an optimal solutionof (4.65). Standard computations on the proximal mapping give(4.77) N (cid:88) τ =1 γ τ G ( x τ , ξ τ , ε τ ) T ( x τ − x ∗ ) ≤ D ω,X + 12 µ ( ω ) N (cid:88) τ =1 γ τ (cid:107) G ( x τ , ξ τ , ε τ ) (cid:107) ∗ . Next using Proposition 3.7 we have(4.78) Q ( x ∗ , ξ τ ) ≥ Q ( x τ , ξ τ ) − η ξ τ ( ε τ , x τ ) + (cid:104) H ( x τ , ξ τ , ε τ ) , x ∗ − x τ (cid:105) where(4.79) η ξ τ ( ε τ , x τ ) = (cid:26) max (cid:104)∇ x L x τ ,ξ τ ( x τ , λ τ , µ τ ) , x τ − x (cid:105) x ∈ X = (cid:26) max (cid:104)∇ x f ( x τ , x τ , ξ τ ) + ( A τ ) T λ τ + (cid:80) pi =1 µ τi ∇ x g i ( x τ , x τ , ξ τ ) , x τ − x (cid:105) x ∈ X . Setting ξ τ − = ( ξ , . . . , ξ τ − ) and taking the conditional expectation E ξ τ [ ·| ξ τ − ] on each side of (4.78) weobtain almost surely(4.80) Q ( x ∗ ) ≥ Q ( x τ ) − E ξ τ [ η ξ τ ( ε τ , x τ ) | ξ τ − ] + ( E ξ τ [ H ( x τ , ξ τ , ε τ ) | ξ τ − ]) T ( x ∗ − x τ ) . Combining (4.77), (4.80), and using the convexity of f we get(4.81) 0 ≤ E [ f ( x ( N )) − f ( x ∗ )] ≤ N N (cid:88) τ =1 γ τ E [ f ( x τ ) − f ( x ∗ )] ≤ N N (cid:88) τ =1 γ τ E [ η ξ τ ( ε τ , x τ )] + 12Γ N (cid:104) D ω,X + 1 µ ( ω ) N (cid:88) τ =1 γ τ E [ (cid:107) G ( x τ , ξ τ , ε τ ) (cid:107) ∗ ] (cid:105) . We now show by contradiction that (4.82) lim τ → + ∞ η ξ τ ( ε τ , x τ ) = 0 almost surely.Take an arbitrary realization of ISMD. We want to show that(4.83) lim τ → + ∞ η ξ τ ( ε τ , x τ ) = 0 The proof is similar to the proof of Proposition 4.6 in [6]. or that realization. Assume that (4.83) does not hold. Let x t ∗ (resp. ˜ x τ ) be an optimal solution of (4.73)(resp. (4.79)). Then there is ε > σ : N → N increasing such that for every τ ∈ N , we have(4.84) (cid:104)∇ x f ( x σ ( τ )2 , x σ ( τ )1 , ξ σ ( τ )2 ) + ( A σ ( τ ) ) T λ σ ( τ ) + p (cid:88) i =1 µ σ ( τ ) i ∇ x g i ( x σ ( τ )2 , x σ ( τ )1 , ξ σ ( τ )2 ) , x σ ( τ )2 − ˜ x σ ( τ )2 (cid:105) ≥ ε . By ε t -optimality of x t we obtain(4.85) f ( x t ∗ , x t , ξ t ) ≤ f ( x t , x t , ξ t ) ≤ f ( x t ∗ , x t , ξ t ) + ε t . Using Assumptions (i), (iii), (iv), and Proposition 3.1 in [6] we get that the sequence ( λ τ , µ τ ) τ is almost surelybounded. Let D be a compact set to which this sequence belongs. By compacity, we can find σ : N → N increasing such that setting σ = σ ◦ σ the sequence ( x σ ( τ )2 , x σ ( τ )1 , λ σ ( τ ) , µ σ ( τ ) , ξ σ ( τ )2 ) converges to some(¯ x , x ∗ , λ ∗ , µ ∗ , ξ ∗ ) ∈ X × X × D × Ξ. We will denote by A ∗ , B ∗ , b ∗ the values of A, B, and b in ξ ∗ . Bycontinuity arguments there is τ ∈ N such that for every τ ≥ τ :(4.86) (cid:12)(cid:12)(cid:12) (cid:104)∇ x f ( x σ ( τ )2 , x σ ( τ )1 , ξ σ ( τ )2 ) + ( A σ ( τ ) ) T λ σ ( τ ) + (cid:80) pi =1 µ σ ( τ ) i ∇ x g i ( x σ ( τ )2 , x σ ( τ )1 , ξ σ ( τ )2 ) , x σ ( τ )2 − ˜ x σ ( τ )2 (cid:105)− (cid:104)∇ x f (¯ x , x ∗ , ξ ∗ ) + A T ∗ λ ∗ + (cid:80) pi =1 µ ∗ ( i ) ∇ x g i (¯ x , x ∗ , ξ ∗ ) , ¯ x − ˜ x σ ( τ )2 (cid:105) (cid:12)(cid:12)(cid:12) ≤ ε / . We deduce from (4.84) and (4.86) that for all τ ≥ τ (4.87) (cid:42) ∇ x f (¯ x , x ∗ , ξ ∗ ) + A T ∗ λ ∗ + p (cid:88) i =1 µ ∗ ( i ) ∇ x g i (¯ x , x ∗ , ξ ∗ ) , ¯ x − ˜ x σ ( τ )2 (cid:43) ≥ ε / > . Assumptions (i)-(iv) imply that primal problem (4.73) and dual problem (4.74) have the same optimal valueand for every x ∈ X and τ ≥ τ we have: f ( x σ ( τ )2 , x σ ( τ )1 , ξ σ ( τ )2 ) + (cid:104) A σ ( τ ) x σ ( τ )2 + B σ ( τ ) x σ ( τ )1 − b σ ( τ ) , λ σ ( τ ) (cid:105) + (cid:104) µ σ ( τ ) , g ( x σ ( τ )2 , x σ ( τ )1 , ξ σ ( τ )2 ) (cid:105)≤ f ( x σ ( τ )2 ∗ , x σ ( τ )1 , ξ σ ( τ )2 ) + ε σ ( τ ) by definition of x σ ( τ )2 ∗ , x σ ( τ )2 and since µ σ ( τ ) ≥ , x σ ( τ )2 ∈ X ( x σ ( τ )1 , ξ σ ( τ )2 ) , ≤ θ x σ ( τ )1 ,ξ σ ( τ )2 ( λ σ ( τ ) , µ σ ( τ ) ) + 2 ε σ ( τ ) , [( λ σ ( τ ) , µ σ ( τ ) ) is an (cid:15) σ ( τ ) -optimal dual solution and there is no duality gap] , ≤ f ( x , x σ ( τ )1 , ξ σ ( τ )2 ) + (cid:104) A σ ( τ ) x + B σ ( τ ) x σ ( τ )1 − b σ ( τ ) , λ σ ( τ ) (cid:105) + (cid:104) µ σ ( τ ) , g ( x , x σ ( τ )1 , ξ σ ( τ )2 ) (cid:105) + 2 ε σ ( τ ) where in the last relation we have used the definition of θ x σ ( τ )1 ,ξ σ ( τ )2 . Taking the limit in the above relationas τ → + ∞ , we get for every x ∈ X : f (¯ x , x ∗ , ξ ∗ ) + (cid:104) A ∗ ¯ x + B ∗ x ∗ − b ∗ , λ ∗ (cid:105) + (cid:104) µ ∗ , g (¯ x , x ∗ , ξ ∗ ) (cid:105)≤ f ( x , x ∗ , ξ ∗ ) + (cid:104) A ∗ x + B ∗ x ∗ − b ∗ , λ ∗ (cid:105) + (cid:104) µ ∗ , g ( x , x ∗ , ξ ∗ ) (cid:105) . Recalling that ¯ x ∈ X this shows that ¯ x is an optimal solution of(4.88) (cid:26) min f ( x , x ∗ , ξ ∗ ) + (cid:104) A ∗ x + B ∗ x ∗ − b ∗ , λ ∗ (cid:105) + (cid:104) µ ∗ , g ( x , x ∗ , ξ ∗ ) (cid:105) x ∈ X . The first order optimality conditions for ¯ x can be written(4.89) (cid:42) ∇ x f (¯ x , x ∗ , ξ ∗ ) + A T ∗ λ ∗ + p (cid:88) i =1 µ ∗ ( i ) ∇ x g i (¯ x , x ∗ , ξ ∗ ) , x − ¯ x (cid:43) ≥ x ∈ X . Specializing the above relation for x = ˜ x σ ( τ )2 ∈ X , we get (cid:42) ∇ x f (¯ x , x ∗ , ξ ∗ ) + A T ∗ λ ∗ + p (cid:88) i =1 µ ∗ ( i ) ∇ x g i (¯ x , x ∗ , ξ ∗ ) , ˜ x σ ( τ )2 − ¯ x (cid:43) ≥ , but the left-hand side of the above inequality is ≤ − ε / < η ξ τ ( ε τ , x τ ) is almost surely bounded,this implies lim τ → + ∞ E [ η ξ τ ( ε τ , x τ )] = 0 and consequently lim N → + ∞ N (cid:80) Nτ =1 γ τ E [ η ξ τ ( ε τ , x τ )] = 0. Us-ing the boundedness of the sequence ( λ t , µ t ) and Assumption (ii) we get that (cid:107) G ( x τ , ξ τ , ε τ ) (cid:107) ∗ is almostsurely bounded. Combining these observations with relation (4.81) and using the definition of γ t we have im N → + ∞ E [ f ( x ( N ))] = f ∗ . Finally, recalling relation (4.81), to show lim N → + ∞ E [ ˆ f N ] = f ∗ all we haveto show is(4.90) lim N → + ∞ N N (cid:88) τ =1 γ τ E [ Q ( x τ ) − f ( x τ , x τ , ξ τ )] = 0 . The above relation immediately follows from(4.91) E [ Q ( x τ )] = E ξ τ − [ Q ( x τ )] = E ξ τ − [ E ξ τ [ Q ( x τ , ξ τ ) | ξ τ − ]] ≤ E ξ τ [ f ( x τ , x τ , ξ τ )] ≤ E [ Q ( x τ )] + ε τ which holds since Q ( x τ , ξ τ ) ≤ f ( x τ , x τ , ξ τ ) ≤ Q ( x τ , ξ τ ) + ε τ by definition of x τ . (cid:3) Remark 4.3.
Output ˆ f N of ISMD is a computable approximation of the optimal value f ∗ of optimizationproblem (4.65) . Theorem 4.4. [Convergence rate for ISMD] Consider problem (4.65) and assume that Assumptions (i)-(iv)of Theorem 4.2 are satisfied. We alse make the following assumptions: (a) ∃ α > such that for every ξ ∈ Ξ , for every x ∈ X , for every y , y ∈ X we have f ( y , x , ξ ) ≥ f ( y , x , ξ ) + ( y − y ) T ∇ x f ( y , x , ξ ) + α (cid:107) y − y (cid:107) ;(b) there is < M < + ∞ such that for every ξ ∈ Ξ , for every x ∈ X , for every y , y ∈ X we have (cid:107)∇ x f ( y , x , ξ ) − ∇ x f ( y , x , ξ ) (cid:107) ≤ M (cid:107) y − y (cid:107) ;(c) there is < M < + ∞ such that for every ξ ∈ Ξ , for every x ∈ X , for every i = 1 , . . . , p , forevery y , y ∈ X , we have (cid:107)∇ x g i ( y , x , ξ ) − ∇ x g i ( y , x , ξ ) (cid:107) ≤ M (cid:107) y − y (cid:107) ;(d) ∃ α D > such that for every x ∈ X , for every ξ ∈ Ξ , dual function θ x ,ξ given by (4.70) isstrongly concave on D x ,ξ with constant of strong concavity α D where D x ,ξ is a set containing theset of solutions of second stage dual problem (4.71) such that ( λ t , µ t ) ∈ D x t ,ξ t . (e) There are functions G , M such that for every x ∈ X , for every x ∈ X we have max( (cid:107) B T (cid:107) , √ p max i =1 ,...,p (cid:107)∇ x g i ( x , x , ξ ) (cid:107) ) ≤ G ( ξ ) and (cid:107)∇ x f ( x , x , ξ ) (cid:107) ≤ M ( ξ ) with E [ G ( ξ )] and E [ M ( ξ )] finite; (f) There are functions f , f such that for all x ∈ X , x ∈ X we have f ( ξ ) ≤ f ( x , x , ξ ) ≤ f ( ξ ) with E [ f ( ξ )] and E [ f ( ξ )] finite. (g) There exists < L ( f ) < + ∞ such that for every ξ ∈ Ξ , for every x ∈ X , function f ( · , x , ξ ) isLipschitz continuous with Lipschitz constant L ( f ) .Let A such that matrix A in ξ almost surely belongs to A and let M < + ∞ such that (cid:107) s f ( x ) (cid:107) ≤ M forall x ∈ X . Let V X be the vector space V X = { x − y : x, y ∈ Aff( X ) } . Define the functions ρ and ρ ∗ by ρ ( A, z ) = (cid:26) max t (cid:107) z (cid:107) t ≥ , tz ∈ A ( B (0 , r ) ∩ V X ) , ρ ∗ ( A ) = (cid:26) min ρ ( A, z ) (cid:107) z (cid:107) = 1 , z ∈ AV X . Assume that γ t = θ √ N and ε t = θ t for some θ , θ > . Let U = ( E [ f ( ξ )] − E [ f ( ξ )]) /κ, U ( r, ξ ) = f ( ξ ) − f ( ξ )+ θ + L ( f ) r min( ρ ∗ ,κ/ with ρ ∗ = min A ∈A ρ ∗ ( A ) , U = (cid:16) ( M + M U ) (cid:113) α + E [ G ( ξ )] √ α D (cid:17) Diam( X ) ,M ∗ ( r ) = (cid:113) E ( M + M ( ξ ) + √ U ( r, ξ ) G ( ξ )) . et ˆ f N computed by ISMD. Then there is r > such that (4.92) f ∗ ≤ E [ ˆ f N ] ≤ f ∗ + 2 θ + U√ θ N + U√ θ ln( N ) N + D ω,X θ + θ M ∗ ( r ) µ ( ω ) √ N where f ∗ is the optimal value of (4.65) .Proof. Let x ∗ be an optimal solution of (4.65). Under our assumptions, we can apply Proposition 3.8 tovalue function Q ( · , ξ t ) and ¯ x = x t , which gives(4.93) Q ( x ∗ , ξ t ) ≥ f ( x t , x t , ξ t ) + (cid:104) H ( x t , ξ t , ε t ) , x ∗ − x t (cid:105) − η ξ t ( ε t , x t ) , where η ξ t ( ε t , x t ) = ε t + (cid:16) M + M κ ( f (¯ x t , x t , ξ t ) − f ( ξ t )) (cid:17)(cid:113) ε t α Diam( X ) , +2 max (cid:16) (cid:107) ( B t ) T (cid:107) , √ p max i =1 ,...,p (cid:107)∇ x g i ( x t , x t , ξ t ) (cid:107) (cid:17) Diam( X ) (cid:113) ε t α D , for some ¯ x t ∈ X depending on ξ t . Taking the conditional expectation E ξ t [ ·| ξ t − ] in (4.93) and using(e)-(f), we get(4.94) Q ( x ∗ ) ≥ E ξ t [ f ( x t , x t , ξ t ) | ξ t − ] + E ξ t [ (cid:104) H ( x t , ξ t , ε t ) , x ∗ − x t (cid:105)| ξ t − ] − ( ε t + U√ ε t ) . Summing (4.94) with the relation f ( x ∗ ) ≥ f ( x t ) + (cid:104) s f ( x t ) , x ∗ − x t (cid:105) and taking the expectation operator E ξ t − [ · ] on each side of the resulting inequality gives(4.95) f ( x ∗ ) ≥ E [ f ( x t , x t , ξ t ) + f ( x t )] + E [ (cid:104) G ( x t , ξ t , ε t ) , x ∗ − x t (cid:105) ] − ( ε t + U√ ε t ) . From (4.95), we deduce(4.96) E [ ˆ f N − f ∗ ] ≤ N N (cid:88) t =1 γ t ( ε t + U√ ε t ) + 1Γ N N (cid:88) t =1 γ t E [ (cid:104) G ( x t , ξ t , ε t ) , x t − x ∗ (cid:105) ] . Using Proposition 3.1 in [6] and our assumptions, we can find r > M ∗ ( r ) is an upper boundfor E [ (cid:107) G ( x t , ξ t , ε t ) (cid:107) ∗ ]. Using this observation, (4.96), and (4.93) (which still holds), we get(4.97) E [ ˆ f N − f ∗ ] ≤ N (cid:16) θ (cid:16) (cid:90) N dxx (cid:17) + U (cid:112) θ (cid:16) (cid:90) N dxx (cid:17)(cid:17) + 12 θ √ N (cid:16) D ω,X + M ∗ ( r ) θ µ ( ω ) (cid:17) ≤ θ + U√ θ N + U√ θ ln( N ) N + D ω,X θ + θ M ∗ ( r µ ( ω ) √ N . Finally(4.98) 0 (4.81) ≤ N N (cid:88) τ =1 γ τ E [ f ( x τ )] − f ∗ = 1Γ N N (cid:88) τ =1 γ τ E [ f ( x τ ) + Q ( x τ )] − f ∗ (4.91) ≤ N N (cid:88) τ =1 γ τ E [ f ( x τ ) + f ( x τ , x τ , ξ τ )] − f ∗ = E [ ˆ f N − f ∗ ] . Combining (4.97) and (4.98) we obtain (4.92). (cid:3) . Numerical experiments
We compare the performances of SMD, ISMD, SAA (Sample Average Approximation, see [18]), and the L-shaped method (see [2]) on two simple two-stage quadratic stochastic programs which satisfy the assumptionsof Theorems 4.2 and 4.4.The first two-stage program is(5.99) (cid:26) min c T x + E [ Q ( x , ξ )] x ∈ { x ∈ R n : x ≥ , (cid:80) ni =1 x ( i ) = 1 } where the second stage recourse function is given by(5.100) Q ( x , ξ ) = min x ∈ R n (cid:18) x x (cid:19) T (cid:16) ξ ξ T + λI n (cid:17) (cid:18) x x (cid:19) + ξ T (cid:18) x x (cid:19) x ≥ , n (cid:88) i =1 x ( i ) = 1 . The second two-stage program is(5.101) (cid:26) min c T x + E [ Q ( x , ξ )] x ∈ { x ∈ R n : (cid:107) x − x (cid:107) ≤ } where cost-to-go function Q ( x , ξ ) has nonlinear objective and constraint coupling functions and is givenby(5.102) Q ( x , ξ ) = min x ∈ R n (cid:18) x x (cid:19) T (cid:16) ξ ξ T + λI n (cid:17) (cid:18) x x (cid:19) + ξ T (cid:18) x x (cid:19) (cid:107) x − y (cid:107) + (cid:107) x − x (cid:107) − R ≤ . For both problems, ξ is a Gaussian random vector in R n and λ >
0. We consider several instances of theseproblem with n = 5, 10, 200, 400, and n = 600. For each instance, the components of ξ are independentwith means and standard deviations randomly generated in respectively intervals [5 ,
25] and [5 , λ = 2 while the components of c are generated randomly in interval [1 , R = 5 and x ( i ) = y ( i ) = 10 , i = 1 , . . . , n .In SMD and ISMD, we take ω ( x ) = (cid:80) ni =1 x i ln( x i ) for problem (5.99)-(5.100). For this distance generatingfunction, x + =Prox x ( ζ ) can be computed analytically for x ∈ R n with x > z ∈ R n by z ( i ) = ln( x ( i )) we have x + ( i ) = exp( z + ( i )) where z + = w − ln (cid:32) n (cid:88) i =1 e w ( i ) (cid:33) with w = z − ζ − max i [ z ( i ) − ζ ( i )] , and with a vector in R n of ones.For problem (5.101)-(5.102), SMD and ISMD are run taking distance generating function ω ( x ) = (cid:107) x (cid:107) (in this case, SMD is just the Robust Stochastic Approximation). For this choice of ω , if x + = Prox x ( ζ ) wehave x + = (cid:26) x − ζ if (cid:107) x − ζ − x (cid:107) ≤ ,x + x − ζ − x (cid:107) x − ζ − x (cid:107) otherwise.In SMD and ISMD, the interior point solver of the Mosek Optimization Toolbox [1] is used at eachiteration to solve the quadratic second stage problem (given first stage decision x t and realization ξ t of ξ at iteration t ) and constant steps are used: if there are N iterations, the step γ t for iteration t is γ t = √ N .For ISMD, we limit the number of iterations of Mosek solver used to solve subproblems. More precisely, weconsider four strategies for the limitation of these numbers of iterations given in Table 5 in the Appendix,which define four variants of ISMD denoted by ISMD 1, ISMD 2, ISMD 3, and ISMD 4. The variants thatmost limit the number of iterations are ISMD 1 and ISMD 2. All methods were implemented in Matlab andrun on an Intel Core i7, 1.8GHz, processor with 12,0 Go of RAM. According to current Mosek documentation, it is not possible to use absolute errors. Therefore, early termination of thesolver can either be obtained limiting the number of iterations or defining relative errors. N Problem
L-shaped SAA SMD . × Table 2.
CPU time in seconds required to solve instances of problems (5.99)-(5.100) and(5.101)-(5.102) (for n = 5 ,
10 and N = 20 000) obtained with the L-shaped method, SAA,and SMD. n N Problem
L-shaped SAA SMD × . × . ×
10 20 000 (5.99) 78.8 78.9 78.610 20 000 (5.101) 3 . × . × . × Table 3.
Approximate optimal value of instances of problems (5.99)-(5.100) and (5.101)-(5.102) (for n = 5 ,
10 and N = 20 000) obtained with the L-shaped method, SAA, andSMD.To check the implementations and compare the accuracy and CPU time of all methods, we first considerproblems (5.99)-(5.100) and (5.101)-(5.102) with n = 5 ,
10, and a large sample of size N = 20 000 of ξ . Inthese experiments, the L-shaped method terminates when the relative error is at most 5%. The CPU timeneeded to solve these instances with the L-shaped method, SAA, and SMD are given in Table 2. For theseinstances, we also report in Table 3 the approximate optimal values given by all methods knowing that for theL-shaped method we report the value of the last upper bound computed. For SMD, the approximate optimalvalue after N iterations is given by ˆ f N . On the four experiments, all methods give very close approximationsof the optimal value, which is a good indication that the methods were well implemented. SMD is by farthe quickest and SAA by far the slowest. For the instance of Problem (5.99)-(5.100) with n = 10, we reportin the left plot of Figure 1 the evolution of the approximate optimal value along the iterations of SMD. We also report on the right plot of this figure the evolution of the upper and lower bounds computed alongthe iterations of the L-shaped method for the instance of Problem (5.99)-(5.100) with n = 10. For problem(5.101)-(5.102), the evolution of the approximate optimal value along the iterations of SMD is represented inFigure 2. For SMD and these simulations, we obtain a stabilization of the approximate optimal value after10 000 iterations (observe that with SMD the approximate optimal value is not the value of the objectivefunction at a feasible point and therefore some of these approximations can be below the optimal value ofthe problem).We now consider larger instances taking n = 200, 400 , and 600. For these simulations we do not useSAA and L-shaped method anymore which were not as efficient as SMD on previous simulations and requireprohibitive computational time for n = 200 , , n = 200 and n = 400, we run all methods 10 times taking samples of ξ of size N = 2000 for n = 200,of size N = 1000 for Problem (5.99)-(5.100) and n = 400, and of size N = 500 for Problem (5.101)-(5.102)and n = 400. For n = 600, it takes much more time to load and solve subproblems and we only run SMDand ISMD once taking a sample of size N = 500 for Problem (5.99)-(5.100) and of size N = 300 for Problem(5.101)-(5.102). Naturally, after running t − N − (cid:80) tτ =1 γ τ ( N ) t (cid:88) τ =1 γ τ ( N ) (cid:16) f ( x N,τ ) + f ( x N,τ , x N,τ , ξ N,τ ) (cid:17) obtained on the basis of sample ξ N, , . . . , ξ N,t of ξ . Due to the increase in computational time when N increases, we do not take the largest sample size N = 2000 for allinstances. However, for all instances and values of N chosen, we observe a stabilization of the approximate optimal value beforestopping the algorithm, which indicates a good solution has been found at termination. terations SAASMD
Iterations
Figure 1.
Left plot: optimal value of our instance of Problem (5.99)-(5.100) with n = 10estimated using SAA as well as evolution of the approximate optimal value computed alongthe iterations of SMD. Right plot: for the same instance, evolution of the lower and upperbounds computed along the iterations of the L-shaped method. Iterations SAASMD
Iterations Figure 2.
Left plot: optimal value of our instance of Problem (5.101)-(5.102) with n = 5estimated using SAA as well as evolution of the approximate optimal value computed alongthe iterations of SMD. Right plot: same outputs for Problem (5.101)-(5.102) and n = 10.In Figure 3, we report for our instances of Problem (5.99)-(5.100) the mean (computed over the 10 runsof the methods for n = 200 , We also report on this figure the empirical distribution (over the 10 runs of the methods for n = 200 , When SMD (and similarly for ISMD) is run on samples of ξ of size N , we have seen how to compute at iteration t − (cid:80) tτ =1 γ τ ( N ) t (cid:88) τ =1 γ τ ( N ) (cid:16) f ( x N,τ ) + f ( x N,τ , x N,τ , ξ N,τ ) (cid:17) of the optimal value on the basis of sample ξ N, , . . . , ξ N,t of ξ . The mean approximate optimal value after t − ξ ofsize N and computing the mean of these values on these samples. nstance SMD ISMD 1 ISMD 2 ISMD 3 ISMD 4 n = 200, Problem (5.99) 1.2 3.2 1.7 1.2 1.2 n = 400, Problem (5.99) 0.86 3.14 1.27 0.86 0.86 n = 600, Problem (5.99) 0.81 6.59 3,33 0.81 0.81 n = 200, Problem (5.101) 1 . × . × . × . × . × n = 400, Problem (5.101) 6.9978 × × × × × n = 600, Problem (5.101) 1.5524 × × × × × Table 4.
Approximate optimal values of instances of Problems (5.99) and (5.101) estimatedwith SMD, ISMD 1, ISMD 2, ISMD 3, and ISMD 4.As expected, ISMD 1 and ISMD 2 complete the N iterations quicker (since they run Mosek for lessiterations) but start with worse approximations of the optimal values. ISMD 3 and ISMD 4 also completethe N iterations quicker than SMD but provide approximations of the optimal values very close to SMDalong the iterations of the method and in particular at termination, see also Table 4 which gives the meanapproximate optimal value at the last iteration N for all methods. We should also note that most ofthe computational time for these methods is spent in loading the data for Mosek solver through a seriesof loops and this step requires the same computational time for all methods. Therefore, the difference incomputational time only comes from the time spent by Mosek to solve subproblems. With a C++ or Fortranimplementation, this time would remain similar but the loops for loading the data would be much quickerand the total solution time would decrease by a much more important factor. However, even with our Matlabimplementation, the total time decreases significantly.For our instances of Problem (5.100)-(5.102), we report in Figure 4 the mean (over the 10 runs for n = 200and n = 400) approximate optimal values computed along the iterations of SMD and our variants of ISMD.For the instances n = 200 and n = 400, we also report in Figure 5 the empirical distribution of the totalsolution time and of the time required for Mosek to solve subproblems for SMD and all variants of ISMD. Theremarks made for Problem (5.99) still apply for these simulations performed on Prob lem (5.101). We alsorefer to Table 4 which provides the mean approximate optimal value at the last iteration N for all methods.As for Problem (5.99), ISMD 3 and ISMD 4 provide after our N − Conclusion
We introduced an inexact variant of SMD called ISMD to solve (general) nonlinear two-stage stochasticprograms. We have shown on two examples of two-stage nonlinear problems that ISMD can allow us toobtain quicker than SMD a good solution and a good approximation of the optimal value.The method and convergence analysis was based on two studies of convex analysis:(a) the computation of inexact cuts for value functions of a large class of convex optimization problemshaving nonlinear objective and constraints which couple the argument of the value function and thedecision variable;(b) the study of the strong concavity of the dual function of an optimization problem (used to deriveone of our formulas for inexact cuts).It is worth mentioning that the formulas we derived for inexact cuts could also be used to propose inexactlevel methods [12] to solve nonlinear two-stage stochastic programs (4.65)-(4.66), when primal and dualsecond stage problems are solved approximately (inexactly).It would also be interesting to test ISMD and the aforementioned inexact level methods on several relevantinstances of nonlinear two-stage stochastic programs.
Acknowledgments
The author’s research was partially supported by an FGV grant, CNPq grant 311289/2016-9, and FAPERJgrant E-26/201.599/2014. The author would like to thank Alberto Seeger for helpful discussions.
200 400 600 800 1000 1200 1400 1600 1800 2000
Iteration
SMDISMD 1ISMD 2ISMD 3ISMD 4
Time
SMDISMD 1ISMD 2ISMD 3ISMD 4 n = 200 n = 200 Iteration
SMDISMD 1ISMD 2ISMD 3ISMD 4
Time
SMDISMD 1ISMD 2ISMD 3ISMD 4 n = 400 n = 400 Iteration
SMDISMD 1ISMD 2ISMD 3ISMD 4 n = 600 Figure 3.
Top left plot: approximate optimal values of our instance of Problem (5.99) with n = 200 along the iterations of SMD and our variants of ISMD. Top right plot: empiricaldistribution of the solution time in seconds on this instance and for these methods. Middleplots: same as the top plot replacing n = 200 by n = 400. Bottom plot: same as the topleft plot replacing n = 200 by n = 600. teration SMDISMD 1ISMD 2ISMD 3ISMD 4
Iteration SMDISMD 1ISMD 2ISMD 3ISMD 4 n = 200 n = 400 Iteration SMDISMD 1ISMD 2ISMD 3ISMD 4 n = 600 Figure 4.
Top left plot: approximate optimal values of our instance of Problem (5.101)with n = 200 along the iterations of SMD and our variants of ISMD. Top right and bottomplots provide the same graphs for respectively n = 400 and n = 600. References [1] E. D. Andersen and K.D. Andersen.
The MOSEK optimization toolbox for MATLAB manual. Version 7.0 , 2013. .[2] J. Birge and F. Louveaux.
Introduction to Stochastic Programming . Springer-Verlag, New York, 1997.[3] G.B. Dantzig and P.W. Glynn. Parallel processors for planning under uncertainty.
Annals of Operations Research , 22:1–21,1990.[4] V. Guigues. Convergence analysis of sampling-based decomposition methods for risk-averse multistage stochastic convexprograms.
SIAM Journal on Optimization , 26:2468–2494, 2016.[5] V. Guigues. Multistep stochastic mirror descent for risk-averse convex stochastic programs based on extended polyhedralrisk measures.
Mathematical Programming , 163:169–212, 2017.[6] V. Guigues. Inexact cuts in Stochastic Dual Dynamic Programming. arXiv , 2018. https://arxiv.org/abs/1809.02007 .[7] J-B Hiriart-Urruty and C. Lemar´echal.
Convex Analysis and Minimization Algorithms I . Springer-Verlag, 1996.[8] G. Infanger. Monte Carlo (Importance) Sampling within a Benders Decomposition Algorithm for Stochastic Linear Pro-grams.
Annals of Operations Research , 39:69–95, 1992. ime SMDISMD 1ISMD 2ISMD 3ISMD 4
Time
30 35 40 45 50 550.10.20.30.40.50.60.70.80.91
SMDISMD 1ISMD 2ISMD 3ISMD 4 n = 200 n = 200 Time
SMDISMD 1ISMD 2ISMD 3ISMD 4
80 90 100 110 120 130 140 150 160 170
Time
SMDISMD 1ISMD 2ISMD 3ISMD 4 n = 400 n = 400 Figure 5.
Top left: empirical distribution of the total solution time to solve Problem(5.101) for SMD and and our four variants of ISMD for the instance with n = 200. Rightplot: empirical distribution of the time required for Mosek to solve all subproblems for thatinstance. Bottom plots: same outputs for n = 400. [9] A. Juditsky and Y. Nesterov. Primal-dual subgradient methods for minimizing uniformly convex functions. Available onarXiv at http://arxiv.org/abs/1401.1792 , 2010.[10] G. Lan, A. Nemirovski, and A. Shapiro. Validation analysis of mirror descent stochastic approximation method.
Math.Program. , 134:425–458, 2012.[11] G. Lan and Z. Zhou. Dynamic stochastic approximation for multi-stage stochastic optimization. arXiv , 2017.[12] C. Lemar´echal, A. Nemirovski, and Y. Nesterov. New variants of bundle methods.
Mathematical Programming , 69:111–148,1995.[13] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming.
SIAM J. Optim. , 19:1574–1609, 2009.[14] M.V.F. Pereira and L.M.V.G Pinto. Multi-stage stochastic optimization applied to energy planning.
Math. Program. ,52:359–375, 1991.[15] B.T. Polyak and A. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM J. Contr. and Optim. , 30:838–855, 1992.[16] R. T. Rockafellar and R. J-B Wets.
Variational Analysis . Grundlehren der Mathematischen Wissenschaften, Springer-Verlag, 1997.
17] A. Ruszczy´nski. A multicut regularized decomposition method for minimizing a sum of polyhedral functions.
MathematicalProgramming , 35:309–333, 1986.[18] A. Shapiro, D. Dentcheva, and A. Ruszczy´nski.
Lectures on Stochastic Programming: Modeling and Theory . SIAM,Philadelphia, 2009.[19] H. Yu and J. Neely. On the Convergence Time of the Drift-Plus-Penalty Algorithm for Strongly Convex Programs.
Availableat http: // arxiv. org/ abs/ 1503. 06235 , 2015. ppendix λ = 1, n = 1, α (¯ x ) = 574 λ = 100, n = 1, α (¯ x ) = 637 λ = 1, n = 10, α (¯ x ) = 610 λ = 1, n = 100, α (¯ x ) = 2 303 λ = 1, n = 1000, α (¯ x ) = 23 149 λ = 100, n = 1000, α (¯ x ) = 23 988 λ = 100, n = 100, α (¯ x ) = 2 839 λ = 100, n = 10, α (¯ x ) = 526 Figure 6.
Dual function θ ¯ x of problem (3.57) for some ¯ x randomly drawn in ball { x ∈ R n : (cid:107) x − x (cid:107) ≤ } , S = AA T + λI n for some random matrix A with random entries in[ − , n, λ ). The dual iterates are represented by reddiamonds. η η η η λ = 1, n = 1, α (¯ x ) = 574 λ = 100, n = 1, α (¯ x ) = 637 η η η η λ = 1, n = 10, α (¯ x ) = 610 λ = 1, n = 100, α (¯ x ) = 2 303 η η η η λ = 1, n = 1000, α (¯ x ) = 23 149 λ = 100, n = 1000, α (¯ x ) = 23 988 η η η η λ = 100, n = 100, α (¯ x ) = 2 839 λ = 100, n = 10, α (¯ x ) = 526 Figure 7.
Plots of η ( ε k , ¯ x ) and η ( ε k , ¯ x ) as a function of iteration k where ε k is theduality gap at iteration k for problem (3.57) for some ¯ x randomly drawn in ball { x ∈ R n : (cid:107) x − x (cid:107) ≤ } , S = AA T + λI n for some random matrix A with random entries in [ − , n, λ ). SMD 1
Iteration number [1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ]IP solver maximalnumber of iterations (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) Iteration number [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ]IP solver maximalnumber of iterations (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) Iteration number [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , N ]IP solver maximalnumber of iterations (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) I max ISMD 2
Iteration number [1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ]IP solver maximalnumber of iterations (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) Iteration number [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , N ]IP solver maximalnumber of iterations (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) I max ISMD 3
Iteration number [1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ]IP solver maximalnumber of iterations (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) Iteration number [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , N ]IP solver maximalnumber of iterations (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) I max ISMD 4
Iteration number [1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ] [ (cid:100) . N (cid:101) + 1 , N ]IP solver maximalnumber of iterations (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) (cid:100) . I max (cid:101) I max Table 5.
Maximal number of iterations for Mosek interior point solver used to solve secondstage problems as a function of the iteration number i = 1 , . . . , N , of ISMD and the maximalnumber of iterations I max allowed for Mosek solver to solve subproblems with SMD. In thistable, (cid:100) x (cid:101) is the smallest integer larger than or equal to x . For problem (5.99)-(5.100)and n = 200 , ,
600 and problem (5.101)-(5.102) and n = 200, we take I max = 15, forproblem (5.101)-(5.102) and n = 400 we take I max = 25, and for problem (5.101)-(5.102)and n = 600 we take I max = 28. For instance for ISMD 1, N = 2000, and problem (5.99)-(5.100), for iterations [ (cid:100) . N (cid:101) + 1 , (cid:100) . N (cid:101) ], i.e., for iterations 0 . × , . . . , . × , . . . , (cid:100) . I max (cid:101) = 8.= 8.