Difference of convex algorithms for bilevel programs with applications in hyperparameter selection
aa r X i v : . [ m a t h . O C ] F e b Noname manuscript No. (will be inserted by the editor)
Difference of convex algorithms for bilevel programswith applications in hyperparameter selection
Jane J. Ye · Xiaoming Yuan · ShangzhiZeng · Jin Zhang
Received: date / Accepted: date
Abstract
In this paper, we present difference of convex algorithms for solvingbilevel programs in which the upper level objective functions are difference ofconvex functions, and the lower level programs are fully convex. This nontrivialclass of bilevel programs provides a powerful modelling framework for dealingwith applications arising from hyperparameter selection in machine learning.Thanks to the full convexity of the lower level program, the value function ofthe lower level program turns out to be convex and hence the bilevel programcan be reformulated as a difference of convex bilevel program. We propose twoalgorithms for solving the reformulated difference of convex program and showtheir convergence under very mild assumptions. Finally we conduct numericalexperiments to a bilevel model of support vector machine classification.
This paper is dedicated to the memory of Olvi L. Mangasarian.The research of the first author was partially supported by NSERC. The second authorwas supported by a General Research Fund from Hong Kong Research Grants Council. Thelast author was supported by NSFC No. 11971220 and Guangdong Basic and Applied BasicResearch Foundation 2019A1515011152.Jane J. YeDepartment of Mathematics and Statistics, University of Victoria, Canada.E-mail: [email protected] YuanDepartment of Mathematics, The University of Hong Kong, Hong Kong SAR, China.E-mail: [email protected] ZengDepartment of Mathematics, The University of Hong Kong, Hong Kong SAR, China.E-mail: [email protected] ZhangDepartment of Mathematics, Southern University of Science and Technology, National Cen-ter for Applied Mathematics Shenzhen, Shenzhen, China.E-mail: [email protected] Jane J. Ye et al.
Keywords
Bilevel program · difference of convex algorithm · hyperparameterselection Mathematics Subject Classification (2010) · Bilevel programs are a class of hierarchical optimization problems which haveconstraints containing a lower-level optimization problem parameterized byupper-level variables. Bilevel programs capture a wide range of importantapplications in various fields including Stackelberg games and moral hazardproblems in economics ([37,26]), hyperparameter selection and meta learningin machine learning ([18,19,20,14,23,24,27,28,31]). More applications can befound in the monographs [3,10,13,36], the survey on bilevel optimization [9,12] and the references within.In this paper, we develop some numerical algorithms for solving the fol-lowing difference of convex (DC) bilevel program:(DCBP) min F ( x, y ) := F ( x, y ) − F ( x, y ) s.t. x ∈ X, y ∈ S ( x ) , with S ( x ) being the set of optimal solutions of the lower level problem,( P x ) : min y ∈ Y f ( x, y ) s.t. g ( x, y ) ≤ , where X ⊆ R n and Y ⊆ R m are nonempty closed sets, g := ( g , . . . , g l ), allfunctions g i : R n × R m → R , i = 1 , . . . , l are convex on an open convex setcontaining the set X × Y , and the functions F , F , f : R n × R m → R areconvex on an open convex set containing the set C := { ( x, y ) ∈ X × Y : g ( x, y ) ≤ } . To ensure the bilevel program is well-defined, we assume that S ( x ) = ∅ for all x ∈ X . Moreover we assume that for all x in an open convex set O ⊇ X , thefeasible region for the lower level program F ( x ) := { y ∈ Y : g ( x, y ) ≤ } isnonempty and the lower level objective function f ( x, y ) is bounded below on F ( x ). Note that all of our results applied to the case where there are also someupper level constraint G i ( x, y ) ≤ , i = 1 , . . . , k with each defining function G i a difference of convex function. For simplicity of notations, in order toconcentrate on the main idea, we choose not to include them here.Although the objective function in the DC bilevel program we considermust be a DC function, this setting is general enough to capture many casesof practical interests. In particular any lower C function (i.e., a functionwhich can be locally written as a supremum of a family of C functions) and C function (i.e., a differentiable function whose gradient is locally Lipschitzcontinuous) are DC functions and the class of DC functions is closed under ifference of convex algorithms for bilevel programs 3 many operations encountered frequently in optimization; see, e.g., [17,40]. Inthe lower level program, we assume all functions are fully convex, i.e., convex inboth variables x and y . However as pointed out by [21, Example 1 and Section5], using some suitable reformulations one may turn a non-fully convex lowerlevel program into a fully convex one. Also as demonstrated in this paper, thebilevel model for hyperparameter selection problem can be reformulated as abilevel program where the lower level is fully convex.Solving bilevel programs numerically is extremely challenging. It is knownthat even when all defining functions are linear, the computational complexityis already NP-hard [4]. If all defining functions are smooth and the lower levelprogram is convex with respect to the lower level variable, the first order ap-proach was popularly used to replace the lower level problem by its first orderoptimality condition and to solve the resulting problem as the mathematicalprogram with equilibrium constraints (MPEC); see e.g. [3,1,12,25,33]. Thefirst order approach may be problematic since it may not provide an equiv-alent reformulation to the original bilevel program if only local (not global)optimal solutions are considered; see [11]. Moreover even in the case of a fullyconvex lower level program, [21, Example 1] shows that it is still possible thata local optimal solution of the corresponding MPEC does not correspond to alocal optimal solution of the original bilevel program. Recently some numer-ical algorithms have been introduced for solving bilevel programs where thelower level problem is not necessarily convex in the lower level variable; seee.g., [22,29,30]. However these approaches have limitations in the numbers ofvariables in the bilevel program. In most of literature on numerical algorithmsfor solving bilevel programs, smoothness of all defining functions are assumed.In some special cases, non-smoothness can be dealt with by introducing auxil-iary variables and constraints to reformulate a nonsmooth lower level programas a smooth constrained lower level program. But using such an approach thenumbers of variables or constraints would increase.Our research on the DC bilevel program is motivated by a number of im-portant applications in model selection and hyperparameter learning. Recentlyin the statistical learning, the regularization parameters has been successfullyused, e.g., in the least absolute shrinkage and selection operator (lasso) methodfor regression and support vector machines (SVMs) for classification. Howeverthe regularization parameters have to be set a priori and the choice of theseparameters dramatically affects results on the model selection. The most com-monly used method for selecting these parameters is the so-called T -fold crossvalidation. By T -fold cross validation, a data set Ω is partitioned into T pair-wise disjoint subsets called the validation sets Ω tval , t = 1 , . . . , T . For each fold t = 1 , . . . , T , a subset of Ω denoted by Ω ttrn := Ω \ Ω tval is used for trainingand the validation set Ω tval is used for testing the result. Take the lasso prob-lem for example, suppose the data set Ω = { ( a j , b j ) } ℓj =1 where a j ∈ R n and b j ∈ R . For each λ > t = 1 , . . . , T , the following lasso problem can Jane J. Ye et al. be solved. ( P tλ ) min w ∈ R n X j ∈ Ω ttrn ( a Tj w − b j ) + λ k w k . The desirable penalty parameter ¯ λ can be selected by minimizing the crossvalidation error based on the validation set: Θ ( w , . . . , w T ) := 1 T T X t =1 | Ω tval | X j ∈ Ω tval ( a Tj w tλ − b j ) , where | M | denotes the number of elements in set M and w tλ denotes a solutionto the lasso problem ( P tλ ). In fact, the hyperparameter selection for the lassoproblem can be considered as the following bilevel program with T lower levelprograms:min Θ ( w , . . . , w T ) s.t. λ > , w t ∈ argmin w X j ∈ Ω ttrn ( a Tj w − b j ) + λ k w k , t = 1 , , . . . , T. It is easy to see that we can reformulate the above problem equivalently asthe following bilevel program with a single lower level programmin Θ ( w , . . . , w T ) s.t. λ > , ( w , . . . , w T ) ∈ argmin T X t =1 X j ∈ Ω ttrn ( a Tj w t − b j ) λ + k w t k . Moreover using the fact that a function in the form φ ( x , λ ) = k x k /λ with λ > T -fold cross vali-dation method for selecting hyperparameters usually implements a grid search :training T models at each point of a discretized parameter space in order tofind an approximate optimal parameter. This method has many drawbacksand limitations. In particular its computational complexity scales exponen-tially with the number of hyperparameters and the number of grid points foreach hyperparameter and hence is not practical for problem requiring severalhyperparameters. To deal with limitations of grid search, introduced first in[5] in 2006, the bilevel program has been used to model hyperparameter se-lection problems in [5,18,19,20,28,27]. We now consider the bilevel model forsupport vector classification using cross validation studied in [18,19]. Given atraining set Ω containing ℓ labelled data pairs { ( a j , b j ) } ℓj =1 where a j ∈ R n , ifference of convex algorithms for bilevel programs 5 and the labels b j = ± Θ ( w , . . . , w T , c ) s.t. λ lb ≤ λ ≤ λ ub , ¯ w lb ≤ ¯ w ≤ ¯ w ub , and for t = 1 , . . . , T :( w t , c t ) ∈ argmin − ¯ w ≤ w ≤ ¯ w c ∈ R λ k w k + X j ∈ Ω ttrn max(1 − b j ( a Tj w − c ) , , (1)where c ∈ R T is the vector with c t as the t th component, and Θ ( w , . . . , w T , c )is some measure of validation accuracy over all folds, typically the averagenumber of misclassifications. Here λ lb , λ ub are given nonnegative numbers and¯ w lb , ¯ w ub are given vectors in R n . By changing the variable λ to µ := λ we canequivalently rewrite the bilevel model for problem (1) so that there is only asingle lower level problem and it is fully convex; see the details in (32).This paper is motivated by an interesting fact that, under our problemsetting, the value function of the lower level in (DCBP) defined by v ( x ) := inf y ∈ Y { f ( x, y ) s.t. g ( x, y ) ≤ } is convex and locally Lipschitz continuous on X . We take full advantage ofthis convexity and use the value function approach first proposed in [32] for anumerical purpose and further used to study optimality conditions in [44] toreformulate (DCBP) as the following DC program:(VP) min ( x,y ) ∈ C F ( x, y ) − F ( x, y ) s.t. f ( x, y ) − v ( x ) ≤ . Unfortunately, due to the value function constraint, (VP) violates the usualconstraint qualification such as the nonsmooth Mangasarian Fromovitz con-straint qualification (MFCQ) at each feasible point, see [44, Proposition 3.2]for the smooth case and Proposition 9 for the nonsmooth case. It is well-known that convergence of the difference of convex algorithm (DCA) is onlyguaranteed under constraint qualifications such as the extended MFCQ, whichis MFCQ extended to infeasible points; see, e.g., [39]. To deal with this issue,we consider the following approximate bilevel program(VP) ǫ min ( x,y ) ∈ C F ( x, y ) − F ( x, y ) s.t. f ( x, y ) − v ( x ) ≤ ǫ, for some ǫ >
0. Such a relaxation strategy has been used for example in[22] based with the reasoning that in numerical algorithms one usually obtainan inexact optimal solution anyway and the solutions of (VP) ǫ approximate asolution of the original bilevel program (VP) as ǫ approaches zero. In this paper Jane J. Ye et al. we will show that EMFCQ holds for problem (VP) ǫ when ǫ > ǫ with ǫ ≥
0. When ǫ >
0, theconvergence of our algorithm is guaranteed and when ǫ = 0, the convergenceis not guaranteed but it could still converge if the penalty parameter sequenceis bounded.Using DCA approach, at each iteration point ( x k , y k ), one linearises theconcave part of the function, i.e., the functions F ( x, y ) , v ( x ) by using anelement of the subdifferentials ∂F ( x k , y k ) , ∂v ( x k ) and solve a resulting convexsubproblem. The value function is an implicit function. How do we obtainan element of the subgradient ∂v ( x k )? At current iteration x k , assuming wecan solve the lower level problem ( P x k ) with a global minimizer ˜ y k and acorresponding Karush-Kuhn-Tucker (KKT) multiplier denoted by γ k . Supposethat the following partial derivative formula holds: ∂f ( x, y ) = ∂ x f ( x, y ) × ∂ y f ( x, y ) , ∂g i ( x, y ) = ∂ x g i ( x, y ) × ∂ y g i ( x, y ) (2)at ( x, y ) = ( x k , ˜ y k ). Then since by convex analysis ∂ x f ( x k , ˜ y k ) + l X i =1 γ ki ∂ x g i ( x k , ˜ y k ) ⊆ ∂v ( x k ) , we can select an element of ∂v ( x k ) from the set ∂ x f ( x k , ˜ y k )+ P li =1 γ ki ∂ x g i ( x k , ˜ y k )and use it to linearize the value function. We then solve the resulting convexsubproblem approximately to obtain a new iterate ( x k +1 , y k +1 ). Thanks torecent developments in large-scale convex programming, using this approachwe can deal with a large scale DC bilevel program.Now we summarize our contributions as follows. – We propose two new algorithms for solving DC program. These algorithmshave modified the classical DCA in two ways. First, we add an proximalterm in each convex subproblem so the the objective function is stronglyconvex and at each iteration point, only an approximate solution for theconvex subproblem is solved. Second, our penalty parameter update issimplier. – We have laid down all theoretical foundations from convex analysis that arerequired for our algorithms to work. In particular we have demonstratedthat under the minimal assumptions that we specify for problem (DCBP),the value function is convex and locally Lipschitz on set X automatically. – Using the two new algorithms for solving DC program, we propose twocorresponding algorithms to solve problem (DCBP). Our algorithms holdunder very mild and natural assumptions. In particular we allow all definingfunctions to be nonsmooth and we do not require any constraint qualifica-tion to hold for the lower level program. The main assumptions we need areonly the partial derivative formula (2) which holds under many practicalsituations (see Proposition 2 for sufficient conditions) and the existence ofan KKT multiplier for the lower level program under each iteration. ifference of convex algorithms for bilevel programs 7 – Taking advantage of large scale convex programming, our algorithm canhandle high dimensional hyperparameter selection problems. To test ef-fectiveness of our algorithm, we have tested it in the bilevel model for SVclassification (1). Our results compare favourably with the MPEC approach[18,19,20].This paper is organized as follows. In Section 2 we give preliminaries fromconvex analysis. We propose two modified DCA and study their convergencefor a class of general DC program in Section 3. In Section 4, we derive explicitconditions for the bilevel program under which the algorithms introduced inSection 3 can be applied. Numerical experiments on the bilevel model for SVclassification is conducted on Section 5. Section 6 concludes the paper.
Let us recall some notations from convex analysis and variational analysis,which will be needed thereafter. Let ϕ ( x ) : R n → [ −∞ , ∞ ] be a convex func-tion, and let ¯ x be a point where ϕ be finite. Recall that the subdifferential of ϕ at ¯ x is a closed convex set defined by ∂ϕ (¯ x ) := { ξ ∈ R n | ϕ ( x ) ≥ ϕ (¯ x ) + h ξ, x − ¯ x i , ∀ x } . For a function ϕ : R n × R m → [ −∞ , + ∞ ], we denote its partial differentialof ϕ with respect to x and y by ∂ x ϕ ( x, y ) and ∂ y ϕ ( x, y ) respectively. Let Σ be a convex subset in R n and ¯ x ∈ Σ . The normal cone to Σ at ¯ x is a closedconvex set defined by N Σ (¯ x ) := { ξ ∈ R n |h ξ, x − ¯ x i ≥ , ∀ x ∈ Σ } . When¯ x / ∈ Σ , we let N Σ (¯ x ) = ∅ . Let δ Σ ( x ) denote the indicator function of set Σ at x , i.e., δ Σ ( x ) := 0 if x ∈ Σ and + ∞ if x / ∈ Σ . Then by definition, it iseasy to see that ∂δ Σ (¯ x ) = N Σ (¯ x ) . It is well-known that when the function ϕ : R n → [ −∞ , ∞ ] is Lipschitz continuous near a point ¯ x where ϕ (¯ x ) is finite,the Clarke generalized gradient denoted by ∂ c ϕ (¯ x ) coincides with the convexsubdifferential; see [7].Recall that a convex function ϕ ( x ) : R n → [ −∞ , ∞ ] is proper if ϕ ( x ) < + ∞ for at least one x and ϕ ( x ) > −∞ for every x . The following calculus ruleswill be useful. Proposition 1 (Sum rule) [34, Theorem 23.8][7, Corollary 1 to Theorem2.9.8] Let ϕ , . . . , ϕ m : R n → [ −∞ , ∞ ] be proper convex functions. Supposethat ¯ x is a point where ϕ (¯ x ) + · · · + ϕ m (¯ x ) is finite. Then ∂ϕ (¯ x ) + · · · + ∂ϕ m (¯ x ) ⊆ ∂ ( ϕ + · · · + ϕ m )(¯ x ) . Moreover if all except one function among ϕ i ( i = 1 , · · · , m ) is Lipschitz con-tinuous around point ¯ x , then the inclusion (1) becomes an equality. Proposition 2 (Partial subdifferentiation)
Let ϕ : R n × R m → [ −∞ , + ∞ ] be a convex function and let (¯ x, ¯ y ) be a point where ϕ is finite. Then ∂ϕ (¯ x, ¯ y ) ⊆ ∂ x ϕ (¯ x, ¯ y ) × ∂ y ϕ (¯ x, ¯ y ) . (3) The inclusion (3) becomes an equality under one of the following conditions.
Jane J. Ye et al. (a) For every ξ ∈ ∂ x ϕ (¯ x, ¯ y ) , it holds that ϕ ( x, y ) − ϕ (¯ x, y ) ≥ h ξ, x − ¯ x i , ∀ ( x, y ) ∈ R n × R m . (b) ϕ ( x, y ) = ϕ ( x ) + ϕ ( y ) .(c) For any ε > , there is δ > such thateither ∂ x ϕ (¯ x, ¯ y ) ⊆ ∂ x ϕ (¯ x, y ) + εB R n ∀ y ∈ B (¯ y ; δ ) (4) or ∂ y ϕ (¯ x, ¯ y ) ⊆ ∂ y ϕ ( x, ¯ y ) + εB R m ∀ x ∈ B (¯ x ; δ ) , (5) where B (¯ x ; δ ) denotes the open ball centered at ¯ x with radius equal to δ and B R n denotes the open unit ball centered at the origin in R n .(d) ϕ ( x, y ) is continuously differentiable respect to one of the variable x or y at (¯ x, ¯ y ) .Moreover ( b ) = ⇒ ( a ) , ( d ) = ⇒ ( c ) = ⇒ ( a ) . Proof
The inclusion (3) and its reverse under (a) follow directly from def-initions of the convex subdifferential and the partial subdifferential. When ϕ ( x, y ) = ϕ ( x ) + ϕ ( y ), we have that ∂ϕ ( x, y ) = ∂ϕ ( x ) × { } + { } × ∂ϕ ( y ).Hence obviously (b) implies (a). The implication of (d) to (c) is obvious. Nowsuppose that (4) holds. Let ξ ∈ ∂ x ϕ (¯ x, ¯ y ). Then according to (4), for any ε > δ > ξ = η + εe , where e ∈ B R n , and h ξ, x − ¯ x i ≤ ϕ ( x, y ) − ϕ (¯ x, y ) + ε k x − ¯ x k ∀ y ∈ B (¯ y ; δ ) . Thanks to the convexity of ϕ , using the proof technique of [8, Corollary 2.6(c)], we can easily show that (a) holds. The proof for the case where (5) holdsis similar and thus omitted. ⊓⊔ Proposition 3 [34, Theorem 10.4] Let ϕ : R n → [ −∞ , ∞ ] be a proper convexfunction. Then ϕ is Lipschitz continuous near any point ¯ x ∈ int(dom ϕ ) , where dom ϕ := { x | ϕ ( x ) < + ∞} is the effective domain of ϕ and int( M ) is theinterior of set M . In order to solve the (relaxed) value function reformulation of problem DCBP,in this section we propose numerical algorithms to solve the following differenceof convex program:(DC) min z ∈ Σ f ( z ) := g ( z ) − h ( z ) s.t. f ( z ) := g ( z ) − h ( z ) ≤ , where Σ is a closed convex subset of R d and g ( z ) , h ( z ) , g ( z ) , h ( z ) : Σ → R are convex functions. Although the results in this section can be generalizedto the case where there are more than one inequality in a straight-forwardedmanner, to simplify the notation and concentrate on the main idea we assumethere is only one inequality constraint in problem (DC). Our algorithms aremodifications of the classical DCA (see [39]).Before we introduce our algorithms and conduct the convergence analyses,we first brief some solution quality characterizations for problem (DC). ifference of convex algorithms for bilevel programs 9 Definition 1
Let ¯ z be a feasible solution of problem (DC). We say that ¯ z isan KKT point of problem (DC) if there exists a multiplier λ ≥ ∈ ∂g (¯ z ) − ∂h (¯ z ) + λ ( ∂g (¯ z ) − ∂h (¯ z )) + N Σ (¯ z ) , ( g (¯ z ) − h (¯ z )) λ = 0 . Definition 2
Let ¯ z be a feasible point of problem (DC). We say that thenonzero abnormal multiplier constraint qualification (NNAMCQ) holds at ¯ z for problem (DC) if either f (¯ z ) < f (¯ z ) = 0 but0 ∂g (¯ z ) − ∂h (¯ z ) + N Σ (¯ z ) . (6)Let ¯ z ∈ Σ , we say that the extended no nonzero abnormal multiplier constraintqualification (ENNAMCQ) holds at ¯ z for problem (DC) if either f (¯ z ) < f (¯ z ) ≥ g, h which are Lipschitz around point ¯ z , we have ∂ c ( g (¯ z ) − h (¯ z )) ⊆ ∂ c g (¯ z ) − ∂ c h (¯ z ) = ∂g (¯ z ) − ∂h (¯ z ). Proposition 4
Let ¯ z be a local solution of problem (DC). If NNAMCQ holdsat ¯ z and all functions g , g , h , h are Lipschitz around point ¯ z , then ¯ z is aKKT point of problem (DC). z k ∈ Σ with k = 0 , , . . . , we select a subdifferential ξ ki ∈ ∂h i ( z k ), for i = 0 ,
1. Then we solve the following subproblem approxi-mately and select z k +1 as an approximate minimizer:min z ∈ Σ ˜ ϕ k ( z ) := g ( z ) − h ( z k ) − h ξ k , z − z k i + β k max { g ( z ) − h ( z k ) − h ξ k , z − z k i , } + ρ k z − z k k , (7)where ρ is a given positive constant and β k represents the adaptive penaltyparameter. Our scheme is similar to that of DCA2 in [39] but different in thatthe subproblem (7) has a strongly convex objective function, the subproblemis only solved approximately, and a simplier penalty parameter update is used.In particular, denote t k +1 := max { g ( z k +1 ) − h ( z k ) − h ξ k , z k +1 − z k i , } . (8) We propose the following two inexact conditions for choosing z k +1 as an ap-proximate solution to (7):dist(0 , ∂ ˜ ϕ k ( z k +1 )+ N Σ ( z k +1 )) ≤ ζ k , for some ζ k ≥ ∞ X k =0 ζ k < ∞ , (9)and dist(0 , ∂ ˜ ϕ k ( z k +1 ) + N Σ ( z k +1 )) ≤ √ ρ k z k − z k − k , (10)where dist( x, M ) denotes the distance from a point x to set M .Using above constructions, we are ready to propose the inexact proximalDCA (iP-DCA) in Algorithm 1. Algorithm 1 iP-DCA
1: Take an initial point z ∈ Σ ; δ β >
0; an initial penalty parameter β > tol > for k = 0 , , . . . do
1. Compute ξ ki ∈ ∂h i ( z k ), i = 0 , z k +1 of (7) satisfying (9) or (10).3. Stopping test. Compute t k +1 in (8). Stop if max {k z k +1 − z k k , t k +1 } < tol .4. Penalty parameter update. Set β k +1 = ( β k + δ β , if max { β k , /t k +1 } < k z k +1 − z k k − ,β k , otherwies .
5. Set k := k + 1.3: end for In DCA2 of [39], the subproblem (7) was solved as a constrained optimiza-tion problem and a Lagrange multiplier is used to update the penalty parame-ter. Since our penalty parameter update rule does not involve any multipliers,it is easier to implement. In the rest of this section we show that the proposedalgorithm converges. Let us start with the following lemma which provides asufficient decrease of the merit function of (DC) defined by ϕ k ( z ) := g ( z ) − h ( z ) + β k max { g ( z ) − h ( z ) , } . Lemma 1
Let { z k } be a sequence of iterates generated by iP-DCA as definedin Algorithm . If the inexact criterion (9) or (10) is applied, then z k satisfies ϕ k ( z k ) ≥ ϕ k ( z k +1 ) + ρ k z k +1 − z k k − ρ ζ k , or ϕ k ( z k ) ≥ ϕ k ( z k +1 ) + ρ k z k +1 − z k k − ρ k z k − z k − k , where ζ k ≥ satisfying P ∞ k =0 ζ k < ∞ respectively. ifference of convex algorithms for bilevel programs 11 Proof
Since z k +1 is an approximation solution to problem (7) with inexactcriterion (9) or (10), there exists a vector e k such that e k ∈ ∂ ˜ ϕ k ( z k +1 ) + N Σ ( z k +1 ) ⊆ ∂ ( ˜ ϕ k + δ Σ )( z k +1 ) and k e k k ≤ ζ k or k e k k ≤ √ ρ k z k − z k − k , (11)respectively. As ˜ ϕ k is strongly convex with modulus ρ , Σ is a closed convexset and z k ∈ Σ , we have˜ ϕ k ( z k ) ≥ ˜ ϕ k ( z k +1 ) + h e k , z k +1 − z k i + ρ k z k +1 − z k k ≥ ˜ ϕ k ( z k +1 ) − ρ k e k k − ρ k z k +1 − z k k + ρ k z k +1 − z k k = ˜ ϕ k ( z k +1 ) − ρ k e k k . (12)Next, by the convexity of h i ( z ) and ξ ki ∈ ∂h i ( z k ), i = 0 ,
1, there holds that h i ( z k +1 ) ≥ h i ( z k ) + h ξ ki , z k +1 − z k i , i = 0 , , and thus ˜ ϕ k ( z k +1 ) ≥ ϕ k ( z k +1 ) + ρ k z k +1 − z k k . Combined with (12), we have ϕ k ( z k ) = ˜ ϕ k ( z k ) ≥ ˜ ϕ k ( z k +1 ) − ρ k e k k ≥ ϕ k ( z k +1 ) − ρ k e k k + ρ k z k +1 − z k k . The conclusion follows immediately from (11). ⊓⊔ The following theorem is the main result of this section. It proves thatany accumulation point of iP-DCA is an KKT point as long as the penaltyparameter sequence { β k } is bounded. Theorem 1
Suppose f is bounded below on Σ and the sequences { z k } and { β k } generated by iP-DCA are bounded. Moreover suppose functions g , g , h , h are locally Lipschitz on set Σ . Then every accumulation point of { z k } is anKKT point of problem (DC).Proof Since { β k } is bounded, there exists some iteration index k such that β k = β k , ∀ k ≥ k , and thus ϕ k ( z ) = ϕ k ( z ) for all k ≥ k . As f is boundedbelow, ϕ k ( z ) is bounded below for all k ≥ k . Then, by the inequality (11)and (11) obtained in Lemma 1, we have ∞ X k =1 k z k +1 − z k k < + ∞ , lim k →∞ k z k +1 − z k k = 0 , and thus β k < k z k +1 − z k k − always holds when k is large enough. Accordingto the update strategy of β k in iP-DCA, there exists some iteration index k such that t k +1 := max { g ( z k +1 ) − h ( z k ) −h ξ k , z k +1 − z k i , } ≤ k z k +1 − z k k ∀ k ≥ k , and thus t k →
0. Since z k +1 is an approximate solution to problem (7) andinexact criterion (9) or (10) holds, there exists a vector e k such that e k ∈ ∂ ˜ ϕ k ( z k +1 ) + N Σ ( z k +1 ) and (11) holds. According to the sum rule in Proposi-tion 1 and the subdifferential calculus rules for the pointwise maximum (see e.g.[7, Proposition 2.3.12]), there exist ˜ λ k +1 ∈ [0 ,
1] and η k +1 i ∈ ∂g i ( z k +1 )( i = 0 , e k ∈ η k +10 − ξ k + β k ˜ λ k +1 ( η k +11 − ξ k ) + ρ ( z k +1 − z k ) + N Σ ( z k +1 ) , (13) g ( z k +1 ) − h ( z k ) − h ξ k , z k +1 − z k i ≤ t k +1 , (14)˜ λ k +1 ( g ( z k +1 ) − h ( z k ) − h ξ k , z k +1 − z k i − t k +1 ) = 0 , (15) t k +1 (1 − ˜ λ k +1 ) = 0 , t k +1 ≥ . (16)Since { β k ˜ λ k +1 } is bounded, we may suppose that ˜ z and ˜ λ are accumulationpoints of { z k } and { β k ˜ λ k +1 } respectively. Taking subsequences if necessary,without loss of generality we may assume that z k → ˜ z ∈ Σ and β k ˜ λ k +1 → ˜ λ .Since g i ( z ), h i ( z ), i = 0 , z , ∂g i ( z ), ∂h i ( z ), i = 0 , N Σ ( z ) are outer semicontinuous. Now passing onto the limit as k → ∞ in (13)-(15), as e k → ζ k → k z k +1 − z k k → t k →
0, we obtain that ˜ z is a KKT solution of problem (DC). ⊓⊔ Notice that the boundedness of the penalty parameters is needed for anaccumulation point to be an KKT point. The following proposition providesa sufficient condition for the boundedness of the penalty parameters sequence { β k } . Proposition 5
Suppose that the sequence { z k } generated by iP-DCA is bounded.Moreover suppose functions g , g , h , h are Lipschitz around at any accu-mulation point of { z k } . Assume that ENNAMCQ holds at any accumulationpoints of the sequence { z k } . Then the sequence { β k } must be bounded.Proof The proof is inspired by [39, Theorem 3.1]. To the contrary, supposethat β k → + ∞ as k → ∞ . Then there exist infinitely many indices j such that β k j < k z k j +1 − z k j k − and t k j +1 > k z k j +1 − z k j k , and thus lim j →∞ k z k j +1 − z k j k = 0 , t k j +1 > , ∀ j. From (16), since t k j +1 > j , we have ˜ λ k j +1 = 1 for all j and thus λ k j +1 := β k j ˜ λ k j +1 → + ∞ as j → ∞ . Taking a further subsequence, if neces-sary, we can assume that z k j → ˜ z ∈ Σ as j → ∞ . If g (˜ z ) − h (˜ z ) <
0, then as g , h are continuous at ˜ z , { ξ k j } is bounded, and lim j →∞ k z k j +1 − z k j k = 0,when j is sufficiently large, one has g ( z k j +1 ) − h ( z k j ) −h ξ k j , z k j +1 − z k j i < , which contradicts to t k j +1 := max { g ( z k j +1 ) − h ( z k j ) −h ξ k j , z k j +1 − z k j i , } > j . Thus, g (˜ z ) − h (˜ z ) ≥
0. From (13), we have e k j ∈ ∂g ( z k j +1 ) − ∂h ( z k j ) + λ k j +1 ∂g ( z k j +1 ) − λ k j +1 ∂h ( z k j )+ ρ ( z k j +1 − z k j ) + N Σ ( z k j +1 ) , ifference of convex algorithms for bilevel programs 13 where λ k +1 := β k ˜ λ k +1 . Dividing both sides of this equality by λ k j +1 , andpassing onto the limit as j → ∞ , we have 0 ∈ ∂g (˜ z ) − ∂h (˜ z ) + N Σ (˜ z ) , whichcontradicts ENNAMCQ. ⊓⊔ g is L -smoothwhich means that ∇ g ( z ) is Lipschitz continuous with constant L on Σ . Thissetting motivates a very simple linearization approach in which we not onlylinearize the concave part but also the convex part of the DC structure. Givena current iteration z k ∈ Σ with k = 0 , , . . . , we select a subdifferential ξ ki ∈ ∂h i ( z k ), for i = 0 ,
1. Then we solve the following subproblem approximately.min z ∈ Σ ˆ ϕ k ( z ) := g ( z ) − h ( z k ) − h ξ k , z − z k i + ρ k k z − z k k + β k max { g ( z k ) + h∇ g ( z k ) , z − z k i − h ( z k ) − h ξ k , z − z k i , } , (17)where ρ k and β k are the adaptive proximal and penalty parameters respec-tively. Choose z k +1 as an approximate minimizer of the convex subproblem(17) satisfying one of the following two inexact criteriadist(0 , ∂ ˆ ϕ k ( z k +1 ) + N Σ ( z k +1 )) ≤ ζ k , for some ζ k satisfying ∞ X k =0 ζ k < ∞ , (18)and dist(0 , ∂ ˆ ϕ k ( z k +1 ) + N Σ ( z k +1 )) ≤ √ σ k z k − z k − k . (19)This yields the inexact proximal linearized DCA (iPL-DCA), whose exactdescription is given in Algorithm 2.Recall that the merit function of (DC) is defined by ϕ k ( z ) := f ( z ) + β k max { f ( z ) , } . Similar to Lemma 1, we first give following sufficiently de-crease result of iPL-DCA.
Lemma 2
Let { z k } be the sequence of iterates generated by iPL-DCA as de-fined in Algorithm . If the inexact criterion (18) or (19) is applied, then z k satisfies ϕ k ( z k ) ≥ ϕ k ( z k +1 ) + σ k z k +1 − z k k − σ ζ k ,ϕ k ( z k ) ≥ ϕ k ( z k +1 ) + σ k z k +1 − z k k − σ k z k − z k − k , respectively. Algorithm 2 iPL-DCA
1: Take an initial point z ∈ Σ ; δ β > σ >
0, an initial penalty parameter β > , ρ = β L + σ , tol > for k = 0 , , . . . do
1. Compute ξ ki ∈ ∂h i ( z k ), i = 0 , z k +1 of (17) satisfying (18) or (19).3. Stopping test. Compute t k +1 := max { g ( z k ) + h∇ g ( z k ) , z k +1 − z k i − h ( z k ) −h ξ k , z k +1 − z k i , } . Stop if max {k z k +1 − z k k , t k +1 } < tol .4. Penalty parameter update. Set β k +1 = ( β k + δ β , if max { β k , /t k +1 } < k z k +1 − z k k − ,β k , otherwies .ρ k +1 = 12 β k +1 L + σ.
5. Set k := k + 1.3: end for Proof
Since z k +1 is an approximation solution to problem (17) with inexactcriterion (18) or (19), there exists a vector e k such that e k ∈ ∂ ˆ ϕ k ( z k +1 ) + N Σ ( z k +1 )) ⊆ ∂ ( ˜ ϕ k + δ Σ )( z k +1 ) and k e k k ≤ ζ k or k e k k ≤ √ σ k z k − z k − k , (20)respectively. As ˆ ϕ k is strongly convex with modulus ρ k and Σ is a closedconvex set, we haveˆ ϕ k ( z k ) ≥ ˆ ϕ k ( z k +1 ) + h e k , z k +1 − z k i + ρ k k z k +1 − z k k ≥ ˆ ϕ k ( z k +1 ) − σ k e k k − σ k z k +1 − z k k + ρ k k z k +1 − z k k = ˆ ϕ k ( z k +1 ) − σ k e k k + ρ k − σ k z k +1 − z k k . (21)Next, by the convexity of h i ( z ) and ξ ki ∈ ∂h i ( z k ), i = 0 ,
1, we have h i ( z k +1 ) ≥ h i ( z k ) + h ξ ki , z k +1 − z k i , i = 0 , . And since g is L -smooth, we have g ( z k +1 ) ≤ g ( z k ) + h∇ g ( z k ) , z − z k i + L k z k +1 − z k k . Thus, we have ˆ ϕ k ( z k +1 ) ≥ ϕ k ( z k +1 ) + ρ k − β k L k z k +1 − z k k . ifference of convex algorithms for bilevel programs 15 Combined with (21), we have ϕ k ( z k ) = ˆ ϕ k ( z k ) ≥ ˆ ϕ k ( z k +1 ) − σ k e k k + ρ k − σ k z k +1 − z k k ≥ ϕ k ( z k +1 ) − σ k e k k + 2 ρ k − β k L − σ k z k +1 − z k k = ϕ k ( z k +1 ) − σ k e k k + σ k z k +1 − z k k . Then the conclusion follows immediately from (20). ⊓⊔ Similar to Theorem 1 and Proposition 5, by Lemma 2, the following conver-gence results of iPL-DCA can be derived easily. The proofs are purely technicaland thus omitted.
Theorem 2
Suppose f is bounded below and the sequences { z k } and { β k } generated by iPL-DCA are bounded, functions g , h , h are locally Lipschitzon set Σ . Then every accumulation point of { z k } is an KKT point for problem(DC). Proposition 6
Suppose the sequence { z k } generated by iPL-DCA is bounded,functions g , h , h are Lipschitz around at any accumulation point of { z k } ,and ENNAMCQ holds at any accumulation points of the sequence { z k } . Thenthe sequence { β k } is bounded.Remark 1 In fact, if g is further assumed to be differentiable and ∇ g isLipschitz continuous, we can also linearize g in iPL-DCA. The proof of con-vergence is similar. In this section we will show how to solve problem (DCBP) numerically.It is obvious that problem (VP) ǫ is problem (DC) with z := ( x, y ) , f ( x, y ) := F ( x, y ) − F ( x, y ) , f ( x, y ) := f ( x, y ) − v ( x ) − ǫ, Σ = C. According to Proposition 3, since F ( x, y ) , F ( x, y ) , f ( x, y ) are Lipschitz con-tinuous near every point on an open convex set containing C and hence Lips-chitz continuous near every point on C . However our problem (VP) ǫ involvesthe value function which is an extended-value function v ( x ) : X → [ −∞ , ∞ ]defined by v ( x ) := inf y ∈ Y { f ( x, y ) s.t. g ( x, y ) ≤ } , with the convention of v ( x ) = + ∞ if the feasible region F ( x ) is empty. Toapply the proposed DC algorithms, we need to answer the following questions.(a) Is the value function convex and locally Lipschitz on the convex set X andhow to obtain one element from ∂V ( x k ) in terms of problem data?(b) Will the constraint qualification ENNAMCQ hold at any accumulationpoint of the iteration sequence?We now give answers to these questions in the next two subsections. Lemma 3
The value function v ( x ) : X → R is convex and Lipschitz contin-uous around any point in set X . Given ¯ x ∈ X and ¯ y ∈ S (¯ x ) , we have ∂v (¯ x ) = { ξ ∈ R n : ( ξ, ∈ ∂φ (¯ x, ¯ y ) } , where φ ( x, y ) := f ( x, y ) + δ D ( x, y ) , D := { ( x, y ) ∈ O × Y | g ( x, y ) ≤ } , with O being the open set defined in the introduction.Proof First we extend the definition of the value function from any element x ∈ X to the whole space R n as follows: v ( x ) := inf y ∈ R m φ ( x, y ) , ∀ x ∈ R n . It follows that v ( x ) = + ∞ for x
6∈ O . In our problem setting, f is fullyconvex on an open convex set containing the convex set C and hence we canassume without loss of generality that f is fully convex on the convex set D . Therefore the extended-valued function φ ( x, y ) : R n × R m → [ −∞ , ∞ ]is convex. The convexity of the value function v ( x ) = inf y ∈ R m φ ( x, y ) thenfollows from [35, Theorem 1]. Hence the value function restricted on set X isconvex. Next, according to [35, Theorem 24], we have the equation (3). Byassumption stated in the introduction of the paper, the feasible region of thelower level program F ( x ) := { y ∈ Y : g ( x, y ) ≤ } 6 = ∅ and v ( x ) = −∞ for all x in the open set O . Hence v ( x ) : R n → [ −∞ , ∞ ] is proper convex.Since dom v := { x : v ( x ) < + ∞} = { x : F ( x ) = ∅} ⊇ O ⊇ X , we have X ⊆ int(dom v ). The result on Lipschitz continuity of the value function followsfrom Proposition 3. ⊓⊔ By using some sensitivity analysis techniques, an element of the subdif-ferential of the value function v ( x ) can be expressed in terms of Lagrangianmultipliers. In particular, given ¯ y ∈ S (¯ x ), we denote the set of KKT multipliersof the lower-level problem ( P ¯ x ) by KT (¯ x, ¯ y ):= ( γ ∈ R l + (cid:12)(cid:12)(cid:12) ∈ ∂ y f (¯ x, ¯ y ) + l X i =1 γ i ∂ y g i (¯ x, ¯ y ) + N Y (¯ y ) , l X i =1 γ i g i (¯ x, ¯ y ) = 0 ) . Theorem 3
Let ¯ x ∈ X and ¯ y ∈ S (¯ x ) . Then ∂v (¯ x ) ⊇ ( ξ (cid:12)(cid:12)(cid:12) ( ξ, ∈ ∂f (¯ x, ¯ y ) + l X i =1 γ i ∂g i (¯ x, ¯ y ) + { } × N Y (¯ y ) , γ ≥ , l X i =1 γ i g i (¯ x, ¯ y ) = 0 ) , (22) ifference of convex algorithms for bilevel programs 17 and the equality holds in (22) provided that N E (¯ x, ¯ y ) = n l X i =1 γ i ∂g i (¯ x, ¯ y ) + { } × N Y (¯ y ) | γ ≥ , l X i =1 γ i g i (¯ x, ¯ y ) = 0 o , (23) where E := { ( x, y ) ∈ R n × Y : g ( x, y ) ≤ } .Moreover if the partial derivative formula holds ∂f (¯ x, ¯ y ) = ∂ x f (¯ x, ¯ y ) × ∂ y f (¯ x, ¯ y ) , ∂g i (¯ x, ¯ y ) = ∂ x g i (¯ x, ¯ y ) × ∂ y g i (¯ x, ¯ y ) (24) then [ γ ∈ KT (¯ x, ¯ y ) ∂ x f (¯ x, ¯ y ) + l X i =1 γ i ∂ x g i (¯ x, ¯ y ) ! ⊆ ∂v (¯ x ) , (25) and the equality in (25) holds provided that (23) holds.Proof Let φ E ( x, y ) := f ( x, y ) + δ E ( x, y ) = f ( x, y ) + δ Y ( y ) + P li =1 δ C i ( x, y )with C i := { ( x, y ) | g i ( x, y ) ≤ } . Then by the sum rule (see Proposition 1) andthe fact that N E = ∂δ E , we have ∂f (¯ x, ¯ y ) + { } × N Y (¯ y ) + l X i =1 N C i (¯ x, ¯ y ) ⊆ ∂φ E (¯ x, ¯ y ) . (26)When g i (¯ x, ¯ y ) <
0, we have (¯ x, ¯ y ) ∈ int C i and hence γ i = 0 ∈ N C i (¯ x, ¯ y ).Otherwise if g i (¯ x, ¯ y ) = 0, by definition of subdifferential and the normal conewe can show that for any γ i ≥ γ i ∂g i (¯ x, ¯ y ) ⊆ N C i (¯ x, ¯ y ) . Hence together with(26), we have n ∂f (¯ x, ¯ y ) + l X i =1 γ i ∂g i (¯ x, ¯ y ) + { } × N Y (¯ y ) | γ ∈ R l , γ ≥ , l X i =1 γ i g i (¯ x, ¯ y ) = 0 o ⊆ ∂φ E (¯ x, ¯ y ) . Since ∂φ E (¯ x, ¯ y ) = ∂φ (¯ x, ¯ y ) where φ ( x, y ) = f ( x, y ) + δ D ( x, y ), it follows fromLemma 3 that ∂v (¯ x ) = { ξ | ( ξ, ∈ ∂φ (¯ x, ¯ y ) } = { ξ | ( ξ, ∈ ∂φ E (¯ x, ¯ y ) }⊇ ( ξ | ( ξ, ∈ ∂f (¯ x, ¯ y ) + l X i =1 γ i ∂g i (¯ x, ¯ y ) + { } × N Y (¯ y ) ,γ ∈ R l , γ ≥ , l X i =1 γ i g i (¯ x, ¯ y ) = 0 ) . Hence (22) holds. Since f is Lipschitz continuous at (¯ x, ¯ y ), by the sum rule inProposition 1, we have ∂φ E (¯ x, ¯ y ) = ∂f (¯ x, ¯ y ) + N E (¯ x, ¯ y ). Hence if (23) holds, then ∂v (¯ x ) = { ξ | ( ξ, ∈ ∂φ (¯ x, ¯ y ) } = { ξ | ( ξ, ∈ ∂φ E (¯ x, ¯ y ) } = { ξ | ( ξ, ∈ ∂f (¯ x, ¯ y ) + N E (¯ x, ¯ y ) } = ( ξ | ( ξ, ∈ ∂f (¯ x, ¯ y ) + l X i =1 γ i ∂g i (¯ x, ¯ y ) + { } × N Y (¯ y ) , γ ≥ , l X i =1 γ i g i (¯ x, ¯ y ) = 0 ) . This shows that the equality holds in (22).Now suppose that (24) holds. Then for any γ ∈ R l , γ ≥ , P li =1 γ i g i (¯ x, ¯ y ) =0, by the sum rule in Proposition 1 we have ∂φ E (¯ x, ¯ y ) ⊇ ∂f (¯ x, ¯ y ) + l X i =1 γ i ∂g i (¯ x, ¯ y ) + { } × N Y (¯ y )= ( ∂ x f (¯ x, ¯ y ) + l X i =1 γ i ∂ x g i (¯ x, ¯ y ) ) × ( ∂ y f (¯ x, ¯ y ) + l X i =1 γ i ∂ y g i (¯ x, ¯ y ) + N Y (¯ y ) ) . Combining with (22), we obtain (25). Similarly when (23) holds, the equalityholds in (25). ⊓⊔ By the description in (25), ∂v (¯ x ) can be calculated as long as (23) is satis-fied. We claim that (23) is a mild condition. In fact, by convexity, there alwaysholds the inclusion N E (¯ x, ¯ y ) ⊇ n l X i =1 γ i ∂g i (¯ x, ¯ y ) + { } × N Y (¯ y ) : γ ≥ , l X i =1 γ i g i (¯ x, ¯ y ) = 0 o . By virtue of [16, Theorem 4.1], the reverse inclusion also follows under standardassumptions. Some suffcient conditions for (23) are thus summarized in thefollowing proposition.
Proposition 7
Equation (23) holds provided that the set-valued map Ψ ( α ) := { ( x, y ) ∈ R n × Y : g ( x, y ) + α ≤ } is calm at (0 , ¯ x, ¯ y ) , i.e., there exist κ, δ > such that dist E ( x, y ) ≤ κ k max { g ( x, y ) , }k ∀ ( x, y ) ∈ B δ (¯ x, ¯ y ) ∩ E ; in particular if one of the following conditions:(a) The linear constraint qualification holds: g ( x, y ) is an affine mapping of ( x, y ) and Y is convex polyhedral.(b) The Slater condition holds: there exists a point ( x , y ) ∈ R n × Y such that g ( x , y ) < holds. ifference of convex algorithms for bilevel programs 19 Proof
By virtue of [16, Theorem 4.1], the reverse inclusion N E (¯ x, ¯ y ) ⊆ n l X i =1 γ i ∂g i (¯ x, ¯ y ) + { } × N Y (¯ y ) : γ ≥ , l X i =1 γ i g i (¯ x, ¯ y ) = 0 o holds provided that the system y ∈ Y, g ( x, y ) ≤ , ¯ x, ¯ y ). It iswell-known that (a) or (b) is a sufficient condition for calmness. ⊓⊔ ǫ . First,as shown in [22], the solutions of (VP) ǫ approximate a true solution of theoriginal bilevel program (DCBP) as ǫ approaches zero. Second, the proximityfrom a local minimizer of (VP) ǫ to the solution set of (DCBP) can be controlledby adjusting the value of ǫ . The third motivation is that the approximateprogram (VP) ǫ when ǫ > Proposition 8
Suppose S ∗ , the solution set of problem (VP) , is compact andthe value function v ( x ) is continuous. Then for any δ > , there exists ¯ ǫ > such that for any ǫ ∈ (0 , ¯ ǫ ] , there exists ( x ǫ , y ǫ ) which is a local minimum of ǫ -approximation problem (VP) ǫ with dist(( x ǫ , y ǫ ) , S ∗ ) < δ .Proof To the contrary, assume that there exist δ > { ǫ k } with ǫ k ↓ k → ∞ such that there does not exist ( x, y ) being a local minimumof ǫ k -approximation problem (VP) ǫ k satisfying dist(( x k , y k ) , S ∗ ) < δ for all k .Consider a point (ˆ x k , ˆ y k ), which is a global minimum to the following problemmin ( x,y ) ∈ C F ( x, y ) s.t. f ( x, y ) − v ( x ) ≤ ǫ, dist(( x, y ) , S ∗ ) ≤ δ. Then by assumption, it holds that dist((ˆ x k , ˆ y k ) , Σ ∗ ) = δ and F (ˆ x k , ˆ y k ) ≤ F ∗ ,where F ∗ is the optimal value of problem (VP). As S ∗ is compact, sequence { (ˆ x k , ˆ y k ) } is bounded and we can assume without lost of generality that(ˆ x k , ˆ y k ) → (¯ x, ¯ y ) as k → ∞ . Since the value function v is continuous, bytaking k → ∞ in (ˆ x k , ˆ y k ) ∈ C and f (ˆ x k , ˆ y k ) − v (ˆ x k ) ≤ ǫ k , we obtain the fea-sibility of the limit point (¯ x, ¯ y ) for problem (VP). Next, by taking k → ∞ indist((ˆ x k , ˆ y k ) , S ∗ ) = δ and F (ˆ x k , ˆ y k ) ≤ F ∗ , we obtain that dist((¯ x, ¯ y ) , S ∗ ) = δ and F (¯ x, ¯ y ) ≤ F ∗ , a contradiction. ⊓⊔ Before we clarify the third motivation, we define the concepts of NNAMCQfor problem (VP) and ENNAMCQ for problem (VP) ǫ where ǫ ≥ Definition 3
Let (¯ x, ¯ y ) be a feasible solution to problem (VP). We say thatNNAMCQ holds at (¯ x, ¯ y ) for problem (VP) if0 / ∈ ∂f (¯ x, ¯ y ) − ∂v (¯ x ) × { } + N C (¯ x, ¯ y ) . (27)Let (¯ x, ¯ y ) ∈ C , we say that ENNAMCQ holds at (¯ x, ¯ y ) for problem ( V P ) ǫ ifeither f (¯ x, ¯ y ) − v (¯ x ) < ǫ or f (¯ x, ¯ y ) − v (¯ x ) ≥ ǫ but (27) holds. Proposition 9
Let (¯ x, ¯ y ) be a feasible solution to problem ( V P ) . Suppose that v ( x ) is Lipschitz continuous around ¯ x . Then NNAMCQ never holds at (¯ x, ¯ y ) .Proof By definition of the value function, we can never have f (¯ x, ¯ y ) − v (¯ x ) < f (¯ x, ¯ y ) − v (¯ x ) = 0. But since (¯ x, ¯ y ) is a feasible solution toproblem (VP), it is easy to see that (¯ x, ¯ y ) must be a solution to the followingproblem min ( x,y ) ∈ C { f ( x, y ) − v ( x ) } . But by the optimality condition we must have0 ∈ ∂f (¯ x, ¯ y ) − ∂v (¯ x ) × { } + N C (¯ x, ¯ y ) . This means that (27) would never hold. ⊓⊔ Although the NNAMCQ never hold for (VP), thanks to the model structures,it holds automatically for (VP) ǫ if ǫ >
0. Hence, according to the DC theoriesestablished in the preceding section, powerful DCA can be employed to solve(VP) ǫ . Proposition 10
For any (¯ x, ¯ y ) ∈ C , problem (VP) ǫ with ǫ > satisfies EN-NAMCQ at (¯ x, ¯ y ) .Proof If f ( x, y ) − v ( x ) < ǫ holds, then by definition, ENNAMCQ holds at(¯ x, ¯ y ). Now suppose that f (¯ x, ¯ y ) − v (¯ x ) ≥ ǫ and ENNAMCQ does not hold,i.e., 0 ∈ ∂f (¯ x, ¯ y ) − ∂v (¯ x ) × { } + N C (¯ x, ¯ y ) . It follows from (3) that0 ∈ (cid:20) ∂ x f (¯ x, ¯ y ) − ∂v (¯ x ) ∂ y f (¯ x, ¯ y ) (cid:21) + N C (¯ x, ¯ y ) . (28)By (3) we have N C (¯ x, ¯ y ) = ∂δ C (¯ x, ¯ y ) ⊆ ∂ x δ C (¯ x, ¯ y ) × ∂ y δ C (¯ x, ¯ y ) ⊆ R n × N C (¯ x ) (¯ y ) , where C (¯ x ) := { y ∈ Y | g i (¯ x, y ) ≤ , i = 1 , . . . , l } . Thus, it follows from (28)that 0 ∈ ∂ y f (¯ x, ¯ y ) + N C (¯ x ) (¯ y ) , which further implies that ¯ y ∈ S (¯ x ) . However, an obvious contradiction tothe assumption that f (¯ x, ¯ y ) − v (¯ x ) ≥ ǫ occurs and thus the desired conclusionfollows immediately. ⊓⊔ ifference of convex algorithms for bilevel programs 21 By using the definition of KKT points for (DC) in Definition 1, under theassumption that the value function is locally Lipschitz continiuous, we defineKKT points for problem (VP) ǫ . Definition 4
We say a point (¯ x, ¯ y ) is a KKT point of problem (VP) ǫ with ǫ ≥ λ ≥ ( ∈ ∂F (¯ x, ¯ y ) − ∂F (¯ x, ¯ y ) + λ∂f (¯ x, ¯ y ) − λ∂v (¯ x ) × { } + N C (¯ x, ¯ y ) ,f (¯ x, ¯ y ) − v (¯ x ) − ǫ ≤ , λ ( f (¯ x, ¯ y ) − v (¯ x ) − ǫ ) = 0 . By virtue of Proposition 10 and Theorem 1, we have the following necessaryoptimality condition. Since the issue of constraint qualifications for problem(VP) is complicated and it is not the main concern in this paper, we refer thereader to discussions on this topic in [2,43].
Theorem 4
Let (¯ x, ¯ y ) be a local optimal solution to problem (VP) ǫ with ǫ ≥ .Suppose either ǫ > or ǫ = 0 and a constraint qualification holds. Then (¯ x, ¯ y ) is an KKT point of problem (VP) ǫ . ǫ In this subsection we implement the proposed DC algorithms in Section 3to solve (VP) ǫ . To proceed, let us describe iP-DCA to solve (VP) ǫ . Given acurrent iteration ( x k , y k ) for each k = 0 , , . . . , solving the lower level problemparameterized by x k min y ∈ Y f ( x k , y ) , s.t. g ( x k , y ) ≤ y k ∈ S ( x k ) and a corresponding KKT multiplier γ k ∈ KT ( x k , ˜ y k ). Select ξ k ∈ ∂F ( x k , y k ) , ξ k ∈ ∂ x f ( x k , ˜ y k ) + l X i =1 γ ki ∂ x g i ( x k , ˜ y k ) . (29)Note that according to (25) in Theorem 3, if the partial derivative formulaholds, then ξ k ∈ ∂ x f ( x k , ˜ y k ) + P li =1 γ ki ∂ x g i ( x k , ˜ y k ) is an element of the subd-ifferential ∂v ( x k ). Compute ( x k +1 , y k +1 ) as an approximate minimizer of thestrongly convex subproblem for (VP) ǫ given bymin ( x,y ) ∈ C F ( x, y ) − h ξ k , ( x, y ) i + ρ k ( x, y ) − ( x k , y k ) k + β k max { f ( x, y ) − f ( x k , ˜ y k ) − h ξ k , x − x k i − ǫ, } , (30)satisfying one of the two inexact criteria. Under the assumption that KT ( x k , y ) = ∅ for all y ∈ S ( x k ), the description of iP-DCA on (VP) ǫ with ǫ ≥ Algorithm 3 iP-DCA for solving (VP) ǫ
1: Take an initial point ( x , y ) ∈ X × Y ; δ β >
0; an initial penalty parameter β > tol > for k = 0 , , . . . do
1. Solve the lower level problem P x k defined in (4.3) and obtain ˜ y k ∈ S ( x k ) and γ k ∈ KT ( x k , ˜ y k ).2. Compute ξ ki , i = 0 , x k +1 , y k +1 ) of (30).3. Stopping test. Compute t k +1 = max { f ( x k +1 , y k +1 ) − f ( x k , ˜ y k ) − h ξ k , x k +1 − x k i − ǫ, } . Stop if max {k ( x k +1 , y k +1 ) − ( x k , y k ) k , t k +1 } < tol .4. Penalty parameter update. Set β k +1 = ( β k + δ β , if max { β k , /t k +1 } < k ( x k +1 , y k +1 ) − ( x k , y k ) k − ,β k , otherwise .
5. Set k := k + 1.3: end for Thanks to Proposition 10, when ǫ >
0, problem (VP) ǫ satisfies ENNAMCQautomatically. Moreover, since the partial subgradient formula (24) holds,according to (25) in Theorem 3, the selection criteria in (29) implies that ξ k ∈ ∂v ( x k ). Hence the convergence of iP-DCA for solving (VP) ǫ (Algorithm3) follows from Theorem 1 and Proposition 5. Theorem 5
Assume that F is bounded below on C . Let { ( x k , y k ) } be an itera-tion sequence generated by Algorithm 3. Moreover assume that the partial sub-gradient formula (2) holds at every iteration point ( x k , y k ) and KT ( x k , y ) = ∅ for all y ∈ S ( x k ) . Suppose that either ǫ > or ǫ = 0 and the penalty sequence { β k } is bounded. Than any accumulation point of { ( x k , y k ) } is an KKT pointof problem (VP) ǫ . We now assume that the lower level objective f is differentiable and ∇ f is Lipschitz continuous with modulus L f on set C . Given a current iterate( x k , y k ), the next iterate ( x k +1 , y k +1 ) can be returned as an approximate min-imizer of subproblem (30) with linearized f given bymin ( x,y ) ∈ C F ( x, y ) − h ξ k , ( x, y ) i + ρ k k ( x, y ) − ( x k , y k ) k + β k max { f ( x k , y k ) + h∇ f ( x k , y k ) , ( x, y ) − ( x k , y k ) i − f ( x k , ˜ y k ) − h ξ k , x − x k i − ǫ, } . (31)Under the assumption that KT ( x k , y ) = ∅ for all y ∈ S ( x k ), the iterationscheme of iPL-DCA on (VP) ǫ with ǫ ≥ ifference of convex algorithms for bilevel programs 23 Algorithm 4 iPL-DCA for solving (VP) ǫ
1: Take an initial point ( x , y ) ∈ X × Y ; δ β , σ > β > ρ = β L f + σ ; tol > for k = 0 , , . . . do
1. Solve the lower level problem P x k defined in (4.3) and obtain ˜ y k ∈ S ( x k ) and γ k ∈ KT ( x k , ˜ y k ).2. Compute ξ ki , i = 0 , x k +1 , y k +1 ) of (31).3. Stopping test. Compute t k +1 = max { f ( x k , y k )+ h∇ f ( x k , y k ) , ( x k +1 , y k +1 ) − ( x k , y k ) i− f ( x k , ˜ y k ) −h ξ k , x k +1 − x k i− ǫ, } . Stop if max {k ( x k +1 , y k +1 ) − ( x k , y k ) k , t k +1 } < tol .4. Penalty parameter update. Set β k +1 = ( β k + δ β , if max { β k , /t k +1 } < k ( x k +1 , y k +1 ) − ( x k , y k ) k − ,β k , otherwies .ρ k +1 = 12 β k +1 L f + σ.
5. Set k := k + 1.3: end for The convergence of iPL-DCA follows from Theorem 2 and Proposition 6directly.
Theorem 6
Assume that F is bounded below on C , f is L f smooth on C .Let the sequence { ( x k , y k ) } be generated by Algorithm 4. Moreover assume thatthe partial subgradient formula (2) holds at every iteration point ( x k , y k ) and KT ( x k , y ) = ∅ for all y ∈ S ( x k ) . Suppose that either ǫ > or ǫ = 0 and thepenalty parameter sequence { β k } is bounded. Then any accumulation point of { ( x k , y k ) } is an KKT point of problem (VP) ǫ . In this section, we will conduct numerical experiments on the bilevel modelfor SV classification which is equivalent to (1):min Θ ( w , . . . , w T , c ) := 1 T T X t =1 | Ω tval | X j ∈ Ω tval max(1 − b j ( a Tj w t − c t ) , s.t. λ ub ≤ µ ≤ λ lb , ¯ w lb ≤ ¯ w ≤ ¯ w ub , ( w , . . . , w T , c ) ∈ argmin − ¯ w ≤ w t ≤ ¯ w ,c t ∈ R T X t =1 k w t k µ + X j ∈ Ω ttrn max(1 − b j ( a Tj w t − c t ) , . (32)Here we use the hinge loss as the cross validation error. One can find otherfunctions that can be used for cross validation error in [5,18,19]. Let x := ( µ, ¯ w ), y := ( w , . . . , w T , c ), X = [ λ ub , λ lb ] × [ ¯ w lb , ¯ w ub ], Y = R ( n +1) T , f ( x, y ) = T X t =1 k w t k µ + X j ∈ Ω ttrn max(1 − b j ( a Tj w t − c t ) , , and g ( x, y ) = g ( x, y )... g T ( x, y ) and g t ( x, y ) = (cid:18) − ¯ w − w t w t − ¯ w (cid:19) , t = 1 , . . . , T. Obviously F , f and g are all convex functions defined an open set containing X × Y . Both F and f are bounded below on X × Y . When ¯ w ub ≥ ¯ w lb > F ( x ) = ∅ for an open set containing X . And since b j ∈ {− , } , f ( x, y ) is co-ercive and continuous with respect to lower-level variable y for any given x inan open set containing X , thus S ( x ) = ∅ for all x in an open set containing X .The function g is smooth and f is a sum of a smooth function and a functionwhich is independent of variable x . Hence by Proposition 2, the partial differ-ential formula (2) holds at each point ( x, y ). Since the lower level constraintsare affine, KKT conditions holds at any y ∈ S ( x ) for any x ∈ X . Therefore,all conditions required by the convergence results of iP-DCA in Theorem 5 aresatisfied.We will compare our proposed algorithms with the MPEC approach con-sidered in [19], in which the authors reformulate such bilevel model into MPECby replacing the lower level problem with its KKT optimality condition andthen apply nonlinear program solver to solve the obtained MPEC. In numer-ical experiments, we will follow the suggestions given in [19] to replace thecomplementarity constraints with the relaxed complementarity constraints.As claimed by [19], such approach can facilitate an early termination of cross-validation and ease the difficulty of dealing the complementarity constraintsfor nonlinear program solver.5.1 Numerical testsAll the numerical experiments are implemented on a laptop with Intel(R)Core(TM) i7-9750H CPU@ 2.60GHz and 32.00 GB memory. All the codes areimplemented on MATLAB 2019b. The subproblems in iP-DCA are all con-vex optimization problems and we apply the Matlab software package SDPT3[41,42] with default setting to solve them. MPEC problem is solved by us-ing fmincon in Matlab optimization toolbox with setting ′ Algorithm ′ being ′ interior − point ′ , ′ MaxIterations ′ being 200, ′ MaxFunctionEvaluations ′ being 10 and ′ OptimalityTolerance ′ being 10 − . As fmincon needs ex-tremely long time to solve large dimension MPEC problems, we first usesmall size datasets to conduct the numerical comparison between iP-DCA and ifference of convex algorithms for bilevel programs 25 MPEC approach. We test here three real datasets “australian scale”, “breast-cancer scale” and “diabetes scale” downloaded from the SVMLib repository[6] . Each dataset is randomly split into a training set Ω with | Ω | = ℓ train data pairs, which is used in the cross-validation bilevel model and a hold-outtest set N with |N | = ℓ test data pairs. We give the descriptions of datasetsin Table 1. For each dataset, we use a three-fold cross-validation in SV clas-sification bilevel model (32), i.e. T = 3, and that each training fold consistsof two-thirds of the total training data and validation fold consists of one-third of the total training data. We repeat the experiments 20 times for eachdataset. The values of parameters in SV classification bilevel model (32) areset as: λ lb = 10 − , λ ub = 10 , ¯ w lb = 10 − and ¯ w ub = 1 .
5. For our approach,we test three different values of relaxation parameter ǫ ∈ { , − , − } in(VP) ǫ . And the value of relaxation parameter of the relaxed complementarityconstraints in MPEC is set to be 10 − . The initial points for both iP-DCA andMPEC approach are chosen as λ = 1, ¯ w = 0 . e , where e denotes the vectorwhose elements are all equal to 1, and the values of other variables are allequal to 0. These settings are used for all experiments. Parameters in iP-DCAare set as β = 1, ρ = 10 − and δ β = 5. And we terminate iP-DCA when t k +1 < − and k ( x k +1 , y k +1 ) − ( x k , y k ) k / (1 + k ( x k , y k ) k ) < tol .For each experiment, after we obtain the hyperparameters ˆ µ and ˆ¯ w fromimplementing our proposed algorithm for the bilevel SV classification modeland MPEC approach, we calculate their corresponding cross-validation error(CV error) and test error for comparing the performances of these two meth-ods. For calculating the CV error, we put ˆ µ and ˆ¯ w back to the lower levelproblem in (32) to get the corresponding lower level solution ( ˆ w , . . . , ˆ w T , ˆ c )and calculate the corresponding cross-validation error Θ ( ˆ w , . . . , ˆ w T , ˆ c ). Next,as in [19] we implement a post-processing procedure to calculate the general-ization error on the hold-out data for each instance. In particular as suggestedin [19], since only two thirds of the data in Ω was used in each fold whilein testing we use all the training data from Ω , we should solve the followingsupport vector classification problem with ˆ λ = µ and ˆ¯ w as hyperparametermin − ˆ¯ w ≤ w ≤ ˆ¯ w c ∈ R µ k w k + X j ∈ Ω max(1 − b j ( a Tj w − c ) , to obtain the final classifier ( ˆ w , ˆ c ). Then the test (hold-out) error rate is cal-culated as: Test error = 1 ℓ test X i ∈N | sign( a Ti ˆ w − ˆ c ) − b i | , where sign( x ) denote the sign function. Note that for each ( a i , b i ) in the testset N , | sign( a Ti ˆ w − ˆ c ) − b i | is either equal to zero or 2 and hence the test erroris the average misclassification by the final classifier. The achieved numericalresults averaged over 20 repetitions for each dataset are reported in Table 2. ∼ cjlin/libsvmtools/datasets/.6 Jane J. Ye et al. Observe from Table 2 that our proposed iP-DCA always achieves a smallercross-validation error, which is exactly the value of upper level objective ofbilevel problem (32). Furthermore, the time taken by our proposed iP-DCAis much shorter than MPEC approach. And the test errors of our proposediP-DCA and MPEC approach are similar and iP-DCA achieves a smaller testerror than MPEC approach on dataset “diabetes scale”. And it is observedthat different values of ǫ and tol do not influence the cross-validation errorobtained by iP-DCA much. iP-DCA can obtain a relatively good solutionwithout requiring a small tol .Next, we are going to test our proposed iP-DCA on two large scale datasets“mushrooms” and “phishing” downloaded from the SVMLib repository. Thedescriptions of datasets are given in Table 1. We set tol = 10 − for thesetests. The numerical results averaged over 20 repetitions for each dataest arereported in Table 3. It can be observed from Table 3 that different values of ǫ and tol do not influence the cross-validation error obtained by iP-DCA muchbut the case ǫ = 0 takes more time to achieve desired tolerance. And iP-DCAcan obtain a satisfactory solution within an acceptable time on large scaleproblems. Table 1
Description of datasets usedDataset ℓ train ℓ test n Taustralian scale 345 345 14 3breast-cancer scale 339 344 10 3diabetes scale 384 384 8 3mushrooms 4062 4062 112 3phishing 5526 5529 68 3
Motivated by hyperparameter selection problems, in this paper we developtwo DCA type algorithms for solving the DC bilevel program. Our numericalexperiments on the bilevel model of SV classification problems show that ourapproach is promising. Due to the space limit, we are not able to present morestudies for more complicated models in hyperparameter selection problems.We hope to study these problems in our future work.
References
1. Allende, G., Still, G.: Solving bilevel programs with the KKT-approach. MathematicalProgramming. , 309-332 (2013)2. Bai, K., Ye, J.J.: Directional necessary optimality conditions for bilevel programs.arXiv:2004.01783v2. (2020)ifference of convex algorithms for bilevel programs 27
Table 2
Numerical results comparing iP-DCA and MPEC approach
Dataset Method CV error Test error Time(sec)australian scale iP-DCA( ǫ = 0, tol = 10 − ) 0.28 ± ± ± ǫ = 0, tol = 10 − ) 0.28 ± ± ± ǫ = 10 − , tol = 10 − ) 0.28 ± ± ± ǫ = 10 − , tol = 10 − ) 0.28 ± ± ± ǫ = 10 − , tol = 10 − ) 0.28 ± ± ± ǫ = 10 − , tol = 10 − ) 0.28 ± ± ± ± ± ± ǫ = 0, tol = 10 − ) 0.06 ± ± ± ǫ = 0, tol = 10 − ) 0.06 ± ± ± ǫ = 10 − , tol = 10 − ) 0.06 ± ± ± ǫ = 10 − , tol = 10 − ) 0.06 ± ± ± ǫ = 10 − , tol = 10 − ) 0.06 ± ± ± ǫ = 10 − , tol = 10 − ) 0.06 ± ± ± ± ± ± ǫ = 0, tol = 10 − ) 0.56 ± ± ± ǫ = 0, tol = 10 − ) 0.56 ± ± ± ǫ = 10 − , tol = 10 − ) 0.57 ± ± ± ǫ = 10 − , tol = 10 − ) 0.56 ± ± ± ǫ = 10 − , tol = 10 − ) 0.56 ± ± ± ǫ = 10 − , tol = 10 − ) 0.56 ± ± ± ± ± ± Table 3
Numerical results of iP-DCA on datasets “mushrooms” and “phishing” with tol =10 − Dataset Method CV error Test error Time(sec)mushrooms iP-DCA( ǫ = 0) 6.36e-04 ± ± ± ǫ = 10 − ) 1.53e-03 ± ± ± ǫ = 10 − ) 6.38e-04 ± ± ± ǫ = 0) 0.29 ± ± ± ǫ = 10 − ) 0.29 ± ± ± ǫ = 10 − ) 0.29 ± ± ±
3. Bard, J.: Practical Bilevel Optimization: Algorithms and Applications. Kluwer AcademicPublishers, Dordrecht (1998)4. Ben-Tal, A., Blair, C.:
Computational difficulties of bilevel linear programming.
Opera-tions Research. , 556-560 (1990)5. Bennett, K.P., Hu, J., Ji., X., Kunapuli, G., Pang, J.-S.: Model selection via bileveloptimization, in The 2006 IEEE International Joint Conference on Neural Network Pro-ceedings, 1922-1929 (2006)6. Chang, C-C., Lin, C-J.: LIBSVM : a library for support vector machines. ACM Trans-actions on Intelligent Systems and Technology. (3), 1-27 (2011)7. Clarke, F.H.: Optimization and Nonsmooth Analysis. Society for Industrial and AppliedMathematics, Philadelphia (1990)8. Clarke, F.H., Ledyaev, Y.S., Stern, R.J., Wolenski, P.R.: Nonsmooth Analysis and Con-trol Theorey. Springer Science & Business Media, New York (1998)9. Colson, B., Marcotte, P., Savard, G.: An overview of bilevel optimization. Annals ofOperations Research. (1-2), 235-256 (2007)10. Dempe, S.: Foundations of Bilevel Programming. Kluwer Academic Publishers, Dor-drecht (2002)11. Dempe, S., Dutta, J.: Is bilevel programming a special case of mathematical program-ming with equilibrium constraints?. Mathematical Programming. , 37-48 (2012)12. Dempe, S., Zemkoho, A.: Bilevel optimization: advances and next challenges. SpringerOptimization and its Applications, vol. 161 (2020)13. Dempe, S., Kalashnikov, V., P´erez-Vald´es, G., Kalashnykova, N.: Bilevel ProgrammingProblems. Energy Systems, Springer Science & Business Media, Berlin (2015)14. Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., Pontil, M.: Bilevel programming forhyperparameter optimization and meta-learning. In: International Conference on MachineLearning. , 1568-1577 (2018)8 Jane J. Ye et al.15. Jourani, A.: Constraint qualifications and Lagrange multipliers in nondifferentiable pro-gramming problems. Journal of Optimization Theory and Applications. , 533-548 (1994)16. Henrion, R., Jourani, A., Outrata, J.V.: On the calmness of a class of multifunctions.SIAM Journal on Optimization. , 603–618 (2002)17. Horst, R., Thoai, N.V.: DC programming: overview. Journal of Optimization Theoryand Applications. (1), 1–43 (1999)18. Kunapuli, G.: A bilevel optimization approach to machine learning. Ph.D Thesis. (2008)19. Kunapuli, G., Bennett, K.P., Hu, J., Pang, J-S.: Classification model selection via bilevelprogramming. Optimization Methods and Software. , 475-489 (2008)20. Kunapuli, G., Bennett, K.P., Hu, J., Pang, J-S.: Bilevel model selection for supportvector machines. In: CRM proceedings and lecture notes. , 129-158 (2008)21. Lampariello, L., Sagratella, S.: Numerically tractable optimistic bilevel problems. Com-putational Optimization and Applications. , 277-303 (2020)22. Lin, G., Xu, M., Ye, J.J.: On solving simple bilevel programs with a nonconvex lowerlevel program. Mathematical Programming. , 277-305 (2014)23. Liu, R., Mu, P., Yuan, X., Zeng, S., Zhang, J.: A generic first-order algorithmic frame-work for bi-level programming beyond lower-level singleton. In: International Conferenceon Machine Learning. , 6305-6315 (2020)24. Liu, R., Mu, P., Yuan, X., Zeng, S., Zhang, J.: A generic descent aggregation frameworkfor gradient-based bi-level optimization. arXiv:2102.07976. (2021)25. Luo, Z-Q., Pang, J-S., Ralph, D.: Mathematical Programs with Equilibrium Constraints.Cambridge University Press, Cambridge (1996)26. Mirrlees, J.A.: The theory of moral hazard and unobservable behaviour: Part I. TheReview of Economic Studies. , 3-21 (1999)27. Moore, G.: Bilevel programming algorithms for machine learning model selection. Ph.DThesis. (2010)28. Moore, G., Bergeron, C., Bennett, K.P.: Model selection for primal SVM. MachineLearning. , 175-208 (2011)29. Nie, J., Wang, L., Ye, J.J.: Bilevel polynomial programs and semidefinite relaxationmethods. SIAM Journal on Optimization. , 1728-1757 (2017)30. Nie, J., Wang, L., Ye, J.J., Zhong, S.: A Lagrange Multiplier Expression Method forBilevel Polynomial Optimization, SIAM Journal on Optimization. revised.31. Okuno, T., Kawana, A.: Bilevel optimization of regularization hyperparameters in ma-chine learning. In: Bilevel Optimization: Advances and Next Challenges, Ch. 6. SpringerOptimization and its Applications, vol. 161 (2020).32. Outrata, J. V.: On the numerical solution of a class of Stackelberg problems, ZOR -Methods and Models of Operations Research. , 255–277 (1990)33. Outrata, J., Kocvara, M., Zowe, J.: Nonsmooth Approach to Optimization Problemswith Equilibrium Constraints: Theory, Applications and Numerical Results. Kluwer Aca-demic Publishers, Boston (1998)34. Rockafellar, R.T.: Convex Anlysis. Princeton University Press, Princeton (1970)35. Rockafellar, R.T.: Conjugate duality and optimization. CBMS-NSF Regional Confer-ence Series in Applied Mathematics. , 1-74 (1974)36. Shimizu, K., Ishizuka, Y., Bard, J.: Nondifferentiable and Two-level Mathematical Pro-gramming. Kluwer Academic Publishers, Dordrecht (1997)37. Stackelberg, H.: Market Structure and Equilibrium. Springer Science & Business Media,Berlin (2010)38. Stephen, B., Vandenberghe, L.: Convex Optimization. Cambridge University Press,Cambridge (2004)39. Thi, H.A.L., Dinh, D.T.: Advanced Computational Methods for Knowledge Engineer-ing, DC programming and DCA for general DC programs. pp. 15-35. Springer, Cham,Switzerland (2014)40. Thi, H.A.L., Dinh, T.P.: DC programming and DCA: thirty years of developments.Math. Program. , 5-68 (2018)41. Toh, K.C., Todd, M.J., Tutuncu, R.H.: SDPT3 — a Matlab software package forsemidefinite programming, Optimization Methods and Software. , 545–581 (1999)42. Tutuncu, R.H., Toh, K.C., Todd, M.J.: Solving semidefinite-quadratic-linear programsusing SDPT3. Mathematical Programming, Series B. , 189–217 (2003)ifference of convex algorithms for bilevel programs 2943. Ye, J.J.: Constraint qualifications and optimality conditions in bilevel optimization. In:Bilevel Optimization: Advances and Next Challenges, Ch. 8. Springer Optimization andits Applications, vol. 161 (2020).44. Ye J.J., Zhu, D. L.: Optimality conditions for bilevel programming problems. Optimiza-tion.33