Derivative-Free Multiobjective Trust Region Descent Method Using Radial Basis Function Surrogate Models
AArticle
Derivative-Free Multiobjective Trust Region Descent MethodUsing Radial Basis Function Surrogate Models
Manuel Berkemeier and Sebastian Peitz (cid:1)(cid:2)(cid:3)(cid:1)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8) (cid:1) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7) Citation:
Berkemeier, M.; Peitz, S.Derivative-Free Multiobjective TrustRegion Descent Method Using RadialBasis Function Surrogate Models. , , 0. https://doi.org/Received:Accepted:Published: Publisher’s Note:
MDPI stays neutralwith regard to jurisdictional claims inpublished maps and institutional affil-iations. Paderborn University; [email protected] Paderborn University; [email protected] * Correspondence: [email protected]
Abstract:
We present a flexible trust region descend algorithm for unconstrained and convexly con-strained multiobjective optimization problems. It is targeted at heterogeneous and expensive problems,i.e., problems that have at least one objective function that is computationally expensive. The methodis derivative-free in the sense that neither need derivative information be available for the expensiveobjectives nor are gradients approximated using repeated function evaluations as is the case in finite-difference methods. Instead, a multiobjective trust region approach is used that works similarly toits well-known scalar pendants. Local surrogate models constructed from evaluation data of the trueobjective functions are employed to compute possible descent directions. In contrast to existing multiob-jective trust region algorithms, these surrogates are not polynomial but carefully constructed radial basisfunction networks. This has the important advantage that the number of data points scales linearly withthe parameter space dimension. The local models qualify as fully linear and the corresponding generalscalar framework is adapted for problems with multiple objectives. Convergence to Pareto critical pointsis proven and numerical examples illustrate our findings.
Keywords: multiobjective optimization; trust region methods; multiobjective descent; derivative-freeoptimization; radial basis functions; fully linear models
1. Introduction
Optimization problems arise in a multitude of applications in mathematics, computerscience, engineering and the natural sciences. In many real-life scenarios, there are multiple,equally important objectives that need to be optimized. Such problems are then called
Multi-objective Optimization Problems (MOP). In contrast to the single objective case, an MOP oftendoes not have a single solution but an entire set of optimal trade-offs between the differentobjectives, which we call
Pareto optimal . They constitute the
Pareto Set and their image is the
Pareto Frontier . The goal in the numerical treatment of an MOP is to either approximate thesesets or to find single points within these sets. In applications, the problem can become moredifficult when some of the objectives require computationally expensive or time consumingevaluations. For instance, the objectives could depend on a computer simulation or someother black-box . It is then of primary interest to reduce the overall number of function evalua-tions. Consequently, it becomes infeasible to approximate derivative information of the trueobjectives using, e.g., finite differences. In this work, optimization methods that do not use theobjective gradients (which nonetheless are assumed to exist) are referred to as derivative-free .There is a variety of methods to deal with multiobjective optimization problems, some ofwhich are also derivative-free or try to constrain the number of expensive function evaluations.A broad overview of different problems and techniques concerning multiobjective optimiza-tion can be found, e.g., in [1–4]. One popular approach for calculating Pareto optimal solutionsis scalarization, i.e., the transformation of an MOP into a single objective problem, cf. [5] foran overview. Alternatively, classical (single objective) descent algorithms can be adapted forthe multiobjective case [6–11]. What is more, the structure of the Pareto Set can be exploited to a r X i v : . [ m a t h . O C ] M a r of 36 find multiple solutions [12,13]. There are also methods for non-smooth problems [14,15] andmultiobjective direct-search variants [16,17]. Both scalarization and descent techniques maybe included in Evolutionary Algorithms (EA) [18–21], the most prominent of which probablyis NSGA-II [22]. To address computationally expensive objectives or missing derivative infor-mation, there are algorithms that use surrogate models (see the surveys [23–25]) or borrowfrom ideas from scalar trust region methods, e.g., [26].In single objective optimization, trust region methods are well suited for derivative-freeoptimization [27,28]. Our work is based on the recent development of multiobjective trustregion methods:• In [29], a trust region method using Newton steps for functions with positive definiteHessians on an open domain is proposed.• In [30] quadratic Taylor polynomials are used to compute the steepest descent directionwhich is used in a backtracking manner to find solutions for unconstrained problems.• In [31] polynomial regression models are used to solve an augmented MOP based on thescalarization in [17]. The algorithm is designed unconstrained bi-objective problems.• In [32], quadratic Lagrange polynomials are used and the Pascoletti-Serafini scalarizationis employed for the descent step calculation.Our contribution is the extension of the above-mentioned methods to general fully linearmodels (and in particular radial basis function surrogates as in [33]), which is related to thescalar framework in [34]. Most importantly, this reduces the complexity with respect to theparameter space dimension to linear, in contrast to the quadratically increasing number offunction evaluations in other methods. We further prove convergence to critical points whenthe problem is constrained to a convex and compact set by using an analogous argumentationas in [35]. This requires new results concerning the continuity of the projected steepest descentdirection. We also show how to keep the convergence properties for constrained problemswhen the Pascoletti-Serafini scalarization is employed (like in [32]).The remainder of the paper is structured as follows: Section 2 provides a brief introduc-tion to multiobjective optimality and criticality concepts. In Section 3 the fundamentals ofour algorithm are explained. In Section 4 we introduce fully linear surrogate models anddescribe their construction. We also formalize the main algorithm in this section. Section 5deals with the descent step calculation so that a sufficient decrease is achieved in each iteration.Convergence is proven in Section 6 and a few numerical examples are shown in Section 7. Weconclude with a brief discussion in Section 8.
2. Optimality and Criticality in Multiobjective Optimization
We consider the following (real-valued) multiobjective optimization problem:min x ∈X f ( x ) : = min x ∈X f ( x ) ... f k ( x ) ∈ R k , (MOP)with a feasible set X ⊆ R n and k objective functions f (cid:96) : R n → R , (cid:96) =
1, . . . , k . We furtherassume (MOP) to be heterogeneous . That is, there is a non-empty subset I ex ⊆ {
1, . . . , k } ofindices so that the gradients of f (cid:96) , (cid:96) ∈ I ex , are unknown and cannot be approximated, e.g., viafinite differences. The (possibly empty) index set I cheap = {
1, . . . , k } \ I ex indicates functionswhose gradients are available.Solutions for (MOP) consist of optimal trade-offs x ∗ ∈ X between the different objectivesand are called non-dominated or Pareto optimal. That is, there is no x ∈ X with f ( x ) ≺ f ( x ∗ ) (i.e., f ( x ) ≤ f ( x ∗ ) and f (cid:96) ( x ) < f (cid:96) ( x ∗ ) for some index (cid:96) ∈ {
1, . . . , k } ). The subset P S ⊆ X of of 36 non-dominated points is then called the Pareto Set and its image P F : = f ( P S ) ⊆ R k is calledthe Pareto Frontier . All concepts can be defined in a local fashion in an analogous way.Similar to scalar optimization, local optima can be characterized using the gradients ofthe objective function. We therefore implicitly assume all objective functions f (cid:96) , (cid:96) =
1, . . . , k ,to be continuously differentiable on X . Moreover, the following assumption allows for aneasier treatment of tangent cones in the constrained case: Assumption 1.
Either X = R n or the feasible set X ⊆ R n is closed, bounded and convex. Allfunctions are defined on X . Because R k is finite-dimensional Assumption 1 is equivalent to requiring X to be compactand convex, which is a standard assumption in the MO literature [6,7].Now let ∇ f (cid:96) ( x ) denote the gradient of f (cid:96) and Df ( x ) ∈ R k × n the Jacobian of f at x ∈ X . Definition 1.
We call a vector d ∈ X − x a multi-descent direction for f in x if (cid:104) ∇ f (cid:96) ( x ) , d (cid:105) < for all (cid:96) ∈ {
1, . . . , k } , or equivalently if max (cid:96) = k (cid:104) ∇ f (cid:96) ( x ∗ ) , d (cid:105) < where (cid:104)• , •(cid:105) is the standard inner product on R n and we consider X − x = X in the unconstrainedcase X = R n . A point x ∗ ∈ X is called critical for (MOP) iff there is no d ∈ X − x ∗ with (1). As allPareto optimal points are also critical (cf. [6,36] or [2, Ch. 17]), it is viable to search for optimalpoints by calculating points from the superset P crit ⊇ P S of critical points for (MOP). Oneway to do so is by iteratively performing descent steps. Fliege and Svaiter [7] propose severalways to compute suitable descent directions. The minimizer d ∗ of the following problem isknown as the multiobjective steepest-descent direction.min d ∈X − x max (cid:96) = k (cid:104) ∇ f (cid:96) ( x ) , d (cid:105) s.t. (cid:107) d (cid:107) ≤
1. (P1)Problem (P1) has an equivalent reformulation asmin d ∈X − x β s.t. (cid:107) d (cid:107) ≤ (cid:104) ∇ f (cid:96) ( x ) , d (cid:105) ≤ β ∀ (cid:96) =
1, . . . , k , (P2)which is a linear program, if X is defined by linear constraints and the maximum-norm (cid:107)•(cid:107) = (cid:107)•(cid:107) ∞ is used [7]. We thus stick with this choice because it facilitates implementation,but note that other choices are possible (see for example [32]).Motivated by the next theorem we can use the optimal value of either problem as ameasure of criticality, i.e., as a multiobjective pendant for the gradient norm. As is standard inmost multiobjective trust region works (cf. [29,30,32]), we flip the sign so that the values arenon-negative. Theorem 1.
For x ∈ X let d ∗ ( x ) be the minimizer of (P1) and ω ( x ) be the negative optimal value,that is ω ( x ) : = − max (cid:96) = k (cid:104) ∇ f (cid:96) ( x ) , d ∗ ( x ) (cid:105) . Then the following statements hold:1. ω ( x ) ≥ for all x ∈ X .2. The function ω : R n → R is continuous. of 36
3. The following statements are equivalent:(a) The point x ∈ X is not critical.(b) ω ( x ) > .(c) d ∗ ( x ) (cid:54) = .Consequently, the point x is critical iff ω ( x ) = . Proof.
For the unconstrained case all statements are proven in [7, Lemma 3].The first and the third statement hold true for X convex and compact by definition. Thecontinuity of ω can be shown similarly as in [6], see Appendix A.1.With further conditions on f and X the criticality measure ω ( x ) is even Lipschitz contin-uous and subsequently uniformly and Cauchy continuous: Theorem 2. If ∇ f (cid:96) , (cid:96) =
1, . . . , k , are Lipschitz continuous and Assumption 1 holds, then the map ω ( • ) as defined in Theorem 1 is uniformly continuous. Proof.
The proof for X = R n is given by Thomann [37]. A proof for the constrained case canbe found in Appendix A.1 as to not clutter this introductory section.Together with Theorem 1 this hints at ω ( • ) being a criticality measure as defined forscalar trust region methods in [35, Ch. 8]: Definition 2.
We call π : N × R n → R , a criticality measure for (MOP) if π is Cauchy continuouswith respect to its second argument and if lim t → ∞ π ( t , x ( t ) ) = implies that the sequence (cid:110) x ( t ) (cid:111) asymptotically approaches a Pareto-critical point.
3. Trust Region Ideas
Multiobjective trust region algorithms closely follow the design of scalar approaches (see[35] for an extensive treatment). Consequently, the requirements and convergence proofs in[29,30,32] for the unconstrained multiobjective case are fairly similar to those in [35]. We willreexamine the core concepts to provide a clear understanding and point out the similarities tothe scalar case.The main idea is to iteratively compute multi-descent steps s ( t ) in every iteration t ∈ N .We could, for example, use the steepest descent direction given by (P1). This would requireknowledge of the objective gradients - which need not be available for objective functionswith indices in I ex . Hence, benevolent surrogate model functions m ( t ) : R n → R k , x (cid:55)→ m ( t ) ( x ) = (cid:104) m ( t ) ( x ) , . . . , m ( t ) k ( x ) (cid:105) T ,are employed. Note, that for cheap objectives f (cid:96) , (cid:96) ∈ I cheap , we could simply use m (cid:96) = f (cid:96) aslong as these f (cid:96) are twice continuously differentiable and have Hessians of bounded norm.The surrogate models are constructed to be sufficiently accurate within a trust region B ( t ) : = B (cid:16) x ( t ) ; ∆ ( t ) (cid:17) = (cid:110) x ∈ X : (cid:13)(cid:13)(cid:13) x − x ( t ) (cid:13)(cid:13)(cid:13) ≤ ∆ ( t ) (cid:111) , with (cid:107)•(cid:107) = (cid:107)•(cid:107) ∞ , (2) of 36 around the current iterate x ( t ) . The model steepest descent direction d ( t ) m can then computed asthe optimizer of the surrogate problem ω ( t ) m (cid:16) x ( t ) (cid:17) : = − min d ∈X − x β s.t. (cid:107) d (cid:107) ≤
1, and (cid:104) ∇ m ( t ) (cid:96) ( x ) , d (cid:105) ≤ β ∀ (cid:96) =
1, . . . , k . (Pm)Now let σ ( t ) > d ( t ) m need not be a descent direction for thetrue objectives f and the trial point x ( t )+ = x ( t ) + σ ( t ) d ( t ) m is only accepted if a measure ρ ( t ) ofimprovement and model quality surpasses a positive threshold ν + . As in [30,32], we scalarizethe multiobjective problems by defining Φ ( x ) : = max (cid:96) = k f (cid:96) ( x ) , Φ ( t ) m ( x ) : = max (cid:96) = k m ( t ) (cid:96) ( x ) .Whenever Φ ( x ( t ) ) − Φ ( x ( t )+ ) >
0, there is a reduction in at least one objective function of f because of 0 < Φ ( x ( t ) ) − Φ ( x ( t )+ ) = f (cid:96) ( x ( t ) ) − f q ( x ( t )+ ) df. ≤ f (cid:96) ( x ( t ) ) − f (cid:96) ( x ( t )+ ) ,where we denoted by (cid:96) the maximizing index in Φ ( x ( t ) ) and by q the maximizing index in Φ ( x ( t )+ ) . Of course, the same property holds for Φ ( t ) m ( • ) and m ( t ) .Thus, the step size σ ( t ) > s ( t ) = σ ( t ) d ( t ) m satisfies both x ( t ) + s ( t ) ∈ B ( t ) and a “sufficient decrease condition” of the form Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t ) + s ( t ) ) ≥ κ sd ω (cid:16) x ( t ) (cid:17) min (cid:110) C · ω (cid:16) x ( t ) (cid:17) , 1, ∆ ( t ) (cid:111) ≥ >
0, see Section 5. Such a condition is also required in the scalar case [34,35]and essential for the convergence proof in Section 6, where we show lim t → ∞ ω (cid:16) x ( t ) (cid:17) = ρ ( t ) = Φ ( x ( t ) ) − Φ ( x ( t )+ ) Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t )+ ) if x ( t ) (cid:54) = x ( t )+ ,0 if x ( t ) = x ( t )+ ⇔ s ( t ) = , (3)is nonnegative. A positive ρ ( t ) implies a decrease in at least one objective f (cid:96) , so we accept x ( t )+ as the next iterate if ρ ( t ) > ν + >
0. If ρ ( t ) is sufficiently large, say ρ ( t ) ≥ ν ++ > ν + >
0, the nexttrust region might have a larger radius ∆ ( t + ) ≥ ∆ ( t ) . If in contrast ρ < ν ++ , the next trustregion radius should be smaller and the surrogates improved.This encompasses the case s ( t ) = , when the iterate x ( t ) is critical formin x ∈ B ( t ) m ( t ) ( x ) ∈ R k . (MOPm)Roughly speaking, we suppose that x ( t ) is near a critical point for the original problem (MOP)if m ( t ) is sufficiently accurate. If we truly are near a critical point, then the trust region radiuswill approach 0. For further details concerning the acceptance ratio ρ ( t ) , see [32, Sec. 2.2]. The abbreviation “df.” above the inequality symbol stands for “(by) definition” and is used throughout this document when appropriate. of 36
Remark 1.
We can modify ρ ( t ) in (3) to obtain a descent in all objectives, i.e., if x ( t ) (cid:54) = x ( t )+ we test ρ ( t ) = f (cid:96) ( x ( t ) ) − f (cid:96) ( x ( t )+ ) m ( t ) (cid:96) ( x ( t ) ) − m ( t ) (cid:96) ( x ( t )+ ) > ν + for all (cid:96) =
1, . . . , k. This is the strict acceptance test.
4. Surrogate Models and the Final Algorithm
Until now, we have not discussed the actual choice of surrogate models used for m ( t ) . Asis shown in Section 5, the models should be twice continuously differentiable with uniformlybounded hessians. To prove convergence of our algorithm we have to impose further require-ments on the (uniform) approximation qualities of the surrogates m ( t ) . We can meet theserequirements using so-called fully linear models. Moreover, fully linear models intrinsicallyallow for modifications of the basic trust region method that are aimed at reducing the totalnumber of expensive objective evaluations. Finally, we briefly recapitulate how radial basisfunctions and multivariate Lagrange polynomials can be made fully linear. Let us begin with the abstract definition of full linearity as given in [27,34]:
Definition 3.
Let ∆ ub > be given and let f : R → R be a function that is continuously differentiablein an open domain containing X and has a Lipschitz continuous gradient on X . A set of model functions M = { m : R n → R } ⊆ C ( R n , R ) is called a fully linear class of models if the following hold:1. There are positive constants (cid:101) , ˙ (cid:101) and L m such that for any given ∆ ∈ ( ∆ ub ) and for any x ∈ X there is a model function m ∈ M with Lipschitz continuous gradient and correspondingLipschitz constant bounded by L m and such that • the error between the gradient of the model and the gradient of the function satisfies (cid:107) ∇ f ( ξ ) − ∇ m ( ξ ) (cid:107) ≤ ˙ (cid:101) ∆ , ∀ ξ ∈ B ( x ; ∆ ) ,• the error between the model and the function satisfies | f ( ξ ) − m ( ξ ) | ≤ (cid:101) ∆ , ∀ ξ ∈ B ( x ; ∆ ) .
2. For this class M there exists “model-improvement” algorithm that – in a finite, uniformlybounded (w.r.t. x and ∆ ) number of steps – can • either establish that a given model m ∈ M is fully linear on B ( x ; ∆ ) • or find a model ˜ m that is fully linear on B ( x ; ∆ ) . Remark 2.
In the constrained case, we treat the constraints as hard , that is, we do not allow forevaluations of the true objectives outside X , see the definition of B ( t ) ⊆ X in (2) . We also ensure toonly select training data in X during the construction of surrogate models.In the unconstrained case, the requirements in Definition 3 can be relaxed a bit, at least whenusing the strict acceptance test with f ( x ( T ) ) ≤ f ( x ( t ) ) for all T ≥ t ≥ . We can then restrict ourselvesto the set X (cid:48) : = (cid:91) x ∈ L ( x ( ) ) B (cid:16) x ; ∆ ub (cid:17) , where L ( x ( ) ) : = (cid:110) x ∈ R n : f ( x ) ≤ f ( x ( ) ) (cid:111) .For the convergence analysis in Section 6, we cite [27, Lemma 10.25] concerning theapproximation quality of fully linear models on enlarged trust regions: of 36 Lemma 1.
For x ∈ X and ∆ ≤ ∆ ub consider a function f and a fully-linear model m as in Definition3 with constants (cid:101) , ˙ (cid:101) , L m > . Let L f > be a Lipschitz constant of ∇ f .Assume w.l.o.g. that L m + L f ≤ (cid:101) and ˙ (cid:101) ≤ (cid:101) . Then m is fully linear on B (cid:0) x ; ˜ ∆ (cid:1) for any ˜ ∆ ∈ [ ∆ , ∆ ub ] with respect to the same constants (cid:101) , ˙ (cid:101) , L m . (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) . If this value is very small at the currentiterate, then x ( t ) could lie near a Pareto-critical point. With the criticality test and criticalityRoutine we ensure that the next model is fully linear and the trust re-gion is not too large. This allows for a more accurate criticality measure and descent stepcalculation.• A trust region update that also takes into consideration (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) . The radius should beenlarged if we have a large acceptance ratio ρ ( t ) and the ∆ ( t ) is small as measured against βω ( t ) m (cid:16) x ( t ) (cid:17) for a constant β > ρ ( t ) in the followingway: Definition 4.
For given constants ≤ ν + ≤ ν ++ < ν ++ (cid:54) = we call the iteration with indext ∈ N of Algorithm 1 . . . • . . . successful if ρ ( t ) ≥ ν ++ . The set of successful indices is S = { t ∈ N : ρ ( t ) ≥ ν ++ } ⊆ N . • . . . model-improving if ρ ( t ) < ν ++ and the models m ( t ) = [ m ( t ) , . . . , m ( t ) k ] T are not fully linear.In these iterations the trust region radius is not changed. • . . . acceptable if ν + > ρ ( t ) ≥ ν + and the models m ( t ) are fully linear. If ν ++ = ν + ∈ (
0, 1 ) , thenthere are no acceptable indices. • . . . inacceptable otherwise, i.e., if ρ ( t ) < ν + and m ( t ) are fully linear.4.2. Fully Linear Lagrange Polynomials Quadratic Taylor polynomial models are used very frequently. As explained in [27] wecan alternatively use multivariate interpolating Lagrange polynomial models when derivativeinformation is not available. We will consider first and second degree Lagrange models. Eventhough the latter require O ( n ) function evaluations they are still cheaper than second degreefinite difference models. For this reason, these models are also used in [32,37].To construct an interpolating polynomial model we have to provide p data sites, where p is the dimension of the space Π dn of real-valued n -variate polynomials with degree d . For d = p = n + d = p = ( n + )( n + ) n ≥
2, the
Mairhuber-Curtis theorem[38] applies and the data sites must form a so-called poised set in X . The set Ξ = { ξ , . . . , ξ p } ⊂ R n is poised if for any basis { ψ i } of Π dn the matrix M ψ : = (cid:2) ψ i ( ξ j ) (cid:3) ≤ i , j ≤ p is non-singular. Then there is a unique polynomial m ( x ) = ∑ pi = λ i ψ i ( x ) with m ( ξ j ) = F ( ξ j ) of 36 Algorithm 1:
General TRM for (MOP)
Configuration:
Criticality parameters ε crit > µ > β >
0, acceptanceparameters 1 > ν ++ ≥ ν + ≥ ν ++ (cid:54) =
0, update factors γ ↑ ≥ > γ ↓ ≥ γ (cid:21) > ∆ ub > Input:
The initial site x ( ) ∈ R n ; for t =
0, 1, . . . doif t > and iteration ( t − ) was model-improving (cf. Definition 4) then Perform at least one improvement step on m ( t − ) and then let m ( t ) ← m ( t − ) ; else Construct surrogate models m ( t ) on B ( t ) ; end /* Criticality Step: */ if (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) < ε crit and (cid:0) m ( t ) not fully linear or ∆ ( t ) > µ(cid:118) ( t ) m (cid:16) x ( t ) (cid:17) (cid:1) then Set ∆ ( t ) ∗ ← ∆ ( t ) ;Call criticalityRoutine () so that m ( t ) is fully linear on B ( t ) with ∆ ( t ) ∈ ( µ(cid:118) ( t ) m (cid:16) x ( t ) (cid:17) ] ;Then set ∆ ( t ) ← min (cid:110) max (cid:110) ∆ ( t ) , β(cid:118) ( t ) m (cid:16) x ( t ) (cid:17)(cid:111) , ∆ ( t ) ∗ (cid:111) ; end Compute a suitable descent step s ( t ) ;Set x ( t )+ ← x ( t ) + s ( t ) , evaluate f ( x ( t )+ ) and compute ρ ( t ) with (3);Perform the following updates: x ( t + ) ← x ( t ) if ρ ( t ) < ν + , x ( t )+ if ν + ≤ ρ ( t ) < ν ++ & m ( t ) is fully linear, x ( t )+ if ν ++ ≤ ρ ( t ) . ∆ ( t + ) ← ∆ + , where ∆ + = ∆ ( t ) if ρ ( t ) < ν ++ & m ( t ) is not fully linear, ∈ [ γ (cid:21) ∆ ( t ) , γ ↓ ∆ ( t ) ] if ρ ( t ) < ν ++ & m ( t ) is fully linear, ∈ (cid:104) ∆ ( t ) , min { γ ↑ ∆ ( t ) , ∆ ub } (cid:105) if ν ++ ≤ ρ ( t ) and ∆ ( t ) ≥ βω ( t ) m (cid:16) x ( t ) (cid:17) , = min { γ ↑ ∆ ( t ) , ∆ ub } if ν ++ ≤ ρ ( t ) and ∆ ( t ) < βω ( t ) m (cid:16) x ( t ) (cid:17) . end for all j =
1, . . . , p and any function F : R n → R . Given a poised set Ξ the associated Lagrangebasis { l i } of Π dn is defined by l i ( ξ j ) = δ i , j . The model coefficients then simply are the datavalues, i.e., λ i = F ( ξ i ) .Same as in [37], we implement Algorithm 6.2 from [27] to ensure poisedness. It selectstraining sites Ξ from the current (slightly enlarged) trust region of radius θ ∆ ( t ) and calculatesthe associated lagrange basis. We can then separately evaluate the true objectives f (cid:96) on Ξ toeasily build the surrogates m ( t ) (cid:96) , (cid:96) ∈ {
1, . . . , k } . Our implementation always includes ξ = x ( t ) and tries to select points from a database of prior evaluations first.We employ an additional algorithm (Algorithm 6.3 in [27]) to ensure that the set Ξ iseven Λ -poised , see [27, Definition 3.6]. The procedure is still finite and ensures the models are of 36 Procedure criticalityRoutine()
Configuration: backtracking constant α ∈ (
0, 1 ) , µ > β > ∆ ← ∆ ( t ) ; for j =
1, 2, . . . do Set radius: ∆ ( t ) ← α j − ∆ ;Make models m ( t ) fully linear on B ( t ) ; /* can change (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) */ if ∆ ( t ) ≤ µ(cid:118) ( t ) m (cid:16) x ( t ) (cid:17) then Break; endend actually fully linear . The quality of the surrogate models can be improved by choosing a smallalgorithm parameter Λ >
1. Our implementation tries again to recycle points from a database.Different to before, interpolation at x ( t ) can no longer be guaranteed. This second step canalso be omitted first and then used as a model-improvement step in a subsequent iteration. The main drawback of quadratic Lagrange models is that we still need O ( n ) functionevaluations in each iteration of Algorithm 1. A possible fix is to use under-determinedregression polynomials instead [27,31,39]. Motivated by the findings in [33] we chose so-called Radial Basis Function (RBF) models as an alternative. RBF are well-known for theirapproximation capabilities on irregular data [38]. In our implementation they have the form m ( x ) = N ∑ i = c i ϕ ( (cid:107) x − ξ i (cid:107) ) + π ( x ) , π = ∑ i λ i ψ i ∈ Π dn , d ∈ {
0, 1 } , N ∈ N , (4)where ϕ is a function from R ≥ to R . For a fixed ϕ the mapping ϕ ( (cid:107)•(cid:107) ) from R n → R isradially symmetric with respect to its argument and the mapping ( x , ξ ) (cid:55)→ ϕ ( (cid:107) x − ξ (cid:107) ) iscalled a kernel .Wild et al. [33] describe a construction of RBF surrogate models as in (4) (see also [40]and the dissertation [39] for more details). If we restrict ourselves to functions ϕ ( (cid:107)•(cid:107) ) that areconditionally positive definite (c.p.d. – see [33,38] for the definition) of order at most two, thenthe surrogates can be made certifiably fully linear with N = n +
1. As before, the algorithmstries to select an initial training set Ξ = { ξ , . . . , ξ N } ⊂ B ( x ( t ) ; θ ∆ ( t ) ) with N = n + θ ≥
1. The set must be poised for interpolation with affine linear polynomials.Due to ϕ ( (cid:107)•(cid:107) ) being c.p.d. of order D ≤
2, the interpolation system (cid:20) Φ M T ψ M ψ (cid:21) · (cid:20) c λ (cid:21) = (cid:20) F ( Ξ ) (cid:21) ∈ R N + p × , c = [ c i ] ≤ i ≤ N , λ = [ λ i ] ≤ i ≤ p , Φ = [ ϕ (cid:0)(cid:13)(cid:13) ξ i − ξ j (cid:13)(cid:13)(cid:1) ] ≤ i , j ≤ N ,is uniquely solvable for any F : R n → R if we choose Π dn such that d ≥ max { D − } . Wecan even include more points, N ≥ n +
1, from within a region of maximum radius θ ∆ ub , θ ≥ θ ≥
1, to capture nonlinear behavior of F . More detailed explanations can be found in[33]. Modifications for box constraints are shown in [39] and [41]. Table 1.
Some radial functions ϕ : R ≥ → R that are c.p.d. of order D ≤
2, cf. [33].
Name ϕ ( r ) c.p.d. order D deg π Cubic r − (cid:112) + ( α r ) , α > {
0, 1 } Gaussian exp ( − ( α r ) ) , α > {
0, 1 } Table 1 shows the RBF we are using and the possible polynomial degrees for π . Boththe Gaussian and the Multiquadric allow for fine-tuning with a shape parameter α >
0. Thiscan potentially improve the conditioning of the interpolation system. Fig. 1 (b) illustrates theeffect of the shape parameter. As can be seen, the radial functions become narrower for largershape parameters. Hence, we do not only use a constant shape parameter α = et al. [33] do, but we also use an α that is (within lower and upper bounds) inversely proportionalto ∆ ( t ) .Fig. 1 (a) shows interpolation of a nonlinear function by a surrogate based on the Multiquadricwith a linear tail. Figure 1. (a)
Interpolation of a nonlinear function (black) by a Multiquadric surrogate (black) based on 5discrete training points (orange). Dashed lines show the kernels and the polynomial tail. (b)
Differentkernels in 1D with varying shape parameter (1 or 10), see also Table 1.
5. Descent Steps
In this section we introduce some possible steps s ( t ) to use in Algorithm 1. We begin bydefining the best step along the steepest descent direction as given by (Pm). Subsequently,backtracking variants are defined that use a multiobjective variant of Armijo’s rule. Both the
Pareto-Cauchy point as well as a backtracking variant, the modified Pareto-Cauchypoint , are points along the descent direction d ( t ) m within B ( t ) so that a sufficient decreasemeasured by Φ ( t ) m ( • ) and ω ( t ) m ( • ) is achieved. Under mild assumptions we can then derive adecrease in terms of ω ( • ) . Definition 5.
For t ∈ N let d ( t ) m be a minimizer for (Pm) . The best attainable trial point x ( t ) PC along d ( t ) m is called the Pareto-Cauchy point and given by x ( t ) PC : = x ( t ) + σ ( t ) · d ( t ) m , σ ( t ) = arg min ≤ σ Φ ( t ) m (cid:16) x ( t ) + σ · d ( t ) m (cid:17) s.t. x ( t ) PC ∈ B ( t ) . (5) Let σ ( t ) be the minimizer in (5) . We call s ( t ) PC : = σ ( t ) d ( t ) m the Pareto-Cauchy step . If we make the following standard assumption, then the Pareto-Cauchy point allows fora lower bound on the improvement in terms of Φ ( t ) m . Assumption 2.
For all t ∈ N the surrogates m ( t ) ( x ) = [ m ( t ) ( x ) , . . . , m ( t ) k ( x )] T are twice con-tinuously differentiable on an open set containing X . Denote by H m ( t ) (cid:96) ( x ) the Hessian of m ( t ) (cid:96) for (cid:96) =
1, . . . , k. Theorem 3.
If Assumptions 1 and 2 are satisfied, then for any iterate x ( t ) the Pareto-Cauchy point x ( t ) PC satisfies Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t ) PC ) ≥ ω ( t ) m (cid:16) x ( t ) (cid:17) · min ω ( t ) m (cid:16) x ( t ) (cid:17) c H ( t ) m , ∆ ( t ) , 1 , (6) where H ( t ) m = max (cid:96) = k max x ∈ B ( t ) (cid:13)(cid:13)(cid:13) H m ( t ) (cid:96) ( x ) (cid:13)(cid:13)(cid:13) F (7) and the constant c > relates the trust region norm (cid:107)•(cid:107) to the Euclidean norm (cid:107)•(cid:107) via (cid:107) x (cid:107) ≤ √ c (cid:107) x (cid:107) ∀ x ∈ R n . (8)If (cid:107)•(cid:107) = (cid:107)•(cid:107) ∞ is used, then c can be chosen as c = k . The proof for Theorem 3 is providedafter the next auxiliary lemma. Lemma 2.
Under Assumptions 1 and 2, let d be a non-increasing direction at x ( t ) ∈ R n for m ( t ) , i.e., (cid:68) ∇ m ( t ) (cid:96) ( x ( t ) ) , d (cid:69) ≤ ∀ (cid:96) =
1, . . . , k . Let q ∈ {
1, . . . , k } be any objective index and ¯ σ ≥ min (cid:110) ∆ ( t ) , (cid:107) d (cid:107) (cid:111) . Then it holds thatm ( t ) q ( x ( t ) ) − min < σ < ¯ σ m ( t ) q (cid:18) x ( t ) + σ d (cid:107) d (cid:107) (cid:19) ≥ w (cid:40) w (cid:107) d (cid:107) c H ( t ) m , ∆ ( t ) (cid:107) d (cid:107) , 1 (cid:41) , where we have used the shorthand notationw = − max (cid:96) = k (cid:68) ∇ m ( t ) (cid:96) ( x ( t ) ) , d (cid:69) ≥ Lemma 2 states that a minimizer along any non-increasing direction d achieves a min-imum reduction w.r.t. Φ ( t ) m . Similar results can be found in in [30] or [32]. But since we donot use polynomial surrogates m ( t ) , we have to employ the multivariate version of Taylor’stheorem to make the proof work. We can do this because according to Assumption 2, the func-tions m ( t ) q , q ∈ {
1, . . . , k } are twice continuously differentiable in an open domain containing X . Moreover, Assumption 1 ensures that the function is defined on the line from χ to x . Asshown in [42, Ch. 3] a first degree expansion at x ∈ B ( χ , ∆ ) around χ ∈ X then leads to m ( t ) q ( x ) = m q ( χ ) + ∇ m ( t ) q ( χ ) T h + h T H m ( t ) q ( ξ q ) h , with h = ( x − χ ) ,for some ξ q ∈ { x + θ ( χ − x ) : θ ∈ [
0, 1 ] } , for all q =
1, . . . , k . (9) Proof of Lemma 2.
Let the requirements of Lemma 2 hold and let d be a non-increasingdirection for m ( t ) . Then: m ( t ) q ( x ( t ) ) − min < σ < ¯ σ m ( t ) q (cid:18) x ( t ) + σ d (cid:107) d (cid:107) (cid:19) = max ≤ σ ≤ ¯ σ (cid:26) m ( t ) q ( x ( t ) ) − m ( t ) q (cid:18) x ( t ) + σ d (cid:107) d (cid:107) (cid:19)(cid:27) (9) = max ≤ σ ≤ ¯ σ (cid:40) m ( t ) q ( x ( t ) ) − (cid:32) m ( t ) q ( x ( t ) ) + σ (cid:107) d (cid:107) (cid:104) ∇ m ( t ) q ( x ( t ) ) , d (cid:105) + σ (cid:107) d (cid:107) (cid:104) d , H m ( t ) q ( ξ q ) d (cid:105) (cid:33)(cid:41) ≥ max ≤ σ ≤ ¯ σ (cid:40) − σ (cid:107) d (cid:107) max j = k (cid:104) ∇ m ( t ) j ( x ( t ) ) , d (cid:105) − σ (cid:107) d (cid:107) (cid:104) d , H m ( t ) q ( ξ q ) d (cid:105) (cid:41) .We use the shorthand w = − max j (cid:104) ∇ m ( t ) j ( x ( t ) ) , d (cid:105) and the Cauchy-Schwartz inequality to get. . . ≥ max ≤ σ ≤ ¯ σ (cid:40) σ (cid:107) d (cid:107) w − σ (cid:107) d (cid:107) (cid:107) d (cid:107) (cid:13)(cid:13)(cid:13) H m ( t ) q ( ξ ) (cid:13)(cid:13)(cid:13) F (cid:41) (8),(7) ≥ max ≤ σ ≤ ¯ σ (cid:26) σ (cid:107) d (cid:107) w − σ H ( t ) m (cid:27) .The RHS is concave and we can thus easily determine the global maximizer σ ∗ .Similar to [30, Lemma 4.1] we find m ( t ) q ( x ( t ) ) − min < σ < ¯ σ m ( t ) q (cid:18) x ( t ) + σ d (cid:107) d (cid:107) (cid:19) ≥ w (cid:40) w (cid:107) d (cid:107) c H ( t ) m , ∆ ( t ) (cid:107) d (cid:107) , 1 (cid:41) ,where we have additionally used ¯ σ ≥ min { ∆ ( t ) , 1 } . Proof of Theorem 3. If x ( t ) is Pareto-critical for (MOPm), then d ( t ) m = and ω ( t ) m (cid:16) x ( t ) (cid:17) = (cid:96) , q ∈ {
1, . . . , k } be such that Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t ) PC ) = m ( t ) (cid:96) ( x ( t ) ) − m ( t ) q ( x ( t ) PC ) ≥ m q ( x ( t ) ) − m q ( x ( t ) PC ) and define ¯ σ : = (cid:40) min (cid:110) ∆ ( t ) , (cid:13)(cid:13)(cid:13) d ( t ) m (cid:13)(cid:13)(cid:13)(cid:111) if (cid:13)(cid:13)(cid:13) d ( t ) m (cid:13)(cid:13)(cid:13) < ∆ ( t ) ≤ ∆ ( t ) else. (10)Then clearly ¯ σ ≥ min (cid:110) ∆ ( t ) , (cid:13)(cid:13)(cid:13) d ( t ) m (cid:13)(cid:13)(cid:13)(cid:111) and for the Pareto-Cauchy point we have m ( t ) q (cid:16) x ( t ) PC (cid:17) = min ≤ σ ≤ ¯ σ m q x ( t ) + σ (cid:13)(cid:13)(cid:13) d ( t ) m (cid:13)(cid:13)(cid:13) d ( t ) m . From Lemma 2 and (cid:13)(cid:13)(cid:13) d ( t ) m (cid:13)(cid:13)(cid:13) the bound (6) immediately follows. Remark 3.
Some authors define the Pareto-Cauchy point as the actual minimizer x ( t ) min of Φ ( t ) m withinthe current trust region (instead of the minimizer along the steepest descent direction). For this trueminimizer the same bound (6) holds. This is due to Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t ) min ) = m (cid:96) ( x ( t ) ) − min x ∈ B ( t ) m q ( x ) ≥ m q ( x ( t ) ) − m q ( x ( t ) PC ) . A common approach in trust region methods is to find an approximate solution to (5)within the current trust region. Usually a backtracking approach similar to Armijo’s inexactline-search is used for the Pareto-Cauchy subproblem. Doing so, we can still guarantee asufficient decrease.Before we actually define the backtracking step along d ( t ) m , we derive a more generallemma. It illustrates that backtracking along any suitable direction is well-defined. Lemma 3.
Suppose Assumptions 1 and 2 hold. For x ( t ) ∈ R n , let d be a descent direction for m ( t ) and let q ∈ {
1, . . . , k } be any objective index and ¯ σ > . Then there is an integer j ∈ N such that Ψ (cid:18) x ( t ) + b j ¯ σ (cid:107) d (cid:107) d (cid:19) ≤ Ψ ( x ( t ) ) − ab j ¯ σ (cid:107) d (cid:107) w , a, b ∈ (
0, 1 ) , (11) where, again, we have used the shorthand notation w = − max (cid:96) = k (cid:68) ∇ m ( t ) (cid:96) ( x ( t ) ) , d (cid:69) > and Ψ is either some specific model, Ψ = m (cid:96) , or the maximum value, Ψ = Φ ( t ) m . Moreover, if we define thestep s ( t ) = b j ¯ σ (cid:107) d (cid:107) d for the smallest j ∈ N satisfying (11) , then there is a constant κ sdm ∈ (
0, 1 ) suchthat Ψ ( x ( t ) ) − Ψ (cid:16) x ( t ) + s ( t ) (cid:17) ≥ κ sdm w min (cid:40) w (cid:107) d (cid:107) c H ( t ) m , ¯ σ (cid:107) d (cid:107) (cid:41) . (12) Proof.
The first part can be derived from the fact that d is a descent direction, see e.g. [6].However, we will use the approach from [30] to also derive the bound (12). With Taylor’sTheorem we obtain Ψ (cid:18) x ( t ) + b j ¯ σ (cid:107) d (cid:107) d (cid:19) = m (cid:96) (cid:18) x ( t ) + b j ¯ σ (cid:107) d (cid:107) d (cid:19) (for some (cid:96) ∈ {
1, . . . , k } )= m ( t ) (cid:96) ( x ( t ) ) + b j ¯ σ (cid:107) d (cid:107) (cid:104) ∇ m ( t ) (cid:96) ( x ( t ) ) , d (cid:105) + ( b j ¯ σ ) (cid:107) d (cid:107) (cid:104) d , H m ( t ) (cid:96) ( ξ (cid:96) ) d (cid:105)≤ Ψ ( x ( t ) ) + max q = k b j ¯ σ (cid:107) d (cid:107) (cid:104) ∇ m ( t ) q ( x ( t ) ) , d (cid:105) + max q = k ( b j ¯ σ ) (cid:107) d (cid:107) (cid:104) d , H m ( t ) q ( ξ q ) d (cid:105) (Pm),(7) ≤ Ψ ( x ( t ) ) − b j ¯ σ (cid:107) d (cid:107) w + ( b j ¯ σ ) H ( t ) m . (13) In the last line, we have additionally used the Cauchy-Schwarz inequality.For a constructive proof, suppose now that (11) is violated for some j ∈ N , i.e., Ψ (cid:18) x ( t ) + b j ¯ σ (cid:107) d (cid:107) d (cid:19) > Ψ ( x ( t ) ) − ab j ¯ σ (cid:107) d (cid:107) w .Plugging in (13) for the LHS and substracting Ψ ( x ( t ) ) then leads tob j > ( − a ) w (cid:107) d (cid:107) ¯ σ c H ( t ) m ,where the right hand side is positive and completely independent of j . Since b ∈ (
0, 1 ) , theremust be a j ∗ ∈ N , j ∗ > j , for which b j ∗ ≤ ( − a ) w (cid:107) d (cid:107) ¯ σ c H ( t ) m so that (11) must also be fulfilled for thisb j ∗ . Analogous to the proof of [30, Lemma 4.2] we can now derive the constant κ sdm from (12)as κ sdm = min { ( − a ) , a } .Lemma 3 applies naturally to the step along d ( t ) m : Definition 6.
For x ( t ) ∈ B ( t ) let d ( t ) m be a solution to (Pm) and define the modified Pareto-Cauchystep as ˜s ( t ) PC : = b j ¯ σ d ( t ) m (cid:13)(cid:13)(cid:13) d ( t ) m (cid:13)(cid:13)(cid:13) , where again ¯ σ as in (10) and j ∈ N is the smallest integer that satisfies Φ ( t ) m ( x ( t ) + ˜s ( t ) PC ) ≤ Φ ( t ) m ( x ( t ) ) − ab j ¯ σ (cid:13)(cid:13)(cid:13) d ( t ) m (cid:13)(cid:13)(cid:13) ω ( t ) m (cid:16) x ( t ) (cid:17) (14) for predefined constants a, b ∈ (
0, 1 ) . The definition of ¯ σ ensures, that x ( t ) + ˜s ( t ) PC is contained in the current trust region B ( t ) .Furthermore, these steps provide a sufficient decrease very similar to (6): Corollary 1.
Suppose Assumptions 1 and 2 hold. For the step ˜s ( t ) PC the following statements are true:1. A j ∈ N as in (14) exists.2. There is a constant κ sdm ∈ (
0, 1 ) such that the modified Pareto-Cauchy step ˜s ( t ) PC satisfies Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t ) + ˜s ( t ) PC ) ≥ κ sdm ω ( t ) m (cid:16) x ( t ) (cid:17) min ω ( t ) m (cid:16) x ( t ) (cid:17) c H ( t ) m , ∆ ( t ) , 1 . Proof. If x ( t ) is critical, then the bound is trivial. Otherwise, the existence of a j satisfying (14)follows from Lemma 3 for Ψ = Φ ( t ) m . The lower bound on the decrease follows immediatelyfrom ¯ σ ≥ min (cid:110)(cid:13)(cid:13)(cid:13) d ( t ) m (cid:13)(cid:13)(cid:13) , ∆ ( t ) (cid:111) . From Lemma 3 it follows that the backtracking condition (14) can be modified to explicitlyrequire a decrease in every objective:
Definition 7.
Let j ∈ N the smallest integer satisfying min (cid:96) = k m ( t ) (cid:96) ( x ( t ) ) − m ( t ) (cid:96) x ( t ) + b j ¯ σ d ( t ) m (cid:13)(cid:13)(cid:13) d ( t ) m (cid:13)(cid:13)(cid:13) ≥ ab j ¯ σ (cid:13)(cid:13)(cid:13) d ( t ) m (cid:13)(cid:13)(cid:13) ω ( t ) m (cid:16) x ( t ) (cid:17) . We define the strict modified Pareto-Cauchy point as ˆx ( t ) PC = x ( t ) + ˆs ( t ) PC and the corresponding step as ˆs ( t ) PC = b j ¯ σ d ( t ) m (cid:13)(cid:13)(cid:13) d ( t ) m (cid:13)(cid:13)(cid:13) . Corollary 2.
Suppose Assumptions 1 and 2 hold.1. The strict modified Pareto-Cauchy point exists, the backtracking is finite.2. There is a constant κ sdm ∈ (
0, 1 ) such that min (cid:96) = k (cid:110) m ( t ) (cid:96) ( x ( t ) ) − m ( t ) (cid:96) (cid:16) ˆx ( t ) PC (cid:17)(cid:111) ≥ κ sdm ω ( t ) m (cid:16) x ( t ) (cid:17) min ω ( t ) m (cid:16) x ( t ) (cid:17) c H ( t ) m , ∆ ( t ) , 1 . (15) Remark 4.
In the preceding subsections, we have shown descent steps along the model steepest descentdirection. Similar to the single objective case we do not necessarily have to use the steepest descentdirection and different step calculation methods are viable. For instance, Thomann and Eichfelder [32]use the well-known Pascoletti-Serafini scalarization to solve the subproblem (MOPm) . We refer totheir work and Appendix B to see how this method can be related to the steepest descent direction.5.3. Sufficient Decrease for the Original Problem
In the previous subsections, we have shown how to compute steps s ( t ) to achieve asufficient decrease in terms of Φ ( t ) m and ω ( t ) m ( • ) . For a descent step s ( t ) the bound is of the form Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t ) + s ( t ) ) ≥ κ sdm ω ( t ) m (cid:16) x ( t ) (cid:17) min ω ( t ) m (cid:16) x ( t ) (cid:17) c H ( t ) m , ∆ ( t ) , 1 , κ sdm ∈ (
0, 1 ) , (16)and thereby very similar to the bounds for the scalar projected gradient trust region method[35]. By introducing a slightly modified version of ω ( t ) m ( • ) , we can transform (16) into thebound used in [32] and [30]. Lemma 4. If π ( t , x ( t ) ) is a criticality measure for some multiobjective problem, then ˜ π ( t , x ( t ) ) = min (cid:110) π ( t , x ( t ) ) (cid:111) is also a criticality measure for the same problem. Proof.
We have 0 ≤ ˜ π ( t , x ( t ) ) ≤ π ( t , x ( t ) ) . Thus, ˜ π → π →
0. The minimum ofuniformly continuous functions is again uniformly continuous.We next make another standard assumption on the class of surrogate models.
Assumption 3.
The norm of all model hessians is uniformly bounded above on X , i.e., there is apositive constant H m such that (cid:13)(cid:13)(cid:13) H m ( t ) (cid:96) ( x ) (cid:13)(cid:13)(cid:13) F ≤ H m ∀ (cid:96) =
1, . . . , k , ∀ x ∈ B ( t ) , ∀ t ∈ N . W.l.o.g., we assume H m · c > with c as in (8) . (17) Remark 5.
From this assumption it follows that the model gradients are then Lipschitz as well.Together with Theorem 2, we then know that ω ( t ) m ( • ) is a criticality measure for (MOPm) . Motivated by the previous remark, we will from now on refer to the following functions (cid:118) ( x ) : = min { ω ( x ) , 1 } and (cid:118) ( t ) m ( x ) : = min (cid:110) ω ( t ) m ( x ) , 1 (cid:111) ∀ t =
0, 1, . . . (18)We can thereby derive the sufficient decrease condition in “standard form”:
Corollary 3.
Under Assumption 3, suppose that for x ( t ) and some descent step s ( t ) the bound (16) holds. For the criticality measure (cid:118) ( t ) m ( • ) it follows that Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t ) + s ( t ) ) ≥ κ sdm (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) min (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) cH m , ∆ ( t ) . (19) Proof. (cid:118) ( t ) m ( • ) is a criticality measure due to Assumption 3 and Lemma 4. Further, from (18)and (17) it follows that (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) cH m ≤ m ≤ ω ( • ) of the original problem, we require anotherassumption. Assumption 4.
There is a constant κ ω > such that (cid:12)(cid:12)(cid:12) ω ( t ) m (cid:16) x ( t ) (cid:17) − ω (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ κ ω ω ( t ) m (cid:16) x ( t ) (cid:17) .This assumption is also made by Thomann and Eichfelder [32] and can easily be justifiedby using fully linear surrogate models and a bounded trust region radius in combination withthe a criticality test, see Lemma 7.Assumption 4 can be used to formulate the next two lemmata relating the model criticalityand the true criticality. They are proven in Appendix A.2. From these lemmata and Corollary3 the final result, Corollary 4, easily follows. Lemma 5.
If Assumption 4 holds, then it holds for (cid:118) ( t ) m ( • ) and (cid:118) ( • ) from (18) that (cid:12)(cid:12)(cid:12) (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) − (cid:118) (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ κ ω (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) . Lemma 6.
From Assumption 4 it follows that (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) ≥ κ ω + (cid:118) (cid:16) x ( t ) (cid:17) with ( κ ω + ) − ∈ (
0, 1 ) . Corollary 4.
Suppose that Assumptions 3 and 4 hold and that x ( t ) and s ( t ) satisfy (19) . Then Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t ) + s ( t ) ) ≥ κ sd (cid:118) (cid:16) x ( t ) (cid:17) min (cid:118) (cid:16) x ( t ) (cid:17) cH m , ∆ ( t ) , (20) where κ sd = κ sd m + κ ω ∈ (
0, 1 ) .
6. Convergence
To prove convergence of Algorithm 1 we first have to make sure that at least one of theobjectives is bounded from below:
Assumption 5.
The maximum max (cid:96) = k f (cid:96) ( x ) of all objective functions is bounded from below on X . To be able to use (cid:118) ( • ) as a criticality measure and to refer to fully linear models, we furtherrequire: Assumption 6.
The objective f : R n → R k is continuously differentiable in an open domain contain-ing X and has a Lipschitz continuous gradient on X . We summarize the assumptions on the surrogates as follows:
Assumption 7.
The surrogate model functions m ( t ) , . . . , m ( t ) k belong to a fully linear class M asdefined in Definition 3. For each objective index (cid:96) ∈ {
1, . . . , k } , the error constants are then denotedby (cid:101) (cid:96) and ˙ (cid:101) (cid:96) . For the subsequent analysis we define component-wise maximum constants as (cid:101) : = max (cid:96) = k (cid:101) (cid:96) , ˙ (cid:101) : = max (cid:96) = k ˙ (cid:101) (cid:96) . (21)We also wish for the descent steps to fulfill a sufficient decrease condition for the surrogatecriticality measure as discussed in Section 5. Assumption 8.
For all t ∈ N the descent steps s ( t ) are assumed to fulfill both x ( t ) + s ( t ) ∈ B ( t ) and (19) . Finally, to avoid a cluttered notation when dealing with subsequences we define the followingshorthand notations: (cid:118) ( t ) m : = (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) , (cid:118) ( t ) : = (cid:118) (cid:16) x ( t ) (cid:17) ∀ t ∈ N . In the following we prove convergence of Algorithm 1 to Pareto critical points. Weaccount for the case that no criticality test is used, i.e., ε crit =
0. We then require all surrogatesto be fully linear in each iteration and need Assumption 4. The proof is an adapted version ofthe scalar case in [34].It is also similar to the proofs for the multiobjective algorithms in [30,32]. However, in bothcases, no criticality test is employed, there is no distinction between successful and acceptableiterations ( ν + = ν ++ ) and interpolation at x ( t ) by the surrogates is required. We indicatenotable differences when appropriate.We start with two results concerning the criticality test in Algorithm 1. Lemma 7.
Outside the criticalityRoutine , Assumption 4 is fulfilled if the model m ( t ) is fully-linear (and if ∆ ( t ) ≤ ∆ ub < ∞ ). Proof.
Let (cid:96) , q ∈ {
1, . . . , k } and d (cid:96) , d q ∈ X − x ( t ) be solutions of (P1) and (Pm) respectivelysuch that ω ( t ) m (cid:16) x ( t ) (cid:17) = −(cid:104) ∇ m ( t ) (cid:96) ( x ( t ) ) , d (cid:96) (cid:105) , ω (cid:16) x ( t ) (cid:17) = −(cid:104) ∇ f q ( x ( t ) ) , d q (cid:105) .If ω ( t ) m (cid:16) x ( t ) (cid:17) ≥ ω (cid:16) x ( t ) (cid:17) , then, using Cauchy-Schwarz and (cid:107) d (cid:96) (cid:107) ≤ (cid:12)(cid:12)(cid:12) ω ( t ) m (cid:16) x ( t ) (cid:17) − ω (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) = (cid:104) ∇ f q ( x ( t ) ) , d q (cid:105) − (cid:104) ∇ m ( t ) (cid:96) ( x ( t ) ) , d (cid:96) (cid:105) df. ≤ (cid:104) ∇ f q ( x ( t ) ) , d (cid:96) (cid:105) − (cid:104) ∇ m ( t ) q ( x ( t ) ) , d (cid:96) (cid:105)≤ (cid:13)(cid:13)(cid:13) ∇ f q ( x ( t ) ) − ∇ m ( t ) q ( x ( t ) ) (cid:13)(cid:13)(cid:13) ,and if ω ( t ) m (cid:16) x ( t ) (cid:17) < ω (cid:16) x ( t ) (cid:17) , we obtain (cid:12)(cid:12)(cid:12) ω ( t ) m (cid:16) x ( t ) (cid:17) − ω (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13) ∇ m ( t ) (cid:96) ( x ( t ) ) − ∇ f (cid:96) ( x ( t ) ) (cid:13)(cid:13)(cid:13) .Because m ( t ) is fully linear, it follows that (cid:12)(cid:12)(cid:12) ω ( t ) m (cid:16) x ( t ) (cid:17) − ω (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ √ c ˙ (cid:101) ∆ ( t ) , with ˙ (cid:101) from (21).If we just left criticalityRoutine , then the model is fully linear for ∆ ( t ) due to Lemma 1and we have ∆ ( t ) ≤ µ(cid:118) ( t ) m (cid:16) x ( t ) (cid:17) ≤ µω ( t ) m (cid:16) x ( t ) (cid:17) . If we otherwise did not enter criticalityRoutine in the first place, it must hold that ω ( t ) m (cid:16) x ( t ) (cid:17) ≥ ε crit and ∆ ( t ) ≤ ∆ ub = ∆ ub ε crit ε crit ≤ ∆ ub ε crit ω ( t ) m (cid:16) x ( t ) (cid:17) and thus (cid:12)(cid:12)(cid:12) ω ( t ) m (cid:16) x ( t ) (cid:17) − ω (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ κ ω ω ( t ) m (cid:16) x ( t ) (cid:17) , κ ω = √ c ˙ (cid:101) max (cid:110) µ , ε − ∆ ub (cid:111) > Assumption 9.
Either ε crit > or Assumption 4 holds. We have also implicitly shown the following property of the criticality measures.
Corollary 5. If m ( t ) is fully linear for f with ˙ (cid:101) > as in (21) then (cid:12)(cid:12)(cid:12) (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) − (cid:118) (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ω ( t ) m (cid:16) x ( t ) (cid:17) − ω (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ √ c ˙ (cid:101) ∆ ( t ) . Lemma 8. If x ( t ) is not critical for the true problem (MOP) , i.e. (cid:118) (cid:16) x ( t ) (cid:17) (cid:54) = , then criticalityRoutine will terminate after a finite number of iterations. Proof.
At the start of criticalityRoutine , we know that m ( t ) is not fully linear or ∆ ( t ) > µ(cid:118) ( t ) m (cid:16) x ( t ) (cid:17) . For clarity, we denote the first model by m ( t ) and define ∆ = ∆ ( t ) . We thenensure that the model is made fully linear on ∆ ( t ) = ∆ and denote this fully linear model by m ( t ) . If afterwards ∆ ( t ) ≤ µ(cid:118) ( t ) m (cid:16) x ( t ) (cid:17) , then criticalityRoutine terminates.Otherwise, the process is repeated: the radius is multiplied by α ∈ (
0, 1 ) so that in the j -thiteration we have ∆ ( t ) j = α j − ∆ and m ( t ) j is made fully linear on ∆ ( t ) j until ∆ ( t ) j = α j − ∆ ≤ µ(cid:118) ( t ) m j (cid:16) x ( t ) (cid:17) .The only way for criticalityRoutine to loop infinitely is (cid:118) ( t ) m j (cid:16) x ( t ) (cid:17) < α j − ∆ µ ∀ j ∈ N . (22)Because m ( t ) j is fully linear on α j − ∆ , we know from Corollary 5 that (cid:12)(cid:12)(cid:12) (cid:118) ( t ) m j (cid:16) x ( t ) (cid:17) − (cid:118) (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ √ c ˙ (cid:101)α j − ∆ ∀ j ∈ N .Using the triangle inequality together with (22) gives us (cid:12)(cid:12)(cid:12) (cid:118) (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) (cid:118) ( t ) m j (cid:16) x ( t ) (cid:17) − (cid:118) (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) (cid:118) ( t ) m j (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:16) µ − + √ c (cid:101) (cid:17) α j − ∆ ∀ j ∈ N .As α ∈ (
0, 1 ) , this implies (cid:118) (cid:16) x ( t ) (cid:17) = x ( t ) is hence critical.We next state another auxiliary lemma that we need for the convergence proof. Lemma 9.
Suppose Assumptions 6 and 7 hold. For the iterate x ( t ) let s ( t ) ∈ R n be a any step with x ( t )+ = x ( t ) + s ( t ) ∈ B ( t ) . If m ( t ) is fully linear on B ( t ) then it holds that (cid:12)(cid:12)(cid:12) Φ ( x ( t )+ ) − Φ ( t ) m ( x ( t )+ ) (cid:12)(cid:12)(cid:12) ≤ (cid:101) (cid:16) ∆ ( t ) (cid:17) . Proof.
The proof follows from the definition of Φ and Φ ( t ) m and the full linearity of m ( t ) . Itcan be found in [32, Lemma 4.16].Convergence of Algorithm 1 is proven by showing that in certain situations, the iterationmust be acceptable or successful as defined in Definition 4. This is done indirectly and relies onthe next two lemmata. They use the preceding result to show that in a (hypothetical) situationwhere no Pareto-critical point is approached, the trust region radius must be bounded frombelow. Lemma 10.
Suppose Assumptions 1, 3 and 6 to 8 hold. If x ( t ) is not Pareto-critical for (MOPm) and m ( t ) is fully linear on B ( t ) and ∆ ( t ) ≤ κ sdm ( − ν ++ ) (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) λ , where λ = max { (cid:101) , cH m } and κ sdm as in (19) ,then the iteration is successful, that is, t ∈ S and ∆ t + ≥ ∆ ( t ) . Proof.
The proof is very similar to [34, Lemma 5.3] and [32, Lemma 4.17]. In contrast to thelatter, we use the surrogate problem and do not require interpolation at x ( t ) :By definition we have κ sdm ( − ν ++ ) < ∆ ( t ) ≤ κ sdm ( − ν ++ ) (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) λ (23) ≤ (cid:118) ( t ) m λ ≤ (cid:118) ( t ) m m ≤ (cid:118) ( t ) m cH m .With Assumption 8 we can plug this into (19) and obtain Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t )+ ) ≥ κ sdm (cid:118) ( t ) m min (cid:40) (cid:118) ( t ) m cH m , ∆ ( t ) (cid:41) ≥ κ sdm (cid:118) ( t ) m ∆ ( t ) . (24)Due to Assumption 7 we can take the definition (3) and estimate (cid:12)(cid:12)(cid:12) ρ ( t ) − (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Φ ( x ( t ) ) − Φ ( x ( t )+ ) − ( Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t )+ ) Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t )+ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) Φ ( x ( t ) ) − Φ ( t ) m ( x ( t ) ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) Φ ( t ) m ( x ( t )+ ) − Φ ( x ( t )+ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t )+ ) (cid:12)(cid:12)(cid:12) Lemma 9, (24) ≤ (cid:101) (cid:16) ∆ ( t ) (cid:17) κ sdm (cid:118) ( t ) m ∆ ( t ) ≤ λ ∆ ( t ) κ sdm (cid:118) ( t ) m (23) ≤ − ν ++ . Therefore ρ ( t ) ≥ ν ++ and the iteration t using step s ( t ) is successful.The same statement can be made for the true problem and (cid:118) ( • ) : Corollary 6.
Suppose Assumptions 1, 3 and 6 to 9 hold. If x ( t ) is not Pareto-critical for (MOP) and m ( t ) is fully linear on B ( t ) and ∆ ( t ) ≤ κ sd ( − ν ++ ) (cid:118) (cid:16) x ( t ) (cid:17) λ , where λ = max { (cid:101) , cH m } , κ sdm as in (20) ,then the iteration is successful, that is t ∈ S and ∆ t + ≥ ∆ ( t ) . Proof.
The proof works exactly the same as for Lemma 10. But due to Assumption 9 we canuse Lemma 7 and employ the sufficient decrease condition (20) for (cid:118) ( • ) instead.As in [34, Lemma 5.4] and [32, Lemma 4.18], it is now easy to show that when noPareto-critical point of (MOPm) is approached the trust region radius must be bounded: Lemma 11.
Suppose Assumptions 1, 3 and 6 to 8 hold and that there exists a constant (cid:118) lbm > suchthat (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) ≥ (cid:118) lbm for all t. Then there is a constant ∆ lb > with ∆ ( t ) ≥ ∆ lb for all t ∈ N . Proof.
We first investigate the criticality step and assume ε crit > (cid:118) ( t ) m ≥ (cid:118) lbm . After wefinish the criticality loop, we get an radius ∆ ( t ) so that ∆ ( t ) ≥ min { ∆ ( t ) ∗ , β(cid:118) ( t ) m } and therefore ∆ ( t ) ≥ min { β(cid:118) lbm , ∆ ( t ) ∗ } for all t .Outside the criticality step, we know from Lemma 10 that whenever ∆ ( t ) falls below˜ ∆ : = κ sdm ( − ν ++ ) (cid:118) lbm λ ,iteration t must be either model-improving or successful and hence ∆ ( t + ) ≥ ∆ ( t ) and theradius cannot decrease until ∆ ( k ) > ˜ ∆ for some k > t . Because γ (cid:21) ∈ (
0, 1 ) is the severestpossible shrinking factor in Algorithm 1, we therefore know that ∆ ( t ) can never be activelyshrunken to a value below γ (cid:21) ˜ ∆ .Combining both bounds on ∆ ( t ) results in ∆ ( t ) ≥ ∆ lb : = min { β(cid:118) lbm , γ (cid:21) ˜ ∆ , ∆ ( ) ∗ } ∀ t ∈ N ,where we have again used the fact, that ∆ ( t ) ∗ cannot be reduced further if it is less than orequal to ˜ ∆ due to the update mechanism in Algorithm 1.We can now state the first convergence result: Theorem 4.
Suppose that Assumptions 1, 3 and 6 to 8 hold. If Algorithm 1 has only a finite number ≤ |S | < ∞ of successful iterations S = { t ∈ N : ρ ( t ) ≥ ν ++ } then lim t → ∞ (cid:118) (cid:16) x ( t ) (cid:17) = Proof.
If the criticality loop runs infinitely, then the result follows from Lemma 8.Otherwise, let t any index larger than the last successful index (or t ≥ S = ∅ ).All t ≥ t then must be model-improving, acceptable or inacceptable. In all cases, the trustregion radius ∆ ( t ) is never increased. Due to Assumption 7, the number of successive model-improvement steps is bounded above by M ∈ N . Hence, ∆ ( t ) is decreased by a factor of γ ∈ [ γ (cid:21) , γ ↓ ] ⊆ (
0, 1 ) at least once every M iterations. Thus, ∞ ∑ t > t ∆ ( t ) ≤ N ∞ ∑ i = γ i ↓ ∆ ( t ) = N γ ↓ − γ ↓ ∆ ( t ) ,and ∆ ( t ) must go to zero for t → ∞ .Clearly, for any τ ≥ t , the iterates (and trust region centers) x ( τ ) and x ( t ) cannot befurther apart than the sum of all subsequent trust region radii, i.e., (cid:13)(cid:13)(cid:13) x ( τ ) − x ( t ) (cid:13)(cid:13)(cid:13) ≤ ∞ ∑ t ≥ t ∆ ( t ) ≤ N γ ↓ − γ ↓ ∆ ( t ) .The RHS goes to zero as we let t go to infinity and so must the norm on the LHS, i.e.,lim t → ∞ (cid:13)(cid:13)(cid:13) x ( τ ) − x ( t ) (cid:13)(cid:13)(cid:13) =
0. (25)Now let τ = τ ( t ) ≥ t be the first iteration index so that m ( τ ) is fully linear. Then (cid:12)(cid:12)(cid:12) (cid:118) ( t ) m (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) (cid:118) ( t ) − (cid:118) ( τ ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) (cid:118) ( τ ) − (cid:118) ( τ ) m (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) (cid:118) ( τ ) m (cid:12)(cid:12)(cid:12) and for the terms on the right and for t → ∞ , we find:• Because of Assumptions 1 and 6 and Theorem 2 (cid:118) ( • ) is Cauchy-continuous and with(25) the first term goes to zero.• Due to Corollary 5 the second term is in O ( ∆ ( τ ) ) and goes to zero.• Suppose the third term does not go to zero as well, i.e., { (cid:118) ( t ) m (cid:16) x ( τ ) (cid:17) } is bounded below bya positive constant. Due to Assumptions 1 and 7 the iterates x ( τ ) are not Pareto-critical for(MOPm) and because of ∆ ( τ ) → (cid:118) (cid:16) x ( t ) (cid:17) , goes to zero as well for t → ∞ .We now address the case of infinitely many successful iterations, first for the surrogatemeasure (cid:118) ( t ) m ( • ) and then for (cid:118) ( • ) . We show that the criticality measures are not boundedaway from zero.We start with the observation that in any case the trust region radius converges to zero: Lemma 12.
If Assumptions 1, 3 and 6 to 8 hold, then the subsequence of trust region radii generatedby Algorithm 1 goes to zero, i.e., lim t → ∞ ∆ ( t ) = Proof.
We have shown in the proof of Theorem 4 that this is the case for finitely manysuccessful iterations.
Suppose there are infinitely many successful iterations. Take any successful index t ∈ S .Then ρ ( t ) ≥ ν ++ and from Assumption 8 it follows for x ( t + ) = x ( t )+ = x ( t ) + s ( t ) that Φ ( x ( t ) ) − Φ ( x ( t )+ ) ≥ ν ++ (cid:16) Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t )+ ) (cid:17) (19) ≥ ν ++ κ sdm (cid:118) ( t ) m min (cid:40) (cid:118) ( t ) m cH m , ∆ ( t ) (cid:41) .The criticality step ensures that (cid:118) ( t ) m ≥ min (cid:40) ε crit , ∆ ( t ) µ (cid:41) so that Φ ( x ( t ) ) − Φ ( x ( t )+ ) ≥ ν ++ κ sdm min (cid:40) ε crit , ∆ ( t ) µ (cid:41) min (cid:40) ∆ ( t ) µ cH m , ∆ ( t ) (cid:41) ≥
0. (26)Now the right hand side has go to zero: Suppose it was bounded below by a positive constant ε >
0. We could then compute a lower bound on the improvement from the first iterationwith index 0 up to t + Φ ( x ( ) ) − Φ ( x ( t + ) ) ≥ ∑ τ ∈S t Φ ( x ( τ ) ) − Φ ( x ( τ + ) ) ≥ |S t | ε where S t = S ∩ {
0, . . . , t } are all successful indices with a maximum index of t . Because S isunbounded, the right side diverges for t → ∞ and so must the left side in contradiction to Φ being bounded below by Assumption 5. From (26) we see that this implies ∆ ( t ) → t ∈ S , t → ∞ .Now consider any sequence T ⊆ N of indices that are not necessarily successful, i.e., |T \ S | ≥
0. The radius is only ever increased in successful iterations and at most by a factor of γ ↑ . Since S is unbounded, there is for any τ ∈ T a largest t τ ∈ S with t τ ≤ τ . Then ∆ ( τ ) ≤ γ ↑ ∆ ( t τ ) andbecause of ∆ ( t τ ) → τ ∈T , τ → ∞ ∆ ( τ ) = Lemma 13.
Suppose Assumptions 1, 3 and 5 to 8 hold. For the iterates produced by Algorithm 1 itholds that lim inf t → ∞ (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) = Proof.
For a contradiction, suppose that lim inf t → ∞ (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) (cid:54) =
0. Then there is a constant (cid:118) lbm > (cid:118) ( t ) m ≥ (cid:118) lbm for all t ∈ N . According to Lemma 11, there exists a constant ∆ lb > ∆ ( t ) ≥ ∆ lb for all t . This contradicts Lemma 12.The next result allows us to transfer the result to (cid:118) ( • ) . Lemma 14.
Suppose Assumptions 1, 6 and 7 hold. For any subsequence { t i } i ∈ N ⊆ N of iterationindices of Algorithm 1 with lim i → ∞ (cid:118) ( t i ) m (cid:16) x ( t i ) (cid:17) =
0, (27) it also holds that lim i → ∞ (cid:118) (cid:16) x ( t i ) (cid:17) =
0. (28)
Proof.
By (27), (cid:118) ( t i ) m < ε crit for sufficiently large i . If x ( t i ) is critical for (MOP), then the resultfollows from Lemma 8. Otherwise, m ( t i ) is fully linear on B (cid:16) x ( t i ) ; ∆ ( t i ) (cid:17) for some ∆ ( t i ) ≤ µ(cid:118) ( t i ) m .From Corollary 5 it follows that (cid:12)(cid:12)(cid:12) (cid:118) ( t i ) m − (cid:118) ( t i ) (cid:12)(cid:12)(cid:12) ≤ √ c ˙ (cid:101) ∆ ( t i ) ≤ √ c ˙ (cid:101)µ(cid:118) ( t i ) m .The triangle inequality yields (cid:118) ( t i ) ≤ (cid:12)(cid:12)(cid:12) (cid:118) ( t i ) − (cid:118) ( t i ) m (cid:12)(cid:12)(cid:12) + (cid:118) ( t i ) m ≤ ( √ c ˙ (cid:101)µ + ) (cid:118) ( t i ) m for sufficiently large i and (27) then implies (28).The next global convergence result immediately follows from Theorem 4 and Lemmas 13and 14: Theorem 5.
Suppose Assumptions 1, 3 and 5 to 8 hold. Then lim inf t → ∞ (cid:118) (cid:16) x ( t ) (cid:17) = R n approximating a Pareto-critical point. We next show that all limit points of a sequencegenerated by Algorithm 1 are Pareto-critical. Theorem 6.
Suppose Assumptions 1 and 3 to 8 hold. Then lim t → ∞ (cid:118) (cid:16) x ( t ) (cid:17) = Proof.
We have already proven the result for finitely many successful iterations, see Theorem4. We thus suppose that S is unbounded.For the purpose of establishing a contradiction, suppose that there exists a sequence (cid:8) t j (cid:9) j ∈ N of indices that are successful or acceptable with (cid:118) ( t j ) ≥ ε > ε > j . (29)We can ignore model-improving and inacceptable iterations: During those the iterate does notchange and we find a larger acceptable or successful index with the same criticality value.From Theorem 5 we obtain that for every such t j , there exists a first index τ j > t j suchthat (cid:118) (cid:16) x ( τ j ) (cid:17) < ε . We thus find another subsequence indexed by { τ j } such that (cid:118) ( t ) ≥ ε for t j ≤ t < τ j and (cid:118) ( τ j ) < ε . (30)Using (29) and (30), it also follows from a triangle inequality that (cid:12)(cid:12)(cid:12) (cid:118) ( t j ) − (cid:118) ( τ j ) (cid:12)(cid:12)(cid:12) ≥ (cid:118) ( t j ) − (cid:118) ( τ j ) > ε − ε = ε ∀ j ∈ N . (31)With { t j } and { τ j } as in (30), define the following subset set of indices T = (cid:8) t ∈ N : ∃ j ∈ N such that t j ≤ t < τ j (cid:9) .By (30) we have (cid:118) ( t ) ≥ ε for t ∈ T , and due to Lemma 14, we also know that then (cid:118) ( t ) m cannotgo to zero neither, i.e., there is some ε m > (cid:118) ( t ) m ≥ ε m > ∀ t ∈ T . From Lemma 12 we know that ∆ ( t ) t → ∞ −−→ t ∈ T must be either successful or model-improving (if m ( t ) is not fully linear). For t ∈ T ∩ S , itfollows from Assumption 8 that Φ ( x ( t ) ) − Φ ( x ( t + ) ) ≥ ν ++ (cid:16) Φ m ( x ( t ) ) − Φ m ( x ( t + ) ) (cid:17) ≥ ν ++ κ sdm ε m min (cid:26) ε m cH m , ∆ ( t ) (cid:27) ≥ t ∈ T ∩ S is sufficiently large, we have ∆ ( t ) ≤ ε m cH m and ∆ ( t ) ≤ ν ++ κ sdm ε m (cid:16) Φ ( x ( t ) ) − Φ ( x ( t + ) ) (cid:17) .Since the iteration is either successful or model-improving for sufficiently large t ∈ T , andsince x ( t ) = x ( t + ) for a model-improving iteration, we deduce from the previous inequalitythat (cid:13)(cid:13)(cid:13) x ( t j ) − x ( τ j ) (cid:13)(cid:13)(cid:13) ≤ τ j − ∑ t = t j , t ∈T ∩S (cid:13)(cid:13)(cid:13) x ( t ) − x ( t + ) (cid:13)(cid:13)(cid:13) ≤ τ j − ∑ t = t j , t ∈T ∩S ∆ ( t ) ≤ ν ++ κ sdm ε m (cid:16) Φ ( x ( t j ) ) − Φ ( x ( τ j ) ) (cid:17) for j ∈ N sufficiently large. The sequence (cid:110) Φ ( x ( t ) ) (cid:111) t ∈ N is bounded below (Assumption 5)and monotonically decreasing by construction. Hence, the RHS above must converge to zerofor j → ∞ . This implies lim j → ∞ (cid:13)(cid:13)(cid:13) x ( t j ) − x ( τ j ) (cid:13)(cid:13)(cid:13) = (cid:118) ( • ) is uniformly continuous so that thenlim j → ∞ (cid:118) (cid:16) x ( t j ) (cid:17) − (cid:118) (cid:16) x ( τ j ) (cid:17) =
7. Numerical Examples
In this section we provide some more details on the actual implementation of Algorithm1 and present the results of various experiments. We compare different surrogate model typeswith regard to their efficacy (in terms of expensive objective evaluations) and their ability tofind Pareto-critical points.
We implemented the algorithm in the Julia language. The
OSQP solver [43] was used tosolve (Pm). For non-linear problems we used the
NLopt.jl [44] package. More specificallywe used the
BOBYQA algorithm [45] in conjunction with
DynamicPolynomials.jl [46] forthe Lagrange polynomials and the population based
ISRES method [47] for the Pascoletti-Serafini subproblems. The derivatives of cheap objective functions were obtained by means ofautomatic differentiation [48] and Taylor models used
FiniteDiff.jl .In accordance with Algorithm 1 we perform the shrinking trust region update via ∆ ( t + ) ← (cid:40) γ (cid:21) ∆ ( t ) if ρ ( t ) < ν + , γ ↓ ∆ ( t ) if ρ ( t ) < ν ++ . Note that for box-constrained problems we internally scale the feasible set to the unit hyper-cube [
0, 1 ] n and all radii are measured with regard to this scaled domain.For stopping we use a combination of different criteria:• We have an upper bound N it. ∈ N on the maximum number of iterations and an upperbound N exp. ∈ N on the number of expensive objective evaluations.• The surrogate criticality naturally allows for a stopping test and due to Lemma 11 thetrust region radius can also be used (see also [32, Sec. 5]). We combine this with a relativetolerance test and stop if ∆ ( t ) ≤ ∆ min OR (cid:16) ∆ ( t ) ≤ ∆ crit AND (cid:13)(cid:13)(cid:13) s ( t ) (cid:13)(cid:13)(cid:13) ≤ ε rel (cid:17) .• At a truly critical point the criticality loop criticalityRoutine runs infinitely. We stopafter a maximum number N loops ∈ N of iterations. If N loops equals 0 the algorithmeffectively stops for small (cid:118) ( t ) m values. We tested our method on a multitude of academic test problems with a varying numberof decision variables n and objective functions k . We were able to approximate Pareto-critical points in both cases, if we treat the problems as heterogenous and if we declarethem as expensive. We benchmarked RBF against polynomial models, because in [32] it wasshown that a trust region method using second degree Lagrange polynomials outperformscommercial solvers on scalarized problems. Most often, RBF surrogates outperform othermodel types with regard to the number of expensive function evaluations.This is illustrated in Fig. 2. It shows two runs of Algorithm 1 on the non-convex problem(T6), taken from [37]:min x ∈X (cid:20) x + ln ( x ) + x , x + x (cid:21) , X = [ ε , 30 ] × [
0, 30 ] ⊆ R , ε = − . (T6) Figure 2.
Two runs with maximum number of expensive evaluations set to 20 (soft limit). Test pointsare light-gray, the iterates are black, final iterate is red, white markers show other points where theobjectives are evaluated. The successive trust regions are also shown. (a)
Using RBF surrogate modelswe converge to the optimum using only 12 expensive evaluations. (b)
Quadratic Lagrange models donot reach the optimum using 19 evaluations. (c)
Iterations and test points in the objective space.
The first objective function is treated as expensive while the second is cheap. The onlyPareto-optimal point of (T6) is [ ε , 0 ] T . When we set a very restrictive limit of N exp. =
20 thenwe run out of budget with second degree Lagrange surrogates before we reach the optimum,see Fig. 2 (b) . As evident in Fig. 2 (a) , surrogates based on (cubic) RBF do require significantlyless training data. For the RBF models the algorithm stopped after 2 critical loops and themodel refinement during these loops is made clear by the samples on the problem boundaryconverging to zero. The complete set of relevant parameters for the test runs is given in Table2. We used a strict acceptance test and the strict Pareto-Cauchy step.
Table 2.
Parameters for Fig. 2, radii relative to [
0, 1 ] n . ε crit N exp. N loops µ β ∆ ub ∆ min ∆ ( ) ν + ν ++ γ (cid:21) γ ↓ γ ↑ −
20 2 2 · − To assess the performance with a growing number of decision variables n , we performedtests on scalable problems of the ZDT and DTLZ family [49,50]. Fig. 3 shows results for thebi-objective problems ZDT1-ZDT3 and for the k -objective problems DTLZ1 and DTLZ6 (weused k = max { n − } objectives). All problems are box constrained. Eight feasible startingpoints were generated for each problem setting, i.e., for each combination of n , a test problemand a descent method).In all cases the first objective was considered cheap and all other objectives expensive.First and second degree Lagrange models were compared against linear Taylor models and(cubic) RBF surrogates. The Lagrange models were built using a Λ -poised set, with Λ = n ≥
6. The Taylormodels used finite differences and points outside of box constraints were simply projectedback onto the boundary. The RBF models were allowed to include up to ( n + )( n + ) /2training points from the database if n ≤
10 and else the maximum number of points was2 n +
1. All other parameters are listed in Table 3.
Table 3.
Parameters for Fig. 3, radii relative to [
0, 1 ] n . θ is used to construct Lagrange and RBF modelsin an enlarged trust region, θ is used only for RBF, see Section 4.3. Parameter ε crit N it. N exp. N loops µ β ∆ ub ∆ crit ε rel ∆ min ∆ ( ) Value − n ·
10 2 · − − − Parameter ν + ν ++ γ (cid:21) γ ↓ γ ↑ θ θ Value
As expected, the second degree Lagrange polynomials require the most objective evalua-tions and the quadratic dependence on n is clearly visible in Fig. 3, and the quadratic growthof the dark-blue line continues for n ≥
8. On average, the Taylor models perform better thanthe linear Lagrange polynomials – despite requiring more evaluations per iteration. This ispossibly due to more accurate derivative information and resulting faster convergence. TheLagrange models do slightly better when the Pascoletti-Serafini step calculation is used (seeAppendix B). By far the least evaluations are needed for the RBF models: The light-blue lineconsistently stays below all other data points. Often, the average number of evaluations isless than half that of the second best method.
Figure 3.
Average number of expensive objective evaluations by number of decision variables n ,surrogate type and descent method. LP1 are Linear Lagrange models, LP2 quadratic Lagrange models,TP1 are linear Taylor polynomials based on finite differences and cubic refers to cubic RBF models.Steepest descent and Pascoletti-Serafini were tested on scalable problems, and 12 runs were performedper setting. Fig. 4 illustrates that not only do RBF perform better on average, but also overall. Withregards to the final solution criticality, there are a few outliers when the method did notconverge using RBF models. However, in most cases the solution criticality is acceptable, seeFig. 4 (b) . Moreover, Fig. 5 shows that a good percentage of problem instances is solved withRBF, especially when compared to linear Lagrange polynomial models. Note, that in caseswhere the true objectives are not differentiable at the final iterate, ω was set to 0 because theselected problems are non-differentiable only in Pareto-optimal points. Figure 4.
Box-plots of the number of evaluations and the solution criticality for n = n =
15 for thesteepest-descent runs from Fig. 3.
Figure 5.
Percentage of solved problem instances, i.e., test runs were the final solution criticality has avalue below 0.1. Per model and n -value there were 40 runs. Furthermore, we compared the RBF kernels from Table 1. In [33], the cubic kernelperforms best on single-objective problems while the Gaussian does worst. As can be seenin Fig. 6 this holds for multiple objective functions, too. The dark-blue and the light-bluebars show that both the Gaussian and the Multiquadric require more function evaluations,especially in higher dimensions. If, however, we use a very simple adaptive strategy tofine-tune the shape parameter, then both kernels can finish significantly faster. The pink andthe gray bar illustrate this fact. In both cases, the shape parameter was set to α = ∆ ( t ) ineach iteration. Nevertheless, cubic function (orange) appears to be a good choice in general. Figure 6.
Influence of a adaptive shape radius on the performance of RBF models (tested on ZDT3).
8. Conclusion
We have developed a trust region framework for heterogeneous and expensive multiob-jective optimization problems. It is based on similar work [29–32] and our main contributionsare the integration of constraints and of radial basis function surrogates. Subsequently, ourmethod is is provably convergent for unconstrained problems and when the feasible set isconvex and compact, while requiring significantly less expensive function evaluations due toa linear scaling of complexity with respect to the number of decision variables.For future work, several modifications and extensions can likely be transferred from thesingle-objective to the multiobjective case. For examples, the trust region update can be madestep-size-dependent (rather than ρ ( t ) alone) to allow for a more precise model refinement, see[35, Ch. 10]. We have also experimented with the nonlinear CG method [9] for a multiobjectiveSteihaug—Toint step [35, Ch. 7] and early results look promising. Going forward, we would like to apply our algorithm to a real world application, similarto what has been done in [51]. Moreover, it would be desirable to obtain not just one but mul-tiple Pareto-critical solutions. Because the Pascoletti-Serafini scalarization is still compatiblewith constraints, the iterations can be guided in image space by providing different globalutopia vectors. Furthermore, it is straightforward to use RBF with the heuristic methodsfrom [52] for heterogeneous problems. We believe that it should also be possible to propagatemultiple solutions and combine the TRM method with non-dominance testing as has beendone in the bi-objective case [31]. One can think of other globalization strategies as well: RBFmodels have been used in multiobjective Stochastic Search algorithms [53] and trust regionideas have been included into population based strategies [26]. It will thus bee interesting tosee whether the theoretical convergence properties can be maintained within these contextsby employing a careful trust-region management. Finally, re-using the data sampled near thefinal iterate within a continuation framework like in [54] is a promising next step.
Author Contributions:
Conceptualization, M.B. and S.P.; methodology, M.B.; software, M.B.; validation,M.B. and S.P.; formal analysis, M.B. and S.P.; investigation, M.B.; writing—original draft preparation,M.B.; writing—review and editing, S.P.; visualization, M.B.; supervision, S.P.; All authors have read andagreed to the published version of the manuscript.
Funding:
This research has been funded by the European Union and the German Federal State of NorthRhine-Westphalia within the EFRE.NRW project “SET CPS”.
Conflicts of Interest: “The authors declare no conflict of interest.”
Appendix A. Miscellaneous Proofs
Appendix A.1. Continuity of the Constrained Optimal Value
In this subsection we show the continuity of ω ( x ) in the constrained case, where ω ( x ) isthe negative optimal value of (P1), i.e., ω ( x ) : = − min d ∈X − x max (cid:96) = k (cid:104) ∇ f (cid:96) ( x ) , d (cid:105) ,s.t. (cid:107) d (cid:107) ≤ ω ( x ) , as stated in Theorem 1, follows the reasoning from [6],where continuity is shown for a related constrained descent direction program. Proof of Item 2 in Theorem 1.
Let the requirements of Item 1 be fulfilled, i.e., let f be contin-uously differentiable and let X ⊂ R n be convex and compact. Further, let x be a point in X and denote the minimizing direction in (P1) by d ( x ) and the optimal value by θ ( x ) . We showthat θ ( x ) is continuous, by which ω ( x ) = − θ ( x ) is continuous as well.First, note the following properties of the maximum function:1. u (cid:55)→ max (cid:96) u (cid:96) is positively homogenous and hencemax (cid:96) ( (cid:104) ∇ f (cid:96) ( x ) , d (cid:105) + (cid:104) ∇ f (cid:96) ( x ) , d (cid:105) ) ≤ max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d (cid:105) + max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d (cid:105) .2. u (cid:55)→ max (cid:96) u (cid:96) is Lipschitz with constant 1 so that (cid:12)(cid:12)(cid:12)(cid:12) max (cid:96) (cid:104) ∇ f ( x ) , d (cid:105) − max (cid:96) (cid:104) ∇ f ( x ) , d (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) Df (cid:96) ( x ) d − Df (cid:96) ( x ) d (cid:107) , for both the maximum and the Euclidean norm.Now let { x ( t ) } ⊆ X be a sequence with x ( t ) → x . Due to the constraints, we have that d ( x ) ∈ X − x and thereby d ( x ) + x − x ( t ) ∈ X − x ( t ) . Let (
0, 1 ] (cid:51) σ ( t ) : = min (cid:40)
1, 1 (cid:13)(cid:13) d ( x ) + x − x ( t ) (cid:13)(cid:13) (cid:41) if d ( x ) (cid:54) = x ( t ) − x ,1 else.Then σ ( t ) (cid:16) d ( x ) + x − x ( t ) (cid:17) is feasible for (P1) at x ( t ) :• σ ( t ) (cid:16) d ( x ) + x − x ( t ) (cid:17) ∈ X − x ( t ) because X − x ( t ) is convex and , (cid:16) d ( x ) + x − x ( t ) (cid:17) ∈X − x ( t ) as well as σ ( t ) ∈ (
0, 1 ] .• (cid:13)(cid:13)(cid:13) σ ( t ) (cid:16) d ( x ) + x − x ( t ) (cid:17)(cid:13)(cid:13)(cid:13) ≤ σ ( t ) .By the definition of (P1) it follows thatmax (cid:96) (cid:104) ∇ f (cid:96) ( x ( t ) ) , d ( x ( t ) ) (cid:105) ≤ σ ( t ) max (cid:96) (cid:104) ∇ f (cid:96) ( x ( t ) ) , d ( x ) + x − x ( t ) (cid:105) and by the maximum property 1max (cid:96) (cid:104) ∇ f (cid:96) ( x ( t ) ) , d ( x ( t ) ) (cid:105) ≤ σ ( t ) max (cid:96) (cid:104) ∇ f (cid:96) ( x ( t ) ) , d ( x ) (cid:105) + σ ( t ) max (cid:96) (cid:104) ∇ f (cid:96) ( x ( t ) ) , x − x ( t ) (cid:105) . (A1)We make the following observations:• Because of (cid:13)(cid:13)(cid:13) d ( x ) + x − x ( t ) (cid:13)(cid:13)(cid:13) t → ∞ −−→ (cid:107) d ( x ) (cid:107) ≤
1, it follows that σ ( t ) t → ∞ −−→ (cid:96) ∈ {
1, . . . , k } that ∇ f (cid:96) ( x ( t ) ) → ∇ f (cid:96) ( x ) and because u (cid:55)→ max (cid:96) u (cid:96) is continuous as well, it then follows thatmax (cid:96) (cid:104) ∇ f (cid:96) ( x ( t ) ) , d ( x ) (cid:105) → max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d ( x ) (cid:105) for t → ∞ .• The last term on the RHS of (A1) vanishes for t → ∞ .By taking the limit superior on (A1), we then find thatlim sup t → ∞ θ ( x ( t ) ) = lim sup t → ∞ max (cid:96) (cid:104) ∇ f (cid:96) ( x ( t ) ) , d ( x ( t ) ) (cid:105) ≤ max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d ( x ) (cid:105) = θ ( x ) (A2)Vice versa, we know that because of d ( x ( t ) ) ∈ X − x ( t ) , it holds that d ( x ( t ) ) + x ( t ) − x ∈ X − x and as above we find thatmax (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d ( x ) (cid:105) ≤ λ ( t ) max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d ( x ( t ) ) (cid:105) + λ ( t ) max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , x ( t ) − x (cid:105) (A3)with λ ( t ) : = min (cid:40)
1, 1 (cid:13)(cid:13) d ( x ) + x ( t ) − x (cid:13)(cid:13) (cid:41) if d ( x ) (cid:54) = x ( t ) − x ,1 else. Again, the last term of (A3) vanishes in the limit so that by using the properties of themaximum function and the continuity of ∇ f (cid:96) , as well as λ ( t ) t → ∞ −−→
1, in taking the limitinferior on (A3) we find that θ ( x ) = max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d ( x ) (cid:105) ≤ lim inf t → ∞ max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d ( x ( t ) ) (cid:105)≤ lim inf t → ∞ (cid:20)(cid:18) max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d ( x ( t ) ) (cid:105) − max (cid:96) (cid:104) ∇ f (cid:96) ( x ( t ) ) , d ( x ( t ) ) (cid:105) (cid:19) + max (cid:96) (cid:104) ∇ f (cid:96) ( x ( t ) ) , d ( x ( t ) ) (cid:105) (cid:21) ≤ lim inf t → ∞ (cid:20)(cid:13)(cid:13)(cid:13) Df ( x ) − Df ( x ( t ) ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d ( x ( t ) ) (cid:13)(cid:13)(cid:13) + max (cid:96) (cid:104) ∇ f (cid:96) ( x ( t ) ) , d ( x ( t ) ) (cid:105) (cid:21) ≤ lim inf t → ∞ max (cid:96) (cid:104) ∇ f (cid:96) ( x ( t ) ) , d ( x ( t ) ) (cid:105) = lim inf t → ∞ θ ( x ( t ) ) . (A4)Combining (A2) and (A4) shows that θ ( x ( t ) ) t → ∞ −−→ θ ( x ) .Theorem 2 claims that ω ( x ) is uniformly continuous, provided the objective gradientsare Lipschitz. The implied Cauchy continuity is an important property in the convergenceproof of the algorithm. Proof of Theorem 2.
We will consider the constrained case only, when X is convex andcompact and show uniform continuity a fortiori by proving that ω ( • ) is Lipschitz. Let theobjective gradients be Lipschitz continuous. Then Df is Lipschitz as well with constant L > x , y ∈ X with x (cid:54) = y (the other case is trivial) and let again d ( x ) , d ( y ) be the respectiveoptimizers.Suppose w.l.o.g. that (cid:12)(cid:12)(cid:12)(cid:12) max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d ( x ) (cid:105) − max (cid:96) (cid:104) ∇ f (cid:96) ( y ) , d ( y ) (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) = max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d ( x ) (cid:105) − max (cid:96) (cid:104) ∇ f (cid:96) ( y ) , d ( y ) (cid:105) If we define (
0, 1 ] (cid:51) σ : = (cid:40) min (cid:110) (cid:107) d ( y )+ y − x (cid:107) (cid:111) if d ( y ) (cid:54) = x − y ,1 else,then again σ ( d ( y ) + y − x ) is feasible for (P1) at y . Thus,max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d ( x ) (cid:105) − max (cid:96) (cid:104) ∇ f (cid:96) ( y ) , d ( y ) (cid:105) df. ≤ max (cid:96) (cid:104) ∇ f (cid:96) ( x ) , σ ( d ( y ) + y − x ) (cid:105) − max (cid:96) (cid:104) ∇ f (cid:96) ( y ) , d ( y ) (cid:105)≤ (cid:107) σ Df ( x )( d ( y ) + y − x ) − Df ( y ) d ( y ) (cid:107) σ ≤ ≤ (cid:107) σ Df ( x ) − Df ( y ) (cid:107)(cid:107) d ( y ) (cid:107) + (cid:107) Df ( x ) (cid:107)(cid:107) x − y (cid:107) , (A5)where we have again used the maximum property 2 for the second inequality. We nowinvestigate the first term on the RHS. Using (cid:107) d ( y ) (cid:107) ≤ (cid:107) σ Df ( x ) − Df ( y ) (cid:107)(cid:107) d ( y ) (cid:107) ≤ (cid:107) Df ( x ) − Df ( y ) − ( − σ ) Df ( x ) (cid:107)≤ L (cid:107) x − y (cid:107) + ( − σ ) (cid:107) Df ( x ) (cid:107) . (A6)Furthermore, (cid:107) d ( y ) + y − x (cid:107) ≤ + (cid:107) y − x (cid:107) implies 1/ ( + (cid:107) y − x (cid:107) ) ≤ σ and1 − σ ≤ − + (cid:107) y − x (cid:107) = (cid:107) y − x (cid:107) + (cid:107) y − x (cid:107) ≤ (cid:107) y − x (cid:107) . We use this inequality and plug (A6) into (A5) to obtainmax (cid:96) (cid:104) ∇ f (cid:96) ( x ) , d ( x ) (cid:105) − max (cid:96) (cid:104) ∇ f (cid:96) ( y ) , d ( y ) (cid:105) ≤ L (cid:107) x − y (cid:107) + (cid:107) Df ( x ) (cid:107)(cid:107) x − y (cid:107)≤ ( L + D ) (cid:107) x − y (cid:107) ,with D = max x ∈X (cid:107) Df ( x ) (cid:107) which is well defined because X is compact and (cid:107) Df ( • ) (cid:107) iscontinuous. Appendix A.2. Modified Criticality Measures
Proof of Lemma 5.
There are two cases to consider:• If ω ( t ) m (cid:16) x ( t ) (cid:17) ≥ ω (cid:16) x ( t ) (cid:17) then (cid:12)(cid:12)(cid:12) ω ( t ) m (cid:16) x ( t ) (cid:17) − ω (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) = ω ( t ) m (cid:16) x ( t ) (cid:17) − ω (cid:16) x ( t ) (cid:17) ≤ κ ω ω ( t ) m (cid:16) x ( t ) (cid:17) .Now (cid:12)(cid:12)(cid:12) (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) − (cid:118) (cid:16) x ( t ) (cid:17)(cid:12)(cid:12)(cid:12) ∈ ω ( t ) m (cid:16) x ( t ) (cid:17) − ω (cid:16) x ( t ) (cid:17) − ω (cid:16) x ( t ) (cid:17) ≤ ω ( t ) m (cid:16) x ( t ) (cid:17) − ω (cid:16) x ( t ) (cid:17) − = ≤ κ ω ω ( t ) m (cid:16) x ( t ) (cid:17) .• The case ω (cid:16) x ( t ) (cid:17) < ω ( t ) m (cid:16) x ( t ) (cid:17) can be shown similarly. Proof of Lemma 6.
Use Lemma 5 and then investigate the two possible cases:• If (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) ≥ (cid:118) (cid:16) x ( t ) (cid:17) , then the first inequality follows because of 1 ≥ ( + κ ω ) .• If (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) < (cid:118) (cid:16) x ( t ) (cid:17) , then (cid:118) (cid:16) x ( t ) (cid:17) − (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) ≤ κ ω (cid:118) ( t ) m (cid:16) x ( t ) (cid:17) , and again the firstinequality follows. Appendix B. Pascoletti-Serafini Step
One example of an alternative descent step s ( t ) ∈ R n is given in [32]. Thomann andEichfelder [32] leverage the Pascoletti-Serafini scalarization to define local subproblems thatguide the iterates towards the (local) model ideal point. To be precise, it is shown that the trialpoint x ( t )+ can be computed as the solution tomin τ ∈ R , x ∈ B ( t ) τ s.t. m ( t ) ( x ( t ) ) + τ r ( t ) − m ( t ) ( x ) ≥ , (A7)where r ( t ) = m ( t ) ( x ( t ) ) − i ( t ) m ∈ R k ≥ is the direction vector pointing from the local model idealpoint i ( t ) m = (cid:104) i ( t ) , . . . , i ( t ) k (cid:105) T , with i ( t ) (cid:96) = min x ∈X m ( t ) (cid:96) ( x ) for (cid:96) =
1, . . . , k , (A8)to the current iterate value. If the surrogates are linear or quadratic polynomials and the trustregion use a p -norm with p ∈ {
1, 2, ∞ } these sub-problems are linear or quadratic programs.A convergence proof for the unconstrained case is given in [32]. It relies on a sufficientdecrease bound similar to (20). However, it is not shown that κ sd ∈ (
0, 1 ) exists independentof the iteration index t but stated as an assumption. Furthermore, constraints (in particular box constraints) are integrated into the definitionof ω ( • ) and ω ( t ) m ( • ) using an active set strategy (see [37]). Consequently, both values areno longer Cauchy continuous. We can remedy both drawbacks by relating the (possiblyconstrained) Pascoletti-Serafini trial point to the strict modified Pareto-Cauchy point in ourprojection framework. To this end, we allow in (A7) and (A8) any feasible set fulfillingAssumption 1. Moreover we recite the following assumption: Assumption 10 (Assumption 4.10 in [32]) . There is a constant r ∈ (
0, 1 ] so that if x ( t ) is notPareto-critical, the components r ( t ) , . . . , r ( t ) k , of r ( t ) satisfy min (cid:96) r ( t ) (cid:96) max (cid:96) r ( t ) (cid:96) ≥ r.The assumption can be justified because r ( t ) (cid:96) > x ( t ) is not critical and r ( t ) (cid:96) can bebounded above and below by expressions involving ω ( t ) m ( • ) , see Remark 3 and [32, Lemma4.9]. We can then derive the following lemma: Lemma A1.
Suppose Assumptions 1, 2 and 10 hold. Let ( τ + , x ( t )+ ) be the solution to (A7) . Thenthere exists a constant ˜ κ sdm ∈ (
0, 1 ) such that it holds Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t )+ ) ≥ ˜ κ sdm ω ( t ) m (cid:16) x ( t ) (cid:17) min ω ( t ) m (cid:16) x ( t ) (cid:17) c H ( t ) m , ∆ ( t ) , 1 . Proof. If x ( t ) is critical for (MOPm), then τ + = x ( t )+ = x ( t ) and the bound is trivial [5].Otherwise, we can use the same argumentation as in [32, Lemma 4.13] to show that for thestrict modified Pareto-Cauchy point ˆx ( t ) PC it holds that Φ ( t ) m ( x ( t ) ) − Φ ( t ) m ( x ( t )+ ) ≥ r min (cid:96) (cid:110) m ( t ) (cid:96) ( x ( t ) ) − m ( t ) (cid:96) ( ˆx ( t ) PC ) (cid:111) and the final bound follows from Corollary 2 with the new constant ˜ κ sdm = r κ sdm . References
1. Ehrgott, M.
Multicriteria optimization , 2nd ed ed.; Springer: Berlin ; New York, 2005.2. Jahn, J.
Vector optimization: theory, applications, and extensions , 2. ed ed.; Springer. OCLC: 725378304.3. Miettinen, K.
Nonlinear Multiobjective Optimization. ; Springer Verlag, 2013. OCLC: 1089790877.4. Eichfelder, G. Twenty Years of Continuous Multiobjective Optimization . p. 34.5. Eichfelder, G.
Adaptive Scalarization Methods in Multiobjective Optimization ; Vector Optimization, Springer Berlin Heidelberg.doi:10.1007/978-3-540-79159-1.6. Fukuda, E.H.; Drummond, L.M.G. A SURVEY ON MULTIOBJECTIVE DESCENT METHODS. , 585–620. doi:10.1590/0101-7438.2014.034.03.0585.7. Fliege, J.; Svaiter, B.F. Steepest descent methods for multicriteria optimization. , 479–494. doi:10.1007/s001860000043.8. Graña Drummond, L.; Svaiter, B. A steepest descent method for vector optimization. Journal of Computational and Applied Mathematics , , 395–414. doi:10.1016/j.cam.2004.06.018.9. Lucambio Pérez, L.R.; Prudente, L.F. Nonlinear Conjugate Gradient Methods for Vector Optimization. , 2690–2720.doi:10.1137/17M1126588.10. Lucambio Pérez, L.R.; Prudente, L.F. A Wolfe Line Search Algorithm for Vector Optimization. , 1–23. doi:10.1145/3342104.11. Gebken, B.; Peitz, S.; Dellnitz, M. A Descent Method for Equality and Inequality Constrained Multiobjective Optimization Problems.Numerical and Evolutionary Optimization – NEO 2017; Trujillo, L.; Schütze, O.; Maldonado, Y.; Valle, P., Eds. Springer, Cham, 2019,pp. 29–61.
12. Hillermeier, C.
Nonlinear multiobjective optimization: a generalized homotopy approach ; Springer Basel AG: Basel, 2001. OCLC: 828735498.13. Gebken, B.; Peitz, S.; Dellnitz, M. On the hierarchical structure of Pareto critical sets.
Journal of Global Optimization , , 891–913.14. Wilppu, O.; Karmitsa, N.; Mäkelä, M. New multiple subgradient descent bundle method for nonsmooth multiobjective optimization. Report no., Turku Centre for Computer Science .15. Gebken, B.; Peitz, S. An Efficient Descent Method for Locally Lipschitz Multiobjective Optimization Problems.
Journal of OptimizationTheory and Applications . doi:10.1007/s10957-020-01803-w.16. Custódio, A.L.; Madeira, J.F.A.; Vaz, A.I.F.; Vicente, L.N. Direct Multisearch for Multiobjective Optimization.
SIAM Journal onOptimization , , 1109–1140. doi:10.1137/10079731x.17. Audet, C.; Savard, G.; Zghal, W. Multiobjective Optimization Through a Series of Single-Objective Formulations. SIAM J. onOptimization , , 188–210. doi:10.1137/060677513.18. Deb, K. Multi-Objective Optimization Using Evolutionary Algorithms ; Wiley, 2001.19. Coello, C.A.C.; Lamont, G.B.; Veldhuizen, D.A.V. Evolutionary Algorithms for Solving Multi-Objective Problems Second Edition .20. Abraham, A.; Jain, L.C.; Goldberg, R., Eds.
Evolutionary multiobjective optimization: theoretical advances and applications ; Advancedinformation and knowledge processing, Springer: New York, 2005.21. Zitzler, E. Evolutionary Algorithms for Multiobjective Optimization: Methods and Applications. p. 134.22. Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II.
IEEE Transactions onEvolutionary Computation , , 182–197. doi:10.1109/4235.996017.23. Peitz, S.; Dellnitz, M. A Survey of Recent Trends in Multiobjective Optimal Control—Surrogate Models, Feedback Control andObjective Reduction. Mathematical and Computational Applications , , 30. doi:10.3390/mca23020030.24. Chugh, T.; Sindhya, K.; Hakanen, J.; Miettinen, K. A survey on handling computationally expensive multiobjective optimizationproblems with evolutionary algorithms. Soft Computing , , 3137–3166. doi:10.1007/s00500-017-2965-0.25. Deb, K.; Roy, P.C.; Hussein, R. Surrogate Modeling Approaches for Multiobjective Optimization: Methods, Taxonomy, and Results. Mathematical and Computational Applications , , 5. doi:10.3390/mca26010005.26. Roy, P.C.; Hussein, R.; Blank, J.; Deb, K. Trust-Region Based Multi-objective Optimization for Low Budget Scenarios. In EvolutionaryMulti-Criterion Optimization ; Deb, K.; Goodman, E.; Coello Coello, C.A.; Klamroth, K.; Miettinen, K.; Mostaghim, S.; Reed, P.,Eds.; Springer International Publishing: Cham, 2019; Vol. 11411, pp. 373–385. Series Title: Lecture Notes in Computer Science,doi:10.1007/978-3-030-12598-1_30.27. Conn, A.R.; Scheinberg, K.; Vicente, L.N.
Introduction to derivative-free optimization ; Number 8 in MPS-SIAM series on optimization,Society for Industrial and Applied Mathematics/Mathematical Programming Society. OCLC: ocn244660709.28. Larson, J.; Menickelly, M.; Wild, S.M. Derivative-free optimization methods. arXiv:1904.11585 [math] . arXiv: 1904.11585,doi:10.1017/S0962492919000060.29. Qu, S.; Goh, M.; Liang, B. Trust region methods for solving multiobjective optimisation. , 796–811. doi:10.1080/10556788.2012.660483.30. Villacorta, K.D.V.; Oliveira, P.R.; Soubeyran, A. A Trust-Region Method for Unconstrained Multiobjective Problems with Applicationsin Satisficing Processes. , 865–889. doi:10.1007/s10957-013-0392-7.31. Ryu, J.h.; Kim, S. A Derivative-Free Trust-Region Method for Biobjective Optimization. SIAM Journal on Optimization , , 334–362.doi:10.1137/120864738.32. Thomann, J.; Eichfelder, G. A Trust-Region Algorithm for Heterogeneous Multiobjective Optimization. , 1017–1047.doi:10.1137/18M1173277.33. Wild, S.M.; Regis, R.G.; Shoemaker, C.A. ORBIT: Optimization by Radial Basis Function Interpolation in Trust-Regions. , 3197–3219.doi:10.1137/070691814.34. Conn, A.R.; Scheinberg, K.; Vicente, L.N. Global Convergence of General Derivative-Free Trust-Region Algorithms to First- andSecond-Order Critical Points. , 387–415. doi:10.1137/060673424.35. Conn, A.R.; Gould, N.I.M.; Toint, P.L. Trust-region methods ; MPS-SIAM series on optimization, Society for Industrial and AppliedMathematics.36. Luc, D.T.
Theory of Vector Optimization ; Vol. 319,
Lecture Notes in Economics and Mathematical Systems , Springer Berlin Heidelberg.doi:10.1007/978-3-642-50280-4.37. Thomann, J. A Trust Region Approach for Multi-Objective Heterogeneous Optimization . p. 257.38. Wendland, H.
Scattered Data Approximation , 1 ed.; Cambridge University Press, 2004. doi:10.1017/CBO9780511617539.39. Stefan, W. Derivative-Free Optimization Algorithms For Computationally Expensive Functions .40. Wild, S.M.; Shoemaker, C. Global Convergence of Radial Basis Function Trust Region Derivative-Free Algorithms. , 761–781.doi:10.1137/09074927X.41. Regis, R.G.; Wild, S.M. CONORBIT: constrained optimization by radial basis function interpolation in trust regions. OptimizationMethods and Software , , 552–580. doi:10.1080/10556788.2016.1226305.
42. Fleming, W.
Functions of Several Variables ; Undergraduate Texts in Mathematics, Springer New York: New York, NY, 1977.doi:10.1007/978-1-4684-9461-7.43. Stellato, B.; Banjac, G.; Goulart, P.; Bemporad, A.; Boyd, S. OSQP: an operator splitting solver for quadratic programs.
MathematicalProgramming Computation , , 637–672. doi:10.1007/s12532-020-00179-2.44. Johnson, S.G. The NLopt nonlinear-optimization package.45. Powell, M.J. The BOBYQA algorithm for bound constrained optimization without derivatives .46. Legat, B.; Timme, S.; Weisser, T.; Kapelevich, L.; Rackauckas, C.; TagBot, J. JuliaAlgebra/DynamicPolynomials.jl: v0.3.15, 2020.doi:10.5281/zenodo.4153432.47. Runarsson, T.P.; Yao, X. Search biases in constrained evolutionary optimization. IEEE Transactions on Systems, Man, and Cybernetics,Part C (Applications and Reviews) , , 233–243.48. Revels, J.; Lubin, M.; Papamarkou, T. Forward-Mode Automatic Differentiation in Julia. arXiv:1607.07892 [cs.MS] .49. Zitzler, E.; Deb, K.; Thiele, L. Comparison of Multiobjective Evolutionary Algorithms: Empirical Results. Evolutionary Computation , , 173–195. doi:10.1162/106365600568202.50. Deb, K.; Thiele, L.; Laumanns, M.; Zitzler, E. Scalable Test Problems for Evolutionary Multiobjective Optimization. In EvolutionaryMultiobjective Optimization ; Abraham, A.; Jain, L.; Goldberg, R., Eds.; Springer-Verlag: London, 2005; pp. 105–145. Series Title:Advanced Information and Knowledge Processing, doi:10.1007/1-84628-137-7_6.51. Prinz, S.; Thomann, J.; Eichfelder, G.; Boeck, T.; Schumacher, J. Expensive multi-objective optimization of electromagnetic mixing in aliquid metal.
Optimization and Engineering . doi:10.1007/s11081-020-09561-4.52. Thomann, J.; Eichfelder, G. Representation of the Pareto front for heterogeneous multi-objective optimization.
J. Appl. Numer. Optim , , 293–323.53. Regis, R.G. Multi-objective constrained black-box optimization using radial basis function surrogates. Journal of Computational Science , , 140–155. doi:10.1016/j.jocs.2016.05.013.54. Schütze, O.; Cuate, O.; Martín, A.; Peitz, S.; Dellnitz, M. Pareto Explorer: a global/local exploration tool for many-objectiveoptimization problems. Engineering Optimization ,52