Scalable Subspace Methods for Derivative-Free Nonlinear Least-Squares Optimization
SScalable Subspace Methods for Derivative-Free NonlinearLeast-Squares Optimization
Coralia Cartis ∗ Lindon Roberts † Abstract
We introduce a general framework for large-scale model-based derivative-free optimiza-tion based on iterative minimization within random subspaces. We present a probabilisticworst-case complexity analysis for our method, where in particular we prove high-probabilitybounds on the number of iterations before a given optimality is achieved. This frameworkis specialized to nonlinear least-squares problems, with a model-based framework based onthe Gauss-Newton method. This method achieves scalability by constructing local linearinterpolation models to approximate the Jacobian, and computes new steps at each iterationin a subspace with user-determined dimension. We then describe a practical implementa-tion of this framework, which we call DFBGN. We outline efficient techniques for selectingthe interpolation points and search subspace, yielding an implementation that has a lowper-iteration linear algebra cost (linear in the problem dimension) while also achieving fastobjective decrease as measured by evaluations. Extensive numerical results demonstratethat DFBGN has improved scalability, yielding strong performance on large-scale nonlinearleast-squares problems.
Keywords: derivative-free optimization, large-scale optimization, nonlinear least-squares,worst case complexity.
Mathematics Subject Classification:
An important class of nonlinear optimization methods is so-called derivative-free optimization(DFO). In DFO, we consider problems where derivatives of the objective (and/or constraints)are not available to be evaluated, and we only have access to function values. This topic hasreceived growing attention in recent years, and is primarily used for objectives which are black-box (so analytic derivatives or algorithmic differentiation are not available), and expensive toevaluate or noisy (so finite differencing is impractical or inaccurate). There are many types ofDFO methods, such as model-based, direct and pattern search, implicit filtering and others (see[50] for a recent survey), and these techniques have been used in a variety of applications [1].Here, we consider model-based DFO methods for unconstrained optimization, which arebased on iteratively constructing and minimizing interpolation models for the objective. Wealso specialize these methods for nonlinear least-squares problems, by constructing interpolationmodels for each residual term rather than for the full objective [79, 73, 18].This paper aims to provide a method that attempts to answer a key question regardingmodel-based DFO: how to improve the scalability of this class. Existing model-based DFO ∗ Mathematical Institute, University of Oxford, Radcliffe Observatory Quarter, Woodstock Road, Oxford, OX26GG, United Kingdom ( [email protected] ). This work was supported by the EPSRC Centre for DoctoralTraining in Industrially Focused Mathematical Modelling (EP/L015803/1) in collaboration with the NumericalAlgorithms Group Ltd. † Mathematical Sciences Institute, Building 145, Science Road, Australian National University, Canberra ACT2601, Australia ( [email protected] ). We note that the work in Sections 4 and 5 originally appeared inthis author’s thesis [65, Chapter 7]. a r X i v : . [ m a t h . O C ] F e b echniques are primarily designed for small- to medium-scale problems, as the linear algebracost of each iteration—largely due to the cost of constructing interpolation models—means thattheir runtime increases rapidly for large problems. There are several settings where scalable DFOalgorithms may be useful, such as data assimilation [9, 3], machine learning [67, 35], generatingadversarial examples for deep neural networks [2, 68], image analysis [30], and as a possibleproxy for global optimization methods [19].To address this, we introduce RSDFO, a scalable algorithmic framework for model-basedDFO. At each iteration of RSDFO we select a random low-dimensional subspace, build andminimize a model to compute a step in this space, then change the subspace at the next iteration.We provide a probabilistic worst-case complexity analysis of RSDFO. To our knowledge, this isthe first subspace model-based DFO method with global complexity and convergence guarantees .We then describe how this general framework can be specialized to the case of nonlinear least-squares minimization through a model construction technique inspired by the Gauss-Newtonmethod, yielding a new algorithm RSDFO-GN with associated worst-case complexity bounds.We then present an efficient implementation of RSDFO-GN, which we call DFBGN. DFBGNis available on Github and includes several algorithmic features that yield strong performanceon large-scale problems and a low per-iteration linear algebra cost that is typically linear in theproblem dimension. The contributions in this paper are connected to several areas of research. We briefly reviewthese topics below.
Block Coordinate Descent
There is a large body of work on (derivative-based) block co-ordinate descent (BCD) methods, typically motivated by machine learning applications. BCDextends coordinate search methods [75] by updating a subset of the variables at each iteration,typically using a coordinate-wise variant of a first-order method. For nonconvex problems, thefirst convergence result for a randomized coordinate descent method based on proximal gradientdescent was given in [58]. Here, the sampling of coordinates was uniform and required step sizesbased on Lipschitz constants associated with the objective. This was extended in [52] to generalrandomized block selection with a nonomonotone linesearch-type method (to allow for unknownLipschitz constants), and to a (possibly deterministic) ‘essentially cyclic’ block selection andextrapolation (but requiring Lipschitz constants) in [77]. Several extensions of this approachhave been developed, including the use of stochastic gradients [76], parallel block updating [32]and inexact step calculations [32, 78].BCD methods have been extended to nonlinear least-squares problems, leading to so-calledSubspace Gauss-Newton methods. These are derivative-based methods where a Gauss-Newtonstep is computed for a subset of variables. This approach was initially proposed in [70] forparameter estimation in climate models—where derivative estimates were computed using im-plicit filtering [49]—and analyzed in quadratic regularization and trust-region settings for generalunconstrained objectives in [15, 16].
Sketching
Sketching is an alternative dimensionality reduction technique for least-squaresproblems, reducing the number of residuals rather than the number of variables. Sketching ideashave been applied to linear [46, 53, 74] and nonlinear [31] least-squares problems, as well asmodel-based DFO for nonlinear least-squares [13], as well as subsampling algorithms for finitesum-of-functions minimization such as Newton’s method [66, 6].There are also alternative approaches to sketching which lead to subspace-type methods,where local gradients and Hessians are estimated only within a subspace (possibly used in con-junction with random subsampling). Sketching in this context has been applied to, for example,Newton’s method [59, 40, 6], BFGS [39], and SAGA [41], as well as to trust-region and quadraticregularization methods [15, 16]. https://github.com/numericalalgorithmsgroup/dfbgn andom Embeddings for Global Optimization Some global optimization methods havebeen proposed which randomly project a high-dimensional problem into a low-dimensional sub-space and solve this smaller problem using existing (global or local) methods. Though applicableto general global optimization problems (as a more sophisticated variant of random search), thistechnique has been explored particularly for defeating the curse of dimensionality when optim-ising functions which have low effective dimensionality [64, 72, 17]. For the latter class, oftenonly one random subspace projection is needed, though the addition of constraints leads tomultiple embeddings being required [17]. Our approach here differs from these works in boththeoretical and numerical aspects, as it is focused on a specific random subspace technique forlocal optimization.
Probabilistic Model-Based DFO
For model-based DFO, several algorithms have been de-veloped and analyzed where the local model at each iteration is only sufficiently accurate with acertain probability [5, 21, 11]. Similar analysis also exists for derivative-based algorithms [20, 43].Our approach is based on deterministic model-based DFO within subspaces, and we instead re-quire a very weak probabilistic condition on the (randomly chosen) subspaces (Assumption 2.9).
Randomized Direct Search DFO
In randomized direct search methods, iterates are per-turbed in a random subset of directions (rather than a positive spanning set) when searching forlocal improvement. In this framework, effectively only a random subspace is searched in eachiteration. Worst-case complexity bounds for this technique are given under predetermined steplength regimes in [10, 36], and with adaptive step sizes in [42, 44], where [44] extends [42] tolinearly constrained problems.
Large-Scale DFO
There have been several alternative approaches considered for improv-ing the scalability of DFO. These often consider problems with specific structure which enableefficient model construction, such as partial separability [24, 60], sparse Hessians [4], and minim-ization over the convex hull of finitely many points [27]. On the other hand, there is a growingbody of literature on ‘gradient sampling’ techniques for machine learning problems. These meth-ods typically consider stochastic first-order methods but with a gradient approximation based onfinite differencing in random directions [56]. This framework has lead to variants of methods suchas stochastic gradient descent [34], SVRG [51] and Adam [22], for example. We note that linearinterpolation to orthogonal directions—more similar to traditional model-based DFO—has beenshown to outperform gradient sampling as a gradient estimation technique [8, 7].
Subspace DFO Methods
A model-based DFO method with similarities to our subspaceapproach is the moving ridge function method from [45]. Here, existing objective evaluations areused to determine an ‘active subspace’ which captures the largest variability in the objective andbuild an interpolation model within this subspace. We also note the VXQR method from [57],which performs line searches along a direction chosen from a subspace determined by previousiterates. Both of these methods do not include convergence theory. By comparison, aside fromour focus on nonlinear least-squares problems, both our general theoretical framework and ourimplemented method select their working subspaces randomly, and we provide (probabilistic)convergence guarantees.
We introduce RSDFO (Randomized Subspace Derivative-Free Optimization), a generic model-based DFO framework that relies on constructing a model in a subspace at each iteration. Ournovel approach enables model-based DFO methods to be applied in a large-scale regime bygiving the user explicit control over the subspace dimension, and hence control over the per-iteration linear algebra cost of the method. This framework is then specialized to the caseof nonlinear least-squares problems, yielding a new algorithm RSDFO-GN (Randomized Sub-space DFO with Gauss-Newton). The subspace model construction framework of RSDFO-GNis based on DFO Gauss-Newton methods [18, 14], and retains the same theoretical guarantees3s RSDFO. We then describe a practical implementation of RSDFO-GN, which we call DFBGN(Derivative-Free Block Gauss-Newton). Compared to existing methods, DFBGN reduces thelinear algebra cost of model construction and the initial objective evaluation cost by allowingfewer interpolation points at every iteration. In order for DFBGN to have both scalability anda similar evaluation efficiency to existing methods (i.e. objective reduction achieved for a givennumber of objective evaluations), several modifications to the theoretical framework, regardingthe selection of interpolation points and the search subspace, are necessary.
Theoretical Results
We consider a generic theoretical framework RSDFO, where the sub-space dimension is a user-chosen algorithm hyperparameter, and no specific model constructionapproach is specified. Our framework is not specific to a least-squares problem structure, andholds for any objective with Lipschitz continuous gradient, and allows for a general class ofrandom subspace constructions (not relying on a specific class of embeddings or projections).The theoretical results here extend the approach and techniques in [15, 16] to model-basedDFO methods. In particular, we use the notion of a well-aligned subspace (Definition 2.8) from[15, 16], one in which sufficient decrease is achievable, and assume that our search subspaceis well-aligned with some probability (Assumption 2.9). This is achieved provided we select asufficiently large subspace dimension (depending on the desired failure probability and subspacealignment quality).We derive a high probability worst-case complexity bound for RSDFO. Specifically, ourmain bounds are of the form P (cid:2) min j ≤ k (cid:107)∇ f ( x j ) (cid:107) ≤ Ck − / (cid:3) ≥ − e − ck and P (cid:2) K (cid:15) ≤ C(cid:15) − (cid:3) ≤ − e − c(cid:15) − , where K (cid:15) is the first iteration to achieve first-order optimality (cid:15) (see Theorem 2.18and Corollary 2.19). This result then implies a variety of alternative convergence results, suchas expectation bounds and almost-sure convergence. Based on [15, 16], we give several construc-tions for determining our random subspace, and show that for constructions based on Johnson-Lindenstrauss transformations, that we can achieve convergence with a subspace dimension thatis independent of the ambient dimension .Our analysis matches the standard deterministic O ( (cid:15) − ) complexity bounds for model-basedDFO methods (e.g. [33, 18]). Compared to the analysis of derivative-based methods (e.g. BCD[77] and probabilistically accurate models [20]) we need to incorporate the possibility that theinterpolation model is not accurate (not fully linear, see Definition 2.1). However, unlike [5,21, 11] we do not assume that full linearity is a stochastic property; instead, our stochasticitycomes from the subspace selection and we explicitly handle non-fully linear models similar to[26, 18]. This gives us a framework which is similar to standard model-based DFO and withweak probabilistic conditions. Compared to the analysis of derivative-based random subspacemethods in [15, 16], our analysis is complicated substantially by the possibility of inaccuratemodels and the intricacies of model-based DFO algorithms.We then consider RSDFO-GN, which explicitly describes how interpolation models can beconstructed for nonlinear least-squares problems, thus providing a concrete implementation ofRSDFO in this context. We prove that RSDFO-GN retains the same complexity bounds asRSDFO. Implementation
Although RSDFO-GN gives a reduced linear algebra cost of model con-struction compared to existing methods, it is important that we have an implementation thatachieves this reduced cost without sacrificing overall performance (in the sense of objective de-crease achieved within a given number of objective evaluations).We introduce a practical, implementable variant of RSDFO-GN called DFBGN, which isbased on the solver DFO-LS [14]. DFBGN includes an efficient, geoemtry-aware approach forselecting interpolation points, and hence directions in our subspace, for removal, an adaptiverandomized approach for selecting new interpolation points/subspace directions. We study theper-iteration linear algebra cost of DFBGN, and show that it is linear in the problem dimension , asubstantial improvement over existing methods, which are cubic in the problem dimension. Our Technically, DFBGN is not a block method as its subspaces are not coordinate-aligned, but has already beenreleased with this name.
Numerical Results
We compare DFBGN with DFO-LS (which itself is shown to have state-of-the-art performance in [14]) on collections of both medium-scale (approx. 100 dimensions)and large-scale test problems (approx. 1000 dimensions). We show that DFBGN with a full-sized subspace has similar performance to DFO-LS in terms of objective evaluations, but showsimproved performance on runtime. As the dimension of the subspace reduces (i.e. the size ofthe interpolation set reduces), we demonstrate a tradeoff between reduced linear algebra costsand increased evaluation counts required to achieve a given objective reduction. The flexibilityof DFBGN allows this tradeoff to be explicitly managed. When tested on large-scale problems,DFO-LS frequently reaches a reasonable runtime limit without making substantial progress,whereas DFBGN with small subspace size can perform many more iterations and hence makebetter progress than DFO-LS. In the case of expensive objectives with small evaluation budgets,we show that DFBGN can make progress with few objective evaluations in a similar way to DFO-LS (which has a mechanism to make progress from as few as 2 objective evaluations independentof problem dimension), but with substantially lower linear algebra costs.
Structure of paper
In Section 2 we describe RSDFO and provide our probabilistic worst-casecomplexity analysis. We specialize RSDFO to RSDFO-GN in Section 3. Then we describe thepractical implementation DFBGN and its features in Section 4. Our numerical results are givenin Section 5.
Implementation
A Python implementation of DFBGN is available on Github. Notation
We use (cid:107) · (cid:107) to refer to the Euclidean norm of vectors and the operator 2-norm ofmatrices, and B ( x , ∆) for x ∈ R n and ∆ > to be the closed ball { y ∈ R n : (cid:107) y − x (cid:107) ≤ ∆ } . In this section we outline our general model-based DFO algorithmic framework based on min-imization in random subspaces. We consider the nonconvex problem min x ∈ R n f ( x ) , (2.1)where we assume that f : R n → R is continuously differentiable, but that access to its gradientis not possible (e.g. for the reasons described in Section 1). In a standard model-based DFOframework (e.g. [26, 50]), at each iteration k we construct a quadratic model m k : R n → R whichapproximates f near our iterate x k : f ( x k + s ) ≈ m k ( s ) := f ( x k ) + g Tk s + 12 s T H k s , (2.2)for some g k ∈ R n and H k ∈ R n × n symmetric. Based on this model, we build a globallyconvergent algorithm using a trust-region framework [25]. This algorithmic framework is suitableproviding that—when necessary—we can guarantee m k is a sufficiently accurate model for f near x k . Details about how to construct sufficiently accurate models based on interpolation are givenin [26].Our core idea here is to construct interpolation models which only approximate the objectivein a subspace, rather than in the full space R n . This allows us to use interpolation sets withfewer points, since we do not have to capture the objective’s behaviour outside our subspace,which improves the scalability of the method. https://github.com/numericalalgorithmsgroup/dfbgn
5n this section, we outline our general algorithmic framework and provide a worst-case com-plexity analysis showing convergence to first-order stationary points with high probability. Wethen describe how this framework may be specialized to the case of nonlinear least-squaresminimization.
In our general framework, which we call RSDFO (Randomized Subspace DFO), we modify theabove approach by randomly choosing a p -dimensional subspace (where p < n is user-chosen)and constructing an interpolation model defined only in that subspace. Specifically, in eachiteration k we randomly choose a p -dimensional affine space Y k ⊂ R n given by the range of Q k ∈ R n × p , i.e. Y k = { x k + Q k ˆ s : ˆ s ∈ R p } . (2.3)We then construct a model which interpolate f at points in Y k and ultimately construct alocal quadratic model for f only on Y k . That is, given Q k , we assume that we have ˆ m k : R p → R given by f ( x k + Q k ˆ s ) ≈ ˆ m k (ˆ s ) := f ( x k ) + ˆ g Tk ˆ s + 12 ˆ s T ˆ H k ˆ s , (2.4)where ˆ g k ∈ R p and ˆ H k ∈ R p × p are the low-dimensional model gradient and Hessian respectively,adopting the convention of using hats on variables to denote low-dimensional quantities. InSection 3 we specialize this to a model construction process for nonlinear least-squares problems.For our trust-region algorithm, we (approximately) minimize ˆ m k inside the trust region toget a tentative step ˆ s k ≈ arg min ˆ s ∈ R p ˆ m k (ˆ s ) , s.t. (cid:107) ˆ s (cid:107) ≤ ∆ k , (2.5)for the current trust-region radius ∆ k > , yielding a tentative step s k = Q k ˆ s k ∈ R n . Wethus also get the computational advantage coming from solving a p -dimensional trust-regionsubproblem.In our setting we are only interested in the approximation properties of ˆ m k in the space Y k ,and so we introduce the following notion of a “sufficiently accurate” model: Definition 2.1.
Given Q ∈ R n × p , a model ˆ m : R p → R is Q -fully linear in B ( x , ∆) ⊂ R n if | f ( x + Q ˆ s ) − ˆ m (ˆ s ) | ≤ κ ef ∆ , (2.6a) (cid:107) Q T ∇ f ( x + Q ˆ s ) − ∇ ˆ m (ˆ s ) (cid:107) ≤ κ eg ∆ , (2.6b)for all s ∈ R p with (cid:107) ˆ s (cid:107) ≤ ∆ . The constants κ ef and κ eg must be independent of Q , ˆ m , x and ∆ .The gradient condition (2.6b) comes from noting that if ˆ f (ˆ s ) := f ( x + Q ˆ s ) then ∇ ˆ f (ˆ s ) = Q T ∇ f ( x + Q ˆ s ) . We note that if we have full-dimensional subspaces p = n and take Q = I , thenwe recover the standard notion of fully linear models [26, Definition 6.1]. Complete RSDFO Algorithm
The complete RSDFO algorithm is stated in Algorithm 1.The overall structure is common to model-based DFO methods [26]. In particular, we assumethat we have procedures to verify whether or not a model is Q k -fully linear in B ( x k , ∆ k ) and (ifnot) to generate a Q k -fully linear model. When we specialize RSDFO to nonlinear least-squaresproblems in Section 3, we will describe how we can obtain such procedures.After defining our subspace and model, we first perform a criticality step, which guaranteesthat—whenever we suspect we are close to first-order stationarity, as measured by (cid:107) ˆ g k (cid:107) —wehave an accurate model and an appropriately sized trust-region radius. Often this criticalitystep is formulated as its own subroutine with an extra inner loop [26], but following [33] we onlyperform one criticality step per iteration and avoid nested loops. Formally, we define our model in an affine space, but we call it a subspace throughout as this fits with anintuitive view of what RSDFO aims to achieve. lgorithm 1 RSDFO (Randomized Subspace Derivative-Free Optimization) for solving (2.1).
Input:
Starting point x ∈ R n , initial trust region radius ∆ > , and subspace dimension p ∈ { , . . . , n } .Parameters: maximum trust-region radius ∆ max ≥ ∆ , trust-region radius scalings < γ dec < < γ inc ≤ γ inc , criticality constants (cid:15) C , µ > and trust-region scaling < γ C < , safety step threshold β F > andtrust-region scaling < γ F < , and acceptance thresholds < η ≤ η < .1: Set flag CHECK_MODEL = FALSE .2: for k = 0 , , , . . . do if CHECK_MODEL = TRUE then
4: Set Q k = Q k − .5: Construct a reduced model ˆ m k : R p → R (2.4) which is Q k -fully linear in B ( x k , ∆ k ) .6: else
7: Define a subspace by randomly sampling Q k ∈ R n × p .8: Construct a reduced model ˆ m k : R p → R (2.4) which need not be Q k -fully linear.9: end if if (cid:107) ˆ g k (cid:107) < (cid:15) C and ( (cid:107) ˆ g k (cid:107) < µ − ∆ k or ˆ m k is not Q k -fully linear in B ( x k , ∆ k ) ) then
11: (Criticality step) Set x k +1 = x k , ∆ k +1 = γ C ∆ k and CHECK_MODEL = TRUE .12: else ← (cid:107) ˆ g k (cid:107) ≥ (cid:15) C or ( (cid:107) ˆ g k (cid:107) ≥ µ − ∆ k and ˆ m k is Q k -fully linear in B ( x k , ∆ k ) )
13: Approximately solve the subspace trust-region subproblem in R p (2.5) and calculate the step s k = Q k ˆ s k ∈ R n .14: if (cid:107) ˆ s k (cid:107) < β F ∆ k then
15: (Safety step) Set x k +1 = x k and ∆ k +1 = γ F ∆ k .16: If ˆ m k is Q k -fully linear in B ( x k , ∆ k ) , then set CHECK_MODEL = FALSE , otherwise
CHECK_MODEL = TRUE .17: else
18: Evaluate f ( x k + s k ) and calculate ratio ρ k := f ( x k ) − f ( x k + s k )ˆ m k ( ) − ˆ m k (ˆ s k ) . (2.7)19: Accept/reject step and update trust region radius: set x k +1 = (cid:40) x k + s k , ρ k ≥ η , x k , ρ k < η , and ∆ k +1 = min(max( γ inc ∆ k , γ inc (cid:107) ˆ s k (cid:107) ) , ∆ max ) , ρ k ≥ η , max( γ dec ∆ k , (cid:107) ˆ s k (cid:107) ) , η ≤ ρ k < η , min( γ dec ∆ k , (cid:107) ˆ s k (cid:107) ) , ρ k < η . (2.8)20: If ρ k ≥ η or ˆ m k is Q k -fully linear in B ( x k , ∆ k ) , then set CHECK_MODEL = FALSE , otherwise set
CHECK_MODEL = TRUE .21: end if end if end for
The remainder of Algorithm 1 broadly follows standard trust-region methods. If the com-puted step is much shorter than the trust-region radius, we enter a safety step—originally dueto Powell [61]—which is similar to an unsuccessful iteration (i.e. ρ k < η , so we reject the stepand decrease ∆ k ) but without evaluating f ( x k + s k ) and hence saving one objective evaluation.Our updating mechanism for ∆ k (2.8) also takes into account the computed step size, and is thesame as in [63, 18].An important feature of RSDFO is that in some iterations, we reuse the previous subspace, Y k = Y k − , corresponding to the flag CHECK_MODEL = TRUE . In this case, we had an inaccuratemodel in iteration k − and require that our new model ˆ m k is accurate ( Q k -fully linear). Thismechanism essentially ensures that ∆ k is not decreased too quickly as a result of inaccuratemodels, and is mostly decreased to achieve sufficient objective reduction.We now give our convergence and worst-case complexity analysis of Algorithm 1. We begin our analysis with some basic assumptions and preliminary results.
Assumption 2.2 (Smoothness) . The objective function f : R n → R is bounded below by f low and continuously differentiable, and ∇ f is L ∇ f -Lipschitz continuous in the extended level set { y ∈ R n : (cid:107) y − x (cid:107) ≤ ∆ max for some f ( x ) ≤ f ( x ) } , for some constant L ∇ f > .We also need two standard assumptions for trust-region methods: uniformly bounded abovemodel Hessians and sufficiently accurate solutions to the trust-region subproblem (2.5).7 ssumption 2.3 (Bounded model Hessians) . We assume that (cid:107) ˆ H k (cid:107) ≤ κ H for all k , for some κ H ≥ . Assumption 2.4 (Cauchy decrease) . Our method for solving the trust-region subproblem (2.5)gives a step ˆ s k satisfying the sufficient decrease condition ˆ m k ( ) − ˆ m k (ˆ s k ) ≥ c (cid:107) ˆ g k (cid:107) min (cid:32) ∆ k , (cid:107) ˆ g k (cid:107) max( (cid:107) ˆ H k (cid:107) , (cid:33) , (2.9)for some c ∈ [1 / , independent of k .A useful consequence, needed for the analysis of our trust-region radius updating scheme, isthe following. Lemma 2.5 (Lemma 3.6, [18]) . Suppose Assumption 2.4 holds. Then (cid:107) ˆ s k (cid:107) ≥ c min (cid:32) ∆ k , (cid:107) ˆ g k (cid:107) max( (cid:107) ˆ H k (cid:107) , (cid:33) , (2.10) where c := 2 c / (1 + √ c ) . Lemma 2.6.
Suppose Assumptions 2.3 and 2.4 hold, and we run RSDFO with β F ≤ c . If ˆ m k is Q k -fully linear in B ( x k , ∆ k ) and ∆ k ≤ c (cid:107) ˆ g k (cid:107) , where c := min (cid:18) µ, κ H , c (1 − η )2 κ ef (cid:19) , (2.11) then the criticality and safety steps are not called, and ρ k ≥ η .Proof. Since ˆ m k is Q k -fully linear and ∆ k ≤ µ (cid:107) ˆ g k (cid:107) , the criticality step is not called. FromLemma 2.5 and ∆ k ≤ (cid:107) ˆ g k (cid:107) /κ H , we have (cid:107) ˆ s k (cid:107) ≥ c ∆ k ≥ β F ∆ k and so the safety step is notcalled.From Assumptions 2.3 and 2.4, we have ˆ m k ( ) − ˆ m k (ˆ s k ) ≥ c (cid:107) ˆ g k (cid:107) min (cid:18) ∆ k , (cid:107) ˆ g k (cid:107) κ H (cid:19) = c (cid:107) ˆ g k (cid:107) ∆ k , (2.12)since ∆ k ≤ (cid:107) ˆ g k (cid:107) /κ H by assumption. Next, since ˆ m k is Q k -fully linear, from (2.6a) we have | f ( x k ) − ˆ m k ( ) | = | f ( x k + Q k ) − ˆ m k ( ) | ≤ κ ef ∆ k , (2.13) | f ( x k + s k ) − ˆ m k (ˆ s k ) | = | f ( x k + Q k ˆ s k ) − ˆ m k (ˆ s k ) | ≤ κ ef ∆ k . (2.14)Hence we have | ρ k − | ≤ | f ( x k ) − ˆ m k ( ) || ˆ m k ( ) − ˆ m k (ˆ s k ) | + | f ( x k + s k ) − ˆ m k (ˆ s k ) || ˆ m k ( ) − ˆ m k (ˆ s k ) | ≤ κ ef ∆ k c (cid:107) ˆ g k (cid:107) ∆ k ≤ − η , (2.15)since ∆ k ≤ c (cid:107) ˆ g k (cid:107) ≤ c (1 − η ) (cid:107) ˆ g k (cid:107) / (2 κ ef ) . Thus ρ k ≥ η , which is the claim of the lemma. Remark . The requirement β F ≤ c in Lemma 2.6 is not restrictive. Since have c ≥ / inAssumption 2.4, it suffices to choose β F ≤ √ − , for example.Our key new assumption is on the quality of our subspace selection, as suggested in [15, 16]: Definition 2.8.
The matrix Q k is well-aligned if (cid:107) Q Tk ∇ f ( x k ) (cid:107) ≥ α Q (cid:107)∇ f ( x k ) (cid:107) , (2.16)for some α Q ∈ (0 , independent of k . 8 ssumption 2.9 (Subspace quality) . Our subspace selection (determined by Q k ) satisfies thefollowing two properties:(a) At each iteration k of RSDFO in which CHECK_MODEL = FALSE , our subspace selection Q k is well-aligned for some fixed α Q ∈ (0 , with probability at least − δ S , for some δ S ∈ (0 , , independently of { Q , . . . , Q k − } .(b) (cid:107) Q k (cid:107) ≤ Q max for all k and some Q max > .Of these two properties, (a) is needed for our complexity analysis, while (b) is only neededin order to construct Q k -fully linear models (in Section 3). We will discuss how to achieveAssumption 2.9 in more detail in Section 2.6. Lemma 2.10.
In all iterations k of RSDFO where the criticality step is not called, we have (cid:107) ˆ g k (cid:107) ≥ min( (cid:15) C , µ − ∆ k ) . If the criticality step is not called in iteration k , Q k is well-aligned and (cid:107)∇ f ( x k ) (cid:107) ≥ (cid:15) , then (cid:107) ˆ g k (cid:107) ≥ (cid:15) g ( (cid:15) ) := min (cid:18) (cid:15) C , α Q (cid:15)κ eg µ + 1 (cid:19) > . (2.17) Proof.
The first part follows immediately from the entry condition of the criticality step. Toprove (2.17), suppose the criticality step is not called in iteration k and (cid:107) ˆ g k (cid:107) < (cid:15) C . Then wehave ∆ k ≤ µ (cid:107) ˆ g k (cid:107) and ˆ m k is Q k -fully linear, and so from (2.6b) we have (cid:107) Q Tk ∇ f ( x k ) (cid:107) ≤ (cid:107) Q Tk ∇ f ( x k ) − ˆ g k (cid:107) + (cid:107) ˆ g k (cid:107) ≤ κ eg ∆ k + (cid:107) ˆ g k (cid:107) ≤ ( κ eg µ + 1) (cid:107) ˆ g k (cid:107) . (2.18)Since Q k is well-aligned, we conclude from (2.16) and (2.18) that α Q (cid:107)∇ f ( x k ) (cid:107) ≤ (cid:107) Q Tk ∇ f ( x k ) (cid:107) ≤ ( κ eg µ + 1) (cid:107) ˆ g k (cid:107) , (2.19)and we are done, since (cid:107)∇ f ( x k ) (cid:107) ≥ (cid:15) . We now provide a series of results counting the number of iterations of RSDFO of different types,following the style of analysis from [20, 16]. First we introduce some notation to enumerate ouriterations. Suppose we run RSDFO until the end of iteration K . We then define the followingsubsets of { , . . . , K } :• C is the set of iterations in { , . . . , K } where the criticality step is called.• F is the set of iterations in { , . . . , K } , where the safety step is called (i.e. (cid:107) ˆ s k (cid:107) < β F ∆ k ).• VS is the set of very successful iterations in { , . . . , K } , where ρ k ≥ η .• S is the set of successful iterations in { , . . . , K } , where ρ k ≥ η . Note that VS ⊂ S .• U is the set of unsuccessful iterations in { , . . . , K } , where ρ k < η .• A is the set of well-aligned iterations in { , . . . , K } , where (2.16) holds.• A C is the set of poorly aligned iterations in { , . . . , K } , where (2.16) does not hold.• D (∆) is the set of iterations in { , . . . , K } where ∆ k ≥ ∆ for some ∆ > .• D C (∆) is the set of iterations in { , . . . , K } where ∆ k < ∆ .• L is the set of iterations in { , . . . , K } where ˆ m k is Q k -fully linear in B ( x k , ∆ k ) .• L C is the set of iterations in { , . . . , K } where ˆ m k is not Q k -fully linear in B ( x k , ∆ k ) .9n particular, we have the partitions, for any ∆ > , { , . . . , K } = C ∪ F ∪ S ∪ U = A ∪ A C = D (∆) ∪ D C (∆) = L ∪ L C . (2.20)First, we bound the number of successful iterations with large ∆ k using standard argumentsfrom trust-region methods. Throughout, we use · ) to refer to the cardinality of a set ofiterations. Lemma 2.11.
Suppose Assumptions 2.2, 2.3 and 2.4 hold. If (cid:107)∇ f ( x k ) (cid:107) ≥ (cid:15) for all k = 0 , . . . , K ,then A ∩ D (∆) ∩ S ) ≤ φ (∆ , (cid:15) ) := f ( x ) − f low η c (cid:15) g ( (cid:15) ) min( (cid:15) g ( (cid:15) ) /κ H , ∆) , (2.21) for all ∆ > .Proof. Since (cid:107)∇ f ( x k ) (cid:107) ≥ (cid:15) , from Lemma 2.10 we have (cid:107) ˆ g k (cid:107) ≥ (cid:15) g ( (cid:15) ) for all k ∈ A ∩ S (notingthat k ∈ S implies k ∈ C C ). Then since ρ k ≥ η , from Assumptions 2.3 and 2.4 we get f ( x k ) − f ( x k +1 ) ≥ η [ ˆ m k ( ) − ˆ m k (ˆ s k )] , (2.22) ≥ η c (cid:107) ˆ g k (cid:107) min (cid:18) ∆ k , (cid:107) ˆ g k (cid:107) κ H (cid:19) , (2.23) ≥ η c (cid:15) g ( (cid:15) ) min (cid:18) ∆ , (cid:15) g ( (cid:15) ) κ H (cid:19) , (2.24)where the last line follows from (cid:107) ˆ g k (cid:107) ≥ (cid:15) g ( (cid:15) ) and ∆ k ≥ ∆ (from k ∈ D (∆) ). Since our stepacceptance guarantees our algorithm is monotone (i.e. f ( x k +1 ) ≤ f ( x k ) for all k ), we get f ( x ) − f low ≥ (cid:88) k ∈A∩D (∆) ∩S [ f ( x k ) − f ( x k +1 )] ≥ (cid:20) η c (cid:15) g ( (cid:15) ) min (cid:18) ∆ , (cid:15) g ( (cid:15) ) κ H (cid:19)(cid:21) · A ∩ D (∆) ∩ S ) , (2.25)from which the result follows. Lemma 2.12.
Suppose Assumptions 2.2, 2.3 and 2.4 hold, and β F ≤ c . If (cid:107)∇ f ( x k ) (cid:107) ≥ (cid:15) forall k = 0 , . . . , K , then A ∩ D C (∆) ∩ L \ VS ) = 0 , (2.26) ∆ ≤ ∆ ∗ ( (cid:15) ) := min (cid:18) c (cid:15) g ( (cid:15) ) , α Q (cid:15)κ eg + µ − (cid:19) . (2.27) Proof.
To find a contradiction, first suppose k ∈ A∩D C (∆) ∩L∩C C \VS . Then since k ∈ A∩C C and (cid:107)∇ f ( x k ) (cid:107) ≥ (cid:15) by assumption, we have (cid:107) ˆ g k (cid:107) ≥ (cid:15) g ( (cid:15) ) from Lemma 2.10. Since k ∈ D C (∆) ,we have ∆ k < ∆ ≤ ∆ ∗ ( (cid:15) ) ≤ c (cid:15) g ( (cid:15) ) ≤ c (cid:107) ˆ g k (cid:107) by definition of ∆ ∗ ( (cid:15) ) . From this and k ∈ L , theassumptions of Lemma 2.6 are met, so k / ∈ F and ρ k ≥ η ; that is, k ∈ VS , a contradiction.Hence we have A ∩ D C (∆) ∩ L ∩ C C \ VS ) = 0 .Next, we suppose k ∈ A ∩ D C (∆) ∩ L ∩ C and again look for a contradiction. In this case, wehave ∆ k < ∆ ≤ ∆ ∗ ( (cid:15) ) ≤ α Q (cid:15)/ ( κ eg + µ − ) , and so from k ∈ A ∩ L and (cid:107)∇ f ( x k ) (cid:107) ≥ (cid:15) we have (cid:107) ˆ g k (cid:107) ≥ (cid:107) Q Tk ∇ f ( x k ) (cid:107) − (cid:107) Q Tk ∇ f ( x k ) − ˆ g k (cid:107) ≥ α Q (cid:15) − κ eg ∆ k > µ − ∆ k . (2.28)This means we have (cid:107) ˆ g k (cid:107) > µ − ∆ k and k ∈ L , so the criticality step is not entered; i.e. k ∈ C C ,a contradiction. Hence we have A ∩ D C (∆) ∩ L ∩ C ) = 0 and we are done. Lemma 2.13.
Suppose Assumptions 2.2, 2.3 and 2.4 hold. Then we have D (max( γ C , γ F , γ dec ) − ∆) \ S ) ≤ C D ( γ − ∆) ∩ S ) + C , (2.29) for all ∆ ≤ ∆ , where C := log( γ inc )log(1 / max( γ C , γ F , γ dec )) and C := log(∆ / ∆)log(1 / max( γ C , γ F , γ dec )) . (2.30)10 roof. If k ∈ S , we always have ∆ k +1 ≤ γ inc ∆ k . On the other hand, if k ∈ U , we have ∆ k +1 = min( γ dec ∆ k , (cid:107) ˆ s k (cid:107) ) ≤ γ dec ∆ k , (2.31)Hence, ∆ k +1 ≤ max( γ C , γ F , γ dec )∆ k , for all k ∈ C ∪ F ∪ U . (2.32)We now consider the value of log(∆ k ) for k = 0 , . . . , K , so at each iteration we have an additivechange:• Since ∆ ≤ ∆ , the threshold value log(∆) is log(∆ / ∆) below the starting value log(∆ ) .• If k ∈ S , then log(∆ k ) increases by at most log( γ inc ) . In particular, ∆ k +1 ≥ ∆ is onlypossible if ∆ k ≥ γ − ∆ .• If k / ∈ S = C ∪ F ∪ U , then log(∆ k ) decreases by at least | log(max( γ C , γ F , γ dec )) | =log(1 / max( γ C , γ F , γ dec )) .Now, any decrease in ∆ k coming from k ∈ D (max( γ C , γ F , γ dec ) − ∆) \ S yields ∆ k +1 ≥ ∆ .Hence the total decrease in log(∆ k ) must be fully matched by the initial gap log(∆ / ∆) plusthe maximum possible amount that log(∆ k ) can be increased above log(∆) . That is, we musthave log(1 / max( γ C , γ F , γ dec )) · D (max( γ C , γ F , γ dec ) − ∆) \ S ) ≤ log(∆ / ∆) + log( γ inc ) · D ( γ − ∆) ∩ S ) , (2.33)which gives us (2.29). Lemma 2.14.
Suppose Assumptions 2.2, 2.3 and 2.4 hold. Then D C ( γ − ∆) ∩ VS ) ≤ C · D C (min( γ C , γ F , γ dec , β F ) − ∆) \ VS ) , (2.34) for all ∆ ≤ min(∆ , γ − ∆ max ) , where C := log(1 / min( γ C , γ F , γ dec , β F ))log( γ inc ) . (2.35) Proof.
We follow a similar reasoning to the proof of Lemma 2.13. For every iteration k ∈VS ∩D C (∆) , we increase ∆ k by a factor of at least γ inc , since ∆ k < ∆ ≤ γ − ∆ max . Equivalently,we increase log(∆ k ) by at least log( γ inc ) . In particular, if ∆ k < γ − ∆ , then ∆ k +1 < ∆ .Alternatively, if k ∈ S \ VS , we set ∆ k +1 = max( γ dec ∆ k , (cid:107) ˆ s k (cid:107) ) ≥ γ dec ∆ k . (2.36)If k ∈ U we set ∆ k +1 = min( γ dec ∆ k , (cid:107) ˆ s k (cid:107) ) ≥ min( γ dec , β F )∆ k , (2.37)since (cid:107) ˆ s k (cid:107) ≥ β F ∆ k from k / ∈ F . Hence, for every iteration k / ∈ VS , we decrease ∆ k by afactor of at most min( γ C , γ F , γ dec , β F ) , or equivalently we decrease log(∆ k ) by at most theamount | log(min( γ C , γ F , γ dec , β F )) | = log(1 / min( γ C , γ F , γ dec , β F )) . Then, to have ∆ k +1 < ∆ we require ∆ k < min( γ C , γ F , γ dec , β F ) − ∆ .Therefore, since ∆ ≥ ∆ , the total increase in log(∆ k ) from k ∈ VS ∩ D C ( γ − ∆) must befully matched by the total decrease in log(∆ k ) from k ∈ D C (min( γ C , γ F , γ dec , β F ) − ∆) \ VS .That is, log( γ inc ) VS ∩ D C ( γ − ∆)) ≤ log(1 / min( γ C , γ F , γ dec , β F )) D C (min( γ C , γ F , γ dec , β F ) − ∆) \ VS ) , (2.38)and we are done. 11 emma 2.15. Suppose Assumptions 2.2, 2.3 and 2.4 hold. Then
A ∩ D C (∆) ∩ L C \ VS ) ≤ A ∩ D C (∆) ∩ L ) + 1 , (2.39) for all ∆ > .Proof. After every iteration k where ˆ m k is not Q k -fully linear and either the criticality step iscalled or ρ k < η , we always set ∆ k +1 ≤ ∆ k and CHECK_MODEL = TRUE . This means that Q k +1 = Q k , so Q k +1 is well-aligned if and only if Q k is well-aligned. Hence if k ∈ A ∩ D C (∆) ∩ L C \ VS then either k = K or k + 1 ∈ A ∩ D C (∆) ∩ L , and we are done.We are now in a position to bound the total number of well-aligned iterations. Lemma 2.16.
Suppose Assumptions 2.2, 2.3 and 2.4 hold, and both β F ≤ c and γ inc > min( γ C , γ F , γ dec , β F ) − hold. Then if (cid:107)∇ f ( x k ) (cid:107) ≥ (cid:15) for all k = 0 , . . . , K , we have A ) ≤ ψ ( (cid:15) ) + C C ( K + 1) , (2.40) where ψ ( (cid:15) ) := 11 + C (cid:20) ( C + 2) φ (∆ min ( (cid:15) ) , (cid:15) ) + 4 φ ( γ − min( γ C , γ F , γ dec , β F )∆ min ( (cid:15) ) , (cid:15) )1 − C (2.41) + C + 21 − C + 1 (cid:21) , (2.42) ∆ min ( (cid:15) ) := min (cid:0) γ − ∆ , min( γ C , γ F , γ dec ) γ − ∆ ∗ ( (cid:15) ) (cid:1) , (2.43) C := max (cid:18) C , C − C (cid:19) > . (2.44) In these expressions, the values C and C are defined in Lemma 2.13, C is defined in Lemma 2.14, φ ( · , (cid:15) ) is defined in Lemma 2.11, and (cid:15) g ( (cid:15) ) and ∆ ∗ ( (cid:15) ) are defined in Lemmas 2.10 and 2.12 re-spectively.Proof. For ease of notation, we will write ∆ min in place of ∆ min ( (cid:15) ) . We begin by noting that γ inc > min( γ C , γ F , γ dec , β F ) − implies that C ∈ (0 , / , which we will use later.Next, we have A ∩ D (∆ min )) = A ∩ D (∆ min ) ∩ S ) + A ∩ D (∆ min ) \ S ) , (2.45) ≤ φ (∆ min , (cid:15) ) + A ∩ D (max( γ C , γ F , γ dec ) − γ inc ∆ min ) \ S )+ A ∩ D (∆ min ) ∩ D C (max( γ C , γ F , γ dec ) − γ inc ∆ min ) \ S ) , (2.46) ≤ φ (∆ min , (cid:15) ) + C D (∆ min ) ∩ S ) + C + A ∩ D C (max( γ C , γ F , γ dec ) − γ inc ∆ min ) \ S ) , (2.47) = φ (∆ min , (cid:15) ) + C D (∆ min ) ∩ S ) + C + A ∩ D C (max( γ C , γ F , γ dec ) − γ inc ∆ min ) ∩ L \ S )+ A ∩ D C (max( γ C , γ F , γ dec ) − γ inc ∆ min ) ∩ L C \ S ) , (2.48) ≤ φ (∆ min , (cid:15) ) + C D (∆ min ) ∩ S ) + C + 2 A ∩ D C (max( γ C , γ F , γ dec ) − γ inc ∆ min ) ∩ L \ S )+ A ∩ D C (max( γ C , γ F , γ dec ) − γ inc ∆ min ) ∩ L ∩ S ) + 1 , (2.49)where the first inequality follows from Lemma 2.11, the second inequality follows from Lemma 2.13and ∆ min ≤ γ − ∆ , and the last line follows from Lemma 2.15 and VS ⊂ S . Now we use12emma 2.12 with ∆ min ≤ max( γ C , γ F , γ dec ) γ − ∆ ∗ ( (cid:15) ) to get A ∩ D (∆ min )) ≤ φ (∆ min , (cid:15) ) + C D (∆ min ) ∩ S ) + C + A ∩ D C (max( γ C , γ F , γ dec ) − γ inc ∆ min ) ∩ L ∩ S ) + 1 , (2.50) = φ (∆ min , (cid:15) ) + C A ∩ D (∆ min ) ∩ S ) + C A C ∩ D (∆ min ) ∩ S ) + C + A ∩ D C (∆ min ) ∩ L ∩ S )+ A ∩ D (∆ min ) ∩ D C (max( γ C , γ F , γ dec ) − γ inc ∆ min ) ∩ L ∩ S ) + 1 , (2.51) ≤ φ (∆ min , (cid:15) ) + C A ∩ D (∆ min ) ∩ S ) + C A C ∩ D (∆ min ) ∩ S ) + C + A ∩ D C (∆ min )) + A ∩ D (∆ min ) ∩ S ) + 1 , (2.52) ≤ ( C + 2) φ (∆ min , (cid:15) ) + C A C ∩ D (∆ min ) ∩ S ) + C + A ∩ D C (∆ min )) + 1 , (2.53)where the last line follows from Lemma 2.11.Separately, we use Lemma 2.15, and apply Lemma 2.12 with ∆ min ≤ ∆ ∗ ( (cid:15) ) to get A ∩ D C (∆ min )) = A ∩ D C (∆ min ) ∩ VS ) + A ∩ D C (∆ min ) ∩ L \ VS )+ A ∩ D C (∆ min ) ∩ L C \ VS ) , (2.54) ≤ A ∩ D C (∆ min ) ∩ VS ) + A ∩ D C (∆ min ) ∩ L \ VS )+ A ∩ D C (∆ min ) ∩ L ) + 1 , (2.55) = A ∩ D C (∆ min ) ∩ VS ) + 2 A ∩ D C (∆ min ) ∩ L \ VS )+ A ∩ D C (∆ min ) ∩ L ∩ VS ) + 1 , (2.56) = A ∩ D C (∆ min ) ∩ VS ) + A ∩ D C (∆ min ) ∩ L ∩ VS ) + 1 , (2.57) ≤ A ∩ D C (∆ min ) ∩ VS ) + 1 . (2.58)We then get A ∩ D C (∆ min )) ≤ A ∩ D C ( γ − min( γ C , γ F , γ dec , β F )∆ min ) ∩ VS )+ 2 A ∩ D ( γ − min( γ C , γ F , γ dec , β F )∆ min ) ∩ D C (∆ min ) ∩ VS ) + 1 , (2.59) ≤ D C ( γ − min( γ C , γ F , γ dec , β F )∆ min ) ∩ VS )+ 2 A ∩ D ( γ − min( γ C , γ F , γ dec , β F )∆ min ) ∩ VS ) + 1 , (2.60) ≤ C D C (∆ min ) \ VS ) + 2 φ ( γ − min( γ C , γ F , γ dec , β F )∆ min , (cid:15) ) + 1 , (2.61) = 2 C A ∩ D C (∆ min ) \ VS ) + 2 C A C ∩ D C (∆ min ) \ VS )+ 2 φ ( γ − min( γ C , γ F , γ dec , β F )∆ min , (cid:15) ) + 1 , (2.62) ≤ C A ∩ D C (∆ min )) + 2 C A C ∩ D C (∆ min ) \ VS )+ 2 φ ( γ − min( γ C , γ F , γ dec , β F )∆ min , (cid:15) ) + 1 , (2.63)where the third inequality follows from Lemma 2.11 and Lemma 2.14 with ∆ min ≤ γ − ∆ ≤ γ − ∆ ≤ min(∆ , γ − ∆ max ) ≤ min( γ C , γ F , γ dec , β F ) − min(∆ , γ − ∆ max ) . (2.64)Since C ∈ (0 , / , we can rearrange (2.63) to conclude that A ∩ D C (∆ min )) ≤ − C (cid:2) C A C ∩ D C (∆ min ) \ VS ) + 2 φ ( γ − min( γ C , γ F , γ dec , β F )∆ min , (cid:15) ) + 1 (cid:3) . (2.65)13ow, we combine (2.53) and (2.65) to get A ) = A ∩ D (∆ min )) + A ∩ D C (∆ min )) , (2.66) ≤ ( C + 2) φ (∆ min , (cid:15) ) + C A C ∩ D (∆ min ) ∩ S ) + C + 2 A ∩ D C (∆ min )) + 1 , (2.67) ≤ ( C + 2) φ (∆ min , (cid:15) ) + 4 φ ( γ − min( γ C , γ F , γ dec , β F )∆ min , (cid:15) )1 − C + C A C ∩ D (∆ min ) ∩ S )+ C + 21 − C + 1 + 4 C − C A C ∩ D C (∆ min ) \ VS ) , (2.68) ≤ ( C + 2) φ (∆ min , (cid:15) ) + 4 φ ( γ − min( γ C , γ F , γ dec , β F )∆ min , (cid:15) )1 − C + C + 21 − C + 1+ max (cid:18) C , C − C (cid:19) (cid:2) A C ∩ D (∆ min ) ∩ S ) + A C ∩ D C (∆ min ) \ VS ) (cid:3) . (2.69)Since A C ∩ D (∆ min ) ∩ S and A C ∩ D C (∆ min ) \ VS are disjoint subsets of A C , we have A C ∩ D (∆ min ) ∩ S ) + A C ∩ D C (∆ min ) \ VS ) ≤ A C ) = ( K + 1) − A ) . (2.70)Substituting this into (2.69) and rearranging, we get the desired result. That C > followsfrom C > and C ∈ (0 , / . The key remaining step is to compare A ) with K . Since each event “ Q k is well aligned” iseffectively an independent Bernoulli trial with success probability at least − δ S , we derive thebelow result based on a concentration bound for Bernoulli trials [23, Lemma 2.1]. Lemma 2.17.
Suppose Assumptions 2.2, 2.3, 2.4 and 2.9 hold. Then we have P [ A ) + 1 ≤ (1 − δ S )(1 − δ )( K + 1)] ≤ e − δ (1 − δ S ) K/ , (2.71) for all δ ∈ (0 , .Proof. The
CHECK_MODEL = FALSE case of this proof has a general framework based on [42, Lemma4.5]—also followed in [16]—with a probabilistic argument from [23, Lemma 2.1].First, we consider only the subsequence of iterations K := { k , . . . , k J } ⊂ { , . . . , K } when Q k is resampled (i.e. where CHECK_MODEL = FALSE , so Q k (cid:54) = Q k − ). For convenience, we define A := A ∩ K and A C := A C ∩ K .Let T k j be the indicator function for the event “ Q k j is well-aligned”, and so A ) = (cid:80) Jj =0 T k j . Since T k j ∈ { , } , and denoting p k j := P (cid:2) T k j = 1 | x k j (cid:3) , for any t > we have E (cid:104) e − t ( T kj − p kj ) | x k j (cid:105) = p k j e − t (1 − p kj ) +(1 − p k j ) e tp kj = e tp kj +log(1 − p kj + p kj e − t ) ≤ e t p kj / , (2.72)where the inequality from the identity px + log(1 − p + pe − x ) ≤ px / , for all p ∈ [0 , and x ≥ , shown in [23, Lemma 2.1].Using the tower property of conditional expectations and the fact that, since k j ∈ K , T k j only depends on x k j and not any previous iteration, we then get E (cid:104) e − t ( A ) − (cid:80) Jj =0 p kj ) (cid:105) = E (cid:104) e − t (cid:80) Jj =0 ( T kj − p kj ) (cid:105) , (2.73) = E (cid:104) E (cid:104) e − t (cid:80) Jj =0 ( T kj − p kj ) | Q , . . . , Q k J − , x , . . . , x k J (cid:105)(cid:105) , (2.74) = E (cid:104) e − t (cid:80) J − j =0 ( T kj − p kj ) E (cid:104) e − t ( T kJ − p kJ ) | Q , . . . , Q k J − , x , . . . , x k J (cid:105)(cid:105) , (2.75) = E (cid:104) e − t (cid:80) J − j =0 ( T kj − p kj ) E (cid:104) e − t ( T kJ − p kJ ) | x k J (cid:105)(cid:105) , (2.76) ≤ e t p kJ / E (cid:104) e − t (cid:80) J − j =0 ( T kj − p kj ) (cid:105) , (2.77) ≤ e t ( (cid:80) Jj =0 p kj ) / , (2.78)14here the second-last line follows from (2.72) and the last line follows by induction. This meansthat P A ) ≤ J (cid:88) j =0 p k j − λ = P (cid:104) e − t ( A ) − (cid:80) Jj =0 p kj ) > e tλ (cid:105) , (2.79) ≤ e − tλ E (cid:104) e − t ( A ) − (cid:80) Jj =0 p kj ) (cid:105) , (2.80) ≤ e t ( (cid:80) Jj =0 p kj ) / − tλ , (2.81)where the inequalities follow from Markov’s inequality and (2.78) respectively. Taking t = λ/ (cid:80) Jj =0 p k j , we get P A ) ≤ J (cid:88) j =0 p k j − λ ≤ e − λ / ( (cid:80) Jj =0 p kj ) . (2.82)Finally, we take λ = δ (cid:80) Jj =0 p k j for some δ ∈ (0 , and note that p k j ≥ (1 − δ S ) (from Assump-tion 2.9), to conclude P [ A ) ≤ (1 − δ )(1 − δ S )( J + 1)] ≤ P A ) ≤ (1 − δ ) J (cid:88) j =0 p k j ≤ e − δ ( (cid:80) Jj =0 p kj ) / , (2.83)or equivalently, using the partition K = A ∪ A C , P (cid:2) A ) ≤ (1 − δ )(1 − δ S )[ A ) + A C )] (cid:3) ≤ e − δ (1 − δ S )[ A )+ A C )] / . (2.84)Now we must consider the iterations for which CHECK_MODEL = TRUE (so Q k = Q k − ), whichwe denote K C . The algorithm ensures that if k ∈ K C , then k + 1 ∈ K (unless we are in the lastiteration we consider, k = K ). Futher, the algorithm guarantees that if k ∈ K C , then k > and k ∈ A if and only if k − ∈ A . These are the key implications of RSDFO that we will now use.Firstly, we have K C ) ≤ K ) + 1 , and so K + 1 = K ) + K C ) ≤ A ) + A C )] + 1 , (2.85)which means (2.84) becomes P (cid:2) A ) ≤ (1 − δ )(1 − δ S )[ A ) + A C )] (cid:3) ≤ e − δ (1 − δ S ) K/ . (2.86)Setting α := δ + δ S + δδ S , we have (1 − δ )(1 − δ S ) = 1 − α , and so P (cid:20) A ) ≤ − αα A C ) (cid:21) = P (cid:2) A ) ≤ (1 − α )[ A ) + A C )] (cid:3) ≤ e − δ (1 − δ S ) K/ . (2.87)Secondly, we have K C ∩ A C ) ≤ A C ) + 1 , and so A C ) ≤ A C ) + 1 . This and A ⊂ A give P (cid:20) A ) ≤ − α α [ A C ) − (cid:21) ≤ e − δ (1 − δ S ) K/ . (2.88)We then note that K + 1 = A ) + A C ) , and so P (cid:20) A ) ≤ − α α [ K + 1 − A ) − (cid:21) ≤ e − δ (1 − δ S ) K/ , (2.89) P (cid:20) A ) + 1 − α α ≤ − α α ( K + 1) (cid:21) ≤ e − δ (1 − δ S ) K/ , (2.90) P [ A ) + 1 ≤ (1 − α )( K + 1)] ≤ e − δ (1 − δ S ) K/ , (2.91)since α > . 15 heorem 2.18. Suppose Assumptions 2.2, 2.3, 2.4 and 2.9 hold, and we have β F ≤ c , δ S < / (1 + C ) for C defined in Lemma 2.16, and γ inc > min( γ C , γ F , γ dec , β F ) − . Then for any (cid:15) > and k ≥ ψ ( (cid:15) ) + 1)1 − δ S − C / (1 + C ) , (2.92) we have P (cid:20) min j ≤ k (cid:107)∇ f ( x j ) (cid:107) ≤ (cid:15) (cid:21) ≥ − exp (cid:18) − k (1 − δ S − C / (1 + C )) − δ S ) (cid:19) . (2.93) Alternatively, if K (cid:15) := min { k : (cid:107)∇ f ( x k ) (cid:107) ≤ (cid:15) } for any (cid:15) > , then P (cid:20) K (cid:15) ≤ (cid:24) ψ ( (cid:15) ) + 1)1 − δ S − C / (1 + C ) (cid:25)(cid:21) ≥ − exp (cid:18) − ( ψ ( (cid:15) ) + 1)[1 − δ S − C / (1 + C )]8(1 − δ S ) (cid:19) , (2.94) where ψ ( (cid:15) ) is defined in Lemma 2.16.Proof. First, fix some arbitrary k ≥ . Let (cid:15) k := min j ≤ k (cid:107)∇ f ( x j ) (cid:107) and A k be the number ofwell-aligned iterations in { , . . . , k } . If (cid:15) k > , from Lemma 2.16, we have A k ≤ ψ ( (cid:15) k ) + C C ( k + 1) . (2.95)For any δ > such that δ < − C (1 + C )(1 − δ S ) , (2.96)we have (1 − δ S )(1 − δ ) > C / (1 + C ) , and so we can compute P (cid:20) ψ ( (cid:15) k ) ≤ (cid:20) (1 − δ S )(1 − δ ) − C C (cid:21) ( k + 1) − (cid:21) ≤ P [ A k ≤ (1 − δ S )(1 − δ )( k + 1)] , (2.97) ≤ e − δ (1 − δ S ) k/ , (2.98)using Lemma 2.17. Defining δ := 12 (cid:20) − C (1 + C )(1 − δ S ) (cid:21) , (2.99)we have (1 − δ S )(1 − δ ) = 12 (cid:20) − δ S + C C (cid:21) > C C , (2.100)since − δ S > C / (1 + C ) from our assumption on δ S . Hence we get P (cid:20) ψ ( (cid:15) k ) ≤ (cid:18) − δ S − C C (cid:19) ( k + 1) − (cid:21) ≤ e − k [1 − δ S − C / (1+ C )] / [16(1 − δ S )] , (2.101)and we note that this result is still holds if (cid:15) k = 0 , as lim (cid:15) → ψ ( (cid:15) ) = ∞ .Now we fix (cid:15) > and choose k satisfying (2.92). We use the fact that ψ ( · ) is non-increasingto get P [ (cid:15) k ≥ (cid:15) ] ≤ P [ ψ ( (cid:15) k ) ≤ ψ ( (cid:15) )] , (2.102) ≤ P (cid:20) ψ ( (cid:15) k ) ≤
12 (1 − δ S − C / (1 + C )) k − (cid:21) , (2.103) ≤ P (cid:20) ψ ( (cid:15) k ) ≤
12 (1 − δ S − C / (1 + C ))( k + 1) − (cid:21) , (2.104)and (2.93) follows. Lastly, we fix k = (cid:24) ψ ( (cid:15) ) + 1)1 − δ S − C / (1 + C ) (cid:25) , (2.105)16nd we use (2.93) and the definition of K (cid:15) to get P [ K (cid:15) ≥ k ] = P [ (cid:15) k ≥ (cid:15) ] , (2.106) ≤ e − k [1 − δ S − C / (1+ C )] / [16(1 − δ S )] , (2.107) ≤ exp (cid:18) − ( ψ ( (cid:15) ) + 1)[1 − δ S − C / (1 + C )]8(1 − δ S ) (cid:19) , (2.108)and we get (2.94). Corollary 2.19.
Suppose the assumptions of Theorem 2.18 hold. Then for k ≥ k for some k ,we have P (cid:20) min j ≤ k (cid:107)∇ f ( x j ) (cid:107) ≤ C √ k (cid:21) ≥ − e − ck , (2.109) for some constants c, C > . Alternatively, for (cid:15) ∈ (0 , (cid:15) ) for some (cid:15) , we have P (cid:104) K (cid:15) ≤ (cid:101) C(cid:15) − (cid:105) ≥ − e − (cid:101) c(cid:15) − , (2.110) for constants (cid:101) c, (cid:101) C > .Proof. For (cid:15) sufficiently small, both (cid:15) g ( (cid:15) ) and ∆ min ( (cid:15) ) are equal to a multiple of (cid:15) , and so ψ ( (cid:15) ) = α (cid:15) − + α = Θ( (cid:15) − ) , for some constants α , α > .Therefore for k sufficiently large, the choice (cid:15) = (cid:115) α (1 − δ S − C / (1 + C )) k − − α = Θ( k − / ) , (2.111)is sufficiently small that ψ ( (cid:15) ) = α (cid:15) − + α , and gives (2.92) with equality. The first result thenfollows from (2.93).The second result follows immediately from ψ ( (cid:15) ) = Θ( (cid:15) − ) and (2.94). Remark . All the above analysis holds with minimal modifications if we replace the trust-region mechanisms in RSDFO with more standard trust-region updating mechanisms. Thisincludes, for example, having no safety step (i.e. β F = 0 ), and replacing (2.8) with x k +1 = (cid:40) x k + s k , ρ k ≥ η, x k , ρ k < η, and ∆ k +1 = (cid:40) min( γ inc ∆ k , ∆ max ) , ρ k ≥ η,γ dec ∆ k , ρ k < η, (2.112)for some η ∈ (0 , . The corresponding requirement on the trust-region updating parameters toprove a version of Theorem 2.18 is simply γ inc > γ − (provided we also set γ C = γ dec ). Our final complexity bounds for RSDFO in Corollary 2.19 are comparable to probabilistic directsearch [42, Corollary 4.9]. They also match—in the order of (cid:15) —the standard bounds for (fullspace) model-based DFO methods for general objective [71, 33] and nonlinear least-squares [19]problems.Following [42], we may also derive complexity bounds on the expected first-order optimalitymeasure (of O ( k − / ) ) and the expected worst-case complexity (of O ( (cid:15) − ) iterations) for RSDFO. Theorem 2.21.
Suppose the assumptions of Theorem 2.18 hold. Then for k ≥ k , the iteratesof RSDFO satisfy E (cid:20) min j ≤ k (cid:107)∇ f ( x j ) (cid:107) (cid:21) ≤ Ck − / + (cid:107)∇ f ( x ) (cid:107) e − ck , (2.113) for c, C > from (2.109) , and for (cid:15) ∈ (0 , (cid:15) ) we have E [ K (cid:15) ] ≤ (cid:101) C (cid:15) − + 1 (cid:101) c , (2.114) for constants (cid:101) c , (cid:101) C > . Here, k and (cid:15) are the same as in Corollary 2.19. roof. First, for k ≥ k define the random variable H k as H k := (cid:40) Ck − / , if min j ≤ k (cid:107)∇ f ( x j ) (cid:107) ≤ Ck − / , (cid:107)∇ f ( x ) (cid:107) otherwise . (2.115)Then since min j ≤ k (cid:107)∇ f ( x j ) (cid:107) ≤ H k , we get E (cid:20) min j ≤ k (cid:107)∇ f ( x j ) (cid:107) (cid:21) ≤ E [ H k ] ≤ Ck − / + (cid:107)∇ f ( x ) (cid:107) P (cid:20) min j ≤ k (cid:107)∇ f ( x j ) (cid:107) > Ck − / (cid:21) , (2.116)and we get the first result by applying Corollary 2.19.Next, if (cid:15) ∈ (0 , (cid:15) ) then k ≥ k ( (cid:15) ) := 2( ψ ( (cid:15) ) + 1)1 − δ S − C / (1 + C ) = Θ( (cid:15) − ) , (2.117)and so from Theorem 2.18 we have P [ K (cid:15) ≤ k ] = P (cid:20) min j ≤ k (cid:107)∇ f ( x j ) (cid:107) ≤ (cid:15) (cid:21) ≥ − e − (cid:101) c k , (2.118)where (cid:101) c := (1 − δ S − C / (1 + C )) / [16(1 − δ S )] . We use the identity E [ X ] = (cid:82) ∞ P [ X > t ] dt for non-negative random variables X (e.g. [69, eqn. (1.9)]) to get E [ K (cid:15) ] ≤ k ( (cid:15) ) + (cid:90) ∞ k ( (cid:15) ) P [ K (cid:15) > t ] dt ≤ k ( (cid:15) ) + ∞ (cid:88) k = k ( (cid:15) ) e − (cid:101) c k = k ( (cid:15) ) + e − (cid:101) c k ( (cid:15) ) − e − (cid:101) c , (2.119)where (cid:101) C comes from k ( (cid:15) ) = Θ( (cid:15) − ) , which concludes our proof.Furthermore, we also get almost-sure convergence of lim inf type, similar to [26, Theorem10.12] in the deterministic case. Theorem 2.22.
Suppose the assumptions of Theorem 2.18 hold. Then the iterates of RSDFOsatisfy inf k ≥ (cid:107)∇ f ( x k ) (cid:107) = 0 almost surely.Proof. From Theorem 2.18, for any (cid:15) > we have lim k →∞ P (cid:20) min j ≤ k (cid:107)∇ f ( x j ) (cid:107) > (cid:15) (cid:21) = 0 . (2.120)However, P [inf k ≥ (cid:107)∇ f ( x k ) (cid:107) > (cid:15) ] ≤ P [min j ≤ k (cid:107)∇ f ( x j ) (cid:107) > (cid:15) ] for all k , and so P (cid:20) inf k ≥ (cid:107)∇ f ( x k ) (cid:107) > (cid:15) (cid:21) = 0 . (2.121)The result follows from the union bound applied to any sequence (cid:15) n = n − , for example.In particular, if (cid:107)∇ f ( x k ) (cid:107) > for all k , then Theorem 2.22 implies lim inf k →∞ (cid:107)∇ f ( x k ) (cid:107) = 0 almost surely. We now specify how to generate our subspaces Q k to be probabilistically well-aligned and uni-formly bounded (Assumption 2.9). These requirements are quite weak, and so there are severalpossible approaches for constructing Q k . First, we discuss the case where Q k is chosen to be a ran-dom matrix with orthonormal columns, and show that we need to choose our subspace dimension p ∼ √ n . This is related to how we ultimately select Q k in the practical implementation DFBGN(Section 4). However, we then show that by instead taking Q k to be a Johnson-Lindenstrausstransform, we can choose a value of p independent of the ambient dimension n .18 .6.1 Random Orthogonal Basis First, we consider the case where Q k is a randomly generated matrix with orthonormal columns.We have the below result, a consequence of [54, Theorem 9]. Theorem 2.23.
Suppose the columns of Q k ∈ R n × p form an orthonormal basis for a randomlygenerated p -dimensional subspace of R n . Then for any fixed vector v , P (cid:20) (cid:107) Q Tk v (cid:107) > (cid:18) pn − θ (cid:114) pn (cid:19) (cid:107) v (cid:107) (cid:21) > − − pθ / , (2.122) for all θ ∈ (0 , . For some δ S ∈ (0 , , if we set θ = 8 (cid:112) log(3 /δ S ) √ p , (2.123)then we get P (cid:20) (cid:107) Q Tk ∇ f ( x k ) (cid:107) > (cid:18) pn − θ (cid:114) pn (cid:19) (cid:107)∇ f ( x k ) (cid:107) (cid:21) > − δ S . (2.124)Therefore Assumption 2.9 is achieved provided we set p ≥ α Q n + (cid:112)
64 log(3 /δ S ) n . (2.125)In particular, since our theory holds if we take α Q > arbitrarily small, our only limitationon the subspace dimension p is δ S < / (1 + C ) from Theorem 2.18, yielding the minimumrequirement p > (cid:112)
64 log(3 + 3 C ) n , (2.126)where C depends only on our choice of trust-region algorithm parameters. The boundednesscondition Assumption 2.9(b) holds with Q max = 1 automatically. This gives us considerablescope to use very small subspace dimensions, which means our subspace approach has a strongchance of providing a substantial reduction in linear algebra costs. We can improve on the requirement (2.126) on p by using Q k with non-orthonormal columns.Specifically, we take Q k to be a Johnson-Lindenstrauss transform (JLT) [74]. The applicationof these techniques to random subspace optimization algorithms follows [15, 16]. Definition 2.24.
A random matrix S ∈ R p × n is an ( (cid:15), δ ) -JLT if, for any point v ∈ R n , we have P (cid:2) (1 − (cid:15) ) (cid:107) v (cid:107) ≤ (cid:107) S v (cid:107) ≤ (1 + (cid:15) ) (cid:107) v (cid:107) (cid:3) ≥ − δ. (2.127)There have been many different approaches for constructing ( (cid:15), δ ) -JLT matrices proposed.Two common examples are:• If S is a random Gaussian matrix with independent entries S i,j ∼ N (0 , /p ) and p =Ω( (cid:15) − | log δ | ) , then S is an ( (cid:15), δ ) -JLT (see [12, Theorem 2.13], for example).• We say that S is an s -hashing matrix if it has exactly s nonzero entries per column (indicessampled independently), which take values ± / √ s selected independently with probability1/2. If S is an s -hashing matrix with s = Θ( (cid:15) − | log δ | ) and p = Ω( (cid:15) − | log δ | ) , then S isan ( (cid:15), δ ) -JLT [48].By taking v = ∇ f ( x k ) in iteration k , and noting (1 − (cid:15) ) ≤ − (cid:15) for all (cid:15) ∈ (0 , , wehave that Assumption 2.9(a) holds if we take Q k = S T , where S is any (1 − α Q , δ S ) -JLT.That is, Assumption 2.9(a) is satisfied using either of the constructions above and p = Ω((1 − α Q ) − | log δ S | ) . Importantly, these constructions allow us to choose a subspace dimension p which has no dependence on the ambient dimension n (i.e. p = O (1) as n → ∞ ), an improvement19n (2.126). We note that the requirement δ S < / (1 + C ) in Theorem 2.18 yields a very milddependence of p on the choice of trust-region updating parameters.We conclude by noting that the uniform boundedness property Assumption 2.9(b) is trivialif S is a hashing matrix. If S is Gaussian, by Bernstein’s inequality we can choose Q max large enough that P [ (cid:107) S (cid:107) > Q max ] is small. Then, we generate our final Q k by sampling S independently until (cid:107) S (cid:107) ≤ Q max holds (which almost-surely takes finite time). A union boundargument then gives us Assumption 2.9(a) for this choice of Q k , and so Assumption 2.9 iscompletely satisfied with the same (asymptotic) requirements on p . We now describe how RSDFO (Algorithm 1) can be specialized to the unconstrained nonlinearleast-squares problem min x ∈ R n f ( x ) := 12 (cid:107) r ( x ) (cid:107) = 12 m (cid:88) i =1 r i ( x ) , (3.1)where r : R n → R m is given by r ( x ) := [ r ( x ) , . . . , r m ( x )] T . We assume that r is differentiable,but that access to the Jacobian J : R n → R m × n is not possible. In addition, we typicallyassume that m ≥ n (regression), but everything here also applies to the case m < n (inverseproblems). We now introduce the algorithm RSDFO-GN (Randomized Subspace DFO withGauss-Newton), which is a randomized subspace version of a model-based DFO variant of theGauss-Newton method [18].Following the construction from [18], we assume that we have selected the p -dimensionalsearch space Y k defined by Q k ∈ R n × p (as in RSDFO above). Then, we suppose that we haveevaluated r at p + 1 points Y k := { x k , y , . . . , y n } ⊂ Y k (which typically are all close to x k ).Since y t ∈ Y k for each t = 1 , . . . , p , from (2.3) we have y t = x k + Q k ˆ s t for some ˆ s t ∈ R p .Given this interpolation set, we first wish to construct a local subspace linear model for r : r ( x k + Q k ˆ s ) ≈ ˆ m k (ˆ s ) = r ( x k ) + ˆ J k ˆ s . (3.2)To do this, we choose the approximate subspace Jacobian ˆ J k ∈ R m × p by requiring that ˆ m k interpolate r at our interpolation points Y k . That is, we impose ˆ m k (ˆ s t ) = r ( y t ) , ∀ t = 1 , . . . , p, (3.3)which yields the p × p linear system (with m right-hand sides) ˆ W k ˆ J Tk := ˆ s T ... ˆ s Tp ˆ J Tk = ( r ( y ) − r ( x k )) T ... ( r ( y p ) − r ( x k )) T . (3.4)Our linear subspace model ˆ m k (3.2) naturally yields a local subspace quadratic model for f , asin the classical Gauss-Newton method, namely (c.f. (2.4)), f ( x k + Q k ˆ s ) ≈ ˆ m k (ˆ s ) := 12 (cid:107) ˆ m k (ˆ s ) (cid:107) = f ( x k ) + ˆ g Tk ˆ s + 12 ˆ s T ˆ H k ˆ s , (3.5)where ˆ g k := ˆ J Tk r ( x k ) and ˆ H k := ˆ J Tk ˆ J k . Q k -Fully Linear Models We now describe how we can achieve Q k -fully linear models of the form (3.5) in RSDFO-GN.As in [18], we will need to define the Lagrange polynomials and Λ -poisedness of an inter-polation set. Given our interpolation set Y k lies inside Y k , we consider the (low-dimensional)20agrange polynomials associated with Y k . These are the linear functions ˆ (cid:96) , . . . , ˆ (cid:96) p : R p → R ,defined by the interpolation conditions ˆ (cid:96) t (ˆ s t (cid:48) ) = δ t,t (cid:48) , ∀ t, t (cid:48) = 0 , . . . , p, (3.6)with the convention ˆ s = corresponding to the interpolation point x k . The Lagrange polyno-mials exist and are unique whenever ˆ W k (3.4) is invertible, which we typically ensure throughjudicious updating of Y k at each iteration. Definition 3.1.
For any Λ > , the set Y k is Λ -poised in the p -dimensional ball B ( x k , ∆ k ) ∩ Y k if max t =0 ,...,p max (cid:107) ˆ s (cid:107)≤ ∆ k | ˆ (cid:96) t (ˆ s ) | ≤ Λ . (3.7)Note that since ˆ (cid:96) ( ) = 1 , for the set Y k to be Λ -poised we require Λ ≥ . In general, a larger Λ indicates that Y k has “worse” geometry, which leads to a less accurate approximation for f .This notion of Λ -poisedness (in a subspace) is sufficient to construct Q k -fully linear models (3.5)for f . Lemma 3.2.
Suppose Assumption 2.9(b) holds, J ( x ) is Lipschitz continuous, and r and J areuniformly bounded above in ∪ k ≥ B ( x k , ∆ max ) . If Y k ⊂ B ( x k , ∆ k ) ∩ Y k and Y k is Λ -poised in B ( x k , ∆ k ) ∩ Y k , then ˆ m k (3.5) is a Q k -fully linear model for f , with κ ef , κ eg = O ( p Λ ) .Proof. Consider the low-dimensional functions ˆ r : R p → R m and ˆ f : R p → R given by ˆ r k (ˆ s ) := r ( x k + Q k ˆ s ) and ˆ f (ˆ s ) := (cid:107) ˆ r (ˆ s ) (cid:107) respectively. We note that ˆ r is continuously differentiablewith Jacobian ˆ J (ˆ s ) = J ( x k + Q k ˆ s ) Q k . Then since (cid:107) Q k (cid:107) ≤ Q max from Assumption 2.9(b), itis straightforward to show that both ˆ r and ˆ J are uniformly bounded above and ˆ J is Lipschitzcontinuous (with a Lipschitz constant Q times larger than for J ( x ) ).We can then consider ˆ m k (3.2) and ˆ m k (3.5) to be interpolation models for ˆ r and ˆ f in thelow-dimensional ball B ( , ∆ k ) ⊂ R p . From [18, Lemma 3.3], we conclude that ˆ m k is a fully linearmodel for ˆ f with constants κ ef , κ eg = O ( p Λ ) . The Q k -fully linear property follows immediatelyfrom this, noting that ∇ ˆ f k (ˆ s ) = Q Tk ∇ f ( x k + Q k ˆ s ) .Given this result, the procedures in [26, Chapter 6] allow us to check and/or guarantee the Λ -poisedness of an interpolation set, and we have met all the requirements needed to fully specifyRSDFO-GN.Lastly, we note that underdetermined linear interpolation, where (3.4) is underdeterminedand solved in a minimal norm sense, has been recently shown to yield a property similar to Q k -full linearity [47, Theorem 3.6]. Complete RSDFO-GN Algorithm
A complete statement of RSDFO-GN is given in Al-gorithm 2. This exactly follows RSDFO (Algorithm 1), but where we ask that the interpolationset satsifies the conditions: Y k ⊂ B ( x k , ∆ k ) ∩ Y k and Y k is Λ -poised in B ( x k , ∆ k ) ∩ Y k . FromLemma 3.2, this is sufficient to guarantee Q k -full linearity of ˆ m k . We are now in a position to specialize our complexity analysis for RSDFO to RSDFO-GN. Forthis, we need to impose a smoothness assumption on r . Assumption 3.3.
The extended level set L := { y ∈ R n : (cid:107) y − x (cid:107) ≤ ∆ max for some f ( x ) ≤ f ( x ) } is bounded, r is continuously differentiable, and the Jacobian J is Lipschitz continuouson L .This smoothness requirement allows us to immediately apply the complexity analysis forRSDFO, yielding the following result. 21 lgorithm 2 RSDFO-GN (Randomized Subspace Derivative-Free Optimization with Gauss-Newton) for solving (3.1).
Input:
Starting point x ∈ R n , initial trust region radius ∆ > , and subspace dimension p ∈ { , . . . , n } .Parameters: maximum trust-region radius ∆ max ≥ ∆ , trust-region radius scalings < γ dec < < γ inc ≤ γ inc , criticality constants (cid:15) C , µ > and trust-region scaling < γ C < , safety step threshold β F > andtrust-region scaling < γ F < , acceptance thresholds < η ≤ η < , and poisedness constant Λ > .1: Set flag CHECK_MODEL = FALSE .2: for k = 0 , , , . . . do if CHECK_MODEL = TRUE then
4: Set Q k = Q k − .5: Construct an interpolation set Y k ⊂ B ( x k , ∆ k ) ∩ Y k which is Y k is Λ -poised in B ( x k , ∆ k ) ∩ Y k .6: Build the reduced model ˆ m k : R p → R (3.5) by solving (3.4).7: else
8: Define a subspace by randomly sampling Q k ∈ R n × p .9: Construct a reduced model ˆ m k : R p → R (3.5) by solving (3.4), where the interpolation points Y k ⊂ Y k need not be contained in B ( x k , ∆ k ) or be Λ -poised in B ( x k , ∆ k ) ∩ Y k .10: end if
11: Follow lines 10 to 22 of RSDFO (Algorithm 1), but replace every instance of checking Q k -full linearity of ˆ m k in B ( x k , ∆ k ) with checking that Y k ⊂ B ( x k , ∆ k ) ∩ Y k and Y k is Λ -poised in B ( x k , ∆ k ) ∩ Y k .12: end for Corollary 3.4.
Suppose Assumptions 3.3, 2.3, 2.4 and 2.9 hold, and we have β F ≤ c , δ S < / (1 + C ) for C defined in Lemma 2.16, and γ inc > min( γ C , γ F , γ dec , β F ) − . Then for theiterates generated by RSDFO-GN and k sufficiently large, P (cid:20) min j ≤ k (cid:107)∇ f ( x j ) (cid:107) ≤ C √ k (cid:21) ≥ − e − ck , (3.8) for some constants c, C > . Alternatively, for (cid:15) ∈ (0 , (cid:15) ) for some (cid:15) , we have P (cid:104) K (cid:15) ≤ (cid:101) C(cid:15) − (cid:105) ≥ − e − (cid:101) c(cid:15) − , (3.9) for constants (cid:101) c, (cid:101) C > .Proof. Assumption 3.3 implies that both r and J are uniformly bounded above on L , which issufficient for Lemma 3.2 to hold. Hence, whenever we check/ensure that Y k ⊂ B ( x k , ∆ k ) ∩ Y k and Y k is Λ -poised in B ( x k , ∆ k ) ∩ Y k we are checking/guaranteeing that ˆ m k is Q k -fully linear in B ( x k , ∆ k ) . In addition, from [18, Lemma 3.2] and taking f low = 0 , we have that Assumption 2.2is satisfied. Therefore the result follows directly from Corollary 2.19. Bounds on Objective Evaluations
In each iteration of RSDFO, we require at most p + 1 objective evaluations: at most p to form the model ˆ m k —and this holds regardless of whether weneed ˆ m k to be Q k -fully linear or not—and one evaluation for x k + s k . Hence the above boundson K (cid:15) also hold for the number of objective evaluations required to first achieve (cid:107)∇ f ( x k ) (cid:107) ≤ (cid:15) ,up to a constant factor of p + 1 . Interpolation Models for RSDFO
Using existing techniques for constructing Λ -poised in-terpolation sets for quadratic interpolation (as outlined in [26]), the specific model constructionideas presented here can also be applied to general objective problems, and thus provide aconcrete implementation of RSDFO. In RSDFO-GN, the interpolation linear system (3.4) is solved in two steps, namely: factorize theinterpolation matrix ˆ W k , then back-solve for each right-hand side. Thus, the cost of the linearalgebra is:1. Model construction costs O ( p ) to compute the factorization of ˆ W k , and O ( mp ) for theback-substitution solves with m right-hand sides; and22. Lagrange polynomial construction costs O ( p ) in total, due to one backsolve for each ofthe p + 1 polynomials (using the pre-existing factorization of ˆ W k ).By updating the factorization or ˆ W − k directly (e.g. via the Sherman-Morrison formula), wecan replace the O ( p ) factorization cost with a O ( p ) updating cost (c.f. [62]). However, thedominant O ( mp ) model construction cost remains, and in practice we have observed that thefactorization needs to be recomputed from scratch to avoid the accumulation of rounding errors.In the case of a full-space method where p = n such as in [18], these costs becomes O ( n ) forthe factorization (or O ( n ) if Sherman-Morrison is used) plus O ( mn ) for the back-solves. When n grows large, this linear algebra cost rapidly dominates the total runtime of these algorithmsand limits the efficiency of full-space methods. This issue is discussed in more detail, withnumerical results, in [65, Chapter 7.2].In light of this discussion, we now turn our attention to building an implementation ofRSDFO-GN that has both strong performance (in terms of objective evaluations) and low linearalgebra cost. An important tenet of DFO is that objective evaluations are often expensive, and so algorithmsshould be efficient in reusing information, hence limiting the total objective evaluations requiredto achieve a given decrease. However because we require our model to sit within our active space Y k , we do not have a natural process by which to reuse evaluations between iterations, when thespace changes. We dedicate this section to outlining an implementation of RSDFO-GN, whichwe call DFBGN (Derivative-Free Block Gauss-Newton). DFBGN is designed to be efficient inits objective queries while still only building low-dimensional models, and hence is also efficientin terms of linear algebra. Specifically, we design DFBGN to achieve two aims:• Low computational cost: we want our implementation to have a per-iteration linear algebracost which is linear in the ambient dimension;•
Efficient use of objective evaluations: our implementation should follow the principles ofother DFO methods and make progress with few objective evaluations. In particular, wehope that, when run with ‘full-space models’ (i.e. p = n ), our implementation should have(close to) state-of-the-art performance.We will assess the second point in Section 5 by comparison with DFO-LS [14] an open-sourcemodel-based DFO Gauss-Newton solver which explores the full space (i.e. p = n ). Remark . As discussed in [14], DFO-LS has a mechanism to build a model with fewer than n + 1 interpolation points. However, in that context we modify the model so that it varies overthe whole space R n , which enables the interpolation set to grow to the usual n + 1 points andyield a full-dimensional model. There, the goal is to make progress with very few evaluations,but here our goal is scalability, so we keep our model low-dimensional throughout and insteadchange the subspace at each iteration. Similar to Section 3, we assume that, at iteration k , our interpolation set has p + 1 points { x k , y , . . . , y p } ⊂ R n with ≤ p ≤ n . However, we assume that these points are already given,and use them to determine the space Y k (as defined by Q k ). That is, given W k := ( y − x k ) T ... ( y p − x k ) T ∈ R p × n , (4.1)we compute the QR factorization W Tk = Q k R k , (4.2)23here Q k ∈ R n × p has orthonormal columns and R k ∈ R p × p is upper triangular—and invertibleprovided W Tk is full rank, which we guarantee by judicious replacement of interpolation points.This gives us the Q k that defines Y k via (2.3)—in this case Q k is has orthonormal columns—andin this way all our interpolation points are in Y k .Since each y t ∈ Y k , from (4.2) we have y t = x k + Q k ˆ s t , where ˆ s t is the t -th column of R k .Hence we have ˆ W k = R Tk in (3.4) and so ˆ m k (3.2) is given by solving R Tk ˆ J Tk = ( r ( y ) − r ( x k )) T ... ( r ( y p ) − r ( x k )) T , (4.3)via forward substitution, since R Tk is lower triangular. This ultimately gives us our local model ˆ m k via (3.5).We reiterate that compared to RSDFO-GN, we have used the interpolation set Y k to determ-ine both Q k and ˆ m k , rather than first sampling Q k , then finding interpolation points Y k ⊂ Y k with which to construct ˆ m k . This difference is crucial in allowing the reuse of interpolationpoints between iterations, and hence lowering the objective evaluation requirements of modelconstruction. Remark . As discussed in [65, Chapter 7.3], we can equivalently recover this constructionby asking for a full-space model m k : R n → R m given by m k ( s ) = r ( x k ) + J k s such that theinterpolation conditions m k ( y t − x k ) = r ( y t ) are satisfied and J k has minimal Frobenius norm. A complete statement of DFBGN is given in Algorithm 3. Compared to RSDFO-GN, we includespecific steps to manage the interpolation set, which in turn dictates the choice of subspace Y k .Specifically, one issue with our approach is that our new iterate x k + s k is in Y k , so if we wereto simply add x k + s k into the interpolation set, Y k would not change across iterations, andwe will never explore the whole space. On the other hand, unlike RSDFO and RSDFO-GN wedo not want to completely resample Q k as this would require too many objective evaluations.Instead, in DFBGN we delete a subset of points from the interpolation set and add new directionsorthogonal to the existing directions, which ensures that Q k +1 (cid:54) = Q k in every iteration. We also note that DFBGN does not include some important algorithmic features present inRSDFO-GN, DFO-LS or other model-based DFO methods, and hence is quite simple to state.These features are not necessary for a variety of reasons, which we now outline.
No Criticality and Safety Steps
Compared to RSDFO-GN, the implementation of DFBGNdoes not include criticality (which is also the case in DFO-LS) or safety steps. These stepsultimately function to ensure we do not have ˆ g k (cid:28) ∆ k . In DFBGN, we ensure ∆ k does not gettoo large compared to (cid:107) s k (cid:107) through (2.8), while (cid:107) s k (cid:107) is linked to (cid:107) ˆ g k (cid:107) through Lemma 2.5. If (cid:107) s k (cid:107) is much smaller than ∆ k and our step produces a poor objective decrease, then we will set ∆ k +1 ← (cid:107) s k (cid:107) for the next iteration. Although Lemma 2.5 allows (cid:107) s k (cid:107) to be large even if (cid:107) g k (cid:107) is small, in practice we do not observe ∆ k (cid:29) (cid:107) g k (cid:107) without DFBGN setting ∆ k +1 ← (cid:107) s k (cid:107) aftera small number of iterations. No Model-Improving Steps
An important feature of model-based DFO methods are model-improving procedures, which change the interpolation set to ensure Λ -poisedness (Definition 3.1),or equivalently ensure that the local model for f is fully linear. In RSDFO-GN for instance,model-improvement is performed when CHECK_MODEL = TRUE , whereas in [26, Algorithm 10.1]there are dedicated model-improvement phases.Instead, DFBGN ensures accurate interpolation models via a geometry-aware (i.e. Λ -poisednessaware) process for deleting interpolation points at each iteration, where they are replaced by By contrast, the optional growing mechanism in DFO-LS (Remark 4.1) is designed such that x k + s k is notin Y k , and so the search space is automatically expanded at every iteration. However, this requires an expensiveSVD of J k ∈ R m × n at every iteration, and so is not suitable for our large-scale setting. lgorithm 3 DFBGN: Derivative-Free Block Gauss-Newton for solving (3.1).
Input:
Starting point x ∈ R n , initial trust region radius ∆ > , subspace dimension p ∈ { , . . . , n } , andnumber of points to drop at each iteration p drop ∈ { , . . . , p } .Parameters: maximum and minimum trust-region radii ∆ max ≥ ∆ > ∆ end > , trust-region radius scalings < γ dec < < γ inc ≤ γ inc , and acceptance thresholds < η ≤ η < .1: Select random orthonormal directions d , . . . , d p ∈ R n using Algorithm 5, and build initial interpolation set Y := { x , x + ∆ d , . . . , x + ∆ d p } .2: for k = 0 , , , . . . do
3: Given x k and Y k , solve (4.3) to build subspace models ˆ m k : R p → R m (3.2) and ˆ m k : R p → R (3.5).4: Approximately solve the subspace trust-region subproblem in R p (2.5) and calculate the step s k = Q k ˆ s k ∈ R n .5: Evaluate r ( x k + s k ) and calculate ratio ρ k (2.7).6: Accept/reject step and update trust region radius: set x k +1 and ∆ k +1 as per (2.8).7: if ∆ k +1 ≤ ∆ end , terminate .8: if p < n then
9: Set Y init k +1 = Y k ∪ { x k + s k } .10: Remove min(max( p drop , , p ) points from Y init k +1 (without removing x k +1 ) using Algorithm 4.11: else
12: Set Y init k +1 = Y k ∪ { x k + s k } \ { y } for some y ∈ Y k \ { x k +1 } .13: Remove min(max( p drop , , p ) points from Y init k +1 (without removing x k +1 ) using Algorithm 4.14: end if
15: Let q := p + 1 − | Y init k +1 | , and generate random orthonormal vectors { d , . . . , d q } that are also orthogonalto { y − x k +1 : y ∈ Y init k +1 \ { x k +1 }} , using Algorithm 5.16: Set Y k +1 = Y init k +1 ∪ { x k +1 + ∆ k +1 d , . . . , x k +1 + ∆ k +1 d q } .17: end for new points in directions (from x k +1 ) which are orthogonal to Q k and selected at random. Theprocess for deleting interpolation points—and choosing a suitable number of points to remove, p drop —at each iteration are considered in Sections 4.3 and 4.4 respectively. The process forgenerating new interpolation points, Algorithm 5, is outlined in Section 4.3.A downside of our approach is that the new orthogonal directions are not chosen by min-imizing a model for the objective (i.e. not attempting to reduce the objective), as we have noinformation about how the objective varies outside Y k . This is the fundamental trade-off betweena subspace approach and standard methods (such as DFO-LS); we can reduce the linear algebracost, but must spend objective evaluations to change the search space between iterations. Linear Algebra Cost of DFBGN
As in Section 3.3, our approach in DFBGN yields sub-stantial reductions in the required linear algebra costs compared to DFO-LS:• Model construction costs O ( np ) for the factorization (4.2) and O ( mp ) for back-substitutionsolves (4.3), rather than O ( n ) and O ( mn ) respectively for DFO-LS; and• Lagrange polynomial construction costs O ( p ) rather than O ( n ) . As well as these reductions, we also get a smaller trust-region subproblem (2.5)—in R p ratherthan R n —and smaller memory requirements for storing the model Jacobian: we only store ˆ J k and Q k , requiring O (( m + n ) p ) memory rather than O ( mn ) for storing the full m × n Jacobian.However, in (2.5), we do have the extra cost of projecting ˆ s k ∈ R p into the full space R n , whichrequires a multiplication by Q k , costing O ( np ) . In addition to the reduced linear algebra costs,the smaller interpolation set means we have a lower evaluation cost to construct the initial modelof p + 1 evaluations (rather than n + 1 ).No particular choice of p is needed for this method, and anything from p = 1 (i.e. coordinatesearch) to p = n (i.e. full space search) is allowed. However, unsurprisingly, we shall see thatlarger values of p give better performance in terms of evaluations, except for the very low-budgetphase, where smaller values of p benefit from a lower initialization cost. Hence, we expect thatour approach with small p is useful when the O ( mn + n ) per-iteration linear algebra costof DFO-LS is too great, and reducing the linear algebra cost is worth (possibly) needing more Here, we use the existing factorization (4.2) and solve with p + 1 right-hand sides, as in Section 3.3. lgorithm phase DFO-LS DFBGN CommentForm ˆ m k (3.2) O ( n + mn ) O ( np + mp ) Factorization plus linear solvesForm ˆ m k (3.5) O ( mn ) O ( mp ) Form ˆ J Tk ˆ J k Trust-region subproblem O ( n ) – O ( n ) O ( p ) – O ( p ) Depending on s k = Q k ˆ s k — O ( np ) Form new step x k + s k O ( n ) O ( n ) Choose point to replace O ( n ) — Compute Lagrange polynomialsModel improvement O ( n ) — Recompute Lagrange polynomialsChoose points to remove — O ( p + np ) See Algorithm 4Generate new directions — O ( np ) See Algorithm 5
Total O ( mn + n ) O ( mp + np + p ) Table 1. Comparison of per-iteration linear algebra costs of DFO-LS and DFBGN (Algorithm 3) withsubspace dimension p ∈ { , . . . , n } . *Note that the trust-region subproblem is solved using a truncatedCG method [25, Chapter 7.5.1] originally from [63] in both DFO-LS and DFBGN. objective evaluations to achieve a given accuracy. As a result, p should in general be set as largeas possible, given the linear algebra costs the user is willing to bear.In Table 1, we compare the linear algebra costs of DFO-LS and DFBGN. The overall per-iteration cost of DFO-LS is O ( mn + n ) and the cost of DFBGN is O ( mp + np + p ) , dependingon the choice of p ∈ { , . . . , n } . The key benefit is that our dependency on the underlying problemdimension n decreases from cubic in DFO-LS to linear in DFBGN (provided p (cid:28) n ). We alsonote that both methods have linear cost in the number of residuals m , but with a factor that issignificantly smaller in DFBGN than in DFO-LS— O ( p ) compared to O ( n ) . Remark . In every iteration we must compute the QR factorization (4.2). However, we note,similar to [18, Section 4.2], that adding, removing and changing interpolation points all inducesimple changes to ˆ W Tk (adding or removing columns, and low-rank updates). This means that(4.2) can be computed with cost O ( np ) per iteration using the updating methods in [37, Section12.5]. In our implementation, however, we do not do this, as we find that these updates introduceerrors that accumulate at every iteration and reduce the accuracy of the resulting interpolationmodels. To maintain the numerical performance of our method, we need to recompute (4.2)from scratch regularly (e.g. every 10 iterations), and so would not see the O ( np ) per-iterationcost, on average. Remark . The default parameter choices for DFBGN are the same as DFO-LS, namely: ∆ max = 10 , ∆ end = 10 − , γ dec = 0 . , γ inc = 2 , γ inc = 4 , η = 0 . , and η = 0 . . DFBGN alsouses the same default choice ∆ = 0 . (cid:107) x (cid:107) ∞ , . The default choice of p drop is discussedin Section 4.4. Adaptive Choice of p One approach that we have considered is to allow p to vary betweeniterations of DFBGN, rather than being constant throughout. Instead of adding p drop new pointsat the end of each iteration (line 15), we implement a variable p by adding at least one new pointto the interpolation set, continuing until some criterion is met. This criterion is designed to allow p small when such a p allows us to make reasonable progress, but to grow p up to p ≈ n whennecessary.We have tested several possible criteria—comparing some combination of model gradient andHessian, trust-region radius, trust-region step length, and predicted decrease from the trust-region step—and found the most effective to be comparing the model gradient and Hessian withthe trust-region radius. Specifically, we continue adding new directions until (c.f. Lemma 2.10and [18, Lemma 3.22]) (cid:107) g k (cid:107) max( (cid:107) H k (cid:107) , ≥ α ∆ k , (4.4)for some α > (we use α = 0 . n − p ) /n for an interpolation set with p +1 points). However, ournumerical testing has shown that DFBGN with p fixed outperforms this approach for all budget Leading to ˆ W Tk (cid:54) = Q k R k , not relating to Q k orthogonal or R k upper-triangular. lgorithm 4 Mechanism for removing points from the interpolation set in DFBGN.
Input:
Interpolation set { x k +1 , y , . . . , y p } with current iterate x k +1 , trust-region radius ∆ k +1 > and numberof points to remove p drop ∈ { , . . . , p } .1: Compute the (linear) Lagrange polynomials for { x k +1 , y , . . . , y p } in the same way as (4.3).2: For t = 1 , . . . , p (i.e. all interpolation points except x k +1 ), compute θ t := max x ∈ B ( x k +1 , ∆ k +1 ) | (cid:96) t ( x ) | · max (cid:32) (cid:107) y t − x k +1 (cid:107) ∆ k +1 , (cid:33) . (4.5)3: Remove the p drop interpolation points with the largest values of θ t . and accuracy levels, on both medium- and large-scale problems, and so we do not consider itfurther here. We delegate further study of this approach to future work, to see if alternativeadaptive choices for p can be beneficial. We now provide more details about how we manage the interpolation set in DFBGN. Specifically,we discuss how points are chosen for removal from Y k , and how new interpolations points arecalculated. In the description of DFBGN, there are no explicit mechanisms to ensure that the interpolationset is well-poised. DFBGN ensures that the interpolation set has good geometry through twomechanisms:• We use a geometry-aware mechanism for removing points, based on [63, 18], which requiresthe computation of Lagrange polynomials. This mechanism is given in Algorithm 4, andis called in lines 10 and 13 of DFBGN, as well as to select a point to replace in line 12; and• Adding new directions that are orthogonal to existing directions, and of length ∆ k , meansadding these new points never causes the interpolation set to have poor poisedness.Together, these two mechanisms mean that any points causing poor poisedness are quicklyremoved, and replaced by high-quality interpolation points (orthogonal to existing directions,and within distance ∆ k of the current iterate).The linear algebra cost of Algorithm 4 is O ( p ) to compute p Lagrange polynomials with cost O ( p ) each (since we already have a factorization of ˆ W Tk ). Then for each t we must evaluate θ t (4.5), with cost O ( p ) to maximize (cid:96) t ( x ) (since (cid:96) t is linear and varies only in directions Y k ), and O ( n ) to calculate (cid:107) y t − x k +1 (cid:107) . This gives a total cost of O ( p + np ) . Alternative Point Removal Mechanism
Instead of Algorithm 4, we could have used a sim-pler mechanism for removing points, such as removing the points furthest from the current iterate(with total cost O ( np ) ). However, this leads to a substantial performance penalty. In Figure 1,we compare these two approaches for selecting points to be removed, namely Algorithm 4 anddistance to x k +1 , on the (CR) test set with p = n/ and p = n (using the default value of p drop ,as detailed in Section 4.4). For more details on the numerical testing framework, see Section 5.1below. We see that the geometry-aware criterion (4.5) gives substantially better performancethan the cheaper criterion. We could instead compute (cid:107) y t − x k +1 (cid:107) by taking the norm of the t -th column of R k , provided we have thefactorization (4.2), for cost O ( p ) for each t . This does not affect the overall conclusion of Table 1. As above, if we have (4.2), we could calculate all distances to x k +1 using columns of R k , with total cost O ( p ) . . . . . . . P r o p o rt i o np r o b l e m ss o l v e d Geometry-awareAlternative (distance to x k ) (a) DFBGN with p = n/ . . . . . . P r o p o rt i o np r o b l e m ss o l v e d Geometry-awareAlternative (distance to x k ) (b) DFBGN with p = n Figure 1. Performance profiles (in evaluations) for DFBGN when p = n/ and p = n , comparingremoving points with the geometry-aware Algorithm 4 and without Lagrange polynomials (by distanceto the current iterate). We use accuracy level τ = 10 − , and results are an average of 10 runs, each witha budget of n + 1) evaluations. The problem collection is (CR). See Section 5.1 for details on thetesting framework. Algorithm 5
Mechanism for generating new directions in DFBGN.
Input:
Orthonormal basis for current subspace Q ∈ R n × p (optional), number of new directions q ≤ n − p .1: Generate A ∈ R n × q with i.i.d. standard normal entries.2: If Q is specified, calculate (cid:101) A = A − QQ T A , otherwise set (cid:101) A = A .3: Perform the QR factorization (cid:101) Q (cid:101) R = (cid:101) A and return d , . . . , d q as the columns of (cid:101) Q . We now detail how new directions d , . . . , d q are created in line 15 of DFBGN (Algorithm 3).The same approach is suitable for generating the initial directions s , . . . , s p in line 1 of DFBGN,using (cid:101) A = A below (i.e. no Q required).Suppose our current subspace is defined by the orthonormal columns of Q ∈ R n × p , and wewish to generate q new orthonormal vectors that are also orthogonal to the columns of Q (with p + q ≤ n ). When called in line 15 of DFBGN, we will have p = p − p drop and q = p drop . Weuse the approach in Algorithm 5. From the QR factorization, the columns of (cid:101) Q are orthonormal,and if (cid:101) A is full rank (which occurs with probability 1; see Lemma 4.5 below) then we also have col( (cid:101) Q ) = col( (cid:101) A ) . So, to confirm the columns of (cid:101) Q are orthogonal to Q , we only need to checkthat the columns of (cid:101) A are orthogonal to Q . Let (cid:101) a i be the i -th column of (cid:101) A and q j be the j -thcolumn of Q . Then, if a i is the i -th column of A , we have (cid:101) a Ti q j = a Ti ( I − QQ T ) q j = a Ti ( q j − Q e j ) = 0 , (4.6)as required.The cost of Algorithm 5 is O ( nq ) to generate A , O ( np q ) to form (cid:101) A and O ( nq ) for the QRfactorization. Since p , q ≤ p (since p is the number of directions remaining in the interpolationset and q is the number of new directions to be added), the whole process has cost at most O ( np ) . This bound is tight, up to constant factors, as we could take p = q = p/ , for instance. Lemma 4.5.
The matrix (cid:101) A has full column rank with probability 1.Proof. Let a i and (cid:101) a i be the i -th columns of A and (cid:101) A respectively. From [29, Proposition 7.1], A has full column rank with probability 1, and each a i / ∈ col( Q ) with probability 1. Now supposewe have constants c , . . . , c q so that (cid:80) qi =1 c i (cid:101) a i = . Then since (cid:101) a i = a i − QQ T a i , we have q (cid:88) i =1 c i a i = q (cid:88) i =1 c i QQ T a i . (4.7)28 . . . . . . P r o p o rt i o np r o b l e m ss o l v e d p drop mixed p drop = 1 p drop = p/ (a) DFBGN with p = n/ . . . . . . P r o p o rt i o np r o b l e m ss o l v e d p drop mixed p drop = 1 p drop = p/ (b) DFBGN with p = n Figure 2. Performance profiles (in evaluations) comparing different choices of p drop , for DFBGN with p = n/ and p = n , with accuracy level τ = 10 − . The choice ‘ p drop mixed’ uses p drop = 1 for successfuliterations and p drop = p/ for unsuccessful iterations. Results an average of 10 runs, each with a budgetof n + 1) evaluations. The problem collection is (CR). See Section 5.1 for details on the testingframework. The right-hand side is in col( Q ) , so since a i / ∈ col( Q ) , we must have (cid:80) qi =1 c i a i = . Thus c = · · · = c q = 0 since A has full column rank, and so (cid:101) A has full column rank. p drop An important component of DFBGN that we have not yet specified is how many points toremove from the interpolation set at each iteration, p drop ∈ { , . . . , p } .On one hand, a large p drop enables us to change the subspace by a large amount betweeniterations, ensuring we explore the whole of R n quickly, rather than searching in unproductivesubspaces for many iterations. However, a small p drop means we require few objective evaluationsper iteration, and so are more likely to use our evaluation budget efficiently. We consider twochoices of p drop , aimed at each of these possible benefits: p drop = p/ to change subspacesquickly, and p drop = 1 (the minimum possible value) to use few objective evaluations.Another approach that we consider is a compromise between these two choices. We notethat having p drop = 1 is useful to make progress with few evaluations, so we use this valuewhile we are making progress—we consider this to occur when we have a successful iteration(i.e. ρ k ≥ η ). When we are not progressing (i.e. unsuccessful steps with ρ k < η ), we use thelarger value p drop = p/ .We compare these three approaches on the (CR) problem collection with a budget of n +1) evaluations in Figure 2. Since these different choices of p drop are similar when p is small, weshow results for subspace dimensions p = n/ and p = n . We first see that, even with p = n/ , the three approaches all perform similarly. However, for p = n the compromise choice p drop ∈ { , p/ } performs better than the two constant- p approaches. In addition, p drop = 1 outperforms p drop = p/ for small performance ratios, but is less robust and solves fewerproblems overall.Given these results, in DFBGN we use the compromise choice as the default mechanism: p drop = 1 on successful iterations and p drop = p/ on unsuccessful iterations. Relationship to model-improvement phases
The
CHECK_MODEL flag in RSDFO-GN isimportant for ensuring we do not reduce ∆ k too quickly without first ensuring the quality ofthe interpolation model. For a similar purpose, DFO-LS incorporates a second trust-regionradius which also is involved with ensuring ∆ k does not decrease too quickly [14]. In DFBGN, as This is related to ensuring ∆ k does not get too small compared to (cid:107) ˆ g k (cid:107) via Lemma 2.6.
25 50 75 100 125 150 175
Iteration − − − − ∆ k a nd k ˆ g k k ∆ k k ˆ g k k f ( x k ) normalized − − N o r m a li ze d O b j ec t i v e V a l u e (a) drcavty1 , p drop = 1 Iteration − − − − ∆ k a nd k ˆ g k k ∆ k k ˆ g k k f ( x k ) normalized × − × − × − N o r m a li ze d O b j ec t i v e V a l u e (b) luksan13 , p drop = 1 Figure 3. Comparison of ∆ k , (cid:107) ˆ g k (cid:107) (left y -axis) and normalized objective value (right y -axis) for DFBGNwith p = n and p drop = 1 . The two problems are taken from the (CR) collection. See Section 5.1 fordetails on the testing framework. Iteration − − − − − − ∆ k a nd k ˆ g k k ∆ k k ˆ g k k f ( x k ) normalized − − − − − − N o r m a li ze d O b j ec t i v e V a l u e (a) drcavty1 , p drop mixed Iteration − − − − ∆ k a nd k ˆ g k k ∆ k k ˆ g k k f ( x k ) normalized − − − − − − − N o r m a li ze d O b j ec t i v e V a l u e (b) luksan13 , p drop mixed Figure 4. Comparison of ∆ k , (cid:107) ˆ g k (cid:107) (left y -axis) and normalized objective value (right y -axis) for DFBGNwith p = n and the default p drop ∈ { , p/ } . The two problems are taken from the (CR) collection.See Section 5.1 for details on the testing framework. described in Section 4.3.1, we maintain the geometry of the interpolation set by replacing poorly-located points with orthogonal directions around the current iterate; in practice this ensures thequality of the interpolation set. However, the choice of p drop has a large impact on causing ∆ k to shrink too quickly.In many cases, DFBGN may reach a point where its model is not accurate and we start tohave unsuccessful iterations. To fix this (and continue making progress), we need to introduceseveral new interpolation points to produce a high-quality model. If p drop is small, this may takemany unsuccessful iterations, causing ∆ k to decrease quickly.The result of having p drop small is seen in Figure 3. Here, we show ∆ k , (cid:107) ˆ g k (cid:107) and f ( x k ) forDFBGN with p = n and p drop = 1 for two problems from the (CR) collection. Both problemsshow that ∆ k can quickly shrink to be much smaller than (cid:107) ˆ g k (cid:107) before reaching optimality. Inthe case of drcavty1 , we see multiple instances where, after several unsuccessful iterations, werecover a high-quality model and continue making progress (causing ∆ k to increase again); thismanifests itself as large oscillations in ∆ k with comparatively little change in (cid:107) ˆ g k (cid:107) . Ultimately,as we terminate on ∆ k ≤ ∆ end = 10 − , DFBGN quits without solving the problem (reachingaccuracy τ ≈ × − ). A more extreme version of this behaviour is seen for problem luksan13 ,where we terminate on small ∆ k in the first sequence of unsuccessful iterations—DFBGN doesnot allow enough time to recover a high-quality model and terminates after achieving accuracy τ ≈ . .This effect is mitigated by our default choice of p drop ∈ { , p/ } . By using a larger p drop on30nsuccessful iterations, when our model is performing poorly, our interpolation set is changedquickly. This results in DFBGN recovering a high-quality model after a smaller decrease in ∆ k .To demonstrate this, in Figure 4 we show the results of DFBGN with this p drop for the sameproblems as Figure 3 above. In both cases, we still see oscillations in ∆ k , but their magnitude issubstantially reduced—it takes fewer iterations to get successful steps, and ∆ k stays well above ∆ end . This leads to both problems being solved to high accuracy.In Figure 4, we also see that, as we approach the solution, (cid:107) ˆ g k (cid:107) and ∆ k decrease at the samerate, as we would hope. For drcavty1 after iteration 150, we also see the phenomenon describedabove, where ∆ k can become much larger than (cid:107) ˆ g k (cid:107) due to many successful iterations, beforean unsuccessful iteration with (cid:107) s k (cid:107) small means that ∆ k returns to the level of (cid:107) ˆ g k (cid:107) regularly. Alternative Mechanism for Recovering High-Quality Models
An alternative approachfor avoiding unnecessary decreases in ∆ k while the interpolation model quality is improved is tosimply decrease ∆ k more slowly on unsuccessful iterations. This corresponds to setting γ dec tobe closer to 1, which is the default choice of DFO-LS for noisy problems (see [14, Section 3.1]),and aligns with our theoretical requirements on the trust-region parameters (Theorem 2.18).In Figure 5, we compare the DFBGN default choices, of p drop ∈ { , p/ } and γ dec = 0 . ,with p drop = 1 and γ dec ∈ { . , . } on the (CR) problem collection. For small values of p (where the different choices of p drop are essentially identical), the choice of γ dec has almost noimpact on the performance of DFBGN. For larger values of p , using γ dec = 0 . with p drop = 1 performs comparably well to the DFBGN default ( γ dec = 0 . with p drop ∈ { , p/ } ). However,we opt for keeping γ dec = 0 . , to allow us to use the larger value for noisy problems (just asin DFO-LS), and to reduce the risk of overfitting our trust-region parameters to a particularproblem collection. In this section we compare the performance of DFBGN (Algorithm 3) to that of DFO-LS. Wenote that that DFO-LS has been shown to have state-of-the-art performance compared to othersolvers in [14]. As described in Section 4.2, the implementation of DFBGN is based on thedecision to reduce the linear algebra cost of the algorithm at the expense of more objectiveevaluations per iteration. However, we still maintain the goal of DFBGN achieving (close to)state-of-the-art performance when it is run as a ‘full space’ method (i.e. p = n ). Here, we willinvestigate this tradeoff in practice. In our testing, we will compare a Python implementation of DFBGN (Algorithm 3) against DFO-LS version 1.0.2 (also implemented in Python). The implementation of DFBGN is available onGithub. We will consider both the standard version of DFO-LS, and one where we use a reducedinitialization cost of n/ evaluations (c.f. Remark 4.1). This will allow us to compare both theoverall performance of DFBGN and its performance with small budgets (since DFBGN also hasa reduced initialization cost of p + 1 evaluations). We compare these against DFBGN with thechoices p ∈ { n/ , n/ , n/ , n } and the adaptive choice of p drop ∈ { , p/ } (Section 4.4). Alldefault settings are used for both solvers, and since both are randomized (DFO-LS uses randominitial directions only, and DFBGN is randomized through Algorithm 5), we run 10 instances ofeach problem under all solver configurations. Test Problems
We will consider two collections of nonlinear least-squares test problems, bothtaken from the CUTEst collection [38]. The first, denoted (CR), is a collection of 60 medium-scale problems (with ≤ n ≤ and n ≤ m ≤ ). Full details of the (CR) collection may befound in [18, Table 3]. The second, denoted (CR-large), is a collection of 28 large-scale problems(with ≤ n ≤ and n ≤ m ≤ ). This collection is a subset of problems from (CR), See https://github.com/numericalalgorithmsgroup/dfbgn . Results here use version 0.1. . . . . . . P r o p o rt i o np r o b l e m ss o l v e d p drop mixed, γ dec = 0 . p drop = 1, γ dec = 0 . p drop = 1, γ dec = 0 . (a) DFBGN with p = n/ . . . . . . P r o p o rt i o np r o b l e m ss o l v e d p drop mixed, γ dec = 0 . p drop = 1, γ dec = 0 . p drop = 1, γ dec = 0 . (b) DFBGN with p = n/ . . . . . . P r o p o rt i o np r o b l e m ss o l v e d p drop mixed, γ dec = 0 . p drop = 1, γ dec = 0 . p drop = 1, γ dec = 0 . (c) DFBGN with p = n/ . . . . . . P r o p o rt i o np r o b l e m ss o l v e d p drop mixed, γ dec = 0 . p drop = 1, γ dec = 0 . p drop = 1, γ dec = 0 . (d) DFBGN with p = n Figure 5. Performance profiles (in evaluations) comparing different choices of p drop and γ dec for DFBGN,at accuracy level τ = 10 − . The choice ‘ p drop mixed’ uses p drop = 1 for successful iterations and p drop = p/ for unsuccessful iterations. Results an average of 10 runs, each with a budget of n + 1) evaluations. The problem collection is (CR). See Section 5.1 for details on the testing framework. with their dimension increased substantially. Full details of the (CR-large) collection are givenin Appendix B. Note that the 12 hour runtime limit was only relevant for (CR-large) in all cases. Measuring Solver Performance
For every problem, we allow all solvers a budget of atmost n + 1) objective evaluations (i.e. evaluations of the full vector r ( x ) ). This dimension-dependent choice may be understood as equivalent to 100 evaluations of r ( x ) and the Jacobian J ( x ) via finite differencing. However, given the importance of linear algebra cost to our compar-isons, we allow each solver a maximum runtime of 12 hours for each instance of each problem. For each solver S , each problem instance P , and accuracy level τ ∈ (0 , , we calculate N ( S, P, τ ) := r ( x ) required to find a point x with f ( x ) ≤ f ( x ∗ ) + τ ( f ( x ) − f ( x ∗ )) , (5.1)where f ( x ∗ ) is an estimate of the minimum of f as listed in [18, Table 3] for (CR) and Appendix Bfor (CR-large). If this objective decrease is not achieved by a solver before its budget or runtimelimit is hit, we set N ( S, P, τ ) = ∞ . We then compare solver performances on a problem collection P by plotting either data profiles [55] d S,τ ( α ) := 1 |P| |{ P ∈ P : N ( S, P, τ ) ≤ α ( n P + 1) }| , (5.2) Since all problems are implemented in Fortran via CUTEst, the cost of objective evaluations for this testingis minimal. . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (a) τ = 0 . . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (b) τ = 10 − . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (c) τ = 10 − . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (d) τ = 10 − Figure 6. Performance profiles (in evaluations) comparing DFO-LS (with and without reduced initializ-ation cost) with DFBGN (various p choices) for different accuracy levels. Results are an average of 10runs for each problem, with a budget of n + 1) evaluations and a 12 hour runtime limit per instance.The problem collection is (CR). where n P is the dimension of problem instance P and α ∈ [0 , is an evaluation budget (in“gradients”, or multiples of n + 1 ), or performance profiles [28] π S,τ ( α ) := 1 |P| |{ P ∈ P : N ( S, P, τ ) ≤ αN min ( P, τ ) }| , (5.3)where N min ( P, τ ) is the minimum value of N ( S, P, τ ) for any solver S , and α ≥ is a performanceratio. In some instances, we will plot profiles based on runtime rather than objective evaluations.For this, we simply replace “number of evaluations of r ( x ) ” with “runtime” in (5.1).When we plot the objective reduction achieved by a given solver, we normalize the objectivevalue to be in [0 , by plotting f ( x ) − f ( x ∗ ) f ( x ) − f ( x ∗ ) , (5.4)which corresponds to the best τ achieved in (5.1) after a given number of evaluations (againmeasured in “gradients”) or runtime. We begin our comparisons by considering the performance of DFO-LS and DFBGN when meas-ured in terms of evaluations. 33 . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (a) τ = 0 . . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (b) τ = 10 − . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (c) τ = 10 − . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (d) τ = 10 − Figure 7. Performance profiles (in evaluations) comparing DFO-LS (with and without reduced initializ-ation cost) with DFBGN (various p choices) for different accuracy levels. Results are an average of 10runs for each problem, with a budget of n + 1) evaluations and a 12 hour runtime limit per instance.The problem collection is (CR-large). Medium-Scale Problems (CR)
First, in Figure 6, we show the results for different accuracylevels for the (CR) problem collection (with n ≈ ). For the lowest accuracy level τ = 0 . ,DFO-LS with reduced initialization cost is the best-performing solver, followed by DFBGN with p = n/ . These correspond to methods with lower initialization costs than DFO-LS and DFBGNwith p = n , so this is likely a large driver behind their performance. DFBGN with full spacesize p = n performs similarly to DFO-LS, and DFBGN with p = n/ and p = n/ performworst (as they are optimizing in a very small subspace at each iteration).However, as we look at higher accuracy levels, we see that DFO-LS (with and without reducedinitialization cost) performs best, and the DFBGN methods perform worse. The performancegap is more noticeable for small values of p . As expected, this means that DFBGN requiresmore evaluations to achieve these levels of accuracy, and benefits from being allowed to use alarger p . Notably, DFBGN with p = n has only a slight performance loss compared to DFO-LS,even though it uses p/ evaluations on unsuccessful iterations (rather than 1–2 for DFO-LS);this indicates that our choice of p drop provides a suitable compromise between solver robustnessand evaluation efficiency. Large-Scale Problems (CR-large)
Next, in Figure 7, we show the same plots but for the(CR-large) problem collection, with n ≈ . Compared to Figure 6, the situation is quitedifferent.At the lowest accuracy level, τ = 0 . , DFBGN with small subspaces ( p = n/ and p = olver % timeoutDFO-LS 92.5%DFO-LS (init n/ ) 97.9%DFBGN ( p = n/ ) 34.6%DFBGN ( p = n/ ) 73.9%DFBGN ( p = n/ ) 81.8%DFBGN ( p = n ) 66.4% Table 2. Proportion of problem instances from (CR-large) for which each solver terminated on themaximum 12 hour runtime. n/ ) gives the best-performing solvers, followed by the full-space solvers (DFO-LS and DFBGNwith p = n ). For higher accuracy levels, the performance of DFBGN with small p deterioratescompared with the full-space methods. DFBGN with p = n/ is the worst-performing DFBGNvariant at low accuracy levels, and performs similar to DFBGN with small p at high accuracylevels. DFO-LS with reduced initialization cost is the worst-performing solver for this dataset.Unlike the medium-scale results above, we no longer have a clear trend in the performanceof DFBGN as we vary p . Instead, we have a combination of two factors coming into play, whichhave opposite impacts on the performance of DFBGN as we vary p . On one hand, we have thenumber of evaluations required for DFBGN (with a given p ) to reach the desired accuracy level.On the other hand, we have the number of iterations that DFBGN can perform before reachingthe 12 hour runtime limit.DFBGN with small p requires more evaluations to reach a given level of accuracy (as seenwith the medium-scale results), but can perform many evaluations before timing out due toits low per-iteration linear algebra cost. This is reflected in it solving many problems to lowaccuracy, but few problems to high accuracy. By contrast, DFBGN with p = n is allowed toperform fewer iterations before timing out (and hence see fewer evaluations), but requires manyfewer evaluations to solve problems, particularly for high accuracy. This manifests in its goodperformance for low and high accuracy levels. The middle ground, DFBGN with p = n/ , hasits performance negatively impacted by both issues: it requires many fewer evaluations to solveproblems than p = n (especially for high accuracy), but also has a relatively high per-iterationlinear algebra cost and times out compared to small p .Both variants of DFO-LS show worse performance here than for the medium-scale problems.This is because, as suggested by the analysis in Table 1, they are both affected by the runtimelimit. DFO-LS with reduced initialization cost is particularly affected, because of the high costof the SVD (of the full m × n Jacobian) at each iteration for these problems. We note that thiscost is only noticeable for these large-scale problems, and this variant of DFO-LS is still usefulfor small- and medium-scale problems, as discussed in [14].We can verify the impact of the timeout on DFO-LS and DFBGN by considering the propor-tion of problem instances for (CR-large) for which the solver terminated because of the timeout.These results are presented in Table 2. DFO-LS reaches the 12 hour maximum much morefrequently than DFBGN: over 90% rather than 35% for DFBGN with p = n/ or 66% forDFBGN with p = n (see Remark 5.1 below). For DFBGN with different values of p , we see thesame behaviour as in Figure 7. That is, DFBGN with small p times out the least frequently, asits low per-iteration runtime means it performs enough iterations to terminate naturally. ForDFBGN with p = n , we time out more frequently (due to the high per-iteration runtime), butnot as often as with p = n/ , as the its superior budget performance for high accuracy levelsmeans it fully solves more problems, even with comparatively fewer iterations. We note thatTable 2 does not measure what accuracy level was achieved before the timeout, which is bettercaptured in the performance profiles Figure 7. Remark . DFBGN with p = n has a similar per-iteration linear algebra cost to DFO-LS.Hence it can perform a similar number of iterations before reaching the runtime limit. However,DFBGN performs more objective evaluations per iteration, because of the choice of p drop . SinceDFBGN with p = n has a similar performance to DFO-LS when measured on budget (as35 . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (a) τ = 0 . . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (b) τ = 10 − . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (c) τ = 10 − . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (d) τ = 10 − Figure 8. Data profiles comparing the runtime of DFO-LS (with and without reduced initialization cost)with DFBGN (various p choices) for different accuracy levels. Results are an average of 10 runs for eachproblem, with a budget of n + 1) evaluations and a 12 hour runtime limit per instance. The problemcollection is (CR-large). seen in Figure 6), this means that it has a superior performance when measured by runtime.Additionally, if multiple objective evaluations can be run in parallel, then DFBGN would alsobe able to benefit from this, unlike DFO-LS. Remark . For completeness, in Appendix A we compare DFBGN with DFO-LS on the low-dimensional collection of test problems from Moré and Wild [55]. We do not include this discus-sion here as these problems are low-dimensional, which is not the main use case for DFBGN.
We have seen above that DFBGN performs well compared to DFO-LS on the (CR-large) problemcollection, as the 12 hour timeout causes DFO-LS to terminate after relatively few objectiveevaluations. In Figure 8, we show the same comparison for (CR-large) as in Figure 7, butshowing data profiles of problems solved versus runtime (rather than evaluations). Here, allDFBGN variants perform similar to or better than DFO-LS for low accuracy levels, since DFBGNhas a lower per-iteration runtime than DFO-LS, and this is the regime where DFBGN performsbest (on budget). For high accuracy levels, DFBGN with p = n is the best-performing solver,as it uses large enough subspaces to solve many problems to high accuracy. By contrast, bothDFBGN with small p and DFO-LS perform similarly at high accuracy levels—the impact of thetimeout on DFO-LS roughly matches the reduced robustness of DFBGN with small p at theseaccuracy levels. Again, as we observed above, DFO-LS with reduced initialization cost is theworst-performing solver, due to the high cost of the SVD at each iteration.36
10 20 30 40 50 60Budget (in gradients)10 − − − − − − − N o r m a li ze d O b j ec t i v e V a l u e DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (a) n = 100 , objective vs. budget − − − − − − − N o r m a li ze d O b j ec t i v e V a l u e DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (b) n = 100 , objective vs. runtime − − − − − − N o r m a li ze d O b j ec t i v e V a l u e DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (c) n = 1000 , objective vs. budget − − − − − − N o r m a li ze d O b j ec t i v e V a l u e DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (d) n = 1000 , objective vs. runtime − − − − N o r m a li ze d O b j ec t i v e V a l u e DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (e) n = 2000 , objective vs. budget − − − − N o r m a li ze d O b j ec t i v e V a l u e DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (f) n = 2000 , objective vs. runtime Figure 9. Normalized objective value (versus evaluations and runtime) for 10 runs of DFO-LS andDFBGN on CUTEst problem arwhdne . These results use a budget of n + 1) evaluations and a 12hour runtime limit per instance.
37o further see the impact of this issue, we now consider how the solvers perform for a variable-dimension test problem, as we increase the underlying dimension. We run each solver, with thesame settings as above, on the CUTEst problem arwhdne for different choices of problemdimension n . In Figure 9 we plot the objective reduction for each solver against budget andruntime for DFO-LS and DFBGN, showing n = 100 , n = 1000 and n = 2000 .We see that, when measured on evaluations, both DFO-LS variants achieve the fastest ob-jective reduction, and that DFBGN with small p achieves the slowest objective reduction. Thisis in line with our results from Section 5.2. However, when we instead consider objective decreaseagainst runtime, we see that DFBGN with small p gives the fastest decrease—the larger numberof iterations needed by these variants (as seen by the larger number of evaluations) is offsetby the substantially reduced per-iteration linear algebra cost. When viewed against runtime,both DFO-LS variants can only achieve a small objective decrease in the allowed 12 hours, eventhough they are showing fast decrease against budget, and would achieve higher accuracy thanDFBGN if the linear algebra cost were negligible. Another benefit of DFBGN is that it has a small initialization cost of p + 1 evaluations. When n is large, it is more likely for a user to be limited by a budget of fewer than n evaluations. Here,we examine how DFBGN compares for small budgets to DFO-LS with reduced initializationcost.We recall from Remark 4.1 that DFO-LS with reduced initialization cost progressively in-creases the dimension of the subspace of its interpolation model, until it reaches the whole space R n (after approximately n + 1 evaluations), while in DFBGN we restrict the dimension at alliterations.In Figure 10 we consider three variable-dimensional CUTEst problems from (CR) and (CR-large), all using n = 1000 and n = 2000 . We show the objective decrease against budget for 10runs of each solver, restricted to a maximum of n + 1 evaluations. We see that the smaller p used in DFBGN, the faster DFBGN is able to make progress (due to the lower number of initialevaluations). However, this is offset by the faster objective decrease achieved by larger p values(after the higher initialization cost)—if the user can afford a larger p , both in terms of linearalgebra and initial evaluations, then this is usually a better option. An exception to this is theproblem vardimne , where its simple structure means DFBGN with small p solves the problemto very high accuracy with very few evaluations, substantially outperforming both DFBGN withlarger p , and DFO-LS with reduced initialization cost.In Figure 10 we also show the decrease for DFO-LS with full initialization cost and DFBGNwith p = n , but they use the full budget on initialization, and so make no progress. However,in addition, we show DFO-LS with a reduced initialization cost of n/ evaluations. This vari-ant performs well, in most cases matching the decrease of DFBGN with p = n/ initially,but achieving a faster objective reduction against budget—this matches with our previous ob-servations. However, the extra cost of the linear algebra means that DFO-LS with reducedinitialization does not end up using the full budget, instead hitting the 12 hour timeout. This ismost clear when comparing the results for n = 1000 with n = 2000 , where DFO-LS with reducedinitialization cost begins by achieving a similar decrease in both cases, but hits the timeout morequickly with n = 2000 , and so terminates after fewer objective evaluations (with a correspondingsmaller objective decrease).We analyze this more systematically in Figure 11, where we show data profiles (measuredon budget) of DFBGN and DFO-LS on the (CR-large) problem collection, for low accuracy andsmall budgets. These results verify our conclusions: DFBGN with small p can make progress onmany problems with a very short budget (fewer than n +1 evaluations), and outperform DFO-LSwith reduced initialization cost due to its slow runtime. However, once we reach a budget ofmore than n + 1 evaluations, then DFO-LS and DFBGN with p = n become the best-performingsolvers (when measuring on evaluations only). They are also able to achieve a higher level ofaccuracy compared to DFBGN with small p . This problem appears in the collections (CR) and (CR-large), with n = 100 and n = 1000 respectively. . . . . . . × − × − × − × − N o r m a li ze d O b j ec t i v e V a l u e DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (a) arwhdne , n = 1000 . . . . . . × − × − × − × − N o r m a li ze d O b j ec t i v e V a l u e DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (b) arwhdne , n = 2000 . . . . . . × − × − × − × − × − N o r m a li ze d O b j ec t i v e V a l u e DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (c) chandheq , n = 1000 . . . . . . × − × − × − × − × − N o r m a li ze d O b j ec t i v e V a l u e DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (d) chandheq , n = 2000 . . . . . . − − − − − − − N o r m a li ze d O b j ec t i v e V a l u e DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (e) vardimne , n = 1000 . . . . . . − − − − − − − N o r m a li ze d O b j ec t i v e V a l u e DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (f) vardimne , n = 2000 Figure 10. Normalized objective value (versus evaluations) for 10 runs of DFO-LS and DFBGN ondifferent CUTEst problems (all with n = 1000 and n = 2000 ). These results use a budget of n + 1 evaluations and a 12 hour runtime limit per instance. . . . . . . . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (a) τ = 0 . , budget n + 1 evaluations .
00 0 .
25 0 .
50 0 .
75 1 .
00 1 .
25 1 .
50 1 .
75 2 . . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (b) τ = 0 . , budget n + 1) evaluations Figure 11. Data profiles (in evaluations) comparing DFO-LS (with and without reduced initializationcost) with DFBGN (various p choices) for different accuracy levels and budgets. Results are an averageof 10 runs for each problem, with a budget of n + 1 or n + 1) evaluations and a 12 hour runtime limitper instance. The problem collection is (CR-large). Lastly, in Figure 12 we show the same results as Figure 11, but showing profiles measured onruntime. We note that we are only measuring the linear algebra costs, as the cost of objectiveevaluation for our problems is negligible. Here, the benefits of DFBGN with small p are notseen. This is because the problems that can be solved by DFBGN with small p using veryfew evaluations are likely easier, and so can likely be solved by DFBGN with large p in fewiterations. Thus, the runtime requirements for DFBGN with large p to solve the problem arenot large—even though they have a higher per-iteration cost, the number of iterations is small.In this setting, therefore, the benefit of DFBGN with small p is not lower linear algebra costs,but fewer evaluations—which is likely to be the more relevant issue in this small-budget regime. The development of scalable derivative-free optimization algorithms is an active area of researchwith many applications. In model-based DFO, the high per-iteration linear algebra cost asso-ciated (primarily) with interpolation model creation and point management is a barrier to itsutility for large-scale problems. To address this, we introduce three model-based DFO algorithmsfor large-scale problems.First, RSDFO is a general framework for model-based DFO based on model construction andminimization in random subspaces, and is suitable for general smooth nonconvex objectives.This is specialized to nonlinear least-squares problems in RSDFO-GN, a version of RSDFObased on Gauss-Newton interpolation models built in subspaces. Lastly, we introduce DFBGN,a practical implementation of RSDFO-GN. In all cases, the scalability of these methods arisesfrom the construction and minimization of models in p -dimensional subspaces of the ambientspace R n . The subspace dimension can be specified by the user to reflect the computationalresources available for linear algebra calculations.We prove high-probability worst-case complexity bounds for RSDFO, and show that RSDFO-GN inherits the same bounds. In terms of selecting the subspace dimension, we show that byusing matrices based on Johnson-Lindenstrauss transformations, we can choose p to be independ-ent of the ambient dimension n . Our analysis extends to DFO the techniques in [15, 16], andyields similar results to probabilistic direct search [42] and standard model-based DFO [33, 18].Our results also imply almost-sure global convergence to first-order stationary points.Our practical implementation of RSDFO-GN, DFBGN, has very low computational require-ments: asymptotically, linear in the ambient dimension rather than cubic for standard model-based DFO. After extensive algorithm development described here, our implementation is simpleand combines several techniques for modifying the interpolation set which allows it to still make40 . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (a) τ = 0 . , n + 1 evaluations (vs. runtime) . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ p = n/ (b) τ = 0 . , n + 1) evaluations (vs. runtime) Figure 12. Data profiles (in runtime) comparing DFO-LS (with and without reduced initialization cost)with DFBGN (various p choices) for different accuracy levels and budgets. Results are an average of 10runs for each problem, with a budget of n + 1 or n + 1) evaluations and a 12 hour runtime limit perinstance. The problem collection is (CR-large). progress with few objective evaluations (an important consideration for DFO techniques). APython version of DFBGN is available on Github. For medium-scale problems, DFBGN operating in the full ambient space ( p = n ) has similarperformance to DFO-LS [14] when measured by objective evaluations, validating the techniquesintroduced in the practical implementation. However, DFBGN (with any choice of subspacedimension) has substantially faster runtime, which means it is much more effective than DFO-LS at solving large-scale problems from CUTEst, even when working in a very low-dimensionalsubspace. Further, in the case of expensive objective evaluations, working a subspace means thatDFBGN can make progress with very few evaluations, many fewer than the n + 1 needed forstandard methods to build their initial model. Overall, the implementation of DFBGN is suitablefor large-scale problems both when objective evaluations are cheap (and linear algebra costsdominate) or when evaluations are expensive (and the initialization cost of standard methods isimpractical).Future work will focus on extending the ideas from the implementation DFBGN to the case ofgeneral objectives with quadratic models. This will bring the available software in line with thetheoretical guarantees for RSDFO. We note that model-based DFO for nonlinear least-squaresproblems has been adapted to include sketching methods, which use randomization to reducethe number of residuals considered at each iteration [13]. We also delegate to future work thedevelopment of techniques for nonlinear least-squares problems which combine sketching (i.e. di-mensionality reduction in the observation space) with our subspace approach (i.e. dimensionalityreduction in variable space), and further study of methods for adaptively selecting a subspacedimension (c.f. Section 4.2). The authors would like to acknowledge the use of the University of Oxford Advanced ResearchComputing (ARC) facility in carrying out this work. References [1]
S. Alarie, C. Audet, A. E. Gheribi, M. Kokkolaras, and S. L. Digabel , Twodecades of blackbox optimization applications , Tech. Rep. G-2020-58, GERAD, 2020. https://github.com/numericalalgorithmsgroup/dfbgn http://dx.doi.org/10.5281/zenodo.22558 M. Alzantot, Y. Sharma, S. Chakraborty, and M. B. Srivastava , GenAttack:Practical black-box attacks with gradient-free optimization , arXiv preprint arXiv:1805.11090,(2018).[3]
W. Arter, A. Osojnik, C. Cartis, G. Madho, C. Jones, and S. Tobias , Dataassimilation approach to analysing systems of ordinary differential equations , in 2018 IEEEInternational Symposium on Circuits and Systems (ISCAS), 2018, pp. 1–5.[4]
A. S. Bandeira, K. Scheinberg, and L. N. Vicente , Computation of sparse low degreeinterpolating polynomials and their application to derivative-free optimization , Mathemat-ical Programming, 134 (2012), pp. 223–257.[5] ,
Convergence of trust-region methods based on probabilistic models , SIAM Journal onOptimization, 24 (2014), pp. 1238–1264.[6]
A. S. Berahas, R. Bollapragada, and J. Nocedal , An investigation of Newton-Sketchand subsampled Newton methods , Optimization Methods and Software, (2020).[7]
A. S. Berahas, L. Cao, K. Choromanski, and K. Scheinberg , Linear interpola-tion gives better gradients than Gaussian smoothing in derivative-free optimization , arXivpreprint arXiv:1905.13043, (2019).[8] ,
A theoretical and empirical comparison of gradient approximations in derivative-freeoptimization , arXiv preprint arXiv:1905.01332, (2019).[9]
E. Bergou, S. Gratton, and L. N. Vicente , Levenberg–Marquardt methods basedon probabilistic gradient models and inexact subproblem solution, with application to dataassimilation , SIAM/ASA Journal on Uncertainty Quantification, 4 (2016), pp. 924–951.[10]
E. H. Bergou, E. Gorbunov, and P. Richtárik , Stochastic three points method forunconstrained smooth minimization , SIAM Journal on Optimization, (2020).[11]
J. Blanchet, C. Cartis, M. Menickelly, and K. Scheinberg , Convergence rateanalysis of a stochastic trust region method for nonconvex optimization , INFORMS Journalon Optimization, 1 (2019), pp. 92–119.[12]
S. Boucheron, G. Lugosi, and P. Massart , Concentration inequalities: A nonasymp-totic theory of independence , Clarendon Press, Oxford, 2012.[13]
C. Cartis, T. Ferguson, and L. Roberts , Scalable derivative-free optimization for non-linear least-squares problems , in Workshop on “Beyond first-order methods in ML systems”at the 37th International Conference on Machine Learning, 2020.[14]
C. Cartis, J. Fiala, B. Marteau, and L. Roberts , Improving the flexibility and ro-bustness of model-based derivative-free optimization solvers , ACM Transactions on Math-ematical Software, 45 (2019), pp. 32:1–32:41.[15]
C. Cartis, J. Fowkes, and Z. Shao , A randomised subspace Gauss-Newton method fornonlinear least-squares , in Workshop on “Beyond first-order methods in ML systems” at the37th International Conference on Machine Learning, Vienna, Austria, 2020.[16] ,
Randomised subspace methods for nonlinear optimization with applications to non-linear least-squares , in preparation, (2021).[17]
C. Cartis, E. Massart, and A. Otemissov , Constrained global optimization of func-tions with low effective dimensionality using multiple random embeddings , arXiv preprintarXiv:2009.10446, (2020).[18]
C. Cartis and L. Roberts , A derivative-free Gauss-Newton method , Mathematical Pro-gramming Computation, 11 (2019), pp. 631–674.4219]
C. Cartis, L. Roberts, and O. Sheridan-Methven , Escaping local minima with localderivative-free methods: a numerical investigation , Optimization, to appear (2021).[20]
C. Cartis and K. Scheinberg , Global convergence rate analysis of unconstrained op-timization methods based on probabilistic models , Mathematical Programming, 169 (2018),pp. 337–375.[21]
R. Chen, M. Menickelly, and K. Scheinberg , Stochastic optimization using a trust-region method and random models , Mathematical Programming, 169 (2018), pp. 447–487.[22]
X. Chen, S. Liu, K. X. Xu, X. Li, X. Lin, M. Hong, and D. Cox , ZO-AdaMM: Zeroth-order adaptive momentum method for black-box optimization , arXiv pre-print arXiv:1910.06513, (2019).[23]
F. Chung and L. Lu , Connected components in random graphs with given expected degreesequences , Annals of Combinatorics, 6 (2002), pp. 125–145.[24]
B. Colson and P. L. Toint , Optimizing partially separable functions without derivatives ,Optimization Methods and Software, 20 (2005), pp. 493–508.[25]
A. R. Conn, N. I. M. Gould, and P. L. Toint , Trust-Region Methods , vol. 1 of MPS-SIAM Series on Optimization, MPS/SIAM, Philadelphia, 2000.[26]
A. R. Conn, K. Scheinberg, and L. N. Vicente , Introduction to Derivative-Free Op-timization , vol. 8 of MPS-SIAM Series on Optimization, MPS/SIAM, Philadelphia, 2009.[27]
A. Cristofari and F. Rinaldi , A derivative-free method for structured optimization prob-lems , arXiv preprint arXiv:2005.05224, (2020).[28]
E. D. Dolan and J. J. Moré , Benchmarking optimization software with performanceprofiles , Mathematical Programming, 91 (2002), pp. 201–213.[29]
M. L. Eaton , Multivariate Statistics: A Vector Space Approach , vol. 53 of Lecture Notes–Monograph Series, Institute of Mathematical Statistics, Beachwood, Ohio, 2007.[30]
M. J. Ehrhardt and L. Roberts , Inexact derivative-free optimization for bilevel learn-ing , Journal of Mathematical Imaging and Vision, to appear (2020).[31]
T. Ergen, E. Candès, and M. Pilanci , Random projections for learning non-convexmodels , in 33rd Conference on Neural Information Processing Systems, 2019.[32]
F. Facchinei, G. Scutari, and S. Sagratella , Parallel selective algorithms for non-convex big data optimization , IEEE Transactions on Signal Processing, 63 (2015), pp. 1874–1889.[33]
R. Garmanjani, D. Júdice, and L. N. Vicente , Trust-region methods without usingderivatives: Worst case complexity and the nonsmooth case , SIAM Journal on Optimization,26 (2016), pp. 1987–2011.[34]
S. Ghadimi and G. Lan , Stochastic first- and zeroth-order methods for nonconvexstochastic programming , SIAM Journal on Optimization, 23 (2013), pp. 2341–2368.[35]
H. Ghanbari and K. Scheinberg , Black-box optimization in machine learning with trustregion based derivative free algorithm , arXiv preprint arXiv:1703.06925, (2017).[36]
D. Golovin, J. Karro, G. Kochanski, C. Lee, X. Song, and Q. Zhang , Gradient-less descent: High-dimensional zeroth-order optimization , arXiv preprint arXiv:1911.06317,(2019).[37]
G. H. Golub and C. F. Van Loan , Matrix Computations , The Johns Hopkins UniversityPress, Baltimore, 3rd ed., 1996. 4338]
N. I. M. Gould, D. Orban, and P. L. Toint , CUTEst: a constrained and unconstrainedtesting environment with safe threads for mathematical optimization , Computational Op-timization and Applications, 60 (2015), pp. 545–557.[39]
R. Gower, D. Goldfarb, and P. Richtárik , Stochastic block BFGS: Squeezing morecurvature out of data , in Proceedings of The 33rd International Conference on MachineLearning, M. F. Balcan and K. Q. Weinberger, eds., vol. 48 of Proceedings of MachineLearning Research, New York, 2016, PMLR, pp. 1869–1878.[40]
R. M. Gower, D. Kovalev, F. Lieder, and P. Richtárik , RSN: Randomized SubspaceNewton , in 33rd Conference on Neural Information Processing Systems, 2019.[41]
R. M. Gower, P. Richtárik, and F. Bach , Stochastic quasi-gradient methods: variancereduction via Jacobian sketching , Mathematical Programming, (2020).[42]
S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang , Direct search based onprobabilistic descent , SIAM Journal on Optimization, 25 (2015), pp. 1515–1541.[43]
S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang , Complexity and global ratesof trust-region methods based on probabilistic models , IMA Journal of Numerical Analysis,38 (2017), pp. 1579–1597.[44] ,
Direct search based on probabilistic feasible descent for bound and linearly constrainedproblems , Computational Optimization and Applications, 72 (2019), pp. 525–559.[45]
J. C. Gross and G. T. Parks , Optimization by moving ridge functions: Derivative-free optimization for computationally intensive functions , arXiv preprint arXiv:2007.04893,(2020).[46]
N. Halko, P. G. Martinsson, and J. A. Tropp , Finding structure with randomness:Probabilistic algorithms for constructing approximate matrix decompositions , SIAM Review,53 (2011), pp. 217–288.[47]
W. Hare, G. Jarry-Bolduc, and C. Planiden , Error bounds for overdetermined andunderdetermined generalized centred simplex gradients , arXiv preprint arXiv:2006.00742,(2020).[48]
D. M. Kane and J. Nelson , Sparser Johnson-Lindenstrauss transforms , Journal of theACM, 61 (2014), pp. 4:1–4:23.[49]
C. T. Kelley , Detection and remediation of stagnation in the Nelder-Mead algorithm usinga sufficient decrease condition , SIAM Journal on Optimization, 10 (1999), pp. 43–55.[50]
J. W. Larson, M. Menickelly, and S. M. Wild , Derivative-free optimization methods ,Acta Numerica, 28 (2019), pp. 287–404.[51]
S. Liu, B. Kailkhura, P.-Y. Chen, P. Ting, S. Chang, and L. Amini , Zeroth-orderstochastic variance reduction for nonconvex optimization , arXiv preprint arXiv:1805.10367,(2018).[52]
Z. Lu and L. Xiao , A randomized nonmonotone block proximal gradient method for aclass of structured nonlinear programming , SIAM Journal on Numerical Analysis, 55 (2017),pp. 2930–2955.[53]
M. W. Mahoney , Randomized algorithms for matrices and data , Foundations and Trendsin Machine Learning, 3 (2011), pp. 123–224.[54]
M. W. Mahoney , Lecture notes on randomized linear algebra , arXiv preprintarXiv:1608.0448, (2016).[55]
J. J. Moré and S. M. Wild , Benchmarking derivative-free optimization algorithms , SIAMJournal on Optimization, 20 (2009), pp. 172–191.4456]
Y. Nesterov and V. Spokoiny , Random gradient-free minimization of convex functions ,Foundations of Computational Mathematics, 17 (2017), pp. 527–566.[57]
A. Neumaier, H. Fendl, H. Schilly, and T. Leitner , VXQR: Derivative-free uncon-strained optimization based on QR factorizations , Soft Computing, 15 (2011), pp. 2287–2298.[58]
A. Patrascu and I. Necoara , Efficient random coordinate descent algorithms for large-scale structured nonconvex optimization , Journal of Global Optimization, 61 (2015), pp. 19–46.[59]
M. Pilanci and M. J. Wainwright , Newton sketch: A linear-time optimization algorithmwith linear-quadratic convergence , SIAM Journal on Optimization, 27 (2017), pp. 205–245.[60]
M. Porcelli and P. L. Toint , Global and local information in structured derivative freeoptimization with BFO , arXiv preprint arXiv:2001.04801, (2020).[61]
M. J. D. Powell , On trust region methods for unconstrained minimization without deriv-atives , Mathematical Programming, 97 (2003), pp. 605–623.[62] ,
Least Frobenius norm updating of quadratic models that satisfy interpolation condi-tions , Mathematical Programming, 100 (2004), pp. 183–215.[63] ,
The BOBYQA algorithm for bound constrained optimization without derivatives ,Tech. Rep. DAMTP 2009/NA06, University of Cambridge, 2009.[64]
H. Qian, Y.-q. Hu, and Y. Yu , Derivative-free optimization of high-dimensional non-convex functions by sequential random embeddings , in Proceedings of the 25th InternationalJoint Conference on Artificial Intelligence, S. Kambhampati, ed., New York, 2016, AAAIPress, pp. 1946–1952.[65]
L. Roberts , Derivative-free algorithms for nonlinear optimisation problems , PhD thesis,University of Oxford, 2019.[66]
F. Roosta-Khorasani and M. W. Mahoney , Sub-sampled Newton methods , Mathem-atical Programming, 174 (2019), pp. 293–326.[67]
T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever , Evolution strategies as ascalable alternative to reinforcement learning , arXiv preprint arXiv:1703.03864, (2017).[68]
J. Tanner and G. Ughi , Model based optimisation applied to black-box attacks in deeplearning , in preparation, (2019).[69]
T. Tao , Topics in random matrix theory , vol. 132 of Graduate Studies in Mathematics,American Mathematical Society, Providence, Rhode Island, 2012.[70]
S. F. B. Tett, K. Yamazaki, M. J. Mineter, C. Cartis, and N. Eizenberg , Cal-ibrating climate models using inverse methods: Case studies with HadAM3, HadAM3P andHadCM3 , Geoscientific Model Development, 10 (2017), pp. 3567–3589.[71]
L. N. Vicente , Worst case complexity of direct search , EURO Journal on ComputationalOptimization, 1 (2013), pp. 143–153.[72]
Z. Wang, F. Hutter, M. Zoghi, D. Matheson, and N. de Freitas , Bayesian op-timization in a billion dimensions via random embeddings , Journal of Artificial IntelligenceResearch, 55 (2016), pp. 361–387.[73]
S. M. Wild , POUNDERS in TAO: Solving derivative-free nonlinear least-squares problemswith POUNDERS , in Advances and Trends in Optimization with Engineering Applications,T. Terlaky, M. F. Anjos, and S. Ahmed, eds., vol. 24 of MOS-SIAM Book Series on Optim-ization, MOS/SIAM, Philadelphia, 2017, pp. 529–539.4574]
D. P. Woodruff , Sketching as a tool for numerical linear algebra , Foundations and Trendsin Theoretical Computer Science, 10 (2014), pp. 1–157.[75]
S. J. Wright , Coordinate descent algorithms , Mathematical Programming, 151 (2015),pp. 3–34.[76]
Y. Xu and W. Yin , Block stochastic gradient iteration for convex and nonconvex optim-ization , SIAM Journal on Optimization, 25 (2015), pp. 1686–1716.[77] ,
A globally convergent algorithm for nonconvex optimization based on block coordinateupdate , Journal of Scientific Computing, 72 (2017), pp. 700–734.[78]
Y. Yang, M. Pesavento, Z.-Q. Luo, and B. Ottersten , Inexact block coordinatedescent algorithms for nonsmooth nonconvex optimization , IEEE Transactions on SignalProcessing, 68 (2020), pp. 947–961.[79]
H. Zhang, A. R. Conn, and K. Scheinberg , A derivative-free algorithm for least-squaresminimization , SIAM Journal on Optimization, 20 (2010), pp. 3555–3576.46
Extended Results for DFBGN
In Figure 13, we show performance profiles comparing DFBGN with DFO-LS on the (MW)problem collection. Since these problems are low-dimensional ( n ≤ ), they do not representthe setting for which DFBGN is designed, however we include them here for completeness.Similar to Figure 6, we see that DFBGN performs better (in terms of evaluations) the largerthe subspace size p , with the performance with p = n similar to DFO-LS. For very low accuracy τ = 0 . , DFBGN with p < n sometimes outperforms DFO-LS (with the full initialization cost),but not DFO-LS with reduced initialization cost. . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ (a) τ = 0 . . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ (b) τ = 10 − . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ (c) τ = 10 − . . . . . . P r o p o rt i o np r o b l e m ss o l v e d DFO-LSDFO-LS (init n/ p = n )DFBGN ( p = n/ p = n/ (d) τ = 10 − Figure 13. Performance profiles (in evaluations) comparing DFO-LS (with and without reduced initial-ization cost) with DFBGN (various p choices) for different accuracy levels. Results are an average of 10runs for each problem, with a budget of n + 1) evaluations and a 12 hour runtime limit per instance.The problem collection is (MW). Large-Scale Test Problems (CR-large) n m f ( x ) 2 f ( x ∗ ) Parameters1 ARGLALE 2000 4000 10000 2000 ( N, M ) = (2000 , . × ( N, M ) = (2000 , N = 1000 N = 5000 N = 5000 . × NDP = 1002 . × − P = 72 . × − . × − P = 72 P = 17
10 BROWNALE 1000 1000 . × N = 1000
11 BROYDN3D 1000 1000 1011 0 N = 1000
12 BROYDNBD 5000 5000 124904 0 N = 5000
13 CBRATU2D 2888 2888 . × − P = 40
14 CHANDHEQ 1000 1000 69.41682 0 N = 1000
15 EIGENB 2550 2550 99 0 N = 50
16 FREURONE 5000 9998 . × . × N = 5000
17 INTEGREQ 1000 1000 5.678349 0 N = 1000
18 MOREBVNE 1000 1000 . × − N = 1000
19 MSQRTA 4900 4900 . × P = 70
20 MSQRTB 1024 1024 7926.444 0 P = 32
21 OSCIGRNE 1000 1000 . × N = 1000
22 PENLT1NE 1000 1001 . × . × − N = 1000
23 POWELLSE 1000 1000 418750 0 N = 1000
24 SEMICN2U 1000 1000 . × ( N, LN ) = (1000 ,
25 SPMSQRT 1000 1664 797.0033 0 M = 334
26 VARDIMNE 1000 1002 . × N = 1000
27 YATP1SQ 2600 2600 . × N = 50
28 YATP2SQ 2600 2600 . × N = 50 Table 3. Details of large-scale test problems from the CUTEst test set (showing f ( x ) and f ( x ∗ ) asthe implementations of DFO-LS and DFBGN do not have the / constant factor in (3.1) ). The set ofproblems are taken from those in [18, Table 3]; the relevant parameters yielding the given ( n, m ) areprovided. The value of n shown excludes fixed variables.shown excludes fixed variables.