Derivative free optimization via repeated classification
DDerivative free optimization via repeated classification
Tatsunori B. Hashimoto Steve Yadlowsky John C. Duchi
Department of Statistics, Stanford University, Stanford, CA, 94305 { thashim, syadlows, jduchi } @stanford.edu Abstract
We develop an algorithm for minimizing afunction using n batched function value mea-surements at each of T rounds by using classi-fiers to identify a function’s sublevel set. Weshow that sufficiently accurate classifiers canachieve linear convergence rates, and showthat the convergence rate is tied to the diffi-culty of active learning sublevel sets. Further,we show that the bootstrap is a computation-ally efficient approximation to the necessaryclassification scheme.The end result is a computationally effi-cient derivative-free algorithm requiring notuning that consistently outperforms otherapproaches on simulations, standard bench-marks, real-world DNA binding optimiza-tion, and airfoil design problems wheneverbatched function queries are natural. Consider the following abstract problem: given accessto a function f : X → R , where X is some space, find x ∈ X minimizing f ( x ). We study an instantiationof this problem that trades sequential access to f forlarge batches of parallel queries—one can query f forits value over n points at each of T rounds. In thissetting, we propose a general algorithm that effectivelyoptimizes f whenever there is a family of classifiers h : X → [0 ,
1] that can predict sublevel sets of f withhigh enough accuracy.Our main motivation comes from settings in which n islarge—on the order of hundreds to thousands—whilepossibly small relative to the size of X . These types of Proceedings of the 21 st International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2018, Lan-zarote, Spain. PMLR: Volume 84. Copyright 2018 by theauthor(s). problems occur in biological assays [21], physical sim-ulations [27], and reinforcement learning problems [33]where parallel computation or high-throughput mea-surement systems allow efficient collection of largebatches of data. More concretely, consider the opti-mization of protein binding affinity to DNA sequencetargets from biosensor data [11, 21, 38]. In this case,assays measure binding of n ≥ T must be small). Insuch problems, it is typically difficult to compute thegradients of f (if they even exist); consequently, we fo-cus on derivative-free optimization (DFO, also knownas zero-order optimization) techniques. The batched derivative free optimization problem con-sists of a sequence of rounds t = 1 , , . . . , T in whichwe propose a distribution p ( t ) , draw a sample of n can-didates X i iid ∼ p ( t ) , and observe Y i = f ( X i ). The goalis to find at least one example X i for which the gapmin i f ( X i ) − inf x ∈X f ( x )is small.Our basic idea is conceptually simple: In each round,fit a classifier h predicting whether Y i ≶ α ( t ) for somethreshold α ( t ) . Then, upweight points x that h pre-dicts as f ( x ) < α ( t ) and downweight the other points x for the proposal distribution p ( t ) for the next round.This algorithm is inspired by classical cutting-plane al-gorithms [30, Sec. 3.2], which remove a constant frac-tion of the remaining feasible space at each iteration,and is extended into the stochastic setting based onmultiplicative weights algorithms [25, 3]. We presentthe overall algorithm as Algorithm 1. a r X i v : . [ s t a t . M L ] A p r erivative free optimization via repeated classification Algorithm 1
Cutting-planes using classifiers
Require:
Objective f , Action space X , hypothesisclass H . Set p (0) ( x ) = 1 / |X | Draw X (0) ∼ p (0) . Observe Y (0) = f ( X (0) ) for t ∈ { . . . T } do Set α ( t ) = median( { Y ( t ) i } ni =1 ) Set h ( t ) ∈ H as the loss minimizer of L over( X (0) , Y (0) > α ( t ) ) . . . ( X ( t − , Y ( t − > α ( t ) ). Set p ( t ) ( x ) ∝ p ( t − ( x )(1 − ηh ( t ) ( x )) Draw X ( t ) ∼ p ( t ) Observe Y ( t ) = f ( X ( t ) ). end for Set i ∗ = arg min i Y ( T ) i return X ( T ) i ∗ . When, as is typical in optimization, one has substan-tial sequential access to f , meaning that T can belarge, there are a number of major approaches to op-timization. Bayesian optimization [34, 7] and kernel-based bandits [9] construct an explicit surrogate func-tion to minimize; often, one assumes it is possibleto perfectly model the function f . Local search al-gorithms [12, 26] emulate gradient descent via finite-difference and local function evaluations. Our workdiffers conceptually in two ways: first, we think of T asbeing small, while n is large, and second, we representa function f by approximating its sublevel sets. Exist-ing batched derivative-free optimizers encounter com-putational difficulties for batch sizes beyond dozens ofpoints [16]. Our sublevel set approach scales to largebatches of queries by simply sampling from the currentsublevel set approximation.While other researchers have considered level set esti-mation in the context of Bayesian optimization [17, 7]and evolutionary algorithms [29], these use the levelset to augment a traditional optimization algorithm.We show good sublevel set predictions alone are suffi-cient to achieve linear convergence. Moreover, giventhe extraordinary empirical success of modern clas-sification algorithms, e.g. deep networks for imageclassification [22], it is natural to develop algorithmsfor derivative-free optimization based on fitting a se-quence of classifiers. Yu et al. [40] also propose clas-sification based on optimization, but their approachassumes a classifier constrained to never misclassifynear the optimum, making the problem trivial. We present Algorithm 1 and characterize its conver-gence rate with appropriate classifiers and show howit relates to measures of difficulty in active learn-ing. We extend this basic approach, which may becomputationally challenging, to an approach based onbootstrap resampling that is empirically quite effectiveand—in certain nice-enough scenarios—has provableguarantees of convergence.We provide empirical results on a number of differ-ent tasks: random (simulated) problems, airfoil (de-vice) design based on physical simulators, and findingstrongly-binding proteins based on DNA assays. Weshow that a black-box approach with random forestsis highly effective within a few rounds T of sequentialclassification; this approach provides advantages in thelarge batch setting.The approach to optimization via classification has anumber of practical benefits, many of which we ver-ify experimentally. It is possible to incorporate priorknowledge in DFO through domain-specific classifiers,and in more generic optimization problems one canuse black-box classifiers such as random forests. Anysufficiently accurate classifier guarantees optimizationperformance and can leverage the large-batch data col-lection biological and physical problems essentially ne-cessitate. Finally, one does not even need to evaluate f : it is possible to apply this framework with pairwisecomparison or ordinal measurements of f . Our starting point is a collection of “basic” resultsthat apply to classification-based schemes and associ-ated convergence results. Throughout this section, weassume we fit classifiers using pairs ( x, z ), where z is a0 / f ( x )) or positive (high f ( x ))class. We begin by demonstrating that two quantitiesgovern the convergence of the optimizer: (1) the fre-quency with which the classifier misclassifies (and thusdownweights) the optimum x ∗ relative to the multi-plicative weight η , and (2) the fraction of the feasiblespace each iteration removes.If the classifier h ( t ) ( x ) exactly recovers the sublevelset ( h ( t ) ( x ) < f ( x ) < α ( t ) ), α ( t ) is at most thepopulation median of f ( X ( t ) ), and X is finite, the basiccutting plane bound immediately implies thatlog (cid:34) P x ∼ p ( T ) (cid:18) f ( x ) = min x ∗ ∈X f ( x ∗ ) (cid:19)(cid:35) ≥ min (cid:32) T log (cid:18) − η (cid:19) − log( |X | ) , (cid:33) . atsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi It is not obvious that such a guarantee continues tohold for inaccurate h ( t ) : it may accidentally misclas-sify the optimum x ∗ , and the thresholds α ( t ) may notrapidly decrease the function value. To address theseissues, we provide a careful analysis in the coming sec-tions: first, we show the convergence guarantees im-plied by Algorithm 1 as a function of classification er-rors (Theorem 1), after which we propose a classifica-tion strategy directly controlling errors (Sec. 2.2), andfinally we give a computationally tractable approxima-tion (Sec. 3). We begin with our basic convergence result. Letting p ( t ) and h ( t ) be a sequence of distributions and classi-fiers on X , the convergence rate depends on two quan-tities: the coverage (number of items cut) (cid:88) x ∈X h ( t ) ( x ) p ( t − ( x )and the number of times a hypothesis downweightsitem x (because f ( x ) is too large), which we denote M T ( x ) := (cid:80) Tt =1 h ( t ) ( x ). We have the following Theorem 1.
Let γ > and assume that for all t , (cid:88) x ∈X h ( t ) ( x ) p ( t − ( x ) ≥ γ where p ( t ) ( x ) ∝ p ( t − ( x )(1 − ηh ( t ) ( x )) as in Alg. 1. Let η ∈ [0 , / and p (0) be uniform. Then for all x ∈ X , log p ( T ) ( x ) ≥ γηη + 2 T − η ( η + 1) M T ( x ) − log(2 |X | ) . The theorem follows from a modification of standardmultiplicative weight algorithm guarantees [3]; seesupplemental section A.1 for a full proof.We say that our algorithm converges linearly iflog p ( t ) ( x ) (cid:38) t . In the context of Theorem 1, choice of η maximizing − ( η + η ) M T ( x ∗ ) + ηη +2 γT yields suchconvergence, as picking η sufficiently small that T − ( η + 1)( η + 2) γ M T ( x ∗ ) = Ω( T )guarantees linear convergence if 2 M T ( x ∗ ) < T γ .A simpler form of the above bound for a fixed η showsthe linear convergence behavior. Corollary 1.
Let x ∈ X , where q T ( x ) := M T ( x ) γT ≤ / . Under the conditions of Theorem 1, log( p ( T ) ( x )) ≥ min (cid:18) , − q T ( x )3 (cid:19) γT − log(2 |X | ) and − log(2 |X | )2 γT ≤ q T ( x ) . The condition q T ( x ) ≥ − γT log(2 |X | ) arises be-cause if M T ( x ) is small, then eventually we must have p ( T ) ( x ) ≥ − γ , and any classifier h which fulfils thecondition (cid:80) x ∈X h ( t ) ( x ) p ( t − ( x ) ≥ γ in Thm. 1 mustdownweight x . At this point, we can identify the op-timum exactly with O (1 / (1 − γ )) additional draws.The corollary shows that if M T ( x ∗ ) = 0 and γ =(1 − /e ) − / <
0, we recover a linear cutting-plane-like convergence rate [cf. 30], which makes constantprogress in volume reduction in each iteration.
The basic guarantee of Theorem 1 requires relativelyfew mistakes on x ∗ , or at least on a point x with f ( x ) ≈ f ( x ∗ ), to achieve good performance in opti-mization. It is thus important to develop careful clas-sification strategies that are conservative: they do notprematurely cut out values x whose performance isuncertain. With this in mind, we now show how con-sistent selective classification strategies [15] (relatedto active learning techniques, and which abstain on“uncertain” examples similar to the Knows-What-It-Knows framework [23, 2]) allow us to achieve linearconvergence when the classification problems are real-izable using a low-complexity hypothesis class.The central idea is to only classify an example if allzero-error hypotheses agree on the label, and otherwiseabstain. Since any hypothesis achieving zero popula-tion error must have zero training set errors, we willonly label points in a way consistent with the true la-bels. El-Yaniv and Wiener [15] define the following consistent selective strategy (CSS). Definition 1 (Consistent selective strategy) . For ahypothesis class H and training sample S , the versionspace VS H ,S m ⊂ H is the set of all hypotheses whichperfectly classify S m . The consistent selective strategy is the classifier h ( x ) = if ∀ g ∈ VS H ,S m , g ( x ) = 10 if ∀ g ∈ VS H ,S m , g ( x ) = 0 no decision otherwise. Applied to our optimizer, this strategy enables safelydownweighting examples whenever they are classifiedas being outside the sublevel set. Optimization per-formance guarantees then come from demonstratingthat at each iteration the selective strategy does notabstain on too many examples.The rate of abstention for a selective classifier is relatedto the difficulty of disagreement based active learning,controlled by the disagreement coefficient [18]. erivative free optimization via repeated classification
Definition 2.
The disagreement ball of a hypothesisclass H for distribution P is B H ,P ( h, r ) := { h (cid:48) ∈ H | P ( h ( X ) (cid:54) = h (cid:48) ( X )) ≤ r } . The disagreement region of a subset
G ⊂ H is Dis ( G ) := { x ∈ X | ∃ h , h ∈ G s.t. h ( x ) (cid:54) = h ( x ) } . The disagreement coefficient ∆ h of the hypothesis class H for the distribution P is ∆ h := sup r> P ( X ∈ Dis ( B H ,P ( h, r ))) r . The disagreement coefficient directly bounds the ab-stention rate as a function of generalization error.
Theorem 2.
Let h be the CSS classifier in definition1, and let h ∗ ∈ H be a classifier achieving zero risk. If P ( g ( X ) (cid:54) = h ∗ ( X )) < (cid:15) for all g ∈ VS H ,S m , then CSSachieves coverage P ( h ( X ) = no decision ) ≤ ∆ h ∗ (cid:15) This follows from the definition of the disagreementcoefficient, and the size of the version space (Supp.section A.1 contains a full proof).The dependence of our results on the disagreementcoefficient implies a reduction from zeroth order op-timization to disagreement based active learning [15]and selective classification [39] over sublevel sets.Implementing the CSS classifier may be somewhatchallenging: given a particular point x , one must ver-ify that all hypotheses consistent with the data classifyit identically. In many cases, this requires training aclassifier on the current training sample S ( t ) at iter-ation t , coupled with x labeled positively, and thenretraining the classifier with x labeled negatively [39].This cost can be prohibitive. (Of course, implement-ing the multiplicative weights-update algorithm over x ∈ X is in general difficult as well, but in a numberof application scenarios we know enough about H tobe able to approximate sampling from p ( t ) in Alg. 1.)A natural strategy is to use the CSS classifier as partof Algorithm 1, setting all no decision outputs to thezero class, only removing points confidently above thelevel set α ( t ) . That is, in round t of the algorithm,given samples S = ( X ( t ) , Z ( t ) ), we define h ( t ) ( x ) = ∀ g ∈ VS H ,S , g ( x ) = 10 if ∀ g ∈ VS H ,S , g ( x ) = 00 otherwise.There is some tension between classifying examplescorrectly and cutting out bad x ∈ X , which thenext theorem shows we can address by choosing largeenough sample sizes n . Theorem 3.
Let H be a hypothesis class containingindicator functions for the sublevel sets of f , with VC-dimension V and disagreement coefficient ∆ h . Thereexists a numerical constant C < ∞ such that for all δ ∈ [0 , , (cid:15) ∈ [0 , , and γ ∈ (∆ h (cid:15), ) , and n ≥ max (cid:110) C(cid:15) − [ V log( (cid:15) − ) + log( δ − ) + log(2 T )] , γ − . (log( δ − ) + log(2 T )) (cid:111) , with probability at least − δ log( p ( T ) ( x ∗ )) ≥ min (cid:110) ( γ − ∆ h (cid:15) ) ηη + 2 T − log(2 |X | ) , log(1 − γ ) (cid:111) after T rounds of Algorithm 1. The proof follows from combining the selective classifi-cation bound with standard VC dimension argumentsto obtain the sample size requirement (Supp. A.1 con-tains a full proof).Thus if ∆ h is small, such as log( |X | ), then choosing (cid:15) = ∆ − h achieves exponential improvements over ran-dom sampling. In the worst case, ∆ h = O ( |X | ), butsmall ∆ h are known for many problems, for examplefor linear classification with continuous X over densi-ties bounded away from zero, ∆ h = poly(log(Vol( X ))),which would result in linear convergence rates (Theo-rem 7.16, [18]).Using recent bounds for the disagreement coefficientfor linear separators [5], we can show that for linearoptimization over a convex domain, the CSS basedoptimization algorithm above achieves linear conver-gence with O ( d / log( d / ) − d / log(3 T δ )) sampleswith probability at least 1 − δ (for lack of space, wepresent this as Theorem A.2 in the supplement.)When the classification problem is non-realizable, butthe Bayes-optimal hypothesis does not misclassify x ∗ ,an analogous result holds through the agnostic selec-tive classification framework of Wiener and El-Yaniv[39]. The full result is in supplemental Theorem A.7. While selective classification provides sufficient controlof error for linear convergence, it is generally compu-tationally intractable. However, a bootstrap resam-pling algorithm [14] approximates selective classifica-tion well enough to provide finite sample guaranteesin parametric settings. Our analysis provides intuition atsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi for the empirical observation that selective classifica-tion via the bootstrap works well in many real-worldproblems [1].Formally, consider a parametric family { P θ } θ ∈ Θ ofconditional distributions Z | X ∈ [0 ,
1] with compactparameter space Θ. Given n samples X , . . . , X n , weobserve Z i | X i ∼ P θ ∗ with θ ∗ ∈ int Θ.Let (cid:96) θ ( x, z ) = − log( P θ ( z | x )) be the negative log like-lihood of z , which majorizes the 0-1 loss of the linearhypothesis class (cid:96) θ ( x, z ) ≥ { (2 z − x (cid:62) θ< } .Define the weighted likelihood L n ( θ, u ) ≡ n n (cid:88) i =1 (1 + u i ) (cid:96) θ ( X i , Z i ) , and consider the following multiplier bootstrap algo-rithm [14, 35], parameterized by B ∈ N and variance σ . σ adds additional variation in the estimates toincrease parameter coverage.1. Draw { ( X i , Z i ) } ni =1 from P .2. Compute θ n = arg min θ L n ( θ, b = 1 to B ,(a) Draw u b iid ∼ Uni[ − , θ ◦ u b = σ (arg min θ L n ( θ, u b ) − θ n ) + θ n .
4. Define the estimator h ◦ ( x ) = ∀ b ∈ [ B ] , x (cid:62) θ ◦ u b >
00 if ∀ b ∈ [ B ] , x (cid:62) θ ◦ u b ≤ . For linear classifiers with strongly convex losses, thisalgorithm obtains selective classification guaranteesunder appropriate regularity conditions as presentedin the following theorem.
Theorem 4.
Assume (cid:96) θ is twice differentiable andfulfils (cid:107)∇ (cid:96) θ ( X, Z ) (cid:107) ≤ R , and (cid:13)(cid:13) ∇ (cid:96) θ ( X, Z ) (cid:13)(cid:13) op ≤ S almost surely. Additionally, assume L n ( θ, is γ -strongly convex and that ∇ L n ( θ, is M -Lipschitzwith probability one.For h ◦ defined above and x ∈ X , P ( x (cid:62) θ ∗ ≤ and h ◦ u ( x ) = 1) < δ. Further, the abstention rate is bounded by (cid:90) x ∈ R d { h ◦ u ( x )= ∅ } p ( x ) dx ≤ (cid:15) ∆ h with probability − δ whenever B ≥
15 log(3 /δ ) ,σ = O ( d / + log(1 /δ ) / + n − / ) ,(cid:15) = O (cid:16) σ n − log( B/δ ) (cid:17) , and n ≥ d/δ ) S/γ . Due to length, the proof and full statement with con-stants appears in the appendix as Theorem A.4, witha sketch provided here: we first show that a givenquadratic version space and a multivariate Gaussiansample θ quad obtains the selective classification guar-antees (Lemmas A.3,A.4,A.5). We then show that θ ◦ ≈ θ quad to order n − which is sufficient to recoverTheorem A.4. (a) Classification confidencesformed by bootstrapping ap-proximate selective classifica-tion. (b) Bootstrapping resultsin more consistent identi-fication of minima. Figure 1.
Bootstrap consensus provides more con-servative classification boundaries which preventsrepeatedly misclassifying the minimum, comparedto direct loss minimization (panel b, triangle).
The d ∆ h abstention rate in this bound is d timesthe original selective classification result. This ad-ditional factor of d appearing in σ arises from thedifference between finding an optimum within a balland randomly sampling it: random vectors concen-trate within O (1 /d ) of the origin, while the maximumpossible value is 1. This gap forces us to scale thevariance in the decision function by σ (step 3b). Wepresent selective classification approximation boundsanalogous to Theorem 3 for linear optimization in theAppendix as Theorem A.5.To illustrate our results through simulations, considera optimizing a two-dimensional linear function in theunit box. Figure 1a shows the set of downweightedpoints (colored points) for various algorithms on clas-sifying a single superlevel set based on eight observa-tions (black points). Observe how linear downweightsmany points (colored ‘x’), in contrast to exact CSS,which only downweights points guaranteed to be inthe superlevel set. Errors of this type combined withAlg. 1 result in optimizers which fail to find the true erivative free optimization via repeated classification minimum depending on initialization (Figure 1b). Thebootstrapped linear classifier behaves similarly to CSS,but is looser due to the non-asymptotic setting. Ran-dom forests, another type of bootstrapped classifier issurprisingly good at approximating CSS, despite notmaking use of the linearity of the decision boundary. One benefit of optimizing via classification is that thealgorithm only requires total ordering amongst the ele-ments. Specifically, step 6 of Algorithm 1 only requiresthreshold comparisons against a percentile selected instep 5. This enables optimization under pairwise com-parison feedback. At each round, instead of observing f ( X ( t ) ), we observe g ( X ( t ) i , X ( t ) j ) = 1 f ( X ( t ) i )
10 seems to work wellin practice, and more sophisticated preference aggrega-tion algorithms may reduce the number of comparisonseven further.
We evaluate Algorithm 1 as a DFO algorithm across afew real-world experimental design benchmarks, com-mon synthetic toy optimization problems, and bench-marks that allow only pairwise function value compar-isons. The small-batch (n = 1-10) nature of hyperpa-rameter optimization problems is outside the scope ofour work, even though they are common DFO prob-lems.For constructing the classifier in Algorithm 1, we ap-ply ensembled decision trees with a consensus decisiondefined as 75% of trees agreeing on the label (referredto as classify-rf ). This particular classifier works ina black-box setting, and is highly effective across allproblem domains with no tuning. We also empiricallyinvestigate the importance of well-specified hypothesesand consensus ensembling and show improved resultsfor ensembles of linear classifiers and problem specificclassifiers, which we call classify-tuned .In order to demonstrate that no special tuning is nec-essary, the same constants are used in the optimizer for all experiments, and the classifiers use off-the-shelfimplementations from scikit-learn with no tuning.For sampling points according to the weighted distri-bution in Algorithm 1, we enumerate for discrete ac-tion spaces X , and for continuous X we perturb sam-ples from the previous rounds using a Gaussian and useimportance sampling to approximate the target distri-bution. Although exact sampling for the continuouscase would be time-consuming, the Gaussian pertur-bation heuristic is fast, and seems to work well enoughfor the functions tested here.As a baseline, we compare to the following algorithms • Random sampling ( random ) • Randomly sampling double the batch size( random-2x ), which is a strong baseline recentlyshown to outperform many derivative-free opti-mizers [24]. • The evolutionary strategy (
CMA-ES ) for con-tinuous problems, due to its high-performance inblack box optimization competitions as well as in-herent applicability to the large batch setting [26] • The Bayesian optimization algorithm provided by
GpyOpt [4] ( GP ) for both continuous and dis-crete problems, using expected improvement asthe acquisition function. We use the ‘random’evaluator which implements an epsilon-greedybatching strategy, since the large batch sizes (100-1000) makes the use of more sophisticated eval-uators completely intractable. The default RBFkernel was used in all experiments presented here.The / - and / -Matern kernels and string kernelswere tried where appropriate, but did not provideany performance improvements.In terms of runtime, all computations for classify-rf take less than 1 second per iteration compared to0.1s for CMA-ES and 1.5 minutes for
GpyOpt . Allexperiments were replicated fifteen times to measurevariability with respect to initialization.All new benchmark functions and reference imple-mentations are made available at http://bit.ly/2FgiIxA . The publicly available protein binding microarray(PBM) dataset consisting of 201 separate assays [6]allows us to accurately benchmark the optimizationprotein binding over DNA sequences. In each assay,the binding affinity between a particular DNA-bindingprotein (transcription factor) and all 8-base DNA se-quences are measured using a microarray. atsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi (a) Binding to the CRX protein (b) Binding to the VSX1 protein (c) High-lift airfoil design
Figure 2.
Performance on two types of real-world batched zeroth-order optimization tasks. classify-rf consis-tently outperforms baselines and even randomly sampling twice the batch size. The line shows median functionvalue over runs, shaded area is quartiles.
This dataset defines 201 separate discrete optimizationproblems. For each protein, the objective function isthe negative binding affinity (as measured by fluores-cence), the batch size is 100 (corresponding roughly tothe size of a typical 96-well plate), across ten rounds.Each possible action corresponds to measuring thebinding affinity of a particular 8-base DNA sequenceexactly. The actions are featurized by considering thebinary encoding of whether a base exists in a position,resulting in a 32-dimensional space. This emulates thetask of finding the DNA binding sequence of a proteinusing purely low-throughput methods.Figure 2a,2b shows the optimization traces of two ran-domly sampled examples, where the lines indicate me-dian achieved function value over 15 random initializa-tions, and the shading indicates quartiles. classify-rf shows consistent improvements over all discrete ac-tion space baselines. For evaluation, we further sample20 problems and find that the median binding affinityfound across replicates is strictly better on 16 out of20, and tied with the Gaussian process on 2.In this case, the high performance of random forests isrelatively unsurprising, as random forests are knownto be high-performance classifiers for DNA sequencerecognition tasks [10, 21].
Airfoil design, and other simulator-based objectivesare well-suited to the batched, classification based op-timization framework, as 30-40 simulations can be runin parallel on modern multicore computers. In the air-foil design case, the simulator is a 2-D aerodynamicssimulator for airfoils [13].The objective function is the negative of lift dividedby drag (with a zero whenever the simulator throwsan error) and the action space is the set of all commonairfoils (NACA-series 4 airfoils). The airfoils are fea-turized by taking the coordinates around the perime- ter of the airfoil as defined in the Selig airfoil format.This results in a highly-correlated two hundred dimen-sional feature space. The batch size is 30 (correspond-ing to the number of cores in our machine) and T = 10rounds of evaluations are performed.We find in Figure 2c that the classify-rf algorithmconverges to the optimal airfoil in only five rounds, anddoes so consistently, unlike the baselines. The Gaus-sian process beat the twice-random baseline, since theradial basis kernel is well-suited for this task (as lift isrelatively smooth over (cid:96) distance between airfoils) butdid not perform as well as the classify-rf algorithm. Matching the classifier and objective function gener-ally results in large improvements in optimization per-formance. We test two continuous optimization prob-lems in [ − , , optimizing a random linear function,and optimizing a random sum of a quadratic and lin-ear functions. For this high dimensional task, we usea batch size of 1000. In both cases we compare contin-uous baselines with classify-rf and classify-tune which uses a linear classifier.We find that the use of the correct hypothesis classgives dramatic improvements over baseline in the lin-ear case (Figure 3a) and continues to give substan-tial improvements even when a large quadratic term isadded, making the hypothesis class misspecified (Fig-ure 3b). The classify-rf does not do as well as thiscustom classifier, but continues to do as well as thebest baseline algorithm ( CMA-ES ).We also find that using an ensembled classifier is animportant for optimization. Figure 3c shows an exam-ple run on the DNA binding task comparing the con-sensus of an ensemble of logistic regression classifiersagainst a single logistic regression classifier. Althoughboth algorithms perform well in early iterations, the erivative free optimization via repeated classification (a) Random linear function (b) Linear+quadratic function (c) Ensembling classifiers improvesoptimization performance
Figure 3.
Testing the importance of ensembling and well-specified hypothesis class in synthetic data where thehypothesis for
Classify-tuned exactly matches level set (panel a), matches level sets with some error (panel b).Ensembling also consistently improves performance, and reduces dependence on initialization (panel c) single logistic regression algorithm gets ‘stuck’ earlierand finds a suboptimal local minima, due to an ac-cumulation of errors. Ensembling consistently reducessuch behavior.
We additionally evaluate on two common syntheticbenchmarks (Figure 4a,4b). Although these tasks arenot the focus of the work, we show that the classify-rf is surprisingly good as a general black box opti-mizer when the batch sizes are large.We consider a batch size of 500 and ten steps due tothe moderate dimensionality and multi-modality rela-tive to the number of steps. We find qualitatively sim-ilar results to before, with classify-rf outperformingother algorithms and
CMA-ES as the best baseline. (a) Shekel (4d) (b) Hartmann (6d)
Figure 4. classify-rf outperforms baselines onsynthetic benchmark functions with large batches
Finally, we demonstrate that we can optimize a func-tion using only pairwise comparisons. In Figure 5 weshow the optimization performance when using the or-dering estimator from equation 1.For small numbers of comparisons per element ( c = 5)we find substantial loss of performance, but once weobserve at least 10 pairwise comparisons per proposed action, we are able to reliably optimize as well as thefull function value case. This suggests that classifica-tion based optimization can handle pairwise feedbackwith little loss in efficiency. Figure 5.
Optimization with pairwise comparisonsbetween each action and a small set of ( c ) randomlyselected actions. Between 10-20 pairwise compar-isons per action gives sufficient information to fullyoptimize the function. Our work demonstrates that the classification-basedapproach to derivative-free optimization is effectiveand principled, but leaves open several theoretical andpractical questions. In terms of theory, it is not clearwhether a modified algorithm can make use of empir-ical risk minimizers instead of perfect selective classi-fiers. In practice, we have left the question of tractablysampling from p ( t ) , as well as how to appropriatelyhandle smaller-batch settings of d > n . atsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi References [1] N. Abe and M. Hiroshi. Query learning strate-gies using boosting and bagging. In
Proceedingsof the Fifteenth International Conference on Ma-chine Learning , volume 1, 1998.[2] J. Abernethy, K. Amin, M. Draief, andM. Kearns. Large-scale bandit problems andKWIK learning. In
Proceedings of the 30th Inter-national Conference on Machine Learning , 2013.[3] S. Arora, E. Hazan, and S. Kale. The multiplica-tive weights update method: a meta algorithmand applications.
Theory of Computing , 8(1):121–164, 2012.[4] T. G. authors. GPyOpt: A bayesian optimiza-tion framework in python. http://github.com/SheffieldML/GPyOpt , 2016.[5] M.-F. Balcan and P. M. Long. Active and passivelearning of linear separators under log-concavedistributions. In
Proceedings of the Twenty SixthAnnual Conference on Computational LearningTheory , pages 288–316, 2013.[6] L. A. Barrera, A. Vedenko, J. V. Kurland,J. M. Rogers, S. S. Gisselbrecht, E. J. Rossin,J. Woodard, L. Mariani, K. H. Kock, S. Inukai,et al. Survey of variation in human transcriptionfactors reveals prevalent dna binding changes.
Science , 351(6280):1450–1454, 2016.[7] I. Bogunovic, J. Scarlett, A. Krause, andV. Cevher. Truncated variance reduction: A uni-fied approach to bayesian optimization and level-set estimation. In
Advances in Neural Infor-mation Processing Systems 29 , pages 1507–1515,2016.[8] S. Boucheron, G. Lugosi, and P. Massart.
Con-centration Inequalities: a Nonasymptotic Theoryof Independence . Oxford University Press, 2013.[9] S. Bubeck and R. Eldan. Multi-scale explorationof convex functions and bandit convex optimiza-tion. In
Proceedings of the Twenty Ninth AnnualConference on Computational Learning Theory ,pages 583–589, 2016.[10] X. Chen and H. Ishwaran. Random forests forgenomic data analysis.
Genomics , 99(6):323–329,2012.[11] A. Chevalier, D.-A. Silva, G. J. Rocklin, D. R.Hicks, R. Vergara, P. Murapa, S. M. Bernard,L. Zhang, K.-H. Lam, G. Yao, et al. Massivelyparallel de novo protein design for targeted ther-apeutics.
Nature , 2017. [12] A. Conn, K. Scheinberg, and L. Vicente.
Introduc-tion to Derivative-Free Optimization , volume 8 of
MPS-SIAM Series on Optimization . SIAM, 2009.[13] M. Drela. Xfoil: An analysis and designsystem for low reynolds number airfoils. In
Low Reynolds number aerodynamics , pages 1–12.Springer, 1989.[14] B. Efron and R. J. Tibshirani.
An Introduction tothe Bootstrap . Chapman & Hall, 1993.[15] R. El-Yaniv and Y. Wiener. Active learning viaperfect selective classification.
Journal of Ma-chine Learning Research , 13(Feb):255–279, 2012.[16] J. Gonz´alez, Z. Dai, P. Hennig, and N. Lawrence.Batch bayesian optimization via local penaliza-tion. In
Proceedings of the 19th InternationalConference on Artificial Intelligence and Statis-tics , pages 648–657, 2016.[17] A. Gotovos, N. Casati, G. Hitz, and A. Krause.Active learning for level set estimation. In
Pro-ceedings of the 28th International Joint Confer-ence on Artificial Intelligence , pages 1344–1350,2013.[18] S. Hanneke. Theory of disagreement-based ac-tive learning.
Foundations and Trends in MachineLearning , 7(2-3):131–309, 2014.[19] C. Harwood and A. Wipat.
Microbial SyntheticBiology , volume 40. Elsevier, 2013.[20] M. J. Kearns and U. V. Vazirani.
An introduc-tion to computational learning theory . MIT press,1994.[21] C. G. Knight, M. Platt, W. Rowe, D. C. Wedge,F. Khan, P. J. Day, A. McShea, J. Knowles, andD. B. Kell. Array-based evolution of dna aptamersallows modelling of an explicit sequence-fitnesslandscape.
Nucleic Acids Research , 37(1):e6–e6,2008.[22] Y. LeCun, Y. Bengio, and G. Hinton. Deep learn-ing.
Nature , 521(7553):436–444, 2015.[23] L. Li, M. Littman, and T. Walsh. Knows what itknows: A framework for self-aware learning. In
Proceedings of the 26th International Conferenceon Machine Learning , 2009.[24] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh,and A. Talwalkar. Hyperband: bandit-basedconfiguration evaluation for hyperparameter op-timization. In
Proceedings of the Fourth Inter-national Conference on Learning Representations ,2016. erivative free optimization via repeated classification [25] N. Littlestone. Redundant noisy attributes, at-tribute errors, and linear-threshold learning us-ing winnow. In
Proceedings of the Fourth An-nual Workshop on Computational Learning The-ory , pages 147–156, 1991.[26] I. Loshchilov. Cma-es with restarts for solvingcec 2013 benchmark problems. In
EvolutionaryComputation (CEC), 2013 , pages 369–376, 2013.[27] A. L. Marsden, M. Wang, J. E. Dennis, andP. Moin. Optimal aeroacoustic shape design usingthe surrogate management framework.
Optimiza-tion and Engineering , 5(2):235–262, 2004.[28] V. K. S. Mendelson. Bounding the smallest sin-gular value of a random matrix without concen-tration. arXiv:1312.3580 [math.PR] , 2013.[29] R. S. Michalski. Learnable evolution model: Evo-lutionary processes guided by machine learning.
Machine Learning , 38(1):9–40, 2000.[30] Y. Nesterov.
Introductory Lectures on Convex Op-timization . Kluwer Academic Publishers, 2004.[31] A. S. Phelps, D. M. Naeger, J. L. Courtier,J. W. Lambert, P. A. Marcovici, J. E. Villanueva-Meyer, and J. D. MacKenzie. Pairwise compari-son versus likert scale for biomedical image assess-ment.
American Journal of Roentgenology , 204(1):8–14, 2015.[32] A. Rakhlin and K. Sridharan. On equivalence ofmartingale tail bounds and deterministic regretinequalities. arXiv:1510.03925 [math.PR] , 2015.[33] J. Schulman, S. Levine, P. Abbeel, M. Jordan,and P. Moritz. Trust region policy optimization.In
Proceedings of the 32nd International Con-ference on Machine Learning , pages 1889–1897,2015.[34] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams,and N. de Freitas. Taking the human out of theloop: A review of bayesian optimization.
Proceed-ings of the IEEE , 104(1):148–175, 2016.[35] V. Spokoiny. Parametric estimation. finite sampletheory.
The Annals of Statistics , 40(6):2877–2909,2012.[36] J. A. Tropp. An introduction to matrix concen-tration inequalities.
Foundations and Trends inMachine Learning , 8(1-2):1–230, 2015.[37] J. Van Hemmen and T. Ando. An inequality fortrace ideals.
Communications in MathematicalPhysics , 76(2):143–148, 1980. [38] J. Wang, Q. Gong, N. Maheshwari, M. Eisenstein,M. L. Arcila, K. S. Kosik, and H. T. Soh. Parti-cle display: A quantitative screening method forgenerating high-affinity aptamers.
AngewandteChemie International Edition , 53(19):4796–4801,2014.[39] Y. Wiener and R. El-Yaniv. Agnostic selectiveclassification. In
Advances in Neural InformationProcessing Systems 24 , pages 1665–1673, 2011.[40] Y. Yu, H. Qian, and Y.-Q. Hu. Derivative-free op-timization via classification. In
Proceedings of theThirty Second National Conference on ArtificialIntelligence , pages 2286–2292, 2016. atsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi
A Supplementary materials
A.1 Cutting plane algorithmsTheorem 1.
Let γ > and assume that for all t , (cid:88) x ∈X h ( t ) ( x ) p ( t − ( x ) ≥ γ where p ( t ) ( x ) ∝ p ( t − ( x )(1 − ηh ( t ) ( x )) as in Alg. 1. Let η ∈ [0 , / and p (0) be uniform. Then for all x ∈ X , log p ( T ) ( x ) ≥ γηη + 2 T − η ( η + 1) M T ( x ) − log(2 |X | ) . Proof.
Since the sampling distribution is derived frommultiplicative weights over (cid:80) Tt =1 h ( t ) ( x ), the followingregret bound holds with respect to any p (Theorem2.4, [3]): γT ≤ T (cid:88) t =1 (cid:88) x ∈X h ( t ) ( x ) p ( t − ( x ) ≤ (1 + η ) T (cid:88) t =1 (cid:88) x ∈X h ( t ) ( x ) p ( x ) + KL( p || p (0) ) η Pick S = { i : (cid:80) t h ( t ) ( x ) < νT } and p uniform over S to get: γT ≤ (1 + η ) νT + log( |X | / | S | ) η From this we get the bound:log( | S | ) ≤ log( |X | ) − γηT + νηT + νη T Now we can get the following basic bound onlog( p ( T ) ( x ∗ )) by decomposing the normalizer using theset S . log( p ( T ) ( x ∗ )) ≥ log(1 − η ) M T ( x ∗ ) − log( | S | + exp(log(1 − η ) νT )( |X | − | S | ))Note that exp(log(1 − η ) νT ) < exp( − ηνT )as well as − ηνT = − γηT + νηT + νη T whenever ν = γ/ ( η + 2). This gives the normalizerbound:log( | S | + exp(log(1 − η ) νT )( |X | − | S | )) < log(2 exp( − ηγ/ ( η + 2) T ) |X | ) Combining the above:log( p ( T ) ( x ∗ )) ≥ log(1 − η ) M T ( x ∗ ) − log( | S | + exp(log(1 − η ) νT ) |X | ) ≥ log(1 − η ) M T ( x ∗ ) − log(2 exp( − ηγ/ ( η + 2) T ) |X | ) ≥ log(1 − η ) M T ( x ∗ ) + ηη + 2 γT − log(2 |X | ) ≥ − ( η + η ) M T ( x ∗ ) + ηη + 2 γT − log(2 |X | ) ≥ γ ηη + 2 (cid:18) T − ( η + 1)( η + 2) γ M T ( x ∗ ) (cid:19) − log(2 |X | )Where in the second step we use the normalizer bound,and in the third we use the identity − log(1 − x ) ≤ x + x for x ∈ [0 , / Corollary 1.
Let x ∈ X , where q T ( x ) := M T ( x ) γT ≤ / . Under the conditions of Theorem 1, log( p ( T ) ( x )) ≥ min (cid:18) , − q T ( x )3 (cid:19) γT − log(2 |X | ) and − log(2 |X | )2 γT ≤ q T ( x ) . Proof.
First we balance the linear terms in Theorem 1by solving for η in( η + 1)( η + 2) γ M T ( x ) = T . This gives the multiplicative weight η = min , (cid:115)
14 + 12 q T ( x ) − > . The inequality follows from q T ( x ) ≤ / p ( T ) ( x )) ≥ min (cid:18) , q T ( x ) − (cid:112) q T ( x ) + q T ( x ) (cid:19) γT − log(2 |X | ) . Using the linear lower bound on 1 + 2 q T ( x ) − (cid:112) q T ( x ) + q T ( x ) at q T ( x ) = 1 / p ( T ) ( x )) < q T ( x ). Theorem 2.
Let h be the CSS classifier in definition1, and let h ∗ ∈ H be a classifier achieving zero risk. If P ( g ( X ) (cid:54) = h ∗ ( X )) < (cid:15) for all g ∈ VS H ,S m , then CSSachieves coverage P ( h ( X ) = no decision ) ≤ ∆ h ∗ (cid:15) erivative free optimization via repeated classification Proof.
The inequality follows from two facts. The firstfact is that P ( h ( X ) = no decision) is upper boundedby the probability of sampling an X is the disagree-ment ball of radius (cid:15) . This follows from the definitionof the CSS classifier which outputs no decision if andonly if there exists a classifier in VS H ,S m which doesnot agree with the others, which is the definition ofthe disagreement region.The second fact is that if sup g ∈ VS H ,Sm P ( g ( x ) (cid:54) = h ∗ ( x )) < (cid:15) , then VS H ,S m ⊆ B H ,P ( h ∗ , (cid:15) ), by construc-tion of the version space. Applying the definition ofthe disagreement coefficient completes the proof. Theorem 3.
Let H be a hypothesis class containingindicator functions for the sublevel sets of f , with VC-dimension V and disagreement coefficient ∆ h . Thereexists a numerical constant C < ∞ such that for all δ ∈ [0 , , (cid:15) ∈ [0 , , and γ ∈ (∆ h (cid:15), ) , and n ≥ max (cid:110) C(cid:15) − [ V log( (cid:15) − ) + log( δ − ) + log(2 T )] , γ − . (log( δ − ) + log(2 T )) (cid:111) , with probability at least − δ log( p ( T ) ( x ∗ )) ≥ min (cid:110) ( γ − ∆ h (cid:15) ) ηη + 2 T − log(2 |X | ) , log(1 − γ ) (cid:111) after T rounds of Algorithm 1.Proof. The proof has three parts: first, we prove thatthe median of n samples Y . . . Y n is at least the γ quantile over p ( t ) .The probability that the n -sample empirical medianis less than the γ quantile over p ( t ) is equivalent tohaving X ∼ Binom( γ, n ) and P ( X > n/ P ( X > n/ ≤ exp( − γ − . n ) . Next, we show that the CSS prediction abstains on atmost ∆ h (cid:15) fraction of the distribution. VC dimensionbounds [20] imply that we can achieve (cid:15) error uni-formly over hypothesis class H with VC dimension V in n = C(cid:15) − [ V log( (cid:15) − ) + log( δ (cid:48)− )]samples with probability at least 1 − δ for some con-stant C . We can then apply the CSS classifier boundto get that the abstention rate is ∆ h (cid:15) . This implies: (cid:88) x ∈X h ( t ) ( x ) p ( t − ( x ) ≥ γ − ∆ h (cid:15) (2) Finally, whenever p ( t ) ( x ∗ ) < − γ , h ( x ∗ ) = 1 f ( x ∗ ) ≤ α ( t ) = 0 , and thus M t ( x ∗ ) = 0.The log( p ( T ) ) inequality follows by applying Theorem1 with Equation 5 noting that either M T ( x ∗ ) = 0 or p ( T ) ( x ∗ ) > − γ .Union bounding the above two probabilities, and en-suring each part has failure probability δ/ n = max (cid:0) C(cid:15) − [ V log( (cid:15) − ) + log( δ − ) + log(2 T )] , γ − . (log( δ − ) + log(2 T )) (cid:1) implies δ failure probability. A.2 Convergence rates for optimizing linearfunctions .The first lemma generalizes Theorem 1 to the contin-uous case.
Lemma A.1.
Consider a compact
X ⊂ R d and γ > ,and let ( h ( t ) : X → [0 , , p ( t − ) t ∈ N be a sequence suchthat for every t ∈ N , (cid:90) x h ( t ) ( x ) p ( t − ( x ) dx ≥ γ,p ( t ) ( x ) ∝ p ( t − ( x )(1 − ηh ( t ) ( x )) for some η ∈ [0 , / , and p (0) the uniform distribu-tion over X . Further, let M T ( x ) = (cid:80) Tt =1 h ( t ) ( x ) thenumber of times h ( t ) downweights item x . Then thefollowing bound on density at the any item x ∗ holdson the last step: log( p ( T ) ( x )) ≥ γ ηη + 2 (cid:18) T − ( η + 1)( η + 2) γ M T ( x ) (cid:19) − log( vol ( X ) / Proof.
Since the sampling distribution is derived frommultiplicative weights over (cid:80) Tt =1 h ( t ) ( x ), the followingregret bound holds with respect to any p (Theorem2.4, [3]): γT ≤ T (cid:88) t =1 (cid:90) x ∈X h ( t ) ( x ) p ( t − ( x ) dx ≤ (1 + η ) T (cid:88) t =1 (cid:90) x ∈X h ( t ) ( x ) p ( x ) dx + KL( p || p (0) ) η atsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi Pick S = { x ∈ X : (cid:80) t h ( t ) ( x ) < νT } and p uniformover S to get: γT ≤ (1 + η ) νT + log(vol( X ) / vol( S )) η From this we get the bound:log(vol( S )) ≤ log(vol( X )) − γηT + νηT + νη T Now we can get the following basic bound onlog( p ( T ) ( x ∗ )) by decomposing the normalizer using S .log( p ( T ) ( x ∗ )) ≥ log(1 − η ) M T ( x ∗ ) − log(vol( S ) + exp(log(1 − η ) νT )(vol( X ) − vol( S )))The rest of the proof is identical to that of Theorem 1and we obtain:log( p ( T ) ( x ∗ )) ≥ γ ηη + 2 (cid:18) T − ( η + 1)( η + 2) γ M T ( x ∗ ) (cid:19) − log(2vol( X )) . We now show that given a well-behaved starting distri-bution, the distribution induced by the multiplicativeweights algorithm is close to a uniform distributionover the sublevel set.
Lemma A.2.
Define h ∗ ( t ) ( x ) = 1 f ( x ) ≤ α ( t ) and q ( t ) ( x ) = − h ∗ ( t ) ( x ) (cid:82) x ∈X − h ∗ ( t ) ( x ) dx .Define p ( t ) and h ( t ) such that p ( t − ( x ) ∝ ∀ x : h ∗ ( t ) ( x ) = 1 , (cid:90) x ∈X h ( t ) ( x ) p ( t − ( x ) dx = γ,p ( t ) ( x ) ∝ p ( t − ( x )(1 − ηh ( t ) ( x )) . and further, (cid:82) x ∈X h ∗ ( t ) ( x )=0 h ( t ) ( x )=1 dx = 0 and (cid:82) x ∈X h ∗ ( t ) ( x )=1 h ( t ) ( x )=0 p ( t − ( x ) dx ≤ ν then − η ( γ − ν )1 − γ − ν p ( t ) ( x ) ≥ q ( t ) ( x ) for all x ∈ X .Proof. We verify the inequality on S ∗ = { x : h ∗ ( t ) ( x ) = 0 } , since q ( t ) ( x ) is zero outside S ∗ .For any x ∈ S ∗ , q ( t ) ( x ) = 1 / vol( S ∗ ) and p ( t − ( x ) ≥ (1 − γ − ν ) / vol( S ∗ ).This implies for x ∈ S ∗ , p ( t ) ( x ) ≥ (1 − γ − ν ) / vol( S ∗ ) (cid:82) x ∈X p ( t − ( x )(1 − ηh ( t ) ( x )) . We now upper bound the normalizer: (cid:90) x ∈X p ( t − ( x )(1 − ηh ( t ) ( x )) ≤ − η (cid:90) x ∈X p ( t − ( x ) h ( t ) ( x ) ≤ − η ( γ − ν ) . This gives the bound that for all x ∈ S ∗ ,1 − η ( γ − ν )1 − γ − ν p ( t )( x ) ≥ q ( t ) ( x ) . The next theorem gives the disagreement coefficientfor linear classification in a log-concave density.
Theorem A.1 (Theorem 14, [5]) . Let D be a log-concave density in R d and H the set of linear classifierin R d , then the region of disagreement over error inDefinition 2 is: P X ∼D ( X ∈ Dis ( B ( h, r ))) r = O ( d / log(1 /r )) . Proof.
The statement here is identical to [5] with theexception of the identity covariance constraint on D .We omit this due to the equivalence of isotropic andnon-isotropic cases as noted in Appendix C, Theorem6.Finally, we combine the above results to obtain con-vergence rates on the linear optimization with linearclassifier case. Theorem A.2.
Let f ( x ) = w (cid:62) x for some w , and X a convex subset of R D . Define p (0) as uniform on X .Define the classification label Z i = 1 Y i > median ( Y ) , onwhich we train a CSS classifier over the linear hypoth-esis class H which we define as h ( t ) ( x ) = if ∀ g ∈ VS H , ( X,Z ) , g ( x ) = 10 if ∀ g ∈ VS H , ( X,Z ) , g ( x ) = 00 otherwise . Define the sampling distributions for each step p ( t ) ( x ) ∝ p ( t − ( x )(1 − h ( t ) ( x ) / Then for any small ν , a batch size of n ( t ) = O (cid:32) d / ν log( ν − d / ) − d / ν log(3 T δ ) (cid:33) , is sufficient to establish the following bound on sam-pling items below any α level set S α = { x ∈ X : f ( x ) < erivative free optimization via repeated classification α } on the last step: log (cid:32)(cid:90) x ∈ S α p ( T ) ( x ) (cid:33) ≥ min (cid:18) (cid:18) − ν (cid:19) T − log(2 vol ( X )) + log( vol ( S α )) , − log(4) (cid:19) with probability − δ .Proof. We will begin by using induction to show thatat each round t the abstention rate is at most ν .In the base case of t = 0, p (0) ( x ) is log-concaveand we can apply Theorem A.1 and 2, which implies ν = C d / log(1 /(cid:15) ) (cid:15) . VC dimension bounds implythere exists some n = C (cid:15) − ( d log(1 /(cid:15) ) + log(1 /δ ))such that any consistent hypothesis incurs at most (cid:15) population error.Inverting and solving for n shows that for some C , n ≥ Cν − d / [ d log( ν − d / ) + log( δ − )]is sufficient to guarantee ν -abstention.For each round t > ν -abstention in all prior rounds. We then fulfil the condi-tion of Lemma A.2. To verify each condition: p ( t − ( x )is constant on sublevel sets below α ( t ) since the selec-tive classifier to never makes false positives, by con-struction. η = 1 / γ ( t ) is the population quantile of p ( t − corresponding to median( Y ( t − ) and ν is theabstention rate at round t − x ∼ p ( t ) has f ( x ) < median( Y ( t − ) = α ( t − , then this samplefollows the uniform distribution over the sublevel set S α ( t − and that the probability of this event occuringis at least − γ ( t ) − ν − ( γ ( t ) − ν ) / .Thus, if we sample n (cid:48) samples from p ( t ) , then withprobability at least 1 − δ , n of these draws will befrom a uniform distribution in the α ( t ) sublevel set,where δ follows: δ = P n (cid:48) (cid:88) i =1 X ( t ) i ∈ S α ( t − ≤ n ≤ exp − (cid:32) − γ ( t ) − ν − ( γ ( t ) − ν ) / − n/n (cid:48) (cid:33) n (cid:48) Solving for n (cid:48) , with the shorthand τ = − γ ( t ) − ν − ( γ ( t ) − ν ) / and δ < n (cid:48) ≥ (cid:112) log( δ )(log( δ ) − τ n ) + 4 τ n − log( δ )4 τ ≥ τ n − log( δ )2 τ Since the α ( t ) sublevel set is convex, we can apply The-orem A.1 to these n samples contained in the sublevelset to show that for all t > n ≥ Cν − d / [ d log( ν − d / ) + log( δ − )]samples is sufficient to ensure ν -abstention when wetrain a selective classifier on n points. Learning a se-lective classifier over all n (cid:48) points can only decreasethe abstention region, and therefore n t is sufficient toguarantee ν abstention.Combining with the bound on n (cid:48) , with probability 1 − δ − δ , n ( t ) ≥ τ Cν − d / [ d log( ν − d / ) + log( δ − )] − log( δ )4 τ Next, we bound γ from above and below using thesame argument as Theorem 3. At any round t defining γ ( t ) = (cid:82) x ∈X p ( t − ( x )1 x< median( Y ( t − ) by Hoeffding’sinequality the probability γ ( t ) is within [1 / , /
4] is: P (1 / ≤ γ ( t ) ≤ / ≥ − − n ( t ) / . Thus, we can ensure γ ( t ) ∈ [1 / , /
4] with probabilityat least 1 − δ in round t if we have n ( t ) ≥ − δ / . Simplifying the bounds, we can ensure ν abstentionacross all T rounds with probability at least 1 − δ given τ = / − ν / ν/ . n ( t ) ≥ C / ν/ / ν − ν d / [ d log( ν − d / ) − log(3 T δ )] − log(3 T δ )(7 / ν/ / − ν ) + 8 log(3 T δ − )Collecting first order terms for ν small we have n ( t ) = O (cid:32) d / ν log( ν − d / ) − d / ν log(3 T δ ) (cid:33) Finally, we can apply Lemma A.1 to get a lower boundfor all points x : f ( x ) < α ( T ) .log( p ( T ) ( x )) ≥ (cid:18) − ν (cid:19) T − log(2vol( X )) . atsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi The complete bound with respect to the α sublevel setfollows after checking two cases. If α ( T ) ≥ α , then theentire set { x : f ( x ) < α } follows the lower bound, andwe integrate to obtain the first part of the bound. If α ( T ) < α then by construction of α ( T ) , at least 1 − γ ofthe distribution must lie below α ( T ) and thus we drawan element with function value less than α ( T ) < α withprobability at least log(1 − γ ). A.3 Approximating selective sampling withGaussians
In this section, we show that sampling from a partic-ular Gaussian approximates selective classification forlinear classification with strongly convex losses.First, we show that the maximum over sampled Gaus-sian parameter vectors is close to the infimum over anhyperellipse.
Lemma A.3.
Consider θ quad1 . . . θ quad B ∼ N (ˆ θ, Σ) thenfor any x and Q τ = { θ : ( θ − ˆ θ ) (cid:62) Σ − ( θ − ˆ θ ) ≤ τ } , P (cid:18) min i x (cid:62) θ quad i − inf θ ∈ Q τ x (cid:62) θ > √ x (cid:62) Σ x(cid:15) (cid:19) ≤ (1 − Φ( (cid:15) − √ τ )) B . Where Φ is the cumulative distribution function of thestandard Gaussian.Proof. We can whiten the space with Σ − / to get thefollowing equivalent statement:Define θ quad i = Σ − / θ quad i , then θ quad1 . . . θ quad B ∼ N (0 , I ) and let x = Σ / x , P (cid:18) min i x (cid:62) θ quad i − inf θ ∈ Q τ x (cid:62) θ > √ x (cid:62) Σ x(cid:15) (cid:19) = P (cid:32) min i x (cid:62) θ quad i − inf θ : (cid:107) θ (cid:107) < τ x (cid:62) θ > (cid:107) x (cid:107) (cid:15) (cid:33) = P (cid:32) min i x (cid:62) (cid:107) x (cid:107) θ quad i > (cid:15) − √ τ (cid:33) = P (cid:32) x (cid:62) (cid:107) x (cid:107) θ quad0 > (cid:15) − √ τ (cid:33) B = (1 − Φ( (cid:15) − √ τ )) B For (cid:15) = 0 and τ = 1, if B ≥ log( δ ) / log(0 .
95) then theminimum of the B samples is smaller than the infimumover the Σ − ellipse with probability at least 1 − δ . Next, we show that the samples from N (ˆ θ, Σ) in d -dimensions are contained in a ball a constant factorlarger than d Σ − with high probability. Lemma A.4.
Let θ quad1 . . . θ quad B ∼ N (ˆ θ, Σ) , then: P (cid:18) max i ( θ quad i − ˆ θ ) (cid:62) Σ − ( θ quad i − ˆ θ ) (cid:62) ≥ cd (cid:19) ≤ B (cid:0) c exp (1 − c ) (cid:1) d/ for c > .Proof. Since ( θ quad i − ˆ θ ) (cid:62) Σ − ( θ quad i − ˆ θ ) (cid:62) is whitened,we can define the probability in terms of chi-squaredvariables ξ . . . ξ B ∼ χ ( d ). P (cid:18) max i ( θ quad i − ˆ θ ) (cid:62) Σ − ( θ quad i − ˆ θ ) (cid:62) ≥ cd (cid:19) = P (cid:18) max i ξ i ≥ cd (cid:19) = 1 − P ( ξ < cd ) B ≤ − (cid:16) − (cid:0) c exp (1 − c ) (cid:1) d/ (cid:17) B ≤ B (cid:0) c exp (1 − c ) (cid:1) d/ Which impliesmax i ( θ quad i − ˆ θ ) (cid:62) Σ − ( θ quad i − ˆ θ ) < B/d ) + 2 d − δ/ − δ .Thus for c = 1 + O (cid:16) d (cid:0) log( B ) − log( δ ) (cid:1)(cid:17) , we can en-sure that all B samples are contained in the cd Σ − ball with probability at least 1 − δ .Combining the two results, we can show that for aquadratic version space, we can perform selective clas-sification by sampling from a Gaussian. Theorem A.3.
Define the selective classifier with re-spect to a set Q (a subset of the parameter space Θ )as: h Q ( x ) = if ∀ θ ∈ Q, θ (cid:62) x > if ∀ θ ∈ Q, θ (cid:62) x ≤ ∅ otherwise . Let Q τ = { θ : ( θ − ˆ θ ) (cid:62) Σ − ( θ − ˆ θ ) < τ } and θ quad1 . . . θ quad B ∼ N (ˆ θ, Σ) .Then the sampled selective classifier: erivative free optimization via repeated classification h quad ( x ) = if ∀ θ quad i , x (cid:62) θ quad i > if ∀ θ quad i , x (cid:62) θ quad i ≤ ∅ otherwise . Has the following properties.For any x ∈ X P ( h Q τ ( x ) = 0 and h quad ( x ) = 1) ≤ (1 − Φ( −√ τ )) B and for any c > , P (cid:32)(cid:88) x h Qcd ( x ) = ∅ < (cid:88) x h quad ( x )= ∅ (cid:33) ≤ B (cid:0) c exp (1 − c ) (cid:1) d/ Proof.
The first statement follows directly fromLemma A.3, since having h Q τ ( x ) = 0 and h quad ( x ) = 1is a subset of the event that is upper bounded inLemma A.3.The second statement follows from Lemma A.4. Underthe conditions of the theorem, P (cid:18) max i ( θ quad i − ˆ θ ) (cid:62) Σ − ( θ quad i − ˆ θ ) (cid:62) ) ≥ cd (cid:19) ≤ B (cid:0) c exp (1 − c ) (cid:1) d/ . If max i ( θ quad i − ˆ θ ) (cid:62) Σ − ( θ quad i − ˆ θ ) (cid:62) ≥ cd , then for all i , θ quad i ∈ Q cd . By construction, if h quad ( x ) = ∅ then, h Q cd ( x ) = ∅ since at least one pair of θ quad i must disagree, and thispair must also be in Q cd ( x ).Now we show that a quadratic version space containsthe Bayes optimal parameter, and is contained in aversion space with low error. We first state a resulton the strong convexity of the empirical loss, whichfollows immediately from Theorem 5.1.1 of [36]. Lemma A.5.
Let { P θ } θ ∈ Θ be a parametric familyover a compact parameter space Θ . Define Z | X ∼ P θ ∗ for some θ ∗ ∈ int Θ .Let (cid:96) θ ( x, z ) = − log( P ( z | x )) be the negative log like-lihood of z , and L n ( θ ) = n n (cid:88) i =1 (cid:96) θ ( X i , Z i ) . If λ min (cid:18) E (cid:104) ∇ (cid:96) θ ( X, Z ) (cid:105)(cid:19) ≥ γ, and λ max (cid:16) ∇ (cid:96) θ ( X, Z ) (cid:17) ≤ L w.p. 1 , then λ min (cid:16) ∇ L n ( θ ) (cid:17) ≥ − (cid:115) L log( d / δ ) nγ γ with probability − δ . We now define the three possible version spaces thatwe can consider - the version space of the low errorhypotheses
V S , version space of low-loss models (cid:96) , andthe quadratic approximation Q . Definition A.1.
Define the maximum likelihood esti-mator θ n = arg min θ ∈ Θ L n ( θ ) The quadratic version space with radius τ around L n is Q τ = (cid:26) θ : ∇ L n ( θ n ) (cid:62) ( θ − θ n ) + ( θ − θ n ) (cid:62) ∇ L n ( θ n )( θ − θ n ) ≤ τ (cid:27) . The version space of loss-loss parameters θ with radius τ is L τ = (cid:26) θ : L n ( θ ) − L n ( θ n ) ≤ τ (cid:27) . The version space of a set of hypotheses h with errorless than τ is V S τ = (cid:26) θ : P (cid:0) h θ ( X ) (cid:54) = Z (cid:1) − P (cid:0) h θ n ( X ) (cid:54) = Z (cid:1) ≤ τ (cid:27) . Lemma A.6.
In the same settings as lemma A.5, as-sume (cid:96) θ ( x, z ) majorizes the zero one loss for somehypothesis class H = { h θ : θ ∈ Θ } as (cid:96) θ ( x, z ) ≥ { h θ ( x )= z } , (cid:107)∇ (cid:96) θ ( X, Z ) (cid:107) ≤ R almost surely, and ∇ L n ( θ ) is M -Lipschitz.Then, for the version spaces given in definition A.1, V S τ +2 ζ ⊇ L τ +2 ζ ⊇ Q τ + ζ ⊇ L τ , (3) where ζ = M (cid:16) τ γ (cid:17) / . Furthermore, if τ = 36 ν C f . d + log( / δ ) n , then θ ∗ ∈ Q τ with probability − δ where constants C f and ν defined in [35] as C f ≤ sup θ D kl ( θ, θ ∗ ) (cid:107) I − / ( θ − θ ∗ ) (cid:107) , and ν ≥ Rλ max ( I θ ∗ ) . atsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi Proof.
The norm (cid:107) θ − θ n (cid:107) bounds the remainder termof the 2nd order Taylor expansion of L n from above, L n ( θ ) − L n ( θ n ) ≤ ∇ L n ( θ n ) (cid:62) ( θ − θ n )+ ( θ − θ n ) (cid:62) ∇ L n ( θ n )( θ − θ n )+ (cid:107) R ( θ, θ n ) (cid:107) op (cid:107) θ − θ n (cid:107) , and below, L n ( θ ) − L n ( θ n ) ≥ ∇ L n ( θ n ) (cid:62) ( θ − θ n )+ ( θ − θ n ) (cid:62) ∇ L n ( θ n )( θ − θ n ) − (cid:107) R ( θ, θ n ) (cid:107) op (cid:107) θ − θ n (cid:107) , with (cid:107) R ( θ, θ n ) (cid:107) op ≤ M (cid:107) θ − θ n (cid:107) .First, we prove L τ ⊆ Q τ + ζ .Let θ ∈ L τ . Strong convexity of L n at θ n implies (cid:107) θ − θ n (cid:107) ≤ γ ( L n ( θ ) − L n ( θ n )) ≤ γ τ. Using this in the Taylor expansion above implies ∇ L n ( θ n ) (cid:62) ( θ − θ n ) + ( θ − θ n ) (cid:62) ∇ L n ( θ n )( θ − θ n ) ≤ L n ( θ ) − L n ( θ n ) + (cid:107) R ( θ, θ n ) (cid:107) op (cid:107) θ − θ n (cid:107) ≤ τ + M (cid:18) τγ (cid:19) = τ + ζ. The argument to show Q τ ⊆ L τ + ζ is nearly identical.Strong convexity of L n implies strong convexity of itsquadratic expansion with the same parameter. Let θ ∈ Q τ . The above Taylor approximation shows L n ( θ ) − L n ( θ n ) ≤∇ L n ( θ n ) (cid:62) ( θ − θ n ) + ( θ − θ n ) (cid:62) ∇ L n ( θ n )( θ − θ n )+ (cid:107) R ( θ, θ n ) (cid:107) op (cid:107) θ − θ n (cid:107) ≤ τ + M (cid:18) τγ (cid:19) = τ + ζ. That (cid:96) ( δ, z ) majorizes the 0-1 loss almost immedi-ately implies L τ ⊆ V S τ . Indeed, L n ( θ ) ≤ τ implies n (cid:80) { h θ ( X i ) (cid:54) = Z i } ≤ τ , and so h θ is in V S τ .That Q τ contains θ ∗ with probability 1 − δ follows fromTheorem 5.2 of [35] with the stated constants. A.4 Approximating selective sampling withthe bootstrap
We begin by showing the minimizer of the quadraticapproximation is close to the θ quad u we defined.Define L n ( θ, u ) := 1 n n (cid:88) i =1 L i ( θ ) + 1 n n (cid:88) i =1 u i L i ( θ ) , which is convex for all u ≥ − . The standard nega-tive log-likelihood is then L n ( θ, ), and for simplicityin notation (and to reflect the dependence on u ) we let θ n = argmin θ L n ( θ, ) and θ u = argmin θ L n ( θ, u ). Re-call that we assume that θ (cid:55)→ ∇ L i ( θ ) is M -Lipschitz(in operator norm) and L n ( θ, ) is γ -strongly convex.We collect a few identification and complexity results. Lemma A.7 (Rakhlin and Sridharan [32]) . Let u i be independent, mean-zero, σ -sub-Gaussian randomvariables. Then for any sequence of vectors v i with (cid:107) v i (cid:107) ≤ R for each i , P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 u i v i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ cRσ √ n (1 + t ) ≤ exp( − t ) , where c < ∞ is a numerical (universal) constant. We then have the following guarantee.
Lemma A.8.
Let (cid:13)(cid:13) ∇ L i ( θ ) (cid:13)(cid:13) ≤ R for all i and u i be independent mean-zero σ -sub-Gaussian ran-dom variables. Then there exist numerical constants C , C , C < ∞ such that for all t ∈ R and n ≥ dγ C ( C − log( δ )) , we have (cid:107) θ u − θ n (cid:107) ≤ C · Rσ √ n with probability at least − δ .Proof. The proof is a more or less standard exer-cise in localization and concentration [8]. We beginby noting that | u i | is O (1) σ -sub-Gaussian, so that n (cid:80) ni =1 | u i | ≤ E [ | u i | ] + t with probability at least1 − exp( − cnt /σ ), where c > (cid:80) ni =1 | u i | ≤ σ , which occurswith probability at least 1 − e − cn . Non-asymptoticlower bounds on the eigenvalues of random matri-ces [28, Thm. 1.3] imply that for any t ≥ ∇ L n ( θ n ; u ) (cid:23) (cid:32) γ − (cid:114) dn · t (cid:33) I d (4)with probability at least 1 − exp( − ct + C log log t ) forconstants c > C < ∞ . Then using the M -Lipschitzcontinuity of ∇ L , we obtain (cid:13)(cid:13)(cid:13) ∇ L n ( θ ; u ) − ∇ L n ( θ n ; u ) (cid:13)(cid:13)(cid:13) op ≤ Mn n (cid:88) i =1 | u i | (cid:107) θ − θ n (cid:107)≤ M σ (cid:107) θ − θ n (cid:107) . With this identification, for all θ satisfying (cid:107) θ − θ n (cid:107) ≤ (cid:15) Mσ , we have L n ( θ, u ) ≥ L n ( θ n , u ) + (cid:104)∇ L n ( θ n , u ) , θ − θ n (cid:105) + γ − (cid:15) − t (cid:112) d/n (cid:107) θ − θ n (cid:107) erivative free optimization via repeated classification with probability at least 1 − exp( − ct + C log log t ).Now, using Lem A.7, we obtain that (cid:13)(cid:13) ∇ L n ( θ n ; u ) (cid:13)(cid:13) ≤ cRσ √ n (1 + t ) with probability at least 1 − e − t , so thatwe have L n ( θ, u ) ≥ L n ( θ n , u ) − CRσ √ n (1 + t ) (cid:107) θ n − θ (cid:107) + γ − (cid:15) − t (cid:112) d/n (cid:107) θ n − θ (cid:107) with probability at least 1 − e − ct + C log log t , where0 < c, C < ∞ are numerical constants.Now fix (cid:15) . Solving the above quadratic in (cid:107) θ n − θ (cid:107) ,we have that (cid:107) θ − θ n (cid:107) > CRσ √ n ( γ − (cid:15) ) − t √ d implies that L n ( θ, u ) > L n ( θ n , u ). If we assume forsimplicity (cid:15) = γ/
2, then we have that γ ≥ (cid:107) θ − θ n (cid:107) > CRσ √ nγ − t √ d implies that L n ( θ, u ) > L n ( θ n , u ), or (by convex-ity) that any minimizer θ u of L n ( θ, u ) must satisfy (cid:107) θ u − θ n (cid:107) ≤ CRσ √ nγ − t √ d .The resulting bound states that there exists numericalconstants c > C < ∞ such that for all t ∈ R , wehave (cid:107) θ u − θ n (cid:107) ≤ C Rσ √ n · γ − t (cid:112) d/n with probability at least 1 − Ce − ct +log log( t ) .Upper bounding log(log( t )) ≤ log( t ) and solving for t ,for any t ≥ C (cid:112) C − log( δ ) there exists C , C < ∞ such that 1 − Ce − ct +log(log( t )) > − δ .Finally, if n ≥ dγ C ( C − log( δ )), then we have anupper bound, (cid:107) θ u − θ n (cid:107) ≤ C Rσ √ n with probability at least 1 − δ .Define the minimizer for a quadratic approximation tothe multiplier bootstrap loss as θ quad u := θ n − ∇ L n ( θ n ) − n n (cid:88) i =1 u i ∇ L i ( θ n ) . We now show that for subgaussian u , θ quad u ≈ θ u . Lemma A.9.
Let (cid:13)(cid:13) ∇ L i ( θ ) (cid:13)(cid:13) ≤ R , (cid:13)(cid:13) ∇ L i ( θ n ) (cid:13)(cid:13) op ≤ S for all i , and u i be independent mean-zero σ -sub-Gaussian random variables. Then there exist numeri-cal constants C, C , C , C < ∞ such that for all δ > and n ≥ dγ C ( C − log( δ/ , (cid:13)(cid:13)(cid:13) θ u − θ quad u (cid:13)(cid:13)(cid:13) ≤ C R σ n (cid:32) M + S (cid:112) log(4 d/δ ) R (cid:33) with probability at least − δ .Proof. By definition, we have θ quad u := θ n − ∇ L n ( θ n ) − n n (cid:88) i =1 u i ∇ L i ( θ n ) . Now, consider θ u . By Taylor’s theorem and the M -Lipschitz continuity of ∇ L i , we have that for matrices E i : R d → R d × d with (cid:13)(cid:13) E i ( θ ) (cid:13)(cid:13) op ≤ M (cid:107) θ − θ n (cid:107) thatfor all θ near θ n , we have n (cid:88) i =1 (1 + u i ) ∇ L i ( θ )= n ∇ L n ( θ n )( θ − θ n ) + n (cid:88) i =1 u i ∇ L i ( θ n )+ n (cid:88) i =1 E i ( θ )( θ − θ n ) + n (cid:88) i =1 u i (cid:104) ∇ L i ( θ n ) + E i ( θ ) (cid:105) ( θ u − θ n ) . That is, defining the random matrix E := 1 n n (cid:88) i =1 (cid:104) (1 + u i ) E i ( θ u ) + u i ∇ L i ( θ n ) (cid:105) and noting that = n (cid:80) ni =1 (1 + u i ) ∇ L i ( θ u ), we have − n n (cid:88) i =1 u i ∇ L i ( θ n ) = (cid:16) ∇ L n ( θ n ) + E (cid:17) ( θ u − θ n ) . Using standard matrix concentration inequalities [36]yields that P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 u i ∇ L i ( θ n ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≥ t ≤ d exp (cid:32) − ct nσ S (cid:33) for a numerical constant c >
0. Using that (cid:13)(cid:13) E i ( θ ) (cid:13)(cid:13) op ≤ M (cid:107) θ − θ n (cid:107) , we obtain that with proba-bility at least 1 − d exp( − ct nσ S ), thus we have (cid:107) E (cid:107) op ≤ M (cid:107) θ u − θ (cid:107) + Sσ √ log(2 d/δ ) √ with probability at least1 − δ . Using Lem A.8, we have that for all n ≥ dγ C ( C − log( δ/ θ u = θ n − (cid:16) ∇ L n ( θ n ) + E (cid:17) − n n (cid:88) i =1 u i ∇ L i ( θ n ) atsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi with probability at least 1 − δ , where (cid:107) E (cid:107) op ≤ CM Rσ √ n + Sσ √ log(4 d/δ ) √ n . Now we use the fact that if (cid:13)(cid:13) EA − (cid:13)(cid:13) op < A, E , then ( A + E ) − = A − (cid:80) ∞ i =0 ( − i ( EA − ) i . Using our bounds on2 M (cid:15) n + t < γ , we have( ∇ L n ( θ n ) + E ) − = ∇ L n ( θ n ) − + ˜ E, where the matrix ˜ E satisfies (cid:107) ˜ E (cid:107) op ≤ (cid:107) E (cid:107) op . Apply-ing Lem A.7 gives the result.We now define the constant terms of the bootstraperror into a function ψ to simplify notation for theremaining sections. Definition A.2 (Bootstrap error constants) . Underthe conditions of Lemma A.9, define ψ ( n, d, δ ) = C R n (cid:32) M + S (cid:112) log(4 d/δ ) R (cid:33) . Now note that if u i ∼ U ( − ,
1) then E [ θ quad u ] = θ n and Cov [ θ quad u ] = ∇ L n ( θ n , − (cid:80) ni =1 ∇ L i ( θ ) ∇ L i ( θ ) (cid:62) ∇ L n ( θ n , − σ /n ,which is close to the distribution used in the Gaussiansampling section before.We now show that for θ ◦ u := ρ ( θ u − θ n ) + θ n , (cid:13)(cid:13) θ ◦ u − θ quad u (cid:13)(cid:13) is small enough to make the quadraticapproximation hold for the bootstrap samples. Lemma A.10.
Under the conditions of Lemma A.8and A.9, define θ ◦ u . . . θ ◦ u B where θ ◦ u i := σ ( θ u i − θ n )+ θ n and u ij ∼ U ( − , .For any x and Q τ/n = { θ : ( θ − θ n ) (cid:62) ∇ L n ( θ, θ − θ n ) ≤ τ /n } . min i x (cid:62) θ ◦ u i − inf θ ∈ Q τ/n x (cid:62) θ ≤ with probability at least − δ whenever B ≥ log( δ/ / log(1 − Φ(1) / n ≥ max (cid:32) R S log(3 d/δ ) , C S R γ − / Φ(1) − (cid:33) σ ≥ √ τ (cid:16) √ γ − / S R √ S ) (cid:112) log(2 d/δ ) (cid:17)(cid:0) − ψ ( n, d, δ ) Sγ − / (cid:1) where ψ is defined in Definition A.2.Proof. The proof proceeds in two parts: in the first, webound the mismatch between the covariance matrix of θ quad u and ∇ L n ( θ n ) − . In the second part, we boundthe gap between θ ◦ and θ quad . Define the covariance matrix Σ = ∇ L n ( θ ) − (cid:80) ni =1 ∇ L i ( θ ) ∇ L i ( θ ) (cid:62) ∇ L n ( θ ) − /n ,the whitened samples x = Σ / x , and whitenedparameters θ ◦ u = Σ − / ( θ ◦ u − θ n ).By whitening and applying the operator norm boundwe obtain the upper boundmin i x (cid:62) θ ◦ u i − inf θ ∈ Q τ/n x (cid:62) θ = min i x (cid:62) θ ◦ u i − inf (cid:107) θ (cid:107) ∈ τ x (cid:62) Σ − / ∇ L n ( θ ) − / θ/ √ n ≤ (cid:107) x (cid:107) (cid:18) min i x (cid:62) (cid:107) x (cid:107) θ ◦ u i + √ τ (cid:13)(cid:13)(cid:13) I − Σ − / ∇ L n ( θ ) − / / √ n (cid:13)(cid:13)(cid:13) op (cid:19) . The second term is the error from thequadratic approximation, and we can boundthis via the Ando-Hemmen inequality (Propo-sition 3.2, [37]) which states (cid:13)(cid:13)(cid:13) A / − B / (cid:13)(cid:13)(cid:13) ≤ (cid:107) A − / (cid:107) − op + (cid:107) B − / (cid:107) − op (cid:107) A − B (cid:107) op , which gives (cid:13)(cid:13)(cid:13) I − Σ − / ∇ L n ( θ ) − / / √ n (cid:13)(cid:13)(cid:13) op ≤ n − / (cid:13)(cid:13)(cid:13) Σ − / √ n (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13) Σ n − ∇ L n ( θ ) − (cid:13)(cid:13) op (cid:13)(cid:13) Σ − / √ n (cid:13)(cid:13) − op + (cid:13)(cid:13) ∇ L n ( θ ) / (cid:13)(cid:13) − ≤ √ nγ − / S − / + S − / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ∇ L i ( θ ) ∇ L i ( θ ) (cid:62) − ∇ L n ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op . Finally, applying the Matrix Bernstein inequality(Thm 6.1.1, [36]), (cid:13)(cid:13)(cid:13) I − Σ − / ∇ L n ( θ ) − / (cid:13)(cid:13)(cid:13) op ≤ √ γ − / S R √ S ) (cid:112) log( d/δ )with probability at least 1 − δ , when n ≥ R S log( d/δ ).For the other parts of the bound let θ quad u i =Σ / ( θ quad u i − θ n ) which is a d -dimensional unit Gaus-sian with variance σ . By Lemma A.8,min i x (cid:62) θ ◦ u i − inf θ ∈ Q τ/n x (cid:62) θ ≤ (cid:107) x (cid:107) (cid:18) min i x (cid:62) (cid:107) x (cid:107) θ quad u i + ψ ( n, d, δ/ Sγ − / σ + √ τ (cid:32) √ γ − / S R √ S ) (cid:112) log(3 d/δ ) (cid:33) erivative free optimization via repeated classification with probability 1 − δ/
3. Define σ = √ τ (cid:16) √ γ − / S R √ S ) (cid:112) log(2 d/δ ) (cid:17) − ψ ( n, d, t ) Sγ − / . Since x (cid:62) θ quad is a bounded i.i.d sum of n gradientterms, by the Berry-Esseen theorem there exists a uni-versal constant C such that P (cid:32) min i x (cid:62) θ ◦ u i − inf θ ∈ Q τ/n x (cid:62) θ ≤ (cid:33) ≤ (cid:32) − Φ( τ ) + CS Rγ √ n (cid:33) B . If B ≥ log( δ/ / log(1 − Φ(1) /
2) and n ≥ C S R γ − / Φ(1) − , we have that:min i x (cid:62) θ ◦ u i − inf θ ∈ Q τ/n x (cid:62) θ ≤ − δ .Thus to satisfy the same bounds as the bootstrap, wemust expand the ellipse under consideration by a smallmultiplicative factor.We first define the bootstrap based selective classifier. Definition A.3.
Let { P θ } θ ∈ Θ be a parametric familyover a compact parameter space Θ . Define Z | X ∼ P θ ∗ for some θ ∗ ∈ int Θ .Let (cid:96) θ ( x, z ) = − log( P ( z | x )) be the negative loglikelihood of z , and assume that (cid:107)∇ (cid:96) θ ( X, Z ) (cid:107) ≤ R and (cid:13)(cid:13) ∇ (cid:96) ( X, Z ) (cid:13)(cid:13) op ≤ S almost surely and this ma-jorizes the 0-1 loss of a linear classifier, (cid:96) θ ( x, z ) ≥ { (2 z − x (cid:62) θ< } .Define the weighted sample negative log likelihood L n ( θ, u ) := n n (cid:88) i =1 (1 + u i ) (cid:96) θ ( X i , Z i ) , and we assume L n ( θ, to be γ -strongly convex and ∇ L n ( θ, to be M -Lipschitz.Given a scaling constant σ > , define the B boot-strapped estimators θ ◦ u i = σ ( θ u − θ n ) + θ n with u i . . . u in ∼ U ( − , .The bootstrap selective classifier is defined as h ◦ u ( x ) := if ∀ i, x (cid:62) θ ◦ u i > if ∀ i, x (cid:62) θ ◦ u i ≤ ∅ otherwise . Combining the above results in the following bootstrapbased selective classification bound.
Theorem A.4.
The bootstrap selective classifier indefinition A.3 fulfils (cid:90) x ∈ R d { h ◦ u ( x )=0 and x (cid:62) θ ∗ ≥ } p ( x ) dx ≤ (cid:15), (cid:90) x ∈ R d { h ◦ u ( x )= ∅ } p ( x ) dx ≤ (cid:15) ∆ h , and for any x , P ( x (cid:62) θ ∗ ≤ and h ◦ u ( x ) = 1) < δ with probability at least − δ whenever B ≥ log( δ/ / log(Φ(1)) ,σ = (cid:115) ν C f (2 . d + log (cid:0) / δ (cid:1) ) (cid:18) √ γ − / S R √ S ) (cid:112) log(2 d/δ ) (cid:19) − ψ ( n, d, δ/ Sγ − / = O ( d / + log(1 /δ ) / ) ,(cid:15) ≥ (cid:18) σ νn + M (cid:32) σ ν nγ (cid:33) / (cid:19) = O ( σ /n ) ,ν = C R Sn and n ≥ d/δ ) S/γ . The constants C f and ν are as defined in Lemma A.6.Proof. Consider the first inequality on the classifica-tion of x . The parameter σ is set such that accordingto Lemma A.10, we will achieve the infimum over Q τ with probability 1 − δ/ n and B forany τ ≤ ν C f . d +log( / δ ) n . By Lemma A.6, θ ∗ ∈ Q τ for τ = ν C f . d +log( / δ ) n and thus achieving the infimumover Q τ is equivalent to consistency with x (cid:62) θ ∗ .For the second inequality on abstention, we can beginby applying Lemma A.8 to the bootstrapped samples θ ◦ u i . Lemma A.8 directly gives the result that withprobability at least 1 − δ/ θ ◦ u . . . θ ◦ u B ∈ Q τ (cid:48) where τ (cid:48) = σ ν .Next, we can apply Lemma A.6 to show that Q τ is strictly contained in the version space of (lin-ear) hypotheses which incur at most (cid:15) = 2 τ /n + M ( τ n − / γ − / − / ) error. Combining with the τ (cid:48) above implies the error rate, and the definition of thedisagreement coefficient implies the abstention bound. atsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi For the misclassification result, we note that since allhypotheses are contained within a version space with (cid:15) error, the consensus classifier h can make at most (cid:15) error.Interpreting these results in the context of our opti-mization algorithm, Theorem A.5.
Let { P θ } θ ∈ Θ be a parametric fam-ily over a compact parameter space Θ , and let H = { θ (cid:62) x ≥ θ ∈ Θ } , and assume that H contains thesublevel sets of f .Given T rounds of algorithm 1 with h defined as inTheorem A.4, we have that for x ∗ = arg min x f ( x ) , log( p ( T ) ( x ∗ )) ≥ min (cid:18) ( ξ − (∆ h +1) (cid:15) ) ηη + 2 T − log(2 |X | ) , log(1 − γ ) (cid:19) . This holds with probability − δ as long as the con-ditions in Theorem A.4 are satisfied with probability − δ/T , and n ≥ max { Sγ (log(2 d/δ ) + log(2 T )) , ξ − . (log( δ − ) + log(2 T )) } . Proof.
The proof follows almost identically to Theo-rem 3, given the error bounds in Theorem A.4 hold.However, in Theorem A.4, a small number of misclassi-fication mistakes can be made on elements besides x ∗ .This could reduce the fraction of the feasible space re-moved at each step, so we will provide the guaranteehere for completeness.Without making any mistakes or abstentions, h ( t ) would remove γ fraction of the current feasible space.As before, abstention might increase this by ∆ h (cid:15) . Be-yond this, misclassifications could prevent us from re-moving an additional (cid:15) fraction of the feasible space,giving the bound of (cid:88) x ∈X h ( t ) ( x ) p ( t − ( x ) ≥ γ − (1 + ∆ h ) (cid:15). (5)Beyond this change, the proof proceeds as in Theo-rem 3, with the choice of n required in Theorem A.4,instead of the usual requirements for exact CSS. A.5 Non-realizable agnostic selectiveclassification
We begin by re-introducing the notation and condi-tions of agnostic selective classification, as defined byEl-Yaniv [39].
Definition A.4.
Let loss class F be defined as: F = { (cid:96) ( h ( x ) , z ) − (cid:96) ( h ∗ ( x ) , z ) : h ∈ H} This F is defined as ( β, B ) -Bernstein with respect to P if for all f ∈ F and some < β ≤ and B ≥ if E f ≤ B ( E f ) β . Define the empirical loss bound: σ ( n, δ, d ) = 2 (cid:114) d log(2 ne/d ) + log(2 /δ ) n . This is a bound on the deviation in loss for a classi-fier with VC dimension d learnt with n samples withprobability at least 1 − δ .Now define the empirical loss minimizerˆ h = arg min h ∈H { n (cid:88) i =1 (cid:96) ( h ( x i ) , z i ) } and the excess loss incurred by disagreeing with theERM on a particular point:∆( x ) = min h ∈H (cid:8) E [ (cid:96) ( h ( x ) , z )] | h ( x ) = − sign(ˆ h ( x )) (cid:9) − E [ (cid:96) (ˆ h ( x ) , z )]Define the agnostic selective classifier: h n,δ,d ( x ) = x ) < σ ( n, δ, d ) and ˆ h ( x ) = 10 if ∆( x ) < σ ( n, δ, d ) and ˆ h ( x ) = 0 ∅ otherwise . (6)The main theorem of agnostic selective classificationis the following: Theorem A.6.
Assume H has VC dimension V , dis-agreement coefficient ∆ h and F is ( β, B ) -Bernsteinwith respect to P .With probability at least − δ , P ( h n,δ,V ( x ) = ∅ ) ≤ B ∆ h (4 σ ( n, δ/ , V )) β . With a performance bound E [ (cid:96) ( h n,δ,V ( x ) , z ) − (cid:96) ( h ∗ ( x ) , z ) | h n,δ,V ( x ) (cid:54) = ∅ ] = 0 , where h ∗ ( x ) = arg min h ∈H { E [ (cid:96) ( h ( x ) , z )] } . A.6 Characterizing performance frommisclassification rates for classifiers
Currently we have the requirement that an oracle re-turn classifiers which control M T ( x ∗ ) for the true op-timum x ∗ . However, we are often interested in finding erivative free optimization via repeated classification one of the K best elements in X . This slight relaxationallows us to obtain bounds that rely on controlling 0-1losses in X .Combining our earlier selective classification based op-timization bound with Theorem A.6 gives the follow-ing rate: Theorem A.7.
Let H be a hypothesis class with VCdimension V and disagreement coefficient ∆ h , and let F be a ( β, B ) -Bernstein class such that for each p ( t ) the population loss minimizer correctly classifies x ∗ :=arg min x ∈X f ( x ) .There exists a numerical constant C < ∞ such thatfor all δ ∈ [0 , , and γ ∈ ( B ∆ h (4 σ ( n, δ/ , V )) β , ) , n ≥ γ − . (log( δ − ) + log(2 T )) , with probability at least − δ log( p ( T ) ( x ∗ )) ≥ min (cid:110) ( γ − B ∆ h (4 σ ( n, δ/ , V )) β ) ηη + 2 T − log(2 |X | ) , log(1 − γ ) (cid:111) after T rounds of Algorithm 1 where the classifier isreplaced by the agnostic selective classifier (6) .Proof. The proof follows directly from the proof ofTheorem 3, with the added observation that the ag-nostic selective classifier (6) must always agree withthe population loss minimizer, or else abstain since σ ( n, δ, V ) is a bound on the excess loss of the empir-ical minimizer which holds with probability at least1 − δδ