[PDF] Stabilizing Bi-Level Hyperparameter Optimization using Moreau-Yosida Regularization

Abstract

This research proposes to use the Moreau-Yosida envelope to stabilize the convergence behavior of bi-level Hyperparameter optimization solvers, and introduces the new algorithm called Moreau-Yosida regularized Hyperparameter Optimization (MY-HPO) algorithm. Theoretical analysis on the correctness of the MY-HPO solution and initial convergence analysis is also provided. Our empirical results show significant improvement in loss values for a fixed computation budget, compared to the state-of-art bi-level HPO solvers.

Full PDF

77th ICML Workshop on Automated Machine Learning (2020)

Stabilizing Bi-Level Hyperparameter Optimization usingMoreau-Yosida Regularization

Sauptik Dhar [email protected]

America Research Lab, LG Electronics

Unmesh Kurup [email protected]

America Research Lab, LG Electronics

Mohak Shah [email protected]

America Research Lab, LG Electronics

Abstract

This research proposes to use the Moreau-Yosida envelope to stabilize the convergencebehavior of bi-level Hyperparameter optimization solvers, and introduces the new algo-rithm called Moreau-Yosida regularized Hyperparameter Optimization (MY-HPO) algo-rithm. Theoretical analysis on the correctness of the MY-HPO solution and initial conver-gence analysis is also provided. Our empirical results show signiﬁcant improvement in lossvalues for a ﬁxed computation budget, compared to the state-of-art bi-level HPO solvers.

1. Introduction

Successful application of any machine learning (ML) algorithm heavily depends on thecareful tuning of its hyper-parameters. However, hyper-parameter optimization (HPO) isa non-trivial problem and has been a topic of research for several decades. Most existingworks broadly adopt one of the following two approaches, a) Black-Box optimization or, b)Direct optimization. The Black-Box optimization approach is agnostic of the underlyingML model and adopts an advanced search algorithm to select the hyper-parameters thatminimizes the validation error. Popular examples include, Random search, Grid search,Bayesian Optimization based approaches (Bergstra et al., 2013; Snoek et al., 2012) or otheradvanced exploration techniques like (Li et al., 2017) etc. The advantage is that it is ap-plicable to any machine learning problem. However, it does not utilize the structure of theunderlying ML model, which can be exploited for faster solutions. The Direct Optimizationutilizes the underlying ML algorithm structure by parameterizing the validation error asa function of the hyper-parameters, and directly optimizes it. Typical examples include,bi-level hyper-parameter optimization (Lorraine and Duvenaud, 2018; MacKay et al., 2019;Pedregosa, 2016; Franceschi et al., 2017; Maclaurin et al., 2015; Franceschi et al., 2018;Mehra and Hamm, 2019) and bound based analytic model selection (Chapelle et al., 2002;Dhar et al., 2019; Cortes et al., 2017) etc. Such approaches can provide signiﬁcant com-putation improvements for HPO. However the limitations of these approaches include, a)the gradients of the validation error w.r.t the hyperparameters must exist and b) instabilityof the optimization problem under ill-conditioned settings. In this paper we address thisunstable convergence behavior of the bi-level HPO approaches for ill-conditioned problems,and propose a new framework to improve it. The main contributions of this work are, c (cid:13) a r X i v : . [ c s . L G ] J u l har et. al

1. We propose a new algorithm called Moreau-Yosida regularized Hyperparameter Op-timization (MY-HPO) to stabilize bi-level HPO for ill-conditioned problems.2. We provide theoretical analysis on the correctness of the proposed MY-HPO algorithmand also provide some initial convergence analysis.3. Finally, we provide extensive empirical results in support of our proposed approach.The paper is structured as follows. Section 2 introduces the bi-level formulation forHPO and discusses the existing issues of the state-of-art bi-level HPO solvers (Lorraine andDuvenaud, 2018; MacKay et al., 2019) under ill-conditioned settings. Section 3 introducesthe Moreau-Yosida regularized Hyperparameter Optimization (MY-HPO) algorithm as asolution to stabilize bi-level HPO for such settings. Additional theoretical analysis of itssolution’s correctness and convergence behavior is also provided. Empirical results anddiscussions are provided in section 4.

2. Bi-Level Hyperparameter Optimization (HPO)

The standard bi-level formulation for hyperparameter optimization (HPO) is given as, λ ∗ ∈ argmin λ L V ( λ , argmin w L T ( w , λ )) (1)Here, L V = Validation loss, L T = Training loss, λ = Hyperparameters, w = Modelparameters. A popular approach involves introducing a best-response function (Lorraineand Duvenaud, 2018; MacKay et al., 2019) to rather solve, λ ∗ ∈ argmin λ L V ( λ , G φ ( λ )) s.t. G φ ( λ ) ∈ argmin w L T ( w , λ ) (2) G φ ( λ ) is the best-response function parameterized by the hyperparameters λ . For simplic-ity we limit our discussion to only scalar (single) λ . Of course, it can be easily extended tomultiple hyperparameters. The work on Stochastic Hyperparameter Optimization (SHO)(Lorraine and Duvenaud, 2018) use a hypernetwork to model the best-response function as G φ ( λ ) = λ φ + φ = Λ φ , where φ = (cid:20) φ φ (cid:21) and Λ = [ λ I | I ]; with φ ∈ argmin θ L T (Λ θ , λ );and adopts an alternate minimization of the training loss L T w.r.t φ (hypernetwork param-eters), and validation loss (cid:32)L V w.r.t λ (hyperparameters) to solve the HPO problem. Theirproposed algorithm is provided in Appendix B. MacKay et al. (2019) adopts a similar al-ternating gradient approach but modiﬁes the algorithm hypothesis class (by scaling andshifting the network hidden layers) and adds additional stability through adaptively tuningthe degree of stochasticity (i.e. perturbation scale) for the gradient updates. In this workwe adopt the SHO algorithm 2 as a representative for such alternating gradients approachesfor solving the bi-level HPO.One major limitation of these alternating gradient based bilevel HPO algorithms is that,the convergence behavior heavily depends on the conditioning of the problem. Basically, forill-conditioned settings, the step-size used for gradient updates need to be suﬃciently smallto ensure stability of such algorithms. This in turn leads to poor convergence rates (alsosee in our results section 4). To alleviate this instability and ensure improved convergencewe introduce our Moreau-Yosida regularized Hyperparameter Optimization (MY-HPO) al-gorithm. For simplicity we will assume unique solutions for (2) in the rest of the paper. tabilizing Bi-Level Hyperparameter Optimization using Moreau-Yosida Regularization

3. Moreau-Yosida regularized HPO

First we reformulate the problem in (2). Given, φ ∗ ∈ argmin θ L T (Λ ∗ θ , λ ∗ ) we solve,min λ, w L T ( w , λ ∗ ) + L V ( λ, G φ ∗ ( λ )) (3)s.t. w = G φ ∗ ( λ ) = λ φ ∗ + φ ∗ = Λ φ ∗ ; where φ ∗ = (cid:20) φ ∗ φ ∗ (cid:21) and Λ = [ λ I | I ]Here, we assume that λ ∗ and φ ∗ are provided to us by an oracle satisfying ∇ φ L T (Λ ∗ φ , λ ∗ ) =0 at solution Λ ∗ = [ λ ∗ I | I ]. Proposition 1 justiﬁes solving (3) in lieue of (2). Proposition 1

For a bijective mapping w = G φ ∗ ( λ ) , the stationary points ( w ∗ , λ ∗ ) of (3) are also stationary points of (2) with ∇ w L T ( w ∗ , λ ∗ ) = 0 and ∇ λ L V ( λ ∗ , G φ ∗ ( λ ∗ )) = 0 . The problem (3) is a sum of two functions parameterized by diﬀerent arguments con-nected through an equality constraint. Such formulations are frequently seen in machinelearning problems and popularly solved through variants of Alternating Direction Method ofMultipliers (Boyd et al., 2011; Goldstein et al., 2014), Alternating Minimization Algorithm(Tseng, 1991) or Douglas Rachford Splitting (Eckstein and Bertsekas, 1992) etc. However,a major diﬀerence in (3) is that the oracle solution λ ∗ and φ ∗ ∈ argmin θ L T (Λ ∗ θ , λ ∗ ) ⇒∇ θ L T (Λ ∗ φ ∗ , λ ∗ ) = 0 is not available a priori. Hence, we cannot directly apply theseexisting approaches to solve (3). This leads to our new algorithm called Moreau-Yosidaregularized Hyperparameter Optimization (MY-HPO) to solve (3). At iteration k + 1 wetake the steps, Step 1.

Update φ k +1 = [ φ k +11 | φ k +10 ] v k +1 = argmin v L T ( v , λ k ); φ k +10 = v k +1 = (cid:88) j v k +1 j ; φ k +11 = v k +1 − v k +1 λ k (4) Step 2. w k +1 = argmin w L T ( w , λ k ) + ( u k ) T ( w − Λ k φ k +1 ) + ρ || w − Λ k φ k +1 || (5) Step 3. λ k +1 = argmin λ L V ( λ, G φ k +1 ( λ )) + ( u k ) T ( w k +1 − Λ φ k +1 ) + ρ || w k +1 − Λ φ k +1 || (6) Step 4.

Update consensus u k +1 = u k + ρ ( w k +1 − Λ k +1 φ k +1 ) (7)The complete algorithm is provided in Appendix C (Algorithm 3). Step 1, ensures aunique solution for given λ k i.e. φ k +1 = argmin θ L T (Λ k θ , λ k ). In Steps 2 and 3 rather thantaking the gradient updates of L T , L V (as in SHO); we take the gradient of the Moreau-Yosida (MY) regularized functions. The Moreau-Yosida (MY) regularization of a functionis deﬁned as f /ρ ( · ) := min x f ( x ) + ρ ||· − x || and serves as a smooth approximate for f ( · ).Note that (5) and (6) transforms to gradient updates of the MY regularized L T and L V inits scaled form (Boyd et al., 2011). This lends to better stability of these updates in Steps2 −

3; and is highly desirable for ill-conditioned problems. Another aspect of this algorithmis that now w and λ updates are not agnostic of each other. The w − Λ φ terms constrainsagainst larger steps in a direction detrimental to either of the loss functions L T , L V . In har et. al Algorithm 1:

Simpliﬁed MY-HPO Algorithm

Input:

MaxIters = Total iterations, ε T OL = convergence error tolerance. α, β, δ = step size for gradient updates.

Output: λ, wInitialize : λ ← − v ← , w ← , u ← ( k + 1 ≤ MaxIters ) and ( convergence error ≥ ε T OL ) dov k +1 = v k − α ∇ v L T ; φ k +10 = v k +1 ; φ k +11 = v k +1 − v k +1 λ and φ k +1 = [ φ k +11 | φ k +10 ] w k +1 = w k − β [ ∇ v L T + u k + ρ ( w k − Λ k φ k +1 )] λ k +1 = λ k − δ ∇ λ [ L V ( λ, G φ ( λ )) + ( u k ) T ( w k +1 − Λ φ k +1 ) + ρ || w k +1 − Λ φ k +1 || ] u k +1 = u k + ρ ( w k +1 − Λ k +1 φ k +1 ) end essence these additional terms maintain consensus between the updates while minimizing L T w.r.t w , and L V w.r.t λ separately. Any deviation from this consensus is captured in u -update and fed back in the next iteration. The user-deﬁned, ρ ≥ Proposition 2 (Convergence Criteria) The necessary and suﬃcient conditions for Steps1 − r k +1 = w k +1 − Λ k +1 φ k +1 → s k +1 = ρ (Λ k +1 φ k +1 − Λ k φ k +1 ) → Claim 1 (Convergence Guarantee) Under the assumptions that L T , L V are proper, closedand convex; with L T strongly convex and { φ k } is bounded ∀ k . The steps 1 − r k +1 = 0 and s k +1 = 0.All proofs are provided in Appendix. To maintain same per-step iteration cost as theSHO algorithm we simplify the steps 1-4 further and rather take single gradient updates inSteps 1-3. This results to the simpliﬁed MY-HPO Algorithm 1 which is used throughoutour experiments. For simplicity we call this simpliﬁed Algorithm 1 as MY-HPO throughout the paper. Additionally, to ensure descent direction we also add a backtracking (BT) ofstep-size scaled by a factor of 0 .

4. Results and Summary

We provide analysis for both regression and classiﬁcation problems using the loss functions,– Least Square: L V = N V (cid:80) x i ∈V ( y i − w T x i ) , L T = N T (cid:80) x i ∈T ( y i − w T x i ) + e λ || w || – Logistic: L V = N V (cid:80) x i ∈V log(1 + e − y i w T x i ), L T = N T (cid:80) x i ∈T log(1 + e − y i w T x i ) + e λ || w || tabilizing Bi-Level Hyperparameter Optimization using Moreau-Yosida Regularization Table 1: Datasets and Experimental Settings.

Data Sets Training/Validation size Test size Dimension ( d ) COOKIE (Osborne et al., 1984) /

17 (50% /

699 (NIR)

MNIST

Regression (LeCun, 1998) ( per digit) ( per digit) (Pixel) MNIST

Classification (LeCun, 1998) ( per class) ( per class) (Pixel) GTSRB (Stallkamp et al., 2012) ( per class) ( per class) (HOG) Here, x i ∈ (cid:60) d , and y ∈ (cid:60) (regression) or {− , } (classiﬁcation), d = dimension of theproblem, T = Training data, N T = number of training samples and V = Validation Data, N V = number of validation samples. The experimental details are provided in Table 1. Here,for regression, we use the MNIST data to regress on the digit labels ‘0’ - ‘9’ (Park et al.,2020) and the Cookie to predict the percentage of ‘fat’ w.r.t the ‘ﬂour’ content using nearinfrared region (NIR) values (Osborne et al., 1984). For classiﬁcation, we classify betweendigits ‘0’ vs. ‘1’ for MNIST, and the traﬃc signs ‘30’ vs. ‘80’ for GTSRB.

Table 2 provides the average ( ± standard deviation) of the loss values on training,validationand a separate test data, over 10 runs of the experiments. Here, for each experiment wepartition the data sets in the same proportion as shown in Table 1. For regression we scalethe output’s loss with the variance of y − values. This is a standard normalization techniquewhich illustrates the proportion of un-explained variance by the estimated model. For theSHO and MY-HPO algorithms we only report the performance of the models using the bestperforming step-sizes. A detailed ablation study using diﬀerent step-sizes for the algorithmsis provided in Appendix D. In addition we also provide the results for popular black-boxalgorithms publicly available through the Auptimizer toolbox (Liu et al., 2019). Here forall the algorithms we maintain a ﬁxed budget of gradient computations. Additional detailson the various algorithm’s parameter settings are provided in Appendix D.Table 2 shows that for both the classiﬁcation and regression problems MY-HPO al-gorithm signiﬁcantly outperforms all the baseline algorithms. The dominant convergencebehavior of the bi-level formulations (i.e. MY-HPO and SHO) compared to black-box ap-proaches is well known from previous studies (Lorraine and Duvenaud, 2018; MacKay et al.,2019). This is also seen in our results. Of course, increasing the number of gradient com-putations allows the black-box approaches achieve similar loss values (see Appendix D).However, a non-trivial observation is that MY-HPO signiﬁcantly outperforms the SHO al-gorithm. To further analyze this improved convergence behavior of the MY-HPO algorithm;we provide the convergence curves of these algorithms for Logistic loss using GTSRB datain Fig 1. Here, MY-HPO (BT) involves back-tracking to ensure descent direction; whereasMY-HPO (C) uses constant step updates. Fig 1 shows that the SHO algorithm obtains bestresults with the step-size ﬁxed to α = 0 .

01. Increasing it further to α = 0 .

05 destabilizesthe updates and results to sub-optimal SHO solution. On the other hand, using the MoreauYosida regularized updates we can accomodate for larger step-sizes and hence achieve betterconvergence rates. Additional improvement can be expected by using back-tracking as itensures descent directions which adds to the stability of the algorithm. This behavior ispersistently seen for both the classiﬁcation and regression problems for all the data sets (seeAppendix D). har et. al Table 2: Loss values of diﬀerent HPO algorithms using ﬁxed computation budget.

Data SHO MyHPO (C) MyHPO (BT) Random Grid HyperOpt Spearmint

Regression ProblemsCookie

No. of gradient computations = 5000Train ( × − ) . ± . . ± . . ± . . ± . . ± . . ± . . ± . Val. ( × − ) . ± . . ± . . ± . . ± . . ± . . ± . . ± . Test ( × − ) . ± . . ± . . ± . . ± . . ± . . ± . . ± . MNIST

No. of gradient computations = 6000Train( × − ) . ± . . ± . . ± . . ± . . ± . . ± . . ± . Val. ( × − ) . ± . . ± . . ± . . ± . . ± . . ± . . ± . Test ( × − ) . ± . . ± .

55 22 . ± . . ± . . ± . . ± . . ± . Classiﬁcation ProblemsMNIST

No. of gradient computations = 1000Train ( × − ) . ± . . ± . . ± . . ± . . ± . . ± .

35 0 . ± . Val. ( × − ) . ± .

96 4 . ± . . ± . . ± . . ± . . ± .

46 5 . ± . Test ( × − ) . ± . . ± . . ± . . ± . . ± . . ± .

44 6 . ± . GTSRB

No. of gradient computations = 1000Train ( × − ) . ± .

04 3 . ± . . ± . . ± . . ± .

01 11 . ± . . ± . Val. ( × − ) . ± .

34 14 . ± . . ± . . ± . . ± . . ± . . ± . Test ( × − ) . ± .

82 13 . ± . . ± . . ± . . ± . . ± . . ± . Figure 1: Convergence behavior of SHO vs. MY-HPO for diﬀerent step-sizes for GTSRBdata using Logistic loss. For MY-HPO we show the ρ ( α, β, δ ) values. In summary our results conﬁrm the eﬀectiveness of Moreau-Yosida (MY) regularized up-dates for bi-level HPO under (ill-conditioned) limited data settings. Here rather than takingalternating gradient updates as in SHO (Lorraine and Duvenaud, 2018) or STN (MacKayet al., 2019); we propose a modiﬁed algorithm MY-HPO by taking gradient updates on theMoreau-Yosida envelope and maintaining a consensus variable. The proposed MY-HPO al-gorithm provides added stability and enables us to use larger step sizes which inturn leads tobetter convergence rates. Under a ﬁxed computation budget MY-HPO signiﬁcantly outper-forms the SHO algorithm (a popular representative for alternating gradient based bi-levelHPO solvers); and the widely used black-box HPO routines. Owing to space constraintsdetails regarding the algorithm parameters for reproducing the experimental results areprovided in Appendix D. tabilizing Bi-Level Hyperparameter Optimization using Moreau-Yosida Regularization References

Amir Beck.

Introduction to nonlinear optimization: Theory, algorithms, and applicationswith MATLAB , volume 19. Siam, 2014.James Bergstra, Daniel Yamins, and David Daniel Cox. Making a science of model search:Hyperparameter optimization in hundreds of dimensions for vision architectures. 2013.Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributedoptimization and statistical learning via the alternating direction method of multipliers.

Foundations and Trends R (cid:13) in Machine learning , 3(1):1–122, 2011.Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, and Sayan Mukherjee. Choosingmultiple parameters for support vector machines. Machine learning , 46(1-3):131–159,2002.Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang.Adanet: Adaptive structural learning of artiﬁcial neural networks. In

Proceedings of the34th International Conference on Machine Learning-Volume 70 , pages 874–883. JMLR.org, 2017.Sauptik Dhar, Vladimir Cherkassky, and Mohak Shah. Multiclass learning from contradic-tions. In

Advances in Neural Information Processing Systems , pages 8400–8410, 2019.Jonathan Eckstein and Dimitri P Bertsekas. On the douglasrachford splitting method andthe proximal point algorithm for maximal monotone operators.

Mathematical Program-ming , 55(1-3):293–318, 1992.Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward andreverse gradient-based hyperparameter optimization. In

Proceedings of the 34th Interna-tional Conference on Machine Learning-Volume 70 , pages 1165–1173. JMLR. org, 2017.Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimilano Pontil.Bilevel programming for hyperparameter optimization and meta-learning. arXiv preprintarXiv:1806.04910 , 2018.Pontus Giselsson and Stephen Boyd. Linear convergence and metric selection for douglas-rachford splitting and admm.

IEEE Transactions on Automatic Control , 62(2):532–544,2016.Tom Goldstein, Brendan O’Donoghue, Simon Setzer, and Richard Baraniuk. Fast alternat-ing direction optimization methods.

SIAM Journal on Imaging Sciences , 7(3):1588–1623,2014.Yann LeCun. The mnist database of handwritten digits. http://yann. lecun.com/exdb/mnist/ , 1998.Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar.Hyperband: A novel bandit-based approach to hyperparameter optimization.

The Journalof Machine Learning Research , 18(1):6765–6816, 2017. har et. al Jiayi Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah. Auptimizer–an extensible,open-source framework for hyperparameter tuning. arXiv preprint arXiv:1911.02522 ,2019.J. Lorraine and D. Duvenaud. Stochastic hyperparameter optimization through hypernet-works. arXiv:1802.09419, 2018.M. MacKay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse. Self-tuning networks:Bilevel optimization of hyperparameters using structured best-response functions. In

International Conference on Learning Representations , 2019.Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter op-timization through reversible learning. In

International Conference on Machine Learning ,pages 2113–2122, 2015.Akshay Mehra and Jihun Hamm. Penalty method for inversion-free deep bilevel optimiza-tion. arXiv preprint arXiv:1911.03432 , 2019.Brian G Osborne, Thomas Fearn, Andrew R Miller, and Stuart Douglas. Application ofnear infrared reﬂectance spectroscopy to the compositional analysis of biscuits and biscuitdoughs.

Journal of the Science of Food and Agriculture , 35(1):99–105, 1984.Y. Park, S. Dhar, S. Boyd, and M. Shah. Variable metric proximal gradient method with di-agonal barzilai-borwein stepsize. In

ICASSP 2020 - 2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , pages 3597–3601, 2020.Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In

Proceedingsof the 33nd International Conference on Machine Learning, ICML , 2016. URL http://jmlr.org/proceedings/papers/v48/pedregosa16.html .Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization ofmachine learning algorithms. In

Advances in neural information processing systems , pages2951–2959, 2012.J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarkingmachine learning algorithms for traﬃc sign recognition.

Neural Networks , pages –, 2012.ISSN 0893-6080. doi: 10.1016/j.neunet.2012.02.016.Paul Tseng. Applications of a splitting algorithm to decomposition in convex programmingand variational inequalities.

SIAM Journal on Control and Optimization , 29(1):119–138,1991. tabilizing Bi-Level Hyperparameter Optimization using Moreau-Yosida Regularization Appendix A. Proofs

A.1 Proof of Proposition 1

The proof follows from analyzing the KKT equations. Note that the lagrangian of theproblem 2 is, L ( w , λ, u ) = L T ( w , λ ∗ ) + L V ( λ, G φ ∗ ( λ )) + u T ( w − G φ ∗ ( λ )) (8)KKT system, ∇ w L T ( w , λ ∗ ) + u = 0 (9) ∇ λ L V ( λ, G φ ∗ ( λ )) − u T ∇ λ G φ ∗ ( λ ) = 0 (10) w − G φ ∗ ( λ ) = 0 (11) ∇ φ L T ( G φ ∗ ( λ ) , λ ) = 0 (12)At the solution of the KKT System ( w ∗ , λ ∗ , u ∗ ) for a bijective mapping w ∗ = G φ ∗ ( λ ∗ ) =Λ ∗ φ ∗ we have eq. (12) ⇒ ∇ w L T ( w ∗ , λ ∗ ) = 0. This in turn indicates, u ∗ = 0 ⇒∇ λ L V ( λ ∗ , G φ ∗ ( λ ∗ )) = 0. Hence at solution, ∇ w L T ( w ∗ , λ ∗ ) = 0 and ∇ λ L V ( λ ∗ , G φ ∗ ( λ ∗ )) = 0 A.2 Proof of Proposition 2

Assume the conditions r k +1 = 0 and s k +1 = 0, are satisﬁed at iteration k + 1. Then thesteps 1 − ∇ w L T ( w k +1 , λ k ) + u k + ρ ( w k +1 − Λ k φ k +1 ) = 0 (13) ⇒∇ w L T ( w k +1 , λ k ) + u k +1 + ρ (Λ k +1 φ k +1 − Λ k φ k +1 ) = 0 (14) ∇ λ L V ( λ k +1 , Λ k +1 φ k +1 ) − ( u k ) T φ k +11 − ρ ( φ k +11 ) T ( w k +1 − Λ k +1 φ k +1 ) = 0 (15) u k +1 = u k + ρ ( w k +1 − Λ k +1 φ k +1 ) (16)Under the conditions r k +1 = 0 , s k +1 = 0 the updates in eq. (13) - (16) becomes, ∇ w L T ( w k +1 , λ k ) + u k +1 = 0 ∇ λ L V ( λ k +1 , Λ k +1 φ k +1 ) − ( u k +1 ) T φ k +11 = 0 (17) u k +1 = u k In addition at iteration k + 1 we also have from eq. (4) a unique solution for( φ k +11 , φ k +10 ) = argmin θ , θ L T ( λ k θ + θ , λ k )Also we have from our assumptions, r k +1 = 0 ⇒ w k +1 = λ k φ k +11 + φ k +10 ⇒ ∇ w L T ( w k +1 , λ k ) = 0 (18) s k +1 = 0 ⇒ λ k φ k +11 = λ k +1 φ k +11 ⇒ ∇ w L T ( w k +1 , λ k +1 ) = 0 (19) har et. al Using e.q. (18) and (19) into (17) gives, ∇ w L T ( w k +1 , λ k +1 ) = 0 ( ⇒ u k +1 = 0) ∇ λ L V ( λ k +1 , Λ k +1 φ k +1 ) = 0 ( ∵ u k +1 = 0) ⇒ ∇ λ L V ( λ k +1 , w k +1 ) = 0 (for bijective Λ φ → w )Next for the necessary part if r k +1 (cid:54) = 0, the primary constraint in (3) is not satisﬁed. Hence r k +1 = 0 is a necessary condition. To establish the necessary condition s k +1 = 0, consider ∇ λ L V ( λ k +1 , w k +1 ) = 0 and ∇ w L T ( w k +1 , λ k +1 ) = 0.Now, r k +1 = 0 ⇒ w k +1 = φ k +1 Λ k +1 ⇒ ∇ φ L T ( φ k +1 Λ k +1 , λ k +1 ) = 0 (from above as-sumption). Finally, Step 1 of Algorithm 3 ensures, ∇ φ L T ( φ k +1 Λ k , λ k ) = 0. Hence φ k +1 Λ k = φ k +1 Λ k +1 ⇒ s k +1 = 0. A.3 Proof for Claim 1

For the ﬁrst part of the proof observe that the steps 2 − φ k +1 . This allows us to re-utilze the Proposition 4 in (Giselsson and Boyd, 2016)and claim that the operator equivalent to the steps 2 − λ k , u k ) → ( λ k +1 , u k +1 ). At stationary points the states remain the same. This gives,From (16), w k +1 − Λ k +1 φ k +1 = 0 ⇒ r k +1 = 0Further, ρ (Λ k +1 φ k +1 − Λ k φ k +1 ) = 0 ( ∵ Λ k = Λ k +1 and φ is bounded ) Appendix B. Stochastic Hyperparameter Optimization usingHypernetworks (SHO)

There are several versions (global vs. local) of the SHO algorithm introduced in (Lorraineand Duvenaud, 2018). The representative local version of the SHO algorithm is providedbelow in Algorithm 2. tabilizing Bi-Level Hyperparameter Optimization using Moreau-Yosida Regularization Algorithm 2:

SHO Algorithm (Local)

Input: α =learning rate of training loss gradient update w.r.t model parameters. β =learning rate of validation loss gradient update w.r.t hyperparameters. Output: λ, G φ ( λ ) Data: T = Training Data , V = Validation Data Initialize : φ , λ and, deﬁne : w = G φ ( λ ) = Λ φ ; while not converged do ˆ λ ∼ P (ˆ λ | λ t − ) // typically modeled as Normal Distribution φ t ← φ t − − α ∇ G φ L T ( G φ (ˆ λ ) , ˆ λ, x ∈ T ) · ∇ φ G φ ( ˆ λ ) = ˆ Λ ; λ t ← λ t − − β ∇ λ G φ ( λ ) = φ T · ∇ G φ L V ( G φ ( λ ) , x ∈ V ) end Appendix C. Moreau Yosida regularized (MY)-HPO Algorithm

Algorithm 3:

MY-HPO algorithm

Input:

MaxIters = Total iterations, ε T OL = convergence error tolerance.

Output: λ, wData: T = Training Data , V = Validation Data Initialize : λ ← − u ← ( k + 1 ≤ MaxIters ) and ( convergence error ≥ ε T OL ) doStep 1. Update φ v k +1 = argmin v L T ( v , λ k ) φ k +10 = v k +1 ; φ k +11 = v k +1 − v k +1 λ k and φ k +1 = [ φ k +11 | φ k +10 ] Step 2.

Update ww k +1 = argmin w L T ( w , λ k ) + ( u k ) T ( w − Λ k φ k +1 ) + ρ || w − Λ k φ k +1 || Step 3.

Update λλ k +1 = argmin λ L V ( λ, G φ k +1 ( λ )) + ( u k ) T ( w k +1 − Λ φ k +1 ) + ρ || w k +1 − Λ φ k +1 || Step 4.

Update consensus u k +1 = u k + ρ ( w k +1 − Λ k +1 φ k +1 ) endConnection with ADMM : The MY-HPO updates are closely connected to the Alter-nating Direction Method of Multipliers (ADMM) algorithm (Boyd et al., 2011). In fact,for a given φ k the steps 2 − har et. al Table 3: Performance of MY-HPO Algorithm 3 for the regression problems in Table 2

Data Train Loss Val. Loss Test Loss No. of Iterations ( k ) MaxIters Cookie . ± .

20 0 . ± .

21 0 . ± .

25 80 ± .

68 100

MNIST . ± .

35 9 . ± .

47 9 . ± .

35 364 ± . Relaxation to simpliﬁed MY-HPO algorithm 1 : Algorithm 1 simpliﬁes the abovealgorithm 3 by taking one gradient update rather than solving the minimization problemsabove. This reduces the per-step iteration cost of the algorithm 1 to be the same as theSHO algorithm in 2. Now, both the Algorithms 1 and 2 incurs 2-gradient steps per outeriteration. This simpliﬁcation however deteriorates the convergence rate of the algorithm.We provide the convergence comparison between the Algorithms 1 vs. 3 in Table 3.

Appendix D. Algorithm Parameters and Additional Results

For all the black-box approaches we match the number of outer-iterations of the bi-levelformulations i.e. n T to train the model for any given parameter. Further, we train themodels using gradient updates with step size ( α train ). This parameter is set speciﬁc toeach problems reported below, and helps us achieve similar training gradient as the bi-levelformulations. In addition, we set the following parameters using the Auptimizer tool-boxfor each of the following algorithms.Random Search: We select the random seed same as that used to generate the data. Ad-ditionally, we search in the range [ − , · · · , n S .Grid Search: We select the range of search as, [ − , · · · , n S .HyperOpt: We select the random seed same as that used to generate the data. We selectthe range of search as, [ − , · · · ,

5] and the engine = ‘tpe’. We report the algorithm’sperformance for varying values of n S .Spearmint: We select the range of search as, [ − , · · · ,

5] , engine = ‘GPEIOptChooser’and grid size = 20000. We report the algorithm’s performance for varying values of n S .The parameters speciﬁc to the problems and the respective data-sets are provided below. D.1 Regression using Cookie Data

D.1.1 Experiment Parameters

Stochastic Hypernetwork Optimization (SHO): We keep the following parameter ﬁxed (todefault values), β = 0 .

01, noise variance ˆ λ ∼ N ( λ, σ ) , σ = 10 − . Changing these valuesfor our experiments did not result in signiﬁcant improvements. We report the performance tabilizing Bi-Level Hyperparameter Optimization using Moreau-Yosida Regularization of SHO for varying values of α . The code is publicly available at: https://github.com/lorraine2/hypernet-hypertraining Moreau Yosida Regularized HPO (MY-HPO): We keep the following parameter ﬁxed, ρ =1 . α, β, δ . It is well-known in ADMM literature that theselection of ρ − parameter greatly impacts the convergence behavior. Such analyses will beexplored in a longer version of the paper.Further we ﬁx n T (i.e. MaxIters in Algorithm 3) = 2500, and step size for training themodel used for black-box approaches as α train = 0 . D.1.2 Additional Results

The complete set of results with all the parameter settings are provided in Table 4. Asseen from the results increasing the budget (number of gradient computations n g ) to al-most 5 × that of the bi-level counterparts allows the black-box approaches achieve similarperformance.Figure 2: Convergence behavior of SHO vs. MY-HPO for diﬀerent step-sizes for Cookiedata using Least Square loss. For MY-HPO we show the ρ ( α, β, δ ) values.The ﬁgure 2 illustrate the convergence curve of the algorithms. Fig. 2 illustrates theunstable behavior of SHO as the step size increases. Note that, for larger step sizes theSHO algorithm fails to converge. As the step size gets smaller the algorithm converges;but demonstrate a very slow convergence rate. This is also seen for the other experimentsused in this paper. On the contrary MY-HPO can accomodate higher step size and resultsto better convergence. Additionally, 2 also provides a comparison between the MY-HPOalgorithm in 3 vs. the simpliﬁed version of algorithm 1. As seen in the ﬁgure (also conﬁrmedin Table 3), the simpliﬁed algorithm provides better per-step iteration cost but incurs muchhigher overall iterations to convergence. har et. al Table 4: Comparison between diﬀerent HPO algorithms for regression problem using Cookiedata.

Method Train Loss( × − ) Validation Loss( × − ) Test Loss( × − ) n g ( n T , n S )SHO ( α = 5 × − ) . ± .

08 157 . ± .

43 143 . ± .

32 5000 ( , )SHO ( α = 1 × − ) . ± .

08 80 . ± .

36 85 . ± .

78 5000 ( , )SHO ( α = 5 × − ) . ± .

96 63 . ± .

85 75 . ± .

78 5000 ( , )MyHPO (C) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

51 28 . ± .

94 30 . ± .

05 5000 ( , )MyHPO (C) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

68 17 . ± .

41 17 . ± .

82 5000 ( , )MyHPO (BT) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

68 17 . ± .

41 17 . ± .

82 5000 ( , )MyHPO (BT) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

68 9 . ± . . ± .

54 5000 ( , )MyHPO (BT) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

53 6 . ± .

82 6 . ± .

25 5000 ( , )Random ( n S = 2 ) . ± .

36 93 . ± . . ± .

97 5000 ( , )( n S = 10 ) . ± .

18 22 . ± .

03 21 . ± .

05 25000 ( , )Grid ( n S = 2 ) . ± .

21 16 . ± .

91 16 . ± .

76 5000 ( , )( n S = 10 ) . ± .

21 16 . ± .

91 16 . ± .

76 25000 ( , )HyperOpt ( n S = 2 ) . ± .

97 49 . ± .

57 47 . ± .

54 5000 ( , )( n S = 10 ) . ± .

19 16 . ± .

94 16 . ± .

79 25000 ( , )Spearmint ( n S = 2 ) . ± .

21 16 . ± .

91 16 . ± .

76 5000 ( , )( n S = 10 ) . ± .

21 16 . ± .

91 16 . ± .

76 25000 ( , ) D.2 Regression using MNIST Data

D.2.1 Experiment Parameters

Stochastic Hypernetwork Optimization (SHO): We ﬁx the parameters same as in D.1.1.We vary the α parameter as shown in Table 5.Moreau Yosida Regularized HPO (MY-HPO): We ﬁx the parameters same as in D.1.1. Wevary the α, β, δ parameters as shown in Table 5.Further for this data we ﬁx n T (i.e. MaxIters in Algorithm 3) = 3000, and step size fortraining the model used for black-box approaches as α train = 0 . tabilizing Bi-Level Hyperparameter Optimization using Moreau-Yosida Regularization Table 5: Comparison between diﬀerent HPO algorithms for regression problem usingMNIST data.

Method Train Loss( × − ) Validation Loss( × − ) Test Loss( × − ) n g ( n T , n S )SHO ( α = 1 × − ) . ± .

83 32 . ± .

13 32 . ± .

46 6000 ( , )SHO ( α = 5 × − ) . ± .

69 25 . ± .

22 24 . ± .

98 6000 ( , )SHO ( α = 1 × − ) . ± .

79 24 . ± .

37 23 . ± .

11 6000 ( , )MyHPO (C) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

66 23 . ± .

97 22 . ± .

48 6000 ( , )MyHPO (C) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

63 23 . ± .

02 22 . ± .

55 6000 ( , )MyHPO (BT) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

63 23 . ± .

02 22 . ± .

55 6000 ( , )MyHPO (BT) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

61 23 . ± .

11 22 . ± .

72 6000 ( , )MyHPO (BT) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

77 24 . ± .

18 22 . ± .

89 6000 ( , )Random ( n S = 2 ) . ± .

04 25 . ± .

86 24 . ± .

85 6000 ( , )Grid ( n S = 2 ) . ± .

52 26 . ± .

48 24 . ± .

17 6000 ( , )HyperOpt ( n S = 2 ) . ± .

75 25 . ± .

48 23 . ± .

15 6000 ( , )Spearmint ( n S = 2 ) . ± .

52 26 . ± .

48 24 . ± .

17 6000 ( , ) Figure 3: Convergence behavior of SHO vs. MY-HPO for diﬀerent step-sizes for MNISTdata using Least Square loss. For MY-HPO we show the ρ ( α, β, δ ) values. har et. al D.2.2 Additional Results

Note that, here further increasing the computation budget (i.e. n S = 10) does not pro-vide additional improvement for the black-box approaches. The results provide similarconclusions,1. MY-HPO accomodates for higher step sizes and results to better convergence.2. Back-tracking further ensures descent direction (in each iteration) and leads to betterconvergence.3. Increasing the budget (number of gradient computations n g ) to almost 5 × that of thebi-level counterparts allows the black-box approaches achieve similar performance. D.3 Classiﬁcation using MNIST Data

D.3.1 Experiment Parameters

Stochastic Hypernetwork Optimization (SHO): We ﬁx the parameters same as in D.1.1.We vary the α parameter as shown in Table 6.Moreau Yosida Regularized HPO (MY-HPO): We ﬁx the parameters same as in D.1.1. Wevary the α, β, δ parameters as shown in Table 6.Further for this data we ﬁx n T (i.e. MaxIters in Algorithm 3) = 500, and step size fortraining the model used for black-box approaches as α train = 0 . D.3.2 Additional Results

Figure 4: Convergence behavior of SHO vs. MY-HPO for diﬀerent step-sizes for MNISTdata using Logistic loss. For MY-HPO we show the ρ ( α, β, δ ) values. tabilizing Bi-Level Hyperparameter Optimization using Moreau-Yosida Regularization Table 6: Comparison between diﬀerent HPO algorithms for classiﬁcation problem usingMNIST data.

Method Train Loss( × − ) Validation Loss( × − ) Test Loss( × − ) n g ( n T , n S )SHO ( α = 5 × − ) . ± .

13 31 . ± .

85 31 . ± .

45 1000 ( , )SHO ( α = 1 × − ) . ± .

14 4 . ± .

96 5 . ± . ( , )SHO ( α = 1 × − ) . ± .

93 5 . ± .

75 5 . ± .

80 1000 ( , )MyHPO (C) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

97 4 . ± .

65 5 . ± .

49 1000 ( , )MyHPO (C) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

57 4 . ± .

89 5 . ± .

83 1000 ( , )MyHPO (BT) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

60 4 . ± .

89 5 . ± .

84 1000 ( , )MyHPO (BT) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

47 4 . ± .

88 5 . ± .

72 1000 ( , )MyHPO (BT) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

68 4 . ± .

94 5 . ± .

85 1000 ( , )Random ( n S = 2 ) . ± .

443 19 . ± .

81 19 . ± .

62 1000 ( , )( n S = 25 ) . ± .

93 4 . ± .

03 5 . ± .

89 12500 ( , )Grid ( n S = 2 ) . ± . . ± .

71 6 . ± .

82 1000 ( , )( n S = 25 ) . ± .

23 4 . ± .

88 5 . ± .

84 12500 ( , )HyperOpt ( n S = 2 ) . ± .

35 4 . ± .

46 5 . ± .

44 1000 ( , )( n S = 25 ) . ± .

43 4 . ± .

88 4 . ± .

97 12500 ( , )Spearmint ( n S = 2 ) . ± .

10 5 . ± .

71 6 . ± .

82 1000 ( , )( n S = 25 ) . ± .

55 4 . ± .

19 5 . ± .

27 12500 ( , ) D.4 Classiﬁcation using GTSRB Data

D.4.1 Experiment Parameters

Method Train Loss( × − ) Validation Loss( × − ) Test Loss( × − ) n g ( n T , n S )SHO ( α = 5 × − ) . ± .

54 25 . ± .

09 25 . ± .

27 1000 ( , )SHO ( α = 1 × − ) . ± .

04 14 . ± .

34 14 . ± .

82 1000 ( , )SHO ( α = 5 × − ) . ± .

13 14 . ± .

41 14 . ± .

85 1000 ( , )MyHPO (C) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

74 14 . ± . . ± .

61 1000 ( , )MyHPO (C) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

50 14 . ± .

38 13 . ± .

25 1000 ( , )MyHPO (BT) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

50 14 . ± .

38 13 . ± .

25 1000 ( , )MyHPO (BT) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

11 14 . ± .

36 13 . ± .

11 1000 ( , )MyHPO (BT) ( α = 0 . , β = 0 . , δ = 0 . ) . ± .

22 13 . ± .

33 13 . ± .

98 1000 ( , )Random ( n S = 2 ) . ± .

38 39 . ± .

38 38 . ± .

63 1000 ( , )( n S = 10 ) . ± .

07 14 . ± .

24 13 . ± .

91 5000 ( , )Grid ( n S = 2 ) . ± .

01 23 . ± .

60 21 . ± .

23 1000 ( , )( n S = 10 ) . ± .

49 13 . ± .

34 13 . ± .

03 5000 ( , )HyperOpt ( n S = 2 ) . ± .

50 18 . ± . . ± .

52 1000 ( , )( n S = 10 ) . ± .

46 13 . ± .

40 13 . ± .

17 5000 ( , )Spearmint ( n S = 2 ) . ± .

01 23 . ± .

60 21 . ± .

23 1000 ( , )( n S = 10 ) . ± .

63 19 . ± .

28 19 . ± .

59 5000 ( , ))