Raiders of the Lost Architecture: Kernels for Bayesian Optimization in Conditional Parameter Spaces
Kevin Swersky, David Duvenaud, Jasper Snoek, Frank Hutter, Michael A. Osborne
aa r X i v : . [ s t a t . M L ] S e p Raiders of the Lost Architecture:Kernels for Bayesian Optimization in ConditionalParameter Spaces
Kevin Swersky
University of Toronto [email protected]
David Duvenaud
University of Cambridge [email protected]
Jasper Snoek
Harvard University [email protected]
Frank Hutter
Freiburg University [email protected]
Michael A. Osborne
University of Oxford [email protected]
Abstract
In practical Bayesian optimization, we must often search over structures with dif-fering numbers of parameters. For instance, we may wish to search over neuralnetwork architectures with an unknown number of layers. To relate performancedata gathered for different architectures, we define a new kernel for conditionalparameter spaces that explicitly includes information about which parameters arerelevant in a given structure. We show that this kernel improves model quality andBayesian optimization results over several simpler baseline kernels.
Bayesian optimization ( BO ) is an efficient approach for solving blackbox optimization problemsof the form arg min x ∈ X f ( x ) (see [1] for a detailed overview), where f is expensive to evaluate.It employs a prior distribution p ( f ) over functions that is updated as new information on f be-comes available. The most common choice of prior distribution are Gaussian processes ( GP s [2]),as they are powerful and flexible models for which the marginal and conditional distributions canbe computed efficiently. However, some problem domains remain challenging to model well with GP s, and the efficiency and effectiveness of Bayesian optimization suffers as a result. In this pa-per, we tackle the common problem of input dimensions that are only relevant if other inputs takecertain values [6, 5]. This is a general problem in algorithm configuration [6] that occurs in manymachine learning contexts, such as, e.g., in deep neural networks [7]; flexible computer vision ar-chitectures [8]; and the combined selection and hyperparameter optimization of machine learningalgorithms [9]. We detail the case of deep neural networks below.Bayesian optimization has recently been applied successfully to deep neural networks [10, 5] tooptimize high level model parameters and optimization parameters, which we will refer to collec-tively as hyperparameters . Deep neural networks represent the state-of-the-art on multiple machinelearning benchmarks such as object recognition [11], speech recognition [12], natural language pro-cessing [13] and more. They are multi-layered models by definition, and each layer is typicallyparameterized by a unique set of hyperparameters, such as regularization parameters and the layercapacity or number of hidden units. Thus adding additional layers introduces additional hyperpa-rameters to be optimized. The result is a complex hierarchical conditional parameter space, which There are prominent exceptions to this rule, though. In particular, tree-based models, such as randomforests, can be a better choice if there are many data points (and GP s thus become computationally inefficient), ifthe input dimensionality is high, if the noise is not normally distributed, or if there are non-stationarities [3, 4, 5].
1s difficult to search over. Historically, practitioners have simply built a separate model for eachtype of architecture or used non- GP models [5], or assumed a fixed architecture [10]. If there is anyrelation between networks with different architectures, separately modelling each is wasteful. GP s with standard kernels fail to model the performance of architectures with such conditional hy-perparameters. To remedy this, the contribution of this paper is the introduction of a kernel thatallows observed information to be shared across architectures when this is appropriate. We demon-strate the effectiveness of this kernel on a GP regression task and a Bayesian optimization task usinga feed-forward classification neural network. GP s employ a positive-definite kernel function k : X × X → R to model the covariance betweenfunction values. Typical GP models cannot, however, model the covariance between function valueswhose inputs have different (possibly overlapping) sets of relevant variables.In this section, we construct a kernel between points in a space that may have dimensions whichare irrelevant under known conditions (further details are available in [14]). As an explicit example,we consider a deep neural network: if we set the network depth to 2 we know that the 3rd layer’shyperparameters do not have any effect (as there is no 3rd layer).Formally, we aim to do inference about some function f with domain X . X = Q Di =1 X i is a D -dimensional input space, where each individual dimension is bounded real, that is, X i = [ l i , u i ] ⊂ R (with lower and upper bounds l i and u i , respectively). We define functions δ i : X → { true , false } ,for i ∈ { , . . . , D } . δ i ( x ) stipulates the relevance of the i th feature x i to f ( x ) . As an example, imagine trying to model the performance of a neural network having either oneor two hidden layers, with respect to the regularization parameters for each layer, x and x . If y represents the performance of a one layer-net with regularization parameters x and x , then thevalue x doesn’t matter, since there is no second layer to the network. Below, we’ll write an inputtriple as ( x , δ ( x ) , x ) and assume that δ ( x ) = true; that is, the regularization parameter for thefirst layer is always relevant.In this setting, we want a kernel k to be dependent on which parameters are relevant, and the valuesof relevant parameters for both points. For example, consider first-layer parameters x and x ′ : • If we are comparing two points for which the same parameters are relevant, the value of anyunused parameters shouldn’t matter, k (cid:0) ( x , false , x ) , ( x ′ , false , x ′ ) (cid:1) = k (cid:0) ( x , false , x ′′ ) , ( x ′ , false , x ′′′ ) (cid:1) , ∀ x , x ′ , x ′′ , x ′′′ ; (1) • The covariance between a point using both parameters and a point using only one should againonly depend on their shared parameters, k (cid:0) ( x , false , x ) , ( x ′ , true , x ′ ) (cid:1) = k (cid:0) ( x , false , x ′′ ) , ( x ′ , true , x ′′′ ) (cid:1) , ∀ x , x ′ , x ′′ , x ′′′ . (2)Put another way, in the absence of any other information, this specification encodes our prior igno-rance about the irrelevant (missing) parameters while still allowing us to model correlations betweenrelevant parameters. We can build a kernel with these properties for each possibly irrelevant input dimension i by em-bedding our points into a Euclidean space. Specifically, the embedding we use is g i ( x ) = (cid:26) [0 , T if δ i ( x ) = false ω i [sin πρ i x i u i − l i , cos πρ i x i u i − l i ] T otherwise. (3)Where ω i ∈ R + and ρ i ∈ [0 , . 2 (0 , true , u ) g (0 , true , l ) g (0 , false , · ) ρπ ω g (1 , false , · ) g (1 , true , l ) x x x Figure 1: A demonstration of the embedding giving rise to the pseudo-metric. All points for which δ ( x ) = false are mapped onto a line varying only along x . Points for which δ ( x ) = true aremapped to the surface of a semicylinder, depending on both x and x . This embedding gives aconstant distance between pairs of points which have differing values of δ but the same values of x .Figure 1 shows a visualization of the embedding of points ( x , δ ( x ) , x ) into R . In this space, wehave the Euclidean distance, d i ( x, x ′ ) = || g i ( x ) − g i ( x ′ ) || = if δ i ( x ) = δ i ( x ′ ) = false ω i if δ i ( x ) = δ i ( x ′ ) ω i √ q − cos( πρ i x i − x ′ i u i − l i ) if δ i ( x ) = δ i ( x ′ ) = true . (4)We can use this to define a covariance over our original space. In particular, we consider the classof covariances that are functions only of the Euclidean distance ∆ between points. There are manyexamples of such covariances. Popular examples are the exponentiated quadratic, for which κ (∆) = σ exp( − ∆ ) , or the rational quadratic, for which κ (∆) = σ (1 + α ∆ ) − α . We can simply take(4) in the place of ∆ , returning a valid covariance that satisfies all desiderata above.Explicitly, note that as desired, if i is irrelevant for both x and x ′ , d i specifies that g ( x ) and g ( x ′ ) should not differ owing to differences between x i and x ′ i . Secondly, if i is relevant for both x and x ′ , the difference between f ( x ) and f ( x ′ ) due to x i and x ′ i increases monotonically with increasing | x i − x ′ i | . The parameter ρ i controls whether differing in the relevance of i contributes more or lessto the distance than differing in the value of x i , should i be relevant. Hyperparameter ω i defines alength scale for the i th feature.Note that so far we only have defined a kernel for dimension i . To obtain a kernel for the entire D -dimensional input space, we simply embed each dimension in R using Equation (3) and thenuse the embedded input space of size D within any kernel that is defined in terms of Euclideandistance. We dub this new kernel the arc kernel . Its parameters, ω i and ρ i for each dimension, canbe optimized using the GP marginal likelihood, or integrated out using Markov chain Monte Carlo. We now show that the arc kernel yields better results than other alternatives. We perform two typesof experiments: first, we study model quality in isolation in a regression task; second, we study theeffect of the arc kernel on BO performance. All GP models use a Mat´ern / kernel. Data.
We use two different datasets, both of which are common in the deep learning literature. Thefirst is the canonical MNIST digits dataset [15] where the task is to classify handwritten digits. Thesecond is the CIFAR-10 object recognition dataset [16]. We pre-processed CIFAR-10 by extractingfeatures according to the pipeline given in [17]. 3 .1 Model Quality ExperimentsModels.
Our first experiments concern the quality of the regression models used to form the re-sponse surface for Bayesian optimization. We generated data by performing 10 independent runs ofBayesian optimization on MNIST and then treat this as a regression problem. We compare the GP with arc kernel (Arc GP ) to several baselines: the first baseline is a simple linear regression model,the second is a GP where irrelevant dimensions are simply filled in randomly for each input. Wealso compare to the case where each architecture uses its own separate GP , as in [5]. The results areaveraged over -fold train/test splits. Kernel parameters were inferred using slice sampling [18].As the errors lie between and with many distributed toward the lower end, it can be beneficial totake the log of the outputs before modelling them with a GP . We experiment with both the originaland transformed outputs.Method Original data Log outputsSeparate Linear . ± .
045 0 . ± . Separate GP . ± .
038 0 . ± . Separate Arc GP . ± .
030 0 . ± . Linear . ± .
043 0 . ± . GP . ± .
031 0 . ± . Arc GP . ± .
033 0 . ± . Table 1:
Normalized Mean Squared Error on MNIST Bayesian optimization data
Results.
Table 1 shows that a GP using the arc kernel performs favourably to a GP that ignores therelevance information of each point. The “separate” categories apply a different model to each layerand therefore do not take advantage of dependencies between layers. Interestingly, the separate Arc GP , which is effectively just a standard GP with additional embedding, performs comparably to astandard GP , suggesting that the embedding doesn’t limit the expressiveness of the model. In this experiment, we test the ability of Bayesian optimization to tune the hyperparameters of eachlayer of a deep neural network. We allow the neural networks for these problems to use up to hidden layers (or no hidden layer). We optimize over learning rates, L2 weight constraints, dropoutrates [19], and the number of hidden units per layer leading to a total of up to hyperparametersand architectures. On MNIST, most effort is spent improving the error by a fraction of a per-cent, therefore we optimize this dataset using the log-classification error. For CIFAR-10, we useclassification error as the objective. We use the Deepnet package, and each function evaluationtook approximately to seconds to run on NVIDIA GTX Titan GPUs. Note that when anetwork of depth n is tested, all hyperparameters from layers n + 1 onward are deemed irrelevant. Experimental Setup.
For Bayesian optimization, we follow the methodology of [10], using slicesampling and the expected improvement heuristic. In this methodology, the acquisition functionis optimized by first selecting from a pre-determined grid of points lying in [0 , , distributedaccording to a Sobol sequence. Our baseline is a standard Gaussian process over this space that isagnostic to whether particular dimensions are irrelevant for a given point. Results.
Figure 2 shows that on these datasets, using the arc kernel consistently reaches goodsolutions faster than the naive baseline, or it finds a better solution. In the case of MNIST, the bestdiscovered model achieved . test error using training examples. By comparison, [20]achieved . test error using a similar model and training examples. Similarly, our bestmodel for CIFAR-10 achieved . test error using training examples and features. Forcomparison, a support vector machine using features with the same feature pipeline and training examples achieves . error. https://github.com/nitishsrivastava/deepnet L o g - c l a ss i f i c a t i o n e rr o r Arc GPBaseline (a) MNIST
10 15 20 25 30 35 40Number of models trained0.150.200.250.300.350.400.450.500.55 C l a ss i f i c a t i o n e rr o r Arc GPBaseline (b) CIFAR-10
Figure 2: Bayesian optimization results using the arc kernel. P r o p o r t i o n o f e v a l u a t i o n s Arc GPBaseline
Figure 3: Relative fraction of neural net architectures searched on the CIFAR-10 dataset.Figure 3 shows the proportion of function evaluations spent on each architecture size for the CIFAR-10 experiments. Interestingly, the baseline tends to favour smaller models while a GP using the arckernel distributes its efforts amongst deeper architectures that tend to yield better results. We introduced the arc kernel for conditional parameter spaces that facilitates modelling the perfor-mance of deep neural network architectures by enabling the sharing of information across architec-tures where useful. Empirical results show that this kernel improves GP model quality and GP -basedBayesian optimization results over several simpler baseline kernels. Allowing information to beshared across architectures improves the efficiency of Bayesian optimization and removes the needto manually search for good architectures. The resulting models perform favourably compared toestablished benchmarks by domain experts. The authors would like to thank Ryan P. Adams for helpful discussions.
References [1] Eric Brochu, Tyson Brochu, and Nando de Freitas. A Bayesian interactive optimization ap-proach to procedural animation design. In
ACM SIGGRAPH/Eurographics Symposium onComputer Animation , 2010. 52] Carl E. Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine Learning.
The MIT Press, Cambridge, MA, USA , 2006.[3] Matthew A. Taddy, Robert B. Gramacy, and Nicholas G. Polson. Dynamic trees for learningand design.
Journal of the American Statistical Association , 106(493):109–123, 2011.[4] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimizationfor general algorithm configuration. In
Proc. of LION-5 , pages 507–523, 2011.[5] James Bergstra, R´emi Bardenet, Yoshua Bengio, Bal´azs K´egl, et al. Algorithms for hyper-parameter optimization. In
Advances in Neural Information Processing Systems , 2011.[6] Frank Hutter.
Automated Configuration of Algorithms for Solving Hard Computational Prob-lems . PhD thesis, University of British Columbia, Department of Computer Science, Vancou-ver, Canada, October 2009.[7] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deepbelief nets.
Neural Computation , 18(7):1527–1554, July 2006.[8] James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyper-parameter optimization in hundreds of dimensions for vision architectures. In
InternationalConference on Machine Learning , pages 115–123, 2013.[9] Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Auto-WEKA:Combined selection and hyperparameter optimization of classification algorithms. In
Proc. ofKDD’13 , pages 847–855, 2013.[10] Jasper Snoek, Hugo Larochelle, and Ryan Prescott Adams. Practical Bayesian optimization ofmachine learning algorithms. In
Advances in Neural Information Processing Systems , 2012.[11] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convo-lutional neural networks. In
Advances in Neural Information Processing Systems . 2012.[12] Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel rahman Mohamed, NavdeepJaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kings-bury. Deep neural networks for acoustic modeling in speech recognition: The shared views offour research groups.
IEEE Signal Process. Mag. , 29(6):82–97, 2012.[13] Tomas Mikolov, Martin Karafi´at, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur. Re-current neural network based language model. In
Interspeech , pages 1045–1048, 2010.[14] Frank Hutter and Michael A. Osborne. A kernel for hierarchical parameter spaces, 2013.arXiv:1310.5738.[15] Yann Lecun, Lon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning appliedto document recognition. In
Proc. of the IEEE , pages 2278–2324, 1998.[16] Alex Krizhevsky. Learning multiple layers of features from tiny images.
Technical report,Department of Computer Science, University of Toronto , 2009.[17] Adam Coates, Honglak Lee, and Andrew Y Ng. An analysis of single-layer networks inunsupervised feature learning.
Artificial Intelligence and Statistics , 2011.[18] Iain Murray and Ryan P. Adams. Slice sampling covariance hyperparameters of latent Gaussianmodels. In
Advances in Neural Information Processing Systems , 2010.[19] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-dinov. Improving neural networks by preventing co-adaptation of feature detectors. arXivpreprint arXiv:1207.0580 , 2012.[20] Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization of neuralnetworks using dropconnect. In