Multi-level CNN for lung nodule classification with Gaussian Process assisted hyperparameter optimization
MMulti-level CNN for lung nodule classification withGaussian Process assisted hyperparameter optimization
Miao Zhang a,b , Huiqi Li a, ∗ , Juan Lyu b,c , Sai Ho Ling b , Steven Su b a School of Information and Electronics,Beijing Institute of Technology, Beijing, 100081,China b Faculty of Engineering and Information Technology, University of Technology Sydney(UTS), 15 Broadway, Ultimo, NSW 2007, Australia c College of Information and Communication Engineering, Harbin Engineering University,Harbin, 150001, China
Abstract
This paper investigates lung nodule classification by using deep neural networks(DNNs). DNN has shown its superiority on several medical image processingproblems, like medical image segmentation, medical image synthesis, and so on,but their performance highly dependent on the appropriate hyperparameterssetting. Hyperparameter optimization in DNNs is a computationally expensiveproblem, where evaluating a hyperparameter configuration may take severalhours or even days. Bayesian optimization has been recently introduced for theautomatically searching of optimal hyperparameter configurations of DNNs. Itapplies probabilistic surrogate models to approximate the validation error func-tion of hyperparameter configurations, such as Gaussian processes, and reducethe computational complexity to a large extent. However, most existing sur-rogate models adopt stationary covariance functions (kernels) to measure thedifference between hyperparameter points based on spatial distance withoutconsidering its spatial locations. This distance-based assumption together withthe condition of constant smoothness throughout the whole hyperparametersearch space clearly violate the property that the points far away from optimalpoints usually get similarly poor performance even though each two of them ∗ Corresponding author
Email address: [email protected] (Huiqi Li)
Preprint submitted to Journal of L A TEX Templates January 3, 2019 a r X i v : . [ c s . L G ] J a n ave huge spatial distance between them. In this paper, a non-stationary ker-nel is proposed which allows the surrogate model to adapt to functions whosesmoothness varies with the spatial location of inputs, and a multi-level convo-lutional neural network (ML-CNN) is built for lung nodule classification whosehyperparameter configuration is optimized by using the proposed non-stationarykernel based Gaussian surrogate model. Our algorithm searches the surrogatefor optimal setting via hyperparameter importance based evolutionary strategy,and the experiments demonstrate our algorithm outperforms manual tuning andwell-established hyperparameter optimization methods such as Random search,Gaussian processes (GP) with stationary kernels, and recently proposed Hyper-parameter Optimization via RBF and Dynamic coordinate search (HORD). Keywords:
Hyperparameter optimization, Gaussian process, Non-stationarykernel, Evolutionary strategy
1. Introduction
Lung cancer is a notoriously aggressive cancer with sufferers having an av-erage 5-year survival rate 18% and a mean survival time of less than 12 months[36], and early diagnosis is very important to improve the survival rate. Recently,deep learning has shown its supriority in computer vision [13, 21, 43], and moreresearchers try to diagnose lung cancers with deep neural networks to assistthe early diagnosis as Computer Aided Diagnosis (CAD) systems [2, 14, 39].In our previous works [26], a multi-level convolutional neural networks (ML-CNN) is proposed to handle lung nodule malignancy classification, which ex-tracts multi-scale features through different convolutional kernel sizes. Our ML-CNN achieves state-of-art accuracies both in binary and ternary classification(which achieves 92.21% and 84.81%, respectively) without any preprocessing.However, the experiments also demonstrate the performance is very sensitive tohyperparameter configuration, especially the number of feature maps in everyconvolutional layer, where we obtain the near-optimal hyperparameter configu-ration through trial and error. 2utomatically hyperparameter optimization is very crucial to apply deeplearning algorithms in practice, and several methods including Grid search [22],Random search [5], Tree-structured Parzen Estimator Approach (TPE)[4] andBayesian optimization[37], have shown their superiority than manual searchmethod in hyperparameters optimization of deep neural network. Hyperpa-rameter optimization in deep neural networks is a global optimization withblack-box and expensive function, where evaluating a hyperparameter choicemay cost several hours or even days. It is a computational expensive problem,and a popular solution is to employ probabilistic surrogate, such as Gaussianprocesses (GP) and Tree-structured Parzen Estimator (TPE), to approximatethe expensive error function to guide the optimization process. A stationarycovariance function (kernel) is usually used in these surrogates, which dependsonly on the spatial distance of two hyperparameter configurations, but not onthe hyperparameters themselves. Such covariance function that employs con-stant smoothness throughout the hyperparameter search space clearly violatesthe intuition that most points away from optimal point all get similarly poorperformance even though each two of them have large spatial distance.In this paper, the deep neural network for lung nodule classification is builtbased on multi-level convolutional neural networks, which designs three levelsof CNNs with same structure but different convolutional kernel sizes to extractmulti-scale features of input with variable nodule sizes and morphologies. Thenthe hyperparameter optimization in deep convolutional neural network is formu-lated as an expensive optimization problem, and a Gaussian surrogate modelbased on non-stationary kernel is built to approximate the error function ofhyperparameter configurations, which allows the model to adapt to functionswhose smoothness varies with the inputs. Our algorithm searches the surrogatevia hyperparameter importance based evolutionary strategy and could find thenear-optimal hyperparameter setting in limited function evaluations.We name our algorithm as H yperparameter O ptimization with s U rrogate-a S sisted E volutionary S trategy, or HOUSES for short. We have compared ouralgorithm against several different well-established hyperparameter optimization3lgorithms, including Random search, Gaussian Process with stationary kernels,and Hyperparameter Optimization via RBF and Dynamic coordinate search(HORD) [17]. The main contribution of our paper is summarized as fourfold:(1) A multi-level convolutional neural network is adopted for lung nodule ma-lignancy classification, whose hyperparameter optimization is formulated asa computational expensive optimization problem.(2) A surrogate-assisted evolutionary strategy is introduced as the frameworkto solve the hyperparameter optimization for ML-CNN, which utilizes ahyperparameter importance based mutation as sampling method for efficientcandidate points generation.(3) A non-stationary kernel is proposed as covariance function to define the rela-tionship between different hyperparameter configurations, which allows themodel adapt spatial dependence structure to vary with a function of loca-tion. Different with a constant smoothness throughout the whole samplingregion, our non-stationary GP regression model could satisfy the assump-tion that the correlation function is no longer dependent on distance only,but also dependent on their relative locations to the optimal point. Aninput-warping method is also adopted which makes covariance functionsmore sensitive near the hyperparameter optimums.(4) Extensive experiments illustrate the superiority of our proposed HOSUE forhyperparaneter optimization of deep neural networks.We organise this paper as follows: Section II introduces the backgroundabout lung nodule classification, hyperparameter optimization in deep neuralnetwork and surrogate-assisted evolutionary algorithm. Section III describesthe proposed non-stationary covariance function for hyperparameter optimiza-tion in deep neural network and the framework and details of H yperparameter O ptimization with s U rrogate-a S sisted E volutionary S trategy (HOUSES) forML-CNN. The experimental design is described in Section IV , and we demon-strates the experimental results with discussions for state-of-the-art hyperpa-rameter optimization approaches in Section V. We conclude and describe the4uture work in Section VI.
2. Relate Works
Deep neural networks have shown their superiority to conventional algo-rithms in the application of computer vision, and more researchers try to em-ploy DNNs to medical imaging diagnosis areas. Paper [41] presents differentdeep structure algorithms in lung cancer diagnosis, including stacked denoisingautoencoder, deep belief network, and convolutional neural network,who obtainthe binary classification accuracies 79.76%, 81.19% and 79.29%, respectively.Shen et al. [34] proposed a Multi-scale Convolutional Neural Networks (MCNN),that utilized multi-scale nodule patches to sufficiently quantify nodule charac-teristics, which obtained binary classification accuracy of 86.84%. In MCNN,three CNNs that took different nodule as inputs were assembled in parallel, andconcatenated the output of each fully-connected layers as its resulting output.The experiments had shown that multi-scale inputs could help CNN learn aset of discriminative features. In 2017, they extended their research and pro-posed a multi-crop CNN (MC-CNN) [35] which automatically extracted nodulefetures by adopting a multi-crop pooling strategy, and obtained 87.14% binaryclassification and 62.46% ternary classification accuracy. In our previous works[26], a multi-level convolutional neural networks (ML-CNN) is proposed whichextracts multi-scale features through different convolutional kernel sizes. It alsodesigns three CNNs with same structure but different convolutional kernel sizesto extract multi-scale features with variable nodule sizes and morphologies. OurML-CNN achieves state-of-art accuracies both in binary and ternary classifica-tion (which achieves 92.21% and 84.81%, respectively), without any additionalhand-craft preprocessing. Even though these deep learning methods were end-to-end machine learning architectures and had shown their superiority thanconventional methods, the structure design and hyperparameter configuration5re based on human experts experience through trial and error search guidedby human’s intuition, which is a difficult and time consuming task [28, 9].
Determining appropriate values of hyperparameters of DNN is a frustratinglydifficult task where all feasible hyperparameter configurations form a huge space,from which we need to choose the optimal case. Setting correct hyperparametersis often critical for reaching the full potential of the deep neural network chosenor designed, otherwise it may severely hamper the performance of deep neuralnetworks.Hyperparameter optimization in DNN is a global optimization to find a D -dimensional hyperparameter setting x that minimize the validation error f ofa DNN with learned parameters θ . The optimal x could be obtained throughoptimizing f as follows: min x ⊆ R D f ( x, θ ; Z val ) s.t. θ = arg min θ f ( x, θ ; Z train ) (1)where Z train and Z val are training and validation datasets respectively. SolvingEq.(1) is very challenging for the high complexity of the function f , and it isusually accomplished manually in the deep learning community, which largelydepends on experts experience or intuition. It is also hard to reproduce similarresults when this configuration is applied on different datasets or problems.There are several systematic approach to tune hyperparameters in machinelearning community, like Grid search, Random search, Bayesian optimizationmethods, and so on. Grid search is the most common strategy in hyper-parameter optimization [22], and it is simple to implement with parallelization,which makes it reliable in low dimensional spaces (e.g., 1- d , 2- d ). However, Gridsearch suffer from the curse of dimensionality because the search space growsexponentially with the number of hyper-parameters. Random search [5] pro-poses to randomly sample points from the hyperparameter configuration space.6lthough this approach looks simple, but it could find comparable hyperparam-eter configuration to grid search with less computation time. Hyperparameteroptimization in deep neural networks is a computational expensive problemwhere evaluating a hyperparameter choice may cost several hours or even days.This property also makes it unrealistic to sample many enough points to beevaluated in Grid and Random search. One popular approach is using efficientsurrogates to approximate the computationally expensive fitness functions toguide the optimization process. Bayesian optimization [37] built a probabilisticGaussian model surrogate to estimate the distribution of computationally ex-pensive validation errors. Hyperparameter configuration space is usually mod-eled smoothly, which means that knowing the quality of certain points mighthelp infer the quality of their nearby points, and Bayesian optimization [4, 33, 3]utilizes the above smoothness assumption to assist the search of hyperparame-ters. Gaussian Process is the most common method for modeling loss functionsin Bayesian optimization for it is simple and flexible. There are several ac-quistion functions to determin the next promising points in Gaussian process,including Probability of Improvement (PI), Expected Improvement (EI), UpperConfidence Bound (UCB) and the Predictive Entropy Search (PES) [37, 15]. Surrogate-assisted evolutionary algorithm was designed to solve expensiveoptimization problems whose fitness function is highly computationally expen-sive [19, 20, 10]. It usually utilizes computationally efficient models, also calledas surrogates, to approximate the fitness function. The surrogate model is builtas: ˆ f ( x ) = f ∗ ( x ) + ξ ( x ) (2)where f ∗ is the true fitness value, ˆ f is the approximated fitness value, and ξ is theerror function that is to minimized by the selected surrogate. Surrogate-assistedevolutionary algorithm uses one or several surrogate models ˆ f to approximatetrue fitness value f ∗ and uses the computationally cheap surrogate to guide7he search process [45]. The iteration of the surrogate-assisted evolutionaryalgorithm is described as: 1) Learn surrogate model f ∗ based on previously trulyevaluated points ( x, f ( x )); 2) Utilize f ∗ to evaluate new mutation-generatedpoints and find the most promising individual x ∗ . 3) evaluate the true fitnessvalue of additional points( x ∗ , f ( x ∗ )) . 4) Update training set.Gaussian process, polynomials, Radial Basis Functions (RBFs), neural net-works, and Support Vector Machines are major techniques to approximate trueobjective function for surrogate model learning. A non-stationary covariancefunction based Gaussian process is adopted as the surrogate model in this pa-per, which allows the model adapt spatial dependence structure to vary withlocations and satisfies our assumption that the hyperparameter configuratio per-forms well near the optimal points while poorly away from the optimal point.Then the evolutionary strategy is used to search the near-optimal hyperparame-ter configuration. The next section will present the details of our Hyperparame-ter Optimization with sUrrogate-aSsisted Evolutionary Strategy (HOUSES) forML-CNN.
3. Hyperparameter Optimization with sUrrogate-aSsisted EvolutionaryStrategy
In our previous work [26], a multi-level convolutional neural network is pro-posed for lung nodules classification, which applies different kernel sizes in threeparallel levels of CNNs to effectively extract different features of each lung nod-ule with different sizes and various morphologies. Fig. 1 presents the structureof ML-CNN, which contains three level of CNNs and each of them has samestructure and different kernel size. As suggested in our previous work, featuremaps number in each convolutional layer has significant impact on the perfor-mance of ML-CNN, so as the dropout rates. The hyperparameter configurationof ML-CNN in [26] is based on trial and error manual search approach, which isa time-consuming work for researcher and has no guarantee to get an optimalconfiguration. In this section, we introduce H yperparameter O ptimization with8 igure 1: The structure of proposed ML-CNN for lung nodule malignancy classification [26]. s U rrogate-a S sisted E volutionary S trategy(HOUSES) to our ML-CNN for lungnodule classification, which could automatically find a competitive or even bet-ter hyperparameter configuration than manual search method without too muchcomputational cost. The framework of the proposed HOUSES for ML-CNN ispresented in Algorithm 1 . In our hyperparameter optimization method, a non-stationary kernel is proposed as covariance function to define the relationshipbetween different hyperparameter configurations, which allows the model adaptspatial dependence structure to vary with a function of location, and the algo-rithm searches for the most promising hyperparameter values based on surrogatemodel through evolutionary strategy. In our HOUSES, several initial hyperpa-rameter configuration points are randomly generated through Latin HypercubeSampling (LHS) [18] methods to keep diversity of the initial population. Theseinitial points are truly evaluated and used as the training set
T r { ( x i , f i ) } n i =1 to build the initial surrogate model. Then the evolutionary strategy generatesa group new points which are evaluated according to the acquisition functionof the surrogate model. Several most promising individuals x ∗ are found fromthose new generated points based on acquisition function and then truly eval-uated. The most promising points with true fitness value( x ∗ , f ( x ∗ )) are addedto training set to update surrogate model. We describe our HOUSES in thefollowing paragraphs. 9 lgorithm 1: General Framework of HOUSES
Input:
Initial population size n , Maximum generation g max , Mutationrate p m , number of new generated points every generation m , Dataset , DN N model
Output: best hyperparameter configuration c best for DNN model Divide dataset into Training, Validation and Testing sets
Initialization
A hyperparameter configuration population pop israndomly generated through Latin Hypercube Sampling. Thesehyperparameter points are used to train DNN model in Training set,and truly evaluated in the Validation set to get true fitness values T r = { ( x i , f i ) } n i =1 . while Maximum generation g max is not reached do1. Use
T r to fit or update the Gaussian surrogate model ˆ f accordingEq.(3); pop selected = select ( pop g )// select individuals with goodperformance and diversity for mutation ; pop m = mutation ( pop selected )// apply mutation operation toselected points to generate m new points ; Calculate { ( x i , ˆ f i ) } m i =1 for m new generated points based onGaussian surrogate model and acquisition functions Eq.(11)(12)(13); Set x ∗ = argmin { ˆ f i } m i =1 ; Truly evaluate f ( x ∗ ) in Training set and Validation set to get truefitness values; Update
T r g +1 = { T r g +1 ∪ ( x ∗ , f ( x ∗ )) } ; endReturn the hyperparameter configuration c best .10 .1. surrogate model building Gaussian process (also known as Kriging) is choosed as the surrogate modelin HOUSES searching for the most promising hyperparameters, which uses ageneralization of the Gaussian disribution to describe a function, defined by amean µ , and covariance function σ :ˆ f ( x ) ∼ N ( µ ( x ) , σ ( x )) (3)Given training data that consists n D -dimensional inputs and outputs, { x n , f n } ,where x i ⊆ R D and f i = f ( x i ). The predictive distribution based on Gaussianprocess at an unknown input, x ∗ , is calculated by the following: µ ( x ∗ ) = K ∗ ( K + θ c I ) − f i : n (4) σ ( x ∗ ) = K ∗∗ − K ∗ (( K + θ c I ) − ) K T ∗ (5)where K ∗ = [ k ( x ∗ , x ) , ..., k ( x ∗ , x n )] and K ∗∗ = k ( x ∗ , x ∗ ), θ c is a noise parame-ter, K is the associated covariance matrix which is built as: K = k ( x , x ) . . . k ( x , x n )... . . . ... k ( x n , x ) . . . k ( x n , x n ) (6) k is a covariance function that defines the relationship between points in theforms of a kernel. A used kernel is automatic relevance determination (ARD)squared exponential covariance function: k ( x i , x j ) = θ f exp D (cid:88) d =1 − ( x di − x dj ) θ d (7) In the hyperparameter optimization of DNNs, two far away hyperparam-eter points usually perform both poorly when they are away optimal point.11his property means that the correlation of two hyperparameter configurationdepends not only on the distance between them, but also the points’ spatiallocations. Those stationary kernel, such as Gaussian kernel, clearly could notsatisfy this property of hyperparameter optimization in DNNs. To account forthis non-stationarity, we proposed a non-stationary covariance function, wherewe use the relative distance to optimal point to measure the spatial locationdifference of two hyperparameter points. The relative distance based kernel isdefined as: k ( x i , x j ) = θ f exp D (cid:88) d =1 − ( (cid:12)(cid:12) x di − s d (cid:12)(cid:12) − (cid:12)(cid:12) x dj − s d (cid:12)(cid:12) ) θ d (8)where s is the assumed optimal point. It is also easy to prove this relativedistance based covariance function k ( x i , x j ) is a kernel based on Theorem 1
1. Eq.(8) could be obtained by set ψ ( x ) = | x − s | and k (cid:48) as Gaussian kernel.This relative distance based kernel is no longer a function of distance betweentwo points, but depends on their own spatial locations to the optimal point. Theorem 1. if ψ is an R D -valued function on X and k (cid:48) is a kernel on R D × R D ,then k ( x , z ) = k (cid:48) ( ψ ( x ) , ψ ( z )) (9) is also a kernel. Proof : k (cid:48) : R D × R D → R , ψ : R D → R D , k (cid:48) is a valid kernel, then we have k (cid:48) ( x , z ) = ϕ ( x ) T ϕ ( z ) so that k ( x , z ) = ϕ ( ψ ( x )) T ϕ ( ψ ( z )) is a kernel.3.2.2. Input Warping In the hyperparameter optimization of machine learning models, objectivefunctions are usually more sensitive near the optimal hypeparameter setting12 igure 2: Example of how Kumaraswamy cumulative distribution function transforming a con-cave function into a convex function, which makes the kernel function is much more sensitiveto small inputs. while much less sensitive far away from the optimum. For example, if theoptimal learning rate is 0.05, it is supposed to obtain 50% performance increasewhen the learning rate changing from 0.04 to 0.05, while may just 5% increasefrom 0.25 to 0.24. Traditionally, most researchers often use logarithm functionto transform the input space and then search in the transformed space, which iseffective only when the non-stationary property of the input space is known inadvance. Recently, a beta cumulative distribution function is proposed as theinput warping transformation function [38, 42], w d ( x d ) = (cid:90) x d u α d − (1 − u ) β d − B ( α d , β d ) d u (10)where B ( α d , β d ) is the beta function, which is to adjust the shape of inputwarping function to the original data based on parameters α d , and β d .Different from [38, 42], we just take the relative distance to local optimum asinputs to be warped that make kernel function is more sensitive to small inputsand less sensitive for large ones. We take the Kumaraswamy cumulative distri-bution function as the substitute, which is not only because of computationalreasons, but also it is easier to fulfill the non-stationary property of our kernelfunction after spatial location transformation, w d ( x d ) = 1 − (1 − x α d d ) β d (11)13imilar to Eq.(9), it is easy to be proven that k ( x , x (cid:48) ) = k (cid:48) ( w ( ψ ( x )) , w ( ψ ( x (cid:48) )))is a kernel. Fig.2 illustrates input warping example with different shape param-eters α d , and β d input warping functions. And the final kernel for our HOUSEis defined as: k ( x i , x j ) = θ f exp D (cid:88) d =1 − ( w d (cid:12)(cid:12) x di − s d (cid:12)(cid:12) − w d (cid:12)(cid:12) x dj − s d (cid:12)(cid:12) ) θ d + θ k exp D (cid:88) d =1 − ( w d (cid:12)(cid:12) x di − x dj (cid:12)(cid:12) ) γ d (12)Eq.(12) is also a kernel proved based on Theorem 2
2. This non-stationarykernel Eq.(12) satisfies the assumption that the correlation function of two hy-perparameter configuration is not only dependent on their distances, but theirrelative locations to optimal point. However it is impossible to get the optimalpoint in advance, and we instead use the hyperparameter configuration with bestperformance in the train set, and updates it in every iteration in our proposedHOUSES.
Theorem 2. If k is a kernel on R D × R D and k is also a kernel on R D × R D ,then k ( x , z ) = k ( x , z ) + k ( x , z ) (13) is also a kernel. Proof : This is because if k ( x , z ) and k ( x , z ) are valid kernels on R D × R D → R , then we have k ( x , z ) = ϕ T ( x ) ϕ ( z ) and k ( x , z ) = ψ T ( x ) ψ ( z ) , wemay define θ ( x ) = ϕ ( x ) ⊕ ψ ( x ) = [ ϕ ( x ) , ψ ( x )] T so that k ( x , z ) = θ ( x ) T θ ( z ) is a kernel.3.3. Acquisition function After building a surrogate model, an acquisition function is required tochoose the most promising point for truly evaluation. Different with surro-14ate model that approximates the optimizing problem, the acquisition functionis to be utilized to find the most possible optimal solution.We applied three different acquisition functions for Gaussian process (GP)based hyperparemeter optimization: • Probability of Improvement α PI ( x ) = Φ( γ ( x )) , γ ( x ) = f ( x best ) − µ ( x ) σ ( x ) (14)where Φ( z ) = (2 π ) − (cid:82) −∞ z exp ( − t ) dt . • Expected Improvement α EI ( x ) = σ ( x )( γ ( x )Φ( γ ( x )) + N ( µ ( x ))) (15)where N ( z ) is the variable z has a Gaussian distribution with z ∼ N (0 , • and Upper Confidence Bound α UCB ( x ) = µ ( x ) + w σ ( x ) (16)with a tunable w to balance exploitation against exploration [17]. The mutation aims to generate better individuals through mutating selectedexcellent individuals, which is a key step of optimization in evolutionary strat-egy. To maintain the diversity of the population, a uniformed selection strategyis adopted in mutation. It first divides every dimension into M uniformed grids[44], and the point with the highest fitness in every dimensional grid is selectedfor mutation. In this way, D ∗ M individules are selected and the polynomialmutation is applied to every selected individual to generate n d candidate hyper-parameter points, respectively. These D ∗ M ∗ n d points are evaluated based onacquisition fuction, and the most promising point is selected for truly evaluationand added into the training set to update surrogate model.15 igure 3: Functional ANOVA based marginal response performance of the number of featuremaps of all convolutional layers in three different levels of ML-CNN. The first two parametersare for the two convolutional layers in the first level, the middle two are for the second level,and the last two are for the third level. Results show that the latter ones in the three levelbrings more effects to the performance, while there is no significant difference among allpossible configuration for the previous feature maps number in each level of ML-CNN. N : ˆ y ( θ ) = (cid:88) U ⊆ N ˆ f U ( θ U ) (17)where the component ˆ f U ( θ U ) is defined as:ˆ f U ( θ U ) = ˆ f ∅ if U = ∅ ˆ a U ( θ U ) − (cid:80) ˆ f W ( θ W ) otherwise (18)where the constant ˆ f ∅ is the mean value of the function over its domain, ˆ a U ( θ U )is the marginal predicted performance defined as ˆ a U ( θ U ) = (cid:107) Θ T (cid:107) (cid:82) ˆ y ( θ N | U ) dθ T .The subset | U | > U , while we only consider the separate hyperparameter importance inthis paper and set | U | = 1. The component function ˆ f U ( θ U ) is then calculatedas: ˆ f ( θ d ) = ˆ a ( θ d ) = 1 (cid:107) Θ T (cid:107) (cid:90) ˆ y ( θ N | d ) d θ T (19)where θ d is the single hyperparameter, T = N \ d , Θ T = Θ \ θ i , Θ = θ ×· · ·× θ D .17he variance of response performance of ˆ y across its domain Θ is V = n (cid:88) i =1 V d , V d = 1 (cid:107) θ d (cid:107) (cid:90) ˆ f ( θ i ) d θ d (20)The importance of each hyperparameter could thus be quantified as: I d = V d / V (21)When the polynomial mutation operator is applied to individuals, genescorresponding to different hypeparameters have different mutation probabilitiesin terms of hyperparameter importances, where genes with lager importancesare supposed to have higher mutation probabilities to generate more offsprings.In this way, our evolutionary strategy is supposed to put more emphases inthose subspaces of important hyperparameters and find better hyperparametersettings.
4. Experimental Design
To examine the optimization performance of our proposed HOUSES for hy-perparameter optimization, two sets of experiments have been conducted. Wetest HOUSES on a Multi-layer Perceptron (MLP) network and LeNet appliedto the popular MNIST dataset, and AlexNet applied to CIFAR-10 dataset inthe first one. For second set, there is only one experiment whose target is to findan optimal hyperparameter configuration of ML-CNN applied to lung noduleclassificayion. All the experiments were performed with Nvidia Quadro P5000GPU (16.0 GB Memory, 8873 GFLOPS). Our experiments are implementedin Python 3.6 environment, and Tensorflow and Tensorlayer are used forbuilding deep neural networks.The following subsections present a brief introduction of experimental prob-lems, peer algorithms, and evaluation budget and experimental setting. https://github.com/tensorflow/tensorflow https://github.com/tensorlayer/tensorlayer .1. DNN problems The first DNN problem in the first experimental set is MLP network ap-plied to MNIST, which consists of three dense layers with ReLU activation anddropout layer between them and SoftMax at the end. The hyperparameters weoptimize with HOUSES and other peer algorithms include dropout rate in eachdropout layer and number of units in dense layers. This problem has 5 param-eters to be optimized, which is described as 5-MLP in this paper. The secondDNN problem is LeNet5 applied to MNIST that has 7 hyperparameters to beoptimized, which is described as 7-CNN. The 7-CNN contains two convolutionalblocks, each containing one convolutional layer with batch normalization, fol-lowed by
ReLU activation and 2 × ×
52 as the input of our multi-level convolutionalneural networks.
We compare HOUSES against Random search, Gaussian processes (GP)with Gaussian kernel, and Hyperparameter Optimization via RBF and Dy-namic coordinate search (HORD). We also compared three different acquisi-tion functions for Gaussian processes (GP) based hyperparemeter optimization:Gaussian processes with Expected Improvement (GP-EI), Gaussian processeswith Probability of Improvement (GP-PI), and Gaussian processes with UpperConfidence Bound (GP-UCB).
Hyperparameter configuration evaluation is typically computationally ex-pensive which consists of the most computation cost in DNN hyperparameteroptimization problem. For fair comparison, we set the number of function eval-uations as 200 for all comparing algorithms. The number of training iterationsfor MNIST dataset is set as 100, and CIFAR-10 and LIDC-IDRI are set as 200and 500, respectively.We implement the Random search with the open-source HyperOpt library .We use the public sklearn library to build Gaussian Processes based surrogatemodel. The implementation for HORD is at . The code for hyperparameterimportance assessing based on functional ANOVA is available at .20 able 1: Experimental mean accuracy of comparing algorithms on 4 DNN problems DNN Problems 5-MLP 7-CNN 9-CNN 9-ML-CNNRandom Search 0.9731 0.9947 0.7429 0.8401HORD 0.9684 0.9929 0.7471 0.8421GP-EI 0.9647 0.9934 0.7546 0.8517GP-PI 0.9645 0.9937 0.7650 0.8473GP-UCB 0.9637 0.9942 0.7318 0.8457HOUSES-EI 0.9698 0.9931 0.7642 0.8511HOUSES-PI 0.9690 0.9949
Manual Tuning - - - 0.8481
5. Experimental results and discussion
In this section, we evaluate these peer hyperparameter optimization algo-rithms on 3 DNN problems, including MLP applied to MNIST (5-MLP),andLeNet network to MNIST (7-CNN), and AlexNet applied to CIFAR10 (9-CNN).For 5-MLP problem, Table 1 (Column 2) shows the obtained test resultsof different comparing methods, and Figure 2 (a) also plots the average ac-curacy over epochs of the obtained hyperparameter configurations from dif-ferent hyperparameter optimization methods. One surprised observation fromthe above table and figure is that the simplest Random Search method couldget satisfied results, which sometimes even outperforms some Bayesian opti-mization based methods (GPs and HORD). This phenomenon suggests that,for low-dimensional hyperparameter optimization, the simple Random Searchcould perform very well, which is also in line with [5]. Furthermore, we can also http://hyperopt.github.io/hyperopt/ https://scikit-learn.org/stable/modules/gaussian_process.html bit.ly/hord-aaai https://github.com/automl/fanova a b l e : C o m p a r i s o n r e s u l t s i n t h r ee D NN p r o b l e m s A l go r i t h m - M L P - C NN - C NN S e n s i t i v i t y Sp ec i fi c i t y AU C S e n s i t i v i t y Sp ec i fi c i t y AU C S e n s i t i v i t y Sp ec i fi c i t y AU C R a nd o m S e a r c h . . . . . . . . . H O R D . . . . . . . . . G P - E I . . . . . . . . . G P - P I . . . . . . . . . G P - U C B . . . . . . . . . H O U S E S - E I . . . . . . . . . H O U S E S - P I . . . . . . . . . H O U S E S - U C B . . . . . . . . . Accuracy = (
T P + T N ) / ( T P + F P + F N + T N ) Sensitive = T P/ ( T P + F N ) Specif icity = T N/ ( T N + F P ) (22)We also calculated Area Under Curve (AUC) [7] as the assessment criteria forReceiver Operating Characteristic (ROC) curve in Table 2. As demonstratedin Table 2, our HOUSES appraoch outperforms Random Search, HORD andnormal kernel based Gaussian processes in accuracy, sensitivity and specificityin 5-CNN and 9-CNN problems. In 5-MLP problem, Random Search also getsincredible results, which also suggests that the simple Random Search couldperform very well in low-dimensional hyperparameter optimization. Althoughthere are just 1-2 percentage increase on the classification rate, it is a signifi-cant improvement for hyperparameter optimization. There is no much statisticdifferences between these comparing algorithms in the results of 7-CNN, whichagain demonstrates that a better neural network structure could significant im-prove the performance and relieve the work of hyperparameter optimizationworks.
In this section, we evaluate HOUSES and all comparing algorithms appliedto 9-ML-CNN (Multi-Level Convolutional Neural Network applied to lung nod-ule classification with 9 hyperparameters to be optimized), and the results aredemomstrated in Table 1 Column 4, and Fig.4. As expected, the performanceof conventional hyperparameter optimization methods degrades significantly incomplicated and high dimensional search space, while HOUSES continues to getsatisfied results and outperforms the Gaussian process with stationary kernels.Similar to those result in previous subsection, we found that UCB gets the best24 a)Testing accuracy of all hyperparameter op-timization algorithms on 5-MLP problem. (b) Testing accuracy of all hyperparameteroptimization algorithms on 7-CNN problem.(c)Testing accuracy of all hyperparameter op-timization algorithms on 9-CNN problem. (d) Testing accuracy of all hyperparameteroptimization algorithms on 9-ML-CNN prob-lem.Figure 4: Testing accuracy of all hyperparameter optimization algorithms on four DNN prob-lems.
6. Conclusion
In this paper, a H yperparameter O ptimization with s U rrogate-a S sisted E volutionary S trategy, named HOUSES is proposed for CNN hyperparame-ter optimization. A non-stationary kernel is devised and adopted as covariancefunction to define the relationship between different hyperparameter configura-26 a b l e : C o m p a r i s o n r e s u l t s o f - M L - C NN p r o b l e m f o r e a c h c l a ss A l go r i t h m B e n i g n I nd e t e r m i n a t e M a li g n a n t S e n s i t i v i t y Sp ec i fi c i t y AU C S e n s i t i v i t y Sp ec i fi c i t y AU C S e n s i t i v i t y Sp ec i fi c i t y AU C R a nd o m S e a r c h . . . . . . . . . H O R D . . . . . . . . . G P - E I . . . . . . . . . G P - P I . . . . . . . . . G P - U C B . . . . . . . . . H O U S E S - E I . . . . . . . . . H O U S E S - P I . . . . . . . . . H O U S E S - U C B . . . . . . . . . M a nu a l T un i n g0 . . . . . . . . . eferences [1] Armato Sg 3Rd, G Mclennan, L Bidaut, M. F. Mcnitt-Gray, C. R. Meyer,A. P. Reeves, B. Zhao, D. R. Aberle, C. I. Henschke, and E. A. Hoffman.The lung image database consortium (lidc) and image database resourceinitiative (idri): a completed reference database of lung nodules on ct scans. Medical Physics , 38(2):915, 2012.[2] M Anthimopoulos, S Christodoulidis, L Ebner, A Christe, andS Mougiakakou. Lung pattern classification for interstitial lung diseasesusing a deep convolutional neural network.
IEEE Transactions on MedicalImaging , 35(5):1207–1216, 2016.[3] J Bergstra, D Yamins, and D. D Cox. Making a science of model search.2012.[4] James Bergstra and Yoshua Bengio. Algorithms for hyper-parameter op-timization. In
International Conference on Neural Information ProcessingSystems , pages 2546–2554, 2011.[5] James Bergstra and Yoshua Bengio. Random search for hyper-parameteroptimization.
Journal of Machine Learning Research , 13(1):281–305, 2012.[6] Konstantinos Chatzilygeroudis, Roberto Rama, Rituraj Kaushik, DorianGoepp, Vassilis Vassiliades, and Jean Baptiste Mouret. Black-box data-efficient policy search for robotics. In
Ieee/rsj International Conference onIntelligent Robots and Systems , 2017.[7] Y. Chien. Pattern classification and scene analysis.
IEEE Transactions onAutomatic Control , 19(4):462–463, 2003.[8] Kenneth Clark, Bruce Vendt, Kirk Smith, John Freymann, Justin Kirby,Paul Koppel, Stephen Moore, Stanley Phillips, David Maffitt, and MichaelPringle. The cancer imaging archive (tcia): Maintaining and operating apublic information repository.
Journal of Digital Imaging , 26(6):1045–1057,2013. 299] Hongbin Dong, Tao Li, Rui Ding, and Jing Sun. A novel hybrid geneticalgorithm with granular information for feature selection and optimization.
Applied Soft Computing , 65, 2018.[10] Dominique Douguet. e-lea3d: a computational-aided drug design webserver.
Nucleic Acids Research , 38(Web Server issue):615–21, 2010.[11] Thomas Elsken, Jan-Hendrik Metzen, and Frank Hutter. Simple and effi-cient architecture search for convolutional neural networks. arXiv preprintarXiv:1711.04528 , 2017.[12] Marc G. Genton. Classes of kernels for machine learning: A statisticsperspective.
Journal of Machine Learning Research , 2(2):299–312, 2002.[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierar-chies for accurate object detection and semantic segmentation. In ,volume 00, pages 580–587, June 2014.[14] Rotem Golan, Christian Jacob, and Jrg Denzinger. Lung nodule detectionin ct images using deep convolutional neural networks. In
InternationalJoint Conference on Neural Networks , pages 243–250, 2016.[15] Matthew W. Hoffman and Zoubin Ghahramani. Predictive entropy searchfor efficient global optimization of black-box functions. In
InternationalConference on Neural Information Processing Systems , pages 918–926,2014.[16] Holger Hoos and Kevin Leyton-Brown. An efficient approach for assess-ing hyperparameter importance. In
International Conference on MachineLearning , pages 754–762, 2014.[17] Ilija Ilievski, Taimoor Akhtar, Jiashi Feng, and Christine Shoemaker. Effi-cient hyperparameter optimization for deep learning algorithms using de-terministic rbf surrogates, 2017. 3018] Ronald L. Iman.
Latin Hypercube Sampling . John Wiley & Sons, Ltd, 2008.[19] Yaochu Jin. Surrogate-assisted evolutionary computation: Recent advancesand future challenges.
Swarm and Evolutionary Computation , 1(2):61–70,2011.[20] Yaochu Jin and B Sendhoff. A systems approach to evolutionary multi-objective structural optimization and beyond.
Computational IntelligenceMagazine IEEE , 4(3):62–76, 2009.[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi-fication with deep convolutional neural networks. In
International Confer-ence on Neural Information Processing Systems , pages 1097–1105, 2012.[22] Yann Lecun, Leon Bottou, Genevieve B. Orr, and Klaus Robert Mller.Efficient backprop.
Neural Networks Tricks of the Trade , 1524(1):9–50,1998.[23] Joel Lehman and Kenneth O Stanley. Evolving a diversity of virtual crea-tures through novelty search and local competition. In
Proceedings of the13th annual conference on Genetic and evolutionary computation , pages211–218. ACM, 2011.[24] Kui Liu and Guixia Kang. Multiview convolutional neural networks forlung nodule classification.
Plos One , 12(11):12–22, 2017.[25] Ilya Loshchilov and Frank Hutter. CMA-ES for hyperparameter optimiza-tion of deep neural networks.
CoRR , abs/1604.07269, 2016.[26] Juan Lyu and Sai Ho Ling. Using multi-level convolutional neural networkfor classification of lung nodules on ct images. In , pages 686–689. IEEE, 2018.[27] Alan Tan Wei Min, Yew Soon Ong, Abhishek Gupta, and Chi Keong Goh.Multi-problem surrogates: Transfer evolutionary multiobjective optimiza-31ion of computationally expensive problems.
IEEE Transactions on Evolu-tionary Computation , PP(99):1–1, 2017.[28] Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically design-ing and training deep architectures. 2017.[29] Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically de-signing and training deep architectures. arXiv preprint arXiv:1704.08792 ,2017.[30] Anthony P. Reeves and Alberto M. Biancardi. The lung image databaseconsortium (lidc) nodule size report. , 20011.[31] Anthony P. Reeves, Alberto M. Biancardi, Tatiyana V. Apanasovich,Charles R. Meyer, Heber Macmahon, Edwin J. R. Van Beek, Ella A. Kaze-rooni, David Yankelevitz, Michael F. Mcnittgray, and Geoffrey Mclennan.The lung image database consortium (lidc): pulmonary nodule measure-ments, the variation, and the difference between different size metrics. In
Medical Imaging 2007: Computer-Aided Diagnosis , pages 1475–1485, 2007.[32] Rommel G. Regis and Christine A. Shoemaker. Combining radial basisfunction surrogates and dynamic coordinate search in high-dimensional ex-pensive black-box optimization.
Engineering Optimization , 45(5):529–555,2013.[33] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, andNando De Freitas. Taking the human out of the loop: A review of bayesianoptimization.
Proceedings of the IEEE , 104(1):148–175, 2015.[34] W. Shen, M. Zhou, F. Yang, C. Yang, and J. Tian. Multi-scale convo-lutional neural networks for lung nodule classification.
Inf Process MedImaging , 24:588–599, 2015.[35] Wei Shen, Mu Zhou, Feng Yang, Dongdong Yu, Di Dong, Caiyun Yang,Yali Zang, and Jie Tian. Multi-crop convolutional neural networks for32ung nodule malignancy suspiciousness classification.
Pattern Recognition ,61(61):663–673, 2017.[36] R. L. Siegel, K. D. Miller, S. A. Fedewa, D. J. Ahnen, R. G. Meester,A Barzi, and A Jemal. Colorectal cancer statistics, 2017.
Ca Cancer JClin , 67(3):104–17, 2017.[37] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesianoptimization of machine learning algorithms. In
International Conferenceon Neural Information Processing Systems , pages 2951–2959, 2012.[38] Jasper Snoek, Kevin Swersky, Rich Zemel, and Ryan Adams. Input warp-ing for bayesian optimization of non-stationary functions. In
InternationalConference on Machine Learning , pages 1674–1682, 2014.[39] Q. Song, L. Zhao, X. Luo, and X. Dou. Using deep learning for classifi-cation of lung nodules on computed tomography images.
J Healthc Eng. ,2017(1):1–7, 2017.[40] Masanori Suganuma, Shinichi Shirakawa, and Tomoharu Nagao. A geneticprogramming approach to designing convolutional neural network architec-tures. pages 497–504, 2017.[41] Wenqing Sun, Bin Zheng, and Wei Qian. Computer aided lung cancer diag-nosis with deep learning algorithms. In
Medical Imaging 2016: Computer-Aided Diagnosis , 2016.[42] Kevin Jordan Swersky.
Improving Bayesian Optimization for MachineLearning using Expert Priors . PhD thesis, 2017.[43] Naiyan Wang, Siyi Li, Abhinav Gupta, and DitYan Yeung. Transferringrich feature hierarchies for robust visual tracking.
Computer Science , 2015.[44] Miao Zhang and Huiqi Li. A reference direction and entropy based evolu-tionary algorithm for many-objective optimization.
Applied Soft Comput-ing , 70:108–130, 2018. 3345] Qingfu Zhang, Wudong Liu, Edward Tsang, and Botond Virginas. Expen-sive multiobjective optimization by moea/d with gaussian process model.