[PDF] Evolutionary algorithms for hyperparameter optimization in machine learning for application in high energy physics

Abstract

The analysis of vast amounts of data constitutes a major challenge in modern high energy physics experiments. Machine learning (ML) methods, typically trained on simulated data, are often employed to facilitate this task. Several choices need to be made by the user when training the ML algorithm. In addition to deciding which ML algorithm to use and choosing suitable observables as inputs, users typically need to choose among a plethora of algorithm-specific parameters. We refer to parameters that need to be chosen by the user as hyperparameters. These are to be distinguished from parameters that the ML algorithm learns autonomously during the training, without intervention by the user. The choice of hyperparameters is conventionally done manually by the user and often has a significant impact on the performance of the ML algorithm. In this paper, we explore two evolutionary algorithms: particle swarm optimization (PSO) and genetic algorithm (GA), for the purposes of performing the choice of optimal hyperparameter values in an autonomous manner. Both of these algorithms will be tested on different datasets and compared to alternative methods.

Full PDF

EEvolutionary algorithms for hyperparameter optimization in machinelearning for application in high energy physics

Tani, Laurits [email protected] Rand, Diana [email protected] Veelken, Christian [email protected]

Kadastik, Mario [email protected] National Institute Of Chemical Physics And Biophysics (NICPB)

November 10, 2020

The analysis of vast amounts of data constitutes a major challenge in modern high energy physicsexperiments. Machine learning (ML) methods, typically trained on simulated data, are often employedto facilitate this task. Several choices need to be made by the user when training the ML algorithm.In addition to deciding which ML algorithm to use and choosing suitable observables as inputs, userstypically need to choose among a plethora of algorithm-speciﬁc parameters. We refer to parameters thatneed to be chosen by the user as hyperparameters. These are to be distinguished from parameters thatthe ML algorithm learns autonomously during the training, without intervention by the user. The choiceof hyperparameters is conventionally done manually by the user and often has a signiﬁcant impact onthe performance of the ML algorithm. In this paper, we explore two evolutionary algorithms: particleswarm optimization (PSO) and genetic algorithm (GA), for the purposes of performing the choice ofoptimal hyperparameter values in an autonomous manner. Both of these algorithms will be tested ondiﬀerent datasets and compared to alternative methods.

Owing to the large amount of data recorded by con-temporary high energy physics (HEP) experiments,the analysis of data relies on powerful computing fa-cilities. Machine learning (ML) methods are usedextensively to aid the data analysis [1, 2]. Boosteddecision trees (BDTs) [3] and artiﬁcial neural net-works (ANNs) [4] are commonly used in HEP ex-periments. Even though these methods may aidthe data analysis task signiﬁcantly, their usage inpractical HEP applications is not trivial. This isbecause, in order to achieve optimal results, a setof parameters, referred to as hyperparameters inthe literature [5], need to be chosen by the user,depending on the given task and data.The subject of this paper is to describe two evo-lutionary algorithms [6], which allow to ﬁnd a set ofoptimal hyperparameters in an autonomous man-ner. The evolutionary algorithms studied in thispaper are particle swarm optimization (PSO) [7]and genetic algorithm (GA) [8].The task of ﬁnding optimal hyperparameter val-ues can be recast as function maximization. One considers a mapping from a point h in hyperparam-eter space H to a ”score” value s ( h ), which quanti-ﬁes the performance of the ML algorithm for a giventask. Using a suitable encoding for hyperparame-ters of non-ﬂoating-point type, the hyperparameterspace H can be taken to be the Euclidean space IR N ,with N denoting the number of hyperparameters.Formally, the optimal hyperparameters, denoted bythe symbol ˆ h , are those that satisfy the condition:ˆ h = argmax h ∈H s ( h ) , (1)where s : H (cid:55)→

IR refers to the objective functionthat maps a point h in H to a score s ( h ). Re-casting the hyperparameter optimization task as afunction maximization problem allows to evaluatethe performance of the PSO and GA on referenceproblems on function maximization from literature,as well as to compare their performance with alter-native methods. We remark that function mini-mization and function maximization problems areequivalent, as the multiplication of the objectivefunction s ( h ) by a constant factor − a r X i v : . [ h e p - e x ] N ov he paper is organized as follows: In Sections2 and 3, we describe the PSO and GA, respec-tively. In Section 4, we apply both evolutionaryalgorithms to a well-known function minimizationproblem from the literature, based on the Rosen-brock function [9], as well as to a typical data anal-ysis task from the domain of HEP, the ”ATLASHiggs boson machine learning challenge” [10]. Weconclude the paper with a summary in Section 5. Particle Swarm Optimization (PSO) [7] representsa computational method for optimizing continuousnonlinear functions. The method is eﬀective for op-timizing a wide range of functions. In common withother evolutionary algorithms, such as the GA, thePSO method is inspired by nature.As the name of the method implies, the max-imization of the objective function by the PSO isperformed by a swarm of particles . The particlestraverse the hyperparameter space H , with the po-sition of each particle representing one set of hyper-parameters h . Having a swarm of particles allowsthe exploration of multiple points in the space H inparallel, thereby allowing for a highly parallel im-plementation of the PSO algorithm on a computer.The evolution of the particle swarm proceeds in it-erations denoted by the letter k . In each iterationa new position x k +1 i is computed for each particle i according to the relation: x k +1 i = x ki + w · p ki + F ki (2)Where x ki denotes the current position of theparticle and p ki its momentum . We use bold sym-bols to denote vector quantities. The momentumterm w · p ki represents the inertia for particles tochange their direction when traversing the space H . The symbol F ki represents an attractive force ,which has the eﬀect for particles to move towardspreviously discovered extrema of the objective func-tion. The momentum term causes a tendency forthe particles to continue moving in their currentdirection, past the previously found extrema. Thisbehaviour increases the exploration of the hyperpa-rameter space H and is found to improve perfor-mance [7]. The coeﬃcient w is refered to as iner-tial weight in the literature [11], though the term damping weight might be actually more descriptiveas suggested in reference [12].Our implementation of the PSO algorithm dis-tinguishes between the personal best location ˆ x ki = { x ∈ H ∧ ˆ x ki = x k (cid:48) i for k (cid:48) ≤ k ∧ s ( x ki ) ≤ s (ˆ x ki ) ∀ k (cid:48) ≤ k } and the best known global extremum ˆˆ x k =argmax { ˆ x ki } : F ki = c · r · (ˆ x ki − x ki ) + c · r · (ˆˆ x k − x ki ) . (3)The coeﬃcients c and c are referred to as the cognitive and the social weights in the literature[13], and the symbols r and r represent randomnumbers, which are drawn from an uniform dis-tribution in the interval [0,1]. The known globalextremum ˆˆ x k for each particle is updated each it-eration by propagating the personal best locationof a subset of the population, referred to as info .The number of particles in this subset is denoted by N informants . Restricting the computation of ˆˆ x k to asubset of particles has the helps to avoid prematureconvergence of the swarm to a local minimum.We choose the coeﬃcients c and c to be equalto 2, such that the particles move past their pre-viously found target about half of the time if theinertial weight w would be negligible [7].After each iteration the momenta are updatedaccording to the rule: p k +1 i = x k +1 i − x ki (4)The positions x i of all particles i are initial-ized randomly within the hyperparameter space H , while all momenta p i are randomly initializedwhithin one quarter of the range of each hyperpa-rameter.The relation between the inertial weight w and the size of the coeﬃcients c and c controlsthe inﬂuence of global (wide-ranging) versus lo-cal (nearby) exploration abilities of the particles.A larger inertial weight w allows the particles tomove into unexplored regions of the hyperparam-eter space H , whereas a small value of w causesthe particle to hone in on local and global extremafound previously [11].A suitable selection of w can provide a balancebetween global and local exploration abilities andthus require fewer iterations on average to ﬁnd theoptimum [11]. As discussed in Ref. [12], one mayexpect the performance of the PSO algorithm tobe improved if one sets the inertial weight w toa large value for the ﬁrst iterations of the PSOalgorithm and gradually reduce w as the swarmevolves. Doing this allows the particles to explorethe hyperparameter space H as fast as possible.By gradually reducing the value of w during sub-sequent iterations, when the approximate locationof the extremum has been established, one switchessmoothly from the global exploration to a local ex-ploration, thus improving on the accuracy of the2ound extrema. The idea is analogous to the grad-ual reduction of the temperature parameter in sim-ulated annealing [12].Each time the position of a particle would moveoutside the bounds of hyperparameter space H , theposition of the particle is set to the boundary valueand its momentum is set to zero, thereby reducingthe probability that the same particle moves againstthe boundary again in the next iteration. The second evolutionary algorithm considered inthis paper, the genetic algorithm (GA), is moti-vated by the concept of natural selection [8]. TheGA maintains a population of possible solutionsto the optimization problem, which evolve throughmultiple generations in order to produce the bestsolution.Each possible solution is referred to as a chro-mosome . Each chromosome (see Fig. 1) representsone point in the hyperparameter space H . Hav-ing multiple chromosomes allows the GA to exploremultiple solutions in parallel.The number of genes in a chromosome matchesthe dimension of the hyperparameter space H .The evolution towards the best solution is iter-ative. Each iteration corresponds to one generationin the evolution of all chromosomes and consistsof 3 distinct stages: the selection of parents, thecrossover of the genes, and the mutation.The selection of parents is performed using thetournament method [14, 15]. In each tournamenta certain number of chromosomes compete to beselected as a parent for the next generation. Thenumber of chromosomes participating in each tour-nament is denoted by the symbol N tour . The par-ticipants are drawn from the population of chro-mosomes at random and are ranked in order of de-creasing score s ( h ). The participant with the high-est score is selected as a parent with the probabil-ity P tour . In case the chromosome with the high-est score is not selected, the chromosome with thesecond highest score gets selected, again with theprobability P tour , and so on. The tournament endswhen two chromosomes are selected in this way tobe the parents.A larger value of N tour has the eﬀect that thechromosome with a low score s ( h ) has a smallerchance to be selected as the parent for the next gen-eration, because there is a high probability that achromosome with a better score participates in thesame tournament. A smaller value of N tour has theopposite eﬀect. New tournaments are started until a suﬃcient number of pairs of parents are selectedto produce the chromosomes for the next genera-tion.The chromosomes of two parents produce onenew chromosome for the next generation by meansof crossover [16, 17]. We use k-point crossover inwhich the chromosomes of both parents are cut at k points ( N cross to refers to the number of points, toavoid using the same symbol as for the number of it-erations) and the chromosomes of the oﬀspring areproduced by randomly choosing chromosome seg-ments from either parent (see Fig. 2). c h r o m o s o m e gene Figure 1: A chromosome consisting of genes. h h h h h h h h h h h h h h h h h h parent1 parent2oﬀspring Figure 2: Possible outcome of a 2-point crossover oftwo parents in case of 6-dimensional hyperparam-eter space H , where h denotes the ﬁrst hyperpa-rameter of the ﬁrst parent, h the second hyperpa-rameter of the ﬁrst parent, etc.The chromosomes of the oﬀspring that are ob-tained by the crossover operation are subject to mutation [8], which aims to increase the diversityof the population, thereby allowing to explore do-mains in the hyperparameter space H not popu-lated by chromosomes from the parent generation.Mutation also helps to avoid the population to getstuck in local minima.In our implementation of the GA, the mutation3f chromosomes is performed by adding a randomnumber, drawn from a normal distribution with amean of zero and a given width, to each gene. Ahigh mutation rate has the eﬀect of turning theGA into a random search. We avoid this eﬀect bylinearly decreasing the width of the normal distri-bution each iteration, with the initial width cor-responding to a quarter of the maximum range ofa given hyperparameter and the ﬁnal width corre-sponding to zero.Our implementation of the GA uses the conceptof elitism [18]. Elitism means that the algorithmpreserves a certain number of the best performingchromosomes within the population and passing theparent chromosomes on to the next generation to-gether with their oﬀspring. Elitism is found to im-prove the convergence toward an optimal solution.The number of parent chromosomes preserved inthis manner is denoted by the symbol N elite .The convergence is further enhanced by culling [19], which means that we discard a certain num-ber of chromosomes with the lowest score amongthe population before selecting the parents for thenext generation. The number of parent chromo-somes discarded in this way is referred to using thesymbol N cull . For each chromosome discarded byculling, we create a new chromosome with randomlyinitialized hyperparameter values to replace the onediscarded.Our implementation of the GA further allowsto evolve groups of chromosomes in subpopulations [20]. The number of subpopulations is denoted bythe symbol N subpop . The selection of parents is re-stricted to the chromosomes from the same subpop-ulation for the ﬁrst N generationssubpop iterations of thealgorithm. For the remaining iterations, the chro-mosomes from diﬀerent subpopulations are allowedto mix freely. The performance of both evolutionary algorithms,PSO and GA, is evaluated on two tasks: on theRosenbrock function, which provides an examplefor a diﬃcult function minimization problem, andon the ATLAS Higgs boson machine learning (ML)challenge, as a typical application of ML methodsin HEP.

The Rosenbrock function [9, 21] represents a well-known trial function for evaluating the performanceof function minimization algorithms. The function is deﬁned as: R ( x, y ) = ( a − x ) + b ( y − x ) , (5)where the a and b are constants.The Rosenbrock function has a global minimumat ( x, y ) = ( a, a ). We chose to study the Rosen-brock function for the case a = 1 and b = 100. Forthe chosen values of a and b , the global minimumis located at the position ( x, y ) = (1 , R (1 ,

1) = 0.Figure 3: The Rosenbrock function in a regionaround its global minimum, located at the position( x, y ) = (1 , R ( x, y ) as function of x and y as a two dimensionalhyperparameter optimization problem, identifyingx and y with the ﬁrst and second hyperparameterrespectively. The position of particles in case ofthe PSO algorithm and the value of the chromo-somes in the case of GA are initialized within therange [ − , +500] × [ − , +500] and are enforcedto stay within this range during the evolution ofboth algorithms.4 .1.1 Stopping Criteria In order to limit the computing time, we deﬁnea criterion when to stop the training of the PSOand GA. We use two criteria for this purpose andterminate the evolution when either criterion isfulﬁlled. The ﬁrst criterion is an upper limit onthe number of iterations, denoted by the symbol N maxiter . Additionally, we terminate the evolutiononce the algorithm has found a point (x, y) forwhich R ( x, y ) < − . We compare the performance of the PSO and GAfor ﬁnding the minimum of the Rosenbrock func-tion with three alternative methods, the gradientdescent algorithm [22], and two naive methods forchoosing the hyperparameters, to which we refer toas ”grid search” and ”random guessing”. The lat-ter two serve as a cross-checks. One would expectof course that evolutionary algorithms such as thePSO and GA outperform the naive methods. (Modiﬁed) gradient descent

We have modi-ﬁed the gradient descent (GD) algorithm in orderto improve its performance on the Rosenbrock func-tion. The issue is that the unmodiﬁed GD algo-rithm often ’zig-zags’ from one side of the valley tothe other, causing the algorithm to progress veryslowly in the direction along the valley, towards theglobal minimum [23]. To prevent this ’zig-zag’ be-haviour and improve the convergence of the algo-rithm, we have modiﬁed the GD algorithm in thefollowing way: at each iteration, the algorithm de-termines the direction of the steepest descent by nu-merical evaluation of the gradient at a given point h k . The new position h k +1 is computed accordingto: h k +1 = h k + δ · ∇ h k |∇ h k | , (6)where the term ∇ h k |∇ h k | represents the direction ofsteepest descent and the step size δ represents theparameter of the algorithm.Rather than moving immediately to the new po-sition h k +1 , the modiﬁed GD algorithm computesthe value of the objective function s at the new po-sition s ( h k +1 ).It then compares the actual decrease of the ob-jective function, s ( h k +1 ) − s ( h k ), with the expecteddecrease, given the expression δ · ∇ h k |∇ h k | ·∇ h k . In case s ( h k +1 ) − s ( h k ) > · δ · ∇ h k |∇ h k | · ∇ h k , we conclude that the step size δ is too large and needs to be reducedin order to avoid this ’zig-zag’ behaviour.In our implementation, we successively reducethe step size by a factor of two until the conditionis satisﬁed. The algorithm then moves to the newposition, the initial step size is restored, and thealgorithm recomputes the gradient at the new po-sition for the next iteration.We choose the number of iterations for the GDalgorithm to be 10 and the initial step size δ to be10 − . Grid search

This is a widely used hyperparam-eter parameter optimization method available forexample in the package sklearn [24]. This methodis based on choosing N d grid points in each dimen-sion d of the hyperparameter space H , evaluatingthe objective function s for all (cid:81) Nd =1 N dgrid combina-tions of grid points, and selecting the best perform-ing combination. The same number of evaluationsof the objective function, (cid:81) Nd =1 N dgrid , is chosen tobe the same as for the other algorithms, in orderto compare all algorithms for the same time us-age of computing time. Here we assume that theevaluation of the objective function consumes themajority of the computing time and the computa-tions internal to the PSO and GA are negligiblein comparison. We believe this assumption repre-sents a very good approximation for practical ap-proximations of these methods in HEP, discussed inthe introduction, where one evaluation of s corre-sponds to one training of a ML algorithm. For theRosenbrock function minimization task, we choose N grid = N grid = 10 grid points for each of the twodimensions, equidistantly within the interval [-500,+500] in each dimension. Random Guessing

In the random guessing(RNG) method, we draw a total of N p = 10 pointsin the hyperparameter space H at random, sam-pling from a uniform distribution within the range[ − , +500] × [ − , +500]. The point correspond-ing to the minimum of the objective function s over the set of these points is selected as the best-performing point of the RNG method. The num-ber of points N p is chosen such that the function isevaluated the same number of times for the RNGmethod as for the PSO, GA, GD and GS methods. Particle swarm optimization

The same maxi-mal number of 10 evaluations of the objective func-tion s were used for the PSO, by setting the numberof particles in the swarm to 100 and the maximumnumber of iterations to 10 . The evaluation of the5SO was terminated before reaching the maximumnumber of iterations in case the global minimum s (ˆˆ x k ) found by the PSO diﬀered from the globalminimum of the Rosenbrock function by less than10 − . The coeﬃcients c and c were chosen to be2 and the inertial weight w was chosen to linearlydecrease from 0.8 to 0.4 as a function of iteration k . The number of informants N info was set to 7. Genetic Algorithm

The same maximum num-ber of 10 evaluations of the objective function wereused for the GA, for which the number of geneswas chosen to be 10 and the maximum numberof iterations to be 100. The same threshold forearly termination of 10 − was chosen for the GA,as for the PSO. The early termination triggers once s ( h ) < − for the hyperparameter values h repre-sented by any chromosome . The values of other pa-rameters of the GA, used for the Rosenbrock func-tion minimization task, are given in Table 1. Duringthe ﬁrst iterations of the algorithm, when subpop-ulations are used, the parameters N cull and N elites amount to 10 and 5 respectively.Parameter Value N tour P tour N cross P mutate N generationssubpop N subpop N cull N elite Owing to the fact that the minima found by theGD, GS and RNG, PSO, GA methods depends onthe values of random numbers that are used to ini-tialize and/or evolve each algorithm, the perfor-mance of each method needs to be evaluated fora set of diﬀerent ’trials’, each trial using a diﬀer-ent seed to produce a diﬀerent sequence of randomnumbers.

The distribution in ˆˆ R = R (ˆˆ h ) at the minima ˆˆ h found in 100 diﬀerent trials is shown in Fig. 4.Numerical values of the average ¯ R and of the width of the distribution, quantiﬁed by the standard de-viation σ R = (cid:113) (cid:80) s , are shown in Table 2.Figure 4: Distribution in ˆˆ R = R (ˆˆ h ) of the Rosen-brock function at the minimum ˆˆ h found in 100 dif-ferent trials for the GD, GS, RNG, PSO and GA.Method ¯ R σ R GD 85.85 143.7GS 3.29 3.89RNG 3.11 3.44PSO 0.00057 0.00030GA 0.0014 0.0021Table 2: Average value ¯ R and standard deviation σ R achieved by the GD, GS, RNG, PSO, and GAmethods in the Rosenbrock function minimizationtask. Discussion

One can see in Fig. 4 that the GDmethod performs extraordinarily well in about halfof the trials, while in the other half it fails to getclose to the minimum of the Rosenbrock function atall. The poor performance of the GD method in thelatter trials is due to the cases where the particlemoves so slowly along the valley of the Rosenbrockfunction that the maximum number of 10 itera-tions is reached before the algorithm reaches theglobal minimum of the position ( x, y ) = (1 , R , outperforming all other methods on the Rosen-brock function minimization task, followed by theGA. The PSO and GA also exhibit the lowest stan-dard deviation σ R , which means that their perfor-mance is robust against variations in the randomchoice of starting positions across diﬀerent trials.6e remark that the early termination limitedthe average number of evaluations of the objectivefunction to ∼ · for the PSO, while the early ter-mination had little eﬀect for the GA (as well as forthe GD, GS, and RNG methods), which makes theperformance of the PSO even more impressive. Asexpected, both evolutionary algorithms outperformall other methods. The ATLAS Higgs boson machine learning chal-lenge (HBC) [10] represents a typical application ofML algorithms to the ﬁeld of HEP. The task of theHBC is to obtain an optimal separation of the Stan-dard Model (SM) Higgs boson → τ τ signal from thelarge SM background. The background consist ofDrell-Yan production of Z bosons, the productionof W bosons in association with jets, and top quarkpair production. Samples of signal and backgroundevents are generated by Monte Carlo (MC) simu-lation. Events are selected in the τ τ → µ ¯ ν µ ν τ τ h ν ﬁnal state, where we use the symbol τ h to denotethe hadronic decay of a τ lepton. Background con-tributions arising from multijet production withoutassociated production of bosons or top quark areneglected.In total 550 000 signal plus background eventsare provided by the organizers of the HBC, of whichwe use 80% for training the ML algorithm and 20%for testing the performance of the trained ML algo-rithm. We refer to the former as the train samplesand to the latter as the test sample.We utilize a BDT to perform the separation ofthe Higgs boson signal from backgrounds. For theBDT implementation, we chose the XGBoost pack-age [25].The objective function s for the hyperparam-eter optimization represents an approximation forthe sensitivity to discover the Higgs boson signalin a physics analysis at the Large Hadron Collider(LHC). The function s was given by the organizersof the HBC and is referred to as the ’approximatemean signiﬁcance’ (AMS), which is deﬁned by: AM S ( θ cut ) = (cid:114) · ( s + b + b r ) · ln [1 + sb + b r ] − s , (7)where b denotes the amount of background and s the amount of signal that passes a cut on the BDToutput. The term b r is introduced as a regular-ization in order to reduce the eﬀect of statisticalﬂuctuations of b and s , resulting from limited MC statistics (as discussed in Ref. [10]). The valueof b r was given by the organizers of the HBC andamounts to b r = 10. The function AM S ( θ cut ), for θ cut = 0 .

15, is used as objective function for theBDT training.Even with the addition of the b r term, statisticalﬂuctuations of the number of signal and backgroundevents passing the cut on the BDT output stillcauses a sizable diﬀerence between the AMS scorescomputed on the test and on the train sample. Weﬁnd that the diﬀerence between test and train per-formance can be reduced and a higher AMS scoreon the test sample can be achieved if we use a mod-iﬁed version of Eq. (7) as the objective function forthe BDT training. We refer to the modiﬁed versionof Eq. (7) as d-AMS. The idea is to add a penaltyterm for the diﬀerence between the AMS scores onthe test compared to the train sample, so that theBDT training (and the hyperparameter optimiza-tion) reduce this diﬀerence: d - AM S = AM S test − κ · max(0 , [ AM S test − AM S train ]) (8)where the coeﬃcient κ controls the strength of thepenalty term. We ﬁnd the choice κ = 1 . θ cut = 0 .

15 has ﬁnished, the threshold θ cut is opti-mized such that d-AMS attains its maximal valueon the training sample.The PSO was evolved for a maximum of 7000evaluations of the objective function, using a swarmof 70 particles and a maximum number of 100 iter-ations. The coeﬃcients c and c were both chosento be equal to 2 and the inertial weight w was cho-sen to linearly decrease from 0.8 to 0.4 during theevolution of the PSO algorithm.For the GA, we used 70 genes and a maximumnumber of 100 iterations. The values of the otherparameters are given in Table 4.The XGBoost hyperparameters chosen and thedefault values for these parameters are given in Ta-ble 3. The parameter num-boost-round speciﬁes thenumber of boosting iterations, corresponding to thenumber of trees in the BDT. The learning-rate pa-rameter controls the eﬀect that trees added at alater stage of the boosted iterations have on the out-put of the BDT relative to the eﬀect of trees addedat an earlier stage. Small values of the learning-rate parameter decrease the eﬀect of trees addedduring the boosting iterations, thereby reducing theeﬀect of boosting on the BDT output. The param-eter max-depth speciﬁes the maximum depth of a7ree. The parameter gamma represents a regular-ization parameter, which aims to reduce overﬁtting.Large values of this parameter prevent the splittingof leaf nodes before the maximum depth of a treeis reached. The parameter min-child-weight speci-ﬁes the minimum number of events that is requiredin each leaf node. The parameter subsample limitsthe number of training events that are used to groweach tree to a fraction of the full training sample. Avalue of this parameter smaller than one decreasesoverﬁtting. The parameter colsample-bytree speci-ﬁes the number of diﬀerent features that are usedin a tree. A value of one means that all features areconsidered for splitting leaf nodes, while a valuesmaller than one restricts the number of featuresthat are used in a tree to a subset of all features.The purpose of this restriction is to reduce overﬁt-ting. The number of features considered for eachtree are drawn at random, independently for eachboosting iteration.The choice of all of these parameters typicallyrepresents a trade-oﬀ. Large values of the param-eters num-boost-round , learning-rate , max-depth , subsample , and colsample-bytree increase the com-plexity of the BDT, while large values of the param-eters gamma , and min-child-weight have the oppo-site eﬀect. BDTs with a higher complexity in gen-eral perform better in separating signal from back-ground on the training sample, but typically aremore susceptible to overﬁtting also.Parameter Default value num-boost-round learning-rate max-depth gamma min-child-weight subsample colsample-bytree N tour P tour N cross P mutate N generationssubpop N subpop N cull N elite N iter . Additionally, we terminate theevolution once the variance between the positions ofthe particles in the PSO or between chromosomes inthe GA is below a certain threshold. The varianceis quantiﬁed by the compactness (also known as themean coeﬃcient of variance), which is deﬁned as:compactness = 1 N N (cid:88) j =1 σ j ¯ x j (9)with σ j = (cid:118)(cid:117)(cid:117)(cid:116) n − n (cid:88) i =1 ( x ji − ¯ x j ) , where N denotes the number of hyperparame-ters, n the number of particles or chromosomes, and¯ x i the mean value of the j-th hyperparameter overthe population of particles or genes, respectively.A low value of the compactness means that thehyperparameters of diﬀerent particles or genes arevery similar, indicating that the PSO or GA hasconverged to a single point in the hyperparameterspace H . The optimal values of the hyperparameters ob-tained with the RNG method, the PSO and theGA are given in Table 5. In Table 6 we comparethe AMS scores obtained for these hyperparametervalues to the AMS scores obtained with the defaultvalues of hyperparameters deﬁned in the XGBoostpackage. The performance is evaluated for two sam-ples of events, referred to as the public and private leaderboard samples. Both samples are providedby the organizers of the HBC and contain signaland background events that overlap with neitherthe test nor the train sample.8he performance achieved by the PSO and GAare very similar and about 12-13% higher than theperformance obtained using the default values ofhyperparameters. The results of the BDT trainedusing the hyperparameters obtained by the RNGmethod are similar to those obtained by the PSOand GA. The values of individual hyperparametersobtained using the PSO with those obtained bythe GA, we ﬁnd that almost all hyperparametershave similar values, except for two: the num-boost-round and learning-rate parameters. The value ofthe num-boost-round parameter optimized by theGA is higher by about a of 2.9, while the value ofthe learning-rate parameter is lower by a factor of3.5. The fact that the learning-rate parameter de-creases by a factor that is similar to the increase ofthe num-boost-round parameter is not surprising:using a large number of trees and a lower learningrate has about the same eﬀect as using a lower num-ber of trees and a higher learning rate. The productof the num-boost-round and learning-rate parame-ters is more similar, diﬀering only by a factor of 1.2between the PSO and GA. The situation is diﬀerentfor the colsample-bytree parameter. It has a smalleﬀect on the d-AMS and AMS scores and is henceonly loosely constrained during the hyperparameteroptimization. The parameters obtained by the RNG methodare more diﬀerent - only the max-depth parame-ter is very similar to those found by PSO and GA.Having min-child-weight roughly two times smallerthan it was found for PSO and GA means mak-ing the model more prone to overﬁtting. However,this eﬀect is overcome by having the product of theanti-correlated pair, num-boost-round and learning-rate , two times smaller from the ones obtained byPSO and GA, thus the model less susceptible tooverﬁtting. Furthermore having three to four timessmaller gamma helps the model generalize even fur-ther. Again the eﬀect of colsample-bytree had neg-ligible eﬀect on the d-AMS and AMS scores.Parameter RNG PSO GA num-boost-round

295 153 451 learning-rate max-depth gamma min-child-weight

173 323.6 442.2 subsample colsample-bytree θ cut AMS scorePublic leaderboard AMS scorePrivate leaderboardDefault hyperparameters 0.175 3.170 3.200RNG 0.152 3.620 3.608PSO 0.134 3.628 3.655GA 0.152 3.619 3.683Table 6: Performance of BDTs trained using the optimal values of the hyperparameters obtained bythe PSO and by the GA compared to a BDT trained using the default values of hyperparameters inthe XGBoost package [25] and with hyperparameters obtained by RNG, for the ATLAS Higgs bosonmachine learning challenge.

Two evolutionary algorithms, the particle swarmoptimization (PSO) and the genetic algorithm(GA), for choosing an optimal set of hyperparame-ters in applications of machine learning (ML) meth-ods to data analyses in high energy physics (HEP)have been presented. The performance of bothmethods have been studied for a diﬃcult functionminimization task (Rosenbrock function) and for atypical data analysis test in the ﬁeld of HEP (Higgsboson machine learning challenge (HBC)). In thelatter case, a boosted decision tree (BDT) has been used as ML algorithm. The PSO as well as theGA demonstrate their ability to ﬁnd the optimalparameter value in the function minimization task.Compared to using the default values of hyperpa-rameters, the optimization of the hyperparametervalues improves the sensitivity of the data analy-sis, as quantiﬁed by the AMS score, by 12-13 %.This improvement demonstrates that the optimiza-tion of hyperparameters is a worthwhile task fordata analyses in the ﬁeld of HEP. Randomly prob-ing diﬀerent hyperparameter sets and subsequentlypicking the best performing one showed similar per-formance to both PSO and GA. This can be at-9ributed to the highly ﬂuctuating hyperparameterspace of this particluar example. For a highly struc-tured hyperparameter space, the gain of using amore sophisicated method, like PSO or GA, willbe much higher, as was shown by the Rosenbrockminimization problem. We expect both evolution- ary algorithms to perform similarly well in case anartiﬁcial neural network (ANN) is used as ML algo-rithm. The optimization of the hyperparameters bythe PSO and GA is fully automated and relieve theuser from manual tuning of the hyperparameters.

Acknowledgements

This work was supported by the Estonian Research Council grants PRG445 and PRG780. Also, wewould like to thank Joosep Pata for constructive criticism of the manuscript.

References [1] K. Albertsson, P. Altoe, D. Anderson, J. Anderson, M. Andrews, J. P. A. Espinosa, A. Aurisano,L. Basara, A. Bevan, W. Bhimji, et al. , “Machine Learning in High Energy Physics CommunityWhite Paper,” arXiv:1807.02876 , 2018.[2] S. Whiteson and D. Whiteson, “Machine learning for event selection in high energy physics,”

Engi-neering Applications of Artiﬁcial Intelligence , vol. 22, no. 8, p. 1203, 2009.[3] B. P. Roe, H.-J. Yang, J. Zhu, Y. Liu, I. Stancu, and G. McGregor, “Boosted decision trees as analternative to artiﬁcial neural networks for particle identiﬁcation,”

Nucl. Instr. Meth. A 543 (2005)577. [4] A. K. Jain, J. Mao, and K. M. Mohiuddin, “Artiﬁcial neural networks: A tutorial,”

Computer ,vol. 29, no. 3, p. 31, 1996.[5] M. Claesen and B. De Moor, “Hyperparameter search in machine learning,” arXiv preprintarXiv:1502.02127 , 2015.[6] E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjective evolutionary algorithms: Empiricalresults,”

Evolutionary computation , vol. 8, no. 2, p. 173, 2000.[7] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in

Proceedings of ICNN’95-International Conference on Neural Networks , p. 1942, 1995.[8] J. H. Holland, “Genetic algorithms,”

Scientiﬁc american , vol. 267, no. 1, p. 66, 1992.[9] Y.-W. Shang and Y.-H. Qiu, “A note on the extended Rosenbrock function,”

Evolutionary Compu-tation , vol. 14, no. 1, p. 119, 2006.[10] C. Adam-Bourdarios, G. Cowan, C. Germain, I. Guyon, B. K´egl, and D. Rousseau, “The Higgsboson machine learning challenge,” in

NIPS 2014 Workshop on High-energy Physics and MachineLearning , p. 19, 2015.[11] Y. Shi and R. C. Eberhart, “Parameter selection in particle swarm optimization,” in

Internationalconference on evolutionary programming , p. 591, Springer, 1998.[12] R. C. Eberhart and Y. Shi, “Comparison between genetic algorithms and particle swarm optimiza-tion,” in

International conference on evolutionary programming , p. 611, Springer, 1998.[13] F. Van Den Bergh and A. P. Engelbrecht, “A study of particle swarm optimization particle trajec-tories,”

Information sciences , vol. 176, no. 8, p. 937, 2006.[14] J. Yang and C. K. Soh, “Structural optimization by genetic algorithms with tournament selection,”

Journal of Computing in Civil Engineering , vol. 11, no. 3, p. 195, 1997.[15] N. M. Razali, J. Geraghty, et al. , “Genetic algorithm performance with diﬀerent selection strategiesin solving tsp,” in

Proceedings, world congress on engineering , p. 1, 2011.1016] W. M. Spears and K. A. De Jong, “An analysis of multi-point crossover,” in

Foundations of geneticalgorithms , vol. 1, p. 301, 1991.[17] K. A. De Jong and W. M. Spears, “A formal analysis of the role of multi-point crossover in geneticalgorithms,”

Annals of mathematics and Artiﬁcial intelligence , vol. 5, no. 1, p. 1, 1992.[18] D. Bhandari, C. Murthy, and S. K. Pal, “Genetic algorithm with elitist model and its convergence,”

International journal of pattern recognition and artiﬁcial intelligence , vol. 10, no. 06, p. 731, 1996.[19] E. B. Baum, D. Boneh, and C. Garrett, “Where genetic algorithms excel,”

Evolutionary Computa-tion , vol. 9, no. 1, p. 93, 2001.[20] R. Tanese, “Distributed genetic algorithms for function optimization,” 1989.[21] S. Kok and C. Sandrock, “Locating and characterizing the stationary points of the extended Rosen-brock function,”

Evolutionary computation , vol. 17, no. 3, p. 437, 2009.[22] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv:1609.04747 , 2016.[23] Wikipedia, “Gradient descent.” https://en.wikipedia.org/wiki/Gradient_descent . [Online;accessed 08-July-2020].[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay, “Scikit-learn: Machine learning in Python,”

Journal of Machine Learning Research ,vol. 12, pp. 2825–2830, 2011.[25] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in

Proceedings, 22nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , p. 785, 2016.[26] “XGBoost Parameters.” https://xgboost.readthedocs.io/en/latest/parameter.htmlhttps://xgboost.readthedocs.io/en/latest/parameter.html