[PDF] Gamma/hadron segregation for a ground based imaging atmospheric Cherenkov telescope using machine learning methods: Random Forest leads

Abstract

A detailed case study of γ -hadron segregation for a ground based atmospheric Cherenkov telescope is presented. We have evaluated and compared various supervised machine learning methods such as the Random Forest method, Artificial Neural Network, Linear Discriminant method, Naive Bayes Classifiers,Support Vector Machines as well as the conventional dynamic supercut method by simulating triggering events with the Monte Carlo method and applied the results to a Cherenkov telescope. It is demonstrated that the Random Forest method is the most sensitive machine learning method for γ -hadron segregation.

Full PDF

aa r X i v : . [ a s t r o - ph . I M ] O c t Research in Astronomy and Astrophysics manuscript no.(L A TEX: mridul-mlearning.tex; printed on October 17, 2018; 20:27)

Gamma/hadron segregation for ground based imaging atmosphericCherenkov telescope using the machine learning methods: RandomForest leads

Mradul Sharma ∗ , J. Nayak , M. K. Koul , S. Bose and Abhas Mitra Astrophysical Sciences Division, Bhabha Atomic Research Centre, Mumbai, India; The Bayesian and Interdisciplinary Research Unit, Indian Statistical Institute, Kolkata , India

Received ; accepted

Abstract

A detailed case study of γ -hadron segregation for a ground based atmosphericCherenkov telescope is presented. We have evaluated and compared various supervised ma-chine learning methods such as the Random Forest method, Artiﬁcial Neural Network, LinearDiscriminant method, Naive Bayes Classiﬁers, Support Vector Machines as well as the con-ventional dynamic supercut method by simulating triggering events with the Monte Carlomethod and applied the results to a Cherenkov telescope. It is demonstrated that the RandomForest method is the most sensitive machine learning method for γ -hadron segregation. Key words: methods: statistical — telescopes

Multidimensional datasets are very difﬁcult to handle with conventional methods, which are generally lin-ear in nature. Therefore, when multidimensional data are encountered, the efﬁciency of these methodsreduces drastically as any interdependence among various parameters is beyond the realm of linear meth-ods. In the case of ground based atmospheric Cherenkov systems, the typical characterization of a signalinvolves more than four attributes/parameters. Present day Cherenkov systems are operating in an energyregime where conventional methods are losing their edge on account of fading differences among the dis-criminating attributes/parameters between signal and background. Therefore, the ground based gamma rayastronomy community has started exploring various options including multivariate methods. These multi-variate methods fall under the umbrella of machine learning methods. The simplicity and intrinsic abilityof these methods to scrub out interdependence, if any, among various attributes/parameters has made theﬁeld of machine learning methods one of the fastest growing scientiﬁc disciplines. These methods employ ∗ [email protected] M. Sharma et al. statistical tools to decipher hidden relationships, if any, among a few or a collection of attributes/parameterswith comparatively little computing infrastructure.Machine learning methods have been explored in the ﬁeld of ground based gamma ray astronomy forquite some time. The earliest efforts were initiated by Bock et al. (2004). Later on, for γ -hadron segregation,the effectiveness of tree based multivariate classiﬁers was demonstrated by two operational ground basedobservatories, MAGIC (Albert & and co-authors. 2008) and HESS (Ohm et al. 2009; Fiasson et al. 2010;Dubois et al. 2009). It should be noted that no machine learning method is sacrosanct as far as its superi-ority over other multivariate methods is concerned. Each dataset is unique and the classiﬁer’s performanceis dependent on the dataset under investigation. Therefore, in order to assess the suitability of a classiﬁer,each dataset needs to be probed independently. In this paper, we compare and evaluate various supervisedmachine learning methods to assess their suitability for γ -hadron segregation. A total of ﬁve machine learn-ing methods, namely Random Forest (RF), Artiﬁcial Neural Network (ANN), Linear Discriminant Analysis(DISC), Naive Bayes (NB) Classiﬁer and Support Vector Machine (SVM) with the Radial Basis Function(RBF) and polynomial kernel have been investigated. They are selected in a way to represent a type ofmachine learning stream. Among these ﬁve methods, the RF method represents a logic based algorithm.The ANN methods are perceptron based techniques. On the other hand, DISC and NB Classiﬁer are sta-tistical learning methods. Furthermore, SVM represents a rather new (1992) machine learning technique.The signal strength after classiﬁcation by each machine learning method was compared with respect to theconventional dynamic supercut method and a conclusion is reached to select the best classiﬁcation method.The plan for the paper is as follows: In Section 2, a brief summary of ground based atmosphericCherenkov telescopes and the underlying principle will be outlined. Section 3 involves the descriptionof the database used to compare various machine learning methods. The subsequent section provides anoverview of all the machine learning methods. The ﬁnal two sections deal with a critical analysis of all theclassiﬁers and the conclusion respectively. Ground based gamma ray astronomy is a rather new discipline. The ﬁrst successful detection of the TeVsource Crab Nebula (Weekes et al. 1989) took place in . With a brief lull in the ﬁeld, the next detectiontook place in when the second TeV γ -ray source Markerian (Punch et al. 1992) was detectedand subsequently in , Mrk501 (Quinn et al. 1996) was detected. Slowly a series of such extragalacticsources was discovered. With the advent of more sensitive systems, the catalog of TeV γ -ray sources sawthe addition of newer sources. The present day ﬁeld of ground based gamma ray astronomy is ﬂourishingwith new detections of exotic sources. In fact, so far more than 150 galactic and extragalactic sources havebeen discovered.The detection of cosmic γ -ray sources is based on the principle of the detection of Cherenkov photonsproduced by cosmic rays in the atmosphere. When cosmic rays enter the atmosphere, they interact withatmospheric nuclei by hadronic and electromagnetic interaction. Electrons and the cosmic γ -rays interactelectromagnetically, i.e. they generate secondary particles by ‘pair production’ and the ‘bremsstrahlung’ http://tevcat.uchicago.edu/ amma/hadron Segregation 3 Fig. 1

Diagram of a few image param-eters.process. The hadronic cosmic rays, namely protons and ionized nuclei, interact via the hadronic interac-tion and also give rise to a number of secondary particles. Such generation of secondary particles in theatmosphere is called the

Extensive Air Shower . The hadronic showers create π ◦ particles that decay into γ -rays making it difﬁcult to distinguish these hadronic showers from genuine showers initiated by γ -rays.The segregation of showers initiated by γ -rays is quite challenging because cosmic rays far outnumber the γ -rays by a huge margin. The secondary particles generated in extensive air showers move with relativistic speeds and generateCherenkov radiation in the atmosphere. The technique of detecting the Cherenkov photon image is knownas the Imaging Atmospheric Cherenkov Technique (IACT). If the shower is close enough to the telescope,the Cherenkov photons are reﬂected by the telescope’s reﬂecting dish and get focused on the camera (anarray of photomultiplier tubes in the focal plane of the detector). The geometrical projection of the showeronto a detector is called an image . The IACT is used to differentiate between γ and hadron initiated show-ers on the basis of the shape and orientation of the images. The image parameterization was introducedby Hillas and hence these parameters are known as Hillas parameters (hillas 1985). Image properties (ana-lyzed ofﬂine) provide information about the nature, energy and incoming direction of the primary particletriggering a shower. A representative diagram of the Hillas parameter is shown in Figure 1.

A database of Monte Carlo simulations was generated by using the CORSIKA air shower code (Heck et al.1998) with the Cherenkov option. The simulations were carried out for the TACTIC telescope (Koul et al.2011) located at the Mount Abu observatory, with an altitude of ∼ m. The simulated showers weregenerated at zenith angles of ◦ , ◦ , ◦ , ◦ and ◦ . The imaging camera with a total of 349 pixels wasconsidered with the innermost pixels being used for generating the trigger. The Cherenkov photonstriggered the telescope after encountering the wavelength dependent photon absorption, reﬂection coefﬁ-cient of the mirror facets, light cone used in the camera and the quantum efﬁciency of photomultiplier M. Sharma et al. tubes. All the triggered events underwent the usual image cleaning procedures described in the literature(Konopelko et al. 1996) to eliminate background noise.The simulated events triggering the telescope were selected according to the differential spectral index . and . for γ and protons respectively. The γ events were generated in the energy range from 1–20 TeV. The corresponding proton events were generated from 2–40 TeV. In order to have a robust and wellcontained image inside the camera, the preﬁltering cuts of size (photoelectrons) ≥ and a distance cut of . ◦ ≤ distance ≤ ◦ were applied. This process yielded a total of events for both γ and protons. Various Hillas image parameters (hillas 1985) like length, width, distance, size (photoelectron) and zenithangle can be used in the process of γ -hadron segregation. However, the size parameter as well as the zenithangle parameter are not strictly separation parameters for γ -ray and hadronic showers. In particular, thezenith angle, for instance, by itself cannot be used to separate the events although different image param-eters depend on it. The same is true with the size (photoelectron) parameter. A typical problem with theseparameters is that in case the training samples for γ -ray and hadrons have a different distribution in theseparameters, these parameters may be considered as separation parameters. This may lead to a rather riskysituation, which is typically handled by preparing the training samples in such a way that their distribu-tions on those parameters (typically size and zenith) are as close as possible. In this way, the uncertaintyassociated with using such parameters as separation parameters can be avoided. In this study, such com-plexities have been taken into account. In addition to these parameters, a derived parameter ‘dens,’ deﬁned(Hengstebeck 2007) as dens = log (size)length × width (1)was also used. A total of two sets of image parameters was considered. The idea was to investigate variousclassiﬁers as a function of the image attributes/parameters. In the ﬁrst instance, only ﬁve image parameters:length, width, distance, size and frac2 (deﬁned as the ratio of the sum of the two highest pixel signals to thesum of all the signals), were considered from the simulation database. In the second case, we considered atotal of seven image parameters. Here, in addition to the above mentioned ﬁve parameters, two additionalparameters, namely zenith angle and dens parameter, were also included. However, for classiﬁcation pur-poses, the alpha parameter was not considered. The alpha is a very powerful parameter as it carries thesignature of the progenitor ( γ or proton). The alpha distribution is expected to be ﬂat for cosmic ray pro-tons, whereas it reﬂects a peaky behavior, for ≤ ◦ for γ -rays. In order to remove any bias of such a strongparameter, it was not considered for classiﬁcation purposes. Moreover, this parameter plays a crucial rolein the estimation of signal strength. If the alpha parameter is used in the classiﬁcation, then the hadronicbackground cannot be evaluated. The problem of γ -hadron segregation is formulated as a two class problem: γ represents one class and thehadron is the second class. In the literature, a large variety of multivariate classiﬁcation methods exists.However, to have a tractable analysis, a few representative supervised machine learning methods were amma/hadron Segregation 5 selected. The classiﬁcation was carried out by using ﬁve different machine learning methods, namely RF,ANN, DISC, NB Classiﬁer and SVM with the RBF and polynomial kernel. Except for the RF and theDynamic Supercut methods, the other methods were applied from a commercially available package namedSTATISTICA . On the other hand, the RF method was studied by using the original RF code . The spatial distribution of Cherenkov photons on the image plane of the camera is parameterized on thebasis of the shape and size (light content) of each such image. The conventional parameterization leadsto the estimation of the image parameters (hillas 1985). In this technique, various sequential cuts in theimage parameters are applied so as to maximize the γ -ray like signal and reject the maximum number ofbackground events. However, this scheme has a disadvantage because the width and length parameters growwith the primary energy. It is observed that the width and length of an image are well correlated with thelogarithm of size; the size of the image provides an estimate of the primary energy. This method of scalingthe width and length parameters with the size is known as the dynamic supercut method (Mohanty et al.1998). By employing this method, the optimum number of cut parameters and their values are estimated bynumerically maximizing the so called quality factor Q (Gaug 2001). The quality factor is deﬁned as Q = ǫ γ √ ǫ P , (2)where ǫ γ and ǫ P are γ and hadron acceptances respectively. The γ -acceptance is deﬁned as the correctlyclassiﬁed γ events out of the total number of γ events and ǫ P is the fraction of proton events which behavelike γ events after the γ -hadron classiﬁcation. The image parameters in Table 1 lead to the maximum qualityfactor. Table 1

Dynamic SupercutParameters

Parameter Cut ValueLength (L) 0.110 ◦ ≤ L ≤ (0 .

235 + 0 . ◦ Width 0.065 ◦ ≤ W ≤ (0 .

085 + 0 . ◦ Distance (D) . ◦ ≤ D ≤ ◦ Size (S) S ≥ peAlpha ( α ) ≤ ◦ Frac2 frac2 ≥ . The RF method is a ﬂexible multivariate selection method. The algorithm for RF was developed by LeoBreiman and Adele Cutler .The RFs are a combination of tree predictors such that each tree depends on the values of a randomvector sampled independently and with the same distribution for all trees in the forest (Breiman 2001). The STATISTICA ∼ breiman/RandomForests/cc software.htm ∼ breiman/RandomForests/ M. Sharma et al. classiﬁcation trees, also known as “decision trees,” are machine learning prediction models constructed byrecursively partitioning the data set. Each binary recursive partitioning splits the data sets into differentbranches. The tree construction starts from the root node (the entire dataset) and ends at a leaf. Everyleaf node is assigned to a class. The RF method combines the concept of ‘bagging’ (Breiman 1996) and‘Random Split Selection’ ( ? ). The RF builds on the bagging (Breiman 1996) technique, where bagging stands for “Bootstrapping” and“Aggregating” techniques. The basic idea of bagging is to use bootstrap re-sampling to generate multipleversions of a predictor and combining them to make the classiﬁcation. On the other hand, the bootstrappingis based on random sampling with replacement. It ensures that the probability of selecting an event in thesampling (with replacement) procedure is constantly /n . Therefore, the probability of not selecting anevent is equal to (1– /n ). If the selection process is repeated n times, where n is very large, the probabilityof not selecting an event will be ∼ / . Therefore, only 2/3 ( ∼ ) of events are taken for each bootstrapsample. In addition to bagging, RF also employs “Random Split Selection” ( ? ). At each node of the decision tree, m variables are selected at random out of the M input vectors and the best split is selected out of these m .Typically about square root ( M ) = m number of predictors are selected. Two sources of randomness, namelyrandom inputs and random features, make RFs accurate classiﬁers. In order to measure the classiﬁcationpower (separation ability) of a parameter and to optimize the cut value, the Gini index is used, whichmeasures the inequality of two distributions. It is deﬁned as the ratio between (a) the area spanned bythe observed cumulative distribution and the hypothetical cumulative distribution for a non-discriminatingvariable (uniform distribution, ◦ line), and (b) the area under this uniform distribution. It is a variablebetween zero and one; a low Gini coefﬁcient indicates more equal distributions, while a high Gini coefﬁcientshows an unequal distribution. Breiman (2001) estimated the error rate on out-of-bag data (i.e. oob data).Each tree is constructed on a different bootstrap sample. Since in each bootstrap training set about one thirdof the instances are left out (i.e. out-of-bag), we can estimate the test set classiﬁcation error by applying eachcase that is left out of the construction of the t th tree to the t th tree. To be precise, the oob error estimate isthe proportion of misclassiﬁcation for the oob data.In this study, the original RF code in Fortran was employed and a total of trees was generated.The variable deﬁned in the above code as m try = 2 / was taken. Very similar results were obtained ineach case. The resultant output of this code was compared with the implementation of RF in the statisticalpackage R . It is worth mentioning here that the Fortran code encounters some memory issues when thenumber of training/test events crosses a certain threshold. However, this limitation was not encountered inthe RF implementation in R. ∼ breiman/RandomForests http://cran.r-project.org/ amma/hadron Segregation 7 The ANN consists of many inputs (Gershenson 2003) which are multiplied by weights (strength of therespective signals), and then computed by a mathematical function that determines the activation of theneuron. Another function computes the output of the artiﬁcial neuron. The speciﬁc output demanded bythe user can be obtained by adjusting the weights of an artiﬁcial neuron. A multilayer perceptron (MLP) isperhaps the most popular network architecture in use today, due originally to Rumelhart and McClelland(Rumelhart et al. 1986) and discussed at length in most neural network textbooks (Bishop 1995). Eachneuron performs a weighted sum of its inputs and passes it through the transfer function to produce theoutput.In this work, we use an MLP network with ﬁve inputs, a minimum of three hidden units and a maximumof 11 hidden units. For classiﬁcation tasks, the probabilistic output was generated and the misclassiﬁcationrate was estimated.

Linear Discriminant Analysis is also known as Discriminant Function Analysis (DFA). DFA combines as-pects of multivariate analysis of variance with the ability to classify observations into known categories.It is a multivariate technique which is not only utilized in classiﬁcation but also estimates how good theclassiﬁcation is. In this method, the discrimination functions like canonical correlations are constructed andeach function is assessed for signiﬁcance. The estimation of the signiﬁcance of a set of discriminant func-tions is computationally identical to multivariate analysis of variances. After estimating the signiﬁcance,one proceeds for classiﬁcation. It generally turns out that ﬁrst one or two functions play an important rolewhile the rest can be neglected. Each discrimination function is orthogonal to the previous function.In the present case, it is known that each class belongs to either γ or hadron; thus, the a priori probabili-ties of these classes are known. Accordingly, in this work, the prior probabilities are taken for classiﬁcation. Bayesian classiﬁers gained prominence in the early nineties and perform very well (Langley et al. 1992;Friedman et al. 1997). A Naive Bayes classiﬁer is a generative classiﬁer technique based on the concept ofprobability theory. The Bayes theorem plays a critical role in probabilistic learning and classiﬁcation. TheBayes theorem states that p ( B/A ) = p ( A/B ) p ( B ) p ( A ) , (3)where p ( A ) = Independent probability of A , p ( B ) = Independent probability of B , p ( A/B ) = Conditionalprobability of A given B , p ( B/A ) = Conditional probability of B given A , i.e. the posterior probability.In “Naive Bayes Classiﬁcation,” the different variables/attributes/features are assumed to be strongly(naively) independent, i.e., p ( h x , x , x ...x n i| y ) = n Y i =1 Π( x i | y ) . (4)Using the strong “independence assumption” and the prior probabilities, the most probable class for a given x is estimated. The best class is the most likely or maximum a posteriori (MAP) class. The MAP estimate M. Sharma et al. gives arg max B p ( B/A ) = arg max B p ( A/B ) p ( B ) . (5)The training and evaluation from this method is very fast but the assumption of strong independence amongparameters is a condition generally not satisﬁed in real world problems. The SVM was introduced by Boser et al. (1992). It is based on the concept of decision planes termed hy-perplanes. These hyperplanes are constructed in multidimensional space for classiﬁcation. The decisionplanes separate the classes. The linear decision plane is too limited in its application because of the hetero-geneous nature of experimental data. In such a case, the linear decision plane lacks the ability to performclassiﬁcation. Here nonlinear classiﬁers based on the kernel function play a vital role. The kernel function(mathematical function) maps the data into a higher dimensional hyperplane (feature space), where eachcoordinate corresponds to one feature of the data items. In this way, the data are transformed into a set ofpoints in a Euclidean space, leading to the classiﬁcation.In the present work, the RBF and polynomial kernels are used. A polynomial of degree with type classiﬁcation was employed. The parameters γ = . and ν = 0 . were considered. For the RBF, theseparameters were . and . respectively. The above listed methods were employed to classify the events into γ and hadron cases. A total of events of each type was considered as described in an earlier section. Around of the events were usedfor training all the machine learning methods and the rest of the data was used as a test sample. The sametraining and test data were used by all the methods to have a one to one correspondence in the results. Aftertraining, the test sample was passed through the trained classiﬁer and predictions of γ and hadron classeswere made. Our aim is to identify the best classifier. The accuracy of the prediction rules can be evaluatedby the Receiver Operator Characteristic (ROC) curves which are graphical techniques (Fawcett 2006) tocompare the classiﬁers and visualize their performance. These curves are applied virtually in the ﬁeld ofdecision making, like in signal detection theory (Egan 1975) and more recently in the medical ﬁeld (Swets1988). We are considering a binary classiﬁcation problem where the two cases are γ and hadrons. For a binaryclassiﬁcation problem, a total of four outcomes are possible. Two outcomes are related to the correct clas-siﬁcation for the two classes and two for incorrect classiﬁcation. The True Positive (TP) class denotes thecorrect classiﬁcation of class γ and True Negative (TN) class represents the correct classiﬁcation of classhadron. The False Negative (FN) class reﬂects the class γ incorrectly classiﬁed as class hadron and FalsePositive (FP) class is the incorrect classiﬁcation of class hadron as class γ .The ROC plot is generated by using the above mentioned scenario for possible outcomes (TP, TN, FP,FN). The correctly classiﬁed γ are represented as the true positive rate (TPR), estimated by deﬁning it as in amma/hadron Segregation 9 (Fawcett 2006). The true positive rate is deﬁned as TPR = TPTP + FN . (6)The hadrons classiﬁed as γ are represented by the False Positive Rate (FPR), deﬁned as FPR = FPFP + TN . (7)TPR and FPR can be deﬁned in terms of the fraction of correctly classiﬁed γ and hadrons. From Equations(6–7), it can be shown that TPR = ǫ γ , (8) FPR = ǫ hadron . (9)Hence the TPR is the accepted γ fraction and the FPR is deﬁned as the accepted hadron fraction. The bestclassiﬁer is the one which provides the maximum TPR for the minimum FPR. It should be noted that weare not generating the ROC curves in the strict sense. The ROC curves lie between ( , ) and ( , ). In thepresent study, in order to better understand the results, the hadron rejection was plotted on a logarithmicaxis. Therefore, the ROC plots in this study will differ from conventional ROC plots.In order to ﬁnd the best classiﬁer, the decision boundary for prediction was varied. Each decision bound-ary generated one point in the γ -acceptance (TPR) and hadron acceptance (FPR) curves. These rates wereplotted and the resultant plot is referred to as a decision-plot. The decision-plot was generated for eachclassiﬁer. If the decision-plot skews towards the left side, it indicates greater accuracy, i.e. a higher ratio oftrue positive to false positive. In order to compare various classiﬁers, the decision plot is generated after theclassiﬁcation by all the methods. The top most plot in the decision-plot turns out to be the best classiﬁerbecause for the same hadron acceptance, the upper plot gives the highest γ -acceptance.The decision-plot is the qualifying metric to select the most suitable classiﬁcation method. In additionto the decision-plot, the difference among various classiﬁers was also quantiﬁed by estimating the signalstrength at a representative γ -acceptance value. The quantifying metric is designated as “signal strength”and deﬁned as σ = S p (2 B + S ) , (10)where S = ǫ γ N S and B = ǫ p N B (Li & Ma 1983) are the signal and background events respectively.The signal strength was estimated by taking N B = 10 000 and N S = 500 (Bock et al. 2004). Since theconventional dynamic supercut method estimated the γ -acceptance to be . , the hadron acceptancefrom each classiﬁer was derived from the decision-plot at a γ -acceptance of . . The decision plot wasgenerated for two sets of image parameters. As mentioned earlier, two sets were considered to evaluate theclassiﬁcation strength as a function of the number of image parameters. The decision-plots for these twocases are shown in Figure 2.The comparison of decision-plot for RF methods for ﬁve and seven sets of image parameters showsthat the RF method yields a better classiﬁcation strength. This difference in the classiﬁcation is, however,small and of the order of ∼ in the γ -ray acceptance for the given hadron acceptance range. This dif-ference results because of the larger number of image parameters and guides us to choose more numbersof image parameters during the training of the classiﬁcation method. The decision-plot for the artiﬁcial γ - A cc ep t an c e Hadron AcceptanceRFANNDISCNB 0 0.2 0.4 0.6 0.8 1 0.0001 0.001 0.01 0.1 1 γ - A cc ep t an c e Hadron AcceptanceRFANNDISCNB

Fig. 2

Signal vs. background acceptance. The left panel is the classiﬁcation result by using theﬁve attributes/parameters. The right panel represents it for seven attributes/parameters. γ - A cc ep t an c e Projected Hadron RejectionRFANNDISCNB

Fig. 3 γ -acceptance as a function of projected hadronrejection.neural network method also reﬂects a tendency to prefer more numbers of image parameters for betterclassiﬁcation. As per the decision-plot, the other two methods also indicate a positive effect on the clas-siﬁcation strength with more numbers of image parameters. The decision plot provides an estimate of thepossible γ -acceptance for a user chosen background (hadron) rejection. Any classiﬁer yielding the maxi-mum γ -acceptance for a given hadron acceptance decides the quality of the classiﬁer. Figure 3 shows the γ -acceptance as a function of projected hadron-rejection for four representative projected hadron-rejectionvalues, viz , . , . and . .For a hadron rejection of . , the RF method yields ∼ γ -acceptance. In comparison to this, anyclassiﬁer coming closest to RF is ANN, which for the same hadron rejection secures a mere ∼ for γ -acceptance. The other two classiﬁers fail to go beyond . for projected hadron rejection. Furthermore,they yield a much smaller γ -acceptance compared to the above two classiﬁers, even at a projected hadronacceptance of . .In addition to estimating the signal strength, the misclassiﬁcation rate was also estimated by using aconfusion matrix. The misclassiﬁcation rate and the signal strength are shown in Table 2. amma/hadron Segregation 11 Table 2

Misclassiﬁcation Rate and Signal Strength with Five and Seven ImageParameters

Misclassiﬁcation Rate (%) Signal StrengthClassiﬁcation method R /R σ /σ Random Forest 5.44 / The positive effect of a greater number of parameters is better seen by a quantiﬁcation of the misclassi-ﬁcation rate as well as the signal strength. Table 2 shows that a higher number of attributes/parameters fortraining the classiﬁer improves the signal strength while the misclassiﬁcation rate goes down.Such improvement in the misclassiﬁcation rate as well as the signal strength is equally visible in all theclassiﬁcation methods. It should be noted that entries related to the SVM in Table 2 are absent. Only themisclassiﬁcation rate is given. Many classiﬁcation methods (ANN, DISC, NB) used in STATISTICA givethe probabilistic output as well as the prediction probability, but there are instances where the prediction isa hard prediction, i.e. in terms of YES or NO output. In the case of SVM, the STATISTICA package yieldshard predictions, thereby hindering the generation of a set of confusion matrices for different decisionboundaries. Due to the lack of probabilistic output from SVM, it is difﬁcult to estimate the signal strength.However, the misclassiﬁcation rate from Table 2 for SVM with both the kernels (RBF and polynomial)suggests that for the given dataset, γ and hadron acceptances will remain lower compared to those of theRF and ANN methods. Based on this premise, it can be concluded that the SVM will not be able to matchthese two classiﬁers for our requirement.Note that the strength of the ROC curves is generally exploited by comparing various classiﬁers and asuitable classiﬁer is selected. The classiﬁer is selected on the basis of its position in the ROC space. Thetop left most plot is considered to be the best classiﬁer. However, this view of selecting the classiﬁer onthe basis of its position in the top left most part of the decision-plot is over simplistic. The Precision-Recall(PR) curves are more fundamental than the ROC plots. According to the theorem (Davis & Goadrich 2006),“For a ﬁxed number of positive and negative examples, one curve dominates a second curve in the ROCspace if and only if the ﬁrst dominates the second in the PR space.” The precision is deﬁned as Precision = TPTP + FP . (11)The precision essentially reﬂects the fraction of examples classiﬁed as positive which are truly positive, i.e.predicted positives (here class γ ). The Recall is the TPR. In the PR space, the recall is plotted on the x -axisand the precision is plotted on the y -axis.The classiﬁer attaining the top position in the PR space and hence in the ROC space (as per the abovementioned theorem) is regarded as the best classiﬁer. Therefore, in order to reach the conclusion about the P r e c i s i on RecallRFANNDISCNB 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P r e c i s i on RecallRFANNDISCNB

Fig. 4

PR curves. The left panel represents the PR curve for the ﬁve attributes/parameters. Theright panel represents it for seven attributes/parameters.best classiﬁer, it is important to evaluate the classiﬁer performance in the PR space. The PR plot is generatedfor both sets of image parameters and is shown in Figure 4.The RF method retains the top most position in the ROC curve as well as in the PR space comparedto the other classiﬁers. Therefore, on the basis of these two curves, it can be concluded that since the RFmethod dominates all the other classiﬁers, for the given dataset, it turns out to be the best classiﬁer. It shouldbe be noted that the superiority of the PR curve over ROC plots is more pronounced when there is skewnessin the class distribution of a dataset.

Five different machine learning methods were evaluated and compared to decide which of these methodsis most suitable for γ -hadron segregation. Given the position of all the methods in the ROC space, the PRspace and the misclassiﬁcation rate for the given dataset, the trend reﬂects the superiority of RF and ANNcompared to other methods, i.e. the DISC, NB Classiﬁer and SVM. The signal strength was estimated byusing a confusion matrix at a representative value for γ -acceptance of . . This acceptance value waschosen because the conventional dynamic supercut method yields the same γ -acceptance. The dynamicsupercut method yields a signal strength of σ . = 12 . , whereas the signal strengths are . and . from the RF method and the ANN method respectively. It is clear that these two methods yield betterresults compared to the conventional dynamic supercut method. For the given dataset, the RF method givesan almost improvement in the signal strength over the ANN method. A similar story is repeated inthe estimation of the misclassiﬁcation rate. It is of course difﬁcult to make a generalized statement aboutthe superiority of the RF method over any other method. Yet, the dominance of the RF method in theROC plot as well as in the PR space indicates that for the given dataset, results are tilting in favor ofthe RF method. In addition to the above classifying metric, the RF method has an advantage in terms ofcomputational time over the perceptron based methods, like ANN. As the number of perceptrons increases,it becomes very computationally expensive; an increase in the number of attributes/parameters adds to thecomputational expense. Also, unlike the ANN method, which acts as a black box, the RF method is quiteeasy to understand. Furthermore, the RF method demands very little processing capabilities. Finally, the amma/hadron Segregation 13 RF method takes care of parameters with little or no separation power, whereas ANN performance can beseverely affected by the inclusion of such parameters.In the next phase, a similar study will be carried out with a bigger dataset and the best method will beemployed for γ -hadron segregation by taking experimental data. With the ever increasing data volume andthe inclusion of larger numbers of attributes/parameters in the ﬁeld of ground based γ -ray astronomy, theRF method, or rather the tree based method, is gaining all around popularity and soon it might become thepreferred method of choice. Acknowledgements

MS thanks P. Savicky for making available the decision plot of the simulated MAGICdata. This helped in comparing the decision plot of their simulated data from our program.

Appendix A: VARIOUS MACHINE LEARNING METHODS

In addition to the ﬁve machine learning methods, various machine learning methods from the TMVA pack-age (Hoecker et al. 2007) were tested and their resultant decision-plot is presented. Various machine learn-ing methods are as follows: Boosting Decision Tree (BDT), BDT with gradient boost (BDTG), BDT withdecorrelation (BDTD) + Adaptive Boost, TMlpANN (ROOT’s own ANN), Fisher Boost (Linear discrimi-nant with Boosting) and Probability Density Estimator Range-Search (PDERS). For all these methods, thedefault settings given by the TMVA developers were used. It is clear from the decision-plot (Fig. A.1) thatthe RF method outperforms all the other methods. γ - A cc ep t an c e Hadron Acceptance RFANNDISCNBBDTBDTGBDTDTMlpANNFisherBoostPDERS

Fig. A.1

The decision-plot of various machine learningmethods.

References

Albert, J., & and co-authors. 2008, Nuclear Instruments and Methods in Physics Research A, 588, 424Bishop, C. M. 1995, Neural Networks for Pattern Recognition (New York, NY, USA: Oxford University Press, Inc.)Bock, R. K., Chilingarian, A., Gaug, M., et al. 2004, Nuclear Instruments and Methods in Physics Research A, 516,5114 M. Sharma et al.Boser, B. E., Guyon, I. M., & Vapnik, V. N. 1992, in Proceedings of the ﬁfth annual workshop on Computationallearning theory, 144–152, COLT ’92 (New York, NY, USA: ACM)Breiman, L. 1996, Machine Learning, 24, 41Breiman, L. 2001, Machine Learning, 45, 5Davis, J., & Goadrich, M. 2006, in Proceedings of the 23rd international conference on Machine learning, 233–240,ICML ’06 (New York, NY, USA: ACM)Dubois, F., Lamanna, G., & Jacholkowska, A. 2009, Astroparticle Physics, 32, 73Egan, J. P. 1975, Signal detection theory and ROC analysis, Series in Cognition and Perception (New York, NY:Academic Press)Fawcett, T. 2006, Pattern Recogn. Lett., 27, 861Fiasson, A., Dubois, F., Lamanna, G., Masbou, J., & Rosier-Lees, S. 2010, Astroparticle Physics, 34, 25Friedman, N., Geiger, D., Goldszmidt, M., et al. 1997, in Machine Learning, 131–163Gaug, M. 2001, DESY-THESIS-2001-022Gershenson, C. 2003, CoRR, cs.NE/0308031Heck, D., Knapp, J., Capdevielle, J. N., Schatz, G., & Thouw, T. 1998, Forschungszentrum Karlsruhe Report FZKA,6019, 1Hengstebeck, T. 2007, Measurement of the energy spectrum of the BL Lac object PG1553+113 with the MAGICtelescope in 2005 and 2006, Ph.D. thesishillas, A. M. 1985, in International Cosmic Ray Conference,