Gamma/hadron segregation for a ground based imaging atmospheric Cherenkov telescope using machine learning methods: Random Forest leads
aa r X i v : . [ a s t r o - ph . I M ] O c t Research in Astronomy and Astrophysics manuscript no.(L A TEX: mridul-mlearning.tex; printed on October 17, 2018; 20:27)
Gamma/hadron segregation for ground based imaging atmosphericCherenkov telescope using the machine learning methods: RandomForest leads
Mradul Sharma ∗ , J. Nayak , M. K. Koul , S. Bose and Abhas Mitra Astrophysical Sciences Division, Bhabha Atomic Research Centre, Mumbai, India; The Bayesian and Interdisciplinary Research Unit, Indian Statistical Institute, Kolkata , India
Received ; accepted
Abstract
A detailed case study of γ -hadron segregation for a ground based atmosphericCherenkov telescope is presented. We have evaluated and compared various supervised ma-chine learning methods such as the Random Forest method, Artificial Neural Network, LinearDiscriminant method, Naive Bayes Classifiers, Support Vector Machines as well as the con-ventional dynamic supercut method by simulating triggering events with the Monte Carlomethod and applied the results to a Cherenkov telescope. It is demonstrated that the RandomForest method is the most sensitive machine learning method for γ -hadron segregation. Key words: methods: statistical — telescopes
Multidimensional datasets are very difficult to handle with conventional methods, which are generally lin-ear in nature. Therefore, when multidimensional data are encountered, the efficiency of these methodsreduces drastically as any interdependence among various parameters is beyond the realm of linear meth-ods. In the case of ground based atmospheric Cherenkov systems, the typical characterization of a signalinvolves more than four attributes/parameters. Present day Cherenkov systems are operating in an energyregime where conventional methods are losing their edge on account of fading differences among the dis-criminating attributes/parameters between signal and background. Therefore, the ground based gamma rayastronomy community has started exploring various options including multivariate methods. These multi-variate methods fall under the umbrella of machine learning methods. The simplicity and intrinsic abilityof these methods to scrub out interdependence, if any, among various attributes/parameters has made thefield of machine learning methods one of the fastest growing scientific disciplines. These methods employ ∗ [email protected] M. Sharma et al. statistical tools to decipher hidden relationships, if any, among a few or a collection of attributes/parameterswith comparatively little computing infrastructure.Machine learning methods have been explored in the field of ground based gamma ray astronomy forquite some time. The earliest efforts were initiated by Bock et al. (2004). Later on, for γ -hadron segregation,the effectiveness of tree based multivariate classifiers was demonstrated by two operational ground basedobservatories, MAGIC (Albert & and co-authors. 2008) and HESS (Ohm et al. 2009; Fiasson et al. 2010;Dubois et al. 2009). It should be noted that no machine learning method is sacrosanct as far as its superi-ority over other multivariate methods is concerned. Each dataset is unique and the classifier’s performanceis dependent on the dataset under investigation. Therefore, in order to assess the suitability of a classifier,each dataset needs to be probed independently. In this paper, we compare and evaluate various supervisedmachine learning methods to assess their suitability for γ -hadron segregation. A total of five machine learn-ing methods, namely Random Forest (RF), Artificial Neural Network (ANN), Linear Discriminant Analysis(DISC), Naive Bayes (NB) Classifier and Support Vector Machine (SVM) with the Radial Basis Function(RBF) and polynomial kernel have been investigated. They are selected in a way to represent a type ofmachine learning stream. Among these five methods, the RF method represents a logic based algorithm.The ANN methods are perceptron based techniques. On the other hand, DISC and NB Classifier are sta-tistical learning methods. Furthermore, SVM represents a rather new (1992) machine learning technique.The signal strength after classification by each machine learning method was compared with respect to theconventional dynamic supercut method and a conclusion is reached to select the best classification method.The plan for the paper is as follows: In Section 2, a brief summary of ground based atmosphericCherenkov telescopes and the underlying principle will be outlined. Section 3 involves the descriptionof the database used to compare various machine learning methods. The subsequent section provides anoverview of all the machine learning methods. The final two sections deal with a critical analysis of all theclassifiers and the conclusion respectively. Ground based gamma ray astronomy is a rather new discipline. The first successful detection of the TeVsource Crab Nebula (Weekes et al. 1989) took place in . With a brief lull in the field, the next detectiontook place in when the second TeV γ -ray source Markerian (Punch et al. 1992) was detectedand subsequently in , Mrk501 (Quinn et al. 1996) was detected. Slowly a series of such extragalacticsources was discovered. With the advent of more sensitive systems, the catalog of TeV γ -ray sources sawthe addition of newer sources. The present day field of ground based gamma ray astronomy is flourishingwith new detections of exotic sources. In fact, so far more than 150 galactic and extragalactic sources havebeen discovered.The detection of cosmic γ -ray sources is based on the principle of the detection of Cherenkov photonsproduced by cosmic rays in the atmosphere. When cosmic rays enter the atmosphere, they interact withatmospheric nuclei by hadronic and electromagnetic interaction. Electrons and the cosmic γ -rays interactelectromagnetically, i.e. they generate secondary particles by ‘pair production’ and the ‘bremsstrahlung’ http://tevcat.uchicago.edu/ amma/hadron Segregation 3 Fig. 1
Diagram of a few image param-eters.process. The hadronic cosmic rays, namely protons and ionized nuclei, interact via the hadronic interac-tion and also give rise to a number of secondary particles. Such generation of secondary particles in theatmosphere is called the
Extensive Air Shower . The hadronic showers create π ◦ particles that decay into γ -rays making it difficult to distinguish these hadronic showers from genuine showers initiated by γ -rays.The segregation of showers initiated by γ -rays is quite challenging because cosmic rays far outnumber the γ -rays by a huge margin. The secondary particles generated in extensive air showers move with relativistic speeds and generateCherenkov radiation in the atmosphere. The technique of detecting the Cherenkov photon image is knownas the Imaging Atmospheric Cherenkov Technique (IACT). If the shower is close enough to the telescope,the Cherenkov photons are reflected by the telescope’s reflecting dish and get focused on the camera (anarray of photomultiplier tubes in the focal plane of the detector). The geometrical projection of the showeronto a detector is called an image . The IACT is used to differentiate between γ and hadron initiated show-ers on the basis of the shape and orientation of the images. The image parameterization was introducedby Hillas and hence these parameters are known as Hillas parameters (hillas 1985). Image properties (ana-lyzed offline) provide information about the nature, energy and incoming direction of the primary particletriggering a shower. A representative diagram of the Hillas parameter is shown in Figure 1.
A database of Monte Carlo simulations was generated by using the CORSIKA air shower code (Heck et al.1998) with the Cherenkov option. The simulations were carried out for the TACTIC telescope (Koul et al.2011) located at the Mount Abu observatory, with an altitude of ∼ m. The simulated showers weregenerated at zenith angles of ◦ , ◦ , ◦ , ◦ and ◦ . The imaging camera with a total of 349 pixels wasconsidered with the innermost pixels being used for generating the trigger. The Cherenkov photonstriggered the telescope after encountering the wavelength dependent photon absorption, reflection coeffi-cient of the mirror facets, light cone used in the camera and the quantum efficiency of photomultiplier M. Sharma et al. tubes. All the triggered events underwent the usual image cleaning procedures described in the literature(Konopelko et al. 1996) to eliminate background noise.The simulated events triggering the telescope were selected according to the differential spectral index . and . for γ and protons respectively. The γ events were generated in the energy range from 1–20 TeV. The corresponding proton events were generated from 2–40 TeV. In order to have a robust and wellcontained image inside the camera, the prefiltering cuts of size (photoelectrons) ≥ and a distance cut of . ◦ ≤ distance ≤ ◦ were applied. This process yielded a total of events for both γ and protons. Various Hillas image parameters (hillas 1985) like length, width, distance, size (photoelectron) and zenithangle can be used in the process of γ -hadron segregation. However, the size parameter as well as the zenithangle parameter are not strictly separation parameters for γ -ray and hadronic showers. In particular, thezenith angle, for instance, by itself cannot be used to separate the events although different image param-eters depend on it. The same is true with the size (photoelectron) parameter. A typical problem with theseparameters is that in case the training samples for γ -ray and hadrons have a different distribution in theseparameters, these parameters may be considered as separation parameters. This may lead to a rather riskysituation, which is typically handled by preparing the training samples in such a way that their distribu-tions on those parameters (typically size and zenith) are as close as possible. In this way, the uncertaintyassociated with using such parameters as separation parameters can be avoided. In this study, such com-plexities have been taken into account. In addition to these parameters, a derived parameter ‘dens,’ defined(Hengstebeck 2007) as dens = log (size)length × width (1)was also used. A total of two sets of image parameters was considered. The idea was to investigate variousclassifiers as a function of the image attributes/parameters. In the first instance, only five image parameters:length, width, distance, size and frac2 (defined as the ratio of the sum of the two highest pixel signals to thesum of all the signals), were considered from the simulation database. In the second case, we considered atotal of seven image parameters. Here, in addition to the above mentioned five parameters, two additionalparameters, namely zenith angle and dens parameter, were also included. However, for classification pur-poses, the alpha parameter was not considered. The alpha is a very powerful parameter as it carries thesignature of the progenitor ( γ or proton). The alpha distribution is expected to be flat for cosmic ray pro-tons, whereas it reflects a peaky behavior, for ≤ ◦ for γ -rays. In order to remove any bias of such a strongparameter, it was not considered for classification purposes. Moreover, this parameter plays a crucial rolein the estimation of signal strength. If the alpha parameter is used in the classification, then the hadronicbackground cannot be evaluated. The problem of γ -hadron segregation is formulated as a two class problem: γ represents one class and thehadron is the second class. In the literature, a large variety of multivariate classification methods exists.However, to have a tractable analysis, a few representative supervised machine learning methods were amma/hadron Segregation 5 selected. The classification was carried out by using five different machine learning methods, namely RF,ANN, DISC, NB Classifier and SVM with the RBF and polynomial kernel. Except for the RF and theDynamic Supercut methods, the other methods were applied from a commercially available package namedSTATISTICA . On the other hand, the RF method was studied by using the original RF code . The spatial distribution of Cherenkov photons on the image plane of the camera is parameterized on thebasis of the shape and size (light content) of each such image. The conventional parameterization leadsto the estimation of the image parameters (hillas 1985). In this technique, various sequential cuts in theimage parameters are applied so as to maximize the γ -ray like signal and reject the maximum number ofbackground events. However, this scheme has a disadvantage because the width and length parameters growwith the primary energy. It is observed that the width and length of an image are well correlated with thelogarithm of size; the size of the image provides an estimate of the primary energy. This method of scalingthe width and length parameters with the size is known as the dynamic supercut method (Mohanty et al.1998). By employing this method, the optimum number of cut parameters and their values are estimated bynumerically maximizing the so called quality factor Q (Gaug 2001). The quality factor is defined as Q = ǫ γ √ ǫ P , (2)where ǫ γ and ǫ P are γ and hadron acceptances respectively. The γ -acceptance is defined as the correctlyclassified γ events out of the total number of γ events and ǫ P is the fraction of proton events which behavelike γ events after the γ -hadron classification. The image parameters in Table 1 lead to the maximum qualityfactor. Table 1
Dynamic SupercutParameters
Parameter Cut ValueLength (L) 0.110 ◦ ≤ L ≤ (0 .
235 + 0 . ◦ Width 0.065 ◦ ≤ W ≤ (0 .
085 + 0 . ◦ Distance (D) . ◦ ≤ D ≤ ◦ Size (S) S ≥ peAlpha ( α ) ≤ ◦ Frac2 frac2 ≥ . The RF method is a flexible multivariate selection method. The algorithm for RF was developed by LeoBreiman and Adele Cutler .The RFs are a combination of tree predictors such that each tree depends on the values of a randomvector sampled independently and with the same distribution for all trees in the forest (Breiman 2001). The STATISTICA ∼ breiman/RandomForests/cc software.htm ∼ breiman/RandomForests/ M. Sharma et al. classification trees, also known as “decision trees,” are machine learning prediction models constructed byrecursively partitioning the data set. Each binary recursive partitioning splits the data sets into differentbranches. The tree construction starts from the root node (the entire dataset) and ends at a leaf. Everyleaf node is assigned to a class. The RF method combines the concept of ‘bagging’ (Breiman 1996) and‘Random Split Selection’ ( ? ). The RF builds on the bagging (Breiman 1996) technique, where bagging stands for “Bootstrapping” and“Aggregating” techniques. The basic idea of bagging is to use bootstrap re-sampling to generate multipleversions of a predictor and combining them to make the classification. On the other hand, the bootstrappingis based on random sampling with replacement. It ensures that the probability of selecting an event in thesampling (with replacement) procedure is constantly /n . Therefore, the probability of not selecting anevent is equal to (1– /n ). If the selection process is repeated n times, where n is very large, the probabilityof not selecting an event will be ∼ / . Therefore, only 2/3 ( ∼ ) of events are taken for each bootstrapsample. In addition to bagging, RF also employs “Random Split Selection” ( ? ). At each node of the decision tree, m variables are selected at random out of the M input vectors and the best split is selected out of these m .Typically about square root ( M ) = m number of predictors are selected. Two sources of randomness, namelyrandom inputs and random features, make RFs accurate classifiers. In order to measure the classificationpower (separation ability) of a parameter and to optimize the cut value, the Gini index is used, whichmeasures the inequality of two distributions. It is defined as the ratio between (a) the area spanned bythe observed cumulative distribution and the hypothetical cumulative distribution for a non-discriminatingvariable (uniform distribution, ◦ line), and (b) the area under this uniform distribution. It is a variablebetween zero and one; a low Gini coefficient indicates more equal distributions, while a high Gini coefficientshows an unequal distribution. Breiman (2001) estimated the error rate on out-of-bag data (i.e. oob data).Each tree is constructed on a different bootstrap sample. Since in each bootstrap training set about one thirdof the instances are left out (i.e. out-of-bag), we can estimate the test set classification error by applying eachcase that is left out of the construction of the t th tree to the t th tree. To be precise, the oob error estimate isthe proportion of misclassification for the oob data.In this study, the original RF code in Fortran was employed and a total of trees was generated.The variable defined in the above code as m try = 2 / was taken. Very similar results were obtained ineach case. The resultant output of this code was compared with the implementation of RF in the statisticalpackage R . It is worth mentioning here that the Fortran code encounters some memory issues when thenumber of training/test events crosses a certain threshold. However, this limitation was not encountered inthe RF implementation in R. ∼ breiman/RandomForests http://cran.r-project.org/ amma/hadron Segregation 7 The ANN consists of many inputs (Gershenson 2003) which are multiplied by weights (strength of therespective signals), and then computed by a mathematical function that determines the activation of theneuron. Another function computes the output of the artificial neuron. The specific output demanded bythe user can be obtained by adjusting the weights of an artificial neuron. A multilayer perceptron (MLP) isperhaps the most popular network architecture in use today, due originally to Rumelhart and McClelland(Rumelhart et al. 1986) and discussed at length in most neural network textbooks (Bishop 1995). Eachneuron performs a weighted sum of its inputs and passes it through the transfer function to produce theoutput.In this work, we use an MLP network with five inputs, a minimum of three hidden units and a maximumof 11 hidden units. For classification tasks, the probabilistic output was generated and the misclassificationrate was estimated.
Linear Discriminant Analysis is also known as Discriminant Function Analysis (DFA). DFA combines as-pects of multivariate analysis of variance with the ability to classify observations into known categories.It is a multivariate technique which is not only utilized in classification but also estimates how good theclassification is. In this method, the discrimination functions like canonical correlations are constructed andeach function is assessed for significance. The estimation of the significance of a set of discriminant func-tions is computationally identical to multivariate analysis of variances. After estimating the significance,one proceeds for classification. It generally turns out that first one or two functions play an important rolewhile the rest can be neglected. Each discrimination function is orthogonal to the previous function.In the present case, it is known that each class belongs to either γ or hadron; thus, the a priori probabili-ties of these classes are known. Accordingly, in this work, the prior probabilities are taken for classification. Bayesian classifiers gained prominence in the early nineties and perform very well (Langley et al. 1992;Friedman et al. 1997). A Naive Bayes classifier is a generative classifier technique based on the concept ofprobability theory. The Bayes theorem plays a critical role in probabilistic learning and classification. TheBayes theorem states that p ( B/A ) = p ( A/B ) p ( B ) p ( A ) , (3)where p ( A ) = Independent probability of A , p ( B ) = Independent probability of B , p ( A/B ) = Conditionalprobability of A given B , p ( B/A ) = Conditional probability of B given A , i.e. the posterior probability.In “Naive Bayes Classification,” the different variables/attributes/features are assumed to be strongly(naively) independent, i.e., p ( h x , x , x ...x n i| y ) = n Y i =1 Π( x i | y ) . (4)Using the strong “independence assumption” and the prior probabilities, the most probable class for a given x is estimated. The best class is the most likely or maximum a posteriori (MAP) class. The MAP estimate M. Sharma et al. gives arg max B p ( B/A ) = arg max B p ( A/B ) p ( B ) . (5)The training and evaluation from this method is very fast but the assumption of strong independence amongparameters is a condition generally not satisfied in real world problems. The SVM was introduced by Boser et al. (1992). It is based on the concept of decision planes termed hy-perplanes. These hyperplanes are constructed in multidimensional space for classification. The decisionplanes separate the classes. The linear decision plane is too limited in its application because of the hetero-geneous nature of experimental data. In such a case, the linear decision plane lacks the ability to performclassification. Here nonlinear classifiers based on the kernel function play a vital role. The kernel function(mathematical function) maps the data into a higher dimensional hyperplane (feature space), where eachcoordinate corresponds to one feature of the data items. In this way, the data are transformed into a set ofpoints in a Euclidean space, leading to the classification.In the present work, the RBF and polynomial kernels are used. A polynomial of degree with type classification was employed. The parameters γ = . and ν = 0 . were considered. For the RBF, theseparameters were . and . respectively. The above listed methods were employed to classify the events into γ and hadron cases. A total of events of each type was considered as described in an earlier section. Around of the events were usedfor training all the machine learning methods and the rest of the data was used as a test sample. The sametraining and test data were used by all the methods to have a one to one correspondence in the results. Aftertraining, the test sample was passed through the trained classifier and predictions of γ and hadron classeswere made. Our aim is to identify the best classifier. The accuracy of the prediction rules can be evaluatedby the Receiver Operator Characteristic (ROC) curves which are graphical techniques (Fawcett 2006) tocompare the classifiers and visualize their performance. These curves are applied virtually in the field ofdecision making, like in signal detection theory (Egan 1975) and more recently in the medical field (Swets1988). We are considering a binary classification problem where the two cases are γ and hadrons. For a binaryclassification problem, a total of four outcomes are possible. Two outcomes are related to the correct clas-sification for the two classes and two for incorrect classification. The True Positive (TP) class denotes thecorrect classification of class γ and True Negative (TN) class represents the correct classification of classhadron. The False Negative (FN) class reflects the class γ incorrectly classified as class hadron and FalsePositive (FP) class is the incorrect classification of class hadron as class γ .The ROC plot is generated by using the above mentioned scenario for possible outcomes (TP, TN, FP,FN). The correctly classified γ are represented as the true positive rate (TPR), estimated by defining it as in amma/hadron Segregation 9 (Fawcett 2006). The true positive rate is defined as TPR = TPTP + FN . (6)The hadrons classified as γ are represented by the False Positive Rate (FPR), defined as FPR = FPFP + TN . (7)TPR and FPR can be defined in terms of the fraction of correctly classified γ and hadrons. From Equations(6–7), it can be shown that TPR = ǫ γ , (8) FPR = ǫ hadron . (9)Hence the TPR is the accepted γ fraction and the FPR is defined as the accepted hadron fraction. The bestclassifier is the one which provides the maximum TPR for the minimum FPR. It should be noted that weare not generating the ROC curves in the strict sense. The ROC curves lie between ( , ) and ( , ). In thepresent study, in order to better understand the results, the hadron rejection was plotted on a logarithmicaxis. Therefore, the ROC plots in this study will differ from conventional ROC plots.In order to find the best classifier, the decision boundary for prediction was varied. Each decision bound-ary generated one point in the γ -acceptance (TPR) and hadron acceptance (FPR) curves. These rates wereplotted and the resultant plot is referred to as a decision-plot. The decision-plot was generated for eachclassifier. If the decision-plot skews towards the left side, it indicates greater accuracy, i.e. a higher ratio oftrue positive to false positive. In order to compare various classifiers, the decision plot is generated after theclassification by all the methods. The top most plot in the decision-plot turns out to be the best classifierbecause for the same hadron acceptance, the upper plot gives the highest γ -acceptance.The decision-plot is the qualifying metric to select the most suitable classification method. In additionto the decision-plot, the difference among various classifiers was also quantified by estimating the signalstrength at a representative γ -acceptance value. The quantifying metric is designated as “signal strength”and defined as σ = S p (2 B + S ) , (10)where S = ǫ γ N S and B = ǫ p N B (Li & Ma 1983) are the signal and background events respectively.The signal strength was estimated by taking N B = 10 000 and N S = 500 (Bock et al. 2004). Since theconventional dynamic supercut method estimated the γ -acceptance to be . , the hadron acceptancefrom each classifier was derived from the decision-plot at a γ -acceptance of . . The decision plot wasgenerated for two sets of image parameters. As mentioned earlier, two sets were considered to evaluate theclassification strength as a function of the number of image parameters. The decision-plots for these twocases are shown in Figure 2.The comparison of decision-plot for RF methods for five and seven sets of image parameters showsthat the RF method yields a better classification strength. This difference in the classification is, however,small and of the order of ∼ in the γ -ray acceptance for the given hadron acceptance range. This dif-ference results because of the larger number of image parameters and guides us to choose more numbersof image parameters during the training of the classification method. The decision-plot for the artificial γ - A cc ep t an c e Hadron AcceptanceRFANNDISCNB 0 0.2 0.4 0.6 0.8 1 0.0001 0.001 0.01 0.1 1 γ - A cc ep t an c e Hadron AcceptanceRFANNDISCNB
Fig. 2
Signal vs. background acceptance. The left panel is the classification result by using thefive attributes/parameters. The right panel represents it for seven attributes/parameters. γ - A cc ep t an c e Projected Hadron RejectionRFANNDISCNB
Fig. 3 γ -acceptance as a function of projected hadronrejection.neural network method also reflects a tendency to prefer more numbers of image parameters for betterclassification. As per the decision-plot, the other two methods also indicate a positive effect on the clas-sification strength with more numbers of image parameters. The decision plot provides an estimate of thepossible γ -acceptance for a user chosen background (hadron) rejection. Any classifier yielding the maxi-mum γ -acceptance for a given hadron acceptance decides the quality of the classifier. Figure 3 shows the γ -acceptance as a function of projected hadron-rejection for four representative projected hadron-rejectionvalues, viz , . , . and . .For a hadron rejection of . , the RF method yields ∼ γ -acceptance. In comparison to this, anyclassifier coming closest to RF is ANN, which for the same hadron rejection secures a mere ∼ for γ -acceptance. The other two classifiers fail to go beyond . for projected hadron rejection. Furthermore,they yield a much smaller γ -acceptance compared to the above two classifiers, even at a projected hadronacceptance of . .In addition to estimating the signal strength, the misclassification rate was also estimated by using aconfusion matrix. The misclassification rate and the signal strength are shown in Table 2. amma/hadron Segregation 11 Table 2
Misclassification Rate and Signal Strength with Five and Seven ImageParameters
Misclassification Rate (%) Signal StrengthClassification method R /R σ /σ Random Forest 5.44 / The positive effect of a greater number of parameters is better seen by a quantification of the misclassi-fication rate as well as the signal strength. Table 2 shows that a higher number of attributes/parameters fortraining the classifier improves the signal strength while the misclassification rate goes down.Such improvement in the misclassification rate as well as the signal strength is equally visible in all theclassification methods. It should be noted that entries related to the SVM in Table 2 are absent. Only themisclassification rate is given. Many classification methods (ANN, DISC, NB) used in STATISTICA givethe probabilistic output as well as the prediction probability, but there are instances where the prediction isa hard prediction, i.e. in terms of YES or NO output. In the case of SVM, the STATISTICA package yieldshard predictions, thereby hindering the generation of a set of confusion matrices for different decisionboundaries. Due to the lack of probabilistic output from SVM, it is difficult to estimate the signal strength.However, the misclassification rate from Table 2 for SVM with both the kernels (RBF and polynomial)suggests that for the given dataset, γ and hadron acceptances will remain lower compared to those of theRF and ANN methods. Based on this premise, it can be concluded that the SVM will not be able to matchthese two classifiers for our requirement.Note that the strength of the ROC curves is generally exploited by comparing various classifiers and asuitable classifier is selected. The classifier is selected on the basis of its position in the ROC space. Thetop left most plot is considered to be the best classifier. However, this view of selecting the classifier onthe basis of its position in the top left most part of the decision-plot is over simplistic. The Precision-Recall(PR) curves are more fundamental than the ROC plots. According to the theorem (Davis & Goadrich 2006),“For a fixed number of positive and negative examples, one curve dominates a second curve in the ROCspace if and only if the first dominates the second in the PR space.” The precision is defined as Precision = TPTP + FP . (11)The precision essentially reflects the fraction of examples classified as positive which are truly positive, i.e.predicted positives (here class γ ). The Recall is the TPR. In the PR space, the recall is plotted on the x -axisand the precision is plotted on the y -axis.The classifier attaining the top position in the PR space and hence in the ROC space (as per the abovementioned theorem) is regarded as the best classifier. Therefore, in order to reach the conclusion about the P r e c i s i on RecallRFANNDISCNB 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P r e c i s i on RecallRFANNDISCNB
Fig. 4
PR curves. The left panel represents the PR curve for the five attributes/parameters. Theright panel represents it for seven attributes/parameters.best classifier, it is important to evaluate the classifier performance in the PR space. The PR plot is generatedfor both sets of image parameters and is shown in Figure 4.The RF method retains the top most position in the ROC curve as well as in the PR space comparedto the other classifiers. Therefore, on the basis of these two curves, it can be concluded that since the RFmethod dominates all the other classifiers, for the given dataset, it turns out to be the best classifier. It shouldbe be noted that the superiority of the PR curve over ROC plots is more pronounced when there is skewnessin the class distribution of a dataset.
Five different machine learning methods were evaluated and compared to decide which of these methodsis most suitable for γ -hadron segregation. Given the position of all the methods in the ROC space, the PRspace and the misclassification rate for the given dataset, the trend reflects the superiority of RF and ANNcompared to other methods, i.e. the DISC, NB Classifier and SVM. The signal strength was estimated byusing a confusion matrix at a representative value for γ -acceptance of . . This acceptance value waschosen because the conventional dynamic supercut method yields the same γ -acceptance. The dynamicsupercut method yields a signal strength of σ . = 12 . , whereas the signal strengths are . and . from the RF method and the ANN method respectively. It is clear that these two methods yield betterresults compared to the conventional dynamic supercut method. For the given dataset, the RF method givesan almost improvement in the signal strength over the ANN method. A similar story is repeated inthe estimation of the misclassification rate. It is of course difficult to make a generalized statement aboutthe superiority of the RF method over any other method. Yet, the dominance of the RF method in theROC plot as well as in the PR space indicates that for the given dataset, results are tilting in favor ofthe RF method. In addition to the above classifying metric, the RF method has an advantage in terms ofcomputational time over the perceptron based methods, like ANN. As the number of perceptrons increases,it becomes very computationally expensive; an increase in the number of attributes/parameters adds to thecomputational expense. Also, unlike the ANN method, which acts as a black box, the RF method is quiteeasy to understand. Furthermore, the RF method demands very little processing capabilities. Finally, the amma/hadron Segregation 13 RF method takes care of parameters with little or no separation power, whereas ANN performance can beseverely affected by the inclusion of such parameters.In the next phase, a similar study will be carried out with a bigger dataset and the best method will beemployed for γ -hadron segregation by taking experimental data. With the ever increasing data volume andthe inclusion of larger numbers of attributes/parameters in the field of ground based γ -ray astronomy, theRF method, or rather the tree based method, is gaining all around popularity and soon it might become thepreferred method of choice. Acknowledgements
MS thanks P. Savicky for making available the decision plot of the simulated MAGICdata. This helped in comparing the decision plot of their simulated data from our program.
Appendix A: VARIOUS MACHINE LEARNING METHODS
In addition to the five machine learning methods, various machine learning methods from the TMVA pack-age (Hoecker et al. 2007) were tested and their resultant decision-plot is presented. Various machine learn-ing methods are as follows: Boosting Decision Tree (BDT), BDT with gradient boost (BDTG), BDT withdecorrelation (BDTD) + Adaptive Boost, TMlpANN (ROOT’s own ANN), Fisher Boost (Linear discrimi-nant with Boosting) and Probability Density Estimator Range-Search (PDERS). For all these methods, thedefault settings given by the TMVA developers were used. It is clear from the decision-plot (Fig. A.1) thatthe RF method outperforms all the other methods. γ - A cc ep t an c e Hadron Acceptance RFANNDISCNBBDTBDTGBDTDTMlpANNFisherBoostPDERS
Fig. A.1
The decision-plot of various machine learningmethods.
References
Albert, J., & and co-authors. 2008, Nuclear Instruments and Methods in Physics Research A, 588, 424Bishop, C. M. 1995, Neural Networks for Pattern Recognition (New York, NY, USA: Oxford University Press, Inc.)Bock, R. K., Chilingarian, A., Gaug, M., et al. 2004, Nuclear Instruments and Methods in Physics Research A, 516,5114 M. Sharma et al.Boser, B. E., Guyon, I. M., & Vapnik, V. N. 1992, in Proceedings of the fifth annual workshop on Computationallearning theory, 144–152, COLT ’92 (New York, NY, USA: ACM)Breiman, L. 1996, Machine Learning, 24, 41Breiman, L. 2001, Machine Learning, 45, 5Davis, J., & Goadrich, M. 2006, in Proceedings of the 23rd international conference on Machine learning, 233–240,ICML ’06 (New York, NY, USA: ACM)Dubois, F., Lamanna, G., & Jacholkowska, A. 2009, Astroparticle Physics, 32, 73Egan, J. P. 1975, Signal detection theory and ROC analysis, Series in Cognition and Perception (New York, NY:Academic Press)Fawcett, T. 2006, Pattern Recogn. Lett., 27, 861Fiasson, A., Dubois, F., Lamanna, G., Masbou, J., & Rosier-Lees, S. 2010, Astroparticle Physics, 34, 25Friedman, N., Geiger, D., Goldszmidt, M., et al. 1997, in Machine Learning, 131–163Gaug, M. 2001, DESY-THESIS-2001-022Gershenson, C. 2003, CoRR, cs.NE/0308031Heck, D., Knapp, J., Capdevielle, J. N., Schatz, G., & Thouw, T. 1998, Forschungszentrum Karlsruhe Report FZKA,6019, 1Hengstebeck, T. 2007, Measurement of the energy spectrum of the BL Lac object PG1553+113 with the MAGICtelescope in 2005 and 2006, Ph.D. thesishillas, A. M. 1985, in International Cosmic Ray Conference,