Comparing Classification Models on Kepler Data
CComparing Classification Models on Kepler
Data
Rohan Saha - [email protected]
University of Alberta
November 26, 2019
Even though the original kepler mission ended due to mechanical failures, the kepler satellite continuesto collect data. Using classification models, we can understand the features exoplanets possess and thenuse those features to investigate further for any more information on the candidate planet. Based onthe classification model, the idea is to find out the probability of the planet under observation being acandidate for an exoplanet or a false positive. If the model predicts that the observation is a candidate forbeing an exoplanet, then further investigation can be conducted. From the model we can narrow downthe features that might explain the difference between a candidate and false positive which ultimatelyhelp use to increase the efficiency of any model and fine tune the model and ultimately the process ofsearching for any future exoplanets.
Space agencies have placed telescopes in the orbit to look for exoplanets. Exoplanets are planets thatare not present in our solar system. NASA had launched the telescope in 2009 for finding exoplanets inother star systems, with the goal of finding other habitable planets, which might be explored in the futurewith scientific advancements. Though the mission ended due to mechanical failures, the satellite stillrecords images on an extended mission. The data set contains numerous features which are explainedby the data dictionary here. Each sample in the dataset is termed as a Kepler “object of interests” orKOIs. The data set contains 9564 samples and 49 columns as features and one target variable which iscategorical. Some columns of interests for the experiment are given below[1]: • kepler name: [These names] are intended to clearly indicate a class of objects that have beenconfirmed or validated as planets—a step up from the planet candidate designation. However, thisattribute will not be included for classification. • koi pdisposition: The disposition Kepler data analysis has towards this exoplanet candidate. Oneof FALSE POSITIVE, NOT DISPOSITIONED, and CANDIDATE. • koi score: A value between 0 and 1 that indicates the confidence in the KOI disposition. For CAN-DIDATEs, a higher value indicates more confidence in its disposition, while for FALSE POSITIVEs,a higher value indicates less confidence in that disposition.In addition, features like koi fpflag nt and koi fpflag ss have high correlation with the target variable,where the former deals with the light curve of the kepler object of interest and the latter deals with thetransit-like event and whether the transit like event was significant. From preliminary exploratory dataanalysis, these two features had high scores in terms of their relation with the target variable. Also,koi depth is an important variable for problem and its relation related to koi duration and how thesetwo variables affect the candidacy of the astronomical body being an exoplanet.1 a r X i v : . [ a s t r o - ph . E P ] J a n Problem Description
For the purposes of this problem, the target variable under consideration is the ‘koi disposition’, andwith a quick inspection of the data set, it is observed that there are two categorical outputs, “CANDI-DATE”, and “FALSE POSITIVE”. Therefore, is a binary classification problem. The former outputsays that the sample under observation is a candidate for being an exoplanet and the latter outputsays that the sample under observation is false positive for being considered as a planet. This poses animportant question of finding out how similar should be the attributes so as to distinguish between falsepositives and candidates. This is of paramount importance because if a model predicts a sample as acandidate then the decision of undertaking further investigation should be carried out for the sample tobe confirmed as an exoplanet; this approach will filter out observations ultimately saving time. Someobservations after looking at the data are as follows.1. The number of samples are almost equally distributed for each label.a. There are 4,496 candidate exoplanetsb. There are 5,068 false positives2. koi depth - Candidate exoplanets tend to have a lower koi depth compared to False Positives.However, the distinction is not clear because there are also some exoplanets that have a higherkoi depth. Therefore, predicting the label solely based on the koi depth will be a naive decision.koi dpeth is the fraction of stellar flux lost at the minimum of the planetary transit.
The experiment under consideration was divided into multiple parts for ease of understanding andclarity. The choice of algorithms and justification is provided in a later subsection.
Before preprocessing, basic exploratory data analysis was carried out to obtain a high level under-standing of the data. We first looked at the proportion of the number of samples and found out thatthe dataset has almost equal number of samples for each class (CANDIDATE and FALSE POSITIVE).And from some preliminary data analysis the most insightful observation was from the two variables koi duration and koi depth . These two variables are most insightful because they represent the tran-sit attributes of the astronomical body under consideration. koi duration represents the time takenfrom the start to the end when the astronomical body passes in front of the host star[2] and koi depth represents the fraction of stellar flux lost at minimum during transit. Figure 1 shows that low koi depth is significant because lesser amount of stellar flux lost represents that the astronomical body is closer tothe host star and thus has a high possibility of revolving around an orbit. Low koi duration signifiesthat transit time is less and has a higher chance of being within the gravitational influence of the hoststar. Therefore, low values of these two parameters support candidacy of the astronomical body for anexoplanet.From figure 1, it can be observed that candidate exoplanets usually have low koi depth and lowkoi duration. On the other hand, false positives either have high koi depth and low koi duration or lowkoi depth and high koi duration. 2igure 1: koi duration vs koi depth
The first part involved feature selection, where we remove the features which are redundant. Outof the initial 49 feature variables, 18 were removed. This reduce the dimensionality of the data. Thefeatures were removed manually after calculating the pearson correlation between the features. Aftercalculating the correlation, the dependent features were removed from the dataset . It must be notedthat the features were selected manually because the number of features is small enough and it also helpsus to understand the data at a higher level.The scores for the top features are given in figureFigure 2: Feature Scores for top 30 features The second part is to preprocess the dataset. From initial examination of the dataset, there wereconsiderable number of samples, which contained null values. One could simply ignore the samples fromthe dataset while training the model. However, such an approach is valid when the number of samplescontaining null values are extremely small. This ensures that there is no significant loss of information The χ method for selecting best features could not be used because the dataset contains negative values, which couldhave been solved with normalizing the features between 0 and 1. However, it was out of the scope of the experiment. . The iterative imputer calculates p(x | y) where x is the value of the feature whichhas missing value and y contains the rest of the features which contain values.After the dataset has been preprocesed, the dataset has to be divided into training and testing data.For the purposes of the experiment, the dataset has been divided in an 80-20 split where 80% of thedata will be used for training and 20% will be used for testing, which gives us 7651 samples for trainingand 1913 samples for testing. Also, cross-validation[3] will be used with k=5 and the validation set forthe cross-validation will be a part of the training set. The test set will never be used for any form oftraining as this might bias the results and result in overfitting. Also, no sample from the testing set isincluded in the training process in any form. This is ensured by implementing cross-validation after thetraining-test dataset split. The algorithms that will be used for modelling the dataset will be1. Logistic Regression2. Decision Tree3. Artificial Neural Network
Logistic Regression is one of most basic models which is used in machine learning for classification.The term logistic refers to the logistic function that helps to normalize the output between the values0 and 1. This helps to set a threshold (usually 0.5) above which the class for the sample is classified tobe one of the classes(usually class one) and below which the sample is classified to be in the other class(usually class zero). In other words, logistic regression tries to maximize the posterior class probability.The logistic function is the sigmoid function which normalizes the output between zero and one. Thesigmoid function is given as: σ ( x ) = 11 + exp − x (1)In fact, a neural network also uses a sigmoid function in the output layer in case of binary classificationwhich we will examine later.The meta-parameter under consideration for logistic regression was the inverse regularization parameter for which the values were in the set { } . The initial choice was to use support vector machine for the problem. However, we found out that theDecision Tree classifier also provides similar performance to that of support vector machine. The initialchoice of SVM was motivated by the fact that it is robust to outliers but decision tree is also robust tooutliers. For the dataset, SVM took much longer to train than Decision Tree, but performed similarly.Therefore, decision tree was chosen as one of the algorithms for this experiment. Also, Decision Treeis a simpler model than Support Vector Machine in terms of intuition and understanding. A high levelunderstanding of Decision Trees can be found out here, and a more in depth understanding is providedhere.For the Decision Tree algorithm, the meta-parameter under consideration was the depth of the tree[4].The values under observation for the tree depth ∈ Z + and lie in the set { } . It was also observedthat higher values of depth did not have any improvement in the performance of the decision tree model. IterativeImputer is an experimental feature in the scikit-learn package and must be used with caution. https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html .5.3 Artificial Neural Network Neural Networks are a variations of generalized linear models that learn a chain of hidden represen-tations with the help of non-linear activations. For the purpose of the experiment, a two layered neuralnetwork was used with 10 hidden units in each hidden layer and with ReLU activation in the hiddenlayer. The rectified linear unit is used because is is faster in terms of the training process and intuitive.The last layer had a sigmoid activation and the loss function used was binary cross-entropy or the logloss, which is the same loss function used for logistic regression. Here 10 hidden neurons were used tohave a reasonably large function space and two layers were used to learn the representations. Buildingon this, the number of layers can also be increased and a comparison can be done on the basis of a chosenmetric. From initial analysis, with two layers, the performance on the training set and the test set wasfairly similar. Also, stochastic gradient descent was used with Nesterov Accelerated Gradient[5] and amomentum value of 0.9. Such a setting was used to accelerate the learning process. In addition, otheroptimization procedures can be used for the training process and a comparative analysis can be carriedout. 100 epochs were used with a batch size of 10 samples.The meta-parameter under consideration is the learning rate and the values under observation are:1. 0.0012. 0.013. 0.1
Statistical Significance Tests are used to analyze any two models in terms of measures of accuracy orerror.The paired t-test can be used to compare the models under observation which is given by the following. t = ¯ x − µ s/ √ n (2)where ¯ x is the sample mean, µ is the population mean, s is the sample standard deviation, and n in thesample size.However, the t-test will not be used for the experiment because due to cross-validation, the assumptionthat the samples are i.i.d is violated. The reason for this is that the estimated scores are now dependent.The k-fold cross validation will lead to optimistic scores and result in a higher type-1 error[6].Now that we understand why we do not use the t-test, we will use the McNemar’s test[7] for comparing themodels under observation. Some reasons for using McNemar’s test are given below. It is a distribution-free test and it is suitable for binary classifiers with cross-validation. It must be noted that the McNemar’stest does not say which of the two models is better than the other but says whether the two modelsagree or disagree in the same way or not. An The McNemar’s test uses a contingency table to find outthe disagreement between the pair of algorithms being compared. A sample contingency table is givenin table 1 Classifier 2 correct Classifier 2 incorrectClassifier 1 Correct a bClassifier 2 Incorrect c dTable 1: Contingency tableThe McNemar’s test statistic is given as χ = ( b − c ) b + c (3)which gives you the value of the statistic, which can be used to retrieve the p-value. As evident, this testtries to ascertain the measure of disagreement between the two algorithms as to say how do the algorithmsdisagree on the data. We use the McNemar’s test for three pairs of algorithms. The McNemar’s test willbe used for evaluating the models in the analysis section.5 .6 Preliminary Results For each of the algorithm, some preliminary results were calculated which are given below. For eachof the algorithms, GridSearch[8] with k-fold(k=5) cross-validation was used. For each of the algorithm,the ROC curve is provided below which tells us how good the model is in classification. This is importantbecause we do not just want our model to achieve a high accuracy but also the quality of the modelshould be good in terms of the number of true positives and false positives.In addition to the ROC curve, the precision-recall curve is also provided because it help us to understandthe performance of the model in terms of the number of correct, incorrect predictions, and the totalpredictions. Even though the dataset under consideration has almost balanced classes, the precision-recall curve can tell us about the relevant predictions returned out of the total number of predictions.
With k-fold cross-validation, and k=5, 1000 iterations, and a tolerance level of 0.00001, the bestparameter that was obtained was the inverse regularization [9] of 0.1. The accuracy that was obtainedwith L2 regularization was 95.72(with a standard deviation of 1.325%) without feature scaling (forthe mentioned number of iterations and the tolerance level) and an accuracy of 98.39(with a standarddeviation of 0.55%) with feature scaling. This is because the gradient takes longer time to convergebecause the contours for the features are not symmetrical in nature. Also, the rest of the algorithms willbe trained and tested on the scaled dataset.Now that we had a chance to look at the accuracy, let’s understand how good our model is in separatingthe two classes of the problem. To understand this, we will have to look at the ROC curve and the Areaunder the Receiving Operating Characteristics curve.Figure 3: ROC curve for Logistic RegressionFrom figure 3, it is very clear that the model when very well classify between the two classes.Correspondingly, the area under the receiving operating characteristics curve achieved a score of 99.41.The precision recall curve for logistic Regression is given in figure 46igure 4: Precision-Recall curve for Logistic Regression
The Decision Tree algorithm was also used with GridSearch to determine the best parameters. Themeta-parameter under consideration was the max depth and the best value for max depth was found tobe 6 when using k-fold cross-validation(k=5).When using the chosen best parameter max depth as 6, an accuracy of 98.53%(with a standard deviationof +- 0.35%) is obtained with the feature scaled dataset. The receiving operating characteristics curveis shown in figure 5. Figure 5: ROC curve for Decision TreeThe area under the curve for the roc curve achieved a score of 99.39% which is almost similar to that ofLogistic Regression.The precision recall curve for decision tree is given in figure 67igure 6: Precision-Recall curve for Decision Tree
As mentioned before, the neural network contained two hidden layers with a rectified linear unitactivation in the hidden layers. Using GridSearch, the best value of the learning rate was found to be0.001 with a batch size of 10, and for 100 epochs. The average accuracy was 98.33% (with a standarddeviation of 0.53%) over 10 runs, for the learning rate of 0.001 with a batch size of 10 for 100 epochs. Theneural network used a momentum of 0.9 with Nesterov Accelerated Gradient to decrease the learningtime. In addition, the Adam optimizer was also used to train the network which gave a mean accuracyof 98.37% +- 0.6%. Since there is no significant difference in the estimates of accuracy for both theoptimizers, we select the Nesterov Accelerated Gradient as the choice of optimizer on the basis of lowerstandard deviation. Further, more analysis can be carried out in terms of comparing different optimizersfor the experiment and their effects on the prediction performance. However, for this experiment suchanalyses is out of scope for the problem. The receiving operating characteristics curve for the neuralnetwork is shown in fig 7. Figure 7: ROC curve for Neural NetworkThe area under the roc curve also received a score of 99.49% which is slightly higher when compared toother models in the experiment.The precision recall curve for neural network is given in figure 88igure 8: Precision-Recall curve for Neural Network
All the three algorithms will be analyzed in terms of the following:1. Execution time2. Statistical Significance Test3. Prediction Performance
Logistic Regression took an average of 5.39 seconds for the training process for the non-scaled datasetand an average of 0.06 seconds to train on the scaled dataset. This shows that the training process isexpedited when the dataset is scaled. This behavior is observed because the contours of the features areof different scales and thus the gradient takes uneven steps to reach the minima and thus takes longerin terms of execution time. This result also shows that we must always scale are dataset while trainingcomplex models like neural networks because they inherently have longer execution times.Decision Tree provides interesting results compared to other logistic regression because of the selectionmeta-parameters. Using GridSearch with the scaled dataset, we found that the optimal depth of the de-cision tree is six. Now, it is obvious that since the depth of tree is fixed, the time required for the trainingprocess will be the same, which, on average was 0.08 seconds for a max depth of 6, but the accuracy onthe scaled dataset and the non-scaled dataset will be different. In addition, it is worthwhile to report thatwhen using cross-validation with the scaled dataset, the max depth meta-parameter of the tree is foundout to be six, but when using cross-validation with the non-scaled dataset, the max depth was found to be18. For memory considerations and execution time, lower tree depth is beneficial for the training process.For the two hidden layered neural network, the training time with the scaled dataset was about 326.33seconds on average (with a batch size of 10, momentum of 0.9 and using Nesterov Accelerated Gradient).Increasing the size of the neural network did not result in considerable increase in the performance ofthe model and therefore the two layer neural network is selected here. On the other hand, training aneural network with a single hidden layer resulted in a lower training time on average but with a cost ofa minor drop in accuracy (about 1%). In addition, using the the Adam
Following the explanation from the experimental design section, we will use the McNemar’s test forcomparing the algorithms for the binary classification problem in this experiment.It must be noted that the McNemar’s test does not say which algorithm is better than the other but says9hether the two algorithms under consideration disagree in the same manner or not. This is importantbecause there is no significant difference in the accuracy of the models on the test data for multiple runs.The null hypothesis for each pair of the three algorithms is given below. The value for α is set to 0.05,which means that 95% of the time, we reject the null hypothesis. Also, it must be noted that if thesum of disagreements is less than 25 the binomial distribution is used, which is the default behaviourfor this experiment. If the sum is equal to above 25, the distribution changes to a chi-squared distribution.1. Logistic Regression = Decision Tree2. Decision Tree = Neural Network3. Logistic Regression = Neural NetworkThe alternate hypothesis is that there is significant statistical difference between the pair of algorithmsas shown below.1. Logistic Regression (cid:54) = Decision Tree2. Decision Tree (cid:54) = Neural Network3. Logistic Regression (cid:54) = Neural NetworkThe algorithms that are different have to be inferred from the statistical significance test. • Logistic Regression and Decision Tree - The p-value obtained from the test for logistic regressionand decision tree is 0.012 on average which is less than the critical value of α = 0 .
05. Therefore, weconclude that the null hypothesis is rejected and there is significant difference in the disagreementbetween the two algorithms. Looking the contingency table, logistic regression gets more predictionsincorrect when decision tree gets the predictions correct compared to the other case where theincorrect number of predictions is less for decision tree where logistic regression gets them correct.This says that the decision tree algorithm performs slightly better than logistic regression. However,the differences in the disagreements are significant and therefore the are statistically different. • Decision Tree and Neural Network - The p-value obtained from the McNemar’s test is 0.0044which is lower than the value of α = 0 .
05. Therefore, there is significant difference between the aDecision Tree and Neural Network. Looking at the contingency matrix, we observe that 21 timesthe decision tree algorithm got correct number of predictions when the neural network predictedthe observations incorrectly. In addition, 3 times, the decision tree got the observations incorrectwhen the neural network predicted them correctly. Again, this indicates that the decision treealgorithm performs better for the experiment under consideration. • Logistic Regression and Neural Network - The p-value obtained in this case was 0.015 when con-sidering a neural network trained with a batch size of 10 and 0.5 when trained with a batch sizeof 100. When training with a batch size of 10, the logistic regression performed slightly better butthe difference is not significant because the number of incorrect predictions for the neural networkwhen logistic regression predicted correctly was 7 times and there were no instances where logisticregression predicted incorrectly and the neural network predicted correctly. In addition to thisresult, when using a batch size of 100, the p-value of 0.5 was obtained which is over the value of α = 0 .
05, which says that there is a significant difference. This also agrees with the contingencytable of logistic regression and the neural network (trained with a batch size of 100 epochs), whichdoes not show a significant difference between between the number of incorrect predictions forlogistic regression and neural network.In order to save resources, time, and capital, the decision tree algorithm should be considered whileclassifying kepler objects of interests either as candidate exoplanets or false positives,The results for the McNemar’s test are summarized in table 2 (with a neural network trained on batchsize of 100) 10lgorithm 1 Algorithm 2 P-ValueLogistic Regression Decision Tree 0.012Decision Tree Neural Network 0.0044Logistic Regression Neural Network 0.5Table 2: P-values for the algorithm pairsFrom the values in table 2, we can infer that the third null hypothesis is true which states that thereexists no statistically significant difference between the performance of Logistic Regression and the Neu-ral Network. Again, it is emphasized that the McNemar’s test does not state which algorithm is betterthan the other, but says to what extent do the algorithm disagree with each other.The decision of rejecting or accepting the null hypothesis is given in 3.Algorithm 1 Algorithm 2 Reject Null Hypothesis?Logistic Regression Decision Tree YesDecision Tree Neural Network YesLogistic Regression Neural Network NoTable 3: Hypothesis decision results
Instead of looking at the actual numbers of false positives and false negatives, a closer look at theprecision-recall curve for each algorithm provides some good insights.For logistic regression, the precision drops steeply when the recall goes over around 98%. This meansthat there is low chance that the predictions of logistic regression will be relevant when the goal is toimprove the accuracy of the model. In other words, an increase in the recall results in some of thesamples to be incorrectly predicted; for example some samples are predicted to belong in class 0, whenthe actual prediction is that of class 1. This is not desirable because classifying a kepler objects of interestis important as only when the planet is classified as a candidate planet, only then the space organizationshould invest time and funds for further investigations.For the decision tree algorithm, the precision-recall curve shows that when the recall reaches almosta 100%, only then the precision drops steeply. This shows that the observations are predicted almost100% correctly and the samples that are predicted correctly are relevant. In other words, the samplesthat are predicted to be in class 1 have an extremely low chance of being in class 0 and vice versa. Thisresult is better than that of logistic regression. When comparing logistic regression to the decision treealgorithm, the choice of algorithm for the dataset would be decision tree.For the neural network, when trained with a batch size of 10, the accuracy is the comparable to othermodels. However, from the precision-recall curve, we can observe that the precision drops sharply whenthe recall goes just over 97.5%. The amount of relevant predictions is still small but the difference issignificant when compared to the decision tree algorithm. Again, the observation from the precision-recall curve for the neural network agrees with the fact that decision tree will be a good algorithmto choose when considering the scaled dataset and meta-parameters selected in this experiment fromcross-validation.It is surprising to see that the neural network performs slightly poorly than decision tree and logisticregression in terms of precision and recall even when the accuracy is on the higher side. Therefore,considering only accuracy as the metric for algorithm selection will be insufficient when there is nosignificant difference over multiple runs. From the results above, it can be seen that decision treeoutperforms logistic regression and the two-layered neural network in terms of predicting accurately andrelevantly. 11
Conclusion
The project conducted an experiment on a dataset that contained information about kepler objectsof interests. The dataset contained features observed from the kepler satellite. Using this dataset, basicexploratory data analysis helped to visualize the data and get a high level understanding of the data. Theselection of the machine learning models were carried out depending on existing works on binary classifi-cation with the hope that the selected models will perform comparably to other experiments with similarconditions. The models were run with cross-validation to help select the best meta-parameters for eachmodel. The accuracy for each algorithm is evaluated against each other as basic comparison technique.In addition, the receiving operating characteristic and the precision-recall curves were plotted to under-stand the diagnostic ability of the machine learning model and the number of relevant results returnedfrom the model during the prediction, respectively. The algorithms were also compared using the McNe-mar’s test which led to rejecting the null hypothesis on two occasions and accepting the null hypothesison one occasion. Ultimately from the statistical significance tests, accuracy, and the execution time, itcan be concluded that the decision tree algorithm with the selected meta-parameter(max depth=6) wasthe model that performed optimally.
As an extension to the experiment, it would be beneficial for this problem to study the effects ofdifferent effects of using various optimization algorithms for the classification process. Especially for theneural network, it will be interesting to study the effects of choosing various optimization algorithmswith respect to increasing the number of layers and its effect on the performance. Other parameterscan also be tested, which will be critical to classification of astronomical objects. For example, featureengineering techniques can be used to create new features which can help to learn other attributes fromthe data and thus improving the models performance. Engaging in such studies will help to build anduse sophisticated models in analysing various astronomical objects.
References [1]
Kepler Exoplanet Search Results , en. [Online]. Available: https : / / kaggle . com / nasa / kepler -exoplanet-search-results .[2]
Nasa Exoplanet kepler candidate columns . [Online]. Available: https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html .[3]
Cross validation . [Online]. Available: https : / / scikit - learn . org / stable / modules / cross _validation.html .[4]
Decision trees scikit-learn 0.21.3 documentation . [Online]. Available: https://scikit-learn.org/stable/modules/tree.html .[5]
Optimization - What’s the difference between momentum based gradient descent and Nesterov’s ac-celerated gradient descent? [Online]. Available: https://stats.stackexchange.com/questions/179915/whats-the-difference-between-momentum-based-gradient-descent-and-nesterovs-acc .[6] J. Brownlee,
Statistical Significance Tests for Comparing Machine Learning Algorithms , en-US, Jun.2018. [Online]. Available: https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/ .[7] ——,
How to Calculate McNemar’s Test to Compare Two Machine Learning Classifiers , en-US,Jul. 2018. [Online]. Available: https : / / machinelearningmastery . com / mcnemars - test - for -machine-learning/ .[8]
Tuning the hyperparameters of an estimator scikit-learn 0.21.3 documentation . [Online]. Available: https://scikit-learn.org/stable/modules/grid_search.html .[9]
Sklearn.linear model logisticregression scikit-learn 0.21.3 documentation . [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html