Machine learning methods for constructing probabilistic Fermi-LAT catalogs
AAstronomy & Astrophysics manuscript no. 0head_ML_3FGL ©ESO 2021February 16, 2021
Machine learning methods for constructingprobabilistic
Fermi -LAT catalogs
A. Bhat (cid:63) and D. Malyshev (cid:63)(cid:63) Erlangen Centre for Astroparticle Physics, Erwin-Rommel-Str. 1, Erlangen, GermanyFebruary 16, 2021
ABSTRACT
Context.
Classification of sources is one of the most important tasks in astronomy. Sources detected in one wavelengthband, for example using gamma rays, may have several possible associations in other wavebands or there may be noplausible association candidates.
Aims.
In this work, we aim to determine probabilistic classification of unassociated sources in the third and the fourth
Fermi
Large Area Telescope (LAT) point source catalogs (3FGL and 4FGL-DR2) into two classes (pulsars and activegalactic nuclei (AGNs)) or three classes (pulsars, AGNs, and other sources).
Methods.
We use several machine learning (ML) methods to determine probabilistic classification of
Fermi -LAT sources.We evaluate the dependence of results on meta-parameters of the ML methods, such as the maximal depth of the treesin tree-based classification methods and the number of neurons in neural networks.
Results.
We determine probabilistic classification of both associated and unassociated sources in 3FGL and 4FGL-DR2catalogs. We cross-check the accuracy by comparing the predicted classes of unassociated sources in 3FGL that haveassociations in 4FGL-DR2. We find that in the 2-class case it is important to correct for the presence of other sourcesamong the unassociated ones in order to realistically estimate the number of pulsars and AGNs. In particular, theestimated number of pulsars in the 3FGL (4FGL-DR2) catalog is 270 (481) in the 2-class case without corrections forthe other sources and 158 (214) in the 3-class case. Provided that the number of associated pulsars is 167 (271) in the3FGL (4FGL-DR2) catalog, the number of pulsars among the unassociated sources is expected to be similar or largerthan the number of associated ones.
Key words.
Methods: statistical – Catalogs
Contents (cid:63) e-mail: [email protected] (cid:63)(cid:63) on leave of absence from NRC “Kurchatov Institute” -ITEP, B. Cheremushkinskaya st. 25, Moscow, Russia 117218,e-mail: [email protected]
Article number, page 1 of 25 a r X i v : . [ a s t r o - ph . H E ] F e b &A proofs: manuscript no. 0head_ML_3FGL
1. Introduction
Multi-wavelength association of astronomical sources isimportant for understanding their nature. Unfortunately,in many cases a firm association of sources at differentwavelength is not possible. For example, about one thirdof gamma-ray sources in
Fermi
Large Area Telescope(LAT) catalogs are unassociated (Abdo et al. 2010a;Nolan et al. 2012; Acero et al. 2015; Abdollahi et al.2020). It is useful to know at least to which class theunassociated sources belong to or, which is more typi-cal, what are the probabilities for the source to belongto various classes. In this paper we use several machinelearning (ML) algorithms to find probabilistic classifi-cation of unassociated sources in the
Fermi -LAT third(3FGL, Acero et al. 2015) and the fourth data release2 (4FGL-DR2, Abdollahi et al. 2020; Ballet et al. 2020)catalogs. We used versions gll_psc_v16 for 3FGL andgll_psc_v27 for 4FGL-DR2.We will refer to the catalogs, where the classificationof the sources is given by probabilities as probabilisticcatalogs. In general, the classes may include the possi-bility that a source is not a real source but a fluctuationof the background or that a source is an overlay of twosources etc.Probabilistic catalogs were previously introduced foroptical sources (e.g., Hogg & Lang 2010; Brewer et al.2013) and for gamma-ray sources (Daylan et al. 2017).Bayesian association probabilities were also included inthe 4FGL (Abdollahi et al. 2020) and 4FGL-DR2 (Balletet al. 2020) catalogs for faint sources. Probabilistic clas-sification of unassociated
Fermi -LAT sources was per-formed by, e.g., Ackermann et al. (2012); Saz Parkinsonet al. (2016); Mirabal et al. (2016); Lefaucheur & Pita(2017); Luo et al. (2020); Zhu et al. (2021), or in the ap-plication for sub-classification of blazars by Hassan et al.(2013); Doert & Errando (2014); Chiaro et al. (2016);Salvetti et al. (2017); Kovačević et al. (2019, 2020) andin subclassification of pulsars by Lee et al. (2012); SazParkinson et al. (2016). In this work, we consider clas-sification of gamma-ray sources into two classes (AGNsand pulsars) as well as into three classes (AGNs, pul-sars, and other associated sources). We revisit probabilis-tic classification of 3FGL sources and compare results ofclassification of unassociated sources with associationsin 4FGL-DR2. We also determine probabilistic classifi-cation of the 4FGL-DR2 sources.Catalogs of gamma-ray point sources are typically de-signed to have low false detection rate. Nevertheless, 469sources out of 3033 in the 3FGL catalog (Acero et al.2015) have no counterparts in the 4FGL catalog (Ab-dollahi et al. 2020). This is much larger than the ex-pected false detection rate in 3FGL arising from statis-tical fluctuations. For the majority of sources in 3FGL without counterparts in 4FGL the problem is not thefalse detection, but rather the association. For example,some sources can be detected due to deficiencies in theGalactic diffuse emission model. In this case, the statis-tical significance of the detection is high, but the asso-ciation is wrong: the sources should be classified as apart of the Galactic diffuse emission rather than point-like sources. Another reason could be that two (or more)point-like sources in 3FGL are associated to a single ex-tended source in 4FGL, or a single source is resolved intotwo sources. Again, this is a problem of classification (orassociation) rather than false detection.Another reason for an absence of a previously de-tected source in a new catalog is variability. In particu-lar, flat spectrum radio quasars (FSRQs) are highly vari-able AGNs. If a source was active during the observationtime of 3FGL but inactive afterwards, then its signifi-cance in the 4FGL can be below the detection thresh-old. This problem is connected to a selection of a harddetection threshold of
T S = 25 for 3FGL and 4FGLcatalogs. Selection of a lower detection threshold couldhelp to keep the variable sources inside the catalog, butit will not solve the problem, since the variable sourcesnear the lower threshold can also disappear in the newcatalog. Moreover, lower threshold would lead to morefalse detections due to fluctuations of the background.Thus, on the one hand, lower threshold can be useful instudies, where a more complete list of sources is desir-able, while the higher false detection rate is admissible.On the other hand, lower threshold can be problematicfor studies where a clean sample is necessary. The prob-lem of the detection threshold selection can be amelio-rated with the development of a probabilistic catalog.In this catalog, each point-like object detected above acertain relatively low confidence level are probabilisti-cally classified into classes, which include the statisti-cal fluctuations class. At low confidence, the probabilityfor a source to come from a background fluctuation ishigh. This probability is decreasing as the significanceof sources increases. Apart from the statistical fluctua-tions class, classes can include various types of Galacticand extra-galactic sources, diffuse emission deficiencies,extended sources. A user of such a catalog will have thefreedom to choose the probability threshold for the classthat he or she is interested in. In this paper we make afirst step in this direction by providing probabilistic clas-sification of
Fermi -LAT sources into two or three classes.We also show how the probabilistic catalogs can be usedfor population studies of sources, e.g., as a function oftheir flux or position on the sky, where one includes notonly associated sources but also unassociated ones ac-cording to their class probabilities.
Article number, page 2 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic
Fermi -LAT catalogs
The paper is organized as follows. In Section 2 wediscuss general questions of constructing the probabilis-tic catalogs and the choice of the ML methods. In Sec-tion 3 we construct the classification algorithms using theassociated source in the 3FGL catalog for training. Weconsider several aspects: 1) feature selection, 2) trainingof the algorithms and selection of the meta-parameters,3) oversampling of the datasets in order to have equalnumber of pulsars and AGNs in training (there are manymore AGNs observed than pulsars).In Section 4 we apply the classification algorithmsdetermined in Section 3 for the classification of 3FGLand 4FGL-DR2 sources. We compare the predictions forthe unassociated sources in 3FGL with the associationsin 4FGL-DR2. We also retrain the algorithms using as-sociated 4FGL-DR2 sources and construct a probabilis-tic catalog based on 4FGL-DR2. In Section 5 we clas-sify sources in 3FGL and 4FGL-DR2 catalogs into threeclasses (AGNs, pulsars, and other sources). In Section 6we show applications of the probabilistic catalogs for pre-dicting the number of pulsars, AGNs, and other sourcesamong the unassociated sources and in construction ofthe source counts as a function of their flux, N ( S ) , andas a function of Galactic latitude and longitude, N ( b ) and N ( (cid:96) ) . We compare the N ( S ) , N ( b ) , and N ( (cid:96) ) dis-tributions for associated and unassociated sources in the3FGL and 4FGL-DR2 catalogs. In Section 7 we presentconclusions.
2. Choice of methods
The first choice one has to make in constructing a prob-abilistic catalog is the input data and the machine learn-ing methods. For the input data we take associatedpoint sources (PS) in the 3FGL catalog, which we splitinto training and testing subsets. We consider four ma-chine learning algorithms: random forests (RF, Ho 1998;Breiman 2001), boosted decision trees (BDT, Friedman2001a), logistic regression (LR, Cox 1958), and neu-ral networks (NN, Hopfield 1982). Although the perfor-mance of algorithms on testing data is slightly different,we report the classification probabilities for all four algo-rithms. The difference among the predictions will serve asa measure of modeling uncertainty related to the choiceof the classification algorithm.
One of the most simple and transparent algorithms forclassification is decision trees. In this algorithm, at eachstep the sample is split into two subsets using one of the input features. The choice of the feature and the sepa-rating value are determined by minimizing an objectivefunction, such as misclassification error, Gini index, orcross-entropy. This method is very intuitive, since at eachstep the results can be described in words, for example,at the first step, the sources can be split in mostly Galac-tic and extragalactic by a cut on the Galactic latitude.At the next step, the high latitude sources can be fur-ther subsplit into millisecond pulsars and other sources,by a cut on the spectral index around 1 GeV (pulsarshave a hard spectrum below a few GeV) etc. One of theproblems with decision trees is either overfitting or bias:if a tree is too deep, then it will pick up particular casesof the training sample resulting in overfitting, while tooshallow trees would not be able to describe the data well,which can lead to bias. As a result, one needs to be verycareful in selecting the depth of the tree. This problemcan be avoided if a random subset of features is used tofind a division at each node. This is the basis of the RFalgorithm, where the final classification is given by anaverage of several trees with random subsets of featuresused at each node. Another problem with the simple treesis that it can miss the classification of some subsets ofdata. In BDT algorithms, the final classification is givenby a collection of trees, where each new tree is createdby increasing the weights of misclassified samples of theprevious step. Finally, simple trees predict classes for thedata samples, while we would like to have probabilities ofclasses (also known as soft classification). RF and BDTalgorithms, by virtue of averaging, provide probabilities.As a result, we will use RF and BDT algorithms ratherthan simple decision trees in this paper.Tree-based algorithms, even after averaging in RFand BDT methods, have sharp edges among domainswith different probabilities. In LR algorithm, the prob-abilities of classes are by construction smooth functionsof features. In particular, for two-class classification theprobability of class 1, given the set of features x , is mod-eled by sigmoid (logit) function p ( x ) = e m ( x ) e m ( x ) . (1)The probability of class 0 is then modeled as p ( x ) =1 − p ( x ) . If m ( x ) is a linear function of features, then theboundary between the domains, defined, e.g., as p ( x ) =0 . , will be linear at m ( x ) = 0 . More complicated bound-aries can be modeled by taking non-linear functions m ( x ) . Unknown parameters of the function m ( x ) are de-termined by maximizing the log likelihood of the modelgiven the known classes of the data in the training sam-ple. A useful feature of the LR method is that it, by con-struction, provides probabilities of classes with smoothtransitions among domains of different classes. A limita- Article number, page 3 of 25 &A proofs: manuscript no. 0head_ML_3FGL tion is that the form of the probability function is fixedto the sigmoid function in Equation (1).We notice that if m ( x ) is a linear function of features x , then the LR model is obtained by an application of sig-moid function to a linear combination of input features.This is in fact a single layer perceptron, or a NN, withseveral input nodes (each node corresponds to a feature)and one output node, which corresponds to p ( x ) , butwithout any hidden layers. The output value is obtainedby a non-linear transformation (sigmoid) of a linear com-bination of features. Neural network with several hiddenlayers is obtained by a sequence of non-linear transfor-mations of linear combinations of features. In particu-lar, the values in the first hidden layer are obtained by anon-linear transformation of linear combinations of inputfeatures. Then the values in the second hidden layer areobtained by a non-linear transformation of linear combi-nations of values in the first hidden layer etc. In the con-text of neural networks, the non-linear transformationsare called activation functions. If the activation functionfor the output layer is sigmoid, then the output value(values) can be interpreted as probabilities.
3. Construction of probabilistic catalogs
As an example of the construction of a probabilistic cat-alog, we will use the 3FGL catalog. For training andtesting the methods, we use sources which have associa-tions and no missing values in the catalog. In this sectionwe will perform a two-class classification to separate PSinto pulsars and AGNs. Thus for training and testing,we subselect the sources, which are associated to pulsarsand AGNs (three-class classification into pulsars, AGNs,and other sources is discussed in Section 5). After thetraining of the algorithms, we test the performance withthe test sources and predict the classes of unassociatedsources that have all features present in the catalog table.The general workflow will have the following steps:1. Select data for learning and testing.2. Optimize algorithms using training datasets. We se-lect meta-parameters of the algorithms by optimizingaccuracy of classification and test for overfitting us-ing the test datasets. In order to get stable results,we repeat the separation of the data into training andtesting samples 100 times and average the accuracy.3. Make prediction for unassociated point sources of the3FGL. We also apply the classification for associatedsources, which we use for consistency checks.As a result of the analysis in this section, we select meta-parameters for the four ML algorithms, which we use inthe following section for a construction of probabilisticcatalogs based on the
Fermi -LAT 3FGL and 4FGL-DR2catalogs.
We restrict the analysis to associated and unassociatedsources without any missing or statistically insignificantvalues (e.g., none or infinity). We use the associatedsources which were classified as either AGNs (classifica-tion labels in the 3FGL catalog: agn, FSRQ, fsrq, BLL,bll, BCU, bcu, RDG, rdg, NLSY1, nlsy1, ssrq, and sey) orpulsars (classification labels in 3FGL: PSR, psr), whichresults in a list of 1905 sources.There are several tens of features of point sourcesquoted in the catalog, such as the position, photon andenergy fluxes integrated in different energy bands, spec-tral parameters, variability index as well as correspond-ing uncertainties. We took the main features, includingthose discussed in Saz Parkinson et al. (2016), and plot-ted their Pearson correlation coefficients. A graphicalrepresentation of the correlation table is shown is Figure1. In the following, if two features have (anti)correlation (cid:38) . ( (cid:46) − . ), then we keep only one of the featuresfor classification. G L O N G L A T S i g n i f _ A v g P i v o t _ E n e r g y F l u x _ D e n s i t y U n c _ F l u x _ D e n s i t y F l u x U n c _ F l u x E n e r g y _ F l u x U n c _ E n e r g y _ F l u x S i g n i f _ C u r v e Sp e c t r a l _ I n d e x U n c _ Sp e c t r a l _ I n d e x P o w e r L a w _ I n d e x V a r i a b ili t y _ I n d e x M e V _ I n d e x GLONGLATSignif_AvgPivot_EnergyFlux_DensityUnc_Flux_DensityFlux1000Unc_Flux1000Energy_Flux100Unc_Energy_Flux100Signif_CurveSpectral_IndexUnc_Spectral_IndexPowerLaw_IndexVariability_Index500MeV_Index -0.022-0.01 0.002-0.042 0.026 -0.13-0.00033-0.0052 0.23 -0.0450.001 -0.006 0.018 -0.032 0.960.011 -0.0039 0.82 -0.036 0.23 0.00790.0088 -0.024 0.6 -0.005 0.13 -0.0024 0.580.01 -0.004 0.83 -0.042 0.24 0.015 0.99 0.60.022 -0.018 0.41 0.032 0.16 0.076 0.39 0.94 0.42-0.014 -0.029 0.73 -0.12 0.14 0.015 0.56 0.65 0.56 0.450.062 0.013 -0.21 -0.52 0.19 0.23 -0.14 -0.2 -0.13 -0.094 -0.31-0.031 0.033 -0.34 0.39 0.037 0.082 -0.11 -0.25 -0.14 -0.27 -0.21 -0.160.05 -0.0057 -0.045 -0.7 0.26 0.28 -0.033 -0.015 -0.026 0.00460.00021 0.83 -0.13-0.014 -0.0012 0.36 -0.068 0.14 0.017 0.079 0.12 0.14 0.09 0.11 -0.011 -0.14 0.0390.053 0.0058 -0.12 -0.56 0.2 0.23 -0.09 -0.2 -0.08 -0.13 -0.17 0.82 -0.21 0.82 0.012
Correlation in 3FGL Associated Data
Fig. 1.
The corelation matrix of features for 3FGL associatedsources. The “500MeV_Index” is defined in Equation (2), allother features are taken directly from the 3FGL catalog.
Spectral index is one of the important characteris-tics of sources. Unfortunately in the 3FGL catalog, thedefinition of the spectral index is different for associatedand unassociated sources. In particular, the gamma-rayflux of pulsars is described by a power-law with a (su-per)exponential cutoff ∝ E − Γ e − ( E/E c ) b and the “Spec-tral_Index” feature in the catalog is the parameter Γ .Gamma-ray flux of unassociated sources with signifi-cant curvature is represented by log-parabola function ∝ ( E/E ) − α − β ln( E/E ) , where the “Spectral_Index” fea-ture is the parameter α , i.e., the tilt in the spectrumat the pivot energy E (which also varies for differentsources). Since the “Spectral_Index” feature has differ-ent definitions for associated pulsars and for possible pul-sars among unassociated sources, its use for training the Article number, page 4 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic
Fermi -LAT catalogs algorithms to separate pulsars from AGNs is problem-atic. If one fits all spectra of sources in the catalog bya power-law function, then the corresponding indices ofthe power laws are represented by “PowerLaw_Index”feature in the catalog. This feature is defined uniformlyfor all associated and unassociated sources, i.e., it is safeto use for training. Unfortunately, the power-law functionis not a good description of the gamma-ray flux from pul-sars. In the classification of the 3FGL sources we haveconstructed a new feature: the index at 500 MeV (de-noted in the following as “500MeV_Index”), defined asminus the derivative of the log flux: n (500 MeV) = − d ln Fd ln E (cid:12)(cid:12)(cid:12)(cid:12) E =500 MeV (2)For log parabola and for power law with (su-per)exponential cutoff it is respectively n (500 MeV) = α + 2 β ln(500 MeV / E ) (3) n (500 MeV) = Γ + b (500 MeV /E c ) b (4)This feature has a more uniform definition for all sourcesin the 3FGL catalog than the Spectral_Index. It alsohas a better separating power than PowerLaw_Index,provided that pulsars have typically harder spectra atenergies below 1 GeV than AGNs.Taking into account the correlation among thefeatures and the above discussion of the spec-tral index definition, we have selected the follow-ing eleven features for the classification of the3FGL sources: Galactic latitude (GLAT), Galac-tic longitude (GLON), ln(Energy_Flux_100), ln (Unc_Energy_Flux100), 500MeV_Index, ln (Signif_Curve), ln (Variability_Index), and fourhardness ratios hr ij = EF j − EF i EF j + EF i , where EF i is theenergy flux in bin i . We have added the hardness ratiosin order to allow the algorithms to “construct” theirown features from the raw data rather than to use thederived features, such as spectral index or integratedflux. The table of features and their statistics can befound in Appendix A. The number of tunable parameters in the classificationalgorithms is not fixed a priori. Moreover there is a cer-tain freedom in the choice of the architecture of the al-gorithms, such as the number of hidden layers and thenumber of neurons in neural networks. In general onestarts with a simple model and increases the complexity(the number of tunable parameters) until the model candescribe the data well, but does not overfit it. The overfit-ting is tested by splitting the input data into the training and testing samples. The training sample is used for op-timizing the parameters, while the test sample is used tocheck that the model is not overtrained (for overtrainedmodels the accuracy on the test sample is significantlyworse than the performance on the training sample). Wewill split the data randomly into 70% training and 30%testing samples.
The two main parameters characterizing the RF algo-rithm are the number of trees and the maximum depthallowed in the trees. We use the Gini index as the ob-jective function for the optimization of parameters (splitvalues of features in the nodes).
Maximum Depth T e s t i n g S c o r e Random Forests
Fig. 2.
Test score (accuracy) of RF classification as a func-tion of the number of trees and the maximal depth of trees.
Figure 2 shows the dependence of the accuracy of thetest sample as a function of maximum depth and thenumber of trees. The results for each point are averagedover 100 realizations of the split into training and testingsamples. We notice that the accuracy does not decreaseas the maximal depth of the trees increases, i.e., there isno overfitting as the complexity of the model increaseswith increased maximum depth.This is due to the random choice of a subset of fea-tures at each node (maximal number of allowed featuresis √ Article number, page 5 of 25 &A proofs: manuscript no. 0head_ML_3FGL
Feature 20 trees, depth 2 50 trees, depth 6GLON 0.001 0.014GLAT 0.009 0.017 ln (Energy_Flux100) 0.073 0.071 ln (Unc_Energy_Flux100) ln (Signif_Curve) ln (Variability_Index) 0.051 0.098500MeV_Index 0.065 0.065HR12 0.047 0.052HR23 0.032 0.062HR34 0.016 0.025 HR45
Table 1.
Feature importances for two RF algorithms. sin(GLAT) to check that this is not due to scaling, i.e.,the large range of values of GLAT, but the significanceis similar to the GLAT itself. We further discuss the de-pendence on GLAT in Section 6.2, where we calculatethe latitude and longitude profiles of the associated andunassociated source counts.In order to illustrate the separation of PS into AGNsand pulsars, we retrain the RF algorithm using onlytwo features: log of curvature significance and log of thevariability index, and plot the resulting probabilities ofclasses in Figure 3 for the model with 50 trees with themaximum depth of 6. The probabilities are averaged over100 splits into training and testing samples. It is impor-tant to note that in this plot the model is trained ononly two features. Nevertheless a good testing accuracyof 97% is reached, which is similar to the accuracy of theRF classification with all 11 features. For the final clas-sification with RF, we use 11 features and average over1000 splits into training and testing samples.
Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Trees: 50Maximum Depth: 6Testing Score:0.97
Random Forest
AGN trainingAGN testingPSR trainingPSR testing 0.00.20.40.60.81.0
Fig. 3.
RF classification domains showing class probabilitiesfor training with two features averaged over 100 random splitsinto training and testing samples. Color scale describes theprobability for a source to be a pulsar.
The meta-parameters for BDT algorithms are similar toRF algorithms: they are the number of trees and themaximal depth. We used the Gradient Boosting algo-rithm for the construction of BDT (Friedman 2001b).The classification is performed by a weighted average oftrees, where the trees are constructed recursively in or-der to better address misclassifications from the previousstep. Dependence of the accuracy on tree depth is shownin Figure 4. Unlike the RF, which is also an ensemblebased method, the testing accuracy drops for the maxi-mal depths larger than 7.
Maximum Depth T e s t i n g S c o r e Boosted Decision Trees
20 Trees50 Trees100 Trees200 Trees
Fig. 4.
Dependence of BDT accuracy on maximum depthand the numbers of trees.
The classification domains in case of two features for20 trees and the maximum depth of 2 is presented inFigure 5. For the classification we will use BDT with 100trees and the maximum depth of 2.
Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Trees: 20Maximum Depth: 2Testing Score:0.97
Boosted Decision Trees
AGN trainingAGN testingPSR trainingPSR testing 0.00.20.40.60.81.0
Fig. 5.
Classification domains for BDT for training with twofeatures averaged over 100 splits into training and testingsamples.
In the case of NN, the number of free parameters dependson the number of hidden layers and on the number ofneurons in the hidden layers. The final model accuracyalso depends on the number of epochs that the network
Article number, page 6 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic
Fermi -LAT catalogs is allowed to be trained for and on the optimization al-gorithm.
Input Hidden Output … …
Fig. 6.
NN architecture that we use in the constructionof the probabilistic catalogs. The activation function in theoutput layer is sigmoid S ( x ) = e x / (1 + e x ) . The general architecture of the NN that we use in thispaper is shown in Figure 6. It is a fully connected NNwith 11 input nodes (shown by red circles with input fea-tures x i ), one hidden layer (shown by blue circles), andan output layer (shown by the green circle). The hiddenlayer consists of several nodes with values y j . For the ac-tivation function at the hidden layer we use either hyper-bolic tangent (tanh - shown on the plot) or rectified linearunit (relu). The activation function for the output layeris sigmoid, which we use to make sure that the outputvalue can be interpreted as a class probability. The un-known parameters are weights of features in the hiddenlayer w ji and in the output layer v j including offsets w j and v . The unknown parameters are optimized by min-imizing a loss function, which we choose to be the crossentropy − log L = − (cid:80) i ( y i log ( p i ) + (1 − y i ) log (1 − p i )) ,where y i = 0 , are the true labels of the sources and p i are the predicted class probabilities. This is the same lossfunction as in the LR case below. We have also used NNwith two hidden layers, but the accuracy was similar tothe networks with one hidden layer (Appendix A). Forthe final classification model, we have chosen to use onehidden layer. Number of Neurons T e s t i n g S c o r e Neural Network
Number of Epochs: 300LBFGS, TanhLBFGS, ReluAdam, TanhAdam, Relu
Fig. 7.
Dependence of accuracy on the number of neuronsfor different NN models.
Dependence of the testing accuracy on the numberof neurons in the hidden layer, on the activation func-tion, and on the optimization algorithm is shown inFigure 7. We compare two activation functions at thehidden layer (tanh and relu) and two optimization al-gorithms: Limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS, Liu & Nocedal 1989) and the stochasticgradient descent algorithm Adam (Kingma & Ba 2014).We use 300 epochs for training. Around 11 neurons inthe hidden layer appears to be an optimal choice, sinceincreasing the number of neurons leads to no significantincrease in accuracy for all models.Dependence on the number of epochs (number of it-erations in fitting) is presented in Figure 8. The accuracyincreases with higher number of epochs and saturates ataround 200 for LBFGS and 300 for Adam.
Epochs T e s t i n g S c o r e Neural Network
Number of Neurons: 11LBFGS, TanhLBFGS, ReluAdam, TanhAdam, Relu
Fig. 8.
Dependence of testing accuracy on the number ofepochs in training for different solvers and activation func-tions.
We illustrate the classification domains for NN withtwo input features in Figure 9. In this case we also useonly two neurons in the hidden layer. One can see thatthe separation boundary is smoother compared to the RFdomains in Figure 3 or BDT domains in Figure 5. Forour final model we chose one hidden layer with elevenneurons, 300 training epochs, LBFGS solver, and tanhactivation function at the hidden layer.
As we have discussed in Section 2.2, the probability tobelong to class 1 or 0 in LR is represented by the sig-moid function p ( x ) = 1 − p ( x ) = e m ( x ) e m ( x ) (see Eq. (1)),where m ( x ) is a function of input features x . The com-plexity of the model is given by the number of parametersin m ( x ) . We have considered two cases for m ( x ) : linearand quadratic function of the input features x . Quadratic m ( x ) resulted in a similar accuracy as linear m ( x ) . Con-sequently, we have restricted our attention to linear func-tions m ( x ) = (cid:80) k =1 f k x k . In Figure 10 we show the ac-curacy of the LR method as a function of the number ofiterations for different solvers, e.g., LBFGS (Liu & No- Article number, page 7 of 25 &A proofs: manuscript no. 0head_ML_3FGL
Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Epochs: 300Solver: LBFGSTesting Score:0.97
Neural Network
AGN trainingAGN testingPSR trainingPSR testing 0.00.20.40.60.81.0
Fig. 9.
NN classification domains for 2 input features aver-aged over 100 random splits into training and testing samples.We use 2 neurons in the hidden layer. We use tanh activationfunction and LBFGS solver. cedal 1989), Stochastic Average Gradient (SAG, Schmidtet al. 2017), SAGA (a variant of SAG, Defazio et al.2014), and liblinear (a special solver for LR and supportvector machine classifications, Fan et al. 2008). As onecan see from Figure 10, LBFGS and Liblinear outperformthe other two solvers and converge much faster. In orderto illustrate the probability domains in LR, we show theclassification with two features (LBFGs, 200 iterations)in Figure 11. The domains look similar to the domainsin the NN case (Figure 9). For the final classification wewill use LBFGs solver with 200 iterations.
Number of Iterations T e s t i n g S c o r e Logistic Regression
LBFGSLiblinearSAGSAGA
Fig. 10.
Dependence of LR testing accuracy on the numberof iterations for different solvers.
Fermi -LAT catalogs have many more AGNs than pul-sars, i.e., the datasets are imbalanced. In the previoussubsections we have optimized overall accuracy. In thiscase, the algorithms try to identify AGNs rather thanpulsars, since it gives better accuracy. As a result, in theregion of parameter space, where both pulsars and AGNsare present, the algorithms will give higher probabilityfor a source to be an AGN.The problem of classification of imbalanced datasetscan be quantitatively described in terms of precision and
Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Iterations: 200Solver: LBFGSTesting Score:0.97
Logistic Regression
AGN trainingAGN testingPSR trainingPSR testing 0.00.20.40.60.81.0
Fig. 11.
Classification domains for LR with two features av-eraged over 100 random splits into training and testing sam-ples. recall. If we denote by “ precision = is a measure how cleanthe prediction is, while recall = is a mea-sure how well the algorithm can detect the pulsars, i.e.,how complete is the list of predicted pulsars. If we re-duce the pulsar domain by attributing uncertain sourcespredominantly to AGNs, then for pulsars the precisionwill increase, but the recall will decrease. abs(sin(GLAT)) P r e c i s i o n o r R e c a ll All algorithms precisionAll algorithms recallAny algorithm precisionAny algorithm recallAll algorithms precision (oversample)All algorithms recall (oversample)
Fig. 12.
Precision and recall for pulsars using all-algorithmand any-algorithm classification for unweighted training dataand all-algorithm classification for oversampling of pulsars intraining data. For details see Section 3.3.
In Figure 12 we show precision and recall for de-tection of pulsars using algorithms from previous sec-tions. In particular, in the first two lines (solid blue withsquares and dashed orange with right triangles) a sourceis categorized as a pulsar if all four algorithms classifyit as a pulsar, while in lines 3 and 4 (solid green withdiamonds and dashed red with down triangles) a source
Article number, page 8 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic
Fermi -LAT catalogs is attributed to the PSR class, if any of the algorithmsclassifies it as a pulsar. It is clear that for lines 1 and2 the pulsar domain is smaller than for lines 3 and 4,since in the former case, the domain is the intersectionof domains for individual algorithms, while in the latterit is the union. For all-algorithms classification the pre-cision is 100% for most of latitudes, while the recall isbetween 40% and 80%, i.e., the list of pulsars is gener-ally clean but incomplete. In case of any-algorithm clas-sification, the recall is increased by about 20% for mostlatitudes compared to the all-algorithms classification,but the precision drops by up to 20% at some latitudes,i.e., the completeness improves at the expense of cleanli-ness of the sample. Alternatively to using any-algorithmclassification, one can give larger weights to pulsars oroversample pulsars in the training process, i.e., use thesame source several times, so that the numbers of pul-sars and AGNs in training are the same. Provided thatin some applications it is beneficial to have as completeas possible the list of pulsar candidates among unasso-ciated sources, we have retrained the algorithms usingoversampling with the same meta-parameters as in theprevious sections.In general one can either under- or oversample adataset. Undersampling would reduce the number ofAGNs to match the number of pulsars. However, sincethe total number of sources is not very high, we havechosen oversampling. For training with oversampling, wecopy randomly existing pulsars and add them to thedataset until the number of pulsars and AGNs are thesame. Although pulsars in the training dataset are re-dundant, they help to increase the weight of pulsars inthe classification model. We illustrate the oversamplingprocedure in Figure 13 top panel: the number of timesa source appears in training is shown by adding markerswith shifts to the right and above the original positionof the source (note that the shift is introduced for pre-sentation only, the parameters of the sources are exactlythe same as in the original source). In the bottom panelof Figure 13 we repeat Figure 11 in order to compare theclassification domains with and without oversampling.One can see that pulsar domain in the top panel is largerthan the pulsar domain in the bottom panel. As a result,in the top panel more pulsars are classified as pulsars butalso more AGNs are falsely classified as pulsars in the in-tersection region. Since the overall number of AGNs islarger than the number of pulsars, the testing accuracywith oversampling is smaller than the one without over-sampling.The results of training with oversampling are pre-sented in Figure 12, lines 5 and 6 (solid purple with cir-cles and dashed brown with stars). These lines show pre-cision and recall when a source is categorized as a pulsar, if all four algorithms trained with oversampling classifyit as a pulsar. The precision and recall in this case aresimilar to the any-algorithm classification for the train-ing without oversampling.
Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Iterations: 200Solver: LBFGSTesting Score:0.92
Logistic Regression(Oversampled)
AGN trainingAGN testingPSR trainingPSR testing 0.00.20.40.60.81.03 4 5 6 7 8
Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Iterations: 200Solver: LBFGSTesting Score:0.97
Logistic Regression
AGN trainingAGN testingPSR trainingPSR testing 0.00.20.40.60.81.0
Fig. 13.
Top panel: LR classification domains showing classprobabilities for training with oversampling. The oversam-pling is illustrated by repeating the pulsar markers with ashift: the number of markers is equal to the number of timesthe pulsar appears in training. Bottom panel: we repeat Fig-ure 11 for convenience of comparison with the oversampledtraining in the top panel. In both panels the domains are ob-tained by averaging over 100 random splits into training andtesting samples.
4. Probabilistic catalogs based on the 3FGL and4FGL catalogs
In this section we use the ML algorithms optimized in theprevious section to construct probabilistic classificationof sources in the 3FGL and 4FGL catalogs.
We use the following four algorithms for the classificationof sources: RF with 50 trees and maximal depth of 6,BDT with 100 trees and maximal depth of 2, NN with11 neurons, LBFGS solver, and 300 epochs, and LR withLBFGS solver and 200 iterations. For training we use the
Article number, page 9 of 25 &A proofs: manuscript no. 0head_ML_3FGL pulsars and AGNs from the 3FGL catalog. In addition tooriginal datasets, we perform oversampling of pulsars inorder to balance the numbers of pulsars and AGNs. Asa result, we have 8 classification methods: 4 algorithmstrained with and without oversampling.
Algorithm Parameters Testing Std. Dev. Comparison withAccuracy 4FGL-DR2 AccuracyRF 50 trees, max depth 6 97.37 0.60 91.06RF_O 97.90 0.50 89.40BDT 100 trees, max depth 2 97.65 0.54 90.40BDT_O 97.79 0.51 91.72NN 300 epochs, 11 neurons, LBFGS 97.29 0.97 90.07NN_O 94.31 5.13 87.09LR 200 iterations, LBFGS solver 97.63 0.54 90.40LR_O 93.68 0.99 85.10
Table 2.
Testing accuracy of the 4 selected algorithms forclassification of 3FGL sources and comparison with associa-tions in the 4FGL-DR2 catalog. “_O” denotes training withoversampling. L n ( S i g n i f i c a n t _ C u r v a t u r e ) Predictions for 3FGL Unassociated Data
PSRs classified as PSRs or AGNsPSRs classified only as PSRsPSRs classified only as AGNsAGNs classified as PSRs or AGNsAGNs classified only as AGNsAGNs classified only as PSRs
Fig. 14.
Comparison of class prediction for unassociated3FGL sources with classes in 4FGL-DR2. For more detailssee Section 4.1.
The selected algorithms are summarized in Table 2,where oversampling is shown by “_O”. “Average test-ing accuracy” is computed by taking 1000 times 70% -30% split into training and testing samples and averag-ing over the accuracies computed for the testing samples.In addition, we look at sources, which are unassociatedin 3FGL but have either pulsar or AGN association in4FGL-DR2: there are 302 such sources. The accuracyof our prediction for the four selected algorithms withand without oversampling, taking the 4FGL-DR2 classesas the true values, is reported in the column “Compar-ison with 4FGL Accuracy”. The correct classificationsand misclassifications for the 302 sources with PSR orAGN associations in 4FGL-DR2 are also presented inFigure 14. The class at the beginning of the label namecorresponds to the association in the 4FGL-DR2, whilethe second half of the labels corresponds to classification of unassociated sources in 3FGL. For example, “PSRsclassified only as PSRs” shows sources which have PSRassociation in 4FGL-DR2 and all eight methods classi-fied the corresponding unassociated sources in 3FGL asa pulsar. “PSRs classified as either PSRs or AGNs” labelssources with PSR associations in 4FGL-DR2 but the cor-responding unassociated sources in 3FGL have both PSRand AGN classifications by different ML methods. Theunassociated sources are classified as PSRs or AGNs ifthe corresponding probability is larger than 0.5. We no-tice that misclassified or partially misclassified sourcesin Figure 14 typically happen on the boundary betweenthe two classes or even inside the opposite class. Many ofthese sources also have flags in the 3FGL catalog, suchas a potential problem with the background diffuse emis-sion model in the location of the source, which can leadto a poor reconstruction of the source spectrum and, con-sequently, misclassification of the source.As a result of the classification with the eight MLmethods, we created a probabilistic catalog based onthe 3FGL sources without missing values. We train on70% of the sources associated with pulsars or AGNs andsave the probability values for testing sources, for sourceswhich are not classified as pulsars or AGNs, and for unas-sociated sources. We repeat the splitting and training1000 times and report the sample average and standarddeviation of the classification probabilities, i.e., we aver-age over 1000 values for unassociated sources and sourcesnot classified as AGNs or pulsars, while the average forAGNs and pulsar is over the number of times the sourcesappear in the testing sample, which is 300 on average. Wehave also subselected the 302 unassociated 3FGL sources,which have PSR or AGN associations in 4FGL-DR2, andsaved them for convenience of comparison as a separatefile.In the probabilistic catalogs we add columns withcorresponding probabilities for each algorithm and eachclass, i.e., provided that there are 8 methods (includ-ing oversampling) and 2 classes, we add 16 columns: 8for unweighted and 8 for oversampled training data. Thecolumns with ’_O’ represent the oversampled probabili-ties. We also add 16 columns for standard deviations ofprobabilities. Although class probabilities and standarddeviation for each algorithm are not independent (prob-abilities add up to 1 and standard deviations are equalfor AGN and PSR classes), we keep the correspondingcolumns in view of the generalizations to multi-class clas-sification (e.g., 3-class classification in Section 5). There are thirteen sources with missing values in the3FGL catalog (2 unassociated, 5 AGNs, 1 pulsar, and5 “other” sources), which we save in a separate file“3FGL_sources_with_missing_values.csv” for reference. Inparticulat all of these sources have a value in significant cur-vature of 0.Article number, page 10 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic
Fermi -LAT catalogs
Table 4 shows an example of the probabilistic cata-log for a few unassociated 3FGL sources. Notice that thefirst source is classified as a pulsar by BDT and as anAGN by RF, LR, and NN algorithms, it is an example ofa source with mixed classification. Out of 1008 unasso-ciated sources in 3FGL, 111 are classified as pulsars byall eight methods, 597 are classified as AGNs, and 300have mixed classifications. Out of 111 sources classifiedas pulsars, 6 sources have counterparts in Parkes survey(Camilo et al. 2015) within 2 arc minutes (see Table 5).
Catalog Classification AGN PSR OTHER MIXED3FGL 2-class 597 111 – 3002-class corr 578 97 51 2793-class 585 53 69 3014FGL-DR2 2-class 872 162 – 6242-class corr 821 135 139 5633-class 736 64 271 587
Table 3.
Expected number of AGNs, pulsars, and othersources as well as sources with mixed classifications amongthe unassociated 3FGL and 4FGL-DR2 sources derived withthe 2-class (Section 4) and 3-class (Section 5) classification.The “2-class corr” row shows correction of the 2-class clas-sification prediction due to the presence of OTHER sourcesamong the unassociated ones (see Section 4.1 for details).
We summarize the results of classification of unassoci-ated sources in Table 3 in the “2-class” row. The “AGNs”column shows the number of unassociated sources whereall eight methods from Table 2 give the probability for asource to be an AGN above 50%. Similarly the “Pulsars”column shows the number of unassociated sources whereall algorithms predict the source to be more likely a pul-sar. The “Mixed” column shows the number of sourceswith mixed classification, i.e., some algorithms predictthat the source is more likely an AGN while the otheralgorithms predict that it is more likely a pulsar. Wealso add the “OTHER” column in order to compare theresults with the 3-class classification in Section 5. Sincethere is no “OTHER” class in the 2-class classification,the corresponding entry is empty. In the “2-class corr”row we show a possible correction of the number of pul-sars and AGNs due to presence of other sources. Herewe assume that the fraction of AGN-like and pulsar-likesources among the other sources is the same for asso-ciated and for unassociated sources. In particular, if wedenote by N AGN the number of unassociated sources withAGN-like probabilistic classification, by N ass OTHERAGN thenumber of sources with AGN-like classification amongassociated OTHER sources, by N ass ( N unass ) the totalnumber of associated (unassociated) sources, then thenumber of AGN-like sources among the unassociatedones corrected for the presence of OTHER sources can AGN ProbabilitySource_Name_3FGL BDT RF LR NN3FGL J0000.2-3738 .
995 0 .
991 0 . .
996 0 .
996 0 . .
997 0 .
934 0 .
999 0 . .
995 1 0 . Table 4.
Example of the AGN classification probabilitiesfor a few unassociated sources in the 3FGL catalog (Aceroet al. 2015). We have ommited the oversampled probabilitycolumns here.
Source_Name_3FGL GLON GLAT Sep (arksec)3FGL J0933.9-6232 . − . . . . . . − . . . − . . . − . . . − . . Table 5.
Connection of unassociated 3FGL sources classifiedas pulsars with Parkes pulsars (Camilo et al. 2015). be estimated as N corrAGN = N AGN − N ass OTHERAGN N unass N ass . (5)Analogous corrections are applied for the number ofunassociated sources with PSR and with mixed classi-fications. If we denote by N ass OTHER the total numberof associated other sources, then the estimated numberof OTHER sources among unassociated ones is N unassOTHER = N ass OTHER N unass N ass . (6)We show this estimate in the OTHER column in the“2-class corr” row. We note that since N ass OTHERAGN + N ass OTHERPSR + N ass OTHERMIXED = N ass OTHER , this estimateis consistent with corrections in Equation (5) for sourcesclassified as AGNs, pulsars, or with mixed classification. In this section we construct a probabilistic classifica-tion of sources in the 4FGL-DR2 catalog. The 4FGL-DR2 catalog (Ballet et al. 2020) is based on 10 yearsof
Fermi -LAT data (compared to 8 years of data in the4FGL catalog, Abdollahi et al. 2020). It contains 5788sources, which is 723 sources more than in the 4FGLcatalog (all sources in 4FGL are kept in 4FGL-DR2 evenif they fall below the detection threshold with 10 years ofdata). 3770 sources in 4FGL-DR2 have either an AGN ora PSR classification, 1658 sources are unassociated (weonly look at CLASS1 column in the catalog), and therest 346 sources are other sources with classification likePWN, SNR, etc. There are 14 sources in 4FGL-DR2 with
Article number, page 11 of 25 &A proofs: manuscript no. 0head_ML_3FGL missing values: four AGNs, one PWN (Crab), and nineunassociated sources. As in the previous section, we usefor training and testing sources associated with eitherAGNs or pulsars, which have no missing values used forclassification. We then calculate the classification proba-bilities of AGN and PSR classes for both the associatedand the unassociated sources. The 4FGL catalog hashigher number of features, especially due to the differ-ence in modeling of the spectra compared with the 3FGLcatalog. We selected 28 of these features and looked forcorrelations among them. If any feature was correlated oranti-correlated with a Pearson index of ± , The 303 unassociated 4FGL-DR2 sources correspond to302 unassociated 3FGL sources, because there are two 4FGL-DR2 sources, which are associated with one 3FGL source.
Algorithm Parameters Testing Accuracy Std. Dev.RF 50 trees, max depth 6 97.87 0.36RF_O 97.56 0.39BDT 100 trees, max depth 2 97.63 0.39BDT_O 97.72 0.38NN 300 epochs, 16 neurons, LBFGS 97.41 0.47NN_O 95.48 0.66LR LBFGS solver, 200 iterations 97.80 0.38LR_O 96.03 0.53
Table 6.
Testing accuracy of the 4 algorithms on 4FGL-DR2associated data. “_O” denotes training with oversampling.
40 sources are predicted to be pulsars using 3FGL fea-tures and 75 sources are predicted to be pulsars using4FGL features. This leads to 29 sources which are pre-dicted by all eight methods to be pulsars for featurestaken from both 3FGL and 4FGL catalogs. For conve-nience, we save these 29 pulsar candidates as a separatefile.Among the 6 unassociated 3FGL sources classifiedas pulsars and spatially associated with pulsars in theParkes survey in Table 5, four are unassociated in 4FGL-DR2 and are predicted to be pulsars by the probabilisticclassification of unassociated 4FGL-DR2 sources. Of theother two, one is now associated as a pulsar in 4FGL-DR2, and the second one is not detected in 4FGL-DR2.
5. Three-Class Classification
One of the caveats of the analysis with two classes isthat there are associated sources, which do not belongto AGN or PSR classes. These have the labels: unk, spp,glc, snr, gal, sbg, GAL, sfr, bin, SNR, HMB, LMB, css,PWN, pwn, hmb, SFR, BIN, lmb, NOV. We collect allthese associated sources, which do not belong to AGNand PSR classes into a new class, which we label as“OTHER”. Since in two-class classification we train algo-rithm to classify sources only into AGN and PSR classes,OTHER sources are also classified as either AGN or PSR.This introduces a bias in the estimates of the number ofAGNs and pulsars among unassociated sources. One pos-sibility to correct this bias is to assume that the fractionsof OTHER sources among associated and unassociatedsources are the same (Equation (5)). This correction canbe applied for the total number of sources or for the num-ber of sources in some window of parameters, e.g., in aflux bin or in a range of latitudes and longitudes. This isa straightforward calculation but it has some limitations.In particular, it implicitly assigns equal probabilities toall AGNs (and all PSRs) to belong to the OTHER class.For a small range of parameters the variance of this esti-mate can be very large due to small number of associatedOTHER sources in this parameter range. As we will see
Article number, page 12 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic
Fermi -LAT catalogs in Section 6, this correction depends on the choice of thevariable used for binning, e.g., overall correction with lat-itude bins is not equal to the correction with longitudebins.In this section we discuss the construction of prob-abilistic catalogs with multi-class classification (3-classclassification in our case). We start with the construc-tion of the probabilistic catalog based on 3FGL by addingthe class “OTHER”, which includes all associated sourceswithout AGN or PSR associations: there are 108 suchsources in 3FGL. We use the same 11 features as in the2-class classification: the only difference is that we usecos(GLON) instead of GLON. The reason is that LR andNN methods have a significantly worse performance thanRF and BDT methods when we use GLON, but, as weshow below, all four methods have comparable accuracywhen we use cos(GLON). This can be due to disconti-nuity in the GLON variable. We perform optimizationof the meta-parameters for the four ML algorithms withthe 3 classes.The dependence of accuracy on meta-parameters ofthe algorithms is shown in Figures 15 and 16. We seethat for the tree-based algorithms, the optimal parame-ters are similar to the 2-class classification, i.e., 50 treeswith depth 6 for RF and 100 trees with depth 2 for BDTprovide close to optimal performance at a minimal costin complexity (depth of the trees). The main differencefor NN and LR algorithms is that more steps are neededfor convergence, especially in case of oversampling. In thefollowing we use 600 epochs for NN and 500 iterationsfor LR instead of 300 epochs and 200 iterations respec-tively in the two-class case. For NN, the accuracy stopsincreasing above about 10 neurons in the hidden layer(in the following we use 11 neurons for classification: thesame as in the two-class case). For oversampling, we usethe oversampling factors (cid:113) and (cid:113) forPSR and OTHER classes respectively (instead of and oversampling factors in the 2-class case).The reason for a smaller oversampling factors is to avoidoverweighting the relatively small OTHER class.We show an example of domains in the 3-class case inFigure 17. A class domain is determined by the class withthe largest probability. Since in the 3-class case there aretwo independent probabilities, which are difficult to showwith a single color bar, we present only the domains rep-resented by three different colors: brown for PSR, greenfor OTHER, and blue for AGN classes. The correspond-ing training and testing data are shown by red crossesand brown rotated squares for PSR, by green squaresand yellow diamonds for OTHER, and by blue and pur-ple triangles for AGN classes. The classification domainsare averaged over 100 realizations of splitting the data
Maximum Depth T e s t i n g S c o r e Random Forests
20 Trees50 Trees100 Trees200 Trees
Maximum Depth T e s t i n g S c o r e Boosted Decision Trees
200 400 600 800 1000
Number of Iterations T e s t i n g S c o r e Logistic Regression
LBFGSSAGSAGA
Fig. 15.
Accuracy for the 3-class classification with RF, BDTand LR methods. LR does not have liblinear solver here, sinceliblinear cannot handle multinomial loss. into training and testing samples. One of these splittingsis shown on the figure.The accuracies of our chosen models for classificationof the 3FGL sources are presented in Table 7. As in the 2-class case, the accuracies are averaged over 1000 realiza-tions of splitting the data into training and testing sam-ples. We notice that accuracies presented in Table 2 arecalculated relative to AGN and PSR classes only, if wetake into account that all OTHER sources are misclassi-fied in this case, then the testing accuracy is reduced byabout 5% (the fraction of OTHER sources in associatedsource in 3FGL), while the accuracy of comparison with4FGL-DR2 is reduced by about 10% (there are 37 unas-sociated sources in 3FGL with OTHER class associationsin 4FGL-DR2, while there are in total 339 unassociatedsources in 3FGL with associations in 4FGL-DR2). Thusthe testing accuracy of 93-94% in Table 7 provides atleast a 1-2% improvement over the accuracy in Table 2,
Article number, page 13 of 25 &A proofs: manuscript no. 0head_ML_3FGL
200 400 600 800 1000
Number of Epochs T e s t i n g S c o r e Neural Network
Number of Neurons: 11LBFGS, TanhLBFGS, ReluAdam, TanhAdam, Relu
Number of Neurons T e s t i n g S c o r e Neural Network
Number of Epochs: 600LBFGS, TanhLBFGS, ReluAdam, TanhAdam, Relu
Fig. 16.
Accuracy of the NN classification as a function ofthe number of epochs and of the number of neurons for the3-class classification.
Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Trees: 50Maximum Depth: 6Testing Score:0.92
Random Forests (3-Class)
AGN trainingAGN testingPSR trainingPSR testingOTHER trainingOTHER testing
Fig. 17.
Classification domains for RF in the 3-class classi-fication. after taking into account the misclassification of OTHERsources in the 2-class case and a similar improvement forthe accuracy of classification of unassociated sources in3FGL with 4FGL-DR2 associations.The 3-class classification of 4FGL-DR2 sources is per-formed similar to the 3-class classification of the 3FGLsources. The differences are similar to the differencesin the 2-class classification of 3FGL and 4FGL-DR2sources: we use 16 features (the same features as in the2-class classification of 4FGL-DR2 sources in Section 4.2but with GLON replaced by cos(GLON)) and we have16 neurons in the hidden layer of the NN method. Fur-thermore, for Logistic Regression we use 1000 iterationsinstead of 500 as it gives better performance for oversam-
Algorithm Parameters Testing Std. Dev. Comparison withAccuracy 4FGL-DR2 AccuracyRF 50 trees, max depth 6 93.96 0.85 84.96RF_O 94.38 0.76 84.96BDT 100 trees, max depth 2 93.72 0.83 83.19BDT_O 93.83 0.80 85.25NN 600 epochs, 11 neurons, LBFGS 93.17 1.05 83.48NN_O 92.51 1.34 81.71LR 500 iterations, LBFGS solver 93.93 0.88 83.19LR_O 93.01 0.96 83.19
Table 7.
Testing accuracy of the four selected algorithms for3-class classification of 3FGL sources and comparison withassociations in the 4FGL-DR2 catalog. “_O” denotes trainingwith oversampling. pled cases. The corresponding accuracies are reported inTable 8. In comparing the accuracies with the 2-classclassification in Table 6, one has to take into accountthat there are 346 OTHER sources among 4116 associ-ated sources in 4FGL-DR2, which is about 8.4%. Since allOTHER sources are “misclassified” by the 2-class classifi-cation, the 3-class classification provides an improvementof about 2-4% compared to the 2-class classification.
Algorithm Parameters Testing Std. Dev.AccuracyRF 50 trees, max depth 6 92.91 0.66RF_O 92.83 0.63BDT 100 trees, max depth 2 92.51 0.67BDT_O 92.27 0.67NN 600 epochs, 16 neurons, LBFGS 91.86 0.72NN_O 90.26 0.83LR 1000 iterations, LBFGS solver 92.63 0.67LR_O 92.22 0.69
Table 8.
Testing accuracy of the four selected algorithms forthe 3-class classification of 4FGL-DR2 sources. “_O” denotestraining with oversampling.
The numbers of unassociated sources classified byall 8 methods as AGNs, pulsars, and other sources for3FGL and 4FGL-DR2 catalogs are presented in Table 3in the “3-class” rows. For each algorithm the most prob-able class of the source is determined by the class withthe largest probability. Since there are three classes, thelargest probability can have the value just above 1/3.The “Mixed” column shows the number of sources withdifferent classification results for different algorithms.Classification of
Fermi -LAT 4FGL sources into threeclasses was considered earlier by, e.g., Zhu et al. (2021).Zhu et al. (2021) have primarily used a two-step clas-sification procedure, where in the first step AGNs areseparated from the rest of sources and in the second step
Article number, page 14 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic
Fermi -LAT catalogs the remaining sources are split into pulsars and othersources. Zhu et al. (2021) have also tested a simultaneousclassification of sources into three classes (AGN, pulsars,other), but the results were inconsistent for the two MLalgorithms used by Zhu et al. (2021) (RF and NN). Inparticular, the number of OTHER sources predicted byNN was zero. In our case, the predictions of various al-gorithms are relatively consistent with each other. Forexample, in the 3FGL (4FGL-DR2) catalog all 8 meth-ods classify 69 (271) unassociated sources as OTHER.Also, 8 out of 37 unassociated 3FGL sources, which areassociated to OTHER sources in 4FGL-DR2, are classi-fied by all 8 algorithms as OTHER (6 are classified asAGNs, 5 as PSRs, and 18 have mixed classification).
6. Application of probabilistic catalogs forpopulation studies
In this section we show how probabilistic catalogs canbe used, for instance, for population studies. One of themost important questions in gamma-ray astronomy iscontribution of point sources, e.g., AGNs, to the extra-galactic gamma-ray flux (e.g., Abdo et al. 2010b; Maly-shev & Hogg 2011; Ackermann et al. 2016; Zechlin et al.2016b,a; Lisanti et al. 2016; Di Mauro et al. 2018): ifmost of the extra-galactic emission is explained by pointsources, then one can put stringent constraints, e.g.,on dark matter annihilation or decay into gamma rays(Ajello et al. 2015; Di Mauro & Donato 2015; Acker-mann et al. 2015; Fornasa & Sánchez-Conde 2015; Liuet al. 2017) or on evaporation of primordial black holes(Carr et al. 2010). In particular, it is important to un-derstand the contribution to the population of AGNsfrom the unassociated sources. A probabilistic catalogprovides an answer to the question: how many sourcesamong the unassociated ones are expected to belong todifferent classes, such as pulsars or AGNs. One can cal-culate the total expected number of AGNs or pulsarsamong the unassociated sources, or calculate the contri-bution as a function of one or more parameters. In thissubsection we determine the numbers of AGNs, pulsars,and, in case of 3-class classification, OTHER sources asa function of their flux, in the following subsection wealso discuss the distributions of sources as functions ofGalactic latitude and longitude.In Figure 18 we show the cumulative number of AGNsand pulsars with flux above 1 GeV larger than the valueon the x-axis. Solid blue lines show the actual countsof sources (AGNs or pulsars) in the 3FGL and 4FGL-DR2 catalogs. As a consistency check of the method, wecalculate the AGN- and PSR-like probabilities for asso-ciated sources. The sum of probabilities (uncorrected for sources other than AGNs and pulsars) for LR algorithmare shown by dotted purple lines. In order to correct theexpected number of AGNs among associated sources forAGN-like probabilities in “other” sources, we subtract thecorresponding AGN-like probabilities in each flux band: N assAGN = (cid:88) i ∈ ass p i AGN − (cid:88) i ∈ ass other p i AGN . (7)The corrected sums of probabilities for LR method areshown by dashed purple lines. The green bands show theenvelope of the sums of corrected probabilities for theeight methods used in this paper. We see that the countsof associated sources, AGNs and pulsars, are consistentwith the expected number of associated sources calcu-lated from the class probabilities of associated sources.This conclusion is not very surprising since we used as-sociated sources for training of ML algorithms. We notethat correction for “other” sources is important for consis-tency of the sum of probabilities and the number of asso-ciated sources. We have also compared the sums of prob-abilities for the 3FGL associated sources in Saz Parkin-son et al. (2016). The sum of probabilities for associatedsources in the LR case uncorrected for “other” sources areshown by dotted black line, while the sums corrected for“other” sources are shown by black dash-dotted lines. Thegray band is the envelope of the two methods (LR andRF) used by Saz Parkinson et al. (2016). We see that thesum of probabilities for AGNs and pulsars overpredictsthe source counts in 3FGL, while correction for “other”sources makes the prediction for associated sources con-sistent with the source counts.The predictions for the number of AGNs and pul-sars among the unassociated sources corrected for “other”sources added to the 3FGL and 4FGL-DR2 source countsare shown by solid orange lines (for the LR case). Theorange bands show the corresponding envelopes for theeight ML methods. We assume that the fractional con-tribution of other sources is the same for associated andunassociated sources in the different flux bands. Thus,the correction for the presence of other sources is cal-culated similarly to the associated sources in Equation7, but we adjust for the fact that there are fewer unas-sociated than associated sources, i.e., the correction isassumed to be proportionally smaller. In particular, thenumber of AGNs among unassociated sources in a fluxband ∆ F is estimated as N unassAGN = (cid:88) i ∈ unass p i AGN − (cid:88) i ∈ ass other p i AGN · N unass N ass (8)where all probabilities and the numbers of sources arecomputed for sources with flux inside ∆ F . The first termis the sum of AGN-like probabilities among the unas-sociated sources, while the second term is the sum of Article number, page 15 of 25 &A proofs: manuscript no. 0head_ML_3FGL log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countSrc count + unassoc. prob. LRAssoc. prob. LRAssoc. prob. LR (uncorr.)Src count + unassoc. prob. LR (SP16)Assoc. prob. LR (SP16)Assoc. prob. LR (SP16, uncorr.)Src count + unassoc. prob.Assoc. prob.Assoc. prob. (SP16) log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countSrc count + unassoc. prob. LRAssoc. prob. LRAssoc. prob. LR (uncorr.)Src count + unassoc. prob.Assoc. prob. log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countSrc count + unassoc. prob. LRAssoc. prob. LRAssoc. prob. LR (uncorr.)Src count + unassoc. prob. LR (SP16)Assoc. prob. LR (SP16)Assoc. prob. LR (SP16, uncorr.)Src count + unassoc. prob.Assoc. prob.Assoc. prob. (SP16) log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countSrc count + unassoc. prob. LRAssoc. prob. LRAssoc. prob. LR (uncorr.)Src count + unassoc. prob.Assoc. prob.
Fig. 18.
Cumulative number of sources as a function of their flux. Blue solid line – associated 3FGL and 4FGL-DR2 sources.Green band – envelope of sums of class probabilities for associated sources for the eight ML methods corrected for the presence ofOTHER sources. Orange solid line (band) – average (envelope) of sums of class probabilities for the eight ML methods (correctedfor the presence of OTHER sources) added to the source count of associated sources. Purple dashed (dotted) line – sum ofclass probabilities for associated sources for the LR method without oversampling corrected (uncorrected) for the presence ofOTHER sources. Gray dash-dotted (dotted) line – sums of class probabilities from Saz Parkinson et al. (2016) using LR corrected(uncorrected) for the presence of OTHER sources. Gray band – envelope of sums of class probabilities for associated sources forLR and RF methods from Saz Parkinson et al. (2016) corrected for the presence of OTHER sources. For details see Section 6.1.
AGN-like probabilities among associated “other” sourcesrescaled by the total number of unassociated and as-sociated sources in this flux band. The expected num-ber of pulsars among the unassociated sources is calcu-lated analogously. The corresponding sums of associatedsource counts plus the expected number of sources cal-culated with LR method of Saz Parkinson et al. (2016)and corrected for other sources are shown by dashed greylines.The expected numbers of pulsars and AGNs amongthe unassociated sources are summarized in Table 9 forthe 3FGL catalog and in Table 10 for the 4FGL-DR2 cat-alog. The row “2-class” shows the expectations for the 2-class classification uncorrected for the presence of othersources. The row “2-class corr total” shows the expec-tations corrected for other sources using Equation (8)without any binning, while “2-class corr F-bins” showscorrection with the flux bins as in Figure 18. We see that corrections using total counts of binned in flux aresimilar and relatively small. In the following section wealso calculate the correction using latitude and longitudebins.The numbers of PSRs and OTHER sources in the 3-class classification as a function of flux are shown in Fig-ure 19. As in the 2-class case, solid blue line shows thecounts of associated sources in the corresponding classes.Red and blue bands show the envelops of the expectednumber of associated sources and the number of associ-ated sources plus the expected numbers of sources amongunassociated ones respectively. In the PSR plots, we alsoreplot the envelops of expected numbers of associatedPSR (shown by the green band) and associated sourcecounts plus prediction for unassociated sources by or-ange bands corrected for the presence of OTHER sourcesamong unassociated ones. We notice that the bands forthe 3-class case are narrower in the PSR plots than the
Article number, page 16 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic
Fermi -LAT catalogs log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countAssoc. prob. LRSrc count + unassoc. prob. LRAssoc. prob. LR (2-class)Src count + unassoc. prob. LR (2-class)Assoc. prob.Src count + unassoc. prob.Assoc. prob. (2-class)Src count + unassoc. prob. (2-class) log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countAssoc. prob. LRSrc count + unassoc. prob. LRAssoc. prob. LR (2-class)Src count + unassoc. prob. LR (2-class)Assoc. prob.Src count + unassoc. prob.Assoc. prob. (2-class)Src count + unassoc. prob. (2-class) log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countAssoc. prob. LRSrc count + unassoc. prob. LRSrc count + unassoc. model (2-class)Assoc. prob.Src count + unassoc. prob. log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countAssoc. prob. LRSrc count + unassoc. prob. LRSrc count + unassoc. model (2-class)Assoc. prob.Src count + unassoc. prob.
Fig. 19.
Cumulative number of sources as a function of their flux. Blue solid line – associated 3FGL and 4FGL-DR2 sources. Blueband – envelope of sums of class probabilities for associated sources for the eight ML methods. Purple dash-dotted line – sum ofclass probabilities for associated sources for the LR method without oversampling. Red solid line (band) – average (envelope) ofsums of class probabilities for the eight ML methods added to the source count of associated sources. Purple dashed line – sumof class probabilities for associated sources for the LR method without oversampling in the 2-class classification corrected for thepresence of OTHER sources. Orange solid line (band) in the PSR plots – average (envelope) of sums of class probabilities forthe eight ML methods (corrected for the presence of OTHER sources) in the 2-class classification added to the source count ofassociated sources (the band and the line are the same as in Figure 18). Green band in the PSR plots – envelope of sums of classprobabilities for associated sources for the eight ML methods in the 2-class classification corrected for the presence of OTHERsources (the same as in Figure 18). Orange solid line in the OTHER plots – OTHER source counts plus an estimate of the numberof OTHER sources among unassociated ones using Equation (9). For details see Section 6.1. bands for the 2-class case. In part this is due to less over-sampling in the 3-class case. We also note that the redband lies almost entirely below the orange one. This isdue to a possible underestimation of the contribution ofOTHER sources to the PSR class among unassociatedsources. In particular, in the bottom panels of Figure 19we show the estimated total number of OTHER sourcesin the 2-class case by the orange line. Since we do nothave probabilities for the OTHER sources in this case, weestimate the number of OTHER sources among unasso-ciated ones simply by rescaling the number of associatedOTHER sources in each energy band as N unassOTHER = N assOTHER N unass N ass (9) We notice that this estimate agrees with the correctionof the estimated numbers of AGNs and pulsars in the2-class case due to presence of other sources, since N assOTHER = (cid:88) i ∈ ass other p i AGN + (cid:88) i ∈ ass other p i PSR = (cid:88) i ∈ ass other ( p i AGN + p i PSR ) (10)provided that p i AGN + p i PSR = 1 for each source in thetwo-class case. The estimated total number of OTHERsources in the 2-class case (orange line) is significantlybelow the red band derived from the sum of probabili-ties of the OTHER class in the 3-class case. The modelin Equation (9) depends on binning, as we see in the fol-lowing section, for latitude binning this 2-class estimateof the number of OTHER sources among unassociated
Article number, page 17 of 25 &A proofs: manuscript no. 0head_ML_3FGL log F >1 GeV [cm s ] un a ss o c . / a ss o c . f r a c t i o n AGN log F >1 GeV [cm s ] un a ss o c . / a ss o c . f r a c t i o n PSR log F >1 GeV [cm s ] un a ss o c . / a ss o c . f r a c t i o n OTHER
Fig. 20.
Ratio of the number of AGNs, pulsars and othersources among unassociated sources estimated with LR with-out oversampling to the counts of associated sources respec-tively. AGN and pulsar estimates are corrected for the pres-ence of other sources using Equation (8), while the numberof other sources among unassociated ones in the 2-class caseis estimated using Equation (9). ones agrees with the 3-class estimate. The reason is thatfor flux bins, the ratio N unass N ass is dominated by AGNs athigh latitudes and it is always smaller than 1, while atlow latitudes, where most of OTHER sources are, theratio N unass N ass is about 1. We note that the probabilistic classification mostlyaffects sources with small fluxes. In Figure 20 we plotthe ratio of the expected number of sources of a certainclass among unassociated sources computed according toEq. 8 with the LR algorithm (without oversampling) tothe number of associated sources in this class. The ratiogenerally increases as the flux decreases. Negative values(e.g., at high fluxes for AGNs) are due to subtractionof probabilities for the “other” associated sources. ForOTHER sources in the 2-class case we use estimates inEquation (9). As discussed above, we see that these es-timates are significantly below the estimated numbers ofOTHER sources among unassociated ones in the 3-classclassification (for the LR method without oversamplingin this case). In this section we show Galactic latitude and longitudeprofiles of the distributions of associated and unassoci-ated sources. In Figures 21 and 22 we present the sourcecounts as a function of abs(sin(GLAT)) for 2-class and3-class classifications respectively. We use 20 bins, i.e.,each bin corresponds to a solid angle of π/ . Solidblue lines show counts of associated sources in 3FGLand 4FGL-DR2 catalogs. It is interesting to note thatthe density of associated AGNs is decreasing towards theGalactic plane. The total counts of unassociated sourcesare shown by red dash-dotted lines. Green bands showthe envelopes of sums of probabilities for classificationof unassociated sources into AGN, PSR, and, in caseof 3 classes, OTHER classes for the eight ML meth-ods corrected in the case of 2 classes for the presenceof OTHER sources. Black dotted line in plots of theOTHER class show the model for the number of OTHERsources among the unassociated ones in Equation (9) forlatitude bins. We see that in the latitude profile, the 2-class model for the contribution of OTHER sources tothe unassociated ones is generally consistent with the es-timate of the number of OTHER sources in the 3-classmodel.The classifications of unassociated 3FGL sources bySaz Parkinson et al. (2016) are shown by gray dashed(RF) and dotted (LR) lines. The numbers of unassoci-ated sources classified as AGN, PSR, or OTHER growtowards the Galactic plane (GP). Within ≈ ◦ from theGP the expected number of PSRs is about the same asthe number of AGNs among unassociated sources, whileat high latitudes, most of unassociated sources are clas-sified as AGNs. It is interesting to note, that accordingto Table 1, GLAT is one of the least important featuresfor the RF algorithm. It can be a posteriori explained bythe fact that the density of AGNs is so high that even in Article number, page 18 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic
Fermi -LAT catalogs abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass total Assoc probUnass probSrc count + unass prob0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMeanUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMean Assoc probUnass probSrc count + unass prob
Fig. 21.
Latitude profiles of source counts in case of 2-class classification. Blue solid line – associated 3FGL and 4FGL-DR2sources. Red dash-dotted line – counts of all unassociated sources. Blue band - envelope of sums of class probabilities for associatedsources for the eight ML methods with and without oversampling corrected for the presence of OTHER sources. Green band –envelope of sums of class probabilities for unassociated sources for the eight ML methods corrected for the presence of OTHERsources. Orange solid line (band) – average (envelope) of sums of class probabilities for the eight ML methods added to the sourcecount of associated sources. Green dotted line on the AGN plots – mean of the orange solid line. Gray dashed (dotted) line – RF(LR) sums of class probabilities from Saz Parkinson et al. (2016). For details see Section 6.2. the GP the expected number of AGNs is comparable tothe expected number of PSRs.Orange shaded areas show the sum of the sourcecounts and the expected number of sources for the eightmethods (both with and without oversampling). The av-erage among the eight methods added to the counts of as-sociated sources is shown by solid orange line (for AGNswe also show the mean of these points by dotted greenline). We find that the number of associated AGNs is de-creasing towards the GP, the expected number of AGNsamong unassociated sources is increasing towards theGP, so that the sum of the two is relatively uniform as afunction of Galactic latitude.In Figures 23 and 24 we show plots analogous to Fig-ures 21 and 22 for Galactic longitudes (we use 30 bins).We note that there is a significant increase in the num-ber of unassociated sources in the 4FGL-DR2 catalogfor | (cid:96) | (cid:46) ◦ . In the 2-class classification about half ofthe unassociated sources are attributed to pulsars and half to AGNs. In the 3-class classification at least onethird of the unassociated sources in this range of lon-gitudes is attributed to the OTHER class. This comesmostly at the expense of reducing the predicted num-ber of pulsars. It is interesting to note that the 2-classmodel for the OTHER sources calculated using Equation(9) for longitude bins (black dotted line in the OTHERplots) is significantly below the bands of 3-class expec-tations, whereas for the latitude bins the 2-class modelis consistent with the bands. The reason is that the ra-tio of unassociated to associated sources, which we useto estimate the number of OTHER sources among theunassociated ones, depends on binning. In particular,this ratio is about 1 for low latitudes, which leads toan estimate that at low latitudes the number of OTHERsources among unassociated ones is similar to the numberof associated OTHER sources. But for longitude and fluxbinning the ratio of unassociated to associated sources issmaller than 1, which leads to a prediction (in the 2-class Article number, page 19 of 25 &A proofs: manuscript no. 0head_ML_3FGL abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass total Assoc probUnass probSrc count + unass prob0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass model (2-class) Assoc probUnass probSrc count + unass prob 0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass model (2-class) Assoc probUnass probSrc count + unass prob0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMeanUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMean Assoc probUnass probSrc count + unass prob
Fig. 22.
Latitude profiles of source counts in case of 3-class classification. For the definition of labels see Figure 21. Black dashedline in plots of the OTHER class show the model for the number of OTHER sources among the unassociated ones in Equation (9)for latitude bins. case) that the number of OTHER sources among unas-sociated ones is smaller than the number of associatedOTHER sources.We summarize the expected numbers of sources inthe 2-class and 3-class models in Table 9 for the 3FGLand in Table 10 for the 4FGL-DR2 catalogs. In the 2-class case we also show the correction for the presence ofOTHER sources among unassociated ones. We make thecorrection using the total numbers of unassociated and associated sources, as well as binning in flux from Fig-ure 18 (F-bins), in latitudes from Figure 21 (Lat-bins),and in longitudes from Figure 23 (Lon-bins). We see thatpredictions for the numbers of sources using latitude bin-ning in the 2-class case is closest to the 3-class case. Evenbetter agreement would likely be achieved if one uses asimultaneous binning in flux, latitudes, longitudes and,possibly, other variables. If, in addition, one allows binsof variable size and non-rectangular boundaries in multi-
Article number, page 20 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic
Fermi -LAT catalogs
GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 15010050050100150
GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass total Assoc probUnass probSrc count + unass prob15010050050100150
GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMeanUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 15010050050100150
GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMean Assoc probUnass probSrc count + unass prob
Fig. 23.
Longitude profiles of source counts in case of 2-class classification. For the definition of labels see Figure 21. dimensional space, then one will likely end up with oneof the ML algorithms for classification. Indeed, ML algo-rithms are designed to provide an optimal binning, whichmaximizes the separation of classes and make predictionsfor data with unknown labels (e.g., unassociated sources)based on counts of samples in bins with known labels(e.g., associated sources). Thus the 3-class prediction forclassification of unassociated sources is likely more ac-curate than the 2-class classification with correction forOTHER sources based on ad hoc binning of one of thevariables. The 2-class classification may still be useful insituations, when high recall is necessary for either pulsarsor AGNs: since OTHER sources are mixed with pulsarsand AGNs (cf. Figure 17), some of pulsars and AGNs canbe mistakenly classified as OTHER in the 3-class case.
7. Conclusions
In this paper we determine the probabilities of classifica-tion of unassociated sources in the 3FGL and 4FGL-DR2
Fermi -LAT catalogs into two (AGNs and pulsars) andthree (AGNs, pulsars, and other sources) classes. Theprobabilities are calculated with 8 different ML meth-ods: RF, BDT, LR and NN – each algorithm with and
Classification AGN PSR OTHER2-class . +75 . − . . +97 . − . –2-class corr total . +69 . − . . +91 . − . . +70 . − . . +93 . − . . +67 . − . . +83 . − . . +68 . − . . +90 . − . . +74 . − . . +22 . − . . +46 . − . Table 9.
Expected counts of sources among unassociated3FGL sources.
Classification AGN PSR OTHER2-class . +173 . − . . +243 . − . –2-class corr total . +158 . − . . +220 . − . . +158 . − . . +222 . − . . +141 . − . . +189 . − . . +154 . − . . +210 . − . . +121 . − . . +47 . − . . +87 . − . Table 10.
Expected counts of sources among unassociated4FGL-DR2 sources. without oversampling during training. The algorithmswere trained and tested with associated sources. We havescanned some meta-parameters of the algorithms, suchas depth of the trees, the number of trees, the num-
Article number, page 21 of 25 &A proofs: manuscript no. 0head_ML_3FGL
GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 15010050050100150
GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass total Assoc probUnass probSrc count + unass prob15010050050100150
GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass model (2-class) Assoc probUnass probSrc count + unass prob 15010050050100150
GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass model (2-class) Assoc probUnass probSrc count + unass prob15010050050100150
GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMeanUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 15010050050100150
GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMean Assoc probUnass probSrc count + unass prob
Fig. 24.
Longitude profiles of source counts in case of 3-class classification. For the definition of labels see Figures 21 and 22. ber of neurons, in order to determine optimal parame-ters which do not create overfitting of data and providegood accuracy of classification. The accuracies, which weobtained for the 3FGL catalog for the four algorithmswithout oversampling in the 2-class case are about 97%,and between 93% and 97% with oversampling. We havealso checked the accuracy of classification by selectingunassociated sources in 3FGL, which have associationsin 4FGL-DR2. If we take the 4FGL-DR2 associationsas the true classes, then the accuracies of classification in this subset of sources without (with) oversampling arebetween 90% and 91% (85% and 92%). Most of misclassi-fied sources in this comparison have spectral parametersin 3FGL which are typical of the other class (Figure 14),i.e., the misclassification can be due to problems with re-constructing the spectrum of the sources. In the 3-classclassification, the testing accuracies for the 3FGL cata-log are between 92% and 94% (both with and withoutoversampling), while comparison with 4FGL-DR2 giveaccuracy between 82% and 85%. For the 4FGL-DR2 clas-
Article number, page 22 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic
Fermi -LAT catalogs sifications, the testing accuracy is between 90% and 93%(both with and without oversampling). If one takes intoaccount that all other sources are misclassified in the 2-class case, then the 3-class case provides an improvementin accuracy of 1% to 5%.We have created four catalogs with probabilistic clas-sifications of sources: based on 3FGL and on 4FGL-DR2with 2- and 3-class classifications. For each source andfor each class we report class probabilities for each ofthe eight ML methods (with and without oversampling).We also provide individual standard deviations for allclassification probabilities by using sample average overselection of training and testing datasets. We report theclassification probabilities not only for the unassociatedsources, but also for the associated ones, which can beused to find outliers. An advantage of such probabilis-tic classification is that a threshold on probability forselecting, e.g., pulsar candidates, can be chosen by theuser based on his or her needs. For example, in a searchof new pulsars, one can select a low threshold in order toavoid missing possible pulsars. In a derivation of an av-erage property of the class, e.g., spectral index or cutoffenergy, one can select a high threshold in order to avoidcontamination from the other classes.As an example of the application of the probabilisticcatalogs, we derive the expected number of sources in thecatalog as a function of their flux, including the unasso-ciated sources. As a consistency check, we compare thecounts of associated sources to the sums of probabilitiesfor the associated ones. We find that correcting for thecontribution of other sources in the 2-class case plays animportant role for the estimation of the expected num-ber of sources in a particular class. We find the totalexpected number of AGNs and pulsars in the 3FGL and4FGL catalogs by adding the class probabilities for theunassociated sources in the 2- and 3-class cases to thesource counts of associated sources and correcting in the2-class case for the contribution of other classes in theunassociated sources. In particular, we find that the to-tal expected number of pulsars is about two times largerthan the number of associated pulsars.We plot the counts of associated sources and theexpected numbers of AGNs, pulsars, and other sourcesamong unassociated sources as functions of Galactic lat-itude and longitude. We find that the number of associ-ated AGNs is decreasing towards low latitudes, while theexpected number of AGNs among unassociated sourcesis increasing, so that the sum of the two is relativelyuniform, as expected for extragalactic sources.
Acknowledgements
The authors would like to thank Pablo Saz Parkinsonand Jean Ballet for valuable discussions. The work of DMwas in part supported by BMBF under the ErUM-Dataproject “Innovative Digital Technologies for Research onUniverse and Matter” (grant number 05H18WERC1)and by DFG grant MA 8279/2-1. We would liketo acknowledge the use of the following software:Astropy ( , Robitaille et al.2013), matplotlib (Hunter 2007), scikit-learn [ https://scikit-learn.org/stable/about.html ], TOPCAT(Taylor 2005).
References
Abdo, A. A., Ackermann, M., Ajello, M., et al. 2010a, ApJS, 188,405Abdo, A. A., Ackermann, M., Ajello, M., et al. 2010b, ApJ, 720,435Abdollahi, S., Acero, F., Ackermann, M., et al. 2020, ApJS, 247,33Acero, F., Ackermann, M., Ajello, M., et al. 2015, ApJS, 218, 23Ackermann, M., Ajello, M., Albert, A., et al. 2015, J. CosmologyAstropart. Phys., 2015, 008Ackermann, M., Ajello, M., Albert, A., et al. 2016, Phys. Rev. Lett.,116, 151105Ackermann, M., Ajello, M., Allafort, A., et al. 2012, ApJ, 753, 83Ajello, M., Gasparrini, D., Sánchez-Conde, M., et al. 2015, ApJ,800, L27Ballet, J., Burnett, T. H., Digel, S. W., & Lott, B. 2020, arXive-prints, arXiv:2005.11208Breiman, L. 2001, Machine Learning, 45, 5Brewer, B. J., Foreman-Mackey, D., & Hogg, D. W. 2013, AJ, 146,7Camilo, F., Kerr, M., Ray, P. S., et al. 2015, The AstrophysicalJournal, 810, 85Carr, B. J., Kohri, K., Sendouda, Y., & Yokoyama, J. 2010,Phys. Rev. D, 81, 104019Chiaro, G., Salvetti, D., La Mura, G., et al. 2016, MNRAS, 462,3180Cox, D. R. 1958, J R Stat Soc B, 20, 215Daylan, T., Portillo, S. K. N., & Finkbeiner, D. P. 2017, ApJ, 839,4Defazio, A., Bach, F., & Lacoste-Julien, S. 2014, arXiv e-prints,arXiv:1407.0202Di Mauro, M. & Donato, F. 2015, Phys. Rev. D, 91, 123001Di Mauro, M., Manconi, S., Zechlin, H. S., et al. 2018, ApJ, 856,106Doert, M. & Errando, M. 2014, ApJ, 782, 41Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J.2008, J. Mach. Learn. Res., 9, 1871–1874Fornasa, M. & Sánchez-Conde, M. A. 2015, Phys. Rep., 598, 1Friedman, J. H. 2001a, Ann. Statist., 29, 1189Friedman, J. H. 2001b, Ann. Statist., 29, 1189Hassan, T., Mirabal, N., Contreras, J. L., & Oya, I. 2013, MNRAS,428, 220Ho, T. K. 1998, IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 20, 832Hogg, D. W. & Lang, D. 2010, in EAS Publications Series, Vol. 45,EAS Publications Series, 351–358Hopfield, J. 1982, Proc. Nat. Acad. Sci., 79, 2554Hunter, J. D. 2007, Computing In Science & Engineering, 9, 90
Article number, page 23 of 25 &A proofs: manuscript no. 0head_ML_3FGL
Kingma, D. P. & Ba, J. 2014, arXiv e-prints, arXiv:1412.6980Kovačević, , M., Chiaro, G., Cutini, S., & Tosti, G. 2019, MNRAS,490, 4770Kovačević, M., Chiaro, G., Cutini, S., & Tosti, G. 2020, MNRAS,493, 1926Lee, K. J., Guillemot, L., Yue, Y. L., Kramer, M., & Champion,D. J. 2012, MNRAS, 424, 2832Lefaucheur, J. & Pita, S. 2017, A&A, 602, A86Lisanti, M., Mishra-Sharma, S., Necib, L., & Safdi, B. R. 2016,ApJ, 832, 117Liu, D. C. & Nocedal, J. 1989, Math. Program., 45, 503–528Liu, W., Bi, X.-J., Lin, S.-J., & Yin, P.-F. 2017, Chinese PhysicsC, 41, 045104Luo, S., Leung, A. P., Hui, C. Y., & Li, K. L. 2020, MNRAS, 492,5377Malyshev, D. & Hogg, D. W. 2011, ApJ, 738, 181Mirabal, N., Charles, E., Ferrara, E. C., et al. 2016, ApJ, 825, 69Nolan, P. L., Abdo, A. A., Ackermann, M., et al. 2012, ApJS, 199,31Robitaille, T. P., Tollerud, E. J., Greenfield, P., et al. 2013, A&A,558, A33Salvetti, D., Chiaro, G., La Mura, G., & Thompson, D. J. 2017,MNRAS, 470, 1291Saz Parkinson, P. M., Xu, H., Yu, P. L. H., et al. 2016, ApJ, 820,8Schmidt, M., Le Roux, N., & Bach, F. 2017, Math. Program., 162,83–112Taylor, M. B. 2005, in Astronomical Society of the Pacific Confer-ence Series, Vol. 347, Astronomical Data Analysis Software andSystems XIV, ed. P. Shopbell, M. Britton, & R. Ebert, 29Zechlin, H.-S., Cuoco, A., Donato, F., Fornengo, N., & Regis, M.2016a, ApJ, 826, L31Zechlin, H.-S., Cuoco, A., Donato, F., Fornengo, N., & Vittino, A.2016b, ApJS, 225, 18Zhu, K.-R., Kang, S.-J., & Zheng, Y.-G. 2021, Research in Astron-omy and Astrophysics, 21, 015
Article number, page 24 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic
Fermi -LAT catalogs
Appendix A: Tests of additionalmeta-parameters
In this appendix we discuss tests of some hyper-parameters, which had a relatively little effect on theaccuracy of the algorithms. For these tests we use the2-class classification in the 3FGL catalog.LR algorithm has two hyper-parameters: regulariza-tion and tolerance. The effect of the choice of these pa-rameters on accuracy is less than 1% (Figure A.1). There-fore we used the default values for these parameters (tol-erance is 1 e − and regularization parameter is set at 1).The same was used in the 3-class case where the changefrom GLON to cos(GLON) was seen as the main im-provement for LR algorithm. Regularization Parameter T e s t i n g S c o r e Logistic Regression (LBFGS, 300 Epochs): Accuracy vs. Regularization
Tol = 0.001Tol = 1Tol = 10
Fig. A.1.
Dependence of LR on tolerance and regularization
In Figure A.2 we show the effect of adding the secondhidden layer in the NN algorithm. The difference betweenthe best accuracies with the additional hidden layer isless then 1% compared with the NN with one hiddenlayer (cf. Table 2).
Number of Neurons in second hidden layer T e s t i n g S c o r e Neural Network: 2 Hidden Layers
Number of Epochs: 300LBFGS, TanhLBFGS, ReluAdam, TanhAdam, Relu
Fig. A.2.
Dependence of NN on the number of neurons inthe second hidden layer, for 11 neurons in the first hiddenlayer.
At the end we summarize features and their statistics,which we use for probabilistic classification of sources inthe 3FGL and 4FGL-DR2 catalogs, in Tables A.1 andA.2 respectively.
Feature Name Mean Standard Deviation Minimum MaximumGLON .
58 100 . .
59 359 . GLAT .
67 41 . − .
66 86 . ln(Energy_Flux100) − .
34 1 . − . − . ln(Unc_Energy_Flux100) − .
47 0 . − . − . ln(Signif_Curve) .
23 1 . − .
81 4 . ln(Variability_Index) .
35 0 .
95 3 11 . .
16 0 . − .
61 3 . HR12 − .
41 0 . − HR23 − .
53 0 . − HR34 − .
55 0 . − HR45 − .
59 0 . − Table A.1.
Statistics of features used for 2 class probabilisticclassification of the 3FGL sources.
Feature Name Mean Standard Deviation Minimum MaximumGLON .
65 101 .
91 0 .
09 359 . GLAT .
93 41 . − .
68 87 . ln(Energy_Flux100) − .
08 1 . − . − . ln(Unc_Energy_Flux100) − .
19 0 . − . − . ln(Pivot_Energy) .
45 0 .
78 5 .
01 10 . LP_Index .
14 0 . − .
08 5 . Unc_LP_Index .
16 0 .
13 0 3 . LP_beta .
16 0 . − .
17 1
LP_SigCurv .
89 6 .
99 0 225 . ln(Variability_Index) .
14 1 . − .
81 11 . HR12 .
12 0 . − HR23 − .
26 0 . − HR34 − .
52 0 . − HR45 − .
54 0 . − HR56 − .
66 0 . − HR67 − .