[PDF] Machine learning methods for constructing probabilistic Fermi-LAT catalogs

Abstract

Classification of sources is one of the most important tasks in astronomy. Sources detected in one wavelength band, for example using gamma rays, may have several possible associations in other wavebands or there may be no plausible association candidates. In this work, we aim to determine probabilistic classification of unassociated sources in the third and the fourth data release 2 Fermi Large Area Telescope (LAT) point source catalogs (3FGL and 4FGL-DR2) into two classes (pulsars and active galactic nuclei (AGNs)) or three classes (pulsars, AGNs, and other sources). We use several machine learning (ML) methods to determine probabilistic classification of Fermi-LAT sources. We evaluate the dependence of results on meta-parameters of the ML methods, such as the maximal depth of the trees in tree-based classification methods and the number of neurons in neural networks. We determine probabilistic classification of both associated and unassociated sources in 3FGL and 4FGL-DR2 catalogs. We cross-check the accuracy by comparing the predicted classes of unassociated sources in 3FGL that have associations in 4FGL-DR2. We find that in the 2-class case it is important to correct for the presence of other sources among the unassociated ones in order to realistically estimate the number of pulsars and AGNs. In particular, the estimated number of pulsars in the 3FGL (4FGL-DR2) catalog is 270 (483) in the 2-class case without corrections for the other sources and 158 (215) in the 3-class case. Provided that the number of associated pulsars is 167 (271) in the 3FGL (4FGL-DR2) catalog, the number of pulsars among the unassociated sources is expected to be similar or larger than the number of associated ones.

Full PDF

AAstronomy & Astrophysics manuscript no. 0head_ML_3FGL ©ESO 2021February 16, 2021

Machine learning methods for constructingprobabilistic

Fermi -LAT catalogs

A. Bhat (cid:63) and D. Malyshev (cid:63)(cid:63) Erlangen Centre for Astroparticle Physics, Erwin-Rommel-Str. 1, Erlangen, GermanyFebruary 16, 2021

ABSTRACT

Context.

Classiﬁcation of sources is one of the most important tasks in astronomy. Sources detected in one wavelengthband, for example using gamma rays, may have several possible associations in other wavebands or there may be noplausible association candidates.

Aims.

In this work, we aim to determine probabilistic classiﬁcation of unassociated sources in the third and the fourth

Fermi

Large Area Telescope (LAT) point source catalogs (3FGL and 4FGL-DR2) into two classes (pulsars and activegalactic nuclei (AGNs)) or three classes (pulsars, AGNs, and other sources).

Methods.

We use several machine learning (ML) methods to determine probabilistic classiﬁcation of

Fermi -LAT sources.We evaluate the dependence of results on meta-parameters of the ML methods, such as the maximal depth of the treesin tree-based classiﬁcation methods and the number of neurons in neural networks.

Results.

We determine probabilistic classiﬁcation of both associated and unassociated sources in 3FGL and 4FGL-DR2catalogs. We cross-check the accuracy by comparing the predicted classes of unassociated sources in 3FGL that haveassociations in 4FGL-DR2. We ﬁnd that in the 2-class case it is important to correct for the presence of other sourcesamong the unassociated ones in order to realistically estimate the number of pulsars and AGNs. In particular, theestimated number of pulsars in the 3FGL (4FGL-DR2) catalog is 270 (481) in the 2-class case without corrections forthe other sources and 158 (214) in the 3-class case. Provided that the number of associated pulsars is 167 (271) in the3FGL (4FGL-DR2) catalog, the number of pulsars among the unassociated sources is expected to be similar or largerthan the number of associated ones.

Key words.

Methods: statistical – Catalogs

Contents (cid:63) e-mail: [email protected] (cid:63)(cid:63) on leave of absence from NRC “Kurchatov Institute” -ITEP, B. Cheremushkinskaya st. 25, Moscow, Russia 117218,e-mail: [email protected]

Article number, page 1 of 25 a r X i v : . [ a s t r o - ph . H E ] F e b &A proofs: manuscript no. 0head_ML_3FGL

1. Introduction

Multi-wavelength association of astronomical sources isimportant for understanding their nature. Unfortunately,in many cases a ﬁrm association of sources at diﬀerentwavelength is not possible. For example, about one thirdof gamma-ray sources in

Fermi

Large Area Telescope(LAT) catalogs are unassociated (Abdo et al. 2010a;Nolan et al. 2012; Acero et al. 2015; Abdollahi et al.2020). It is useful to know at least to which class theunassociated sources belong to or, which is more typi-cal, what are the probabilities for the source to belongto various classes. In this paper we use several machinelearning (ML) algorithms to ﬁnd probabilistic classiﬁ-cation of unassociated sources in the

Fermi -LAT third(3FGL, Acero et al. 2015) and the fourth data release2 (4FGL-DR2, Abdollahi et al. 2020; Ballet et al. 2020)catalogs. We used versions gll_psc_v16 for 3FGL andgll_psc_v27 for 4FGL-DR2.We will refer to the catalogs, where the classiﬁcationof the sources is given by probabilities as probabilisticcatalogs. In general, the classes may include the possi-bility that a source is not a real source but a ﬂuctuationof the background or that a source is an overlay of twosources etc.Probabilistic catalogs were previously introduced foroptical sources (e.g., Hogg & Lang 2010; Brewer et al.2013) and for gamma-ray sources (Daylan et al. 2017).Bayesian association probabilities were also included inthe 4FGL (Abdollahi et al. 2020) and 4FGL-DR2 (Balletet al. 2020) catalogs for faint sources. Probabilistic clas-siﬁcation of unassociated

Fermi -LAT sources was per-formed by, e.g., Ackermann et al. (2012); Saz Parkinsonet al. (2016); Mirabal et al. (2016); Lefaucheur & Pita(2017); Luo et al. (2020); Zhu et al. (2021), or in the ap-plication for sub-classiﬁcation of blazars by Hassan et al.(2013); Doert & Errando (2014); Chiaro et al. (2016);Salvetti et al. (2017); Kovačević et al. (2019, 2020) andin subclassiﬁcation of pulsars by Lee et al. (2012); SazParkinson et al. (2016). In this work, we consider clas-siﬁcation of gamma-ray sources into two classes (AGNsand pulsars) as well as into three classes (AGNs, pul-sars, and other associated sources). We revisit probabilis-tic classiﬁcation of 3FGL sources and compare results ofclassiﬁcation of unassociated sources with associationsin 4FGL-DR2. We also determine probabilistic classiﬁ-cation of the 4FGL-DR2 sources.Catalogs of gamma-ray point sources are typically de-signed to have low false detection rate. Nevertheless, 469sources out of 3033 in the 3FGL catalog (Acero et al.2015) have no counterparts in the 4FGL catalog (Ab-dollahi et al. 2020). This is much larger than the ex-pected false detection rate in 3FGL arising from statis-tical ﬂuctuations. For the majority of sources in 3FGL without counterparts in 4FGL the problem is not thefalse detection, but rather the association. For example,some sources can be detected due to deﬁciencies in theGalactic diﬀuse emission model. In this case, the statis-tical signiﬁcance of the detection is high, but the asso-ciation is wrong: the sources should be classiﬁed as apart of the Galactic diﬀuse emission rather than point-like sources. Another reason could be that two (or more)point-like sources in 3FGL are associated to a single ex-tended source in 4FGL, or a single source is resolved intotwo sources. Again, this is a problem of classiﬁcation (orassociation) rather than false detection.Another reason for an absence of a previously de-tected source in a new catalog is variability. In particu-lar, ﬂat spectrum radio quasars (FSRQs) are highly vari-able AGNs. If a source was active during the observationtime of 3FGL but inactive afterwards, then its signiﬁ-cance in the 4FGL can be below the detection thresh-old. This problem is connected to a selection of a harddetection threshold of

T S = 25 for 3FGL and 4FGLcatalogs. Selection of a lower detection threshold couldhelp to keep the variable sources inside the catalog, butit will not solve the problem, since the variable sourcesnear the lower threshold can also disappear in the newcatalog. Moreover, lower threshold would lead to morefalse detections due to ﬂuctuations of the background.Thus, on the one hand, lower threshold can be useful instudies, where a more complete list of sources is desir-able, while the higher false detection rate is admissible.On the other hand, lower threshold can be problematicfor studies where a clean sample is necessary. The prob-lem of the detection threshold selection can be amelio-rated with the development of a probabilistic catalog.In this catalog, each point-like object detected above acertain relatively low conﬁdence level are probabilisti-cally classiﬁed into classes, which include the statisti-cal ﬂuctuations class. At low conﬁdence, the probabilityfor a source to come from a background ﬂuctuation ishigh. This probability is decreasing as the signiﬁcanceof sources increases. Apart from the statistical ﬂuctua-tions class, classes can include various types of Galacticand extra-galactic sources, diﬀuse emission deﬁciencies,extended sources. A user of such a catalog will have thefreedom to choose the probability threshold for the classthat he or she is interested in. In this paper we make aﬁrst step in this direction by providing probabilistic clas-siﬁcation of

Fermi -LAT sources into two or three classes.We also show how the probabilistic catalogs can be usedfor population studies of sources, e.g., as a function oftheir ﬂux or position on the sky, where one includes notonly associated sources but also unassociated ones ac-cording to their class probabilities.

Article number, page 2 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic

Fermi -LAT catalogs

The paper is organized as follows. In Section 2 wediscuss general questions of constructing the probabilis-tic catalogs and the choice of the ML methods. In Sec-tion 3 we construct the classiﬁcation algorithms using theassociated source in the 3FGL catalog for training. Weconsider several aspects: 1) feature selection, 2) trainingof the algorithms and selection of the meta-parameters,3) oversampling of the datasets in order to have equalnumber of pulsars and AGNs in training (there are manymore AGNs observed than pulsars).In Section 4 we apply the classiﬁcation algorithmsdetermined in Section 3 for the classiﬁcation of 3FGLand 4FGL-DR2 sources. We compare the predictions forthe unassociated sources in 3FGL with the associationsin 4FGL-DR2. We also retrain the algorithms using as-sociated 4FGL-DR2 sources and construct a probabilis-tic catalog based on 4FGL-DR2. In Section 5 we clas-sify sources in 3FGL and 4FGL-DR2 catalogs into threeclasses (AGNs, pulsars, and other sources). In Section 6we show applications of the probabilistic catalogs for pre-dicting the number of pulsars, AGNs, and other sourcesamong the unassociated sources and in construction ofthe source counts as a function of their ﬂux, N ( S ) , andas a function of Galactic latitude and longitude, N ( b ) and N ( (cid:96) ) . We compare the N ( S ) , N ( b ) , and N ( (cid:96) ) dis-tributions for associated and unassociated sources in the3FGL and 4FGL-DR2 catalogs. In Section 7 we presentconclusions.

2. Choice of methods

The ﬁrst choice one has to make in constructing a prob-abilistic catalog is the input data and the machine learn-ing methods. For the input data we take associatedpoint sources (PS) in the 3FGL catalog, which we splitinto training and testing subsets. We consider four ma-chine learning algorithms: random forests (RF, Ho 1998;Breiman 2001), boosted decision trees (BDT, Friedman2001a), logistic regression (LR, Cox 1958), and neu-ral networks (NN, Hopﬁeld 1982). Although the perfor-mance of algorithms on testing data is slightly diﬀerent,we report the classiﬁcation probabilities for all four algo-rithms. The diﬀerence among the predictions will serve asa measure of modeling uncertainty related to the choiceof the classiﬁcation algorithm.

One of the most simple and transparent algorithms forclassiﬁcation is decision trees. In this algorithm, at eachstep the sample is split into two subsets using one of the input features. The choice of the feature and the sepa-rating value are determined by minimizing an objectivefunction, such as misclassiﬁcation error, Gini index, orcross-entropy. This method is very intuitive, since at eachstep the results can be described in words, for example,at the ﬁrst step, the sources can be split in mostly Galac-tic and extragalactic by a cut on the Galactic latitude.At the next step, the high latitude sources can be fur-ther subsplit into millisecond pulsars and other sources,by a cut on the spectral index around 1 GeV (pulsarshave a hard spectrum below a few GeV) etc. One of theproblems with decision trees is either overﬁtting or bias:if a tree is too deep, then it will pick up particular casesof the training sample resulting in overﬁtting, while tooshallow trees would not be able to describe the data well,which can lead to bias. As a result, one needs to be verycareful in selecting the depth of the tree. This problemcan be avoided if a random subset of features is used toﬁnd a division at each node. This is the basis of the RFalgorithm, where the ﬁnal classiﬁcation is given by anaverage of several trees with random subsets of featuresused at each node. Another problem with the simple treesis that it can miss the classiﬁcation of some subsets ofdata. In BDT algorithms, the ﬁnal classiﬁcation is givenby a collection of trees, where each new tree is createdby increasing the weights of misclassiﬁed samples of theprevious step. Finally, simple trees predict classes for thedata samples, while we would like to have probabilities ofclasses (also known as soft classiﬁcation). RF and BDTalgorithms, by virtue of averaging, provide probabilities.As a result, we will use RF and BDT algorithms ratherthan simple decision trees in this paper.Tree-based algorithms, even after averaging in RFand BDT methods, have sharp edges among domainswith diﬀerent probabilities. In LR algorithm, the prob-abilities of classes are by construction smooth functionsof features. In particular, for two-class classiﬁcation theprobability of class 1, given the set of features x , is mod-eled by sigmoid (logit) function p ( x ) = e m ( x ) e m ( x ) . (1)The probability of class 0 is then modeled as p ( x ) =1 − p ( x ) . If m ( x ) is a linear function of features, then theboundary between the domains, deﬁned, e.g., as p ( x ) =0 . , will be linear at m ( x ) = 0 . More complicated bound-aries can be modeled by taking non-linear functions m ( x ) . Unknown parameters of the function m ( x ) are de-termined by maximizing the log likelihood of the modelgiven the known classes of the data in the training sam-ple. A useful feature of the LR method is that it, by con-struction, provides probabilities of classes with smoothtransitions among domains of diﬀerent classes. A limita- Article number, page 3 of 25 &A proofs: manuscript no. 0head_ML_3FGL tion is that the form of the probability function is ﬁxedto the sigmoid function in Equation (1).We notice that if m ( x ) is a linear function of features x , then the LR model is obtained by an application of sig-moid function to a linear combination of input features.This is in fact a single layer perceptron, or a NN, withseveral input nodes (each node corresponds to a feature)and one output node, which corresponds to p ( x ) , butwithout any hidden layers. The output value is obtainedby a non-linear transformation (sigmoid) of a linear com-bination of features. Neural network with several hiddenlayers is obtained by a sequence of non-linear transfor-mations of linear combinations of features. In particu-lar, the values in the ﬁrst hidden layer are obtained by anon-linear transformation of linear combinations of inputfeatures. Then the values in the second hidden layer areobtained by a non-linear transformation of linear combi-nations of values in the ﬁrst hidden layer etc. In the con-text of neural networks, the non-linear transformationsare called activation functions. If the activation functionfor the output layer is sigmoid, then the output value(values) can be interpreted as probabilities.

3. Construction of probabilistic catalogs

As an example of the construction of a probabilistic cat-alog, we will use the 3FGL catalog. For training andtesting the methods, we use sources which have associa-tions and no missing values in the catalog. In this sectionwe will perform a two-class classiﬁcation to separate PSinto pulsars and AGNs. Thus for training and testing,we subselect the sources, which are associated to pulsarsand AGNs (three-class classiﬁcation into pulsars, AGNs,and other sources is discussed in Section 5). After thetraining of the algorithms, we test the performance withthe test sources and predict the classes of unassociatedsources that have all features present in the catalog table.The general workﬂow will have the following steps:1. Select data for learning and testing.2. Optimize algorithms using training datasets. We se-lect meta-parameters of the algorithms by optimizingaccuracy of classiﬁcation and test for overﬁtting us-ing the test datasets. In order to get stable results,we repeat the separation of the data into training andtesting samples 100 times and average the accuracy.3. Make prediction for unassociated point sources of the3FGL. We also apply the classiﬁcation for associatedsources, which we use for consistency checks.As a result of the analysis in this section, we select meta-parameters for the four ML algorithms, which we use inthe following section for a construction of probabilisticcatalogs based on the

Fermi -LAT 3FGL and 4FGL-DR2catalogs.

We restrict the analysis to associated and unassociatedsources without any missing or statistically insigniﬁcantvalues (e.g., none or inﬁnity). We use the associatedsources which were classiﬁed as either AGNs (classiﬁca-tion labels in the 3FGL catalog: agn, FSRQ, fsrq, BLL,bll, BCU, bcu, RDG, rdg, NLSY1, nlsy1, ssrq, and sey) orpulsars (classiﬁcation labels in 3FGL: PSR, psr), whichresults in a list of 1905 sources.There are several tens of features of point sourcesquoted in the catalog, such as the position, photon andenergy ﬂuxes integrated in diﬀerent energy bands, spec-tral parameters, variability index as well as correspond-ing uncertainties. We took the main features, includingthose discussed in Saz Parkinson et al. (2016), and plot-ted their Pearson correlation coeﬃcients. A graphicalrepresentation of the correlation table is shown is Figure1. In the following, if two features have (anti)correlation (cid:38) . ( (cid:46) − . ), then we keep only one of the featuresfor classiﬁcation. G L O N G L A T S i g n i f _ A v g P i v o t _ E n e r g y F l u x _ D e n s i t y U n c _ F l u x _ D e n s i t y F l u x U n c _ F l u x E n e r g y _ F l u x U n c _ E n e r g y _ F l u x S i g n i f _ C u r v e Sp e c t r a l _ I n d e x U n c _ Sp e c t r a l _ I n d e x P o w e r L a w _ I n d e x V a r i a b ili t y _ I n d e x M e V _ I n d e x GLONGLATSignif_AvgPivot_EnergyFlux_DensityUnc_Flux_DensityFlux1000Unc_Flux1000Energy_Flux100Unc_Energy_Flux100Signif_CurveSpectral_IndexUnc_Spectral_IndexPowerLaw_IndexVariability_Index500MeV_Index -0.022-0.01 0.002-0.042 0.026 -0.13-0.00033-0.0052 0.23 -0.0450.001 -0.006 0.018 -0.032 0.960.011 -0.0039 0.82 -0.036 0.23 0.00790.0088 -0.024 0.6 -0.005 0.13 -0.0024 0.580.01 -0.004 0.83 -0.042 0.24 0.015 0.99 0.60.022 -0.018 0.41 0.032 0.16 0.076 0.39 0.94 0.42-0.014 -0.029 0.73 -0.12 0.14 0.015 0.56 0.65 0.56 0.450.062 0.013 -0.21 -0.52 0.19 0.23 -0.14 -0.2 -0.13 -0.094 -0.31-0.031 0.033 -0.34 0.39 0.037 0.082 -0.11 -0.25 -0.14 -0.27 -0.21 -0.160.05 -0.0057 -0.045 -0.7 0.26 0.28 -0.033 -0.015 -0.026 0.00460.00021 0.83 -0.13-0.014 -0.0012 0.36 -0.068 0.14 0.017 0.079 0.12 0.14 0.09 0.11 -0.011 -0.14 0.0390.053 0.0058 -0.12 -0.56 0.2 0.23 -0.09 -0.2 -0.08 -0.13 -0.17 0.82 -0.21 0.82 0.012

Correlation in 3FGL Associated Data

Fig. 1.

The corelation matrix of features for 3FGL associatedsources. The “500MeV_Index” is deﬁned in Equation (2), allother features are taken directly from the 3FGL catalog.

Spectral index is one of the important characteris-tics of sources. Unfortunately in the 3FGL catalog, thedeﬁnition of the spectral index is diﬀerent for associatedand unassociated sources. In particular, the gamma-rayﬂux of pulsars is described by a power-law with a (su-per)exponential cutoﬀ ∝ E − Γ e − ( E/E c ) b and the “Spec-tral_Index” feature in the catalog is the parameter Γ .Gamma-ray ﬂux of unassociated sources with signiﬁ-cant curvature is represented by log-parabola function ∝ ( E/E ) − α − β ln( E/E ) , where the “Spectral_Index” fea-ture is the parameter α , i.e., the tilt in the spectrumat the pivot energy E (which also varies for diﬀerentsources). Since the “Spectral_Index” feature has diﬀer-ent deﬁnitions for associated pulsars and for possible pul-sars among unassociated sources, its use for training the Article number, page 4 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic

Fermi -LAT catalogs algorithms to separate pulsars from AGNs is problem-atic. If one ﬁts all spectra of sources in the catalog bya power-law function, then the corresponding indices ofthe power laws are represented by “PowerLaw_Index”feature in the catalog. This feature is deﬁned uniformlyfor all associated and unassociated sources, i.e., it is safeto use for training. Unfortunately, the power-law functionis not a good description of the gamma-ray ﬂux from pul-sars. In the classiﬁcation of the 3FGL sources we haveconstructed a new feature: the index at 500 MeV (de-noted in the following as “500MeV_Index”), deﬁned asminus the derivative of the log ﬂux: n (500 MeV) = − d ln Fd ln E (cid:12)(cid:12)(cid:12)(cid:12) E =500 MeV (2)For log parabola and for power law with (su-per)exponential cutoﬀ it is respectively n (500 MeV) = α + 2 β ln(500 MeV / E ) (3) n (500 MeV) = Γ + b (500 MeV /E c ) b (4)This feature has a more uniform deﬁnition for all sourcesin the 3FGL catalog than the Spectral_Index. It alsohas a better separating power than PowerLaw_Index,provided that pulsars have typically harder spectra atenergies below 1 GeV than AGNs.Taking into account the correlation among thefeatures and the above discussion of the spec-tral index deﬁnition, we have selected the follow-ing eleven features for the classiﬁcation of the3FGL sources: Galactic latitude (GLAT), Galac-tic longitude (GLON), ln(Energy_Flux_100), ln (Unc_Energy_Flux100), 500MeV_Index, ln (Signif_Curve), ln (Variability_Index), and fourhardness ratios hr ij = EF j − EF i EF j + EF i , where EF i is theenergy ﬂux in bin i . We have added the hardness ratiosin order to allow the algorithms to “construct” theirown features from the raw data rather than to use thederived features, such as spectral index or integratedﬂux. The table of features and their statistics can befound in Appendix A. The number of tunable parameters in the classiﬁcationalgorithms is not ﬁxed a priori. Moreover there is a cer-tain freedom in the choice of the architecture of the al-gorithms, such as the number of hidden layers and thenumber of neurons in neural networks. In general onestarts with a simple model and increases the complexity(the number of tunable parameters) until the model candescribe the data well, but does not overﬁt it. The overﬁt-ting is tested by splitting the input data into the training and testing samples. The training sample is used for op-timizing the parameters, while the test sample is used tocheck that the model is not overtrained (for overtrainedmodels the accuracy on the test sample is signiﬁcantlyworse than the performance on the training sample). Wewill split the data randomly into 70% training and 30%testing samples.

The two main parameters characterizing the RF algo-rithm are the number of trees and the maximum depthallowed in the trees. We use the Gini index as the ob-jective function for the optimization of parameters (splitvalues of features in the nodes).

Maximum Depth T e s t i n g S c o r e Random Forests

Fig. 2.

Test score (accuracy) of RF classiﬁcation as a func-tion of the number of trees and the maximal depth of trees.

Figure 2 shows the dependence of the accuracy of thetest sample as a function of maximum depth and thenumber of trees. The results for each point are averagedover 100 realizations of the split into training and testingsamples. We notice that the accuracy does not decreaseas the maximal depth of the trees increases, i.e., there isno overﬁtting as the complexity of the model increaseswith increased maximum depth.This is due to the random choice of a subset of fea-tures at each node (maximal number of allowed featuresis √ Article number, page 5 of 25 &A proofs: manuscript no. 0head_ML_3FGL

Feature 20 trees, depth 2 50 trees, depth 6GLON 0.001 0.014GLAT 0.009 0.017 ln (Energy_Flux100) 0.073 0.071 ln (Unc_Energy_Flux100) ln (Signif_Curve) ln (Variability_Index) 0.051 0.098500MeV_Index 0.065 0.065HR12 0.047 0.052HR23 0.032 0.062HR34 0.016 0.025 HR45

Table 1.

Feature importances for two RF algorithms. sin(GLAT) to check that this is not due to scaling, i.e.,the large range of values of GLAT, but the signiﬁcanceis similar to the GLAT itself. We further discuss the de-pendence on GLAT in Section 6.2, where we calculatethe latitude and longitude proﬁles of the associated andunassociated source counts.In order to illustrate the separation of PS into AGNsand pulsars, we retrain the RF algorithm using onlytwo features: log of curvature signiﬁcance and log of thevariability index, and plot the resulting probabilities ofclasses in Figure 3 for the model with 50 trees with themaximum depth of 6. The probabilities are averaged over100 splits into training and testing samples. It is impor-tant to note that in this plot the model is trained ononly two features. Nevertheless a good testing accuracyof 97% is reached, which is similar to the accuracy of theRF classiﬁcation with all 11 features. For the ﬁnal clas-siﬁcation with RF, we use 11 features and average over1000 splits into training and testing samples.

Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Trees: 50Maximum Depth: 6Testing Score:0.97

Random Forest

AGN trainingAGN testingPSR trainingPSR testing 0.00.20.40.60.81.0

Fig. 3.

RF classiﬁcation domains showing class probabilitiesfor training with two features averaged over 100 random splitsinto training and testing samples. Color scale describes theprobability for a source to be a pulsar.

The meta-parameters for BDT algorithms are similar toRF algorithms: they are the number of trees and themaximal depth. We used the Gradient Boosting algo-rithm for the construction of BDT (Friedman 2001b).The classiﬁcation is performed by a weighted average oftrees, where the trees are constructed recursively in or-der to better address misclassiﬁcations from the previousstep. Dependence of the accuracy on tree depth is shownin Figure 4. Unlike the RF, which is also an ensemblebased method, the testing accuracy drops for the maxi-mal depths larger than 7.

Maximum Depth T e s t i n g S c o r e Boosted Decision Trees

20 Trees50 Trees100 Trees200 Trees

Fig. 4.

Dependence of BDT accuracy on maximum depthand the numbers of trees.

The classiﬁcation domains in case of two features for20 trees and the maximum depth of 2 is presented inFigure 5. For the classiﬁcation we will use BDT with 100trees and the maximum depth of 2.

Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Trees: 20Maximum Depth: 2Testing Score:0.97

Boosted Decision Trees

AGN trainingAGN testingPSR trainingPSR testing 0.00.20.40.60.81.0

Fig. 5.

Classiﬁcation domains for BDT for training with twofeatures averaged over 100 splits into training and testingsamples.

In the case of NN, the number of free parameters dependson the number of hidden layers and on the number ofneurons in the hidden layers. The ﬁnal model accuracyalso depends on the number of epochs that the network

Article number, page 6 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic

Fermi -LAT catalogs is allowed to be trained for and on the optimization al-gorithm.

Input Hidden Output … …

Fig. 6.

NN architecture that we use in the constructionof the probabilistic catalogs. The activation function in theoutput layer is sigmoid S ( x ) = e x / (1 + e x ) . The general architecture of the NN that we use in thispaper is shown in Figure 6. It is a fully connected NNwith 11 input nodes (shown by red circles with input fea-tures x i ), one hidden layer (shown by blue circles), andan output layer (shown by the green circle). The hiddenlayer consists of several nodes with values y j . For the ac-tivation function at the hidden layer we use either hyper-bolic tangent (tanh - shown on the plot) or rectiﬁed linearunit (relu). The activation function for the output layeris sigmoid, which we use to make sure that the outputvalue can be interpreted as a class probability. The un-known parameters are weights of features in the hiddenlayer w ji and in the output layer v j including oﬀsets w j and v . The unknown parameters are optimized by min-imizing a loss function, which we choose to be the crossentropy − log L = − (cid:80) i ( y i log ( p i ) + (1 − y i ) log (1 − p i )) ,where y i = 0 , are the true labels of the sources and p i are the predicted class probabilities. This is the same lossfunction as in the LR case below. We have also used NNwith two hidden layers, but the accuracy was similar tothe networks with one hidden layer (Appendix A). Forthe ﬁnal classiﬁcation model, we have chosen to use onehidden layer. Number of Neurons T e s t i n g S c o r e Neural Network

Number of Epochs: 300LBFGS, TanhLBFGS, ReluAdam, TanhAdam, Relu

Fig. 7.

Dependence of accuracy on the number of neuronsfor diﬀerent NN models.

Dependence of the testing accuracy on the numberof neurons in the hidden layer, on the activation func-tion, and on the optimization algorithm is shown inFigure 7. We compare two activation functions at thehidden layer (tanh and relu) and two optimization al-gorithms: Limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS, Liu & Nocedal 1989) and the stochasticgradient descent algorithm Adam (Kingma & Ba 2014).We use 300 epochs for training. Around 11 neurons inthe hidden layer appears to be an optimal choice, sinceincreasing the number of neurons leads to no signiﬁcantincrease in accuracy for all models.Dependence on the number of epochs (number of it-erations in ﬁtting) is presented in Figure 8. The accuracyincreases with higher number of epochs and saturates ataround 200 for LBFGS and 300 for Adam.

Epochs T e s t i n g S c o r e Neural Network

Number of Neurons: 11LBFGS, TanhLBFGS, ReluAdam, TanhAdam, Relu

Fig. 8.

Dependence of testing accuracy on the number ofepochs in training for diﬀerent solvers and activation func-tions.

We illustrate the classiﬁcation domains for NN withtwo input features in Figure 9. In this case we also useonly two neurons in the hidden layer. One can see thatthe separation boundary is smoother compared to the RFdomains in Figure 3 or BDT domains in Figure 5. Forour ﬁnal model we chose one hidden layer with elevenneurons, 300 training epochs, LBFGS solver, and tanhactivation function at the hidden layer.

As we have discussed in Section 2.2, the probability tobelong to class 1 or 0 in LR is represented by the sig-moid function p ( x ) = 1 − p ( x ) = e m ( x ) e m ( x ) (see Eq. (1)),where m ( x ) is a function of input features x . The com-plexity of the model is given by the number of parametersin m ( x ) . We have considered two cases for m ( x ) : linearand quadratic function of the input features x . Quadratic m ( x ) resulted in a similar accuracy as linear m ( x ) . Con-sequently, we have restricted our attention to linear func-tions m ( x ) = (cid:80) k =1 f k x k . In Figure 10 we show the ac-curacy of the LR method as a function of the number ofiterations for diﬀerent solvers, e.g., LBFGS (Liu & No- Article number, page 7 of 25 &A proofs: manuscript no. 0head_ML_3FGL

Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Epochs: 300Solver: LBFGSTesting Score:0.97

Neural Network

AGN trainingAGN testingPSR trainingPSR testing 0.00.20.40.60.81.0

Fig. 9.

NN classiﬁcation domains for 2 input features aver-aged over 100 random splits into training and testing samples.We use 2 neurons in the hidden layer. We use tanh activationfunction and LBFGS solver. cedal 1989), Stochastic Average Gradient (SAG, Schmidtet al. 2017), SAGA (a variant of SAG, Defazio et al.2014), and liblinear (a special solver for LR and supportvector machine classiﬁcations, Fan et al. 2008). As onecan see from Figure 10, LBFGS and Liblinear outperformthe other two solvers and converge much faster. In orderto illustrate the probability domains in LR, we show theclassiﬁcation with two features (LBFGs, 200 iterations)in Figure 11. The domains look similar to the domainsin the NN case (Figure 9). For the ﬁnal classiﬁcation wewill use LBFGs solver with 200 iterations.

Number of Iterations T e s t i n g S c o r e Logistic Regression

LBFGSLiblinearSAGSAGA

Fig. 10.

Dependence of LR testing accuracy on the numberof iterations for diﬀerent solvers.

Fermi -LAT catalogs have many more AGNs than pul-sars, i.e., the datasets are imbalanced. In the previoussubsections we have optimized overall accuracy. In thiscase, the algorithms try to identify AGNs rather thanpulsars, since it gives better accuracy. As a result, in theregion of parameter space, where both pulsars and AGNsare present, the algorithms will give higher probabilityfor a source to be an AGN.The problem of classiﬁcation of imbalanced datasetscan be quantitatively described in terms of precision and

Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Iterations: 200Solver: LBFGSTesting Score:0.97

Logistic Regression

AGN trainingAGN testingPSR trainingPSR testing 0.00.20.40.60.81.0

Fig. 11.

Classiﬁcation domains for LR with two features av-eraged over 100 random splits into training and testing sam-ples. recall. If we denote by “ precision = is a measure how cleanthe prediction is, while recall = is a mea-sure how well the algorithm can detect the pulsars, i.e.,how complete is the list of predicted pulsars. If we re-duce the pulsar domain by attributing uncertain sourcespredominantly to AGNs, then for pulsars the precisionwill increase, but the recall will decrease. abs(sin(GLAT)) P r e c i s i o n o r R e c a ll All algorithms precisionAll algorithms recallAny algorithm precisionAny algorithm recallAll algorithms precision (oversample)All algorithms recall (oversample)

Fig. 12.

Precision and recall for pulsars using all-algorithmand any-algorithm classiﬁcation for unweighted training dataand all-algorithm classiﬁcation for oversampling of pulsars intraining data. For details see Section 3.3.

In Figure 12 we show precision and recall for de-tection of pulsars using algorithms from previous sec-tions. In particular, in the ﬁrst two lines (solid blue withsquares and dashed orange with right triangles) a sourceis categorized as a pulsar if all four algorithms classifyit as a pulsar, while in lines 3 and 4 (solid green withdiamonds and dashed red with down triangles) a source

Article number, page 8 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic

Fermi -LAT catalogs is attributed to the PSR class, if any of the algorithmsclassiﬁes it as a pulsar. It is clear that for lines 1 and2 the pulsar domain is smaller than for lines 3 and 4,since in the former case, the domain is the intersectionof domains for individual algorithms, while in the latterit is the union. For all-algorithms classiﬁcation the pre-cision is 100% for most of latitudes, while the recall isbetween 40% and 80%, i.e., the list of pulsars is gener-ally clean but incomplete. In case of any-algorithm clas-siﬁcation, the recall is increased by about 20% for mostlatitudes compared to the all-algorithms classiﬁcation,but the precision drops by up to 20% at some latitudes,i.e., the completeness improves at the expense of cleanli-ness of the sample. Alternatively to using any-algorithmclassiﬁcation, one can give larger weights to pulsars oroversample pulsars in the training process, i.e., use thesame source several times, so that the numbers of pul-sars and AGNs in training are the same. Provided thatin some applications it is beneﬁcial to have as completeas possible the list of pulsar candidates among unasso-ciated sources, we have retrained the algorithms usingoversampling with the same meta-parameters as in theprevious sections.In general one can either under- or oversample adataset. Undersampling would reduce the number ofAGNs to match the number of pulsars. However, sincethe total number of sources is not very high, we havechosen oversampling. For training with oversampling, wecopy randomly existing pulsars and add them to thedataset until the number of pulsars and AGNs are thesame. Although pulsars in the training dataset are re-dundant, they help to increase the weight of pulsars inthe classiﬁcation model. We illustrate the oversamplingprocedure in Figure 13 top panel: the number of timesa source appears in training is shown by adding markerswith shifts to the right and above the original positionof the source (note that the shift is introduced for pre-sentation only, the parameters of the sources are exactlythe same as in the original source). In the bottom panelof Figure 13 we repeat Figure 11 in order to compare theclassiﬁcation domains with and without oversampling.One can see that pulsar domain in the top panel is largerthan the pulsar domain in the bottom panel. As a result,in the top panel more pulsars are classiﬁed as pulsars butalso more AGNs are falsely classiﬁed as pulsars in the in-tersection region. Since the overall number of AGNs islarger than the number of pulsars, the testing accuracywith oversampling is smaller than the one without over-sampling.The results of training with oversampling are pre-sented in Figure 12, lines 5 and 6 (solid purple with cir-cles and dashed brown with stars). These lines show pre-cision and recall when a source is categorized as a pulsar, if all four algorithms trained with oversampling classifyit as a pulsar. The precision and recall in this case aresimilar to the any-algorithm classiﬁcation for the train-ing without oversampling.

Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Iterations: 200Solver: LBFGSTesting Score:0.92

Logistic Regression(Oversampled)

AGN trainingAGN testingPSR trainingPSR testing 0.00.20.40.60.81.03 4 5 6 7 8

Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Iterations: 200Solver: LBFGSTesting Score:0.97

Logistic Regression

AGN trainingAGN testingPSR trainingPSR testing 0.00.20.40.60.81.0

Fig. 13.

Top panel: LR classiﬁcation domains showing classprobabilities for training with oversampling. The oversam-pling is illustrated by repeating the pulsar markers with ashift: the number of markers is equal to the number of timesthe pulsar appears in training. Bottom panel: we repeat Fig-ure 11 for convenience of comparison with the oversampledtraining in the top panel. In both panels the domains are ob-tained by averaging over 100 random splits into training andtesting samples.

4. Probabilistic catalogs based on the 3FGL and4FGL catalogs

In this section we use the ML algorithms optimized in theprevious section to construct probabilistic classiﬁcationof sources in the 3FGL and 4FGL catalogs.

We use the following four algorithms for the classiﬁcationof sources: RF with 50 trees and maximal depth of 6,BDT with 100 trees and maximal depth of 2, NN with11 neurons, LBFGS solver, and 300 epochs, and LR withLBFGS solver and 200 iterations. For training we use the

Article number, page 9 of 25 &A proofs: manuscript no. 0head_ML_3FGL pulsars and AGNs from the 3FGL catalog. In addition tooriginal datasets, we perform oversampling of pulsars inorder to balance the numbers of pulsars and AGNs. Asa result, we have 8 classiﬁcation methods: 4 algorithmstrained with and without oversampling.

Algorithm Parameters Testing Std. Dev. Comparison withAccuracy 4FGL-DR2 AccuracyRF 50 trees, max depth 6 97.37 0.60 91.06RF_O 97.90 0.50 89.40BDT 100 trees, max depth 2 97.65 0.54 90.40BDT_O 97.79 0.51 91.72NN 300 epochs, 11 neurons, LBFGS 97.29 0.97 90.07NN_O 94.31 5.13 87.09LR 200 iterations, LBFGS solver 97.63 0.54 90.40LR_O 93.68 0.99 85.10

Table 2.

Testing accuracy of the 4 selected algorithms forclassiﬁcation of 3FGL sources and comparison with associa-tions in the 4FGL-DR2 catalog. “_O” denotes training withoversampling. L n ( S i g n i f i c a n t _ C u r v a t u r e ) Predictions for 3FGL Unassociated Data

PSRs classified as PSRs or AGNsPSRs classified only as PSRsPSRs classified only as AGNsAGNs classified as PSRs or AGNsAGNs classified only as AGNsAGNs classified only as PSRs

Fig. 14.

Comparison of class prediction for unassociated3FGL sources with classes in 4FGL-DR2. For more detailssee Section 4.1.

The selected algorithms are summarized in Table 2,where oversampling is shown by “_O”. “Average test-ing accuracy” is computed by taking 1000 times 70% -30% split into training and testing samples and averag-ing over the accuracies computed for the testing samples.In addition, we look at sources, which are unassociatedin 3FGL but have either pulsar or AGN association in4FGL-DR2: there are 302 such sources. The accuracyof our prediction for the four selected algorithms withand without oversampling, taking the 4FGL-DR2 classesas the true values, is reported in the column “Compar-ison with 4FGL Accuracy”. The correct classiﬁcationsand misclassiﬁcations for the 302 sources with PSR orAGN associations in 4FGL-DR2 are also presented inFigure 14. The class at the beginning of the label namecorresponds to the association in the 4FGL-DR2, whilethe second half of the labels corresponds to classiﬁcation of unassociated sources in 3FGL. For example, “PSRsclassiﬁed only as PSRs” shows sources which have PSRassociation in 4FGL-DR2 and all eight methods classi-ﬁed the corresponding unassociated sources in 3FGL asa pulsar. “PSRs classiﬁed as either PSRs or AGNs” labelssources with PSR associations in 4FGL-DR2 but the cor-responding unassociated sources in 3FGL have both PSRand AGN classiﬁcations by diﬀerent ML methods. Theunassociated sources are classiﬁed as PSRs or AGNs ifthe corresponding probability is larger than 0.5. We no-tice that misclassiﬁed or partially misclassiﬁed sourcesin Figure 14 typically happen on the boundary betweenthe two classes or even inside the opposite class. Many ofthese sources also have ﬂags in the 3FGL catalog, suchas a potential problem with the background diﬀuse emis-sion model in the location of the source, which can leadto a poor reconstruction of the source spectrum and, con-sequently, misclassiﬁcation of the source.As a result of the classiﬁcation with the eight MLmethods, we created a probabilistic catalog based onthe 3FGL sources without missing values. We train on70% of the sources associated with pulsars or AGNs andsave the probability values for testing sources, for sourceswhich are not classiﬁed as pulsars or AGNs, and for unas-sociated sources. We repeat the splitting and training1000 times and report the sample average and standarddeviation of the classiﬁcation probabilities, i.e., we aver-age over 1000 values for unassociated sources and sourcesnot classiﬁed as AGNs or pulsars, while the average forAGNs and pulsar is over the number of times the sourcesappear in the testing sample, which is 300 on average. Wehave also subselected the 302 unassociated 3FGL sources,which have PSR or AGN associations in 4FGL-DR2, andsaved them for convenience of comparison as a separateﬁle.In the probabilistic catalogs we add columns withcorresponding probabilities for each algorithm and eachclass, i.e., provided that there are 8 methods (includ-ing oversampling) and 2 classes, we add 16 columns: 8for unweighted and 8 for oversampled training data. Thecolumns with ’_O’ represent the oversampled probabili-ties. We also add 16 columns for standard deviations ofprobabilities. Although class probabilities and standarddeviation for each algorithm are not independent (prob-abilities add up to 1 and standard deviations are equalfor AGN and PSR classes), we keep the correspondingcolumns in view of the generalizations to multi-class clas-siﬁcation (e.g., 3-class classiﬁcation in Section 5). There are thirteen sources with missing values in the3FGL catalog (2 unassociated, 5 AGNs, 1 pulsar, and5 “other” sources), which we save in a separate ﬁle“3FGL_sources_with_missing_values.csv” for reference. Inparticulat all of these sources have a value in signiﬁcant cur-vature of 0.Article number, page 10 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic

Fermi -LAT catalogs

Table 4 shows an example of the probabilistic cata-log for a few unassociated 3FGL sources. Notice that theﬁrst source is classiﬁed as a pulsar by BDT and as anAGN by RF, LR, and NN algorithms, it is an example ofa source with mixed classiﬁcation. Out of 1008 unasso-ciated sources in 3FGL, 111 are classiﬁed as pulsars byall eight methods, 597 are classiﬁed as AGNs, and 300have mixed classiﬁcations. Out of 111 sources classiﬁedas pulsars, 6 sources have counterparts in Parkes survey(Camilo et al. 2015) within 2 arc minutes (see Table 5).

Catalog Classiﬁcation AGN PSR OTHER MIXED3FGL 2-class 597 111 – 3002-class corr 578 97 51 2793-class 585 53 69 3014FGL-DR2 2-class 872 162 – 6242-class corr 821 135 139 5633-class 736 64 271 587

Table 3.

Expected number of AGNs, pulsars, and othersources as well as sources with mixed classiﬁcations amongthe unassociated 3FGL and 4FGL-DR2 sources derived withthe 2-class (Section 4) and 3-class (Section 5) classiﬁcation.The “2-class corr” row shows correction of the 2-class clas-siﬁcation prediction due to the presence of OTHER sourcesamong the unassociated ones (see Section 4.1 for details).

We summarize the results of classiﬁcation of unassoci-ated sources in Table 3 in the “2-class” row. The “AGNs”column shows the number of unassociated sources whereall eight methods from Table 2 give the probability for asource to be an AGN above 50%. Similarly the “Pulsars”column shows the number of unassociated sources whereall algorithms predict the source to be more likely a pul-sar. The “Mixed” column shows the number of sourceswith mixed classiﬁcation, i.e., some algorithms predictthat the source is more likely an AGN while the otheralgorithms predict that it is more likely a pulsar. Wealso add the “OTHER” column in order to compare theresults with the 3-class classiﬁcation in Section 5. Sincethere is no “OTHER” class in the 2-class classiﬁcation,the corresponding entry is empty. In the “2-class corr”row we show a possible correction of the number of pul-sars and AGNs due to presence of other sources. Herewe assume that the fraction of AGN-like and pulsar-likesources among the other sources is the same for asso-ciated and for unassociated sources. In particular, if wedenote by N AGN the number of unassociated sources withAGN-like probabilistic classiﬁcation, by N ass OTHERAGN thenumber of sources with AGN-like classiﬁcation amongassociated OTHER sources, by N ass ( N unass ) the totalnumber of associated (unassociated) sources, then thenumber of AGN-like sources among the unassociatedones corrected for the presence of OTHER sources can AGN ProbabilitySource_Name_3FGL BDT RF LR NN3FGL J0000.2-3738 .

995 0 .

991 0 . .

996 0 .

996 0 . .

997 0 .

934 0 .

999 0 . .

995 1 0 . Table 4.

Example of the AGN classiﬁcation probabilitiesfor a few unassociated sources in the 3FGL catalog (Aceroet al. 2015). We have ommited the oversampled probabilitycolumns here.

Source_Name_3FGL GLON GLAT Sep (arksec)3FGL J0933.9-6232 . − . . . . . . − . . . − . . . − . . . − . . Table 5.

Connection of unassociated 3FGL sources classiﬁedas pulsars with Parkes pulsars (Camilo et al. 2015). be estimated as N corrAGN = N AGN − N ass OTHERAGN N unass N ass . (5)Analogous corrections are applied for the number ofunassociated sources with PSR and with mixed classi-ﬁcations. If we denote by N ass OTHER the total numberof associated other sources, then the estimated numberof OTHER sources among unassociated ones is N unassOTHER = N ass OTHER N unass N ass . (6)We show this estimate in the OTHER column in the“2-class corr” row. We note that since N ass OTHERAGN + N ass OTHERPSR + N ass OTHERMIXED = N ass OTHER , this estimateis consistent with corrections in Equation (5) for sourcesclassiﬁed as AGNs, pulsars, or with mixed classiﬁcation. In this section we construct a probabilistic classiﬁca-tion of sources in the 4FGL-DR2 catalog. The 4FGL-DR2 catalog (Ballet et al. 2020) is based on 10 yearsof

Fermi -LAT data (compared to 8 years of data in the4FGL catalog, Abdollahi et al. 2020). It contains 5788sources, which is 723 sources more than in the 4FGLcatalog (all sources in 4FGL are kept in 4FGL-DR2 evenif they fall below the detection threshold with 10 years ofdata). 3770 sources in 4FGL-DR2 have either an AGN ora PSR classiﬁcation, 1658 sources are unassociated (weonly look at CLASS1 column in the catalog), and therest 346 sources are other sources with classiﬁcation likePWN, SNR, etc. There are 14 sources in 4FGL-DR2 with

Article number, page 11 of 25 &A proofs: manuscript no. 0head_ML_3FGL missing values: four AGNs, one PWN (Crab), and nineunassociated sources. As in the previous section, we usefor training and testing sources associated with eitherAGNs or pulsars, which have no missing values used forclassiﬁcation. We then calculate the classiﬁcation proba-bilities of AGN and PSR classes for both the associatedand the unassociated sources. The 4FGL catalog hashigher number of features, especially due to the diﬀer-ence in modeling of the spectra compared with the 3FGLcatalog. We selected 28 of these features and looked forcorrelations among them. If any feature was correlated oranti-correlated with a Pearson index of ± , The 303 unassociated 4FGL-DR2 sources correspond to302 unassociated 3FGL sources, because there are two 4FGL-DR2 sources, which are associated with one 3FGL source.

Algorithm Parameters Testing Accuracy Std. Dev.RF 50 trees, max depth 6 97.87 0.36RF_O 97.56 0.39BDT 100 trees, max depth 2 97.63 0.39BDT_O 97.72 0.38NN 300 epochs, 16 neurons, LBFGS 97.41 0.47NN_O 95.48 0.66LR LBFGS solver, 200 iterations 97.80 0.38LR_O 96.03 0.53

Table 6.

Testing accuracy of the 4 algorithms on 4FGL-DR2associated data. “_O” denotes training with oversampling.

40 sources are predicted to be pulsars using 3FGL fea-tures and 75 sources are predicted to be pulsars using4FGL features. This leads to 29 sources which are pre-dicted by all eight methods to be pulsars for featurestaken from both 3FGL and 4FGL catalogs. For conve-nience, we save these 29 pulsar candidates as a separateﬁle.Among the 6 unassociated 3FGL sources classiﬁedas pulsars and spatially associated with pulsars in theParkes survey in Table 5, four are unassociated in 4FGL-DR2 and are predicted to be pulsars by the probabilisticclassiﬁcation of unassociated 4FGL-DR2 sources. Of theother two, one is now associated as a pulsar in 4FGL-DR2, and the second one is not detected in 4FGL-DR2.

5. Three-Class Classiﬁcation

One of the caveats of the analysis with two classes isthat there are associated sources, which do not belongto AGN or PSR classes. These have the labels: unk, spp,glc, snr, gal, sbg, GAL, sfr, bin, SNR, HMB, LMB, css,PWN, pwn, hmb, SFR, BIN, lmb, NOV. We collect allthese associated sources, which do not belong to AGNand PSR classes into a new class, which we label as“OTHER”. Since in two-class classiﬁcation we train algo-rithm to classify sources only into AGN and PSR classes,OTHER sources are also classiﬁed as either AGN or PSR.This introduces a bias in the estimates of the number ofAGNs and pulsars among unassociated sources. One pos-sibility to correct this bias is to assume that the fractionsof OTHER sources among associated and unassociatedsources are the same (Equation (5)). This correction canbe applied for the total number of sources or for the num-ber of sources in some window of parameters, e.g., in aﬂux bin or in a range of latitudes and longitudes. This isa straightforward calculation but it has some limitations.In particular, it implicitly assigns equal probabilities toall AGNs (and all PSRs) to belong to the OTHER class.For a small range of parameters the variance of this esti-mate can be very large due to small number of associatedOTHER sources in this parameter range. As we will see

Article number, page 12 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic

Fermi -LAT catalogs in Section 6, this correction depends on the choice of thevariable used for binning, e.g., overall correction with lat-itude bins is not equal to the correction with longitudebins.In this section we discuss the construction of prob-abilistic catalogs with multi-class classiﬁcation (3-classclassiﬁcation in our case). We start with the construc-tion of the probabilistic catalog based on 3FGL by addingthe class “OTHER”, which includes all associated sourceswithout AGN or PSR associations: there are 108 suchsources in 3FGL. We use the same 11 features as in the2-class classiﬁcation: the only diﬀerence is that we usecos(GLON) instead of GLON. The reason is that LR andNN methods have a signiﬁcantly worse performance thanRF and BDT methods when we use GLON, but, as weshow below, all four methods have comparable accuracywhen we use cos(GLON). This can be due to disconti-nuity in the GLON variable. We perform optimizationof the meta-parameters for the four ML algorithms withthe 3 classes.The dependence of accuracy on meta-parameters ofthe algorithms is shown in Figures 15 and 16. We seethat for the tree-based algorithms, the optimal parame-ters are similar to the 2-class classiﬁcation, i.e., 50 treeswith depth 6 for RF and 100 trees with depth 2 for BDTprovide close to optimal performance at a minimal costin complexity (depth of the trees). The main diﬀerencefor NN and LR algorithms is that more steps are neededfor convergence, especially in case of oversampling. In thefollowing we use 600 epochs for NN and 500 iterationsfor LR instead of 300 epochs and 200 iterations respec-tively in the two-class case. For NN, the accuracy stopsincreasing above about 10 neurons in the hidden layer(in the following we use 11 neurons for classiﬁcation: thesame as in the two-class case). For oversampling, we usethe oversampling factors (cid:113) and (cid:113) forPSR and OTHER classes respectively (instead of and oversampling factors in the 2-class case).The reason for a smaller oversampling factors is to avoidoverweighting the relatively small OTHER class.We show an example of domains in the 3-class case inFigure 17. A class domain is determined by the class withthe largest probability. Since in the 3-class case there aretwo independent probabilities, which are diﬃcult to showwith a single color bar, we present only the domains rep-resented by three diﬀerent colors: brown for PSR, greenfor OTHER, and blue for AGN classes. The correspond-ing training and testing data are shown by red crossesand brown rotated squares for PSR, by green squaresand yellow diamonds for OTHER, and by blue and pur-ple triangles for AGN classes. The classiﬁcation domainsare averaged over 100 realizations of splitting the data

Maximum Depth T e s t i n g S c o r e Random Forests

20 Trees50 Trees100 Trees200 Trees

Maximum Depth T e s t i n g S c o r e Boosted Decision Trees

200 400 600 800 1000

Number of Iterations T e s t i n g S c o r e Logistic Regression

LBFGSSAGSAGA

Fig. 15.

Accuracy for the 3-class classiﬁcation with RF, BDTand LR methods. LR does not have liblinear solver here, sinceliblinear cannot handle multinomial loss. into training and testing samples. One of these splittingsis shown on the ﬁgure.The accuracies of our chosen models for classiﬁcationof the 3FGL sources are presented in Table 7. As in the 2-class case, the accuracies are averaged over 1000 realiza-tions of splitting the data into training and testing sam-ples. We notice that accuracies presented in Table 2 arecalculated relative to AGN and PSR classes only, if wetake into account that all OTHER sources are misclassi-ﬁed in this case, then the testing accuracy is reduced byabout 5% (the fraction of OTHER sources in associatedsource in 3FGL), while the accuracy of comparison with4FGL-DR2 is reduced by about 10% (there are 37 unas-sociated sources in 3FGL with OTHER class associationsin 4FGL-DR2, while there are in total 339 unassociatedsources in 3FGL with associations in 4FGL-DR2). Thusthe testing accuracy of 93-94% in Table 7 provides atleast a 1-2% improvement over the accuracy in Table 2,

Article number, page 13 of 25 &A proofs: manuscript no. 0head_ML_3FGL

200 400 600 800 1000

Number of Epochs T e s t i n g S c o r e Neural Network

Number of Neurons: 11LBFGS, TanhLBFGS, ReluAdam, TanhAdam, Relu

Number of Neurons T e s t i n g S c o r e Neural Network

Number of Epochs: 600LBFGS, TanhLBFGS, ReluAdam, TanhAdam, Relu

Fig. 16.

Accuracy of the NN classiﬁcation as a function ofthe number of epochs and of the number of neurons for the3-class classiﬁcation.

Ln(Variability_Index) L n ( S i g n i f i c a n t _ C u r v a t u r e ) Trees: 50Maximum Depth: 6Testing Score:0.92

Random Forests (3-Class)

AGN trainingAGN testingPSR trainingPSR testingOTHER trainingOTHER testing

Fig. 17.

Classiﬁcation domains for RF in the 3-class classi-ﬁcation. after taking into account the misclassiﬁcation of OTHERsources in the 2-class case and a similar improvement forthe accuracy of classiﬁcation of unassociated sources in3FGL with 4FGL-DR2 associations.The 3-class classiﬁcation of 4FGL-DR2 sources is per-formed similar to the 3-class classiﬁcation of the 3FGLsources. The diﬀerences are similar to the diﬀerencesin the 2-class classiﬁcation of 3FGL and 4FGL-DR2sources: we use 16 features (the same features as in the2-class classiﬁcation of 4FGL-DR2 sources in Section 4.2but with GLON replaced by cos(GLON)) and we have16 neurons in the hidden layer of the NN method. Fur-thermore, for Logistic Regression we use 1000 iterationsinstead of 500 as it gives better performance for oversam-

Algorithm Parameters Testing Std. Dev. Comparison withAccuracy 4FGL-DR2 AccuracyRF 50 trees, max depth 6 93.96 0.85 84.96RF_O 94.38 0.76 84.96BDT 100 trees, max depth 2 93.72 0.83 83.19BDT_O 93.83 0.80 85.25NN 600 epochs, 11 neurons, LBFGS 93.17 1.05 83.48NN_O 92.51 1.34 81.71LR 500 iterations, LBFGS solver 93.93 0.88 83.19LR_O 93.01 0.96 83.19

Table 7.

Testing accuracy of the four selected algorithms for3-class classiﬁcation of 3FGL sources and comparison withassociations in the 4FGL-DR2 catalog. “_O” denotes trainingwith oversampling. pled cases. The corresponding accuracies are reported inTable 8. In comparing the accuracies with the 2-classclassiﬁcation in Table 6, one has to take into accountthat there are 346 OTHER sources among 4116 associ-ated sources in 4FGL-DR2, which is about 8.4%. Since allOTHER sources are “misclassiﬁed” by the 2-class classiﬁ-cation, the 3-class classiﬁcation provides an improvementof about 2-4% compared to the 2-class classiﬁcation.

Algorithm Parameters Testing Std. Dev.AccuracyRF 50 trees, max depth 6 92.91 0.66RF_O 92.83 0.63BDT 100 trees, max depth 2 92.51 0.67BDT_O 92.27 0.67NN 600 epochs, 16 neurons, LBFGS 91.86 0.72NN_O 90.26 0.83LR 1000 iterations, LBFGS solver 92.63 0.67LR_O 92.22 0.69

Table 8.

Testing accuracy of the four selected algorithms forthe 3-class classiﬁcation of 4FGL-DR2 sources. “_O” denotestraining with oversampling.

The numbers of unassociated sources classiﬁed byall 8 methods as AGNs, pulsars, and other sources for3FGL and 4FGL-DR2 catalogs are presented in Table 3in the “3-class” rows. For each algorithm the most prob-able class of the source is determined by the class withthe largest probability. Since there are three classes, thelargest probability can have the value just above 1/3.The “Mixed” column shows the number of sources withdiﬀerent classiﬁcation results for diﬀerent algorithms.Classiﬁcation of

Fermi -LAT 4FGL sources into threeclasses was considered earlier by, e.g., Zhu et al. (2021).Zhu et al. (2021) have primarily used a two-step clas-siﬁcation procedure, where in the ﬁrst step AGNs areseparated from the rest of sources and in the second step

Article number, page 14 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic

Fermi -LAT catalogs the remaining sources are split into pulsars and othersources. Zhu et al. (2021) have also tested a simultaneousclassiﬁcation of sources into three classes (AGN, pulsars,other), but the results were inconsistent for the two MLalgorithms used by Zhu et al. (2021) (RF and NN). Inparticular, the number of OTHER sources predicted byNN was zero. In our case, the predictions of various al-gorithms are relatively consistent with each other. Forexample, in the 3FGL (4FGL-DR2) catalog all 8 meth-ods classify 69 (271) unassociated sources as OTHER.Also, 8 out of 37 unassociated 3FGL sources, which areassociated to OTHER sources in 4FGL-DR2, are classi-ﬁed by all 8 algorithms as OTHER (6 are classiﬁed asAGNs, 5 as PSRs, and 18 have mixed classiﬁcation).

6. Application of probabilistic catalogs forpopulation studies

In this section we show how probabilistic catalogs canbe used, for instance, for population studies. One of themost important questions in gamma-ray astronomy iscontribution of point sources, e.g., AGNs, to the extra-galactic gamma-ray ﬂux (e.g., Abdo et al. 2010b; Maly-shev & Hogg 2011; Ackermann et al. 2016; Zechlin et al.2016b,a; Lisanti et al. 2016; Di Mauro et al. 2018): ifmost of the extra-galactic emission is explained by pointsources, then one can put stringent constraints, e.g.,on dark matter annihilation or decay into gamma rays(Ajello et al. 2015; Di Mauro & Donato 2015; Acker-mann et al. 2015; Fornasa & Sánchez-Conde 2015; Liuet al. 2017) or on evaporation of primordial black holes(Carr et al. 2010). In particular, it is important to un-derstand the contribution to the population of AGNsfrom the unassociated sources. A probabilistic catalogprovides an answer to the question: how many sourcesamong the unassociated ones are expected to belong todiﬀerent classes, such as pulsars or AGNs. One can cal-culate the total expected number of AGNs or pulsarsamong the unassociated sources, or calculate the contri-bution as a function of one or more parameters. In thissubsection we determine the numbers of AGNs, pulsars,and, in case of 3-class classiﬁcation, OTHER sources asa function of their ﬂux, in the following subsection wealso discuss the distributions of sources as functions ofGalactic latitude and longitude.In Figure 18 we show the cumulative number of AGNsand pulsars with ﬂux above 1 GeV larger than the valueon the x-axis. Solid blue lines show the actual countsof sources (AGNs or pulsars) in the 3FGL and 4FGL-DR2 catalogs. As a consistency check of the method, wecalculate the AGN- and PSR-like probabilities for asso-ciated sources. The sum of probabilities (uncorrected for sources other than AGNs and pulsars) for LR algorithmare shown by dotted purple lines. In order to correct theexpected number of AGNs among associated sources forAGN-like probabilities in “other” sources, we subtract thecorresponding AGN-like probabilities in each ﬂux band: N assAGN = (cid:88) i ∈ ass p i AGN − (cid:88) i ∈ ass other p i AGN . (7)The corrected sums of probabilities for LR method areshown by dashed purple lines. The green bands show theenvelope of the sums of corrected probabilities for theeight methods used in this paper. We see that the countsof associated sources, AGNs and pulsars, are consistentwith the expected number of associated sources calcu-lated from the class probabilities of associated sources.This conclusion is not very surprising since we used as-sociated sources for training of ML algorithms. We notethat correction for “other” sources is important for consis-tency of the sum of probabilities and the number of asso-ciated sources. We have also compared the sums of prob-abilities for the 3FGL associated sources in Saz Parkin-son et al. (2016). The sum of probabilities for associatedsources in the LR case uncorrected for “other” sources areshown by dotted black line, while the sums corrected for“other” sources are shown by black dash-dotted lines. Thegray band is the envelope of the two methods (LR andRF) used by Saz Parkinson et al. (2016). We see that thesum of probabilities for AGNs and pulsars overpredictsthe source counts in 3FGL, while correction for “other”sources makes the prediction for associated sources con-sistent with the source counts.The predictions for the number of AGNs and pul-sars among the unassociated sources corrected for “other”sources added to the 3FGL and 4FGL-DR2 source countsare shown by solid orange lines (for the LR case). Theorange bands show the corresponding envelopes for theeight ML methods. We assume that the fractional con-tribution of other sources is the same for associated andunassociated sources in the diﬀerent ﬂux bands. Thus,the correction for the presence of other sources is cal-culated similarly to the associated sources in Equation7, but we adjust for the fact that there are fewer unas-sociated than associated sources, i.e., the correction isassumed to be proportionally smaller. In particular, thenumber of AGNs among unassociated sources in a ﬂuxband ∆ F is estimated as N unassAGN = (cid:88) i ∈ unass p i AGN − (cid:88) i ∈ ass other p i AGN · N unass N ass (8)where all probabilities and the numbers of sources arecomputed for sources with ﬂux inside ∆ F . The ﬁrst termis the sum of AGN-like probabilities among the unas-sociated sources, while the second term is the sum of Article number, page 15 of 25 &A proofs: manuscript no. 0head_ML_3FGL log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countSrc count + unassoc. prob. LRAssoc. prob. LRAssoc. prob. LR (uncorr.)Src count + unassoc. prob. LR (SP16)Assoc. prob. LR (SP16)Assoc. prob. LR (SP16, uncorr.)Src count + unassoc. prob.Assoc. prob.Assoc. prob. (SP16) log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countSrc count + unassoc. prob. LRAssoc. prob. LRAssoc. prob. LR (uncorr.)Src count + unassoc. prob.Assoc. prob. log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countSrc count + unassoc. prob. LRAssoc. prob. LRAssoc. prob. LR (uncorr.)Src count + unassoc. prob. LR (SP16)Assoc. prob. LR (SP16)Assoc. prob. LR (SP16, uncorr.)Src count + unassoc. prob.Assoc. prob.Assoc. prob. (SP16) log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countSrc count + unassoc. prob. LRAssoc. prob. LRAssoc. prob. LR (uncorr.)Src count + unassoc. prob.Assoc. prob.

Fig. 18.

Cumulative number of sources as a function of their ﬂux. Blue solid line – associated 3FGL and 4FGL-DR2 sources.Green band – envelope of sums of class probabilities for associated sources for the eight ML methods corrected for the presence ofOTHER sources. Orange solid line (band) – average (envelope) of sums of class probabilities for the eight ML methods (correctedfor the presence of OTHER sources) added to the source count of associated sources. Purple dashed (dotted) line – sum ofclass probabilities for associated sources for the LR method without oversampling corrected (uncorrected) for the presence ofOTHER sources. Gray dash-dotted (dotted) line – sums of class probabilities from Saz Parkinson et al. (2016) using LR corrected(uncorrected) for the presence of OTHER sources. Gray band – envelope of sums of class probabilities for associated sources forLR and RF methods from Saz Parkinson et al. (2016) corrected for the presence of OTHER sources. For details see Section 6.1.

AGN-like probabilities among associated “other” sourcesrescaled by the total number of unassociated and as-sociated sources in this ﬂux band. The expected num-ber of pulsars among the unassociated sources is calcu-lated analogously. The corresponding sums of associatedsource counts plus the expected number of sources cal-culated with LR method of Saz Parkinson et al. (2016)and corrected for other sources are shown by dashed greylines.The expected numbers of pulsars and AGNs amongthe unassociated sources are summarized in Table 9 forthe 3FGL catalog and in Table 10 for the 4FGL-DR2 cat-alog. The row “2-class” shows the expectations for the 2-class classiﬁcation uncorrected for the presence of othersources. The row “2-class corr total” shows the expec-tations corrected for other sources using Equation (8)without any binning, while “2-class corr F-bins” showscorrection with the ﬂux bins as in Figure 18. We see that corrections using total counts of binned in ﬂux aresimilar and relatively small. In the following section wealso calculate the correction using latitude and longitudebins.The numbers of PSRs and OTHER sources in the 3-class classiﬁcation as a function of ﬂux are shown in Fig-ure 19. As in the 2-class case, solid blue line shows thecounts of associated sources in the corresponding classes.Red and blue bands show the envelops of the expectednumber of associated sources and the number of associ-ated sources plus the expected numbers of sources amongunassociated ones respectively. In the PSR plots, we alsoreplot the envelops of expected numbers of associatedPSR (shown by the green band) and associated sourcecounts plus prediction for unassociated sources by or-ange bands corrected for the presence of OTHER sourcesamong unassociated ones. We notice that the bands forthe 3-class case are narrower in the PSR plots than the

Article number, page 16 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic

Fermi -LAT catalogs log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countAssoc. prob. LRSrc count + unassoc. prob. LRAssoc. prob. LR (2-class)Src count + unassoc. prob. LR (2-class)Assoc. prob.Src count + unassoc. prob.Assoc. prob. (2-class)Src count + unassoc. prob. (2-class) log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countAssoc. prob. LRSrc count + unassoc. prob. LRAssoc. prob. LR (2-class)Src count + unassoc. prob. LR (2-class)Assoc. prob.Src count + unassoc. prob.Assoc. prob. (2-class)Src count + unassoc. prob. (2-class) log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countAssoc. prob. LRSrc count + unassoc. prob. LRSrc count + unassoc. model (2-class)Assoc. prob.Src count + unassoc. prob. log F >1 GeV [cm s ] N u m b e r o f s o u r c e s Src countAssoc. prob. LRSrc count + unassoc. prob. LRSrc count + unassoc. model (2-class)Assoc. prob.Src count + unassoc. prob.

Fig. 19.

Cumulative number of sources as a function of their ﬂux. Blue solid line – associated 3FGL and 4FGL-DR2 sources. Blueband – envelope of sums of class probabilities for associated sources for the eight ML methods. Purple dash-dotted line – sum ofclass probabilities for associated sources for the LR method without oversampling. Red solid line (band) – average (envelope) ofsums of class probabilities for the eight ML methods added to the source count of associated sources. Purple dashed line – sumof class probabilities for associated sources for the LR method without oversampling in the 2-class classiﬁcation corrected for thepresence of OTHER sources. Orange solid line (band) in the PSR plots – average (envelope) of sums of class probabilities forthe eight ML methods (corrected for the presence of OTHER sources) in the 2-class classiﬁcation added to the source count ofassociated sources (the band and the line are the same as in Figure 18). Green band in the PSR plots – envelope of sums of classprobabilities for associated sources for the eight ML methods in the 2-class classiﬁcation corrected for the presence of OTHERsources (the same as in Figure 18). Orange solid line in the OTHER plots – OTHER source counts plus an estimate of the numberof OTHER sources among unassociated ones using Equation (9). For details see Section 6.1. bands for the 2-class case. In part this is due to less over-sampling in the 3-class case. We also note that the redband lies almost entirely below the orange one. This isdue to a possible underestimation of the contribution ofOTHER sources to the PSR class among unassociatedsources. In particular, in the bottom panels of Figure 19we show the estimated total number of OTHER sourcesin the 2-class case by the orange line. Since we do nothave probabilities for the OTHER sources in this case, weestimate the number of OTHER sources among unasso-ciated ones simply by rescaling the number of associatedOTHER sources in each energy band as N unassOTHER = N assOTHER N unass N ass (9) We notice that this estimate agrees with the correctionof the estimated numbers of AGNs and pulsars in the2-class case due to presence of other sources, since N assOTHER = (cid:88) i ∈ ass other p i AGN + (cid:88) i ∈ ass other p i PSR = (cid:88) i ∈ ass other ( p i AGN + p i PSR ) (10)provided that p i AGN + p i PSR = 1 for each source in thetwo-class case. The estimated total number of OTHERsources in the 2-class case (orange line) is signiﬁcantlybelow the red band derived from the sum of probabili-ties of the OTHER class in the 3-class case. The modelin Equation (9) depends on binning, as we see in the fol-lowing section, for latitude binning this 2-class estimateof the number of OTHER sources among unassociated

Article number, page 17 of 25 &A proofs: manuscript no. 0head_ML_3FGL log F >1 GeV [cm s ] un a ss o c . / a ss o c . f r a c t i o n AGN log F >1 GeV [cm s ] un a ss o c . / a ss o c . f r a c t i o n PSR log F >1 GeV [cm s ] un a ss o c . / a ss o c . f r a c t i o n OTHER

Fig. 20.

Ratio of the number of AGNs, pulsars and othersources among unassociated sources estimated with LR with-out oversampling to the counts of associated sources respec-tively. AGN and pulsar estimates are corrected for the pres-ence of other sources using Equation (8), while the numberof other sources among unassociated ones in the 2-class caseis estimated using Equation (9). ones agrees with the 3-class estimate. The reason is thatfor ﬂux bins, the ratio N unass N ass is dominated by AGNs athigh latitudes and it is always smaller than 1, while atlow latitudes, where most of OTHER sources are, theratio N unass N ass is about 1. We note that the probabilistic classiﬁcation mostlyaﬀects sources with small ﬂuxes. In Figure 20 we plotthe ratio of the expected number of sources of a certainclass among unassociated sources computed according toEq. 8 with the LR algorithm (without oversampling) tothe number of associated sources in this class. The ratiogenerally increases as the ﬂux decreases. Negative values(e.g., at high ﬂuxes for AGNs) are due to subtractionof probabilities for the “other” associated sources. ForOTHER sources in the 2-class case we use estimates inEquation (9). As discussed above, we see that these es-timates are signiﬁcantly below the estimated numbers ofOTHER sources among unassociated ones in the 3-classclassiﬁcation (for the LR method without oversamplingin this case). In this section we show Galactic latitude and longitudeproﬁles of the distributions of associated and unassoci-ated sources. In Figures 21 and 22 we present the sourcecounts as a function of abs(sin(GLAT)) for 2-class and3-class classiﬁcations respectively. We use 20 bins, i.e.,each bin corresponds to a solid angle of π/ . Solidblue lines show counts of associated sources in 3FGLand 4FGL-DR2 catalogs. It is interesting to note thatthe density of associated AGNs is decreasing towards theGalactic plane. The total counts of unassociated sourcesare shown by red dash-dotted lines. Green bands showthe envelopes of sums of probabilities for classiﬁcationof unassociated sources into AGN, PSR, and, in caseof 3 classes, OTHER classes for the eight ML meth-ods corrected in the case of 2 classes for the presenceof OTHER sources. Black dotted line in plots of theOTHER class show the model for the number of OTHERsources among the unassociated ones in Equation (9) forlatitude bins. We see that in the latitude proﬁle, the 2-class model for the contribution of OTHER sources tothe unassociated ones is generally consistent with the es-timate of the number of OTHER sources in the 3-classmodel.The classiﬁcations of unassociated 3FGL sources bySaz Parkinson et al. (2016) are shown by gray dashed(RF) and dotted (LR) lines. The numbers of unassoci-ated sources classiﬁed as AGN, PSR, or OTHER growtowards the Galactic plane (GP). Within ≈ ◦ from theGP the expected number of PSRs is about the same asthe number of AGNs among unassociated sources, whileat high latitudes, most of unassociated sources are clas-siﬁed as AGNs. It is interesting to note, that accordingto Table 1, GLAT is one of the least important featuresfor the RF algorithm. It can be a posteriori explained bythe fact that the density of AGNs is so high that even in Article number, page 18 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic

Fermi -LAT catalogs abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass total Assoc probUnass probSrc count + unass prob0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMeanUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMean Assoc probUnass probSrc count + unass prob

Fig. 21.

Latitude proﬁles of source counts in case of 2-class classiﬁcation. Blue solid line – associated 3FGL and 4FGL-DR2sources. Red dash-dotted line – counts of all unassociated sources. Blue band - envelope of sums of class probabilities for associatedsources for the eight ML methods with and without oversampling corrected for the presence of OTHER sources. Green band –envelope of sums of class probabilities for unassociated sources for the eight ML methods corrected for the presence of OTHERsources. Orange solid line (band) – average (envelope) of sums of class probabilities for the eight ML methods added to the sourcecount of associated sources. Green dotted line on the AGN plots – mean of the orange solid line. Gray dashed (dotted) line – RF(LR) sums of class probabilities from Saz Parkinson et al. (2016). For details see Section 6.2. the GP the expected number of AGNs is comparable tothe expected number of PSRs.Orange shaded areas show the sum of the sourcecounts and the expected number of sources for the eightmethods (both with and without oversampling). The av-erage among the eight methods added to the counts of as-sociated sources is shown by solid orange line (for AGNswe also show the mean of these points by dotted greenline). We ﬁnd that the number of associated AGNs is de-creasing towards the GP, the expected number of AGNsamong unassociated sources is increasing towards theGP, so that the sum of the two is relatively uniform as afunction of Galactic latitude.In Figures 23 and 24 we show plots analogous to Fig-ures 21 and 22 for Galactic longitudes (we use 30 bins).We note that there is a signiﬁcant increase in the num-ber of unassociated sources in the 4FGL-DR2 catalogfor | (cid:96) | (cid:46) ◦ . In the 2-class classiﬁcation about half ofthe unassociated sources are attributed to pulsars and half to AGNs. In the 3-class classiﬁcation at least onethird of the unassociated sources in this range of lon-gitudes is attributed to the OTHER class. This comesmostly at the expense of reducing the predicted num-ber of pulsars. It is interesting to note that the 2-classmodel for the OTHER sources calculated using Equation(9) for longitude bins (black dotted line in the OTHERplots) is signiﬁcantly below the bands of 3-class expec-tations, whereas for the latitude bins the 2-class modelis consistent with the bands. The reason is that the ra-tio of unassociated to associated sources, which we useto estimate the number of OTHER sources among theunassociated ones, depends on binning. In particular,this ratio is about 1 for low latitudes, which leads toan estimate that at low latitudes the number of OTHERsources among unassociated ones is similar to the numberof associated OTHER sources. But for longitude and ﬂuxbinning the ratio of unassociated to associated sources issmaller than 1, which leads to a prediction (in the 2-class Article number, page 19 of 25 &A proofs: manuscript no. 0head_ML_3FGL abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass total Assoc probUnass probSrc count + unass prob0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass model (2-class) Assoc probUnass probSrc count + unass prob 0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass model (2-class) Assoc probUnass probSrc count + unass prob0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMeanUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 0.0 0.2 0.4 0.6 0.8 1.0 abs(sin(GLAT)) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMean Assoc probUnass probSrc count + unass prob

Fig. 22.

Latitude proﬁles of source counts in case of 3-class classiﬁcation. For the deﬁnition of labels see Figure 21. Black dashedline in plots of the OTHER class show the model for the number of OTHER sources among the unassociated ones in Equation (9)for latitude bins. case) that the number of OTHER sources among unas-sociated ones is smaller than the number of associatedOTHER sources.We summarize the expected numbers of sources inthe 2-class and 3-class models in Table 9 for the 3FGLand in Table 10 for the 4FGL-DR2 catalogs. In the 2-class case we also show the correction for the presence ofOTHER sources among unassociated ones. We make thecorrection using the total numbers of unassociated and associated sources, as well as binning in ﬂux from Fig-ure 18 (F-bins), in latitudes from Figure 21 (Lat-bins),and in longitudes from Figure 23 (Lon-bins). We see thatpredictions for the numbers of sources using latitude bin-ning in the 2-class case is closest to the 3-class case. Evenbetter agreement would likely be achieved if one uses asimultaneous binning in ﬂux, latitudes, longitudes and,possibly, other variables. If, in addition, one allows binsof variable size and non-rectangular boundaries in multi-

Article number, page 20 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic

Fermi -LAT catalogs

GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 15010050050100150

GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass total Assoc probUnass probSrc count + unass prob15010050050100150

GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMeanUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 15010050050100150

GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMean Assoc probUnass probSrc count + unass prob

Fig. 23.

Longitude proﬁles of source counts in case of 2-class classiﬁcation. For the deﬁnition of labels see Figure 21. dimensional space, then one will likely end up with oneof the ML algorithms for classiﬁcation. Indeed, ML algo-rithms are designed to provide an optimal binning, whichmaximizes the separation of classes and make predictionsfor data with unknown labels (e.g., unassociated sources)based on counts of samples in bins with known labels(e.g., associated sources). Thus the 3-class prediction forclassiﬁcation of unassociated sources is likely more ac-curate than the 2-class classiﬁcation with correction forOTHER sources based on ad hoc binning of one of thevariables. The 2-class classiﬁcation may still be useful insituations, when high recall is necessary for either pulsarsor AGNs: since OTHER sources are mixed with pulsarsand AGNs (cf. Figure 17), some of pulsars and AGNs canbe mistakenly classiﬁed as OTHER in the 3-class case.

7. Conclusions

In this paper we determine the probabilities of classiﬁca-tion of unassociated sources in the 3FGL and 4FGL-DR2

Fermi -LAT catalogs into two (AGNs and pulsars) andthree (AGNs, pulsars, and other sources) classes. Theprobabilities are calculated with 8 diﬀerent ML meth-ods: RF, BDT, LR and NN – each algorithm with and

Classiﬁcation AGN PSR OTHER2-class . +75 . − . . +97 . − . –2-class corr total . +69 . − . . +91 . − . . +70 . − . . +93 . − . . +67 . − . . +83 . − . . +68 . − . . +90 . − . . +74 . − . . +22 . − . . +46 . − . Table 9.

Expected counts of sources among unassociated3FGL sources.

Classiﬁcation AGN PSR OTHER2-class . +173 . − . . +243 . − . –2-class corr total . +158 . − . . +220 . − . . +158 . − . . +222 . − . . +141 . − . . +189 . − . . +154 . − . . +210 . − . . +121 . − . . +47 . − . . +87 . − . Table 10.

Expected counts of sources among unassociated4FGL-DR2 sources. without oversampling during training. The algorithmswere trained and tested with associated sources. We havescanned some meta-parameters of the algorithms, suchas depth of the trees, the number of trees, the num-

Article number, page 21 of 25 &A proofs: manuscript no. 0head_ML_3FGL

GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 15010050050100150

GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass total Assoc probUnass probSrc count + unass prob15010050050100150

GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass model (2-class) Assoc probUnass probSrc count + unass prob 15010050050100150

GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalUnass model (2-class) Assoc probUnass probSrc count + unass prob15010050050100150

GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMeanUnass prob RF (SP16) Unass prob LR (SP16)Assoc probUnass probSrc count + unass prob 15010050050100150

GLON (deg) N u m b e r o f s o u r c e s Src countSrc count + unass avUnass totalMean Assoc probUnass probSrc count + unass prob

Fig. 24.

Longitude proﬁles of source counts in case of 3-class classiﬁcation. For the deﬁnition of labels see Figures 21 and 22. ber of neurons, in order to determine optimal parame-ters which do not create overﬁtting of data and providegood accuracy of classiﬁcation. The accuracies, which weobtained for the 3FGL catalog for the four algorithmswithout oversampling in the 2-class case are about 97%,and between 93% and 97% with oversampling. We havealso checked the accuracy of classiﬁcation by selectingunassociated sources in 3FGL, which have associationsin 4FGL-DR2. If we take the 4FGL-DR2 associationsas the true classes, then the accuracies of classiﬁcation in this subset of sources without (with) oversampling arebetween 90% and 91% (85% and 92%). Most of misclassi-ﬁed sources in this comparison have spectral parametersin 3FGL which are typical of the other class (Figure 14),i.e., the misclassiﬁcation can be due to problems with re-constructing the spectrum of the sources. In the 3-classclassiﬁcation, the testing accuracies for the 3FGL cata-log are between 92% and 94% (both with and withoutoversampling), while comparison with 4FGL-DR2 giveaccuracy between 82% and 85%. For the 4FGL-DR2 clas-

Article number, page 22 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic

Fermi -LAT catalogs siﬁcations, the testing accuracy is between 90% and 93%(both with and without oversampling). If one takes intoaccount that all other sources are misclassiﬁed in the 2-class case, then the 3-class case provides an improvementin accuracy of 1% to 5%.We have created four catalogs with probabilistic clas-siﬁcations of sources: based on 3FGL and on 4FGL-DR2with 2- and 3-class classiﬁcations. For each source andfor each class we report class probabilities for each ofthe eight ML methods (with and without oversampling).We also provide individual standard deviations for allclassiﬁcation probabilities by using sample average overselection of training and testing datasets. We report theclassiﬁcation probabilities not only for the unassociatedsources, but also for the associated ones, which can beused to ﬁnd outliers. An advantage of such probabilis-tic classiﬁcation is that a threshold on probability forselecting, e.g., pulsar candidates, can be chosen by theuser based on his or her needs. For example, in a searchof new pulsars, one can select a low threshold in order toavoid missing possible pulsars. In a derivation of an av-erage property of the class, e.g., spectral index or cutoﬀenergy, one can select a high threshold in order to avoidcontamination from the other classes.As an example of the application of the probabilisticcatalogs, we derive the expected number of sources in thecatalog as a function of their ﬂux, including the unasso-ciated sources. As a consistency check, we compare thecounts of associated sources to the sums of probabilitiesfor the associated ones. We ﬁnd that correcting for thecontribution of other sources in the 2-class case plays animportant role for the estimation of the expected num-ber of sources in a particular class. We ﬁnd the totalexpected number of AGNs and pulsars in the 3FGL and4FGL catalogs by adding the class probabilities for theunassociated sources in the 2- and 3-class cases to thesource counts of associated sources and correcting in the2-class case for the contribution of other classes in theunassociated sources. In particular, we ﬁnd that the to-tal expected number of pulsars is about two times largerthan the number of associated pulsars.We plot the counts of associated sources and theexpected numbers of AGNs, pulsars, and other sourcesamong unassociated sources as functions of Galactic lat-itude and longitude. We ﬁnd that the number of associ-ated AGNs is decreasing towards low latitudes, while theexpected number of AGNs among unassociated sourcesis increasing, so that the sum of the two is relativelyuniform, as expected for extragalactic sources.

Acknowledgements

The authors would like to thank Pablo Saz Parkinsonand Jean Ballet for valuable discussions. The work of DMwas in part supported by BMBF under the ErUM-Dataproject “Innovative Digital Technologies for Research onUniverse and Matter” (grant number 05H18WERC1)and by DFG grant MA 8279/2-1. We would liketo acknowledge the use of the following software:Astropy ( , Robitaille et al.2013), matplotlib (Hunter 2007), scikit-learn [ https://scikit-learn.org/stable/about.html ], TOPCAT(Taylor 2005).

References

Abdo, A. A., Ackermann, M., Ajello, M., et al. 2010a, ApJS, 188,405Abdo, A. A., Ackermann, M., Ajello, M., et al. 2010b, ApJ, 720,435Abdollahi, S., Acero, F., Ackermann, M., et al. 2020, ApJS, 247,33Acero, F., Ackermann, M., Ajello, M., et al. 2015, ApJS, 218, 23Ackermann, M., Ajello, M., Albert, A., et al. 2015, J. CosmologyAstropart. Phys., 2015, 008Ackermann, M., Ajello, M., Albert, A., et al. 2016, Phys. Rev. Lett.,116, 151105Ackermann, M., Ajello, M., Allafort, A., et al. 2012, ApJ, 753, 83Ajello, M., Gasparrini, D., Sánchez-Conde, M., et al. 2015, ApJ,800, L27Ballet, J., Burnett, T. H., Digel, S. W., & Lott, B. 2020, arXive-prints, arXiv:2005.11208Breiman, L. 2001, Machine Learning, 45, 5Brewer, B. J., Foreman-Mackey, D., & Hogg, D. W. 2013, AJ, 146,7Camilo, F., Kerr, M., Ray, P. S., et al. 2015, The AstrophysicalJournal, 810, 85Carr, B. J., Kohri, K., Sendouda, Y., & Yokoyama, J. 2010,Phys. Rev. D, 81, 104019Chiaro, G., Salvetti, D., La Mura, G., et al. 2016, MNRAS, 462,3180Cox, D. R. 1958, J R Stat Soc B, 20, 215Daylan, T., Portillo, S. K. N., & Finkbeiner, D. P. 2017, ApJ, 839,4Defazio, A., Bach, F., & Lacoste-Julien, S. 2014, arXiv e-prints,arXiv:1407.0202Di Mauro, M. & Donato, F. 2015, Phys. Rev. D, 91, 123001Di Mauro, M., Manconi, S., Zechlin, H. S., et al. 2018, ApJ, 856,106Doert, M. & Errando, M. 2014, ApJ, 782, 41Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J.2008, J. Mach. Learn. Res., 9, 1871–1874Fornasa, M. & Sánchez-Conde, M. A. 2015, Phys. Rep., 598, 1Friedman, J. H. 2001a, Ann. Statist., 29, 1189Friedman, J. H. 2001b, Ann. Statist., 29, 1189Hassan, T., Mirabal, N., Contreras, J. L., & Oya, I. 2013, MNRAS,428, 220Ho, T. K. 1998, IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 20, 832Hogg, D. W. & Lang, D. 2010, in EAS Publications Series, Vol. 45,EAS Publications Series, 351–358Hopﬁeld, J. 1982, Proc. Nat. Acad. Sci., 79, 2554Hunter, J. D. 2007, Computing In Science & Engineering, 9, 90

Article number, page 23 of 25 &A proofs: manuscript no. 0head_ML_3FGL

Kingma, D. P. & Ba, J. 2014, arXiv e-prints, arXiv:1412.6980Kovačević, , M., Chiaro, G., Cutini, S., & Tosti, G. 2019, MNRAS,490, 4770Kovačević, M., Chiaro, G., Cutini, S., & Tosti, G. 2020, MNRAS,493, 1926Lee, K. J., Guillemot, L., Yue, Y. L., Kramer, M., & Champion,D. J. 2012, MNRAS, 424, 2832Lefaucheur, J. & Pita, S. 2017, A&A, 602, A86Lisanti, M., Mishra-Sharma, S., Necib, L., & Safdi, B. R. 2016,ApJ, 832, 117Liu, D. C. & Nocedal, J. 1989, Math. Program., 45, 503–528Liu, W., Bi, X.-J., Lin, S.-J., & Yin, P.-F. 2017, Chinese PhysicsC, 41, 045104Luo, S., Leung, A. P., Hui, C. Y., & Li, K. L. 2020, MNRAS, 492,5377Malyshev, D. & Hogg, D. W. 2011, ApJ, 738, 181Mirabal, N., Charles, E., Ferrara, E. C., et al. 2016, ApJ, 825, 69Nolan, P. L., Abdo, A. A., Ackermann, M., et al. 2012, ApJS, 199,31Robitaille, T. P., Tollerud, E. J., Greenﬁeld, P., et al. 2013, A&A,558, A33Salvetti, D., Chiaro, G., La Mura, G., & Thompson, D. J. 2017,MNRAS, 470, 1291Saz Parkinson, P. M., Xu, H., Yu, P. L. H., et al. 2016, ApJ, 820,8Schmidt, M., Le Roux, N., & Bach, F. 2017, Math. Program., 162,83–112Taylor, M. B. 2005, in Astronomical Society of the Paciﬁc Confer-ence Series, Vol. 347, Astronomical Data Analysis Software andSystems XIV, ed. P. Shopbell, M. Britton, & R. Ebert, 29Zechlin, H.-S., Cuoco, A., Donato, F., Fornengo, N., & Regis, M.2016a, ApJ, 826, L31Zechlin, H.-S., Cuoco, A., Donato, F., Fornengo, N., & Vittino, A.2016b, ApJS, 225, 18Zhu, K.-R., Kang, S.-J., & Zheng, Y.-G. 2021, Research in Astron-omy and Astrophysics, 21, 015

Article number, page 24 of 25. Bhat and D. Malyshev : Machine learning methods for constructing probabilistic

Fermi -LAT catalogs

Appendix A: Tests of additionalmeta-parameters

In this appendix we discuss tests of some hyper-parameters, which had a relatively little eﬀect on theaccuracy of the algorithms. For these tests we use the2-class classiﬁcation in the 3FGL catalog.LR algorithm has two hyper-parameters: regulariza-tion and tolerance. The eﬀect of the choice of these pa-rameters on accuracy is less than 1% (Figure A.1). There-fore we used the default values for these parameters (tol-erance is 1 e − and regularization parameter is set at 1).The same was used in the 3-class case where the changefrom GLON to cos(GLON) was seen as the main im-provement for LR algorithm. Regularization Parameter T e s t i n g S c o r e Logistic Regression (LBFGS, 300 Epochs): Accuracy vs. Regularization

Tol = 0.001Tol = 1Tol = 10

Fig. A.1.

Dependence of LR on tolerance and regularization

In Figure A.2 we show the eﬀect of adding the secondhidden layer in the NN algorithm. The diﬀerence betweenthe best accuracies with the additional hidden layer isless then 1% compared with the NN with one hiddenlayer (cf. Table 2).

Number of Neurons in second hidden layer T e s t i n g S c o r e Neural Network: 2 Hidden Layers

Number of Epochs: 300LBFGS, TanhLBFGS, ReluAdam, TanhAdam, Relu

Fig. A.2.

Dependence of NN on the number of neurons inthe second hidden layer, for 11 neurons in the ﬁrst hiddenlayer.

At the end we summarize features and their statistics,which we use for probabilistic classiﬁcation of sources inthe 3FGL and 4FGL-DR2 catalogs, in Tables A.1 andA.2 respectively.

Feature Name Mean Standard Deviation Minimum MaximumGLON .

58 100 . .

59 359 . GLAT .

67 41 . − .

66 86 . ln(Energy_Flux100) − .

34 1 . − . − . ln(Unc_Energy_Flux100) − .

47 0 . − . − . ln(Signif_Curve) .

23 1 . − .

81 4 . ln(Variability_Index) .

35 0 .

95 3 11 . .

16 0 . − .

61 3 . HR12 − .

41 0 . − HR23 − .

53 0 . − HR34 − .

55 0 . − HR45 − .

59 0 . − Table A.1.

Statistics of features used for 2 class probabilisticclassiﬁcation of the 3FGL sources.

Feature Name Mean Standard Deviation Minimum MaximumGLON .

65 101 .

91 0 .

09 359 . GLAT .

93 41 . − .

68 87 . ln(Energy_Flux100) − .

08 1 . − . − . ln(Unc_Energy_Flux100) − .

19 0 . − . − . ln(Pivot_Energy) .

45 0 .

78 5 .

01 10 . LP_Index .

14 0 . − .

08 5 . Unc_LP_Index .

16 0 .

13 0 3 . LP_beta .

16 0 . − .

17 1

LP_SigCurv .

89 6 .

99 0 225 . ln(Variability_Index) .

14 1 . − .

81 11 . HR12 .

12 0 . − HR23 − .

26 0 . − HR34 − .

52 0 . − HR45 − .

54 0 . − HR56 − .

66 0 . − HR67 − .