A Hybrid Ensemble Learning Approach to Star-Galaxy Classification
MMon. Not. R. Astron. Soc. , 1–15 (2015) Printed 14 October 2018 (MN L A TEX style file v2.2)
A Hybrid Ensemble Learning Approach to Star-Galaxy Classification
Edward J. Kim (cid:63) , Robert J. Brunner , , , and Matias Carrasco Kind , Department of Physics, University of Illinois, Urbana, IL 61801 USA Department of Astronomy, University of Illinois, Urbana, IL 61801 USA Department of Statistics, University of Illinois, Champaign, IL 61820 USA National Center for Supercomputing Applications, Urbana, IL 61801 USA
14 October 2018
ABSTRACT
There exist a variety of star-galaxy classification techniques, each with their own strengths andweaknesses. In this paper, we present a novel meta-classification framework that combines andfully exploits different techniques to produce a more robust star-galaxy classification. To demon-strate this hybrid, ensemble approach, we combine a purely morphological classifier, a supervisedmachine learning method based on random forest, an unsupervised machine learning methodbased on self-organizing maps, and a hierarchical Bayesian template fitting method. Using datafrom the CFHTLenS survey, we consider different scenarios: when a high-quality training set isavailable with spectroscopic labels from DEEP2, SDSS, VIPERS, and VVDS, and when the de-mographics of sources in a low-quality training set do not match the demographics of objects inthe test data set. We demonstrate that our Bayesian combination technique improves the overallperformance over any individual classification method in these scenarios. Thus, strategies thatcombine the predictions of different classifiers may prove to be optimal in currently ongoing andforthcoming photometric surveys, such as the Dark Energy Survey and the Large Synoptic SurveyTelescope.
Key words: methods: data analysis – methods: statistical – surveys – stars: statistics – galax-ies:statistics.
The problem of source classification is fundamental to astronomy andgoes as far back as Messier (1781). A variety of different strategieshave been developed to tackle this long-standing problem, and yetthere is no consensus on the optimal star-galaxy classification strat-egy. The most commonly used method to classify stars and galaxies inlarge sky surveys is the morphological separation (Sebok 1979; Kron1980; Valdes 1982; Yee 1991; Vasconcellos et al. 2011; Henrion et al.2011). It relies on the assumption that stars appear as point sourceswhile galaxies appear as resolved sources. However, currently ongo-ing and upcoming large photometric surveys, such as the Dark EnergySurvey (DES ) and the Large Synoptic Survey Telescope (LSST ),will detect a vast number of unresolved galaxies at faint magnitudes.Near a survey’s limit, the photometric observations cannot reliablyseparate stars from unresolved galaxies by morphology alone withoutleading to incompleteness and contamination in the star and galaxysamples.The contamination of unresolved galaxies can be mitigated byusing training based algorithms. Machine learning methods have theadvantage that it is easier to include extra information, such as con-centration indices, shape information, or different model magnitudes.However, they are only reliable within the limits of the training data, (cid:63) [email protected] and it can be difficult to extrapolate these algorithms outside the pa-rameter range of the training data. These techniques can be furthercategorized into supervised and unsupervised learning approaches.In supervised learning, the input attributes (e.g., magnitudes orcolors) are provided along with the truth labels (e.g., star or galaxy).Odewahn et al. (1992) pioneered the application of neural networks tothe star-galaxy classification problem, and it has become a core partof the astronomical image processing software SE XTRACTOR (Bertin& Arnouts 1996). Other successfully implemented examples includedecision trees (Weir et al. 1995; Suchkov et al. 2005; Ball et al.2006; Sevilla-Noarbe & Etayo-Sotos 2015) and Support Vector Ma-chines (Fadely, Hogg & Willman 2012). Unsupervised machine learn-ing techniques are less common, as they do not utilize the truth labelsduring the training process, and only the input attributes are used.Physically based template fitting methods have also been usedfor the star-galaxy classification problem (Robin et al. 2007; Fadelyet al. 2012). Template fitting approaches infer a source’s propertiesby finding the best match between the measured set of magnitudes(or colors) and the synthetic set of magnitudes (or colors) computedfrom a set of spectral templates. Although it is not necessary to ob-tain a high-quality spectroscopic training sample, these techniques dorequire a representative sample of theoretical or empirical templatesthat span the possible spectral energy distributions (SEDs) of starsand galaxies. Furthermore, they are not exempt from uncertaintiesdue to measurement errors on the filter response curves, or from mis-matches between the observed magnitudes and the template SEDs.In this paper, we present a novel star-galaxy classification frame- c (cid:13) a r X i v : . [ a s t r o - ph . I M ] J u l E. J. Kim, R. J. Brunner & M. Carrasco Kind work that combines and fully exploits different classification tech-niques to produce a more robust classification. In particular, we showthat the combination of a morphological separation method, a tem-plate fitting technique, a supervised machine learning method, andan unsupervised machine learning algorithm can improve the overallperformance over any individual method. In Section 2, we describeeach of the star-galaxy classification methods. In Section 3, we de-scribe different classification combination techniques. In Section 4,we describe the Canada-France Hawaii Telescope Lensing Survey(CFHTLenS) data set with which we test the algorithms. In Section 5,we compare the performance of our combination techniques to theperformance of the individual classification techniques. Finally, weoutline our conclusions in Section 6.
In this section, we present four distinct star-galaxy classification tech-niques. The first method is a morphological separation method, whichuses a hard cut in the half-light radius vs. magnitude plane. The sec-ond method is a supervised machine learning technique named TPC(Trees for Probabilistic Classification), which uses prediction treesand a random forest (Carrasco Kind & Brunner 2013). The thirdmethod is an unsupervised machine learning technique named SOMc,which uses self-organizing maps (SOMs) and a random atlas to pro-vide a classification (Carrasco Kind & Brunner 2014b). The fourthmethod is a Hierarchical Bayesian (HB) template fitting techniquebased on the work by Fadely et al. (2012), which fits SED templatesfrom star and galaxy libraries to an observed set of measured fluxvalues.Collectively, these four methods represent the majority of allstandard star-galaxy classification approaches published in the litera-ture. It is very likely that any new classification technique would befunctionally similar to one of these four methods. Therefore, any ofthese four methods could in principle be replaced by a similar method.
The simplest and perhaps the most widely used approach to star-galaxy classification is to make a hard cut in the space of photometricattributes. As a first-order morphological selection of point sources,we adopt a technique that is popular among the weak lensing com-munity (Kaiser, Squires & Broadhurst 1995). As Figure 1 shows,there is a distinct locus produced by point sources in the half-light ra-dius (estimated by SE
XTRACTOR ’s FLUX RADIUS parameter) vs.the i -band magnitude plane. A rectangular cut in this size-magnitudeplane separates point sources, which are presumed to be stars, fromresolved sources, which are presumed to be galaxies. The boundariesof the selection box are determined by manually inspecting the size-magnitude diagram.One of the disadvantages of such cut-based methods is that itclassifies every source with absolute certainty. It is difficult to jus-tify such a decisive classification near a survey’s magnitude limits,where measurement uncertainties generally increase. A more infor-mative approach is to provide probabilistic classifications. Althougha recent work by Henrion et al. (2011) implemented a probabilisticclassification using a Bayesian approach on the morphological mea-surements alone, here we use a cut-based morphological separationto demonstrate the advantages of our combination techniques. In par-ticular, we later show that the binary outputs (i.e., 0 or 1) of cut-basedmethods can be transformed into probability estimates by combiningthem with the probability outputs from other probabilistic classifica-tion techniques, such as TPC, SOMc, and HB. i ( m a g ) Figure 1.
Half-light radius vs. magnitude.
TPC is a parallel, supervised machine learning algorithm that usesprediction trees and random forest techniques (Breiman et al. 1984;Breiman 2001) to produce a star-galaxy classification. TPC is a partof a publicly available software package called MLZ (MachineLearning for Photo- z ). The full software package includes: TPZ, asupervised photometric redshift (photo- z ) estimation technique (re-gression mode; Carrasco Kind & Brunner 2013); TPC, a supervisedstar-galaxy classification technique (classification mode); SOM z , anunsupervised photo- z technique (regression mode; Carrasco Kind &Brunner 2014b); and SOMc, an unsupervised star-galaxy classifica-tion technique (classification mode).TPC uses classification trees, a type of prediction trees that aredesigned to provide a classification or predict a discrete category. Pre-diction trees are built by asking a sequence of questions that recur-sively split the data into branches until a terminal leaf is created thatmeets a stopping criterion (e.g., a minimum leaf size). The optimalsplit dimension is decided by choosing the attribute that maximizesthe Information Gain ( I G ), which is defined as I G ( D node , X ) = I d ( D node ) − (cid:88) x ∈ values( X ) | D node ,x || D node | I d ( D node ,x ) , (1)where D node is the training data in a given node, X is one of the pos-sible dimensions (e.g., magnitudes or colors) along which the node issplit, and x are the possible values of a specific dimension X . | D node | and | D node ,x | are the size of the total training data and the numberof objects in a given subset x within the current node, respectively. I d is the impurity degree index, and TPC can calculate I d from anyof the three standard different impurity indices: information entropy , Gini impurity , and classification error . In this work, we use the in-formation entropy, which is defined similarly to the thermodynamicentropy: I d ( D ) = − f g log f g − (1 − f g ) log (1 − f g ) , (2)where f g is the fraction of galaxies in the training data. At each nodein our tree, we scan all dimensions to identify the split point that max-imizes the information gain as defined by Equation 1, and select theattribute that maximizes the impurity index overall.In a technique called random forest, we create bootstrap samples http://lcdm.astro.illinois.edu/code/mlz.htmlc (cid:13) , 1–15 Hybrid Approach to Star-Galaxy Classification (i.e., N randomly selected objects with replacement) from the inputtraining data by sampling repeatedly from the magnitudes and col-ors using their measurement errors. We use these bootstrap samplesto construct multiple, uncorrelated prediction trees whose individualpredictions are aggregated to produce a star-galaxy classification foreach source.We also use a cross-validation technique called Out-of-Bag (OOB; Breiman et al. 1984; Carrasco Kind & Brunner 2013).When a tree (or a map) is built in TPC (or SOMc), a fraction of thetraining data, usually one-third, is left out and not used in trainingthe trees (or maps). After a tree is constructed using two-thirds of thetraining data, the final tree is applied to the remaining one-third tomake a classification. This process is repeated for every tree, and thepredictions from each tree are aggregated for each object to make thefinal star-galaxy classification. We emphasize that if an object is usedfor training a given tree, it is never used for subsequent predictionby that tree. Thus, the OOB data is an unbiased estimation of the er-rors and can be used as cross-validation data as long as the OOB dataremain similar to the final test data set. The OOB technique can alsoprovide extra information such as a ranking of the relative importanceof the input attributes used in the prediction. The OOB technique canprove extremely valuable when calibrating the algorithm, when de-ciding which attributes to incorporate in the construction of the trees,and when combining this approach with other techniques. A self-organizing map (Kohonen 1990, 2001) is an unsupervised, ar-tificial neural network algorithm that is capable of projecting high-dimensional input data onto a low-dimensional map through a pro-cess of competitive learning. In astronomical applications, the high-dimensional input data can be magnitudes, colors, or some otherphotometric attributes. The output map is usually chosen to be two-dimensional so that the resulting map can be used for visualizing var-ious properties of the input data. The differences between a SOM andtypical neural network algorithms are that a SOM is unsupervised,there are no hidden layers and therefore no extra parameters, and itproduces a direct mapping between the training set and the outputnetwork. In fact, a SOM can be viewed as a non-linear generalizationof a principal component analysis (PCA) algorithm (Yin 2008).The key characteristic of SOM is that it retains the topologyof the input training set, revealing correlations between input datathat are not obvious. The method is unsupervised: the user is not re-quired to specify the desired output during the creation of the lower-dimensional map, and the mapping of the components from the inputvectors is a natural outcome of the competitive learning process.During the construction of a SOM, each node on the two-dimensional map is represented by weight vectors of the same di-mension as the number of attributes used to create the map itself. Inan iterative process, each object in the input sample is individuallyused to correct these weight vectors. This correction is determinedso that the specific neuron (or node), which at a given moment bestrepresents the input source, is modified along with the weight vec-tors of that node’s neighboring neurons. As a result, this sector withinthe map becomes a better representation of the current input object.This process is repeated for every object in the training data, and theentire process is repeated for several iterations. Eventually, the SOMconverges to its final form where the training data is separated intogroups of similar features. Although the spectroscopic labels are notused at all in the learning process, they are used (only after the maphas been constructed) to generate predictions for each cell in the re-sulting two-dimensional map.In a similar approach to random forest in TPZ and TPC, SOM z uses a technique called random atlas to provide photo- z estimation(Carrasco Kind & Brunner 2014b). In random atlas, the predictiontrees of random forest are replaced by maps, and each map is con-structed from different bootstrap samples of the training data. Fur-thermore, we create random realizations of the training data by per-turbing the magnitudes and colors by their measurement errors. Foreach map, we can either use all available attributes, or randomly se-lect a subsample of the attribute space. This SOM implementationcan also be applied to the classification problem, and we refer to it asSOMc in order to differentiate it from the photo- z estimation problem(regression mode). We also use the random atlas approach in some ofthe classification combination approaches as discussed in Section 3.One of the most important parameter in SOMc is the topologyof the two-dimensional SOM, which can be rectangular, hexagonal,or spherical. In our SOM implementation, it is also possible to useperiodic boundary conditions for the non-spherical cases. The spher-ical topology is by definition periodic and is constructed by usingHEALPIX (G´orski et al. 2005). Similar to TPC, we use the OOBtechnique to make an unbiased estimation of errors. We determinethe optimal parameters by performing a grid search in the parameterspace of different toplogies, as well as other SOM parameters, for theOOB data. We find that the spherical topology gives the best perfor-mance for the CFHTLenS data, likely due to its natural periodicity.Thus, we use a spherical topology to classify stars and galaxies in theCFHTLenS data. For a complete description of the SOM implemen-tation and its application to the estimation of photo- z probability den-sity functions (photo- z PDFs), we refer the reader to Carrasco Kind& Brunner (2014b).
One of the most common methods to classify a source based on itsobserved magnitudes is template fitting. Template fitting algorithmsdo not require a spectroscopic training sample; there is no need for ad-ditional knowledge outside the observed data and the template SEDs.However, any incompleteness in our knowledge of the template SEDsthat fully span the possible SEDs of observed sources may lead tomisclassification of sources.Bayesian algorithms use Bayesian inference to quantify the rela-tive probability that each template matches the input photometry anddetermine a probability estimate by computing the posterior that asource is a star or a galaxy. In this work, we have modified and par-allelized a publicly available Hierarchical Bayesian (HB) templatefitting algorithm by Fadely et al. (2012). In this section, we provide abrief description of the HB template fitting technique; for the detailsof the underlying HB approach, we refer the reader to Fadely et al.(2012).We write the posterior probability that a source is a star as P ( S | x , θ ) = P ( x | S, θ ) P ( S | θ ) , (3)where x represents a given set of observed magnitudes,. We have alsointroduced the hyperparameter θ , a nuisance parameter that charac-terizes our uncertainty in the prior distribution. To compute the like-lihood that a source is a star, we marginalize over all star and galaxytemplates T . In a template-fitting approach, we marginalize by sum-ming up the likelihood that a source has the set of magnitudes x fora given star template as well as the likelihood for a given galaxy tem-plate: P ( x | S, θ ) = (cid:88) t ∈ T P ( x | S, t, θ ) P ( t | S, θ ) . (4) c (cid:13)000
One of the most common methods to classify a source based on itsobserved magnitudes is template fitting. Template fitting algorithmsdo not require a spectroscopic training sample; there is no need for ad-ditional knowledge outside the observed data and the template SEDs.However, any incompleteness in our knowledge of the template SEDsthat fully span the possible SEDs of observed sources may lead tomisclassification of sources.Bayesian algorithms use Bayesian inference to quantify the rela-tive probability that each template matches the input photometry anddetermine a probability estimate by computing the posterior that asource is a star or a galaxy. In this work, we have modified and par-allelized a publicly available Hierarchical Bayesian (HB) templatefitting algorithm by Fadely et al. (2012). In this section, we provide abrief description of the HB template fitting technique; for the detailsof the underlying HB approach, we refer the reader to Fadely et al.(2012).We write the posterior probability that a source is a star as P ( S | x , θ ) = P ( x | S, θ ) P ( S | θ ) , (3)where x represents a given set of observed magnitudes,. We have alsointroduced the hyperparameter θ , a nuisance parameter that charac-terizes our uncertainty in the prior distribution. To compute the like-lihood that a source is a star, we marginalize over all star and galaxytemplates T . In a template-fitting approach, we marginalize by sum-ming up the likelihood that a source has the set of magnitudes x fora given star template as well as the likelihood for a given galaxy tem-plate: P ( x | S, θ ) = (cid:88) t ∈ T P ( x | S, t, θ ) P ( t | S, θ ) . (4) c (cid:13)000 , 1–15 E. J. Kim, R. J. Brunner & M. Carrasco Kind
The likelihood of each template P ( x | S, θ ) is itself marginalized overthe uncertainty in the template-fitting coefficient. Furthermore, forgalaxy templates, we introduce another step that marginalizes thelikelihood by redshifting a given galaxy template by a factor of z .Marginalization in Equation 4 requires that we specify the priorprobability P ( t | S, θ ) that a source has a spectral template t (at agiven redshift). Thus, the probability that a source is a star (or agalaxy) is either the posterior probability itself if a prior is used, or thelikelihood itself if an uninformative prior is used. In a Bayesian anal-ysis, it is preferable to use a prior, which can be directly computedeither from physical assumptions, or from an empirical function cali-brated by using a spectroscopic training sample. In an HB approach,the entire sample of sources is used to infer the prior probabilities foreach individual source.Since the templates are discrete in both SED shape and physicalproperties, we parametrize the prior probability of each template as adiscrete set of weights such that (cid:88) t ∈ T P ( t | S, θ ) = 1 . (5)Similarly, we also parametrize the overall prior probability, ( S | θ ) ,in Equation 3, as a weight. These weights correspond to the hyper-parameters, which can be inferred by sampling the posterior proba-bility distribution in the hyperparameter space. For the sampling, weuse EMCEE , a Python implementation of the affine-invariant MarkovChain Monte Carlo (MCMC) ensemble sampler (Foreman-Mackeyet al. 2013).As the goal of template fitting methods is to minimize the dif-ference between observed and theoretical magnitudes, this approachheavily relies on both the use of SED templates and the accuracyof the transmission functions for the filters used for particular sur-vey. For our stellar templates, we use the empirical SED library fromPickles (1998). The Pickles library consists of 131 stellar templates,which span all normal spectral types and luminosity classes at solarabundance, as well as metal-poor and metal-rich F–K dwarf and G–Kgiant and supergiant stars. We supplement the stellar library with 100SEDs from Chabrier et al. (2000), which include low mass stars andbrown dwarfs with different T eff and surface gravities. We also in-clude four white dwarf templates of Bohlin, Colina & Finley (1995),for a total of 235 templates in our final stellar library. For our galaxytemplates, we use four CWW spectra from Coleman, Wu & Weedman(1980), which include an Elliptical, an Sba, an Sbb, and an Irregulargalaxy template. When extending an analysis to higher redshifts, theCWW library is often augmented with two star bursting galaxy tem-plates from Kinney et al. (1996). From the six original CWW andKinney spectra, intermediate templates are created by interpolation,for a total of 51 SEDs in our final galaxy library.All of the above templates are convolved with the filter responsecurves to generate model magnitudes. These response curves consistof u , g , r , i , z filter transmission functions for the observations takenby the Canada-France Hawaii Telescope (CFHT). Building on the work in the field of ensemble learning, we combinethe predictions from individual star-galaxy classification techniquesusing four combination techniques. The main idea behind ensemblelearning is to weight the predictions from individual models and com-bine them to obtain a prediction that outperforms every one of themindividually (Rokach 2010).
Given the variety of star-galaxy classification methods we are using,we fully expect the relative performance of the individual techniquesto vary across the parameter space spanned by the data. For example,it is reasonable to expect supervised techniques to outperform othertechniques in areas of parameter space that are well-populated withtraining data. Similarly, we can expect unsupervised approaches suchas SOM or template fitting approaches to generally perform betterwhen a training sample is either sparse or unavailable.We therefore adopt a binning strategy similar to Carrasco Kind& Brunner (2014a). In this binning strategy, we allow different clas-sifier combinations in different parts of parameter space by creatingtwo-dimensional SOM representations of the full nine-dimensionalmagnitude-color space: u , g , r , i , z , u − g , g − r , r − i , and i − z . ASOM representation can be rectangular, hexagonal, or spherical; herewe choose a 10 ×
10 rectangular topology to facilitate visualization asshown in Figure 2. We note that this choice is mainly for convenienceand that the optimal topology and map size would likely depend on anumber of factors, such as the number of objects and attributes. For allcombination methods, we use only the OOB (cross-validation) datacontained in each cell to compute the relative weights for the baseclassifiers. The weights within individual cells are then applied to theblind test data set to make the prediction.Furthermore, we construct a collection of SOM representationsand subsequently combine the predictions from each map into a meta-prediction. Given a training sample of N sources, we generate N R random realizations of training data by perturbing the attributes withthe measured uncertainty for each attribute. The uncertainties are as-sumed to be normally distributed. In this manner, we reduce the biastowards the data and introduce randomness in a systematic manner.For each random realization of a training sample, we create N M boot-strap samples of size N to generate N M different maps.After all maps are built, we have a total of N R × N M probabilis-tic outputs for each of the N sources. To produce a single probabilityestimate for each source, we could take the mean, the median, or someother simple statistic. With a sufficient number of maps, we find thatthere is usually negligible difference between taking the mean andtaking the median, and use the median in the following sections. Wenote that it is also possible to establish confidence intervals using thedistribution of the probability estimates. The simplest approach to combine different combination techniquesis to simply add the individual classifications from the base classifiersand renormalize the sum. In this case, the final probability is given by P ( S | x , M ) = (cid:88) i P ( S | x , M i ) , (6)where M is the set of models (TPC, SOMc, HB, and morphologi-cal separation in our work). We improve on this simple approach byusing the binning strategy to calculate the weighted average of ob-jects in each SOM cell separately for each map, and then combine thepredictions from each map into a final prediction. After the multi-dimensional input data have been binned, we can usethe cross-validation data to choose the best model within each bin,and use only that model within that specific bin to make predictionsfor the test data. We use the mean squared error (MSE; also knownas Brier score (Brier 1950)) as a classification error metric. We defineMSE as c (cid:13) , 1–15 Hybrid Approach to Star-Galaxy Classification MSE = 1 N N − (cid:88) i =0 ( y i − ˆ y i ) , (7)where ˆ y i is the actual truth value (e.g., 0 or 1) of the i th data, and y i is the probability prediction made by the models. Thus, a model withthe minimum MSE is chosen in each bin, and is assigned a weightof one, and zero for all other models. However, the chosen model isallowed to vary between different bins. Instead of selecting a single model that performs best within each bin,we can train a learning algorithm to combine the output values of sev-eral other base classifiers in each bin. An ensemble learning methodof using a meta-classifier to combine lower-level classifiers is knownas stacking or stacked generalization (Wolpert 1992). Although anyarbitrary algorithm can theoretically be used as a meta-classifier, alogistic regression or a linear regression is often used in practice. Inour work, we use a single-layer multi-response linear regression algo-rithm, which often shows the best performance (Breiman 1996; Ting& Witten 1999). This algorithm is a variant of the least-square re-gression algorithm, where a linear regression model is constructedfor each class. We also use a model combination technique known as BayesianModel Combination (BMC; Monteith et al. 2011), which usesBayesian principles to generate an ensemble combination of differ-ent classifiers. The posterior probability that a source is a star is givenby P ( S | x , D , M , E ) = (cid:88) e ∈ E P ( S | x , M , e ) P ( e | D ) , (8)where D is the data set, and e is an element in the ensemble space E of possible model combinations. By Bayes’ Theorem, the posteriorprobability of e given D is given by P ( e | D ) = P ( e ) P ( D ) (cid:89) d ∈ D P ( d | e ) ∝ P ( e ) (cid:89) d ∈ D P ( d | e ) . (9)Here, P ( e ) is the prior probability of e , which we assume to be uni-form. The product of P ( d | e ) is over all individual data d in the train-ing data D , and P ( D ) is merely a normalization factor and not im-portant.For binary classifiers whose output is either zero or one (e.g., acut-based morphological separation), we assume that each exampleis corrupted with an average error rate (cid:15) . This means that P ( d | e ) =1 − (cid:15) if the combination e correctly predicts class ˆ y i for the i th ob-ject, and P ( d | e ) = (cid:15) if it predicts an incorrect class. The averagerate (cid:15) can be estimated by the fraction ( M g + M s ) /N , where M g isthe number of true galaxies classified as stars, M s is the number oftrue stars classified as galaxies, and N is the total number of sources.Equation 9 then becomes P ( e | D ) ∝ P ( e ) (1 − (cid:15) ) N − M s − M g ( (cid:15) ) M s + M g . (10)For probabilistic classifiers, we can directly use the probabilistic pre-dictions and write Equation 9 as P ( e | D ) ∝ P ( e ) N − (cid:89) i =0 ˆ y i y i + (1 − ˆ y i ) (1 − y i ) . (11)Although the space E of potential model combinations is inprinciple infinite, we can produce a reasonable finite set of potentialmodel combinations by using sampling techniques. In our implemen-tation, the weights of each combination of the base classifiers is ob-tained by sampling from a Dirichlet distribution. We first set all alphavalues of a Dirichlet distribution to unity. We then sample this distri-bution q times to obtain q sets of weights. For each combination, weassume a uniform prior and calculate P ( e | D ) using Equation 10 or11. We select the combination with the highest P ( e | D ) , and updatethe alpha values by adding the weights of the most probable combina-tion to the current alpha values. The next q sets of weights are drawnusing the updated alpha values.We continue the sampling process until we reach a predefinednumber of combinations, and finally use Equation 8 to compute theposterior probability that a source is a star (or a galaxy). In this paper,we use a q value of three, and 1,000 model combinations are consid-ered.We also use a binned version of the BMC technique, where weuse a SOM representation to apply different model combinations fordifferent regions of the parameter space. We however note that intro-ducing randomness though the construction of N R × N M differentSOM representations does not show significant improvement over us-ing only one single SOM representation. This similarity is likely dueto the randomness that has already been introduced by sampling fromthe Dirichlet distribution. Thus, our BMC technique uses one SOM,while other base models (WA, BoM, and stacking) generate N R ran-dom realizations of N M maps. We use photometric data from the Canada-France-Hawaii TelescopeLensing Survey (CFHTLenS ; Heymans et al. 2012; Erben et al.2013; Hildebrandt et al. 2012). This catalog consists of more thantwenty five million objects with a limiting magnitude of i AB ≈ . .It covers a total of 154 square degrees in the four fields (named W1,W2, W3, and W4) of CFHT Legacy Survey (CFHTLS; Gwyn 2012)observed in the five photometric bands: u , g , r , i , and z .We have cross-matched reliable spectroscopic galaxies from theDeep Extragalactic Evolutionary Probe Phase 2 (DEEP2; Davis et al.2003; Newman et al. 2013), the Sloan Digital Sky Survey Data Re-lease 10 (Ahn et al. 2014, SDSS-DR10), the VIsible imaging Multi-Object Spectrograph (VIMOS) Very Large Telescope (VLT) DeepSurvey (VVDS; Le F`evre et al. 2005; Garilli et al. 2008), and theVIMOS Public Extragalactic Redshift Survey (VIPERS; Garilli et al.2014). We have selected only sources with very secure redshifts andno bad flags (quality flags -1, 3, and 4 for DEEP2; quality flag 0 forSDSS; quality flags 3, 4, 23, and 24 for VIPERS and VVDS). In theend, we have 8,545 stars and 57,843 galaxies available for the trainingand testing processes. We randomly select 13,278 objects for the blindtesting set, and use the remainder for training and cross-validation.While HB uses only the magnitudes in the five bands, u , g , r , i , and z , TPC and SOMc are trained with a total of 9 attributes: the fivemagnitudes and their corresponding colors, u − g , g − r , r − i , and i − z . The morphological separation method uses SE XTRACTOR ’sFLUX RADIUS parameter provided by the CFHTLenS catalog.Our goal here is not to obtain the best classifier performance; (cid:13)000
10 rectangular topology to facilitate visualization asshown in Figure 2. We note that this choice is mainly for convenienceand that the optimal topology and map size would likely depend on anumber of factors, such as the number of objects and attributes. For allcombination methods, we use only the OOB (cross-validation) datacontained in each cell to compute the relative weights for the baseclassifiers. The weights within individual cells are then applied to theblind test data set to make the prediction.Furthermore, we construct a collection of SOM representationsand subsequently combine the predictions from each map into a meta-prediction. Given a training sample of N sources, we generate N R random realizations of training data by perturbing the attributes withthe measured uncertainty for each attribute. The uncertainties are as-sumed to be normally distributed. In this manner, we reduce the biastowards the data and introduce randomness in a systematic manner.For each random realization of a training sample, we create N M boot-strap samples of size N to generate N M different maps.After all maps are built, we have a total of N R × N M probabilis-tic outputs for each of the N sources. To produce a single probabilityestimate for each source, we could take the mean, the median, or someother simple statistic. With a sufficient number of maps, we find thatthere is usually negligible difference between taking the mean andtaking the median, and use the median in the following sections. Wenote that it is also possible to establish confidence intervals using thedistribution of the probability estimates. The simplest approach to combine different combination techniquesis to simply add the individual classifications from the base classifiersand renormalize the sum. In this case, the final probability is given by P ( S | x , M ) = (cid:88) i P ( S | x , M i ) , (6)where M is the set of models (TPC, SOMc, HB, and morphologi-cal separation in our work). We improve on this simple approach byusing the binning strategy to calculate the weighted average of ob-jects in each SOM cell separately for each map, and then combine thepredictions from each map into a final prediction. After the multi-dimensional input data have been binned, we can usethe cross-validation data to choose the best model within each bin,and use only that model within that specific bin to make predictionsfor the test data. We use the mean squared error (MSE; also knownas Brier score (Brier 1950)) as a classification error metric. We defineMSE as c (cid:13) , 1–15 Hybrid Approach to Star-Galaxy Classification MSE = 1 N N − (cid:88) i =0 ( y i − ˆ y i ) , (7)where ˆ y i is the actual truth value (e.g., 0 or 1) of the i th data, and y i is the probability prediction made by the models. Thus, a model withthe minimum MSE is chosen in each bin, and is assigned a weightof one, and zero for all other models. However, the chosen model isallowed to vary between different bins. Instead of selecting a single model that performs best within each bin,we can train a learning algorithm to combine the output values of sev-eral other base classifiers in each bin. An ensemble learning methodof using a meta-classifier to combine lower-level classifiers is knownas stacking or stacked generalization (Wolpert 1992). Although anyarbitrary algorithm can theoretically be used as a meta-classifier, alogistic regression or a linear regression is often used in practice. Inour work, we use a single-layer multi-response linear regression algo-rithm, which often shows the best performance (Breiman 1996; Ting& Witten 1999). This algorithm is a variant of the least-square re-gression algorithm, where a linear regression model is constructedfor each class. We also use a model combination technique known as BayesianModel Combination (BMC; Monteith et al. 2011), which usesBayesian principles to generate an ensemble combination of differ-ent classifiers. The posterior probability that a source is a star is givenby P ( S | x , D , M , E ) = (cid:88) e ∈ E P ( S | x , M , e ) P ( e | D ) , (8)where D is the data set, and e is an element in the ensemble space E of possible model combinations. By Bayes’ Theorem, the posteriorprobability of e given D is given by P ( e | D ) = P ( e ) P ( D ) (cid:89) d ∈ D P ( d | e ) ∝ P ( e ) (cid:89) d ∈ D P ( d | e ) . (9)Here, P ( e ) is the prior probability of e , which we assume to be uni-form. The product of P ( d | e ) is over all individual data d in the train-ing data D , and P ( D ) is merely a normalization factor and not im-portant.For binary classifiers whose output is either zero or one (e.g., acut-based morphological separation), we assume that each exampleis corrupted with an average error rate (cid:15) . This means that P ( d | e ) =1 − (cid:15) if the combination e correctly predicts class ˆ y i for the i th ob-ject, and P ( d | e ) = (cid:15) if it predicts an incorrect class. The averagerate (cid:15) can be estimated by the fraction ( M g + M s ) /N , where M g isthe number of true galaxies classified as stars, M s is the number oftrue stars classified as galaxies, and N is the total number of sources.Equation 9 then becomes P ( e | D ) ∝ P ( e ) (1 − (cid:15) ) N − M s − M g ( (cid:15) ) M s + M g . (10)For probabilistic classifiers, we can directly use the probabilistic pre-dictions and write Equation 9 as P ( e | D ) ∝ P ( e ) N − (cid:89) i =0 ˆ y i y i + (1 − ˆ y i ) (1 − y i ) . (11)Although the space E of potential model combinations is inprinciple infinite, we can produce a reasonable finite set of potentialmodel combinations by using sampling techniques. In our implemen-tation, the weights of each combination of the base classifiers is ob-tained by sampling from a Dirichlet distribution. We first set all alphavalues of a Dirichlet distribution to unity. We then sample this distri-bution q times to obtain q sets of weights. For each combination, weassume a uniform prior and calculate P ( e | D ) using Equation 10 or11. We select the combination with the highest P ( e | D ) , and updatethe alpha values by adding the weights of the most probable combina-tion to the current alpha values. The next q sets of weights are drawnusing the updated alpha values.We continue the sampling process until we reach a predefinednumber of combinations, and finally use Equation 8 to compute theposterior probability that a source is a star (or a galaxy). In this paper,we use a q value of three, and 1,000 model combinations are consid-ered.We also use a binned version of the BMC technique, where weuse a SOM representation to apply different model combinations fordifferent regions of the parameter space. We however note that intro-ducing randomness though the construction of N R × N M differentSOM representations does not show significant improvement over us-ing only one single SOM representation. This similarity is likely dueto the randomness that has already been introduced by sampling fromthe Dirichlet distribution. Thus, our BMC technique uses one SOM,while other base models (WA, BoM, and stacking) generate N R ran-dom realizations of N M maps. We use photometric data from the Canada-France-Hawaii TelescopeLensing Survey (CFHTLenS ; Heymans et al. 2012; Erben et al.2013; Hildebrandt et al. 2012). This catalog consists of more thantwenty five million objects with a limiting magnitude of i AB ≈ . .It covers a total of 154 square degrees in the four fields (named W1,W2, W3, and W4) of CFHT Legacy Survey (CFHTLS; Gwyn 2012)observed in the five photometric bands: u , g , r , i , and z .We have cross-matched reliable spectroscopic galaxies from theDeep Extragalactic Evolutionary Probe Phase 2 (DEEP2; Davis et al.2003; Newman et al. 2013), the Sloan Digital Sky Survey Data Re-lease 10 (Ahn et al. 2014, SDSS-DR10), the VIsible imaging Multi-Object Spectrograph (VIMOS) Very Large Telescope (VLT) DeepSurvey (VVDS; Le F`evre et al. 2005; Garilli et al. 2008), and theVIMOS Public Extragalactic Redshift Survey (VIPERS; Garilli et al.2014). We have selected only sources with very secure redshifts andno bad flags (quality flags -1, 3, and 4 for DEEP2; quality flag 0 forSDSS; quality flags 3, 4, 23, and 24 for VIPERS and VVDS). In theend, we have 8,545 stars and 57,843 galaxies available for the trainingand testing processes. We randomly select 13,278 objects for the blindtesting set, and use the remainder for training and cross-validation.While HB uses only the magnitudes in the five bands, u , g , r , i , and z , TPC and SOMc are trained with a total of 9 attributes: the fivemagnitudes and their corresponding colors, u − g , g − r , r − i , and i − z . The morphological separation method uses SE XTRACTOR ’sFLUX RADIUS parameter provided by the CFHTLenS catalog.Our goal here is not to obtain the best classifier performance; (cid:13)000 , 1–15 E. J. Kim, R. J. Brunner & M. Carrasco Kind for this we would have fine tuned individual base classifiers and cho-sen sophisticated models best suited to the particular properties of theCFHTLenS data. For example, Hildebrandt et al. (2012) suggest thatall objects with i > in the CFHTLenS data set may be classifiedas galaxies without significant incompleteness and contamination inthe galaxy sample. Although this approach works because the highGalactic latitude fields of the CFHTLS contain relatively few stars, itis very unlikely that such an approach will meet the science require-ments for the quality of star-galaxy classification in lower-latitude,star-crowded fields. Rather, our goal for the CFHTLenS data set isto demonstrate the usefulness of combining different classifiers evenwhen the base classifiers may be poor or trained on partial data. Wealso note that the relatively few number of stars in the CFHTLS fieldsmight paint too positive a picture of completeness and purity, espe-cially for the stars. Thus, we caution the reader that the specific com-pleteness and purity values will likely vary in other surveys that ob-serve large portions of the sky, and we emphasize once again that ouraim is to highlight that there is a relative improvement in performancewhen we combine multiple star-galaxy classification techniques togenerate a meta-classification. In this section, we present the classification performance of the fourdifferent combination techniques, as well as the individual star-galaxyclassification techniques on the CFHTLenS test data.
Probabilistic classification models can be considered as functions thatoutput a probability estimate of each source to be in one of the classes(e.g., a star or a galaxy). Although the probability estimate can beused as a weight in subsequent analyses to improve or enhance a par-ticular measurement (Ross et al. 2011), it can also be converted into aclass label by using a threshold (a probability cut). The simplest wayto choose the threshold is to set it to a fixed value, e.g., p cut = 0 . .This is, in fact, what is often done (e.g., Henrion et al. 2011; Fadelyet al. 2012). However, choosing . as a threshold is not the bestchoice for an unbalanced data set, where galaxies outnumber stars.Furthermore, setting a fixed threshold ignores the operating condi-tion (e.g., science requirements, stellar distribution, misclassificationcosts) where the model will be applied. When we have no information about the operating condition whenevaluating the performance of classifiers, there are effective tools suchas the Receiver Operating Characteristic (ROC) curve (Swets, Dawes& Monahan 2000). An ROC curve is a graphical plot that illustratesthe true positive rate versus the false positive rate of a binary classifieras its classification threshold is varied. The Area Under the Curve(AUC) summarizes the curve information in a single number, and canbe used as an assessment of the overall performance.
In astronomical applications, the operating condition usually trans-lates to the completeness and purity requirements of the star or galaxysample. We define the galaxy completeness c g (also known as recallor sensitivity) as the fraction of the number of true galaxies classified Table 1.
The definition of the classification performance metrics.Metric MeaningAUC Area under the Receiver Operating CurveMSE Mean squared error c g Galaxy completeness p g Galaxy purity c s Star completeness p s Star purity p g ( c g = x ) Galaxy purity at x galaxy completeness p s ( c s = x ) Star purity at x star completeness as galaxies out of the total number of true galaxies, c g = N g N g + M g , (12)where N g is the number of true galaxies classified as galaxies, and M g is the number of true galaxies classified as stars. We define thegalaxy purity p g (also known as precision or positive predictive value)as the fraction of the number of true galaxies classified as galaxies outof the total number of objects classified as galaxies, p g = N g N g + M s , (13)where M s is the number of true stars classified as galaxies. Star com-pleteness and purity are defined in a similar manner.One of the advantages of a probabilistic classification is that thethreshold can be adjusted to produce a more complete but less puresample, or a less complete but more pure one. To compare the per-formance of probabilistic classification techniques with that of mor-phological separation, which has a fixed completeness ( c g = 0 . , c s = 0 . ) at a certain purity ( p g = 0 . , p s = 0 . ), weadjust the threshold of probabilistic classifiers until the galaxy com-pleteness c g matches that of morphological separation to computethe galaxy purity p g at c g = 0 . . Similarly, the star purity p s at c s = 0 . is computed by adjusting the threshold until the starcompleteness of each classifier is equal to that of morphological sep-aration.We can also compare the performance of different classifica-tion techniques by assuming an arbitrary operating condition. Forexample, weak lensing science measurements of the DES require c g > . and p g > . to control both the statistical and sys-tematic errors on the cosmological parameters, and c s > . and p s > . for stellar Point Spread Function (PSF) calibration (Sou-magnac et al. 2015). Although these values will likely be differentfor the science cases of the CFHTLenS data, we adopt these valuesto compare the classification performance at a reasonable operatingcondition. Thus, we compute p g at c g = 0 . and p s at c s = 0 . .We also use the MSE defined in Equation 7 as a classification errormetric. We present in Table 2 the classification performance obtained by ap-plying the four different combination techniques, as well as the in-dividual star-galaxy classification techniques, on the CFHTLenS testdata. The bold entries highlight the best technique for any particu-lar metric. The first four rows show the performance of four individ-ual star-galaxy classification techniques. Given a high-quality trainingdata, it is not surprising that our supervised machine learning tech-nique TPC outperforms other unsupervised techniques. TPC is thusshown in the first row as the benchmark.The simplest of the combination techniques, WA and BoM, gen-erally do not perform better than TPC. It is also interesting that, even c (cid:13) , 1–15 Hybrid Approach to Star-Galaxy Classification Table 2.
A summary of the classification performance metrics for the four individual methods and the four different classification combination methods as appliedto the CFHTLenS data, with no cut applied to the training data set. The definition of the metrics is summarized in Table 1. The bold entries highlight the bestperformance values within each column. Note that some objects in the test set have bad or missing values (e.g., − or ) in one or more attributes, which areincluded here (but are omitted, for example, in Figure 5 when the corrsponding attribute is not available.)Classifier AUC MSE p g ( c g = 0 . p s ( c s = 0 . p g ( c g = 0 . p s ( c s = 0 . TPC
BMC 0.9852 i s t e ll a r f r a c . u − g g − r Figure 2.
A two-dimensional 10 ×
10 SOM representation showing the mean i -band magnitude (top left), the fraction of true stars in each cell (top right),and the mean values of u − g (bottom left) and g − r (bottom right) for thecross-validation data. with binning the parameter space and selecting the best model withineach bin, BoM almost always chooses TPC as the best model in allbins, and therefore gives the same performance as TPC in the end.However, our BMC and stacking techniques have a similar perfor-mance and often outperform TPC. Although TPC shows the best per-formance as measured by the AUC, BMC shows the best performancein all other metrics.In Figure 2, we show in the top left panel the mean CFHTLenS i -band magnitude in each cell, and in the top right panel the frac-tion of stars in each cell. The bottom two panels show the mean u − g and g − r colors in each cell. These two-dimensional mapsclearly show the ability of the SOM to preserve relationships betweensources when it projects the full nine-dimensional space to the two-dimensional map. We note that these SOM maps should only be usedto provide guidance, as the SOM mapping is a non-linear representa-tion of all magnitudes and colors.We can also use the same SOM from Figure 2 to determine therelative weights for the four individual classification methods in eachcell. We present the four weight maps for the BMC technique in Fig-ure 3. In these maps, a darker color indicates a higher weight, orequivalently that the corresponding classifier performs better in thatregion. These weight maps demonstrate the variation in the perfor-mance of the individual techniques across the two-dimensional pa-rameter space defined by the SOM. Furthermore, since the maps inFigure 2 and 3 are constructed using the same SOM, we can deter-mine the region in the parameter space where each individual tech- T P C S O M H B M o r p h o l o g y w Figure 3.
A two-dimensional 10 ×
10 SOM representation showing the relativeweights for the BMC combination technique applied to the four individualmethods for the CFHTLenS data. nique performs better or worse. Not surprisingly, the morphologicalseparation performs best in the top left corner of the weight map inFigure 3, which corresponds to the brightest CFHTLenS magnitudes i (cid:46) in the i -band magnitude map of Figure 2. It is also clearthat the SOM cells where the morphological separation performs besthave higher stellar fraction than the other cells. On the other hand,TPC seems to perform best in the region that corresponds to interme-diate magnitudes (cid:46) i (cid:46) . and . (cid:46) u − g (cid:46) . . Our unsu-pervised learning method SOMc performs relatively better at faintermagnitudes i (cid:38) . with (cid:46) u − g (cid:46) . and (cid:46) g − r (cid:46) . .Although HB shows the worst performance when there exists a high-quality training data set, BMC still utilizes information from HB, es-pecially at intermediate magnitudes (cid:46) i (cid:46) . Another interestingpattern is that the four techniques seem complementary, and they areweighted most strongly in different regions of the SOM representa-tion. In Figure 4, we compare the star and galaxy purity values forBMC, TPC, and morphological separation as functions of i -bandmagnitude. We use the kernel density estimation (KDE; Silverman1986) with the Gaussian kernel to smooth the fluctuations in the dis-tribution. Although morphological separation shows a slightly betterperformance in galaxy purity at bright magnitudes i (cid:46) , BMC out-performs both TPC and morphological separation at faint magnitudes i (cid:38) . As the top panel shows, the number count distribution peaksat i ∼ , and BMC therefore outperforms both TPC and morpholog-ical separation for the majority of objects. It is also clear that BMCoutperforms TPC over all magnitudes. BMC can presumably accom-plish this by combining information from all base classifiers, e.g.,giving more weight to the morphological separation method at bright c (cid:13)000
10 SOM representation showing the relativeweights for the BMC combination technique applied to the four individualmethods for the CFHTLenS data. nique performs better or worse. Not surprisingly, the morphologicalseparation performs best in the top left corner of the weight map inFigure 3, which corresponds to the brightest CFHTLenS magnitudes i (cid:46) in the i -band magnitude map of Figure 2. It is also clearthat the SOM cells where the morphological separation performs besthave higher stellar fraction than the other cells. On the other hand,TPC seems to perform best in the region that corresponds to interme-diate magnitudes (cid:46) i (cid:46) . and . (cid:46) u − g (cid:46) . . Our unsu-pervised learning method SOMc performs relatively better at faintermagnitudes i (cid:38) . with (cid:46) u − g (cid:46) . and (cid:46) g − r (cid:46) . .Although HB shows the worst performance when there exists a high-quality training data set, BMC still utilizes information from HB, es-pecially at intermediate magnitudes (cid:46) i (cid:46) . Another interestingpattern is that the four techniques seem complementary, and they areweighted most strongly in different regions of the SOM representa-tion. In Figure 4, we compare the star and galaxy purity values forBMC, TPC, and morphological separation as functions of i -bandmagnitude. We use the kernel density estimation (KDE; Silverman1986) with the Gaussian kernel to smooth the fluctuations in the dis-tribution. Although morphological separation shows a slightly betterperformance in galaxy purity at bright magnitudes i (cid:46) , BMC out-performs both TPC and morphological separation at faint magnitudes i (cid:38) . As the top panel shows, the number count distribution peaksat i ∼ , and BMC therefore outperforms both TPC and morpholog-ical separation for the majority of objects. It is also clear that BMCoutperforms TPC over all magnitudes. BMC can presumably accom-plish this by combining information from all base classifiers, e.g.,giving more weight to the morphological separation method at bright c (cid:13)000 , 1–15 E. J. Kim, R. J. Brunner & M. Carrasco Kind N f r a c t i o n stars0.900.920.940.960.981.00 p g ¡ c g = . ¢ BMCTPCMorphology18 19 20 21 22 23 24 i (mag) p s ( c s = . ) Figure 4.
Purity as a function of the i -band magnitude as estimated by the ker-nel density estimation (KDE) method. The top panel shows the histogram witha bin size of 0.1 mag and the KDE for objects in the test set. The second panelshows the fraction of stars estimated by KDE as a function of magnitude. Thebottom two panels compare the galaxy and star purity values for BMC, TPC,and morphological separation as functions of magnitude. Results for BMC,TPC, and morphological separation are in blue, green, and red, respectively.The σ confidence bands are estimated by bootstrap sampling. magnitudes. The bottom panel shows that the star purity of morpho-logical separation drops to p s < . at fainter magnitudes i > .This is expected, as our crude morphological separation classifies ev-ery object as a galaxy beyond i > , and purity measures the numberof true stars classified as stars. It is again clear that BMC outperformsboth TPC and morphological separation in star purity values over allmagnitudes.In Figure 5, we show the cumulative galaxy and star purity val-ues as functions of magnitude. Although morphological separationperforms better than TPC at bright magnitudes, its purity values de-crease as the magnitudes become fainter, and TPC eventually out-performs morphological separation by 1–2% at i > . BMC clearlyoutperforms both TPC and morphological separation, and it maintainsthe overall galaxy purity of 0.980 up to i ∼ . .We also show the star and galaxy purity values as functions ofphotometric redshift estimate in Figure 6. Photo- z is estimated withthe BPZ algorithm (Ben´ıtez 2000) and provided with the CFHTLenSphotometric redshift catalogue (Hildebrandt et al. 2012). The ad-vantage of BMC over TPC and morphological separation is now p g ¡ c g = . ¢ BMCTPCMorphology18 19 20 21 22 23 24 i (mag) p s ( c s = . ) Figure 5.
Cumulative purity as a function of the i -band magnitude. The up-per panel compares the galaxy purity values for BMC (blue solid line), TPC(green dashed line), and morphological separation (red dashed line). The lowerpanel compares the star purity. The σ error bars are computed following themethod of Paterno (2003) to avoid the unphysical errors of binomial or Pois-son statistics. more pronounced in Figure 6. Although the morphological separa-tion method outperforms BMC at bright magnitudes in Figure 4, it isclear that BMC outperforms both TPC and morphological separationover all redshifts. We also present in Figure 7 how the star and galaxypurity values vary as a function of g − r color. It is again clear thatBMC outperforms both TPC and morphological separation over all g − r colors.In Figure 8, we show the distribution of P ( S ) , the posteriorprobability that an object is a star, for BMC, TPC, and morphologicalseparation. It is interesting that the BMC technique assigns a poste-rior star probability P ( S ) (cid:46) . to significantly more true galaxiesthan TPC, and a probability P ( S ) (cid:38) . to significantly fewer truegalaxies. By utilizing information from different types of classifica-tion techniques in different parts of the parameter space, BMC be-comes more certain that an object is a star or a galaxy, resulting inimprovement of overall performance. It is very costly in terms of telescope time to obtain a large sampleof spectroscopic observations down to the limiting magnitude of aphotometric sample. Thus, we investigate the impact of training setquality by considering a more realistic case where the training dataset is available only for a small number of objects with bright magni-tudes. To emulate this scenario, we only use objects that have spectro-scopic labels from the VVDS 0226-04 field (which is located withinthe CFHTLS W1 field) and impose a magnitude cut of i < . in thetraining data, leaving us a training set with only 1,365 objects. We ap-ply the same four star-galaxy classification techniques and four com-bination methods, and measure the performance of each technique onthe same test data set from Section 5.2. As the top two panels of Fig-ures 11, 13, and 14 show, the demographics of objects in the training c (cid:13) , 1–15 Hybrid Approach to Star-Galaxy Classification Table 3.
A summary of the classification performance metrics for the four individual methods and the four different classification combination methods when thetraining data set consists of only the sources that are in CFHTLS W1 field, has spectroscopic labels available from VVDS, and has i < . The definition of themetrics is summarized in Table 1. The bold entries highlight the best performance values within each column. Note that some objects in the test set have bad ormissing values (e.g., − or ) in one or more attributes, which are included here (but are omitted, for example, in Figure 12 when the corrsponding attribute isnot available.) Classifier AUC MSE p g ( c g = 0 . p s ( c s = 0 . p g ( c g = 0 . p s ( c s = 0 . TPC 0.9399 0.0511 0.9350 0.7060 0.9570 0.9747SOMc 0.8861 0.0989 0.8843 0.4316 0.9165 0.6263HB 0.9386 0.0760 0.9325 0.6911 0.9424 0.6918Morphology - 0.0397 0.9597 0.9666 - -WA 0.9600 0.0536 0.9208 0.8818 0.9757 0.9815BoM 0.9587 0.1511 0.9658 0.9862 0.9790 0.9977Stacking 0.9442 0.1847 0.9561 0.9309 0.9664 0.9983BMC N f r a c t i o n stars0.850.900.951.00 p g ¡ c g = . ¢ BMCTPCMorphology0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 z phot p s ( c s = . ) Figure 6.
Similar to Figure 4 but as a function of photo- z . The bin size ofhistogram in the top panel is 0.02. set are different from the distribution of sources in the test set. Thus,this also serves as a test of the efficacy of heterogeneous training.We present in Table 3 the same six metrics for each method,and highlight the best method for each metric. Overall, the results ob-tained for the reduced data set are remarkable. With a smaller trainingset, our training based methods, TPC and SOMc, suffer a significantdecrease in performance. The performance of morphological separa-tion and HB is essentially unchanged from Table 2 as they do notdepend on the training data. Without sufficient training data, the ad- N f r a c t i o n stars0.70.80.91.0 p g ¡ c g = . ¢ BMCTPCMorphology0.5 0.0 0.5 1.0 1.5 2.0 2.5 g − r p s ( c s = . ) Figure 7.
Similar to Figure 4 but as a function of g − r color. The bin size ofhistogram in the top panel is 0.05. vantage of combining the predictions of different classifiers is moreobvious. Even WA, the simplest of combination techniques, outper-forms all individual classification techniques in four metrics, AUC, p s at c s = 0 . , p g at c g = 0 . , and p s at c s = 0 . . Al-though BoM always chooses TPC as the best model when we have ahigh-quality training set, it now chooses various methods in differentbins and outperforms all base classifiers. While the performance ofthe stacking technique is only slightly worse than that of BMC whenwe have a high-quality training set, stacking now fails to outperform c (cid:13)000
Similar to Figure 4 but as a function of g − r color. The bin size ofhistogram in the top panel is 0.05. vantage of combining the predictions of different classifiers is moreobvious. Even WA, the simplest of combination techniques, outper-forms all individual classification techniques in four metrics, AUC, p s at c s = 0 . , p g at c g = 0 . , and p s at c s = 0 . . Al-though BoM always chooses TPC as the best model when we have ahigh-quality training set, it now chooses various methods in differentbins and outperforms all base classifiers. While the performance ofthe stacking technique is only slightly worse than that of BMC whenwe have a high-quality training set, stacking now fails to outperform c (cid:13)000 , 1–15 E. J. Kim, R. J. Brunner & M. Carrasco Kind N Morphology starsgalaxies10 N TPC P ( S ) N BMC
Figure 8.
Histogram of the posterior probability that a source is a star formorphological separation (top), TPC (middle), and BMC (bottom) for a high-quality training data set. The true galaxies are in green, and true stars are inblue. The bin size is 0.05. i s t e ll a r f r a c . u − g g − r Figure 9.
Similar to Figure 2 but for the reduced training data set. morphological separation. BMC shows an impressive performanceand outperforms all other classification techniques in all six metrics.Overall, the improvements are small but still significant since thesemetrics are averaged over the full test data.In Figure 10, we again show the × two-dimensional weightmap defined by the SOM. When the quality of training data is rel-atively poor, the performance of training based algorithms will de-crease, while the performance of template fitting algorithms or mor-phological separation methods is independent of training data. Thus,when the weight maps of Figure 3 and Figure 10 are visually com-pared, it is clear that the BMC algorithm now uses more informationfrom morphological separation and HB, while it uses considerablyless information from our training based algorithms, TPC and SOMc.Not surprisingly, the morphological separation method performs best T P C S O M H B M o r p h o l o g y w Figure 10.
Similar to Figure 3 but for the reduced training data set. at bright magnitudes, and BMC assigns more weight to HB at faintermagnitudes.We present the star and galaxy purity values as functions of i -band magnitude in Figure 11. The normalized density distribution asa function of magnitude in the top panel and the stellar distribution inthe second panel clearly show that the demographics of the trainingset and that of the test set are different. Since the training set is cutat i < , the density distribution falls off sharply around i ∼ and has a higher fraction of stars than the test set. Compared to thepurity values in Figure 4, TPC now suffers a significant decrease instar and galaxy purity. However, the purity of BMC does not showsuch a significant drop and decreases by only 2–5%. As suggested bythe weight maps in Figure 10, BMC can accomplish this by shiftingthe relative weights assigned to each base classifier in different SOMcells. As the quality of training set worsens, BMC assigns less weightto training based methods and more weight to HB and morphologicalseparation.In Figure 12, we show the cumulative galaxy and star purity val-ues as functions of magnitude. Compared to Figure 5, the drop in theperformance of TPC is clear. However, even when some classifiershave been trained on a significantly reduced training set, BMC main-tains a galaxy purity of 0.970 and a star purity of 1.0 up to i ∼ . ,and it sill outperforms morphological separation at fainter magnitudes i (cid:38) .We also show the star and galaxy purity values as functions ofphoto- z in Figure 13 and as functions of g − r in Figure 14. Comparedto Figure 6 and 7, the performance of BMC becomes worse in somephoto- z and g − r bins. However, this drop in performance seems tobe confined to only a small number of objects in particular regionsof the parameter space, and BMC still outperforms both TPC andmorphological separation for the majority of objects.Compared to Figure 8, the difference between the posterior starprobability distribution of TPC and that of BMC is now more pro-nounced in Figure 15. The P ( S ) distribution of BMC for true galax-ies falls off sharply at P ( S ) ≈ . , and BMC does not assign astar probability P ( S ) (cid:38) . to any true galaxies, On the other hand,both TPC and morphological separation classify some true galaxiesas stars with absolute certainty. The combination techniques that we have demonstrated so far usetwo training based algorithms as base classifiers. Ideally, the trainingdata should mirror the entire parameter space occupied by the datato be classified. Yet we have seen in Section 5.3 that the BMC tech-nique does reliably extrapolate past the limits of the training data, c (cid:13) , 1–15 Hybrid Approach to Star-Galaxy Classification d e n s i t y training (W1)test (W1-4)0.00.20.40.6 f r a c t i o n p g ¡ c g = . ¢ BMCTPCMorphology18 19 20 21 22 23 24 i (mag) p s ( c s = . ) Figure 11.
Purity as a function of the i -band magnitude for the reduced train-ing data set. Top panel shows the histograms and KDEs for the number countdistribution for the training (blue) and test (green) data set. The second panelshows the fraction of stars in the training and test data set in blue and green,respectively. The bottom two panels compare the galaxy and star purity valuesfor BMC, TPC, and morphological separation as functions of i -band magni-tude. even when some base classifiers are trained on a low-quality train-ing data set. In this section, we further investigate if and where BMCbegins to break down by imposing various magnitude, photo- z , andcolor cuts to change the size and composition of the training set.In Figure 16, we present a visual comparison between differentclassification techniques, when various magnitude cuts are applied onthe training data, and the performance is measured on the same testset from Section 5.2 and 5.3. It is not surprising that the performanceof TPC decreases as we decrease the size of training set by impos-ing more restrictive magnitude cuts, while the performance of HBand morphological separation is essentially unchanged. However, theeffect of change in size and composition of the training set is signifi-cantly mitigated by the use of the BMC technique. BMC outperformsboth HB and TPC in all four metrics, even when the training set isrestricted to i < . . BMC also consistently outperforms morpho-logical separation until we impose a magnitude cut of i < . onthe training data, beyond which point BMC finally performs worse p g ¡ c g = . ¢ BMCTPCMorphology18 19 20 21 22 23 24 i (mag) p s ( c s = . ) Figure 12.
Similar to Figure 5 but for the reduced training data set. than morphological separation. It is remarkable that BMC is able toreliably extrapolate past the training data to i ∼ . , the limitingmagnitude of the test set, and outperform HB, TPC, and morpholog-ical separation in all performance metrics, even the demographics oftraining set do not accurately sample the data to be classified.Similarly, we impose various spectroscopic redshift cuts on thetraining data in Figure 17. Since all stars have z spec values close tozero, we are effectively changing the demographics of training set bykeeping all stars and gradually removing galaxies with high redshifts.BMC begins to perform worse than morphological separation when aconservative cut of z spec < . is imposed. However, it is again clearthat BMC is able to utilize information from HB and morphologicalseparation to mitigate the drop in the performance of TPC.In Figure 18, we decrease the size of training set by keepingred objects and gradually removing blue objects. A color cut seemsto have a more pronounced effect on the performance of TPC andBMC, which perform worse than morphological separation when thetraining set is restricted to g − r > . . The performance dependsmore strongly on the color distribution, because a significant fractionof blue objects consists of stars, while objects with fainter magni-tudes and higher redshifts are mostly galaxies. We can verify this inFigure 2, where the darker (higher stellar fraction) cells in the uppermiddle region of the stellar fraction map (top right panel) have brightmagnitudes i (cid:46) in the i -band magnitude map (top left panel) andblue colors g − r (cid:46) . in the g − r color map (bottom right panel).On the other hand, the darker (fainter magnitude) cells in the right-hand side of the i -band magnitude map have almost no stars in themand are represented by bright (low stellar fraction) cells in the stel-lar fraction map. Thus, these results indicate that the performance oftraining based methods depends more strongly on the composition oftraining data than on the size, and it is necessary to have a sufficientnumber of the minority class in the training data set to ensure optimalperformance. c (cid:13)000
Similar to Figure 5 but for the reduced training data set. than morphological separation. It is remarkable that BMC is able toreliably extrapolate past the training data to i ∼ . , the limitingmagnitude of the test set, and outperform HB, TPC, and morpholog-ical separation in all performance metrics, even the demographics oftraining set do not accurately sample the data to be classified.Similarly, we impose various spectroscopic redshift cuts on thetraining data in Figure 17. Since all stars have z spec values close tozero, we are effectively changing the demographics of training set bykeeping all stars and gradually removing galaxies with high redshifts.BMC begins to perform worse than morphological separation when aconservative cut of z spec < . is imposed. However, it is again clearthat BMC is able to utilize information from HB and morphologicalseparation to mitigate the drop in the performance of TPC.In Figure 18, we decrease the size of training set by keepingred objects and gradually removing blue objects. A color cut seemsto have a more pronounced effect on the performance of TPC andBMC, which perform worse than morphological separation when thetraining set is restricted to g − r > . . The performance dependsmore strongly on the color distribution, because a significant fractionof blue objects consists of stars, while objects with fainter magni-tudes and higher redshifts are mostly galaxies. We can verify this inFigure 2, where the darker (higher stellar fraction) cells in the uppermiddle region of the stellar fraction map (top right panel) have brightmagnitudes i (cid:46) in the i -band magnitude map (top left panel) andblue colors g − r (cid:46) . in the g − r color map (bottom right panel).On the other hand, the darker (fainter magnitude) cells in the right-hand side of the i -band magnitude map have almost no stars in themand are represented by bright (low stellar fraction) cells in the stel-lar fraction map. Thus, these results indicate that the performance oftraining based methods depends more strongly on the composition oftraining data than on the size, and it is necessary to have a sufficientnumber of the minority class in the training data set to ensure optimalperformance. c (cid:13)000 , 1–15 E. J. Kim, R. J. Brunner & M. Carrasco Kind d e n s i t y training (W1)test (W1-4)0.00.20.40.6 f r a c t i o n p g ¡ c g = . ¢ BMCTPCMorphology0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 z phot p s ( c s = . ) Figure 13.
Similar to Figure 11 but as a function of photo- z . We have presented and analyzed a novel star-galaxy classificationframework for combining star-galaxy classifiers using the CFHTLenSdata. In particular, we use four independent classification techniques:a morphological separation method; TPC, a supervised machinelearning technique based on prediction trees and a random forest;SOMc, an unsupervised machine learning approach based on self-organizing maps and a random atlas; and HB, a Hierarchical Bayesiantemplate-fitting method that we have modified and parallelized. BothTPC and SOMc algorithms are currently available within a softwarepackage named MLZ . Our implementation of HB and BMC, as wellas IP YTHON notebooks that have been used to produce the resultsin this paper, are available at https://github.com/EdwardJKim/astroclass .Given the variety of star-galaxy classification methods we are us-ing, we fully expect the relative performance of the individual tech-niques to vary across the parameter space spanned by the data. Wetherefore adopt the binning strategy, where we allow different clas-sifier combinations in different parts of parameter space by creating http://lcdm.astro.illinois.edu/code/mlz.html d e n s i t y training (W1)test (W1-W4)0.00.20.40.60.81.0 f r a c t i o n p g ¡ c g = . ¢ BMCTPCMorphology0.5 0.0 0.5 1.0 1.5 2.0 2.5 g − r p s ( c s = . ) Figure 14.
Similar to Figure 11 but as a function of g − r color. two-dimensional self-organizing maps of the full multi-dimensionalmagnitude-color space. We apply different star-galaxy classificationtechniques within each cell of this map, and find that the four tech-niques are weighted most strongly in different regions of the map.Using data from the CFHTLenS survey, we have considered dif-ferent scenarios: when an excellent training set is available with spec-troscopic labels from DEEP2, SDSS, VIPERS, and VVDS, and whenthe demographics of sources in a low-quality training set do not matchthe demographics of objects in the test data set. We demonstratethat the Bayesian Model Combination (BMC) technique improves theoverall performance over any individual classification method in bothcases. We note that Carrasco Kind & Brunner (2014a) analyzed differ-ent techniques for combining photometric redshift probability densityfunctions (photo- z PDFs) and also found that BMC is in general thebest photo- z PDF combination technique.The problem of star-galaxy classification is a rich area for futureresearch. It is unclear if sufficient training data will be available infuture ground-based surveys. Furthermore, in large sky surveys suchas DES and LSST, photometric quality is not uniform across the sky,and a purely morphological classifier alone will not be sufficient, es-pecially at faint magnitudes. Given the efficacy of our approach, clas-sifier combination strategies are likely the optimal approach for cur- c (cid:13) , 1–15 Hybrid Approach to Star-Galaxy Classification N Morphology starsgalaxies10 N TPC P ( S ) N BMC
Figure 15.
Similar to Figure 8 but for the reduced training data set. rently ongoing and forthcoming photometric surveys. We thereforeplan to apply the combination technique described in this paper toother surveys such as the DES. Our approach can also be extendedmore broadly to classify objects that are neither stars nor galaxies(e.g., quasars). Finally, future studies could explore the use of multi-epoch data, which would be particularly useful for the next generationof synoptic surveys.
ACKNOWLEDGEMENTS × × × × × t r a i n i n g s i z e A U C BMC TPC HB0.040.080.12 M S E BMCTPC MorphologyHB0.930.950.97 p g ¡ c g = . ¢ < . < . < . < . < . < . < . < . < . i magnitude cut p s ( c s = . ) Figure 16.
The classification performance metrics for BMC (blue), TPC(green), morphology (red), and HB (purple) as applied to the CFHTLenS datain the VVDS field with various magnitude cuts. The top panel shows the num-ber of sources in the training set at corresponding magnitude cuts. We showonly one of the four combination methods, BMC, which has the best overallperformance. for the Participating Institutions of the SDSS-III Collaboration in-cluding the University of Arizona, the Brazilian Participation Group,Brookhaven National Laboratory, Carnegie Mellon University, Uni-versity of Florida, the French Participation Group, the German Par-ticipation Group, Harvard University, the Instituto de Astrofisica deCanarias, the Michigan State/Notre Dame/JINA Participation Group,Johns Hopkins University, Lawrence Berkeley National Laboratory,Max Planck Institute for Astrophysics, Max Planck Institute for Ex-traterrestrial Physics, New Mexico State University, New York Uni-versity, Ohio State University, Pennsylvania State University, Univer-sity of Portsmouth, Princeton University, the Spanish ParticipationGroup, University of Tokyo, University of Utah, Vanderbilt Univer-sity, University of Virginia, University of Washington, and Yale Uni-versity.This paper uses data from the VIMOS Public ExtragalacticRedshift Survey (VIPERS). VIPERS has been performed using theESO Very Large Telescope, under the ”Large Programme” 182.A- c (cid:13)000
The classification performance metrics for BMC (blue), TPC(green), morphology (red), and HB (purple) as applied to the CFHTLenS datain the VVDS field with various magnitude cuts. The top panel shows the num-ber of sources in the training set at corresponding magnitude cuts. We showonly one of the four combination methods, BMC, which has the best overallperformance. for the Participating Institutions of the SDSS-III Collaboration in-cluding the University of Arizona, the Brazilian Participation Group,Brookhaven National Laboratory, Carnegie Mellon University, Uni-versity of Florida, the French Participation Group, the German Par-ticipation Group, Harvard University, the Instituto de Astrofisica deCanarias, the Michigan State/Notre Dame/JINA Participation Group,Johns Hopkins University, Lawrence Berkeley National Laboratory,Max Planck Institute for Astrophysics, Max Planck Institute for Ex-traterrestrial Physics, New Mexico State University, New York Uni-versity, Ohio State University, Pennsylvania State University, Univer-sity of Portsmouth, Princeton University, the Spanish ParticipationGroup, University of Tokyo, University of Utah, Vanderbilt Univer-sity, University of Virginia, University of Washington, and Yale Uni-versity.This paper uses data from the VIMOS Public ExtragalacticRedshift Survey (VIPERS). VIPERS has been performed using theESO Very Large Telescope, under the ”Large Programme” 182.A- c (cid:13)000 , 1–15 E. J. Kim, R. J. Brunner & M. Carrasco Kind × × × × × t r a i n i n g s i z e A U C BMCTPCHB0.00.10.20.30.40.5 M S E BMCTPC MorphologyHB0.930.950.97 p g ¡ c g = . ¢ < . < . < . < . < . < . < . < . z spec cut p s ( c s = . ) Figure 17.
Similar to Figure 16 but using z spec cuts. REFERENCES
Ahn C. P., et al., 2014, ApJS, 211, 17Ball N. M., Brunner R. J., Myers A. D., Tcheng D., 2006, ApJ, 650, 497Ben´ıtez N., 2000, ApJ, 536, 571Bertin E., Arnouts S., 1996, A&AS, 117, 393Bohlin R. C., Colina L., Finley D. S., 1995, AJ, 110, 1316Breiman L., 1996, Machine learning, 24, 49Breiman L., 2001, Machine learning, 45, 5Breiman L., Friedman J., Stone C. J., Olshen R. A., 1984, Classification andregression trees. CRC pressBrier G. W., 1950, Monthly weather review, 78, 1Carrasco Kind M., Brunner R. J., 2013, MNRAS, 432, 1483Carrasco Kind M., Brunner R. J., 2014a, MNRAS, 442, 3380Carrasco Kind M., Brunner R. J., 2014b, MNRAS, 438, 3409Chabrier G., Baraffe I., Allard F., Hauschildt P., 2000, ApJ, 542, 464Coleman G. D., Wu C.-C., Weedman D. W., 1980, ApJS, 43, 393 × × × × × t r a i n i n g s i z e A U C BMCTPCHB0.040.080.12 M S E BMCTPC MorphologyHB0.920.940.960.98 p g ¡ c g = . ¢ > . > . > . > . > . > . g − r color cut p s ( c s = . ) Figure 18.
Similar to Figure 16 but using g − r color cuts.Davis M., et al., 2003, Astronomical Telescopes and Instrumentation, pp161–172Erben T., et al., 2013, MNRAS, p. stt928Fadely R., Hogg D. W., Willman B., 2012, ApJ, 760, 15Foreman-Mackey D., Hogg D. W., Lang D., Goodman J., 2013, PASP, 125,306Garilli B., et al., 2008, A&A, 486, 683Garilli B., et al., 2014, A&A, 562, A23G´orski K. M., Hivon E., Banday A. J., Wandelt B. D., Hansen F. K., ReineckeM., Bartelmann M., 2005, ApJ, 622, 759Gwyn S. D., 2012, AJ, 143, 38Henrion M., Mortlock D. J., Hand D. J., Gandy A., 2011, MNRAS, 412,2286Heymans C., et al., 2012, MNRAS, 427, 146Hildebrandt H., et al., 2012, MNRAS, 421, 2355Kaiser N., Squires G., Broadhurst T., 1995, ApJ, 449, 460Kinney A. L., Calzetti D., Bohlin R. C., McQuade K., Storchi-Bergmann T.,Schmitt H. R., 1996, ApJ, 467, 38Kohonen T., 1990, Proceedings of the IEEE, 78, 1464Kohonen T., 2001, Self-organizing maps. Vol. 30 of Springer, SpringerKron R. G., 1980, ApJS, 43, 305Le F`evre O., et al., 2005, A&A, 439, 845Messier C., 1781, Connoissance des Temps for 1784, pp 227–267Monteith K., Carroll J. L., Seppi K., Martinez T., 2011, in Neural Networks(IJCNN), The 2011 International Joint Conference on Turning bayesianc (cid:13) , 1–15 Hybrid Approach to Star-Galaxy Classification model averaging into bayesian model combination. pp 2657–2663Newman J. A., et al., 2013, ApJS, 208, 5Odewahn S. C., Stockwell E. B., Pennington R. L., Humphreys R. M., Zu-mach W. A., 1992, AJ, 103, 318Paterno M., 2003, Calculating Efficiencies and Their Uncertainties, http://home.fnal.gov/˜paterno/images/effic.pdf Pickles A. J., 1998, PASP, 110, 863Robin A. C., et al., 2007, ApJS, 172, 545Rokach L., 2010, Artificial Intelligence Review, 33, 1Ross A. J., et al., 2011, MNRAS, 417, 1350Sebok W. L., 1979, AJ, 84, 1526Sevilla-Noarbe I., Etayo-Sotos P., 2015, Astronomy and Computing, in press(arXiv:1504.06776)Silverman B. W., 1986, CRC press, 26Soumagnac M. T., et al., 2015, MNRAS, 450, 666Suchkov A. A., Hanisch R. J., Margon B., 2005, AJ, 130, 2439Swets J. A., Dawes R. M., Monahan J., 2000, Scientific American, p. 83Ting K. M., Witten I. H., 1999, J. Artif. Intell. Res.(JAIR), 10, 271Valdes F., 1982, in Instrumentation in Astronomy IV Vol. 331 of Society ofPhoto-Optical Instrumentation Engineers (SPIE) Conference Series, Reso-lution classifier. pp 465–472Vasconcellos E. C., de Carvalho R. R., Gal R. R., LaBarbera F. L., CapelatoH. V., Frago Campos Velho H., Trevisan M., Ruiz R. S. R., 2011, AJ, 141,189Weir N., Fayyad U. M., Djorgovski S., 1995, AJ, 109, 2401Wolpert D. H., 1992, Neural networks, 5, 241Yee H. K. C., 1991, PASP, 103, 396Yin H., 2008, Computational intelligence: a compendium, pp 715–762
This paper has been typeset from a TEX/ L A TEX file prepared by theauthor. c (cid:13)000