Finding rare objects and building pure samples: Probabilistic quasar classification from low resolution Gaia spectra
C.A.L. Bailer-Jones, K.W. Smith, C. Tiede, R. Sordo, A. Vallenari
aa r X i v : . [ a s t r o - ph ] S e p Mon. Not. R. Astron. Soc. , 000–000 (0000) Printed 15 October 2018 (MN L A TEX style file v2.2)
Finding rare ob jects and building pure samples:Probabilistic quasar classification from low resolution Gaiaspectra
C.A.L. Bailer-Jones ⋆ , K.W. Smith , C. Tiede , R. Sordo , A. Vallenari Max-Planck-Institut f¨ur Astronomie, K¨onigstuhl 17, 69117 Heidelberg, Germany INAF, Osservatorio di Padova, Vicolo Osservatorio 5, 35122 Padova, Italy submitted 7 August 2008; resubmitted 18 September 2008
ABSTRACT
We develop and demonstrate a probabilistic method for classifying rare objects in sur-veys with the particular goal of building very pure samples. It works by modifying theoutput probabilities from a classifier so as to accommodate our expectation (priors)concerning the relative frequencies of different classes of objects. We demonstrate ourmethod using the
Discrete Source Classifier , a supervised classifier currently based onSupport Vector Machines, which we are developing in preparation for the Gaia dataanalysis. DSC classifies objects using their very low resolution optical spectra. Welook in detail at the problem of quasar classification, because identification of a purequasar sample is necessary to define the Gaia astrometric reference frame. By varyinga posterior probability threshold in DSC we can trade off sample completeness andcontamination. We show, using our simulated data, that it is possible to achieve apure sample of quasars (upper limit on contamination of 1 in 40 000) with a complete-ness of 65% at magnitudes of G=18.5, and 50% at G=20.0, even when quasars havea frequency of only 1 in every 2000 objects. The star sample completeness is simul-taneously 99% with a contamination of 0.7%. Including parallax and proper motionin the classifier barely changes the results. We further show that not accounting forclass priors in the target population leads to serious misclassifications and poor pre-dictions for sample completeness and contamination. We discuss how a classificationmodel prior may, or may not, be influenced by the class distribution in the trainingdata. Our method controls this prior and so allows a single model to be applied to anytarget population without having to tune the training data and retrain the model.
Key words: surveys – methods: data analysis, statistical – quasars
Finding rare objects is hard, for two reasons. First, we expectto have to look at many objects before we encounter one,and second, we may not even recognise it even when we do.The reason for this is that the prior probability that anyone object is of the rare class, P ( C rare ), is very small. Soeven if it has very characteristic features, i.e. the likelihoodof the data given the rare object, P (Data | C rare ), is high,the posterior probability, P ( C rare | Data), could still be low.A survey for rare objects obviously requires a very dis-criminating classifier, but we could assist it by modifyingthe class probabilities. If we raised the prior, P ( C rare ), weare more likely to find the rare objects, but it is inevitable ⋆ Email: [email protected] that we will then incorrectly classify other objects as beingof the rare class. Depending on our goals, there may be asatisfactory balance between maximizing the expected num-ber of true positives (sample completeness) and minimizingthe expected number of false positives (sample contamina-tion). Here we present a method for achieving an optimalbalance and a correct prediction of this balance which, al-though simple, is not trivial and has important implications.We illustrate our method in the context of the problemof detecting quasars in the Gaia survey based on their lowresolution ( R ≃
30) optical spectra. Gaia is an all-sky astro-metric and spectrophotometric survey complete to G = 20,expecting to observe some 10 stars, a few million galaxiesand half a million quasars. Its primary mission is to studyGalactic structure by measuring the 3D spatial distribu-tion and 2D kinematic distribution of stars throughout the c (cid:13) C.A.L. Bailer-Jones et al.
Galaxy and correlating these with stellar properties (abun-dances, ages etc.) derived from the spectra. With astromet-ric accuracies as good as 10 µ as, Gaia cannot be externallycalibrated with an existing catalogue. Instead it must ob-serve a large number of quasars over the whole sky withwhich to define its own reference frame. This quasar samplemust be very clean (low contamination). (The quasar sampleis also, of course, of intrinsic interest.)We present our Gaia classification model, the DiscreteSource Classifier (DSC), and report its classification perfor-mance using simulated data. We will show how, using ourprobabilty modification approach, we can use this to buildpure quasar samples, at the (acceptable) loss of sample com-pleteness.
Related work . One of the most comprehensive searchfor quasars to date is that done with the SDSS. Richards etal. (2002) defined a colour locus in the ugriz space to identifyobjects for spectroscopic follow up and estimated the con-tamination rate of their photometric selection to be 34%. Us-ing the spectroscopy of all point sources taken in SDSS stripe82, Vanden Berk et al. (2005) assessed the completeness ofthe Richards et al. selection at 95.7% for sources brighterthan i = 19 .
1. Later, Richards et al. (2004) trained a photo-metric classifier (based on kernel density estimation) on a setof 22,000 spectroscopically identified quasars and used thisto build a sample of 100 000 quasars brighter than g = 21 . g =19.5,with a contamination of just 5% down to g =21.0. Suchkovet al. (2005) and Ball et al. (2006) both use decision trees toclassify objects in the SDSS photometric catalogue, by train-ing on objects with spectroscopic classifications from SDSS.Gao et al. (2008) used support vector machines and Kd-treenearest neighbours to classify spectroscopically-confirmedSDSS and 2MASS objects as quasars or stars using just thephotometry, with stars and quasars present in equal pro-portions. With both methods they obtained contaminationrates of a few percent (averaged across both classes). Outline . We start in section 2 by presenting our generalprior modification method. In section 3 we present the Gaia-DSC classifier and the data used to train and test it. Theperformance of this and the application of our method aregiven in section 4 and discussed in section 5, where we alsodiscuss the interpretation of the probabilities and some ofthe limitations of this work. We conclude in section 6.
Definitions and notation . We use the term class frac-tions to mean the relative numbers of objects of each class ina data set. The nominal model refers to the classifier “as is”,i.e. the output probabilities without any modifications. Incontrast, in the modified model we have modified these out-puts to be appropriate to a situation in which there wouldbe very different class fractions, e.g. quasars very rare (sec-tion 2.4). The subscript i refers to true classes and the sub-script j to model output classes. f traini , f testi and f modi referto the class fractions of class i for the training data, test dataand effective test data respectively. They are normalized, P i f traini = P i f testi = P i f modi = 1. This effective testdata set is one which we don’t actually have, but reflects atarget population with different class fractions, e.g. quasarsvery rare. We make predictions of the performance on this byusing the actual test data but by modifying the performanceequations (section 2.5). In the figures especially, lower case class names, e.g. quasar , denote estimated classes and up-per case names, e.g. STAR , true classes. Thus P ( quasar | STAR )means “the DSC-assigned quasar probability given that thesource is truly a STAR”. This refers to DSC outputs forobjects with known classes (e.g. in a test set). This is notto be confused with the notation for the DSC outputs inthe general case, P ( C j | x n , θ ), which means “the probabil-ity which DSC model θ assigns to class C j for object x n ”.“C&C” stands for “completeness and contamination”. The probabilities and assigned classes from any classifierdepend not only on the evidence in the data, but also onthe prior probabilities. In this context, the prior is the classprobabilities before you look at the data. Priors are alwayspresent (whether you accept the Bayesian philosophy ornot), but they are not explicit in many models, so we maynot know what they are. This is a problem, because if themodel priors are inappropriate (e.g. equal probability of starand quasar) this will translate into poor posterior probabili-ties and thus class assignments. The method we present hereallows us to replace priors (even if implicit) with somethingmore appropriate, e.g. quasars much rarer than stars.
DSC assigns to every data vector (e.g. spectrum), x n , aprobability, P ( quasar ), that it is a quasar, and likewisefor the other classes. The classes are exclusive and ex-haustive, so the probabilities sum to unity. The classifieris not perfect, so generally P ( quasar | QUASAR ) < P ( quasar | not QUASAR ) > sample for class class , we select objects which havea DSC output P ( class ) > P t , where P t is a user-definedthreshold (and may be different for each class). The sample completeness is the number of objects in the sample whichare truly of that class divided by the total number of objectsof that class in the test set. The contamination is thenumber of objects in the sample which are truly of otherclasses divided by the total number of objects in the sample.For class = j c ompleteness j = n i = j,j N i c ontamination j = P i = j n i,j P i n i,j (1)where n i,j is the number of objects of (true) class i in the c (cid:13) , 000–000 inding rare objects sample selected for (output) class j , and N i is the total num-ber of objects of (true) class i in the test set. (There are var-ious other diagnostics one could use, such as the ROC curveor precision rate.) These equations give us predictions of thecompleteness and contamination for a new (unlabelled) dataset, insofar as we believe it to have the same class fractionsas the training data (that’s our prior). All classifiers include a prior, whether explicit or not. Weneed to know this prior for two reasons. First, we would liketo know what assumption our model is actually making (andnot what we suspect it is making). Second, we would like tochange this prior to something which is appropriate to theproblem at hand. Here we discuss what the prior is, how thetraining data may influence it, and how to calculate it posthoc from a trained model.
The outputs from a trained classifier when presented withdata x n are P ( C j | x n , θ ), the probability that the data is ofclass C j given the data and the model, θ . This latter quantityreflects both the architecture of the model and the trainingdata set used to fix its internal parameters. We can thinkof this output as a posterior probability and write it usingBayes’ theorem P ( C j | x n , θ ) = P ( x n | C j , θ ) P ( C j | θ ) P ( x n | θ ) (2)The term P ( x n | C j , θ ) is the likelihood of the data given theclass and model. The term P ( C j | θ ) is the prior probability that, given our model, an object is of class C j . Bayesianstatistics deals with updating probabilities based on newdata: the prior reflects our knowledge (based on some otherdata) before we look at the new data. In the present con-text, the prior P ( C k = quasar | θ ) is the probability that anyone object in our survey is a quasar, before we look at itsspectrum. We always have some prior information, e.g. withGaia the fact that it is an all sky survey to G=20.0. If weknow them, we could even treat the magnitude and Galacticlatitude as prior information. Given that we can make the decomposition in equation 2, itfollows that all classification models must possess a prior onthe class probabilities. In some models, for example lineardiscriminant analysis or Gaussian mixture models, this prioris explict and so can be controlled. But in many others,such as neural networks or support vector machines, it is notexplicit. (See a standard text on machine learning for detailsof these methods, e.g. Hastie et al. (2001).) In particular, itmay depend on the class fractions in the training data.Take, for example, a standard neural network regres-sion model which is trained by minimizing an error functionover the whole data set. If we trained this on 1000 stars andjust one quasar, it will learn to recognise stars much bet-ter than quasars, because in minimizing the error it hardly has to worry about fitting the lone quasar. If we changedthe training data (class fractions), the model and thus theclassifications would change. Other regression models are in-fluenced by the class fractions in different ways, or not at all.Given this dependence on the model and data, we refer tothe priors as model-based priors , and the notation P ( C j | θ )reminds us of this.This issue of class fractions influencing the model per-formance is well-known in the machine learning literature,where it is referred to as the problem of “class imbalance”or “imbalanced data sets”. It has been demonstrated to in-fluence neural networks, support vector machines and clas-sification trees (e.g. Shin & Cho 2003, Visa & Ralescu 2005,Weiss 2004). But how, exactly, do the class fractions affectthe classifications and, more specifically, the model-basedpriors? We might think that in the above example the ratioof the star to quasar prior probabilities implicit in the modelis 1000 to 1, but this is generally not the case , because itdepends on the model and how it is trained. The bottomline is that, in general, the model-based prior is not equalto the class fractions in the training data. We can calculate the model-based priors, P ( C j | θ ), directlyfrom the trained model via the marginalization equation P ( C j | θ ) = n = N test X n =1 P ( C j | x n , θ ) P ( x n | θ ) (3)where the sum is taken over all N test objects in the testdata set. The first term in the sum is the posterior prob-ability. The second term is the probability that we drawobject x n from the test data set, which is 1 /N test . Hencethe prior is simply the average of the posterior probabili-ties. It might seem strange that the prior can be calculatedfrom the posterior. Yet because the sum is over all objectsin the test set, regardless of their true class, we can thinkof this summation as eliminating information on individualobjects, leaving us with what the model probability is forclass C j in the absence of specific data. This is the prior.If we had three classes equally represented in the data, andthe classifier were perfect, the prior for each class would be1 /
3. Different class fractions and non-perfect classifiers willgive different results. In section 4 we will compare the model-based priorswith the training data class fractions. If this were the case, then we might be tempted to address theclass imbalance problem by changing the training data to haveclass fractions equal to our priors. But if we had just 10 000 train-ing vectors, we could then have only 10 quasars, making it hardfor the classifier to correctly classify quasars. We demonstrate thislater (section 4.6). Note that the prior calculation assumes the test set has thesame class fractions as the training set. Interestingly, because thetrue classes don’t appear in the equation, we don’t actually needlabels on the individual objects in order to calculate the model-based prior.c (cid:13) , 000–000
C.A.L. Bailer-Jones et al.
Typically we train a classification algorithm on more or lessequal class fractions, because this helps the algorithm toreliably identify the class boundaries. This is our nominalmodel, θ nom . Yet the target population to which we wantto apply our classifier may have different class fractions(e.g. quasars very rare), in which case the classifier maymake poor predictions. The underlying issue here is thatthe model-based priors of our nominal model differ from ourtarget priors.To correct for this we would ideally just replace the orig-inal (nominal) prior with our modified prior. But we cannotdo this directly because the model priors are not explicit. In-stead we modify the output (posterior) probabilities to giveus the “modified model”, θ mod . From inspection of equa-tion 2 we see we may achieve this by dividing the posteriorby the nominal prior and then multiplying it by the modi-fied prior. The nominal priors can be calculated using equa-tion 3. But we cannot use this equation to calculate the mod-ified priors because we don’t yet have the posteriors fromthe modified model. We therefore approximate the modifiedpriors using the class fractions. Let P nom ( C j | x, θ nom ) bethe output (posterior) probability from the nominal model,i.e. that trained on the data set with class fractions f traini .The modified probability appropriate for a target populationwith class fractions f modi we define as P mod ( C j | x n , θ mod ) = a n P nom ( C j | x n , θ nom ) f modi = j f traini = j (4)where a n is a normalization factor which ensures that P j P mod ( C j | x n , θ mod ) = 1 for each object n . The factor f modi = j /f traini = j is equivalent to changing the prior in equation 2and has the expected impact on the posterior.Following the discussion in section 2.3 we do not neces-sarily expect the class fractions to be a good proxy for thepriors, but for now it’s the best we can do. (We will latercheck how good the approximation is.)We illustrate the above with a simple example, a two-class model trained on equal numbers of stars and quasars,i.e. f train = (0 . , . P nom ( C star | x n , θ nom ) = 0 . P nom ( C quasar | x n , θ nom ) = 0 .
8. If we want f mod = (0 . , .
1) (stars nine times more common thanquasars), then from equation 4 we get P mod ( C star | x n , θ mod ) = a n . × . / . a n .
36 = 0 . P mod ( C quasar | x n , θ mod ) = a n . × . / . a n .
16 = 0 . if we had trained the modelon the modified class fractions. But this is generally not true, because what we are doing in equation 4 is changingthe priors (via the proxy of the class fractions) and thisis not equivalent to changing the training data (as we willdemonstrate later). The equations for the completeness and contamination(eqn. 1) depend on the actual number of objects correctlyand incorrectly classified and were calculated from the testdata set, which by definition has class fractions f testi . How-ever, we expect to apply the modified model to a targetpopulation with different class fractions (this was the wholepoint of modifying it). To estimate the C&C, it is not neces-sary to create a new test data set. Instead, we can modify theequations to reflect the target class fractions, f mod , namelyc ompleteness j = „ f modi = j f testi = j « n i = j,j „ f modi = j f testi = j « N i = n i = j,j N i c ontamination j = P i = j “ f modi f testi ” n i,j P i “ f modi f testi ” n i,j (5)This effectively changes the number of objects in each trueclass. The quantities n i,j and N i refer to the number of ob-jects in samples obtained from the modified model appliedto the actual test data set. Note that the equation for thecompleteness remains unchanged, yet the value of the com-pleteness will generally change because the posterior prob-abilities have changed and so n i = j,j changes. We emphasisethat the above equation is independent of the procedure tomodify the probabilities described in section 2.4. It is onlyto be used to make predictions for an effective test set. Tocalculate the completeness and contamination for an actual test set which already has the modified class fractions, weuse equation 1. Once we have the posteriors from the modified model wecan calculate its model-based priors from the test set usingequation 3. But just as for the C&C calculation, we mustchange the effective number of objects in each true class torepresent the modified class fractions. Thus in equation 3we should use P ( x n | θ mod ) = „ f modi f testi « N test (6)where x n , being an object in the (labelled) test set, hasknown true class, i . If a class is ten times rarer in the mod-ified population, then P ( x n | θ mod ) is reduced by a factor often compared to before.In principle we could now use these model-based priorsto recalculate the posteriors for the modified model (withoutusing the approximation in equation 4), but we don’t do thisin the present article. Rather we just report them as a checkon what the prior really is. The Discrete Source Classifier (DSC) is the algorithm weare developing as part of our work in the Gaia Data Pro- c (cid:13) , 000–000 inding rare objects cessing and Analysis Consortium (DPAC) (O’Mullane et al.(2007)). DSC will classify objects based primarily on lowresolution prism spectra (so-called “BP/RP” spectra), butsupplemented with parallaxes and proper motions and pho-tometric variability where available. Here we restrict our-selves to the three classes “star”, “galaxy” and “quasar”.After DSC in the pipeline come several algorithms for es-timating astrophysical parameters, e.g. stellar parameters.For more details of the classification system, see Bailer-Jones(2005).Gaia detects objects in a very broad band, G, whichhas roughly the same wavelength range as the BP/RP in-strument. In common with the Gaia detection strategy, DSConly operates on point sources (i.e. makes no use of mor-phological information). In this paper we present results forsimulations at both G=20.0 and G=18.5.Here we describe the underlying classifier, the sourcelibraries and the Gaia simulated data used to train and testDSC. DSC is presently based on a Support Vector Machine (SVM)(e.g. Cortes & Vapnik (1995), Burges (1998)), using the lib-SVM implmentation (Chang & Lin (2001)). An SVM is a su-pervised learning algorithm which finds the boundary which“optimally” separates two classes of objects. In contrast tomany other classifiers, SVMs are defined only by those train-ing objects which lie close to the boundary, the so-called support vectors . The basic algorithm is linear, but by usingthe “kernel trick” to implicitly map data into a higher di-mensional space, it achieves a nonlinear mapping betweenthe input data (the spectrum) and the output classes. TheSVM is trained by minimizing the number of misclassifica-tions and using a regularizer for complexity control.SVMs are fundamentally non-probabilistic: they simplydefine a boundary and assign a class ( C or C ) dependingon which side of the boundary an object falls. Probabilitiesmay be derived from the distance, f , of an object from theboundary, in the present case by fitting a sigmoidal functionto P ( C j = C | f ) (Platt (2000)).The basic SVM is a two-class classifier. libSVM actuallytrains K ( K − / K classproblem) and uses the pairwise coupling method describedin Wu et al. (2004) (their algorithm 2) to produce normalizedprobabilities for K classes.One of the advantages of SVMs over many other nonlin-ear classifiers (e.g. neural networks) is that the SVM objec-tive function is convex, so the training has a unique solution.However, the model has two hyperparameters, the regular-ization parameter (the “cost”) and the scale length in theradial basis function (RBF) kernel we use. We optimize these(“tune”) using a Nelder-Mead algorithm. As this involves re-training the SVM for each combination of hyperparameters,it is a time-consuming process.The main concepts presented in this paper are not de-pendent on the choice of classifier. SVM is not perfect, notleast because its underlying philosophy of not modelling thedata distribution means that probabilties do not arise natu-rally. It is, nonetheless, a competitive model in terms of clas-sification error, and in tests we have done it has performed aswell as or better than several alternatives, including multi- Figure 1.
The distribution of stellar parameters in the trainingand testing data sets layer perceptrons, RBF networks, boosted trees and mixturemodels (Elting & Bailer-Jones (2007)).
Libraries of spectra are being assembled or created for Gaiapurposes. Our stellar libraries are based on the Basel (Leje-une et al. (1997)) and MARCS (e.g. Gustafsson et al. (2008))libraries plus an OB star library from J.-C. Bouret. To-gether these span the full range of effective temperature,metallicity and surface gravity. The galaxy library is de-scribed by Tsalmantza et al. (2008) (see also Tsalamantzaet al. (2007)). It is based on synthetic spectra derived fromgalaxy evolution models with several astrophysical param-eters, in particular those which describe the star formationhistory and the redshift. The quasar library is also synthetic,with each spectrum uniquely defined by the three param-eters continuum slope, α , emission line equivalent width,EW, and redshift, z (Claeskens et al. (2006)). Their valuesare chosen at random from a distribution which is flat inredshift over the range 0–5.5, flat in α from − A V = 0–10(with fixed extinction law R V = 3 . eff and A V ; their distribution is plotted inFig. 1. The distributions may be a little artificial (the three“steps” reflect the three libraries), but they are sufficient forthe purpose of this article.This data set is obviously rather limited (e.g. purelysynthetic data, no extinction in the quasars, no emissionline stars, no strong galaxy emission). Our main goal at this c (cid:13) , 000–000 C.A.L. Bailer-Jones et al.
Figure 2.
Random selection of five stars (blue solid lines) andfive quasars (red dashed lines) (G=18.5, noise included). The first46 pixels are the blue channel (Blue Prism, BP), the remaining40 the red channel (Red Prism, RP), separated by a dashed line. stage is not to produce a data set tuned to the real Gaiasky. We discuss the issue of training sets more in section 5.
To build and test the processing pipeline, the DPAC hasbuilt a sophisticated data simulator (Luri et al. (2005)). Foreach object in the spectral libraries we simulate its BP/RPspectrum, parallax and proper motion, including all sourcesof random error. Gaia observes every part of the sky be-tween 40 and 200 times over five years (depending on eclip-tic latitude). The average number of spectra for an objectin an end-of-mission stack is 72, which we use here. Oursimulated data are the so-called cycle 3 data (Sordo & Val-lenari (2008)).Gaia prism spectra are obtained in two channels, onefor the blue called BP and one for the red called RP, withcut-offs defined by the (silver) mirror response in the blue,CCD QE in the red, and bandpass filters in the middle. Af-ter removal of low sensitivity regions, BP ranges from 338to 998 nm over 46 pixels and RP from 646 to 1082 nm over40 pixels. We use all 86 pixels as separate inputs to the clas-sifier (we do not attempt to combine the wavelength over-lapped region). The dispersion varies strongly with wave-length from 3 nm/pixel (blue end) to 30 nm/pixel (redend). The instrumental point spread function (PSF) is sig-nificantly broader than the pixel size, and in one conserva-tive estimate there are are only 18 independent measuresin BP/RP (Brown (2006)). Example spectra are shown inFig. 2. The dominant noise source is Poisson noise from thesource, so the SNR is approximately the square root of theflux. (At G=20, the fluxes are four times smaller and theSNR is about half.) The corresponding library spectra (i.e.prior to being put through the Gaia instrument simulator)are shown in Fig. 3. The original stellar library spectra arecalculated at a resolution of 0.1 nm and show many absorp-tion lines, so to avoid confusion, we just plot every 20 th pixel.Note how much information is lost – especially concerningthe spectral lines – in BP/RP. Figure 3.
The (noise-free) library spectra for the stars (toppanel) and quasars (bottom panel) shown in Fig. 2. The pho-ton flux has been normalized in each case to have a maximum of1.0.
We included the astrometry in our classifiers in the expecta-tion that it could help distingiush galactic and extragalacticobjects. Fig. 4 shows the distribution of the simulated as-trometry (with noise) for G=18.5. Astrometry for the extra-galactic objects is just zero plus noise. Note that this causesnegative parallaxes (which are preserved in the Gaia astro-metric data processing for model fit assessment purposes).Stellar parallaxes, ̟ , are simulated in a way to ensure con-sistency with the apparent and absolute magniutudes andstellar parameters. The stellar proper motions are drawn atrandom from a zero mean Gaussian with standard devia-tion 10 ̟ /yr, which simiulates a distance-independent spacevelocity distribution with a standard deviation of 50 km/s.At G=18.5, the Gaia parallax error is typically 0.1 masand the proper motion error around 0.14 mas/yr (both dom-inated by source photon statistics). The errors are twice aslarge at G=20.0. The input data for our classifiers is BP/RP plus astrometry(although it turns out that inclusion of astrometry hardlyimproves the results). Models are trained and tested on noisydata and a single G-band magnitude is used in each experi-ment.Train and test data are drawn at random (without re- c (cid:13) , 000–000 inding rare objects Figure 4.
Distribution of the simulated astrometry.Blue/solid=star, red/dashed=quasar, black/dotted=galaxy(which is indistinguishable from the quasars). placement) from a common data set. Unless stated other-wise, the training data set comprises 5000 objects from eachof the three classes and the test data 60 000 objects of eachclass. The reason for the large number in the test set willbecome clear in section 4.3.
We present results for three experiments(1) objects at G=18.5,(2) as (1), but in which quasars with emission line equiv-alent widths less than 5000 ˚A have been removed from thetraining data set, leaving 2901 quasars (the test set is un-changed),(3) as (2), but for G=20.0.We show results for both the “nominal” and “modified”cases. In the nominal cases, the class fractions are thosegiven by the data sets themselves, namely f train = f test =(1 , ,
1) for (star, quasar, galaxy) respectively, except for thetraining data in cases 2 and 3 which has f train = (1 , . , f mod = (1 , / , Table 1.
Confusion matrix for class assignments from maximumprobability. Each row corresponds to a true class and sums to 100%.Nominal priors, G=18.5galaxy quasar starGALAXY 99.36 0.19 0.45QUASAR 1.03 96.04 2.94STAR 0.49 0.88 98.63
Figure 5.
Histograms of DSC outputs for each class showinghow confident DSC is of identifying each class. Nominal priors,G=18.5 g = 18 . P t , on the posterior probabilities in order tobuild samples (section 2.2). The third method of illustra-tion shows how the sample completeness and contaminationvaries with this threshold for each output class. We report only briefly on this case, primarily in order tomotivate the removal of the low EW quasars in experiments(2) and (3). c (cid:13) , 000–000 C.A.L. Bailer-Jones et al.
Figure 6.
Completeness (blue/thick line) and contamination(red/thin line) of a sample as a function of the probabilitythreshold. The bottom right panel shows the (logarithm) of theactual number of different types of class in the quasar sam-ple: stars (blue/solid line); quasars (red/dashed lines); galaxies(black/dotted line). Nominal priors, G=18.5
Figure 7.
Completeness (blue/thick line) and contamination(red/thin line) of a sample as a function of the probabilitythreshold. The bottom right panel shows the (logarithm) of theactual number (thick lines) of different types of class in thequasar sample (i.e. in the test set): stars (blue/solid line); quasars(red/dashed lines). The thin lines show the corresponding effec-tive number of objects for the modified case. Modified priors,G=18.5
Figure 8.
Examples of low EW quasars (red/dashed lines) andthose stars (blue/solid lines) which are the contaminants in thequasar sample in the modified G=18.5 model
The confusion matrix (Table 1) shows that good classi-fication accuracy is achieved with the nominal model. Thehistograms of the posterior probabilities (Fig. 5) furthershow that these classifications are achieved with high confi-dence. This translates into a satisfactory trade-off in samplecompleteness and contamination at thresholds between 0.5and 0.9 (Fig. 6). The bottom right-hand panel shows thatstars rather than galaxies are the main contaminant in thequasar sample. If we now proceed to the modified model,the C&C curves are very different (Fig. 7). While we wouldexpect higher contamination for a given completeness levelfor quasars (because they are now very rare), we see thatit is impossible to get a low contamination quasar sampleat all. The bottom right-hand panel of that plot explainswhy: The effective number of stars in the sample (366) ismuch higher than the effective number of quasars (63), sothe contamination is high, 366 / (366 + 63) = 0 .
85 (To recall:the effective number is the number we would have in a dataset which had the modified class fractions.)Inspection of the contaminants shows that they are pre-dominantly cool, highly-reddened stars, with T eff between4000 and 8000 K and A V mostly between 8 and 10 mag.While the full stellar data set is dominated by stars inthis temperature range, the extinction distribution peaks to-wards lower extinctions (Fig. 1), so this tendency for highlyextincted contaminants is significant. Figs. 8 and 9 plot arandom selection of these stellar contaminants along with arandom selection of quasars with lower values of the emissionline EW parameter. Note how the very broad PSF of BP/RPis smoothing out the sharp features in the quasar spectra,especially at the red of the two channels where the dispersionis lower. The narrower quasar emission lines are not resolved,and these objects look more like the smoother stellar spec-tra. We hypothesize that if we remove such quasars from The completeness always decreases monotonically from 1 at P t = 0 (all objects included in sample) to 0 to P t = 1 (no objectsin sample). The maximum contamination occurs at P t = 0 witha value depending on the class fractions (for equal class fractionsin K classes it will be 1 /K ) and drops to zero at P t = 1.c (cid:13) , 000–000 inding rare objects Figure 9.
The (noise-free) input spectra for the stars (top panel)and quasars (bottom panel) shown in Fig. 8. The photon flux hasbeen normalized in each case to have a maximum of 1.0.
Table 2.
Model-based priors for the nominal, P ( C j | θ nom ), and mod-ified, P ( C j | θ mod ), cases for the full training data and the case inwhich low EW quasars have been removed (“nlEW”). For compari-son we show the class fractions relevant to he nominal models, f traini ,and the modified models f modi data G star quasar galaxy P ( C j | θ nom ) full 18.5 0.3380 0.3279 0.3341 f traini full 18.5 0.3333 0.3333 0.3333 P ( C j | θ mod ) full 18.5 0.4965 0.002514 0.5010 f modi full 18.5 0.4998 0.000500 0.4998 P ( C j | θ nom ) nlEW 18.5 0.367 0.283 0.350 P ( C j | θ nom ) nlEW 20.0 0.368 0.260 0.372 f traini nlEW both 0.388 0.225 0.388 P ( C j | θ mod ) nlEW 18.5 0.4983 0.000328 0.5013 P ( C j | θ mod ) nlEW 20.0 0.4762 0.000277 0.5234 f modi nlEW both 0.4998 0.000500 0.4998 our training set, and thus from our definition of quasars,then the SVM should not so readily confuse these stars withour quasar class. We test this in the next experiment (sec-tion 4.3).Table 2 lists the model-based priors (section 2.3). Thefirst line is for the nominal model. The second row gives,for comparison, the fraction of objects in each true class, i ,in the training data. These we may consider as frequentistestimates of the model priors, insofar as the frequency dis- Table 3.
Confusion matrix for class assignments from maximumprobability. Each row corresponds to a true class and sums to 100%.Nominal priors, G=18.5, no low EW quasars in training datagalaxy quasar starGALAXY 99.37 0.00 0.63QUASAR 4.22 85.59 10.19STAR 0.68 0.13 99.19 tribution of the classes dicates these. At least for this SVMmodel with equal class fractions, the model-based priors areclose to the class fractions.The third and fourth lines give the same but for themodified model. Now we see that the modified class frac-tion, f modi , for the quasars is not a good proxy for themodel-based prior. This implies that its use in equation 4will give poor estimates for the true posterior probabilities.We could attempt to improve this by an iterative procedure:Now that we have the model-based priors, we can recalcu-late the model posteriors directly from Bayes’ equation (2)– rather than our approximation (equation 4) – and then re-calculate the model-based priors with equation 3. However,we don’t do this because in our main experiments (next),the discrepancy is not as large. Motivated by the results of the previous experiment, weremoved the low equivalent width quasars (EW < Comparing the confusion matrix (Table 3) to that in theprevious experiment, we now see that fewer quasars are cor-rectly classified, with 10% being misclassified as stars. Yetthis loss of quasars is balanced by the fact that six timesfewer stars are now misclassified as quasars (0.13% ratherthan 0.88% previously, or 78 stars rather than 528). This iswhat we wanted to achieve by modifying the training sam-ple. Note how few galaxies are misclassified as stars andquasars, and how this has hardly changed from the pre-vious experiment. In all experiements we have performedwith these data (including many not reported here), thegalaxies are always classified with high confidence. As we arenot changing the class fractions for galaxies – they are act-ing mostly to make the classification problem for the SVMharder – we focus on the stars and quasars from now on.We summarze the model confidence (posterior probabil-ties) in the histograms in Fig. 10. We can read several thingsfrom this: the leading diagonal shows P ( class | CLASS ), howconfident the true positives are; the central row shows P ( class | QUASAR ), the probabilities assigned to each class fortrue quasars; the central column shows P ( quasar | CLASS ),the quasar probabilities assigned to objects of each trueclass. We see that the confidences for the correct classes c (cid:13) , 000–000 C.A.L. Bailer-Jones et al.
Figure 10.
Histograms of the DSC class posterior probabilities, shown for each output class (columns) split for objects of each trueclass (rows). Nominal priors, G=18.5, no low EW quasars in training data are still high. However, comparing with the results of theprevious experiment (Fig. 5), we now see a set of truequasars which are now assigned very low probabilities, P ( quasar | QUASAR ). These are predominantly (but not ex-clusively) the low EW quasars that were removed from thetraining data. These low probability cases are reflected asa dip in the quasar completeness curve (Fig. 11) at low P t .These quasars show up in the star sample and we corre-spondingly see a higher contamination for the stars. Yet thepositive trade off from this is a lower quasar sample contam-ination. To give numbers: At P t = 0 . We turn now to the modified model in which our prior is thatquasars are 1000 times rarer. The histograms of the posteri-ors are show in Fig. 12 which we may compared to the nom-inal case in Fig. 10. The main difference can be seen in thecentral plot, where we now observe a significant peak aroundzero probability for P ( quasar | QUASAR ): the modified model has a high barrier to classifying anything as a quasar. This isthe result of the probability remapping, equation 4 (wherebythe impact of the normalization should not be ignored). Thisprovides more discrimination between those quasars which,in the nominal case, were all assiged an output probabilityof almost 1.0.Fig. 13 shows the C&C for the modified model. It is sig-nificantly different from the nominal case. The quasar com-pleteness is slightly reduced, but we win a much lower con-tamination. Indeed, at P t = 0 .
11 the contamination dropsto zero for a sample completeness of 65%. This translates toa contamination of less than 1 in 0 . × P t = 1 . P t = 0 . . ) and54 (= 10 . ) respectively. If we had done this experimentwith a test set containing, say, only 2000 quasars, then theeffective number of quasars in the resultant sample wouldonly have been 54 × (2000 /
60 000) = 1 .
8. This is not enoughto draw results of any significance and explains why we needlarge test sets for evaluating the modified model. c (cid:13) , 000–000 inding rare objects Figure 12.
Histograms of the DSC class posterior probabilities, shown for each output class (columns) split for objects of each trueclass (rows). Modified priors, G=18.5, no low EW quasars in training data
Table 4.
Confusion matrix for class assignments achieved by apply-ing a probability threshold. Each row gives the percentage of objectsof each true class assigned to the different output classes (columns).As an object may now be assigned to more than one class, the val-ues in a row no longer sum to 100%, plus some objects may remainunclassified. The thresholds applied are P t = 0 . P t = 0 . The confusion matrix for this modified case is shown inTable 4. As class assignments are now based on thresholdsrather than maximum probability, and because we can setthese thresholds independently of one another, a given ob-ject may be assigned to more than one class or it may remainunclassified. Therefore the values in a row no longer sumto 100%. Here we use thresholds of 0 . , . , . × . × . . × .
001 + 98 . × . . × . . eff ≃
40 000 K. The other three are cooler (twoaround 4000 K, one around 7000 K) but have very high ex-tinctions (8–9 magnitudes A V ). This gives rise to the much c (cid:13) , 000–000 C.A.L. Bailer-Jones et al.
Figure 11.
Completeness (blue/thick line) and contamination(red/thin line) of a sample as a function of the probabilitythreshold. The bottom right panel shows the (logarithm) of theactual number of different types of class in the quasar sam-ple: stars (blue/solid line); quasars (red/dashed lines); galaxies(black/dotted line). Nominal priors, G=18.5, no low EW quasarsin training data larger flux in the red which DSC confuses with a quasar-likebroad emission line feature.We recall that in this experiment we removed the lowEW quasars from the training set, but not the test set.We may, therefore, expect this model to perform particu-larly badly on these objects. However, if we plot Fig. 12just for these objects it turns out to be broadly sim-ilar, although with about half as many objects in the P ( quasar | lowEWQUASAR ) = 1 bin and twice as many inthe P ( quasar | lowEWQUASAR ) = 0 bin. So while this modeldoesn’t do as well on the objects omitted from the trainingdata, it is still able to assign many a high probability. Thatis, DSC has been able to extend its recognition of quasarsto lower EW cases than it was trained on. The C&C plots for the modified model are of course just pre-dictions of what the C&C would be if we applied the mod-ified model to a real data set which has the modified pop-ulation fractions. We can test these predictions by buildinga new test data set in which the population fractions reallyare (1,1/1000,1). To do this we took the original test dataset but retained just a random selection of 1/1000 of thequasars (i.e. 60 quasars). We applied the modified modeland calculated the C&C (now using equation 1 of course,not equation 5). As there are only 60 quasars in this set, thepredictions have high variance, so we bootstrapped the se-lection 20 times. Fig. 16 shows that the measured C&C arevery close to the predictions, thus validating our approach. Figure 13.
Completeness (blue/thick line) and contamination(red/thin line) of a sample as a function of the probabilitythreshold. The bottom right panel shows the (logarithm) of theactual number (thick lines) of different types of class in thequasar sample (i.e. in the test set): stars (blue/solid line); quasars(red/dashed lines). The thin lines show the corresponding effec-tive number of objects for the modified case. There are no galaxycontaminants. Modified priors, G=18.5, no low EW quasars intraining data
Theoretical arguments aside, could we nonetheless use thenominal model to classify data sets in which quasars arerare? To assess this, we applied the nominal model to thefew-quasar data set from the above test (section 4.3.3).The measured C&C are plotted in Fig. 17 as the dashedlines. These agree well with the C&C we would predict onthe effective test data set (i.e. using equation 5) but usingthe nominal model, which are shown as the solid lines inFig. 17. While they agree, this figure shows that the samplecontamination is far higher when using the nominal modelthan the modified model (Cf. Fig. 16). That is, the modifiedmodel is superior at obtaining pure quasar samples. Thisdemonstrates that we must take into account the expectedclass fractions (i.e. use a suitable prior) when applyinga classifier to a data set in which we expect a class to be rare.Finally for this experiment, we note that the model-basedpriors agree quite well with the training data class fractionor the modified class fraction for the nominal and modifiedmodels respectively (Table 2).
At G=20 the noise is higher, so we expect DSC to performless well. The basic class assignment based on maximum c (cid:13) , 000–000 inding rare objects Figure 14.
Spectra of stars which are most confused asquasars, namely those 7 stars which have the highest values of P mod ( quasar | STAR ). G=18.5, modified priors, no low EW quasarsin training data.
Figure 15.
The (noise-free) input spectra for the stars shown inFig. 14. The photon flux has been normalized in each case to havea maximum of 1.0.
Figure 16.
Comparison of the completeness (blue/thick lines)and contamination (red/thin lines) obtained with the modifiedmodel at G=18.5 (no low EW quasars in training data). The solidlines are the predictions (same as those in the the top right panelof Fig. 13) and the dashed lines are the measurements from a dataset in which quasars really are rare (class fractions of (1,1/1000,1))
Figure 17.
As Fig. 16 but now for the nominal model
Table 5.
Confusion matrix for class assignments from maximumprobability. Each row corresponds to a true class and sums to 100%.Nominal priors, G=20.0, no low EW quasars in training datagalaxy quasar starGALAXY 93.53 0.12 6.66QUASAR 7.40 77.13 15.47STAR 13.92 0.21 85.87 posterior class probability is indeed worse (Table 5). Com-paring with Table 3 we see that the stars suffer in particu-lar, with only 86% being correctly classified, as opposed to99% at G=18.5. This is reflected by the lower confidence themodel gives to its classifications (Fig. 18). Likewise, the com-pleteness is lower and the contamination higher at a giventhreshold (Fig. 19).
Figure 18.
Histograms of DSC outputs for each class showinghow confident DSC is of identifying each class. Nominal priors,G=20.0, no low EW quasars in training datac (cid:13) , 000–000 C.A.L. Bailer-Jones et al.
Figure 19.
Completeness (blue/thick line) and contamination(red/thin line) of a sample as a function of the probabilitythreshold. The bottom right panel shows the (logarithm) of theactual number (thick lines) of different types of class in thequasar sample (i.e. in the test set): stars (blue/solid line); quasars(red/dashed lines). Nominal priors, G=20.0, no low EW quasarsin training data
If we now modify the priors and recalculate the poste-rior probabilities then we see a similar shift in the distri-butions that we did for the G=18.5 case (histograms notshown). The C&C plot with the modified model is shownin Fig. 20. What stands out immediately is that we are stillable to get a clean (zero contamination) quasar sample. Thisoccurs once P t = 0 .
07, at which point the completenessis 58% (and remains at above 50% if we choose to take alarger threshold). This compares to a completeness of 65%at P t = 0 .
11 for zero contamination at G = 18 .
5. Thus wesee that the higher noise in the data can be translated intoa lower completeness (which is acceptable) rather than araised contamination (which is not). Notice that there is noguarantee that the contamination should drop to zero before P t = 1 (cf. section 4.2), so this is an important result. We may expect parallaxes and proper motions to providea good discriminant between Galactic and extragalactic ob-jects. Yet it turns out that if we remove astrometry from theDSC, the performance hardly degrades (accuracy decreasesby no more than 1%). This observation may not be surpris-ing, because the majority of objects with astrometry consis-tent with zero (within the measurement errors) are actuallystars, and that is because our training sample is dominatedby distant stars. It is also not specific to the SVM. We alsobuilt a two-stage classifier, in which (1) an SVM predictsclasses based only on photometry and (2) Gaussian mixtures
Figure 20.
Completeness (blue/thick line) and contamination(red/thin line) of a sample as a function of the probabilitythreshold. The bottom right panel shows the (logarithm) of theactual number (thick lines) of different types of class in thequasar sample (i.e. in the test set): stars (blue/solid line); quasars(red/dashed lines). The thin lines show the corresponding effec-tive number of objects for the modified case. There are no galaxycontaminants. Modified priors, G=20.0, no low EW quasars intraining data are used to model the density in the 2D astrometric spaceand predict classes. The probabilities are then combined togive a single posterior (Bailer-Jones & Smith (2008b)). Theresults are no better than an SVM using BP/RP and astrom-etry (although the two-stage classifier offers more flexibilityand interpretability).
In section 2.3.2 we argued that the model class priors arenot necessarily dictated by the class fractions in the train-ing data. Moreover, in attempting to match the trainingdata distribution to that in the target population for a rareobject, we may end up with very few objects in the trainingdata, such that the classifier does not learn to classify themwell.We test this by varying the fraction of quasars inthe training data set, q , from 100% (equal class fractions)down to 0.1% (60 quasars). We then build samples basedon maximum probability and measure the C&C (again us-ing equation 5 to account for the modified class fractions).Fig. 21 shows how the completeness and contamination ofthe quasar sample varies with q . For modest reductions in q the contamination is 5%, which is too high for a pure sample(although the samples are built using maximum probability,so we could reduce this at the cost of reduced completeness).Surprisingly, the contamination actually drops to zero once c (cid:13) , 000–000 inding rare objects Figure 21.
Quasar sample completeness (thick blue line) andcontamination from stars (thin red line) and galaxies (dottedblack line) as a function of varying the fraction of quasars inthe training data set (the first two values are 0.1% and 1%) the quasar fraction drops below 5%. However, this comesat the price of greatly reduced completeness of just 0.1% at q = 0.1%. This we can directly compare to the results fromthe modified model in experiment (2) (section 4.3.2), wherethe completeness is still 62%. Thus we conclude that prun-ing the training data set is not the way to achieve a goodclassifier for a rare class. Any supervised method is limited by the data it is trainedon. Systematic differences between these and the target pop-ulation can compromise the classifier. It is not the goal ofthis article to present the final classifier for Gaia, so we havenot yet optmized the training sets. Nonetheless, our workdoes raise some issues relevant to this important next step,which we briefly discuss.First, we found that removing the low EW quasars wasimportant for achieving clean quasar samples in the mod-ified case. This may reflect a limitation of the SVMs, butit raises the general issue of how to set the training datadistribution. We do not expect a supervised method to per-form well on regions of the data space not included in thetraining data (although we did find that the model couldcorrectly classify many low EW quasars). For this reason,DSC will be preceeded by an “outlier detector” in the clas-sification pipeline. This assesses how well an observed objectis “covered” by the training space and only passes “known”objects to DSC. It could be based on density estimation (inlow dimensions) or one-class SVMs.Second, if we are building a classifier only to look for aspecific type of object, then we don’t need to aim for com-pleteness in the training sample (although we do need toinclude potential contaminants). For example, none of ourquasar simulations include interstellar extinction (unlike thestars and galaxies). This could be motivated by saying we donot trust extinction laws and just want to find unreddenedquasars. We are free to make this choice at the cost of re-duced completeness of real quasars. If we included reddened quasars in the test set, then either they would be classifiedas quasars anyway (in which case the completeness goes up),or they would be classified as something else. Either way, thequasar contamination would not increase so our conclusionabout building pure samples of unreddened quasars is unaf-fected. Incidentally, the results in section 4 are very similar ifwe repeat the experiments but now with interstellar extinc-tion applied also to the quasars. The main difference is thatthe quasar sample completeness is reduced by around 10%,with the more reddened quasars now contaminating the starsamples. The quasar contamination is not increased.Third, how should we define the classes? If we had splitthe quasars into two classes (e.g. high and low emission lineEW) maybe this would help the SVM to retain zero con-tamination on the high EW class without having to removethe low EW quasars entirely from the training data. Thiswill be the subject of future work.Gaia does not exist in a void; we of course have in-formation from other catalogues/surveys to help us identifyquasars. The full DSC is actually a multi-stage algorithm,in which the present paper just describes the stage based onthe BP/RP spectrum. Other stages provide class probabili-ties based on other data, for example the source magnitudeor Galactic latitute or external catalogues, in which case wemight call them “priors”. We will describe this multi-stageapproach in a future paper.
When we build a sample by applying a threshold, all of theobjects have a posterior probability which is at least as highas the threshold. For example, with the modified model insection 4.3.2, we were able to set P t = 0 .
11 and produce asample of only quasars. There is of course no guarantee thatall samples built using this threshold have no contaminants.On the other hand, this threshold is not a statement aboutthe sample as a whole: it would be wrong to expect only 11%of the sample to be quasars. We need to look at the actualproabilities, not a lower limit. How can we use these? Imag-ine a case in which we set P t = 0 .
90 and thereby obtaineda sample of 1000 (supposed) quasars, all which happen tohave P ( quasar ) = 0 .
95. We would reasonably expect 5%not to be quasars. Yet we must be careful with such an in-ference, because posterior probabilities are not frequencies(especially if we have very non-uniform priors, e.g. class frac-tions of (1 , . , P ( quasar ) = p for all objectsin a sample of size n , then the probability that exactly r ofthem ( r n ) are non-quasars is P ( n, r, p ) = n ! r !( n − r )! (1 − p ) r p ( n − r ) (7)The left panel of Fig. 22 shows how (the logarithm of) thisquantity varies with r for n = 1000 and p = 0 .
95. (It hasa maximum of P ( n, r, p ) = 0 .
058 at r = 50). By summingequation 7 over ranges of r we can make more useful state-ments. For example, summing from r = 0 to r = 50, we cansay that the probability that there are 50/1000=5% con-taminants or fewer is 0.54. Equivalently, the probability that c (cid:13) , 000–000 C.A.L. Bailer-Jones et al.
Figure 22.
The left panel shows the variation with r of P ( n =1000 , r, p = 0 .
95) (equation 7), the probability that there areexactly r contaminants in the quasar sample of n objects, eachwith posterior quasar probability of p . The right hand panel is1 minus the cumulation of this (equation 8), and so shows theprobability that there will be more than r contaminants (shownas the contamination fraction r/n ). there is more than 5% contamination is 0.46. The right handpanel of Fig. 22 shows this for variable r . Specifically it plots1 − r ′ = r X r ′ =0 P ( n, r ′ , p ) (8)against r/n , the contamination fraction. This figure agreesapproximately with our intuition of a 5% contaminationwhen all posterior probabilities are 95%.We can apply this to the case in section 4.3.3, wherewe applied the modified model to a test with 60 quasars.Setting P t = 0 . P ( quasar | QUASAR )in Fig. 10), but we approximate by setting p equal to theaverage for this sample, which is 0.95. Equation 8 tells us( n = 37, r = 1, p = 0 .
95) that the probability of havingone or more contaminants is 0.56. This is consistent withhaving no contaminants. Of course, this average probabilityis not very representative, so this doesn’t strictly apply. Forexample, we could also have taken a threshold of P t = 0 . p = 0 .
997 and the probability of one or morecontaminants is just 0.006.Despite these approximations, it appears that the infer-ences about the sample as a whole are consistent with theindividual posterior probabilities.
Ultimately we would like to know whether the modifiedmodel is “better” than the nominal one. If we ignored theissue of priors and class fractions, then our predictions ofsample completeness and contamination would always begiven by Fig. 11 (at G=18.5), regardless of the true classfractions in the target population. If we improve this by in-stead using equation 5 to accommodate the modified classfractions (but still with the nominal model posteriors), thenour predictions would be the solid lines in Fig. 17. Yet bothare poor predictions of the actual completeness and contam- ination in a sample where quasars are rare (dashed lines inFig. 16).We demonstrated (in section 4.3.4) that the applica-tion of the nominal model to a data set with rare quasarsgives poorer results (larger contamination) than the modi-fied model at a given probability threshold . It is not yet clearwhether we could just use the nominal model but with muchhigher probability thresholds to achieve similar results. How-ever, there are two reasons why we should not want to dothis. First, as we must not build test data sets with verysmall numbers of the rare class, we would still have to usethe modified class fractions in order to correctly predict theC&C (section 2.5). Second, we want models which deliveractual probabilities, not just numbers between 0.0 and 1.0to which we apply a meaningless threshold. This is especiallyimportant if we later want to update the probabilities, e.g.in a multi-stage classifier. We therefore believe it is betterto use the modified model.
We have introduced a method of probability modificationwhich can be used to build clean samples of rare objects forwhich the completeness and contamination can be reliablypredicted. We have demonstrated this using a support vectormachine classifier, although it may, of course, be used inconjunction with any classifier which gives probabilities. Themain conclusions of our work are as follows. • To construct a pure sample of objects we should usea probabilistic classifier and only select objects with highprobabilities. By varying this probability threshold we cantrade off sample completeness and contamination. • To achieve pure samples of rare objects, we must takeinto account the expected class fractions in the target popu-lation, which act as a prior probability on the classifier. Weuse these to modify the nominal classifier outputs to givethe modified model . • We applied our modified model to a three class problemin which quasars are simulated to be 1000 times rarer thanstars and galaxies. We can achieve a pure quasar sample(zero contamination) yet still reach a sample completenessof 50–65% for magnitudes down to G=20.0. Although thetest set is finite in size, this correponds to an upper limit inthe contamination of 1 in 39 000. This is more than adequatefor establishing an astrometric reference frame for Gaia: IfGaia observes half a million quasars, we can build a quasarsample of 250 000 with no more than 13 contaminants. • While achieving this pure quasar sample, we simul-tanesouly achieve very complete galaxy and star samples(both 99%) with low contamination (both 0.7%) (figures forG=18.5). • These results were achieved after removing quasars withlow equivalent width emission lines from the training sample(defined somewhat arbitrarily as 5000 ˚A). Including theseprecluded establishing a low contamination sample, becauseit resulted in cool (4000–8000 K) highly reddened ( A V = 8–10) stars being confused with them. After removing thesequasars, the first stars to contaminate a quasar sample (ifwe set a low threshold) are these reddened cool stars as wellas hot stars (T eff >
40 000 K). c (cid:13) , 000–000 inding rare objects • Including parallax and proper motion (either as addi-tional SVM inputs, or in a separate mixture model classifier)hardly changes the performance. This is not surprising sincethe majority of objects with astrometry consistent with zeroare actually stars. • Extending the training and testing sets to includequasars with a full range of interstellar extinction does notsignificantly alter the results (completeness slightly lower,but contamination unaffected) • All classification models have a prior, but the prior isoften not explicit and are sometimes implicitly influenced bythe training data distribution. We have introduced a simplemethod for calculating the implicit priors in a classificationmodel, which we call the model-based priors . In many caseswe have experimented with, these priors are close to thetrue class fractions in the training data (nominal model) ormodified class fractions (modified model). • We recommend that a classifier be trained on roughlyequal numbers of objects in each class so that it can properlylearn the class distributions or boundaries. By determiningthe model-based priors and replacing them with somethingmore appropriate to the target population (e.g. quasars be-ing rare), we can produce a modified model with superiorperformance. In particular, this is far better at producinglarge, pure samples of the rare class. • With our approach we can apply the model to any newtarget population by specifying the appropriate class frac-tions (priors) without having to change the training datadistribution or re-train the model.
ACKNOWLEDGEMENTS
This work makes use of Gaia simulated observations and wethank the members of the Gaia DPAC Coordination Unit2, in particular Paola Sartoretti and Yago Isasi, for theirwork. These data simulations were done with the MareNos-trum supercomputer at the Barcelona Supercomputing Cen-ter - Centro Nacional de Supercomputaci´on (The SpanishNational Supercomputing Center). We thank Jean-FrancoisClaeskens, Vivi Tsalmantza and Jean-Claude Bouret for useof the quasar, galaxy and OB star spectral libraries (respec-tively), and Christian Elting for assistance in assemblingthe data samples. The MPIA Gaia team was supported inpart by a grant from the Deutsches Zentrum f¨ur Luft- undRaumfahft (DLR).
REFERENCES
The Three-dimensional universewith Gaia , C. Turon, K.S. O’Flaherty, M.A.C. Perryman(eds), ESA, SP-576, 393Ball N. M., Brunner R.J., Myers A.D., Tcheng D., 2006,ApJ 650, 497 Brown A.G.A, 2006, Gaia DPAC Technical note, GAIA-CA-TN-LEI-AB-005Burges C.J.C., 1998, Data mining and knowledge discovery2, 121Chang C.-C., Lin C.-J., 2001,
LIBSVM : A library for sup-port vector machines ∼ The Three-dimensional universe with Gaia , C. Turon, K.S. O’Flaherty,M.A.C. Perryman (eds), ESA, SP-576, 357O’Mullane W., Lammers U., Bailer-Jones C.A.L., et al., in
Astronomical Data Analysis Software and Systems 16 , R.A.Shaw, F. Hill and D.J. Bell (eds), ASP Conf. Ser. 376, 99Platt J., 2000, in A.J. Smola et al. (eds.),
Ad-vances in large margin classifiers , MIT Press, p. 61http://citeseer.ist.psu.edu/platt99probabilistic.htmlRichards G.T., Fan X., Newberg H., et al., 2002, AJ 123,2945Richards G.T., Nichol R.C., A.G. Gray, et al., 2004, ApJSS155, 257Sordo R., Vallenari A., 2008, Gaia DPAC Technical note,GAIA-C8-DA-OAPD-RS-002Suchkov A.A., Hanisch R.J., Margon B., 2005, AJ 130, 2439Tsalmantza P., Kontizas M., Bailer-Jones C.A.L., et al.2007, A&A 470, 761Tsalmantza P., Kontizas M., Rocca-Volmerange B., et al.2008, A&A submittedVanden Berk D.E., Schneider D.P., Richards G.T., et al.,2005, ApJ 129, 2047Wu T.-F., Lin C.-J., Weng R.C., 2004, Journal of MachineLearning Research 5, 975 c (cid:13)000