[PDF] Finding rare objects and building pure samples: Probabilistic quasar classification from low resolution Gaia spectra

Abstract

We develop and demonstrate a probabilistic method for classifying rare objects in surveys with the particular goal of building very pure samples. It works by modifying the output probabilities from a classifier so as to accommodate our expectation (priors) concerning the relative frequencies of different classes of objects. We demonstrate our method using the Discrete Source Classifier, a supervised classifier currently based on Support Vector Machines, which we are developing in preparation for the Gaia data analysis. DSC classifies objects using their very low resolution optical spectra. We look in detail at the problem of quasar classification, because identification of a pure quasar sample is necessary to define the Gaia astrometric reference frame. By varying a posterior probability threshold in DSC we can trade off sample completeness and contamination. We show, using our simulated data, that it is possible to achieve a pure sample of quasars (upper limit on contamination of 1 in 40,000) with a completeness of 65% at magnitudes of G=18.5, and 50% at G=20.0, even when quasars have a frequency of only 1 in every 2000 objects. The star sample completeness is simultaneously 99% with a contamination of 0.7%. Including parallax and proper motion in the classifier barely changes the results. We further show that not accounting for class priors in the target population leads to serious misclassifications and poor predictions for sample completeness and contamination. (Truncated)

Full PDF

aa r X i v : . [ a s t r o - ph ] S e p Mon. Not. R. Astron. Soc. , 000–000 (0000) Printed 15 October 2018 (MN L A TEX style ﬁle v2.2)

Finding rare ob jects and building pure samples:Probabilistic quasar classiﬁcation from low resolution Gaiaspectra

C.A.L. Bailer-Jones ⋆ , K.W. Smith , C. Tiede , R. Sordo , A. Vallenari Max-Planck-Institut f¨ur Astronomie, K¨onigstuhl 17, 69117 Heidelberg, Germany INAF, Osservatorio di Padova, Vicolo Osservatorio 5, 35122 Padova, Italy submitted 7 August 2008; resubmitted 18 September 2008

ABSTRACT

We develop and demonstrate a probabilistic method for classifying rare objects in sur-veys with the particular goal of building very pure samples. It works by modifying theoutput probabilities from a classiﬁer so as to accommodate our expectation (priors)concerning the relative frequencies of diﬀerent classes of objects. We demonstrate ourmethod using the

Discrete Source Classiﬁer , a supervised classiﬁer currently based onSupport Vector Machines, which we are developing in preparation for the Gaia dataanalysis. DSC classiﬁes objects using their very low resolution optical spectra. Welook in detail at the problem of quasar classiﬁcation, because identiﬁcation of a purequasar sample is necessary to deﬁne the Gaia astrometric reference frame. By varyinga posterior probability threshold in DSC we can trade oﬀ sample completeness andcontamination. We show, using our simulated data, that it is possible to achieve apure sample of quasars (upper limit on contamination of 1 in 40 000) with a complete-ness of 65% at magnitudes of G=18.5, and 50% at G=20.0, even when quasars havea frequency of only 1 in every 2000 objects. The star sample completeness is simul-taneously 99% with a contamination of 0.7%. Including parallax and proper motionin the classiﬁer barely changes the results. We further show that not accounting forclass priors in the target population leads to serious misclassiﬁcations and poor pre-dictions for sample completeness and contamination. We discuss how a classiﬁcationmodel prior may, or may not, be inﬂuenced by the class distribution in the trainingdata. Our method controls this prior and so allows a single model to be applied to anytarget population without having to tune the training data and retrain the model.

Key words: surveys – methods: data analysis, statistical – quasars

Finding rare objects is hard, for two reasons. First, we expectto have to look at many objects before we encounter one,and second, we may not even recognise it even when we do.The reason for this is that the prior probability that anyone object is of the rare class, P ( C rare ), is very small. Soeven if it has very characteristic features, i.e. the likelihoodof the data given the rare object, P (Data | C rare ), is high,the posterior probability, P ( C rare | Data), could still be low.A survey for rare objects obviously requires a very dis-criminating classiﬁer, but we could assist it by modifyingthe class probabilities. If we raised the prior, P ( C rare ), weare more likely to ﬁnd the rare objects, but it is inevitable ⋆ Email: [email protected] that we will then incorrectly classify other objects as beingof the rare class. Depending on our goals, there may be asatisfactory balance between maximizing the expected num-ber of true positives (sample completeness) and minimizingthe expected number of false positives (sample contamina-tion). Here we present a method for achieving an optimalbalance and a correct prediction of this balance which, al-though simple, is not trivial and has important implications.We illustrate our method in the context of the problemof detecting quasars in the Gaia survey based on their lowresolution ( R ≃

30) optical spectra. Gaia is an all-sky astro-metric and spectrophotometric survey complete to G = 20,expecting to observe some 10 stars, a few million galaxiesand half a million quasars. Its primary mission is to studyGalactic structure by measuring the 3D spatial distribu-tion and 2D kinematic distribution of stars throughout the c (cid:13) C.A.L. Bailer-Jones et al.

Galaxy and correlating these with stellar properties (abun-dances, ages etc.) derived from the spectra. With astromet-ric accuracies as good as 10 µ as, Gaia cannot be externallycalibrated with an existing catalogue. Instead it must ob-serve a large number of quasars over the whole sky withwhich to deﬁne its own reference frame. This quasar samplemust be very clean (low contamination). (The quasar sampleis also, of course, of intrinsic interest.)We present our Gaia classiﬁcation model, the DiscreteSource Classiﬁer (DSC), and report its classiﬁcation perfor-mance using simulated data. We will show how, using ourprobabilty modiﬁcation approach, we can use this to buildpure quasar samples, at the (acceptable) loss of sample com-pleteness.

Related work . One of the most comprehensive searchfor quasars to date is that done with the SDSS. Richards etal. (2002) deﬁned a colour locus in the ugriz space to identifyobjects for spectroscopic follow up and estimated the con-tamination rate of their photometric selection to be 34%. Us-ing the spectroscopy of all point sources taken in SDSS stripe82, Vanden Berk et al. (2005) assessed the completeness ofthe Richards et al. selection at 95.7% for sources brighterthan i = 19 .

1. Later, Richards et al. (2004) trained a photo-metric classiﬁer (based on kernel density estimation) on a setof 22,000 spectroscopically identiﬁed quasars and used thisto build a sample of 100 000 quasars brighter than g = 21 . g =19.5,with a contamination of just 5% down to g =21.0. Suchkovet al. (2005) and Ball et al. (2006) both use decision trees toclassify objects in the SDSS photometric catalogue, by train-ing on objects with spectroscopic classiﬁcations from SDSS.Gao et al. (2008) used support vector machines and Kd-treenearest neighbours to classify spectroscopically-conﬁrmedSDSS and 2MASS objects as quasars or stars using just thephotometry, with stars and quasars present in equal pro-portions. With both methods they obtained contaminationrates of a few percent (averaged across both classes). Outline . We start in section 2 by presenting our generalprior modiﬁcation method. In section 3 we present the Gaia-DSC classiﬁer and the data used to train and test it. Theperformance of this and the application of our method aregiven in section 4 and discussed in section 5, where we alsodiscuss the interpretation of the probabilities and some ofthe limitations of this work. We conclude in section 6.

Deﬁnitions and notation . We use the term class frac-tions to mean the relative numbers of objects of each class ina data set. The nominal model refers to the classiﬁer “as is”,i.e. the output probabilities without any modiﬁcations. Incontrast, in the modiﬁed model we have modiﬁed these out-puts to be appropriate to a situation in which there wouldbe very diﬀerent class fractions, e.g. quasars very rare (sec-tion 2.4). The subscript i refers to true classes and the sub-script j to model output classes. f traini , f testi and f modi referto the class fractions of class i for the training data, test dataand eﬀective test data respectively. They are normalized, P i f traini = P i f testi = P i f modi = 1. This eﬀective testdata set is one which we don’t actually have, but reﬂects atarget population with diﬀerent class fractions, e.g. quasarsvery rare. We make predictions of the performance on this byusing the actual test data but by modifying the performanceequations (section 2.5). In the ﬁgures especially, lower case class names, e.g. quasar , denote estimated classes and up-per case names, e.g. STAR , true classes. Thus P ( quasar | STAR )means “the DSC-assigned quasar probability given that thesource is truly a STAR”. This refers to DSC outputs forobjects with known classes (e.g. in a test set). This is notto be confused with the notation for the DSC outputs inthe general case, P ( C j | x n , θ ), which means “the probabil-ity which DSC model θ assigns to class C j for object x n ”.“C&C” stands for “completeness and contamination”. The probabilities and assigned classes from any classiﬁerdepend not only on the evidence in the data, but also onthe prior probabilities. In this context, the prior is the classprobabilities before you look at the data. Priors are alwayspresent (whether you accept the Bayesian philosophy ornot), but they are not explicit in many models, so we maynot know what they are. This is a problem, because if themodel priors are inappropriate (e.g. equal probability of starand quasar) this will translate into poor posterior probabili-ties and thus class assignments. The method we present hereallows us to replace priors (even if implicit) with somethingmore appropriate, e.g. quasars much rarer than stars.

DSC assigns to every data vector (e.g. spectrum), x n , aprobability, P ( quasar ), that it is a quasar, and likewisefor the other classes. The classes are exclusive and ex-haustive, so the probabilities sum to unity. The classiﬁeris not perfect, so generally P ( quasar | QUASAR ) < P ( quasar | not QUASAR ) > sample for class class , we select objects which havea DSC output P ( class ) > P t , where P t is a user-deﬁnedthreshold (and may be diﬀerent for each class). The sample completeness is the number of objects in the sample whichare truly of that class divided by the total number of objectsof that class in the test set. The contamination is thenumber of objects in the sample which are truly of otherclasses divided by the total number of objects in the sample.For class = j c ompleteness j = n i = j,j N i c ontamination j = P i = j n i,j P i n i,j (1)where n i,j is the number of objects of (true) class i in the c (cid:13) , 000–000 inding rare objects sample selected for (output) class j , and N i is the total num-ber of objects of (true) class i in the test set. (There are var-ious other diagnostics one could use, such as the ROC curveor precision rate.) These equations give us predictions of thecompleteness and contamination for a new (unlabelled) dataset, insofar as we believe it to have the same class fractionsas the training data (that’s our prior). All classiﬁers include a prior, whether explicit or not. Weneed to know this prior for two reasons. First, we would liketo know what assumption our model is actually making (andnot what we suspect it is making). Second, we would like tochange this prior to something which is appropriate to theproblem at hand. Here we discuss what the prior is, how thetraining data may inﬂuence it, and how to calculate it posthoc from a trained model.

The outputs from a trained classiﬁer when presented withdata x n are P ( C j | x n , θ ), the probability that the data is ofclass C j given the data and the model, θ . This latter quantityreﬂects both the architecture of the model and the trainingdata set used to ﬁx its internal parameters. We can thinkof this output as a posterior probability and write it usingBayes’ theorem P ( C j | x n , θ ) = P ( x n | C j , θ ) P ( C j | θ ) P ( x n | θ ) (2)The term P ( x n | C j , θ ) is the likelihood of the data given theclass and model. The term P ( C j | θ ) is the prior probability that, given our model, an object is of class C j . Bayesianstatistics deals with updating probabilities based on newdata: the prior reﬂects our knowledge (based on some otherdata) before we look at the new data. In the present con-text, the prior P ( C k = quasar | θ ) is the probability that anyone object in our survey is a quasar, before we look at itsspectrum. We always have some prior information, e.g. withGaia the fact that it is an all sky survey to G=20.0. If weknow them, we could even treat the magnitude and Galacticlatitude as prior information. Given that we can make the decomposition in equation 2, itfollows that all classiﬁcation models must possess a prior onthe class probabilities. In some models, for example lineardiscriminant analysis or Gaussian mixture models, this prioris explict and so can be controlled. But in many others,such as neural networks or support vector machines, it is notexplicit. (See a standard text on machine learning for detailsof these methods, e.g. Hastie et al. (2001).) In particular, itmay depend on the class fractions in the training data.Take, for example, a standard neural network regres-sion model which is trained by minimizing an error functionover the whole data set. If we trained this on 1000 stars andjust one quasar, it will learn to recognise stars much bet-ter than quasars, because in minimizing the error it hardly has to worry about ﬁtting the lone quasar. If we changedthe training data (class fractions), the model and thus theclassiﬁcations would change. Other regression models are in-ﬂuenced by the class fractions in diﬀerent ways, or not at all.Given this dependence on the model and data, we refer tothe priors as model-based priors , and the notation P ( C j | θ )reminds us of this.This issue of class fractions inﬂuencing the model per-formance is well-known in the machine learning literature,where it is referred to as the problem of “class imbalance”or “imbalanced data sets”. It has been demonstrated to in-ﬂuence neural networks, support vector machines and clas-siﬁcation trees (e.g. Shin & Cho 2003, Visa & Ralescu 2005,Weiss 2004). But how, exactly, do the class fractions aﬀectthe classiﬁcations and, more speciﬁcally, the model-basedpriors? We might think that in the above example the ratioof the star to quasar prior probabilities implicit in the modelis 1000 to 1, but this is generally not the case , because itdepends on the model and how it is trained. The bottomline is that, in general, the model-based prior is not equalto the class fractions in the training data. We can calculate the model-based priors, P ( C j | θ ), directlyfrom the trained model via the marginalization equation P ( C j | θ ) = n = N test X n =1 P ( C j | x n , θ ) P ( x n | θ ) (3)where the sum is taken over all N test objects in the testdata set. The ﬁrst term in the sum is the posterior prob-ability. The second term is the probability that we drawobject x n from the test data set, which is 1 /N test . Hencethe prior is simply the average of the posterior probabili-ties. It might seem strange that the prior can be calculatedfrom the posterior. Yet because the sum is over all objectsin the test set, regardless of their true class, we can thinkof this summation as eliminating information on individualobjects, leaving us with what the model probability is forclass C j in the absence of speciﬁc data. This is the prior.If we had three classes equally represented in the data, andthe classiﬁer were perfect, the prior for each class would be1 /

3. Diﬀerent class fractions and non-perfect classiﬁers willgive diﬀerent results. In section 4 we will compare the model-based priorswith the training data class fractions. If this were the case, then we might be tempted to address theclass imbalance problem by changing the training data to haveclass fractions equal to our priors. But if we had just 10 000 train-ing vectors, we could then have only 10 quasars, making it hardfor the classiﬁer to correctly classify quasars. We demonstrate thislater (section 4.6). Note that the prior calculation assumes the test set has thesame class fractions as the training set. Interestingly, because thetrue classes don’t appear in the equation, we don’t actually needlabels on the individual objects in order to calculate the model-based prior.c (cid:13) , 000–000

C.A.L. Bailer-Jones et al.

Typically we train a classiﬁcation algorithm on more or lessequal class fractions, because this helps the algorithm toreliably identify the class boundaries. This is our nominalmodel, θ nom . Yet the target population to which we wantto apply our classiﬁer may have diﬀerent class fractions(e.g. quasars very rare), in which case the classiﬁer maymake poor predictions. The underlying issue here is thatthe model-based priors of our nominal model diﬀer from ourtarget priors.To correct for this we would ideally just replace the orig-inal (nominal) prior with our modiﬁed prior. But we cannotdo this directly because the model priors are not explicit. In-stead we modify the output (posterior) probabilities to giveus the “modiﬁed model”, θ mod . From inspection of equa-tion 2 we see we may achieve this by dividing the posteriorby the nominal prior and then multiplying it by the modi-ﬁed prior. The nominal priors can be calculated using equa-tion 3. But we cannot use this equation to calculate the mod-iﬁed priors because we don’t yet have the posteriors fromthe modiﬁed model. We therefore approximate the modiﬁedpriors using the class fractions. Let P nom ( C j | x, θ nom ) bethe output (posterior) probability from the nominal model,i.e. that trained on the data set with class fractions f traini .The modiﬁed probability appropriate for a target populationwith class fractions f modi we deﬁne as P mod ( C j | x n , θ mod ) = a n P nom ( C j | x n , θ nom ) f modi = j f traini = j (4)where a n is a normalization factor which ensures that P j P mod ( C j | x n , θ mod ) = 1 for each object n . The factor f modi = j /f traini = j is equivalent to changing the prior in equation 2and has the expected impact on the posterior.Following the discussion in section 2.3 we do not neces-sarily expect the class fractions to be a good proxy for thepriors, but for now it’s the best we can do. (We will latercheck how good the approximation is.)We illustrate the above with a simple example, a two-class model trained on equal numbers of stars and quasars,i.e. f train = (0 . , . P nom ( C star | x n , θ nom ) = 0 . P nom ( C quasar | x n , θ nom ) = 0 .

8. If we want f mod = (0 . , .

1) (stars nine times more common thanquasars), then from equation 4 we get P mod ( C star | x n , θ mod ) = a n . × . / . a n .

36 = 0 . P mod ( C quasar | x n , θ mod ) = a n . × . / . a n .

16 = 0 . if we had trained the modelon the modiﬁed class fractions. But this is generally not true, because what we are doing in equation 4 is changingthe priors (via the proxy of the class fractions) and thisis not equivalent to changing the training data (as we willdemonstrate later). The equations for the completeness and contamination(eqn. 1) depend on the actual number of objects correctlyand incorrectly classiﬁed and were calculated from the testdata set, which by deﬁnition has class fractions f testi . How-ever, we expect to apply the modiﬁed model to a targetpopulation with diﬀerent class fractions (this was the wholepoint of modifying it). To estimate the C&C, it is not neces-sary to create a new test data set. Instead, we can modify theequations to reﬂect the target class fractions, f mod , namelyc ompleteness j = „ f modi = j f testi = j « n i = j,j „ f modi = j f testi = j « N i = n i = j,j N i c ontamination j = P i = j “ f modi f testi ” n i,j P i “ f modi f testi ” n i,j (5)This eﬀectively changes the number of objects in each trueclass. The quantities n i,j and N i refer to the number of ob-jects in samples obtained from the modiﬁed model appliedto the actual test data set. Note that the equation for thecompleteness remains unchanged, yet the value of the com-pleteness will generally change because the posterior prob-abilities have changed and so n i = j,j changes. We emphasisethat the above equation is independent of the procedure tomodify the probabilities described in section 2.4. It is onlyto be used to make predictions for an eﬀective test set. Tocalculate the completeness and contamination for an actual test set which already has the modiﬁed class fractions, weuse equation 1. Once we have the posteriors from the modiﬁed model wecan calculate its model-based priors from the test set usingequation 3. But just as for the C&C calculation, we mustchange the eﬀective number of objects in each true class torepresent the modiﬁed class fractions. Thus in equation 3we should use P ( x n | θ mod ) = „ f modi f testi « N test (6)where x n , being an object in the (labelled) test set, hasknown true class, i . If a class is ten times rarer in the mod-iﬁed population, then P ( x n | θ mod ) is reduced by a factor often compared to before.In principle we could now use these model-based priorsto recalculate the posteriors for the modiﬁed model (withoutusing the approximation in equation 4), but we don’t do thisin the present article. Rather we just report them as a checkon what the prior really is. The Discrete Source Classiﬁer (DSC) is the algorithm weare developing as part of our work in the Gaia Data Pro- c (cid:13) , 000–000 inding rare objects cessing and Analysis Consortium (DPAC) (O’Mullane et al.(2007)). DSC will classify objects based primarily on lowresolution prism spectra (so-called “BP/RP” spectra), butsupplemented with parallaxes and proper motions and pho-tometric variability where available. Here we restrict our-selves to the three classes “star”, “galaxy” and “quasar”.After DSC in the pipeline come several algorithms for es-timating astrophysical parameters, e.g. stellar parameters.For more details of the classiﬁcation system, see Bailer-Jones(2005).Gaia detects objects in a very broad band, G, whichhas roughly the same wavelength range as the BP/RP in-strument. In common with the Gaia detection strategy, DSConly operates on point sources (i.e. makes no use of mor-phological information). In this paper we present results forsimulations at both G=20.0 and G=18.5.Here we describe the underlying classiﬁer, the sourcelibraries and the Gaia simulated data used to train and testDSC. DSC is presently based on a Support Vector Machine (SVM)(e.g. Cortes & Vapnik (1995), Burges (1998)), using the lib-SVM implmentation (Chang & Lin (2001)). An SVM is a su-pervised learning algorithm which ﬁnds the boundary which“optimally” separates two classes of objects. In contrast tomany other classiﬁers, SVMs are deﬁned only by those train-ing objects which lie close to the boundary, the so-called support vectors . The basic algorithm is linear, but by usingthe “kernel trick” to implicitly map data into a higher di-mensional space, it achieves a nonlinear mapping betweenthe input data (the spectrum) and the output classes. TheSVM is trained by minimizing the number of misclassiﬁca-tions and using a regularizer for complexity control.SVMs are fundamentally non-probabilistic: they simplydeﬁne a boundary and assign a class ( C or C ) dependingon which side of the boundary an object falls. Probabilitiesmay be derived from the distance, f , of an object from theboundary, in the present case by ﬁtting a sigmoidal functionto P ( C j = C | f ) (Platt (2000)).The basic SVM is a two-class classiﬁer. libSVM actuallytrains K ( K − / K classproblem) and uses the pairwise coupling method describedin Wu et al. (2004) (their algorithm 2) to produce normalizedprobabilities for K classes.One of the advantages of SVMs over many other nonlin-ear classiﬁers (e.g. neural networks) is that the SVM objec-tive function is convex, so the training has a unique solution.However, the model has two hyperparameters, the regular-ization parameter (the “cost”) and the scale length in theradial basis function (RBF) kernel we use. We optimize these(“tune”) using a Nelder-Mead algorithm. As this involves re-training the SVM for each combination of hyperparameters,it is a time-consuming process.The main concepts presented in this paper are not de-pendent on the choice of classiﬁer. SVM is not perfect, notleast because its underlying philosophy of not modelling thedata distribution means that probabilties do not arise natu-rally. It is, nonetheless, a competitive model in terms of clas-siﬁcation error, and in tests we have done it has performed aswell as or better than several alternatives, including multi- Figure 1.

The distribution of stellar parameters in the trainingand testing data sets layer perceptrons, RBF networks, boosted trees and mixturemodels (Elting & Bailer-Jones (2007)).

Libraries of spectra are being assembled or created for Gaiapurposes. Our stellar libraries are based on the Basel (Leje-une et al. (1997)) and MARCS (e.g. Gustafsson et al. (2008))libraries plus an OB star library from J.-C. Bouret. To-gether these span the full range of eﬀective temperature,metallicity and surface gravity. The galaxy library is de-scribed by Tsalmantza et al. (2008) (see also Tsalamantzaet al. (2007)). It is based on synthetic spectra derived fromgalaxy evolution models with several astrophysical param-eters, in particular those which describe the star formationhistory and the redshift. The quasar library is also synthetic,with each spectrum uniquely deﬁned by the three param-eters continuum slope, α , emission line equivalent width,EW, and redshift, z (Claeskens et al. (2006)). Their valuesare chosen at random from a distribution which is ﬂat inredshift over the range 0–5.5, ﬂat in α from − A V = 0–10(with ﬁxed extinction law R V = 3 . eﬀ and A V ; their distribution is plotted inFig. 1. The distributions may be a little artiﬁcial (the three“steps” reﬂect the three libraries), but they are suﬃcient forthe purpose of this article.This data set is obviously rather limited (e.g. purelysynthetic data, no extinction in the quasars, no emissionline stars, no strong galaxy emission). Our main goal at this c (cid:13) , 000–000 C.A.L. Bailer-Jones et al.

Figure 2.

Random selection of ﬁve stars (blue solid lines) andﬁve quasars (red dashed lines) (G=18.5, noise included). The ﬁrst46 pixels are the blue channel (Blue Prism, BP), the remaining40 the red channel (Red Prism, RP), separated by a dashed line. stage is not to produce a data set tuned to the real Gaiasky. We discuss the issue of training sets more in section 5.

To build and test the processing pipeline, the DPAC hasbuilt a sophisticated data simulator (Luri et al. (2005)). Foreach object in the spectral libraries we simulate its BP/RPspectrum, parallax and proper motion, including all sourcesof random error. Gaia observes every part of the sky be-tween 40 and 200 times over ﬁve years (depending on eclip-tic latitude). The average number of spectra for an objectin an end-of-mission stack is 72, which we use here. Oursimulated data are the so-called cycle 3 data (Sordo & Val-lenari (2008)).Gaia prism spectra are obtained in two channels, onefor the blue called BP and one for the red called RP, withcut-oﬀs deﬁned by the (silver) mirror response in the blue,CCD QE in the red, and bandpass ﬁlters in the middle. Af-ter removal of low sensitivity regions, BP ranges from 338to 998 nm over 46 pixels and RP from 646 to 1082 nm over40 pixels. We use all 86 pixels as separate inputs to the clas-siﬁer (we do not attempt to combine the wavelength over-lapped region). The dispersion varies strongly with wave-length from 3 nm/pixel (blue end) to 30 nm/pixel (redend). The instrumental point spread function (PSF) is sig-niﬁcantly broader than the pixel size, and in one conserva-tive estimate there are are only 18 independent measuresin BP/RP (Brown (2006)). Example spectra are shown inFig. 2. The dominant noise source is Poisson noise from thesource, so the SNR is approximately the square root of theﬂux. (At G=20, the ﬂuxes are four times smaller and theSNR is about half.) The corresponding library spectra (i.e.prior to being put through the Gaia instrument simulator)are shown in Fig. 3. The original stellar library spectra arecalculated at a resolution of 0.1 nm and show many absorp-tion lines, so to avoid confusion, we just plot every 20 th pixel.Note how much information is lost – especially concerningthe spectral lines – in BP/RP. Figure 3.

The (noise-free) library spectra for the stars (toppanel) and quasars (bottom panel) shown in Fig. 2. The pho-ton ﬂux has been normalized in each case to have a maximum of1.0.

We included the astrometry in our classiﬁers in the expecta-tion that it could help distingiush galactic and extragalacticobjects. Fig. 4 shows the distribution of the simulated as-trometry (with noise) for G=18.5. Astrometry for the extra-galactic objects is just zero plus noise. Note that this causesnegative parallaxes (which are preserved in the Gaia astro-metric data processing for model ﬁt assessment purposes).Stellar parallaxes, ̟ , are simulated in a way to ensure con-sistency with the apparent and absolute magniutudes andstellar parameters. The stellar proper motions are drawn atrandom from a zero mean Gaussian with standard devia-tion 10 ̟ /yr, which simiulates a distance-independent spacevelocity distribution with a standard deviation of 50 km/s.At G=18.5, the Gaia parallax error is typically 0.1 masand the proper motion error around 0.14 mas/yr (both dom-inated by source photon statistics). The errors are twice aslarge at G=20.0. The input data for our classiﬁers is BP/RP plus astrometry(although it turns out that inclusion of astrometry hardlyimproves the results). Models are trained and tested on noisydata and a single G-band magnitude is used in each experi-ment.Train and test data are drawn at random (without re- c (cid:13) , 000–000 inding rare objects Figure 4.

Distribution of the simulated astrometry.Blue/solid=star, red/dashed=quasar, black/dotted=galaxy(which is indistinguishable from the quasars). placement) from a common data set. Unless stated other-wise, the training data set comprises 5000 objects from eachof the three classes and the test data 60 000 objects of eachclass. The reason for the large number in the test set willbecome clear in section 4.3.

We present results for three experiments(1) objects at G=18.5,(2) as (1), but in which quasars with emission line equiv-alent widths less than 5000 ˚A have been removed from thetraining data set, leaving 2901 quasars (the test set is un-changed),(3) as (2), but for G=20.0.We show results for both the “nominal” and “modiﬁed”cases. In the nominal cases, the class fractions are thosegiven by the data sets themselves, namely f train = f test =(1 , ,

1) for (star, quasar, galaxy) respectively, except for thetraining data in cases 2 and 3 which has f train = (1 , . , f mod = (1 , / , Table 1.

Confusion matrix for class assignments from maximumprobability. Each row corresponds to a true class and sums to 100%.Nominal priors, G=18.5galaxy quasar starGALAXY 99.36 0.19 0.45QUASAR 1.03 96.04 2.94STAR 0.49 0.88 98.63

Figure 5.

Histograms of DSC outputs for each class showinghow conﬁdent DSC is of identifying each class. Nominal priors,G=18.5 g = 18 . P t , on the posterior probabilities in order tobuild samples (section 2.2). The third method of illustra-tion shows how the sample completeness and contaminationvaries with this threshold for each output class. We report only brieﬂy on this case, primarily in order tomotivate the removal of the low EW quasars in experiments(2) and (3). c (cid:13) , 000–000 C.A.L. Bailer-Jones et al.

Figure 6.

Figure 7.

Completeness (blue/thick line) and contamination(red/thin line) of a sample as a function of the probabilitythreshold. The bottom right panel shows the (logarithm) of theactual number (thick lines) of diﬀerent types of class in thequasar sample (i.e. in the test set): stars (blue/solid line); quasars(red/dashed lines). The thin lines show the corresponding eﬀec-tive number of objects for the modiﬁed case. Modiﬁed priors,G=18.5

Figure 8.

Examples of low EW quasars (red/dashed lines) andthose stars (blue/solid lines) which are the contaminants in thequasar sample in the modiﬁed G=18.5 model

The confusion matrix (Table 1) shows that good classi-ﬁcation accuracy is achieved with the nominal model. Thehistograms of the posterior probabilities (Fig. 5) furthershow that these classiﬁcations are achieved with high conﬁ-dence. This translates into a satisfactory trade-oﬀ in samplecompleteness and contamination at thresholds between 0.5and 0.9 (Fig. 6). The bottom right-hand panel shows thatstars rather than galaxies are the main contaminant in thequasar sample. If we now proceed to the modiﬁed model,the C&C curves are very diﬀerent (Fig. 7). While we wouldexpect higher contamination for a given completeness levelfor quasars (because they are now very rare), we see thatit is impossible to get a low contamination quasar sampleat all. The bottom right-hand panel of that plot explainswhy: The eﬀective number of stars in the sample (366) ismuch higher than the eﬀective number of quasars (63), sothe contamination is high, 366 / (366 + 63) = 0 .

85 (To recall:the eﬀective number is the number we would have in a dataset which had the modiﬁed class fractions.)Inspection of the contaminants shows that they are pre-dominantly cool, highly-reddened stars, with T eﬀ between4000 and 8000 K and A V mostly between 8 and 10 mag.While the full stellar data set is dominated by stars inthis temperature range, the extinction distribution peaks to-wards lower extinctions (Fig. 1), so this tendency for highlyextincted contaminants is signiﬁcant. Figs. 8 and 9 plot arandom selection of these stellar contaminants along with arandom selection of quasars with lower values of the emissionline EW parameter. Note how the very broad PSF of BP/RPis smoothing out the sharp features in the quasar spectra,especially at the red of the two channels where the dispersionis lower. The narrower quasar emission lines are not resolved,and these objects look more like the smoother stellar spec-tra. We hypothesize that if we remove such quasars from The completeness always decreases monotonically from 1 at P t = 0 (all objects included in sample) to 0 to P t = 1 (no objectsin sample). The maximum contamination occurs at P t = 0 witha value depending on the class fractions (for equal class fractionsin K classes it will be 1 /K ) and drops to zero at P t = 1.c (cid:13) , 000–000 inding rare objects Figure 9.

The (noise-free) input spectra for the stars (top panel)and quasars (bottom panel) shown in Fig. 8. The photon ﬂux hasbeen normalized in each case to have a maximum of 1.0.

Table 2.

Model-based priors for the nominal, P ( C j | θ nom ), and mod-iﬁed, P ( C j | θ mod ), cases for the full training data and the case inwhich low EW quasars have been removed (“nlEW”). For compari-son we show the class fractions relevant to he nominal models, f traini ,and the modiﬁed models f modi data G star quasar galaxy P ( C j | θ nom ) full 18.5 0.3380 0.3279 0.3341 f traini full 18.5 0.3333 0.3333 0.3333 P ( C j | θ mod ) full 18.5 0.4965 0.002514 0.5010 f modi full 18.5 0.4998 0.000500 0.4998 P ( C j | θ nom ) nlEW 18.5 0.367 0.283 0.350 P ( C j | θ nom ) nlEW 20.0 0.368 0.260 0.372 f traini nlEW both 0.388 0.225 0.388 P ( C j | θ mod ) nlEW 18.5 0.4983 0.000328 0.5013 P ( C j | θ mod ) nlEW 20.0 0.4762 0.000277 0.5234 f modi nlEW both 0.4998 0.000500 0.4998 our training set, and thus from our deﬁnition of quasars,then the SVM should not so readily confuse these stars withour quasar class. We test this in the next experiment (sec-tion 4.3).Table 2 lists the model-based priors (section 2.3). Theﬁrst line is for the nominal model. The second row gives,for comparison, the fraction of objects in each true class, i ,in the training data. These we may consider as frequentistestimates of the model priors, insofar as the frequency dis- Table 3.

Confusion matrix for class assignments from maximumprobability. Each row corresponds to a true class and sums to 100%.Nominal priors, G=18.5, no low EW quasars in training datagalaxy quasar starGALAXY 99.37 0.00 0.63QUASAR 4.22 85.59 10.19STAR 0.68 0.13 99.19 tribution of the classes dicates these. At least for this SVMmodel with equal class fractions, the model-based priors areclose to the class fractions.The third and fourth lines give the same but for themodiﬁed model. Now we see that the modiﬁed class frac-tion, f modi , for the quasars is not a good proxy for themodel-based prior. This implies that its use in equation 4will give poor estimates for the true posterior probabilities.We could attempt to improve this by an iterative procedure:Now that we have the model-based priors, we can recalcu-late the model posteriors directly from Bayes’ equation (2)– rather than our approximation (equation 4) – and then re-calculate the model-based priors with equation 3. However,we don’t do this because in our main experiments (next),the discrepancy is not as large. Motivated by the results of the previous experiment, weremoved the low equivalent width quasars (EW < Comparing the confusion matrix (Table 3) to that in theprevious experiment, we now see that fewer quasars are cor-rectly classiﬁed, with 10% being misclassiﬁed as stars. Yetthis loss of quasars is balanced by the fact that six timesfewer stars are now misclassiﬁed as quasars (0.13% ratherthan 0.88% previously, or 78 stars rather than 528). This iswhat we wanted to achieve by modifying the training sam-ple. Note how few galaxies are misclassiﬁed as stars andquasars, and how this has hardly changed from the pre-vious experiment. In all experiements we have performedwith these data (including many not reported here), thegalaxies are always classiﬁed with high conﬁdence. As we arenot changing the class fractions for galaxies – they are act-ing mostly to make the classiﬁcation problem for the SVMharder – we focus on the stars and quasars from now on.We summarze the model conﬁdence (posterior probabil-ties) in the histograms in Fig. 10. We can read several thingsfrom this: the leading diagonal shows P ( class | CLASS ), howconﬁdent the true positives are; the central row shows P ( class | QUASAR ), the probabilities assigned to each class fortrue quasars; the central column shows P ( quasar | CLASS ),the quasar probabilities assigned to objects of each trueclass. We see that the conﬁdences for the correct classes c (cid:13) , 000–000 C.A.L. Bailer-Jones et al.

Figure 10.

Histograms of the DSC class posterior probabilities, shown for each output class (columns) split for objects of each trueclass (rows). Nominal priors, G=18.5, no low EW quasars in training data are still high. However, comparing with the results of theprevious experiment (Fig. 5), we now see a set of truequasars which are now assigned very low probabilities, P ( quasar | QUASAR ). These are predominantly (but not ex-clusively) the low EW quasars that were removed from thetraining data. These low probability cases are reﬂected asa dip in the quasar completeness curve (Fig. 11) at low P t .These quasars show up in the star sample and we corre-spondingly see a higher contamination for the stars. Yet thepositive trade oﬀ from this is a lower quasar sample contam-ination. To give numbers: At P t = 0 . We turn now to the modiﬁed model in which our prior is thatquasars are 1000 times rarer. The histograms of the posteri-ors are show in Fig. 12 which we may compared to the nom-inal case in Fig. 10. The main diﬀerence can be seen in thecentral plot, where we now observe a signiﬁcant peak aroundzero probability for P ( quasar | QUASAR ): the modiﬁed model has a high barrier to classifying anything as a quasar. This isthe result of the probability remapping, equation 4 (wherebythe impact of the normalization should not be ignored). Thisprovides more discrimination between those quasars which,in the nominal case, were all assiged an output probabilityof almost 1.0.Fig. 13 shows the C&C for the modiﬁed model. It is sig-niﬁcantly diﬀerent from the nominal case. The quasar com-pleteness is slightly reduced, but we win a much lower con-tamination. Indeed, at P t = 0 .

11 the contamination dropsto zero for a sample completeness of 65%. This translates toa contamination of less than 1 in 0 . × P t = 1 . P t = 0 . . ) and54 (= 10 . ) respectively. If we had done this experimentwith a test set containing, say, only 2000 quasars, then theeﬀective number of quasars in the resultant sample wouldonly have been 54 × (2000 /

60 000) = 1 .

8. This is not enoughto draw results of any signiﬁcance and explains why we needlarge test sets for evaluating the modiﬁed model. c (cid:13) , 000–000 inding rare objects Figure 12.

Histograms of the DSC class posterior probabilities, shown for each output class (columns) split for objects of each trueclass (rows). Modiﬁed priors, G=18.5, no low EW quasars in training data

Table 4.

Confusion matrix for class assignments achieved by apply-ing a probability threshold. Each row gives the percentage of objectsof each true class assigned to the diﬀerent output classes (columns).As an object may now be assigned to more than one class, the val-ues in a row no longer sum to 100%, plus some objects may remainunclassiﬁed. The thresholds applied are P t = 0 . P t = 0 . The confusion matrix for this modiﬁed case is shown inTable 4. As class assignments are now based on thresholdsrather than maximum probability, and because we can setthese thresholds independently of one another, a given ob-ject may be assigned to more than one class or it may remainunclassiﬁed. Therefore the values in a row no longer sumto 100%. Here we use thresholds of 0 . , . , . × . × . . × .

001 + 98 . × . . × . . eﬀ ≃

40 000 K. The other three are cooler (twoaround 4000 K, one around 7000 K) but have very high ex-tinctions (8–9 magnitudes A V ). This gives rise to the much c (cid:13) , 000–000 C.A.L. Bailer-Jones et al.

Figure 11.

Completeness (blue/thick line) and contamination(red/thin line) of a sample as a function of the probabilitythreshold. The bottom right panel shows the (logarithm) of theactual number of diﬀerent types of class in the quasar sam-ple: stars (blue/solid line); quasars (red/dashed lines); galaxies(black/dotted line). Nominal priors, G=18.5, no low EW quasarsin training data larger ﬂux in the red which DSC confuses with a quasar-likebroad emission line feature.We recall that in this experiment we removed the lowEW quasars from the training set, but not the test set.We may, therefore, expect this model to perform particu-larly badly on these objects. However, if we plot Fig. 12just for these objects it turns out to be broadly sim-ilar, although with about half as many objects in the P ( quasar | lowEWQUASAR ) = 1 bin and twice as many inthe P ( quasar | lowEWQUASAR ) = 0 bin. So while this modeldoesn’t do as well on the objects omitted from the trainingdata, it is still able to assign many a high probability. Thatis, DSC has been able to extend its recognition of quasarsto lower EW cases than it was trained on. The C&C plots for the modiﬁed model are of course just pre-dictions of what the C&C would be if we applied the mod-iﬁed model to a real data set which has the modiﬁed pop-ulation fractions. We can test these predictions by buildinga new test data set in which the population fractions reallyare (1,1/1000,1). To do this we took the original test dataset but retained just a random selection of 1/1000 of thequasars (i.e. 60 quasars). We applied the modiﬁed modeland calculated the C&C (now using equation 1 of course,not equation 5). As there are only 60 quasars in this set, thepredictions have high variance, so we bootstrapped the se-lection 20 times. Fig. 16 shows that the measured C&C arevery close to the predictions, thus validating our approach. Figure 13.

Completeness (blue/thick line) and contamination(red/thin line) of a sample as a function of the probabilitythreshold. The bottom right panel shows the (logarithm) of theactual number (thick lines) of diﬀerent types of class in thequasar sample (i.e. in the test set): stars (blue/solid line); quasars(red/dashed lines). The thin lines show the corresponding eﬀec-tive number of objects for the modiﬁed case. There are no galaxycontaminants. Modiﬁed priors, G=18.5, no low EW quasars intraining data

Theoretical arguments aside, could we nonetheless use thenominal model to classify data sets in which quasars arerare? To assess this, we applied the nominal model to thefew-quasar data set from the above test (section 4.3.3).The measured C&C are plotted in Fig. 17 as the dashedlines. These agree well with the C&C we would predict onthe eﬀective test data set (i.e. using equation 5) but usingthe nominal model, which are shown as the solid lines inFig. 17. While they agree, this ﬁgure shows that the samplecontamination is far higher when using the nominal modelthan the modiﬁed model (Cf. Fig. 16). That is, the modiﬁedmodel is superior at obtaining pure quasar samples. Thisdemonstrates that we must take into account the expectedclass fractions (i.e. use a suitable prior) when applyinga classiﬁer to a data set in which we expect a class to be rare.Finally for this experiment, we note that the model-basedpriors agree quite well with the training data class fractionor the modiﬁed class fraction for the nominal and modiﬁedmodels respectively (Table 2).

At G=20 the noise is higher, so we expect DSC to performless well. The basic class assignment based on maximum c (cid:13) , 000–000 inding rare objects Figure 14.

Spectra of stars which are most confused asquasars, namely those 7 stars which have the highest values of P mod ( quasar | STAR ). G=18.5, modiﬁed priors, no low EW quasarsin training data.

Figure 15.

The (noise-free) input spectra for the stars shown inFig. 14. The photon ﬂux has been normalized in each case to havea maximum of 1.0.

Figure 16.

Comparison of the completeness (blue/thick lines)and contamination (red/thin lines) obtained with the modiﬁedmodel at G=18.5 (no low EW quasars in training data). The solidlines are the predictions (same as those in the the top right panelof Fig. 13) and the dashed lines are the measurements from a dataset in which quasars really are rare (class fractions of (1,1/1000,1))

Figure 17.

As Fig. 16 but now for the nominal model

Table 5.

Confusion matrix for class assignments from maximumprobability. Each row corresponds to a true class and sums to 100%.Nominal priors, G=20.0, no low EW quasars in training datagalaxy quasar starGALAXY 93.53 0.12 6.66QUASAR 7.40 77.13 15.47STAR 13.92 0.21 85.87 posterior class probability is indeed worse (Table 5). Com-paring with Table 3 we see that the stars suﬀer in particu-lar, with only 86% being correctly classiﬁed, as opposed to99% at G=18.5. This is reﬂected by the lower conﬁdence themodel gives to its classiﬁcations (Fig. 18). Likewise, the com-pleteness is lower and the contamination higher at a giventhreshold (Fig. 19).

Figure 18.

Histograms of DSC outputs for each class showinghow conﬁdent DSC is of identifying each class. Nominal priors,G=20.0, no low EW quasars in training datac (cid:13) , 000–000 C.A.L. Bailer-Jones et al.

Figure 19.

If we now modify the priors and recalculate the poste-rior probabilities then we see a similar shift in the distri-butions that we did for the G=18.5 case (histograms notshown). The C&C plot with the modiﬁed model is shownin Fig. 20. What stands out immediately is that we are stillable to get a clean (zero contamination) quasar sample. Thisoccurs once P t = 0 .

07, at which point the completenessis 58% (and remains at above 50% if we choose to take alarger threshold). This compares to a completeness of 65%at P t = 0 .

11 for zero contamination at G = 18 .

5. Thus wesee that the higher noise in the data can be translated intoa lower completeness (which is acceptable) rather than araised contamination (which is not). Notice that there is noguarantee that the contamination should drop to zero before P t = 1 (cf. section 4.2), so this is an important result. We may expect parallaxes and proper motions to providea good discriminant between Galactic and extragalactic ob-jects. Yet it turns out that if we remove astrometry from theDSC, the performance hardly degrades (accuracy decreasesby no more than 1%). This observation may not be surpris-ing, because the majority of objects with astrometry consis-tent with zero (within the measurement errors) are actuallystars, and that is because our training sample is dominatedby distant stars. It is also not speciﬁc to the SVM. We alsobuilt a two-stage classiﬁer, in which (1) an SVM predictsclasses based only on photometry and (2) Gaussian mixtures

Figure 20.

Completeness (blue/thick line) and contamination(red/thin line) of a sample as a function of the probabilitythreshold. The bottom right panel shows the (logarithm) of theactual number (thick lines) of diﬀerent types of class in thequasar sample (i.e. in the test set): stars (blue/solid line); quasars(red/dashed lines). The thin lines show the corresponding eﬀec-tive number of objects for the modiﬁed case. There are no galaxycontaminants. Modiﬁed priors, G=20.0, no low EW quasars intraining data are used to model the density in the 2D astrometric spaceand predict classes. The probabilities are then combined togive a single posterior (Bailer-Jones & Smith (2008b)). Theresults are no better than an SVM using BP/RP and astrom-etry (although the two-stage classiﬁer oﬀers more ﬂexibilityand interpretability).

In section 2.3.2 we argued that the model class priors arenot necessarily dictated by the class fractions in the train-ing data. Moreover, in attempting to match the trainingdata distribution to that in the target population for a rareobject, we may end up with very few objects in the trainingdata, such that the classiﬁer does not learn to classify themwell.We test this by varying the fraction of quasars inthe training data set, q , from 100% (equal class fractions)down to 0.1% (60 quasars). We then build samples basedon maximum probability and measure the C&C (again us-ing equation 5 to account for the modiﬁed class fractions).Fig. 21 shows how the completeness and contamination ofthe quasar sample varies with q . For modest reductions in q the contamination is 5%, which is too high for a pure sample(although the samples are built using maximum probability,so we could reduce this at the cost of reduced completeness).Surprisingly, the contamination actually drops to zero once c (cid:13) , 000–000 inding rare objects Figure 21.

Quasar sample completeness (thick blue line) andcontamination from stars (thin red line) and galaxies (dottedblack line) as a function of varying the fraction of quasars inthe training data set (the ﬁrst two values are 0.1% and 1%) the quasar fraction drops below 5%. However, this comesat the price of greatly reduced completeness of just 0.1% at q = 0.1%. This we can directly compare to the results fromthe modiﬁed model in experiment (2) (section 4.3.2), wherethe completeness is still 62%. Thus we conclude that prun-ing the training data set is not the way to achieve a goodclassiﬁer for a rare class. Any supervised method is limited by the data it is trainedon. Systematic diﬀerences between these and the target pop-ulation can compromise the classiﬁer. It is not the goal ofthis article to present the ﬁnal classiﬁer for Gaia, so we havenot yet optmized the training sets. Nonetheless, our workdoes raise some issues relevant to this important next step,which we brieﬂy discuss.First, we found that removing the low EW quasars wasimportant for achieving clean quasar samples in the mod-iﬁed case. This may reﬂect a limitation of the SVMs, butit raises the general issue of how to set the training datadistribution. We do not expect a supervised method to per-form well on regions of the data space not included in thetraining data (although we did ﬁnd that the model couldcorrectly classify many low EW quasars). For this reason,DSC will be preceeded by an “outlier detector” in the clas-siﬁcation pipeline. This assesses how well an observed objectis “covered” by the training space and only passes “known”objects to DSC. It could be based on density estimation (inlow dimensions) or one-class SVMs.Second, if we are building a classiﬁer only to look for aspeciﬁc type of object, then we don’t need to aim for com-pleteness in the training sample (although we do need toinclude potential contaminants). For example, none of ourquasar simulations include interstellar extinction (unlike thestars and galaxies). This could be motivated by saying we donot trust extinction laws and just want to ﬁnd unreddenedquasars. We are free to make this choice at the cost of re-duced completeness of real quasars. If we included reddened quasars in the test set, then either they would be classiﬁedas quasars anyway (in which case the completeness goes up),or they would be classiﬁed as something else. Either way, thequasar contamination would not increase so our conclusionabout building pure samples of unreddened quasars is unaf-fected. Incidentally, the results in section 4 are very similar ifwe repeat the experiments but now with interstellar extinc-tion applied also to the quasars. The main diﬀerence is thatthe quasar sample completeness is reduced by around 10%,with the more reddened quasars now contaminating the starsamples. The quasar contamination is not increased.Third, how should we deﬁne the classes? If we had splitthe quasars into two classes (e.g. high and low emission lineEW) maybe this would help the SVM to retain zero con-tamination on the high EW class without having to removethe low EW quasars entirely from the training data. Thiswill be the subject of future work.Gaia does not exist in a void; we of course have in-formation from other catalogues/surveys to help us identifyquasars. The full DSC is actually a multi-stage algorithm,in which the present paper just describes the stage based onthe BP/RP spectrum. Other stages provide class probabili-ties based on other data, for example the source magnitudeor Galactic latitute or external catalogues, in which case wemight call them “priors”. We will describe this multi-stageapproach in a future paper.

When we build a sample by applying a threshold, all of theobjects have a posterior probability which is at least as highas the threshold. For example, with the modiﬁed model insection 4.3.2, we were able to set P t = 0 .

11 and produce asample of only quasars. There is of course no guarantee thatall samples built using this threshold have no contaminants.On the other hand, this threshold is not a statement aboutthe sample as a whole: it would be wrong to expect only 11%of the sample to be quasars. We need to look at the actualproabilities, not a lower limit. How can we use these? Imag-ine a case in which we set P t = 0 .

90 and thereby obtaineda sample of 1000 (supposed) quasars, all which happen tohave P ( quasar ) = 0 .

95. We would reasonably expect 5%not to be quasars. Yet we must be careful with such an in-ference, because posterior probabilities are not frequencies(especially if we have very non-uniform priors, e.g. class frac-tions of (1 , . , P ( quasar ) = p for all objectsin a sample of size n , then the probability that exactly r ofthem ( r n ) are non-quasars is P ( n, r, p ) = n ! r !( n − r )! (1 − p ) r p ( n − r ) (7)The left panel of Fig. 22 shows how (the logarithm of) thisquantity varies with r for n = 1000 and p = 0 .

95. (It hasa maximum of P ( n, r, p ) = 0 .

058 at r = 50). By summingequation 7 over ranges of r we can make more useful state-ments. For example, summing from r = 0 to r = 50, we cansay that the probability that there are 50/1000=5% con-taminants or fewer is 0.54. Equivalently, the probability that c (cid:13) , 000–000 C.A.L. Bailer-Jones et al.

Figure 22.

The left panel shows the variation with r of P ( n =1000 , r, p = 0 .

95) (equation 7), the probability that there areexactly r contaminants in the quasar sample of n objects, eachwith posterior quasar probability of p . The right hand panel is1 minus the cumulation of this (equation 8), and so shows theprobability that there will be more than r contaminants (shownas the contamination fraction r/n ). there is more than 5% contamination is 0.46. The right handpanel of Fig. 22 shows this for variable r . Speciﬁcally it plots1 − r ′ = r X r ′ =0 P ( n, r ′ , p ) (8)against r/n , the contamination fraction. This ﬁgure agreesapproximately with our intuition of a 5% contaminationwhen all posterior probabilities are 95%.We can apply this to the case in section 4.3.3, wherewe applied the modiﬁed model to a test with 60 quasars.Setting P t = 0 . P ( quasar | QUASAR )in Fig. 10), but we approximate by setting p equal to theaverage for this sample, which is 0.95. Equation 8 tells us( n = 37, r = 1, p = 0 .

95) that the probability of havingone or more contaminants is 0.56. This is consistent withhaving no contaminants. Of course, this average probabilityis not very representative, so this doesn’t strictly apply. Forexample, we could also have taken a threshold of P t = 0 . p = 0 .

997 and the probability of one or morecontaminants is just 0.006.Despite these approximations, it appears that the infer-ences about the sample as a whole are consistent with theindividual posterior probabilities.

Ultimately we would like to know whether the modiﬁedmodel is “better” than the nominal one. If we ignored theissue of priors and class fractions, then our predictions ofsample completeness and contamination would always begiven by Fig. 11 (at G=18.5), regardless of the true classfractions in the target population. If we improve this by in-stead using equation 5 to accommodate the modiﬁed classfractions (but still with the nominal model posteriors), thenour predictions would be the solid lines in Fig. 17. Yet bothare poor predictions of the actual completeness and contam- ination in a sample where quasars are rare (dashed lines inFig. 16).We demonstrated (in section 4.3.4) that the applica-tion of the nominal model to a data set with rare quasarsgives poorer results (larger contamination) than the modi-ﬁed model at a given probability threshold . It is not yet clearwhether we could just use the nominal model but with muchhigher probability thresholds to achieve similar results. How-ever, there are two reasons why we should not want to dothis. First, as we must not build test data sets with verysmall numbers of the rare class, we would still have to usethe modiﬁed class fractions in order to correctly predict theC&C (section 2.5). Second, we want models which deliveractual probabilities, not just numbers between 0.0 and 1.0to which we apply a meaningless threshold. This is especiallyimportant if we later want to update the probabilities, e.g.in a multi-stage classiﬁer. We therefore believe it is betterto use the modiﬁed model.

We have introduced a method of probability modiﬁcationwhich can be used to build clean samples of rare objects forwhich the completeness and contamination can be reliablypredicted. We have demonstrated this using a support vectormachine classiﬁer, although it may, of course, be used inconjunction with any classiﬁer which gives probabilities. Themain conclusions of our work are as follows. • To construct a pure sample of objects we should usea probabilistic classiﬁer and only select objects with highprobabilities. By varying this probability threshold we cantrade oﬀ sample completeness and contamination. • To achieve pure samples of rare objects, we must takeinto account the expected class fractions in the target popu-lation, which act as a prior probability on the classiﬁer. Weuse these to modify the nominal classiﬁer outputs to givethe modiﬁed model . • We applied our modiﬁed model to a three class problemin which quasars are simulated to be 1000 times rarer thanstars and galaxies. We can achieve a pure quasar sample(zero contamination) yet still reach a sample completenessof 50–65% for magnitudes down to G=20.0. Although thetest set is ﬁnite in size, this correponds to an upper limit inthe contamination of 1 in 39 000. This is more than adequatefor establishing an astrometric reference frame for Gaia: IfGaia observes half a million quasars, we can build a quasarsample of 250 000 with no more than 13 contaminants. • While achieving this pure quasar sample, we simul-tanesouly achieve very complete galaxy and star samples(both 99%) with low contamination (both 0.7%) (ﬁgures forG=18.5). • These results were achieved after removing quasars withlow equivalent width emission lines from the training sample(deﬁned somewhat arbitrarily as 5000 ˚A). Including theseprecluded establishing a low contamination sample, becauseit resulted in cool (4000–8000 K) highly reddened ( A V = 8–10) stars being confused with them. After removing thesequasars, the ﬁrst stars to contaminate a quasar sample (ifwe set a low threshold) are these reddened cool stars as wellas hot stars (T eﬀ >

40 000 K). c (cid:13) , 000–000 inding rare objects • Including parallax and proper motion (either as addi-tional SVM inputs, or in a separate mixture model classiﬁer)hardly changes the performance. This is not surprising sincethe majority of objects with astrometry consistent with zeroare actually stars. • Extending the training and testing sets to includequasars with a full range of interstellar extinction does notsigniﬁcantly alter the results (completeness slightly lower,but contamination unaﬀected) • All classiﬁcation models have a prior, but the prior isoften not explicit and are sometimes implicitly inﬂuenced bythe training data distribution. We have introduced a simplemethod for calculating the implicit priors in a classiﬁcationmodel, which we call the model-based priors . In many caseswe have experimented with, these priors are close to thetrue class fractions in the training data (nominal model) ormodiﬁed class fractions (modiﬁed model). • We recommend that a classiﬁer be trained on roughlyequal numbers of objects in each class so that it can properlylearn the class distributions or boundaries. By determiningthe model-based priors and replacing them with somethingmore appropriate to the target population (e.g. quasars be-ing rare), we can produce a modiﬁed model with superiorperformance. In particular, this is far better at producinglarge, pure samples of the rare class. • With our approach we can apply the model to any newtarget population by specifying the appropriate class frac-tions (priors) without having to change the training datadistribution or re-train the model.

ACKNOWLEDGEMENTS

This work makes use of Gaia simulated observations and wethank the members of the Gaia DPAC Coordination Unit2, in particular Paola Sartoretti and Yago Isasi, for theirwork. These data simulations were done with the MareNos-trum supercomputer at the Barcelona Supercomputing Cen-ter - Centro Nacional de Supercomputaci´on (The SpanishNational Supercomputing Center). We thank Jean-FrancoisClaeskens, Vivi Tsalmantza and Jean-Claude Bouret for useof the quasar, galaxy and OB star spectral libraries (respec-tively), and Christian Elting for assistance in assemblingthe data samples. The MPIA Gaia team was supported inpart by a grant from the Deutsches Zentrum f¨ur Luft- undRaumfahft (DLR).

REFERENCES

The Three-dimensional universewith Gaia , C. Turon, K.S. O’Flaherty, M.A.C. Perryman(eds), ESA, SP-576, 393Ball N. M., Brunner R.J., Myers A.D., Tcheng D., 2006,ApJ 650, 497 Brown A.G.A, 2006, Gaia DPAC Technical note, GAIA-CA-TN-LEI-AB-005Burges C.J.C., 1998, Data mining and knowledge discovery2, 121Chang C.-C., Lin C.-J., 2001,

LIBSVM : A library for sup-port vector machines ∼ The Three-dimensional universe with Gaia , C. Turon, K.S. O’Flaherty,M.A.C. Perryman (eds), ESA, SP-576, 357O’Mullane W., Lammers U., Bailer-Jones C.A.L., et al., in

Astronomical Data Analysis Software and Systems 16 , R.A.Shaw, F. Hill and D.J. Bell (eds), ASP Conf. Ser. 376, 99Platt J., 2000, in A.J. Smola et al. (eds.),

Ad-vances in large margin classiﬁers , MIT Press, p. 61http://citeseer.ist.psu.edu/platt99probabilistic.htmlRichards G.T., Fan X., Newberg H., et al., 2002, AJ 123,2945Richards G.T., Nichol R.C., A.G. Gray, et al., 2004, ApJSS155, 257Sordo R., Vallenari A., 2008, Gaia DPAC Technical note,GAIA-C8-DA-OAPD-RS-002Suchkov A.A., Hanisch R.J., Margon B., 2005, AJ 130, 2439Tsalmantza P., Kontizas M., Bailer-Jones C.A.L., et al.2007, A&A 470, 761Tsalmantza P., Kontizas M., Rocca-Volmerange B., et al.2008, A&A submittedVanden Berk D.E., Schneider D.P., Richards G.T., et al.,2005, ApJ 129, 2047Wu T.-F., Lin C.-J., Weng R.C., 2004, Journal of MachineLearning Research 5, 975 c (cid:13)000