[PDF] A classifier for spurious astrometric solutions in Gaia EDR3

Abstract

The Gaia mission is delivering exquisite astrometric data for 1.47 billion sources, which are revolutionizing many fields in astronomy. For a small fraction of these sources the astrometric solutions are poor, and the reported values and uncertainties may not apply. For many analyses it is important to recognize and excise these spurious results, commonly done by means of quality flags in the Gaia catalog. Here we devise and apply a path to separating 'good' from 'bad' astrometric solutions that is an order-of-magnitude cleaner than any single flag: we achieve a purity of 99.7% and a completeness of 97.6% as validated on our test data. We devise an extensive sample of manifestly bad astrometric solutions: sources whose inferred parallax is 'negative' at >= 4.5 sigma; and a corresponding sample of presumably good solutions: the sources in HEALPix patches of the sky that do not contain extremely negative parallaxes. We then train a neural net that uses 14 pertinent Gaia catalog entries to discriminate these two samples, captured in a single 'astrometric fidelity' parameter. An extensive and diverse set of verification tests show that our approach to assessing astrometric fidelity works very cleanly also in the regime where no negative parallaxes are involved; its main limitations are in the very low S/N regime. Our astrometric fidelities for all EDR3 can be queried via the Virtual Observatory. In the spirit of open science, we make our code and training/validation data public, so that our results can be easily reproduced.

Full PDF

MMNRAS , 1–12 (2020) Preprint 29 January 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

A classiﬁer for spurious astrometric solutions in Gaia EDR3

Jan Rybizki ★ , Gregory M. Green , Hans-Walter Rix , Markus Demleitner , Eleonora Zari ,Andrzej Udalski , Richard L. Smart , and Andy Gould , Max Planck Institute for Astronomy, Königstuhl 17, D-69117 Heidelberg, Germany Astronomisches Rechen-Institut, Zentrum für Astronomie der Universität Heidelberg, Mönchhofstrasse 12-14, D-69120 Heidelberg, Germany Astronomical Observatory, University of Warsaw, Al. Ujazdowskie 4, 00-478 Warszawa, Poland INAF - Osservatorio Astroﬁsico di Torino, via Osservatorio 20, 10025 Pino Torinese (TO), Italy Department of Astronomy, Ohio State University, 4055 McPherson Laboratory, 140 West 18th Avenue, Columbus, Ohio 43210, USA

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

The Gaia mission is delivering exquisite astrometric data for 1.47 billion sources, which arerevolutionizing many ﬁelds in astronomy. For a small fraction of these sources the astrometricsolutions are poor, and the reported values and uncertainties may not apply. For many analysesit is important to recognize and excise these spurious results, commonly done by means ofquality ﬂags in the Gaia catalog. Here we devise and apply a path to separating ’good’ from’bad’ astrometric solutions that is an order-of-magnitude cleaner than any single ﬂag: weachieve a purity of 99.7% and a completeness of 97.6% as validated on our test data. Wedevise an extensive sample of manifestly bad astrometric solutions: sources whose inferredparallax is negative at ≥ . 𝜎 ; and a corresponding sample of presumably good solutions:the sources in HEALPix patches of the sky that do not contain extremely negative parallaxes.We then train a neural net that uses 14 pertinent Gaia catalog entries to discriminate thesetwo samples, captured in a single ’astrometric ﬁdelity’ parameter. An extensive and diverseset of veriﬁcation tests show that our approach to assessing astrometric ﬁdelity works verycleanly also in the regime where no negative parallaxes are involved; its main limitations arein the very low S/N regime. Our astrometric ﬁdelities for all EDR3 can be queried via theVirtual Observatory. In the spirit of open science, we make our code and training/validationdata public, so that our results can be easily reproduced. Key words:

Galaxy: stellar content, Galaxy: kinematics and dynamics, software: publicrelease, space vehicles: instruments, virtual observatory tools

Parallax measurements contain information about the distance of as-trophysical objects, and are critical to anchoring the cosmic distanceladder. At the same time, kinematic measurements – proper motionsand radial velocities – provide phase-space information that is keyto understanding Milky Way dynamics and external galaxies. The1.47 billion astrometric measurements reported in

Gaia

Early DataRelease 3 (Gaia Collaboration et al. 2020a, “EDR3”) constitute thelargest astrometric dataset ever produced.While this astrometric catalog is of extremely high quality(Lindegren et al. 2020b), a signiﬁcant fraction of astrometric solu-tions are spurious (Fabricius et al. 2020). Spurious astrometric solu-tions should are a distinct issue from negative parallaxes, which arean expected outcome of the normally distributed parallax measure-ment (Bailer-Jones 2015; Luri et al. 2018). Spurious solution havea biased parallax value with an incompatible parallax uncertainty ★ E-mail: [email protected] reported, due to speciﬁc failure modes (Fabricius et al. 2020). Thiscan be a particular concern when looking at sparsely populated por-tions of the color-magnitude diagram, or at extreme objects, suchas the nearest or fastest-moving stars (i.e., those with the largestparallaxes or proper motions, respectively). For example, naivelyselecting all objects with measured parallaxes greater than 10 mas(corresponding to a distance of less than 100 pc) yields a catalogwith an estimated 50 % of spurious parallax measurements.

Gaia

EDR3 provides a number of astrometric quality parameters that canbe used to exclude such spurious solutions. The “Gaia Catalogueof Nearby Stars” (Gaia Collaboration et al. 2020b, “GCNS”) usesa combination of these parameters to ﬁlter out spurious sources,obtaining a highly complete and pure subset of

Gaia

EDR3 sourceslying within 100 pc. In this paper, we use a similar approach toextend this work to the entire

Gaia

EDR3 catalog.

Gaia

EDR3 provides 1.47 billion astrometric measurementscontaining of a two-dimensional position on the sky, a two-dimensional proper motion and a parallax (in addition, a 7.2Msubset also has radial velocity measurements). There are many pos- © a r X i v : . [ a s t r o - ph . I M ] J a n Jan Rybizki et al. sible sources of excess noise in these astrometric measurements.Some error modes, such as unmodeled acceleration caused by anunresolved binary companion typically introduce small residualsinto the astrometric solution, which will usually be accounted forin the parallax uncertainty estimate (Lindegren et al. 2020b). How-ever, other error modes, such as incorrect epoch cross-matches withbackground or spurious sources and also close source pairs, whichmight be partially resolved (Fabricius et al. 2020), can introducevery large residuals, scattered around the true parallax(Gaia Col-laboration et al. 2020b), which are unaccounted for in the reportedparallax uncertainty. Spurious astrometric solutions mainly happenin very dense parts of the sky (Fabricius et al. 2020). It is this latterclass of “catastrophic” errors in the astrometric solutions (leadingto errors in excess of the stated uncertainties) that we will attemptto detect.One can try to mitigate these spurious astrometric solutionswith cuts on ruwe , visibility_periods_used and G magnitude. Thesecuts are known to exclude many valid sources and also bias the skycoverage. In the GCNS the approach was to use many and onlyastrometric quality indicators and train a random forest on a goodand a bad training sample. Since only sources with observed parallaxof greater 8mas were considered, this was operating in an extremelyhigh parallax SNR regime. For the “bad” examples the sourceswith parallax < -8 mas were used, exploiting the fact that spuriousastrometric solutions can be expected to scatter randomly aroundthe true parallax. For the “good” examples sources in low densityregions of the sky were used that were crossmatched to 2MASS andshowed consistent absolute magnitudes in Gaia and 2MASS bandswith main stellar populations.When trying to classify the spurious parallax solutions for thewhole GaiaEDR3 catalog we also need to make informed decisionsfor low parallax SNR as they constitute 85 % of the sources (forSNR < This preprint is a work in progress. We have created a classiﬁer thatwe believe identiﬁes cleanly sources with spurious parallaxes, andwhich performs signiﬁcantly better than the simple cuts advocatedin the existing literature. However, we are open to suggestions forimprovement – in the classiﬁer itself, the creation of the trainingdatasets, and ideas for validating performance. We want to allowthe community to use this classiﬁer, and welcome feedback. We areopen to oﬀering co-authorship in the ﬁnal journal-submitted versionof this work for any signiﬁcant contribution.All of the work in this paper, including the training of theclassiﬁer and the various validation tests, can be redone using a

Python notebook and data that we have made available. .As we update the classiﬁer, we will continue to keep the initial The notebook can be found at https://colab.research.google.com/drive/1d4KCXiCyFzLF1RzTzRGRAnVS0Uc8x3RU?usp=sharing ,while the necessary data is stored at https://keeper.mpdl.mpg.de/d/21d3582c0df94e19921d/

Figure 1.

Density distribution of the sources identiﬁed as bad for our trainingsample by their < − . 𝜎 (negative) parallaxes, shown using an Aitoﬀprojection in Galactic coordinates (orange to black). In contrast, the regionsof the sky from which we drew the good training sample, where such stronglynegative parallaxes are absent, are shown in blue. version of the astrometric ﬁdelities (v1) in the corresponding Vir-tual Observatory (VO) table, but will add additional columns withupdated probabilities. This will allow astronomers to redo theiranalysis with upcoming classiﬁers, and to compare results acrossdiﬀerent versions of the classiﬁer. In all of the following we neglect the zero-point parallax oﬀset(Lindegren et al. 2020a). We name the training set for spurious(valid) astrometric solutions " bad " ( "good" ). We construct our bad training sample by selecting sources with parallax _ over _ error < − .

5. We use the following query:

SELECT *FROM gaiaedr3.gaia_sourceWHERE parallax_over_error < -4.5

This returns 4.18 million sources. If all of the 1.47 billion sourcesin Gaia EDR3 with measured parallaxes had a true parallax of zero,and all of the measurement errors were Gaussian, then we wouldexpect approximately 5000 stars – nearly three orders of magnitudefewer – to satisfy the above cut. Since in reality, sources have positiveparallaxes, the discrepancy is even larger. Thus, even with the mostpessimistic assumptions, the contamination rate of our bad trainingsample by sources with good astrometric solutions is ∼ . bad sources over the sky.Dense areas such as the bulge, disc and the Magellanic clouds areapparent, but scanning law patterns are also visible, with regionsof the sky that are scanned most often (notably the two rings alongecliptic latitude ≈ ± ◦ ) having higher densities of spurious astro-metric solutions. We conjecture that this is due to the many scansalong a similar scanning angle, which increases the probability ofspurious detections ocurring at the same place and therefore reduc-ing the probability of being ﬁltered out in the downstream process(Torra et al. 2020) for example due to visibility_periods_used < MNRAS000 , 1–12 (2020) purious astrometric solutions In order to construct the good training set, we select all sources inregions of the sky that do not contain any sources with signiﬁcantlynegative parallaxes (in the above − . 𝜎 sense). This means that our good and bad training examples come from disjoint regions of thesky, as can be seen in Fig. 1. The good sample does not come fromthe Galactic plane (i.e., | 𝑏 | > ◦ for all good sources). In detail,we separately query each HEALPix level-6 pixel that contains nosources with parallax _ over _ error < − .

5. In all, 4197 out of49152 pixels meet this condition. Our query for a single pixel is asfollows:

SELECT *FROM (SELECT dr3.*, tmass.tmass_oid,FLOOR(dr3.source_id/140737488355328) as hpx6FROM gaiaedr3.gaia_source as dr3JOIN gaiadr2.tmass_best_neighbour AS tmassUSING (source_id)WHERE source_id BETWEEN 0 AND 562949953421311) ASsubquery-- Query only first HEALpix of level 6JOIN gaiadr1.tmass_original_valid AS tmUSING (tmass_oid)-- only sources with a crossmatch to 2MASS are queried

We obtain a total of 5.24 million sources from the 4197 pixels wequery in this manner. The requirement that the source is also visiblein 2MASS (Skrutskie et al. 2006) ensures that we do not includespurious sources. It lowers the fraction of faint blue objects (e.g.white dwarfs) though, but this cut does not seem to propagate intoour prediction, as we do not classify using photometric indicators.We split our good training set into two subsets: a high-SNRsubset with parallax _ over _ error ( SNR ) > . − . < SNR < .

5. In order to further purify our good training set, we require 𝐺 − 𝑅𝑃 < . 𝐺 − 𝑅𝑃 requires 𝑅𝑃 photometry, which only excludes 10k sources. The 40k sourcesremoved by this cut are unphysically red, as can be seen in the upperpanel of Fig. 2. This extremely red color sometimes coincides withnearby sources and/or high phot_bp_rp_excess_factor . Afterthese photometric cuts, our good training set contains 5.18 millionsources.As is apparent from the lower panel of Fig. 2, the high-SNRsubsample of the good training set (in blue) resembles a low-extinction color-absolute magnitude diagram (CAMD), with manyof the known subpopulations clearly distinguishable and almost nounphysical features. The small overdensity of sources between themain sequence (MS) and the white dwarf (WD) sequence are mostlydue to erroneously high 𝐵𝑃 values from faint sources (Riello et al.2020). It is also important to note that the cut in 𝐺 − 𝑅𝑃 does not ex-clude the reddest asymptotic giant branch (AGB) stars in 𝐵𝑃 − 𝑅𝑃 .As expected, the absolute magnitudes of the low-SNR subsamplescatter to much brighter absolute magnitudes, due to their poorlyconstrained distance moduli. The low-SNR sample consists mainlyof sources with apparent 𝐺 between 15 and 20 mag.In constructing the good sample, we do not cut on any astro-metric ﬂags that we use later as potential features in the classiﬁer. Ofcourse, our good and bad training sets probe quite diﬀerent regimes,with the good training set coming mostly from low-extinction re-gions and sparse ﬁelds. Nevertheless, we hope (and later verify) thatin the space of astrometric parameters and quality ﬂags, our trainingsets cover the relevant feature space and will allow our classiﬁer to Figure 2.

Color-absolute magnitude (CAMD) distribution for diﬀerent ver-sions of the good training sample, showing a naive estimate of the ab-solute 𝐺 -band magnitude as a function of two diﬀerent colors, 𝐺 − 𝑅𝑃 in the top panel, and ( 𝐵𝑃 − 𝑅𝑃 ) in the bottom panel. The green pointsresult from the initial query, before sources excessively red in 𝐺 − 𝑅𝑃 were removed. The gray and blue points show the low SNR and high SNRsubsample, deﬁned by 3 < parallax _ over _ error ( SNR ) < . parallax _ over _ error ( SNR ) > .

5, respectively. The plotting order ischanged such that either the high-SNR subset in blue (bottom) or the low-SNR subset in grey (top) is fully visible. have discriminative power over the entire sky, as was the case forthe GCNS. However, we acknowledge that it would be desirable toadd valid astrometric solutions from the Galactic plane to the good training sample.

We use the exact same features as in the GCNS (Gaia Collaborationet al. 2020b) marked in grey in their Table A.1. For completeness,we list them here in descending order of importance accordingto the Gini metric reported by the GCNS: parllax_error , parallax_over_error , astrometric_sigma5d_max , pmra_error , pmdec_error , astrometric_excess_noise , ipd_gof_harmonic_amplitude , ruwe , MNRAS , 1–12 (2020)

Jan Rybizki et al. visibility_periods_used , pmdec , pmra , ipd_frac_odd_win , ipd_frac_multi_peak and astrometric_gof_al . For parallax_over_error , which we abbreviate as SNR, both theGCNS and we use the absolute value as a feature. We train two diﬀerent classiﬁers, intended for use in the regimes | SNR | < . | SNR | > .

5. We will refer to these classiﬁers asthe “low-SNR” and “high-SNR” classiﬁers, respectively. The mostimportant diﬀerence between these two classiﬁers is that the high-SNR classiﬁer uses |SNR| as a feature, while the low-SNR classiﬁerdoes not. Recall that our bad training set does not include sourceswith | 𝑆𝑁 𝑅 | < .

5. If we were to allow the low-SNR classiﬁer totake |SNR| into account, it would learn that there are no bad sourceswith | 𝑆𝑁 𝑅 | < .

5, which is simply an artifact of our method ofidentifying training data.To train the low-SNR classiﬁer, we use only training data with | 𝑆𝑁 𝑅 | > . good training examples).Excluding low-SNR training data while training the low-SNR clas-siﬁer may seem counterintuitive, but our goal is to prevent theimbalance in coverage of SNR-space in the good and bad trainingsets from impacting our classiﬁcations in the low-SNR regime. Inthis regime, we end up with 3,964,264 good and 4,180,244 bad training examples.To train the high-SNR classiﬁer, we use the entire good and bad sets. In this regime, we end up with 5,184,555 good and 4,180,244 bad training examples. In contrast to the work on GCNS, where spurious sources wereidentiﬁed using a random forest, we employ a feed-forward neuralnetwork (NN) here. Our NN model consists of 4 hidden layers, eachwith 64 neurons and a Rectiﬁed Linear Unit (ReLU) activation.The ﬁnal layer has a single neuron with a sigmoid activation, andrepresents the probability that a source belongs to the good class.We use the binary cross-entropy loss function (e.g. Goodfellowet al. 2016), which is closely related to the Kullback-Leibler diver-gence and which measures how much additional information wouldneeded to correct the classiﬁer’s prediction. Given input features (cid:174) 𝑥 ,the classiﬁer outputs a probability 𝑃 ((cid:174) 𝑥 ) that the source belongs tothe good class. Denote true class (the label) by 𝑦 ∈ { , } (where 𝑦 = good ”). The binary cross-entropy is then given by H = − ( − 𝑦 ) ln [ − 𝑃 ((cid:174) 𝑥 )] − 𝑦 ln 𝑃 ((cid:174) 𝑥 ) . (1)As the binary cross-entropy is a measure of missing information, itcan be expressed in units of bits or nats.We implement our model in Tensorﬂow 2 (Abadi et al. 2016)and Keras (Chollet et al. 2015). We train for 100 epochs with anAdam optimizer (Kingma & Ba 2014), using a learning rate of10 − in the ﬁrst 50 epochs, and a learning rate of 10 − in the ﬁnal50 epochs. During training, we apply a dropout rate of 0.1 aftereach hidden layer in order to prevent over-ﬁtting. The features areshuﬄed and normalised (to zero mean and unit variance) prior tothe training, and we set 20% of the data aside for validation.We train our high- and low-SNR classiﬁers separately. Weassess our ﬁnal performance by applying the low-SNR classiﬁer toour low-SNR test dataset, and our high-SNR classiﬁer to our high-SNR test dataset. On this combined test dataset, we achieve a loss of0.0405 nats of entropy, with a purity of 99.7% and a completeness of . . . . . . − − label = good label = bad high SNR , label = good Figure 3.

Histogram of the predicted classiﬁer probabilities (of belonging tothe good class) for sources in the test dataset, split by training label. As the good class contains both low- and high-SNR training data, we additionallyshow the classiﬁer probabilities for the high-SNR good sources. The 𝑥 -axis,is the probability output by the classiﬁer that a given source is good , whichwe term the ”astrometric ﬁdelity”. > We compare the astrometric ﬁdelity predicted by our neuralnetwork to analogous quantities obtained using simpler classi-ﬁers. First, we evaluate how cleanly simple cuts on ruwe and astrometric_excess_noise separate good and bad sourcesin the high-SNR test dataset. Fig. 4 shows the binary cross-entropy, purity and completeness of the cut as a function ofthe threshold value for each feature. For ruwe , we achievea minimum binary cross-entropy of 1.61 nats using a cut of ruwe < .

12, corresponding to a purity of 89.5% and a com-pleteness of 89.0%. For astrometric_excess_noise , we achievea minimum binary cross-entropy of 1.01 nats for a cut of astrometric_excess_noise < . MNRAS000 , 1–12 (2020) purious astrometric solutions . . . . . . . . . . c r o ss e n t r o p y . good : ruwe < threshold crossentropypuritycompleteness 0 . . . . . . . . . pu r i t y & c o m p l e t e n e ss . . . . . . . . . . . c r o ss e n t r o p y . good : astrometric excess noise < threshold crossentropypuritycompleteness 0 . . . . . . . . . . . pu r i t y & c o m p l e t e n e ss . . Figure 4.

Performance of simple cuts on ruwe (top panel) and astrometric_excess_noise (bottom panel) in diﬀerentiating good and bad astrometric solutions. In each panel, we show how binary cross-entropy,purity and completeness depend on the threshold chosen for the cut. Ourneural network predicting astrometric ﬁdelity achieves an order of magni-tude less contamination than the optimal choices for these cuts (at minimalcross-entropy). linear combination of features. This model assigns a probability 𝑃 ( good | (cid:174) 𝑥 ) = (cid:104) + 𝑒 −( (cid:174) 𝑤 · (cid:174) 𝑥 + 𝑏 ) (cid:105) − (2)of belonging to the good class to each source, where (cid:174) 𝑥 is a vectorcontaining the features, (cid:174) 𝑤 is a vector containing a weight for eachfeature, and 𝑏 , the bias, is a scalar. We use the Adam optimizer toﬁnd the weights and bias that minimize the binary cross-entropy ofthe predictions. On the high-SNR test dataset, we obtain a binarycross-entropy of 0.0960 nats, a purity of 96.4% and a completenessof 97.0%. This is better than what we achieve with simple cuts, butstill represents more than three times the binary cross-entropy weobtain with the full neural network model.The full neural network is not signiﬁcantly more diﬃcult toimplement than these simpler classiﬁers, and it achieves a far morecomplete and pure separation of the validation data. For these rea-sons, we strongly favor use of the full neural network classiﬁcationover simpler alternatives. We divide the validation for the two models in the regimes SNR ≥ . < . We ﬁrst apply our classiﬁer to all sources in the “Gaia Catalogueof Nearby Stars” (GCNS). All 1.2 million sources with parallax > | SNR | ≥ . | SNR | < .

5. Taking the GCNS clas-siﬁcations to be correct, our low-SNR model has lower performancethan our high-SNR model. However, bad parallax determinationsare fundamentally more diﬃcult both to identify and to deﬁne inthis regime, as the reported measurements are compatible with avery wide range of true parallaxes. However, note that we did nottrain our classiﬁer on the GCNS sample, so it is not unsurprisingthat our low-SNR classiﬁer performs worse on the GCNS datasetthan on our own test dataset.

Open and globular clusters, and the “prior” information on thedistance of their likely member stars, oﬀer a great opportunity tovalidate our parallax classiﬁer. We begin with a catalog of 162,484sources assigned to 121 clusters, coming from Bailer-Jones et al.(2018). This catalog was compiled using a method similar to thatused in Gaia Collaboration et al. (2018). For each individual clus-ter, we calculate the variance-weighted mean parallax of its membersources, as well as the corresponding uncertainty in the mean par-allax: (cid:104) ˆ 𝜛 (cid:105) = ∑︁ 𝑖 ˆ 𝜛 𝑖 𝜎 𝑖 , 𝜎 (cid:104) ˆ 𝜛 (cid:105) = (cid:32)∑︁ 𝑖 𝜎 𝑖 (cid:33) − / . (3)We then select the 41 clusters for which 𝜎 (cid:104) ˆ 𝜛 (cid:105) /(cid:104) ˆ 𝜛 (cid:105) < . Δ ˆ 𝜛 ≡ ˆ 𝜛 − (cid:104) ˆ 𝜛 (cid:105) , 𝜎 Δ ˆ 𝜛 = (cid:16) 𝜎 𝜛 + 𝜎 (cid:104) ˆ 𝜛 (cid:105) (cid:17) / . (4)The distribution of these parallax residuals (divided by the cor-responding uncertainties) is shown in Fig. 5. Our classiﬁer labelsapproximately 1% of sources in these clusters as bad . The stan-dardized residuals of sources classiﬁed as good roughly follow theexpected unit normal distribution, while the distribution of standard-ized residuals of the sources classiﬁed as bad is shifted negative andhas much longer tails. MNRAS , 1–12 (2020)

Jan Rybizki et al. − − $/σ ∆ ˆ $ . . . . . N (0 , goodbad Figure 5.

Validation of our astrometric ﬁdelity prediction using open andglobular clusters. The ﬁgure shows histograms of the standardized parallaxresiduals for good and bad sources. The true parallaxes are estimated usingthe variance-weighted mean of the parallaxes in each cluster. We restrictthis comparison to clusters with distances determined to 20% or better. The good sources closely follow the expected unit normal distribution, in markedcontrast to the standardized residuals of the bad sources.

The Fourth Phase of the Optical Gravitational Lensing Experi-ment (OGLE-IV, Udalski et al. 2015) began observing the bulgeof the Milky Way in 2010. Here, we validate our classiﬁer usingsources with proper-motion measurements from both OGLE-IV(OGLE Uranus astrometry project, Udalski et al. 2021, in prepa-ration) and Gaia EDR3. Our assumption is that objects with spu-rious parallax determinations in Gaia EDR3 are more likely tohave spurious proper-motion determinations. This should be re-ﬂected in the proper-motion residuals between Gaia EDR3 andOGLE-IV, with sources classiﬁed as bad in Gaia EDR3 having sys-tematically higher 𝜒 values in this comparison. We begin with acatalog of OGLE-IV sources with proper-motion measuresments,lying in a 0 .

15 deg × .

15 deg box centered on ( 𝛼 J2000 , 𝛿

J2000 ) = ( .

761 deg , − .

698 deg ) . Using a matching radius of 0 . (cid:48)(cid:48) , weobtain 14125 matching Gaia EDR3 sources with measured propermotions. Our classiﬁer labels 2288 of these sources good .We calculate the proper-motion residuals, Δ (cid:174) 𝜇 ≡ (cid:174) 𝜇 Gaia −(cid:174) 𝜇 OGLE , as well as the covariance matrix of the residuals, 𝐶 Δ (cid:174) 𝜇 = 𝐶 𝜇, Gaia + 𝐶 𝜇, OGLE . We then calculate 𝜒 = Δ (cid:174) 𝜇 𝑇 𝐶 − Δ (cid:174) 𝜇 Δ (cid:174) 𝜇 for eachsource. If the uncertainties are well estimated and the residuals fol-low a Gaussian distribution, then the 𝜒 values that we obtain shouldfollow a 𝜒 distribution with two degrees of freedom. However, weﬁnd that the resulting 𝜒 values are signiﬁcantly larger, on average,than expected, both for sources labeled good and bad , indicatingthat Gaia EDR3 and/or OGLE-IV proper-motion uncertainties areunderestimated in the Galactic Bulge. One could attempt to addressthis problem by inﬂating the uncertainties by a constant factor orby introducing a systematic error ﬂoor. However, these diﬀerentmethods of “correcting” the proper-motion uncertainties impact thedistributions of 𝜒 values obtained for the good and bad sources dif-ferently, as the good sources tend to have smaller estimated proper-motion uncertainties than the bad sources. In order to avoid thesediﬃculties, we restrict our comparison to sources in a relativelysmall range of estimated proper-motion uncertainties, for which we χ / dof10 − − − . . m e d i a n χ / d o f m e d i a n χ / d o f Gaia vs . OGLE proper motions goodbad

Figure 6.

Validation of our astrometric ﬁdelity classiﬁcation through propermotion comparison between OGLE and Gaia EDR3. Shown is the distribu-tion of 𝜒 / dof, based on a comparison of Gaia EDR3 and OGLE-IV propermotions in the Galactic Bulge, for sources labeled good and bad by ourclassiﬁer. For ideal data, the median 𝜒 / dof would be ∼ .

69. We ﬁnd thatthat proper-motion uncertainties are underestimated for Gaia EDR3 and/orOGLE-IV, leading to larger 𝜒 / dof values. However, for sources labeled good by our classiﬁer, Gaia EDR3 and OGLE-IV proper motions matchsigniﬁcantly better, as indicated by the lower median 𝜒 / dof values (4.4 forthe good subsample, vs. 7.4 for the bad subsample). assume the true proper-motion uncertainties to be similar. In particu-lar, we select sources with 0 . − < (cid:12)(cid:12)(cid:12) 𝐶 Δ (cid:174) 𝜇 (cid:12)(cid:12)(cid:12) / < . − ,obtaining 1192 sources labeled good and 1978 sources labeled bad .The resulting distributions of 𝜒 values are displayed in Fig. 6. Weﬁnd that sources labeled good by our classiﬁer tend to have signiﬁ-cantly lower 𝜒 values than those labeled bad , with the median 𝜒 per degree of freedom (dof) for the good subsample being 4.4, andthe median 𝜒 / dof of the bad subsample being 7.4. In the direction of the Large Magellanic Cloud (LMC), the vast ma-jority of sources should be at a distance of ∼

50 kpc, correspondingto a parallax of 0.02 mas. This aﬀords us another opportunity tovalidate our classiﬁcations, as almost all stars labeled good in thisregion of the sky should have reported parallaxes consistent with0.02 mas. We expect the bad sources to have larger than reportedresiduals, and to scatter equally to positive and negative parallaxes,leading to a widened distribution of reported parallaxes centered on0.02 mas.We query a 0.25 deg cone in Gaia eDR3, centered on Galacticcoordinates ( ℓ, 𝑏 ) = ( .

47 deg , − .

88 deg ) , obtaining 252,115sources, which we then run through our classiﬁer. In this denselycrowded region of the sky, only 11.2% of all sources are classiﬁed as good , while only 0.7% of high-SNR sources are classiﬁed as good .In order to model the small number of Milky Way foreground starsin this ﬁeld, we compare to a control ﬁeld of the same apparentsize with the same Galactic latitude, and longitude reﬂected around ℓ = good .Fig 7 shows the parallax distribution of good and bad sources MNRAS000

Validation of our astrometric ﬁdelity classiﬁcation through theparallax distribution of good and bad sources with small parallax uncertain-ties ( 𝜎 ˆ 𝜛 < . good sources is consistent with a large population of distant ( ˆ 𝜛 ≈

0) sources(approximated by a normal distribution with zero mean and a standard de-viation of 0.2 mas), along with an expected population of foreground stars(matching the distribution of parallaxes in a control ﬁeld). In contrast, the bad parallaxes are consistent with a distant population of stars with parallaxuncertainties that are underestimated by ∼ with small reported errors ( parallax _ error < .

2) in our LMCﬁeld. The parallax distribution of good sources is consistent with adistant population of stars with well-measured errors, plus a smallforeground population of Milky Way stars at larger parallax (match-ing the control ﬁeld). The bad sources are consistent with a distantpopulation of stars with signiﬁcantly underestimated parallax errors.Our classiﬁer is thus clearly identifying sources with excess paral-lax residuals, and even in this dense ﬁeld, is still cleanly identifyingforeground stars.

The catalog of O-, B-, and A-type (OBA) stars devised by Zari etal. (2021, subm.) oﬀers another opportunity to test our classiﬁerwith an ensemble of sources at low Galactic latitudes . Zari et al.( in preparation ) select stars brighter than 𝐺 =

16 mag, with

Gaia

EDR3 and 2MASS colors consistent with (reddened) OBA-typestars. Zari et al. do not apply any condition on the parallax error, asthe sample was designed to be inclusive for spectroscopic follow-up.We run our classiﬁer on the resulting catalog consists of ∼ good (left, ∼

75% of the initialsample) and bad (right) sources in the Galactic plane ( | 𝑏 | < ◦ ).The distribution of sources with good astrometric solutions showsknown regions of young stars and traces the spiral arm structure ofthe Milky Way disk, as discussed in Zari et al. (cf. their Fig. 11). Thedistribution of sources with bad astrometric solutions shows a ring-like feature between 2 and 3 kpc, which is physically implausibleand hence presumably spurious. This is expected, as the parallaxdistribution of all sources in the OBA catalog peaks at around0.3 mas ( ∼ As a ﬁnal approach to validation, we visually inspect the projectedsky distribution and CAMDs of Gaia EDR3 sources classiﬁed as good and bad in narrow bins of (catalog-reported) parallax. Werefer to the parallax bins by their corresponding nominal distances.The 100 pc sample consists of the 1.2 million sources with ˆ 𝜛 > . < ˆ 𝜛 < . < ˆ 𝜛 < . .

333 mas < ˆ 𝜛 < .

334 mas) contains 1.3 million sources,the 10 kpc sample (0 . < ˆ 𝜛 < .

101 mas) contains 1.2 millionsources, and the 30 kpc sample (0 . < ˆ 𝜛 < .

034 mas)contains 1.4 million sources.

Here we only look at sources in each parallax bin that have|SNR| ≥ good sources in four parallax slices(from top to bottom, at nominal distances of 0.3, 1, 3 and 10 kpc). Inthe closer distance slices, distinct overdensities that correspond toopen clusters are visible. At 3 kpc, the Milky Way’s overall structurebecomes clearly visible, with star forming regions standing out. Notmany sources at very large distances have high SNR, and highlyextincted regions of the Galactic plane have no sources (and aretherefore colored gray).When looking at the sources classiﬁed as bad in Fig. 10, thebulge and disk dominate in all parallax bins. Even nominally nearbysources with high SNR are concentrated in the region of the skycorresponding to the bulge and disk. Interestingly, even a cut for rea-sonably high SNR does not remove spurious astrometric solutions,as can be seen by the large number of bad sources.Fig. 11 shows the CAMD of high-SNR good solutions in eachparallax bin. The stellar locus well populated in each parallax bin.The unphysically large number of seemingly pre-main sequencestars (redder and brighter than the main sequence) is due to photo-metric excess in the RP photometry of sources in dense regions ofthe sky (Riello et al. 2020). Another feature that is apparent in theseCAMDs is that the red clump becomes increasingly elongated withdistance, due to the greater range of dust columns probed at largerdistances.For the high SNR sources that are classiﬁed bad we see inFig. 12 a ﬂoor of sources near to the Gaia magnitude limit forthe respective parallax slices. For the 3000pc bin there might bean indication of AGB sources wrongly classiﬁed as bad. Thoughobservational conditions for these extreme objects might resembleastrometrically a bad solution source. Now we inspect the result of the low-SNR model on the low SNRvalidation data (|SNR| < good sources in Fig. 13, we see a similarity to the high-SNR sample,though the bulge region is missing and scanning law patterns arevisible. While the structures overlap more across diﬀerent distancebins (due to the lower parallax SNR), similar structures are visible at3 and 10 kpc as in the high-SNR sample (Fig. 9). Reassuringly, thebulge is most prominent at 10 kpc, while at 30 kpc, the Magellanicclouds are more prominent. MNRAS , 1–12 (2020)

Jan Rybizki et al.

Figure 8.

Validation of our astrometric ﬁdelity classiﬁcation through the astrophysical plausibility of the X-Y distribution of young stars in the Galactic plane.Shown is the distribution of good (left) and bad (right) OBA star sources in the Galactic plane, with the Sun located at ( 𝑋, 𝑌 ) = ( , ) , and theGalactic center at ( . , ) . We have divided the plane into pixels 100 pc on a side. The color bar shows the number of sources per bin. The dashedcircles have radii ranging from 1 to 5 kpc, in steps of 1 kpc. The good sources show concentrations at many known locations of young stars, and show spiralarm like morphology. The bad sources show a ringlike structure, exactly centered on the sun and at the (seeming) distance of the most common parallax;clearly, a far too Ptolemean distribution to be “real”. Fig. 14 shows the sky distribution of sources classiﬁed as bad in the low-SNR sample. Bad sources strongly outnumber the good sources in the 1 kpc slice. In every distance slice, the sky distributionof the bad sources essentially traces the highest density parts of thesky. At 100 pc the scanning law is still visible, but the sparsity ismainly due to most sources having a high SNR in this distance bin.For the low SNR sample we only focus on the 10 kpc bin whenlooking at the CAMD. For the good sources in the upper panel ofFig. 15 we can see massive main sequence stars and turn-oﬀ stars,as well as the red clump and maybe sdB stars at the very blue end.For the bad sources in the lower panel, a weak signal of all thosepopulations seems to be present and again some AGB stars mighthave been wrongly predicted as bad. Overall most of the physicalstructure can be seen in the good sample, making us conﬁdent thateven at low SNR our astrometric ﬁdelity classiﬁer is useful.

Our catalog is hosted at the German Astrophysical Virtual Observa-tory (GAVO), in the table gedr3spur.main . The simplest wayto access the astrometric ﬁdelities for a sample of stars is to cross-match directly via a Table Access Protocol (TAP) upload join inTOPCAT. If the local table has Gaia EDR3 source_id s one cansimply query:

SELECT src.*FROM gedr3spur.main as srcJOIN TAP_UPLOAD.t1 AS target-- TAP_UPLOAD.tX needs to be the table number in TOPCATUSING (source_id) https://dc.zah.uni-heidelberg.de/ https://dc.zah.uni-heidelberg.de/browse/gedr3spur/q The TOPCAT program is described at . Because there is 100 MB upload limit, one can increase the num-ber of sources queried at a time by hiding all columns except source_id . GAVO hosts a light version of Gaia EDR3, containingonly the most commonly used columns. One can directly query thislight versio nof Gaia eDR3 and simultaneously crossmatch to ourastrometric ﬁdelities. For example,

SELECT COUNT(*) AS ct,ROUND(parallax/parallax_error,2) AS binFROM gaia.edr3lite -- only contains most important rowsJOIN gedr3spur.main using (source_id)WHERE fidelity_v1 >= 0.5 GROUP BY bin returns a histogram of the parallax distribution for the 730 million good sources. Requiring fidelity _ v1 < . bad sources. Fig. 16 shows thedistributions returned by these two queries. As expected, the bad solutions dominate the negative parallax regime. Interestingly, this isalso the case for positive parallaxes in the range 0 . < ˆ 𝜛 / mas < bad sources peaks at ˆ 𝜛 = .

19 mas andis almost symmetrically distributed around this point, with an excessof 35 million sources in the positive wing. This might be partly dueto misclassiﬁcation of sources with good astrometric solutions, butcould also due in part to spurious solutions scattering around thetrue parallax value (Gaia Collaboration et al. 2020b).The parallax distribution of the good sources peaks at ˆ 𝜛 = .

26 mas. For comparison, we have also plotted the distribution ofobserved parallaxes from 1.33 billion GeDR3mock sources (Ry-bizki et al. 2020), which peaks at ˆ 𝜛 = .

16 mas. The bad astrometricsolutions are usually real sources that have anomalously large paral-lax errors. One can therefore imagine how the excess GeDR3mocksources (compared to the good

EDR3 sources) could be randomly We have adopted the EDR3 corrected parallax_error from theADQL query in Gaia Collaboration et al. (2020b) and imposed the mag-nitude limits from table maglim_6 of GeDR3mock: https://dc.zah.uni-heidelberg.de/browse/gedr3mock/q .MNRAS000

Distribution of good solutions for |SNR| ≥ ≥ scattered to produce the distribution of the bad sources, painting aconsistent picture.Fig. 17 shows the SNR distribution for the good and bad sources in EDR3. We see a jump in the number of sources at the|SNR| = 4.5 transition between our high- and low-SNR classiﬁer. AtSNR = 4.5, the number of bad sources increases by almost 50 %,i.e. the high-SNR classiﬁer seems to have a higher purity. Fabriciuset al. (2020) estimates the total contamination of the sources withSNR > ∼ < -5). Our classiﬁer ﬁnds 12.2 million bad Figure 10.

Distribution of bad solutions for |SNR| ≥ sources in this regime, which constitute 6 % of the 192 millionsources with SNR > This is a list of ideas that could be applied to improve results andmight enter a future version of the classiﬁer. • correcting for parallax zero point • use wide binaries for validation, use wide binary sample whichhave statistically sound parallax uncertainty for training • make mock test (using gedr3mock) with photometric cleaningfor sample generation • Sources in the good sample HEALpix without a 2MASS cross-match could be still used if a crossmatch to Pan-STARRS1 (Cham-bers et al. 2016) is found. • clean classiﬁed samples from sources that are obviously wrongand feed them into training • use diﬀerent photometric cuts • use more restrictive cuts when acquiring the good trainingsample, e.g. only HEALpix with no sources of SNR < − . • take into account the error coming from true parallax overerror, vs measured parallax over error MNRAS , 1–12 (2020) Jan Rybizki et al.

Figure 11.

CAMD for good solutions for |SNR| ≥ • cut the samples at diﬀerent snr levels • Optional features which might have high predictive power:– matched_transits_removed – astrometric_params_solved (or maybe train a modelfor 5p and 6p solutions separately)– phot_proc_mode (available for little less sources than thosewith astrometric solution but more than RP, BP)– it might also help to add in astrometric_excess_noise / astrometric_excess_noise_sig as an estimate of the un-certainty of the excess source noise– distance to nearest neighbour in the catalogue Figure 12.

CAMD for bad solutions for |SNR| ≥ bad sources overwhelminglyhave spurious parallaxes. – distance to nearest bright neighbour (e.g. < 𝐺 ) due totheir spurious source creation (Fabricius et al. 2016).– phot_bp_rp_excess_factor (only available for sourceswith both colors) We have extended the classiﬁcation of valid and spurious astro-metric solutions from the Gaia Catalogue of nearby stars (GaiaCollaboration et al. 2020b) to all 1.47 billion sources in Gaia EDR3

MNRAS000

MNRAS000 , 1–12 (2020) purious astrometric solutions Figure 13.

Sky distribution of good solutions for |SNR| < with astrometry. Our training sample of spurious sources are ob-tained by taking all sources with parallax_over_error < -4.5.Our training sample of good astrometric solutions is obtained bytaking all sources with a 2MASS crossmatch in parts of the skywhere no sources with parallax_over_error < -3.5 exist. Wetrain two neural network models, one for high parallax SNR andone for low (divided at |SNR|=4.5), which take astrometric qualityparameters from EDR3 as inputs.Our validation shows that we outperform simple cuts but alsologistic models that take into account a linear combination of ourfeatures. Our good sources’ parallaxes distribute normally withrespect to clusters but also the LMC. Sources classiﬁed as good also have signiﬁcantly lower 𝜒 when comparing to OGLE propermotions. Sources classiﬁed as bad usually occur in high-density Figure 14.

Sky distribution of bad solutions for |SNR| < Figure 15.

CAMD of good (top panel) and bad (bottom panel) solutionsfor |SNR| < , 1–12 (2020) Jan Rybizki et al.

Figure 16.

Parallax distribution for good and bad sources in Gaia EDR3 andfor the mock observed parallaxes of GeDR3mock.

Figure 17.

SNR ( parallax_over_error ) distribution for good and badsources in Gaia EDR3. The |SNR| = 4.5, where the classiﬁers change, isshown in dashed grey lines. regions, e.g. in the bulge and disc region and in the Magellanicclouds.

ACKNOWLEDGEMENTS

The authors would like to thank Douglas P. Finkbeiner and JoshuaS. Speagle for helpful discussions and suggestions.This work has made use of data from the European SpaceAgency (ESA) mission Gaia, processed by the Gaia Data Process-ing and Analysis Consortium (DPAC). Funding for the DPAC hasbeen provided by national institutions, in particular the institutionsparticipating in the Gaia Multilateral Agreement.This research or product makes use of public auxiliary dataprovided by ESA/Gaia/DPAC as obtained from the publicly acces-sible ESA Gaia SFTP.This work was funded by the DLR (German space agency) viagrant 50 QG 1403.GG acknowledges funding from the Alexander von HumboldtFoundation, through the Sofja Kovalevskaja Award.The OGLE project has received funding from the NationalScience Centre, Poland, grant MAESTRO 2014/14/A/ST9/00121to AU.JR will not travel anywhere by aeroplane for the purpose ofpromoting this paper. Software:

TOPCAT (Taylor 2005),

HEALpix (Górski et al.2005).

Data availability

The data underlying this article are available in the article and in itsonline supplementary material.

REFERENCES

Abadi M., et al., 2016, TensorFlow: Large-Scale Machine Learning on Het-erogeneous Distributed Systems ( arXiv:1603.04467 )Bailer-Jones C. A. L., 2015, PASP, 127, 994Bailer-Jones C. A. L., Rybizki J., Fouesneau M., Mantelet G., Andrae R.,2018, AJ, 156, 58Chambers K. C., et al., 2016, arXiv e-prints, p. arXiv:1612.05560Chollet F., et al., 2015, Keras, https://keras.io

Fabricius C., et al., 2016, A&A, 595, A3Fabricius C., et al., 2020, arXiv e-prints, p. arXiv:2012.06242Gaia Collaboration et al., 2018, A&A, 616, A10Gaia Collaboration Brown A. G. A., Vallenari A., Prusti T., de Brui-jne J. H. J., Babusiaux C., Biermann M., 2020a, arXiv e-prints, p.arXiv:2012.01533Gaia Collaboration et al., 2020b, arXiv e-prints, p. arXiv:2012.02061Goodfellow I., Bengio Y., Courville A., 2016, Deep Learning. MIT PressGórski K. M., Hivon E., Banday A. J., Wandelt B. D., Hansen F. K., ReineckeM., Bartelmann M., 2005, ApJ, 622, 759Kingma D. P., Ba J., 2014, arXiv e-prints, p. arXiv:1412.6980Lindegren L., et al., 2020a, arXiv e-prints, p. arXiv:2012.01742Lindegren L., et al., 2020b, arXiv e-prints, p. arXiv:2012.03380Luri X., et al., 2018, A&A, 616, A9Riello M., et al., 2020, arXiv e-prints, p. arXiv:2012.01916Rybizki J., et al., 2020, PASP, 132, 074501Skrutskie M. F., et al., 2006, AJ, 131, 1163Taylor M. B., 2005, in Shopbell P., Britton M., Ebert R., eds, AstronomicalSociety of the Paciﬁc Conference Series Vol. 347, Astronomical DataAnalysis Software and Systems XIV. p. 29Torra F., et al., 2020, arXiv e-prints, p. arXiv:2012.06420Udalski A., Szymański M. K., Szymański G., 2015, Acta Astron., 65, 1This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS000