Miec: A Bayesian hierarchical model for the analysis of nearby young open clusters
J. Olivares, H. Bouy, L. M. Sarro, E. Moraux, A. Berihuete, P.A.B. Galli, N. Miret-Roig
AAstronomy & Astrophysics manuscript no. Miec © ESO 2021February 25, 2021
Miec: A Bayesian hierarchical model for the analysis of nearbyyoung open clusters
J. Olivares , H. Bouy , L. M. Sarro , E. Moraux , A. Berihuete , P.A.B. Galli , and N. Miret-Roig Laboratoire d’astrophysique de Bordeaux, Univ. Bordeaux, CNRS, B18N, allée Geo ff roy Saint-Hilaire, 33615 Pessac, France.e-mail: [email protected] Depto. de Inteligencia Artificial , UNED, Juan del Rosal, 16, 28040 Madrid, Spain Univ. Grenoble Alpes, CNRS, IPAG, 38000 Grenoble, France Depto. Estadística e Investigación Operativa. Universidad de Cádiz, Avda. República Saharaui s / n, 11510 Puerto Real, Cádiz, SpainReceived ; accepted ABSTRACT
Context.
The analysis of luminosity and mass distributions of young stellar clusters is essential to understanding the star-formationprocess. However, the gas and dust left over by this process extinct the light of the newborn stars and can severely bias both the censusof cluster members and its luminosity distribution.
Aims.
We aim to develop a Bayesian methodology to infer, with minimal biases due to photometric extinction, the candidate membersand magnitude distributions of embedded young stellar clusters.
Methods.
We improve a previously published methodology and extend its application to embedded stellar clusters. We validate themethod using synthetically extincted data sets of the Pleiades cluster with varying degrees of extinction.
Results.
Our methodology can recover members from data sets extincted up to A v ∼ Conclusions.
The methodology presented here recovers, with minimal biases, the members and distributions of embedded stellarclusters from data sets with a high percentage of sources with missing values ( > Key words.
Proper motions, Methods: statistical, Galaxy: open clusters and associations: general, Galaxy: open clusters and associ-ations: individual: M45
1. Introduction
Stellar clusters are benchmarks against which the predictions ofcurrent theories of star formation and evolution can be comparedand validated. In these comparisons, the youngest clusters playan important role because they still hold the imprints of the initialconditions of the molecular cloud from which they were formed.Unfortunately, because of the low star formation e ffi ciency (be-tween 5% and 30%, McKee & Ostriker 2007; Fukushima et al.2020), most of the gas and dust of the parent molecular cloudremains in the vicinity of the newborn stars and extincts theirlight. This extinction can bias the census of candidate membersand the population parameters derived from them, particularlythe luminosity and mass distributions. For this reason, the propa-gation of the observational uncertainties and the characterizationof methodological biases are unavoidable steps in the compari-son of model predictions with observations.In the last decade, diverse methodologies have been devisedto determine the stellar census and properties of star-forming re-gions and open clusters (e.g., Sarro et al. 2014; Krone-Martins& Moitinho 2014; Gagné et al. 2014). Following the seconddata release of the Gaia mission (Gaia Collaboration et al.2018), hundreds of authors have used these high-quality mea-surements, in particular the highly discriminant parallax, to iden-tify members in stellar clusters by applying diverse machine-learning techniques. The most popular techniques are Gaussianmixture models, random forest classifiers (Breiman 2001), DB- SCAN (Ester et al. 1996), HDBSCAN (Campello et al. 2013),and density contrast using kernel density estimates. For exam-ple, Cantat-Gaudin et al. (2018) used a modified version ofUPMASK (Krone-Martins & Moitinho 2014) to identify mem-bers in 1229 clusters and Kounkel & Covey (2019) used HDB-SCAN to identify 1900 clusters and comoving groups within1 kpc. Other popular algorithms include
Clusterix (Balaguer-Núñez et al. 2020) and
ASteCA (Perren et al. 2015). While theformer is a fully nonparametric method that determines clustermembership probabilities based on proper motions, the latter isa fully automated software that obtains cluster parameters likecenter coordinates and radius, together with luminosity func-tions and membership probabilities. To the best of our knowl-edge,
ASteCA is the only membership methodology from the lit-erature that can deal with extinction, though based on theoreticalisochrones that are known to face di ffi culties in reproducing theobserved cluster photometric sequences in the low-mass domain(see the discussion in Sections 5.2 of Bouy et al. 2015 and Miret-Roig et al. 2019).Although the censuses of stellar clusters have seen tremen-dous improvements thanks to the Gaia data, the determinationof population parameters based on these censuses is far fromstraight-forward because several biases can appear (Luri et al.2018). Following the recommendations provided by the latterauthors (see their Sect. 4.3), we improve the methodology ofOlivares et al. (2018, hereafter Paper I). The new code that we
Article number, page 1 of 21 a r X i v : . [ a s t r o - ph . I M ] F e b & A proofs: manuscript no. Miec present here, which we call
Miec , is designed to simultaneouslyderive the census of members and the astrometric and photomet-ric population distributions of embedded young stellar clusters.The remainder of this paper is organized as follows. In Sect.2, we review the methodology of Paper I and introduce the im-provements, in particular those concerning the treatment of ex-tinction. Section 3 presents the synthetically extincted Pleiadesclusters that is used in Sect. 4 to validate our methodology andanalyze its biases. Finally, in Sect. 5, we present our conclusionsand perspectives.
2. Methodology
In this section, we describe the treatment of the photometric ex-tinction as an improvement to the methodology of Paper I. Westart by enumerating our assumptions and making a summary ofthe original model. Afterward, we describe the details of the twomajor improvements we present here: the treatment of extinctionand a simplified model for equal-mass binaries. Also, since thepublication of Paper I the code has been subject to a series ofminor methodological and computational improvements that aredescribed in Appendix A.
Assumption 1.
Observed values are independent across starsand Gaussian distributed. The input catalog provides the nec-essary information to reconstruct these Gaussian distributions.
Assumption 2.
The stellar cluster members share a commonorigin and therefore have similar properties but with an intrinsicdispersion. On the contrary, the field population has a heteroge-neous origin, and therefore the cluster members can be proba-bilistically disentangled from it through statistical models con-structed on features (i.e., observed values) of the astrometric andphotometric spaces. The classification quality depends on the de-gree of overlapping between cluster and field populations and onthe information provided by the features used in the models.
Assumption 3.
Given that the field population overwhelminglydominates the input catalog (typically >
98% of it), we assumethat the field model inferred from the initial list of field sourcescan remain fixed during the inference of the cluster model.
Assumption 4.
The photometric and astrometric observed val-ues of a source are independent of each other. Although incor-rect, this assumption results in a moderate complexity model,with n =
66 parameters. If we were to include the correlationsbetween the astrometric and photometric parameters, the result-ing model would be computationally intractable, with ∼
500 freeparameters. Thus, this assumption is a reasonable compromisebetween the model’s complexity and the computational time re-quired to infer it. As a corollary, we assume that the photometricextinction does not a ff ect the astrometric observables. Assumption 5.
The stellar cluster is composed of only singlestars and equal-mass binaries (hereafter EMBs). This assump-tion simplifies the treatment of the full spectra of binary mass-ratios.
Assumption 6.
The input extinction map provides the upperlimit, A v , max , to the true extinction value, A v , of each source, (i.e., A v ∈ [0 , A v , max ]). The Pleiades were known as Miec by the Aztecs (Galindo Trejo2002).
Assumption 7.
The extinction follows the law of Cardelli et al.(1989). We assume a value of R v = .
1, which corresponds to thedi ff use interstellar medium. This assumption neglects the e ff ectsof infrared excess, such as that due to the presence of protoplan-etary disks for example. The aim of Paper I was to develop, test, and characterize aBayesian hierarchical model designed to simultaneously identifymembers of nearby young open clusters and infer their popula-tion distributions (i.e., proper motions, and color and magnitudedistributions). This model proceeds as follows.The likelihood of a source is a mixture model with two com-ponents: the cluster likelihood, L c , and the field likelihood, L f ;see Assumption 2. The weight or amplitude of the field com-ponent is parametrized as π . Because there are only two com-ponents, the cluster likelihood has amplitude 1- π . Each of thesecomponents has its own set of parameters, represented by θ c forthe cluster, and θ f for the field. These parameters together with π , make the full set of model parameters: Θ = { π, θ c , θ f } .We define the N -sources data set as: D = { d i } Ni = = { ˆ µ i , ˆ Σ i } Ni = ,with d i the set of observables of the i -th source, which comprisesthe mean, ˆ µ i , and the covariance matrix ˆ Σ i of the observed quan-tities. Under Assumption 1, the uncertainties are Gaussian andtherefore the full likelihood can be written as L ( D | Θ ) ≡ N (cid:89) i = L ( d i | Θ ) = N (cid:89) i = π · L f ( d i | θ f ) + (1 − π ) · L c ( d i | θ c ) , (1)which is equivalent to Eq. 3 of Paper I.Assumption 4 allows us to factorize the likelihood of the i -thsource into the astrometric and photometric parts as L ( d i | Θ ) = π · L Af ( d i | θ Af ) · L Pf ( d i | θ Pf ) + (1 − π ) · L Ac ( d i | θ Ac ) · L Pc ( d i | θ Pc ) , (2)where the superscripts A and P stand for astrometry and photom-etry, respectively.The cluster photometry is modeled as a mixture of two com-ponents (see Assumption 5): single-stars, denoted with subscript s , and EMB, denoted with subscript b . This mixture’s weightsare given by π s for the single-star population and 1- π s for theEMB population. With these definitions, the likelihood of the i -th source can be written as L ( d i | Θ ) = π · L Af ( d i | θ Af ) · L Pf ( d i | θ Pf ) + (1 − π ) · L Ac ( d i | θ Ac ) · (cid:104) π s L Pc , s ( d i | θ Pc , s ) + (1 − π s ) L Pc , b ( d i | θ Pc , b ) (cid:105) . (3)The photometric likelihoods of both single and EMB pop-ulations are parametrized with the source true color index, CI .However, as the latter is a source-level parameter we marginal-ize it with the aid of a population-level prior, p ( CI | φ ). Theprior parameters, φ , are included in the set of cluster parameters(i.e., φ ∈ θ c ). The marginalization of the CI parameter is done asfollows, Article number, page 2 of 21. Olivares et al.: Miec L ( d i | Θ ) = (cid:90) L ( d i | CI , Θ ) · p ( CI | φ ) · d CI = π · L Af ( d i | θ Af ) · L Pf ( d i | θ Pf ) + (1 − π ) · L Ac ( d i | θ Ac ) · (cid:34) π s (cid:90) L Pc , s ( d i | CI , θ Pc , s ) · p ( CI | φ ) · d CI + (1 − π s ) · (cid:90) L Pc , b ( d i | CI , θ Pc , b ) · p ( CI | φ ) · d CI (cid:35) . (4)The explicit definitions of all terms in the previous equationcan be found in Paper I, together with the prior distribution fortheir parameters. Our treatment of the photometric extinction follows a similar ap-proach to that taken in Paper I for the marginalization of the truecolor index of each source. We include the true extinction, A v ,of each source as a model parameter that we marginalize withthe aid of a uniform prior, p ( A v | ψ ), with ψ its parameters, thesupport of which is provided by the extinction map.Thus, in the extinction module, the likelihood of the i -thsource given the original model parameters, Θ , and the new ex-tinction parameters, ψ , is L ( d i | Θ , ψ ) ≡ (cid:90) L ( d i , A v | Θ ) · d A v = (cid:90) L ( d i | A v , Θ ) · p ( A v | ψ ) · d A v ∝ (cid:90) A v , max L ( d i | A v , Θ ) · d A v , (5)where the term L ( d i | A v , Θ ) is given by an expression similar tothat of Eq. 4. However, in the latter, the true photometry is red-dened according to the assumed extinction law (see Assumption7) using the values of A v in the support of the prior p ( A v | ψ ).These integrals, one for each source, are computed numericallyas in Paper I.We note that the marginalization of the true extinction of thesource di ff ers from that of its true color index in the sense that theprior parameters ψ (i.e., the lower and upper bounds of the uni-form prior) are not inferred by the model but are provided by theextinction map (see Assumption 6). Although using these lim-its rather than inferring them diminishes the complexity of themodel, the marginalization integrals severely increase the com-putation time. The latter grows linearly with the number of eval-uation steps of the integral, with roughly every step taking thesame amount of time as one likelihood evaluation of our basicnonextinction module. We heuristically set the grid steps to 20,which proved to be a good compromise between the accuracy ofthe integral and the total computing time. Stellar clusters are known to host a non-negligible fraction ofstars coupled in binaries or multiple systems. Observationally,these multiple systems are seen as a spread of the cluster pho-tometric sequence in color–magnitude diagrams. In particular,EMBs are located in a usually sharper sequence 0.75 magnitudesbrighter than the single star sequence. In Paper I, the EMBs were modeled with their own parallel photometric sequence and astro-metric Gaussian mixture model (GMM). The new EMB moduleallows us to model the EMBs with a parallel photometric se-quence, but it forces the astrometric GMM to be the same asthat of the single stars. This simplification reduces the numberof astrometric parameters by half and therefore also the modelcomplexity, which helps to decrease the computation time of anexpensive model like the extinction one. Although the simplifi-cation introduced by this module implies discarding ∼
2% of thecandidate members recovered with the original EMB module, itnonetheless recovers the same fraction of EMBs in the clusterpopulation (for more details see Appendix B).
3. Data set
We validate our methodological and computational improve-ments on the Pleiades data set used in Paper I. This data setcorresponds to the 10 most probable candidate members of thePleiades DANCe catalog (Bouy et al. 2015). We use the samerepresentation space as in Paper I, which consists of the propermotions, the color index i − K , and the photometric bands Y , J , H ,and K . Table 1 gives a summary of this data set, which we here-after refer to as the original one.As described in Paper I, the color index in the representa-tion space must be carefully chosen to minimize the number ofmissing values, maximize the wavelength interval, and fulfill theinjective requirement in the splines that model the photometricmagnitudes as functions of the color index. The previous re-quirements prevent us from using, for example, the colors Y − J or J − K , which result in fewer missing values than the i − K color, but have narrower wavelength intervals and produce verti-cal photometric sequences that cannot be modeled with injectivefunctions.To validate the new modules of our methodology (see Sects.2.2 and 2.3) we generate synthetically extincted Pleiades datasets by reddening the original photometry with synthetic extinc-tion maps (more below). Each of the latter provides the upperlimit of the source extinction, A v , max (see Assumption 6). If thesource is a candidate member (according to the results of PaperI), then its true extinction is uniformly sampled between zeroand the upper limit provided by the map. Given the proximity ofthe Pleiades cluster ( (cid:36) = . ± .
08 mas, Galli et al. 2017), thepopulation of foreground contaminants is negligible ( (cid:46)
Gaia
DR2 parallaxes for sources inthe same sky region of our catalog). Thus, if the source is classi-fied as a field contaminant according to Paper I, then we use theupper limit provided by the map as its true extinction. Finally, thephotometry is reddened using the true A v of each source, togetherwith the extinction law of Cardelli et al. (1989) with R v = . ff ective wavelength of each photometric passband (i.e., i , Y , J , H , K = { } ). Synthetic extinction maps
We need synthetic extinction maps that produce A v , max probabil-ity distributions with the following properties. First, they mustresemble those of real cases without being so specific that thevalidation loses generality. Second, they must allow us to con-trol the number of sources within a given interval of extinc-tion, which is needed to ensure good-number statistics over theprobed extinction values. We restrain ourselves from using theextinction maps of real regions because in that case, our vali-dation would lose generality because the resulting A v , max prob- Article number, page 3 of 21 & A proofs: manuscript no. Miec
Table 1.
Properties of the original data set.
Min. Max. Missing [%]Observablespmra [mas · yr − ] -100.0 99.9 0.0pmdec [mas · yr − ] -100.0 100.0 0.0i-K [mag] 0.9 8.0 96.4Y [mag] 8.3 22.2 22.2J [mag] 4.0 20.6 7.2H [mag] 3.0 20.3 7.5K [mag] 2.6 21.0 4.9ability distributions would be highly dependent on the spatialdistributions and extinction maps of those particular regions.We decided to generate synthetic maps using GMMs on thespace of synthetic sky coordinates. On the one hand, the GMMshave enough flexibility (see Kuhn & Feigelson 2017, for a reviewof the use of mixture models in astronomy) to mimic the possi-bly cloudy structures present in nearby star-forming regions, buton the other, the synthetic sky coordinates in combination withthe parameters of the GMM allow us to control the fraction ofsources above a certain value of extinction (i.e., the shape of the A v , max cumulative distribution). We note that our aim is to pro-duce realistic and generic A v , max probability distributions ratherthan realistic 2D extinction maps. Furthermore, as the sky coor-dinates are not part of the representation space of our methodol-ogy, they do not influence our results beyond setting the A v , max of the sources and allowing us to control their numbers within agiven extinction interval. Therefore, we define the A v , max of ourextinction maps as A v , max ( α, δ ) = k (cid:88) i = w i C { µ i , Σ i } · N ( α, δ | µ i , Σ i ) , (6)where α and δ are the synthetic sky coordinates, k is the num-ber of Gaussian components, and w , µ , and Σ are the weights,means, and diagonal covariance matrices of the GMM, respec-tively. The factor C { µ i , Σ i } is a constant that is equal to the densityof the Gaussian distribution at its mean; it allows us to set w i as the maximal extinction for the i th Gaussian component. Thesynthetic sky coordinates of the sources follow a bivariate uni-form distribution, with α, δ in the interval [-10 ◦ ,10 ◦ ] in R.A. andDec.After testing several configurations of parameter values forEq. 6, we found that GMMs with three components and the fol-lowing parameter values produce what we consider are repre-sentative case scenarios for the application of our methodology(more below).MAP : w = { , , } , µ = { [5 , , [5 , − , [ − , − } , Σ = { [3 , , [2 , , [1 , } , MAP : w = { , , } , µ = { [0 , , [5 , − , [ − , − } , Σ = { [20 , , [15 , , [15 , } , MAP : w = { , , } , µ = { [5 , , [5 , − , [ − , − } , Σ = { [12 , , [8 , , [4 , } . With the previous maps, we reddened the original data setand obtained three synthetically extincted data sets. In the fol-lowing, we refer to these data sets as MAP , MAP , and MAP according to the extinction map used to redden each of them.
10 5 0 5 10R.A. [deg]10.07.55.02.50.02.55.07.510.0 D e c . [ d e g ] A v , m a x K FieldClusterField reddenedCluster reddened
Fig. 1.
Top panel: Synthetic 2D extinction map, MAP . Bottom panel:Color–magnitude diagram showing the original photometry of the 10 most probable members from the Pleiades DANCe catalog, the red-dened one, and the extinction vectors as obtained from the map shownin the upper panel. As an example, Fig. 1 shows the extinction MAP as well asthe original, and reddened photometry. In Fig. 2 we show the A v cumulative distributions of the open cluster Ruprecht 147 (Oli-vares et al. 2019), the star-forming regions of Corona-Australis(Galli et al. 2020), Upper Scorpius (Miret-Roig et al. in prepara-tion), and Taurus (Olivares et al. in preparation), as well as thoseof the A v , max obtained by our synthetic extinction maps. The A v values of the members in the previous star-forming regions andstellar clusters were obtained by transforming their Gaia
DR2 a_g_val using the relation A G / A v = . ± .
005 (Wang &Chen 2019).As can be observed in Fig. 2, our synthetic extinction mapsreproduce the A v , max distribution of low-, intermediate-, andhigh-extinction regions. The extinction map MAP mimics alow-extinction region, with extinction values similar to those ex-pected in old open clusters that have expelled the majority ofthe remnant gas and dust. The extinction map MAP mimics anintermediate-extinction region similar to those of Taurus and Up-per Scorpius. We notice that the three Gaussian components ofour synthetic extinction maps are enough to reproduce the over- Article number, page 4 of 21. Olivares et al.: Miec
MAP MAP MAP Fig. 2.
Cumulative distribution of A v for the members of the open clusterRuprecht 147, and the star-forming regions of Corona Australis, UpperScorpius, and Taurus. The A v , max cumulative distributions yielded by oursynthetic extinction maps for the Pleiades members found in Paper I arealso shown with dashed lines. all trend of extinction present in the Taurus region. Althoughthe use of more Gaussian components would certainly result ina closer match to the observed extinction distribution, as men-tioned above, it would imply a loss of generality in the valida-tion. Finally, the extinction map MAP , mimics a high-extinctionregion with values larger than those observed in Taurus or UpperScorpius. We created this extinction map to validate our method-ology beyond the extinction values of candidate members foundwith current membership methodologies that, although takinginto account the photometry, are unable to deal with extinction(see e.g., Sarro et al. 2014; Krone-Martins & Moitinho 2014;Olivares et al. 2019).
4. Validation
We ran our new extinction and simplified EMB modules (here-after extinction + EMB module; see Sect. 2) on the syntheticallyreddened MAP , , data sets (see Sect. 3). Each of these runstakes ∼
10 days on a computing server with 45 CPUs at 2.1GHz and 8 GPUs NVIDIA GForce. We also ran our improvedmethodology on the original nonextincted data set using the sim-plified EMB module but without the extinction one (hereafterbasic + EMB module). This latter solution is used as a bench-mark against which the results of the extinction + EMB moduleare compared. The details of this benchmark solution and thevalidation of the EMB module are given in Appendix B.In the following, we first validate the capability of ourmethodology to recover the extincted members. Afterward, weassess the biases that extinction introduces in the inferred popu-lation parameters and the magnitude distributions in particular.
We measure the quality of our classifier using the confusionmatrices that result from the inferred membership probabilitiesand the optimum probability threshold for classification. As de-scribed in Paper I, our methodology returns, for each source inthe data set and for each combination of model parameters Θ , two membership probabilities: one of belonging to the clusterand another of being an EMB. Therefore, the MCMC samplingof the posterior distribution of the model parameters results indistributions for the cluster and EMB membership probabilities.As in Paper I, here we classify a source as a candidate mem-ber if 86% of its membership probability distribution is abovethe optimum probability threshold p t (i.e., if P µ + P σ > p t ,where the first and second terms on the left-hand side are themean and standard deviation of the distribution of membershipprobability). The optimum probability threshold, p t , correspondsto the probability threshold that maximizes the accuracy of theclassifier (ACC, see Eq. 10 of Paper I) when applied over datawhere the true classes are known (i.e., cluster or field popula-tions, single-stars or EMB). We notice that this classificationthreshold is not part of the model but is found a posteriori basedon a specific criterion: the maximum classification accuracy. Wefind the optimum probability threshold, p t , with three di ff erentstrategies.The first strategy uses the learned cluster and field models togenerate synthetic data where the true classes are known, thenit runs the method over these data and computes the optimumprobability threshold. The second and third strategies instead usethe class labels of the original nonextincted data set. While thesecond strategy uses all sources independently of their extinctionvalue, the third one splits the data into bins of one magnitude ofextinction and computes an optimum probability threshold foreach bin. Appendix C describes the details of these three strate-gies and the results they produce when applied to the syntheticdata sets MAP , , . The three strategies are similarly good at re-covering the cluster members, but the third one results in cleanersamples due to a reduced contamination rate.When used as a classifier, our extinction methodology re-covers candidate members from datasets with extinction up to A v , max ∼ ), the contamination rate measured with ourthird strategy on the subset of sources with observed color in-dex is (cid:46)
4% and increases to ∼
20% for the subset of sourceswith a missing color index. The detailed analysis of classifi-cation thresholds by bins of extinction allowed us to improvethe true positive rate and contamination rate to values that arebetter than 83% and 9%, respectively. Despite this success, ourmethod faces two problems: an increased contamination rate dueto sources with missing values and a reduced true positive ratein sources with high extinction values. In the next section, weanalyze the impact that these missing-value contaminants haveon the population distributions.
In this section, we analyze the ability of our extinction method-ology to recover the parameters of the benchmark solution, par-ticularly those of the magnitude distributions. As the populationdistributions are obtained directly from the inferred parametersof the model, they are independent of the probability thresholdsand strategies to obtain them. We start by comparing the pos-terior distribution of the model parameters, the resulting propermotions, and color index distributions, and finally we analyzethe accuracy of the inferred magnitude distributions.In the models inferred from the MAP , , data sets, between15% and 33% of the parameters have a maximum-a-posteriori Article number, page 5 of 21 & A proofs: manuscript no. Miec value that is discrepant, beyond 3 σ (with σ being the squareroot of the sum of the variances of the parameter posterior dis-tributions), from the value of the benchmark solution. The mostdiscrepant parameters are those associated with the intrinsic dis-persion of the photometric sequence and with the fractions andcovariance matrices of the proper motions GMM. The previousdiscrepancies have their origin in the contaminants introduced inthe extinction model, as explained below.In our methodology, all sources in the data set contribute tothe cluster model proportionally to their membership probabil-ity. Consequently, the contaminants of the model, which are bydefinition the field population, contribute to the broadening ofboth the photometric sequence and the distribution of proper mo-tions (i.e., the most discrepant parameters), and therefore to theshifting of their parameters. As described in the previous sectionand Appendix C, the majority of the contaminants have their ori-gin in a missing color index. Therefore, the broadening of themodel and the shifting of its parameters results from the lackof constraining information produced by the high percentage ofsources with missing values (see Table 1).Figures 3 to 5 show the marginal posterior distributions ofthe proper motions and color index inferred from the MAP , , data sets. These distributions are constructed from samples ofthe posterior distribution of the model parameters, and each lineshows a single sample of the MCMC. For comparison purposes,the figures also display samples of the posterior distributions ofthe benchmark solution (orange lines) and the prior distribution(green lines).As can be observed from these figures, the model learnedfrom the MAP data set is almost indistinguishable from thebenchmark solution in both proper motions and color index, de-spite having a 15% of discrepant parameters, as discussed above.On the contrary, the models learned from the MAP , data setsshow wider wings in the proper motions distributions. Thesewings result from the contaminants shown as the dispersed pop-ulation of false positives in the upper panels of Figs. C.2 andC.3.The lower panels of Figs. 3 to 5 show a gradual trend in thediscrepancy between the color index distribution inferred fromthe MAP , , data sets and that of the benchmark solution, withthe discrepancy increasing with the extinction value. The fea-tures in the color distribution are gradually smoothed until, atthe maximum extinction of A v ∼ dataset), the features at i-K ∼ ff ects are a direct consequence ofthe increased number of contaminants with increasing extinctionvalue. Despite these e ff ects, the bulk of the proper motions andcolor index distributions are recovered without significant shifts,but only with a general broadening.The precision of the inferred population distributions, whichis shown as the width of the posterior samples in Figs. 3 to 5,remains similar to that of the benchmark solution. The only ex-ception is the blue side of the color index distribution inferredfrom the MAP data set, between 0.8 mag and 3 mag, where thelines are less jammed. This loss of precision is a direct conse-quence of the lack of information resulting from a large numberof missing values in the Y band (see Table 1, and the black andblue lines in the upper panel of Fig. B.4) in combination with thehigh values of extinction.The previous analysis indicates that the model inferred withour extinction methodology shows a broadening that increaseswith increasing values of extinction. In the MAP data set, thisbroadening is negligible. In the MAP , it is only observed inthe distribution of proper motions, and in the MAP data set,
10 0 10 20 30 40pmra [mas/yr]0.0000.0250.0500.0750.1000.1250.1500.175 D e n s i t y PriorExtinctionBenchmark D e n s i t y PriorExtinctionBenchmark
Fig. 3.
Comparison of the distributions of proper motions (pmra, upperpanel) and color index (lower panel) inferred from the MAP data set(maroon lines) and those of the benchmark solution (orange lines). Theprior distribution is shown with green lines. it is observed in both the proper motions and the color indexdistribution. Magnitude distributions
In our methodology, the magnitude distributions are obtainedby transforming the color index distribution into magnitude dis-tributions by means of the color-index to magnitude relations,which are modeled as spline functions whose coe ffi cients arealso inferred from the data (see Paper I for more details). Figures Article number, page 6 of 21. Olivares et al.: Miec
10 0 10 20 30 40pmra [mas/yr]0.0000.0250.0500.0750.1000.1250.1500.175 D e n s i t y PriorExtinctionBenchmark D e n s i t y PriorExtinctionBenchmark
Fig. 4.
Same as Fig. 3 but for the MAP data set. , , data sets. For comparison purposes, the figures also ex-hibit the magnitude distributions of the benchmark solution (or-ange lines), as well as the prior (green lines). As can be observedfrom Fig. 6, the magnitude distributions inferred from the MAP data set are almost identical to those of the benchmark solution,except in the region of K ∼ data set underestimate the density in the regionsof K ∼ ∼
15 mag, and overestimate it at K ∼ ∼
10 0 10 20 30 40pmra [mas/yr]0.0000.0250.0500.0750.1000.1250.1500.175 D e n s i t y PriorExtinctionBenchmark D e n s i t y PriorExtinctionBenchmark
Fig. 5.
Same as Fig. 3 but for the MAP data set. tude distributions inferred from the MAP data set show all theprevious e ff ects, together with a loss of precision in the brightend, from 8 mag to 11 mag in the Y band. This latter e ff ect isa direct consequence of the loss of precision in the color indexdistribution discussed in the previous section.This analysis shows that inferred magnitude distributions ex-hibit artifacts whose magnitude increases with the extinctionvalue of the data set. These artifacts are the reflection of thesmoothing of the color index distribution (see Sect. 4.2) ampli-fied by the shifts in the coe ffi cients of the spline functions thatmodel the color index to magnitude relations. As discussed inthe previous section, the smoothing and shifting of the modelparameters have their origin in the contaminants introduced into Article number, page 7 of 21 & A proofs: manuscript no. Miec PriorExtinctionBenchmark
Fig. 6.
Comparison of the Y, J, H, and K magnitude distributions learntfrom the original (orange lines) and MAP (maroon lines) data sets. Theprior distribution is shown with green lines. PriorExtinctionBenchmark
Fig. 7.
Same as Fig. 6 but for the MAP data set. the model. Although we developed strategies to minimize thesecontaminants when our methodology is used as a classifier, theirinfluence in the model parameters still impacts the inferred mag-nitude distributions. Nonetheless, the magnitude distributions in-ferred from data sets with up to A v ∼ PriorExtinctionBenchmark
Fig. 8.
Same as Fig. 6 but for the MAP data set. PriorExtinctionBenchmark
Fig. 9.
Same as Fig. 8 but using an informative astrometric prior (seetext). tribution of the K band, the most infrared of our bands, shows nosystematic errors.Because the majority of the contaminants come from sourceswith missing values (see Sect. 4.1 and Appendix C) the mini-mization of their impact in the population distributions and inparticular in the magnitude distributions requires constraininginformation able to counterbalance the lack of it resulting fromthe high percentage of sources with missing values. The only
Article number, page 8 of 21. Olivares et al.: Miec sources of information in our model are the data set, the extinc-tion map, and the prior distribution. From the statistical model-ing point of view, the information content of both the data setand the extinction map is fixed and cannot be improved until thearrival of new and more constraining data.Luckily, our Bayesian methodology provides a straightfor-ward solution to the low-information problem: the prior distri-bution. The lack of discriminant information originating fromthe sources with missing values can be counterbalanced by theinformation content of the prior. Thus, taking advantage of theBayesian formalism, we provide the model with more constrain-ing prior distributions. To avoid biasing the magnitude distribu-tions by the prior itself, we only modify the prior distribution ofthe astrometric parameters.Given that the results of the MAP data set are the mosta ff ected by the lack of constraining information, we reanalyzethese data with the following modifications. We modify thehyper-parameter α sgl of the Dirichlet distribution acting as priorfor the proper motion GMM fractions (i.e., weights). Until now,we have used the weakly informative value of α sgl = [5 , , , α sgl = [50 , , , β sgl = [2 , , ,
2] mas · yr − (see Appendix A.6), but now wereplace it by β sgl = [4 . , . , . , .
8] mas · yr − , which corre-spond to the values of the standard deviations found after fittinga GMM to the candidate members reported in Paper I.The results of our analysis of the MAP data set with a moreconstraining astrometric prior show that the systematic errors inthe magnitude distributions are considerably reduced. Figure 9shows the magnitude distributions recovered by our methodol-ogy using the constraining astrometric priors described above.As shown by this figure, the systematic shift in the J band mag-nitude distribution is now removed. In addition, the dispersionpresent at the bright end of the distributions, shown as the dis-persed lines between magnitudes 8 and 11 in the K band ofFig. 8, is now considerably reduced, as shown by the morecrowded lines of Fig. 9 in the same magnitude interval. Finally,although the over-density at K ∼
15 mag persists, the discrepancyis nonetheless negligible. On the other hand, the restrictive astro-metric prior has negative consequences when our methodologyis used as a classifier. Although the contamination rate remainslow at 8.7%, the true positive rate drops by ∼
20% to a value of62.5% (compare to the value in the last row of Table C.3).In this section, we shown that the population distributionsinferred from data sets with extinction values up to A v ∼ A v ∼ > ∼
20% drop in thetrue positive rate, which points to the importance of acquiringdata sets with fully observed photometric features, particularlyin highly extincted regions with A v (cid:38)
5. Conclusion
In this work, we improve the methodology of Paper I by extend-ing its application to embedded stellar clusters and validate it us- ing synthetically extincted data sets of the Pleiades cluster. In theinterval of extinctions analyzed here ( A v ∈ [0 ,
6] mag), our newmethodology delivers lists of candidate members with true pos-itive and contamination rates greater than 83% and lower than9%, respectively. This low value of the contamination rate wasachieved thanks to a strategy that obtains the probability clas-sification threshold through a detailed analysis of the classifierquality at di ff erent extinction bins. The variations at these binscan reach up to 16% in the contamination rate and down to 70%in the true positive rate. The application of this strategy to realstellar clusters will require the generation of carefully craftedsynthetically extincted data sets that mimic the properties of thereal one.The inferred magnitude distributions display artifacts whosemagnitude increases with the extinction of the data set. For ex-tinction values up to A v ∼ A v ∼ Article number, page 9 of 21 & A proofs: manuscript no. Miec
Concerning the prior distributions, we use weakly informa-tive ones. However, if neither the information content of the dataset nor that of the extinction map are enough to mitigate the ef-fects of possible contaminants, then more restricting prior distri-butions can be used. To minimize possible biases introduced inthe magnitude distributions by the prior itself, the constraininginformation must be included in the astrometric prior. Luckily,the
Gaia mission is a remarkable source for this constraininginformation because the cluster members found with only theastrometric features can be used to set stronger astrometric priordistributions.
Acknowledgements.
We thank the anonymous referee for the constructive com-ments. This research has received funding from the European Research Council(ERC) under the European Union’s Horizon 2020 research and innovation pro-gram (grant agreement No 682903, P.I. H. Bouy), and from the French Statein the framework of the ”Investments for the future” Program, IdEx Bordeaux,reference ANR-10-IDEX-03-02. The figures presented here were created usingMatplotlib (Hunter 2007).
References
Balaguer-Núñez, L., López del Fresno, M., Solano, E., et al. 2020, MNRAS, 492,5811Bouy, H., Bertin, E., Sarro, L. M., et al. 2015, A&A, 577, A148Bovy, J., Hogg, D. W., & Roweis, S. T. 2011, Annals of Applied Statistics, 5Breiman, L. 2001, Machine Learning, 45, 5Campello, R. J. G. B., Moulavi, D., & Sander, J. 2013, in Advances in Knowl-edge Discovery and Data Mining, ed. J. Pei, V. S. Tseng, L. Cao, H. Motoda,& G. Xu (Berlin, Heidelberg: Springer Berlin Heidelberg), 160–172Cantat-Gaudin, T., Jordi, C., Vallenari, A., et al. 2018, A&A, 618, A93Cardelli, J. A., Clayton, G. C., & Mathis, J. S. 1989, ApJ, 345, 245Chung, Y., Rabe-Hesketh, S., Dorie, V., Gelman, A., & Liu, J. 2013, Psychome-trika, 78, 685Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. 1996, in (AAAI Press), 226–231Fukushima, H., Yajima, H., Sugimura, K., et al. 2020, arXiv e-prints,arXiv:2005.13401Gagné, J., Lafrenière, D., Doyon, R., Malo, L., & Artigau, É. 2014, ApJ, 783,121Gaia Collaboration, Brown, A. G. A., Vallenari, A., et al. 2018, A&A, 616, A1Galindo Trejo, J. 2002, in Astronomical Society of the Pacific Conference Se-ries, Vol. 282, Galaxies: the Third Dimension, ed. M. Rosada, L. Binette, &L. Arias, 3Galli, P. A. B., Bouy, H., Olivares, J., et al. 2020, A&A, 634, A98Galli, P. A. B., Moraux, E., Bouy, H., et al. 2017, A&A, 598, A48Gelman, A. 2006, Bayesian Anal., 1, 515Huang, A. & Wand, M. P. 2013, Bayesian Anal., 8, 439Hunter, J. D. 2007, Computing in Science & Engineering, 9, 90Kounkel, M. & Covey, K. 2019, AJ, 158, 122Krone-Martins, A. & Moitinho, A. 2014, A&A, 561, A57Kuhn, M. A. & Feigelson, E. D. 2017, arXiv e-prints, arXiv:1711.11101Lam, S. K., Pitrou, A., & Seibert, S. 2015, in Proceedings of the Second Work-shop on the LLVM Compiler Infrastructure in HPC, LLVM ’15 (New York,NY, USA: Association for Computing Machinery)Luri, X., Brown, A. G. A., Sarro, L. M., et al. 2018, A&A, 616, A9McKee, C. F. & Ostriker, E. C. 2007, ARA&A, 45, 565Miret-Roig, N., Bouy, H., Olivares, J., et al. 2019, A&A, 631, A57Olivares, J., Bouy, H., Sarro, L. M., et al. 2019, A&A, 625, A115Olivares, J., Sarro, L. M., Moraux, E., et al. 2018, A&A, 617, A15Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, Journal of MachineLearning Research, 12, 2825Perren, G. I., Vázquez, R. A., & Piatti, A. E. 2015, A&A, 576, A6Sarro, L. M., Bouy, H., Berihuete, A., et al. 2014, A&A, 563, A45Wang, S. & Chen, X. 2019, ApJ, 877, 116
Article number, page 10 of 21. Olivares et al.: Miec
Appendix A: Methodological and computationalimprovements
This Appendix describes the methodological and computationalimprovements to the original methodology of Paper I. Here, wealso provide the details of the hyper-parameters of our model.
Appendix A.1: GPU computation
Our methodology is computationally expensive. The results pre-sented in Paper I took 30 days to run in a computing server with80 CPUs at 3.5 GHz each. To improve this, we translated theCPU computation of the likelihood into GPU code using the numba compiler (Lam et al. 2015). Using the same data set asin Paper I and a computing machine with eight GPUs NvidiaGForce RTX 2080i, our improved computation takes ∼
10 hr torun, which represents a speed-up factor of 72.
Appendix A.2: Simplified intrinsic photometric dispersion
In Paper I (see Sect. 2.1.2), the cluster’s intrinsic photometricdispersion was modeled with a multivariate Gaussian distribu-tion with a full covariance matrix. This distribution was evalu-ated in a color index grid with n =
300 steps, which resulted ina cluster photometric sequence with a hosepipe shape. The samehosepipe shape can be recovered with a simpler photometricmodel in which the o ff -diagonal terms of the covariance matrixare neglected (e.g., in a color–magnitude diagram, the hosepipeshape of the cluster sequence can be obtained by overlapping ei-ther ellipses or circles). Here, we reduce the model complexityby neglecting the o ff -diagonal terms of the covariance matrix.However, to obtain the same value of the original photometriclikelihood within a tolerance of 10 − , we have to increase thenumber of grid evaluations to n = D ph · ( D ph − /
2, where D ph is the photometric dimension of the representation space.This simpler model is sampled more e ffi ciently than the originalone (i.e., the one with the full covariance matrix), and thus thetotal computing time is e ff ectively reduced. Appendix A.3: New prior families
In Paper I, we selected a series of prior families that were the bestchoice according to the literature. After a critical review of ouroriginal choices, we update the following family distributions.Our choice of the Half-Cauchy family as a prior for vari-ance parameters was inspired by the work of Gelman (2006).However, it has been shown that the heavy tails of the Cauchyfamily can seriously hinder the computational performance ofMarkov chain Monte Carlo (MCMC) samplers, which increasesthe computational time required to ensure convergence of thesampler. Therefore, we replace the Half-Cauchy family with theGamma one. The latter is defined over the standard deviationsinstead of the variances, and following the recommendations ofChung et al. (2013) we parameterize it as Gamma( α = , β ), with β a hyper-parameter.Similarly, we also replace the Huang & Wand (2013) priorof the photometric and astrometric covariance matrices with aGamma one (Chung et al. 2013). The β parameters are alsomodel hyper-parameters. https://mc-stan.org/users/documentation/case-studies/weakly_informative_shapes.html Appendix A.4: Field model
In Paper I, the astrometric and photometric field models were as-sumed to be a Gaussian + Uniform mixture model and Gaussianmixture model (GMM), respectively. The parameters of thesemodels were computed with our Expectation-Maximization rou-tines. Here, we compare the results obtained with two additionalliterature algorithms (one for the astrometric model and anotherfor the photometric one) over the original field population of Pa-per I. We still select the best number of components based on theBayesian information criterion (BIC).In the case of the astrometric model, we compare both ouroriginal algorithm and the Gaussian Mixture routine from scikit-learn (Pedregosa et al. 2011) with 1 to 15 components. The BICchooses models with 7 components for both algorithms. Thedi ff erence between the likelihood of the sources resulting fromthese two best models is significantly reduced when the conver-gence tolerance is ≤ − . Furthermore, the solutions obtainedafter running our complete methodology with these two mod-els are indistinguishable given their uncertainties. Nonetheless,the BIC value of the model inferred by our algorithm is slightlylower than that of the model inferred with scikit-learn .In the case of the photometric model, due to the presence ofmissing values, we are only able to test the ExtremeDeconvolu-tion algorithm (Bovy et al. 2011). However, the models inferredwith this algorithm have components with bizarre behaviors likeweights that vanish ( < − ) and means that are located at neg-ative magnitudes. Although these e ff ects are strongly reducedby increasing the value of the split-and-merge steps (see Ap-pendix B of Bovy et al. 2011), the computing time increasesdrastically as well. Our tests with 10 split-and-merge steps putthe computing time at ∼ K ( K − K − / − and the split-and-merge stepsto 10. Then, the BIC chooses the GMM with 13 components.After running our complete methodology with this field model,we obtain increased contamination in the cluster model; ∼ ff ers from the following problem. In-creasing the number of components results in proposed solutionswith nonpositive-semi-definite covariance matrices for the com-ponents with the lowest weights. Although these proposals arerejected, they still increase the computing time ( ∼ · yr − ,100 mas · yr − ], and a vast fraction of sourceswith missing values ( > Article number, page 11 of 21 & A proofs: manuscript no. Miec original convergence tolerance from 10 − to 10 − we obtain anastrometric GMM with 7 components and a photometric GMMwith 19 components. Third, the problem of inferring the param-eters of multivariate GMM from data sets with a high percentageof missing values still presents a computational challenge. Appendix A.5: Synthetic data
In Paper I, the probability threshold of classification (i.e., be-tween field and cluster) was found using synthetic data sets gen-erated from the inferred field and cluster models. Here, we dothis as well but with the following improvements. First, we gen-erate as many synthetic sources as those present in the real dataset. In Paper I, due to computation time reasons, we generatedsynthetic sources based on a model learned from a sample ofonly 10 sources. Then, we assumed that the probability thresh-old found for that model was valid for the model learned in thelarger 10 data set. Here, we find that this assumption was notentirely correct. The contamination and true positive rates re-ported in Paper I were optimistic (more details are given in Ap-pendix B). Second, in Paper I, the uncertainties of the syntheticsources were assigned without consideration about the sourceorigin: cluster or field. As the cluster members tend to have betteruncertainties than those of the field population , our original ap-proach resulted in less realistic simulations than those obtainedby separating the two cases. Here, the uncertainties and masksof missing values of a synthetic source are assigned as those of arandomly chosen real source in the same magnitude bin. For thelatter, we use the most observed photometric band in the real dataset. Third, in Paper I, the observed values were assumed to cor-respond to the synthetic values. Here, the synthetic value and itsassigned uncertainty are used as the mean and covariance matrixof a multivariate Gaussian distribution from which the syntheticobserved values are drawn. The previous improvements result inmore realistic synthetic data sets than those of Paper I. Appendix A.6: Hyper-parameters
The hyper-parameters of our hierarchical model are similar tothose defined in Paper I, except for the prior families that weremodified as part of our methodological improvements (see Sect.A.3). The set of model hyper-parameters is shown in Table A.1.In the latter, the subindices f c , sb , sgl , clr , c f s , and ph , standfor field-cluster, singles-binaries, singles, color, coe ffi cients, andphotometry, respectively. α and β are the hyper-parameters ofthe Dirichlet and Gamma distributions, which are the prior dis-tributions of fractions and standard deviations, respectively. µ and σ are the mean and standard deviation hyper-parameters,respectively, of the univariate and multivariate normal distribu-tions used as prior for the mean proper motions ( µ sgl and σ sgl ),the mean color index ( µ clr and σ clr ), and the mean coe ffi cients ofthe photometric splines ( µ c f s and σ c f s ). Finally, rg clr and knots indicate the interval of the color index and the knots of the pho-tometric splines, respectively. More details are given in Sect. 2and Paper I. The membership probability of a source is the ratio of its cluster like-lihood to the total likelihood times the cluster prior probability. There-fore, members tend to have narrow uncertainties that give them largevalues of the cluster likelihood, which are able to overcome the clusterprior probability, which in our data set is ∼ Table A.1.
Model hyper-parameters.
Name Units Value α fc [980,20] α sb [8,2] α sgl [5,4,1,1] β sgl mas · yr − [[2,2],[2,2],[2,2],[2,2]] µ sgl , mas · yr − [16.13, -39.03] µ sgl , mas · yr − [16.13, -39.03] µ sgl , mas · yr − [16.13, -39.03] µ sgl , mas · yr − [16.13, -39.03] σ sgl , mas · yr − [[ 17.08, -2.63],[ -2.63, 23.65]] σ sgl , mas · yr − [[ 2.02, -0.58],[ -0.58, 4.06]] σ sgl , mas · yr − [[ 143.2, 18.54],[ 18.54, 141.67]] σ sgl , mas · yr − [[2940.01,319.19],[319.19,2633.76]] α clr [1,1,1,1,1] β clr mag 1.0 µ clr mag 4.4 σ clr mag 3.6 rg clr mag [0.8, 8.0] β ph mag [0.1, 0.01, 0.01, 0.01, 0.01] knots mag [0.8, 3.22, 3.23, 5.18, 8.0] σ cf s mag [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5] µ cf s , Y mag [ 9.30, 10.81, 12.33, 15.06, 18.04, 21.02, 22.79] µ cf s , J mag [ 8.75, 10.21, 11.66, 14.29, 17.16, 20.02, 21.71] µ cf s , H mag [ 8.39, 9.78, 11.18, 13.69, 16.43, 19.17, 20.80] µ cf s , K mag [ 8.40, 9.68, 10.97, 13.30, 15.83, 18.37, 19.87] Appendix B: The basic and benchmark solutions
In this Appendix, we describe the results of the basic and bench-mark solutions obtained with the original nonextincted data set(see Sect. 3). The basic solution is obtained with the improvedbasic module (i.e., the methodological and computational im-provements described in Appendix A). In contrast, the bench-mark solution is obtained with the improved basic module plusthe simplified model of EMB (see Sect. 2.3). Both basic andbenchmark solutions use the same field model, which we ob-tained after fitting a GMM to the field population (see AppendixA.4).
Appendix B.1: Basic solution
The results of applying the improved basic module to the origi-nal data set are the following. The optimum probability thresh-old (found using the first strategy, that of Paper I) is p t = . = = = p t = .
84, the TPR remains similar, theCR is worst by 3%, and the ACC is 3% better. These di ff erencesare explained by our improved and more realistic modeling ofthe synthetic data (see Appendix A.5). The basic solution finds1990 candidate members, of which 1846 are in common with theresults of Paper I. The number of candidate members is compat-ible, within the Poisson uncertainties, with the 1973 candidatemembers found in Paper I. The new solution rejects 144 of theoriginal candidate members and finds 122 new ones. The newand rejected candidates are randomly distributed over the propermotions and photometric sequence of the cluster and show noevidence of systematic shifts. Moreover, their numbers are com-patible with the original ones given the measured value of CR Article number, page 12 of 21. Olivares et al.: Miec Q u a li t y i n d i c a t o r [ % ] TPRCR
Fig. B.1.
TPR and CR of the benchmark solution as functions of theprobability threshold. (6.9%). Therefore, we conclude that this new solution is statisti-cally indistinguishable from that of Paper I.
Appendix B.2: Benchmark solution
Here, we show the details of the benchmark solution obtainedwith the basic + EMB modules over the original data set. First,we analyze the quality of the classifier in terms of the TPR andCR, and then we describe the inferred posterior distributions ofthe proper motions, the color index, and Y, J, H, and K bands.
Appendix B.2.1: Quality of the classifier
The benchmark solution finds 1943 candidate members witha membership probability larger than the optimum probabilitythreshold of p t = .
68, hereafter we refer to these as the highmembership probability sample (HMPS). Compared to the 1990candidate members found by the basic solution (see Sect. B.1),there are 1903 in common, 87 rejected, and 40 new ones. Thequality indicators of the classifier are shown in Fig. B.1, in par-ticular, at the optimum probability threshold, the TPR is 92.5%,the CR is 6.5%, and the ACC is 99.6%.Although the benchmark solution found 47 less candidatemembers than the basic solution, both numbers are compatiblewithin the Poisson uncertainty. Moreover, under the assumptionthat the members of the basic solution are the true ones, the 1903common members represent 95.6% of the 1990 candidate mem-bers, well above the 92.5% of the TPR reported by the bench-mark solution. The previous values indicate that our quality mea-surements are consistent between solutions and that the bench-mark solution is similar to the basic one. However, contrary towhat is observed when comparing the basic solution to that ofPaper I, in this case, the rejected sources are not randomly dis-tributed.The 87 rejected candidate members are located at a meandistance of 15.1 mas · yr − from the cluster proper motion center,whereas the common candidate members have a mean distanceof 5.2 mas · yr − , showing that the rejected sources are mainly onthe outskirts of the proper motions distributions. Furthermore,72 out of the 87 rejected sources are farther away than the 5.2mas · yr − of typical distance in the common candidate mem- bers. The rejected sources were considered members in the ba-sic solution due to the more extended wings of the EMB propermotions model, but once this model is removed (see Sect. 2.3),the membership probabilities of the rejected sources fall belowthe optimum probability threshold and are therefore no longerconsidered as candidate members. Although 30% of the rejectedcandidate members from the basic solution are EMB, the totalfraction of EMB reported by both solutions is ∼ Appendix B.2.2: Population distributions
10 0 10 20 30 40pmra [mas/yr]706050403020 p m d e c [ m a s / y r ] PriorPosteriorMAPSinglesEMBBasic
Fig. B.2.
Proper motions diagram showing the single and EMB candi-date members found by the benchmark solution, together with samplesfrom the prior (green lines) and posterior (orange lines) distributions ofthe astrometric GMM parameters, and the maximum-a-posteriori solu-tion (red line). For comparison, the candidate members found with thebasic solution are shown with blue circles.
We now make a summary of the population distributionsfound with the benchmark solution. Figures B.2 to B.4 show theproper motions, color–magnitude, and magnitude diagrams ofthe HMPS candidate members, together with samples from theprior (green lines) and posterior (orange lines) distributions ofthe benchmark solution. For comparison, the figures also showthe distributions resulting from the candidate members of thebasic solution (see Sect. B.1). Figure B.2 shows that the ma-jority of the rejected candidate members from the basic solu-tion are located at the outskirts of the proper motion distribution,as already discussed in the previous section. Despite the previ-ous e ff ect, neither the color–magnitude diagram nor the result-ing magnitude distributions exhibit any potential bias (see Figs.B.3 and B.4). Furthermore, the magnitude distributions result-ing from the HMPS are almost indistinguishable from those ob-tained with the candidate members of the basic solution. The dif-ferences between the magnitude distributions obtained with theHMPS and with the posterior distributions of the model param-eters are a direct consequence of the fact that the former is not a Article number, page 13 of 21 & A proofs: manuscript no. Miec K [ m a g ] PriorPosteriorMAPSinglesEMBBasic
Fig. B.3.
Color–magnitude diagram showing the single and EMB can-didate members found by the benchmark solution. Captions as in Fig.B.2. PriorPosteriorMAPHMPSBasic
Fig. B.4.
Magnitude distributions in the Y, J, H, and K bands found bythe benchmark solution. Captions as in Fig. B.2. random sample of the latter, but it only contains the high mem-bership probability sources. In addition, these latter sources havemissing values, which prevent us from drawing them, as clearlyshown in the bright side of the Y band magnitude distribution. In this Appendix, we show that the basic and benchmark so-lutions are almost identical with no evidence of biases. Further-more, our simplified EMB model recovers the same fraction ofEMBs as the basic solution.
Appendix C: Details of the classifier quality
In this Appendix, we analyze the quality of the classifier ob-tained using the extinction + EMB modules (see Sect. 2.2 and 2.3)when applied over the synthetically extincted MAP , , data sets.In particular, we measure the accuracy, true positive rate (TPR),and contamination rate (CR) at the optima probability thresholdsobtained with the following three strategies.In the first strategy, we learn the cluster and field mod-els from the synthetically extincted data set using the extinc-tion + EMB module. Then, with these learned models, we gen-erate a new synthetic data set in which the true class labels areknown (i.e., those of the field and the cluster). We notice thatthanks to our extinction module, the learned cluster model is freeof extinction, and therefore the synthetic data generated from itare also extinction-free. Afterward, we infer the cluster modelfrom the previous data using our basic + EMB module. Finally,we compute the optimum probability threshold using the trueclasses of the synthetic data and the inferred membership proba-bilities. This strategy follows the approach of Paper I in the sensethat it assumes that the learned model is the true one. In the ab-sence of the true labels of the cluster and field populations, thisstrategy is, to the best of our knowledge, the only available oneto obtain the optimum probability threshold. The caveat of it isthat the learned cluster model is free of extinction, and there-fore the resulting classification threshold is independent of theextinction.In the second and third strategies, the optimum probabilitythreshold is computed assuming that the true class labels cor-respond to those obtained by the benchmark solution (i.e., thatfound with our basic + EMB module on the original nonextincteddata set; see Appendix B). While the second strategy computesthe optimum probability threshold using all sources indepen-dently of their extinction value, the third strategy splits the datainto bins of one magnitude of extinction and obtains optimalclassification thresholds for each bin. This latter strategy givesa more detailed analysis of the performance of our extinctionmethodology under di ff erent degrees of extinction. The caveatof these strategies is that they assume that the data set containsthe true class labels.Tables C.1 to C.3 show the true positive rate (TPR) and con-tamination rate (CR) measured with our three strategies on thesynthetic data sets MAP , , . The rows of these tables show theresults of the strategies and are marked as follows. The resultsof the first and second strategies are shown in the first two rows,while the rest of the rows show those of the third strategy. Thislatter is shown at each extinction bin and for the concatenationof the labels in all bins (last row). The columns show the averageextinction value of each bin, the optimum probability threshold p t , and the quality indicators of the classifier. These latter areshown for the entire data set (labeled All) and for subsets con-taining only the sources with observed and missing color index(labeled CI obs and CI na , respectively). As can be observed, ourstrategies report high and similar values of the true positive rate.On the contrary, our second and third strategies report contami-nation rates that are lower than that obtained with the first strat-egy. The di ff erences between the contamination and true positiverates of the first and second strategies result from their di ff erentprobability thresholds. As the latter decreases, more members Article number, page 14 of 21. Olivares et al.: Miec are recovered, but also more contaminants are included. For thisreason, the first strategy reports larger true positive and contam-ination rates than those of the second strategy. As mentionedabove, the first two strategies obtain the optimum probabilitythreshold using all sources from the data set. However, thanksto the partitioning of the data into bins of extinction, the thirdstrategy obtains probability thresholds that are optimal for eachextinction bin. These thresholds monotonically decrease with in-creasing extinction as a consequence of both the optimization ofthe recovered members and the dilution of information acrossthe increasingly wider limits of the marginalization integral ofEq. 5. The previous analysis shows that while our first strategyis similarly good at recovering the cluster members as the othertwo, the detailed analysis in bins of extinction performed by thethird one reduces the contamination rate to values (cid:46)
Table C.1.
Quality indicators found in the MAP data set. A v p t TPR [%] CR [%]All CI obs CI na All CI obs CI na StrategyFirst - 0.70 96.30 96.89 95.01 10.39 3.79 22.14Second - 0.79 93.90 95.94 89.50 5.89 2.98 12.01Bin 1 0.10 0.79 96.32 98.03 92.66 5.39 2.88 10.62Bin 2 1.71 0.67 67.48 71.26 58.33 16.16 4.62 38.24All bins - - 94.39 96.18 90.53 5.97 2.97 12.19
Table C.2.
Quality indicators found in the MAP data set. A v p t TPR [%] CR [%]All CI obs CI na All CI obs CI na StrategyFirst - 0.75 89.55 92.04 84.17 18.00 6.32 36.66Second - 0.87 83.45 87.74 74.18 8.91 3.50 20.33Bin 1 0.61 0.95 90.06 94.34 82.50 4.78 1.96 10.00Bin 2 1.52 0.90 87.81 92.33 78.72 7.28 2.24 17.32Bin 3 2.66 0.64 86.47 89.79 78.39 10.77 3.70 25.95All bins - - 87.53 91.32 79.35 8.64 2.96 20.24
Table C.3.
Quality indicators found in the MAP data set. A v p t TPR [%] CR [%]All CI obs CI na All CI obs CI na StrategyFirst - 0.72 84.76 87.10 79.69 22.27 7.60 43.47Second - 0.87 77.41 82.96 65.40 9.48 3.70 22.29Bin 1 0.31 0.92 86.50 92.58 73.61 5.66 2.30 13.59Bin 2 1.48 0.90 85.71 93.24 71.30 9.70 4.46 20.62Bin 3 2.48 0.74 85.71 91.21 74.73 13.65 5.14 29.17Bin 4 3.47 0.33 84.25 86.34 78.87 9.32 3.07 23.29Bin 5 4.48 0.08 77.78 78.05 77.19 7.89 2.04 18.52Bin 6 5.50 0.03 70.21 70.87 68.42 9.17 8.75 10.34All bins - - 83.83 88.38 74.01 8.61 3.65 19.32
Figures C.1 to C.3 show the proper motions and color–magnitude diagrams of sources classified as true positives, falsepositives, and false negatives in the MAP , , data sets as recov-ered with the optimum probability thresholds of the third strat-egy. As can be observed from the proper motions diagram, thefalse positives and false negatives appear to be more frequent inthe outskirts of the distribution than the true positives. On theother hand, the color–magnitude diagrams show that the falsenegatives tend to have large extinction values, while the falsepositives are evenly spread in extinction and follow the color–magnitude distribution of the true positives. Therefore, we con- clude that our extinction methodology has di ffi culties in recov-ering the most extincted members (i.e., the false negatives with A v ∼ A v ∼ ∼ ∼
70% of the false positives (i.e.,contaminants) come from sources with a missing color index,which unfortunately is the majority of the sources in our originaldata set (see Table 1). Although the contamination rate intro-duced by these sources in the high extinction regions ( A v > ∼ Appendix C.1: Quality as a function of extinction andmagnitude
In this section, we provide further details of the quality of theclassifier as a function of extinction and magnitude. As the firststrategy obtains a cluster model free of extinction, we only ana-lyze its results as functions of magnitude.Figures C.4 to C.6 show the TPR and CR of the classifiersobtained from the MAP , , data sets as functions of the proba-bility threshold. The black lines show the quality indicators ob-tained with the second strategy, that is using all sources, and thecolored lines show those obtained with the third strategy and arecolor-coded with the mean extinction value of the bin. The meanextinction value and the optima probability thresholds of eachcase are shown in the second and third columns of Tables C.1to C.3. Similarly, the quality indicators at each optimum proba-bility threshold are marked with crosses, and their values corre-spond to those shown in the fourth and seventh columns of theaforementioned tables.As can be observed from Fig. C.4, the quality of the clas-sifier resulting from the MAP data set and the second strategy,shown with black crosses (TPR = = = = (cid:38) A v bin that are similar to those of the second strategy. How-ever, the second bin shows poorer results than the first one. Fig-ures C.5 and C.6 show that the quality indicators found in theMAP , data sets follow similar trends, which are a decreasingTPR with increasing value of extinction, and a CR that remains (cid:46) ffi culty in recovering memberswith increasing values of extinction. As shown in Figs. C.4 to Article number, page 15 of 21 & A proofs: manuscript no. Miec
10 0 10 20 30 40 50 [mas/yr]706050403020100 [ m a s / y r ] TPFPFN1 2 3 4 5 6 7 8i-K [mag]810121416 K [ m a g ] TPFPFN0.0 0.5 1.0 1.5 2.0 2.5 A v , max Fig. C.1.
Proper motions and color–magnitude diagrams showing thetrue positives, false positives, and false negatives, from the MAP dataset. C.6 and Tables C.1 to C.3, the TPR steadily decreases as a func-tion of extinction from values ∼
90% at A v ∼ . ∼ A v ∼ . (cid:46)
15% in spite of the extinction value.We now analyze the quality indicators of the classifier for thesubsets of the data with observed and missing color index. Theresults of this analysis are shown in Tables C.1 to C.3 and Figs.C.7 to C.9. The upper and lower panels of these latter figuresshow the TPR and CR, respectively, as a function of the proba-bility threshold, the extinction, and the status of the color index:observed or missing.As can be observed from Figs. C.7 to C.9, the TPR of sourceswith observed color index is for all cases higher than that ofsources with missing color index. The di ff erence between thesevalues decreases with increasing extinction, starting from 20%at A v ∼ .
5, and down to zero at the limit of A v ∼
6, where theTPR of sources with and without a color index are both ∼ ff erence follows a trend similar to that of the TPR.
10 0 10 20 30 40 50 [mas/yr]80706050403020100 [ m a s / y r ] TPFPFN1 2 3 4 5 6 7 8i-K [mag]810121416 K [ m a g ] TPFPFN0.0 0.5 1.0 1.5 2.0 2.5 3.0 A v , max Fig. C.2.
Same as Fig. C.1 but for the MAP data set. Finally, we analyze the dependencies of the TPR and CR onthe photometric magnitude. Figures C.10 to C.12 show the mea-sured TPR and CR of the classifier inferred from the MAP , , data sets as functions of the K magnitude, which is the most ob-served band of our data set (see Table 1). In addition, they showthe TPR and CR measured from the subsets of sources with andwithout an observed color index (marked as CI obs and CI na , re-spectively). As can be observed in these figures, our three strate-gies show similar results of TPR, with the second one show-ing increasingly lower values of TPR with increasing extinction.Moreover, the TPR is relatively stable as a function of magni-tude with slightly poorer values on the bright side (K (cid:46)
11 mag).The latter is a consequence of the high fraction of sources withmissing Y band, an e ff ect discussed in Sect. 4.2 and Appendix C.Regarding the CR, the values obtained with the second and thirdstrategies are consistently lower than those obtained with the firstone across all the magnitude interval and the data sets. In addi-tion, the results from the three data sets show that the CR is lowerin the central magnitude regions than in the faint and bright sides.However, we notice that due to low-number statistics, the resultsof the extreme bins at magnitudes < >
17 mag mustbe interpreted with caution. As discussed in Sect. 4.1, sourceswith missing values hamper our methodology by increasing theCR. Therefore, the higher CR values on the bright and faint do-
Article number, page 16 of 21. Olivares et al.: Miec
10 0 10 20 30 40 50 [mas/yr]80706050403020100 [ m a s / y r ] TPFPFN2 4 6 8i-K [mag]8101214161820 K [ m a g ] TPFPFN0 1 2 3 4 5 6 A v , max Fig. C.3.
Same as Fig. C.1 but for the MAP data set. mains (without taking into account the edges) result from thelarge fraction of sources with missing values produced by thesensitivity limits of the photometric detectors, which saturate atthe bright side and are unable to detect the faintest sources. Par-titioning the data set into the subsets with and without observedcolor index confirms our previous discussion about the negativeimpact that sources with missing values produce in the classifierquality, particularly in the CR. As can be observed in these fig-ures, the CR and TPR measured from the subset of sources withobserved color index are better than those measured in the sub-set with missing color index. Here again, the values at the edgesmust be neglected due to the low-number statistics.In this Appendix, we show that the major caveats of ourmethodology are the contaminants introduced by sources withmissing color index, and the reduced recovery rate with increas-ing extinction value. These two problems are associated with thequality of both the data set and the extinction map. While better-quality data will certainly reduce the contaminants down to thenegligible values shown by the subset with observed color in-dex, a detailed extinction map that provides constraining lowerand upper limits to the extinction will improve the TPR of theclassifier. % TPRCR0.0 0.5 1.0 1.5 2.0 2.5 A v , max Fig. C.4.
Quality indicators of the classifier obtained from the MAP data set. The TPR and CR of the classifier are shown as functions ofthe probability threshold (black line) and for bins of one magnitude ofextinction (color lines). % TPRCR0.0 0.5 1.0 1.5 2.0 2.5 3.0 A v , max Fig. C.5.
Quality indicators of the classifier obtained the MAP data set.Captions as in Fig. C.4. Article number, page 17 of 21 & A proofs: manuscript no. Miec % TPRCR0 1 2 3 4 5 6 A v , max Fig. C.6.
Quality indicators of the classifier obtained the MAP data set.Captions as in Fig. C.4. T P R [ % ] Observed CIMissing CI0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Probability0102030405060708090100 C R [ % ] Observed CIMissing CI0.0 0.5 1.0 1.5 2.0 2.5 A v , max Fig. C.7.
TPR and CR of the classifier resulting from the MAP dataset. The sources are split into those with a missing (dashed lines) andobserved (solid lines) color index. The black line indicates the resultsindependent of the extinction value of the sources (i.e., the second strat-egy), while the colored lines show the results of di ff erent bins of extinc-tion (i.e., the third strategy).Article number, page 18 of 21. Olivares et al.: Miec T P R [ % ] Observed CIMissing CI0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Probability0102030405060708090100 C R [ % ] Observed CIMissing CI0.0 0.5 1.0 1.5 2.0 2.5 3.0 A v , max Fig. C.8.
TPR and CR of the classifier resulting from the MAP dataset. Captions as in Fig. C.7. T P R [ % ] Observed CIMissing CI0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Probability0102030405060708090100 C R [ % ] Observed CIMissing CI0 1 2 3 4 5 6 A v , max Fig. C.9.
TPR and CR of the classifier resulting from the MAP dataset. Captions as in Fig. C.7. Article number, page 19 of 21 & A proofs: manuscript no. Miec T P R [ % ] StrategyFirstSecondThird8 10 12 14 16 18K [mag]0102030405060 C R [ % ] CaseAll CI obs CI na Fig. C.10.
TPR and CR of the classifier resulting from the MAP dataset as functions of the K magnitude. The color shows the strategy toobtain the optimum probability threshold, and the line style shows thethree di ff erent cases: all sources (solid lines), and those with and withoutan observed color index (dashed and dotted lines, respectively). T P R [ % ] StrategyFirstSecondThird8 10 12 14 16 18K [mag]0102030405060 C R [ % ] CaseAll CI obs CI na Fig. C.11.
TPR and CR of the classifier resulting from the MAP dataset as functions of the K magnitude. Captions as in Fig. C.10.Article number, page 20 of 21. Olivares et al.: Miec T P R [ % ] StrategyFirstSecondThird8 10 12 14 16 18K [mag]0102030405060 C R [ % ] CaseAll CI obs CI na Fig. C.12.
TPR and CR of the classifier resulting from the MAP2