[PDF] Photometric Classifications of Evolved Massive Stars: Preparing for the Era of Webb and Roman with Machine Learning

Abstract

In the coming years, next-generation space-based infrared observatories will significantly increase our samples of rare massive stars, representing a tremendous opportunity to leverage modern statistical tools and methods to test massive stellar evolution in entirely new environments. Such work is only possible if the observed objects can be reliably classified. Spectroscopic observations are infeasible with more distant targets, and so we wish to determine whether machine learning methods can classify massive stars using broadband infrared photometry. We find that a Support Vector Machine classifier is capable of coarsely classifying massive stars with labels corresponding to hot, cool, and emission line stars with high accuracy, while rejecting contaminating low mass giants. Remarkably, 76\% of emission line stars can be recovered without the need for narrowband or spectroscopic observations. We classify a sample of {\sim}2500 objects with no existing labels, and identify fourteen candidate emission line objects. Unfortunately, despite the high precision of the photometry in our sample, the heterogeneous origins of the labels for the stars in our sample severely inhibits our classifier from distinguishing classes of stars with more granularity. Ultimately, no large and homogeneously labeled sample of massive stars currently exists. Without significant efforts to robustly classify evolved massive stars -- which is feasible given existing data from large all-sky spectroscopic surveys -- shortcomings in the labeling of existing data sets will hinder efforts to leverage the next-generation of space observatories.

Full PDF

DDraft version February 8, 2021

Typeset using L A TEX twocolumn style in AASTeX63

Photometric Classiﬁcations of Evolved Massive Stars:Preparing for the Era of

Webb and

Roman with Machine Learning

Trevor Z. Dorn-Wallenstein, James R. A. Davenport, Daniela Huppenkothen,

2, 3, 4 and Emily M. Levesque University of Washington Astronomy DepartmentPhysics and Astronomy Building, 3910 15th Ave NESeattle, WA 98105, USA SRON Netherlands Institute for Space Research,Sorbonnelaan 3, 3584 CA UtrechtThe Netherlands DIRAC Institute,Department of Astronomy,University of Washington, 3910 15th Ave NESeattle, WA 98105, USA The University of Washington eScience Institute,The Washington Research Foundation Data Science Studio,University of Washington,Seattle, WA 98105, USA (Received; Revised; Accepted)

Submitted to ApJABSTRACTIn the coming years, next-generation space-based infrared observatories will signiﬁcantly increase oursamples of rare massive stars, representing a tremendous opportunity to leverage modern statisticaltools and methods to test massive stellar evolution in entirely new environments. Such work is onlypossible if the observed objects can be reliably classiﬁed. Spectroscopic observations are infeasiblewith more distant targets, and so we wish to determine whether machine learning methods can classifymassive stars using broadband infrared photometry. We ﬁnd that a Support Vector Machine classiﬁeris capable of coarsely classifying massive stars with labels corresponding to hot, cool, and emission linestars with high accuracy, while rejecting contaminating low mass giants. Remarkably, 76% of emissionline stars can be recovered without the need for narrowband or spectroscopic observations. We classifya sample of ∼ INTRODUCTIONEvolved massive stars are observed in a menagerie ofexotic evolutionary phases. While the challenge of con-necting these states with a self-consistent theory of stel-lar evolution has seen rapid advancement since the orig-

Corresponding author: Trevor Z. [email protected] inal introduction of the “Conti Scenario” (Conti et al.1983), the eﬀects of rotation, magnetic ﬁelds, internalmixing processes, and binary interactions on the evolu-tion of massive stars are still the subject of much the-oretical eﬀort (e.g., Ekstr¨om et al. 2012; Eldridge et al.2017). While individual massive stars can be used asprecision probes of these processes, ensembles of evolvedmassive stars can also signiﬁcantly constrain stellar evo-lution. This can be done by comparing the integrated a r X i v : . [ a s t r o - ph . S R ] F e b Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque spectra of massive stars (e.g. Levesque et al. 2012), or bystudying the detailed makeup of resolved populations ofmassive stars (Dorn-Wallenstein & Levesque 2018, 2020;Stanway et al. 2020).The latter approach requires large and accurately-classiﬁed samples of evolved massive stars that will beachievable in the coming years with the launch of theJames Webb Space Telescope (

Webb ) and the NancyGrace Roman Space Telescope (

Roman ). Among theinstrumentation on

Webb and the proposed instrumen-tation for

Roman are photometers equipped with ﬁltersspanning a broad wavelength baseline from 0.5 to 28 µ m. The resolution of Webb will allow us to identifyand study in detail individual luminous stars to greatdistances (e.g. Jones et al. 2017), while the impressive0.218 deg ﬁeld of view of Roman will allow us to ef-ﬁciently survey nearby galaxies in a small number ofpointings (Spergel et al. 2013). Combined, observationsfrom both missions will give astronomers access to pre-cise infrared measurements of vast numbers of evolvedmassive stars. But without sophisticated methods ofidentifying and classifying these stars, the science returnaﬀorded by such a large increase in expected sample sizeswill be signiﬁcantly reduced.Classiﬁcation of stars from broadband photometryis often done by adopting simple linear cuts in color-magnitude space (e.g., Massey et al. 2006, 2009), and— most critically — do not include rare emission lineobjects, whose classiﬁcation requires dedicated narrow-band surveys (sometimes with custom-designed ﬁlters,e.g. Neugent et al. 2018b), often accompanied by follow-up spectroscopy, both of which require extensive tele-scope time. These objects are often the post-main se-quence evolved states of massive stars, in which the ef-fects of rotation, binary interactions, and chemical mix-ing are the most pronounced; the stars that place themost valuable constraints on unknown stellar physics arealso the hardest to detect via traditional means. There-fore, it is worthwhile to determine whether there are al-ternative ways to classify massive stars that avoid usingtraditional and expensive methods.At present, we can mimic the observing capabilities of

Webb and

Roman by combining data from

Gaia (whichhas a red-optical bandpass), the Two Micron All SkySurvey (2MASS, Skrutskie et al. 2006, near-infrared),and the Wide Field Infrared Survey Explorer (WISE,Wright et al. 2010, mid-infrared). WISE provides theadditional beneﬁt of having scanned the sky approxi-mately every six months, yielding lightcurves spanninga ∼ Roman and

Webb will not be observing the entire sky in this fashion, determining whether variability can aid in the classiﬁ-cation of evolved massive stars will determine whetherobservers should seek repeated observations of a stellarpopulation.We wish to determine whether we can1. Assemble a sample of evolved massive stars withavailable classiﬁcations as a training data set,2. Construct a machine learning classiﬁer that can re-ject low mass red contaminants and identify likelyemission line objects in order to optimise availabletelescope time on the most promising targets,3. Determine whether variability metrics estimatedfrom WISE lightcurves can aid in these tasks, and4. Determine which photometric bandpasses andvariability metrics contribute the most to makingaccurate classiﬁcations.Here we utilize a support vector machine classiﬁer(SVC) trained only on broad-band photometry and sim-ple metrics derived from WISE lightcurves to classify alarge sample of evolved massive stars. We describe oursample selection and labeling method in § § § § SAMPLE SELECTION & LABELINGFor any machine learning algorithm, a high-qualitytraining set with accurate labels is necessary. The sec-ond data release (DR2) of the

Gaia mission (Gaia Col-laboration et al. 2018a) contains precise photometry inthree bands ( G , G BP , and G RP ) and geometric paral-laxes ( (cid:36) ) for 1.3 billion stars in the Galaxy and Mag-ellanic Clouds. Because the parallax measurements suf-fer from some systematics (Lindegren et al. 2018), andmany objects have high fractional errors ( σ (cid:36) /(cid:36) ) or neg-ative measured parallax, Bailer-Jones et al. (2018) cal-culated Bayesian distance estimates for the majority ofstars in Gaia

DR2, using a prior based on the spatialdistribution of stars in the Galaxy. Figure 1 shows thediﬀerence between the distance inferred by Bailer-Joneset al. (2018), r est , and a naive distance derived by invert-ing the reported Gaia measurements of (cid:36) for ∼ , r est = 1 /(cid:36) . While the two distanceestimates are roughly consistent for nearby stars, moredistant stars are biased much further away in the naivedistance estimates. hotometric Classifications of Evolved Massive Stars Gaia

DR2 and the ALLWISE data release.ALLWISE (Cutri et al. 2013) contains photometry infour mid-infrared (MIR) bands — W µ m), W µ m), W µ m), and W µ m) — derived fromco-added images obtained during the original WISE mis-sion, as well as W W result flag = 1 if the distance estimate is themode of the posterior distribution, 2 if it is the median,and 0 for a failed estimate, see Bailer-Jones et al. 2018for more details) that satisfy M G = G − r est + 5 ≤ − . ,W < . (1)Since Bailer-Jones et al. (2018) used a Galactic prior,stars in the Magellanic Clouds have distances that areconsiderably underestimated. Thus, we also matchthe catalog in Gaia Collaboration et al. (2018b) tothe ALLWISE/ Gaia cross-match, and select stars with W <

14 and M G ≤ − .

5, assuming distance moduliof 19.05/18.52 for the SMC/LMC respectively (Kov´acs2000a,b) and combining the two cross-matches whiledropping duplicate stars. This results in a total of452,283 stars. . /$ [kpc]10 r e s t ( B a il e r - J o n e s e t a l. ) [ k p c ] Figure 1.

Distance from Bailer-Jones et al. (2018) versusdistance inferred via inverting the reported (cid:36) from

Gaia

DR2 for ∼ r est = 1 /(cid:36) . We then estimate the reddening in the

Gaia band-passes using the published estimate for A G from Gaia

DR2, and coeﬃcients from (Malhan et al. 2018) to cal-culate E ( G BP − G RP ). For Galactic stars without A G estimates, we assume A G = 0, and for stars in the Magellanic Clouds, we assume the average value of A G and E ( G BP − G RP ) using R V measurements from Gor-don et al. (2003) and E ( B − V ) from Massey et al.(2007). Using these quanitites, we calculate the intrinsic G BP − G RP and M G for all stars.We can then construct color-magnitude diagrams(CMDs) in the Gaia ﬁlters, which we can use to selectmassive stars — i.e., stars with initial mass M i ≥ M (cid:12) .We use the MESA Isochrones & Stellar Tracks (MIST,Dotter 2016; Choi et al. 2016; Paxton et al. 2011, 2013,2015) isochrones with metallicity [ F e/H ] = 0 , − . , − v/v crit = 0 .

4. Wethen selected the faintest isochrone point of any agebetween 10 and 10 . yr in steps of 0.05 dex with M i ≥ (cid:12) in 100 equally-spaced bins in the range − . ≤ G BP − G RP ≤

3. These isochrone pointsform a boundary in the

Gaia

CMD that represents thefaintest luminosities reached by any massive star at anypoint during its evolution, and no fainter massive starsare expected to be found — note that many isochronepoints with M i < (cid:12) lie above this boundary, so oursample is not constructed to be free of contamination.The left panel of Figure 2 shows the logarithmic den-sity on the sky of all stars selected from the Gaia

DR2database; the Galactic plane and Magellanic Clouds areclearly visible. The right panel shows the portion ofthe

Gaia

CMD brighter than M G = − .

75, where blue(orange, green) points are individual stars in the Galaxy(LMC, SMC). The solid (dashed, dotted) black line showthe MIST luminosity threshold for the Galaxy (LMC,SMC). Note that the thresholds accurately capture theslope of the main sequence for all three galaxies, as wellas the G BP − G RP color corresponding to the Hayashilimit.We select all stars brighter than the correspondingluminosity threshold for their host galaxy, resulting in9784 objects. From this sample, we select all starsfainter than the saturation limit in W W W

4, wherethe signal-to-noise is often poor). We also convert the W W Gaia photometry forthese stars. We query

Vizier (Ochsenbein et al. 2000)using astroquery to download

JHK s photometry fromthe 2-micron All Sky Survey (2MASS, Skrutskie et al.2006) for all stars. We also query SIMBAD (Wengeret al. 2000) and download the common name ( MAIN ID ), Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque − − − − D ec [ d e g ] − G BP − G RP [mag] − − − − − − − − M G [ m ag ] MIST GalacticMIST LMCMIST SMC Gaia GalacticGaia LMCGaia SMC

Figure 2.

Left:

Density of stars selected from the

Gaia database on the sky. Intensity of the colormap corresponds to thelogarithm of the number of stars in each bin.

Right : Gaia

CMD for stars selected from the

Gaia

DR2 database brighter than M G = − .

5. Galactic stars are in blue, LMC stars are in orange, and SMC stars are in green. The solid, dashed, and dottedlines represent our minimum-luminosity criteria to select massive stars in the Galaxy, LMC, and SMC respectively. spectral type (contained in the

MK Spectral Type and

SP Type ﬁelds), and object type (

OType ) for each star,the latter two of which we use to assign labels.2.1.

Label Assignment M a i nS e q u e n ce O B A E v o l v e d O B A Sup e r g i a n t O B A O B A O B A e O B [ e ] W R L B V Y S G R S G C / S / G i a n t Y e ll o w D w a r f M i s c . V a r i a b l e U n k n o w n / C a nd i d a t e Class10 N Figure 3.

The makeup of our sample of massive stars. Notethat the sample is dominated by OBA stars and cool super-giants, and non-OBAe emission line stars are the rarest starsin our sample. For readability, we have used a logarithmicy-axis to display our sample statistics. Note that in practice,diﬀerences in the number of stars per class are much largerthan they might appear here.

For our ﬁnal sample of ∼ hotometric Classifications of Evolved Massive Stars Table 1.

Common names, coordinates, host galaxies, and

Gaia measurements of 6,484 putative massive stars, orderedby Right Ascension. r est from Bailer-Jones et al. (2018) is given for Galactic stars. Listed values of G and G BP − G RP are uncorrected for extinction. Common Name R.A. [deg] Dec [deg] Host Galaxy r est [kpc] G [mag] A G [mag] G BP − G RP [mag]HD 236270 0.17442287 55.72245665 Milky Way 2.162 9.073082 0.942 0.24057600LS I +64 10 0.38838658 64.51219232 Milky Way 5.305 11.545427 1.3305 0.56807750LS I +60 69 0.55506135 60.43828347 Milky Way 5.866 11.854865 1.234 0.61282300BD+62 2353 0.59458742 62.90087875 Milky Way 5.243 9.809305 0.485 0.37020600HD 73 1.40408512 43.40139506 Milky Way 1.869 8.1893425 0.1245 -0.14992300HD 240496 1.42175475 58.49541068 Milky Way 2.499 9.698675 1.5533 0.68225500WISE J000559.28-790653.3 1.49713706 -79.11483482 SMC - 13.9371195 0.21266685 1.04591900LS I +59 30 1.70503555 59.85955733 Milky Way 4.006 10.856779 1.188 0.50059550BD+57 2870 1.82960982 58.33785301 Milky Way 3.893 9.841114 1.4205 0.82356400BD+62 1 1.88805102 63.08030731 Milky Way 2.893 10.293984 1.2973 0.52861900 Note —This table is published in its entirety in the machine-readable format. A portion is shown here for guidance regarding its form andcontent. liar stars or X-ray binaries). Deriving labels for knownmassive stars using existing sources is not trivial, andour labelling scheme would be entirely diﬀerent if a largesample of massive stars with well-measured tempera-tures, surface gravities, and chemical abundances wasavailable.We ﬁrst use the common name and

Gaia source id ofthe star to determine if the star belongs to the catalog ofconﬁrmed Luminous Blue Variables (LBVs) presented inRichardson & Mehner (2018). Non-LBVs are classifed asWR stars if “W” is in the spectral type, or the SIMBAD

OType is “*WR”. Non-WRs with “K” or “M” in theirspectral type are classiﬁed as either Red Supergiants(RSGs) or “C/S/Giant” if their SIMBAD

SP Type con-tains “III” — we keep all such low-mass contaminantsin our sample as distinguishing between RSGs and lumi-nous low-mass giants is still a diﬃcult problem (Masseyet al. 2009; Yang et al. 2019; Neugent et al. 2020). Theresulting sample of RSGs is pure; of the ﬁve RSGs thatdon’t have luminosity class I, all of them are luminos-ity class Ia-II, Ib-II, or Iab-II, which are consistent withbona ﬁde RSGs (Levesque et al. 2005). This is a goodtest of the

Gaia

DR2 parallaxes and Bailer-Jones et al.(2018) distances, as cool subgiants and dwarfs with lu-minosity class IV or V would have been erroneously clas-siﬁed as RSGs using our criteria had they been includedin our sample due to bad distance and M G measure-ments.Non-RSGs with “F” or “G” in their spectral typeare classiﬁed as Yellow Supergiants (YSGs). However,this includes a number of blue stars. Further inspec-tion of these stars reveals a number of objects whose MK Spectral Type ﬁeld indicates these are hot stars,with spectral type O, B, or A, which we classify assuch (see below). Eight low-mass yellow stars are also included in our sample. Stars with “III” or “V” intheir spectral types are classiﬁed as Yellow Dwarfs. Wenote that for extragalactic samples, foreground dwarfscan usually be ﬁltered based on proper motions, whiledwarfs belonging to the stellar population under studycan be excluded just based on their apparent magni-tudes. Nonetheless, we retain this class to avoid confu-sion between these stars and true YSGs. All YSGs thataren’t hot stars or dwarfs keep their YSG label.Of the objects that have not yet been classiﬁed, starswith “[e]” in their spectral type are classiﬁed as OB[e]stars, while non-OB[e] stars with spectral types includ-ing an “e” (without brackets, and without “pec” in theirspectral type) are classiﬁed as OBAe stars. If the staris not yet classiﬁed and O, B, or A are in the spectraltype with no additional information, they are classiﬁedas generic OBA stars. OBA stars with “III” or “IV” intheir spectral type are classiﬁed as evolved OBA stars,stars with “V” in their spectral type are classiﬁed asOBA main sequence stars, and ﬁnally, stars with “I” intheir spectral type are labeled as OBA supergiants. Allstars that have not been assigned a label at this stageare either C/S stars (which are assigned the C/S/Giantclass), stars labeled only as Variables in SIMBAD (e.g.,LPVs, semi-regular variables, or just variables, with-out other spectral information) which are assigned theMiscellaneous Variable classiﬁcation, or stars with noidentifying information/no conﬁrmed designation (e.g.,the SIMBAD

OType is “Star” or contains “Candidate”)which are classiﬁed as Unknown/Candidate.Finally, we include an “Is Binary” ﬂag for all stars,which is 1 for stars classiﬁed as Eclipsing or Spectro-scopic Binaries, High Mass X-ray Binaries, or EllipsoidalVariables, or if they have a compound spectral type (e.g.,

Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque S a m p l e I n R i c h a r d s on & M e hn e r ( ) ? L B V W R R S G Y S GO B A e E vo l v e d O B A M a i n S e qu e n ce O B A S up e r g i a n t O B AO B A C / S / G i a n t Y e ll o w Dw a rf O B [ e ] M i s c . V a r i a b l e U nkno w n / C a nd i d a t e Y e s W R o r ' W ' i n S p T ? N o Y e s K o r M ? N o L u m i no s it y C l a ss III ? Y e s F o r G ? N o N o Y e s O , B , o r A ? Y e s [ e ] i n S p T ? N o III o r V i n L u m i no s it y C l a ss ? N o e i n S p T ( no p ec ) ? Y e s N o Y e s Y e s N o Y e s O , B , o r A ? N o III o r I V i n L u m i no s it y C l a ss ? Y e s C o r S s t a r ? N o Y e s L u m i no s it y C l a ss V ? N o Y e s L u m i no s it y C l a ss I ? N o Y e s N o Y e s S I M B AD V a r i a b l e? N o Y e s N o Figure 4.

Flowchart illustrating the process by which stars are assigned labels, as described in text. Each star begins in thetop left and is assigned a label by following a series of binary decisions. This process is complex, and demonstrates the diﬃcultyin deriving useful labels for massive stars. hotometric Classifications of Evolved Massive Stars and 0 otherwise; 102 stars are ﬂagged asbinaries. Because photometry of binary systems can bemisleading (Neugent et al. 2018a), and binary systemsexhibit a broad range of variability that is not intrin-sic to the individual components, we exclude these starsfrom our classiﬁer.Figure 3 shows the makeup of our sample. Approxi-mately 30% of our sample (2550 stars) belong to the Mis-cellaneous Variable and Unknown/Candidate classes,which we do not use to train our classiﬁer; instead,we use the classiﬁer to assign tentative classiﬁcationsin §

4. The rest of the sample is dominated by luminousOBA stars and cool supergiants, with very few LBVs,OB[e] stars, and WR. This is unsurprising given thesestars’ high luminosity in the

Gaia bandpass, and theexpected lifetimes of these evolutionary phases relativeto the lifetimes of exotic emission line objects (Ekstr¨omet al. 2012). The imbalances in available training dataacross diﬀerent classes, along with the extreme spar-sity of training data in the rare classes, will impact theperformance of the classiﬁer if not properly addressed.(Chawla 2010). We discuss this issue for this particularsample in § M G vs. G − J CMD for all stars in our sample that aren’t labeled asmiscellaneous variables or unknown/candidate, coloredby their label. G − J correlates reasonably well with ef-fective temperature in main sequence stars (Davenport& Covey 2018). From this plot, it is clear that manystars are misclassiﬁed in SIMBAD (the worst exampleis one particular red star classiﬁed as an OBA star),reducing the eﬀectiveness of any machine learning algo-rithm, and propagating biases into the results. Yellowsupergiants are especially prone to this problem: 81/212YSGs in the sample (38%) have G − J <

1, consistentwith the optical colors of much hotter stars. Indeed, thisproblem would have been worse had we not corrected forthe presence of OBA stars in the initial sample of YSGs.This issue may originate from bad distance estimates forindividual stars (which explains why our sample includesF and G dwarfs), bad estimates of reddening (given ourusage of monochromatic extinction coeﬃcients), previ-ously unidentiﬁed variability, or the fact that many ofthese spectral types were determined via stellar spectrataken on photographic plates (for example, many spec-tral types for stars in the LMC come from Ardeberget al. 1972). We expect this issue to propagate in our This does not include stars with the “OB+” spectral type, whichis an outdated class that describes OB stars with weaker absorp-tion lines that would now be classiﬁed as OB supergiants. results, increasing the confusion between YSGs and hotstars.Because we expect objects in some classes — espe-cially those with an evolutionary link such as main se-quence, evolved, and supergiant OBA stars — to appearsimilar in the training data set, we also assign all starsa coarse label: all classes of OBA stars excluding OB[e]and OBAe are labeled “Hot”; RSGs and YSGs are la-beled “Cool”; WRs, LBVs, and both OB[e] and OBAestars are labeled “Emission” (EM for short); C/S/Giantstars and Yellow Dwarfs are labeled “Contaminant”; andmiscellaneous variables and unknown/candidates are la-beled “Unknown/Candidate.” The results of this label-ing scheme are summarized in Table 2 which shows thenumber of stars with a given reﬁned class that are as-signed a particular coarse label. The right panel of Fig-ure 5 shows the same CMD, with points colored by theircoarse label. This leads to some improvement: eachcoarse class lies in the approximate region of the CMDthat one would expect. Regardless, it is evident thatselecting any one of these classes solely from this opticalphotometry would be diﬃcult: the “cool” class has sig-niﬁcant overlap with the “hot” class — largely driven bythe YSGs — emission line stars can be found at a rangeof G BP − G RP colors, and there is signiﬁcant overlap be-tween low-mass contaminants that will end their lives aswhite dwarves and true massive stars that will end theirlives in supernovae explosions. This point is emphasizedby the contours, which correspond to 0.5 and 0.1 timesthe maximum value of a kernel density estimate of thedistribution of each class.We will use these coarse labels to train a second clas-siﬁer. While these coarse labels lose some speciﬁcity,each coarse class contains more stars, hopefully increas-ing the performance of a classiﬁer trained on these la-bels. Furthermore, they still retain physical informationwhile increasing the number of stars in each class: the“cool” label contains stars with convective envelopes,while “hot” stars contain radiative envelopes. Mean-while, emission line stars are notable for their variability.It is our hope that this second classiﬁer will still addresstwo of our stated goals: to identify emission line stars,and to reject contaminating low mass stars. WISE LIGHTCURVESVariability in evolved massive stars has been well-characterized at timescales from minutes to decades (e.g.Conroy et al. 2018; Dorn-Wallenstein et al. 2019; So-raisam et al. 2020). In a study of massive stars in theWhirlpool Galaxy (M51), Conroy et al. (2018) foundthat almost half of the stars brighter than M I = − Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque − − G − J − − − − − − − − M G MainSequenceOBAEvolvedOBASupergiantOBAOBAOBAeOB[e] WRLBVYSGRSGC/S/GiantYellow Dwarf − − G − J − − − − − − − − M G HotCool EmissionContaminant

Figure 5. M G vs. G − J for putative massive stars. Left : Stars are colored by their label.

Right : Stars are colored by theircoarse label. Contours for each coarse class correspond to 0.5 and 0.1 times the maximum value of a kernel density estimateof the distribution of each class in the CMD. Both panels illustrate a key diﬃculty faced by our classiﬁer: both the reﬁnedand coarse labels have signiﬁcant overlap with each other in the CMD. For example, the coolest/warmest YSGs have identicaloptical photometry to RSGs/OBA supergiants respectively. Even in the coarse labels, the contours for hot stars and emissionline stars are nearly identical.

Table 2.

Number of stars in a class that are assigned agiven coarse label; we do not show with the MiscellanousVariable or Unknown/Candidate labels.

Coarse LabelReﬁned Label Hot Emission Cool ContaminantMain Sequence OBA 187Evolved OBA 409Supergiant OBA 798OBA 915OBAe 383OB[e] 12WR 37LBV 8YSG 212RSG 847C/S/Giant 118Yellow Dwarf 8Total 2309 440 1059 126 tion of 1. Both red and extremely luminous blue starsexhibited quite high amplitude (∆ I ≥ .

3) variability.For spectral energy distributions (SEDs) dominated bypurely stellar light, mid-infrared (MIR) ﬂux measure-ments (and thus variability) is sensitive to (variationsin) the bolometric luminosity. However, for stars withsigniﬁcant circumstellar dust components in their SEDs,MIR variability is correlated with both intrinsic bolo-metric variability, and with dust creation/destructionprocesses in the circumstellar medium (for example, inRSGs where it is correlated with the mass loss rate, e.g.Yang et al. 2018). The WISE mission provides lightcurves from stars inall parts of the sky, observed over a ∼ ∼

180 days. All stars have a ∼ W µ m), W µ m), W µ m), and W µ m). How-ever, it was reduced to using only the two bluest bandsin its post-cryogenic survey mode called “NEOWISE”. hotometric Classifications of Evolved Massive Stars ∼

180 day visit depends on spa-tial geometry of the WISE scanning program, i.e. starscloser to the ecliptic poles have longer duration visits(often exceeding a week) with many epochs per visit,while star near the equator have very short visits (typ-ically a couple days) with only a few epochs per visit.Because WISE lightcurves possess such non-uniform ca-dence, extracting detailed physics for most individualstars is diﬃcult. However, the WISE lightcurves place fantastic constraints on MIR variability amplitude onlonger timescales, especially for evolved massive starswhose highest-amplitude variability occurs over ∼ yeartimescales. Such amplitude and timescale estimates arerelated to the physical parameters of the star, poten-tially aiding in classiﬁcation.For every star selected in § astroquery to pull data in theregion within 3 arcsec of the known source location. Toensure high quality data for all recovered epochs, we re-quire the photometric quality ﬂag to be PH QUAL=A , thecontamination ﬂag to be

CC FLAGS=00 , the number ofdeblended sources ﬂag to be

NB=1 , and the PSF pho-tometry ﬁt quality (deﬁned as the reduced χ ) in W w1rchi2 < .Only two of our stars did not have usable datafrom WISE: WISE J074911.48-102000.2 (HD 63554) hasno lightcurve available online, and WISE J050128.62-701120.2, which does not have any corresponding ob-ject nearby on SIMBAD. When calculating each of thevariability metrics below, we instead record a value of NaN (i.e., missing data). For the remaining stars, weignore the W W W − W W W W − W Variability Metrics

Amplitude

For each of the three lightcurves of each object, wewish to extract simple metrics that describe the ampli-tude and timescale of variability. We choose χ about the median deﬁned as χ = (cid:88) (cid:0) M i − ˜ Mσ i (cid:1) (2)and the reduced- χ χ red = χ / ( N −

1) (3)Where M i is a magnitude measurement, ˜ M is the me-dian of the lightcurve, σ i is the corresponding error onthe data point, and N is the number of points in thelightcurve. We also calculate the Median Absolute De-viation (MAD), and Error-Weighted MAD (EWM): M AD = Median( | M i − ˜ M | ) (4) EW M = Median( | M i − ˜ M | /σ i ) (5)If the ﬁltered and cleaned lightcurve only contains onegood measurement (or no good measurements), we au-tomatically give it χ = χ red = MAD = EWM = NaN .We describe our method for treating missing data below.The top panels of Figure 7 show the distributions of χ red and EWM derived for our sample. Values from W W W − W χ red vs. EWM for all three lightcurves. While the two mea-sures correlate reasonably well with each other, there isa branch of stars whose lightcurves have high χ red andlow EWM; because the EWM is robust to outliers, χ red is an eﬀective probe of lightcurves with sudden bright-ening/fading events, while EWM is an eﬀective selectorfor lightcurves that display consistent variability. Thebottom right panel shows a scatter plot of χ red in W W

2, with each point colored by its coarse label. Adistinct branch of stars that are much more variable in W W χ red <

10 in W χ red >

100 in W W W

1, and have one observation duringwhich the star apparently becomes considerably redder,achieving W − W ∼

4. Examiningthe times at which these extreme reddening events occurshows a preference for times during the cryogenic WISEsurvey, implying that this behavior is likely instrumentalin origin, despite our ﬁltering using the provided qual-ity ﬂags. Nonetheless, we include χ red as it does notmap perfectly onto EW M , and only 98 stars fall in thisregime. We do not yet know whether χ red , EW M , bothmetrics, or neither are useful features for classiﬁcation,so we keep both with the intent of exploring their im-portance below.0

Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque . . . . . . . . W , W [ m ag ] W1W2W1-W2 − − −

500 0 500 1000 15007 . . . . . . . . W [ m ag ] Binned W1Spline Interpolant

MJD [days] − . − . . . . . . W - W [ m ag ] − − −

500 0 500 1000 1500 ∆MJD [days] − − − d W / d t [ mm ag/ d a y ] st Derivative

Figure 6.

Example lightcurve for WISE J000536.97+432405.0.

Top left : Raw lightcurve, with W W Bottom left : Variability in W − W Top right :Binned W t = 0. Bottom right : First-order derivative of theinterpolant. Vertical blue lines show the times where the derivative crosses zero, indicated by the horizontal blue line. χ red N W1W2W1-W2 N W1W2W1-W2 − χ red − − E W M W1W2W1-W2 − χ red,W − − χ r e d , W HotEMCoolCont

Figure 7.

Top row : Distribution of derived χ red (left) and EW M (right) values in the W W W − W Bottom row : Scatter plotscomparing diﬀerent amplitude metrics. The left panel shows

EW M vs. χ red in the W W

2, and W − W χ red in W χ red in W

1, with each point colored by its coarselabel.

Timescale

Many methods exist for estimating dominanttimescales in lightcurves. Conroy et al. (2018) use theLomb-Scargle Periodogram (Lomb 1976; Scargle 1982)to search for periodic variables. However, this approachsuﬀers from numerous, well-known issues (including ac-curate period recovery at low signal to noise), and falsepeaks can easily be mistaken for real timescales, espe-cially in highly-irregularly sampled data, as is the case for the WISE lightcurves. Soraisam et al. (2020) usea Gaussian process interpolation scheme coupled witha wavelet analysis to estimate timescales in massivestars in M31 observed by the Palomar Transient Factory(PTF). However, with so few data points, we found itdiﬃcult to obtain a reliable ﬁt with a Gaussian process,and even when the ﬁt was successful, the resulting inter-polant had a large standard deviation in between WISEvisits. The resulting measurements of the characteristictimescale was more reﬂective of the kernel used.Instead, we adopt a spline-based interpolationmethod, as follows. We ﬁrst subtract half of the sumof the times of the ﬁrst and last available observations,so that the lightcurve is centered at t = 0. We thenbin the observations in each visit. Visits are deﬁned assets of points separated in time by less than a deﬁnedthreshold. Due to the WISE scanning law, some starsnear the ecliptic poles have visits separated by less thanthe typical ∼

180 days. Therefore, we adopt 50 daysas the threshold for visits. Two stars in our sample areclose enough to the ecliptic pole to be observed nearlycontinuously such that we erroneously record two “vis-its”: one each during the cryogenic and post-cryogenicsurveys. However, neither star is strongly variable andthus this small edge case does not substantially impactour subsequent analyses.For all observations in a given visit, we calculatethe mean time and W W W − W scipy.interpolate.splrep in Python to ﬁndthe 3 rd -order B-spline representation of the binned hotometric Classifications of Evolved Massive Stars s=10 . Thisreturns the knots, B-spline coeﬃcients, and degree ofthe spline. By deﬁnition, 3 rd -order splines are diﬀeren-tiable, so we use scipy.interpolate.splev to evaluatethe ﬁrst derivative of the spline interpolant, and ﬁnd thetimes when the derivative changes sign — i.e., whenthe lightcurve reaches a maximum or minimum. Asmetrics of the characteristic timescale of the lightcurve,we calculate the frequency of zero-crossings of the ﬁrstderivative of the spline interpolant, ν — calculatedas the number of times the derivative passes throughzero, divided by the time baseline of the lightcurve— (cid:104) ∆ t (cid:105) , the mean of the diﬀerences between succes-sive zero-crossings, and the standard deviation of thediﬀerences between successive zero-crossings, σ ∆ t . Forstars with fewer than four visits, we automatically as-sign ν = (cid:104) ∆ t (cid:105) = σ ∆ t = NaN . The right panels of Figure6 show this process on the W W MACHINE LEARNING4.1.

Classiﬁer Selection

The problem of classiﬁcation based on broad-bandphotometry has a rich history in the literature. Withthe advent of large surveys like the Sloan Digital SkySurvey (SDSS, York et al. 2000), optical data could becoupled with space-based MIR data to ﬁnd the stellarlocus in a 10-dimensional color-space (Davenport et al.2014). Recent eﬀorts to separate stars from quasars,or perform a regression on eﬀective temperature withmachine learning on photometric data have been suc-cessful (Makhija et al. 2019; Bai et al. 2019); however,these studies are often focused on main sequence, lowmass stars. This is an understandable choice given therarity of evolved, high mass stars, the absence of reli-able distances to calculate luminosities from which toselect putative massive stars, and the fact that follow-up spectroscopy is necessary in order to conﬁrm a star’smembership in many important classes.With the advent of

Gaia

DR2, luminosities can beeasily determined, and putative massive stars can beidentiﬁed, as we do in §

2. We wish to train an algo-rithm that takes as input the broadband photometryand variability metrics derived for our sample, and out-puts spectral type classiﬁcations. Many machine learn-ing classiﬁers exist; of these, we wish to choose a ﬂexiblemodel with well-understood mathematics, while avoid-ing techniques like neural networks that can be diﬃcult

Table 3.

List of features passed to our machine learning classiﬁers,as well as clarifying deﬁnitions where relevant. WISE photometryused to calculate colors and magnitudes is from the ALLWISE datarelease (Cutri et al. 2013). All variability metrics are calculatedfrom the WISE W W

2, and W − W Feature DeﬁnitionColors & Magnitudes M G Absolute magnitude in

Gaia G band. G − J From

Gaia and 2MASS photometry. J − H From 2MASS photometry. H − K s From 2MASS photometry. K s − W W − W W − W W − W M W Absolute magnitude in WISE W χ red Log of the reduced χ .log EWM Log of the error-weighted Median Absolute Deviation. ν Frequency of zero-crossings of the ﬁrst derivative of thespline interpolant.log (cid:104) ∆ t (cid:105) Log of the average time between zero-crossings.log σ ∆ t Log of the standard deviation of zero-crossing times. to interpret. Of the classiﬁers available in the sklearn package, we decided to test a Random Forest (RF) clas-siﬁer (Breiman 2001) — which consists of a collection ofdecision trees trained on random subsets of samples andfeatures — a Support Vector Machine (SVM) classiﬁer(Cortes & Vapnik 1995) — which identiﬁes hyperplanesin the feature space that separate diﬀerent classes —and a Gaussian process (GP) classiﬁer (Rasmussen &Williams 2006). We refer the reader to these publica-tions, as well as to the sklearn documentation for themathematics and implementation details of each classi-ﬁer. In the multi-class case, a collection of classiﬁers aretrained on each possible pair of classes (one-versus-oneor “ovo”), generating a total of N classes ( N classes − / N classes is the number of classes. Labelsare assigned to test samples by allowing each classiﬁer tovote, and the label with the most votes is chosen (Knerret al. 1990).Each type of classiﬁer has a number of hyperpa-rameters that aﬀect the performance of the classiﬁer.For the RF classiﬁer, n estimators speciﬁes the num-ber of trees in the forest, max depth speciﬁes howmany branches each decision tree in the forest can https://scikit-learn.org/stable/index.html Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque have, and max features speciﬁes the maximum num-ber of features each tree is trained on. We also set class weight=balanced , which weighs samples whenﬁtting to account for the diﬀerent frequencies of eachclass in the data.For the SVM classiﬁer (SVC), C is a regularizationparameter that governs the tradeoﬀ between maximiz-ing the margin and misclassiﬁcations in the training set.Higher values of C will force the SVC to correctly clas-sify every point, resulting in poor generalization (i.e.overﬁtting). The SVC requires that the distance be-tween two points in the feature space is deﬁned as the in-ner product of two vectors in the feature space, (cid:104) (cid:126)X i , (cid:126)X j (cid:105) .Because the boundaries between classes in our sampleare not guaranteed to be linear, one can project thesamples into a much higher dimension space via a map-ping function, Φ, where distances between two vectors inthis space are calculated as (cid:104) (cid:126)Z i , (cid:126)Z j (cid:105) = (cid:104) Φ( (cid:126)X i ) , Φ( (cid:126)X j ) (cid:105) .In reality, the transformed feature space can be incred-ibly high dimensional, and explicitly mapping the datainto this high-dimensional space is computationally inef-ﬁcient. Instead, we can adopt a kernel function, K thatdeﬁnes distances in the higher dimensional space, e.g., K ( (cid:126)X i , (cid:126)X j ) = (cid:104) (cid:126)Z i , (cid:126)Z j (cid:105) . Because the kernel only takes themeasured features as input and outputs a number, us-ing a kernel function implicitly maps the input featurespace into a high dimensional space without specifyingΦ.Common choices of the kernel function include a lin-ear kernel (i.e., the Euclidean distance between individ-ual samples in the feature space) and the “radial basisfunction” (RBF) kernel: K ( (cid:126)X i , (cid:126)X j ) = e − γ || (cid:126)X i − (cid:126)X j || (6)where γ governs the inﬂuence of the kernel function;lower values result in increasingly linear boundaries,while high values result in the decision function be-ing entirely depended on individual points, creatingsmall islands of a given class centered on each train-ing point. The advantage of non-linear models likea SVM with a RBF kernel over linear methods isthat the decision boundaries can be much more ﬂex-ible; the tradeoﬀ is that the contribution of individ-ual features to the classiﬁer cannot be easily calcu-lated without the unknown function Φ (see § class weight=balanced for the SVC,which automatically sets the value C for class i to CN samples / ( N classes N i ) where N samples is the size of thesample, N classes is the number of classes, and N i is thenumber of objects in the sample belonging to the class. This serves to weight rarer classes more heavily. TheGP classiﬁer’s only hyperparameter is a choice of ker-nel, which deﬁnes the covariance function of the Gaus-sian process.To ﬁt our classiﬁers, we ﬁrst remove all stars labeled asmiscellaneous variables or unknown/candidates, as wellas known binaries, and one star with bad J photometry,the Be star HD 53032. For features, we use the intrinsiccalculated value of M G , as well as (uncorrected for ex-tinction) G − J , J − H , H − K s , K s − W W − W W − W

4, and χ red , EWM, ν , (cid:104) ∆ t (cid:105) , and σ ∆ t in allthree lightcurves — we indicate the WISE band or colorthat each variability metric corresponds to with a sub-script hereafter. The input features and a brief descrip-tion where relevant are listed in Table 3. Because χ red ,EWM, (cid:104) ∆ t (cid:105) , and σ ∆ t have signiﬁcant dynamic range,we use the base-10 logarithm of these features. Featuresvalues as well as labels for each star in our sample aregiven in Table 4.We now randomly split our sample into a trainingset with 70% of the samples, and a test set with theremaining 30%, using a stratiﬁcation strategy to en-sure the proportions of the classes in both sets areequal. The test set is withheld until we are ready toassess the performance of the chosen classiﬁer. Wethen use sklearn.preprocessing.StandardScaler inPython to scale the training data such that eachfeature has 0 mean and unit variance. Be-cause the data have missing values, we thenuse sklearn.impute.IterativeImputer which uses aBayesian ridge regression to predict and replace missingvalues.To test the accuracy of the imputer, we se-lect only the rows from the training set withno missing data. For each feature with missingdata (log χ red,W − W , log E W M W − W , log σ ∆ t,W ,log σ ∆ t,W , and log σ ∆ t,W − W ) we randomly choose 200objects, replace the value of the feature with NaN for onlythese objects, transform the data using the scaler andimputer, and calculate the fractional error between thetrue value and the imputed value. The returned frac-tional errors for each feature centered around 0, and hada low scatter with the exception of log E

W M W − W .However, this has little impact on the classiﬁer, as onlytwo objects in the actual training set have missing valuesfor log E W M W − W . We also repeated this procedurefor coarse labels: we select 200 random objects with agiven coarse label, replace a random feature from thelist of features with missing data with NaN for each ob-ject, and again calculate the fractional error betweenthe true and imputed values. We ﬁnd that the imputerperforms poorly on Cool and Contaminant stars. Given hotometric Classifications of Evolved Massive Stars Table 4.

Feature values and assigned labels for all stars in our sample, ordered by Right Ascension. Missing numbers are indicated with“-”.

Common Name M G [mag] G − J [mag] W − W χ red,W log (cid:104) ∆ t (cid:105) W [d] Label Coarse LabelHD 236270 -3.54342624 0.3541 0.1420 0.389 - RSG CoolLS I +64 10 -3.40866054 0.7154 0.0020 - - OBA HotLS I +60 69 -3.22094985 0.8769 -0.0340 0.061 - OBAe EMBD+62 2353 -4.27358395 0.4653 -0.0350 0.110 2.816 SupergiantOBA HotHD 73 -3.29367442 -0.6567 -0.0520 3.157 2.701 EvolvedOBA HotHD 240496 -3.84380534 1.0127 0.0130 -0.064 2.764 OBA HotWISE J000559.28-790653.3 -5.32554735 1.3391 -0.0390 0.072 3.162 Unknown/Candidate Unknown/CandidateLS I +59 30 -3.34464936 0.6748 -0.0430 0.234 - OBA HotBD+57 2870 -4.53103041 1.1711 0.0040 0.046 3.135 SupergiantOBA HotBD+62 1 -3.31029252 0.9010 0.2660 1.622 2.813 EvolvedOBA Hot Note —This table is published in its entirety in the machine-readable format. A portion is shown here for guidance regarding its form and content. that all of the features with missing data are linked tovariability, a signiﬁcant fraction of red supergiants dis-play high amplitude variability (Conroy et al. 2018), theMIR variability of AGB stars (which make up the bulkof the Contaminants) is higher than RSGs in a givenmagnitude range (Yang et al. 2018), and our sample ofCool and Contaminant stars contains objects in quitediﬀerent evolutionary states that nevertheless have sim-ilar colors and magnitudes, it is unsurprising that theimputer is unable to predict the variability properties ofthese stars. In § sklearn classiﬁer object (e.g., sklearn.svm.SVC ). Tosettle on the best values for the hyperparameters weuse sklearn.model selection.GridSearchCV to per-form a cross-validation search on a grid of hyperparam-eters, using a stratiﬁed K-fold strategy with k = 5 toensure that each fold has a representative distributionof classes. For the RF, we search for n estimators be-tween 10 and 150 in steps of 10, max depth between 10and 100 in steps of 10, and allow max features to be ei-ther sqrt , log2 , or None (where the maximum numberof features individual trees are trained on is the squareroot of, base-2 logarithm of, or equal to the number offeatures, see the documentation for details). We also al-low max depth to take on the default value (

None ), suchthat individual trees can be grown until each leaf onlycontains one sample.For the SVC, we search for values of C on a logarith-mic grid with 1 dex spacing between 0.01 and 100, andfor the RBF kernel, we search for values of γ on a similargrid between 0.01 and 10. Additionally, we allow γ to bethe default values of (where n features is the number of features). For the GP classiﬁer, we onlyvary the kernel, as the GaussianProcessClassifier object automatically optimizes the kernel hyperparam- eters. We let the kernel be either linear, RBF, or thedefault (a special case of the RBF kernel with the lengthscale equal to 1).Each classiﬁer object has a default method to scoreeach set of hyperparameters, e.g. the accuracy of pre-dicted labels compared to true labels. However, theclasses in our training set are unbalanced (e.g. Figure3), so inaccurately classifying every single LBV, for ex-ample, would have little impact on the overall accuracyof the classiﬁer. To account for this, we instead usethe balanced accuracy (Mosley 2013; Guyon et al. 2015),which weighs each sample by the frequency of that sam-ple’s class in the training set. Other options for scoringcriteria exist, including some that help maximize theclassiﬁer’s precision such as the weighted F score andCohen’s kappa (Cohen 1960). We experimented withusing these scores, and found that using the balancedaccuracy minimizes misclassiﬁcations across all classes(reﬂected in the diagonal in the left panels of Figures 9and 12). Note that this choice implicitly selects a classi-ﬁer that performs well across all classes, and is not op-timized for speciﬁc classes. Future work will explore thepossibility of tuning a classiﬁer to ﬁnd speciﬁc classes ofrare stars.Finally, we explore three variations of a voting clas-siﬁer. Such a classiﬁer consists of an ensemble of in-dividual classiﬁers, each which “votes” by assigning aclass to a given sample. The ﬁnal assigned class can ei-ther be chosen with a “hard” (the class with the mostvotes wins) or “soft” (class assignments are weightedby the probabilities output by each classiﬁer) strategy.We construct two voting classiﬁers that each use a dif-ferent voting strategy, using RF, SVC, and GP classi-ﬁers as the individual components. We refer to these asthe Voting (Hard) and Voting (Soft) classiﬁers. We alsomake a third voting classiﬁer that also uses a soft votingstrategy, but the votes from each component classiﬁer4 Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque weighted by the balanced accuracy determined via cross-validation. We refer to this as the Voting (Weighted)classiﬁer. We score each voting classiﬁer by averagingthe balanced accuracy taken from ﬁve stratiﬁed folds ofthe data. Figure 8 shows the balanced accuracy for thethree optimized classiﬁers, as well as the three votingclassiﬁers; the SVC performs “best.” Both the Voting(Soft) and Voting (Weighted) classiﬁers perform com-parably with the worst classiﬁer, the GP. This is due tothe fact that, while the SVC often selects one individualclass with high probability, both the RF and GP tendto select multiple classes with high probability (with theGP sometimes selecting all classes with roughly equalprobability, usually slightly favoring the classes selectedby the RF). This can result in both the RF and GP vot-ing for the wrong class with higher probability than thecorrect vote from the SVC, leading to the poor observedperformance. R a nd o m F o r e s t S V C G a u ss i a n P r o ce ss V o t i n g ( H a r d ) V o t i n g ( S o f t) V o t i n g ( W e i g h t e d ) Classiﬁer0 . . . . . . B a l a n ce d A cc u r a c y Figure 8.

Balanced accuracy for each optimized classiﬁer,averaged over ﬁve foldings of the data. The three voting clas-siﬁers perform roughly identically, with the Voting (Hard)classiﬁer performing best.

SVC Performance

The procedure above results in values for theSVC hyperparameters of kernel = linear , and

C =0.01 . With these hyperparameters, we ﬁt the SVCto the training set, use the

StandardScaler and

IterativeImputer that were previously ﬁt to the train-ing set to transform the test set, and use the SVC topredict the labels of the test set. The left-hand panel ofFigure 9 shows the raw number of stars in the test setwith the true label given on the x-axis, and the predictedlabel given on the y-axis. The center and right-handpanels show this matrix, where each row/column is nor-malized by the total number of stars in that row/column,yielding the confusion/eﬃciency matrices, respectively.The i, j entry in the confusion matrix corresponds tothe fraction of objects in the test set belonging to class i (shown on the y-axis) that are assigned class j (shownon the x-axis). Entries along the diagonal are the com-pleteness (also called the recall in some contexts), i.e.,the percentage of a given class that is accurately recov-ered by the classiﬁer. The i, j entry in the eﬃciencymatrix is the fraction of objects in the test set classiﬁedas j , that belong to class i . Entries along the diagonalare equivalent to the precision (equivalent to one minusthe contamination), i.e., the percentage of an observedclass that is made up of true members of that class. Fig-ure 10 shows the completeness versus the contaminationfor each class. Completeness is just the diagonal of thecorresponding row/column of the confusion matrix, andcontamination is one minus the diagonal of the corre-sponding row in the eﬃciency matrix.The SVC performs poorly on non-supergiant OBAstars. This is perhaps unsurprising given that both theobserved colors and interior structures of OBA stars asthey evolve from the zero-age main sequence (ZAMS) tothe terminal-age main sequence (TAMS) do not changedrastically compared to the much more evolved statesthat we also consider. The classiﬁer classiﬁes OBAestars with somewhat lower contamination compared tomain sequence and evolved OBA stars, though withcomparably low completeness. True OBAe stars are ei-ther misclassiﬁed as other types of OBA star or as WRs,while stars falsely labeled as OBAe are mostly true OBAstars, with the exception of one true WR. 75% of WRstars are recovered, but only 6/30 stars identiﬁed asWRs in the test set are true WRs; given the importanceof WRs for both the physics of mass loss and studyingevolved massive stellar populations (Dorn-Wallenstein& Levesque 2018, 2020), future work will focus on de-veloping a classiﬁer speciﬁcally for identifying WR stars.All LBVs in the test set are recovered; while such highaccuracy is often seen as a sign of overﬁtting, we choosenot to focus on this subclass, given both the disputedevolutionary status of LBVs (Smith & Tombleson 2015;Humphreys et al. 2016; Aadland et al. 2018) and thefact that only two LBVs exist in the training set. Yel-low supergiants are only classiﬁed with 27% accuracy.As discussed in § hotometric Classifications of Evolved Massive Stars M a i nS e q u e n ce O B A E v o l v e d O B A Sup e r g i a n t O B A O B A O B A e O B [ e ] W R L B V Y S G R S G C / S / G i a n t Y e ll o w D w a r f Predicted label

MainSequenceOBAEvolvedOBASupergiantOBAOBAOBAeOB[e]WRLBVYSGRSGC/S/GiantYellow Dwarf T r u e l a b e l M a i nS e q u e n ce O B A E v o l v e d O B A Sup e r g i a n t O B A O B A O B A e O B [ e ] W R L B V Y S G R S G C / S / G i a n t Y e ll o w D w a r f Predicted label M a i nS e q u e n ce O B A E v o l v e d O B A Sup e r g i a n t O B A O B A O B A e O B [ e ] W R L B V Y S G R S G C / S / G i a n t Y e ll o w D w a r f Predicted label

Figure 9.

Left : Matrix showing the number of stars in the test set with true label indicated on the y-axis that are assignedthe label on the x-axis.

Center : Confusion matrix for the SVC, calculated by normalizing each row of the left panel by thetotal number of stars in that row. Values correspond to the fraction of samples in the test set with true label indicated on they-axis that are assigned the label on the x-axis, such that the values along the diagonal are the fraction of each class that iscorrectly classiﬁed.

Right : Eﬃciency matrix for the SVC, calculated by normalizing each column of the left panel by the totalnumber of stars in that column. Values in each box correspond to the fraction of samples in the test set assigned the label onthe x-axis that belong to the class on the y-axis, such that the values along the diagonal correspond to the precision (one minusthe contamination). Darker colors in all panels correspond to more/a higher fraction of stars. . . . . . . . . . . . . C o n t a m i n a t i o n MainSequenceOBAEvolvedOBASupergiantOBAOBAOBAeOB[e]WRLBVYSGRSGC/S/GiantYellow Dwarf

Figure 10.

Completeness versus contamination of each classin the test set. A high completeness value implies members ofthat class are accurately classiﬁed, while a low contaminationvalue implies that an object classiﬁed as such is likely tobelong to that class. The ﬁgure is roughly divided into fourquadrants; stars with classes in the bottom-right quadrantcan be considered to be well-classiﬁed, in the sense that theyhave high completeness and low contamination.

Overall, an SVC trained on these reﬁned labels ap-pears to have little use. With the exception of RSGsand low-mass giants, the remaining classes have lowaccuracy, high contamination, or both. We nonethe-less use the SVC to predict labels for the 2550 starsinitially labeled as “Miscellaneous Variable” or “Un-known/Candidate.” We identify 79 candidate RSGs and36 candidate C/S/Giant stars, of which we expect ∼ Table 5.

Common names and coordinates of stars predictedto be RSGs by the SVC trained on reﬁned labels.

Common Name R.A. [deg] Dec [deg]SP77 48-11 81.07900941 -70.43417562WISE J185608.58-163255.1 284.03575762 -16.54867009W61 19-14 83.07777354 -67.52941938OGLE BRIGHT-LMC-MISC-169 72.94711333 -69.32348227WISE J194127.64+385155.3 295.36520609 38.86536427NGC 2004 BBBC 431 82.69161818 -67.29036242[KWV2015] J045626.51-692350.6 74.11062177 -69.39740804W61 6-54 85.54018822 -69.21978048WISE J064232.30-715243.3 100.63462144 -71.87871874W61 6-34 85.51621031 -69.21870016

Note —This table is published in its entirety in the machine-readableformat. A portion is shown here for guidance regarding its formand content.

Performance on Coarse Labelling

Examining Figures 9 and 10, we see that, while theclassiﬁer is not especially accurate except for the cooleststars, the classiﬁer is roughly useful for sorting the testset into broad categories: diﬀerent types of OBA starsare mostly (mis)classiﬁed as other classes of OBA stars;the same is true for emission line stars (OBAe, OB[e],WR, and LBV) and cool stars (YSG, RSG, C/S/Giant).For this reason, we also utilize the coarse labels in-troduced in §

2. We repeat the entire process describedabove, beginning with the selection of the classiﬁer. Fig-ure 11 shows the balanced accuracy for each of the classi-ﬁers discussed above, trained on the coarse labels, usinga ﬁve-fold cross-validation to optimize the hyperparam-6

Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque eters of each classiﬁer. We ﬁnd that once again, the SVCyields the highest balanced accuracy (0.876). R a nd o m F o r e s t S V C G a u ss i a n P r o ce ss V o t i n g ( H a r d ) V o t i n g ( S o f t) V o t i n g ( W e i g h t e d ) Classiﬁer0 . . . . . B a l a n ce d A cc u r a c y Figure 11.

Similar to Figure 8 for classiﬁers trained on thecoarse labels.

We keep the same scaled and imputed training andtesting sets, and perform a ﬁve-fold cross-validation asbefore to ﬁnd the optimal hyperparameters, which are kernel=rbf , C = 1 , and gamma = 1/n features . Withthese hyperparameters, we ﬁt the SVC to the trainingset before predicting labels for the test set. Figure 12shows the confusion and eﬃciency matrices similar toFigure 9, while Figure 13 shows the completeness versusthe contamination similar to Figure 10. All told, theSVC performs signiﬁcantly better compared to the clas-siﬁer trained on the reﬁned labels, recovering all classeswith ≥

75% completeness and ≤

30% contamination.Of the emission line stars that are correctly identiﬁed,83 are OBAe stars, three are OB[e] stars, eight are WRs,and two are LBVs. This is 73%, 75%, 100%, and 100%,respectively of these stars that are in the test set, im-plying that the performance of the SVC on emission linestars is not dominated entirely by OBAe stars (whichcomprise the majority of emission line stars in the testset). Of the stars mislabeled as contaminants, two areOBA stars and two are RSGs. One true C/S/Giant star,and two Yellow Dwarfs are misclassiﬁed.We then use the SVC to predict the coarse labels forthe same 2550 stars as above. Figure 14 shows the dis-tribution of these predicted labels. The majority (2472stars) are labeled as Hot. 63 stars are labeled as Cool,three of which are already identiﬁed in SIMBAD as can-didate AGBs or RGBs. 14 of these stars are labeled asemission line stars, of which 9-10 are likely to actuallybe emission line stars, assuming 30% contamination. Welist all 2550 stars’ common names, coordinates, and pre-dicted coarse label in Table A1.4.3.1.

Feature Importance

We can also identify which features contribute mostto the overall performance of the classiﬁer on the coarselabels. To do this, we initialize a new

SVC object withthe same hyperparameters, and perform a greedy searchover features. For each feature in the scaled and imputedtraining set, we train the SVC on just this feature acrossﬁve stratiﬁed folds of the training set, and record the av-erage and standard deviation of the balanced accuracy.We select the feature that yields the highest average bal-anced accuracy. This has the advantage of ensuring thecontribution of each feature to the balanced accuracyis stable across subsets of the data. We then train theSVC on all combinations of this feature and the remain-ing features, selecting which combination again yieldsthe highest average balanced accuracy. This process isrepeated until all features are used.Figure 15 illustrates this process. The x-axis showsthe feature that is selected at each stage of the greedysearch. The y-axis shows the mean balanced accu-racy of the SVC at that stage. Errorbars show thestandard deviation of the balanced accuracy across theﬁve folds of the training set. The balanced accuracyreaches a maximum after the ﬁrst seven features: J − H , W − W M W , K s − W

1, log

EW M W , log χ red,W ,and log EW M W − W . However, the contribution to thebalanced accuracy from all but the ﬁrst four features issmall. This suggests that variability amplitude is a use-ful, though not critical metric to obtain, while variabilitytimescales are not necessary. Finally this suggests thatphotometry bluer than J -band is also unnecessary.We can also examine the importance of each featurefor classifying individual classes. We perform the samegreedy search over the features, instead calculating aperformance metric that focuses on the performance ona speciﬁc class. One option is the F β measure: F β = (1 + β ) completeness · precision β · completeness + precision (7)where β is a free parameter that sets the relative impor-tance of completeness compared to precision. Commonchoices are β = 1 (i.e., F , a harmonic mean of complete-ness and precision), β = 0 .

5, and β = 2 (Chinchor & D1992). We adopt F (i.e., β = 2), because we prioritizegenerating complete samples of rare massive stars.The left panels of Figure 16 show the F measureas a function of successively added features, calculatedspeciﬁcally for hot stars (top), emission line stars (sec-ond panel), cool stars (third panel), and contaminants(bottom). The results for both hot and cool stars aremostly similar to the results for the overall classiﬁer inFigure 15, in the sense that the best performance isreached after including a mix of near- and mid-infrared hotometric Classifications of Evolved Massive Stars H o t E M C oo l C o n t a m i n a n t Predicted label

HotEMCoolContaminant T r u e l a b e l

626 40 4 221 96 10 026 2 286 21 0 2 34 H o t E M C oo l C o n t a m i n a n t Predicted label H o t E M C oo l C o n t a m i n a n t Predicted label

Figure 12.

Similar to Figure 9 for the SVC trained using the coarse labels. Note that signiﬁcantly more stars fall along thediagonal of each plot, reﬂecting the improved performance of the SVC on the coarse labels. . . . . . . . . . . . . C o n t a m i n a t i o n HotEMCoolContaminant

Figure 13.

Similar to Figure 10 for the SVC trained usingthe coarse labels. All coarse classes have high completenessand low contamination. H o t E M C oo l C o n t a m i n a n t Class10 N Figure 14.

Distribution of coarse labels assigned to 2550stars with no previously known class. colors and magnitudes. The main diﬀerence is that M G is the fourth most important feature for classifying hotstars.For emission line stars, a maximum in the mean F is reached after 11 features: W − W M W , G J ,log (cid:104) ∆ t (cid:105) W , log (cid:104) ∆ t (cid:105) W , log σ ∆ t,W , W − W W − W K s − W ν ,W , and J − H . However, given the error- bars, only the ﬁrst three features contribute meaning-fully, with the remaining features consistent with a con-stant value of F . Interestingly, compared to its contri-bution to the overall balanced accuracy of the classiﬁer,bluer photometry (signiﬁed by the presence of G − J in the above list) is much more important for identi-fying emission line stars. While variability metrics areincluded in the above list, they do not signiﬁcantly con-tribute to the F score.For contaminants, a total of 15 features arerequired in order to maximize F : J − H ,log σ ∆ t,W − W , log σ ∆ t,W , log (cid:104) ∆ t (cid:105) W , log (cid:104) ∆ t (cid:105) W − W , ν ,W , log σ ∆ t,W , ν ,W − W , M W , H − K s , K s − W W , log χ red,W − W , log EWM W − W , andlog χ red,W . Notably, the F measure ﬁrst decreases asfeatures are added, before increasing to the maximumafter M W . This trend is unintuitive compared to theother panels in the Figure. It may be a result of the factthat increased features improve the precision of the clas-siﬁer at the cost of completeness, resulting in a decreasein F due to the increased weighting of completeness.To demonstrate the capabilities of the classiﬁer usinga limited set of features, we plot the scaled and imputedtest set — which was not used in the greedy search al-gorithm — in the right panels of Figure 16, using onlythe two most important features in each row. Stars be-longing to the corresponding coarse class are plotted aslarger, colored points. In all cases, most members of thetest set with that label are well-separated from the otherstars. As expected from the left panels, most of the sepa-ration is along the x-axis which corresponds to the mostimportant feature, with the second-most important fea-ture plotted on the y-axis providing some additional dif-ferentiation, especially for hot stars. In the case of con-taminants, the second-most important feature provideslittle-to-no additional information, consistent with thelack of change in F with increased features.8 Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque J − H W − W M W K s − W l og E W M W l og χ r e d , W l og E W M W − W l og χ r e d , W W − W l og h ∆ t i W G − J l og σ ∆ t , W l og h ∆ t i W l og σ ∆ t , W ν , W ν , W − W l og E W M W l og χ r e d , W − W ν , W H − K s M G W − W l og σ ∆ t , W − W l og h ∆ t i W − W Feature0 . . . . . . . . . B a l a n ce d A cc u r a c y Figure 15.

Mean balanced accuracy of the SVC for coarse labels trained on successively added features, calculated from ﬁvestratiﬁed folds of the data. The balanced accuracy reaches a maximum after the ﬁrst seven features. Errorbars indicate thestandard deviation of the balanced accuracy across folds.

We conclude that while small numbers of features canbe used to identify hot, cool, and (remarkably) emis-sion line stars, a large number of features is necessaryin order to maximize the number of accurately identi-ﬁed contaminants. This includes time domain features,where the drastically diﬀerent structures of old AGBand RGB stars compared to massive cool supergiantsmay be imprinted. Already extragalactic massive starsamples are contaminated by foreground giants in theMilky Way halo; distant stars that can be resolved byJWST and

Roman will have comparable brightnesses tocool dwarfs that are too faint to be ﬁltered out using as-trometry from

Gaia . Developing the infrastructure toreliably remove these contaminating objects from mas-sive star samples will be that much more critical. DISCUSSION & CONCLUSIONIn the coming decades, space-based infrared observa-tories like

Webb and

Roman will give us access to un-precedentedly large samples of evolved massive stars.Therefore, we need to be prepared to leverage these datato search for stars in the most interesting evolutionarystates. Obtaining spectroscopy of individual stars doesnot scale well at the size of the expected samples, whilelinear cuts in color-magnitude space are too simplisticand ignore emission line objects. Here we have demon-strated the promising performance of a support vectormachine trained on ∼ . − µ m photometry and sim-ple variability metrics. However, with currently avail-able labels, we are not able to construct a classiﬁer thatperforms well at the level of granularity needed for manyscience cases.Our main results are summarized as follows: • We have assembled a large sample of evolved mas-sive stars using distances from

Gaia

DR2 and Bailer-Jones et al. (2018), with high precisioninfrared photometry from

Gaia , 2MASS, andWISE. • Using SIMBAD, we assign labels to all stars, andﬁnd that the sample contains a number of low masscontaminants. • We ﬁnd that, of the classiﬁcation methods weapplied, a support vector machine classiﬁcation(SVC) algorithm is best at accurately labelingevolved massive stars. The SVC is fast, and hasthe added beneﬁt that the underlying mathemat-ics are well-understood. • The SVC trained on reﬁned labels is capable ofidentifying low mass red giant contaminants withhigh accuracy. However, the overall performanceof this classiﬁer is quite poor, and we do not rec-ommend its use at present. • The SVC trained using coarse spectral types per-forms better, as measured with the balanced accu-racy score. We ﬁnd higher completeness and lowercontamination (Figures 12 and 13) compared tothe SVC trained on the reﬁned labels (Figures 9and 13). With this classiﬁer, we identify 14 candi-date emission line stars from a sample of ∼ • We ﬁnd that the SVC performs equally as wellwith only a small subset of features. These fea-tures are mostly infrared colors and absolute mag-nitudes — i.e., those least aﬀected by reddening —with small contributions from infrared variabilitymetrics. However, if we change our performance hotometric Classifications of Evolved Massive Stars . . . . . . Feature . . . . . . F M e a s u r e H − K s W − W M W M G K s − W l og σ ∆ t , W l og E W M W − W ν , W l og σ ∆ t , W l og σ ∆ t , W − W l og h ∆ t i W l og h ∆ t i W − W W − W l og h ∆ t i W ν , W W − W G − JJ − H l og χ r e d , W − W l og E W M W ν , W − W l og χ r e d , W l og E W M W l og χ r e d , W . . .

94 Hot − H − K s − W − W W − W M W G − J l og h ∆ t i W l og h ∆ t i W l og σ ∆ t , W W − W W − W K s − W ν , W J − H l og χ r e d , W − W l og σ ∆ t , W H − K s l og E W M W − W ν , W − W M G l og h ∆ t i W − W ν , W l og σ ∆ t , W − W l og χ r e d , W l og E W M W l og E W M W l og χ r e d , W . . . .

85 EM − W − W − − M W J − H M W K s − W W − W l og χ r e d , W − W l og h ∆ t i W − W H − K s G − J l og E W M W − W ν , W − W l og χ r e d , W l og σ ∆ t , W − W M G l og E W M W l og χ r e d , W ν , W W − W l og E W M W l og σ ∆ t , W l og h ∆ t i W l og h ∆ t i W l og σ ∆ t , W W − W ν , W . . .

96 Cool J − H − − M W J − H l og σ ∆ t , W − W l og σ ∆ t , W l og h ∆ t i W l og h ∆ t i W − W ν , W l og σ ∆ t , W ν , W − W M W H − K s K s − W l og E W M W l og χ r e d , W − W l og E W M W − W l og χ r e d , W ν , W l og χ r e d , W l og h ∆ t i W W − W l og E W M W W − W M G G − J W − W . . J − H − . . . . l og σ ∆ t , W − W Figure 16.

Left panels are similar to Figure 15, except using the F measure calculated for hot stars (top), emission line stars(second panel), cool stars (third panel), and contaminants (bottom). The right panel in each row shows a scatter plot of onlyﬁrst and second most important features (indicated with blue and red text respectively) drawn from the test set, with starsbelonging to the corresponding class in each row highlighted. Note that the features plotted are the scaled and imputed values, not the original values listed in Table 4. metric to one that focuses on emission line stars,optimal performance of the classiﬁer requires somered-optical photometry. We ﬁnd that the addedbeneﬁt of using variability metrics may not beworth the investment in telescope time in orderto measure them.Ultimately, the performance of the SVC trained on thereﬁned labels is poor. All stars in the sample are bright( W < Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque and more that can be used to accurately classify mas-sive stars. In order to prepare ourselves for the era of

Webb and

Roman , we must develop pipelines speciﬁcallytuned for evolved massive stars. This is especially truefor the classes that are underrepresented in our dataset,i.e., rare emission line stars.Along with better labels, more data will become avail-able via future data releases of the

Gaia mission. Therecent early third

Gaia release contains modest improve-ments in precision and sample size that are unlikely toaﬀect our results given the high quality of the photome-try in our sample. However, the full

Gaia

DR3 will con-tain low-resolution spectra as well as epoch photometryfor a limited number of sources, which have the potentialto signiﬁcantly improve the performance of a machinelearning classiﬁer. On the horizon, the Legacy Surveyof Space and Time (LSST) conducted at the Vera Ru-bin Observatory will measure the multi-color variabilityof massive stars at higher cadence, while its large tele-scope aperture will help deﬁne a much larger sample.As we demonstrate with Figure 16, it is possible to se-lect features that maximize the performance of the SVCfor speciﬁc classes. With a larger sample, we may beable to optimize the SVC to search for speciﬁc classes ofevolved massive stars.ACKNOWLEDGMENTSThe authors acknowledge that the work presentedwas largely conducted on the traditional land of theﬁrst people of Seattle, the Duwamish People past andpresent and honor with gratitude the land itself and theDuwamish Tribe.This research was supported by NSF grant AST 1714285awarded to EML.JRAD and DH acknowledge support from the DiRACInstitute in the Department of Astronomy at the Univer- sity of Washington. The DiRAC Institute is supportedthrough generous gifts from the Charles and Lisa Si-monyi Fund for Arts and Sciences, and the WashingtonResearch Foundation.DH is supported by the Women In Science Excel(WISE) programme of the Netherlands Organisation forScientiﬁc Research (NWO).This project was developed in part at the 2018 GaiaSprint, hosted by the eScience and DiRAC Institutes atthe University of Washington, Seattle.This research has made use of the VizieR cata-logue access tool, CDS, Strasbourg, France (DOI:10.26093/cds/vizier). The original description of theVizieR service was published in A&AS 143, 23. Thisresearch has made use of the SIMBAD database, op-erated at CDS, Strasbourg, France. This publicationmakes use of data products from the Two Micron AllSky Survey, which is a joint project of the University ofMassachusetts and the Infrared Processing and Analy-sis Center/California Institute of Technology, funded bythe National Aeronautics and Space Administration andthe National Science Foundation.This work made use of the following software:

Software:

Astropy v3.2.2 (Astropy Collaborationet al. 2013; The Astropy Collaboration et al. 2018),Astroquery v0.3.10 (Ginsburg et al. 2019), Matplotlibv3.1.1 (Hunter 2007), makecite (Price-Whelan et al.2018), NumPy v1.17.2 (Van Der Walt et al. 2011), Pan-das v0.25.1 (McKinney 2010), Python 3.7.4, Scikit-learnv0.21.3 (Pedregosa et al. 2011), Scipy v1.3.1 (Jones et al.2001)APPENDIX A. COARSE LABELS FOR 2550 STARSTable A1 lists all 2550 stars with no known label, and predicted labels generated by the SVC trained on coarselabels. REFERENCES

Aadland, E., Massey, P., Neugent, K. F., & Drout, M. R.2018, AJ, 156, 294, doi: 10.3847/1538-3881/aaeb96Ardeberg, A., Brunet, J. P., Maurice, E., & Prevot, L. 1972,Astronomy and Astrophysics Supplement Series, 6, 249 Astropy Collaboration, Robitaille, T. P., Tollerud, E. J.,et al. 2013, A&A, 558, A33,doi: 10.1051/0004-6361/201322068 hotometric Classifications of Evolved Massive Stars Table A1.

Common names, coordinates, and predicted labels of 2550 starsinput to the SVC trained on coarse labels, ordered by Right Ascension.

Common Name R.A. [deg] Dec [deg] Predicted Coarse LabelWISE J000559.28-790653.3 1.49713706 -79.11483482 HotTYC 4500-1480-1 2.86210879 79.08686958 HotBD+61 45 5.25504760 62.77064970 HotNGC 104 LEE 2520 5.41170226 -72.21106679 HotWISE J002203.44-693554.7 5.51434821 -69.59851087 HotWISE J002207.43-742212.1 5.53102165 -74.37003199 HotWISE J002318.05-742326.4 5.82523611 -74.39068759 HotWISE J002340.20-750446.9 5.91756693 -75.07972556 HotWISE J002758.92-764527.2 6.99552600 -76.75757402 HotWISE J002759.32-742119.8 6.99734043 -74.35552728 Hot