Photometric Classifications of Evolved Massive Stars: Preparing for the Era of Webb and Roman with Machine Learning
Trevor Z. Dorn-Wallenstein, James R.A. Davenport, Daniela Huppenkothen, Emily M. Levesque
DDraft version February 8, 2021
Typeset using L A TEX twocolumn style in AASTeX63
Photometric Classifications of Evolved Massive Stars:Preparing for the Era of
Webb and
Roman with Machine Learning
Trevor Z. Dorn-Wallenstein, James R. A. Davenport, Daniela Huppenkothen,
2, 3, 4 and Emily M. Levesque University of Washington Astronomy DepartmentPhysics and Astronomy Building, 3910 15th Ave NESeattle, WA 98105, USA SRON Netherlands Institute for Space Research,Sorbonnelaan 3, 3584 CA UtrechtThe Netherlands DIRAC Institute,Department of Astronomy,University of Washington, 3910 15th Ave NESeattle, WA 98105, USA The University of Washington eScience Institute,The Washington Research Foundation Data Science Studio,University of Washington,Seattle, WA 98105, USA (Received; Revised; Accepted)
Submitted to ApJABSTRACTIn the coming years, next-generation space-based infrared observatories will significantly increase oursamples of rare massive stars, representing a tremendous opportunity to leverage modern statisticaltools and methods to test massive stellar evolution in entirely new environments. Such work is onlypossible if the observed objects can be reliably classified. Spectroscopic observations are infeasiblewith more distant targets, and so we wish to determine whether machine learning methods can classifymassive stars using broadband infrared photometry. We find that a Support Vector Machine classifieris capable of coarsely classifying massive stars with labels corresponding to hot, cool, and emission linestars with high accuracy, while rejecting contaminating low mass giants. Remarkably, 76% of emissionline stars can be recovered without the need for narrowband or spectroscopic observations. We classifya sample of ∼ INTRODUCTIONEvolved massive stars are observed in a menagerie ofexotic evolutionary phases. While the challenge of con-necting these states with a self-consistent theory of stel-lar evolution has seen rapid advancement since the orig-
Corresponding author: Trevor Z. [email protected] inal introduction of the “Conti Scenario” (Conti et al.1983), the effects of rotation, magnetic fields, internalmixing processes, and binary interactions on the evolu-tion of massive stars are still the subject of much the-oretical effort (e.g., Ekstr¨om et al. 2012; Eldridge et al.2017). While individual massive stars can be used asprecision probes of these processes, ensembles of evolvedmassive stars can also significantly constrain stellar evo-lution. This can be done by comparing the integrated a r X i v : . [ a s t r o - ph . S R ] F e b Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque spectra of massive stars (e.g. Levesque et al. 2012), or bystudying the detailed makeup of resolved populations ofmassive stars (Dorn-Wallenstein & Levesque 2018, 2020;Stanway et al. 2020).The latter approach requires large and accurately-classified samples of evolved massive stars that will beachievable in the coming years with the launch of theJames Webb Space Telescope (
Webb ) and the NancyGrace Roman Space Telescope (
Roman ). Among theinstrumentation on
Webb and the proposed instrumen-tation for
Roman are photometers equipped with filtersspanning a broad wavelength baseline from 0.5 to 28 µ m. The resolution of Webb will allow us to identifyand study in detail individual luminous stars to greatdistances (e.g. Jones et al. 2017), while the impressive0.218 deg field of view of Roman will allow us to ef-ficiently survey nearby galaxies in a small number ofpointings (Spergel et al. 2013). Combined, observationsfrom both missions will give astronomers access to pre-cise infrared measurements of vast numbers of evolvedmassive stars. But without sophisticated methods ofidentifying and classifying these stars, the science returnafforded by such a large increase in expected sample sizeswill be significantly reduced.Classification of stars from broadband photometryis often done by adopting simple linear cuts in color-magnitude space (e.g., Massey et al. 2006, 2009), and— most critically — do not include rare emission lineobjects, whose classification requires dedicated narrow-band surveys (sometimes with custom-designed filters,e.g. Neugent et al. 2018b), often accompanied by follow-up spectroscopy, both of which require extensive tele-scope time. These objects are often the post-main se-quence evolved states of massive stars, in which the ef-fects of rotation, binary interactions, and chemical mix-ing are the most pronounced; the stars that place themost valuable constraints on unknown stellar physics arealso the hardest to detect via traditional means. There-fore, it is worthwhile to determine whether there are al-ternative ways to classify massive stars that avoid usingtraditional and expensive methods.At present, we can mimic the observing capabilities of
Webb and
Roman by combining data from
Gaia (whichhas a red-optical bandpass), the Two Micron All SkySurvey (2MASS, Skrutskie et al. 2006, near-infrared),and the Wide Field Infrared Survey Explorer (WISE,Wright et al. 2010, mid-infrared). WISE provides theadditional benefit of having scanned the sky approxi-mately every six months, yielding lightcurves spanninga ∼ Roman and
Webb will not be observing the entire sky in this fashion, determining whether variability can aid in the classifi-cation of evolved massive stars will determine whetherobservers should seek repeated observations of a stellarpopulation.We wish to determine whether we can1. Assemble a sample of evolved massive stars withavailable classifications as a training data set,2. Construct a machine learning classifier that can re-ject low mass red contaminants and identify likelyemission line objects in order to optimise availabletelescope time on the most promising targets,3. Determine whether variability metrics estimatedfrom WISE lightcurves can aid in these tasks, and4. Determine which photometric bandpasses andvariability metrics contribute the most to makingaccurate classifications.Here we utilize a support vector machine classifier(SVC) trained only on broad-band photometry and sim-ple metrics derived from WISE lightcurves to classify alarge sample of evolved massive stars. We describe oursample selection and labeling method in § § § § SAMPLE SELECTION & LABELINGFor any machine learning algorithm, a high-qualitytraining set with accurate labels is necessary. The sec-ond data release (DR2) of the
Gaia mission (Gaia Col-laboration et al. 2018a) contains precise photometry inthree bands ( G , G BP , and G RP ) and geometric paral-laxes ( (cid:36) ) for 1.3 billion stars in the Galaxy and Mag-ellanic Clouds. Because the parallax measurements suf-fer from some systematics (Lindegren et al. 2018), andmany objects have high fractional errors ( σ (cid:36) /(cid:36) ) or neg-ative measured parallax, Bailer-Jones et al. (2018) cal-culated Bayesian distance estimates for the majority ofstars in Gaia
DR2, using a prior based on the spatialdistribution of stars in the Galaxy. Figure 1 shows thedifference between the distance inferred by Bailer-Joneset al. (2018), r est , and a naive distance derived by invert-ing the reported Gaia measurements of (cid:36) for ∼ , r est = 1 /(cid:36) . While the two distanceestimates are roughly consistent for nearby stars, moredistant stars are biased much further away in the naivedistance estimates. hotometric Classifications of Evolved Massive Stars Gaia
DR2 and the ALLWISE data release.ALLWISE (Cutri et al. 2013) contains photometry infour mid-infrared (MIR) bands — W µ m), W µ m), W µ m), and W µ m) — derived fromco-added images obtained during the original WISE mis-sion, as well as W W result flag = 1 if the distance estimate is themode of the posterior distribution, 2 if it is the median,and 0 for a failed estimate, see Bailer-Jones et al. 2018for more details) that satisfy M G = G − r est + 5 ≤ − . ,W < . (1)Since Bailer-Jones et al. (2018) used a Galactic prior,stars in the Magellanic Clouds have distances that areconsiderably underestimated. Thus, we also matchthe catalog in Gaia Collaboration et al. (2018b) tothe ALLWISE/ Gaia cross-match, and select stars with W <
14 and M G ≤ − .
5, assuming distance moduliof 19.05/18.52 for the SMC/LMC respectively (Kov´acs2000a,b) and combining the two cross-matches whiledropping duplicate stars. This results in a total of452,283 stars. . /$ [kpc]10 r e s t ( B a il e r - J o n e s e t a l. ) [ k p c ] Figure 1.
Distance from Bailer-Jones et al. (2018) versusdistance inferred via inverting the reported (cid:36) from
Gaia
DR2 for ∼ r est = 1 /(cid:36) . We then estimate the reddening in the
Gaia band-passes using the published estimate for A G from Gaia
DR2, and coefficients from (Malhan et al. 2018) to cal-culate E ( G BP − G RP ). For Galactic stars without A G estimates, we assume A G = 0, and for stars in the Magellanic Clouds, we assume the average value of A G and E ( G BP − G RP ) using R V measurements from Gor-don et al. (2003) and E ( B − V ) from Massey et al.(2007). Using these quanitites, we calculate the intrinsic G BP − G RP and M G for all stars.We can then construct color-magnitude diagrams(CMDs) in the Gaia filters, which we can use to selectmassive stars — i.e., stars with initial mass M i ≥ M (cid:12) .We use the MESA Isochrones & Stellar Tracks (MIST,Dotter 2016; Choi et al. 2016; Paxton et al. 2011, 2013,2015) isochrones with metallicity [ F e/H ] = 0 , − . , − v/v crit = 0 .
4. Wethen selected the faintest isochrone point of any agebetween 10 and 10 . yr in steps of 0.05 dex with M i ≥ (cid:12) in 100 equally-spaced bins in the range − . ≤ G BP − G RP ≤
3. These isochrone pointsform a boundary in the
Gaia
CMD that represents thefaintest luminosities reached by any massive star at anypoint during its evolution, and no fainter massive starsare expected to be found — note that many isochronepoints with M i < (cid:12) lie above this boundary, so oursample is not constructed to be free of contamination.The left panel of Figure 2 shows the logarithmic den-sity on the sky of all stars selected from the Gaia
DR2database; the Galactic plane and Magellanic Clouds areclearly visible. The right panel shows the portion ofthe
Gaia
CMD brighter than M G = − .
75, where blue(orange, green) points are individual stars in the Galaxy(LMC, SMC). The solid (dashed, dotted) black line showthe MIST luminosity threshold for the Galaxy (LMC,SMC). Note that the thresholds accurately capture theslope of the main sequence for all three galaxies, as wellas the G BP − G RP color corresponding to the Hayashilimit.We select all stars brighter than the correspondingluminosity threshold for their host galaxy, resulting in9784 objects. From this sample, we select all starsfainter than the saturation limit in W W W
4, wherethe signal-to-noise is often poor). We also convert the W W Gaia photometry forthese stars. We query
Vizier (Ochsenbein et al. 2000)using astroquery to download
JHK s photometry fromthe 2-micron All Sky Survey (2MASS, Skrutskie et al.2006) for all stars. We also query SIMBAD (Wengeret al. 2000) and download the common name ( MAIN ID ), Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque − − − − D ec [ d e g ] − G BP − G RP [mag] − − − − − − − − M G [ m ag ] MIST GalacticMIST LMCMIST SMC Gaia GalacticGaia LMCGaia SMC
Figure 2.
Left:
Density of stars selected from the
Gaia database on the sky. Intensity of the colormap corresponds to thelogarithm of the number of stars in each bin.
Right : Gaia
CMD for stars selected from the
Gaia
DR2 database brighter than M G = − .
5. Galactic stars are in blue, LMC stars are in orange, and SMC stars are in green. The solid, dashed, and dottedlines represent our minimum-luminosity criteria to select massive stars in the Galaxy, LMC, and SMC respectively. spectral type (contained in the
MK Spectral Type and
SP Type fields), and object type (
OType ) for each star,the latter two of which we use to assign labels.2.1.
Label Assignment M a i nS e q u e n ce O B A E v o l v e d O B A Sup e r g i a n t O B A O B A O B A e O B [ e ] W R L B V Y S G R S G C / S / G i a n t Y e ll o w D w a r f M i s c . V a r i a b l e U n k n o w n / C a nd i d a t e Class10 N Figure 3.
The makeup of our sample of massive stars. Notethat the sample is dominated by OBA stars and cool super-giants, and non-OBAe emission line stars are the rarest starsin our sample. For readability, we have used a logarithmicy-axis to display our sample statistics. Note that in practice,differences in the number of stars per class are much largerthan they might appear here.
For our final sample of ∼ hotometric Classifications of Evolved Massive Stars Table 1.
Common names, coordinates, host galaxies, and
Gaia measurements of 6,484 putative massive stars, orderedby Right Ascension. r est from Bailer-Jones et al. (2018) is given for Galactic stars. Listed values of G and G BP − G RP are uncorrected for extinction. Common Name R.A. [deg] Dec [deg] Host Galaxy r est [kpc] G [mag] A G [mag] G BP − G RP [mag]HD 236270 0.17442287 55.72245665 Milky Way 2.162 9.073082 0.942 0.24057600LS I +64 10 0.38838658 64.51219232 Milky Way 5.305 11.545427 1.3305 0.56807750LS I +60 69 0.55506135 60.43828347 Milky Way 5.866 11.854865 1.234 0.61282300BD+62 2353 0.59458742 62.90087875 Milky Way 5.243 9.809305 0.485 0.37020600HD 73 1.40408512 43.40139506 Milky Way 1.869 8.1893425 0.1245 -0.14992300HD 240496 1.42175475 58.49541068 Milky Way 2.499 9.698675 1.5533 0.68225500WISE J000559.28-790653.3 1.49713706 -79.11483482 SMC - 13.9371195 0.21266685 1.04591900LS I +59 30 1.70503555 59.85955733 Milky Way 4.006 10.856779 1.188 0.50059550BD+57 2870 1.82960982 58.33785301 Milky Way 3.893 9.841114 1.4205 0.82356400BD+62 1 1.88805102 63.08030731 Milky Way 2.893 10.293984 1.2973 0.52861900 Note —This table is published in its entirety in the machine-readable format. A portion is shown here for guidance regarding its form andcontent. liar stars or X-ray binaries). Deriving labels for knownmassive stars using existing sources is not trivial, andour labelling scheme would be entirely different if a largesample of massive stars with well-measured tempera-tures, surface gravities, and chemical abundances wasavailable.We first use the common name and
Gaia source id ofthe star to determine if the star belongs to the catalog ofconfirmed Luminous Blue Variables (LBVs) presented inRichardson & Mehner (2018). Non-LBVs are classifed asWR stars if “W” is in the spectral type, or the SIMBAD
OType is “*WR”. Non-WRs with “K” or “M” in theirspectral type are classified as either Red Supergiants(RSGs) or “C/S/Giant” if their SIMBAD
SP Type con-tains “III” — we keep all such low-mass contaminantsin our sample as distinguishing between RSGs and lumi-nous low-mass giants is still a difficult problem (Masseyet al. 2009; Yang et al. 2019; Neugent et al. 2020). Theresulting sample of RSGs is pure; of the five RSGs thatdon’t have luminosity class I, all of them are luminos-ity class Ia-II, Ib-II, or Iab-II, which are consistent withbona fide RSGs (Levesque et al. 2005). This is a goodtest of the
Gaia
DR2 parallaxes and Bailer-Jones et al.(2018) distances, as cool subgiants and dwarfs with lu-minosity class IV or V would have been erroneously clas-sified as RSGs using our criteria had they been includedin our sample due to bad distance and M G measure-ments.Non-RSGs with “F” or “G” in their spectral typeare classified as Yellow Supergiants (YSGs). However,this includes a number of blue stars. Further inspec-tion of these stars reveals a number of objects whose MK Spectral Type field indicates these are hot stars,with spectral type O, B, or A, which we classify assuch (see below). Eight low-mass yellow stars are also included in our sample. Stars with “III” or “V” intheir spectral types are classified as Yellow Dwarfs. Wenote that for extragalactic samples, foreground dwarfscan usually be filtered based on proper motions, whiledwarfs belonging to the stellar population under studycan be excluded just based on their apparent magni-tudes. Nonetheless, we retain this class to avoid confu-sion between these stars and true YSGs. All YSGs thataren’t hot stars or dwarfs keep their YSG label.Of the objects that have not yet been classified, starswith “[e]” in their spectral type are classified as OB[e]stars, while non-OB[e] stars with spectral types includ-ing an “e” (without brackets, and without “pec” in theirspectral type) are classified as OBAe stars. If the staris not yet classified and O, B, or A are in the spectraltype with no additional information, they are classifiedas generic OBA stars. OBA stars with “III” or “IV” intheir spectral type are classified as evolved OBA stars,stars with “V” in their spectral type are classified asOBA main sequence stars, and finally, stars with “I” intheir spectral type are labeled as OBA supergiants. Allstars that have not been assigned a label at this stageare either C/S stars (which are assigned the C/S/Giantclass), stars labeled only as Variables in SIMBAD (e.g.,LPVs, semi-regular variables, or just variables, with-out other spectral information) which are assigned theMiscellaneous Variable classification, or stars with noidentifying information/no confirmed designation (e.g.,the SIMBAD
OType is “Star” or contains “Candidate”)which are classified as Unknown/Candidate.Finally, we include an “Is Binary” flag for all stars,which is 1 for stars classified as Eclipsing or Spectro-scopic Binaries, High Mass X-ray Binaries, or EllipsoidalVariables, or if they have a compound spectral type (e.g.,
Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque S a m p l e I n R i c h a r d s on & M e hn e r ( ) ? L B V W R R S G Y S GO B A e E vo l v e d O B A M a i n S e qu e n ce O B A S up e r g i a n t O B AO B A C / S / G i a n t Y e ll o w Dw a rf O B [ e ] M i s c . V a r i a b l e U nkno w n / C a nd i d a t e Y e s W R o r ' W ' i n S p T ? N o Y e s K o r M ? N o L u m i no s it y C l a ss III ? Y e s F o r G ? N o N o Y e s O , B , o r A ? Y e s [ e ] i n S p T ? N o III o r V i n L u m i no s it y C l a ss ? N o e i n S p T ( no p ec ) ? Y e s N o Y e s Y e s N o Y e s O , B , o r A ? N o III o r I V i n L u m i no s it y C l a ss ? Y e s C o r S s t a r ? N o Y e s L u m i no s it y C l a ss V ? N o Y e s L u m i no s it y C l a ss I ? N o Y e s N o Y e s S I M B AD V a r i a b l e? N o Y e s N o Figure 4.
Flowchart illustrating the process by which stars are assigned labels, as described in text. Each star begins in thetop left and is assigned a label by following a series of binary decisions. This process is complex, and demonstrates the difficultyin deriving useful labels for massive stars. hotometric Classifications of Evolved Massive Stars and 0 otherwise; 102 stars are flagged asbinaries. Because photometry of binary systems can bemisleading (Neugent et al. 2018a), and binary systemsexhibit a broad range of variability that is not intrin-sic to the individual components, we exclude these starsfrom our classifier.Figure 3 shows the makeup of our sample. Approxi-mately 30% of our sample (2550 stars) belong to the Mis-cellaneous Variable and Unknown/Candidate classes,which we do not use to train our classifier; instead,we use the classifier to assign tentative classificationsin §
4. The rest of the sample is dominated by luminousOBA stars and cool supergiants, with very few LBVs,OB[e] stars, and WR. This is unsurprising given thesestars’ high luminosity in the
Gaia bandpass, and theexpected lifetimes of these evolutionary phases relativeto the lifetimes of exotic emission line objects (Ekstr¨omet al. 2012). The imbalances in available training dataacross different classes, along with the extreme spar-sity of training data in the rare classes, will impact theperformance of the classifier if not properly addressed.(Chawla 2010). We discuss this issue for this particularsample in § M G vs. G − J CMD for all stars in our sample that aren’t labeled asmiscellaneous variables or unknown/candidate, coloredby their label. G − J correlates reasonably well with ef-fective temperature in main sequence stars (Davenport& Covey 2018). From this plot, it is clear that manystars are misclassified in SIMBAD (the worst exampleis one particular red star classified as an OBA star),reducing the effectiveness of any machine learning algo-rithm, and propagating biases into the results. Yellowsupergiants are especially prone to this problem: 81/212YSGs in the sample (38%) have G − J <
1, consistentwith the optical colors of much hotter stars. Indeed, thisproblem would have been worse had we not corrected forthe presence of OBA stars in the initial sample of YSGs.This issue may originate from bad distance estimates forindividual stars (which explains why our sample includesF and G dwarfs), bad estimates of reddening (given ourusage of monochromatic extinction coefficients), previ-ously unidentified variability, or the fact that many ofthese spectral types were determined via stellar spectrataken on photographic plates (for example, many spec-tral types for stars in the LMC come from Ardeberget al. 1972). We expect this issue to propagate in our This does not include stars with the “OB+” spectral type, whichis an outdated class that describes OB stars with weaker absorp-tion lines that would now be classified as OB supergiants. results, increasing the confusion between YSGs and hotstars.Because we expect objects in some classes — espe-cially those with an evolutionary link such as main se-quence, evolved, and supergiant OBA stars — to appearsimilar in the training data set, we also assign all starsa coarse label: all classes of OBA stars excluding OB[e]and OBAe are labeled “Hot”; RSGs and YSGs are la-beled “Cool”; WRs, LBVs, and both OB[e] and OBAestars are labeled “Emission” (EM for short); C/S/Giantstars and Yellow Dwarfs are labeled “Contaminant”; andmiscellaneous variables and unknown/candidates are la-beled “Unknown/Candidate.” The results of this label-ing scheme are summarized in Table 2 which shows thenumber of stars with a given refined class that are as-signed a particular coarse label. The right panel of Fig-ure 5 shows the same CMD, with points colored by theircoarse label. This leads to some improvement: eachcoarse class lies in the approximate region of the CMDthat one would expect. Regardless, it is evident thatselecting any one of these classes solely from this opticalphotometry would be difficult: the “cool” class has sig-nificant overlap with the “hot” class — largely driven bythe YSGs — emission line stars can be found at a rangeof G BP − G RP colors, and there is significant overlap be-tween low-mass contaminants that will end their lives aswhite dwarves and true massive stars that will end theirlives in supernovae explosions. This point is emphasizedby the contours, which correspond to 0.5 and 0.1 timesthe maximum value of a kernel density estimate of thedistribution of each class.We will use these coarse labels to train a second clas-sifier. While these coarse labels lose some specificity,each coarse class contains more stars, hopefully increas-ing the performance of a classifier trained on these la-bels. Furthermore, they still retain physical informationwhile increasing the number of stars in each class: the“cool” label contains stars with convective envelopes,while “hot” stars contain radiative envelopes. Mean-while, emission line stars are notable for their variability.It is our hope that this second classifier will still addresstwo of our stated goals: to identify emission line stars,and to reject contaminating low mass stars. WISE LIGHTCURVESVariability in evolved massive stars has been well-characterized at timescales from minutes to decades (e.g.Conroy et al. 2018; Dorn-Wallenstein et al. 2019; So-raisam et al. 2020). In a study of massive stars in theWhirlpool Galaxy (M51), Conroy et al. (2018) foundthat almost half of the stars brighter than M I = − Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque − − G − J − − − − − − − − M G MainSequenceOBAEvolvedOBASupergiantOBAOBAOBAeOB[e] WRLBVYSGRSGC/S/GiantYellow Dwarf − − G − J − − − − − − − − M G HotCool EmissionContaminant
Figure 5. M G vs. G − J for putative massive stars. Left : Stars are colored by their label.
Right : Stars are colored by theircoarse label. Contours for each coarse class correspond to 0.5 and 0.1 times the maximum value of a kernel density estimateof the distribution of each class in the CMD. Both panels illustrate a key difficulty faced by our classifier: both the refinedand coarse labels have significant overlap with each other in the CMD. For example, the coolest/warmest YSGs have identicaloptical photometry to RSGs/OBA supergiants respectively. Even in the coarse labels, the contours for hot stars and emissionline stars are nearly identical.
Table 2.
Number of stars in a class that are assigned agiven coarse label; we do not show with the MiscellanousVariable or Unknown/Candidate labels.
Coarse LabelRefined Label Hot Emission Cool ContaminantMain Sequence OBA 187Evolved OBA 409Supergiant OBA 798OBA 915OBAe 383OB[e] 12WR 37LBV 8YSG 212RSG 847C/S/Giant 118Yellow Dwarf 8Total 2309 440 1059 126 tion of 1. Both red and extremely luminous blue starsexhibited quite high amplitude (∆ I ≥ .
3) variability.For spectral energy distributions (SEDs) dominated bypurely stellar light, mid-infrared (MIR) flux measure-ments (and thus variability) is sensitive to (variationsin) the bolometric luminosity. However, for stars withsignificant circumstellar dust components in their SEDs,MIR variability is correlated with both intrinsic bolo-metric variability, and with dust creation/destructionprocesses in the circumstellar medium (for example, inRSGs where it is correlated with the mass loss rate, e.g.Yang et al. 2018). The WISE mission provides lightcurves from stars inall parts of the sky, observed over a ∼ ∼
180 days. All stars have a ∼ W µ m), W µ m), W µ m), and W µ m). How-ever, it was reduced to using only the two bluest bandsin its post-cryogenic survey mode called “NEOWISE”. hotometric Classifications of Evolved Massive Stars ∼
180 day visit depends on spa-tial geometry of the WISE scanning program, i.e. starscloser to the ecliptic poles have longer duration visits(often exceeding a week) with many epochs per visit,while star near the equator have very short visits (typ-ically a couple days) with only a few epochs per visit.Because WISE lightcurves possess such non-uniform ca-dence, extracting detailed physics for most individualstars is difficult. However, the WISE lightcurves place fantastic constraints on MIR variability amplitude onlonger timescales, especially for evolved massive starswhose highest-amplitude variability occurs over ∼ yeartimescales. Such amplitude and timescale estimates arerelated to the physical parameters of the star, poten-tially aiding in classification.For every star selected in § astroquery to pull data in theregion within 3 arcsec of the known source location. Toensure high quality data for all recovered epochs, we re-quire the photometric quality flag to be PH QUAL=A , thecontamination flag to be
CC FLAGS=00 , the number ofdeblended sources flag to be
NB=1 , and the PSF pho-tometry fit quality (defined as the reduced χ ) in W w1rchi2 < .Only two of our stars did not have usable datafrom WISE: WISE J074911.48-102000.2 (HD 63554) hasno lightcurve available online, and WISE J050128.62-701120.2, which does not have any corresponding ob-ject nearby on SIMBAD. When calculating each of thevariability metrics below, we instead record a value of NaN (i.e., missing data). For the remaining stars, weignore the W W W − W W W W − W Variability Metrics
Amplitude
For each of the three lightcurves of each object, wewish to extract simple metrics that describe the ampli-tude and timescale of variability. We choose χ about the median defined as χ = (cid:88) (cid:0) M i − ˜ Mσ i (cid:1) (2)and the reduced- χ χ red = χ / ( N −
1) (3)Where M i is a magnitude measurement, ˜ M is the me-dian of the lightcurve, σ i is the corresponding error onthe data point, and N is the number of points in thelightcurve. We also calculate the Median Absolute De-viation (MAD), and Error-Weighted MAD (EWM): M AD = Median( | M i − ˜ M | ) (4) EW M = Median( | M i − ˜ M | /σ i ) (5)If the filtered and cleaned lightcurve only contains onegood measurement (or no good measurements), we au-tomatically give it χ = χ red = MAD = EWM = NaN .We describe our method for treating missing data below.The top panels of Figure 7 show the distributions of χ red and EWM derived for our sample. Values from W W W − W χ red vs. EWM for all three lightcurves. While the two mea-sures correlate reasonably well with each other, there isa branch of stars whose lightcurves have high χ red andlow EWM; because the EWM is robust to outliers, χ red is an effective probe of lightcurves with sudden bright-ening/fading events, while EWM is an effective selectorfor lightcurves that display consistent variability. Thebottom right panel shows a scatter plot of χ red in W W
2, with each point colored by its coarse label. Adistinct branch of stars that are much more variable in W W χ red <
10 in W χ red >
100 in W W W
1, and have one observation duringwhich the star apparently becomes considerably redder,achieving W − W ∼
4. Examiningthe times at which these extreme reddening events occurshows a preference for times during the cryogenic WISEsurvey, implying that this behavior is likely instrumentalin origin, despite our filtering using the provided qual-ity flags. Nonetheless, we include χ red as it does notmap perfectly onto EW M , and only 98 stars fall in thisregime. We do not yet know whether χ red , EW M , bothmetrics, or neither are useful features for classification,so we keep both with the intent of exploring their im-portance below.0
Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque . . . . . . . . W , W [ m ag ] W1W2W1-W2 − − −
500 0 500 1000 15007 . . . . . . . . W [ m ag ] Binned W1Spline Interpolant
MJD [days] − . − . . . . . . W - W [ m ag ] − − −
500 0 500 1000 1500 ∆MJD [days] − − − d W / d t [ mm ag/ d a y ] st Derivative
Figure 6.
Example lightcurve for WISE J000536.97+432405.0.
Top left : Raw lightcurve, with W W Bottom left : Variability in W − W Top right :Binned W t = 0. Bottom right : First-order derivative of theinterpolant. Vertical blue lines show the times where the derivative crosses zero, indicated by the horizontal blue line. χ red N W1W2W1-W2 N W1W2W1-W2 − χ red − − E W M W1W2W1-W2 − χ red,W − − χ r e d , W HotEMCoolCont
Figure 7.
Top row : Distribution of derived χ red (left) and EW M (right) values in the W W W − W Bottom row : Scatter plotscomparing different amplitude metrics. The left panel shows
EW M vs. χ red in the W W
2, and W − W χ red in W χ red in W
1, with each point colored by its coarselabel.
Timescale
Many methods exist for estimating dominanttimescales in lightcurves. Conroy et al. (2018) use theLomb-Scargle Periodogram (Lomb 1976; Scargle 1982)to search for periodic variables. However, this approachsuffers from numerous, well-known issues (including ac-curate period recovery at low signal to noise), and falsepeaks can easily be mistaken for real timescales, espe-cially in highly-irregularly sampled data, as is the case for the WISE lightcurves. Soraisam et al. (2020) usea Gaussian process interpolation scheme coupled witha wavelet analysis to estimate timescales in massivestars in M31 observed by the Palomar Transient Factory(PTF). However, with so few data points, we found itdifficult to obtain a reliable fit with a Gaussian process,and even when the fit was successful, the resulting inter-polant had a large standard deviation in between WISEvisits. The resulting measurements of the characteristictimescale was more reflective of the kernel used.Instead, we adopt a spline-based interpolationmethod, as follows. We first subtract half of the sumof the times of the first and last available observations,so that the lightcurve is centered at t = 0. We thenbin the observations in each visit. Visits are defined assets of points separated in time by less than a definedthreshold. Due to the WISE scanning law, some starsnear the ecliptic poles have visits separated by less thanthe typical ∼
180 days. Therefore, we adopt 50 daysas the threshold for visits. Two stars in our sample areclose enough to the ecliptic pole to be observed nearlycontinuously such that we erroneously record two “vis-its”: one each during the cryogenic and post-cryogenicsurveys. However, neither star is strongly variable andthus this small edge case does not substantially impactour subsequent analyses.For all observations in a given visit, we calculatethe mean time and W W W − W scipy.interpolate.splrep in Python to findthe 3 rd -order B-spline representation of the binned hotometric Classifications of Evolved Massive Stars s=10 . Thisreturns the knots, B-spline coefficients, and degree ofthe spline. By definition, 3 rd -order splines are differen-tiable, so we use scipy.interpolate.splev to evaluatethe first derivative of the spline interpolant, and find thetimes when the derivative changes sign — i.e., whenthe lightcurve reaches a maximum or minimum. Asmetrics of the characteristic timescale of the lightcurve,we calculate the frequency of zero-crossings of the firstderivative of the spline interpolant, ν — calculatedas the number of times the derivative passes throughzero, divided by the time baseline of the lightcurve— (cid:104) ∆ t (cid:105) , the mean of the differences between succes-sive zero-crossings, and the standard deviation of thedifferences between successive zero-crossings, σ ∆ t . Forstars with fewer than four visits, we automatically as-sign ν = (cid:104) ∆ t (cid:105) = σ ∆ t = NaN . The right panels of Figure6 show this process on the W W MACHINE LEARNING4.1.
Classifier Selection
The problem of classification based on broad-bandphotometry has a rich history in the literature. Withthe advent of large surveys like the Sloan Digital SkySurvey (SDSS, York et al. 2000), optical data could becoupled with space-based MIR data to find the stellarlocus in a 10-dimensional color-space (Davenport et al.2014). Recent efforts to separate stars from quasars,or perform a regression on effective temperature withmachine learning on photometric data have been suc-cessful (Makhija et al. 2019; Bai et al. 2019); however,these studies are often focused on main sequence, lowmass stars. This is an understandable choice given therarity of evolved, high mass stars, the absence of reli-able distances to calculate luminosities from which toselect putative massive stars, and the fact that follow-up spectroscopy is necessary in order to confirm a star’smembership in many important classes.With the advent of
Gaia
DR2, luminosities can beeasily determined, and putative massive stars can beidentified, as we do in §
2. We wish to train an algo-rithm that takes as input the broadband photometryand variability metrics derived for our sample, and out-puts spectral type classifications. Many machine learn-ing classifiers exist; of these, we wish to choose a flexiblemodel with well-understood mathematics, while avoid-ing techniques like neural networks that can be difficult
Table 3.
List of features passed to our machine learning classifiers,as well as clarifying definitions where relevant. WISE photometryused to calculate colors and magnitudes is from the ALLWISE datarelease (Cutri et al. 2013). All variability metrics are calculatedfrom the WISE W W
2, and W − W Feature DefinitionColors & Magnitudes M G Absolute magnitude in
Gaia G band. G − J From
Gaia and 2MASS photometry. J − H From 2MASS photometry. H − K s From 2MASS photometry. K s − W W − W W − W W − W M W Absolute magnitude in WISE W χ red Log of the reduced χ .log EWM Log of the error-weighted Median Absolute Deviation. ν Frequency of zero-crossings of the first derivative of thespline interpolant.log (cid:104) ∆ t (cid:105) Log of the average time between zero-crossings.log σ ∆ t Log of the standard deviation of zero-crossing times. to interpret. Of the classifiers available in the sklearn package, we decided to test a Random Forest (RF) clas-sifier (Breiman 2001) — which consists of a collection ofdecision trees trained on random subsets of samples andfeatures — a Support Vector Machine (SVM) classifier(Cortes & Vapnik 1995) — which identifies hyperplanesin the feature space that separate different classes —and a Gaussian process (GP) classifier (Rasmussen &Williams 2006). We refer the reader to these publica-tions, as well as to the sklearn documentation for themathematics and implementation details of each classi-fier. In the multi-class case, a collection of classifiers aretrained on each possible pair of classes (one-versus-oneor “ovo”), generating a total of N classes ( N classes − / N classes is the number of classes. Labelsare assigned to test samples by allowing each classifier tovote, and the label with the most votes is chosen (Knerret al. 1990).Each type of classifier has a number of hyperpa-rameters that affect the performance of the classifier.For the RF classifier, n estimators specifies the num-ber of trees in the forest, max depth specifies howmany branches each decision tree in the forest can https://scikit-learn.org/stable/index.html Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque have, and max features specifies the maximum num-ber of features each tree is trained on. We also set class weight=balanced , which weighs samples whenfitting to account for the different frequencies of eachclass in the data.For the SVM classifier (SVC), C is a regularizationparameter that governs the tradeoff between maximiz-ing the margin and misclassifications in the training set.Higher values of C will force the SVC to correctly clas-sify every point, resulting in poor generalization (i.e.overfitting). The SVC requires that the distance be-tween two points in the feature space is defined as the in-ner product of two vectors in the feature space, (cid:104) (cid:126)X i , (cid:126)X j (cid:105) .Because the boundaries between classes in our sampleare not guaranteed to be linear, one can project thesamples into a much higher dimension space via a map-ping function, Φ, where distances between two vectors inthis space are calculated as (cid:104) (cid:126)Z i , (cid:126)Z j (cid:105) = (cid:104) Φ( (cid:126)X i ) , Φ( (cid:126)X j ) (cid:105) .In reality, the transformed feature space can be incred-ibly high dimensional, and explicitly mapping the datainto this high-dimensional space is computationally inef-ficient. Instead, we can adopt a kernel function, K thatdefines distances in the higher dimensional space, e.g., K ( (cid:126)X i , (cid:126)X j ) = (cid:104) (cid:126)Z i , (cid:126)Z j (cid:105) . Because the kernel only takes themeasured features as input and outputs a number, us-ing a kernel function implicitly maps the input featurespace into a high dimensional space without specifyingΦ.Common choices of the kernel function include a lin-ear kernel (i.e., the Euclidean distance between individ-ual samples in the feature space) and the “radial basisfunction” (RBF) kernel: K ( (cid:126)X i , (cid:126)X j ) = e − γ || (cid:126)X i − (cid:126)X j || (6)where γ governs the influence of the kernel function;lower values result in increasingly linear boundaries,while high values result in the decision function be-ing entirely depended on individual points, creatingsmall islands of a given class centered on each train-ing point. The advantage of non-linear models likea SVM with a RBF kernel over linear methods isthat the decision boundaries can be much more flex-ible; the tradeoff is that the contribution of individ-ual features to the classifier cannot be easily calcu-lated without the unknown function Φ (see § class weight=balanced for the SVC,which automatically sets the value C for class i to CN samples / ( N classes N i ) where N samples is the size of thesample, N classes is the number of classes, and N i is thenumber of objects in the sample belonging to the class. This serves to weight rarer classes more heavily. TheGP classifier’s only hyperparameter is a choice of ker-nel, which defines the covariance function of the Gaus-sian process.To fit our classifiers, we first remove all stars labeled asmiscellaneous variables or unknown/candidates, as wellas known binaries, and one star with bad J photometry,the Be star HD 53032. For features, we use the intrinsiccalculated value of M G , as well as (uncorrected for ex-tinction) G − J , J − H , H − K s , K s − W W − W W − W
4, and χ red , EWM, ν , (cid:104) ∆ t (cid:105) , and σ ∆ t in allthree lightcurves — we indicate the WISE band or colorthat each variability metric corresponds to with a sub-script hereafter. The input features and a brief descrip-tion where relevant are listed in Table 3. Because χ red ,EWM, (cid:104) ∆ t (cid:105) , and σ ∆ t have significant dynamic range,we use the base-10 logarithm of these features. Featuresvalues as well as labels for each star in our sample aregiven in Table 4.We now randomly split our sample into a trainingset with 70% of the samples, and a test set with theremaining 30%, using a stratification strategy to en-sure the proportions of the classes in both sets areequal. The test set is withheld until we are ready toassess the performance of the chosen classifier. Wethen use sklearn.preprocessing.StandardScaler inPython to scale the training data such that eachfeature has 0 mean and unit variance. Be-cause the data have missing values, we thenuse sklearn.impute.IterativeImputer which uses aBayesian ridge regression to predict and replace missingvalues.To test the accuracy of the imputer, we se-lect only the rows from the training set withno missing data. For each feature with missingdata (log χ red,W − W , log E W M W − W , log σ ∆ t,W ,log σ ∆ t,W , and log σ ∆ t,W − W ) we randomly choose 200objects, replace the value of the feature with NaN for onlythese objects, transform the data using the scaler andimputer, and calculate the fractional error between thetrue value and the imputed value. The returned frac-tional errors for each feature centered around 0, and hada low scatter with the exception of log E
W M W − W .However, this has little impact on the classifier, as onlytwo objects in the actual training set have missing valuesfor log E W M W − W . We also repeated this procedurefor coarse labels: we select 200 random objects with agiven coarse label, replace a random feature from thelist of features with missing data with NaN for each ob-ject, and again calculate the fractional error betweenthe true and imputed values. We find that the imputerperforms poorly on Cool and Contaminant stars. Given hotometric Classifications of Evolved Massive Stars Table 4.
Feature values and assigned labels for all stars in our sample, ordered by Right Ascension. Missing numbers are indicated with“-”.
Common Name M G [mag] G − J [mag] W − W χ red,W log (cid:104) ∆ t (cid:105) W [d] Label Coarse LabelHD 236270 -3.54342624 0.3541 0.1420 0.389 - RSG CoolLS I +64 10 -3.40866054 0.7154 0.0020 - - OBA HotLS I +60 69 -3.22094985 0.8769 -0.0340 0.061 - OBAe EMBD+62 2353 -4.27358395 0.4653 -0.0350 0.110 2.816 SupergiantOBA HotHD 73 -3.29367442 -0.6567 -0.0520 3.157 2.701 EvolvedOBA HotHD 240496 -3.84380534 1.0127 0.0130 -0.064 2.764 OBA HotWISE J000559.28-790653.3 -5.32554735 1.3391 -0.0390 0.072 3.162 Unknown/Candidate Unknown/CandidateLS I +59 30 -3.34464936 0.6748 -0.0430 0.234 - OBA HotBD+57 2870 -4.53103041 1.1711 0.0040 0.046 3.135 SupergiantOBA HotBD+62 1 -3.31029252 0.9010 0.2660 1.622 2.813 EvolvedOBA Hot Note —This table is published in its entirety in the machine-readable format. A portion is shown here for guidance regarding its form and content. that all of the features with missing data are linked tovariability, a significant fraction of red supergiants dis-play high amplitude variability (Conroy et al. 2018), theMIR variability of AGB stars (which make up the bulkof the Contaminants) is higher than RSGs in a givenmagnitude range (Yang et al. 2018), and our sample ofCool and Contaminant stars contains objects in quitedifferent evolutionary states that nevertheless have sim-ilar colors and magnitudes, it is unsurprising that theimputer is unable to predict the variability properties ofthese stars. In § sklearn classifier object (e.g., sklearn.svm.SVC ). Tosettle on the best values for the hyperparameters weuse sklearn.model selection.GridSearchCV to per-form a cross-validation search on a grid of hyperparam-eters, using a stratified K-fold strategy with k = 5 toensure that each fold has a representative distributionof classes. For the RF, we search for n estimators be-tween 10 and 150 in steps of 10, max depth between 10and 100 in steps of 10, and allow max features to be ei-ther sqrt , log2 , or None (where the maximum numberof features individual trees are trained on is the squareroot of, base-2 logarithm of, or equal to the number offeatures, see the documentation for details). We also al-low max depth to take on the default value (
None ), suchthat individual trees can be grown until each leaf onlycontains one sample.For the SVC, we search for values of C on a logarith-mic grid with 1 dex spacing between 0.01 and 100, andfor the RBF kernel, we search for values of γ on a similargrid between 0.01 and 10. Additionally, we allow γ to bethe default values of (where n features is the number of features). For the GP classifier, we onlyvary the kernel, as the GaussianProcessClassifier object automatically optimizes the kernel hyperparam- eters. We let the kernel be either linear, RBF, or thedefault (a special case of the RBF kernel with the lengthscale equal to 1).Each classifier object has a default method to scoreeach set of hyperparameters, e.g. the accuracy of pre-dicted labels compared to true labels. However, theclasses in our training set are unbalanced (e.g. Figure3), so inaccurately classifying every single LBV, for ex-ample, would have little impact on the overall accuracyof the classifier. To account for this, we instead usethe balanced accuracy (Mosley 2013; Guyon et al. 2015),which weighs each sample by the frequency of that sam-ple’s class in the training set. Other options for scoringcriteria exist, including some that help maximize theclassifier’s precision such as the weighted F score andCohen’s kappa (Cohen 1960). We experimented withusing these scores, and found that using the balancedaccuracy minimizes misclassifications across all classes(reflected in the diagonal in the left panels of Figures 9and 12). Note that this choice implicitly selects a classi-fier that performs well across all classes, and is not op-timized for specific classes. Future work will explore thepossibility of tuning a classifier to find specific classes ofrare stars.Finally, we explore three variations of a voting clas-sifier. Such a classifier consists of an ensemble of in-dividual classifiers, each which “votes” by assigning aclass to a given sample. The final assigned class can ei-ther be chosen with a “hard” (the class with the mostvotes wins) or “soft” (class assignments are weightedby the probabilities output by each classifier) strategy.We construct two voting classifiers that each use a dif-ferent voting strategy, using RF, SVC, and GP classi-fiers as the individual components. We refer to these asthe Voting (Hard) and Voting (Soft) classifiers. We alsomake a third voting classifier that also uses a soft votingstrategy, but the votes from each component classifier4 Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque weighted by the balanced accuracy determined via cross-validation. We refer to this as the Voting (Weighted)classifier. We score each voting classifier by averagingthe balanced accuracy taken from five stratified folds ofthe data. Figure 8 shows the balanced accuracy for thethree optimized classifiers, as well as the three votingclassifiers; the SVC performs “best.” Both the Voting(Soft) and Voting (Weighted) classifiers perform com-parably with the worst classifier, the GP. This is due tothe fact that, while the SVC often selects one individualclass with high probability, both the RF and GP tendto select multiple classes with high probability (with theGP sometimes selecting all classes with roughly equalprobability, usually slightly favoring the classes selectedby the RF). This can result in both the RF and GP vot-ing for the wrong class with higher probability than thecorrect vote from the SVC, leading to the poor observedperformance. R a nd o m F o r e s t S V C G a u ss i a n P r o ce ss V o t i n g ( H a r d ) V o t i n g ( S o f t) V o t i n g ( W e i g h t e d ) Classifier0 . . . . . . B a l a n ce d A cc u r a c y Figure 8.
Balanced accuracy for each optimized classifier,averaged over five foldings of the data. The three voting clas-sifiers perform roughly identically, with the Voting (Hard)classifier performing best.
SVC Performance
The procedure above results in values for theSVC hyperparameters of kernel = linear , and
C =0.01 . With these hyperparameters, we fit the SVCto the training set, use the
StandardScaler and
IterativeImputer that were previously fit to the train-ing set to transform the test set, and use the SVC topredict the labels of the test set. The left-hand panel ofFigure 9 shows the raw number of stars in the test setwith the true label given on the x-axis, and the predictedlabel given on the y-axis. The center and right-handpanels show this matrix, where each row/column is nor-malized by the total number of stars in that row/column,yielding the confusion/efficiency matrices, respectively.The i, j entry in the confusion matrix corresponds tothe fraction of objects in the test set belonging to class i (shown on the y-axis) that are assigned class j (shownon the x-axis). Entries along the diagonal are the com-pleteness (also called the recall in some contexts), i.e.,the percentage of a given class that is accurately recov-ered by the classifier. The i, j entry in the efficiencymatrix is the fraction of objects in the test set classifiedas j , that belong to class i . Entries along the diagonalare equivalent to the precision (equivalent to one minusthe contamination), i.e., the percentage of an observedclass that is made up of true members of that class. Fig-ure 10 shows the completeness versus the contaminationfor each class. Completeness is just the diagonal of thecorresponding row/column of the confusion matrix, andcontamination is one minus the diagonal of the corre-sponding row in the efficiency matrix.The SVC performs poorly on non-supergiant OBAstars. This is perhaps unsurprising given that both theobserved colors and interior structures of OBA stars asthey evolve from the zero-age main sequence (ZAMS) tothe terminal-age main sequence (TAMS) do not changedrastically compared to the much more evolved statesthat we also consider. The classifier classifies OBAestars with somewhat lower contamination compared tomain sequence and evolved OBA stars, though withcomparably low completeness. True OBAe stars are ei-ther misclassified as other types of OBA star or as WRs,while stars falsely labeled as OBAe are mostly true OBAstars, with the exception of one true WR. 75% of WRstars are recovered, but only 6/30 stars identified asWRs in the test set are true WRs; given the importanceof WRs for both the physics of mass loss and studyingevolved massive stellar populations (Dorn-Wallenstein& Levesque 2018, 2020), future work will focus on de-veloping a classifier specifically for identifying WR stars.All LBVs in the test set are recovered; while such highaccuracy is often seen as a sign of overfitting, we choosenot to focus on this subclass, given both the disputedevolutionary status of LBVs (Smith & Tombleson 2015;Humphreys et al. 2016; Aadland et al. 2018) and thefact that only two LBVs exist in the training set. Yel-low supergiants are only classified with 27% accuracy.As discussed in § hotometric Classifications of Evolved Massive Stars M a i nS e q u e n ce O B A E v o l v e d O B A Sup e r g i a n t O B A O B A O B A e O B [ e ] W R L B V Y S G R S G C / S / G i a n t Y e ll o w D w a r f Predicted label
MainSequenceOBAEvolvedOBASupergiantOBAOBAOBAeOB[e]WRLBVYSGRSGC/S/GiantYellow Dwarf T r u e l a b e l M a i nS e q u e n ce O B A E v o l v e d O B A Sup e r g i a n t O B A O B A O B A e O B [ e ] W R L B V Y S G R S G C / S / G i a n t Y e ll o w D w a r f Predicted label M a i nS e q u e n ce O B A E v o l v e d O B A Sup e r g i a n t O B A O B A O B A e O B [ e ] W R L B V Y S G R S G C / S / G i a n t Y e ll o w D w a r f Predicted label
Figure 9.
Left : Matrix showing the number of stars in the test set with true label indicated on the y-axis that are assignedthe label on the x-axis.
Center : Confusion matrix for the SVC, calculated by normalizing each row of the left panel by thetotal number of stars in that row. Values correspond to the fraction of samples in the test set with true label indicated on they-axis that are assigned the label on the x-axis, such that the values along the diagonal are the fraction of each class that iscorrectly classified.
Right : Efficiency matrix for the SVC, calculated by normalizing each column of the left panel by the totalnumber of stars in that column. Values in each box correspond to the fraction of samples in the test set assigned the label onthe x-axis that belong to the class on the y-axis, such that the values along the diagonal correspond to the precision (one minusthe contamination). Darker colors in all panels correspond to more/a higher fraction of stars. . . . . . . . . . . . . C o n t a m i n a t i o n MainSequenceOBAEvolvedOBASupergiantOBAOBAOBAeOB[e]WRLBVYSGRSGC/S/GiantYellow Dwarf
Figure 10.
Completeness versus contamination of each classin the test set. A high completeness value implies members ofthat class are accurately classified, while a low contaminationvalue implies that an object classified as such is likely tobelong to that class. The figure is roughly divided into fourquadrants; stars with classes in the bottom-right quadrantcan be considered to be well-classified, in the sense that theyhave high completeness and low contamination.
Overall, an SVC trained on these refined labels ap-pears to have little use. With the exception of RSGsand low-mass giants, the remaining classes have lowaccuracy, high contamination, or both. We nonethe-less use the SVC to predict labels for the 2550 starsinitially labeled as “Miscellaneous Variable” or “Un-known/Candidate.” We identify 79 candidate RSGs and36 candidate C/S/Giant stars, of which we expect ∼ Table 5.
Common names and coordinates of stars predictedto be RSGs by the SVC trained on refined labels.
Common Name R.A. [deg] Dec [deg]SP77 48-11 81.07900941 -70.43417562WISE J185608.58-163255.1 284.03575762 -16.54867009W61 19-14 83.07777354 -67.52941938OGLE BRIGHT-LMC-MISC-169 72.94711333 -69.32348227WISE J194127.64+385155.3 295.36520609 38.86536427NGC 2004 BBBC 431 82.69161818 -67.29036242[KWV2015] J045626.51-692350.6 74.11062177 -69.39740804W61 6-54 85.54018822 -69.21978048WISE J064232.30-715243.3 100.63462144 -71.87871874W61 6-34 85.51621031 -69.21870016
Note —This table is published in its entirety in the machine-readableformat. A portion is shown here for guidance regarding its formand content.
Performance on Coarse Labelling
Examining Figures 9 and 10, we see that, while theclassifier is not especially accurate except for the cooleststars, the classifier is roughly useful for sorting the testset into broad categories: different types of OBA starsare mostly (mis)classified as other classes of OBA stars;the same is true for emission line stars (OBAe, OB[e],WR, and LBV) and cool stars (YSG, RSG, C/S/Giant).For this reason, we also utilize the coarse labels in-troduced in §
2. We repeat the entire process describedabove, beginning with the selection of the classifier. Fig-ure 11 shows the balanced accuracy for each of the classi-fiers discussed above, trained on the coarse labels, usinga five-fold cross-validation to optimize the hyperparam-6
Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque eters of each classifier. We find that once again, the SVCyields the highest balanced accuracy (0.876). R a nd o m F o r e s t S V C G a u ss i a n P r o ce ss V o t i n g ( H a r d ) V o t i n g ( S o f t) V o t i n g ( W e i g h t e d ) Classifier0 . . . . . B a l a n ce d A cc u r a c y Figure 11.
Similar to Figure 8 for classifiers trained on thecoarse labels.
We keep the same scaled and imputed training andtesting sets, and perform a five-fold cross-validation asbefore to find the optimal hyperparameters, which are kernel=rbf , C = 1 , and gamma = 1/n features . Withthese hyperparameters, we fit the SVC to the trainingset before predicting labels for the test set. Figure 12shows the confusion and efficiency matrices similar toFigure 9, while Figure 13 shows the completeness versusthe contamination similar to Figure 10. All told, theSVC performs significantly better compared to the clas-sifier trained on the refined labels, recovering all classeswith ≥
75% completeness and ≤
30% contamination.Of the emission line stars that are correctly identified,83 are OBAe stars, three are OB[e] stars, eight are WRs,and two are LBVs. This is 73%, 75%, 100%, and 100%,respectively of these stars that are in the test set, im-plying that the performance of the SVC on emission linestars is not dominated entirely by OBAe stars (whichcomprise the majority of emission line stars in the testset). Of the stars mislabeled as contaminants, two areOBA stars and two are RSGs. One true C/S/Giant star,and two Yellow Dwarfs are misclassified.We then use the SVC to predict the coarse labels forthe same 2550 stars as above. Figure 14 shows the dis-tribution of these predicted labels. The majority (2472stars) are labeled as Hot. 63 stars are labeled as Cool,three of which are already identified in SIMBAD as can-didate AGBs or RGBs. 14 of these stars are labeled asemission line stars, of which 9-10 are likely to actuallybe emission line stars, assuming 30% contamination. Welist all 2550 stars’ common names, coordinates, and pre-dicted coarse label in Table A1.4.3.1.
Feature Importance
We can also identify which features contribute mostto the overall performance of the classifier on the coarselabels. To do this, we initialize a new
SVC object withthe same hyperparameters, and perform a greedy searchover features. For each feature in the scaled and imputedtraining set, we train the SVC on just this feature acrossfive stratified folds of the training set, and record the av-erage and standard deviation of the balanced accuracy.We select the feature that yields the highest average bal-anced accuracy. This has the advantage of ensuring thecontribution of each feature to the balanced accuracyis stable across subsets of the data. We then train theSVC on all combinations of this feature and the remain-ing features, selecting which combination again yieldsthe highest average balanced accuracy. This process isrepeated until all features are used.Figure 15 illustrates this process. The x-axis showsthe feature that is selected at each stage of the greedysearch. The y-axis shows the mean balanced accu-racy of the SVC at that stage. Errorbars show thestandard deviation of the balanced accuracy across thefive folds of the training set. The balanced accuracyreaches a maximum after the first seven features: J − H , W − W M W , K s − W
1, log
EW M W , log χ red,W ,and log EW M W − W . However, the contribution to thebalanced accuracy from all but the first four features issmall. This suggests that variability amplitude is a use-ful, though not critical metric to obtain, while variabilitytimescales are not necessary. Finally this suggests thatphotometry bluer than J -band is also unnecessary.We can also examine the importance of each featurefor classifying individual classes. We perform the samegreedy search over the features, instead calculating aperformance metric that focuses on the performance ona specific class. One option is the F β measure: F β = (1 + β ) completeness · precision β · completeness + precision (7)where β is a free parameter that sets the relative impor-tance of completeness compared to precision. Commonchoices are β = 1 (i.e., F , a harmonic mean of complete-ness and precision), β = 0 .
5, and β = 2 (Chinchor & D1992). We adopt F (i.e., β = 2), because we prioritizegenerating complete samples of rare massive stars.The left panels of Figure 16 show the F measureas a function of successively added features, calculatedspecifically for hot stars (top), emission line stars (sec-ond panel), cool stars (third panel), and contaminants(bottom). The results for both hot and cool stars aremostly similar to the results for the overall classifier inFigure 15, in the sense that the best performance isreached after including a mix of near- and mid-infrared hotometric Classifications of Evolved Massive Stars H o t E M C oo l C o n t a m i n a n t Predicted label
HotEMCoolContaminant T r u e l a b e l
626 40 4 221 96 10 026 2 286 21 0 2 34 H o t E M C oo l C o n t a m i n a n t Predicted label H o t E M C oo l C o n t a m i n a n t Predicted label
Figure 12.
Similar to Figure 9 for the SVC trained using the coarse labels. Note that significantly more stars fall along thediagonal of each plot, reflecting the improved performance of the SVC on the coarse labels. . . . . . . . . . . . . C o n t a m i n a t i o n HotEMCoolContaminant
Figure 13.
Similar to Figure 10 for the SVC trained usingthe coarse labels. All coarse classes have high completenessand low contamination. H o t E M C oo l C o n t a m i n a n t Class10 N Figure 14.
Distribution of coarse labels assigned to 2550stars with no previously known class. colors and magnitudes. The main difference is that M G is the fourth most important feature for classifying hotstars.For emission line stars, a maximum in the mean F is reached after 11 features: W − W M W , G J ,log (cid:104) ∆ t (cid:105) W , log (cid:104) ∆ t (cid:105) W , log σ ∆ t,W , W − W W − W K s − W ν ,W , and J − H . However, given the error- bars, only the first three features contribute meaning-fully, with the remaining features consistent with a con-stant value of F . Interestingly, compared to its contri-bution to the overall balanced accuracy of the classifier,bluer photometry (signified by the presence of G − J in the above list) is much more important for identi-fying emission line stars. While variability metrics areincluded in the above list, they do not significantly con-tribute to the F score.For contaminants, a total of 15 features arerequired in order to maximize F : J − H ,log σ ∆ t,W − W , log σ ∆ t,W , log (cid:104) ∆ t (cid:105) W , log (cid:104) ∆ t (cid:105) W − W , ν ,W , log σ ∆ t,W , ν ,W − W , M W , H − K s , K s − W W , log χ red,W − W , log EWM W − W , andlog χ red,W . Notably, the F measure first decreases asfeatures are added, before increasing to the maximumafter M W . This trend is unintuitive compared to theother panels in the Figure. It may be a result of the factthat increased features improve the precision of the clas-sifier at the cost of completeness, resulting in a decreasein F due to the increased weighting of completeness.To demonstrate the capabilities of the classifier usinga limited set of features, we plot the scaled and imputedtest set — which was not used in the greedy search al-gorithm — in the right panels of Figure 16, using onlythe two most important features in each row. Stars be-longing to the corresponding coarse class are plotted aslarger, colored points. In all cases, most members of thetest set with that label are well-separated from the otherstars. As expected from the left panels, most of the sepa-ration is along the x-axis which corresponds to the mostimportant feature, with the second-most important fea-ture plotted on the y-axis providing some additional dif-ferentiation, especially for hot stars. In the case of con-taminants, the second-most important feature provideslittle-to-no additional information, consistent with thelack of change in F with increased features.8 Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque J − H W − W M W K s − W l og E W M W l og χ r e d , W l og E W M W − W l og χ r e d , W W − W l og h ∆ t i W G − J l og σ ∆ t , W l og h ∆ t i W l og σ ∆ t , W ν , W ν , W − W l og E W M W l og χ r e d , W − W ν , W H − K s M G W − W l og σ ∆ t , W − W l og h ∆ t i W − W Feature0 . . . . . . . . . B a l a n ce d A cc u r a c y Figure 15.
Mean balanced accuracy of the SVC for coarse labels trained on successively added features, calculated from fivestratified folds of the data. The balanced accuracy reaches a maximum after the first seven features. Errorbars indicate thestandard deviation of the balanced accuracy across folds.
We conclude that while small numbers of features canbe used to identify hot, cool, and (remarkably) emis-sion line stars, a large number of features is necessaryin order to maximize the number of accurately identi-fied contaminants. This includes time domain features,where the drastically different structures of old AGBand RGB stars compared to massive cool supergiantsmay be imprinted. Already extragalactic massive starsamples are contaminated by foreground giants in theMilky Way halo; distant stars that can be resolved byJWST and
Roman will have comparable brightnesses tocool dwarfs that are too faint to be filtered out using as-trometry from
Gaia . Developing the infrastructure toreliably remove these contaminating objects from mas-sive star samples will be that much more critical. DISCUSSION & CONCLUSIONIn the coming decades, space-based infrared observa-tories like
Webb and
Roman will give us access to un-precedentedly large samples of evolved massive stars.Therefore, we need to be prepared to leverage these datato search for stars in the most interesting evolutionarystates. Obtaining spectroscopy of individual stars doesnot scale well at the size of the expected samples, whilelinear cuts in color-magnitude space are too simplisticand ignore emission line objects. Here we have demon-strated the promising performance of a support vectormachine trained on ∼ . − µ m photometry and sim-ple variability metrics. However, with currently avail-able labels, we are not able to construct a classifier thatperforms well at the level of granularity needed for manyscience cases.Our main results are summarized as follows: • We have assembled a large sample of evolved mas-sive stars using distances from
Gaia
DR2 and Bailer-Jones et al. (2018), with high precisioninfrared photometry from
Gaia , 2MASS, andWISE. • Using SIMBAD, we assign labels to all stars, andfind that the sample contains a number of low masscontaminants. • We find that, of the classification methods weapplied, a support vector machine classification(SVC) algorithm is best at accurately labelingevolved massive stars. The SVC is fast, and hasthe added benefit that the underlying mathemat-ics are well-understood. • The SVC trained on refined labels is capable ofidentifying low mass red giant contaminants withhigh accuracy. However, the overall performanceof this classifier is quite poor, and we do not rec-ommend its use at present. • The SVC trained using coarse spectral types per-forms better, as measured with the balanced accu-racy score. We find higher completeness and lowercontamination (Figures 12 and 13) compared tothe SVC trained on the refined labels (Figures 9and 13). With this classifier, we identify 14 candi-date emission line stars from a sample of ∼ • We find that the SVC performs equally as wellwith only a small subset of features. These fea-tures are mostly infrared colors and absolute mag-nitudes — i.e., those least affected by reddening —with small contributions from infrared variabilitymetrics. However, if we change our performance hotometric Classifications of Evolved Massive Stars . . . . . . Feature . . . . . . F M e a s u r e H − K s W − W M W M G K s − W l og σ ∆ t , W l og E W M W − W ν , W l og σ ∆ t , W l og σ ∆ t , W − W l og h ∆ t i W l og h ∆ t i W − W W − W l og h ∆ t i W ν , W W − W G − JJ − H l og χ r e d , W − W l og E W M W ν , W − W l og χ r e d , W l og E W M W l og χ r e d , W . . .
94 Hot − H − K s − W − W W − W M W G − J l og h ∆ t i W l og h ∆ t i W l og σ ∆ t , W W − W W − W K s − W ν , W J − H l og χ r e d , W − W l og σ ∆ t , W H − K s l og E W M W − W ν , W − W M G l og h ∆ t i W − W ν , W l og σ ∆ t , W − W l og χ r e d , W l og E W M W l og E W M W l og χ r e d , W . . . .
85 EM − W − W − − M W J − H M W K s − W W − W l og χ r e d , W − W l og h ∆ t i W − W H − K s G − J l og E W M W − W ν , W − W l og χ r e d , W l og σ ∆ t , W − W M G l og E W M W l og χ r e d , W ν , W W − W l og E W M W l og σ ∆ t , W l og h ∆ t i W l og h ∆ t i W l og σ ∆ t , W W − W ν , W . . .
96 Cool J − H − − M W J − H l og σ ∆ t , W − W l og σ ∆ t , W l og h ∆ t i W l og h ∆ t i W − W ν , W l og σ ∆ t , W ν , W − W M W H − K s K s − W l og E W M W l og χ r e d , W − W l og E W M W − W l og χ r e d , W ν , W l og χ r e d , W l og h ∆ t i W W − W l og E W M W W − W M G G − J W − W . . J − H − . . . . l og σ ∆ t , W − W Figure 16.
Left panels are similar to Figure 15, except using the F measure calculated for hot stars (top), emission line stars(second panel), cool stars (third panel), and contaminants (bottom). The right panel in each row shows a scatter plot of onlyfirst and second most important features (indicated with blue and red text respectively) drawn from the test set, with starsbelonging to the corresponding class in each row highlighted. Note that the features plotted are the scaled and imputed values, not the original values listed in Table 4. metric to one that focuses on emission line stars,optimal performance of the classifier requires somered-optical photometry. We find that the addedbenefit of using variability metrics may not beworth the investment in telescope time in orderto measure them.Ultimately, the performance of the SVC trained on therefined labels is poor. All stars in the sample are bright( W < Dorn-Wallenstein, Davenport, Huppenkothen, & Levesque and more that can be used to accurately classify mas-sive stars. In order to prepare ourselves for the era of
Webb and
Roman , we must develop pipelines specificallytuned for evolved massive stars. This is especially truefor the classes that are underrepresented in our dataset,i.e., rare emission line stars.Along with better labels, more data will become avail-able via future data releases of the
Gaia mission. Therecent early third
Gaia release contains modest improve-ments in precision and sample size that are unlikely toaffect our results given the high quality of the photome-try in our sample. However, the full
Gaia
DR3 will con-tain low-resolution spectra as well as epoch photometryfor a limited number of sources, which have the potentialto significantly improve the performance of a machinelearning classifier. On the horizon, the Legacy Surveyof Space and Time (LSST) conducted at the Vera Ru-bin Observatory will measure the multi-color variabilityof massive stars at higher cadence, while its large tele-scope aperture will help define a much larger sample.As we demonstrate with Figure 16, it is possible to se-lect features that maximize the performance of the SVCfor specific classes. With a larger sample, we may beable to optimize the SVC to search for specific classes ofevolved massive stars.ACKNOWLEDGMENTSThe authors acknowledge that the work presentedwas largely conducted on the traditional land of thefirst people of Seattle, the Duwamish People past andpresent and honor with gratitude the land itself and theDuwamish Tribe.This research was supported by NSF grant AST 1714285awarded to EML.JRAD and DH acknowledge support from the DiRACInstitute in the Department of Astronomy at the Univer- sity of Washington. The DiRAC Institute is supportedthrough generous gifts from the Charles and Lisa Si-monyi Fund for Arts and Sciences, and the WashingtonResearch Foundation.DH is supported by the Women In Science Excel(WISE) programme of the Netherlands Organisation forScientific Research (NWO).This project was developed in part at the 2018 GaiaSprint, hosted by the eScience and DiRAC Institutes atthe University of Washington, Seattle.This research has made use of the VizieR cata-logue access tool, CDS, Strasbourg, France (DOI:10.26093/cds/vizier). The original description of theVizieR service was published in A&AS 143, 23. Thisresearch has made use of the SIMBAD database, op-erated at CDS, Strasbourg, France. This publicationmakes use of data products from the Two Micron AllSky Survey, which is a joint project of the University ofMassachusetts and the Infrared Processing and Analy-sis Center/California Institute of Technology, funded bythe National Aeronautics and Space Administration andthe National Science Foundation.This work made use of the following software:
Software:
Astropy v3.2.2 (Astropy Collaborationet al. 2013; The Astropy Collaboration et al. 2018),Astroquery v0.3.10 (Ginsburg et al. 2019), Matplotlibv3.1.1 (Hunter 2007), makecite (Price-Whelan et al.2018), NumPy v1.17.2 (Van Der Walt et al. 2011), Pan-das v0.25.1 (McKinney 2010), Python 3.7.4, Scikit-learnv0.21.3 (Pedregosa et al. 2011), Scipy v1.3.1 (Jones et al.2001)APPENDIX A. COARSE LABELS FOR 2550 STARSTable A1 lists all 2550 stars with no known label, and predicted labels generated by the SVC trained on coarselabels. REFERENCES
Aadland, E., Massey, P., Neugent, K. F., & Drout, M. R.2018, AJ, 156, 294, doi: 10.3847/1538-3881/aaeb96Ardeberg, A., Brunet, J. P., Maurice, E., & Prevot, L. 1972,Astronomy and Astrophysics Supplement Series, 6, 249 Astropy Collaboration, Robitaille, T. P., Tollerud, E. J.,et al. 2013, A&A, 558, A33,doi: 10.1051/0004-6361/201322068 hotometric Classifications of Evolved Massive Stars Table A1.
Common names, coordinates, and predicted labels of 2550 starsinput to the SVC trained on coarse labels, ordered by Right Ascension.
Common Name R.A. [deg] Dec [deg] Predicted Coarse LabelWISE J000559.28-790653.3 1.49713706 -79.11483482 HotTYC 4500-1480-1 2.86210879 79.08686958 HotBD+61 45 5.25504760 62.77064970 HotNGC 104 LEE 2520 5.41170226 -72.21106679 HotWISE J002203.44-693554.7 5.51434821 -69.59851087 HotWISE J002207.43-742212.1 5.53102165 -74.37003199 HotWISE J002318.05-742326.4 5.82523611 -74.39068759 HotWISE J002340.20-750446.9 5.91756693 -75.07972556 HotWISE J002758.92-764527.2 6.99552600 -76.75757402 HotWISE J002759.32-742119.8 6.99734043 -74.35552728 Hot