[PDF] Examining the Radius Valley: a Machine Learning Approach

Abstract

The ''radius valley" is a relative dearth of planets between two potential populations of exoplanets, super-Earths and mini-Neptunes. This feature appears in examining the distribution of planetary radii, but has only ever been characterized on small samples. The valley could be a result of photoevaporation, which has been predicted in numerous theoretical models, or a result of other processes. Here, we investigate the relationship between planetary radius and orbital period through 2-dimensional kernel density estimator and various clustering methods, using all known super-Earths ( R<4.0 R E ). With our larger sample, we confirm the radius valley and characterize it as a power law. Using a variety of methods, we find a range of slopes that are consistent with each other and distinctly negative. We average over these results and find the slope to be m=− 0.319 +0.088 −0.116 . We repeat our analysis on samples from previous studies. For all methods we use, the resulting line has a negative slope, which is consistent with models of photoevaporation and core-powered mass loss but inconsistent with planets forming in a gas-poor disk.

Full PDF

MMNRAS , 1–10 (2019) Preprint 30 May 2019 Compiled using MNRAS L A TEX style ﬁle v3.0

Examining the Radius Valley: a Machine LearningApproach

Mariah G. MacDonald, , (cid:63) Department of Astronomy & Astrophysics, The Pennsylvania State University, 525 Davey Lab, State College, PA, 16802, USA Center for Exoplanets and Habitable Worlds, The Pennsylvania State University, 525 Davey Lab, State College, PA, 16802, USA

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

The ”radius valley” is a relative dearth of planets between two potential populationsof exoplanets, super-Earths and mini-Neptunes. This feature appears in examiningthe distribution of planetary radii, but has only ever been characterized on smallsamples. The valley could be a result of photoevaporation, which has been predictedin numerous theoretical models, or a result of other processes. Here, we investigatethe relationship between planetary radius and orbital period through 2-dimensionalkernel density estimator and various clustering methods, using all known super-Earths( R < . R E ). With our larger sample, we conﬁrm the radius valley and characterize itas a power law. Using a variety of methods, we ﬁnd a range of slopes that are consistentwith each other and distinctly negative. We average over these results and ﬁnd theslope to be m = − . + . − . . We repeat our analysis on samples from previous studies.For all methods we use, the resulting line has a negative slope, which is consistent withmodels of photoevaporation and core-powered mass loss but inconsistent with planetsforming in a gas-poor disk. Key words: planetary systems – methods: statistical

With the launch of

Kepler in 2009 and the following yearsof analysis, the astronomical community witnessed the num-ber of known exoplanets skyrocket from a couple hundredto a couple thousand (Lissauer et al. 2011, 2014; Fabryckyet al. 2014). Newfound discoveries such as compact systems(Lissauer et al. 2011; Muirhead et al. 2015), resonant chains(Lithwick & Wu 2012; Fabrycky et al. 2014), and low den-sity planets (e.g., Welsh et al. 2015; Weiss & Marcy 2014;Ford 2017) led to numerous studies of the exoplanet popula-tion as a whole, as well as countless studies focused on justone or a few systems. Many of these studies aimed to bettercharacterize exoplanets and estimate the planet parametersto gain insight into planet formation.Several theoretical models predict that the size of plan-ets can be strongly inﬂuenced by the host stars. In particu-lar, formation models ﬁnd that atmospheric erosion, whichmostly aﬀects planets near their star, will result in a so-called ”photoevaporation valley,” a gap in the distribution (cid:63)

E-mail: [email protected] of planetary radii around R E (Owen & Wu 2013; Jin et al.2014; Lopez & Fortney 2014; Chen & Rogers 2016; Lopez& Rice 2016; Owen & Wu 2017). This valley is establishedby the boundary between planets that are massive enoughto keep their gas envelopes and planets whose atmospheresare stripped away. The speciﬁc slope of the valley dependson various factors, such as the composition of the planetsand the physics of the evaporation (e.g., Owen & Wu 2017),but the overall sign of the slope is negative. In particular,the separation appears linear in a log-log plot of planetaryradius vs. orbital period:log R p = m · log P + b (1)where P is the orbital period, R p is the planetary radius, m is the slope in log-log space, and b is the y-intercept inlog-log space. In linear space, this translates to a power law R p = e b · P m .Actually observing this valley can be challenging, asuncertainties in stellar parameters can obfuscate its pres-ence and render characterization of the valley, to aid in un-derstanding planet formation, quite diﬃcult. Fulton et al.(2017) ﬁrst observed the gap in distribution of planetary © a r X i v : . [ a s t r o - ph . E P ] M a y M. G. MacDonald radii using precise radius measurements from the California-Kepler Survey and a sample of 2025 exoplanets. They notethe gap as a dearth of planets with radii between 1.5 and 2.0 R E and suggest that this limit essentially separates super-Earths from mini-Neptunes, meaning that it determineswhether a planet has a substantial H/He envelope. UsingGaia parallaxes, Fulton & Petigura (2018) released an up-dated catalog of ∼ ∼ R p ∼ R E . Bergeret al. (2018) also released an updated catalog with stellarradii derived from a combination of Gaia DR2 parallaxesand the DR25 Kepler Stellar Properties Catalog. They con-ﬁrm a gap in the distribution of radii with their updatedvalues, noting that the gap is mostly limited to large inci-dent ﬂuxes ( > F E ). These three studies did not attemptto constrain the valley as a function of orbital period.Soon after Fulton et al. (2017) released their study, VanEylen et al. (2018) explored the location and shape of theradius valley using 117 planets with accurate parametersdetermined from asteroseismology. Using a linear model (inlog-log space) and an MCMC algorithm, they ﬁrst charac-terize the valley by ﬁtting the absence of data, recovering aslope of m = − . ± . . They then use support vector ma-chines to determine the hyperplane of maximum separationbetween the planets above and below the valley, resulting ina slope of m = − . + . . . They conclude that the negativeslope they measured is consistent with models of photoevap-oration.There are a few possible explanations for this radiusvalley aside from mass loss by photoevaporation, includingsculpting by giant impacts (e.g., Liu et al. 2015; Schlicht-ing et al. 2015; Inamdar & Schlichting 2016), core-poweredmass loss (e.g., Ginzburg et al. 2017; Gupta & Schlichting2018), and planets forming late in a gas-poor disk (e.g., Leeet al. 2014; Lee & Chiang 2016). It is unclear whether im-pacts alone can cause a gap in the distribution of planetaryradii since impact erosion is a highly stochastic process, butatmospheric heating caused by an impact can cause the en-velope to expand, which would make it more susceptible tophotoevaporation. Core-powered mass loss results in a neg-ative slope in the valley and can suﬃciently explain the gapin the distribution of planetary radii (Ginzburg et al. 2018;Gupta & Schlichting 2018). Lopez & Rice (2016) explore theidea that super-Earths near their stars are actually a sepa-rate population of planets that developed late in a gas-poordisk and never accreted signiﬁcant envelopes, instead of ac-creting atmospheres that are then later stripped away. Theyfound that the transition radius between this population andother planets depends on orbital period R ∝ P m , where m isbetween 0.07 and 0.10. Therefore, forming in a gas-poor diskwould result in the slope of a radius valley that is positive.To date, there has been no observational study thatlooks to characterize the radius valley, in both radius andorbital period, without suﬀering from a small sample size.Here, we aim to characterize the valley by measuring itsslope and to identify the dominant mechanism that is re-sponsible for sculpting the radii of exoplanets. We investi-gate the existence and slope of the radius valley using all Data set Sample Size < R σ > ( R E ) RefAll 2066 0.252 –F17 766 0.19 Fulton et al. (2017)V18 81 0.0615 Van Eylen et al. (2018)F18 1334 0.089 Fulton & Petigura (2018) Table 1.

The number of planets and average (median) uncer-tainty in planetary radius for planets in each of the data sets weuse. For data set ”all,” we pull all exoplanets from the exoplanetarchive (April 5, 2019). We only use planets with published un-certainties in both planetary radius and in orbital period. Wealso limit all samples to super-Earths ( R < . R E ) with orbitalperiods P < days. known exoplanets with orbital period and planetary radiusmeasurements of P < , R p < . R E .The outline of this paper is as follows: in Section 2,we describe the data used in this study. In Section 3, wedescribe and explain the theories behind the methods andtests used to examine the radius valley. Following this, weshow and discuss the results from our analysis in Section 4and Section 5 before concluding in Section 6. We use all known exoplanets in our analyses, pullingdata from the exoplanetarchive . Although exoplanetarchivekeeps track of all published parameters for planets, we useonly the most recent parameters for the study presented inthis paper. We include planets from all detection methods,although Kepler contributed the most data, resulting in 3933known exoplanets. We only use planets with published val-ues and uncertainties for radius and orbital period and limitour sample to P < days and R p < . R E , resulting in 2066planets. The Solar System was not included in this study.We perform our analysis on other data sets that havebeen used to study this phenomenon in the past, particu-larly those of Fulton et al. (2017), Van Eylen et al. (2018),and Fulton & Petigura (2018). We refer to these data sets as”F17,” ”V18,” and ”F18” from here on. We present informa-tion about each data set, including number of planets andaverage uncertainty in planetary radius, in Table 1. We analyze the exoplanetary population using 2-D kerneldensity estimator, hierarchical clustering, K -means cluster-ing, K -medians clustering, and density-based clustering. Themethods used here and the theories behind them are brieﬂydescribed below. We perform all of our analysis in the Rlanguage (R Core Team 2017). exoplanetarchive.ipac.caltech.edu, data pulled on April 5, 2019MNRAS000 , 1–10 (2019) adius Valley A kernel density estimation is a nonparametric estimationof probability density functions. The kernel density estimateis deﬁned by: ˆ f ( x ; H ) = n − n (cid:213) i = K H ( x − X i ) (2)where x = ( x , x ) T and X i = ( X i , X i ) T for i =1,2,...,n a bi-variate random sample drawn from a density f , K ( x ) is thekernel which is a symmetric probability density function,and H is the bandwidth matrix which is symmetric andpositive-deﬁnite. Although the choice of the kernel is notvery important, in this analysis we use an axis-aligned bi-variate normal kernel, evaluated on a square grid. A KDEis dependent on the distribution of data given to it, so ourresults will be sensitive to incompleteness and eﬀects frombiases. We also must choose a bandwidth for both orbitalperiod and planetary radius; the results of the KDE are notvery dependent on the bandwidths, but bandwidths that aretoo large will result in a uni-modal distribution and band-widths that are too small will result in multi-modal distri-butions. We choose bandwidths that lead to bi-modal dis-tributions so that we are able to characterize a line betweenthe modes. In our analysis, we use the function kde2d fromCRAN package MASS (Venables & Ripley 2002). For a morein-depth description of KDEs and their applications, we en-courage the reader to explore Silverman (2018).

Hierarchical clustering is a method of cluster analysis thatbuilds a hierarchy of clusters. Originally, each object receivesits own cluster; the algorithm computes the distance be-tween clusters via the Lance-Williams dissimilarity formulaand combines the two most similar clusters. The Lance-Williams algorithms are a family of agglomerative clusteringalgorithms which are represented by a recursive formula. Af-ter two clusters are merged, the updated cluster distance iscomputed recursively: d ( ij ) k = α i d ik + α j d jk + β d ij + γ | d ik − d jk | (3)where i and j indicate the clusters being merged, d is thecluster distance and α i , α j , β , and γ are parameters. We usecomplete linkage method to ﬁnd similar clusters, where thedistance between two clusters is the distance between thetwo elements (one in each cluster) that are farthest awayfrom each other. This process continues until all objects havebeen sorted into one cluster, and a decision tree has beenformed. Simply put:1. Assign every point to its own cluster.2. Compute the distance between each cluster.3. Merge the two closest clusters.4. Update the distances between the new cluster andthe original clusters.5. Repeat steps 3 and 4 until only a single clusterremains.We cut the hierarchical tree at two branches so that wehave two populations (super-Earths and mini-Neptunes).One caveat to this method is we cannot control along which access the resulting two clusters are separated. In our anal-ysis, we use the function hclust from CRAN package MASS (Venables & Ripley 2002). We refer the reader to John-son (1967) for a discussion on various hierarchical clusteringschemes.

Originally from signal processing, K -means clustering is amethod of vector quantization. We use the original algo-rithm by Hartigan & Wong (1979) which aims to separatethe data into K number of groups such that the sum ofsquares from all points to the cluster centers is minimized.Mathematically, the algorithm aims to ﬁnd: arg min ¯ S K (cid:213) i = (cid:213) ¯ x ∈ S i || ¯ x − µ i || = arg min ¯ S K (cid:213) i = | S i | Var S i (4)where ¯ x = { x , x , ..., x n } is a data set, ¯ S = { S , S , ..., S k } isa set of clusters, and µ i is the mean of points in S i . Simplyput: 1. Select K points as the initial centeroids.2. Assign all points to the closest centeroid (mean pointof cluster).3. Recompute the centroid (mean) of each cluster.4. Repeat steps 2 and 3 until the centroids no longerchange.We choose K = for two clusters. In our analysis, we use thefunction kmeans from base R (R Core Team 2017). We referthe reader to MacQueen et al. (1967) or Hartigan & Wong(1979) for a more thorough description of the algorithm. K -medians clustering is another method of vector quanti-zation. We use the same algorithm as presented above inSection 3.3. The only diﬀerence between the two methodsis µ i is now the median of the points in each cluster S i , sothe centeroid is the median point of the cluster. We choose K = for two clusters. In our analysis, we use the function kGmedian from CRAN package Gmedian (Cardot 2017). Weencourage the reader to explore Mulvey & Crowder (1979)for a detailed description and analysis of this method.

Density Based Spatial Clustering of Applications with Noise(DBSCAN) is a clustering technique that was designed todiscover the clusters and the noise in a spatial database (Es-ter et al. 1996). The aim of the algorithm is for each point ofa cluster in the ”neighborhood” of a given radius (cid:15) to containat least a minimum number of points. In other words, thedensity in the ”neighborhood” has to exceed some threshold.The algorithm works as:1. Start with an arbitrary point p in the data andretrieve all points that are ”density-reachable,”i.e. the other point belongs in the ”neighborhood”of the initially selected point p and the”neighborhood” contains the minimum number ofpoints.2. Repeat step 1 until some points are density-reachable. MNRAS , 1–10 (2019)

M. G. MacDonald

3. Calculate the distances between the resultingclusters.4. Merge clusters that are at least as dense as thethinnest cluster if the two clusters are closer thanthe given radius (cid:15) .5. Repeat steps 3 and 4 until no more mergers occur.The choice of distance function that is used will determinethe shape of the ”neighborhood,” but the resulting clusterscan have any shape (Ester et al. 1996). We use the Man-hattan distance in 2D space which results in a rectangularneighborhood. We randomly draw our (cid:15) and minimum num-ber of points from uniform distributions such that the clus-tering results in two populations. DBSCAN speciﬁcally willaccount for noise in the data, where noise is simply the datapoints that do not belong to any clusters. This method istherefore less sensitive to outliers than other methods (Es-ter et al. 1996; Campello et al. 2013). It also means thatthe clusters are smaller (spatially, and contain fewer points)than those from the other methods which force all points intoa cluster. In our analysis, we use the function dbscan fromCRAN packages dbscan (Hahsler & Piekenbrock 2018). Werefer the reader to Ester et al. (1996) for the full mathemat-ical deﬁnition and implementation of the algorithm.

We use a variety techniques so that our resulting slope is notdependent on the technique. However, many computer sci-entists and statisticians have been studying and comparingthese methods for decades. The general consensus is that avariety of clustering techniques should be studied for eachnew data set, given that diﬀerent methods are sensitive todiﬀerent pit-falls (e.g., Ester et al. 1996; Chen et al. 2002;Mingoti & Lima 2006; Sharma et al. 2012).In general, hierarchical clustering or density-based clus-tering are considered to be superior when the number ofclusters is unconstrained or poorly constrained. Partition-ing methods such as K -means or K -medians, however, can bemuch faster and less computationally intensive (e.g., Chenet al. 2002; Sharma et al. 2012). For the problem presentedhere, we know that we want two populations (super-Earthsand mini-Neptunes) so this superiority no longer applies. In-stead, the hierarchical clustering introduces a bit of a prob-lem in our slope determination: given that all of the datahave been sorted into one cluster via a tree, and we simplycut this tree at two branches, the axis of separation betweenthe two clusters can readily be along orbital period insteadof planetary radius. Oftentimes, the two populations will beseparated along R p ∼ R E , but sometimes the populationswill be separated by P ∼ days. This results in lines thatare near-vertical with very negative slopes.Chen et al. (2002) found that, in analyzing ES cell geneexpression data, hierarchical clustering performed worsethan K -means clustering. They attributed its poor perfor-mance to the algorithm’s ”greediness” since it is not possibleto do any reﬁnement or correction once two clusters havemerged. Other studies have found similar results (e.g., Min-goti & Lima 2006).DBSCAN has been noted to be ineﬃcient for high-dimensional data sets, but is often praised for its notion ofnoise and ability to ﬁnd arbitrarily shaped clusters, includ- ing clusters that are completely surrounded by other clusters(Ester et al. 1996; Sharma et al. 2012; Campello et al. 2013).It has also been found to be at least 100 times more eﬃcientthan CLARANS, a K -medoid algorithm similar to K -meansand K -medians (Ester et al. 1996).Both K -means and K -medians clustering are typicallyfaster than hierarchical clustering, especially for large datasets (Ester et al. 1996; Sharma et al. 2012), but do notwork very well with non-globular clusters (Sharma et al.2012). K -means clustering is also aﬀected by the presenceof a large amount of outliers ∼

40% (Mingoti & Lima 2006).The accuracy of K -means and K -medians clustering is alsoquite dependent on the choice of initial centeroids (Milligan1980). To help mitigate this dependence, our bootstrappingtechnique starts with diﬀerent centeroids each time for bothmethods. Our clusters are also fairly globular in shape anddo not contain so many outliers, so these pit-falls do noteﬀect our results. Here we analyze the resulting slopes from our various meth-ods and data sets. We report our bestﬁt slope and its uncer-tainty as the median and 16th and 84th percentiles from acombination of all of our bootstrap realizations from usingall known exoplanets with period P < days and radius R p < . R E . We report the resulting slopes from our diﬀer-ent data sets and methods in Table 2. We aim to recover the slope of the linear separation be-tween the two populations to help determine the cause ofthis observed dichotomy. For the 2-D KDE, we perform aform of bootstrapping to determine the median and uncer-tainty in the slope. For each of our 200 bootstrap samples, wedraw (with replacement) 2000 planets from the data set. Werandomly select our bandwidths for orbital period and forplanetary radius from U[2.0,15.0] days and U[0.2,0.4] R E , re-spectively, for each realization. These bandwidths mark thelimits for a bi-modal distribution. For each realization, weﬁrst ﬁnd the line between the two peaks in the full period-radius distribution, where period is in days and planetaryradius is in R ⊕ . We then take the line that is perpendicularto this line to be our separator.For each of our clustering methods (hierarchical, K -means, K -medians, density-based), we ﬁt a line separatingthe two categories using a linear discriminant analysis, a gen-eralization of Fisher’s linear discriminant (Fisher 1936) thataims to ﬁnd a linear combination of features that separatestwo populations of objects. Although an in-depth derivationof this process is beyond the scope of this work, we referthe reader to Mika et al. (1999) for an excellent review. Weagain perform a form of bootstrapping where we draw (withreplacement) 2000 planets from the data set 200 times. Forthe density-based clustering, we choose a bandwidth (cid:15) fromU[0.315,0.362] for each realization which results in two clus-ters. MNRAS000

2D Kernel Density estimation of planetary radius ver-sus orbital period for all known exoplanets with period P < days and radius R p < . R E . We did not use planets that did nothave published values for period, radius, or uncertainty in radius.We performed a bootstrapping variant, varying the bandwidthfor period between 2.0 and 15.0 days and the bandwidth for ra-dius between 0.2 and 0.4 R E , although the resulting slopes werenot very dependent on the bandwidths. For each permutation,we also draw (with replacement) 1000 planets. We account foruncertainties in radius by drawing the planets’ radii from normaldistributions where the means were the published values and thestandard deviations were the published uncertainties. Here, weplot the resulting 200 slopes overtop a kernel density estimationwith bandwidths of 13.5 days and 0.4 R E . The resulting slope es-timation for ”all” data was m = − . + . − . , suggesting that theradius valley we see is likely an artifact of photo-evaporation orcore-powered mass loss and not of planets forming in a gas-poordisk. We apply a two-dimensional kernel density estimator (2-DKDE) to the data with an axis-aligned bivariate normal ker-nel, evaluated on a square grid. We perform a bootstrappingvariant, sampling our data and measuring the line betweenpopulations 200 times. For each sample, we draw with re-placement 2000 planets from the data set (see Section 2 fora description of each set). We plot the KDE with the result-ing separators in Figure 1. The resulting slope estimationfor ”all” data is m = − . + . − . where our uncertainties markthe 16th and 84th percentiles. We present the slope esti-mates and uncertainties for all methods and data sets inTable 2.We next cluster our data using a variety of methods de-scribed in Section 3: hierarchical clustering, K -means clus-tering, K -medians clustering, and density-based clustering.For each of our clustering methods, we repeat the bootstrap-ing variant as mentioned above, selecting 2000 planets withreplacement from the data set. We then cluster the data intotwo clusters for the K -means and K -medians clustering. Forthe hierarchical clustering, we cut the tree at two branchessince hierarchical clustering clusters the data into an entire l ll ll ll ll ll ll l l lll l l ll ll ll lllll l ll l l llll ll ll ll l ll llll llll l ll l ll ll lllll ll ll l ll ll ll ll ll l lll lll lll lll ll llll ll ll ll l lll lll ll lllll l l l ll l ll llll l ll lll l l lllll l l lll l ll ll ll ll ll lll lllll ll ll ll l lll ll l ll l lll ll lll ll lll lll ll lll llll l lll l l llll ll lll llll ll ll llll l llll ll ll ll ll llll ll lll ll ll l lllll ll ll lll l lll lll l ll llll l ll l l ll ll l lll lll l lll lllll llllll l ll l l lll ll ll lll ll ll lll l llll l ll ll lll ll ll ll l ll ll ll ll lll ll lll l ll l llll ll l lll l l lll lll l llll ll lll ll ll ll ll llll ll ll lll llll ll lll lll llll l ll lllll ll lll lll l lll l ll l ll ll ll l lll ll l lll l l l ll l l lll ll l l l ll l l ll ll l ll l ll ll l llll lllll ll l ll ll l lll llll lll l ll ll l llll ll lll ll ll l l ll ll llll lll ll l ll l ll l ll ll llll l l l lll l ll ll llll l llll ll lll l ll l ll ll ll ll ll ll l ll lllll ll ll l ll ll ll l l llll l l l l l ll ll l l lll ll l llll ll ll l l llll lll l ll l lll ll ll ll ll lll lll ll ll lll l lll ll lll l ll ll l ll l ll ll llll lll l lll ll lll l ll l lllll lll ll lll ll ll l ll l lll ll lll l ll l lll l ll l lll ll l ll ll l l l l l l ll Period (d)0.2 0.5 1 2 5 10 20 50 . R ad i u s ( R E ) . . Figure 2.

The result of K -means clustering, with lines deﬁningthe radius valley from all bootstrap realizations from all methods(2000 lines) for all known exoplanets with period P < daysand radius R p < . R E . The radius valley does not appear tobe much of a valley, since it is not a complete absence of planetsin that region of parameter space, but clustering does split theplanets into two populations, usually about R ∼ R ⊕ . Diﬀerentmethods result in diﬀerent slopes and precisions, but all methodsand data sets are consistent with a negative slope, suggesting thatthe radius valley is likely a result of photoevaporation (e.g., Owen& Wu 2017) or core-powered mass loss (e.g., Gupta & Schlichting2018) and not of planets forming in a gas-poor disk. tree of classiﬁcation. This sometimes led to the clustering be-ing solely period-based, resulting in very large and negativeslopes. For each clustering method and each bootstrappedrealization, we ﬁt a line between the two clusters, as de-scribed above in Section 4.1, and take the median slope tobe our bestﬁt and the 16th and 84th percentiles as our un-certainties in the slope. The resulting slope estimation be-tween the two populations for ”all” data is m = − . + . − . for all clustering methods with a resulting y-intercept of b = . + . − . . We show a realization of clustering with allﬁtted lines plotted on top in Figure 2.We present the slope estimates and uncertainties forall methods and data sets in Table 2. We take our best-ﬁt slope to be the result of ”all” data set and all methods m = − . + . − . . All methods and data sets result in con-sistent and distinctly negative slopes, which are consistentwith both models of photoevaporation (e.g., Owen & Wu2017) and core-powered mass loss (e.g., Gupta & Schlicht-ing 2018), but inconsistent with planets forming late in agas-poor disk. We examine the distributions of planetary radii for each ofthe data sets in an attempt to observe the gap noted byprevious studies (e.g., Fulton et al. 2017; Van Eylen et al.2018). First, we create samples of equal size from each data

MNRAS , 1–10 (2019)

M. G. MacDonald

Data set Num. points Kmeans Kmedians DBSCAN HIER KDE CLUST ALL MAll 2066 − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . -0.319 + . − . All, σ r − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . All P < − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . All P < , σ r − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . F17 766 − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . F17, σ r − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . V18 81 − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . V18, σ r − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . F18 1334 − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . F18, σ r − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . − . + . − . Table 2.

Resulting slope measurements and their uncertainties for our diﬀerent methods and data sets. Here, the ”all” data set refers toall known exoplanets with orbital period and planetary radius estimates and uncertainties and P < days and R P < . R E , σ r refersto data sets where we accounted for uncertainties in the planetary radii by drawing 1000 planets with replacement from the data setand drawing each planet’s radius from a normal distribution centered on the nominal value with standard deviation of the uncertaintyin the measurement, and P < refers to adding a cut to orbital period at 25 days to help mitigate the eﬀects of incompleteness. Theother samples refer to data sets used in F17: Fulton et al. (2017), V18: Van Eylen et al. (2018), and F18: Fulton & Petigura (2018). SeeSection 2 for a description of each data set. ”DBSCAN” is density-based clustering, ”HIER” is hierarchical clustering cut at two branches,”KDE” is 2D kernel density estimation, ”CLUST” is the results from all clustering methods (excluding KDE), and ”ALL M” is the resultsfrom all methods. See Section 3 for a description and comparison of all methods. All measurements from all methods and datasets areconsistent with a negative slope. We take our bestﬁt slope to be the result of ”all” data set and all methods, in bold in the table above, m = − . + . − . . set by drawing with replacement 1000 times. We also incor-porate uncertainties in the planetary radii by drawing eachplanet radius from a normal distribution that is centeredon its nominal value with a standard deviation of the pub-lished uncertainty. Our sample of all exoplanets with P < days and R P < . has a much larger average uncertaintyfor planetary radii than the other samples (see Table 1 for afull comparison of our data sets). In Figure 3, we plot a Ker-nel Density Estimation (KDE) for the radii distributions foreach of our re-sampled data sets. We use a KDE to minimizedependence on bin width, although a KDE does require abandwidth. We note that, once a KDE is used and uncertain-ties are accounted for, the gap in distribution of planetaryradii is barely distinguishable for data set F18, and not atall distinguishable for the F17 and ”all” data sets. The gap isstill prominent in data set V18, which has few planets (81)and the smallest radii uncertainties ( < σ R > = . R E ).Accounting for the uncertainty in planetary radii ob-scures not only the gap in the distribution of radii, but alsothe radius valley as a function of period. When we cluster the”all” data set, the slope of the line between the two popula-tions grows from m = − . + . − . using only nominal valuesfor the radii to m = − . + . − . when we account for uncer-tainties. This large increase in slope is due to the clusteringalgorithms failing. Instead of clustering along R p ∼ R E ,they tend to cluster along P ∼ days, which results in nearvertical lines. This eﬀect becomes even more apparent whenwe only use planets with P < days, as can be seen inTable 2. We repeat our analysis with a cut in orbital period of P < days to help mitigate the eﬀects of incompletenessdue to observational biases. The resulting slope measure-ments are consistent with the measurements from the fullsample, although larger and less precise (see Table 2 for allslope measurements).The diﬀerent methods that we employ result in diﬀer-ent slopes. In Figure 4, we plot the KDE of the resultingslopes from our analysis on all exoplanets using the nomi-nal values for radius (i.e., not accounting for uncertainties).Slopes larger than m ∼ − . are a result of the clusteringmethods failing, which is most apparent in the hierarchicalclustering. This method creates an entire tree of classiﬁca-tion and we only cut it at two branches. For the lines withsteep slopes, a cut at three branches did result in a clusterwith R p < . R E and a cluster with R p > . R E , but weonly include the results of the two-branch cut. K -means and K -medians clustering result in much higher precision thanthe other methods (for the ”all” data set as well as mostof the other data sets). We take the median and uncertain-ties from all slopes from all methods as our bestﬁt slope m = − . + . − . .Our measured slopes are much steeper than those re-ported in Van Eylen et al. (2018) and from theoreticalmodels of photoevaporation and core-powered mass loss, al-though they are still consistent within σ . Van Eylen et al.(2018) ﬁt a line to their absence of data and then usedsupport vector machines to determine the hyperplane ofmaximum separation between the planets above and be- MNRAS000

Figure 3.

Kernel density estimations of the distributions of planetary radii for our four data sets, where ”all” is all known exoplanetswith orbital period and planetary radius estimates and uncertainties and P < days and R P < . R E . We accounted for uncertaintiesin planetary radius and diﬀerent sizes of the data sets by creating samples of 1000 planets drawn with replacement from each data set,and then drawing a radius for each planet from a normal distribution with the published value as the mean and the uncertainty as thestandard deviation. A gap in the distributions of radii was noted by Fulton et al. (2017), Van Eylen et al. (2018), and Fulton & Petigura(2018), but is only distinct in the sample from Van Eylen et al. (2018) and noticeable in the sample from Fulton & Petigura (2018). Agap is not discernible in the other two data sets. low the valley, resulting in slope of m = − . ± . and m = − . + . . , respectively. Given that the other data setshave planets occupying the radius valley, we cannot employtheir methods on other data sets. Our methods (discussedin full in Section 3) result in a steeper slope when even weuse the smaller asteroseismic data set from Van Eylen et al.(2018). This diﬀerence is due to the diﬀerent classiﬁcationtechniques; as can be seen in comparing our classiﬁcation(Figure 2) with the classiﬁcation from support vector ma-chines (Fig. 7 in Van Eylen et al. 2018), the methods em-ployed herein tend to classify planets with periods P > days as part of the mini-Neptune population.There is a dearth of mini-Neptunes at short orbital pe-riods ( P < days), that is predicted by photoevaporationmodels (e.g., Owen & Wu 2013) and noted by many authors (e.g., Lundkvist et al. 2018). This absence of detected plan-ets makes a slope diﬃcult to constrain. Coupled with theobservational bias of super-Earths far from their hosts be-ing diﬃcult to detect, this dearth of planets can confuse theclustering and lead to positive slopes. How much this eﬀectsthe resulting slope is diﬃcult to truly explore, but we donote that the clustering algorithms rarely ( < . ) result ina positive slope. As mentioned above, we partially investi-gate this by repeating our analysis with an orbital periodcut of P < days. With this cut, we ﬁnd slopes that areconsistent with the slopes from using all data.Given that the measured slopes for all data sets andmethods are consistent and distinctly negative, we ﬁnd thatthe radius valley is consistent with both models of photoe-vaporation (e.g., Owen & Wu 2017) and core-powered mass MNRAS , 1–10 (2019)

M. G. MacDonald −0.8 −0.6 −0.4 −0.2 0.0

Slope D en s i t y f un c t i on K- medians HIER

DBSCAN

KDE

K-means

Figure 4.

Kernel density estimations of the slopes measured viaour diﬀerent methods for all known exoplanets with orbital pe-riod and planetary radius estimates and uncertainties and P < days and R P < . R E . Here, ”DBSCAN” is density-based cluster-ing, ”HIER” is hierarchical clustering (cut at two branches), and”KDE” is 2D kernel density estimation (see Section 3 for a descrip-tion of all methods). We take the median and uncertainties fromall slopes from all methods as our bestﬁt slope m = − . + . − . . loss (e.g., Gupta & Schlichting 2018), but inconsistent withplanets forming late in a gas-poor disk (e.g., Lee & Chiang2016). Core-powered mass loss, however, would produce a ra-dius valley that is independent of incident ﬂux, and is there-fore independent of stellar type. Recently, Fulton & Petigura(2018) found that the radius gap shifts to higher incidentstellar ﬂuxes around higher mass stars, showing that the gapis dependent on stellar type. This is consistent with predic-tions from photoevaporation models (Owen & Wu 2017). Wenote that these mechanisms are not mutually exclusive. Itis plausible that both photoevaporation and core-poweredmass loss, as well as other mechanisms, are responsible forthe radius valley simultaneously or at diﬀerent times duringplanetary formation.The precise slope of the radius valley depends on theplanet formation model and the composition of the plan-ets. We compare our resulting slope with diﬀerent modelsof photoevaporation from Owen & Wu (2017) in Figure 5.The photoevaporation models use the maximum radius atthe bottom of the valley, but since the observational valley ismore of a dearth of planets than a complete absence of plan-ets, we do not have a ”bottom” of valley to directly compareto. Instead, we plot our resulting slope from all methods forour ”all” data set with a σ conﬁdence interval, which is anoverestimation for what we are comparing to.Previous studies have found that the location of the ra-dius valley is more consistent with iron-rich cores than withicy cores, assuming that the valley is caused primarily byphotoevaporation (Owen & Wu 2017; Jin & Mordasini 2018;Van Eylen et al. 2018). We ﬁnd that our measurements areconsistent at σ with both the more complex models that . . . . Period (d) R ad i u s ( R E ) . Figure 5.

The resulting slope from all methods for all known ex-oplanets with orbital period and planetary radius estimates anduncertainties and P < days and R P < . R E in black and its σ conﬁdence interval in grey ( m = − . + . − . ). We comparethe measured slope to theoretical models with diﬀerent planet-core compositions from Owen & Wu (2017) that show the largestsuper-Earth at each orbital period. Since our slope is the slope ofthe line separating the two populations and not of the largestsuper-Earth, it is a bit of an over-estimation. The solid linesare for constant, energy-limited eﬃciency models (EL) while thedashed lines are for models with variable eﬃciency (VE, see e.g.Owen & Jackson 2012). The blue lines show planets that consistof 1/3 iron and 2/3 silicates while the red lines show planets thatconsist of 1/3 ice and 2/3 silicates. We ﬁnd that our measure-ments are consistent with all four models until an orbital periodof P ∼ . days and are consistent with both models for icy coresfor the entire period range. Our measurements are consistent withall four models for the range or orbital periods at σ . include recombination and x-ray evaporation and with thesteeper slopes predicted for pure energy-limited evaporationfor both iron-rich and icy cores up until an orbital periodof P ∼ . days (although our measurements are consistentwith all four models from Owen & Wu (2017) at σ ). Ourslopes are more consistent with icy cores than iron-rich coreswith orbital periods shorter than P ∼ . days, possibly in-dicating a need for including planetary migration after theplanets form. We investigate the relationship between planetary radiusand orbital period, conﬁrming the “radius valley” that is sug-gested to separate super-Earths from mini-Neptunes using2-dimensional kernel density estimator and various cluster-ing techniques. Unlike previous studies, we use a large sam-ple of exoplanets. We use all known exoplanets, but limit oursample to planets with orbital period and planetary radiusmeasurements and P < days and R P < . R E , resulting in2066 planets. We repeat our analysis for a sample with plan- MNRAS000 , 1–10 (2019) adius Valley etary radii drawn from normal distributions to account forthe uncertainty in the measurements, which can obfuscatethe radius valley, and for a sample with P < days to helpmitigate the eﬀects of incompleteness due to observationalbiases. Accounting for uncertainties and a smaller period cutboth led to estimates of the slope that were larger and lessprecise than using the full sample with nominal values forthe radii, but still consistent with each other and our otherestimates. We also repeat our analysis on the data sets fromFulton et al. (2017), Van Eylen et al. (2018), and Fulton &Petigura (2018), measuring slopes that are consistent withour other analyses. We present all of our measurements ofthe radius valley for each method and data set in Table 2.We conclude the following results: • Powerful machine-learning techniques allow us to detectand characterize the radius valley using a large sample ofexoplanets. • The transitional radius between super-Earth and mini-Neptune decreases as a function of orbital period. We ﬁnda negative slope which is consistent with models of photoe-vaporation and core-powered mass loss, but is inconsistentwith late formation in a gas-poor disk (characterized by apositive slope). • We measure the location of the radius valley as a powerlaw: log R p = m · log P + b , where m = − . + . − . and b = . + . − . . • Given the lack of a clear valley, our sample consistsof planets with a wide range of core compositions or plan-ets which have formed beyond the snow line. Our measuredslope is consistent at σ with theoretical models of pho-toevaporation for cores consisting of a signiﬁcant fractionof iron and for cores consisting of 1/3 ice, 2/3 silicates forplanets with P > . days. For smaller orbital periods, ourmeasurements are more consistent with icy cores.Given the exoplanetary missions that have recentlylaunched or will soon launch and the missions that have beenproposed for the future, such an analysis as presented hereinshould be repeated after more data have been gathered. Theaddition of data, especially data that are less biased or havediﬀerent biases, could grandly expand the signiﬁcance of thisstudy, providing additional information about the formationand evolution of exoplanetary systems. ACKNOWLEDGMENTS

We thank the referee for helpful comments and suggestionsthat have improved this manuscript. We thank KathrynDisher for her helpful discussion. This material is basedupon work supported by the National Science FoundationGraduate Research Fellowship Program under Grant No.DGE1255832. Any opinions, ﬁndings, and conclusions or rec-ommendations expressed in this material are those of theauthor and do not necessarily reﬂect the views of the Na-tional Science Foundation. This research has made use of theNASA Exoplanet Archive, which is operated by the Califor-nia Institute of Technology, under contract with the NationalAeronautics and Space Administration under the ExoplanetExploration Program.

REFERENCES

Berger T. A., Huber D., Gaidos E., van Saders J. L., 2018, ApJ,866, 99Campello R. J., Moulavi D., Sander J., 2013, in Paciﬁc-Asia con-ference on knowledge discovery and data mining. pp 160–172Cardot H., 2017, Gmedian: Geometric Median, k-Median Clus-tering and Robust Median PCA. https://CRAN.R-project.org/package=Gmedian

Chen H., Rogers L. A., 2016, ApJ, 831, 180Chen G., Jaradat S. A., Banerjee N., Tanaka T. S., Ko M. S.,Zhang M. Q., 2002, Statistica Sinica, pp 241–262Ester M., Kriegel H.-P., Sander J., Xu X., et al., 1996, in Kdd.pp 226–231Fabrycky D. C., et al., 2014, ApJ, 790, 146Fisher R. A., 1936, Annals of eugenics, 7, 179Ford E., 2017, in APS Meeting Abstracts.Fulton B. J., Petigura E. A., 2018, AJ, 156, 264Fulton B. J., et al., 2017, AJ, 154, 109Ginzburg S., Schlichting H. E., Sari R., 2017, preprint,( arXiv:1708.01621 )Ginzburg S., Schlichting H. E., Sari R., 2018, MNRAS, 476, 759Gupta A., Schlichting H. E., 2018, arXiv e-prints,Hahsler M., Piekenbrock M., 2018, dbscan: Density Based Clus-tering of Applications with Noise (DBSCAN) and RelatedAlgorithms. https://CRAN.R-project.org/package=dbscan

Hartigan J. A., Wong M. A., 1979, JSTOR: Applied Statistics,28, 100Inamdar N. K., Schlichting H. E., 2016, ApJ, 817, L13Jin S., Mordasini C., 2018, ApJ, 853, 163Jin S., Mordasini C., Parmentier V., van Boekel R., Henning T.,Ji J., 2014, ApJ, 795, 65Johnson S. C., 1967, Psychometrika, 32, 241Lee E. J., Chiang E., 2016, ApJ, 817, 90Lee E. J., Chiang E., Ormel C. W., 2014, ApJ, 797, 95Lissauer J. J., et al., 2011, ApJS, 197, 8Lissauer J. J., et al., 2014, ApJ, 784, 44Lithwick Y., Wu Y., 2012, ApJ, 756, L11Liu S.-F., Hori Y., Lin D. N. C., Asphaug E., 2015, ApJ, 812, 164Lopez E. D., Fortney J. J., 2014, ApJ, 792, 1Lopez E. D., Rice K., 2016, arXiv e-prints,Lundkvist M. S., Huber D., Silva Aguirre V., Chaplin W. J., 2018,arXiv e-prints,MacQueen J., et al., 1967, in Proceedings of the ﬁfth Berkeleysymposium on mathematical statistics and probability. pp281–297Mika S., Ratsch G., Weston J., Scholkopf B., Mullers K.-R., 1999,in Neural networks for signal processing IX: Proceedings ofthe 1999 IEEE signal processing society workshop (cat. no.98th8468). pp 41–48Milligan G. W., 1980, Psychometrika, 45, 325Mingoti S. A., Lima J. O., 2006, European journal of operationalresearch, 174, 1742Muirhead P. S., et al., 2015, ApJ, 801, 18Mulvey J. M., Crowder H. P., 1979, Management Science, 25, 329Owen J. E., Jackson A. P., 2012, MNRAS, 425, 2931Owen J. E., Wu Y., 2013, ApJ, 775, 105Owen J. E., Wu Y., 2017, ApJ, 847, 29R Core Team 2017, R: A Language and Environment for Sta-tistical Computing. R Foundation for Statistical Computing,Vienna, Austria,

Schlichting H. E., Sari R., Yalinewich A., 2015, Icarus, 247, 81Sharma N., Bajpai A., Litoriya M. R., 2012, facilities, 4, 78Silverman B. W., 2018, Density estimation for statistics and dataanalysis. RoutledgeVan Eylen V., Agentoft C., Lundkvist M. S., Kjeldsen H., OwenJ. E., Fulton B. J., Petigura E., Snellen I., 2018, MNRAS,479, 4786MNRAS , 1–10 (2019) M. G. MacDonald

Venables W. N., Ripley B. D., 2002, Modern Applied Statisticswith S, fourth edn. Springer, New York,

Weiss L. M., Marcy G. W., 2014, ApJ, 783, L6Welsh W. F., et al., 2015, ApJ, 809, 26This paper has been typeset from a TEX/L A TEX ﬁle prepared bythe author. MNRAS000