[PDF] Predicting Exoplanets Mass and Radius: A Nonparametric Approach

Abstract

A fundamental endeavor in exoplanetary research is to characterize the bulk compositions of planets via measurements of their masses and radii. With future sample sizes of hundreds of planets to come from TESS and PLATO, we develop a statistical method that can flexibly yet robustly characterize these compositions empirically, via the exoplanet M-R relation. Although the M-R relation has been explored in many prior works, they mostly use a power-law model, with assumptions that are not flexible enough to capture important features in current and future M-R diagrams. To address these shortcomings, a nonparametric approach is developed using a sequence of Bernstein polynomials. We demonstrate the benefit of taking the nonparametric approach by benchmarking our findings with previous work and showing that a power-law can only reasonably describe the M-R relation of the smallest planets and that the intrinsic scatter can change non-monotonically with different values of a radius. We then apply this method to a larger dataset, consisting of all the Kepler observations in the NASA Exoplanet Archive. Our nonparametric approach provides a tool to estimate the M-R relation by incorporating heteroskedastic measurement errors into the model. As more observations will be obtained in the near future, this approach can be used with the provided R code to analyze a larger dataset for a better understanding of the M-R relation.

Full PDF

SSubmitted to the Astrophysical Journal

Preprint typeset using L A TEX style AASTeX6 v. 1.0

PREDICTING EXOPLANETS MASS AND RADIUS: A NONPARAMETRIC APPROACH

Bo Ning , Angie Wolfgang , and Sujit Ghosh Department of Statistics and Data Science, Yale University, 24 Hillhouse Ave, New Haven, CT 06511 Department of Astronomy and Astrophysics, Pennsylvania State University, University Park, PA 16802 Center for Exoplanets and Habitable Worlds, 525 Davey Laboratory, The Pennsylvania State University, University Park, PA, 16802, USA. NSF Astronomy & Astrophysics Postdoctoral Fellow Department of Statistcs, North Carolina State University, 2311 Stinson Dr., Raleigh, NC 27695

ABSTRACTA fundamental endeavor in exoplanetary research is to characterize the bulk compositions of planetsvia measurements of their masses and radii. With future sample sizes of hundreds of planets to comefrom TESS and PLATO, we develop a statistical method that can ﬂexibly yet robustly characterizethese compositions empirically, via the exoplanet M-R relation. Although the M-R relation has beenexplored in many prior works, they mostly use a power-law model, with assumptions that are notﬂexible enough to capture important features in current and future M-R diagrams. To address theseshortcomings, a nonparametric approach is developed using a sequence of Bernstein polynomials. Wedemonstrate the beneﬁt of taking the nonparametric approach by benchmarking our ﬁndings withprevious work and showing that a power-law can only reasonably describe the M-R relation of thesmallest planets and that the intrinsic scatter can change non-monotonically with diﬀerent values ofa radius. We then apply this method to a larger dataset, consisting of all the Kepler observationsin the NASA Exoplanet Archive. Our nonparametric approach provides a tool to estimate the M-Rrelation by incorporating heteroskedastic measurement errors into the model. As more observationswill be obtained in the near future, this approach can be used with the provided R code to analyze alarger dataset for a better understanding of the M-R relation. Keywords: planets and satellites: composition — methods: statistical INTRODUCTIONAn exoplanet’s mass and radius can constrain its bulkcomposition when models of planetary internal struc-tures are ﬁt to these observed properties (e.g. Valen-cia et al. 2007; Seager et al. 2007; Fortney et al. 2007a;Rogers et al. 2011; Lopez & Fortney 2014). Measur-ing many planets’ masses and radii can therefore illumi-nate: 1) the range of exoplanet compositions producedby planet formation and evolution processes; and 2) howthese compositions trend with planet and stellar prop-erties of interest, such as planet mass, orbital period,host star mass, and host star metallicity. The existenceand nature of these trends, along with the amount of as-trophysical scatter around them, provides observationalconstraints on planet formation theory (e.g. Ida & Lin2010; Lee & Chiang 2015; Dawson et al. 2015; Lopez &Rice 2016; Owen & Wu 2017).The population distributions of these mass and ra- [email protected]; [email protected]; [email protected] dius measurements therefore contain valuable informa-tion about the range of planetary compositions. To ob-tain the population distributions is particularly neces-sary for the thousands of super-Earth- and sub-Neptune-sized planets discovered by the

Kepler mission (Cough-lin et al. 2016), as this population is not represented inour own Solar System.Initially population studies of small, low-mass plan-ets focused on the marginal mass (Howard et al. 2010;Mayor et al. 2011) or marginal radius (Youdin 2011;Howard et al. 2012; Fressin et al. 2013; Petigura et al.2013; Dressing & Charbonneau 2013) distributions; thiswas driven in part by the diﬃculty in obtaining a sampleof planets which both transited their host stars and weresuitable to precision radial velocity (RV) follow-up. Thischanged once the

Kepler follow-up program publishedmass estimates for their chosen subset of

Kepler can-didates (Marcy et al. 2014). Recent analysis of

Kepler planets’ transit timing variations have also produced ro-bust mass determinations for dozens of multiple-planetsystems (Jontof-Hutter et al. 2016; MacDonald et al.2016; Mills et al. 2016; Hadden & Lithwick 2017). How- a r X i v : . [ a s t r o - ph . E P ] N ov Ning, Wolfgang, & Ghosh ever, mass measurements for the majority of individual

Kepler planet candidates are still unavailable. As a re-sult, the mass-radius (M-R) relations estimated basedon the existing mass and radius measurements providesa useful tool for astronomers to predict masses based onobserved radii for those planets which only have radiusmeasurements.1.1.

Previous work on mass-radius relations

There have been several M-R relations proposed in theexoplanet literature; all of them assume that exoplanetmasses and radii follow a power-law: • Lissauer et al. (2011) ﬁt a relation to Earth andSaturn and obtained M = R . , where M and R are in Earth units. • Wu & Lithwick (2013) calculated the masses of22 planet pairs based on the amplitudes of theirsinusoidal transit timing variations (TTVs) andfound that M = 3 R . • Weiss & Marcy (2014) used the 42 planets cho-sen by the

Kepler team to be followed up withradial velocity measurements (Marcy et al. 2014)and found the power law is M = 2 . R . forplanets with 1 . < R < ⊕ . • Hadden & Lithwick (2014) revisited the M-R re-lation for 56 low-eccentricity TTV planets and ﬁt M = 14 . R/ R ⊕ ) . to them. • Wolfgang et al. (2016) (hereafter denoted asWRF16) selected planets with radii < ⊕ fromthe NASA Exoplanet Archive (up to the date1/30/2015) and found two diﬀerent power-law re-lations: M = 2 . R . for < ⊕ planets and M = 1 . R . for < ⊕ planets. Unlike the ear-lier results, which used basic least squares regres-sion, WRF16 built a hierarchical Bayesian modelthat incorporated observational measurement un-certainty, the intrinsic, astrophysical scatter ofplanet masses, and non-detections and upper lim-its. • Mills & Mazeh (2017) ﬁt power laws to varioussubsets of exoplanet masses and radii, testing fordiﬀerences between RV and TTV M-R relations intwo diﬀerent period bins. • Chen & Kipping (2017) built on the hierarchi-cal Bayesian model from WRF16 to ﬁt a continu-ous broken power-law model across a much largerrange of masses and radii. They found that a four-segment M-R relation best described their data,with Terran, Neptunian, Jovian, and Stellar re-gions. They ﬁt a diﬀerent power-law and intrinsicscatter to each segment. 1.2.

Moving to a nonparametric approach

The power-law model is simple to ﬁt to exoplanetmasses and radii and its parameters are easy to inter-pret. However, there are several reasons to move be-yond it. First, we already know that a single power-lawrelation will not be able to describe the M-R relationfor the entire population, as degeneracy pressure causesthe radius-mass relation to ﬂatten out around a Jupitermass. Chen & Kipping (2017) addressed this by as-suming a broken power-law with multiple segments. Assoon as one expands the space of models beyond thosewith a handful of parameters, it is useful to considermore ﬂexible nonparametric models as well . Indeed,there is no particular reason to believe that exoplanetmasses and radii should follow power-laws as a popu-lation, since the processes that dominate planet forma-tion for small rocky planets are diﬀerent than those thatdominate the formation of gas giants. Furthermore, wedo not expect the distribution of masses at a given ra-dius to be Gaussian, as was assumed by WRF16 andChen & Kipping (2017), or even symmetric. There isalready evidence from the hierarchical modeling check-ing performed by WRF16 that a Gaussian distributiondoes not completely reproduce the observed exoplanetmasses and radii.Moving forward, we adopt a nonparametric approachthat allows us to relax these assumptions. There aremany beneﬁts of using nonparametric models, including:1. They can take on variety of shapes to ﬁt the data,which can be advantageous for making predictionsthat are less model-dependent. Therefore, there isno need to impose abrupt breaks for modeling M-R relations across a wide range of sizes.2. They can model the joint distribution of mass andradius, and thus provide a self-consistent way topredict both mass from radius and radius frommass over the entire exoplanet mass and radiusrange.3. Eventually we want to understand masses andradii as a function of many other star and planetproperties like stellar mass, metallicity, orbital pe-riod, planet multiplicity, even eccentricity. We donot have clear guidance from theory about thefunctional form of the dependence in these addi- We note that “nonparametric” is a bit of a misnomer: non-parametric models still do have parameters. In our case, theseparameters are the degree d and the weights w kl in Eqn 7. Whatmakes nonparametric models diﬀerent than parametric ones isthat the dimension of the parameters (in this case d ) is allowed tovary with sample size, which allows for much greater ﬂexibility toadapt to complex shapes. onparametric Exoplanet Mass-Radius Relation THE MODELThere are at least two ways to model the exoplanetmass-radius relation. The ﬁrst is to approach it as a re-gression problem. When one performs regression, suchas using ordinary least squares (OLS) to ﬁt a line todata, the goal is to illuminate the relationship betweenan independent variable and another quantity that de-pends on it—the dependent variable. For exoplanets, weare indeed concerned with how mass and radius are re-lated, as it illuminates how planet compositions changeas a function of the planet’s size or mass. However, itis not so clear which is the independent variable andwhich is the dependent variable. From a theoreticalpoint of view, mass is the more fundamental propertyand so should be the independent variable . On theother hand, the small fraction of Kepler planets thatonly have mass measurements creates a practical needto convert

Kepler radii to masses (or in other words, topredict masses from radii), which requires radius to bethe independent variable.Moreover, there are non-negligible measurement un-certainties associated with both mass and radius. OLSregression, which ignores the uncertainties on the in-dependent variable, will consistently underestimate theslope of the regression line when there are uncertaintiesin both variables (see Isobe et al. 1990). This occursbecause OLS only minimizes the distance between thepoints and the line in the vertical direction, which over-corrects for the additional vertical distance introducedby uncertainties in the dependent variable (for lines withslope (cid:54) = 0, the horizontal scatter produced by uncertain-ties in the dependent variable will also produce someadditional vertical scatter around the line). As a result,the OLS line which ﬁts mass to radius will not be theinverse of the line which ﬁts radius to mass. This is aproblem if our goal is to estimate the underlying funda-mental relationship between radius and mass.As our goal in this paper is to provide mass predic-tions from measured radii as well as radius predictionsfrom mass, we decide to approach the problem diﬀer-ently. WRF16 already showed that a simple regressionline, i.e. a one-to-one function which deterministicallymaps radius to mass, is an insuﬃcient description ofthe existing data: there exists an intrinsic, astrophys-ical scatter in the mass-radius relation. Therefore, wealso want this approach to allow for a distribution of That said, some physical processes which aﬀect exoplanetcompositions, such as photoevaporation, depend on both massand radius. masses at any one radius. We can express this distri-bution as g ( m | r ), the conditional probability distribu-tion of mass given radius, i.e. a function describing theprobability of a planet having a certain mass when itsradius takes a speciﬁc value . Through the deﬁnitionof conditional probability, every conditional probabilitydistribution can be expressed as a ratio of the joint prob-ability distribution g ( m, r ) to the marginal probabilitydistribution of the variable on which the conditioningoccurs ( g ( r )): g ( m | r ) = g ( m, r ) g ( r ) = g ( m, r ) (cid:82) g ( m, r )d m (1)Likewise, g ( r | m ) = g ( m, r ) g ( m ) = g ( m, r ) (cid:82) g ( m, r )d r (2)Therefore, to get both conditional distributions we needonly to model the joint distribution: the probability ofa planet having a certain mass and a certain radius. Itis this joint distribution that we deﬁne in § § g ( m, r ) to discuss general properties of joint mass,radius distributions. We will use f ( m, r ) to refer to ourspeciﬁc formulation of the joint mass, radius distributionas given by Eqn 7.2.1. The Bernstein polynomials model

Our nonparametric model for the joint probability dis-tribution of mass and radius is a bivariate distributionthat consists of a tensor product of sequence of location-scaled and transformed beta densities which can also beviewed as Bernstein polynomials. Each marginal dis-tribution can be expressed as a linear combination ofBernstein polynomials (BPs), leading to mixture of betadensities. When normalized, BPs have the same func-tional form as beta distributions (see Eqn 3). For avisual representation of Bernstein Polynomials, see Fig-ure A1, where we plot BPs as a function of their degree(denoted by k or l in Eqn 7).The properties of BPs and a discussion of their ad-vantages over other choices for basis functions are de-scribed in Appendix A. In addition, we note that we donot use Gaussian process regression, which is a popu-lar nonparametric prediction method in exoplanet sci-ence, for a number of reasons. First, a joint density es-timation approach is more appropriate for our purposesthan a regression approach, as argued above. Second,the marginal distribution of the M-R is not in general WRF16 modeled g ( m | r ) as a normal distribution but providedsome evidence that this assumption was too restrictive; this evi-dence is partly what has motivated us to adopt a nonparametricapproach in this paper. Ning, Wolfgang, & Ghosh normally distributed, as we demonstrate in Section 5.Third, ﬁtting a mixture of Gaussians to describe thejoint density would involve three free parameters percomponent (mixture weight, mean, and variance) ratherthan the one free parameter per component that we usehere (the mixture weight). Fourth, a mixture of Gaus-sians produces an identiﬁability problem, wherein dif-ferent components can be interchangeable, that is notpresent in the Bernstein polynomial model due to theordered nature of the diﬀerent terms (see Eqn 7 and theﬁgures in Appendix A for how the Bernstein polynomialterms, i.e. the diﬀerent β k for a given d , are distinguish-able from each other).Before introducing our model, we ﬁrst deﬁne somemathematical notations that will be used throughoutthe paper. Let M obs i and R obs i be the reported massand radius measurements for i -th planet in our dataset,and σ obs M i and σ obs R i be the reported standard deviationsfor their measurement errors. We denote M i and R i as the true mass and radius for i -th planet that wouldhave been observed if there were no measurement er-rors. We denote their joint distribution as f ( m, r ), andthe M-R relation as the conditional distribution f ( m | r ).We let M obs stand for a vector containing the observa-tions ( M obs1 , . . . , M obs n ); similarly, R obs , σ obs M and σ obs R stand for the set of their respective observations. Welet N ( µ, σ ) stand for a normal distribution with mean µ and standard deviation σ , and denote the probabilitydensity of a Beta distribution with shape parameters i and d − i + 1 by β id ( a ) = d (cid:18) d − i − (cid:19) a i − (1 − a ) d − i , (3)where 0 ≤ a ≤

1, and d > M obs i ind ∼ N ( M i , σ obs M i ) , (4) R obs i ind ∼ N ( R i , σ obs R i ) , (5)( M i , R i ) iid ∼ f ( m, r | w , d, d (cid:48) ) , (6)where the symbol ind ∼ stands for “independently dis-tributed as”, and the symbol iid ∼ stands for “identi-cally and independently distributed as”. Let w =( w , . . . , w dd (cid:48) ) be a set of weights which describe howmuch each corresponding term in the following seriescontribute to the overall joint distribution f ( m, r ): f ( m, r | w , d, d (cid:48) ) = d (cid:88) k =1 d (cid:48) (cid:88) l =1 w kl β kd ( m − MM − M ) M − M β ld (cid:48) ( r − RR − R ) R − R . (7)To ensure that Eqn 7 remains a probability distribution,i.e. that it integrates to 1, we impose the constraint (cid:80) dk =1 (cid:80) d (cid:48) l =1 w kl = 1 and that w kl ≥ k and l . The notations M , M , R and R are usedto denote the lower and upper bounds, respectively, formass and radius. The values of the upper and lowerbounds can be determined by a set of observed valuesof masses and radii; for example, one could choose thelower mass bound to be the minimum ( M obs i − σ obs M i /n ) inthe dataset, and the upper mass bound to be the max-imum ( M obs i + σ obs M i /n ). The bounds could also be setas the minimum and maximum mass and radius theo-retically expected for an exoplanet. For regions with noobservations, the BP model reverts to the overall meanof the whole function; this happens when the values of M , M , R and R are chosen to be far away from thenearest observations (see discussion in § B.1).Note that Eqns 4–6 form a multi-level model, in thatthe measurement process is modeled as a separate ran-dom process from the underlying true distribution ofexoplanet masses and radii. It therefore has a similarhierarchical structure as the hierarchical Bayesian mod-els deﬁned in WRF16 and Chen & Kipping (2017).2.2.

Model inference

Although the model may look complex at the ﬁrstglance, the inference is relatively straightforward. Thereare only two types of unknown parameters in the model,the weights w and the degrees d and d (cid:48) . Once d and d (cid:48) are determined using the method we will describe below,the two parameters i and d − i + 1 in each beta distribu-tion (see Eqn 3) are also determined. This is advanta-geous compared to other mixture models; for example, amixture of Gaussians would require that each mean andstandard deviation of component distribution be esti-mated by the data. Moreover, it avoids the commonidentiﬁability issues created by label switching withinthe mixture of Gaussians. Another advantage of ourmixture of beta distributions is that parameter space(which is a simplex) is a bounded convex set, whichhelps to guarantee the existence of the optimally esti-mated parameters. Here we take a maximum likelihoodapproach to estimate the parameters by explicitly deriv-ing the likelihood using one-dimensional numerical inte-gration.The likelihood function of the model 4–6 can be writ-ten as, L ( w , d, d (cid:48) | M obs , R obs , σ obs M , σ obs R )= (cid:90) MM (cid:90) RR f ( M obs , R obs , m, r | w , d, d (cid:48) , σ obs M , σ obs R ) d r d m = n (cid:89) i =1 (cid:90) MM (cid:90) RR f ( M obs i | m, σ obs M i ) f ( R obs i | r, σ obs R i ) × f ( m, r | w , d, d (cid:48) ) d r d m onparametric Exoplanet Mass-Radius Relation n (cid:89) i =1 d (cid:88) k =1 d (cid:48) (cid:88) l =1 w kl (cid:90) MM σ obs M i N (cid:16) M obs i − mσ obs M i (cid:17) β kd ( m − MM − M ) M − M d m × (cid:90) RR σ obs R i N (cid:16) R obs i − rσ obs R i (cid:17) β ld (cid:48) ( r − RR − R ) R − R d r. (8)The two integral terms are essentially two constantsand can be calculated numerically by using integrate function in R or by any other Gaussian quadraturemethod available for one-dimensional numerical integra-tion. Therefore, we denote c kl,i ≡ (cid:90) MM σ obs M i N (cid:16) M obs i − mσ obs M i (cid:17) β kd ( m − MM − M ) M − M d m × (cid:90) RR σ obs R i N (cid:16) R obs i − rσ obs R i (cid:17) β ld (cid:48) ( r − RR − R ) R − R d r, and c i = ( c ,i , . . . , c dd (cid:48) ,i ) to simplify the expression.Then for given values d and d (cid:48) , we ﬁnd an estimatorof w which maximizes the log-likelihood,maximize: log L = n (cid:88) i =1 log( c Ti w ) , subject to: d (cid:88) k =1 d (cid:48) (cid:88) l =1 w kl = 1 , w kl ≥ , for all k = 1 , . . . , d, l = 1 , . . . , d (cid:48) . (9)The above problem is a convex optimization problemwhich can be readily solved using any standard numeri-cal optimization methods. For our applications, we haveused the R package Rsolnp to solve the above optimiza-tion problem.Now we discuss how we choose d and d (cid:48) , which actas tuning parameters relative to the smoothness of theunderlying density function. We ﬁrst choose a set of can-didate values for d and d (cid:48) , such as 2, 3, . . . , n/ log( n ),where n is the sample size (i.e. number of exoplanets).We use 10-fold cross validation to choose the optimalvalue of degrees d and d (cid:48) from those candidate values.To conduct the 10-fold cross validation, we separate thedataset randomly into 10 disjoint subsets with equalsizes. Then we leave out the s -th subset, denoted by D s ( s = 1 , . . . ,

10) and use the remaining 9 subsets of datato estimate the parameters. Doing this for each D s inturn results in 10 estimated sets of weights, ˆ w ( s ) basedall observations i / ∈ D s . We plug in each ˆ w ( s ) alongwith the corresponding data that was used to estimateeach set of weights to obtain an estimated value for thelog-likelihood. Mathematically, we are calculating thefollowing quantity,log L pred = (cid:88) s =1 (cid:88) i/ ∈ D s log( c Ti ˆ w ( s ) ) for diﬀerent possible values of d and d (cid:48) . The optimumdegrees for the BP model are then the d and d (cid:48) whichgive the largest value of − log L pred .Besides using the cross-validation method, other pop-ular methods such as AIC (Akaike information criterion)and BIC (Bayesian information criterion) are also possi-ble when data samples are abundant. When the samplesize is relatively small, both methods will under-selectthe degrees, as in the example provided in Turnbull &Ghosh (2014). This is the reason why we choose to usethe 10-fold cross validation method.The model may require d and d (cid:48) to be large, and thusthe number of weights which need to be estimated isalso large. In such a situation, one may be concernedwhether we have enough data to estimate the weights.Fortunately, a majority of the estimated weights turnsout to be numerically zero, due to the fact that parame-ter space is a simplex and therefore convex. This drasti-cally reduces the number of eﬀective parameters in ourmodel. For example, mass predictions computed usingthe largest 25 weights of the ﬁt performed in § M ).Once we obtained ˆ d , ˆ d (cid:48) and ˆ w , the estimated values of d , d (cid:48) and w , the M-R relation can be derived followingEqn 1: f ( m | r, ˆ w , ˆ d, ˆ d (cid:48) ) = f ( m, r | ˆ w , ˆ d, ˆ d (cid:48) ) (cid:82) MM f ( m, r | ˆ w , ˆ d, ˆ d (cid:48) )d m = f ( m, r | ˆ w , ˆ d, ˆ d (cid:48) ) f ( r | ˆ w , ˆ d, ˆ d (cid:48) ) , (10)where f ( r | ˆ w , ˆ d, ˆ d (cid:48) ) = ˆ d (cid:88) k =1 ˆ d (cid:48) (cid:88) l =1 ˆ w kl β l ˆ d (cid:48) (cid:16) r − RR − R (cid:17) ( R − R ) . From Eqn 10, mean, variance and prediction intervalscan be obtained analytically.As noted in § w , d and d (cid:48) . For example wecan assign a Dirichlet prior to the weights as follows: π ( w , w . . . , w dd (cid:48) ) ∼ Dir( α , α , . . . , α dd (cid:48) ) , where (cid:80) dk =1 (cid:80) d (cid:48) l =1 α kl = 1, and assign a Poisson or uni-form prior for d and d (cid:48) respectively. The inference can becarried out using a Markov chain Monte Carlo algorithm(MCMC) (i.e. Petrone 1999a,b). However, initial inves-tigations into this approach indicated that the MCMCalgorithm would take a much longer time to run. As aresult, we employ a maximum likelihood method ratherthan a Bayesian method in this paper. Ning, Wolfgang, & Ghosh DATAWe apply the Bernstein polynomial model to obtainthe M-R relations from two datasets. The ﬁrst datasetis taken from WRF16 to enable a direct comparisonbetween their parametric mass-radius relation and ournonparametric one (this comparison is illustrated in Fig-ure 1). Speciﬁcally, we use their extended radial velocitydataset, denoted in WRF16 as “RV only, < ⊕ ” andwhich consists of all planets in their Table 2 except thoselabeled “c”. In this work we also exclude the planetswhose mass measurements are only upper limits, as the R package we use to ﬁnd the maximum likelihood esti-mates of the weights w does not allow censored data.The results from this benchmark dataset ( N = 60) arepresented in § Kepler name, whose mass measurements origi-nated from either radial velocities (RV) or from N-bodydynamical ﬁts to transit timing variations (TTVs). Wenote that TTV planets could have astrophysically dif-ferent densities; however, we see that the inclusion ofhigh-quality TTV masses in the current manuscript as areasonable decision, both because prior work has madethe same decision (Weiss & Marcy 2014 and WRF16)and because a dataset that contains both TTV andRV masses better represents the full range of densitiesthat an M-R relation would need to reproduce. Weobtained this information from the NASA ExoplanetArchive (Akeson et al. 2013) on June 5, 2017; for thisdataset we also excluded planets with only upper lim-its reported on their masses. With these restrictions,our

Kepler -only dataset consists of 127 planets with ro-bust mass measurements; the results from this sciencedataset are presented in § Discussion of selection eﬀects

We choose to restrict our science dataset (results pre-sented in §

5) to

Kepler planets in an eﬀort to minimizethe inﬂuence of survey selection eﬀects. Selection eﬀectsmanifest in two ways for the mass-radius relation:1. The probability of detecting a planet is a non-constant function of either their mass (for RVdetection) or radius (for transit detection), withsmaller or less massive planets being more diﬃ-cult to detect.2. The probability that a known planet has its ra-dius (for RV detections) or its mass (for transitdetections) measured by follow-up observations isa complicated, often unpublished function of theplanet’s discovery mass/radius, the predicted ra-dius/mass, and the host star’s brightness, spectraltype, activity indicators, sky position, etc. These two selection eﬀects impact the inferred mass-radius relation in diﬀerent ways. The ﬁrst aﬀects therelative amount of data at large vs. small radii (or athigh vs. low mass). Correcting for this eﬀect becomesimportant when one’s scientiﬁc goal is a joint mass, ra-dius probability distribution g ( m, r ) that reﬂects theunderlying, true distribution of exoplanet masses andradii. For example, if one tried to use the data fromthe Exoplanet Archive as-is with no correction for de-tection completeness, they would incorrectly concludethat ∼ . Jupiter and ∼ Jupiter is the most probablemass and radius for an exoplanet, as more Jupiter-sizedplanets have had their masses measured than smallerplanets. We already know that this potential conclusionto be incorrect: the numerous occurrence rate studieswhich use

Kepler data to correct for variable detectioneﬃciency across the full range of planet radii (e.g. mostrecently Foreman-Mackey et al. 2014; Fulton et al. 2017;Hsu et al. 2018) have shown that smaller planets aremuch more prevalent than Jupiter-sized ones. There-fore, if one wishes to understand the probability of anexoplanet existing with a speciﬁc mass and radius (i.e.characterize g ( m, r )), they must account for the diﬀerentsurveys’ detection completeness.Fortunately, this is not the goal of this work. Weare scientiﬁcally interested in g ( m | r ), the conditionaldistribution which gives us the M-R relation. In goingfrom g ( m, r ) to g ( m | r ), we marginalize over r , which di-vides out the completeness correction in that dimension.With this correction disappearing from the equation for g ( m | r ), we can safely ignore the eﬀect of the selectionbias due to detectability of diﬀerent radii planet. (seeEqn 1). This leaves only the second concern for our masspredictions: how transiting planets are selected for RVfollow-up.To address this concern, one needs to know how plan-ets were chosen for follow-up. To date, no RV follow-upgroup has published their selection function in a quan-titative way that would allow us to incorporate it intoan analysis like this. Furthermore, given that the targetlist often evolves as RV data are acquired, this selectionfunction is probably intractable to calculate for the cur-rent dataset. Progress is being made toward this end bydesigning RV follow-up campaigns that have deﬁnableselection functions (i.e., Burt et al. 2018 and Montet2018), but for the current dataset this remains practi-cally out of reach. No other papers on the M-R relationhas incorporated corrections for completeness or selec-tion eﬀects into their analysis. We follow this precedent,acknowledging that is a clear area for future work, andchoose to focus instead on the novel development of anonparametric analysis. BENCHMARKING TO WRF16 onparametric Exoplanet Mass-Radius Relation < 8R Earth

Nonparam< 8R

Earth

WRF16

Radius (R

Earth ) M a ss ( M E a r t h ) Mass−Radius Relations

Figure 1 . Two M-R relations are plotted. The blue line is the mean power-law M-R relation estimated by WRF16; the shadedregion is the central 68% of the predicted masses, showing the distribution of the mass predictions around the mean relation. Thedark blue line along with its shaded region is our nonparametric mean M-R relation and the 68% prediction regions estimatedusing the Bernstein polynomial model. Comparing the parametric M-R with the nonparametric M-R, the means of the M-Rrelations noticeably diﬀer in certain radius regimes. The parametric model is optimistic with the assumption that mass andradius follow an increasing power-law relation even in the region where observations are few and poorly constrain the relation.The nonparametric M-R method is more adaptive to the observed data. The red lines denote physical restrictions for planetmasses. Data points are displayed in the background with their measurement errors.

As we look to apply our nonparametric method to awider range of exoplanet masses and radii than consid-ered by WRF16, it is nonetheless prudent to benchmarkour results to their results.The parametric power-law M-R relation obtained byWRF16 is plotted in Figure 1. Quantitatively, it is m | r ∼ N (1 . r . , .

9) for < ⊕ , which assumes thatthe M-R relation follows a normal distribution.In Figure 1, we also plot the mean and 68% predic-tion intervals of M-R relation estimated using the pro-posed nonparametric method (the grey region). Basedon the 10-fold cross-validation method, the values for ˆ d and ˆ d (cid:48) are chosen to be 50 and 20, respectively. Phys-ically motivated boundaries for any M-R relations aredenoted by the red lines. Those boundaries are calcu-lated based on the constraint that 0 < M i ≤ M i, pureFe , i = 1 , . . . , n , where M i, pureFe is calculated using Fortneyet al. (2007a,b)’s rock-iron internal structure models,log( M i, pureFe ) = − b ± (cid:112) b − a ( c − R i )2 a , with a = 0 . b = 0 . c = 0 . f ( m | r, ˆ w , ˆ d, ˆ d (cid:48) ), where ˆ w is the maxi-mum likelihood estimator (MLE) of w . The predictionintervals do not account for variations on parametersˆ w , ˆ d and ˆ d (cid:48) themselves. To account for the variationof the parameters in the model, we use the bootstrapmethod. The bootstrap method is conducted by resam-pling (with replacement) the observations N times (here,we take N = 100). For each resampling, we obtain ˆ d , ˆ d (cid:48) and ˆ w . Then the bootstrap conﬁdence intervals are ob-tained by plugging in N sets of these estimated valuesinto the M-R relation function one-by-one and identi-fying the 16% and 84% quantiles of the mean relationover those N realizations. Note that the Bayesian cred-ible intervals, the prediction intervals and the bootstrapconﬁdence intervals not only have diﬀerent interpreta-tions, but also are derived from diﬀerent models—theBayesian credible intervals are obtained from WRF16’s Ning, Wolfgang, & Ghosh

Bayesian hierarchical power-law model, the other twointervals are obtained from our nonparametric model.Due to these diﬀerences, we will not compare those in-tervals. Instead, we focus on comparing the mean M-Rrelations from each method.Now, let’s go back to the discussion on Figure 1. Atthe ﬁrst glimpse, a distinct feature of this plot from thenonparametric M-R relation is that the uncertainty re-gion is quite large for

R > ⊕ compared to the regionobtained from the parametric M-R relation. This is be-cause there are only a few observations in that region,and the measurement error for each observation is rela-tively large. One may worry that the estimated mean M-R relation may be misleading in this context. However,we view this as a strength of using the nonparametricmodel for at least two reasons: ﬁrst, there is no concretetheory to support that the population-level relationshipbetween mass and radius follows a power-law, and so theparametric M-R relation may be too optimistic with itsprecision in regions with little data. Second, from Fig-ure 2, the constant variance assumed by the power-lawmodel, even in the > ⊕ regime, may not reasonable.The variance estimated using our nonparametric modelbehaves more as expected: the variance is smaller in the < ⊕ region where the observations are abundant, andit becomes larger in the > ⊕ region.In Figure 2, we also plot the uncertainty regions forthe estimated intrinsic scatter from the two models, i.e.the 16% and 84% Bayesian credible intervals using theparametric model, and 16% and 84% bootstrap conﬁ-dence intervals using the nonparametric model. Again,we would like to point out that the intrinsic scatter es-timated by the nonparametric model is clearly not aconstant. In fact, WRF16 also believed that the in-trinsic scatter may change as planets increase the size.However, they only modeled the intrinsic scatter as alinear function of radius, and this model was not ﬂex-ible enough to capture the true behavior: they foundthat the posterior distribution of the slope of that lin-ear function included zero, and thus concluded that theintrinsic scatter was a constant with radius.The normal assumption in the parametric model isalso disfavored by observations. From Figure 1, the pre-diction intervals obtained from the nonparametric modelsuggests that the conditional distribution of mass givenradius is not always normally distributed as there areregions where those intervals are not symmetric aroundthe mean. When WRF16 checked the normal assump-tion after ﬁtting their model, they also found evidencethat this assumption does not hold. THE JOINT EXOPLANET MASS-RADIUSDISTRIBUTION

Radius (R

Earth ) I n t r i n s i c sc a tt e r o f M − R r e l a t i on Figure 2 . The intrinsic scatter is the estimated standard de-viation (s.d.) for f ( m | r ). The parametric model (cyan) isunder-ﬁtting the M-R relation, thus has a smaller estima-tion for the s.d. compared to the nonparametric model (darkblue). The nonparametric model suggests that the intrinsicscatter is not a constant along with radii. In this ﬁgure, thedark blue line plots the mean s.d. across all bootstrapped re-alizations estimated by the nonparametric method, and thecorresponding shaded region denotes the bootstrap 16% and84% conﬁdence intervals. The cyan line denotes the poste-rior median of the intrinsic scatter term in the parametricmodel; the associated shaded area denote the 16% and 84%quantiles of the posterior samples for that parameter. In this section, we repeat our nonparametric M-R re-lation analysis using all robustly measured

Kepler ob-servations provided in the NASA Exoplanet Archive, asdescribed in §

3. Since we consider a larger range ofplanet masses and radii, we will display the M-R rela-tion in log scale and perform our analysis on log m andlog r , rather than m and r . Note that the M-R relationin the original scale of masses and radii can be obtainedby applying the Jacobian method to the joint distribu-tion of log m and log r .Similar to Eqns 4–7, the joint distribution for log m and log r is, f ( log m, log r | w , d, d (cid:48) ) = d (cid:88) k =1 d (cid:48) (cid:88) l =1 w kl β kd ( log m − log M log M − log M ) β ld (cid:48) ( log r − log R log R − log R )(log M − log M )(log R − log R ) . (11)Inferring the parameters w, d, d (cid:48) for this model is similarthe inference performed for the previous model.5.1. M-R relation of the full

Kepler dataset

The M-R relation is the conditional distribution of log-scaled mass given log-scaled radius, derived from Eqn 1: onparametric Exoplanet Mass-Radius Relation Radius (R

Earth ) M a ss ( M E a r t h ) Kepler data: Mass−Radius Relations

Figure 3 . Nonparametric M-R relation of

Kepler data. With a more ﬂexible model, we see that the transitions between super-Earths, Neptunes, and Jupiters are not sharply deﬁned, yet there appears to be three contiguous regions between which theM-R relation changes: 0 . ⊕ ≤ r (cid:46) ⊕ , 5R ⊕ (cid:46) r (cid:46) ⊕ and (cid:38) ⊕ . The dark blue curve is the mean M-R relation andthe darker shaded area is the uncertainty region between the 16% and 84% predictive intervals. The light blue band is the theuncertainty region between the bootstrap 16% and 84% conﬁdence intervals and shows large uncertainty in the region wheredata is absent (radius < ⊕ ). Data points are displayed in the background with their measurement errors. f (log m | log r, w , d, d (cid:48) ) = f (log m, log r | w , d, d (cid:48) ) f (log r | w , d, d (cid:48) ) , (12)with f (log m, log r | w , d, d (cid:48) ) is in (11) and f (log r | w , d, d (cid:48) ) = d (cid:88) k =1 d (cid:48) (cid:88) l =1 w kl β ld ( log r − log R log R − log R )log R − log R .

We use 10-fold cross-validation (see § d (corresponding to mass) and d (cid:48) (corresponding to ra-dius), and ﬁnd that ˆ d = 55 and ˆ d (cid:48) = 50. After weobtain ˆ d and ˆ d (cid:48) , we then plug in their values to estimate w using the MLE method. Because the values of ˆ d andˆ d (cid:48) may change under diﬀerent realizations of the randomnumber generator, we repeat the 10-fold cross-validationmethod ﬁve times to assess its stability, each time us-ing a diﬀerent random seed. We found that four outof ﬁve times, the cross-validation gives the same choicesfor d and d (cid:48) with the dissenting realization suggestinga slightly larger number for ˆ d . Thus we decide to useˆ d = 55 and ˆ d (cid:48) = 50 to estimate the M-R relation. Afterwe obtain ˆ d , ˆ d (cid:48) and ˆ w , we derive the prediction distri-bution of the M-R relation and employ the bootstrapmethod described in § . ⊕ is unexpectedly similar to a step function.This is because there is only one isolated data point inthat region, which the model tries to connect with thenearby points via smooth polynomials. Fortunately, wesee that the bootstrap conﬁdence intervals in this regionare very wide, indicating that the M-R relation is notwell understood in this region, and that more data isneeded.The M-R relations can be roughly split into threeparts: 0 . ⊕ ≤ r (cid:46) ⊕ , 5R ⊕ (cid:46) r (cid:46) ⊕ and (cid:38) ⊕ . In the region between 0 . ⊕ and 5R ⊕ , thebootstrap conﬁdence intervals narrow as data points be-come abundant for larger radii, which suggests the M-Rrelation is more accurate. The almost linear relationsuggests that the power-law relation may not be a mis-leading assumption in this radius regime. Transitioningfrom the 5R ⊕ region to the 11R ⊕ region, the number ofobservations decreases and we observe both the predic-0 Ning, Wolfgang, & Ghosh

Radius (R

Earth ) I n s t r i n i c sc a tt e r o f M − R r e l a t i on l og ( M a ss ( M E a r t h )) Radius (R

Earth ) M a ss ( M E a r t h ) Figure 4 . The intrinsic scatter of the M-R relation, in log scale (left) and linear mass scale (right). Mathematically, the intrinsicscatter is the estimated standard deviation of f (log m | log r ). We note the behavior of the intrinsic scatter in the three regionsvisually identiﬁed in Fig 3: for 0 . ⊕ ≤ r (cid:46) ⊕ the intrinsic scatter is increasing; for 5R ⊕ (cid:46) r (cid:46) ⊕ it starts to decline;and for (cid:38) ⊕ it is nonlinear. In the ﬁgure, the dark blue curve is the estimated standard deviation of the M-R relation andthe shaded area is the uncertainty region between the 16% and 84% bootstrap conﬁdence intervals. tion intervals and bootstrap conﬁdence intervals becomeslightly wider. In the region > ⊕ the M-R relationis more ﬂat or even decreases as radius becomes larger.These ﬁndings are consistent with well-known featuresof the M-R relation (see discussion in § f (log m | log r ) (seeEqn 12), the intrinsic scatter is technically in units oflog m . We plot this on the left and original scale on theright for a more intuitive comparison. Starting at thesmallest radii, we see that the bootstrap conﬁdence in-tervals are quite wide. This reﬂects the fact that thereis only one data point below ∼ . ⊕ , and so the in-trinsic scatter is quite uncertain in this regime. In theregion between 0 . ⊕ and (cid:46) ⊕ , the intrinsic scatteris an increasing function with radius. The increasingpattern becomes more obvious in the right plot. Thisﬁnding is consistent with the result shown in Figure 2,even though two results are based on diﬀerent datasets.With more data points to constrain this region, we seethat the size of the intrinsic scatter in Figure 4 is smallerthan it in Figure 2. In the > ⊕ region, the intrinsicscatter behaves nonlinearly. It decreases ﬁrst then in-creases at radii dominated by inﬂated Jupiters. As dis-cussed in § Mass predictions for given radii

In Figure 5, we plot ﬁve conditional densities for massgiven radius at diﬀerent radius values: 1R ⊕ , 3R ⊕ , 5R ⊕ ,10R ⊕ , and 15R ⊕ . These curves represent the mass pre-dictions for planets at those radii, i.e. the probabilityof a planet of that size having a certain mass. Theshaded regions represent the uncertainty in that distri-bution and correspond to the 68% bootstrap conﬁdenceregions. While these distributions look roughly Gaus-sian in log scale, when transformed to a linear scale allthe densities are skewed to the right. The two densitieswithin the region > ⊕ are more skewed than the rest.Thus no matter the the size of the planets, the intrinsicscatter is deﬁnitely not Gaussian in linear scale, as wasassumed by WRF16. The lognormal intrinsic scatterassumed by Chen & Kipping (2017) is more appropri-ate for mass given radius, although it fails to capturethe low-mass tail corresponding to super-puﬀy planetsat 7 −

10 R ⊕ and <

30 M ⊕ .5.3. The radius-mass (R-M) relation

As we are modeling the joint distribution of mass andradius (in log scale), the radius-mass relation (R-M re-lation) can also be derived easily. From Eqn 2, the con-ditional distribution for log r given log m is f (log r | log m, w , d, d (cid:48) ) = f (log m, log r | w , d, d (cid:48) ) f (log m | w , d, d (cid:48) ) , (13)with f (log m, log r | w , d, d (cid:48) ) in (11) and f (log m | w , d, d (cid:48) ) = d (cid:88) k =1 d (cid:48) (cid:88) l =1 w kl β ld ( log m − log M log M − log M )log M − log M .

After we plug in ˆ d , ˆ d (cid:48) and ˆ w , we plot this relation alongwith the prediction intervals and bootstrap conﬁdence onparametric Exoplanet Mass-Radius Relation Radius = = = =

10 Radius = Mass (M

Earth ) Figure 5 . The conditional distributions for mass given radius, i.e. the mass predictions that our model produces for planets at1R ⊕ , 3R ⊕ , 5R ⊕ , 10R ⊕ , and 15R ⊕ . In the plotted log scale, these predictive mass distributions are roughly Gaussian, but inlinear scale are skewed to the right. The larger radius is, the more skewed the distribution. For each curve, the shaded regioncorresponds to the uncertainty in the distribution, i.e. the 16% and 84% bootstrap conﬁdence intervals. intervals in Figure 6. Note that the R-M relation is not asimple ﬂip of the M-R relation around the 1:1 line; thisis due to the fact that the M-R relation is actually adistribution that is asymmetric around a mean relationthat is not a straight line.The R-M relation is highly uncertain for < . M ⊕ dueto the single isolated data point in that region; this isreﬂected in the wide bootstrap conﬁdence intervals. Thebootstrap conﬁdence intervals narrow as mass increases,reﬂecting higher certainty in the mean relation given thedata. This mean R-M relation is an increasing functionup to 100 M ⊕ ; above that the R-M relation becomesﬂatter due to degeneracy pressure.We also plot the 68% bootstrap conﬁdence intervalsfor each of the conditional distributions for 1, 10, 50, 100and 500 M ⊕ . Almost all the densities are skewed to theright, especially when the displayed log scale is trans-formed to linear scale. In the 0 . M ⊕ ≤ m ≤ M ⊕ region, the three densities look quite diﬀerent, indicat-ing rapid change in the radius distribution from 1 to 50M ⊕ . 5.4. Predicting K2 planets

In this section, we apply the estimated M-R relationto predict masses of K2 planets given their radius. The Mass (M

Earth ) R ad i u s ( R E a r t h ) Figure 6 . The nonparametric estimation of the R-M rela-tion; note that this is not simply the inverse of the M-Rrelation. The dark blue line is the mean R-M relation andthe gray area is the 16% and 84% prediction intervals. Thelight blue area is the uncertainty region between 16% and84% bootstrap conﬁdence intervals.

K2 planets data are obtained from the NASA ExoplanetArchive on June 5, 2017. In Figure 8, we ﬁrst plot theprediction intervals from the M-R relation for all radii,then plot the K2 planets on top of the band. We see2

Ning, Wolfgang, & Ghosh

Mass = =

10 Mass = = = Radius (R

Earth ) Figure 7 . The conditional distributions for radii givenmasses at 1, 10, 50, 100, 500 M ⊕ . As in Figure 5, the shadedregion corresponds to the uncertainty in the predictive dis-tributions, i.e. the 16% and 84% bootstrap conﬁdence in-tervals. These predictive distributions are more skewed thanthose for mass given radius. l l l ll lllll ll ll l l lll llllllllll Radius (R

Earth ) M a ss ( M E a r t h ) Predicting K2 planets

Figure 8 . The masses of K2 planets with radii < ⊕ arebiased. The shaded area is the same as that shown in Figure3; it is the uncertainty region between 16% and 84% pre-diction intervals of the M-R relation estimated using Kepler data. K2 planets’ masses and radii are plotted in red withtheir measurement errors. that for many of the planets with

R < ⊕ , the pre-diction intervals do not cover their masses. However,for planets with R > ⊕ , the prediction intervals dospan the masses for all the planets. This suggests thatthere is a signiﬁcant bias for K2 planets. One explana-tion for this is that the most massive planets reach ahigh mass detection signiﬁcance threshold ﬁrst, and soare published ﬁrst. Burt et al. (2018) discusses this biasand its implication for population mass-radius analysesin detail. IMPLICATIONS FOR EXOPLANETCOMPOSITIONSThe nonparametric M-R relation that we have ﬁt tothe full

Kepler dataset yields several astrophysical re-sults. Below we discuss them in details.6.1.

Assessing “well-understood” M-R features

A crucial test of any method is how well it reproducesfeatures that are well established by multiple studies us-ing diﬀerent techniques. To this end, we emphasize thatour relation is not just the average line, but also thepredictive distribution around the line. We also notethat our goal was not to reproduce prior mass-radiusrelations, as one of our motivations for applying a non-parametric method was to assess how well the featuresidentiﬁed in more parametric models persist when thestrong assumptions in those models are relaxed.To start with, our nonparametric M-R relation repro-duces the well-established result that more massive plan-ets (cid:46) M ⊕ are on average larger (see Figure 3). Thisresult is not a surprise: more massive planets are ex-pected to form earlier in the protoplanetary disk life-time and thus should able to accrete more gas-rich ma-terial, which quickly increases the planets’ radii. Indeed,all M-R relations in the literature (see § f ( m | r ) (mass given radius) conditional dis-tribution, vertical features like this are collapsed, withtheir extent represented as the probability distributionat that radius. Figure 7 displays these probability dis-tributions, and shows that there is some probability ofmore massive planets at 10 and 15 R ⊕ . In fact, the diﬀer-ence between f ( m | r ) and f ( r | m ) in the Jupiter regimeis an argument for why a two-way M-R relation is neces-sary: there are some features that are easier to captureone way than the other.6.2. Location of transition points

Identifying transitions in the exoplanet mass-radiusrelation is important for our physical understanding ofthe types of planets which exist, based on their compo-sitions.No sharp transitions are visible in Figure 3. Never-theless, there appears to be at least three segments tothe M-R relation: (cid:46) ⊕ , ∼ ∼ ⊕ , and (cid:38) ⊕ onparametric Exoplanet Mass-Radius Relation ⊕ ,then the conditionals can look quite diﬀerent. A rela-tively large spread in radii for planets with 3-10M ⊕ fallsadjacent to a small spread in radii for planets around30M ⊕ ; this is what causes ambiguity in identifying atransition from Neptunes to Saturns.Upon visual inspection, it is also clear that there arenot currently enough mass measurements of extrasolarplanets ≤ ⊕ to justify a super-Earth transition point.This result is at odds with some previously publishedresults on the M-R relation, for example Weiss & Marcy(2014), Rogers (2015), Chen & Kipping (2017) and Ful-ton et al. (2017). In below, we shall discuss each of theirworks one by one.Weiss & Marcy (2014) ﬁxed the transition point at1.5 R ⊕ based on visual inspection of the density vs. ra-dius plane. When their dataset is plotted with the re-ported error bars, it becomes apparent that there is littleempirical evidence for a transition at 1.5 R ⊕ given thesize of the density or mass error bars for the smallestplanets. Indeed, one could draw a straight line throughthe mass and radius points and have it ﬁt the data justas well as the chosen relation with the transition.To our best knowledge, Rogers (2015) provided thestrongest evidence of a transition in the M,R planeamong all four of these studies. This paper parame-terized the transition as the radius below which 100% ofplanets must be rocky (that is, their masses are greaterthan the minimum mass of a rocky planet at that size).This study tested both a step function for the fractionof planets that are rocky and a more gradual transi-tion; given the size of the error bars, a step functionwas suﬃcient to describe this dataset. While useful inits physical interpretation, this parametrization does notguarantee that there is a visible kink or transition in theM-R relation. Indeed, no such kink is visible in theirdataset, displayed in their Figure 1. This is becausethe Rogers parameterization of the transition tests forthe absence of planets in one region of M,R parameter space (that of very low-mass planets in the < R ⊕ ra-dius regime); if their absence does not suﬃciently alterthe average M-R relation, then it will not produce a kinkin the M-R plane. Considering the very large measure-ment uncertainties in this very small planet regime, it isindeed the case that the absence of these planets doesn’tsigniﬁcantly shift the average M-R relation to producea kink.Fulton et al. (2017) looked only at the marginal radiusdistribution. There is no guarantee that the bimodalitythey see in the radius distribution will produce a kink,or a transition, in the M-R plane. Indeed, the transitionis likely gradual with a period dependence. Additionalfollow-up to measure masses for a larger sample of thesesmall planets is needed before a claim of a Fulton-likegap or transition in the M-R plane is warranted.The tension between the existence of a super-Earthtransition in Chen & Kipping (2017) and our lack ofone arises from the fact that we do not include the minorSolar System bodies in our dataset. It is not unreason-able to expect that the compositions of mainly refractorybodies in other planetary systems will be similar to thosein ours. However, including this data without inﬂatingthe uncertainties to match those we obtain for extraso-lar systems leads to misleadingly tight constraints for ∼ ⊕ planets. Indeed, there is only one planet be-tween 1 and 3M ⊕ in their dataset, so most of the infor-mation about the “Terran” transition point comes fromextrapolating up from Solar System minor bodies andextrapolating down from Neptunes, rather from mea-surements at the transition point itself. We choose a lessSolar System-constrained approach to characterize thedistribution of compositions in the 1-3R ⊕ regime, choos-ing instead to develop a ﬂexible method that can best letthe extrasolar data speak for itself as more super-Earthsare discovered and followed up by TESS.6.3. Intrinsic dispersion

The intrinsic dispersion in mass given radius slightlyincreases from 1 to 5 R ⊕ (see Figure 4), but varies muchmore in radius given mass (see the width of the grayregion in Figure 6. This indicates that planet composi-tions in the 3-10M ⊕ range exhibit more diversity thanothers, and that planets with 1-5R ⊕ fall into a relativelynarrow range of masses. The large spread in radius overa narrow mass regime could have a number of phys-ical interpretations: perhaps these super-Earths/sub-Neptunes fall at a transition point where scaled-upterrestrial planet formation co-exists with scaled-downgaseous planet formation. Perhaps planet evolution isparticularly dramatic in this mass regime; photoevap-oration has been shown to be important for such low-density low-mass planets (e.g. Lopez & Rice 2016; Owen& Wu 2017).4 Ning, Wolfgang, & Ghosh

In either case, it would be useful to search for multiplepopulations in this region of the M-R diagram, to testif the bimodal marginal radius distribution unearthedby Fulton et al. (2017) and veriﬁed by Hsu et al. (2018)extends to the joint radius and mass space. While thereis not currently evidence for a bimodal mass and radiusdistribution, our approach of calculating the M-R andR–M relations by ﬁrst estimating the joint distributionis capable of ﬁnding and quantifying such bimodalities.Before the joint distribution can be interpreted as anastrophysical result, however, both selection eﬀects dis-cussed in § Predictive distributions

Given the current dataset, a lognormal distributionis a reasonable approximation for the conditional dis-tribution of mass given radius, in most radius regimes(see Figure 5). On the other hand, there is signiﬁcantskewness in the conditional distributions for log scaledradius given log scaled mass (see Figure 7). This resultdemonstrates that adopting the more ﬂexible model waswarranted, and that diﬀerent mass regimes have diﬀer-ent composition distributions. In particular, the super-puﬀy planets found by TTV analyses are visible as theskewed tail in the 10M ⊕ conditional distribution.6.5. Implications for TESS

As seen in Figure 5, mass predictions are the best con-strained for planets at 3R ⊕ : the conditional distributionfor mass at 3R ⊕ is the most robust to the bootstrappingscheme we used to measure the uncertainty in these dis-tributions. As a result, RV follow-up of TESS will yieldthe most scientiﬁc return for 1-2R ⊕ planets and above ∼ ⊕ where there are intrinsically fewer planets.While few Kepler target stars were suﬃciently brightto enable follow-up, a larger proportion of K2 planet-hosting stars are. One would hope that these brighterstars would enable more mass detections of smaller plan-ets, yet we see in Figure 8 that there is signiﬁcant biasin the K2 planets planetary masses, especially for theplanets with ≤ ⊕ radii. This serves as a word ofwarning for those who aim to perform mass-radius anal-yses with TESS data: to build robust M-R relations,all mass measurements, even upper limits, must be pub-lished and incorporated into the analysis (see Burt etal. (2018) for an in-depth study of this eﬀect). In ad-dition, the second type of selection eﬀect described in § CONCLUSIONIn this paper we provide a nonparametric approach,using the Bernstein polynomial model, to estimate the M-R relation. We applied this approach to analyze twodatasets.The ﬁrst dataset is from WRF16, which we use tobenchmark our results. First, we found that the M-R relations estimated using the parametric model andthe nonparametric model are similar for < ⊕ planets.However, the two relations diﬀer for > ⊕ , in that theprediction intervals of the M-R relation obtained fromthe nonparametric model are wider from a model thatassumed a power-law M-R relation. We also found thatthe conditional distribution of mass given radius is notalways normally distributed. Last, we found that theintrinsic scatter is not a constant function with radius,increasing weakly up to 5R ⊕ .We applied the Bernstein polynomial model to studythe joint exoplanet mass and radius distribution usingall Kepler planets with measured masses. We foundthat there are at least three distinct M-R relations forplanets with radii (cid:46) ⊕ , 5R ⊕ < r < ⊕ and > ⊕ . We also studied the relationship between theintrinsic scatter and the radius. We also found that theintrinsic scatter is a weakly increasing function withradius up to ∼ ⊕ and becomes nonlinear beyond that.Furthermore, we found that the conditional distributionof mass given radius is reasonably approximated by alognormal distribution, which skews to the right in m ,but is nearly symmetric in log m . Last, we studied theR–M relation and applied the estimated M-R relation topredict K2 planets. We found that the bias for K2 planetmasses, especially for those with < ⊕ , is large.A major contribution of this study is that our methodprovides a tool to explore a wider range of ﬁts by in-corporating heteroscedastic measurement errors. Ourmethod provides a new direction for studying theM-R relation when more mass and radius measure-ments are obtained with future missions like TESSand PLATO. To enable these future studies, we haveplaced the input datasets and the R code which weuse to ﬁt the weights, ﬁnd the optimal number of de-grees, and plot the ﬁgures in this paper, at the follow-ing website: https://github.com/Bo-Ning/Predicting-exoplanet-mass-and-radius-relationship.In the future, the model itself can be extended byadding incident ﬂux as a third variable for a better un-derstanding of the M-R relation (see Thorngren & Fort-ney (2017) and Sestovic, Demory & Queloz (2018) forexamples of ﬂux-dependent M-R analysis in the Jupiterregime). It is likely that a mass-radius-ﬂux (M-R-F) re-lation would better describe planets in all mass regimes,given the importance of photoevaporation for low-massplanets. Given the rich and growing literature on thesubject, this is an area where signiﬁcant progress can bemade, with the right tools. As we have demonstratedin this paper, a nonparametric approach provides such onparametric Exoplanet Mass-Radius Relation

15a tool: it can yield numerous astrophysical insights andmodel the data with minimal assumptions, making it apromising option for future mass-radius analyses.The ﬁrst draft was completed when the ﬁrst authorwas a Ph.D. student in Department of Statistics atNorth Carolina State University, Raleigh, North Car-olina, U.S.A..The authors thank the Statistical and Applied Math-ematical Sciences Institute (SAMSI) for bringing theauthors together and providing space and funding forcontinued collaborations. The authors also thank TomLoredo and Eric Ford for their valuable suggestions onthis work while it was in development at SAMSI, EricFord for providing detailed revisions on the ﬁrst versionof the manuscript, Shubham Kanodia and GudmundurStefansson for their work in translating the R code intoPython, and Matthias Yang He and Eric Ford for theirwork in testing the number of eﬀective nonzero weightswhile translating the code into Julia. Furthermore, theauthors thank two referees who provided valuable sug- gestions to improve the paper. This material was basedupon work partially supported by the National ScienceFoundation under Grant DMS-1127914 to the Statisti-cal and Applied Mathematical Sciences Institute. AWacknowledges support from the National Science Foun-dation Astronomy & Astrophysics Postdoctoral Fellow-ship program under Award No. 1501440. The Centerfor Exoplanets and Habitable Worlds is supported bythe Pennsylvania State University, the Eberly Collegeof Science, and the Pennsylvania Space Grant Consor-tium. This research has made use of the NASA Exo-planet Archive, which is operated by the California In-stitute of Technology, under contract with the NationalAeronautics and Space Administration under the Exo-planet Exploration Program. This paper includes datacollected by the

Kepler mission. Funding for the

Ke-pler mission is provided by the NASA Science Missiondirectorate.

Facilities:

NASA Exoplanet Archive

Software: R and pythonAPPENDIX A. BERNSTEIN POLYNOMIALS AND THEIR PROPERTIESBernstein Polynomials (BPs) have a long history in mathematics and statistics since its original publication (Bern-stein, 1912). Readers interested in knowing more about the use and origins of BPs in various scientiﬁc ﬁelds will ﬁndthe article by Farouki (2012) very comprehensive and informative. Here we present a brief overview of the mainproperties and motivations of choosing BPs over other basis functions, such as splines or kernel methods. Given acontinuous function f ( u ) deﬁned on the unit interval [0 , d corresponding to the function f isdeﬁned as B d ( u ; f ) = d (cid:88) k =0 f (cid:18) kd (cid:19) (cid:18) dk (cid:19) u k (1 − u ) d − k for d = 1 , , . . . , where B d ( x, f ) is a polynomial of degree at most d . It was shown by Bernstein that (cid:107) B d ( · , f ) − f ( · ) (cid:107) ∞ ≡ sup u ∈ [0 , | B d ( u, f ) − f ( u ) | → d → ∞ , which constructively demonstrates that a continuous function on aclosed interval can be uniformly approximated by a polynomial. Figure A1 provides plots of the basis function(s)with d = 1 , , , , , , . Ina statistical density estimation problem, we would not know the density function f and we thus work with θ k = f (cid:0) kd (cid:1) for k = 0 , , . . . , d and estimate the parameter vector ( θ , θ , . . . , θ d ) using likelihood and other standard statisticalmethods. The prime motivation for using BPs for estimating densities is that it not only enjoys many of the similarasymptotic properties as some of the other popular density estimation methods do (such as those based on kernels andB-splines), but it also has good asymptotic properties near boundaries. Its connection to a mixture of Beta densitiesis described in Tenbusch (1994). Babu & Chaubey (2006) provides much of the asymptotic properties of the BPsin multi-dimensions and show that densities can be estimated based on dependent sequence of multivariate randomvariables. Moreover, it has been shown that BPs can also be used to make inference when the density or regressionfunctions are subject to shape constraints (see Turnbull & Ghosh 2014; Wang & Ghosh 2012).6 Ning, Wolfgang, & Ghosh . . . . . . Degree d = 1 . . . . . . Degree d = 2 . . . . . . Degree d = 3 . . . . . . Degree d = 5 . . . . . . Degree d = 10 . . . . . . Degree d = 20

Figure A1 . Plot of Bernstein polynomial basic functions, as a function of degree (i.e. the number of terms in the polynomialseries); compare to Eqn 3. Note that the independent variable, i.e. the x-axis, must be rescaled to be between 0 and 1. See § B.1for a discussion of this. B.

MORE ON THE BERNSTEIN POLYNOMIALSS MODELB.1.

On choosing the upper and lower bounds for mass and radius: M , M , R , R . The Bernstein polynomials model requires specifying an upper bound and a lower bound for both mass and radius.Although in theory one could choose the upper bound and the lower bound to be any values, in practice setting theupper bounds and the lower bounds to diﬀerent values will cause the results to be diﬀerent from the results in Figure1 and Figure 3. In particular, when there are no data in a region, the Bernstein polynomials model will ﬁt a smoothline toward the overall mean. This happens when the upper bound is chosen to be too large. To illustrate this, weoﬀer Fig B2 as a comparison to Figure 1.In Figure B2, we set the upper bound of masses to be 8 R ⊕ , which is the same value used in WFR16’s paper, insteadof 6 . R ⊕ . One can immediately tell that the region between 6 . R ⊕ and 8 R ⊕ has a downward trend. This is becausethe overall mean of the dataset’s planet masses is smaller than the mass at 6 . R ⊕ , which is where the last observationlies. Because the model tends toward the overall mean when there is no data, a downward trend is produced. Thistrend might not provide a correct astrophysical interpretation of the M-R relation in this region, and so we stronglycaution against extrapolating this nonparametric relation. That said, the prediction intervals are very wide in thisregion, which suggests that there is signiﬁcant variability in the mean relation, which would be rectiﬁed with moreobservations. B.2. On choosing the degrees d and d (cid:48) . Choosing the degrees d and d (cid:48) is somewhat computationally expensive due to the cross-validation method’s searchingfor the optimal d and d (cid:48) from a large number of potential values. To decrease the computation time, one can force d = d (cid:48) . We adopt this approach to choose d and then re-estimate the parameters in the model. Figure B2 and FigureB3 are the results produced by choosing d = d (cid:48) . The cross-validation method chooses d = d (cid:48) = 35 for the ﬁrst datasetand d = d (cid:48) = 55 for the second. Comparing these two ﬁgures to Figures 1 and 3 respectively, the results do not vary onparametric Exoplanet Mass-Radius Relation Radius (R

Earth ) M a ss ( M E a r t h ) Figure B2 . Plot of the M-R relation with its prediction intervals when R is chosen to be 8 R ⊕ instead of 6 . R ⊕ (as in Figure1). Comparing with Figure 1, the two M-R relations are similar in the region where R ≤ R ⊕ , where a large amount of data isavailable. The current plot shows a downward trend for the M-R relation when R > R ⊕ because where there is no data themodel ﬁts a line smoothly to the overall mean. The large prediction intervals show that this downward trend is highly uncertain.We note that in this plot, ˆ d = ˆ d (cid:48) = 35, which was selected based on the cross-validation method described in § too much. Figure B3 . Plot of the M-R relation with its prediction intervals estimated from the

Kepler data. We chose ˆ d = ˆ d (cid:48) = 55,where the values of d and d (cid:48) are selected based on the cross-validation method discussed in § Ning, Wolfgang, & Ghosh l ll llll ll lll llll ll l l ll lll lll l lll lll ll ll ll ll l l ll l l lll l ll ll ll ll ll l ll ll l ll l ll llll lll l ll llll ll l lll lll ll l

Radius (R

Earth ) M a ss ( M E a r t h ) No measurement errors l ll l lll ll ll l llll ll l l ll lll lll l lll lll ll ll ll ll l l ll l ll ll l ll ll ll ll ll l ll lll ll l ll llll llll ll llll ll l lll lll ll l

Radius (R

Earth ) M a ss ( M E a r t h ) s M = Earth , s R = Earth l lll ll lll ll l ll l ll ll l ll l ll ll l ll ll llll llll l ll l ll lll lll ll lll ll lll lll lll ll lll ll l ll l

Radius (R

Earth ) M a ss ( M E a r t h ) s M = Earth , s R = Earth

Figure C4 . The performance of our nonparametric M-R relation as assessed with simulated data: only for the largest mea-surement uncertainties does the inferred M-R relation deviate from the truth. Each simulation consists of 100 datasets, andeach of those contain 100 data points; all 10,000 simulated planets are plotted as small light blue points, with the larger darkblue points identifying those from one randomly chosen dataset, to show the typical planet-to-planet variation in any individualdataset. The diﬀerent panels were generated with diﬀerent measurement uncertainties; these are denoted above each panel inunits of Earth mass and Earth radii. For each panel, the black dashed line is the “true” M-R relation that we use to generatethe individual dataset (see text for details). The red lines illustrate the variability produced by ﬁtting our model to each ofthe 100 diﬀerent datasets: the solid red line denotes the mean of the most probable mass values (at each radius) across all 100model ﬁts (i.e. the mean of the 100 model ﬁt means), and the dashed red lines denote the 16% and 84% conﬁdence intervals ofthis most probable mass value (i.e. the standard error of the 100 model ﬁt means).C.

A SIMULATION STUDYTo check how missing values and measurement errors could aﬀect the estimation result, we conduct several simulationstudies. We simulate data from a power-law relation and use the Bernstein polynomial model to estimate that function,as we are interested in comparing the M-R relations estimated under diﬀerent settings.In this simulation study, we deﬁne the “true” M-R relation as follows: r ∼ U (0 , ⊕ and 8R ⊕ , and m ∼ N (1 . r . , ⊕ and on masses as 5, 20, 35 M ⊕ .We ﬁt each dataset to the model, choosing the degrees using the same 10-fold cross validation scheme that isdescribed in § onparametric Exoplanet Mass-Radius Relation l ll llll ll lll llll ll l l ll lll lll l lll lll ll ll ll ll l l ll l l lll l ll ll ll ll ll l ll ll l ll l ll llll lll l ll llll ll l lll lll ll l Radius (R

Earth ) M a ss ( M E a r t h ) Missing value

Figure C5 . The performance of our nonparametric method when the radius measurements are not distributed uniformly over0 < r < ⊕ : the M-R relation defaults to a ﬂat line where there is missing data. The points and lines here show the sameinformation as in Figure C4. relation. Information about the M-R relation is lost in this process, and so a less complex nonparametric model isrequired to ﬁt the data. This is reﬂected in a lower degree being recommended by our cross-validation scheme: forthe original simulated dataset with no measurement errors, d = 90, while d = 40 , , R ⊕ and 5 R ⊕ are removed. Comparing to the ﬁrstsubplot in Figure C4, the M-R relation is diﬀerent in the region where the data are missing. The model is trying to ﬁndthe optimal polynomials to estimate the M-R relation. The polynomials may not necessarily to be linear, thus in theregion with no data points, the M-R relation can be nonlinear. When there are missing values, Bernstein polynomialsimpute missing values by taking mean from the nearest region where data are presented. Thus the mean M-R relationis lower than the true relation on the left side of missing data region and higher than the true relation on the right side.While the mean M-R relation may depart from the true relation in the missing data region, the conﬁdence intervalscontains the true M-R relation, which does not rule out the possibility of obtaining the true relation. This gives usconﬁdence in applying our model to real Kepler and K2 data that have non-uniform sampling.REFERENCES