[PDF] Extensive Studies of the Neutron Star Equation of State from the Deep Learning Inference with the Observational Data Augmentation

Abstract

We discuss deep learning inference for the neutron star equation of state (EoS) using the real observational data of the mass and the radius. We make a quantitative comparison between the conventional polynomial regression and the neural network approach for the EoS parametrization. For our deep learning method to incorporate uncertainties in observation, we augment the training data with noise fluctuations corresponding to observational uncertainties. Deduced EoSs can accommodate a weak first-order phase transition, and we make a histogram for likely first-order regions. We also find that our observational data augmentation has a byproduct to tame the overfitting behavior. To check the performance improved by the data augmentation, we set up a toy model as the simplest inference problem to recover a double-peaked function and monitor the validation loss. We conclude that the data augmentation could be a useful technique to evade the overfitting without tuning the neural network architecture such as inserting the dropout.

Full PDF

PPrepared for submission to JHEP

Extensive Studies of the Neutron Star Equation ofState from the Deep Learning Inference with theObservational Data Augmentation

Yuki Fujimoto a Kenji Fukushima a,b

Koichi Murase c a Department of Physics, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033,Japan b Institute for Physics of Intelligence (IPI), The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku,Tokyo 113-0033, Japan c Center for High Energy Physics, Peking University, Beijing 100871, China

E-mail: [email protected] , [email protected] , [email protected] Abstract:

We discuss deep learning inference for the neutron star equation of state (EoS)using the real observational data of the mass and the radius. We make a quantitative com-parison between the conventional polynomial regression and the neural network approachfor the EoS parametrization. For our deep learning method to incorporate uncertainties inobservation, we augment the training data with noise ﬂuctuations corresponding to obser-vational uncertainties. Deduced EoSs can accommodate a weak ﬁrst-order phase transition,and we make a histogram for likely ﬁrst-order regions. We also ﬁnd that our observationaldata augmentation has a byproduct to tame the overﬁtting behavior. To check the perfor-mance improved by the data augmentation, we set up a toy model as the simplest inferenceproblem to recover a double-peaked function and monitor the validation loss. We concludethat the data augmentation could be a useful technique to evade the overﬁtting withouttuning the neural network architecture such as inserting the dropout. a r X i v : . [ nu c l - t h ] J a n ontents M - R relation 42.2 General strategy for the machine learning implementation 62.3 Random EoS generation with parametrization by the speed of sound 72.4 Neural network design: a general introduction 102.5 Loss function and training the neural network 11 In the central cores of neutron stars baryonic matter is highly compressed by the gravita-tional force. The inward force by gravity is balanced by the outward pressure which stronglydepends on intrinsic properties of cold and dense matter subject to the strong interaction.The study of inner structures of the neutron star requires a relation between the pressure p and the energy density ε ; namely, the equation of state (EoS), p = p ( ε ) , at T = 0 (see, e.g.,– 1 –efs. [1–5] for recent reviews). At the center the baryon number density n B may reach 5–10times as large as the saturation density of nuclear matter, i.e., n (cid:39) . (nucleons) / fm .Towards the model-independent EoS determination, we should solve quantum chro-modynamics (QCD), which is the ﬁrst-principles theory of the strong interaction. At themoment, however, the model-independent results are only available at limited ranges ofdensities. At low density range of n B ∼ – n we can apply ab initio methods combinedwith the nuclear force derived from Chiral Eﬀective Theory ( χ EFT) with controlled uncer-tainty estimates [6–14] (see Ref. [15] for a recent review). At asymptotically high densitiesof n B (cid:38) n perturbative QCD calculations reliably converge [16–21] (see Ref. [22] for arecent review). The intermediate density region around – n , which is relevant for theneutron star structure, still lacks trustable QCD predictions (see also Ref. [23] for a recentattempt to improve the convergence in the resummed perturbation theory). To tackle theintermediate density region from QCD, we need to develop non-perturbative methods suchas the Monte-Carlo simulation of QCD on the lattice (lattice QCD), but the lattice-QCDapplication to ﬁnite density systems is terribly hindered by the notorious sign problem(see, e.g., Ref. [24] for a review). This is why neutron-star-physics studies still rely onphenomenological EoS constructions: rather ab initio approaches [25, 26] near the satura-tion density, estimates employing the Skyrme interactions [27], the relativistic mean ﬁeldtheories [28], and the functional renormalization group [29], etc.Meanwhile, we have witnessed signiﬁcant advances in neutron star observations overthe last decades. Now these advances have opened an alternative pathway for the model-independent extraction of the EoS by statistical methods. Observations include the Shapirodelay measurements of massive two-solar-mass pulsars [30–33], the radius determination ofquiescent low-mass X-ray binaries and thermonuclear bursters [34–38] (see reviews [2, 39, 40]for discussions on the methods and associated uncertainties), detections of gravitationalwaves from binary mergers involving neutron stars by the LIGO-Virgo collaboration [41, 42]as well as the X-ray timing measurements of pulsars by the NICER mission [43, 44]. Typicalobservable quantities of neutron stars involve mass M , radius R , compactness M/R , tidaldeformability Λ (and their variants, e.g., Love number k ), quadrupole moment Q , momentof inertia I , etc. The gravitational waves from binary neutron star mergers provide us withthe information about the tidal deformability. Also, the NICER mission particularly targetsat the compactness M/R of stars by measuring the gravitational lensing of the thermalemission from the stellar surface. Some of these neutron star quantities are connectedthrough the universal relations that are insensitive to the EoS details. Among such relations,the most well-known example is the I-Love-Q relation [45, 46], which relates the momentof inertia I , the Love number k , and the quadrupole moment Q (see Refs. [4, 47] andreference therein for other universal relations and usages).These observations have provided some insights on the EoS and invoked discussionsabout non-perturbative aspect of QCD: for example, the emergence of quark matter inthe neutron star [48] and a certain realization of duality between the conﬁned and thedeconﬁned matter [49–57]. Moreover, these observations, particularly the gravitationalwave observation, may provide a unique opportunity for studying the hadron-to-quarkphase transition occurring in the binary star merger and even better sensitivity will be– 2 –chieved with the future detector [58–60].For the purpose of extracting the most likely EoS from the observational data, there arediverse statistical technologies. The commonly used method is the Bayesian inference [34–36, 61–66]. There are still other methods such as the one based on the Gaussian processeswhich is a variation of the Bayesian inferences with nonparametric representation of theEoS [67–69], etc. Despite numerous eﬀorts to quantitatively constrain the EoS, it is stillinconclusive what the genuine EoS should look like because of uncertainties from the as-sumed prior distribution in the Bayesian analysis. Therefore, the EoS inference program isin need of complementary tools. As an alternative to these conventional statistical meth-ods, the deep learning ; namely, the machine learning devising deep neural networks (NNs),has been successfully applied to a wide range of physics ﬁelds [70–76]. Several studies havealready put the NN in motion in the context of gravitational wave data analysis of binaryneutron star mergers [77–80]. Along these lines, in our previous publications [81, 82] weproposed a novel approach to the EoS extraction based on the deep machine learning (seealso Refs. [83–85] for related works).In the present work we make further comprehensive analyses using the machine learningmethod developed previously [81, 82]. Here we give a brief synopsis of our machine learningapproach to the neutron star EoS problem. Given an EoS, various pairs of stellar massand radius, M - R , follow from the general relativistic structural equation, i.e., Tolman-Oppenheimer-Volkoﬀ (TOV) equation [86, 87]. We will express the inverse operation ofsolving the TOV equation in terms of the deep NN, train the NN with suﬃciently large dataset of mock observations, and eventually obtain a regressor that properly infers the mostlikely EoS corresponding to the observational M - R inputs. While the other approachesto the neutron star EoS explicitly assume a prior distribution for the EoS probabilitydistributions, our method only implicitly depends on the prior assumption. Therefore, in asense that the implementations are based on diﬀerent principles and the prior dependencesappear diﬀerently, we can consider our method as complementary to other conventionalapproaches. We also note that independent algorithms would help us with gaining someinformation about systematics. For more detailed discussions on the prior, see Sec. 2.2.This paper is organized as follows. Section 2 provides a detailed account on our problemsetting, i.e., mapping neutron star observables to the EoS. Then, we outline our method bysetting forth our parametrization for the EoS and reviewing the machine learning techniquein general. Section 3 is devoted for the methodology study using mock data. We describeour data generating procedure and training setups, and then give a brief overview of theevolution of learning processes, which motivates the later analyses in Sec. 5. We showcasethe typical examples out of the whole results, and then we carry out statistical analysesfor the data as a whole. We confront our NN method with a diﬀerent method; namely, thepolynomial regression, and compare two methods to show that our method surpasses thepolynomial regression. In Sec. 4, we present the EoSs deduced from the several diﬀerentneutron star observables in real. We explain our treatment of uncertainty quantiﬁcationalong with the possibility of a weak ﬁrst-order phase transition in the deduced EoS results.In Sec. 5 we study the eﬃciency of data augmentation in the context to remedy the over-ﬁtting problem. Finally, we summarize this work and discuss future outlooks in Sec. 6.– 3 – - R data M R M ⊙ Regression Analysis of M ( R ) Equation of State M - R curve p ε M R M ⊙ Ψ −1TOV Ψ TOV

Neural network

Figure 1 . Schematic of the TOV mapping Ψ TOV and the regression analysis of the inverse TOVmapping from the M - R data. Throughout this paper we use the natural unit system with G = c = 1 unless otherwisespeciﬁed. In this section we explicitly deﬁne our problem and summarize the basic strategy of ourapproach with the supervised machine learning. We will explain the concrete setup in eachsubsection where we adjust the strategy in accordance with the goal of each subsection.In the present study we want to constrain the EoS from the stellar observables. The EoSand the observables are non-trivially linked by the TOV equation which gives a meansto calculate the neutron star structure from the EoS input. Thus, constraining the EoSfrom the observables is the inverse process of solving the TOV equation, but this inverseproblem encounters diﬃculties from the nature of the observations. In Sec. 2.1 we formulatethe inverse problem of the TOV equation and discuss its diﬃculties. We proceed to ourapproach to this problem using the supervised machine learning in Sec. 2.2. We closelydescribe the EoS parametrization and the data generation in Sec. 2.3. Then, we explainthe design of the deep NN in Sec. 2.4 and its training procedures in Sec. 2.5. M - R relation In the present analysis we focus on the mass M and the radius R as the neutron starobservables. Given a boundary condition of the core pressure p c , the observables M and R of such a neutron star can be determined by solving the TOV equation [86, 87]: dp ( r ) dr = − [ ε ( r ) + p ( r )][ m ( r ) + 4 πr p ( r )] r [ r − m ( r )] , (2.1) m ( r ) = 4 π (cid:90) r r (cid:48) dr (cid:48) ε ( r (cid:48) ) , (2.2)where r is the radial coordinate which represents a distance from the stellar center. Thefunctions, p ( r ) and ε ( r ) , are the pressure and the energy density (i.e., the mass density),– 4 –espectively, at the position r . The function m ( r ) represents a mass enclosed within thedistance r . We can readily integrate these integro-diﬀerential equations from r = 0 with theinitial condition, p ( r = 0) = p c , towards the outward direction. The radius of the neutronstar, R , is deﬁned by the surface condition: p ( R ) = 0 . The mass of the neutron star is thetotal mass within R , i.e., M = m ( R ) .The only missing information in the combined TOV equations (2.1) and (2.2) is arelationship between p and ε , which is nothing but the EoS, i.e., p = p ( ε ) . Once an EoS isﬁxed, we can draw a p c -parametrized trajectory of ( M ( p c ) , R ( p c )) in the M - R plane. Thistrajectory is called the M - R relation. Thus, the operation of solving the TOV equation canbe considered as a mapping from the functional space of the EoSs onto the functional spaceof the M - R relations. Here we call this mapping the “TOV mapping” denoted by Ψ TOV : Ψ TOV : (EoS) (cid:55)→ ( M - R relation) . (2.3)It is known that the mapping is invertible, i.e., the inverse mapping Ψ − TOV exists [88]. Thus, ideally , one could uniquely reconstruct the EoS from an observed M - R relation. However, practically , we cannot access the entire continuous curve on the M - R relation (which wewill brieﬂy call the M - R curve) from neutron star observations. In reconstructing the EoSfrom the observational data, we are facing twofold obstacles:(i) Incompleteness — M - R data from the observation does not form a continuous curve.Since one neutron star corresponds to one point on the M - R curve, the observation ofneutron stars gives a limited number of M - R points and only partial information onthe original M - R curve is available. Even with a suﬃciently large number of neutronstars, the information on possible unstable branches of the M - R curve, on which theneutron stars do not exist, would be missing.(ii) Uncertainties — Real data is not a point but a distribution. Each M - R point isaccompanied by observational uncertainties represented as a credibility distribution.Moreover, the central peak position of the distribution is not necessarily on the genuine M - R relation.We can introduce the following symbolic representation of the above complication associatedwith the observational limitation: Π obs ( ω ) : ( M - R relation) (cid:55)→ ( M - R data) ω . (2.4)This mapping by Π obs ( ω ) is a convolution of a projection from the M - R relation to aset of points along the M - R curve and a probability distribution due to observationaluncertainties. The distribution generally depends on observational scenarios (how manyand what kinds of neutron stars are detected by which signals) and such dependence iscollectively represented by ω . With this mapping, the relation between the EoS and theobserved M - R data is more correctly expressed as Π obs ( ω ) ◦ Ψ TOV : (EoS) (cid:55)→ ( M - R data) ω . (2.5)– 5 –ecause of this additional ﬁlter by Π obs ( ω ) , one cannot reconstruct the correct EoS fromthe observational data just by calculating Ψ − TOV . Since a part of original information is lostthrough Π obs ( ω ) , this is an ill-posed problem, i.e., there is no unique well-deﬁned answerdue to insuﬃcient information and uncertainties inherent in this problem.Here, instead of reconstructing the unique EoS , we try to infer the most likely EoS from M - R observations with uncertainty; that is, the regression analysis of the inverse mappingincluding Π obs ( ω ) (see Fig. 1 for a schematic illustration). In particular it is crucial todevelop an approach leading to outputs robust against the observational uncertainties ofthe M - R data. To solve this ill-posed problem we employ a method of machine learning with deep NNsto infer the neutron star EoS. In particular we consider the supervised machine learningto train the NN with as many and diﬀerent ω ’s as possible. The trained NN can thenreceive the M - R data for one particular ω and return the output of the most likely EoScorrespondingly. To put it another way, the “inversion” of the mapping Π obs ◦ Ψ TOV isapproximated by the regression, particularly by the deep NN in our approach, which wouldbe symbolically written as

Reg[Π obs ◦ Ψ TOV ] − .In the supervised learning, we need to prepare the training data and the validationdata composed of pairs of the input M - R data and the desired output EoS. To gener-ate the training data eﬃciently, we make use of the asymmetry between Π obs ◦ Ψ TOV and

Reg[Π obs ◦ Ψ TOV ] − ; that is, we can straightforwardly calculate the forward mapping of Π obs ◦ Ψ TOV by modeling the observation Π obs ( ω ) , while the latter inverse mapping, whichis what we currently want to know, is more non-trivial. Thus, ﬁrst, we randomly generatemany possible answer EoSs represented by several parameters. We next generate the cor-responding M - R data by applying the forward mapping with various ω ’s. We will explaintechnical details of handling and simplifying ω in Sec. 2.3.Now, we turn to the description of the architecture of our NN model and we willoptimize the model parameters so that the model can infer the answer EoS correspondingto the input training M - R data. During the training, it is important to monitor thetraining quality by checking the prediction error behavior for the validation data. Afterthe optimization process is complete with good convergence of the validation error, we cantest the predictive power using the mock data for which the true answer EoS is known (seeSec. 3 for actual calculations). Once the method passes all these qualiﬁcation tests, weﬁnally proceed to the application of the real observational data to obtain the most realisticEoS (see Sec. 4 for details).Here, we comment on an alternative possibility of the inference problem formulation.As already mentioned, the inverse TOV mapping, Ψ − TOV , is well-deﬁned by itself. Thus, itis also feasible to decompose the inference as

Reg[Π obs ◦ Ψ TOV ] − = Ψ − TOV ◦ Reg[Π obs ] − and train the NN aimed to approximate Reg[Π obs ] − ; that is, the NN model would predictthe M - R curve from the input M - R data. We will not adopt this strategy since it is unclearin this approach how to measure the reconstruction performance of the M - R curve. Ourtarget space is the EoS space, but the distance in the M - R space is not uniformly reﬂected– 6 –n the distance in the desired EoS space. We are ultimately interested in the EoS, so it isnatural to directly model Reg[Π obs ◦ Ψ TOV ] − by maximizing the performance for the EoSreconstruction.We also make a remark on the discriminative modeling and the generative modeling toclarify a diﬀerence between our machine learning method and other statistical approachessuch as the Bayesian inference. The approach we are advocating in this paper belongsto the discriminative modeling: our NN model directly predicts the response variable,i.e., the EoS, for a ﬁxed set of explanatory variables, i.e., the M - R data. In contrast,the statistical approaches such as the Bayesian inference for the neutron star EoS are tobe categorized as the generative modeling. The Bayesian inference ﬁrst models occur-rence of the combination of the explanatory and response variables by the joint probability, Pr( M - R data , EoS ) ≡ Pr( M - R data | EoS ) Pr(

EoS ) . Here, Pr(

EoS ) models the prior beliefon the EoS, and Pr( M - R data | EoS ) models the observation process. Then, the Bayes the-orem gives a posterior distribution of EoS, Pr(

EoS | M - R data ) , from the joint probability,which corresponds to the likely EoS distribution for given M - R data.These discriminative and generative approaches can be considered to be complementaryto each other. They have the pros and cons as follows:• The advantage in the generative approaches is that the EoS prediction for a given M - R data takes a distribution form, so that the prediction uncertainties can be esti-mated naturally. In the discriminative model, on the other hand, additional analysisis needed to estimate the prediction uncertainties. We will perform the additionaluncertainty analysis within our method in Sec. 4.3. At the same time, we will utilizeour uncertainty analysis for a physics implication as discussed in Sec. 4.5.• In the discriminative modeling, once the NN training is completed, the most likelyEoS corresponding to a given M - R data can immediately result from the mappingrepresented by the NN: the NN models the inference procedure by itself. Thus,the computational cost is much smaller than statistical analyses, which enables usto perform supplementary estimation on observables and/or quality assessments ifnecessary. In the generative approaches, on the other hand, the posterior calculationis computationally expensive because the EoS parameter space is large. In addition,one needs to calculate the posterior for each M - R data separately. If one appliesthe Bayesian inference to the real observational data only once, this computationalcost would not matter. The computational cost becomes problematic, however, ifone wants to examine the statistical nature of the inference using a large number ofgenerated EoSs and M - R data. The training data consists of pairs of the input M - R data and the output EoS, p ( ε ) ,parametrized by a ﬁnite number of variables. We ﬁrst generate EoSs by randomly assign-ing values to the EoS parameters, solve the TOV equation to ﬁnd the corresponding M - R curve, and ﬁnally generate the M - R data. The M - R data generation involves Π obs ( ω ) ,– 7 –amely, we sample observational points of neutron stars on the M - R curve, and probabilis-tically incorporate ﬁnite observational uncertainties, which should mimic observations inthe scenario ω . The real M - R data is composed of the probability distribution for each neu-tron star on the M - R plane, but in practice we should simplify ω and need to parametrizethe M - R data as well. We will discuss the parametrization later and in this section wefocus on our EoS parametrization and generation.In designing the EoS parametrization, it is important to consider its aﬃnity to thephysical constraints such as the thermodynamic stability and the causality, i.e, p shouldbe a monotonically non-decreasing function of ε , and the speed of sound, c s , should not belarger than the speed of the light. For this reason, c s , which is derived from c s = ∂p∂ε , (2.6)is the convenient choice as the EoS parameters to impose the physical constraints easily.Also, we need to smoothly connect the parametrized EoS to the empirical EoS knownfrom the low density region. Up to the energy density of the saturated nuclear matter, ε , weuse a conventional nuclear EoS, speciﬁcally SLy4 [27]. Because the EoS is well constrainedby the saturation properties and the symmetry energy measurements near the saturationdensity, this speciﬁc choice of SLy4 would not aﬀect our ﬁnal results. In the energy regionabove ε we employ the standard piecewise polytropic parametrization and partition a den-sity range into a certain number of segments. In our present analysis we choose the densityrange, [ ε , ε ] , and six energy points, ε i ( i = 0 , . . . , ) with ε = 8 ε , equally separated inthe logarithmic scale. Then, ( ε i − , ε i ) ( i = 1 , . . . ) form 5 segments. To be more speciﬁc,these values are ( ε , ε , ε , ε , ε , ε ) = (150 , , , , , MeV / fm . The EoSparameter is the average speed of sound, c s,i ≡ (cid:104) c s (cid:105) = (cid:104) ∂p/∂ε (cid:105) ( i = 1 , . . . , ), in i -th seg-ment. From these deﬁnitions we see that the pressure values at the i -th segment boundaries, p i , are read as p i = p i − + c s,i ( ε i − ε i − ) , (2.7)where p is determined by p = p ( ε ) from our chosen nuclear EoS, i.e., SLy4. We makepolytropic interpolation for the EoS by p = k i ε γ i whose exponent and coeﬃcient are givenby γ i = ln( p i /p i − )ln( ε i /ε i − ) , k i = p i ε γ i i . (2.8)For the EoS generation we randomly assign the average sound velocities, c s,i , of theuniform distribution within δ ≤ c s,i < − δ . The upper bound comes from the causality We can conﬁrm that c s,i is indeed an average in each segment under this construction as (cid:104) c s (cid:105) ≡ (cid:90) ε i ε i − dεε i − ε i − c s = (cid:90) ε i ε i − dεε i − ε i − ∂p∂ε = 1 ε i − ε i − (cid:90) p i p i − dp = p i − p i − ε i − ε i − = c s,i . – 8 – Energy density [MeV/fm ] P r e ss u r e p [ M e V / f m ]

10 12 14 16 18

Radius R [km] M a ss M [ M ] Figure 2 . Randomly generated EoSs (left) and the corresponding M - R curves (right). c s < c = 1 , and the lower bound from the thermodynamic stability ∂p/∂ε ≥ . Here, asmall margin by δ = 0 . is inserted as a regulator to avoid singular behavior of the TOVequation. We here note that we allow for small c s corresponding to a (nearly) ﬁrst-orderphase transition. Repeating this procedure, we generate a large number of EoSs. For eachgenerated EoS, we solve the TOV equation to obtain the M - R curve as explained in Sec. 2.1.In Fig. 2 we show the typical EoS data and the corresponding M - R curves used in our lateranalyses. It should be noted here that we excluded artiﬁcial biases in the EoS generationand allowed for any possible EoSs including ones disfavored by the model calculations orthe observational data (e.g., an EoS whose maximum mass exceeds M (cid:12) in the right panelof Fig. 2). This is important for the NN to be free from biases coming from speciﬁc modelpreference. In this way our analysis gains an enough generalization ability, covers evenexotic scenario cases, and captures the correct underlying relations between the input andthe output.Before closing this part, let us mention a possible improvement for the random EoSgeneration, though we would not utilize this improvement to keep our present approach assimple as possible. The problem arises from our assumed uniform distribution of c s,i . Theparametrization and generation algorithm as explained above deﬁnitely allows for a ﬁrst-order phase transition for suﬃciently small c s,i ’s; however, due to the uniform distribution,it is a tiny percentage among the whole data for the generated EoSs to accommodate a ﬁrst-order phase transition, and this may be a part of our “prior” dependence. Since a strongﬁrst-order transition scenario has already been ruled out from phenomenology, this priordependence should be harmless. Nevertheless, if necessary, we can naturally increase thepercentage of the EoSs with a ﬁrst-order phase transition in a very simple way as follows:In Eq. (2.8) we carry out the linear interpolation in the log-log plane. Alternatively, we canuse the spline interpolation in the log-log plane. With such a smooth interpolation, thereappear the energy density regions with negative ∂p/∂ε , i.e., the EoS can be non-monotonic.We can then replace this non-monotonic part by the ﬁrst-order phase transition using theMaxwell construction. This is one eﬀective and natural way to enhance a ﬁnite fraction ofEoSs with a ﬁrst-order phase transition while keeping the same c s,i ’s. We also note thatone more merit in this procedure is that the end points of the ﬁrst-order energy regions– 9 –re not restricted to be on the segment boundaries. Thus, the strength of the ﬁrst-orderphase transition is not necessarily discretized by the segment width but an arbitrarily weakﬁrst-order phase transition could be realized with the same parametrization. We say againthat we would not adopt this procedure: the present observational precision gives a tighterlimitation and for the moment we do not beneﬁt from this EoS improvement. In thefuture when the quality and the quantity of the observational data ameliorate, the abovementioned procedure could be useful. The neural network, NN, is one representation of a function with ﬁtting parameters. Thedeep learning, i.e., the machine learning method using a deep NN, is nothing but theoptimization procedure of the parameters contained in the function represented by the NN.In particular, the supervised learning can be considered as the ﬁtting problem (namely, regression ) for given pairs of the input and the output, i.e., the training data. We oftenrefer to a NN “model” to mean the optimized system of the function given in terms of theNN. The advantage of the deep learning, as compared to naive ﬁtting methods, lies inthe generalization properties of the NN. We need not rely on any preknowledge aboutreasonable forms of ﬁtting functions: the NN with a suﬃcient number of neurons is capableof capturing any continuous functions [89, 90]. With a large number of neurons (and layers),there are many ﬁtting parameters, and recent studies of machine learning have developedpowerful optimization methods for such complicated NNs.The model function of feedforward NN can be expressed as follows: y = f ( x |{ W (1) , b (1) , . . . , W ( L ) , b ( L ) } ) , (2.9)where x and y are input and output, respectively. The ﬁtting parameters, { W ( (cid:96) ) , b ( (cid:96) ) } L(cid:96) =1 ,are given in the form of matrices and vectors, respectively. For calculation steps, we ﬁrst pre-pare ( L +1) layers (including the input and the output layers). Each layer x ( (cid:96) ) ( (cid:96) = 0 , . . . L )consists of nodes called neurons corresponding to the vector components, x ( (cid:96) )1 , . . . , x ( (cid:96) ) n (cid:96) . Wenote that the number of neurons, n (cid:96) , can be diﬀerent for diﬀerent (cid:96) . The input x is set tothe ﬁrst layer x (0) (which is labeled as (cid:96) = 0 in this paper), and the subsequent layers arecalculated iteratively as x (0) = x , (2.10) x ( (cid:96) ) = σ ( (cid:96) ) (cid:0) W ( (cid:96) ) x ( (cid:96) − + b ( (cid:96) ) (cid:1) ( (cid:96) = 1 , . . . , L ) . (2.11)The L -th layer becomes the ﬁnal output, y = f ( x ) = x ( L ) . Here, σ ( (cid:96) ) ( x ) ’s are called activation functions , whose typical choices include the sigmoid function: σ ( x ) = 1 / ( e x + 1) ,the ReLU: σ ( x ) = max { , x } , hyperbolic tangent: σ ( x ) = tanh( x ) , etc. In the abovenotation σ ( (cid:96) ) returns a vector by applying the activation to each vector component of thefunction argument, i.e., σ ( x ) = (cid:0) σ ( x ) , . . . , σ ( x n ) (cid:1) T . The ﬁtting parameters, W ( (cid:96) ) and b ( (cid:96) ) , on the (cid:96) -th layer, denote the weights between nodes in the two adjacent layers andthe activation oﬀset at each neuron called bias , respectively. These ﬁtting parameters have– 10 – =x x =x x =x x N =x N (0) x x x x x L ) =y x L ) =y x L ) =y x M ( L ) =y M Figure 3 . Schematic illustration of the feedforward neural network. no physical interpretation at the setup stage. The general architecture is schematicallydepicted in Fig. 3, in which the calculation proceeds from the left with input x to the rightwith output y .The complexity choice of the NN structure, such as the number of layers and neurons,has a tradeoﬀ problem between the ﬁtting power and the overﬁtting. To capture the essenceof the problem, the complexity of layers and neurons should be suﬃciently large. At thesame time, to avoid the overﬁtting problem, and to train the NN within reasonable time, thenumber of layers and neurons should not be too large. There is no universal prescription andthe performance depends on the problem to be solved. For our setup we found overall goodperformance when we chose the neuron number on the ﬁrst layer greater than the inputneuron number. A more sophisticated NN may be possible, but for our present purpose asimple network structure as in Fig. 3 is suﬃcient. For the actual procedure of machine learning we need to choose a loss function and minimizeit, which is called training . We deﬁne a loss function as L ( { W ( (cid:96) ) , b ( (cid:96) ) } (cid:96) ) ≡ (cid:90) d x Pr( x ) (cid:96) (cid:16) y , f ( x |{ W ( (cid:96) ) , b ( (cid:96) ) } (cid:96) ) (cid:17) , (2.12)where (cid:96) ( y , y (cid:48) ) quantiﬁes the distance or error between the NN prediction y (cid:48) and the answer y provided in the training data. The exact formula for the distribution Pr( x ) is not necessarilyknown in general. Moreover, even if it is known, multidimensional integration in Eq. (2.12)is practically intractable, so let us approximate it with the sum over the samples in thetraining data, D = { ( x n , y n ) } |D| n =1 , with |D| denoting the sample size: L ( { W ( (cid:96) ) , b ( (cid:96) ) } (cid:96) ) ≈ |D| |D| (cid:88) n =1 (cid:96) (cid:16) y n , f ( x n |{ W ( (cid:96) ) , b ( (cid:96) ) } (cid:96) ) (cid:17) . (2.13)The optimization problem we deal with here is to minimize the above loss function bytuning the network parameters { W ( (cid:96) ) , b ( (cid:96) ) } (cid:96) . The minimization is usually achieved with the– 11 –tochastic gradient descent method or its improved versions. The deterministic methodswould be easily trapped in local minima or saddle points of the loss function. In practice,therefore, the stochastic algorithms are indispensable. In gradient descent method theparameters { W ( (cid:96) ) ( t ) , b ( (cid:96) ) ( t ) } (cid:96) at the iteration step t are updated along the direction of thederivative of the loss function with respect to the parameters: W ( (cid:96) ) ( t + 1) = W ( (cid:96) ) ( t ) − η ∂ L ( W ( (cid:96) ) ( t )) ∂W ( (cid:96) ) ( t ) , (2.14)where η is called learning rate. We can update b ( (cid:96) ) in the same way. We usually evaluatethe derivatives by using the mini-batch method . In the mini-batch method, the trainingdata set is ﬁrst randomly divided into multiple subsets, which is called mini-batches . Then,the derivative of the loss function is estimated within a mini-batch B . Using the explicitexpression (2.13), the approximated derivatives read: ∂ L ( W ( (cid:96) ) ) ∂W ( (cid:96) ) ≈ |B| |B| (cid:88) n =1 ∂(cid:96) (cid:16) y n , f ( x n |{ W ( (cid:96) ) , b ( (cid:96) ) } (cid:96) ) (cid:17) ∂W ( (cid:96) ) , (2.15)where the batch size |B| denotes the number of sample points in B , whose optimal choicediﬀers case-by-case. The epoch counts the number of scans over the entire training dataset D . The parameters are updated for each mini-batch, so one epoch amounts to |D| / |B| updates. Usually, the derivative, ∂(cid:96)/∂W appearing in Eq. (2.15), is evaluated by a methodcalled backpropagation.For numerics we basically use a Python library, Keras [91] with TensorFlow [92] as abackend. In this work we speciﬁcally employ msle , i.e., the mean square logarithmic errors,for the choice of the loss (cid:96) ( y , y (cid:48) ) appearing in Eq. (2.12); that is, (cid:96) msle ( y , y (cid:48) ) ≡ (cid:12)(cid:12) log y − log y (cid:48) (cid:12)(cid:12) . (2.16)We use a speciﬁc ﬁtting algorithm, Adam [93], with the mini-batch size being 100 for theanalysis in Sec. 3 and 1000 for the analysis in Sec. 4. We initialize our NN parameters withthe Glorot uniform distribution. This section is an extensive continuation from our previous work on the methodologystudy [81]. We here present more detailed and complete investigations of various per-formance tests.In this section, to put our considerations under theoretical control, we assume ω to bea hypothetical observational scenario that 15 neutron stars are observed and ( M i , R i ) ( i =1 , . . . , ) are known with a ﬁxed size of observational uncertainty following the previoussetup in Refs. [2, 37, 38, 81]. We design a NN, generate the validation data as well as thetraining data, and train the NN to predict the EoS parameters, c s,i , from the artiﬁciallygenerated M - R data, ( M i , R i ) ( i = 1 , . . . , ) (i.e., mock data). We discuss the trainingbehavior of the NN, the reconstruction performance of the EoS and the M - R curve, andthe distribution of reconstruction errors. – 12 – btain M-R curve by solving TOV eq -3 -2 -1 p [ G e V /f m ] (cid:1) c [GeV/fm ] p ε M / M (cid:1) R[km]

M R M / M (cid:1) R[km]

M R

Generate EoS randomly Solve TOV eqSample 15 points on M-R curve. Each points are assigned with constant errors . (0.1 M ⊙ , 0.5 km) Shift the points within the assigned errors (1) (2)(3) (4)

Figure 4 . Schematic ﬂow of data generation procedures for the analysis in Sec. 3.

For the current study in this section, we deal with the input M - R data, ( M i , R i ) ( i =1 , . . . ), and the output EoS parameters, c s,i ( i = 1 , . . . ), as described in the previoussection. A combination of these input and output data forms a sample point in the trainingand the validation data, and we need to generate many independent combinations to preparethe training and the validation data.The procedure for sampling the M - R points is sketched in Fig. 4. We follow the basicstrategy described in Sec. 2.3 and introduce minor modiﬁcations for the current setup.For each randomly generated EoS, p = p ( ε ) , we solve the TOV equation to get the M - R curve and identify the maximum mass M max of the neutron star, corresponding to (1) and(2) in Fig. 4. If M max does not reach the observed maximum, such an EoS is rejectedfrom the ensemble. Then, for each EoS, we randomly sample 15 pairs of ( M i , R i ) on thecorresponding M - R curve, as shown in (3) in Fig. 4, assuming a uniform distribution of M i over the interval [ M (cid:12) , M max ] . If there are multiple values of R corresponding to one M , wetake the largest R discarding others belonging to unstable branches. In this way we select15 points of ( M ∗ i , R ∗ i ) on the M - R curve.We also expect the NN to learn that observational data should include uncertainties, ∆ M i and ∆ R i . We randomly generate ∆ M i and ∆ R i according to the normal distributionwith assumed variances, σ M = 0 . M (cid:12) for the mass and σ R = 0 . km for the radius. Thesevariances should be adjusted according to the real observational uncertainty as we willtreat in Sec. 4, but for the test purpose in this section, we simplify the treatment byﬁxing σ M and σ R by hand. Furthermore, the distributions are not necessarily centeredon the genuine M - R curve as in (3) in Fig. 4. So, we shift data points from ( M ∗ i , R ∗ i ) to ( M i = M ∗ i + ∆ M i , R i = R ∗ i + ∆ R i ) . Importantly, we repeat this procedure to generate– 13 –ayer index Neurons Activation Function0 30 N/A1 60 ReLU2 40 ReLU3 40 ReLU4 5 tanh Table 1 . Neural network architecture used in Sec. 3. In the input layer, 30 neurons correspondto input 15 points of the mass and the radius. In the last layer 5 neurons correspond to 5 outputparameters of the EoS. an ensemble of randomly shifted data for each EoS, and the repeating number (and thusthe number of obtained shifted data) is denoted by n s in this work. Our typical choiceis n s = 100 , whose eﬀects on learning will be carefully examined in later discussions. Forstudies in this section, we have prepared 1948 EoSs (2000 generated and 52 rejected), sothe total size of the training data set is × n s . We ﬁnd that the learning proceeds fasterif the data is normalized appropriately to be consistent with the Glorot initialization; weuse M i /M norm and R i /R norm with M norm = 3 M (cid:12) and R norm = 20 km.The NN is optimized to ﬁt the training data so as to have a predictive power. Totest it, we use the validation loss calculated for the validation data, i.e., independent mockdata of the neutron star observation. We separately generate 200 EoSs, among which 196EoSs pass the two-solar-mass condition, and sample an n s = 1 observation for each EoS togenerate the validation data set.It is worth mentioning here that calculating M - R curves requires repeated numericalintegration of the TOV equation with various boundary conditions, which is computation-ally expensive. Generally speaking, in the machine learning approach, the preparation ofthe training data is often the most time-consuming part. In contrast, increasing the sam-ple size by repeating the observation procedure does not need further computational cost.This procedure can actually be regarded as a kind of the data augmentation by the noiseinjection, where the observational errors play the role of the injected noise. Let us call thisprocedure observational data augmentation hereafter. This observational data augmenta-tion has actually another virtue, which will be discussed in later sections.We are constructing a NN that can give us one EoS in the output side in response toone “observation” of 15 neutron stars, ( M i , R i ) ( i = 1 , . . . , ) in the input side. Thus, thenumber of neurons in the input and the output layers should be matched with the size ofthe observational data and the number of EoS parameters, respectively. The input layerwith 30 neurons matches 15 M - R points (30 input data). We sorted 15 M - R pairs bytheir masses in ascending order, but we found that the sorting does not aﬀect the overallperformance. The output neurons in the last layer correspond to the prediction target, i.e.,5 parameters of the speed of sound.The concrete design of our NN is summarized in Tab. 1. We choose the activationfunction at the output last layer as σ (4) ( x ) = tanh( x ) so that the speed of sound is auto-matically bounded in a causal range c s < . In principle the output could be unphysical:– 14 – L o ss f un c t i o n ( m s l e ) Dashed: training loss n s = 100 n s = 1 Figure 5 . Typical behavior of learning curves in our setup. Solid lines show the validation lossand the dashed lines show the training loss. c s < may occur, and then we readjust it as c s = δ . For the other layers we choose theReLU, i.e., σ ( (cid:96) ) ( x ) = max { , x } ( (cid:96) = 1 , . . . , L − ), which is known to evade the vanishinggradient problem. The stochastic gradient descent method is an iterative procedure to adjust the NN parame-ters step by step and to gradually make the loss function smaller. To judge the point wherewe stop the iteration, we need to monitor the convergence of the loss function in training.For this purpose we plot the loss function as a function of training time (i.e., the numberof iteration units), which we call the learning curve in this work.The actual shape of the learning curve depends on the initial parameters of the NN andalso on the choice of mini-batches in the stochastic gradient descent method. Although thelearning curve diﬀers every time we start the new training, we can still ﬁnd some commontendency. In Fig. 5 we show the typical behavior of learning curves for our neutron starproblem. It should be noted that the horizontal axis in Fig. 5 is the number of mini-batches.In this study we will compare the training speeds with diﬀerent n s but with the ﬁxed mini-batch size. The number of epochs – the number of scans over the whole training data set –is often used as the unit of training time, but in the present study for the training speeds,we need to adopt the number of mini-batches which is proportional to the actual numberof parameter updating and the computational time of the training.In Fig. 5 we plot two diﬀerent learning curves for a single training: one for the trainingloss (dashed line) and the other for the validation loss (solid line). The training and thevalidation losses are the loss functions calculated for the training and the validation datasets, respectively, by using Eq. (2.13). The former diagnoses how well the model remembers the training data, and the latter diagnoses how well the model properly predicts unseendata by generalizing the learned information.Here, we compare the learning curves for two diﬀerent training data sets with n s = 100 and n s = 1 . By observing the plot like Fig. 5 for many trials on the training, we can readout two general tendencies: – 15 – Energy density [MeV/fm ] P r e ss u r e p [ M e V / f m ] ReconstructedOriginal

11 12 13 14 15 16 17

Radius R [km] M a ss M [ M ] ReconstructedOriginal

Figure 6 . (Left) Two typical examples of the randomly generated EoSs (dashed line) and the NNoutputs (solid lines) reconstructed from one observation of 15 M - R points [see the right panel foractual ( M i , R i )]. (Right) Randomly sampled 15 data points (centers of the ellipses) and the M - R curve from the reconstructed EoS (solid lines) and the M - R curve from the original EoS (dashedlines). The orange and blue colors correspond to two EoSs shown in the same color in the left panel. (i) The value of validation loss of n s = 1 is larger than that of n s = 100 . A possibleinterpretation is that the loss function can be more trapped in local minima/saddlepoints for smaller n s .(ii) For the number of batches (cid:38) , for n s = 1 , the training loss keeps decreasing withincreasing number of batches, whilst the validation loss turns to increasing behavior.The loss function of n s = 1 apparently shows overﬁtting behavior but the overﬁttingis not seen for n s = 100 .From these results, we may argue that the data augmentation with suﬃciently large n s could be useful to improve the problems of local minimum trapping and overﬁtting. Tojustify or falsify this speculation, we will perform further quantitative analyses on theobservational data augmentation under a simple setup in Sec. 5. We note in passing thatrecent developments show a possibility of over-parametrized networks, which has a verylarge number of parameters, to reconcile a good generalization ability and an elusion ofoverﬁtting [94, 95]. Here, our point is that the data augmentation is needed to incorporatethe physical uncertainty rather than to improve the learning quality, but it may have sucha byproduct. Once the loss function converges, we can use the trained NN model to infer an EoS froman observation of 15 M - R points. We picked two typical examples represented by orangeand blue in Fig. 6. In the left panel of Fig. 6 the dashed lines represent randomly generatedEoSs. We note that two EoSs are identical in the low density region because the SLy4 isemployed at ε ≤ ε . The corresponding genuine M - R relations are shown by the dashedlines in the right panel of Fig. 6. Randomly sampled mock observations consisting of 15 M - R points are depicted by ellipses where the centers are the “observed” ( M, R ) and the majorand minor axes show observational uncertainties in the R and M directions, respectively.– 16 – − − − P r ob a b ilit y d e n s it y Prediction error δ c s ,1 − − − Prediction error δ c s ,2 − − − Prediction error δ c s ,3 − − − Prediction error δ c s ,4 − − − Prediction error δ c s ,5 P r ob a b ilit y d e n s it y Prediction c s ,1 Prediction c s ,2 Prediction c s ,3 Prediction c s ,4 Prediction c s ,5 Figure 7 . The upper and the lower panels show the histograms of prediction errors and values,respectively, of c s,i . The vertical axis is the probability density, i.e., the count in each bin is dividedby the bin width and the total count. The reconstructed EoSs are depicted by solid lines in the left panel of Fig. 6. We can seethat the reconstructed EoSs agree quite well with the original EoSs for these examples. Itwould also be interesting to make a comparison of the M - R relations corresponding to theoriginal and the reconstructed EoSs. The solid lines in the right panel of Fig. 6 representthe M - R relations calculated with the reconstructed EoSs. As is expected from agreementin the EoSs in the left panel of Fig. 6, the original and the reconstructed M - R relations areclose to each other. More details are already reported in our previous publication [81]. We make the detailed error analysis of the reconstructed EoSs in terms of the speed ofsound. To increase the statistics, for the present analysis, we set n s = 1000 (remember thatincreasing n s is almost costless).Now, we introduce the prediction error in the speed of sound: δc s,i ≡ ( c s,i ) (rec) − ( c s,i ) ∗ ( i = 1 , . . . , , (3.1)where ( c s,i ) (rec) and ( c s,i ) ∗ are the reconstructed and original values, respectively, for thespeed of sound. The ﬁrst row of Fig. 7 shows the histograms of δc s,i ( i = 1 , . . . ). Wesee that the prediction error at the lower density (i.e., smaller i ) is centered at the originwhile that at the higher density (i.e., larger i ) is distributed widely, which means that theobservational data eﬃciently constrains lower density regions. At the highest density with i = 5 , the prediction error roughly follows the uniform distribution within the physicalrange. This behavior can be understood as follows. The original values of the speed ofsound in the validation data are generated by the uniform distribution in the range [0 , while the distribution of the predicted c s,i ( i = 1 , . . . ) are shown in the second row ofFig. 7. At lower density, the distribution of the prediction roughly follows the originaluniform distribution. However, at higher density, the distribution of the prediction peaksaround c s,i ∼ . . At the highest density, therefore, the NN always predicts c s,i ∼ . for– 17 –ny inputs. This is because c s,i ∼ . is an optimal prediction to minimize the loss functionwhen the constraining information is inadequate. The exact peak position is slightly smallerthan 0.5, which is because our choice of the loss function, msle , gives larger weights forsmaller values. Thus, the approximate uniform distribution of the prediction errors at thehighest density (as seen in the upper rightmost ﬁgure in Fig. 7) just reﬂects the distributionof the original values relative to the ﬁxed prediction.From this analysis we conclude that the observation data should contain more infor-mation on the lower density EoS but not much on the high density EoS around ε ∼ ε .This is quite natural, but such a quantitative demonstration in Fig. 7 is instructive. Also,the results in Fig. 7 tell us two important aspects of our NN approach. First, the NN tendsto give a moderate prediction, not too far from all possible values, when the uncertainty islarge. This is nontrivial: the posterior distribution of EoS is random, but such informationis lost after the NN ﬁltering. This is reasonable behavior in a sense that we want to designthe NN to predict the most likely EoS instead of the posterior distribution. Secondly, themoderate prediction for unconstrained cases may explicitly depend on the choice of the lossfunction and also on the prior distribution, i.e., the distribution of the EoSs in the trainingdata set in our problem. Thus, we should be careful of possible biases when we discussunconstrained results in the high density regions.The error analysis so far is separately made for each EoS parameter. In Fig. 8, toextract more detailed information, we plot the marginalized distributions for diﬀerent pairsof the EoS parameters with corresponding Pearson’s correlation coeﬃcients r ij . We seethat the speeds of sound of neighboring density segments have negative correlations, whichis similar to the behavior seen in the Bayesian analysis [63]. The underlying mechanismfor this behavior can be explained in the same way as in Ref. [63]: an overprediction ina segment can be compensated by an underprediction in the adjacent segment, and viceversa. The correlations are clearer in the lower density regions in which predicted c s,i ’s havebetter accuracy. We also observe a small and positive correlation, e.g., r > , between theparameters separated by two segments. This is also expected from two negative correlationsin between. The other correlations involving parameters in the high density regions areconsistent with zero within the statistical error. It is an interesting question how the machine learning method could be superior to otherconventional approaches as a regression tool. In this section we make a detour, introducinga polynomial model, and checking performance of a traditional ﬁtting method. We canmodel the regression of EoS from the M - R data with the following polynomial ﬁtting ofdegree n : y i = g i ( { x j }|{ a ij j ··· j N } ) ≡ (cid:88) j k ≥ k =1 ,...,N ) j + j + ··· + j N ≤ n a ij j ··· j N x j x j · · · x j N N , (3.2)– 18 – − δ c s ,2 r = − − − δ c s ,3 r = 0.115(16) − − − δ c s ,4 r = 0.019(15) − − − − δ c s ,1 δ c s ,5 r = 0.010(16) r = − r = 0.011(17) − − δ c s ,2 r = 0.002(17) r = − − − δ c s ,3 r = 0.019(21) − − δ c s ,4 r = − Figure 8 . Predicted errors marginalized for diﬀerent pairs of EoS parameters. The ﬁrst to fourthcolumns correspond to δc s,i ( i = 1 , . . . ). Similarly, the ﬁrst to fourth rows correspond to δc s,i ( i = 2 , . . . ). Colors show the normalized counts in each bin, i.e., the counts in the bin divided bythe total count and multiplied by the bin number. Pearson’s correlation coeﬃcients r ij for respectiveparameter pairs are shown in panels with the statistical uncertainty enclosed in parentheses. where { x i } and { y i } are the input and the output of the training data set, respectively, with N = 30 being the size of the input vector { x i } , and { a ij j ··· j N } are the ﬁtting parameters.For the optimization we employ the traditional least-square method with the loss functionin Eq. (2.12) speciﬁed as mse , i.e., the mean square errors: (cid:96) mse ( y , y (cid:48) ) ≡ | y − y (cid:48) | . (3.3)In this case the mapping is linear with respect to the ﬁtting parameters, which means theproblem is essentially the (multivariate) linear regression. Thus, we can ﬁnd an analyticalsolution of the exact global minimum of the loss function (3.3) in this traditional method.Here we comment on the number of parameters in the polynomial model, which is givenby the multiset coeﬃcient: n param = (cid:0)(cid:0) N +1 n (cid:1)(cid:1) = (cid:0) N + nn (cid:1) . For N = 30 in the present problem,as long as n (cid:28) N , we can approximate n param = O ( N n ) which explodes as the degree n – 19 –

10 100 1000 C oun t NN 10 100 1000 − − − C oun t δ R ( M = 0.6 M ⊙ ) [km]Polynomial (deg = 2)

10 100 1000 C oun t NN 10 100 1000 − − − C oun t δ R ( M = 1.0 M ⊙ ) [km]Polynomial (deg = 2)

10 100 1000 C oun t NN 10 100 1000 − − − C oun t δ R ( M = 1.8 M ⊙ ) [km]Polynomial (deg = 2) (a) (b) (c) Figure 9 . Histograms of δR from the NN (upper) and the polynomial (lower) regression for (a) . M (cid:12) , (b) . M (cid:12) , and (c) . M (cid:12) . gets larger. Explicit calculations read: n param = 31 ( n = 1 ), ( n = 2 ), ( n = 3 ), ( n = 4 ), etc. The drawback of the linear regression is a huge computational cost of O ( n param |D| ) . In Sec. 3.3 we already observed two successful examples of the EoS reconstruction. For otherEoSs in the validation data, the reconstructed M - R curves agree well with the genuine ones.In this section we shall quantify the overall agreement, and for this purpose we enlarge thesize of the validation data; namely, we randomly generate 1000 EoSs, among which 967 EoSspass the two-solar-mass condition, and we carry out statistical analyses with 967 EoSs.We deﬁne the overall reconstruction accuracy by calculating the radius deviation, i.e.,the distance between the solid and dashed lines as already shown in the right panel of Fig. 6: δR ( M ) ≡ R (rec) ( M ) − R ∗ ( M ) , (3.4)where R (rec) and R ∗ ( M ) are the reconstructed radius (solid) and the genuine original radius(dashed), respectively, at the mass M . Then, we estimate the root mean square (RMS) ofradius deviations (3.4) using 967 data at masses ranging from 0.6 M (cid:12) to 1.8 M (cid:12) with aninterval of 0.2 M (cid:12) . We calculate δR ( M ) for both the NN and the polynomial regression tomake a quantitative comparison.The logarithmic histogram of δR ( M ) is plotted in Fig. 9 for M = 0 . M (cid:12) [as shown in(a)] . M (cid:12) [as shown in (b)], and . M (cid:12) [as shown in (c)]. In the plot of the logarithmicprobability density the Gaussian distribution or the normal distribution takes a quadraticshape, so the δR histograms from both the NN and the polynomial regression have astrong peak at δR = 0 for all M . The tails are wider than the normal distribution, andthe polynomial results (lower panels) exhibit even slightly wider tails than the NN results(upper panels).We can look into this tendency in more quantitative manner in Fig. 10. In the leftmost(a) of Fig. 10 we plot the RMS of δR ( M ) , i.e., ∆ R RMS . We see that ∆ R RMS from the NNis around ∼ . km for a wide range of masses. This indicates that our NN method workssurprisingly good; remember that data points have random ﬂuctuations by ∆ R ∼ . km.It should be noticed that, even without neutron stars around M = 0 . – . M (cid:12) in mock– 20 – Δ R R M S [ k m ] Mass M / M ⊙ NNPolynomial (deg = 2) Δ R σ [ k m ] Mass M / M ⊙ NNPolynomial (deg = 2) Δ R σ [ k m ] Mass M / M ⊙ NNPolynomial (deg = 2) (a) (b) (c)

Figure 10 . Overall performance measured with (a) ∆ R RMS , (b) ∆ R σ , and (c) ∆ R σ . M - R data of our setup, the RMS of the corresponding radii are still reconstructed withinthe accuracy ∼ . km. One might think that this should be so simply because we use theSLy4 in the low energy region, but in view of Fig. 6, diﬀerent EoSs lead to signiﬁcantlydiﬀerent radii even in this region of M = 0 . – . M (cid:12) . Figure 10 (a) clearly shows that theNN results surpass the polynomial outputs with n = 2 .When the distribution has long tails, as seen in Fig. 9, the RMS may not fully charac-terize the data behavior. To quantify the deviation of the reconstruction more diﬀerentially,we can deﬁne quantile-based widths, ∆ R σ and ∆ R σ , corresponding to (cid:39) and (cid:39) of δR contained within the half widths, respectively. They can be explicitly expressed as ∆ R ασ ≡ (cid:34) F − δR (cid:18) α √ (cid:19) − F − δR (cid:18) − erf α √ (cid:19)(cid:35) , (3.5)where F − δR ( p ) is the quantile function of the δR distribution, i.e., the value of δR at the frac-tion p integrated from small δR . The error function, erf( α/ √ ≡ (1 / √ π ) (cid:82) α − α dxe − x / , isused to translate the statistical signiﬁcance by σ to the probability, e.g., erf(1 / √ (cid:39) and erf(2 / √ (cid:39) . If δR obeys the normal distribution N (0 , σ ) , in general, ∆ R ασ = ασ follows. Therefore, we can use ∆ R ασ /α as alternatives to ∆ R RMS = σ . For moregeneral distributions it is important to note that these quantile-based widths have a clearmeaning as the 68% and 95% conﬁdence intervals unlike ∆ R RMS .In Figs. 10 (b) and (c) we plot ∆ R σ and ∆ R σ . It is interesting that the ∆ R σ resultsappear very diﬀerent while the ∆ R σ results are almost indistinguishable between the NNand the polynomial results. The polynomial results are consistently larger than the NNresults. This behavior of ∆ R σ implies that the polynomial results may be contaminatedby outliers of wrong values far from the true answer. We can understand this by relatingit to Runge’s phenomenon of the polynomial approximation; higher order terms can causelarge oscillation in the interpolation and the extrapolation. In particular, the extrapolationmay easily break down by higher order terms that blow up for large inputs. In contrast,the NNs with ReLU activation only contain up to the linear terms so that the outputs arestable. – 21 – EoS Estimation from the Real Observational Data

In the previous section we have studied our methodology to infer the most likely EoS fromobservational data and quantitatively conﬁrmed satisfactory performance to reproduce thecorrect results. Now, we shall apply the method to analyze the real observational data andconstrain the real neutron star EoS.For practical purposes it is important to develop a way to estimate uncertainties aroundthe most likely EoS within the framework of the machine learning method. As we havealready discussed, the EoS inference from the M - R data is an ill-posed problem and thesolution cannot be uniquely determined. Thus, the prediction from the NN method cannotavoid uncertain deviations from the true EoS. To employ a predicted EoS as a physicaloutput, we need to quantify the uncertainty bands on top of the results. In this section weconsider two diﬀerent approaches for the uncertainty quantiﬁcation. For the real data we use the M - R relation of the neutron stars determined by variousX-ray observations. There are three sorts of sources we adopt here: 8 neutron stars inquiescent low-mass X-ray binaries (qLMXB) [2, 37, 38], 6 thermonuclear bursters withbetter constraints than the qLMXBs [2, 37] as well as a rotation-powered millisecond pulsar,PSR J0030+0451, with thermal emission from hot spots on the stellar surface [43, 44]. Thedata from the ﬁrst two sources is tabulated in Refs. [2, 37, 38], especially in Fig. 4 ofRef. [2] . The data from the last source is recently investigated in the NICER mission, andthere are two independent analyses based on the same observation, among which here weuse the data given in Ref. [43]. All of these M - R relations are provided in the form of theBayesian posterior distribution, which takes the two-dimensional probability distribution onthe R - M plane (see also Fig. 1 in Ref. [82] for a graphical representation of the probabilitydistribution).Ideally, with suﬃcient computational resources, the machine learning would be capableof directly dealing with such two-dimensional probability distribution using, for example,the convolutional neural network (CNN). In the present work, however, we simplify our anal-ysis by approximately characterizing a two-dimensional probability distribution with fourparameters following the prescription proposed in our previous publication [82]: the meansand the variances of the marginal distributions with respect to M and R . More speciﬁcally,we make projection of the two-dimensional distribution onto the one-dimensional M -axis(and R -axis) by integrating over R (and M , respectively). In this way we marginalize thetwo-dimensional distribution with respect to M and R into two one-dimensional distribu-tions, which are eventually ﬁtted by the Gaussians with the mean and the variance. We should revise the data generation procedure previously illustrated in Fig. 4 in Sec. 3.1.Previously, for the mock data analysis, we assumed the universal variances, σ M = 0 . M (cid:12) The data is distributed at the website: http://xtreme.as.arizona.edu/NeutronStars/ – 22 – enerate EoS randomly Solve TOV eq Obtain M-R curve by solving TOV eqSample 14 points on M-R curve. Each points are assigned with random errors . ( σ M , i , σ R , i ) Shift the points within the assigned errors ( σ M , i , σ R , i ) -3 -2 -1 p [ G e V /f m ] (cid:1) c [GeV/fm ] p ε M / M (cid:1) R[km]

M R M / M (cid:1) R[km]

M R (1) (2)(3) (4)

Figure 11 . Schematic ﬂow of data generation procedure for the analysis in Sec. 4 and σ R = 0 . km, for all ∆ M i and ∆ R i . In reality, however, they should vary for diﬀerent i , i.e., σ M,i and σ R,i should correspond to the observational uncertainties of ( M i , R i ). Todeal with the real observational data, the revised procedure for sampling the M - R pointsis sketched in Fig. 11. We need to design the NN with an input of information including σ M,i and σ R,i : the input variables are extended to ( M i , R i ; σ M,i , σ R,i ).We shall recapitulate the data generation scheme as follows. In the same way as inSec. 3 we prepare 5 EoS parameters c s,i ( i = 1 , . . . ) in the output side. In this section thetraining data comprises × input parameters, i.e., ( M i , R i ; σ M,i , σ R,i ) ( i = 1 , . . . , ).We note that i runs to not 15 but 14 corresponding to the number of observed neutronstars as explained in Sec. 4.1. We calculate the M - R curve for each EoS, and then select14 points of ( M ∗ i , R ∗ i ) on the M - R curve and add statistical ﬂuctuations of ∆ M i and ∆ R i [see Fig. 11 (3)]. Let us go into more detailed procedures now. Unlike σ M and σ R in Sec. 3here we randomly generate σ M,i and σ R,i diﬀerently for i = 1 , . . . . These variances, σ M,i and σ R,i , are sampled from the uniform distributions, [0 , M (cid:12) ) and [0 , km ) , respectively.In view of the observational data, these ranges of the distributions should be suﬃcient tocover the realistic situations. Then, ∆ M i and ∆ R i are sampled according to the Gaussiandistributions with these variances, σ M,i and σ R,i . Finally we obtain the training data, ( M i = M ∗ i + ∆ M i , R i = R ∗ i + ∆ R i ; σ M,i , σ

R,i ) ( i = 1 , . . . ) [see Fig. 11 (4)]. Hereafter wecall these 14 tetrads of ( M i , R i ; σ M,i , σ

R,i ) an observation .Now we prepare the training data set by taking multiple observations. For each EoSwe randomly generate 100 diﬀerent pairs of ( σ M,i , σ

R,i ) , and then we make another 100observations for each ( σ M,i , σ

R,i ) . From the former 100 pairs the NN is expected to learnthat the observational uncertainties may vary, and the latter tells the NN that the genuine M - R relation may deviate from the observational data. In total we make n s = 10000 – 23 –ayer index Neurons Activation Function0 56 N/A1 60 ReLU2 40 ReLU3 40 ReLU4 5 tanh Table 2 . Neural network architecture used in Sec. 4. In the input layer, 56 neurons correspond tothe input 14 points of the mass, the radius, and their variances. The design of the other layers iskept the same as in Tab. 1. ( = 100 × ) “observations” per one EoS. The size of the whole training data set is thus100 times larger than before.We modify the architecture of the NN used in this section accordingly. The number ofneurons in the input layer becomes

56 (= 4 × (the performance test was done with 15points, but we have numerically conﬁrmed that the same level of performance is achievedwith 14 points as well). Since we already know from the mock data analysis that the masssorting does not aﬀect the performance, we keep the mass ordering as generated randomly,unlike in Sec. 3.1. We normalize the input data as M i /M norm , R i /R norm , σ M,i /σ M norm , and σ R,i /σ R norm with M norm = 3 M (cid:12) , R norm = 20 km, σ M norm = 1 M (cid:12) , and σ R norm = 5 km.Aside from the input layer, the NN design of the other layers is chosen to be the sameas before, as summarized in Tab. 2. For the NN optimization we adopt Adam, the sameas in Sec. 3, but with diﬀerent mini-batch size, 1000. Incorporating the observationaluncertainties the training data set becomes larger as compared to before, and we expectthat a larger mini-batch size would fasten the training. Here we prescribe two independent methods to quantify uncertainties in the output EoSbased on diﬀerent principles.The ﬁrst one utilizes the validation data, and the procedure is similar to that in Sec. 3.6.The basic idea is that, once we have trained the NN, we can evaluate the prediction accuracyfrom the validation data. We generate 100 samples of the validation data set whose inputvariances are chosen to be in accord with the real observational uncertainties as explainedin Sec. 4.1. For the validation data, we know the true M - R relation so that we can evaluatethe deviation δR ( M ) as deﬁned in Eq. (3.4). Using the whole validation data set, wecalculate the root-mean-square deviation, ∆ R RMS , and regard it as the systematic errorinherent in the NN approach. Later we show this uncertainty by the label “validation” asseen in Fig. 13.The second one is the ensemble method in machine learning. This method is usuallyused to enhance the stability and performance of the predicted output from NNs. Here werepurpose it to quantify uncertainty of the predicted output. The idea is concisely summa-rized in Fig. 12. In this method we set up multiple NNs independently: NN , NN , . . . ,NN N with N being the number of prepared NNs. We perform random sampling from D to gener-– 24 – raining data 𝒟 Output ̂ o Bootstrapping (random sampling)

Aggregating (combine all NN outputs by averaging)

Training

Output ̂ o Output ̂ o N Neural Network NN Neural Network NN Neural Network NN N Training data 𝒟 Training data 𝒟 Training data 𝒟 N Output:

Uncertainty: ̂ o = ⟨ ̂ o i ⟩Δ ̂ o = ⟨( ̂ o i − ̂ o ) ⟩ Figure 12 . Illustration of the ensemble method procedure for uncertainty quantiﬁcation. ate diﬀerent subsets, D , D , · · · , D N and train each NN using D , D , . . . , D N . This randomsampling is commonly referred to as bootstrapping . After the training, by feeding the inputdata, each NN predicts output values, which we symbolically denote by ˆ o , ˆ o , . . . , ˆ o N asin Fig. 12. Finally, aggregating the outputs from these multiple NNs, i.e., averaging allthe outputs, we get the overall output ˆ o = (cid:104) ˆ o i (cid:105) ≡ N − (cid:80) i ˆ o i . Here we can also calculatethe variance by ∆ˆ o = (cid:104) (ˆ o i − ˆ o ) (cid:105) , and regard it as “uncertainty”. This whole procedure,comprising bootstrapping and aggregating, is named bagging [96]. In this work, we choose N = 10 . If some regions of the EoS are insensitive to the M - R observation, independentlytrained 10 NN models would lead to diﬀerent EoSs in such unconstrained regions. From ∆ˆ o that quantiﬁes how much 10 output EoSs vary, we can estimate the uncertainty aroundthe output ˆ o . We use this bagging for most of the analyses as shown by the band labeledby “Bagging” in Figs. 13.We have introduced the two natural ways to quantify uncertainties, but our workingprescriptions are still to be developed. In fact there is no established method yet. A moresystematic way for the uncertainty quantiﬁcation might be possible. This is an interestingproblem left for future research in the general context of machine learning. In Fig. 13 the orange line is the most likely neutron star EoS deduced from our NN approach.We estimated uncertainty from the bagging (shown by the band with a hatch pattern labeledby “bagging”) and the validation (shown by the blue band labeled by “validation”). We plotbands from other works in the ﬁgure for comparison. The gray band represents an estimate In the strict sense, the term “bagging” is used for the random sampling procedure with replacement ofdata. Here we use “bagging” even without replacement in a loose sense. – 25 – Energy density [MeV/fm ] P r e ss u r e p [ M e V / f m ] EFT+astroThis work (Validation)This work (Bagging)Bayesian (Steiner et al.)Bayesian (Ozel et al.)

Radius R (km) M a ss M ( M ) PSR J1614 2230PSR J0348+0432PSR J0740+6620

EFT+astroThis work (Validation)This work (Bagging)Steiner et al.Ozel et al. (a) (b)

Figure 13 . (a) EoSs deduced from the observational M - R data of qLMXBs and thermonuclearbursters. The shaded blue and hatched orange bands represent our 68% credibility bands from thevalidation and the bagging estimations. The χ EFT prediction and the Bayesian results (Steiner et al . [35, 36] and Özel et al. [2, 37, 38]) are overlaid for reference. The former band represents68% CL, while the latter shows the contour of e − of the maximum likelihood. (b) M - R relationscorresponding to the deduced EoSs from this work with references to other approaches. Mass M [ M ] T i d a l d e f o r m a b ili t y GW170817 % % % % (a) (b) Figure 14 . (a) Tidal deformability Λ calculated from our EoS (b) Correlation of tidal deforma-bilities, Λ and Λ ; see the text for details. from the χ EFT calculation combined with polytropic extrapolation and the two-solar-masspulsar constraint [97] (labeled by “ χ EFT+astro”). Because χ EFT is an ab initio approach,any reasonable predictions should lie within the gray band, and indeed our results are foundto be near the middle of the gray band. The preceding Bayesian analyses [2, 35–38] arealso overlaid on Fig. 13. While Özel et al . [2, 37, 38] and our present analysis use thesame astrophysical data, Steiner et al . [35, 36] employs a subset of the data, i.e., 8 of X-raysources. One may think that our prediction gives a tighter constraint than the others,but the narrowness of the band may be related with the implicit assumption in our EoSparametrization; we will come back to this point in Sec. 4.5 (see Fig. 17). Figure 13 (b)shows the M - R curves corresponding to the EoSs in (a). We see that our EoS (blue curve)certainly supports neutron stars with M > M (cid:12) [30–33].– 26 –

00 1000

Energy density [MeV/fm ] P r e ss u r e p [ M e V / f m ] With NICERWithout NICER

Radius R [km] M a ss M [ M ] PSR J1614 2230PSR J0348+0432PSR J0740+6620

With NICERWithout NICERNICER (Riley et al.)NICER (Miller et al.) (a) (b)

Figure 15 . (a) EoSs deduced from the observational M - R data of qLMXBs, thermonuclearbursters [2, 37, 38], and NICER data of PSR J0030+0451 [43]. The shaded blue and the hatched or-ange regions represent the 68% uncertainty band (evaluated in the bagging method) for the analyseswith and without the NICER data, respectively. (b) M - R relations corresponding to the deducedEoSs shown in (a). Figures 14 (a) and (b) show the tidal deformability and their correlation, respectively,in the binary neutron star merger GW170817. Once an EoS is given, the dimensionless tidaldeformability, Λ , results from a quantity called the Love number k , which is derived fromthe Einstein equation under static linearized perturbations to the Schwarzschild metric dueto external tidal ﬁelds. Practically, we solve a second-order ordinary diﬀerential equationin combination with the TOV equation; see Refs. [98, 99] for the explicit form of theequations. The blue band in Fig. 14 (a) represents Λ from the EoS we inferred in thepresent work, which is consistent with the merger event GW170817 indicated by the redbar. In Fig. 14 (b) we show the correlation of the tidal deformabilities, Λ of the star 1 and Λ of the star 2, using the relation between Λ and M as given in Fig. 14 (a). The orangelines in Fig. 14 (b) refer to the constraints (solid:90% and dashed:50%) for which Λ and Λ are sampled independently [41], while the green lines refer to the constraints for which Λ and Λ are related through Λ a (Λ s , q ) with Λ a = (Λ − Λ ) / , Λ s = (Λ + Λ ) / , and q = M /M . In Fig. 14 (b) we clearly see that our predicted band is located within the90% contours of the LIGO-Virgo data [41, 100].So far we have only used the observational data from the thermonuclear bursters andqLMXBs, which may contain large systematic errors related with the uncertain atmosphericmodel of neutron stars. The results in Figs. 15 (a) and (b) include the M - R constraint fromthe NICER mission as well. There are two independent analyses, as spotted on Fig. 15 (b),on the same observation of PSR J0030+0451 [43, 44], and we adopt the one [green bar in(b)] in Ref. [43]. We see that the uncertainty becomes slightly larger by the inclusion of theNICER data, which is attributed to the relatively large deviation of the NICER data fromothers. – 27 – Energy density [MeV/fm ] P r e ss u r e p [ M e V / f m ] Energy density [MeV/fm ] o f E o S s w i t h p h a s e t r a n s i t i o n s Figure 16 . (Left) 100 output EoSs predicted from each bagging predictor with a ﬁrst-order tran-sition highlighted by the orange thick lines. (Right) Histogram of the ﬁrst-order phase transitionsin each energy density region of piecewise polytropes.

In the analyses we have presented so far, we used the piecewise polytrope with 5 segmentsof density. Here, we change the number of segments from 5 to 7 and repeat the inferencewith the ﬁner bins. There are mainly two issues argued in this subsection: a possibility of aweak ﬁrst-order phase transition with the ﬁner bins and its implication on the uncertaintyquantiﬁcation.To this end we prepared N = 100 NNs in the bagging outlined in Sec. 4.3. We note thateach NN is trained so as to predict an EoS in response to the real observational data. Herewe use the M - R data of qLMXBs and thermonuclear bursters without the NICER data. Inthe left panel of Fig. 16 we show 100 EoSs predicted from 100 independent NNs. We remindthat the activation function in the output layer is chosen to be tanh which takes a valueover [ − , . When the predicted values is smaller than the threshold: c s < δ = 0 . , weadjust it to c s = δ and identify a ﬁrst-order phase transition. We found 44 predicted EoSsout of 100 that have a ﬁrst-order phase transition. We highlight the region of ﬁrst-orderphase transition with orange thick lines in Fig. 16. From this plot we can understand whywe increased the number of segments. If we use the EoS parametrization with 5 segments,weak ﬁrst-order phase transitions are too strongly prohibited by coarse discretization.We also make a histogram in the right panel of Fig. 16 to show a breakdown of the EoSregions with a ﬁrst-order phase transition. This histogram counts the number of ﬁrst-ordertransition EoSs in each energy density region. It is interesting to see that the most of theﬁrst-order phase transition is centered around the energy region – MeV. On the onehand, in the lower energy region – MeV the ﬁrst-order phase transition is less likely,and this tendency is consistent with the fact that a stronger ﬁrst-order phase transitionin a lower energy region is more disfavored by the two-solar-mass pulsar constraint [101].In the higher energy region, on the other hand, there are also less EoSs with a ﬁrst-orderphase transition. One may think that a ﬁrst-order phase transition would be more allowedin the higher energy region, but it is not the case in the NN analysis. In Sec. 3.4 we alreadydiscussed that the NN model tends to predict the most conservative value around c s ∼ . – 28 – Energy density [MeV/fm ] P r e ss u r e p [ M e V / f m ] EFT+astroBagging (fine binning)Bagging (coarse binning)Bayesian (Steiner et al.)Bayesian (Ozel et al.)

Radius R [km] M a ss M [ M ] PSR J1614 2230PSR J0348+0432PSR J0740+6620

Bagging (fine)Bagging (coarse)Steiner et al.Ozel et al.NICER (Riley et al.)NICER (Miller et al.) (a) (b)

Figure 17 . (a) EoSs deduced from the NN inference. The red and the blue shades representthe 68% credibility band evaluated in the bagging method with piecewise polytrope with 7 (ﬁnebin) and 5 (coarse bin) segments, respectively. Other bands are the same as in Fig. 13. (b) M - R relations corresponding to the deduced EoSs and uncertainty bands in (a). in the high energy density regions where the constraints are inadequate. Therefore, thecorrect interpretation of the absence of the ﬁrst-order transition in the high density regionsas shown in Fig. 16 should be, not that our results exclude a ﬁrst-order transition there,but merely that the observational data analyzed in our NN method does not favor a ﬁrst-order transition there. Another artiﬁcial factor in the high energy density region is thatour piecewise polytrope is equally spaced in the log scale, so the higher density segmentshave larger energy widths in the linear scale. Then, the parametrization does not have aresolution to represent a weak ﬁrst-order phase transition in the high energy region. Here,we also make a comment that our NN approach does not take account of the third familyscenario [102, 103] in which separate branches may appear in the M - R relation (see alsoRefs. [66, 104] for recent discussions).In Fig. 17 we show the EoS results with 7 segments (ﬁne binning) after bagging. We seethat uncertainty is widened compared with the results with 5 segments (coarse binning).This is partially because the number of output parameters, c s,i ( i = 1 , . . . , ), is increased,while the amount of the input data is unchanged, so the corresponding uncertainty shouldincrease. Furthermore we see that the uncertainty is enhanced by the eﬀect of ﬁrst-orderphase transition. If there is a ﬁrst-order phase transition, c s becomes (nearly) zero and thesezero values enlarge the variance of the output in the bagging and increase the uncertaintyaccordingly. Now the uncertainty appears comparable with other Bayesian methods as seenin Fig. 17. Our previous results with 5 segments as shown in Ref. [82] and in Sec. 4.4 ofthis paper had smaller uncertainty, which implies that the choice of 5 segments turns outto be optimal.Finally we plot the speed of sound in Fig. 18. The bare output from the NN modelactually comprises these values of c s,i . It is clear from the plots that the speed of soundexceeds the conformal limit c s = 1 / ; see Refs. [13, 105] for thorough discussions on theconformal limit. The inclusion of the NICER data only slightly widens uncertainty, andchanging the segment number is a more dominant eﬀect on uncertainty bands as quantiﬁed– 29 – Energy density [MeV/fm ] Sp ee d o f s o un d c s SLy4

With NICERWithout NICER Energy density [MeV/fm ] Sp ee d o f s o un d c s SLy4

Fine binningCoarse binning

Figure 18 . (Left) The speed of sound from our EoSs with and without the NICER data by theshaded blue and the hatched orange regions. (Right) The speed of sound from ﬁne binning (shadedblue) and coarse binning (hatched orange) estimates. x f ( x ) Figure 19 . A concrete shape of f ( x ) of our choice (blue curve) and typical training data (orangedots) for n base = 20 and n s = 5 . in the left panel of Fig. 18. In Sec. 3.2, we observed a quantitative diﬀerence between the learning curves for the trainingdata sets with and without data augmentation by n s = 100 as demonstrated in Fig. 5. Then,it would be a natural anticipation to consider that this observational data augmentationmay be helpful to overcome the problems of local minimum trapping and overﬁtting that weoften meet during the NN training. This section is aimed to discuss numerical experimentsto understand the behavior of the learning curve and the role of n s thereof. In particular,we will focus on the overﬁtting problem here .– 30 –eutron star ( i = 1 , . . . , ) f ( x ) ﬁttingInput ( M i , R i ) x Output EoS f ( x ) Training data (input) ( M ∗ i + ∆ M i , R ∗ i + ∆ R i ) x ∗ + ∆ x Number of training data (output) 1948 n base = 20 Number of training data (input) 1948 n s = n base · n s = 20 n s Table 3 . Literal correspondence between the neutron star calculation (mock data analysis) andthe f ( x ) ﬁtting. Here we consider a speciﬁc and simple problem as a toy model. We choose the followingdouble-peaked function f ( x ) : f ( x ) = (10 x − x + x ) e − x (5.1)in the domain of x ∈ [0 , . A concrete shape of above f ( x ) is depicted in Fig. 19. In Tab. 3we make a comparison between the neutron star calculation and this toy-model study of f ( x ) ﬁtting.For the training data, we ﬁrst generate a set of the “true” inputs, denoted by x ∗ , in theinterval [0 , and calculate the corresponding outputs, y ∗ = f ( x ∗ ) . We denote this basedata set as T ∗ = { ( x ∗ p , y ∗ p ) } n base p =1 ⊂ R with n base = | T ∗ | being the size of the base data set.Then, similarly to the neutron star case, we augment the training data set by duplicating n s copies of the base data with the input ﬂuctuated by “observational uncertainty”, ∆ x .Here, we set the distribution of ∆ x as the normal distribution with the standard deviationﬁxed as σ ( x ∗ ) = 0 . x ∗ + 0 . of our choice . Then, the whole training data set can beexpressed as T = (cid:8) ( x ∗ p + ∆ x p,i , y ∗ p ) (cid:12)(cid:12) p ∈ { , . . . , n base } , i ∈ { , . . . , n s } , ∆ x p,i ∼ N (cid:0) , σ ( x ∗ p ) (cid:1)(cid:9) . (5.2)The size of the training data set T is found to be | T | = n s n base . The trivial case with n s = 1 corresponds to the naive training data set without augmentation, while n s > augments the training data set. Figure 19 exempliﬁes the training data set with n base = 20 and n s = 5 .Since the task we deal with here is idealized and as simpliﬁed as possible, we canreasonably keep the NN architecture simple and relatively small (albeit deep). In thepresent study we basically use two types: NN1221 and

NN1991 as tabulated in Tab. 4. We We also tested the performance to escape from the local minimum trapping, but the data augmentationseems not to be useful to resolve the trapping problem. This choice of f ( x ) is motivated from spectral functions which often appear in the lattice-QCD calcu-lation. With this choice of σ ( x ∗ ) the uncertainty is smaller in the x region with a narrow peak. For thedemonstration purpose the toy model is designed in such a way that the problem is not too easy but notintractable. – 31 –ayer index Neurons Activation0 1 N/A1 2 Leaky ReLU2 2 Leaky ReLU3 1 Linear( NN1221 ) Layer index Neurons Activation0 1 N/A1 9 Leaky ReLU2 9 Leaky ReLU3 1 Linear(

NN1991 ) Table 4 . NN architectures used in this work:

NN1221 (left) and

NN1991 (right). We have chosenthe leaky ReLU f ( x ) = αx ( x < , x ( x ≥ to avoid the dying ReLU problem. We ﬁxed theleaky ReLU parameter α = 0 . . also choose the simplest loss function here as mae , i.e., the mean absolute errors: (cid:96) abs ( y , y (cid:48) ) ≡ (cid:12)(cid:12) y − y (cid:48) (cid:12)(cid:12) . (5.3) To deepen our understanding of the data augmentation, we argue that including noises inthe input is equivalent to introducing ﬂuctuations to the NN parameters in the ﬁrst layer.For the moment we will not limit ourselves to the special toy model in Eq. (5.1) but herewe shall develop a general formulation using the schematic notation introduced in Sec. 2.4.We consider the following case that a noise ˜ e is added to input values: ˜ x = x + ˜ e . Thenoise is assumed to be independent of the input value, i.e., Pr( x , ˜ e ) = Pr( x ) Pr(˜ e ) . In thiscase the input noise ˜ e can be absorbed by redeﬁnition of the ﬁrst layer parameters; that is, x (1) = σ (1) ( W (1) ( x + ˜ e ) + b (1) ) = σ (1) ( W (1) x + ˜ b (1) ) , (5.4)where ˜ b (1) ≡ b (1) + W (1) ˜ e . Therefore, we ﬁnd, f (˜ x ; W (1) , b (1) , . . . , W ( L ) , b ( L ) ) = f ( x ; W (1) , ˜ b (1) , . . . , W ( L ) , b ( L ) ) . (5.5)The loss function with the input noise is given by L ( { W ( (cid:96) ) , b ( (cid:96) ) } (cid:96) ) = (cid:90) d x d ˜ e Pr( x , ˜ e ) (cid:96) (cid:0) y , f (˜ x ; W (1) , b (1) , . . . , W ( L ) , b ( L ) ) (cid:1) = (cid:90) d ˜ e Pr(˜ e ) (cid:90) d x Pr( x ) (cid:96) (cid:0) y , f ( x ; W (1) , ˜ b (1) , . . . , W ( L ) , b ( L ) ) (cid:1) = (cid:90) d ˜ e Pr(˜ e ) L ( W (1) , b (1) + W (1) ˜ e , . . . , W ( L ) , b ( L ) ) . (5.6)This is nothing but the loss function with ﬂuctuations in the ﬁrst-layer biases b (1) ; insummary, the noise ˜ e in the input is translated into the ﬂuctuation by W (1) ˜ e in the NN pa-rameter b (1) . Once we realize this correspondence, we can generalize this latter prescriptionof ﬂuctuating NN parameters not only in the ﬁrst-layer b (1) but also in all the parameters { W ( (cid:96) ) , b ( (cid:96) ) } (cid:96) . We will test this idea of generalization later.We can further relate the parameter ﬂuctuations to a standard NN technique called dropout . Instead of adding random ﬂuctuations to the parameters ( additive parameter – 32 – oise ), we can also think of multiplying random ﬂuctuations to the parameters ( multiplica-tive parameter noise ), i.e., W ( (cid:96) ) ij → ˜ W ( (cid:96) ) ij = W ( (cid:96) ) ij ˜ e ( (cid:96) ) ij and b ( (cid:96) ) i → ˜ b ( (cid:96) ) i = b ( (cid:96) ) i ˜ e ( (cid:96) ) i for a ﬁxed (cid:96) ,where ˜ e ( (cid:96) ) ij and ˜ e ( (cid:96) ) i are random ﬂuctuations following a certain distribution. As a specialcase, let us assume that ˜ e ( (cid:96) ) ij and ˜ e ( (cid:96) ) i follow the simplest discrete probability distribution:the Bernoulli distribution B (1 , p ) , where 0 and 1 appear with probability p and − p ,respectively. With an additional constraint of ˜ e ( (cid:96) ) ij = ˜ e ( (cid:96) ) i , putting these multiplicative pa-rameter noise is exactly the procedure commonly referred to as “dropout” with the dropoutrate identiﬁed as p .In this way we can understand the parameter noises as generalization of the dataaugmentation by noise injection and the dropout. It is already well known that the dropoutis indeed an established tool to avoid the overﬁtting problem. Thus, we can expect that ourobservational data augmentation, which was originally needed to handle the observationaluncertainties, have an eﬀect similar to the dropout and can tame the overﬁtting behavior. Motivated by the relations between the data augmentation and the dropout through pa-rameter noise, as argued in the previous subsection, we shall make a numerical comparisoninvolving all of the data augmentation, the NN parameter noise, and the dropout usingthe toy model deﬁned in Eq. (5.1). For the parameter ﬂuctuation noise, we consider theadditive parameter noise of the normal distribution N (0 , σ ) in the ﬁrst-layer parameters, ( W (1) , b (1) ) , or in all the parameters, { W ( (cid:96) ) , b ( (cid:96) ) } (cid:96) . To manipulate this special type of thetraining presented in this section, we implement the NN and its optimization from scratchin C++ . As an ideal test case we want a “troublesome” NN suﬀering from the overﬁttingproblem. We have ﬁrst veriﬁed that the NN method with

NN1221 in Tab. 4 can reasonablysolve the reconstruction problem of f ( x ) while NN1991 encounters the overﬁtting problem.In what follows below, we will present the results from

NN1991 to analyze the overﬁttingin details. We ﬁx n base = 20 and the mini-batch size is 10. We repeat carrying out theindependent training procedures with the same setup 100 times and take the average of theloss function to draw the learning curves with probabilistic distribution.In Fig. 20 we show the averaged learning curves for several choices of n s to check theeﬀect of the data augmentation. The mean values of the loss function (cid:104)L ( t ) (cid:105) are representedby the dashed (training) and the solid (validation) curves along with the standard errorband, ∆ (cid:104)L ( t ) (cid:105) , for the validation loss. Here the (population) mean is estimated by thesample mean over the 100 training, i.e., (cid:104)L ( t ) (cid:105) (cid:39) L ( t ) = N (cid:80) Ni =1 L i ( t ) with N = 100 ,where t is the training time in the unit of the net number of processed mini-batches, and i is the index of the training. We should stress that this process with N = 100 is notthe bagging, but we do so to quantify the performance on average. We also note that thelearning curves of the validation loss ﬂuctuate, and the standard deviation ∆ L ( t ) [as shownin Fig. 20 (b)] is much larger than the standard error band [as read from Fig. 20 (a)] thatquantiﬁes the typical deviation of the sample mean L ( t ) from the “true” population mean (cid:104)L ( t ) (cid:105) . Here, the standard deviation and the standard error are estimated by the unbiased– 33 – NN1991, n base = 20Dashed: train loss L o ss t Normal ( n s = 1) n s = 10 n s = 100 L o ss t Normal ( n s = 1) n s = 10 n s = 100 (a) (b) Figure 20 . (a) Numerical experiment of increasing n s . The mean values of the loss functions arerepresented by the dashed and the solid curves for the training and the validation losses, respectively.The bands represents the standard errors. (b) The same plot with the bands representing thestandard deviations. C oun t Final validation loss n s = 1 n s = 10 n s = 100 Figure 21 . Histogram to show the 100 training distribution of the ﬁnal validation loss at t = 10 . standard variance : ∆ L ( t ) ≡ (cid:112) [ L ( t ) − (cid:104)L ( t ) (cid:105) ] (cid:39) (cid:114) NN − L i ( t ) − (cid:104)L ( t ) (cid:105) ] , (5.7) ∆ (cid:104)L ( t ) (cid:105) ≡ (cid:113) [ L ( t ) − (cid:104)L ( t ) (cid:105) ] = 1 √ N ∆ L ( t ) . (5.8)As we speculated before, the results in Fig. 20 (a) evidence that incorporating thedata augmentation with n s > reduces the overﬁtting behavior (i.e., the validation lossincreasing while the training loss decreasing). Obviously the learning curve for n s = 1 indicates the overﬁtting. For a suﬃciently large n s (cid:38) , the validation loss does not The second line can be intuitively understood from the fact that the distribution of the mean (cid:104)L ( t ) (cid:105) narrows with the variance (∆ L ( t )) /N by the central limit theorem. – 34 – NN1991, n base = 20Dashed: train loss L o ss t NormalNoised ( W (1) , b (1) ) σ = 0.2Noised ( W (1) , b (1) ) σ = 0.4 L o ss t NormalNoised ( W (1) , b (1) ) σ = 0.2Noised ( W (1) , b (1) ) σ = 0.4 (a) (b) Figure 22 . (a) Numerical experiment of adding noise to ( W (1) , b (1) ) . (b) The same plot with thebands representing the standard deviations. increase any more and saturates towards a certain value. The improvement with n s (cid:38) can be understood from the size of the data set, n s n base (cid:38) ·

20 = 200 , which is larger thanthe number of the NN parameters, i.e., 118 for

NN1991 . When n s is small, the size of thedata set is too small as compared to the NN parameters, so that the NN parameters cannotbe well constrained. Even for n s = 100 there are still discrepancies between the trainingand the validation losses as seen in Fig. 20 (a), which can be explained as follows; the sizeof the independent training base data with n base = 20 is too small. The data augmentationdoes not supplement any additional information on the training data T but the containedinformation is the same as the base data set T ∗ . Therefore, the improvement should beeventually saturated with increasing n s .The standard deviation in Fig. 20 (b) at an early time is large reﬂecting the varianceof initial NN parameters. The variance of the initial parameters quickly disappears around t (cid:46) . An interesting observation is that the standard deviation becomes larger againwhen the overﬁtting occurs, and its magnitude is of the same order as the loss increaseby the overﬁtting. This means that in some trials of training the performance does notnecessarily get worse even in the overﬁtting time scale. The actual distribution of the ﬁnalvalidation loss at t = 10 is shown in Fig. 21 in a form of histogram. We can see that thevalidation loss spreads to larger values in the overﬁtting case ( n s = 1 ), while the validationloss with the data augmentation ( n s = 10 and ) is well localize around L = 0 . .In Fig. 22 we next check the eﬀect of noise in the NN parameters. We introduce theadditive parameter noise of the normal distribution N (0 , σ ) with several variances σ = 0 , . , and . to the ﬁrst-layer parameters ( W (1) , b (1) ) . From Fig. 22 we can conﬁrm thesame trend as the observational data augmentation, and indeed the overﬁtting is evadedsimilarly. We have tried diﬀerent values of σ = 0 . and . , but they are consistent witheach other within the standard errors. We can also observe the behavior of the standarddeviation similar to the case of the data augmentation.We also compare diﬀerent ways to introduce noise in the NN parameters. We hereconsider the following types of the parameter noise to make a plot in Fig. 23:– 35 – NN1991, n base = 20Dashed: train loss L o ss t Pointwise Noised ( W (1) , b (1) )Batchwise Noised ( W (1) , b (1) )Batchwise Noised ( W ( ℓ ) , b ( ℓ ) ) L o ss t Pointwise Noised ( W (1) , b (1) )Batchwise Noised ( W (1) , b (1) )Batchwise Noised ( W ( ℓ ) , b ( ℓ ) ) (a) (b) Figure 23 . (a) Numerical experiment of adding noise in diﬀerent ways; see the text for moredetails. (b) The same plot with the bands representing the standard deviations. NN1991, n base = 20Dashed: train loss L o ss t NormalDropout 10%Dropout 25%Dropout 50% L o ss t NormalDropout 10%Dropout 25% (a) (b)

Figure 24 . (a) Numerical experiment of inserting a dropout. (b) The same plot with the bandsrepresenting the standard deviations. (i) Add random values sampled from the normal distribution N (0 , σ ) with σ = 0 . tothe ﬁrst-layer parameters ( W (1) , b (1) ) . We resample parameter noises every time foreach sample point in the training data [“Pointwise Noised ( W (1) , b (1) ) ” in Fig. 23].(ii) Add random values in the same way as (i) but resample parameter noises every timea mini-batch starts. We use a common set of the parameter noise values throughouta mini-batch. [“Batchwise Noised ( W (1) , b (1) ) ” in Fig. 23].(iii) Add random values sampled from the normal distribution N (0 , σ ) with σ = 0 . to all the NN parameters ( W ( (cid:96) ) , b ( (cid:96) ) ) . We resample parameter noises every time amini-batch starts as in (ii). [“Batchwise Noised ( W ( (cid:96) ) , b ( (cid:96) ) ) ” in Fig. 23].We ﬁnd that the performance is slightly improved for the batchwise noise rather than thepointwise noise. Also, better performance, i.e., smaller validation losses are reached forall-layer noises as prescribed in (iii) rather than for the ﬁrst-layer noises as in (i) and (ii).– 36 – NN1991, n base = 20Dashed: train loss L o ss t Normal n s = 10Noised ( W (1) , b (1) ), σ = 0.2Dropout 10% L o ss t Normal n s = 10Noised ( W (1) , b (1) ), σ = 0.2Dropout 10% (a) (b) Figure 25 . (a) Summary of the numerical experiment with the standard error bands. (b) Thesame plot with the bands representing the standard deviations.

Finally, we plot the results for the case with a dropout in the NN in Fig. 24. Hereagain, we can validate the role of the dropout to overcome the overﬁtting problem. Wehave tried several diﬀerent dropout rates; namely, 10%, 20%, and 50%, and numericallyconﬁrmed that the performance is better for small values of the dropout rate in the currentsetup.Figure 25 is the summary plot of the performance comparison between three diﬀerentapproaches; namely, the observational data augmentation, the parameter noise, and thedropout. The parameter noise and the dropout results are close to each other and almostidentical within the errors. Among these three approaches, it has turned out that the bestperformance is realized for the observational data augmentation in the present toy-modelsetup. The best strategy can be diﬀerent for diﬀerent model circumstances. The importantpoint is not a question of which is the best but the ﬁnding that the data augmentation cantame the overﬁtting problem at the same level as the standard procedure of the dropout.This is a very interesting observation. In physics all observational data has uncertainties,and the data is intrinsically augmented by repeated measurements with limited accuracy.Then, for physics problems, this NN nature of the “observational data augmentation” cannaturally evade the overﬁtting problem by itself.

In this work, we have performed a comprehensive analysis of inferring the equation of state(EoS) from the observed masses and radii of neutron stars using the previously proposedmethod of deep machine learning [81, 82]. Following the method elaborated in Sec. 2,we have subdivided our analyses into three parts: the examination of the deep learningmethodology using the mock data mimicking the real observations (Sec. 3), the applicationof the method to the actual observational data of neutron stars (Sec. 4), and the discussionof our observational data augmentation by noise ﬂuctuations in a simple setup idealized fortheorization (Sec. 5). – 37 –n the ﬁrst part (Sec. 3) we have demonstrated how our method works by presentingthe typical two examples of the EoS prediction. Then we have quantiﬁed the uncertainty orerror between the prediction and the original value as given in Eq. (3.1) and inspected theprediction reliability for the whole validation data set. The upshot is that, by looking atthe histogram and the correlation between the prediction uncertainties in diﬀerent energydensity regions, we have veriﬁed that the EoS reconstruction fares well for the lower densityregions, but the higher density regions are not suﬃciently constrained. We have also com-pared the NN method to a more conventional approach, i.e., the polynomial regression. Wethen quantiﬁed the results in terms of ∆ R RMS , ∆ R σ , and ∆ R σ as given in Eqs. (3.4) and(3.5). We have reached a conclusion that our NN method leads to more robust outputs,and the polynomial regression may have outliers that damage the overall performance.In the second part (Sec. 4) we have applied the EoS inference for real neutron stardata from the various sources; namely, qLMXBs, thermonuclear bursters, and the pulsarwith hot spots on the surface, at which the NICER mission aims. We have also introducednatural and intuitive ways for uncertainty quantiﬁcation. One is based on the validationloss and the other is based on the procedure called the bagging. We trained multiple NNsindependently and took the average for the bagging. We have utilized independent outputsfrom the multiple NNs to discuss the likelihood of EoSs with a ﬁrst-order phase transition.We made a histogram to count the rate of EoSs with a ﬁrst-order transition in each energydensity segment.In the third part (Sec. 5) we have brought to light a byproduct from our prescription ofincorporating observational uncertainty. The training data is augmented with ﬂuctuationscorresponding to observational uncertainty, which we call the observational data augmen-tation. Then, the data augmentation with n s ﬂuctuating copies has turned out to tame theoverﬁtting problem. We have explicitly conﬁrmed this behavior of the overﬁtting reductionby monitoring a time evolution of the validation loss.Although we have carried out extensive analyses in this work, there are still a lot tobe done in the future to improve the applicability of our method to the neutron star EoSinference. As our NN method reliably predicts the most likely EoS in lower density region,it would be supplementary to provide information on the TOV limiting mass (the heaviestobserved mass) directly in the NN input. At present we have incorporated this informationindirectly by just kicking out the training data that does not reach the TOV limit. Also,our NN model currently copes with the pairs of mass-radius data only, but it would be animportant next step to extend the NN model so as to process other forms of data, e.g., thetidal deformability.Currently we employ the feedforward network. It is also a natural future extension toupgrade the NN architecture. For instance, we can replace the NN with the convolutionalneural network (CNN) as frequently used in the image processing. With CNN we can makefull use of the two-dimensional Bayesian posterior distribution for the input observabledata. On top of the improvement in the NN part, we can also think of another extensionof better EoS parametrization. It would be possible to use parametrizations other thanthe piecewise polytropes, such as the spectral representation [106, 107] and the quarkyonicparametrization [56]. In this work we jointed our predicted EoS above n with the SLy4– 38 –elow n . The dependence of the joint density and/or choices of the nuclear EoS should besystematically examined. Our naive expectation is that none of these improvements wouldaﬀect the results drastically.The uncertainty quantiﬁcation is another interesting issue. In this work we have usedthe practical but oversimpliﬁed techniques; namely, the bagging and the validation loss.More rigorous treatments of uncertainties may be possible by adopting the CNN as men-tioned above, or by combining the Bayesian method and NN approach. Using such aBayesian NN, in which the network weights are sampled from a probability distribution,we can obtain the posterior probabilities of the outputs and thus we can capture the NNmodel uncertainties correctly. One eﬃcient way to implement this computation is to setup a dropout as outlined in Refs. [108, 109] (see also Ref. [110] for physics applications).In analogy to the explanation in Sec. 5.2, we can regard a dropout as a Bayesian inferencewith likelihood functions set to the Bernoulli distribution B (1 , p ) . All above-mentionedextensions deserve further investigations. Acknowledgments

This work was was supported by JSPS KAKENHI Grant No. 20J10506 (YF) and GrantNos. 18H01211, 19K21874 (KF).

References [1] J.M. Lattimer,

The nuclear equation of state and neutron star masses , Ann. Rev. Nucl.Part. Sci. (2012) 485 [ ].[2] F. Özel and P. Freire, Masses, Radii, and Equation of State of Neutron Stars , Ann. Rev.Astron. Astrophys. (2016) 401 [ ].[3] G. Baym, T. Hatsuda, T. Kojo, P.D. Powell, Y. Song and T. Takatsuka, From hadrons toquarks in neutron stars: a review , Rept. Prog. Phys. (2018) 056902 [ ].[4] L. Baiotti, Gravitational waves from neutron star mergers and their relation to the nuclearequation of state , Prog. Part. Nucl. Phys. (2019) 103714 [ ].[5] T. Kojo,

QCD equations of state and speed of sound in neutron stars , 11, 2020[ ].[6] K. Hebeler and A. Schwenk,

Chiral three-nucleon forces and neutron matter , Phys. Rev. C (2010) 014314 [ ].[7] S. Gandolﬁ, J. Carlson and S. Reddy, The maximum mass and radius of neutron stars andthe nuclear symmetry energy , Phys. Rev. C (2012) 032801 [ ].[8] I. Tews, T. Krüger, K. Hebeler and A. Schwenk, Neutron matter atnext-to-next-to-next-to-leading order in chiral eﬀective ﬁeld theory , Phys. Rev. Lett. (2013) 032504 [ ].[9] J.W. Holt, N. Kaiser and W. Weise,

Nuclear chiral dynamics and thermodynamics , Prog.Part. Nucl. Phys. (2013) 35 [ ]. – 39 –

10] G. Hagen, T. Papenbrock, A. Ekström, K. Wendt, G. Baardsen, S. Gandolﬁ et al.,

Coupled-cluster calculations of nucleonic matter , Phys. Rev. C (2014) 014319[ ].[11] A. Roggero, A. Mukherjee and F. Pederiva, Quantum Monte Carlo calculations of neutronmatter with non-local chiral interactions , Phys. Rev. Lett. (2014) 221103 [ ].[12] G. Wlazł owski, J. Holt, S. Moroz, A. Bulgac and K. Roche,

Auxiliary-Field QuantumMonte Carlo Simulations of Neutron Matter in Chiral Eﬀective Field Theory , Phys. Rev.Lett. (2014) 182503 [ ].[13] I. Tews, J. Carlson, S. Gandolﬁ and S. Reddy,

Constraining the speed of sound insideneutron stars with chiral eﬀective ﬁeld theory interactions and observations , Astrophys. J. (2018) 149 [ ].[14] C. Drischler, R. Furnstahl, J. Melendez and D. Phillips,

How well do we know theneutron-matter equation of state at the densities inside neutron stars? A Bayesian approachwith correlated uncertainties , .[15] C. Drischler, J. Holt and C. Wellenhofer, Chiral Eﬀective Field Theory and theHigh-Density Nuclear Equation of State , .[16] B.A. Freedman and L.D. McLerran, Fermions and Gauge Vector Mesons at FiniteTemperature and Density. 1. Formal Techniques , Phys. Rev.

D16 (1977) 1130.[17] B.A. Freedman and L.D. McLerran,

Fermions and Gauge Vector Mesons at FiniteTemperature and Density. 3. The Ground State Energy of a Relativistic Quark Gas , Phys.Rev.

D16 (1977) 1169.[18] V. Baluni,

Nonabelian Gauge Theories of Fermi Systems: Chromotheory of HighlyCondensed Matter , Phys. Rev.

D17 (1978) 2092.[19] A. Kurkela, P. Romatschke and A. Vuorinen,

Cold Quark Matter , Phys. Rev.

D81 (2010)105021 [ ].[20] E.S. Fraga, A. Kurkela and A. Vuorinen,

Interacting quark matter equation of state forcompact stars , Astrophys. J. (2014) L25 [ ].[21] T. Gorda, A. Kurkela, P. Romatschke, M. Säppi and A. Vuorinen,

Next-to-Next-to-Next-to-Leading Order Pressure of Cold Quark Matter: Leading Logarithm , Phys. Rev. Lett. (2018) 202701 [ ].[22] J. Ghiglieri, A. Kurkela, M. Strickland and A. Vuorinen,

Perturbative Thermal QCD:Formalism and Applications , Phys. Rept. (2020) 1 [ ].[23] Y. Fujimoto and K. Fukushima,

Equation of state of cold and dense QCD matter inresummed perturbation theory , .[24] G. Aarts, Introductory lectures on lattice QCD at nonzero baryon number , J. Phys. Conf.Ser. (2016) 022004 [ ].[25] A. Akmal, V.R. Pandharipande and D.G. Ravenhall,

The Equation of state of nucleonmatter and neutron star structure , Phys. Rev.

C58 (1998) 1804 [ nucl-th/9804027 ].[26] H. Togashi, K. Nakazato, Y. Takehara, S. Yamamuro, H. Suzuki and M. Takano,

Nuclearequation of state for core-collapse supernova simulations with realistic nuclear forces , Nucl.Phys.

A961 (2017) 78 [ ]. – 40 –

27] F. Douchin and P. Haensel,

A uniﬁed equation of state of dense matter and neutron starstructure , Astron. Astrophys. (2001) 151 [ astro-ph/0111092 ].[28] B.D. Serot and J.D. Walecka,

Recent progress in quantum hadrodynamics , Int. J. Mod.Phys. E (1997) 515 [ nucl-th/9701058 ].[29] M. Drews and W. Weise, Functional renormalization group studies of nuclear and neutronmatter , Prog. Part. Nucl. Phys. (2017) 69 [ ].[30] P. Demorest, T. Pennucci, S. Ransom, M. Roberts and J. Hessels, Shapiro DelayMeasurement of A Two Solar Mass Neutron Star , Nature (2010) 1081 [ ].[31] E. Fonseca et al.,

The NANOGrav Nine-year Data Set: Mass and Geometric Measurementsof Binary Millisecond Pulsars , Astrophys. J. (2016) 167 [ ].[32] J. Antoniadis et al.,

A Massive Pulsar in a Compact Relativistic Binary , Science (2013) 6131 [ ].[33] H.T. Cromartie et al.,

Relativistic Shapiro delay measurements of an extremely massivemillisecond pulsar , Nature Astron. (2019) 72 [ ].[34] F. Özel, G. Baym and T. Guver, Astrophysical Measurement of the Equation of State ofNeutron Star Matter , Phys. Rev.

D82 (2010) 101301 [ ].[35] A.W. Steiner, J.M. Lattimer and E.F. Brown,

The Equation of State from Observed Massesand Radii of Neutron Stars , Astrophys. J. (2010) 33 [ ].[36] A.W. Steiner, J.M. Lattimer and E.F. Brown,

The Neutron Star Mass-Radius Relation andthe Equation of State of Dense Matter , Astrophys. J. (2013) L5 [ ].[37] F. Ozel, D. Psaltis, T. Guver, G. Baym, C. Heinke and S. Guillot,

The Dense MatterEquation of State from Neutron Star Radius and Mass Measurements , Astrophys. J. (2016) 28 [ ].[38] S. Bogdanov, C.O. Heinke, F. Özel and T. Güver,

Neutron Star Mass-Radius Constraints ofthe Quiescent Low-mass X-ray Binaries X7 and X5 in the Globular Cluster 47 Tuc , Astrophys. J. (2016) 184 [ ].[39] M.C. Miller,

Astrophysical Constraints on Dense Matter in Neutron Stars , .[40] M.C. Miller and F.K. Lamb, Observational Constraints on Neutron Star Masses and Radii , Eur. Phys. J. A (2016) 63 [ ].[41] Virgo, LIGO Scientific collaboration,

GW170817: Observation of Gravitational Wavesfrom a Binary Neutron Star Inspiral , Phys. Rev. Lett. (2017) 161101 [ ].[42]

LIGO Scientific, Virgo collaboration,

GW190425: Observation of a Compact BinaryCoalescence with Total Mass ∼ . M (cid:12) , Astrophys. J. Lett. (2020) L3 [ ].[43] T.E. Riley et al., A N ICER

View of PSR J0030+0451: Millisecond Pulsar ParameterEstimation , Astrophys. J. Lett. (2019) L21 [ ].[44] M. Miller et al.,

PSR J0030+0451 Mass and Radius from

N ICER

Data and Implicationsfor the Properties of Neutron Star Matter , Astrophys. J. Lett. (2019) L24 [ ].[45] K. Yagi and N. Yunes,

I-Love-Q , Science (2013) 365 [ ].[46] K. Yagi and N. Yunes,

I-Love-Q Relations in Neutron Stars and their Applications toAstrophysics, Gravitational Waves and Fundamental Physics , Phys. Rev.

D88 (2013)023009 [ ]. – 41 –

47] K. Yagi and N. Yunes,

Approximate Universal Relations for Neutron Stars and QuarkStars , Phys. Rept. (2017) 1 [ ].[48] E. Annala, T. Gorda, A. Kurkela, J. Nättilä and A. Vuorinen,

Evidence for quark-mattercores in massive neutron stars , Nature Phys. (2020) [ ].[49] K. Masuda, T. Hatsuda and T. Takatsuka,

Hadron-Quark Crossover and Massive HybridStars with Strangeness , Astrophys. J. (2013) 12 [ ].[50] Y. Fujimoto, K. Fukushima and W. Weise,

Continuity from neutron matter to two-ﬂavorquark matter with S and P superﬂuidity , Phys. Rev. D (2020) 094009 [ ].[51] L. McLerran and R.D. Pisarski,

Phases of cold, dense quarks at large N(c) , Nucl. Phys. A (2007) 83 [ ].[52] K. Fukushima and T. Kojo,

The Quarkyonic Star , Astrophys. J. (2016) 180[ ].[53] L. McLerran and S. Reddy,

Quarkyonic Matter and Neutron Stars , Phys. Rev. Lett. (2019) 122701 [ ].[54] K.S. Jeong, L. McLerran and S. Sen,

Dynamically generated momentum space shellstructure of quarkyonic matter via an excluded volume model , Phys. Rev. C (2020)035201 [ ].[55] D.C. Duarte, S. Hernandez-Ortiz and K.S. Jeong,

Excluded-volume model for quarkyonicMatter: Three-ﬂavor baryon-quark Mixture , Phys. Rev. C (2020) 025203 [ ].[56] T. Zhao and J.M. Lattimer,

Quarkyonic Matter Equation of State in Beta-Equilibrium , Phys. Rev. D (2020) 023021 [ ].[57] K. Fukushima, T. Kojo and W. Weise,

Hard-core deconﬁnement and soft-surfacedelocalization from nuclear to quark matter , Phys. Rev. D (2020) 096017 [ ].[58] E.R. Most, L.J. Papenfort, V. Dexheimer, M. Hanauske, S. Schramm, H. Stöcker et al.,

Signatures of quark-hadron phase transitions in general-relativistic neutron-star mergers , Phys. Rev. Lett. (2019) 061101 [ ].[59] A. Bauswein, N.-U.F. Bastian, D.B. Blaschke, K. Chatziioannou, J.A. Clark, T. Fischeret al.,

Identifying a ﬁrst-order phase transition in neutron star mergers throughgravitational waves , Phys. Rev. Lett. (2019) 061102 [ ].[60] E.R. Most, L. Jens Papenfort, V. Dexheimer, M. Hanauske, H. Stoecker and L. Rezzolla,

On the deconﬁnement phase transition in neutron-star mergers , Eur. Phys. J. A (2020)59 [ ].[61] D. Alvarez-Castillo, A. Ayriyan, S. Benic, D. Blaschke, H. Grigorian and S. Typel, New classof hybrid EoS and Bayesian M-R data analysis , Eur. Phys. J.

A52 (2016) 69 [ ].[62] C.A. Raithel, F. Özel and D. Psaltis,

From Neutron Star Observables to the Equation ofState: An Optimal Parametrization , Astrophys. J. (2016) 44 [ ].[63] C.A. Raithel, F. Özel and D. Psaltis,

From Neutron Star Observables to the Equation ofState. II. Bayesian Inference of Equation of State Pressures , Astrophys. J. (2017) 156[ ].[64] G. Raaijmakers et al.,

Constraining the dense matter equation of state with joint analysis ofNICER and LIGO/Virgo measurements , Astrophys. J. Lett. (2020) L21 [ ]. – 42 –

65] G. Raaijmakers et al., A N ICER view of PSR J0030+0451: Implications for the densematter equation of state , Astrophys. J. Lett. (2019) L22 [ ].[66] D. Blaschke, A. Ayriyan, D.E. Alvarez-Castillo and H. Grigorian,

Was GW170817 aCanonical Neutron Star Merger? Bayesian Analysis with a Third Family of Compact Stars , Universe (2020) 81 [ ].[67] P. Landry and R. Essick, Nonparametric inference of the neutron star equation of statefrom gravitational wave observations , Phys. Rev. D (2019) 084049 [ ].[68] R. Essick, P. Landry and D.E. Holz, Nonparametric Inference of Neutron Star Composition,Equation of State, and Maximum Mass with GW170817 , Phys. Rev. D (2020) 063007[ ].[69] R. Essick, I. Tews, P. Landry, S. Reddy and D.E. Holz,

Direct Astrophysical Tests of ChiralEﬀective Field Theory at Supranuclear Densities , Phys. Rev. C (2020) 055803[ ].[70] L.-G. Pang, K. Zhou, N. Su, H. Petersen, H. Stöcker and X.-N. Wang,

Anequation-of-state-meter of quantum chromodynamics transition from deep learning , NatureCommun. (2018) 210 [ ].[71] Y. Mori, K. Kashiwa and A. Ohnishi, Toward solving the sign problem with pathoptimization method , Phys. Rev. D (2017) 111501 [ ].[72] Y.D. Hezaveh, L. Perreault Levasseur and P.J. Marshall, Fast Automated Analysis of StrongGravitational Lenses with Convolutional Neural Networks , Nature (2017) 555[ ].[73] S. Chang, T. Cohen and B. Ostdiek,

What is the Machine Learning? , Phys. Rev.

D97 (2018) 056009 [ ].[74] Z.M. Niu and H.Z. Liang,

Nuclear mass predictions based on Bayesian neural networkapproach with pairing and shell eﬀects , Phys. Lett.

B778 (2018) 48 [ ].[75] B. Kaspschak and U.-G. Meißner,

How machine learning conquers the unitary limit , .[76] L. Wang, Y. Jiang, L. He and K. Zhou, Continuous-mixture Autoregressive Networks for theEﬃcient Variational Calculation of Spin System , .[77] D. George and E.A. Huerta, Deep Neural Networks to Enable Real-time MultimessengerAstrophysics , Phys. Rev.

D97 (2018) 044039 [ ].[78] M. Carrillo, M. Gracia-Linares, J.A. González and F.S. Guzmán,

Parameter estimates inbinary black hole collisions using neural networks , Gen. Rel. Grav. (2016) 141[ ].[79] D. George and E.A. Huerta, Deep Learning for Real-time Gravitational Wave Detection andParameter Estimation: Results with Advanced LIGO Data , Phys. Lett.

B778 (2018) 64[ ].[80] D. George, H. Shen and E. Huerta,

Classiﬁcation and unsupervised clustering of LIGO datawith Deep Transfer Learning , Phys. Rev.

D97 (2018) 101501.[81] Y. Fujimoto, K. Fukushima and K. Murase,

Methodology study of machine learning for theneutron star equation of state , Phys. Rev.

D98 (2018) 023019 [ ]. – 43 –

82] Y. Fujimoto, K. Fukushima and K. Murase,

Mapping neutron star data to the equation ofstate using the deep neural network , Phys. Rev. D (2020) 054016 [ ].[83] M. Ferreira and C. Providência,

Unveiling the nuclear matter EoS from neutron starproperties: a supervised machine learning approach , .[84] F. Morawski and M. Bejger, Neural network reconstruction of the dense matter equation ofstate derived from the parameters of neutron stars , Astron. Astrophys. (2020) A78[ ].[85] S. Traversi and P. Char,

Structure of Quark Star: A Comparative Analysis of BayesianInference and Neural Network Based Modeling , Astrophys. J. (2020) 9 [ ].[86] R.C. Tolman,

Static solutions of Einstein’s ﬁeld equations for spheres of ﬂuid , Phys. Rev. (1939) 364.[87] J.R. Oppenheimer and G.M. Volkoﬀ, On Massive neutron cores , Phys. Rev. (1939) 374.[88] L. Lindblom, Determining the nuclear equation of state from neutron-star masses and radii , Astrophys. J. (1992) 569.[89] G. Cybenko,

Approximation by superpositions of a sigmoidal function , Mathematics ofControl, Signals, and Systems (MCSS) (1989) 303.[90] K. Hornik, Approximation capabilities of multilayer feedforward networks , Neural Networks (1991) 251 .[91] F. Chollet, “Keras: Deep learning library for theano and tensorﬂow.” https://github.com/fchollet/keras , 2015.[92] M. Abadi et al., Tensorﬂow: A system for large-scale machine learning , .[93] D.P. Kingma and J. Ba, Adam: A method for stochastic optimization , CoRR abs/1412.6980 (2014) [ ].[94] C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals,

Understanding deep learningrequires rethinking generalization , Published in ICLR 2017 (2016) arXiv:1611.03530[ ].[95] Z. Allen-Zhu, Y. Li and Y. Liang,

Learning and generalization in overparameterized neuralnetworks, going beyond two layers , in

Advances in Neural Information Processing Systems32 (NeurlPS 2019) , H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox andR. Garnett, eds., vol. 32, pp. 6158–6169, Curran Associates, Inc., 2019,https://proceedings.neurips.cc/paper/2019/ﬁle/62dad6e273d32235ae02b7d321578ee8-Paper.pdf.[96] L. Breiman,

Bagging predictors , Mach. Learn. (1996) 123–140.[97] K. Hebeler, J.M. Lattimer, C.J. Pethick and A. Schwenk, Equation of state and neutronstar properties constrained by nuclear physics and observation , Astrophys. J. (2013) 11[ ].[98] T. Hinderer,

Tidal Love numbers of neutron stars , Astrophys. J. (2008) 1216[ ].[99] T. Hinderer, B.D. Lackey, R.N. Lang and J.S. Read,

Tidal deformability of neutron starswith realistic equations of state and their gravitational wave signatures in binary inspiral , Phys. Rev.

D81 (2010) 123016 [ ]. – 44 – LIGO Scientific, Virgo collaboration,

GW170817: Measurements of neutron star radiiand equation of state , Phys. Rev. Lett. (2018) 161101 [ ].[101] M.G. Alford, G. Burgio, S. Han, G. Taranto and D. Zappalà,

Constraining and applying ageneric high-density equation of state , Phys. Rev. D (2015) 083002 [ ].[102] U.H. Gerlach, Equation of State at Supranuclear Densities and the Existence of a ThirdFamily of Superdense Stars , Phys. Rev. (1968) 1325.[103] K. Schertler, C. Greiner, J. Schaﬀner-Bielich and M.H. Thoma,

Quark phases in neutronstars and a ’third family’ of compact stars as a signature for phase transitions , Nucl. Phys.

A677 (2000) 463 [ astro-ph/0001467 ].[104] M.G. Alford and A. Sedrakian,

Compact stars with sequential QCD phase transitions , Phys.Rev. Lett. (2017) 161104 [ ].[105] P. Bedaque and A.W. Steiner,

Sound velocity bound and neutron stars , Phys. Rev. Lett. (2015) 031103 [ ].[106] L. Lindblom,

Spectral Representations of Neutron-Star Equations of State , Phys. Rev.

D82 (2010) 103011 [ ].[107] L. Lindblom and N.M. Indik,

Spectral Approach to the Relativistic Inverse Stellar StructureProblem II , Phys. Rev.

D89 (2014) 064003 [ ].[108] Y. Gal and Z. Ghahramani,

Dropout as a bayesian approximation: Representing modeluncertainty in deep learning , in

Proceedings of The 33rd International Conference onMachine Learning , M.F. Balcan and K.Q. Weinberger, eds., vol. 48 of

Proceedings ofMachine Learning Research , (New York, New York, USA), pp. 1050–1059, PMLR, 20–22Jun, 2016, http://proceedings.mlr.press/v48/gal16.html.[109] A. Kendall and Y. Gal,

What uncertainties do we need in bayesian deep learning forcomputer vision? , in

Advances in Neural Information Processing Systems , I. Guyon,U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan et al., eds., vol. 30,pp. 5574–5584, Curran Associates, Inc., 2017,https://proceedings.neurips.cc/paper/2017/ﬁle/2650d6089a6d640c5e85b2b88265dc2b-Paper.pdf.[110] L. Perreault Levasseur, Y.D. Hezaveh and R.H. Wechsler,

Uncertainties in ParametersEstimated with Neural Networks: Application to Strong Gravitational Lensing , Astrophys. J.Lett. (2017) L7 [ ].].