[PDF] A sparse regression approach to modeling the relation between galaxy stellar masses and their host halos

Abstract

Sparse regression algorithms have been proposed as the appropriate framework to model the governing equations of a system from data, without needing prior knowledge of the underlying physics. In this work, we use sparse regression to build an accurate and explainable model of the stellar mass of central galaxies given properties of their host dark matter (DM) halo. Our data set comprises 9,521 central galaxies from the EAGLE hydrodynamic simulation. By matching the host halos to a DM-only simulation, we collect the halo mass and specific angular momentum at present time and for their main progenitors in 10 redshift bins from z=0 to z=4. The principal component of our governing equation is a third-order polynomial of the host halo mass, which models the stellar-mass halo-mass relation. The scatter about this relation is driven by the halo mass evolution and is captured by second and third-order correlations of the halo mass evolution with the present halo mass. An advantage of sparse regression approaches is that unnecessary terms are removed. Although we include information on halo specific angular momentum, these parameters are discarded by our methodology. This suggests that halo angular momentum has little connection to galaxy formation efficiency. Our model has a root mean square error (RMSE) of 0.167 \log_{10}(M^*/M_\odot), and accurately reproduces both the stellar mass function and central galaxy correlation function of EAGLE. The methodology appears to be an encouraging approach for populating the halos of DM-only simulations with galaxies, and we discuss the next steps that are required.

Full PDF

MMNRAS , 1–19 (2020) Preprint 11 January 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

A sparse regression approach to modeling the relation betweengalaxy stellar masses and their host halos

M. Icaza-Lizaola , ★ , Richard G. Bower , , Peder Norberg , , , Shaun Cole , Stefan Egan Institute for Computational Cosmology, Department of Physics, University of Durham, South Road, Durham DH1 3LE, UK. Institute for Data Science, Department of Physics, University of Durham, South Road, Durham DH1 3LE, UK. Centre for Extragalactic Astronomy, Department of Physics, University of Durham, South Road, Durham DH1 3LE, UK. Procter & Gamble, Newcastle Innovation Centre, Newcastle upon Tyne, UK.

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

167 log ( 𝑀 ∗ / 𝑀 (cid:12) ) , and accurately reproduces boththe stellar mass function and central galaxy correlation function of EAGLE. The methodologyappears to be an encouraging approach for populating the halos of DM-only simulations withgalaxies, and we discuss the next steps that are required. Key words: galaxies: evolution – galaxies: haloes – cosmology: dark matter

Gravitational collapse in the expanding Universe leads to the forma-tion of complex, highly non-linear structures. Fortunately, the forceof gravity can be accurately models through N-body cosmologicalsimulations. However, observational probes of the Universe’s struc-ture usually rely on galaxies, bringing in to play the much broaderrange of baryon physics. Unlike dark matter only (DMO) simula-tions, which only allow interactions through gravity, baryon simu-lations need to deal with complicated feedback processes and arestrongly inﬂuenced by processes happening at scales much smallerthan the size of the simulation grid (Schaye et al. 2010). While thiscan be mitigated by including sub-grid models of these processes, inthe form of sources or sinks of energy and matter, the resulting com-putational cost of accurate baryonic simulations remains far greaterthan that of DMO simulations. As a consequence the volume ofthe universe that can be modelled in this way is limited. A hybridapproach is therefore necessary, in which a large volume DMO ★ E-mail: [email protected] simulation is populated with galaxies based on the halo-galaxy re-lationship found in smaller baryonic simulations. This requires amethodology that can extract robust halo-galaxy relationships mak-ing optimal use of the available volume of baryonic simulations. Inthis paper, we explore whether sparse-regression models provide anattractive approach.It is already well established that there is a strong correlationbetween the stellar mass (SM) of a central galaxy and the mass of itshost halo (White & Rees 1978). This relation is known as the Stellar-Mass-Halo Mass (SMHM) relation. We use this relation as thestarting point for this investigation. Although a full reconstructionof the baryonic Universe would require us to also treat satellitegalaxies, these are subject to additional physics such as tidal stripingand heating (Merritt 1983) as well as other environmental processes.We therefore focus on accurate modelling of the SMHM of centralgalaxies and leave extension of the methodology to include satellitegalaxies for future work.We will focus here on the scatter in the SMHM relation (Moreet al. 2010; Zu & Mandelbaum 2015) and investigate whether otherproperties of the DM halo, besides its present-day mass, also play a © a r X i v : . [ a s t r o - ph . GA ] J a n M. Icaza-Lizaola et al. role in determining the stellar mass of the central galaxies. Previouswork has shown that properties that measure the evolutionary historyof the halo mass are well correlated with these residuals, this isknown as assembly bias (Zentner et al. 2014). Matthee et al. (2016)studied the correlation between the residuals in the SMHM relationand diﬀerent DM properties. They found that the parameters thatare most correlated with this residual are those that are determinedby the evolution of the halo mass, in particular the halo formationtime. They found no other parameter strongly correlated with theresidual of the SMHM relation once it was corrected by the haloconcentration correlation. Our aim in this paper is to investigatethe optimal measure of halo formation trajectory and to determinewhether the prediction of stellar mass can be improved by includingadditional halo properties.The speciﬁc angular momentum (sAM) of a galaxy appearscorrelated with its stellar mass (Fall & Romanowsky 2013). How-ever, while there are correlations between the history of the sAMof a galaxy and the history of the sAM of its host halo (Zavalaet al. 2016), Danovich et al. (2015) uses cosmological simulationsto suggest that the sAM of gas and dark matter undergo decou-pled formation histories and that its only the ﬁnal distribution ofspin parameters that is similar between baryons and dark matter.Nevertheless, it remains physical plausible that halo angular mo-mentum and galaxy formation eﬃciency maybe interconnected insome more complex way.The aim of this work is to use a sparse regression algorithm toﬁnd a polynomial equation that relates the stellar mass of a galaxywith the properties of its DM halo. This is a form of MachineLearning (ML). More conventional ML algorithms such as neuralnetworks (Lecun et al. 2015) and random forests (Breiman 2001) aresome of the most powerful tools for parameterising a data set. How-ever, algorithms like neural networks or ensemble models (Roberts& Everson 2001) generate models with virtually no explainability ,so that extracting the physics behind the model would be diﬃcult.While the network could predict galaxy properties, it would be hardto gain conﬁdence that the output was physically reasonable. Al-gorithms like random forests, that work by building a collection ofdecision trees that are designed to be as uncorrelated as possible,and look at the average properties of all trees, are easier to interpretas one can measure how often a variable was used and how dras-tically the entropy decreases in each step. Another potential issuewith machine learning algorithms is the slow evaluation speed of aﬁnal model.Sparse regression methods (SRM) (Hastie et al. 2015) are a setof minimization algorithms that are eﬃcient at discarding unneces-sary free parameters. This makes them very useful at minimizingfunctions for which one suspects that most free parameters are ir-relevant except for a small subset that one is trying to identify. SRMprovide a trade-oﬀ between including very many free parameters(which would result in over-ﬁtting to random artefacts in the data)and eliminating too many parameters (which would result in a poordescription of the data). SRM have been proposed as the appropriateframework to extract the governing equations of a physical systemfrom the data alone with relatively little prior knowledge required ofthe system’s physics (Brunton et al. 2016). A key point advantage ofthe SRM approach is that the small number of retained coeﬃcientsare likely to have a clear physical interpretation. Further more, giventhat the models produced with SRM are polynomial equations, theycan be evaluated virtually immediately.In this paper, We use a sparse regression methodology to modelthe stellar mass of galaxies in the EAGLE 100 cMpc hydrodynam-ical simulation (Schaye et al. 2015; Crain et al. 2015). The EAGLE simulation provides a reasonable description of the observed Uni-verse (Crain et al. 2015; Artale et al. 2017; Furlong et al. 2015;Trayford et al. 2016), and is ideal for our proposes as parameter val-ues of the DM halos are stored at several redshift slices (McAlpineet al. 2016), which permit us to model assembly bias. While allmodels presented here were calibrated on EAGLE data, we expectthat the methodology could be reproduced in other hydrodynamicalsimulations with similar success.ML methods have shown promising developments is the mak-ing of mock catalogs of DM halos. Lucie-Smith et al. (2018) used arandom forest algorithm to predict which DM particle in a simula-tion would end up inside a DM halo of a given mass, while Berger& Stein (2019) used a neural network to build DM halo mocks. Inthis paper we tackle the challenging next step. For mock simulationsto be compared to observations of large-scale structure surveys theyneed to be populated with galaxies in such a way that they repro-duce the stellar mass function (SMF) and the clustering patterns ofgalaxies. A point to bear in mind, however, is that large scale struc-ture surveys include faint galaxies at small redshifts, which wouldrequire us to model the stellar mass of satellite galaxies as well. Wediscuss the challenges that this brings at the end of the paper.This paper is organized as follows: Section 2 introduces thesparse regression methodology used to build our model, it alsoincludes an example model that illustrates the behavior of the al-gorithm. Section 3 introduces the hydrodynamical simulation fromwhich the input data were extracted, it also discusses how the datawas processed to be used by our algorithm. The details on run-ning the algorithm using the data set are presented in Section 4.Section 5 shows the results of the diﬀerent conﬁgurations in whichwe run our code, and we also discuss the physical interpretation ofdiﬀerent terms of our governing equations and compare the stellarmass distribution and clustering statistics to those from EAGLE.Our conclusions and ﬁnal thoughts along with a brief discussion onthe next steps that we aim to take, are presented in Section 6.

We are interested in ﬁnding a function that models a physicalproperty, 𝑦 (cid:48) , that might be determined by a set of M variables (cid:174) 𝑥 (cid:48) = [ 𝑥 (cid:48) , ...., 𝑥 (cid:48) 𝑀 ] (we reserve the symbol 𝑥 for normalised vari-ables - see below). In this work 𝑦 (cid:48) is the stellar mass of a galaxyand (cid:174) 𝑥 (cid:48) a set of present and past properties of its DM halo. We canbuild a data set of values of 𝑦 (cid:48) and their associated (cid:174) 𝑥 (cid:48) by looking atlarge catalogues where the value of both has been measured. In thispaper we use the output of the EAGLE hydrodynamical simulations(see Section 3).We collect a sample of N galaxies to build a vector (cid:174) 𝑦 (cid:48) , where (cid:174) 𝑦 (cid:48) = [ 𝑦 (cid:48) , ..., 𝑦 (cid:48) 𝑁 ] , (1)and an associated matrix X (cid:48) , with each row (cid:174) 𝑥 (cid:48) 𝑖 representing thelist of dependent variables associated with the DM halo of thecorresponding galaxy: X (cid:48) =  (cid:174) 𝑥 (cid:48) . (cid:174) 𝑥 (cid:48) 𝑁  =  𝑥 (cid:48) ... 𝑥 (cid:48) 𝑀 . . .𝑥 (cid:48) 𝑁 ... 𝑥 (cid:48) 𝑁 𝑀  (2)The diﬀerent columns of matrix X (cid:48) correspond to diﬀerentproperties of the DM halo, where each property can have diﬀerentunits and distributions. It is, therefore, necessary to standardize our MNRAS , 1–19 (2020) parse regression modeling of the stellar mass data. We choose to do this using the mean and standard deviationof the distribution, and deﬁne the normalised variable as: (cid:174) 𝑥 𝑖 = (cid:174) 𝑥 (cid:48) 𝑖 − 𝜇 ( (cid:174) 𝑥 𝑖 ) 𝜎 ( (cid:174) 𝑥 𝑖 ) , (3)where 𝜇 and 𝜎 are the mean and standard deviation operators.The same normalization scheme is also applied to our dependentvariable: 𝑦 = ( 𝑦 (cid:48) − 𝜇 ( 𝑦 (cid:48) ))/ 𝜎 ( 𝑦 (cid:48) ) . Note that the primed variablesrefer to natural quantities and non-primed variables to standardisedones that have zero mean and unit variance.The observed values of the M variables of (cid:174) 𝑥 𝛼 will be usedas inputs for a series of functions whose output one hopes to useto predict 𝑦 𝛼 . These functions can in principle have any desiredform, and so we will use a gradient descent algorithm to ﬁt a linearcombination of them to (cid:174) 𝑦 (Section 2.3). Although other approacheslike singular value decomposition Golub (1970) could be used inthe hyperbolic case, we wish to ensure that the method is generic.We consider a library of 𝐷 functions, and their evaluatedvalues for the observed parameters of the 𝛼 𝑡ℎ galaxy (cid:174) 𝑓 ( (cid:174) 𝑥 𝛼 ) = [ 𝑓 ( (cid:174) 𝑥 𝛼 ) , ....., 𝑓 𝐷 ( (cid:174) 𝑥 𝛼 )] . The library of functions that we use in thiswork consists of: • A constant term 𝑓 ( (cid:174) 𝑥 𝛼 ) = • 𝑀 linear terms of the form [ 𝑓 ( (cid:174) 𝑥 𝛼 ) ,..., 𝑓 𝑀 ( (cid:174) 𝑥 𝛼 ) ] =[ 𝑥 𝛼 ,..., 𝑥 𝛼𝑖 ,..., 𝑥 𝛼𝑀 ] where 1 ≤ 𝑖 ≤ 𝑀 . • 𝑀 ( 𝑀 + )/ 𝑓 ( (cid:174) 𝑥 𝛼 ) ,..., 𝑓 𝑀 ( 𝑀 + )/ ( (cid:174) 𝑥 𝛼 ) ] = [ 𝑥 𝛼 ,...., 𝑥 𝛼𝑖 𝑥 𝛼 𝑗 ,..., 𝑥 𝛼𝑀 ] with1 ≤ 𝑖 ≤ 𝑗 ≤ 𝑀 . • 𝑀 ( 𝑀 + )( 𝑀 + )/ 𝑓 ( (cid:174) 𝑥 𝛼 ) ,..., 𝑓 𝑀 ( 𝑀 + ) ( 𝑀 + )/ ( (cid:174) 𝑥 𝛼 ) ] = [ 𝑥 𝛼 , ... , 𝑥 𝛼𝑖 𝑥 𝛼 𝑗 𝑥 𝛼𝑘 ,...., 𝑥 𝛼𝑀 ] with1 ≤ 𝑖 ≤ 𝑗 ≤ 𝑘 ≤ 𝑀 .The total number of functions considered is: 𝐷 = + 𝑀 + 𝑀 ( 𝑀 + ) + 𝑀 ( 𝑀 + )( 𝑀 + ) . (4)The number M of DM halo properties that we use depends on thespeciﬁc parametrisation of the present and past properties of thehalo that we select. We run four diﬀerent models each with diﬀerentvalues of M (Section 3.5).This methodology is able to deal with far more complicatedfunctional forms than the polynomial forms used here. For exam-ple, we experimented with exponential decays and step functions.However including these more complicated functions in our initialtesting did not improve our models, but increased the computationaltime so we excluded them from our ﬁnal ﬁts in this paper.Our goal is to ﬁnd optimized values of the coeﬃcients (cid:174) 𝐶 = [ 𝐶 , ......., 𝐶 𝐷 ] that will make the linear combinations of our 𝐷 functions a suﬃciently accurate model of (cid:174) 𝑦 . Speciﬁcally, we aimto ﬁnd the optimized values of (cid:174) 𝐶 such that (cid:174) 𝐹 ( (cid:174) 𝐶, X ) ≈ (cid:174) 𝑦 , where (cid:174) 𝐹 ( (cid:174) 𝐶, X ) is deﬁned as: (cid:174) 𝐹 ( (cid:174) 𝐶, X ) = F ( X ) (cid:174) 𝐶 𝑇 =  𝑓 ((cid:174) 𝑥 ) ... 𝑓 𝐷 ((cid:174) 𝑥 ) ... ... ...𝑓 ((cid:174) 𝑥 𝑁 ) ... 𝑓 𝐷 ((cid:174) 𝑥 𝑁 )   𝐶 ...𝐶 𝐷  . (5)We discuss the precise meaning of the approximate equality in thefollowing section. Our aim is to achieve a balance between the ac-curacy of ﬁtting the data while keeping the model as simple aspossible. Clearly there is an underlying assumption that the func-tions included can be linearly combined into a suﬃciently accuratemodel. In the absence of a detailed understanding of the physical system our approach is to include a large number of functions in ourlibrary, spanning the possible range of physical interactions. Sparse regression methods are aim to minimising the error term | (cid:174) 𝐹 ( (cid:174) 𝐶, X ) − (cid:174) 𝑦 | while discarding any unnecessary functions by settingtheir associated coeﬃcients 𝐶 𝑗 to a negligibly small values. Thismakes them the appropriate framework for our problem as it allowsus to include a large number of functions while knowing that allof the unnecessary ones that we include will be discarded by themethodology. The fewer surviving coeﬃcients the easier it is tointerpret the solution (i.e. the more explainable it is). The solutionwill also be less susceptible to over-ﬁtting to random ﬂuctuations inthe training data.One of the most common sparse regression algorithms isLASSO (Tibshirani 1996; Tibshirani & Friedman 2017), whereone minimizes 𝐿 = 𝜒 + 𝜆𝑃 ( (cid:174) 𝐶 ) . (6) 𝑃 ( (cid:174) 𝐶 ) is known as the penalty function and its value should increasewith the absolute value and number of coeﬃcients that are not setto zero. The coeﬃcient 𝜆 is a hyperparameter of the model anddetermines the relative magnitude of the penalty term. The value of 𝜆 is determined using a k-fold methodology (Hastie et al. 2015),as described in Section 2.4. 𝜒 is the normal chi-squared func-tion deﬁned as 𝜒 = 𝑁 ∑︁ 𝛼 = ( 𝐹 𝛼 ( (cid:174) 𝐶, X ) − 𝑦 𝛼 ) 𝜎 𝑦 𝛼 , (7)where 𝜎 𝑦 𝛼 is an estimate of the uncertainty in measurement 𝑦 𝛼 and 𝐹 𝛼 ( (cid:174) 𝐶, X ) is the 𝛼 𝑡ℎ element of (cid:174) 𝐹 ( (cid:174) 𝐶, X ) . In the absence of thepenalty term, 𝐿 would be the negative of the logarithmic likelihoodfunction (ie., 𝐿 = − L ).In the standard LASSO approach 𝑃 ( (cid:174) 𝐶 ) is deﬁned as 𝑃 ( (cid:174) 𝐶 ) = 𝐷 ∑︁ 𝑖 = | 𝐶 𝑖 | . (8)We introduce a regularisation term to smooth out the gradient dis-continuities that occur when parameters are close to zero, 𝑃 ( (cid:174) 𝐶 ) = 𝐷 ∑︁ 𝑖 = | 𝐶 𝑖 | 𝑒 −( 𝜖 / 𝐶 𝑖 ) , (9)where 𝜖 is a small constant. Note that exp (− ( 𝜖 / 𝐶 𝑖 ) ) is very closeto zero when | 𝐶 𝑖 | (cid:28) 𝜖 and close to one when | 𝐶 𝑖 | (cid:29) 𝜖 . Therefore 𝜖 determines how close to zero a coeﬃcient 𝐶 𝑗 needs to go beforeits contribution to the penalty is negligible. We adopt a value of 𝜖 = × − , which we show in Section 4 makes unnecessarycoeﬃcients go close enough to zero to be clearly distinguished fromthe ones that are useful, while keeping a reasonable computationalcost (the closer to zero unnecessary coeﬃcients are required to getthe longer the minimizer needs to run). We deﬁne a cutoﬀ value 𝜈 as the threshold between used and discarded parameters: everycoeﬃcient larger than 𝜈 will be used in our model and all smallercoeﬃcients are discarded. The exact value of 𝜈 is presented inSection 4.In equation 9 the contribution of each coeﬃcient 𝐶 𝑖 is inde-pendent of the contribution of all other coeﬃcients. This meansthat there is not a strong penalty for having many small, but largerthan 𝜖 , values of 𝐶 𝑖 . We found that a more eﬃcient approach at MNRAS , 1–19 (2020)

M. Icaza-Lizaola et al. eliminating non-essential coeﬃcients is to consider the contribu-tion of a coeﬃcient, in comparison to all of the other survivingcoeﬃcients. This is achieved by the following penalty function 𝑃 ( (cid:174) 𝐶 ) = (cid:205) 𝐷𝑖 = (cid:2)(cid:205) 𝑗 ≠ 𝑖 | 𝐶 𝑗 | (cid:3) | 𝐶 𝑖 | . Combining both modiﬁcationsour penalty function has the following form 𝑃 ( (cid:174) 𝐶 ) = 𝐷 ∑︁ 𝑖 = ∑︁ 𝑗 ≠ 𝑖 | 𝐶 𝑗 | 𝑒 − ( 𝜖 / 𝐶 𝑗 )  | 𝐶 𝑖 | 𝑒 −( 𝜖 / 𝐶 𝑖 ) . (10)This is the form of the penalty function adopted in our algorithm.The 𝜒 is a measure of the goodness of ﬁt, which decreases asthe model becomes more accurate. Balancing of the goodness of ﬁtstatistic and penalty term makes sparse models robust against over-ﬁtting: an over-ﬁtted model would use many parameters to make aunrealistically good ﬁt, which would make the 𝜒 very small but itwould also make the penalty term large (as it grows with the numberof parameters). The minimum should belong to a model that is assimple as possible, while still being a suﬃciently good ﬁt. This iswhy when using a large library of functions all but a small subsetof the coeﬃcients end up being set to zero.By making some general assumptions, we can estimate that inthe optimised solution 𝑃 ( (cid:174) 𝐶 ) = O( ) . First we note that 𝑃 ( (cid:174) 𝐶 ) ≈ (cid:205) 𝐷𝑖 = (cid:2)(cid:205) 𝑗 ≠ 𝑖 | 𝐶 𝑖 || 𝐶 𝑗 | (cid:3) ≈ ( (cid:205) 𝐷𝑖 = | 𝐶 𝑖 |) , and that the optimisedsolution should satisfy (cid:174) 𝐹 ( (cid:174) 𝐶, X ) = F ( X ) (cid:174) 𝐶 𝑇 = (cid:205) 𝐷𝑖 = 𝐶 𝑖 𝐹 𝑖 ( X ) ≈ (cid:174) 𝑦 .Secondly let us note that 𝐹 𝑖 ( X ) correspond to third order com-binations of elements of (cid:174) 𝑥 𝑖 , with each element standardised tobe of the order of magnitude of the elements of (cid:174) 𝑦 and therefore 𝐹 𝑖𝛼 ( X ) ≈ O( (cid:174) 𝑦 𝛼 ) . From here it should be that (cid:205) 𝐷𝑖 = | 𝐶 𝑖 |≈ O( ) ,and consequently that 𝑃 ( (cid:174) 𝐶 ) ≈ O( ) .The properties of the simulated galaxies do not have formalmeasurement errors, but we still expect a random scatter due to thestochastic nature of the formation process. We therefore estimate 𝜎 𝑦 𝛼 using 𝜎 𝑦 𝛼 = 𝑁 ∑︁ 𝛼 = ( 𝐹 𝛼 ( (cid:174) 𝐶, X ) − 𝑦 𝛼 ) 𝑁 (11)evaluated at the values of 𝐶 𝑖 for which this is a minimum. Theseare easily found by our code that minimizes equation 6 by initiallysetting 𝜆 = 𝜎 𝑦 𝛼 = 𝑁 . A consequence of using this deﬁnitionof 𝜎 𝑦 𝛼 is that if we then minimise equation 6 with no penalty ( 𝜆 = 𝐿 ( 𝜆 = ) = 𝑁. (12)The optimised value of 𝜆 should be such that 𝜒 and 𝜆𝑃 ( (cid:174) 𝐶 ) areof comparable size. Given that by 𝑃 ( (cid:174) 𝐶 ) ≈ O( ) , and that we con-structed 𝜎 𝑦 𝛼 such that 𝜒 ≈ O( 𝑁 ) then 𝜆 ≈ O( 𝑁 ) . This allows usto estimate the sizes of penalty that we should explore. We use a gradient decent algorithm to minimise eq. 6. The processstarts at an initial point in parameter space and iteratively walks inthe direction opposite to the gradient of 𝐿 with respect to 𝐶 𝑖 .We use a variation of a gradient descent algorithm (Arfken1985), which is standard practice for most machine learning method-ologies. The size of each step is determined by a parameter 𝜂 . Atevery step one computes 𝐿 , if it is larger at the new position then 𝜂 is reduced (as it would likely mean that it overshot the minimum).In the opposite case 𝜂 size is increased if 𝐿 is smaller at the newposition as it is likely that we are still far from the minimum. The gradient of 𝐿 from equation 6 is computed with respectto the vector of coeﬃcients 𝐶 𝑖 . In the standard methodology onemakes a step in the direction of the gradient at the current position.However, we found that this did not perform well in the steep-sidedvalleys that characterise 𝐿 . In such valleys, a step will overshootthe minimum, and as a consequence the next step would be in theopposite direction than the previous one but with a slightly smallerstep size. Progress along the valley toward the global minimum isthen slow. This makes convergence ineﬃcient in high dimensionalspaces, as the minimizer tends to jump from one wall of a potentialwell to the opposite wall at each step instead of following a moredirect downwards path.We achieved performance gains by using the following adap-tation of the algorithm for determining the next step of the mini-mization. Deﬁning the position of the 𝑖 𝑡ℎ step as 𝑝 𝑖 = 𝐶 𝑖 , ..., 𝐶 𝑖𝐷 ,the gradient vector −∇( 𝐿 )( 𝑝 𝑖 ) = 𝜕𝐿𝜕𝐶 ( 𝑝 𝑖 ) , .., 𝜕𝐿𝜕𝐶 𝐷 ( 𝑝 𝑖 ) (13)points downhill towards the nearest local minimum. Since we areonly interested in the direction of the gradient and not its magnitudewe can normalize the vector as ∇ 𝐿 ( 𝑝 𝑖 ) = ∇( 𝐿 )( 𝑝 𝑖 ) (cid:46) |∇( 𝐿 )( 𝑝 𝑖 )| , (14)We make a ﬁrst trial step on the downhill direction that takes us tothe following position in parameter space 𝑝 𝑖 / = 𝑝 𝑖 − 𝜂 ∇( 𝐿 )( 𝑝 𝑖 ) . (15)The direction of the next step 𝑝 𝑖 + is given by the mean of thegradients at 𝑝 𝑖 and 𝑝 𝑖 + / , 𝑝 𝑖 + = 𝑝 𝑖 − 𝜂 [∇( 𝐿 )( 𝑝 𝑖 ) + ∇( 𝐿 )( 𝑝 𝑖 + / ) ]/ 𝜂 taken by the minimizer. A very small step sizeindicates that we have not moved far for several steps. Our code willrun until the step size becomes smaller than some tolerance value. Asmaller value of the tolerance means we get closer to the minimum,however, the computational cost of our minimization is stronglydependent on this tolerance value. We found that a tolerance of thestep size of 𝜂 < − produces stable results and manageable lowcomputational cost. We will use a k-fold methodology to ﬁt the hyperparameter 𝜆 . K-fold is a well-known method that is standard practice for ﬁttinghyper-parameters in sparse regression (e.g. Hastie et al. 2015).The method works by randomly dividing data into 𝑘 indepen-dent subsets of roughly the same size. Then the hyperparameter 𝜆 is sampled in 𝑘 independent runs, each time one subset is left outof the minimization and is used to test the model on data it has notseen before. The set left out is called the test set . The rest of thedata points are used for running the minimisation algorithm andare referred as the training set . In this work we will use a value of 𝑘 =

10, which is standard practice.Each run explores 𝜆 with thirty uniformly spread out points inlog 𝜆 between 𝜆 = 𝜆 = 𝑁 , to which the case of 𝜆 = 𝜆 that we explore the more com-putationally expensive our code becomes. We found by testing that MNRAS , 1–19 (2020) parse regression modeling of the stellar mass

30 uniformly spread out points in log 𝜆 was enough to ﬁnd suf-ﬁciently smooth curves without a high computational cost. In prin-ciple, one could explore larger values of 𝜆 , however in our casemodels with 𝜆 = 𝑁 provided already signiﬁcantly worse ﬁts thanmodels with smaller 𝜆 values, which indicates that the penalty wasalready too large at 𝜆 = 𝑁 , this is true for all models presented inthis paper except for the example of Section 2.5 where we neededto run between 𝜆 = 𝜆 = Rootmean square error (RMSE) deﬁned as:RMSE = √︄ (cid:205) 𝑁𝛼 = ( 𝐹 𝛼 ( (cid:174) 𝐶, X ) − 𝑦 𝛼 ) 𝑁 (17)When 𝜆 is close to zero the error in the model of the trainingset would be small as there is no signiﬁcant penalty and the modelis overﬁtted. Such a model is poor at predicting results in data thatit has never seen before and this translates into a large error on thetest set (see Figs. 2 and 6). For very large values of the penalty, themodel becomes too simple as coeﬃcients are heavily penalised, amodel that is too simple will show large error on both the test andthe training set. When 𝜆 is large enough to avoid overﬁtting but nottoo large that models become too simple the RMSE of the test setwill reach its minimum (this is illustrated in Fig. 2 of next section).If 𝜆 𝑘 is the value of 𝜆 where the minimum is for a given k-fold then 𝜆 𝜇 = 𝜇 ( 𝜆 𝑘 ) is an estimate of the optimised value of 𝜆 , where 𝜇 isthe mean operator.It is common practice in sparse regression to choose a valuethat is larger than 𝜆 𝜇 by one standard deviation, this is the one-standard-error rule from Hastie et al. (2015), and is done to avoidover ﬁtting due to inaccuracies in the methodology. In this work weimplement a modiﬁed version of the one-standard-error rule. Let usdeﬁne RMSE 𝑘 ( 𝜆 ) as the RMSE of the 𝑘 𝑡ℎ k-fold as function of 𝜆 . Byconstruction, RMSE 𝑘 is minimised for 𝜆 = 𝜆 𝑘 . If 𝜎 ( RMSE 𝑘 ( 𝜆 𝑘 )) is the standard deviation of the collection of RMSE 𝑘 ( 𝜆 𝑘 ) , then foreach k-fold we deﬁne the optimised value of 𝜆 , 𝜆 min , as: 𝜆 min = 𝜇 ( 𝜆 𝑚 𝑘 ) = 𝜇 ( RMSE 𝑘 ( 𝜆 𝑘 ) + 𝜎 ( RMSE 𝑘 ( 𝜆 𝑘 ))) (18)In order to ﬁnd our surviving coeﬃcients, we run the min-imization algorithm again on the complete data set, setting 𝜆 to 𝜆 min . With 𝑃 , the number of coeﬃcients 𝐶 𝑖 larger than the cutoutvalue 𝜈 , we deﬁne our library of surviving functions 𝐹 𝑆 ( X ) = [ 𝐹 𝑆 ( X ) , .., 𝐹 𝑆𝑃 ( X )] , for which 𝐶 𝑆 𝑗 > 𝜈 .The penalty is useful for selecting which coeﬃcients to discardand keep, but once this is done the presence of a penalty termbiases all coeﬃcients to smaller values. A penalty rewards smallercoeﬃcients over larger ones, due to its size increasing with the sizeof the coeﬃcients. Having this in mind, our ﬁnal model is found byre-running our minimization algorithm using only the functions in 𝐹 𝑆 ( X ) and setting 𝜆 to zero, i.e. without penalty. In this section we introduce a simple example to more clearly illus-trate our methodology. We build a matrix X as in equation 2, whereeach column (cid:174) 𝑥 𝑎 has thirty points (N=30) and each point is a randomnumber between zero and one. We will use three independent vari-ables (M=3), (cid:174) 𝑥 , (cid:174) 𝑥 and (cid:174) 𝑥 . We will also build a dependent variable (cid:174) 𝑦 as: (cid:174) 𝑦 = . + (cid:174) 𝑥 + Noise (19) C C Starting pointTrue ValuePath

Figure 1.

Isocontours of the penalty function deﬁned by Eq. 10 forthe two diﬀerent coeﬃcients 𝐶 and 𝐶 associated with the functions 𝐶 𝑓 ( (cid:174) 𝑥 𝛼 ) = 𝐶 𝑥 ,𝛼 and 𝐶 𝑓 ( (cid:174) 𝑥 𝛼 ) = 𝐶 𝑥 ,𝛼 . The dashed hyperbolicand dotted elliptical lines are the isocontours of our penalty function andof the 𝜒 statistic respectively. Given that the gradient is perpendicular tothe contour lines, the minimisation routine can eﬃciently move toward theorigin of the plot, and also to one of the axes. Hence the code will quicklyreach the minimum if either or both coeﬃcients are zero. where the noise comes from a Gaussian distribution centered onzero and with a width of 10% of the standard deviation of 1 . + (cid:174) 𝑥 .Before running the model, we do not know the shape of equa-tion 19. However, let us suppose that we suspect that (cid:174) 𝑦 shoulddepend on the parameters (cid:174) 𝑥 , (cid:174) 𝑥 and (cid:174) 𝑥 . As we are uncertain onhow to model the dependence between the parameters, we include alarge set of functions. In this example model our library of functionswill include a constant function and linear and quadratic terms only(we will leave cubic terms out for simplicity). In total we end upwith ten functions (D=10).For the purpose of illustration, we focus on 𝐶 and 𝐶 , theparameters associated with the linear functions of 𝑥 and 𝑥 . Fig. 1follows the trajectory of the minimizer for our example model andfor these two coeﬃcients. From equation 19 we know that 𝐶 = 𝐶 =

2. The minimizer starts in an arbitrary position (in the caseof this example in 𝐶 = 𝐶 =

1) and follows the trajectory shown bythe blue line. The dashed lines represent the contours of both the 𝜒 (elliptic dotted contours) and 𝑃 ( (cid:174) 𝐶 ) (dashed contours) of equation10. A gradient descent algorithm will try to move perpendicularly tothese contours, but the modiﬁcations in our algorithm allow the pathto quickly align to the valley around 𝐶 =

0. Apparent deviationsfrom this motion come from the fact that we are looking at the 2dimensional projection of a ten dimensional trajectory.Fig. 2 shows the evolution of the RMSE with respect to 𝜆 forour example model using the k-fold methodology of Section 2.4.For this example, we divide the data in to 𝑘 = 𝜆 between 𝜆 = 𝜆 =

800 since the ratio 𝐷 / 𝑁 MNRAS000

800 since the ratio 𝐷 / 𝑁 MNRAS000 , 1–19 (2020)

M. Icaza-Lizaola et al. R M S E Training setTest set

RMSE k RMSE k + ( RMSE k )( RMSE k ) min Figure 2.

Evolution of the RMSE from the best ﬁts of our example modelat diﬀerent values of the hyperparameter 𝜆 . The blue and green dashed linesrepresent the RMSE of the training and test sets respectively. The solid linesrepresent the median of these curves. When 𝜆 is close to zero the trainingset has a very small error and the test set a comparably larger one, this is dueto overﬁtting of the minimizer and it improves as 𝜆 grows. The black dotsindicate the minimum RMSE for the test set of each individual k-fold, this iswhere overﬁtting was smallest. The black dashed line shows the mean valueof the 𝜆 s of the black dots. The red dots are plotted at the values of 𝜆 givenby our modiﬁed one-standard-error rule. The red line indicates the mean of 𝜆 s of these red dots and is our estimate of 𝜆 min from equation 18. is much larger ( 𝐷 / 𝑁 = /

30) than when running with our fulldata set (where we have hundreds of functions to ﬁt around almost10000 galaxies), making the code is more likely to overﬁt, resultingin the need to allow a larger penalty.When 𝜆 is close to zero the RMSE in the training set is small,the model is overﬁtted and therefore bad at predicting the result indata that it has never seen before. This results in the comparablylarger error on the test set. For the largest values of the penalty, themodel becomes too simple and the error on both the test and thetraining set begins to increase.Around 𝜆 ∼

10, the ﬁt of the test set improves and the RMSEreaches its minimum. Here is where the model is least susceptibleto overﬁtting while still capturing the important features of the dataset. The black dots indicate the minimum RMSE for the test set ofeach individual k-fold and the black dashed line shows the meanvalue of these points. We select an optimal value of 𝜆 to be slightlylarger than this median as deﬁned in equation 18. This is shown asa red line. 𝐶 and 𝐶 are the non-zero coeﬃcients used in building (cid:174) 𝑦 according to equation 19. Fig. 3 shows how the best-ﬁt coeﬃcientsof our example model evolve for diﬀerent values of 𝜆 . Each curveis the mean curve from our 5 diﬀerent folds.As stated in Section 2.2, the code will not set parameters ex-actly to zero but to a very small value which is determined by theparameter 𝜖 of equation 10. Fig. 3 shows that in this example thevalue is ≈ × − (this is true for both the example model and the | C | min C True Value C True Value

Figure 3.

The coloured lines show the value of the mean ﬁt of our indepen-dent k-fold runs for our surviving coeﬃcients, i.e. those coeﬃcients largerthan the cutoﬀ value 𝜈 at the optimised value of the hyper-parameter 𝜆 min .The dashed lines show the true coeﬃcients used to create the data fromequation 19. The gray dashed lines show the values of the coeﬃcients thatwere not chosen as part of the ﬁnal model. The gray shaded area representsour cutout value 𝜈 below which parameters will be taken out of the ﬁt. Thecoeﬃcients are shown as a function of the hyper-parameter 𝜆 and the dottedblack line represents the 𝜆 min chosen by the algorithm. At the chosen 𝜆 allcoeﬃcients are set to zero except 𝐶 and 𝐶 . galaxy data set as it only depends on 𝜖 ). Therefore we select a cutoﬀvalue of 𝜈 = × − . This is is shown as the gray shaded region inFig. 3.The coloured lines correspond to the coeﬃcients that wereabove the cutoﬀ value 𝜈 at 𝜆 min , and therefore survived into theﬁnal model. As expected, the only surviving lines were 𝐶 and 𝐶 .The black dashed lines are the coeﬃcients that were below 𝜈 andtherefore were discarded. The vertical dashed line corresponds tothe optimized value 𝜆 min .At 𝜆 =

0, all coeﬃcients are above the cutout threshold dueto over-ﬁtting. As 𝜆 grows coeﬃcients drops below the thresholdvalue and at 𝜆 min all coeﬃcients other than 𝐶 and 𝐶 have beendiscarded.The ﬁnal model selected by the algorithm is: (cid:174) 𝐹 ( (cid:174) 𝐶, X ) = . + . (cid:174) 𝑥 (20)Considering that this a ﬁt to data generated using equation 19, where 𝑦 = . + 𝑥 + Noise , we can conclude that our algorithm generateda sparse and accurate representation of the data.

Hydrodynamical simulations provide powerful insight into thegalaxy formation process. The resulting database catalogues DMhalos and their connection to baryonic properties such as stellar

MNRAS , 1–19 (2020) parse regression modeling of the stellar mass mass. In this paper, we use the Evolution and Assembly of Galaxiesand their Environments ( EAGLE , Schaye et al. 2015; Crain et al.2015) simulations, a suite of hydrodynamical simulations built in-side cubic periodic volumes. We use the largest of these volumes,corresponding to a box of 100 cMpc of length.The simulation runs using a modiﬁcation of the

GADGET 3 code described in Springel (2005). The code uses Smooth ParticleHydrodynamics methods to model the mechanics of the baryonﬂuid. In order to compute the gravitational potential, the code usesa combination of a Particle Mesh (at large scales) and a hierarchicalTree algorithm (at grid and subgrid scales). The details on themodiﬁcations can be found in Schaller et al. (2015c). Simulationsare built using the Planck cosmology (Planck Collaboration et al.2014).Baryonic physical processes that cannot be solved directly areimplemented into the simulation as sources and sink terms, whereenergy and matter are either absorbed or injected locally into thesimulation. These subgrid models should depend only on the localproperty of the gas. The subgrid models implemented account forradiative cooling (Wiersma et al. 2009a), star formation (Schaye& Dalla Vecchia 2008), star formation feedback (Dalla Vecchia& Schaye 2012), black hole growth (Rosas-Guevara et al. 2015;Springel et al. 2005), Active Galactic Nuclei feedback (Booth &Schaye 2009) and chemical enrichment (Wiersma et al. 2009b). Theuncertain parameters of the subgrid models need to be calibrated,which is done by choosing the values that would reproduce thegalaxy mass function at 𝑧 =0.1, the galaxy size-stellar mass relationand the black hole mass-stellar mass regression. Discussion of thecalibration process can be found in Crain et al. (2015).Haloes are deﬁned using a Friends-of-Friends (FoF) algorithm( FoF Einasto et al. 1984) with a linking length of 𝑏 =0.2, i.e.all particles that can be linked together with an inter-particle dis-tance smaller than 0.2 times the mean inter-particle distance form ahalo. Once the halos have been identiﬁed, the SUBFIND algorithm(Springel et al. 2001) identiﬁes the self-bound local overdensities ofeach FoF group as subhalos. The subhalo that contains the particlewith the lowest value of the potential energy will be deﬁned as thecentral sub-halo.The simulation information is saved at 20 redshift from 𝑧 =20to 𝑧 =0, and is used to build merger trees , which connect a haloto its progenitors at earlier redshifts (Qu et al. 2017). The mainprogenitor of a halo is deﬁned as the progenitor with the largestmass at all earlier outputs. We use these main progenitors to trackthe mass evolution of a DM halo (Section 3.3). Note that whentwo halos pass close to each other without merging they couldmomentarily belong to the same FoF group, as a consequence themass and the subhalo chosen as the central may be inconsistent atthis redshift slice when compared to ones immediately before orafter the interaction (Behroozi et al. 2015). We introduce a schemeto clean such issues from the input data below. Our data set consists of central galaxies inside halos with a masslarger than 𝑀 ,𝐶 > . M (cid:12) .Where 𝑀 ,𝐶 corresponds to the mass inside the radius 𝑅 ,𝐶 of a halo, which is the radius within which the densityis 200 times the critical density of the universe. The stellar mass ofa galaxy is measured as the baryonic mass contained inside a sphereof 30 proper kpc around the centre of potential of the halo.Baryonic processes inside halos can aﬀect their measured DMproperties (Bryan et al. 2013; Schaller et al. 2015b). If we run our code using the properties of the DM found in a hydrodynamicalsimulation, we risk including biases by ﬁtting the stellar mass usinga property that has been modiﬁed by the presence of baryons (thismodiﬁcation would be correlated with the stellar mass as the haloswith more baryons would be more modiﬁed). To avoid this bias, itis common practice to extract all DM input properties from a DMonly simulation generated with the same initial conditions, box sizeand resolution as the full hydrodynamic simulation.The matching between the hydrodynamic and DM-only simu-lations is described in Schaller et al. (2015a). The 50 most boundedDM particles of each halo in the hydrodynamic simulation are found.If a halo in the DM-only simulation has at least half of those mostbound particles it is considered its analogue. Using this method,99% of the halos with 𝑀 ,𝐶 > × M (cid:12) are matched.We collect information about the host DM halo at diﬀerent red-shifts (Section 3.3), therefore we require that our halos are presentin all redshifts slices. With this in mind, we only use galaxies with aprogenitor deﬁned at 𝑧 =4 . Our full sample consists of 9521 galaxies.As stated, inconsistencies between redshift slices are a well-known characteristics of the merger trees (Behroozi et al. 2015)are due to the subhalo ﬁnder running individually on each redshiftslice. When two halos interact some of the particles of one canbe assigned to the other regardless of where they belonged in pastslices. Some of the redshift slices where a given halo classiﬁcationhas been inconsistent can be found by looking for the slices wherethe halo main progenitor was not considered central by the subhaloﬁnder (given that we are only using central galaxies, this wouldmean an inconsistent classiﬁcation at that particular slice). If themain progenitor of our subhalo is not central at some redshift, welook for the mass value of both the nearest earlier and later redshiftswhere the halo was still central. We use these values to do a linearinterpolation and replace the mass that we got from the database forthe one resulting from our interpolation. The nearest earlier redshiftis always well deﬁned (as at z=0 all of our selected subhalos arecentral), however, a small subset of galaxies have a non-centralprogenitor at their oldest redshifts. In these cases we select the twoclosest later subhalos and perform our interpolation with those. Fig.5 shows the mass history of four galaxies from 𝑧 = 𝑧 =

0, the grey dashed lines and the solid black lines correspondsto the values before and after the correction respectively. As theﬁgure shows, diﬀerent galaxies have had very diﬀerent histories,and we will explore whether galaxies that have followed a diﬀerentformation paths will end up having diﬀerent residuals in the SM-HMrelation.

Once we have selected the galaxies in our data set, we deﬁne the 𝑀 parameters of the DM halo that are used to build the matrix 𝑋 ofequation 2.The ﬁrst variable accounted for is the halo mass at redshift zero(or any variable highly correlated with it). The SM-HM relation iswell known and it explains most of the scatter in the stellar mass.Fig. 4 shows where our galaxies reside in the SM-HM plane, andshows that there is a strong correlation between the SM and the HM.We will denote the HM input variable of a galaxy as 𝑀 and deﬁneit as 𝑀 (cid:48) = log ( 𝑀 𝑐 / 𝑀 (cid:12) ) . (21)We use equation 3 to standardize units. We call our standardize units 𝑀 . However, Fig. 4 also shows that there is signiﬁcant scatter MNRAS000

M. Icaza-Lizaola et al.

11 12 13 14log ( M C M )891011 l o g ( M * M ) Figure 4.

Stellar Mass - Dark Matter Host Halo Mass relation at z=0 forcentral galaxies. There is a strong correlation between the two variables,showing that the host dark matter halo mass should be included as one ofour variables. Scatter around the correlation will be modeled by includingadditional variables. around the SM-HM relation, therefore we should also add param-eters that are correlated with the residual between the SM-HMrelation and the actual stellar mass. The features that are more cor-related with the scatter on the SM-HM relation are those that aregood estimators of the mass evolution of the DM host halo (Mattheeet al. 2016). We can include information about the mass evolutioninto our 𝑋 matrix by adding the halo mass of the main progenitorof a galaxy at diﬀerent redshift slices.The Eagle simulation saves information in 19 redshift slicesbetween z=0 and z=4 (McAlpine et al. 2016). However, informationbetween redshift slices that are close to each other is strongly corre-lated as halos have not evolved signiﬁcantly. Keeping this in mindand given that the computational cost of running the minimizer in-creases exponentially with the number of parameters, we only usea subset of the available redshifts. The redshifts slices that we useas inputs are 𝑧 in =[0.18,0.37,0.62,1.0,1.26,1.74,2.48,3.02,3.98].Sparse regression methods work best if variables are indepen-dent, therefore we will use the ratio between the mass at a givenredshift and the mass at redshift zero (so that the correlation of themass at a given redshift and its mas at 𝑧 = 𝑀 𝑧 / 𝑀 and they are deﬁned as: ( 𝑀 𝑧 / 𝑀 ) (cid:48) = log (cid:32) 𝑀 𝑐 ( 𝑧 ) 𝑀 𝑐 ( 𝑧 = ) (cid:33) (22)We then use equation 3 to standardise units and form 𝑀 𝑧 / 𝑀 .An alternative approach to parametrizing halo evolution is theformation time (Lacey & Cole 1994), deﬁned as the time at whicha halo has assembled half of its mass. We generalize this idea todeﬁne ﬁve formation criteria ( FC (cid:48) ) by ﬁnding the redshifts (insteadof times) at which the DM halo has assembled 20%, 30%, 50%,70% and 90% of its mass respectively, the set of all ﬁve formationcriteria for our sample will be referred to as FC (cid:48) 𝑝 , where 𝑝 denotes l o g M C ( z ) M C ( z = ) After CorrectionBefore CorrectionFormation CriteriaRedshift

Figure 5.

The solid black lines show the mass ratios of equation 22 for fourrandomly selected galaxies in our sample. They represent the growth of theDM halo mass. The vertical red dashed lines correspond to value of 1 + 𝑧 for each redshift in 𝑧 in . The column 𝑥 𝑛 of our variables matrix (equation 2)corresponding to the ratio mass at each redshift in 𝑧 in can be visualised asthe intersection of its red line with all the black lines of our galaxy set. Theblue horizontal lines corresponds to a constant mass ratio of log ( . ) ,log ( . ) , log ( . ) , log ( . ) , log ( . ) (from top to bottom).Thecolumn 𝑥 𝑛 of our variables matrix for each formation criteria in FC can bevisualised as the intersection of a blue line and the black lines. the percentage used. Fig. 5 shows a set of horizontal blue linescorresponding to halo mass ratios (equation 22) of 0.9, 0.7, 0.5,0.3 and 0.2, the redshifts at which each formation history curve(black solid lines) intersects this blue horizontal lines is a visualrepresentation of FC (cid:48) 𝑖 . The redshifts that correspond to a givenformation criteria are found by performing a linear interpolationof the halo mass ratios. As with all of our other units a ﬁnal stepis to standarise units into FC 𝑖 using equation 3. We build twoindependent models, the ﬁrst using the mass ratios of 5 and thesecond using our formation criteria FC 𝑖 instead. There is a well known scaling relation, and therefore a correlation,between the angular momentum of a galaxy and their stellar mass(Fall & Romanowsky 2013). It is a matter of discussion how muchof a role does the angular momentum history of a dark matter halosplay on the determination of the speciﬁc angular momentum (sAM)of its hosted galaxy. Zavala et al. (2016) ﬁnds strong correlationsbetween both parameters using the EAGLE simulation. However,Danovich et al. (2015) suggest that the sAM of gas and dark matterundergo diﬀerent formation histories, which would suggest that anycorrelation between them is a by-product of a third correlation withother parameters like the mass formation history. Having this inmind, we will generate models that also include angular momentuminput parameters on top of the mass evolution parameters.Angular momentum evolution is included in our methodology

MNRAS , 1–19 (2020) parse regression modeling of the stellar mass by computing the sAM vector of a halo at each redshift slice. (cid:174) 𝜎 ( 𝑅, 𝑧 ) = (cid:205) 𝑖 𝑚 𝑖 ((cid:174) 𝑟 𝑖 − (cid:174) 𝑟 𝑐 ) × ((cid:174) 𝑣 𝑖 − (cid:174) 𝑣 𝑐 ) (cid:205) 𝑖 𝑚 𝑖 , (23)where (cid:174) 𝑟 𝑖 and (cid:174) 𝑣 𝑖 are the position and velocity vectors of each particleinside a radius 𝑅 around the centre of mass, 𝑚 𝑖 is the mass of theparticle, and (cid:174) 𝑟 𝑐 , (cid:174) 𝑣 𝑐 are the position and velocity of the centre ofmass of the halo. We will use diﬀerent values of 𝑅 in order to capturethe angular momentum evolution of the full halo and of its centreseparately. The values of 𝑅 that are included in our model are 𝑅 𝐶 ,and 𝑅 𝐶 𝑅 𝐶 .The speciﬁc angular momentum parameter deﬁned in equation23 is strongly correlated with the mass of the halo, this is driven bythe scaling relations (cid:174) 𝑟 ∝ 𝑀 / and (cid:174) 𝑣 ∼ √︃ 𝐺𝑀𝑅 ∝ 𝑀 / . In order toavoid having strongly correlated variables in our set, we deﬁne thefollowing speciﬁc angular momentum parameter, ( 𝑆 ( 𝑅, 𝑧 )) (cid:48) = log (| 𝜎 ( 𝑅, 𝑧 )|) −

23 log ( 𝑀 𝐶 / 𝑀 (cid:12) ) , (24)where | (cid:174) 𝜎 ( 𝑅, 𝑧 )| is the norm of (cid:174) 𝜎 ( 𝑅, 𝑧 ) .Given that the Angular momentum is a vector we need twotypes of variables to describe it, one capturing the information onits magnitude and the other one on its direction. Therefore, we willalso include the direction magnitude Θ deﬁned as: ( Θ ( 𝑧 )) (cid:48) = (cid:174) 𝜎 ( 𝑅 𝐶 , 𝑧 ) · (cid:174) 𝜎 ( 𝑅 𝐶 , )| (cid:174) 𝜎 ( 𝑅 𝐶 , 𝑧 )|| (cid:174) 𝜎 ( 𝑅 𝐶 , )| (25)As with all other variables we use equation 3 to standardize units andform the scalars 𝑆 ( 𝑅, 𝑧 ) and Θ ( 𝑧 ) . We form the following library ofsAM parameters for each progenitor halo at each redshift: • 𝑆 ( 𝑅 𝐶 , 𝑧 )• 𝑆 ( 𝑅 𝐶 , 𝑧 )• 𝑆 ( 𝑅 𝐶 , 𝑧 )• Θ ( 𝑧 ) The evolution of our speciﬁc angular momentum parameters hada large amount of statistical noise and so were smoothed acrossdiﬀerent redshifts using a Gaussian Kernel.

Here we present the list of the four models that we build in thiswork:(i)

Mass ratios : The ﬁrst run includes values of the Halo massat redshift zero 𝑀 from equation 21, and the DM ratios 𝑀 𝑧 / 𝑀 from equation 22, that parameterise the DM halo mass evolution.Given that we are using 9 diﬀerent redshift slices this gives a totalof 𝑀 =

10 Input variables. Therefore we will use a total of D=286free parameters (from equation 4).(ii)

Formation criterion : In our second model, the DM ratios arereplaced by the formation criterion FC 𝑖 deﬁned in Section 3.3. Thismodel uses 5 values of 𝐹𝐶 𝑝 (where p=[0.9,0.7,0.5,0.3,0.2])and thehalo mass at redshift zero as input parameters therefore 𝑀 = 𝐷 = Mass ratios and sAM : In this run we add the speciﬁcangular momentum parameters (sAM) 𝑆 ( 𝑅 𝐶 , 𝑧 ) , 𝑆 ( 𝑅 𝐶 , 𝑧 )/ 𝑆 ( 𝑅 𝐶 , 𝑧 )/ Θ ( 𝑧 ) , to the library of free parameters of run (i) Mass ratios . The library of functions contains the linear quadraticand cubic terms of the Halo mass evolution parameters 𝑀 𝑧 / 𝑀 . However only the linear terms of the speciﬁc angular momentum pa-rameters are included. This is because including all of the quadraticand cubic terms would result in 𝐷 = 𝐷 =

326 functions to minimize over.(iv)

Formation criterion and sAM : This run is similar to thelast but we add the terms of the speciﬁc angular momentum (sAM)parameters to the library of functions of run (ii)

Formation cri-terion instead. As with last run we only add the linear terms ofthe speciﬁc angular momentum parameters ending up with D=123parameters over which to minimize (the 84 parameters of run (ii)plus 39 linear sAM terms, one for each parameter).

In this section we present the speciﬁc aspects of applying themethodology presented in Section 2 to the data set from Section 3.Here we present tests of the consistency of the algorithm. The resultsof the runs will be left for Section 5.

The data is randomly divided into two data sets, the training set andthe holdout set . The training set contains 85% of the data and is usedby the algorithm to build the model, the remaining 15% constitutethe holdout set and it is not used until the model is completed. Theﬁnal model is applied to the holdout set to test its accuracy on datathat is not used in the building of the model and therefore is unbiasedto over-ﬁtting.Note that the holdout data set is diﬀerent from the test sets usedfor estimating the optimal value of 𝜆 in the k-fold methodology ofSection 2.4. The latter constitutes sets that are systematically keptout of the minimizations done while exploring the 𝜆 parameter spaceand are used to determine 𝜆 min , they are part of the methodologyfor building our model. The holdout set on the other hand is keptout of this methodology completely and is used to evaluate the ﬁnalmodel once it is built. This section presents tests on the methodology used for optimizingthe hyperparameter 𝜆 . This methodology was introduced in Sec-tion 2.4, and Section 2.5 presents a toy model example. As statedthe optimal value of 𝜆 : 𝜆 min is set using a k-fold method with 𝑘 = Mass ratios from Section 3.5. The green and blue dashedlines show the RMSE as a function of the hyperparameter 𝜆 on thetest and training sets respectively. Each test set (green dashed lines)has around 800 points, and the minimization runs with 286 freeparameters.The green dashed lines show some spread in their amplitudes,which are correlated with their value at 𝜆 =

0. This spread is aconsequence of dividing the subsets randomly. Some subsets willcontain a larger amount of points that are well predicted by themodel and will, therefore, have smaller errors.As we saw with Fig. 2, the RMSE of the training set is smallerwhen 𝜆 ∼ MNRAS000

RMSE k RMSE k + ( RMSE k ) min = 929 Figure 6.

Evolution of the RMSE (equation 17) as a function of 𝜆 for thedata set of run (i) Mass ratios of Section 3.5, which contains 286 inputfunctions. The blue and green dashed lines represent the training and testsets respectively. The solid lines represent the median of these curves. Theblack dots show the minimum of the dashed lines (RMSE 𝑘 ) and the red dotsthe one-standard-error rule correction from equation 18. The red solid linecorresponds to the mean position in 𝜆 of the red dots and is our estimate of 𝜆 min . It is selected by our algorithm at 929. well with the data it used for ﬁtting. In contrast when the model istested on data it has not seen before the RMSE is larger, as shownby the comparatively larger error on the test set. As 𝜆 increases theerror on the test set decreases and eventually, reaches a minimum(RMSE 𝑘 ) around 𝜆 = 𝜆 min from equation 18.Fig. 7 (equivalent to Fig. 3), shows the evolution of the coef-ﬁcients 𝐶 𝑖 from equation 5 as a function of the hyperparamer 𝜆 .These minimisations are done using the data from model (i) Massratios in Section 3.5. The vertical black dashed line shows the valueof 𝜆 min found by our algorithm. Each coloured line correspond to acoeﬃcient that is above the cutout value 𝜈 at 𝜆 min , 𝜈 is representedas the boundary between the white and grey regions of the plot. Theblack dashed lines correspond to the coeﬃcients that are below 𝜈 at 𝜆 min and therefore discarded.Diﬀerent coeﬃcients of 𝐶 𝑖 are ﬁtted by the minimizer withdiﬀerent orders of magnitude . Therefore we need to make sure that Given that our input variables are not Gaussian, several parameter valuesare above one standard deviation. In a standardized space this will mean that | C | ConstantLinearQuadraticCubicChoosen lam=932

Figure 7.

Best ﬁt value of each Coeﬃcient as a function of the hyper-parameter 𝜆 . The coloured lines show the values of the accepted coeﬃcientsand the black dashed lines represent the rejected coeﬃcients. The dashedblack line shows the value of 𝜆 min and the gray shaded area represents theregion bellow the cutout value 𝜈 , therefore all coeﬃcients above the shadedregion at 𝜆 min are accepted and represented by coloured lines. We can seethat there is a distinct separation between the chosen coeﬃcients and thosediscarded. the value of the parameter 𝜖 of equation 10, which determines howclose to zero unnecessary parameters get in the minimization, issuch that discarded coeﬃcients are well below the cutout value 𝜈 ,and are close enough to zero that they can be separated from usefulcoeﬃcients.We use a value of 𝜖 = × − , which, as shown in Fig. 7,corresponds to the minimizer setting unused parameters to a valueas small as ≈ × − . This is the same number that we found in theexample presented in Section 2.5). A very small value of 𝜖 increasesour computational time signiﬁcantly given that parameters need tobe driven further toward zero. Our choice represents a value of 𝜖 that is small enough to get parameters close enough to zero whilenot being so small that the code becomes unfeasible expensive torun. Fig. 8 shows the resulting coeﬃcients after running our fullalgorithm ﬁve times using diﬀerent values of 𝜖 . And using themodel conﬁguration (ii) Formation criterion from Section 3.5, weshow the rest of the example plots on this work in this conﬁgurationas this is the one with less free parameters and therefore the one thatrequires less computational time.The colour lines show the parameters that are above the cutoutvalue in the model built with 𝜖 = × − (our standard value)and represent the variables that where chosen by the algorithm. Theblack lines correspond to the values rejected at 𝜖 = × − . Thecutout value 𝜈 depends on how close parameters get to zero andtherefore it is a function of 𝜖 , for the propose of illustration, we set 𝜈 = 𝜖 . they will be larger than one. As a consequence linear quadratic and cubiccoeﬃcients will require diﬀerent scales to make similar contributions.MNRAS , 1–19 (2020) parse regression modeling of the stellar mass | C | Standard ConstantLinnearQuadraticCubic

Figure 8.

Selected coeﬃcients for ﬁve diﬀerent runs of our methodologyusing diﬀerent values of the 𝜖 parameter (equation 2.3) as the only diﬀerencein the initial setup of the run. This parameter determines how close to zerocoeﬃcients get before their contribution to the penalty is negligible. Thecutoﬀ value 𝜈 is set as 𝜈 = 𝜖 for each run. The black dotted line shows thevalue of 𝜖 used in our standard conﬁguration. When 𝜖 is large all coeﬃcientsare above the cutout value 𝜈 . For lower values of 𝜖 , all rejected coeﬃcientsare well inside the cutout region. The minimisations were done using themodel (ii) Formation criterion from 3.5.

For larger values of 𝜖 there is no clear cut between discardedcoeﬃcients and most of the cubic and quadratic terms that end up inour model. On the opposite end at 𝜖 = × − all rejected coeﬃ-cients are well below the cutout value, as the model sets parameterscloser to zero. While the diﬀerence between useful and useless co-eﬃcients is clearer at 𝜖 = × − , our standard conﬁguration with 𝜖 = − seems to work just as well while being signiﬁcantly fasterto run. Several of our coeﬃcients 𝐶 𝑖 are associated with functions of thesame form but with inputs at diﬀerent redshifts (see Section 3.3).If the mass of halos does not vary signiﬁcantly between contin-uous redshift slices then their corresponding functions 𝑓 𝑖 ( 𝑥 ) canbe correlated between them. 𝑓 𝑖 ( 𝑥 ) are polynomial functions of ourparameters that go up to third order (see Section 2.1). In generaldiﬀerent order combinations of correlated terms will also be corre-lated.Considering the above statements, it is possible that the pa-rameter space 𝐶 𝑖 has several local minima. This could be an issuefor gradient decent algorithms as by construction they will convergeonly toward the closest minimum. In practice we are satisﬁed withany minimum, as we do not have a preference between a featurebeing explained by for example, the mass at one redshift slice or atthe neighbouring slice.This, however, means that there might be slight variations inthe surviving parameters of diﬀerent models. In order to test how Minimization | C i | M FC FC FC FC FC M M × FC M × FC M × FC FC × FC FC × FC M FC FC FC × M FC FC × FC Figure 9.

Values of coeﬃcients 𝐶 𝑖 from equation 5 for diﬀerent minimi-sation runs. The only diﬀerence between each run is the initial point ofthe minimization algorithm. Given that our space is correlated there arediﬀerent local minimum and depending on the starting point the code willﬁnd one or the other.The coloured lines correspond to the parameters thatsurvived in at least one model, the colour coding is used only to diﬀerentiatebetween diﬀerent lines and has no physical interpretation. Blue and greencircles correspond to the coeﬃcients associated with 𝑓 ( (cid:174) 𝑥 ) = 𝑀 × FC and 𝑓 ( (cid:174) 𝑥 ) = 𝑀 × FC in runs three and four respectively, which are shownas an example of correlated variables generating diﬀerent local minimum.The grey area represents the cut-oﬀ value 𝜈 . strong of an eﬀect this has we perform ﬁve diﬀerent minimizationsof conﬁguration (ii) Formation criterion from Section 3.5. We set 𝜆 min = 𝐶 𝑖 is varied to random values between the ﬁveruns and is the only feature that is diﬀerent between runs.Fig. 9 shows the variation of the resulting 𝐶 𝑖 coeﬃcients in theminimisations. All models have equivalent accuracy with an RMSEwithin the range 0 . ± . 𝐶 𝑖 ) are kept constant amongst all runs, similarly there is a largesubset of coeﬃcients that are not necessary in any of the models.However, there is a subset of parameters that are interchangeablebetween diﬀerent models. An example of this is shown by the greenand blue circles, which correspond to the coeﬃcients associatedwith 𝑓 ((cid:174) 𝑥 ) = 𝑀 × FC and 𝑓 ((cid:174) 𝑥 ) = 𝑀 × FC in runs 3 and 4respectively. If we look at the models from runs three and four wecan see that they are very similar in almost every parameter, exceptthat model three gives a very important role to 𝑓 ((cid:174) 𝑥 ) = 𝑀 × FC and almost discards 𝑓 ((cid:174) 𝑥 ) = 𝑀 × FC , while model four does theopposite.To test the variance of our methodology as a whole, we makesix independent runs of our algorithm, varying the data that is keptoutside of the model as the Holdout set. One of the Holdout set is our MNRAS000

Holdout | C i | FC × FC FC × FC M FC FC FC FC M M × FC M × FC M × FC M FC FC FC FC × FC FC FC FC × FC M × FC All Models5/6 Models4/6 Models3/6 Models1/6 Models

Figure 10.

Values of coeﬃcients 𝐶 𝑖 from equation 5 for ﬁve diﬀerent min-imisation runs. In each run a ﬁfth of the training set has been systematicallyleft out of the minimisation, the resulting models select a diﬀerent subsetof surviving coeﬃcients 𝐶 𝑖 (but have similar accuracy). The colour codeshows how often a given coeﬃcient was used as part of the ﬁt. standard set, the other ﬁve correspond to ﬁve independent subsetsof the Test set with the same amount of points that the standardHoldout set (so each of the six independent subsets has 15% of thewhole data set).The algorithms uses conﬁguration (ii) Formation crite-rion from 3.5. The RMSE of the 5 resulting models are[0.167,0.170,0.169,0.162,0.162,0.167] and they have [16, 14, 15,15, 18, 17] surviving coeﬃcients each. Therefore all six modelshave similar accuracy and simplicity.Fig. 10 shows the variations in the resulting 𝐶 𝑖 coeﬃcientsthat survived in at least one of the six models. Solid line representcoeﬃcients that survived in all of our runs and correspond to elevencoeﬃcients, this means than on average two thirds of all coeﬃcientsare the same amongst all models. Of the remaining coeﬃcients twoare present in ﬁve out of six models and two in four out of six. Whichmeans that there are 15 coeﬃcients present in almost all six models,there is also some others that might be correlated with those and aresometimes present but discarded in at least half of the runs. We now present the results of our four runs deﬁned in 3.5, these are:(i)

Mass ratios , (ii)

Formation criterion , (iii)

Mass ratios andsAM and (iv)

Formation criterion and sAM . The speciﬁc surviv-ing coeﬃcients 𝐶 𝑖 selected by each of the models are presented inTable 1.A striking feature of models (iii) Mass ratios and sAM and(iv)

Formation criterion and sAM is that the algorithm does notselect any speciﬁc angular momentum parameters in either of them.In fact the selected parameters of model (ii)

Formation criterion and (iv)

Formation criterion and sAM are almost identical. While there are some small diﬀerences between the coeﬃcients chosen in(i)

Mass ratios , and (iii)

Mass ratios and sAM (Namely model (iii)selects two extra parameters, and the values of some of the commonparameters are slightly diﬀerent). These diﬀerence are consistencewith the variance of the methodology reported in Section 4.3.This indicates that the contribution to the accuracy of the modeldue to including the sAM parameters is negligible in our methodol-ogy: The sparse regression methodology found that no sAM param-eters contributed information that was not already provided usingthe rest of the parameters. This suggests that any correlation be-tween the sAM of a galaxy and the sAM history of its host haloshould be the consequence of a correlation between the mass andsAM formation histories of subhalos.

MNRAS , 1–19 (2020) pa rs e r e g r e ss i on m od e li ngo ft h e s t e ll a r m a ss (i) Mass ratios (ii) Formation criterion (iii) Mass ratios and sAM (iv)Formation criterion and sAM N Coeﬃcient Value Coeﬃcient Value Coeﬃcient Value Coeﬃcient Value1 1 0.12 1 0.2 1 0.139 1 0.22 𝑀 𝑀 𝑀 𝑀 𝑀 . / 𝑀 FC 𝑀 . / 𝑀 FC 𝑀 . / 𝑀 FC 𝑀 . / 𝑀 FC 𝑀 . / 𝑀 FC 𝑀 . / 𝑀 FC 𝑀 / 𝑀 FC 𝑀 / 𝑀 FC 𝑀 . / 𝑀 𝑀 -0.22 𝑀 . / 𝑀 𝑀 -0.228 𝑀 . / 𝑀 𝑀 × FC -0.021 𝑀 . / 𝑀 𝑀 × FC -0.0209 𝑀 . / 𝑀 𝑀 × FC -0.022 𝑀 . / 𝑀 𝑀 × FC -0.02210 𝑀 -0.172 𝑀 × FC -0.03 𝑀 . / 𝑀 𝑀 × FC -0.0311 ( 𝑀 / 𝑀 ) FC × FC -0.044 𝑀 -0.215 FC × FC -0.04312 ( 𝑀 . / 𝑀 ) 𝑀 ( 𝑀 / 𝑀 ) 𝑀 ( 𝑀 . / 𝑀 ) FC ( 𝑀 . / 𝑀 ) FC ( 𝑀 . / 𝑀 ) -0.002 FC ( 𝑀 . / 𝑀 ) FC ( 𝑀 . / 𝑀 ) 𝑀 × FC ( 𝑀 . / 𝑀 ) -0.003 𝑀 × FC 𝑀 × 𝑀 . / 𝑀 -0.017 FC ( 𝑀 . / 𝑀 ) FC 𝑀 × 𝑀 . / 𝑀 -0.011 FC × FC 𝑀 × 𝑀 . / 𝑀 -0.017 FC × FC 𝑀 × 𝑀 . / 𝑀 -0.014 𝑀 × 𝑀 . / 𝑀 -0.01219 𝑀 × 𝑀 . / 𝑀 -0.024 𝑀 × 𝑀 . / 𝑀 -0.01201220 𝑀 × 𝑀 . / 𝑀 -0.004 𝑀 × 𝑀 . / 𝑀 -0.03501221 𝑀 ( 𝑀 ) × 𝑀 . / 𝑀 Table 1.

Parameters selected by each of the four models of Section 3.5. Neither model (iii) Mass ratios and sAM nor (iv)Formation criterion and sAM used any speciﬁc angular momentum (sAM) parameters aspart of their ﬁnal coeﬃcients. Models (ii)Formation criterion and (iv)Formation criterion and sAM are virtually identical with only very minor diﬀerences in some of the parameters. M N R A S , ( ) M. Icaza-Lizaola et al.

Fig. 11 shows the predicted values of the stellar mass comparedto their real values in the EAGLE simulation for all galaxies in theholdout set and for three out of our four models (given that models(ii)

Formation criterion and (iv)

Formation criterion and sAM ended up with the almost identical parameters we only plot one ofthem. The closer a point is to the one-to-one line (black dashedline), the better the model predicted its value. We also include theRMSE (equation 17) of each model.A diﬀerent estimate of the goodness of a ﬁt is the 𝑅 statistic,which determines the amount of the variation in (cid:174) 𝑦 that can beexplained by a model : R = − RMSE 𝜎 𝑦 𝛼 , (26)where 𝜎 𝑦 𝛼 is the standard deviation of (cid:174) 𝑦 . The usefulness of the 𝑅 comes from being intuitive to interpret: the closer to one the 𝑅 ofa model is the more accurate it is.Both the RMSE and 𝑅 statistics show that the three mod-els have very similar accuracy. Model (ii) Formation criterion isslightly simpler than models (i)

Mass ratios and (iii)

Mass ratiosand sAM , as it has 17 free parameters, compared with 22 and 24from the other two. This suggests that the formation criteria param-eters ( FC 𝑖 ), are slightly more eﬃcient at summarising the halo massinformation than the mass ratios ( 𝑀 𝑧 / 𝑀 ) of equation 22. The goal of this work is to make a model that is accurate andalso explainable. With this in mind, we now try to give a physicalinterpretation to some the terms kept in our model.By looking at Table 1 we conclude that in general survivingparameters in all models can be divided into four diﬀerent groups: • (1) Terms forming a third order polynomial of 𝑀 . Namely theterms 1, 𝑀 , 𝑀 , 𝑀 • (2) Terms forming third order polynomials of the other depen-dent variables that are correlated with the mass at 𝑧 >

0. Namely,terms of the form 𝑀 𝑧 / 𝑀 , ( 𝑀 𝑧 / 𝑀 ) and ( 𝑀 𝑧 / 𝑀 ) for models (i) Mass Ratios and (iii) Mass Ratios and sAM . And terms of theform FC 𝑝 , FC 𝑝 and FC 𝑝 for models (ii) Formation Criteria and(iv ) Formation Criteria and sAM • (3) Terms corresponding to the product of 𝑀 and either 𝑀 𝑧 / 𝑀 for models (i) and (iii) or FC 𝑧 for model (ii) and (iv) . • (4) Other terms corresponding to higher order combinationsof crossed terms. It is more challenging to provide a physical inter-pretation of these.The terms in group (1) correspond to a model of the SMHMrelation. Let us call 𝑃 ( 𝑧 = ) to the polynomial built with the termsin group (1) and their associated coeﬃcients. 𝑃 ( 𝑧 = ) = 𝐶 + 𝐶 𝑀 + 𝐶 𝑀 + 𝐶 𝑀 (27)In order to compare our ﬁnal model with the stellar mass we trans-form our model from our standarised units to stellar mass units: 𝑃 (cid:48) ( 𝑧 = ) = 𝑃 ( 𝑧 = ) × 𝜎 ( 𝑀 ∗ ) + 𝜇 ( 𝑀 ∗ ) (28) 𝑅 estimators should be considered with caution as they are easily biasedby inaccurate estimations of 𝜎 𝑦 𝛼 and can have deceivingly small (or large)values. They should be used as reference only. We also include RMSE errorsas goodness of ﬁt estimators, which are far more robust. where as in equation 3 𝜇 and 𝜎 are the mean and standard deviationoperators. 𝑃 (cid:48) ( 𝑧 = ) computed using model (ii) Formation criterion isshown as the black dashed line of the left panel of Fig. 12. The ﬁgureshows that 𝑃 (cid:48) ( 𝑧 = ) is already a good model of the stellar mass,however, there is some scatter around it. We deﬁne the residualbetween each galaxy and 𝑃 (cid:48) ( 𝑧 = ) ﬁt as 𝛿 (cid:48) : 𝛿 (cid:48) = 𝑙𝑜𝑔 ( 𝑀 ∗ ) − 𝑃 (cid:48) ( 𝑧 = ) . (29)Galaxies in Fig. 12 are divided into four 𝛿 (cid:48) bins. The mem-bers of the yellow bin, which is the bin with the largest value of 𝛿 (cid:48) correspond to galaxies for which their stellar masses are the mostunder-predicted by 𝑃 (cid:48) ( 𝑧 = ) , while galaxies in the blue bin corre-spond the the galaxies with the most over predicted stellar masses.The right panel of the ﬁgure shows the average mass of halos ineach of the four bins as a function of redshift. On average galaxiesin the yellow bin live inside host halos that attained their ﬁnal massearly in their evolution when the characteristic density was higher.The deeper potential well of these halos allows the creation of mas-sive galaxies. In contrast, the galaxies in the most over-predicted 𝛿 (cid:48) bin (blue) live inside host halos that only achieved their ﬁnal massvery recently and therefore had a lower characteristic density for aconsiderable period of time, compared to halos of the same massin larger 𝛿 (cid:48) bins. This implies that there is a correlation between 𝛿 (cid:48) and the mass formation history and explains why coeﬃcients ingroup (2) where selected by our model.This conclusion is in agreement with Zentner et al. (2014),where formation time is used to model assembly bias, and withMatthee et al. (2016) where formation time is found to be the mostcorrelated parameter with 𝛿 (cid:48) . We emphasize that we arrive at thisconclusion by using a completely diﬀerent approach, that does notrequire any prior knowledge of the underlying physics correlatingstellar mass with halo mass and formation time.A novel result that our model proposes is that it contains termsof FC 𝑝 for all values in 𝑝 =[20,30,50,70,90] are needed in the ﬁnalﬁt, this suggests that formation time alone is not enough for ourmodel to remove the correlation with 𝛿 (cid:48) , but actually tracking thediﬀerent formation times at which diﬀerent percentages of the ﬁnalhalo mass were formed leads to more accurate models.Our model suggests, that the assembly history dependence isitself a function of the ﬁnal halo mass. In order to explore this, wewrite the polynomial ﬁts to each of the FC 𝑝 terms from (3) , wherep=[20,30,50,70,90], as: 𝑃 ( 𝑝 ) = 𝐶 𝑝 + 𝐶 𝑝 FC 𝑝 + 𝐶 𝑝 FC 𝑝 + 𝐶 𝑝 FC 𝑝 (30)We then deﬁne the residual 𝛿 𝑝 as the leftover residual once wehave removed all terms from groups (1) and (2) : 𝛿 𝑝 = 𝑦 − 𝑃 ( 𝑧 = ) − 𝑃 ( ) − 𝑃 ( ) − 𝑃 ( ) − 𝑃 ( ) − 𝑃 ( ) (31)Fig. 13 shows where galaxies are in the FC vs 𝛿 𝑝 space. Where FC is the redshift when 50% of the mass of a halo has already beenformed. The galaxies are colour coded by the ﬁnal halo mass, 𝑀 .The blue and red solid lines shows the average 𝛿 𝑝 for very massiveand very small halos respectively. We can tell that when the redshift FC is later than the average ( FC <0) galaxies living in massivehost halos tend to be overpredicted by the model (shown by theblue line being above zero) and galaxies living in small halos tendto be under-predicted (shown by the red line being bellow zero).This shows why terms of the form FC 𝑝 × 𝑀 , corresponding tocoeﬃcients in group (3) improve our model. The fact that the modelselected terms of the form FC 𝑝 × 𝑀 suggests that it is not enough MNRAS , 1–19 (2020) parse regression modeling of the stellar mass Model(log ( M * M )) l o g ( M * M ) (i) Mass ratiosRMSE=0.168 8 9 10 11 Model(log ( M * M )) l o g ( M * M ) (ii) Formation criterion R = 0.939 RMSE=0.167 8 9 10 11 Model(log ( M * M )) l o g ( M * M ) (iii) Mass ratios and sAM R = 0.94 RMSE=0.166 Figure 11.

Each coloured dot corresponds to a galaxy in the Holdout set. They represent the value of the stellar mass predicted by the three models comparedto the actual stellar mass measured from the simulation. Left, centre and right panels correspond to models (i)

Mass ratios , (ii)

Formation criterion , (iii)

Mass ratios and sAM from Section 3.5 respectively. The closer each point is to the one-to-one line (black dashed lines) the more accurate they were predictedby the model. The measured value of the RMSE from equation 17 and the 𝑅 statistic of equation 26 are included for comparison. In general the three diﬀerentmodels have equivalent accuracy. Note that as model (iv) Formation criterion and sAM is virtually identical to model (ii)

Formation criterion we onlyincluded one of them. log ( M C ( z = 0) M ) l o g ( M * ) -1.61 < < -0.4-0.4 < < -0.1-0.1 < < 0.30.3 < < 1.32 P ( z = 0) l o g ( M c ( z ) M c ( z = ) ) Figure 12.

The left plot shows the stellar mass/ halo mass relation for galaxies in our Holdout set. The black dashed line shows the polynomial 𝑃 (cid:48) ( 𝑧 = ) from equation 28 computed using model (ii) Formation criterion . This line describes the trend well but there is some scatter around it. The colour codingdivides the galaxies by their residual 𝛿 (cid:48) from equation 29 into four bins. The right plot shows the evolution of the halo mass for each residual bin. The solidlines represent the mean of the mass ratios of equation 22 for all galaxies in each bin. The shaded contours indicate the standard deviation. The ﬁgure showsthat most negative residuals correspond to halos that grew to their ﬁnal mass only recently, while the most positive residuals 𝛿 (cid:48) correspond to halos that growinto their current mass early in their evolution. to model a linear relationship between stellar mass and formationtime (or in our case our formation criteria), one needs to correctthis relation by a factor that is dependent on the ﬁnal halo mass.Assembly bias suggests that the stellar mass of galaxies depends onformation history, our model also suggests that this dependency isin turn dependent on the ﬁnal halo mass. We have shown the models capability to reproduce the stellar massof individual galaxies from the EAGLE simulation. We now discuss our models accuracy at reproducing other statistics from EAGLEsuch as the distribution of galaxy masses, that can be measured witha stellar mass function (SMF), and the clustering of the galaxies,that can be measured with a correlation function.We run our test in six models, one of them is model (ii)

For-mation criterion from Table 1 and the other ﬁve models are builtby systematically leaving a ﬁfth of the training set outside of theminimisation and are the same models used in Section 4.3 (seediscussion of Fig. 10). All of these models are built using the con-ﬁguration (ii)

Formation criterion from 3.5. We now we compare

MNRAS000

MNRAS000 , 1–19 (2020) M. Icaza-Lizaola et al. FC p ( p [ M > 2])( p [ M < 0.8]) 10123456 M Figure 13.

Relation between the redshift at which a halo acquired half ofits mass (in standardised units) FC and the residual 𝛿 𝑝 of equation 31 forall galaxies in our holdout set. The galaxies are colour coded by 𝑀 . Theblue solid line shows the mean residual 𝜇 ( 𝛿 𝑝 ) for galaxies in very massivehalos, i.e. halos where 𝑀 > 𝛿 𝑝 for galaxies living in halos with verylow mass ( 𝑀 < . (3) in the solution.The plot shows that the strength of assembly bias is correlated with the ﬁnalhalo mass. the SMF and the correlation functions of this models to the onesmeasured directly from the EAGLE data set.Furlong et al. (2015) shows that the SMF of the EAGLE hydro-dynamical simulation at redshift zero agrees with the one observedfrom SDSS (Li & White 2009) and GAMA (Baldry et al. 2012).The red dashed line of Fig. 14 shows the central galaxy stellar massfunction obtained from the stellar masses in our EAGLE data set,the green line shows the SMF of a sample that also includes thesatellite galaxies in the host halos of our main sample for compar-ison (as opposed to only the centrals). The red shaded region is anestimate of the error due to Poisson noise within the EAGLE sampleand is computed with the bootstrap method (Efron 1979; Norberget al. 2009).The six blue lines are the SMF computed using the stellarmasses predicted by each of our models. The ﬁgure shows that thefractional diﬀerences are of around ± .

05 the value of the EAGLElog ( SMF ) up to log ( 𝑀 ∗ / 𝑀 (cid:12) ) = ( 𝑀 ∗ / 𝑀 (cid:12) ) = . ( 𝑀 ∗ / 𝑀 (cid:12) ) = .

5. Fig. 12 showsthat at this two values of the stellar masses the scatter of the modelaround its expected value is still not fully symmetric, our modelseems to over-predict the mass of a small but signiﬁcant subsetof galaxies around log ( 𝑀 ∗ / 𝑀 (cid:12) ) = . ( 𝑀 ∗ / 𝑀 (cid:12) ) = .

5. This suggests that the remainingresiduals of our model might be correlated with other parameters not l o g ( d N d l o g ( M ) )( M p c ) Sparse regression modelsEAGLE with subhalosEAGLE log ( M * / M ) l o g ( d N E A G L E d N M o d e l )( M p c ) Figure 14.

SMF of our data set measured from the stellar masses registeredin the EAGLE simulation data-sets (red), and from the stellar mass predictedby six models built with our methodology (blue). The red shaded areas showour estimate of the error in the EAGLE SMF due to Poisson noise and arecomputed using a bootstrap method. The green line shows the SMF of asample that also includes satellite galaxies inside the same host halos of ourmain sample. included in our methodology. These parameters could be either otherHM properties that we did not consider, higher-order correlationsof our input parameters, or artifacts of baryonic processes.The EAGLE hydrodynamical simulation has been shown toaccurately reproduce the two point correlation function of galax-ies from 1 ℎ − Mpc and up to 6 ℎ − Mpc (Artale et al. 2017). Atlarger scales the volume of the simulation becomes too small andthe correlation gets under-predicted. In order to test how well ourmodel reproduces the correlation function of EAGLE we divide thegalaxies in each of our models into four stellar mass bins. We thencompute the two point correlation function of galaxies in each massbin, this is done by assigning to each galaxy the same comovingcoordinates of its host halo.Fig. 15 shows how the correlation functions compares with theone obtained from the EAGLE simulation. Each colour correspondsto a diﬀerent mass bin. The dashed lines show the correlation func-tion computed using the value of the stellar mass from the EAGLEsimulation to divide galaxies into bins. As with the previous plot theshaded areas come from an estimate of the errors using a bootstrapmethod. The correlations functions from each of our six models andfor each stellar mass bin are shown as solid faint lines. The ﬁgureshows that our correlation functions agree within errors with theones from EAGLE, which suggests that our models assign galaxymasses in a way that reproduces the clustering of central galaxiesas a function of mass accurately and up to 10 ℎ − Mpc. It is alsonoticeable that the scatter on the correlation functions from ourmethodology is smaller that the one from bootstrap errors.Hydrodynamical N-body simulations that model both the darkmatter and the baryonic component of the universe are computa-tionally expensive, this limits the volumes in which they can be

MNRAS , 1–19 (2020) parse regression modeling of the stellar mass l o g (( r )) ( M * / M ) < 9.59.5 < log ( M * / M ) < 1010 < log ( M * / M ) < 10.510.5 < log ( M * / M )ModelEAGLE log ( r )[ Mpc ] l o g ( ( r ) M o d e l ( r ) E A G L E ) Figure 15.

Correlation function from EAGLE (dashed lines) in four diﬀerentstellar mass bins compared to those computed with our 6 models (thin solidlines). The spatial position of the galaxies is that of their host halo, and 𝑟 is the distance between galaxy pairs. The errors in the EAGLE correlationfunctions are computed using the bootstrap method. The plot shows thatthe clustering of our model agrees with that of EAGLE within bootstrapuncertainties. computed to a few ( ) . Our models are informed by thephysical processes relating the stellar mass of a galaxy and its hostDM halo. Therefore, by populating DM-only simulations in largevolumes our models could provide new tests of these physics onlarger scales than the ones permitted in hydrodynamical simula-tions. The fact that we can reproduce with accuracy both the stellarmass and the correlation functions of EAGLE, suggests that thisapproach is promising for populating DM only simulations. This issomething that we are interested in exploring in future extensionsof this work. There is a well-known correlation between the stellar mass of agalaxy and the dark matter of its host halo (SMHM relation). How-ever, this relation has signiﬁcant residuals, which suggests that otherproperties are signiﬁcant at determining the stellar mass of a halo.The halo mass evolution history (assembly bias) and the speciﬁcangular momentum have both been proposed to be correlated withthis residuals.We use a sparse regression methodology to model the gov-erning equations relating the stellar mass of central galaxies tothe properties of their host dark matter halos. This method buildsaccurate and explainable models without needing much physicalknowledge of the processes that determine the stellar mass of agalaxy from the halo properties of its host. In sparse regressionmethods, the lack of physical knowledge is substituted by largenumbers of free parameters, where each parameter models diﬀerentbehaviours of the dark matter halo properties. A LASSO algorithmis used to optimize solutions. This method heavily penalizes the number of surviving parameters so that they are as few as possibleare selected without losing accuracy. Here we have modiﬁed theform of the LASSO algorithm to be more eﬃcient when combinedwith a gradient descent minimizer. This is achieved by including aregularisation term that smooths out discontinuities in the gradientthat are present in standard LASSO when parameters are close tozero. This smoothing is characterized by a parameter 𝜖 that limitshow close to zero coeﬃcients need to get before being discarded bythe algorithm. We also modify the method by which the minimizerdecides which path to follow in such a way that we ﬁnd performancegains in large dimensional spaces.The size of the penalty is determined by the parameter 𝜆 , whichis optimized using a k-fold methodology with 𝑘 =

10. We use theone-standard-error rule to select a value of 𝜆 that is larger than thebest-ﬁt and therefore builds a slightly less accurate model with fewerfree parameters and therefore more explainability.The data that we use to build our model comes from the EAGLEsimulation. However, we emphasize that this method should be ableto be calibrated against any dark matter simulation with similarresults. We use a sample of 9521 central galaxies from the 100 cMpcbox EAGLE suite of hydrodynamical simulations. The dark matterproperties are read from a DM only simulation with the same initialconditions and on the same box as our hydrodynamical simulation,both simulations are matched with each other in such a way that apair is found for 99% of the DM halos.We build four diﬀerent models that diﬀer by the independentparameters chosen to model the stellar mass, (i) Mass ratios usesthe ratio between the mass of a halo at a redshift 𝑧 and its massat 𝑧 = Formation Criterion

Uses the redshift at which a halo formed aspeciﬁc percentage of its mass. We include all linear quadratic andcubic correlations of our independent variables as free parametersof the ﬁts.We also run these two models a second time including param-eters related to the speciﬁc angular momentum (sAM) history of thehalos. We include parameters that characterize both the magnitudeand the direction of the sAM vector, we also vary the radius of theDM halo over which to measure the magnitude of the sAM vector.Due to computational restrictions, we only include linear terms ofour sAM free parameters.The computational time of our minimization is correlated withthe value of 𝜖 , a very large value would mean a very fast compu-tational time, but it would be hard to distinguish useful parametersand those that should be discarded. We have shown that a value of 𝜖 = × − selects the same coeﬃcients as slower and more accu-rate runs without being too computationally expensive. Some inputparameters are correlated with each other, for example, the massratio 𝑀 𝑧 at a given redshift and a neighbouring redshift slice. Inprinciple, our answers could be susceptible to the starting point ofthe minimizer; however, we show that neither the explainability northe accuracy of the model changes signiﬁcantly between runs withdiﬀerent starting points. We have also shown that models trained ondiﬀerent subsets of the same data arrive at equivalent models.Our algorithm did not select sAM parameters for either Models(iii) Mass ratios and sAM nor (iv)

Formation criterion and sAM .In fact, all diﬀerences between these models and the equivalentones without sAM parameters are consistent with variations in ourmethodology. This suggests that any correlation between the sAM ofa host halo and the residual of the SMHM relation is the consequenceof correlations between the mass history of the halo and the historyof its sAM. Given that model (ii)

Formation Criterion is slightlysimpler than model (i)

Mass ratios we conclude that the formation

MNRAS000

MNRAS000 , 1–19 (2020) M. Icaza-Lizaola et al. criteria parameters FC 𝑖 are slightly more eﬃcient at summarizingthe halo mass evolution information than the mass ratios 𝑀 𝑧 / 𝑀 .A subset of our surviving terms can be combined into a poly-nomial of 𝑀 and is therefore a model of the SMHM relation. Othersubsets of surviving terms can be combined into polynomials ofeither FC 𝑖 or 𝑀 𝑧 / 𝑀 (depending on the parametrization of thehalo mass evolution history) and therefore model the assembly bias.Terms of the shape 𝑀 × FC i (or 𝑀 × 𝑀 𝑧 / 𝑀 ) add a signiﬁcantcorrection to very small or very large halos. Our models suggestthat a single formation time is not enough to model the variationin the SM-HM relation, and that a better approach is to include thetimes at which diﬀerent percentages of the mass have been formed.This is reﬂected in our model by the similar contribution of terms ofthe form FC 𝑝 for all 𝑝 in 𝑝 = [ , , , , ] . Our model alsosuggests that the relation between the stellar mass and the formationtimes is not the same for all galaxies, but it depends on the halo massat 𝑧 = 𝑀 ∗ except around log ( 𝑀 ∗ / 𝑀 (cid:12) ) = . ( 𝑀 ∗ / 𝑀 (cid:12) ) = . ACKNOWLEDGEMENTS

The data used in this work can be shared if requested from theauthors. The data from the EAGLE simulations has been publiclyreleased, see McAlpine et al. (2016).

REFERENCES

Arfken G., 1985, Mathematical Methods for Physicists, third edn. AcademicPress, Inc., San DiegoArtale M. C., et al., 2017, Monthly Notices of the Royal AstronomicalSociety, 470, 1771Baldry I. K., et al., 2012, MNRAS, 421, 621Behroozi P., et al., 2015, MNRAS, 454, 3020Berger P., Stein G., 2019, MNRAS, 482, 2861Booth C. M., Schaye J., 2009, MNRAS, 398, 53Breiman L., 2001, Machine Learning, 45, 5Brunton S. L., Proctor J. L., Kutz J. N., 2016, Proceedings of the NationalAcademy of Sciences, 113, 3932Bryan S. E., Kay S. T., Duﬀy A. R., Schaye J., Dalla Vecchia C., BoothC. M., 2013, MNRAS, 429, 3316Crain R. A., et al., 2015, MNRAS, 450, 1937Dalla Vecchia C., Schaye J., 2012, MNRAS, 426, 140Danovich M., Dekel A., Hahn O., Ceverino D., Primack J., 2015, MonthlyNotices of the Royal Astronomical Society, 449, 2087–2111Efron B., 1979, Ann. Statist., 7, 1Einasto J., Klypin A. A., Saar E., Shandarin S. F., 1984, MNRAS, 206, 529Fall S. M., Romanowsky A. J., 2013, The Astrophysical Journal, 769, L26Furlong M., et al., 2015, MNRAS, 450, 4486Golub G. H., 1970, Reinsch, C.,Hastie T., Tibshirani R., Wainwright M., 2015, Statistical Learning withSparsity: The Lasso and Generalizations. Chapman & Hall/CRCLacey C., Cole S., 1994, MNRAS, 271, 676Lecun Y., Bengio Y., Hinton G., 2015, Nature, 521, 436Li C., White S. D. M., 2009, MNRAS, 398, 2177Lucie-Smith L., Peiris H. V., Pontzen A., Lochner M., 2018, Monthly Noticesof the Royal Astronomical Society, 479, 3405Matthee J., Schaye J., Crain R. A., Schaller M., Bower R., Theuns T., 2016,Monthly Notices of the Royal Astronomical Society, 465, 2381–2396McAlpine S., et al., 2016, Astronomy and Computing, 15, 72Merritt D., 1983, ApJ, 264, 24More S., van den Bosch F. C., Cacciato M., Skibba R., Mo H. J., Yang X.,2010, Monthly Notices of the Royal Astronomical Society, 410, 210Norberg P., Baugh C. M., Gaztañaga E., Croton D. J., 2009, MNRAS, 396,19Planck Collaboration et al., 2014, A&A, 571, A1Qu Y., et al., 2017, MNRAS, 464, 1659 MNRAS , 1–19 (2020) parse regression modeling of the stellar mass Roberts S., Everson R., 2001, Independent Component Anal-ysis: Principles and Practice. Cambridge University Press,doi:10.1017/CBO9780511624148Rosas-Guevara Y. M., et al., 2015, MNRAS, 454, 1038Schaller M., et al., 2015a, MNRAS, 451, 1247Schaller M., et al., 2015b, Monthly Notices of the Royal AstronomicalSociety, 452, 343Schaller M., Dalla Vecchia C., Schaye J., Bower R. G., Theuns T., CrainR. A., Furlong M., McCarthy I. G., 2015c, Monthly Notices of theRoyal Astronomical Society, 454, 2277Schaye J., Dalla Vecchia C., 2008, MNRAS, 383, 1210Schaye J., et al., 2010, Monthly Notices of the Royal Astronomical Society,402, 1536Schaye J., et al., 2015, MNRAS, 446, 521Springel V., 2005, MNRAS, 364, 1105Springel V., White S. D. M., Tormen G., Kauﬀmann G., 2001, MNRAS,328, 726Springel V., et al., 2005, Nature, 435, 629Tibshirani R., 1996, Journal of the Royal Statistical Society: Series B(Methodological), 58, 267Tibshirani R., Friedman J., 2017, arXiv e-prints, p. arXiv:1712.00484Trayford J. W., Theuns T., Bower R. G., Crain R. A., Lagos C. d. P., SchallerM., Schaye J., 2016, Monthly Notices of the Royal Astronomical Society,460, 3925White S. D. M., Rees M. J., 1978, Monthly Notices of the Royal AstronomicalSociety, 183, 341Wiersma R. P. C., Schaye J., Smith B. D., 2009a, MNRAS, 393, 99Wiersma R. P. C., Schaye J., Theuns T., Dalla Vecchia C., Tornatore L.,2009b, MNRAS, 399, 574Zavala J., et al., 2016, Monthly Notices of the Royal Astronomical Society,460, 4466Zentner A. R., Hearin A. P., van den Bosch F. C., 2014, MNRAS, 443, 3044Zu Y., Mandelbaum R., 2015, Monthly Notices of the Royal AstronomicalSociety, 454, 1161This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS000