Machine Learning and Cosmological Simulations I: Semi-Analytical Models
MMNRAS , 1–18 (2015) Preprint 20 August 2018 Compiled using MNRAS L A TEX style file v3.0
Machine Learning and Cosmological Simulations I:Semi-Analytical Models
Harshil M. Kamdar , (cid:63) , Matthew J. Turk , and Robert J. Brunner , , , Department of Physics, University of Illinois, Urbana, IL 61801 USA Department of Astronomy, University of Illinois, Urbana, IL 61801 USA Department of Statistics, University of Illinois, Champaign, IL 61820 USA National Center for Supercomputing Applications, Urbana, IL 61801 USA Beckman Institute For Advanced Science and Technology, University of Illinois, Urbana, IL, 61801 USA
Accepted 2015 October 1. Received 2015 September 30; in original form 2015 July 2
ABSTRACT
We present a new exploratory framework to model galaxy formation and evolutionin a hierarchical universe by using machine learning (ML). Our motivations are two-fold: (1) presenting a new, promising technique to study galaxy formation, and (2)quantitatively analyzing the extent of the influence of dark matter halo propertieson galaxies in the backdrop of semi-analytical models (SAMs). We use the influentialMillennium Simulation and the corresponding Munich SAM to train and test vari-ous sophisticated machine learning algorithms (k-Nearest Neighbors, decision trees,random forests and extremely randomized trees). By using only essential dark matterhalo physical properties for haloes of
M > M (cid:12) and a partial merger tree, ourmodel predicts the hot gas mass, cold gas mass, bulge mass, total stellar mass, blackhole mass and cooling radius at z = 0 for each central galaxy in a dark matter halofor the Millennium run. Our results provide a unique and powerful phenomenolog-ical framework to explore the galaxy-halo connection that is built upon SAMs anddemonstrably place ML as a promising and a computationally efficient tool to studysmall-scale structure formation. Key words: galaxies: halo – galaxies: formation – galaxies: evolution – cosmology:theory – large-scale structure of Universe
In recent years, with the introduction of surveys such asSDSS , DES , and LSST , the amount of data available toastronomers has exploded. These massive data sets have en-abled astronomers to form and test sophisticated modelsthat explain cosmic structure formation in the universe. Cos-mological simulations are a rich subset of these models andhave consequently, also been on the rise; these simulationsprovide a concrete link between theory and observation. Ithas been argued that the ΛCDM model (Peebles 1982; Blu-menthal et al. 1984; Davis et al. 1985) is as widely acceptedas it is today largely due to the emergence of these high-resolution numerical simulations (Springel 2005). However,modeling galaxy formation accurately by using numericalsimulations remains an important problem in modern astro-physics, both scientifically and computationally. (cid:63) E-mail: [email protected] The evolution of collisionless dark matter particles atlarge scales has been studied exhaustively at unprecedent-edly high resolutions, given the meteoric rise in computa-tional power and the relative simplicity of these simulations(Springel 2005; Springel et al. 2005; Klypin et al. 2011; An-gulo et al. 2012; Skillman et al. 2014). The formation ofstructure on the scale of galaxies, however, has been incred-ibly difficult to model (Somerville & Dav´e 2014); the diffi-culty arises primarily because baryonic physics at this scaleis governed by a wide range of dissipative and/or nonlinearprocesses, some of which are poorly understood (Kang et al.2005; Baugh 2006; Somerville & Dav´e 2014).Broadly speaking, there are two prevalent techniquesused to understand galaxy formation and evolution: semi-analytical modeling (SAM) and simulations that includeboth hydrodynamics and gravity. The former is a post defacto technique that combines dark matter only simulationswith approximate physical processes at the scale of a galaxy(Baugh 2006). The SAM used in this work is detailed in Cro-ton et al. (2006); De Lucia et al. (2006); De Lucia & Blaizot(2007) (hereafter DLB07), and Guo et al. (2011) (hereafter c (cid:13) a r X i v : . [ a s t r o - ph . GA ] O c t Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner
G11). For a general, exhaustive review of the motivationof SAMs and a comparison of different SAMs, the readeris referred to Baugh (2006); Somerville & Dav´e (2014) andKnebe et al. (2015). N-body + hydrodynamical simulations(NBHS) evolve baryonic components using fluid dynamicsalongside regular dark matter evolution. The biggest ad-vantage of NBHS over SAMs is the self-consistent way inwhich gaseous interactions are treated in the former. How-ever, NBHS are incredibly computationally expensive to runand also require some approximations at the subgrid levelsimilar to those applied in SAMs. Promising new NBHSare outlined in Vogelsberger et al. (2014) and Schaye et al.(2015). For an extensive comparison of SAMs and NBHS,the reader is referred to Benson et al. (2001); Yoshida et al.(2002); Monaco et al. (2014) and Somerville & Dav´e (2014).Dark matter plays an integral role in galaxy formation;broadly speaking, dark matter haloes are ‘cradles’ of galaxyformation (Baugh 2006). It is well-established that gas coolshierarchically in the centers of dark matter haloes throughmergers; the evolution of galaxies, however, is dictated by awide variety of baryonic processes that are discussed later inthis paper. While baryonic physics plays a crucial role in theoutcome of gaseous interactions, the story always starts withgravitational collapse. However, no simple mapping has beenfound between the internal dark matter halo properties andthe final galaxy properties because of the sheer complexityof the baryonic interactions. For instance, in Contreras et al.(2015), a systematic study of the relationship between thehost halo mass and internal galaxy properties is performed.They conclude that no simple mapping was found betweenthe cold gas mass or the star formation rate and the hosthalo mass. The lack of a relatively simple mapping betweeninternal halo properties and the galaxy properties motivatesmany of the approximations that SAMs and NBHS make.Moreover, the computational costs associated with bothstandard galaxy formation models are incredibly high. TheIllustris simulation (an NBHS) used a total of around19 million CPU hours to run. SAMs, while significantlyfaster than NBHS, still require an appreciable amountof computational power. For instance, consider the opensource GALACTICUS SAM put forth in Benson (2012); inGALACTICUS, a halo of mass 10 M (cid:12) is evolved (withbaryonic physics) in around 2 seconds and a halo of mass10 M (cid:12) is evolved in around 1.25 hours. Thus, a very roughorder of magnitude estimate can be made for the approxi-mate runtime for GALACTICUS. For about 500,000 darkmatter haloes, and an average evolution time of approx-imately 2 minutes (corresponding to about 10 M (cid:12) ), thetime taken for GALACTICUS to build merger trees to z = 0is O (15000) CPU hours. The lack of a simple mapping be-tween dark matter haloes and the properties of galaxies,the computational costs associated with the popular galaxyformation models and the highly nonlinear nature of theproblem make galaxy formation incredibly hard to model,leaving room for new exploration.While SAMs, not limited to DLB07 and G11, have beenincredibly successful in reproducing a lot of observations(White & Frenk 1991; Kauffmann et al. 1993; Cole et al.1994; Somerville & Primack 1999; Cole et al. 2000; Kang et al. 2005; Bower et al. 2006; Monaco et al. 2007; De Lucia& Blaizot 2007; Lagos et al. 2008; Somerville et al. 2008;Weinmann et al. 2010; De La Torre et al. 2011) and producesimilar results to NBHS (Benson et al. 2001; Somerville &Dav´e 2014), there still exist a few deficiencies in the generalmethodology of SAMs. Most importantly, the degeneracy in-herent to most SAMs is concerning (see, for e.g., Henriqueset al. (2009); Bower et al. (2010); Neistein & Weinmann(2010)). SAMs (including DLB07 and G11) use simple yetpowerful, physically motivated analytical relationships formost processes that play a role in galaxy formation; theseprocesses have several free parameters that are ‘tuned’ tomatch up with observations.An alternative approach to model galaxy formation,that is physically much more transparent, was employedin Neistein & Weinmann (2010) (hereafter referred to asNW). NW put forth a simple model that includes treat-ment of feedback, star formation, cooling, smooth accretion,gas stripping in satellite galaxies, and merger-induced star-bursts with one key difference compared to conventionalSAMs. In the NW model, the efficiency of each physicalprocess is assumed to depend only on the host halo massand the redshift, making it a much simpler model than G11,DLB07, and other SAMs. NW produces a very similar pop-ulation of galaxies with similar physical properties to thatof DLB07’s (G11’s predecessor). The success of NW raisesan interesting question: could we go even further and tryto learn more about the halo-galaxy connection using solelythe halo environment and merger history, and would we beable to reproduce results found in conventional SAMs? How-ever, attempting to do so is a non-trivial task for severalreasons. First, the inputs for the exploratory model aren’texactly clear. Second, the mappings between dark matterhalo properties and galaxy properties are incredibly com-plex, as discussed earlier. NW, G11, and all other SAMs usesimplified analytical relationships to capture complex bary-onic processes; these relationships have a partial, non-trivialdependence on internal halo properties but it is not clearhow these analytical relationships can be used to build adark matter-only model to probe galaxy formation and evo-lution. Given their non-parametric nature and their abilityto successfully model complex phenomena, machine learningalgorithms provide an interesting framework to explore thisproblem.A variety of statistical techniques, falling under thebroad subfield of machine learning (ML), are gaining trac-tion in the physical sciences. The main goal of ML is tobuild highly efficient, non-parametric algorithms that at-tempt to learn complex relationships in and make predic-tions on large, high-dimensional data sets. Applications ofML to model highly complex physical models include pat-tern recognition in meteorological models (Liu & Weisberg2011), particle identification (Roe et al. 2005), inferring stel-lar parameters from spectra (Fiorentin et al. 2007), pho-tometric redshift estimation (Kind & Brunner 2013) andsource classification in photometric surveys (Kim 2015). Ma-chine learning techniques have been shown to be highly effec-tive at picking up complex relationships in high-dimensionaldata (Witten & Frank 2005; Johnson & Zhang 2011; Graffet al. 2014). As discussed later, we find that ensemble tech-niques that use a combination of decision trees perform thebest in the context of galaxy formation. The relative sim- MNRAS , 1–18 (2015) odeling Galaxy Formation and Evolution by using Machine Learning plicity of some ML techniques, their high computationalefficiency, and powerful predictive capabilities for complexmodels make the problem of galaxy formation well-suitedproblem for machine learning.In this paper, we present a first exploration into us-ing supervised ML techniques to model galaxy formation.For our analyses, we use the high-resolution Millenniumsimulation (Springel 2005; Springel et al. 2005) performedby the Virgo Consortium. The Millennium simulation is anextremely influential dark matter simulation that has mo-tivated more than 700 papers in the study of large scalestructure and galaxy evolution. The nature of the Millen-nium simulation makes it ideal for galaxy-scale studies. Theinternal halo properties and a partial merger history for eachdark matter halo at z = 0 in the Millennium simulation areextracted to be used as input features for our algorithms.We use the well-established Munich SAM, G11, for Millen-nium (Croton et al. 2006; De Lucia et al. 2006; De Lucia &Blaizot 2007; Guo et al. 2011) for our training and testing.We use various ML algorithms to predict the cold gas mass,hot gas mass, stellar mass in the bulge, total stellar massand the central black hole mass for the central galaxy ofeach dark matter halo in the Millennium simulation. Thesecomponents of the mass at z = 0 provide an extensive probeinto how effective ML algorithms are at learning baryonicprocesses prescribed in SAMs by using only dark matter in-puts.It should be emphasized here that absolutely no bary-onic processes are explicitly included in our analyses. Onlythe relevant internal dark matter halo properties: numberof dark matter particles, spin, M crit , velocity disper-sion ( σ v ), maximum circular velocity ( v max ), and a partialmerger history of the dark matter halo in which the galaxyresides are used as the inputs for the algorithms. The resultsof this study will shed valuable light on the halo-galaxy con-nection. We can quantitatively, admittedly phenomenologi-cally and not physically, determine the extent of the impactthat dark matter has on the structure formation at smallerscales in the universe. We can also evaluate how well ML canlearn the physical prescriptions that are used in SAMs. Toreiterate, our model uses only the ‘skeleton’ of most galaxyformation models: the merger tree of each dark matter halo.The paper is organized as follows. In Section 2, we dis-cuss the data extracted from the Millennium simulation andthe basics of machine learning. More specifically, we outlinethe basics of the Millennium simulation and G11 and ourreasons for using the Millennium simulation. We also dis-cuss the basic principles of ML and outline and discuss thebest algorithm that was found empirically. In Section 3, weoutline our results. In this section, the results for the dif-ferent components of the central galaxy mass are presented.We also discuss the drawbacks of our model and providepossible explanations for some of the discrepancies in ourresults and present an alternative ML approach to correctfor these discrepancies. In Section 4, we discuss what ourresults imply about the halo-galaxy connection and SAMs.Finally, in Section 5, we conclude the paper with a summaryof our findings and potential avenues for future research. In this section, we discuss the data set obtained by usingthe Millennium simulation and the ML algorithms that wereused for the analyses. First, we discuss general details ofthe Millennium simulation and the cosmogony that was em-ployed in carrying out the simulation. Next, we discuss theSAM used in Millennium (G11) and give a brief overviewof how key physical processes are handled in the model. Wealso discuss our reasons for choosing the Millennium simu-lation in place of a higher resolution simulation with a moreaccurate cosmogony; in particular, Millennium’s σ value isnoticeably off from the most recent Planck results (Collab-oration et al. 2015). We also discuss the extraction of thedata set and outline the challenges that were faced in con-structing it. Finally, we briefly review how ML works, andoutline the primary algorithms were used in our analyses. The data for this project were extracted from the publiclyavailable Millennium simulation (Springel 2005; Springelet al. 2005). The Millennium simulation was ran with a cus-tom version of GADGET-2, using the Tree-PM method (Xu1994) to handle gravitational interactions. The Millenniumhalo catalogs were generated by using a friends-of-friends(FoF) algorithm with a linking length of 0.2 times the meandark matter particle separation. The Millennium simulationis run with 2160 dark matter particles in a 500 h − Mpc boxfrom z = 127 to z = 0. The mass of each dark matter parti-cle is 8 . × M (cid:12) h − and the smallest subhalo has at least20 particles. The cosmological model employed in the Mil-lennium simulation has Ω m = 0 .
25, Ω b = 0 . Λ = 0 . h = 0 . n s = 1 and σ = 0 .
9, where the Hubble constantis parametrized as H = 100 km s − Mpc − .The raw simulation was sampled in the form of a snap-shot 64 times, with FoF group catalogs and their substruc-tures, identified by using SUBFIND, which is discussed inSpringel et al. (2001). With SUBFIND, each FoF group isdecomposed into a set of subhaloes by identifying locallyoverdense, gravitationally bound regions. The merger treeorganization of the dark matter haloes is shown in Figure11 of Springel et al. (2005). The SAM used to populate the dark matter haloes in Millen-nium with galaxies is described extensively in Springel et al.(2005); Croton et al. (2006); De Lucia et al. (2006); De Lu-cia & Blaizot (2007); Guo et al. (2011). We only provide abrief overview here. G11 includes ingredients and method-ologies originally introduced by White & Frenk (1991) andlater refined by Springel et al. (2001); De Lucia et al. (2004);Croton et al. (2006). G11, like most SAMs, has simple, yetphysically powerful prescriptions for gas cooling, star for-mation, supernova feedback, galaxy mergers, and chemicalenrichment that are tuned by using observational data. G11uses the Chabrier (2003) IMF. Additionally, G11 takes intoaccount the growth and activity of central black holes andtheir effect on suppressing the cooling and star formationin massive haloes. Morphological transformation of galaxiesand processes of metal enrichment are also modeled. For a
MNRAS000
MNRAS000 , 1–18 (2015)
Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner more thorough description of the physical prescriptions usedin G11, the reader is referred to the set of papers referencedabove. Liu et al. (2010); De La Torre et al. (2011); Cuc-ciati et al. (2012) show where DLB07 (G11’s predecessor)agrees with observational data and also show some weak-nesses of DLB07. Furthermore, Knebe et al. (2015) discusshow L-GALXIES (the code behind G11; Henriques et al.(2013)) and DLB07 perform against other recent SAMs(e.g. Galacticus, GalICS, etc.) In this section, we providean overview of how G11 handles the cold gas mass (2.2.1),central black hole mass (2.2.2), total stellar mass (2.2.3),bulge mass (2.2.4) and the hot gas mass (2.2.5).
We outlined in the introduction how the dark matter mergerhistory forms the ‘skeleton’ of SAMs. However, the struc-ture of dark matter haloes and their internal properties arealso important in determining the rate at which gas coolsand the dynamics of the galaxies in the halo (Baugh 2006).The cooling of gas in G11 is computed using the growth ofthe cooling radius r cool as defined in Croton et al. (2006);Guo et al. (2011), which describes the maximum radius atwhich the hot gas density is still high enough for the coolingto occur within the halo dynamical time t h , following thesimple model presented in Springel et al. (2001). In G11, itis assumed that infalling gas is shock-heated to the virialtemperature ( T vir ) of the dark matter halo at an accretionshock. The cooling time ( t cool ) at each radius is given by: t cool = 32 µm p kTρ hot ( r )Λ( T hot , Z hot ) (1)Here, µm p denotes the mean particle mass, k isthe Boltzmann constant, ρ hot ( r ) is the hot gas density,Λ( T hot , Z hot ) is the cooling function, that depends on tem-perature and metallicity (Sutherland & Dopita 1993). T hot is given by: T hot = 35 . (cid:16) V vir kms − (cid:17) (2)The cooling radius as the point where the local cooling timeis equal to t h , where: t h = R vir V vir = 0 . H ( z ) − (3)Therefore, the cooling radius can be written as: r cool = (cid:20) t dyn m hot Λ( T hot , Z hot )6 πµm h kT vir R vir (cid:21) (4)The cooling rate can then be written through a simple con-tinuity equation, assuming an isothermal distribution:˙ M cool = 12 M hot r cool V vir R vir (5)A major modification in the cooling rate in Equation5 comes about through ‘radio mode’ AGN feedback. AGNfeedback becomes especially important in haloes with largermasses. G11 follows the prescription laid out in Croton et al.(2006) for the suppression of this cooling rate; the modifiedrate is given by: ˙ M (cid:48) cool = ˙ M cool − E radio V vir (6) where: ˙ E radio = 0 . M BH c (7)˙ M BH = κ (cid:16) f hot . (cid:17) (cid:16) V vir kms − (cid:17) (cid:18) M BH M (cid:12) h − (cid:19) M (cid:12) yr − (8)Here, f hot for a main subhalo is given by the ratio of thehot gas mass to the subhalo mass ( M hot M DM ) and κ sets theefficiency of the accretion of the hot gas. A more detailedexplanation can be found in Guo et al. (2011).The rate of gas cooling is an integral part of galaxy for-mation because it determines the rate at which stars formin a galaxy (Baugh 2006). As we can see, G11 has a complexrecipe to incorporate gas cooling and, while we can see somehalo dependence in Equation 5 and 6, this is not a trivialmapping. In the NW model discussed in the introduction,because of the complexity inherent to gas cooling, the co-efficient for the cooling rate was empirically determined byrunning DLB07 on the milli-Millennium simulation since nofitting function in terms of the host halo mass and redshiftwas found. G11, like Croton et al. (2006), splits AGN activity into‘quasar’ mode and ‘radio’ mode. The formation and evo-lution of the black hole is dominated by the ‘quasar’ modefeedback, where the central black hole grows through majorand/or gas-rich mergers. During a merger, the central blackhole of the larger progenitor absorbs the minor progenitor’sblack hole, and cold gas is accreted onto the central blackhole. The evolution of the black hole mass through ‘quasar’mode feedback is given by: δM BH = M BH,min + f (cid:18) M minor M major (cid:19) (cid:32) M cold kms − V vir (cid:33) (9)Here, M BH,min is the mass of the black hole in the minorprogenitor, f is a free parameter that is set to 0.03 to re-produce the observed BH mass-bulge mass relation, M minor and M major are the total baryonic masses of the minor andmajor progenitors and M cold is the total cold gas mass. G11 follows the Kauffmann (1996) recipe in assuming thatstar formation is proportional to the mass in cold gas abovea certain threshold. A threshold surface density:Σ crit = 12 × (cid:16) V vir kms − (cid:17) (cid:18) Rkpc (cid:19) − M (cid:12) pc − (10)is set for cold gas below which stars do not form and abovewhich stars do form (Kennicutt Jr 1998). Furthermore, itis assumed that this cold gas mass is distributed uniformlyover the disk, giving us the following for the cricitcal mass: M crit = 11 . × (cid:16) V max kms − (cid:17) (cid:18) r disk kpc (cid:19) M (cid:12) (11)Therefore, when the mass of the cold gas in the galaxy MNRAS , 1–18 (2015) odeling Galaxy Formation and Evolution by using Machine Learning is greater than M crit , the stars form at the following rateper unit time: ˙ M (cid:63) = α sf M cold − M crit t dyn,disk (12)where the disk dynamical time is given by t dyn,disk = r disc V vir and α sf is the star formation efficiency, which is manuallyset between 5 and 15 percent. In the NW model, a modifiedstar formation rate is used:˙ M (cid:63) = f s ( M cold − M crit ) (13)where f s is the star formation efficiency and has the unitsof Gyr − . The analytic fitting function that NW found for f s was given as: f s = M . h t − . − . − . log ( M h )) (14)and M crit was parameterized in terms of the host halo massas: M crit = f − s − . M . h t − . (15) M h has the units of M (cid:12) h − and t is in Gyr . The results pre-sented in Neistein & Weinmann for stellar mass show goodagreement with DLB07 results. The above parametrizationof the star formation efficiency and M crit in terms of solelythe host halo mass and time and its success offer motivationfor our study. In G11, bulge growth is modeled in 3 ways: minor mergers,major mergers and disk instabilities. In the case of minormergers (i.e. a satellite merging with a central galaxy), thetotal stellar mass of the satellite galaxy is added to the bulgeof the central galaxy and the disk of the larger progenitorremains intact. The cold gas of the satellite galaxy is addedto the disk of the central galaxy and a fraction of the com-bined cold gas from both galaxies is turned into stars as aresult of the merger. In the case of major mergers, the disksof both merging galaxies are destroyed to form a spheroidto which the combined stellar mass of the two progenitors isassigned.G11 uses energy conservation and the virial theorem tocalculate the change in size:
C GM new,bulge R new,bulge = C GM R + C GM R + α GM M R + R (16)Here, C is a parameter relating the binding energy of agalaxy to its mass and radius, and α is a parameter sig-nifying the effective interaction energy in the stellar compo-nents. Furthermore, in G11 a prescription for bulge forma-tion through disk instability is also included. Following theframework set up in Mo et al. (1998), it is assumed that astellar disk becomes unstable when: V c ( Gm d r d ) ≤ V c is approximated as V vir . At each time step, thisinequality is checked and if it is not satisfied, some stellarmass is transferred from the disk to the bulge until stabilityis restored. The total hot gas mass is a result of various physical pro-cesses. If we ignore supernovae feedback and gas stripping,we obtain the following amount of hot gas for a halo at eachsnapshot available for cooling: M hot = f b M vir − (cid:88) i M ( i ) cold (18)Here, the cold gas mass is summed over in all the galaxiesin the FOF group and f b is the universal baryon fraction,given by: 0.017.Supernova feedback is the main source of reheating in-corporated in G11. Supernovae feedback, based on Martin(1999), is modeled in G11 as: δM reheat = (cid:15) disc × δM (cid:63) (19)where δM (cid:63) is the stellar mass of the newly formed stars oversome finite time interval. Unlike DLB07, G11 has a variable (cid:15) disk and is modeled as follows: (cid:15) disk = (cid:15) × (cid:20) . (cid:16) V max kms − (cid:17) − β (cid:21) (20)where both (cid:15) and β are free parameters which were set inG11 based on the observed stellar mass function.In most SAMs, when a merger happens, all the hot gas isassumed to be transferred instantaneously from the smallerhalo to the larger halo; however, this rapid transfer has beenshown to cause a rapid decline in star formation (Baldryet al. 2006; Wang et al. 2007). G11 implements a gas strip-ping model that includes both the instantaneous strippingand tidal, more gradual stripping in their treatment.There are two reasons why we chose to use the Millen-nium simulation over more recent, higher-resolution simula-tions with a more accurate ΛCDM cosmogony. First, Mil-lennium is a state of the art cosmological simulation that isuniquely linked to the concurrent development and refine-ment of two cutting edge SAMs (Croton et al. 2006; Boweret al. 2006). Second, and perhaps more important, the Mil-lennium simulation provides a readily accessible data set.The publicly available simulation data enables reproducibil-ity and consistency. In the growing climate of scientific re-producibility, this approach is becoming increasingly impor-tant; and we follow this trend and release all our data andcode at: https://github.com/ProfessorBrunner/ml-sims .Next, we describe how the data set was obtained. We used the online SQL database hosted by GAVO (Lemsonet al. 2006) to construct our data set. Using the queryableSQL database for the Millennium Simulation, we extracted365,361 dark matter haloes at z = 0. Only dark matterhaloes with masses larger than 10 h − M (cid:12) were used in ouranalysis. For each dark matter halo, we extracted the fol-lowing physical properties: number of dark matter particles( N ), spin, M crit , maximum circular velocity ( v max ), andvelocity dispersion ( σ v ). For haloes at z = 0, we also includethe virial mass M virial , the half-mass radius R half , virial ve-locity v virial , virial radius r virial and r crit, . Furthermore,we extracted the cold gas mass, total stellar mass, stellarmass in the bulge mass, central black hole mass and the hot MNRAS000
C GM new,bulge R new,bulge = C GM R + C GM R + α GM M R + R (16)Here, C is a parameter relating the binding energy of agalaxy to its mass and radius, and α is a parameter sig-nifying the effective interaction energy in the stellar compo-nents. Furthermore, in G11 a prescription for bulge forma-tion through disk instability is also included. Following theframework set up in Mo et al. (1998), it is assumed that astellar disk becomes unstable when: V c ( Gm d r d ) ≤ V c is approximated as V vir . At each time step, thisinequality is checked and if it is not satisfied, some stellarmass is transferred from the disk to the bulge until stabilityis restored. The total hot gas mass is a result of various physical pro-cesses. If we ignore supernovae feedback and gas stripping,we obtain the following amount of hot gas for a halo at eachsnapshot available for cooling: M hot = f b M vir − (cid:88) i M ( i ) cold (18)Here, the cold gas mass is summed over in all the galaxiesin the FOF group and f b is the universal baryon fraction,given by: 0.017.Supernova feedback is the main source of reheating in-corporated in G11. Supernovae feedback, based on Martin(1999), is modeled in G11 as: δM reheat = (cid:15) disc × δM (cid:63) (19)where δM (cid:63) is the stellar mass of the newly formed stars oversome finite time interval. Unlike DLB07, G11 has a variable (cid:15) disk and is modeled as follows: (cid:15) disk = (cid:15) × (cid:20) . (cid:16) V max kms − (cid:17) − β (cid:21) (20)where both (cid:15) and β are free parameters which were set inG11 based on the observed stellar mass function.In most SAMs, when a merger happens, all the hot gas isassumed to be transferred instantaneously from the smallerhalo to the larger halo; however, this rapid transfer has beenshown to cause a rapid decline in star formation (Baldryet al. 2006; Wang et al. 2007). G11 implements a gas strip-ping model that includes both the instantaneous strippingand tidal, more gradual stripping in their treatment.There are two reasons why we chose to use the Millen-nium simulation over more recent, higher-resolution simula-tions with a more accurate ΛCDM cosmogony. First, Mil-lennium is a state of the art cosmological simulation that isuniquely linked to the concurrent development and refine-ment of two cutting edge SAMs (Croton et al. 2006; Boweret al. 2006). Second, and perhaps more important, the Mil-lennium simulation provides a readily accessible data set.The publicly available simulation data enables reproducibil-ity and consistency. In the growing climate of scientific re-producibility, this approach is becoming increasingly impor-tant; and we follow this trend and release all our data andcode at: https://github.com/ProfessorBrunner/ml-sims .Next, we describe how the data set was obtained. We used the online SQL database hosted by GAVO (Lemsonet al. 2006) to construct our data set. Using the queryableSQL database for the Millennium Simulation, we extracted365,361 dark matter haloes at z = 0. Only dark matterhaloes with masses larger than 10 h − M (cid:12) were used in ouranalysis. For each dark matter halo, we extracted the fol-lowing physical properties: number of dark matter particles( N ), spin, M crit , maximum circular velocity ( v max ), andvelocity dispersion ( σ v ). For haloes at z = 0, we also includethe virial mass M virial , the half-mass radius R half , virial ve-locity v virial , virial radius r virial and r crit, . Furthermore,we extracted the cold gas mass, total stellar mass, stellarmass in the bulge mass, central black hole mass and the hot MNRAS000 , 1–18 (2015)
Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner gas mass of the primary galaxy for each dark matter haloand matched them likewise. As discussed in the introduc-tion, the merger history of each dark matter halo plays animportant role in how SAMs populate dark matter haloeswith galaxies. However, sampling the merger history suffi-ciently and translating that into a well-defined set of inputsfor our ML algorithms turns out to be a difficult task.The naive way to extract the merger history would beto take all the progenitor haloes of each dark matter haloand use the internal properties of each progenitor halo andthe current halo as the inputs to our ML algorithms. How-ever, in the Millennium simulations, some massive dark mat-ter haloes have tens of progenitors just one snapshot back,potentially making the number of inputs in the hundreds.Moreover, going just one snapshot back is not sufficient totruly capture the merger history of a dark matter halo. Con-sequently, going to higher redshifts would easily result in thenumber of inputs being in the thousands. Our algorithms’runtimes are directly dependent on the dimension of theinput data, and, therefore, employing the naive approachwould severely impact the efficiency of our ML algorithms.The merger tree for the Millennium simulation includesa descendantID and a firstProgenitorID for each halo. Weare left with two ways to sample the merger history while re-taining computational efficiency. We can either go top-down(i.e., start at a high redshift and go to z = 0) by using de-scendantID or we could go bottom-up (i.e. starting at z = 0and proceed to higher redshifts) by using the firstProgeni-torID. In the Millennium documentation , the first progen-itor is defined as ‘the main progenitor of the subhalo’. Thefirst progenitor is simply defined as the most massive of eachdark matter halo’s progenitors. And thus the firstProgeni-torID tracks the main branch of a merger tree. We chose thelatter approach for two reasons. The first approach resultsin fewer haloes and, consequently, galaxies, being examinedat z = 0. Second, the ‘main progenitor’ approach encodesinformation about the main branch of a progenitor of eachhalo, implying that this may provide more valuable informa-tion about the dark matter halo’s internal history. Using thefirstProgenitorID from z = 0 to z = 5 .
724 (from snapshot63 to 19), we extract the five physical properties mentionedabove for 365,361 DM haloes.Dark matter haloes below 10 M (cid:12) weren’t consideredin our analyses because the computational cost associatedwith our technique (given how far back we go in the mergertree) would rise significantly since the number of haloesbetween just 10 and 10 M (cid:12) is a factor of O(100) toO(1000) greater than what we’ve considered in this work.This amounted to a tradeoff between a higher range ofmasses explored and how much deeper we can go into themerger tree. Since the main point of the paper is to roughlyexplore the applicability of ML in reproducing a reasonablepopulation of galaxies, we decided against including haloesof lower masses. However, lower mass haloes are included inour analyses in the next paper (Kamdar et al. tted) in theseries where we apply machine learning to N-body + hy-drodynamical simulations. We show there that when haloesof all masses are considered and we see that there is morescatter at lower masses but a reasonably high amount of in- http://gavo.mpa-garching.mpg.de/Millennium/Help/mergertrees formation is recovered from the dark matter halo propertiesusing machine learning throughout. In this subsection, we briefly talk about the basics of MLand outline the best performing algorithms.
Machine learning is a bustling field in computer science, witha wide variety of applications in a number of other areas.The basic idea of ML algorithms is to ‘learn’ relationshipsbetween the input data and the output data without any ex-plicit analytical prescription being used. Supervised learningtechniques are provided some training data (
X, y ) and theytry to learn the mapping G ( X → y ) in order to apply thismapping to the test data.Machine learning has been applied to several subfieldsin astronomy with a lot of success; see, for example, Ball &Brunner (2010); Ivezi´c et al. (2014) for a review of the appli-cations of machine learning to astronomy. A decent major-ity of the applications of ML in astronomy have either beenin classification problems such as star-galaxy classification(Ball et al. 2006; Kim 2015), galaxy morphology classifica-tion (Banerji et al. 2010; Dieleman et al. 2015); or have beenregression applications like: photometric redshift estimation(Ball et al. 2007; Gerdes et al. 2010; Kind & Brunner 2013),and estimation of stellar atmospheric parameters (Fiorentinet al. 2007).To the best of our knowledge, however, only a few haveapplied ML to the problem of galaxy formation and thegalaxy-halo connection. Xu et al. (2013), which inspired thispaper, predicted the number of galaxies in a dark matterhalo to create mock galaxy catalogs. They used k-NearestNeighbors (kNN) and Simple Vector Machines (SVM) to ob-tain promising results. Furthermore, Ntampaka et al. (2015)used machine learning for dynamical mass measurementsof galaxy clusters also showing promise. Given their non-parametric nature and incredibly powerful predictive capa-bilities, machine learning provides an attractive and intrigu-ing method to study galaxy formation and evolution.For our study, we used a variety of machine learning al-gorithms: kNN, decision trees, random forests and extremelyrandomized trees. To quantify how well the algorithms aredoing at learning relationships in the data, we use the meansquared error (MSE) metric. The MSE is defined as: MSE = 1 N test i = N test − (cid:88) i =1 (cid:0) X itest − X ipredicted (cid:1) (21)Here, X itest is the i th value of the actual test set and X ipredicted is the i th value of the predicted set. Furthermore,to gauge the relative performance of the algorithms, we alsointroduce the base MSE, following in the footsteps of Xuet al. (2013), defined as: MSE b = 1 N test i = N test − (cid:88) i =1 (cid:0) X itest − X mean,train (cid:1) (22)Here, X mean,train is, as the name suggests, the mean of thetraining data set. MSE b is an extremely naive prediction MNRAS , 1–18 (2015) odeling Galaxy Formation and Evolution by using Machine Learning of the error since each test point is simply predicted as themean of the training dataset. MSE b will serve as an ex-tremely useful metric when we want to measure the relativeperformance of our machine learning algorithms and the fac-tor MSE b MSE will quantitatively show how good our model is atminimizing error.
The lower the MSE, and consequently thehigher the factor
MSE b MSE , imply a more robust prediction.
Furthermore, we will also be using the following twometrics to check for the robustness of the prediction: thePearson correlation and the coefficient of determination (‘re-gression score function’). The pearson correlation is definedas ρ = cov ( X predicted X test ) σ X predicted σ X test (23)and the coefficient of determination is defined as: R = 1 − (cid:80) i ( X itest − X ipredicted ) (cid:80) i ( X itest − X mean,train ) (24) A higher ρ and R imply a robust prediction. As shown inthe results section, extremely randomized trees and randomforests consistently outperform the other algorithms; conse-quently, we now briefly review the basics of these two tech-niques.
Extremely randomized trees (ERT) is an ensemble learningtechnique that builds upon the widely used decision trees(for the purposes of regression, decision trees are called re-gression trees). Therefore, to understand how ERT works,we must first discuss the fundamentals of regression trees.What follows is only a brief overview of both techniques; fora more comprehensive account of the technique, the readeris referred to Breiman et al. (1984) and Geurts et al. (2006).Regression trees follow a relatively simple procedure: • Step 1:
Construct a node containing all the data pointsand compute m c and S , where m c and S are defined as: m c = 1 n c (cid:88) i(cid:15)c z i (25) S ( M ) = (cid:88) c ∈ leaves ( M ) (cid:88) i ∈ c ( z i − z c ) (26)Here, c are the possible values of dimension M, z i gives thetarget value on each branch c and z c gives the mean valueon that branch c . We can, therefore, rewrite Equation 26 as: S ( M ) = (cid:88) c ∈ leaves ( M ) = n c V c (27) S ( M ) signifies the sum of the squared errors for some nodeM, where n c is the number of samples in a leaf c and V c isthe variance in leaf c. • Step 2:
If all the points in the node have the samevalue for all the input variables, we stop the algorithm.Otherwise, we scan over all dimension splits of all variablesto find the one that will reduce S(M) as much as possible.If the largest decrease in S(M) is less than some threshold (cid:15) , we stop the algorithm. Otherwise, we take that split,creating new nodes of the specified dimension. • Step 3:
Go to Step 1Regression trees are usually considered to be weaklearners. A technique that is used to turn these weak learn-ers into strong ones involves building an ensemble of weaklearners. In the context of regression trees, there are two pop-ular ensemble methods: boosting and randomization. Sincewe have a multidimensional output, we focus on randomizedensemble techniques. The essence of ERT is to build an en-semble of regression trees where both the attribute and split-point choice are randomized while splitting a tree node. Weprovide pseudocode for the full algorithm in Table 1, whichclosely follows the algorithm outlined in Geurts et al. (2006).In the algorithm, the Score is the relative reduction in thevariance. For the two subtrees S l and S r corresponding tothe split s (cid:63) , the Score( s (cid:63) , S) (abbreviated to Sc ( s (cid:63) , S )) isgiven by: Sc ( s (cid:63) , S ) = var ( y , S ) − | S l || S | var ( y , S l ) − | S r || S | var ( y , S r ) var ( y , S ) (28)The estimates produced by the M trees in the ERT en-semble are finally combined by averaging y over all trees inthe ensemble. The use of the original training data set inplace of a bootstrap sample (as is done for random forests)is done to minimize bias in the prediction. Furthermore, theuse of both randomization and averaging is aimed at reduc-ing the variance of the prediction (Geurts et al. 2006). The methodology of random forests (Breiman 2001, RF) isvery similar to that of extremely randomized trees. There aretwo central algorithmic differences between the two meth-ods. First, RF use a bootstrap replica (Breiman 1996). Boot-strap replica consists of selecting a random sample with-out replacement from the training data (
X,y ). ERT, on theother hand, uses the original training data set. Second, ERTchooses the split randomly from the range of values in thesample at each split, whereas RF tries to determine the bestsplit at each internal node. We briefly outline the basic stepsof the algorithm in Table 2.For our analyses, we used the implementation of ERTand RF provided in the Python library scikit-learn (Pe-dregosa et al. 2011). The parameters we used and the run-time of the techniques for the problem are discussed in theresults section. ERT tends to be faster than RF because ofthe randomization in finding the split, reducing the trainingtime. The reduced training time lets us build a bigger ensem-ble of trees and explains why our ERT results are, generally,marginally better than RF’s.
In this section, we present and discuss the results that wereobtained when we applied the algorithms to the Millenniumdata. Using the dark matter internal halo properties and apartial merger history as our inputs, the following compo-nents of the final mass of the galaxy are predicted: stellarmass in the bulge, total stellar mass, cold gas mass, cen-tral black hole mass and hot gas mass. In nature, these
MNRAS000
MNRAS000 , 1–18 (2015)
Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner
Table 1.
An outline of the extremely randomized trees regression algorithmExtremely Randomized Trees
Inputs : A training set S corresponding to ( X, y ) input-output vectors, where X =( X , X , ..., X N ) and y =( y , y , ..., y l ), M (number of trees in the ensemble),K (number of random splits screened at each node) and n min,samples (number of samplesrequired to split a node) Outputs : An ensemble of M trees: T = ( t , t , ..., t M ) Step 1 : Randomly select K inputs ( X , X , ..., X K ) where 1 ≤ K ≤ N ). Step 2 : For each selected input variable X i in i = (1 , , ..., K ): • Compute the minimal and maximal value of X in the set: X mini and X maxi • Randomly select a cut-point X c in the interval [ X mini , X maxi ] • Return the split in the interval X i ≤ X c Step 3 : Select the best split s (cid:63) such that Score( s (cid:63) , S) = max i =1 , ,...,K Score( s i , S) Step 4 : Using s (cid:63) , split S into S l ( X i ) and S r ( X i ) Step 5 : For S l ( X i ) and S r ( X i ), check the following conditions: • | S l ( X i ) | or | S r ( X i ) | is lower than n min,samples • All input attributes ( X , X , ..., X N ) are constant in | S l ( X i ) | or | S r ( X i ) |• The output vector ( y , y , ..., y l ) is constant in | S l ( X i ) | or | S r ( X i ) | Step 6 : If any of the conditions in step 5 are satisfied, stop. We’re at a leaf node.If none of the conditions are satisfied, repeat steps 1 through 5.
Table 2.
An outline of the random forests algorithmRandom forests
Inputs : A training set S corresponding to ( X, y ) input-outputvectors, where X =( X , X , ..., X N ) and y =( y , y , ..., y l ),M (number of trees in the ensemble), K (number of features to consider when looking for best split) n min (minimum number of samples required to split a node) Outputs : An ensemble of M trees: T = ( t , t , ..., t M ) Step 1 : Select a new bootstrap sample from the training dataset
Step 2 : Grow an unpruned tree on this bootstrap
Step 3 : For each node in the tree, randomly select K features,look for the best split using only these features Step 4 : Save the tree and do not perform pruning.
Step 5 : Perform steps 1 through 4 M times. Step 6 : The overall prediction is the average output from eachtree in the ensemble. attributes are the result of billions of years of dissipative,nonlinear baryonic processes. As discussed earlier, the basicingredient for large scale structure formation is the ΛCDMmodel; but, on a smaller scale, the story is incredibly differ-ent and vastly more complicated. In this section, we try todraw a link between the two regimes using ML algorithmsto explore the halo-galaxy connection. We first discuss theperformance of ML in reproducing the simulated propertiesof the galaxies in G11 and the implications of our results forthe halo-galaxy connection. Then, we address some discrep-ancies in our results (particularly the cold gas mass), discusswhy the cold gas mass prediction is not robust and provide an alternative, significantly more accurate model that in-cludes two baryonic inputs over two snapshots.Table 3 lists the results we obtained with the differentmachine learning algorithms for each component of the massof the central galaxy.
MSE b and the MSE are listed for eachtechnique. The factor reduction of the
MSE is also listed totest the relative performance of the algorithms to quantifyhow much they are learning. Finally, the pearson correla-tion between the predicted and the true data set and thecoefficient of determination ( R ) are also listed. Seventeendifferent plots (Figures 2 through 18) are included to showthe best results we obtained by using both ERT and RF. Ahexbin plot and a violinplot are shown for each component ofthe mass to compare the predictions to the G11 test data. A hexbin plot is a 2D histogram and provides informationabout the goodness of fit. A kernel density estimate (KDE)is plotted on each axis to overlay information about the dis-tribution as well. A violinplot is a boxplot with a KDE onthe side, providing more information about the distributionof a particular set.
Furthermore, a stellar mass-halo massrelation plot is shown to compare the physical reasonabilityof our stellar mass results with G11 results. A plot show-ing the cold gas mass fraction as a function of stellar massis also included. Lastly, the G11 and ML BH mass-bulgemass relation for the predicted galaxies and G11 galaxies isshown in Figure 10. All plots were created using Seaborn and Matplotlib. For the hexbin plot, a gridsize of 30 wasused and the colormap was logarithmically scaled. For thekernel density estimate (KDE) in the violinplot, the band-width was chosen using the Silverman (1986) method andthe density is evaluated on 1000 grid points. The violinplotsserves two purposes: providing an insightful look into how http://stanford.edu/ mwaskom/software/seaborn/MNRAS , 1–18 (2015) odeling Galaxy Formation and Evolution by using Machine Learning M h ( M fl ) -3 -2 -1 M / M h ( M fl ) G11Predicted
Figure 1.
The stellar mass-halo mass relation for the predictedtotal stellar mass using machine learning and the total stellarmass in G11 are compared for central galaxies. Both quantities arebinned using the halo mass from Millennium. The two differentshadings (blue for G11 and green for ML) represent the standarddeviation at each binned point for the respective technique. M SAM (10 M fl /h ) M p r e d i c t e d ( M fl / h ) l og ( N ) Figure 2.
A hexbin plot of M SAM,(cid:63) and M predicted,(cid:63) . The blackdashed line corresponds to a perfect prediction. The MSE for theprediction is 5.755, the Pearson correlation between the predictedgalaxy mass and the G11 galaxy mass is 0.876 and the regressionscore is 0.768. good ML is at reproducing a similar population of galaxiesas found in G11 and providing a zoomed in alternative ofwhat the mass distribution looks like.The algorithms that performed the best, ERT and RF,were outlined in section 2. Using scikit-learn’s implementa-tion, we used the following parameters for ERT: n trees = M SAM M predicted M ( M fl / h ) Figure 3.
A violinplot is plotted that shows the distributions ofthe M SAM,(cid:63) and M predicted,(cid:63) . The median and the interquantilerange are shown for both sets of galaxy masses. M SAM (10 M fl /h ) M p r e d i c t e d ( M fl / h ) l og ( N ) Figure 4.
A hexbin plot of M SAM,bulge and M predicted,bulge .The black dashed line corresponds to a perfect prediction. TheMSE for the prediction is 6.305, the Pearson correlation betweenthe predicted galaxy mass and the G11 galaxy mass is 0.881 andthe regression score is 0.775. n min ) = 5. For RF, weused the following parameters: n trees = 325, and minimumsample split ( n min ) = 5. We used 35% of the G11 and Mil-lennium data for training and the rest were used for testing.The entire pipeline for extremely randomized trees (includesdata preprocessing, training, testing and generating all theplots) using the listed parameters ran on 2.7 GHz Intel Dual-Core Processor in 73 minutes. Likewise, the entire pipelinefor random forests using the parameters above ran on thesame system in 122 minutes. Note that in both cases, thesetimes are orders of magnitude smaller than SAM computa-tion times. MNRAS000
A hexbin plot of M SAM,bulge and M predicted,bulge .The black dashed line corresponds to a perfect prediction. TheMSE for the prediction is 6.305, the Pearson correlation betweenthe predicted galaxy mass and the G11 galaxy mass is 0.881 andthe regression score is 0.775. n min ) = 5. For RF, weused the following parameters: n trees = 325, and minimumsample split ( n min ) = 5. We used 35% of the G11 and Mil-lennium data for training and the rest were used for testing.The entire pipeline for extremely randomized trees (includesdata preprocessing, training, testing and generating all theplots) using the listed parameters ran on 2.7 GHz Intel Dual-Core Processor in 73 minutes. Likewise, the entire pipelinefor random forests using the parameters above ran on thesame system in 122 minutes. Note that in both cases, thesetimes are orders of magnitude smaller than SAM computa-tion times. MNRAS000 , 1–18 (2015) Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner
Table 3.
The performance of different machine learning techniques in predicting the different mass components the centralgalaxy in each dark matter halo at z = 0.Technique MSE b MSE
Factor (
MSE b MSE ) Pearson Correlation R kNN 6.661 3.624 0.852 0.724Decision Trees 7.448 3.241 0.832 0.691Random Forests 5.763 4.301 M (cid:63),total Extremely Randomized Trees
Random Forests M (cid:63),bulge Extremely Randomized Trees 28.165 6.306 4.466 kNN 1243.182 47.786 0.991 0.979Decision Trees 144.537 411.014 M hot Extremely Randomized Trees kNN 0.401 1.311 0.487 0.237Decision Trees 0.445 1.182 0.393 0.155Random Forests M cold Extremely Randomized Trees kNN 0.000063 6.958 0.926 0.856Decision Trees 0.000081 5.432 0.903 0.815Random Forests 0.000068 6.456 0.925 0.856 M BH Extremely Randomized Trees M SAM M predicted M ( M fl / h ) Figure 5.
A violinplot showing the distributions of the M SAM,bulge and M predicted,bulge . The median and the interquar-tile range for both sets of galaxy masses are shown. The first thing to notice in our results is the total stellarmass-halo mass relation (SMHM). In Behroozi et al. (2010)and Moster et al. (2010), the SMHM is extensively studiedand compared to a variety of observational data and preva-lent empirical halo-galaxy models. The SMHM provides avery powerful tool to check whether our results seem phys-ically meaningful and not just numerically reasonable. Wecan see in Figure 1 that the SMHM is reconstructed almostperfectly. The curves for the predicted set and the G11 re-sults line up almost perfectly. One thing to notice here isthat our prediction is slightly off for the higher halo masses.Moreover, there is more noticeable scatter in the Millennium M SAM (10 M fl /h ) M p r e d i c t e d ( M fl / h ) l og ( N ) Figure 6.
A hexbin plot of M SAM,hot and M predicted,hot . Theblack dashed line corresponds to a perfect prediction. The MSEfor the prediction is 57.536, the Pearson correlation between thepredicted galaxy mass and the G11 galaxy mass is 0.999 and theregression score is 0.999. SMHM than the reconstructed SMHM. These discrepanciesare probably present because the ML algorithms are unableto model extreme cases, a hypothesis which is supportedby the hexbin plot in Figure 2 as well; the galaxies withhigher masses ( M g > × M (cid:12) ) are being underpre-dicted. The SMHM being reproduced strongly implies thatmachine learning is able to approximate the mapping be-tween the stellar mass and the halo mass that is prescribed MNRAS , 1–18 (2015) odeling Galaxy Formation and Evolution by using Machine Learning M SAM M predicted M ( M fl / h ) Figure 7.
A violinplot showing the distributions of the M SAM,hot and M predicted,hot . The median and the interquartilerange for both sets of galaxy masses are shown. M SAM (10 M fl /h ) M p r e d i c t e d ( M fl / h ) l og ( N ) Figure 8.
A hexbin plot of M SAM,BH and M predicted,BH . Theblack dashed line corresponds to a perfect prediction. The MSEfor the prediction is 0.000066, the Pearson correlation betweenthe predicted galaxy mass and the G11 galaxy mass is 0.927 andthe regression score is 0.86. in G11 very well. A subtlety to note here is that ML doesnot a priori assume that a direct mapping exists betweenthe stellar mass and halo mass like other SMHM studies;instead, ML is trying to learn the relationship prescribed inG11 for how the galaxies are populated with stellar mass.This point will be important later in the paper when wecompare our model with subhalo abundance matching.Furthermore, we can clearly see in Figure 2 that thestellar mass is being predicted fairly well. The regressionscore ( R ) is 0 .
77 and the correlation between the predictedset and the test G11 set is 0 . M SAM M predicted M ( M fl / h ) Figure 9.
A violinplot showing the distributions of M SAM,BH and M predicted,BH . The median and the interquartile range forboth sets of galaxy masses are shown. M Bulge ( M fl ) M B H ( M fl ) G11Predicted
Figure 10.
The black hole mass-bulge mass relation is plotted ona log scale for the predicted population of galaxies and G11’s pop-ulation of galaxies. For the predicted curve, M BH,predicted pointsare binned by the predicted bulge mass and for G11, M BH,SAM is binned by the corresponding G11 bulge mass. The shaded areascorrespond to the standard deviation at each binned point (bluefor G11 and green for ML). the distribution of the stellar mass is reproduced perfectlyusing machine learning. Our model is able to pick up onthe physical prescriptions that are used in G11 to populategalaxies with stars very well.
We also predict the stellar mass that is in the bulge of eachcentral galaxy at z = 0. In Figures 4 and 5, we can seethat the bulge mass is also being accurately reproduced. MNRAS000
We also predict the stellar mass that is in the bulge of eachcentral galaxy at z = 0. In Figures 4 and 5, we can seethat the bulge mass is also being accurately reproduced. MNRAS000 , 1–18 (2015) Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner
The regression score is 0 .
775 and the correlation betweenthe predicted set is 0 . The hot gas mass prediction, as shown in Figures 6 and 7 andTable 3, is outstanding. The Pearson correlation is 0.999 andthe regression score is 0.999. ML is able to pick up on the waythat hot gas is modeled in G11 incredibly well. As discussedin section 2.2.5, the amount of hot gas available at eachsnapshot is directly dependent upon the total virial massin the dark matter halo. Even though supernovae feedbackplays an important role in reheating some of the gas found inthe halo, ML is still able to pick up on how the prescriptionsfor the hot gas mass in G11 are set. The distribution of thehot gas mass is reproduced perfectly and the MSE is reducedby a factor of 1137.As discussed in section 2.2.5, the main contributors tothe hot gas mass in the central galaxy are gas strippingand supernovae feedback. Our almost perfect results showthat ML is able to model these two physical prescriptionsvery well. The former involves hot gas being stripped froma satellite galaxy and being added to the central galaxy forcooling. The latter, which plays a larger role in the deter-mination of the hot gas mass, involves the reheating of coldgas due to supernova feedback. As outlined in Equations19 and 20, the amount of mass that is reheated has a par-tial dependence, both directly (in (cid:15) disk ) and indirectly (in δM (cid:63) , on halo properties and ML is able to model both quitewell. The amount of hot gas mass plays an important role inthe cooling (Equation 5). The almost perfect predictions for M SAM (10 M fl /h ) M p r e d i c t e d ( M fl / h ) l og ( N ) Figure 11.
A hexbin plot of M SAM,cold and M predicted,cold . Theblack dashed line corresponds to a perfect prediction. The MSEfor the prediction is 0.319, the Pearson correlation between thepredicted galaxy mass and the G11 galaxy mass is 0.632 and theregression score is 0.40. the hot gas mass are promising and show ML’s strength inmodeling a mapping that is dominated by a direct analyticalrelationship (Equation 18). As discussed in Section 2.2.2, the central black hole massis mostly accreted through the ‘quasar’ mode (Equation 9)during major mergers or gas-rich mergers. Our results for thecentral black hole mass are shown in Table 3 and Figures 8and 9. The regression score is 0 .
86 and the correlation is0 . MNRAS , 1–18 (2015) odeling Galaxy Formation and Evolution by using Machine Learning M SAM M predicted M ( M fl / h ) Figure 12.
A violinplot showing the distributions of M SAM,cold and M predicted,cold . The median and the interquartile range forboth sets of galaxy masses are shown. M ( M fl ) -4 -3 -2 -1 M g a s / M G11Predicted
Figure 13.
The average cold gas mass fraction as a functionof stellar mass is shown for G11 galaxies and ML galaxies. ForG11, M cold,SAM M (cid:63),SAM points are binned by M (cid:63),SAM and for ML, M cold,predicted M (cid:63),predicted are binned by M (cid:63),predicted . As shown in Figure 11, the cold gas mass is being visibly un-derpredicted. This underprediction is unfortunately not sur-prising. The recipe used in G11, outlined in Section 2.2.1, hasa partial halo dependence but the baryonic processes play afar more important role in determining the cold gas mass. In-deed, NW found no easy way to parameterize the efficiencyof cooling rate in terms of the host halo mass and time andhad to empirically estimate this value by running DLB07on the mini-Millennium simulation. To investigate possiblereasons for the underprediction, recall that the cooling rateis heavily dependent upon the cooling radius. By using thesame algorithms and the same inputs, we predicted the cool-ing radius at z = 0 for the central galaxy. Our results areshown in Figures 14 and 15. We obtained a regression scoreof 0 .
86 and a correlation of 0.931; thus, the prediction isfairly robust. The cooling radius, as shown in Equation 4,also has only a partial halo dependence. It is remarkable that R SAM ( Mpc/h ) R P r e d i c t e d ( M p c / h ) l og ( N ) Figure 14.
A hexbin plot of the cooling radii R SAM,cool and R predicted,cool . The black dashed line corresponds to a perfectprediction. The MSE for the prediction is 0.000059, the Pearsoncors between the predicted galaxy mass and the G11 galaxy massis 0.930 and the regression score is 0.87. R SAM R Predicted R ( M p c / h ) Figure 15.
A violinplot showing the distributions of the R SAM,cooling and R predicted,cooling . The median and the in-terquartile range for both sets of cooling radii are shown. our prediction is so accurate, then, using solely dark matterhalo information; one would expect that baryonic recipes,for instance, the cooling function prescribed in Sutherland& Dopita (1993), to play a far larger role in the determina-tion of the cooling radius.The reproduction of the cooling radius and the robusthot gas mass prediction discussed earlier raise the question:why is the cold gas mass evolution not being captured? MLis able to predict the two basic ingredients of gas coolingwell, but the cooled mass itself is not being robustly pre-dicted. We hypothesize that this discrepancy is a result ofthe variability in the cooling radius. Without some form ofexplicit inclusion of the time evolution of hot gas mass andcooling radius over snapshots, ML is unable to capture theevolution of the cooling radius and, consequently, the accu- MNRAS000
A violinplot showing the distributions of the R SAM,cooling and R predicted,cooling . The median and the in-terquartile range for both sets of cooling radii are shown. our prediction is so accurate, then, using solely dark matterhalo information; one would expect that baryonic recipes,for instance, the cooling function prescribed in Sutherland& Dopita (1993), to play a far larger role in the determina-tion of the cooling radius.The reproduction of the cooling radius and the robusthot gas mass prediction discussed earlier raise the question:why is the cold gas mass evolution not being captured? MLis able to predict the two basic ingredients of gas coolingwell, but the cooled mass itself is not being robustly pre-dicted. We hypothesize that this discrepancy is a result ofthe variability in the cooling radius. Without some form ofexplicit inclusion of the time evolution of hot gas mass andcooling radius over snapshots, ML is unable to capture theevolution of the cooling radius and, consequently, the accu- MNRAS000 , 1–18 (2015) Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner
Table 4.
Cold gas mass prediction with the inclusion of cooling radius and hot gas mass as inputsTechnique
MSE b MSE
Factor (
MSE b MSE ) Pearson Correlation R Random Forests M cold,w/o Extremely Randomized Trees
Random Forests M cold,with Extremely Randomized Trees 0.527 0.113 4.664 0.892 0.786 M SAM (10 M fl /h ) M p r e d i c t e d ( M fl / h ) l og ( N ) Figure 16.
A hexbin plot of M SAM,cold and M predicted,cold .The black dashed line corresponds to a perfect prediction. TheMSE for the prediction is 0.097, the Pearson correlation betweenthe predicted galaxy mass and the G11 galaxy mass is 0.905 andthe regression score is 0.82. The black dotted line correspondsto a perfect prediction. The main difference between this figureand figure 11 is that the cooling radius and the hot gas mass fortwo snapshots are explicitly included in the inputs for the MLalgorithms. M SAM M predicted M ( M fl / h ) Figure 17.
A violinplot showing the distributions of M SAM,cold and M predicted,cold . The median and the interquartile range forboth sets of galaxy masses are shown. The main difference hereis that the cooling radius and the hot gas mass for two snapshotsare explicitly included in the inputs for the ML algorithms. mulated cold gas mass. We tested this hypothesis by includ-ing the cooling radius and the hot gas mass over only twosnapshots ( z = 0 and z = 0 . R is 0.40. We also see in the hexbin plot that the densesthexbins lie on the straight line, but there is noticeable scat-ter for higher cold gas masses. Furthermore, in Figure 13,we show the average cold gas fraction as a function of stel-lar mass for both G11 galaxies ML galaxies. There are somediscrepancies between our results and G11’s at the begin-ning, but the general progression of the two curves matchesup quite well. Our relatively poor results for the cold gasmass confirm previous results in literature by demonstrablyand quantitatively showing the absence of a relatively sim-ple mapping between cold gas mass and dark matter haloproperties.While the main point of this study was to explore thehalo-galaxy connection in a dark matter-only context, weare able to show the power of ML in modeling the compli-cated cold gas mass recipe reasonably well when very crude,partial baryonic ingredients are included as inputs. The anal-ysis of cold gas mass accumulation using both dark matter MNRAS , 1–18 (2015) odeling Galaxy Formation and Evolution by using Machine Learning only inputs and with baryonic inputs raises two interest-ing points; first, ML, by itself, is unable to pick up on thebaryonic evolution involved in gas cooling using solely darkmatter inputs. Second, with just the addition of four bary-onic inputs, the prediction for the cold gas mass is vastlyimproved. The results above show that there is more roomfor exploration and that the relatively poor cold gas massprediction does not undermine the overall usefulness of MLas a solid tool in exploring the problem of galaxy formation. The results above show that ML is able to recreate a popu-lation of galaxies that is strikingly similar to that of G11’s inour dark matter-only framework. While the reduced MSE’swe found for the different components of the mass are sur-prisingly low, they’re still high enough to merit a discussionof the sources of error. First, and most important, a reasonfor the relatively high MSE is the absence of any baryonicprocesses or results being input into our machine learningalgorithms. In NW, only the efficiency of the physical pro-cesses were modeled by using the host halo mass and red-shift; NW did include simplified, but still physically moti-vated, baryonic processes in the model, with hand-tuned freeparameters. Our model, on the other hand, is not groundedon physical motivations related to baryons at all; it is in-stead an effort to explore the halo-galaxy connection in theframework of SAMs by using solely halo properties and apartial merger history. Another possible reason for the rel-atively high MSE may be a result of the fact that we areonly looking at the mass of the central galaxy of a halo andignoring the satellite galaxies, while using the inputs for theentire halo. This may also explain some of the scatter we seein our results for the stellar mass and the cold gas mass.There are certain deficiencies in our model. As men-tioned earlier, the model is not motivated by baryonicphysics like most SAMs. G11 and other SAMs offer simple,but incredibly illuminating treatments of interactions at thegalactic scale. Machine learning, on the other, by its verynature, does not leave room for exploration into the sub-tleties of baryonic physics and is a purely phenomenologicalmodel. Moreover, the predictions shown above do not im-ply that machine learning is a replacement to SAMs. Ourresults imply instead that machine learning offers an inter-esting and promising avenue for exploration in the domainof galaxy formation, primarily because of its simplicity, ef-ficiency and its ability to provide a unique platform thatallows us to probe how much information can be extractedfrom just dark matter haloes. In the case of the stellar mass,black hole mass and hot gas mass, ML is able to pick up onthe physical prescriptions used by G11 very well by usingsolely a partial merger history and information about thehalo environment. In the case of a cold gas mass, ML isnot able to pick up on the evolution of gas cooling by itself;with a partial inclusion of a couple of baryonic inputs overtwo snapshots, however, we’re able to double our regressionscore and make predictions that are vastly more robust. Theimproved results place confidence in the predictive power ofML and imply that ML may be a useful tool for future stud-ies of NBHS and other problems in theoretical astrophysics.An interesting point here is the superficial similarity be- tween our model and suhalo abundance matching (SHAM)(Conroy & Wechsler 2009). Both models use halo informa-tion to glean physical information about the galaxies resid-ing in the halo. Our study differs from SHAMs in one verykey aspect: SHAM involves populating haloes with galaxiesassuming that there already exists a monotonic relationshipbetween halo mass and galaxy stellar mass (or luminosity).On the other hand, our model predicts properties of galax-ies that have already been populated using a SAM (G11)with no relationship being fed to the algorithms. The re-sults obtained, then, imply that the key assumption in mostSHAMs that observable properties of galaxies are monoton-ically related to the dynamical properties of dark mattersubstructures is partially valid. The discrepancy in our coldgas mass result implies that the baryonic physics plays avastly more important role in the cooling rate than the haloenvironment itself. But the reproduction of the total stellarmass implies that SHAM’s assumptions in the context ofstellar mass hold true and instill further confidence into thegeneral methodology of SHAM.The cold gas mass prediction raises several interestingpoints. First, and most important, similar to Contreras et al.(2015), only a weak mapping between the cold gas mass andthe internal halo properties was found. However, the robustprediction of the cooling radius and the hot gas mass doesleave the door open for further exploration into modelingcold gas mass using ML. Our results also quantitatively ver-ify what NW found; they were unable to parameterize thecooling efficiency in terms of host halo mass and the red-shift. By the inclusion of just the cooling radius and hot gasmass over two snapshots, ML is able to vastly improve uponthe DM-only predictions. The improved predictions natu-rally raise the question: can ML be applied to other, morecomplicated evolutionary models and still reproduce physi-cally and numerically reasonable results?Our results give a unique and deeper look into thegalaxy-halo connection that is grounded upon SAMs. Weare able to quantitatively estimate the amount of informa-tion that dark matter haloes and merger trees hold aboutthe baryonic processes that drive galaxy formation and evo-lution, in the context of SAMs and how SAMs populatehaloes with galaxies. We have quantitatively shown that theenvironmental dependence of galaxy evolution on the sur-rounding dark matter halo is surprisingly strong. As men-tioned earlier, our robust predictions for the total stellarmass, stellar mass in the bulge, black hole mass, and thehot gas mass strongly imply that it is possible to learn thephysical processes used in cutting edge SAMs to evolve thesecomponents by using solely dark matter properties, a mergerhistory, and machine learning. The relatively weaker resultsfor the cold gas mass imply that our phenomonological, darkmatter-only model fails in reproducing the cold gas massevolution in galaxies. However, our improved cooling modelwith baryonic inputs solidifies the viability of ML in futuregalaxy formation studies.Overall, we get somewhat surprising results since onewould expect that gaseous interactions play a significantlymore important role in predicting the final components ofmass of a single galaxy than just the basic dark matter halomodel; but, we have shown that machine learning providesa unique and fairly robust avenue to quantitatively analyzethe role that just the dark matter haloes play in galaxy for-
MNRAS000
MNRAS000 , 1–18 (2015) Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner mation in the context of SAMs. We showed that the SMHMrelation is reproduced almost perfectly, the shapes of thepredicted and true distributions of the different mass com-ponents are very similar, the BH mass-bulge mass relationis reproduced, and the cold gas mass fraction as a functionof stellar mass is reasonably reproduced. Machine learningis able to learn an appreciable portion of the physical pre-scriptions used in G11 for galaxy formation using solely darkmatter inputs. Moreover, the amount of time it took to runthe whole pipeline took about three hours, considerably lessthan the hundreds or thousands of hours a typical SAMwould require.
We have performed an extensive study of the halo-galaxyconnection by using novel machine learning techniques inthe backdrop of a state of the art SAM. Using G11 to trainour ML algorithms, the total stellar mass, stellar mass inthe bulge, cold gas mass and hot gas mass in the Millenniumsimulation are predicted. ML provides a powerful frameworkto explore the problem of galaxy formation in part due toits relative simplicity, computational efficiency and its abil-ity to model complex physical relationships. The discrepan-cies in and weaknesses of our phenomenological model werediscussed and the reasons for some of our relatively lessrobust predictions were also discussed. An improved cool-ing model with four additional baryonic inputs was imple-mented, which made the cold gas mass predictions signifi-cantly better and solidified ML’s position as a model thatcan be used to probe the halo-galaxy connection in moredetail, perhaps with sophisticated NBHS.Our primary conclusions are as follows:(1) Exploring the extent of the influence of dark matterhaloes and its past environment on galaxy formation andevolution is a non-trivial problem with poorly defined in-puts and mappings. Semi-analytic modeling is the prevalentgalaxy formation modeling technique that uses simple, yetphysically powerful, recipes to populate dark matter haloeswith galaxies. However, there is no clear way to explore theextent of the influence of dark matter haloes on the halo-galaxy solely by using just SAMs. Machine learning, on theother hand, provides an interesting alternative to standardtechniques for three main reasons: powerful predictive capa-bilities, simplicity and efficiency.(2) By using the Millennium simulation and G11, we setup a model that used internal halo properties ( N , spin, M crit , v max and σ v ) and a partial merger history to pre-dict different mass components of the central galaxy in eachdark matter halo at z = 0. No baryonic processes were incor-porated in our initial analysis. We applied several sophisti-cated algorithms (kNN, regression trees, random forests andextremely randomized trees) to the Millennium data and wewere able to reproduce a similar galaxy population.(3) The total stellar mass and the stellar mass in the bulgeare predicted very well. The predicted and true distributionsfor both are almost identical. ML is able to model the physi-cal prescriptions laid out in G11 for galactic stellar mass evo-lution. The stellar mass-halo mass relation that G11 found isrecreated almost perfectly with some very minor discrepan-cies for M h ≈ M (cid:12) h − . The bulge mass prediction is also fairly robust, with the distributions being remarkably con-sistent. However, the bulge mass is slightly overpredicted forlower masses. We hypothesize that this may be because ourinputs include a partial merger history of the haloes and notgalaxies. Consequently, ML is possibly overpredicting as aresult of its inability to fully model the galaxy-galaxy mergertimescale using only a halo merger history. The central blackhole mass prediction is also very robust and the distributionis recreated almost perfectly. The BH-bulge mass relationfor the ML simulated galaxies and G11 galaxies is very con-sistent. There is a slight overprediction for lower masses forthe central black hole mass prediction, which further placesconfidence in the hypothesis that ML is unable to fully pickup on the galaxy-galaxy merger timescale using only a halomerger history.(4) The hot gas mass is predicted outstandingly well. MLis demonstrably able to model G11’s prescriptions for gasstripping and supernovae feedback. The cold gas mass pre-diction, on the other hand, is relatively weak with a corre-lation of only 0 .
63. However, the robust cooling radius andthe hot gas prediction imply that the ingredients for coolingare sufficiently modeled by ML. We hypothesized that ourpoor prediction was a result of the inability of ML to modelthe cooling radius evolution without any baryonic guidance(i.e. by using solely dark matter inputs and merger history).We tested this hypothesis by including the cooling radiusand the hot gas mass for only the last two snapshots andfound significantly better predictions with a correlation of0 .
91 and R of 0 .
82. The improved cooling predictions placeconfidence in the predictive power of ML and imply thatML will be a useful tool in future studies in galaxy forma-tion and evolution. The average cold gas mass fraction asa function of stellar mass was also plotted for G11 galaxiesand the predicted galaxies. The shape of the two curves isreasonably similar with a minor discrepancy at the lowestmasses.(5) Our results provide a unique framework to explorethe galaxy-halo connection that is built by using SAMs.We are able to quantitatively estimate the amount of in-formation that dark matter haloes and merger trees holdabout the baryonic processes that drive galaxy formationand evolution, in the context of G11. Our robust predictionsfor the total stellar mass, stellar mass in the bulge, centralblack hole mass, and the hot gas mass strongly imply thatit is possible to successfully model the physical prescriptionsused in SAMs to evolve these mass components using solelydark matter properties, a merger history and machine learn-ing. However, ML is unable to find a robust approximatemapping between the internal dark matter halo propertiesand the cold gas mass, like Neistein & Weinmann (2010);Faucher-Gigu`ere et al. (2011); Contreras et al. (2015).(6) ML is a phenomenological model and not a physicalone, and, consequently, is not a replacement for SAMs. How-ever, ML offers a solid and intriguing framework to explorethe halo-galaxy connection with solid results comparable toG11, which conventional modeling techniques don’t provide.The results presented in this paper show the usefulnessof ML in providing a solid framework to probe the halo-galaxy connection in the backdrop of SAMs. Future work in-cludes exploring more sophisticated ML techniques to probegalaxy formation and evolution in NBHS.
MNRAS , 1–18 (2015) odeling Galaxy Formation and Evolution by using Machine Learning ACKNOWLEDGMENTS
The authors thank Christopher Chan, Rishabh Jain, andDingcheng Yue for help in gathering data and exploringpreliminary machine learning approaches. HMK and RJBacknowledge support from the National Science FoundationGrant No. AST-1313415. HMK has been supported in partby funding from the LAS Honors Council at the Univer-sity of Illinois and by the Office of Student Financial Aid atthe University of Illinois. RJB has been supported in part bythe Center for Advanced Studies at the University of Illinois.MJT is supported by the Gordon and Betty Moore Foun-dation’s Data-Driven Discovery Initiative through GrantGBMF4561. We would like to thank the reviewer for theirhelpful comments that made this paper better.The Millennium Simulation databases used in this pa-per and the web application providing online access to themwere constructed as part of the activities of the GermanAstrophysical Virtual Observatory (GAVO).
REFERENCES
Angulo R., Springel V., White S., Jenkins A., Baugh C., FrenkC., 2012, MNRAS, 426, 2046Baldry I. K., Balogh M. L., Bower R., Glazebrook K., NicholR. C., Bamford S. P., Budavari T., 2006, MNRAS, 373, 469Ball N. M., Brunner R. J., 2010, International Journal of ModernPhysics D, 19, 1049Ball N. M., Brunner R. J., Myers A. D., Tcheng D., 2006, ApJ,650, 497Ball N. M., Brunner R. J., Myers A. D., Strand N. E., AlbertsS. L., Tcheng D., Llor`a X., 2007, ApJ, 663, 774Banerji M., et al., 2010, MNRAS, 406, 342Baugh C. M., 2006, Reports on Progress in Physics, 69, 3101Behroozi P. S., Conroy C., Wechsler R. H., 2010, ApJ, 717, 379Benson A. J., 2012, New Astronomy, 17, 175Benson A., Pearce F., Frenk C., Baugh C., Jenkins A., 2001,MNRAS, 320, 261Blumenthal G. R., Faber S., Primack J. R., Rees M. J., 1984,Nature, 311, 517Bower R., Benson A., Malbon R., Helly J., Frenk C., Baugh C.,Cole S., Lacey C. G., 2006, MNRAS, 370, 645Bower R., Vernon I., Goldstein M., Benson A., Lacey C. G.,Baugh C., Cole S., Frenk C., 2010, MNRAS, 407, 2017Boylan-Kolchin M., Ma C.-P., Quataert E., 2008, MNRAS, 383,93Breiman L., 1996, Machine learning, 24, 123Breiman L., 2001, Machine learning, 45, 5Breiman L., Friedman J., Stone C. J., Olshen R. A., 1984, Clas-sification and regression trees. CRC pressChabrier G., 2003, Publications of the Astronomical Society ofthe Pacific, 115, 763Cole S., Aragon-Salamanca A., Frenk C. S., Navarro J. F., ZepfS. E., 1994, MNRAS, 271, 781Cole S., Lacey C. G., Baugh C. M., Frenk C. S., 2000, MNRAS,319, 168Collaboration P., et al., 2015, arXiv preprint arXiv:1502.01589Conroy C., Wechsler R. H., 2009, ApJ, 696, 620Contreras S., Baugh C., Norberg P., Padilla N., 2015, arXivpreprint arXiv:1502.06614Croton D. J., et al., 2006, MNRAS, 365, 11Cucciati O., et al., 2012, A&A, 548, A108Davis M., Efstathiou G., Frenk C. S., White S. D., 1985, ApJ,292, 371De La Torre S., et al., 2011, A&A, 525, A125 De Lucia G., Blaizot J., 2007, MNRAS, 375, 2De Lucia G., Kauffmann G., White S. D., 2004, MNRAS, 349,1101De Lucia G., Springel V., White S. D., Croton D., Kauffmann G.,2006, MNRAS, 366, 499De Lucia G., Boylan-Kolchin M., Benson A. J., Fontanot F.,Monaco P., 2010, MNRAS, 406, 1533Dieleman S., Willett K. W., Dambre J., 2015, MNRAS, 450, 1441Faucher-Gigu`ere C.-A., Kereˇs D., Ma C.-P., 2011, MNRAS, 417,2982Fiorentin P. R., Bailer-Jones C., Lee Y., Beers T., Sivarani T.,Wilhelm R., Prieto C. A., Norris J., 2007, A&A, 467, 1373Gerdes D. W., Sypniewski A. J., McKay T. A., Hao J., WeisM. R., Wechsler R. H., Busha M. T., 2010, ApJ, 715, 823Geurts P., Ernst D., Wehenkel L., 2006, Machine learning, 63, 3Graff P., Feroz F., Hobson M. P., Lasenby A., 2014, MNRAS, 441,1741Guo Q., et al., 2011, MNRAS, 413, 101Henriques B. M., Thomas P. A., Oliver S., Roseboom I., 2009,MNRAS, 396, 535Henriques B. M., White S. D., Thomas P. A., Angulo R. E., GuoQ., Lemson G., Springel V., 2013, MNRAS, 431, 3373Hopkins P. F., et al., 2010, ApJ, 715, 202Ivezi´c ˇZ., Connolly A. J., VanderPlas J. T., Gray A., 2014, Statis-tics, Data Mining, and Machine Learning in Astronomy: APractical Python Guide for the Analysis of Survey Data:A Practical Python Guide for the Analysis of Survey Data.Princeton University PressJohnson R., Zhang T., 2011, arXiv preprint arXiv:1109.0887Kamdar H., Turk M., Brunner R., Submitted, MNRASKang X., Jing Y., Mo H., B¨orner G., 2005, ApJ, 631, 21Kauffmann G., 1996, MNRAS, 281, 475Kauffmann G., White S. D., Guiderdoni B., 1993, MNRAS, 264,201Kennicutt Jr R. C., 1998, ApJ, 498, 541Kim Edward B. R. C.-K. M., 2015, MNRAS, 453, 507Kind M. C., Brunner R. J., 2013, MNRAS, 432, 1483Klypin A. A., Trujillo-Gomez S., Primack J., 2011, ApJ, 740, 102Knebe A., et al., 2015, arXiv preprint arXiv:1505.04607Lagos C. d. P., Cora S. A., Padilla N. D., 2008, MNRAS, 388, 587Lemson G., et al., 2006, arXiv preprint astro-ph/0608019Liu Y., Weisberg R. H., 2011, A review of self-organizing map ap-plications in meteorology and oceanography. INTECH OpenAccess PublisherLiu L., Yang X., Mo H., Van den Bosch F. C., Springel V., 2010,ApJ, 712, 734Martin C. L., 1999, ApJ, 513, 156Mo H., Mao S., White S. D., 1998, MNRAS, 295, 319Monaco P., Fontanot F., Taffoni G., 2007, MNRAS, 375, 1189Monaco P., Benson A. J., De Lucia G., Fontanot F., Borgani S.,Boylan-Kolchin M., 2014, MNRAS, 441, 2058Moster B. P., Somerville R. S., Maulbetsch C., Van den BoschF. C., Macci`o A. V., Naab T., Oser L., 2010, ApJ, 710, 903Neistein E., Weinmann S. M., 2010, MNRAS, 405, 2717Ntampaka M., Trac H., Sutherland D. J., Battaglia N., PoczosB., Schneider J., 2015, ApJ, 803, 50Pedregosa F., et al., 2011, The Journal of Machine Learning Re-search, 12, 2825Peebles P., 1982, Astrophys. J, 263, L1Roe B. P., Yang H.-J., Zhu J., Liu Y., Stancu I., McGregor G.,2005, Nuclear Instruments and Methods in Physics ResearchSection A: Accelerators, Spectrometers, Detectors and Asso-ciated Equipment, 543, 577Schaye J., et al., 2015, MNRAS, 446, 521Silverman B. W., 1986, Density estimation for statistics and dataanalysis. Vol. 26, CRC pressSkillman S. W., Warren M. S., Turk M. J., Wechsler R. H., HolzD. E., Sutter P., 2014, arXiv preprint arXiv:1407.2600MNRAS000
Angulo R., Springel V., White S., Jenkins A., Baugh C., FrenkC., 2012, MNRAS, 426, 2046Baldry I. K., Balogh M. L., Bower R., Glazebrook K., NicholR. C., Bamford S. P., Budavari T., 2006, MNRAS, 373, 469Ball N. M., Brunner R. J., 2010, International Journal of ModernPhysics D, 19, 1049Ball N. M., Brunner R. J., Myers A. D., Tcheng D., 2006, ApJ,650, 497Ball N. M., Brunner R. J., Myers A. D., Strand N. E., AlbertsS. L., Tcheng D., Llor`a X., 2007, ApJ, 663, 774Banerji M., et al., 2010, MNRAS, 406, 342Baugh C. M., 2006, Reports on Progress in Physics, 69, 3101Behroozi P. S., Conroy C., Wechsler R. H., 2010, ApJ, 717, 379Benson A. J., 2012, New Astronomy, 17, 175Benson A., Pearce F., Frenk C., Baugh C., Jenkins A., 2001,MNRAS, 320, 261Blumenthal G. R., Faber S., Primack J. R., Rees M. J., 1984,Nature, 311, 517Bower R., Benson A., Malbon R., Helly J., Frenk C., Baugh C.,Cole S., Lacey C. G., 2006, MNRAS, 370, 645Bower R., Vernon I., Goldstein M., Benson A., Lacey C. G.,Baugh C., Cole S., Frenk C., 2010, MNRAS, 407, 2017Boylan-Kolchin M., Ma C.-P., Quataert E., 2008, MNRAS, 383,93Breiman L., 1996, Machine learning, 24, 123Breiman L., 2001, Machine learning, 45, 5Breiman L., Friedman J., Stone C. J., Olshen R. A., 1984, Clas-sification and regression trees. CRC pressChabrier G., 2003, Publications of the Astronomical Society ofthe Pacific, 115, 763Cole S., Aragon-Salamanca A., Frenk C. S., Navarro J. F., ZepfS. E., 1994, MNRAS, 271, 781Cole S., Lacey C. G., Baugh C. M., Frenk C. S., 2000, MNRAS,319, 168Collaboration P., et al., 2015, arXiv preprint arXiv:1502.01589Conroy C., Wechsler R. H., 2009, ApJ, 696, 620Contreras S., Baugh C., Norberg P., Padilla N., 2015, arXivpreprint arXiv:1502.06614Croton D. J., et al., 2006, MNRAS, 365, 11Cucciati O., et al., 2012, A&A, 548, A108Davis M., Efstathiou G., Frenk C. S., White S. D., 1985, ApJ,292, 371De La Torre S., et al., 2011, A&A, 525, A125 De Lucia G., Blaizot J., 2007, MNRAS, 375, 2De Lucia G., Kauffmann G., White S. D., 2004, MNRAS, 349,1101De Lucia G., Springel V., White S. D., Croton D., Kauffmann G.,2006, MNRAS, 366, 499De Lucia G., Boylan-Kolchin M., Benson A. J., Fontanot F.,Monaco P., 2010, MNRAS, 406, 1533Dieleman S., Willett K. W., Dambre J., 2015, MNRAS, 450, 1441Faucher-Gigu`ere C.-A., Kereˇs D., Ma C.-P., 2011, MNRAS, 417,2982Fiorentin P. R., Bailer-Jones C., Lee Y., Beers T., Sivarani T.,Wilhelm R., Prieto C. A., Norris J., 2007, A&A, 467, 1373Gerdes D. W., Sypniewski A. J., McKay T. A., Hao J., WeisM. R., Wechsler R. H., Busha M. T., 2010, ApJ, 715, 823Geurts P., Ernst D., Wehenkel L., 2006, Machine learning, 63, 3Graff P., Feroz F., Hobson M. P., Lasenby A., 2014, MNRAS, 441,1741Guo Q., et al., 2011, MNRAS, 413, 101Henriques B. M., Thomas P. A., Oliver S., Roseboom I., 2009,MNRAS, 396, 535Henriques B. M., White S. D., Thomas P. A., Angulo R. E., GuoQ., Lemson G., Springel V., 2013, MNRAS, 431, 3373Hopkins P. F., et al., 2010, ApJ, 715, 202Ivezi´c ˇZ., Connolly A. J., VanderPlas J. T., Gray A., 2014, Statis-tics, Data Mining, and Machine Learning in Astronomy: APractical Python Guide for the Analysis of Survey Data:A Practical Python Guide for the Analysis of Survey Data.Princeton University PressJohnson R., Zhang T., 2011, arXiv preprint arXiv:1109.0887Kamdar H., Turk M., Brunner R., Submitted, MNRASKang X., Jing Y., Mo H., B¨orner G., 2005, ApJ, 631, 21Kauffmann G., 1996, MNRAS, 281, 475Kauffmann G., White S. D., Guiderdoni B., 1993, MNRAS, 264,201Kennicutt Jr R. C., 1998, ApJ, 498, 541Kim Edward B. R. C.-K. M., 2015, MNRAS, 453, 507Kind M. C., Brunner R. J., 2013, MNRAS, 432, 1483Klypin A. A., Trujillo-Gomez S., Primack J., 2011, ApJ, 740, 102Knebe A., et al., 2015, arXiv preprint arXiv:1505.04607Lagos C. d. P., Cora S. A., Padilla N. D., 2008, MNRAS, 388, 587Lemson G., et al., 2006, arXiv preprint astro-ph/0608019Liu Y., Weisberg R. H., 2011, A review of self-organizing map ap-plications in meteorology and oceanography. INTECH OpenAccess PublisherLiu L., Yang X., Mo H., Van den Bosch F. C., Springel V., 2010,ApJ, 712, 734Martin C. L., 1999, ApJ, 513, 156Mo H., Mao S., White S. D., 1998, MNRAS, 295, 319Monaco P., Fontanot F., Taffoni G., 2007, MNRAS, 375, 1189Monaco P., Benson A. J., De Lucia G., Fontanot F., Borgani S.,Boylan-Kolchin M., 2014, MNRAS, 441, 2058Moster B. P., Somerville R. S., Maulbetsch C., Van den BoschF. C., Macci`o A. V., Naab T., Oser L., 2010, ApJ, 710, 903Neistein E., Weinmann S. M., 2010, MNRAS, 405, 2717Ntampaka M., Trac H., Sutherland D. J., Battaglia N., PoczosB., Schneider J., 2015, ApJ, 803, 50Pedregosa F., et al., 2011, The Journal of Machine Learning Re-search, 12, 2825Peebles P., 1982, Astrophys. J, 263, L1Roe B. P., Yang H.-J., Zhu J., Liu Y., Stancu I., McGregor G.,2005, Nuclear Instruments and Methods in Physics ResearchSection A: Accelerators, Spectrometers, Detectors and Asso-ciated Equipment, 543, 577Schaye J., et al., 2015, MNRAS, 446, 521Silverman B. W., 1986, Density estimation for statistics and dataanalysis. Vol. 26, CRC pressSkillman S. W., Warren M. S., Turk M. J., Wechsler R. H., HolzD. E., Sutter P., 2014, arXiv preprint arXiv:1407.2600MNRAS000 , 1–18 (2015) Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner
Relative
Importance
SpinR half
Number of ParticlesV max V disp M crit, Figure A1.
The relative importance of different halo propertiesin predicting different properties of the galaxy.Somerville R. S., Dav´e R., 2014, arXiv preprint arXiv:1412.2712Somerville R. S., Primack J. R., 1999, MNRAS, 310, 1087Somerville R. S., Hopkins P. F., Cox T. J., Robertson B. E.,Hernquist L., 2008, MNRAS, 391, 481Springel V., 2005, MNRAS, 364, 1105Springel V., White S. D., Tormen G., Kauffmann G., 2001, MN-RAS, 328, 726Springel V., et al., 2005, Nature, 435, 629Sutherland R. S., Dopita M. A., 1993, ApJ Supplement Series,88, 253Vogelsberger M., et al., 2014, MNRAS, 444, 1518Wang L., Li C., Kauffmann G., De Lucia G., 2007, MNRAS, 377,1419Weinmann S. M., Kauffmann G., Von Der Linden A., De LuciaG., 2010, MNRAS, 406, 2249White S. D., Frenk C. S., 1991, ApJ, 379, 52Witten I. H., Frank E., 2005, Data Mining: Practical machinelearning tools and techniques. Morgan KaufmannXu G., 1994, arXiv preprint astro-ph/9409021Xu X., Ho S., Trac H., Schneider J., Poczos B., Ntampaka M.,2013, ApJ, 772, 147Yoshida N., Stoehr F., Springel V., White S. D., 2002, MNRAS,335, 762
APPENDIX A: FEATURE IMPORTANCE
In the discussion of the halo properties chosen for our analysis,an evaluation of which attributes play a role in determining thegalaxy properties was not performed. Here, we provide a featureimwportance plot that shows the relative importance of the haloproperties (at z = 0) in predicting galaxy properties.For tree-based machine learning techniques, the depth of afeature (i.e. relative rank) used as a decision node can be usedto evaluate how important that particular feature is in the learn-ing process. The expected fraction of the samples a feature con-tributes to can be used as an estimate of the relative importanceof the features. We then average this quantity over all trees inthe ensemble to get a less biased estimate for the importance ofa particular feature.As one would expect, the mass of the halo plays an integralrole in determining the galaxy properties. Perhaps surprisingly,the spin of the dark matter halo plays a minimal role in the learn-ing process. This analysis of feature importances will guide futurework that uses machine learning to extract information from darkmatter haloes about the galaxies residing in the halo. M SAM (10 M fl /h ) M p r e d i c t e d ( M fl / h ) l og ( N ) Figure B1.
A hexbin plot showing the black hole mass predictionfor Bower et al. (2006) with a KDE on top.
APPENDIX B: USING A DIFFERENTSEMI-ANALYTICAL MODEL
An interesting question is whether machine learning techniquesperform similarly well using a different SAM. The reason we usedG11 for this work in place of, or along with, a Durham SAM(Bower et al. 2006) was simply because more halo parameters wereavailable in the merger trees that were constructed for DLB07and G11. Using G11 offered the opportunity to explore a biggerparameter space.Here, we explore the effect of using another SAM with fewerhalo parameters. For Bower et al. (2006), only the halo mass isprovided through the merger tree. We repeat our analysis usingjust the halo mass over four snapshots and predict only the stellarmass and the black hole mass. The point of this analysis is toexamine whether ML is able to model the same relationships whena different SAM is used with fewer inputs.As we can see in the two figures attached, the predictionsare noticeably more scattered (particularly the stellar mass) butthe general trend is still recovered, even when only one feature isused in our prediction (out of necessity) in a SAM where someof the physics is treated differently. In Kamdar et al. (2015),we explore the feasibility of ML in making predictions from anNBHS, where the physics is vastly more complicated.This paper has been typeset from a TEX/L A TEX file prepared bythe author. MNRAS , 1–18 (2015) odeling Galaxy Formation and Evolution by using Machine Learning M SAM (10 M fl /h ) M p r e d i c t e d ( M fl / h ) l og ( N ) Figure B2.
A hexbin plot showing the stellar mass predictionfor Bower et al. (2006) with a KDE on top.MNRAS000