[PDF] Time Delay Lens Modelling Challenge

Abstract

In recent years, breakthroughs in methods and data have enabled gravitational time delays to emerge as a very powerful tool to measure the Hubble constant H 0 . However, published state-of-the-art analyses require of order 1 year of expert investigator time and up to a million hours of computing time per system. Furthermore, as precision improves, it is crucial to identify and mitigate systematic uncertainties. With this time delay lens modelling challenge we aim to assess the level of precision and accuracy of the modelling techniques that are currently fast enough to handle of order 50 lenses, via the blind analysis of simulated datasets. The results in Rung 1 and Rung 2 show that methods that use only the point source positions tend to have lower precision ( 10−20% ) while remaining accurate. In Rung 2, the methods that exploit the full information of the imaging and kinematic datasets can recover H 0 within the target accuracy ( |A|<2% ) and precision ( <6% per system), even in the presence of poorly known point spread function and complex source morphology. A post-unblinding analysis of Rung 3 showed the numerical precision of the ray-traced cosmological simulations to be insufficient to test lens modelling methodology at the percent level, making the results difficult to interpret. A new challenge with improved simulations is needed to make further progress in the investigation of systematic uncertainties. For completeness, we present the Rung 3 results in an appendix, and use them to discuss various approaches to mitigating against similar subtle data generation effects in future blind challenges.

Full PDF

MMNRAS , 1–23 (2020) Preprint 17 June 2020 Compiled using MNRAS L A TEX style ﬁle v3.0

Time Delay Lens modelling Challenge: II. Results

X. Ding, (cid:63) T. Treu, S. Birrer, G. C.-F. Chen, J. Coles, P. Denzel, , M. Frigo, A. Galan, P. J. Marshall, M. Millon, A. More, A. J. Shajib, D. Sluse, H. Tak, , , , , D. Xu, M. W. Auger, V. Bonvin, H. Chand, , F. Courbin, G. Despali, C. D. Fassnacht, D. Gilman, S. Hilbert, , S. R. Kumar, Y.-Y. Lin, J. W. Park, P. Saha, , S. Vegetti, L. Van de Vyvere, L. L.R. Williams, Department of Physics and Astronomy, University of California, Los Angeles, CA, 90095-1547, USA Kavli Institute for Particle Astrophysics and Cosmology and Department of Physics, Stanford University, Stanford, CA 94305, USA Department of Physics, University of California, Davis, CA 95616, USA Physik-Department T38, Technische Universität München, James-Franck-Str. 1, D-85748 Garching, Germany Institute for Computational Science, University of Zurich, 8057 Zurich, Switzerland Physics Institute, University of Zurich, 8057 Zurich, Switzerland Max Planck Institute for Astrophysics, Karl-Schwarzschild-Strasse 1, D-85740 Garching, Germany Institute of Physics, Laboratory of Astrophysics, Ecole Polytechnique Fédérale de Lausanne (EPFL), Observatoire de Sauverny, 1290 Versoix, Switzerland Kavli Institute for Particle Astrophysics and Cosmology, Stanford University, 452 Lomita Mall, Stanford, CA 94035, USA Kavli IPMU (WPI), UTIAS, The University of Tokyo, Kashiwa, Chiba 277-8583, Japan STAR Institute, Quartier Agora - Allée du six Août, 19c B-4000 Liège, Belgium International CHASC Astrostatistics Collaboration, Harvard University, Cambridge, MA 02138, USA Center for Astrostatistics, The Pennsylvania State University, University Park, PA 16802, USA Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA Department of Astronomy and Astrophysics, The Pennsylvania State University, University Park, PA 16802, USA Institute for Computational and Data Sciences, The Pennsylvania State University, University Park, PA 16802, USA Department of Astronomy, Tsinghua University, Beijing, 100084, China Institute of Astronomy, University of Cambridge, Madingley Road, Cambridge CB3 0HA, UK Aryabhatta Research Institute of observational sciencES (ARIES), Nainital 263002, India Department of Physics and Astronomical Sciences, Central University of Himachal Pradesh, India-176206 Exzellenzcluster Universe, Boltzmannstr. 2, 85748 Garching, Germany Ludwig-Maximilians-Universität, Universitäts-Sternwarte, Scheinerstr. 1, 81679 München, Germany Department of Physics, University of Illinois at Urbana-Champaign, 1110 West Green Street, Urbana, IL 61801, USA School of Physics and Astronomy, University of Minnesota, 116 Church Street SE, Minneapolis, MN 55455, US

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

In recent years, breakthroughs in methods and data have enabled gravitational time delays toemerge as a very powerful tool to measure the Hubble constant H . However, published state-of-the-art analyses require of order 1 year of expert investigator time and up to a million hoursof computing time per system. Furthermore, as precision improves, it is crucial to identify andmitigate systematic uncertainties. With this time delay lens modelling challenge we aim toassess the level of precision and accuracy of the modelling techniques that are currently fastenough to handle of order 50 lenses, via the blind analysis of simulated datasets presented inpaper I. The results in Rung 1 and Rung 2 show that methods that use only the point sourcepositions tend to have lower precision (10 − H within the target accuracy ( | A | < <

6% per system), even in the presenceof poorly known point spread function and complex source morphology. A post-unblindinganalysis of Rung 3 showed the numerical precision of the ray-traced cosmological simulationsto be insuﬃcient to test lens modelling methodology at the percent level, making the resultsdiﬃcult to interpret. A new challenge with improved simulations is needed to make furtherprogress in the investigation of systematic uncertainties. For completeness, we present theRung 3 results in an appendix, and use them to discuss various approaches to mitigatingagainst similar subtle data generation eﬀects in future blind challenges.

The ﬂat Λ cold dark matter ( Λ CDM) cosmological model has beenremarkably successful in explaining the geometry and dynamicsof our Universe. It has been able to predict/match the results ofa wide range of experiments covering a wide range of physicalscales (Planck Collaboration et al. 2014, 2016; Riess et al. 2016;Betoule et al. 2014; Eisenstein et al. 2005; Alam et al. 2017), andthe expansion of our Universe (Riess et al. 1998; Perlmutter et al.1999).One of the key parameters of the model is the Hubble con-stant ( H ) that determines the age and physical scale of the Uni-verse. Measuring H at high precision and accuracy has been oneof the main goals of observational cosmology for almost a cen-tury (Freedman et al. 2001). In recent years, as the precision ofthe measurements has improved to a few percent level, a strongtension has emerged between early and late universe probes. Asfar as early-universe probes are concerned, analysis of Planck datayields H = . ± . − Mpc − (Planck Collaboration et al.2018), assuming a Λ CDM model. In the local universe, the Equa-tion of State of dark energy (SH0ES) team using the traditional“distance ladder” method based on Cepheid calibration of type Iasupernovae by ﬁnds H = . ± .

42 km s − Mpc − (Riesset al. 2019), and H = . ± . − Mpc − based on the tipof the red giant brand (Yuan et al. 2019). The Carnegie-ChicagoHubble Program calibrated of the tip of the red giant branch andapplied to type Ia supernovae, ﬁnding a midway Hubble tension as H = . ± . (± .

1% stat ) ± . (± .

4% sys ) (Freedman et al.2019). The tension between late and early universe probes rangesbetween 4-6 σ (see summary by Verde et al. 2019). If this ∼ Λ CDM is not a good descrip-tion of the universe, and additional ingredients such as new particlesor early dark energy might be needed (e.g., Knox & Millea 2020;Arendse et al. 2019). Given the potential implications of this ten-sion, it is crucial to have several independent methods to measure H each with suﬃcient precision to resolve the tension (e.g. 1.6%to resolve the 8% H tension at 5 − σ ).Time-delay cosmography by strong gravitational lensing pro-vide a one-step measurement of H together with other cosmologi-cal parameters (Refsdal 1966; Treu & Marshall 2016). The stronglylensed source produces multiple images, corresponding to multi-ple paths followed by the photons through the universe. Accordingto Fermat’s principle, the lensed images arrive at the observer atdiﬀerent time, corresponding to the extrema of the arrival timesurface. The time delays between the images depend on the abso-lute value of cosmological distances, chieﬂy through the so-called“time-delay distance", D ∆ t , and can thus be used to infer H like anyother distance indicator (Schechter et al. 1997; Treu & Koopmans2002a; Suyu et al. 2010a). Importantly, time delay cosmography isindependent of all other probes of H .At the time of writing, the H Lenses in COSMOGRAIL’sWellspring (H0LiCOW) and SHARP collaborations have ﬁnishedthe analysis of six strong lensed quasars and obtain a joint infer-ence for Hubble constant as H = . + . − . km s − Mpc − (Wonget al. 2020). In addition, as part of the STRIDES collabora-tion, Shajib et al. (2020) analyzed one particularly information-rich strong lens system DES J0408-5354 alone and constrainedthe H at 3.9% level, in excellent agreement with the Wonget al. (2020) result. (In the rest of the paper, we refer toH0LiCOW/SHARP/COSMOGRAIL/STRIDES collectively as TD-COSMO (Millon et al. 2019)). Measurements of H using time delay lenses also have been investigated by other collaborations (Paraﬁcz& Hjorth 2010; Ghosh et al. 2020).Based on the current results, it is predicted that a 1% precisionin the H can be achieved via the time-delay cosmography aloneusing a sample of 40 lensed system (Shajib et al. 2018). However,two issues need to be addressed before a 1% measurement of H can be achieved with time delay cosmography. First, the analysisand computational costs need to be reduced in order to make thelarger samples tractable. Second, all sources of potential systematicuncertainties must be investigated in order to identify and mitigateany outstanding one.The ﬁrst issue is well illustrated by the current state-of-the-art.At present, the analysis of each system requires approximately oneyear of eﬀort full time by an expert investigator. Furthermore, theanalysis by Shajib et al. (2020) required approximately 1 millionhours of CPU time. Analyzing 40 lenses would thus be prohibitivewith current techniques especially in terms of investigator time.Eﬀorts are underway to automate these modelling eﬀorts so that theycan be scaled to large number of lenses reducing the investigatortime per lens (Shajib et al. 2019), but much work remains to be doneto get to high precision low cost modelling (Schmidt et al. 2020, inprep).Regarding the second issue, a number of eﬀorts are under wayto identify systematic uncertainties (e.g. Millon et al. 2019). Allparts of the analysis need to be checked with high quality data andindependent analysis, as well as with simulated datasets.One eﬀective strategy to test for unknown systematic errorsis to use blind analysis. In the implementation followed by theTDCOSMO collaboration, the inferred values of D ∆ t and H arekept blind until all coauthors agree to freeze the analysis during acollaboration telecon. The inference is then unblinded and publishedwithout modiﬁcation. One of the goals of the blind analysis is toavoid conscious and unconscious conﬁrmation bias.Another powerful strategy is to study systematic errors usingrealistic simulations, possibly analyzed blindly. Blind analysis ofsimulated datasets was the strategy of the “Time Delay Challenge”(TDC). In the TDC, a so-called “Evil” team ﬁrst simulated largenumber realistic ‘mock’ time delay light curves including antic-ipated physical and experimental eﬀects. Then, the “Evil” teampublished the mock data and invited the community to extract timedelay signals blindly using their own method. Liao et al. (2015)showed that time delays can be measured from realistic datasetswith precision within 3% and accuracy within 1%.The success of TDC encouraged the community to take on thenext step by verifying the precision and accuracy of lens models witha time delay lens modelling challenge (TDLMC, Ding et al. 2018),initiated on 2018 January 8th. The challenge “Evil” team simulated48 systems of mock strongly lensed quasars data and provided to theparticipating teams (“Good” teams) to model, blindly . The “Evil”team produced realistic simulated time-delay lens data includingi) HST -like lensed AGN images, ii) lens time delays, iii) line-of-sight velocity dispersions, and iv) external convergence. After the“Good” team submitted their inferred H , the performance of theadopted method could be estimated by comparing to the true valuesin the simulation.The number of simulated lensed quasars was chosen to havesuﬃcient statistics to assess the performance at the percent level https://tdlmc.github.io/ For an early implementation of a blind challenge see paper by Williams& Saha (2000). MNRAS000

DLMC. II (7% expected per system, gives approximately a ∼

1% precisionon the mean). We stress that this is already a huge sample forcurrent modelling methods, and thus the challenge is exclusivelytesting “fast methods”. The computational cost of lens modellingis a major hurdle that will need to be overcome in the future; thusTDLMC uses large simulated samples aiming at testing the speedand performance of these “fast methods”.We also note that TDLMC is limited to the study of the lensmodel accuracy. Other sources of uncertainty are not considered.Therefore ancillary data, including time delay, line-of-sight velocitydispersion and information of external convergence are providedunbiased and with true uncertainties.This paper provides the details of the challenge design, thatwere hidden in paper I (Ding et al. 2018, hereafter: TDLMC1) andpresents an overview of the submission results. We encourage theindividual “Good” teams to submit more detailed papers on theirmethods and results. The paper is structured as follows. In Section 2,we describe the details of the challenge, including hitherto hiddenassumptions adopted the simulating sample. Sections 3 includes theresponse from the participating teams to this challenge and a briefsummary of the method(s) adopted. The analysis of the submissionsfor Rung 1 and Rung 2 is presented in Section 4. For Rung 3, wediscovered post-unblinding that the numerical precision of the ray-traced simulations was insuﬃcient to test lens model methodologyat the percent level, making the results from this rung diﬃcult tointerpret. Therefore we dedicate a full Section 5 to the subtleties ofRung 3 that will need to addressed in a future challenge that wishesto adopt numerical simulations of galaxies as a starting point. Theresults of Rung 3 are given in Appendix A for completeness, eventhough the results should be interpreted with caution. We drawsome of the implications of the results and discuss our ﬁndings inSection 6. Section 7 presents a brief summary of the paper.

There are three challenge ladders in TDLMC, called Rung 1, Rung 2and Rung 3. In addition, an entry level Rung 0 is also designed as forthe training propose. To ensure that the “Good” teams do not inferany information for the previous rung, we reset the H at each rung.We adopt two independent codes, namely Lenstronomy (Birrer &Amara 2018) and Pylens (Auger et al. 2011) to simulate HST -likelensed AGN images (equally split). This strategy helps us to mitigatethe “home advantage", if any, in the sense that when “Good” teamhappens to adopt the same code as the one used to generate thesimulated images. The use of two independent codes also allows usto estimate numerical uncertainties related to the implementationof the algorithms, if present.

The TDLMC begins with Rung 0, consisting of two lens systems– one double and one quad . This training rung aims to ensurethat “Good” team members understand the format of the data, andavoids any trivial coding errors or mistakes which potentially aﬀectthe results of the entire challenge.Considering that the lens modelling process is usually time https://github.com/sibirrer/lenstronomy https://github.com/tcollett/pylens consuming, we generated in total 48 lensing systems, spread overthree blind rungs (i.e., Rung 1,2,3. There are 16 systems in eachrung). The sample size is small enough to ensure it is tractable bythe “Good” teams and large enough to explore diﬀerent conditionswith suﬃcient statistics and uncover potential biases at the percentlevel. We increase the level of complexity from Rung 1 to Rung 3.We reveal the details of the simulations for each blind rung inthe rest of this section, including the ones which were only knownto the “Evil” team before unblinding. For training purpose, Rung 0 was designed to be as simple as pos-sible. Therefore, simple parametrized forms were adopted to de-scribed the surface brightness of the deﬂector and the source galaxy(i.e., Sérsic), and the mass proﬁle(elliptical power-law) of the de-ﬂector. The true point spread function (PSF) is given and externalconvergence was not considered. The Rung 0 data is released withall the input parameters so that the “Good” teams can validate theiranalysis.In Rung 1, the increase in complexity with respect to Rung 0is that the surface brightness of the AGN host galaxy is realisticallycomplex, rather than described by a simply parameterized modellike Sérsic. For the purpose of making realistic source galaxies, westarted from high-resolution images of nearby galaxies obtained by

HST . The digital images are downloaded from the Hubble LegacyArchive . In order to get a clean galaxy image, we ﬁrst removedisolated stars and background foreground objects in the ﬁeld. All theprocessed galaxy images are shown in Figure 1. Then, we obtainedglobal properties of these galaxy, by using Galfit to ﬁt them as theSérsic proﬁles so as to obtain their eﬀective radius ( R eﬀ ) in arcsecand total ﬂux. This information is then used to rescale the galaxy sizeand magnitude in the source plane, as described in Section 2.3.3.A random external convergence value is also added in Rung 1 (seeSection 2.4).Rung 2 increases on the complexity of Rung 1 by providing the“Good” teams with only a guess of the PSF, instead of the actual PSFused to generate the simulations. This added complexity is meant torepresent a typical situation where the observer uses a nearby staror model as initial guess to the actual PSF and then improves on itusing the quasar images themselves. In order to implement this stepin a realistic manner, the “Evil” team took one actual star observedby HST

WFC3/F160W and constructed a high resolution image byinterpolation. This PSF image is used to carry out the simulationprocess described in Section 2.5. However, the PSF informationbased on a diﬀerent star was given to the “Good” teams.Rung 3 was the most ambitious as we aimed to increase thecomplexity of the deﬂector mass distribution, in addition to retainthe complexities of Rung 2. Assessing the eﬀects of the complexityof the deﬂector mass distribution is crucial to evaluate the perfor-mance of modelling methods. For example, the mass sheet degen-eracy (MSD, Falco et al. 1985) can be broken by adopting a power-law model to a non power-law lens mass distribution (Schneider& Sluse 2013, 2014). The assumption of any speciﬁc mass proﬁlecan potentially result in the systematic bias to the measured Hubbleconstant, the magnitude of which depends on the diﬀerence betweenthe model and the true unknown proﬁle. This eﬀect has been illus-trated with cosmological hydrodynamic simulations (Xu et al. 2016;Tagore et al. 2018), suggesting a potential bias could be introduced https://hla.stsci.edu/hlaview.html MNRAS , 1–23 (2020)

X. Ding et al.

ESO498G5 M51 NGC1084 NGC1309 NGC1376 NGC2397NGC278 NGC3021 NGC3259 NGC3370 NGC3949 NGC3982NGC4319 NGC4639 NGC4911 NGC5584 NGC5806 NGC6217NGC6503 NGC6782 NGC7252 NGC7742 PGC55493 UGC12158

Figure 1.

HST images of the realistic galaxies that were used in the TDLMC simulations as lensed AGN host galaxies. due to MSD. In an attempt to model this, the deﬂector galaxies inRung 3 are based on cosmological numerical simulations. However,this is also the most conceptually diﬃcult step, because we do nothave access to the “true” mass distribution in real galaxies. ForRung 3, the “Evil” team examined two options to produce a realisticdeﬂector mass. The ﬁrst option, following Gilman et al. (2017), isto use surface brightness distribution of real galaxies, convert it intostellar mass, and add some dark matter component with some pre-scription. There are challenges to this approach, for example it is notclear how to obtain self-consistent stellar kinematics. Thus, we dis-carded this option and decided to (following e.g. Xu et al. 2016) takethe results of hydrodynamical simulations as our “realistic” massdistribution (speciﬁcally Illustris (Vogelsberger et al. 2013, 2014)and the ‘zoom’ simulations in Frigo et al. (2019) were adopted).This method has clear advantages but also limitations. For example,the results are only as good in terms of interpreting the real universeas the simulations are accurate, and it is well known that simulatingmassive elliptical galaxies accurately is a challenge (e.g., Naab &Ostriker 2017). Furthermore, the resolution of the state-of-the-artsimulations is ﬁnite and the eﬀects of ﬁnite resolution are impor-tant at the scale of strong lensing (Mukherjee et al. 2019; Enzi et al.2020). We did not anticipate additional numerical issues which werediscovered post-unblinding and will be discussed later in the paper.After setting up the deﬂector redshift in Section 2.3.1, theRung 3 deﬂector providers produced deﬂector maps at the corre-sponding redshift. These maps are at very high resolution, whichis superior to

HST by a factor of 16 (i.e., 0 . (cid:48)(cid:48) / = . (cid:48)(cid:48) • mass distribution : The lensing maps include potential map ( f ),the deﬂection angles maps (including f (cid:48) x and f (cid:48) y , i.e., ﬁrst-orderderivation of f ) and the hessian map ( f (cid:48)(cid:48) xx , f (cid:48)(cid:48) yy and f (cid:48)(cid:48) xy , i.e., second-order derivation of f ). • surface brightness : The “observed” R-band surface brightness isused to illustrate the light of the deﬂector in the simulation in Sec-tion 2.5. This map is also used to calculate the light-weighted line-of-sight stellar velocity dispersion in Section 2.6.2. • kinematics : The kinematic maps include the line-of-sight averagedvelocity map ( V ave ), which accounts for the deﬂector rotation (Fig-ure 2, left panel), and the averaged velocity-dispersion map ( σ ave ,see Figure 2, right panel). In TDLMC, the “Evil” team intends to provide the mock data asrealistic as possible. An overview of models/conﬁguration that usedto simulate the mock data have been introduced in the challengedesigning paper, i.e., TDLMC1. However, for obvious reasons, somedetails had to be kept blind and are presented here for the ﬁrst time.

MNRAS000

MNRAS000 , 1–23 (2020)

DLMC. II m e a n v e l o c i t y ( k m / s ) v e l o c i t y d i s p er s i o n ( k m / s ) Figure 2.

Example of 2D kinematic map for Rung 3. (left): LOS mean velocity ( v ave ) map. (right): velocity dispersion ( σ ave ) map. The image resolution is 16times higher than HST /WFC3, i.e., 0 . (cid:48)(cid:48) / = . (cid:48)(cid:48) The redshift of the mock lenses are assumed to be distributed asfor typical lenses. In Rung 1 and Rung 2, the “Evil” team randomlygenerated their values from a normal distribution with z d ∈ N ( . ± . ) , z s ∈ N ( . ± . ) . In Rung 3, the lensing maps are directlyprovided by the hydrodynamical simulation, ﬁxing the redshift ofthe source at z s = . z d ∼ . The lensing maps are assumed to be composed of a main deﬂector,plus external shear and convergence. We describe the mass distri-bution of the deﬂector in this section. The deﬂector mass modelsare meant to describe typical elliptical galaxies.In Rung 1 and Rung 2, the main deﬂector is assumed to fol-low a typical elliptical power-law mass distribution (see also theSection 2.3.1 in TDLMC1), with parameters distribution as listedin Table 1. In the simulations, we ﬁrst draw the SIS (i.e., singleisothermal sphere) velocity dispersion from the distribution in Ta-ble 1. Then, the corresponding Einstein radius can be calculated as R E = π v d D ds D s , where D ds , D s are the angular diameter distancebetween the source and the deﬂector and from the source to us.In Rung 3, the deﬂector mass information is provided by thetwo simulating teams (X.D., M.F., and S.V.) as described above.They also provide the velocity map of the deﬂector, which is usedto calculate the aperture velocity dispersion in Section 2.6.2, and itssurface brightness (see next section). The surface brightness in an image is comprised of light both fromthe deﬂector and the lensed source. The main deﬂector surfacebrightness in Rung 1 and Rung 2 is described with the widelyused Sérsic proﬁle (as described in Section 2.2.1 in TDLMC1) Lensing maps include the potential map, the deﬂection map and the con-vergence map, i.e., f , f (cid:48) x , f (cid:48) y , f (cid:48)(cid:48) xx , f (cid:48)(cid:48) yy and f (cid:48)(cid:48) xy . with parameters distributed as Table 1. In Rung 3, the (relative) R-band luminosity of stellar particles as deﬂector light were computedbased on their age and metallicity, using the Bruzual & Charlot(2003) model. We only assume the distribution of the deﬂector’smagnitude given in Table 1 to normalize its total ﬂux for the purposeof achieving a realistic signal to noise ratio.To deﬁne the realistic surface brightness distribution of theAGN host galaxy, we adopt a true high-resolution image taken from HST archive for all the blind rungs,. We ﬁrst rescale the image byprojecting it on the source plane, so that it has an apparent Sérsiceﬀective radius drawn from Table 1. The magnitude of the sourcehost galaxy is then rescaled from the observed according to theredshift of the source.In order to obtain images similar to those used for cosmo-graphic measurements, we assume that the active nuclei have acomparable ﬂux to that of their host galaxy, see Table 1.We vary the position of the source AGN so as to generate thelensing image in a range of conﬁgurations (including cusp , fold , cross and double ). All the mass along the line-of-sight (LOS) contributes to the deﬂec-tion of photons. In current state-of-the art analyses, this problem ismade tractable by modelling the main deﬂector and the most mas-sive nearby perturbers explicitly, while describing the remainingeﬀects to ﬁrst order as external shear and convergence ( κ ext ).For simplicity, in this challenge we do not include massiveperturbers, so there are just two components, the main deﬂector(described above) and the LOS external shear and convergence.In Rung 1 and Rung 2, we add an external shear to the lensingpotential with typical strength and random orientation, as shown inTable 1. External shear is not added in Rung 3 in order to keep thelensing potential self-consistent with the mass. More important isthe eﬀect of the external convergence ( κ ext ), since it aﬀects the rela-tive Fermat potential and time delay. As mentioned in TDLMC1, weconsider the eﬀect of κ ext by drawing from a Gaussian distributionN ( ± . ) for all the three Rungs. MNRAS , 1–23 (2020)

X. Ding et al.

Table 1.

Parameter distribution.Simulation ingredient model and parameter valuesA): redshiftdeﬂector redshift z d ∼ N ( . ± . ) source redshift z s ∼ N ( . ± . ) B): deﬂector (image plane)lensing galaxy mass elliptical power-lawSIS velocity dispersion v d ∼ N ( ± ) km/sEinstein radius a R Ein = π v d D ds D s mass slope s ∼ N ( . ± . ) ellipticity q ∼ U ( . − . ) elliptical axis angle φ m ∼ U ( − π ) lensing galaxy SB Sérsic proﬁletotal magnitude b mag ∼ U ( . − . ) magnitudeeﬀective radius R eﬀ = R Ein ∗ U ( . − . ) Sérsic index n ∼ U ( . − . ) ellipticity q ∼ U ( . − . ) elliptical axis angle c φ = φ m U ( . − . ) C): AGN (source plane)host galaxy SB realistic galaxytotal magnitude mag ∼ U ( . − . ) magnitudeeﬀective radius d R eﬀ ∼ U ( . (cid:48)(cid:48) , . (cid:48)(cid:48) ) , . < z s < . R eﬀ ∼ U ( . (cid:48)(cid:48) , . (cid:48)(cid:48) ) , . < z s < . R eﬀ ∼ U ( . (cid:48)(cid:48) , . (cid:48)(cid:48) ) , . < z s < . R eﬀ ∼ U ( . (cid:48)(cid:48) , . (cid:48)(cid:48) ) , . < z s < . active nuclear light scaled point sourcesource plane total ﬂux f AGN = f host ∗ U ( . − . ) external shearamplitudes γ ∼ U ( − . ) shear axis angle φ ∼ U ( − π ) external convergencyexternal kappa e κ ext ∼ N ( ± . ) Note: − Table lists the assumptions that were used to distribute the param-eters for the TDLMC simulation. In Rung 3, non-parameterized deﬂectors(i.e., lensing galaxy mass and surface brightness) are adopted. Thus, the Bpart in the table is not adoptable for this rung. The distribution of “N” meansnormal distribution and the “U” means uniform distribution. Among all theparameters shown in the table, only the redshifts (with zero observationerror) and unbiased estimated of external convergence κ ext = ± .

025 wereprovided to the “Good” teams.a: Using our deﬁnition, the Einstein Radius would be in the range [1 . (cid:48)(cid:48) . (cid:48)(cid:48) = − . ∗ log 10 ( ﬂux ) + zp, where zp is the ﬁlter zeropoint in AB system.For ﬁlter WFC3/F160W, zp = . κ ext is randomly generated to calculate the time delay data. The parentdistribution this distribution was provided to the “Good” teams, but not theactual value, to mimic real analyses. HST -like data

Having deﬁned the ingredients of the simulations, we adopt twoindependent codes to build the pipeline that produces the mock

HST imaging data. We aim to simulate the image quality of typical state-of-the-art datasets, i.e. WFC3/F160W with individual exposuresof 1200 s, and typical background. We use astrodrizzle to co-addeight single dithered exposures to obtain the ﬁnal image with pixelsampling improved from 0 . (cid:48)(cid:48)

13 to 0 . (cid:48)(cid:48) HST resolution(i.e., 0 . (cid:48)(cid:48) / HST images, as illustrated inFigure 2 of TDLMC1. To numerically deﬁne the surface brightnessof these actual images, Pylens uses interpolation and Lenstron-omy uses shapelet decomposition (Refregier 2003b; Birrer et al.2015). We then rescale the image to the desired size. Then, the dis-tortion by lensing is based on the deﬂection angles. We convolve theimage plane surface brightness with the PSF and add scaled PSF inthe position as the point sources to mimic instrumental resolution.In Rung 1 the PSF is generated with

TinyTim (Krist et al. 2011),while in Rung 2 and Rung 3 PSFs are extracted from the real

HST images, and we use interpolation to obtain the PSF image at higherresolution. The pipeline is illustrated in Figure 3. Note that at thisstep, the images are still sampled at the 0 . (cid:48)(cid:48) / × HST resolution, i.e., 0 . (cid:48)(cid:48)

13. We select eight diﬀerent patterns torebin the image, so as to mimc the dither process. In the next step,we add the noise to the data, see Figure 1 and Figure 2 in Ding et al.(2017) for details. Finally, we use drizzling process to co-add eightdithered image to obtain the ﬁnal drizzled image at 0 . (cid:48)(cid:48)

08 sampling.We present the 48 simulated images of the three rungs in Figure 4.In the TDLMC, the eight dithered

HST images and the ﬁnaldrizzled images are all provided to the “Good” teams including thescience images, noise level maps and a sampled PSF image.

In addition to the

HST imaging data, the “Evil” team providetime delay and aperture stellar velocity dispersion, computed asdescribed in this section.

The true time delay between the lensed AGN images are calcu-lated using the following equations once the values of the simulatedparameters are given by: ∆ t ij = D ∆ t c (cid:2) φ ( θ i ) − φ ( θ j ) (cid:3) , (1)where θ j and θ j are the coordinates of the images i and j in theimage plane. φ ( θ i ) is the Fermat potential at image i and D ∆ t isso-called time-delay distance, deﬁned as: φ ( θ i ) = ( θ i − β ) − ψ ( θ i ) , (2) D ∆ t ≡ ( + z d ) D d D s D ds , (3)where D d , D s and D ds are respectively the angular distances fromthe observer to the deﬂector, from the observer to the source, andfrom the deﬂector to the source.We consider the eﬀects of the κ ext to the observed time delay MNRAS000

DLMC. II host in source plane host in image plane convolved adding point sourcehost in source plane host in image plane convolved adding point source Figure 3.

Illustration of the generation of a mock lensed AGN image, using Lenstronomy (top) and Pylens (bottom). The image is sampled based on

HST -WFC3/F160W at 4 times higher resolution (i.e., 0 . (cid:48)(cid:48) / by: ∆ t obs = ( − κ ext ) ∆ t true . (4)Note that the true value of κ ext is assumed to be zero. It is themeasured value of κ ext that is scattered as N ( ± . ) . The eﬀectof adding such κ ext is equivalent to adding a perturbation on theobserved time delays as Eq. (4). In principle, the external conver-gence eﬀect should also shift the Einstein radius, which we did notconsider in TDLMC for simplicity. That is, the κ ext is only taken asa pure scatter eﬀect on the time delay, hence H .Assuming zero bias on the time delay, we add random erroras the largest between 1% and 0.25 days were adopted. We aredeliberately keeping the uncertainties on the time delay as smallas in the very best cases, in order not to obfuscate lens modellingerrors. The aperture stellar velocity dispersion is helpful to break the masssheet degeneracy (Falco et al. 1985; Treu & Koopmans 2002b).The integrated line-of-sight velocity dispersion is computed as thesecond moment of the velocity distribution weighted by surfacebrightness in a square aperture by 1 . (cid:48)(cid:48) × . (cid:48)(cid:48)

0, similar to the standardaperture used for real systems. Seeing conditions are also chosento mimic the best current ground based systems, idealized as aGaussian kernel with a full width at half maximum (FWHM) as0 . (cid:48)(cid:48)

6. In Rung 1 and Rung 2 the deﬂector mass distribution is simplyparameterized. Following current practice (e.g., Shajib et al. 2018),we assume that the mass distribution is related to the velocity dis-persion proﬁle through the spherical Jeans equation:1 l ( r ) d ( l σ ) d r + β ani ( r ) σ r = − GM (≤ r ) r , (5) where l ( r ) is the luminosity density of the deﬂector galaxy, σ r is theradial velocity dispersion and β ani ( r ) is the anisotropy proﬁle anddescribed as: β ani ( r ) = − σ σ , (6)where σ t is the tangential velocity dispersion. The observed line-of-sight velocity dispersion is surface-brightness-weighted, and thuscan be calculated by solving the equation as Mamon & Łokas (2005) I ( R ) σ ( R ) = G ∫ ∞ R k (cid:16) rR , r ani R (cid:17) l ( r ) M ( r ) d rr , (7)where I ( R ) is the deﬂector surface brightness.We adopt the Osipkov-Merritt parametrization of anisotropy β ani ( r ) = /( + r / r ) (Osipkov 1979; Merritt 1985c,a), with thefunction k ( u , u ani ) given byk ( u , u ani ) = u + / ( u + ) / (cid:32) u + u u (cid:33) tan − (cid:115) u − u + − / u + (cid:113) − / u . (8)The anisotropy radius r ani is usually considered to be a freeparameter with size comparable to the eﬀective radius. In the sim-ulation, we assume r ani = R eﬀ to calculate the velocity dispersion.In Rung 3, the velocity dispersion is provided by the hydrody-namical simulations via high resolution maps (16 times higher than HST ), see Section 2.2. The aperture stellar velocity dispersion is thusa combination of the two kinematic maps by: V aper = (cid:113) V + σ ,where the V ave and σ ave is the line of sight (LOS) mean velocity andthe velocity dispersion as shown in Figure 2. The “Evil” team cal-culate the 2D surface-brightness-weighted line-of-sight dispersionand convolve it using a FWMH 0 . (cid:48)(cid:48) MNRAS , 1–23 (2020)

X. Ding et al. (a) Rung 1 imaging data -0.5 -0.43 -0.29 -0.01 0.55 1.7 3.9 8.3 17 35 70 (b) Rung 2 imaging data -0.5 -0.43 -0.29 -0.01 0.55 1.7 3.9 8.3 17 35 70

1" 1" 1" 1" 1"1" 1" 1" 1" 1" 1" 1" 1"1" 1" 1" (c) Rung 3 imaging data -0.5 -0.43 -0.29 -0.01 0.55 1.7 3.9 8.3 17 35 70

1" 1" 1" 1" 1" 1"1" 1" 1" 1" 1" 1" 1"1" 1"1"

Figure 4.

Mock data provided for Rung 1, Rung 2 and Rung 3. The conﬁgurations from left to right are cross , cusp , fold and double . The images belonging tothe same rung are shown with the same stretch to facilitate visual comparisons. averaged velocity dispersion in the aperture was computed. Notethat in principle the surface brightness weighting should be consid-ered before convolving and aperture selection. However, the velocitymap and the surface brightness map are both convolved using a sameGaussian kernel, making the sequence of this processing irrelevant.For illustration, the velocity dispersion as a function of aperture sizeis shown in Figure 5.A random Gaussian noise with 5% standard deviation is addedto the model velocity dispersion to represent high quality measure-ment errors. The “Good” teams submitted their modelled H of each lens sys-tem in the three rungs, and the “Evil” team deﬁned four standardmetrics to estimate the performance of the submissions, including eﬃciency ( f ), goodness ( χ ), precision ( P ) and accuracy ( A ). Theyare deﬁned as follows: f = NN total , (9) χ = N (cid:213) i (cid:18) ˜ H i − H δ i (cid:19) , (10) P = N (cid:213) i δ i H , (11) A = N (cid:213) i ˜ H i − H H , (12)where N is the number of successfully modelled systems in eachsubmission and N total = δ i is the uncertainty (1 − σ level) of MNRAS000

DLMC. II arcsec k m / s ave2 v ave2 ave + v ave kpc Einstein Radiusaperture size

Figure 5.

Surface-brightness-weighted line-of-sight stellar velocity disper-sion as a function of aperture radius, based on a lens system in the Rung 3.The stellar velocity dispersion is composed of the LOS mean velocity (i.e., v ave ) and LOS velocity dispersion (i.e., σ ave ), added in quadrature. The Ein-stein radius and the eﬀective aperture size are also shown as blue and redlines, respectively. H by each systems in the submission. We identiﬁed the followingtargets for the metrics, based on current state of the analyses:0 . < χ < , (13) P < , (14) | A | < . (15)The χ metric target is aimed to ensure that the estimatederrors are a reasonable measure of the deviation from the truth. The P metric target is chosen to represent the precision of the best currentmeasurements. The A metric target is set to investigate whether thefast methods can contain biases below the current reported precisionby state-of-the-art analysis of samples of a few lenses. We don’t seta metric target for f , as deciding which systems can be analyzedwith suﬃcient conﬁdence depends on the methodology employedand thus we expect it to vary widely across submissions. The TDLMC challenge mock data were released on 2018 January8th. The deadline of the blind submission for the three rungs were:2018 September 8th for Rung 1; 2019 April 8th for Rung 2; 2019September 8th for Rung 3. Each rung was unblinded a few days afterthe submission deadline, to give teams a chance to learn in real timeduring the challenge. The “Evil” team was especially mindful to helpthe “Good” teams detect bugs and glitches that could invalidate thesubsequent blind rungs, and prevent the teams from learning abouttheir ability to tackle increased complexity.Prior to the Rung 3 deadline, the “Evil” team received in total15, 17 and 24 submissions for the Rung 1, Rung 2 and Rung 3,respectively, from ﬁve diﬀerent participating teams (“Good” teams).We describe the method adopted by each team in the rest of thissection.

H. Tak

This team proposes the following posterior density of H de-signed to combine information from multiple lens systems in asimple but statistically principled way: π ( H | D ) ∝ L ( H ) h ( H ) . (16)The notation D denotes a set of the time delay estimates, Fermatpotential diﬀerence estimates, and their standard errors for all uniquepairs of lenses in 16 systems. Also, L ( H ) represents the likelihoodfunction and h ( H ) indicates a Uniform(20, 120) prior density. Thisproper uniform prior guarantees posterior propriety of the resultingposterior (Tak et al. 2018). The team derives the likelihood functionfrom a Gaussian assumption on the Fermat potential diﬀerenceestimate (Marshall et al., 2020, in preparation): φ est ijk | ∆ ijk , H ∼ N (cid:18) φ ijk = c ∆ ijk D ∆ ( H ) , σ ( φ est ijk ) (cid:19) , (17)where φ est ijk denotes the Fermat potential diﬀerence estimate of the i -th and j -th lensed images in the k -th lens system and σ ( φ est ijk ) indicates its standard error (1 σ uncertainty). The notation ∆ ijk isthe time delay between the i -th and j -th lensed images in the k -thsystem ( ∆ ijk = − ∆ jik ). The time delay distance D ∆ ( H ) is treatedas a function of only H because all other information is completelygiven in the TDLMC.On top of this Gaussian assumption on φ est ijk , the team adoptsanother Gaussian distribution for the time delay ∆ ijk with its meanequal to ∆ est ijk and standard error σ ( ∆ est ijk ) , i.e., ∆ ijk ∼ N (cid:16) ∆ est ijk , σ ( ∆ est ijk ) (cid:17) . (18)The team also assumes that ∆ ijk and H are independent a priori ina sense that ∆ ijk is typically inferred from light curves of multiply-lensed images without any information about H (Tak et al. 2017).The Gaussian assumptions in (17) and (18) make it simple tointegrate out ∆ ijk analytically from their joint distribution, leadingto the Gaussian distribution of φ est ijk given only H : φ est ijk | H ∼ N (cid:169)(cid:173)(cid:171) c ∆ est ijk D ∆ ( H ) , c σ ( ∆ est ijk ) D ∆ ( H ) + σ ( φ est ijk ) (cid:170)(cid:174)(cid:172) . (19)The team also assume the conditional independence amongFermat potential diﬀerence estimates within and across lensed sys-tems given the Hubble constant H . Then, the likelihood functionof H is the product of Gaussian densities whose distributions arespeciﬁed in (19), for every unique pair of gravitationally lensedimages i and j across 16 lensed systems.Since the posterior density function of H in (16) is a functionof only H , it is easy to draw an i.i.d. sample from this posterior viaa grid sampling (Chapter 5, Gelman et al. 2013).On top of the posterior π ( H | D ) , the team models κ ext usingthe relationship, H ext0 = ( − κ ext ) H , where H ext0 is the Hubble con-stant with κ ext considered and H is the one without κ ext considered(Rusu et al. 2017). The team puts a N ( , . ) prior on κ ext forsimplicity, which is assumed to be independent of the data. Finally,the posterior distribution of H ext0 is derived as π ( H ext0 | D ) = ∫ π ( H ext0 | D , κ ext ) g ( κ ext ) d κ ext , (20)where g denotes the N ( , . ) density of κ ext . The posterior dis-tribution of H ext0 in (20) is sampled via a Monte Carlo integration;(i) draw a random sample of κ ext from N ( , . ) ; (ii) sample H from (16); (iii) and lastly set H ext0 = ( − κ ext ) H . A Jacobian term MNRAS , 1–23 (2020) X. Ding et al. is not needed for a deterministic transformation within a Bayesiansampling framework (Tak et al. 2020). The proposed frameworkdoes not account for the lens velocity dispersion for each lens sys-tem. The key to the proposed approach is to obtain D to be used as acondition of the posterior distribution in (20) because given D , it issimple to draw a random sample of H . The team notes again that D is composed of time delay estimates, ∆ est ijk ’s, their standard errors, σ ( ∆ est ijk ) ’s, Fermat potential diﬀerence estimates, φ est ijk ’s, and theirstandard errors, σ ( φ est ijk ) ’s. The ﬁrst two components are fully knownin the TDLMC, and thus the remaining ingredients for sampling H ext0 from (20) are φ est ijk ’s and σ ( φ est ijk ) ’s.For this purpose, the team uses Lenstronomy (version 0.4.3,Birrer & Amara 2018). In Rung 1, the team uses the ellipticalSérsic proﬁle for the source light model and adopts one, two, orthree elliptical Sérsic proﬁles for the lens light model. In Rungs 2–3, the team utilizes a superposition of a smooth power-law ellipticalmass density proﬁle (SPEMD) with external shear for the lens massmodel. An elliptical Sérsic proﬁle with shapelets (Birrer et al. 2015)is adopted for the source light model, and an elliptical Sérsic proﬁleis used for the lens light model. The team ﬁxed n max =

10 as theorder of the shapelets basis for the baseline model. Also, the teammakes use of the PSF iteration to correct the PSF model (Shajibet al. 2019). In addition, the team manually boosts the noise levelby adopting one of seven diﬀerent PSF error inﬂation rates (1%,5%, 10%, 15%, 20%, 25%, 30%) to deal with additional errors inthe given PSF. This means that for each unique pairs of lenses, theteam ﬁts the model by Lenstronomy seven times each with one ofthe seven PSF error inﬂation rates.For each of the seven ﬁts, Lenstronomy produces a poste-rior sample of φ ijk that is possibly non-Gaussian. Thus, to obtain φ est ijk and σ ( φ est ijk ) , the team summarizes the posterior distributionin two ways; posterior mean and standard deviation (Summary 1);posterior median and quantile-based standard error (Summary 2).This is because the posterior mean and standard deviation can bemisleading if the posterior distribution of φ ijk is not Gaussian.Consequently, for each pair of lensed images the team obtainsthe seven pairs of ( φ est ( l ) ijk , σ ( φ est ( l ) ijk )) for l = , . . . ,

7, according toeach type of summary. Since D requires having only one representa-tive pair of ( φ est ijk , σ ( φ est ijk )) for each pair of lensed images, the teamtakes an average of these seven pairs in three ways. The ﬁrst one is aFisher-type weighted average of φ est ( l ) ijk ’s weighted by 1 / σ ( φ est ( l ) ijk ) ’s(Average 1). This averaging method puts more weights on the pairswith smaller standard errors. The second averaging method simplytakes an arithmetic mean over seven estimates and over seven vari-ances (Average 2). This way puts equal weights on all seven pairsregardless of their diﬀerent standard errors. Finally, the third oneuses the same arithmetic mean as Average 2 but sets σ ( φ est ijk ) to asample variance of the seven estimates, φ est ( l ) ijk ’s (Average 3). Thisone does not use the information about standard errors at all. Theteam brieﬂy describes the details of each submission in Table 2.Due to the space limitations, the detailed information of thelens modelling settings will be presented in a separate paper (Taket al., in prep). M. Millon, A. Galan, F. Courbin, V. Bonvin

Table 2.

The details of the submissions of Student-T team. Summaries 1, 2,Averages 1, 2, 3 are deﬁned in Section 3.1.Rung Algorithm Details1 1 Summary 1 and Average 12 Summary 1 and Average 23 Summary 1 and Average 34 The same as Algorithm 1 except that three pairsare intentionally removed for consistency5 The same as Algorithm 2 except the three pairs6 The same as Algorithm 3 except the three pairs2 1 Summary 1 and Average 12 Summary 1 and Average 23 Summary 2 and Average 14 Summary 2 and Average 25 An independent replication of Algorithm 13 1 Summary 1 and Average 12 Summary 1 and Average 23 Summary 2 and Average 14 Summary 2 and Average 25 The same as Algorithm 1 with three times morerepetitions (i.e., 21 pairs instead of 7 pairs)6 The same as Algorithm 2 with 21 pairs7 The same as Algorithm 3 with 21 pairs8 The same as Algorithm 4 with 21 pairs9 The same as Algorithm 5 but without considering κ ext i.e., sampling from (16) instead of (20)10 The same as Algorithm 6 but sampling from (16)11 The same as Algorithm 7 but sampling from (16)12 The same as Algorithm 8 but sampling from (16) The EPFL team followed a streamlined version of current modellingpractices applied to time delay cosmography. The main diﬀerencewith respect to the analysis described by (Birrer et al. 2019; Shajibet al. 2020) is that the challenge is known to be free of signiﬁcantperturbers besides the main deﬂector and the line of sight. Takingadvantage of this information and to reduce computation costs, asmaller number of model choices was considered in the challengeas compared to real systems. In addition, in order to reduce humaninvestigator time, the modelling was standardized as opposed totailored to the speciﬁc of each individual lens. For this purpose, apartly automated modelling pipeline was developed by the team.A more detailed description of the pipeline may be the subjectof a future paper. The standardization is a necessary step towardsmodelling large numbers of systems, but it may result in failures ifthe one-size-ﬁts all approach is not (yet) suﬃciently accurate.For the modelling part, the team used the publicly availablesoftware Lenstronomy (Birrer & Amara 2018). This software iswell validated and has been previously used for the modelling andcosmography analysis of real time delay strong lens systems (Birreret al. 2016, 2019; Shajib et al. 2020). The entire challenge dataset was used as constraints for our models, including the provideddrizzled image, noise maps, and PSF ; the measured time delaysat lensed AGN positions ∆ t measu ; the measured LOS velocity dis-persion of stars in the lens galaxy σ los , measu ; the estimate of theexternal convergence κ ext .The models are described by linear (surface brightness ampli-tudes) and non-linear parameters, depending on the type of proﬁles(see Birrer et al. 2015, for details). The team chose to add the timedelay distance D ∆ t as a free non-linear parameter. MNRAS000

DLMC. II For a single system, the generic workﬂow starting from lensmodelling up to H inference can be divided in the three followingsteps.

1) Parameters optimization and sampling

First linear andnon-linear parameters are optimized by alternating Particle SwarmOptimizer (PSO) runs and increments of the complexity of lensmodels. Parameters are sampled from uniform priors, ensuring thatall lenses can be modelled from the same initial set of priors. Thetime delay distance D ∆ t , considered as a free non-linear parameterof the model, is constrained by the measured time delays ∆ t ij , measu by enforcing the modelled time delays to be compatible with themeasured ones. Modelled time delays ∆ t ij , model are computed asfollows: ∆ t ij , model = ( + z d ) D ∆ t c ∆Φ ij , model , (21)where z d is the lens redshift, Φ model is the model Fermat potential, c is the speed of light, and “ i j ” deﬁnes the diﬀerence of the indicatedquantity evaluated at the positions of two lensed AGN i and j . Thisprocedure gives a best ﬁt estimates of linear and non-linear param-eters, that are then used as a starting point of a MCMC sampling.Both PSO and MCMC routines are implemented in Lenstronomy,based on the CosmoHammer package (Akeret et al. 2013) andemcee (Foreman-Mackey et al. 2012).

2) Kinematics and angular diameter distances

For eachMCMC sample, the team derived in a post-processing step the LOSvelocity dispersion σ los , model from model parameters. The teamused the Osipkov-Merritt model to solve the spherical Jeans equa-tion, again following current practices e.g. Suyu et al. (2010a);Shajib et al. (2018), with routines implemented in Lenstronomy.The team computed angular diameter distances from both kinemat-ics and time delays. The sampled time delay distance gives directlythe distance ratio D d D s / D ds . The modelled LOS velocity disper-sion, along with the model parameters ξ model , are used to computethe distance ratio D s / D ds from the following relation (Birrer et al.2016): σ , model = D s D ds c J ( ξ model , r ani ) , (22)where J captures all dependencies on model parameters and kine-matics anisotropy, moving any dependencies on cosmological pa-rameters in the distance ratio. The external convergence was alsosampled as κ ext (cid:118) N( , . ) , to simulate a correction to the timedelay distance by any mass external to the main deﬂector, through: D ∆ t , eﬀ = D ∆ t /( − κ ext ) . From the two distance ratios describedabove, is is straightforward to extract the angular distance to thedeﬂector, namely D d .

3) Cosmography inference for an individual system

Fol-lowing Birrer et al. (2019), the inference of the Hubble constantis performed in the 2D plane deﬁned by angular distances D ∆ t , eﬀ and D d . This plane encodes the joint constraints from imaging data,time delays, external convergence and lens kinematics. In order toapproximate the full covariance between the two D ∆ t , eﬀ and D d posteriors, both distributions are used to evaluate the likelihoodwhen inferring H . Since Ω m is ﬁxed in this challenge, the onlycosmological parameter being sampled is the Hubble constant.

4) Joint cosmology inference for an entire rung

The teamcomputed the ﬁnal inferred H value and associated uncertaintyestimates for an entire rung in two steps. First, an outlier rejectionscheme was performed, according to the following criteria, thatwere found to be good markers of poor models: • each individual H median value must be inside the prior boundsdeﬁned by the TDLMC, i.e. inside [ , ] km s − Mpc − ; • the sampled time delay distance D ∆ t (free parameter constrainedby the lens model and time delays) and the modelled time delaydistance D ∆ t , model (obtained through Eq. (21) inversion) must beconsistent with each other at the (cid:46) σ level; • the modelled lens velocity dispersion σ , model must be consis-tent at (cid:46) σ level with the measured value; • each individual H posterior must be consistent with each other atthe (cid:46) σ level.When all the above criteria were fulﬁlled, the team kept the modelfor the joint inference over the rung, for a given model family. Thisleads to a set of D ∆ t and D d pairs of posteriors. The team thenperformed two joint inferences using: • only time-delay information. H is sampled according to the en-semble of D ∆ t posteriors only. • both time-delay and kinematics information. This follows the ap-proach described in Birrer et al. (2019), H is sampled in the 2Dplane over the set of D ∆ t and D d posteriors. This last option isthe standard procedure used for joint inference of real lenses (e.g,Wong et al. 2020)Note that even in the ﬁrst case of inference H from D ∆ t only,knowledge about kinematics still plays a (smaller) role, because ofmodel selection steps are performed before the inference.The joint H posteriors described above are computed underthe assumption that the systems do not share systematic errors. Ifthis assumption breaks, then one should marginalize from individualdistributions, instead of the joint inference. For this reason, the teamalso submitted H posteriors that are marginalized over the selectedmodels. Additional details speciﬁc to each rung are given in thefollowing subsections. In Rung 1, lens mass and light proﬁles are simply-parametrized.Hence the team used power-law elliptical mass distribution(SPEMD) (Barkana 1998) with external shear proﬁles to describethe projected mass distribution, and a single Sérsic proﬁle for thelens surface brightness. For the source, the team used a Sérsic pro-ﬁle superimposed to a set of shapelets (Refregier 2003a; Birreret al. 2015). The team chose n max = n max were slightlyincreased, typically up to n max =

14. The source galaxy centroid(Sérsic+shapelets) was ﬁxed to the position of the quasar, itselfmodelled as a single point source constrained by enforcing lensedimages to trace back to the same position in source plane.The “Evil” team kept secret any details related to kinemat-ics modelling assumptions, including the anisotropy model theyused for computing velocity dispersion. As stated above, the EPFLteam used Osipkov-Merritt modelling for computing velocity dis-persions (Osipkov 1979; Merritt 1985b). This model assumes aparametrized anisotropy parameter β ani = r /( r + r ) , where r ani is the anisotropy radius, which deﬁnes the radius at which stellarorbits go from being radial (near the center) to isotropic (equally ra-dial and tangential). Standard practices are to sample the anisotropyspace through a uniform prior on the anisotropy radius, see e.g.Suyu et al. (2012); Shajib et al. (2018). In Rung 1, the team useda uniform prior r ani (cid:118) U( . , ) r eﬀ , where r eﬀ is the half-lightradius of the lens.The unblinding of Rung 1 revealed that the team’s submittedinference was strongly aﬀected by one (or several) systematic er- MNRAS , 1–23 (2020) X. Ding et al. ror(s), as quantiﬁed by an accuracy of A = . ∼ (cid:48)(cid:48) , andtime delays are of the order of dozens of days with precision 0.25days. Typical lensed systems modelled by the TDCOSMO collabo-ration have on average image separations of ∼ . (cid:48)(cid:48) with time delaysprecision up to a couple of days. A particularly high precision istherefore required when modelling the position of each lensed im-ages in the setting of the challenge, which is not the case for allreal systems analyzed so far. A lack of precision can propagate toa signiﬁcant bias on the Hubble constant. The bias they observedin their initial Rung 1 submission allowed them to highlight sucha requirement, which have been the topic of a dedicated paper byBirrer & Treu (2019). The authors introduced simple formulae that,given an expected precision on the Hubble constant, can be at ﬁrstorder used to estimate the astrometric requirements that must befulﬁlled, from image separations and time delays precision. Theyrefer the reader to that paper for consequences of such requirementsand quantitative examples. As discussed in Section 4.2, the prob-lem was solved by the EFPL team by introducing in Lenstronomya nuisance parameter to describe the unknown diﬀerence betweentrue and measured image positions and marginalizing over it.For Rung 1, the team submitted a single sample of models, andrelated joint Hubble value, following the description above. In Rung 2, only a guess of the PSF was provided, in order to testPSF reconstruction algorithms. The team used the iterative PSFreconstruction routine original implemented in Lenstronomy. Fora set of baseline models, the team incorporated this routine duringparameters optimization, eﬀectively alternating between PSO andPSF reconstructions. Having noticed that the PSF was degraded thesame way for each of the 16 lenses of Rung 2, the team computeda median stacked PSF kernel from their best reconstructed kernels.This reconstructed PSF was then used for all of their subsequentRung 2 modelling attempts.Based on Rung 1 knowledge, the team took into considerationthe astrometric requirements described in previous subsection, inorder to mitigate a potential bias on the inferred Hubble constant.The team allowed extra degrees of freedom to model any unknownuncertainty on the position of AGN images (a.k.a. point sources),by introducing in the parameter space, two new “oﬀset” parame-ters, δ x and δ y , for each of the 2 or 4 images independently. Theseoﬀsets actually represent the error between the (modelled) posi-tion of point sources on the image, and the (predicted) positions atwhich the Fermat potential is evaluated for time delays computation.These additional parameters are sampled as non-linear parameters,and constrained by time delays and imaging data. The team reg-ularly checked that those oﬀsets were correctly constrained, withamplitudes expected to be below the image pixel scale.After careful analysis of post-unblinding or Rung 1, the teamrealized that most consistent results were obtained when r ani ≈ r eﬀ . Consequently, in Rung 2, the team ﬁxed the anisotropy radius r ani to be equal to the lens half-light radius for all the remainingsubmissions.The remaining volume of the parameter space (mass and lightproﬁles of the lens galaxy, light proﬁles of source galaxy, and quasarmodel) was identical to those of the previous rung. The team submitted 4 model samples and corresponding jointvalue for Rung 2: • DdDdt : the inferred H was obtained through joint inference in the2D plane (cid:8) D ∆ t , eﬀ , D d (cid:9) ; • margDdDdt : same as DdDdt , except that the inferred ﬁnal valuewas obtained by marginalization over individual H posteriors, asopposed to a joint inference ; • Ddtonly : same as

DdDdt , except that H values were inferred onlyfrom the time delay distance D ∆ t , eﬀ ; • margDdtonly : same as Ddtonly , except that the inferred ﬁnalvalue was obtained by marginalization over individual H posteri-ors. For Rung 3, the team used the exact same PSF reconstruction methodas for Rung 2. For lens models, they followed the practices of theTDCOSMO collaboration, in the sense that they chose two familiesof models: power-law and composite. The former consists of ellip-tical power-law mass distribution with external shear, whereas thelatter distinguishes the baryonic mass and dark matter, in additionto the external shear. For the baryonic matter they used a doubleChameleon proﬁle (see Suyu et al. 2014, for deﬁnition) to ﬁt the lenssurface brightness, and convert it to surface mass density througha constant mass-to-light ratio, introduced as a free parameter. Theymodelled the dark matter component as a single elliptical NFWproﬁle.In order to improve their eﬃciency in modelling Rung 3 withtwo model of families, which require signiﬁcant amount of work,they also used double Chamelon proﬁles to describe the lens lightin their power-law models. This allowed them to extract best ﬁt lenslight parameters from their power-law models, and properly ini-tialise the corresponding composite models, for a given lens. Notethat it is diﬀerent than the usual TDCOSMO procedure, where thesurface brightness of the lens galaxy is ﬁtted with double Sérsicfor power-law mass models. They checked that no systematic er-rors were introduced when using double Sérsic instead of doubleChameleon proﬁles, which is expected as the latter is designed tobe a good approximation of the former.The rest of the procedure was similar to their submissions forRung 2 and 3, in terms of selection criterions and joint inference. Theselection was performed independently for the two model familiesdescribed above, meaning that their composite and power-law sub-missions did not necessarily consist in the same modelled lenses, northe same number of lenses. For each model family, they submittedtwo submission pairs, with H inferred from: 1) joint (cid:8) D ∆ t , eﬀ , D d (cid:9) inference, 2) D ∆ t , eﬀ only. Additionally, they submitted a third pairof submissions with a subset of lenses whose models were coin-cidentally accepted with both model families, which enabled themto combine their inferences from power-law and composite models.More precisely, for a given lens, they marginalised over the twomodel families, prior to the ﬁnal joint inference H among the dif-ferent lenses. To summarize, one ended up with 6 submissions forthis rung. P. Denzel, J. Coles, P. Saha, L. L.R. Williams

The lenses were reconstructed with the codes GLASS by Coleset al. (2014) and its precursor PixeLens by Saha & Williams (2004)

MNRAS000

MNRAS000 , 1–23 (2020)

DLMC. II which are based on the free-form modelling technique. In contrastto other methods, free-form lens reconstructions are not restrictedto a parametrized family of models, but rather build a lens as su-perpositions of a large number of mass components, e.g. mass tilesor pixels, with minimal assumptions about the form of the full lens.The price to pay for the ﬂexibility is that the free parameters out-number the constraints and thus regularization needs to be imposedto avoid overﬁtting the data.While GLASS and Pixelens are completely separate codes,implemented in diﬀerent languages, and using diﬀerent Monte Carlosampling engines, they both share the same approach to free-formlenses. Represented as a discrete grid of pixels, the lens potentialtakes the following form: ψ ( θ ) = (cid:213) κ n ∇ − Q n ( θ ) (23)where κ n is the density of the n -th mass tile and Q n ( θ ) is the shapeintegral over the n -th pixel. Each tile is a square and its contribution κ n Q n ( θ ) to the potential at θ can be worked out analytically (Abdel-Salam et al. 1998). In both GLASS and PixeLens the tiles cover acircular area centered on the lensing galaxy. The radius of this area r p , in pixels, determines the resolution of a model. For instance, r p = × × sp = sp =

5, respectively. A central pixel withno subdivision is equivalent to sp =

1. Both codes ensure a smallregion of “pixel rings” outside the outermost image.Quasar image positions, time delays, and redshifts are the onlydata input for the models. Image parities are also given but aredetermined solely from experience and by generating test modelsto verify image parity assignment. As is well-known, images arelocated at extrema of ∇ ψ and the sign of ∇∇ ψ determines theparity.This input is used to create a system of equations which arelinear in the source position β and mass tiles κ n . The intrinsicand well-known problem of lensing arises from the fact that thereare inﬁnitely many solutions to these linear equations. Free-formtechniques usually sample from that solution space according toa few reasonable priors. Most notably they require non-negativemass tiles, limited to twice the average of all neighboring tiles,and the local density gradient to point typically 45 ◦ from the cen-ter; additionally, the azimuthally-averaged mass proﬁles must notincrease, which still allows for twisting isodensity contours and sig-niﬁcantly varying ellipticities with radius. These priors ensure someminimum level of physical correctness where the density of the rea-sonably smooth lensing mass is increasing towards the center. Fromthe information provided by the “Evil” team for each rung, furtherphysical parameters and priors could be included: • Redshifts set the distance scales (assuming a standard cosmologyof Ω m = .

27 and Ω Λ = . • The models allowed for external shear. • Time delays were constrained, for GLASS with uncertainties of ± .

25 days, for PixeLens without. • The range of H was limited to 50 −

90 km s − Mpc − . The velocity dispersion information was not used to constrain themodels, but can be derived from the models following Leier (2009).A free-form lens model consists of an ensemble of models; ∼ H must be the samefor all systems.An ensemble usually includes many diﬀerent convergencemaps some of which are unphysical at times. Generally this is not aproblem, as the ensemble average washes out these outliers. Never-theless, the ensemble can be ﬁltered according to diﬀerent criteria inorder to optimize the ensemble average. In Rung 2 for instance, weapplied such a post-processing ﬁlter based on a simpliﬁed versionof the source mapping algorithm described in Denzel et al. (2020).Instead of only using quasar image positions, the entire photomet-ric information was used to select the most probable models in thefollowing manner. A χ value was computed for each lens model ofthe ensembles by ﬁtting a synthetic image using the drizzled imagedata (including science images, noise level maps and a sampled PSFimage, while masking out the lensing galaxies in the center). Foreach ensemble, 300 models with the best values were retained toestimate H . This ensured that only the models which best ﬁt theentire image data were used to infer H . Despite slight improve-ments on H the ﬁlter was abandoned again for Rung 3, because, atthe time, the methods were computationally too intensive.It is important to note that the distributions of H of all en-semble models were far from Gaussian, but were ﬁtted as suchnonetheless as it was the demanded submission format of the chal-lenge.For each rung, model ensembles were generated for all 16single lenses and for groups of multiple lenses (four sets of fourlenses) using GLASS and Pixelens. These submissions have thesuﬃxes Single and

Multi respectively.In Rung 1, all GLASS models use p r = sp =

5, and multi-lenses use sp =

1. In Rung 2, GLASS single lensmodels have a higher resolution using r p =

10 and sp =

5, whilemulti-lenses use r p = sp =

1. For Rung 3, the resolutionof GLASS models was increased as high as was computationallyfeasible to r p =

12 for the submission glassSingleHiRes . Thesubmission glassSingleLowRes used the standard r p =

8. Bothsubmissions further resolved the central pixel with sp = glassCherrypick is a multi-lensanalysis using a subset of four lenses for which the individually mod-elled arrival-time surfaces and mass maps subjectively appeared tobe unproblematic (e.g., no additional images and a clean arrivaltime surface). In Rung 2, glassSynthFiltered used the afore-mentioned source mapping algorithm to select models from the glassMulti ensemble which best reproduced the lensed images. S. R. Kumar, H. Chand

The main motivation of the team was to understand to what accu-racy and precision H can be constrained through simple analyticalmodelling, constrained by point image positions and ﬂux ratios. Tothis end, the team modelled the TDLMC Rung 0, Rung 1 and Rung2 systems using ‘Glaﬁc’ software (Oguri 2010). In general, the mass Due to the linear nature of the lens equation, a superposition of solutionsalso is a solution.MNRAS , 1–23 (2020) X. Ding et al. distribution of the lensing galaxy was modelled as singular isother-mal ellipsoid along with a shear component (SIE + γ ). In Rung 1,some double lens systems were found to overﬁt ( χ << γ ). All the Rung 2 systems weremodelled as SIE + γ , except for one system for which this modelwas found to result in catastrophic failure. The exceptional casewas modelled as singular isothermal ellipsoid without any shearcomponent (SIE only).The astrometry of the lensed quasar images and the center ofthe lensing galaxy were measured from the provided HST drizzledimage for each system using ‘imexam’ task in IRAF. The astromet-ric coordinates were assigned an uncertainty of 0 . (cid:48)(cid:48)

02. The ﬂuxesof the lensed quasar images were also measured through aperturephotometry using the same IRAF task from

HST drizzled image.From these ﬂuxes, the absolute ﬂux ratio was computed for eachlensed quasar image with respect to the brightest image. These ﬂuxratios were each assigned a suﬃciently large uncertainty of 0 . Ω m = 0.27, Ω Λ = 0.73, and w = -1. Source and lens redshifts were ﬁxed foreach system according to the provided values. The measured H foreach system was taken to be that which corresponded to the bestﬁtting model. The 1- σ uncertainty of H was inferred by ﬁxing itat diﬀerent values around the measured value and marginalizing allthe model parameters to minimize χ and noting the range where ∆ χ <

1, with respect to the value for the best ﬁtting model. Theerror bars in positive and negative directions were averaged. To in-clude the line of sight eﬀects for Rung 1 and Rung 2 systems, 2.5%was added in quadrature to the H uncertainty. The team submittedonly the results for those systems where H was constrained to bet-ter than 20 km s − Mpc − . The remaining systems were ﬂagged asfailure. The team also submitted results ﬁltered according to cutoﬀvalues of 15 km s − Mpc − and 10 km s − Mpc − to see what eﬀectthese selections have on the TDLMC performance metrics. In orderto combine all the H estimates from the individual systems intoone global value for a rung, the team did a simple weighted average. J. W. Park, Y.-Y. Lin

The H0rton team automated the lens modelling using a Bayesianneural network (BNN), a method pioneered by Hezaveh et al.(2017). The BNN-inferred lens model posterior was then propa-gated into H inference. Readers are referred to the accompanyingmethod paper (Park et. al. in prep) for more details. The imple-mentation of the H0rton pipeline is available in the form of theopen-source Python package H0rton. Given the drizzled image of each lens system, the BNN pre-dicted the posterior PDF over a power-law elliptical mass model(PEMD) parameters, the source position, and the half-light radius https://github.com/jiwoncpark/h0rton of the Sérsic lens light (for computing the velocity dispersion like-lihood). The posterior PDF was parameterized as a mixture of twoGaussians with full covariance matrices, informed by the resultsof Wagner-Carena et al. (in prep) that the parameter recovery im-proved with this form of the posterior in comparison to the singleuncorrelated Gaussian originally adopted by Hezaveh et al. (2017).The training set for the BNN consisted of 200,000 images. Theassumed lens mass and lens light proﬁles were identical to thoseused to generate the TDLMC data of Rung 1 & 2, i.e. PEMD andelliptical Sérsic, respectively. The AGN host light, however, wasassumed to follow an elliptical Sérsic proﬁle in order to keep theparameterization simple. The predictive model parameters in thetraining set were assumed to be independently distributed, asidefrom selecting the magniﬁcation to be greater than 2 in order toensure signiﬁcant lensing signal. The approximate range of eachparameter was inferred from the Rung 1 dataset and conﬁrmed byvisual inspection on the Rung 3 images. For the PSF convolution,the simulation rotated among the 16 drizzled PSF maps providedin Rung 1. The PSF information was fed to the BNN only via theconvolved image and the network was expected to process the de-convolution internally. Non-drizzled images or PSF maps were notused. The training set was generated using the team’s open-sourcePython package Baobab, which wraps around the Lenstronomypackage (Birrer & Amara 2018).The combined cosmographic likelihood was the product of thelikelihoods of the time delays and the line-of-sight velocity disper-sion with the nuisance parameters, i.e, the external convergence,kinematic anisotropy, and the BNN-inferred model parameters,marginalized out. The velocity dispersion was modelled assuming aspherical power-law mass proﬁle and a Hernquist lens light to solvethe spherical Jeans equation, as done by Suyu et al. (2010b). Thekinematic computations were performed with Lenstronomy. Sam-ples from the cosmographic likelihood were obtained via MCMCsampling with Emcee (Foreman-Mackey et al. 2012). Note that,in contrast to the traditional forward modelling approach, the pix-elwise image likelihood was never directly modelled. Instead, theBNN-inferred posterior entered the MCMC integration as a priorover the lens model parameters at the H inference stage.It was discovered during the analysis procedure that, when theBNN-predicted source position and lens model were directly used tosolve the lens equation, the predicted number of images often did notagree with the data. These cases were traced to sources very closeto the caustic, for which the precision requirements on the sourceposition tended to be very high (see e.g. Birrer & Treu 2019). TheBNN-inferred posterior was placing signiﬁcant weight on modelsthat did not produce the correct number of images. To alleviatethis discrepancy, the image positions were manually estimated fromthe images and fed in as additional data into the MCMC samplingpipeline. A Gaussian likelihood of the image positions, when ap-pended to the MCMC sampling objective, iteratively brought theBNN-inferred lens model closer to one that yielded the observedimage positions.The H0rton team joined the challenge late and only made ablind submission to Rung 3. The open-box datasets of Rungs 1and 2 that were available at the time, however, informed the team’sapproach. https://github.com/jiwoncpark/baobab MNRAS000

DLMC. II Table 3.

Summary table of input data.

Team point sources extended source kinematicsStudent-T Yes Yes NoEPFL Yes Yes YesFreeform Yes No NoRathnakumar Yes No YesH0rton Yes Yes Yes

Note: − Table summarizes the input data as used by the “Good” team. Inaddition, all teams use time delays and redshifts, and simulated

HST imagesto constrain the deﬂector.

Table 4.

Summary of computation and investigator time.

Team CPU time (hours) investigators time (hours)Student-T 15 ,

400 48EPFL 500 ,

000 1 , , − Rathnakumar − −

H0rton − −

Note: − Estimated CPU and investigator time spent for TDLMC by theteams who provided them.

To summarize the input data used by each “Good” team, we presentthe information in Table 3. A summary of the computation andinvestigator time invested in the challenge is given in Table 4. Abrief analysis of results of the submissions is presented in thissection.

In this section, we give an overview of the performance of the blindsubmissions to Rung 1 and Rung 2. As described in Section 2.7, fourmetrics are used to perform a synthetic evaluation of the submis-sions, even though we encourage teams to carry out more detailedstudies. The metrics of each submission for Rung 1 and 2 are shownin Table 5. Note that the “Good” teams were allowed to adopt mul-tiple methods based on diﬀerent algorithms and submit multipleresult for each rung. The metrics plots by each submissions areshown in the Figures 6 and 7.“Good” teams including Student-T, EPFL and Rathnakumaralso estimated and submitted the overall H , which is their bestestimation using the combination of the lens systems analyzed ineach rung. The Freeform team also submitted the overall H valuesafter unblinding, although it is based on a straightforward average ofblind inferences. Following (11) and (12), we calculate the metricsof precision and accuracy using the values of these overall H andshow them in Figure 8. Note that overall H is a joint inference fromthe combination of the multiple lens systems; thus, the precisionmetric value should be, in principle, decreased by the square rootof the volume of the analyzed lensed systems (i.e., √ N ), comparedto Figures 6 and 7. The combination of multiple systems could alsoin principle allow teams to ﬂag and reject outliers, thus reducingthe impact of overly complicated systems, i.e. those for which themodelling tool or data quality is insuﬃcient.Furthermore, we investigated whether there is “wisdom in the crowd” by considering metrics combined across H submissionsfor Rung 1 and Rung 2. We considered the following strategies: • Direct average : of all the submission of H without weighting; • Bagging : For each lens in one rung, we compute the mean H across all the submissions and estimate the uncertainty via bootstrapresampling. The result is taken as the H inference for each lenssystem. Then, we combine H inference across all the lens systemsin the rung to compute the metrics; • Rejection σ -median : We combine the entire H submissions inone rung to do the bootstrap resampling. We remove the outliersbefore inferring the averaged metrics using the following criteria.In each bootstrap seed, we calculate the median H ( H , median ) andreject the outliers by | H , median − ˜ H i |/ δ i > • Rejection σ -mean : Similar to rejection σ -median , we remove theoutliers in each bootstrap seeding using the H weighted mean value( H , mean ) by | H , mean − ˜ H i |/ δ i > • Rejection widths-median : Similar to previous rejection methods,we use the widths of the H distribution in each bootstrap drawing(i.e., W H , which is the half width of 16% −

84% conﬁdence intervalin H distribution) and remove the outliers in the bootstrappedsample by | H , mean − ˜ H i |/ W H > averaged metrics are only introduced to help to “guide the eye”to evaluate if there is “wisdom in the crowd”. This is not a commonpractice in current research on this topic. Furthermore, the combinedmetrics are not representative and overweighting certain methods,since diﬀerent teams had diﬀerent number of submissions.A few trends emerge from these plots, discarding Student-Tsubmission to Rung 2, and EPFL submission to Rung 1 for reasonsdiscussed in next subsections. First, most methods seem to have arealistic assessment of their uncertainties, landing on or close tothe χ target. Second, the methods constrained only by point sourceposition and ﬂuxes tend to produce signiﬁcantly larger uncertaintiesthan the target precision. Only the method using the full extent of thesurface brightness of the host galaxy and the ancillary data hits theprecision target. This trend can be conﬁrmed by Table 6, in whichthe combined metrics of precision and accuracy are calculated inRung 2 based on the algorithms using diﬀerent level of information.This ﬁnding is encouraging even though not surprising: using moredata yields more precise results. Also encouraging is that even in themore challenging Rung 2 all the methods - including Student-T postblind - hit the accuracy target. Unexpectedly, the accuracy in Rung 1seemed to have been less than in Rung 2. The improved accuracy inRung 2 is likely due to the fact that the “Good” teams learned fromRung 1’s results to improve their algorithms, and identify bugs inthe codes.To understand if the performance of the lens modelling is dif-ferent between diﬀerent lens conﬁgurations (i.e., cross , cusp , fold and double ) and simulating codes (i.e., Lenstronomy and Pylens),we categorize the entire submissions and compare their metrics di-rectly by plotting them together in Figure 9. Interestingly, there is nosigniﬁcant evidence of diﬀerence between the diﬀerent conﬁgura- MNRAS , 1–23 (2020) X. Ding et al. tions (e.g., doubles and quads), which is an echo to the recent studyby Birrer et al. (2019) that the precision of the cosmographic mea-surement with the doubly imaged AGNs could be comparable withthose of quadruply imaged ones. Of course, this result should not beoverinterpreted as the additional information content of the quadsmay just be not apparent in the conﬁguration and regimes studiedhere, but relevant in other situations where for the example the massdistribution is more complicated or the data quality is not as good,or the uncertainties are smaller. One potential explanation for thesimilarity is that the quads considered here are fairly more sym-metric than the quads of the TDCOSMO collaborations, likely as aresult of the selection function that favors systems with large ellip-ticity and shear since they have the highest cross-section for quads v.s. doubles. Symmetric quads have typically shorter time delaysand less radial leverage when compared to more asymmetric ones,and thus provide weaker constraints on the Hubble constant. For allthese reasons, the similarity between quads and doubles found inthis challenge does not imply that they are equally eﬃcient in real-ity. Also, the metrics are the same on Lenstronomy and Pylenssample, which proves the diﬀerence of simulated images betweenLenstronomy and Pylens is below the noise level in Section 2.5.Due to the limitations of Rung 3, as discussed in Section 5, wepresent the Rung 3 results in Appendix A.

The ﬁrst important lesson is that the independent teams have comeup with several independent techniques, including novel ones. Asdescribe above, the underlying assumptions of the techniques varygreatly, and so does the amount of information used by each tech-nique and the ﬂexibility of the models. As often the case in astro-physics ﬁnding the right balance between too little and too muchﬂexibility in the models is diﬃcult yet vital to obtain accuracy andprecision. Too little ﬂexibility may lead to bias or underestimatederror bars. Too much ﬂexibility may lead to unphysical solutions orun-necessary inﬂation the error bars. The level of ﬂexibility directlyties to another major obstacle to precision, lensing degeneracies.One way in which degeneracies can be quantiﬁed is by pulling mul-tiple solutions from diﬀerent families of models, and analyzing thevariance within that ensemble (see e.g., Gomer & Williams 2019;Saha 2000).The second important lesson is that most methods seem to pro-duce reasonable estimates of their uncertainties. In Rung 1 virtuallyall methods produced acceptable χ metric distributions, while inRung 2 the submissions that returned an answer for every system(i.e., high eﬃciency) sometimes paid a price in the sense that theyunderestimated their uncertainties.The third important lesson is that more information translatesin higher precision. Therefore, if one wishes to extract high precisionfrom time delay measurements, it is crucial to use all the informationavailable, not just the positions of the point sources (or their ﬂux).However, an important caveat is that information content by itselfdoes not necessarily guarantee accuracy if the modelling techniqueis not suﬃciently ﬂexible, as discussed above. Rung 1 and Rung 2provide a useful test, but much remains to be done to explore theright degree of ﬂexibility.After unblinding of Rung 1, the EPFL team discovered thatsmall systematic uncertainties in the position of the multiply im-aged quasars at the level of a fraction of a pixel could introducea noticeable bias in the inference given the precision of the timedelays. Thus, in Rung 2, the EPFL team introduced nuisance pa-rameters to describe this uncertainty and marginalized over it. The Table 5.

Metrics of blind submission for Rung 1 and Rung 2.

Team algorithm f log ( χ ) P ( % ) A ( % ) metrics of Rung 1Student-T algorithm1 0.688 0.771 4.834 1.049Student-T algorithm2 0.688 0.615 5.374 1.752Student-T algorithm3 0.688 0.493 8.237 2.492Student-T algorithm4 0.688 0.541 6.533 0.293Student-T algorithm5 0.688 0.324 7.019 1.005Student-T algorithm6 0.688 0.094 10.036 1.825EPFL submission 0.688 0.411 6.169 7.512Freeform glassCherrypick 0.250 1.193 5.785 -22.847Freeform glassMulti 1.000 0.406 9.002 -4.570Freeform glassSingle 1.000 0.264 13.812 -8.516Freeform pixelensMulti 1.000 0.349 9.299 -7.220Freeform pixelensSingle 1.000 0.790 13.123 -5.632Rathnakumar cutoﬀ10 0.125 0.024 8.429 4.112Rathnakumar cutoﬀ15 0.250 -0.164 12.137 6.337Rathnakumar cutoﬀ20 0.375 -0.339 15.419 3.932Rung 1 combined metrics Direct average

Bagging -0.199 9.646 -1.644

Rejection σ -median Rejection σ -mean Rejection widths-median

Direct average

Bagging -0.343 10.768 0.372

Rejection σ -median -0.040 14.041 1.481 Rejection σ -mean Rejection widths-median

Note: − Table summaries the metrics of the blind submission for Rung 1and Rung 2, together with the post-blind submissions by Student-T team(see Section 4.3). eﬀect is evident by comparing their blind results in the Rung 1 andRung 2. This is an example of the importance of modelling tech-nique ﬂexibility to ensure accuracy, and the fourth key lesson fromRung 1 and Rung 2 is that astrometric precision needs to be com-mensurate with the time delay precision. As discussed by Birrer &Treu (2019) the requirements can be at the level of milli-arcsecondsif the time delay is known to percent precision. For

HST -like images,the requirements correspond to a small fraction of a pixel, a chal-lenging requirement for point sources superimposed on an extendedand unknown source. It is thus important to consider explicitly thissource of uncertainty and marginalize over it, transforming a poten-tial source of bias into a decrease in precision.

MNRAS000

MNRAS000 , 1–23 (2020)

DLMC. II l og ( ) Student-T algorithm1Student-T algorithm2Student-T algorithm3Student-T algorithm4Student-T algorithm5Student-T algorithm6EPFL submissionFreeform glassCherrypickFreeform glassMultiFreeform glassSingleFreeform pixelensMultiFreeform pixelensSingleRathnakumar cutoff10Rathnakumar cutoff15Rathnakumar cutoff20Direct averageBaggingRejection -medianRejection -meanRejection widths-median p r ec i s i on ( % ) efficiency acc u r ac y ( % ) log ( ) precision (%) Figure 6.

Results for TDLMC Rung 1, showing the 4 metrics for all the submissions using diﬀerent algorithms, together with the combined metrics shown asyellow points. The f , χ , P and A are deﬁned in Section 2.7. The gray regions in each plot bracket the expected performance of the metrics. Note that we didnot set a target performance for the eﬃciency ( f ) metric; the gray regions in the left three panels is drawn only for the other metrics. The last four combinedmetrics have either reconstructed its sample or rejected the outliers, thus the eﬃciency metrics are also not considered. After unblinding, it was discovered that in Rung 2 (and Rung 3) theStudent-T team used the non-drizzled PSF, drizzled lens image, anddrizzled noise map, owing to clerical errors. The team’s unblinded(post) analyses shows that this mismatch was the main source of bi-ases in the blinded analysis. In Figure 10, we ﬁnd that the Rung 2’sresult after using the correct ﬁle is much improved. The correspond-ing metrics of the post analysis are also given in Table 5. We stressthat these post-submissions only corrects the input ﬁle; the mod-elling algorithms remain unchanged. These post-submissions arenot used while calculating the combined (i.e., averaged) metrics.

Rung 3 was inconclusive, because of the limitations of the proce-dure used to construct the lenses for this rung. We discuss heresome of the limitations of the hydrodynamical simulations usedto construct Rung 3. The “Evil” team was aware of some of themwhile constructing the challenge, while others only became apparentpost-unblinding. We introduce them in the following subsection.

The main known limitations of the simulations pre-unblinding aretwofold. First, the resolution of the simulations we used is insuf-ﬁcient to describe the inner regions of early-type galaxies. This isillustrated in Figure 11, where we show a typical mass proﬁle, de-composed in dark and total mass. The total mass proﬁle has a coreof approximately 0 . (cid:48)(cid:48)

1, about half a kpc at the redshift of our sources.We also note that the adopted numerical simulations have softeninglengths of 200 −

700 pc, which have partially contributed to the coresizes in these simulated galaxies. Despite that some cored massiveelliptical galaxies have been found (Thomas et al. 2016) and couldbe produced in highly accurate dynamical simulations (Rantala et al.2018), they are unlikely to be present in real lens galaxies with masslike Rung 3 ones. The main evidence against such cores is from thestudy of gravitational lenses themselves. The central slope of themass density proﬁle controls the magniﬁcation of the central im-age. Currently, the existence of galaxy cores is still an open question.The fact that the central image is almost always absent in galaxyscale lenses (not in clusters-scale lenses), is a argument againstcores. For example, the radio observations studies (e.g., Rusin &Ma 2001; Keeton 2003; Winn et al. 2004; Boyce et al. 2006; Zhang

MNRAS , 1–23 (2020) X. Ding et al. l og ( ) Student-T algorithm1Student-T algorithm2Student-T algorithm3Student-T algorithm4Student-T algorithm5 p r ec i s i on ( % ) EPFL DdDdtEPFL DdtonlyEPFL margDdDdtEPFL margDdtonlyFreeform glassMultiFreeform glassSingleFreeform glassSynthFilteredFreeform pixelensMultiFreeform pixelensSingleRathnakumar cutoff10Rathnakumar cutoff15Rathnakumar cutoff20Direct averageBaggingRejection -medianRejection -meanRejection widths-median efficiency acc u r ac y ( % ) log ( ) precision (%) Figure 7.

Same as Figure 6, but for Rung 2’s results. To demonstrate the improvement of the Student-T team’s result after using the correct ﬁle (see Section 4.3),ﬁgure also shows post-blind submissions labeled by the hollow markers. We note that the combined metrics , i.e., yellow points, does not include the results bypost-blind algorithms.

Table 6.

Summary of the precision and accuracy by combining algorithmsbased on diﬀerent level of information used to constrain the models inRung 2.

Combined ﬁtting algorithm Precision ( % ) Accuracy ( % ) Everything 2 . − . ± . blind submissions only . − . ± . blind + post-blind . − . ± . only post-blind for Student-T . . ± . . . ± . Note: − “Everything” calculates the metrics combining the algorithms thatadopted point sources, extended source, and kinematics. “Extended Source”combines the results of the algorithms that utilized the lensed arc informationin the lens modelling. “Point Sources” combines the ones which use onlypoint sources but not lensed arcs. For cases with post-blind submissionsexplained in the text we report all the permutations of blind and post-blindcombinations. et al. 2007; Quinn et al. 2016) usually present a non-detection ofthe ‘central’ lensed image, which gives an upper limit of the core(<5 ∼

100 pc). Moreover, for such cases, the measured kinematicswould be diﬀerent with respect to the modelled ones. Second, sincesimulations do not match perfectly the mass proﬁle of real massiveelliptical galaxies, as shown by Figure 11, generalizing the resultsof such a test is always complicated. For example, if the modelerswere to assume the mass density proﬁle to be cuspy in the innerregions and thus do not match the cores in the simulations, wouldthis be a problem in analyzing real galaxies, which should be cuspy?The recent study by Enzi et al. (2020) shows that without kinematicinformation, departures from a single power-law (in this case inthe form of a core) can lead to a bias on the inference of H ofup to 25%. A similar concern about the realism of simulations isillustrated by Xu et al. (2017), who analyzed Illustris simulationsand showed that the simulations do not match exactly the detailedproperties of real galaxies in terms of central dark matter fractionand slope of mass density proﬁle.These limitations were known to the “Evil” team while de-signing the challenge. The “Evil” team considered these limitationa “necessary evil”, to be kept in mind in the interpretation of theresults. A complexity worth paying in order to produce complexgalaxy potentials in the absence of empirical information of compa- MNRAS000

DLMC. II precision (%) acc u r ac y ( % ) Student-T algorithm1Student-T algorithm2Student-T algorithm3Student-T algorithm4Student-T algorithm5Student-T algorithm6EPFL submissionRathnakumar cutoff10Rathnakumar cutoff15Rathnakumar cutoff20FreeformDirect average precision (%) acc u r ac y ( % ) Student-T algorithm1Student-T algorithm2Student-T algorithm3 Student-T algorithm4Student-T algorithm5EPFL DdDdtEPFL DdtonlyEPFL margDdDdtEPFL margDdtonlyRathnakumar cutoff10Rathnakumar cutoff15Rathnakumar cutoff20FreeformDirect average

Figure 8.

Results for TDLMC Rung 1 (left) and Rung 2 (right), based on the overall H submissions by each algorithm. The Freeform team submitted the overall H values after unblinding, although it is based on a straightforward average of blind submissions. l og ( ) Rung 1 crossRung 1 cuspRung 1 foldRung 1 doubleRung 1 LenstronomyRung 1 PylensRung 2 crossRung 2 cuspRung 2 foldRung 2 doubleRung 2 LenstronomyRung 2 Pylens p r ec i s i on ( % ) efficiency acc u r ac y ( % ) log ( ) precision (%) Figure 9.

Figure illustrates the metrics of Rung 1 and Rung 2 accordingto diﬀerent categories of lens systems using the entire submissions. Notethat in Rung 2, the goodness (i.e., χ ) is overwhelmingly dominated by thefour double systems in the Freeform glassMulti’s submission. Because thesefour double systems are simulated by Lenstronomy and Pylens evenly, thecorresponding log ( χ ) in Rung 2 is signiﬁcantly larger than the other ones.The larger goodness by Freeform team is an artifact due to the prior that H is between 50 and 90 km s − Mpc − . The values which lie close to 50 havelow error estimates (cut oﬀ at 50), which results in very high χ value. rable resolution. Future challenges may want to pursue some form ofempirically-driven models (perhaps based on observations of localmassive elliptical galaxies) until the ﬁdelity of simulations improvessigniﬁcantly. Additional limitations were discovered post-unblinding thanks tocollaborative eﬀorts by the “Evil” and “Good” teams. However,these limitations are not necessarily to introduce a major bias on theinference of H . In Rung 3, 12/16 simulations dynamically bound substructure (i.e.satellite halos) were identiﬁed and removed before producing thelensing quantities. This procedure renders the kinematics inconsis-tent with the lensing quantities, because the motion of the stars andgas was precomputed based on the full mass distribution includingsubstructure. Substructure accounts for approximately 1% of thetotal mass at the relevant scales, so this is not a large eﬀect, but canpotentially introduce a bias at the percent level when combininglensing and kinematic tracers.

For computational reasons, only the particles within the virial radius( R ) or twice the virial radius were considered when projectingthe mass distribution to calculate lensing quantities. This introducestwo main outcomes. First, not taking into account mass beyond R may introduce a negative mass-sheet transform, biasing H belowthe percent level. Second, the spherical truncation at R does notfollow an isodensity contours of the mass proﬁle, introducing anartiﬁcial shear (Van de Vyvere et al. 2020 in prep.). At this radius,the artiﬁcial shear created by the truncation is small and may bias H by less than 1 percent. Both truncation eﬀects (i.e. artiﬁcial shearintroduction and negative mass-sheet bias) have low amplitude fortruncation at virial radius. They then may introduce a small bias onthe H inference but should not be the major cause of bias in Rung 3results. MNRAS , 1–23 (2020) X. Ding et al.

Figure 10.

Post-blind improvement of the Student-T team’s results using the correct PSF ﬁle for Rung 2 submissions, without changing any code or algorithm.This correction removes any bias in the inference and improves slightly the precision.

Pixels C onv e r g e n ce m a p ( i n l og10 s ca l e ) Total massDark matterTyptical NFW

Einstein Radiusaperture sizeAGN image position

Figure 11.

Mass proﬁle of a typical deﬂector in Rung 3, illustrating theunphysical core in the central regions and the departure of the dark matterhalo from a standard (Navarro et al. 1997) form.

First of all, a positive outcome of the challenge is that several teamswere able to analyze a sample of 48 lenses, the sample size neededto reach sub-percent precision (Shajib et al. 2018). Analyzing thislarge sample within the time constraints of the challenge requiredgood teams to apply fast methods as opposed to the more time andresource consuming approaches of state-of-the-art analysis of realdata. These fast methods are necessary to make progress and itis essential to test them as we did in the challenge. We note thateven with the fast methods participation to the challenge was laborintensive, and the “Evil” team extended the original deadlines setin TDLMC1 by a few months in order to allow more “Good” teamsto participate.Rung 1 and Rung 2 demonstrate that current fast lens mod-elling technology is able to obtain precise and accurate estimatesof H starting from a best guess of the point spread function, whenusing the information content of HST -like images. The expectedcomplexity of lensed host galaxy of the quasar is not an obstacleto the inference, provided that suﬃciently ﬂexible models are usedto describe the source. The common practice of reconstructing thePSF starting from an empirical or theoretical best guess and the use of ﬂexible source description is validated by the two rungs andshould become the standard in future work.Astrometry of the point sources from

HST -like images can bea source of bias at the few percent level for extremely precise timedelays. Mitigation strategies include adding nuisance parametersto describe the astrometric noise arising from poor sampling, orusing higher resolution images, e.g. from adaptive optics or radiointerferometers.The conclusions about modelling the gravitational potentialof the deﬂector are not so clear cut. Encouragingly, the teamsperformed well when the deﬂector was described by a simplyparametrized analytic forms as in Rung 1 and Rung 2, with noevidence of inaccuracy. As discussed above, and as expected, thefast methods using more information performed better in terms ofprecision than the ones which used only AGN positions and ﬂuxratios. Rung 3 was helpful in unveiling subtle eﬀects that need to beconsidered if one wishes to use simulations to test gravitational lensmodelling techniques for cosmological inference to high precision.Unfortunately, the same limitations – and the known limitations inresolution and realism at the beginning of the challenge – make itdiﬃcult to draw conclusions based on it. More work is needed onthis front, and it will require either much higher resolution simula-tions than the ones adopted here or more advanced computationaltechniques to calculate the lensing quantities. Alternatively, a futurechallenge could ﬁnd a way to generate high precision and realisticmodels, perhaps inspired by empirical data on local massive ellip-tical galaxies.

In this paper, we described the main results of the time delay lensmodelling challenge. We ﬁrst revealed some of the details of theconstruction of the simulated datasets that were kept blind duringthe challenge. Second, we gave a brief description of the meth-ods followed by the “Good” teams to do the inference. Third, wedescribed a number of limitations of Rung 3, including some numer-ical eﬀects discovered post-unblinding that preclude inferences atthe percent level required for this challenge. These limitations makeRung 3 diﬃcult to interpret but are reported here with the aim toinform future challenges. Finally, we presented an overview of theperformance of the methods against 4 metrics (precision, accuracy,eﬃciency, goodness of ﬁt).

MNRAS000

MNRAS000 , 1–23 (2020)

DLMC. II The main conclusions, based on Rungs 1 and 2, can be sum-marized as follows: • Each team came with fundamentally diﬀerent methods to studya large sample of systems. In particular methods constrained onlyby point-like images and using either analytic or free-form models,a novel bayesian technique assuming a locally gaussian Fermatpotential, and modelling similar to current cosmographic analyses.A Bayesian Neural Network approach has also been applied onunblinded data. Several teams developed fast methods that allowedthem to analyze 48 lenses within the duration of this challenge( ∼ − ∼ • The fast methods applied to this challenge estimate their uncer-tainty appropriately, yielding error bars that are statistically compa-rable with the departure from the truth. • The fast methods that exploit the full information content of thedata achieve higher precision than the ones that only utilize lensedquasars positions and ﬂuxes to constrain the models. • The fast methods based on full image reconstruction can meet thetarget precision (6% per system) and accuracy (2%), when analyzingmock images based on complex sources and starting with a guessof the point spread function. • Astrometric requirements on the position of the point sources canbe stringent and diﬃcult to meet for high precision time delay mea-surements, given the Hubble Space Telescope point spread functionand pixel size. Biases arising from poor sampling of the PSF can beavoided by modelling the astrometric noise explicitly.As far as Rung 3 is concerned, one generic problem was knownbefore the challenge, i.e. if simulations do not reproduce real galax-ies at the percent level precision in gravitational potential, it is diﬃ-cult to generalize the outcome of the challenge. A good example ofthis issue, is the ﬁnite resolution of cosmological hydrodynamicalresolution, which introduces features like cores that are unlikelyto be present in real systems. A spherical redistribution of cusp tocore would not itself aﬀect lensing observables, but it would changekinematic and other properties. If modelers assume that galaxiesare cuspy, and do not detect the core in the simulations, what doesit mean for real galaxies? The following additional and more subtleeﬀects were identiﬁed post-unblinding. • The kinematics of the particles in the simulations must be consis-tent to sub-percent level with the gravitational potential generatedby the lensing data products given to the “Good” teams. Remov-ing substructures or other parts of the simulation when generatingthe lensing data may cause internal tension in the data so that theylensing and dynamical probes cannot be combined without bias. • The standard practice of truncating simulated halos at the virialradius may lead to inconsistencies between the actual Fermat poten-tial and the one computed from truncated maps. Lensing quantitiessuch as the Fermat potential are non local, and the kernel map-ping convergence into potential is logarithmic. Therefore, in orderto avoid biases in Fermat potential at the few percent level one hasto include all particles well beyond the virial radius and carefullyconsider the shape of the truncation.In recent years, a number of works has investigated the sys-tematic uncertainties in time-delay cosmography (e.g., Schneider &Sluse 2013; Sonnenfeld 2018; Kochanek 2020; Millon et al. 2019).However, it is diﬃcult to make a quantitative comparison between our results and those in the literature, because the uncertaintiesdepend strongly on the assumptions and methods used.To conclude, this work shows that blind challenges on simu-lated data is a powerful tool to study and characterize a method,alongside blind and independent analysis of real datasets (Millonet al. 2019). The results obtained from this ﬁrst time delay lensmodelling challenge are encouraging, in the sense that accurate andprecise H can be derived blindly even in presence of complexsources and unknown PSF. However, our results also demonstratethat much work remains to be done before we can have conclusiveend-to-end tests based on simulations. First, state-of-the-art mod-elling methods exploiting the full information content of the dataneed to speed up so that even larger simulated datasets can be an-alyzed within a practical time frame to explore a variety of morecomplicated conﬁgurations. For example, the EPFL team that usedall the information employed 500,000 CPU hours and 1,700 hoursof investigator time, almost a full year equivalent. This is signiﬁ-cantly less time than currently employed per lens by H0LiCOW orSTRIDES. However, the challenge was single plane and by designsimpler in terms of satellites and perturbers along the line of sightthan real lenses. So, in order to analyze samples of order 100-1000lenses with increased complexity, further speed-ups are necessary.Second, improvements in numerical simulations of massiveelliptical galaxies and the calculation of their lensing properties areneeded before they can be used to perform lens modeling challengesto percent level precision. ACKNOWLEDGMENTS

We thank Kenneth C. Wong, Sherry H. Suyu for useful suggestionsand supports.T.T. acknowledges support by the Packard Foundation in theform of a Packard Research Fellowship. T.T. and C.D.F. acknowl-edge support by NSF through grant “Collaborative Research: To-ward a 1% Measurement of The Hubble Constant with Gravita-tional Time Delays” AST-1906976. This project has received fund-ing from the European Research Council (ERC) under the EuropeanUnion’s Horizon 2020 research and innovation programme (grantagreement No 787886). C.D.F. and G.C.-F.C. acknowledge supportfor this work from the National Science Foundation under GrantNumbers AST-1312329 and AST-1907396. This work has receivedfunding from the European Research Council (ERC) under the Eu-ropean Union’s Horizon 2020 research and innovation programme(COSMICLENS: grant agreement No 787886) and the Swiss Na-tional Science Foundation (SNSF). S.V. has received funding fromthe European Research Council (ERC) under the European Union’sHorizon 2020 research and innovation programme (grant agreementNo 758853).Hyungsuk Tak acknowledges Simon Birrer for his sincere andtireless support on implementing Lenstronomy, which has enabledthe team’s participation in the challenge. Hyungsuk Tak also appre-ciates Xuheng Ding for his thoughtful comments on the team’sPython code submitted to the Evil team, which has signiﬁcantlyimproved the methodology (including the discovery of the cleri-cal error). In addition, Hyungsuk Tak acknowledges computationalsupports from the Institute for Computational and Data Sciences atPennsylvania State University.S. Rathna Kumar and Hum Chand acknowledge ﬁnan-cial support from SERB, DST, Govt. of India through grantPDF/2016/003848 during the course of this project. S. Hilbert ac-

MNRAS , 1–23 (2020) X. Ding et al. knowledges support by the DFG cluster of excellence ‘Origin andStructure of the Universe’.

DATA AVAILABILITY

The data underlying this article are available in the TDLMC website,at https://tdlmc.github.io/

REFERENCES

AbdelSalam H. M., Saha P., Williams L. L. R., 1998, MNRAS, 294, 734Akeret J., Seehars S., Amara A., Refregier A., Csillaghy A., 2013, Astron-omy and Computing, 2, 27Alam S., et al., 2017, MNRAS, 470, 2617Arendse N., et al., 2019, arXiv e-prints, p. arXiv:1909.07986Auger M. W., Treu T., Brewer B. J., Marshall P. J., 2011, MNRAS, 411, L6Barkana R., 1998, ApJ, 502, 531Betoule M., et al., 2014, A&A, 568, A22Birrer S., Amara A., 2018, Physics of the Dark Universe, 22, 189Birrer S., Treu T., 2019, MNRAS, 489, 2097Birrer S., Amara A., Refregier A., 2015, ApJ, 813, 102Birrer S., Amara A., Refregier A., 2016, Journal of Cosmology and As-troparticle Physics, 2016, 020Birrer S., et al., 2019, MNRAS, 484, 4726Boyce E. R., Winn J. N., Hewitt J. N., Myers S. T., 2006, ApJ, 648, 73Bruzual G., Charlot S., 2003, MNRAS, 344, 1000Choi E., Ostriker J. P., Naab T., Johansson P. H., 2012, ApJ, 754, 125Coles J. P., Read J. I., Saha P., 2014, MNRAS, 445, 2181Denzel P., Mukherjee S., Coles J. P., Saha P., 2020, Monthly Notices of theRoyal Astronomical Society, 492, 3885?3903Ding X., et al., 2017, MNRAS, 465, 4634Ding X., et al., 2018, arXiv e-prints, p. arXiv:1801.01506Eisenstein D. J., et al., 2005, ApJ, 633, 560Enzi W., Vegetti S., Despali G., Hsueh J.-W., Metcalf R. B., 2020, MNRAS,Falco E. E., Gorenstein M. V., Shapiro I. I., 1985, ApJ, 289, L1Foreman-Mackey D., Hogg D. W., Lang D., Goodman J., 2012, eprint arXiv,1202, 3665Freedman W. L., et al., 2001, ApJ, 553, 47Freedman W. L., et al., 2019, ApJ, 882, 34Frigo M., Naab T., Hirschmann M., Choi E., Somerville R. S., KrajnovicD., Davé R., Cappellari M., 2019, MNRAS, 489, 2702Gelman A., Carlin J. B., Stern H. S., Dunson D. B., Vehtari A., Rubin D. B.,2013, Bayesian Data Analysis. CRC Press, Boca Raton, FL, USAGhosh A., Williams L. L. R., Liesenborgs J., 2020, MNRAS, 494, 3998Gilman D., Agnello A., Treu T., Keeton C. R., Nierenberg A. M., 2017,MNRAS, 467, 3970Gomer M. R., Williams L. L. R., 2019, arXiv e-prints, p. arXiv:1907.08638Hezaveh Y. D., Levasseur L. P., Marshall P. J., 2017, Nature, 548, 555Hilbert S., White S. D. M., Hartlap J., Schneider P., 2007, MNRAS, 382,121Hilbert S., Hartlap J., White S. D. M., Schneider P., 2009, A&A, 499, 31Hu C.-Y., Naab T., Walch S., Moster B. P., Oser L., 2014, MNRAS, 443,1173Keeton C. R., 2003, ApJ, 582, 17Knox L., Millea M., 2020, Phys. Rev. D, 101, 043533Kochanek C. S., 2020, MNRAS, 493, 1725Krist J. E., Hook R. N., Stoehr F., 2011, 20 years of Hubble Space Telescopeoptical modeling using Tiny Tim. p. 81270J, doi:10.1117/12.892762Leier D., 2009, MNRAS, 400, 875Liao K., et al., 2015, ApJ, 800, 11Mamon G. A., Łokas E. L., 2005, MNRAS, 363, 705Merritt D., 1985a, AJ, 90, 1027Merritt D., 1985b, AJ, 90, 1027Merritt D., 1985c, MNRAS, 214, 25PMetcalf R. B., Petkova M., 2014, MNRAS, 445, 1942 Millon M., et al., 2019, arXiv e-prints, p. arXiv:1912.08027Mukherjee S., Koopmans L. V. E., Metcalf R. B., Tortora C., SchallerM., Schaye J., Vernardos G., Bellagamba F., 2019, arXiv e-prints, p.arXiv:1901.01095Naab T., Ostriker J. P., 2017, ARA&A, 55, 59Navarro J. F., Frenk C. S., White S. D. M., 1997, ApJ, 490, 493Oguri M., 2010, PASJ, 62, 1017Oser L., Ostriker J. P., Naab T., Johansson P. H., Burkert A., 2010, ApJ, 725,2312Osipkov L. P., 1979, Pisma v Astronomicheskii Zhurnal, 5, 77Paraﬁcz D., Hjorth J., 2010, ApJ, 712, 1378Perlmutter S., et al., 1999, ApJ, 517, 565Petkova M., Metcalf R. B., Giocoli C., 2014, MNRAS, 445, 1954Planck Collaboration et al., 2014, A&A, 571, A16Planck Collaboration et al., 2016, A&A, 594, A13Planck Collaboration et al., 2018, arXiv e-prints, p. arXiv:1807.06209Quinn J., et al., 2016, MNRAS, 459, 2394Rantala A., Johansson P. H., Naab T., Thomas J., Frigo M., 2018, ApJ, 864,113Refregier A., 2003a, MNRAS, 338, 35Refregier A., 2003b, MNRAS, 338, 35Refsdal S., 1966, MNRAS, 132, 101Riess A. G., et al., 1998, AJ, 116, 1009Riess A. G., et al., 2016, ApJ, 826, 56Riess A. G., Casertano S., Yuan W., Macri L. M., Scolnic D., 2019, ApJ,876, 85Rusin D., Ma C.-P., 2001, ApJ, 549, L33Rusu C. E., et al., 2017, Monthly Notices of the Royal Astronomical Society,467, 4220Saha P., 2000, AJ, 120, 1654Saha P., Williams L. L. R., 2004, AJ, 127, 2604Schechter P. L., et al., 1997, ApJ, 475, L85Schneider P., Sluse D., 2013, A&A, 559, A37Schneider P., Sluse D., 2014, A&A, 564, A103Shajib A. J., Treu T., Agnello A., 2018, MNRAS, 473, 210Shajib A. J., et al., 2019, MNRAS, 483, 5649Shajib A. J., et al., 2020, MNRAS, 494, 6072Sonnenfeld A., 2018, MNRAS, 474, 4648Springel V., 2005, MNRAS, 364, 1105Suyu S. H., Marshall P. J., Auger M. W., Hilbert S., Blandford R. D.,Koopmans L. V. E., Fassnacht C. D., Treu T., 2010a, ApJ, 711, 201Suyu S., Marshall P., Auger M., Hilbert S., Blandford R., Koopmans L.,Fassnacht C., Treu T., 2010b, The Astrophysical Journal, 711, 201Suyu S. H., et al., 2012, preprint, ( arXiv:1202.4459 )Suyu S. H., et al., 2014, ApJ, 788, L35Tagore A. S., Barnes D. J., Jackson N., Kay S. T., Schaller M., Schaye J.,Theuns T., 2018, MNRAS, 474, 3403Tak H., Mandel K., van Dyk D. A., Kashyap V. L., Meng X.-L., Siemigi-nowska A., 2017, The Annals of Applied Statistics, 11, 1309Tak H., Ghosh S. K., Ellis J. A., 2018, Monthly Notices of the Royal Astro-nomical Society, 481, 277Tak H., You K., Ghosh S. K., Su B., Kelly J., 2020, Journal of Computationaland Graphical StatisticsThomas J., Ma C.-P., McConnell N. J., Greene J. E., Blakeslee J. P., JanishR., 2016, Nature, 532, 340Treu T., Koopmans L. V. E., 2002a, MNRAS, 337, L6Treu T., Koopmans L. V. E., 2002b, ApJ, 575, 87Treu T., Marshall P. J., 2016, A&ARv, 24, 11Verde L., Treu T., Riess A. G., 2019, Nature Astronomy, 3, 891Vogelsberger M., Genel S., Sijacki D., Torrey P., Springel V., Hernquist L.,2013, MNRAS, 436, 3031Vogelsberger M., et al., 2014, Nature, 509, 177Williams L. L. R., Saha P., 2000, AJ, 119, 439Winn J. N., Rusin D., Kochanek C. S., 2004, Nature, 427, 613Wong K. C., et al., 2020, MNRAS,Xu D., Sluse D., Schneider P., Springel V., Vogelsberger M., Nelson D.,Hernquist L., 2016, MNRAS, 456, 739 MNRAS000

DLMC. II Xu D., Springel V., Sluse D., Schneider P., Sonnenfeld A., Nelson D., Vo-gelsberger M., Hernquist L., 2017, MNRAS, 469, 1824Yuan W., Riess A. G., Macri L. M., Casertano S., Scolnic D. M., 2019, ApJ,886, 61Zhang M., Jackson N., Porcas R. W., Browne I. W. A., 2007, MNRAS, 377,1623

APPENDIX A: DETAILS OF RUNG 3A1 Illustris simulations

For the ﬁrst set of simulated galaxies are selected from the Illus-tris simulation (Vogelsberger et al. 2013, 2014) with six galaxies at z = . z = .

6. All have total dark matter halomasses between 1 − ( M (cid:12) ), and velocity dispersion rangingfrom 250 km/s to 320 km/s. In this challenge we are not intentionalto test bias in the most severe cases where the true proﬁles signif-icantly deviate away from the power law models. For this reason,our selection was based on the fact that the selected galaxies shalldistribute fairly closely around the best-ﬁt general mass-velocitydispersion relation. As a result, majority of the selected galaxiesare not classiﬁed as extreme case of deviation from power-law massdistribution, the most severe case would result in a underestimateof Hubble constant by 15% (see Xu et al. 2016).The convergence and potential maps (as well as potential’sﬁrst and second derivatives) were calculated using a netted-meshbased methods through FFT with an isolated boundary condition.All matter distribution of the selected galaxy halo is truncated atR200 with a spherical aperture. The results have been cross-checkedwith the public software GLAMER, which is a ray-tracing code forthe simulation of gravitational lenses Metcalf & Petkova (2014);Petkova et al. (2014). In addition, we also calculated the same mapsusing a mesh-based FFT algorithm, adopting a Smoothed-particlehydrodynamics (SPH) kernel to smooth the simulated particles tothe mesh. The two sets of results showed expected consistencywithin the numerical uncertainties.The velocity maps were calculated on desired meshes, here nosmoothing was used. The pixel values of mean velocity and velocitydispersions were weighted by rest-frame SDSS-r band luminositiesof stellar particles projected to the pixel. A2 Zoom simulations

The second set of simulations is a sample of ‘zoom’ cosmologicalsimulations, which have been previously used in Frigo et al. (2019).A ‘zoom’ simulation is a higher resolution re-run of a small partof the cosmological box of a large-scale simulation (like Illustris),called the ‘parent’ simulation. In the set we employed, the parentsimulation is a 100 Mpc wide cosmological box simulated with darkmatter only (Oser et al. 2010), and each zoom simulation covers thevolume of a dark matter halo (at z = × M (cid:12) . Thisis higher resolution than Illustris, but not high enough to avoid theissues presented in Section 5. The simulations run from z =

43 to

Table A1.

Metrics of blind submission for Rung 3.

Team algorithm f log ( χ ) P ( % ) A ( % ) metrics of Rung 3Student-T algorithm1 0.750 0.117 15.616 -3.803Student-T algorithm2 0.812 -0.583 26.226 6.221Student-T algorithm3 0.938 0.459 8.472 1.677Student-T algorithm4 1.000 0.213 12.869 2.512Student-T algorithm5 0.875 0.402 11.998 -11.998Student-T algorithm6 0.938 -0.932 26.515 3.986Student-T algorithm7 1.000 0.718 4.885 -5.415Student-T algorithm8 1.000 0.027 12.587 -2.786Student-T algorithm9 0.875 0.532 8.247 -7.373Student-T algorithm10 0.938 -0.848 15.369 4.401Student-T algorithm11 1.000 1.132 3.923 -5.065Student-T algorithm12 1.000 0.115 9.728 -1.195EPFL Combined 0.438 0.893 4.276 -9.963EPFL CombinedDdtOnly 0.438 0.879 4.584 -9.944EPFL Composite 0.500 1.515 2.612 -11.302EPFL CompositeDdtOnly 0.500 1.500 2.559 -11.403EPFL Powerlaw 0.812 0.938 2.941 -7.016EPFL PowerlawDdtonly 0.812 0.955 3.001 -6.973Freeform glassMulti 1.000 2.464 5.106 -16.041Freeform glassSingleHiRes 1.000 1.954 5.809 -17.267Freeform glassSingleLowRes 1.000 1.401 9.632 -11.441Freeform pixelensMulti 1.000 -0.695 18.866 7.626Freeform pixelensSingle 1.000 -0.226 21.637 0.542H0rton Bayesian neural network 0.312 0.637 9.056 3.356 Note: − Table summaries the metrics of the blind submission for Rung 3. z =

0. The sample of simulated galaxies varies in mass, size, dy-namical and stellar-population properties. For the TDLMC projectwe used snapshots at diﬀerent redshifts (0 . < z < .

5) of fourmost massive AGN galaxies, which have arcsec-size Einstein radii.More details on the simulation code and on this sample can be foundin Frigo et al. (2019).The maps of convergence, lensing potential and its deriva-tives were calculated with the post-processing ray tracing codeHilbert (Hilbert et al. 2007, 2009). The whole high resolutionregion of each simulation, roughly reaching out to twice the virialradius of the galaxy, was fed into the code and used to calculatethe lensing maps. The 3D orientation of the galaxy was chosenrandomly before the analysis. The kinematic maps were calculatedon the same grid as the lensing maps, weighting the line-of-sightvelocity of each particle with its R band luminosity.

A3 Rung 3 results

For completeness, we report here the full results of Rung 3. Wecaution the reader that the interpretation of these results is diﬃ-cult, because of the limitations and numerical issues described inSection 5.The metric of each submissions for Rung 3 are listed in Ta-ble A1 and plotted in Figure A1 and A2.

This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS , 1–23 (2020) X. Ding et al. l og ( ) Student-T algorithm1Student-T algorithm2Student-T algorithm3Student-T algorithm4Student-T algorithm5Student-T algorithm6Student-T algorithm7Student-T algorithm8Student-T algorithm9 p r ec i s i on ( % ) Student-T algorithm10Student-T algorithm11Student-T algorithm12EPFL CombinedEPFL CombinedDdtOnlyEPFL CompositeEPFL CompositeDdtOnlyEPFL PowerlawEPFL PowerlawDdtonlyFreeform glassMultiFreeform glassSingleHiResFreeform glassSingleLowResFreeform pixelensMultiFreeform pixelensSingleH0rton Bayesian neural network efficiency acc u r ac y ( % ) log ( ) precision (%) Figure A1.

Metrics of Rung 3 blind submissions. Note that Rung 3 was aﬀected by issues described in Section 5 and thus great caution should be taken ininterpreting these results. precision (%) acc u r ac y ( % ) Student-T algorithm1Student-T algorithm2Student-T algorithm3Student-T algorithm4Student-T algorithm5Student-T algorithm6Student-T algorithm7Student-T algorithm8Student-T algorithm9Student-T algorithm10Student-T algorithm11Student-T algorithm12EPFL CombinedEPFL CombinedDdtOnlyEPFL CompositeEPFL CompositeDdtOnlyEPFL PowerlawEPFL PowerlawDdtonlyFreeform l og ( ) Rung 3 crossRung 3 cuspRung 3 foldRung 3 doubleRung 3 LenstronomyRung 3 Pylens p r ec i s i on ( % ) efficiency acc u r ac y ( % ) log ( ) precision (%) Figure A2.

Panel (left) and (right) is the same as Figure 8 and 9, separately, but for Rung 3. MNRAS000