[PDF] GREAT3 results I: systematic errors in shear estimation and the impact of real galaxy morphology

Abstract

We present first results from the third GRavitational lEnsing Accuracy Testing (GREAT3) challenge, the third in a sequence of challenges for testing methods of inferring weak gravitational lensing shear distortions from simulated galaxy images. GREAT3 was divided into experiments to test three specific questions, and included simulated space- and ground-based data with constant or cosmologically-varying shear fields. The simplest (control) experiment included parametric galaxies with a realistic distribution of signal-to-noise, size, and ellipticity, and a complex point spread function (PSF). The other experiments tested the additional impact of realistic galaxy morphology, multiple exposure imaging, and the uncertainty about a spatially-varying PSF; the last two questions will be explored in Paper II. The 24 participating teams competed to estimate lensing shears to within systematic error tolerances for upcoming Stage-IV dark energy surveys, making 1525 submissions overall. GREAT3 saw considerable variety and innovation in the types of methods applied. Several teams now meet or exceed the targets in many of the tests conducted (to within the statistical errors). We conclude that the presence of realistic galaxy morphology in simulations changes shear calibration biases by ∼1 per cent for a wide range of methods. Other effects such as truncation biases due to finite galaxy postage stamps, and the impact of galaxy type as measured by the Sérsic index, are quantified for the first time. Our results generalize previous studies regarding sensitivities to galaxy size and signal-to-noise, and to PSF properties such as seeing and defocus. Almost all methods' results support the simple model in which additive shear biases depend linearly on PSF ellipticity.

Full PDF

aa r X i v : . [ a s t r o - ph . C O ] A p r Mon. Not. R. Astron. Soc. , 000–000 (0000) Printed 10 August 2018 (MN L A TEX style ﬁle v2.2)

GREAT3 results I: systematic errors in shear estimation andthe impact of real galaxy morphology

Rachel Mandelbaum ⋆ , Barnaby Rowe † , Robert Armstrong , Deborah Bard , ,Emmanuel Bertin , James Bosch , Dominique Boutigny , , Frederic Courbin ,William A. Dawson , Annamaria Donnarumma , Ian Fenech Conti ,Raphaël Gavazzi , Marc Gentile , Mandeep S. S. Gill , , David W. Hogg ,Eric M. Huﬀ , M. James Jee , Tomasz Kacprzak , , Martin Kilbinger ,Thibault Kuntzer , Dustin Lang , Wentao Luo , Marisa C. March , Philip J. Marshall ,Joshua E. Meyers , Lance Miller , Hironao Miyatake , , Reiko Nakajima ,Fred Maurice Ngolé Mboula , Guldariya Nurbaeva , Yuki Okura ,Stéphane Paulin-Henriksson , Jason Rhodes , , Michael D. Schneider ,Huanyuan Shan , Erin S. Sheldon , Melanie Simet , Jean-Luc Starck ,Florent Sureau , Malte Tewes , Kristian Zarb Adami , , Jun Zhang , Joe Zuntz McWilliams Center for Cosmology, Department of Physics, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA Department of Physics & Astronomy, University College London, Gower Street, London, WC1E 6BT, UK Department of Astrophysical Sciences, Princeton University, Peyton Hall, Princeton, NJ 08544, USA Kavli Institute for Particle Astrophysics and Cosmology, Department of Physics, Stanford University, Stanford, CA 94305, USA SLAC National Accelerator Laboratory, 2575 Sand Hill Road, Menlo Park, CA 94025, USA Institut d’Astrophysique de Paris, UMR 7095 CNRS – Université Pierre et Marie Curie, 98bis Bd Arago, 75014 Paris, France Centre de Calcul de l’IN2P3, USR 6402 du CNRS-IN2P3, 43 Bd. du 11 Novembre 1918, 69622 Villeurbanne Cedex, France Laboratoire d’astrophysique, Ecole Polytechnique Fédérale de Lausanne (EPFL), Observatoire de Sauverny, CH-1290 Versoix, Switzerland Lawrence Livermore National Laboratory, P.O. Box 808 L-210, Livermore, CA, 94551, USA Institute of Space Sciences & Astronomy (ISSA), University of Malta, Msida, MSD 2080, Malta Center for Cosmology and Particle Physics, Department of Physics, New York University, 4 Washington Pl Center for Cosmology and AstroParticle Physics (CCAPP) and Department of Physics, The Ohio State University, 191 W. Woodruﬀ Ave.,Columbus, OH 43210, USA Department of Physics, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA Institut für Astronomie, ETH Zürich, Wolfgang-Pauli-Str. 27, 8093 Zürich, Switzerland Laboratory AIM, UMR CEA-CNRS-Paris 7, Irfu, SAp SEDI, Service d’Astrophysique, CEA Saclay, F-91191 Gif-sur-Yvette Cedex, France Key Laboratory for Research in Galaxies and Cosmology, Shanghai Astronomical Observatory; Nandan Road 80, Shanghai 200030, China David Rittenhouse Laboratory, University of Pennsylvania, 209 South 33rd Street, Philadelphia, PA 19104 USA Department of Physics, University of Oxford, Denys Wilkinson Building, Keble Road, Oxford OX1 3RH, UK Kavli Institute for the Physics and Mathematics of the Universe (Kavli IPMU, WPI), The University of Tokyo, Kashiwa,Chiba 277-8582, Japan Argelander-Institut für Astronomie, Auf dem Hügel 71, D-53121 Bonn, Germany National Astronomical Observatory of Japan, Tokyo 181-8588, Japan Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109, California Institute of Technology, Pasadena, CA 91125, USA Brookhaven National Laboratory, Bldg 510, Upton, New York 11973 Center for Astronomy and Astrophysics, Department of Physics and Astronomy, Shanghai Jiao Tong University, 955 Jianchuan Road,Shanghai, 200240, China Jodrell Bank Centre for Astrophysics, University of Manchester, Manchester M13 9PL, UK

10 August 2018c (cid:13)

Mandelbaum, Rowe, et al.

ABSTRACT

We present ﬁrst results from the third GRavitational lEnsing Accuracy Testing(GREAT3) challenge, the third in a sequence of challenges for testing methods ofinferring weak gravitational lensing shear distortions from simulated galaxy images.GREAT3 was divided into experiments to test three speciﬁc questions, and includedsimulated space- and ground-based data with constant or cosmologically-varying shearﬁelds. The simplest (control) experiment included parametric galaxies with a realisticdistribution of signal-to-noise, size, and ellipticity, and a complex point spread function(PSF). The other experiments tested the additional impact of realistic galaxy morphol-ogy, multiple exposure imaging, and the uncertainty about a spatially-varying PSF;the last two questions will be explored in Paper II. The 24 participating teams com-peted to estimate lensing shears to within systematic error tolerances for upcomingStage-IV dark energy surveys, making 1525 submissions overall. GREAT3 saw consid-erable variety and innovation in the types of methods applied. Several teams now meetor exceed the targets in many of the tests conducted (to within the statistical errors).We conclude that the presence of realistic galaxy morphology in simulations changesshear calibration biases by ∼ per cent for a wide range of methods. Other eﬀects suchas truncation biases due to ﬁnite galaxy postage stamps, and the impact of galaxytype as measured by the Sérsic index, are quantiﬁed for the ﬁrst time. Our resultsgeneralize previous studies regarding sensitivities to galaxy size and signal-to-noise,and to PSF properties such as seeing and defocus. Almost all methods’ results supportthe simple model in which additive shear biases depend linearly on PSF ellipticity. Key words: gravitational lensing: weak — methods: data analysis — techniques:image processing — cosmology: observations.

Weak gravitational lensing, the small but coherent de-ﬂections of light from distant objects due to the grav-itational ﬁeld of more nearby matter (for a review, seeBartelmann & Schneider 2001; Refregier 2003; Schneider2006; Hoekstra & Jain 2008; Massey, Kitching & Richard2010), has emerged in the past two decades as a promis-ing way to constrain cosmological models, to study therelationship between visible and dark matter, and evento constrain the theory of gravity on cosmological scales(e.g., Hu 2002; Huterer 2002; Abazajian & Dodelson 2003;Zhang et al. 2007). Because of this promise, gravitationallensing has already been measured in many datasets,and there are several large surveys planned for thenext few decades to measure weak lensing even moreprecisely, including Euclid (Laureijs et al. 2011), LSST (LSST Science Collaborations & LSST Project 2009), andWFIRST-AFTA (Spergel et al. 2013), all of which areStage IV dark energy experiments according to the DarkEnergy Task Force (Albrecht et al. 2006) deﬁnitions.The most common type of weak lensing measurementinvolves measuring coherent distortions (“shear”) in theshapes of galaxies. In order for the aforementioned surveysto make the most of their ability to measure these distortionswith sub-per cent statistical errors, they must ensure ade-quate control of systematic errors. While a full systematicerror budget for weak lensing includes both astrophysical ⋆ E-mail: [email protected] † E-mail: [email protected]; [email protected] http://sci.esa.int/euclid/ , http://wfirst.gsfc.nasa.gov and instrumental systematic errors, a problem that has oc-cupied much attention in the community for over a decadeis ensuring accurate measurements of the shear distortionsof galaxies given that they have been convolved with a pointspread function (PSF) and rendered into noisy images.With the rapid proliferation of shear estimation meth-ods, the weak lensing community began a series of blindcommunity challenges, with simulations that included alensing shear (known only to the organizers) that par-ticipants must measure. This served as a way to bench-mark diﬀerent shear estimation methods. The earliest ofthese challenges were the ﬁrst Shear TEsting Programme(STEP1: Heymans et al. 2006) and its successor (STEP2:Massey et al. 2007a). Then it became apparent that manycomplex aspects of the process of shear estimation wouldbeneﬁt from simpler and more controlled simulations,which led to the GRavitational lEnsing Accuracy Testing(GREAT08) challenge (Bridle et al. 2009, 2010), followed bythe GREAT10 challenge (Kitching et al. 2010, 2012, 2013).Each of these challenges has been informative in itsown way, illuminating important issues in shear estimationwhile also generating signiﬁcant improvement in the accu-racy of weak lensing shear estimation. For example, boththe GREAT08 and GREAT10 challenges highlighted therole played by pixel noise in biasing shear estimates. Whilethis S/N - and resolution-dependent “noise bias” was stud-ied in speciﬁc contexts before GREAT08 and GREAT10(e.g., Bernstein & Jarvis 2002; Hirata et al. 2004), the land-scape changed after GREAT08, with several more gen-eral studies (Kacprzak et al. 2012; Melchior & Viola 2012;Refregier et al. 2012), some of which used the GREAT10simulations as a test for calibration schemes. However, de-spite the progress encouraged by these challenges, there re- c (cid:13) , 000–000 REAT3 Results I mained a number of outstanding issues in shear estimationthat needed to be addressed for the community to ensureits ability to measure weak lensing in near-term and fu-ture surveys. These issues include the impact of realisticgalaxy morphology: a number of studies have convincinglydemonstrated that when estimating shears in a way thatassumes a particular galaxy model, the shears can be bi-ased if the galaxy light proﬁles are not correctly describedby that model (termed “model bias”: Voigt & Bridle 2010;Melchior et al. 2010). More generally, any method based onthe use of second moments to estimate shears cannot becompletely independent of the details of the galaxy light pro-ﬁles, such as the overall galaxy morphology and presence ofdetailed substructure (Massey et al. 2007b; Bernstein 2010;Zhang & Komatsu 2011). Thus, the question of the impactof realistic galaxy morphology (and the way that galaxiesdeviate from simple parametric models) on shear estimationis important to address in a community-wide challenge. Thisis one of the key questions of the GREAT3 challenge.The GREAT3 challenge was also designed to addresstwo additional questions. One of these is the combination ofmultiple exposures, which is necessary to analyze the datafrom nearly any current or upcoming weak lensing survey.For Nyquist-sampled data this is relatively straightforward,but for data that are not Nyquist-sampled (such as someimages from space telescopes), the problem is more challeng-ing (e.g. Lauer 1999; Rowe, Hirata & Rhodes 2011; Fruchter2011). The ﬁnal problem addressed in GREAT3 is the im-pact of PSF estimation from stars and interpolation to thepositions of the galaxies. However, this paper will focus pre-dominantly on the question of shear estimation in generaland realistic galaxy morphology in particular, leaving theother questions for Paper II.In Sec. 2, we describe how the challenge was designedand run, how submissions were evaluated, and a basic sum-mary of the submissions that were made. We discuss themethods used by participants to analyze the simulated datain Sec. 3. For certain methods for which the teams mademany submissions, we derive lessons related to those meth-ods in Sec. 4. We then present the overall results for allteams in Sec. 5. Sec. 6 describes some lessons learned aboutshear estimation from GREAT3, and we conclude in Sec. 7.Finally, there are appendices with some further technicaldetails related to the challenge simulations, and lengthierdescriptions of the methods used by each team. Gravitational lensing distorts the images of distant galax-ies. When this distortion can be described as a locally lin-ear transformation, then the lensing eﬀect is described as“weak”. In this case, it relates unlensed coordinates ( x u , y u ;with the origin at the center of the distant light source) andthe observed, lensed coordinates ( x l , y l ; with the origin atthe center of the observed image), via (cid:18) x u y u (cid:19) = (cid:18) − γ − κ − γ − γ γ − κ (cid:19) (cid:18) x l y l (cid:19) . (1)The two components of the lensing shear ( γ , γ ) describethe stretching of galaxy images due to lensing, whereas the convergence κ describes a change in apparent size andbrightness for lensed objects. This transformation is oftenrecast as (cid:18) x u y u (cid:19) = (1 − κ ) (cid:18) − g − g − g g (cid:19) (cid:18) x l y l (cid:19) , (2)in terms of the reduced shear, g i = γ i / (1 − κ ) ≃ γ i in mostcosmological applications. Typically it is the stretching de-scribed by the reduced shear that is actually observed. Weoften encode the two components of shear (reduced shear)as a single complex number, γ = γ + i γ ( g = g + i g ).The lensing shear causes a change in estimates of the ellipticity of distant galaxies. In practice, the eﬀect is es-timated statistically by measuring galaxy properties thattransform in simple ways under a shear. One method is tomodel the galaxy image using a proﬁle with a well-deﬁnedellipticity, written as ε = ε + i ε , with magnitude | ε | = 1 − b/a b/a (3)for semi-minor and semi-major axis lengths b and a , and ori-entation angle determined by the major axis direction. Fora population of randomly-oriented source intrinsic elliptic-ities, the ensemble average ellipticity after lensing gives anunbiased estimate of the shear: h ε i ≃ g .Another common choice of shape parametrization isbased on second brightness moments of the galaxy image, Q ij = R d xI ( x ) W ( x ) x i x j R d xI ( x ) W ( x ) , (4)where ( x , x ) correspond to the ( x , y ) directions, I ( x ) de-notes the galaxy image light proﬁle, W ( x ) is an optional weight function (see, e.g., Schneider 2006), and the coordi-nate origin is placed at the galaxy image center. A secondellipticity deﬁnition (sometimes called the distortion to dis-tinguish it from the ellipticity that satisﬁes Eq. 3) can bewritten as e = e + i e = Q − Q + 2i Q Q + Q . (5)The ellipticity ε can also be related to the moments byreplacing the denominator in Eq. (5) with Q + Q +2( Q Q − Q ) / .If the weight function W is constant or brightness-dependent, an image with elliptical isophotes has | e | = 1 − b /a b /a . (6)For a randomly-oriented population of source distortions,the ensemble average e after lensing gives an unbiased es-timate of shear that depends on the population root meansquare (RMS) distortion h ( e ( s ) ) i as h e i ≃ − h ( e ( s ) ) i ] g .See e.g. Bernstein & Jarvis (2002) for further details oncommonly-used shear and ellipticity deﬁnitions. Optional for the purpose of this deﬁnition; but in practice, forimages with noise, some weight function that reduces the contri-bution from the wings of the galaxy is necessary to avoid momentsbeing dominated by noise.c (cid:13) , 000–000

Mandelbaum, Rowe, et al.

Here we describe how the GREAT3 challenge wasstructured; more details are given in the handbook,Mandelbaum et al. (2014).The GREAT3 challenge was designed to address howthree issues aﬀect shear estimation: (a) the impact of realis-tic galaxy morphology, (b) the impact of the image combi-nation process, and (c) the eﬀect of errors due to estimationand interpolation of the PSF. To this end, the challengeconsisted of ﬁve experiments:(1) Control: Parametric (single or double Sérsic) galaxymodels based on ﬁts (Lackner & Gunn 2012) to

HST data from the COSMOS (Koekemoer et al. 2007;Scoville et al. 2007b,a) survey, meant to represent thegalaxy population in a typical weak lensing survey, in-cluding appropriate size vs. galaxy ﬂux signal-to-noise(

S/N ) relations, morphology distributions, and so on.In each image, the non-trivially complex PSF was pro-vided for the participants as a set of nine images withdiﬀerent centroid oﬀsets.(2) Real galaxy: Diﬀered from the control experiment onlyin the use of the actual images from the

HST

COSMOSdataset instead of the best-ﬁtting parametric models.(3) Multiepoch: Diﬀered from the control experiment onlyin that each ﬁeld contained six images (representing ob-servations that must be combined) instead of one. Forthe space branches, the six images were not Nyquistsampled.(4) Variable PSF: Diﬀered from the control experimentonly in that the PSF varied across the image in a re-alistic way, and had to be estimated from star images.(5) Full: Included the complications of the real galaxy, mul-tiepoch, and variable PSF experiments all together.In all cases, the goal was to estimate the lensingshear . For each experiment, there were four branches,which came from the combination of two types of simulateddata (ground, space) and two types of shear ﬁelds (con-stant, variable). For convenience, we will refer to branchesby their combinations of {experiment}-{observation type}-{shear type}, e.g., control-ground-constant, and will use theunique abbreviations CGC, CGV, and so on. Of the 20branches (ﬁve experiments × two data types × two sheartypes), participants could submit results for as many orfew as they chose (see Mandelbaum et al. 2014, ﬁgure 5).A given branch included 200 subﬁelds, each with galax-ies on grids. To reduce statistical errors on the shear biases,galaxies were arranged such that the intrinsic noise due tonon-circular galaxy shapes (‘shape noise’) was nearly can-celled out.Submissions to the challenge were evaluated accordingto metrics described in Sec. 2.3. Within a branch, teamswere ranked based on their best submission in that branch.Per-branch rankings were used to award teams points, whichwere then added up across multiple branches to give an over-all leaderboard ranking. While the leaderboard ranking wasnecessary for the purpose of carrying out a challenge, the This is not the same as testing the ability to measure a per-galaxy shape . Two diﬀerent methods can recover a diﬀerent per-galaxy shape, while still estimating the overall shear accurately. goal of this work is to study how teams performed and de-rive lessons for the future based on analysis that goes farbeyond a simple ranking scheme.There are a number of online resources related to thechallenge and the simulations. The main challenge web-site contains overall information. The leaderboard web-site, linked from the main challenge website, containsthe archived challenge leaderboards, and additional post-challenge boards to which submissions were made after theend of the challenge. It also links to download the GREAT3simulations and truth tables. The GitHub site contains soft-ware to reproduce the simulated data and to analyze it usingsimple methods, and a wiki with information for the partici-pants. Finally, GalSim is the simulation software that wasused to make the GREAT3 simulations, and its algorithms,design, and functionality are described in Rowe et al. (2014).Some physical eﬀects that are not tested in the challengeinclude object detection, selection, and deblending, becausethe galaxies are located on grids; wavelength-dependent ef-fects; instrumental and detector defects or non-linearities;star/galaxy separation; background estimation; complexpixel noise models; cosmic rays and other image artifacts;redshift-dependent shear calibration; shear estimation forgalaxies with sizes comparable to the PSF; non-weak shearsignals (e.g. cluster lensing); and ﬂexion.Appendix A contains more detailed information aboutsome aspects of the challenge that were not in the handbook.These include Appendix A1, on the intrinsic ellipticity distri-bution ( p ( ε ) ) of the galaxies; Appendix A2, which describesthe distributions from which the lensing shears were drawn;Appendix A3, which presents distributions of optical and at-mospheric PSF properties; and Appendix A4, which showsthe actual S/N distributions for galaxies in GREAT3. Thelast point is particularly relevant for how pixel noise shouldaﬀect shear estimates in the challenge.Finally, the GREAT3 Executive Committee (EC) dis-tributed example scripts to automatically process the chal-lenge data, including shear estimation, coaddition of multi-epoch data, and variable PSF estimation. While the lattertwo will be discussed in Paper II, we describe the algorithmsin the shear estimation example script in Appendix B. Here we describe the diagnostics used to quantify the perfor-mance of each submission to the challenge. The metrics forconstant- and variable-shear branches, discussed in detail inMandelbaum et al. (2014), were used to rank submissions.Here we brieﬂy deﬁne the equations used. https://github.com/barnabytprowe/great3-public https://github.com/GalSim-developers/GalSim The Executive Committee created the simulations, rankingscheme, and other aspects of the challenge, and had access toprivileged information about the simulations. Because of this ac-cess, teams to which they made signiﬁcant contributions did notreceive points in the challenge, and were not ranked. Those teamsappear on the leaderboard with an asterisk for their score.c (cid:13) , 000–000

REAT3 Results I For constant-shear simulations, each ﬁeld has a particularvalue of shear applied to all galaxies (App. A2). Participantssubmitted estimated (“observed”) shears for each constantshear value in the branch. We relate biases in observed shears g obs to the true shear g true using a linear model in eachcomponent: g obs i − g true i = m i g true i + c i (7)where i denotes the shear component, and m i and c i are themultiplicative and additive biases, respectively. From user-submitted estimates of all g obs i in a branch, the metric calcu-lation begins with an unweighted least-squares linear regres-sion to provide estimates of m i , c i given the true shears (inSec. 4.8 we discuss the role of outliers in aﬀecting the m i and c i estimates). The regression is done in a coordinate framerotated to be aligned with the mean PSF ellipticity in eachﬁeld, so that c values will properly reﬂect the contaminationof galaxy shapes by the PSF anisotropy.Having estimated m i and c i , we constructed the metric, Q c , by comparison with ‘target’ values m target , c target . Thesecome from requirements for upcoming weak lensing experi-ments; we use m target = 2 × − and c target = 2 × − , mo-tivated by a recent estimate of requirements (Cropper et al.2013; Massey et al. 2013) for the Euclid space mission. Theconstant-shear metric is then deﬁned as Q c = 2000 × η c vuut σ min , c + X i =+ , × "(cid:18) m i m target (cid:19) + (cid:18) c i c target (cid:19) . (8)The indices + , × refer to the two shear components in therotated reference frame described above. We adopt σ min , c =1 (4) for space (ground) branches, corresponding to the typ-ical dispersion in the quadrature sum of m i /m target and c i /c target due to pixel noise. This metric is normalized by η c such that methods that meet our chosen targets on m i and c i in space-based data should achieve Q c ≃ . In theground branches Q c is slightly lower for submissions reach-ing target bias levels, reﬂecting their larger σ min , c due togreater uncertainty in individual shear estimates for grounddata. However, Q c scores are consistent between space andground branches where biases are signiﬁcant.Given the nature of this metric deﬁnition, the uncer-tainty in Q c is larger at high Q c than at small Q c . Forthe level of pixel noise in the simulations from ground(space), the eﬀective uncertainty on Q c for Q c values of [100 , , , is [3 , , , ( [2 , , , ). For variable-shear simulations, the key test is the recon-struction of the shear correlation function. Submission ofresults for these branches begins with calculation of correla-tion functions by the participant . The submission consistsof estimates of the aperture mass dispersion (e.g., Schneider2006; Schneider et al. 1998), which are constructed from Software for this purpose was distributed publicly at https://github.com/barnabytprowe/great3-public . two-point correlation function estimates, and allows a sepa-ration into contributions from E and B modes . We labelthese E and B mode aperture mass dispersions M E and M B .The submissions were estimates of M E,j for each of tenﬁelds labelled by index j ; this estimate is constructed usingtwenty subﬁelds in a given ﬁeld. This choice provides a largedynamic range of spatial scales in the correlation function,and thereby probes a greater range of shear signals. The M E,j are estimated in N bins logarithmically spaced annularbins of galaxy pair separation θ k , from the smallest availableangular scales in the ﬁeld to the largest.The metric Q v for the variable-shear branches was con-structed by comparison to the known, true value of the aper-ture mass dispersion for the realization of E -mode shears ineach ﬁeld. These we label M E, true ,j ( θ k ) . The variable-shearbranch metric is then calculated as Q v = 1000 × η v σ min , v + N norm N bins X k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N fields X j =1 [ M E,j ( θ k ) − M E, true ,j ( θ k )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (9)where N norm = N ﬁelds N bins , σ min , v = 4 (9) × − for space(ground) branches, and η v is a normalization factor designedto yield Q v ≃ for a method achieving m = m = m target and c = c = c target .The primary source of noise in the M E,j ( θ k ) is pixelnoise, with some residual shape noise playing a role despitethe shape noise cancellation scheme. After the end of thechallenge we found that a small additional source of noisecomes from the interplay between the θ k bin size, the galaxygrid conﬁguration, and approximations used in the calcula-tion of the correlation function and aperture mass disper-sion in corr2 . While this is a subdominant source of noise( ∼ / of that due to measurement error), it does mean thatparticipants will ﬁnd that their Q v results depend slightlyon the ordering of galaxies in their catalog.For the level of pixel noise in the simulations fromground (space), the eﬀective uncertainty on Q v for Q v valuesof [100 , , , is [6 , , , ( [5 , , , ). For the constant-shear branches, we have a clean way to di-rectly study additive and multiplicative biases in the formof m i and c i , where i = + , × (deﬁned in the frame alignedwith the PSF ellipticity, and at 45 degree angles with re-spect to that direction). However, also of interest are the m i and c i deﬁned in the frame deﬁned by the pixel coordinates,for i = 1 , . In the STEP2 challenge (Massey et al. 2007a),many methods exhibited coherent diﬀerences in shear sys-tematics along the pixel axes and at 45 degrees with respectto them, presumably due to the diﬀerent eﬀective samplingof the galaxy and PSF proﬁles. Since the PSF ellipticity di-rection has a random orientation with respect to the pixelaxes, diﬀerences between m and m will average out, giv-ing m + ≈ m × . Since diﬀerences between m and m may be For more discussion of the limitations on E - and B -mode sep-aration in GREAT3, please see Mandelbaum et al. (2014). https://code.google.com/p/mjarvis/ c (cid:13) , 000–000 Mandelbaum, Rowe, et al. interesting in understanding the performance of a method,we will use m and m for some of our plots.In addition, c and c may be of interest. While c + shows the inﬂuence of PSF anisotropy, additive systematicsdue to PSF anisotropy will have a random sign and directionfor each subﬁeld in the pixel coordinate frame, so c and c have an expectation value of zero. Nonzero values mayindicate selection biases with respect to the pixel direction,or asymmetric numerical artifacts.Given the more fundamental nature of m and m , andthe need to use c + to identify additive PSF systematics,we also consider what we will call a “mixed metric”, Q mix ,deﬁned in analogy to Q c (Eq. 8) as Q mix = 2000 × η c vuut σ min , c + X i =1 , (cid:18) m i m target (cid:19) + X i =+ , × (cid:18) c i c target (cid:19) . (10) During the challenge period, there were 1525 submissions with nonzero score, from 24 distinct teams. Of these, twoteams were actually members of the GREAT3 EC makingsubmissions based on simple test scripts to validate the sim-ulations or submission process; sixteen were teams of partici-pants; and six were teams that included at least one memberof the GREAT3 EC, and were thus excluded from winningany points or the challenge itself.Fig. 1 shows the number of submissions to the challengeas a function of time, expressed in terms of weeks until thedeadline. The ﬁrst entries were submitted near the beginningof the challenge period, which ran from mid-October 2013until April 30 2014. The submission rate was an increasingfunction of time particularly in the last month; the spike inentries in the last week was partly due to a relaxation of therules on the number of entries per team per day.Two teams entered all twenty branches, and 7/24 (30%)of the teams entered more than half the branches. Not sur-prisingly, many teams chose to focus on the control and re-alistic galaxy branches, which required the least amount ofsoftware infrastructure to participate.Table 1 shows the results for each branch, includingthe winning team, the winning score (deﬁned in Sec. 2.3),the number of participating teams, and the number of en-tries. As shown, a variety of teams with diﬀerent methodswon individual branches, rather than one team dominatingeverything. For all but two branches, VGV and FGV, thewinning scores were & , meaning that within the abil-ity of the simulations to determine shear systematic errors,the winning submissions were eﬀectively unbiased. Not onlythe winning team but also typically several other teams hadscores in this range, representing an unprecedented qualityof submissions in a weak lensing community challenge. Wewill discuss why the combination of variable PSF and vari-able shear was more diﬃcult in Paper II.To motivate the approach we take for the analysis, Fig. 2 The leaderboard website shows 1532 submissions, but sevenhad an incorrect submission format, giving Q = 0 . Branch Winning Winning Table 1.

For each branch, this table shows the winning teamand its score, the number of teams that submitted to that branch(with the number having scores above 500 for the submissionsanalyzed in Sec. 5 shown in parenthesis), and the total numberof entries in the branch.

Figure 1.

Number of submissions to the GREAT3 challenge as afunction of time, expressed in terms of weeks until the deadline.The rules for the number of submissions per team per day wererelaxed in the ﬁnal week of the challenge. shows a scatter plot of metric Q (either Q c or Q v as ap-propriate) as a function of time, for all submissions acrossall branches. Point styles indicate the team; the legend hasbeen suppressed because our purpose is only to show that (1)there are a huge number of submissions with a wide range ofperformance, and (2) sometimes even within a given team,the results varied a great deal. We thus approach the anal-ysis in two stages. Our ﬁrst step, in Sec. 4, is to analyzethe results for speciﬁc teams that made many submissions,to understand the trends for that method and identify afair subset of their submissions (one per branch) to comparewith those from other teams. Then, in Sec. 5, we use this c (cid:13) , 000–000 REAT3 Results I Figure 2. Q for all submissions as a function of time, expressedin terms of weeks until the deadline. Later submissions by thesame team that appear to perform worse than earlier submissionstypically went to more challenging branches. fair subsample of submissions, one per team per branch, tolearn lessons from the overall challenge results. In this section, we broadly categorize and describe the meth-ods used to analyze the GREAT3 data. Appendix C containsa more detailed description of all methods. The main aspectsof the methods used by the teams in GREAT3 are summa-rized in Table 2, which forms the basis for the discussion inthis section .We have assigned each of the 21 teams to a “class” (listedin Table 2) that describes how the method essentially works.There are several options for the class:(1) Maximum likelihood: maximum-likelihood model-ﬁtting methods, of which there are ﬁve.(2) Bayesian methods: there are four of these, each withdiﬀerent labels (e.g., “Bayesian hierarchical”, “BayesianFourier”, etc.) indicating diﬀerences in how they work.The “Partially Bayesian” label for MaltaOx is meantto indicate a Bayesian marginalization over nuisanceparameters combined with mean likelihood estimation,rather than a fully Bayesian approach.(3) Moments: there are eight methods that work by com-bining estimates of galaxy and PSF moments in someway. Of these, six are real-space moments methods(called “Moments”) and two are Fourier-space momentsmethods (“Fourier moments”). Of the six real-space mo-ments methods, one involves as a key aspect of the A few teams listed on the GREAT3 challenge website are notin this table, either because they did not make any submissions,because the team solely existed to demonstrate the use of theexample scripts (Appendix B) distributed by the GREAT3 EC(team “GREAT3_EC”), or because the team was created by aGREAT3 EC member only to check the GREAT3 simulations aspart of the validation process (team “miyatake-test”). method a self-calibration scheme (“Moments + self-calibration”), and that self-calibration could be extendedto non-moments-based methods.(4) Stacking: a single team used image stacking.(5) Neural network and supervised machine learning (ML):three methods rely heavily on machine learning.The table also lists the weighting scheme that was used.Here there are a few options. Several teams used constant(equal) weighting, in some cases allowing optional rejec-tion using certain selection criteria (“Constant + rejection”).Many teams used inverse variance weighting, where the vari-ance is a combination of shape noise and measurement errordue to pixel noise. In the Bayesian methods, the weights areoften implicit rather than explicitly assigned. Some teamsexperimented with multiple weighting schemes, in whichcase their entry in the table is “Various”, and details arein the Appendix.Another important entry in Table 2 is “Calibration phi-losophy”, which relates to how or whether a team tries tocalibrate out systematic errors, versus attempting to be un-biased a priori . Here there are a few options:(a) None: These teams apply no calibration corrections.(b) External simulations: These teams generate their ownsimulations in order to calibrate their shears. In onecase (sFIT), these are produced iteratively until theyare found to suﬃciently match the data that are beinganalyzed (“External simulations (iterative)”).(c) Ellipticity penalty term: One team, rather than apply-ing calibrations after the fact, uses a penalty term onhigh ellipticity to reduce certain calibration biases. Thispenalty term must be calibrated in some way, making itsomewhat diﬀerent in nature from the next option.(d) p ( ε ) from deep data: Some methods require an inputintrinsic ellipticity distribution from deep data (or moreprecisely, for BAMPenn, the full distribution of unnor-malized moments). This is qualitatively diﬀerent fromrequiring external simulations, since many surveys willhave a deeper subset of the data that could be used toderive this prior.(e) Inferred p ( ε ) : One team tried to hierarchically infer the p ( ε ) and the shear from the data itself.(f) Self-calibration: Finally, two teams (MetaCalibrationand MaltaOx) implemented a self-calibration scheme toderive calibration corrections from the data itself.Table 2 also lists other useful pieces of informationabout these methods, as described in the caption. Before exploring the overall results of the challenge, we ﬁrstconsider several methods in detail. For methods with manysubmissions, it is important to understand overall behaviorof the method before comparing with others. For this reason,we carry out two types of tests:(1) Controlled tests of the performance of the method asa function of the various initial settings and parame-ter values that determine its performance, for multiplesubmissions in a given branch . c (cid:13) , 000–000 Mandelbaum, Rowe, et al.

Team Class Weighting Calibration Limitations N branch Rank Exact New Time perscheme philosophy PSF? software galaxyAmalgam@IAP Maximum Inverse Ellipticity None 16 2 Yes Some 0.1–1 slikelihood variance penaltyBAMPenn Bayesian Implicit p ( ε ) from Variable 2 - Yes Yes < sFourier deep data shearEPFL_gﬁt Maximum Constant + None None 8 6 Yes Yes 1–3 slikelihood rejectionCEA-EPFL Maximum Various None None 20 3 Yes Yes 1–3 slikelihoodCEA_denoise Moments Constant None None 8 - Yes No 0.03 sCMU Stacking Constant External Variable 2 N/A Yes Some 0.03 sexperimenters simulations shearCOGS Maximum Constant External None 12 N/A Yes Yes 1 s( im3shape ) likelihood simulationsE-HOLICS Moments Constant + External None 12 8 Yes No 1–3 srejection simulationsEPFL_HNN Neural Constant None None 7 - Yes Yes 2-3 snetworkEPFL_KSB Moments Inverse None None 4 - Yes No 0.001-0.002 svarianceEPFL_MLP / Neural Constant None None 5 - Yes Yes 2-3 sEPFL_MLP_FIT networkFDNT Fourier Inverse External None 12 N/A Yes Some ∼ smoments variance simulationsFourier_Quad Fourier Various None None 6 5 Yes No 0.001-0.002 smomentsHSC/LSST-HSM Moments Inverse External None 4 N/A Yes Some 0.05 svariance simulationsMBI Bayesian Implicit Inferred Variable 4 9 No Some 10 shierarchical p ( ε ) shear, PSFMaltaOx Partially Inverse Self- None 3 7 Yes Some 0.05 s( LensFit ) Bayesian variance calibrationMegaLUT Supervised Constant + External None 16 4 Yes Some 0.02 sML rejection simulationsMetaCalibration Moments + Inverse Self- Variable 1 N/A Yes Yes 0.3 sself-calibration variance calibration shearWentao_Luo Moments Inverse None None 4 - Yes Yes 1-2 svarianceess Bayesian Implicit p ( ε ) from Variable 2 - No Yes 1 smodel-ﬁtting deep data shearsFIT Maximum Inverse External None 20 1 Yes Yes 0.8 slikelihood variance simulations(iterative) Table 2.

Table summarizing the methods used by teams that participated in the challenge, including basic information such as teamname; class (overall type of method); weighting scheme; calibration philosophy (discussed in the text); and number of branches entered inthe challenge ( N branch ). “Limitations” refers to types of data to which the implementation used here is not applicable without signiﬁcantfurther development. “Rank” is the leaderboard ranking for those that received points (“-” for those that did not, and “N/A” for thosethat were ineligible due to participation of a GREAT3 EC member). “exact PSF?” indicates whether they used the exact PSF or anapproximation to it (e.g., sums of Gaussians). “New software” indicates whether the software used to analyze the GREAT3 simulationswas newly developed (“yes”), included some existing infrastructure with new software of non-trivial complexity (“some”), or was entirelypre-existing (“no”). Finally, we show the approximate processing time per galaxy per exposure (on a single core) for science-quality shearestimates. Several ﬁelds are discussed in detail in Sec. 3. c (cid:13) , 000–000 REAT3 Results I (2) A comparison of submissions for that method acrossmultiple branches , while holding its initial settings andparameters ﬁxed (instead of using those that happenedto give the best metric score in each branch).These results then serve as a basis for the fair compar-ison between methods and across branches, which will beperformed later in the paper. For all the methods discussed,see Appendix C for a more detailed description. gfit gfit parameters In this section, we show results of a more detailed ex-ploration of the gfit software used by the EPFL_gﬁtand CEA-EPFL teams (see method descriptions in Appen-dices C3 and C4). In particular, we investigate the depen-dence of the results on choices made in the course of esti-mating the per-object shears, or the weighting used to esti-mate an average shear for the entire ﬁeld). Our comparisonfocuses on the constant-shear branches, where we have ad-ditional diagnostics such as the multiplicative and additivebiases (see Sec. 2.3 for deﬁnitions).This comparison uses the submissions from EPFL_gﬁt,but the results are also applicable to CEA-EPFL submis-sions. The factors that were considered in the comparisonare the galaxy model, the postage stamp size, precision onthe total ﬂux and centroid, maximum half-light radii of thebulge and disk, ﬁltering of the galaxy catalog, constraints onpositivity of bulge and disk ﬂux, and occasional other exper-iments, such as stacking the 9 PSFs in the starﬁeld images,or running a denoising scheme.We begin by analyzing the fourteen submissions inRGC. Correlating the Q c values with the settings that varyfor these submissions, we ﬁnd that the parameter that mostdirectly predicts Q c is the postage stamp size used for themodel ﬁtting (see top panel of Fig. 3). As shown, using thefull × postage stamp maximizes the Q c score.To understand this correlation, we consider the mul-tiplicative bias as a function of postage stamp size (middlepanel of Fig. 3). As shown, except for a few outliers, the mul-tiplicative biases m + and m × that contribute to Q c increasefrom being consistent with zero to . ± . and . ± . percent, respectively, as the postage stamp is reduced to halfof its (linear) size. The statistical signiﬁcance of the diﬀer-ence between the results with the maximum and minimumstamp size is more than the σ that it appears to be in Fig. 3;given the high ( ∼ . ) correlation coeﬃcient between thesubmissions, the change in m is detected at approximately σ signiﬁcance.For maximum-likelihood ﬁtting methods, we expect acalibration bias due to the eﬀects of noise (“noise bias”). Oneinterpretation of the RGC results at the maximal postagestamp size is therefore a (cancelling) combination of noisebias with other potential biases, such as those expected dueto an imperfect galaxy model.As the postage stamp size is reduced, the likelihood sur-face for the shear estimate changes due to reduced informa-tion about the light proﬁle, and this change will generallydepend on the galaxy size and shape, postage stamp size,and the noise level. This change in the likelihood surface willin general change the location of the maximum likelihood, Figure 3. Q c and Q mix (top), and the bias components m i (mid-dle) and c i (bottom), for the gfit method as a function of thepostage stamp size used for modeling the galaxy images in theRGC branch. The target regions are shown as a grey shaded re-gion, within which the vertical axis has a linear scaling; outsideof the shaded region, the scaling is logarithmic. Multiple submis-sions with the same stamp size have slight horizontal oﬀsets forclarity. The errorbars are correlated between the submissions, sothe ﬁgure cannot be used to assess statistical signiﬁcance of diﬀer-ences between them. See the discussion in the text for quantitativecalculations of statistical signiﬁcance. The m i and c i panels onlyshow errors on a single quantity ( i = + ), for clarity.c (cid:13) , 000–000 Mandelbaum, Rowe, et al.

Figure 4. Q c and Q mix for gfit as a function of the postagestamp size used for modeling the galaxy images in the CSCbranch. causing a potential bias for such methods. We refer to theresulting bias on ensemble shear estimates as “truncationbias”. For this method, the sign of the eﬀect is apparentlyincreasingly positive as stamp sizes decrease, though thatdoes not necessarily have to be the case for all methods.We can also see signs that m and m , the calibrationbiases deﬁned in the pixel coordinate system, may be relatedas m ≈ m +0 . ( . σ signiﬁcance). A diﬀerence betweenthe calibration bias along the pixel directions ( m ) and alongthe diagonals ( m ) would be consistent with the results ofprevious work (Massey et al. 2007a; High et al. 2007), andcould plausibly be explained either by the diﬀerent eﬀectivesampling of the galaxy and PSF proﬁle along those direc-tions, or by the fact that postage stamp itself extends furtherin the diagonal directions. For the maximal postage stampsize, m and m have opposite signs, which yields m + and m × near zero. For this reason, Q c > Q mix for the maximalpostage stamp; in this case, Q mix is a better estimator ofthe level of systematics in gfit .We also investigated the additive bias and its varia-tion with postage stamp size in the bottom panel of Fig. 3.Results consistent with zero, c + = ( − ± × − , areachieved at the maximal postage stamp size, but additivebias becomes steadily more negative until it exceeds ourtarget value for the smallest postage stamp sizes, where c + = ( − ± × − . This result suggests that additive sys-tematics also exhibit truncation bias (with σ signiﬁcanceafter accounting for the correlation between submissions).However, the best-ﬁtting values of c , c , and c × are withinthe target region and statistically consistent with zero.Fig. 3 also shows that a few submissions with largepostage stamp sizes had worse than typical results. For thelargest postage stamp size, these variations in Q c are due tovariations in the amount of ﬁltering imposed on the outputcatalog before averaging to get a mean shear for the ﬁeld. Note that with perfect models and in the absence of noise,truncation should not in general cause a bias. Truncation biascould therefore be seen as a modulation of the model and/ornoise biases as the weighting of the pixels changes.

The ﬁltering typically involves the value of the best-ﬁt radii,the sum of the ﬁt residuals (related to ﬁt quality), and the

S/N , and usually involves removing several per cent of thegalaxies in each ﬁeld. For the next-largest stamp size (44),the submissions with worse results involved experimentingwith ﬁt settings (e.g., allowing components with negativeﬂux), with use of denoised images, and with stacking thenine provided PSF images instead of using just one.Among the space branches, CSC has many gfit entrieswith diﬀerent postage stamp sizes, though the maximumis × (out of a possible × ). As for the ground,postage stamp size is the most important factor, with Q c asa function of this parameter in Fig. 4. In this case, the bestpostage stamp size of × does non-negligibly truncatethe light proﬁles of a fair fraction of the galaxies, whereasthe largest postage stamp size used ( × ) has a substan-tially lower Q c due to its multiplicative calibration bias of m + = − . ± . per cent and m × = − . ± . per cent.These biases are reduced to m + = − . ± . per cent and m × = +0 . ± . per cent for the best stamp size, an > σ change when accounting for the strong correlation betweenthe submissions.The natural interpretation is that the various sourcesof bias in the space simulations for the largest stamp sizeresult in a negative multiplicative bias of h m i ≃ − . ± . %(where h m i = [ m + + m × ] / ), but a positive truncation biascancels this out for smaller postage stamp sizes. The factthat the bias becomes more positive for smaller stamp sizesis consistent across ground and space simulations.The potential sources of bias in the × case includenoise bias, some truncation bias compared to the full × case, and model bias due to an inexact match between theparametric model in the simulations versus those used by gfit . In all cases, there is a detection of additive systematics,with c + ranging from (7 ± × − for the × stampsize, to (3 ± × − for stamps smaller than × . Thedecrease in c + due to truncation bias is signiﬁcant at the σ level. The best results from the gfit team used quite diﬀerentpostage stamp sizes for each branch. Since the galaxy pop-ulations are, in a statistical sense, consistent when compar-ing across all ground branches and all space branches, a faircross-branch comparison would use consistent settings for allground branches and for all space branches. Here we presentthe results of this comparison.For ground branches, all branches except for CGC hada submission with stamp size of × , and CGC has onewith × , which is close enough for this comparison.Fig. 5 shows the Q values for all gfit submissions in allground branches, particularly indicating those submissionsthat are part of the fair cross-branch comparison. Note thatthe Q c and Q v values do not relate to shear systematics inquite the same way, so we cannot directly compare acrossconstant and variable shear branches. However, it is clear ingeneral that the submissions in this fair comparison sampleperform respectably ( . Q . ) but do not typicallyinclude the best submission in each branch. The results forthe mixed metric Q mix in that ﬁgure (top right) for constant- c (cid:13) , 000–000 REAT3 Results I Figure 5.

Top left:

Histogram of Q value (either Q c or Q v depending on the branch) for the gfit method for all submissions in groundbranches from CEA-EPFL and EPFL_gﬁt teams. The large dots located on the histograms indicate the submissions that are part ofthe fair cross-branch comparison, with the same choice of postage stamp size. Top right, bottom left, bottom right:

The same, but for Q mix , h m i , and c + (respectively), for constant-shear branches. In the bottom plots, the points have horizontal errorbars indicating theirstatistical uncertainty, and the shaded regions indicate the target values of h m i and c + . Outliers have been removed from the bottomtwo panels so that the main part of the distribution can be clearly seen. shear branches actually shows consistency across branchesfor the selected submissions, with . Q mix . .The bottom row of Fig. 5 shows the distribution of mul-tiplicative biases averaged over both components, h m i =( m + + m × ) / , and additive biases aligned with the PSF ( c + ;no signiﬁcant c × was detected for this or any method) forall submissions in CGC and RGC. For h m i , given the ﬁxed gfit analysis settings, the diﬀerences between the red pointsin CGC and RGC indicate additional multiplicative modelbias due to real galaxy morphology of h m i RGC − h m i CGC =1 . ± . per cent. There may also be model bias in CGCdue to the parametric models used by gfit not preciselymatching the ones in the GREAT3 simulations. The CGCvs. RGC comparison therefore reﬂects only additional modelbias due to real galaxy morphology, rather than all sourcesof model bias.When considering the points that indicate the submis-sions in the fair comparison sample, the additive biases areconsistent with zero for CGC but a signiﬁcant detection for RGC is seen, suggestive that model bias due to realistic mor-phology can result in additive errors from imperfect PSF de-convolution. However, it is worth bearing in mind that thepostage stamps used in this cross-branch comparison are sig-niﬁcantly truncated. In all these ground branch submissionsthere will thus be some truncation bias that might interactwith other biases such as model biases. The individual ef-fects cannot be wholly isolated, but the compound eﬀectsare clear.For space branches, the “fair comparison” submissionshad postage stamp sizes of × , representing signiﬁcanttruncation compared to the full size of × . The faircomparison results do not exhibit the very high Q valuesof the best submissions ( > ) but are, however, in therange < Q < . Comparing CSC and RSC suggests amultiplicative model bias due to realistic galaxy morphologyof h m i RSC − h m i CSC = 0 . ± . per cent, but no additivemodel bias. c (cid:13) , 000–000 Mandelbaum, Rowe, et al.

In summary, gfit results are signiﬁcantly aﬀected by thepostage stamp size used for modeling, with small stamp sizesresulting in what we call truncation bias. This (generallypositive) truncation bias can oﬀset the negative noise biasthat is a natural consequence of using a maximum-likelihoodﬁtting method. The next most interesting factor is the ﬁl-tering of the catalog to exclude galaxies on the basis of ﬁtquality or ﬁt parameters, with typically a few per cent ofgalaxies being excluded.Results for a consistent choice of stamp size suggestdiﬀerences in h m i between the control and realistic galaxyexperiments of order ∆ h m i ≃ – % (greater for ground thanfor space) due to model bias from realistic galaxy morphol-ogy. This conclusion is based on the fact that the galaxy anddata properties in these branches are the same, except forthe way of representing the light proﬁles (parametric modelsvs. HST images). Thus truncation, noise, and other biasesshould be consistent between the two sets of results. Diﬀer-ences in c + for the control and realistic galaxy experimentsdepend on whether the simulated data represents a spacesurvey or a ground survey.We note the general point that, using this dataset, wecannot cleanly separate model bias in true isolation, as com-pounding interplays may exist between model bias, trunca-tion bias, noise bias and other biases. This would be aninteresting subject for future study. For the purpose of con-trolling for the eﬀects found in this analysis of gfit results,in the general analysis in Sec. 5, we will use a set of gfit submissions with consistent postage stamp sizes (one set forground, and another for space). These will be the same sub-missions used in Sec. 4.1.2. The Amalgam@IAP analysis pipeline (see Appendix C1) hasa signiﬁcant number of parameters that can change. Theseinclude the postage stamp size, subpixel resolution, and or-der of interpolation used to combine star images for PSFestimation; the type of ﬁltering of the galaxy catalogs; themodeling window (the maximum allowed region to use formodeling, which was either ﬁxed to the postage stamp sizeor was permitted to vary with a maximum value equal tothe postage stamp size); the use of regular vs. modiﬁed χ to mitigate the eﬀects of galaxy blends (see Appendix C1);the use of an additional penalty term on Sérsic index and/oraspect ratio, see Eq. (C3); and the choice of eﬀective shapenoise σ s in the weighting used to combine individual galaxyshape estimates (see Appendix C1.3).Early in the challenge it was found that increasing thesampling density of both the PSF and the galaxy models( ≈ . × on each axis compared to the values that wouldautomatically be set by the regular versions of PSFEx andSExtractor) signiﬁcantly improved the scores, at the priceof increasing computing time by an order of magnitude.In RGC, we carried out multi-factor ANOVA to un-derstand the most important factors determining the per-formance of the Amalgam@IAP team. Unfortunately, evenwith nearly 40 submissions, the 8-dimensional parameter space was not sampled well enough to get a clear answer.The results suggest that σ s was the most important fac-tor determining performance, with choice of form for the χ (regular preferred over modiﬁed) and use of penalty term(penalty on aspect ratio preferred over not) being impor-tant with marginal signiﬁcance.Given the importance of σ s , Fig. 6 shows the variationof our metrics with this parameter. As shown in the toppanel, Q c sharply decreases for very small σ s , and reaches amaximum for σ s ≈ . For inﬁnite σ s (constant weighting),there are two submissions with quite diﬀerent Q c values, and , which we discuss in more detail below.The decrease in Q c for very low σ s is quite interesting.As σ s approaches zero, the weighting scheme gives a strongpreference to very high S/N galaxies. In real data, there isno advantage to giving such a preference because of shapenoise. However, in GREAT3, we have canceled out the shapenoise by including 90 ◦ rotated pairs, so in principle, a perfectshear estimate for just the two highest S/N galaxies wouldperfectly determine the shear for the whole ﬁeld. The low Q c in this case implies that either the covariance matrix used forthe weighting is poorly determined or has some correlationwith shear direction, or that the shear estimates for high- S/N galaxies are poor. The high-

S/N galaxies should havelittle noise bias, but may have model bias due to a mismatchbetween the input parametric models and the ones ﬁttedby the Amalgam@IAP team. Another possible explanationrelates to the adaptive selection of modeling window size (upto but not beyond the size of the input postage stamps).If the algorithm chooses too-small postage stamps for thehighest-

S/N galaxies, it could introduce truncation biasesas seen in gfit results (see Sec. 4.1). Since a similar trendin Q c was seen in CGC, the problem is not plausibly duesolely to realistic galaxy morphology. Unfortunately giventhe data that we have, we are unable to tease apart theseeﬀects.The other panels in Fig. 6 show the m i and c i values asa function of σ s , to explain the trends in the Q c plot. Forvery low σ s (upweighting the high S/N galaxies), the mul-tiplicative biases can be as bad as − . ± . per cent, witha very high detection signiﬁcance for the trends in m i . Forconstant weighting, the submission with near-zero m i and c i includes a penalty term on the aspect ratio, whereas thepoorly-performing submission does not (giving a σ changein m i ). In the bottom panel, as σ s goes from 0.05 up to 1and ﬁnally to ∞ (corresponding to strong S/N upweighting,weighting with a substantial shape noise term, and constantweighting, respectively), c + goes from (3 . ± . × − , toconsistent with zero, to negative values, ( − ± × − .The statistical signiﬁcance of these changes is > σ . Thissuggests that c + for this method is positive (negative) forthe high- (low-) S/N galaxies.We now address the issue of the penalty term on as-pect ratio, another parameter of interest that causes highlysigniﬁcant changes in multiplicative and additive biases asdiscussed above. The idea of the penalty term is that forgalaxies that have low

S/N and poor resolution, the ellip-ticity is so poorly determined that there is a very largetail to high ellipticity (which is a manifestation of noisebias). Hence the idea is to penalize high ellipticity valuesby adding a term to the χ , which will have little eﬀect onhigh-ellipticity objects with high S/N . This was important c (cid:13) , 000–000 REAT3 Results I Figure 6.

From top to bottom, we show Q c , m i , and c i for theAmalgam@IAP team submissions as a function of the σ s used inthe weighting scheme, for all submissions in RGC. The target re-gions are shown as a grey shaded region, within which the verticalaxis has a linear scaling; outside of the shaded region, the scalingis logarithmic. Note that the entries shown at σ s = 10 actuallyhad σ s = ∞ , i.e., completely equal weighting for all galaxies. Mul-tiple submissions with the same σ s have slight horizontal oﬀsetsfor clarity. The m i and c i panels only show errors on a singlequantity, for clarity. Figure 7.

Fitted ellipticity distributions for the Amalgam@IAPteam for a good-seeing (blue) and poor-seeing (red) subﬁeld inGREAT3, in the CGC branch. The top (bottom) panel showsthe results without (with) a penalty term on aspect ratio. particularly for ﬁelds with poor seeing and/or substantial de-focus that enlarged the PSF. An example is shown in Fig. 7.The top panel shows the ﬁtted ellipticity distribution in agood-seeing (blue) and poor-seeing (red) ﬁeld in GREAT3without the penalty term, and the bottom panel shows thesame when using the penalty term on aspect ratio. The dis-tribution for the poor-seeing image has a pronounced high-ellipticity tail that is nearly removed by the penalty term,yet the shape of the distribution in the good-seeing image isless altered by the addition of this term.In some sense, the addition to the term in the χ isequivalent to multiplying the likelihood, i.e., imposing aprior on the ellipticity. It seems that this is a way to re-move or reduce noise bias in all ﬁelds (with stronger impacton those that have poor seeing), eliminating the need forexplicit calibration factors. For GREAT3, the best value ofshape noise σ s to use in the weighting scheme and the formof the penalty term to use in the χ was clearly shown todo an excellent job at shear estimation for the particular p ( ε ) and galaxy property distributions used here. However,it is unclear whether these results would necessarily be con-sistently reproducible for other datasets with diﬀerent in-trinsic p ( ε ) , or those with a p ( ε ) that correlates with othergalaxy properties in a way that is not reproduced here. Forthis reason, further simulations would be needed to evaluatethe generality of this procedure for real data with a varietyof properties, and conﬁrm that the exact σ s and form ofthe penalty term gives similar results in cancelling out noisebias. For the Amalgam@IAP team, it was diﬃcult to identify asingle group of settings used for all branches. Instead, fourgroups of settings with submissions in a few branches wereidentiﬁed: c (cid:13) , 000–000 Mandelbaum, Rowe, et al. χ penalty term on aspect ratio θ aspect ; σ s = 0 . forweighting.2. χ penalty term on θ aspect ; uniform weighting ( σ s = ∞ ).3. χ penalty term on θ aspect and Sérsic index n s ; σ s = 0 . .4. No priors on model parameters; σ s = 0 . .The settings also diﬀered in minor ways that have little im-pact on performance.Fig. 8 shows histograms of Q c , h m i , and c + values for allAmalgam@IAP submissions in all constant shear branches,also indicating those submissions with the aforementionedconsistent settings with points. As shown, for branches thatinclude submissions with setting 1, that submission is typ-ically among the best in the branch, with RSC being theexception to this rule. This is consistent with our previousresults indicating that σ s ∼ . and the χ penalty term onaspect ratio were important factors aﬀecting the results.Comparing the results for setting 1 and 2 in RGC, theonly constant shear branch to include submissions with bothsettings, their performance seems quite consistent with eachother. However, in variable shear branches (not shown), set-ting 1 leads to better performance, conﬁrming the impor-tance of the weight including both shape and measurementnoise rather than using equal weighting.Comparing settings 1 and 3, we see that for CGC, set-ting 1 leads to better performance due to a substantiallysmaller calibration bias. This suggests that use of a Sér-sic n penalty term is unimportant or perhaps even harm-ful, though its impact is somewhat less on variable shearbranches (not shown). This ﬁnding may simply reﬂect thefact that the variable shear metric is less sensitive to multi-plicative bias m .Finally, settings 1 and 4 gave similar results, with com-parable m i and c i . While the use of penalty terms on θ aspect is helpful, that is especially true for higher σ s than the valueused here.In general, the results for these fairly chosen sets ofsubmissions are worse in CGC than in RGC. The primaryreason is an average multiplicative bias of h m i = 0 . ± . percent in CGC, while h m i is consistent with zero in RGC. Sincethe simulation designs in the control and realistic galaxyexperiments correspond apart from galaxy morphology, thisdiﬀerence between CGC and RGC suggests a model bias dueto realistic galaxy morphology that is of that order. This biasmay be canceled out by some other bias in RGC (perhapsnoise bias, truncation bias, or residual model bias due tomismatch between input and output parametric models). Incontrast, the additive systematics for CGC vs. RGC (setting1) are consistent within the errors. For space branches, themultiplicative biases diﬀer for RSC and CSC by h m i RSC −h m i CSC = 0 . ± . per cent, suggesting that model biasdue to realistic galaxy morphology has a similar magnitudefor both space and ground data. Here we summarize the key lessons from analysis of theAmalgam@IAP results. First, the main factors that deter-mine performance are the magnitude of shape noise used inthe weighting scheme ( σ s ) and the use of a penalty term onthe aspect ratio to reduce the incidence of spurious highlyelliptical, lower S/N and resolution objects. Using the best

Figure 8.

Top:

Histogram of Q c values for all submissions fromthe Amalgam@IAP team for constant shear branches. The coloredpoints indicate submissions that are part of the fair cross-branchcomparisons with consistent settings, with the four settings de-scribed in the text indicated with diﬀerent shaped points. Middle,bottom:

The same, but for h m i and c + . The points have horizontalerrorbars indicating their statistical uncertainty, and the shadedregions indicate the target values of h m i and c + . Outliers havebeen removed from the bottom two panels so that the main partof the distribution can be clearly seen.c (cid:13)000 , 000–000 REAT3 Results I choices for these parameters in all branches resulted in over-all good performance, though with hints of a model biasfor ground and space data due to realistic galaxy morphol-ogy that is slightly below a per cent. Also, strong variationin c + with the weighting scheme suggests that the additivesystematics are a strong function of the galaxy S/N .Because of the importance of σ s and penalty terms indetermining performance, for the overall analysis and com-parison with other methods, we use a set of submissionswith the same value of σ s = 0 . and a penalty term on theaspect ratio, with small variations in other less importantparameters . The MegaLUT team (see Appendix C17) made many sub-missions with varying choices related to the learning samplegeneration, shape measurement, input parameters for the ar-tiﬁcial neural network (ANN), architecture of the ANN, andﬁnally the rejection of faint or unresolved galaxies. Here wewill explore the dependence of their results on these choices.First, we consider the ﬁltering of the catalogs, compar-ing four submissions to CGC that used the same settingsfor all parameters except the ﬁltering. The m i and c i valuesfor these four submissions are shown in Fig. 9, with the Q c values indicated in the legend. As shown, the results for thetop three options (all with default ﬁltering for positive ﬂuxand proﬁle increasing) give very similar results, regardless ofother choices like rejection based on maximum shear values,or clipping large shears (setting them to a maximum valueof . ). However, removing the default ﬁltering and only re-jecting based on | g | or | g | > gives signiﬁcantly worse Q c .This is due to both m i and c + increasing in magnitude. Thissubmission is only mildly correlated with the others, and the m i and c i changes are only marginally signiﬁcant ( σ ). Ona minor note, there is a – σ hint of non-zero c and c ,which (if real) may reﬂect asymmetry in selection criteria.Note that the default ﬁltering option removes typically < per cent of the galaxies.The next test was on CSC, comparing two otherwisesimilar submissions with diﬀerent choices at the trainingstage. The training sample shears were uniformly distributedwith | g | < and with | g | < . . The Q c values were and respectively, primarily because of a larger magnitude ofthe (negative) calibration bias in the latter case. This changein m is not very statistically signiﬁcant ( < σ ), which isinteresting because it suggests a lack of sensitivity to thisaspect of the training.Also in CSC, we compare two submission that used dif-ferent statistics of the image to describe the shape. In onesubmission, the adaptive moments routines in GalSim wereused, eﬀectively ﬁtting the image to an elliptical Gaussian;the other submission used the moments of the autocorrela-tion function (ACF, van Waerbeke et al. 1997) of the image. For three variable-shear branches, there were no submissionswith σ s = 0 . . To enable comparison in those branches, the Amal-gam@IAP team made submissions after the end of the challengeusing the same catalogs as during the challenge, reweighted using σ s = 0 . . Figure 9.

Top: m i values for four MegaLUT submissions in CGCwith diﬀerent choices for how catalogs were ﬁltered, but otherwisethe same settings. Bottom: c i values. The results are shown in Fig. 10. The Q c values are and , respectively, due to a σ diﬀerence in m i values (thesigniﬁcance is larger than it appears on the plot due to cor-relations between the submissions). Use of the ACF gives amore negative calibration bias of h m i = − . ± . per cent,compared to h m i = − . ± . per cent without its use. Ap-parently the ACF is not an unbiased way of compressing theinformation in the image, consistent with what was seen fortwo methods using the ACF in GREAT10 (Kitching et al.2012).A ﬁnal study performed in CSC relates to other ways ofﬁltering the catalogs after shear estimates have been made,comparing the results of the default ﬁltering with two otheroptions: excluding small objects, and using convex hull peel-ing (Eddy 1982). The exclusion of small objects changes m i and c i only slightly. However, convex hull peeling gives sub-stantially worse results that are also noisier, with Q c reducedfrom around to , and h m i going from − . ± . to ± per cent ( σ signiﬁcance).For RSC, we compared two submissions with diﬀerenttraining options. In one case (“half noise”), the training setimages had noise that was half the level in the GREAT3images; in the other case, it was “low noise”, / the levelin the GREAT3 images. Fig. 11 shows that the latter givessigniﬁcantly better performance, Q c = 221 instead of .The “half noise” case has slightly worse m i values, and sub-stantially worse additive systematics of c + = (10 ± × − vs. c + = ( − ± × − in the low noise case, a σ diﬀer-ence given the correlations between the submissions). Theincrease in m i with increasing noise in the training sampleimages could be due to the resulting noisiness in the inputfeatures of the ANN training. This noisiness smears out any c (cid:13) , 000–000 Mandelbaum, Rowe, et al.

Figure 10.

Top: m i values for two MegaLUT submissions in CSCwith diﬀerent methods of measuring galaxy shapes, but otherwisethe same settings. Bottom: c i values. Figure 11.

Top: m i values for two MegaLUT submissions inRSC with diﬀerent choices for how much noise to include in thetraining sample. Bottom: c i values. sharp structures that the ANN regression should ﬁt, leadingto biased ANN predictions. We speculate that the eﬀects on c + may relate to errors in centroiding that are larger alongthe PSF direction somehow being ampliﬁed if the trainingsample is also noisy, but this eﬀect requires further study tofully understand.The ﬁnal test in RSC relates to the use of clippingthe shears, meaning setting those galaxies with estimated | g | > g clip to | g | = g clip instead of using the estimated value.We compare two submissions with g clip = 0 . and . andotherwise similar settings, and ﬁnd Q c values of 76 and 105,respectively. While the additive systematics are virtuallyidentical, the submission with stronger clipping has worsecalibration bias ( h m i is more negative: − . ± . per centinstead of − . ± . per cent, signiﬁcant at more than σ given the high correlation between the submissions). Notsurprisingly, aggressive clipping of shear magnitudes biasesthe estimated cosmological shear low. We also show results for a fair cross-branch comparison us-ing similar settings for the training set, ﬁltering, and otherparameters of interest. In this case, | g | in the simulationswas uniformly distributed in a unit disk of radius . ; thesimulations had half the noise of the GREAT3 simulations;galaxies with estimated g or g with magnitude larger than were rejected, but no other rejection scheme was used; andshears were clipped to a maximum of . . As for the othercross-branch comparisons, we show histograms of all submis-sions and points indicating the ones in the fair comparisonset, in Fig. 12.As shown in the top panel of Fig. 12, the MegaLUTsubmissions in the cross-branch comparison set are typicallyamong their top submissions. The top values of Q c are in therange – , whereas the top values of Q v are in a higherrange, – . For all combinations of (experiment, ob-servation type) to which this team submitted, the resultsfor variable shear were better than for constant shear. Thismay reﬂect the fact that the best results typically had ∼ – % multiplicative calibration biases, to which Q c is substan-tially more sensitive than Q v . For constant-shear branches,the results for the mixed metric Q mix were very similar tothose for the standard Q c . Another interesting trend acrossbranches is that, with the exception of RSV, MegaLUT didbetter in the control experiment than in the realistic galaxyexperiment, perhaps reﬂecting a preference for the paramet-ric models used to generate the training sample (which weexplore in detail in Sec. 5.3).The middle panel of Fig. 12 shows h m i averaged overcomponents. As shown, for both control and realistic galaxyexperiments, the multiplicative bias h m i is typically posi-tive for ground-based data (around %) and negative forspace-based data ( − . % to − %). The magnitude of thebias is slightly larger for the realistic galaxy experimentthan for the control experiment. The diﬀerences are of h m i RGC − h m i CGC = 1 . ± . per cent for ground and h m i RSC − h m i CSC = − . ± . per cent for space. Thismay reﬂect diﬀerences in model bias from realistic galaxymorphology.Finally, the bottom panel of Fig. 12 shows the additivebias c + . We see a statistically signiﬁcant diﬀerence in addi- c (cid:13) , 000–000 REAT3 Results I Figure 12.

Top:

Histograms of Q values (either Q c or Q v depend-ing on the branch) for the MegaLUT team. The colored pointsindicate the submissions that are part of the fair cross-branchcomparison with consistent settings. Middle, bottom:

The same,but for h m i and c + (respectively), which involves using constant-shear branches only. The points have horizontal errorbars indicat-ing their statistical uncertainty, and the shaded regions indicatethe target values of h m i and c + . Outliers have been removed fromthe bottom two panels so that the main part of the distributioncan be clearly seen. tive biases for the control and realistic galaxy experiments,which is partly responsible for the worse performance in re-alistic galaxy branches compared to control branches. Thisis a manifestation of model bias due to realistic galaxy mor-phology. To summarize the MegaLUT results, we ﬁnd that good re-sults required identiﬁcation and rejection of a small fractionof problematic galaxies. Use of the image autocorrelationfunction led to substantially worse performance than use ofthe adaptive moments (from a ﬁt to an elliptical Gaussian,using code in

GalSim ). An attempt to use convex hull peel-ing led to substantial calibration biases and overall noisi-ness. Use of training images with / (rather than / )the noise level of GREAT3 reduced the additive system-atic errors. Finally, clipping the shears substantially (to amaximum value of | g | = 0 . ) led to negative calibration bi-ases and overall worse performance. The MegaLUT methodhad overall better performance in variable-shear branchesdue to the pervasive ∼ per cent calibration biases, whichhurt their performance preferentially on the constant-shearbranches. This multiplicative calibration bias has oppositesigns for ground and space data (but similar magnitude). Wesaw hints of additive and multiplicative model bias due torealistic galaxy morphology, which we will explore in moredetail in Sec. 5.3.While using low noise training data led to improved per-formance, there was not a fair set of submissions across allbranches that used low noise. Thus, for the overall analysisin Sec. 5, we will use a set of submissions with half noise.However, it is important to bear in mind that this degradesthe performance of the method. For Fourier_Quad (see Appendix C13), the key diﬀerencebetween submissions relates to the weighting scheme usedwhen combining per-galaxy shear estimates. Three optionswere used for GREAT3: • No explicit weighting: Since the galaxy light proﬁle am-plitudes scale with the ﬂux, if this is not divided out, a lackof explicit weighting corresponds to implicit weighting by ( S/N ) . In GREAT3 this improves performance given ouruse of shape noise cancellation, in a way that is not viablein real data where shape noise does not cancel. • Identifying the pairs of 90 ◦ -degree rotated galaxiesand dividing the G , G and N for each object (see Ap-pendix C13) in the pair by the squared galaxy ﬂux. Thisweighting scheme is also not viable for real data. • Dividing the power spectrum of the galaxy image by thesquare of the galaxy ﬂux, which corresponds to eﬀectivelyunweighted per-galaxy shear estimates .For the constant-shear branches, higher scores wereachieved using the ﬁrst weighting scheme, followed closelyby the second. For example, in CGC, the top Q c scores us-ing the ﬁrst two weighting schemes were 1202 and 1122, re-spectively; in RGC, 888 and 764; in CSC, 1318 and 1245; inRSC, 1919 and 1726. Clearly the performance was excellent c (cid:13) , 000–000 Mandelbaum, Rowe, et al. with both weighting schemes, with m i and c i values at ornear the target range. However, since these are not a viableapproach in real data, all comparisons with other methods(in Sec. 5) will use the third weighting scheme .For two reasons, Fourier_Quad did not get high scoresin variable-shear branches. First, unlike most of the othermethods, the shear estimators of Fourier_Quad do not di-rectly correspond to galaxy ellipticities, so the method doesnot get the full advantage of having zero intrinsic E -modecorrelation in variable-shear branches. Second, the way ofcalculating shear correlation functions in Fourier_Quad isstill sub-optimal, as described in App. C13. Since we wishto use results that correspond to what would be used in realdata, we do not use their variable-shear submissions for ouroverall analysis. For the sFIT team (see Appendix C21), multiple submis-sions in each branch reﬂect more complete or sophisticatedsets of simulations from which to derive multiplicative andadditive calibration factors to apply to per-galaxy shear es-timates. Thus, it is generally the case that the most fairsubmission to use in each branch is the one that was sub-mitted last, except in a few branches with some experimentalsubmissions at the end.However, comparing the results for individual submis-sions within a branch provides information about the sizes ofvarious biases. For example, in CGC, the Q c value changedfrom 579 to 974 from the ﬁrst to the last submission. Theinitial attempt came from applying calibration based on sim-ulations that approximately matched distributions in size,Sérsic index, and noise level, but with Gaussian PSFs ratherthan the real PSFs. Despite the simpliﬁcations in the ini-tial simulations used to derive the calibration factors, thebest-ﬁtting m i values were ∼ . ± . per cent in each com-ponent, and c + was consistent with zero. It is likely that thecalibration correction in this branch is dominated by noisebias corrections. Later improvements involved oversamplingthe Sérsic proﬁles, a better PSF model (double Gaussian,which is still not as complex as the real PSF model ), andimproved Sérsic n distribution based on CSC, which primar-ily improved the score by reducing the multiplicative biasto . ± . per cent. For the ﬁnal submission, the averagemultiplicative calibration factor over all the subﬁelds (with adiﬀerent value of calibration depending on the PSF FWHM)was approximately . , and the magnitude of the typicaladditive bias correction (which depends on the PSF FWHMand its ellipticity) was of order × − . The ﬁnal results inCGV, with Q v = 841 . , resulted from directly applying thecalibration factors from the ﬁnal submissions to CGC, as isappropriate given the similarities in branch design.In CSC, the initial basic calibration (derived in a roughway as for CGC) led to Q c = 698 . Further iterations involved The submissions with that weighting scheme were made afterthe end of the challenge, but in the interest of trying to make afair comparison with other methods, we will use them. Due to the computational expense of rendering images with afull optics and atmospheric PSF, the simulations used to derivethe calibrations by the sFIT team did not use the full PSF modelfor ground branches. narrowing distributions of Sérsic n and S/N (because theoriginal ones from ﬁts to the GREAT3 data had an unphys-ical tail due to noise), and ultimately achieved Q c = 920 .The processing used a × postage stamp, not the full × , which should result in truncation bias as in Sec. 4.1.However, since the calibration simulations also use smallpostage stamps, the truncation bias should be automati-cally corrected. The magnitude of the total multiplicativebias correction for this ﬁnal submission was approximately1.02, with an additive bias correction of order − × − .In the realistic galaxy experiment, we ﬁrst considerRGC. Interestingly, the ﬁrst submission (with Q c = 305 . )used calibrations derived from simulations with real galaxyimages in GalSim . However, the next attempt directly usedthe calibrations from CGC, which do not include realisticgalaxy morphology, and achieved Q c = 806 . . This changetells us that for the sFIT method of ﬁtting Sérsic proﬁles,the model bias due to realistic galaxy morphology is not verylarge in ground-based data, because it is in principle uncor-rected in these results. After modifying the simulation inputsto better match the p ( ε ) and size distribution in RGC, theresults were as high as , with h m i = 0 . ± . per cent,and c + = (1 ± × − . This suggests that residual modelbias due to realistic galaxy morphology is only importantat the − level for this method, compared to − for themethods discussed previously. The best-scoring submissionsin RGV used the calibration from RGC.In RSC, interestingly, simulations based on real galaxyimages were necessary to improve Q c above ∼ . Useof COSMOS images led to an immediate boost of Q c to in the ﬁrst attempt, which is a statistically signiﬁcantchange arising from h m i changing from − . ± . per centto − . ± . per cent, with nearly the same additive bias, c + = (5 . ± . × − . The signiﬁcance of the change in h m i is > σ due to the very high correlation between thesubmissions. This suggests a statistically signiﬁcant, sub-percent model bias due to realistic galaxy morphology forthis method in space data. Further attempts to improve thesimulated p ( ε ) to match the GREAT3 simulations led to ad-ditional improvements to Q c = 825 , with the additive biasremaining unchanged. One possible cause for this residualbias is that the calibration simulations did not use a fullyrealistic PSF, which could result in slightly incorrect addi-tive bias corrections. As described in Appendix C15, the MBI team made submis-sions using a few variations of their method.For the Optimal Tractor and Sample Tractor, they usedthe maximum-likelihood estimate of the lensed ellipticityand the average of samples from the posterior PDFs (re-spectively) to derive the mean shear for the ﬁeld, typicallywith similar performance. For example, in CGC, the scoresfor the Optimal Tractor submissions were 15 and 53, re-ﬂecting multiplicative biases of and per cent, and non-negligible additive systematics. The results for the SampleTractor submissions were in the same range. The results inRGC for these two cases were worse than in CGC.However, hierarchically inferring the intrinsic ellipticitydistribution using importance sampling from the posteriorPDF for the mean shear, with a Gaussian p ( ε ) in each com- c (cid:13) , 000–000 REAT3 Results I ponent, improved scores by factors of ∼ – %, and a reduction inadditive systematics to within the target range. The excep-tion to this trend is CSC, where the use of hierarchical infer-ence did not yield signiﬁcant improvement ( Q c scores weretypically in the range 90–200 regardless of method). How-ever, there the assumption that the PSF can be described asa sum of three Gaussian components is more dubious thanin the ground branches, so PSF modeling may be the keylimitation in that branch.The results in the ground branches for the ImportantTractor (hierarchical inference) submissions suggest thatthis new method may indeed be able to reduce some intrin-sic limitations of maximum-likelihood ﬁtting methods (e.g.,noise bias). Noise bias primarily arises when transforming aprobability distribution for a galaxy shape estimate into asingle point estimator. Combining the probability distribu-tions for all galaxies (resulting in increased S/N ) and ap-plying a hierarchically inferred prior p ( ε ) yields improvedresults.The submissions from MBI included several variantsof the hierarchical inference. The ﬁrst, called “multi-baling”(hierarchically inferring the p ( ε ) common to ﬁve subﬁelds),led to some improvements in scores, up to a factor of two.In contrast, using the deep ﬁelds to infer the p ( ε ) did notresult in an improvement in Q c over hierarchical inferenceassuming an uninformative hyper-prior. Finally, the MBIteam made submissions with informative prior PDFs on thelensing shear, with four diﬀerent values that seem to bracketa peak in Q c in the CGC branch. The highest Q c -scores ob-tained this way (in CGC and MGC) were around a factorof four higher than that for an asserted uniform prior PDFfor the shear components. For example, with their wide, de-fault, narrow, and narrower assumed values for σ g , the Q c values were , , , and , corresponding to multi-plicative biases in the range %, %, − . %, and − %, re-spectively. In GREAT3 constant-shear simulations, the true p true ( | g | ) ∝ | g | (see App. A2), whereas the MBI team usedRayleigh distributions. Their “wide” and “narrower” distri-butions are particularly mismatched in shape to the trueone, so the poor Q c scores are not surprising.After the challenge, the MBI team investigated inferringthe optimal value of σ g from the data directly (as opposedto from Q c ). This yields a factor of two improvement overan asserted uniform prior PSF for the shear components.It is unclear how much better one can do in this way onGREAT3 simulations because of the unusual p ( | g | ) , whichdiﬀers from the functional form chosen by the MBI team(and makes sense for real data).For the overall analysis in Sec. 5, we use the MBI re-sults with hierarchical inference. While multi-baling and us-ing deep ﬁelds to get the p ( ε ) may become helpful in future,they were not fully explored in GREAT3, so we do not usethem for the overall comparison. The COGS team made a number of submissions, using the im3shape algorithm (Zuntz et al. 2013), that are describedin Appendix C7. The submissions that used input settingsand methodology suitable for scientiﬁc analysis are labelled

Figure 13.

Averaged multiplicative bias h m i = ( m + + m × ) / forCOGS submissions to CGC and RGC, under diﬀering schemes forthe removal of noise bias (see Sec. 4.7). u7 , c1 , c2 , and c3 . The labels c1 – c3 denote three diﬀer-ent schemes used to calibrate for multiplicative biases thatare expected in Maximum-Likelihood shape estimation. Nocorrection was applied for additive bias.In Sec. 5, and thereafter, where we wish to draw faircomparisons between branches and between methods, onlyCOGS submissions that used the c3 calibration are used.This choice is made as c3 comes closest to the approach thatwould be adopted when applying im3shape to real data (seeAppendix C7).The diﬀerent submissions make it possible to test for theeﬀect of diﬀerent choices made in the noise bias calibrations,and to test for model bias due to realistic galaxy morphologyby comparing CGC and RGC. Fig. 13 shows the signiﬁcantimpact of noise bias calibrations on h m i = ( m + + m × ) / forCOGS submissions to CGC and RGC. The c3 calibration,derived from the deep data in CGV but with some outlierrejection in deep ﬁeld ﬁts (see Appendix C7), controls mul-tiplicative biases in CGC to within statistical uncertainties.For u7 , i.e. without any attempt to calibrate multiplica-tive bias, we ﬁnd h m i = 2 . ± . per cent for CGC and h m i = 1 . ± . per cent for RGC. These results represent acombination of noise, model, and other biases in the uncal-ibrated COGS submissions.For each pair of submissions grouped by calibrationstrategy, we also ﬁnd a consistent diﬀerence in the levelof multiplicative bias between CGC and RGC results: h m i RGC − h m i CGC = − . ± . per cent. This diﬀerence in h m i can be interpreted as a diﬀerence in model bias due torealistic galaxy morphology, for the im3shape galaxy modelchosen by the COGS team. It is similar in magnitude to theeﬀect found in other model-ﬁtting methods. Several teams identiﬁed images with particularly challengingPSFs. Here we consider the role played by outliers in thechallenge results, given that our metrics Q c and Q v (Sec. 2.3)allow teams to weight galaxies within subﬁelds, but not toassign weights to the per-ﬁeld shears before construction ofthe metric. The rationale behind this choice was that, witheach subﬁeld having fairly similar pixel noise and the samenumber of galaxies, the shear statistics should be determinedequally well for each subﬁeld. However, if a method has asystematic problem with a subﬁeld, they cannot indicate this c (cid:13) , 000–000 Mandelbaum, Rowe, et al. by giving a low (or zero) weight, unlike in real data wherethey could choose to discard a subset of the data.As an example, the top panel of Fig. 14 shows theper-subﬁeld submitted shear from RGC for the “ess” team.The plotted quantities are used to derive m i and c i for the Q c metric. This team has several subﬁelds with highly dis-crepant submitted shears, well beyond the expected stan-dard deviation of . . per subﬁeld. This branch is theworst case for this team, which had fewer outliers in CGC.To explore the eﬀect of outliers, we did a systematictest for outliers in the submitted shears, identifying (foreach branch and team) those ﬁelds for which the submit-ted shears were discrepant by more than | ∆ g | = 0 . inmore than 75 per cent of submissions. In general, these sub-ﬁelds were consistent across methods; that is, if two teamshad a certain number of outlier ﬁelds in a given branch, theywere almost always the same set of subﬁelds. Those subﬁeldswere commonly ones with higher values for the PSF defocus(or, for the “ess” team, higher values of trefoil); we defer amore detailed exploration of the impact of defocus on shearsystematics to Sec. 5.5. For the “ess” team, the reason forthe outliers shown in Fig. 14 is fairly clear: they used a sumof three Gaussian components to describe the PSF, whichmakes it particularly diﬃcult to model PSFs with defocus ortrefoil. In contrast, the middle panel of that ﬁgure shows acomparable plot for the Amalgam@IAP team, which mod-eled the full PSF, and does not show signiﬁcant outliers.Finally, the bottom panel shows images of the PSF for theeleven subﬁelds in RGC for which the “ess” results were se-riously discrepant. As shown, in about three cases the PSFhas the characteristic “donut” shape of highly out-of-focusimages; such data would likely be eliminated from a shearanalysis in a real dataset. These subﬁelds were problematicfor several other methods. In other subﬁelds, there is a tri-angular shape characteristic of trefoil, which seems to havebeen less problematic for other methods that have a moreﬂexible representation of the PSF.For those teams and branches for which outlier ﬁeldswere identiﬁed, we recalculated the m i , c i , and Q c valuesafter excluding the outlier ﬁelds. We found that while errorsfrom the linear regression on m i and c i decreased substan-tially (sometimes tens of percent after excluding only a fewpercent of the subﬁelds), the changes in m i , c i , and Q i werein general not coherent. In many cases, results for diﬀerentsubmissions from the same team in the same branch wouldchange in diﬀerent directions. There were three combina-tions of branch and team with coherent changes in resultsafter excluding outliers (in two cases the results were al-most always worse, and in one case they were almost alwaysbetter).Several other teams had problems with outliers thatwere not identiﬁed in the previous test. (Identifying themas outliers would require a smaller threshold on | ∆ g | andon the number of times the ﬁeld has a discrepancy for itto oﬃcially be called an outlier.) These include MBI, which(like ess) used a sum of Gaussians to describe the PSF; andMegaLUT. We recalculated the results for these teams afterexcluding the ﬁelds with the 10 per cent worst defocus inCGC. For MBI, the results of excluding the subﬁelds withthe worst defocus did have a coherent eﬀect, but with op-posite signs in the control and realistic galaxy experiments,increasing Q c in the former by as much as a factor of two and Figure 14.

Top:

The diﬀerence between submitted and trueshears vs. the true shears for each subﬁeld in RGC, for both shearcomponents. The best-ﬁt line is also shown on the plot, along withthe m , c , and Q c values. These results are for the best submissionfrom the “ess” team. Middle:

The same, for the best submissionin that branch from Amalgam@IAP.

Bottom:

Images of the PSFsfor the 11 subﬁelds for which the “ess” results are discrepant atthe level of | ∆ γ | > . in at least one shear components in > per cent of their submissions. Subﬁeld indices are shown on theplot. The images are shown with a self-consistent linear ﬂux scal-ing and with the total PSF ﬂux normalized to 1, so subﬁelds withworse seeing will generally have a lower peak ﬂux value.c (cid:13)000 , 000–000 REAT3 Results I lowering it in the latter by a similar amount. We speculatethat the diﬃculty in modeling these PSFs may not lead tosome systematic overall eﬀect because the hierarchical infer-ence of p ( ε ) might be partially compensating for imperfectPSF model ﬁts by adjusting the galaxy model ﬁts accord-ingly. For MegaLUT, the changes in results after excludingthe high-defocus subﬁelds were substantially smaller thanfor MBI.Due to the generally inconclusive results of excludingoutliers, with no team showing a strong trend towards im-proved overall results, for the rest of this paper we do notexclude outlier ﬁelds. However, in in Appendix D we alsotabulate the outlier-rejected estimates of h m i and c + for theess, MBI and MegaLUT submissions, these being the teamsmost aﬀected by outliers.We note that in future challenges it may be a good ideato permit participants to assign weights to their submittedper-subﬁeld shears, so they can indicate regimes in whichtheir PSF modeling or shear estimation does not work. Ourresults also suggest that PSF modeling with a low-order de-composition into sums of Gaussians may be inadequate todescribe realistic PSFs, and can signiﬁcantly aﬀect the shearestimates. In this section, we present results for the control and realisticgalaxy experiments for all teams.

To avoid showing many submissions from each team in eachbranch, we adopt a fair and consistent way to select a singlesubmission per branch from each team. For the teams dis-cussed in Sec. 4, we have already stated what submissionswill be used here. For the remaining teams, the selection wasdone as follows: • FDNT: We use FDNT v1.3, with a self-consistent setof resolution and SNR cuts (submissions with names thatinclude “r12_sn15”). • E-HOLICs: We use their “snﬁxed200” submission, whichhave a self-consistent set of noise bias corrections. • MaltaOx: We use the best results for

LensFit withoversampled PSFs and self-calibration included. • ess: We only use their RGC results, with the priors on p ( ε ) derived from the deep ﬁelds (submission name “nﬁt-rgc-06-nﬁt-ﬂags-02”). • CMU experimenters: Only one submission per branch. • CEA_denoise, MetaCalibration, EPFL_MLP /EPFL_MLP_FIT, EPFL_KSB, EPFL_HNN, WentaoLuo: Best submission in each branch. • GREAT3-EC (or re-Gaussianization): These resultsused the shear estimation example script described in Ap-pendix B. As noted there, for several reasons the results arenot science-quality shear estimates and therefore the resultshave no reﬂection on science papers that use this algorithm.However, since it is a stable algorithm in the public domain,and one of the few moments-based methods, we include itin this section to provide a basic point of comparison.

Figure 15. Q c (top) and Q v (bottom) for constant- and variable-shear branches in the control and realistic galaxy experiments.The errorbars show the possible range of Q values for a sub-mission with shear calibration biases that would nominally givea particular Q value. As shown, the sizes of these ranges dependstrongly on Q , and are smaller for space than for ground branches. Results for the following teams are not shown in this sec-tion: miyatake-test (for reasons described in Sec. 3), BAMP-enn, and HSC/LSST-HSM. The BAMPenn results includedsome bugs that mean the results do not correctly reﬂectthe real performance of the method. The HSC/LSST-HSMsubmissions used the HSC/LSST software pipeline with thesame shear estimation method as in the GREAT3 examplescripts purely as a sanity check of the pipeline. Q results In this subsection, we present the Q results for all teams.Fig. 15 shows Q c and Q v for the control and realistic galaxyexperiments.Several trends from Sec. 4 are evident here. For ex-ample, the results for sFIT are quite consistent across allbranches shown here. The MegaLUT results are consistentlybetter for variable shear than for constant shear, presumablybecause of a low-level m -type bias, to which Q c is more sen- c (cid:13) , 000–000 Mandelbaum, Rowe, et al.

Figure 17.

Multiplicative and additive biases for constant-shear branches in the control (left) and realistic galaxy (right) experiments,for ground (top) and space (bottom) branches. For each branch, we show the averaged (over components) multiplicative bias h m i vs. c + , the additive bias term deﬁned in the coordinate system deﬁned by the PSF anisotropy. The axes are linear within the target region( | m | < × − and | c | < × − , shaded grey) and logarithmic outside that region. Figure 16.

Comparison between the Q v predicted from theconstant-shear branch results (CGC), and the actual Q v resultsfor variable shear (CGV). sitive than Q v . The results for Amalgam@IAP and CEA-EPFL are good in many branches, but exhibit signiﬁcantﬂuctuations due to partial cancellations of biases. The re-sults for Fourier_Quad with a realistic weighting schemeare quite good, but degraded compared to the results withthe unrealistic weighting schemes.The errorbars in Fig. 15 show that for lower Q values,the uncertainty in Q is very small. However, near the tar-get Q values, small uncertainties in m and c become largeuncertainties in Q . These errorbars are quite non-Gaussian,so for example the diﬀerence between Q = 500 and forcontrol space branches is signiﬁcantly more than the σ sug-gested by the plot. It is apparent that in many branches, 2–3teams performed well enough that the diﬀerences betweentheir Q values (and between the target of ∼ ) are notstatistically signiﬁcant.One basic question is whether the results in the constantand variable shear branches are consistent. We cannot di-rectly compare Q c and Q v , because they respond to system-atic errors in diﬀerent ways. However, for a given constant-shear submission, we can use the recovered m and c valuesto predict Q v by simulating variable shear submissions withthose m and c , and then checking their Q v . Comparing the c (cid:13) , 000–000 REAT3 Results I Figure 18.

Left and right columns show results for h m i (top; averaged over components) and c + (bottom) for ground and spacebranches, respectively. Each panel compares results for control vs. realistic galaxy experiments. The axes are linear within the targetregion ( | m | < × − and | c | < × − , shaded grey) and logarithmic outside that region. The black dashed line is the 1:1 line. predicted Q v with the actual one (for the same experimentand observation type) is a valid consistency check. We showthis comparison for CGC and CGV in Fig. 16, with a rea-sonable level of consistency within the relatively large errorson the Q v , and at most a σ discrepancy for one team. Theplots for the other experiments and observation types showsimilar constant vs. variable shear consistency. This section will focus on Fig. 17, which shows the mul-tiplicative and additive shear biases ( m and c ) for theconstant-shear branches in the control and realistic galaxyexperiment. All m and c values are also tabulated in Ap-pendix D. Unlike Q c , m and c have well-understood error-bars. On these plots, the errorbars are diﬀerent sizes fordiﬀerent methods. In some cases, it is only an apparent dif-ference (due to the mixed linear and logarithmic axes), butthere is some variation in the scatter in shears that we willexplore in Sec. 5.6.We begin by discussing the top left panel of Fig. 17,which shows h m i (averaged over components) vs. c + for CGC. Not surprisingly, the teams that are located near thecenter of this plot (small | m | and | c | ) are the ones with high Q c factors for this branch (Fig. 15).A few methods (COGS, MegaLUT, MetaCalibration)are notable in having multiplicative biases consistent withbeing in the target region, but highly signiﬁcant detectionsof additive bias. Both COGS and MetaCalibration includemultiplicative bias corrections, but no additive bias correc-tions were implemented by the end of the challenge period. Comparing the left and right sides of Fig. 17 would reveal theimpact of realistic galaxy morphology. However, to facilitatean easier comparison, Fig. 18 explicitly compares h m i (aver-aged over components) and c + values for control vs. realisticgalaxy experiments, with results tabulated in Table D3. Forground-based simulations, the h m i comparison is in the topleft panel. Many methods are consistent with the 1:1 line,meaning that the calibration bias does not show any de-tectable impact from realistic galaxy morphology. Moderatediﬀerences in model bias due to realistic galaxy morphology c (cid:13) , 000–000 Mandelbaum, Rowe, et al.

Figure 19.

Shear biases for CGC, similar to Fig. 17 but using m and m (deﬁned using the pixel coordinate system). can be seen for many teams, with typically a ∼ per centlevel impact of realistic galaxy morphology on multiplica-tive calibration biases, although the sign of the change in h m i depended on method.The top right panel of Fig. 18 shows how h m i changesfrom control to realistic galaxy experiment for space-basedsimulations. Again, some methods exhibit no signiﬁcantmodel bias due to realistic galaxy morphology (but notethat sFIT included this eﬀect in their simulations, and ex-plicitly calibrated it out), while others have typically ∼ per cent level calibration changes.The bottom left panel of Fig. 18 shows c + for CGC vs.RGC, with everything from complete consistency to strongdiﬀerences in c + in these branches, implying that realisticgalaxy morphology can in some cases cause additive biases.Finally, in the bottom right panel of Fig. 18, the c + areconsistent between control and realistic galaxy experimentsfor space-based simulations for most methods. It seems thatfor space simulations, removing the PSF anisotropy is simi-larly diﬃcult for both parametric and realistic galaxy mod-els. Comparing the top and bottom rows of Fig. 17 reveals theeﬀects of using a space-based PSF rather than a ground-based PSF. Note that the numerical values of the c + and h m i changes are shown in Table D3. Focusing ﬁrst on thecontrol experiment (left side), the c + values shifted to theright (more positive) in space data for the majority of the methods. Note that if c + scales linearly with PSF ellipticity(a model that we will validate in Sec. 5.4), then c + for thespace branches should be larger than in the ground branchesby a factor of ∼ . This may explain the changes in c + forseveral teams, but not all, implying that in some cases theadditive systematics have some additional dependence onthe form of the PSF beyond its ellipticity.Comparing multiplicative biases for CGC and CSC,they are either statistically consistent between space andground or more negative for space branches; curiously, theydid not become more positive for any teams. Given the widediversity of methods and the apparent lack of commonalitybetween many that exhibit similar behavior between groundand space data, it is diﬃcult to draw conclusions, but thepattern is indeed interesting.These results were for the control experiment. If wecompare RGC vs. RSC (right panels), we see that the dif-ferences in c + and h m i between space and ground simula-tions in the realistic galaxy experiment are similar to whatwas seen for the control experiment for all teams exceptCEA_denoise. This ﬁnding suggests that the eﬀect of thetype of PSF (space vs. ground) on additive and multiplica-tive biases does not typically depend on whether the galaxieshave realistic morphology or are simple parametric models. The top left panel of Fig. 17 shows m vs. c for CGC in thecoordinate system deﬁned by the PSF anisotropy, whereasFig. 19 shows the same in the pixel coordinate system. In afew cases (e.g., CEA-EPFL, Fourier_Quad, and MetaCali-bration to some degree though it is noisier), m and m haveopposite signs, and thus average out to something closer tozero (after rotating to the PSF anisotropy coordinate frame)for m + and m × , resulting in Q mix < Q c . In this section, we explore the linear model for shear sys-tematics, Eq. (7), by considering some alternative models ofshear measurement bias.It is commonly assumed that the main source of c -typebiases is leakage from PSF anisotropy into galaxy shear es-timates, which should be proportional to the amplitude ofthe PSF ellipticity. (However, there are physical models thatviolate this assumption, nor is this assumption completelyobvious for all methods.) If the assumption is correct, wecan write an alternative model: g obs i − g true i = m i g true i + a i g PSF i (11)Here, the a i prefactors are average values across an entiregalaxy population that likely depend on the distribution ofSNR, resolution, morphology, and PSF type. In the coordi-nate system deﬁned by the PSF anisotropy, g PSF+ = | g PSF | and g PSF × = 0 . We can therefore ﬁt to this new model, and ifthe additive errors are proportional to the PSF anisotropy,then we should ﬁnd c + ∝ a + , where the constant of propor-tionality is an eﬀective mean | g PSF | for that branch.Fig. 20 compares c + and a + for CGC (top) and RSC(bottom), though the results are quite similar for CSC andRGC as well. The best-ﬁt line relating c + and a + goes c (cid:13) , 000–000 REAT3 Results I through nearly all the points, indicating that the linearmodel works well (except for EPFL_HNN) for a wide va-riety of shear estimation methods. The slopes of the best-ﬁtting lines for CGC, RGC, CSC, and RSC are 0.025, 0.016,0.039, and 0.037, respectively, corresponding to the eﬀectivemean per-branch | g PSF | . a + is essentially the fraction of PSF anisotropy thatleaks into galaxy shear estimates. For the methods that have c + within the target region, the a + values indicate that typi-cally < per cent of the PSF shear contaminates the galaxyshears. Several methods are in the range of 1–10 per centleakage, and the worst case scenarios involve leakage of tensof per cent. For data with a narrower (wider) range of PSFanisotropies but otherwise similar properties (so that a + isthe same), the additive bias c + will be better (worse) thanis shown here. (Note that the histogram of PSF shears inGREAT3 is in Appendix A3.)In real data, selection biases that correlate with PSF di-rection also induce additive systematics. While these operateat some level in GREAT3 due to diﬀerent weights being as-signed to galaxies depending on their direction with respectto the PSF, in real data selection biases should be more im-portant given the need to identify galaxies. In that case, thissimple linear model may no longer be valid. It seems reason-able that selection biases will cause c + to scale with | g PSF | ,but it is not obvious that the scaling should be linear.The success of the simple linear PSF contaminationmodel of Eq. (11) in describing additive bias in GREAT3,evidenced by Fig. 20, is striking. However, we note that theGREAT3 simulations were designed without many eﬀectsfound in real data that potentially cause additive bias (seeSec. 2.2 for a list) but are not directly related to the PSF.These may cause additive biases to show more complex de-pendencies in real data.Another question about the linear model for shear cal-ibration biases is whether these methods have a nonlinearresponse to shear. This question was already addressed inthe STEP2 challenge (Massey et al. 2007a). In that case,the shears were positive in the CCD coordinate system, andthe nonlinearity test involved a term proportional to g .In GREAT3, the per-component shears can be positive ornegative, so the simplest low order nonlinear terms are pro-portional to g or sign( g true ) g . We can think of theseas being the next order beyond linear of a series expansionof some unknown function representing the shear response.We carried out ﬁts with an additional term deﬁned ineither of these two ways, and checked for nonzero prefac-tors for the nonlinear terms. In general, the results for allmethods are consistent with zero. When considering con-stant shear branches in the control and realistic galaxy ex-periments, there are 81 submissions (across all branches andteams) that we use in this section, and therefore 162 ﬁtswhen we use both shear components. Regardless of whichform we use for the nonlinear term, its prefactor diﬀers fromzero at > σ for nine of the 162 ﬁts, or 5.6 per cent, which isconsistent with what we expect if no methods have nonlin-ear response. Moreover, these > σ deviations are not con-sistently found in any particular team, but are for a rangeof teams. We conclude that the GREAT3 results show nosign of a nonlinear shear response for any method. However,with a maximum value of | g true | . , we are not very sen- Figure 20.

For CGC (top) and RSC (bottom), we compare theadditive bias c + in the standard linear bias model, Eq. (7), against a + for the alternative model in Eq. (11). a + is a constant of pro-portionality relating additive shear systematics to the PSF ellip-ticity. The axes are linear for | a + | < × − and | c + | < × − (where the latter is our target region for additive systematics,shown in grey) and logarithmic outside that region; we use verti-cal lines to indicate the linear-logarithmic boundary in a + . Thebest-ﬁt slope relating c + and a + is shown as a dashed magentaline. It only appears curved because we show combined log andlinear axes with an unequal aspect ratio. sitive to nonlinear shear response, and studies that go intothe cluster shear regime may need to redo this test. In this section, we check how results for each method dependon the PSF properties within the branch. Note that the PSFproperties in the control and realistic galaxy experiments arediscussed and shown in Appendix A3 and Fig. A3.For this test, we split the subﬁelds within a branchinto those with atmospheric PSF FWHM, defocus, or | g PSF | above and below the median values. Then we reﬁt the sub-mitted shears for those subsets to estimate m i and c i values.We can compare the m i and c i for those with better vs. worsevalues of seeing, defocus, and PSF shear, and compare withthe overall m i and c i for the branch. c (cid:13) , 000–000 Mandelbaum, Rowe, et al.

Figure 21.

For CGC, we show how h m i (top) and c + (bottom)change when we split the subﬁelds in the branch into 50 per centabove and below the median atmospheric PSF FWHM. The axesare linear within the target regions for m and c , and logarithmicoutside them, with the target regions shaded in grey. The thickblue arrows point towards the direction of reducing shear sys-tematics, i.e., towards the center. In all panels, any points thatrepresent a signiﬁcant change in the plotted quantity for better orworse seeing subﬁelds also have an arrow showing how the resultshave changed compared to using the whole ﬁeld. The legend givesthe Q c value for the original submission from each team. We begin in CGC, splitting into samples with better or worseatmospheric PSF FWHM (seeing). The results are in Fig. 21,in which the top panel compares the h m i values for the bet-ter and worse seeing half of the subﬁelds (with numerical val-ues tabulated in Table D4). The teams for which h m i diﬀersfor better vs. worse seeing have a more negative (positive)calibration bias for better (worse) seeing.The bottom panel of Fig. 21 shows that many teamshave consistent c + for better and worse seeing, with the resthaving a more strongly positive c + for the better seeing sub-ﬁelds. The worse c + values for better seeing subﬁelds maycome from the fact that the optical PSF (which is often moreelongated than the atmospheric component) dominates. In-deed, the correlation coeﬃcient between PSF FWHM and | g PSF | in CGC (RGC) is -0.23 (-0.25), with a signiﬁcanceof p = 0 . ( × − ). Thus, the worse seeing subﬁelds Figure 22.

For CGC, we compare additive biases c + when split-ting the subﬁelds into those with defocus above or below the me-dian. The axes are linear within the target region (shaded grey),and logarithmic outside them. The thick blue arrows point in thedirection of reduced shear systematics. Any points that representa signiﬁcant change compared to results for the entire branch havean arrow showing that change, as well. have a consistently rounder PSF, which can reduce additivesystematics.The results for both the h m i and c + trends were similarin RGC to what we have shown here for CGC, which is apoint that we will revisit in some of our later tests. In Fig. 22, we show how c + in CGC changes when we split atthe median absolute value of defocus (with results tabulatedin Table D4). The results for many methods exhibit a morestrongly nonzero c + for stronger defocus. It is not surprisingthat additive systematic errors are worse when out of focus,because defocus ampliﬁes the eﬀect of other aberrations likecoma and astigmatism on the PSF (Schechter & Levinson2011), giving a noticably more elliptical PSF. Appendix A3shows that we allowed a relatively wide range of defocusvalues in the ground branches, which explains why its eﬀectsare noticeable despite the fact that the atmospheric PSF isnormally thought of as being dominant.The multiplicative biases m + and m × (not shown) donot typically change when splitting by defocus, except forsFIT and MetaCalibration, with smaller changes for MBIand Amalgam@IAP. For MBI, the representation of the PSFas the sum of three Gaussians may be the limiting factor indescribing out-of-focus PSFs. For sFIT, the problem mayarise from the use of simple PSFs (rather than a range ofcomplex PSF with varying defocus) for the simulations usedto calibrate the shears. Explicitly deriving calibrations fordiﬀerent PSFs may ameliorate this problem. A similar is-sue is likely at play for MetaCalibration, which derived anaverage shear response using all subﬁelds, rather than onefor each PSF. It is unclear why the calibration changes withdefocus for Amalgam@IAP, but it may be because of diﬃcul-ties in ﬁnding a well-deﬁned maximum likelihood for manygalaxies in the more strongly defocused cases.In space simulations (CSC), splitting by defocus had c (cid:13) , 000–000 REAT3 Results I qualitatively similar eﬀects on shear systematics as for CGC.However, the shifts are smaller in magnitude for space sim-ulations, likely because the range of defocus is much smallerfor space simulations than for ground ones (see App. A3).Our ﬁndings are similar in the realistic galaxy experi-ment, suggesting that the dependence of shear systematicson defocus is independent of realistic galaxy morphology. When splitting the subﬁelds by | g PSF | , the results are consis-tent with those of Sec. 5.4, where additive systematics wereshown to scale linearly with the PSF ellipticity. Here we explore the eﬀective noise level of the estimatedshears. In principle, galaxy shapes were arranged in a waythat cancels out shape noise, so that the dominant source oferror in the estimated shears is measurement error due topixel noise. However, the shape noise cancellation is imper-fect at low

S/N , so that the submitted shears include someshape noise as well. Fig. 23 shows the per-galaxy and per-component scatter ( σ g ) in the estimated shears for CGC,estimated from ﬁtting the model of Eq. (7) and ﬁnding thescatter in the shear estimates for the subﬁelds (then divid-ing by √ to get a per-galaxy value). This scatter thusincludes both the measurement error and any residual shapenoise due to noise in the weights, which can be seen as an ad-ditional manifestation of susceptibility to pixel noise. Thereis a weak relationship between σ g and Q c , with all methodsthat have Q c & having . . σ g . . . Methods withlower Q c scores have higher scatter by as much as a fac-tor of 40; the exceptions to this rule are re-Gaussianizationand EPFL_KSB, which notably are fairly simple moments-based methods. In a few cases, outliers are an issue, but evenwith σ clipping, the trend at low Q c is quite evident. Thisﬁgure for RGC looks very similar. For the space simulations,the eﬀective per-galaxy S/N was slightly higher, reducingthe σ g values slightly, the overall trend is the same.The straightforward interpretation of these results isthat for methods with Q c & , the per-object measure-ment error is typically subdominant to shape noise, whereassome methods with lower Q c allow signiﬁcant leakage ofpixel noise into the estimated shears. For several teams, we carried out catalog-level tests that in-volve using subsets of the galaxies. For example, we splitthe galaxies into subsamples with

S/N above and belowthe median; and likewise for resolution factor deﬁned asin Hirata & Seljak (2003) using the adaptive second mo-ments, and Sérsic index n . These splits use the true (notestimated) values of these parameters, to preserve shapenoise cancellation. The methods used for this test are CEA-EPFL, MegaLUT, Fourier_Quad, re-Gaussianization, and For galaxies that were represented in GREAT3 as a two-component model, a single-component Sérsic was used for thissplit.

Figure 23.

Scatter in the estimated shears (per galaxy and percomponent) vs. Q c for each method in CGC. The horizontal lineindicates a typical level of shape noise in realistic galaxy samples. sFIT, which include a range of shear estimation methods.For Fourier_Quad, we re-estimated ensemble shears for thegalaxy subsets as in Eq. (C11).In general, biases such as noise bias depend on boththe ﬂux-based S/N and the resolution. Thus, a split by asingle galaxy property may not isolate a particular bias. In-stead, these splits are a way to estimate how much the shearsystematics might change for a particular method when di-viding the galaxy sample in a way that changes the mean

S/N , resolution, or Sérsic n .Fig. 24 shows the results for h m i (left) and c + (right)after dividing the galaxy sample in CGC in these three ways.In each case, we plot the results for subsamples against eachother, so a method that is robust to changes in these quan-tities would be on the line. Methods that are not onthat line must by deﬁnition move either to the upper left orlower right. We consider each method in turn.The h m i and c + results for CEA-EPFL show only amild dependence on S/N , but a much greater dependenceon resolution and on Sérsic n. MegaLUT has less statisticallysigniﬁcant trends, with the most clear ones being the change h m i with S/N and the change in c + with Sérsic n . Themultiplicative bias h m i for Fourier_Quad is quite robust tosplitting by any of the three parameters, but c + shows sig-niﬁcant changes for S/N splits, with the change for Sérsic n being less signiﬁcant. re-Gaussianization exhibits signiﬁcantdependence on all of S/N , resolution, and Sérsic n , qualita-tively consistent with the ﬁndings in Mandelbaum et al.(2012), but little change in c + . Finally, for sFIT, both h m i and c + change when splitting by all three parameters,though the changes with S/N are marginal in signiﬁcance.Given that this team explicitly derived calibration factorsto remove additive and multiplicative biases from the entirepopulation (not as a function of galaxy properties), thesetrends are not surprising. There is no reason to expect thecalibration factors to be valid for subsamples. This exercise The magnitude of the trends is not consistent, but this couldbe because of the ways in which the example script used for thistest diﬀers from a science-quality measurement; see Appendix B.c (cid:13) , 000–000 Mandelbaum, Rowe, et al. merely emphasizes the necessity of rederiving them whenusing subsamples, or even changing weighting schemes.If we check in RGC whether the changes in these meth-ods are consistent across control and realistic galaxy experi-ments, then we cannot do the comparison for sFIT due to alack of catalogs for that submission. However, of the remain-ing methods used, only the re-Gaussianization results whensplitting the galaxy sample are the same in CGC and RGC,which is interesting given the signiﬁcant model bias due torealistic galaxy morphology seen for this method in Sec. 5.3.The results for CEA-EPFL when splitting by resolution andSérsic n are the same in RGC as in CGC, but the change in h m i when splitting by S/N has the opposite sign as in CGC.The MegaLUT method shows much stronger trends in both h m i and c + in RGC when splitting by all three parametersthan in CGC. Finally, for Fourier_Quad, the sign of the c + changes when splitting by resolution and Sérsic n is reversedin RGC compared to CGC.In CSC, we can check how the use of space simula-tions changes the results when dividing the galaxy sample(for all but re-Gaussianization, which only has results onground branches). The CEA-EPFL and sFIT team resultsshow diﬀerent signs and/or magnitudes of changes in shearsystematics when splitting by galaxy properties in CSC vs.in CGC. For MegaLUT, the value of c + changes when split-ting by S/N and resolution more signiﬁcantly in CSC thanin CGC. For Fourier_Quad the diﬀerence in c + between sub-samples in resolution and Sérsic n changes in sign in CSCcompared to CGC. These ﬁndings suggest that essentially allteams considered here have trends in shear systematics withgalaxy properties that are diﬀerent in space vs. in grounddata. In this section we compare quantitatively with the resultsfrom GREAT08 (Bridle et al. 2010) and the GREAT10galaxy challenge (Kitching et al. 2012), to the limited extentthat is possible given the diﬀerent challenge designs and thelack of error analysis in previous challenge results .For GREAT08, the fairest comparison is betweenGREAT3 CGC and GREAT08 RealNoise_Blind bulge + disk galaxy results. We cannot compare Q values since theyare deﬁned in a diﬀerent way, so instead we compare h m i and c + , bearing in mind that even this comparison is complicatedby the broader, more realistic distributions of galaxy prop-erties in GREAT3. In the left column, middle row of ﬁgureC3 in Bridle et al. (2010), the number of methods that have |h m i| < . , . , and . is , , and . In GREAT3,these numbers are 12, 10, and 6, using only the fair com-parison sample results used throughout Sec. 5 rather thanthe best submission per team for this branch. We are alsoignoring the uncertainty on these h m i values for consistencywith how we did the calculation for GREAT08 given itslack of error estimates. The upper left panel of ﬁgure C4(b + d) in Bridle et al. (2010) suggests that ( ) methods Given the previous challenge data volumes and SNR levels,the uncertainties cannot be signiﬁcantly smaller than the uncer-tainties in GREAT3. have | c + | < × − ( × − ), whereas in GREAT3 CGCthese numbers are 9 (3). The latter comparison is particu-larly complicated by diﬀerent choices for the PSF ellipticitydistribution in these challenges, since we showed in Sec. 5.4that for essentially all methods, c + is linearly proportionalto PSF ellipticity.For GREAT10, the simplest comparison is with the in-ferred m and c values in table 3 of Kitching et al. (2012),again ignoring noise due to the fact that no uncertaintiesare quoted. However, two of the better-performing submis-sions in that table have no m or c estimates, since they used apower spectrum analysis. In the absence of more informationwe will include them in the best category that we consider, |h m i| < . and c + < × − . Given this choice, thenumber of methods in GREAT10 with |h m i| < . , . ,and . is , , and , which should again be comparedwith 12, 10, and 6 in GREAT3. All 12 methods in table 3of Kitching et al. (2012) had c values within × − , withthe range of PSF ellipticities being diﬀerent from that inGREAT3, but not to a very large extent.The GREAT3 results show that signiﬁcant progresshas been made in controlling multiplicative biases sinceGREAT08 and GREAT10, with the situation for additivebiases being less clear. However, additive biases are easier toidentify in real data (for example, using star-galaxy cross-correlations), so this situation fairly reﬂects the community’sfocus on the more pernicious multiplicative biases. Giventhat, as discussed in Appendix A4, the GREAT3 simulationshave a realistic S/N distribution with an eﬀective cutoﬀ of , this improvement in control of multiplicative biases isa signiﬁcant achievement reﬂecting tremendous progress inthe weak lensing community as a whole. In this section, we discuss lessons learned about shear es-timation based on the analyses in Sections 4 and 5. Ourfocus is on results that are more general than just a singlemethod; conclusions for individual methods can be drawnfrom earlier plots and discussion.

Many teams that participated in GREAT3 used model-ﬁtting methods, which must make choices about which pix-els to use for the ﬁtting. The results in Sec. 4.1 highlightthe importance of truncation bias due to use of overly-smallmodeling windows. Truncation bias can potentially be sev-eral per cent (multiplicative bias), and also is a source of ad-ditive bias; its magnitude makes it relevant for present-daysurveys, and could potentially be worse in the case of blends(which might lead to the choice of a more restricted mod-eling window). These model-ﬁtting methods make choicesabout which models to use, with two popular options beinga single Sérsic model (Amalgam@IAP, sFIT, MBI) and asum of a bulge and disk Sérsic models with ﬁxed n (COGS, gfit ). The good performance of these methods suggests thatuse of Sérsic proﬁles can reduce model bias that is observed c (cid:13) , 000–000 REAT3 Results I Figure 24.

For CGC, we compare the h m i (left) and c + (right) values that we get by splitting the galaxies at the median value of S/N (top), resolution (middle), and Sérsic n . The legend indicates the team and the Q c value for the original submission using all galaxies.Arrows on each plot are shown for those teams for which the results for the subsamples diﬀer from the overall results by more than 10per cent, and are drawn from the overall value to the results for the subsamples. with, e.g., shapelets or other models that do not describegalaxy light proﬁles as well as Sérsic proﬁles.Several methods of calibration were successful formodel-ﬁtting methods: external simulations for which the in-puts were iteratively updated until the output galaxy prop-erties match those in the GREAT3 data (sFIT), derivationof calibration corrections from a deep subset of the samedata (COGS), and addition of a penalty term to the χ to reduce noise bias (Amalgam@IAP). External simulationsare always limited by their realism, though use of iterativemethods seems to be helpful. Calibration corrections fromdeep data do not, in principle, require external validation.Addition of a penalty term to the χ does require externalsimulations to check that the penalty term really removesthe noise bias.Our results in Sec. 5.4 conﬁrm the applicability of the c (cid:13) , 000–000 Mandelbaum, Rowe, et al. linear model for shear calibration biases in the | g | . . regime for all methods that participated in GREAT3. Sev-eral methods showed tendencies for multiplicative biases de-ﬁned in the pixel coordinate system to diﬀer between thecomponent along the pixel axes and along their diagonals,similar to what was seen in e.g. Massey et al. (2007a). In allcases, the additive biases c + were linearly proportional tothe amplitude of the PSF ellipticity (of order 0.1 per cent ofthe PSF ellipticity for the best methods, and more typically1–5 per cent). It is possible that some biases in real surveysbut not GREAT3 would violate this pattern (e.g., selectionbiases that depend on the PSF anisotropy).The results for many methods show a dependence onPSF properties like the FWHM, defocus, and ellipticity. Insome cases, the results seem to have been calibrated to workon average, so that they are worse for better or worse qual-ity data than for the challenge overall. Defocus tends toresult primarily in additive (not multiplicative) systematics.Some methods are particularly sensitive to outliers in defo-cus, which results in more complicated-looking PSFs; it isdiﬃcult to assess to what extent that sensitivity is intrinsicto the PSF correction method (because those PSFs violateone of its assumptions) vs. arising from how the PSFs aremodeled (because of limitations of the PSF modeling soft-ware). Some future surveys will have additional diagnosticdata regarding PSFs; these results suggest that it may behelpful to incorporate this information in the PSF modelingand shear estimation process.When splitting galaxy samples by S/N , resolution, orSérsic n , we observe statistically signiﬁcant trends for theﬁve methods that were considered; these trends are sensitiveto real galaxy morphology (control vs. realistic galaxy exper-iment) and the type of data (space vs. ground). In contrast,the variation in shear systematic errors due to data prop-erties like atmospheric PSF FWHM or defocus was fairlyrobust to realistic galaxy morphology.Comparing ground vs. space data, additive systematicsseem to be more important for the latter. In space branches,several teams saw their c + become signiﬁcantly more posi-tive, which contributed towards there being almost entirelypositive c + submissions in space branches. However, not allthe teams with negative c + in the ground branches submit-ted to the space branches.Finally, the eﬀective noise level of the shear estimates(measurement error due to pixel noise) showed a weak in-verse relationship with Q . For the majority of the meth-ods (especially those with Q c & ), the values of σ g percomponent were fairly consistent across methods. This con-ﬁrms the general tendency to select shear estimation meth-ods based on their multiplicative and additive biases, ratherthan separately considering their measurement errors. Many methods, including some that performed extremelywell, show a small but statistically signiﬁcant change inmodel bias due to realistic galaxy morphology, with orderof magnitude 1 per cent. Realistic galaxy morphology canalso result in additive systematics. Our ﬁndings for the orderof magnitude of this eﬀect for multiple methods is consistentwith the ﬁnding for the im3shape software (Kacprzak et al.2014). For some methods, realistic galaxy morphology was more important for space branches than for ground (e.g.,the sFIT team had to explicitly calibrate out the bias dueto realistic galaxy morphology only for space).One key limitation in lessons learned about realisticgalaxy morphology in GREAT3 is that, since its impactis relatively small (typically detected at ∼ σ ), it is hardto distinguish between space and ground results or clearlyidentify trends with other data properties. However, this initself is good news for future surveys, since it provides an in-dication that model bias due to realistic galaxy morphologymay rank behind other eﬀects, such as noise bias, in termsof its direct impact on shear measurements.In real data with a substantially deeper source popu-lation than is represented in the sample of galaxies fromCOSMOS used as the basis for the GREAT3 simula-tions, these results will have to be revisited due to thelarger fraction of irregular galaxies at higher redshift (e.g.,Bundy, Ellis & Conselice 2005). We have presented results for the control and realistic galaxyexperiments of the GREAT3 challenge, the goal of which wasto test ensemble shear estimation given a galaxy populationwith a realistic distribution of size,

S/N , ellipticity, and mor-phology, and with a (known) fairly complicated PSF. A keyresult is that, within the ability of the simulations to de-termine systematics at this level and bearing in mind thatsome eﬀects are not included in them, a range of methodscan now carry out shear estimation with systematics errorsaround the level required by Stage IV dark energy surveys.We have explored how the results for each team dependon the galaxy and PSF properties; and explored the impactof realistic galaxy morphology by comparing the control andrealistic galaxy branches. Our conclusions on these pointsare summarized in Sec. 6, with the main one being that shearsystematic errors due to realistic galaxy morphology are, forthose methods for which we have a clear detection, typicallyof order ∼ per cent. While signiﬁcant enough that futuresurveys must take these eﬀects into account, this source ofmodel bias error is subdominant when compared to the levelof noise bias expected for similar galaxy populations to thosein GREAT3 (e.g., Kacprzak et al. 2012; Melchior & Viola2012; Refregier et al. 2012). In Paper II, we will use the otherbranches of the challenge to explore whether these overallresults from Sec. 6 carry over to the case where the PSF isnot known.Treating the participants as a fair subset of the commu-nity, it seems that model-ﬁtting methods now dominate theﬁeld in both popularity and (broadly) performance. Somediﬀerences between methods may relate to implementationdetails rather than true issues with a method. Unlike adecade ago, moments methods are now a minority. How-ever there are some highly interesting alternative methods,for which we have seen the introduction and/or evident ma-turity in GREAT3 (some based on Bernstein & Armstrong(2014); MetaCalibration; self-calibration for LensFit as car-ried out by the MaltaOx team; hierarchical inference as doneby the MBI team; machine learning based methods likeMegaLUT; and Fourier_Quad), adding variety and qual-ity to the ﬁeld. This includes the introduction of some c (cid:13) , 000–000 REAT3 Results I teams that just infer ensemble shears (MBI, BAMPenn, ess,Fourier_Quad) rather than per-object shears; however, ademonstration of these methods on variable shear data willbe crucial for their more general acceptance.Choices related to calibration of shears were quite var-ied, with some teams that aim for an unbiased measurement(e.g. BAMPenn, ess, MBI) and others that apply calibra-tions in a variety of ways. Aside from external simulation-based calibrations, which are subject to the limitation thatthe calibrations are only as good as the simulations, a fewmore sophisticated options were tried. These include itera-tive external simulations that get updated until the outputsmatch those in the dataset that is being analyzed (sFIT),analysis of a deep subset of the same data (COGS), andself-calibration using manipulations of the images them-selves (MaltaOx, MetaCalibration). These alternatives ap-pear promising, and avoid some of the objections to the mostbasic brute-force calibration. The utility of the deeper datato several teams, either for calibrations or deriving galaxyproperty distributions, suggests that future surveys may ﬁndit useful to have a deeper subsurvey, as indeed many alreadyintend to do. Several teams used self-calibration methods(MetaCalibration and MaltaOx) and hierarchical-inference(MBI) methods that in principle could be used to removethe biases in many other shear estimation methods. Thesenewer methods were not among the very top performers, butdid impressively well for new implementations, so it will beinteresting to follow their future development.We also have a number of conclusions about GREAT-type challenges based on the GREAT3 challenge process.Unfortunately, the variable shear simulations were less pow-erful than originally intended at detecting systematic biasesin the shear ﬁelds. Despite our best eﬀorts in attempting todeﬁne a metric with a reasonably small variance, Q v wasnoisier than Q c , the constant-shear metric. However, for themethods that submitted results to constant and variableshear branches, the results were consistent with the esti-mated shears having the same underlying biases (within theerrors), as we would expect. Future challenges that wantto determine biases with variable shear ﬁelds may requiresubstantially larger data volumes than in GREAT3. Futurechallenges may also want to allow participants to assignweights to downweight data that they do not want to use,rather than requiring shear estimates for all ﬁelds.After the end of the challenge, we found that use ofa metric based on systematics in the coordinate system de-ﬁned by the PSF anisotropy resulted in accidental preferencefor methods with calibration biases in the coordinate systemdeﬁned by the pixel frame that were related as m ≈ − m .While this had little eﬀect on the challenge itself, it high-lights the fact that a challenge with a public leaderboardincluding Q values (even without any multiplicative andadditive biases) cannot be considered truly blind. Partici-pants sometimes made choices based on feedback from theleaderboard, which at times was useful in helping them avoidcompletely futile pathways, but at times may have involvedtuning to low levels of noise rather than making real con-clusions. Thus, if the goal is a truly blind challenge (whichhelps evaluate existing methods rather than assisting thedevelopment of new ones), then we recommend that futurechallenges consider some change in the public leaderboard.For example, the public leaderboard could use a subset of the data, with the real leaderboard that uses all the databeing released only after the end of the challenge. An alter-native would be to tell participants a range in which their Q values fall (e.g., < Q < , < Q < , and so on).Both options would give participants a basic idea of theirresults (allowing them to check, e.g., shear conventions andavoid submitting junk by accident) while not encouragingthem to potentially tune to the noise.A ﬁnal point for future challenges and even planningfor future surveys relates to the importance of the S/N def-inition. It is quite common to use galaxies above some

S/N limit, but in GREAT3, we found that depending on the

S/N deﬁnition, the eﬀective

S/N can vary by nearly a factor oftwo. For example, as stated in the handbook, we initially seta

S/N > limit to ensure that most teams would be able tocompute shears for all galaxies, with shape noise eﬀectivelycanceled. The disadvantage of this limit was that we wouldnot dig too deeply into the noise bias-dominated regime.However, we found in practice (see Appendix A4) that our S/N estimator was so optimal as to be completely unachiev-able in practice, given that it assumes perfect knowledge ofthe light proﬁle. Our tests showed that the lower

S/N limitusing more practical estimators is around . On the positiveside, this meant that the results have a more realistic level ofnoise bias, but on the negative side, it meant that the simu-lations were less powerful in constraining shear systematics.This ﬁnding highlights the importance of how S/N is de-ﬁned both for future challenges and for parameter forecastsand mission speciﬁcations for future lensing surveys.In conclusion, GREAT3 has led to substantial progressin quantifying shear systematics for a wide range of meth-ods, including traditionally recognized eﬀects like noise andmodel bias due to mismatch between assumed and realgalaxy light proﬁles in the control branch, but also newereﬀects like truncation bias and model bias due to realisticmorphology, the latter of which was enabled by the use of

HST data for the simulations. The results show that theﬁeld has made signiﬁcant advances in the years since theend of the GREAT10 challenge, particularly in controllingmultiplicative biases, and that community challenges can bebeneﬁcial by inspiring the creation or development of newshear estimation methods. Within this ﬁeld, there are bothnew and established methods that are now capable of han-dling weak lensing data from upcoming Stage III surveys,provided adequate care is taken over identiﬁed sources ofbias. Although development will be needed in many areas,the GREAT3 results provide new reasons to be optimisticabout delivering reliably accurate shear estimates at StageIV survey accuracy.

ACKNOWLEDGMENTS

We thank Gary Bernstein and Mike Jarvis for providinghelpful feedback on this paper, Peter Freeman for providingguidance on the statistical interpretation of results, and theanonymous referee for making suggestions that improved thepresentation of results in the paper. We thank the PASCAL-2 network for its sponsorship of the challenge. This projectwas supported in part by NASA via the Strategic UniversityResearch Partnership (SURP) Program of the Jet Propul-sion Laboratory, California Institute of Technology; and by c (cid:13) , 000–000 Mandelbaum, Rowe, et al. the IST Programme of the European Community, under thePASCAL2 Network of Excellence, IST-2007-216886. This ar-ticle only reﬂects the authors’ views. This work was sup-ported in part by the National Science Foundation underGrant No. PHYS-1066293 and the hospitality of the AspenCenter for Physics.RM was supported during the development of theGREAT3 challenge in part by program HST-AR-12857.01-A, provided by NASA through a grant from the SpaceTelescope Science Institute, which is operated by the As-sociation of Universities for Research in Astronomy, Incor-porated, under NASA contract NAS5-26555, and in partthrough an Alfred P. Sloan Fellowship from the Sloan Foun-dation; her work on the ﬁnal analysis of results was sup-ported by the Department of Energy Early Career AwardProgram. BR, JZuntz, and TKacprzak acknowledge supportfrom the European Research Council in the form of a Start-ing Grant with number 240672. HM acknowledges supportfrom Japan Society for the Promotion of Science (JSPS)Postdoctoral Fellowships for Research Abroad and JSPS Re-search Fellowships for Young Scientists. The Amalgam@IAPTeam (AD, EB, RG) acknowledges the Agence Nationale dela Recherche (ANR Grant “AMALGAM”) and Centre Na-tional des Etudes Spatiales (CNES) for ﬁnancial support.MT acknowledges support from the Deutsche Forschungsge-meinschaft (DFG) grant Hi 1495/2-1. TKuntzer), MGentile,HYS, and FC acknowledge support from the Swiss NationalScience Foundation (SNSF) under grants CRSII2_147678,200020_146813 and 200021_146770. Part of the work car-ried out by the MBI team was performed under the aus-pices of the U.S. Department of Energy at Lawrence Liv-ermore National Laboratory under contract number DE-AC52-07NA27344 and SLAC National Accelerator Labo-ratory under contract number DE-AC02-76SF00515. HYSacknowledges the support by a Marie Curie InternationalIncoming Fellowship within the th European CommunityFramework Programme, and NSFC of China under grants11103011. JEM was supported by National Science Founda-tion grant PHY-0969487. JZhang is supported by the na-tional science foundation of China (Grant No. 11273018,11433001), and the national basic research program of China(Grant No. 2013CB834900, 2015CB857001). J-LS, MK, FS,and FMNM were supported by the European ResearchCouncil grant SparseAstro (ERC-228261). EMH is gratefulto Christopher Hirata for insightful discussion and feedbackon the MetaCalibration idea.Contributions: RM and BR were co-leaders of the chal-lenge itself, and coordinated the analysis presented in thispaper. In the rest of this listing, people whose names aregiven as initials or ﬁrst initial-last name (when initials areambiguous) are co-authors on the paper, and those who arenot have their names listed in full. JB, Chihway Chang,FC, MGill, Mike Jarvis, HM, RN, JR, MS, and JZuntzwere members of the GREAT3 Executive Committee, whichhelped to design the simulations and run the challenge. Theother co-authors were members of teams that participatedin the challenge: • Amalgam@IAP: EB, AD, RG • BAMPenn: RA, Gary Bernstein, MM • EPFL_gFIT: MGentile, FC • CEA-EPFL: MGentile, FS, MK, J-LS, FNMN, SP-H,FC • CEA_denoise: MK • CMU_experimenters: RM • COGS: BR, JZuntz, TKacprzak, Sarah Bridle • E-HOLICs: YO • EPFL_HNN: GN, FC • EPFL_KSB: HYS • EPFL_MLP: GN • FDNT: RN • Fourier_Quad: JZhang • HSC-LSST-HSM: JB, RM • MBI: DBard, DBoutigny, WAD, DWH, DL, PJM JEM,MDS • MaltaOx: LM, IFC, KZA • MegaLUT: TKuntzer, MT, FC • MetaCalibration: EMH, RM • Wentao_Luo: WL • ess: ESS • sFIT: MJJ REFERENCES

Abazajian K., Dodelson S., 2003, Physical Review Letters,91, 41301Albrecht A. et al., 2006, ArXiv e-prints (astro-ph/0609591)Bartelmann M., Schneider P., 2001, Phys. Rep., 340, 291Bernstein G. M., 2010, MNRAS, 406, 2793Bernstein G. M., Armstrong R., 2014, MNRAS, 438, 1880Bernstein G. M., Jarvis M., 2002, AJ, 123, 583Bertin E., 2011, in Astronomical Society of the Paciﬁc Con-ference Series, Vol. 442, Astronomical Data Analysis Soft-ware and Systems XX, Evans I. N., Accomazzi A., MinkD. J., Rots A. H., eds., p. 435Bertin E., Arnouts S., 1996, A&AS, 117, 393Bridle S. et al., 2010, MNRAS, 405, 2044Bridle S. et al., 2009, Annals of Applied Statistics, 3, 6Bundy K., Ellis R. S., Conselice C. J., 2005, ApJ, 625, 621Cropper M. et al., 2013, MNRAS, 431, 3103Eddy W., 1982, in COMPSTAT 1982 5th Symposium heldat Toulouse 1982, Caussinus H., Ettinger P., TomassoneR., eds., Physica-Verlag HD, pp. 42–47Foreman-Mackey D., Hogg D. W., Lang D., Goodman J.,2013, PASP, 125, 306Fruchter A. S., 2011, PASP, 123, 497Gentile M., Courbin F., Meylan G., 2012, ArXiv e-prints(1211.4847)Graﬀ P., Feroz F., Hobson M. P., Lasenby A., 2014, MN-RAS, 441, 1741Haykin S., 2009, Neural Networks and Learning Machines,Neural networks and learning machines No. v. 10. PrenticeHallHeymans C. et al., 2013, MNRAS, 432, 2433Heymans C., Van Waerbeke L., Bacon D., Berge J., Bern-stein G., Bertin E., Bridle S., et al., 2006, MNRAS, 368,1323High F. W., Rhodes J., Massey R., Ellis R., 2007, PASP,119, 1295Hirata C., Seljak U., 2003, MNRAS, 343, 459Hirata C. M. et al., 2004, MNRAS, 353, 529Hoekstra H., Franx M., Kuijken K., 2000, ApJ, 532, 88 c (cid:13) , 000–000 REAT3 Results I Hoekstra H., Franx M., Kuijken K., Squires G., 1998, ApJ,504, 636Hoekstra H., Jain B., 2008, Annual Review of Nuclear andParticle Science, 58, 99Hogg D. W., Lang D., 2013, PASP, 125, 719Hu W., 2002, Phys. Rev. D, 65, 023003Huterer D., 2002, Phys. Rev. D, 65, 63001Jee M. J., Tyson J. A., Schneider M. D., Wittman D.,Schmidt S., Hilbert S., 2013, ApJ, 765, 74Kacprzak T., Bridle S., Rowe B., Voigt L., Zuntz J., HirschM., MacCrann N., 2014, MNRAS, 441, 2528Kacprzak T., Zuntz J., Rowe B., Bridle S., Refregier A.,Amara A., Voigt L., Hirsch M., 2012, MNRAS, 427, 2711Kaiser N., 2000, ApJ, 537, 555Kaiser N., Squires G., Broadhurst T., 1995, ApJ, 449, 460Kitching T., Balan S., Bernstein G., Bethge M., Bridle S.,Courbin F., Gentile M., et al., 2010, AOAS, 5, 2231Kitching T. D. et al., 2012, MNRAS, 423, 3163Kitching T. D. et al., 2013, ApJS, 205, 12Koekemoer A. M. et al., 2007, ApJS, 172, 196Lackner C. N., Gunn J. E., 2012, MNRAS, 421, 2277Lauer T. R., 1999, PASP, 111, 227Laureijs R. et al., 2011, ArXiv e-prints (1110.3193)LSST Science Collaborations, LSSTProject, 2009, ArXiv e-prints (0912.0201),

Luppino G. A., Kaiser N., 1997, ApJ, 475, 20Mandelbaum R., Hirata C. M., Leauthaud A., MasseyR. J., Rhodes J., 2012, MNRAS, 420, 1518Mandelbaum R. et al., 2005, MNRAS, 361, 1287Mandelbaum R. et al., 2014, ApJS, 212, 5Massey R. et al., 2007a, MNRAS, 376, 13Massey R. et al., 2013, MNRAS, 429, 661Massey R., Kitching T., Richard J., 2010, Reports onProgress in Physics, 73, 086901Massey R., Rowe B., Refregier A., Bacon D. J., Bergé J.,2007b, MNRAS, 380, 229Melchior P., Böhnert A., Lombardi M., Bartelmann M.,2010, A&A, 510, A75Melchior P., Viola M., 2012, MNRAS, 424, 2757Melchior P., Viola M., Schäfer B. M., Bartelmann M., 2011,MNRAS, 412, 1552Miller L. et al., 2013, MNRAS, 429, 2858Nissen S., 2003, Implementation of a fast artiﬁcial neu-ral network library (fann). Tech. rep., Department ofComputer Science University of Copenhagen (DIKU),http://fann.sf.netNurbaeva G., Courbin F., Gentile M., Meylan G., 2011,A&A, 531, A144Nurbaeva G., Tewes M., Courbin F., Meylan G., 2014,ArXiv e-prints (1411.3193)Okura Y., Futamase T., 2011, ApJ, 730, 9Okura Y., Futamase T., 2012, ApJ, 748, 112Okura Y., Futamase T., 2013, ApJ, 771, 37Refregier A., 2003, ARA&A, 41, 645Refregier A., Kacprzak T., Amara A., Bridle S., Rowe B.,2012, MNRAS, 425, 1951Reyes R., Mandelbaum R., Gunn J. E., Nakajima R., SeljakU., Hirata C. M., 2012, MNRAS, 425, 2610Rhodes J., Refregier A., Groth E. J., 2000, ApJ, 536, 79Rojas R., 1996, Neural Networks: A Systematic Introduc-tion. Springer-Verlag New York, Inc., New York, NY, USA Rowe B., Hirata C., Rhodes J., 2011, ApJ, 741, 46Rowe B. et al., 2014, ArXiv e-prints (1407.7676)Schechter P. L., Levinson R. S., 2011, PASP, 123, 812Schneider M. D., Hogg D. W., Marshall P. J., DawsonW. A., Meyers J., Bard D. J., Lang D., 2014, ArXiv e-prints (1411.2608)Schneider P., 2006, in Saas-Fee Advanced Course 33: Grav-itational Lensing: Strong, Weak and Micro, Meylan G.,Jetzer P., North P., Schneider P., Kochanek C. S., Wamb-sganss J., eds., pp. 269–451Schneider P., van Waerbeke L., Jain B., Kruse G., 1998,MNRAS, 296, 873Scoville N. et al., 2007a, ApJS, 172, 38Scoville N. et al., 2007b, ApJS, 172, 1Sheldon E. S., 2014, MNRAS, 444, L25Spergel D. et al., 2013, ArXiv e-prints (1305.5422)Starck J.-L., Pires S., Réfrégier A., 2006, A&A, 451, 1139Tewes M., Cantale N., Courbin F., Kitching T., Meylan G.,2012, A&A, 544, A8van Waerbeke L., Mellier Y., Schneider P., Fort B., MathezG., 1997, A&A, 317, 303Voigt L. M., Bridle S. L., 2010, MNRAS, 404, 458Zhang J., 2008, MNRAS, 383, 113Zhang J., 2010, MNRAS, 403, 673Zhang J., 2011, JCAP, 11, 41Zhang J., Komatsu E., 2011, MNRAS, 414, 1047Zhang J., Luo W., Foucaud S., 2013, ArXiv e-prints(1312.5514)Zhang P., Liguori M., Bean R., Dodelson S., 2007, PhysicalReview Letters, 99, 141302Zuntz J., Kacprzak T., Voigt L., Hirsch M., Rowe B., BridleS., 2013, MNRAS, 434, 1604

APPENDIX A: GREAT3 CHALLENGEDETAILS

In this appendix, we summarize some details of the GREAT3challenge that were not included in the handbook.

A1 Galaxy intrinsic ellipticity distribution

The galaxy intrinsic ellipticity distribution, or p ( ε ) , is impor-tant, since many methods make assumptions about or tryto infer it. We measure this distribution for the GREAT3galaxy samples using parametric ﬁts to COSMOS galaxies.The galaxy selection in each subﬁeld has three goals:ﬁrst, it should roughly preserve the joint size, S/N , morphol-ogy, and ellipticity distributions of real galaxy samples; sec-ond, each subﬁeld should have a similar

S/N cutoﬀ (whichdepends on the PSF as well as the pixel noise); and ﬁnally,the galaxies should be suﬃciently resolved that essentiallyall methods can measure them. In ground branches, wherethe PSF size varies substantially from subﬁeld to subﬁeld,it is not obvious that the galaxy population will have thesame p ( ε ) in each subﬁeld after these cuts.In Fig. A1, we show the p ( ε ) for several subﬁelds in CGCand CSC, with several apparent trends. First, the p ( ε ) aresimilar for space and ground branches. Second, within diﬀer-ent subﬁelds in CSC, there are small ﬂuctuations in the p ( ε ) ,but these appear consistent with noise. For ground branches, c (cid:13) , 000–000 Mandelbaum, Rowe, et al.

Figure A1.

The intrinsic ellipticity distribution p ( ε ) for CGC(top) and CSC (bottom), for three subﬁelds. For CGC, the leg-end shows the subﬁeld index and atmospheric PSF FWHM. Bothpanels show intrinsic ellipticity distributions for disk and bulgegalaxies from Miller et al. (2013). the PSF FWHM results in quite diﬀerent populations beingrepresented in each subﬁeld. For this ﬁgure, we deliberatelyshow one subﬁeld with atmospheric PSF FWHM aroundthe median, along with the subﬁelds with the minimum andmaximum values of PSF FWHM. Thus, we have maximizedpopulation diﬀerences due to our FWHM-dependent galaxyselection process. However, h ε i is only slightly smaller inthe worst seeing subﬁeld than for the more typical and bestsubﬁelds, and part of the diﬀerence here is due to statis-tical ﬂuctuations. The results are similar for the realisticgalaxy experiment, and for variable shear branches. Thus,the p ( ε ) are largely stable within and across branches. More-over, they are reasonably consistent with a linear combi-nation of observationally-motivated distributions for bulgesand disks derived in a completely diﬀerent way and used inMiller et al. (2013), as shown on the plot.The resolution cut is slightly ellipticity-dependent forthe smallest galaxies, as shown in Fig. A2 (the 2D distri-bution of half-light radius and ellipticity). In general, . per cent of the galaxies are small enough to be aﬀected bythis problem. Also, this eﬀect is irrelevant in space branches,where the cuts remove very few galaxies. Figure A2.

The 2D histogram of galaxy half-light radius r / and ellipticity magnitude | ε | for subﬁeld 51 in CGC, which hasatmospheric PSF FWHM around the median value. A2 Lensing shears

Here we describe the distributions from which the lensingshears were drawn.In constant-shear branches, the lensing shears had ran-dom orientations, with magnitudes between . | g | . . The distribution of magnitudes within this range is p ( | g | ) ∝ | g | , which emphasizes higher shear values and thusincreases our sensitivity to systematic errors in the shear.In variable shear branches, each galaxy had an appliedshear and magniﬁcation according to a shear power spec-trum. The shear power spectrum came from interpolationbetween tabulated ones for a particular cosmological modelwith three median redshifts z med = 0 . , . , and . . How-ever, the power spectrum was altered in two ways. First, theamplitude was doubled, to increase our sensitivity to mul-tiplicative biases. Second, to make the power spectrum onethat cannot be guessed by participants, we added a termcorresponding to a sum of shapelets with randomly chosenamplitudes (of order 10 per cent of the original power spec-trum amplitude). For more details, see the publicly availablesimulation scripts on the GREAT3 GitHub page. A3 Atmospheric and optical PSF properties

While the handbook contained details on many inputs tothe PSF models, here we show the outputs that are relevantfor tests carried out in this paper, especially in Sec. 5.5.Fig. A3 shows the distributions of the seeing (atmo-spheric PSF FWHM) in two branches; the defocus forground and space-based simulations; and ﬁnally the eﬀec-tive PSF ellipticities including all components. As shown(top left), the seeing distributions in CGC and RGC are con-sistent, modulo small noise ﬂuctuations. This consistency isimportant for the comparison between control and realis-tic galaxy experiments, since consistency in PSF propertiesleads to consistency in the simulated galaxy populations.The top right panel of Fig. A3 shows the distribution ofdefocus values for the optical PSF in the ground-based simu-lations. CGC and RGC are again consistent, with most sub-ﬁelds have a maximum defocus of / wave, but with a tail c (cid:13) , 000–000 REAT3 Results I to higher values. The subﬁelds that seemed most problem-atic in Sec. 4.8 are those with higher defocus values, whichsuggests that identifying and removing such data could beadvantageous. The bottom left panel shows the defocus dis-tribution for simulated space-based data, and as expected,the simulated distribution is roughly a factor of ten nar-rower than for ground data. Moreover, CSC and RSC areconsistent, which facilitates comparison between control andrealistic galaxy experiments.Finally, Fig. A3 (bottom right) shows the distributionsof eﬀective PSF shear for four branches. Typically this quan-tity is . . , consistent with real data; two-sided KS testsshow that the PSF shears are consistent between pairs ofbranches that are meant to represent the same data type(e.g., CSC and RSC, CGC and RGC). In both ground andspace simulations, there is a positive correlation betweenthe absolute value of defocus and g PSF , ∼ . in both cases(with a p -value of order − ). A4 Galaxy

S/N distributions

The galaxy

S/N distribution in the GREAT3 simulations isimportant because it determines the level of noise bias, animportant systematic error for shear estimation. The hand-book states that the galaxies have

S/N > , which is higherthan the cutoﬀ that is used by many methods in real data.However, the S/N estimator used to impose that cutoﬀ is anoptimal one that assumes perfect knowledge of the galaxyproﬁle (which is unachievable in real data). Thus, to relatethe quoted

S/N cutoﬀ to what is used in real data, we mustuse a more realistic

S/N estimator.For this purpose, we considered two

S/N estimators.One is the

S/N within an elliptical Gaussian aperture de-termined using the best-ﬁtting elliptical Gaussian model forthe PSF-convolved galaxy. Another is the ratio of sextrac-tor outputs

FLUX_AUTO / FLUXERR_AUTO . Fig. A4 shows

S/N distributions using the second deﬁnition for several subﬁeldsin ground (top) and space (bottom) branches.As shown, the

S/N distribution is quite uniform acrosssubﬁelds in space branches. The 5th percentile for

S/N is ∼ . In contrast, the S/N distribution for ground branchesvaries with the subﬁeld; the ones shown here are the sameas in Fig. A1, with maximal variation in the atmosphericPSF FWHM. Subﬁelds with worse seeing typically havehigher average galaxy

S/N . The 5th percentile

S/N valueis 11.3, 12.0, and 13.5 for subﬁelds with the best, median,and worst atmospheric PSF FWHM. If we use the ellipticalGaussian-based

S/N estimate, then the plots shift slightlyto the right (higher

S/N ), with a lower limit of ∼ in-stead of 12 for space branches. This is still a far cry fromthe nominal S/N > limit using the optimal estimator,which highlights the need for care in comparing predictionswith diﬀerent estimators. APPENDIX B: EXAMPLE SCRIPTS

The GREAT3 Executive Committee distributed a shear es-timation example script (called simple_shear.py) on theGREAT3 GitHub page. This example script estimates per-galaxy shears for all galaxies and outputs them as catalogs inthe format expected by the publicly available presubmission

Figure A4.

The distribution of galaxy

S/N using the second

S/N estimator described in the text, for three subﬁelds in CGC(top) and CSC (bottom). scripts. Teams could take this code to do the bookkeepingwhile substituting their per-galaxy shear estimation routinein place of the one in the example script.The example script uses the

GalSim (Rowe et al. 2014)implementation of the re-Gaussianization (Hirata & Seljak2003) PSF correction method; see those papers for moredetails of the algorithm and implementation. Becausethe script is a simple and fast example (not meant toget a science-quality shear estimate), it applies only asimply-derived calibration correction that does not in-clude all known systematics. For the “shear responsivity”(Bernstein & Jarvis 2002) describing how galaxies with aparticular distortion respond to a lensing shear, the scriptuses an overly-simplistic expression rather than a more accu-rate one (both available in the above reference). It also uses asimple but inaccurate way of estimating the RMS distortionof the galaxy population, rather than more accurated butmore complicated methods that are available in the litera-ture (e.g., Reyes et al. 2012) as an input to the responsivitycalculation. Finally, the default settings for initial guess ofobject size lead to convergence to a local minimum for thespace branches that cuts out the outer parts of the PSF,resulting in very wrong shear estimates (but accurate cen- c (cid:13) , 000–000 Mandelbaum, Rowe, et al.

Figure A3.

Distributions of PSF properties across all subﬁelds in various branches.

Top:

Seeing (left) and defocus distributions (right)for CGC and RGC.

Bottom left:

Defocus distributions for CSC and RSC; note the smaller dynamic range compared to the groundbranches.

Bottom right:

Distribution of PSF shear in the four constant-shear branches in the control and realistic galaxy experiments. troid estimates). Fine-tuning the initial guesses is necessaryfor this script to give reasonable results on space simulations.

APPENDIX C: SHEAR ESTIMATIONMETHODSC1 Amalgam@IAP

C1.1 PSF modelling

The PSF modeling was performed using the

PSFEx pack-age (Bertin 2011) to compute the PSF model for the starpostage stamps. The PSF modeling procedure starts by nor-malising and re-centering point-source images to a common“PSF grid” using a regular image resampling technique. Thecoeﬃcients of a set of basis functions of point-source coor-dinates X c ( θ ) (simple polynomials) are adjusted in the χ sense to every PSF “pixel” to compute a coarse PSF modeland its spatial variation, in the form of a set of tabulatedPSF components φ c .The model is further reﬁned by adding corrections ∆ φ c http://astromatic.net/software/psfex by minimising the following cost function over all pixels i ∈D s from all point sources s : E (∆ φ , ,... ) = X s X i ∈D s (cid:0) p i − f s P c X c ( θ ) [ φ ′ c i ( x s ) + ∆ φ ′ c i ( x s )] (cid:1) σ i + X c k ∆ φ c k σ φ , (C1)where p i is the value of pixel i , with uncertainty σ i , and f s the ﬂux of point source s . σ φ sets the amplitude of the reg-ularisation term. In practice, σ φ ≈ − represents a goodcompromise between ﬁdelity and robustness of the solution.The prime indicates a resampled version of the PSFcomponents; e.g., the value of pixel i with coordinates x i inthe image of PSF φ resampled at the point-source position x s with PSF sampling step η : φ ′ i ( x s ) = X j h s ( x j − η ( x i − x s )) φ j , where h s ( x ) is the interpolation function. c (cid:13) , 000–000 REAT3 Results I The version of

PSFEx used for the GREAT3 challengeis identical to v3.17.1 except for the interpolation function,which is either a Lanczos-4 or Lanczos-5 kernel instead ofthe default Lanczos-3. Support for measurement vectors asPSF dependency parameters (

PSFVAR_KEYS ) was added earlyin the challenge to allow

PSFEx to map PSF variations asa function of any set of columns in an ASCII list, through

SExtractor ’s ASSOC mechanism.The

PSFEx conﬁguration used for GREAT3 diﬀersfrom the default one in a few minor ways. The ﬁrst dif-ference is in the use of super-resolution, adopting a constantsampling step η of 0.6 image pixels for all branches. Thissampling step oﬀers the best compromise between robust-ness and accuracy given the limited number of PSF imagesfor branches with a constant PSF. Also, the full star postagestamp size is used for each branch. PSF variations are mod-elled using 0th and 5th degree polynomials of star coordi-nates for constant and variable PSF branches, respectively.Finally, the noise on point source images is assumed to bepurely additive, setting PSF_ACCURACY to 0.

C1.2 Galaxy shape measurement

Galaxy shapes are measured using

SExtractor v2.19.15(Bertin & Arnouts 1996; Bertin 2011). The measurementprocess involves independently ﬁtting each galaxy imagewith a Sérsic model convolved with the local PSF modelfrom PSFEx . To avoid galaxy detection problems, theAmalgam@IAP team used a detection image to explicitlytell

SExtractor about the gridded galaxy positions.The vector of Sérsic model parameters θ includes the ( x, y ) centroid position, amplitude, eﬀective radius, aspectratio, position angle and Sérsic index. Physically meaning-ful constraints (e.g., amplitude > ) are imposed on all pa-rameters except position angle through a change of variables θ → θ ′ . For instance, for the aspect ratio (parameter θ aspect )the Amalgam@IAP team instead constrain the transformedparameter θ ′ aspect deﬁned as θ ′ aspect = ln ln θ aspect − ln 0 . − ln θ aspect . (C2)Individual ellipticities from the aspect ratio and position an-gle of the best-ﬁtting galaxy model are used directly. SEx-tractor also extracts the associated uncertainties and theircorrelation coeﬃcient from the covariance matrix of the ﬁt-ted parameters.The ﬁt itself is achieved by minimising a quadratic costfunction with the Levenberg-Marquardt algorithm using the

LevMar library. The cost function is the weighted sumof squared residuals plus a quadratic penalty term E ( θ ) = χ ( θ ′ ) + X i ( θ ′ i − µ θ i ) σ θ i . (C3)where the sum is over galaxy model parameters i . The ver-sion of SExtractor used by default has σ θ i ≡ ∞ for allparameters (no penalty).The ﬁtting process typically converges in 50-100 itera-tions. Compared to the latest publicly available version of http://astromatic.net/software/sextractor http://users.ics.forth.gr/~lourakis/levmar/ the package, the following changes were made to the SEx-tractor code for GREAT3: • Fitting area (normally set automatically) is limited tothe size of the GREAT3 galaxy images to avoid overlappingwith neighbouring galaxies. • Sampling of the model is forced to 0.3 image pixel, in-stead of the default which depends on the input PSF model. • The step used in diﬀerence approximation to the Jaco-bian in

LevMar is set to − . • Penalty parameters for the aspect ratio are set to µ θ aspect = 0 and σ θ aspect = 1 to disfavour very large el-lipticities for the most poorly resolved objects, without sig-niﬁcantly aﬀecting the results for more resolved galaxies. • The default, modiﬁed χ (which is more robust for par-tially overlapping objects) is replaced with a regular χ .Finally, the SExtractor conﬁguration used by theAmalgam@IAP team reﬂects the details of the GREAT3simulations: the background is set to 0 ADU; the

GAIN is setto 0 (equivalent to inﬁnite); and the

MASK_TYPE detectionmasking parameter is usually NONE.

C1.3 Galaxy weighting

The Amalgam@IAP team used a modiﬁed inverse-varianceweighting scheme based on the full covariance matrix from

SExtractor (approximated by the Hessian calculated bythe

LevMar minimization engine) to account for possiblecovariance between parameters and for diﬀerences in therecovery of e and e components. This covariance matrixforms the basis for the per-galaxy shear covariance matrix.To avoid giving too much weight to high S/N objects, theAmalgam@IAP team added a constant σ s to the diagonalentries. For constant-shear branches, they used the full per-object covariance C i to estimate the shear as ˆ γ = X k C − k ! − X j C − j e j , using the 2-vector e j and × matrix C − . In practice, thediﬀerence between using the full covariance matrix and itsisotropic approximation was small.For variable shear branches, the Amalgam@IAP teamused the provided corr2 code with isotropized scalarweights deﬁned as w i = 2 σ i, + σ i, + 2 σ s , where the denominator represents the quadrature sum ofmeasurement error and shape noise. C2 BAMPenn

This team used the Bayesian Fourier Domain (BFD)method from Bernstein & Armstrong (2014), which relies onweighted moments calculated in Fourier space and a priorfor the noiseless distribution of galaxy moments (e.g., fromdeep data). Weighting is implicit rather than explicit in thisBayesian calculation. The ensemble shear from the meanof the Bayesian posterior should be unbiased in the limitthat many galaxies are used for shear estimation, potentially c (cid:13) , 000–000 Mandelbaum, Rowe, et al. avoiding noise biases that can plague maximum-likelihoodmethods. It does not result in a per-object shear estimate.The submissions made during the challenge period camefrom an immature software pipeline and had errors that wereidentiﬁed after the fact. Currently, the machinery is in placeonly for a constant-shear analysis, not variable shear.

C3 EPFL_gﬁt

All submissions by the EPFL_gﬁt team used the gfit method. A few submissions also used a wavelet-based

DWTWiener denoising code from Nurbaeva et al. (2011),integrated into gfit . The gfit method uses a maximum-likelihood, forward model-ﬁtting algorithm to measuregalaxy shapes. An earlier version of gfit , used in theGREAT10 galaxy challenge (Kitching et al. 2010, 2012),was described in Gentile, Courbin & Meylan (2012). Theversion used in GREAT3 is completely new, written inPython and relies on the

NumPy , SciPy and

PyFits li-braries. The software has a modular design, so that addi-tional galaxy models and minimizers can be plugged in fairlyeasily. The behavior of gfit is controlled though conﬁgura-tion ﬁles. gfit requires catalogs generated via an automated pro-cess from input galaxy and PSF mosaic images by

SEx-tractor (Bertin & Arnouts 1996). The following galaxymodels, for which images are generated using

GalSim , arecurrently supported:(a) A pure disk Sérsic model.(b) A sum of an exponential Sérsic proﬁle (Sérsic n = 1 )to model the disk and a de Vaucouleurs Sérsic proﬁle( n = 4 ) to model the bulge. The disk and bulge sharethe same centroid and ellipticity.(c) A model similar to the previous but with a varying diskSérsic index.Almost all GREAT3 submissions used the secondgalaxy model, with the following eight parameters: galaxycentroid, total ﬂux, ﬂux fraction of the disk, bulge and diskradii, and ellipticity.Fitting can be performed with two minimizers, using in-put SExtractor catalogs to get initial guesses for galaxycentroids, ﬂuxes and sizes. The ﬁrst minimizer is the

SciPy

Levenberg-Marquardt non-linear least-squares implementa-tion. The second is a simple coordinate descent minimizer(SCDM), a loose implementation of the Coordinate Descentalgorithm. In the SCDM, the model parameters are sequen-tially varied in a cycle, to explore all directions in parameterspace. After each cycle, the change in the objective functionis measured and the sense of variation maintained or re-versed. The step size for each parameter is dynamically ad-justed based on previous iterations. The algorithm is, by na-ture, slow but quite robust, with a failure rate below 1/1000on GREAT3 images. Several stopping conditions are avail-able and can be combined.The EPFL_gﬁt submissions used a simple weightingscheme that was one of the options used by CEA-EPFL(below), involving constant weighting for all galaxies exceptthose that have unusually large ﬁt residuals, which are re-jected entirely (typically < per cent of the galaxies). C4 CEA-EPFL

The CEA-EPFL team used an object-oriented frameworkwritten in Python and usable in other contexts thanGREAT3 with minimal changes, including: • Galaxy shape measurement ( gfit from Sec. C3). • Weight calculation ( sfilter ). • PSF estimation (star shape measurement, PSF interpo-lation, PCA decomposition and reconstruction). • Image coaddition routines. • Wavelet-based tools for deconvolution, denoising, coad-dition, and super-resolution. gfit was described in Sec. C3, but the remaining pipelineelements used in GREAT3 are described below.

C4.1 Weighting scheme

The sfilter tool uses catalogs produced by gfit to assigna weight to each galaxy.In GREAT3, two weighting schemes were used. Thesimpler scheme involved eliminating entries with large ﬁtresiduals by giving them weights of zero. The more com-plex scheme involved assigning weights based on PCA anal-ysis of the RMS between ellipticities ﬁtted by gfit onGREAT3 data and those obtained after running gfit onGREAT3-like simulated data. The galaxy simulations werecreated using

GalSim with GREAT3-like PSF, noise and

S/N ; the galaxy parameters were motivated by the outputsfrom a gfit analysis of the RSV branch. A PCA decom-position was then performed on a vector with ﬁrst compo-nent | ∆ e | = p ( e , out − e , in ) + ( e , out − e , in ) . The otherPCA components were either (a) ﬂux, disk and bulge radii,disk fraction, gfit output parameters or (b) SExtractor

FWHM, size,

S/N , ﬂux, and gfit disk and bulge radii.The ﬁrst component, | ∆ e | , was plotted against variousPCA components to select a cut-oﬀ value v that separatedregions of low and high | ∆ e | . A weight w low was assigned toall galaxies with v < v , with w low = 0 . for choice (b), and w low = 0 . for choice (a). C4.2 PSF estimation

For the three experiments with constant PSFs that wereprovided for the participants, the CEA-EPFL team usedthe provided PSFs directly. The spredict tool was used toestimate the PSF at the positions of galaxies in the variablePSF and full experiments. The version of spredict used inGREAT3 supports two PSF models: • An elliptical Moﬀat proﬁle, based on maximum-likelihood ﬁtting using

GalSim to generate images. Thiswas used in a few submissions to the ground branches. • A data-driven model based on PCA decomposition ofselected PSF images (with suﬃciently high

S/N , either > or > ) into either 10, 15, or 20 PCA components.More details of these algorithms will appear in Paper II. C4.3 Diﬀerences between GREAT3 submissions

The diﬀerences between submissions in a given branch arosemainly from the size of the postage stamps used for the ﬁts; c (cid:13) , 000–000 REAT3 Results I constraints placed on galaxy model parameters; minimizeroptions; weight functions; choice of galaxy models (thoughmost used the second one in Sec. C3); and occasionally at-tempts to include wavelet-based denoising. C5 CEA_denoise

The CEA_denoise team denoised the GREAT3 galaxyimages using a publicly available, multi-scale wavelet-based code mr_filter , based on Starck, Pires & Réfrégier(2006). They then measured unweighted second momentsof the denoised galaxy images and noiseless PSF imagesusing

SExtractor . Finally, they corrected for PSF con-volution by subtracting the PSF moments from the galaxymoments, as proposed by Rhodes, Refregier & Groth (2000)and Melchior et al. (2011).The CEA_denoise team varied the denoising options(such as using 2 vs. 3 wavelet scales), and selected the denois-ing methods by comparing the original and ﬁltered galaxiesby eye. Strong denoising often resulted in blurry galaxieswith correlated noise features around the galaxies.No weighting was applied to the measured shears.

C6 CMU_experimenters

The stacking method used by CMU_experimenters wasa simple modiﬁcation of the example script described inApp. B. The basic steps were galaxy registration, stacking,and PSF correction of the stacked image.First, CMU_experimenters measured the weighted ﬁrstmoments (centroids) for all galaxy images. They used the de-fault

GalSim interpolation routines to shift each galaxy sothe centroid would be at the exact center of the postagestamp. Next, they stacked all galaxies in a singleGREAT3 image using a simple unweighted average. Fi-nally they used GalSim routines for PSF correction (re-Gaussianization) to estimate the PSF-corrected distortion ˆ e . The shear estimate for the ﬁeld is simply ˆ g = ˆ e/ , sincethe stacked object is eﬀectively round in the absence of ashear. There is a calibration factor of 1.02 for the intrinsiclimitations of re-Gaussianization (Mandelbaum et al. 2012). C7 COGS

All submissions from the COGS team used the im3shape galaxy model ﬁtting code described in Zuntz et al. (2013).

C7.1 Galaxy model ﬁtting

The COGS team used a two component galaxy model, with ade Vaucouleurs bulge (Sérsic n = 4 ) and an exponential disk( n = 1 ). The two components were constrained to have thesame half-light radius, centroid, and ellipticity. The sevenfree parameters in the ﬁt were therefore total ﬂux, bulge-to-total ﬂux ratio, radius, centroid ( x, y ), and ellipticity.The best-ﬁtting model was identiﬁed by minimizing thesquared residual between data and model image, using the LevMar implementation of the Levenberg-Marquardt al-gorithm (see Zuntz et al. 2013 for details). The parametersettings and optimizer termination criteria are given in the im3shape initialization ﬁle used for GREAT3. The fullgalaxy postage stamps were used for all ﬁts.One important parameter for im3shape is the upsam-pling , the internal super-resolution at which proﬁles aredrawn and FFT convolutions performed. For speed, earlysubmissions used the native resolution, which causes arti-facts in the modeling and increases biases. Later COGS sub-missions set upsampling = 7. These submissions requiredsimilarly upsampled PSF images, which were generated viabicubic interpolation across the noise free PSF images pro-vided with the GREAT3 data. These entries with upsam-pling = 7 can be considered to be the baseline set of COGSsubmissions with high precision input settings. These sub-missions are referenced by their label u7 in this paper. C7.2 Noise bias calibration

Some im3shape submissions include a multiplicative cal-ibration factor to correct for expected noise biases inMaximum-Likelihood shape estimation. These can begrouped under the following three labels: • c1 : A correction for an isotropic multiplicative bias h m i = 0 . is applied. This expected noise bias was es-timated in simulations performed by Kacprzak et al. (2014,table 2) using a galaxy population that diﬀers somewhatfrom that in GREAT3. • c2 : A correction for an isotropic multiplicative bias h m i = 0 . is applied. This bias was estimated usingthe CGV deep data. The ellipticity of galaxies in the CGVdeep ﬁelds was measured using im3shape (with upsam-pling = 7). These images were then degraded by addingnoise to match the regular (non-deep) GREAT3 images,and re-measuring the ellipticities. By ﬁtting a polynomialincluding a constant, linear, and cubic term to ε , deep vs. ε , degraded , the COGS team estimated a calibration factor m ( ε ) and then calculated an expected calibration bias of h m i = 0 . based on p ( ε deep ) . The CGV deep data wasused since it exhibited less variation in the image propertiesthan the deep CGC data. This possibly relates to the rel-atively strong seeing variation identiﬁed in the deep CGCdata, discussed in Appendix C20. • c3 : A correction for an isotropic multiplicative bias h m i = 0 . is applied. This factor was estimated us-ing the CGV deep data in a similar manner to c2 , but usingonly galaxy models with best-ﬁtting | ε deep | < . in the deepdata to estimate m ( ε ) . This removal of outliers was foundto provide a better ﬁt to the (most numerous) galaxies withlower ellipticity values.No calibration was made for additive biases due to noise,although these are expected where PSFs are anisotropic(Kacprzak et al. 2012). C7.3 Diﬀerences between GREAT3 submissions

The main diﬀerences between submissions were the correc-tion of bugs in the interface between im3shape and the https://github.com/barnabytprowe/great3-public/wiki/COGS-.ini-file c (cid:13) , 000–000 Mandelbaum, Rowe, et al.

GREAT3 data format, the upsampling, and noise bias cal-ibrations applied. Early submissions used low accuracy set-tings for rapid basic validation of the GREAT3 data, andare unsuitable as a basis for careful scientiﬁc analysis.However the later set of submissions (with labels u7 , c1 , c2 , and c3 as described above) can be used for fairscientiﬁc comparison, and to explore systematic errors in the im3shape approach more generally. All galaxies were givenuniform statistical weights when generating submissions. C8 E-HOLICS

The E-HOLICs method (Okura & Futamase 2011, 2012,2013) is a moment-based method based on the KSB method(Kaiser, Squires & Broadhurst 1995). One important im-provement of the E-HOLICs method compared to KSB isits use of an elliptical (not circular) weight function.In the E-HOLICs analysis of GREAT3 data, all galax-ies that were used for the analysis were uniformly weighted.However, galaxies with estimated ellipticities > were re-jected (i.e., given zero weight). The E-HOLICs team applieda correction for systematic error due to pixel noise as derivedin the above references, with diﬀerent submissions havingdiﬀerent corrections. C9 EPFL_HNN

The EPFL_HNN method deconvolves the data by the givenPSF, represented by linear algebra formalism as a Toeplitzmatrix. This allows for solution of the convolution equationby applying the Hopﬁeld Neural Network (HNN) forwardrecurrent algorithm. At each iteration, the selected neuronsof the network (image pixels) are updated to minimize theenergy function. To measure the ellipticity of galaxies indeconvolved images, the second order moments of the imageautocorrelation function are used (Nurbaeva et al. 2014).HNN is an unsupervised neural network, so input galaxystamps could be initialized to zero. To reduce CPU time,the observed data was used as input. The output consists ofreconstructed images of the deconvolved galaxies, their au-tocorrelation functions, and an ellipticity catalog. All galax-ies received equal weighting when calculating the averageshears, and no calibration correction was applied.Diﬀerences between submissions in each branch include: • the size of the eﬀective galaxy postage stamp size; • the pixel updating value (a smaller number gives ﬁnerreconstruction, while increasing the iteration number andCPU time); and • ﬁltering (removing the galaxies for which the HNN al-gorithm failed to converge). C10 EPFL_KSB

The EPFL_KSB team used an implementation ofthe KSB method (Kaiser, Squires & Broadhurst 1995;Luppino & Kaiser 1997; Hoekstra et al. 1998) based on theKSBf90 pipeline (Heymans et al. 2006). The KSB methodparametrises galaxies and stars according to their weightedquadrupole moments. In the standard KSB method, a Gaus-sian ﬁlter of scale length r g is used, where r g is galaxy size.The EPFL_KSB team also tried other weighting functions. The main assumption of the KSB method is that thePSF can be described as a small but highly anisotropic dis-tortion convolved with a large circularly symmetric function.With that assumption, the shear can be recovered to ﬁrst-order from the observed ellipticity of each galaxy via γ = P − γ (cid:18) e obs − P sm P sm ∗ e ∗ (cid:19) , (C4)where asterisks indicate quantities that should be mea-sured from the PSF model at that galaxy position, P sm isthe smear polarisability (see Heymans et al. 2006 for def-initions) and P γ is the correction to the shear polaris-ability that includes the smearing with the isotropic com-ponent of the PSF. The ellipticities are constructed fromweighted quadrupole moments, and the other quantities in-volve higher order moments. All deﬁnitions are taken fromLuppino & Kaiser (1997). The shear contribution from eachgalaxy is weighted according to the quadrature sum of shapenoise and measurement error, calculated as in appendix Aof Hoekstra, Franx & Kuijken (2000).Submissions from this team fall into two categories:those using the standard KSB Gaussian ﬁlter, and thoseusing a combination of KSBf90 and a multiresolutionwiener ﬁlter with bspline wavelet transform ( mr_filter ,Starck, Pires & Réfrégier 2006). The latter submissionstended to perform better. Among the ﬁrst type of submis-sions, the better-performing ones use a polynomial ﬁttingformula for P γ based on the galaxy size and S/N , and re-jection of galaxies with extremely large values of P γ . C11 EPFL_MLP

The EPFL_MLP team’s method involved training a Multi-layer Perceptron (MLP) Neural Network to measure galaxyshapes. The MLP is a feedforward neural network with onehidden layer (Haykin 2009; Rojas 1996). The arctangentfunction is used as an activation function. The input dataare the set of neurons, represented by the galaxy image pix-els. The output is the ellipticity catalog. The MLP is trainedon simulated data with the standard back-propagation al-gorithm.MLP works in two passes. During the forward pass, theweight matrix is applied to the training set, the output iscompared to the desired result to obtain the error gradientand to average them over the batch set. During the back-ward pass, the weight updates ∆ w are calculated from thegradient descent method using the learning rate.This method uses a batch learning scheme, where theinput data is a batch of galaxy stamps and the weights areupdated based on the error rate averaged over the batch.For each submission, the EPFL_MLP team varied thefollowing parameters: the number of neurons in the hiddenlevel, the learning rate, the epoch number (an epoch corre-sponds to one forward pass and a backward pass), the batchnumber (batch learning improves stability by averaging), themomentum rate ( µ indicates the relative importance of theprevious weight change on the new weight increment).The training set consists of galaxy images, simulated us-ing GalSim with the following parameters: disk and bulgehalf-light radii, ellipticity modulus | e | , orientation angle,galaxy total ﬂux, bulge ratio, and signal-to-noise ratio. c (cid:13) , 000–000 REAT3 Results I Both bulge and disk have the same centroid and ellip-ticity. No weighting was applied to shear estimators, and nocalibration factors were applied.

C12 FDNT

This team used an implementation of the Fourier-domainnulling technique (FDNT, Bernstein 2010). This method es-timates a per-galaxy shear in the Fourier domain after PSFeﬀects have been removed by Fourier division (equivalentto deconvolution in real space). This team’s approach wasto then apply bias corrections based on image simulations.The bias is a function of (1)

S/N , (2) resolution, (3) PSFshape, (4) radial ﬂux distribution of the galaxy, and (5) ra-dial ﬂux distribution of the PSF. Additive bias was found tobe directly proportional to the PSF shape.In some cases, galaxies were weighted according to thecombination of shape noise (determined from the deep data)and shape measurement uncertainty. All FDNT submissions(v0.1 through v1.3) have the wrong bias corrections applied,and hence all results submitted during the challenge periodare not indicative of the real performance of this methodonce this error is corrected.

C13 Fourier_Quad

This team used Fourier-space methods described in a se-quence of papers (Zhang 2008, 2010; Zhang & Komatsu2011; Zhang 2011; Zhang, Luo & Foucaud 2013). The shearestimators for the two components of the reduced shear g and g are deﬁned based on the Fourier transform of thegalaxy image. There are three quantities: G , G , and N ,based on multipole moments of the spectral density distri-bution of the galaxy image in Fourier space: G = − Z d k ( k x − k y ) T ( k ) M ( k ) (C5) G = − Z d k k x k y T ( k ) M ( k ) N = Z d k (cid:20) k − β k (cid:21) T ( k ) M ( k ) where T ( k ) = (cid:12)(cid:12)(cid:12) ˜ W β ( k ) (cid:12)(cid:12)(cid:12) / (cid:12)(cid:12)(cid:12) ˜ W PSF ( k ) (cid:12)(cid:12)(cid:12) (C6) M ( k ) = (cid:12)(cid:12)(cid:12) ˜ f S ( k ) (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) ˜ f B ( k ) (cid:12)(cid:12)(cid:12) and ˜ f S ( k ) , ˜ f B ( k ) , ˜ W PSF ( k ) , and ˜ W β ( k ) are the Fouriertransforms of the galaxy image, an image of backgroundnoise, the PSF image, and an isotropic Gaussian function ofscale radius β , respectively. The latter is deﬁned as W β ( x ) = 12 πβ exp (cid:18) − | x | β (cid:19) . (C7)The factor T ( k ) is used to convert the form of the PSF toan isotropic Gaussian function. The value of β should be atleast slightly larger than the original PSF W PSF to avoidsingularities in the conversion. If the intrinsic galaxy imagesare statistically isotropic, the ensemble averages of the shearestimators deﬁned above recover the shear to second order in accuracy, i.e., h G j ih N i = g j + O ( g , ) (C8)for j = 1 , . Note that ensemble averages are taken for G , G , and N separately; these should be weighted averages,as we will discuss in Sec. C13.1. In practice, G , G , and N are calculated using discrete Fourier transforms.In the presence of source Poisson noise, the method ismodiﬁed/extended by adding more terms into the shear es-timators to keep them unbiased. Statistically, the Poissonnoise has a scale-independent spectral density in Fourierspace. Its amplitude can be estimated at the large wave-number limit, at which the source spectrum is subdominantdue to ﬁltering by the PSF. The estimated Poisson noisespectrum can then be subtracted from the spectral densityof the image on all scales. This operation is particularly suit-able for these shear estimators, as the ensemble averagesare taken directly on the spectral density. Finally, the sameprocedure should be repeated in the neighboring image ofbackground noise, as the Poisson noise in the source image ispartly due to the background photons. Removing the sourcePoisson noise eﬀect requires modiﬁcation of the deﬁnition of M ( k ) in Eq. (C7) to M ( k ) = (cid:12)(cid:12)(cid:12) ˜ f S ( k ) (cid:12)(cid:12)(cid:12) − F S − (cid:12)(cid:12)(cid:12) ˜ f B ( k ) (cid:12)(cid:12)(cid:12) + F B (C9)with F S,B = P | k j | >k c (cid:12)(cid:12)(cid:12) ˜ f S,B ( k j ) (cid:12)(cid:12)(cid:12) P | k j | >k c (C10)where k c is a value at which the Poisson noise amplitudedominates over the source signal, typically ∼ / of theNyquist wavenumber. C13.1 GREAT3 Experience

In GREAT3, the PSF for constant-shear branches was de-termined by stacking the spectral densities of the nine pro-vided PSF images. Several diﬀerent weighting schemes wereused, for each of which the weight is a function of the totalsource ﬂux F (rather than the shape parameters) to avoidintroducing systematic biases. Shear estimation for the j thcomponent was carried out via P i G j,i W i P i N i W i = g j . (C11)Since the background noise in GREAT3 images is un-correlated, its power spectrum in Fourier space is scale-independent. Thus, its contamination can be directly re-moved using the source image itself, without using a neigh-boring background image, rewriting Eq. (C9) as M ( k ) = (cid:12)(cid:12)(cid:12) ˜ f S ( k ) (cid:12)(cid:12)(cid:12) − F S . (C12)Three weighting options were tried:(i) W = 1 , for which the contribution to the shear signal scalesas ( S/N ) , guaranteeing equal weights for the galaxies ineach 90 ◦ rotated pair and maximizing shape noise cancel-lation. However, in terms of contribution to the ensembleshear signal, the bright galaxy pairs are much more impor-tant than the faint ones. c (cid:13) , 000–000 Mandelbaum, Rowe, et al. (ii) W = ( S/N ) − for galaxies that can be easily identiﬁed as90 ◦ rotated pairs by sorting the galaxy luminosity distribu-tion. For two galaxies in a pair, their average ﬂux is usedfor calculating W . For galaxy pairs that are too faint to beidentiﬁed, W = ( S min /N ) − , where S min is the minimumgalaxy ﬂux from the identiﬁed pairs.(iii) W = ( S/N ) − for all galaxies without identifying pairs.The ﬁrst two weighting options are eﬀective in GREAT3due to its shape noise cancellation, which is not relevant forreal data. The last weighting scheme is applicable to realdata, though it is not yet optimal.To calculate the shear-shear correlation function usingthe shear estimator deﬁned in Eq. (C5), the Fourier_Quadteam would ideally use (Zhang & Komatsu 2011) h γ j ( x ) γ j ( x + ∆ x ) i = P i G j ( x i ) G j ( x i + ∆ x ) P i N ( x i ) N ( x i + ∆ x ) (C13)The above formula is similar (but not equivalent) to theusual shear-shear correlation calculation using ellipticities ε , and weights W : h γ j ( x ) γ j ( x + ∆ x ) i = P i ε ( x i ) ε ( x i + ∆ x ) W ( x i ) W ( x i + ∆ x ) P i W ( x i ) W ( x i + ∆ x ) (C14)To use the GREAT3 presubmission script, theFourier_Quad team converted G , G , N to per-galaxy ε , ε , W via ε j = G j /N and W = N . This choice hadseveral drawbacks, the main one of which is that for lower S/N sources, G , G , N can take both positive and nega-tive values, due to the subtraction of the background noisecontribution in Eq. (C12). As a result, the ε , can be ex-tremely noisy ( | ε , | ≫ ), which is not a problem if theshear correlation is calculated using Eq. (C13). The proof ofconcept for variable shear estimation using this method isthe subject of ongoing work. C14 HSC-LSST-HSM

The HSC/LSST-HSM team attempted to reproduce theresults of the publicly released shear estimation examplescript, but using the HSC/LSST pipeline for the bookkeep-ing and a slightly older version of the re-Gaussianizationmethod (Hirata & Seljak 2003). From a scientiﬁc perspec-tive the results should be the same, so this is primarily asanity check that the HSC pipeline has no bugs that wouldcause re-Gaussianization to perform diﬀerently.The HSC/LSST pipeline was used for the preliminaryparts of the data processing, which in this case was mostlyjust bookkeeping. Only the ﬁrst PSF image in the constantPSF branches was used, after shifting it by ( − . , − . pix-els using 5th-order Lanczos interpolation to match the con-ventions of the HSC/LSST pipeline. Objects were selectedby cutting out postage stamps according to the providedgalaxy catalog; the HSC/LSST pipeline object detectionroutines were not used. Then, an early implementation ofre-Gaussianization that is part of the HSC pipeline was run.Shear responsivity, weighting, and an additional calibrationfactor of 0.98 were all done in a way identical to the publiclyreleased example script that uses the GalSim implementa-tion of re-Gaussianization.

C15 MBI

The MBI team carried out a hierarchical (multi-level)Bayesian joint inference (MBI) of the shear and the intrinsicellipticity distribution given the image pixel data, assumingsimply-parametrised galaxy models, simply-parametrisedPSF models, and a simply parametrised p ( ε ) . The team’sgoal was to begin the exploration of this new approach toshear measurement in a realistic setting, without expect-ing to be competitive given the simplicity of its PSF andgalaxy models, but hoping to learn something by compar-ing various hierarchical inferences with the standard maxi-mum likelihood estimates. A paper describing this method(Schneider et al. 2014) gives an overall picture of the MBIframework and several ideas for improvements beyond theimplementation used in the GREAT3 challenge.The MBI team modeled the PSF with a mixture of threeGaussians using the star image data. Galaxies are modeledas elliptical Sérsic proﬁles (using constrained Gaussians mix-tures; Hogg & Lang 2013) with six parameters: position, ef-fective radius, Sérsic index, and two (lensed) ellipticity com-ponents. The Tractor software developed by Lang andHogg (Lang et al. in prep. ) was used for these low-level indi-vidual galaxy inferences: the posterior PDF for each galaxy’smodel parameters is sampled using the ensemble MCMCsampler emcee (Foreman-Mackey et al. 2013) starting nearthe mode of the posterior found by a simple non-linear leastsquares optimizer. These individual galaxy model inferencesare carried out in embarrassingly parallel fashion.The intrinsic (pre-lensing) galaxy p ( ε ) is modeled asa Gaussian in both components, centred on zero and withwidth σ ε . This parameter is inferred jointly for both shearcomponents for each ﬁeld by importance sampling the em-cee outputs with a ﬂat hyperprior on log σ ε (assuming anuninformative prior on lensed ellipticity), using the standardrelation between shear, intrinsic and observed ellipticity. Thebest results use this simple Gaussian prior; a double Gaus-sian did not improve accuracy. For GREAT3 submissions,the MBI team reported the posterior mean estimates of theshear components. They only entered the constant-shear andconstant-PSF branches of the challenge, where their simpleassumptions are valid and no PSF interpolation is required.Of the six branches ﬁtting this description, they did notsubmit to two (RSC and MSC) due to lack of time.MBI team submissions are labeled as follows: • Optimal Tractor: The shear estimator is the mean ofthe maximum likelihood galaxy lensed ellipticity estimatesfor all galaxies in the ﬁeld. • Sample Tractor: The shear estimator is the mean of allsamples from all galaxies’ lensed ellipticity posterior PDFs. • Important Tractor: Submissions derived from the im-portance sampling analysis, assuming an independent Gaus-sian p ( ε ) in each ﬁeld.Some submissions experimented with other aspects ofthe method. For example, those labeled “multi-baling” in-volved inferring a p ( ε ) common to ﬁve ﬁelds at a time. Generally, importance sampling is a process for estimating theproperties of some distribution despite only having samples gen-erated from a diﬀerent distribution. Because of this diﬀerence, thesamples that are drawn must be reweighted.c (cid:13) , 000–000

REAT3 Results I Submissions labeled “deep” used the deep ﬁelds to obtaina hyper-prior on the p ( ε ) width parameter σ ε , which wasthen asserted during the importance sampling of the wideﬁelds. The MBI team additionally experimented with infor-mative prior PDFs for the lensing shear, asserting the shearcomponents to have been drawn from a Gaussian distribu-tion centred on zero with width σ g .The MBI team attempted no explicit calibration of anykind. Finally, we note that their approach is general, and caneasily be attached to other shape measurement algorithms. C16 MaltaOx

The Malta-Oxford team based their measurements on the lensfit algorithm (Miller et al. 2013). This method mea-sures the likelihood of PSF-convolved galaxy models ﬁttedto the pixel data for individual galaxies, adopting a Bayesianmarginalisation over nuisance parameters but using a fre-quentist likelihood estimate of ellipticity for each galaxy.Shear for the constant-shear branches was estimated fromthe weighted mean of galaxy ellipticity values.The galaxy models were two-component exponentialdisk plus de Vaucouleurs bulge, with ﬁxed relative ellipticityand scale length. The galaxy position, scale length, total ﬂuxand bulge fraction were nuisance parameters. For GREAT3,the priors for the marginalisation over galaxy scale lengthwere obtained by running lensfit on the GREAT3 deepdata, and ﬁtting a lognormal distribution to the measuredscale lengths, accounting for the ellipticity-dependent sizecut (App. A1) in the ﬁtting process. The ellipticity priorwas similarly derived from lensfit ﬁts to the GREAT3 deepdata, although it only enters the ﬁnal shear estimate as partof the weight function. The individual galaxy weight is aninverse variance weight, deﬁning the variance as the quadra-ture sum of ellipticity measurement variance and shape noise(see Miller et al. 2013 for details).For CGC and RGC, where noise-free PSFs were pro-vided, the MaltaOx team used a modiﬁed version of the lensfit

PSF modeling code to convert the nine images foreach subﬁeld into a single oversampled PSF model in a pixelbasis set. In the one variable PSF branch that they entered,they used the most recent lensfit

PSF modellng code with-out modiﬁcation. However, the data format required manymodiﬁcations to work with this code, so they lacked time tooptimise the assumed scale length of variation of the PSF.When used for CFHTLenS (Heymans et al. 2013), noisebias was calibrated using simulations that matched the ob-servations. For GREAT3, the MaltaOx team wanted to testa new self-calibration method (to be described in a futurepaper), integral to the likelihood measurement process, thatdoes not rely on external data or simulations. The ﬁnal Mal-taOx submissions used this self-calibration method. A ﬁnalpost-measurement step to isotropise the weights, to remove

S/N -dependent orientation bias, was also applied.

C17 MegaLUT

MegaLUT uses a supervised machine learning technique toestimate galaxy shape parameters by measuring the PSF-convolved, noisy galaxy images. The method can be seen asa detailed empirical calibration of a priori inaccurate shape measurement algorithms, such as raw moments of the ob-served galaxy image. The distinctive feature of MegaLUT isto completely leave it to the machine learning algorithm to“deconvolve” and correct crude shape measurements for theeﬀects of the PSF and for noise bias, instead of calibratingonly the residual biases of a priori more accurate techniques.In this way, the input to the machine learning algorithm isclose to the recorded information of each galaxy, avoidingpotential information loss from deconvolutions. A furtheradvantage of this approach is its very low computationalcost, due to the use of simple shape measurements.

C17.1 MegaLUT implementation for GREAT3

To build the learning samples on which MegaLUT is trainedfor GREAT3, the MegaLUT team used simple Sérsic pro-ﬁles to represent the galaxies. They can therefore trainthe algorithm to directly predict the Sérsic proﬁle param-eters, in particular the ellipticity. For branches with con-stant known PSFs, this training was performed separatelyfor each PSF. The measurements are based on

SExtrac-tor (Bertin & Arnouts 1996), the adaptive moments im-plemented in

GalSim (Rowe et al. 2014; Hirata & Seljak2003), and, for some submissions, on moments of the discreteautocorrelation function (ACF, van Waerbeke et al. 1997).The most fundamental change in MegaLUT with respect toits implementation for GREAT10 (described in Tewes et al.2012) is the machine learning itself. MegaLUT now usesfeed-forward artiﬁcial neural networks (ANNs), which aretrained interchangeably via the SkyNet (Graﬀ et al. 2014)or FANN (Nissen 2003) implementations. The method worksin eﬀectively the same way for control and realistic galaxybranches, and for ground- and space-based branches.For multiepoch branches, the MegaLUT team coaddedthe images with the stacking algorithm provided by theGREAT3 EC. For the pre-deadline submissions, the coaddi-tion process was not simulated in the learning sample, andMegaLUT could therefore not learn about related biases,which will be the subject of further work. Regarding thevariable PSF branches, the MegaLUT team developed anapproach that incorporates PSF interpolation into the ma-chine learning. In essence, the galaxy position is included asan input to the ANN, which is trained using PSFs at variouslocations. Prior to the deadline, this treatment of variablePSF branches was not suﬃciently mature to be used as aproof of concept of this novel approach.The MegaLUT team submissions do not use the deepdatasets, and do not weight the per-galaxy shear estimators,aside from rejections following simple criteria. The time pergalaxy listed in Table 2 for this method, 20 ms, includesthe overhead involved in generating a typical-sized trainingdataset as well as the training of the ANN. However, oncethe ANN has been trained, the shear estimation per galaxytakes roughly 3 ms.

C17.2 Diﬀerences between submissions

Multiple submissions within a branch diﬀer in the learningsample generation, the shape measurement, the selection ofANN input parameters, the ANN architecture, and the re-jection of faint or unresolved galaxies. The distribution of c (cid:13) , 000–000 Mandelbaum, Rowe, et al. shape parameters of the learning sample does not have toclosely mimic the “observations”, as it does not act as prior.For those parameters that do aﬀect the shape measurementoutput, the distributions used to generate the learning sam-ple merely deﬁne the region in parameter space over whichthe machine learning can perform an accurate regression.

C18 MetaCalibration

The philosophy behind the MetaCalibration method is thatsince shear systematics depend on the galaxy populationand PSF model, all shear systematics corrections shouldbe determined directly from the images themselves (ratherthan from independently-generated simulations). In practi-cal terms, the method involves constructing a model of theimage with the shear as a parameter. Varying the shear pa-rameter allows a direct measurement of the shear responsefrom the diﬀerence between pipeline outputs with and with-out the additional shear. For GREAT3, this team using re-Gaussianation as the shear estimation method, but in prin-ciple MetaCalibration could be used for any method.In detail, inspired by Kaiser (2000), the MetaCalibra-tion team constructs the model sheared image by deconvolv-ing the original image by the PSF model, applying a smallshear to the deconvolved image, and then convolving the re-sult with a slightly enlarged version of the original PSF. Theﬁnal, lossy step is required because the applied shear movesnoisy modes inside the PSF kernel window; reconstructinga sheared version of the original image would require accessto information on scales hidden by the original PSF. Themeasured sensitivity is correct for the version of the im-age with the enlarged PSF, so the ﬁnal shear measurementsare performed on the reconvolved image, with an enlargedPSF but no applied shear. This procedure should allow usto measure shear calibration biases for any shear measure-ment pipeline; for GREAT3, the MetaCalibration team usedthe

GalSim implementation of re-Gaussianization, but theapproach could be applied to self-calibrate any other shearestimation method.Since the per-object response is quite noisy, using a per-object response or even a per-image mean over 10000 galax-ies proved unstable. The entire set of images for a givenbranch was used to model the shape of the likelihood curveand derive the shear response.This approach was used to directly calibrate out mul-tiplicative systematics from the data. An extension of themethod to remove additive bias was not implemented be-fore the end of the challenge. Also, the anisotropic correlatednoise in the images with added shear was not whitened ormade four-fold symmetric; there are plans to test the eﬀectsof this limitation as well, with an updated version of

GalSim that can impose symmetry on the ﬁnal noise ﬁeld.

C19 Wentao_Luo

This team used an independent implementation of the re-Gaussianization method (Hirata & Seljak 2003). Given thechoice of applying the PSF dilution correction to the re-Gaussianized image or the version after application of arounding kernel, they used the latter as it was found duringtests on STEP2 (Massey et al. 2007a) images to give better performance. For the rounding kernel, a × kernel wasconstructed following Bernstein & Jarvis (2002).Due to convergence issues, only ∼ per cent of thegalaxies had estimated shapes, and a further size cut re-duced the number to ∼ per cent, resulting in quite noisysubmitted results.Submissions were made using two weighting schemes.The ﬁrst, from Mandelbaum et al. (2005), is inverse vari-ance weighting using the quadrature sum of shape noiseand measurement error due to pixel noise. The second is anellipticity-dependent weight from Bernstein & Jarvis (2002)( w = 1 / √ e + 2 . σ e , using the measurement error due topixel noise). The former led to better results than the latter,by roughly a factor of ∼ in Q score.The shear responsivity (to convert from distortion toshear) was calculated as in Bernstein & Jarvis (2002), andno additional calibration factors were applied. C20 ess

The ess team implemented the Bayesian model-ﬁtting(BMF) shear measurement algorithm introduced byBernstein & Armstrong (2014). For general details aboutthe implementation , see Sheldon (2014). The only detailsof importance that are not in Sheldon (2014) are about PSFﬁtting, prior determination and choice of models.For constant PSF branches, the ess team ﬁt three un-constrained Gaussians to one of the provided PSF imagesusing an Expectation Maximization (EM) algorithm chosenfor its high level of stability. For subﬁelds without strongdefocus, the residual of the model with the PSF was typi-cally consistent either with random noise or had a triangularshape perhaps due to trefoil in the PSF (which cannot easilybe represented by the adopted PSF model). In ﬁelds withstrong defocus, the residuals were quite bad; see Sec. 4.8 fora further discussion of this point.A number of diﬀerent galaxy models were used, includ-ing full Sérsic proﬁles, but the best performing on the real-istic galaxy branches was a simple exponential disk. The ﬁtswere carried out using the full × postage stamps. Fitsto the deep ﬁeld images were used to estimate priors on thesize and ﬂux. The joint size-ﬂux distribution averaged overall deep ﬁelds in the branch was then parametrized by sumsof Gaussians, again ﬁt using an EM algorithm.For ellipticity, the ess team tried ﬁtting the deep ﬁeldsand using the galaxy model ﬁts provided by the GREAT3team based on ﬁtting the COSMOS HST data at full reso-lution to a Sérsic model (Lackner & Gunn 2012). The latterapproach led to better results than the former.Because the Bernstein & Armstrong (2014) algorithmbreaks down at high shear, the ess team iterated the solu-tion on the constant-shear ﬁelds, expanding the Taylor seriesabout the result from the previous iteration. In the absenceof additive errors, this iteration converges in three iterationseven for ∼ per cent shears, but since the results did havesome additive bias, full convergence was not possible.The ess team worked primarily with the realistic galaxybranch because performance on the control branch wasrather poor. Their estimates of galaxy properties on the deep https://github.com/esheldon/ngmix c (cid:13) , 000–000 REAT3 Results I ﬁelds for the CGC branch suggested a strong variation intheir statistical properties both within the branch and com-pared to RGC. Priors are crucial for Bernstein & Armstrong(2014), and this variation may have resulted in poor perfor-mance. The RGC deep ﬁelds seemed more uniform in theirproperties according to this team’s analysis. An analysis af-ter the fact using the truth tables showed that the atmo-spheric PSF FWHM for the deep ﬁelds in both branches hadthe same mean value, but a dispersion of 0.12 ′′ vs. 0.08 ′′ forCGC and RGC, supporting the claim that the deep ﬁelds inCGC exhibited more variation than in RGC. C21 sFIT

The sFIT (shapes from iterative training) method is a setof principles to use simulations to characterize systematicerrors in shear estimation. The principles of the method are:(a) Shear estimation consists of two steps: initial elliptic-ity estimation (which must be highly repeatable) andapplication of calibration.(b) Shear calibration is derived via image simulation.(c) Simulated galaxies must have properties matchingthose in real data (in this case, the GREAT3 data).(d) Each step in image processing aﬀects the calibrationfactor. This includes image coaddition, PSF estimationand interpolation, handling of under-sampling, etc.A more detailed description will be presented in Jee &Tyson (in prep.).

C21.1 Implementation of the sFIT Method

For shear calibration using image simulations, the threeimportant questions are: 1) How well does the simulationmatch reality? 2) How far can the galaxy model be sim-pliﬁed? (i.e., minimization of the number of calibration pa-rameters), and 3) What is the requirement for the initialellipticity measurement method?

Initial ellipticity measurement:

The sFIT teamuses forward-modeling to obtain the initial ellipticity es-timate for each galaxy, by convolving the galaxy modelwith the PSF and minimizing the diﬀerence between thesimulated and actual galaxy image. The choice of galaxymodel is important. The sFIT team experimented with awide range of galaxy models, examining their stability (con-vergence rate), speed, bias, and measurement noise. Per-haps the simplest parametrization is an elliptical Gaussianas used in the Deep Lens Survey (DLS, Jee et al. 2013).The strength of this model includes the high convergencerate, speed, and small measurement error. The drawbackis that it requires rather a large calibration factor, of or-der 10 per cent. Although in principle a calibration factorcan be derived for this choice, it is preferable if the correc-tions that are being applied are small. Another option is thebulge + disk model, which may be regarded as the oppositeextreme to the elliptical Gaussian approach. This sophisti-cated representation of galaxy proﬁles reduces the bias, butwith an unacceptably poor convergence rate (fails for ∼ per cent of the GREAT3 galaxies) and slow speed ( ∼

10 secper object). The increase in the number of parameters alsoincreases noise bias for faint galaxies. The compromise that was adopted for GREAT3 is a single Sérsic representation,which is a one-parameter extension to the elliptical Gaus-sian model used for DLS. Without any external calibration,the model introduces a reasonably small multiplicative bias( ∼ per cent). The model converges ∼ per cent of thetime, and takes ∼ second per galaxy. Image simulation method:

The sFIT team used

GalSim to perform its image simulations. Although theteam already has a high-ﬁdelity image simulator used forDLS, there are merits in using

GalSim for the GREAT3challenge. First, the GREAT3 data are generated with

Gal-Sim . Were

GalSim to make some unknown systematic errorwhen representing galaxies under shear, the potential impacton competitive performance is best minimized by using thesame simulator to make images (while the scientiﬁc value inidentifying a discrepancy is, unfortunately, sacriﬁced).Second, for the real galaxy branches, it is important tomatch the galaxy properties. This team’s DLS image simula-tor uses galaxy images in the Ultra Deep Field (UDF), whichdetects faint galaxies down to 30th mag at the σ level.Clearly, these galaxies are diﬀerent from those in GREAT3.The sFIT team used Sérsic ﬁts to the GREAT3 data toestimate distributions of galaxy sizes, ellipticity, Sérsic in-dices, PSF properties, and noise level. Then, they ran Gal-Sim with input parameters based on these measurementsby drawing values from parametrized distributions. It isnot trivial to guess the input parameters that will generateimages that closely match the GREAT3 data, since the noisein the GREAT3 data means that the observed distributionsdeviate from the true inputs (they are wider than the inputs,with shifted means). Several iterations were required beforethe mean, width, and tail shape of the distribution agreedwell with the observed one.

Calibration:

Many details such as properties of galax-ies and PSFs, method of image reduction, implementationdetails of ellipticity measurement, noise level, etc. all aﬀectshear calibration. However, for practical purposes, the num-ber of parameters in the calibration process must be limited.The sFIT team avoided calibration against implementationdetails by keeping the size of the postage stamp images,the over-sampling ratio, the centroid constraint method, etc.ﬁxed throughout the challenge.The galaxy properties are important parameters. How-ever, individual measurements are noisy. Thus, instead ofa per-galaxy correction based on each galaxy’s properties,shear calibrations were derived based on aggregate statis-tics and applied to the entire population (an exception ismade for variable shears; see below).The most important parameters are the PSF proper-ties such as ellipticity, size, kurtosis, etc. Even with perfectknowledge of PSF, galaxy ellipticities still have both addi-tive and multiplicative bias, which increases with the size ofthe PSF. In their GREAT3 analysis, the sFIT team ignoredkurtosis and characterize the PSF in terms of its ellipticityand FWHM. They modeled the variation of both additiveand multiplicative errors as a function of PSF FWHM us-ing second-order polynomials. Variable shear branches dorequire a per-galaxy correction using the PSF properties atthe galaxy location to estimate the correction factors (butnot using the individual galaxy-ﬁtting results).

Weighting Scheme:

The ellipticity measurement codeused by the sFIT team outputs ellipticity uncertainties by c (cid:13) , 000–000 Mandelbaum, Rowe, et al. evaluating the Hessian matrix. Unfortunately, these ellip-ticity uncertainties are somewhat correlated with galaxyshapes, so if the ellipticity uncertainties are used directlyto evaluate individual weights, the shapes would be corre-lated with the weights. To avoid this problem, the sFITteam derived average

S/N vs. ellipticity uncertainty rela-tions, and converted per-galaxy

S/N values into elliptic-ity uncertainties. Then, the weights are evaluated from theequation w = 1 / ( σ e + σ SN ) , where σ e is the ellipticity uncer-tainty derived from the S/N value, and σ SN is the intrinsicellipticity dispersion per component. C21.2 GREAT3 submission policy

To avoid tuning to the GREAT3 simulations in too muchdetail, the sFIT team tried to minimize the number of sub-missions. Submissions were made in the following cases: • When obvious mistakes were found, such as applyingcalibration factors to the wrong branch. • When better calibrations become available. Since shearcalibration requires signiﬁcant computing time, occasionallythe sFIT team took shortcuts to reduce computing time.However, if this shortcut resulted in poor performance, theyrevisited the problem and performed brute-force simulationsto obtain calibration parameters directly. • For many variable shear branches, results improvedwhen galaxies are unweighted. Thus, the sFIT team experi-mented with their weighting scheme (by turning on/oﬀ) foralmost every variable shear branch (except for VSV, wherethey achieved the highest score with just one submission).

APPENDIX D: CROSS-BRANCHCOMPARISON OF SUBMISSIONS

Tables D1 and D2 provide estimates of c + and thecomponent-averaged h m i for all submissions described inSec. 5.1 in branches CGC, RGC, CSC, and RSC. Tables D3and D4 show the changes in c + and h m i when comparingacross branches and within branches while splitting by PSFproperties, respectively. c (cid:13) , 000–000 REAT3 Results I Table D1.

Additive bias c + and component-averaged multiplicative bias h m i for the submissions selected for the fair cross-branchcomparison (see Sec. 5.1) in ground branches CGC and RGC.CGC CGC RGC RGCTeam c + h m i c + h m i Amalgam@IAP . ± . . ± . . ± . . ± . CEA_denoise . ± . − . ± . . ± . − . ± . CEA-EPFL . ± . − . ± . − . ± . . ± . CMU experimenters . ± . . ± . . ± . . ± . COGS − . ± . − . ± . − . ± . − . ± . E-HOLICs . ± . . ± . — —EPFL_HNN . ± . . ± . . ± . − . ± . EPFL_KSB . ± . . ± . — —EPFL_MLP . ± . − . ± . − . ± . − . ± . ess — — . ± . − . ± . ess (outlier clipped ) — — . ± . . ± . Fourier_Quad . ± . . ± . . ± . − . ± . FDNT . ± . − . ± . . ± . − . ± . MaltaOx . ± . − . ± . . ± . − . ± . MBI − . ± . . ± . − . ± . . ± . MBI (outlier clipped ) . ± . . ± . − . ± . . ± . MegaLUT − . ± . . ± . − . ± . . ± . MegaLUT (outlier clipped ) − . ± . . ± . − . ± . . ± . MetaCalibration . ± . . ± . — —re-Gaussianization − . ± . . ± . − . ± . . ± . sFIT − . ± . . ± . . ± . . ± . Wentao Luo − . ± . − . ± . − . ± . − . ± . Outlying values in the submitted shears were removed from the submission and scores recalculated, as described in Sec. 4.8. The worst 10 per cent of ﬁelds by PSF defocus value were removed and scores recalculated, as described in Sec. 4.8.

Table D2.

Additive bias c + and component-averaged multiplicative bias h m i for the submissions selected for the fair cross-branchcomparison (see Sec. 5.1) in space branches CSC and RSC.CSC CSC RSC RSCTeam c + h m i c + h m i Amalgam@IAP − . ± . − . ± . . ± . − . ± . CEA_denoise . ± . − . ± . . ± . − . ± . CEA-EPFL . ± . − . ± . . ± . . ± . E-HOLICs . ± . − . ± . . ± . . ± . EPFL_HNN . ± . − . ± . . ± . − . ± . EPFL_KSB . ± . − . ± . — —EPFL_MLP . ± . − . ± . — —Fourier_Quad − . ± . . ± . . ± . . ± . MBI − . ± . − . ± . — —MegaLUT − . ± . − . ± . . ± . − . ± . sFIT . ± . . ± . . ± . − . ± . Wentao Luo . ± . − . ± . . ± . − . ± . c (cid:13) , 000–000 Mandelbaum, Rowe, et al.

Table D3.

Change in additive bias ∆ c + and component-averaged multiplicative bias ∆ h m i across branches, for the submissions selectedfor the fair cross-branch comparison. The ordering of branch labels indicates the order in which the bias results are subtracted.RGC − CGC RSC − CSC CSC − CGCTeam ∆ c + ∆ h m i ∆ c + ∆ h m i ∆ c + ∆ h m i Amalgam@IAP − . ± . − . ± . . ± . − . ± . − . ± . − . ± . CEA_denoise − . ± . . ± . − . ± . . ± . . ± . − . ± . CEA-EPFL − . ± . . ± . . ± . . ± . . ± . . ± . CMU experimenters − . ± . . ± . — — — —COGS . ± . − . ± . — — — —E-HOLICs — — − . ± . . ± . . ± . − . ± . EPFL_HNN − . ± . − . ± . − . ± . − . ± . . ± . − . ± . EPFL_KSB — — — — . ± . − . ± . EPFL_MLP − . ± . − . ± . — — − . ± . − . ± . Fourier_Quad − . ± . − . ± . . ± . . ± . − . ± . . ± . FDNT . ± . . ± . — — — —MaltaOx − . ± . . ± . — — — —MBI − . ± . . ± . — — − . ± . − . ± . MBI (outlier clipped ) − . ± . . ± . — — − . ± . − . ± . MegaLUT − . ± . . ± . . ± . − . ± . . ± . − . ± . MegaLUT (outlier clipped ) − . ± . . ± . . ± . − . ± . . ± . − . ± . re-Gaussianization . ± . − . ± . — — — —sFIT . ± . − . ± . . ± . − . ± . . ± . − . ± . Wentao Luo − . ± . − . ± . . ± . . ± . . ± . − . ± . Table D4.

Change in additive bias ∆ c + and component-averaged multiplicative bias ∆ h m i within CGC, when splitting by atmosphericPSF FWHM and optical PSF defocus, for the submissions selected for the fair cross-branch comparison.better − worse atmospheric PSF FWHM better − worse optical PSF defocusTeam ∆ c + ∆ h m i ∆ c + ∆ h m i Amalgam@IAP . ± . − . ± . − . ± . . ± . CEA_denoise − . ± . . ± . − . ± . . ± . CEA-EPFL . ± . − . ± . − . ± . . ± . CMU experimenters − . ± . − . ± . . ± . − . ± . COGS − . ± . . ± . . ± . . ± . E-HOLICs . ± . − . ± . − . ± . − . ± . EPFL_HNN . ± . − . ± . . ± . − . ± . EPFL_KSB . ± . − . ± . . ± . − . ± . EPFL_MLP . ± . − . ± . − . ± . . ± . Fourier_Quad . ± . − . ± . − . ± . − . ± . FDNT . ± . − . ± . − . ± . − . ± . MaltaOx . ± . − . ± . − . ± . − . ± . MBI . ± . . ± . − . ± . . ± . MegaLUT − . ± . − . ± . . ± . . ± . MetaCalibration . ± . − . ± . − . ± . . ± . re-Gaussianization − . ± . . ± . . ± . − . ± . sFIT . ± . − . ± . − . ± . . ± . Wentao Luo − . ± . . ± . . ± . . ± . c (cid:13)000