[PDF] Variational Inference for Deblending Crowded Starfields

Abstract

In the image data collected by astronomical surveys, stars and galaxies often overlap. Deblending is the task of distinguishing and characterizing individual light sources from survey images. We propose StarNet, a fully Bayesian method to deblend sources in astronomical images of crowded star fields. StarNet leverages recent advances in variational inference, including amortized variational distributions and the wake-sleep algorithm. Wake-sleep, which minimizes forward KL divergence, has significant benefits compared to traditional variational inference, which minimizes a reverse KL divergence. In our experiments with SDSS images of the M2 globular cluster, StarNet is substantially more accurate than two competing methods: Probablistic Cataloging (PCAT), a method that uses MCMC for inference, and a software pipeline employed by SDSS for deblending (DAOPHOT). In addition, StarNet is as much as 100,000 times faster than PCAT, exhibiting the scaling characteristics necessary to perform fully Bayesian inference on modern astronomical surveys.

Full PDF

VVariational Inference for DeblendingCrowded Starﬁelds

Runjing Liu ∗ Department of Statistics, University of California, BerkeleyJon D. McAuliﬀeThe Voleon GroupDepartment of Statistics, University of California, BerkeleyJeﬀrey RegierDepartment of Statistics, University of MichiganThe LSST Dark Energy Science Collaboration

Abstract

In the image data collected by astronomical surveys, stars and galaxies oftenoverlap. Deblending is the task of distinguishing and characterizing individual lightsources from survey images. We propose StarNet, a fully Bayesian method to de-blend sources in astronomical images of crowded star ﬁelds. StarNet leverages recentadvances in variational inference, including amortized variational distributions andthe wake-sleep algorithm. Wake-sleep, which minimizes forward KL divergence, hassigniﬁcant beneﬁts compared to traditional variational inference, which minimizes areverse KL divergence. In our experiments with SDSS images of the M2 globularcluster, StarNet is substantially more accurate than two competing methods: Proba-blistic Cataloging (PCAT), a method that uses MCMC for inference, and a softwarepipeline employed by SDSS for deblending (DAOPHOT). In addition, StarNet isas much as 100 ,

000 times faster than PCAT, exhibiting the scaling characteristicsnecessary to perform fully Bayesian inference on modern astronomical surveys.

Keywords:

Bayesian methods, amortized inference, wake-sleep algorithm, approximate in-ference, astronomical surveys, cataloging. ∗ Runjing Liu (e-mail: [email protected] ) gratefully acknowledges support from the NSFGraduate Research Fellowship Program. This paper has been approved by the LSST Dark Energy ScienceCollaboration following an internal review. The internal reviewers were Bastien Arcelin, Fran¸cois Lanusse,and Peter Melchior. The authors also thank Derek Hansen, Ismael Mendoza, and Zhe Zhao for reviewingthis manuscript and for continuing the development of our software. a r X i v : . [ a s t r o - ph . I M ] F e b Introduction

Astronomical images record the arrival of photons from distant light sources. Astronomicalcatalogs are constructed from these images. Catalogs label light sources as stars, galaxies,or other objects; they also list the physical characteristics of light sources such as ﬂux, color,and morphology. These catalogs are the starting point for many downstream analyses. Forexample, Bayestar used a catalog of stellar ﬂuxes and colors to construct the 3D distributionof interstellar dust (Green et al., 2019). Catalogs of galaxy morphologies have been usedto validate theoretical models of dark matter and dark energy (Abbott et al., 2018).A light source, be it a star or a galaxy, produces a peak intensity of brightness inan image. When light sources are well separated, catalog construction is straightforward:characteristics of each light source, such as ﬂux, can be estimated by analyzing intensitiesat the peak and surrounding pixels. However, in images crowded with many light sources,observed intensities may result from the combined light of multiple sources. Source separa-tion, or deblending , is the task of diﬀerentiating and characterizing individual light sourcesfrom a mixture of intensities on an image. A key challenge in deblending is inferringwhether an observed intensity is in fact blended, that is, whether it is composed of a singlebright source or multiple dimmer sources.Deblending is challenging for several reasons. First, it is an unsupervised problemwithout ground truth labeled data. Second, it is a problem with a sample size of one:there is only one night sky, which is imaged many times, and the collected survey imagescapture overlapping regions of it. Third, for blended ﬁelds, the properties of light sourcesare ambiguous; therefore, providing calibrated uncertainties for catalog construction is asimportant as making accurate predictions. Finally, the scale of the data is immense. Theupcoming Rubin Observatory Legacy Survey of Space and Time (LSST), scheduled to begindata collection in 2022, is expected to produce 60 petabytes of astronomical images overits lifetime (LSST, 2020).As more powerful telescopes are developed, and their ability to detect more distantlight sources improves, the density of light sources in the images they capture will onlyincrease. For instance, Bosch et al. (2018) estimates that 58% of light sources are blendedin images captured by the Subaru Telescope’s Hyper Suprime-Cam, and that percentage isexpected to increase for LSST. Therefore, developing a method that reliably characterizeslight sources, even in cases of signiﬁcant light source blending, advances any astronomi-cal research in which conclusions about the physical universe are derived from estimatedcatalogs.We focus on cataloging applications where all light sources are well modeled as pointswithout spatial extent. Point-source-only models are applicable to surveys such as DE-Cam (Schlaﬂy et al., 2018), which imaged the center of the Milky Way, and WISE (Wrightet al., 2010), whose telescope resolution did not allow for diﬀerentiation between stars and2alaxies. In this work, we use the globular cluster M2, a region imaged by the Sloan DigitalSky Survey (SDSS) that is densely populated with stars, as a test bed for our method.

From software pipelines to probabilistic cataloging

Traditionally, most cataloging has been performed using software pipelines. Thesepipelines are algorithms that usually involve the following stages: locating the brightestpeaks, estimating ﬂuxes, and subtracting the estimated light source. These stages maybe performed iteratively. Pipelines do not normally produce statistically calibrated errorestimates that propagate the uncertainty that accumulates in each of the steps. Failure toproperly accumulate error at each step results in unreliable catalogs for images in whichambiguity exists in identifying sources and estimating their characteristics. For example,PHOTO (Lupton et al., 2001), the default cataloging pipeline used by the Sloan DigitalSky Survey (SDSS), failed to produce a catalog on the globular cluster M2 (Portillo et al.,2017).In contrast, probabilistic cataloging posits a statistical model consisting of a likelihoodfor the observed image given a catalog and a prior distribution over possible catalogs (Por-tillo et al., 2017; Brewer et al., 2013; Feder et al., 2020). Instead of deriving a single catalog,probabilistic cataloging produces a posterior distribution over the set of all possible cata-logs. Uncertainties are quantiﬁed by the posterior distribution. For example, in an imagewith an ambiguously blended bright peak, some catalogs sampled from the posterior wouldcontain multiple dim light sources while others would contain one bright source. The rela-tive density the posterior distribution places on one explanation over another represents thestatistical conﬁdence in that explanation. Moreover, a distribution over the set of all cat-alogs induces a distribution on any estimate derived from a catalog. Therefore, calibrateduncertainties can be propagated to downstream analyses.Previous work on probabilistic cataloging employed Markov chain Monte Carlo (MCMC)to sample from the posterior distribution. The MCMC procedure in Portillo et al. (2017)and Feder et al. (2020) was called PCAT, short for “Probabilistic CATaloging.” Becausethe posterior is deﬁned over the set of all catalogs and the number of sources in a cat-alog is unknown and random, the latent variable space is transdimensional. PCAT usedreversible jump MCMC (Green, 1995) to sample transdimensional catalogs. In reversiblejump MCMC, auxiliary variables are added to encode the “birth” of new light sources orthe “death” of existing ones in the Markov chain.The computational cost of MCMC for this model is problematic for large-scale astro-nomical surveys. Early implementations of PCAT required a day to process a 100 × We use “probabilistic cataloging” to refer to any method that produces a posterior over possiblecatalogs, whereas “PCAT” refers speciﬁcally to the MCMC procedure in Portillo et al. (2017) and Federet al. (2020).

3t al., 2020). In any case, a 100 ×

100 pixel image covers only a 0 . × .

66 arcminute patchof the sky. For comparison, in one night, SDSS scans a region on the order of 100 × Our contribution

We propose

StarNet , a approach to deblending that employs several recent VI innova-tions (Zhang et al., 2019; Le et al., 2020). Unlike Regier et al. (2019), our VI approach isable to handle arbitrary probabilistic models, including a transdimensional model with anunknown number of sources. Section 2 introduces the statistical model, which is similar tothe model used in PCAT.Secondly, again unlike Regier et al. (2019), we employ amortization , which enablesStarNet to scale inference to large astronomical surveys. In amortized variational inference,a neural network maps input images to an approximate posterior. Following a one-timecost to ﬁt the neural network, inference on new images requires just a single forward pass.Rapid inference is available without the need to re-run MCMC or numerically optimizeVI for each new image. For StarNet, the forward pass on a 100 ×

100 pixel image takes0 . q and the exact posterior p . Reverse KL is deﬁned as the q -weightedaverage diﬀerence between log q and log p . Wake-sleep instead ﬁts the approximate posteriorusing the “forward” KL divergence, which weights the diﬀerence between log p and log q by p . The forward KL is minimized using stochastic gradient descent (SGD). At eachiteration, catalogs are sampled from the prior distribution, then used to generate imagesfrom the likelihood model. The neural network is ﬁt to map the generated images todistributions that place probability mass on their corresponding catalogs. Section 4 detailsthe wake-sleep procedure.In this application, optimizing the forward KL produces more reliable approximate4osteriors than optimizing the traditional reverse KL (Section 5). In particular, by takingadvantage of complete data—the sampled images and their corresponding catalogs—wake-sleep avoids shallow local minima where the approximate posterior returned by the neuralnetwork is far from the exact posterior in terms of KL divergence.The wake-sleep algorithm has been used in previous research to train deep generativemodels (Hinton et al., 1995; Bornschein and Bengio, 2014; Le et al., 2020). However, tothe best of our knowledge, this is the ﬁrst application of wake-sleep for scientiﬁc purposes.Speciﬁcally, we use wake-sleep for inference to ﬁnd a latent space that is interpretable: itis the set of all possible astronomical catalogs.We applied StarNet to the M2 globular cluster as imaged by SDSS. We used the M2catalog from the ACS Globular Cluster Survey (Sarajedini et al., 2007) as ground truth.The ACS survey used the Hubble Space telescope, which has approximately 20 times theangular resolution and 30 times the exposure of the ground-based Sloan telescope. Star-Net was more accurate than the MCMC-based cataloger PCAT and traditional catalogingapproaches while running 100 ,

000 times faster than the former (Section 6). Althoughthe MCMC posterior approximation converges to the exact posterior asymptotically, overﬁnite horizons an MCMC sampler may not mix well. The sleep-phase objective fromthe Wake-Sleep algorithm allows StarNet to circumvent many of the challenges of non-convex optimization (Section 7). Code to reproduce our results is publicly available at https://github.com/Runjing-Liu120/DeblendingStarfields . In crowded starﬁelds, such as globular clusters and the Milky Way, the vast majority oflight sources are stars. An astronomical image records the number of photons that reacheda telescope and arrived at each pixel. Typically, photons must pass through one of severalﬁlters, each selecting photons from a speciﬁed band of wavelengths, before being recorded.For a given H × W pixel image with B ﬁlter bands, our goal is to infer a catalog ofstars. The catalog speciﬁes the number of stars in an image; for each such star, the catalogrecords its location and its ﬂux, or brightness, in each band. The space of latent variables Z is the collection of all possible catalogs of the form z := { N, ( (cid:96) i , f i, , ..., f i,B ) Ni =1 } , (1)where the number of stars in the catalog is N ∈ N , the location of the i th star is (cid:96) i ∈ R ,and the ﬂux of the i th star in the b th band is f i,b ∈ R + .A Bayesian approach requires speciﬁcation of a prior over catalog space Z and a like-lihood for the observed images. Our likelihood and prior, detailed below, are similar toprevious approaches (Brewer et al., 2013; Portillo et al., 2017; Regier et al., 2019; Federet al., 2020). 5 .1 The prior The prior over Z is a marked spatial Poisson process. To sample the prior, ﬁrst draw thenumber of stars contained in the H × W image as N ∼ Poisson( µHW ) , (2)where µ is a hyperparameter specifying the average number of sources per pixel. Next,draw locations (cid:96) , ..., (cid:96) N | N iid ∼ Uniform([0 , H ] × [0 , W ]) . (3)The ﬂuxes in the ﬁrst band are from a power law distribution with slope α : f , , ..., f N, | N iid ∼ Pareto( f min , α ) . (4)Fluxes in other bands are described relative to the ﬁrst band. Like Feder et al. (2020), wedeﬁne the log-ratio of ﬂux relative to the ﬁrst band as “color.” Colors are drawn from aGaussian distribution c ,b , ..., c N,b | N iid ∼ N ( µ c , σ c ) , b = 2 , ..., B. (5)Given the ﬂux in the ﬁrst band f i, and color c i,b , the ﬂux in band b is f i,b = f i, × c i,b / . .Also like Feder et al. (2020), we set the power law slope α = 0 . µ c = 0, σ c = 1). Appendix C evaluates the sensitivity of theresulting catalog to choices of the prior parameters. Let x bhw denote the observed number of photoelectrons at pixel ( h, w ) in band b . Foreach band, at every pixel, the expected number of photoelectron arrivals is λ bhw ( z ), adeterministic function of the catalog z . Motivated by the Poissonian nature of photonarrivals and the large photon arrival rate in SDSS and LSST images, observations aredrawn as x bhw | z ind ∼ N ( λ bhw , λ bhw ) , b = 1 , ..., B ; h = 1 , ..., H ; w = 1 , ..., W, (6)where λ bhw = I b ( h, w ) + N (cid:88) i =1 f i,b P b (cid:0) h − (cid:96) i, , w − (cid:96) i, (cid:1) . (7)Here, P b is the point spread function (PSF) for band b and I b is the background intensity.The PSF is a function P b : R × R (cid:55)→ R + , (8)6escribing the appearance of a stellar point source at any 2D position of the image (butignoring pixelation). Our PSF model is a weighted average between a Gaussian “core” anda power-law “wing,” as described in Xin et al. (2018). For each band, the PSF has theform P ( u, v ) = exp( − ( u + v )2 σ ) + ζ exp( − ( u + v )2 σ ) + ρ (1 + v + u γσ P ) − γ/ ζ + ρ . (9)The PSF parameters vary by band. Let the collection of PSF parameters across all bandsbe denoted π := ( σ ( b )1 , σ ( b )2 , σ ( b ) P , γ ( b ) , ζ ( b ) , ρ ( b ) ) Bb =1 .The background intensity at pixel ( h, w ) is modeled with an aﬃne function: I b ( h, w ) = β b + β b × h + β b × w. (10)The background parameters are speciﬁc to the band.StarNet estimates these parameters jointly with the approximate posterior (Section 4).Prior work on probabilistic cataloging relied on estimates from the SDSS software pipelineand found the PSF estimates to be suboptimal in crowded starﬁelds (Feder et al., 2020). The central quantity in Bayesian statistics is the posterior distribution p ( z | x ). However,in many nontrivial probabilistic models, including our own, the posterior distribution isintractable to calculate. Calculation of the posterior requires us to compute the marginallikelihood, p ( x ), which involves integrating over the latent variable z . In our model, thelatent variable space is high dimensional: it is the set of all catalogs. Approximate methodssuch as MCMC and variational inference are therefore required.Variational inference (Jordan et al., 1999; Wainwright and Jordan, 2008; Blei et al.,2017) posits a family of distributions Q and seeks the distribution q ∗ ∈ Q that is closest tothe exact posterior in KL divergence. Q is chosen such that q ∗ will not be too diﬃcult toﬁnd via optimization. We index the distributions in Q using a real-valued vector η , thenseek η ∗ satisfying η ∗ = arg min η KL (cid:104) q η ( z | x ) (cid:107) p ( z | x ) (cid:105) . (11)Minimizing the KL divergence in (11) is equivalent to maximizing the evidence lowerbound (ELBO) (Blei et al., 2017): L elbo ( η ) = E q η ( z | x ) (cid:104) log p ( x, z ) − log q η ( z | x ) (cid:105) . (12)Computing the ELBO does not require computing the marginal distribution p ( x ), which isintractable, or the posterior distribution p ( z | x ), which would be circular.7 .1 The variational distribution Traditionally in variational inference, the posterior approximation q η depends on the data x only implicitly, in that η ∗ is chosen according to (11). In this case, q η ( z | x ) is usuallywritten q η ( z ), suppressing the dependence on x .The generative model in Section 2 does not assume any hierarchical structure overreplicated images. Therefore, given a new image x new , the posterior factorizes; in otherwords, p ( z new , z | x new , x ) = p ( z new | x new ) p ( z | x ). To ﬁnd a variational approximation to theposterior p ( z new | x new ), the optimization problem (11) must be solved again with x = x new .Even in models with hierarchical structure, the variational distribution will generally needto be updated via optimization for every newly observed data point.On the other hand, in amortized variational inference (Kingma and Welling, 2013;Rezende et al., 2014), q η explicitly depends on the data. A ﬂexible, parameterized functionmaps input x , in this case an observed image, to the parameters of a variational distri-bution on the latent space Z . Typically, the function is a neural network, in which casethe variational parameters η are the neural network weights. After the neural network isﬁtted with (11) using a collection of observed x ’s, the approximate posterior q η ( z new | x new )for a new data point x new can be evaluated with a single forward pass through the neuralnetwork. No additional run of an iterative optimization routine is needed for a new datapoint x new .The following subsections detail the construction of our variational distribution. To make the objective in (11) tractable, the family Q is normally restricted to probabil-ity distributions without conditional dependencies between some latent variables. In themost extreme case, known as mean-ﬁeld variational inference, the variational distributioncompletely factorizes across all latent variables.Our factorization instead has a spatial structure. First, we partition the full H × W image into disjoint R × R tiles. R will be chosen such that the probability of having three ormore stars in one tile is small. In this way, the cataloging problem decomposes to inferringonly a few stars at a time (Section 3.1.3).Let S = H / R and T = W / R and assume without loss of generality that H and W aremultiples of R . For s = 1 , ..., S and t = 1 , ..., T , the tile ˜ x st is composed of the followingpixels: ˜ x st = { x hw : Rs ≤ h ≤ R ( s + 1) and Rt ≤ w ≤ R ( t + 1) } . (13)Figure 1 gives an example with R = 2.Let ˜ N ( s,t ) be the number of stars in tile ( s, t ). Because ˜ N ( s,t ) is random, the cardinality ofthe set of locations and ﬂuxes in each tile is also random. To handle the trans-dimensional8igure 1: Tiling a 10 ×

10 pixel image into 2 × triangular array of latent variables on each tile:˜ (cid:96) ( s,t ) = (˜ (cid:96) ( s,t ) N,i : i = 1 , ..., N ; N = 1 , , ... ); (14)˜ f ( s,t ) = ( ˜ f ( s,t ) N,i : i = 1 , ..., N ; N = 1 , , ... ) , (15)where ˜ (cid:96) ( s,t ) N,i and ˜ f ( s,t ) N,i are the elements of the triangular array corresponding to location andﬂuxes, respectively.Tile locations ˜ (cid:96) ( s,t ) N,i ∈ [0 , R ] × [0 , R ] give the location of stars within a tile. The ﬂuxes˜ f ( s,t ) N,i are vectors in R B + (one ﬂux for each band).Call ( ˜ N ( s,t ) , ˜ (cid:96) ( s,t ) , ˜ f ( s,t ) ) S,Ts =1 ,t =1 the tile latent variables ; succinctly denote tile latent vari-ables as ˜ z . The ultimate latent variable of interest is z = { N, ( (cid:96) i , f i, , ..., f i,B ) Ni =1 } , thecatalog for the full image. A distribution on z is obtained by ﬁrst constructing a mappingfrom ˜ z to z . We then deﬁne a distribution on the tile latent variables ˜ z , which in turninduces a distribution on z .We now detail the mapping ˜ z (cid:55)→ z (see Figure 2 for a schematic). The number of starsin the catalog is N = S (cid:88) s =1 T (cid:88) t =1 ˜ N ( s,t ) . (16)For every tile ( s, t ), we index into the ˜ N ( s,t ) -th row of the triangular array of tile latentvariables: the ﬂuxes in the catalog are { f i } Ni =1 = (cid:110) ˜ f ( s,t )˜ N ( s,t ) ,i : i = 1 , ..., ˜ N ( s,t ) , s = 1 , ..., S, t = 1 , ..., T (cid:111) , (17)9nd the corresponding locations are { (cid:96) i } Ni =1 = (cid:26) ˜ (cid:96) ( s,t ) N ( s,t ) ,i + (cid:18) RsRt (cid:19) : i = 1 , ..., N ( s,t ) , s = 1 , ..., S, t = 1 , ..., T (cid:27) . (18)The tile location is shifted by ( Rs, Rt ) to obtain the location in the full image.Given this mapping from tile latent variables to the catalog of interest, (cid:0) ˜ N ( s,t ) , ˜ (cid:96) ( s,t ) , ˜ f ( s,t ) (cid:1) S,Ts =1 ,t =1 (cid:55)→ { N, ( (cid:96) i , f i, , ..., f i,B ) Ni =1 } , (19)a distribution on the tile latent variables induces a distribution on catalogs.Our variational distribution on ˜ z factorizes over image tiles:˜ q η (cid:0)(cid:0) ˜ N ( s,t ) , ˜ (cid:96) ( s,t ) , ˜ f ( s,t ) (cid:1) S,Ts =1 ,t =1 | x (cid:1) = S (cid:89) s =1 T (cid:89) t =1 ˜ q η (cid:0) ˜ N ( s,t ) , ˜ (cid:96) ( s,t ) , ˜ f ( s,t ) | x (cid:1) . (20)Within each tile ( s, t ), the distribution also fully factorizes:˜ q η (cid:0) ˜ N ( s,t ) , ˜ (cid:96) ( s,t ) , ˜ f ( s,t ) | x (cid:1) = ˜ q η (cid:0) ˜ N ( s,t ) | x (cid:1) ∞ (cid:89) n =1 n (cid:89) i =1 ˜ q η (cid:0) ˜ (cid:96) ( s,t ) n,i | x (cid:1) ˜ q η (cid:0) ˜ f ( s,t ) n,i | x (cid:1) . (21)If τ is the mapping in (19), then the variational distribution on catalogs z is q η ( z | x ) := ˜ q η ( τ − ( z ) | x ) , (22)where τ − ( z ) is the pre-image of z under τ . Evaluating the variational distribution

Evaluating the ELBO requires computing the probability of q η ( z | x ) for any given catalog z = { N, ( (cid:96) i , f i, , ..., f i,B ) Ni =1 } . By (22), it suﬃces to evaluate ˜ q η ( τ − ( z ) | x ).Here, τ − ( z ) is a set of tile latent variables because the mapping from tile latent variablesto catalogs z is not injective, as we now explain.Locations in the catalog { (cid:96) i } Ni =1 determine the number of stars on tile ( s, t ). The numberof stars ˜ N ( s,t ) is simply the count of the locations that reside within that tile:˜ N ( s,t ) = N (cid:88) i =1 (cid:110) (cid:96) i ∈ [ Rs, R ( s + 1)] × [ Rt, R ( t + 1)] (cid:111) , (23)where {·} is the indicator function, equal to one if true and zero if false.Now, consider ˜ (cid:96) ( s,t ) and ˜ f ( s,t ) , the triangular array of locations and ﬂuxes on tile ( s, t ).For each ( s, t ), the ˜ N ( s,t ) -th row of the triangular array of ﬂuxes and locations is determinedby the locations and ﬂuxes of stars imaged in tile ( s, t ); they are determined by the catalog z . However, the other rows of the triangular arrays are not determined by the catalog z ;they are free to take any value in their domain. Therefore, the mapping τ is not injective.10igure 2: An example image with four tiles illustrating the relationship between tile latentvariables and the full-image latent variables. The upper-left tile has two stars; the upper-right and lower-left have one; the bottom right has none. To construct the full-imagecatalog, we index into the appropriate row of the triangular array on each tile.Thus, evaluating the probability of τ − ( z ) under ˜ q η requires marginalizing over the rowsof the triangular arrays (cid:96) ( s,t ) and ˜ f ( s,t ) that are not determined by z . However, because˜ q η fully factorizes, the terms in (21) where n (cid:54) = ˜ N ( s,t ) do not enter the product aftermarginalization. Applying this observation and combining (21) and (20), ˜ q ( τ − ( z ) | x )becomes ˜ q ( τ − ( z ) | x ) = S (cid:89) s =1 T (cid:89) t =1 (cid:110) ˜ q η ( ˜ N ( s,t ) | x ) ˜ N ( s,t ) (cid:89) i =1 ˜ q η (cid:0) ˜ (cid:96) ( s,t )˜ N ( s,t ) ,i | x (cid:1) ˜ q η (cid:0) ˜ f ( s,t )˜ N ( s,t ) ,i | x (cid:1)(cid:111) . (24)In words, given a catalog z , ﬁrst convert z to tile random variables; to compute q η ( z | x ), itsuﬃces to evaluate ˜ q η only at the rows of triangular arrays determined by the number ofstars falling in each tile. 11 .1.2 Variational distributions on image tiles We describe the variational distribution on each tile, ˜ q η (cid:0) ˜ N ( s,t ) , ˜ (cid:96) ( s,t ) , ˜ f ( s,t ) | x (cid:1) . Dropping theindex ( s, t ) in this subsection,˜ N ∼ Categorical( ω ; 0 , ..., N max ); (25)˜ (cid:96) ˜ N,i /R ∼ LogitNormal( µ (cid:96) ˜ N,i , diag( ν (cid:96) ˜ N,i )); (26)˜ f b ˜ N,i ∼ LogNormal( µ f b ˜ N,i , σ f b ˜ N,i ) , (27)for i = 1 , ..., ˜ N ; ˜ N = 1 , ..., N max . ω is a ( ˜ N max + 1)-dimensional vector on the simplex. µ (cid:96) ˜ N,i and ν (cid:96) ˜ N, are two-dimensional vectors – the covariance on locations is diagonal. Thelatent variables also fully factorize within each tile. Note that in the exact posterior, ˜ N hassupport on the nonnegative integers; in the variational distribution, we truncate at somelarge N max .These distributions were taken for convenience: ﬂuxes are positive and right skewed, sowe place a normal distribution on log-ﬂuxes; locations are between zero and R , so we placea normal distribution on the logit of the location scaled by 1 /R . In each tile, the distributional parameters in (25), (26), and (27) are the output of a neuralnetwork. The input to the neural network is an R × R tile, padded with surroundingpixels. Padding enables the neural network to produce better predictions inside the tile.For example, a bright source in the vicinity of but outside the tile will aﬀect the pixel valuesinside the tile. Padding the tiles allows the neural network access to this information. Theappropriate amount of padding will depend on the PSF width in the analyzed image. Tocatalog the crowded starﬁeld M2 (Section 6), we set R = 2 and padded the tile with athree-pixel-wide boundary. Thus, while the distribution on tile latent variables factorizeover tiles, the neural network is able to use information from neighboring tiles in producingthe distributional parameters.Let ˆ x ( s,t ) denote the padded tile pixel intensities (which includes all B bands) and h η be the neural network, which returns the collection of distributional parameters h η (ˆ x ( s,t ) ) = ( ω ( s,t ) , µ ( s,t ) (cid:96) , ν ( s,t ) (cid:96) , µ ( s,t ) f , σ ( s,t ) f ) . (28)The same neural network is evaluated for all tiles ( s, t ). Our variational parameters areneural network weights, here denoted η . The architecture consists of several convolutionallayers followed by a series of fully connected layers (Figure 3). This architecture has beensuccessful on image classiﬁcation challenges such as ImageNet (Russakovsky et al., 2015).The optimization of the architecture is left for future work; our focus in this paper is the12igure 3: The neural network architecture. For cataloging M2, the input image is an8 × × s, t ), the categoricalparameter ω ( s,t ) lies on the simplex and has dimension N max + 1. Furthermore, each indexof the triangular array i = 1 , ..., ˜ N ( s,t ) , ˜ N ( s,t ) = 1 , ..., N max describes a star. The star has amean and variance for each location coordinate, and a mean and variance for its ﬂux in eachband. Thus, for each star ( ˜ N ( s,t ) , i ), the neural network outputs 2 × ( B + 2) parameters.In total, the neural network has output dimension ( N max + 1) + ( B + 2) × ( N max + N max ).Because the output dimension is quadratic in N max , factorizing the variational distribu-tion spatially keeps the output dimension of the neural network manageable. On a crowdedstarﬁeld with H = W = 100, the number of imaged stars is on the order of 10 . If the neu-ral network were to return a variational distribution on the full 100 ×

100 pixel image, theoutput dimension would be on the order of (10 ) . On the 2 × N max = 3, andthe output dimension of the neural network with two bands is 53. With this factorization,the network can be used for inference on an image of any size.We emphasize that while the variational distribution factorizes over 2 × × In mean-ﬁeld variational inference, the ELBO (12) is often optimized by coordinate ascentin the parameters of the variational distribution. The coordinate ascent updates can be13erived in closed form in the special case of exponential family models that are conditionallyconjugate (Blei et al., 2017).In our setting of amortized variational inference, stochastic optimization procedureshave been employed with modern auto-diﬀerentiation tools to avoid the need for derivinganalytic updates. Examples include black-box variational inference (BBVI) (Ranganathet al., 2013) and automatic-diﬀerentiation variational inference (ADVI) (Kucukelbir et al.,2017). The latter is closely related to the reparameterization trick (Kingma and Welling,2013; Rezende et al., 2014) proposed to train deep generative models using the KL objective.These approaches all sample latent variables from q η and produce an unbiased estimate forthe gradient of the KL; the optimization is done with stochastic gradient descent.However, the reparameterization trick does not apply when a latent variable – in ourcase, the number of stars N – is discrete. The REINFORCE estimator (Williams, 1992),which BBVI adapts, produces an unbiased stochastic gradient for both continuous and dis-crete latent variables. However, these gradients often suﬀer from high variance in practice,and so the resulting stochastic optimization is slow. We ﬁnd this to be true in our applica-tion as well (Section 5). See Appendix A for details concerning both the reparameterizationand REINFORCE estimators.The key diﬃculty in constructing stochastic gradients of the ELBO is that the inte-grating distribution depends on the optimization parameter η . The wake-sleep algorithm,originally proposed by Hinton et al. (1995), replaces the ELBO objective with L sleep ( η ) := − E x ∼ p ( x ) (cid:104) KL( p ( z | x ) (cid:107) q η ( z | x )) (cid:105) , (29)known as the sleep objective . Section 4.1 details a simple gradient estimator for (29) thatdoes not require reparameterization or REINFORCE.There are two key diﬀerences between the sleep objective (29) and the ELBO (12).First, the KL arguments are transposed. Recall that maximizing the ELBO is equivalentto minimizing KL( q (cid:107) p ), and therefore, the minimization of KL( q (cid:107) p ) does not depend on theintractable data likelihood p ( x ). However, the minimization of KL with arguments reverseddoes. As will be detailed in Section 4.1, the outer expectation over the data in (29) makesthis optimization objective tractable without dependence on p ( x ).Second, the expectation over the data also gives diﬀerent meaning to the sleep objective.The ELBO objective seeks η to minimize the KL between q η ( z | x ) and p ( z | x ) for ﬁxed,observed data x , in this case the H × W image. In contrast, the sleep objective seeks tominimize the KL on average over all possible data x , as weighted by the generative model p ( x ). In other words, the target is no longer an approximate posterior for the observeddata, but rather an approximate posterior that is “good on average” over possible data.Moreover, “possible data” is deﬁned under the generative model from Section 2. There-fore, it is imperative that the generative model p ( x ) approximates the true underlyingdata-generating mechanism well. Thus, in addition to the sleep phase, the wake-sleep algo-14ithm also incorporates a “wake-phase” to estimate model parameters. In our application,these model parameters include the PSF parameters π and background parameters β . Let φ := ( π, β ) be the concatenation of all model parameters, and denote the dependence ofthe generative model on φ using subscripts, p φ .To estimate model parameters, one would ideally optimize the marginal log-likelihoodlog p φ ( x ). However, since log p φ ( x ) is intractable, the wake-phase optimizes for φ using theELBO objective (12) as a surrogate for the intractable log-likelihood. The ELBO is a lowerbound of log p φ ( x ); it is equal to log p φ ( x ) when q η ( z | x ) = p φ ( z | x ).The wake-sleep algorithm thus alternates between the two objectives: Sleep phase: η t = arg max η − E x ∼ p φ ( x ) (cid:104) KL( p φ t − ( z | x ) (cid:107) q η ( z | x )) (cid:105) ; (30) Wake phase: φ t = arg max φ E q ηt ( z | x ) (cid:104) log p φ ( x, z ) − log q η t ( z | x ) (cid:105) , (31)for iterations t = 1 , ..., T .Stochastic gradients of the expectation in the wake-phase are simple to compute. Be-cause the integrating distribution does not depend on the optimization parameter φ in thewake phase, unbiased stochastic gradients are simply computed as ∇ φ log p φ ( x, z ) for z ∼ q η . (32)Section 4.1 shows that a similarly simple gradient estimator exists for the sleep objective.The wake-sleep algorithm is closely related to variational EM (Jordan et al., 1999; Nealand Hinton, 2000; Beal and Ghahramani, 2002), which alternates between an expectation step (E-step) and a maximization step (M-step): E-step: η t = arg max η E q η ( z | x ) (cid:104) log p φ t − ( x, z ) − log q η ( z | x ) (cid:105) ; (33) M-step: φ t = arg max φ E q ηt ( z | x ) (cid:104) log p φ ( x, z ) − log q η t ( z | x ) (cid:105) , (34)for iterations t = 1 , ..., T .Variational EM can be viewed as block coordinate ascent on the ELBO objective, withthe E-step optimizing variational parameters η and the M-step optimizing the model pa-rameters φ .The wake phase of the wake-sleep algorithm is equivalent to the M-step of variationalEM. As discussed above, the optimization of variational parameters η using the ELBOobjective is not conducive to using simple stochastic gradient estimators. Thus, the sleepphase replaces the ELBO objective of the E-step with the expected reverse KL in (29). In this section, we decompose the sleep objective in (29) for closer study. We take φ asﬁxed in this section and drop the explicit dependence of p on φ .15irst, observe that optimizing the sleep objective does not require computing the in-tractable term p ( x ):arg max η L sleep ( η ) = arg min η E x ∼ p ( x ) (cid:104) KL( p ( z | x ) (cid:107) q η ( z | x ) (cid:105) (35)= arg min η E p ( x ) (cid:104) E p ( z | x ) (cid:16) log p ( z | x ) − log q η ( z | x ) (cid:17)(cid:105) (36)= arg min η E p ( x ) (cid:104) E p ( z | x ) (cid:16) − log q η ( z | x ) (cid:17)(cid:105) (37)= arg min η E p ( x,z ) (cid:104) − log q η ( z | x ) (cid:105) . (38)Crucially, the integrating distribution, p ( x, z ), does not depend on the optimization param-eter η . In the E-step of variational EM, the integrating distribution is q η , resulting in theneed for reparameterization or other adjustments to compute stochastic gradients. Here,unbiased stochastic gradients can be obtained simply as g = −∇ η log q η ( z | x ) for ( z, x ) ∼ p ( x, z ) . (39)In other words, the sleep phase simulates complete data ( z, x ) from the generative modeland evaluates the loss − log q η ( z | x ). Here, “complete data” refers to the image along withits catalog. This loss encourages the neural network to map an image x to a distribution q η ( ·| x ) that places large mass on the image’s catalog z .Furthermore, recall that q η factorizes over image tiles. Having sampled the catalog z andthe H × W image x from p ( x, z ), convert z to its tile parameterization ( ˜ N ( s,t ) , ˜ (cid:96) ( s,t ) , ˜ f ( s,t ) ) ( S,T ) s =1 ,t =1 ,as detailed in Section 3.1.1.For a given image tile, the latent variables ˜ N ( s,t ) , ˜ (cid:96) ( s,t ) , and ˜ f ( s,t ) also factorize in q , so − log q η ( ˜ N ( s,t ) , ˜ (cid:96) ( s,t ) , ˜ f ( s,t ) | x ) = − log q η ( ˜ N ( s,t ) | x ) − log q η (˜ (cid:96) ( s,t ) | x ) − log q η ( ˜ f ( s,t ) | x ) . (40)Each term can be interpreted separately. In tile ( s, t ), the number of stars ˜ N ( s,t ) is cate-gorical with parameter ω ( s,t ) . The loss function for the number of stars becomes − log q η ( ˜ N ( s,t ) | x ) = − ˜ N max (cid:88) n =0 { ˜ N ( s,t ) = n } log ω ( s,t ) n , (41)the usual cross-entropy loss for a multi-class classiﬁcation problem.Recall that location coordinates are logit-normal and ﬂuxes are log-normal in the lasttwo terms in (40). For a given index ( n, i ), let y n,i generically denote either the logit-location or log-ﬂux for that star, and let µ n,i and σ n,i generically denote the mean andstandard deviation of its Gaussian variational distribution. Thus, − log q η ( y n,i | x ) = 12 σ n,i ( y n,i − µ n,i ) + log σ n,i + 12 log(2 π ) . (42)16y our discussion in Section 3.1.1, only the N = N ( s,t ) -th row of the triangular array oflatent variables ˜ (cid:96) ( s,t ) and ˜ f ( s,t ) needs to be evaluated. Therefore, the losses in the last twoterms of (40) are of the form − log q η ( y | x ) = − ˜ N ( s,t ) (cid:88) i =1 log q η ( y ˜ N ( s,t ) ,i | x ) . (43)To interpret (42), observe that y n,i is the true (simulated) latent variable in the catalogand µ n,i is the predicted value for that latent variable from the neural network. σ n,i is alsooutputted by the neural network, representing uncertainty – the second term encouragessmall uncertainties, but this is balanced by the scaling of the error ( y n,i − µ n,i ) in the ﬁrstterm.The losses in (41) and (42) show that the sleep objective results in a supervised learningproblem on complete data simulated from our generative model: the objective function forthe number of stars is the usual cross-entropy loss for classiﬁcation, while the objectivefunction for log-ﬂuxes and logit-locations are L losses in the mean parameters. A simple example demonstrates that there exist shallow local optima in the ELBO wherethe ﬁtted approximate posterior is far in KL divergence from the exact posterior. Theselocal optima result in unreliable catalogs. The sleep objective, by taking advantage ofcomplete data, appears to better avoid these local optima. For this example, the data weresimulated with known PSF and background, and the wake phase is not needed.The simulated 20 ×

20 single-band image x test is shown in Figure 4. The image has fourstars, each with the same ﬂux.We compare three approaches to deblending. The ﬁrst two approaches directly optimizethe test ELBO, L elbo ( η ; x test ) = E q η ( z | x test ) (cid:104) log p ( x test , z ) − log q η ( z | x test ) (cid:105) , (44)while the third approach optimizes the sleep objective (29). In each case, q η is the inferencenetwork from Section 3.1.3; the input to the network is a 10 ×

10 tile with no padding.Note that the sleep objective does not depend on x test . Optimizing the sleep objectiveonly requires sampling catalogs from the prior and simulating images conditional on eachcatalog. The prior on the number of stars was set to Poisson with mean µ = 4.Figure 5 (top row) charts the test ELBO (44) as the optimization proceeds in our threeapproaches. The ﬁrst approach optimizes the ELBO with stochastic gradient descent and17igure 4: The 20 ×

20 pixel synthetic image x test . It contains four stars and is partitionedinto 10 ×

10 tiles.the REINFORCE gradient estimator. This optimization did not converge, likely due tothe high variance of the REINFORCE estimator (Figure 5a). For a lower variance gradientestimator, the second approach employed the reparameterized gradient. To employ thisgradient estimator, we analytically integrated the ELBO with respect to the number ofstars N to remove the discrete random variable. See Appendix A for details about thegradient estimators. Using reparameterized gradients instead of REINFORCE gradientsenabled the optimization to converge to stationary points (Figure 5b). However, for twoof the six randomly initialized restarts, the optimization found local optima where thenegative ELBO is higher than other restarts.In contrast, optimizing the sleep objective consistently converged to a similar ELBOacross all restarts and appeared to avoid shallow local optima (Figure 5c). Recall thatsleep phase optimization does not directly optimize the test ELBO. However, the testELBO increases nonetheless, because the variational posterior better approximates theexact posterior as the optimization proceeds.Shallow local optima in the ELBO result in unreliable catalogs. The bottom row ofFigure 5 displays the estimated locations, deﬁned as the mode of the ﬁtted variationaldistribution. The bottom left shows these locations after converging to a shallow local op-timum. In this local optimum, the upper left tile was correctly estimated to have two stars;however, both estimated stars were placed at the same location. For correct detections,one of the locations should be placed on the second star. However, in order to move oneof the estimated locations to the second star, the optimization path must traverse a regionwhere the log-likelihood is lower than the current conﬁguration (Figure 6). The displayedconﬁguration is a local optimum where the gradient with respect to its locations is ap-proximately zero. In contrast, the sleep-phase variational distribution consistently placedits mode around the four true stars. The sleep objective is quadratic in the logit-locationestimate µ (cid:96) (Equation 42), and the gradient does not vanish in the sleep objective. Anexample of correct detection after sleep-phase optimization is shown in the bottom right18mage of Figure 5.Figure 5: (Top row) The ELBO as the optimization progresses for six random restarts.(Bottom row) Modal locations from two variational posteriors in red. On the left, a ran-domly initialized run where optimizing the ELBO with the reparameterized gradient re-sulted in a shallow local optimum: the estimated locations split a single source into twosources, while missing another source entirely. On the right, the variational posterior wasoptimized using the sleep phase and correctly identiﬁed all four sources.Also, note that low-variance gradients of the ELBO for this simple example were con-structed by analytically integrating out N . Only then could the reparameterization trickbe applied to the ELBO, as the remaining latent variables are continuous. In this exam-ple, the variational distribution has support over only 0, 1, or 2 stars for each tile. Sincethe variational distribution factorizes over the four tiles, integrating N is a summation of3 = 81 terms. On larger images with more tiles, analytically integrating N would becomputationally infeasible, and the simple reparameterization trick would not apply as itdoes in this small illustrative example.Figure 7 displays our results on a larger example: a simulated 100 ×

100 pixel imagewith ﬁfty stars. The tiles again consisted of 10 ×

10 pixel regions. Optimizing the ELBOdoes not produce accurate location estimates and appears to be hindered by regions withlittle gradient information; optimizing the sleep objective produced much improved locationestimates. 19igure 6: An illustration of local optima in the ELBO objective. To move an estimatedlocation to a correct location, the optimization path must traverse a region where thenegative ELBO is larger than the current conﬁguration. In contrast, the sleep objective isquadratic in the estimated location, and the gradient does not vanish.Figure 7: (Left) Detections produced by ELBO optimization on a 100 ×

100 pixel testimage. Correct locations are blue and estimated locations are red. (Middle) Detectionsproduced by the sleep phase optimization. (Right) The negative conditional log-likelihoood, − log p ( x | ˆ z ), where ˆ z is the mode of the variational posterior and x the 100 ×

100 pixel testimage.

We validated StarNet using an SDSS image of the Messier 2 (M2) globular cluster, acrowded starﬁeld found in ﬁeld 136 of camera column 2 in run 2583. M2 was also imagedin the ACS Globular Cluster Survey (Sarajedini et al., 2007) using the Hubble Spacetelescope (HST), which has greater resolution than the Sloan telescope (0.05 arseconds perpixel and 0.4 arcseconds per pixel, respectively; see ESAHubble (2021) and SDSS (2020)).The reported catalog from this Hubble survey serves as the ground truth for validating ourresults. 20e consider the 100 ×

100 pixel subimage of M2 that Portillo et al. (2017) and Federet al. (2020) analyzed with PCAT. This subimage is located approximately two arcsecondsfrom the heavily saturated core of the cluster; even in this subimage, the HST catalogcontains over 1000 stars with F606W-band magnitudes less than 22.Both StarNet and PCAT are able to model multiband images. We include two bandsin our analysis, the SDSS r -band and i -band. The SDSS r -band and the Hubble F606Wband are centered at roughly the same wavelength, but the wavelength range of the HubbleF606W band is slightly broader. We factorized our variational distribution into 2 × × × ×

100 pixel image of M2 took 30 milliseconds. By comparison, the reported runtimeof PCAT, which uses MCMC, is 30 minutes on the same 100 ×

100 pixel image (Federet al., 2020). In other words, after the initial ﬁt, StarNet provided nearly a 10 -fold speedincrease.The speed at inference time (which excludes training time) gives StarNet the scalingcharacteristics necessary for processing large astronomical surveys. A single SDSS image is1489 × ×

100 pixelsubimage, we project that the runtime to process the full image would be 30 min × ×

20 =8400 minutes, or almost six days. The SDSS survey consists of nearly one million images.Scaling PCAT to the entire SDSS survey would be infeasible. The upcoming LSST surveywill be 300 times larger than SDSS.In contrast, if we assume the PSF and background are homogeneous across the fullSDSS image (which is also assumed in PCAT), we can ﬁt StarNet using wake-sleep on asmall 100 ×

100 pixel subimage (while simultaneously obtaining estimates of the PSF andbackground), a one time computational cost of 18.2 minutes. Producing a catalog with thefull 1489 × × ×

20 = 8 . The cataloging accuracy of StarNet is compared with PCAT and DAOPHOT (Stetson,1987). DAOPHOT is an algorithmic routine for detecting stars in crowded starﬁelds whichdoes not use a generative model. This software convolves the observed image with aGaussian kernel and scans for peaks above a given threshold. The DAOPHOT catalog ofM2 was reported in An et al. (2008).To evaluate the three methods, the HST catalog was used as ground truth. We ﬁlteredthe HST catalog to stars with magnitudes smaller than 22.5 in the Hubble F606W band,because stars with lower apparent brightness cannot be detected in SDSS images.Estimated catalogs are evaluated on three metrics: the true positive rate (TPR), thepositive predicted value (PPV), and the F1 score. The TPR is the proportion of stars inthe HST catalog matched with a star in the estimated catalog; the PPV is the proportionof stars in the estimated catalog matched with a star in the HST catalog. The F1 scoresummarizes the two metrics as the harmonic mean of the PPV and the TPR.Like Portillo et al. (2017) and Feder et al. (2020), a “match” between an estimated starand an HST star is deﬁned as follows: (1) the estimated location and the HST locationare within 0.5 SDSS pixels, and (2) the estimated SDSS r -band ﬂux and the HST F606Wband ﬂux are within half a magnitude.In probabilistic cataloging (PCAT and StarNet), the posterior deﬁnes a distributionover catalogs. For StarNet, the TPR, PPV, and F1 score were computed for the catalogcorresponding to the mode of the variational distribution (henceforth, the StarNet catalog).For PCAT, 300 catalogs were sampled using MCMC; the metrics were computed for eachsampled catalog and averaged.Fitting StarNet using wake-sleep (StarNet-WS) resulted in a catalog that outperformsDAOPHOT and PCAT in terms of its F1 score (Table 1). Fitting StarNet without thewake phase and using only the default SDSS background and PSF (StarNet-S) produceda catalog with a TPR similar to that of StarNet-WS but with a smaller PPV. PCATestimated the most stars of all methods; it therefore had a large TPR but a small PPV. Onthe other hand, DAOPHOT estimated less than half the number of stars when comparedto the other methods. It therefore had a large PPV but a small TPR. The StarNet-WScatalog had about the same PPV as DAOPHOT while having nearly the same TPR asPCAT.Table 1 also prints the number of stars inferred by each method. For probabilisticmethods (StarNet and PCAT), we display the mean number of stars under the approximateposterior, along with the 5-th and 95-th quantiles. Neither PCAT nor StarNet captured thetrue number of stars in their 90% credible interval, though StarNet-WS came the closest.22able 1: Performance metrics on M2. In the bottom row, “Hubble” refers to the catalogwhich we use as ground-truth. For probabilistic methods (StarNet and PCAT) the “ r -band magnitude. Smaller magnitudes correspond tobrighter stars.The StarNet credible intervals were three times wider than the PCAT intervals. The smallPCAT credible intervals may be indicative of an MCMC sampler that failed to mix well.The improvement of StarNet-WS in PPV was most pronounced for dim stars (Figure 8).The TPR for StarNet-WS was uniformly better than DAOPHOT for stars of all magnitudes.Of all methods, the StarNet-WS catalog best approximated the Hubble ﬂux distribution(Figures 9 and 10). PCAT overestimated the number of dim stars with magnitudes greaterthan 21. In contrast, DAOPHOT failed to ﬁnd dim stars; its ﬂux distribution assigned zeroprobability to stars with magnitudes greater than 21.The diﬀerence between the source-magnitude histograms (Figure 10) produced by StarNet-WS and PCAT is likely due to the fact that StarNet-WS estimates the background withmaximum likelihood, while the background in PCAT is ﬁxed. As discussed below, the23igure 9: True versus estimated magnitudes on the r -band of M2. Every estimated starwas matched with the nearest Hubble star in L2 distance between locations. Red line isthe identity, y = x .Figure 10: Source magnitude histograms for the r -band observations of M2. The magnitudedistribution of the Hubble catalog in grey. Magnitude distributions for the DAOPHOT,PCAT, and StarNet-WS catalogs overlaid. For PCAT, the magnitude histogram plotted isfrom a single catalog sampled by MCMC. PCAT overestimates the number of dim sources,while DAOPHOT fails to detect dim sources.SDSS estimate of the background (which PCAT uses) is too dim, and PCAT compensatesfor the model mis-match by estimating more dim stars. The StarNet-WS estimate increasesthe intensity of the background, and hence does not over-estimate dim stars.Figure 11 shows the StarNet-WS catalog alongside PCAT, DAOPHOT, and Hubblecatalogs.StarNet-WS returns a well-deﬁned distribution over the set of all catalogs and is thusable to capture uncertainties in catalog construction. Samples from the StarNet-WS vari-ational distribution are compared with samples from PCAT in Figure 12. To examine theuncertainty calibration of StarNet-WS, we evaluated the approximate posterior conditionalon the true number of stars in the Hubble catalog. Then, each star in the StarNet-WScatalog was matched with exactly one Hubble star by ﬁnding the permutation of Hubblestars that had the largest log-likelihood under our variational distribution q η . For each24igure 11: Estimated catalogs on four 10 ×

10 subimages from M2. Blue dots are stars fromthe HST catalog used as ground truth. Starnet-WS, PCAT, and DAOPHOT estimatedstars are shown as red, cyan, and orange crosses, respectively.star, we computed the z-score ( y − ˆ y ) / ˆ σ , where y is the Hubble log-ﬂux or logit-location;ˆ y and ˆ σ are the mean and the standard deviation, respectively, of the Gaussian variationaldistribution for the log-ﬂux or logit-location. If the uncertainties were perfectly calibrated,the distribution of these z-scores would be a standard Gaussian. The empirical z-scoredistributions are close to a standard Gaussian with some discrepancy in the tails and someevidence of unmodeled skewness, suggesting the uncertainties are not too mis-estimated(Figure 13).Finally, we evaluate the quality of the wake-phase estimates for the PSF and back-ground. Let z h denote the Hubble catalog, and given some model parameters φ , let L Hubble ( φ ) := − log p φ ( x ( r ) | z h ) (45)be the negative log-likelihood of the SDSS r -band image conditional on the Hubble catalog(recall that the Hubble absorption range most closely resembles the SDSSgit c r -band).The log-likelihood of model parameters estimated by the wake phase is nearly four timeslarger than the log-likelihood of parameters estimated by SDSS (Table 2). The largest im-provement in log-likelihood came from the wake-estimated background, which was brighterthan the SDSS background, suggesting that the SDSS background was too dim for thisstarﬁeld.Appendix B presents the results of StarNet, PCAT, and DAOPHOT on a neighboring100 ×

100 pixel subimage of M2. The results remain qualitatively similar in that StarNethas the best F1 score of all methods. Importantly, StarNet was applied to the new regionwithout further wake-sleep optimization; StarNet produced a catalog on this new region in ≈

30 milliseconds. On the other hand, PCAT required a new 30 minute run of MCMC.25igure 12: Four 10 ×

10 subimages from M2. Blue dots are stars from the HST catalog usedas ground truth. (Left) Posterior samples from StarNet-WS and (right) posterior samplesfrom the MCMC chain of PCAT.Table 2: The negative log-likelihood (45) for various model estimates. PHOTO providesestimates of background and PSF for every SDSS data release. We compare the PHOTOestimates with StarNet estimates obtained after two cycles of wake-sleep.Model estimate Neg.Loglikbackground PSFPHOTO PHOTO 8.671e+05PHOTO StarNet 8.665e+05StarNet PHOTO 3.651e+05StarNet StarNet 3.395e+05

StarNet employs variational inference and outperforms both an MCMC-based probabilisticcataloger and a non-model-based approach in terms of accuracy and runtime. Under theframework of probabilistic modeling, StarNet produces catalogs in which uncertainties arecaptured by a Bayesian posterior over the set of all catalogs. Importantly, unlike MCMC,StarNet also has the capacity to scale probabilistic cataloging to process large astronomicalsurveys.The quality of StarNet detections is the result of optimizing the forward KL, a diﬀerentobjective than the one traditionally used in variational inference. Optimizing the forwardKL allows the variational posterior to be ﬁt on large amounts of complete data – theimage along with its latent catalog – generated from the statistical model. While labeled26igure 13: The empirical distribution of z-scores for logit-locations (left) and log-ﬂuxes(right) under the StarNet-WS variational posterior.data from simulators has been used in other astronomy applications to train deep neuralnetworks (see for example Lanusse et al. (2017) and Huang et al. (2020)), StarNet is theﬁrst training procedure to employ simulated data in a statistical framework: the neuralnetwork speciﬁes an approximate Bayesian posterior.This variational approach, unlike previous MCMC approaches, enables StarNet to es-timate model parameters such as the PSF and sky background. While the current workfocuses on PSF models, our methodology can be extended to more general sources such asgalaxies. Unsurprisingly, the current performance of StarNet is sub-optimal for catalogingregions of the sky that contain both stars and galaxies, due to model misﬁt (Appendix D).One promising direction is using a neural network for the generative model and ﬁttinga deep generative model for galaxies (Regier et al., 2015; Reiman and G¨ohre, 2019; Lanusseet al., 2020; Arcelin et al., 2021). Here, a neural network encodes a conditional likelihoodof galaxy images given a low-dimensional galaxy representation. Using a neural networkto encode a likelihood extends the ﬂexibility of galaxy models beyond the simple modelsused here.The statistical framework in this research lays the foundation for building ﬂexible modelsto incorporate the cataloging of all celestial objects. Future astronomical surveys will onlyexpand in terms of the volume of data they are able to amass. As telescopes peer deeperinto space, ﬁelds will reveal more sources and images will become more crowded. Theuncertainties in crowded ﬁelds necessitate a probabilistic approach. Our method holds thepromise of providing a scalable inference tool that can meet the challenges of these futuresurveys. 27 eferences

Abbott, T. M., Abdalla, F. B., Alarcon, A., et al. (2018), “Dark Energy Survey Year 1Results: Cosmological Constraints from Galaxy Clustering and Weak Lensing,”

PhysicalReview D , 98.An, D., Johnson, J. A., Clem, J. L., et al. (2008), “Galactic Globular and Open Clustersin the Sloan Digital Sky Survey. I. Crowded-Field Photometry and Cluster FiducialSequences in ugriz,”

The Astrophysical Journal Supplement Series , 179, 326–354.Arcelin, B., Doux, C., Aubourg, E., Roucelle, C., and LSST Dark Energy Science Collabo-ration (2021), “Deblending Galaxies with Variational Autoencoders: a Joint Multiband,Multi-instrument Approach,”

Monthly Notices of the Royal Astronomical Society , 500,531–547.Beal, M. and Ghahramani, Z. (2002), “The Variational Bayesian EM Algorithm for In-complete Data: with Application to Scoring Graphical Model Structures,”

BayesianStatistics .Blei, D. M., Kucukelbir, A., and McAuliﬀe, J. D. (2017), “Variational Inference: A Reviewfor Statisticians,”

Journal of the American Statistical Association , 112, 859–877.Bornschein, J. and Bengio, Y. (2014), “Reweighted Wake-Sleep,” https://arxiv.org/abs/1406.2751 .Bosch, J., Armstrong, R., Bickerton, S., et al. (2018), “The Hyper Suprime-Cam SoftwarePipeline,”

Publications of the Astronomical Society of Japan , 70, 1–39.Brewer, B. J., Foreman-Mackey, D., and Hogg, D. W. (2013), “Probabilistic Catalogs forCrowded Stellar Fields,”

The Astronomical Journal , 146, 7–15.ESAHubble (2021), “Hubble’s instruments: ACS – Advanced Camera for Surveys,” https://esahubble.org/about/general/instruments/acs/ , [Accessed: 2021-02-21].Feder, R. M., Portillo, S. K. N., Daylan, T., and Finkbeiner, D. (2020), “Multiband Proba-bilistic Cataloging: a Joint Fitting Approach to Point-source Detection and Deblending,”

The Astronomical Journal , 159, 163–188.Green, G. M., Schlaﬂy, E., Zucker, C., Speagle, J. S., and Finkbeiner, D. (2019), “A 3DDust Map Based on Gaia, Pan-STARRS 1, and 2MASS,”

The Astrophysical Journal ,887, 93–120.Green, P. J. (1995), “Reversible Jump Markov chain Monte Carlo Computation andBayesian Model Determination,”

Biometrika , 82, 711–732.28inton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995), “The Wake-sleep Algorithmfor Unsupervised Neural Networks,”

Science , 268, 1158–1161.Huang, X., Domingo, M., Pilon, A., et al. (2020), “Finding Strong Gravitational Lenses inthe DESI DECam Legacy Survey,”

The Astrophysical Journal , 894, 78–106.Jordan, M. I., Ghahramani, Z., Jaakkola, T. I., and Saul, L. K. (1999), “An Introductionto Variational Methods for Graphical Models,”

Machine Learning , 37, 183–233.Kingma, D. P. and Ba, J. (2014), “Adam: a Method for Stochastic Optimization,” https://arxiv.org/abs/1412.6980 .Kingma, D. P. and Welling, M. (2013), “Auto-Encoding Variational Bayes,” https://arxiv.org/abs/1312.6114 .Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017), “AutomaticDiﬀerentiation Variational Inference,”

Journal of Machine Learning Research , 18, 1–45.Lanusse, F., Ma, Q., Li, N., Collett, T. E., Li, C.-L., Ravanbakhsh, S., Mandelbaum,R., and P´oczos, B. (2017), “CMU DeepLens: deep learning for automatic image-basedgalaxy–galaxy strong lens ﬁnding,”

Monthly Notices of the Royal Astronomical Society ,473, 3895–3906.Lanusse, F., Mandelbaum, R., Ravanbakhsh, S., Li, C.-L., Freeman, P., and Poczos, B.(2020), “Deep Generative Models for Galaxy Image Simulations,” https://arxiv.org/abs/2008.03833 .Le, T. A., Kosiorek, A. R., Siddharth, N., Teh, Y. W., and Wood, F. (2020), “RevisitingReweighted Wake-sleep for Models with Stochastic Control Flow,” in

Uncertainty inArtiﬁcial Intelligence , pp. 1039–1049.LSST (2020), “About LSST,” , [Accessed: 2020-12-01].Lupton, R., Gunn, J. E., Ivezic, Z., Knapp, G. R., Kent, S., and Yasuda, N. (2001), “TheSDSS Imaging Pipelines,” https://arxiv.org/abs/astro-ph/0101420 .Neal, R. and Hinton, G. (2000), “A View Of The EM Algorithm That Justiﬁes Incremental,Sparse, And Other Variants,”

Learning in graphical models , 89, 355–368.Portillo, S. K. N., Lee, B. C. G., Daylan, T., and Finkbeiner, D. P. (2017), “Improved Point-source Detection in Crowded Fields Using Probabilistic Cataloging,”

The AstronomicalJournal , 154, 132–156.Ranganath, R., Gerrish, S., and Blei, D. M. (2013), “Black Box Variational Inference,” https://arxiv.org/abs/1401.0118 . 29egier, J., McAuliﬀe, J. D., and Prabhat (2015), “A deep generative model for astronomicalimages of galaxies,” in

NIPS Workshop on Advances in Approximate Bayesian Inference .Regier, J., Miller, A. C., Schlegel, D., Adams, R. P., McAuliﬀe, J. D., and Prabhat (2019),“Approximate Inference for Constructing Astronomical Catalogs from Images,”

The An-nals of Applied Statistics , 13, 1884–1926.Reiman, D. M. and G¨ohre, B. E. (2019), “Deblending Galaxy Superpositions with BranchedGenerative Adversarial Networks,”

Monthly Notices of the Royal Astronomical Society ,485, 2617–2627.Rezende, D. J., Mohamed, S., and Wierstra, D. (2014), “Stochastic Backpropagation andApproximate Inference in Deep Generative Models,” https://arxiv.org/abs/1401.4082 .Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015), “ImageNet LargeScale Visual Recognition Challenge,”

International Journal of Computer Vision (IJCV) ,115, 211–252.Sarajedini, A., Bedin, L. R., Chaboyer, B., Dotter, A., Siegel, M., Anderson, J., Aparicio,A., King, I., Majewski, S., Mar´ın-Franch, A., et al. (2007), “The ACS Survey of GalacticGlobular Clusters,”

The Astronomical Journal , 133, 1658–1672.Schlaﬂy, E. F., Green, G. M., Lang, D., et al. (2018), “The DECam Plane Survey: OpticalPhotometry of Two Billion Objects in the Southern Galactic Plane,”

The AstrophysicalJournal Supplement Series , 234, 39–58.SDSS (2020), “SDSS Scope,” , [Accessed: 2020-12-23].Stetson, P. B. (1987), “DAOPHOT: A Computer Program for Crowded-Field Stellar Pho-tometry,”

Astronomical Society of the Paciﬁc , 99, 191–222.Wainwright, M. J. and Jordan, M. I. (2008), “Graphical Models, Exponential Families, andVariational Inference,”

Foundations and Trends in Machine Learning , 1, 1–305.Williams, R. J. (1992), “Simple Statistical Gradient-Following Algorithms for ConnectionistReinforcement Learning,”

Machine Learning , 8, 229–256.Wright, E. L., Eisenhardt, P. R. M., Mainzer, A. K., et al. (2010), “The Wide-ﬁeld InfraredSurvey Explorer (WISE): Mission Description and Initial On-orbit Performance,”

TheAstronomical Journal , 140, 1868–1881. 30in, B., Ivezi`c, Z., Lupton, R. H., Peterson, J. R., Yoachim, P., Jones, R. L., Claver, C. F.,and Angeli, G. (2018), “A Study of the Point-spread Function in SDSS Images,”

TheAstronomical Journal , 156, 222–232.Zhang, C., B¨utepage, J., Kjellstr¨om, H., and Mandt, S. (2019), “Advances in VariationalInference,”

IEEE Transactions on Pattern Analysis and Machine Intelligence , 41, 2008–2026.

A Reparameterized and REINFORCE gradients

The ELBO objective (12) is of the form L ( η ) = E q η ( z ) [ f η ( z )] . (46)The parameter η is to be optimized, and z is the latent variable. The integrating distribu-tion q and the function f depend on η .The REINFORCE estimator Williams (1992) is a general-purpose unbiased estimatefor the gradient of (46). It is given by g rf ( z ) = ∇ η f η ( z ) + f η ( z ) ∇ η log q η ( z ) for z ∼ q η ( z ) . (47)The REINFORCE estimate is unbiased for the true gradient: E q η ( z ) [ g rf ( z )] = (cid:90) q η ( z ) ∇ η f η ( z ) dz + (cid:90) q η ( z ) f η ( z ) ∇ η log q η ( z ) dz (48)= (cid:90) q η ( z ) ∇ η f η ( z ) dz + (cid:90) f η ( z ) ∇ η q η ( z ) dz (49)= (cid:90) ∇ η [ q η ( z ) f η ( z )] dz (50)= ∇ η (cid:90) q η ( z ) f η ( z ) dz = ∇ η E q η ( z ) [ f η ( z )] , (51)assuming that f is well-behaved so that integration and diﬀerentiation can be interchanged.Alternatively, the reparameterized gradient Rezende et al. (2014); Kingma and Welling(2013) can be used when there exists some distribution F not involving η and a diﬀerentiablemapping h η such that w ∼ F = ⇒ h η ( w ) ∼ q η . (52)For example, if q η ( z ) = N ( z ; 0 , η ) that is, a Gaussian with zero mean and variance η , onepossibility is to let F be the standard Gaussian and h η ( w ) = w √ η .31he gradient of L ( η ) can be written as ∇ η E q η ( z ) [ f η ( z )] = ∇ η E w ∼ F [ f η ( h η ( w )] = E w ∼ F [ ∇ η f η ( h η ( w )] , (53)again assuming the interchangability of integrals and derivatives. Unbiased gradients arecomputed as g rp = ∇ η f η ( h η ( w )) = ∇ z f η ( z ) (cid:12)(cid:12)(cid:12) z = h η ( w ) ∇ η h η ( w ) for w ∼ F. (54)The reparameterized gradient includes gradient information ∇ z f η ( z ) while the REIN-FORCE gradient does not. Taking into account the structure of f through its gradientlowers the variance of reparameterized gradient in comparison to the REINFORCE gradi-ent. However, if z contains discrete components, there cannot be a diﬀerentiable mapping h η , and the reparameterization trick will not apply. B Results on a test M2 image

We evaluate StarNet on another subregion of M2. The initial 100 ×

100 subregion ofM2 considered in our main paper was located at pixel coordinates (630, 310) in SDSSrun 2583, ﬁeld 136, camera column 6. After ﬁtting StarNet on this initial region x , weevaluate StarNet on a neighboring region x test . See Figure A.1 for locations of the consideredsubregions.The subregion x test has approximately 25% more stars than x (1413 Hubble detectionsin x test versus 1114 detections in x ). Due the increased density, all methods suﬀeredan ≈

10 percentage point decrease in F1 score on x test compared to the F1 score on x (compare Table 1 and A.1).Table A.1: Performance metrics on the M2 test subregion. StarNet-init is the networkﬁt on the original M2 subregion (the same network as StarNet-WS in the main text).StarNet-reﬁt ran two further wake-sleep cycles on the M2 test subregion.Method TPR PPV F1 score x . Using StarNet-init and its ﬁtted background and PSF as an initialization, we ran two32urther cycles of wake-sleep on x test (StarNet-reﬁt). StarNet-reﬁt improved the PPV overStarnet-init by two percentage points. The TPR appeared to be nearly identical across allmagnitudes (Figure A.2).These results suggest that StarNet-init extrapolates well to neighboring regions, and re-running wake-sleep is not necessary. Evaluating StarNet-init on x test took 30 milliseconds.On the other hand, re-running PCAT takes another 30 minutes. Even if StarNet didrequire reﬁtting, the subsequent wake-sleep cycles takes only an additional three minutes.The amortization enables StarNet inference to have much better scaling characteristicsthan PCAT.Figure A.1: The SDSS ﬁeld containing M2. In red, the subregion x considered in the maintext. The subregion x test in blue. C Sensitivity to prior parameters

We examine the sensitivity of StarNet to prior parameters µ and α on the image M2. Recall µ is the prior mean number of stars per pixel (2); α is the power law slope on the r -bandﬂuxes (4). In the results of Section 6, µ = 0 .

15 and α = 0 . µ increases the TPR increaseswhile the PPV decreases – the prior encourages more detections (Figure A.3).33igure A.2: True positive rate (left) and positive predicted value (right) of various methodson the M2 test subregion. StarNet-init is the network ﬁt on the original M2 subregion (thesame network as StarNet-WS in the main text). StarNet-reﬁt ran two further wake-sleepcycles on the M2 test subregion.As α increases, StarNet estimates more dim sources: the prior distribution on ﬂuxesplaces more mass near f min (Figure A.4).While the TPR increases across α values, the PPV suﬀers at α = 0 .

75 (Figure A.5).At α = 0 .

75, we see that the TPR improves at dimmer sources at the expense of brightersources, suggesting that the brigher sources become over-split. The stronger prior ondimmer stars also manifests in the ﬂux distribution of the resulting StarNet catalog (Fig-ure A.6).

D Results on Stripe-82

Most regions of the sky are much less densely populated than M2. We test StarNet onrun 94, camcol 1, ﬁeld 12 of SDSS, an image with light source density more typical ofSDSS images. After 10 minutes of sleep training, StarNet produced a catalog on the full1489 × ≈ ×

50 tiles; N max on tiles is three. Because this region of the sky also contains galaxies, only the sleep phasewas employed to ﬁt StarNet; the wake phase would optimize the PSF to explain both starsand galaxies.This image is contained in Stripe 82, a region of the sky repeatedly imaged by SDSS.Averaging images from diﬀerent runs boosts the signal to noise ratio, resulting in a “co-added” image. We compare the performance of StarNet and PHOTO on the non co-added34u TPR PPV0.10 0.46 0.660.15 0.51 0.600.20 0.51 0.49Figure A.3: Sensitivity of performance metrics to Poisson mean parameter on number ofstars ( µ in Equation 2).Figure A.4: The prior density of r -band magnitude for various power law slopes α .image. The PHOTO catalog of the co-added image was used as ground truth.The TPR of StarNet is comparable with the TPR of the PHOTO catalog on the original(non co-added) image (Figure A.7). StarNet accrues false detections, namely galaxies,35lpha TPR PPV0.25 0.46 0.600.50 0.51 0.600.75 0.52 0.54Figure A.5: Sensitivity of performance metrics to ﬂux prior parameter α .and thus we cannot compare the PPV. On some tiles, missed detections occur when alarge galaxy within a tile causes all N max detections to be placed around the galaxy – theremaining stars in the image go undetected. Incorporating a galaxy model would boostour performance. 36igure A.6: Sensitivity of estimated ﬂux distribution to ﬂux prior parameter αα