[PDF] Real-Time Likelihood-Free Inference of Roman Binary Microlensing Events with Amortized Neural Posterior Estimation

Abstract

Fast and automated inference of binary-lens, single-source (2L1S) microlensing events with sampling-based Bayesian algorithms (e.g., Markov Chain Monte Carlo; MCMC) is challenged on two fronts: high computational cost of likelihood evaluations with microlensing simulation codes, and a pathological parameter space where the negative-log-likelihood surface can contain a multitude of local minima that are narrow and deep. Analysis of 2L1S events usually involves grid searches over some parameters to locate approximate solutions as a prerequisite to posterior sampling, an expensive process that often requires human-in-the-loop domain expertise. As the next-generation, space-based microlensing survey with the Roman Space Telescope is expected to yield thousands of binary microlensing events, a new fast and automated method is desirable. Here, we present a likelihood-free inference (LFI) approach named amortized neural posterior estimation, where a neural density estimator (NDE) learns a surrogate posterior \hat{p}(\theta|x) as an observation-parametrized conditional probability distribution, from pre-computed simulations over the full prior space. Trained on 291,012 simulated Roman-like 2L1S simulations, the NDE produces accurate and precise posteriors within seconds for any observation within the prior support without requiring a domain expert in the loop, thus allowing for real-time and automated inference. We show that the NDE also captures expected posterior degeneracies. The NDE posterior could then be refined into the exact posterior with a downstream MCMC sampler with minimal burn-in steps.

Full PDF

DDraft version February 18, 2021

Typeset using L A TEX twocolumn style in AASTeX62

Real-Time Likelihood-free Inference of

Roman

Binary Microlensing Events with Amortized Neural PosteriorEstimation

Keming Zhang ( 张可名 ) , Joshua S. Bloom,

1, 2

B. Scott Gaudi, Franc¸ois Lanusse, Casey Lam, andJessica R. Lu Department of Astronomy, University of California, Berkeley, CA 94720-3411, USA Lawrence Berkeley National Laboratory, 1 Cyclotron Road, MS 50B-4206, Berkeley, CA 94720-3411, USA Department of Astronomy, The Ohio State University, Columbus, OH 43210, USA AIM, CEA, CNRS, Universit´e Paris-Saclay, Universit´e Paris Diderot, Sorbonne Paris Cit´e, F-91191, Gif-sur-Yvette, France

ABSTRACTFast and automated inference of binary-lens, single-source (2L1S) microlensing events with sampling-based Bayesian algorithms (e.g., Markov Chain Monte Carlo; MCMC) is challenged on two fronts: highcomputational cost of likelihood evaluations with microlensing simulation codes, and a pathologicalparameter space where the negative-log-likelihood surface can contain a multitude of local minimathat are narrow and deep. Analysis of 2L1S events usually involves grid searches over some parametersto locate approximate solutions as a prerequisite to posterior sampling, an expensive process thatoften requires human-in-the-loop domain expertise. As the next-generation, space-based microlensingsurvey with the

Roman Space Telescope is expected to yield thousands of binary microlensing events,a new fast and automated method is desirable. Here, we present a likelihood-free inference (LFI)approach named amortized neural posterior estimation, where a neural density estimator (NDE) learnsa surrogate posterior ˆ p ( θθθ | x ) as an observation-parametrized conditional probability distribution, frompre-computed simulations over the full prior space. Trained on 291,012 simulated Roman -like 2L1Ssimulations, the NDE produces accurate and precise posteriors within seconds for any observationwithin the prior support without requiring a domain expert in the loop, thus allowing for real-timeand automated inference. We show that the NDE also captures expected posterior degeneracies. TheNDE posterior could then be reﬁned into the exact posterior with a downstream MCMC sampler withminimal burn-in steps.

Keywords:

Binary lens microlensing (2136), Gravitational microlensing exoplanet detection (2147) INTRODUCTIONWhen the apparent trajectory of a foreground lens star passes close to a more distant source star, the grav-itational ﬁeld of the lens will perturb the light raysfrom the source which results in a time-variable mag-niﬁcation. Such are single-lens, single-source (1L1S)microlensing events. Binary microlensing events occurwhen the lens is a system of two masses: either a binarystar system or a star-planet conﬁguration. Observationof such events provides a unique opportunity for exo-planet discovery as the planet-to-star mass ratio may beinferred from the light curve without having to detectlight from the star-planet lens itself (see Gaudi 2010

Corresponding author: Keming [email protected] for a review). A next-generation microlensing surveywith the Roman Space Telescope (Spergel et al. 2015;hereafter

Roman ) is estimated to discover thousands ofbinary microlensing events over the duration of the 5-year mission span, many with planetary-mass compan-ions (Penny et al. 2019). This expected discovery isroughly an order of magnitude more than events previ-ously discovered (see Gaudi 2012 for a review).While single-lens microlensing events are described bya simple analytic expression (“Paczy´nski light-curve”),binary microlensing events require numerical forwardmodels that are computationally expensive. In addition,binary microlensing light-curves exhibit extraordinaryphenomenological diversity, owing to the diﬀerent ge-ometrical conﬁgurations for which magniﬁcation couldtake place. This translates to a pathological parame-ter space for which the likelihood surface suﬀers from a r X i v : . [ a s t r o - ph . I M ] F e b Zhang et al. a multitude of local minima that are disconnected, nar-row, and deep; this issue signiﬁcantly hampers attemptsof direct sampling-based inference such as MCMC wherethe chains are initialized from a broad prior. As a result,binary microlensing events thus far have generally beenanalyzed on a case-by-case basis.For some planetary-mass-ratio events, heuristics couldbe used to “read oﬀ” an approximate solution from theplanetary anomaly in the light curve (Gaudi & Gould1997; Gould & Loeb 1992). Khakpash et al. (2019) ap-plied the heuristics described in Gaudi & Gould (1997)on simulated

Roman light-curves and found that theprojected binary separation can be recovered very wellfor low-mass-ratio events, and the binary mass-ratioswithin an order of magnitude for events with wide andclose caustic topologies.More generally, an expensive grid search is usuallyconducted over a subset of parameters to which the mag-niﬁcation pattern is hyper-sensitive: i.e., binary sepa-ration, mass ratio, and the source trajectory angle ofapproach (e.g. Herrera-Mart´ın et al. (2020)). At eachgrid-point, the remaining parameters are searched forwith simple Nelder-Mead optimization (Nelder & Mead1965) or MCMC. The ﬁxed-grid solutions are then usedto seed full MCMC samplings to reﬁne solutions andsample the posteriors. This status quo approach, whichis both computationally expensive and requires domainexpertise in the loop, thus presents a great challengeto analyze the thousands of binary microlensing eventsexpected to be discovered by

Roman .Recent progress in deep learning provides a promisingframework for a solution. In particular, both Convolu-tional (CNN; LeCun et al. 2015) and Recurrent NeuralNetworks (RNN Hochreiter & Schmidhuber 1997; Choet al. 2014) have emerged as powerful alternatives to fea-ture engineering of astronomical time-series (e.g. Naulet al. 2018). Given suﬃcient training data, CNN/RNNscould learn to compress the “high-dimensional” rawobservations into “low-dimensional” feature vectors—automatically learning to produce features that are use-ful for downstream tasks such as classiﬁcation or regres-sion. In addition to advances in this “representationlearning,” neural networks have also enjoyed signiﬁcantprogress in modeling probability distributions, otherwiseknown as neural density estimation, where the funda-mental task is to learn distributions from samples ofthat distribution. Both autoregressive models (Germainet al. 2015; Oord et al. 2016) and ﬂow-based models (Pa-pamakarios et al. 2017; Dinh et al. 2017) are highly capa-ble of modeling complicated and multi-modal distribu-tions, which can not only evaluate probability densities,but also sample from that distribution. The advancement in feature learning and NDE has al-lowed for accelerated progress in the ﬁeld of likelihood-free inference (LFI), also known as simulation-based in-ference (SBI), which has been motivated by inferenceproblems without a tractable likelihood. LFI is an um-brella term that encompasses a wide range of inferencealgorithms that do not require explicit evaluation of thelikelihood. Under our particular LFI approach calledamortized neural posterior estimation, an NDE learnsa surrogate posterior as an observation-parametrizedconditional probability distribution, from pre-computedsimulations over the full prior space. A “featurizer” neu-ral network is employed to compress raw observationinto a feature vector which parametrizes the NDE. In-ference is amortized in that all of the computation costof simulation is paid upfront—likelihood evaluation withthe slow forward simulator is no longer required, thus al-lowing for fast inference. For other neural LFI instances,neural networks could learn the likelihood (Papamakar-ios et al. 2019) or the likelihood-ratio (Thomas et al.2020) as surrogates to accelerate sampling-based infer-ence algorithms like MCMC (see Cranmer et al. 2020 foran overview).In this paper, we present a likelihood-free infer-ence approach for binary microlensing where an NDElearns a surrogate posterior ˆ p ( θθθ | x ) as an observation-parametrized conditional distribution from ( x i , θθθ i ) sam-ples of simulated microlensing light-curves with theassociated microlensing parameters. After training, theNDE can automatically generate posterior samples forfuture observations eﬀectively in real-time. Because ofthe speed and performance without supervision by do-main experts, the approach we introduce here has greatpotential for batch inference tasks such as those posedby Roman . While machine learning (ML) has been pre-viously applied to discover and classify microlensingevents (Wyrzykowski et al. 2015; Godines et al. 2019;Mr´oz 2020), our work represent the ﬁrst application ofML to direct parameter inference in the general binary-lens case. Our preliminary results were reported as anextended abstract in Zhang et al. (2020). The workherein supersedes and expands upon that work.We ﬁrst lay out our inference framework in Section 2.Training set construction under the context of

Roman isdiscussed in Section 3. In Section 4, we demonstrate theability of the NDE to capture degenerate solutions andalso present a systematic evaluation of the NDE perfor-mance over a large number of test events. In Section 5,we suggest future directions including a potential addi-tion of a down-stream MCMC algorithm to reﬁne theNDE posterior into the exact posterior, with minimaladditional computation time. eal-Time Likelihood-Free Inference of

Roman

Binary Microlensing Events METHODWe train a conditional NDE ˆ p φ ( θθθ | x ) to approximatethe true posterior p ( θθθ | x ), where θθθ are the physical pa-rameters and x ∈ R N is the light curve with N data-points. The objective is to minimize the Kullback–Leibler (KL) divergence ( D KL ) between the two distri-butions: φ = argmin( D KL ( p ( θθθ | x )) || ˆ p φ ( θθθ | x ))) (1)= argmin( E θθθ ∼ p ( θθθ ) , x ∼ p ( x | θθθ ) [log( p ( θθθ | x )) − log(ˆ p φ ( θθθ | x ))])(2)= argmax( E θθθ ∼ p ( θθθ ) , x ∼ p ( x | θθθ ) [ˆ p φ ( θθθ | x )]) , (3)where φ represents the neural network parameter, and E denotes the mathematical expectation over the speciﬁeddistribution.The NDE is therefore trained with Maximum Likeli-hood Estimation (MLE) on a training set with physicalparameters drawn from the prior p ( θθθ ) and light-curvesdrawn from the likelihood function, which is the Poissonmeasurement noise model on top of the noise-free mi-crolensing light curve g ( θθθ ) (in the number of photons)which, for simplicity, is assumed to be in the Gaussianlimit: p ( x | θθθ ) = N (cid:16) µ = g ( θθθ ) , σ = (cid:112) g ( θθθ ) (cid:17) . (4)The noise-free light-curve, in turn, is determined bythe baseline source ﬂux ( F source ), the magniﬁcationtime-series produced by the microlensing physical for-ward model A ( θθθ ), and the constant blend ﬂux, which isthe ﬂux from the lens star and any other star that isunresolved from the source star: g ( θθθ ) = A ( θθθ ) · F source + F blend .We use a 20-block Masked Autoregressive Flow(MAF) (Papamakarios et al. 2017) to model ˆ p ( θθθ | x ),and a ResNet-GRU network to extract features ( h )from the light curve ( x ). We do not distinguish betweenˆ p ( θθθ | x ) and ˆ p ( θθθ | h ) where the former is meant to referto the “featurizer+NDE” model and latter is meant torefer to the NDE model alone that is explicitly condi-tioned on h . Figure 1 presents a schema of our neuralposterior estimation framework. The ResNet-GRU net-work is comprised of a 18-layer 1D ResNet (ResidualConvolutional Network; He et al. 2016) and a 2-layerGRU (Gated Recurrent Network; Cho et al. 2014). Wedescribe the neural networks in detail below.2.1. Masked Autoregressive Flow

The masked autoregressive ﬂow (MAF) belongs to aclass of NDE called normalizing ﬂows, which models theconditional distribution ˆ p ( θθθ | x ) as an invertible transfor-mation f from a base distribution π z ( z ) to the target distribution ˆ p ( θθθ | x ). The base density π u ( z ) is requiredto be fast to evaluate and is typically chosen to be eithera standard Gaussian or a mixture of Gaussians for theMAF. The basic idea is that if the MAF, conditionedon the observation x , could learn to map the posteriorto a standard Gaussian, then the inverse transformationcould enable sampling of the posterior by simply sam-pling from that standard Gaussian.As binary microlensing events often exhibit degener-ate, multi-modal solutions, we use a mixture of eightstandard multivariate Gaussians, each with 8 dimen-sions, as the base distribution. The posterior probabil-ity density ˆ p ( θθθ | x ) is evaluated by applying the inversetransformation f − from θθθ to z :ˆ p ( θθθ | x ) = π z (cid:0) f − ( θθθ ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂f − ∂θθθ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) , (5)where π z (cid:0) f − ( θθθ ) (cid:1) represents the probability density forthe base distribution ( π z ) evaluated at f − ( θθθ ), whilethe second term—the determinant of the Jacobian—corresponds to the “compression” of probability space.The MAF is built upon blocks of aﬃne transforma-tions where the scaling and shifting factors for each di-mension are computed with a Masked Autoencoder forDistribution Estimation (MADE; Germain et al. 2015).For a simple 1-block case, the inverse transformationfrom θθθ to z is expressed as: z i = ( θ i − µ i ) · exp ( − α i ) , (6)In the above equation, µ i = f µ i ( θθθ i − ; x ) and α i = f α i ( θθθ i − ; x ) are the scaling and shifting factors mod-eled by MADE subject to the autoregressive conditionthat the transformation of any dimension can only de-pend on those prior to it according to a predeterminedordering. This allows the Jacobian of f − to be triangu-lar, whose absolute determinant can be easily calculatedas: (cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂f − ∂θθθ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = exp (cid:16) − (cid:88) i α i (cid:17) , (7)where α i = f α i ( θθθ i − ; x ).To sample from the posterior, the forward transfor-mation θθθ = f ( z ) where z ∼ π z is applied: θ i = z i · exp α i + µ i , (8)where µ i and α i are computed in the same manner asthe inverse transformation.The MAF is built by stacking many such aﬃne trans-formation blocks, M , M , . . . , M K , where M K mod-els the invertible transformation f K between the pos-terior ( z K ) and intermediate random variable z K − , Zhang et al. time f l ux …… Featurizer: ResNet + GRU …… MADE / M MADE / M K ………… z i + = exp( α i ) ⋅ z i + μ i z i = exp(− α i ) ⋅ ( z i + − μ i ) MADE / M x ∈ ℝ h ∈ ℝ MAF z ∈ ℝ z ∈ ℝ z K ( θ ) ∈ ℝ h h h α μ α μ α K μ K Figure 1.

Schematic illustration of the inference framework based on conditional NDE. The bottom left shows a microlensinglight-curve in arbitrary units which is abstracted into the length-7200 vector ( x ) above. The featurizer composed of a combinationof ResNet and GRU, shown in the trapezoid, compresses the light-curve into a low-dimensional feature vector h . The maskedautoregressive ﬂow (MAF), composed of K blocks of masked autoencoder for density estimation (MADE), is shown in the dashedbox. Each MADE block takes in the feature vector h and predicts scaling ( ααα ) and shifting ( µµµ ) factors, which parameterizes aninvertible aﬃne transformation between adjacent random variables (e.g., z and z ) shown in the dotted box. The left-mostrandom variable is the mixture-of-Gaussian base distribution whereas the right-most random variable ( z K ) is the posterior ( θθθ ). M K − models that between intermediate random vari-ables z K − and z K − and so on, and ﬁnally the baserandom variable z is modeled with the mixture-of-Gaussian distribution. M also computes the mixtureweights. The composite transformation can be writtenas f = f ◦ f ◦ . . . ◦ f K and the posterior probabilitydensity is now:ˆ p ( θθθ | x ) = π z (cid:0) f − ( θθθ ) (cid:1) K (cid:89) i =1 (cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂f − i ∂ z i − (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (9)where it is understood that z K := θθθ . The log-probabilityof the posterior is then, by Equation 7,log ˆ p ( θθθ | x ) = log[ π z (cid:0) f − ( θθθ ) (cid:1) ] + K (cid:88) i =1 log (cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂f − i ∂z i − (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = log[ π z (cid:0) f − ( θθθ ) (cid:1) ] − K (cid:88) i =1 N (cid:88) j =1 α ij , (10)where α ij is the j th component of the scale factor in M i , as in Equation 6. This serves as the optimizationobjective (see Section 3.3).Autoregressive models are sensitive to the order of thevariables. The original MAF paper uses the default or-der for the autoregressive layer closest to θθθ and reversesthe order for each successive layer. In this work, we adopt ﬁxed random orderings for each MAF block whichwe ﬁnd to allow for better expressibility. The randomseed of the ordering serves as a hyper-parameter to beoptimized on. 2.2. Featurizer Network

A custom 1D ResNet with a down-stream 2-layer GRUis used as the light curve featurizer which takes pre-processed light curves ( x ) as input and outputs a low-dimensional feature vector ( h ). The ResNet used inthis study shares the identical architecture as Zhang &Bloom (2020) (except for hyper-parameters) and con-sists of 9 identical residual blocks, each of which is com-posed of two convolutions followed by layer normaliza-tion (Ba et al. 2016). A residual connection is addedbetween each adjacent residual block, which acts as a“gradient highway” to assist network optimization. A MaxPool layer is applied in between every two ResNetlayers, where the sequence length is reduced by half andthe feature dimension doubled until a speciﬁed maxi-mum. This results in an output feature map of length L = 56 and dimension D = 256, when is then fed intothe GRU network that sequentially processes informa-tion across the temporal dimension and outputs a singlevector of D = 256 which then serves as the conditionalinput to the MAF. DATA eal-Time Likelihood-Free Inference of

Roman

Binary Microlensing Events MulensModel (Poleski & Yee 2019); each sequence contains 144 days ata cadence of 0.01 day, corresponding to the planned Ro-man cadence of 15 minutes (Penny et al. 2019). Thesesequences are chosen to have twice the length of the72-day Roman observation window to facilitate sam-pling from a t ∼ Uniform(0 ,

72) prior (see Section3.1). We then ﬁt each simulated magniﬁcation time-series with a Paczy´nski single-lens-single-source (1L1S)model (assuming

S/N base = 200 and f s = 1; see Section3.2) and discard those that are consistent with 1L1S( χ / dof < Prior

Assuming rectilinear relative motion of the observer,lens, and source, binary microlensing (2L1S) events arecharacterised by eight parameters: binary lens separa-tion ( s ), mass ratio ( q ), angle of the source trajectorywith respect to the projected binary lens axis ( α ), im-pact parameter ( u ), time of closest approach ( t ), Ein-stein ring crossing timescale ( t E ), ﬁnite source size ( ρ ),and source ﬂux fraction ( f s ). α is the angle between thevector pointing from the primary to the secondary andthe source trajectory vector, measured counterclockwisein degrees. u and t are deﬁned with respect to the bi-nary lens center-of-mass (COM). Where applicable, theparameters are normalized to the Einstein ring length-scale or the Einstein ring crossing time-scale of the totalmass of the lens system. t and t E are in units of days.We simulate 2L1S events based on the following analyticpriors: s ∼ LogUniform(0 . , q ∼ LogUniform(10 − , α ∼ Uniform(0 , u ∼ Uniform(0 , t ∼ Uniform(0 , t E ∼ TruncLogNorm(1 , , µ = 10 . , σ = 10 . ) ρ ∼ LogUniform(10 − , − ) f s ∼ LogUniform(0 . , χ L S / dof < q and small u ,which otherwise have ﬂat priors, are strongly preferred. u log ( t E ) log ( ) f r a c t i o n log ( s ) log ( q ) Figure 2.

Fraction of the 10 simulations passing the χ L S / dof > simulations are shown in red-dashed linesup to a normalization factor. For parameters with a ﬂat orig-inal prior, the gray histogram is also the eﬀective training setprior up to a normalization factor. The t and f s distribu-tions follow the original priors as they are sampled on the ﬂyduring training. During training, a random 72-day segment is chosenon the ﬂy from each 144-day magniﬁcation sequence,equivalent to prescribing a uniform prior on t . Thetruncated normal distribution for t E is an approxima-tion of a statistical analysis based on OGLE-IV data(Mr´oz et al. 2017). The lower limit of q = 10 − corre-sponds to the mass ratio between Mercury and a low-mass ( M ∼ . M (cid:12) ) M-dwarf star, highlighting the su-perb sensitivity of Roman. The source ﬂux fraction isdeﬁned as the ratio between the source ﬂux and the totalbaseline ﬂux f s = F source F source + F blend . (12)3.2. Light-curve realization

The magniﬁcation sequences are converted into light-curves during training on the ﬂy by multiplying with thebaseline pre-magniﬁcation source ﬂux before adding theconstant blend ﬂux and applying measurement noise.For simplicity, we only consider photon-counting noisefrom the lens and ﬁxed blend ﬂux, assumed to be in theGaussian limit of the Poisson noise (Equation 4), wherethe standard deviation of each photometric measure-ment is the square root of ﬂux measurement in photoncounts. Studies of the bulge star population show thatthe apparent magnitude largely lies within the range of20 mag to 25 mag (Penny et al. 2019: Figure 5). The Ro-man/WFIRST Cycle 7 design has the zero-point mag-nitude (1 count/s) at 27.615 mag for the W149 ﬁlter.

Zhang et al.

Table 1.

Solutions for the example central-caustic pass-ing event. t E and t are in units of days, α in degrees, u , s , and ρ in units of θ E . Uncertainties are 1 σ marginaluncertainties. truth close widelog ( s ) − . − . +0 . − . . +0 . − . log ( q ) − . − . +0 . − . − . +0 . − . α .

000 299 . +0 . − . . +0 . − . u .

050 0 . +0 . − . . +0 . − . t .

000 26 . +0 . − . . +0 . − . log ( t E ) 1 .

301 1 . +0 . − . . +0 . − . log ( ρ ) − . − . + − . − . − . +0 . − . f s .

120 0 . +0 . − . . +0 . − . With exposure time at 46.8 s, the aforementioned mag-nitude range corresponds to signal-to-noise (

S/N base ) ra-tios between 230 and 23 for the baseline ﬂux, which werandomly and uniformly sample during training. On-the-ﬂy sampling of

S/N base and f s also serves as dataaugmentation, which refers to the process of expandingthe eﬀective size of the training set.3.3. Pre-processing and Training

Network optimization is performed with ADAM(Kingma & Ba 2015) at an initial learning rate of 0.001and batch size 512, which decays to 0 according to acosine annealing schedule (Loshchilov & Hutter 2017)for 250 epochs, at which point the training terminates.To ensure that there is no over-ﬁtting, we ﬁrst reserved20% of the training set as a validation set. After con-ﬁrming the absence of over-ﬁtting, we then proceed withthe full training set. We apply data augmentation on α by changing the direction of the source trajectory: thetemporal order of each sequence is reverted and α be-comes − ( α + 180) mod 360. Each training epoch takes ∼ − − RESULTSThe trained model is able to generate accurate andprecise posterior samples at a rate of 10 per second onone GPU, eﬀectively in real-time. This compares to the ∼ MulensModel on one CPU core. In this section, we ﬁrsthighlight the ability of the NDE to capture multi-modal

Table 2.

Solutions for the example resonant-caustic pass-ing event. Same units as Table 1. Uncertainties are 1 σ marginal uncertainties.truth close resonantlog ( s ) 0 . − . +0 . − . . +0 . − . log ( q ) − . − . +0 . − . − . +0 . − . α .

000 109 . +0 . − . . +0 . − . u .

100 0 . +0 . − . . +0 . − . t .

000 25 . +0 . − . . +0 . − . log ( t E ) 1 .

301 1 . +0 . − . . +0 . − . log ( ρ ) − . − . +0 . − . − . + − . − . f s .

200 0 . +0 . − . . +0 . − . solutions by providing NDE posteriors of representativeevents where we set the baseline S/N base = 200. Then,the quality of NDE posteriors is systematically analyzedby examining the accuracy and calibration properties ona test set of 14,511 simulated light-curves.4.1.

Central-Caustic Passing Event

Figure 3a shows the NDE posterior for an exam-ple central-caustic-passing event where a classic “close-wide” degeneracy is clearly exhibited by the s -1 /s be-havior (Griest & Saﬁzadeh 1998; Dominik 1999). Ta-ble 1 presents the ground truth 2L1S parameters of thisevent as well as the “close” and “wide” solutions, cal-culated as the modes of their respective distributions.Although the source is expected to pass the caustic cen-ter at the same distances for the two cases, Figure 3ashows a bimodal solution for u as well because u hasbeen deﬁned with respect to the center-of-mass (COM),rather than the caustic center. While the caustic centeris centered on the COM for close-separation events, forwide-separation events there is an oﬀset from the COMof δ = s · q q − qs · (1 + q ) (13)where the ﬁrst term accounts for the oﬀset of the caus-tic center from the location of the primary (Han 2008),and the second term, the oﬀset of the primary from thecenter of mass. Positive oﬀsets are directed toward thecompanion and vice versa. Plugging in the wide solu-tion, we expect an oﬀset of ∆ u = 0 . u = 0 . eal-Time Likelihood-Free Inference of Roman

Binary Microlensing Events Figure 3. (a) NDE posterior for a central-caustic passing event. t E and t are in units of days, α in degrees, u , s , and ρ in units of θ E . Filled contours show 1/2/3 σ regions. The ground truth close solution is marked with orange cross-hairs. Theclose and wide solutions are marked with a red cross and a blue diamond, respectively. (b) Close-up view of the light-curverealizations normalized to the minimum ﬂuxes for both solutions, in the same color-coding as (a). The 0.01 day cadence andmeasurement noise is negligibly small on the scale of the ﬁgure, and therefore not shown. (c) Caustic structures as well astrajectories for the two solutions in the same color-coding, centered on the center of caustic. Zhang et al.

Figure 4.

Resonant-caustic-passing event; same ﬁgure caption as Figure 3. Here, a degenerate solution is seen at s <

1, whosetwo triangular caustics cause a similar suppression pattern as the resonant caustic. eal-Time Likelihood-Free Inference of

Roman

Binary Microlensing Events Figure 5.

Example event exhibiting a blunt and ﬂat light-curve near the peak, which has a 5-fold degenerate NDE posterior;same ﬁgure caption as Figure 3. Color-coding is shared across the three panels.

Table 3.

Degenerate solutions for the binary-planetary degenerate event shown in Figure 5. Same units asTable 1. Uncertainties are 1 σ marginal uncertainties.truth planetary A planetary B binary A binary B binary Clog ( s ) − . − . +0 . − . . +0 . − . − . +0 . − . − . +0 . − . − . +0 . − . log ( q ) − . − . +0 . − . − . +0 . − . − . +0 . − . − . +0 . − . − . +0 . − . α .

000 80 . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . u .

100 0 . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . t .

000 25 . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . log ( t E ) 1 .

699 1 . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . f s .

200 0 . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . Zhang et al.

Resonant-Caustic Passing Event

We also highlight an example of a resonant-causticpassing event, whose parameters and solutions areshown in Table 2. As illustrated in Figure 4, the NDEﬁnds an additional solution at s <

1, whose triangu-lar caustics are causing a similar weak de-magniﬁcationas the resonant caustics (also see Figure 7 in Gaudi2010). This type of degeneracy has been previouslyobserved in the microlensing event OGLE-2018-BLG-0677Lb (Herrera-Mart´ın et al. 2020). Strong covariancesare seen among u , t E , and f s , as is also seen in theprevious example (Section 4.1). As ﬁrst observed byWo´zniak & Paczy´nski (1997), in the f s (cid:28) u (cid:28) Binary-Planetary Degeneracy

We also provide a fascinating 5-fold-degenerate exam-ple that is similar to the degeneracy reported in Choiet al. (2012) where a light curve that is blunt and ﬂatnear the peak can be explained by either a binary caseor a planetary case. Here, we simulate a close-topology,planetary mass ratio ( q = 10 − . ) event where the sourcetrajectory passes through the negative perturbation re-gion towards the back end of the arrowhead-shapedcentral caustic (Figure 5: “planetary A/B” caustic).Choi et al. (2012) noted that a similar perturbation canoccur for the binary case when the source trajectorypasses through the negative perturbation region betweentwo adjacent cusps of the astroid-shaped central caustic(Figure 5: “binary A/B/C” caustic; also see Figure 1 intheir paper).As shown in Figure 5, all ﬁve degenerate solutionscause magniﬁcation patterns that are hardly distinguish-able. The two planetary solutions exhibit a close-widedegeneracy. For the three binary solutions, the “bi-nary B/C” solutions suggest two possible trajectories( ∼ /

270 deg) for the same lens system conﬁgurationwhereas “binary A” solution exhibits a smaller mass ra-tio and a wider binary separation than “binary B/C”.We note that this additional degeneracy in the mass ra-tio for the binary case was not reported in Choi et al.(2012). It is not clear if this is a discrete or continuousdegeneracy, nor if it is an “accidental degeneracy” thatarises because of the relatively weak perturbation, oris due to some underlying symmetry in the binary lensequation (e.g. Dominik 1999).On the other hand, wide solutions for the binary caseare largely absent from the NDE posterior, apart from an inkling of density near log ( s ) ∼ .

69 which pointsto the expected close-wide degeneracy for the binary so-lution. We note that the reason those degenerate solu-tions are excluded is that, because of the oﬀset betweenthe COM and the central caustic (Equation 13), wide-binary solutions would require t <

0, which has a priorprobability of zero.4.4.

Evaluating Performance

We present a systematic evaluation of all 14,551 testset events in the form of predicted vs. truth scatter plots(Figure 6). Each test event light-curve is realized in thesame fashion as training time. As the NDE returns po-tentially multi-modal posteriors of arbitrary shape, wecompute the mode(s) for the marginal 1D distributionsof the posterior and consider the mode closest to theground truth as the “predicted” value. The mode(s) iscomputed by ﬁrst ﬁtting each with a 1D histogram of100 bins and then searching for local maxima deﬁned asany bin count higher than that of the 20 adjacent bins.This limits the number of modes to 5. Considering thepurpose of the NDE posterior is to allow ultra-fast con-vergence of a downstream sampling-based algorithm likeMCMC to determine the exact posterior, as long as thecorrect solution has substantial density in the NDE pos-terior, it should not raise alarm if an alternative modeis mistakenly favored. Any degeneracies can be easilyresolved downstream. Therefore, it is sensible to allowthe correct mode to be used as the predicted value, evenif another degenerate mode is incorrectly preferred.As shown in the upper-left corner of each subplot inFigure 6, all parameters are constrained at a rate of closeto 100% except for the ﬁnite source eﬀect for which only14 .

2% is constrained, as the source trajectory is requiredto either cross or pass close to a caustic for ρ to bedetermined. We consider a parameter to be constrainedif the probability density of the 1D marginal distributionis > × the prior probability density at the global mode.The second row in each upper-left corner shows thefrequency for which the correct mode is preferred byNDE, that is, the ground truth is the closest to the ma-jor mode compared to the minor modes(s), if any. If theground truth is closer to a minor mode, the major modeis plotted in blue while the major mode is shown in or-ange. We see clear degeneracy patterns in log ( s ) and α . For log ( s ), the “wide-close” degeneracy is exhibitedby the cluster around the upper-left to lower-right diag- Formally, eﬀects on the light curve due to the ﬁnite size ofthe source are only signiﬁcant if the gradient of the magniﬁcationacross the source has a signiﬁcant second derivative. In practice,this condition is only satisﬁed if the source passes within a fewangular source radii of a caustic. eal-Time Likelihood-Free Inference of

Roman

Binary Microlensing Events Figure 6.

Predicted vs. ground truth 2L1S parameters for 14,551 test-set 2L1S events. t E and t are in units of days, α indegrees, u , s , and ρ in units of θ E . Single-mode NDE posteriors are shown in black dots. For multi-model NDE posteriors, wecolor-code the solution as follows: those for which the global mode is closest to the ground truth are plotted in black; for caseswhere a minor mode is closest to the true value, this correct, minor mode is plotted in orange whereas the incorrect global modeis plotted in blue. Red shadows indicate 32–68th percentile (1 σ ) and 5–95 percentile (2 σ ) regions. Red-dashed lines show thediagonal. In the upper left of each subplot, “constrain” refers to the percentage of events whose NDE posterior poses suﬃcientconstraint—the peak posterior probability must be at least twice the prior probability. “Correct” refers to the percentage ofconstrained events whose true parameter lies closest to the global mode. onal. For α , there is also a cluster of events along thesame diagonal, indicating a degeneracy between α and − α . Such a degeneracy may happen for nearly symmet-rical central caustics along the direction perpendicularto the lens axis.The 1- σ and 2- σ ranges of prediction, shown in redshadows, are clustered closely around the diagonal formost parameters. We emphasize that the loose 1L1S-ﬁtting cutoﬀ ( χ L S / dof ∼

1) means many of the test-set light-curves are only weakly perturbed by the binarynature of the lens, and should explain a number of casesin which the mass-ratio is poorly constrained. Interest-ingly, we ﬁnd that there is a tendency to overestimatethe mass ratio in these cases. In addition, we noticethat u and f s are underestimated for a large number ofcases while t E is correspondingly overestimated, thoughhardly visible in Figure 6. This bias could be explainedby the combined eﬀect of a known degeneracy for 1L1Sevents and a distribution mismatch.First, there exists a well-known degeneracy between u , f s , and t E for single-lens events which in our case,applies to events that are only weakly perturbed by thebinary nature of the lens. As demonstrated by Wo´zniak & Paczy´nski (1997), this degeneracy is most severe forlow magniﬁcation events ( u (cid:29) u < .

15, the bias in f s and t E is largely removed. Figure 7 shows the NDE posteriorfor an example u = 1 . u , f s , and t E .In the presence of strong degeneracies as such, the ef-fective likelihood implicitly provided by the featurizeris only marginally informative. In other words, the fea-turizer cannot distinguish among solutions within thecontinuous degeneracy, and only prescribes a region inparameter space where the observation is about equallylikely. Therefore, the posterior is essentially in a prior-dominated regime, where small u and f s are stronglyfavored as seen in Figure 7. Had the parameters for theseweakly perturbed events been drawn from the eﬀectiveprior (Figure 2), there would be little bias (under/over-estimation) at all. However, quite the contrary, the dis-tribution of the weakly-perturbed is weighted towardsthe exact opposite direction of eﬀective prior, e.g., to-wards large u and small log ( q )—those more likely tobe excluded from the χ L S / dof > Zhang et al. u = 0.98 +0.300.32 . . . . . l o g ( t E ) log ( t E ) = 1.64 +0.140.10 . . . . . u . . . . f s .

50 1 .

65 1 .

80 1 .

95 2 . log ( t E ) .

25 0 .

50 0 .

75 1 . f s f s = 0.44 +0.320.21 Figure 7.

Corner plot for the marginal NDE posterior ofan 1L1S event showing strong degeneracy among the three1L1S parameters: u in units of θ E , t E in units of days, and f s . Filled contours show 1/2/3/4 σ regions. Small u and f s are strongly favored because of the eﬀective priors (Section3.1) and a marginally informative likelihood. this distribution mismatch, large u and small log ( q )occur much more often than expected by prior belief,thus resulting in the under/overestimation bias. Andbecause of the strong covariances among u , f s , and t E , the under-estimation of u translates into an under-estimation of f s and an over-estimation for t E (Figure7), which explains the biases seen in Figure 6.4.5. Calibration Properties

A perfectly calibrated posterior knows how often itis right or wrong. In other words, the quantile of theground truth parameter under the NDE posterior shouldbe expected to be distributed uniformly. Figure 8 showsthe quantile distribution for the 1D marginal NDE pos-terior distributions for the same 14,551 test set infer-ences. The quantile distribution for log ( q ), log ( α ),and t is concave-up, indicating that the NDE uncer-tainty is overestimated and the true value lies closer tothe center of the posterior more often than expected.This suggests that the NDE ﬁnds it hard to contract theposterior in those dimensions, possibly due to numericaloptimization diﬃculties or insuﬃcient neural networkexpressibility. On the other hand, distributions for thethree parameters in the second row—log ( t E ), u , and f s —demonstrate the systematic under/over-estimationas seen in Figure 6, where log ( t E ) is systematically log ( s ) = 0.3% log ( q ) = 0.4% = 0.3% log ( t E ) = 0.6% u = 0.3% f s = 0.4% quantile o cc u r a n c e log ( ) = 0.5% t = 0.2% Figure 8.

Calibration plot showing the test-set distribu-tions of the ground truth quantile for the 1D marginal NDEposteriors. Dashed lines indicated the uniform distributionas expected for a perfectly calibrated posterior. overestimated and u and f s are underestimated. Thequantile distributions for log( s ) and log ( ρ ) are con-sistent with uniform distributions and are thus well-calibrated. DISCUSSION AND CONCLUSIONSWe have demonstrated that amortized neural pos-terior estimation, a likelihood-free inference methodwhich uses a conditional NDE to learn a surrogate pos-terior, ˆ p ( θθθ | x ), greatly accelerates binary microlensinginference—an approximate posterior could be producedin seconds without the need for an expert in the loop.Our new approach is capable of capturing a variety ofdegeneracies. For future work, it is straightforward toextend to higher-level eﬀects such as parallax and bi-nary motion by introducing additional parameters. Ap-plication to more complex systems such as 3L1S may befruitful, where the physical forward modal is orders ofmagnitude slower. In addition, the photometric noisemodel adapted in our study is somewhat simplistic, andfuture work can explore how to adapt models trainedwith ideal noise properties to fully realistic data withthe help of image-based simulation pipelines such as onesused in Penny et al. (2019). We discuss two additionalaspects of our work below.5.1. A hybrid NDE-MCMC framework eal-Time Likelihood-Free Inference of

Roman

Binary Microlensing Events

Roman data. Applied to much noisier andmore sparsely sampled ground-based data, we expectthat data uncertainties will dominate over model uncer-tainties, thus allowing the NDE posterior to convergetowards the exact posterior.5.2.

Choice of Coordinate System

For all events in this work, we have adopted the center-of-mass (COM) coordinate system, which is the defaultin

MulensModel but not the most eﬃcient referenceframe in the sense that more than 70% of the 1 millionsimulations turn out to be consistent with a 1L1S model.For example, most 2L1S conﬁgurations with large u do not pass close to either the central caustics or theplanetary caustics. For parts of the parameter space,alternative reference coordinates may be more descrip-tive or useful. For example, the caustic-center frame ispreferred for binary and/or wide-separation events for which there is an oﬀset of the caustic-center from theCOM. Doing so recovers the missing wide/binary solu-tion in Section 4.3 without the need to expand the priorto include negative t . Additionally, planetary-causticpassing events are also rare; for source trajectories farfrom the central caustics, most do not pass close to theplanetary caustic and as a result, the magniﬁcation isfrequently indistinguishable from 1L1S. For future work,a hybrid and self-consistent coordinate system could beused. Software: numpy (van der Walt et al. 2011), scipy (Jones et al. 2001), matplotlib (Hunter 2007), pytorch (Paszke et al. 2017),

MulensModel (Poleski & Yee 2018),

Jupyter (Kluyver et al. 2016).ACKNOWLEDGEMENTSK.Z. and J.S.B. are supported by a Gordon and BettyMoore Foundation Data-Driven Discovery grant. J.S.B.is partially sponsored by a faculty research award fromTwo Sigma. K.Z. thanks the LSSTC Data ScienceFellowship Program, which is funded by LSSTC, NSFCybertraining Grant 1829740, the Brinson Foundation,and the Moore Foundation; his participation in the pro-gram has beneﬁted this work. K.Z. thanks Shude Maoand Tsinghua University for their hospitality during theCOVID-19 pandemic. B.S.G. is supported by NASAgrant NNG16PJ32C and the Thomas Jeﬀerson Chairfor Discovery and Space Exploration. This work is sup-ported by the AWS Cloud Credits for Research program.We thank Yang Gao and Yu Sun for helpful discussions,and Weicheng Zang for pointing out issues with the ini-tial ﬁgure presentation.REFERENCES Zhang et al. eal-Time Likelihood-Free Inference of

Roman

Binary Microlensing Events15