Real-Time Likelihood-Free Inference of Roman Binary Microlensing Events with Amortized Neural Posterior Estimation
Keming Zhang, Joshua S. Bloom, B. Scott Gaudi, Francois Lanusse, Casey Lam, Jessica R. Lu
DDraft version February 18, 2021
Typeset using L A TEX twocolumn style in AASTeX62
Real-Time Likelihood-free Inference of
Roman
Binary Microlensing Events with Amortized Neural PosteriorEstimation
Keming Zhang ( 张 可 名 ) , Joshua S. Bloom,
1, 2
B. Scott Gaudi, Franc¸ois Lanusse, Casey Lam, andJessica R. Lu Department of Astronomy, University of California, Berkeley, CA 94720-3411, USA Lawrence Berkeley National Laboratory, 1 Cyclotron Road, MS 50B-4206, Berkeley, CA 94720-3411, USA Department of Astronomy, The Ohio State University, Columbus, OH 43210, USA AIM, CEA, CNRS, Universit´e Paris-Saclay, Universit´e Paris Diderot, Sorbonne Paris Cit´e, F-91191, Gif-sur-Yvette, France
ABSTRACTFast and automated inference of binary-lens, single-source (2L1S) microlensing events with sampling-based Bayesian algorithms (e.g., Markov Chain Monte Carlo; MCMC) is challenged on two fronts: highcomputational cost of likelihood evaluations with microlensing simulation codes, and a pathologicalparameter space where the negative-log-likelihood surface can contain a multitude of local minimathat are narrow and deep. Analysis of 2L1S events usually involves grid searches over some parametersto locate approximate solutions as a prerequisite to posterior sampling, an expensive process thatoften requires human-in-the-loop domain expertise. As the next-generation, space-based microlensingsurvey with the
Roman Space Telescope is expected to yield thousands of binary microlensing events,a new fast and automated method is desirable. Here, we present a likelihood-free inference (LFI)approach named amortized neural posterior estimation, where a neural density estimator (NDE) learnsa surrogate posterior ˆ p ( θθθ | x ) as an observation-parametrized conditional probability distribution, frompre-computed simulations over the full prior space. Trained on 291,012 simulated Roman -like 2L1Ssimulations, the NDE produces accurate and precise posteriors within seconds for any observationwithin the prior support without requiring a domain expert in the loop, thus allowing for real-timeand automated inference. We show that the NDE also captures expected posterior degeneracies. TheNDE posterior could then be refined into the exact posterior with a downstream MCMC sampler withminimal burn-in steps.
Keywords:
Binary lens microlensing (2136), Gravitational microlensing exoplanet detection (2147) INTRODUCTIONWhen the apparent trajectory of a foreground lens star passes close to a more distant source star, the grav-itational field of the lens will perturb the light raysfrom the source which results in a time-variable mag-nification. Such are single-lens, single-source (1L1S)microlensing events. Binary microlensing events occurwhen the lens is a system of two masses: either a binarystar system or a star-planet configuration. Observationof such events provides a unique opportunity for exo-planet discovery as the planet-to-star mass ratio may beinferred from the light curve without having to detectlight from the star-planet lens itself (see Gaudi 2010
Corresponding author: Keming [email protected] for a review). A next-generation microlensing surveywith the Roman Space Telescope (Spergel et al. 2015;hereafter
Roman ) is estimated to discover thousands ofbinary microlensing events over the duration of the 5-year mission span, many with planetary-mass compan-ions (Penny et al. 2019). This expected discovery isroughly an order of magnitude more than events previ-ously discovered (see Gaudi 2012 for a review).While single-lens microlensing events are described bya simple analytic expression (“Paczy´nski light-curve”),binary microlensing events require numerical forwardmodels that are computationally expensive. In addition,binary microlensing light-curves exhibit extraordinaryphenomenological diversity, owing to the different ge-ometrical configurations for which magnification couldtake place. This translates to a pathological parame-ter space for which the likelihood surface suffers from a r X i v : . [ a s t r o - ph . I M ] F e b Zhang et al. a multitude of local minima that are disconnected, nar-row, and deep; this issue significantly hampers attemptsof direct sampling-based inference such as MCMC wherethe chains are initialized from a broad prior. As a result,binary microlensing events thus far have generally beenanalyzed on a case-by-case basis.For some planetary-mass-ratio events, heuristics couldbe used to “read off” an approximate solution from theplanetary anomaly in the light curve (Gaudi & Gould1997; Gould & Loeb 1992). Khakpash et al. (2019) ap-plied the heuristics described in Gaudi & Gould (1997)on simulated
Roman light-curves and found that theprojected binary separation can be recovered very wellfor low-mass-ratio events, and the binary mass-ratioswithin an order of magnitude for events with wide andclose caustic topologies.More generally, an expensive grid search is usuallyconducted over a subset of parameters to which the mag-nification pattern is hyper-sensitive: i.e., binary sepa-ration, mass ratio, and the source trajectory angle ofapproach (e.g. Herrera-Mart´ın et al. (2020)). At eachgrid-point, the remaining parameters are searched forwith simple Nelder-Mead optimization (Nelder & Mead1965) or MCMC. The fixed-grid solutions are then usedto seed full MCMC samplings to refine solutions andsample the posteriors. This status quo approach, whichis both computationally expensive and requires domainexpertise in the loop, thus presents a great challengeto analyze the thousands of binary microlensing eventsexpected to be discovered by
Roman .Recent progress in deep learning provides a promisingframework for a solution. In particular, both Convolu-tional (CNN; LeCun et al. 2015) and Recurrent NeuralNetworks (RNN Hochreiter & Schmidhuber 1997; Choet al. 2014) have emerged as powerful alternatives to fea-ture engineering of astronomical time-series (e.g. Naulet al. 2018). Given sufficient training data, CNN/RNNscould learn to compress the “high-dimensional” rawobservations into “low-dimensional” feature vectors—automatically learning to produce features that are use-ful for downstream tasks such as classification or regres-sion. In addition to advances in this “representationlearning,” neural networks have also enjoyed significantprogress in modeling probability distributions, otherwiseknown as neural density estimation, where the funda-mental task is to learn distributions from samples ofthat distribution. Both autoregressive models (Germainet al. 2015; Oord et al. 2016) and flow-based models (Pa-pamakarios et al. 2017; Dinh et al. 2017) are highly capa-ble of modeling complicated and multi-modal distribu-tions, which can not only evaluate probability densities,but also sample from that distribution. The advancement in feature learning and NDE has al-lowed for accelerated progress in the field of likelihood-free inference (LFI), also known as simulation-based in-ference (SBI), which has been motivated by inferenceproblems without a tractable likelihood. LFI is an um-brella term that encompasses a wide range of inferencealgorithms that do not require explicit evaluation of thelikelihood. Under our particular LFI approach calledamortized neural posterior estimation, an NDE learnsa surrogate posterior as an observation-parametrizedconditional probability distribution, from pre-computedsimulations over the full prior space. A “featurizer” neu-ral network is employed to compress raw observationinto a feature vector which parametrizes the NDE. In-ference is amortized in that all of the computation costof simulation is paid upfront—likelihood evaluation withthe slow forward simulator is no longer required, thus al-lowing for fast inference. For other neural LFI instances,neural networks could learn the likelihood (Papamakar-ios et al. 2019) or the likelihood-ratio (Thomas et al.2020) as surrogates to accelerate sampling-based infer-ence algorithms like MCMC (see Cranmer et al. 2020 foran overview).In this paper, we present a likelihood-free infer-ence approach for binary microlensing where an NDElearns a surrogate posterior ˆ p ( θθθ | x ) as an observation-parametrized conditional distribution from ( x i , θθθ i ) sam-ples of simulated microlensing light-curves with theassociated microlensing parameters. After training, theNDE can automatically generate posterior samples forfuture observations effectively in real-time. Because ofthe speed and performance without supervision by do-main experts, the approach we introduce here has greatpotential for batch inference tasks such as those posedby Roman . While machine learning (ML) has been pre-viously applied to discover and classify microlensingevents (Wyrzykowski et al. 2015; Godines et al. 2019;Mr´oz 2020), our work represent the first application ofML to direct parameter inference in the general binary-lens case. Our preliminary results were reported as anextended abstract in Zhang et al. (2020). The workherein supersedes and expands upon that work.We first lay out our inference framework in Section 2.Training set construction under the context of
Roman isdiscussed in Section 3. In Section 4, we demonstrate theability of the NDE to capture degenerate solutions andalso present a systematic evaluation of the NDE perfor-mance over a large number of test events. In Section 5,we suggest future directions including a potential addi-tion of a down-stream MCMC algorithm to refine theNDE posterior into the exact posterior, with minimaladditional computation time. eal-Time Likelihood-Free Inference of
Roman
Binary Microlensing Events METHODWe train a conditional NDE ˆ p φ ( θθθ | x ) to approximatethe true posterior p ( θθθ | x ), where θθθ are the physical pa-rameters and x ∈ R N is the light curve with N data-points. The objective is to minimize the Kullback–Leibler (KL) divergence ( D KL ) between the two distri-butions: φ = argmin( D KL ( p ( θθθ | x )) || ˆ p φ ( θθθ | x ))) (1)= argmin( E θθθ ∼ p ( θθθ ) , x ∼ p ( x | θθθ ) [log( p ( θθθ | x )) − log(ˆ p φ ( θθθ | x ))])(2)= argmax( E θθθ ∼ p ( θθθ ) , x ∼ p ( x | θθθ ) [ˆ p φ ( θθθ | x )]) , (3)where φ represents the neural network parameter, and E denotes the mathematical expectation over the specifieddistribution.The NDE is therefore trained with Maximum Likeli-hood Estimation (MLE) on a training set with physicalparameters drawn from the prior p ( θθθ ) and light-curvesdrawn from the likelihood function, which is the Poissonmeasurement noise model on top of the noise-free mi-crolensing light curve g ( θθθ ) (in the number of photons)which, for simplicity, is assumed to be in the Gaussianlimit: p ( x | θθθ ) = N (cid:16) µ = g ( θθθ ) , σ = (cid:112) g ( θθθ ) (cid:17) . (4)The noise-free light-curve, in turn, is determined bythe baseline source flux ( F source ), the magnificationtime-series produced by the microlensing physical for-ward model A ( θθθ ), and the constant blend flux, which isthe flux from the lens star and any other star that isunresolved from the source star: g ( θθθ ) = A ( θθθ ) · F source + F blend .We use a 20-block Masked Autoregressive Flow(MAF) (Papamakarios et al. 2017) to model ˆ p ( θθθ | x ),and a ResNet-GRU network to extract features ( h )from the light curve ( x ). We do not distinguish betweenˆ p ( θθθ | x ) and ˆ p ( θθθ | h ) where the former is meant to referto the “featurizer+NDE” model and latter is meant torefer to the NDE model alone that is explicitly condi-tioned on h . Figure 1 presents a schema of our neuralposterior estimation framework. The ResNet-GRU net-work is comprised of a 18-layer 1D ResNet (ResidualConvolutional Network; He et al. 2016) and a 2-layerGRU (Gated Recurrent Network; Cho et al. 2014). Wedescribe the neural networks in detail below.2.1. Masked Autoregressive Flow
The masked autoregressive flow (MAF) belongs to aclass of NDE called normalizing flows, which models theconditional distribution ˆ p ( θθθ | x ) as an invertible transfor-mation f from a base distribution π z ( z ) to the target distribution ˆ p ( θθθ | x ). The base density π u ( z ) is requiredto be fast to evaluate and is typically chosen to be eithera standard Gaussian or a mixture of Gaussians for theMAF. The basic idea is that if the MAF, conditionedon the observation x , could learn to map the posteriorto a standard Gaussian, then the inverse transformationcould enable sampling of the posterior by simply sam-pling from that standard Gaussian.As binary microlensing events often exhibit degener-ate, multi-modal solutions, we use a mixture of eightstandard multivariate Gaussians, each with 8 dimen-sions, as the base distribution. The posterior probabil-ity density ˆ p ( θθθ | x ) is evaluated by applying the inversetransformation f − from θθθ to z :ˆ p ( θθθ | x ) = π z (cid:0) f − ( θθθ ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂f − ∂θθθ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) , (5)where π z (cid:0) f − ( θθθ ) (cid:1) represents the probability density forthe base distribution ( π z ) evaluated at f − ( θθθ ), whilethe second term—the determinant of the Jacobian—corresponds to the “compression” of probability space.The MAF is built upon blocks of affine transforma-tions where the scaling and shifting factors for each di-mension are computed with a Masked Autoencoder forDistribution Estimation (MADE; Germain et al. 2015).For a simple 1-block case, the inverse transformationfrom θθθ to z is expressed as: z i = ( θ i − µ i ) · exp ( − α i ) , (6)In the above equation, µ i = f µ i ( θθθ i − ; x ) and α i = f α i ( θθθ i − ; x ) are the scaling and shifting factors mod-eled by MADE subject to the autoregressive conditionthat the transformation of any dimension can only de-pend on those prior to it according to a predeterminedordering. This allows the Jacobian of f − to be triangu-lar, whose absolute determinant can be easily calculatedas: (cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂f − ∂θθθ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = exp (cid:16) − (cid:88) i α i (cid:17) , (7)where α i = f α i ( θθθ i − ; x ).To sample from the posterior, the forward transfor-mation θθθ = f ( z ) where z ∼ π z is applied: θ i = z i · exp α i + µ i , (8)where µ i and α i are computed in the same manner asthe inverse transformation.The MAF is built by stacking many such affine trans-formation blocks, M , M , . . . , M K , where M K mod-els the invertible transformation f K between the pos-terior ( z K ) and intermediate random variable z K − , Zhang et al. time f l ux …… Featurizer: ResNet + GRU …… MADE / M MADE / M K ………… z i + = exp( α i ) ⋅ z i + μ i z i = exp(− α i ) ⋅ ( z i + − μ i ) MADE / M x ∈ ℝ h ∈ ℝ MAF z ∈ ℝ z ∈ ℝ z K ( θ ) ∈ ℝ h h h α μ α μ α K μ K Figure 1.
Schematic illustration of the inference framework based on conditional NDE. The bottom left shows a microlensinglight-curve in arbitrary units which is abstracted into the length-7200 vector ( x ) above. The featurizer composed of a combinationof ResNet and GRU, shown in the trapezoid, compresses the light-curve into a low-dimensional feature vector h . The maskedautoregressive flow (MAF), composed of K blocks of masked autoencoder for density estimation (MADE), is shown in the dashedbox. Each MADE block takes in the feature vector h and predicts scaling ( ααα ) and shifting ( µµµ ) factors, which parameterizes aninvertible affine transformation between adjacent random variables (e.g., z and z ) shown in the dotted box. The left-mostrandom variable is the mixture-of-Gaussian base distribution whereas the right-most random variable ( z K ) is the posterior ( θθθ ). M K − models that between intermediate random vari-ables z K − and z K − and so on, and finally the baserandom variable z is modeled with the mixture-of-Gaussian distribution. M also computes the mixtureweights. The composite transformation can be writtenas f = f ◦ f ◦ . . . ◦ f K and the posterior probabilitydensity is now:ˆ p ( θθθ | x ) = π z (cid:0) f − ( θθθ ) (cid:1) K (cid:89) i =1 (cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂f − i ∂ z i − (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (9)where it is understood that z K := θθθ . The log-probabilityof the posterior is then, by Equation 7,log ˆ p ( θθθ | x ) = log[ π z (cid:0) f − ( θθθ ) (cid:1) ] + K (cid:88) i =1 log (cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂f − i ∂z i − (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = log[ π z (cid:0) f − ( θθθ ) (cid:1) ] − K (cid:88) i =1 N (cid:88) j =1 α ij , (10)where α ij is the j th component of the scale factor in M i , as in Equation 6. This serves as the optimizationobjective (see Section 3.3).Autoregressive models are sensitive to the order of thevariables. The original MAF paper uses the default or-der for the autoregressive layer closest to θθθ and reversesthe order for each successive layer. In this work, we adopt fixed random orderings for each MAF block whichwe find to allow for better expressibility. The randomseed of the ordering serves as a hyper-parameter to beoptimized on. 2.2. Featurizer Network
A custom 1D ResNet with a down-stream 2-layer GRUis used as the light curve featurizer which takes pre-processed light curves ( x ) as input and outputs a low-dimensional feature vector ( h ). The ResNet used inthis study shares the identical architecture as Zhang &Bloom (2020) (except for hyper-parameters) and con-sists of 9 identical residual blocks, each of which is com-posed of two convolutions followed by layer normaliza-tion (Ba et al. 2016). A residual connection is addedbetween each adjacent residual block, which acts as a“gradient highway” to assist network optimization. A MaxPool layer is applied in between every two ResNetlayers, where the sequence length is reduced by half andthe feature dimension doubled until a specified maxi-mum. This results in an output feature map of length L = 56 and dimension D = 256, when is then fed intothe GRU network that sequentially processes informa-tion across the temporal dimension and outputs a singlevector of D = 256 which then serves as the conditionalinput to the MAF. DATA eal-Time Likelihood-Free Inference of
Roman
Binary Microlensing Events MulensModel (Poleski & Yee 2019); each sequence contains 144 days ata cadence of 0.01 day, corresponding to the planned Ro-man cadence of 15 minutes (Penny et al. 2019). Thesesequences are chosen to have twice the length of the72-day Roman observation window to facilitate sam-pling from a t ∼ Uniform(0 ,
72) prior (see Section3.1). We then fit each simulated magnification time-series with a Paczy´nski single-lens-single-source (1L1S)model (assuming
S/N base = 200 and f s = 1; see Section3.2) and discard those that are consistent with 1L1S( χ / dof < Prior
Assuming rectilinear relative motion of the observer,lens, and source, binary microlensing (2L1S) events arecharacterised by eight parameters: binary lens separa-tion ( s ), mass ratio ( q ), angle of the source trajectorywith respect to the projected binary lens axis ( α ), im-pact parameter ( u ), time of closest approach ( t ), Ein-stein ring crossing timescale ( t E ), finite source size ( ρ ),and source flux fraction ( f s ). α is the angle between thevector pointing from the primary to the secondary andthe source trajectory vector, measured counterclockwisein degrees. u and t are defined with respect to the bi-nary lens center-of-mass (COM). Where applicable, theparameters are normalized to the Einstein ring length-scale or the Einstein ring crossing time-scale of the totalmass of the lens system. t and t E are in units of days.We simulate 2L1S events based on the following analyticpriors: s ∼ LogUniform(0 . , q ∼ LogUniform(10 − , α ∼ Uniform(0 , u ∼ Uniform(0 , t ∼ Uniform(0 , t E ∼ TruncLogNorm(1 , , µ = 10 . , σ = 10 . ) ρ ∼ LogUniform(10 − , − ) f s ∼ LogUniform(0 . , χ L S / dof < q and small u ,which otherwise have flat priors, are strongly preferred. u log ( t E ) log ( ) f r a c t i o n log ( s ) log ( q ) Figure 2.
Fraction of the 10 simulations passing the χ L S / dof > simulations are shown in red-dashed linesup to a normalization factor. For parameters with a flat orig-inal prior, the gray histogram is also the effective training setprior up to a normalization factor. The t and f s distribu-tions follow the original priors as they are sampled on the flyduring training. During training, a random 72-day segment is chosenon the fly from each 144-day magnification sequence,equivalent to prescribing a uniform prior on t . Thetruncated normal distribution for t E is an approxima-tion of a statistical analysis based on OGLE-IV data(Mr´oz et al. 2017). The lower limit of q = 10 − corre-sponds to the mass ratio between Mercury and a low-mass ( M ∼ . M (cid:12) ) M-dwarf star, highlighting the su-perb sensitivity of Roman. The source flux fraction isdefined as the ratio between the source flux and the totalbaseline flux f s = F source F source + F blend . (12)3.2. Light-curve realization
The magnification sequences are converted into light-curves during training on the fly by multiplying with thebaseline pre-magnification source flux before adding theconstant blend flux and applying measurement noise.For simplicity, we only consider photon-counting noisefrom the lens and fixed blend flux, assumed to be in theGaussian limit of the Poisson noise (Equation 4), wherethe standard deviation of each photometric measure-ment is the square root of flux measurement in photoncounts. Studies of the bulge star population show thatthe apparent magnitude largely lies within the range of20 mag to 25 mag (Penny et al. 2019: Figure 5). The Ro-man/WFIRST Cycle 7 design has the zero-point mag-nitude (1 count/s) at 27.615 mag for the W149 filter.
Zhang et al.
Table 1.
Solutions for the example central-caustic pass-ing event. t E and t are in units of days, α in degrees, u , s , and ρ in units of θ E . Uncertainties are 1 σ marginaluncertainties. truth close widelog ( s ) − . − . +0 . − . . +0 . − . log ( q ) − . − . +0 . − . − . +0 . − . α .
000 299 . +0 . − . . +0 . − . u .
050 0 . +0 . − . . +0 . − . t .
000 26 . +0 . − . . +0 . − . log ( t E ) 1 .
301 1 . +0 . − . . +0 . − . log ( ρ ) − . − . + − . − . − . +0 . − . f s .
120 0 . +0 . − . . +0 . − . With exposure time at 46.8 s, the aforementioned mag-nitude range corresponds to signal-to-noise (
S/N base ) ra-tios between 230 and 23 for the baseline flux, which werandomly and uniformly sample during training. On-the-fly sampling of
S/N base and f s also serves as dataaugmentation, which refers to the process of expandingthe effective size of the training set.3.3. Pre-processing and Training
Network optimization is performed with ADAM(Kingma & Ba 2015) at an initial learning rate of 0.001and batch size 512, which decays to 0 according to acosine annealing schedule (Loshchilov & Hutter 2017)for 250 epochs, at which point the training terminates.To ensure that there is no over-fitting, we first reserved20% of the training set as a validation set. After con-firming the absence of over-fitting, we then proceed withthe full training set. We apply data augmentation on α by changing the direction of the source trajectory: thetemporal order of each sequence is reverted and α be-comes − ( α + 180) mod 360. Each training epoch takes ∼ − − RESULTSThe trained model is able to generate accurate andprecise posterior samples at a rate of 10 per second onone GPU, effectively in real-time. This compares to the ∼ MulensModel on one CPU core. In this section, we firsthighlight the ability of the NDE to capture multi-modal
Table 2.
Solutions for the example resonant-caustic pass-ing event. Same units as Table 1. Uncertainties are 1 σ marginal uncertainties.truth close resonantlog ( s ) 0 . − . +0 . − . . +0 . − . log ( q ) − . − . +0 . − . − . +0 . − . α .
000 109 . +0 . − . . +0 . − . u .
100 0 . +0 . − . . +0 . − . t .
000 25 . +0 . − . . +0 . − . log ( t E ) 1 .
301 1 . +0 . − . . +0 . − . log ( ρ ) − . − . +0 . − . − . + − . − . f s .
200 0 . +0 . − . . +0 . − . solutions by providing NDE posteriors of representativeevents where we set the baseline S/N base = 200. Then,the quality of NDE posteriors is systematically analyzedby examining the accuracy and calibration properties ona test set of 14,511 simulated light-curves.4.1.
Central-Caustic Passing Event
Figure 3a shows the NDE posterior for an exam-ple central-caustic-passing event where a classic “close-wide” degeneracy is clearly exhibited by the s -1 /s be-havior (Griest & Safizadeh 1998; Dominik 1999). Ta-ble 1 presents the ground truth 2L1S parameters of thisevent as well as the “close” and “wide” solutions, cal-culated as the modes of their respective distributions.Although the source is expected to pass the caustic cen-ter at the same distances for the two cases, Figure 3ashows a bimodal solution for u as well because u hasbeen defined with respect to the center-of-mass (COM),rather than the caustic center. While the caustic centeris centered on the COM for close-separation events, forwide-separation events there is an offset from the COMof δ = s · q q − qs · (1 + q ) (13)where the first term accounts for the offset of the caus-tic center from the location of the primary (Han 2008),and the second term, the offset of the primary from thecenter of mass. Positive offsets are directed toward thecompanion and vice versa. Plugging in the wide solu-tion, we expect an offset of ∆ u = 0 . u = 0 . eal-Time Likelihood-Free Inference of Roman
Binary Microlensing Events Figure 3. (a) NDE posterior for a central-caustic passing event. t E and t are in units of days, α in degrees, u , s , and ρ in units of θ E . Filled contours show 1/2/3 σ regions. The ground truth close solution is marked with orange cross-hairs. Theclose and wide solutions are marked with a red cross and a blue diamond, respectively. (b) Close-up view of the light-curverealizations normalized to the minimum fluxes for both solutions, in the same color-coding as (a). The 0.01 day cadence andmeasurement noise is negligibly small on the scale of the figure, and therefore not shown. (c) Caustic structures as well astrajectories for the two solutions in the same color-coding, centered on the center of caustic. Zhang et al.
Figure 4.
Resonant-caustic-passing event; same figure caption as Figure 3. Here, a degenerate solution is seen at s <
1, whosetwo triangular caustics cause a similar suppression pattern as the resonant caustic. eal-Time Likelihood-Free Inference of
Roman
Binary Microlensing Events Figure 5.
Example event exhibiting a blunt and flat light-curve near the peak, which has a 5-fold degenerate NDE posterior;same figure caption as Figure 3. Color-coding is shared across the three panels.
Table 3.
Degenerate solutions for the binary-planetary degenerate event shown in Figure 5. Same units asTable 1. Uncertainties are 1 σ marginal uncertainties.truth planetary A planetary B binary A binary B binary Clog ( s ) − . − . +0 . − . . +0 . − . − . +0 . − . − . +0 . − . − . +0 . − . log ( q ) − . − . +0 . − . − . +0 . − . − . +0 . − . − . +0 . − . − . +0 . − . α .
000 80 . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . u .
100 0 . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . t .
000 25 . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . log ( t E ) 1 .
699 1 . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . f s .
200 0 . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . Zhang et al.
Resonant-Caustic Passing Event
We also highlight an example of a resonant-causticpassing event, whose parameters and solutions areshown in Table 2. As illustrated in Figure 4, the NDEfinds an additional solution at s <
1, whose triangu-lar caustics are causing a similar weak de-magnificationas the resonant caustics (also see Figure 7 in Gaudi2010). This type of degeneracy has been previouslyobserved in the microlensing event OGLE-2018-BLG-0677Lb (Herrera-Mart´ın et al. 2020). Strong covariancesare seen among u , t E , and f s , as is also seen in theprevious example (Section 4.1). As first observed byWo´zniak & Paczy´nski (1997), in the f s (cid:28) u (cid:28) Binary-Planetary Degeneracy
We also provide a fascinating 5-fold-degenerate exam-ple that is similar to the degeneracy reported in Choiet al. (2012) where a light curve that is blunt and flatnear the peak can be explained by either a binary caseor a planetary case. Here, we simulate a close-topology,planetary mass ratio ( q = 10 − . ) event where the sourcetrajectory passes through the negative perturbation re-gion towards the back end of the arrowhead-shapedcentral caustic (Figure 5: “planetary A/B” caustic).Choi et al. (2012) noted that a similar perturbation canoccur for the binary case when the source trajectorypasses through the negative perturbation region betweentwo adjacent cusps of the astroid-shaped central caustic(Figure 5: “binary A/B/C” caustic; also see Figure 1 intheir paper).As shown in Figure 5, all five degenerate solutionscause magnification patterns that are hardly distinguish-able. The two planetary solutions exhibit a close-widedegeneracy. For the three binary solutions, the “bi-nary B/C” solutions suggest two possible trajectories( ∼ /
270 deg) for the same lens system configurationwhereas “binary A” solution exhibits a smaller mass ra-tio and a wider binary separation than “binary B/C”.We note that this additional degeneracy in the mass ra-tio for the binary case was not reported in Choi et al.(2012). It is not clear if this is a discrete or continuousdegeneracy, nor if it is an “accidental degeneracy” thatarises because of the relatively weak perturbation, oris due to some underlying symmetry in the binary lensequation (e.g. Dominik 1999).On the other hand, wide solutions for the binary caseare largely absent from the NDE posterior, apart from an inkling of density near log ( s ) ∼ .
69 which pointsto the expected close-wide degeneracy for the binary so-lution. We note that the reason those degenerate solu-tions are excluded is that, because of the offset betweenthe COM and the central caustic (Equation 13), wide-binary solutions would require t <
0, which has a priorprobability of zero.4.4.
Evaluating Performance
We present a systematic evaluation of all 14,551 testset events in the form of predicted vs. truth scatter plots(Figure 6). Each test event light-curve is realized in thesame fashion as training time. As the NDE returns po-tentially multi-modal posteriors of arbitrary shape, wecompute the mode(s) for the marginal 1D distributionsof the posterior and consider the mode closest to theground truth as the “predicted” value. The mode(s) iscomputed by first fitting each with a 1D histogram of100 bins and then searching for local maxima defined asany bin count higher than that of the 20 adjacent bins.This limits the number of modes to 5. Considering thepurpose of the NDE posterior is to allow ultra-fast con-vergence of a downstream sampling-based algorithm likeMCMC to determine the exact posterior, as long as thecorrect solution has substantial density in the NDE pos-terior, it should not raise alarm if an alternative modeis mistakenly favored. Any degeneracies can be easilyresolved downstream. Therefore, it is sensible to allowthe correct mode to be used as the predicted value, evenif another degenerate mode is incorrectly preferred.As shown in the upper-left corner of each subplot inFigure 6, all parameters are constrained at a rate of closeto 100% except for the finite source effect for which only14 .
2% is constrained, as the source trajectory is requiredto either cross or pass close to a caustic for ρ to bedetermined. We consider a parameter to be constrainedif the probability density of the 1D marginal distributionis > × the prior probability density at the global mode.The second row in each upper-left corner shows thefrequency for which the correct mode is preferred byNDE, that is, the ground truth is the closest to the ma-jor mode compared to the minor modes(s), if any. If theground truth is closer to a minor mode, the major modeis plotted in blue while the major mode is shown in or-ange. We see clear degeneracy patterns in log ( s ) and α . For log ( s ), the “wide-close” degeneracy is exhibitedby the cluster around the upper-left to lower-right diag- Formally, effects on the light curve due to the finite size ofthe source are only significant if the gradient of the magnificationacross the source has a significant second derivative. In practice,this condition is only satisfied if the source passes within a fewangular source radii of a caustic. eal-Time Likelihood-Free Inference of
Roman
Binary Microlensing Events Figure 6.
Predicted vs. ground truth 2L1S parameters for 14,551 test-set 2L1S events. t E and t are in units of days, α indegrees, u , s , and ρ in units of θ E . Single-mode NDE posteriors are shown in black dots. For multi-model NDE posteriors, wecolor-code the solution as follows: those for which the global mode is closest to the ground truth are plotted in black; for caseswhere a minor mode is closest to the true value, this correct, minor mode is plotted in orange whereas the incorrect global modeis plotted in blue. Red shadows indicate 32–68th percentile (1 σ ) and 5–95 percentile (2 σ ) regions. Red-dashed lines show thediagonal. In the upper left of each subplot, “constrain” refers to the percentage of events whose NDE posterior poses sufficientconstraint—the peak posterior probability must be at least twice the prior probability. “Correct” refers to the percentage ofconstrained events whose true parameter lies closest to the global mode. onal. For α , there is also a cluster of events along thesame diagonal, indicating a degeneracy between α and − α . Such a degeneracy may happen for nearly symmet-rical central caustics along the direction perpendicularto the lens axis.The 1- σ and 2- σ ranges of prediction, shown in redshadows, are clustered closely around the diagonal formost parameters. We emphasize that the loose 1L1S-fitting cutoff ( χ L S / dof ∼
1) means many of the test-set light-curves are only weakly perturbed by the binarynature of the lens, and should explain a number of casesin which the mass-ratio is poorly constrained. Interest-ingly, we find that there is a tendency to overestimatethe mass ratio in these cases. In addition, we noticethat u and f s are underestimated for a large number ofcases while t E is correspondingly overestimated, thoughhardly visible in Figure 6. This bias could be explainedby the combined effect of a known degeneracy for 1L1Sevents and a distribution mismatch.First, there exists a well-known degeneracy between u , f s , and t E for single-lens events which in our case,applies to events that are only weakly perturbed by thebinary nature of the lens. As demonstrated by Wo´zniak & Paczy´nski (1997), this degeneracy is most severe forlow magnification events ( u (cid:29) u < .
15, the bias in f s and t E is largely removed. Figure 7 shows the NDE posteriorfor an example u = 1 . u , f s , and t E .In the presence of strong degeneracies as such, the ef-fective likelihood implicitly provided by the featurizeris only marginally informative. In other words, the fea-turizer cannot distinguish among solutions within thecontinuous degeneracy, and only prescribes a region inparameter space where the observation is about equallylikely. Therefore, the posterior is essentially in a prior-dominated regime, where small u and f s are stronglyfavored as seen in Figure 7. Had the parameters for theseweakly perturbed events been drawn from the effectiveprior (Figure 2), there would be little bias (under/over-estimation) at all. However, quite the contrary, the dis-tribution of the weakly-perturbed is weighted towardsthe exact opposite direction of effective prior, e.g., to-wards large u and small log ( q )—those more likely tobe excluded from the χ L S / dof > Zhang et al. u = 0.98 +0.300.32 . . . . . l o g ( t E ) log ( t E ) = 1.64 +0.140.10 . . . . . u . . . . f s .
50 1 .
65 1 .
80 1 .
95 2 . log ( t E ) .
25 0 .
50 0 .
75 1 . f s f s = 0.44 +0.320.21 Figure 7.
Corner plot for the marginal NDE posterior ofan 1L1S event showing strong degeneracy among the three1L1S parameters: u in units of θ E , t E in units of days, and f s . Filled contours show 1/2/3/4 σ regions. Small u and f s are strongly favored because of the effective priors (Section3.1) and a marginally informative likelihood. this distribution mismatch, large u and small log ( q )occur much more often than expected by prior belief,thus resulting in the under/overestimation bias. Andbecause of the strong covariances among u , f s , and t E , the under-estimation of u translates into an under-estimation of f s and an over-estimation for t E (Figure7), which explains the biases seen in Figure 6.4.5. Calibration Properties
A perfectly calibrated posterior knows how often itis right or wrong. In other words, the quantile of theground truth parameter under the NDE posterior shouldbe expected to be distributed uniformly. Figure 8 showsthe quantile distribution for the 1D marginal NDE pos-terior distributions for the same 14,551 test set infer-ences. The quantile distribution for log ( q ), log ( α ),and t is concave-up, indicating that the NDE uncer-tainty is overestimated and the true value lies closer tothe center of the posterior more often than expected.This suggests that the NDE finds it hard to contract theposterior in those dimensions, possibly due to numericaloptimization difficulties or insufficient neural networkexpressibility. On the other hand, distributions for thethree parameters in the second row—log ( t E ), u , and f s —demonstrate the systematic under/over-estimationas seen in Figure 6, where log ( t E ) is systematically log ( s ) = 0.3% log ( q ) = 0.4% = 0.3% log ( t E ) = 0.6% u = 0.3% f s = 0.4% quantile o cc u r a n c e log ( ) = 0.5% t = 0.2% Figure 8.
Calibration plot showing the test-set distribu-tions of the ground truth quantile for the 1D marginal NDEposteriors. Dashed lines indicated the uniform distributionas expected for a perfectly calibrated posterior. overestimated and u and f s are underestimated. Thequantile distributions for log( s ) and log ( ρ ) are con-sistent with uniform distributions and are thus well-calibrated. DISCUSSION AND CONCLUSIONSWe have demonstrated that amortized neural pos-terior estimation, a likelihood-free inference methodwhich uses a conditional NDE to learn a surrogate pos-terior, ˆ p ( θθθ | x ), greatly accelerates binary microlensinginference—an approximate posterior could be producedin seconds without the need for an expert in the loop.Our new approach is capable of capturing a variety ofdegeneracies. For future work, it is straightforward toextend to higher-level effects such as parallax and bi-nary motion by introducing additional parameters. Ap-plication to more complex systems such as 3L1S may befruitful, where the physical forward modal is orders ofmagnitude slower. In addition, the photometric noisemodel adapted in our study is somewhat simplistic, andfuture work can explore how to adapt models trainedwith ideal noise properties to fully realistic data withthe help of image-based simulation pipelines such as onesused in Penny et al. (2019). We discuss two additionalaspects of our work below.5.1. A hybrid NDE-MCMC framework eal-Time Likelihood-Free Inference of
Roman
Binary Microlensing Events
Roman data. Applied to much noisier andmore sparsely sampled ground-based data, we expectthat data uncertainties will dominate over model uncer-tainties, thus allowing the NDE posterior to convergetowards the exact posterior.5.2.
Choice of Coordinate System
For all events in this work, we have adopted the center-of-mass (COM) coordinate system, which is the defaultin
MulensModel but not the most efficient referenceframe in the sense that more than 70% of the 1 millionsimulations turn out to be consistent with a 1L1S model.For example, most 2L1S configurations with large u do not pass close to either the central caustics or theplanetary caustics. For parts of the parameter space,alternative reference coordinates may be more descrip-tive or useful. For example, the caustic-center frame ispreferred for binary and/or wide-separation events for which there is an offset of the caustic-center from theCOM. Doing so recovers the missing wide/binary solu-tion in Section 4.3 without the need to expand the priorto include negative t . Additionally, planetary-causticpassing events are also rare; for source trajectories farfrom the central caustics, most do not pass close to theplanetary caustic and as a result, the magnification isfrequently indistinguishable from 1L1S. For future work,a hybrid and self-consistent coordinate system could beused. Software: numpy (van der Walt et al. 2011), scipy (Jones et al. 2001), matplotlib (Hunter 2007), pytorch (Paszke et al. 2017),
MulensModel (Poleski & Yee 2018),
Jupyter (Kluyver et al. 2016).ACKNOWLEDGEMENTSK.Z. and J.S.B. are supported by a Gordon and BettyMoore Foundation Data-Driven Discovery grant. J.S.B.is partially sponsored by a faculty research award fromTwo Sigma. K.Z. thanks the LSSTC Data ScienceFellowship Program, which is funded by LSSTC, NSFCybertraining Grant 1829740, the Brinson Foundation,and the Moore Foundation; his participation in the pro-gram has benefited this work. K.Z. thanks Shude Maoand Tsinghua University for their hospitality during theCOVID-19 pandemic. B.S.G. is supported by NASAgrant NNG16PJ32C and the Thomas Jefferson Chairfor Discovery and Space Exploration. This work is sup-ported by the AWS Cloud Credits for Research program.We thank Yang Gao and Yu Sun for helpful discussions,and Weicheng Zang for pointing out issues with the ini-tial figure presentation.REFERENCES Zhang et al. eal-Time Likelihood-Free Inference of
Roman
Binary Microlensing Events15