Confidence Sets and Hypothesis Testing in a Likelihood-Free Inference Setting
CConfidence Sets and Hypothesis Testing in a Likelihood-Free Inference Setting
Niccol`o Dalmasso Rafael Izbicki Ann B. Lee Abstract
Parameter estimation, statistical tests and confi-dence sets are the cornerstones of classical statis-tics that allow scientists to make inferences aboutthe underlying process that generated the ob-served data. A key question is whether one canstill construct hypothesis tests and confidence setswith proper coverage and high power in a so-called likelihood-free inference (LFI) setting; thatis, a setting where the likelihood is not explic-itly known but one can forward-simulate observ-able data according to a stochastic model. In thispaper, we present
ACORE (Approximate Compu-tation via Odds Ratio Estimation), a frequentistapproach to LFI that first formulates the classicallikelihood ratio test (LRT) as a parametrized clas-sification problem, and then uses the equivalenceof tests and confidence sets to build confidenceregions for parameters of interest. We also presenta goodness-of-fit procedure for checking whetherthe constructed tests and confidence regions arevalid.
ACORE is based on the key observationthat the LRT statistic, the rejection probability ofthe test, and the coverage of the confidence setare conditional distribution functions which of-ten vary smoothly as a function of the parametersof interest. Hence, instead of relying solely onsamples simulated at fixed parameter settings (asis the convention in standard Monte Carlo solu-tions), one can leverage machine learning toolsand data simulated in the neighborhood of a pa-rameter to improve estimates of quantities of in-terest. We demonstrate the efficacy of
ACORE with both theoretical and empirical results. Ourimplementation is available on
Github . Department of Statistics & Data Science, Carnegie MellonUniversity, Pittsburgh, USA Department of Statistics, FederalUniversity of S˜ao Carlos, S˜ao Paulo, Brazil. Correspondence to:Niccol`o Dalmasso < [email protected] > . Proceedings of the th International Conference on MachineLearning , Online, PMLR 119, 2020. Copyright 2020 by the au-thor(s).
1. Introduction
Parameter estimation, statistical tests and confidence sets arethe cornerstones of classical statistics that relate observeddata to properties of the underlying statistical model. Mostfrequentist procedures with good statistical performance(e.g., high power) require explicit knowledge of a likelihoodfunction. However, in many science and engineering ap-plications, complex phenomena are modeled by forwardsimulators that implicitly define a likelihood function: Forexample, given input parameters θ , a statistical model of ourenvironment, climate or universe may combine determinis-tic dynamics with random fluctuations to produce syntheticdata X . Simulation-based inference without an explicitlikelihood is called likelihood-free inference (LFI).The literature on LFI is vast. Traditional LFI methods, suchas Approximate Bayesian Computation (ABC; Beaumontet al. 2002; Marin et al. 2012; Sisson et al. 2018), esti-mate posteriors by using simulations sufficiently close tothe observed data, hence bypassing the likelihood. Morerecently, several approaches that leverage machine learningalgorithms have been proposed; these either directly esti-mate the posterior distribution (Marin et al., 2016; Chen &Gutmann, 2019; Izbicki et al., 2019; Greenberg et al., 2019)or the likelihood function (Izbicki et al., 2014; Thomas et al.,2016; Price et al., 2018; Ong et al., 2018; Lueckmann et al.,2019; Papamakarios et al., 2019). We refer the reader toCranmer et al. (2019) for a recent review of the field.A question that has not received much attention so far iswhether one, in an LFI setting, can construct inference tech-niques with good frequentist properties. Frequentist proce-dures have nevertheless played an important role in manyfields. In high energy physics for instance, classical statisti-cal techniques (e.g., hypothesis testing for outlier detection)have resulted in discoveries of new physics and other suc-cessful applications (Feldman & Cousins, 1998; Cranmer,2015; Cousins, 2018). Even though controlling type I er-ror probabilities is important in these applications, mostLFI methods do not have guarantees on validity or power.Ideally, a unified LFI approach should • be computationally efficient in terms of the number ofrequired simulations, • handle high-dimensional data from different sources(without, e.g., predefined summary statistics), a r X i v : . [ s t a t . M E ] A ug onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting • produce hypothesis tests and confidence sets that are valid ; that is, have the nominal type I error or confi-dence level, • produce hypothesis tests with high power or, equiva-lently, confidence sets with a small expected size, • provide diagnostics for checking empirical coverageor for checking how well the estimated likelihood fitssimulated data.In this paper, we present ACORE (Approximate Computa-tion via Odds Ratio Estimation), a frequentist approach toLFI, which addresses the above mentioned concerns.Figure 2 summarizes the
ACORE work structure:
ACORE first compares synthetic data from the simulator F θ to areference distribution G by computing an “Odds Ratio”.The odds ratio can be learnt with a probabilistic classifier,such as a neural network with a softmax layer, suitablefor the data at hand. As we shall see, the estimated oddsratio is an approximation of the likelihood ratio statistic(Proposition 3.1). The ACORE test statistic (Equation 4), to-gether with an estimate of the “Critical Value” (Algorithms 1and 2), can be used for hypothesis testing or for finding aconfidence set for θ . ACORE also includes “Diagnostics”(Section 3.3) for computing the empirical coverage of theconstructed confidence set for θ .At the heart of ACORE is the key observation that the like-lihood ratio statistic, the critical value of the test, and thecoverage of the confidence set are conditional distributionfunctions which often vary smoothly as a function of the(unknown) parameters of interest. Hence, instead of rely-ing solely on samples simulated at fixed parameter settings(as is the convention in standard Monte Carlo solutions),one can leverage machine learning tools and data simulatedin the neighborhood of a parameter to improve estimatesof quantities of interest and decrease the total number ofsimulated data points. Our contribution is three-fold:1. a new procedure for estimating the likelihood ratiostatistic , which uses probabilistic classifiers and doesnot require repeated sampling at each θ or a separateinterpolation or calibration step;2. an efficient procedure for estimating the critical value that guarantees valid tests and confidence sets, basedon quantile regression without repeated sampling ateach θ ;3. a new goodness-of-fit technique for computing em-pirical coverage of constructed confidence sets as afunction of the unknown parameters θ .Finally, ACORE is simple and modular by construction. Onecan easily switch out, generalize or take pieces of the frame-work and apply it to any similar machine-learning-basedLFI setting. The theoretical results of Section 3.1 hold for a general setting. In addition, given the vast arsenal of ex-isting probabilistic classifiers developed in the literature,
ACORE can be applied to many different types of complexdata X (e.g., images, time series and functional data). InSection 4, we show empirical results connecting the powerof the constructed hypothesis tests to the performance ofthe classifier. Note also that Algorithms 1 and 2 for esti-mating parametrized critical values apply to any hypothesistest on θ of the form of Equation 2 for any test statistic τ .The goodness-of-fit procedure in Section 3.3 for checkingempirical coverage as a function of θ is also not tied to oddsratios. The problem of constructing confidence intervals with goodfrequentist properties has a long history in statistics (Ney-man, 1937; Feldman & Cousins, 1998; Chuang & Lai, 2000).One of the earlier simulation-based approaches was devel-oped in high energy physics (HEP) by Diggle & Gratton(1984); their proposed scheme of estimating the likelihoodand likelihoood ratio statistic nonparametrically by his-tograms of photon counts would later become a key compo-nent in the discovery of the Higgs Boson (Aad et al., 2012).However, traditional approaches for building confidence re-gions and hypothesis tests in LFI rely on a series of MonteCarlo samples at each parameter value θ (Barlow & Bee-ston, 1993; Weinzierl, 2000; Schafer & Stark, 2009). Thus,these approaches quickly become inefficient with large orcontinuous parameter spaces. Traditional nonparametricapproaches also have difficulties handling high-dimensionaldata without losing key information.LFI has recently benefited from using powerful machinelearning tools like deep learning to estimate likelihood func-tions and likelihood ratios for complex data. Successfulapplication areas include HEP (Guest et al., 2018), astron-omy (Alsing et al., 2019) and neuroscience (Gonc¸alves et al.,2019). ACORE has some similarities to the work of Cranmeret al. (2015) which also uses machine learning methods forfrequentist inference in an LFI setting. Other elements of
ACORE such as leveraging the ability of ML algorithms tosmooth over parameter space turning a density ratio estimateinto a supervised classification problem have also previouslybeen used in LFI settings: Works that smooth over parameterspace include, e.g., Gaussian processes (Frate et al., 2017;Leclercq, 2018) and neural networks (Baldi et al., 2016).Works that turn a density ratio into a classification probleminclude applications to generative models (see Mohamed &Lakshminarayanan 2016 for a review), and Bayesian LFI(Thomas et al., 2016; Gutmann et al., 2018; Dinev & Gut-mann, 2018; Hermans et al., 2019). Finally, like
ACORE ,Thornton et al. (2017) explores frequentist guarantees ofconfidence regions; however those regions are built under aBayesian framework. onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting
Novelty.
What distinguishes the
ACORE approach fromother related work is that it uses an efficient procedure forestimating the (i) likelihood ratio, (ii) critical values and(iii) coverage of confidence sets across the entire parameterspace, without the need for an extra interpolation or calibra-tion step (as in traditional Monte Carlo solutions and morerecent ML approaches). To the best of our knowledge, (ii)and (iii) are entirely novel in the LFI literature. In contrastto other methods that estimate (i),
ACORE does not makeparametric assumptions or require an additional calibrationstep, and it can accommodate all types of hypotheses. Weprovide theoretical guarantees on our procedures in termsof power and validity (Section 3.1; proofs in SupplementaryMaterial C). We also offer a scheme for how to choose MLalgorithms and the number of simulations so as to have goodpower properties and valid inference in practice.
Notation.
Let F θ with density f θ represent the stochasticforward simulator for a sample point X ∈ X at parameter θ ∈ Θ . We denote i.i.d “observable” data from F θ by D = (cid:8) X obs , . . . , X obs n (cid:9) , and the actually observed or measureddata by D = (cid:8) x obs , . . . , x obs n (cid:9) . The likelihood function L ( D ; θ ) = (cid:81) ni =1 f θ ( X obs i ) .
2. Statistical Inference in a Traditional Setting
We begin by reviewing elements of traditional statisticalinference that play a key role in
ACORE . Equivalence of tests and confidence sets.
A classical ap-proach to constructing a confidence set for an unknown pa-rameter θ ∈ Θ is to invert a series of hypothesis tests (Ney-man, 1937): Suppose that for each possible value θ ∈ Θ ,there is a level α test δ θ of H ,θ : θ = θ versus H ,θ : θ (cid:54) = θ ; (1)that is, a test δ θ where the type I error (the probability oferroneously rejecting a true null hypothesis H ,θ ) is nolarger than α . For observed data D = D , now define R ( D ) as the set of all parameter values θ ∈ Θ for which thetest δ θ does not reject H ,θ . Then, by construction, therandom set R ( D ) satisfies P [ θ ∈ R ( D ) | θ = θ ] ≥ − α for all θ ∈ Θ . That is, R ( D ) defines a (1 − α ) confidenceset for θ . Similarly, we can define a test with a desired signif-icance level from a confidence set with a certain coverage. Likelihood ratio test.
A general form of hypothesis teststhat often leads to high power is the likelihood ratio test(LRT). Consider testing H : θ ∈ Θ versus H : θ ∈ Θ , (2) where Θ = Θ \ Θ . For the likelihood ratio (LR) statistic , Λ( D ; Θ ) = log sup θ ∈ Θ L ( D ; θ )sup θ ∈ Θ L ( D ; θ ) , (3)the LRT of hypotheses (2) rejects H when Λ( D ; Θ ) < C for some constant C .Figure 1 illustrates the construction of confidence setsfor θ from level α likelihood ratio tests (1). Thecritical value for each such test δ θ is C θ = { C : P [Λ( D ; θ ) < C | θ = θ ] = α } . Figure 1.
Constructing confidence intervals from hypothesis tests.
Left:
For each θ ∈ Θ , we find the critical value C θ that rejects thenull hypothesis H ,θ at level α ; that is, C θ is the α -quantile of thedistribution of the likelihood ratio statistic Λ( D ; θ ) under the null. Right:
The horizontal lines represent the acceptance region foreach θ ∈ Θ . Suppose we observe data D = D . The confidenceset for θ (indicated with the red line) consists of all θ -values forwhich the observed test statistic Λ( D ; θ ) (indicated with the blackcurve) falls in the acceptance region. ACORE : Approximate Computation viaOdds Ratio Estimation
In a likelihood-free inference setting, we cannot directlyevaluate the likelihood ratio statistic. Here we describethe details of how a simulation-based approach (
ACORE ,Figure 2) can lead to hypothesis tests and confidence setswith good frequentist properties.
We start by simulating a labeled sample for computing oddsratios. The estimated odds ratio then defines a new teststatistic that we use in place of the unknown likelihood ratiostatistic.
Simulating a labeled sample.
Let G be a distribution withlarger support than F θ for all θ ∈ Θ . The distribution G onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting Figure 2.
Schematic diagram of
ACORE . The simulator providessynthetic observable data T B for learning a parametrized oddsratio via probabilistic classification. The simulator also generatesa separate sample T (cid:48) B (cid:48) for learning critical values as a functionof θ ∈ Θ . Once data D obs are observed, the odds ratio canbe used to construct hypothesis tests or confidence sets for θ . ACORE provides diagnostics for computing the empirical coverageof constructed confidence sets as a function of the (unknown)parameter θ . The three main parts of ACORE (critical value, oddsratio, diagnostics) are separate modules. Each module leveragesmachine learning methods in the training phase and is amortized,i.e., they perform inference on new data without having to beretrained. could for example be a dominating distribution which itis easy to sample from. We use F θ and G to simulate alabeled training sample T B = { θ i , x i , y i } Bi =1 for estimatingodds ratios. The random sample T B = { θ i , X i , Y i } Bi =1 isidentically distributed as ( θ, X , Y ) , where the parameters θ ∼ r Θ (a fixed proposal distribution over Θ ), the “label” Y ∼ Ber ( p ) (a Bernoulli distribution with known p with Y independent of θ ), X | θ, Y = 1 ∼ F θ and X | θ, Y = 0 ∼ G .That is, the label Y i is the indicator that the sample point X i was generated from F θ rather than G . We call G a “referencedistribution” as we are comparing F θ for different θ withthis distribution. For all our experiments in this work weuse p=1/2; other choices could account for computationaldifferences in sampling from F θ versus G . (Algorithm 3 inSupplementary Material A summarizes our procedure.) Odds ratios.
For fixed x , we define the odds at θ as O ( x ; θ ) := P ( Y = 1 | θ, x ) P ( Y = 0 | θ, x ) , and the odds ratio at θ , θ ∈ Θ as OR ( x ; θ , θ ) := O ( θ ; x ) O ( θ ; x ) . One way of interpreting the odds O ( θ, X ) is to re-gard it as a measure of the chance that X was gen-erated from F θ . That is, a large odds O ( θ, x ) re-flects the fact that it is plausible that x was generatedfrom F θ (rather than G ). Thus, OR ( x ; θ , θ ) mea-sures the plausibility that x was generated from θ ratherthan θ . When testing (2), we therefore reject H if sup θ ∈ Θ inf θ ∈ Θ (cid:80) ni =1 log (cid:0) OR ( X obs i ; θ , θ ) (cid:1) < C , forsome constant C . By Bayes rule, this is just the likelihoodratio test of (2). Hypothesis testing in an LFI setting.
In an LFI setting,we cannot directly evaluate the likelihood ratio statistic (3).The advantage of rewriting the LRT in terms of odds ratiosis that we can forward-simulate a labeled training sample D B , as described above, and then use a probabilistic clas-sifier (suitable for the data at hand) to efficiently estimatethe odds ratios OR ( x ; θ , θ ) for all θ , θ ∈ Θ : The proba-bilistic classifier compares data from the forward simulator F θ with data from the reference distribution G and returnsa parametrized odds estimate (cid:98) O ( x ; θ ) , which is a functionof θ ∈ Θ . We can directly compute the odds ratio estimate (cid:100) OR ( x ; θ , θ ) at any two values θ , θ ∈ Θ from (cid:98) O ( x ; θ ) .There is no need for a separate training step.We reject H if the ACORE test statistic defined as τ ( D ; Θ ) := sup θ ∈ Θ inf θ ∈ Θ n (cid:88) i =1 log (cid:16)(cid:100) OR ( X obs i ; θ , θ ) (cid:17) (4)is small enough for observed data D = D . If the probabili-ties learned by the classifier are well estimated, τ is exactlythe likelihood ratio statistic: Proposition 3.1 (Fisher Consistency) . If (cid:98) P ( Y = 1 | θ, x ) = P ( Y = 1 | θ, x ) for every θ and x , then the ACORE teststatistic (4) is the likelihood ratio statistic (Equation 3).
Estimating the critical value.
A key question is how to efficiently estimate the critical value of a test. In this sectionwe consider a single composite null hypothesis H : θ ∈ Θ .(The setting for constructing confidence sets by testing (1)for all θ ∈ Θ is discussed in Section 3.2.). Suppose that wereject the null hypothesis if the test statistic (4) is smallerthan some constant C . To achieve a test with a desired levelof significance α , we need (for maximum power) the largest C that satisfies sup θ ∈ Θ P ( τ ( D ; Θ ) < C | θ ) ≤ α. (5)However, we cannot explicitly compute the critical value C or the rejection probability as we do not know the distribu-tion of the test statistic τ .Simulation-based approaches are often used to computerejection probabilities and critical values in lieu of large-sample theory approximations. Typically, such simulations onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting compute a separate Monte Carlo simulation at each fixed θ ∈ Θ on, e.g., a fine enough grid on θ . That is, theconvention is to rely solely on sample points generated atfixed θ to estimate the rejection probabilities P ( τ ( D ; Θ ) Estimate the critical value C for a level- α testof composite hypotheses H : θ ∈ Θ vs. H : θ ∈ Θ Require: stochastic forward simulator F θ ; sample size B (cid:48) fortraining quantile regression estimator; r Θ (a fixed proposal distri-bution over the null region Θ ); test statistic τ ; quantile regressionestimator; desired level α ∈ (0 , Ensure: estimated critical value (cid:98) C Set T (cid:48) ← ∅ for i in { } do Draw parameter θ i ∼ r Θ Draw sample X i, , . . . , X i,n iid ∼ F θ i Compute test statistic τ i ← τ (( X i, , . . . , X i,n ); Θ ) T (cid:48) ← T (cid:48) ∪ { ( θ i , τ i ) } end for Use T (cid:48) to learn parametrized function (cid:98) c α ( θ ) := (cid:98) F − τ | θ ( α | θ ) via quantile regression of τ on θ return (cid:98) C ← inf θ ∈ Θ (cid:98) c α ( θ ) Theoretical guarantees. We denote convergence in proba-bility and in distribution by P → and Dist −−−−→ , respectively. Westart by showing that our procedure leads to valid hypothesistests (that is, tests that control the type I error probability)as long as B (cid:48) in Algorithm 1 is large enough. In order to doso, we assume that the quantile regression estimator used in Algorithm 1 to estimate the critical values is consistent inthe following sense: Assumption 3.2. Let ˆ F B (cid:48) ( ·| θ ) be the estimated cumulativedistribution function of the test statistic τ conditional on θ based on a sample size B (cid:48) , and let F ( ·| θ ) be true conditionaldistribution. For every θ ∈ Θ , assume that the quantileregression estimator is such that sup t ∈ R | ˆ F B (cid:48) ( t | θ ) − F ( t | θ ) | P −−−−−→ B (cid:48) −→∞ Under some conditions, Assumption 3.2 holds for instancefor quantile regression forests (Meinshausen, 2006).Next we show that, for every fixed training sample size B in Algorithm 3, Algorithm 1 yields a valid hypothesis testas B (cid:48) → ∞ . The result holds even if the likelihood ratiostatistic is not well estimated. Theorem 3.3. Let C B,B (cid:48) ∈ R be the critical value of thetest based on the statistic τ = τ B for a training samplesize B with critical value chosen according to Algorithm 1for a fixed α ∈ (0 , . If the quantile estimator satisfiesAssumption 3.2 and | Θ | < ∞ , then C B,B (cid:48) P −−−−−→ B (cid:48) −→∞ C ∗ B , where C ∗ B is such that sup θ ∈ Θ P ( τ B ≤ C ∗ B | θ ) = α. Finally we show that as long as the probabilistic classifier isconsistent and the critical values are well estimated (whichholds for large B (cid:48) according to Theorem 3.3), the powerof the ACORE test converges to the power of the LRT as B grows. Theorem 3.4. Let (cid:98) φ B,C B ( D ) be the test based on the statis-tic τ = τ B for a labeled sample size B with critical value C B ∈ R . Moreover, let φ C ∗ ( D ) be the likelihood ratio testwith critical value C ∗ ∈ R . If, for every θ ∈ Θ , (cid:98) P ( Y = 1 | θ, X ) P −−−−−→ B −→∞ P ( Y = 1 | θ, X ) , where | Θ | < ∞ , and (cid:98) C B is such that (cid:98) C B Dist −−−−−→ B −→∞ C ∗ , then,for every θ ∈ Θ , P (cid:16) (cid:98) φ B, (cid:98) C B ( D ) = 1 | θ (cid:17) −−−−−→ B −→∞ P ( φ C ∗ ( D ) = 1 | θ ) . To construct a confidence set for θ , we use the equivalenceof tests and confidence sets (Section 2): Suppose that we That is, (cid:98) φ B,C B ( D ) = 1 ⇐⇒ τ B ( D ; Θ ) < C B . That is, φ C ∗ ( D ) = 1 ⇐⇒ Λ( D ; Θ ) < C ∗ . onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting for every θ ∈ Θ can find the critical value C θ of a test of(1) with type I error no larger than α . The random set R ( D ) = { θ ∈ Θ | τ ( D ; θ ) ≥ C θ } , then defines a (1 − α ) confidence region for θ .However, rather than repeatedly running Algorithm 1 foreach null hypothesis Θ = { θ } separately, we estimate allcritical values C θ (for different θ ∈ Θ ) simultaneously.Algorithm 2 outlines our procedure. Again, we use quan-tile regression to learn a parametrized function C θ . Thewhole procedure for computing confidence sets via ACORE is summarized in Algorithm 4 in Supplementary Material B.Theorem 3.3 implies that the constructed confidence set hasthe nominal − α confidence level as B (cid:48) → ∞ . The sizeof the confidence set depends on the training sample size B and the classifier. Algorithm 2 [Many Simple Null Hypotheses] Estimate thecritical values C θ for a level- α test of H ,θ : θ = θ vs. H ,θ : θ (cid:54) = θ for all θ ∈ Θ simultaneously Require: stochastic forward simulator F θ ; sample size B (cid:48) fortraining quantile regression estimator; r (a fixed proposal distri-bution over the full parameter space Θ ); test statistic τ ; quantileregression estimator; desired level α ∈ (0 , Ensure: estimated critical values (cid:98) C θ for all θ = θ ∈ Θ Set D (cid:48) ← ∅ for i in { } do Draw parameter θ i ∼ r Draw sample X i, , . . . , X i,n iid ∼ F θ i Compute test statistic τ i ← τ (( X i, , . . . , X i,n ); θ i ) D (cid:48) ← D (cid:48) ∪ { ( θ i , τ i ) } end for Use D (cid:48) to learn parametrized function (cid:98) C θ := (cid:98) F − τ | θ ( α | θ ) via quantile regression of τ on θ return (cid:98) C θ ← (cid:98) F − τ | θ ( α | θ ) θ After the parametrized ACORE statistic and the critical val-ues have been estimated, it is important to check whether theresulting confidence sets indeed are valid or, equivalently, ifthe resulting hypothesis tests have the nominal significancelevel. We also want to identify regions in parameter spacewhere we clearly overcover. That is, the two main questionsare: (i) do the constructed confidence sets satisfy P [ θ ∈ R ( D ) | θ = θ ] ≥ − α, for every θ ∈ Θ , and (ii) how close is the actual coverageto the nominal confidence level − α ? To answer these questions, we propose a goodness-of-fit procedure where wedraw B (cid:48)(cid:48) new samples from the simulator given θ , constructa confidence set for each sample, and then check which com-puted regions include the “true” θ . More specifically: wegenerate a set T (cid:48)(cid:48) B (cid:48)(cid:48) = { ( θ (cid:48) , D (cid:48) ) , . . . , ( θ (cid:48) B (cid:48)(cid:48) , D (cid:48) B (cid:48)(cid:48) ) } , where θ (cid:48) i ∼ r Θ and D (cid:48) i is a sample of size n of i.i.d. observabledata from F θ (cid:48) i . We then define W i := I ( θ (cid:48) i ∈ R ( D (cid:48) i )) , where R ( D (cid:48) i ) is the confidence set for θ for data D (cid:48) i . If R has the correct coverage, then P ( W i = 1 | θ i ) ≥ − α. We can estimate the probability P ( W i = 1 | θ i ) using anyprobabilistic classifier; some methods also provide confi-dence bands that assess the uncertainty in estimating thisquantity (Eubank & Speckman, 1993; Claeskens et al., 2003;Krivobokova et al., 2010). By comparing the estimated prob-ability to − α , we have a diagnostic tool for checking howclose we are to the nominal confidence level over the entireparameter space Θ . See Figure 3 for an example.Finally note that our procedure parametrizes the coverage ofthe confidence set as a function of the true parameter value.This is in contrast to other goodness-of-fit techniques (e.g.,Cook et al. 2006; Bordoloi et al. 2010; Talts et al. 2018;Schmidt et al. 2019) that only check for marginal coverage,i.e., n − (cid:80) ni =1 W i ≥ − α . 4. Toy Examples We consider two examples where the true likelihood isknown. In the first example, the forward simulator F θ fol-lows a Poisson (100 + θ ) distribution similar to the signal-background model in Section 5. In the second example,we consider a Gaussian mixture model (GMM) with twounit-variance Gaussians centered at − θ and θ , respectively.In both examples, n = 10 , the proposal distribution r Θ is auniform distribution, and the reference distribution G is anormal distribution. Table 1 summarizes the set-up. Poisson Example GMM Example r Θ Unif(0 , 20) Unif(0 , F θ Poisson (100 + θ ) N ( − θ, 1) + N ( θ, G N (110 , ) N (0 , ) True θ θ = 10 θ = 5 Table 1. Set-up for the two toy examples. First we investigate how the power of ACORE and the sizeof the derived confidence sets depend on the performance ofthe classifier used in the odds ratio estimation (Section 3.1).We consider three classifiers: multilayer perceptron (MLP),nearest neighbor (NN) and quadratic discriminant analysis onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting Poisson Example B Classifier Cross Average Size ofEntropy Loss Power Confidence Set [%] 100 MLP 0.87 ± ± ± ± ± ± 500 MLP 0.69 ± ± NN 0.67 ± ± QDA ± ± ± ± NN 0.66 ± ± QDA ± ± - Exact 0.64 ± ± GMM Example B Classifier Cross Average Size ofEntropy Loss Power Confidence Set [%] 100 MLP ± ± NN 0.81 ± ± ± ± ± ± NN 0.45 ± ± ± ± ± ± NN 0.41 ± ± QDA 0.62 ± ± Exact 0.35 ± ± Table 2. Results for Poisson example (left) and GMM example (right). The tables show the cross entropy loss, power (averaged over θ ) and size of ACORE confidence sets for different values of B and for different classifiers. These results are based on 100 repetitions;the numbers represent the mean and one standard deviation. The best results in each setting are marked in bold-faced; we see that theclassifier with the lowest cross entropy loss (a quantity that is easily computed in practice) is linked with the highest average power andthe smallest confidence set. As B increases, the best ACORE values approach the values for the exact LRT, listed in the bottom row in redcolor. (The QDA for the GMM example does not improve with increasing B because the quadratic classifier cannot separate F θ and G ina mixed distribution with three modes, hence breaking the assumption of Theorem 3.4.) All nine probabilistic classifiers yield valid 90%confidence regions according to our diagnostics; see Table 3. (QDA). For different values of B (sample size for estimat-ing odds ratios), we compute the binary cross entropy (ameasure of classifier performance), the power as a functionof θ , and the size of the constructed confidence set. Table 2summarizes results based on 100 repetitions. (To computethe critical values in Algorithm 2, we use quantile gradientboosted trees and a large enough sample size B (cid:48) = 5000 toguarantee confidence sets; see Supplementary Mate-rial D.) The last row of the table shows the best attainablecross entropy loss (Supplementary Material F), the confi-dence set size and power for the true likelihood function.For all 18 settings, the computation of one ACORE confi-dence set takes between 10 to 30 seconds on a single CPU. A full breakdown of the runtime of ACORE confidence setscan be found in Supplementary Material I.For each setting with fixed B , the best classifier accordingto cross entropy loss achieves the highest power and thesmallest confidence set. Moreover, as B increases, the bestvalues (marked in bold-faced) get closer to those of the truelikelihood (marked in red). The cross-entropy loss is easyto compute in practice. Our results indicate that minimizingthe cross-entropy loss is a good rule of thumb for achieving ACORE inference results with desirable statistical properties.Next we illustrate our goodness-of-fit procedure (Sec-tion 3.3) for checking the coverage of the constructed con-fidence sets across the parameter space Θ . To pass ourgoodness-of-fit test, we require the nominal coverage to be More specifically, an 8-core Intel Xeon 3.33GHz X5680 CPU. In traditional settings, high power has been shown to lead to asmall expected interval size under certain distributional assump-tions (Pratt, 1961; Ghosh, 1961). Coverage as Function of θ (Poisson Example) B'=100, Average Coverage=0.85 E s t i m a t e d C o v e r a g e B'=500, Average Coverage=0.88 θ B'=1000, Average Coverage=0.89 Figure 3. Estimated coverage as a function of θ in the Poissonexample for ACORE with different values of B (cid:48) . The mean and onestandard deviation prediction intervals are estimated via logisticregression. Our diagnostics show that B (cid:48) = 500 is large enough toachieve the nominal confidence level − α = 0 . . (We here use n = 10 , a QDA classifier with B = 1000 and gradient boostedquantile regression). within two standard deviations of the estimated coveragefor all parameter values. Figure 3 shows the estimated cov-erage with logistic regression for the Poisson example with B = 1000 (the training sample size for estimating odds viaQDA) and three different values of B (cid:48) (the training samplesize for estimating the critical value C via gradient boostedquantile regression). As expected (Theorem 3.3), the esti-mated coverage gets closer to the nominal confidencelevel as B (cid:48) increases. We can use these diagnostic plots tochoose B (cid:48) . For instance, here B (cid:48) = 500 is large enoughfor ACORE to achieve good coverage. (See SupplementaryMaterial D for a detailed analysis of this example.) onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting Background b S i g n a l ν 90% Confidence Set, Exact LR, Estimated C RF QR, B'=1000Deep QR, B'=25000Exact LR confidence set Background b S i g n a l ν 90% Confidence Set, Estimated OR and C ACORE, B=25000ACORE, B=100000ACORE, B=200000Exact LR confidence set . . . . . . . . . Background b S i g n a l ν Estimated Coverage Over ( ν , b ) Space 0.50.60.70.80.91.0 Figure 4. Signal detection HEP example. Left : confidence sets computed with the exact likelihood ratio statistic. Estimating criticalvalues can however be challenging, as highlighted by the differences in the results for two different quantile regression (QR) algorithmsand sample sizes: Random Forest QR at B (cid:48) = 1000 (green dotted) versus Deep QR at B (cid:48) = 25000 (blue dashed). Our goodness-of-fitprocedure can be used to select the best method in a principled way. (The red contour shows the exact LR confidence set, and the redstar is at the true parameter setting.) Center : confidence sets when using ACORE to estimate both odds ratios and critical values.This is the LFI setting. Our proposed strategy for choosing ACORE components selects a 5-layer deep neural network with B = 100000 ;this yields a confidence set (dashed blue) close to the exact LR set (solid red). Increasing B does not show a noticeable improvement(dash-dotted purple), whereas decreasing B makes estimates worse (dotted green). Right : Heat map of the estimated coverage for aconfidence set that did not pass our goodness-of-fit diagnostic. The overall coverage of the confidence set is correct ( . vs. the nominal confidence level), but the set clearly undercovers in low-signal and high-background regions. Supplementary Materials G and H include a comparisonbetween ACORE and Monte Carlo Gaussian Process (MCGP) interpolation (Frate et al., 2017) and calibrated neuralnets classifiers (CARL, Cranmer et al. 2015), respectively.Our results show that MC-based GP interpolation providesa better approximation of the likelihood ratio when the sim-ulated data are approximately Gaussian (as in the Poissonexample). However, when the parametric assumptions arenot valid (as in the GMM example), MC-based GP fails toapproximate the likelihood ratio regardless of the numberof available simulations. For both examples, CARL leads tolower power and larger confidence intervals than ACORE.See Tables 4 and 5 for details. 5. Signal Detection in High Energy Physics In order to apply ACORE , we need to choose four key com-ponents: (i) a probabilistic classifier, (ii) a training samplesize B for learning odds ratios, (iii) a quantile regressionalgorithm, and (iv) a training sample size B (cid:48) for estimatingcritical values. We propose the following practical strategyto choose such components:1. Use the cross entropy loss to select the classifier and B (as seen in Section 4, a small cross entropy correspondsto higher power and a smaller confidence set);2. Then use our goodness-of-fit procedure (Section 3.3)to select the quantile regression method and B (cid:48) .We illustrate ACORE and this strategy on a model described in Rolke et al. (2005) and Sen et al. (2009) for a high energyphysics (HEP) experiment. In this model, particle collisionevents are counted under the presence of a backgroundprocess b . The goal is to assess the intensity ν of a signal(i.e., an event which is not part of the background process).The observed data D consist of n = 10 realizations of X =( N, M ) , where N ∼ Poisson ( b + ν ) is the number of eventsin the signal region, and M ∼ Poisson ( b ) is the numberof events in the background (control) region. (We use auniform proposal distribution r Θ and a Gaussian referencedistribution G .) This model is a simplified version of areal particle physics experiment where the true likelihoodfunction is not known.Figure 4 illustrates the role of B , B (cid:48) , and our goodness-of-fit procedure when estimating confidence sets. (For de-tails, see Supplementary Material E.) In the left panel, weuse the true LR statistic to show that, even if the LR isavailable, estimating the critical value C well still matters.Our goodness-of-fit diagnostic provides a principled wayof choosing the best quantile regression (QR) method andthe best sample size B (cid:48) for estimating C . In this example,random forest QR does not pass our goodness-of-fit test; italso leads to a confidence region quite different from theexact one. Deep QR, which passes our test, gives a more ac-curate region estimate. In the center panel, we use ACORE to estimate both the odds ratio and the critical value C (thisis the LFI setting). If we choose B by identifying when thecross entropy loss levels off, we would choose B = 100000 .Decreasing B leads to a worse cross-entropy loss and, as the onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting figure shows, also a larger confidence region. Increasing B beyond our selected sample size does not lead to substantialgains. The right panel illustrates how our goodness-of-fitprocedure can be used to identify regions in parameter spacewhere a constructed confidence set is not valid. The heatmap refers to an example which did not pass our goodness-of-fit procedure. While the overall (marginal) coverage isat the right value, our diagnostic procedure (for estimatingcoverage as a function of ν and b ) is able to identify under-coverage in low-signal and high-background regions. Thatis, for a valid confidence set, one needs to better estimatethe critical value C by, e.g., using a different quantile regres-sion estimator or by increasing B (cid:48) (either uniformly overthe parameter space or by an active learning scheme whichincreases the number of simulations at parameter settingswhere one undercovers). 6. Conclusions In this paper we introduce ACORE , a framework for carry-ing out frequentist inference in LFI settings. ACORE is wellsuited for settings with costly simulations, as it efficientlyestimates test statistics and critical values across the entireparameter space. We provide a new goodness-of-fit proce-dure for estimating coverage of constructed confidence setsfor all possible parameter settings. Even if the likelihoodratio is not well estimated, ACORE provides valid infer-ence as long as hypothesis tests and confidence sets passour goodness-of-fit procedure (albeit at the cost of havingless power and larger sets). We provide practical guidanceon how to choose the smallest number of simulations toguarantee powerful and valid procedures.Future studies will investigate the effect of G and r Θ onperformance, as well as how ACORE scales with increas-ing (a) feature space dimension and (b) parameter spacedimension. Because we utilize ML methods to efficientlyestimate odds ratio and critical values (Algorithms 3 and 1),performance in (a) will depend on the convergence rates ofthe chosen probabilistic classifier and quantile regressionmethod. For (b), scaling relies on having an efficient searchalgorithm; this search is challenging for all likelihood-basedmethods. Common solutions include gradient-free optimiza-tion methods, such as Nelder-Mead (Nelder & Mead, 1965)and Bayesian optimization (Snoek et al., 2012), and approx-imation techniques, such as profile likelihoods (Murphy &Vaart, 2000) and hybrid resampling (Chuang & Lai, 2000;Sen et al., 2009). Such approaches can potentially be in-tegrated into ACORE . The ACORE framework can also beadapted to accommodate test statistics such as the Bayesfactor (Kass & Raftery, 1995). In addition to Bayes factors,we will investigate choosing the number of simulations B via sequential testing and likelihood goodness-of-fit testssuch as Dalmasso et al. (2020). In addition, we will consider extending the ACORE framework to include other statisticalquantities in the likelihood ratio estimation process by, forexample, adapting the regression on likelihood ratio (ROLR)score in Brehmer et al. (2020b). Finally, we will include atheoretical study of how the power of the ACORE test relatesto classifier performance. Acknowledgments We thank the anonymous reviewers for their thoughtful com-ments and suggestions. ND is grateful to Tudor Manole,Alan Mishler, Aleksandr Podkopaev and the STAMPS re-search group for insightful discussions. RI is grateful forthe financial support of FAPESP (2019/11321-9) and CNPq(306943/2017-4). The early stages of this research weresupported in part by the National Science Foundation underDMS-1520786. References Aad, G., Abajyan, T., Abbott, B., Abdallah, J., Ab-del Khalek, S., Abdelalim, A., Abdinov, O., Aben, R.,Abi, B., Abolins, M., and et al. Observation of a newparticle in the search for the standard model higgs bosonwith the atlas detector at the lhc. Physics Letters B , 716(1):1?29, Sep 2012. ISSN 0370-2693. doi: 10.1016/j.physletb.2012.08.020. URL http://dx.doi.org/10.1016/j.physletb.2012.08.020 .Alsing, J., Charnock, T., Feeney, S., and Wand elt, B. Fastlikelihood-free cosmology with neural density estima-tors and active learning. Monthly Notice of the RoyalAstronomical Society , 488(3):4440–4458, Sep 2019. doi:10.1093/mnras/stz1960.Baldi, P., Cranmer, K., Faucett, T., Sadowski, P., andWhiteson, D. Parameterized neural networks for high-energy physics. The European Physical Journal C , 76(5):235, Apr 2016. ISSN 1434-6052. doi: 10.1140/epjc/s10052-016-4099-4. URL https://doi.org/10.1140/epjc/s10052-016-4099-4 .Barlow, R. and Beeston, C. Fitting using finite montecarlo samples. Computer Physics Communications ,77(2):219 – 228, 1993. ISSN 0010-4655. doi:https://doi.org/10.1016/0010-4655(93)90005-W. URL .Beaumont, M. A., Zhang, W., and Balding, D. J. Approxi-mate Bayesian computation in population genetics. Ge-netics , 162(4):2025–2035, 2002.Bordoloi, R., Lilly, S. J., and Amara, A. Photo-z perfor-mance for precision cosmology. Monthly Notices of theRoyal Astronomical Society , 406(2):881–895, 2010. onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting Brehmer, J., Kling, F., Espejo, I., and Cranmer, K.MadMiner, 2019. URL https://github.com/diana-hep/madminer .Brehmer, J., Kling, F., Espejo, I., and Cranmer, K. Mad-Miner: Machine learning-based inference for particlephysics. Comput. Softw. Big Sci. , 4(1):3, 2020a. doi:10.1007/s41781-020-0035-2.Brehmer, J., Louppe, G., Pavez, J., and Cranmer, K. Mininggold from implicit models to improve likelihood-freeinference. Proc. Nat. Acad. Sci. , 117(10):5242–5249,2020b. doi: 10.1073/pnas.1915980117.Chen, Y. and Gutmann, M. U. Adaptive gaussiancopula ABC. In Chaudhuri, K. and Sugiyama,M. (eds.), Proceedings of Machine Learning Re-search , volume 89 of Proceedings of Machine Learn-ing Research , pp. 1584–1592. PMLR, 16–18 Apr2019. URL http://proceedings.mlr.press/v89/chen19d.html .Chuang, C.-S. and Lai, T. L. Hybrid resampling methodsfor confidence intervals. Statistica Sinica , 10(1):1–33,2000. ISSN 10170405, 19968507. URL .Claeskens, G., Van Keilegom, I., et al. Bootstrap confidencebands for regression curves and their derivatives. TheAnnals of Statistics , 31(6):1852–1884, 2003.Cook, S. R., Gelman, A., and Rubin, D. B. Validation ofsoftware for Bayesian models using posterior quantiles. Journal of Computational and Graphical Statistics , 15(3):675–692, 2006. doi: 10.1198/106186006X136976.Cousins, R. D. Lectures on Statistics in Theory: Pre-lude to Statistics in Practice. arXiv e-prints , art.arXiv:1807.05996, Jul 2018.Cranmer, K. Practical Statistics for the LHC. arXiv e-prints ,art. arXiv:1503.07622, Mar 2015.Cranmer, K., Pavez, J., and Louppe, G. Approximatinglikelihood ratios with calibrated discriminative classifiers. arXiv preprint arXiv:1506.02169 , 2015.Cranmer, K., Brehmer, J., and Louppe, G. The fron-tier of simulation-based inference. arXiv e-prints , art.arXiv:1911.01429, Nov 2019.Dalmasso, N., Lee, A., Izbicki, R., Pospisil, T., Kim, I., andLin, C.-A. Validation of approximate likelihood and em-ulator models for computationally intensive simulations.In Chiappa, S. and Calandra, R. (eds.), Proceedings ofthe Twenty Third International Conference on ArtificialIntelligence and Statistics , volume 108 of Proceedings of Machine Learning Research , pp. 3349–3361, Online, 26–28 Aug 2020. PMLR. URL http://proceedings.mlr.press/v108/dalmasso20a.html .Diggle, P. J. and Gratton, R. J. Monte carlo methods ofinference for implicit statistical models. Journal of theRoyal Statistical Society. Series B (Methodological) , 46(2):193–227, 1984. ISSN 00359246. URL .Dinev, T. and Gutmann, M. U. Dynamic Likelihood-freeInference via Ratio Estimation (DIRE). arXiv e-prints ,art. arXiv:1810.09899, Oct 2018.Duda, R. O., Hart, P. E., and Stork, D. G. Pattern Clas-sification . Wiley, New York, 2 edition, 2001. ISBN978-0-471-05669-0.Eubank, R. L. and Speckman, P. L. Confidence bandsin nonparametric regression. Journal of the AmericanStatistical Association , 88(424):1287–1301, 1993.Feldman, G. J. and Cousins, R. D. Unified approach to theclassical statistical analysis of small signals. PhysicalReview D , 57(7):3873?3889, Apr 1998. ISSN 1089-4918. doi: 10.1103/physrevd.57.3873. URL http://dx.doi.org/10.1103/PhysRevD.57.3873 .Frate, M., Cranmer, K., Kalia, S., Vand enberg-Rodes,A., and Whiteson, D. Modeling Smooth Backgroundsand Generic Localized Signals with Gaussian Processes. arXiv e-prints , art. arXiv:1709.05681, Sep 2017.Ghosh, J. K. On the relation among shortest confidenceintervals of different types. Calcutta Statistical Asso-ciation Bulletin , 10(4):147–152, 1961. doi: 10.1177/0008068319610404.Gonc¸alves, P. J., Lueckmann, J.-M., Deistler, M., Nonnen-macher, M., ¨Ocal, K., Bassetto, G., Chintaluri, C., Pod-laski, W. F., Haddad, S. A., Vogels, T. P., Greenberg, D. S.,and Macke, J. H. Training deep neural density estima-tors to identify mechanistic models of neural dynamics. bioRxiv , 2019. doi: 10.1101/838383.Greenberg, D., Nonnenmacher, M., and Macke, J. Au-tomatic posterior transformation for likelihood-free in-ference. In Chaudhuri, K. and Salakhutdinov, R.(eds.), Proceedings of the 36th International Con-ference on Machine Learning , volume 97 of Pro-ceedings of Machine Learning Research , pp. 2404–2414, Long Beach, California, USA, 09–15 Jun2019. PMLR. URL http://proceedings.mlr.press/v97/greenberg19a.html .Guest, D., Cranmer, K., and Whiteson, D. Deep Learningand Its Application to LHC Physics. Annual Review ofNuclear and Particle Science , 68(1):161–181, Oct 2018.doi: 10.1146/annurev-nucl-101917-021019. onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting Gutmann, M. U., Dutta, R., Kaski, S., and Corander, J.Likelihood-free inference via classification. Statisticsand Computing , 28(2):411–425, Mar 2018. ISSN 1573-1375. doi: 10.1007/s11222-017-9738-6. URL https://doi.org/10.1007/s11222-017-9738-6 .Hermans, J., Begy, V., and Louppe, G. Likelihood-freemcmc with amortized approximate likelihood ratios,2019.Izbicki, R., Lee, A., and Schafer, C. High-dimensionaldensity ratio estimation with extensions to approximatelikelihood computation. In Artificial Intelligence andStatistics , pp. 420–429, 2014.Izbicki, R., Lee, A. B., and Pospisil, T. ABC–CDE: Towardapproximate bayesian computation with complex high-dimensional data and limited simulations. Journal ofComputational and Graphical Statistics , pp. 1–20, 2019.doi: 10.1080/10618600.2018.1546594.Kass, R. E. and Raftery, A. E. Bayes factors. Journal ofthe American Statistical Association , 90(430):773–795,1995. doi: 10.1080/01621459.1995.10476572.Krivobokova, T., Kneib, T., and Claeskens, G. Simultaneousconfidence bands for penalized spline estimators. Journalof the American Statistical Association , 105(490):852–863, 2010.Leclercq, F. Bayesian optimization for likelihood-free cos-mological inference. Physical Review D , 98(6):063511,Sep 2018. doi: 10.1103/PhysRevD.98.063511.Lueckmann, J.-M., Bassetto, G., Karaletsos, T., and Macke,J. H. Likelihood-free inference with emulator networks.In Symposium on Advances in Approximate BayesianInference , pp. 32–53, 2019.MacKay, D. J. C. Information Theory, Inference & LearningAlgorithms . Cambridge University Press, New York, NY,USA, 2002. ISBN 0521642981.Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J.Approximate Bayesian computational methods. Statisticsand Computing , 22(6):1167–1180, 2012.Marin, J.-M., Raynal, L., Pudlo, P., Ribatet, M., and Robert,C. ABC random forests for bayesian parameter inference. Bioinformatics (Oxford, England) , 35, 05 2016. doi:10.1093/bioinformatics/bty867.Meinshausen, N. Quantile regression forests. Journal ofMachine Learning Research , 7(Jun):983–999, 2006.Mohamed, S. and Lakshminarayanan, B. Learningin Implicit Generative Models. arXiv e-prints , art.arXiv:1610.03483, Oct 2016. Murphy, S. A. and Vaart, A. W. V. D. On profile likelihood. Journal of the American Statistical Association , 95(450):449–465, 2000. doi: 10.1080/01621459.2000.10474219.Nelder, J. A. and Mead, R. A Simplex Method forFunction Minimization. The Computer Journal , 7(4):308–313, 01 1965. ISSN 0010-4620. doi: 10.1093/comjnl/7.4.308. URL https://doi.org/10.1093/comjnl/7.4.308 .Neyman, J. Outline of a theory of statistical estimationbased on the classical theory of probability. PhilosophicalTransactions of the Royal Society of London. Series A,Mathematical and Physical Sciences , 236(767):333–380,1937. ISSN 00804614. URL .Ong, V. M. H., Nott, D. J., Tran, M.-N., Sisson, S. A., andDrovandi, C. C. Variational bayes with synthetic likeli-hood. Statistics and Computing , 28(4):971–988, Jul 2018.ISSN 1573-1375. doi: 10.1007/s11222-017-9773-3.Papamakarios, G., Sterratt, D., and Murray, I. Sequen-tial neural likelihood: Fast likelihood-free inference withautoregressive flows. In The 22nd International Confer-ence on Artificial Intelligence and Statistics , pp. 837–848,2019.Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Rai-son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang,L., Bai, J., and Chintala, S. Pytorch: An imperativestyle, high-performance deep learning library. In Wal-lach, H., Larochelle, H., Beygelzimer, A., d’ Alch´e-Buc,F., Fox, E., and Garnett, R. (eds.), Advances in Neural In-formation Processing Systems 32 , pp. 8024–8035. CurranAssociates, Inc., 2019.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.Scikit-learn: Machine learning in Python. Journal ofMachine Learning Research , 12:2825–2830, 2011.Pratt, J. W. Length of confidence intervals. Journal of theAmerican Statistical Association , 56(295):549–567, 1961.doi: 10.1080/01621459.1961.10480644.Price, L. F., Drovandi, C. C., Lee, A., and Nott, D. J.Bayesian synthetic likelihood. Journal of Computa-tional and Graphical Statistics , 27(1):1–11, 2018. doi:10.1080/10618600.2017.1302882.Rolke, W. A., L´opez, A. M., and Conrad, J. Limits and con-fidence intervals in the presence of nuisance parameters. onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting Nuclear Instruments and Methods in Physics ResearchSection A: Accelerators, Spectrometers, Detectors and As-sociated Equipment , 551(2):493 – 503, 2005. ISSN 0168-9002. doi: https://doi.org/10.1016/j.nima.2005.05.068.Schafer, C. M. and Stark, P. B. Constructing confidenceregions of optimal expected size. Journal of the AmericanStatistical Association , 104(487):1080–1089, 2009. doi:10.1198/jasa.2009.tm07420.Schmidt, S., Malz, A., and et al. Evaluation of probabilisticphotometric redshift estimation approaches for LSST. (In Preparation) Under internal review by LSST-DESC ,2019.Sen, B., Walker, M., and Woodroofe, M. On the unifiedmethod with nuisance parameters. Statistica Sinica , 19(1):301–314, 2009. ISSN 10170405, 19968507.Sisson, S. A., Fan, Y., and Beaumont, M. Handbookof Approximate Bayesian Computation . Chapman andHall/CRC, 2018.Snoek, J., Larochelle, H., and Adams, R. P. Practicalbayesian optimization of machine learning algorithms. InPereira, F., Burges, C. J. C., Bottou, L., and Weinberger,K. Q. (eds.), Advances in Neural Information Process-ing Systems 25 , pp. 2951–2959. Curran Associates, Inc.,2012.Talts, S., Betancourt, M., Simpson, D., Vehtari, A.,and Gelman, A. Validating Bayesian inference algo-rithms with simulation-based calibration. arXiv preprintarXiv:1804.06788 , 2018.Thomas, O., Dutta, R., Corander, J., Kaski, S., and Gut-mann, M. U. Likelihood-free inference by ratio estima-tion. arXiv e-prints , art. arXiv:1611.10242, Nov 2016.Thornton, S., Li, W., and ge Xie, M. An effective likelihood-free approximate computing method with statistical infer-ential guarantees, 2017.Weinzierl, S. Introduction to monte carlo methods, 2000. onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting Supplementary Material: ConfidenceSets and Hypothesis Testing in aLikelihood-Free Inference Setting A. Algorithm for Simulating Labeled Samplefor Estimating Odds Ratios Algorithm 3 provides details on how to create the train-ing sample T B for estimating odds ratios. Out of the totaltraining sample size B , a proportion p is generated by thestochastic forward simulator F θ at different parameter val-ues θ , while the remainder is sampled from a referencedistribution G . Algorithm 3 Generate a labeled sample of size B for esti-mating odds ratios Require: stochastic forward simulator F θ ; reference distribution G , proposal distribution r Θ over parameter space; training samplesize B; parameter p of Bernoulli distribution Ensure: labeled training sample1: Draw parameter values θ , . . . , θ B iid ∼ r Θ Assign labels Y , . . . , Y B iid ∼ ∼ Ber ( p ) for i = 1 , . . . , B do if Y i == 1 then Draw sample from forward simulator, X i ∼ F θ end if if Y i == 0 then Draw sample from reference distribution, X i ∼ G end if end for return T B = { θ i , X i , Y i } Bi =1 B. Algorithm for Constructing Confidence Setfor θ Algorithm 4 summarizes the ACORE procedure for con-structing confidence sets. First, we estimate parametrizedodds to compute the ACORE test statistic τ (Eq. 4). Then,we compute a parametrized estimate of the critical values asa conditional distribution function of τ . Finally, we computea confidence set for θ by the Neyman inversion technique(Neyman, 1937). C. Proofs Proof of Proposition 3.1. If (cid:98) P ( Y = 1 | θ, x ) = P ( Y =1 | θ, x ) , then (cid:100) OR ( x ; θ , θ ) = OR ( x ; θ , θ ) . By Bayesrule and construction (Algorithm 3), O ( x ; θ ) := P ( Y = 1 | θ, x ) P ( Y = 0 | θ, x ) = f ( x | θ ) pg ( x )(1 − p ) . Algorithm 4 Construct confidence set for θ with coefficient γ = 1 − α Require: stochastic forward simulator F θ ; reference distribution G ; proposal distribution r over Θ ; parameter p of Bernoulli distri-bution; sample size B (for estimating odds ratios); sample size B (cid:48) (for estimating critical values); probabilistic classifier; observeddata D = (cid:8) x obs , . . . , x obs n (cid:9) ; α ∈ (0 , Ensure: θ -values in confidence set1: // Estimate odds ratios: Generate labeled sample T B according to Algorithm 3 Apply probabilistic classifier to T B to learn class pos-terior probabilities, (cid:98) P ( Y = 1 | θ, X ) , for all θ ∈ Θ and X ∈ X Let the estimated odds (cid:98) O ( θ, X ) = (cid:98) P ( Y =1 | θ, X ) (cid:98) P ( Y =0 | θ, X ) Let the estimated odds ratios (cid:100) OR ( X ; θ , θ ) = (cid:98) O ( θ , X ) (cid:98) O ( θ , X ) , for all θ , θ ∈ Θ and X ∈ X // Estimate the critical value for a test δ θ that re-jects θ = θ at significance level α : Construct parametrized function (cid:98) C θ := (cid:98) F − τ | θ ( α | θ ) for θ ∈ Θ according to Algorithm 2 // Find parameter set for which the test δ θ does notreject θ = θ : ThetaGrid ← grid of parameter values in Θ n grid ← length( ThetaGrid ) Set S ← ∅ for Theta ∈ ThetaGrid do Cutoff ← (cid:98) F − τ | θ ( α | Theta ) sumLogOR ← array with length n grid for j = 1 , . . . , n grid do Theta ← ThetaGrid [ j ] sumLogOR [ j ] ← (cid:80) nk =1 log (cid:16)(cid:100) OR ( x obs k ; Theta , Theta ) (cid:17) end for TauObs ← min( sumLogOR )) if TauObs > Cutoff then S ← S ∪ Theta end if end for return SThus, the odds ratio at θ , θ ∈ Θ is given by OR ( x ; θ , θ ) = f ( x | θ ) f ( x | θ ) , onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting and therefore τ ( D ; Θ ) = sup θ ∈ Θ inf θ ∈ Θ n (cid:88) i =1 (cid:16) log (cid:100) OR ( x obs i ; θ , θ ) (cid:17) = sup θ ∈ Θ inf θ ∈ Θ n (cid:88) i =1 log f ( x obs i | θ ) f ( x obs i | θ )= sup θ ∈ Θ inf θ ∈ Θ log (cid:18) L ( D ; θ ) L ( D ; θ ) (cid:19) = Λ( D ; Θ ) . Proof of Theorem 3.3. The union bound and Assump-tion 3.2 imply that sup θ ∈ Θ sup t ∈ R | ˆ F B (cid:48) ( t | θ ) − F ( t | θ ) | P −−−−−→ B (cid:48) −→∞ . It follows that sup θ ∈ Θ | ˆ F − B (cid:48) ( α | θ ) − F − ( α | θ ) | P −−−−−→ B (cid:48) −→∞ . The result follows from the fact that ≤ | C B,B (cid:48) − C ∗ B | = | sup θ ∈ Θ ˆ F − B (cid:48) ( α | θ ) − sup θ ∈ Θ F − ( α | θ ) |≤ sup θ ∈ Θ | ˆ F − B (cid:48) ( α | θ ) − F − ( α | θ ) | , and thus | C B,B (cid:48) − C ∗ B | P −−−−−→ B (cid:48) −→∞ . Lemma C.1. If ( (cid:98) P ( Y = 1 | θ, X )) θ ∈ Θ P −−−−−→ B −→∞ ( P ( Y =1 | θ, X )) θ ∈ Θ and | Θ | < ∞ , then τ ( D ; Θ ) P −−−−−→ B −→∞ sup θ ∈ Θ inf θ ∈ Θ n (cid:88) i =1 log (cid:0) OR ( X obs i ; θ , θ ) (cid:1) Proof. For every θ , θ ∈ Θ , it follows directly from theproperties of convergence in probability that n (cid:88) i =1 log (cid:16)(cid:100) OR ( X obs i ; θ , θ ) (cid:17) P −−−−−→ B −→∞ n (cid:88) i =1 log (cid:0) OR ( X obs i ; θ , θ ) (cid:1) The conclusion of the lemma follows from the continuousmapping theorem. Proof of Theorem 3.4. Lemma C.1 impliesthat τ B ( D ; Θ ) converges in distribution to sup θ ∈ Θ inf θ ∈ Θ (cid:80) ni =1 log (cid:0) OR ( X obs i ; θ , θ ) (cid:1) . Now,from Slutsky’s theorem, τ B ( D ; Θ ) − (cid:98) C B Dist −−−−−→ B −→∞ sup θ ∈ Θ inf θ ∈ Θ n (cid:88) i =1 log (cid:0) OR ( X obs i ; θ , θ ) (cid:1) − C ∗ . It follows that P (cid:16) (cid:98) φ B, (cid:98) C B ( D ) = 1 | θ (cid:17) = P (cid:16) τ B ( D ; Θ ) − (cid:98) C B ≤ | θ (cid:17) −−−−−→ B −→∞ P (cid:16) sup θ ∈ Θ inf θ ∈ Θ n (cid:88) i =1 log (cid:0) OR ( X obs i ; θ , θ ) (cid:1) − C ∗ ≤ | θ (cid:17) = P ( φ C ∗ ( D ) = 1 | θ ) , where the last equality follows from Proposition 3.1. D. Toy Examples This section provides details on the toy examples of Sec-tion 4. We use the sklearn ecosystem (Pedregosa et al.,2011) implementation of the following probabilistic classi-fiers: • multi-layer perceptron (MLP) with default parameters,but no L regularization ( α = 0 ); • quadratic discriminant analysis (QDA) with defaultparameters; • nearest neighbors (NN) classifier, with number ofneighbors equal to the rounded square root of the num-ber of data points available (as per Duda et al. (2001)).Table 3 reports the observed coverage for the settings ofTables 1 and 2. Critical values or C are estimated with quan-tile gradient boosted trees ( trees with maximum depthequal to ), a training sample size B (cid:48) = 5000 , observed data D of sample size n = 10 , nominal coverage of , andaveraging over repetitions. The table shows that we forall cases achieve results in line with the nominal confidencelevel. Our goodness-of-fit procedure shown in Figure 3 uses aset T (cid:48)(cid:48) B (cid:48)(cid:48) with size B (cid:48)(cid:48) = 250 (as defined in Section 3.3);Figure 5 shows the goodness-of-fit plot for the Gaussianmixture model example, where the coverage is estimatedvia logistic regression and the critical values are estimatedvia quantile gradient boosted trees. For the Poisson example The CI of a binomial distribution with probability p = 0 . over repetitions is in fact [0 . , . . This inter-val includes the observed coverages listed in Table 3. onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting B Classifier Poisson Example GMM ExampleCoverage Coverage 100 MLP 0.91 0.87NN 0.91 0.91QDA 0.90 0.88500 MLP 0.91 0.91NN 0.93 0.95QDA 0.94 0.921000 MLP 0.91 0.92NN 0.89 0.88QDA 0.91 0.93 Table 3. Observed coverage of the toy examples in Tables 1 and2. These values are consistent with what we would expect for 100trials with a nominal confidence level of ; see text. a training sample size of B (cid:48) = 500 seems to be enoughto achieve correct coverage, whereas the Gaussian mixturemodel example requires B (cid:48) = 1000 .Next we compare our goodness-of-fit diagnostic with diag-nostics obtained via standard Monte Carlo sampling. Fig-ure 6 shows the MC coverage as a function of θ for thePoisson example (left) and the Gaussian mixture model ex-ample (right). In both cases 100 MC samples are drawnat 100 parameter values chosen uniformly. The empirical ACORE coverage is computed over the MC samples at eachchosen θ . This MC procedure is expensive: it uses a to-tal of , simulations, which is times the numberused in our goodness-of-fit procedure. The observed cov-erage of the Poisson example (Figure 6, left) indicates that B (cid:48) = 500 is sufficient to achieve the nominal coverage of . For the Gaussian mixture model example (Figure 6,right), we detect undercoverage for very small values of θ .This discrepancy is due to the fact that, at θ = 0 , the mixturecollapses into a single Gaussian, structurally different fromthe GMM at any other θ > and closer to the N (0 , ) reference distribution.Our goodness-of-fit procedure is able to identify that theactual coverage is far from the nominal coverage at smallvalues of θ , when the training sample size B (cid:48) for estimat-ing C is too small. More specifically, Figure 5 shows anoticeable tilt in the prediction bands for B (cid:48) = 100 and . However, as B (cid:48) increases, the estimation of criticalvalues becomes more precise and the estimated confidenceintervals pass our goodness-of-fit diagnostic at, for example, B (cid:48) = 1000 . Future studies will provide a more detailed ac-count on how such boundary effects depend on the methodfor estimating the coverage. Coverage as Function of θ (GMM Example) B'=100, Average Coverage=0.87 E s t i m a t e d C o v e r a g e B'=500, Average Coverage=0.86 θ B'=1000, Average Coverage=0.90 Figure 5. Estimated coverage as a function of θ in the Gaussianmixture model example for ACORE with different values of B (cid:48) .Logistic regression is used to estimate mean coverage and onestandard deviation prediction bands. (We here use n = 10 , a MLPclassifier with B = 1000 and quantile gradient boosted trees). E. Signal Detection in High Energy Physics Here we consider the signal detection example in Section 5We describe the details of the construction of ACORE con-fidence sets which used the strategy in Section 5 to choose ACORE components and parameters. For learning the oddsratio, we compared the following classifiers: • logistic regression, • quadratic discimininant analysis (QDA) classifier, • nearest neighbor classifier, • gradient boosted trees using { , , } treeswith maximum depth { , , } , • Gaussian process classifiers with radial basis func-tions kernels with variance { , . , . } , • feed-forward deep neural networks, with , ..., deeplayers, number of neurons between { ,..., } and eitherReLu or hyperbolic tangent activations.For estimating the critical values, we considered the follow-ing quantile regression algorithms: • gradient boosted trees using { , , } trees withmaximum depth { , , } , • random forest quantile regression with { , , } trees, • deep quantile regression with { , } deep layers, { ,.., } neurons and ReLu activations (using the PyTorch implementation (Paszke et al., 2019)).All computations were performed on 8-Core Intel XeonCPUs X5680 at 3.33GHz. GP classifiers were used only with sample sizes B below , , as the matrix inversion quickly becomes computationallyinfeasible for larger values of B . onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting θ O b s e r v e d C o v e r a g e Observed MC Coverage as Function of θ (Poisson Example) B=100, Average Coverage=0.85B=500, Average Coverage=0.88B=1000, Average Coverage=0.89 0 2 4 6 8 10 θ O b s e r v e d C o v e r a g e Observed MC Coverage as Function of θ (GMM Example) B=100, Average Coverage=0.86B=500, Average Coverage=0.88B=1000, Average Coverage=0.90 Figure 6. Observed ACORE coverage across the parameter space for the Poisson example (left) and the Gaussian mixture model example(right). The coverage is computed with Monte Carlo samples of size , each sampled at a θ chosen uniformly over the parameter space.Odds ratios are computed with a QDA classifier for the Poisson example, and an MLP classifier for the GMM example (as in Figures 3and 5). We observe undercoverage at small θ for the GMM (right) due to the mixture collapsing into a single Gaussian as θ → . Figure 7 illustrates the two steps in identifying the fourcomponents of ACORE . We first use a validation set of , simulations to determine which probabilistic classifier andtraining sample size B minimize the cross entropy loss.Figure 7 (left) shows the cross entropy loss of the best fourclassifiers as function of B . The minimum is achieved by a5-layer deep neural network (DNN) at B = 100 , witha cross entropy loss of . × − , closely followedby QDA with . × − at B = 50 , . Given howsimilar the loss values are, we select both classifiers tofollow-up on. In Figure 7 (right), the “estimated correctcoverage” represents the proportion of the parameter spacethat passes our diagnostic procedure. The lowest B (cid:48) withcorrect coverage is achieved by the five-layer DNN classifier(for estimating odds ratios) at B (cid:48) = 25 , with criticalvalues estimated via a two-layer deep quantile regressionalgorithm. None of the quantile regression algorithms passa diagnostic test with a nominal coverage of at the onestandard deviation level when using the QDA classifier. Wetherefore do not use QDA in Section 5.Based on the analysis above, we choose the following ACORE components: (i) a five-layer DNN for learningodds ratios, (ii) B = 100 , , (iii) a two-layer deepquantile regression for estimating critical values, and (iv) B (cid:48) = 25 , . Figure 4 shows the confidence sets computedwith this choice. F. Cross Entropy Loss Analysis In this work, we use the cross entropy loss to measurethe accuracy of the probabilistic predictions of the clas-sifier. That is, we calibrate the estimated odds function g ( θ, x ) := (cid:98) P ( Y = 1 | θ, x ) / (cid:98) P ( Y = 0 | θ, x ) as follows:Consider a sample point { θ, x , y } generated according toAlgorithm 3. Let p be a Ber ( y ) distribution, and q be aBer (cid:16)(cid:98) P ( Y = 1 | θ, x ) (cid:17) = Ber (cid:16) g ( θ, x )1+ g ( θ, x ) (cid:17) distribution. The cross entropy between p and q is given by L CE ( g ; { θ, x , y } ) = − y log (cid:18) g ( θ, x )1 + g ( θ, x ) (cid:19) − (1 − y ) log (cid:18) 11 + g ( θ, x ) (cid:19) = − y log ( g ( θ, x )) + log (1 + g ( θ, x )) . For every x and θ , the expected cross entropy E [ L CE ( g ; { θ, x , Y } )] is minimized by g ( θ, x ) = O ( θ, x ) .Thus we can measure the performance of an estimator g ofthe odds by the risk R CE ( g ) = E [ L CE ( g ; { θ, X , Y } )] . The cross entropy loss is not the only loss function that isminimized by the true odds function, but it is usually easy tocompute in practice. It is also well known that minimizingthe cross entropy loss between the estimated distribution q and the true distribution p during training is equivalent tominimizing the Kullback-Leibler (KL) divergence betweenthe two distributions, as KL ( p || q ) = H ( p, q ) − H ( p ) , where H ( p, q ) is the cross entropy and H ( p ) is the entropyof the true distribution. By Gibbs’ inequality (MacKay,2002), we have that KL ( p || q ) ≥ ; hence the entropy H ( p ) of the true distribution lower bounds the cross entropy withthe minimum achieved when p = q . Hence, we can connectthe cross entropy loss to the ACORE statistic. Proposition F.1. If the probabilistic classifier in ACORE achieves the minimum of the cross entropy loss, then theconstructed ACORE statistic (4) is equal to the likelihoodratio statistic (3) .Proof of Prop F.1. The proof follows from Proposition 3.1and the expected cross entropy loss is minimized if and onlyif (cid:98) O ( θ, x ) = O ( θ, x ) . onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting Training Sample Size B −1 −1 −1 −1 −1 C r o ss E n t r o p y L o ss Cross Entropy Loss as Function of B QDAGradient boosted trees3-layers deep NN5-layers deep NN Training Sample Size B' for Estimating Critical Values020406080100 E s t i m a t e d C o rr e c t C o v e r a g e [ % ] Estimated Correct Coverage Across ( ν , b ) as Function of B' QDA5-layers deep NN Figure 7. Using the strategy in Section 5 to choose ACORE components for the signal detection example. Left: The cross entropy lossof the best four classifiers, shown as a function of B . In order of increasing loss: 5-layer DNN ([512, 256, 64, 32, 32] neurons, ReLuactivations), QDA classifier, 3-layer DNN ([64, 32, 32] neurons, ReLu activations) and gradient boosted trees ( trees with maximumdepth 5). Because the first two classifiers (the 5-layer DNN and QDA) achieve a very similar minimum loss, we consider both classifiersin the follow-up step. Right : Proportion of the ( ν, b ) parameter space where the best two classifiers pass our goodness-of-fit procedurewith a nominal coverage of . Both the mean value curves and the ± one standard deviation prediction bands are computed via logisticregression. Critical values are estimated via a two-layer deep quantile regression ([64,64] neurons, ReLu activations), which passed thediagnostic at the lowest sample size ( B (cid:48) = 25 , , with the 5-layers DNN). Based on the results, we choose the 5-layer DNN with B (cid:48) = 25 , . In addition, we show that the convergence of the class pos-terior implies the convergence of the cross entropy to theentropy of the true distribution. This supports our decisionto use the cross entropy loss when selecting the probabilisticclassifier and sample size B . Lemma F.2. If for every θ ∈ Θ q := (cid:98) P ( Y = 1 | θ, X ) P −−−−−→ B −→∞ p := P ( Y = 1 | θ, X ) , then H ( p, q ) P −−−−−→ B −→∞ H ( p ) . Proof of Lemma F.2. We can rewrite the cross entropy H ( p, q ) and entropy H ( p ) as H ( p, q ) = − (cid:88) y ∈{ , } (cid:90) X × Θ p log ( q ) d P ( x , θ ) ,H ( p ) = − (cid:88) y ∈{ , } (cid:90) X × Θ p log ( p ) d P ( x , θ ) . In addition, for any ( X , θ ) , it also holds that | q | ≤ . Thelemma follows by combining the dominated convergencetheorem with the continuous mapping theorem for the loga-rithm. G. Comparison with Monte Carlo SyntheticLikelihood-Based Methods In this section we compare the performance of ACORE with Monte-Carlo (MC) synthetic likelihood-based methods,more specifically Gaussian process (GP) interpolation (Frateet al., 2017). The latter method first simulates multiple sam-ple points for a few different values of θ . For each fixed θ , one fits a Gaussian synthetic likelihood function. The GPlikelihood model is then used to smoothly interpolate acrossthe parameter space by fitting a mean function m ( θ ) and acovariance function Σ( θ ) . As a note, Cranmer et al. (2019)point out that such MC methods are less efficient than meth-ods that estimate the likelihood ratio directly because of theneed to first estimate the entire likelihood.For our comparison, we use the two toy examples describedin Section 4 and Table 1. To allocate B sample points forthe GP interpolation, we use the following strategy: For q ∈ { , , } , first choose θ , ..., θ q on an evenly spacedgrid across the parameter space. Then, generate N = B/q sample points X , ..., X N at each location θ .Table 4 summarizes the results. Unlike Table 2, we do not re-port the cross-entropy loss because GP interpolation is not aclassification algorithm; instead we report the mean squarederror in estimating the likelihood ratio across the parameterspace. Our results show that when the simulated data at each θ are approximately Gaussian, as in the Poisson example,MC-based GP interpolation provides a better approximationof the likelihood ratio due to its parametric assumptions.However, when the parametric assumptions are not valid, asin the GMM example, MC-based GP fails to approximatethe likelihood ratio regardless of how large N or B are. Insuch settings, we do better with a fully nonparametric ap-proach. As a note, MC-based GP uses the asymptotic χ approximation by Wilks’ theorem to determine the criticalvalues of the confidence sets. In our experiments, usingquantile regression for critical values instead (as in ACORE )led to a significant increase in power for the GP likelihoodmodels: from ≈ . to ≈ . for the Poisson example,and from ≈ . to ≈ . for the GMM example. onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting Poisson Example B Classifier 90 % Mean Squared Average Size ofError Interval Power Confidence Set [%] 100 MLP [2.14, 989.78] 0.27 72.8 ± ± ± G.P. (5) [0.05, 4.09] ± G.P. (10) [0.06, 4.97] 0.48 53.2 ± G.P. (25) [0.03, 6.54] 0.48 53.2 ± 500 MLP [0.86, 22.45] 0.38 62.2 ± ± QDA [0.08, 6.95] 0.45 ± G.P. (5) [0.01, 0.81] 0.49 52.4 ± G.P. (10) [0.02, 0.85] 0.49 52.0 ± G.P. (25) [0.01, 1.12] 0.48 52.5 ± ± NN [1.77, 17.88] 0.45 ± QDA [0.06, 2.83] ± G.P. (5) [0.01, 0.48] 0.49 52.3 ± G.P. (10) [0.01, 0.46] 0.48 52.5 ± G.P. (25) [0.01, 0.45] 0.48 52.6 ± - Exact - ± GMM Example B Classifier 90 % Mean Squared Average Size ofError Interval ( × ) Power Confidence Set [%] 100 MLP [0.34, 1.46] 0.87 14.5 ± NN [1.33, 11.77] 0.49 52.1 ± ± ± ± ± [0.44, 1.35] 0.90 ± ± ± ± ± ± [0.53, 1.17] 0.90 12.1 ± NN [0.57, 2.04] 0.71 30.2 ± ± ± ± ± Exact - ± Table 4. Results for ACORE (MLP, NN, QDA) and Gaussian Process interpolation (GP for q = 5 , , ; see text) for the two toyexamples, Poisson example (left) and GMM example (right), of Section 4. The tables list the mean squared error (MSE) between theestimated and true likelihood, the power (averaged over θ ) and the size of confidence sets, for different values of B and for differentclassifiers. We report a 90% confidence interval for the MSE, together with the mean and standard deviation of the size of the estimated90% confidence set for θ . Best results for each training sample size B are marked in bold-faced. All fitted classifiers produce valid confidence sets for θ according to our diagnostics. H. Comparison with Calibrated ApproximateRatio of Likelihood Classifiers In this section we compare the performance of ACORE withthe calibrated approximate ratio of likelihood (CARL) es-timator by Cranmer et al. (2015). CARL approximates thelikelihood ratio Λ( D ; Θ ) = L ( D ; θ ) / L ( D ; θ ) by turn-ing the density ratio estimation into a supervised classifica-tion problem, where a probabilistic classifier is trained toseparate samples from F θ and F θ . As such, CARL classi-fiers are “doubly parameterized” by θ and θ , whereas the ACORE classifier is parameterized by a single parameter θ in the definition of the odds of F θ versus G .In our study, we include three different CARL classifiers,implemented with the MADMINER neural network-basedsoftware (Brehmer et al., 2020a; 2019): (a) a shallow per-ceptron with 100 neurons (equivalent to the MLP used inSection 4), (b) a 2-layer deep network with 20 neurons perlayer, and (c) a 2-layer deep network with 20 and 50 neuronsin the two layers respectively. To allocate B sample pointsfor interpolation we devised two schemes: (i) a uniformsampling, and (ii) a Monte Carlo sampling over the parame-ter space. For (i), we uniformly sample B parameters andthen generate a sample point X at each parameter value. For Changing the number of neurons per layers did not seem toprovide a significant difference in performance for the 2-layerdeep networks. Number of epochs and learning rate were manu-ally tuned (with a search in the range [20 , and {− ,..., − } respectively). (ii), we first select evenly spaced parameters θ , , ..., θ ,q and θ , ..., θ ,q , for the numerator and the denominator re-spectively. We set q ∈ { , , } , resulting in N = B/q sample points X , ..., X N at each θ location. Because the χ approximation by Wilks’ theorem did not yield validconfidence sets for CARL classifiers, we computed criticalvalues as in ACORE Algorithm 1.Table 5 shows the results of ACORE and CARL for the syn-thetic data in Section 4 and Table 1. For both the Poissonand GMM examples, CARL classifiers yield a higher meansquared error in estimating the likelihood ratio, as well aslower power and larger confidence intervals. I. Runtime Analysis In this section we provide a runtime analysis for constructingone ACORE confidence set for the two examples in Section 4and Table 2. We also provide a running time comparisonwith the two methods described in Sections G and H. Thisanalysis was performed on a 8-Core Intel Xeon 3.33GHzX5680 CPU.The procedure for constructing confidence sets with ACORE is outlined in Algorithm 4. In this analysis we break thecomputation into 4 steps: (i) odds ratio training as describedby Algorithm 3, (ii) computing the test statistic (4) for theobserved data, (iii) computing the test statistic (4) in the B (cid:48) sample as described by Algorithm 2 and (iv) quantile regres-sion algorithm training. Table 6 summarizes our running onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting Poisson Example B Classifier 90 % Mean Squared Average Size ofError Interval Power Confidence Set [%] 200 MLP [3.25, 1305.45] 0.17 82.7 ± ± [0.20, 25.16] 0.45 55.8 ± MLP (MC) [2.51, 38.10] 0.24 76.1 ± ± ± ± ± ± ± ± QDA [0.04, 5.03] 0.49 52.0 ± MLP (MC) [2.38, 24.50] 0.22 78.5 ± ± ± ± ± ± ± NN [1.09, 11.27] 0.44 ± QDA [0.03, 1.60] 0.50 51.0 ± MLP (MC) [2.13, 35.39] 0.18 82.4 ± ± ± ± ± ± Exact - ± GMM Example B Classifier 90 % Mean Squared Average Size ofError Interval ( × ) Power Confidence Set [%] 200 MLP [0.56, 1.69] 0.88 14.2 ± NN [1.13, 4.17] 0.50 51.5 ± ± ± ± ± ± ± ± [0.89, 1.59] 0.90 12.1 ± NN [0.78, 2.31] 0.69 32.0 ± ± ± ± ± ± ± ± [0.33, 1.55] 0.90 11.5 ± NN [0.32, 1.57] ± QDA [3.29, 3.81] 0.16 83.7 ± ± ± ± ± ± ± Exact - ± Table 5. Results for ACORE (MLP, NN, QDA) and CARL or uniform (U) and Monte-Carlo (MC) sampling schemes in the Poissonexample (left) and GMM example (right) settings of Section 4. The tables list the mean squared error (MSE) between the estimated andtrue likelihood, the power (averaged over θ ) and the size of confidence sets, for different values of B and for different classifiers. Wereport a 90% confidence interval for the MSE, together with the mean and standard deviation of the size of the estimated 90% confidenceset for θ . The best results for each training sample size B are marked in bold-faced. times results. ACORE constructs one confidence set in lessthan and seconds for Poisson and GMM examplesrespectively. The main computational bottleneck is step (iii),while the computation time of step (i) increases with thesample size B .Figure 8 shows the results of comparing confidence set con-struction runtimes with MC GP and CARL classifiers. Forboth the Poisson and the GMM examples, we only considerthe best performing ACORE classifiers, and the two CARLclassifiers with 20 hidden units in both layers. Results show ACORE classifiers are comparable with GP interpolation interms of running times, while CARL classifiers tend to havesignificantly longer runtimes. onfidence Sets and Hypothesis Tests in a Likelihood-Free Inference Setting Running Times to Generate a Confidence Set (Seconds) – Poisson Example B Classifier Odds Ratio Odds Ratio Calculate (4) for Quantile Regression Total RunningTraining Prediction B (cid:48) Samples Training Time 100 MLP 0.38 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Running Times to Generate a Confidence Set (Seconds) – GMM Example B Classifier Odds Ratio Odds Ratio Calculate (4) for Quantile Regression Total RunningTraining Prediction B (cid:48) Samples Training Time 100 MLP 5.89 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 6. Runtimes in seconds for constructing a confidence set with ACORE for the Poisson example (top) and GMM example (bottom).The procedure for constructing confidence sets is outlined in Algorithm 4, and is split in 4 steps (see text). The rightmost column showstotal runtimes. 100 500 1000051015202530354045 T o t a l R unn i n g T i m e [ s ] ACORE and MC GP Running Times (Poisson Example) QDAG.P. (5)G.P. (10)G.P. (25) 100 500 1000051015202530354045 ACORE and MC GP Running Times (GMM Example) MLPG.P. (5)G.P. (10)G.P. (25) 200 800 1800 Sample size B T o t a l R unn i n g T i m e [ s ] ACORE and CARL Running Times (Poisson Example) QDAMLP (U)MLP (MC)(20, 20) DNN (U)(20, 20) DNN (MC) 200 800 1800 Sample size B ACORE and CARL Running Times -- (GMM Example) MLPMLP (U)MLP (MC)(20, 20) DNN (U)(20, 20) DNN (MC) Figure 8. Runtimes in seconds for constructing a confidence set for the Poisson example (left panels) and GMM example (right panels).The best ACORE classifier runtime is compared with Gaussian process interpolation (GP) for q = { , , }}