[PDF] Prepaid parameter estimation without likelihoods

Abstract

In various fields, statistical models of interest are analytically intractable. As a result, statistical inference is greatly hampered by computational constraints. However, given a model, different users with different data are likely to perform similar computations. Computations done by one user are potentially useful for other users with different data sets. We propose a pooling of resources across researchers to capitalize on this. More specifically, we preemptively chart out the entire space of possible model outcomes in a prepaid database. Using advanced interpolation techniques, any individual estimation problem can now be solved on the spot. The prepaid method can easily accommodate different priors as well as constraints on the parameters. We created prepaid databases for three challenging models and demonstrate how they can be distributed through an online parameter estimation service. Our method outperforms state-of-the-art estimation techniques in both speed (with a 23,000 to 100,000-fold speed up) and accuracy, and is able to handle previously quasi inestimable models.

Full PDF

aa r X i v : . [ s t a t . C O ] D ec Prepaid parameter estimation without likelihoods

Merijn Mestdagh , Stijn Verdonck , Kristof Meers , Tim Loossens , and FrancisTuerlinckx KU Leuven – University of Leuven, Belgium * [email protected] + These authors contributed equally to this work.December 27, 2018

Abstract

In various ﬁelds, statistical models of interest are analytically intractable. As a result, statisticalinference is greatly hampered by computational constraints. However, given a model, diﬀerent users withdiﬀerent data are likely to perform similar computations. Computations done by one user are potentiallyuseful for other users with diﬀerent data sets. We propose a pooling of resources across researchersto capitalize on this. More speciﬁcally, we preemptively chart out the entire space of possible modeloutcomes in a prepaid database. Using advanced interpolation techniques, any individual estimationproblem can now be solved on the spot. The prepaid method can easily accommodate diﬀerent priors aswell as constraints on the parameters. We created prepaid databases for three challenging models anddemonstrate how they can be distributed through an online parameter estimation service. Our methodoutperforms state-of-the-art estimation techniques in both speed (with a 23,000 to 100,000-fold speedup) and accuracy, and is able to handle previously quasi inestimable models.

Interesting nonlinear models are often analytically intractable. As a result, statistical inference has to relyon massive, time-intensive, simulations. The main idea of our method is to avoid the redundancy of similarcomputations that typically occur when diﬀerent researchers independently ﬁt the same model to theirarticular dataset. Instead, we propose to pool computational resources across the researchers interested inany given model. The prepaid method starts with an extensive simulation of datasets across the parameterspace. The simulated data are compressed into summary statistics, and the relation to the parameters islearned using machine learning techniques. This results in a parameter estimation machine that producesaccurate estimates very quickly (a 23,000 to 100,000-fold speed up compared to traditional methods).

Models without an analytical likelihood are increasingly used in various disciplines, such as genetics [2],ecology [31, 6], economics [19, 7] and neuroscience [28]. For models without an analytical or easily computablelikelihood, parameter estimation is a major challenge for which a variety of solutions have been proposed[31, 2, 12]. All these methods have in common that they rely on extensive Monte Carlo simulations and thattheir convergence can be painstakingly slow. As a result, the current methods can be very time consuming.To date, the practice is to analyze each data set separately. However, considering all the calculationsthat have ever been performed during parameter estimation of a particular type of model, for each diﬀerentdata set, one cannot help but notice an incredible waste of resources. Indeed, simulations performed whileestimating one data set may also be relevant for the estimation of another. Currently, each researcherestimating the same model with diﬀerent data will start from scratch, and not beneﬁt from all the possiblyrelevant calculations that have already been performed in earlier analyses by other researchers, in otherlocations, on diﬀerent hardware, and for other data sets, but concerning the same model.Therefore, our proposal is to combine resources and create a giant online database of parameter valuescoupled to simulated data. Consequently, (cloud based) interpolation techniques and global optimizationmethods can be used on the previously created (hence, prepaid) database for accurate and fast parameterestimation on any device. Statistical analyses that currently take up hours to days of computation time ondedicated hardware are now available to everyone within seconds.In Figure 1 we present a graphical illustration of the prepaid parameter estimation method. First (panelA), for a representative number of parameter vectors θ , large data sets are simulated, compressed intosummary statistics (i.e., s sim ) and saved — creating the prepaid grid. This prepaid grid is computedbeforehand and the results are stored at a central location. Second (panel B1), the observed (data) summarystatistics ( s obs ) are compared to the simulated (data) summary statistics (i.e., s sim ) using an appropriateobjective loss function d (cid:0) s sim , s obs (cid:1) and a number of nearest neighbor simulated summary statistics are2elected. The loss function is related to the loss function used in the generalized method of moments [10]and method of simulated moments [8].Third (panel B2), interpolation methods are used to ﬁnd the relation s = f ( θ ) between the parametervalues and the summary statistics for the selected points of the previous step [9, 20]. In this paper, we usetuned least squares support vector machines, LS-SVM [25]. Finally (panel B3), the objective loss function d (cid:0) s pred , s obs (cid:1) , now using predicted summary statistics s pred , is minimized as a function of the unknownparameter values using an optimizer.Three important aspects of the prepaid method deserve special mention. First, the parameter space isrequired to be bounded. If this is unnatural for a given parametrization, then the parameters have to beappropriately transformed to a bounded space. Second, we typically start from a uniform distribution ofparameter vectors in the ﬁnal parameter space. This choice reﬂects on the uniformity of the grid’s resolution,but has no further implications provided the grid is suﬃciently dense. Bayesian priors can be implementedwithout recreating the prepaid grid, since the prior can be taken into account in the loss function. Third,often a user is not interested in a single instance of a model, but rather has data from several experimentalconditions that share some common parameters but assume other ones to be diﬀerent. Also in these casesthe prepaid grid does not need to be recreated, as the parameter constraints can be included through priorswith tuning parameters (i.e., penalties).The performance of the prepaid method can be studied theoretically in simple situations (see Meth-ods 5.1). In what follows, the prepaid method will be applied to three more complicated, realistic scenarios.3 ) Prepaid Simulate and Save

B) Estimation

1) Nearest Neighbors2) Local Learning3) Method of Predicted Moments θ s sim2 θ s sim1 s s s obs = (cid:8) s obs1 , s obs2 (cid:9) d (cid:0) s sim , s obs (cid:1) ≤ constant s θ θ s = f ( θ , θ ) s θ θ s = f ( θ , θ ) d (cid:0) s pred , s obs (cid:1) = d (cid:0) { f ( θ , θ ) , f ( θ , θ ) } , s obs (cid:1) θ θ Figure 1: Graphical illustration of the prepaid parameter estimation method.4

Results

Example 1: The Ricker model

In a ﬁrst example, we apply our prepaid method to the Ricker model [27, 31] which describes the dynamicsof the number of individuals y t in a species over time (with t = 1 to T obs ): y t ∼ Pois ( φN t ) N t +1 = rN t e − N t + e t (1)where e t ∼ N (cid:0) , σ (cid:1) . The variables N t (i.e., the expected number of individuals at time t ) and e t are hiddenstates. Given an observed time series { y t } T obs t =1 , we want to estimate the parameters θ = { r, σ, φ } , where r is the growth rate, σ the process noise and φ a scaling parameter. The Ricker model can demonstratenear-chaotic or chaotic behavior and no explicit likelihood formula is available.Wood [31] used the synthetic likelihood to estimate the model’s parameters. In the original syntheticlikelihood approach (denoted as SL Orig ), the assumed multivariate normal distribution of the summarystatistics is used to create a synthetic likelihood. The mean and covariance matrix of this normal distributionare functions of the unknown parameters and are calculated using a large number of model simulations. Thesynthetic likelihood is proportional to the posterior distribution from which is sampled using MCMC and aposterior mean is computed.Wood’s synthetic likelihood SL

Orig approach is compared to the prepaid method, where we create aprepaid grid of the mean and the covariance matrix of a similar set of summary statistics. Prepaid estimationcomes in multiple variants, depending on the use of an interpolation method. The ﬁrst, which uses only theprepaid grid points and chooses the nearest neighbor (maximum synthetic likelihood) as ﬁnal estimate, willbe called SL

GridML . The second, SL

SVMML , uses LS-SVM to interpolate between the parameters in the prepaid gridto increase accuracy. The diﬀerential evolution algorithm (a global optimizer; [24]) is used to maximize thisinterpolated synthetic (log)likelihood. Additional details on the implementation of the synthetic likelihoodcan also be found in Methods 5.2.Figure 2 shows both the accuracy of parameter recovery (as measured with the RMSE) and computationtime for the three methods under comparison: (1) SL

Orig as in [31], the prepaid method (2) with (SL

SVMML ),and (3) without (SL

GridML ) interpolation. As can be seen in Figure 2, the prepaid estimation techniques leadto better results than the synthetic likelihood for T obs = 1 , , both in accuracy and speed. The SL Orig method leads to some clear outliers (see Methods 5.2 ) which testiﬁes to possible convergence problems5probably due to local minima). The prepaid method suﬀers much less from this problem. Most strikingis the speed up of the prepaid method: The SL

GridML version of the prepaid estimation is ﬁnished before asingle iteration of the 30,000 iterations in the synthetic likelihood method has been completed — 100,000times faster. In addition, it is demonstrated that the coverages of the prepaid method conﬁdence intervalsare very close or exactly equal to the nominal value (we look at 95% bootstrap-based conﬁdence intervals).SVM interpolation is mainly helpful for large T obs , where one expects a higher accuracy of the estimates andthe grid is too coarse. The analyses with large T obs could only be completed in a reasonable time using theprepaid method (See Methods 5.2 for more detailed information).In the application above, the tacitly assumed prior on the parameter space is uniform. In addition, thereis only one data set for which a single triplet of parameters ( r, σ, φ ) needs to be estimated. In Methods 5.2,we show how both limitations can be relaxed. First, it is explained how diﬀerent priors for the Ricker modelcan be implemented. Second, it is discussed what can be done if there are two data sets (i.e., conditions) forwhich it holds that r = r and σ = σ but φ and φ are not related.Finally, we also tested our estimation process on the population dynamics of the Chilo partellus, extractedfrom Figure 1 in Taneja and Leuschner [26, 32]. Here we found that r = 1 . (95% conﬁdence interval 1.06–1.34), σ = 0 . (95% conﬁdence interval 0.30 – 0.54) and φ = 140 . (95% conﬁdence interval = 43.94– 208.19). We found similar results using the synthetic likelihood method (see Methods 5.2), but ourestimation was 4000 times faster. 6 .04 4 6000 time (s) -1 R M SE SL Orig SL GridML SL SVMML

Figure 2: The RMSE versus the time needed for the estimation of the three parameters of the Ricker model(see Equation 1). The RMSE and time are based on 100 test data sets with T obs = 1000 . The three colorsrepresent the three parameters (blue for r , red for σ and yellow for φ ). Solid lines represent the SL Orig approach, dashed lines the SL

GridML approach (using only nearest neighbors), and dotted lines the SL

SVMML approach (using interpolation). The stars and the dots represent the time needed for the SL

GridML and theSL

SVMML estimation, respectively. The estimates for SL

Orig are posterior means, based on the second half ofthe ﬁnished MCMC iterations.

Example 2: A stochastic model of community dynamics

A second example we use to illustrate the prepaid inference method is a trait model of community dynamics[14] used to model the dispersion of species. For this model (see also Methods section), there are fourparameters to be estimated: I , A , h , and σ . As with the ﬁrst application, there is no analytical expressionfor the likelihood [14].As an established benchmark procedure for this trait model, we apply the widely used ApproximateBayesian Computation (ABC) method [4, 30, 23, 1] as implemented in the Easy ABC package and denotedhere as ABC OrigPM (PM stands for posterior means, which will be used as point estimates) [16]. As priors,we use uniform distributions on bounded intervals for log( I ) , log( A ) , h and log( σ ) (see Methods 5.3 for theexact speciﬁcations), but this can be easily changed as explained for the ﬁrst example.To allow for a direct comparison with the ABC method, and to illustrate the versatility of the prepaidmethod, we have also implemented three Bayesian versions of the prepaid method. The ﬁrst, SL GridPM , createsa posterior proportional to the prepaid synthetic likelihood. The second, ABC

GridPM , saves not only, the mean7nd covariance matrix of the summary statistics for every parameter in the prepaid grid, but also a large setof uncompressed summary statistics. Using these statistics we are able to approximate an ABC approach.The third, ABC

SVMPM , again interpolates between the grid points to achieve a higher accuracy.Table 1: The RMSE of the estimates of the test set of the trait model. T obs refers to the number ofobservations (i.e., vector with species frequencies) and Ω is the number of prepaid points. T obs version Ω log( I ) log( A ) h log( σ ) Orig / 0.17 0.67 7.45 0.741 SL

GridPM

SVMPM

GridPM is about 23,000 times faster than traditional ABC.For small sample sizes, all ABC based methods achieve good coverage. However, for large sample sizes,ABC

OrigPM cannot be used anymore (because of the unduly long computation time). For the prepaid versions,it is necessary to use SVM interpolation between the grid points to get accurate results.

Example 3: The Leaky Competing Accumulator for choice response times

In a third example, we apply our method to stochastic accumulation models for elementary decision making.In this paradigm, a person has to choose, as quickly and accurately as possible, the correct response given astimulus (e.g., is a collection of points moving to the left or to the right). Task diﬃculty is manipulated byapplying diﬀerent levels of stimulus ambiguity.A popular neurally inspired model of decision making is the Leaky Competing Accumulator (LCA[29]).For two response options, two noisy evidence accumulators (stochastic diﬀerential equations, see Methodssection) race each other until one of them reaches the required amount of evidence for the correspondingoption to be chosen. The time that is required to reach that option’s threshold is interpreted as the associated8hoice response time. For diﬀerent levels of stimulus diﬃculty, the model produces diﬀerent levels of accuracyand choice response time distributions. The evidence accumulation process leading up to these choices andresponse times is assumed to be indicative of the activation levels of neural populations involved in thedecision making.As in the ﬁrst two examples, there is no analytical likelihood available that can be used to estimate theparameters of the LCA. Moreover, the LCA is an extremely diﬃcult model to estimate. To the best ofour knowledge, only [21] systematically investigated the recovery of the LCA parameters, but for a slightlydiﬀerent model (with three choice options) and with a method that is impractically slow for very large samplesizes, making it diﬃcult to show near-asymptotic recovery properties with. -2 -1 M AE v -1 T obs -2 -1 M AE T obs -3 -2 -1 a Figure 3: The mean absolute error of the estimates of four central parameters of the LCA (common input v , leakage γ , mutual inhibition κ , evidence threshold a ) as a function of sample size (abscissa) and forthree diﬀerent methods: (1) choosing the nearest neighbor grid point in the space of summary statistics(CHISQ GridNN , triangles); (2) using the average of a set of nearest neighbor grid points based on bootstrapsamples (CHISQ

GridBS , open circles) and (3) using SVM interpolation between the 100 nearest neighbors(CHISQ

SVMBS , crosses).For an experiment with four stimulus diﬃculty levels, the LCA model has nine parameters. However,after a reparametrization of the model (but without a reduction in complexity), it is possible to reduce theprepaid space to four dimensions (see Methods 5.4) and conditionally estimate the remaining subset of theparameters with a less computationally intensive method. Three variants of the prepaid method have beenimplemented: taking the nearest neighboring parameter set (based on a symmetrized χ distance between9istributions) on the prepaid grid (CHISQ GridNN ), averaging over the grids nearest neighboring parameter setsof 100 non-parametric bootstrap samples (CHISQ

GridBS ), using SVM interpolation for every bootstrap estimate(CHISQ

SVMBS ). A nearest neighbor or bootstrap averaged estimate completes in about a second on a DellPrecision T3600 (4 cores at 3.60GHz), an SVM interpolated estimate requires a couple of minutes extra.Figure 3 displays the mean absolute error (MAE) of the estimates for four of the nine parameters asa function of sample size, separately for three estimation methods. The results for the other parametersare similar and can be consulted in the Methods section. It can be seen that with increasing sample size,MAE decreases. The SVM method pays oﬀ especially for larger samples. Figure 4 shows detailed recoveryscatter plots for a subset of the parameters for 1,200 observed trials, which is the typical size of decisionexperiments. To get better recovery, larger sample sizes have to be considered (see Methods section). Ingeneral, recovery is much better than what has been reported in [21]. The coverage of the method, based onnon-parametric bootstrapping, is satisfactory for all sample sizes, provided SVM interpolated estimates areused for T obs > . In addition, we do not ﬁnd evidence for a fundamental identiﬁcation issue with thetwo option LCA, as has been stated in [21]. v a Figure 4: Parameter recovery for the LCA model with 1200 observations ( in each of the four diﬃcultyconditions); the true value on the abscissa and estimated value on the ordinate. The same parameters asin Figure 3 are shown. The method used to produce these estimates is the averaged bootstrap approach(CHISQ

GridBS , see Methods 5.4 for details). 10

Discussion

For a very simple setting, we want to study the performance of the prepaid methods analytically.Assume y i ∼ N ( µ, s ) ( i = 1 , . . . , T obs ) with the mean µ unknown (and to be estimated and the standarddeviation s known (so number of parameters K = 1 ). The observed mean is denoted as ¯ y . We will explore twosituations. In the ﬁrst situation, ¯ y will be our summary statistic s obs (hence number of summary statistics R = 1 ) to estimate µ ( ¯ y is also a suﬃcient statistic for µ ). In the second situation, we will study whathappens if s obs = ¯ y is chosen to be the summary statistic. Situation 1: s obs = ¯ y As a prepaid grid, we take N r evenly spaced µ -values with spacing or gap size ∆ = µ j +1 − µ j . For each value µ j , T sim values of y are simulated and the sample average is computed (i.e., ¯ y sim j ). Typically, T sim = 1000 or larger. Hence, every value of µ j is paired with a particular ¯ y sim j : ( µ j , ¯ y sim j ) .Given an observed ¯ y , the N nearest neighbors of simulated statistics ¯ y sim j are selected: ( µ (1) , ¯ y sim(1) ) , ( µ (2) , ¯ y sim(2) ) , . . . , ( µ ( N ) , ¯ y sim( N ) ) , such that | ¯ y sim(1) − ¯ y | ≤ | ¯ y sim(2) − ¯ y | ≤ · · · ≤ | ¯ y sim( N ) − ¯ y | . Typically, N = 100 .Because of the linearity of the problem, we can safely assume that if T sim is large enough, the N selected12 values are all consecutive or nearly consecutive (because of noise in the prepaid simulation of ¯ y sim , it canhappen that the N selected µ values are not consecutive). We denote the average of these N µ -values as M µ . If all values are exactly consecutive, M µ can be expressed as M µ = 1 N N X j =1 µ ( j ) = 1 N N − X j =0 (cid:0) µ (1) + j ∆ (cid:1) = µ (1) + ∆ N N − X j =1 j = µ (1) + ∆ ( N − where we have deﬁned µ (1) as µ (1) ≡ min i ∈ , ,...,N (cid:0) µ ( i ) (cid:1) In addition (assuming that all values are exactly consecutive), their variance V µ is given by V µ = 1 N  N X j =1 µ j )  − M µ = 1 N  N − X j =0 (cid:0) µ (1) + j ∆ (cid:1)  − M µ = 1 N  N − X j =0 (cid:16) µ + 2 j ∆ µ (1) + j ∆ (cid:17) − M µ = µ + 2 ∆ µ (1) N  N − X j =1 j  + ∆ N  N − X j =1 j  − M µ = µ + ∆ µ (1) ( N −

1) + ∆ ( N −

1) (2 N − − M µ = ∆ ( N −

1) (2 N − − ∆ ( N −

4= ∆ ( N − N + 1)12 ≈ ∆ N S µ ≈ ∆ N √ .Using the N pairs, we assume as a linear interpolator in this example a linear regression model that linksthe simulated statistics to the true underlying µ : ¯ y sim j = β + β µ j + ǫ j , with ǫ j ∼ N (cid:16) , s T sim (cid:17) . Obviously, β = 0 and β = 1 .Given ¯ y , N selected prepaid points and the ﬁtted linear regression model, we know from linear regressiontheory that:  ˆ β ˆ β  ∼ N   ,  σ σ σ σ  , where 0 and 1 are the true β and β and σ = Var ( ˆ β | ¯ y ) ≈ s T sim N + 12 M µ ∆ N ! σ = Var ( ˆ β | ¯ y ) ≈ s T sim N σ = Cov ( ˆ β , ˆ β | ¯ y ) = − M µ σ ≈ − s T sim M µ ∆ N . The distribution is assumed to hold for repeated simulations of the replicated statistics in the prepaid grid.Because we work with linear regression, the optimization problem is simple. In this case, the optimalvalue of µ for a given ¯ y can be found by inverting the regression line: ˆ µ = ¯ y − ˆ β ˆ β . Next, we can study the properties of ˆ µ . We begin by calculating the conditional mean E (ˆ µ | ¯ y ) andconditional variance Var (ˆ µ | ¯ y ) . Hence, we treat the observed data (or sample average) as given and ﬁxed.These expectations are taken over diﬀerent simulations of ¯ y sim j ’s in the prepaid grid. Before giving theexpressions, it is useful to note that  ¯ y − ˆ β ˆ β  ∼ N  ¯ y  ,  σ − σ − σ σ  . E (ˆ µ | ¯ y ) = E ¯ y − ˆ β ˆ β | ¯ y ! ≈ E (¯ y − ˆ β | ¯ y ) E ( ˆ β | ¯ y ) − E ( ˆ β | ¯ y ) Cov (¯ y − ˆ β , ˆ β | ¯ y ) + E (¯ y − ˆ β | ¯ y ) E ( ˆ β | ¯ y ) Var ( ˆ β | ¯ y ) ≈ ¯ y − s T sim M µ ∆ N + ¯ y s T sim N = ¯ y (cid:18) s T sim N (cid:19) − M µ T sim s ∆ N and Var (ˆ µ | ¯ y ) = Var ¯ y − ˆ β ˆ β | ¯ y ! ≈ E (¯ y − ˆ β | ¯ y ) E ( ˆ β | ¯ y ) Var (¯ y − ˆ β | ¯ y ) E (¯ y − ˆ β | ¯ y ) + Var ( ˆ β | ¯ y ) E ( ˆ β | ¯ y ) − Cov (¯ y − ˆ β , ˆ β | ¯ y ) E (¯ y − ˆ β | ¯ y ) E ( ˆ β | ¯ y ) ! = ¯ y (cid:18) σ ¯ y + σ − − σ )¯ y · (cid:19) = σ + ¯ y σ − yM µ σ ≈ s T sim N M µ + 12¯ y − yM µ ∆ N ! = s T sim N M µ − ¯ y ) ∆ N ! . Invoking the double expectation theorem to arrive at the unconditional expectations, we have: E (ˆ µ ) = E [ E (ˆ µ | ¯ y )] ≈ E (¯ y ) (cid:18) s T sim N (cid:19) − M µ T sim s ∆ N = µ (cid:18) s T sim N (cid:19) − M µ T sim s ∆ N = µ − αT sim s ∆ N , (2)where α = M µ − µ , that is, the diﬀerence between the true value and the mean of the selected nearest15eighbors µ ’s. Likewise, we can derive:Var (ˆ µ ) = E [ Var (ˆ µ | ¯ y )] + Var [ E (ˆ µ | ¯ y )] ≈ s T sim N s T obs + µ ) + 12 M µ − M µ µ ∆ N + 1 ! + s T obs (cid:18) s T sim ∆ N (cid:19) = s T obs + 24 s T obs T sim ∆ N + 144 s T obs T sim2 ∆ N + s T sim N + 12 s ( s T obs + µ ) + 12 s M µ − s M µ µT sim ∆ N = s T obs + s T sim N + 24 s T sim T obs ∆ N + 144 s T T obs ∆ N + 12 s T sim ∆ N (cid:18) s T obs + ( µ − M µ ) (cid:19) = s T obs + s T sim N + 12 s ( µ − M µ ) T sim ∆ N + 36 s T sim T obs ∆ N + 144 s T T obs ∆ N (3)From Equation 2, we learn that if there is no systematic deviation in the selection of µ -grid points, theprepaid estimator is unbiased. In the other case, the is bias decreases with T sim but is proportional to s . InEquation 3, the leading term of the variance is s T obs , which is the same as in classical estimation theory. Forthe other terms, they all have T sim (or a power of it) in the denominator. Because T sim is usually quite large,these terms tend to be in general of lesser importance. However, some terms also have both N (the numberof selected nearest neighbor grid points) and ∆ (the gap size). It is worthwhile to note that increasing theresolution (i.e., decreasing ∆ ), while keeping N constant, will increase the additional terms and thus add tothe error. The reason for this is that the interpolation is deﬁned on a too small grid, leading to uncertainty inthe estimated regression. This eﬀect is illustrated in the left panel of Figure 5 in which the root mean squareerror (RMSE) is shown for the estimation of µ for diﬀerent values of N and ∆ . The plot is constructed bymeans of a simulation study, but conﬁrms our analytical results. Situation 2: s obs = ¯ y In the second situation, we will again estimate µ (the unknown mean of a unitvariance normal), but in this case s obs = ¯ y is used as a statistic. Thus, the relation between the simulatedstatistics ¯ y sim and µ is quadratic (and thus nonlinear). Again we use a local linear approximation. Clearly,this approximation will only be approximately valid if we do not choose the area of approximation too large.However, unlike in the ﬁrst situation, we do expect an additional eﬀect of the approximation error.No analytical derivations were made for this case, but we conducted a similar simulation study as insituation 1. The results (in terms of RMSE) are shown in the right panel of Figure 5. As can be seen, thereis a clear optimality trade-oﬀ visible between ∆ and N . This can be explained as follows: Fix N and thenconsider the gap size ∆ . If ∆ is too small, we get a similar phenomenon as in the left panel, that is a large16igure 5: RMSE (based on a simulation study) of the toy example estimation as function of the gap size( ∆ ) and number of nearest neighbors selected to carry out the interpolation ( N ). The left panel is calledsituation 1 in which s obs = ¯ y and the right panel is situation 2 ( s obs = ¯ y ). For the second situation, thetrade-oﬀ between ∆ and N is clearly visible.RMSE. However, if we take ∆ too large, then the approximation error will dominate (because the linearinterpolation misﬁts the quadratic relation). The optimal point will be diﬀerent for diﬀerent N .This toy example demonstrates the sound theoretical foundations of the prepaid method in well-behavedsituations. However, the question is how well the method performs for real life examples. Three hardproblems will be studied next. The basic model equations of the Ricker model is given in Equation 1.

Synthetic likelihood estimation

For the synthetic likelihood estimation (SL

Orig ), we made use of the synlik package [5]. The syntheticlikelihood l s for a data set with summary statistics s obs and a certain parameter vector θ = ( r, σ, φ ) is givenby 17 s ( θ ) = − (cid:0) s obs − ˆ µ θ (cid:1) T ˆ Σ − θ (cid:0) s obs − ˆ µ θ (cid:1) −

12 log (cid:12)(cid:12)(cid:12) ˆ Σ θ (cid:12)(cid:12)(cid:12) , (4)where ˆ µ θ and ˆ Σ θ are the estimated mean and covariance of the summary statistics when Equation 1 issimulated multiple times with parameter θ .The statistics used by the synthetic likelihood function were the average population size, the number ofzeros, the autocovariances up to lag 5, the coeﬃcients of the quadratic linear autoregression of y . t and thecoeﬃcients of the cubic regression of the ordered diﬀerences y t − y t − on the observed values.For each data set we used the synthetic likelihood Markov chain Monte Carlo (MCMC) method with30000 iterations, a burn in of 3 time steps and 500 simulations to compute each ˆ µ θ and ˆ Σ θ [5]. We used thefollowing prior: r ∼ U (1 , σ ∼ U (0 . , . φ ∼ U (0 , . (5)The synlik package generates the MCMC chain on a logarithmic scale, we estimated the parameters as theexponential of the posterior mean. To ensure convergence, only the last half of the chain is used (the last15000 iterations). Creation of the prepaid grid

For the prepaid estimation, we used the same summary statistics as for the traditional synthetic likelihood,except for two diﬀerences. First, the coeﬃcients of the cubic regression of the ordered diﬀerences y t − y t − on the observed values could not be used, because the observed values are not available when creating theprepaid grid. Second, we changed the number of zeros to the percentage of zeros to make this statisticindependent of T obs (as this may change depending on the observation).We ﬁlled the prepaid grid with 100000 parameter sets using the priors of Equation 5. To cover this gridas evenly as possible (and avoiding too large gaps), the uniform distribution was approximated using Haltonsequences [18, 17]. For each parameter set in the prepaid grid, we simulated a time series of length andused the summary statistics of this long time series as ˆ µ θ .Each time series was then split into series of length T prepaid = ˆ Σ θ,T prepraid for the statistics computed on data of these lengths. This means, forexample, that we had 100000 series of length 100 to compute the covariance matrix for a certain parameterset for time series of length 100. If we need to estimate parameters of a time series with T obs not equal toone of the T prepaid lengths, we use the covariance matrix created with time series of length T prepaid which isclosest to T obs in logarithmic scale and adapt the covariance matrix into ˆ Σ θ ,T obs = T prepaid T obs ˆ Σ θ ,T prepaid ( r ) ∼ U ( log (1) , log (200)) σ ∼ U (0 . , . log ( φ ) ∼ U ( − , . (7)We ﬁlled to prepaid grid with 100000 parameter sets and used this prior for the real life data set. Prepaid estimation

Four variants of prepaid estimation were implemented for this example. All use the negative syntheticlikelihood as distance ( d (cid:0) s sim , s obs (cid:1) as deﬁned in the main text and Figure 1). First, we do a nearestneighbor estimation SL GRIDML , without using any interpolation between the grid points of the prepaid dataset. We compute the synthetic likelihood of all the prepaid parameters for the summary statistics of the testdata set. The parameter vector with the highest likelihood, the so-called nearest neighbor may already be agood estimation. For a low number of time points T obs , it is to be expected that the error on the parameterestimate is much larger than the gaps in the prepaid grid, and in such a case, the SL GRIDML estimation approachsuﬃces.Second, a more accurate estimation can be acquired by interpolating between the parameter values in theprepaid grid (SL

SVMML ). Therefore, we learn the relation between the parameters and the summary statistics: ˆ f svm : θ s . However, we only learn this relation in the region of interest, that is the 100 nearest neighborsaccording to the synthetic likelihood. For each summary statistic, we create, on the ﬂy, a separate least19quares support vector machine (LS-SVM) [25] using the 100 nearest neighbors. This machine learningtechnique is chosen as it is a fast non-linear method which generalizes well. We limit the predictions to thepossible range of the summary statistics (e.g., to prevent a percentage of zeros, one of the statistics, largerthan 1).We then use the diﬀerential evolution global optimizer [24] to ﬁnd the maximum of: l PP s ( θ ) = − (cid:16) s obs − ˆ f svm ( θ ) (cid:17) T ˆ Σ − θ ,T obs (cid:16) s obs − ˆ f svm ( θ ) (cid:17) −

12 log (cid:12)(cid:12)(cid:12) ˆ Σ θ ,T obs (cid:12)(cid:12)(cid:12) , (8)where ˆ Σ θ ,T obs is the covariance matrix of the statistics of the nearest neighbor as deﬁned in Equation 6.The superscript "PP" is used to denote that we use the prepaid version of synthetic likelihood, and not thetraditional version as used by [31] (see Equation 4). The optimization process is constrained and we use theminima and maxima for each parameter of the 100 nearest neighbors as eﬀective bounds.The SL SVMML approach makes use of a non-linear black box interpolator. However, we may also considerusing a much faster linear regression (see also the toy example in Section 5.1). Therefore, we will alsocompare the SL

SVMML (and SL

GridML ) approach to a third option where we predict the summary statistics usinga linear regression (called the SL

LinML approach).Third, we can easily implement a prior for the likelihood in Equation 4. This leads to a posterior givenby p ( θ | s obs ) ∝ p ( θ ) l s ( θ ) . (9)The parameters will be estimated as the maximum a posteriori (MAP), as comparison to maximumlikelihood estimation which is a maximum a posteriori with a uniform prior. Here we will apply this extensionto the nearest neighbor estimation: SL GRIDMAP .Lastly we will show that our prepaid method can also be used to cover multiple experimental set ups.Each experimental set up involves multiple conditions and may have varying constraints on the parametersof the model over these conditions. If, for one experimental set up, the conditions c are independent, thelikelihood of the whole experiment is l s,experiment ( θ , θ , ..., θ C ) = C Y c =1 l s,c ( θ c ) (10)where l s,c ( θ c ) is the synthetic likelihood for condition c . This is equivalent to estimating each parameter20et θ c individually for each condition c . If the conditions are not independent and we assume that someparameters are the same over conditions we adapt the prior to mimic these assumptions. For example, if weassume that the ﬁrst parameter θ is the same over all conditions C we formulate this as p ( θ , θ , ..., θ C ) = C Y c =1 N (cid:18) θ c − ¯ θ σ prior (cid:19) (11)where N is the standard normal distribution and ¯ θ is the average of all θ c . The smaller the tuning parameter σ prior , the more all θ c will be forced to be equal. If σ prior is too large the estimation will not take into accountthe interdependence between the conditions. However, if it is too small we run into trouble with the sparsityof the prepaid grid. In the limit, where σ prior goes to zero, one point is chosen from the prepaid grid leadingto equal parameters not only for ﬁrst parameter but also for the others. σ prior can be easily tuned bysimulating the experimental set up for a certain prepaid grid.To further illustrate, we will apply this method to the Ricker model, assuming two conditions acrosswhich r and σ stay the same such that this prior is given by p ( θ , θ ) = N (cid:18) r − ¯ rσ prior (cid:19) N (cid:18) r − ¯ rσ prior (cid:19) N (cid:18) σ − ¯ σσ prior (cid:19) N (cid:18) σ − ¯ σσ prior (cid:19) (12)We ﬁrst use the nearest neighbor approach SL GRIDML to ﬁnd the 1000 nearest neighbors for condition one andtwo separately and then we reﬁne the parameters using prior 12. As we assume r and σ to be constant overthe conditions, we take ¯ r and ¯ σ as ﬁnal estimates for these parameters in each condition. Test set

As a test set we ﬁrst used 100 random parameters created with the prior of Equation 5. To avoid problemswith the borders we deleted parameters that where within 1% range of the bounds. We simulated datasets for T obs = { , · , , , } . For each data set we estimated parameters using the nearestneighbor (SL GridML ) and the SL

SVMML approach. For T obs = 10 , we also estimated the parameters using theSL LinML approach. Due to time constraints, we only estimated parameters for the data with T obs ≤ usingthe traditional synthetic likelihood approach.Next we also created test data sets from diﬀerent priors for T obs = 10 . Prior P from Equation 5 can21lso be written as r − − ∼ Beta (1 , σ − . . − . ∼ Beta (1 , φ ∼ Beta (1 , . (13)where Beta is a beta distribution with parameters α = 1 and β = 1 . Similarly, we created a test set fromprior P r − − ∼ B (10 , σ − . . − . ∼ B (10 , φ ∼ B (10 , , (14)and prior P r − − ∼ B (2 , σ − . . − . ∼ B (10 , φ ∼ B (2 , . (15)We will test if SL GRIDMAP performs best when the correct prior is used in the estimation process. Last wealso created a test set for T obs = 10 for an experimental set up with two conditions where r and σ are equalover the conditions. Results

For the results, we will evaluate the methods on the following criteria: accuracy, speed, and coverage.

Accuracy

To start oﬀ, we look at the recoveries for T obs = 10 for all 100 simulated data sets and the threemethods (SL Orig ,SL

GridML and SL

SVMML ). Scatter plots are shown in Figure 6. It can seen that the syntheticlikelihood estimation leads to some clear outliers. One possible reason for the absence of outliers in theprepaid estimation is the fact that prepaid estimation from the start examines the whole grid and thereforehas less problems with getting stuck in local optima.22 SL Orig SL GridML SL SVMML

Figure 6: Estimated versus true parameters of the Ricker model of 100 data sets with T obs = 1000 . TheSL Orig estimation has some problems with outliers.More generally, we plotted the accuracy of each of the methods as a function of time series length T obs in Figure 7. The left panel shows the root mean square error (RMSE), while the right panel shows themedian absolute error (MAE). We decided to look at the MAE because the few outliers for SL Orig (whichwere shown Figure 6) may inﬂate the RMSE of the synthetic likelihood disproportionally, which happensto a certain extent. However, very similar conclusions can be drawn for both performance measures. Ingeneral, accuracy increases when T obs increases (i.e., both RMSE and MAE decreases). For RMSE, ourSVM prepaid method clearly outperforms the traditional synthetic likelihood method SL Orig for every T obs and every parameter. For T obs = { · , } , also the SL GridML prepaid approach leads for every parameterto a lower RMSE compared to the synthetic likelihood. For all T obs , the SL SVMML prepaid leads to a higheraccuracy compared to the SL

GridML prepaid and this diﬀerence becomes larger for a larger T obs . For MAE,the SL SVMML prepaid method and the original synthetic likelihood SL

Orig show a very similar accuracy (for T obs ≤ ). Both outperform the SL GridML prepaid. 23 T -6 -4 -2 R M SE T -2 -1 M AE SL Orig SL GridML SL SVMML

Figure 7: The accuracy of all estimation methods versus the number of time points T obs . The left panel showsthe mean squared error, while the right panel shows the median absolute error. The three colors representthe three parameters. Blue lines refer to the parameter r , red lines to the parameter σ and yellow lines tothe parameter φ . The solid line represents the original synthetic likelihood approach SL Orig (stopping at T obs = 10 ), the dashed line the SL SVMML prepaid approach and the dotted line the SL

SVMML prepaid approach.The largest attainable accuracy for the SL

GridML prepaid approach is limited by the spacing of the prepaidgrid. If we had created an equally spaced grid of T obs = 10 points using the prior in Equation 5, we wouldhave the following gaps in each of the three parameter dimensions: ∆ r = 90 − ) / = 1 . σ = 0 . − . ) / = 0 . φ = 20 − ) / = 0 . . (16)We do not have an equally spaced grid, but it is expected that the quasi Monte Carlo distribution of pointscreates expected gaps close to the ones in Equation 16. Therefore, it is no coincidence that the best possibleRMSE using the SL GridML prepaid approach has the same order of magnitude as the gap size ∆ , as can be seenin Table 2 for the case of T obs = 10 . However, Table 2 also show that the SL SVMML prepaid approach leads to24able 2: RMSE for the estimation of the parameters of the Ricker model for T = 10 using the SL GridML ,SL

SVMML and SL

LinML prepaid methods. r σ φ SL GridML

SVMML

LinML

GridML and the SL

SVMML prepaid approach for T obs = 10 isfurther visualized in Figure 8.The results in Table 2 also show the need for a non-linear interpolator for the prepaid method. TheRMSE of a linear regression interpolator (SL LinML ) is much larger than that of the SVM prepaid.

20 40 60 8020406080 0.2 0.4 0.60.10.20.30.40.50.6 5 10 1551015 SL GridML SL SVMML

Figure 8: The estimation of the three parameters of the Ricker model of 100 data sets with T obs = 10 . TheSL SVMML estimation clearly outperforms the SL

GridML estimation.In sum, we can conclude that the prepaid estimation methods lead to better, or at least similar, resultsas the traditional synthetic likelihood.

Speed

The largest improvement of the prepaid method over synthetic likelihood is in computational speed:The prepaid method is many times faster than synthetic likelihood. Consider Figure 2 in the main text whereit is shown that the SL

GridML prepaid method is ﬁnished before a single iteration of the 30000 iterations aredone by the SL

Orig method. While the SL

GridML and the SL

SVMML prepaid methods are ﬁnished in respectively0.044 and 3.7 seconds, independent of the time series length T obs , the SL Orig method grows slower with anorder of magnitude of T obs . In each SL Orig iteration one needs to simulate multiple time series with length T obs . The larger T obs , the slower the estimation. While the synthetic likelihood needs approximately oneand a half hour to estimate the parameters for a time series with length T obs = 10 . The SL GridML prepaid25stimation still ﬁnishes in 0.044 s, which is more than times faster. The speed up factors are presented inTable 3 and as can be seen from Figure 7, there is not loss of accuracy. The speed up would reach millions,if we had the time to run the synthetic likelihood method for longer time series. Coverage

Next, we look at the coverage rates of the conﬁdence intervals as obtained with the boot-strap in combination with the prepaid method. To estimate a conﬁdence interval of the estimates forthe prepaid method, a parametric bootstrap with B = 1000 bootstrap samples was used.For the prepaid version the estimate for the observed data set was obtained using the SL SVMML approach andthe bootstrap estimates were commonly obtained using the SL

GridML prepaid method applied to the bootstrapdata sets. However, if in the ﬁrst 100 bootstraps only half of the nearest neighbors where unique points, thebootstrap distribution could be considered questionable. This behavior is to be expected for larger samplesizes T obs , because the true bootstrap distribution is very peaked so that every bootstrap sample will havethe same nearest neighbor grid point. When this occurs, we would estimate the parameters of each bootstrapusing diﬀerential evolution, using the SVM created by the original 100 nearest neighbors.Alternatively, for the synthetic likelihood approach (using MCMC) we computed the conﬁdenceinterval by calculating the 0.025 and 0.975 quantiles of the last half of the posterior samples.The coverage results for the test set of 100 parameters are shown for three diﬀerent values of T obs inTable 4. It can be seen that for both methods, the coverage is close to the nominal level of , but thecoverage of the prepaid method is slightly better. Prior

In this paragraph we show how we can beneﬁt from using the correct prior. We estimate theparameters of the three testsets for T obs = 100 , created with uniform prior P from Equation 9 and betadistribution priors P and P from Equations 14 and 15. We estimated all three data sets using maximuma posteriori estimation SL GRIDMAP using all three priors. The results are shown in Table 6. Using the correctprior leads, as expected, to the best results.

Parameter constraints across conditions

We estimated the parameters for a two condition experimen-tal set up with equal r and σ , with and without the prior from Equation 12 (parameter σ prior was tunedon 100 similar simulated data sets). The results are shown in Table 7. Using the prior from Equation 12,which implements the parameter constraints of the experimental set up, leads, as expected, to better resultsfor each parameter. Even for φ , which is absent in the prior, we ﬁnd better results.26able 3: Average time in seconds needed for the SL Orig estimation for multiple T obs and the speed up forthe SL GridML and SL

SVMML methods. The time for T obs = 10 and T obs = 10 was not measured, so these valuesare estimated and between brackets. (Figure 7 shows the corresponding accuracies.) T obs · time SL Orig

716 s 3549 s 5841 s (50000 s) (500000 s)SL

GRIDML times faster 16273 80659 132750 (1000000) (10000000)SL

SVMML times faster 194 959 1578 (10000) (100000)Table 4: The eﬀective coverages of the test set for diﬀerent T obs . T obs r σ φ Orig · · Real life data set

A second model we will apply our prepaid modeling technique to, is a stochastic dispersal-limited trait-basedmodel of community dynamics [14]. The data that will be modeled, are the abundances of species (hence avector of frequencies, in which each component is a diﬀerent species). Each species in the local environmentis assumed to have a competitive value dependent on its trait u , given by the ﬁltering functionTable 5: Population dynamics of the Chilo partellus [32, 26]. We show the estimates, the 95% conﬁdenceintervals and computation time of the prepaid and synthetic likelihood estimation techniques.r σ φ Time (in seconds)SL

Orig

GRIDML

SVMML

GRIDMAP estimation of test sets with T obs = 100 created with priors P , P and P andestimated by using priors P , P and P . For each test set and parameter the best result is shown in bold.estimated with P estimated with P estimated with P parameter r σ φ r σ φ r σ φ test set created with P

10 0.12 0.82 16 0.17 0.94test set created with P

10 0.13 0.55

11 0.12 0.60test set created with P Table 7: RMSE for Ricker model data where T obs = 100 for an experimental set up with two conditionswhere r and σ are equal over the conditions. Parameters are estimated by using SL GRIDMAP with a ﬂat prior(same as SL

GRIDML )and with a prior from Equation 12prior r σ φ ﬂat prior 88 0.17 0.42prior Equation 12 61 0.11 0.36 F ( u ) = 1 + Ae − ( u − h )22 σ . (17)Here A is the maximal competitive advantage, h is the optimal trait value in the local environment and σ describes the width of the ﬁltering function. At each time step, one individual from the local communitydies. It is then replaced with a probability − II + J +1 by a random descendant from the local pool. Here, J is the size of the local community and I is the fourth parameter to estimate, related to the amount ofimmigration from the regional pool into the local community. The probability that this descendant comesfrom a certain individual in the local community, is proportional to the competitiveness of this individual.With a probability of II + J +1 , the dead individual is replaced by an immigrant from the regional pool. Thedistribution of traits u of the individuals in the regional pool is assumed to be uniform over u . It is noteworthythat Jabot saw the necessity of reusing ABC simulations to reduce computation time in his recovery study[14].The model was simulated using the C++ code from the Easy ABC package [16] where a regional pool of S = 1000 species was deﬁned evenly spaced on the trait axis (i.e., the resolution) and J = 500 was the sizeof the local community. 28 BC estimation

We compare our prepaid method estimation with the Easy ABC package (ABC

Orig ) [15, 16]. Because wework in a Bayesian framework, we ﬁrst have to specify priors. As in Jabot et al. we use the following priors[16]: log( I ) ∼ U (3 , A ) ∼ U (log(0 . , log(5)) h ∼ U ( − , σ ) ∼ U (log(0 . , log(25)) . (18)In this application, the parameter vector θ is deﬁned as follows: θ = (log( I ) , log( A ) , h, log( σ )) . To getthe ABC algorithm to work, we compute four summary statistics: the richness of the community (numberof living species), Shannon’s index which measures the entropy of the community, and the mean and theskewness of the trait distribution of the community.The ABC algorithm we use applies a sequential parameter sampling scheme [3]. The sequence of tolerancebounds is given by ρ = { , , , , . , . , . } and the algorithm proceeds to the next tolerance after 200simulations which lead to summary statistics within the bounds. The last 200 simulations within the boundsrepresent the posterior, and the estimate of the parameter is given by the posterior mean. Creation of the prepaid grid

For the prepaid estimation, we used exactly the same summary statistics as the Easy ABC package. We ﬁlledthe prepaid grid with , parameter vectors using the priors of Equation 18, but for most examples wewill use a grid with only , parameter vectors. To cover this grid as evenly as possible, the uniformdistribution was approximated using Halton sequences [18, 17] (in order to avoid gaps that may appearwhen Monte Carlo samples are used). The creation of the prepaid grid with , parameter vectors tookapproximately 3 days on a 3.4GHz 20-core processor.For the community dynamics models from Equations 17 and 18, there are several ways to simulate analmost inﬁnitely large data set to achieve stable summary statistics. The ﬁrst way is to increase the numberof species S and the size of the local pool J . Unfortunately some summary statistics (the richness and theentropy) are in some unknown way dependent on these parameters. As a result, the summary statistics of a29imulation with J = 5000 cannot be used to estimate the parameters for a setting where J = 500 . Therefore,we chose to ﬁx the size of the local pool J and the number of species S . It is very well possible that thereare summary statistics which do not have this problem, making the prepaid grid much more universal. Wechose however, for the sake of comparison with the easy ABC package to keep using these parameters.A second way to simulate data with a very large sample size is by increasing the number of time steps.By estimating the summary statistics after each time step, when one individual from the local communitydies and is replaced by another individual, we create a time series of summary statistics. Averaging thesummary statistics over a suﬃcient large number of time points will lead to stable average values of thesesummary statistics. In our simulations, we applied some tinning by calculating the summary statistics everytime after 500 species have died (the size of the community). The reasons is that there is not enoughof variation in the summary statistics computed after the death of a single species. Next, we createdtime series of length T = 100 , ( · species will have been replaced) for the prepaid grid and usedthe average of these summary statistics as ˆ µ θ . Using this time series we also computed ˆ Σ θ ,T prepaid for T prepaid = { , , , } . T prepaid = 1 is of course the setting for which the original trait model isdescribed and for which the Easy ABC algorithm is tested. Additionally we also saved 1000 samples of timeseries of length T prepaid = { , , , } . Prepaid estimation

Contrary to the ﬁrst application (the Ricker model), where we used a frequentist approach, for this communitydynamics model we will follow a Bayesian approach. In Bayesian statistics, the focus is on the posteriordistribution of the parameters p ( θ | data ) , which is deﬁned as follows: p ( θ | data ) ∝ p ( data | θ ) × p ( θ ) , (19)where p ( data | θ ) is the likelihood and p ( θ ) the prior. As the likelihood, we will use the synthetic likelihood p ( data | θ ) ≈ L s ( θ ) = exp( l s ( θ )) , where l s ( θ ) is the synthetic log-likelihood as deﬁned in Equation 4 (based onthe vector of summary statistics s obs ). Because we compress the data into summary statistics, the posteriorwe work with is actually an approximation to the true posterior: p ( θ | s obs ) ≈ p ( θ | data ) (in case the summarystatistics are suﬃcient statistics for θ , the approximation sign becomes an equality sign). The distributionsfrom Equation 18 are the priors for the parameters.We have studied three variants of a Bayesian version of the prepaid method. These three versions willbe discussed here in increasing order of complexity. We will denote the three variants as follows: SL GridPM ,30BC

GridPM , and ABC

SVMPM . SL GridPM

Because the priors are all uniform (and our prepaid grid is distributed following this prior), theposterior for a data set with summary statistic s at parameter θ p of the prepaid grid is proportional to p ( θ | s obs ) ∝ L PP s ( θ ) , (20)where L PP s ( θ ) is the prepaid synthetic likelihood (i.e., with the mean statistics computed for a very largesample and a approximate covariance matrix given by Equation 6). The posterior mean (PM) using prepaidsynthetic likelihood can be estimated as: ˆ θ | s obs = P p L PP s ( θ p ) × θ p P p L PP s ( θ p ) . (21) ABC

GridPM

The prepaid synthetic likelihood approach works best if the assumption of normally distributedsummary statistics is not too far oﬀ. However, as can be seen in Figure 9, this is not always the case forthe trait model deﬁned in Equation 17. Therefore, as an alternative procedure, we propose an ApproximateBayesian Computation (ABC) approach. First, we select a subset of nearest neighbors S from the prepaidset, such that for every θ q ∈ S , the synthetic likelihood value L s ( θ q ) is highest and so that P q L PP s ( θ q ) P p L PP s ( θ p ) < . , (22)where the sum in the denominator runs across all grid points. In a sense, these are all the prepaid points inthe . expected coverage according to the posterior of Equation 20. We denote the cardinality of S as Q . In a next step, we basically perform ABC with all the grid points belonging to the selected subset S .However, there is an important issue we cannot overlook. When doing ABC, for a given parameter vector newdata are simulated of the same size as the observed data. Unfortunately, our prepaid grid has correspondinglyonly very large data sets. To rectify this problem, so that ABC can applied without problems, we simulatedduring the construction of the prepaid grid, a set of M = 1000 prepaid samples for several designated samplesizes (i.e., T prepaid = { , , , } ). Let us denote with s q,i,T prepaid the vector of statistics for prepaidgrid point q , the i th simulation (with i = 1 , . . . , M ) and sample size T prepaid .Now, we can apply ABC to arrive at the posterior for θ ; the method will be denoted as ABC GridPM . Fornow we will assume that T obs is equal to one of the T prepaid lenghts. We select the 1000 samples from this31 × samples set that have the smallest Mahalonobis distance to the observed set of statistics s obs : ǫ q,i,T prepaid = ( s q,i,T ABC − s obs ) W − Q ( s q,i,T ABC − s obs ) (23)here W Q is given by the covariance over all grid points in S and over all 1000 replications (thus, Q × ).The ﬁnally selected 1000 samples are then considered as a sample from the posterior. Note that the ABC GridPM method does not require us to progressively strengthen the tolerances, as in traditional ABC

Orig (governedby the tolerance parameter ρ ). If the observed sample size T obs is not equal to one of the T prepaid lengths,then one can use the samples for length T prepaid which is closest to T obs in logaritmic scale and later adjustthe posterior samples such that the posterior mean stays the same, but the posterior covariance matrixchanges to ˆ Σ posterior,T obs = T prepaid T obs ˆ Σ posterior,T prepaid (24)We advise to save samples for enough diﬀerent T prepaid such that this correction is only marginal. ABC

SVMPM

The ABC

GridPM is only based on the raw prepaid grid points. But again, a more accurate es-timation can be found by interpolating between the parameters in the prepaid grid. Therefore, we learnthe relation between the parameters and the summary statistics using LS-SVM: ˆ f svm : θ s . We onlylearn this relation in the region of interest, that is, only the 100 nearest neighbors according to the ABC GridPM approach or more speciﬁcally, the 100 prepaid points for which the most samples lead to a small enough ǫ q,i,T ABC .Before we use machine learning to infer the relation ˆ f svm : θ s we cluster these 100 nearest neighborsusing hierarchical clustering such that no cluster has more than 50 prepaid points. This is necessary as these100 nearest neighbors may come from totally diﬀerent areas in the prepaid grid. This is illustrated in Figure10. For each cluster, we ﬁrst make sure that at least 20 points are included (if not, we add points fromthe prepaid grid which are closest). Then we estimate the ˆ f svm : θ s using LS-SVM for each cluster c separately, giving rise to ˆ f svm,c . Next, we ﬁnd the minimum volume ellipse encompassing all the pointsin each cluster. These ellipses inform us about the areas for which the relation holds. Subsequently weresample parameters in each ellipse to zoom in more and more to the regions of interests. In detail, we dothe following in every cluster c : 32. Uniformly sample 1000 points θ j,c in the minimum volume ellipse for cluster c . We create a ﬁner gridfor each elipse.2. Find the summary statistics based on the LS-SVM in cluster c : ˆ s j,c = ˆ f svm,c ( θ j,c )

3. Find for each point θ j,c the nearest point θ p from the prepaid points with which this particular clusterwas created4. Translate the 1000 samples from the nearest point θ p to the newly sampled point θ j,c and add to eachsample the diﬀerence in summary statistics: d = ˆ s j,c − s p . In this step we aproximate a distributionof statistics for θ j,c around ˆ s j,c .5. Keep the points θ j for which ǫ j,i from Equation 23 is among the 5000 smallest distances and removeall others.6. Recalculate the minimum volume ellipse with the new points.7. Go back to step 1, until the worst ǫ j,i does not decrease any more.Broadly speaking, in step 1, we sample parameters θ j,c , in step 2 to 4 we approximate the summarystatistics distribution for each θ j,c using LS-SVM and in step 5 to 7 we trim this set of parameters to onlykeep the parameters with a high posterior probability.In the end we combine all the samples, we build the posterior with the parameters from the 1000 bestsamples over all clusters according to Equation 23. Note that some parameters may show up several timesin this posterior sample. To compute the posterior mean, we use a weighted version of these samples. Theweights are given by the volume of the ellipse from the cluster where they were created. This is necessaryto ensure the correct use of the equal prior for all clusters. Test set

To generate the test set, we follow the same logic as in [14]. We use the prior in Equation 18 to generate 1000random parameter sets, except for h , where we changed the prior with the following generating distribution: h ∼ U (0 , , (25)such that 0 and 100 are the true minimum and maximum optimal trait values for communities. By takingthe prior for h as in Equation 18, we avoid boundary eﬀects. To exclude other problems at the borders of33he parameter space, we deleted parameters which where within 1% range of the bounds. We simulated datasets for both T obs = 1 and T obs = 1000 . ResultsAccuracy

Let us ﬁrst look at the results for T obs = 1 . We have used traditional ABC (ABC Orig ), prepaidBayes approach based on the synthetic likelihood (SL

GridPM ) and prepaid ABC based on separately generatedsamples at the grid points (ABC

GridPM and ABC

SVMPM ). We have used and · prepaid grid points. TheRMSE and MAE can be found in Tables 1 and 8. All methods result in accuracies that are equally large. For3 out of 4 parameters (except for h ), the prepaid method outperforms ABC Orig with respect to RMSE. ForMAE, the prepaid method uniformly outperforms the Easy ABC package (ABC

Orig ). Overal, the diﬀerencebetween

Ω = 10 and Ω = 5 · prepaid grid point is very small for the prepaid methods.We have refrained from interpolating with the LS-SVM because the . coverage includes on averagemore than 1000 points. This is perfectly logical because T obs = 1 does not provide a lot of information,and, as a consequence, there is a lot of uncertainty (which translates itself into a large number of parameterpoints that have a reasonable large synthetic likelihood value). As a result, creating a posterior based ononly 100 nearest neighbors (even after interpolation) would not suﬃce because too many parameter pointswith high posterior density would be missed.For T obs = 1000 (see again Tables 1 and 8), the accuracy increases, as is expected (this can be seen bothin the RMSE as in the MAE). In this case, both increasing the number of grid points Ω and using LS-SVMinterpolation increases accuracy. No results are given for ABC Orig , because it is impossible to ﬁt the modelwith this sample size in acceptable time.

Speed

For T obs = 1 , the estimation time of ABC Orig is 3865 s. In contrast, the estimation time of ABC

GridPM is 0.167 s. This means that the prepaid ABC method is approximately 23000 times faster than traditionalABC.

Coverage

For both the ABC

Orig as well as the prepaid versions we end up with a posterior sample. Wecomputed the coverage by calculating the 0.025 and 0.975 quantiles of the posterior samples. Next, wechecked whether the true parameter was in this interval or not. Note that when we use clustering duringABC

SVMPM , we weigh each point proportional to the volume of its originating cluster. For the SL

GridPM approachwe use the whole prepaid set as posterior and us weights according to Equation 20.34or T obs = 1 and T obs = 1000 , coverage results can be found in Table 9. For T obs = 1 , ABC Orig leadsto better coverages than SL

GridPM . Also the ABC

GridPM method gives good coverages (around the nominal levelof 0.95) for T obs = 1 , but these coverages deteriorate for T obs = 1000 if no interpolation is used (coverage isa bit better for · grid points). When the LS-SVM interpolation is applied (i.e., ABC SVMPM ), coveragesbecome very good again, certainly for the largest number of grid points.Table 8: The MAE of the estimations of the test set of the trait model. T obs version Ω log( I ) log( A ) h log( σ ) Orig / 0.11 0.45 1.4 0.451 SL

GridPM

SVMPM

SVMPM % coverage of the estimations of the test set of the trait model. T obs version Ω log( I ) log( A ) h log( σ ) Orig / 0.97 0.97 0.99 0.961 ABC

Orig

GridPM

SVMPM

SVMPM .4 Application 3: The Leaky Competing Accumulator

Elementary decision making has been studied intensively in humans and animals [13]. A common example ofan experimental paradigm is the random-motion dot task: the participant has to decide whether a collectionof dots (of which only a fraction moves coherently; the others move randomly) is moving to the left or tothe right. The stimuli typically have varying levels of diﬃculty, determined by the fraction of dots movingcoherently.Assuming there are two response options (e.g., left and right), the Leaky Competing Accumulator consistsof two evidence accumulators, x ( t ) and x ( t ) (where t denotes the time), each associated with one responseoption. The evolution of evidence across time for a single trial is then described by the following system oftwo stochastic diﬀerential equations: dx = ( v + ∆ v i − γx − κx ) · dt + c · dW dx = ( v − ∆ v i − γx − κx ) · dt + c · dW , (26)where dW and dW are uncorrelated white noise processes. To avoid negative values, the evidence is setto 0 whenever it becomes negative: x = max( x , and x = max( x , . The initial values (at t = 0 ) are ( x , x ) = (0 , .The evidence accumulation process continues until one of the accumulators crosses a boundary a (with a > ). The coordinate that crosses its decision boundary ﬁrst, determines the choice that is made and thetime of crossing is seen as the decision time. The observed choice response time is seen as the sum of thedecision time and a non-decision time T er , to account for the time needed to encode the stimulus and emitthe response.Equation 26 describes the evolution of information accumulation for a two-option choice RT task, giventhe presentation of a single stimulus. For all stimuli, the total evidence is equal to v , but the diﬀerentialevidence for option 1 compared to 2 is v i , which is stimulus dependent and reﬂects the stimulus diﬃculty.In this example, we assume the stimuli can be categorized into four levels of diﬃculty, hence i = 1 , . . . , .The model gives rise to two separate choice response time probability densities, p i ( t ) and p i ( t ) , eachrepresenting the response time conditional on the choice that was made. Integrating the densities over timewill result in the probability of choosing the response options: R ∞ p i ( t ) dt = Pr( option 1 for stimulus i ) and R ∞ p i ( t ) dt = Pr( option 2 for stimulus i ) . Obviously, when taken together, p i and p i sum to one.36ll parameters in the parameter vector θ = ( v, ∆ v , . . . , ∆ v , κ, γ, a, T er ) can take values from to ∞ .This parametrization is known to have one redundant parameter [21], so we choose to ﬁx c = 0 . . The re-parametrization

The prepaid method will not be applied to the model as presented in Equation 26, but rather on a re-parametrized formulation: dx it = D · ( v ′ (1 + C i ) − γ ′ · x it − κ ′ · x it ) · dt + √ D · dW it dx it = D · ( v ′ (1 − C i ) − γ ′ · x it − κ ′ · x it ) · dt + √ D · dW it , (27)again with the additional restriction that x it = max( x it , and x it = max( x it , . The new parametersare deﬁned as follows in terms of the original ones: D = c v ′ = vDC i = ∆ v i vγ ′ = γDκ ′ = κD . This new parametrization has the advantage that D can be interpreted as an inverse time scalar becausedoubling D makes all choice response times twice as fast. This property will allow us to reduce the dimen-sionality of the prepaid grid (see below). The parameter v ′ > denotes general stimulus strength scaledaccording to D , while parameter C i (for coherence) denotes the amount of relative evidence encoded in thestimulus i : − < C i < . It is commonly assumed for these evidence accumulator models that diﬀerentstimuli should lead to diﬀerent coherences C i , but without aﬀecting the other parameters. The nondecisiontime T er is not transformed. 37 reation of the prepaid grid For the delineation of the parameter space, we will follow the speciﬁcations of [21]. Because this parameterspace is rather restrictive (a consequence of the recommendation of [21] to improve parameter recovery),we will extend it through the use of a time scale parameter. This extension will be further discussed whenintroducing the test set.First, we create a prepaid grid on a four-dimensional space in the original parametrization by drawingfrom the following distribution: a ∼ U (0 . , . v ∼ U (0 . , . γ ∼ U (1 , κ ∼ U (1 , . (28)We select 10000 grid points from this distribution using Halton sequences [18, 17]. When working in thereparametrized version, as deﬁned in Equation 27, this space can be transformed to a four dimensional spaceof v ′ , γ ′ , κ ′ and D .However, because D acts an inverse time scalar on the response time distributions, we may also considerthe three dimensional space formed by v ′ , γ ′ , and κ ′ and for each grid point, choose the parameter D in sucha way that the RT distributions for options 1 and 2 are scaled to ﬁt nicely between 0 and 3 seconds (witha resolution of 1ms and 3000 time points so that about . of the tail mass is allowed to be clipped at3 seconds when C = 0 ). Eﬀectively, this brings all RT distributions to the same scale (denoted as s = 1 ).This process of scaling is illustrated in Figure 11. It reduces both the number of simulations and the storageload (without it we would have to simulate and store a separate set of distributions for each value of D ).Note that the scaling is done jointly for all RT distributions associated with a particular g . The resultingdiﬀusion constant corresponding to the rescaled distribution is denoted as D g . In addition, the constructioneﬀectively removes one parameter from the prepaid grid, which is illustrated in Figure 12.To include the coherence parameter, we extend each grid point with a set of predeﬁned coherences. Foreach point g = ( v ′ , γ ′ , κ ′ ) in the grid, we take 50 equally spaced coherences C g k (with k = 1 , . . . , ) from to the maximum coherence that still has some non-zero chance of choice option 2 to be selected (we take . ). Finally, we simulate for each combination of g = ( v ′ , γ ′ , κ ′ ) and C g k a large number of choice response38ime data (choices and response times). This is illustrated in Figure 11.In a last step, grid points are eliminated from the prepaid grid, if the simulations result in too manysimultaneous arrivals (i.e., trajectories that end at or very close to the intersection point of the two absorbingboundaries at the upper right corner, located at ( a, a ) ). More speciﬁcally, we drop grid points with morethan 0.1 percent simultaneous arrivals. Creating the prepaid database took less then a day on a NVIDIAGeForce GTX 780 GPU. Prepaid estimation

To explain how the prepaid estimation of the LCA works, let us start with a prototypical experimentaldesign. Assume a choice RT experiment with four stimulus diﬃculty levels (e.g., four coherences in therandom dot motion task). Each diﬃculty level is administered N times to a single participant. A particulartrial in this experiment results in ( c ij , t ij ) , where i is the stimulus diﬃculty level ( i = 1 , . . . , ) and j isthe sequence number within its diﬃculty level ( j = 1 , . . . , N ). The data resulting from this experiment areresponses c ij (referring to choice or choice ) and response times t ij . Each pair ( c ij , t ij ) is considered tooriginate from an unknown parameter set ( v ′ , γ ′ , κ ′ , D, T er ) and coherences C i ( i = 1 , . . . , ).Our ﬁrst aim is to is to establish a local net of prepaid points that lead to data that are close to theobserved dataset. If necessary, we can further zoom in with the help of support vector machines. Conditionalon each prepaid parameter set g in the basic grid, a number of the remaining parameters can be integratedout beforehand. First, conditional on grid point g , we have for 50 predetermined coherences C g k simulatedaccuracies and response time distributions (see Figure 11). The coherences of the observed data can beestimated solely using the observed accuracies using simple linear interpolation. The estimated coherencefor stimulus (or condition) i is denoted as ˆ C i . Corresponding to each of the 50 coherences C g k for grid point g , there is a pair of corresponding simulated RT densities p g ic ( t ) (with c = 1 , ). As before, p g ic ( t ) is scaled tothe [0 , seconds window, and we can use a combination of translating (estimating ˆ T er ), scaling (estimating ˆ D ) and interpolating. Speciﬁcally, we ﬁrst calculate ˆ s as the optimal time scalar to match data with themodel on grid point g : ˆ s = vuut N − P ij ( t ij − ˆ µ t ) P ic R p g ic ( t )( t − µ g t ) dt , in which 39 µ t = 14 N X ij t ij µ g t = 14 X ic Z p g ic ( t ) tdt. This formula capitalizes on the fact that the variance of a distribution does not change when it is simplyshifted to the right by a constant. Hence, the ratio of the model’s decision time variance (without T e r ) andthe observed total response time variance (presumably shifted with some T e r ) is still an estimator of thesquared scale factor between them. Using this information, we can estimate the optimal ˆ D and ˆ T er for gridpoint g as follows: ˆ D = D g ˆ s ˆ T er = ˆ µ t − ˆ sµ g t , with D g being the optimal scaling diﬀusion constant used for optimal storage in the database. This gives usa ﬁnal eﬀective parameter vector of (cid:16) v ′ , γ ′ , κ ′ , ˆ C , ˆ C , ˆ C , ˆ C , ˆ D, b T er (cid:17) . Note that the last 6 elements of thisvector are estimates conditional on the grid point g = ( v ′ , γ ′ , κ ′ ) .Next, we have to determine the single optimal parameter set (and thus also the optimal v ′ , γ ′ , and κ ′ ).For this we need an objective function that compares the model based PDFs with those of the data. Forthis purpose, we use a (symmetrized) chi-square distance based on a set of bin statistics. For each stimulus’observed set of choice RTs, t i = ( t i , t i ) (with t i the RTs for option 1 and t i for option 2), we calculate 20data quantiles q u (with u = 1 , . . . , ) at probability masses m i = 0 . · i . The set of quantiles is appendedwith one extra quantile q at m = 0 . to have a more detailed representation of the leading edge of thedistribution. Based on binning edges (0 , q , q , . . . , q , + ∞ ) , we create × × bin frequencies ˆ b icw with w = 1 , . . . , . The corresponding probability masses m gicw can be easily extracted from the prepaid PDFs p gic ( t ) as well. Observed and theoretical quantities can then be combined in the a symmetrized chi-squaredistance: d ( g, { c ij , t ij } ) = X icw (ˆ b icw − m gicw ) ˆ b icw + m gicw (29)40his deﬁnes a distance between all grid points g in the database and any data set.In the following paragraphs we will present three ways of using this distance to calculate LCA estimates,each a bit more complicated than the previous one (but also more accurate): CHISQ GridNN , CHISQ

GridBS ,CHISQ

SVMBS . CHISQ

GridNN

The grid point closest to the data set (as measured by the symmetrized chi-square distancefunction) can be used as a ﬁrst nearest neighbor estimate.

CHISQ

GridBS

Not all parameters are treated equally in the estimation procedure. The parameters C i , D and T er are estimated conditionally on all grid points g and then the other parameters are estimatedconditionally on ˆ C i , ˆ D and ˆ T er . Moreover, these parameters are chosen in such a way that a speciﬁc aspectof the data (e.g., proportion of choices for option 1) is ﬁtted perfectly (i.e., the coherence is chosen to resultin probabilities perfectly equal to the proportions observed in the data). This would be no problem for aninﬁnite amount of data. However, for ﬁnite data, the major disadvantage of this way of working is thatany errors induced in the precursor step are propagated through the estimation process for v ′ , γ ′ and κ ′ .This is because for ﬁnite data, the observed accuracies will typically not exactly coincide with the accuraciesprovided by the best model estimates. As the estimates ˆ C i are (on each grid point) exactly ﬁt to the observedaccuracy and consequently, the eﬀective grid points will all have this exact accuracy. We tackle this estimatorbias by non-parametrically bootstrapping the data and repeating the nearest neighbor estimate for everybootstrapped dataset. Taking the mean of this set of estimates (a method known as bagging; [11]), givesus a more accurate estimate. Additionally, we now have a standard error of the estimate (and conﬁdenceinterval). CHISQ

SVMBS

If we apply the bootstrap procedure, it may turn out that the selected grid points as nearestneighbor are not very diverse (this may happen with large sample sizes). In such a situation, it can beworthwhile to use an interpolator. So we may learn a support vector machine based on the bin statisticsof the few unique bootstraps grid points available, together with the best overall unique grid points. Wepropose to use a training set of 100 grid points in total. The SVM can then be used as an approximation forthe bin statistics in the space between the grid points and hence for the objective function. We subsequentlyminimize the approximative SVM based objective function for every bootstrap, using diﬀerential evolution(as has been outlined above for the other applications).Obviously, the quality of the SVM based estimate is limited by the quality of the SVMs that are trained to41earn the relation between parameters and statistics. In addition, the same SVMs are used for all bootstrapsamples, which may introduce an unwanted distortion in the uncertainty assessment. To account for thesystemic bias that might have been introduced by the SVMs, we will add some random noise to each bootstrapestimate. The amount of random deviation that is added equals the size of the prediction error of the SVM.In this way, low quality SVMs are prohibited of biasing all bootstraps in the same way. The uncertainty ofthe SVMs is now incorporated in the ﬁnal bootstrapped results.

Test set

The test set is created by uniformly sampling parameters according to Equation 28. Input diﬀerences v − v are chosen to produce typical accuracies of 0.6, 0.7, 0.8, and 0.9. As is done in [21], excessively long PDFs(with a maximum RT larger than 5000ms) and excessively short PDFs (with a range below 400ms) areremoved from the test set. Apart from the fact that these PDFs are deemed unrealistic [21] for typical choiceRT data, this part of the parameter space suﬀers from inherent poor parameter identiﬁability, with verylarge conﬁdence intervals and less meaningful estimates as a consequence. Because the new parametrizationanalytically integrates out scale (i.e., D ) (and also shift T er ), and is positively unbounded in these dimensions,we can expand the test set to cover a broader range of distributions than the ones covered in [21]. To broadenthe range of the test, the distributions are scaled with a random factor ranging from 0.2 to 5. We will usethis broadened test set to determine the method’s accuracy and coverage. ResultsAccuracy

The recoveries of the original LCA parameters are displayed in Figures 4, 13, and 14. It canbe concluded that for all sample sizes, recovery is acceptable, but it improves a lot for larger sample sizes.In all cases, the recovery is dramatically better than that reported in [21]. Figures 16 and 15 shows RMSEand MAE, respectively, as a function of sample size for three methods (for all parameters). It can be seenthat accuracy improves for all parameters for the single best nearest neighbor and for the bootstrap method,until some point, after which it stabilizes or deteriorates. However, for the SVM based estimation, there isstill considerable improvement for higher sample sizes.

Coverage

Figure 17 shows the coverages for diﬀerent numbers of observations. Nearest neighbor bootstrapcoverage seems to be adequate for sample sizes up to 10000; for higher sample sizes SVMs are needed toensure good coverage. 42 cknowledgements

The research leading to the results reported in this paper was sponsored in part by Belgian Federal SciencePolicy within the framework of the Interuniversity Attraction Poles program (IAP/P7/06), as well as bygrant GOA/15/003 from the KU Leuven, and the grand for M.M and grant G.0806.13 from the Fund ofScientiﬁc Research Flanders.

Author contributions statement

M.M. and S.V. conceived the method; M.M., S.V., and K.M. implemented the method. T.L. and F.T.studied the method theoretically. M.M., S.V., and F.T. wrote the manuscript.

Additional information

The author(s) declare no competing interests.

References [1] M. A. Beaumont. Approximate Bayesian Computation in Evolution and Ecology.

Annual Review ofEcology, Evolution, and Systematics , 41(1):379–406, 2010. doi: 10.1146/annurev-ecolsys-102209-144621.URL https://doi.org/10.1146/annurev-ecolsys-102209-144621 .[2] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian Computation in Pop-ulation Genetics.

Genetics , 162(4):2025–2035, Dec. 2002. ISSN 0016-6731, 1943-2631. URL .[3] M. A. BEAUMONT, J.-M. CORNUET, J.-M. MARIN, and C. P. ROBERT. Adaptive ap-proximate Bayesian computation.

Biometrika , 96(4):983–990, 2009. ISSN 0006-3444. URL .[4] K. Csilléry, M. G. B. Blum, O. E. Gaggiotti, and O. François. Approximate Bayesian Computation(ABC) in practice.

Trends in Ecology & Evolution , 25(7):410–418, July 2010. ISSN 0169-5347. doi:10.1016/j.tree.2010.04.001. 435] M. Fasiolo and S. Wood.

An introduction to synlik (2014). R package version 0.1.1. http://XXX.org .[6] M. Fasiolo, N. Pya, and S. N. Wood. A Comparison of Inferential Methods forHighly Nonlinear State Space Models in Ecology and Epidemiology.

Statistical Science , 31(1):96–118, Feb. 2016. ISSN 0883-4237, 2168-8745. doi: 10.1214/15-STS534. URL https://projecteuclid.org/euclid.ss/1455115916 .[7] J.-D. Fermanian and B. Salanié. A NONPARAMETRIC SIMULATED MAXI-MUM LIKELIHOOD ESTIMATION METHOD.

Econometric Theory , 20(4):701–734,Aug. 2004. ISSN 1469-4360, 0266-4666. doi: 10.1017/S0266466604204054. URL .[8] C. Gourieroux and A. Monfort.

Simulation-based econometric methods . Oxford University Press, 1996.[9] M. U. Gutmann and J. Corander. Bayesian Optimization for Likelihood-Free Inference of Simulator-Based Statistical Models.

Journal of Machine Learning Research , 17(125):1–47, 2016. URL http://jmlr.org/papers/v17/15-017.html .[10] A. R. Hall.

Generalized method of moments . Oxford University Press, 2005.[11] T. Hastie, R. Tibshirani, and J. Friedman.

The Elements of Statistical Learning: Data Mining, Inference,and Prediction, Second Edition . Springer Science & Business Media, Aug. 2009. ISBN 978-0-387-84858-7.[12] D. Heard, G. Dent, T. Schifeling, and D. Banks. Agent-based models and microsimulation.

AnnualReview of Statistics and Its Application , 2:259–272, 2015.[13] A. C. Huk, L. N. Katz, and J. L. Yates. The role of the lateral intraparietal area in (the study of)decision making.

Annual review of neuroscience , 40, 2017.[14] F. Jabot. A stochastic dispersal-limited trait-based model of community dynamics.

Journal of Theo-retical Biology , 262(4):650–661, Feb. 2010. ISSN 1095-8541. doi: 10.1016/j.jtbi.2009.11.004.[15] F. Jabot, T. Faure, and N. Dumoulin. EasyABC: performing eﬃcient approximateBayesian computation sampling schemes using R.

Methods in Ecology and Evolution ,4(7):684–687, July 2013. ISSN 2041-210X. doi: 10.1111/2041-210X.12050. URL http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12050/abstract .4416] F. Jabot, T. Faure, N. Dumoulin, and C. Albert.

EasyABC: Eﬃcient Approximate Bayesian Compu-tation Sampling Schemes . 2015. URL https://CRAN.R-project.org/package=EasyABC .[17] L. Kocis and W. J. Whiten. Computational Investigations of Low-discrepancy Sequences.

ACMTrans. Math. Softw. , 23(2):266–294, June 1997. ISSN 0098-3500. doi: 10.1145/264029.264064. URL http://doi.acm.org/10.1145/264029.264064 .[18] MATLAB. version 9.1.0.441655 (R2016b) . The MathWorks Inc., Natick, Massachusetts, 2016.[19] D. McFadden. A Method of Simulated Moments for Estimation of Discrete Response Models WithoutNumerical Integration.

Econometrica , 57(5):995–1026, 1989. ISSN 0012-9682. doi: 10.2307/1913621.URL .[20] M. Mestdagh, S. Verdonck, K. Duisters, and F. Tuerlinckx. Fingerprint resampling: A generic methodfor eﬃcient resampling.

Scientiﬁc Reports , 5:srep16970, Nov. 2015. ISSN 2045-2322. doi: 10.1038/srep16970. URL .[21] S. Miletić, B. M. Turner, B. U. Forstmann, and L. van Maanen. Parameter recovery for the LeakyCompeting Accumulator model.

Journal of Mathematical Psychology , 76:25–50, 2017.[22] A. M. Mood, F. A. Graybill, and D. C. Boes.

Introduction to the theory of statistics (3rd ed) . McGraw-Hill, Signapore, 1974.[23] A. Siepel, B. Gulko, C. G. Danko, I. Gronau, and M. J. Hubisz. Bayesian inference of ancient humandemography from individual genome sequences.

Nature Genetics , 43(10):1031, Oct. 2011. ISSN 1546-1718. doi: 10.1038/ng.937. URL .[24] R. Storn and K. Price. Diﬀerential Evolution – A Simple and Eﬃcient Heuristic forglobal Optimization over Continuous Spaces.

Journal of Global Optimization , 11(4):341–359, Dec. 1997. ISSN 0925-5001, 1573-2916. doi: 10.1023/A:1008202821328. URL http://link.springer.com/article/10.1023/A:1008202821328 .[25] J. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle.

Least Squares Support VectorMachines . World Scientiﬁc Publishing Company, River Edge, NJ, Nov. 2002. ISBN 978-981-238-151-4.[26] S. L. Taneja and K. Leuschner. Methods of rearing, infestations, and eval-uation for Chilo partellus resistance in sorghum. ICRISAT, 1985. URL http://agris.fao.org/agris-search/search.do?recordID=QX8600091 .4527] P. Turchin.

Complex Population Dynamics . Princeton Univ. Press, 2003. ISBN 978-0-691-09021-4. URL http://press.princeton.edu/titles/7436.html .[28] B. M. Turner, P. B. Sederberg, and J. L. McClelland. Bayesian analysis of simulation-based models.

Journal of Mathematical Psychology , 72:191–199, June 2016. ISSN 0022-2496. doi: 10.1016/j.jmp.2014.10.001. URL .[29] M. Usher and J. L. McClelland. The time course of perceptual choice: The leaky, competing accumulatormodel.

Psychological Review , 108(3):550–592, 2001.[30] B. F. Voight, C. Wijmenga, D. Wegmann, D. G. R. a. M.-a. Consortium, E. A. Stahl, F. A. S. Kurree-man, G. Trynka, H. J. Kallberg, J. Worthington, J. Gutierrez-Achury, K. A. Siminovitch, L. Alfredsson,M. I. G. Consortium, P. I. W. d. Bakker, P. K. Gregersen, P. Kraft, R. Chen, R. M. Plenge, R. Do,S. Kathiresan, and S. Raychaudhuri. Bayesian inference analyses of the polygenic architecture of rheuma-toid arthritis.

Nature Genetics , 44(5):483, May 2012. ISSN 1546-1718. doi: 10.1038/ng.2232. URL .[31] S. N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems.

Na-ture , 466(7310):1102–1104, Aug. 2010. ISSN 0028-0836. doi: 10.1038/nature09319. URL .[32] T. Yonow, D. J. Kriticos, N. Ota, J. Van Den Berg, and W. D. Hutchison. The potential globaldistribution of Chilo partellus, including consideration of irrigation and cropping patterns.

Journalof Pest Science , 90(2):459–477, Mar. 2017. ISSN 1612-4766. doi: 10.1007/s10340-016-0801-4. URL https://doi.org/10.1007/s10340-016-0801-4 .46 E n t r op y A v e r age Richness -2-1.5-1-0.50 S k e w ne ss Entropy

82 84 86 88

Average -2 -1 0

Skewness R i c hne ss Figure 9: Samples for T obs = 1 of the summary statistics of the trait model for parameter set log( I ) = 3 . , log( A ) = 0 . , h = 86 . and log( σ ) = − . .47 luster 1Cluster 2Cluster 3Cluster 4True value -4-202-500501001503 3.5 4 4.5-2024 -4 -2 0 2-100 0 100 200-2 0 2 43.43.63.84 Figure 10: Scatter plot matrix of the clustering that occurs for the 100 nearest neighbors for the summarystatistics for T obs = 1000 of parameter log( I ) = 3 . , log( A ) = − . , h = 36 . and log( σ ) = 2 . .The red cross shows the true value of this parameter.48igure 11: Illustration of how diﬀerent coherences are incorporated. The gray plane is a simpliﬁed represen-tation of the three dimensional ( v ′ , γ ′ , κ ′ ) -space. For each point g , 50 coherences are chosen. Correspondingto each coherence, there is a pair of RT distributions (which each integrate to the probability of selectingthe corresponding option). BA sc a l e s = Figure 12: Illustration of the transformation of the original parameter space (called A ) to a new one (called B ) in which D is one of the parameters. The projections of the three parameter points on the red axisgoverning the width of the B area are denoted with open circle and these are the parameter points g .For each of these open circle points, the RT distribution scales are set to 1 (i.e., s = 1 ) by choosing anappropriate diﬀusion coeﬃcient (denoted as D g ) and any parameter point in B can be reached by selectingan appropriate g and then adjusting the scale up- or downwards (this is indicated by the dotted lines in thelength direction of the new parameter space B . 49 v a T er dv i Figure 13: Recovery for the original parameters of the LCA model with T obs = 1000 observation per stimulus.See Figure 4 for detailed information. 50 v a T er dv i Figure 14: Recovery for the original parameters of the LCA model with T obs = 10000 observation perstimulus. See Figure 4 for detailed information. 51 -2 -1 M AE v -1 -1 M AE -3 -2 -1 a T obs M AE T er T obs -4 -3 -2 v Figure 15: The MAE of the estimates of the parameters of the LCA as a function of sample size (abscissa)and for diﬀerent methods. More details can be found in the caption of Figure 3.52 -1 R M SE v R M SE -2 -1 a T obs R M SE T er T obs -3 -2 v Figure 16: The RMSE of the estimates of the parameters of the LCA as a function of sample size (abscissa)and for diﬀerent methods. More details can be found in the caption of Figure 3.53 c o v e r age T obs =300 T obs =10000 c o v e r age T obs =100000 T obs =1000000 Figure 17: The coverage of LCA estimates for diﬀerent number of observations T obs . Each line representsone of the nine LCA parameters and plots the fraction of estimates between the [ α, − α ]]