Prepaid parameter estimation without likelihoods
Merijn Mestdagh, Stijn Verdonck, Kristof Meers, Tim Loossens, Francis Tuerlinckx
aa r X i v : . [ s t a t . C O ] D ec Prepaid parameter estimation without likelihoods
Merijn Mestdagh , Stijn Verdonck , Kristof Meers , Tim Loossens , and FrancisTuerlinckx KU Leuven – University of Leuven, Belgium * [email protected] + These authors contributed equally to this work.December 27, 2018
Abstract
In various fields, statistical models of interest are analytically intractable. As a result, statisticalinference is greatly hampered by computational constraints. However, given a model, different users withdifferent data are likely to perform similar computations. Computations done by one user are potentiallyuseful for other users with different data sets. We propose a pooling of resources across researchersto capitalize on this. More specifically, we preemptively chart out the entire space of possible modeloutcomes in a prepaid database. Using advanced interpolation techniques, any individual estimationproblem can now be solved on the spot. The prepaid method can easily accommodate different priors aswell as constraints on the parameters. We created prepaid databases for three challenging models anddemonstrate how they can be distributed through an online parameter estimation service. Our methodoutperforms state-of-the-art estimation techniques in both speed (with a 23,000 to 100,000-fold speedup) and accuracy, and is able to handle previously quasi inestimable models.
Interesting nonlinear models are often analytically intractable. As a result, statistical inference has to relyon massive, time-intensive, simulations. The main idea of our method is to avoid the redundancy of similarcomputations that typically occur when different researchers independently fit the same model to theirarticular dataset. Instead, we propose to pool computational resources across the researchers interested inany given model. The prepaid method starts with an extensive simulation of datasets across the parameterspace. The simulated data are compressed into summary statistics, and the relation to the parameters islearned using machine learning techniques. This results in a parameter estimation machine that producesaccurate estimates very quickly (a 23,000 to 100,000-fold speed up compared to traditional methods).
Models without an analytical likelihood are increasingly used in various disciplines, such as genetics [2],ecology [31, 6], economics [19, 7] and neuroscience [28]. For models without an analytical or easily computablelikelihood, parameter estimation is a major challenge for which a variety of solutions have been proposed[31, 2, 12]. All these methods have in common that they rely on extensive Monte Carlo simulations and thattheir convergence can be painstakingly slow. As a result, the current methods can be very time consuming.To date, the practice is to analyze each data set separately. However, considering all the calculationsthat have ever been performed during parameter estimation of a particular type of model, for each differentdata set, one cannot help but notice an incredible waste of resources. Indeed, simulations performed whileestimating one data set may also be relevant for the estimation of another. Currently, each researcherestimating the same model with different data will start from scratch, and not benefit from all the possiblyrelevant calculations that have already been performed in earlier analyses by other researchers, in otherlocations, on different hardware, and for other data sets, but concerning the same model.Therefore, our proposal is to combine resources and create a giant online database of parameter valuescoupled to simulated data. Consequently, (cloud based) interpolation techniques and global optimizationmethods can be used on the previously created (hence, prepaid) database for accurate and fast parameterestimation on any device. Statistical analyses that currently take up hours to days of computation time ondedicated hardware are now available to everyone within seconds.In Figure 1 we present a graphical illustration of the prepaid parameter estimation method. First (panelA), for a representative number of parameter vectors θ , large data sets are simulated, compressed intosummary statistics (i.e., s sim ) and saved — creating the prepaid grid. This prepaid grid is computedbeforehand and the results are stored at a central location. Second (panel B1), the observed (data) summarystatistics ( s obs ) are compared to the simulated (data) summary statistics (i.e., s sim ) using an appropriateobjective loss function d (cid:0) s sim , s obs (cid:1) and a number of nearest neighbor simulated summary statistics are2elected. The loss function is related to the loss function used in the generalized method of moments [10]and method of simulated moments [8].Third (panel B2), interpolation methods are used to find the relation s = f ( θ ) between the parametervalues and the summary statistics for the selected points of the previous step [9, 20]. In this paper, we usetuned least squares support vector machines, LS-SVM [25]. Finally (panel B3), the objective loss function d (cid:0) s pred , s obs (cid:1) , now using predicted summary statistics s pred , is minimized as a function of the unknownparameter values using an optimizer.Three important aspects of the prepaid method deserve special mention. First, the parameter space isrequired to be bounded. If this is unnatural for a given parametrization, then the parameters have to beappropriately transformed to a bounded space. Second, we typically start from a uniform distribution ofparameter vectors in the final parameter space. This choice reflects on the uniformity of the grid’s resolution,but has no further implications provided the grid is sufficiently dense. Bayesian priors can be implementedwithout recreating the prepaid grid, since the prior can be taken into account in the loss function. Third,often a user is not interested in a single instance of a model, but rather has data from several experimentalconditions that share some common parameters but assume other ones to be different. Also in these casesthe prepaid grid does not need to be recreated, as the parameter constraints can be included through priorswith tuning parameters (i.e., penalties).The performance of the prepaid method can be studied theoretically in simple situations (see Meth-ods 5.1). In what follows, the prepaid method will be applied to three more complicated, realistic scenarios.3 ) Prepaid Simulate and Save
B) Estimation
1) Nearest Neighbors2) Local Learning3) Method of Predicted Moments θ s sim2 θ s sim1 s s s obs = (cid:8) s obs1 , s obs2 (cid:9) d (cid:0) s sim , s obs (cid:1) ≤ constant s θ θ s = f ( θ , θ ) s θ θ s = f ( θ , θ ) d (cid:0) s pred , s obs (cid:1) = d (cid:0) { f ( θ , θ ) , f ( θ , θ ) } , s obs (cid:1) θ θ Figure 1: Graphical illustration of the prepaid parameter estimation method.4
Results
Example 1: The Ricker model
In a first example, we apply our prepaid method to the Ricker model [27, 31] which describes the dynamicsof the number of individuals y t in a species over time (with t = 1 to T obs ): y t ∼ Pois ( φN t ) N t +1 = rN t e − N t + e t (1)where e t ∼ N (cid:0) , σ (cid:1) . The variables N t (i.e., the expected number of individuals at time t ) and e t are hiddenstates. Given an observed time series { y t } T obs t =1 , we want to estimate the parameters θ = { r, σ, φ } , where r is the growth rate, σ the process noise and φ a scaling parameter. The Ricker model can demonstratenear-chaotic or chaotic behavior and no explicit likelihood formula is available.Wood [31] used the synthetic likelihood to estimate the model’s parameters. In the original syntheticlikelihood approach (denoted as SL Orig ), the assumed multivariate normal distribution of the summarystatistics is used to create a synthetic likelihood. The mean and covariance matrix of this normal distributionare functions of the unknown parameters and are calculated using a large number of model simulations. Thesynthetic likelihood is proportional to the posterior distribution from which is sampled using MCMC and aposterior mean is computed.Wood’s synthetic likelihood SL
Orig approach is compared to the prepaid method, where we create aprepaid grid of the mean and the covariance matrix of a similar set of summary statistics. Prepaid estimationcomes in multiple variants, depending on the use of an interpolation method. The first, which uses only theprepaid grid points and chooses the nearest neighbor (maximum synthetic likelihood) as final estimate, willbe called SL
GridML . The second, SL
SVMML , uses LS-SVM to interpolate between the parameters in the prepaid gridto increase accuracy. The differential evolution algorithm (a global optimizer; [24]) is used to maximize thisinterpolated synthetic (log)likelihood. Additional details on the implementation of the synthetic likelihoodcan also be found in Methods 5.2.Figure 2 shows both the accuracy of parameter recovery (as measured with the RMSE) and computationtime for the three methods under comparison: (1) SL
Orig as in [31], the prepaid method (2) with (SL
SVMML ),and (3) without (SL
GridML ) interpolation. As can be seen in Figure 2, the prepaid estimation techniques leadto better results than the synthetic likelihood for T obs = 1 , , both in accuracy and speed. The SL Orig method leads to some clear outliers (see Methods 5.2 ) which testifies to possible convergence problems5probably due to local minima). The prepaid method suffers much less from this problem. Most strikingis the speed up of the prepaid method: The SL
GridML version of the prepaid estimation is finished before asingle iteration of the 30,000 iterations in the synthetic likelihood method has been completed — 100,000times faster. In addition, it is demonstrated that the coverages of the prepaid method confidence intervalsare very close or exactly equal to the nominal value (we look at 95% bootstrap-based confidence intervals).SVM interpolation is mainly helpful for large T obs , where one expects a higher accuracy of the estimates andthe grid is too coarse. The analyses with large T obs could only be completed in a reasonable time using theprepaid method (See Methods 5.2 for more detailed information).In the application above, the tacitly assumed prior on the parameter space is uniform. In addition, thereis only one data set for which a single triplet of parameters ( r, σ, φ ) needs to be estimated. In Methods 5.2,we show how both limitations can be relaxed. First, it is explained how different priors for the Ricker modelcan be implemented. Second, it is discussed what can be done if there are two data sets (i.e., conditions) forwhich it holds that r = r and σ = σ but φ and φ are not related.Finally, we also tested our estimation process on the population dynamics of the Chilo partellus, extractedfrom Figure 1 in Taneja and Leuschner [26, 32]. Here we found that r = 1 . (95% confidence interval 1.06–1.34), σ = 0 . (95% confidence interval 0.30 – 0.54) and φ = 140 . (95% confidence interval = 43.94– 208.19). We found similar results using the synthetic likelihood method (see Methods 5.2), but ourestimation was 4000 times faster. 6 .04 4 6000 time (s) -1 R M SE SL Orig SL GridML SL SVMML
Figure 2: The RMSE versus the time needed for the estimation of the three parameters of the Ricker model(see Equation 1). The RMSE and time are based on 100 test data sets with T obs = 1000 . The three colorsrepresent the three parameters (blue for r , red for σ and yellow for φ ). Solid lines represent the SL Orig approach, dashed lines the SL
GridML approach (using only nearest neighbors), and dotted lines the SL
SVMML approach (using interpolation). The stars and the dots represent the time needed for the SL
GridML and theSL
SVMML estimation, respectively. The estimates for SL
Orig are posterior means, based on the second half ofthe finished MCMC iterations.
Example 2: A stochastic model of community dynamics
A second example we use to illustrate the prepaid inference method is a trait model of community dynamics[14] used to model the dispersion of species. For this model (see also Methods section), there are fourparameters to be estimated: I , A , h , and σ . As with the first application, there is no analytical expressionfor the likelihood [14].As an established benchmark procedure for this trait model, we apply the widely used ApproximateBayesian Computation (ABC) method [4, 30, 23, 1] as implemented in the Easy ABC package and denotedhere as ABC OrigPM (PM stands for posterior means, which will be used as point estimates) [16]. As priors,we use uniform distributions on bounded intervals for log( I ) , log( A ) , h and log( σ ) (see Methods 5.3 for theexact specifications), but this can be easily changed as explained for the first example.To allow for a direct comparison with the ABC method, and to illustrate the versatility of the prepaidmethod, we have also implemented three Bayesian versions of the prepaid method. The first, SL GridPM , createsa posterior proportional to the prepaid synthetic likelihood. The second, ABC
GridPM , saves not only, the mean7nd covariance matrix of the summary statistics for every parameter in the prepaid grid, but also a large setof uncompressed summary statistics. Using these statistics we are able to approximate an ABC approach.The third, ABC
SVMPM , again interpolates between the grid points to achieve a higher accuracy.Table 1: The RMSE of the estimates of the test set of the trait model. T obs refers to the number ofobservations (i.e., vector with species frequencies) and Ω is the number of prepaid points. T obs version Ω log( I ) log( A ) h log( σ ) Orig / 0.17 0.67 7.45 0.741 SL
GridPM
GridPM
GridPM
GridPM
GridPM
SVMPM
SVMPM
GridPM is about 23,000 times faster than traditional ABC.For small sample sizes, all ABC based methods achieve good coverage. However, for large sample sizes,ABC
OrigPM cannot be used anymore (because of the unduly long computation time). For the prepaid versions,it is necessary to use SVM interpolation between the grid points to get accurate results.
Example 3: The Leaky Competing Accumulator for choice response times
In a third example, we apply our method to stochastic accumulation models for elementary decision making.In this paradigm, a person has to choose, as quickly and accurately as possible, the correct response given astimulus (e.g., is a collection of points moving to the left or to the right). Task difficulty is manipulated byapplying different levels of stimulus ambiguity.A popular neurally inspired model of decision making is the Leaky Competing Accumulator (LCA[29]).For two response options, two noisy evidence accumulators (stochastic differential equations, see Methodssection) race each other until one of them reaches the required amount of evidence for the correspondingoption to be chosen. The time that is required to reach that option’s threshold is interpreted as the associated8hoice response time. For different levels of stimulus difficulty, the model produces different levels of accuracyand choice response time distributions. The evidence accumulation process leading up to these choices andresponse times is assumed to be indicative of the activation levels of neural populations involved in thedecision making.As in the first two examples, there is no analytical likelihood available that can be used to estimate theparameters of the LCA. Moreover, the LCA is an extremely difficult model to estimate. To the best ofour knowledge, only [21] systematically investigated the recovery of the LCA parameters, but for a slightlydifferent model (with three choice options) and with a method that is impractically slow for very large samplesizes, making it difficult to show near-asymptotic recovery properties with. -2 -1 M AE v -1 T obs -2 -1 M AE T obs -3 -2 -1 a Figure 3: The mean absolute error of the estimates of four central parameters of the LCA (common input v , leakage γ , mutual inhibition κ , evidence threshold a ) as a function of sample size (abscissa) and forthree different methods: (1) choosing the nearest neighbor grid point in the space of summary statistics(CHISQ GridNN , triangles); (2) using the average of a set of nearest neighbor grid points based on bootstrapsamples (CHISQ
GridBS , open circles) and (3) using SVM interpolation between the 100 nearest neighbors(CHISQ
SVMBS , crosses).For an experiment with four stimulus difficulty levels, the LCA model has nine parameters. However,after a reparametrization of the model (but without a reduction in complexity), it is possible to reduce theprepaid space to four dimensions (see Methods 5.4) and conditionally estimate the remaining subset of theparameters with a less computationally intensive method. Three variants of the prepaid method have beenimplemented: taking the nearest neighboring parameter set (based on a symmetrized χ distance between9istributions) on the prepaid grid (CHISQ GridNN ), averaging over the grids nearest neighboring parameter setsof 100 non-parametric bootstrap samples (CHISQ
GridBS ), using SVM interpolation for every bootstrap estimate(CHISQ
SVMBS ). A nearest neighbor or bootstrap averaged estimate completes in about a second on a DellPrecision T3600 (4 cores at 3.60GHz), an SVM interpolated estimate requires a couple of minutes extra.Figure 3 displays the mean absolute error (MAE) of the estimates for four of the nine parameters asa function of sample size, separately for three estimation methods. The results for the other parametersare similar and can be consulted in the Methods section. It can be seen that with increasing sample size,MAE decreases. The SVM method pays off especially for larger samples. Figure 4 shows detailed recoveryscatter plots for a subset of the parameters for 1,200 observed trials, which is the typical size of decisionexperiments. To get better recovery, larger sample sizes have to be considered (see Methods section). Ingeneral, recovery is much better than what has been reported in [21]. The coverage of the method, based onnon-parametric bootstrapping, is satisfactory for all sample sizes, provided SVM interpolated estimates areused for T obs > . In addition, we do not find evidence for a fundamental identification issue with thetwo option LCA, as has been stated in [21]. v a Figure 4: Parameter recovery for the LCA model with 1200 observations ( in each of the four difficultyconditions); the true value on the abscissa and estimated value on the ordinate. The same parameters asin Figure 3 are shown. The method used to produce these estimates is the averaged bootstrap approach(CHISQ
GridBS , see Methods 5.4 for details). 10
Discussion
For a very simple setting, we want to study the performance of the prepaid methods analytically.Assume y i ∼ N ( µ, s ) ( i = 1 , . . . , T obs ) with the mean µ unknown (and to be estimated and the standarddeviation s known (so number of parameters K = 1 ). The observed mean is denoted as ¯ y . We will explore twosituations. In the first situation, ¯ y will be our summary statistic s obs (hence number of summary statistics R = 1 ) to estimate µ ( ¯ y is also a sufficient statistic for µ ). In the second situation, we will study whathappens if s obs = ¯ y is chosen to be the summary statistic. Situation 1: s obs = ¯ y As a prepaid grid, we take N r evenly spaced µ -values with spacing or gap size ∆ = µ j +1 − µ j . For each value µ j , T sim values of y are simulated and the sample average is computed (i.e., ¯ y sim j ). Typically, T sim = 1000 or larger. Hence, every value of µ j is paired with a particular ¯ y sim j : ( µ j , ¯ y sim j ) .Given an observed ¯ y , the N nearest neighbors of simulated statistics ¯ y sim j are selected: ( µ (1) , ¯ y sim(1) ) , ( µ (2) , ¯ y sim(2) ) , . . . , ( µ ( N ) , ¯ y sim( N ) ) , such that | ¯ y sim(1) − ¯ y | ≤ | ¯ y sim(2) − ¯ y | ≤ · · · ≤ | ¯ y sim( N ) − ¯ y | . Typically, N = 100 .Because of the linearity of the problem, we can safely assume that if T sim is large enough, the N selected12 values are all consecutive or nearly consecutive (because of noise in the prepaid simulation of ¯ y sim , it canhappen that the N selected µ values are not consecutive). We denote the average of these N µ -values as M µ . If all values are exactly consecutive, M µ can be expressed as M µ = 1 N N X j =1 µ ( j ) = 1 N N − X j =0 (cid:0) µ (1) + j ∆ (cid:1) = µ (1) + ∆ N N − X j =1 j = µ (1) + ∆ ( N − where we have defined µ (1) as µ (1) ≡ min i ∈ , ,...,N (cid:0) µ ( i ) (cid:1) In addition (assuming that all values are exactly consecutive), their variance V µ is given by V µ = 1 N N X j =1 µ j ) − M µ = 1 N N − X j =0 (cid:0) µ (1) + j ∆ (cid:1) − M µ = 1 N N − X j =0 (cid:16) µ + 2 j ∆ µ (1) + j ∆ (cid:17) − M µ = µ + 2 ∆ µ (1) N N − X j =1 j + ∆ N N − X j =1 j − M µ = µ + ∆ µ (1) ( N −
1) + ∆ ( N −
1) (2 N − − M µ = ∆ ( N −
1) (2 N − − ∆ ( N −
4= ∆ ( N − N + 1)12 ≈ ∆ N S µ ≈ ∆ N √ .Using the N pairs, we assume as a linear interpolator in this example a linear regression model that linksthe simulated statistics to the true underlying µ : ¯ y sim j = β + β µ j + ǫ j , with ǫ j ∼ N (cid:16) , s T sim (cid:17) . Obviously, β = 0 and β = 1 .Given ¯ y , N selected prepaid points and the fitted linear regression model, we know from linear regressiontheory that: ˆ β ˆ β ∼ N , σ σ σ σ , where 0 and 1 are the true β and β and σ = Var ( ˆ β | ¯ y ) ≈ s T sim N + 12 M µ ∆ N ! σ = Var ( ˆ β | ¯ y ) ≈ s T sim N σ = Cov ( ˆ β , ˆ β | ¯ y ) = − M µ σ ≈ − s T sim M µ ∆ N . The distribution is assumed to hold for repeated simulations of the replicated statistics in the prepaid grid.Because we work with linear regression, the optimization problem is simple. In this case, the optimalvalue of µ for a given ¯ y can be found by inverting the regression line: ˆ µ = ¯ y − ˆ β ˆ β . Next, we can study the properties of ˆ µ . We begin by calculating the conditional mean E (ˆ µ | ¯ y ) andconditional variance Var (ˆ µ | ¯ y ) . Hence, we treat the observed data (or sample average) as given and fixed.These expectations are taken over different simulations of ¯ y sim j ’s in the prepaid grid. Before giving theexpressions, it is useful to note that ¯ y − ˆ β ˆ β ∼ N ¯ y , σ − σ − σ σ . E (ˆ µ | ¯ y ) = E ¯ y − ˆ β ˆ β | ¯ y ! ≈ E (¯ y − ˆ β | ¯ y ) E ( ˆ β | ¯ y ) − E ( ˆ β | ¯ y ) Cov (¯ y − ˆ β , ˆ β | ¯ y ) + E (¯ y − ˆ β | ¯ y ) E ( ˆ β | ¯ y ) Var ( ˆ β | ¯ y ) ≈ ¯ y − s T sim M µ ∆ N + ¯ y s T sim N = ¯ y (cid:18) s T sim N (cid:19) − M µ T sim s ∆ N and Var (ˆ µ | ¯ y ) = Var ¯ y − ˆ β ˆ β | ¯ y ! ≈ E (¯ y − ˆ β | ¯ y ) E ( ˆ β | ¯ y ) Var (¯ y − ˆ β | ¯ y ) E (¯ y − ˆ β | ¯ y ) + Var ( ˆ β | ¯ y ) E ( ˆ β | ¯ y ) − Cov (¯ y − ˆ β , ˆ β | ¯ y ) E (¯ y − ˆ β | ¯ y ) E ( ˆ β | ¯ y ) ! = ¯ y (cid:18) σ ¯ y + σ − − σ )¯ y · (cid:19) = σ + ¯ y σ − yM µ σ ≈ s T sim N M µ + 12¯ y − yM µ ∆ N ! = s T sim N M µ − ¯ y ) ∆ N ! . Invoking the double expectation theorem to arrive at the unconditional expectations, we have: E (ˆ µ ) = E [ E (ˆ µ | ¯ y )] ≈ E (¯ y ) (cid:18) s T sim N (cid:19) − M µ T sim s ∆ N = µ (cid:18) s T sim N (cid:19) − M µ T sim s ∆ N = µ − αT sim s ∆ N , (2)where α = M µ − µ , that is, the difference between the true value and the mean of the selected nearest15eighbors µ ’s. Likewise, we can derive:Var (ˆ µ ) = E [ Var (ˆ µ | ¯ y )] + Var [ E (ˆ µ | ¯ y )] ≈ s T sim N s T obs + µ ) + 12 M µ − M µ µ ∆ N + 1 ! + s T obs (cid:18) s T sim ∆ N (cid:19) = s T obs + 24 s T obs T sim ∆ N + 144 s T obs T sim2 ∆ N + s T sim N + 12 s ( s T obs + µ ) + 12 s M µ − s M µ µT sim ∆ N = s T obs + s T sim N + 24 s T sim T obs ∆ N + 144 s T T obs ∆ N + 12 s T sim ∆ N (cid:18) s T obs + ( µ − M µ ) (cid:19) = s T obs + s T sim N + 12 s ( µ − M µ ) T sim ∆ N + 36 s T sim T obs ∆ N + 144 s T T obs ∆ N (3)From Equation 2, we learn that if there is no systematic deviation in the selection of µ -grid points, theprepaid estimator is unbiased. In the other case, the is bias decreases with T sim but is proportional to s . InEquation 3, the leading term of the variance is s T obs , which is the same as in classical estimation theory. Forthe other terms, they all have T sim (or a power of it) in the denominator. Because T sim is usually quite large,these terms tend to be in general of lesser importance. However, some terms also have both N (the numberof selected nearest neighbor grid points) and ∆ (the gap size). It is worthwhile to note that increasing theresolution (i.e., decreasing ∆ ), while keeping N constant, will increase the additional terms and thus add tothe error. The reason for this is that the interpolation is defined on a too small grid, leading to uncertainty inthe estimated regression. This effect is illustrated in the left panel of Figure 5 in which the root mean squareerror (RMSE) is shown for the estimation of µ for different values of N and ∆ . The plot is constructed bymeans of a simulation study, but confirms our analytical results. Situation 2: s obs = ¯ y In the second situation, we will again estimate µ (the unknown mean of a unitvariance normal), but in this case s obs = ¯ y is used as a statistic. Thus, the relation between the simulatedstatistics ¯ y sim and µ is quadratic (and thus nonlinear). Again we use a local linear approximation. Clearly,this approximation will only be approximately valid if we do not choose the area of approximation too large.However, unlike in the first situation, we do expect an additional effect of the approximation error.No analytical derivations were made for this case, but we conducted a similar simulation study as insituation 1. The results (in terms of RMSE) are shown in the right panel of Figure 5. As can be seen, thereis a clear optimality trade-off visible between ∆ and N . This can be explained as follows: Fix N and thenconsider the gap size ∆ . If ∆ is too small, we get a similar phenomenon as in the left panel, that is a large16igure 5: RMSE (based on a simulation study) of the toy example estimation as function of the gap size( ∆ ) and number of nearest neighbors selected to carry out the interpolation ( N ). The left panel is calledsituation 1 in which s obs = ¯ y and the right panel is situation 2 ( s obs = ¯ y ). For the second situation, thetrade-off between ∆ and N is clearly visible.RMSE. However, if we take ∆ too large, then the approximation error will dominate (because the linearinterpolation misfits the quadratic relation). The optimal point will be different for different N .This toy example demonstrates the sound theoretical foundations of the prepaid method in well-behavedsituations. However, the question is how well the method performs for real life examples. Three hardproblems will be studied next. The basic model equations of the Ricker model is given in Equation 1.
Synthetic likelihood estimation
For the synthetic likelihood estimation (SL
Orig ), we made use of the synlik package [5]. The syntheticlikelihood l s for a data set with summary statistics s obs and a certain parameter vector θ = ( r, σ, φ ) is givenby 17 s ( θ ) = − (cid:0) s obs − ˆ µ θ (cid:1) T ˆ Σ − θ (cid:0) s obs − ˆ µ θ (cid:1) −
12 log (cid:12)(cid:12)(cid:12) ˆ Σ θ (cid:12)(cid:12)(cid:12) , (4)where ˆ µ θ and ˆ Σ θ are the estimated mean and covariance of the summary statistics when Equation 1 issimulated multiple times with parameter θ .The statistics used by the synthetic likelihood function were the average population size, the number ofzeros, the autocovariances up to lag 5, the coefficients of the quadratic linear autoregression of y . t and thecoefficients of the cubic regression of the ordered differences y t − y t − on the observed values.For each data set we used the synthetic likelihood Markov chain Monte Carlo (MCMC) method with30000 iterations, a burn in of 3 time steps and 500 simulations to compute each ˆ µ θ and ˆ Σ θ [5]. We used thefollowing prior: r ∼ U (1 , σ ∼ U (0 . , . φ ∼ U (0 , . (5)The synlik package generates the MCMC chain on a logarithmic scale, we estimated the parameters as theexponential of the posterior mean. To ensure convergence, only the last half of the chain is used (the last15000 iterations). Creation of the prepaid grid
For the prepaid estimation, we used the same summary statistics as for the traditional synthetic likelihood,except for two differences. First, the coefficients of the cubic regression of the ordered differences y t − y t − on the observed values could not be used, because the observed values are not available when creating theprepaid grid. Second, we changed the number of zeros to the percentage of zeros to make this statisticindependent of T obs (as this may change depending on the observation).We filled the prepaid grid with 100000 parameter sets using the priors of Equation 5. To cover this gridas evenly as possible (and avoiding too large gaps), the uniform distribution was approximated using Haltonsequences [18, 17]. For each parameter set in the prepaid grid, we simulated a time series of length andused the summary statistics of this long time series as ˆ µ θ .Each time series was then split into series of length T prepaid = ˆ Σ θ,T prepraid for the statistics computed on data of these lengths. This means, forexample, that we had 100000 series of length 100 to compute the covariance matrix for a certain parameterset for time series of length 100. If we need to estimate parameters of a time series with T obs not equal toone of the T prepaid lengths, we use the covariance matrix created with time series of length T prepaid which isclosest to T obs in logarithmic scale and adapt the covariance matrix into ˆ Σ θ ,T obs = T prepaid T obs ˆ Σ θ ,T prepaid ( r ) ∼ U ( log (1) , log (200)) σ ∼ U (0 . , . log ( φ ) ∼ U ( − , . (7)We filled to prepaid grid with 100000 parameter sets and used this prior for the real life data set. Prepaid estimation
Four variants of prepaid estimation were implemented for this example. All use the negative syntheticlikelihood as distance ( d (cid:0) s sim , s obs (cid:1) as defined in the main text and Figure 1). First, we do a nearestneighbor estimation SL GRIDML , without using any interpolation between the grid points of the prepaid dataset. We compute the synthetic likelihood of all the prepaid parameters for the summary statistics of the testdata set. The parameter vector with the highest likelihood, the so-called nearest neighbor may already be agood estimation. For a low number of time points T obs , it is to be expected that the error on the parameterestimate is much larger than the gaps in the prepaid grid, and in such a case, the SL GRIDML estimation approachsuffices.Second, a more accurate estimation can be acquired by interpolating between the parameter values in theprepaid grid (SL
SVMML ). Therefore, we learn the relation between the parameters and the summary statistics: ˆ f svm : θ s . However, we only learn this relation in the region of interest, that is the 100 nearest neighborsaccording to the synthetic likelihood. For each summary statistic, we create, on the fly, a separate least19quares support vector machine (LS-SVM) [25] using the 100 nearest neighbors. This machine learningtechnique is chosen as it is a fast non-linear method which generalizes well. We limit the predictions to thepossible range of the summary statistics (e.g., to prevent a percentage of zeros, one of the statistics, largerthan 1).We then use the differential evolution global optimizer [24] to find the maximum of: l PP s ( θ ) = − (cid:16) s obs − ˆ f svm ( θ ) (cid:17) T ˆ Σ − θ ,T obs (cid:16) s obs − ˆ f svm ( θ ) (cid:17) −
12 log (cid:12)(cid:12)(cid:12) ˆ Σ θ ,T obs (cid:12)(cid:12)(cid:12) , (8)where ˆ Σ θ ,T obs is the covariance matrix of the statistics of the nearest neighbor as defined in Equation 6.The superscript "PP" is used to denote that we use the prepaid version of synthetic likelihood, and not thetraditional version as used by [31] (see Equation 4). The optimization process is constrained and we use theminima and maxima for each parameter of the 100 nearest neighbors as effective bounds.The SL SVMML approach makes use of a non-linear black box interpolator. However, we may also considerusing a much faster linear regression (see also the toy example in Section 5.1). Therefore, we will alsocompare the SL
SVMML (and SL
GridML ) approach to a third option where we predict the summary statistics usinga linear regression (called the SL
LinML approach).Third, we can easily implement a prior for the likelihood in Equation 4. This leads to a posterior givenby p ( θ | s obs ) ∝ p ( θ ) l s ( θ ) . (9)The parameters will be estimated as the maximum a posteriori (MAP), as comparison to maximumlikelihood estimation which is a maximum a posteriori with a uniform prior. Here we will apply this extensionto the nearest neighbor estimation: SL GRIDMAP .Lastly we will show that our prepaid method can also be used to cover multiple experimental set ups.Each experimental set up involves multiple conditions and may have varying constraints on the parametersof the model over these conditions. If, for one experimental set up, the conditions c are independent, thelikelihood of the whole experiment is l s,experiment ( θ , θ , ..., θ C ) = C Y c =1 l s,c ( θ c ) (10)where l s,c ( θ c ) is the synthetic likelihood for condition c . This is equivalent to estimating each parameter20et θ c individually for each condition c . If the conditions are not independent and we assume that someparameters are the same over conditions we adapt the prior to mimic these assumptions. For example, if weassume that the first parameter θ is the same over all conditions C we formulate this as p ( θ , θ , ..., θ C ) = C Y c =1 N (cid:18) θ c − ¯ θ σ prior (cid:19) (11)where N is the standard normal distribution and ¯ θ is the average of all θ c . The smaller the tuning parameter σ prior , the more all θ c will be forced to be equal. If σ prior is too large the estimation will not take into accountthe interdependence between the conditions. However, if it is too small we run into trouble with the sparsityof the prepaid grid. In the limit, where σ prior goes to zero, one point is chosen from the prepaid grid leadingto equal parameters not only for first parameter but also for the others. σ prior can be easily tuned bysimulating the experimental set up for a certain prepaid grid.To further illustrate, we will apply this method to the Ricker model, assuming two conditions acrosswhich r and σ stay the same such that this prior is given by p ( θ , θ ) = N (cid:18) r − ¯ rσ prior (cid:19) N (cid:18) r − ¯ rσ prior (cid:19) N (cid:18) σ − ¯ σσ prior (cid:19) N (cid:18) σ − ¯ σσ prior (cid:19) (12)We first use the nearest neighbor approach SL GRIDML to find the 1000 nearest neighbors for condition one andtwo separately and then we refine the parameters using prior 12. As we assume r and σ to be constant overthe conditions, we take ¯ r and ¯ σ as final estimates for these parameters in each condition. Test set
As a test set we first used 100 random parameters created with the prior of Equation 5. To avoid problemswith the borders we deleted parameters that where within 1% range of the bounds. We simulated datasets for T obs = { , · , , , } . For each data set we estimated parameters using the nearestneighbor (SL GridML ) and the SL
SVMML approach. For T obs = 10 , we also estimated the parameters using theSL LinML approach. Due to time constraints, we only estimated parameters for the data with T obs ≤ usingthe traditional synthetic likelihood approach.Next we also created test data sets from different priors for T obs = 10 . Prior P from Equation 5 can21lso be written as r − − ∼ Beta (1 , σ − . . − . ∼ Beta (1 , φ ∼ Beta (1 , . (13)where Beta is a beta distribution with parameters α = 1 and β = 1 . Similarly, we created a test set fromprior P r − − ∼ B (10 , σ − . . − . ∼ B (10 , φ ∼ B (10 , , (14)and prior P r − − ∼ B (2 , σ − . . − . ∼ B (10 , φ ∼ B (2 , . (15)We will test if SL GRIDMAP performs best when the correct prior is used in the estimation process. Last wealso created a test set for T obs = 10 for an experimental set up with two conditions where r and σ are equalover the conditions. Results
For the results, we will evaluate the methods on the following criteria: accuracy, speed, and coverage.
Accuracy
To start off, we look at the recoveries for T obs = 10 for all 100 simulated data sets and the threemethods (SL Orig ,SL
GridML and SL
SVMML ). Scatter plots are shown in Figure 6. It can seen that the syntheticlikelihood estimation leads to some clear outliers. One possible reason for the absence of outliers in theprepaid estimation is the fact that prepaid estimation from the start examines the whole grid and thereforehas less problems with getting stuck in local optima.22 SL Orig SL GridML SL SVMML
Figure 6: Estimated versus true parameters of the Ricker model of 100 data sets with T obs = 1000 . TheSL Orig estimation has some problems with outliers.More generally, we plotted the accuracy of each of the methods as a function of time series length T obs in Figure 7. The left panel shows the root mean square error (RMSE), while the right panel shows themedian absolute error (MAE). We decided to look at the MAE because the few outliers for SL Orig (whichwere shown Figure 6) may inflate the RMSE of the synthetic likelihood disproportionally, which happensto a certain extent. However, very similar conclusions can be drawn for both performance measures. Ingeneral, accuracy increases when T obs increases (i.e., both RMSE and MAE decreases). For RMSE, ourSVM prepaid method clearly outperforms the traditional synthetic likelihood method SL Orig for every T obs and every parameter. For T obs = { · , } , also the SL GridML prepaid approach leads for every parameterto a lower RMSE compared to the synthetic likelihood. For all T obs , the SL SVMML prepaid leads to a higheraccuracy compared to the SL
GridML prepaid and this difference becomes larger for a larger T obs . For MAE,the SL SVMML prepaid method and the original synthetic likelihood SL
Orig show a very similar accuracy (for T obs ≤ ). Both outperform the SL GridML prepaid. 23 T -6 -4 -2 R M SE T -2 -1 M AE SL Orig SL GridML SL SVMML
Figure 7: The accuracy of all estimation methods versus the number of time points T obs . The left panel showsthe mean squared error, while the right panel shows the median absolute error. The three colors representthe three parameters. Blue lines refer to the parameter r , red lines to the parameter σ and yellow lines tothe parameter φ . The solid line represents the original synthetic likelihood approach SL Orig (stopping at T obs = 10 ), the dashed line the SL SVMML prepaid approach and the dotted line the SL
SVMML prepaid approach.The largest attainable accuracy for the SL
GridML prepaid approach is limited by the spacing of the prepaidgrid. If we had created an equally spaced grid of T obs = 10 points using the prior in Equation 5, we wouldhave the following gaps in each of the three parameter dimensions: ∆ r = 90 − ) / = 1 . σ = 0 . − . ) / = 0 . φ = 20 − ) / = 0 . . (16)We do not have an equally spaced grid, but it is expected that the quasi Monte Carlo distribution of pointscreates expected gaps close to the ones in Equation 16. Therefore, it is no coincidence that the best possibleRMSE using the SL GridML prepaid approach has the same order of magnitude as the gap size ∆ , as can be seenin Table 2 for the case of T obs = 10 . However, Table 2 also show that the SL SVMML prepaid approach leads to24able 2: RMSE for the estimation of the parameters of the Ricker model for T = 10 using the SL GridML ,SL
SVMML and SL
LinML prepaid methods. r σ φ SL GridML
SVMML
LinML
GridML and the SL
SVMML prepaid approach for T obs = 10 isfurther visualized in Figure 8.The results in Table 2 also show the need for a non-linear interpolator for the prepaid method. TheRMSE of a linear regression interpolator (SL LinML ) is much larger than that of the SVM prepaid.
20 40 60 8020406080 0.2 0.4 0.60.10.20.30.40.50.6 5 10 1551015 SL GridML SL SVMML
Figure 8: The estimation of the three parameters of the Ricker model of 100 data sets with T obs = 10 . TheSL SVMML estimation clearly outperforms the SL
GridML estimation.In sum, we can conclude that the prepaid estimation methods lead to better, or at least similar, resultsas the traditional synthetic likelihood.
Speed
The largest improvement of the prepaid method over synthetic likelihood is in computational speed:The prepaid method is many times faster than synthetic likelihood. Consider Figure 2 in the main text whereit is shown that the SL
GridML prepaid method is finished before a single iteration of the 30000 iterations aredone by the SL
Orig method. While the SL
GridML and the SL
SVMML prepaid methods are finished in respectively0.044 and 3.7 seconds, independent of the time series length T obs , the SL Orig method grows slower with anorder of magnitude of T obs . In each SL Orig iteration one needs to simulate multiple time series with length T obs . The larger T obs , the slower the estimation. While the synthetic likelihood needs approximately oneand a half hour to estimate the parameters for a time series with length T obs = 10 . The SL GridML prepaid25stimation still finishes in 0.044 s, which is more than times faster. The speed up factors are presented inTable 3 and as can be seen from Figure 7, there is not loss of accuracy. The speed up would reach millions,if we had the time to run the synthetic likelihood method for longer time series. Coverage
Next, we look at the coverage rates of the confidence intervals as obtained with the boot-strap in combination with the prepaid method. To estimate a confidence interval of the estimates forthe prepaid method, a parametric bootstrap with B = 1000 bootstrap samples was used.For the prepaid version the estimate for the observed data set was obtained using the SL SVMML approach andthe bootstrap estimates were commonly obtained using the SL
GridML prepaid method applied to the bootstrapdata sets. However, if in the first 100 bootstraps only half of the nearest neighbors where unique points, thebootstrap distribution could be considered questionable. This behavior is to be expected for larger samplesizes T obs , because the true bootstrap distribution is very peaked so that every bootstrap sample will havethe same nearest neighbor grid point. When this occurs, we would estimate the parameters of each bootstrapusing differential evolution, using the SVM created by the original 100 nearest neighbors.Alternatively, for the synthetic likelihood approach (using MCMC) we computed the confidenceinterval by calculating the 0.025 and 0.975 quantiles of the last half of the posterior samples.The coverage results for the test set of 100 parameters are shown for three different values of T obs inTable 4. It can be seen that for both methods, the coverage is close to the nominal level of , but thecoverage of the prepaid method is slightly better. Prior
In this paragraph we show how we can benefit from using the correct prior. We estimate theparameters of the three testsets for T obs = 100 , created with uniform prior P from Equation 9 and betadistribution priors P and P from Equations 14 and 15. We estimated all three data sets using maximuma posteriori estimation SL GRIDMAP using all three priors. The results are shown in Table 6. Using the correctprior leads, as expected, to the best results.
Parameter constraints across conditions
We estimated the parameters for a two condition experimen-tal set up with equal r and σ , with and without the prior from Equation 12 (parameter σ prior was tunedon 100 similar simulated data sets). The results are shown in Table 7. Using the prior from Equation 12,which implements the parameter constraints of the experimental set up, leads, as expected, to better resultsfor each parameter. Even for φ , which is absent in the prior, we find better results.26able 3: Average time in seconds needed for the SL Orig estimation for multiple T obs and the speed up forthe SL GridML and SL
SVMML methods. The time for T obs = 10 and T obs = 10 was not measured, so these valuesare estimated and between brackets. (Figure 7 shows the corresponding accuracies.) T obs · time SL Orig
716 s 3549 s 5841 s (50000 s) (500000 s)SL
GRIDML times faster 16273 80659 132750 (1000000) (10000000)SL
SVMML times faster 194 959 1578 (10000) (100000)Table 4: The effective coverages of the test set for different T obs . T obs r σ φ Orig · · Real life data set
A second model we will apply our prepaid modeling technique to, is a stochastic dispersal-limited trait-basedmodel of community dynamics [14]. The data that will be modeled, are the abundances of species (hence avector of frequencies, in which each component is a different species). Each species in the local environmentis assumed to have a competitive value dependent on its trait u , given by the filtering functionTable 5: Population dynamics of the Chilo partellus [32, 26]. We show the estimates, the 95% confidenceintervals and computation time of the prepaid and synthetic likelihood estimation techniques.r σ φ Time (in seconds)SL
Orig
GRIDML
SVMML
GRIDMAP estimation of test sets with T obs = 100 created with priors P , P and P andestimated by using priors P , P and P . For each test set and parameter the best result is shown in bold.estimated with P estimated with P estimated with P parameter r σ φ r σ φ r σ φ test set created with P
10 0.12 0.82 16 0.17 0.94test set created with P
10 0.13 0.55
11 0.12 0.60test set created with P Table 7: RMSE for Ricker model data where T obs = 100 for an experimental set up with two conditionswhere r and σ are equal over the conditions. Parameters are estimated by using SL GRIDMAP with a flat prior(same as SL
GRIDML )and with a prior from Equation 12prior r σ φ flat prior 88 0.17 0.42prior Equation 12 61 0.11 0.36 F ( u ) = 1 + Ae − ( u − h )22 σ . (17)Here A is the maximal competitive advantage, h is the optimal trait value in the local environment and σ describes the width of the filtering function. At each time step, one individual from the local communitydies. It is then replaced with a probability − II + J +1 by a random descendant from the local pool. Here, J is the size of the local community and I is the fourth parameter to estimate, related to the amount ofimmigration from the regional pool into the local community. The probability that this descendant comesfrom a certain individual in the local community, is proportional to the competitiveness of this individual.With a probability of II + J +1 , the dead individual is replaced by an immigrant from the regional pool. Thedistribution of traits u of the individuals in the regional pool is assumed to be uniform over u . It is noteworthythat Jabot saw the necessity of reusing ABC simulations to reduce computation time in his recovery study[14].The model was simulated using the C++ code from the Easy ABC package [16] where a regional pool of S = 1000 species was defined evenly spaced on the trait axis (i.e., the resolution) and J = 500 was the sizeof the local community. 28 BC estimation
We compare our prepaid method estimation with the Easy ABC package (ABC
Orig ) [15, 16]. Because wework in a Bayesian framework, we first have to specify priors. As in Jabot et al. we use the following priors[16]: log( I ) ∼ U (3 , A ) ∼ U (log(0 . , log(5)) h ∼ U ( − , σ ) ∼ U (log(0 . , log(25)) . (18)In this application, the parameter vector θ is defined as follows: θ = (log( I ) , log( A ) , h, log( σ )) . To getthe ABC algorithm to work, we compute four summary statistics: the richness of the community (numberof living species), Shannon’s index which measures the entropy of the community, and the mean and theskewness of the trait distribution of the community.The ABC algorithm we use applies a sequential parameter sampling scheme [3]. The sequence of tolerancebounds is given by ρ = { , , , , . , . , . } and the algorithm proceeds to the next tolerance after 200simulations which lead to summary statistics within the bounds. The last 200 simulations within the boundsrepresent the posterior, and the estimate of the parameter is given by the posterior mean. Creation of the prepaid grid
For the prepaid estimation, we used exactly the same summary statistics as the Easy ABC package. We filledthe prepaid grid with , parameter vectors using the priors of Equation 18, but for most examples wewill use a grid with only , parameter vectors. To cover this grid as evenly as possible, the uniformdistribution was approximated using Halton sequences [18, 17] (in order to avoid gaps that may appearwhen Monte Carlo samples are used). The creation of the prepaid grid with , parameter vectors tookapproximately 3 days on a 3.4GHz 20-core processor.For the community dynamics models from Equations 17 and 18, there are several ways to simulate analmost infinitely large data set to achieve stable summary statistics. The first way is to increase the numberof species S and the size of the local pool J . Unfortunately some summary statistics (the richness and theentropy) are in some unknown way dependent on these parameters. As a result, the summary statistics of a29imulation with J = 5000 cannot be used to estimate the parameters for a setting where J = 500 . Therefore,we chose to fix the size of the local pool J and the number of species S . It is very well possible that thereare summary statistics which do not have this problem, making the prepaid grid much more universal. Wechose however, for the sake of comparison with the easy ABC package to keep using these parameters.A second way to simulate data with a very large sample size is by increasing the number of time steps.By estimating the summary statistics after each time step, when one individual from the local communitydies and is replaced by another individual, we create a time series of summary statistics. Averaging thesummary statistics over a sufficient large number of time points will lead to stable average values of thesesummary statistics. In our simulations, we applied some tinning by calculating the summary statistics everytime after 500 species have died (the size of the community). The reasons is that there is not enoughof variation in the summary statistics computed after the death of a single species. Next, we createdtime series of length T = 100 , ( · species will have been replaced) for the prepaid grid and usedthe average of these summary statistics as ˆ µ θ . Using this time series we also computed ˆ Σ θ ,T prepaid for T prepaid = { , , , } . T prepaid = 1 is of course the setting for which the original trait model isdescribed and for which the Easy ABC algorithm is tested. Additionally we also saved 1000 samples of timeseries of length T prepaid = { , , , } . Prepaid estimation
Contrary to the first application (the Ricker model), where we used a frequentist approach, for this communitydynamics model we will follow a Bayesian approach. In Bayesian statistics, the focus is on the posteriordistribution of the parameters p ( θ | data ) , which is defined as follows: p ( θ | data ) ∝ p ( data | θ ) × p ( θ ) , (19)where p ( data | θ ) is the likelihood and p ( θ ) the prior. As the likelihood, we will use the synthetic likelihood p ( data | θ ) ≈ L s ( θ ) = exp( l s ( θ )) , where l s ( θ ) is the synthetic log-likelihood as defined in Equation 4 (based onthe vector of summary statistics s obs ). Because we compress the data into summary statistics, the posteriorwe work with is actually an approximation to the true posterior: p ( θ | s obs ) ≈ p ( θ | data ) (in case the summarystatistics are sufficient statistics for θ , the approximation sign becomes an equality sign). The distributionsfrom Equation 18 are the priors for the parameters.We have studied three variants of a Bayesian version of the prepaid method. These three versions willbe discussed here in increasing order of complexity. We will denote the three variants as follows: SL GridPM ,30BC
GridPM , and ABC
SVMPM . SL GridPM
Because the priors are all uniform (and our prepaid grid is distributed following this prior), theposterior for a data set with summary statistic s at parameter θ p of the prepaid grid is proportional to p ( θ | s obs ) ∝ L PP s ( θ ) , (20)where L PP s ( θ ) is the prepaid synthetic likelihood (i.e., with the mean statistics computed for a very largesample and a approximate covariance matrix given by Equation 6). The posterior mean (PM) using prepaidsynthetic likelihood can be estimated as: ˆ θ | s obs = P p L PP s ( θ p ) × θ p P p L PP s ( θ p ) . (21) ABC
GridPM
The prepaid synthetic likelihood approach works best if the assumption of normally distributedsummary statistics is not too far off. However, as can be seen in Figure 9, this is not always the case forthe trait model defined in Equation 17. Therefore, as an alternative procedure, we propose an ApproximateBayesian Computation (ABC) approach. First, we select a subset of nearest neighbors S from the prepaidset, such that for every θ q ∈ S , the synthetic likelihood value L s ( θ q ) is highest and so that P q L PP s ( θ q ) P p L PP s ( θ p ) < . , (22)where the sum in the denominator runs across all grid points. In a sense, these are all the prepaid points inthe . expected coverage according to the posterior of Equation 20. We denote the cardinality of S as Q . In a next step, we basically perform ABC with all the grid points belonging to the selected subset S .However, there is an important issue we cannot overlook. When doing ABC, for a given parameter vector newdata are simulated of the same size as the observed data. Unfortunately, our prepaid grid has correspondinglyonly very large data sets. To rectify this problem, so that ABC can applied without problems, we simulatedduring the construction of the prepaid grid, a set of M = 1000 prepaid samples for several designated samplesizes (i.e., T prepaid = { , , , } ). Let us denote with s q,i,T prepaid the vector of statistics for prepaidgrid point q , the i th simulation (with i = 1 , . . . , M ) and sample size T prepaid .Now, we can apply ABC to arrive at the posterior for θ ; the method will be denoted as ABC GridPM . Fornow we will assume that T obs is equal to one of the T prepaid lenghts. We select the 1000 samples from this31 × samples set that have the smallest Mahalonobis distance to the observed set of statistics s obs : ǫ q,i,T prepaid = ( s q,i,T ABC − s obs ) W − Q ( s q,i,T ABC − s obs ) (23)here W Q is given by the covariance over all grid points in S and over all 1000 replications (thus, Q × ).The finally selected 1000 samples are then considered as a sample from the posterior. Note that the ABC GridPM method does not require us to progressively strengthen the tolerances, as in traditional ABC
Orig (governedby the tolerance parameter ρ ). If the observed sample size T obs is not equal to one of the T prepaid lengths,then one can use the samples for length T prepaid which is closest to T obs in logaritmic scale and later adjustthe posterior samples such that the posterior mean stays the same, but the posterior covariance matrixchanges to ˆ Σ posterior,T obs = T prepaid T obs ˆ Σ posterior,T prepaid (24)We advise to save samples for enough different T prepaid such that this correction is only marginal. ABC
SVMPM
The ABC
GridPM is only based on the raw prepaid grid points. But again, a more accurate es-timation can be found by interpolating between the parameters in the prepaid grid. Therefore, we learnthe relation between the parameters and the summary statistics using LS-SVM: ˆ f svm : θ s . We onlylearn this relation in the region of interest, that is, only the 100 nearest neighbors according to the ABC GridPM approach or more specifically, the 100 prepaid points for which the most samples lead to a small enough ǫ q,i,T ABC .Before we use machine learning to infer the relation ˆ f svm : θ s we cluster these 100 nearest neighborsusing hierarchical clustering such that no cluster has more than 50 prepaid points. This is necessary as these100 nearest neighbors may come from totally different areas in the prepaid grid. This is illustrated in Figure10. For each cluster, we first make sure that at least 20 points are included (if not, we add points fromthe prepaid grid which are closest). Then we estimate the ˆ f svm : θ s using LS-SVM for each cluster c separately, giving rise to ˆ f svm,c . Next, we find the minimum volume ellipse encompassing all the pointsin each cluster. These ellipses inform us about the areas for which the relation holds. Subsequently weresample parameters in each ellipse to zoom in more and more to the regions of interests. In detail, we dothe following in every cluster c : 32. Uniformly sample 1000 points θ j,c in the minimum volume ellipse for cluster c . We create a finer gridfor each elipse.2. Find the summary statistics based on the LS-SVM in cluster c : ˆ s j,c = ˆ f svm,c ( θ j,c )
3. Find for each point θ j,c the nearest point θ p from the prepaid points with which this particular clusterwas created4. Translate the 1000 samples from the nearest point θ p to the newly sampled point θ j,c and add to eachsample the difference in summary statistics: d = ˆ s j,c − s p . In this step we aproximate a distributionof statistics for θ j,c around ˆ s j,c .5. Keep the points θ j for which ǫ j,i from Equation 23 is among the 5000 smallest distances and removeall others.6. Recalculate the minimum volume ellipse with the new points.7. Go back to step 1, until the worst ǫ j,i does not decrease any more.Broadly speaking, in step 1, we sample parameters θ j,c , in step 2 to 4 we approximate the summarystatistics distribution for each θ j,c using LS-SVM and in step 5 to 7 we trim this set of parameters to onlykeep the parameters with a high posterior probability.In the end we combine all the samples, we build the posterior with the parameters from the 1000 bestsamples over all clusters according to Equation 23. Note that some parameters may show up several timesin this posterior sample. To compute the posterior mean, we use a weighted version of these samples. Theweights are given by the volume of the ellipse from the cluster where they were created. This is necessaryto ensure the correct use of the equal prior for all clusters. Test set
To generate the test set, we follow the same logic as in [14]. We use the prior in Equation 18 to generate 1000random parameter sets, except for h , where we changed the prior with the following generating distribution: h ∼ U (0 , , (25)such that 0 and 100 are the true minimum and maximum optimal trait values for communities. By takingthe prior for h as in Equation 18, we avoid boundary effects. To exclude other problems at the borders of33he parameter space, we deleted parameters which where within 1% range of the bounds. We simulated datasets for both T obs = 1 and T obs = 1000 . ResultsAccuracy
Let us first look at the results for T obs = 1 . We have used traditional ABC (ABC Orig ), prepaidBayes approach based on the synthetic likelihood (SL
GridPM ) and prepaid ABC based on separately generatedsamples at the grid points (ABC
GridPM and ABC
SVMPM ). We have used and · prepaid grid points. TheRMSE and MAE can be found in Tables 1 and 8. All methods result in accuracies that are equally large. For3 out of 4 parameters (except for h ), the prepaid method outperforms ABC Orig with respect to RMSE. ForMAE, the prepaid method uniformly outperforms the Easy ABC package (ABC
Orig ). Overal, the differencebetween
Ω = 10 and Ω = 5 · prepaid grid point is very small for the prepaid methods.We have refrained from interpolating with the LS-SVM because the . coverage includes on averagemore than 1000 points. This is perfectly logical because T obs = 1 does not provide a lot of information,and, as a consequence, there is a lot of uncertainty (which translates itself into a large number of parameterpoints that have a reasonable large synthetic likelihood value). As a result, creating a posterior based ononly 100 nearest neighbors (even after interpolation) would not suffice because too many parameter pointswith high posterior density would be missed.For T obs = 1000 (see again Tables 1 and 8), the accuracy increases, as is expected (this can be seen bothin the RMSE as in the MAE). In this case, both increasing the number of grid points Ω and using LS-SVMinterpolation increases accuracy. No results are given for ABC Orig , because it is impossible to fit the modelwith this sample size in acceptable time.
Speed
For T obs = 1 , the estimation time of ABC Orig is 3865 s. In contrast, the estimation time of ABC
GridPM is 0.167 s. This means that the prepaid ABC method is approximately 23000 times faster than traditionalABC.
Coverage
For both the ABC
Orig as well as the prepaid versions we end up with a posterior sample. Wecomputed the coverage by calculating the 0.025 and 0.975 quantiles of the posterior samples. Next, wechecked whether the true parameter was in this interval or not. Note that when we use clustering duringABC
SVMPM , we weigh each point proportional to the volume of its originating cluster. For the SL
GridPM approachwe use the whole prepaid set as posterior and us weights according to Equation 20.34or T obs = 1 and T obs = 1000 , coverage results can be found in Table 9. For T obs = 1 , ABC Orig leadsto better coverages than SL
GridPM . Also the ABC
GridPM method gives good coverages (around the nominal levelof 0.95) for T obs = 1 , but these coverages deteriorate for T obs = 1000 if no interpolation is used (coverage isa bit better for · grid points). When the LS-SVM interpolation is applied (i.e., ABC SVMPM ), coveragesbecome very good again, certainly for the largest number of grid points.Table 8: The MAE of the estimations of the test set of the trait model. T obs version Ω log( I ) log( A ) h log( σ ) Orig / 0.11 0.45 1.4 0.451 SL
GridPM
GridPM
GridPM
GridPM
GridPM
SVMPM
SVMPM % coverage of the estimations of the test set of the trait model. T obs version Ω log( I ) log( A ) h log( σ ) Orig / 0.97 0.97 0.99 0.961 ABC
Orig
GridPM
GridPM
GridPM
GridPM
SVMPM
SVMPM .4 Application 3: The Leaky Competing Accumulator
Elementary decision making has been studied intensively in humans and animals [13]. A common example ofan experimental paradigm is the random-motion dot task: the participant has to decide whether a collectionof dots (of which only a fraction moves coherently; the others move randomly) is moving to the left or tothe right. The stimuli typically have varying levels of difficulty, determined by the fraction of dots movingcoherently.Assuming there are two response options (e.g., left and right), the Leaky Competing Accumulator consistsof two evidence accumulators, x ( t ) and x ( t ) (where t denotes the time), each associated with one responseoption. The evolution of evidence across time for a single trial is then described by the following system oftwo stochastic differential equations: dx = ( v + ∆ v i − γx − κx ) · dt + c · dW dx = ( v − ∆ v i − γx − κx ) · dt + c · dW , (26)where dW and dW are uncorrelated white noise processes. To avoid negative values, the evidence is setto 0 whenever it becomes negative: x = max( x , and x = max( x , . The initial values (at t = 0 ) are ( x , x ) = (0 , .The evidence accumulation process continues until one of the accumulators crosses a boundary a (with a > ). The coordinate that crosses its decision boundary first, determines the choice that is made and thetime of crossing is seen as the decision time. The observed choice response time is seen as the sum of thedecision time and a non-decision time T er , to account for the time needed to encode the stimulus and emitthe response.Equation 26 describes the evolution of information accumulation for a two-option choice RT task, giventhe presentation of a single stimulus. For all stimuli, the total evidence is equal to v , but the differentialevidence for option 1 compared to 2 is v i , which is stimulus dependent and reflects the stimulus difficulty.In this example, we assume the stimuli can be categorized into four levels of difficulty, hence i = 1 , . . . , .The model gives rise to two separate choice response time probability densities, p i ( t ) and p i ( t ) , eachrepresenting the response time conditional on the choice that was made. Integrating the densities over timewill result in the probability of choosing the response options: R ∞ p i ( t ) dt = Pr( option 1 for stimulus i ) and R ∞ p i ( t ) dt = Pr( option 2 for stimulus i ) . Obviously, when taken together, p i and p i sum to one.36ll parameters in the parameter vector θ = ( v, ∆ v , . . . , ∆ v , κ, γ, a, T er ) can take values from to ∞ .This parametrization is known to have one redundant parameter [21], so we choose to fix c = 0 . . The re-parametrization
The prepaid method will not be applied to the model as presented in Equation 26, but rather on a re-parametrized formulation: dx it = D · ( v ′ (1 + C i ) − γ ′ · x it − κ ′ · x it ) · dt + √ D · dW it dx it = D · ( v ′ (1 − C i ) − γ ′ · x it − κ ′ · x it ) · dt + √ D · dW it , (27)again with the additional restriction that x it = max( x it , and x it = max( x it , . The new parametersare defined as follows in terms of the original ones: D = c v ′ = vDC i = ∆ v i vγ ′ = γDκ ′ = κD . This new parametrization has the advantage that D can be interpreted as an inverse time scalar becausedoubling D makes all choice response times twice as fast. This property will allow us to reduce the dimen-sionality of the prepaid grid (see below). The parameter v ′ > denotes general stimulus strength scaledaccording to D , while parameter C i (for coherence) denotes the amount of relative evidence encoded in thestimulus i : − < C i < . It is commonly assumed for these evidence accumulator models that differentstimuli should lead to different coherences C i , but without affecting the other parameters. The nondecisiontime T er is not transformed. 37 reation of the prepaid grid For the delineation of the parameter space, we will follow the specifications of [21]. Because this parameterspace is rather restrictive (a consequence of the recommendation of [21] to improve parameter recovery),we will extend it through the use of a time scale parameter. This extension will be further discussed whenintroducing the test set.First, we create a prepaid grid on a four-dimensional space in the original parametrization by drawingfrom the following distribution: a ∼ U (0 . , . v ∼ U (0 . , . γ ∼ U (1 , κ ∼ U (1 , . (28)We select 10000 grid points from this distribution using Halton sequences [18, 17]. When working in thereparametrized version, as defined in Equation 27, this space can be transformed to a four dimensional spaceof v ′ , γ ′ , κ ′ and D .However, because D acts an inverse time scalar on the response time distributions, we may also considerthe three dimensional space formed by v ′ , γ ′ , and κ ′ and for each grid point, choose the parameter D in sucha way that the RT distributions for options 1 and 2 are scaled to fit nicely between 0 and 3 seconds (witha resolution of 1ms and 3000 time points so that about . of the tail mass is allowed to be clipped at3 seconds when C = 0 ). Effectively, this brings all RT distributions to the same scale (denoted as s = 1 ).This process of scaling is illustrated in Figure 11. It reduces both the number of simulations and the storageload (without it we would have to simulate and store a separate set of distributions for each value of D ).Note that the scaling is done jointly for all RT distributions associated with a particular g . The resultingdiffusion constant corresponding to the rescaled distribution is denoted as D g . In addition, the constructioneffectively removes one parameter from the prepaid grid, which is illustrated in Figure 12.To include the coherence parameter, we extend each grid point with a set of predefined coherences. Foreach point g = ( v ′ , γ ′ , κ ′ ) in the grid, we take 50 equally spaced coherences C g k (with k = 1 , . . . , ) from to the maximum coherence that still has some non-zero chance of choice option 2 to be selected (we take . ). Finally, we simulate for each combination of g = ( v ′ , γ ′ , κ ′ ) and C g k a large number of choice response38ime data (choices and response times). This is illustrated in Figure 11.In a last step, grid points are eliminated from the prepaid grid, if the simulations result in too manysimultaneous arrivals (i.e., trajectories that end at or very close to the intersection point of the two absorbingboundaries at the upper right corner, located at ( a, a ) ). More specifically, we drop grid points with morethan 0.1 percent simultaneous arrivals. Creating the prepaid database took less then a day on a NVIDIAGeForce GTX 780 GPU. Prepaid estimation
To explain how the prepaid estimation of the LCA works, let us start with a prototypical experimentaldesign. Assume a choice RT experiment with four stimulus difficulty levels (e.g., four coherences in therandom dot motion task). Each difficulty level is administered N times to a single participant. A particulartrial in this experiment results in ( c ij , t ij ) , where i is the stimulus difficulty level ( i = 1 , . . . , ) and j isthe sequence number within its difficulty level ( j = 1 , . . . , N ). The data resulting from this experiment areresponses c ij (referring to choice or choice ) and response times t ij . Each pair ( c ij , t ij ) is considered tooriginate from an unknown parameter set ( v ′ , γ ′ , κ ′ , D, T er ) and coherences C i ( i = 1 , . . . , ).Our first aim is to is to establish a local net of prepaid points that lead to data that are close to theobserved dataset. If necessary, we can further zoom in with the help of support vector machines. Conditionalon each prepaid parameter set g in the basic grid, a number of the remaining parameters can be integratedout beforehand. First, conditional on grid point g , we have for 50 predetermined coherences C g k simulatedaccuracies and response time distributions (see Figure 11). The coherences of the observed data can beestimated solely using the observed accuracies using simple linear interpolation. The estimated coherencefor stimulus (or condition) i is denoted as ˆ C i . Corresponding to each of the 50 coherences C g k for grid point g , there is a pair of corresponding simulated RT densities p g ic ( t ) (with c = 1 , ). As before, p g ic ( t ) is scaled tothe [0 , seconds window, and we can use a combination of translating (estimating ˆ T er ), scaling (estimating ˆ D ) and interpolating. Specifically, we first calculate ˆ s as the optimal time scalar to match data with themodel on grid point g : ˆ s = vuut N − P ij ( t ij − ˆ µ t ) P ic R p g ic ( t )( t − µ g t ) dt , in which 39 µ t = 14 N X ij t ij µ g t = 14 X ic Z p g ic ( t ) tdt. This formula capitalizes on the fact that the variance of a distribution does not change when it is simplyshifted to the right by a constant. Hence, the ratio of the model’s decision time variance (without T e r ) andthe observed total response time variance (presumably shifted with some T e r ) is still an estimator of thesquared scale factor between them. Using this information, we can estimate the optimal ˆ D and ˆ T er for gridpoint g as follows: ˆ D = D g ˆ s ˆ T er = ˆ µ t − ˆ sµ g t , with D g being the optimal scaling diffusion constant used for optimal storage in the database. This gives usa final effective parameter vector of (cid:16) v ′ , γ ′ , κ ′ , ˆ C , ˆ C , ˆ C , ˆ C , ˆ D, b T er (cid:17) . Note that the last 6 elements of thisvector are estimates conditional on the grid point g = ( v ′ , γ ′ , κ ′ ) .Next, we have to determine the single optimal parameter set (and thus also the optimal v ′ , γ ′ , and κ ′ ).For this we need an objective function that compares the model based PDFs with those of the data. Forthis purpose, we use a (symmetrized) chi-square distance based on a set of bin statistics. For each stimulus’observed set of choice RTs, t i = ( t i , t i ) (with t i the RTs for option 1 and t i for option 2), we calculate 20data quantiles q u (with u = 1 , . . . , ) at probability masses m i = 0 . · i . The set of quantiles is appendedwith one extra quantile q at m = 0 . to have a more detailed representation of the leading edge of thedistribution. Based on binning edges (0 , q , q , . . . , q , + ∞ ) , we create × × bin frequencies ˆ b icw with w = 1 , . . . , . The corresponding probability masses m gicw can be easily extracted from the prepaid PDFs p gic ( t ) as well. Observed and theoretical quantities can then be combined in the a symmetrized chi-squaredistance: d ( g, { c ij , t ij } ) = X icw (ˆ b icw − m gicw ) ˆ b icw + m gicw (29)40his defines a distance between all grid points g in the database and any data set.In the following paragraphs we will present three ways of using this distance to calculate LCA estimates,each a bit more complicated than the previous one (but also more accurate): CHISQ GridNN , CHISQ
GridBS ,CHISQ
SVMBS . CHISQ
GridNN
The grid point closest to the data set (as measured by the symmetrized chi-square distancefunction) can be used as a first nearest neighbor estimate.
CHISQ
GridBS
Not all parameters are treated equally in the estimation procedure. The parameters C i , D and T er are estimated conditionally on all grid points g and then the other parameters are estimatedconditionally on ˆ C i , ˆ D and ˆ T er . Moreover, these parameters are chosen in such a way that a specific aspectof the data (e.g., proportion of choices for option 1) is fitted perfectly (i.e., the coherence is chosen to resultin probabilities perfectly equal to the proportions observed in the data). This would be no problem for aninfinite amount of data. However, for finite data, the major disadvantage of this way of working is thatany errors induced in the precursor step are propagated through the estimation process for v ′ , γ ′ and κ ′ .This is because for finite data, the observed accuracies will typically not exactly coincide with the accuraciesprovided by the best model estimates. As the estimates ˆ C i are (on each grid point) exactly fit to the observedaccuracy and consequently, the effective grid points will all have this exact accuracy. We tackle this estimatorbias by non-parametrically bootstrapping the data and repeating the nearest neighbor estimate for everybootstrapped dataset. Taking the mean of this set of estimates (a method known as bagging; [11]), givesus a more accurate estimate. Additionally, we now have a standard error of the estimate (and confidenceinterval). CHISQ
SVMBS
If we apply the bootstrap procedure, it may turn out that the selected grid points as nearestneighbor are not very diverse (this may happen with large sample sizes). In such a situation, it can beworthwhile to use an interpolator. So we may learn a support vector machine based on the bin statisticsof the few unique bootstraps grid points available, together with the best overall unique grid points. Wepropose to use a training set of 100 grid points in total. The SVM can then be used as an approximation forthe bin statistics in the space between the grid points and hence for the objective function. We subsequentlyminimize the approximative SVM based objective function for every bootstrap, using differential evolution(as has been outlined above for the other applications).Obviously, the quality of the SVM based estimate is limited by the quality of the SVMs that are trained to41earn the relation between parameters and statistics. In addition, the same SVMs are used for all bootstrapsamples, which may introduce an unwanted distortion in the uncertainty assessment. To account for thesystemic bias that might have been introduced by the SVMs, we will add some random noise to each bootstrapestimate. The amount of random deviation that is added equals the size of the prediction error of the SVM.In this way, low quality SVMs are prohibited of biasing all bootstraps in the same way. The uncertainty ofthe SVMs is now incorporated in the final bootstrapped results.
Test set
The test set is created by uniformly sampling parameters according to Equation 28. Input differences v − v are chosen to produce typical accuracies of 0.6, 0.7, 0.8, and 0.9. As is done in [21], excessively long PDFs(with a maximum RT larger than 5000ms) and excessively short PDFs (with a range below 400ms) areremoved from the test set. Apart from the fact that these PDFs are deemed unrealistic [21] for typical choiceRT data, this part of the parameter space suffers from inherent poor parameter identifiability, with verylarge confidence intervals and less meaningful estimates as a consequence. Because the new parametrizationanalytically integrates out scale (i.e., D ) (and also shift T er ), and is positively unbounded in these dimensions,we can expand the test set to cover a broader range of distributions than the ones covered in [21]. To broadenthe range of the test, the distributions are scaled with a random factor ranging from 0.2 to 5. We will usethis broadened test set to determine the method’s accuracy and coverage. ResultsAccuracy
The recoveries of the original LCA parameters are displayed in Figures 4, 13, and 14. It canbe concluded that for all sample sizes, recovery is acceptable, but it improves a lot for larger sample sizes.In all cases, the recovery is dramatically better than that reported in [21]. Figures 16 and 15 shows RMSEand MAE, respectively, as a function of sample size for three methods (for all parameters). It can be seenthat accuracy improves for all parameters for the single best nearest neighbor and for the bootstrap method,until some point, after which it stabilizes or deteriorates. However, for the SVM based estimation, there isstill considerable improvement for higher sample sizes.
Coverage
Figure 17 shows the coverages for different numbers of observations. Nearest neighbor bootstrapcoverage seems to be adequate for sample sizes up to 10000; for higher sample sizes SVMs are needed toensure good coverage. 42 cknowledgements
The research leading to the results reported in this paper was sponsored in part by Belgian Federal SciencePolicy within the framework of the Interuniversity Attraction Poles program (IAP/P7/06), as well as bygrant GOA/15/003 from the KU Leuven, and the grand for M.M and grant G.0806.13 from the Fund ofScientific Research Flanders.
Author contributions statement
M.M. and S.V. conceived the method; M.M., S.V., and K.M. implemented the method. T.L. and F.T.studied the method theoretically. M.M., S.V., and F.T. wrote the manuscript.
Additional information
The author(s) declare no competing interests.
References [1] M. A. Beaumont. Approximate Bayesian Computation in Evolution and Ecology.
Annual Review ofEcology, Evolution, and Systematics , 41(1):379–406, 2010. doi: 10.1146/annurev-ecolsys-102209-144621.URL https://doi.org/10.1146/annurev-ecolsys-102209-144621 .[2] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian Computation in Pop-ulation Genetics.
Genetics , 162(4):2025–2035, Dec. 2002. ISSN 0016-6731, 1943-2631. URL .[3] M. A. BEAUMONT, J.-M. CORNUET, J.-M. MARIN, and C. P. ROBERT. Adaptive ap-proximate Bayesian computation.
Biometrika , 96(4):983–990, 2009. ISSN 0006-3444. URL .[4] K. Csilléry, M. G. B. Blum, O. E. Gaggiotti, and O. François. Approximate Bayesian Computation(ABC) in practice.
Trends in Ecology & Evolution , 25(7):410–418, July 2010. ISSN 0169-5347. doi:10.1016/j.tree.2010.04.001. 435] M. Fasiolo and S. Wood.
An introduction to synlik (2014). R package version 0.1.1. http://XXX.org .[6] M. Fasiolo, N. Pya, and S. N. Wood. A Comparison of Inferential Methods forHighly Nonlinear State Space Models in Ecology and Epidemiology.
Statistical Science , 31(1):96–118, Feb. 2016. ISSN 0883-4237, 2168-8745. doi: 10.1214/15-STS534. URL https://projecteuclid.org/euclid.ss/1455115916 .[7] J.-D. Fermanian and B. Salanié. A NONPARAMETRIC SIMULATED MAXI-MUM LIKELIHOOD ESTIMATION METHOD.
Econometric Theory , 20(4):701–734,Aug. 2004. ISSN 1469-4360, 0266-4666. doi: 10.1017/S0266466604204054. URL .[8] C. Gourieroux and A. Monfort.
Simulation-based econometric methods . Oxford University Press, 1996.[9] M. U. Gutmann and J. Corander. Bayesian Optimization for Likelihood-Free Inference of Simulator-Based Statistical Models.
Journal of Machine Learning Research , 17(125):1–47, 2016. URL http://jmlr.org/papers/v17/15-017.html .[10] A. R. Hall.
Generalized method of moments . Oxford University Press, 2005.[11] T. Hastie, R. Tibshirani, and J. Friedman.
The Elements of Statistical Learning: Data Mining, Inference,and Prediction, Second Edition . Springer Science & Business Media, Aug. 2009. ISBN 978-0-387-84858-7.[12] D. Heard, G. Dent, T. Schifeling, and D. Banks. Agent-based models and microsimulation.
AnnualReview of Statistics and Its Application , 2:259–272, 2015.[13] A. C. Huk, L. N. Katz, and J. L. Yates. The role of the lateral intraparietal area in (the study of)decision making.
Annual review of neuroscience , 40, 2017.[14] F. Jabot. A stochastic dispersal-limited trait-based model of community dynamics.
Journal of Theo-retical Biology , 262(4):650–661, Feb. 2010. ISSN 1095-8541. doi: 10.1016/j.jtbi.2009.11.004.[15] F. Jabot, T. Faure, and N. Dumoulin. EasyABC: performing efficient approximateBayesian computation sampling schemes using R.
Methods in Ecology and Evolution ,4(7):684–687, July 2013. ISSN 2041-210X. doi: 10.1111/2041-210X.12050. URL http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12050/abstract .4416] F. Jabot, T. Faure, N. Dumoulin, and C. Albert.
EasyABC: Efficient Approximate Bayesian Compu-tation Sampling Schemes . 2015. URL https://CRAN.R-project.org/package=EasyABC .[17] L. Kocis and W. J. Whiten. Computational Investigations of Low-discrepancy Sequences.
ACMTrans. Math. Softw. , 23(2):266–294, June 1997. ISSN 0098-3500. doi: 10.1145/264029.264064. URL http://doi.acm.org/10.1145/264029.264064 .[18] MATLAB. version 9.1.0.441655 (R2016b) . The MathWorks Inc., Natick, Massachusetts, 2016.[19] D. McFadden. A Method of Simulated Moments for Estimation of Discrete Response Models WithoutNumerical Integration.
Econometrica , 57(5):995–1026, 1989. ISSN 0012-9682. doi: 10.2307/1913621.URL .[20] M. Mestdagh, S. Verdonck, K. Duisters, and F. Tuerlinckx. Fingerprint resampling: A generic methodfor efficient resampling.
Scientific Reports , 5:srep16970, Nov. 2015. ISSN 2045-2322. doi: 10.1038/srep16970. URL .[21] S. Miletić, B. M. Turner, B. U. Forstmann, and L. van Maanen. Parameter recovery for the LeakyCompeting Accumulator model.
Journal of Mathematical Psychology , 76:25–50, 2017.[22] A. M. Mood, F. A. Graybill, and D. C. Boes.
Introduction to the theory of statistics (3rd ed) . McGraw-Hill, Signapore, 1974.[23] A. Siepel, B. Gulko, C. G. Danko, I. Gronau, and M. J. Hubisz. Bayesian inference of ancient humandemography from individual genome sequences.
Nature Genetics , 43(10):1031, Oct. 2011. ISSN 1546-1718. doi: 10.1038/ng.937. URL .[24] R. Storn and K. Price. Differential Evolution – A Simple and Efficient Heuristic forglobal Optimization over Continuous Spaces.
Journal of Global Optimization , 11(4):341–359, Dec. 1997. ISSN 0925-5001, 1573-2916. doi: 10.1023/A:1008202821328. URL http://link.springer.com/article/10.1023/A:1008202821328 .[25] J. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle.
Least Squares Support VectorMachines . World Scientific Publishing Company, River Edge, NJ, Nov. 2002. ISBN 978-981-238-151-4.[26] S. L. Taneja and K. Leuschner. Methods of rearing, infestations, and eval-uation for Chilo partellus resistance in sorghum. ICRISAT, 1985. URL http://agris.fao.org/agris-search/search.do?recordID=QX8600091 .4527] P. Turchin.
Complex Population Dynamics . Princeton Univ. Press, 2003. ISBN 978-0-691-09021-4. URL http://press.princeton.edu/titles/7436.html .[28] B. M. Turner, P. B. Sederberg, and J. L. McClelland. Bayesian analysis of simulation-based models.
Journal of Mathematical Psychology , 72:191–199, June 2016. ISSN 0022-2496. doi: 10.1016/j.jmp.2014.10.001. URL .[29] M. Usher and J. L. McClelland. The time course of perceptual choice: The leaky, competing accumulatormodel.
Psychological Review , 108(3):550–592, 2001.[30] B. F. Voight, C. Wijmenga, D. Wegmann, D. G. R. a. M.-a. Consortium, E. A. Stahl, F. A. S. Kurree-man, G. Trynka, H. J. Kallberg, J. Worthington, J. Gutierrez-Achury, K. A. Siminovitch, L. Alfredsson,M. I. G. Consortium, P. I. W. d. Bakker, P. K. Gregersen, P. Kraft, R. Chen, R. M. Plenge, R. Do,S. Kathiresan, and S. Raychaudhuri. Bayesian inference analyses of the polygenic architecture of rheuma-toid arthritis.
Nature Genetics , 44(5):483, May 2012. ISSN 1546-1718. doi: 10.1038/ng.2232. URL .[31] S. N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems.
Na-ture , 466(7310):1102–1104, Aug. 2010. ISSN 0028-0836. doi: 10.1038/nature09319. URL .[32] T. Yonow, D. J. Kriticos, N. Ota, J. Van Den Berg, and W. D. Hutchison. The potential globaldistribution of Chilo partellus, including consideration of irrigation and cropping patterns.
Journalof Pest Science , 90(2):459–477, Mar. 2017. ISSN 1612-4766. doi: 10.1007/s10340-016-0801-4. URL https://doi.org/10.1007/s10340-016-0801-4 .46 E n t r op y A v e r age Richness -2-1.5-1-0.50 S k e w ne ss Entropy
82 84 86 88
Average -2 -1 0
Skewness R i c hne ss Figure 9: Samples for T obs = 1 of the summary statistics of the trait model for parameter set log( I ) = 3 . , log( A ) = 0 . , h = 86 . and log( σ ) = − . .47 luster 1Cluster 2Cluster 3Cluster 4True value -4-202-500501001503 3.5 4 4.5-2024 -4 -2 0 2-100 0 100 200-2 0 2 43.43.63.84 Figure 10: Scatter plot matrix of the clustering that occurs for the 100 nearest neighbors for the summarystatistics for T obs = 1000 of parameter log( I ) = 3 . , log( A ) = − . , h = 36 . and log( σ ) = 2 . .The red cross shows the true value of this parameter.48igure 11: Illustration of how different coherences are incorporated. The gray plane is a simplified represen-tation of the three dimensional ( v ′ , γ ′ , κ ′ ) -space. For each point g , 50 coherences are chosen. Correspondingto each coherence, there is a pair of RT distributions (which each integrate to the probability of selectingthe corresponding option). BA sc a l e s = Figure 12: Illustration of the transformation of the original parameter space (called A ) to a new one (called B ) in which D is one of the parameters. The projections of the three parameter points on the red axisgoverning the width of the B area are denoted with open circle and these are the parameter points g .For each of these open circle points, the RT distribution scales are set to 1 (i.e., s = 1 ) by choosing anappropriate diffusion coefficient (denoted as D g ) and any parameter point in B can be reached by selectingan appropriate g and then adjusting the scale up- or downwards (this is indicated by the dotted lines in thelength direction of the new parameter space B . 49 v a T er dv i Figure 13: Recovery for the original parameters of the LCA model with T obs = 1000 observation per stimulus.See Figure 4 for detailed information. 50 v a T er dv i Figure 14: Recovery for the original parameters of the LCA model with T obs = 10000 observation perstimulus. See Figure 4 for detailed information. 51 -2 -1 M AE v -1 -1 M AE -3 -2 -1 a T obs M AE T er T obs -4 -3 -2 v Figure 15: The MAE of the estimates of the parameters of the LCA as a function of sample size (abscissa)and for different methods. More details can be found in the caption of Figure 3.52 -1 R M SE v R M SE -2 -1 a T obs R M SE T er T obs -3 -2 v Figure 16: The RMSE of the estimates of the parameters of the LCA as a function of sample size (abscissa)and for different methods. More details can be found in the caption of Figure 3.53 c o v e r age T obs =300 T obs =10000 c o v e r age T obs =100000 T obs =1000000 Figure 17: The coverage of LCA estimates for different number of observations T obs . Each line representsone of the nine LCA parameters and plots the fraction of estimates between the [ α, − α ]]