Bayesian inference with tmbstan for a state-space model with VAR(1) state equation
BBayesian inference with tmbstan for a state-space model with VAR(1) state equation
Yihan Cao ∗ Jarle Tufto ∗ Both frequentist and Bayesian statistical inference have been used for investigating ecolog-ical processes. In the frequentist framework, Template model builder (TMB, Kristensenet al., 2016), an R package developed for fast fitting complex linear or nonlinear mixedmodels, has gained the popularity recently, especially in the field of ecology which usu-ally involves in modeling complicated ecological processes (for example Cadigan, 2015;Albertsen et al., 2016; Auger-Méthé et al., 2017). The combination of reverse-mode au-tomatic differentiation and Laplace approximation for high-dimension integrals makesparameter estimation with TMB very efficient even for non-Gaussian and complex hier-archical models. TMB provides a flexible framework in model formulation and can beimplemented even for statistical models where the predictor is nonlinear in parametersand random effect. However, the lack of capability of working in the Bayesian frameworkhas hindered the adoption of it for Bayesians.Within the Bayesian framework, the software package
Stan (Gelman et al., 2015),a probabilistic programming language for statistical inference written in C++ attractspeoples attention. It uses the No-U-Turn Sampler (NUTS) (Hoffman & Gelman, 2014),an adaptive extension to Hamiltonian Monte Carlo (Neal et al., 2011), which itself is ageneralization of the familiar Metropolis algorithm, to conduct sampling more efficientlythrough the posterior distribution by performing multiple steps per iteration. Stan is avaluable tool for many ecologists utilizing Bayesian inference, particularly for problemswhere BUGS (Lunn et al., 2000) is prohibitively slow (Monnahan et al., 2017). As such,it can extend the boundaries of feasible models for applied problems, leading to a betterunderstanding of ecological processes. Fields that would likely benefit include estimationof individual and population growth rates, meta-analyses and cross-system comparisons,among many others.Combining the merits of TMB and Stan, the new software package tmbstan (Monnahan& Kristensen, 2018) which provides MCMC sampling for TMB models was developed.This package provides ADMB and TMB users a possibility for making Bayesian statis-tical analysis when prior information on the unknown parameters is available. From theuser’s perspective, it implements NUTS sampling from a target density proportional tothe product of marginal likelihood (computed by TMB or Stan) and the prior densityspecified by the user. The user has the flexibility to decide which random effects are in-tegrated out via the Laplace approximation in TMB and then the TMB model is passed ∗ Centre for Biodiversity Dynamics, Department of Mathematical Sciences, Norwegian University ofScience and Technology, 7491 Trondheim, Norway a r X i v : . [ s t a t . M E ] J a n o function Stan in the RStan package so that the rest of the parameters are integratedby Stan. This methodology might therefore potentially be more computationally efficientthan using MCMC alone to integrate out all parameters. Monnahan and Kristensen(2018) introduced the tmbstan package, applied it to simulation studies and compared itscapabilities (computational efficiency and the accuracy of Laplace approximation) withother platforms such as ADMB and TMB.However, it is unclear that if Bayesian inference with arbitrary prior distributionimplemented with Stan would perform comparatively with frequentist inference whenmodeling complex ecological processes. It is also unclear that when using tmbstan, ifusing the Laplace approximation to integrate latent variables is more computationallyefficient than handling all latent variables via MCMC. In the case studies in Monnahanand Kristensen (2018), Laplace approximation turned out to reduce the computationalefficiency of MCMC. Another issue arose in the case studies is that the Laplace approxi-mation to the integration of random effects is not accurate to a degree and this could leadto biased parameter estimates or uncertainties in parameter estimation. To gain moreinsights on these issues, in this paper we conduct simulation studies and a case study inthe context of modeling fluctuating and auto-correlated selection with state-space models(SSM). These forms of models are more generally increasingly used in ecology to modeltime-series such as animal movement paths and population dynamics (for example Cadi-gan, 2015; Albertsen et al., 2016; Auger-Méthé et al., 2017). Furthermore, following Cao,Visser, and Tufto (2019), we also use order-1 vector autoregressive model (VAR(1)) tomodel the unobserved states, which in our study are temporally fluctuating and poten-tially auto-correlated height, width and location of a Gaussian fitness function. Thisalso allows us to make a further investigation into the issue of underestimation of theauto-correlation parameter in auto-regressive models shown in Chevin et al. (2015) andCao et al. (2019).Through the simulation and empirical studies, our paper aims to (1) compare esti-mates between frequentist inference and Bayesian inference under different simulationschemes; (2) investigate how the choice of prior influence Bayesian inference; (3) comparethe computational efficiency of MCMC with and without integrating out some of therandom effects via Laplace approximation. We consider a typical ecological process, the fluctuating selection in a bird species, thegreat tit (
Parus major ). We conduct the study in the context of temporally changingselection on the laying date with the number of fledglings as the fitness component, but itcan be generalized to any episode of viability or fertility selection, or to overall selectionthrough lifetime fitness. The discrete nonnegative variable, number of fledglings, is bestmodelled by distributions such as Poisson, or zero-inflated Poisson (for example Chevinet al., 2015; Cao et al., 2019). Within the framework of generalized linear models,the expected value of response variable is commonly linked to the linear predictors ofbiologically interest by logarithm. When both linear and quadratic effects of the traits areincluded, this leads to a Gaussian model of stabilizing selection. In this study, the numberof fledglings in a specific brood is assumed to be Poisson distributed, X i | w i ∼ Poisson( w i ) ,where i indicates the breeding event. The fitness (the expected number of fledglings w i )2f individuals with phenotype z i is then given by ln w i = η ( α ) t − ( z i − η ( θ ) t ) e η ( ω ) t ) , (1)where η ( α ) t , η ( θ ) t and e η ( ω ) t ( e based to guarantee positive) are parameters determining thelogrithm of maximum fitness, optimum laying date and width of the fitness function inyear t respectively. We further model η ( α ) t , η ( θ ) t and η ( ω ) t , the three stochastic processes asfollowing: η ( α ) t = µ α + σ α α t ,η ( θ ) t = µ θ + σ θ θ t ,η ( ω ) t = µ ω + σ ω ω t . (2)The elements of vector µ = ( µ α , µ θ , µ ω ) T are the means of the three processes. Thestochastic processes α t , θ t , ω t are assumed to be multivariate normal distributed ( α t , θ t , ω t ) T ∼ N ( , Γ ) with Γ = ρ α,θ ρ α,ω ρ α,θ ρ θ,ω ρ α,ω ρ θ,ω , where ρ α,θ , ρ α,ω and ρ θ,ω indicate the correla-tions and are assumed to be mutually independent. ( α t , θ t , ω t ) T are further assumed tofollow a first-order vector autoregressive (VAR(1)) process as below: α t θ t ω t = Φ α t − θ t − ω t − + w t , (3)where Φ is × transition matrix and w t is a 3-dimentional vector of white noise. Thecovariance matrix of w t is calculated as Γ − ΦΓ Φ . Correlations between the elements of w t are determined by both ρ = ( ρ α,θ , ρ α,ω , ρ θ,ω ) and Φ . If ρ is vector and Φ is diagonal,then w t reduces to be three independent and identically distributed white noise processes.In this case, α t , θ t and ω t simplify to three independent first-order autoregressive (AR(1))processes. If ρ is and all entries of Φ are zero, both ( α t , θ t , ω t ) T and w t reduce to threeindependent and identically distributed white noise processes. In any case, our non-centered parameterization implies that the standard deviation of α t , θ t and ω t is onlydetermined by σ α , σ θ and σ ω respectively. We expect the non-centered parameterizationyields simpler posterior geometries (Betancourt & Girolami, 2015) and will be much moreefficient in terms of effective sample size when there is not much data (Stan DevelopmentTeam, 2018b, chapter 20).It is worth mentioning that one objective of this study is to provide another case studybeyond the ones in Monnahan and Kristensen (2018). Therefore, even though α t , θ t and ω t are assumed to be VAR(1) in the model, in the simulation study we consider onlyAR(1) θ t and white noise of α t and ω t . The alternative simulation studies in which α t , θ t and ω t are formulated as other possible stochastic processes can be conducted similarlyand exhaustively, but that is an enormous amount of work in one single study. When α t , θ t and ω t are assumed to be VAR(1), one caution to be taken is that all the eigenvaluesof Φ must lie in the unit circle to guarantee the VAR (1) process to be stationary (Wei,2006). At last, in the simulation study, we assume that the model structure is known,which means that we already know θ t is AR(1) process since the aim of the study is notto explore the structure of the true model. 3 .2 Prior distribution The priors are assumed to be independent to each other π ( µ, Φ , Σ ) = π ( µ ) π ( Φ ) π ( Σ ) . Wetake a normal N ( m , q I ) prior distribution for the process mean vector µ = ( µ α , µ θ , µ ω ) and input weak prior information on the process mean by taking m = and q = 100 .Since in this study we assume constant η ( α ) t and η ( ω ) t , φ θ,θ is the only non-zero entry in Φ . We used truncated normal prior on φ θ,θ since it outperforms Jeffreys’ prior (Jeffreys& Jeffreys, 1961), g prior (Zellner, 1986) and natural conjugate prior (Schlaifer & Raiffa,1961) in terms of posterior sensitivity using Highest Posterior Density Region (HPDR)criterion concluded from the simulation study in Karakani et al. (2016). Lei et al. (2011)also uses truncated normal distribution as subjective prior for the auto-regressive param-eter in its AR (1) model. The mean and standard deviation of the truncated normaldistribution are arbitrarily set to be 0 and 0.5 respectively.For the variance of the error term σ θ ( σ α and σ ω are assumed to be zero), two priorsare used:(1) half-Cauchy (0, 10) prior on σ θ (Prior1);(2) lognormal (1, 0.5) prior on σ θ (Prior2).These two priors are referred to Prior1 and Prior2 respectively in the rest of this paper.It is worth mentioning that we also tested uniform prior on log ( σ θ ) (non-informative im-proper prior which equals to /σ prior on σ (Gelman et al., 2006)) and inverse-gamma (1,1) prior on σ θ (non-informative proper prior, also illustrated in (Gelman et al., 2006)),but both of them render an issue that the sampler traps in a subspace of the whole pa-rameter space of log ( σ θ ) and results in numerous divergent transitions. It was potentiallycaused by the posterior becoming improper and consisting of a mode and an infinitelow-posterior-density ridge extending to infinity as illustrated in Tufto et al. (2012). Wethus in this study only consider the two proper informative priors (Prior1 and Prior2),while more information on the MCMC with inverse-gamma (1, 1) prior on σ θ is given inSupporting Information.Note also that the scale parameters log( σ θ ) is declared in the TMB template in thelogarithmic format, but the half-Cauchy prior and lognormal prior contributed to thetotal likelihood with the log density in terms of σ θ and for inverse-gamma prior, it is interms of σ θ , where σ θ is a positive transform σ = e logσ . Therefore, Jacobian adjustment(see chapter 20.3 in Stan Development Team (2018b) for Jacobian adjustment) was con-ducted by adding logσ θ to the total likelihood when half-Cauchy prior and lognormalprior are used. When testing inverse-gamma prior, it was log 2 + 2 log σ θ added to thetotal likelihood. The model is formulated with C++ and passed to TMB for frequentist inference. Themodel objective (fn) and gradient (gr) functions are fed to optimization function nlminb with default setting to optimize the objective function.For Bayesian inference, the TMB model objective and gradient functions are passedto tmbstan which uses the stan function and executes the No-U-Turn sampler (NUTS)algorithm by default to sample. Currently the other options are "HMC" (HamiltonianMonte Carlo), and "Fixed param". We ran the simulation study on a multicore com-puting server with enough RAM to avoid swapping to disk. The number of warmupiterations to be excluded when computing the summaries is set to 1000 and for totalsample length, it is 3000. We thin each chain to every second sample and set the value4dapt delta to 0.95, which is the average proposal acceptance probability Stan aims forduring the adaption (warmup) period. We set a seed for each simulation including dataset and tmbstan to make sure all the simulation results are reproducible.Divergent transitions during sampling may occur due to a large step size in the sampleror a poorly parameterized model, meaning that the iteration of the MCMC sampler runsinto numerical instabilities (Carpenter et al., 2017) and thus inferences will be biased.RStan team suggested that the problem may be alleviated by increasing the adapt deltaparameter (gives a smaller step size), especially when the number of divergent transitionsis small (Stan Development Team, 2018a). In our simulation studies, we find it difficultto completely avoid divergent transitions across all data sets even though adapt deltais increased to 0.95. Similar to Fuglstad, Hem, Knight, Rue, and Riebler (2019), wethus removed simulations where 0.1% or more divergent transitions in the iterations afterwarmup occur during the inference to avoid reporting biased results.It is worth mentioning that the execution of Markov chains can be done in parallel.While the default of RStan is to use 1 core, the RStan team recommended to set it toas many processors as the hardware and RAM allow and at most one core per chain(Stan Development Team, 2018a). The simulations we run are done with a server thathas 28 available cores. We thus set the number of cores to be 4 for the 4 Markov chains.However, since R is single threaded and for frequentist inference, optimization algorithmused in R function "nlminb" only uses one core of CPU, we thus only compare thecomputational efficiency between tmbstan with and without Laplace approximation andignore the computational efficiency with "nlminb" to ensure fair comparisons.
All the data simulated are in natural units and considered to be biologically realisticaccording to the empirical studies of natural birds populations (e.g. Grant & Grant, 2002;Vedder et al., 2013). Samples were modeled from a population undergoing stabilizingselection with AR(1) θ t , fixed η ( α ) t and η ( ω ) t . Vector µ = ( µ α , µ θ , µ ω ) T is set to (2 , , . .The autocorrelation φ θ,θ is set to 0.1, 0.4 and 0.7 (only positive values considered sincethe estimate of auto-correlation in temporal optimal laying date is positive, for example0.3029 in Chevin et al. (2015) and 0.524 in Cao et al. (2019)), the variance of fluctuatingoptimal laying date σ θ is set to 20.For each value of φ θ,θ , tmax = 25 or 50 time points were simulated and for each timepoint the sample size was drawn from a Poisson distribution with mean n = 25 , 50 or100 individuals. We considered four combinations of tmax and n , which are ( tmax =25 , n = 50) , ( tmax = 25 , n = 100) , ( tmax = 50 , n = 25) and ( tmax = 50 , n = 100) .These four combinations are refered as simulation setting 1, 2, 3, 4 respectively in thefollowing sections. Similar to Cao et al. (2019), we neglected response to selection andused the same normal distribution for simulating individual phenotype each year. Thephenotypic standard deviation before selection σ z was set to 20, such that the strengthof stabilizing selection S = σ z / ( e η ( ω ) t ) + σ z (e.g. Chevin et al., 2015) was 0.267. For eachindividual, its fitness was computed from its phenotype using equation (1), and its actualnumber of offspring was then drawn from a Poisson distribution with mean w t ( z ) .5 .2 Frequentist vs. Bayesian estimates The results of one single simulation obtained from maximum likelihood in the frequentistframework are compared with those from tmbstan . The summaries of the estimates withtmbstan are computed after dropping the warmup iterations and merging the draws fromall the four chains. The frequentist and Bayesian estimates with different sample sizesand φ θ,θ = 0 . are shown in Table 1, the estimates with other values of auto-correlationin θ t ( φ θ,θ =0.1 and 0.7) can be found in Supporting Information.From Table 1 we find that both frequentist and Bayesian inferences show good es-timates for µ α and µ ω . It is interesting to see that the auto-correlation for θ t is notalways under-estimated under all settings (for example ( tmax = 25 , n = 50) ), this can bealso seen from the tables for parameter estimates in Supporting Information. Bayesianinference with Prior1 (half-Cauchy prior) generally reports smaller estimates of µ θ thanMLE and Prior2 (lognormal prior) but larger estimates of φ θ,θ and logσ θ . The estimateswith MLE and Prior2 are close to each other while the estimates with Prior2 show feweruncertainties for φ θ,θ and logσ θ implied by the smaller standard errors in the brackets.Prior2 also reports smaller estimates for logσ θ compared with MLE and Prior1 since itputs very large weight on small values of the variance, as will be graphically demonstratedin section 3.4. We also find that φ θ,θ and log σ θ are difficult parameters to estimate sincenone of these three techniques can estimate them accurately across all the cases. How-ever, the estimates are based on one realization of simulation, the discrepancy betweenestimates to the true value would vary from simulation to simulation.We also compare the estimates across the different sample sizes. We typically comparethe estimates between setting ( tmax = 25 , n = 50) and ( tmax = 25 , n = 100) , ( tmax =50 , n = 25) and ( tmax = 50 , n = 100) , ( tmax = 25 , n = 100) and ( tmax = 50 , n = 100) .We find that increasing the mean sample size at each time point does not necessar-ily increase the certainty of the estimates, but the data set with increased time points ( tmax = 50 , n = 100) contains more information on the parameters of interest and thusreports more certain estimates compared with the data set with ( tmax = 25 , n = 100) .The same conclusion can be also drawn by making similar comparisons among the esti-mates in Table S1 and S2 in Supporting Information.We can also find from Table 1, Table S1 and S2 from Supporting Information that theBayesian inference with Prior1 in some cases report 1 or 2 divergent transitions while withPrior2 there are no divergent transitions reported. This implies that the geometric shapeof posterior likelihood with Prior1 is more challenging for sampling probably due to lighttails and thus potentially leads to an incomplete exploration of the target distribution. The comparison between the estimates in the last section is based on one realization ofthe simulation. To make comparisons of estimates over more realizations, the simulationwas repeated 50 times under the setting of ( tmax = 50 , n = 25) . Due to divergenttransitions, only 44 out of 50 replicates were kept and the replications with more than0.1% divergent transitions (in 2000 iterations) were excluded from the analysis. For theestimate of φ θ,θ and logσ θ in each replication, the bias was calculated in a frequentistframework as the absolute difference between the true value and the mean estimate fromeach inference technique. The absolute bias for φ θ,θ and logσ θ are graphically displayedin the upper and lower plot in Fig. 1 respectively. From the upper plot we find thatin most replications, Bayesian inference with Prior1 slightly outperforms the frequentist6able 1: Frequentsit and Bayesian estimates (standard errors) from the model with AR(1) θ t , autocorrelation in θ t φ θ,θ = 0 . , and different sample sizes ( ( tmax = 25 , n = 50) , ( tmax = 25 , n = 100) , ( tmax = 50 , n = 25) and ( tmax = 50 , n = 100) ) from one realiza-tion of the simulation. For each sample size setting, the number of divergent transitions inthe MCMC is also reported and is used as a measure of stability of the inference scheme.MLE stands for maximum likelihood estimate, Prior1 and Prior2 represent half-Cauchy(0, 10) and lognormal (1, 0.5) prior respectively. φ θ,θ = 0 . , tmax = 25 , n = 50 Parameters True value MLE Prior1 Prior2no. divergent transitions NA NA 1 0 µ α µ θ
20 18.5(3.7) 18.3(5.1) 18.5(3.7) µ ω φ θ,θ logσ θ φ θ,θ = 0 . , tmax = 25 , n = 100 Parameters True value MLE Prior1 Prior2no. divergent transitions NA NA 2 0 µ α µ θ
20 20.2(8.7) 18.3(17.5) 20.1(7.4) µ ω φ θ,θ logσ θ φ θ,θ = 0 . , tmax = 50 , n = 25 Parameters True value MLE Prior1 Prior2no. divergent transitions NA NA 0 0 µ α µ θ
20 20.0(3.8) 19.8(4.9) 20.1(4.2) µ ω φ θ,θ logσ θ φ θ,θ = 0 . , tmax = 50 , n = 100 Parameters True value MLE Prior1 Prior2no. divergent transitions NA NA 0 0 µ α µ θ
20 20.7(3.9) 20.0(5.0) 20.7(4.1) µ ω φ θ,θ logσ θ φ θ,θ (the upper plot) and for thescale parameter log σ θ (the lower plot) respectively under the setting with time serieslength tmax = 50 , average annual sample size n = 25 , autocorrelation in θ t φ θ,θ = 0 . and 44 replications (50 replications were conducted, among which 6 replications report 3or more divergent transitions for the MCMC of Bayesian inference and thus are removedfrom the analysis).inference and Bayesian inference with Prior2, the latter two reported very close estimatesfor φ θ,θ . One striking thing is that the bias is close to or even larger than 0.4 for somereplications, this suggests that the inferences report even negative estimates of φ θ,θ andit again turns out to be a difficult parameter. In the lower plot, we can see no singleinference technique stands out in estimating the scale parameter logσ θ . Fig. 2 shows histograms of posterior samples of the scale parameter σ θ from models withthe two different prior distributions: half-Cauchy (0, 10) and log-normal (1, 0.5), whichare represented by solid lines in the left and right plot on each subplot respectively. Thetrue value of σ θ is indicated by a solid red line. Plot (a), (b), (c) and (d) correspondto setting ( tmax = 25 , n = 50) , ( tmax = 25 , n = 100) , ( tmax = 50 , n = 25) and ( tmax = 50 , n = 100) respectively. We can see from plot (a) that the priors are quiteinformative and pull the posteriors towards small values away from the true value andthis prior-domination is more clear with log-normal prior where the prior distributionsharply peaks at 2. The domination is not mitigated even though the mean annualsample size is increased to 100 as shown in plot (b). With the same total sample sizein plot (c) ( tmax = 50 , n = 25) as that in plot (a) ( tmax = 25 , n = 50) , the posteriorlikelihoods in plot (c) are, however, not dominated by the priors. The prior-dominationis also mitigated in plot (d) compared with plot (b) by increasing the time points from8 a) ( tmax = 25 , n = 50) . (b) ( tmax = 25 , n = 100) .(c) ( tmax = 50 , n = 25) . (d) ( tmax = 50 , n = 100) . Figure 2: Histograms of posterior samples of the scale parameter σ θ from models withtwo different prior distributions. Plot (a), (b), (c) and (d) correspond to sample sizesetting ( tmax = 25 , n = 50) , ( tmax = 25 , n = 100) , ( tmax = 50 , n = 25) and ( tmax =50 , n = 100) respectively. On each subplot, the left one shows the histogram of posteriorsamples given half-Cauchy (0, 10) prior on σ θ and similarly, the right one displays thehistogram of posterior samples given log-normal (1, 0.5) prior on σ θ . Overlain on eachsubplot (the solid black lines) is the corresponding prior density function. The red linesindicate the true value of σ θ . Only φ θ,θ = 0 . was considered in the simulations.95 to 50.Altogether, the informative log-normal prior pulls more of the posterior towards anarrower range of smaller parameter values especially when the number of time pointsin the data is small. The posterior samples are less dominated by the half-Cauchy priorin this case. Increasing the annual mean sample size does not necessarily lead to betteridentification of the small region of parameter space. Only the amount of time points isthe matter for the likelihood to overwhelm the prior distribution and to dominate theposterior distribution. In tmbstan , sampling can be performed with or without Laplace approximation for therandom effects. It is possible to mix the Laplace approximation with MCMC by speci-fying laplace=TRUE , such that the random effects are integrated with the Laplace ap-proximation in TMB and other parameters (such as fixed effects and hyperparametersspecifying the distribution of the random effects) are handled by the NUTS in Stan. Inthe case studies in Monnahan and Kristensen (2018), the Bayesian inference algorithmswith Laplace approximation is less computationally efficient than without Laplace ap-proximation, where the efficiency is defined as the minimum effective sample size per sec-ond. Following that definition, we calculated the efficiency of tmbstan with and withoutLaplace approximation with simulated data. Different from Monnahan and Kristensen(2018), we did not consider the computational efficiency of Frequentist inference with theLaplace approximation, as explained in the last section.In Fig. 3, plot (a) displays violin plots of computational efficiency without (orange) andwith (green) Laplace approximation (la) of Bayesian inference with Prior1 under differentsample size settings. The setting 1, 2, 3, 4 on x axis stand for setting ( tmax = 25 , n = 50) , ( tmax = 25 , n = 100) , ( tmax = 50 , n = 25) and ( tmax = 50 , n = 100) respectively. Only φ θ,θ = 0 . was considered and the divergent transitions were not taken into account.Inside the violin plots are box plots showing the quantiles of 50 realized computationalefficiencies. Similarly, the violin plots of computational efficiency with Prior2 are shownon plot (b). We find from both plot (a) and (b) that Bayesian inference without Laplaceapproximation generally is more efficient under setting 1, 2, and 3, the outperformanceis more manifest when the sample size is small ( tmax = 25 , n = 50) . However, when thesample size is increased to ( tmax = 50 , n = 100) , inference with Laplace approximationturns out to be slightly more efficient than that without Laplace approximation, theboxplots and violin plots also tend to be more compact under this setting.Even though the technique in which the random effects are integrated out by Laplaceapproximation in TMB turns out to be less efficient in most settings, we still provide acounterexample from Monnahan and Kristensen (2018) in which the enabling of Laplaceapproximation is always less computationally efficient in the case studies. By comparing the Bayesian posteriors with and without Laplace approximation, we areallowed to check how well the Laplace approximation works. Fig. 4 shows pair plots ofposterior samples with and without Laplace approximation done by TMB under differentsample size settings with Prior2. Only autocorrelation in θ t φ θ,θ = 0 . was considered.10 a) Computational efficiency with Prior1.(b) Computational efficiency with Prior2. Figure 3: Violin plots of computational efficiency (minimum effective sample size persecond) without (orange) and with (green) Laplace approximation (la). The four settingson x axis correspond to sample size setting ( tmax = 25 , n = 50) , ( tmax = 25 , n =100) , ( tmax = 50 , n = 25) and ( tmax = 50 , n = 100) respectively. Plot (a) showsthe computational efficiency of Bayesian inference with Prior1 and plot (b) with Prior2.Only φ θ,θ = 0 . was used in simulations. Inside the violin plots are box plots showingthe quantiles of 50 realized computational efficiencies. For each realization among the 50simulations and across the settings, the same specifications in tmbstan are used.11 a) ( tmax = 25 , n = 50) with Prior2. (b) ( tmax = 25 , n = 100) with Prior2.(c) ( tmax = 50 , n = 25) with Prior2. (d) ( tmax = 50 , n = 100) with Prior2. Figure 4: Pair plots of posterior samples for Laplace approximation check from onerealization of the simulation with Prior2. The four plots (a) (b) (c) and (d) correspondto the four settings of sample size in simulation. The random effects in the TMB modelcan be integrated with two techniques: (1) full MCMC integration via NUTS and (2)Laplace approximation. To check the accuracy of Laplace approximation to the posteriorlikelihood density, the posterior samples for all the fixed effects in the model without(yellow dots) and with Laplace approximation (green dots) are shown pair-wisely on thesame plot. Columns and rows on the lower diagonal correspond to pair-wise parameters,with the diagonal showing QQ-plot of posterior samples from Bayesian inference without(yellow dots) and with (green dots) Laplace approximation for that parameter includinga 1:1 line in yellow. The large red circles on the off-diagonal plots represent the pairwisemeans. On each off-diagonal plot, there are 4000 yellow dots corresponding to 1000samples retained from each of four chains without Laplace approximation, so as the greendots with Laplace approximation. Posterior rows were randomized to prevent consistentoverplotting of one integration technique. Overlaps in the two colored dots suggest thatthe Laplace approximation is accurate. 12lot (a), (b), (c) and (d) correspond to setting ( tmax = 25 , n = 50) , ( tmax = 25 , n =100) , ( tmax = 50 , n = 25) and ( tmax = 50 , n = 100) respectively. On each subplot,the lower diagonal plots contain pairwise parameter posterior points. The green dotsrepresent posterior points from full MCMC integration via NUTS and the yellow pointsfrom enabled Laplace approximation of the random effects. The hollow red circles onthe off-diagonal plots represent the pairwise means. The diagonal shows QQ-plot ofposterior samples from Bayesian inference without (yellow dots) and with (green dots)Laplace approximation for that parameter including a 1:1 line in yellow. Even thoughthe posterior points are densely packed, the overlap of the red circles with each techniqueshows seemingly good alignment of the two versions of the posterior, and this suggests thatthe Laplace approximation to the marginal likelihood where random effects are integratedout works well. Similar pair plots for Laplace approximation check with Prior1 can befound in Supporting Information. Having established the utility of our modeling approach and frequentist and Bayesianinference in the context of simulated data, we also applied the same statistical modelto the analysis of a real great tit dataset of practical interest. The observed data werecollected from a Dutch great tit (
Parus major ) population at the Hoge Veluwe NationalPark in the Netherlands (52°02’ - 52°07’N, 5°51’ - 5°32E). The recorded variables includethe number of chicks, number of fledglings, mother ID, brood laying date and so onfor each brood. Laying dates are presented as the number of days after March 31 (day1=April 1, day 31=May 1). Similar to Reed et al. (2013), only the broods with one ormore chicks were considered in our analysis due to the high proportion (15.7%) of zero-observations in the number of fledglings among the broods. The number of fledglings wastaken as the fitness component and assumed to be Poisson distributed. The analyzeddataset consists of brood records breeding in 61 years from 1955 to 2015 and the samplesize in a specific year ranges from 10 to 164 with an average of 81 across the study years.See Reed et al. (2013) for more details on the study population and fieldwork procedures.The focus of this empirical study is to compare the computational efficiency of Bayesianinference with and without Laplace approximation and to check the accuracy of Laplaceapproximation. However, since the true structure of the model is unknown, we firstconducted model selection under the frequentist framework and the candidate modelsconsidered are different from each other only in the model structure of stochastic α t , θ t and ω t . The details of all the candidate models including the best model are given inSupporting Information. We then made Bayesian inference with the two different pri-ors as in the simulation study using the selected model. For each prior distribution, weimplemented tmbstan with and without Laplace approximation to check the accuracy ofLaplace approximation.Table 2 lists the reported estimates of model parameters from maximum likelihood(MLE) and Bayesian estimates with half-Cauchy (0, 10) prior (Prior1) and log-normal (1,0.5) prior (Prior2). The best model indicates VAR(1) structure of α t and θ t and non-zerocorrelation ˆ ρ α,θ . The width of stabilizing fitness function turned to be constant over thestudy years implied by zero ˆ ω t . Frequentist inference and Bayesian inference with Prior2report close estimates for φ θ,θ but the estimates with Prior2 show again less uncertaintyfor most of the estimates except for ρ α,θ . In terms of log σ θ , Bayesian inference with Prior113able 2: Frequentist and Bayesian estimates of parameters in the selected model withgreat tit dataset. The Bayesian estimates (in column Prior1 and Prior2) are obtainedwithout Lapalace approximation done by TMB. parameter MLE Prior1 Prior2 µ α µ θ µ ω φ α,α φ θ,θ logσ α -1.72(0.14) -1.63(0.152) -1.76(0.126) logσ θ ρ α,θ -0.728(0.0825) -0.715(0.0895) -0.661(0.0987)Table 3: Comparison of computational efficiency between Bayesian inference without (inthe row "Full MCMC") and with Laplace approximation (in the row "Laplace approxi-mation") for random effects for the great tit case study. Model Inference Time(s) min.ESS Efficiency(ESS/t)
Prior 1 Full MCMC 1542.215 186.7651 0.1211019Laplace approximation 15491.85 1004.643 0.06484975Prior 2 Full MCMC 1266.096 291.0717 0.229897Laplace approximation 7815.218 1111.257 0.1421914reports the largest estimate and least certainty compared with the other two techniques.The close resemblance between estimates of log σ θ based on maximum likelihood andBayesian inferences suggests that the data contains a good amount of information on log σ θ so that the maximum likelihood overwhelms the log-normal prior and dominatesthe posterior likelihood.Table 3 shows computational efficiencies of Bayesian inference without and withLaplace approximation. It turns out that the computational efficiency with Laplace ap-proximation is approximately half of that without Laplace approximation in both modelswith Prior1 and Prior2.Similar to Fig. 4, Fig. 5 and Fig. 6 display pair plots of posterior samples to checkthe accuracy of Laplace approximation with Prior1 and Prior2 respectively. Both thefigures seemingly suggest a good mix of posterior samples with and without Laplaceapproximation for all the parameters in the selected model, indicating that the Laplaceapproximation assumption is met. In this study, we have investigated frequentist inference and Bayesian inference with twodifferent priors. The inferences were implemented with a state-space model estimatingtemporal fluctuating selection and with simulated biological data under four different sim-ulation settings. A state-of-the-art R package (tmbstan) for fast fitting statistical modelswas used for Bayesian inference with Laplace approximation turning on or off. The simu-lation studies show that the choice of prior can have an important impact on the geometric14igure 5: Pair plots of posterior samples for Laplace approximation test for the greattit case study with Prior1. The random effects in the great tit TMB model can be in-tegrated with two techniques: (1) full MCMC integration via NUTS and (2) Laplaceapproximation. To check the accuracy of Laplace approximation to the posterior likeli-hood density, the posterior samples for all the fixed effects in the model without (yellowdots) and with Laplace approximation (green dots) are shown pair-wisely on the sameplot. Columns and rows on the lower diagonal correspond to pair-wise parameters, withthe diagonal showing QQ-plot of posterior samples from Bayesian inference without (yel-low dots) and with (green dots) Laplace approximation for that parameter including a1:1 line in yellow. The large red circles of the off-diagonal plots represent the pairwisemeans. On each off-diagonal plot, there are 4000 yellow dots corresponding to 1000 sam-ples retained from each of four chains without Laplace approximation, so as the greendots with Laplace approximation. Posterior rows were randomized to prevent consistentoverplotting of one integration technique. Overlaps in the two colored dots suggest theLaplace approximation assumption is met.shape of the posterior distributions of the model parameters and a non-informative prior(in this study uniform prior and inverse-gamma prior on the scale parameter) may leadto unstable inference since the Markov chains may not converge or get stuck in part ofthe ridge of posterior. With unobserved states following a VAR(1) process, we also foundthat the autoregressive parameters and the scale parameters in the variance-covariancematrix of the states are difficult and challenging to be estimated accurately. The in-creased sample size at each time point does not necessarily provide more information for15igure 6: Pair plots of posterior samples for Laplace approximation test for the great titcase study with Prior2. The random effects in the great tit TMB model can be integratedwith two techniques: (1) full MCMC integration via NUTS and (2) Laplace approxima-tion. To check the accuracy of Laplace approximation to the posterior likelihood density,the posterior samples for all the fixed effects in the model without (yellow dots) and withLaplace approximation (green dots) are shown pair-wisely on the same plot. Columnsand rows on the lower diagonal correspond to pair-wise parameters, with the diagonalshowing QQ-plot of posterior samples from Bayesian inference without (yellow dots) andwith (green dots) Laplace approximation for that parameter including a 1:1 line in yel-low. The large red circles of the off-diagonal plots represent the pairwise means. On eachoff-diagonal plot, 4000 yellow dots correspond to 1000 samples retained from each of fourchains without Laplace approximation, so as the green dots with Laplace approximation.Posterior rows were randomized to prevent consistent overplotting of one integration tech-nique. Overlaps in the two colored dots suggest the Laplace approximation assumptionis met.the transition parameters and scale parameters. Only more time points in the data couldmake the likelihood dominate the posterior likelihood and thus lead to better estimates ofthese parameters. Half-Cauchy prior on the scale parameter leads to less stable inferencethan log-normal prior indicated by the number of divergent transitions in the MarkovChains. Laplace approximation for the random effects turns out to be accurate suggestedby the pair plots of the posterior samples with and without Laplace approximation forboth the simulation studies and the great tit case study. Turning on Laplace approxi-16ation in tmbstan would probably reduce computational efficiency but it is worth tryingwhen there is a good amount of data, in which case the Laplace approximation is morelikely to be accurate and also potentially improve the computational efficiency of MCMC.In our study, we used arbitrary prior distributions, however, the prior information canbe obtained from different sources. For example, in our great tit case study, the timingand width of the caterpillar peak can provide a clue for the time window of optimal layingdates, thus the information can be used to decide the prior for the scale parameter ofthe optimal laying dates. Prior information can also be generated from previous studieson the same species and more general ecological knowledge coming from other relatedspecies (Tufto et al., 2000).We conducted simulation studies with only AR(1) process of the optimal laying dates,but the model is formulated and coded in a way that can be effortlessly extended toorder-1 vector autoregression (VAR(1)). It can be widely used for modeling ecologicalprocesses where auto-correlation and cross-correlation in the processes arise due to sharedenvironmental variables at either temporal or spatial scale. We expect more ecologists toadopt these two new estimation methods, TMB, and tmbstan, given its flexibility in eitherfrequentist or Bayesian inference for a wide range of models, including the models wherethe unobserved ecological processes are treated as latent variables and assumed to be VARprocesses. However, the drawback of Bayesian VAR (BVAR) methods is that it usuallyrequires estimation of a large number of parameters and thus the over-parameterizationmight lead to unstable inference and inaccurate out-of-sample forecasts. Some shrinkagemethods (Sims & Zha, 1998; Koop et al., 2010; Giannone et al., 2015; Sørbye & Rue,2017, for example) were thereby developed, in which Bayesian priors provide a logicaland consistent method of imposing parameter restrictions that can be potentially appliedto ecological data cases. 17 eferences
Albertsen, C. M., Nielsen, A., & Thygesen, U. H. (2016). Choosing the observationallikelihood in state-space stock assessment models.
Canadian Journal of Fisheriesand Aquatic Sciences , (5), 779–789.Auger-Méthé, M., Albertsen, C. M., Jonsen, I. D., Derocher, A. E., Lidgard, D. C.,Studholme, K. R., . . . Flemming, J. M. (2017). Spatiotemporal modelling of marinemovement data using Template Model Builder (TMB). Marine Ecology ProgressSeries , , 237–249.Betancourt, M., & Girolami, M. (2015). Hamiltonian Monte Carlo for hierarchical models. Current trends in Bayesian methodology with applications , , 30.Cadigan, N. G. (2015). A state-space stock assessment model for northern cod, includingunder-reported catches and variable natural mortality rates. Canadian Journal ofFisheries and Aquatic Sciences , (2), 296–308.Cao, Y., Visser, M. E., & Tufto, J. (2019). A time-series model for estimating temporalvariation in phenotypic selection on laying dates in a dutch great tit population. Methods in Ecology and Evolution , (9), 1401–1411.Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M.,. . . Riddell, A. (2017). Stan: A probabilistic programming language. Journal ofstatistical software , (1).Chevin, L.-M., Visser, M. E., & Tufto, J. (2015). Estimating the variation, autocor-relation, and environmental sensitivity of phenotypic selection. Evolution , (9),2319–2332.Fuglstad, G.-A., Hem, I. G., Knight, A., Rue, H., & Riebler, A. (2019). Intuitive principle-based priors for attributing variance in additive model structures. arXiv preprintarXiv:1902.00242 .Gelman, A., Lee, D., & Guo, J. (2015). Stan: A probabilistic programming languagefor bayesian inference and optimization. Journal of Educational and BehavioralStatistics , (5), 530–543.Gelman, A., et al. (2006). Prior distributions for variance parameters in hierarchicalmodels (comment on article by Browne and Draper). Bayesian analysis , (3),515–534.Giannone, D., Lenza, M., & Primiceri, G. E. (2015). Prior selection for vector autore-gressions. Review of Economics and Statistics , (2), 436–451.Grant, P. R., & Grant, B. R. (2002). Unpredictable evolution in a 30-year study ofDarwin’s finches. science , (5568), 707–711.Hoffman, M. D., & Gelman, A. (2014). The No-U-Turn sampler: adaptively settingpath lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research , (1), 1593–1623.Jeffreys, H., & Jeffreys, H. (1961). Theory of probability (3rd edn).
Oxford.Karakani, H. M., van Niekerk, J., & van Staden, P. (2016). Bayesian analysis of AR (1)model. arXiv preprint arXiv:1611.08747 .Koop, G., Korobilis, D., et al. (2010). Bayesian multivariate time series methods forempirical macroeconomics.
Foundations and Trends® in Econometrics , (4), 267–358.Kristensen, K., Nielsen, A., Berg, C. W., Skaug, H., & Bell, B. M. (2016). TMB: Au-tomatic differentiation and Laplace approximation. Journal of Statistical Software , (5), 1–21. 18ei, G., Boys, R., Gillespie, C., Greenall, A., & Wilkinson, D. (2011). Bayesian inferencefor sparse VAR (1) models, with application to time course microarray data. Journalof Biometrics and Biostatistics .Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS-a Bayesianmodelling framework: concepts, structure, and extensibility.
Statistics and comput-ing , (4), 325–337.Monnahan, C. C., & Kristensen, K. (2018). No-U-turn sampling for fast bayesian infer-ence in ADMB and TMB: Introducing the adnuts and tmbstan R packages. PloSone , (5), e0197954.Monnahan, C. C., Thorson, J. T., & Branch, T. A. (2017). Faster estimation of bayesianmodels in ecology using Hamiltonian Monte Carlo. Methods in Ecology and Evolu-tion , (3), 339–348.Neal, R. M., et al. (2011). MCMC using Hamiltonian dynamics. Handbook of markovchain monte carlo , (11), 2.Reed, T. E., Jenouvrier, S., & Visser, M. E. (2013). Phenological mismatch stronglyaffects individual fitness but not population demography in a woodland passerine. Journal of Animal Ecology , (1), 131–144.Schlaifer, R., & Raiffa, H. (1961). Applied statistical decision theory . Wiley Cambridge.Sims, C. A., & Zha, T. (1998). Bayesian methods for dynamic multivariate models.
International Economic Review , 949–968.Sørbye, S. H., & Rue, H. (2017). Penalised complexity priors for stationary autoregressiveprocesses.
Journal of Time Series Analysis , (6), 923–935.Stan Development Team. (2018a). Rstan: the R interface to stan. R package version2.17.3 . http://mc-stan.org .Stan Development Team. (2018b). Stan modeling language users guide and referencemanual. Version 2.18.0 . http://mc-stan.org .Tufto, J., Lande, R., Ringsby, T.-H., Engen, S., Sæther, B.-E., Walla, T. R., & DeVries,P. J. (2012). Estimating Brownian motion dispersal rate, longevity and populationdensity from spatially explicit mark–recapture data on tropical butterflies. Journalof Animal Ecology , (4), 756–769.Tufto, J., Sæther, B.-E., Engen, S., Arcese, P., Jerstad, K., Røstad, O. W., & Smith,J. N. (2000). Bayesian meta-analysis of demographic parameters in three small,temperate passerines. Oikos , (2), 273–281.Vedder, O., Bouwhuis, S., & Sheldon, B. C. (2013). Quantitative assessment of theimportance of phenotypic plasticity in adaptation to climate change in wild birdpopulations. PLoS Biology , (7), e1001605.Wei, W. (2006). Time series analysis: Univariate and multivariate methods (2nd ed.).Pearson Addison Wesley.Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis withg-prior distributions.