Efficient Selection Between Hierarchical Cognitive Models: Cross-validation With Variational Bayes
Viet-Hung Dao, David Gunawan, Minh-Ngoc Tran, Robert Kohn, Guy E. Hawkins, Scott D. Brown
EEfficient Selection Between Hierarchical Cognitive Models:Cross-validation With Variational Bayes
Viet-Hung Dao , David Gunawan , Minh-Ngoc Tran , Robert Kohn ,Guy E. Hawkins , Scott D. Brown
1: Australian School of Business, University of New South Wales, Sydney, Australia2: School of Mathematics and Applied Statistics, University of Wollongong3: Discipline of Business Analytics, University of Sydney Business School4: School of Psychology, University of Newcastle, AustraliaModel comparison is the cornerstone of theoretical progress in psychological research.Common practice overwhelmingly relies on tools that evaluate competing models by bal-ancing in-sample descriptive adequacy against model flexibility, with modern approachesadvocating the use of marginal likelihood for hierarchical cognitive models. Cross-validationis another popular approach but its implementation has remained out of reach for cogni-tive models evaluated in a Bayesian hierarchical framework, with the major hurdle beingprohibitive computational cost. To address this issue, we develop novel algorithms thatmake variational Bayes (VB) inference for hierarchical models feasible and computationallyefficient for complex cognitive models of substantive theoretical interest. It is well knownthat VB produces good estimates of the first moments of the parameters which gives goodpredictive densities estimates. We thus develop a novel VB algorithm with Bayesian predic-tion as a tool to perform model comparison by cross-validation, which we refer to as CVVB.In particular, the CVVB can be used as a model screening device that quickly identifies badmodels. We demonstrate the utility of CVVB by revisiting a classic question in decisionmaking research: what latent components of processing drive the ubiquitous speed-accuracytradeoff? We demonstrate that CVVB strongly agrees with model comparison via marginallikelihood yet achieves the outcome in much less time. Our approach brings cross-validationwithin reach of theoretically important psychological models, and makes it feasible to com-pare much larger families of hierarchically specified cognitive models than has previouslybeen possible.
Keywords:
LBA model, marginal Likelihood, model screening.
Introduction
Progress in psychological science can be made by choosing between competing the-ories: Does sleep deprivation cause attentional lapses? Does alcohol impair the speed ofinformation processing or reduce cautiousness, or both? Does the forgetting curve followa power or exponential function? When these theories are quantitative models that canbe estimated from observed data (i.e., “fitted”), the problem is known as model selection. a r X i v : . [ s t a t . A P ] F e b ROSS-VALIDATION WITH VARIATIONAL BAYES 2Model selection continues to be a thorny problem for psychological researchers, even aftermany decades of progress (e.g., Gronau & Wagenmakers, 2019; Myung, 2000; Navarro, 2019;Roberts & Pashler, 2000). The key difficulty in model selection is to balance goodness offit against model flexibility; that is, to balance the degree to which each model accounts forthe patterns observed in data against its ability to predict arbitrary data patterns. Modelflexibility is often defined as the range of data patterns that a model can predict, whichincludes patterns that were observed as well as patterns that were not observed. Overly-flexible models are theoretically non-informative because they can “predict” almost anypattern that could be observed.Many approaches have been developed to tackle this problem. These include likeli-hood ratio tests, various information criteria (e.g., Akaike, Bayesian and Deviance Informa-tion Criteria; AIC, BIC, and DIC, respectively), minimum description length, and marginallikelihood (i.e., Bayes factors). Among these, cross-validation is the most popular (Browne,2000; Efron & Gong, 1983; Vehtari & Gelman, 2014). A key strength of cross-validation isthat it directly asks the question that scientists are often interested in: how well will thismodel predict new data? The simplest version of cross-validation divides observed data intotwo disjoint and approximately equal parts. The first, the “estimation” subset, is used toestimate the model, while the second, the “validation” subset, is held out. The procedureis repeated with the second subset used to estimate the model and the first subset is usedfor validation. The average of the validation performance measures, such as mean squarederrors (MSE) is then used to compare different models. The model is evaluated on its abilityto predict the held-out data, treating them as new observations.While cross-validation is widely agreed to be a desirable method for model selection,it is not used very widely in psychological science. A principal reason for this is its computa-tional cost. Cross-validation is usually carried out repeatedly, using many different ways ofsplitting the observed data in the estimation and validation subsets (this is important in or-der to reduce sampling error associated with implementing the subsetting ). Leave-one-outcross-validation (LOO-CV) leaves out one data point at a time and uses the rest of the datato estimate the model. LOO-CV is closest to actual prediction but it is computationally ex-tremely expensive. A more practical version is K -fold cross-validation ( K -fold CV) in whichthe data is partitioned into K folds (a common choice is K = 5 or 10). It is implementedwith one fold left out as the validation subset and the model is estimated based on the otherfolds. This requires effectively estimating the model on a “new” subset of estimation data K times, which can be particularly time consuming in modern quantitative psychology,given the emphasis on using hierarchical Bayesian methods. Hierarchical models includerandom effects to account for the ubiquitous differences between human participants. Withcomplex psychological theories, evaluating such a model in a Bayesian framework can takedays of computer time, which makes repeated evaluation for cross-validation impractical.For example, one approach to understanding which elements of a particular psychologicaltheory may be critical to explaining observed data is to enumerate a large family of modelvariants. These model variants are formed from all the combinations produced by includingor excluding different elements that have been hypothesized to be important, leading todozens or even hundreds of models to consider. Using cross-validation to choose betweensuch a large set of models is even less practical. As long as this approach to cross-validationremains out of reach, an unresolved issue is that model selection is subject to researcherROSS-VALIDATION WITH VARIATIONAL BAYES 3bias: researchers select and then compare the subset of models they believe to be a priorimost reasonable, since we cannot enumerate and feasibly compare all possible models.We propose a solution to this problem, allowing cross-validation to be used very ef-ficiently with complex psychological theories which include random effects. Our approachmaintains the hierarchical Bayesian structure of the models, but employs variational Bayesto increase the computational efficiency so greatly that cross-validation becomes practical.Variational Bayes (VB; also known as variational approximation, VA) methods provide analternative to more widely-used methods based on Markov chain Monte-Carlo (MCMC). VBmethods have become increasingly prominent for approximate Bayesian inference in a widerange of challenging statistical models (for reviews, see, e.g., Blei, Kucukelbir, & McAuliffe,2017; Ormerod & Wand, 2010). With VB, the problem of estimating the posterior distri-bution is re-formulated as an optimization problem. The (usually very complex) posteriordistribution of interest is approximated by a simpler, more tractable distribution that isselected to balance accuracy and computational cost. The parameters of the approximatingdistribution are then identified by an optimization which minimizes the Kullback-Leiblerdistance between the approximating distribution and the posterior distribution. With care-ful choices of the approximating distribution and optimization algorithm, VB methods canproduce results 10 or 100 times faster than exact methods such as MCMC. However, unlikeMCMC, variational methods are approximate.Despite their strengths, VB methods are still not widely used in psychological re-search (see, however, Galdo, Bahg, & Turner, 2019). One reason is that VB methods havecertain limitations which make drawing model-based inferences difficult. The quality of theapproximation is not always well-known; the methods have a tendency to underestimatethe variability of the posterior distribution, and this can be problematic for parameter in-ference such as setting credible intervals as well as model selection based on the marginallikelihood. A key insight underpinning our work is that VB methods are quite accurateat estimating the posterior means (see, for example, the discussion in Blei et al., 2017),even though VB methods often underestimate the posterior variances. This is a crucialdistinction for application to cross-validation. In cross-validation, a model’s performance isevaluated on how well it predicts held-out data, and here the role of the posterior variancesis of second order importance at most. We show in a simulation study that the predictivedensities estimated by MCMC and VB are very close when the VB underestimates some ofthe posterior variance of the model parameters.Following the above discussion, and building on recent work using VB methods incognitive science by Galdo et al. (2019), we propose to combine cross-validation and VB,which we call “CVVB”. By employing modern VB methods, we show that CVVB can handlecomplex psychological theories with random effects in hierarchical Bayesian frameworks.Even with such challenging models, CVVB is sufficiently fast to make it practical to usewhen searching a large number of competing models, as described above (in one examplebelow, we illustrate selection amongst 256 competing models). Alternatively, for those whoprefer to use exact Bayesian model selection approaches, such as marginal likelihood, CVVBmay be used as an efficient model screening tool. That is, when given a very large set ofmodels to evaluate, one can use CVVB to screen out the poorest-performing models. Thisreduces the number of candidate models to a manageable size, and slower exact Bayesianmethods (such as the marginal likelihood) can then be used on the remaining models.ROSS-VALIDATION WITH VARIATIONAL BAYES 4The article first outlines the VB method and then develops the novel VB algorithms.These algorithms are presented in a general way to make their implementation apparent for arange of psychological models. The performance of the novel VB algorithms is demonstratedin a cognitive model for decision-making. Following this, the CVVB procedure is developedthrough a detailed example of the model selection approach, continuing the analysis ofthe cognitive model for decision-making. The example enables us to revisit a theoreticalquestion about the speed-accuracy tradeoff in decision-making (Lee, 2008; Rae, Heathcote,Donkin, Averell, & Brown, 2014; Starns, Ratcliff, & McKoon, 2012); the question waspreviously addressed by statistical model selection methods with the shortcomings describedabove. Using CVVB, we are able to address the question of key scientific interest using moreadvanced model selection methods. Variational Bayes
This section introduces the basic ideas behind VB methods. We focus on the “fixedform” method, also known as stochastic VB, which is currently widely used in the machinelearning and statistics literatures. We then introduce particular applications of the method,which we will focus on in this article. These methods are particularly well-suited to ap-plications in psychology, where almost all models include random effects (for participants)and have correlated parameters (due to the overlapping and inter-dependent nature of theunderlying psychological constructs being modelled).Bayesian model selection involves choosing between competing models (including pri-ors). The basic model is defined by its likelihood function p ( y | θ ), which gives the prob-ability density for observing data y given parameters θ = ( θ , . . . , θ p ). In the Bayesianapproach, the model parameters are governed by a prior distribution p ( θ ) which encodesexisting knowledge about plausible values for those parameters. The goal of inference isto estimate the posterior distribution p ( θ | y ), which encodes the plausibility of differentparameter values, given the data. Closed-form solutions for the posterior distribution arerarely available, so Bayesian analysis requires methods for approximating the posterior dis-tribution. Markov chain Monte Carlo (MCMC) produces simulation consistent Bayesianinference, i.e., we obtain exact answers as the number of MCMC iterates increases. A keydisadvantage of MCMC methods for psychological models is that they can be very ineffi-cient computationally when the posterior distribution is high-dimensional, i.e., the modelhas many parameters, or when the model’s parameters are strongly correlated (Turner,Sederberg, Brown, & Steyvers, 2013).Variational Bayes (VB) is an approximate method to estimate the posterior. It isbased on optimization: an easy-to-use distribution is chosen to approximate the posteriordistribution, and then parameters for the approximating distribution are found by optimiz-ing the fit to the posterior distribution. Let q λ ( θ ) denote the approximating distributionfor θ which has parameters λ called the variational parameters. The best choice for theseparameters is identified by minimizing the Kullback-Leibler (KL) divergence between theapproximating distribution and the posterior distribution: KL ( q λ ( θ ) || p ( θ | y )) := E q λ ( θ ) (cid:20) log q λ ( θ ) p ( θ | y ) (cid:21) . The notation E f denotes the expectation with respect to distribution f . ROSS-VALIDATION WITH VARIATIONAL BAYES 5The KL divergence has the property that KL ( q λ ( θ ) || p ( θ | y )) ≥ q λ ( θ ) = p ( θ | y ). It follows that0 ≤ KL ( q λ ( θ ) || p ( θ | y )) = E q λ ( θ ) [log q λ ( θ ) − log p ( θ | y )]= E q λ ( θ ) [log q λ ( θ ) − log p ( y , θ ) + log p ( y )]= E q λ ( θ ) [log q λ ( θ ) − log p ( y , θ )] + log p ( y );hence log p ( y ) ≥ L ( λ ) := E q λ ( θ ) [log p ( y , θ ) − log q λ ( θ )] . Hence, minimizing the KL divergence between q λ ( θ ) and p ( θ | y ) is equivalent to max-imizing L ( λ ), which is called the lower bound. This allows optimization of the fit betweenthe approximating and posterior distributions to proceed by searching on parameters λ tomaximize the quantity L ( λ ). The search can be computationally difficult, if the approxi-mating distribution has many parameters or is chosen poorly. Our approach relies on recentdevelopments in the statistical literature to simplify the optimization. We apply stochastic-gradient search methods (Robbins & Monro, 1951), and improve their precision using thereparameterization “trick” of Kingma and Welling (2013) and Rezende, Mohamed, andWierstra (2014). We further simplify the problem by reducing the dimension of λ , using afactor structure for some of its parameters. Finally, we automate the problem of identifyingseparate step sizes for elements of the vector λ using the adaptive learning and stoppingrule developed by Zeiler (2012). Appendix A gives the details. Gaussian Variational Bayes with a Factor Covariance Structure
Gaussian VB is the most common VB approach; here the variational distribution q λ ( θ ) = N p ( θ | µ , Σ ) is Gaussian . Gaussian VB is often motivated by the observation thatthe posterior can be well approximated by a normal distribution under general conditions,when there are sufficient data (Bernardo & Smith, 2009). For a Gaussian approximatingdistribution, the dimension of λ is p + p ( p + 1) /
2. This means that the dimension of theparameters to be searched over in the approximation step increases quadratically with thenumber of model parameters – due to all the covariance elements in the matrix Σ . Oneway to simplify the optimization problem is to set Σ to a diagonal matrix, but this isunsatisfactory for psychological models because it makes the very restrictive assumption ofposterior independence between the components (as in Turner et al., 2013).Following Ong, Nott, and Smith (2018), we make the covariance matrix parsimoniousby using a standard factor structure; i.e., we assume that Σ = BB > + D , where B isa p × r matrix and D is a diagonal matrix with diagonal elements d = ( d , . . . , d p ). Bychoosing the number of factors r (cid:28) p , the factor approximation is simpler and the VBoptimization is more tractable. The approximating distribution is a normal distributionwith mean µ and variance matrix Σ , which means that the size of the search problem ismuch smaller; the vector to be searched over is λ = ( µ > , vec( B ) > , d > ) > . Approximating N p ( µ , Σ ) denotes a p -dimensional normal distribution with mean µ and covariance matrix Σ ; N p ( θ | µ , Σ )denotes the corresponding multivariate normal density with argument θ . vec( A ) is a column vector obtained by stacking the columns of matrix A under each other from left toright. ROSS-VALIDATION WITH VARIATIONAL BAYES 6the posterior distribution by searching over λ is made even more efficient by applying thereparameterization trick to reduce the variance of the gradient estimate of the lower bound,leading to fast and accurate approximations of the gradient during the search (see AppendixA). Variational Bayes for Psychological Models with Random Effects
This section develops the Gaussian VB method presented in the previous sectionfor Bayesian inference with hierarchical psychological models. In a hierarchical model,participants are allowed to have different values for one or more of the model parametersand such parameters are called random effects. These random effects capture the importantpsychological differences between participants, and avoid many of the problems associatedwith averaging across people. We make the model estimation more tractable by assumingthat the random effects follow some group-level distribution, rather than being independentacross people. Here, we assume that the distribution of random effects in the population ismultivariate normal, possibly after an appropriate parameter transformation.The application of simple Gaussian VB to a generic cognitive model that is definedby some arbitrary density function is first illustrated. The approximation is then improvedby exploiting the structure of hierarchical cognitive models.Suppose there are J participants who all perform a cognitive task, with each subjectcompleting multiple trials; on each trial, a stimulus is presented and the subject producesa response. For participant j , the observed response on trial i is denoted y ji , with y ji generated by p ( y i | α j ), the density function of the observations according to the cognitivemodel, where α j = ( α j , . . . , α jD α ) is the vector of D α parameters. The n j responses fromparticipant j are denoted y j = ( y j , . . . , y jn j ) and the collection of responses from thesample of J participants is y = ( y , . . . , y J ). With the usual assumptions of independencebetween trials, the conditional density of all the observations is p ( y | α ) = J Y j =1 n j Y i =1 p ( y ji | α j ) . (1)We assume the elements of α j have support on the real line (possibly after transformation).This assumption makes it possible to assume a multivariate normal distribution for thegroup-level distribution of the random effects. The full model for the data is,1. Conditional density: y ji | α j i.i.d. ∼ p ( y ji | α j ) for j = 1 , . . . , J ; i = 1 , . . . , n j .2. A multivariate normal distribution for the random effects α j | µ α , Σ α i.i.d. ∼ N ( µ α , Σ α ) . (2)3. Priors for model parameters: We follow Gunawan, Hawkins, Tran, Kohn, and Brown(2020) and use a normal prior for µ α and the marginally non-informative prior forROSS-VALIDATION WITH VARIATIONAL BAYES 7 Σ α suggested by Huang, Wand, et al. (2013): µ α ∼ N ( , I ) , Σ α | a , . . . , a D α ∼ IW ( D α + 1 , Ψ ) , Ψ = 4diag (cid:18) a , . . . , a D α (cid:19) ,a , . . . , a D α ∼ IG (cid:18) , (cid:19) . (3)The notation IW( ν, A ) denotes an inverse Wishart distribution with degrees of freedom ν and scale matrix A and IG(1 / ,
1) denotes an inverse Gamma distribution with scaleparameter 1 / Gaussian Variational Bayes
The parameter vector of the psychological model, θ , includes random effects for everysubject ( α J ), the group-level mean ( µ α ) and variance ( Σ α ) parameters, as well as the hy-perparameters a = ( a , . . . , a D α ) of the prior. The random effects ( α ) and the group-levelmeans ( µ α ) have support on the real line, but the covariance parameters ( Σ α ) are restrictedto form a positive definite covariance matrix, and the hyperparameters a are strictly pos-itive. These constraints make it unreasonable to approximate the posterior distributionby a Gaussian distribution. To obtain a useful Gaussian variational approximation, wetransform the parameters, where necessary, so that all the elements now have support onthe full real line. Let Σ α = C α C α > be the Cholesky decomposition of the group-levelcovariance matrix, with C α a lower triangular matrix with positive elements on the diago-nal. We can therefore reparametrize Σ α by an unconstrained vector lying on the real lineconsisting of the strict lower triangle of C α and the logarithms of the diagonal elementsof C α . We similarly log-transform the hyperparameters a . The working parameters are˜ θ = ( α > , . . . , α > J , µ α > , vech( C ∗ α ) > , log( a ) , . . . , log( a D α )) > , with C ∗ α indicating the lowertriangle of matrix C α . Appendix B gives the technical details. Hybrid Gaussian Variational Bayes
We now develop a novel extension to Gaussian VB for hierarchical models with ran-dom effects, which exploits the structure of the posterior distribution. In the hierarchicalmodels we consider, the posterior distribution can be factored as p ( α J , µ α , Σ α , a | y ) = p ( α J , µ α , a | y ) p ( Σ α | α J , µ α , a , y ) . It is not difficult to show that the conditional density p ( Σ α | α J , µ α , a , y ) is the den-sity of IW( Σ α | ν, Ψ ) with ν = 2 D α + J + 1 and Ψ = P Jj =1 ( α j − µ α )( α j − µ α ) > +4 diag (1 /a , . . . , /a D α )(Appendix C, Lemma 1).This suggests that is only necessary to approximate the joint posterior of the randomeffects vectors ( α J ), the group-level mean parameters ( µ α ), and the hyperparameters ( a ).That is, we use a VB approximating distribution, q λ ( θ ), of the form q λ ( α J , µ α , a , Σ α ) = q λ ( α J , µ α , a )IW( Σ α | ν, Ψ ) . vech( A n × n ) is the ( n ( n + 1) / × A ) by omitting all the uppertriangular elements of A . ROSS-VALIDATION WITH VARIATIONAL BAYES 8This hybrid variational distribution takes into account the posterior dependence between Σ α and the other parameters, which allows for a more accurate approximation to theposterior. The set of parameters is now ˜ θ = ( α J , µ α , log a , Σ α ) and the data-parameterjoint density becomes p ( y , α J , µ α , Σ α , log a ) = J Y j =1 f ( y j | α j ) N ( α j | µ α , Σ α ) N ( µ α | , I )IW( Σ α | ν, Ψ ) × D α Y d =1 IG( a d | / , | J a d → log a d | , where J a d → log a d = a d is the Jacobian of the transformation.If the parameters are separated as θ = ( α J , µ α , log a ) and θ = Σ α and q λ ( θ )is parameterized by a Gaussian density that assumes a reduced factor structure for itscovariance matrix, then the variational distribution has the parametric form q λ ( θ , Σ α ) = N ( θ | µ , BB > + D )IW( Σ α | ν, Ψ ) , with the variational parameters λ = ( µ , B, d ) (recall D is a diagonal matrix with thediagonal vector d ). We refer to this approach as Hybrid Gaussian VB . We can write θ = u ( (cid:15) ; λ ) := µ + B (cid:15) + d (cid:12) (cid:15) , with (cid:15) = ( (cid:15) > , (cid:15) > ) > ∼ N ( , I ). Using the reparameterizationtrick, the lower bound can be written as L ( λ ) = E ( (cid:15) , θ ) [log p ( y , u ( (cid:15) ; λ ) , θ ) − log q λ ( u ( (cid:15) ; λ )) − log p ( θ | u ( (cid:15) ; λ ) , y )] . The idea of hybrid VB is also explored recently by Loaiza-Maya, Smith, Nott, andDanaher (2020); however, they do not include the term ∇ θ log p ( θ | θ , y ) in their calcula-tion of the lower bound gradient. Appendix A gives details for the gradient function of thislower bound, including efficient estimation methods based on the work of Loaiza-Maya etal. (2020). CVVB: Model Selection by Variational Bayes with K -Fold Cross-validation The aim of cross-validation (CV) is to assess how well a model will predict out ofsample. There are several versions of CV (Arlot, Celisse, et al., 2010). The popular K − foldCV divides the data into K approximately equal parts called ‘folds’. The model is firstestimated using folds 2 to K , (the “estimation data”) and then the estimated model is usedto predict the data in the first fold (the “validation data”). This is then repeated withfolds 2 to K successively left out of the estimation and used for model validation. CV canbe computationally expensive as the process must be repeated many times, holding out adifferent fold each time.This section describes a strategy for speeding up K -fold cross-validation based onVB, and refer to the method as cross-validation variational Bayes (CVVB). Our approachis based on two key observations. First, VB is very fast and is also good for prediction(Blei et al., 2017). Second, when the data are randomly split into folds of similar sizes,the VB approximations should not differ much across the data folds. Because of this, wecan initialize the VB search algorithm for every fold after the first one using the resultsROSS-VALIDATION WITH VARIATIONAL BAYES 9of the first estimation. Good initialization is important in VB optimization and helps tosignificantly speed up the convergence.CVVB can be used as a model selection method by choosing the best model basedon predictive performance in the held-out data. Alternatively, for those who prefer exactBayesian methods, CVVB may be used as a model screening tool. That is, when given avery large set of models to evaluate, one can use CVVB to efficiently screen out the poorest-performing models. This reduces the set of candidate models to a manageable size, and itis then possible to use slower exact Bayesian methods (such as the marginal likelihood) onthe remaining models.An important choice in K -fold CV is the choice of loss function for the validation fold.In principle, almost any statistic which summarizes the discrepancy between the model’spredictions and the held-out data is adequate. In Bayesian statistics, predictive performanceis most commonly measured by the expected log predictive density (ELPD) (Gelman et al.,2013): ELPD := Z log p (˜ y | y ) p ∗ (˜ y ) d ˜ y ; p ∗ (˜ y ) is the unknown true distribution of future observations ˜ y , and p (˜ y | y ) is the poste-rior predictive density. This is the density of the future observations, integrated over theposterior distribution of the parameters: p (˜ y | y ) = Z p (˜ y | θ ) p ( θ | y ) d θ . It is straightforward to estimate ELPD by CV. The data are partitioned into K folds of similar sizes y ( k ) , k = 1 , . . . , K (a typical choice of K is 5 or 10). Let y ( − k ) bethe data after fold k is left out. For random effect models, we partition the data in thesubject level, i.e., the data from each subject is randomly split into K disjoint subsets,hence y ( k ) = ( y ( k )1 , . . . , y ( k ) J ) consists of observations from all subjects for fold k (appendixE gives the details of CVVB applied to random effect models). The K -fold cross-validationestimate for ELPD is (cid:92) ELPD
K-CV := 1 K K X k =1 log p ( y ( k ) | y ( − k ) ) . The term p ( y ( k ) | y ( − k ) ) is the posterior predictive density for the k th fold, and repre-sents the log score when the data in that fold are treated as unseen, and predicted usingthe posterior distribution estimated from the other folds. Using VB methods, this posteriorpredictive density can be estimated by drawing S samples from the variational distributionas p ( y ( k ) | y ( − k ) ) = Z p ( y ( k ) | θ ) p ( θ | y ( − k ) ) d θ ≈ Z p ( y ( k ) | θ ) q λ ( k ) ( θ ) d θ ≈ S S X s =1 p ( y ( k ) | θ ( s ) ) , with θ ( s ) ∼ q λ ( k ) ( θ ) , s = 1 , . . . , S. ROSS-VALIDATION WITH VARIATIONAL BAYES 10Here, q λ ( k ) ( θ ) is the VB posterior approximation for the leave- k th-fold-out posterior p ( θ | y ( − k ) ). By replacing the posterior predictive density p ( y ( k ) | y ( − k ) ) with the VB ap-proximation, the K -fold CVVB estimate for ELPD is obtained as (cid:92) ELPD
K-CVVB := 1 K K X k =1 log S S X s =1 p ( y ( k ) | θ ( s ) ) ! . Although it is necessary to run the VB algorithm K times for K -fold CV, the warm-up initialization strategy discussed above means that the time taken to run all K repetitionsis almost the same as running VB once on the full data set. Using the samples from theVB approximating distribution ( q λ ( k ) ( θ )) rather than from the exact posterior ( p ( θ | y ( − k ) ))means that we only obtain approximate inference. However, this loss is offset by a very largegain in computational efficiency, making the CVVB approach very attractive for quicklyscreening a large set of competing models. An Illustrative Application of Variational Bayes: Decision-Making byEvidence Accumulation
We now apply the novel VB methods to an evidence accumulation model (EAM) fordecision making. EAMs propose that decisions between competing alternative outcomesare made by accumulating evidence in favour of each possible response. The accumulationcontinues until a pre-defined threshold level of evidence is exceeded, after which the responsecorresponding to the winning accumulator is executed. While all EAMs share this basicstructure, they differ in the specific details of the accumulation process and threshold setting.EAMs have been used to address important theoretical and applied questions in psychology(for reviews, see Donkin & Brown, 2018; Ratcliff, Smith, Brown, & McKoon, 2016). Forexample, EAMs helped to resolve theoretical debates about the mechanisms which underpinthe cognitive slowdown observed during healthy ageing. It has long been known that olderadults respond more slowly in many cognitive tasks than younger adults. For many decades,age-related slowing was attributed to a decrease in the rate of information processing (thefamous “generalized slowdown” hypothesis; Salthouse, 1996). By applying EAMs to thedata of older and younger adults, it was observed that a large proportion of the age-relatedslowdown effect was caused by increased caution rather than a decreased rate of processing(Forstmann et al., 2011; Ratcliff & Smith, 2004; Starns & Ratcliff, 2010; Thapar, Ratcliff,& McKoon, 2003). This kind of result typifies the benefit of using cognitive models toaddress applied questions, sometimes known as “cognitive psychometrics” (Batchelder, inpress). Important psychological insights are supported by choosing between competingtheories, which are represented by different model variants; e.g., comparing an EAM inwhich processing rate differs between younger and older groups vs. an EAM in whichcaution differs.We focus on the linear ballistic accumulator (LBA; Brown & Heathcote, 2008), whichis simpler than many other EAMs in that it assumes no competition between alternatives(Brown & Heathcote, 2005), no passive decay of evidence (Usher & McClelland, 2001) andno within-trial variability (Ratcliff, 1978; Stone, 1960). This simplicity permits closed-formexpressions for the likelihood function for the model parameters, which supports advancedROSS-VALIDATION WITH VARIATIONAL BAYES 11statistical techniques including Bayesian methods based on MCMC and particle algorithms(Gunawan et al., 2020; Tran et al., in press; Turner et al., 2013; Wall et al., in press).Most modern applications of the LBA model include a hierarchical random effectsstructure for individual differences. Bayesian methods for inference with the hierarchicalLBA were first developed by Turner et al. (2013). Recent developments have increasedthe efficiency of these exact methods, and extended them to allow for correlation betweenrandom effects (Gunawan et al., 2020). Even though these newer MCMC methods are moreefficient than earlier methods, the computation time can still be quite costly. For example,for an experiment with 100 subjects each of whom contributes 1,000 decisions it can takeseveral hours to estimate the model on a high-performance computer. This computationalcost is one of the primary motivations for exploring VB methods.We use the VB methods developed above to explore LBA models of decision-makingin three data sets, as well as in a simulation study. We then demonstrate that addressingmodel selection among a large class of competing models is both feasible and practical withthe CVVB approach. The CVVB approach is then used to address, more comprehensivelythan previous analyses, a debate about the effects of caution vs. urgency on decision-making(Rae et al., 2014).
The LBA Model of Decision Making
The LBA model (Brown & Heathcote, 2008) represents a choice between several alter-natives as a race between different evidence accumulators, one for each response (see Figure1); however, see van Ravenzwaaij, Brown, Marley, and Heathcote (2019) for more flexibleextensions. Each evidence accumulator begins the decision trial with a starting amount ofevidence k that increases at a speed given by the “drift rate” d . Accumulation continuesuntil a response threshold b is reached. The first accumulator to reach the threshold deter-mines the response, and the time taken to reach the threshold is the response time (RT),plus some extra constant time for non-decision processes, τ .To explain the observed variability in the data, the model assumes that the startingpoints for evidence accumulators are random values drawn from a uniform distribution onthe interval [0 , A ], and the drift rates are drawn from normal distributions with means v , v , . . . for the different response accumulators. It is usual to assume a common standarddeviation s for all accumulators (but see also Donkin, Brown, & Heathcote, 2009). Allrandom values are drawn independently for each accumulator, and are independent acrossdecision trials. With these assumptions, Brown and Heathcote (2008) and Terry et al. (2015)derive expressions for the distribution of the time to reach threshold, which we denoteby F c and f c , for the cumulative distribution function and probability density function,respectively. The joint density over response time RT = t and response choice RE = c isLBA( c, t | b, A, v , s, τ ) = f c ( t ) × Y k = c (1 − F k ( t )) , with v = ( v , v , . . . ). Note that it is also possible to have parameters other than v changebetween accumulators. For example, strategic decision biases may be represented by allow-ing different response thresholds ( b ) between accumulators. In these cases, the expressionabove generalizes in the obvious way, e.g., replacing the scalar parameter b with a vector b .ROSS-VALIDATION WITH VARIATIONAL BAYES 12 Figure 1 . An illustration of the LBA model for a binary choice with two evidence accumu-lators, one for “Response A” (left panel) and one for “Response B” (right panel). Evidenceaccumulates for each response until one reaches a threshold ( b ). The speed of evidenceaccumulation (drift rate d ) and starting points ( k ) are random from decision to decisionand between accumulators.The observed data from a single decision is represented by the vector of responsetime and choice, which we denote y > i = ( RE i , RT i ). If a participant provides a sequenceof n decisions, the vector of all data for the participant is denoted by y > = ( y > , . . . , y > n ).Assuming independence across decision trials, the density for the data set is given by p ( y | b, A, v , s, τ ) = n Y i =1 LBA( y i | b, A, v , s, τ ) , For VB with the LBA model, this term replaces the generic model p ( y | α ). Hierarchical LBA Models
We illustrate the generalization of the LBA model of how one person makes deci-sions to how a group of people make decisions with an example typical of the literature.Forstmann et al. (2008) collected data from 19 participants who performed a simple per-ceptual decision-making task. The participants were asked to decide, repeatedly, whethera cloud of semi-randomly moving dots appeared to move to the left or to the right. Inaddition, each participant was asked on some trials to respond very urgently, on other trialsto respond very carefully, and on others to respond neutrally. These three speed-accuracytradeoff conditions were of primary interest in the Forstmann et al. analysis.To capture the differences between the subjects, as well as the differences betweenthe three conditions, Gunawan et al. (2020) proposed a hierarchical LBA model with threedifferent threshold parameters b ( a ) , b ( n ) and b ( s ) for accuracy, neutral and speed conditions,respectively. They also proposed two parameters for the means of the drift rate distributions:one for drift rates in the accumulator corresponding to the correct response on each trial( v c ) and the other for the error response ( v e ). Gunawan et al. assumed that the standarddeviation of the drift rate distribution was always s = 1. With these assumptions, eachROSS-VALIDATION WITH VARIATIONAL BAYES 13subject j has the vector of random effects z j = ( b ( a ) j , b ( n ) j , b ( s ) j , A j , v j = ( v jc , v je ) , τ j ) . Let J be the total number of subjects ( J = 19 in this case); let n ( t ) j be the number oftrials (decisions) made by participant j in condition t ; denote by y ( t ) ji the i th decision fromsubject j under condition t . With the usual independence assumptions, the conditionaldensity of all the observations is p ( y | b , A , v , τ ) = J Y j =1 Y t ∈{ a,n,s } n ( t ) j Y i =1 LBA( y ( t ) ji | b ( t ) j , A j , v j , τ j ) , which replaces the generic form in Equation (1) with the LBA density of all the observations.Our article makes a small change in the parameterization proposed by Gunawan et al.(2020). To take into account the constraint that thresholds ( b ) must always be higherthan the top of the start point distribution ( A ), we parameterize c ( t ) j = b ( t ) j − A j for j =1 , . . . , J ; t ∈ { a, n, s } . We follow Gunawan et al. (2020), and log-transform all the randomeffects, which gives them support on the entire real line, and in many cases also leads toapproximately normal distributions of the random effects across subjects. For each subject j = 1 , . . . , J , we define the vector of log-transformed random effects α j = ( α j , . . . , α j ) := log (cid:16) c ( a ) j , c ( n ) j , c ( s ) j , A j , v > j = ( v jc , v je ) , τ j (cid:17) . Let D α be the dimension of α j (in this case, D α = 7). Then, the conditional densityof the hierarchical LBA model is defined as y ( t ) ji | α j i.i.d. ∼ LBA( y ( t ) ji | c ( t ) j , A j , v j , τ j ) for j =1 , . . . , J ; t ∈ { a, n, s } ; i = 1 , . . . , n ( t ) j . The prior for the random effects (that is, the group-level distribution) and the priors for model parameters are as specified in Equations (2) and(3). Applying Variational Bayes to the Hierarchical LBA Model
We first demonstrate the Gaussian VB and Hybrid Gaussian VB methods by usingthem to estimate the hierarchical LBA model from the data reported by Forstmann et al.(2008). This experiment is small enough to make exact Bayesian inference using MCMCfeasible. To assess the quality of the VB approximations, we compare the VB results tothe exact posterior estimated using the Particle Metropolis within Gibbs sampler (PMwG:Gunawan et al., 2020).The posterior was approximated using Gaussian VB and Hybrid Gaussian VB; ineach case using 20 factors to reduce the dimension of the approximating distribution. Thisrepresents a substantial simplification from the full model, which has p = 161 parameters(7 group-level mean parameters, 21 parameters for the covariance matrix of those means,and 19 × N = 10 Monte-Carlo samples. The step sizes are set byusing the adaptive learning rate algorithm ADADELTA with ξ = 10 − and v = 0 .
95; seeAppendix A. The computation time for the Gaussian VB and Hybrid Gaussian VB methodsROSS-VALIDATION WITH VARIATIONAL BAYES 14were both less than 5 minutes on an average desktop computer (Intel(R) Core(TM) i5-6500CPU, 3.20GHz and 8 GB of RAM). By comparison, the run time for the PMwG methodon the same system was approximately 2 hours.
Figure 2 . Comparing the means and standard deviations of the marginal posterior distri-butions estimated by VB (vertical axis) against the exact values calculated using PMwG(horizontal axis). The top panels show the means and standard deviations of the group-levelparameters. The bottom panels show the means and standard deviations of the randomeffects. The Gaussian VB (GVB) and Hybrid GVB methods accurately recover the meanof the posterior, but underestimate the standard deviation.Hybrid Gaussian VB provides a better approximation to the posterior distribution, asindicated by a greater lower bound than Gaussian VB (7,275 vs. 7,242). To assess the qual-ity of the marginal inference, the two left panels of Figure 2 compare the posterior meansestimated by the VB methods against the exact posterior means calculated using PMwG.Both Gaussian and Hybrid Gaussian VB capture the posterior means quite precisely, forboth the group-level mean parameters (top left panel) and for the individual-subject randomeffects (lower left panel). The right panels of Figure 2 shows the corresponding comparisonfor the estimated standard deviations of the posterior distribution. The standard devia-tion of the posterior is underestimated by both methods, which is typical for VB. However,Hybrid Gaussian VB provides much more accurate estimates for the posterior standard devi-ations of the group-level parameters than Gaussian VB (top right panel); this demonstratesa clear advantage of the Hybrid Gaussian VB method.We now compare the predictive densities estimated using PMwG with ones obtainedROSS-VALIDATION WITH VARIATIONAL BAYES 15
Figure 3 . The top panels are the posterior predictive correct response time densities forsubject 2 under various conditions: accuracy (the leftmost panel), neutral (the central panel) and speed emphasis (the rightmost panel). Similarly, the posterior predictive incorrectresponse time densities for this participant are shown in the bottom panels.by using the hybrid VB approximation. Figures 3 and 4 show these posterior predictivedensities for subjects 2 and 9, respectively. The results for the other subjects are similar.The fact that the posterior predictive densities are very well approximated by VB supportsthe claim that VB gives very good predictions. Appendix D gives the algorithm to obtainthe predictive densities for the hierarchical LBA models.
CVVB in Action: A More Thorough Evaluation of Selective Influence in LBAModels of Decision-Making
The notion of “selective influence” has been important in evaluating psychologicalmodels, including evidence accumulation models (Ratcliff & Rouder, 1998; Voss, Rother-mund, & Voss, 2004). An experimental manipulation (e.g., changing the brightness of aperceptual decision stimulus) is said to selectively influence a particular model parameter(e.g., drift rate) if the model can account for differences in observed data caused by themanipulation via adjustments in only that one parameter.Rae et al. (2014) and Starns, Ratcliff, and White (2012) identified an important viola-tion of selective influence in both the LBA model and the related diffusion decision model.When decision-makers were asked to adjust their speed-accuracy tradeoff strategies, themodels required more than just changes in threshold parameters to explain the observeddata. Instead, the models required changes in threshold parameters and drift rate parame-ters – contrary to expectation, the speed-accuracy tradeoff manipulation did not selectivelyinfluence threshold parameters.ROSS-VALIDATION WITH VARIATIONAL BAYES 16
Figure 4 . The top panels are the posterior predictive correct response time densities forsubject 9 under various conditions: accuracy (the leftmost panel), neutral (the central panel) and speed emphasis (the rightmost panel). Similarly, the posterior predictive incorrectresponse time densities for this participant are shown in the bottom panels.Rae et al. (2014) and Starns, Ratcliff, and White (2012) carried out inference about themodel parameters using statistical methods which were available to them at the time. Themethods presented here allow these results to be improved in important ways. Firstly, themodels can be treated using a random effects structure, which allows for person-to-personvariation. Secondly, using the CVVB method, a much more complete set of candidate modelparameterizations can be investigated. This reduces the dangers posed by experimenter bias.Below, we update those earlier findings by reanalysing three previously-reported data sets,using three very different decision-making tasks. In each case, we investigate the questionof selective influence by enumerating a comprehensive set of models for comparison, usingCVVB to choose between them. Before reanalysing the real data, we present a simulationstudy which shows the properties of our methods.
Case Study 1: The Speed-Accuracy Tradeoff in Perceptual Decisions
As the first demonstration, we reconsider the experiment conducted by Forstmann etal. (2008). In our earlier application of VB methods to this data set, we made the standardselective influence assumption: the effect of the speed-accuracy tradeoff manipulation isentirely explained by separate response threshold settings ( c ) for the speed, neutral andaccuracy emphasis conditions, with all remaining random effects, i.e., subject-level param-eters, estimated to common values across conditions. Whether selective influence of thismanipulation holds in the LBA model parameters can be tested by investigating whetherdifferent threshold settings are required for the different conditions, and/or whether otherROSS-VALIDATION WITH VARIATIONAL BAYES 17random effects are also required to be different across those conditions, particularly thedrift rates, v . We investigated a set of 27 different models, ranging in complexity from anull model (the random effects are the same across conditions) through to a very complexmodel with three random effects for τ , three for threshold c , and three pairs of drift rates v . Each model is denoted by the number of random effects for c, v and τ . For instance,model 3-2-1 denotes an LBA model with 3 random effects for thresholds ( c ( n ) , c ( s ) , c ( a ) ), 2random effects for drift rates ( v , v ), and only 1 random effect for non-decision time ( τ ). Simulation Study.
We first conducted a simulation study to investigate the per-formance of the CVVB procedure, and in particular its ability to detect the data generatingmodel. The simulation design is based on Forstmann et al. (2008) experiment with 19 par-ticipants and 1,000 trials per participant, where the data generating process is an LBAmodel. The data generating (“true”) model parameters µ α and Σ α are set to estimatedfrom the data using PMwG for model 3-1-1, which is the selective influence model, i.e.,three threshold settings for the three conditions, but no change in the other parameters.We ran 100 independent replications, and in each replication, we repeated the followingsteps for each of the j = 1 , . . . ,
19 simulated participants:1. Sample α j ∼ N ( µ α , Σ α )2. Transform α j back to the natural parameterization ( b ( a ) j , b ( n ) j , b ( s ) j , A j , v j , τ j ).3. Simulate 1,000 trials for subject j as follows• Sample 350 pairs ( RT ij , RE ij ) ∼ LBA( t, c | b ( a ) j , A j , v j , τ j ).• Sample 350 pairs ( RT ij , RE ij ) ∼ LBA( t, c | b ( n ) j , A j , v j , τ j ).• Sample 300 pairs ( RT ij , RE ij ) ∼ LBA( t, c | b ( s ) j , A j , v j , τ j ).For each of the 100 simulated data sets, we used 5-fold CVVB to estimate all 27 candi-date LBA models and then ranked the models using ELPD. Figure 5 shows the sensitivity ofthe CVVB procedure: the number of times out of 100 replications that the data-generatingmodel was ranked in the top r models ( x -axis). For example, the data-generating modelwas ranked amongst the top 3 candidates in 94 of the 100 replications, and was correctlyranked as the most likely model over 75% of the time. Given the small size of the simulateddata sample ( n = 19 subjects) and the approximate nature of the CVVB algorithm, weconsider this as good performance. Of particular importance is that the data-generatingmodel was quite simple relative to some of the candidate models, indicating that the CVVBprocedure appropriately manages the flexibility of the set of models under consideration. Analysis of the Real Data.
The performance of all 27 candidate models in theForstmann et al. (2008) data was evaluated using CVVB, Hybrid Gaussian VB, 5 folds, and15 factors to reduce the dimension of the approximating distribution. We compared ELPDestimated by CVVB with the marginal likelihood estimated by the Importance SamplingSquared (IS ) method of Tran et al. (in press). Table 1 compares the estimated marginallikelihood for each model (right-most column) against the (cid:92) ELPD
K-CVVB (second-to-rightcolumn). The left-most column gives each model an index number, which we use in theROSS-VALIDATION WITH VARIATIONAL BAYES 18
Figure 5 . Sensitivity of the CVVB procedure for data simulated from Forstmann et al.(2008) design. The y -axis shows the frequency (from 100 replications) with which the data-generating model is ranked in the top r models ( x -axis). The best model is ranked 1, thesecond best model is ranked 2, and so on through to the worst model which is ranked 27.A procedure with high sensitivity has large f ( r ) ( y -axis) for small model ranks ( x -axis).plots below. There is general agreement between the CVVB method and the correspondingmarginal likelihood estimate from the exact method. For example, both methods place thesame three models (11, 22, and 23) among their top four best models. The 12 worst-rankedmodels by the two methods are also the same.Figure 6 compares the ranking on the set of 27 models implied by CVVB with theranking implied by marginal likelihood. While there are some differences evident in therankings given to middle-ranked models, overall the agreement is quite good. The Spearmanrank correlation of the rankings implied by the two model selection methods is ρ = . Case Study 2: The Speed-Accuracy Tradeoff in Recognition Memory
Rae et al. (2014) reported a new experiment to test selective influence in a decision-making task based on memory recognition (as opposed to perceptual discrimination, asabove). For this, 47 participants were asked to study lists of words and then repeatedlydecide whether given prompt words were old (from the studied lists) or new (not). For somedecisions, participants were instructed to respond very urgently (speed emphasis) and forothers to respond very carefully (accuracy emphasis).To evaluate the selective influence of the speed/accuracy manipulation on the thresh-old parameters, we investigated a large set of LBA models. We allowed the random effectsfor the threshold ( c ) to vary between response accumulators (“old” vs. “new”) in order toROSS-VALIDATION WITH VARIATIONAL BAYES 19Table 1 Model selection via CVVB and marginal likelihood for the 27 LBA models fitted to datareported by Forstmann et al. (2008). The last column lists the log-marginal likelihood esti-mated by the IS method with standard errors in brackets. Model Model (cid:92)
ELPD
K-CVVB log d p ( y )Index ( c − v − τ ) (IS method)1 1-1-1 1,060.4 5,199.5 (0.1)2 1-1-2 1,413.7 6,947.0 (0.1)3 1-1-3 1,433.2 7,028.9 (0.1)4 1-2-3 1,530.6 7,536.4 (0.2)5 1-2-2 1,516.7 7,465.9 (0.2)6 1-2-1 1,462.7 7,220.8 (0.1)7 1-3-1 1,473.9 7,279.1 (0.3)8 1-3-2 1,527.1 7,541.6 (0.5)9 1-3-3 1,527.4 7,537.3 (0.5)10 2-3-3 1,536.5 7,571.4 (0.3)11 2-3-2 1,548.9 7,584.9 (0.2)12 2-3-1 1,545.7 7,580.1 (0.6)13 2-2-1 1,522.2 7,500.5 (0.1)14 2-2-2 1,523.1 7,508.4 (0.3)15 2-2-3 1,535.6 7,573.6 (0.5)16 2-1-3 1,525.2 7,510.8 (0.2)17 2-1-2 1,512.6 7,444.0 (0.1)18 2-1-1 1,493.2 7,359.0 (0.1)19 3-1-1 1,516.2 7,461.8 (0.7)20 3-1-2 1,528.6 7,529.9 (0.1)21 3-1-3 1,530.4 7,527.8 (0.1)22 3-2-3 1,548.6 7,591.2 (0.2)23 3-2-2 1,548.0 7,595.3 (0.2)24 3-2-1 1,540.5 7,591.2 (0.2)25 3-3-1 1,536.2 7,573.4 (1.4)26 3-3-2 1,547.4 7,572.1 (0.3)27 3-3-3 1,547.2 7,556.7 (0.4)ROSS-VALIDATION WITH VARIATIONAL BAYES 20 Figure 6 . CVVB model ranks ( x -axis) plotted against marginal likelihood model ranks( y -axis) for Forstmann et al. (2008) data.capture the biases in different subject’s responding patterns. We also allowed drift rates( v ) to vary between accumulators and according to whether the stimulus was actually anold or new word, which captures the basic ability of subjects to do the memory task. Thisinvestigation compares the 16 models given in Table 2. In the table, models are numberedfrom 1 (the simplest) to 16 (the most complex). For this data set, and the following one, wehave adopted a notation based on the experimental manipulations to describe the models.For example, the notation E ∗ R in the second column indicates that the correspondingparameter for that column ( c ) is allowed to vary with both the response accumulator (R)and the speed vs. accuracy emphasis manipulation (E). The letter “S” indicates the ma-nipulation of studied (old) vs. not studied stimulus words, and the letter “M” indicates thematch between the stimulus class and the response. A parameter is indicated by 1 if it iscommon across conditions. For example, in model 1, we allow c to vary with the responseaccumulator R, v to vary with the stimulus S and the stimulus-accuracy match M; s isonly affected by the stimulus M, and both A and τ are common across accumulators andconditions.Table 2 compares ELPD (estimated using CVVB) with marginal likelihood (estimatedusing IS ). The two model selection methods are quite consistent in this example, agreeingon the same set of five best-ranked models and four out of the five worst-ranked models.Figure 7 compares the rankings implied by the two methods, and, once again, the agreementis quite good (Spearman rank correlation of ρ = . c ).ROSS-VALIDATION WITH VARIATIONAL BAYES 21Table 2 Model selection via CVVB and marginal likelihood for the 16 LBA models fitted to the datareported by Rae et al. (2014). The last column lists the log-marginal likelihood estimated bythe IS method with the standard errors in brackets. Model Model (cid:92)
ELPD
K-CVVB log d p ( y )Index c A v s τ (IS method)1 R S ∗ M M R S ∗ M M E R E ∗ S ∗ M M R E ∗ S ∗ M M E
R E S ∗ M M
R E S ∗ M M E
R E E ∗ S ∗ M M
R E E ∗ S ∗ M M E E ∗ R S ∗ M M E ∗ R S ∗ M M E E ∗ R E ∗ S ∗ M M E ∗ R E ∗ S ∗ M M E E ∗ R E S ∗ M M E ∗ R E S ∗ M M E E ∗ R E E ∗ S ∗ M M E ∗ R E E ∗ S ∗ M M E
Figure 7 . CVVB model ranks ( x -axis) plotted against marginal likelihood model ranks( y -axis) for Rae et al. (2014) data.ROSS-VALIDATION WITH VARIATIONAL BAYES 22 Case Study 3: The Speed-Accuracy Tradeoff in Lexical Decisions
The first two case studies investigated the selective influence of speed/accuracy manip-ulations on the threshold parameter of the LBA model in perceptual decisions (Forstmannet al., 2008) and mnemonic decisions (Rae et al., 2014). The third case study extends theanalysis to a different decision-making domain: lexical decisions. In addition, this thirdcase study emphasizes the benefit of VB methods because the set of models to be comparedis much larger (256). Model comparison using exact methods such as MCMC with such alarge class of models is very expensive.The lexical decision task is a commonly used method for studying highly-practicedprocesses in reading. Participants are required to rapidly decide whether strings of lettersare either valid English words (e.g., “WORD”) or non-words (e.g., “WERD”). We analyzedata from Experiment 1 of Wagenmakers, Ratcliff, Gomez, and McKoon (2008). In thisexperiment, 17 native English speakers made lexical decisions and were sometimes instructedto respond as quickly as possible (speed emphasis) and sometimes to respond as accuratelyas possible (accuracy emphasis). In addition, there were three different kinds of wordsused, which changed the difficulty of the decision. Some words were very common words(high frequency), such as “CARS”. Others were uncommon words (low frequency), such as“COMB”, and others were very-low frequency words, such as “DALE”. Participants find itmore difficult to distinguish between very low frequency words and non-words.We use E to represent the speed/accuracy conditions, C for the responses (error (e) orcorrect (c)), and W for the four levels of word frequency (high frequency, low frequency, verylow frequency or non-word). The performance of 256 models was evaluated. The simplestmodel allows only the mean drift rate to differ between correct and error accumulators( c ∼ , A ∼ , v ∼ C, s ∼ , τ ∼ c ∼ C ∗ E, A ∼ C ∗ E, v ∼ E ∗ W ∗ C, s ∼ , τ ∼ E ).With a large number of competing models, model selection based on the log marginallikelihood is extremely costly – this is one of the primary reasons for using VB methods.Therefore, we did not estimate the marginal likelihood for all the models. Instead, wepropose a mixed approach in which we use CVVB to quickly screen through all the models.This results in an approximate ranking for all the models in approximately 16 hours. Fromthis ranking, we selected a small subset (the best 10 and the worst 10) for follow-up usingslower exact methods to estimate the posterior distributions and marginal likelihood.Table 3 lists the results for just these selected best models, comparing ELPD (esti-mated using CVVB) with marginal likelihood (estimated using IS ). Figure 8 comparesthe (cid:92) ELPD
K-CVVB with the log marginal likelihood both in absolute terms (lower panels)and rankings (upper panels). The figure shows the comparison for both the 10 best modelsaccording to (cid:92)
ELPD
K-CVVB (left panels) and the 10 worst (right panels). For the 10 bestmodels, the two methods closely agree on both the relative ranking of the models (Spearmanrank correlation of ρ = 0 . Model selection via CVVB and marginal likelihood for the 10 best models (above the solidline) and the 10 worst models (below the solid line) fitted to the data reported by Wagen-makers et al. (2008). The last column lists the log-marginal likelihood estimated by the IS method with the standard errors in brackets. Model Model (cid:92)
ELPD
K-CVVB log d p ( y )Index c A v τ (IS method)252 C C*E C*W*E E 1,647.2 8,140.0 (3.9)236 C C C*W*E E 1,611.6 8,111.1 (4.0)240 C*E C C*W*E E 1,610.1 8,101.6 (1.3)239 C*E C C*W*E 1 1,609.5 8,086.7 (1.1)255 C*E C*E C*W*E 1 1,604.8 8,081.2 (6.1)232 C*E 1 C*W*E E 1,597.4 8,064.1 (4.5)248 C*E E C*W*E E 1,602.5 8,050.6 (4.6)256 C*E C*E C*W*E E 1,564.4 8,046.2 (3.4)184 C*E E C*W E 1,590.9 7,926.5 (1.0)191 C*E C*E C*W 1 1,586.7 7,914.4 (1.1)1 1 1 1 1 -1,969.9 -9,932.4 (0.2)97 1 1 W 1 -1,867.4 -9,444.7 (0.3)2 1 1 1 E -1,274.3 -6,343.5 (0.05)98 1 1 W E -1,144.2 -5,654.4 (0.2)17 1 E 1 1 -1,085.0 -5,448.2 (0.1)5 E 1 1 1 -1,065.0 -5,346.0 (0.1)18 1 E 1 E -1,028.0 -5,223.9 (0.1)21 E E 1 1 -1,017.8 -5,165.9 (0.6)6 E 1 1 E -1,007.0 -5,094.4 (0.2)22 E E 1 E -1,015.3 -5,043.2 (0.6)speed/accuracy manipulation does not selectively influence the threshold parameters. All ofthe 10 best models (top half of Table 3) include effects of the speed/accuracy manipulation(“E”) on parameters other than the threshold (column c ). Discussion
This paper proposes Hybrid VB method for approximate Bayesian inference withpsychological models; it is more efficient than previous VB methods for such models. Theperformance of the VB method is demonstrated with applications in decision making. Animportant development from our work is the coupling of VB methods for model estima-tion with cross-validation methods for model selection. The combined CVVB approach isa computationally efficient method for model selection. This method is particularly use-ful when the number of models to be compared is large, which can make exact methods(such as MCMC) infeasible. Our simulation study shows that CVVB accurately identifiesthe data-generating model, and our analyses of real data repeatedly demonstrate that theCVVB results agree closely to model selection by marginal likelihood, estimated by exact(i.e., simulation consistent) algorithms. However, some users may still want to base theirROSS-VALIDATION WITH VARIATIONAL BAYES 24
Figure 8 . (cid:92) ELPD
K-CVVB and marginal likelihood estimates for the 10 best models (leftpanels) and the 10 worst models (right panels) for the data reported by Wagenmakers etal. (2008). The lower panels plot the (cid:92)
ELPD
K-CVVB ( x -axes) against the marginal likelihoodestimate ( ± two standard errors; y -axes). The upper panels show the corresponding modelranks from the two methods.final conclusions on exact methods, and for that situation we propose using CVVB as amodel screening tool. CVVB can be used to efficiently “screen” a large set of models, andquickly identify a much smaller number of candidates for follow-up by slower, exact meth-ods. The CVVB method allows a more thorough investigation of an important questionabout “selective influence” in the speed-accuracy tradeoff than previous approaches.VB methods have already been used in other domains of psychological research asa fast alternative to MCMC, but mostly in much simpler models than here. For instance,VB methods have been used to study the impact of three prior distributions on Bayesianparameter recovery in very simple models, with just one or two parameters. In most of thesesimple cases the authors found VB to be both fast and also highly accurate, and recommendVB for use with hierarchical models in particular because the method is computationallyeffective, quick, and accurate. Beyond parameter recovery exercises, VB has also been usedto investigate probabilistic cognitive models of how people represent temporal structure inthe world (Marković, Reiter, & Kiebel, 2019), and to approximate solutions to the inverseBayesian decision theory problem in the context of learning and decision-making (Daunizeauet al., 2010).While these applications of VB are interesting and effective, they all employ theso-called “mean field VB”, which assumes a simplified factorization for the variational dis-tribution q . Mean field VB ignores the posterior dependence between the blocks of modelparameters, and requires analytical calculation of model-specific expectations (Ormerod &ROSS-VALIDATION WITH VARIATIONAL BAYES 25Wand, 2010). These can be challenging to compute, or simply unavailable, for many in-teresting psychological models . This has been a major hurdle to the uptake of VB forsubstantively interesting psychological models.By contrast, the “fixed form” VB method we have used is more flexible and widely ap-plicable. It takes into account the posterior dependence between the model parameters anddoes not require any calculation of model-specific expectations. In recent work promotingthe use of VB, Galdo et al. (2019) also proposed fixed form VB – their methods adopt anoptimization strategy called differential evolution to bypass the need to compute analyticalexpectations. Nevertheless, Galdo et al. still assume a simplified factorization structure forthe variational distribution q , and thus do not account for posterior dependence betweenblocks of parameters. Galdo et al. (2019) test their approach using two benchmark cognitivemodels, a non-hierarchical (single subject) LBA model and a hierarchical version of SignalDetection Theory. Our work extends that of Galdo et al. in at least two important aspects.First, it examines for hierarchical LBA models with a more complete parameterization.The multivariate Gaussian group-level distribution accounts for between-subject differencesand also for the correlation of the random effects, and therefore provides a more realisticrepresentation of prior knowledge. Second, our fixed-form VB approach takes into accountthe dependence structure of the posterior and incorporates some of the latest advances inthe fixed form VB literature.We hope that the VB methods developed in the article will be taken up and extendedby other researchers. To assist in this, we have shared the code and data to replicateour analyses online, at https://github.com/Henry-Dao/CVVB . The methods are quitegeneral and are not limited to the LBA model or even decision-making paradigms. Ourapproach will translate most easily to other models for which the group-level assumptionscan be maintained: a multivariate Gaussian distribution for random effects, with relativelyuninformative priors. This structure is sufficiently generic that it will apply to a very largerange of psychological models. Acknowledgements
The research of Viet Hung Dao, Minh Ngoc Tran, Robert Kohn and Scott Brown waspartially supported by ARC Discovery grant DP180102195.References
Arlot, S., Celisse, A., et al. (2010). A survey of cross-validation procedures for model selection.
Statistics surveys , , 40–79.Batchelder, W. H. (in press). Cognitive psychometrics: Using multinomial processing tree models asmeasurement tools. In S. Embretson & J. Roberts (Eds.), New directions in psychological mea-surement with model based approaches.
Washington, DC: American Psychological AssociationBooks.Bernardo, J. M., & Smith, A. F. (2009).
Bayesian theory (Vol. 405). John Wiley & Sons.Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisti-cians.
Journal of the American statistical Association , (518), 859–877.Brown, S. D., & Heathcote, A. (2005). A ballistic model of choice response time. PsychologicalReview , (1), 117.Brown, S. D., & Heathcote, A. (2008). The simplest complete model of choice response time: Linearballistic accumulation. Cognitive Psychology , (3), 153–178. ROSS-VALIDATION WITH VARIATIONAL BAYES 26
Browne, M. (2000). Cross-validation methods.
Journal of Mathematical Psychology , , 108–132.Daunizeau, J., Den Ouden, H. E., Pessiglione, M., Kiebel, S. J., Stephan, K. E., & Friston, K. J.(2010). Observing the observer (i): meta-bayesian models of learning and decision-making. PloS one , (12), e15554.Donkin, C., & Brown, S. D. (2018). Response times and decision-making. Stevens’ Handbook ofExperimental Psychology and Cognitive Neuroscience , , 1–33.Donkin, C., Brown, S. D., & Heathcote, A. (2009). The overconstraint of response time models:Rethinking the scaling problem. Psychonomic Bulletin and Review , (6), 1129–1135.Efron, B., & Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and cross–validation. The American Statistician , , 36–48.Forstmann, B. U., Dutilh, G., Brown, S., Neumann, J., Von Cramon, D. Y., Ridderinkhof, K. R.,& Wagenmakers, E.-J. (2008). Striatum and pre-sma facilitate decision-making under timepressure. Proceedings of the National Academy of Sciences , (45), 17538–17542.Forstmann, B. U., Tittgemeyer, M., Wagenmakers, E.-J., Derrfuss, J., Imperati, D., & Brown, S.(2011). The speed-accuracy tradeoff in the elderly brain: a structural model-based approach. Journal of Neuroscience , (47), 17242–17249.Galdo, M., Bahg, G., & Turner, B. M. (2019). Variational Bayesian methods for cognitive science. Psychological Methods .Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013).
Bayesiandata analysis . CRC press.Gronau, Q. F., & Wagenmakers, E.-J. (2019). Limitations of Bayesian leave-one-out cross-validationfor model selection.
Computational brain & behavior , (1), 1–11.Gunawan, D., Hawkins, G. E., Tran, M.-N., Kohn, R., & Brown, S. (2020). New estimationapproaches for the hierarchical linear ballistic accumulator model. Journal of MathematicalPsychology , .Huang, A., Wand, M. P., et al. (2013). Simple marginally noninformative prior distributions forcovariance matrices. Bayesian Analysis , (2), 439–452.Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 .Lee, M. D. (2008). Three case studies in the Bayesian analysis of cognitive models. PsychonomicBulletin & Review , , 1–15.Loaiza-Maya, R., Smith, M. S., Nott, D. J., & Danaher, P. J. (2020). Fast and accurate variationalinference for models with many latent variables. arXiv:2005.07430 .Marković, D., Reiter, A. M., & Kiebel, S. J. (2019). Predicting change: Approximate inference underexplicit representation of temporal structure in changing environments. PLoS computationalbiology , (1), e1006707.Myung, I. J. (2000). The importance of complexity in model selection. Journal of MathematicalPsychology , , 190–204.Navarro, D. J. (2019). Between the devil and the deep blue sea: Tensions between scientificjudgement and statistical model selection. Computational Brain & Behavior , (1), 28–34.Ong, V. M.-H., Nott, D. J., & Smith, M. S. (2018). Gaussian variational approximation with a factorcovariance structure. Journal of Computational and Graphical Statistics , (3), 465–478.Ormerod, J. T., & Wand, M. P. (2010). Explaining variational approximations. The AmericanStatistician , (2), 140–153.Petersen, K. B., & Pedersen, M. S. (2012). The matrix cookbook .Rae, B., Heathcote, A., Donkin, C., Averell, L., & Brown, S. (2014). The hare and the tortoise:Emphasizing speed can change the evidence used to make decisions.
Journal of ExperimentalPsychology: Learning, Memory, and Cognition , (5), 1226.Ratcliff, R. (1978). A theory of memory retrieval. Psychological review , (2), 59.Ratcliff, R., & Rouder, J. N. (1998). Modeling response times for two–choice decisions. PsychologicalScience , , 347–356. ROSS-VALIDATION WITH VARIATIONAL BAYES 27
Ratcliff, R., & Smith, P. L. (2004). A comparison of sequential sampling models for two-choicereaction time.
Psychological review , (2), 333.Ratcliff, R., Smith, P. L., Brown, S. D., & McKoon, G. (2016). Diffusion decision model: Currentissues and history. Trends in cognitive sciences , (4), 260–281.Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximateinference in deep generative models. arXiv preprint arXiv:1401.4082 .Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of MathematicalStatistics , 400–407.Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing inpsychology.
Psychological Review , , 358–367.Salthouse, T. A. (1996). The processing-speed theory of adult age differences in cognition. Psycho-logical review , (3), 403.Starns, J. J., & Ratcliff, R. (2010). The effects of aging on the speed–accuracy compromise:Boundary optimality in the diffusion model. Psychology and aging , (2), 377.Starns, J. J., Ratcliff, R., & McKoon, G. (2012). Evaluating the unequal-variability and dual-processexplanations of zROC slopes with response time data and the diffusion model. CognitivePsychology , , 1-34.Starns, J. J., Ratcliff, R., & White, C. N. (2012). Diffusion model drift rates can be influenced bydecision processes: An analysis of the strength-based mirror effect. Journal of ExperimentalPsychology: Learning, Memory, and Cognition , , 1137-1151.Stone, M. (1960). Models for choice-reaction time. Psychometrika , (3), 251–260.Terry, A., Marley, A., Barnwal, A., Wagenmakers, E.-J., Heathcote, A., & Brown, S. D. (2015). Gen-eralising the drift rate distribution for linear ballistic accumulators. Journal of MathematicalPsychology , , 49–58.Thapar, A., Ratcliff, R., & McKoon, G. (2003). A diffusion model analysis of the effects of agingon letter discrimination. Psychology and Aging , , 415–429.Tran, M.-N., Nott, D. J., & Kohn, R. (2017). Variational bayes with intractable likelihood. Journalof Computational and Graphical Statistics , (4), 873–882.Tran, M.-N., Scharth, M., Gunawan, D., Kohn, R., Brown, S. D., & Hawkins, G. E. (in press).Robustly estimating the marginal likelihood for cognitive models via importance sampling. Behavior Research Methods .Turner, B. M., Sederberg, P. B., Brown, S. D., & Steyvers, M. (2013). A method for efficientlysampling from distributions with correlated dimensions.
Psychological methods , (3), 368.Usher, M., & McClelland, J. L. (2001). The time course of perceptual choice: the leaky, competingaccumulator model. Psychological review , (3), 550.van Ravenzwaaij, D., Brown, S. D., Marley, A., & Heathcote, A. (2019). Accumulating advantages:A new conceptualization of rapid multiple choice. Psychological Review .Vehtari, A., & Gelman, A. (2014). Waic and cross-validation in stan. , (2015), 5.Voss, A., Rothermund, K., & Voss, J. (2004). Interpreting the parameters of the diffusion model:An empirical validation. Memory & Cognition , , 1206–1220.Wagenmakers, E.-J., Ratcliff, R., Gomez, P., & McKoon, G. (2008). A diffusion model account ofcriterion shifts in the lexical decision task. Journal of memory and language , (1), 140–159.Wall, L., Gunawan, D., Brown, S. D., Tran, M.-N., Kohn, R., & Hawkins, G. E. (in press). Identifyingrelationships between cognitive processes across tasks, contexts, and time. Behavior ResearchMethods .Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 . ROSS-VALIDATION WITH VARIATIONAL BAYES 28Appendix A
Variational Bayes DetailsDetails of the Optimization Methods
We use gradient-based search methods to maximize the lower bound, which requirecomputing ∇ λ L ( λ ), the gradient of L ( λ ) with respect to the variational parameters λ . Inmost cases it is impossible to compute ∇ λ L ( λ ) analytically, but ∇ λ L ( λ ) can be estimatedunbiasedly. For this reason, stochastic gradient ascent methods (Robbins & Monro, 1951)are often used to optimize L ( λ ). These methods start from some initial value λ (0) for λ and update it recursively by following the gradient vector “uphill”: λ ( t +1) = λ ( t ) + ρ t (cid:12) (cid:92) ∇ λ L ( λ ( t ) ) , (4)where ρ t is a vector of step sizes, (cid:12) denotes the element-wise product of two vectors, and (cid:92) ∇ λ L ( λ ) is an unbiased estimate of the gradient of ∇ λ L ( λ ). A “reparameterization trick”.
The performance of stochastic gradient ascentdepends greatly on the variance of the noisy gradient estimate (cid:92) ∇ λ L ( λ ). Performance cantherefore be improved by employing variance reduction methods. A popular variance reduc-tion method is the so-called “reparameterization trick” (Kingma & Welling, 2013; Rezendeet al., 2014). If we can write θ ∼ q λ ( θ ) as θ = u ( (cid:15) ; λ ) with (cid:15) ∼ f (cid:15) which does not dependon λ , then the lower bound and its gradient can be written as the expectations L ( λ ) = E f (cid:15) [log p ( y , u ( (cid:15) ; λ )) − log q λ ( u ( (cid:15) ; λ ))] , ∇ λ L ( λ ) = E f (cid:15) (cid:20) ∇ λ u ( (cid:15) ; λ ) [ ∇ θ log p ( y , θ ) − ∇ θ log q λ ( θ )] (cid:21) . (5)By sampling (cid:15) ∼ f (cid:15) , it is straightforward to obtain the unbiased estimates of the lowerbound and its gradient (cid:91) L ( λ ) := 1 N N X i =1 h log p ( y , u ( (cid:15) ( i ) ; λ )) − log q λ ( u ( (cid:15) ( i ) ; λ )) i , (cid:92) ∇ λ L ( λ ) := 1 N N X i =1 (cid:20) ∇ λ u ( (cid:15) ( i ) ; λ ) (cid:16) ∇ θ log p ( y , θ ( i ) ) − ∇ θ log q λ ( θ ( i ) ) (cid:17) (cid:21) , (6)with (cid:15) ( i ) ∼ f (cid:15) , i = 1 , . . . , N . We used N = 10 in our applications. Learning rates and stopping rule.
The elements of the vector λ may need verydifferent step sizes (learning rates) during the search, to account for scale or the geometryof the space. We set the step sizes adaptively using the ADADELTA method (Zeiler, 2012),with different step sizes for each element of λ . At iteration t + 1, the i th element λ i of λ isupdated as λ ( t +1) i = λ ( t ) i + ∆ λ ( t ) i . The step size ∆ λ ( t ) i := ρ ( t ) i g ( t ) λ i , where g ( t ) λ i denotes the i th component of (cid:92) ∇ λ L ( λ ( t ) ) and ρ ( t ) i := q E (∆ λ i ) ( t − + ξ q E ( g λ i ) ( t ) + ξ , ROSS-VALIDATION WITH VARIATIONAL BAYES 29where ξ is a small positive constant, with E (∆ λ i ) ( t ) = vE (∆ λ i ) ( t − + (1 − v )(∆ ( t ) λ i ) ,E ( g λ i ) ( t ) = vE ( g λ i ) ( t − + (1 − v )( g ( t ) λ i ) . The ADADELTA default settings are ξ = 10 − , v = 0 .
95 and initialize E (∆ λ i ) (0) := E ( g λ i ) (0) = 0. However, in our experiments we obtained better results with ξ = 10 − .A popular stopping criterion for the search algorithm is to stop when the moving aver-age lower bound estimates LB t = 1 /m P > i = t − m +1 (cid:92) L ( λ ( i ) ) do not improve after k consecutiveiterations (Tran, Nott, & Kohn, 2017). Our article uses m = k = 200. Details for the Gaussian VB approach.
Using the factor-based approximation, we can write θ = µ + B (cid:15) + d (cid:12) (cid:15) , with (cid:15) = ( (cid:15) > , (cid:15) > ) > ∼ N ( , I r + p ). Using the reparameterization trick from (5) and noting that ∇ θ log q λ ( θ ) = − ( BB > + D ) − ( θ − µ ), the gradient of the lower bound ∇ λ L ( λ ) is ∇ µ L ( λ ) = E f h ∇ θ log h ( µ + B (cid:15) + d (cid:12) (cid:15) ) + ( BB > + D ) − ( B (cid:15) + d (cid:12) (cid:15) ) i , ∇ B L ( λ ) = E f h ∇ θ log h ( µ + B (cid:15) + d (cid:12) (cid:15) ) (cid:15) > + ( BB > + D ) − ( B (cid:15) + d (cid:12) (cid:15) ) (cid:15) > i , ∇ d L ( λ ) = E f h diag (cid:16) ∇ θ log h ( µ + B (cid:15) + d (cid:12) (cid:15) ) (cid:15) > + ( BB > + D ) − ( B (cid:15) + d (cid:12) (cid:15) ) (cid:15) > (cid:17)i , where h ( θ ) = p ( θ ) p ( y | θ ) and f represents N ( , I r + p ). From this, unbiased estimates ofthe lower bound gradient can be obtained by sampling from f . It is necessary to obtainthe inverse of the p × p matrix ( BB > + D ), which is computationally expensive when thedimension of p of θ is high. Normally, the number of factors we use should be much lessthan the dimension of θ , i.e., r (cid:28) p . We can then use the Woodbury formula to computethe inverse using (Petersen & Pedersen, 2012)( BB > + D ) − = D − − D − B ( I + B > D − B ) − B > D − . This is computationally more efficient because it only requires finding the inverses of thediagonal matrix D and of ( I + B > D − B ), which is a much smaller r × r matrix. Details for the Hybrid Gaussian VB
The gradient of the lower bound with respect to the variational parameters is ∇ λ L ( λ ) = E (cid:15) (cid:20) ∇ λ u ( (cid:15) ; λ ) ∇ θ log (cid:18) p ( θ , y ) q λ ( θ ) (cid:19)(cid:21) = E ( (cid:15) , θ ) (cid:20) ∇ λ u ( (cid:15) ; λ ) ∇ θ log (cid:18) p ( θ , y ) p ( θ | θ , y ) q λ ( θ ) p ( θ | θ , y ) (cid:19)(cid:21) = E ( (cid:15) , θ ) (cid:20) ∇ λ u ( (cid:15) ; λ ) ∇ θ log (cid:18) p ( θ , θ , y ) q λ ( θ , θ ) (cid:19)(cid:21) = E ( (cid:15) , θ ) (cid:20) ∇ λ u ( (cid:15) ; λ ) (cid:18) ∇ θ log p ( y | θ , θ ) + ∇ θ log p ( θ , θ ) − ∇ θ log q λ ( θ ) − ∇ θ log p ( θ | θ , y ) (cid:19)(cid:21) . ROSS-VALIDATION WITH VARIATIONAL BAYES 30Appendix B gives the gradients ∇ θ log p ( y | θ , θ ) , ∇ θ log p ( θ , θ ) , ∇ θ log q λ ( θ )and ∇ θ log p ( θ | θ , y ). We note that, because E ( (cid:15) , θ ) (cid:20) ∇ θ log p ( θ | θ , y ) (cid:21) = E (cid:15) (cid:20) Z ∇ θ p ( θ | θ , y ) d θ (cid:21) = 0 , we can remove the term ∇ θ log p ( θ | θ , y ) from the calculation of ∇ λ L ( λ ). However, thisterm also plays the role of a control variate and is useful in reducing the variance of thegradient estimate in finite sample sizes (recall we use N = 10). We therefore include thisterm in all computations reported in the paper.Appendix B Deriving the Gradients in the Gaussian VB for approximating the Hierarchical LBA Models
For the hierarchical LBA model, the joint density of the data y and model parameters θ = ( α > , . . . , α > J , µ α > , vech( Σ α ) > , a ) > is p ( y , θ ) = p ( y | θ ) p ( θ )= J Y j =1 LBA( y j | α j ) N ( α j | µ α , Σ α ) N ( µ α | , I )IW( Σ α | ν, Ψ ) × D α Y d =1 IG( a d | / , . As mentioned previously, in order to use Gaussian VB, it is necessary to transformthe parameters so that all the elements have support on the full real line. The workingparameters are ˜ θ = ( α > , . . . , α > J , µ α > , vech( C ∗ α ) > , log a > ) > , where log a := (log a , . . . , log a D α ) > . In order to approximate p ( ˜ θ | y ) using the GaussianVB method, it is necessary to have the gradient ∇ ˜ θ log p ( y , ˜ θ ) or equivalently, ∇ ˜ θ log p ( y | ˜ θ )and ∇ ˜ θ log p ( ˜ θ ). Computing ∇ ˜ θ log p ( y | ˜ θ ) . Clearly, ∇ µ α log p ( y | ˜ θ ) = , ∇ vech( C ∗ α ) log p ( y | ˜ θ ) = and ∇ log a log p ( y | ˜ θ ) = since p ( y | ˜ θ ) = J Q j =1 LBA( y j | α j ) does not depend on the group-level parameters. ∇ α j log p ( y | θ ) = ∂∂ α j log LBA( y j | α j ) = n j X i =1 ∂∂ α j log LBA( y ji | α j )= n j X i =1 ∂ z j ∂ α > j ! > ∂∂ z j LBA( y ji | z j )LBA( y ji | z j ) ,∂∂ z j LBA( y ji | z j ) = ∂f c ( t ) ∂ z j (1 − F k = c ( t )) − ∂F k = c ( t ) ∂ z j f c ( t )The partial derivatives of f c ( t ) with respect to z are For simplicity, we omit the subscript j . ROSS-VALIDATION WITH VARIATIONAL BAYES 31 ∂∂b f c ( t ) = 1 A (cid:20) − v c φ ( ω − ω ) + sφ ( ω − ω ) + v c φ ( ω ) − sφ ( ω ) (cid:21) ∂ω ∂b ; ∂∂A f c ( t ) = − A (cid:20) f c ( t ) − v c φ ( ω − ω ) + sφ ( ω − ω ) (cid:21) ∂ω ∂A ; ∂∂v c f c ( t ) = 1 A (cid:20) − Φ( ω − ω ) + Φ( ω ) (cid:21) + 1 A (cid:20) − v c φ ( ω − ω ) + sφ ( ω − ω ) + v c φ ( ω ) − sφ ( ω ) (cid:21) ∂ω ∂v c ; ∂∂τ f c ( t ) = 1 A (cid:20) − v c φ ( ω − ω ) + sφ ( ω − ω ) (cid:21) (cid:18) ∂ω ∂τ − ∂ω ∂τ (cid:19) + 1 A (cid:20) v c φ ( ω ) − sφ ( ω ) (cid:21) ∂ω ∂τ . The partial derivatives of F c ( t ) with respect to z are: ∂∂b F c ( t ) = (cid:20) A Φ( ω − ω ) + b − A − ( t − τ ) v c A φ ( ω − ω ) ∂ω ∂b (cid:21) + 1 ω (cid:20) φ ( ω − ω ) − φ ( ω ) (cid:21) ∂ω ∂b + (cid:20) − A Φ( ω ) − b − ( t − τ ) v c A φ ( ω ) ∂ω ∂b (cid:21) ; ∂∂A F c ( t ) = 1 A (cid:20) − Φ( ω − ω ) + φ ( ω − ω )( b − A − ( t − τ ) v c ) ∂ω ∂A (cid:21) + 1 A (cid:20) ( b − ( t − τ ) v c )Φ( ω ) − ( b − A − ( t − τ ) v c )Φ( ω − ω ) (cid:21) + φ ( ω − ω ) ∂ω ∂A ω − ∂ω ∂A φ ( ω − ω ) ω + φ ( ω ) ∂ω ∂A ; ∂∂v c F c ( t ) = − t − τA Φ( ω − ω ) + b − A − ( t − τ ) v c A φ ( ω − ω ) ∂ω ∂v c + t − τA Φ( ω ) − φ ( ω ) b − ( t − τ ) v c A ∂ω ∂v c + 1 ω ( φ ( ω − ω ) − φ ( ω )) ∂ω ∂v c ; ∂∂τ F c ( t ) = v c τ Φ( ω − ω ) + b − A − ( t − τ ) v c A φ ( ω − ω ) (cid:18) ∂ω ∂τ − ∂ω ∂τ (cid:19) − v c τ Φ( ω ) − b − ( t − τ ) v c A φ ( ω ) ∂ω ∂τ + (cid:20) φ ( ω − ω ) ω (cid:18) ∂ω ∂τ − ∂ω ∂τ (cid:19) − φ ( ω − ω ) ∂ω ∂τ (cid:21) / ( ω ) − (cid:20) φ ( ω ) ω ∂ω ∂τ − φ ( ω ) ∂ω ∂τ (cid:21) / ( ω ) . For simplicity, we omit the subscript j ROSS-VALIDATION WITH VARIATIONAL BAYES 32
Computing ∇ ˜ θ log p ( ˜ θ ) . To get the prior for the transformed parameters ˜ θ , mul-tiply the prior density by the Jacobians: p ( ˜ θ ) = p ( α J | µ α , Σ α ) × p ( µ α | µ , Σ ) × p (vech( C ∗ α ) | log a ) × p (log a )= J Y j =1 N ( α j | µ α , Σ α ) × N ( µ α | µ , Σ ) × IW( Σ α | ν, Ψ ) × (cid:12)(cid:12)(cid:12) J Σ α → vech( C ∗ α ) (cid:12)(cid:12)(cid:12) × · · ·× D Y d =1 IG( a d | α d , β d ) × | J a → log a | , with the prior hyperparameters µ = , Σ = I D α , ν = ν α + D α − Ψ =2 ν α diag(1 /a , . . . , /a D α ), α d = 1 A d and β d = 12 . The Jacobian terms are:• (cid:12)(cid:12)(cid:12) J Σ α → vech( C ∗ α ) (cid:12)(cid:12)(cid:12) = 2 D α D α Q d =1 C D α − d +2 d,d , with C d,d is an element in posision ( d, d ) of matrix C α ,and• | J a → log a | = det (diag( a , . . . , a D α )) = a × · · · × a D α . ∇ α j log p ( ˜ θ ) = ∇ α j log N ( α j | µ α , Σ α ) = − Σ − α ( α j − µ α ); ∇ µ α log p ( ˜ θ ) = J X j =1 ∇ µ α log N ( α j | µ α , Σ α ) + ∇ µ α log N ( µ α | µ , Σ )= J X j =1 Σ − α ( α j − µ α ) − Σ − ( µ α − µ ); ∇ vech( C ∗ α ) log p ( ˜ θ ) = ∂∂ vech( C ∗ α ) J X j =1 log N ( α j | µ α , Σ α ) + log IW( Σ α | ν, Ψ ) + log (cid:12)(cid:12)(cid:12) J Σ α → vech( C ∗ α ) (cid:12)(cid:12)(cid:12) = − J + ν + p + 12 ∂∂ vech( C ∗ α ) log | Σ α | − ∂∂ vech( C ∗ α ) J X j =1 ( α j − µ α ) > Σ − α ( α j − µ α ) − · · ·− ∂∂ vech( C ∗ α ) trace( ΨΣ − α ) + ∂∂ vech( C ∗ α ) log D α D α Y d =1 C D α − d +2 d,d ! = − ( J + ν + p + 1) vech( I D α ) −
12 vech( C ) −
12 vech( C ) + vech (diag( D α + 1 , D α , . . . , , where C and C are matrices whose elements are( C ) ij = ( M ij if i = j M ii × ( C α ) ii if i = j and ( C ) ij = ( H ij if i = j H ii × ( C α ) ii if i = j , ROSS-VALIDATION WITH VARIATIONAL BAYES 33where M = − C α − T C α − Ψ C α − T and H = − C α − T C α − J P j =1 ( α j − µ )( α j − µ ) > C α − T . ∇ log a d log p ( ˜ θ ) = ∂∂ log a d (cid:20) log IW( Σ α | ν, Ψ ) + log IG( a d | α d , β d ) + log | J a → log a | (cid:21) = ∂∂ log a d (cid:20) ν (cid:18) ν α diag( 1 a , . . . , a D α ) (cid:19) −
12 trace (cid:18) ν α diag( 1 a , . . . , a D α ) Σ − α (cid:19) − · · ·− ( α d + 1) log a d − β d a d + log a d (cid:21) = − ν − A d + 12 a d + ν α ( Σ − α ) dd a d , for d = 1 , . . . , D α Appendix C
Deriving the Gradients in the Hybrid Gaussian VB for approximating the Hierarchical LBA Models
Recall the set of working parameters is ˜ θ = ( α J , µ α , log a , Σ α ) which can be partitionedinto θ = ( α J , µ α , log a ) and θ = Σ α . The data-parameter joint density is p ( y , α J , µ α , Σ α , log a ) = J Y j =1 LBA( y j | α j ) N ( α j | µ α , Σ α ) N ( µ α | , I )IW( Σ α | ν, Ψ ) × D α Y d =1 IG( a d | / , | J a d → log a d | , where J a d → log a d = a d is the Jacobian of the transformation. Lemma 1.
For models with random effects given by (1)– (3), p ( Σ α | α J , µ α , a , y ) is thedensity of IW ( Σ α | ν, Ψ ) , with ν = 2 D α + J + 1 and Ψ = P Jj =1 ( α j − µ α )( α j − µ α ) > +4 diag (1 /a , . . . , /a D α ) .Proof. p ( Σ α | α J , µ α , a , y ) ∝ p ( α J , µ α , Σ α , a , y ) ∝ J Y j =1 p ( α j | µ α , Σ α ) p ( Σ α | a )= J Y j =1 N ( α j | µ α , Σ α )IW ( Σ α | D α + 1 , Ψ ) (where Ψ = 4diag(1 /a , . . . , /a D α )) ∝ | Σ α | − J/ exp − J X j =1 ( α j − µ α ) > Σ − α ( α j − µ α ) | Σ α | − (2 D α +2) / exp (cid:18) −
12 trace( ΨΣ − α ) (cid:19) ∝ | Σ α | − (2 D α + J +2) / exp −
12 trace J X j =1 ( α j − µ α ) > ( α j − µ α ) + Ψ Σ − α . It is now straightforward to see that p ( Σ α | α J , µ α , a , y ) is the density of the InverseWishart distribution with the degrees of freedom ν = 2 D α + J + 1 and the scale matrix Ψ = J P j =1 ( α j − µ α ) > ( α j − µ α ) + 4diag(1 /a , . . . , /a D α ) . ROSS-VALIDATION WITH VARIATIONAL BAYES 34See Appendix B for ∇ θ log p ( y | θ , θ ). For the other terms, we first note that p ( θ , Σ α ) = J Y j =1 N ( α j | µ α , Σ α ) N ( µ α | , I )IW( Σ α | ν, Ψ ) D α Y d =1 IG( a d | α d , β d ) | J a d → log a d | ;hence,log p ( θ , Σ α ) = J X j =1 log N ( α j | µ α , Σ α ) + log N ( µ α | , I ) + log IW( Σ α | ν, Ψ )+ D α X d =1 (cid:16) log IG( a d | α d , β d ) + log a d (cid:17) = − J X j =1 ( α j − µ α ) > Σ − α ( α j − µ α ) − µ α µ α > + ν Ψ ) −
12 trace( ΨΣ − α ) + D α X d =1 (cid:26) − ( α d + 1) log a d − β d a d + log a d (cid:27) + constant . It follows that ∂∂ α j log p ( θ , Σ α ) = − Σ − α ( α j − µ α ) ,∂∂ µ α log p ( θ , Σ α ) = J X j =1 Σ − α ( α j − µ α ) − µ α , and ∂∂ log a d log p ( θ , Σ α ) = − ν − α d + β d a d + ν α ( Σ − α ) dd a d . Appendix D
Estimating the posterior predictive densities for the Hierarchical LBA model
Recall the parameters are θ = ( α > , . . . , α > J , µ α > , vech( Σ α ) > , a ) > .1. Suppose there are θ ( s ) MCMC , s = 1 , . . . , S
MCMC draws from the posterior p ( θ | y ); inthe paper S = 9 , θ ( s ) MCMC , simulate a sample for subject j : ˜ y js := (rt js , re js ) ∼ LBA( y | α ( s ) j ).3. Estimate the posterior predictive densities based on the simulated samples ˜ y =(˜ y , . . . , ˜ y S ) , . . . , ˜ y J = (˜ y J , . . . , ˜ y JS ). In particular:• The posterior predictive density for the response time when the decision is cor-rect for subject j under the accuracy condition is estimated based only on theresponse times corresponding to correct responses performed under accuracycondition. Similar approach can be used to obtain the posterior predictive den-sity for the response time when the decision is correct for subject j under neutraland speed conditions.ROSS-VALIDATION WITH VARIATIONAL BAYES 35• The posterior predictive density for the response time when the decision is in-correct for accuracy, neutral, and speed conditions can be generated similarly.The VB posterior predictive densities are estimated similarly, except that here we simulate θ ( s ) V B ∼ q λ ( θ ) , s = 1 , . . . , S, instead of using the MCMC draws θ ( s ) MCMC .Appendix E
The K − fold CVVB applied for hierarchical LBA modelsInput: A set of LBA models {M m } Mm =1 . Output:
The models are ranked based on their predictive power which is measured by (cid:92)
ELPD
K-CVVB .1. The data is randomly split into K folds. For subject j , this is done by splitting theirobservations into K disjoint subsets of approximately equal length; y j = y (1) j ∪ y (2) j ∪ · · · ∪ y ( K ) j , j = 1 , . . . , J. Denote by I kj and I − kj the set of indices of the observations of subject j that are infold k and are not in fold k , respectively; i.e., the observations in fold k belonging tosubject j are y ( k ) j = n y ji | i ∈ I kj o . Thus, fold k consists of y ( k ) = J [ j =1 y ( k ) j , k = 1 , . . . , K.
2. For each model M m ( m = 1 , . . . , M ) :• Leave fold k out, approximate the leave- k th-fold-out posterior π ( ˜ θ | y ( − k ) ). De-note the VB approximation by q λ ( k ) ( ˜ θ ).• Estimate the leave- k th-fold-out posterior predictive density p ( y ( k ) | y ( − k ) ) ≈ S S X s =1 p ( y ( k ) | ˜ θ ( s ) V B ) , where ˜ θ ( s ) V B ∼ q λ ( ˜ θ ) , s = 1 , . . . , S, and p ( y ( k ) | ˜ θ ( s ) V B ) = J Y j =1 Y i ∈ I kj LBA( y ij | α ( s ) j ) . • The computed K-fold-cross-validation estimate for ELPD is (cid:92) ELPD
K-CVVB = 1 K K X k =1 log S S X s =1 p ( y ( k ) | ˜ θ ( s ) V B ) ! .
3. Models are ranked according to the computed K-fold-cross-validation estimate. Themodel with largest (cid:92)
ELPD