A Bayesian perspective on sampling of alternatives
AA Bayesian perspective on sampling ofalternatives
13 January 2021T
HIJS D EKKER (corresponding author)Institute for Transport StudiesUniversity of Leeds, [email protected]
RATEEK B ANSAL
Department of Civil and Environmental EngineeringImperial College London, [email protected] a r X i v : . [ s t a t . M E ] J a n bstract In this paper, we apply a Bayesian perspective to sampling of alternatives for multinomial logit (MNL)and mixed multinomial logit (MMNL) models. We find three theoretical results – i) McFadden’scorrection factor under the uniform sampling protocol can be transferred to the Bayesian context inMNL; ii) the uniform sampling protocol minimises the loss in information on the parameters of interest(i.e. the kernel of the posterior density) and thereby has desirable small sample properties in MNL;and iii) our theoretical results extend to Bayesian MMNL models using data augmentation. Notably,sampling of alternatives in Bayesian MMNL models does not require the inclusion of the additionalcorrection factor, as identified by Guevara and Ben-Akiva (2013a) in classical settings. Accordingly,due to desirable small and large sample properties, uniform sampling is the recommended samplingprotocol in MNL and MMNL, irrespective of the estimation framework selected.
Keywords:
Multinomial logit, Mixed Multinomial logit, Sampling of alternatives, Bayesian estimation . Introduction
Recent works in transportation, environmental economics and marketing (Daly et al., 2014; Guevaraand Ben-Akiva, 2013a,b; Von Haefen and Domanski, 2018; Sinha et al., 2018; Keane and Wasi, 2012)have renewed interest in sampling of alternatives in the context of discrete choice modelling. Themajor benefit of sampling of alternatives is that significant reductions in estimation time of large-scalechoice models - characterised by a large number of available alternatives - can be attained at relativelylow costs. McFadden (1978) already proved that, after applying a sampling correction to the utilityfunction of sampled alternatives, consistent parameter estimates can be obtained by estimating amultinomial logit (MNL) model over a subset of randomly sampled alternatives.Most discrete choice applications nowadays apply model specifications beyond MNL. Models fromthe Multivariate Extreme Value (MEV) family, such as the nested logit model (Daly, 1987), and mixedmultinomial logit (MMNL) models (Revelt and Train, 1998) have become state of the art. Guevaraand Ben-Akiva (2013b) and Guevara and Ben-Akiva (2013a) proved that McFadden (1978)’s resultforms only a part of the solution to implement sampling of alternatives within these advanced models.Consistent parameter estimates can be obtained when in addition to the McFadden correction termthe analyst also corrects for the imperfect representation of the LogSum in MEV and individual-levelparameters in MMNL in the sampled model.Guevara and Ben-Akiva (2013a) highlight in their conclusions the need to study sampling ofalternatives in the context of finite sample sizes and alternative estimation strategies. Von Haefenand Domanski (2018) argue consistency of parameter estimates and welfare measures of latent classmodels using the EM algorithm in combination with sampling of alternatives. The key feature is thatthe EM algorithm sequentially estimates weighted MNL models (Train, 2009) to which McFadden(1978)’s result applies. Von Haefen and Domanski (2018) show good performance for relativelysmall sizes of sampled choice sets, up to 5% of the full choice set size, but do not address the issueof small sample sizes. The purpose of our paper is to provide a theoretical analysis of small sampleperformance of McFadden (1978)’s and Guevara and Ben-Akiva (2013a) results in the context ofBayesian estimation for MNL and MMNL models.With the advancements in computational resources and approximate inference, Bayesian estimationhas gained popularity in the discrete choice modelling literature (Bansal et al., 2020). Bayesianestimation of discrete choice models is particularly fruitful when latent constructs are included inthe likelihood function. Significant reductions in estimation time over classical maximum simulatedlikelihood approaches are typically obtained due to augmenting latent constructs (Tanner and Wong,1987; Train, 2009), such as individual level parameters in mixed logit models or class membership inlatent class models. The joint implementation of sampling of alternatives and Bayesian estimationmay therefore lead to additional time savings and spur the implementation of more flexible discretechoice models in (travel- or recreational-) demand models with large choice sets.With the application of the Bernstein-von Mises theorem (Train, 2009), consistency (i.e., large-sample property) of the Bayesian parameter estimates under sampling of alternatives can be estab-lished. The interest of this paper is, however, in the performance of McFadden’s correction factorin case of finite sample sizes. First, this paper proves that the McFadden correction under uniformsampling protocol minimises the expected information loss in the non-normalised posterior density,i.e. the shape and location of the kernel density of the parameters of interest, when moving fromthe ‘true’ MNL likelihood to the ‘sampled’ likelihood irrespective of sample size. The proof draws1xtensively on the work of Keane and Wasi (2012). Second, the paper extends the proof to MMNLmodels and shows that the additional correction factor developed by Guevara and Ben-Akiva (2013a)becomes obsolete in a Bayesian MMNL model with augmented individual-level parameter estimates.Our results are particularly interesting in the context of Azaiez (2010); Keane and Wasi (2012);Lemp and Kockelman (2012); Guevara and Ben-Akiva (2013b); Von Haefen and Domanski (2018). Allthese studies argue that a naive approach to sampling of alternatives in MMNL performs relatively well,but that empirical efficiency depends on the sampling protocol. The naive approach is characterisedby solely applying the McFadden (1978)’s correction factor even though the required Independenceof Irrelevant Alternatives criterion is not satisfied by MMNL models. From a Bayesian perspective,this result does not come as a surprise given that data augmentation requires the evaluation ofconditional MNL probabilities for which the McFadden (1978) correction term should provide anaccurate approximation asymptotically. This paper, however, shows that this result also extends tosmall samples.The paper is structured as follows – Section 2 discusses the consistency of McFadden’s correctionfactor in the context of MNL models. Section 3 studies the small sample properties of McFadden’scorrection factor in the context of the Bayesian MNL model. Section 4 explains the consistencyrequirements for mixed logit models using Guevara and Ben-Akiva (2013a) and examines the needfor the additional correction factor in the context of Bayesian MMNL models. Section 5 concludeswith key takeaways.
2. Sampling of alternatives in multinomial logit models andconsistency
This paper follows the conventional micro-econometric approach to individual decision making.Individual n is assumed to select alternative i from choice set C n when it generates the highest level ofindirect utility U in . Indirect utility in Equation (1) comprises a deterministic part V in and an additiveunobserved stochastic part ε in . The deterministic part is a function of explanatory variables X in andassociated parameters β , and typically takes a linear form. The stochastic term ε in is assumed to beindependently and identically distributed across alternatives and individuals. U in = V in + ε in = V ( X in ; β ) + ε in (1)Throughout the paper it is assumed that the ‘true’ data generating process takes the form of the logitmodel. In that case, ε in follows a Type 1 Extreme Value distribution such that P ( i | β , C n ) in Equation(2) describes the probability that individual n will select alternative i . The denominator highlightsthe computational challenge of applying MNL to large choice sets. P ( i | β , C n ) = exp ( V in ) (cid:80) j ∈ C n exp ( V jn ) (2)Sampling of alternatives reduces the computational burden by specifying a quasi-likelihood functionbased on the smaller choice set D n approximating Equation (2). D n is a subset of C n and includes To improve the clarity of notation, we remove the conditioning on X n from all probability statements. X n typicallyrepresents a set of exogenous variables not associated with any form of stochasticity. If any such stochasticity is present,for example in a latent variable model, our Bayesian results still apply as long as the stochasticity is independent of β . i a number of randomly sampled alternatives. The interested readeris referred to Ben-Akiva and Lerman (1985) for a description of various sampling protocols. Let P ( i | β , D n ) † in Equation (3) define the uncorrected sampled choice probability. By using a smallernumber of alternatives in the denominator, the sampled quasi-likelihood function (without correction)will overestimate the true choice probability for a given value of β . P ( i | β , D n ) † = exp ( V in ) (cid:80) j ∈ D n exp ( V jn ) (3) McFadden (1978) showed that consistent parameter estimates for β can be obtained by adding thecorrection factor l n ( π ( D n | j )) to the indirect utility function of alternative j ∀ j ∈ D n . By definition, D n has to include the chosen alternative i . The correction factor describes the natural log transformationof π ( D n | j ) , which denotes the probability of sampling the restricted choice set D n conditional on j being the chosen alternative. Ben-Akiva and Lerman (1985) derive the McFadden’s correction factorby means of Bayes’ rule (see Equations 4-6). P ( i | β , D n ) = π ( D n | i ) P ( i | β , C n ) π ( D n ) = π ( D n | i ) P ( i | β , C n ) (cid:80) j ∈ C n π ( D n | j ) P ( j | β , C n ) (4) = π ( D n | i ) P ( i | β , C n ) (cid:80) j ∈ D n π ( D n | j ) P ( j | β , C n ) = π ( D n | i ) exp ( V in ) (cid:80) k ∈ Cn exp ( V kn ) (cid:80) j ∈ D n π ( D n | j ) exp ( V jn ) (cid:80) k ∈ Cn exp ( V kn ) (5) = π ( D n | i ) exp ( V in ) (cid:80) j ∈ D n π ( D n | j ) exp ( V jn ) = exp ( V in + l n ( π ( D n | i )) (cid:80) j ∈ D n exp ( V jn + l n ( π ( D n | j ))) (6)In the above, P ( i | β , D n ) denotes the corrected probability of selecting alternative i from the sampledchoice set D n . Moreover, π ( D n ) represents the unconditional probability of sampling the set D n ,which can be rewritten as π ( D n ) = (cid:80) j ∈ C n π ( D n | j ) P ( j | β , C n ) and further simplified to π ( D n ) = (cid:80) j ∈ D n π ( D n | j ) P ( j | β , C n ) since D n must include the chosen alternative j and therefore π ( D n | j ) = j (cid:54)∈ D n .The above proof requires the positive conditioning assumption (McFadden, 1978), i.e. π ( D n | j ) > ∀ j ∈ D n . Intuitively, the assumption requires a positive probability of obtaining the sampledset D n given the choice for any alternative included in that set. When all alternatives in the truechoice set are assigned the same sampling probability irrespective of the chosen alternative, then π ( D n | i ) = π ( D n | j ) = π ( D n ) ∀ j , i . In this particular case, the McFadden’s correction factor cancelsout and Equation (3) provides consistent parameter estimates because P ( i | β , D n ) = P ( i | β , D n ) † . Thisspecific sampling protocol is frequently labelled as uniform conditioning or uniform sampling protocol . Similar to previous studies (Guevara and Ben-Akiva, 2013a), throughout the paper we assume that the sampling protocol π ( D n | j ) is independent of β . Any such dependence requires re-sampling D n at every iteration during estimation andlimits the computational benefits of the proposed method. .2. Maximising the ex ante corrected likelihood and consistency We are looking for the optimal correction factors c jn ( ∀ j ∈ D n , n ∈ {
1, 2, . . . , N } ) and parameters β that maximise the likelihood. Following Keane and Wasi (2012), we define the sampled and correctedlog-likelihood of MNL by: L L + = N (cid:88) n = l n (cid:130) exp ( V in + c in ) (cid:80) j ∈ D n exp ( V jn + c jn ) (cid:140) (7) Ex ante , the chosen alternative and the sampled choice set are unknown. Hence, following Keane andWasi (2012) and Guevara and Ben-Akiva (2013a), we take expectations over the chosen alternativeand the sampled set of alternatives in Equation (8). (cid:69) ( L L + ) = N (cid:88) n = (cid:88) k ∈ C n (cid:88) D n ∈ C n π ( D n , k | β ∗ ) · l n (cid:130) exp ( V kn + c kn ) (cid:80) j ∈ D n exp ( V jn + c jn ) (cid:140) , (8)Note that the above expectation is taken conditional on the true parameter β ∗ . Keane and Wasi(2012) and Guevara and Ben-Akiva (2013a) make use of different representations of the joint distri-bution π ( D n , k | β ∗ ) , as denoted in Equations (9) and (10), respectively. We present the equivalenceof Keane and Wasi (2012) and Guevara and Ben-Akiva (2013a) for MNL in Appendix A. Results inthe context of the MNL model are indifferent to the representation of the joint distribution, but theformulation adopted by Guevara and Ben-Akiva (2013a) is more convenient in the context of theproof of consistency for the mixed logit model in Section 4. π ( D n , i | β ∗ ) = π ( D n | i ) · P ( i | β ∗ , C n ) (9) = P ( i | β ∗ , D n ) · π ( D n ) (10)Using (9), the expected log-likelihood of the MNL model is defined by: (cid:69) ( L L + ) = N (cid:88) n = (cid:88) k ∈ C n (cid:88) D n ∈ C n P ( k | β ∗ , C n ) · π ( D n | k ) · l n (cid:130) exp ( V kn + c kn ) (cid:80) j ∈ D n exp ( V jn + c jn ) (cid:140) (11)Multiplying P ( k | β ∗ , C n ) in Equation (11) by (cid:80) j ∈ Dn exp ( V jn + ln ( π ( D n | j ))) (cid:80) j ∈ Dn exp ( V jn + ln ( π ( D n | j ))) and defining R n = (cid:80) j ∈ Dn exp ( V jn + ln ( π ( D n | j ))) (cid:80) j ∈ Cn exp ( V jn ) ,allows the expected log-likelihood to be rewritten to: (cid:69) ( L L + ) = N (cid:88) n = (cid:88) k ∈ C n (cid:88) D n ∈ C n R n · exp ( V kn ) (cid:80) j ∈ D n exp ( V jn + l n ( π ( D n | j ))) · π ( D n | k ) · l n (cid:130) exp ( V kn + c kn ) (cid:80) j ∈ D n exp ( V jn + c jn ) (cid:140) (12)Noting that i) R n is independent of k , ii) π ( D n | k ) = k (cid:54)∈ D n and iii) the second and thirdterm jointly arrive at McFadden’s corrected choice probability, we obtain: (cid:69) ( L L + ) = N (cid:88) n = (cid:88) D n ∈ C n R n (cid:88) k ∈ D n P ( k | β ∗ , D n ) · l n (cid:130) exp ( V kn + c kn ) (cid:80) j ∈ D n exp ( V jn + c jn ) (cid:140) (13) Keane and Wasi (2012) define R n = (cid:80) j ∈ Dn exp ( V jn ) (cid:80) j ∈ Cn exp ( V jn ) and thereby rely on uniform sampling protocol. Also note the correspon-dence between R n and what Daly et al. (2014) label as coverage of the sampled choice set. (cid:80) k ∈ D n reaches its maximum when β = β ∗ and when c jn = ln ( π ( D n | j )) ∀ j ∈ D n , because of entropy, i.e. (cid:80) k ∈ D n P ( k | β ∗ , D n ) · l n ( P ( k | β ∗ , D n )) . In words, undermild regularity conditions, the maximum of the corrected log-likelihood converges in probability tothe maximum of the true likelihood at β = β ∗ (Newey and McFadden, 1994), only when McFadden’scorrection factor applies . This proof of consistency holds for both uniform and non-uniform samplingprotocols. Keane and Wasi (2012) continue by looking at the ex ante difference in the log-likelihood betweenthe ‘sampled’ and the ‘true’ likelihood under uniform sampling protocol (see Equations 15-17). Inthese Equations, the † refers to the case of uniform sampling protocol. All the correction factors dropout of P ( i | β ∗ , D n ) † and R † n = (cid:80) j ∈ Dn exp ( V jn ) (cid:80) j ∈ Cn exp ( V jn ) . We also know that 0 < R † n < (cid:80) j ∈ C n exp ( V jn ) > (cid:80) j ∈ D n exp ( V jn ) . Therefore, the term in square brackets in Equation (15) is positive and so is (cid:69) ( L L † − L L ) . Equation (17) again takes the convenient entropy expression, but this time with respect to R † n and the summation is over all possible sampled choice sets. Due to the negative sign outside,the positive information divergence is therefore minimised at the McFadden correction term underuniform sampling. The result does not directly transfer to the case of non-uniform sampling protocol. (cid:69) ( L L † − L L ) (14) = N (cid:88) n = (cid:88) D n ∈ C n R † n (cid:88) k ∈ D n P ( k | β ∗ , D n ) † (cid:150) l n (cid:130) exp ( V kn ) (cid:80) j ∈ D n exp ( V jn ) (cid:140) − l n (cid:130) exp ( V kn ) (cid:80) j ∈ C n exp ( V jn ) (cid:140)(cid:153) (15) = N (cid:88) n = (cid:88) D n ∈ C n R † n (cid:88) k ∈ D n P ( k | β ∗ , D n ) † l n (cid:130) (cid:80) j ∈ C n exp ( V jn ) (cid:80) j ∈ D n exp ( V jn ) (cid:140) (16) = − N (cid:88) n = (cid:88) D n ∈ C n R † n l n (cid:0) R † n (cid:1) (17)The above proof offers a strong case for using uniform sampling when we look at the smallsample performance of sampling of alternatives in Section 3. The positive divergence (cid:69) ( L L † − L L ) in itself is, however, not informative for model selection under sampling of alternatives. Althoughthe expected divergence may be minimised using uniform sampling, the size and behaviour of theresulting difference is unknown when using different specifications for V . For example, take twonon-nested model specifications, such that Wald tests (based on the consistent parameter estimates)cannot be used to test whether one model is better than the other. We thus only know that the(expected) difference in the log likelihood is minimised for both models. Quantifying the actual difference still requires evaluating the true model and thus one might just as well evaluate the truemodel under non-uniform sampling and use the respective model fits for model selection purposes.Therefore neither uniform nor non-uniform sampling protocol has a clear informational advantage inthe context of model selection under sampling of alternatives.5 . Small sample performance of McFadden’s correction factor in MNLmodels - a Bayesian perspective Instead of working with point estimates, Bayesian models derive the posterior distribution of theparameters of interest. The posterior p ( β | Y , C ) in Equation (18) is described as a function of theprior p ( β ) , the true likelihood p ( Y | β , C ) and the marginal likelihood p ( Y | C ) - where C denotes theset of full choice sets across all observations. Given our interest in using a quasi-likelihood functionto approximate the true likelihood function, containing the same β parameters, there is no reasonto assume that the prior differs between the ‘true’ and ‘sampled’ model. The marginal likelihood isindependent of β and only operates as a scalar ensuring that the respective posterior density functionintegrates to unity. Any difference in the shape and location between the ‘true’ and ‘sampled’ posteriorfor β therefore arises in the likelihood function. p ( β | Y , C ) = p ( β ) p ( Y | β , C ) p ( Y | C ) = p ( β ) (cid:81) Nn = P ( i | β , C n ) (cid:82) β p ( β ) (cid:81) Nn = P ( i | β , C n ) d β (18)For completeness, the sampled posterior is described by Equation (19), where D denotes the set ofsampled choice sets across all observations. Here, we are evaluating the properties of the McFadden’scorrection factor in a Bayesian context and directly implement it in Equation (20). p ( β | Y , D ) = p ( β ) p ( Y | β , D ) p ( Y | D ) = p ( β ) (cid:81) Nn = P ( i | β , D n ) (cid:82) β p ( β ) (cid:81) Nn = P ( i | β , D n ) d β (19) P ( i | β , D n ) = exp ( V in + l n ( π ( D n | i )) (cid:80) j ∈ D n exp ( V jn + l n ( π ( D n | j )) (20) The Kullback-Leibler divergence criterion (Kullback and Leibler, 1951) is frequently used in Bayesianstatistics as a measure of information loss when approximating the ‘true’ distribution with an alterna-tive distribution. Under the assumption of identical prior densities, the Kullback-Leibler divergencecriterion ( D K L ) reduces to the expected log-likelihood-ratio between the two models plus the log ofthe ratio of the marginal likelihoods (see Equations 21 - 22). The latter term is also known as the(inverse)
Bayes Factor , which is independent of β and can accordingly be placed outside of the integral.Note that the D K L measure takes expectations with respect to β , but the measure is conditional onthe observed choices Y and the sampled choice sets D . D K L = (cid:90) β p ( β | Y , C ) l n (cid:129) p ( β | Y , C ) p ( β | Y , D ) (cid:139) d β = (cid:90) β p ( β | Y , C ) l n (cid:32) p ( β ) p ( Y | β , C ) p ( Y | C ) p ( β ) p ( Y | β , D ) p ( Y | D ) (cid:33) d β (21) D K L = (cid:90) β p ( β | Y , C ) l n (cid:129) p ( Y | β , C ) p ( Y | β , D ) (cid:139) d β + l n (cid:129) p ( Y | D ) p ( Y | C ) (cid:139) (22)Following the same principles as under classical estimation, we aim to derive an ex ante expressionfor D K L and are additionally taking expectations with respect Y and D . Without loss of generality,other explanatory variables X , C are assumed to be non-stochastic. Define the joint probability of6bserving the vector of parameters β , the vector of choices Y and the set of sampled choice sets D by: π ( β , Y , D ) = p ( β ) p ( Y | β , C ) π ( D | Y ) = π ( D ) p ( β ) p ( Y | β , D ) (23)Define the ex ante version of D K L by: E ( D K L ) Y , D = (cid:90) β p ( β ) (cid:88) Y (cid:88) D P ( Y | β , C ) π ( D | Y ) l n (cid:129) P ( Y | β , C ) P ( Y | β , D ) (cid:139) d β + (cid:150)(cid:90) β p ( β ) (cid:88) D (cid:88) Y π ( D ) P ( Y | β , D ) d β (cid:153) l n (cid:129) P ( Y | D ) P ( Y | C ) (cid:139) (24)Following the same logic as in Equations (11)-(13), under the assumption of independent choiceand sampling probabilities across observations, we can rewrite the above to: E ( D K L ) Y , D = (cid:90) β p ( β ) (cid:88) D R (cid:88) Y P ( Y | β , D ) l n (cid:129) P ( Y | β , C ) P ( Y | β , D ) (cid:139) d β + (cid:150)(cid:90) β p ( β ) (cid:88) D (cid:88) Y π ( D ) P ( Y | β , D ) d β (cid:153) l n (cid:129) P ( Y | D ) P ( Y | C ) (cid:139) (25)where R = (cid:81) Nn = R n = (cid:81) Nn = (cid:80) j ∈ D n exp ( V jn + ln ( π ( D n | j ))) (cid:80) j ∈ C n exp ( V jn ) The two parts in (25) serve a different purpose. The first part captures the extent to which thequasi likelihood loses information with respect to the parameters of interest, i.e. the β ’s. The inverseBayes Factor captures the extent to which each model is able to describe the data.Consider the first part and due to its similarity to (cid:69) ( L L † − L L ) in Section 2.3, we adopt uniformsampling protocol. Accordingly, l n (cid:128) P ( Y | β , C ) P ( Y | β , D ) † (cid:138) = l n ( R † ) , which does not depend on the chosenalternative and R n reduces to R † n . Given that 0 < R † n <
1, the McFadden’s correction factor underuniform sampling maximises the first term for every value of β (see Equations 26 to 28). That is, itbrings the term l n (cid:128) P ( Y | β , C ) P ( Y | β , D ) † (cid:138) , which is always negative, closest to zero. Thus, the information losswith respect to β in the sampled posterior relative to the true model is minimised by the McFadden’scorrection factor under uniform sampling. The importance of this result is that the performance ofthe McFadden’s correction factor holds for any value of β and for any sample size, but only whenuniform sampling is applied. (cid:90) β p ( β ) (cid:88) D R † (cid:88) Y P ( Y | β , D ) † l n (cid:129) P ( Y | β , C ) P ( Y | β , D ) † (cid:139) d β (26) = (cid:90) β p ( β ) (cid:88) D R † l n (cid:0) R † (cid:1) (cid:88) Y P ( Y | β , D ) † d β (27) = (cid:90) β p ( β ) (cid:88) D R † l n (cid:0) R † (cid:1) d β (28)The second term can be re-written as: We still assume that the sampling protocol is independent of β . (cid:90) β p ( β ) (cid:88) D (cid:88) Y π ( D ) P ( Y | β , D ) d β (cid:153) l n (cid:129) P ( Y | D ) P ( Y | C ) (cid:139) = (cid:88) D π ( D ) (cid:88) Y P ( Y | D ) l n (cid:129) P ( Y | D ) P ( Y | C ) (cid:139) (29)Similar to the classical case, the McFadden’s correction factor maximises the second term and thusprovides the best ex ante model fit. Whereas maximising the quasi likelihood was crucial to establishconsistency of the parameter estimates in classical settings, the notions of fit and maximisationare irrelevant in the Bayesian estimation. It is also worth noting that the normalising constantis generally ignored in the Bayesian estimation and only the kernel of the posterior matters, i.e. p ( β | Y ) ∝ p ( β ) p ( Y | β ) . Hence, the second term can even be ignored here.Conclusively, the McFadden’s correction factor under uniform sampling provides us with the bestex ante approximation of the non-normalised posterior, i.e. the kernel (or location and shape) ofthe posterior distribution for β . Accordingly, uniform sampling ensures that sampling of alternativesworks for small sample sizes and that it can be applied under alternative estimation procedures, suchas Bayesian estimation. The presented result thus provides a strong case for using uniform samplingas the sampling protocol in the context of the MNL model.Despite minimising the expected loss in information on β , the McFadden’s correction factor underuniform sampling still introduces some degree of inaccuracy between the ‘true’ and ‘sampled’ posteriordensity. The severity of this loss in information for parameter estimates should be addressed empirically,including the search for optimal sampling rates (Daly et al., 2014). This is, however, no differentfrom what happens in classical estimation.
4. Sampling of alternatives in mixed logit models
We now switch our attention to the mixed logit (MMNL) model. Guevara and Ben-Akiva (2013a);Keane and Wasi (2012) both analytically show that McFadden’s result does not extend to the mixedlogit. Namely, classical estimation methods treat β n , the individual level preference parameters, as alatent construct which is integrated out of the likelihood function typically by means of simulationmethods (Train, 2009). The mixing density f ( β n | θ ) takes a central role in these simulation methodsas it describes the distribution of preferences over the population of interest. Here, θ describes theset of hyper-parameters characterising the mixing density. The primary interest in classical estimationmethods is in estimating the hyper-parameters. Bayesian estimation is also interested in these specificparameters, but simultaneously estimates β n using the principle of data augmentation (Tanner andWong, 1987). In what follows, we will show that applying the notion of data augmentation enablesextending our results for MNL to MMNL. In this section, we focus on the Guevara and Ben-Akiva (2013a)’s consistency proof. The Keane andWasi (2012)’s proof for MMNL can be found in Appendix B. Unlike Keane and Wasi (2012), Guevaraand Ben-Akiva (2013a) do set out the necessary conditions for consistency of parameter estimates inthe MMNL model. For ease of exposition, we assume a single choice per respondent, but the resultextends to the case of the panel MMNL model. Let’s define:8 ( i | D n ) = π ( D n , i ) π ( D n ) = (cid:82) β n π ( D n | β ) · P ( i | β n , D n ) f ( β n | θ ) d β n π ( D n ) = (cid:90) β n W n P ( i | β n , D n ) f ( β n | θ ) d β n (30)where P ( i | β n , D n ) is the corrected choice probability as defined in (6). Important here is thedistinction between π ( D n ) and π ( D n | β n ) : π ( D n ) = (cid:90) β n π ( D n | β n ) f ( β n | θ ) d β n = (cid:88) j ∈ D n π ( D n | j ) (cid:90) β n P ( j | β n , C n ) f ( β n | θ ) d β n = (cid:88) j ∈ D n π ( D n | j ) P ( j | θ , C n ) (31)such that in the numerator of W n the individual level parameters β n are observed, but they areintegrated out in the denominator: W n = π ( D n | β n ) π ( D n ) = (cid:80) j ∈ D n π ( D n | j ) P ( j | β n , C n ) (cid:80) j ∈ D n π ( D n | j ) P ( j | θ , C n ) (32)Under uniform sampling π ( D n | j ) = c ∀ j ∈ D n , such that π ( D n | β ) = c · (cid:80) j ∈ D n P ( j | β n , C n ) and π ( D n ) = c · (cid:80) j ∈ D n P ( j | θ , C n ) and π ( D n | β n ) (cid:54) = π ( D n ) . Consequently, W n remains present in π ( i | D n ) even under uniform sampling. Now again taking the expectations over the chosen alternativesand the sampled choice sets, we can define the maximum expected loglikelihood at the true set ofhyper-parameters θ ∗ by: (cid:69) ( L L + ) = N (cid:88) n = (cid:88) D n ∈ C n π ( D n ) (cid:88) k ∈ C n π ( k | D n , θ ∗ ) · l n ( π ( k | D n , θ )) (33)Thus, Guevara and Ben-Akiva (2013a) show that consistent estimation of θ in the MMNL modelunder sampling of alternatives requires, in addition to the standard McFadden’s correction factor, theincorporation of W n in the quasi-likelihood function. Indeed, this is significant progress over Keaneand Wasi (2012), but the resulting quasi likelihood function is not feasible because the term W n stilldepends on the full choice set in both the numerator and the denominator (see Equation 32).To circumvent the dependency of W n on C n , Guevara and Ben-Akiva (2013a) develop a feasibleestimator. The estimator approximates W n only using elements in the sampled choice set D n whilstretaining the consistency property. The necessary expansion factors can be determined by using i)populations shares, ii) observed choices by the individual and iii) the naive method which sets W n to1 and thus reduces to applying only the McFadden’s correction factor. Since the naive approximationof W n provides consistent estimates and provides good results in Monte Carlo simulations and realworld examples, this is the recommended approach by Guevara and Ben-Akiva (2013a). Bayesian estimation using data augmentation is interested in the joint posterior p ( β , θ | Y , C ) . Themodel requires a prior distribution p ( θ ) for the hyper-parameters of the mixing density. Again, it Data augmentation is the traditional approach to Bayesian estimation of MMNL models, also known as hierarchical Bayes(e.g. Train (2009)).
9s assumed that the prior is identical between the true and sampled model. The mixing density f ( β n | θ ) is also assumed to be of identical parametric form between the ‘true’ and ‘sampled’ modeland effectively acts as a ‘prior’ on the individual level parameter β n . The only difference introducedby sampling of alternatives between the ‘true’ and ‘sampled’ model therefore arises in the conditionalchoice probability p ( Y n | β n , C n ) . The posteriors for both models are given by: p ( β , θ | Y , C ) = p ( θ ) (cid:81) Nn = f ( β n | θ ) p ( Y n | β n , C n ) p ( Y | C ) (34) p ( β , θ | Y , D ) = p ( θ ) (cid:81) Nn = f ( β n | θ ) p ( Y n | β n , D n ) p ( Y | D ) (35)The Kullback-Leiber divergence for the MMNL model under data augmentation is defined by Equation(36). Note that since β n is no longer latent, the log of the likelihood ratio is independent of θ . D K L = E β , θ (cid:129) l n (cid:129) p ( Y | β , C ) p ( Y | β , D ) (cid:139)(cid:139) + l n (cid:129) p ( Y | D ) p ( Y | C ) (cid:139) (36)Similar to Section 3.1, we only need to focus on the first part of D K L , because that is the only partthat provides information on the loss information with respect to β and θ . The structure of Equation(36) is identical to our proof used for the Bayesian MNL model: E β , θ , Y , D (cid:129) l n (cid:129) p ( Y | β , C ) p ( Y | β , D ) (cid:139)(cid:139) = (cid:90) θ p ( θ ) (cid:89) n = (cid:90) β n f ( β n | θ ) (cid:88) D n R n (cid:88) Y n P ( Y n | β n , D n ) l n (cid:129) P ( Y n | β , C n ) P ( Y n | β , D n ) (cid:139) d β n d θ (37)For each individual, the structure reduces to an MNL model for which, following Equation (26)to (28), the McFadden’s correction factor under uniform sampling provides the minimum loss ofinformation on β n and thus θ . This completes the proof: E β , θ , Y , D (cid:129) l n (cid:129) p ( Y | β , C ) p ( Y | β , D ) † (cid:139)(cid:139) = (cid:90) θ p ( θ ) (cid:89) n = (cid:90) β n f ( β n | θ ) (cid:88) D n R † n l n (cid:0) R † n (cid:1) d β n d θ (38)This result is particularly encouraging as it enables researchers to combine the computationalbenefits of Bayesian estimation through data augmentation with the computational benefits ofsampling of alternatives. Accordingly significant reductions in estimation time can be foreseen formixed logit models with large choice sets.
5. Conclusions
In this paper we apply a Bayesian perspective to sampling of alternatives for multinomial logit (MNL)and mixed logit models (MMNL) models. In the context of the MNL model, we find three importantresults. Firstly, McFadden’s correction factor under uniform sampling (McFadden, 1978) can betransferred to the Bayesian context because it minimises the expected loss in information on theparameters of interest between the sampled and the true model. Note that under uniform samplingall alternatives, except for the chosen alternative which has to be included in the sampled choiceset, have the same probability of being sampled. Secondly, the ability to implement sampling of10lternatives in Bayesian MNL models implies that uniform sampling has good small sample properties.Namely, our proofs apply for any given sample size. Thirdly, because of the desirable small and largesample properties, uniform sampling should be considered as the recommended sampling protocolirrespective of the estimation framework selected.Our results extend to Bayesian MMNL models using data augmentation, also known as HierarchicalBayes (HB). Instead of integrating out the latent individual level parameters, HB models treat theseas observed quantities, where the mixing density acts as a prior. Because conditional on individuallevel parameters, choice probabilities take the form of the MNL model, a similar set of results can beobtained. Firstly, uniform sampling under data augmentation minimises the loss in information onindividual level parameters and accordingly the hyper-parameters of the mixing density. This resultholds for any number of choices made by an individual and sample size. Again, uniform sampling isthe recommended sampling protocol in MMNL due to its small sample properties. Our contributions regarding data augmentation and small sample performance do not explain thegood performance of the feasible Naive estimator (i.e., W n =
1) in Guevara and Ben-Akiva (2013a)and related papers in a classical setting. Its feasibility entirely arises as a result of assumptionsrelated to asymptotic normality and consistency. Our small sample results for the MNL model do,however, provide the necessary information to explain the good performance of small sampled choiceset under the EM algorithm in the context of the latent class models in Von Haefen and Domanski(2018). The EM algorithm for estimating latent class models reduces to sequential estimation ofweighted MNL models Train (2009). Moreover, Section 2 has shown that for any sample size andsize of the sampled choice set, the McFadden’s correction factor under uniform sampling minimisesthe loss in information on the parameters of interest, in this case the class specific marginal utilities.Therefore, given that Von Haefen and Domanski (2018) apply uniform sampling in their MonteCarlo simulations and empirical work, good small sample performance is as expected. Our proofcan easily be extended to the context of latent class model, where class memberships, instead ofindividual level parameters, would be augmented. Conditional on class membership within classchoice probabilities again reduce to MNL specifications, allowing for a similar exposition on minimuminformation loss. Our contributions do not (yet) extend to the context of Multivariate Extreme Value(MEV) models Guevara and Ben-Akiva (2013b) and Random Regret Minimisation models Guevaraet al. (2015). The additional correction factors required in these model specifications are a directresult of no longer satisfying the axiom of Independence of Irrelevant Alternatives (IIA). They cannotbe circumvented by the use of data augmentation since there are no latent variables driving theneed for these correction factors. Indeed, one could approximate MEV model structures with errorcomponents models. Alternatively, the performance of the referred correction factors can be studiedin future research. When the number of choices per individual grows, consistent estimates of individual level parameters can be obtainedfollowing Section 2. Additionally, when the number of individuals grows, consistent estimates of individual levelparameters and hyper-parameters will be obtained. The essential difference, and hence explaining the need for theadditional expansion factors in Guevara and Ben-Akiva (2013a), is that Guevara and Ben-Akiva (2013a)’s proof onlyrequires the number of individuals to grow for the purposes of consistency. eferences Azaiez, I. (2010). Sampling of alternatives for logit mixture models. Master’s thesis, EPFL Lausanne.Bansal, P., Krueger, R., Bierlaire, M., Daziano, R. A., and Rashidi, T. H. (2020). Bayesian estimationof mixed multinomial logit models: Advances and simulation-based evaluations.
TransportationResearch Part B: Methodological , 131:124 – 142.Ben-Akiva, M. and Lerman, S. (1985).
Discrete choice analysis . MIT Press.Daly, A. (1987). Estimating ‘tree’ logit models.
Transportation Research Part B: Methodological ,21(4):251 – 267.Daly, A., Hess, S., and Dekker, T. (2014). Practical solutions for sampling alternatives in large scalemodels.
Transportation Research Record , 2429(1):148–156.Guevara, C. A. and Ben-Akiva, M. E. (2013a). Sampling of alternatives in logit mixture models.
Transportation Research Part B: Methodological , 58:185–198.Guevara, C. A. and Ben-Akiva, M. E. (2013b). Sampling of alternatives in multivariate extreme value(mev) models.
Transportation Research Part B: Methodological , 48:31 – 52.Guevara, C. A., Chorus, C. G., and Ben-Akiva, M. E. (2015). Sampling of alternatives in randomregret minimization models.
Transportation Science , 0(0):1–16.Keane, M. P. and Wasi, N. (2012). Estimation of discrete choice models with many alternatives usingrandom subsets of the full choice set: With an application to demand for frozen pizza.Kullback, S. and Leibler, R. (1951). On information and sufficiency.
The annals of mathematicalstatistics , 22(1):79–86.Lemp, J. D. and Kockelman, K. M. (2012). Strategic sampling for large choice sets in estimation andapplication.
Transportation Research Part A: Policy and Practice , 46(3):602 – 613.McFadden, D. (1978). Modeling the choice of residential location.
Transportation Research Record ,(673).Newey, K. and McFadden, D. (1994). Large sample estimation and hypothesis.
Handbook of Econo-metrics, IV, Edited by RF Engle and DL McFadden , pages 2112–2245.Revelt, D. and Train, K. (1998). Mixed logit with repeated choices: Households’ choices of applianceefficiency level.
The Review of Economics and Statistics , 80(4):647–657.Sinha, P., Caulkins, M. L., and Cropper, M. L. (2018). Household location decisions and the value ofclimate amenities.
Journal of Environmental Economics and Management , 92:608 – 637.Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation.
Journal of the American Statistical Association , 82(398):528–540.Train, K. (2009).
Discrete Choice Methods with Simulation . Cambridge University Press.Von Haefen, R. H. and Domanski, A. (2018). Estimation and welfare analysis from mixed logit modelswith large choice sets.
Journal of Environmental Economics and Management , 90:101–118.12 . Equivalence of Guevara and Ben Akiva (2013a) to Keane and Wasi(2012) for MNL
Whereas Keane and Wasi (2012) relied on π ( D n | i ) · P ( i | β ∗ , C n ) , Guevara and Ben-Akiva (2013a)made use of P ( i | β ∗ , D n ) · π ( D n ) to represent the joint distribution π ( D n , i ) . It is worth noting that theunconditional density π ( D n ) still depends on the full choice set due to involvement of P ( k | β ∗ , C n ) , asillustrated below: π ( D n ) = (cid:88) k ∈ D n π ( D n | k ) · P ( k | β ∗ , C n ) (39)To show equivalence between the results of Keane and Wasi (2012) and Guevara and Ben-Akiva(2013a), we write the expected likelihood of MNL using Guevara and Ben-Akiva (2013a) specification: (cid:69) ( L L + ) = N (cid:88) n = (cid:88) k ∈ C n (cid:88) D n ∈ C n P ( k | β ∗ , D n ) · π ( D n ) · l n (cid:130) exp ( V kn + c kn ) (cid:80) j ∈ D n exp ( V jn + c jn ) (cid:140) (40) (cid:69) ( L L + ) = N (cid:88) n = (cid:88) D n ∈ C n π ( D n ) (cid:88) k ∈ C n P ( k | β ∗ , D n ) · l n (cid:130) exp ( V kn + c kn ) (cid:80) j ∈ D n exp ( V jn + c jn ) (cid:140) (41) (cid:69) ( L L + ) = N (cid:88) n = (cid:88) D n ∈ C n (cid:32) (cid:88) j ∈ D n P ( j | β ∗ , C n ) π ( D n | j ) (cid:33) (cid:88) k ∈ D n P ( k | β ∗ , D n ) · l n (cid:130) exp ( V kn + c kn ) (cid:80) j ∈ D n exp ( V jn + c jn ) (cid:140) (42) (cid:69) ( L L + ) = N (cid:88) n = (cid:88) D n ∈ C n (cid:130) (cid:80) j ∈ D n exp ( V jn + l n ( π ( D n | j ))) (cid:80) j ∈ C n exp ( V jn ) (cid:140) (cid:88) k ∈ D n P ( k | β ∗ , D n ) · l n (cid:130) exp ( V kn + c kn ) (cid:80) j ∈ D n exp ( V jn + c jn ) (cid:140) (43) (cid:69) ( L L + ) = N (cid:88) n = (cid:88) D n ∈ C n R n (cid:88) k ∈ D n P ( k | β ∗ , D n ) · l n (cid:130) exp ( V kn + c kn ) (cid:80) j ∈ D n exp ( V jn + c jn ) (cid:140) (44)The equivalence between Equations (44) and (13) is evident, and therefore, we end up with theoptimal value for β = β ∗ and the optimal correction factor being the McFadden’s correction factor. B. MMNL model for Keane and Wasi (2012)
Following the same logic, Keane and Wasi (2012) move to the case of the mixed multinomial logit(MMNL) model. Assuming a single choice per respondent, the true expected probability can bedefined by: (cid:69) ( P in ) β n = (cid:90) β n P in f ( β n ) d β n (45)where expectations are taken over β n . Similarly, the expectation of the uncorrected choice proba-bility can be defined by: (cid:69) ( P + in ) β n = (cid:90) β n P + in f ( β n ) d β n (46)13e can then formulate the same expected log likelihood over the chosen alternatives and sampledchoice sets: (cid:69) ( L L + ) = N (cid:88) n = (cid:88) k ∈ C n (cid:88) D n ∈ C n (cid:69) ( P kn ) β n · π ( D n | k ) · l n (cid:0) (cid:69) ( P + kn ) β n (cid:1) (47)Keane and Wasi (2012) point out that the trick applied with R n under the MNL model, i.e. multiply-ing by (cid:80) j ∈ Dn exp ( V jn ) · π ( D n | j ) (cid:80) j ∈ Dn exp ( V jn ) · π ( D n | j ) , does not apply in the context of the MMNL model since exp ( V jn ) dependson β n . The best that can be achieved is the following: (cid:69) ( L L + ) = N (cid:88) n = (cid:88) k ∈ C n (cid:88) D n ∈ C n (cid:90) β n R n · P ∗ in · f ( β n ) d β n · l n (cid:0) (cid:69) ( P + kn ) β n (cid:1) (48)Since R n cannot be taken outside of the integral the necessary symmetry does not apply in Equation(48) for entropy to apply and identify the McFadden’s correction factor as the optimal correctionfactor in the context of the MMNL model. Keane and Wasi (2012) do point out that as the size of thesampled choice set increases and approaches the size of C n , R n approaches 1 and the degree of error(under uniform sampling) vanishes. In practice, they find that the finite sample bias is negligible evenfor modest sized subsets D nn