A Neural-embedded Choice Model: TasteNet-MNL Modeling Taste Heterogeneity with Flexibility and Interpretability
Yafei Han, Christopher Zegras, Francisco Camara Pereira, Moshe Ben-Akiva
AA Neural-embedded Choice Model: TasteNet-MNLModeling Taste Heterogeneity with Flexibility and Interpretability
Yafei Han a , Christopher Zegras b , Francisco Camara Pereira c , Moshe Ben-Akiva a a MIT, Civil and Environmental Engineering; b MIT, Department of Urban Studies andPlanning; c Technical University of Denmark, School of Management
ABSTRACT
Discrete choice models (DCMs) and neural networks (NNs) can complement eachother. We propose a neural network embedded choice model –
TasteNet-MNL , toimprove the flexibility in modeling taste heterogeneity while keeping model inter-pretability . The hybrid model consists of a
TasteNet module: a feed-forward neuralnetwork that learns taste parameters as flexible functions of individual characteris-tics; and a choice module: a multinomial logit model (MNL) with manually specifiedutility. TasteNet and MNL are fully integrated and jointly estimated. By embeddinga neural network into a DCM, we exploit a neural network’s function approxima-tion capacity to reduce specification bias. Through special structure and parameterconstraints, we incorporate expert knowledge to regularize the neural network andmaintain interpretability.On synthetic data, we show that TasteNet-MNL can recover the underlying non-linear utility function, and provide predictions and interpretations as accurate as thetrue model; while examples of logit or random coefficient logit models with misspec-ified utility functions result in large parameter bias and low predictability. In thecase study of Swissmetro mode choice, TasteNet-MNL outperforms benchmarkingMNLs’ predictability; and discovers a wider spectrum of taste variations within thepopulation, and higher values of time on average. This study takes an initial step to-wards developing a framework to combine theory-based and data-driven approachesfor discrete choice modeling.
KEYWORDS
Neural Network; Interpretability; Flexible Utility; Taste Heterogeneity; DiscreteChoice Model
1. Introduction
Discrete choice models (DCM) provide a powerful econometric framework to under-stand and predict choice behaviors. The majority of DCMs are Random Utility Mod-els (RUM), derived under the utility-maximization decision rule (McFadden, 1973).Rooted in theory, DCMs have the advantage of interpretability : they can explainwhy/how individuals choose among a set of alternatives, and provide credible an-swers to “what-if” scenario questions. DCMs have been the predominant approach forconsumer choice analysis and widely applied in areas such as transportation planningand marketing.A DCM requires model specification to be known as a priori . Utility function isa primary component of model specification. Systematic part of the utility describes
Corresponding author. Email: [email protected] a r X i v : . [ ec on . E M ] F e b ow a choice-maker values each attribute of an alternative (“taste”), and how tastesvary systematically across choice-makers (“taste heterogeneity”). When the underly-ing relationships are nonlinear, coming up with an accurate utility specification can bedifficult. Misspecified utility functions lead to biased parameter estimates, lower pre-dictability, and wrong interpretations (Bentz and Merunka, 2000, Torres et al., 2011,van der Pol et al., 2014).Although nonlinear functions (e.g. higher-order polynomial, semi-log transform,piecewise linear) can be employed, they also require correct assumptions about thefunction form. Statistical tests are routinely used to select models. However, it is dif-ficult to test all possible specifications with a fair amount of covariates; and the truespecification may not be covered. Model uncertainty has been a persistent concern formodel developers and users, which motivates researches on data-driven approaches tolearn utility specification.Machine learning (ML), often viewed as a collection of data-driven methods, canexploit the rich information in large raw data. ML requires less a priori theories. Itsprimary focus is prediction accuracy rather than interpretability. Neural networks, asa popular class of ML algorithms, have achieved remarkable breakthroughs recentlyin various domains, such as computer vision, natural language processing, and speechrecognition. On complicated tasks, deep neural networks that require no domain knowl-edge surpass traditional ML methods that rely heavily on feature engineering.The success of neural networks is attributed to its capacity to learn highly complexfunctions, enabled by large datasets, advanced optimization techniques, and increasedcomputational capacity. Given our limited knowledge of the true utility function, couldwe utilize a neural network to unravel the complexity in data? Can we bring neuralnetworks to DCMs, in ways to enhance the flexibility in model specification, reducepotential bias and improve predictability?Current neural network applications to discrete choice problems focus on prediction ,with some exceptions (West et al., 1997, De Carvalho et al., 1998, Bentz and Merunka,2000, Hruschka et al., 2002, Sifringer et al., 2018, van Cranenburgh and Alwosheel,2019). A majority of the studies find neural networks can outperform DCMs in var-ious contexts regarding prediction accuracy. A major criticism of the neural networkapproach is its lack of interpretability.Interpretability is crucial for high-stakes decisions in transportation planning, suchas infrastructure investment, congestion pricing etc. Planners rely on models to givereliable answers to “what-if” questions at the disaggregated level. For example, how aspecific market segment will respond to a toll increase or a new subway line? To supportpolicy decisions, prediction accuracy alone is not enough. A model must represent thetrue relationships between explanatory variables and choice outcomes.Although a neural network can provide utility interpretations and economic indica-tors (e.g. elasticity, willingness-to-pay etc) equivalent to a DCM, its estimation resultssuffer from large variances across runs; and a particular model run can generate unreal-istic behavioral indicators (Wang and Zhao, 2018). How can we make neural networkslearn interpretable results that can support planning decisions?We propose to integrate neural networks and DCMs to benefit from both: the flex-ibility of neural networks and the interpretability of DCMs. We name such modelsas neural embedded choice models . Extending the work by Sifringer et al. (2018), wepropose a neural network embedded multinomial logit (MNL) model – TasteNet-MNL .Specifically, we employ a neural network (TasteNet) to model tastes as flexible func-tions of individuals characteristics. Taste parameters predicted by the TasteNet areembedded in a parametric
MNL to compute choice probability and likelihood. Param-2ters of the two parts are jointly estimated by maximum likelihood.By embedding a neural network in an MNL, we enhance the flexibility of the modelto represent systematic taste heterogeneity, which can reduce bias in manual spec-ification. By bringing a parametric MNL to a neural network, we incorporate ex-pert knowledge and constrain the neural network to generate outputs with designatedmeanings. Using both synthetic and real datasets, we demonstrate the effectiveness ofTasteNet-MNL. The source code is made publicly available .The rest of this paper is organized as follows. Section 2 reviews the previous neuralnetwork applications to discrete choice and the challenges. Section 3 describes ourmodel structure and estimation method. Section 4 reports experiments and resultson synthetic data. Section 5 applies TasteNet-MNL to the Swissmetro dataset, andcompares it with MNL benchmarks. Lastly, we summarize the key contributions, anddiscuss the limitations and future works.
2. Literature Review
Neural Networks for Choice Prediction
Empirical studies have compared DCMs with neural networks (NNs) for various choiceproblems, such as travel mode choice (De Carvalho et al., 1998, Hensher and Ton,2000, Cantarella and de Luca, 2005, Nam et al., 2017, Lee et al., 2018, Zhao et al.,2018), vehicle ownership choice (Mohammadian and Miller, 2002), and brand choice(Agrawal and Schorling, 1996, Bentz and Merunka, 2000, Hruschka et al., 2002, 2004).Most of the early applications choose a feed-forward network (FFN) with one or twohidden layers, because more layers cause over-fitting and computational challenges.FFNs are compared to DCM structures, including logit (Agrawal and Schorling, 1996,West et al., 1997, Omrani, 2015, Lee et al., 2018), nested logit (Hensher and Ton,2000, Mohammadian and Miller, 2002, Cantarella and de Luca, 2005), cross-nestedlogit (Cantarella and de Luca, 2005), and mixed logit (Zhao et al., 2018). ShallowFFNs achieve higher predictability than DCMs in most cases (Agrawal and Schorling,1996, West et al., 1997, De Carvalho et al., 1998, Mohammadian and Miller, 2002,Cantarella and de Luca, 2005, Omrani, 2015, Lee et al., 2018).Inspired by the success of deep learning in other domains, recent studies attemptdeep neural networks (DNN) for discrete choice (Nam et al., 2017, Wang and Zhao,2019). The results are somewhat disappointing. Nam et al. (2017) apply several deeplearning techniques (drop-out, initialization, stochastic gradient descent) to train anFFW with 4 hidden layers . Surprisingly, their DNN gives almost the same predictedlog-likelihood as a nested-logit and a cross-nested logit model . Wang and Zhao (2019)compare a DNN with a nested logit model for mode choice using a stated-preferencesurvey. Despite an extensive hyper-parameter search, the best DNN does not match anested logit model. The authors highlight the importance of finding the right hyper-parameters for DNN to predict as well as, if not worse than, DCMs.So far, DNNs have not worked effectively for discrete choice as expected, perhapsdue to small data, over-fitting issue, or the difficulty in finding the right set of hyper- https://github.com/YafeiHan-MIT/TasteNet-MNL They call it ”DNN” because it applies deep learning techniques, not because the neural network is deep The conventional FFW performs the worst. But we suspect that the optimal hyper-parameter, such as hiddenlayer size, may not be found without hyper-parameter search. It is inconclusive which type of model predictsbetter special structure and parameter constraints , model predictability can be improved evenwith only one hidden layer.
Learning Nonlinear Utility with Neural Networks
While most studies focus on comparing prediction performance with a brief expla-nation of why, a few dig into how and under what circumstances (West et al., 1997,De Carvalho et al., 1998, Bentz and Merunka, 2000). These studies seek to understandfrom a behavioral perspective: whether a neural network can discover the true behav-iors, which can be different from or more complex than our assumptions; and if so,how to derive such knowledge.A series of studies conduct Monte-Carlo experiments to show a neural network cancapture nonlinearity in utility functions (West et al., 1997, De Carvalho et al., 1998,Bentz and Merunka, 2000). Non-linearity may reflect the saturation effect or thresholdeffect of attributes on utility, or non-compensatory decision rules. For example, Westet al. (1997) find that NNs consistently outperform logit and discriminative analysiswhen predicting the outcome of non-compensatory choice rule. Bentz and Merunka(2000) show the analogy between NN and MNL, and NN with hidden layers as a moregeneral version of MNL. With synthetic data and an empirical study, they show thatNN can detect interaction and threshold effects in utility, and therefore can be usedas a diagnostic tool to improve MNL utility specification. This sequential approachrequires manual analysis of NN results to identify the nonlinear effect, and thus appliesonly to simple problems. Nevertheless, their idea inspires a recent study by Sifringeret al. (2018) to integrate the two.Hruschka et al. (2002) compare a NN with an MNL and a Latent Class Logit (LCL)model in an empirical study of brand choice. They find the NN model can identifyinteraction effects, threshold effects, saturation effects and other nonlinear forms (likeinverse S-shape) of attributes on brand utility. Also, NN implies elasticities differentfrom MNL or LCL. MNL sometimes gives wrong signs for elasticity due to its simplisticlinear form. The NN predicts better on hold-out data than MNL or LCL. A follow-up study by Hruschka et al. (2004) compares NN with two other MNLs with flexiblesystematic utility, and draws similar conclusions.To summarize, these studies show that a NN can outperform an MNL, when the nonlinearity in attributes are neglected or mistaken. However, these studies have notaddressed nonlinearity in taste , nor compared NNs with more advanced DCMs, e.g.random coefficient logit. They consider NN as either an alternative to MNL; or adiagnostic tool to improve utility specification of MNL, which works only for simpleproblems. Our study complements previous work as we focus on modeling nonlinearityin taste with neural networks. Also, beyond an either-or or a sequential approach, ourintegrated model achieves both flexibility and partial interpretability.
The Interpretability Challenge
Being able to predict better and capture nonlinear utility is not sufficient for policy-scenario analysis, which is a great advantage of behavior models based on theory and4rior knowledge. A major criticism of neural networks is the lack of interpretability.As this popular term is not clearly defined in the literature, despite its wide usage, wesummarize the popular understandings of interpretability into the following aspects.The first is parameter-level interpretability. Clearly, individual weights from a neu-ral network do not carry specific meanings (Agrawal and Schorling, 1996, Shmueliet al., 1996). In contrast, parameters of logit models can be directly interpreted as themarginal effect of an attribute (or “taste”).Another strict definition has to do with how a model is specified: based on externalknowledge or learned from data. Statistical choice models clearly map the relationshipbetween input and output with a theory behind it. In neural networks, the relationshipsare learned from data by arbitrary functions. Even if a NN mimics the true functions,this itself does not provide a theory of why inputs lead to choice outcomes. By either ofthe first two criteria, a NN model is not interpretable. However, these two definitionsare not meaningful measures for model usability.The third view of interpretability by Sifringer et al. (2018) is “the ability of themodel to recover the true parameters’ values of the variables that enter the inter-pretable part of the utility functions”. This definition focuses on obtaining unbiased model estimates for the interpretable part. However, the unknown part of the utilitymodeled by a black-box can still give uninterpretable answers to ”what-if” questions.The division between interpretable part and uninterpretable part of the utility functionis subjective.Perhaps the most popular view of interpretability is the model’s ability to derivebehavior indicators, such as elasticity, willingness-to-pay (WTP), marginal rate of sub-stitution, and (MRS). Studies that claim ML or NN model interpretability are mostlybased on this criteria (Wang and Zhao, 2018, Sifringer et al., 2018, Zhao et al., 2018).Extracting behavior indicators from neural networks is simple. Bentz and Merunka(2000) show the similarity between MNL and a feed-forward neural network with nohidden layer and Softmax activation. Systematic utilities correspond to the outputvalues before applying Softmax activation. We can plot utility versus inputs to obtainmarginal effects (Bentz and Merunka, 2000, Hruschka et al., 2004). Choice elasticitiesand other economic indicators can be computed analytically (Hruschka et al., 2002,2004) or numerically by simulations (Wang and Zhao, 2018).We consider this definition insufficient because a model that gives unreasaonblebehavioral indicators is not interpretable. A study by Wang and Zhao (2018) showsthat individual NN estimation can generate unreasonable economic indicators. Forexample, a choice probability can be non-monotonically decreasing as cost increasesand highly sensitive to a particular model run. The derivative of choice probabilitieswith respect to cost and time can be positive; and values of time can be negative, zero,arbitrarily large, or infinite. They conclude that neural based choice models generatereasonable economic information only at the aggregate level either through modelensemble or population average, due to the challenge of irregular probability fieldsand large estimation errors. However, scenario analysis and policy decision dependon answers to “what-if” questions at the disaggregated level (e.g. a particular marketsegment).The definition of interpretability is to some extent subjective and ultimately a philo-sophical question. We propose a definition close to the popular view but with extraconditions:
A model is interpretable if at a disaggregated level, it is able to give credible answerto “what will happen if ” and “but for” questions.
Compared to the popular definition, we emphasize the credibility of the economic5ndicators and interpretability at the disaggregated (both model and choice-maker)level. By “credible”, we mean the answer should conform with a set of prior knowledge,for example, non-positive choice elasticity regarding cost and non-positive values oftime. However, prior knowledge can change over time and vary across applicationcontexts.A fundamental challenge for a neural network to be interpretable is that manynetworks may exist that fit the data equally well; but not all can draw reasonablebehavior insights. We propose imposing special structure and constraints that reflectexpert knowledge on a neural network. We show the proposed model obtains reason-able behavioral indicators at the disaggregated level; and that predictability does notnecessarily come at the cost of interpretability.
Direct Precedents: Integrating Neural Networks with DiscreteChoice Models
Recent studies attempt to create a synergy between statistical DCM and NN througha hybrid structure. The idea of a hybrid approach dates back to Bentz and Merunka(2000). They propose using NN as a diagnostic tool to detect nonlinear effects. Themain drawbacks of this approach is the sequential nature and its ineffectiveness forlarge problems.Learning-MNL (L-MNL) proposed by Sifringer et al. (2018) is the first example ofa neural embedded choice model as far as we know. In an L-MNL, systematic utility isdivided into an “interpretable” part manually specified; and a “representation” part,a nonlinear representation learned by a neural network. The unknown part capturesthe effects of the unused features. This model structure is inspiring but has somelimitations. First, variables in the interpretable part and the representation part aremutually exclusive sets. The authors’ motive is to make the interpretable utility ob-tain stable estimates, as the NN can overpower the logit model and cause unstableestimates. Second, this model assumes that variables in the representation part haveno interactions with those in the interpretable part, since the two parts of utility are added with no overlapping variables. Essentially, the gain of an L-MNL comes froma flexible representation of the alternative specific constants (ASCs): L-MNL modelsASCs by a neural network as a flexible function of all the unused features. This as-sumption is too restrictive since the unused features can affect not only the ASCs, butalso other taste parameters in the interpretable utility, such as the time coefficient.Similarly, features in the interpretable part may have unspecified nonlinear effects,and can affect the ASCs. Thirdly, the selection of covariates to enter which part of theutility function is arbitrary.Inspired by L-MNL, we propose a more general framework to model taste hetero-geneity. The proposed TasteNet-MNL differs from L-MNL and traditional FFW inthree aspects. First, We allow all or a subset of taste parameters to be modeled byNN as a flexible function, not just the ASCs. This enhances the flexibility to modeltaste heterogeneity. Second, we impose constraints on taste parameters predicted byneural networks, as a strategy to regularize the network and obtain interpretable re-sults. Third, we model taste parameters instead of utilities by a NN, different from adirect application of a FFW. The key idea is to assign the more complex or less knowntask to a NN, and keep the well known part parametric.6 . Model Structure
For a given choice task, suppose each person n makes a one time of choice from achoice set C n . For each person n , observed data includes individual characteristics( z n ), attributes of alternative i ( x in ), and the chosen alternative y n . V in denotes the systematic utility of alternative i to choice maker n . If tastes arehomogeneous, V in is a function of attributes. A simple example is a linear functionin Eqn. 1. Systematic taste heterogeneity is usually specified as a group of interac-tions between attributes and characteristics (Eqn. 2). Interaction effects are specifiedaccording to prior assumptions and verified by statistical tests. V in = β i + K i (cid:88) k =1 β ki x kin (1) V in = β i + K i (cid:88) k =1 β ki x kin + (cid:88) ( p,q ) ∈ I i γ pqi x pin z qn (2)Since how taste varies across choice makers may not be known as a priori , we proposea data-driven approach to represent systematic taste heterogeneity: using a neuralnetwork (TasteNet) to model taste parameters as flexible functions of characteristics(Eqn. 3). β T Nn = T asteN et ( z n ; w ) (3)Inputs of TasteNet are choice-maker n ’s characteristics ( z n ). Neural network weights( w ) are unknown parameters to estimate. Output of the network β n T N correspond toa full set or a subset of the coefficients in an ordinary MNL utility (e.g. Eqn. 2). Themeaning of each element of β n T N is defined by the MNL module, where TasteNet isembedded.We can divide the systematic utility into a parametric part and a flexible part(Eqn. 4). The parametric part is a manually specified utility function. It can includeinteractions and nonlinear transformations. The flexible part is a sum of alternativeattributes weighted by taste coefficients predicted by the TasteNet. We can model all, asubset of or none of the taste coefficients by the neural network. Note that each tastecoefficient (e.g. time coefficient) is either learned by a neural network or manuallyspecified. V in = β i T N ( z n ; w ) (cid:48) x in T N + β i MNL (cid:48) f ( x in MNL , z n ) (4) P ( y n = i | x n , z n , w , β MNL ) = e β i TN ( z n , w ) (cid:48) x in TN + β i MNL (cid:48) f ( x in MNL , z n ) (cid:80) j ∈ C n e β j TN ( z n , w ) (cid:48) x jn TN + β j MNL (cid:48) f ( x jn MNL , z n ) (5) This problem setup can be generalized to a repeated choice scenario. We choose one time of choice to avoidcluttering in the notation z n to taste coefficients β n T N . The MNL moduletakes in β n T N and the corresponding attributes, along with the parametric utility tocompute choice probabilities (Eqn. 5) and likelihood.This integrated structure achieves two goals. First, utility specification becomesmore flexible in representing systematic taste heterogeneity. Second, model inter-pretability is partially maintained, since each output unit of the neural network car-ries a behavioral meaning. Some coefficients may subject to parameter constraintsaccording to prior knowledge. Below we provide more details on the neural networkarchitecture, parameter constraints, and estimation procedure.
TasteNet
We choose a feed-forward neural network (also called a multi-layer perceptron (MLP))for TasteNet. An MLP consists of an input layer, one or more hidden layers and anoutput layer. Essentially, MLP is a composition of linear and nonlinear functions tomap inputs to outputs.In an MLP with 1 hidden layer of H hidden units, the k-th output of the network β TN k can be written as Eqn. 6, where D is the input dimension, H is the number ofhidden units, A (1) is hidden layer activation function, and T is the output activationfunction. Neural network parameters w (1) and w (2) correspond to weights from theinput layer to the hidden layer, and weights from the hidden layer to the output layer. β T Nk ( z , w ) = T [ H (cid:88) h =1 w (2) kh A (1) ( D (cid:88) i =1 w (1) hi z i + w (1) h ) + w (2) k ] (6)An MLP with multiple hidden layers can be denoted asMLP( L, [ H , .., H L ] , [ A (1) , ..., A ( L ) ] , T ). We need to specify the number of hid-8en layers L , the size of each hidden layer H l , activation function for each hiddenlayer A ( l ) , and output transform function T . These hyper-parameters are selectedbased on a model’s prediction performance on the development set. Parameter constraints
Over-parameterization is common for neural networks, especially when the sample sizeis relatively small compared to the model complexity. Adding constraint is a methodto regularize a neural network. We impose constraints on taste parameters not onlyto improve model generalization ability; but also to ensure that taste parameters fallinto a reasonable range based on expert knowledge.A typical constraint is on the signs of parameters. For example, the coefficient fortravel time or waiting time is usually negative. We incorporate sign constraints throughan output transform function T . For taste parameter β s with non-negative sign con-straints, choices of T can be the rectified linear function ( ReLU ( β )) or exponentialfunction (exp( β )). For β s with non-positive signs, choices of T can be the rectifiedlinear unit − ReLU ( − β ) or − exp( − β ). For β s without constraints, T is the iden-tity function. Such transformations redistribute the parameters to the desirable rangethrough continuous differentiable functions, which resemble the exponential transformfor scale or time coefficient in the utility specification of a DCM.An advantage of using transform function for sign constraints is that the constraintscan be strictly kept. Other methods, such as adding penalty for constraint violationto the learning objective, cannot enforce the constraints on unseen data. Estimation
Model is estimated by optimizing a learning objective function with stochastic gradientdescent. The objective is to minimize a loss function, which is the average of negativelog-likelihood plus a regularization term for the p-norm of the neural network weights(Eqn. 7) to prevent the model from over-fitting.min w , β MNL − (cid:88) n log P ( y n | z n , x n , w , β MNL ) + λ p || w || p (7)TasteNet-MNL is trained in an integrated fashion through back-propagation. Un-known parameters to estimate include neural network weights ( w ) and unknown co-efficients in the MNL module ( β MNL ).
4. Experiments
We generate a synthetic dataset with an underlying logit model. Its utility functioncontains higher-order interactions between characteristics and attributes. On this syn-thetic data, we compare TasteNet-MNL with benchmarking MNLs and random co-efficient logit models (RCLs). We expect that the TasteNet-MNL can improve pre-dictability, reduce bias in parameter estimates, and provide more accurate behavioralinterpretations, compared to MNLs and RCLs with misspecified systematic utility.9 .1.
Synthetic data
The data generation model is a binary logit, with systematic utility of alternative i for person n defined in Eqn. 8. Explanatory variables include three characteristics:income (inc), full-time employment dummy (full) and flexible work schedule dummy(flex); and two alternative attributes: travel cost (cost) and travel time (time) (seeTable A1 in Appendix A for details). Coefficient values are chosen to carry realisticmeanings: income has positive effect on value of time (VOT), full-time workers havehigher VOT and people with flexible schedule have lower VOT. Cost coefficient is fixedto -1 for both alternatives, so that VOT can be read from time coefficient. Alternativespecific constant (ASC) for alternative 1 is -1.0 and 0 for alternative 0. The randomcomponent of each utility follows an Extreme Value distribution. The synthetic datagenerated has 14,000 observations, randomly split into training (10,000), development(2000) and test (2000) sets. Details about the input distribution and synthetic datageneration are presented in Appendix A. V in = ASC i − cost in + ( − . − . inc n − . f ull n + 0 . f lex n − . inc n ∗ f ull n + 0 . inc n ∗ f lex n + 0 . f ull n ∗ f lex n ) ∗ time in (8) Models in comparison
MNL-I ’s utility functions only include first-orderinteractions between characteristics and time (Eqn. 9). Compared to MNL-I, utilitiesof
MNL-II have one additional interaction inc ∗ f ull ∗ time (Eqn.10). MNL-TRUE is an MNL with the true utility specification. It is different from the ground truth dueto sampling error. In all MNLs, alternative specific constants (ASCs) are fixed to 0 foralternative 0 (
ASC = 0).4.2.2. Random coefficient logit modelsTwo random coefficient logit (RCL) benchmarks are included to test whether model-ing unobserved heterogeneity can compensate for specification errors in the systematicutility. We assume time coefficient is randomly distributed, following a Normal distri-bution with mean equal to a linear function of characteristics, and standard deviation σ . RCL-I and
RCL-II represent two variations in specifying the mean of time co-efficient (Eqn. 11 and 12), and correspond to MNL-I and MNL-II’s time coefficients,respectively. V MNL − Ii = ASC i − cost i + ( b + b inc + b f ull + b f lex ) ∗ time i (9) V MNL − IIi = ASC i − cost i + ( b + b inc + b f ull + b f lex + b inc ∗ f ull ) ∗ time i (10)10igure 2.: Diagram of the TasteNet-MNL for the Synthetic Data V RCL − Iin = ASC i − cost in + β n ∗ time in ,β n ∼ N ( b + b inc n + b f ull n + b f lex n , σ ) (11) V RCL − IIin = ASC i − cost in + β n ∗ time in ,β n ∼ N ( b + b inc n + b f ull n + b f lex n + b inc n ∗ f ull n , σ ) (12)4.2.3. TasteNet-MNLThe structure of the TasteNet-MNL for the synthetic data is shown in Figure 2. Timecoefficient ( β vot ) is modeled by an MLP. Hyper-parameters to decide include the num-ber of hidden layers (L), the size(s) of hidden layer(s) ( H , ..., H L ), and type of reg-ularizer (norm p ), regularization strength λ p , activation function for hidden layers( A , ..., A L ), and output transform function (T).We train TasteNet-MNL on training dataset with different combinations of hyper-parameters. We choose 1 hidden layer, since it is enough to obtain the true model’sprediction accuracy. We vary the number of hidden units from 5 to 30. For each hiddenlayer size, we apply L2 regularization penalty in [0, 0.0001, 0.001, 0.01]. For hiddenlayer activation function, we try ReLU and Tanh. For output transformation, we ex-periment with functions: − ReLU ( − β ) and − e − β , to impose non-positivity constrainton the value of time coefficient β vot . For each scenario, we train the model 5 timeswith different random initialization.The best hyper-parameter scenario is selected based on the lowest average negativelog-likelihood ( N LL ) on development dataset. The best TasteNet-MNL has 1 hidden11able 1.: Average Negative Log-likelihood (NLL) and Prediction Accuracy (ACC) forSynthetic Data
Model NLL train NLL dev NLL test ACC train ACC dev ACC testMNL-I 0.54102 0.55699 0.54572 0.719 0.703 0.722MNL-II 0.53755 0.55479 0.54695 0.717 0.706 0.724RCL-I 0.52591 0.54594 0.52758 0.718 0.703 0.724RCL-II 0.52298 0.54323 0.52808 0.719 0.701 0.723TasteNet-MNL (H=7, λ =0.001) 0.45433 0.46803 0.46562 0.785 0.775 0.786MNL-TRUE 0.45459 0.47268 0.45979 0.786 0.773 0.785Data generation model 0.45502 0.47186 0.45877 0.786 0.772 0.787 layer with 7 hidden units, ReLU for hidden layer activation, − ReLU ( − β ) for outputtransformation, and L2 penalty 0.001. Results
We compare MNLs, RCLs and TasteNet-MNL regarding predictability, parameterbias, and interpretability.4.3.1. PredictabilityWe measure model predictability by average negative log-likelihood (NLL) and predic-tion accuracy (ACC) on training, development and test data. NLL is the total negativelog-likelihood in equation (7) divided by the number of observations. The higher theNLL, the poorer the model fit. Prediction accuracy is the percentage of correct pre-dictions. Table 1 summarizes the prediction performance of different models.MNL with the correct utility specification (MNL-TRUE) achieves the same NLLand ACC as the data generation model. MNL-I and MNL-II result in higher NLL(0.59 - 0.6) than MNL-TRUE (0.47); and lower prediction accuracy (70% - 72%) thanMNL-TRUE (77% - 79%). Compared to MNL-I, MNL-II’s utility includes one moreinteraction inc ∗ f ull ∗ time , which has the largest effect (-0.2) among the three missinginteraction terms in MNL-I. However, model fit and prediction accuracy of MNL-IIdoes not improve significantly. This indicates that prediction accuracy can be sensitiveto systematic utility specification. A seemingly small misspecification can cause sig-nificant prediction errors. Poor predictability can be a sign for model misspecification.Compared to MNL-I and MNL-II, RCL-I and RCL-II both achieve better log-likelihood fit. The better fit is because part of the missing terms is modeled as randomheterogeneity. However, RCLs do not improve choice prediction accuracy.The best TasteNet-MNL achieves the same predictability as the data generationmodel. We give minimal instructions to the model: 1) choice makers makes trade-offsbetween time and cost; and 2) value of time depends on individual characteristics. Wedo not specify in detail how value of time varies across individuals. Instead, we letthe neural network learn value of time as a function of individual characteristics. Youmay wonder: does TasteNet-MNL recover the true utility function? Is the predictionperformance a result of learning the correct utility function?12able 2.: Parameter Estimates by MNLs, RCLs and TasteNet-MNL Compared to theTruth Coef MNL-I MNL-II RCL-I RCL-II TasteNet-MNL a MNL-TRUE Truth
ASC -0.1484 -0.1085 -0.141 -0.141 -0.1055 -0.1003 -0.1time -0.0998 -0.1233 -0.0914 -0.139 -0.1056 -0.0927 -0.1inc*time -0.5983 -0.5058 -0.636 -0.447 -0.4829 -0.5277 -0.5full*time -0.1154 -0.0651 -0.109 -0.0434 -0.1093 -0.1051 -0.1flex*time 0.1113 0.1120 0.114 0.115 0.060 0.0458 0.05inc*full*time -0.1470 -0.223 -0.1904 -0.1741 -0.2inc*flex*time 0.0182 0.0695 0.05full*flex*time 0.1046 0.0932 0.1 σ (time) 0.0528 0.0504RMSE 0.093 0.051 0.098 0.058 0.014 0.016MAE 0.072 0.042 0.076 0.053 0.012 0.012MAPE 63% 52% 64% 61% 15% 11% a Estimated through regression b RMSE: Root Mean Squared Error; MAE: Mean Absolute Error; MAPE: Mean Absolute Percentage Error β V OT s against the characteristics ( inc , f ull , f lex ) and theirinteractions to obtain coefficients in the utility function, except for ASC , directlyestimated from the MNL module. We compute the errors in parameter estimates,including mean squared error (MSE), mean absolute error (MAE) and mean absolutepercentage error (MAPE) (Table 2).Parameter errors of TasteNet-MNL is close to the true model MNL-TRUE. Thismeans that neural network recovers the correct form of the taste functio in this case.MNL-I, MNL-II, RCL-I and RCL-II have large biases in parameter estimates, withMAPE from 52% to 64%.It is worth noting that RCL-I and RCL-II both have statistically significant stan-dard deviations for the random coefficient for time ( σ ( time )). The missing systematiceffects can be misinterpreted as random heterogeneity in value of time. This examplealso shows that RCLs do not necessarily reduce bias in parameter estimates. Theirparameter errors are similar and even a bit higher than their corresponding MNLs.But RCLs do improve fitted log-likelihood (Table 1).These results imply that if we do not have a flexible enough function to capturenonlinearity in systematic taste variation, we might mistake systematic heterogeneityfor random heterogeneity, and obtain biased estimates and interpretations. Neuralnetworks can be utilized to exhaust the capacity of the systematic utility function,and separate systematic effect from random effect.4.3.3. InterpretabilityWe expect that TasteNet-MNL is able to provide more accurate economic indicatorsthan misspecified MNLs. We compare value of time (VOT), choice elasticity and choice13able 3.: Errors in Estimated Values of Time (Unit: $ / Hour) Input data Error metric a MNL-I MNL-II MNL-TRUE TasteNet-MNLSynthetic data RMSE 1.805 1.730 0.098 0.111MAE 1.700 1.696 0.080 0.056MAPE 10.1% 10.1% 0.5% 0.3%New input RMSE 2.710 1.707 0.351 0.408MAE 2.274 1.573 0.244 0.280MAPE 13.9% 9.9% 1.5% 1.6% a RMSE: Root Mean Squared Error; MAE: Mean Absolute Error; MAPE: Mean Absolute Percentage Error probability derived from different models and against the ground truth. a) Value of time
We estimate VOT for each individual in the synthetic data. Table3 shows the errors in estimated VOT by different models. The MAE of VOT estimatedby MNL-I and MNL-II is 1.7 $ per hour, about 10% of the true values. TasteNet-MNL’sMAE is lower: 0.05 $ per hour, 0.3% from the true VOTs. Taste-MNL’s accuracy inindividual VOT estimates matches MNL-TRUE.To test the model’s generalization performance, we create a dataset with 200 in-dividuals, whose characteristics are drawn from a uniform distribution. There are 50individuals in each of the four groups defined by the combinations of full-time (yes/no)and flexible schedule (yes/no). Income of individuals from each group is evenly dis-tributed in the range of 0 to 60$ per hour with interval size 1.2. With this new input,MNL-I and MNL-II produce an MAE of 2.3$ per hour (14%) and 1.6$ per hour (10%),respectively, compared to TasteNet-MNL’s error of 0.3$ per hour (1.6%).We plot predicted VOTs against income for the four categories of individuals (Fig-ure 3). MNL-I cannot distinguish the difference in VOTs (given income is fixed) be-tween group (full flex) and group (nofull noflex). Adding one higher-order interactionin MNL-II helps, but large bias persists. TasteNet-MNL gives more accurate VOTestimates at the individual level. The root mean squared error (RMSE) of individualVOT estimates is 0.41, close to the true model MNL-TRUE (0.35), and much lowerthan MNL-I (2.71) and MNL-II (1.71). The mean absolute percentage error (MAPE)by TasteNet-MNL is 1%, similar to MNL-TRUE and better than MNL-I (14%) andMNL-II (9%). To summarize, TasteNet-MNL can provide more accurate estimates ofVOT at the individual level, while misspecified MNLs can result in large bias. b) Elasticity and choice probability
Elasticities are useful economic indicators derived from a choice model. They mea-sure the effects of a change in one of the variables (e.g. income, cost) on the choiceprobability. We compare disaggregated point elasticities across models. The generaldefinition of elasticity of demand with respect to alternative attribute x kin is definedin Equation 13. P n ( i ) is the probability of choosing alternative i for person n . x kin is the k -th attribute of alternative i for person n . Elasticity E P n ( i ) x kin measures the per-centage change in choice probability P n ( i ) with respect to one percentage change inattribute x kin . Elasticity formulas for a linear MNL and a TasteNet-MNL are shownin Eqn. 14 and 15. The major difference between them is that the taste parameter β k in the TasteNet-MNL case becomes a function of characteristics z .14igure 3.: Estimated Values of Time and the Ground Truth for New InputTable 4.: Errors in Estimated Elasticities and Probabilities by Different Models Choice Elasticity Choice ProbabilityError metric MNL-I MNL-II MNL-TRUE TasteNet-MNL MNL-I MNL-II MNL-TRUE TasteNet-MNLRMSE 5.30 5.22 0.34 0.32 0.21 0.21 0.012 0.011MAE 3.03 3.10 0.16 0.11 0.16 0.16 0.0079 0.0057MAPE 55% 56% 3% 2% 61% 62% 3% 2%
RMSE: Root Mean Squared Error; MAE: Mean Absolute Error; MAPE: Mean Absolute Percentage Error E P n ( i ) x kin = ∂P n ( i ) ∂x kin x kin P n ( i ) (13) E P n ( i ) x kin = (1 − P n ( i )) x kin β k (14) E P n ( i ) x kin = (1 − P n ( i )) x kin β k ( z ) (15)In the first analysis, we estimate elasticity and choice probability for each observa-tion in the synthetic data. We choose elasticity and choice probability of alternative1 with respect to the time of alternative 1. Table 4 shows the errors in estimatedelasticities by different models. TasteNet-MNL achieves the same level of accuracy asthe true model MNL-TRUE, while the mis-specified logit models result in 55% to 56%errors. Similar results hold for predicted choice probability.The second analysis is performed on a selected individual with 60$ hourly wage,full-time job and flexible schedule. Mode 0’s time and cost is fixed at 20 minutes and2$. Cost of mode 1 is fixed to 8$. We vary time (time of alternative 1) from 0.2 to20$, and compute choice elasticity and probability of this person for each value of time . Figure 4 shows the estimated elasticity v.s. time and choice probability v.s. time across models. Among the models, TasteNet-MNL gives the function closest tothe ground-truth.The third analysis compares predicted elasticities and probabilities by different mod-els across 4 types of individuals. The four types of people are defined by the combina-15igure 4.: Elasticity and Choice Probability against time for the Selected PersonFigure 5.: Elasticity v.s. time for 4 Types of Individuals( inc =30$/hr, time =20 min, cost =2$, cost =8$)tions of full-time (yes/no) and flexible schedule (yes/no) with income fixed at 30$ perhour. Time and cost of alternative 0 is given at 20 minutes and 2$, and cost is 8$. Weplot the elasticity and probability as a function of time for each group predicted bydifferent models (Figure 5 and Figure 6). MNL-I can barely distinguish the differencebetween the full-flex and nofull-noflex groups; while TasteNet-MNL can distinguishand give more accurate estimates than the misspecified MNLs.4.3.4. Understanding the Neural NetworkTo understand how the neural network learns the effect of input variables and theirinteractions, we visualize the activation values of the hidden units with simulatedindividual characteristics. We also show the estimated weights of TasteNet, includingweights of the linear layer from input to hidden layer, and weights of the linear layerfrom hidden layer to output layer (Table 5). Interestingly, hidden unit 6 is not usedsince its associated weights are all zeros.We generate four types of individuals with income varying from 0 to 60 $ per hour.We pass individual characteristics to the trained TasteNet and obtain activation valuesfor each hidden unit. Figure 7 displays the activation values. Darker color indicatesstronger activation. All activation values are non-negative since the activation functionused is ReLU. By observing how a neuron gets activated as input varies, we can16igure 6.: Choice Probability v.s. time for 4 Types of Individuals( inc =30$/hr, time =20 min, cost =2$, cost =8$)Table 5.: Estimated Weights of TasteNet-MNL Input-Hidden Hidden-OutputHidden units z income z fulltime z flexible z intercept1 -0.4257 0.3917 0.0479 -0.0315 0.4622 0.3907 -0.0925 -0.114 -0.0543 -0.15363 -0.0001 0.4637 -0.0764 0.484 -0.36414 0.8034 -0.2261 -0.2055 0.4987 -0.35955 0.812 -0.317 -0.2252 0.5003 -0.19446 0 0 0 0 07 0.0396 -0.5551 -0.1817 0.5089 0.4905intercept 0.0921 understand the role of each neuron in approximating the true taste function.Hidden units 4 and 5 apparently capture income effect, since they become moreactivated as income increases in all 4 groups. Hidden units 4 and 5 also capture the non-flexible effect. Note that individuals with non-flexible schedules tend to have higheractivation values for hidden units 4 and 5, all else equal (left vs right in Figure 7).Note that in this case, higher activation of units 4 and 5 leads to higher values oftime (or more negative β V OT ). Their corresponding coefficients in the linear hidden-to-output layer are negative (-0.3595 and -0.1944, see Table 5). In other words, biggeractivation leads to a more negative β V OT . Hidden unit 3 captures the full-time effect.Full-time individuals tend to have a higher activation value for unit 3, which leads to alower value of time since the hidden-to-output layer’s coefficient for unit 3 is negative(-0.3641). Hidden units 1, 2 and 7 represent the three interaction effects: income *full-time, income * not flexible, and not full-time * not flexible, respectively. Again,we see hidden node 6 is never activated.Through Monte-Carlo experiments, we show TasteNet-MNL’s ability to capturenonlinear taste functions and uncover the true utility form. Misspecified systematicutility in MNLs or RCLs can lead to large bias in parameter estimates. TasteNet-MNLcan be used to identify specification errors in utility and reduce potential biases.17igure 7.: Activation of the Hidden Layer in TasteNet18asteNet-MNL’s prediction accuracy matches the true model (77% to 79%), higherthan the misspecified MNLs and RCLs (70% to 72%). TasteNet-MNL also providesinterpretable economic indicators, like value of time and demand elasticities, closeto the ground truth; while MNLs and RCLs with misspecified utility can produceunreliable interpretations.
5. Model Application: Swissmetro Mode Choice
We apply TasteNet-MNL to a publicly available dataset –
Swissmetro to model modechoice for inter-city travel. The purpose of this application is to 1) examine whetherTasteNet-MNL is able to predict more accurately compared to a manually specified,relatively sophisticated MNL; and 2) whether TasteNet-MNL can draw reasonablebehavioral interpretations and, if so, how its interpretations differ from those of theMNLs. To compare with TasteNet-MNL, we set up three benchmarking MNL modelswith increasing complexity in the utility function.
Data
The Swissmetro is a proposed revolutionary mag-lev underground system. To assesspotential demand, the Swissmetro Stated Preference (SP) survey collected data from1,192 respondents (441 rail-based travellers and 751 car users), with 9 choices from eachrespondent. Each respondent is asked to choose one mode out of a set of alternativesfor inter-city travel given the attributes of each mode (e.g. travel time, headway andcost). The universal choice set includes train (TRAIN), Swissmetro (SM), and car(CAR). For individuals without a car, the choice set includes only TRAIN and SM.Table 6 provides a description of the variables. For more information, readers can referto Bierlaire (2018).The original data has 10,728 observations, downloaded in Jan 2019 . After re-moving observations with unknown age, ”other” trip purpose and unknown choice,we retain 10,692 observations. We randomly split the data into training (”train”),development(”dev”) and test(”test”) set with 7,484, 1,604 and 1,604 observations,respectively. Models
The three benchmarks are logit models.
MNL-A is similar to Bierlaire et al.(2001)’sMNL specification but with some enhancements: 1) the value of travel time and valueof headway are made mode-specific; 2) all levels of age and luggage categories areincluded; and 3) cost coefficients are fixed to -1.0 for directly reading VOT fromtime coefficients (Table 7). In the benchmark
MNL-B , we add the interaction terms:time*age, time*income and time*purpose (Table 8). The third benchmark
MNL-C isa MNL with all pairs of first-order interactions between characteristics and attributes(Table 9). This model is equivalent to a TasteNet-MNL with all taste coefficientsmodeled by a neural network without hidden layers.The
TasteNet-MNL structure for Swissmetro data is shown in Figure 8. We spec-ify the utility functions for each alternative in the MNL module. Coefficients for cost Data link: https://biogeme.epfl.ch/data.html
Alternative Alternative attributes AvailabilityTRAIN time, headway, cost (train tt, train hw, train co) train avSM (Swissmetro) time, headway, seats a , cost (sm tt, sm hw,sm seats, sm co) sm avCAR time, cost (car tt, car co) car avPerson/Trip variable Variable levelsAGE 0: age ≤
24, 1: 24 < age ≤
30, 2: 39 < age ≤ < age ≤
65, 4: 65 < ageMALE 0: female, 1: maleINCOME (thousand CHF per year) 0: under 50, 1: between 50 and 100, 2: over 100,3: unknownFIRST (First class traveler) 0: no, 1: yesGA (Swiss annual season ticket) 0: no GA, 1: owns a GAPURPOSE 0: Commuter, 1: Shopping, 2: Business, 3:LeisureWHO (Who pays) 0: self, 1: employer, 2: half-halfLUGGAGE 0: none, 1: one piece, 2: several pieces a. Seats configuration in Swissmetro: seats=1 if airline seats, 0 otherwise. Table 7.: Estimated Coefficients of MNL-AVariable Description Train Swissmetro CarConstant 0.1227 0.5726Travel time (minutes) -1.3376 -1.4011 -1.0177Headway (minutes) -0.4509 -0.8171Seats (airline seating = 1) 0.1720Cost (CHF) -1 (fixed) -1 (fixed) -1 (fixed)GA (annual ticket = 1) 2.0656 0.5319Age 1: 24 < age ≤
30 -0.75482: 39 < age ≤
54 -0.94573: 54 < age ≤
65 -0.48594: 65 ≤ age 0.6995Luggage 1:one piece -0.15382:several pieces -0.923020able 8.: Estimated Coefficients of MNL-BVariable Description Train Swissmetro CarConstant 0.0056 0.4674Travel time (minutes) -0.5006 -0.4010 -0.5600Travel time * Age0: age ≤ < age ≤
39 -0.6354 -0.3307 -0.56962: 39 < age ≤
54 -0.8475 -0.6101 -0.61053: 54 < age ≤
65 -0.1566 0.1419 -0.09154: 65 < age 0.3265 -0.243 -0.0234Travel time * Income0: under 501: 50 to 100 -0.2688 0.1739 0.16232: over 100 -1.0181 -0.436 -0.40933: unknown 0.0852 0.2828 -0.0923Travel time * Purpose0: Commute1: Shopping -0.2081 -0.6192 -0.60622: Business -0.1574 -0.8688 -0.18333: Leisure -0.59 -0.9706 -0.0162Headway (minutes) -0.6158 -0.7011Seats (airline seating = 1) 0.189Cost (CHF) -1 (fixed) -1 (fixed) -1 (fixed)GA (annual ticket = 1) 1.6162 0.2988Luggage 1:one piece -0.17142:several pieces -0.671821able 9.: Estimated Coefficients of MNL-C Coefficients for alternative attributesz (characteristics) TRAIN TT SM TT CAR TT TRAIN HE SM HE SM SEATS TRAIN ASC SM ASCIntercept -0.0671 0.1455 0.0059 0.1713 0.0646 0.3064 0.2953 0.2067Male -0.1526 -0.0477 0.0742 -0.2384 0.0706 -0.1016 0.0671 0.149Age1: (24,39] -0.0965 -0.2422 -0.1093 0.0044 0.5682 0.0517 -0.1634 0.42852: (39,54] -0.1467 -0.2022 -0.195 -0.2397 -0.0105 -0.2135 -0.2692 0.09593: (54,65] 0.0256 0.1201 0.0251 -0.2379 -0.0807 0.1619 -0.0861 -0.03444: (65,) -0.1712 0.1435 0.1105 0.6032 -0.1488 -0.1529 0.618 -0.351Income1: 50-100 0.0494 -0.039 0.0098 -0.1884 -0.2972 0.2349 -0.1776 0.19442: over 100 -0.2825 -0.1697 -0.2662 0.1393 0.0372 0.5288 -0.0406 -0.07893: unknown 0.0289 0.1467 -0.2037 0.1484 -0.0721 -0.4196 0.1621 -0.0459First class -0.1927 -0.0807 -0.3297 -0.4768 0.1183 0.1302 0.2228 -0.2085Who pay1: employer -0.2154 -0.1668 0.1231 0.028 -0.0045 0.0882 0.1191 0.39862: half-half 0.1537 0.4771 0.4391 -0.0311 0.3917 0.3114 -0.2414 -0.0332Purpose1:Shopping 0.2339 -0.219 0.19 0.1509 0.0493 0.1994 0.4238 0.69962:Business -0.0872 -0.3524 -0.181 -0.0544 -0.0195 -0.0647 0.0605 -0.29413:Leisure -0.2678 -0.2778 -0.0043 0.3245 -0.4552 -0.0289 -0.302 -0.4739Luggage1:one piece -0.0375 0.0861 0.2525 0.58 -0.1993 0.0413 0.3364 0.32392:several pieces 0.022 -0.1785 -0.2731 -0.2946 0.0814 -0.1225 -0.0041 0.2158Annual ticket 0.5912 -0.0075 -0.3181 0.2652 -0.2032 -0.5815 0.3576 0.1351
Table 10.: Options of Hyper-parameters & Activation FunctionsOptions ValuesHidden activation relu , tanh Output activation (for non-positive parameters) − relu ( − β ), − exp ( − β )Hidden layer size [10, 20, ..., 100] l l a [0, 0.0001, 0.001, 0.01] a Either l l are fixed to -1 so that the coefficients for time is the negative value of time. Thereare 7 coefficients in the MNL utilities, including for the alternative specific constants.We assume all MNL coefficients (taste parameters) are functions of individual charac-teristics, and model them as the output of the TasteNet. This is a special case of thegeneral structure: the set of β MNL is empty and all taste parameters are modeled byTasteNet as β T N (Eqn. 4 and 5).The TasteNet module consists of a linear layer from input z to hidden layer h (1) , anonlinear activation A (1) for the hidden layer, followed by a linear layer from hiddenlayer to output layer and an output activation function T for the output. We chooseonly 1 hidden layer, since the predicted log-likelihoods on hold-out datasets do notimprove with more hidden layers. Input z includes all characteristics: age, gender,income, first class, who pays for travel cost, trip purpose and luggage. We experimentwith various sets of hyper-parameters and activation functions (Table 10).22igure 8.: Diagram of TasteNet-MNL for Swissmetro Dataset Results
The estimated model coefficients for MNL-A, MNL-B and MNL-C are shown in Table7, 8 and 9. Among all TasteNet-MNL scenarios, the one with 80 hidden units, relu for hidden layer activation, negative exponential for non-positive output activation,and no regularization achieves the best prediction performance on the developmentdataset.5.3.1. Prediction PerformanceTasteNet-MNL significantly out-performs MNL benchmarks in terms of prediction ac-curacy. We use average negative log-likelihood (NLL) and prediction accuracy (ACC)to measure predictability (Table 11). From MNL-A to MNL-C, more interactions be-tween attributes and individual characteristics are added. MNL-C has a full set ofinteractions between attributes and characteristics. Surprisingly, the predicted log-likelihood shows only marginal improvements: NLL decreases from 0.728 (MNL-A) to0.708 (MNL-B) and 0.691 (MNL-C). With TasteNet-MNL, we see a substantial im-provement in prediction performance: NLL on development data drops from 0.691 to0.646. This is attributed to the flexibility enabled by the hidden layer with nonlineartransformation in the TasteNet. The improved log-likelihood implies the existence ofnonlinear effects in utility specification. The neural network is able to automaticallylearn taste as a nonlinear function of individual characteristics. Because it capturesa more accurate relationship between characteristics and taste, it outperforms MNLswith linear utilities.5.3.2. Individual Taste EstimatesWe want to understand how estimated tastes, such as value of time by mode, providedby different models differ. We apply each model to obtain taste parameters for eachindividual in the Swissmetro dataset. The average of taste coefficients among thepopulation are displayed in Table 12. Since the cost coefficients are fixed to -1.0, alltaste coefficients are in the willingness-to-pay space measured by Swiss Franc (CHF).We find that from MNL-A to MNL-C, average values of travel time (VOT) and23able 11.: Average Negative Log-likelihood (NLL), Prediction Accuracy (ACC) andF1 Score by Model
NLL ACC F1Model train dev test train dev test train dev testMNL-A 0.762 0.728 0.755 0.662 0.691 0.66 0.535 0.557 0.534MNL-B 0.73 0.708 0.72 0.678 0.69 0.678 0.578 0.585 0.573MNL-C 0.704 0.691 0.698 0.685 0.706 0.678 0.588 0.611 0.574TasteNet 0.607 0.646 0.645 0.737 0.718 0.703 0.668 0.634 0.620
Table 12.: Population Mean of Taste Parameters Estimated by Different ModelsMode Taste MNL-A MNL-B MNL-C TasetNet-MNL (v.s. MNL-C)TRAIN TT -1.338 -1.710 -1.846 -2.327 26%HE -0.451 -0.616 -0.880 -1.102 25%ASC -0.198 0.234 0.368 0.801 117%SM TT -1.401 -1.514 -1.505 -1.764 17%HE -0.817 -0.701 -1.039 -1.733 67%SEATS 0.172 0.189 0.420 0.266 -37%ASC 0.648 0.510 0.512 0.669 31%CAR TT -1.018 -1.251 -1.354 -1.685 24%
TT: time. HE: headway. ASC: alternative specific constant values of headway (VOHE) increase with more interaction terms being added to theutility function. For example, train VOT increases from 1.34 to 1.85 CHF per minute.Swissmetro VOT increases from 1.4 to 1.51 CHF per minute, and car VOT rises from1 to 1.35 CHF per minute. Both MNL-B and MNL-C suggest that the VOT of trainis higher than that of Swissmetro or car. MNL-C also gives higher average VOHE forboth train and Swissmetro than MNL-B or MNL-A.TasteNet-MNL gives the largest average VOT and VOHE among all models (Table12). Its average VOT estimates for train, swissmetro and car are 26%, 17% and 24%higher, respectively, than those predicted by MNL-C. Its average VOHE estimates fortrain and swissmetro are 25% and 67% higher than those estimated by MNL-C.We further investigate where the higher average VOTs come from. We plot his-tograms for each type of taste parameter and for each model (Figure 9). As interac-tions are incrementally added from MNL-A to MNL-C, the model captures more tastevariations. Compared to MNLs with linear utilities, TasteNet-MNL discovers a widerrange of taste variations. In particular, the VOTs and VOHEs for all travel modeshave longer tails on the high end of WTP. Based o the synthetic data results, we havereason to believe that TasteNet-MNL’s superior predictability is a result of its moreaccurate estimates of individual tastes. 24able 13.: Example Person Selected for Comparing Taste Function by ModelsCharacteristics Value z fixed MALE MaleAGE (39,54]PURPOSE CommuteWHO SelfLUGGAGE One pieceGA YesFIRST No z vary INCOME 0: under 50, 1: 50 to 100, 2: over 1005.3.3. Taste FunctionAn interpretable model should provide reasonable function relationship between inputand choice outcome at the disaggregate level. We propose a diagnostic tool for modelinterpretability: visualizing the taste function.Each model provides a taste function that maps individual characteristics to a typeof taste value (e.g. VOT, VOHE). We want to check whether TasteNet-MNL learnssensible taste functions at the individual level, in comparison to benchmarking MNLs.Since function input z is multi-dimensional, we cannot directly visualize the func-tions. Instead, we pick an individual with characteristics z . We vary one dimension of z : z i , while keeping other dimensions fixed z j (cid:54) = i . We plot a particular taste parameteras a function of z i : β k = f model ( z i ; z j (cid:54) = i ).For example, we pick a person with characteristics shown in Table 13. We vary thisperson’s income and ask each model a question, e.g., what are the VOTs for such aperson as his income varies? We compare the answers given by different models. Figure10 shows the VOTs and VOHEs estimated by different models versus income.Compared with the benchmarking MNLs, VOT and VOHE estimated by TasteNet-MNL all fall within credible ranges. TasteNet-MNL gives more or less different esti-mates. Swissmetro VOT estimates are not very different between TasteNet-MNL andMNL-C. Regarding train VOT, TasteNet-MNL gives smaller estimates for all threeincome groups than MNL-C. Car VOT estimated by TasteNet-MNL is higher forhigher income groups and lower for the lowest income group. With respect to VOHEs,TasteNet-MNL gives higher estimates for train and lower estimates for swissmetro forall income levels. MNL-C shows a monotonic relationship between VOT and incomeonly for train VOT, while TasteNet-MNL identifies the monotonicity for swissemtroVOT and car VOT.As we do not know the ground truth, the interpretability and credibility of themodel inevitably depends on expert knowledge and judgment. We draw many indi-vidual cases, visualize the taste functions, and compare across models. Overall, tasteparameters by TasteNet-MNL fall within similar range as the MNLs. Yet a particulartaste of a specific individual given by TasteNet-MNL can agree with or differ from theMNLs. Based on TasteNet-MNL’s better predictability, we trust that it gives moreaccurate taste parameters for individuals.25able 14.: Aggregate Choice Elasticity of Swissmetro w.r.t Time by Income GroupINCOME0 INCOME1 INCOME2MNL-A -0.3765 -0.4297 -0.4706MNL-B -0.3975 -0.3923 -0.5329MNL-C -0.3759 -0.3706 -0.4653TasteNet-MNL -0.4200 -0.3982 -0.48105.3.4. ElasticityTo evaluate model interpretability, we also compare elasticity derived from differentmodels. First we apply each model to calculate dissaggregate point elasticity of Swiss-metro mode choice with respect to Swissmetro travel time for each observation (Eqn.14and 15). TasteNet-MNL’s individual elasticity estimates differ from MNL-C by 0.2287on average.With individual elasticities, we compute aggregate elasticity, which measures agroup of decision-makers’ response to an incremental change in a variable. This is de-fined in Eqn 16 as the percentage change in the expected share of the group choosingalternative i ( W i ) with respect to one percentage change in variable x ki . It is equiva-lent to a weighted average of the individual elasticities using the choice probabilitiesas weights.The aggregate elasticities of Swissmetro mode share with respect to Swissmetrotravel time are -0.43, -0.45, and -0.41 for MNL-A, MNL-B and MNL-C, compared to-0.437 for TasteNet-MNL. We further compare aggregate elasticity by group, such asincome (Table 14). TasteNet-MNL suggests higher elasticities for low income and highincome groups than MNL-C. But overall, TasteNet-MNL gives choice elasticities closeto MNLs and within reasonable range. E W ( i ) x ki = ∂W ( i ) ∂x ki x ki W ( i ) = (cid:80) n P n ( i ) E P n ( i ) x kin (cid:80) n P n ( i ) (16)
6. Conclusions & Discussions
In this paper, we embed a neural network into a logit model to flexibly representtaste heterogeneity while keeping model interpretability. Departing from a traditionaleither-or approach, we integrate neural networks and DCMs to take advantage of both.With a synthetic data, we show that TasteNet-MNL can learn the true nonlineartaste function. As a result, it reaches the same level of accuracy in predicting choiceand economic indicators as the true model. Exemplary MNLs and RCLs with mis-specified utility result in large parameter bias, less accurate prediction and misleadinginterpretations. In an application to the Swissmetro dataset, TasteNet-MNL not onlypredicts more accurately on unseen data, but also provides interpretable indicatorsfor policy analysis: individual level VOTs and elasticities derived from TasteNet-MNLare comparable to the results of the benchmarking MNLs. TasteNet-MNL discoversa greater range of taste variations in the population than the benchmarking MNLs.The average VOT estimates by TasteNet-MNL are higher than the MNLs’, due to the26onger tails on the high end of willingness-to-pay. Based on its superior predictabil-ity, we believes that TasteNet-MNL provides more accurate estimates of tastes andelasticities.Through this case, we show that neural networks and DCMs can well complementeach other. Neural network can learn complex function from data and reduce bias inmanual specification. TasteNet-MNL can be used in comparison to DCMs to detectpotential misspecification. The theory and domain knowledge of a DCM can guide theneural networks to output meaningful results. A high-level idea behind TasteNet-MNLis to assign the more complex or unknown part of the model (e.g. taste heterogeneity)to a neural network (data-driven), and keep the well-understood part (e.g. trade-offsbetween alternative attributes) parametric (theory-driven). This idea can be appliedto other settings. For example, in a latent class choice model, we usually lack a goodprior knowledge of the class membership model specification. A neural network canbe utilized to learn the latent class structure.TasteNet-MNL is distinguished from previous studies in several ways. First, it gen-eralizes the L-MNL (Sifringer et al., 2018). Instead of learning only the residual in theutility, a neural network learns the complicated interactions between characteristicsand attributes. Secondly, different from the majority of neural network applicationsto discrete choice, TasteNet learns representation of taste rather than utility. Thisgives us direct control over parameters that carry behavioral meanings, such as valuesof time, as they are no longer part of the parameters to estimate, but intermediateoutputs to predict . Thirdly, we realize the necessity of incorporating domain knowl-edge to obtain interpretable results from neural networks. We introduce parameterconstraints as a regularization strategy to combat over-parameterization, a commonissue with insufficient data and a cause for large estimation variability.There are several limitations and open questions for future research. First, thecurrent TasteNet-MNL model only accommodates systematic taste variations.
Randomtaste heterogeneity is an important source of heterogeneity. How to model distributionsof taste parameters with neural networks is an intriguing question for future research.Han (2019) proposes a neural network embedded latent class choice model, as oneway to represent random heterogeneity. Future work can develop a neural embeddedcontinuous mixed logit model.Second, TasteNet-MNL focuses on modeling taste heterogeneity. Non-linear ef-fects of attributes have been observed empirically (Monroe, 1973, Gupta and Cooper,1992, Kalyanaram and Little, 1994). Nonlinear effects, such as the saturation effectand threshold effect, are explained by prospect theory and assimilation-contrast the-ory (Kahnemann and Tversky, 1979, Winer, 1986, 1988). Future work may extendTasteNet-MNL model to reflect nonlinearity in attributes.Third, more synthetic data scenarios and benchmark models can be examined. Otherforms of nonlinearity can be used to test if TasteNet-MNL can capture them. Morebenchmarks, such as latent class choice model, random coefficient logit with otherdistributional assumptions and systematic utilities can be compared against. It is in-conclusive whether DCMs that incorporate random heterogeneity can predict better orworse than a TasteNet-MNL with only systematic taste variation. Most likely, it variesacross cases, depending on the magnitude of systematic vs random taste variation inthe datasets. Future work can conduct a comprehensive comparison of TasteNet-MNLand DCMs.Lastly, we suggest comparing TasteNet-MNL with DCMs under various empiricalsettings, in terms of prediction performance and behavioral interpretations. As wefind in the Swissmetro case study, TasteNet-MNL suggests higher average VOTs and27 greater variety in tastes. Future research can find out whether this is true in general,and how it would affect aggregate forecasts and scenario analysis. We expect that theintegrated model can help discover new insights about behaviors and improve modelpredictability. 28 i g u r e .: P o pu l a t i o n T a s t e D i s t r i bu t i o n s b y M o d e l s ( S w i ss m e t r o D a t a s e t eferences D. Agrawal and C. Schorling. Market share forecasting: An empirical comparison of artificialneural networks and multinomial logit model.
Journal of Retailing , 72(4):383–407, 1996.Y. Bentz and D. Merunka. Neural networks and the multinomial logit for brand choice mod-elling: a hybrid approach.
Journal of Forecasting , 19(3):177–200, 2000.M. Bierlaire.
Swissmetro , 2018. URL http://transp-or.epfl.ch/documents/technicalReports/CS_SwissmetroDescription.pdf .G. E. Cantarella and S. de Luca. Multilayer feedforward networks for transportation modechoice analysis: An analysis and a comparison with random utility models.
TransportationResearch Part C: Emerging Technologies , 13(2):121–155, 2005.M. De Carvalho, M. Dougherty, A. Fowkes, and M. Wardman. Forecasting travel demand:a comparison of logit and artificial neural network methods.
Journal of the OperationalResearch Society , 49(7):717–722, 1998.S. Gupta and L. G. Cooper. The discounting of discounts and promotion thresholds.
Journalof consumer research , 19(3):401–411, 1992.Y. Han.
Neural-embedded Choice Models . PhD thesis, Massachusetts Institute of Technology,Cambridge MA, 8 2019.D. A. Hensher and T. T. Ton. A comparison of the predictive potential of artificial neuralnetworks and nested logit models for commuter mode choice.
Transportation Research PartE: Logistics and Transportation Review , 36(3):155–172, 2000.H. Hruschka, W. Fettes, M. Probst, and C. Mies. A flexible brand choice model based onneural net methodology a comparison to the linear utility multinomial logit model and itslatent class extension.
OR spectrum , 24(2):127–143, 2002.H. Hruschka, W. Fettes, and M. Probst. An empirical comparison of the validity of a neural netbased multinomial logit choice model to alternative model specifications.
European Journalof Operational Research , 159(1):166–180, 2004.D. Kahnemann and A. Tversky. Prospect theory: An analysis of decision under risk.
Econo-metrica , 47(2):363–391, 1979.G. Kalyanaram and J. D. C. Little. An empirical analysis of latitude of price acceptancein consumer package goods.
Journal of Consumer Research , 21(3):408–418, 1994. ISSN00935301, 15375277. URL .D. Lee, S. Derrible, and F. C. Pereira. Comparison of four types of artificial neural networkand a multinomial logit model for travel mode choice modeling.
Transportation ResearchRecord , 2672(49):101–112, 2018. . URL https://doi.org/10.1177/0361198118796971 .D. McFadden. Conditional logit analysis of qualitative choice behavior. In P. Zarembka, editor,
FRONTIERS IN ECONOMETRICS , pages 105–142. Academic Press, 1973.A. Mohammadian and E. J. Miller. Nested logit models and artificial neural networks for pre-dicting household automobile choices: Comparison of performance.
Transportation ResearchRecord , 1807(1):92–100, 2002. . URL https://doi.org/10.3141/1807-12 .K. B. Monroe. Buyers’ subjective perceptions of price.
Journal of Marketing Research , 10(1):70–80, 1973. . URL https://doi.org/10.1177/002224377301000110 .D. Nam, H. Kim, J. Cho, and R. Jayakrishnan. A model based on deep learning for predict-ing travel mode choice. In
Proceedings of the Transportation Research Board 96th AnnualMeeting Transportation Research Board, Washington, DC, USA , pages 8–12, 2017.H. Omrani. Predicting travel mode of individuals by machine learning.
Transportation ResearchProcedia , 10:840–849, 2015.D. Shmueli, I. Salomon, and D. Shefer. Neural network analysis of travel behavior: evaluatingtools for prediction.
Transportation Research Part C: Emerging Technologies , 4(3):151–166,1996.B. Sifringer, V. Lurkin, and A. Alahi. Let me not lie: Learning multinomial logit. arXivpreprint arXiv:1812.09747 , 2018.C. Torres, N. Hanley, and A. Riera. How wrong can you be? implications of incorrect utilityfunction specification for welfare measurement in choice experiments.
Journal of Environ- ental Economics and Management , 62(1):111–121, 2011.S. van Cranenburgh and A. Alwosheel. An artificial neural network based approach to investi-gate travellers decision rules. Transportation Research Part C: Emerging Technologies , 98:152–166, 2019.M. van der Pol, G. Currie, S. Kromm, and M. Ryan. Specification of the utility function indiscrete choice experiments.
Value in Health , 17(2):297 – 301, 2014. ISSN 1098-3015. . URL .S. Wang and J. Zhao. Using deep neural network to analyze travel mode choice with inter-pretable economic information: An empirical example. arXiv preprint arXiv:1812.04528 ,2018.S. Wang and J. Zhao. Multitask learning deep neural network to combine revealed and statedpreference data. arXiv preprint arXiv:1901.00227 , 2019.P. M. West, P. L. Brockett, and L. L. Golden. A comparative analysis of neural networks andstatistical methods for predicting consumer choice.
Marketing Science , 16(4):370–391, 1997.R. S. Winer. A reference price model of brand choice for frequently purchased products.
Journal of consumer research , 13(2):250–256, 1986.R. S. Winer. Behavioral perspective on pricing. In
Issues in Pricing , pages 35–57. LexingtonBooks, 1988.X. Zhao, X. Yan, A. Yu, and P. Van Hentenryck. Modeling stated preference for mobility-on-demand transit: A comparison of machine learning and logit models. arXiv preprintarXiv:1811.01315 , 2018. ppendix A. Synthetic Data Generation We first draw input characteristics z according to assumed input distribution in TableA1. Alternative attributes cost and time are drawn from the ranges described inTable A1. With the true model, we compute choice probabilities for each individual.Finally, we draw a chosen alternative for each individual according to the predictedchoice probabilities by the true model. We generate 6000, 2000 and 2000 examplesfor training, development and test data, respectively. Training data is used for modelestimation. The development set is for selecting hyper-parameters. The test set is notused in training or selection. It evaluates model generalization ability.Table A1.: Description of Input Variables in Synthetic Data Variable Description DistributionCharacteristics z inc Income ($ per minute) LogNormal(log(0.5),0.25) forfull-time;LogNormal(log(0.25),0.2) fornot full-time z full Full-time worker (1=yes, 0=no) Bern(0.5) z flex Flexible schedule (1=yes, 0=no) Bern(0.5)Attributes x cost cost ($) 0.2 to 40$ x time travel time (minutes) 1 to 90 minutestime travel time (minutes) 1 to 90 minutes