[PDF] A Neural-embedded Choice Model: TasteNet-MNL Modeling Taste Heterogeneity with Flexibility and Interpretability

Abstract

Discrete choice models (DCMs) and neural networks (NNs) can complement each other. We propose a neural network embedded choice model - TasteNet-MNL, to improve the flexibility in modeling taste heterogeneity while keeping model interpretability. The hybrid model consists of a TasteNet module: a feed-forward neural network that learns taste parameters as flexible functions of individual characteristics; and a choice module: a multinomial logit model (MNL) with manually specified utility. TasteNet and MNL are fully integrated and jointly estimated. By embedding a neural network into a DCM, we exploit a neural network's function approximation capacity to reduce specification bias. Through special structure and parameter constraints, we incorporate expert knowledge to regularize the neural network and maintain interpretability. On synthetic data, we show that TasteNet-MNL can recover the underlying non-linear utility function, and provide predictions and interpretations as accurate as the true model; while examples of logit or random coefficient logit models with misspecified utility functions result in large parameter bias and low predictability. In the case study of Swissmetro mode choice, TasteNet-MNL outperforms benchmarking MNLs' predictability; and discovers a wider spectrum of taste variations within the population, and higher values of time on average. This study takes an initial step towards developing a framework to combine theory-based and data-driven approaches for discrete choice modeling.

Full PDF

AA Neural-embedded Choice Model: TasteNet-MNLModeling Taste Heterogeneity with Flexibility and Interpretability

Yafei Han a , Christopher Zegras b , Francisco Camara Pereira c , Moshe Ben-Akiva a a MIT, Civil and Environmental Engineering; b MIT, Department of Urban Studies andPlanning; c Technical University of Denmark, School of Management

ABSTRACT

Discrete choice models (DCMs) and neural networks (NNs) can complement eachother. We propose a neural network embedded choice model –

TasteNet-MNL , toimprove the ﬂexibility in modeling taste heterogeneity while keeping model inter-pretability . The hybrid model consists of a

TasteNet module: a feed-forward neuralnetwork that learns taste parameters as ﬂexible functions of individual characteris-tics; and a choice module: a multinomial logit model (MNL) with manually speciﬁedutility. TasteNet and MNL are fully integrated and jointly estimated. By embeddinga neural network into a DCM, we exploit a neural network’s function approxima-tion capacity to reduce speciﬁcation bias. Through special structure and parameterconstraints, we incorporate expert knowledge to regularize the neural network andmaintain interpretability.On synthetic data, we show that TasteNet-MNL can recover the underlying non-linear utility function, and provide predictions and interpretations as accurate as thetrue model; while examples of logit or random coeﬃcient logit models with misspec-iﬁed utility functions result in large parameter bias and low predictability. In thecase study of Swissmetro mode choice, TasteNet-MNL outperforms benchmarkingMNLs’ predictability; and discovers a wider spectrum of taste variations within thepopulation, and higher values of time on average. This study takes an initial step to-wards developing a framework to combine theory-based and data-driven approachesfor discrete choice modeling.

KEYWORDS

Neural Network; Interpretability; Flexible Utility; Taste Heterogeneity; DiscreteChoice Model

1. Introduction

Discrete choice models (DCM) provide a powerful econometric framework to under-stand and predict choice behaviors. The majority of DCMs are Random Utility Mod-els (RUM), derived under the utility-maximization decision rule (McFadden, 1973).Rooted in theory, DCMs have the advantage of interpretability : they can explainwhy/how individuals choose among a set of alternatives, and provide credible an-swers to “what-if” scenario questions. DCMs have been the predominant approach forconsumer choice analysis and widely applied in areas such as transportation planningand marketing.A DCM requires model speciﬁcation to be known as a priori . Utility function isa primary component of model speciﬁcation. Systematic part of the utility describes

Corresponding author. Email: [email protected] a r X i v : . [ ec on . E M ] F e b ow a choice-maker values each attribute of an alternative (“taste”), and how tastesvary systematically across choice-makers (“taste heterogeneity”). When the underly-ing relationships are nonlinear, coming up with an accurate utility speciﬁcation can bediﬃcult. Misspeciﬁed utility functions lead to biased parameter estimates, lower pre-dictability, and wrong interpretations (Bentz and Merunka, 2000, Torres et al., 2011,van der Pol et al., 2014).Although nonlinear functions (e.g. higher-order polynomial, semi-log transform,piecewise linear) can be employed, they also require correct assumptions about thefunction form. Statistical tests are routinely used to select models. However, it is dif-ﬁcult to test all possible speciﬁcations with a fair amount of covariates; and the truespeciﬁcation may not be covered. Model uncertainty has been a persistent concern formodel developers and users, which motivates researches on data-driven approaches tolearn utility speciﬁcation.Machine learning (ML), often viewed as a collection of data-driven methods, canexploit the rich information in large raw data. ML requires less a priori theories. Itsprimary focus is prediction accuracy rather than interpretability. Neural networks, asa popular class of ML algorithms, have achieved remarkable breakthroughs recentlyin various domains, such as computer vision, natural language processing, and speechrecognition. On complicated tasks, deep neural networks that require no domain knowl-edge surpass traditional ML methods that rely heavily on feature engineering.The success of neural networks is attributed to its capacity to learn highly complexfunctions, enabled by large datasets, advanced optimization techniques, and increasedcomputational capacity. Given our limited knowledge of the true utility function, couldwe utilize a neural network to unravel the complexity in data? Can we bring neuralnetworks to DCMs, in ways to enhance the ﬂexibility in model speciﬁcation, reducepotential bias and improve predictability?Current neural network applications to discrete choice problems focus on prediction ,with some exceptions (West et al., 1997, De Carvalho et al., 1998, Bentz and Merunka,2000, Hruschka et al., 2002, Sifringer et al., 2018, van Cranenburgh and Alwosheel,2019). A majority of the studies ﬁnd neural networks can outperform DCMs in var-ious contexts regarding prediction accuracy. A major criticism of the neural networkapproach is its lack of interpretability.Interpretability is crucial for high-stakes decisions in transportation planning, suchas infrastructure investment, congestion pricing etc. Planners rely on models to givereliable answers to “what-if” questions at the disaggregated level. For example, how aspeciﬁc market segment will respond to a toll increase or a new subway line? To supportpolicy decisions, prediction accuracy alone is not enough. A model must represent thetrue relationships between explanatory variables and choice outcomes.Although a neural network can provide utility interpretations and economic indica-tors (e.g. elasticity, willingness-to-pay etc) equivalent to a DCM, its estimation resultssuﬀer from large variances across runs; and a particular model run can generate unreal-istic behavioral indicators (Wang and Zhao, 2018). How can we make neural networkslearn interpretable results that can support planning decisions?We propose to integrate neural networks and DCMs to beneﬁt from both: the ﬂex-ibility of neural networks and the interpretability of DCMs. We name such modelsas neural embedded choice models . Extending the work by Sifringer et al. (2018), wepropose a neural network embedded multinomial logit (MNL) model – TasteNet-MNL .Speciﬁcally, we employ a neural network (TasteNet) to model tastes as ﬂexible func-tions of individuals characteristics. Taste parameters predicted by the TasteNet areembedded in a parametric

MNL to compute choice probability and likelihood. Param-2ters of the two parts are jointly estimated by maximum likelihood.By embedding a neural network in an MNL, we enhance the ﬂexibility of the modelto represent systematic taste heterogeneity, which can reduce bias in manual spec-iﬁcation. By bringing a parametric MNL to a neural network, we incorporate ex-pert knowledge and constrain the neural network to generate outputs with designatedmeanings. Using both synthetic and real datasets, we demonstrate the eﬀectiveness ofTasteNet-MNL. The source code is made publicly available .The rest of this paper is organized as follows. Section 2 reviews the previous neuralnetwork applications to discrete choice and the challenges. Section 3 describes ourmodel structure and estimation method. Section 4 reports experiments and resultson synthetic data. Section 5 applies TasteNet-MNL to the Swissmetro dataset, andcompares it with MNL benchmarks. Lastly, we summarize the key contributions, anddiscuss the limitations and future works.

2. Literature Review

Neural Networks for Choice Prediction

Empirical studies have compared DCMs with neural networks (NNs) for various choiceproblems, such as travel mode choice (De Carvalho et al., 1998, Hensher and Ton,2000, Cantarella and de Luca, 2005, Nam et al., 2017, Lee et al., 2018, Zhao et al.,2018), vehicle ownership choice (Mohammadian and Miller, 2002), and brand choice(Agrawal and Schorling, 1996, Bentz and Merunka, 2000, Hruschka et al., 2002, 2004).Most of the early applications choose a feed-forward network (FFN) with one or twohidden layers, because more layers cause over-ﬁtting and computational challenges.FFNs are compared to DCM structures, including logit (Agrawal and Schorling, 1996,West et al., 1997, Omrani, 2015, Lee et al., 2018), nested logit (Hensher and Ton,2000, Mohammadian and Miller, 2002, Cantarella and de Luca, 2005), cross-nestedlogit (Cantarella and de Luca, 2005), and mixed logit (Zhao et al., 2018). ShallowFFNs achieve higher predictability than DCMs in most cases (Agrawal and Schorling,1996, West et al., 1997, De Carvalho et al., 1998, Mohammadian and Miller, 2002,Cantarella and de Luca, 2005, Omrani, 2015, Lee et al., 2018).Inspired by the success of deep learning in other domains, recent studies attemptdeep neural networks (DNN) for discrete choice (Nam et al., 2017, Wang and Zhao,2019). The results are somewhat disappointing. Nam et al. (2017) apply several deeplearning techniques (drop-out, initialization, stochastic gradient descent) to train anFFW with 4 hidden layers . Surprisingly, their DNN gives almost the same predictedlog-likelihood as a nested-logit and a cross-nested logit model . Wang and Zhao (2019)compare a DNN with a nested logit model for mode choice using a stated-preferencesurvey. Despite an extensive hyper-parameter search, the best DNN does not match anested logit model. The authors highlight the importance of ﬁnding the right hyper-parameters for DNN to predict as well as, if not worse than, DCMs.So far, DNNs have not worked eﬀectively for discrete choice as expected, perhapsdue to small data, over-ﬁtting issue, or the diﬃculty in ﬁnding the right set of hyper- https://github.com/YafeiHan-MIT/TasteNet-MNL They call it ”DNN” because it applies deep learning techniques, not because the neural network is deep The conventional FFW performs the worst. But we suspect that the optimal hyper-parameter, such as hiddenlayer size, may not be found without hyper-parameter search. It is inconclusive which type of model predictsbetter special structure and parameter constraints , model predictability can be improved evenwith only one hidden layer.

Learning Nonlinear Utility with Neural Networks

While most studies focus on comparing prediction performance with a brief expla-nation of why, a few dig into how and under what circumstances (West et al., 1997,De Carvalho et al., 1998, Bentz and Merunka, 2000). These studies seek to understandfrom a behavioral perspective: whether a neural network can discover the true behav-iors, which can be diﬀerent from or more complex than our assumptions; and if so,how to derive such knowledge.A series of studies conduct Monte-Carlo experiments to show a neural network cancapture nonlinearity in utility functions (West et al., 1997, De Carvalho et al., 1998,Bentz and Merunka, 2000). Non-linearity may reﬂect the saturation eﬀect or thresholdeﬀect of attributes on utility, or non-compensatory decision rules. For example, Westet al. (1997) ﬁnd that NNs consistently outperform logit and discriminative analysiswhen predicting the outcome of non-compensatory choice rule. Bentz and Merunka(2000) show the analogy between NN and MNL, and NN with hidden layers as a moregeneral version of MNL. With synthetic data and an empirical study, they show thatNN can detect interaction and threshold eﬀects in utility, and therefore can be usedas a diagnostic tool to improve MNL utility speciﬁcation. This sequential approachrequires manual analysis of NN results to identify the nonlinear eﬀect, and thus appliesonly to simple problems. Nevertheless, their idea inspires a recent study by Sifringeret al. (2018) to integrate the two.Hruschka et al. (2002) compare a NN with an MNL and a Latent Class Logit (LCL)model in an empirical study of brand choice. They ﬁnd the NN model can identifyinteraction eﬀects, threshold eﬀects, saturation eﬀects and other nonlinear forms (likeinverse S-shape) of attributes on brand utility. Also, NN implies elasticities diﬀerentfrom MNL or LCL. MNL sometimes gives wrong signs for elasticity due to its simplisticlinear form. The NN predicts better on hold-out data than MNL or LCL. A follow-up study by Hruschka et al. (2004) compares NN with two other MNLs with ﬂexiblesystematic utility, and draws similar conclusions.To summarize, these studies show that a NN can outperform an MNL, when the nonlinearity in attributes are neglected or mistaken. However, these studies have notaddressed nonlinearity in taste , nor compared NNs with more advanced DCMs, e.g.random coeﬃcient logit. They consider NN as either an alternative to MNL; or adiagnostic tool to improve utility speciﬁcation of MNL, which works only for simpleproblems. Our study complements previous work as we focus on modeling nonlinearityin taste with neural networks. Also, beyond an either-or or a sequential approach, ourintegrated model achieves both ﬂexibility and partial interpretability.

The Interpretability Challenge

Being able to predict better and capture nonlinear utility is not suﬃcient for policy-scenario analysis, which is a great advantage of behavior models based on theory and4rior knowledge. A major criticism of neural networks is the lack of interpretability.As this popular term is not clearly deﬁned in the literature, despite its wide usage, wesummarize the popular understandings of interpretability into the following aspects.The ﬁrst is parameter-level interpretability. Clearly, individual weights from a neu-ral network do not carry speciﬁc meanings (Agrawal and Schorling, 1996, Shmueliet al., 1996). In contrast, parameters of logit models can be directly interpreted as themarginal eﬀect of an attribute (or “taste”).Another strict deﬁnition has to do with how a model is speciﬁed: based on externalknowledge or learned from data. Statistical choice models clearly map the relationshipbetween input and output with a theory behind it. In neural networks, the relationshipsare learned from data by arbitrary functions. Even if a NN mimics the true functions,this itself does not provide a theory of why inputs lead to choice outcomes. By either ofthe ﬁrst two criteria, a NN model is not interpretable. However, these two deﬁnitionsare not meaningful measures for model usability.The third view of interpretability by Sifringer et al. (2018) is “the ability of themodel to recover the true parameters’ values of the variables that enter the inter-pretable part of the utility functions”. This deﬁnition focuses on obtaining unbiased model estimates for the interpretable part. However, the unknown part of the utilitymodeled by a black-box can still give uninterpretable answers to ”what-if” questions.The division between interpretable part and uninterpretable part of the utility functionis subjective.Perhaps the most popular view of interpretability is the model’s ability to derivebehavior indicators, such as elasticity, willingness-to-pay (WTP), marginal rate of sub-stitution, and (MRS). Studies that claim ML or NN model interpretability are mostlybased on this criteria (Wang and Zhao, 2018, Sifringer et al., 2018, Zhao et al., 2018).Extracting behavior indicators from neural networks is simple. Bentz and Merunka(2000) show the similarity between MNL and a feed-forward neural network with nohidden layer and Softmax activation. Systematic utilities correspond to the outputvalues before applying Softmax activation. We can plot utility versus inputs to obtainmarginal eﬀects (Bentz and Merunka, 2000, Hruschka et al., 2004). Choice elasticitiesand other economic indicators can be computed analytically (Hruschka et al., 2002,2004) or numerically by simulations (Wang and Zhao, 2018).We consider this deﬁnition insuﬃcient because a model that gives unreasaonblebehavioral indicators is not interpretable. A study by Wang and Zhao (2018) showsthat individual NN estimation can generate unreasonable economic indicators. Forexample, a choice probability can be non-monotonically decreasing as cost increasesand highly sensitive to a particular model run. The derivative of choice probabilitieswith respect to cost and time can be positive; and values of time can be negative, zero,arbitrarily large, or inﬁnite. They conclude that neural based choice models generatereasonable economic information only at the aggregate level either through modelensemble or population average, due to the challenge of irregular probability ﬁeldsand large estimation errors. However, scenario analysis and policy decision dependon answers to “what-if” questions at the disaggregated level (e.g. a particular marketsegment).The deﬁnition of interpretability is to some extent subjective and ultimately a philo-sophical question. We propose a deﬁnition close to the popular view but with extraconditions:

A model is interpretable if at a disaggregated level, it is able to give credible answerto “what will happen if ” and “but for” questions.

Compared to the popular deﬁnition, we emphasize the credibility of the economic5ndicators and interpretability at the disaggregated (both model and choice-maker)level. By “credible”, we mean the answer should conform with a set of prior knowledge,for example, non-positive choice elasticity regarding cost and non-positive values oftime. However, prior knowledge can change over time and vary across applicationcontexts.A fundamental challenge for a neural network to be interpretable is that manynetworks may exist that ﬁt the data equally well; but not all can draw reasonablebehavior insights. We propose imposing special structure and constraints that reﬂectexpert knowledge on a neural network. We show the proposed model obtains reason-able behavioral indicators at the disaggregated level; and that predictability does notnecessarily come at the cost of interpretability.

Direct Precedents: Integrating Neural Networks with DiscreteChoice Models

Recent studies attempt to create a synergy between statistical DCM and NN througha hybrid structure. The idea of a hybrid approach dates back to Bentz and Merunka(2000). They propose using NN as a diagnostic tool to detect nonlinear eﬀects. Themain drawbacks of this approach is the sequential nature and its ineﬀectiveness forlarge problems.Learning-MNL (L-MNL) proposed by Sifringer et al. (2018) is the ﬁrst example ofa neural embedded choice model as far as we know. In an L-MNL, systematic utility isdivided into an “interpretable” part manually speciﬁed; and a “representation” part,a nonlinear representation learned by a neural network. The unknown part capturesthe eﬀects of the unused features. This model structure is inspiring but has somelimitations. First, variables in the interpretable part and the representation part aremutually exclusive sets. The authors’ motive is to make the interpretable utility ob-tain stable estimates, as the NN can overpower the logit model and cause unstableestimates. Second, this model assumes that variables in the representation part haveno interactions with those in the interpretable part, since the two parts of utility are added with no overlapping variables. Essentially, the gain of an L-MNL comes froma ﬂexible representation of the alternative speciﬁc constants (ASCs): L-MNL modelsASCs by a neural network as a ﬂexible function of all the unused features. This as-sumption is too restrictive since the unused features can aﬀect not only the ASCs, butalso other taste parameters in the interpretable utility, such as the time coeﬃcient.Similarly, features in the interpretable part may have unspeciﬁed nonlinear eﬀects,and can aﬀect the ASCs. Thirdly, the selection of covariates to enter which part of theutility function is arbitrary.Inspired by L-MNL, we propose a more general framework to model taste hetero-geneity. The proposed TasteNet-MNL diﬀers from L-MNL and traditional FFW inthree aspects. First, We allow all or a subset of taste parameters to be modeled byNN as a ﬂexible function, not just the ASCs. This enhances the ﬂexibility to modeltaste heterogeneity. Second, we impose constraints on taste parameters predicted byneural networks, as a strategy to regularize the network and obtain interpretable re-sults. Third, we model taste parameters instead of utilities by a NN, diﬀerent from adirect application of a FFW. The key idea is to assign the more complex or less knowntask to a NN, and keep the well known part parametric.6 . Model Structure

For a given choice task, suppose each person n makes a one time of choice from achoice set C n . For each person n , observed data includes individual characteristics( z n ), attributes of alternative i ( x in ), and the chosen alternative y n . V in denotes the systematic utility of alternative i to choice maker n . If tastes arehomogeneous, V in is a function of attributes. A simple example is a linear functionin Eqn. 1. Systematic taste heterogeneity is usually speciﬁed as a group of interac-tions between attributes and characteristics (Eqn. 2). Interaction eﬀects are speciﬁedaccording to prior assumptions and veriﬁed by statistical tests. V in = β i + K i (cid:88) k =1 β ki x kin (1) V in = β i + K i (cid:88) k =1 β ki x kin + (cid:88) ( p,q ) ∈ I i γ pqi x pin z qn (2)Since how taste varies across choice makers may not be known as a priori , we proposea data-driven approach to represent systematic taste heterogeneity: using a neuralnetwork (TasteNet) to model taste parameters as ﬂexible functions of characteristics(Eqn. 3). β T Nn = T asteN et ( z n ; w ) (3)Inputs of TasteNet are choice-maker n ’s characteristics ( z n ). Neural network weights( w ) are unknown parameters to estimate. Output of the network β n T N correspond toa full set or a subset of the coeﬃcients in an ordinary MNL utility (e.g. Eqn. 2). Themeaning of each element of β n T N is deﬁned by the MNL module, where TasteNet isembedded.We can divide the systematic utility into a parametric part and a ﬂexible part(Eqn. 4). The parametric part is a manually speciﬁed utility function. It can includeinteractions and nonlinear transformations. The ﬂexible part is a sum of alternativeattributes weighted by taste coeﬃcients predicted by the TasteNet. We can model all, asubset of or none of the taste coeﬃcients by the neural network. Note that each tastecoeﬃcient (e.g. time coeﬃcient) is either learned by a neural network or manuallyspeciﬁed. V in = β i T N ( z n ; w ) (cid:48) x in T N + β i MNL (cid:48) f ( x in MNL , z n ) (4) P ( y n = i | x n , z n , w , β MNL ) = e β i TN ( z n , w ) (cid:48) x in TN + β i MNL (cid:48) f ( x in MNL , z n ) (cid:80) j ∈ C n e β j TN ( z n , w ) (cid:48) x jn TN + β j MNL (cid:48) f ( x jn MNL , z n ) (5) This problem setup can be generalized to a repeated choice scenario. We choose one time of choice to avoidcluttering in the notation z n to taste coeﬃcients β n T N . The MNL moduletakes in β n T N and the corresponding attributes, along with the parametric utility tocompute choice probabilities (Eqn. 5) and likelihood.This integrated structure achieves two goals. First, utility speciﬁcation becomesmore ﬂexible in representing systematic taste heterogeneity. Second, model inter-pretability is partially maintained, since each output unit of the neural network car-ries a behavioral meaning. Some coeﬃcients may subject to parameter constraintsaccording to prior knowledge. Below we provide more details on the neural networkarchitecture, parameter constraints, and estimation procedure.

TasteNet

We choose a feed-forward neural network (also called a multi-layer perceptron (MLP))for TasteNet. An MLP consists of an input layer, one or more hidden layers and anoutput layer. Essentially, MLP is a composition of linear and nonlinear functions tomap inputs to outputs.In an MLP with 1 hidden layer of H hidden units, the k-th output of the network β TN k can be written as Eqn. 6, where D is the input dimension, H is the number ofhidden units, A (1) is hidden layer activation function, and T is the output activationfunction. Neural network parameters w (1) and w (2) correspond to weights from theinput layer to the hidden layer, and weights from the hidden layer to the output layer. β T Nk ( z , w ) = T [ H (cid:88) h =1 w (2) kh A (1) ( D (cid:88) i =1 w (1) hi z i + w (1) h ) + w (2) k ] (6)An MLP with multiple hidden layers can be denoted asMLP( L, [ H , .., H L ] , [ A (1) , ..., A ( L ) ] , T ). We need to specify the number of hid-8en layers L , the size of each hidden layer H l , activation function for each hiddenlayer A ( l ) , and output transform function T . These hyper-parameters are selectedbased on a model’s prediction performance on the development set. Parameter constraints

Over-parameterization is common for neural networks, especially when the sample sizeis relatively small compared to the model complexity. Adding constraint is a methodto regularize a neural network. We impose constraints on taste parameters not onlyto improve model generalization ability; but also to ensure that taste parameters fallinto a reasonable range based on expert knowledge.A typical constraint is on the signs of parameters. For example, the coeﬃcient fortravel time or waiting time is usually negative. We incorporate sign constraints throughan output transform function T . For taste parameter β s with non-negative sign con-straints, choices of T can be the rectiﬁed linear function ( ReLU ( β )) or exponentialfunction (exp( β )). For β s with non-positive signs, choices of T can be the rectiﬁedlinear unit − ReLU ( − β ) or − exp( − β ). For β s without constraints, T is the iden-tity function. Such transformations redistribute the parameters to the desirable rangethrough continuous diﬀerentiable functions, which resemble the exponential transformfor scale or time coeﬃcient in the utility speciﬁcation of a DCM.An advantage of using transform function for sign constraints is that the constraintscan be strictly kept. Other methods, such as adding penalty for constraint violationto the learning objective, cannot enforce the constraints on unseen data. Estimation

Model is estimated by optimizing a learning objective function with stochastic gradientdescent. The objective is to minimize a loss function, which is the average of negativelog-likelihood plus a regularization term for the p-norm of the neural network weights(Eqn. 7) to prevent the model from over-ﬁtting.min w , β MNL − (cid:88) n log P ( y n | z n , x n , w , β MNL ) + λ p || w || p (7)TasteNet-MNL is trained in an integrated fashion through back-propagation. Un-known parameters to estimate include neural network weights ( w ) and unknown co-eﬃcients in the MNL module ( β MNL ).

4. Experiments

We generate a synthetic dataset with an underlying logit model. Its utility functioncontains higher-order interactions between characteristics and attributes. On this syn-thetic data, we compare TasteNet-MNL with benchmarking MNLs and random co-eﬃcient logit models (RCLs). We expect that the TasteNet-MNL can improve pre-dictability, reduce bias in parameter estimates, and provide more accurate behavioralinterpretations, compared to MNLs and RCLs with misspeciﬁed systematic utility.9 .1.

Synthetic data

The data generation model is a binary logit, with systematic utility of alternative i for person n deﬁned in Eqn. 8. Explanatory variables include three characteristics:income (inc), full-time employment dummy (full) and ﬂexible work schedule dummy(ﬂex); and two alternative attributes: travel cost (cost) and travel time (time) (seeTable A1 in Appendix A for details). Coeﬃcient values are chosen to carry realisticmeanings: income has positive eﬀect on value of time (VOT), full-time workers havehigher VOT and people with ﬂexible schedule have lower VOT. Cost coeﬃcient is ﬁxedto -1 for both alternatives, so that VOT can be read from time coeﬃcient. Alternativespeciﬁc constant (ASC) for alternative 1 is -1.0 and 0 for alternative 0. The randomcomponent of each utility follows an Extreme Value distribution. The synthetic datagenerated has 14,000 observations, randomly split into training (10,000), development(2000) and test (2000) sets. Details about the input distribution and synthetic datageneration are presented in Appendix A. V in = ASC i − cost in + ( − . − . inc n − . f ull n + 0 . f lex n − . inc n ∗ f ull n + 0 . inc n ∗ f lex n + 0 . f ull n ∗ f lex n ) ∗ time in (8) Models in comparison

MNL-I ’s utility functions only include ﬁrst-orderinteractions between characteristics and time (Eqn. 9). Compared to MNL-I, utilitiesof

MNL-II have one additional interaction inc ∗ f ull ∗ time (Eqn.10). MNL-TRUE is an MNL with the true utility speciﬁcation. It is diﬀerent from the ground truth dueto sampling error. In all MNLs, alternative speciﬁc constants (ASCs) are ﬁxed to 0 foralternative 0 (

ASC = 0).4.2.2. Random coeﬃcient logit modelsTwo random coeﬃcient logit (RCL) benchmarks are included to test whether model-ing unobserved heterogeneity can compensate for speciﬁcation errors in the systematicutility. We assume time coeﬃcient is randomly distributed, following a Normal distri-bution with mean equal to a linear function of characteristics, and standard deviation σ . RCL-I and

RCL-II represent two variations in specifying the mean of time co-eﬃcient (Eqn. 11 and 12), and correspond to MNL-I and MNL-II’s time coeﬃcients,respectively. V MNL − Ii = ASC i − cost i + ( b + b inc + b f ull + b f lex ) ∗ time i (9) V MNL − IIi = ASC i − cost i + ( b + b inc + b f ull + b f lex + b inc ∗ f ull ) ∗ time i (10)10igure 2.: Diagram of the TasteNet-MNL for the Synthetic Data V RCL − Iin = ASC i − cost in + β n ∗ time in ,β n ∼ N ( b + b inc n + b f ull n + b f lex n , σ ) (11) V RCL − IIin = ASC i − cost in + β n ∗ time in ,β n ∼ N ( b + b inc n + b f ull n + b f lex n + b inc n ∗ f ull n , σ ) (12)4.2.3. TasteNet-MNLThe structure of the TasteNet-MNL for the synthetic data is shown in Figure 2. Timecoeﬃcient ( β vot ) is modeled by an MLP. Hyper-parameters to decide include the num-ber of hidden layers (L), the size(s) of hidden layer(s) ( H , ..., H L ), and type of reg-ularizer (norm p ), regularization strength λ p , activation function for hidden layers( A , ..., A L ), and output transform function (T).We train TasteNet-MNL on training dataset with diﬀerent combinations of hyper-parameters. We choose 1 hidden layer, since it is enough to obtain the true model’sprediction accuracy. We vary the number of hidden units from 5 to 30. For each hiddenlayer size, we apply L2 regularization penalty in [0, 0.0001, 0.001, 0.01]. For hiddenlayer activation function, we try ReLU and Tanh. For output transformation, we ex-periment with functions: − ReLU ( − β ) and − e − β , to impose non-positivity constrainton the value of time coeﬃcient β vot . For each scenario, we train the model 5 timeswith diﬀerent random initialization.The best hyper-parameter scenario is selected based on the lowest average negativelog-likelihood ( N LL ) on development dataset. The best TasteNet-MNL has 1 hidden11able 1.: Average Negative Log-likelihood (NLL) and Prediction Accuracy (ACC) forSynthetic Data

Model NLL train NLL dev NLL test ACC train ACC dev ACC testMNL-I 0.54102 0.55699 0.54572 0.719 0.703 0.722MNL-II 0.53755 0.55479 0.54695 0.717 0.706 0.724RCL-I 0.52591 0.54594 0.52758 0.718 0.703 0.724RCL-II 0.52298 0.54323 0.52808 0.719 0.701 0.723TasteNet-MNL (H=7, λ =0.001) 0.45433 0.46803 0.46562 0.785 0.775 0.786MNL-TRUE 0.45459 0.47268 0.45979 0.786 0.773 0.785Data generation model 0.45502 0.47186 0.45877 0.786 0.772 0.787 layer with 7 hidden units, ReLU for hidden layer activation, − ReLU ( − β ) for outputtransformation, and L2 penalty 0.001. Results

We compare MNLs, RCLs and TasteNet-MNL regarding predictability, parameterbias, and interpretability.4.3.1. PredictabilityWe measure model predictability by average negative log-likelihood (NLL) and predic-tion accuracy (ACC) on training, development and test data. NLL is the total negativelog-likelihood in equation (7) divided by the number of observations. The higher theNLL, the poorer the model ﬁt. Prediction accuracy is the percentage of correct pre-dictions. Table 1 summarizes the prediction performance of diﬀerent models.MNL with the correct utility speciﬁcation (MNL-TRUE) achieves the same NLLand ACC as the data generation model. MNL-I and MNL-II result in higher NLL(0.59 - 0.6) than MNL-TRUE (0.47); and lower prediction accuracy (70% - 72%) thanMNL-TRUE (77% - 79%). Compared to MNL-I, MNL-II’s utility includes one moreinteraction inc ∗ f ull ∗ time , which has the largest eﬀect (-0.2) among the three missinginteraction terms in MNL-I. However, model ﬁt and prediction accuracy of MNL-IIdoes not improve signiﬁcantly. This indicates that prediction accuracy can be sensitiveto systematic utility speciﬁcation. A seemingly small misspeciﬁcation can cause sig-niﬁcant prediction errors. Poor predictability can be a sign for model misspeciﬁcation.Compared to MNL-I and MNL-II, RCL-I and RCL-II both achieve better log-likelihood ﬁt. The better ﬁt is because part of the missing terms is modeled as randomheterogeneity. However, RCLs do not improve choice prediction accuracy.The best TasteNet-MNL achieves the same predictability as the data generationmodel. We give minimal instructions to the model: 1) choice makers makes trade-oﬀsbetween time and cost; and 2) value of time depends on individual characteristics. Wedo not specify in detail how value of time varies across individuals. Instead, we letthe neural network learn value of time as a function of individual characteristics. Youmay wonder: does TasteNet-MNL recover the true utility function? Is the predictionperformance a result of learning the correct utility function?12able 2.: Parameter Estimates by MNLs, RCLs and TasteNet-MNL Compared to theTruth Coef MNL-I MNL-II RCL-I RCL-II TasteNet-MNL a MNL-TRUE Truth

ASC -0.1484 -0.1085 -0.141 -0.141 -0.1055 -0.1003 -0.1time -0.0998 -0.1233 -0.0914 -0.139 -0.1056 -0.0927 -0.1inc*time -0.5983 -0.5058 -0.636 -0.447 -0.4829 -0.5277 -0.5full*time -0.1154 -0.0651 -0.109 -0.0434 -0.1093 -0.1051 -0.1ﬂex*time 0.1113 0.1120 0.114 0.115 0.060 0.0458 0.05inc*full*time -0.1470 -0.223 -0.1904 -0.1741 -0.2inc*ﬂex*time 0.0182 0.0695 0.05full*ﬂex*time 0.1046 0.0932 0.1 σ (time) 0.0528 0.0504RMSE 0.093 0.051 0.098 0.058 0.014 0.016MAE 0.072 0.042 0.076 0.053 0.012 0.012MAPE 63% 52% 64% 61% 15% 11% a Estimated through regression b RMSE: Root Mean Squared Error; MAE: Mean Absolute Error; MAPE: Mean Absolute Percentage Error β V OT s against the characteristics ( inc , f ull , f lex ) and theirinteractions to obtain coeﬃcients in the utility function, except for ASC , directlyestimated from the MNL module. We compute the errors in parameter estimates,including mean squared error (MSE), mean absolute error (MAE) and mean absolutepercentage error (MAPE) (Table 2).Parameter errors of TasteNet-MNL is close to the true model MNL-TRUE. Thismeans that neural network recovers the correct form of the taste functio in this case.MNL-I, MNL-II, RCL-I and RCL-II have large biases in parameter estimates, withMAPE from 52% to 64%.It is worth noting that RCL-I and RCL-II both have statistically signiﬁcant stan-dard deviations for the random coeﬃcient for time ( σ ( time )). The missing systematiceﬀects can be misinterpreted as random heterogeneity in value of time. This examplealso shows that RCLs do not necessarily reduce bias in parameter estimates. Theirparameter errors are similar and even a bit higher than their corresponding MNLs.But RCLs do improve ﬁtted log-likelihood (Table 1).These results imply that if we do not have a ﬂexible enough function to capturenonlinearity in systematic taste variation, we might mistake systematic heterogeneityfor random heterogeneity, and obtain biased estimates and interpretations. Neuralnetworks can be utilized to exhaust the capacity of the systematic utility function,and separate systematic eﬀect from random eﬀect.4.3.3. InterpretabilityWe expect that TasteNet-MNL is able to provide more accurate economic indicatorsthan misspeciﬁed MNLs. We compare value of time (VOT), choice elasticity and choice13able 3.: Errors in Estimated Values of Time (Unit: $ / Hour) Input data Error metric a MNL-I MNL-II MNL-TRUE TasteNet-MNLSynthetic data RMSE 1.805 1.730 0.098 0.111MAE 1.700 1.696 0.080 0.056MAPE 10.1% 10.1% 0.5% 0.3%New input RMSE 2.710 1.707 0.351 0.408MAE 2.274 1.573 0.244 0.280MAPE 13.9% 9.9% 1.5% 1.6% a RMSE: Root Mean Squared Error; MAE: Mean Absolute Error; MAPE: Mean Absolute Percentage Error probability derived from diﬀerent models and against the ground truth. a) Value of time

We estimate VOT for each individual in the synthetic data. Table3 shows the errors in estimated VOT by diﬀerent models. The MAE of VOT estimatedby MNL-I and MNL-II is 1.7 $ per hour, about 10% of the true values. TasteNet-MNL’sMAE is lower: 0.05 $ per hour, 0.3% from the true VOTs. Taste-MNL’s accuracy inindividual VOT estimates matches MNL-TRUE.To test the model’s generalization performance, we create a dataset with 200 in-dividuals, whose characteristics are drawn from a uniform distribution. There are 50individuals in each of the four groups deﬁned by the combinations of full-time (yes/no)and ﬂexible schedule (yes/no). Income of individuals from each group is evenly dis-tributed in the range of 0 to 60$ per hour with interval size 1.2. With this new input,MNL-I and MNL-II produce an MAE of 2.3$ per hour (14%) and 1.6$ per hour (10%),respectively, compared to TasteNet-MNL’s error of 0.3$ per hour (1.6%).We plot predicted VOTs against income for the four categories of individuals (Fig-ure 3). MNL-I cannot distinguish the diﬀerence in VOTs (given income is ﬁxed) be-tween group (full ﬂex) and group (nofull noﬂex). Adding one higher-order interactionin MNL-II helps, but large bias persists. TasteNet-MNL gives more accurate VOTestimates at the individual level. The root mean squared error (RMSE) of individualVOT estimates is 0.41, close to the true model MNL-TRUE (0.35), and much lowerthan MNL-I (2.71) and MNL-II (1.71). The mean absolute percentage error (MAPE)by TasteNet-MNL is 1%, similar to MNL-TRUE and better than MNL-I (14%) andMNL-II (9%). To summarize, TasteNet-MNL can provide more accurate estimates ofVOT at the individual level, while misspeciﬁed MNLs can result in large bias. b) Elasticity and choice probability

Elasticities are useful economic indicators derived from a choice model. They mea-sure the eﬀects of a change in one of the variables (e.g. income, cost) on the choiceprobability. We compare disaggregated point elasticities across models. The generaldeﬁnition of elasticity of demand with respect to alternative attribute x kin is deﬁnedin Equation 13. P n ( i ) is the probability of choosing alternative i for person n . x kin is the k -th attribute of alternative i for person n . Elasticity E P n ( i ) x kin measures the per-centage change in choice probability P n ( i ) with respect to one percentage change inattribute x kin . Elasticity formulas for a linear MNL and a TasteNet-MNL are shownin Eqn. 14 and 15. The major diﬀerence between them is that the taste parameter β k in the TasteNet-MNL case becomes a function of characteristics z .14igure 3.: Estimated Values of Time and the Ground Truth for New InputTable 4.: Errors in Estimated Elasticities and Probabilities by Diﬀerent Models Choice Elasticity Choice ProbabilityError metric MNL-I MNL-II MNL-TRUE TasteNet-MNL MNL-I MNL-II MNL-TRUE TasteNet-MNLRMSE 5.30 5.22 0.34 0.32 0.21 0.21 0.012 0.011MAE 3.03 3.10 0.16 0.11 0.16 0.16 0.0079 0.0057MAPE 55% 56% 3% 2% 61% 62% 3% 2%

RMSE: Root Mean Squared Error; MAE: Mean Absolute Error; MAPE: Mean Absolute Percentage Error E P n ( i ) x kin = ∂P n ( i ) ∂x kin x kin P n ( i ) (13) E P n ( i ) x kin = (1 − P n ( i )) x kin β k (14) E P n ( i ) x kin = (1 − P n ( i )) x kin β k ( z ) (15)In the ﬁrst analysis, we estimate elasticity and choice probability for each observa-tion in the synthetic data. We choose elasticity and choice probability of alternative1 with respect to the time of alternative 1. Table 4 shows the errors in estimatedelasticities by diﬀerent models. TasteNet-MNL achieves the same level of accuracy asthe true model MNL-TRUE, while the mis-speciﬁed logit models result in 55% to 56%errors. Similar results hold for predicted choice probability.The second analysis is performed on a selected individual with 60$ hourly wage,full-time job and ﬂexible schedule. Mode 0’s time and cost is ﬁxed at 20 minutes and2$. Cost of mode 1 is ﬁxed to 8$. We vary time (time of alternative 1) from 0.2 to20$, and compute choice elasticity and probability of this person for each value of time . Figure 4 shows the estimated elasticity v.s. time and choice probability v.s. time across models. Among the models, TasteNet-MNL gives the function closest tothe ground-truth.The third analysis compares predicted elasticities and probabilities by diﬀerent mod-els across 4 types of individuals. The four types of people are deﬁned by the combina-15igure 4.: Elasticity and Choice Probability against time for the Selected PersonFigure 5.: Elasticity v.s. time for 4 Types of Individuals( inc =30$/hr, time =20 min, cost =2$, cost =8$)tions of full-time (yes/no) and ﬂexible schedule (yes/no) with income ﬁxed at 30$ perhour. Time and cost of alternative 0 is given at 20 minutes and 2$, and cost is 8$. Weplot the elasticity and probability as a function of time for each group predicted bydiﬀerent models (Figure 5 and Figure 6). MNL-I can barely distinguish the diﬀerencebetween the full-ﬂex and nofull-noﬂex groups; while TasteNet-MNL can distinguishand give more accurate estimates than the misspeciﬁed MNLs.4.3.4. Understanding the Neural NetworkTo understand how the neural network learns the eﬀect of input variables and theirinteractions, we visualize the activation values of the hidden units with simulatedindividual characteristics. We also show the estimated weights of TasteNet, includingweights of the linear layer from input to hidden layer, and weights of the linear layerfrom hidden layer to output layer (Table 5). Interestingly, hidden unit 6 is not usedsince its associated weights are all zeros.We generate four types of individuals with income varying from 0 to 60 $ per hour.We pass individual characteristics to the trained TasteNet and obtain activation valuesfor each hidden unit. Figure 7 displays the activation values. Darker color indicatesstronger activation. All activation values are non-negative since the activation functionused is ReLU. By observing how a neuron gets activated as input varies, we can16igure 6.: Choice Probability v.s. time for 4 Types of Individuals( inc =30$/hr, time =20 min, cost =2$, cost =8$)Table 5.: Estimated Weights of TasteNet-MNL Input-Hidden Hidden-OutputHidden units z income z fulltime z ﬂexible z intercept1 -0.4257 0.3917 0.0479 -0.0315 0.4622 0.3907 -0.0925 -0.114 -0.0543 -0.15363 -0.0001 0.4637 -0.0764 0.484 -0.36414 0.8034 -0.2261 -0.2055 0.4987 -0.35955 0.812 -0.317 -0.2252 0.5003 -0.19446 0 0 0 0 07 0.0396 -0.5551 -0.1817 0.5089 0.4905intercept 0.0921 understand the role of each neuron in approximating the true taste function.Hidden units 4 and 5 apparently capture income eﬀect, since they become moreactivated as income increases in all 4 groups. Hidden units 4 and 5 also capture the non-ﬂexible eﬀect. Note that individuals with non-ﬂexible schedules tend to have higheractivation values for hidden units 4 and 5, all else equal (left vs right in Figure 7).Note that in this case, higher activation of units 4 and 5 leads to higher values oftime (or more negative β V OT ). Their corresponding coeﬃcients in the linear hidden-to-output layer are negative (-0.3595 and -0.1944, see Table 5). In other words, biggeractivation leads to a more negative β V OT . Hidden unit 3 captures the full-time eﬀect.Full-time individuals tend to have a higher activation value for unit 3, which leads to alower value of time since the hidden-to-output layer’s coeﬃcient for unit 3 is negative(-0.3641). Hidden units 1, 2 and 7 represent the three interaction eﬀects: income *full-time, income * not ﬂexible, and not full-time * not ﬂexible, respectively. Again,we see hidden node 6 is never activated.Through Monte-Carlo experiments, we show TasteNet-MNL’s ability to capturenonlinear taste functions and uncover the true utility form. Misspeciﬁed systematicutility in MNLs or RCLs can lead to large bias in parameter estimates. TasteNet-MNLcan be used to identify speciﬁcation errors in utility and reduce potential biases.17igure 7.: Activation of the Hidden Layer in TasteNet18asteNet-MNL’s prediction accuracy matches the true model (77% to 79%), higherthan the misspeciﬁed MNLs and RCLs (70% to 72%). TasteNet-MNL also providesinterpretable economic indicators, like value of time and demand elasticities, closeto the ground truth; while MNLs and RCLs with misspeciﬁed utility can produceunreliable interpretations.

5. Model Application: Swissmetro Mode Choice

We apply TasteNet-MNL to a publicly available dataset –

Swissmetro to model modechoice for inter-city travel. The purpose of this application is to 1) examine whetherTasteNet-MNL is able to predict more accurately compared to a manually speciﬁed,relatively sophisticated MNL; and 2) whether TasteNet-MNL can draw reasonablebehavioral interpretations and, if so, how its interpretations diﬀer from those of theMNLs. To compare with TasteNet-MNL, we set up three benchmarking MNL modelswith increasing complexity in the utility function.

Data

The Swissmetro is a proposed revolutionary mag-lev underground system. To assesspotential demand, the Swissmetro Stated Preference (SP) survey collected data from1,192 respondents (441 rail-based travellers and 751 car users), with 9 choices from eachrespondent. Each respondent is asked to choose one mode out of a set of alternativesfor inter-city travel given the attributes of each mode (e.g. travel time, headway andcost). The universal choice set includes train (TRAIN), Swissmetro (SM), and car(CAR). For individuals without a car, the choice set includes only TRAIN and SM.Table 6 provides a description of the variables. For more information, readers can referto Bierlaire (2018).The original data has 10,728 observations, downloaded in Jan 2019 . After re-moving observations with unknown age, ”other” trip purpose and unknown choice,we retain 10,692 observations. We randomly split the data into training (”train”),development(”dev”) and test(”test”) set with 7,484, 1,604 and 1,604 observations,respectively. Models

The three benchmarks are logit models.

MNL-A is similar to Bierlaire et al.(2001)’sMNL speciﬁcation but with some enhancements: 1) the value of travel time and valueof headway are made mode-speciﬁc; 2) all levels of age and luggage categories areincluded; and 3) cost coeﬃcients are ﬁxed to -1.0 for directly reading VOT fromtime coeﬃcients (Table 7). In the benchmark

MNL-B , we add the interaction terms:time*age, time*income and time*purpose (Table 8). The third benchmark

MNL-C isa MNL with all pairs of ﬁrst-order interactions between characteristics and attributes(Table 9). This model is equivalent to a TasteNet-MNL with all taste coeﬃcientsmodeled by a neural network without hidden layers.The

TasteNet-MNL structure for Swissmetro data is shown in Figure 8. We spec-ify the utility functions for each alternative in the MNL module. Coeﬃcients for cost Data link: https://biogeme.epﬂ.ch/data.html

Alternative Alternative attributes AvailabilityTRAIN time, headway, cost (train tt, train hw, train co) train avSM (Swissmetro) time, headway, seats a , cost (sm tt, sm hw,sm seats, sm co) sm avCAR time, cost (car tt, car co) car avPerson/Trip variable Variable levelsAGE 0: age ≤

24, 1: 24 < age ≤

30, 2: 39 < age ≤ < age ≤

65, 4: 65 < ageMALE 0: female, 1: maleINCOME (thousand CHF per year) 0: under 50, 1: between 50 and 100, 2: over 100,3: unknownFIRST (First class traveler) 0: no, 1: yesGA (Swiss annual season ticket) 0: no GA, 1: owns a GAPURPOSE 0: Commuter, 1: Shopping, 2: Business, 3:LeisureWHO (Who pays) 0: self, 1: employer, 2: half-halfLUGGAGE 0: none, 1: one piece, 2: several pieces a. Seats conﬁguration in Swissmetro: seats=1 if airline seats, 0 otherwise. Table 7.: Estimated Coeﬃcients of MNL-AVariable Description Train Swissmetro CarConstant 0.1227 0.5726Travel time (minutes) -1.3376 -1.4011 -1.0177Headway (minutes) -0.4509 -0.8171Seats (airline seating = 1) 0.1720Cost (CHF) -1 (ﬁxed) -1 (ﬁxed) -1 (ﬁxed)GA (annual ticket = 1) 2.0656 0.5319Age 1: 24 < age ≤

30 -0.75482: 39 < age ≤

54 -0.94573: 54 < age ≤

65 -0.48594: 65 ≤ age 0.6995Luggage 1:one piece -0.15382:several pieces -0.923020able 8.: Estimated Coeﬃcients of MNL-BVariable Description Train Swissmetro CarConstant 0.0056 0.4674Travel time (minutes) -0.5006 -0.4010 -0.5600Travel time * Age0: age ≤ < age ≤

39 -0.6354 -0.3307 -0.56962: 39 < age ≤

54 -0.8475 -0.6101 -0.61053: 54 < age ≤

65 -0.1566 0.1419 -0.09154: 65 < age 0.3265 -0.243 -0.0234Travel time * Income0: under 501: 50 to 100 -0.2688 0.1739 0.16232: over 100 -1.0181 -0.436 -0.40933: unknown 0.0852 0.2828 -0.0923Travel time * Purpose0: Commute1: Shopping -0.2081 -0.6192 -0.60622: Business -0.1574 -0.8688 -0.18333: Leisure -0.59 -0.9706 -0.0162Headway (minutes) -0.6158 -0.7011Seats (airline seating = 1) 0.189Cost (CHF) -1 (ﬁxed) -1 (ﬁxed) -1 (ﬁxed)GA (annual ticket = 1) 1.6162 0.2988Luggage 1:one piece -0.17142:several pieces -0.671821able 9.: Estimated Coeﬃcients of MNL-C Coeﬃcients for alternative attributesz (characteristics) TRAIN TT SM TT CAR TT TRAIN HE SM HE SM SEATS TRAIN ASC SM ASCIntercept -0.0671 0.1455 0.0059 0.1713 0.0646 0.3064 0.2953 0.2067Male -0.1526 -0.0477 0.0742 -0.2384 0.0706 -0.1016 0.0671 0.149Age1: (24,39] -0.0965 -0.2422 -0.1093 0.0044 0.5682 0.0517 -0.1634 0.42852: (39,54] -0.1467 -0.2022 -0.195 -0.2397 -0.0105 -0.2135 -0.2692 0.09593: (54,65] 0.0256 0.1201 0.0251 -0.2379 -0.0807 0.1619 -0.0861 -0.03444: (65,) -0.1712 0.1435 0.1105 0.6032 -0.1488 -0.1529 0.618 -0.351Income1: 50-100 0.0494 -0.039 0.0098 -0.1884 -0.2972 0.2349 -0.1776 0.19442: over 100 -0.2825 -0.1697 -0.2662 0.1393 0.0372 0.5288 -0.0406 -0.07893: unknown 0.0289 0.1467 -0.2037 0.1484 -0.0721 -0.4196 0.1621 -0.0459First class -0.1927 -0.0807 -0.3297 -0.4768 0.1183 0.1302 0.2228 -0.2085Who pay1: employer -0.2154 -0.1668 0.1231 0.028 -0.0045 0.0882 0.1191 0.39862: half-half 0.1537 0.4771 0.4391 -0.0311 0.3917 0.3114 -0.2414 -0.0332Purpose1:Shopping 0.2339 -0.219 0.19 0.1509 0.0493 0.1994 0.4238 0.69962:Business -0.0872 -0.3524 -0.181 -0.0544 -0.0195 -0.0647 0.0605 -0.29413:Leisure -0.2678 -0.2778 -0.0043 0.3245 -0.4552 -0.0289 -0.302 -0.4739Luggage1:one piece -0.0375 0.0861 0.2525 0.58 -0.1993 0.0413 0.3364 0.32392:several pieces 0.022 -0.1785 -0.2731 -0.2946 0.0814 -0.1225 -0.0041 0.2158Annual ticket 0.5912 -0.0075 -0.3181 0.2652 -0.2032 -0.5815 0.3576 0.1351

Table 10.: Options of Hyper-parameters & Activation FunctionsOptions ValuesHidden activation relu , tanh Output activation (for non-positive parameters) − relu ( − β ), − exp ( − β )Hidden layer size [10, 20, ..., 100] l l a [0, 0.0001, 0.001, 0.01] a Either l l are ﬁxed to -1 so that the coeﬃcients for time is the negative value of time. Thereare 7 coeﬃcients in the MNL utilities, including for the alternative speciﬁc constants.We assume all MNL coeﬃcients (taste parameters) are functions of individual charac-teristics, and model them as the output of the TasteNet. This is a special case of thegeneral structure: the set of β MNL is empty and all taste parameters are modeled byTasteNet as β T N (Eqn. 4 and 5).The TasteNet module consists of a linear layer from input z to hidden layer h (1) , anonlinear activation A (1) for the hidden layer, followed by a linear layer from hiddenlayer to output layer and an output activation function T for the output. We chooseonly 1 hidden layer, since the predicted log-likelihoods on hold-out datasets do notimprove with more hidden layers. Input z includes all characteristics: age, gender,income, ﬁrst class, who pays for travel cost, trip purpose and luggage. We experimentwith various sets of hyper-parameters and activation functions (Table 10).22igure 8.: Diagram of TasteNet-MNL for Swissmetro Dataset Results

The estimated model coeﬃcients for MNL-A, MNL-B and MNL-C are shown in Table7, 8 and 9. Among all TasteNet-MNL scenarios, the one with 80 hidden units, relu for hidden layer activation, negative exponential for non-positive output activation,and no regularization achieves the best prediction performance on the developmentdataset.5.3.1. Prediction PerformanceTasteNet-MNL signiﬁcantly out-performs MNL benchmarks in terms of prediction ac-curacy. We use average negative log-likelihood (NLL) and prediction accuracy (ACC)to measure predictability (Table 11). From MNL-A to MNL-C, more interactions be-tween attributes and individual characteristics are added. MNL-C has a full set ofinteractions between attributes and characteristics. Surprisingly, the predicted log-likelihood shows only marginal improvements: NLL decreases from 0.728 (MNL-A) to0.708 (MNL-B) and 0.691 (MNL-C). With TasteNet-MNL, we see a substantial im-provement in prediction performance: NLL on development data drops from 0.691 to0.646. This is attributed to the ﬂexibility enabled by the hidden layer with nonlineartransformation in the TasteNet. The improved log-likelihood implies the existence ofnonlinear eﬀects in utility speciﬁcation. The neural network is able to automaticallylearn taste as a nonlinear function of individual characteristics. Because it capturesa more accurate relationship between characteristics and taste, it outperforms MNLswith linear utilities.5.3.2. Individual Taste EstimatesWe want to understand how estimated tastes, such as value of time by mode, providedby diﬀerent models diﬀer. We apply each model to obtain taste parameters for eachindividual in the Swissmetro dataset. The average of taste coeﬃcients among thepopulation are displayed in Table 12. Since the cost coeﬃcients are ﬁxed to -1.0, alltaste coeﬃcients are in the willingness-to-pay space measured by Swiss Franc (CHF).We ﬁnd that from MNL-A to MNL-C, average values of travel time (VOT) and23able 11.: Average Negative Log-likelihood (NLL), Prediction Accuracy (ACC) andF1 Score by Model

NLL ACC F1Model train dev test train dev test train dev testMNL-A 0.762 0.728 0.755 0.662 0.691 0.66 0.535 0.557 0.534MNL-B 0.73 0.708 0.72 0.678 0.69 0.678 0.578 0.585 0.573MNL-C 0.704 0.691 0.698 0.685 0.706 0.678 0.588 0.611 0.574TasteNet 0.607 0.646 0.645 0.737 0.718 0.703 0.668 0.634 0.620

Table 12.: Population Mean of Taste Parameters Estimated by Diﬀerent ModelsMode Taste MNL-A MNL-B MNL-C TasetNet-MNL (v.s. MNL-C)TRAIN TT -1.338 -1.710 -1.846 -2.327 26%HE -0.451 -0.616 -0.880 -1.102 25%ASC -0.198 0.234 0.368 0.801 117%SM TT -1.401 -1.514 -1.505 -1.764 17%HE -0.817 -0.701 -1.039 -1.733 67%SEATS 0.172 0.189 0.420 0.266 -37%ASC 0.648 0.510 0.512 0.669 31%CAR TT -1.018 -1.251 -1.354 -1.685 24%

TT: time. HE: headway. ASC: alternative speciﬁc constant values of headway (VOHE) increase with more interaction terms being added to theutility function. For example, train VOT increases from 1.34 to 1.85 CHF per minute.Swissmetro VOT increases from 1.4 to 1.51 CHF per minute, and car VOT rises from1 to 1.35 CHF per minute. Both MNL-B and MNL-C suggest that the VOT of trainis higher than that of Swissmetro or car. MNL-C also gives higher average VOHE forboth train and Swissmetro than MNL-B or MNL-A.TasteNet-MNL gives the largest average VOT and VOHE among all models (Table12). Its average VOT estimates for train, swissmetro and car are 26%, 17% and 24%higher, respectively, than those predicted by MNL-C. Its average VOHE estimates fortrain and swissmetro are 25% and 67% higher than those estimated by MNL-C.We further investigate where the higher average VOTs come from. We plot his-tograms for each type of taste parameter and for each model (Figure 9). As interac-tions are incrementally added from MNL-A to MNL-C, the model captures more tastevariations. Compared to MNLs with linear utilities, TasteNet-MNL discovers a widerrange of taste variations. In particular, the VOTs and VOHEs for all travel modeshave longer tails on the high end of WTP. Based o the synthetic data results, we havereason to believe that TasteNet-MNL’s superior predictability is a result of its moreaccurate estimates of individual tastes. 24able 13.: Example Person Selected for Comparing Taste Function by ModelsCharacteristics Value z fixed MALE MaleAGE (39,54]PURPOSE CommuteWHO SelfLUGGAGE One pieceGA YesFIRST No z vary INCOME 0: under 50, 1: 50 to 100, 2: over 1005.3.3. Taste FunctionAn interpretable model should provide reasonable function relationship between inputand choice outcome at the disaggregate level. We propose a diagnostic tool for modelinterpretability: visualizing the taste function.Each model provides a taste function that maps individual characteristics to a typeof taste value (e.g. VOT, VOHE). We want to check whether TasteNet-MNL learnssensible taste functions at the individual level, in comparison to benchmarking MNLs.Since function input z is multi-dimensional, we cannot directly visualize the func-tions. Instead, we pick an individual with characteristics z . We vary one dimension of z : z i , while keeping other dimensions ﬁxed z j (cid:54) = i . We plot a particular taste parameteras a function of z i : β k = f model ( z i ; z j (cid:54) = i ).For example, we pick a person with characteristics shown in Table 13. We vary thisperson’s income and ask each model a question, e.g., what are the VOTs for such aperson as his income varies? We compare the answers given by diﬀerent models. Figure10 shows the VOTs and VOHEs estimated by diﬀerent models versus income.Compared with the benchmarking MNLs, VOT and VOHE estimated by TasteNet-MNL all fall within credible ranges. TasteNet-MNL gives more or less diﬀerent esti-mates. Swissmetro VOT estimates are not very diﬀerent between TasteNet-MNL andMNL-C. Regarding train VOT, TasteNet-MNL gives smaller estimates for all threeincome groups than MNL-C. Car VOT estimated by TasteNet-MNL is higher forhigher income groups and lower for the lowest income group. With respect to VOHEs,TasteNet-MNL gives higher estimates for train and lower estimates for swissmetro forall income levels. MNL-C shows a monotonic relationship between VOT and incomeonly for train VOT, while TasteNet-MNL identiﬁes the monotonicity for swissemtroVOT and car VOT.As we do not know the ground truth, the interpretability and credibility of themodel inevitably depends on expert knowledge and judgment. We draw many indi-vidual cases, visualize the taste functions, and compare across models. Overall, tasteparameters by TasteNet-MNL fall within similar range as the MNLs. Yet a particulartaste of a speciﬁc individual given by TasteNet-MNL can agree with or diﬀer from theMNLs. Based on TasteNet-MNL’s better predictability, we trust that it gives moreaccurate taste parameters for individuals.25able 14.: Aggregate Choice Elasticity of Swissmetro w.r.t Time by Income GroupINCOME0 INCOME1 INCOME2MNL-A -0.3765 -0.4297 -0.4706MNL-B -0.3975 -0.3923 -0.5329MNL-C -0.3759 -0.3706 -0.4653TasteNet-MNL -0.4200 -0.3982 -0.48105.3.4. ElasticityTo evaluate model interpretability, we also compare elasticity derived from diﬀerentmodels. First we apply each model to calculate dissaggregate point elasticity of Swiss-metro mode choice with respect to Swissmetro travel time for each observation (Eqn.14and 15). TasteNet-MNL’s individual elasticity estimates diﬀer from MNL-C by 0.2287on average.With individual elasticities, we compute aggregate elasticity, which measures agroup of decision-makers’ response to an incremental change in a variable. This is de-ﬁned in Eqn 16 as the percentage change in the expected share of the group choosingalternative i ( W i ) with respect to one percentage change in variable x ki . It is equiva-lent to a weighted average of the individual elasticities using the choice probabilitiesas weights.The aggregate elasticities of Swissmetro mode share with respect to Swissmetrotravel time are -0.43, -0.45, and -0.41 for MNL-A, MNL-B and MNL-C, compared to-0.437 for TasteNet-MNL. We further compare aggregate elasticity by group, such asincome (Table 14). TasteNet-MNL suggests higher elasticities for low income and highincome groups than MNL-C. But overall, TasteNet-MNL gives choice elasticities closeto MNLs and within reasonable range. E W ( i ) x ki = ∂W ( i ) ∂x ki x ki W ( i ) = (cid:80) n P n ( i ) E P n ( i ) x kin (cid:80) n P n ( i ) (16)

6. Conclusions & Discussions

In this paper, we embed a neural network into a logit model to ﬂexibly representtaste heterogeneity while keeping model interpretability. Departing from a traditionaleither-or approach, we integrate neural networks and DCMs to take advantage of both.With a synthetic data, we show that TasteNet-MNL can learn the true nonlineartaste function. As a result, it reaches the same level of accuracy in predicting choiceand economic indicators as the true model. Exemplary MNLs and RCLs with mis-speciﬁed utility result in large parameter bias, less accurate prediction and misleadinginterpretations. In an application to the Swissmetro dataset, TasteNet-MNL not onlypredicts more accurately on unseen data, but also provides interpretable indicatorsfor policy analysis: individual level VOTs and elasticities derived from TasteNet-MNLare comparable to the results of the benchmarking MNLs. TasteNet-MNL discoversa greater range of taste variations in the population than the benchmarking MNLs.The average VOT estimates by TasteNet-MNL are higher than the MNLs’, due to the26onger tails on the high end of willingness-to-pay. Based on its superior predictabil-ity, we believes that TasteNet-MNL provides more accurate estimates of tastes andelasticities.Through this case, we show that neural networks and DCMs can well complementeach other. Neural network can learn complex function from data and reduce bias inmanual speciﬁcation. TasteNet-MNL can be used in comparison to DCMs to detectpotential misspeciﬁcation. The theory and domain knowledge of a DCM can guide theneural networks to output meaningful results. A high-level idea behind TasteNet-MNLis to assign the more complex or unknown part of the model (e.g. taste heterogeneity)to a neural network (data-driven), and keep the well-understood part (e.g. trade-oﬀsbetween alternative attributes) parametric (theory-driven). This idea can be appliedto other settings. For example, in a latent class choice model, we usually lack a goodprior knowledge of the class membership model speciﬁcation. A neural network canbe utilized to learn the latent class structure.TasteNet-MNL is distinguished from previous studies in several ways. First, it gen-eralizes the L-MNL (Sifringer et al., 2018). Instead of learning only the residual in theutility, a neural network learns the complicated interactions between characteristicsand attributes. Secondly, diﬀerent from the majority of neural network applicationsto discrete choice, TasteNet learns representation of taste rather than utility. Thisgives us direct control over parameters that carry behavioral meanings, such as valuesof time, as they are no longer part of the parameters to estimate, but intermediateoutputs to predict . Thirdly, we realize the necessity of incorporating domain knowl-edge to obtain interpretable results from neural networks. We introduce parameterconstraints as a regularization strategy to combat over-parameterization, a commonissue with insuﬃcient data and a cause for large estimation variability.There are several limitations and open questions for future research. First, thecurrent TasteNet-MNL model only accommodates systematic taste variations.

Randomtaste heterogeneity is an important source of heterogeneity. How to model distributionsof taste parameters with neural networks is an intriguing question for future research.Han (2019) proposes a neural network embedded latent class choice model, as oneway to represent random heterogeneity. Future work can develop a neural embeddedcontinuous mixed logit model.Second, TasteNet-MNL focuses on modeling taste heterogeneity. Non-linear ef-fects of attributes have been observed empirically (Monroe, 1973, Gupta and Cooper,1992, Kalyanaram and Little, 1994). Nonlinear eﬀects, such as the saturation eﬀectand threshold eﬀect, are explained by prospect theory and assimilation-contrast the-ory (Kahnemann and Tversky, 1979, Winer, 1986, 1988). Future work may extendTasteNet-MNL model to reﬂect nonlinearity in attributes.Third, more synthetic data scenarios and benchmark models can be examined. Otherforms of nonlinearity can be used to test if TasteNet-MNL can capture them. Morebenchmarks, such as latent class choice model, random coeﬃcient logit with otherdistributional assumptions and systematic utilities can be compared against. It is in-conclusive whether DCMs that incorporate random heterogeneity can predict better orworse than a TasteNet-MNL with only systematic taste variation. Most likely, it variesacross cases, depending on the magnitude of systematic vs random taste variation inthe datasets. Future work can conduct a comprehensive comparison of TasteNet-MNLand DCMs.Lastly, we suggest comparing TasteNet-MNL with DCMs under various empiricalsettings, in terms of prediction performance and behavioral interpretations. As weﬁnd in the Swissmetro case study, TasteNet-MNL suggests higher average VOTs and27 greater variety in tastes. Future research can ﬁnd out whether this is true in general,and how it would aﬀect aggregate forecasts and scenario analysis. We expect that theintegrated model can help discover new insights about behaviors and improve modelpredictability. 28 i g u r e .: P o pu l a t i o n T a s t e D i s t r i bu t i o n s b y M o d e l s ( S w i ss m e t r o D a t a s e t eferences D. Agrawal and C. Schorling. Market share forecasting: An empirical comparison of artiﬁcialneural networks and multinomial logit model.

Journal of Retailing , 72(4):383–407, 1996.Y. Bentz and D. Merunka. Neural networks and the multinomial logit for brand choice mod-elling: a hybrid approach.

Journal of Forecasting , 19(3):177–200, 2000.M. Bierlaire.

Swissmetro , 2018. URL http://transp-or.epfl.ch/documents/technicalReports/CS_SwissmetroDescription.pdf .G. E. Cantarella and S. de Luca. Multilayer feedforward networks for transportation modechoice analysis: An analysis and a comparison with random utility models.

TransportationResearch Part C: Emerging Technologies , 13(2):121–155, 2005.M. De Carvalho, M. Dougherty, A. Fowkes, and M. Wardman. Forecasting travel demand:a comparison of logit and artiﬁcial neural network methods.

Journal of the OperationalResearch Society , 49(7):717–722, 1998.S. Gupta and L. G. Cooper. The discounting of discounts and promotion thresholds.

Journalof consumer research , 19(3):401–411, 1992.Y. Han.

Neural-embedded Choice Models . PhD thesis, Massachusetts Institute of Technology,Cambridge MA, 8 2019.D. A. Hensher and T. T. Ton. A comparison of the predictive potential of artiﬁcial neuralnetworks and nested logit models for commuter mode choice.

Transportation Research PartE: Logistics and Transportation Review , 36(3):155–172, 2000.H. Hruschka, W. Fettes, M. Probst, and C. Mies. A ﬂexible brand choice model based onneural net methodology a comparison to the linear utility multinomial logit model and itslatent class extension.

OR spectrum , 24(2):127–143, 2002.H. Hruschka, W. Fettes, and M. Probst. An empirical comparison of the validity of a neural netbased multinomial logit choice model to alternative model speciﬁcations.

European Journalof Operational Research , 159(1):166–180, 2004.D. Kahnemann and A. Tversky. Prospect theory: An analysis of decision under risk.

Econo-metrica , 47(2):363–391, 1979.G. Kalyanaram and J. D. C. Little. An empirical analysis of latitude of price acceptancein consumer package goods.

Journal of Consumer Research , 21(3):408–418, 1994. ISSN00935301, 15375277. URL .D. Lee, S. Derrible, and F. C. Pereira. Comparison of four types of artiﬁcial neural networkand a multinomial logit model for travel mode choice modeling.

Transportation ResearchRecord , 2672(49):101–112, 2018. . URL https://doi.org/10.1177/0361198118796971 .D. McFadden. Conditional logit analysis of qualitative choice behavior. In P. Zarembka, editor,

FRONTIERS IN ECONOMETRICS , pages 105–142. Academic Press, 1973.A. Mohammadian and E. J. Miller. Nested logit models and artiﬁcial neural networks for pre-dicting household automobile choices: Comparison of performance.

Transportation ResearchRecord , 1807(1):92–100, 2002. . URL https://doi.org/10.3141/1807-12 .K. B. Monroe. Buyers’ subjective perceptions of price.

Journal of Marketing Research , 10(1):70–80, 1973. . URL https://doi.org/10.1177/002224377301000110 .D. Nam, H. Kim, J. Cho, and R. Jayakrishnan. A model based on deep learning for predict-ing travel mode choice. In

Proceedings of the Transportation Research Board 96th AnnualMeeting Transportation Research Board, Washington, DC, USA , pages 8–12, 2017.H. Omrani. Predicting travel mode of individuals by machine learning.

Transportation ResearchProcedia , 10:840–849, 2015.D. Shmueli, I. Salomon, and D. Shefer. Neural network analysis of travel behavior: evaluatingtools for prediction.

Transportation Research Part C: Emerging Technologies , 4(3):151–166,1996.B. Sifringer, V. Lurkin, and A. Alahi. Let me not lie: Learning multinomial logit. arXivpreprint arXiv:1812.09747 , 2018.C. Torres, N. Hanley, and A. Riera. How wrong can you be? implications of incorrect utilityfunction speciﬁcation for welfare measurement in choice experiments.

Journal of Environ- ental Economics and Management , 62(1):111–121, 2011.S. van Cranenburgh and A. Alwosheel. An artiﬁcial neural network based approach to investi-gate travellers decision rules. Transportation Research Part C: Emerging Technologies , 98:152–166, 2019.M. van der Pol, G. Currie, S. Kromm, and M. Ryan. Speciﬁcation of the utility function indiscrete choice experiments.

Value in Health , 17(2):297 – 301, 2014. ISSN 1098-3015. . URL .S. Wang and J. Zhao. Using deep neural network to analyze travel mode choice with inter-pretable economic information: An empirical example. arXiv preprint arXiv:1812.04528 ,2018.S. Wang and J. Zhao. Multitask learning deep neural network to combine revealed and statedpreference data. arXiv preprint arXiv:1901.00227 , 2019.P. M. West, P. L. Brockett, and L. L. Golden. A comparative analysis of neural networks andstatistical methods for predicting consumer choice.

Marketing Science , 16(4):370–391, 1997.R. S. Winer. A reference price model of brand choice for frequently purchased products.

Journal of consumer research , 13(2):250–256, 1986.R. S. Winer. Behavioral perspective on pricing. In

Issues in Pricing , pages 35–57. LexingtonBooks, 1988.X. Zhao, X. Yan, A. Yu, and P. Van Hentenryck. Modeling stated preference for mobility-on-demand transit: A comparison of machine learning and logit models. arXiv preprintarXiv:1811.01315 , 2018. ppendix A. Synthetic Data Generation We ﬁrst draw input characteristics z according to assumed input distribution in TableA1. Alternative attributes cost and time are drawn from the ranges described inTable A1. With the true model, we compute choice probabilities for each individual.Finally, we draw a chosen alternative for each individual according to the predictedchoice probabilities by the true model. We generate 6000, 2000 and 2000 examplesfor training, development and test data, respectively. Training data is used for modelestimation. The development set is for selecting hyper-parameters. The test set is notused in training or selection. It evaluates model generalization ability.Table A1.: Description of Input Variables in Synthetic Data Variable Description DistributionCharacteristics z inc Income ($ per minute) LogNormal(log(0.5),0.25) forfull-time;LogNormal(log(0.25),0.2) fornot full-time z full Full-time worker (1=yes, 0=no) Bern(0.5) z ﬂex Flexible schedule (1=yes, 0=no) Bern(0.5)Attributes x cost cost ($) 0.2 to 40$ x time travel time (minutes) 1 to 90 minutestime travel time (minutes) 1 to 90 minutes