[PDF] Theory-based residual neural networks: A synergy of discrete choice models and deep neural networks

Abstract

Researchers often treat data-driven and theory-driven models as two disparate or even conflicting methods in travel behavior analysis. However, the two methods are highly complementary because data-driven methods are more predictive but less interpretable and robust, while theory-driven methods are more interpretable and robust but less predictive. Using their complementary nature, this study designs a theory-based residual neural network (TB-ResNet) framework, which synergizes discrete choice models (DCMs) and deep neural networks (DNNs) based on their shared utility interpretation. The TB-ResNet framework is simple, as it uses a ( δ , 1- δ ) weighting to take advantage of DCMs' simplicity and DNNs' richness, and to prevent underfitting from the DCMs and overfitting from the DNNs. This framework is also flexible: three instances of TB-ResNets are designed based on multinomial logit model (MNL-ResNets), prospect theory (PT-ResNets), and hyperbolic discounting (HD-ResNets), which are tested on three data sets. Compared to pure DCMs, the TB-ResNets provide greater prediction accuracy and reveal a richer set of behavioral mechanisms owing to the utility function augmented by the DNN component in the TB-ResNets. Compared to pure DNNs, the TB-ResNets can modestly improve prediction and significantly improve interpretation and robustness, because the DCM component in the TB-ResNets stabilizes the utility functions and input gradients. Overall, this study demonstrates that it is both feasible and desirable to synergize DCMs and DNNs by combining their utility specifications under a TB-ResNet framework. Although some limitations remain, this TB-ResNet framework is an important first step to create mutual benefits between DCMs and DNNs for travel behavior modeling, with joint improvement in prediction, interpretation, and robustness.

Full PDF

TTheory-based residual neural networks: A synergy ofdiscrete choice models and deep neural networks

Shenhao Wang ∗ Baichuan MoJinhua Zhao

Massachusetts Institute of TechnologyCambridge, MA 02139Oct, 2020

Abstract

Researchers often treat data-driven and theory-driven models as two disparate or even con-ﬂicting methods in travel behavior analysis. However, the two methods are highly comple-mentary because data-driven methods are more predictive but less interpretable and robust,while theory-driven methods are more interpretable and robust but less predictive. Using theircomplementary nature, this study designs a theory-based residual neural network (TB-ResNet)framework, which synergizes discrete choice models (DCMs) and deep neural networks (DNNs)based on their shared utility interpretation. The TB-ResNet framework is simple, as it uses a( δ , 1- δ ) weighting to take advantage of DCMs’ simplicity and DNNs’ richness, and to preventunderﬁtting from the DCMs and overﬁtting from the DNNs. This framework is also ﬂexible:three instances of TB-ResNets are designed based on multinomial logit model (MNL-ResNets),prospect theory (PT-ResNets), and hyperbolic discounting (HD-ResNets), which are tested onthree data sets. Compared to pure DCMs, the TB-ResNets provide greater prediction accuracyand reveal a richer set of behavioral mechanisms owing to the utility function augmented by theDNN component in the TB-ResNets. Compared to pure DNNs, the TB-ResNets can modestlyimprove prediction and signiﬁcantly improve interpretation and robustness, because the DCMcomponent in the TB-ResNets stabilizes the utility functions and input gradients. Overall, thisstudy demonstrates that it is both feasible and desirable to synergize DCMs and DNNs bycombining their utility speciﬁcations under a TB-ResNet framework. Although some limita-tions remain, this TB-ResNet framework is an important ﬁrst step to create mutual beneﬁtsbetween DCMs and DNNs for travel behavior modeling, with joint improvement in prediction,interpretation, and robustness. ∗ Corresponding author: [email protected] a r X i v : . [ c s . L G ] O c t . Introduction As machine learning (ML) is increasingly used in the transportation ﬁeld, we observe a tension be-tween data-driven ML methods and classical theory-driven methods. Take travel behavior researchas an example: researchers can analyze travel mode choice by using discrete choice models (DCMs)under the framework of random utility maximization (RUM), or using data-driven methods suchas ML classiﬁers without any substantial behavioral understanding. This tension creates practicaldiﬃculty in choosing one method over the other, and prevents scholars from tackling travel behaviorproblems under a uniﬁed framework. This tension even delineates a form of partisan line withinthe transportation research community: researchers using data-driven methods focus on computa-tional perspectives and prediction accuracy, while researchers using theory-driven methods focuson interpretation, economic information, and behavioral foundations.However, a closer examination reveals that the two methods are complementary in terms ofprediction, interpretation, and robustness, prompting us to ask how to synergize them rather thantreating them as disparate or even conﬂicting methods. As summarized in Table 1, deep neuralnetworks (DNNs) and DCMs can be complementary because the former are more predictive, butless interpretable and robust, while the latter are less predictive, but more interpretable and robust.While DNNs are widely known as highly predictive [35, 32, 36, 39, 20], researchers often contendthat DNNs lack interpretability [35, 42, 15], which is crucial in analyzing individual behavior forreasons such as safety in autonomous vehicles, knowledge distillation in research, and transparencyin governance [42, 15, 18]. DNNs are also found to lack robustness, creating a brittle systemvulnerable to small random noises or adversarial attacks. On the other hand, parsimonious DCMsare believed to be more interpretable and robust, although their predictive power can be low owingto their misspeciﬁcation errors. Therefore, it appears to be a natural question whether thesetwo complementary methods can be synergized to retain the strength from both sides. However,since DNNs and DCMs emerged from two diﬀerent research communities (computer science andeconomics), it is unclear whether this synergy is even possible, let alone mutually beneﬁcial.Table 1: Comparison of DNNs, DCMs, and TB-ResNets

Models Prediction Interpretability RobustnessDeep neural networks (DNNs) High Low LowDiscrete choice models (DCMs) Low High HighTheory-based residual neural networks (TB-ResNets) High High High

To address the aforementioned challenge, this study designs a theory-based residual neural net-work (TB-ResNet) that synergizes DNNs and DCMs, demonstrating that this synergy is not onlyfeasible but also desirable, leading to a simultaneous improvement in prediction, interpretation,and robustness. This study consists of three main components. We ﬁrst demonstrate that DNNsalign with the RUM framework by brieﬂy recounting McFadden (1974) and Wang et al. (2020)[48, 74]. Second, we present the TB-ResNet framework, which augments DNNs to DCMs to ﬁt1he utility residuals with a ( δ, − δ ) formulation, resembling the essence of the standard residualnetwork (ResNet) [25]. This TB-ResNet framework is further elaborated with six interwoven per-spectives: architecture design, model ensemble, gradient boosting, regularization, ﬂexible functionapproximation, and theory diagnosis. The regularization perspective is formally demonstrated byusing the state-of-the-art statistical learning theory to illustrate the intuition that DNNs tend tobe too complex to capture the reality and DCMs tend to be too simple to do so. Then we designthree instances of TB-ResNets using multinomial logit models (MNL-ResNets), prospect theory(PT-ResNets) for risk preference, and hyperbolic discounting (HD-ResNets) for time preference,showing that the simple TB-ResNet framework can incorporate a wide range of DCMs that are partof the utility maximization framework. Lastly, we use empirical testing to determine whether thethree instances of TB-ResNets are eﬀective in three datasets, one collected in Singapore and twofrom Tanaka (2010) [69]. We found that (with some exceptions) the three instances of TB-ResNetcan generally improve the overall prediction, interpretability, and robustness of the pure DCMs andDNNs.The next section reviews related studies. Section 3 introduces the TR-ResNet and its threeinstances. Section 4 discusses the design of experiments. Section 5 presents the results, andSection 6 concludes and discusses our ﬁndings.

2. Literature Review

Individual decision-making has been a classical research question in economics, transportation,marketing, and many other social science and engineering ﬁelds. At least three predominant typesof DCMs exist: the multinomial logit (MNL) model describing the trade-oﬀ between multiple alter-natives, prospect theory (PT) models that analyze decision-making under risk and uncertainty, andhyperbolic discounting models that analyze temporal decision-making. McFadden (1974) developedthe seminal MNL model based on random utility maximization and applied the model to travelbehavior analysis [48]. After McFadden (1974), several generations of researchers reﬁned the MNLmodel by incorporating heterogeneity, endogeneity, and more complicated substitution patterns[70, 71, 4]. In terms of risk preference, Neumann and Morgenstein [52] created the expected utilitymodel to analyze how individuals make decisions with risky inputs. Kahneman and Tversky [31, 72]created prospect theory (PT), which addresses abnormalities that cannot be explained by the ini-tial expected utility models [52, 61, 1, 67]. In the last two decades, researchers gradually improvedthese models by specifying the formulation of reference points or adding more interactions betweenattributes and probabilities [69, 13, 34]. With regards to time preference, important models includeexponential discounting [64], hyperbolic discounting (HD) [45], quasi-hyperbolic discounting [55]and many others. Given the ubiquity of individual decision-making across a massive number ofﬁelds, these three types of theories have been widely applied to analyze travel behavior, technologyadoption, fuel economy, policy decisions, insurance premiums, procrastination, and self-control [8,53, 43, 54, 33]. 2ecently researchers have started to use DNNs to predict travel behavior, demonstrating thatDNNs can outperform DCMs in terms of prediction accuracy, although these studies often fail toconnect DNNs to DCMs in a deeper manner. Individual decision-making can be treated as a MLclassiﬁcation task because the target variables are often discrete. Researchers used DNNs to predicttravel mode choice [9], car ownership [59], travel accidents [84], travelers’ decision rules [10], drivingbehaviors [30], trip distribution [50], hierarchical demand structure [79], queue lengths [40], parkingoccupancy [82], metro passenger ﬂows [24], and traﬃc ﬂows [60, 44, 80, 83, 14, 46]. DNNs are alsoused to complement the smartphone-based survey [81], improve survey eﬃciency [66], synthesizenew populations [6], and impute survey data [16]. Studies typically found that the ML classiﬁers,including DNNs, support vector machines, decision trees, and random forests, can achieve higherpredictive performance than the classical DCMs [62, 56, 65, 23, 9]. However, these investigationsare mainly limited to a comparative perspective, implicitly intensifying the tension between thedata-driven and the theory-driven methods. While several recent studies started to explore theinteraction between DNNs and DCMs [78, 74, 76], the exploration is still inadequate. Given thepredominant use of DCMs and DNNs in travel modeling, it is imperative to demonstrate how toadopt the DNN perspective to analyze individual decision-making beyond prediction.Prediction accuracy should not be the only focus, because interpretability and robustness areboth important criteria [42, 18, 15]. Although many recent studies have focused on DNN interpre-tation, [27, 63, 17, 3, 68], DNNs are still largely perceived as lacking interpretability [35]. This isnot surprising given that DNNs were initially designed to maximize the prediction power. DNNsand DCMs respectively focus on prediction and interpretation, or equivalently, on predicting ˆ y andestimating ˆ β , as argued by Mullainathan and Spiess (2017) [51]. In the transportation ﬁeld, onlya small number of studies have touched upon the interpretability issue of DNNs for choice mod-eling. For example, researchers extracted full economic information from DNNs [75], ranked theimportance of DNN input variables [23], or visualized the input-output relationship to improve theunderstanding of DNN models [5]. The challenge of analyzing model interpretability is partiallycaused by the ambiguity of its deﬁnition, as pointed out by Lipton (2016) [42]. For example, inter-pretability can be deﬁned as “simulatability”: whether researchers can easily simulate the modelin their mind. It can also be deﬁned as the capacity to approximate the true probabilistic behav-ioral mechanism for the choice modeling context [75, 78]. This work adopts the latter deﬁnitionrecognizing the importance of behavioral realism in demand modeling.In the choice modeling context, robustness represents the local stability of economic informa-tion and the regularity of the behavioral mechanism, which is formally measured by the predictioninvariance to random noises or adversarial attacks. When a choice model is robust, a small per-turbation of the inputs, such as a $

3. Theory

Section 3.1 demonstrates that DNNs have an implicit utility interpretation by recounting the resultsfrom McFadden (1974) [48] and Wang et al. (2020) [74]. Subsection 3.2 presents the TB-ResNetframework, and introduces six pertinent ML and behavioral perspectives. Subsection 3.3 formallyillustrates the regularization perspective to illustrate the rationale underlying the design of TB-ResNets. Subsection 3.4 substantiates this TB-ResNet framework by creating three instances forthree choice scenarios.

The choice analysis includes two types of inputs: alternative-speciﬁc variables x ik and individual-speciﬁc variables z i with i ∈ { , , ..., N } representing the individual index and k ∈ { , , ...K } the alternative index. Let B = { , , ..., K } and ˜ x i = [ x Ti , ..., x TiK ] T . The output is individual i ’schoice, denoted as y i = [ y i , y i , ...y iK ], with each y ik ∈ { , } and (cid:80) k y ik = 1. The RUM frameworkassumes that individuals maximize the sum of a deterministic utility v ik and a random utility (cid:15) ik : u ik = v ik + (cid:15) ik = V k ( z i , ˜ x i ) + (cid:15) ik (1)in which v ik represents the deterministic utility value and V k the utility function. The probabilityof individual i choosing alternative k is P ik = P rob ( v ik + (cid:15) ik > v ij + (cid:15) ij , ∀ j ∈ B, j (cid:54) = k ) (2)Assuming (cid:15) ik is independent and identically distributed across individuals and alternatives and thecumulative distribution function of (cid:15) ik is F ( (cid:15) ik ), then P ik = (cid:90) (cid:89) j (cid:54) = k F (cid:15) ij ( v ik − v ij + (cid:15) ik ) dF ( (cid:15) ik ) (3)and the two following propositions hold: Proposition 1.

Suppose (cid:15) ik follows a Gumbel distribution, with a probability density function equalto f ( (cid:15) ik ) = e − (cid:15) ik e − e − (cid:15)ik and a cumulative distribution function equal to F ( (cid:15) ik ) = e − e − (cid:15)ik . Then the hoice probability P ik takes the form of the Softmax activation function. P ik = e v ik (cid:80) j e v ij (4) Proposition 2.

Suppose Equation 3 holds, and choice probability P ik takes the form of the Softmaxfunction as in Equation 4. If (cid:15) ik is a distribution with the transition complete property, then (cid:15) ik follows a Gumbel distribution, with F ( (cid:15) ik ) = e − αe − (cid:15)ik . Propositions 1 and 2 jointly demonstrate that DNNs have an implicit RUM interpretation.Speciﬁcally, the Softmax activation function, which is used in nearly all DNN architectures as thelast layer, implies the random utility terms with a Gumbel distribution under the RUM framework.When a fully connected feedforward DNN is applied to inputs ˜ x i and z i , the implicit assumption isRUM with the random utility term following Gumbel distribution. Therefore, the inputs into theSoftmax activation function in DNNs can be interpreted as the utilities of alternatives. The Softmaxfunction itself is a process of comparing utility scores. The DNN transformation prior to the Softmaxfunction is a process of specifying utilities. Proposition 1 can be found in nearly all the textbooksof choice modeling [71, 4], and Proposition 2 is from Lemma 2 in McFadden (1974). By takingadvantage of the RUM interpretation in DNNs, researchers can design novel DNN architectures toimprove the model performance [74]. The sketch proof of the two propositions is in Appendix I.DNNs and DCMs share the utility maximization framework, but they parameterize their utilityfunctions in diﬀerent ways. DNNs can automatically learn utility function owing to the strongapproximation power of complex DNNs’ model families [29, 28, 11], while DCMs rely on muchmore parsimonious parametric assumptions. For example, the utility function of DNNs ( V DNN,k )can be parameterized by millions of parameters, while that of DCMs ( V T,i ) often by less thanten parameters. The similar utility interpretation shared by DNNs and DCMs enables us to de-sign the TB-ResNet framework, and their diﬀerences in model complexity are an opportunity forcomplementarity.

Leveraging the similar utility interpretation in DCMs and DNNs, we design the TB-ResNet frame-work, which consists of a DCM utility function V T,k ( z i , ˜ x i ) and a DNN utility function V DNN,k ( z i , ˜ x i )weighted by δ and 1- δ : v T B − ResNet,ik = (1 − δ ) v T,ik + δv DNN,ik = (1 − δ ) V T,k ( z i , ˜ x i ) + δV DNN,k ( z i , ˜ x i ) (5)where V T,k represents the utility function from DCMs, V DNN,k represents that from DNNs, and ( δ ,1- δ ) adjusts the weighting between them. This TB-ResNet can be seen as a linear combination ofthe two types of utility functions with a ﬂexible weighting controlled by δ . The utility speciﬁcation5ig. 1. Architecture of TB-ResNet. Both DCM and DNN are ﬂexible: the DNN block uses sevenlayers as an example, but it can be any depth or width; the DCM block can take any utilityspeciﬁcation under the RUM framework.of a feedforward DNN V DNN,ik in Equation 5 can be parameterized as V DNN,k ( z i , ˜ x i ) = w Tm,k Φ( z i , ˜ x i ) = w Tm,k ( g m − ... ◦ g ◦ g )( z i , ˜ x i ) (6)in which m is the number of layers of DNN, g l ( t ) = ReLU ( W Tl t + b l ), and ReLU ( t ) = max(0 , t ). The W l represents the parameters in DNNs for layer l , and the ReLU is one activation function amongmany alternatives (e.g. Tanh and Sigmoid). The DCM utility speciﬁcation can be parameterizedby various utility theories, which we will discuss in Section 3.4.Figure 1 represents the architecture of the TB-ResNet, consisting of the shallow (1 − δ )DCM andthe deep δ DNN blocks for joint speciﬁcation of the deterministic utility term. Typically the DCMblock (1 − δ ) V T,k can be represented by a shallow neural network with one-layer transformation,while the DNN block δV DNN,k is represented by a deep structure capable of automatic learning.The DCM and DNN blocks transform the inputs ( z i , ˜ x i ) into deterministic utilities, which arefurther converted into choice probabilities and outputs through the Softmax activation function.This TB-ResNet framework can be understood from six interwoven ML and behavioral perspectivesas follows.First and most intuitively, this TB-ResNet can be seen as a new DNN architecture, becausethe DCM part in the TB-ResNet represents a shallow neural network and the DNN part representsa deep one. In fact, the name of the TB-ResNet arises from the standard ResNet architecture,which consists of an identity feature mapping and a feedforward DNN architecture: v ResNet,ik = V I,k ( z i , ˜ x i ) + V DNN,k ( z i , ˜ x i ), where V I,k ( x ) = x . When the true model is close to linear, the ResNetcan approximate the true model better than a standard feedforward DNN. This reasoning can be6imilarly applied to TB-ResNets. When 1 − δ is close to one, the TB-ResNet consists of a mainDCM part and a small DNN part ﬁtting the utility residual, resembling the essence of the standardResNet architecture.Second, the TB-ResNet framework with the ( δ, − δ ) weighting can be seen as an ensemblemodel of DCMs and DNNs with scale adjustment. The weighting is controlled by the ratio − δδ ,which can span all possible positive values under a logarithmic scale of δ ∈ (0 , When δ → δ = 10 − , the utility ratio − δδ converges to + ∞ ; when δ →

1, such as δ = 1 − − , theutility ratio − δδ converges to 0; when δ = 0 .

5, the ratio equals to one. Therefore, this ( δ, − δ )weighting allows us to explore all the possible utility ratio of the DCM and the DNNs. In fact,the ﬂexible scaling is closely related to the randomness discussion in classical choice modeling. Forexample, to combine revealed and stated preference data, researchers need to adjust a scale factorto reﬂect the diﬀerent randomness in two types of data sets [76, 26, 7].Third, the TB-ResNet framework is similar to the gradient boosting method [19], althoughdiﬀerences still exist. They are similar since both seek to achieve higher performance by addingmultiple models; in particular, the TB-ResNets with a sequential training procedure are similar tothe boosting method with multiple stages. However, they are also diﬀerent since gradient boostingis typically used to combine multiple weak classiﬁers, while TB-ResNets combine a relatively weakclassiﬁer (DCMs) and a strong one (DNNs). As a result, the regularization perspective becomesessential in the TB-ResNets, particularly when δ is small. In addition, the TB-ResNets combine theDCMs and DNNs through the shared utility interpretation, while the boosting method connectsmultiple classiﬁers through the multi-stage optimization of the loss function. The shared utilityinterpretation in TB-ResNets is helpful in obtaining not only lower prediction losses, but also im-provements in local regularity, robustness, and utility-based economic interpretation. Nonetheless,the authors acknowledge that it a ﬁne line to diﬀerentiate between model ensemble, gradient boost-ing, and our TB-ResNet framework. It is also possible to improve the TB-ResNet framework byincorporating the other two perspectives in the future.Fourth, when δ →

0, the TB-ResNet is dominated by the DCM component, which becomes askeleton utility function to localize and stabilize the TB-ResNet system, and the small δ regularizesthe complex DNN component to address overﬁtting. With a small δ , the model complexity between(1 − δ ) V T,k and δV DNN,k is more balanced, since the DNN component is typically much more complexthan the DCM component. Intuitively, when δ becomes smaller, the TB-ResNet framework isincreasingly localized around the DCM component, and the training of the DNN is similar to asearching around the small neighbor of the DCM. A small δ is most eﬀective when simple DCMs cansuccessfully capture much of the true behavioral mechanism. In fact, when the DCM componentcan perfectly capture the true behavioral mechanism, the best δ should be close to zero.Fifth, when δ →

1, the TB-ResNet is dominated by the DNN component, which allows the TB-ResNet system to use the outstanding approximation power of DNNs to approximate the true data The logarithmic scale refers to the δ taking the values of 10 − x and 1 − − x as it gets close to zero or one. Thisdesign is intended to maximize the span of the magnitude of the scale ratio − δδ . δ is closer to one,it trends towards a large regional search by the δ DNN component around the small (1 − δ )DCMcomponent. A large δ is the most eﬀective when simple DCMs capture little information about thedecision-making mechanism. In the worst scenario, when the DCMs capture zero information, theTB-ResNet reduces to a DNN model.Therefore, the optimum δ value becomes a metric to diagnose the completeness of the DCMs.A small and optimum δ suggests that the current DCM is highly eﬀective since only a small portionof the DNN component is needed to ﬁt the utility residuals. A large and optimum δ suggests thatthe current DCM is far from complete since the TB-ResNet mainly uses the DNN component to ﬁtthe true behavior. Therefore, the optimum δ value can be a tool to diagnose the completeness ofthe DCMs. However, the optimum δ value can only be identiﬁed empirically since modelers cannotevaluate the completeness of a theory a priori. The results section will compare the optimum δ values of our three instances, which sheds light on the theoretical completeness of the MNL, PT,and HD models.In summary, simple DCMs tend to underﬁt the true behavioral mechanism, while rich DNNstend to overﬁt. The TB-ResNets are formulated with a ﬂexible ( δ, − δ ) weighting, taking advantageof both the simplicity of DCMs and the richness of DNNs, and guarding against problems fromboth sides. A large δ enables the DNN component to ﬁt the utility residual of the DCM componentto address the underﬁtting problem, and a small δ controls the scale of the DNN component asa regularization tool to address the overﬁtting problem. This is the key intuition underlying thedesign of TB-ResNets. Formally, the problems of underﬁtting and overﬁtting can be framed as the challenge of balancingapproximation and estimation errors. This subsection will use the state-of-the-art statistical learn-ing theory to illustrate the importance of the δ term in balancing the model complexities betweenDCMs and DNNs.The out-of-sample performance of TB-ResNets can be decomposed into approximation andestimation errors, and the analysis into the latter illustrates the importance of controlling modelcomplexity. Let F and F denote the model families of DCMs and DNNs. The empirical riskminimization (ERM) is used for model training:ˆ f = argmin f ∈ (1 − δ ) F + δ F N N (cid:88) i =1 l ( y i , f ( x i )) (7)in which x i is a vector representing both the alternative-speciﬁc inputs x ik and the individual-speciﬁc inputs z i . Let f ∗ denote the true data generating process. The Excess error is deﬁned and8ecomposed as following: E S [ L ( ˆ f ) − L ( f ∗ )] = E S [ L ( ˆ f ) − L ( f ∗F )] + E S [ L ( f ∗F ) − L ( f ∗ )] (8)where L = E x,y [ l ( y, f ( x )] is the expected loss function and S represents the sample { x i , y i } N ; f ∗F = argmin f ∈F L ( f ), the best function in function class F := (1 − δ ) F + δ F to approximate f ∗ .The excess error measures the average diﬀerence of the out-of-sample performance between theestimated function ˆ f and the true model f ∗ . The excess error is decomposed as an estimationerror E S [ L ( ˆ f ) − L ( f ∗F )] and an approximation error E S [ L ( f ∗F ) − L ( f ∗ )]. The approximation erroris deterministic and is thus irrelevant to the training procedure, so this study does not addressapproximation errors in details. A simple model family (e.g. DCMs) will usually approximate thetrue data generating process worse than a complex model family (e.g. DNNs), which can also beinferred from the universal approximator theorem of the DNN [29]. The key upper bound is onthe estimation error E S [ L ( ˆ f ) − L ( f ∗ F )], which is provided by using Rademacher complexity fromstatistical learning theory. Deﬁnition 1.

Empirical Rademacher complexity of function class F is deﬁned as: ˆ R n ( F | S ) = E (cid:15) sup f ∈F N N (cid:88) i =1 (cid:15) i f ( x i ) (9) (cid:15) i is the Rademacher random variable, taking values {− , +1 } with equal probabilities. Proposition 3.

The estimation error of an estimator ˆ f can be bounded by the Rademacher com-plexity of F . E S [ L ( ˆ f ) − L ( f ∗F )] ≤ E S ˆ R n ( F | S ) (10)Deﬁnition 1 and Proposition 3 jointly provide the intuition that the upper bound of the esti-mation error of any estimate ˆ f can be approximated by the complexity of the model family F .Note that ˆ R n ( F | S ) in Deﬁnition 1 measures how complex the model family F is, and the averagedRademacher complexity on the right side of Proposition 3 is the upper bound on the estimationerror. In other words, it is important to limit the complexity of the model family F to achieve atight upper bound on the estimation error. The proof of Proposition 3 can be found in [78], andthe key technique is the symmetrization lemma [73].In the case of TB-ResNets, F is designed to be (1 − δ ) F + δ F . The following three propositionsprovide the intuition as to why the δ weighting is important: Proposition 4.

The estimation error of the TB-ResNet estimator ˆ f can be bounded by the weightedRademacher complexities of F and F . E S [ L ( ˆ f ) − L ( f ∗F )] ≤ E S [(1 − δ ) ˆ R n ( F | S ) + δ ˆ R n ( F | S )] (11)9 roposition 5. The Rademacher complexity of the DCM model family F can be bounded by ˆ R n ( F | S ) (cid:46) O ( (cid:114) vN ) (12) in which v is the VC dimension of function class F and N is the sample size. Proposition 6.

The Rademacher complexity of the DNN model family F can be bounded by ˆ R n ( F | S ) (cid:46) ( (cid:112) D ) + 1) (cid:113) N (cid:80) Ni =1 || x i || √ N × D (cid:89) j =1 M F ( j ) (13) in which D represents the depth of DNN, || x i || represents the norm of input variables, and M F ( j ) is the upper bound of the Frobenius norm of parameter matrix W j . Here the DNN model uses ReLUactivation functions. Propositions 4, 5, and 6 demonstrate the importance of using δ to balance the estimation errorsbetween F and F . Since the DNN model family F is much more complicated than the DCMmodel family F , a strong regularization needs to be imposed on the DNN model family F toguarantee a balanced complexity between (1 − δ ) ˆ R n ( F | S ) and δ ˆ R n ( F | S ). Speciﬁcally, the DNNcomplexity, as shown in Proposition 6, is exponential in the depth D and also depends on the size ofthe input variables x i . On the other hand, the DCM model F only depends on the VC dimension,which is often linearly related to the number of parameters. Therefore, the DNN complexity istypically much larger than the DCM complexity ( ˆ R n ( F | S ) >> ˆ R n ( F | S )). In this case, a small δ weighting on the DNN part can limit the total complexity of the TB-ResNets ((1 − δ ) F + δ F ),thus improving the model performance. The proof of Proposition 5 can be found in Wang et al.(2020) [75], and that of Proposition 6 is in Golowich et al. (2017) [21]. We provide a sketch prooffor Proposition 4 in Appendix IV. The TB-ResNet framework is ﬂexible because the DCM utility V T,k can take diﬀerent forms depend-ing on the context. In the MNL setting, we use the linear utility speciﬁcation as the theory partof TB-ResNets. Replacing the V T,k in Equation 5 by V MNL,k and removing index i for simplicity,the MNL-ResNet is formulated as v MNL − ResNet,k = (1 − δ ) v MNL,k + δv DNN,k = (1 − δ ) V MNL,k ( z, ˜ x ) + δV DNN,k ( z, ˜ x ) (14) V MNL,k ( z, ˜ x ) = w ,k + w (cid:48) x k x k + w (cid:48) z z (15)where w x k represents the parameters for the alternative-speciﬁc variables and w z represents theparameters of the individual-speciﬁc variables. This linear-in-parameter speciﬁcation of V MNL,k is10idely used in choice modeling.

In the risk preference setting, we use prospect theory as the theory part of TB-ResNets. Replacing V T,k in Equation 5 by V P T,k , the utility speciﬁcation of PT-ResNet is formulated as v P T − ResNet,k = (1 − δ ) v P T,k + δv DNN,k = (1 − δ ) V P T,k ( z, ˜ x ) + δV DNN,k ( z, ˜ x ) (16) V P T,k ( z, ˜ x ) = (cid:88) j c ( x kj ) π ( p kj ) (17)Then the value function c ( x kj ) and probability weighting function π ( p kj ) are further parameterizedas c ( x kj ) = (cid:110) x rkj x kj ≥ − λ ( − x kj ) r x kj ≤ π ( p kj ) = e − ( − ln p kj ) α (19) α = α ( z ) = α + z (cid:48) w αz (20) r = r ( z ) = r + z (cid:48) w rz (21) λ = λ ( z ) = λ + z (cid:48) w λz (22)In the equations above, j is the index of uncertain monetary payoﬀs; x kj is the monetary payoﬀfor alternative k at the value indexed by j ; p kj is the winning probability of x kj ; α represents theprobability weighting factor; r represents the concavity of value functions; λ represents the lossaversion factor; α , r , and λ are individual speciﬁc and can be partially explained by socioeconomicvariables z . The speciﬁcation in Equations from 16 to 22 is basically the same as Tanaka et al.(2010) [69]. PT is widely used for travel behavioral analysis because travel decisions often involvetime uncertainty [12, 2].

Another instance of TB-ResNets is the hyperbolic discounting residual network (HD-ResNet) fortime preferences. The utility function of the HD-ResNet is formulated as following: v HD − ResNet,k = (1 − δ ) v HD,k + δv DNN,k = (1 − δ ) V HD,k ( z, ˜ x ) + δV DNN,k ( z, ˜ x ) (23) V HD,k ( z, ˜ x ) = (cid:88) j x kj βe − rt kj (24) There are two slight diﬀerences between our PT model and Tanaka et al. (2010): the initial paper used a non-parametric method to estimate individuals’ risk preference parameters and sequentially estimated the coeﬃcients w αz , w rz , and w λz , while our PT model follows a parametric method and simultaneously estimates all the coeﬃcients. β and r can be further parameterized as β = β ( z ) = β + z (cid:48) w βz (25) r = r ( z ) = r + z (cid:48) w rz (26)In the equations above, x kj is the monetary payoﬀ; t kj is the associated time; r is the conventionaltime discounting factor; β is the present-bias factor; both r and β can be partially explained bysocioeconomic variables z . The speciﬁcations from Equation 23 to 26 are the same as Tanaka et al.(2010) [69].

4. Experiment Setup

The experiments use three datasets for the three instances. The ﬁrst data set (SG) is a statedpreference survey collected by the authors in Singapore in 2017, focusing on the travel mode choicebetween walking, buses, ridesharing, driving, and autonomous vehicles. The second and thirddatasets are stated preference surveys collected from Tanaka et al. (2010) [69], which focused onthe risk and time preference of two monetary alternatives.The summary statistics of the three data sets are provided in Appendix II. The sample sizesof the three datasets are respectively 8 , ,

335 and 5 , .

38% of the respondents who chose to walk in Singapore. Theattributes in the SG data set were designed by using the standard orthogonal experiment designbased on the average travel time and costs of the travel alternatives in Singapore. The attributesin the PT and HD data sets were designed by using simulations based on PT and HD theories. Thesurvey for the SG data set was conducted with the help of an online company Qualtrics; the surveysfor the PT and HD data sets were collected through interviews with the help of local oﬃcials. Fordata collection details of the three data sets, readers can refer to [77] and [69].

The three data sets are divided into training and testing sets, with the former used to train theTB-ResNets and the latter to demonstrate the eﬀects of diﬀerent δ values. The sample sizes ofthe training and testing sets in the SG, PT, and HD data sets are (5050, 1648), (4199,1050), and(4272, 1068), respectively. To concentrate the focus on the δ factor, we largely simpliﬁed the hyper-parameter searching of the DNN component by using only two simple DNN architectures. The ﬁrstDNN architecture is designed with depth = 3, width = 100, number of iterations = 5,000, and sizeof mini-batch = 100, and the second DNN architecture is the same as the ﬁrst except with depth =5. The two DNN architectures are compared to reveal that the TB-ResNets performance partiallydepends on the conﬁguration of the DNN part. However, this work mainly seeks to demonstrate12he beneﬁts through the synergistic TB-ResNet framework rather than the improvements within either the DCM or the DNN component. It is because either the DCM or the DNN component canbe improved by a vast number of methods, given the past four decades’ work on the DCMs andthe recent popularity of the DNNs, which are beyond the scope of this work. Our empirical resultseeks to demonstrate that, conditioning on any given DCM or DNN, the TB-ResNets can generatemutual beneﬁts for both components.The eﬀects of δ values are demonstrated in the testing sets and explored on a logarithmic scale. A small δ such as 10 − postulates a large utility ratio ( − δδ ) between the DCM and the DNN partsas 10 , while a large δ such as 1 − − postulates a small ratio as 10 − . The range of δ values canassist in empirically evaluating the completeness of the DCM theory, which is impossible to knowa priori. An empirically optimum and small δ suggests the relative completeness of the DCMtheory, while an optimum and large δ suggests that the DCM model is far from complete and needsfurther development. The optimality of the δ values will be discussed in all three instances of theTB-ResNets in our result section.The three instances of TB-ResNets can be trained in a simultaneous or sequential manner.The simultaneous training implies the the training of V DNN,k and V T,k at the same time with thestochastic gradient descent. The sequential training implies a sequential approach of training V T,k on the ﬁrst stage and training V DNN,k on the second. While both methods are consistent with ourML and behavioral perspectives in the TB-ResNet framework, we tend to believe that sequentialtraining is more intuitive than simultaneous training. This is because the DNN component in theTB-ResNet might easily capture all the valuable information that could have been explained by V T,k , when V T,k and V DNN,k are simultaneously trained. In this case, the simultaneous trainingmight damage the capacity of V T,k to stabilize the utility function. Nonetheless, we provide theresults of both sequential and simultaneous training in our results section and appendices.By using w T to represent the parameters in the DCM part and w DNN to represent the param-eters in the DNN part, the formulation of the sequential training procedure is described as follows.In the ﬁrst stage, empirical risk minimization (ERM) is formulated as shown:min V T,k ∈F L (˜ x i , z i ; w T ) = min V T,k ∈F − N N (cid:88) i =1 K (cid:88) k =1 y ik log e (1 − δ ) V T,k (˜ x i ,z i ; w T ) (cid:80) Kj =1 e (1 − δ ) V T,j (˜ x i ,z i ; w T ) (27)which is the same as the maximum likelihood estimation used in classical choice models. Thesecond stage is another training conditioned on ˆ w T . With F representing the model family of The δ spans a list of values: [1e-10,1e-8, 1e-7,1e-6,1e-5, 1e-4, 0.001, 0.002, 0.004, 0.005, 0.006,0.007, 0.008,0.009,0.01,0.03, 0.05, 0.1, 0.3, 0.5, 0.8, 0.9,0.95,0.99,0.999,0.9999, 1] For application, the authors suggest dividing the datasets into training, validation, and testing sets. The valida-tion set can be used to choose the optimum δ value and the testing set for application. Since this work focuses ondemonstrating the eﬀects of δ , we decided to simplify the data processing into the division of training and testingsets. V DNN,k ∈F L ( ˆ w T , ˜ x i , z i ; w DNN ) = (28)min V DNN,k ∈F − N N (cid:88) i =1 K (cid:88) k =1 y ik log e (1 − δ ) ˆ V T,k (˜ x i ,z i )+ δV DNN,k (˜ x i ,z i ; w DNN ) (cid:80) Kj =1 e (1 − δ ) ˆ V T,j (˜ x i ,z i )+ δV DNN,j (˜ x i ,z i ; w DNN ) (29)To evaluate predictive performance, this work uses three metrics: prediction accuracy, cross-entropy loss, and F1 score. Prediction accuracy is formulated as: Accuracy = 1 − N N (cid:88) i =1 { ˆ y i (cid:54) = y i } (30)in which ˆ y i represents the vector of the predicted choice, {} is an indicator function, and theinequality sign implies that the two vectors (ˆ y i and y i ) are diﬀerent. The cross-entropy loss isformulated as: Cross entropy loss = − N N (cid:88) i =1 K (cid:88) k =1 y ik log ˆ P ik (31)in which ˆ P ik is the predicted choice probability from DCMs, DNNs, or TB-ResNets. This cross-entropy loss is also named as log-loss, the same as the negative value of the log likelihood. Lastly,the F1 score is: F score = K (cid:88) k =1 W k · × P recision k × Recall k P recision k + Recall k (32)where W k is the share of label k in the sample ( W k = N (cid:80) Ni =1 y ik ). In the equation above, P recision k and Recall k are class-speciﬁc and formulated as: P recision k = (cid:80) Ni =1 { ˆ y ik = y ik } (cid:80) Ni =1 ˆ y ik (33) Recall k = (cid:80) Ni =1 { ˆ y ik = y ik } (cid:80) Ni =1 y ik (34)Intuitively, P recision k and Recall k measure the column- and row-speciﬁc performance of the confu-sion matrix. The F1 score combines the column- and row-speciﬁc perspectives with a class-speciﬁcweighting. Our three metrics, including accuracy, cross-entropy loss, and F1 score, have takeninto account the deterministic and probabilistic decision rules for both balanced and unbalancedoutputs. 14 . Results The result section starts with visualizion of the utility functions of DCMs, DNNs, and TB-ResNets,thus providing intuition into why DCMs and DNNs are complementary and how TB-ResNetsstrike a middle ground that retains beneﬁts from both sides. Then we will successively delve intointerpretability, prediction, and robustness for comparison and evaluation.

Figures 2, 3, and 4 visualize how utilities vary with input values in the three choice scenarios. TakeFigure 2 as an example. The ﬁve graphs on the upper row represent how the utility of takingbuses varies jointly with two dimensions, the monetary cost (x-axis) and the in-vehicle travel time(y-axis), and the ten graphs on the lower row visualize how utility varies with single dimensions, themonetary cost and the in-vehicle travel time respectively, holding all the other variables constant.Each pair of graphs on the lower row correspond to the one graph directly above them on theupper row. On the upper row of Figure 2, the ﬁgure on the right end visualizes the utility functionof DNN; on the left is the utility function of MNL; the three in the middle visualize the utilityfunctions of MNL-ResNets with diﬀerent δ values. The format of Figures 3 and 4 is similar toFigure 2. In Figure 3 (PT), the monetary payoﬀ and winning probabilities are the x- and y-axeson the upper row and the x-axes on the lower row. In Figure 4 (HD), the monetary payoﬀ andtime are the x- and y-axes on the upper row and the x-axis on the lower row. All the models inthe three ﬁgures are trained by using the sequential training method and with the three-layer DNNpart. The results of simultaneous training are attached in Appendix III and that of the ﬁve-layerDNN are attached in Appendix V. Both yield ﬁndings similar to the results in this subsection.15 a) MNL (50 . δ = 10 − ; 53 . δ = 0 . . δ = 0 .

05; 56 . . Fig. 2. Utility functions of MNL-ResNets, MNL, and DNNs. Upper row: visualization of 2Dutility functions, and percentages in the parentheses represent the prediction accuracy. Lower row:visualization of 1D utility functions, and every pair of ﬁgures on the lower row correspond to theﬁgure directly above on the upper row. (a) PT (69 . δ = 10 − ; 75 . δ = 0 .

9; 89 . δ = 0 .

99; 88 . . Fig. 3. Utility functions of PT-ResNets, PT, and DNNs (Same format as above)16 a) HD (56 . δ = 10 − ; 58 . δ = 0 .

05; 77 . δ = 0 .

99; 76 . . Fig. 4. Utility functions of HD-ResNets, HD, and DNNs (Same format as above)First, we can observe the complementary nature of DNNs and DCMs by comparing only thetwo graphs of DNNs and DCMs on the right and left ends of Figures 2, 3, and 4. On one hand,the utility functions of the MNL, PT, and HD models are very regular and intuitive, as shownby subﬁgures 2a, 3a, and 4a. In subﬁgures 2f and 2g, the utility values of choosing the bus modelinearly decrease as bus costs and in-vehicle travel time increase. In subﬁgures 3f and 3g, theutility of taking the risky alternative increases as the monetary payoﬀ and winning probabilitiesincrease. These highly regular utility functions in DCMs are interpretable, although it is also likelythat the true utility functions can be much more complex than the smooth and regular MNL, PT,and HD, leading to their misspeciﬁcation errors and underﬁtting. However, on the other hand,the utility functions of DNNs for the MNL, PT, and HD scenarios are very irregular and highlycounter-intuitive, although they have higher prediction accuracy, as shown by subﬁgures 2e, 3e,and 4e. For example, in Figure 2n, the DNN predicts that the utility of using buses ﬁrst increasesas the travel cost increases, violating the basic principle of economics theory. The same type ofcounter-intuitive results also arise from DNNs in the PT and HD scenarios, as shown in subﬁgures3e and 4e. These highly irregular utility functions in DNNs are not very interpretable, although theoverly complex functions of DNNs may capture more behavioral mechanisms than DCMs, leadingto higher prediction accuracy. In fact, DNNs do outperform DCMs in all three scenarios, as theprediction accuracy of DNN in the MNL setting is 55 . .

6% of MNL; that of DNNin the PT setting is 88 . .

2% of PT; that of DNN in the HD setting is 71 . .

7% of HD. Overall, it is critical to observe the complementary nature of DNNsand DCMs: DCMs might be too simple and regular to capture reality, while DNNs might be toocomplex and irregular to do so.TB-ResNets achieve a ﬂexible compromise between DCMs and DNNs, the degree of which iscontrolled by δ . Take the MNL setting (Figure 2) as an example. As δ increases, the utility function17f MNL-ResNets becomes more irregular and similar to the DNN model, and as δ decreases, theutility function becomes more regular and thus similar to the MNL model. This compromise alsohappens to the PT and HD settings: PT-ResNets and HD-ResNets resemble a continuum betweenthe highly regular DCMs and irregular DNNs, with δ as the weighting factor.This perspective of TB-ResNets acting as a ﬂexible compromise between DCMs and DNNsis tied to the six perspectives that were previously introduced. It should be obvious now thatthis TB-ResNet framework seeks to strike a weighted ensemble model between DCMs and DNNsthrough the shared utility interpretation, as shown by the continuous change in Figures 2, 3, and4. When δ →

0, the TB-ResNets increasingly resemble the theory-driven models. The DCM partbecomes the skeleton utility function to stabilize the full TB-ResNet utility function, and the DNNpart is strongly regularized around the DCM part. When δ →

1, the TB-ResNets can gain higherprediction accuracy, and the behavioral patterns become more similar to the irregular DNNs. Onlyin the middle ground, can the optimum δ values be found to construct optimum TB-ResNets. DCMs tend to be too simple to capture reality, while DNNs tend to be too complex to do so.As a result, TB-ResNets are more interpretable than pure DCMs because TB-ResNets enrich theoverly simple utility functions of DCMs with the DNN component, and also more interpretablethan the pure DNNs because TB-ResNets regularize the overly complex DNNs with a small δ . Forexample, the MNL-ResNet model ( δ = 0 .

008 in Figure 2c) is similar to the MNL model, since it hasa relatively regular utility contour and the overall utility values decrease as the cost and in-vehicletravel time increase. But unlike the MNL model, the MNL-ResNet ( δ = 0 . δ = 0 . δ = 0 . δ = 0 . δ = 0 .

05 in Figure 4c) retains these overall patterns with richer details, and is morereasonable than the DNN model (Figures 4e, 4n, and 4o), which produced the counter-intuitive18esult that the utility of a future payoﬀ increases with longer waiting time.Elasticity is another important metric used to interpret choice models. Although the utilityfunctions of DNNs and TB-ResNets are highly nonlinear, the most practical elasticity values arehow the aggregate choice probabilities (market shares) respond to a 1% change in the cost, whichcan simulate the policy scenarios of imposing a gasoline tax or increasing bus fares. The elasticitycan be computed by the formula: N (cid:88) i =1 ∂P i,k /P i,k ∂x i,k /x i,k = (cid:88) i ∂P i,k ∂x i,k × x i,k P i,k (35)in which k and k are the indices of two alternatives and i the index of individuals. When k = k ,this equation computes the self-elasticity; when k (cid:54) = k , it computes the cross-elasticity. With thisformula, the elasticity can be computed for every output alternative regarding every input.Table 2 summarizes the elasticity coeﬃcients of MNL, DNNs, and MNL-ResNet with respectto the alternative-speciﬁc variables, demonstrating again that MNL-ResNets achieve a reasonablecompromise between MNL and DNN. As shown in Panel 1, the elasticities of MNL follow theindependence of irrelevant alternative (IIA) assumption, as the coeﬃcients of self-elasticities arenegative and that of cross-elasticities are positive. This pattern is simple and intuitive, but isoften criticized as being too restrictive. On the other side, the DNN model reveals a much moreirregular elasticity pattern, in which the coeﬃcients of self-elasticities and many cross-elasticitiesare of a large magnitude and are negative. The elasticity coeﬃcients in the MNL-ResNet achievea certain compromise between MNL and DNN because the magnitude of the coeﬃcients in MNL-ResNet is much smaller than DNNs and shrinks toward the magnitude of the MNL model, and theMNL-ResNet still retains a relatively ﬂexible substitution pattern as in the DNN.19able 2: Elasticity coeﬃcients of MNL, DNN, and MNL-ResNet ( δ = 0 . Panel 1: MNL

Walk Bus Ridesharing Drive AVWalk: walk time -1.778 -0.577 -0.431 -0.192 -0.784 -0.760 -0.418 -0.408

AV: in-vehicle time 0.058 0.058 0.058 0.058 -0.540Panel 2: DNN

Walk Bus Ridesharing Drive AVWalk: walk time -6.298 -0.563 -1.842 -0.193 -2.152 -0.494 -0.184

Ridesharing: cost 0.604 0.514 -2.553 -0.534 -0.832 -5.236 -2.133 -1.643 -0.828 -0.320 -5.919

AV: in-vehicle time -0.478 -4.594Panel 3: MNL-ResNet

Walk Bus Ridesharing Drive AVWalk: walk time -2.476 -0.262 -0.936 -0.017 -0.785 -0.108 -1.003 -0.117 -0.109 -1.472 -0.908 -0.880 -0.010 -1.670

AV: in-vehicle time -0.342 -1.783

However, the concept of interpretability is ambiguous, since it has multiple deﬁnitions, includ-ing simulatability, decomposability, algorithmic transparency, and post-hoc interpretability [42].Our evaluation above has deﬁned interpretability as accurately approximating the true behavioralmechanism and being consistent with the accepted behavior knowledge [78, 75], which belongs toone type of post-hoc interpretability [63, 49, 27]. Mathematically, it is deﬁned as the distancebetween the true and estimated choice probability functions [78], since an accurately estimatedchoice probability function can provide the complete economic information in demand modeling[75]. However, a diﬀerent deﬁnition of interpretability can lead to a diﬀerent conclusion. For ex-ample, suppose that interpretability is deﬁned as simulatability, which means the ease with whichmodelers can “simulate” the model structure in their mind. The simplest DCMs are then fullyinterpretable while complex DNNs are not. The TB-ResNet, as a mixture of DCMs and DNNs,can hardly be more interpretable than simple DCMs. Although intuitive, this simulatability deﬁ-nition can be misleading, since it encourages an overly simpliﬁed model that fails to recognize richbehavioral reality. For example, a constant model stripped of all the contents (e.g. y = 0) can be20valuated as the most interpretable, although it is not practically useful. Nonetheless, it is impor-tant to recognize that many deﬁnitions of interpretability can co-exist, from which we adopted onlya single relevant one for this work.

Table 3 summarizes the predictive performance of the three choice scenarios in three panels. Itpresents three metrics - prediction accuracy, cross-entropy loss, and F1 score - to measure theperformance of two TB-ResNets with diﬀerent DNN architectures, one with three layers and theother with ﬁve. Each panel includes the DCMs, DNNs, and TB-ResNets with varying δ values.Columns 2-6 present the prediction accuracy, cross-entropy loss, and F1 score in the testing sets ofthe two architectures, and the last column reports the largest share of the choice set as a baselineperformance metric. To provide an intuition for the predictive performance, Figure 5 visualizeshow the model performance varies with δ in the three choice scenarios, with the points representingthe individual model performance, ﬁtted by smooth curves, and the red dash lines marking theoptimum δ values.Table 3: Performance of DCMs, DNNs, and two TB-ResNets in testing sets (Sequential training) Predictionaccuracy(3-layer) Cross-entropyloss(3-layer) F1 score(3-layer) Predictionaccuracy(5-layer) Cross-entropyLoss(5-layer) F1 score(5-layer) Baseline(Largestshare)

Panel 1. Performance of MNL models

MNL 50.6% 1.254 0.439 50.6% 1.254 0.439 44.8%MNL ResNet ( δ = 1e-5) 53.1% δ = 0.008 ) δ = 0.05) 56.1% 1.572 0.559 57.8% 3.342 0.576 44.8%DNN 55.8% 2.861 0.555 55.3% 4.175 0.548 44.8% Panel 2. Performance of PT models

PT 69.2% 0.602 0.693 69.2% 0.602 0.693 53.8%PT ResNet ( δ = 1e-05) 75.6% 0.502 0.762 81.3% 0.477 0.772 53.8%PT ResNet ( δ = 0.9 ) δ = 0.99) 88.6% Panel 3. Performance of HD models

HD 56.7% 0.684 0.568 56.7% 0.684 0.568 50.0%HD ResNet ( δ = 1e-05) 68.9% 0.523 0.689 73.7% 0.471 0.737 50.0%HD ResNet ( δ = 0.05 ) δ = 0.99) 76.0% 0.444 0.763 74.9% 0.453 0.760 50.0%DNN 71.2% 0.909 0.722 72.6% 0.958 0.726 50.0% a) MNL (b) PT (c) HD Fig. 5. Model performance and delta (TB-ResNets with the 3-layer DNN)MNL-ResNets, PT-ResNets, and HD-ResNets outperfrom both DNNs and DCMs in predictionaccuracy, cross-entropy losses, and F1 score, as shown in Table 3 and Figure 5. In Figure 5, thecurves of the prediction accuracy are always concave and that of the cross-entropy loss are alwaysconvex. It suggests that TB-ResNets can outperform both the right and left ends in predictingindividual choices, evaluated in the deterministic rule (accuracy) and the probabilistic rule (cross-entropy). The curves in Figure 5 are smooth when approaching the right and left ends, whichsuggest the smooth convergence of TB-ResNets towards DCMs as δ decreases and towards DNNsas δ increases. The TB-ResNet framework can outperform DCMs because the DNN part in the TB-ResNets relaxes the stringent structure constraint embedded in DCMs, which nearly always leads tomisspeciﬁcation errors and underﬁtting. The TB-ResNet framework can outperform DNNs becauseof the localization in the DCM part and the regularization by the small δ . Overall, the results hereclearly demonstrate the TB-ResNets can outperform both DCMs and DNNs, incorporate the twomodel families as two speciﬁc cases, and can converge to the two ends with δ approaching zero orone.The optimum δ values, as highlighted in bold font in Table 3, suggest that the MNL and HDtheories are more complete than the PT. As marked by the middle (3rd) model in each panel, theoptimum δ values equal to 0.008 and 0.05 in the MNL and HD scenarios, while that in the PTscenario is around 0.9. As introduced in Section 3.3, an optimum and small δ strongly regularizesthe DNN part, suggesting relative completeness of the DCM part; while an optimum and large δ usesthe DNN part to search in a larger function space, suggesting that the DCM part is less complete.Therefore, our results imply that the MNL and HD theories are relatively complete in capturingthe main behavioral mechanism, while the PT has more room for improvement. This ﬁnding canalso be validated by the gap of the predictive performance between DCMs and DNNs in the threescenarios. A larger prediction gap is related to a larger optimum δ value: the optimum δ valuesare 0.008, 0.05, and 0.9 in MNL, HD, and PT scenarios, corresponding to the prediction accuracygaps of 5.2% (=55.8%-50.6%), 14.5% (=71.2%-56.7%), and 19.1% (=88.3%-69.2%). Intuitively,a larger predictive gap between the DCMs and the DNNs suggests more incompleteness of theDCMs. Therefore, the empirical results highly align with our theoretical discussions. The empirical22ptimum δ value creates an optimum TB-ResNet model and functions as a diagnostic tool toevaluate the completeness of the theories.Although Table 3 presents the optimum δ as exact values, the optimum δ is more likely to take arange of values, caused by the potential inconsistency of the multiple predictive metrics [76]. Modelselections based on multiple predictive metrics can be inconsistent with each other. For example,in Table 3 and Figure 5, while DNNs outperform PT in both prediction accuracy and cross-entropyloss, DNNs outperform MNL and HD models in only prediction accuracy but underperform incross-entropy loss. In selecting the optimum TB-ResNets, the optimum δ based on the predictionaccuracy and the cross-entropy loss is always slightly diﬀerent. The authors would argue that thisoptimum range is actually more preferred to an exact optimum value, as it presents ﬂexibility forfurther design of δ to enrich the TB-ResNet framework.The TB-ResNets’ predictive performance varies with the design of the DNN part. Table 3presents the TB-ResNets constructed by two DNN parts with three and ﬁve layers, reported re-spectively in columns 2-4 and 5-7. The TB-ResNet with the three-layer DNN architecture slightlyoutperforms that with the ﬁve-layer DNN part in the MNL scenario, while the former underper-forms the latter in the PT and HD scenarios. Therefore, it is possible to improve the TB-ResNetsby adopting more eﬀective DNN architecture, given a vast number of available DNN architectures.However, this work seeks to demonstrate that, regardless of the speciﬁc DNN architecture, the TB-ResNets can always improve the performance of DNNs through the combination with the DCMs.Hence, in-depth exploration into either the DNN or the DCM part is beyond the scope of our work. Although robustness is not a traditional topic in travel demand modeling, it is increasingly im-portant in both general ML discussions and the speciﬁc setting of demand modeling, since it canmeasure the local regularity of economic information. A model lacking robustness tends to predictan irregular behavioral pattern in which a slight perturbation of the input leads to a large changeof the output, which is behaviorally unrealistic since a tiny price increase should not dramaticallychange decision-maker’s choice. In practice, perturbations can arise from the measurement noisesin data or the randomness in data collection process. This type of behavioral regularity can be formally measured by robustness, or more speciﬁ-cally, the change of predictive performance under small perturbations of the inputs. This studyuses three perturbations including two adversarial attacks (FGSM and TGSM) and one randomnoise (Gaussian noise). They are (1) the fast gradient sign attack method (FGSM) ( x adv = x + (cid:15) × sign ( ∇ x L ( y, ˆ y ))), and (2) the target gradient sign method (TGSM) ( x adv = x − (cid:15) × sign ( ∇ x L ( y target , ˆ y ))) [22, 38], and (3) the Gaussian noise (GN) ( x adv = x + (cid:15) × ∆ x , where x isalready standardized and ∆ x ∼ N (0 , x ,perturbed with the x adv under the three mechanisms, and evaluated by using the x adv to predict For more details, readers can refer to Wang, Wang, and Zhao (2020) to understand the relationship betweenrobustness and local regularity of economic information. y corresponding to the initial x . A robust model should respond modestly to the perturbation,since the small input perturbations should lead to a small change of outputs. The Gaussian noiseperturbation simulates random noises in data sets, while FGSM and TSGM methods simulate ma-licious attacks on the system. The adversarial attacks are common methods to evaluate systemrobustness in the ML literature, while Gaussian noise is more realistic in approximating randomnoise in the context of choice modeling. (a) MNL FGSM (b) PT FGSM (c) HD FGSM(d) MNL TGSM (e) PT TGSM (f) HD TGSM(g) MNL Gaussiannoise (h) PT Gaussian noise (i) HD Gaussian noise Fig. 6. Prediction accuracy with three perturbations (Gaussian noise, FGSM, and TGSM) usingthe perturbed testing setsMNL-ResNets and HD-ResNets are more robust than DNNs under the two adversarial attacks,as shown in Figures 6a, 6d, 6c, and 6f. First, DCMs are more robust than DNNs. For example, inFigure 6a, while the prediction accuracy of the MNL model is lower than that of the DNN modelwhen (cid:15) = 0, the accuracy of the MNL model becomes much higher than that of the DNN when (cid:15) > .

03. Since the MNL-ResNet and the HD-ResNet models are the combination of DCMs andDNNs, the two TB-ResNets are more robust than DNNs, as shown by the orange curves lyingabove the blue curves in Figures 6a, 6d, 6c, and 6f. The reason is that the DCMs are functioning24s an anchoring skeleton in the TB-ResNet system, so the local utility patterns of TB-ResNets areless irregular than DNNs. This robustness intuition is closely related to the previous discussionsabout utility patterns: when utility patterns are more regular, the system tends to be more robust.However, the robustness pattern under the PT scenario is quite diﬀerent from MNL and HD,because of the particularity of PT and the diﬀerent optimum δ value in PT-ResNets. While MNLand HD theories are designed to have smooth and well-bounded input gradients, the probabilityweighting function in the PT can have an input gradient that is close to inﬁnity when the prob-abilities approach zero. The underlying behavioral intuition is that people tend to signiﬁcantlyexaggerate very small winning chances, such as in gambling, leading to a large overestimation ofthe utility gain associated with the alternative. In other words, PT is designed to be sensitive toperturbations, thus lacking robustness, particularly around the region of small probabilities. Thiscan be seen in Figures 6b and 6e: the PT models not only have smaller prediction accuracy thanthe DNNs as (cid:15) = 0, but also decrease much more quickly than the DNNs. But on the other side, theprediction accuracy decrease of PT-ResNets and DNNs largely align with each other, because ofthe large optimum δ value in PT-ResNet. When δ is close to one, the PT-ResNet largely resemblesthe DNN part, so their robustness performance tends to be similar. Nonetheless, the pattern ofan inﬁnite gradient seems to be very speciﬁc to PT, because generally a DCM should have regularand bounded gradients, thus stabilizing the TB-ResNets rather than destabilizing them.Under the Gaussian noise, changes in the predictive performance are much more modest, al-though the pattern still appears similar to that from the adversarial attacks. As shown in Figures6g, 6h, and 6i, all of the curves are nearly ﬂat with slight downward sloping, suggesting that theprediction accuracy does not vary much under Gaussian noise. The diﬀerence between Gaussiannoise and adversarial attacks is no surprise, since the adversarial attacks target the most vulnera-ble part of the input space while Gaussian noise is random. However, even with the relatively ﬂatcurves, the ﬁndings here appear similar to those from the adversarial attacks. In the MNL andHD scenarios, MNL and HD models are slightly more robust than the DNNs, while the PT modelseems slightly less so than the DNNs. As a result, the MNL-ResNet and the HD-ResNet appearslightly more robust than the DNNs in the MNL and HD scenarios, while the PT-ResNet and theDNN in the PT scenario nearly overlap.

6. Conclusions and Discussions

This study introduces a TB-ResNet framework to analyze individual decision-making by synergiz-ing the theory- and data-driven methods, based on the utility interpretation shared by DNNs andDCMs. The TB-ResNet framework can be understood from the perspectives of architecture design,model ensemble, gradient boosting, regularization, function approximation, and theory diagnosis.Three instances of TB-ResNets, including MNL-ResNets, PT-ResNets and HD-ResNets, are cre-ated and tested empirically on three data sets to evaluate model prediction, interpretability, androbustness. 25s summarized in Table 4, our empirical results demonstrate that TB-ResNets are more pre-dictive, interpretable, and robust overall than pure DCMs and DNNs, although several exceptionsexist. Compared to DNNs, TB-ResNets are more predictive, interpretable, and robust becausethe DCM component in TB-ResNets can stabilize the utility functions and regularize the DNNcomponent. Compared to DCMs, TB-ResNets are more predictive and interpretable because richerutility functions are augmented to the skeleton DCM by the DNN component in TB-ResNets. TheTB-ResNets are formulated with a ﬂexible ( δ , 1 − δ ) weighting, thus taking advantage of both thesimplicity of the DCMs and the richness of the DNNs, preventing the underﬁtting of the DCMsand the overﬁtting of the DNNs, and providing insights into the completeness of the DCM theories.Our ﬁndings are consistent across the three scenarios (MNL, PT, and HD). While exceptions existin the PT scenario and the comparison to DCMs for robustness evaluation, our main ﬁndings holdfrom both theoretical and empirical perspectives. Models Prediction Interpretability Robustness

Compared to DNNs Marginal improvement(by stabilization andregularization) Signiﬁcant improvement(by stabilization andregularization) Signiﬁcant improvement(by stabilization andregularization)Compared to DCMs Signiﬁcant improvement(by augmenting andenriching utility function) Signiﬁcant improvement(by augmenting andenriching utility function) No improvement

Table 4: Comparison of TB-ResNets to DCMs and DNNsThe TB-ResNet is a method to reconcile the handcrafted and the automatic utility speciﬁca-tion, which resonates with the broad discussions about how to interact the general-purpose MLand domain-speciﬁc models. DNNs generate dominant predictive performance because they canautomatically learn utility speciﬁcation, termed as an “end-to-end” system that can “learn fromscratch”. This power of automation is treated as its main strength over traditional methods thatrely on domain knowledge to handcraft utility functions [51, 39]. However on the other side, re-searchers argued that automatic feature learning with zero prior knowledge does not appear to bea viable approach. Liao and Poggio [41] contended that “being lazy (automatic learning) is good,but being too lazy is not”. Our TB-ResNet is a tangible example of integrating handcrafted andautomatic learning systems in the context of individual decision-making.This TB-ResNet framework reveals the particularity of DNNs among all the ML classiﬁers, sinceDNNs are closely related to the DCMs through the implicit utility interpretation and the probabilis-tic behavioral perspective. DNNs can model the choice probabilities using the Softmax activationfunction, and are thus highly compatible with the classical probabilistic behavioral modeling. Onthe contrary, many other ML classiﬁers adopt a deterministic approach in modeling the choice out-puts, such as K nearest neighbors and support vector machines, and are therefore somewhat morediﬃcult to synergize with DCMs. The TB-ResNets take advantage of the utility interpretation forthis synergy, so this approach resonates better with the classical utility theory than the perspectivesof model ensemble and gradient boosting. From the architecture design perspective, the TB-ResNet26ramework can be further improved by incorporating a large number of DNN architectures fromthe ML community [25, 36] or designing new DNN architectures with behavioral knowledge [74].The architecture design perspective will become even more important when modelers start to usehigh-dimensional inputs such as imagery and natural language for demand modeling.Overall, this study introduces a simple but ﬂexible TB-ResNet framework. It is simple becauseit combines DCMs and DNNs neatly into the TB-ResNet architecture, and is ﬂexible because it canincorporate any DCM as the theory-driven part and any DNN as the data-driven part. The threeinstances of the TB-ResNets provide evidence that the TB-ResNet framework is malleable enoughto be applied to diverse decision scenarios with many beneﬁts. Despite its strength, many questionsstill remain. Although the combination of the DNNs and DCMs can generate mutual beneﬁts, theTB-ResNet is not the only possible approach to improve prediction, interpretation, and robustness,simply because a large number of other methods have been developed to improve DCMs and DNNsin each community. DCMs can be more predictive with richer utility speciﬁcations, and DNNs canbe more predictive with better designs in architecture, hyperparameters, and training algorithms.DNNs can be made more interpretable and robust by adopting activation maximization (AM) [17],LIME [63], and minimax training procedures [47]. Future studies can further enrich our TB-ResNetframework by using other DCMs for the DCM component and other DNN architectures for the DNNcomponent. This study highlights a synergetic perspective and a comprehensive model evaluationbased on three criteria, so future studies could take the complementarity of the data-driven andtheory-driven methods beyond simple prediction comparison. We hope that this work can pave theway for future studies to create more links between the data-driven and theory-driven methods,because their complementary nature provides immense opportunities, their underlying perspectivesare interwoven, and their synergy can overcome their respective weaknesses.

Acknowledgement

The research is supported by the National Research Foundation (NRF), Prime Minister’s Oﬃce,Singapore, under CREATE programme, Singapore-MIT Alliance for Research and Technology(SMART) Centre, Future Urban Mobility (FM) IRG. We thank Nick Caros for his careful proof-reading.

Author Contributions - CRediT

Shenhao Wang : Conceptualization, Methodology, Software, Formal analysis, Investigation, Writing-Original Draft, Writing-Review & Editing, Project administration.

Baichuan Mo : Software, DataCuration, Visualization.

Jinhua Zhao : Supervision, Funding acquisition, Resources. All authorsdiscussed the results and contributed to the ﬁnal manuscript.27 eferences [1] Kenneth Joseph Arrow.

Aspects of the theory of risk-bearing . Yrjo Jahnssonin Saatio, 1965.[2] Erel Avineri and Piet Bovy. “Identiﬁcation of parameters for a prospect theory model fortravel choice analysis”. In:

Transportation Research Record: Journal of the TransportationResearch Board

Journal ofMachine Learning Research

Discrete choice analysis: theory and applicationto travel demand . Vol. 9. MIT press, 1985.[5] Yves Bentz and Dwight Merunka. “Neural networks and the multinomial logit for brandchoice modelling: a hybrid approach”. In:

Journal of Forecasting

Transportation ResearchPart C: Emerging Technologies

106 (2019), pp. 73–97. issn : 0968-090X.[7] Mark A Bradley and Andrew J Daly. “Estimation of logit choice models using mixed statedpreference and revealed preference information”. In:

Understanding travel behaviour in an eraof change (1997), pp. 209–232.[8] Colin F Camerer and Howard Kunreuther. “Decision processes for low probability events:Policy implications”. In:

Journal of Policy Analysis and Management

Transportation Research Part C: Emerging Technologies

Transportation Research Part C: Emerging Tech-nologies

98 (2019), pp. 152–166.[11] George Cybenko. “Approximation by superpositions of a sigmoidal function”. In:

Mathematicsof control, signals and systems

Marketing Letters

The Foundations of Behavioral Economic Analysis . Oxford University Press,2016.[14] Loan NN Do et al. “An eﬀective spatial-temporal attention based neural network for traﬃcﬂow prediction”. In:

Transportation research part C: emerging technologies

108 (2019), pp. 12–28. issn : 0968-090X. 2815] Finale Doshi-Velez and Been Kim. “Towards a rigorous science of interpretable machinelearning”. In: (2017).[16] Yanjie Duan et al. “An eﬃcient realization of deep learning for traﬃc data imputation”. In:

Transportation research part C: emerging technologies

72 (2016), pp. 168–181.[17] Dumitru Erhan et al. “Visualizing higher-layer features of a deep network”. In:

University ofMontreal

ACM SIGKDDexplorations newsletter

Annals of statistics (2001), pp. 1189–1232.[20] Edward L Glaeser et al. “Big data and big cities: The promises and limitations of improvedmeasures of urban life”. In:

Economic Inquiry arXiv preprint arXiv:1712.06541 (2017).[22] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining and harnessing ad-versarial examples”. In: arXiv preprint arXiv:1412.6572 (2015).[23] Julian Hagenauer and Marco Helbich. “A comparative study of machine learning classiﬁersfor modeling travel mode choice”. In:

Expert Systems with Applications

78 (2017), pp. 273–282.[24] Siyu Hao, Der-Horng Lee, and De Zhao. “Sequence to sequence learning with attention mecha-nism for short-term passenger ﬂow prediction in large-scale metro system”. In:

TransportationResearch Part C: Emerging Technologies

107 (2019), pp. 287–300. issn : 0968-090X.[25] Kaiming He et al. “Deep residual learning for image recognition”. In:

Proceedings of the IEEEconference on computer vision and pattern recognition . 2016, pp. 770–778.[26] David A Hensher and Mark Bradley. “Using stated response choice data to enrich revealedpreference discrete choice models”. In:

Marketing Letters arXiv preprint arXiv:1503.02531 (2015).[28] Kurt Hornik. “Approximation capabilities of multilayer feedforward networks”. In:

Neuralnetworks

Neural networks

Transportation ResearchPart C: Emerging Technologies

95 (2018), pp. 346–362.2931] Daniel Kahneman and Amos Tversky. “Prospect theory: An analysis of decision under risk”.In:

Econometrica: Journal of the Econometric Society (1979), pp. 263–291.[32] Matthew G Karlaftis and Eleni I Vlahogianni. “Statistical methods versus neural networksin transportation research: Diﬀerences, similarities and some insights”. In:

TransportationResearch Part C: Emerging Technologies

Journalof Political Economy

TheQuarterly Journal of Economics (2006), pp. 1133–1165.[35] Sotiris B Kotsiantis, I Zaharakis, and P Pintelas. “Supervised machine learning: A review ofclassiﬁcation techniques”. In:

Emerging artiﬁcial intelligence applications in computer engi-neering

160 (2007), pp. 3–24.[36] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. “Imagenet classiﬁcation with deepconvolutional neural networks”. In:

Advances in neural information processing systems . 2012,pp. 1097–1105.[37] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. “Adversarial examples in the physicalworld”. In: arXiv preprint arXiv:1607.02533 (2017).[38] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. “Adversarial machine learning at scale”.In: arXiv preprint arXiv:1611.01236 (2016).[39] Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. “Deep learning”. In:

Nature

Transportation research part C: emergingtechnologies

109 (2019), pp. 117–136. issn : 0968-090X.[41] Qianli Liao and Tomaso Poggio.

When Is Handcrafting Not a Curse?

Tech. rep. 2018.[42] Zachary C Lipton. “The mythos of model interpretability”. In: arXiv preprint arXiv:1606.03490 (2016).[43] Elaine M Liu. “Time to change what to sow: Risk preferences and technology adoption de-cisions of cotton farmers in China”. In:

Review of Economics and Statistics

Transportation Research Part C: Emerging Technologies

84 (2017),pp. 74–91.[45] George Loewenstein and Drazen Prelec. “Anomalies in intertemporal choice: Evidence andan interpretation”. In:

The Quarterly Journal of Economics

Transportation ResearchPart C: Emerging Technologies

111 (2020), pp. 352–372. issn : 0968-090X.[47] Aleksander Madry et al. “Towards deep learning models resistant to adversarial attacks”. In: arXiv preprint arXiv:1706.06083 (2017).[48] Daniel McFadden. “Conditional logit analysis of qualitative choice behavior”. In: (1974).[49] Gregoire Montavon, Wojciech Samek, and Klaus-Robert Muller. “Methods for interpretingand understanding deep neural networks”. In:

Digital Signal Processing

73 (2018), pp. 1–15.[50] Mikhail Mozolin, J-C Thill, and E Lynn Usery. “Trip distribution forecasting with multi-layer perceptron neural networks: A critical evaluation”. In:

Transportation Research Part B:Methodological

Journal of Economic Perspectives

Theory of games and economic behavior . 1944.[53] Walter Nicholson and Christopher Snyder. “Uncertainty and Strategy”. In:

MicroeconomicsTheory: Basic Principles and Extensions . 2012.[54] Ted O’Donoghue and Matthew Rabin. “Choice and procrastination”. In:

The Quarterly Jour-nal of Economics

American Economic Review (1999), pp. 103–124.[56] Hichem Omrani. “Predicting travel mode of individuals by machine learning”. In:

Transporta-tion Research Procedia

10 (2015), pp. 840–849.[57] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. “Transferability in machine learn-ing: from phenomena to black-box attacks using adversarial samples”. In: arXiv preprintarXiv:1605.07277 (2016).[58] Nicolas Papernot et al. “Distillation as a defense to adversarial perturbations against deepneural networks”. In: . IEEE, 2016,pp. 582–597.[59] Miguel Paredes et al. “Machine learning or discrete choice models for car ownership demandestimation and prediction?” In:

Models and Technologies for Intelligent Transportation Sys-tems (MT-ITS), 2017 5th IEEE International Conference on . IEEE, 2017, pp. 780–785.[60] Nicholas G Polson and Vadim O Sokolov. “Deep learning for short-term traﬃc ﬂow predic-tion”. In:

Transportation Research Part C: Emerging Technologies

79 (2017), pp. 1–17.[61] John W Pratt. “Risk aversion in the small and in the large”. In:

Econometrica: Journal ofthe Econometric Society (1964), pp. 122–136.3162] Sarada Pulugurta, Ashutosh Arun, and Madhu Errampalli. “Use of artiﬁcial intelligence formode choice analysis and comparison with traditional multinomial logit model”. In:

Procedia-Social and Behavioral Sciences

104 (2013), pp. 583–592.[63] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should i trust you?: Explain-ing the predictions of any classiﬁer”. In:

Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining . ACM, 2016, pp. 1135–1144.[64] Paul A Samuelson. “A note on measurement of utility”. In:

The Review of Economic Studies

Transportation Research Procedia

17 (2016), pp. 644–652.[66] Toru Seo et al. “Interactive online machine learning approach for activity-travel survey”. In:

Transportation Research Part B: Methodological (2017).[67] Justin Sydnor. “(Over) insuring modest risks”. In:

American Economic Journal: AppliedEconomics arXiv preprint arXiv:1312.6199 (2014).[69] Tomomi Tanaka, Colin F Camerer, and Quang Nguyen. “Risk and time preferences: linkingexperimental and household survey data from Vietnam”. In:

American Economic Review

TheReview of Economic Studies

Discrete choice methods with simulation . Cambridge university press, 2009.[72] Amos Tversky and Daniel Kahneman. “Advances in prospect theory: Cumulative represen-tation of uncertainty”. In:

Journal of Risk and uncertainty

High-dimensional statistics: A non-asymptotic viewpoint . Vol. 48. Cam-bridge University Press, 2019.[74] Shenhao Wang, Baichuan Mo, and Jinhua Zhao. “Deep neural networks for choice analysis:Architecture design with alternative-speciﬁc utility functions”. In:

Transportation ResearchPart C: Emerging Technologies

112 (2020), pp. 234–251. issn : 0968-090X.[75] Shenhao Wang, Qingyi Wang, and Jinhua Zhao. “Deep neural networks for choice analysis:Extracting complete economic information for interpretation”. In:

Transportation ResearchPart C: Emerging Technologies

118 (2020), p. 102701. issn : 0968-090X.[76] Shenhao Wang, Qingyi Wang, and Jinhua Zhao. “Multitask learning deep neural networksto combine revealed and stated preference data”. In:

Journal of Choice Modelling (2020),p. 100236. issn : 1755-5345. 3277] Shenhao Wang and Jinhua Zhao. “Risk preference and adoption of autonomous vehicles”. In:

Transportation Research Part A: Policy and Practice

126 (2019), pp. 215–229. issn : 0965-8564.[78] Shenhao Wang et al. “Deep Neural Networks for Choice Analysis: A Statistical LearningTheory Perspective”. In: arXiv preprint arXiv:1810.10465 (2018).[79] Xin Wu et al. “Hierarchical travel demand estimation using multiple data sources: A forwardand backward propagation algorithmic framework on a layered computational graph”. In:

Transportation Research Part C: Emerging Technologies

96 (2018), pp. 321–346. issn : 0968-090X.[80] Yuankai Wu et al. “A hybrid deep learning based traﬃc ﬂow prediction method and its under-standing”. In:

Transportation Research Part C: Emerging Technologies

90 (2018), pp. 166–180.[81] Guangnian Xiao, Zhicai Juan, and Chunqin Zhang. “Detecting trip purposes from smartphone-based travel surveys with artiﬁcial neural networks and particle swarm optimization”. In:

Transportation Research Part C: Emerging Technologies

71 (2016), pp. 447–463.[82] Shuguan Yang et al. “A deep learning approach to real-time parking occupancy predictionin transportation networks incorporating multiple spatio-temporal data sources”. In:

Trans-portation Research Part C: Emerging Technologies

107 (2019), pp. 248–265. issn : 0968-090X.[83] Junbo Zhang et al. “Predicting citywide crowd ﬂows using deep spatio-temporal residualnetworks”. In:

Artiﬁcial Intelligence

259 (2018), pp. 147–166. issn : 0004-3702.[84] Zhenhua Zhang et al. “A deep learning approach for detecting traﬃc accidents from socialmedia data”. In:

Transportation research part C: emerging technologies

86 (2018), pp. 580–596. 33 ppendix I: Proof of Propositions 1 and 2

Proposition 1 Proof.

This proof can be found in many textbooks [71, 4]. With Gumbel distri-butional assumption, Equation 3 could be solved in an analytical way: P ik = (cid:90) + ∞−∞ (cid:89) j (cid:54) = k e − e − ( Vik − Vij + (cid:15)ik ) f ( (cid:15) ik ) d(cid:15) ik = (cid:90) (cid:89) j e − e − ( Vik − Vij + (cid:15)ik ) e − (cid:15) ik d(cid:15) ik = (cid:90) exp ( e − (cid:15) ik (cid:88) j − e − ( V ik − V ij ) ) e − (cid:15) ik d(cid:15) ik = (cid:90) ∞ exp ( − t (cid:88) j e − ( V ik − V ij ) ) dt = e V ik (cid:80) j e V ij (36)in which the fourth equation uses t = e − (cid:15) ik . Note this formula in Equation 36 is the Softmaxfunction in DNN. V ik is both the deterministic utility in RUM and the inputs into the Softmaxfunction in DNN. Proposition 2 Proof.

The detailed proof of Proposition 2 could be found in Lemma 2 of McFad-den’s seminal paper [48]. Here is a brief summary. Suppose that one individual i ﬁrstly choosesbetween alternative k and T alternatives j . Then according to Equations 3 and 36, P ik = e V ik e V ik + T e V ij = (cid:90) F ( (cid:15) ik + V ik − V ij ) T dF ( (cid:15) ik ) (37)Suppose that the individual i chooses between alternatives k and alternative l in another choicescenario, and alternative l is constructed such that T e V ij = e V il . Then P ik = e V ik e V ik + e V il = (cid:90) F ( (cid:15) ik + V ik − V il ) dF ( (cid:15) ik )= (cid:90) F ( (cid:15) ik + V ik − V ij − logT ) dF ( (cid:15) ik ) (38)By construction, Equations 37 and 38 are equivalent (cid:90) F ( (cid:15) ik + V ik − V ij − logT ) − F ( (cid:15) ik + V ik − V ij ) T dF ( (cid:15) ik ) = 0Since F ( (cid:15) ) is transition complete, meaning that ∀ a , Eh ( (cid:15) + a ) = 0 implies h ( (cid:15) ) = 0 , ∀ (cid:15) , it implies34 ( V ik − log T ) = F ( V ik ) T , ∀ V ik , T Taking V ik = 0 implies F ( − log T ) = e − αT . Taking V ik = log T − log L implies F ( − log L ) = F ( log T /L ) T . Hence F ( log T /L ) = F ( − log L ) /T = e − αL/T . Therefore, F ( (cid:15) ) = e − αe − (cid:15) . This isthe function of Gumbel distribution when α = 1. Appendix II: Summary Statistics of Three Data Sets

Table 5: Summary Statistics of SG data set

Variables Variables

Name Mean Std. Name Mean Std.Male (Yes = 1) 0.383 0.486 Age <

35 (Yes = 1) 0.329 0.470Age >

60 (Yes = 1) 0.075 0.263 Low education (Yes = 1) 0.331 0.471High education (Yes = 1) 0.480 0.500 Low income (Yes = 1) 0.035 0.184High income (Yes = 1) 0.606 0.489 Full job (Yes = 1) 0.602 0.490Walk: walk time (min) 60.50 54.88 Bus: cost ( $ SG) 2.070 1.266Bus: walk time (min) 11.96 10.78 Bus: waiting time (min) 7.732 5.033Bus: in-vehilce time (min) 25.06 18.91 RideSharing: cost ( $ SG) 14.48 11.64RideSharing: waiting time (min) 7.108 4.803 RideSharing: in-vehilce time (min) 18.28 13.39AV: cost ( $ SG) 16.08 14.60 AV: waiting time (min) 7.249 5.674AV: in-vehilce time (min) 20.11 16.99 Drive: cost ( $ SG) 10.49 10.57Drive: walk time (min) 3.968 4.176 Drive: in-vehilce time (min) 17.43 14.10

Statitics

Number of samples 8418Number of choices Walk: 874 (10.38%); Bus: 1951 (23.18%); RideSharing: 904 (10.74%);Drive 3774 (44.83%); AV: 915 (10.87%)

Table 6: Summary Statistics of PT data set

Variables Variables

Name Mean Std. Name Mean Std.Male (Yes = 1) 0.619 0.485 Age 47.46 12.89Num of years in school 6.746 3.821 Household annual income (1 million dong) 20.27 21.15Chinese (Yes = 1) 0.055 0.228 Distance to the nearest local market (km) 1.482 1.840Living in Southern Vietnam (Yes = 1) 0.541 0.498 Reward 1 in option A (1 million dong) 0.032 0.016Prob of reward 1 in option A 0.638 0.263 Reward 2 in option A (1 million dong) 0.016 0.015Prob of reward 2 in option A 0.362 0.263 Reward 1 in option B (1 million dong) 0.076 0.038Prob of reward 1 in option B 0.486 0.252 Reward 2 in option B (1,000 dong) -0.340 9.640Prob of reward 2 in option B 0.514 0.252

Statistics

Number of samples 5249Number of choices Option A: 2823 (53.78%); Option B: 2426 (46.22%)

Variables Variables

Name Mean Std. Name Mean Std.Male (Yes = 1) 0.618 0.486 Age 47.51 12.94Num of years in school 6.764 3.843 Household income (1 million dong) 20.71 21.23Chinese (Yes = 1) 0.055 0.228 Distance to the nearest local market (km) 1.506 1.846Living in Southern Vietnam (Yes = 1) 0.534 0.499 Trusted agent (Yes = 1) 0.028 0.165Payment received in the risk experiment (1 million dong) 20.97 21.17 Amount of immediate reward (1 million dong) 0.075 0.078Amount of delayed reward (1 million dong) 0.150 0.104 Days of delay 35.67 32.33

Statistics

Number of samples 5340Number of choices Immediate reward: 2670 (50.0%); Future reward: 2670 (50.0%) Trusted agents are people who would keep the money until delayed delivery date to ensure subjects believedthe money would be delivered. The selected trusted persons were usually village heads or presidents of women’sassociations ppendix III: Results of Simultaneous Training (a) MNL FGSM (b) PT FGSM (c) HD FGSM(d) MNL TGSM (e) PT TGSM (f) HD TGSM(g) MNL Gaussiannoise (h) PT Gaussian noise (i) HD Gaussian noise Fig. 7. Prediction accuracy with perturbations (Gaussian noise, FGSM, and TGSM attacks, simul-taneous training) 37 a) MNL (50 . δ = 10 − ; 52 . δ = 0 . . δ = 0 .

05; 56 . . Fig. 8. Utility functions of MNL-ResNets, MNL, and DNNs (Simultaneous training). (a) PT (69 . δ = 10 − ; 76 . δ = 0 .

9; 88 . δ = 0 .

99; 88 . . Fig. 9. Utility functions of PT-ResNets, PT, and DNNs (Simultaneous training).38 a) HD (56 . δ = 10 − ; 57 . δ = 0 .

05; 77 . δ = 0 .

99; 76 . . Fig. 10. Utility functions of HD-ResNets, HD, and DNNs (Simultaneous training)Table 8: Performance of DCMs, DNNs, and TB-ResNets in testing sets (Sequential and simultane-ous training)

Predictionaccuracy(Sequential) Cross-entropyloss(Sequential) F1 score(Sequential) Predictionaccuracy (Si-multaneous) Cross-entropyLoss (Simulta-neous) F1 score (Si-multaneous) Baseline(Largestshare)

Panel 1. Performance of MNL models

MNL 50.6% 1.254 0.439 50.6% 1.254 0.439 44.8%MNL ResNet ( δ = 1e-5) 53.1% 1.207 0.485 52.1% 1.224 0.468 44.8%MNL ResNet ( δ = 0.008) 57.0% 1.237 0.559 56.6% 1.150 0.542 44.8%MNL ResNet ( δ = 0.05) 56.1% 1.572 0.559 56.1% 1.213 0.550 44.8%DNN 55.8% 2.861 0.555 55.8% 2.861 0.555 44.8% Panel 2. Performance of PT models

PT 69.2% 0.602 0.693 69.2% 0.602 0.693 53.8%PT ResNet ( δ = 1e-05) 75.6% 0.502 0.762 76.7% 0.477 0.772 53.8%PT ResNet ( δ = 0.9) 89.2% 0.343 0.887 88.7% 0.347 0.882 53.8%PT ResNet ( δ = 0.99) 88.6% 0.318 0.882 88.6% 0.335 0.884 53.8%DNN 88.3% 0.353 0.885 88.3% 0.353 0.885 53.8% Panel 3. Performance of HD models

HD 56.7% 0.684 0.568 56.7% 0.684 0.568 50.0%HD ResNet ( δ = 1e-05) 68.9% 0.523 0.689 68.7% 0.517 0.686 50.0%HD ResNet ( δ = 0.05) 77.6% 0.437 0.764 77.3% 0.439 0.774 50.0%HD ResNet ( δ = 0.99) 76.0% 0.444 0.763 76.3% 0.440 0.774 50.0%DNN 71.2% 0.909 0.722 71.2% 0.909 0.722 50.0% ppendix IV: Proof of Proposition 4 Proof.

By using the deﬁnition of Rademacher complexity and Proposition 3, the left hand sidecan be rewritten as: E S [ L ( ˆ f ) − L ( f ∗F )] ≤ E S ˆ R n ( F | S ) (39)= 2 E S ˆ R n ((1 − δ ) F + δ F | S ) (40)= 2 E S E (cid:15) sup f ∈F ; f ∈F N (cid:104) (cid:15), (1 − δ ) f + δf (cid:105) (41) ≤ E S (cid:2) E (cid:15) sup f ∈F N (cid:104) (cid:15), (1 − δ ) f (cid:105) + E (cid:15) sup f ∈F N (cid:104) (cid:15), δf (cid:105) (cid:3) (42)= 2 E S [(1 − δ ) ˆ R n ( F | S ) + δ ˆ R n ( F | S )] (43)in which the ﬁrst line uses Proposition 3; the second line uses the deﬁnition of F ; the third lineuses a deﬁnition f := (1 − δ ) f + δf ; the fourth line uses the convexity of the sup operator; thelast one uses the deﬁnition of Rademacher complexity again. Appendix V: Utility Intuition for Five-Layer DNN Architecture (a) MNL (50 . δ = 10 − ; 53 . δ = 0 . . δ = 0 .

05; 56 . . Fig. 11. Utility functions of MNL-ResNets, MNL, and DNNs; upper row: visualization of 2D utilityfunctions; lower row: visualization of 1D utility functions; percentages on the upper row are theprediction accuracy of the ﬁve models. 40 a) PT (69 . δ = 10 − ; 81 . δ = 0 .

9; 89 . δ = 0 .

99; 90 . . Fig. 12. Utility functions of PT-ResNets, PT, and DNNs; upper row: visualization of 2D utilityfunctions; lower row: visualization of 1D utility functions; percentages on the upper row are theprediction accuracy of the ﬁve models. (a) HD (56 . δ = 10 − ; 58 . δ = 0 .

05; 75 . δ = 0 .

99; 74 . .6%)(f) Values (g) Time (h) Values (i) Time (j) Values (k) Time (l) Values (m) Time (n) Values (o) Time