# Duality in dynamic discrete-choice models

DDUALITY IN DYNAMIC DISCRETE CHOICE MODELS

KHAI X. CHIONG § , ALFRED GALICHON † , AND MATT SHUM ♣ Abstract.

Using results from convex analysis, we investigate a novel approach to iden-tiﬁcation and estimation of discrete choice models which we call the “Mass TransportApproach” (MTA). We show that the conditional choice probabilities and the choice-speciﬁc payoﬀs in these models are related in the sense of conjugate duality , and that theidentiﬁcation problem is a mass transport problem. Based on this, we propose a newtwo-step estimator for these models; interestingly, the ﬁrst step of our estimator involvessolving a linear program which is identical to the classic assignment (two-sided matching)game of Shapley and Shubik (1971). The application of convex-analytic tools to dynamicdiscrete choice models, and the connection with two-sided matching models, is new in theliterature.

Date : First draft: April 2013. This version: May 2015.The authors thank the Editor, three anonymous referees, as well as Benjamin Connault, Thierry Magnac,Emerson Melo, Bob Miller, Sergio Montero, John Rust, Sorawoot (Tang) Srisuma, and Haiqing Xu foruseful comments. We are especially grateful to Guillaume Carlier for providing decisive help with the proofof Theorem 5. We also thank audiences at Michigan, Northwestern, NYU, Pittsburgh, UCSD, the CEMMAPconference on inference in game-theoretic models (June 2013), UCLA econometrics mini-conference (June2013), the Boston College Econometrics of Demand Conference (December 2013) and the Toulouse conferenceon “Recent Advances in Set Identiﬁcation” (December 2013) for helpful comments. Galichon’s research hasreceived funding from the European Research Council under the European Union’s Seventh FrameworkProgramme (FP7/2007-2013) / ERC grant agreement n ◦ § Division of the Humanities and Social Sciences, California Institute of Technology; [email protected] † Department of Economics, Sciences Po; [email protected] ♣ Division of the Humanities and Social Sciences, California Institute of Technology; [email protected] a r X i v : . [ ec on . E M ] F e b HIONG, GALICHON, AND SHUM Introduction

Empirical research utilizing dynamic discrete choice models of economic decision-makinghas ﬂourished in recent decades, with applications in all areas of applied microeconomicsincluding labor economics, industrial organization, public ﬁnance, and health economics.The existing literature on the identiﬁcation and estimation of these models has recognizeda close link between the conditional choice probabilities (hereafter, CCP, which can beobserved and estimated from the data) and the payoﬀs (or choice-speciﬁc value functions ,which are unobservable to the researcher); indeed, most estimation procedures containan “inversion” step in which the choice-speciﬁc value functions are recovered given theestimated choice probabilities.This paper has two contributions. First, we explicitly characterize this duality relation-ship between the choice probabilities and choice-speciﬁc payoﬀs. Speciﬁcally, in discretechoice models, the social surplus function (McFadden (1978)) provides us with the mappingfrom payoﬀs to the probabilities with which a choice is chosen at each state (conditionalchoice probabilities). Recognizing that the social surplus function is convex, we developthe idea that the convex conjugate of the social surplus function gives us the inverse map-ping - from choice probabilities to utility indices. More precisely, the subdiﬀerential of theconvex conjugate is a correspondence that maps from the observed choice probabilities toan identiﬁed set of payoﬀs. In short, the choice probabilities and utility indices are relatedin the sense of conjugate duality . The discovery of this relationship allows us to succinctlycharacterize the empirical content of discrete choice models, both static and dynamic.Not only is the convex conjugate of the social surplus function a useful theoretical object;it also provides a new and practical way to “invert” from a given vector of choice probabilitiesback to the underlying utility indices which generated these probabilities. This is the secondcontribution of this paper. We show how the conjugate along with its set of subgradientscan be eﬃciently computed by means of linear programming. This linear programmingformulation has the structure of an optimal assignment problem (as in Shapley-Shubik’s(1971) classic work). This surprising connection enables us to apply insights developed inthe optimal transport literature, e.g. Villani (2003, 2009), to discrete choice models. Wecall this new methodology the “Mass Transport Approach” to CCP inversion.

UALITY IN DYNAMIC DISCRETE CHOICE MODELS

This paper focuses on the estimation of dynamic discrete-choice models via two-stepestimation procedures in which conditional choice probabilities are estimated in the initialstage; this estimation approach was pioneered in Hotz and Miller (HM, 1993) and Hotz,Miller, Sanders, Smith (1994). Our use of tools and concepts from convex analysis tostudy identiﬁcation and estimation in this dynamic discrete choice setting is novel in theliterature. Based on our ﬁndings, we propose a new two-step estimator for DDC models.A nice feature of our estimator is that it works for practically any assumed distributionof the utility shocks. Thus, our estimator would make possible the task of evaluating therobustness of estimation to diﬀerent distributional assumptions. Section 2 contains our main results regarding duality between choice probabilities andpayoﬀs in discrete choice models. Based on these results, we propose, in Section 3, atwo-step estimation approach for these models. We also emphasize here the surprisingconnection between dynamic discrete-choice and optimal matching models. In Section 4 wediscuss computational details for our estimator, focusing on the use of linear programmingto compute (approximately) the convex conjugate function from the dynamic discrete-choicemodel. Monte Carlo experiments (in Section 5) show that our estimator performs well inpractice, and we apply the estimator to Rust’s (1987) bus engine replacement data (Section6). Section 7 concludes. The Appendix contains proofs and also a brief primer on relevantresults from convex analysis. Sections 2.2 and 2.3, as well as Section 4, are not speciﬁc todynamic discrete choice problems but are also true for any (static) discrete choice model. Subsequent contributions include Aguirregabiria and Mira (2002, 2007), Magnac and Thesmar (2002),Pesendorfer and Schmidt-Dengler (2008), Bajari, et. al. (2009), Arcidiacono and Miller (2011), and Noretsand Tang (2013). While existing identiﬁcation results for dynamic discrete choice models allow for quite general speciﬁca-tions of the additive choice-speciﬁc utility shocks, many applications of these two-step estimators maintainthe restrictive assumption that the utility shocks are distributed i.i.d. type I extreme value, independentlyof the state variables, leading to choice probabilities which take the multinomial logit form. While they are not the focus in this paper, many applications of dynamic choice models do not uti-lize HM-type two step estimation procedures, and they allow for quite ﬂexible distributions of the utilityshocks, and also for serial correlation in these shocks (examples include Pakes (1986) and Keane and Wolpin(1997)). This literature typically employs simulated method of moments, or simulated maximum likelihoodfor estimation (see Rust (1994, section 3.3)).

HIONG, GALICHON, AND SHUM Basic Model

The framework.

In this section we review the basic dynamic discrete-choice setup, asencapsulated in Rust’s (1987) seminal paper. The state variable is x ∈ X which we assumeto take only a ﬁnite number of values. Agents choose actions y ∈ Y from a ﬁnite space Y = { , , . . . , D } . The single-period utility ﬂow which an agent derives from choosing y ina given period is ¯ u y ( x ) + ε y where ε y denotes the utility shock pertaining to action y , which diﬀers across agents. Acrossagents and time periods, the set of utility shocks ε ≡ ( ε y ) y ∈Y is distributed according to ajoint distribution function Q ( · · · ; x ) which can depend on the current values of the statevariable x . We assume that this distribution Q is known to the researcher.Throughout, we consider a stationary setting in which the agent’s decision environmentremains unchanged across time periods; thus, for any given period, we use primes ( (cid:48) ) todenote next-period values. Following Rust (1987), and most of the subsequent papers inthis literature, we maintain the following conditional independence assumption (which rulesout serially persistent forms of unobserved heterogeneity ): Assumption 1 (Conditional Independence) . ( x, ε ) evolves across time periods as a con-trolled ﬁrst-order Markov process, with transition P r ( x (cid:48) , ε (cid:48) | y, x, ε ) = P r ( ε (cid:48) | x (cid:48) , y, x, ε ) · P r ( x (cid:48) | y, x, ε )= P r ( ε (cid:48) | x (cid:48) ) · P r ( x (cid:48) | y, x ) . The discount rate is β . Agents are dynamic optimizers whose choices each period satisfy y ∈ arg max ˜ y ∈Y (cid:8) ¯ u ˜ y ( x ) + ε ˜ y + β E (cid:2) ¯ V (cid:0) x (cid:48) , ε (cid:48) (cid:1) | x, ˜ y (cid:3)(cid:9) , (1)where the value function ¯ V is recursively deﬁned via Bellman’s equation as ¯ V ( x, ε ) = max ˜ y ∈Y (cid:8) ¯ u ˜ y ( x ) + ε ˜ y + β E (cid:2) ¯ V (cid:0) x (cid:48) , ε (cid:48) (cid:1) | x, ˜ y (cid:3)(cid:9) . See Norets (2009), Kasahara and Shimotsu (2009), Arcidiacono and Miller (2011), and Hu and Shum(2012). We have used Assumption 1 to eliminate ε as a conditioning variable in the expectation in Eq. (1). See, eg., Bertsekas (1987, chap. 5) for an introduction and derivation of this equation.

UALITY IN DYNAMIC DISCRETE CHOICE MODELS V ( x ), the ex-ante value function, is deﬁned as: V ( x ) = E (cid:2) ¯ V ( x, ε ) | x (cid:3) . The expectation above is conditional on the current state x . In the literature, V ( x ) iscalled the ex-ante (or integrated) value function, because it measures the continuation valueof the dynamic optimization problem before the agent observes his shocks ε , so that theoptimal action is still stochastic from the agent’s point of view.Next we deﬁne the choice-speciﬁc value functions as consisting of two terms: the per-period utility ﬂow and the discounted continuation payoﬀ: w y ( x ) ≡ ¯ u y ( x ) + β E (cid:2) V ( x (cid:48) ) | x, y ) (cid:3) . In this paper, the utility ﬂows { u y ( x ); ∀ y ∈ Y , ∀ x ∈ X } , and subsequently also the choice-speciﬁc value functions { w y ( x ) , ∀ y, x } , will be treated as unknown parameters; and we willstudy the identiﬁcation and estimation of these parameters. For this reason, in the initialpart of the paper, we will suppress the explicit dependence of w y on x for convenience.Given these preliminaries, we derive the duality which is central to this paper.2.2. The social surplus function and its convex conjugate.

We start by introducingthe expected indirect utility of a decision maker facing the |Y| -dimensional vector of choice-speciﬁc values w ≡ { w y , y ∈ Y} (cid:48) : G ( w ; x ) = E (cid:20) max y ∈Y ( w y + ε y ) | x (cid:21) (2)where the expectation is assumed to be ﬁnite and is taken over the distribution of the utilityshocks, Q ( · ; x ). This function G ( · ; x ) : R |Y| → R , is called the “social surplus function”in McFadden’s (1978) random utility framework, and can be interpreted as the expectedwelfare of a representative agent in the dynamic discrete-choice problem. There is a diﬀerence between the deﬁnition of V ( x ) and the last terms in Equation (1) above. Here, weare considering the expectation of the value function ¯ V ( x, ε ) taken over the distribution of ε | x (ie. holdingthe ﬁrst argument ﬁxed). In the last term of Eq. (1), however, we are considering the expectation over the joint distribution of ( x (cid:48) , ε (cid:48) ) | x (ie. holding neither argument ﬁxed). HIONG, GALICHON, AND SHUM

For convenience in what follows, we introduce the notation Y ( w, ε ) to denote an agent’soptimal choice given the vector of choice-speciﬁc value functions w and the vector of util-ity shocks ε ; that is, Y ( w, ε ) = argmax y ∈Y ( w y + ε y ). This notation makes explicit therandomness in the optimal alternative (arising from the utility shocks ε ). We get G ( w ; x ) = E (cid:2) w Y ( w,ε ) + ε Y ( w,ε ) | x (cid:3) = (cid:88) y ∈Y P r ( Y ( w, ε ) = y | x ) (cid:124) (cid:123)(cid:122) (cid:125) ≡ p y ( x ) ( w y + E [ ε y | Y ( w, ε ) = y, x ]) (3)which shows an alternative expression for the social surplus function as a weighted average,where the weights are the components of the vector of conditional choice probabilities p ( x ).For the remainder of this section, we suppress the dependence of all quantities on x forconvenience. In later sections, we will reintroduce this dependence when it is necessary.In the case when the social surplus function G ( w ) is diﬀerentiable (which holds for mostdiscrete-choice model speciﬁcations considered in the literature ), we obtain a well-knownfact that the vector of choice probabilities p compatible with rational choice coincides withthe gradient of G at w : Proposition 1 (The Williams-Daly-Zachary (WDZ) Theorem) . p = ∇G ( w ) . This result, which is analogous to Roy’s Identity in discrete choice models, is expoundedin McFadden (1978) and Rust (1994; Thm. 3.1)). It characterizes the vector of choiceprobabilities corresponding to optimal behavior in a discrete choice model as the gradientof the social surplus function. For completeness, we include a proof in the Appendix.The WDZ theorem provides a mapping from the choice-speciﬁc value functions (which areunobserved by researchers) to the observed choice probabilities p .However, the identiﬁcation problem is the reverse problem, namely to determine the setof w which would lead to a given vector of choice probabilities. This problem is exactly We use w and ε (and also p below) to denote vectors, while w y and ε y (and p y ) denote the y -thcomponent of these vectors. This includes logit, nested logit, multinomial probit, etc. in which the distribution of the utility shocksis absolutely continuous and w is bounded, cf. Lemma 1 in Shi, Shum and Wong (2014). UALITY IN DYNAMIC DISCRETE CHOICE MODELS solved by convex duality and the introduction of the convex conjugate of G , which we denoteas G ∗ : Deﬁnition 1 (Convex Conjugate) . We deﬁne G ∗ , the Legendre-Fenchel conjugate functionof G (a convex function), by G ∗ ( p ) = sup w ∈ R Y (cid:88) y ∈Y p y w y − G ( w ) . (4)Equation (4) above has the property that if p is not a probability, that is if eitherconditions p y ≥ (cid:80) y ∈Y p y = 1 do not hold, then G ∗ ( p ) = + ∞ . Because the choice-speciﬁc value functions w and the choice probabilities p are, respectively, the argumentsof the functions G and its convex conjugate function G ∗ , we say that w and p are relatedin the sense of conjugate duality . The theorem below states an implication of this duality,and provides an “inverse” correspondence from the observed choice probabilities back tothe unobserved w , which is a necessary step for identiﬁcation and estimation. Theorem 1.

The following pair of equivalent statements capture the empirical content ofthe DDC model:(i) p is in the subdiﬀerential of G at w p ∈ ∂ G ( w ) , (5) (ii) w is in the subdiﬀerential of G ∗ at pw ∈ ∂ G ∗ ( p ) . (6)The deﬁnition and properties of the subdiﬀerential of a convex function are provided inAppendix A. Part (i) is, of course, connected to the WDZ theorem above; indeed, it is Details of convex conjugates are expounded in the Appendix. Convex conjugates are also encounteredin classic producer and consumer theory. For instance, when f is the convex cost function of the ﬁrm(decreasing returns to scale in production), then the convex conjugate of the cost function, f ∗ , is in fact theﬁrm’s optimal proﬁt function. G is diﬀerentiable at w if and only if ∂ G ( w ) is single-valued. In that case, part (i) of Th. 1 reducesto p = ∇G ( w ), which is the WDZ theorem. If, in addition, ∇G is one-to-one, then we immediately get w = ( ∇G ) − ( p ), or ∇G ∗ ( p ) = ( ∇G ) − ( p ), which is the case of the classical Legendre transform. However, HIONG, GALICHON, AND SHUM the WDZ theorem when G ( w ) is diﬀerentiable at w . Hence, it encapsulates an optimal-ity requirement that the vector of observed choice probabilities p be derived from optimaldiscrete-choice decision making for some unknown vector w of choice-speciﬁc value func-tions.Part (ii) of this proposition, which describes the “inverse” mapping from conditionalchoice probabilities to choice-speciﬁc value functions, does not appear to have been ex-ploited in the literature on dynamic discrete choice. It relates to Galichon and Salani´e(2012) who use convex analysis to estimate matching games with transferable utilities. Itspeciﬁcally states that the vector of choice-speciﬁc value functions can be identiﬁed fromthe corresponding vector of observed choice probabilities p as the subgradient of the convexconjugate function G ∗ ( p ). Eq. (6) is also constructive, and suggests a procedure for com-puting the choice-speciﬁc value functions corresponding to observed choice probabilities.We will fully elaborate this procedure in subsequent sections .Appendix A contains additional derivations related to the subgradient of a convex func-tion. Speciﬁcally, it is known (Eq. (25)) that G ( w ) + G ∗ ( p ) = (cid:80) y ∈Y p y w y if and only if p ∈ ∂ G ( w ). Combining this with Eq. (3), we obtain an alternative expression for the convexconjugate function G ∗ : G ∗ ( p ) = − (cid:88) y p y E [ ε y | Y ( w, ε ) = y ] , (7)corresponding to the weighted expectations of the utility shocks ε y conditional on choosingthe option y . It is also known that the subdiﬀerential ∂ G ∗ ( p ) corresponds to the set ofmaximizers in the program (4) which deﬁne the conjugate function G ∗ ( p ); that is, w ∈ ∂ G ∗ ( p ) ⇔ w ∈ argmax w ∈ R Y (cid:88) y ∈Y p y w y − G ( w ) . (8) as we show below, ∇G ( w ) is not typically one-to-one in discrete choice models, so that the statement in part(ii) of Th. 1 is more suitable. Clearly, Theorem 1 also applies to static random utility discrete-choice models, with the w ( x ) beinginterpreted as the utility indices for each of the choices. As such, Eq. (6) relates to results regarding theinvertibility of the mapping from utilities to choice probabilities in static discrete choice models (e.g. Berry(1994); Haile, Hortacsu, and Kosenok (2008); Berry, Gandhi, and Haile (2013)). Similar results have alsoarisen in the literature on stochastic learning in games (Hofbauer and Sandholm (2002); Cominetti, Meloand Sorin (2010)). UALITY IN DYNAMIC DISCRETE CHOICE MODELS

Later, we will exploit this variational representation of the subdiﬀerential G ∗ ( p ) for compu-tational purposes; cf. Section 4 below. Example 1 (Logit) . Before proceeding, we discuss the logit model, for which the functionsand relations above reduce to familiar expressions. When the distribution Q of ε obeys anextreme value type I distribution, it follows from Extreme Value theory that G and G ∗ can beobtained in closed form : G ( w ) = log( (cid:80) y ∈Y exp( w y ))+ γ , while G ∗ ( p ) = (cid:80) y ∈Y p y log p y − γ if p belongs in the interior of the simplex, G ∗ ( p ) = + ∞ otherwise ( γ ≈ . is Euler’sconstant). Hence in this case, G ∗ is the entropy of distribution p (see Anderson, de Palma,Thisse (1988) and references therein).The subdiﬀerential of G ∗ is characterized as follows: w ∈ ∂ G ∗ ( p ) if and only if w y =log p y − K , for some K ∈ R . In this logit case the convex conjugate function G ∗ is theentropy of distribution p , which explains why it can be called a generalized entropy functioneven in non-logit contexts. (cid:4) Identiﬁcation.

It follows from Theorem 1 that the identiﬁcation of systematic utilitiesboils down to the problem of computing the subgradient of a generalized entropy function.However, from examining the social surplus function G , we see that if w ∈ ∂ G ∗ ( p ), then itis also true that w − K ∈ ∂ G ∗ ( p ), where K ∈ R |Y| is a vector taking values of K across all Y components. Indeed, the choice probabilities are only aﬀected by the diﬀerences in thelevels oﬀered by the various alternatives. In what follows, we shall tackle this indeterminacyproblem by isolating a particular w among those satisfying w ∈ ∂ G ∗ ( p ), where we choose G (cid:0) w (cid:1) = 0 . (9)We will impose the following assumption on the heterogeneity. Assumption 2 (Full Support) . Assume the distribution Q of the vector of utility shocks ε is such that the distribution of the vector ( ε y − ε ) y (cid:54) =1 has full support. Relatedly, Arcidiacono and Miller (2011, pp. 1839-1841) discuss computational and analytical solutionsfor the G ∗ function in the generalized extreme value setting. HIONG, GALICHON, AND SHUM

Under this assumption, Theorem 2 below shows that Eq. (9) deﬁnes w uniquely. The-orem 3 will then show that the knowledge of w allows for easy recovery of all vectors w satisfying p ∈ ∂ G ( w ). Theorem 2.

Under Assumption 2, let p be in the interior of the simplex ∆ |Y| , (i.e. p y > for each y and (cid:80) y p y = 1 ). Then there exists a unique w ∈ ∂ G ∗ ( p ) such that G (cid:0) w (cid:1) = 0 . The proof of this theorem is in the Appendix. Moreover, even when Assumption 2 is notsatisﬁed, w will still be set-identiﬁed; Theorem 4 below describes the identiﬁed set of w corresponding to a given vector of choice probabilities p .Our next result is our main tool for identiﬁcation; it shows that our choice of w ( x ), asdeﬁned in Eq. (9) is without loss of generality; it is not an additional model restriction, butmerely a convenient way of representing all w ( x ) in ∂ G ∗ ( p ( x )) with respect to a naturaland convenient reference point. Theorem 3.

Maintain Assumption 2, and let K denote any scalar K ∈ R . The set ofconditions w ∈ ∂ G ∗ ( p ) and G ( w ) = K is equivalent to w y = w y + K, ∀ y ∈ Y . This theorem shows that any vector within the set ∂ G ∗ ( p ) can be characterized as thesum of the (uniquely-determined, by Theorem 3) vector w and a constant K ∈ R . As wewill see below, this is our invertibility result for dynamic discrete choice problems, as it willimply unique identiﬁcation of the vector of choice-speciﬁc value functions corresponding toany observed vector of conditional choice probabilities. This indeterminacy issue has been resolved in the existing literature on dynamic discrete choice models(eg. Hotz and Miller (1993), Rust (1994), Magnac and Thesmar (2002) by focusing on the diﬀerences betweenchoice-speciﬁc value functions, which is equivalent to setting w y ( x ), the choice-speciﬁc value function for abenchmark choice y , equal to zero. Compared to this, our choice of w ( x ) satisfying G ( w ( x )) = 0 is moreconvenient in our context, as it leads to a simple expression for the constant K (see Section 2.4). See Berry (1994), Chiappori and Komunjer (2010), Berry, Gandhi, and Haile (2012), among others, forconditions ensuring the invertibility or “univalence” of demand systems stemming from multinomial choicemodels, under settings more general than the random utility framework considered here.

UALITY IN DYNAMIC DISCRETE CHOICE MODELS

Empirical Content of Dynamic Discrete Choice Model.

To summarize the em-pirical content of the model, we recall the fact that the ex-ante value function V solves thefollowing equation V ( x ) = (cid:88) y ∈Y p y ( x ) (cid:32) ¯ u y ( x ) + E [ ε y | Y ( w, ε ) = y, x ] + β (cid:88) x (cid:48) p (cid:0) x (cid:48) | x, y (cid:1) V (cid:0) x (cid:48) (cid:1)(cid:33) (derived in Pesendorfer and Schmidt-Dengler (2008), among others), where we write p ( x (cid:48) | x, y ) = P r ( x t +1 = x (cid:48) | x t = x, y t = y ). Noting that the choice-speciﬁc value function is just w y ( x ) = ¯ u y ( x ) + β (cid:88) x (cid:48) p (cid:0) x (cid:48) | x, y (cid:1) V (cid:0) x (cid:48) (cid:1) , (10)and, comparing with Eq. (3), V ( x ) = G ( w ( x ); x ) and p ( x ) ∈ ∂ G ( w ( x ); x ) . Hence, by Theorem 3, the true w ( x ) will diﬀer from w ( x ) by a constant term V ( x ): w ( x ) = w ( x ) + V ( x )where w ( x ) is deﬁned in Theorem 2. This result is also convenient for identiﬁcationpurposes, as it separates identiﬁcation of w into two subproblems, the determination of w and the determination of V . Once w and V are known, the utility ﬂows are determinedfrom Eq. (10). This motivates our two-step estimation procedure, which we describe next.3. Estimation using the Mass Transport Approach (MTA)

Based upon the derivations in the previous section, we present a two-step estimationprocedure. In the ﬁrst step, we use the results from Theorem 3 to recover the vectorof choice-speciﬁc value functions w ( x ) corresponding to each observed vector of choiceprobabilities p ( x ). In the second step, we recover the utility ﬂow functions ¯ u y ( x ) given the w ( x ) obtained from the ﬁrst step.3.1. First step.

In the ﬁrst step, the goal is to recover the vector of choice-speciﬁc valuefunctions w ( x ) ∈ ∂ G ∗ ( p ( x )) corresponding to the vector of observed choice probabilities p ( x ) for each value of x . In doing this, we use Theorem 1 above and Proposition 2 below,which show how w ( x ) belongs to the subdiﬀerential of the conjugate function G ∗ ( p ( x )). HIONG, GALICHON, AND SHUM

We delay discussing these details until Section 4. There, we will show how this problemof obtaining w ( x ) can be reformulated in terms of a class of mathematical programmingproblems, the Monge-Kantorovich mass transport problems, which leads to convenient com-putational procedures. Since this is the central component of our estimation procedure, wehave named it the mass transport approach (MTA).3.2. Second step.

From the ﬁrst step, we obtained w ( x ) such that w ( x ) = w ( x ) + V ( x ).Now in the second step, we use the recursive structure of the dynamic model, along withﬁxing one of the utility ﬂows, to jointly pin down the values of w ( x ) and V ( x ). Finally,once w ( x ) and V ( x ) are known, the utility ﬂows can be obtained from ¯ u y ( x ) = w y ( x ) − β E [ V ( x (cid:48) ) | x, y ].In order to nonparametrically identify ¯ u y ( x ), we need to ﬁx some values of the utilityﬂows. Following Bajari, Chernozhukov, Hong, and Nekipelov (2009), we ﬁx the utility ﬂowcorresponding to a benchmark choice y to be constant at zero: Assumption 3 (Fix utility ﬂow for benchmark choice) . ∀ x, ¯ u y ( x ) = 0 . With this assumption, we get0 = w y ( x ) + V ( x ) − β E (cid:2) V (cid:0) x (cid:48) (cid:1) | x, y = y (cid:3) . (11)Let W be the column vector whose general term is (cid:0) w y ( x ) (cid:1) x ∈X , let V be the columnvector whose general term is ( V ( x )) x ∈X , and let Π be the |X | × |X | matrix whose generalterm Π ij is P r ( x t +1 = j | x t = i, y = y ). Equation (11), rewritten in matrix notation, is W = β Π V − V and for β <

1, matrix I − β Π is a diagonally dominant matrix. Hence, it is invertible andEquation (11) becomes V = ( β Π − I ) − W. (12) In a static discrete-choice setting (i.e. β = 0), this assumption would be a normalization, and withoutloss of generality. In a dynamic discrete-choice setting, however, this entails some loss of generality becausediﬀerent values for the utility ﬂows imply diﬀerent values for the choice-speciﬁc value functions, which leadsto diﬀerences in the optimal choice behavior. Norets and Tang (2013) discuss this issue in greater detail. UALITY IN DYNAMIC DISCRETE CHOICE MODELS

The right hand side of this equation is uniquely estimated from the data. After obtaining V ( x ), ¯ u y ( x ) can be nonparametrically identiﬁed by¯ u y ( x ) = w y ( x ) + V ( x ) − β E [ V ( x (cid:48) ) | x, y ] , (13)where w ( x ) is as in Theorem 3, and V is given by (12).As a sanity check, one recovers ¯ u y ( . ) = W + V − β Π V = 0. Also, when β → u y ( x ) = w y ( x ) − w y ( x ) which is the case in standard static discrete choice.Moreover, since our approach to identifying the utility ﬂows is nonparametric, our MTAapproach does not leverage any known restrictions on the ﬂow utility (including parametricor shape restrictions) in identifying or estimating the ﬂow utilities. Eqs. (12) and (13) above, showing how the per-period utility ﬂows can be recovered fromthe choice-speciﬁc value functions via a system of linear equations, echoes similar derivationsin the existing literature (e.g. Aguirregabiria and Mira (2007), Pesendorfer and Schmidt-Dengler (2008), Arcidiacono and Miller (2011, 2013)). Hence, the innovative aspect of ourMTA estimator lies not in the second step, but rather in the ﬁrst step. In the next section,we delve into computational aspects of this ﬁrst step.Existing procedures for estimating DDC models typically rely on a small class of distri-butions for the utility shocks – primarily those in the extreme-value family, as in Example1 above – because these distributions yield analytical (or near-analytical) formulas for thechoice probabilities and { E [ ε y | Y ( w, ε ) = y, x ] } y , the vector of conditional expectation ofthe utility shocks for the optimal choices, which is required in order to recover the utilityﬂows . Our approach, however, which is based on computing the G ∗ function, easily ac-commodates diﬀerent choices for Q ε , the (joint) distribution of the utility shocks conditional To ensure that the inverted w satisﬁes certain shape restrictions, the linkage between w and the CCPwill no longer be stipulated by the subdiﬀerential of the convex conjugate function. It is possible thatthere exists a modiﬁcation of the convex conjugate function that is equivalent to imposing certain shaperestrictions on utilities. This is an interesting avenue for future research. Related papers include Hotz and Miller (1993), Hotz, Miller, Sanders, Smith (1994), Aguirregabiriaand Mira (2007), Pesendorfer and Schmidt-Dengler (2008), Arcidiacono and Miller (2011). Norets andTang (2013) propose another estimation approach for binary dynamic choice models in which the choiceprobability function is not required to be known.

HIONG, GALICHON, AND SHUM on X . Therefore, our ﬁndings expand the set of dynamic discrete-choice models suitablefor applied work far beyond those with extreme-value distributed utility shocks. Computational details for the MTA estimator

In Section 4.1, we show that the problem of identiﬁcation in DDC models can be for-mulated as a mass transport problem, and also how this may be implemented in practice.In showing how to compute G ∗ , we exploit the connection, alluded to above, between thisfunction and the assignment game, a model of two-sided matching with transferable utilitywhich has been used to model marriage and housing markets (such as Shapley and Shubik(1971) and Becker (1973)).4.1. Mass Transport formulation.

Much of our computational strategy will be basedon the following proposition, which was derived in Galichon and Salani´e (2012, Proposition2). It characterizes the G ∗ function as an optimum of a well-studied mathematical program:the “mass transport,”problem, see Villani (2003). Proposition 2 (Galichon and Salani´e) . Given Assumption (2), the function G ∗ ( p ) is thevalue of the mass transport problem in which the distribution Q of vectors of utility shocks ε is matched optimally to the distribution of actions y given by the multinomial distribution p , when the cost associated to a match of ( ε, y ) is given by c ( y, ε ) = − ε y where ε y is the utility shock from taking the y -th action. That is, G ∗ ( p ) = sup w,z s.t. w y + z ( ε ) ≤ c ( y,ε ) { E p [ w Y ] + E Q [ z ( ε )] } , (14) where the supremum is taken over the pair ( w, z ) , where w y is a vector of dimension |Y| and z ( · ) is a Q -measurable random variable. By Monge-Kantorovich duality, (14) coincides This remark is also relevant for static discrete choice models. In fact, the random-coeﬃcients multi-nomial demand model of Berry, Levinsohn, and Pakes (1995) does not have a closed-form expression for thechoice probabilities, thus necessitating a simulation-based inversion procedure. In ongoing work (Chiong,Galichon, Shum (2013)), we are exploring the estimation of random-coeﬃcients discrete-choice demandmodels using our approach.

UALITY IN DYNAMIC DISCRETE CHOICE MODELS with its dual G ∗ ( p ) = min Y ∼ pε ∼ Q E [ c ( Y, ε )] , (15) where the minimum is taken over the joint distribution of ( Y, ε ) such that the the ﬁrst margin Y has distribution p and the second margin ε has distribution Q . Moreover, w ∈ ∂ G ∗ ( p ) ifand only if there exists z such that ( w, z ) solves (14). Finally, w ∈ ∂ G ∗ ( p ) and G ( w ) = 0 if and only if there exists z such that (cid:0) w , z (cid:1) solves (14) and z is such that E Q [ z ( ε )] = 0 . In Eq. (15) above, the minimum is taken across all joint distributions of (

Y, ε ) withmarginal distribution equal to, respectively, p and Q . It follows from the proposition thatthe main problem of identiﬁcation of the choice-speciﬁc value functions w can be recast asa mass transport problem (Villani (2003)), in which the set of optimizers to Eq. (14) yieldvectors of choice-speciﬁc value functions w ∈ ∂ G ∗ ( p ).Moreover, the mass transport problem can be interpreted as an optimal matching prob-lem. Using a marriage market analogy, consider a setting in which a matched couple con-sisting of a “man” (with characteristics y ∼ p ) and a “woman” (with characteristics ε ∼ Q )obtain a joint marital surplus − c ( y, ε ) = ε y . Accordingly, Eq. (15) is an optimal matchingproblem in which the joint distribution of characteristics ( y, ε ) of matched couples is chosento maximize the aggregate marital surplus.In the case when Q is a discrete distribution, the mass transport problem in the aboveproposition reduces to a linear-programming problem which coincides with the assignmentgame of Shapley and Shubik (1971). This connection suggests a convenient way for eﬃ-ciently computing the G ∗ function (along with its subgradient). Speciﬁcally, we will showhow the dual problem (Eq. (15)) takes the form of a linear programming problem or assign-ment game, for which some of the associated Lagrange multipliers correspond to the thesubgradient ∂ G ∗ , and hence the choice-speciﬁc value functions. These computational detailsare the focus of Section 4 below. We include the proof of Proposition 2 in the Appendix forcompleteness.4.2. Linear programming computation.

Let ˆ Q be a discrete approximation to the dis-tribution Q . Speciﬁcally, consider a S -point approximation to Q , where the support isSupp( ˆ Q ) = { ε , . . . , ε S } . Let P r ( ˆ Q = ε s ) = q s . The best S -point approximation is such HIONG, GALICHON, AND SHUM that the support points are equally weighted, q s = S , i.e. the best ˆ Q is a uniform distri-bution, see Kennan (2006). Therefore, let ˆ Q be a uniform distribution whose support canbe constructed by drawing S points from the distribution Q . Moreover, ˆ Q converges to Q uniformly as S → ∞ , so that the approximation error from this discretization will vanishwhen S is large. Under these assumptions, Problem (14)-(15) has a Linear Programmingformulation as max π ≥ (cid:88) y,s π ys ε sy (16) S (cid:88) s =1 π ys = p y , ∀ y ∈ Y (17) (cid:88) y ∈Y π ys = q s , ∀ s ∈ { , ..., S } . (18)For this discretized problem, the set of w ∈ ∂ G ∗ ( p ) is the set of vectors w of Lagrangemultipliers corresponding to constraints (17). To see how we recover w , the speciﬁc elementin ∂ G ∗ ( p ) as deﬁned in Theorem 1, we begin with the dual problemmin λ,z (cid:88) y ∈Y p y λ y + S (cid:88) s =1 q s z s (19) s.t. λ y + z s ≥ ε sy Consider ( λ, z ) a solution to (19). By duality, λ and z are, respectively, vectors of La-grange multipliers associated to constraints (17) and (18). We have G ∗ ( p ) = (cid:80) y ∈Y p y λ y + (cid:80) Ss =1 q s z s , which implies that G ( λ ) = − (cid:80) Ss =1 q s z s . Also, for any two elements λ, w ∈ ∂ G ∗ ( p ), we have (cid:80) y ∈Y p y λ y − G ( λ ) = (cid:80) y ∈Y p y w y − G ( w ). Because ˆ Q is constructed from i.i.d. draws from Q , this uniform convergence follows from the Glivenko-Cantelli Theorem. Because the two linear programs (16) and (19) are dual to each other, the Lagrange multipliers ofinterest λ y can be obtained by computing either program. In practice, for the simulations and empiricalapplication below, we computed the primal problem (16). This uses Eq. (25) in Appendix A, which (in our setup) states that G ∗ ( p ) + G ( λ ) = p · λ , for allLagrange multiplier vectors λ ∈ ∂ G ∗ ( p ). UALITY IN DYNAMIC DISCRETE CHOICE MODELS

Hence, because G ( w ) = 0, we get w y = λ y − G ( λ ) = λ y + S (cid:88) s =1 q s z s . (20)In Theorem 5 below, we establish the consistency of this estimate of w .4.3. Discretization of Q and a second type of indeterminacy issue. Thus far, wehave proposed a procedure for computing G ∗ (and the choice-speciﬁc value functions w ) bydiscretizing the otherwise continuous distribution Q . However, because the support of ε isdiscrete, w y will generally not be unique. This is due to the non-uniqueness of the solutionto the dual of the LP problem in Eq. (16), and corresponds to Shapley and Shubik’s (1971)well-known results on the multiplicity of the core in the ﬁnite assignment game. Applied todiscrete-choice models, it implies that when the support of the utility shocks is ﬁnite, theutilities from the discrete-choice model will only be partially identiﬁed. In this section, wediscuss this partial identiﬁcation, or indeterminacy, problem further.Recall that G ∗ ( p ) = sup w y + z ( ε ) ≤ c ( y,ε ) { E p [ w Y ] + E Q [ z ( ε )] } (21)where c ( y, ε ) = − ε y . In Proposition 2, this problem was shown to be the dual formulationof an optimal assignment problem.We call identiﬁed set of payoﬀ vectors , denoted by I ( p ), the set of vectors w such thatPr (cid:18) w y + ε y ≥ max y (cid:48) { w y (cid:48) + ε y (cid:48) } (cid:19) = p y (22)and we denote by I ( p ) the normalized identiﬁed set of payoﬀ vectors , that is the set of w ∈ I ( p ) such that G ( w ) = 0. If Q were to have full support, I ( p ) would contain onlythe singleton (cid:8) w (cid:9) as in Theorem 3. Instead, when the distribution Q is discrete, the set I ( p ) contains a multiplicity of vectors w which satisfy (5). One has: Theorem 4.

The following holds: Note that Theorem 1 requires ε to have full support. HIONG, GALICHON, AND SHUM (i) The set I ( p ) coincides with the set of w such that there exists z such that ( w, z ) is asolution to (21). Thus I ( p ) = w : ∃ z, w y + z ε ≤ c ( y, ε ) E p [ w Y ] + E Q [ z ε ] = G ∗ ( p ) . (ii) The set I ( p ) is determined by the following set of linear inequalities I ( p ) = w : ∃ z, w y + z ε ≤ c ( y, ε ) E p [ w Y ] = G ∗ ( p ) E Q [ z ε ] = 0 . This result allows us to easily derive bounds on the individual components of w usingthe characterization of the identiﬁed set using linear inequalities. Indeed, for each y ∈ Y , wecan obtain upper (resp. lower) bounds on w y by maximizing (resp. minimizing) w y subjectto the linear inequalities characterizing I ( p ), which is a linear programming problem. Furthermore, when the dimensionality of discretization, S , is high, the core shrinks to asingleton, and the core collapses to (cid:8) w (cid:9) . This is a consequence of our next theorem, whichis a consistency result. In our Monte Carlo experiments below, we provide evidence forthe magnitude of this indeterminacy problem under diﬀerent levels of discretization.4.4.

Consistency of MTA estimator.

Here we show (strong) consistency for our MTAestimator of w , the normalized choice-speciﬁc value functions. In our proof, we accommo-date two types of error: (i) approximation error from discretizing the distribution Q of ε ,and (ii) sampling error from our ﬁnite-sample observations of the choice probabilities. Weuse Q n to denote the discretized distributions of ε , and p n to denote the sample estimatesof the choice probabilities. The limiting vector of choice probabilities is denoted p . For a However, letting ¯ w y (resp. w y ) denote the upper (resp. lower) bound on w y , we note that typically thevector ( w y , y ∈ Y ) (cid:48) (cid:54)∈ I ( p ). Moreover, partial identiﬁcation in w (due to discretization of the shock distribution Q ( ε ) will naturallyalso imply partial identiﬁcation in the utility ﬂows u . For a given identiﬁed vector w (and also given thechoice probabilities p and transition matrix Π from the data), we can recover the corresponding u usingEqs. (12)-(13). Gretsky, Ostroy, and Zame (1999) also discusses this phenomenon in their paper.

UALITY IN DYNAMIC DISCRETE CHOICE MODELS given ( Q n , p n ), let w ny denote the choice-speciﬁc value functions estimated using our MTAapproach. Theorem 5.

Assume:(i) The sequence of vectors (cid:8) p ny (cid:9) y ∈Y , viewed as the multinomial distribution of y , con-verges weakly to p ;(ii) The discretized distributions of ε converge weakly to Q : Q n d → Q ;(iii) The second moments of Q n are uniformly bounded.Then the convergence w ny → w y for each y ∈ Y holds almost surely. The proof, which is in the appendix, may be of independent interest as the main argumentrelies on approximation results from mass transport theory, which we believe to be the ﬁrstuse of such results for proving consistency in an econometrics context.5.

Monte Carlo Evidence

In this section, we illustrate our estimation framework using a dynamic model of resourceextraction. To illustrate how our method can tractably handle any general distribution ofthe unobservables, we use a distribution in which shocks to diﬀerent choices are correlated.We will begin by describing the setup.At each time t , let x t ∈ { , , . . . , } be the state variable denoting the size of theresource pool. There are three choices, y t = 0 : The pool of resources is extracted fully. x t +1 | x t , y t = 0 follows a multinomialdistribution on { , , , } with parameter π = ( π , π , π , π ). The utility ﬂow is¯ u ( y t = 0 , x t ) = 0 . √ x t − ε . y t = 1 : The pool of resources is extracted partially. x t +1 | x t , y t = 1 follows a multino-mial distribution on { max { , x t − } , max { , x t − } , max { , x t − } , max { , x t − }} with parameter π . The utility ﬂow is ¯ u ( y t = 1 , x t ) = 0 . √ x t − ε . y t = 2 : Agent waits for the pool to grow and does not extract. x t +1 | x t , y t = 3 followsa multinomial distribution on { x t , x t + 1 , x t + 2 , x t + 3 } with parameter π . Wenormalize the utility ﬂow to be ¯ u ( y t = 2 , x t ) = ε . HIONG, GALICHON, AND SHUM

The joint distribution of the unobserved state variables is given by ( ε − ε , ε − ε ) ∼ N (cid:32)(cid:32) (cid:33) , (cid:32) . . . (cid:33)(cid:33) . Other parameters we ﬁx and hold constant for the Monte Carlostudy are the discount rate, β = 0 . π = (0 . , . , . , . Asymptotic performance.

As a preliminary check of our estimation procedure, weshow that we are able to recover the utility ﬂows using the actual conditional choice proba-bilities implied by the underlying model. We discretized the distribution of ε using S = 5000support points. As is clear from Figure 1, the estimated utility ﬂows (plotted as dots) as afunction of states matched the actual utility functions very well. x , state Nonparametric estimatesNonparametric estimates¯ u ( y = 0 , x ) = − . √ x ¯ u ( y = 1 , x ) = − . √ x Figure 1.

Comparison between the estimated and true utility ﬂows.5.2.

Finite sample performance.

To test the performance of our estimation procedurewhen there is sampling error in the CCPs, we generate simulated panel data of the followingform: { y it , x it : i = 1 , , . . . , N ; t = 1 , , . . . , T } where y it ∈ { , , } is the dynamically UALITY IN DYNAMIC DISCRETE CHOICE MODELS optimal choice at x it after the realization of simulated shocks. We vary the number of cross-section observations N and the number periods T , and for each combination of ( N, T ), wegenerate 100 independent datasets. For each replication or simulated dataset, the root-mean-square error (RMSE) and R are calculated, showing how well the estimated ¯ u y ( x ) ﬁts the true utility function for each y . The averages are reported in Table 1.Design RMSE( y = 0) RMSE( y = 1) R ( y = 0) R ( y = 1) N = 100 , T = 100 0.5586 0.2435 0.3438 0.7708 N = 100 , T = 500 0.1070 0.1389 0.7212 0.9119 N = 100 , T = 1000 0.0810 0.1090 0.8553 0.9501 N = 200 , T = 100 0.1244 0.1642 0.5773 0.8736 N = 200 , T = 200 0.1177 0.1500 0.7044 0.9040 N = 500 , T = 100 0.0871 0.1162 0.8109 0.9348 N = 500 , T = 500 0.0665 0.0829 0.8899 0.9678 N = 1000 , T = 100 0.0718 0.0928 0.8777 0.9647 N = 1000 , T = 1000 0.0543 0.0643 0.9322 0.9820 Table 1.

Average ﬁt across all replications. Standard deviations are re-ported in the Appendix.5.3.

Size of the identiﬁed set of payoﬀs.

As mentioned in Section 4.3, using a discreteapproximation to the distribution of the unobserved state variable introduces a partialidentiﬁcation problem: the identiﬁed choice-speciﬁc value functions might not be unique.Using simulations, we next show that the identiﬁed set of choice-speciﬁc value functions(which we will simply refer to as “payoﬀs”) shrinks to a singleton as S increases, where S is the number of support points in the discrete approximation of the continuous errordistribution. For S ranging from 100 to 1000, we plot in Figure 2, the diﬀerences betweenthe largest and smallest choice-speciﬁc value function for y = 2 across all values of p ∈ ∆ (using the linear programming procedures described in Section 4.3). In each dataset, we initialized x i with a random state in X . When calculating RMSE and R ,we restrict to states where the probability is in the interior of the simplex ∆ , otherwise utilities are notidentiﬁed and the estimates are meaningless. HIONG, GALICHON, AND SHUM

Figure 2.

The identiﬁed set of payoﬀs shrinks to a singleton across ∆ .

100 200 300 400 500 600 700 800 900 100000.020.040.060.080.10.120.140.16

Number of discretized points, S U pp e r b o und Size of the identiﬁed set of payoﬀ for choice y=3

For each value of S , we plot the values of the diﬀerences max w ∈ ∂ G ∗ ( p ) w − min w ∈ ∂ G ∗ ( p ) w across allvalues of p ∈ ∆ . In the boxplot, the central mark is the median, the edges of the box are the 25thand 75th percentiles, the whiskers extend to the most extreme data points not considered outliers,and outliers are plotted individually. As is evident, even at small S , the identiﬁed payoﬀs are very close to each other inmagnitude. At S = 1000, where computation is near-instantaneous, for most of the valuesin the discretised grid of ∆ , the core is a singleton; when it is not, the diﬀerence in theestimated payoﬀ is less than 0.01. Similar results hold for the choice-speciﬁc value functionsfor choices y = 0 and y = 1, which are plotted in Figures 5 and 6 in the Appendix. To sumup, it appears that this indeterminacy issue in the payoﬀs is not a worrisome problem foreven very modest values of S .5.4. Comparison: MTA vs. Simulated Maximum Likelihood.

One common tech-nique used in the literature to estimate dynamic discrete choice models with non-standarddistribution of unobservables is the Simulated Maximum Likelihood (SML). Our MTAmethod has a distinct advantage over SML – while MTA allows the utility ﬂows ¯ u y ( x )for diﬀerent choices y and states x to be nonparametric, the SML approach typically re-quires parameterizing these utility ﬂows as a function of a low-dimensional parameter vector.This makes comparison of these two approaches awkward. Nevertheless, here we undertake UALITY IN DYNAMIC DISCRETE CHOICE MODELS a comparison of the nonparametric MTA vs. the parametric SML approach. First we com-pare the performance of the two alternative approaches in terms of computational time.The computations were performed on a Quad Core Intel Xeon 2.93GHz UNIX workstation,and the results are presented in Table 2.From a computational point of view, the disadvantage of SML is that the dynamic pro-gramming problem must be solved (via Bellman function iteration) for each trial parametervector, whereas the MTA requires solving a large-scale linear programming problem – butonly once . Table 2 shows that our MTA procedure signiﬁcantly outperforms SML in termsof computational speed. This ﬁnding, along with the results in Table 1, show that MTAhas the desirable properties of speed and accuracy, and also allows for nonparametric spec-iﬁcation of the utility ﬂows ¯ u y ( x ). Table 2.

Comparison: MTA vs. Simulated Maximum Likelihood (SML) S discretized points SML: + MTA: ++ Avg. seconds Avg. seconds2000 19.8 2.63000 24.5 4.44000 26.5 6.65000 40.9 9.66000 70.5 13.47000 105.0 17.58000 129.4 21.5 + : In this column we report time it takes to estimate the parameters θ = ( θ , θ , θ , θ ) as a localmaximum of a simulated maximum likelihood, where θ corresponds to ¯ u y =0 ( x ) = θ + θ √ x , and¯ u y =1 ( x ) = θ + θ √ x . ++ : In this column we report the time it takes to nonparametrically estimate the per-period utility ﬂow.

Furthermore, as conﬁrmed in our computations, the nonlinear optimization routines typ-ically used to implement SML have trouble ﬁnding the global optimum; in contrast, theMTA estimator, by virtue of its being a linear programming problem, always ﬁnds the globaloptimum. Indeed, under the logistic assumption on unobservables and linear-in-parametersutility, one advantage of the Hotz-Miller estimator for DDC models (vs. SML) is that the

HIONG, GALICHON, AND SHUM system of equations deﬁning the estimator has a unique global solution; in their discus-sion of this, Aguirregabiria and Mira (2010, pg. 48) remark that “extending the range ofapplicability of ... CCP methods to models which do not impose the CLOGIT [logistic] as-sumption is a topic for further research.” This paper ﬁlls the gap: our MTA estimator sharesthe computational advantages of the CLOGIT setup, but works for non-logistic models. Inthis sense, the MTA estimator is a generalized CCP estimator.6.

Empirical Application: Revisiting Harold Zurcher

In this section, we apply our estimation procedure to the bus engine replacement datasetﬁrst analyzed in Rust (1987). In each week t , Harold Zurcher (bus depot manager), chooses y t ∈ { , } after observing the mileage x t ∈ X and the realized shocks ε t . If y t = 0, then hechooses not to replace the bus engine, and y t = 1 means that he chooses to replace the busengine. The states space is X = { , , . . . } , that is, we divided the mileage space into 30states, each representing a 12,500 increment in mileage since the last engine replacement. Harold Zurcher manages a ﬂeet of 104 identical buses, and we observe the decisions thathe made, as well as the corresponding bus mileage at each time period t . The durationbetween t + 1 and t is a quarter of a year, and the dataset spans 10 years. Figures 7 and 8in the Appendix summarize the frequencies and mileage at which replacements take placein the dataset.Firstly, we can directly estimate the probability of choosing to replace and not to replacethe engine for each state in X . Also directly obtained from the data is the Markov transitionprobabilities for the observed state variable x t ∈ X , estimated as: This grid is coarser compared to Rust’s (1987) original analysis of this data, in which he dividedthe mileage space into increments of 5,000 miles. However, because replacement of engines occurred soinfrequently (there were only 61 replacement in the entire ten-year sample period), using such a ﬁne gridsize leads to many states that have zero probability of choosing replacement. Our procedure – like all otherCCP-based approaches – fails when the vector of conditional choice probability lies on the boundary of thesimplex.

UALITY IN DYNAMIC DISCRETE CHOICE MODELS x , m ileage since last replacem ent (p er 12,500 m iles) ˆ u ( y = , x ) ǫ ∼ . · N (cid:181) ,

11 + 0 . x ¶ + 0 . · N (0 , ǫ ∼ . · N (0 , . · N (0 , Figure 3.

Estimates of utility ﬂows ¯ u y =0 ( x ), across values of mileage x ˆPr( x t +1 = j | x t = i, y t = 0) = . j = i . j = i + 10 otherwiseˆPr( x t +1 = j | x t = i, y t = 1) = . j = 00 . j = 10 otherwiseFor this analysis, we assumed a normal mixture distribution of the error term, speciﬁcally, ε t − ε t ∼ N (0 ,

1) + N (0 , . x ). We chose this mixture distribution in order to allow In this paper, we restrict attention to the case where the researcher fully knows the distribution of theunobservables Q (cid:126)ε , so that there are no unknown parameters in these distributions. In principle, the two-stepprocedure proposed here can be nested inside an additional “outer loop” in which unknown parameters of Q (cid:126)ε are considered, but identiﬁcation and estimation in this case must rely on additional model restrictions HIONG, GALICHON, AND SHUM the utility shocks to depend on mileage – which accommodates, for instance, operating costswhich may be more volatile and unpredictable at diﬀerent levels of mileage. At the sametime, these speciﬁcations for the utility shock distribution showcase the ﬂexibility of ourprocedure in estimating dynamic discrete choice models for any general error distribution.For comparison, we repeat this exercise using an error distribution that is homoskedastic,i.e., its variance does not depend on the state variable x t . The result appears to be robustto using diﬀerent distributions of ε t − ε t . We set the discount rate β = 0 . u y =0 ( x ), we ﬁxed ¯ u y =1 ( x ) to 0 for all x ∈ X . Hence, ourestimates of ¯ u y =0 ( x ) should be interpreted as the magnitude of operating costs relative toreplacement costs , with positive values implying that replacement costs exceed operatingcosts. The estimated utility ﬂows from choosing y = 0 (don’t replace) relative to y = 1(replace engine) are plotted in Figure 3. We only present estimates for mileage within therange x ∈ [9 , u y =0 ( x ) using our procedure. Theresults are plotted in Figure 4. The evidence suggests that we are able to obtain fairly tight in addition to those considered in this paper. We are currently exploring such a model in the context of thesimpler static discrete choice setting (Chiong, Galichon and Shum (2014, work in progress)). Operating costs include maintenance, fuel, insurance costs, plus Zurcher’s estimate of the costs of lostridership and goodwill due to unexpected breakdowns. To be pedantic, this also includes the operating cost at x = 0. UALITY IN DYNAMIC DISCRETE CHOICE MODELS

Figure 4.

Bootstrapped estimates of utility ﬂows ¯ u y =0 ( x )

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30−505−1.28 x , mi l eage si nce l ast repl acement (p er 12, 500 mi l es) ˆ u ( y = 0 , x ) We plot the values of the bootstrapped resampled estimates of ¯ u y =0 ( x ). In each boxplot, thecentral mark is the median, the edges of the box are the 25th and 75th percentiles, the whiskersextend to the 5th and 95th percentiles. cost estimates for states where there is at least one replacement, i.e. for x ≥ x ≥ , x ≤

22 ( x ≤ ,

000 miles).7.

Conclusion

In this paper, we have shown how results from convex analysis can be fruitfully appliedto study identiﬁcation in dynamic discrete choice models; modulo the use of these tools, alarge class of dynamic discrete choice problems with quite general utility shocks becomes nomore diﬃcult to compute and estimate than the Logit model encountered in most empiricalapplications. This has allowed us to provide a natural and holistic framework encompassingthe papers of Rust (1987), Hotz and Miller (1993), and Magnac and Thesmar (2002). Whilethe identiﬁcation results in this paper are comparable to other results in the literature, the

HIONG, GALICHON, AND SHUM approach we take, based on the convexity of the social surplus function G and the resultingduality between choice probabilities and choice-speciﬁc value functions, appears new. Farmore than providing a mere reformulation, this approach is powerful, and has signiﬁcantimplications in several dimensions.First, by drawing the (surprising) connection between the computation of the G ∗ functionand the computation of optimal matchings in the classical assignment game, we can applythe powerful tools developed to compute optimal matchings to dynamic discrete-choicemodels. Moreover, by reformulating the problem as an optimal matching problem, allexistence and uniqueness results are inherited from the theory of optimal transport. Forinstance, the uniqueness of a systematic utility rationalizing the consumer’s choices followsfrom the uniqueness of a potential in the Monge-Kantorovich theorem.We believe the present paper opens a more ﬂexible way to deal with discrete choicemodels. While identiﬁcation is exact for a ﬁxed structure of the unobserved heterogeneity,one may wish to parameterize the distribution of the utility shocks and do inference onthat parameter. The results and methods developed in this paper may also extend to dy-namic discrete games, with the utility shocks reinterpreted as players’ private information. However, we leave these directions for future exploration.

References [1] V. Aguirregabiria and P. Mira. Swapping the nested ﬁxed point algorithm: A class of estimators fordiscrete Markov decision models.

Econometrica , 70:1519-1543, 2002.[2] V. Aguirregabiria and P. Mira. Sequential estimation of dynamic discrete games.

Econometrica , 75:1–53,2007.[3] V. Aguirregabiria and P. Mira. Dynamic discrete choice structural models: a survey.

Journal of Econo-metrics , 156:38–67, 2010.[4] Anderson, S., de Palma, A., and Thisse, J.-F. A Representative Consumer Theory of the Logit Model.

International Economic Review , 29(3), 461-466, 1988.[5] P. Arcidiacono and R. Miller. Conditional Choice Probability Estimation of Dynamic Discrete ChoiceModels with Unobserved Heterogeneity.

Econometrica , 79: 1823-1867, 2011. While the present paper has used standard Linear Programming algorithms such as the Simplexalgorithm, other, more powerful matching algorithms such as the Hungarian algorithm may be eﬃcientlyput to use when the dimensionality of the problem grows. See, e.g. Aguirregabiria and Mira (2007) or Pesendorfer and Schmidt-Dengler (2008)).

UALITY IN DYNAMIC DISCRETE CHOICE MODELS [6] P. Arcidiacono and R. Miller. Identifying Dynamic Discrete Choice Models oﬀ Short Panels. Workingpaper, 2013.[7] C. Aliprantis and K. Border.

Inﬁnite Dimensional Analysis: A Hitchhiker’s Guide . Springer-Verlag,2006.[8] P. Bajari, V. Chernozhukov, H. Hong, and D. Nekipelov. Nonparametric and semiparametric analysisof a dynamic game model. Preprint, 2009.[9] S. Berry, A. Gandhi, and P. Haile. Connected Substitutes and Invertibility of Demand.

Econometrica

81: 2087-2111, 2013.[10] S. Berry. Estimating Discrete-Choice models of Production Diﬀerentiation.

RAND Journal of Econom-ics , 25:242-262, 1994.[11] S. Berry, J. Levinsohn, and A. Pakes. Automobile prices in market equilibrium.

Econometrica , 63:841–890, July 1995.[12] D. Bertsekas.

Dynamic Programming Deterministic and Stochastic Models . Prentice-Hall, 1987.[13] P. Chiappori and I. Komunjer. On the Nonparametric Identiﬁcation of Multiple Choice Models. Workingpaper, 2010.[14] K. Chiong, A. Galichon, and M. Shum. Simulation and Partial Identiﬁcation in Random CoeﬃcientDiscrete Choice Demand Models. Work in progress, 2014.[15] R. Cominetti, E. Melo, and S. Sorin. A payoﬀ-based learning procedure and its application to traﬃcgames.

Games and Economic Behavior , 70:71-83, 2010.[16] A. Galichon and B. Salani´e. Cupid’s invisible hand: Social surplus and identiﬁcation in matching models.Preprint, 2012.[17] N. Gretsky, J. Ostroy, and W. Zame. Perfect Competition in the Continuous Assignment Model.

Journalof Economic Theory , Vol. 85, pp. 60-118, 1999.[18] P. Haile, A. Hortacsu, and G. Kosenok. On the Empirical Content of Quantal Response Models.

Amer-ican Economic Review , 98:180-200, 2008.[19] J. Hofbauer and W. Sandholm. On the Global Convergence of Stochastic Fictitious Play.

Econometrica ,70: 2265-2294, 2002.[20] J. Hotz and R. Miller. Conditional choice probabilties and the estimation of dynamic models.

Reviewof Economic Studies , 60:497–529, 1993.[21] J. Hotz, R. Miller, S. Sanders, and J. Smith. A Simulation Estimator for Dynamic Models of DiscreteChoice.

Review of Economic Studies , 61:265-289, 1994.[22] Y. Hu and M. Shum. Nonparametric Identiﬁcation of Dynamic Models with Unobserved Heterogeneity.

Journal of Econometrics , 171: 32-44, 2012.[23] H. Kasahara and K. Shimotsu. Nonparametric Identiﬁcation of Finite Mixture Models of DynamicDiscrete Choice.

Econometrica , 77: 135–175, 2009.

HIONG, GALICHON, AND SHUM [24] M. Keane and K. Wolpin. The career decisions of young men.

Journal of Political Economy , 105: 473–522, 1997.[25] J. Kennan. A Note on Discrete Approximations of Continuous Distributions. Mimeo, University ofWisconsin at Madison, 2006.[26] T. Magnac and D. Thesmar. Identifying dynamic discrete decision processes.

Econometrica , 70:801–816,2002.[27] D. McFadden. Modeling the choice of residential location. In A. Karlquist et. al., editor,

Spatial Inter-action Theory and Residential Location . North Holland Pub. Co., 1978.[28] D. McFadden. Economic Models of Probabilistic Choice. In C. Manski and D. McFadden, editors,

Structural Analysis of Discrete Data with Econometric Applications , 1981.[29] A. Norets. Inference in dynamic discrete choice models with serially correlated unobserved state vari-ables.

Econometrica , 77: 1665-1682, 2009.[30] A. Norets and S. Takahashi. On the Surjectivity of the Mapping Between Utilities and Choice Proba-bilities.

Quantitative Economics

Econo-metrica , 54:1027-1057, 1986.[33] M. Pesendorfer and P. Schmidt-Dengler. Asymptotic least squares estimators for dynamic games.

Reviewof Economic Studies , 75:901–928, 2008.[34] R. Tyrell Rockafellar.

Convex Analysis . Princeton University Press, 1970.[35] J. Rust. Structural Estimation of Markov Decision Processes.

Handbook of Econometrics , Volume 4 (ed.R. Engle and D. McFadden). North-Holland, 1994.[36] J. Rust. Optimal replacement of GMC bus engines: An empirical model of Harold Zurcher.

Economet-rica , 55:999–1033, 1987.[37] X. Shi, M. Shum, and W. Song. Estimating Multinomial Models using Cyclic Monotonicity. CaltechSocial Science Working Paper 1397, 2014.[38] L. Shapley and M. Shubik. The assignment game I: The core.

International Journal of Game Theory ,1(1):111–130, 1971.[39] C. Villani.

Topics in Optimal Transportation . Graduate Studies in Mathematics, Vol. 58. AmericanMathematical Society, 2003.[40] C. Villani.

Optimal Transport, Old and New . Springer, 2009.

UALITY IN DYNAMIC DISCRETE CHOICE MODELS Background results

Convex Analysis for Discrete-choice Models .

Here, we give a brief review ofthe main notions and results used in the paper. We keep an informal style and do not giveproofs, but we refer to Rockafellar (1970) for an extensive treatment of the subject.Let u ∈ R |Y| be a vector of utility indices. For utility shocks { ε y } y ∈Y distributed accordingto a joint distribution function Q , we deﬁne the social surplus function as G ( u ) = E [max y { u y + ε y } ] , (23)where u y is the y -th component of u . If E ( ε y ) exists and is ﬁnite, then the function G is a proper convex function that is continuous everywhere. Moreover assuming that Q is suﬃciently well-behaved (for instance, if it has a density with respect to the Lebesguemeasure), G is diﬀerentiable everywhere.Deﬁne the Legendre-Fenchel conjugate , or convex conjugate of G as G ∗ ( p ) = sup u ∈ R |Y| { p · u − G ( u ) } . Clearly, G ∗ is a convex function as it is the supremum of aﬃne functions. Notethat the inequality G ( u ) + G ∗ ( p ) ≥ p · u (24)holds in general. The domain of G ∗ consists of p ∈ R |Y| for which the supremum is ﬁnite.In the case when G is deﬁned by (23), it follows from Norets and Takahashi (2013) that thedomain of G ∗ contains the simplex ∆ |Y| , which is the set of p ∈ R |Y| such that p y ≥ (cid:80) y ∈Y p y = 1. This means that our convex conjugate function is always well-deﬁned.The subgradient ∂ G ( u ) of G at u is the set of p ∈ R |Y| such that p · u − G ( u ) ≥ p · u (cid:48) − G ( u (cid:48) )holds for all u (cid:48) ∈ R |Y| . Hence ∂ G is a set-valued function or correspondence. ∂ G ( u ) is asingleton if and only if G ( u ) is diﬀerentiable at u ; in this case, ∂ G ( u ) = ∇G ( u ).One sees that p ∈ ∂ G ( u ) if and only if p · u − G ( u ) = G ∗ ( p ), that is if equality is reachedin inequality (24): G ( u ) + G ∗ ( p ) = p · u. (25)This equation is itself of interest, and is known in the literature as “Fenchel’s equality”. Bysymmetry in (25), one sees that p ∈ ∂ G ( u ) if and only if u ∈ ∂ G ∗ ( p ). In particular, whenboth G and G ∗ are diﬀerentiable, then ∇G ∗ = ∇G − . HIONG, GALICHON, AND SHUM Proofs

Proof of Proposition 1.

Consider the y -th component, corresponding to ∂ G ( w ) ∂w y : ∂ G ( w ) ∂w y = ∂∂w y (cid:90) max y [ w y + ε y ] dQ (26)= (cid:90) ∂∂w y max y [ w y + ε y ] dQ (27)= (cid:90) ( w y + ε y ≥ w y (cid:48) + ε y (cid:48) ) , ∀ y (cid:48) (cid:54) = y ) dQ = p ( y ) . (28)(We have suppressed the dependence on x for convenience.) Proof of Theorem 1.

This follows directly from Fenchel’s equality (see Rockafellar (1970),Theorem 23.5, see also Appendix 8.1), which states that p ∈ ∂ G ( w )is equivalent to G ( w ) + G ∗ ( p ) = (cid:80) y p y w y , which is equivalent in turn to w ∈ ∂ G ∗ ( p ) . Proof of Theorem 2.

Because ε has full support, the choice probabilities p will lie strictlyin the interior of the simplex ∆ |Y| . Let ˜ w ∈ ∂ G ∗ ( p ), and let w y = ˜ w y − G ( ˜ w ). One has G ( w ) = 0, and an immediate calculation shows that ∂ G ( w ) = p . Let us now show that w isunique. Consider w and w (cid:48) such that G ( w ) = G ( w (cid:48) ) = 0, and p ∈ ∂ G ( w ) and p ∈ ∂ G ( w (cid:48) ).Assume w (cid:54) = w (cid:48) to get a contradiction; then there exist two distinct y and y such that w y − w y (cid:54) = w (cid:48) y − w (cid:48) y ; without loss of generality one may assume w y − w y > w (cid:48) y − w (cid:48) y . Let S be the set of ε ’s such that w y − w y > ε y − ε y > w (cid:48) y − w (cid:48) y w y + ε y > max y (cid:54) = y ,y { w y + ε y } w (cid:48) y + ε y > max y (cid:54) = y ,y (cid:8) w (cid:48) y + ε y (cid:9) UALITY IN DYNAMIC DISCRETE CHOICE MODELS

Because ε has full support, S has positive probability.Let ¯ w = w + w (cid:48) . Because p ∈ ∂ G ( w ) and p ∈ ∂ G ( w (cid:48) ), G is linear on the segment [ w, w (cid:48) ],thus G ( ¯ w ) = 0, thus0 = E (cid:2) ¯ w Y ( ¯ w,ε ) + ε Y ( ¯ w,ε ) (cid:3) = 12 E (cid:2) w Y ( ¯ w,ε ) + ε Y ( ¯ w,ε ) (cid:3) + 12 E (cid:104) w (cid:48) Y ( ¯ w,ε ) + ε Y ( ¯ w,ε ) (cid:105) ≤ E (cid:2) w Y ( w,ε ) + ε Y ( w,ε ) (cid:3) + 12 E (cid:104) w (cid:48) Y ( w (cid:48) ,ε ) + ε Y ( w (cid:48) ,ε ) (cid:105) = 12 (cid:0) G ( w ) + G (cid:0) w (cid:48) (cid:1)(cid:1) = 0Hence equality holds term by term, and w Y ( w,ε ) + ε Y ( w,ε ) = w Y ( ¯ w,ε ) + ε Y ( ¯ w,ε ) w (cid:48) Y ( w (cid:48) ,ε ) + ε Y ( w (cid:48) ,ε ) = w (cid:48) Y ( ¯ w,ε ) + ε Y ( ¯ w,ε ) For ε ∈ S , Y ( w, ε ) = Y ( ¯ w, ε ) = y and Y ( w (cid:48) , ε ) = Y ( ¯ w, ε ) = y , and we get the desiredcontradiction.Hence w = w (cid:48) , and the uniqueness of w follows. Proof of Theorem 3.

From G (cid:0) w (cid:1) = 0 and ∂ G ( w − G ( w )) = ∂ G ( w ), and by the uniquenessresult in Theorem 2, it follows that w = w − G ( w ) . Proof of Proposition 2.

The proof is in Galichon and Salani´e (2012), but we include it herefor self-containedness. This connection between the G ∗ function and a matching model HIONG, GALICHON, AND SHUM follows from manipulation of the variational problem in the deﬁnition of G ∗ : G ∗ ( p ) = sup w ∈ R Y (cid:40)(cid:88) y p y w y − E Q (cid:20) max y ∈Y ( w y + ε y ) (cid:21)(cid:41) (29)= sup w ∈ R Y (cid:88) y p y w y + E Q (cid:20) min y ∈Y ( − w y − ε y ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) ≡ z ( ε ) . Deﬁning c ( y, ε ) ≡ − ε y , one can rewrite the above as G ∗ ( p ) = sup w y + z ( ε ) ≤ c ( y,ε ) { E p [ w Y ] + E Q [ z ( ε )] } . (30)As is well-known from the results of Monge-Kantorovich (Villani (2003), Thm. 1.3), this isthe dual-problem for a mass transport problem. The corresponding primal problem is G ∗ ( p ) = min Y ∼ pε ∼ ˆ Q E [ c ( Y, ε )]which is equivalent to (16)-(18). Comparing Eqs. (29) and (30), we see that the subdif-ferential ∂ G ∗ ( p ) is identiﬁed with those elements w such that ( w, z ), for some z , solves thedual problem (30). Proof of Theorem 4. (i) follows from Proposition 2 and the fact that if w y + z ( ε ) ≤ c ( y, ε ),then E p [ w Y ] + E Q [ z ( ε )] = G ∗ ( p ) if and only if ( w, z ) is a solution to the dual problem.(ii) follows from the fact that − z ( ε ) = sup y { w y − c ( y, ε ) } = sup y { w y + ε y } , thus E Q [ z ( ε )] = 0 is equivalent to E Q (cid:2) sup y { w y + ε y } (cid:3) = 0, that is G ( w ) = 0. Proof of Theorem 5.

We shall show that the vector of choice-speciﬁc value functions derivedfrom the MTA estimation procedure, denoted w n , converges to the true vector w . In ourprocedure, there are two sources of estimation error. The ﬁrst is the sampling error in thevector of choice probabilities, denoted p n . The second is the simulation error involved inthe discretization of the distribution of ε ; we let Q n denote this discretized distribution.A distinctive aspect of our proof is that it utilizes the theory of mass transport; namelyconvergence results for sequences of mass transport problems. For y ∈ Y , let ι y denote the UALITY IN DYNAMIC DISCRETE CHOICE MODELS |Y| -dimensional row vector with all zeros except a 1 in the y -th column. This discretizedmass transport problem from which we obtain w n is:sup γ ∈M ( Q n ,p n ) (cid:90) R d × R d ( ι · ε ) γ ( dε, dι ) (31)where M ( Q n , p n ) denotes the set of joint (discrete) probability measures with marginaldistributions Q n and p n . In the above, ι denotes a random vector which is equal to ι y withprobability p ny , for y ∈ Y . The dual problem used in the MTA procedure isinf z,w (cid:90) z ( ε ) dQ n ( ε ) + (cid:88) y w y p ny : (32) s.t. z ( ε ) ≥ ι y · ε − w y , ∀ y, ∀ ε (33) G n ( w ny ) = 0 , (34)where G n ( w ) ≡ E Q n ( w y + (cid:15) y ). We let ( z n , w n ) denote solutions to this discretized dualproblem (32). Recall (from the discussion in Section 2.3) that the extra constraint (34) inthe dual problem just selects among the many dual optimizing arguments ( w n , z n ) corre-sponding to the optimal primal solution γ n , and so does not aﬀect the primal problem. Next we derive a more manageable representation of this constraint (34). From Fenchel’sEquality (Eq. (25)), we have (cid:80) y p ny w ny = G n ( w n ) + G ∗ n ( p n ) = G ∗ n ( p n ) (with G ∗ n deﬁned asthe convex conjugate function of G n ). Moreover, from Proposition 2, we know that G ∗ n ( p n )can be characterized as the optimized dual objective function in (32). Hence, we see thatthe constraint G n ( w n ) = 0 is equivalent to (cid:82) z n ( ε ) dQ n ( ε ) = 0. We introduce this latterconstraint directly and rewrite the dual programinf z,w (cid:88) y w y p ny + (cid:90) z ( ε ) dQ n ( ε ) (35) s.t. z ( ε ) ≥ ι y · ε − w y , ∀ y, ∀ ε (36) (cid:90) z ( ε ) dQ n ( ε ) = 0 . (37) We note that, as discussed before, the discreteness of Q n implies that ( z n , w n ) will not be uniquelydetermined, as the core of the assignment game for a ﬁnite market is not a singleton. But this does notaﬀect the proof, as our arguments below hold for any sequence of selections { z n , w n } n . HIONG, GALICHON, AND SHUM

We will demonstrate consistency by showing that ( z n , w n ) converge a.s. to the dualoptimizers in the “limit” dual problem, given byinf z,w (cid:88) y w y p y (38) z ( ε ) ≥ ι y · ε − w y , ∀ y, ∀ ε (39) (cid:90) z ( ε ) dQ = 0 (40)We denote the optimizers in this limit problem by ( w , z ), where, by construction, w are the “true” values of the choice-speciﬁc value functions. The diﬀerence between thediscretized and limit dual problems is that Q n in the former has been replaced by Q , thecontinuous distribution of ε , and the estimated choice probabilities p n have been replacedby the limit p .We proceed in two steps. First, we argue that the sequence of optimized dual programs(35) converges to the optimized limit dual program (38), a.s. Based upon this, we thenargue that the sequence of dual optimizers, ( w n , z n ), necessarily converge to their uniquelimit optimizers, ( w , z ), a.s. First step.

By the Kantorovich duality theorem, we know that the optimized values forthe limit primal and dual programs coincidesup γ ∈ Π( Q ,p ) (cid:90) R d × R d ( ι · ε ) γ ( dε, dι ) = inf (cid:88) y w y p y + (cid:90) z ( ε ) dQ. (41)Moreover, both the primal and dual problems in the discretized case are ﬁnite-dimensionallinear programming problem, and by the usual LP duality, the optimal primal and dualproblems for the discretized case also coincide: (cid:90) R d × R d ( ι · ε ) γ n ( dε, dι ) = (cid:88) y w ny p ny + (cid:90) z n ( ε ) dQ n . Given Assumption 1, and by Theorem 5.20 in Villani (2009), p. 77, we have that, upto a subsequence extraction, γ n (the optimizing argument of (31)) converges weakly. Inaddition, by Theorem 5.30 in Villani (2009), the left-hand side of (41) has a unique solution UALITY IN DYNAMIC DISCRETE CHOICE MODELS γ ; hence, the sequence γ n must converge generally to γ . This implies a.s. convergence ofthe value of the primal problems: (cid:90) R d × R d ( ι · ε ) γ n ( dε, dι ) → (cid:90) R d × R d ( ι · ε ) γ ( dε, dι ) , a.s., and, by duality, we must also have a.s. convergence of the discretized dual problem to thelimit problem: (cid:88) y w ny p ny + (cid:90) z n ( ε ) dQ n → (cid:88) y w y p y + (cid:90) z ( ε ) dQ, a.s. (42) Second step.

Next, we show that the discretized dual minimizers ( z n , w n ) converge a.s.For convenience, in what follows we will suppress the qualiﬁer “a.s.” from all the statementsbelow. Let w¯ n = min y w ny . (43)From examination of the dual problem (35), we see that z n is the piecewise aﬃne function z n ( ε ) = max y { ι y · ε − w ny } , (44)thus z n is M -Lipschitz with M := max y | ι y | = 1. Now observe that z n ( ε ) + w¯ n = max y { ι y · ε − w ny + w¯ n } ≤ max y { ι y · ε } =: z ( ε ) (45)and, letting y (cid:48) be the argument of the minimum in (43), z n ( ε ) + w¯ n ≥ ι y (cid:48) · ε − w ny (cid:48) + w¯ n = ι y (cid:48) · ε ≥ min y { ι y · ε } =: z ( ε ) (46)thus, by a combination of (45) and (46), z ( ε ) ≤ z n ( ε ) + w¯ n ≤ z ( ε ) . (47)By (cid:82) z n ( ε ) dQ n ( ε ) = 0, we have that that w¯ n is uniformly bounded (sublinear): for someconstant K , | z n ( ε ) | ≤ C (1 + | ε | ) for every n and every ε . Hence the sequence z n is uniformlyequicontinuous, and converges locally uniformly up to a subsequence extraction by Ascoli’stheorem. Let this limit function be denoted z . By (42), and Theorem 2, we deduce that z , HIONG, GALICHON, AND SHUM the optimizer in the limit dual problem is unique , so that it must coincide with the limitfunction z .By the deﬁnition of ( w n , z n ) as optimizing arguments for (35), we have (cid:80) y w ny p ny ≤ (cid:80) y w¯ n p y + (cid:82) [ z ( ε )] dQ n ( ε ) or (cid:88) y (cid:0) w ny − w¯ n (cid:1) p ny ≤ (cid:90) [ z ( ε )] dQ n ( ε ) = E Q n z The second moment restrictions on Q n (condition (ii) in the theorem) imply that E Q n z ( ε )exists and converges to E Q z . Hence, the nonnegative vectors (cid:0) w ny − w¯ n (cid:1) are bounded;accordingly, the vectors (cid:0) w ny (cid:1) are themselves bounded. This implies that w n converges upto a subsequence to some limit point w ∗ , using the Bolzano-Weierstrass theorem. Thisimplies that (cid:80) y w ny p ny → (cid:80) y w ∗ y p y by bounded convergence. By Theorem 2, we know thatthe limit point w ∗ must coincide with w , which is the unique optimizer in the dual limitproblem (38). Thus, we have shown that w n converges to w , a.s. Although the support of ε is not bounded, the locally uniform convergence of z n and the fact that thesecond moments of Q n are uniformly bounded are enough to conclude. UALITY IN DYNAMIC DISCRETE CHOICE MODELS

Additional Figures

Design RMSE( y = 0) RMSE( y = 1) R ( y = 0) R ( y = 1) N = 100 , T = 100 0.5586 (3.7134) 0.2435 (0.1155) 0.3438 (0.7298) 0.7708 (0.2073) N = 100 , T = 500 0.1070 (0.0541) 0.1389 (0.0638) 0.7212 (0.2788) 0.9119 (0.0820) N = 100 , T = 1000 0.0810 (0.0376) 0.1090 (0.0425) 0.8553 (0.1285) 0.9501 (0.0352) N = 200 , T = 100 0.1244 (0.0594) 0.1642 (0.0628) 0.5773 (0.6875) 0.8736 (0.1112) N = 200 , T = 200 0.1177 (0.0736) 0.1500 (0.0816) 0.7044 (0.2813) 0.9040 (0.0842) N = 500 , T = 100 0.0871 (0.0375) 0.1162 (0.0430) 0.8109 (0.2468) 0.9348 (0.0650) N = 500 , T = 500 0.0665 (0.0261) 0.0829 (0.0290) 0.8899 (0.1601) 0.9678 (0.0374) N = 1000 , T = 100 0.0718 (0.0340) 0.0928 (0.0344) 0.8777 (0.1320) 0.9647 (0.0314) N = 1000 , T = 1000 0.0543 (0.0176) 0.0643 (0.0162) 0.9322 (0.0577) 0.9820 (0.0101) Table 3

HIONG, GALICHON, AND SHUM

100 200 300 400 500 600 700 800 900 100000.010.020.030.040.050.060.070.08

Number of discretized points, S U pp e r b o und Size of the identiﬁed set of payoﬀ for choice y=1

Figure 5.

For each value of S , we plot the values of the diﬀerencesmax w ∈ ∂ G ∗ ( p ) w − min w ∈ ∂ G ∗ ( p ) w across all values of p ∈ ∆ . In the box-plot, the central mark is the median, the edges of the box are the 25th and75th percentiles, the whiskers extend to the most extreme data points notconsidered outliers, and outliers are plotted individually. UALITY IN DYNAMIC DISCRETE CHOICE MODELS

100 200 300 400 500 600 700 800 900 100000.020.040.060.080.10.120.140.16

Number of discretized points, S U pp e r b o und Size of the identiﬁed set of payoﬀ for choice y=2

Figure 6.

For each value of S , we plot the values of the diﬀerencesmax w ∈ ∂ G ∗ ( p ) w − min w ∈ ∂ G ∗ ( p ) w across all values of p ∈ ∆ . In the box-plot, the central mark is the median, the edges of the box are the 25th and75th percentiles, the whiskers extend to the most extreme data points notconsidered outliers, and outliers are plotted individually. HIONG, GALICHON, AND SHUM x , m ileage since last replacem ent (p er 12,500 m iles) F r e q u e n c y o f e n g i n e r e p l a ce m e n t Figure 7 x , m ileage since last replacem ent (p er 12,500 m iles) P r o b a b ili t y o f e n g i n e r e p l a ce m e n tt