Ensemble Learning with Statistical and Structural Models
EEnsemble Learning with Statistical and Structural
Models ∗ Jiaming Mao † Jingzhi Xu ‡ June 11, 2020
Abstract
Statistical and structural modeling represent two distinct approaches to data anal-ysis. In this paper, we propose a set of novel methods for combining statistical andstructural models for improved prediction and causal inference. Our first proposedestimator has the doubly robustness property in that it only requires the correct speci-fication of either the statistical or the structural model. Our second proposed estimatoris a weighted ensemble that has the ability to outperform both models when they areboth misspecified. Experiments demonstrate the potential of our estimators in varioussettings, including fist-price auctions, dynamic models of entry and exit, and demandestimation with instrumental variables. ∗ We thank Panle Jia Barwick, Whitney Newey, and seminar audiences for many helpful discussions andsuggestions. Mao acknowledges financial support by the national natural science foundation of China. † Corresponding author. Xiamen University, Email: [email protected] ‡ Xiamen University, Email: [email protected] a r X i v : . [ ec on . E M ] J un Introduction
In economics as well as many other scientific disciplines, statistical and structural modelingrepresent two distinct approaches to data analysis (Heckman, 2000). The structural approachdraws a direct link between data and theory. It estimates structural models, or scientificmodels (Shalizi, 2013), that specify the causal mechanisms generating the observed data.A complete structural model in economics describes economic and social phenomena as theoutcomes of individual behavior in specific economic and social environments (Heckman andVytlacil, 2007; Reiss and Wolak, 2007). Once estimated, these models can be used for makingpredictions, evaluating causal effects, and conducting normative welfare analyses (Low andMeghir, 2017).In contrast to the structural approach, the statistical approach to data analysis relieson the use of statistical models for prediction and causal inference. While recent advancesin machine learning have focused on predictive tasks (Athey, 2017), a large literature incausal inference across multiple disciplines has proposed statistical methods for estimatingcausal effects from experimental and observational data (Imbens and Rubin, 2015) . Ineconomics, this statistical approach to causal inference is informally referred to as the reduced-form approach (Chetty, 2009) . Methods such as controlling for observed confounding andinstrumental variables regression are widely used in applied economic analyses (Athey andImbens, 2017).Which approach should be preferred – the statistical or the structural – has been thesubject of a long-standing debate within the economics profession (Angrist and Pischke, e.g. the social sciences, the biomedical sciences, statistics, and computer science. In the statistical approach to causal inference, causal knowledge is used not to specify a completestructural model, but to inform research designs that can identify the causal effects of interest by exploitingexogenous variations in the data. As Chetty (2009) pointed out, the term “reduced-form” is largely a misnomer, whose meaning in theeconometrics literature today has departed from its historical root. Historically, a reduced-form model isan alternative representation of a structural model. Given a structural model M ( x, y, (cid:15) ) = 0 , where x isexogenous, y is endogenous, and (cid:15) is unobserved, if we write y as a function of x and (cid:15) , y = f ( x, (cid:15) ) , then f is the reduced-form of M (Reiss and Wolak, 2007). Today, however, applied economists typically refer tononstructural, statistical treatment effect models as “reduced-form” models. Perhaps reflecting the informalnature of the terminology today, Rust (2014) gave the following definitions of the two approaches: “At therisk of oversimplifying, empirical work that takes theory “seriously” is referred to as structural econometrics whereas empirical work that avoids a tight integration of theory and empirical work is referred to as reducedform econometrics .” in-domain prediction, where the training and the test data have the same dis-tribution . On the other hand, a main advantage of structural estimation lies in its abilityto make out-of-domain predictions . As long as the same causal mechanism governs datageneration, a correctly specified structural model provides a way to extrapolate from thetraining data to the test data even if the distributions have changed . Similarly, in causalinference, reduced-form methods that exploit credible sources of identifying information de-liver estimates of causal effects with high internal validity , while structural estimates mayhave more claims to external validity .The relative strengths of the two approaches point to a complementarity that providesthe motivation for this paper. Of course, the reason that any approach may outperform theother in certain aspects of data analysis is fundamentally due to model misspecification –if any model captures the true distributions governing the source and the target domains,then no improvement is possible. Indeed, one can argue that researchers on both sides ofthe methodological debate are motivated by a shared concern over model misspecification.Proponents for the statistical approach are concerned about misspecifications due to theoften strong and unrealistic assumptions – both causal and parametric – made in structuralmodels, while those advocating for the structural approach are concerned about misspec-ifications due to not incorporating theoretical insight – functional forms such as constantelasticity of substitution (CES) aggregation and the gravity equation of trade, for example, Using the terminology of transfer learning, a domain is a joint distribution governing the input andoutput variables (Muandet et al., 2013). A key limitation with most statistical and machine learning modelsis that they require the distributions governing the training data (the source domain) and the test data (the target domain) to be the same in order to guarantee performance (Ben-David et al., 2010). In this paper, we distinguish between the notions of out-of-domain and out-of-sample . Out-of-sampledata are test data drawn from the same distribution as the training data. Traditionally, economists emphasize the ability of structural models to make counterfactual predictions.We note that counterfactual predictions can be viewed as a special type of out-of-domain predictions. Angrist and Pischke (2010) offered an account of what they call “the credibility revolution” – the in-creasing popularity of quasi-experimental methods that seek natural experiments as sources of identifyinginformation. Our definition of reduced-form methods include both quasi-experimental and more traditional,non-quasi-experimental statistical methods that use expert knowledge to locate exogenous sources of varia-tion. By model misspecification, we refer to both incorrect functional form and distributional assumptionsand, in the case of causal inference, incorrect causal assumptions. .In this paper, we propose a set of methods for combining the statistical and structuralapproaches for improved prediction and causal inference. Our first proposed estimator, whichwe call the doubly robust statistical-structural ( DRSS ) estimator, provides a consistent in-domain estimate as long as either the structural or the (reduced-form) statistical model iscorrectly specified. Our second proposed estimator, which we call the ensemble statistical-structural ( ESS ) estimator, is a weighted ensemble that has the ability to outperform boththe structural and the (reduced-form) statistical model, both in-domain and out-of-domain ,when both are misspecified.Our methods build on several intuitions. First, statistically speaking, a structural modelis a generative model (Jebara, 2012). Given a structural model that specifies the data-generating mechanism of ( x , . . . , x p ) ∈ O , we can generate predictions of discriminative relationships E [ x j | x i ] or E (cid:2) x x i = aj (cid:3) for any ( x i , x j ) ⊂ ( x , . . . , x p ) , where x x i = aj denotes the potential outcome of x j under the intervention of x i = a . These structurally derivedrelationships can then be considered as competitors to (reduced-form) statistical modelsthat explicitly model these relationships. This allows us to leverage the large statisticalliterature on dealing with competing models. One popular method used in causal inferenceis the doubly robust estimator that combines an outcome regression model with a treatmentassignment model in the estimation of causal effects (Bang and Robins, 2005). The doublyrobust estimator is consistent if either of the two models is correctly specified, thus providingan insurance against model misspecification. Lewbel et al. (2019) generalized the classicdoubly robust method to allow the combination of any parametric models. Their methodprovides a basis for our DRSS estimator.Second, the complementary properties of statistical and structural models suggest thata model combination approach may yield superior results (Kellogg et al., 2020). In the Rust (2014): “Notice the huge difference in world views. The primary concern of Leamer, Manski,Pischke, and Angrist is that we rely too much on assumptions that could be wrong, and which could resultin incorrect empirical conclusions and policy decisions. Wolpin argues that assumptions and models couldbe right, or at least they may provide reasonable first approximations to reality.” In this paper, we mainly adopt the notations of the Rubin causal model (Rubin, 1974) in discussing causalinference. Equivalently, using the notation of (Pearl, 2009), E (cid:2) x x i = aj (cid:3) can be expressed as E [ x j | do ( x i = a )] . . As Breiman (1996b) pointed out, ensemblemethods benefit the most from the use of diverse and dissimilar models, which is exactly thecase when we combine statistical and structural models.In this paper, we provide two ensemble estimators. The first, which we call ESS-LN , isa linear ensemble based on the method of stacking (Wolpert, 1992), or jackknife averaging (Hansen and Racine, 2012), which produces an optimal linear combination of a set of modelsby minimizing a cross-validated loss criterion such as expected mean squared error. Weshow how to use the method both for prediction and causal inference. Our second ensembleestimator,
ESS-NP , goes beyond linear combinations and builds a nonparametric ensembleof statistical and structural models. For conditional mean estimation, it employs the randomforest algorithm introduced by Breiman (2001), which allows for the modeling of nonlinearrelationships and complex interactions by building a large number of regression trees thatadaptively partition the input space and combining them through bootstrap aggregation.The method can be viewed as an adaptive locally weighted estimator (Athey et al., 2019),allowing us to assign different weights to different regions of the input space depending onwhich model – the statistical or the structural – performs better in that region. The resultingensemble has the ability to combine the strengths of statistical and structural models whiledefending against their weaknesses. When the models being combined are complex and high-dimensional for which global optima are hardto obtain, the ensemble approach also produces gains by averaging local optima produced by local search(Dietterich, 2000).
10 15 20 25 p q (a) p q Statistical (b)
10 20 30 40 50 p q StatisticalStructural (c)
10 20 30 40 50 p q DRSSStatisticalStructural (d)
10 20 30 40 50 p q DRSSStatisticalStructuralTrue (e)
10 20 30 40 50 p q ESS−NPStatisticalStructuralTrue (f)
Figure 1:
Demand Estimation. Filled circles represent training data. Unfilled circles representout-of-domain test data. xample To illustrate our methods, consider the setting of a simple demand estimationproblem. We observe the prices and quantities sold of a good x , as plotted in Figure 1a.Suppose the data are generated by the consumption decisions of n consumers who purchased x at different prices. Each consumer had fixed income I and decided how much to purchaseby solving the problem: max q,q o u i ( q i , q oi ) subject to p i q i + p oi q oi ≤ I (1), where ( p i , q i , p oi , q oi ) denote respectively the price and quantity of good x and of an outsidegood o . The consumer utility function is given by the following CES function: u i ( q i , q oi ) = [ α i q ρi + (1 − α i ) ( q oi ) ρ ] ρ (2), where ρ = , suggesting an elasticity of substitution of .We can fit the following statistical model to the data: q i = β + β p i + β p i + (cid:15) i (3)The result is plotted in Figure 1b. Under the causal assumption that prices are exogenousto the consumers, (3) represents a reduced-form estimate of the individual demand curve.The model appears to fit the data quite well. However, once we extrapolate beyond theobserved ranges of prices, its predictions become very bad (Figure 1c). On the other hand,structurally estimating the parameters of model (1) would yield a demand curve that hasboth internal and external validity (Figure 1c). This is not surprising as (1) describes thetrue data-generating mechanism. In practice, given two competing models, the (reduced-form) statistical model (3) and the structural model (1), we may not know which one iscorrectly specified. The DRSS resolves this issue by combining the two models and providinga consistent estimate as long as one of them is correctly specified. Figure 1d plots the DRSS { α i } are generated as follows: α i = exp ( ξ i )1 + exp ( ξ i ) , ξ i ∼ N (0 , . ρ = − . The resulting structural fitnow deviates pronouncedly from the true model, highlighting the fact that the validity ofthe structural approach hinges crucially on the model being correct. The DRSS estimatorthat combines this misspecified structural model with the (reduced-form) statistical model(3) now puts most of its weight on the latter and is no longer consistent (Figure 1e). Note,however, compared to (3), the misspecified structural model has worse fit in-domain , but stillperforms significantly better out-of-domain . This provides the motivation for our ensembleapproach. Intuitively, although we misspecify the utility function, the theory of consumerutility maximization subject to budget constraints still provides important prior informationon the likely shape of the demand curve – such as its downward-slopingness – that can beused to regulate the behavior of statistical models. In Figure 1f, we show the results ofour ESS-NP estimator based on a random forest ensemble of the misspecified structuralmodel and the (reduced-form) statistical model. The ESS-NP fit is closer to the true modeland performs well both in-domain and out-of-domain . Thus in this example, the ensembleapproach is able to deliver optimal performance when both the structural and the (reduced-form) statistical models are incorrect.In section 3, we demonstrate the effectiveness of our methods using a set of simulationexperiments under a variety of more realistic settings in applied economic analyses, includingfirst-price auctions and dynamic models of entry and exit. We also revisit this demandestimation problem and show how to apply our methods to estimating the demand curvewith the help of instrumental variables when prices are endogenous. For each experiment, wereport the performance of the DRSS and ESS estimators when either or both of a structuralmodel and a (reduced-form) statistical model is misspecified. That is, instead of estimating both ( α i , ρ ) from the data, we estimate α i only while treating ρ = − . as an assumption of the model. The assumption, of course, is incorrect in this case. The ESS-LN method produces similar results as the ESS-NP in this example. elated Literature This paper is related to several strands of literature. The doublyrobust estimator was proposed by Robins et al. (1994); Robins and Rotnitzky (1995); Scharf-stein et al. (1999) as a means of estimating the average treatment effect by combining anoutcome regression model with a treatment assignment model so that the estimator remainsconsistent as long as one of the models is correctly specified. In general, an estimator issaid to have the doubly robustness property if it is consistent for the target parameter whenany one of two nuisance parameters is consistently estimated (Benkeser et al., 2017). Sub-sequent developments in doubly robust estimation include Bang and Robins (2005); Tan(2010); Okui et al. (2012); Farrell (2015); Vermeulen and Vansteelandt (2015); Benkeseret al. (2017); Arkhangelsky and Imbens (2019). Chernozhukov et al. (2016, 2017) showedthat the doubly robust estimator can be viewed as being based on Neyman-orthogonal mo-ment conditions that are first-order robust to errors in nuisance parameter estimation. Morerecently, Lewbel et al. (2019) proposed the general doubly robust ( GDR ) method that pro-vides a general technique for constructing a doubly robust combination out of any parametricmodels, which forms the basis of our DRSS estimator.Our paper is also related to the literature on model averaging and ensemble methods.Model averaging provides a natural response to model uncertainty in the Bayesian frameworkand has long been considered an alternative to model selection. See Hoeting et al. (1999) fora comprehensive review of bayesian model averaging methods. In machine learning, Wolpert(1992) proposed the method of stacking, or stacked generalization . (Breiman, 1996a) pro-posed bagging, or bootstrap aggregation. Freund and Schapire (1996) introduced boosting.These ensemble methods are constructed with the explicit goal of maximizing predictiveaccuracy and achieve their effectiveness by incorporating model uncertainty, averaging localoptima, and enriching the model space (Dietterich, 2000). More recently, there has also beena growing body of research in the statistics and econometrics literature on asymptoticallyoptimal frequentist model averaging. See Claeskens and Hjort (2003); Hjort and Claeskens(2003); Hansen (2007); Hansen and Racine (2012); Kitagawa and Muris (2016); Zhang et al. Also see Breiman (1996b). When weights are restricted under a simplex constraint, stacking can beconsidered a frequentist model averaging technique. Van der Laan et al. (2007) and Hansen and Racine(2012) provided theory on its asymptotic optimality. These authors also gave different names to the method: super learning (Van der Laan et al., 2007) and jackknife model averaging (Hansen and Racine, 2012). . See Pan and Yang (2010) for a survey on transfer learning and Ben-David et al.(2010) for theory on learning from different domains. A majority of research on transferlearning so far has focused on domain adaptation , where the marginal distributions of the in-put variables vary across domains and are observed, but the conditional outcome distributionis assumed to be the same. Methods that have been proposed aim to reduce the difference ininput distributions either by sample-reweighting (Zadrozny, 2004; Huang et al., 2007; Jiangand Zhai, 2007; Sugiyama et al., 2008) or by finding a domain-invariant transformation (Panet al., 2010; Gopalan et al., 2011) . Our methods, however, can be viewed as tackling themore difficult problem of domain generalization , where the target domain is unknown at thetime of training and where both the marginal and the conditional distributions are allowedto vary. Intuitively, we achieve this by incorporating theory into statistical modeling . Theeffectiveness of our approach hinges on the stability of the underlying causal mechanism andon the availability of a structural model that is informative, if not correctly specified .A main contribution of this paper is to the literature on combining structural and reduced-form estimation. Many authors in economics have called for combining these two approaches The problem of transfer learning is closely related to the problem of sampling bias or the sample selection problem – a general problem that arises when we try to make inference, whether statistical or causal, abouta population using data collected from another population. This includes the more recent deep domain adaptation literature that employs deep neural networks fordomain adaptation. See Glorot et al. (2011); Chopra et al. (2013); Ganin and Lempitsky (2014); Tzeng et al.(2014); Long et al. (2015). Wang and Deng (2018) provides an overview of this literature in the context ofcomputer vision. Transfer learning has also been referred to knowledge transfer (Pan and Yang, 2010). We note, however,that true knowledge transfer must involve causal knowledge as encapsulated in theory. Rojas-Carulla et al. (2018); Kuang et al. (2020) also proposed methods for domain generalization byassuming stability in causal relationships. Both studies rely on the assumption that a subset of the inputvariables v ⊆ x have a causal relation with the outcome y and the conditional probability p ( y | v ) is invariantacross domains. However, it is not true that having a causal relationship implies p ( y | v ) is domain-invariant.Let w = x \ v . The assumption only holds under very limited and untestable conditions, namely that y ⊥ w | v and that the causal effect of v on y is homogeneous.
9o harness their respective strengths , . Early efforts include (Chetty, 2009; Heckman,2010). Their solution is to use structural models to derive sufficient statistics for the intendedanalysis and then use reduced-form methods to estimate them. In comparison, we offer a setof general algorithms rather than relying on ad hoc derivations . More recently, Fessler andKasy (2019); Mao and Zheng (2020) proposed shrinkage methods that combine statisticaland structural models by shrinking the former toward the latter. Their methods can beviewed as complementary to ours. Indeed, there is a connection between shrinkage andmodel averaging (Hansen, 2007). By combining models of different complexities, a modelaveraging procedure effectively shrinks the more complex models toward the less complexones.Compared to Fessler and Kasy (2019); Mao and Zheng (2020), our approach arguably alsohas several advantages. First, their methods are asymmetric with respect to the complexitiesof statistical and structural models. Specifically, they require the specification of complexstatistical models to be regularized with structural models. In contrast, our approach is sym-metric , allowing researchers to combine structural models with simple linear reduced-formmodels frequently used in applied research. Second, when the structural models are complexand high-dimensional, our ensemble methods can provide effective regularization. This canbe most easily seen in the case of the stacking estimator ESS-LN. When the structural modelis more complex than the statistical model, the ESS-LN effectively regularizes the formerwith the latter by averaging the two. This is relevant since many structural models used inempirical applications today are highly complicated and prone to overfitting as researchersstrive for ever more “realistic” models . Chetty (2009): “The structural and statistic methods can be combined to address the short-comings ofeach strategy ... By combining the two methods in this manner, researchers can pick a point in the interiorof the continuum between reduced-form and structural estimation, without being pinned to one endpoint orthe other.” Mirroring the debate in economics on structural vs. reduced-form estimation, there has long beena debate in the machine learning literature on generative vs. discriminative models as well as efforts tocombine them. See Ng and Jordan (2002); Bishop and Lasserre (2007). However, our method cannot be used to conduct welfare analysis, which is the focus of Chetty (2009). Importantly, the best model to describe a given data set may not be the model that truthfully describesthe data-generating mechanism. This is because the true model may well be too complex for the amount ofthe data we have, in which case the model will be poorly fit on the limited sample and generate unreliablepredictions. We therefore echo Hansen (2015): “it remains an important challenge for econometricians todevise methods for infusing empirical credibility into ‘highly stylized’ models of dynamical economic systems.Dismissing this problem through advocating only the analysis of more complicated ‘empirically realistic’
The DRSS builds on the GDR method of Lewbel et al. (2019). In this section, we discuss theestimator first in the context of statistical prediction and then in causal inference. In bothcontexts, we first assume that we have access to a representative data set, i.e. the targetdomain on which we wish to make inference is the same as the source domain from which thedata are drawn. We then consider the case that our data is non-representative and discussits implications on the external validity or out-of-domain performance of our algorithms. Statistical Prediction
Given variables ( x, y ) ∈ X × R , assume first that our goal is tolearn the conditional expectation function µ ( x ) = E [ y | x ] . We have at our disposal twoparametric models for µ ( x ) : h ( x ; θ h ) and g ( x ; θ g ) , where θ h ∈ R p h , θ g ∈ R p g . One of thesemodels is correctly specified, but we do not know which one. Let f ∈ { h, g } index thecorrect model. Suppose the true parameter θ f is identified by a set of (cid:96) f × , (cid:96) f > p f moment conditions E (cid:2) ψ f (cid:0) x, y ; θ f (cid:1)(cid:3) = 0 . Given a sample of n i.i.d. observations, we canthen construct the following (adjusted) moment distance functions: Q m ( θ m ) = κ − m ψ m ( θ m ) (cid:48) Ω m ψ m ( θ m ) , m ∈ { h, g } (4), where ψ m ( θ m ) . = n (cid:80) ni =1 ψ m ( x i , y i ; θ m ) , Ω m is a (cid:96) m × (cid:96) m positive definite weight matrix ,and κ m = (cid:96) m − p m is the degrees of freedom of the χ statistic that the unadjusted Q m equals models will likely leave econometrics and statistics on the periphery of important applied research.” Lewbel et al. (2019) recommend the use of
Ω = (cid:98) E (cid:2) ψ ( θ ) ψ ( θ ) (cid:48) (cid:3) − , the (estimated) efficient GMMweight of Hansen (1982). However, it may not be the optimal weight for the GDR or for our DRSS. Weleave the characterization of the optimal weight matrix to future work. m is the true model.Let (cid:98) θ m = arg min θ m Q m ( θ m ) , m ∈ { h, g } . A doubly robust estimator for µ ( x ) can beconstructed as follows: (cid:98) µ ( x ) = w h h (cid:16) x ; (cid:98) θ h (cid:17) + w g g (cid:16) x ; (cid:98) θ g (cid:17) (5), where w h = Q g (cid:16)(cid:98) θ g (cid:17) Q h (cid:16)(cid:98) θ h (cid:17) + Q g (cid:16)(cid:98) θ g (cid:17) , w g = 1 − w h (6)Under regularity conditions, as long as one of the two models, h or g , is correctly specified,it can be shown that (cid:98) µ ( x ) → p µ ( x ) . The proof is based on Theorem 1 of Lewbel et al. (2019)(see Appendix A.1). The intuition is simple: if one of the models, say h , is correctly specifiedbut g is not, then Q h (cid:16)(cid:98) θ h (cid:17) → p while Q g (cid:16)(cid:98) θ g (cid:17) will have a nonzero limit. Thus in the limit, w h will be and (cid:98) µ ( x ) becomes h (cid:16) x ; (cid:98) θ h (cid:17) – the consistently estimated correct model for µ ( x ) .Adapting the doubly robust estimator (5) to combining statistical and structural modelsis straightforward: let M ( x, y ; θ M ) be a structural model that specifies the data-generatingmechanism of ( x, y ) . From this generative structural model, we can derive its prediction ofthe discriminative function µ ( x ) . Let g ( x ; θ M ) = E M [ y | x ] be the implied conditional meanof y according to M . We can then combine g ( x ; θ M ) with any statistical model h ( x ; θ h ) according to (5). The resulting estimator is the DRSS estimator for µ ( x ) .In practice, there are two ways to construct ψ g ( x, y ; θ M ) for the structurally derived dis-criminative model g ( x ; θ M ) . If M is the true model and θ M is the true parameter value, ψ g needs to satisfy E [ ψ g ( x, y ; θ M )] = 0 . Therefore, we can either directly specify a set of mo-ment conditions that identify M or let ψ g ( x, y ; θ M ) = φ ( x ) ( y − g ( x ; θ M )) for any function φ ( . ) . We can then construct Q g ( θ M ) based on ψ g ( x, y ; θ M ) and compute ( w h , w g ) based on (cid:16) Q h (cid:16)(cid:98) θ h (cid:17) , Q g (cid:16)(cid:98) θ M (cid:17)(cid:17) , where (cid:16)(cid:98) θ h , (cid:98) θ M (cid:17) are obtained from separate first stage estimation ofthe statistical model h and the structural model M . Sample Splitting
The DRSS method as outlined above is a two-stage procedure, where (cid:16)(cid:98) θ h , (cid:98) θ M (cid:17) are obtained in a first stage and the estimator is constructed according to (5) in asecond stage. If both stages are conducted on the same sample of data, however, finite samplebias from the first stage will be carried over to the second stage, especially when complex12tatistical or structural models, prone to overfitting, are estimated in the first stage. Toavoid bias from overfitting and ensure good statistical behavior, we can use separate datasets for the two stages of the procedure. This can be accomplished by, for example, splittingthe observed data randomly into two parts. This is known as sample-splitting (Angristand Krueger, 1995) . This way, from the perspective of the second stage, (cid:16)(cid:98) θ h , (cid:98) θ M (cid:17) areexogenously given, so that when we evaluate the moment distance functions Q h and Q g –critical for computing the DRSS weights – we do not suffer an optimistic bias due to (cid:16)(cid:98) θ h , (cid:98) θ M (cid:17) being obtained from the same data.There is an efficiency cost involved in sample-splitting, as half of the data are wastedin each stage. The results can also be highly variable due to the whims of a single randomsplit. To improve efficiency, we can perform sample-splitting multiple times and averagetheir results. This is the idea behind cross-validation and cross-fitting Chernozhukov et al.(2016, 2017) and can be described as follows for our DRSS estimator: randomly partition thedata into K equal-sized parts. For k = 1 , · · · , K , let D k denote the data of the k th partitionand let D − k denote the data not in D k . We use D − k for the first stage estimation of θ h and θ M . This gives us (cid:16)(cid:98) θ ( − k ) h , (cid:98) θ ( − k ) M (cid:17) . We then use D k to evaluate Q h and Q g at (cid:16)(cid:98) θ ( − k ) h , (cid:98) θ ( − k ) M (cid:17) .This gives us (cid:16) Q ( k ) h (cid:16)(cid:98) θ ( − k ) h (cid:17) , Q ( k ) g (cid:16)(cid:98) θ ( − k ) M (cid:17)(cid:17) . Finally, for cross-validation, w is determined as w h = Q g Q h + Q g , w g = 1 − w h (7), where Q m . = K (cid:80) Kk =1 Q ( k ) m (cid:16)(cid:98) θ ( − k ) m (cid:17) , m ∈ { h, g/ M} are cross-validated moment distances.For cross-fitting, let w ( k ) h be constructed from (cid:16) Q ( k ) h (cid:16)(cid:98) θ ( − k ) h (cid:17) , Q ( k ) g (cid:16)(cid:98) θ ( − k ) M (cid:17)(cid:17) according to (6).Then the cross-fitted weight is w h = 1 K K (cid:88) k =1 w ( k ) h , w g = 1 − w h (8) The idea of sample-splitting is of course closely related to the idea of using separate training and vali-dation data sets for fitting model- and hyper-parameters in machine learning. Indeed, the weights ( w h , w g ) can be viewed as the hyperparameters of the DRSS model. Both methods are consistent. See Li (1987); Chernozhukov et al. (2016). Although to our knowledge,their asymptotic efficiency and finite sample performance have not been compared in existing studies. ausal Inference We now discuss the problem of causal effect estimation under uncon-foundedness. Let the observed variables be ( y, d, v ) ∈ R × R × V , where y is the outcomevariable, d is the treatment variable, and v is a set of control variables. We are interested inthe causal effect of d on y . Specifically, let our target be the average treatment effect (ATE)denoted by τ . We allow τ to be fully nonlinear and heterogeneous, i.e. τ = τ ( d, v ) . Then τ ( d, v ) = ∂∂d E (cid:2) y d (cid:12)(cid:12) v (cid:3) (9), where y d is the potential outcome of y under treatment d .Under the unconfoundedness assumption of Rosenbaum and Rubin (1983) , E (cid:2) y d (cid:12)(cid:12) v (cid:3) = E [ y | d, v ] . Let x = ( d, v ) . The task of estimating τ ( d, v ) is thus equivalent to the task ofestimating E [ y | x ] . Suppose now that we have a reduced-form model h ( x ; θ h ) for E [ y | x ] anda structural model M ( x, y ; θ M ) , both supporting the unconfoundedness condition , thenwe can use the DRSS to produce an estimate of E [ y | x ] by combining these two models, fromwhich we can derive (cid:98) τ ( d, v ) .When the unconfoundedness condition does not hold so that d is endogenous conditionalon v , one of the most widely used strategies in reduced-form inference is to rely on theuse of instrumental variables, which are auxiliary sources of randomness that can be usedto identify causal effects. Let h ( x ; θ h ) , x = ( d, v ) be a reduced-form model for E (cid:2) y d (cid:12)(cid:12) v (cid:3) .We can write y = h ( x ; θ h ) + (cid:15) , where (cid:15) is defined as y − h ( x ; θ h ) and may be correlatedwith d . If we have access to a variable z that is correlated with d (conditional on v ) and Suppose the treatment variable d takes on a discrete set of values, d ∈ { , . . . , D } , then the unconfound-edness – or conditional exchangeability – assumption can be stated as d ⊥⊥ (cid:0) y d =1 , . . . , y d = D (cid:1)(cid:12)(cid:12) v This assumption is satisfied if d is not associated with any other causes of y conditional on v , in whichcase we say d is exogenous to y conditional on v . A more precise statement on the sufficient conditions forsatisfying this assumption, made in the language of causal graphical models based on directed acyclic graphs(DAGs), is that v satisfies the back-door criterion (Pearl, 2009). i.e. (1) the design of h is based on the unconfoundedness condition; (2) in the causal structure assumedby M , v satisfies the back-door criterion. Technically, τ ( d, v ) is the conditional ATE. With a slight abuse of notation, the population ATE τ ( d ) = E v [ τ ( d, v )] . By definition , when E [ d(cid:15) ] (cid:54) = 0 , the received treatment d is related to unobserved factors that affectpotential outcomes y d , thus violating the unconfoundedness condition. E [ z(cid:15) ] = 0 , then z can serve as an instrument for d . In general, given θ h ∈ R p h , let ψ h ( x, y, z ; θ h ) = φ ( z ) ( y − h ( x ; θ h )) be a set of (cid:96) h > p h functions, where φ ( z ) is any functionof z . If h is the true model and θ h is the true parameter, then θ h can be identified via thefollowing moment conditions: E [ ψ h ( x, y, z ; θ h )] = 0 (10)Now let M ( x, y, z ; θ M ) be a structural model for the data-generating mechanism of theobserved variables . Let g ( x ; θ M ) = E M (cid:2) y d (cid:12)(cid:12) v (cid:3) be the model derived conditional expecta-tion of the potential outcome under treatment d . Let ψ g ( x, y, z ; θ M ) be either a set of mo-ment functions for M or let ψ g ( x, y, z ; θ M ) = φ ( z ) ( y − g ( x ; θ M )) . We can then construct Q h ( θ h ) and Q g ( θ M ) based on ψ h ( x, y, z ; θ h ) and ψ g ( x, y, z ; θ M ) , and combine h ( x ; θ h ) and g ( x ; θ M ) according to (5) to produce a DRSS estimate of E (cid:2) y d (cid:12)(cid:12) v (cid:3) , from which we canobtain (cid:98) τ ( d, v ) . Discussion
The goal of doubly robust estimation is to ensure consistency when one of twocandidate models is correctly specified but we do not know which one. When both modelsare misspecified, however, doubly robust estimators can perform poorly (Kang and Schafer,2007). This is not surprising as these estimators are not constructed to optimize performancebased on a loss criterion such as expected mean squared error. In fact, the DRSS estimatorcan be viewed as a weighted average of its candidate models (see (5)) and bears a closeresemblance to bayesian model averaging, which is known to be flawed in M -open settingsin which none of the candidate models is true (Clyde and Iversen, 2013; Yao et al., 2018) . On a causal graph, this translates into the requirement that z is correlated with d and that every openpath connecting z and y has an arrow pointing into d . M does not have to contain z . See e.g. section ( ?? ) for an example. If M does contain z , z needs tosatisfy the IV requirement in the causal structure of M , i.e. z is correlated with d and that every open pathconnecting z and y has an arrow pointing into d . If M is a model for ( x, y ) only, in the case that it is thetrue model, the DRSS estimator for E (cid:2) y d (cid:12)(cid:12) v (cid:3) will be based both on the causal assumptions in M and onthe additional assumption that z is a variable satisfying the IV requirement. The difference is that in (5), by combining h and g , we get (cid:98) E [ y | d, v ] . Here we get (cid:98) E (cid:2) y d (cid:12)(cid:12) v (cid:3) . More precisely, bayesian model averaging is appropriate for M -closed settings rather than M -completeor M -open settings. Following the definitions of Bernardo and Smith (2009), given a list of candidate models,the M -closed setting is the one in which the true model is in the list. In the M -complete setting, the truemodel can be specified but for tractability of computations or other reasons is not included in the model list.The M -open setting refers to the situation in which we know the true model is not in the list and have noidea what it looks like.
15n our presentation so far, we have also assumed that we have access to a representativesample drawn from the population of interest, i.e. the source domain is the same as the targetdomain. In practice, however, this is often not the case. In particular, we are often interestedin making inference on populations that are much larger than the population from which wedraw our sample, i.e. we care about the external validity or out-of-domain performance ofour estimators. The DRSS however assures only in-domain consistency if one of its candidatemodels is correctly specified. In general, no similar guarantees on out-of-domain consistencycan be obtained without further assumptions .If our goal is not to achieve consistency on a target population, but rather to improvepredictive accuracy as much as possible, then note that simply averaging a statistical modelthat fits well in-domain with an approximately correct structural model could improve thein-domain fit of the latter and the out-of-domain fit of the former. This observation appliesto the DRSS as well, as it is also a weighted average method. The weights of the DRSS,however, are not constructed to optimize a performance criterion. This brings us to theensemble estimators that we introduce in the next section, which are explicitly constructedto do so. As we will see, even though the criteria are evaluated on observed data, theensemble estimators often produce superior in-domain and out-of-domain results relativeto both of its candidate models and the DRSS approach, especially when both individualmodels are misspecified. Given variables ( x, y ) ∈ X × R , again assume that our goal is to learn the conditionalexpectation function µ ( x ) = E [ y | x ] and we have at our disposal two parametric models h ( x ; θ h ) and g ( x ; θ g ) . Let (cid:98) h ( x ) . = h (cid:16) x ; (cid:98) θ h (cid:17) and (cid:98) g ( x ) . = g (cid:16) x ; (cid:98) θ g (cid:17) be their fitted values onthe observed sample. The linear ensemble, ESS-LN , combines the two linearly to form an This can be readily seen by considering two models that produce the same fit in-domain but behavecompletely differently out-of-domain. Without further assumptions, there is no way to tell them apart usingobserved data. µ ( x ) : µ ( x ) = w + w (cid:98) h ( x ) + w (cid:98) g ( x ) (11)To choose the optimal weights w = ( w , w , w ) , we can simply run a least squaresregression of y on (cid:98) h ( x ) and (cid:98) g ( x ) . At the population level, combining models this way nevermake things worse (Hastie et al., 2009). On finite sample, however, we need to take intoconsideration differences in model complexity and avoid carrying over any biases in the firststage estimation of (cid:16)(cid:98) θ h , (cid:98) θ g (cid:17) into the choice of w . To this end, one can use the method of stacking (Wolpert, 1992) and obtain w via leave-one-out cross validation: (cid:98) w = arg min w (cid:40) n (cid:88) i =1 (cid:16) y i − w − w (cid:98) h − i ( x i ) − w (cid:98) g − i ( x i ) (cid:17) (cid:41) (12), where (cid:98) h − i ( x i ) and (cid:98) g − i ( x i ) are respectively the predictions at x i using h and g that areestimated on the training data with the i th observation removed. The cross-validated errorgives a better approximation of the expected error, allowing an optimal combination. Inpractice, one can also account for model complexity via the use of sample-splitting or cross-fitting, or use K − fold instead of leave-one-out cross validation.To adapt the stacking method to combining statistical and structural models, as in theconstruction of the DRSS estimator, we let g ( x ; θ M ) = E M [ y | x ] be the implied conditionalmean of y according to the structural model M ( x, y ; θ M ) . We then combine g ( x ; θ M ) withstatistical model h ( x ; θ h ) according to (11). With regard to the choice of w , in (Wolpert,1992), no restrictions are placed and (cid:98) w is given by least squares regression of y i on (cid:98) h − i ( x i ) and (cid:98) g − i ( x i ) . Hansen and Racine (2012) proved the asymptotic optimality of stacking forlinear models under a model averaging constraint that w = 0 , w , w ≥ , w + w = 1 .Ando and Li (2017) proved asymptotic optimality for generalized linear models with weightrestrictions relaxed to w = 0 , w , w ∈ [0 , . In this paper, we follow the original stackingmethod and do not place restrictions on w . The stacking method as proposed by (Wolpert, 1992) is therefore a general model combination or en-semble method rather than a model averaging method. In particular, both Hansen and Racine (2012) and Ando and Li (2017) assumed individual (generalized)linear models with intercept terms, so that their prediction errors have mean . In our case, we do notrequire misspecified structural models to generate predictions of y that have mean error. We thus need anadditional intercept term w .
17e now discuss the use of ESS-LN for causal effect estimation. As discussed in section 2.1,given treatment variable d , outcome variable y , and control variables v , the task of estimatingthe conditional ATE under unconfoundedness is equivalent to the task of estimating theconditional expectation E [ y | d, v ] , . Procedurally, the causal inference problem is thus thesame as the statistical prediction problem in this case , .In general, however, without assuming unconfoundedness, our goal is to produce anestimate of E (cid:2) y d (cid:12)(cid:12) v (cid:3) based on a reduced-form model (cid:98) h ( x ) = h (cid:16) x ; (cid:98) θ h (cid:17) and a structurally-derived model (cid:98) g ( x ) = g (cid:16) x ; (cid:98) θ M (cid:17) : E (cid:2) y d (cid:12)(cid:12) v (cid:3) = w + w (cid:98) h ( x ) + w (cid:98) g ( x ) , x = ( d, v ) (13), from which we can obtain (cid:98) τ ( d, v ) = ∂ (cid:98) E (cid:2) y d (cid:12)(cid:12) v (cid:3)(cid:46) ∂d .When d is endogenous – when there is unmeasured confounding, if we observe a variable z that can serve as a valid instrument for d , then we can specify the following (cid:96) × , (cid:96) ≥ moment conditions: E (cid:104) φ ( z ) (cid:16) y − w + w (cid:98) h ( x ) + w (cid:98) g ( x ) (cid:17)(cid:105) = 0 (14), where φ ( z ) is any function of z and w = ( w , w , w ) are the true values of w , .Let ψ ( x, y, z ; w ) . = φ ( z ) (cid:16) y − w + w (cid:98) h ( x ) + w (cid:98) g ( x ) (cid:17) . Let ψ ( w ) . = n (cid:80) ni =1 ψ ( x i , y i , z i ; w ) . Technically, the conditional ATE τ ( d, v ) = ∂ E [ y | d, v ]/ ∂d under unconfoundedness. When the unconfoundedness condition does not hold, a number of reduced-form strategies are oftenemployed to identify causal effects. In addition to the use of instrumental variables, which we detail below,these methods include difference-in-differences (DID) and regression discontinuity (RD). Statistically, bothDID and RD can be cast as a conditional mean estimation problem given specific designs and thus can becombined with their structurally-derived counterpart using the ensemble method we have described. We note that in current practice, the goal of causal inference is typically to produce an unbiased estimateof the treatment effect, while in predictive modeling, the goal is to often to minimize an expected L loss.However, whether causal effect estimation should aim for unbiasedness or precision remains an unsettledquestion. Importantly, in the case of ensemble estimators, even if the ensemble model estimates causal effectsbased on the unconfoundedness assumption, the structural model in the ensemble does not have to supportthe assumption. Whatever the causal assumptions are made by the structural model, we use its derivedfunctional form for E [ y | d, v ] as an input into the ensemble. Thus, the final ensemble estimate is still basedon the unconfoundedness assumption. If this assumption holds true but is unsupported by a member modelin the ensemble, then that model is simply misspecified. Assuming that (13) is the true model. The structural model M from which (cid:98) g ( x ) is derived does not have to contain z , and if it does, z doesnot need to satisfy the IV requirement in the causal structure assumed by M . See footnote 41. Q ( w ) . = ψ ( w ) (cid:48) Ω ψ ( w ) , where Ω is a (cid:96) × (cid:96) positive definite weight matrix . The optimal w can then be obtained by minimizing the GMM objective function: (cid:98) w = arg min w Q ( w ) (15)In practice, as in the case of conditional mean modeling, given finite sample, we wantto account for model complexity and avoid carrying any bias in the first stage estimation of (cid:98) h and (cid:98) g into the determination of w . This can be accomplished by using the strategies ofeither sample-splitting, cross-validation, or cross-fitting. The ESS-LN is a linear ensemble. Our ESS-NP estimator goes one step further and allowsany nonlinear combinations of individual models. In conditional mean estimation, let µ ( x ) = f (cid:16)(cid:98) h ( x ) , (cid:98) g ( x ) ; w (cid:17) (16), where f ( ., . ) is any function. Statistically, this amounts to regressing the outcome y non-parametrically on the predictions obtained from individual models h and g .While a large class of nonparametric models can be used for f , in this paper we adoptthe random forest model of Breiman (2001). The random forest is based on decision treemodels. A decision tree is constructed by repeatedly splitting or partitioning the predictorspace into different regions in order to maximize fit. In each region, a constant model is fit sothat the predicted value is simply the mean of the observed outcomes in that region. Thus,in its simplest form, with a predetermined number of splits (such as in the case of a stump ),a decision tree is a piecewise-constant model. When splits are adaptively chosen to minimizeprediction error, the decision tree becomes a nonparametric model whose complexity growswith data and is related to kernels and nearest-neighbor methods in that its predictions arebased on the values of neighborhood observations, except that it chooses the neighborhoods(regions) in a data-driven way (Athey et al., 2019). e.g. the efficient GMM weight of Hansen (1982).
19n contrast to conventional trees, in the ESS-NP, the predictor space is formed by (cid:98) h ( x ) and (cid:98) g ( x ) – the predictions obtained from statistical model h and structurally-derived model g . A tree constructed out of (cid:98) h ( x ) and (cid:98) g ( x ) carves up the space formed by (cid:98) h ( x ) and (cid:98) g ( x ) ,which in turn, implies a partition of the underlying input space x . The ESS-NP can thereforebe viewed as allowing us to adaptively assign different weights to different regions of the inputspace depending on which model – the statistical or the structural – performs better.While decision trees are powerful tools for capturing nonlinear relations and complexinteractions, they tend to suffer from high variance and instability. Random forests improveupon decision trees by building and combining a large number of trees through bootstrapaggregation, thereby reducing variance and increasing predictive accuracy . Additionalrandomness can be introduced to further de-correlate individuals trees via random splitselection that restricts the variables available for consideration in each split . In the ESS-NP estimator (16), f is therefore based on the random forest model.The conditional mean ESS-NP estimator can be used for prediction and causal effectestimation under unconfoundedness . When there is unmeasured confounding, as in thecase of ESS-LN, it is conceptually possible to adapt the ESS-NP to perform instrumentalvariables estimation based on the following conditional moment restrictions: E (cid:104) (cid:16) y − f (cid:16)(cid:98) h ( x ) , (cid:98) g ( x ) ; w (cid:17)(cid:17)(cid:12)(cid:12)(cid:12) z (cid:105) = 0 (17), where f ( ., . ) is again any function. The type of nonparametric IV regression defined by(17), however, is known to suffer from poor statistical performance due to the ill-posedinverse problem (Newey, 2013). Applying the random forest method to this task is also notstraight-forward . Therefore, in this paper, we do not propose an ESS-NP method for IVestimation. The random forest is an ensemble of individual trees. In our ESS-NP estimator, each tree is in turn anensemble of h and g . The ESS-NP is therefore an “ensemble of ensembles”. See Loh (2014); Biau and Scornet (2016) for overviews of decision trees and forest-based methods.Consistency results on random forests are obtained in Biau (2012); Scornet et al. (2015); Scornet (2016). The estimator can also be used to combine structural models with reduced-form models based on sta-tistical designs such as DID and RD when there is unmeasured confounding. Methods for estimating heterogeneous causal effects with semiparametric IV regression based on randomforests have recently been proposed in Athey et al. (2019). Experiments
In this section, we demonstrate the effectiveness of our methods and compare their finite-sample performances using three sets of simulated experiments. Taken together, these ex-ercises cover prediction and causal inference problems, static and dynamic settings, andindividual behavior that deviates in various ways from perfect rationality.
A First-Price Auction
In our first experiment, we consider first-price sealed-bid auctions. Auctions are one of themost important market allocation mechanisms. Empirical analysis of auction data has beentransformed in recent years by structural estimation of auction models based on games ofincomplete information . Structural analysis of auction data views the observed bids asequilibrium outcomes and attempts to recover the distribution of bidders’ private valuesby estimating relationships derived directly from equilibrium bid functions. This approach,while offering a tight integration of theory and observations, relies on a set of strong as-sumptions on the information structure and rationality of bidders (Bajari and Hortacsu,2005).In this exercise, we conduct three experiments by simulating auction data with varyingnumber of participants under three scenarios. The first scenario features rational bidderswith independent private values drawn from a uniform distribution. The second scenario fea-tures rational bidders whose values are drawn from a beta distribution. The third scenariofeatures boundedly-rational bidders whose bids deviate from optimal bidding strategies. Ineach experiment, we’re interested in the effect of the number of bidders n on the winning bid b ∗ , E [ b ∗ | n ] . We estimate this target function using (a) a statistical model, (b) a structuralmodel, (c) the DRSS estimator, (d) the ESS estimators (ESS-LN, ESS-NP), and comparetheir performances. For all experiments, we use a structural model that assumes rationalbidders with uniformly distributed values. The model is thus correctly specified for exper-iment 1, but is misspecified in experiment 2 and 3. Table 1 summarizes this setup. Below See Paarsch and Hong (2006); Athey and Haile (2007); Hickman et al. (2012); Perrigne and Vuong (2019)for surveys on econometric analysis of auction data a Experiment True Mechanism Structural Model Statistical Model1 v i i.i.d. ∼ U (0 , , b i = b ( v i ) v i i.i.d. ∼ U (0 , , b i = b ( v i ) see (21)2 v i i.i.d. ∼ Beta (2 , , b i = b ( v i ) v i i.i.d. ∼ U (0 , , b i = η i · b ( v i ) a b ( v i ) is the equilibrium bid function (19). η i i.i.d. ∼ TN (0 , . , , ∞ ) . we detail the data-generating models of the three experiments. Setup
Consider a first-price sealed-bid auction with n risk-neutral bidders with indepen-dent private value v i ∼ i.i.d. F ( v ) . Each bidder submits a bid b i to maximize her expectedreturn π i = ( v i − b i ) × Pr ( b i > max { b − i } ) (18), where b − i denotes the other submitted bids. In Bayesian-Nash equilibrium, each bidder’sbidding strategy is given by b ( v ) = v − F ( v ) n − (cid:90) v i F ( x ) n − dx (19)For experiment 1 and 3, we let F be U (0 , . In this case the equilibrium bid functionsimplifies to: b ( v ) = n − n v (20)For experiment 2, we let F be Beta (2 , . In each experiment, we simulate repeatedauctions with varying number of bidders . For experiment 1 and 2, the observed bids b i are the equilibrium outcomes, i.e. b i = b ( v i ) . For experiment 3, we let b i = η i · b ( v i ) , where η i follows a normal distribution left-truncated at , η i i.i.d. ∼ TN (0 , . , , ∞ ) . Bidders inexperiment 3 thus “overbid” relative to the Bayesian-Nash equilibrium. Simulation
For each experiment, we simulate M = 500 auctions with number of bidders n m varying between and . The observed data thus consist of D = {{ b mi } n m i =1 } Mm =1 . In Assuming the same object is being repeatedly auctioned. E [ b ∗ | n ] , the relationship between the number of biddersand the winning bid. To assess the performance of various estimators, we use the true data-generating models to compute E [ b ∗ | n ] for n ∈ [5 , , so that we can compare the predictionsof each method with the true values both in-domain and out-of-domain. Statistical Model
To estimate E [ b ∗ | n ] using a statistical model , the data we need are { ( n m , b ∗ m ) } Mm =1 , where b ∗ m is the winning bid of auction m . We adopt the following seconddegree polynomial as the model for E [ b ∗ | n ] : b ∗ m = β + β n m + β n m + e m (21) Structural Model
Our structural model assumes that bidders are rational, risk-neutral,and have independent private values drawn from a U (0 , distribution. Under these as-sumptions, the bidders’ private values can be easily identified from the observed bids in eachauction by v i = nn − b i . The structural model makes it even easier to make predictions onthe winning bid. The model implies that: E [ b ∗ | n ] = nn + 1 (22)No estimation is necessary. Results
Figure 2a and 2b show the results of the first experiment. In Figure 2a, we plotthe number of participants n against the winning bid b ∗ , the true relationship E [ b ∗ | n ] , andthe predictions obtained from five models: statistical, structural, DRSS, ESS-LN, and ESS-NP. Since the structural model is the true model in this experiment, it predicts the true Since n is exogenous, E [ b ∗ | n ] is also a causal relationship and (21) can also be thought of a reduced-formmodel of the effect of the number of bidders on the winning bid. In general, if we do not impose the assumption that v i i.i.d. ∼ U (0 , and assume instead that v i i.i.d. ∼ F ( v ) ,with F unknown, then we can identify and estimate v i using the following strategy based on Guerre et al.(2000): let G ( b ) and g ( b ) be the distribution and density of the bids. (19) implies v i = b i + 1 n − G ( b i ) g ( b i ) Thus, by nonparametrically estimating G ( b ) and g ( b ) from the observed bids, we can obtain an estimate of v i .
10 15 20 25
Number of Bidders W i nn i ng B i d DRSSESS−LNESS−NPStatisticalStructuralTrue (a)
Experiment 1: Training Sample
10 20 30 40 50
Number of Bidders W i nn i ng B i d DRSSESS−LNESS−NPStatisticalStructuralTrue (b)
Experiment 1: Test Sample
Number of Bidders W i nn i ng B i d DRSSESS−LNESS−NPStatisticalStructuralTrue (c)
Experiment 2: Training Sample
10 20 30 40 50
Number of Bidders W i nn i ng B i d DRSSESS−LNESS−NPStatisticalStructuralTrue (d)
Experiment 2: Test Sample
Number of Bidders W i nn i ng B i d DRSSESS−LNESS−NPStatisticalStructuralTrue (e)
Experiment 3: Training Sample
10 20 30 40 50
Number of Bidders W i nn i ng B i d DRSSESS−LNESS−NPStatisticalStructuralTrue (f)
Experiment 3: Test Sample
Figure 2:
First-price Auction - The relationship between the number of bidders and the winningbid. a In-Domain Out-of-DomainMSE Bias Var MSE Bias Var
Experiment 1
Structural 0.00 0.00 0.00 0.00 0.00 0.00Statistical 1.27 86.36 0.29 871.37 2320.12 31.56DRSS 0.17 20.72 0.11 126.83 566.38 77.66ESS-LN 0.38 41.70 0.36 123.18 730.98 115.23ESS-NP 1.89 104.09 1.88 123.69 965.51 2.08
Experiment 2
Structural 1311.47 3617.50 0.00 1252.17 3537.75 0.00Statistical 0.47 53.58 0.21 326.01 1392.57 28.52DRSS 0.47 53.41 0.21 324.93 1389.32 28.48ESS-LN 0.37 46.91 0.24 138.98 908.81 20.92ESS-NP 1.37 91.36 1.32 98.16 836.85 3.86
Experiment 3
Structural 214.41 1394.45 0.00 602.32 2443.56 0.00Statistical 3.70 144.71 1.66 1245.99 2630.90 156.97DRSS 3.63 143.92 1.69 1227.77 2624.21 151.48ESS-LN 2.88 130.51 1.98 460.74 1480.48 132.90ESS-NP 13.95 290.35 13.16 323.67 1483.44 24.96 a Results are based on 100 simulation trials. All numbers are on the scaleof − . Since the structural model predicts E [ b ∗ | n ] = ( n − n + 1) ,its predictions have zero variance and are the true values in experiment1. n ∈ [5 , to n ∈ [2 , . Whilethe structural predictions still hold true, the statistical fit becomes very bad, as can be ex-pected. Because the structural model is correctly specified while the statistical model is not,the DRSS puts most of the weight on the structural model and closely approximates its per-formance. The two ensemble estimators, ESS-LN and ESS-NP, are also able to significantlyoutperform the statistical model out-of-domain. In the first panel of Table 2, we report thebias, variance, and mean squared error of all the estimators for simulation runs . Indomain , compared to the true structural model, the DRSS provides the best fit, followed bythe ESS-LN. Both the statistical and the ESS-NP models fit well as well.
Out of domain , thestatistical model has by far the worst performance. The three proposed estimators all havesimilar MSE and achieve significant gains in performance over the statistical model. Out ofthe three, the DRSS has the smallest bias. Thus, the DRSS estimator appears to work thebest in this experiment. This is not surprising as one of its candidate models is correctlyspecified, satisfying the condition for DRSS consistency.Figure 2c −
2f show the results of experiment 2 and 3. The results tell as similar story.In both experiments, the structural model is misspecified. In experiment 2, it misspecifiesthe private value distribution. In experiment 3, it assumes that bidders are rational and theobserved bids are Bayesian-Nash equilibrium outcomes when they are not. As a consequence,in both cases, the structural fit deviates from the true model significantly. The statisticalmodel, like in experiment 1, is able to fit well in-domain but poorly out-of-domain. Sinceboth of its candidate models are misspecified in these experiments, the DRSS does notperform well. As the statistical model has better in-domain fit relative to the misspecifiedstructural, the DRSS puts the majority of its weight on the statistical model. In comparison, Given an estimator f , let f ( r ) ( n ) denote the estimator’s prediction of the winning bid in simulation r ,then bias ( f ) = E n (cid:104) E r (cid:104)(cid:12)(cid:12)(cid:12) f ( r ) ( n ) − E [ b ∗ | n ] (cid:12)(cid:12)(cid:12)(cid:105)(cid:105) var ( f ) = E n (cid:20) E r (cid:20)(cid:16) f ( r ) ( n ) − E r (cid:104) f ( r ) ( n ) (cid:105)(cid:17) (cid:21)(cid:21) mse ( f ) = E n (cid:20) E r (cid:20)(cid:16) f ( r ) ( n ) − E [ b ∗ | n ] (cid:17) (cid:21)(cid:21) Reported are their empirical estimates. simulation runs. Inboth experiments, the ESS-LN produces the best in-domain fit, while the ESS-NP producesthe best out-of-domain fit. Intuitively, the ensemble methods are able to achieve theseperformance gains due to a complementarity that exists between the statistical and thestructural models in these two experiments: the statistical model fits well in-domain, whilethe structural model, though misspecified, provides useful guidance on the functional formof E [ b ∗ | n ] when we extrapolate beyond the observed domain, as evidenced in Figure 2d, 2f. B Dynamic Entry and Exit
Our second application concerns the modeling and estimation of firm entry and exit dynam-ics. Structural analysis of dynamic firm behavior based on dynamic discrete choice (DDC)and dynamic game models has been an important part of empirical industrial organization .These dynamic structural models capture the path dependence and forward-looking behaviorof agents, but pays the price of imposing strong behavioral and parametric assumptions fortractability and computational convenience.In this exercise, we focus our attention on the rational expectations assumption thathas been a key building block of dynamic structural models in macro- and microeconomicanalyses. The assumption and its variants state that agents have expectations that do notsystematically differ from the realized outcomes . Despite having long been criticized asunrealistic, the rational expectations paradigm has remained dominant due to a lack oftractable alternatives and the fact that economists still know preciously little about beliefformation.We conduct three experiments in the context of the dynamic entry and exit of firms incompetitive markets in non-stationary environments. Our data-generating models are DDCmodels of entry and exit with entry costs and exogenously evolving economic conditions. In See Aguirregabiria and Mira (2010); Bajari et al. (2013) for surveys on structural estimation of dynamicdiscrete choice and dynamic game models. More precisely, rational expectations are mathematical expectations based on information and probabil-ities that are model-consistent (Muth, 1961).
Setup
Consider a market with N firms. In each period, the market structure consists of n t incumbent firms and N − n t potential entrants. The profit to operating in the marketat time t is R t , which we assume to be exogenous and time-varying. At the beginning ofeach period, both incumbents and potential entrants observe the current period payoff R t and each draws an idiosyncratic utility shock (cid:15) it . Incumbent firms then decide whether toremain or exit the market by weighing the expected present values of each option, whilepotential incumbents decide whether or not to enter the market, which will incur a one-timeentry cost c . Specifically, let the entry status of a firm be represented by (0 , . The time- t flow utility of a firm, who is in state j ∈ { , } in time t − and state k ∈ { , } in time t ,is given by u jkit = π jkt + (cid:15) kit (23), where π jkt = ( µ + α · R t − c · I ( j = 0)) · I ( k = 1) (24)28s the deterministic payoff function and (cid:15) it = ( (cid:15) it , (cid:15) it ) are idiosyncratic shocks, which we as-sume are i.i.d. type-I extreme value distributed. The parameter α measures the importanceof operating profits to entry-exit decisions relative to the idiosyncratic utility shocks.The ex-ante value function of a firm at the beginning of a period is given by V jt ( (cid:15) it ) = max k ∈{ , } (cid:110) π jkt + (cid:15) kit + β · E t (cid:104) V kt +1 (cid:105)(cid:111) (25) = max k ∈{ , } (cid:110) V jkt + (cid:15) kit (cid:111) (26), where j is the firm’s state in t − , β is the discount factor, V jt := E (cid:15) (cid:2) V jt ( (cid:15) it ) (cid:3) is theexpected value integrated over idiosyncratic shocks, and V jkt := π jkt + β · E t (cid:104) V kt +1 (cid:105) is thechoice-specific conditional value function.At the beginning of each period, after idiosyncratic shocks are realized, each firm thuschooses its action, a it ∈ { , } , by solving the following problem: a it = arg max k ∈{ , } (cid:110) V jkt + (cid:15) kit (cid:111) (27), which gives rise to the conditional choice probability (CCP) function: p t ( k | j ) := Pr ( a it = k | a i,t − = j ) = e V jkt (cid:80) (cid:96) =0 e V j(cid:96)t (28), which follows from the extreme value distribution assumption.Since the value function involves the continuation values E t (cid:104) V kt +1 (cid:105) , which requires ex-pectations of the future profits ( R t +1 , R t +2 , . . . ) , its solution requires us to specify how suchexpectations are formed. In experiment 1, we assume firms have perfect foresight on R t .This is a stronger form of rational expectations that assumes individuals knows the futurerealized values. Firms can then compute V jt = E (cid:15) (cid:2) V jt ( (cid:15) it ) (cid:3) , j ∈ { , } in a model-consistentway, i.e. based on the distributional assumption of (cid:15) it . In experiment 2, we assume firmshave a form of adaptive expectations , according to which beliefs about the future are formedbased on past values. Here for simplicity, we assume that firms expect future profits to bealways the same as in current period, i.e. R t = R t +1 = R t +2 = · · · . Finally, in experiment 3,29
250 500 750 1000
Periods R Figure 3:
Dynamic Entry and Exit - Exogenous Operating Profit we allow firms to be myopic , so that they do not care about the future and only maximizecurrent payoffs.
Simulation
For each experiment, we simulate N = 10 , firms for T = 1000 periods.The first T = 500 periods are used for training and the last T − T = 500 periods are usedto assess the out-of-domain performance of our estimators. The training data thus consistof D = (cid:110) { a it } Ni =1 , R t (cid:111) Tt =1 . We simulate R t to follow an autoregressive process with a timetrend so that the environment is non-stationary. Figure 3 shows a realized path of R t . Adifferent R t process is chosen for each experiment so that the entry and exit dynamics overthe first T periods are significantly different from the last T − T periods, allowing us tobetter distinguish the performance of the estimators. Appendix B.1 reports the parametervalues we use as well as other details of the simulation. Statistical Model
To predict the number of firms operating in the market each period, n t , based on observed exogenous operating profits, R t , we adopt the following ARX model: n t = γ + γ R t + ρ n t − + ρ n t − + e t (29)30 tructural Model We estimate the DDC model given by (23)–(28) assuming rationalexpectations. Our estimation strategy builds on Arcidiacono and Miller (2011) and esti-mates an Euler-type equation constructed out of CCPs. Here we sketch the strategy whilepresenting its details in Appendix B.1 . A key to our strategy is the assumption that be-cause agents have rational expectations, their expected continuation values do not deviatesystematically from the realized values, i.e. V jt +1 = E t (cid:104) V jt +1 (cid:105) + ξ jt , where ξ jt is a time- t expectational error with E (cid:0) ξ jt (cid:1) = 0 . Given this assumption, and since our model has thefinite dependence property of Arcidiacono and Miller (2011), solution to (25) can be writtenin the form of the following Euler equation: ln p t ( k | j ) p t ( j | j ) = (cid:16) π j,kt − π j,jt + β (cid:16) π k,kt +1 − π j,kt +1 (cid:17)(cid:17) − β ln p t +1 ( k | k ) p t +1 ( k | j ) + (cid:15) j,kt (30), where (cid:15) j,kt = β (cid:0) ξ kt − ξ jt (cid:1) .Replacing the CCPs with their sample analogues, i.e. let (cid:98) p t ( k | j ) = observed percentageof firms that are in state j in t − and state k in time t , we obtain the following estimatingequations: for all j (cid:54) = k , ln (cid:98) p t ( k | j ) (cid:98) p t ( j | j ) + β ln (cid:98) p t +1 ( k | k ) (cid:98) p t +1 ( k | j ) = µ + αR t − (1 − β ) c + e t ( j, k ) = (0 , − µ − αR t + e t ( j, k ) = (1 , (31), where e t = ( e t , e t ) is an error term that captures both the expectational errors in (cid:15) j,kt andthe approximation errors in (cid:98) p t ( k | j ) .We assume that the value of the discount factor β is known. Estimating (31) gives usan estimate of the model parameters ( µ, α, c ) . These estimates are consistent for a modelthat assumes rational expectations. Our structural model is therefore correctly specified forexperiment 1, but misspecified in experiment 2 and 3. Results
Figure 4 shows the results of the first experiment. Figure 4a plots the expectedpercentage of firms in the market, E [ n t ] , for entire periods of t = 1 − , including both See Arcidiacono and Ellickson (2011) for a review of related CCP estimators. For empirical implemen-tations, see, e.g. Artuc et al. (2010); Scott (2014).
250 500 750 1000
Periods A gg r ega t e A gen t s i n M a r k e t DRSSESS−LNESS−NPStatisticalStructuralTrue (a)
Periods A gg r ega t e A gen t s i n M a r k e t DRSSESS−LNESS−NPStatisticalStructuralTrue (b)
500 525 550 575 600
Periods A gg r ega t e A gen t s i n M a r k e t DRSSESS−LNESS−NPStatisticalStructuralTrue (c)
Figure 4:
Dynamic Entry and Exit - Experiment 1. Plotted are the true expected percentage offirms in the market along with model predictions. Training data are not plotted for clarity. In (a),the entire periods of t = 1 − are plotted, which covers both the in-domain periods of t = 1 − and the out-of-domain periods of t = 501 − . (b) and (c) plot respectively the in-domain periodsof t = 1 − and the out-of-domain periods of t = 501 − in order to show a more detailedpicture.
250 500 750 1000
Periods A gg r ega t e A gen t s i n M a r k e t DRSSESS−LNESS−NPStatisticalStructuralTrue (a)
Periods A gg r ega t e A gen t s i n M a r k e t DRSSESS−LNESS−NPStatisticalStructuralTrue (b)
500 525 550 575 600
Periods A gg r ega t e A gen t s i n M a r k e t DRSSESS−LNESS−NPStatisticalStructuralTrue (c)
Figure 5:
Dynamic Entry and Exit - Experiment 2. Plotted are the true expected percentage offirms in the market along with model predictions. Training data are not plotted for clarity. In (a),the entire periods of t = 1 − are plotted, which covers both the in-domain periods of t = 1 − and the out-of-domain periods of t = 501 − . (b) and (c) plot respectively the in-domain periodsof t = 1 − and the out-of-domain periods of t = 501 − in order to show a more detailedpicture.
250 500 750 1000
Periods A gg r ega t e A gen t s i n M a r k e t DRSSESS−LNESS−NPStatisticalStructuralTrue (a)
Periods A gg r ega t e A gen t s i n M a r k e t DRSSESS−LNESS−NPStatisticalStructuralTrue (b)
500 525 550 575 600
Periods A gg r ega t e A gen t s i n M a r k e t DRSSESS−LNESS−NPStatisticalStructuralTrue (c)
Figure 6:
Dynamic Entry and Exit - Experiment 3. Plotted are the true expected percentage offirms in the market along with model predictions. Training data are not plotted for clarity. In (a),the entire periods of t = 1 − are plotted, which covers both the in-domain periods of t = 1 − and the out-of-domain periods of t = 501 − . (b) and (c) plot respectively the in-domain periodsof t = 1 − and the out-of-domain periods of t = 501 − in order to show a more detailedpicture. a In-Domain Out-of-DomainMSE Bias Var MSE Bias Var
Experiment 1
Structural 10.13 133.01 80.50 55.86 562.82 276.50Statistical 5.53 160.32 56.79 1620.59 3271.40 13.99DRSS 4.12 135.05 57.87 1197.93 2707.97 105.57ESS-LN 0.44 38.57 54.49 110.22 631.67 185.57ESS-NP 0.12 14.09 53.95 379.99 1254.47 134.72
Experiment 2
Structural 144.36 376.53 116.18 1199.94 2514.04 605.93Statistical 3.22 67.78 7.26 1744.25 2569.35 2.97DRSS 4.00 10.11 12.30 1502.54 2350.33 67.57ESS-LN 1.45 35.32 7.50 1332.06 2126.37 16.39ESS-NP 0.38 74.47 7.55 1146.09 1926.19 48.50
Experiment 3
Structural 361.75 685.71 196.53 2670.64 4378.27 499.04Statistical 1.89 78.72 7.36 890.14 1952.56 3.09DRSS 1.88 78.35 8.14 849.67 1891.20 6.35ESS-LN 0.99 49.24 6.78 762.69 1689.28 6.56ESS-NP 0.24 14.24 6.85 628.23 1470.74 18.64 a Results are based on 100 simulation trials. All numbers are on the scaleof − . t = 1 − and the out-of-domain periods of t = 501 − ,together with the predictions of the five estimators. Predictions are made using one-stepahead forecasting . A closer look at in-domain and out-of-domain results are presented inFigure 4b and 4c for chosen periods.All estimators fit relatively well in-domain. However, out-of-domain, the time seriesmodel is unable to capture the rising number of firms as R t increases. This is partly bydesign: as we have discussed, we intentionally choose parameter values so that out-of-domaindynamics differ markedly from those in-domain. A statistical model that fits to the in-domaindata is unable to extrapolate well in this case. On the other hand, the structural model, whichis correctly specified in this experiment, extrapolates very well, as expected. Since one ofits candidate models is correctly specified, the DRSS is also expected to perform well. Here,the DRSS model successfully allocates most of its weight on the structural model. However,because some weight is still put on the statistical model, it systematically underestimatesthe number of firms in out-of-domain periods as well. This is also expected as inabilityto distinguish between competing models based on limited data is what motivates doublyrobust and model averaging approaches in the first place. Like the DRSS, the two ensembleestimators are able to largely capture the rising number of firms in out-of-domain periods,offering significantly better predictions than the statistical model. Out of the two ensemblemodels, the ESS-LN performs particularly well, matching the true model closely.In Table 4 Panel 1, we report the bias, variance, and mean squared error of all theestimators with respect to the true E [ n t ] over trials. Somewhat surprisingly, the struc-tural model, albeit correctly specified, performs the worst in terms of MSE out of the fiveestimators in-domain. This is perhaps due to a loss of efficiency associated with our Euler-equation approach in estimating the model (Aguirregabiria and Magesan, 2013). Out ofdomain, though, it predictably delivers the best performance. Out of the remaining fourestimators, the ESS-NP produces the best in-domain fit, while the ESS-LN produces thebest out-of-domain fit.Figure 5 shows the results of the second experiment. In Experiment 2, agents have Given an estimated model, in each period t , we predict n t based on { ( n t − , n t − , . . . ) , ( R t , R t − , . . . ) } .To generate predictions for the structural model, we also assume agents have perfect foresight regarding ( R t +1 , R t +2 , . . . ) . R t (cid:48) = R t ∀ t (cid:48) > t . Since in oursimulations, R t follows a rising trend, this means that agents systematically underestimatefuture profits. The realized dynamics show that for most of the in-domain periods, there arefew firms in the market. Number of firms increases significantly during the out-of-domainperiods. This marked difference between in-domain and out-of-domain dynamics pose sig-nificant challenges. Looking at the model fits, the time series model again fits relatively wellin-domain but is completely unable to extrapolate out-of-domain. The structural model,being misspecified, is able to capture the rising entries, but tends to have larger fluctuationsthan the true model. This can be explained by the fact that agents in the structural modelassumes that future profits will be the same as current profits, thus reacting more dramati-cally to any changes in R t . As both the statistical and the structural model are misspecified,the DRSS does not perform well. It puts most of the weight on the statistical model, leadingto a bad extrapolation performance. The ensemble models, ESS-LN and ESS-NP, are bothable to fit well in-domain and capture some part of the rising trend out-of-domain. Com-pared to the structural model, they tend to underfit rather than overfit the true expectednumber of firms in out-of-domain periods.Looking at Panel 2 of Table 4, we see that the ESS-NP achieves the smallest MSE bothin-domain and out-of-domain, making it the winner in this experiment. The structural modelis a close second in out-of-domain performance but is by far the worst in-domain. Indeed,the DRSS, the ESS-LN, and the ESS-NP all achieve significantly smaller MSEs in-domain.This experiment serves to illustrate a scenario in which the complementarity between thestructural and the statistical model is especially pronounced, with the former fitting relativelybadly in-domain and the latter completely unable to extrapolate. By combining the two,our ensemble models mainly rely on the former to guide out-of-domain prediction and onthe latter to regulate in-domain fit.Figure 6 shows the results of the third experiment. In this experiment, agents are myopicin that they only care about current period returns when making entry and exit decisions.The data-generating model is therefore static in nature. Looking at estimator performances,the story is broadly similar to that of experiment 2, with the difference being that, in thisexperiment, the true model exhibits less dramatic difference between its in-domain and37able 5: Demand Estimation - SetupExperiment True Mechanism Reduced-Form Structural Model1 linear demand, optimal linear demandmonopoly pricing2 linear demand, non-optimal linear demandmonopoly pricing linear demand, optimal3 linear demand, optimal log-log demand monopoly pricingmonopoly pricing4 linear demand, non-optimal log-log demandmonopoly pricingout-of-domain dynamics and the misspecified structural model tends to more significantlyoverestimate the number of firms in the market. As a consequence, according to Panel 3 ofTable 4, the structural model is the worst performer both in-domain and out-of-domain inthis experiment. On the other hand, both ensemble estimators perform better than the otherestimators both in-domain and out-of-domain, with the ESS-NP the clear winner. Thus, asin the auction experiments, our ensemble methods are able to consistently outperform theother estimators when both the structural and the statistical model are misspecified. C Demand Estimation
In our final application, we revisit the demand estimation problem under a different setting.Suppose now that instead of observing consumer demand under exogenously varying prices,the prices we observe are set by a monopolist. In this case, changes in prices are endogenousand the relationship between price and quantity sold is confounded. We are interested inlearning the true demand curve. To this end, if we have access to a variable that shifts thecost of production for the monopoly firm but does not affect demand directly, then it can beused as an instrumental variable to help identify the demand curve. This is the reduced-formapproach. Alternatively, we can estimate a structural model that fully specifies monopolypricing behavior. This is the structural approach. Finally, we can combine the two usingthe DRSS and the ESS-LN .In this exercise, we conduct four experiments. In all four experiments, we assume that For instrumental variable estimation, we do not offer an ESS-NP estimator.
38e have access to a valid instrument so that the demand curve is identified. However, thefunctional form of the reduced-form model may still be misspecified. On the other hand,using the structural approach, we estimate a model that assumes the observed prices areoptimally set by a profit-maximizing monopoly firm. When this assumption is violated, aswhen for example the firm’s pricing is not optimal or it does not have monopoly power,the structural model will also be misspecified. The four experiments we conduct are thusarranged as follows: in the first experiment, both the reduced-form and the structural modelsare correctly specified. In experiment 2 and 3, only one of the two is correctly specified. Inexperiment 4, both are misspecified. Table 5 summarizes this setup. For each experiment,we also simulate both a slightly confounded data set, in which the relationship between priceand quantity does not deviate too much from the demand curve, and a highly confoundeddata set, in which they look nothing alike.In contrast to the previous two exercises, in this exercise, we focus on comparisons of in-domain performance. We show that when either the reduced-form or the structural modelis misspecified, the DRSS and the ESS-LN will have better in-domain performance – more internal validity – than the misspecified model. When both are misspecified, the ESS-LNoutperforms them both.
Setup
Consider M geographical markets in which a product is sold. The equilibriumprice and quantity sold in market m are ( p m , q m ) . Assume that all markets share the sameaggregate demand function Q d ( p ) : q m = Q d ( p m ) = α − β · p m + (cid:15) m (32)In experiment 1 and 3, we assume the product is sold by a monopoly firm who sets theprices in each market to maximize its profit. The firm has different marginal costs c m foroperating in different markets. Hence it sets p m = arg max p> (cid:8) ( p − c m ) Q d ( p ) (cid:9) (33) = c m + 1 β q m (34)39ssume that we also observe a cost-shifter z m , e.g. transportation costs, such that c m = a + b · z m (35), then z m can serve as an instrument for p m for identifying the demand curve.In experiment 2 and 4, we assume the monopoly firm fails to set optimal prices or doesnot have complete monopoly power. Its pricing decisions are given by p m = c m + λβ q m (36), where λ ∈ (0 , . The firm thus earns a lower markup than a optimal price-setting monopoly. Simulation
For each experiment, we simulate two data sets. Each data set consists ofprices, quantities, and cost shifters in M = 1000 markets, i.e. D = { ( p m , q m , z m ) } Mm =1 . Onedata set is only slightly confounded, so that E [ q m | p m ] is close to the demand relation (32).The other is highly confounded, so that they are completely different. See Appendix B.2 forthe parameter values we use in simulation. Reduced-Form Model
Because p m is now endogenous – p m and (cid:15) m are correlated through(34) – the statistical relation between p m and q m is confounded and no longer represents thedemand function. To estimate the demand curve using the reduced-form approach, we availof the instrumental variable z m and estimate Q d ( p ) by two-stage least squares (2SLS). Inexperiment 1 and 2, our reduced-form model is correctly specified, i.e. we fit (32) to thedata by 2SLS. In experiment 3 an 4, however, we assume the demand function takes on alog-log form: log q m = α − β · log p m + (cid:15) m (37), and is therefore misspecified in these two experiments. Structural Model
We fit a structural model featuring linear demand function (32) andprice-setting function (34). This structural model is correctly specified for experiment 1 and3, but misspecified for experiment 2 and 4. The structural parameters are ( α, β, a, b ) and40 P Q DRSSESS−LNStatisticalStructuralTrue (a) Experiment 1
30 40 50 60 P Q DRSSESS−LNStatisticalStructuralTrue (b) Experiment 2
30 40 50 60 P Q DRSSESS−LNStatisticalStructuralTrue (c) Experiment 3
30 40 50 60 P Q DRSSESS−LNStatisticalStructuralTrue (d) Experiment 4
Figure 7:
Demand Estimation – Slightly Confounded Data can be estimated as follows: from (32) and (34), we obtain p m = a + b · z m + 1 β q m (38)If our model is correct, (38) is a deterministic linear equation system from which we cansolve directly for (cid:16)(cid:98) a, (cid:98) b, (cid:98) β (cid:17) . Substituting (cid:98) β into (32), we then obtain (cid:98) α = M (cid:80) Mm =1 (cid:16) q m + (cid:98) βp m (cid:17) . Results
In Figure 7 and 8, we plot the results of the four experiments respectively for theslightly and highly confounded scenarios. In the latter case, the observed data ( p m , q m ) aresignificantly confounded such that fitting a least squares model to the data would produce41n upward-sloping curve. Regardless of the level of confounding, however, the two groupsof plots tell a similar story. When correctly specified, both reduced-form and structuralestimation are able to identify the true demand curve (Figure 7a, 8a) . When only one ofthem is correctly specified, the misspecified model produces fits that, while still managingto capture the downward-sloping nature of the demand curve, can deviate significantly fromthe true relationship (Figure 7b, 8b, 7c, 8c). In this case, the ESS-LN generally still performswell, while the DRSS is able to fit the demand curve well in Figure 7b and 8b but not in7c and 8c. Finally, when both the reduced-form and the structural models are misspecified,the ESS-LN becomes the only method that is able to fit the true demand curve well (Figure7d, 8d).Table 6 reports the bias, variance, and mean squared error of the estimators with respectto the true demand curve over 100 trials. In both the slightly and highly confounded scenar-ios, when they are correctly specified, the reduced-form and the structural models exhibitlow biases. The structural model, by virtue of imposing more structure on the data, attainsa lower variance. When misspecified, both types of models exhibit large biases and MSEs.The DRSS is able to outperform the misspecified model in experiment 2 and 3, while theESS-LN consistently achieves the lowest MSE – often significantly lower than those of theother estimators, regardless of which model – the reduced-form or the structural or even both– is misspecified. Note, however, for all experiments, the DRSS and the ESS-LN performbetter on the slightly confounded data. This is not surprising. In particular, as Figure 8reveals, when the data are highly confounded, the structural and the reduced-form modelscan behave similarly on the observed data, even when their predicted demand curves areactually very different due to one or both of them being misspecified, making it difficult forthe DRSS method to distinguish between them and for the ESS-LN to leverage their differ-ences in functional form. More confounding thus presents more challenges for our methodsto work well. This is because both use correctly specified models and z is a valid instrument. P Q DRSSESS−LNStatisticalStructuralTrue (a) Experiment 1
20 40 60 80 P Q DRSSESS−LNStatisticalStructuralTrue (b) Experiment 2
40 60 80 100 P Q DRSSESS−LNStatisticalStructuralTrue (c) Experiment 3
40 60 80 P Q DRSSESS−LNStatisticalStructuralTrue (d) Experiment 4
Figure 8:
Demand Estimation – Highly Confounded Data a Slightly Confounded Highly ConfoundedMSE Bias Var MSE Bias Var
Experiment 1
Structural 0.90 0.75 0.90 1.01 0.81 1.00Statistical 2.34 1.16 2.36 14.37 2.70 14.43DRSS 1.52 0.94 1.53 6.34 1.71 6.37ESS-LN 2.06 1.07 2.05 13.99 2.53 13.93
Experiment 2
Structural 2394.80 42.04 3.45 767.90 23.80 1.52Statistical 1.377 .898 1.38 17.83 2.93 17.99DRSS 1.696 .987 1.57 143.20 7.12 103.33ESS-LN 1.701 .990 1.72 18.11 2.97 18.29
Experiment 3
Structural 0.85 0.76 0.86 1.01 0.81 1.00Statistical 8062.98 50.29 99.73 329.33 13.60 2.43DRSS 36.82 2.21 26.76 141.50 8.47 16.12ESS-LN 11.25 1.97 10.96 137.87 7.07 139.20
Experiment 4
Structural 2394.80 42.40 3.45 767.90 23.80 1.52Statistical 1395.50 30.37 10.00 447.78 16.27 3.30DRSS 1100.62 25.72 20.94 375.08 14.90 233.92ESS-LN 3.55 1.40 3.53 168.19 8.41 169.70 a Results are based on 100 simulation trials. All numbers are on thescale of − . Conclusion
In this paper, we propose a set of methods for combining statistical and structural models forimproved prediction and causal inference. We demonstrate the effectiveness of our methodsin a number of economic applications including first-price auctions, dynamic models of entryand exit, and demand estimation with instrumental variables. Our methods offer a way tobridge the gap between the (reduced-form) statistical approach and the structural approachin economic analysis and have potentially wide applications in addressing problems for whichsignificant concerns about model misspecification exist.
References
Aguirregabiria, V. and Magesan, A. (2013). Euler equations for the estimation of dynamicdiscrete choice structural models.
Advances in Econometrics , 31:3–44.Aguirregabiria, V. and Mira, P. (2010). Dynamic discrete choice structural models: A survey.
Journal of Econometrics , 156(1):38–67. Publisher: Elsevier.Ando, T. and Li, K.-C. (2017). A weight-relaxed model averaging approach for high-dimensional generalized linear models.
The Annals of Statistics , 45(6):2654–2679. Pub-lisher: Institute of Mathematical Statistics.Angrist, J. D. and Krueger, A. B. (1995). Split-Sample Instrumental Variables Estimatesof the Return to Schooling.
Journal of Business & Economic Statistics , 13(2):225–235.Publisher: Taylor & Francis.Angrist, J. D. and Pischke, J.-S. (2010). The credibility revolution in empirical economics:How better research design is taking the con out of econometrics.
Journal of economicperspectives , 24(2):3–30.Arcidiacono, P. and Ellickson, P. B. (2011). Practical Methods for Estimation of Dy-namic Discrete Choice Models.
Annual Review of Economics , 3(1):363–394. _eprint:https://doi.org/10.1146/annurev-economics-111809-125038.45rcidiacono, P. and Miller, R. A. (2011). Conditional choice probability estimation of dy-namic discrete choice models with unobserved heterogeneity.
Econometrica , 79(6):1823–1867. Publisher: Wiley Online Library.Arkhangelsky, D. and Imbens, G. W. (2019). Double-robust identification for causal paneldata models. arXiv preprint arXiv:1909.09412 .Artuc, E., Chaudhuri, S., and McLaren, J. (2010). Trade Shocks and Labor Adjustment: AStructural Empirical Approach.
American Economic Review , 100(3):1008–1045.Athey, S. (2017). Beyond prediction: Using big data for policy problems.
Science ,355(6324):483–485. Publisher: American Association for the Advancement of Science.Athey, S. and Haile, P. A. (2007). Nonparametric approaches to auctions.
Handbook ofeconometrics , 6:3847–3965. Publisher: Elsevier.Athey, S. and Imbens, G. W. (2017). The state of applied econometrics: Causality andpolicy evaluation.
Journal of Economic Perspectives , 31(2):3–32.Athey, S., Tibshirani, J., and Wager, S. (2019). Generalized random forests.
Annals ofStatistics , 47(2):1148–1178. Publisher: Institute of Mathematical Statistics.Bajari, P., Hong, H., and Nekipelov, D. (2013). Game theory and econometrics: A surveyof some recent research. In
Advances in economics and econometrics, 10th world congress ,volume 3, pages 3–52.Bajari, P. and Hortacsu, A. (2005). Are Structural Estimates of Auction Models Reason-able? Evidence from Experimental Data.
Journal of Political Economy , 113(4):703–741.Publisher: The University of Chicago Press.Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causalinference models.
Biometrics , 61(4):962–973. Publisher: Wiley Online Library.Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W.(2010). A theory of learning from different domains.
Machine Learning , 79(1):151–175.46enkeser, D., Carone, M., Laan, M. V. D., and Gilbert, P. B. (2017). Doubly robustnonparametric inference on the average treatment effect.
Biometrika , 104(4):863–880.Publisher: Oxford University Press.Bernardo, J. M. and Smith, A. F. (2009).
Bayesian theory , volume 405. John Wiley & Sons.Biau, G. (2012). Analysis of a Random Forests Model.
Journal of Machine Learning Re-search , 13(38):1063–1095.Biau, G. and Scornet, E. (2016). A random forest guided tour.
Test , 25(2):197–227. Pub-lisher: Springer.Bishop, C. M. and Lasserre, J. (2007). Generative or discriminative? getting the best ofboth worlds.
Bayesian statistics , 8(3):3–24.Breiman, L. (1996a). Bagging predictors.
Machine learning , 24(2):123–140. Publisher:Springer.Breiman, L. (1996b). Stacked regressions.
Machine learning , 24(1):49–64. Publisher:Springer.Breiman, L. (2001). Random forests.
Machine learning , 45(1):5–32. Publisher: Springer.Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W.(2017). Double/debiased/neyman machine learning of treatment effects.
American Eco-nomic Review , 107(5):261–65.Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. K.(2016). Double machine learning for treatment and causal parameters. Technical report,cemmap working paper.Chetty, R. (2009). Sufficient Statistics for Welfare Analysis: A Bridge Between Structuraland Reduced-Form Methods.
Annual Review of Economics , 1(1):451–488.Chopra, S., Balakrishnan, S., and Gopalan, R. (2013). Dlid: Deep learning for domainadaptation by interpolating between domains. In
ICML workshop on challenges in repre-sentation learning , volume 2. 47laeskens, G. and Hjort, N. L. (2003). The focused information criterion.
Journal of theAmerican Statistical Association , 98(464):900–916. Publisher: Taylor & Francis.Clyde, M. and Iversen, E. S. (2013). Bayesian model averaging in the M-open framework.
Bayesian theory and applications , pages 483–498. Publisher: Oxford University PressOxford, UK.Deaton, A. (2010). Instruments, randomization, and learning about development.
Journalof economic literature , 48(2):424–55.Dietterich, T. G. (2000). Ensemble methods in machine learning. In
International workshopon multiple classifier systems , pages 1–15. Springer.Farrell, M. H. (2015). Robust inference on average treatment effects with possibly morecovariates than observations.
Journal of Econometrics , 189(1):1–23. Publisher: Elsevier.Fessler, P. and Kasy, M. (2019). How to Use Economic Theory to Improve Estimators:Shrinking Toward Theoretical Restrictions.
The Review of Economics and Statistics ,101(4):681–698. Publisher: MIT Press.Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In icml ,volume 96, pages 148–156. Citeseer.Ganin, Y. and Lempitsky, V. (2014). Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495 .Glorot, X., Bordes, A., and Bengio, Y. (2011). Domain adaptation for large-scale sentimentclassification: A deep learning approach.Gopalan, R., Li, R., and Chellappa, R. (2011). Domain adaptation for object recognition:An unsupervised approach. In , pages999–1006. IEEE.Guerre, E., Perrigne, I., and Vuong, Q. (2000). Optimal nonparametric estimation of first-price auctions.
Econometrica , 68(3):525–574. Publisher: Wiley Online Library.48ansen, B. E. (2007). Least squares model averaging.
Econometrica , 75(4):1175–1189.Publisher: Wiley Online Library.Hansen, B. E. and Racine, J. S. (2012). Jackknife model averaging.
Journal of Econometrics ,167(1):38–46. Publisher: Elsevier.Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators.
Econometrica: Journal of the Econometric Society , pages 1029–1054. Publisher: JSTOR.Hansen, L. P. (2015). Method of Moments and Generalized Method of Moments. In Wright,J. D., editor,
International Encyclopedia of the Social & Behavioral Sciences (Second Edi-tion) , pages 294–301. Elsevier, Oxford.Hastie, T., Tibshirani, R., and Friedman, J. (2009).
The elements of statistical learning:data mining, inference, and prediction . Springer Science & Business Media.Heckman, J. J. (2000). Causal parameters and policy analysis in economics: A twentiethcentury retrospective.
The Quarterly Journal of Economics , 115(1):45–97.Heckman, J. J. (2010). Building bridges between structural and program evaluation ap-proaches to evaluating policy.
Journal of Economic literature , 48(2):356–98.Heckman, J. J. and Vytlacil, E. J. (2007). Econometric Evaluation of Social Programs, PartI: Causal Models, Structural Models and Econometric Policy Evaluation. In Heckman,J. J. and Leamer, E. E., editors,
Handbook of Econometrics , volume 6, pages 4779–4874.Elsevier.Hickman, B. R., Hubbard, T. P., and Saglam, Y. (2012). Structural econometric methodsin auctions: A guide to the literature.
Journal of Econometric Methods , 1(1):67–106.Publisher: De Gruyter.Hjort, N. L. and Claeskens, G. (2003). Frequentist model average estimators.
Journal of theAmerican Statistical Association , 98(464):879–899. Publisher: Taylor & Francis.49oeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian ModelAveraging: A Tutorial.
Statistical Science , 14(4):382–401. Publisher: Institute of Mathe-matical Statistics.Huang, J., Gretton, A., Borgwardt, K., Scholkopf, B., and Smola, A. J. (2007). CorrectingSample Selection Bias by Unlabeled Data. In Scholkopf, B., Platt, J. C., and Hoffman,T., editors,
Advances in Neural Information Processing Systems 19 , pages 601–608. MITPress.Imbens, G. W. and Rubin, D. B. (2015).
Causal inference in statistics, social, and biomedicalsciences . Cambridge University Press.Jebara, T. (2012).
Machine learning: discriminative and generative , volume 755. SpringerScience & Business Media.Jiang, J. and Zhai, C. (2007). Instance weighting for domain adaptation in NLP. In
Pro-ceedings of the 45th annual meeting of the association of computational linguistics , pages264–271.Kang, J. D. and Schafer, J. L. (2007). Demystifying double robustness: A comparison ofalternative strategies for estimating a population mean from incomplete data.
Statisticalscience , 22(4):523–539. Publisher: Institute of Mathematical Statistics.Keane, M. P. (2010a). A structural perspective on the experimentalist school.
Journal ofEconomic Perspectives , 24(2):47–58.Keane, M. P. (2010b). Structural vs. atheoretic approaches to econometrics.
Journal ofEconometrics , 156(1):3–20. Publisher: Elsevier.Kellogg, M., Mogstad, M., Pouliot, G., and Torgovitsky, A. (2020). Combining Matching andSynthetic Controls to Trade off Biases from Extrapolation and Interpolation. Technicalreport, National Bureau of Economic Research.Kitagawa, T. and Muris, C. (2016). Model averaging in semiparametric estimation of treat-ment effects.
Journal of Econometrics , 193(1):271–289. Publisher: Elsevier.50uang, K., Xiong, R., Cui, P., Athey, S., and Li, B. (2020). Stable Prediction with ModelMisspecification and Agnostic Distribution Shift. arXiv:2001.11713 [cs, stat] . arXiv:2001.11713.Lewbel, A., Choi, J.-Y., and Zhou, Z. (2019). General Doubly Robust Identification andEstimation.
Working Paper .Li, K.-C. (1987). Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: discrete index set.
The Annals of Statistics , pages 958–975. Publisher: JSTOR.Loh, W.-Y. (2014). Fifty years of classification and regression trees.
International StatisticalReview , 82(3):329–348. Publisher: Wiley Online Library.Long, M., Cao, Y., Wang, J., and Jordan, M. I. (2015). Learning transferable features withdeep adaptation networks. arXiv preprint arXiv:1502.02791 .Low, H. and Meghir, C. (2017). The use of structural models in econometrics.
Journal ofEconomic Perspectives , 31(2):33–58.Mao, J. and Zheng, Z. (2020). Structural Regularization. arXiv:2004.12601 [econ] . arXiv:2004.12601.Minka, T. P. (2000). Bayesian model averaging is not model combination. , pages 1–2.Moral-Benito, E. (2015). Model Averaging in Economics: AnOverview.
Journal of Economic Surveys , 29(1):46–75. _eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1111/joes.12044.Muandet, K., Balduzzi, D., and Scholkopf, B. (2013). Domain generalization via invariantfeature representation. In
Proceedings of the 30th International Conference on Interna-tional Conference on Machine Learning - Volume 28 , ICML’13, pages I–10–I–18, Atlanta,GA, USA. JMLR.org.Muth, J. F. (1961). Rational expectations and the theory of price movements.
Econometrica:Journal of the Econometric Society , pages 315–335. Publisher: JSTOR.51evo, A. and Whinston, M. D. (2010). Taking the dogma out of econometrics: Structuralmodeling and credible inference.
Journal of Economic Perspectives , 24(2):69–82.Newey, W. K. (2013). Nonparametric Instrumental Variables Estimation.
American Eco-nomic Review , 103(3):550–556.Ng, A. Y. and Jordan, M. I. (2002). On discriminative vs. generative classifiers: A compar-ison of logistic regression and naive bayes. In
Advances in neural information processingsystems , pages 841–848.Okui, R., Small, D. S., Tan, Z., and Robins, J. M. (2012). Doubly robust instrumentalvariable regression.
Statistica Sinica , pages 173–205. Publisher: JSTOR.Paarsch, H. J. and Hong, H. (2006). An introduction to the structural econometrics ofauction data.
MIT Press Books , 1. Publisher: The MIT Press.Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. (2010). Domain adaptation via transfercomponent analysis.
IEEE Transactions on Neural Networks , 22(2):199–210.Pan, S. J. and Yang, Q. (2010). A Survey on Transfer Learning.
IEEE Transactions onKnowledge and Data Engineering , 22(10):1345–1359.Pearl, J. (2009).
Causality . Cambridge university press.Perrigne, I. and Vuong, Q. (2019). Econometrics of Auctions and Nonlinear Pricing.
AnnualReview of Economics , 11(1):27–54. _eprint: https://doi.org/10.1146/annurev-economics-080218-025702.Reiss, P. C. and Wolak, F. A. (2007). Structural Econometric Modeling: Rationales andExamples from Industrial Organization. In Heckman, J. J. and Leamer, E. E., editors,
Handbook of Econometrics , volume 6, pages 4277–4415. Elsevier.Robins, J. M. and Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regressionmodels with missing data.
Journal of the American Statistical Association , 90(429):122–129. Publisher: Taylor & Francis. 52obins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficientswhen some regressors are not always observed.
Journal of the American statistical Asso-ciation , 89(427):846–866. Publisher: Taylor & Francis.Rojas-Carulla, M., Scholkopf, B., Turner, R., and Peters, J. (2018). Invariant models forcausal transfer learning.
The Journal of Machine Learning Research , 19(1):1309–1342.Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in obser-vational studies for causal effects.
Biometrika , 70(1):41–55. Publisher: Oxford UniversityPress.Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonran-domized studies.
Journal of educational Psychology , 66(5):688. Publisher: AmericanPsychological Association.Rust, J. (2014). The Limits of Inference with Theory: A Review of Wolpin (2013).
Journalof Economic Literature , 52(3):820–850.Scharfstein, D. O., Rotnitzky, A., and Robins, J. M. (1999). Adjusting for nonignorabledrop-out using semiparametric nonresponse models.
Journal of the American StatisticalAssociation , 94(448):1096–1120. Publisher: Taylor & Francis Group.Scornet, E. (2016). On the asymptotics of random forests.
Journal of Multivariate Analysis ,146:72–83.Scornet, E., Biau, G., and Vert, J.-P. (2015). Consistency of random forests.
Annals ofStatistics , 43(4):1716–1741. Publisher: Institute of Mathematical Statistics.Scott, P. (2014). Dynamic discrete choice estimation of agricultural land use. Publisher:TSE Working Paper.Shalizi, C. (2013).
Advanced data analysis from an elementary point of view . CambridgeUniversity Press Cambridge.Steel, M. F. (2019). Model averaging and its use in economics. arXiv preprintarXiv:1709.08221 . 53ugiyama, M., Nakajima, S., Kashima, H., Buenau, P. V., and Kawanabe, M. (2008). Di-rect Importance Estimation with Model Selection and Its Application to Covariate ShiftAdaptation. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors,
Advancesin Neural Information Processing Systems 20 , pages 1433–1440. Curran Associates, Inc.Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting.
Biometrika , 97(3):661–682. Publisher: Oxford University Press.Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell, T. (2014). Deep domain confu-sion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 .Van der Laan, M. J., Polley, E. C., and Hubbard, A. E. (2007). Super learner.
Statisticalapplications in genetics and molecular biology , 6(1). Publisher: De Gruyter.Vermeulen, K. and Vansteelandt, S. (2015). Bias-reduced doubly robust estimation.
Journalof the American Statistical Association , 110(511):1024–1036. Publisher: Taylor & Francis.Wang, M. and Deng, W. (2018). Deep visual domain adaptation: A survey.
Neurocomputing ,312:135–153.Wolpert, D. H. (1992). Stacked generalization.
Neural networks , 5(2):241–259. Publisher:Elsevier.Wolpin, K. I. (2013).
The Limits of Inference without Theory . MIT Press. Google-Books-ID:ueXxCwAAQBAJ.Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). Using Stacking to AverageBayesian Predictive Distributions (with Discussion).
Bayesian Analysis , 13(3):917–1007.Publisher: International Society for Bayesian Analysis.Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In
Proceedings of the twenty-first international conference on Machine learning , ICML ’04,page 114, Banff, Alberta, Canada. Association for Computing Machinery.54hang, X., Yu, D., Zou, G., and Liang, H. (2016). Optimal model averaging estimationfor generalized linear models and generalized linear mixed-effects models.