Estimating Production Functions with Partially Latent Inputs
UUsing Monotonicity Restrictions toIdentify Models with Partially Latent Covariates ∗ Minji Bang, Wayne Gao, Andrew Postlewaite, and Holger SiegUniversity of PennsylvaniaJanuary 18, 2021
Abstract
This paper develops a new method for identifying econometric models withpartially latent covariates. Such data structures arise naturally in industrialorganization and labor economics settings where data are collected using an“input-based sampling” strategy, e.g., if the sampling unit is one of multiple la-bor input factors. We show that the latent covariates can be nonparametricallyidentified, if they are functions of a common shock satisfying some plausiblemonotonicity assumptions. With the latent covariates identified, semiparamet-ric estimation of the outcome equation proceeds within a standard IV frame-work that accounts for the endogeneity of the covariates. We illustrate the use-fulness of our method using two applications. The first focuses on pharmacies:we find that production function differences between chains and independentpharmacies may partially explain the observed transformation of the industrystructure. Our second application investigates education achievement functionsand illustrates important differences in child investments between married anddivorced couples.Keywords: production functions, latent variables, endogeneity, semiparametricestimation, instrumental variables, matching. ∗ We would like to thank Xu Cheng, Aureo de Paula, Ulrich Doraszelski, Amit Gandhi, ClaudiaGoldin, Aviv Nevo, Dan Silverman, Petra Todd, and seminar participants at numerous universitiesfor comments and suggestions. Postlewaite and Sieg acknowledge support from the National ScienceFoundation. a r X i v : . [ ec on . E M ] J a n Introduction
This paper develops a new method for identifying econometric models with partiallylatent covariates. We show that a broad class of econometric models that play a largerole in industrial organization and labor economics can be nonparametrically identi-fied if the partially latent covariate variables satisfy certain monotonicity assumptions.Examples that fall into this class of models are a variety of different production, skillformation, and achievement functions. It is often plausible to assume that the dif-ferent inputs or explanatory variables are functions of a common unobserved randomshock, and we consider models in which it is natural to impose strict monotonicity inthis common shock. The monotonicity assumption imposes some strong functionaldependencies on the explanatory variables as pointed out in the context of produc-tion function estimation by Ackerberg, Caves, and Frazer (2015). The key insightof this paper is that we can leverage the functional dependence between inputs toachieve identification within a partially latent covariate framework. In that sense, weturn the functional dependence problem on its head to impute the partially latentcovariates. Broadly speaking, our imputation is in the spirit of matching algorithms(Rubin, 1973). In contrast to traditional matching algorithms, we propose to matchon the expected dependent variable to impute missing covariates. The partially latent data structure, that we study in this paper, arises quite natu-rally in many potential applications of our technique if one employs an “input-basedsampling” strategy, i.e. if the sampling unit is one of the multiple labor input factors.These types of data sets are becoming more prevalent in modern econometrics sinceresearchers have come to rely on unstructured or semi-structured data sets. Consider,for example, a production team in which team members perform different tasks. Letus assume that the researcher interviews one member from each team to provide thedata. It is plausible that this person knows the team’s output, but does not have Other potential applications in applied microeconomics are discussed in the conclusions. Note that this assumption is commonly used, for example, in the production function literatureas discussed by Olley and Pakes (1996). In particular, this assumption does not require that inputsare “optimally” chosen by competitive firms and is consistent with a broad class of strategic andnon-strategic models that may describe the agents’ behavior. Note that we do not apply the matching approach within the standard potential outcome frame-work of program evaluation which is based on the potential outcome model developed by Fisher(1935). For a discussion of the properties of matching estimators in that context see, among oth-ers, Rosenbaum and Rubin (1983), Heckman, Ichimura, Smith, and Todd (1998), and Abadie andImbens (2006). We show that we can combine our identification results with a variety of lin-ear, nonlinear, and semiparametric estimation strategies. In that sense our approachis flexible and allows researchers to make appropriate functional form assumptions ifnecessary. To illustrate the key issues that are encountered in estimation we considerthe scenario in which researchers only have access to a single cross-section of dataand rely on instrumental variables for estimation. For example, production functionestimation relies on the assumption that differences in local input prices give riseto differences in input choices that are uncorrelated with productivity shocks at thelocal level. Similarly, skill formation and achievement function estimation requiresthe choice of suitable instruments for parental inputs. Estimation proceeds in two steps. In finite samples, we first nonparametricallyestimate the latent input functions. Plugging the estimators into our production, In the context of production function estimation this endogeneity problem is referred to as thetransmission bias problem since inputs are correlated with unobserved productivity shocks (Marschakand Andrews, 1944). Hence we cannot address this endogeneity problem using panel data with fixed effects, firstadvocated by Hoch (1955, 1962) and Mundlak (1961, 1963). We can also not use more sophisticatedtiming assumptions within a control function or IV frameworks as discussed, for example, in Olleyand Pakes (1996) and Blundell and Bond (1998, 2000), Levinsohn and Petrin (2003), and Ackerberg,Caves, and Frazer (2015). We discuss the extension of our methods to this scenario in the conclusions. Hence, local input prices can serve as valid instruments for endogenous input choices. SeeGriliches and Mairesse (1998) for a critical discussion of the assumption that these input prices areexogenous. For a more general discussion of the issues encountered in estimating achievement and skill for-mation functions see, among others, Todd and Wolpin (2003) and Cunha, Heckman, and Schennach(2010).
Identification of Partially Latent Covariates
Consider the following cross-sectional econometric model y i = F ( x i , x i , u i ) + (cid:15) i (1)where i = 1 , ..., N indexes a generic observation from a random sample , y i denotesan observable scalar-valued outcome variable, and x i := ( x i , x i ) denotes a two-dimensional vector of covariates. Both u i and (cid:15) i are scalar-valued unobserved errors,with u i taken to be a “structural error” that is endogenous with respect to x i , while (cid:15) i is a “measurement error” that is assumed to be exogenous. The unknown outcomefunction F may be either parametric or nonparametric.First, we need to define what we mean by partially latent covariates , a key datastructure that we explore in this paper. Assumption 1 (Partially Latent Covariates) . For each observation i , the econome-trician either observes x i or x i , but never both. Essentially, one of the two covariates ( x i , x i ) is latent in each observation in thedata. In the following, it will be convenient to write d i := , if x i is observed and x i is latent , , if x i is observed and x i is latent , so that effectively ( d i , (2 − d i ) x i , ( d i − x i ) is observed for i . Such data structuresoften arise when the data is collected at the individual level while we are interestedin some firm, household, or team level outcome variable that also depends on otherindividuals who are not surveyed in the data. These types of unstructured data setsare becoming increasingly more prevalent in empirical work, as we discuss in detailbelow. In this section we just provide one application that we use as the leadingexample to illustrate the main concepts. Example (Team Production Functions) . Our first application studied in Section 4 See Corollary 1 for the extension of our identification method to settings with covariates ofhigher dimensions. For simplicity, letus assume a log-linear Cobb-Douglas specification: y i = α + α x i + α x i + u i + (cid:15) i , (2)where y i is the logarithm of the team’s output, x i is the logarithm of hours worked bythe first team member (a manager), and x i is the logarithm of hours worked by thesecond team member (an employee). The data structure described in Assumption1 arises if the researcher interviews only one member, and not both members of theteam. We also refer to this technique as an “ input-based sampling ” approach. It isplausible that the interviewed team member knows the team’s output, but does nothave complete information about the other team member’s input choices. Hence, thesurveyed person provides the output level, y i , and her own hours worked, x i or x i ,leading to the problem of partially latent inputs as defined in Assumption 1.The next assumption imposes a monotonicity condition on the outcome function. Assumption 2 (Monotonicity of the Outcome Function) . F is nondecreasing in allof its arguments and is strictly increasing in at least one of its arguments. This assumption essentially states that the covariates or inputs ( x i , x i ) and thestructural error u i have nonnegative effects on the outcome variable y i . Moreover,the monotonicity is strict in, at least, one of the three arguments x i , x i , and u i .The restriction of monotonicity with respect to ( x i , x i ) is substantive: it requiresthat the covariates cannot negatively affect the outcome variable holding everythingelse fixed. In contrast, the restriction of monotonicity with respect to u i is largelyinnocuous given the interpretation of u i as a (weakly) “positive shock”. Example (Team Production Functions Continued) . Assumption 2 is satisfied in thelinear additive model in equation (2) provided that the model satisfies the additionalparameter restriction that α , α ≥ We use the term “team production function” since we largely focus on different types of laborinputs and abstract from capital or other inputs that may be subject to dynamics and adjustmentcosts. The team production concept is also related to the concept of task production functions, whichare surveyed by Acemoglu and Autor (2011). Haanwinckel (2018) estimates a task productionfunction in which each team member specializes in a single task. u i and (cid:15) i in equa-tion (1). First, we assume that the endogenous covariates x i are strictly monotonefunctions of the scalar structural error u i , potentially after conditioning on a set ofobserved covariates z i , that may affect the covariates x i . Assumption 3 (Strict Monotonicity of the Covariates in the Structural Error) . Thereexists a vector of additional observed covariates z i and two deterministic, real-valuedfunctions h , h , such that x i = h ( u i , z i ) , x i = h ( u i , z i ) , with both h ( u i , z i ) and h ( u i , z i ) strictly increasing in their first argument u i forevery realization of z i . We note that the functions h and h can be unknown and nonparametric. More-over, Assumption 3 does not require z i to be exogenous; in other words, z i and u i are allowed to be statistically dependent. The only requirement here is that, afterconditioning on z i , the covariates x i and x i can be written as deterministic mono-tone functions of the error u i . Such a “monotonicity-in-a-scalar-error” assumptionhas been widely used in the econometric literature on identification analysis. Example (Team Production Functions Continued) . In the IO literature u i is typicallyinterpreted as a “productivity shock” that enters into the choices of inputs x i . Incontrast, (cid:15) i captures either a measurement error or a productivity shock that doesnot affect inputs, since it is not observed to the firms when input choices are made.Assumption 3 requires that the input choice functions are strictly increasing in the“productivity shock” u i , conditional on any additional observed covariates z i thatmay influence input choices, as suggested, for example, by Olley and Pakes (1996) andothers. For concreteness, we take z i to be local wages for managers and employees.The monotonicity of input choices in the unobserved productivity shock can be fur-ther micro-founded in a variety of settings based on efficiency or equilibrium criteria.For example, Assumption 3 is automatically satisfied if competitive firms optimally See Matzkin (2007) for a general survey, and see Ackerberg, Caves, and Frazer (2015) in thespecific context of production function identification, which fits into our working example (2). This is a standard assumption that underlies most, if not all, existing approaches of productionfunction estimation in one way or another: see, for example, Griliches and Mairesse (1998) andAckerberg, Caves, and Frazer (2015) for reviews of the relevant literature. h and h are char-acterized by the relevant first-order conditions and have simple closed-form formulasthat are linear and increasing in u i and decreasing in z i . More generally, one may usethe theory of monotone comparative statics to obtain more primitive conditions forinput monotonicity, which typically involve various forms of increasing-difference orsingle-crossing conditions: see, for example, Milgrom and Shannon (1994) and Vives(2000) for formal statements. Essentially, in settings where input choices are made bya single decision maker, such as under perfect competition and monopsony, we wouldneed the marginal values of inputs to be increasing in the productivity shock u i , whichis a mild condition to impose given our interpretation of u i as a “productivity shock”.In settings where the input choices are generated as equilibria of a strategic gamebetween two decision makers, an additional assumption of strategic complementarityis typically sufficient for monotonicity. For games with strategic substitutability, wewould further need a condition to ensure that the extent of strategic substitutability isnot overwhelming: see Roy and Sabarwal (2010) for general results, and our AppendixD for an example where Assumption 3 is satisfied under strategic substitutability.Next we formalize the required exogeneity condition on the measurement error (cid:15) i . Assumption 4 (Exogeneity of the Measurement Error) . E [ (cid:15) i | x i , z i , d i ] = 0 . Note that, under Assumption 3, conditioning on ( x i , z i , d i ) is equivalent to con-ditioning on ( u i , z i , d i ). In the production function estimation literature without thepartial latency problem, E [ (cid:15) i | u i , z i ] = 0 is a standard assumption imposed on (cid:15) i . Inour current setting, we are requiring that (cid:15) i is furthermore exogenous with respect tothe partial latency indicator variable d i .It is worth noting that this paper is both conceptually and technically differentfrom previous work on missing data in linear regression and, more generally, GMMestimation settings, such as Rubin (1976), Little (1992), Robins, Rotnitzky, and Zhao(1994), Wooldridge (2007), Graham (2011), Chaudhuri and Guilkey (2016), Abrevayaand Donald (2017) and McDonough and Millimet (2017). This line of literature See Appendix A for details. We note that the problem of partially latent inputs is less relevantin that case since the “reduced-form” regression of the observed inputs on the exogenous wages w i will indirectly recover the production function parameters α . This corresponds to the “dualityapproach” to production function estimation as discussed in detail in Griliches and Mairesse (1998).However, an attractive feature of our approach is also that we can test whether inputs are optimallychosen. If we reject the null hypothesis that inputs are optimal, our estimator is still feasible whileduality estimators are not. d i is allowed to be correlated with other observables as wellas the unobserved productivity shock. Instead, we will be relying on monotonicityconditions to identify and impute the latent input.Specifically, Assumption 4 here is simply requiring that (cid:15) i is a “measurementerror” term that is exogenous with respect to the observables and consequently the“productivity shock” u i , but does not impose any restriction on the dependence struc-ture between the partial latency indicator d i and other structural components of themodel ( u i , x i , z i ).However, we do require the following very mild condition on the variable d i . Assumption 5 (Nondegenerate Latency Probabilities) . < P { d i = 1 | u i , z i } < . Assumption 5 guarantees that conditioning on realizations of ( u i , z i ) we will ob-serve x i , and x i , with strict positive probabilities. Again, this assumption is muchweaker than “missing-at-random” assumptions, which would usually require that P { d i = 1 | u i , z i } is constant in u i , z i , or some other variables. In contrast, herewe do not impose any restrictions on the dependence of P { d i = 1 | u i , z i } on ( u i , z i )beyond non-degeneracy.We are now ready to present our main identification result. Theorem 1.
Under Assumptions 1-5, for each observation i , the latent covariate, x i if d i = 1 or x i if d i = 2 , is point identified. Next, we provide a detailed explanation of our identification strategy. The start-ing point of our identification strategy is the reduced form of our model with themeasurement error term: y i = F ( u i , z i ) + (cid:15) i (3)where F ( u i , z i ) := F ( h ( u i , z i ) , h ( u i , z i ) , u i ) . (4)11learly, F ( u i , z i ) is strictly increasing in u i given Assumptions 2 and 3.Consider two firms i and j with z i = z j . In the context of our working example, weare effectively considering two firms i and j operating in the same local labor marketwith the same local wages. For concreteness, suppose that ( x i , x j ) are observed,while ( x i , x j ) are unobserved. Since these firms have the same value of managerialinputs x i = x j , then by Assumption 3 it must also be true that they have the samevalue of the productivity shock: u i = h − ( x i ; z i ) = h − ( x j ; z j ) = u j , where h − ( · ; z i ) is the inverse of h ( · , z i ), which is well-defined by Assumption 3.This further implies that F ( u i , z i ) = F ( u j , z j ) . Taking an average of y i and y j ,12 ( y i + y j ) = F ( u i , z i ) + 12 ( (cid:15) i + (cid:15) j ) , (5)we are essentially averaging out the variations in (cid:15) . Intuitively, if we average overoutcomes of all observations that share the same x i and the same z i and thus thesame value of u i , then we can identify F ( u i , z i ).Formally, define γ ( c ) as the expected output of firm i conditional on the eventthat x i is observed ( d i = 1) to have a given value of c , i.e., γ ( c ; z ) := E [ y i | z i = z, d i = 1 , x i = c ] . (6)Clearly, γ is directly identified from data given Assumptions 1 and 5, and canbe nonparametrically estimated later on. Taking a closer look at γ , we have, byequation (3), Assumption 3, and Assumption 4, γ ( c ; z ) = E (cid:2) F ( u i , z i ) + (cid:15) i (cid:12)(cid:12) z i = z, d i = 1 , h ( u i , z i ) = c (cid:3) = F (cid:0) h − ( c ; z ) , z (cid:1) + E (cid:2) (cid:15) i | z i = z, d i = 1 , u i = h − ( c ; z ) (cid:3) = F (cid:0) c , h (cid:0) h − ( c ; z ) , z (cid:1) , h − ( c ; z ) (cid:1) , (7) In fact, we can directly “match” on output y i if there is no measurement error, (cid:15) i , in output. Assumption 5 ensures that the conditioning event occurs with strictly positive probability. z i and a particular observed value of x i = c , we are effectively conditioning onthe unobserved productivity shock u i . Aggregating across observations allows us toaverage out the measurement errors and obtain a quantity that is implicitly a functionof the productivity shock u i = h − ( c ; z i ).Next, we observe that γ ( c ; z ) is strictly increasing in c , since ∂∂c γ ( c ; z ) = F + F · ∂∂u h (cid:0) h − ( c ) , z (cid:1) ∂∂u h (cid:0) h − ( c ) , z (cid:1) + F · ∂∂u h (cid:0) h − ( c ) , z (cid:1) > ∂∂u h , ∂∂u h > F , F , F of F are all nonnegative with, at least, one being strictly positive by Assumption 2. Similarly, we can define γ ( c ; z ) := E [ y i | z i = z, d i = 2 , x i = c ]which is strictly increasing in c .Now, the basic idea behind our identification strategy is then to conditionally“match” observations on the event that γ ( c ; z ) = γ ( c ; z ) (9)for some c , c , and z . Example (Team Production Functions Continued) . Let us consider production teamswithin the same local market so that wages ( z i ) are constant. Equation (9) then in-volves two separate conditional expected output levels, one ( γ ) for teams whosemanager input ( x i ) is observed , and the other ( γ ) for teams whose employee input( x i ) is observed. When these two expected output levels are equalized as in equation(9), we can infer that the underlying productivity shock ( u i ) must be the same acrossall teams with either x i = c observed or x i = c observed. By equations (5) and(7) we know h − ( c ; z i ) = h − ( c ; z i ) =: u The partial derivatives F , F , F of F are evaluated at (cid:0) c , h (cid:0) h − ( c ; z ) , z i (cid:1) , h − ( c ; z ) (cid:1) . x i = h ( u, z i ) , for d i = 1 ,x i = h ( u, z i ) , for d i = 2 . Formally, the latent covariates can be identified via a composition of γ , γ andtheir inverses, x i = γ − ( γ ( x i ; z i ) ; z i ) , for d i = 1 ,x i = γ − ( γ ( x i ; z i ) ; z i ) , for d i = 2 , (10)since on the right-hand side x i , x i are observed for d i = 1 , , respectively, and γ , γ are nonparametrically identified functions. This completes the description of our keyidentification strategy as well as the proof of Theorem 1. Remark . We have thus far focused on the case withtwo covariates. It is straightforward to see that our model, assumptions, and themain identification result can be easily generalized to the case with covariates of anarbitrary finite dimension D . This result is summarized by the following Corollary. Corollary 1.
Consider the model y i := F ( x i , ..., x iD , u i )+ (cid:15) i along with Assumptions2 and 4 unchanged, and the following modifications of other assumptions:(i) Assumption 1: for each i at least one out of D covariates is observed.(ii) Assumption 3: all D covariates are strictly increasing in u i given z i .(iii) Assumption 5: all D covariates are observed with strictly positive probabilities.Then the latent covariates are identified.Remark . If Condition (i) in Corollary 1 is strengthened so that more than one covariates are simultaneously observed in a given observation (with positive prob-ability), then we would also obtain over-identification, and the input-monotonicityrestriction in Assumption 3 becomes empirically refutable. Alternatively, with two ormore covariates simultaneously observed, we would be able to accommodate higherdimensions of unobserved shocks, provided that the dimension of the unobserved14hock u i is strictly smaller than the dimension of the covariates D . Since such anextension would be more involved and move farther away from the applications weconsider in this paper, we leave it as a direction for future research. With the latent inputs already identified in Theorem 1, we are back to equation (1) y i = F ( x i , x i , u i ) + (cid:15) i , but now we can effectively regard both x i and x i as being known, at least foridentification purposes. Researchers may proceed to identify the output function F under appropriate application-specific assumptions as in a “standard” setting withoutthe partial latency problem.Hence, the identification of F or other objects of interest is largely “separable”from the partial latency problem, which is the key problem we are solving in thispaper. That said, we note that the estimation of the latent covariates will affect the estimation of (the parameters of) F based on “plugged-in” latent covariate estimates.This section provides a discussion on how to identify and estimate F , and analyzesthe impact of the “first-stage” estimation of latent inputs on the final estimator of F .While we cannot cover all relevant specifications of F , in this section we will pro-vide both identification and estimation results for the linear case, which is arguablythe workhorse model, or at least a natural benchmark, in various empirical applica-tions. We also discuss how our method can be applied under more general settings. In this subsection we focus on the linear parametric specification of F as in (2): y i = α + α x i + α x i + u i + (cid:15) i , where our goal is to identify and estimate the unknown parameters α := ( α , α , α ).15 .1.1 Identification In the presence of the endogeneity problem between x i := ( x i , x i ) and u i , we willneed instrumental variables for the identification of α . For illustrational simplicity,we impose the following standard IV assumption. Assumption 6 (Instrumental Variables) . Write z i := ( z i , z i ) , z i := (1 , z i , z i ) (cid:48) and x i = (1 , x i , x i ) (cid:48) . Assume(i) Relevance: Σ zx := E (cid:2) z i x (cid:48) i (cid:3) has full rank.(ii) Exogeneity: E [ u i | z i ] = 0 . Corollary 2 (Identification of Linear Parameters) . Under Assumptions (1) - (6) , α ispoint identified. Example (Team Production Function Continued) . In the context of our working ex-ample, here we are essentially following a strategy discussed in Griliches and Mairesse(1998) and assume that we have access to some instrumental variables (such as localwages) that affect input choices.
We now turn to the more interesting problem of estimation, propose semiparametricestimators for α , and characterize their asymptotic distributions.We first describe our proposed estimator. Since the identification of latent inputsvia equation (10) is constructive, it suggests a natural estimation procedure: Step 1 (Nonparametric Regression): obtain an estimator ˆ γ of γ by nonpara-metrically regressing y i on x i and z i , among firms with d i = 1, i.e., those with x i observed. Similarly, obtain an estimator ˆ γ of γ . Step 2 (Imputation): impute latent inputs by plugging the nonparametric esti-mators ˆ γ , ˆ γ into equation (10), i.e.,ˆ x i = ˆ γ − (ˆ γ ( x i ; z i ) ; z i ) , for d i = 1 , ˆ x i = ˆ γ − (ˆ γ ( x i ; z i ) ; z i ) , for d i = 2 . tep 3 (IV Regression): estimate equation (2) with z i as IVs for x i , i.e.,ˆ α := (cid:32) n n (cid:88) i =1 z i ˜ x i (cid:33) − (cid:32) n n (cid:88) i =1 z i y i (cid:33) and ˜ x i := (1 , x i , ˆ x i ) (cid:48) , for d i = 1 , (1 , ˆ x i , x i ) (cid:48) , for d i = 2 . In Appendix B.4, we also propose an alternative estimator ˆ α ∗ that features a slightlydifferent Step 3, leading to an efficiency gain over ˆ α asymptotically. Since the asymp-totic theories for ˆ α and ˆ α ∗ are very similar, we defer results on ˆ α ∗ to the appendix. We now establish the consistency and the asymptotic normality of ˆ α under the fol-lowing regularity assumptions. Assumption 7 (Finite Error Variances) . E [ u i | z i ] < ∞ and E [ (cid:15) i | x i , z i , d i ] < ∞ . Assumption 8 (Strong Monotonicity) . The first derivative of γ k ( · , z ) is uniformlybounded away from zero, i.e., for any c, z , ∂∂c γ k ( c ; z ) > c > . In view of equation (8), Assumption 8 is satisfied if either α , α > ∂∂u h , ∂∂u h are uniformly bounded above by a finite constant. Assumption 8 is needed to ensurethat ˆ γ − k ( · , z ) is a good estimator of γ − k ( · , z ) provided that the first-stage nonpara-metric estimator ˆ γ k is consistent for γ k . Assumption 9 (First-Stage Estimation) . (i) Donsker property: γ , γ ∈ Γ , which is a Donsker class of functions with uni-formly bounded first and second derivatives, and ˆ γ , ˆ γ ∈ Γ with probabilityapproaching 1.(ii) First-stage convergence: (cid:107) ˆ γ k − γ k (cid:107) = o p (cid:16) N − (cid:17) for k = 1 , . γ , γ satisfy certain smoothness condition, e.g. γ k possesses uniformly bounded derivatives up to a sufficiently high order. Assumption9(ii) requires that the first-stage estimator converges at a rate faster than N − / , whichis satisfied under various types of nonparametric estimators under certain regularityconditions. This is required so that the final estimator of the production functionparameters α can converge at the standard parametric ( √ N ) rate despite the slowerfirst-step nonparametric estimation of γ , γ .Finally, we state another technical assumption that captures how the first-stagenonparametric estimation of γ , γ influences the final semiparametric estimators ˆ α through the functional derivatives of the residual function with respect to γ , γ .Assumption 10 below, based on Newey (1994), provides an explicit formula for theasymptotic variance of ˆ α that does not depend on the particular forms of first-stagenonparametric estimators.Formally, write w i := ( y i , x i , z i , d i ), γ := ( γ , γ ), and suppress the conditioningvariables z i in γ for notational simplicity. Define the residual functions g ( w i , ˜ α, ˜ γ ) := z i (cid:0) y i − ˜ α − ˜ α x i − ˜ α ˜ γ − (˜ γ ( x i )) (cid:1) for d i = 1 ,z i (cid:0) y i − ˜ α − ˜ α x i − ˜ α ˜ γ − (˜ γ ( x i )) (cid:1) for d i = 2 . for generic ˜ α, ˜ γ , and g ( w i , ˜ γ ) := g ( w i , α, ˜ γ ) at the true α . Define the pathwisefunctional derivative of g at γ along direction τ by G ( w i , τ ) := lim t → t [ g ( w i , γ + tτ ) − g ( w i , γ )] . Then, following Newey (1994), the so-called “influence function” can be derived an-alytically based on G and takes the form of ϕ ( w i ) z i (cid:15) i with ϕ ( w i ) := − (cid:18) λ α γ (cid:48) − λ α γ (cid:48) (cid:19) ( { d i = 1 } − { d i = 2 } ) , where γ (cid:48) k denotes ∂∂h k γ k ( x ik ; z i ), λ stands for λ ( x i , z i ) := E [ { d i = 1 }| x i , z i ]i.e., the conditional probability of observing x i , and λ := 1 − λ . See the proof of Theorem 2 for details on the calculation. ϕ ( w i ) z i (cid:15) i . In particular, the for-mula for ϕ given above will be the same regardless of the specific forms of first-stepestimators used, provided that some suitable regularity conditions are satisfied. Assumption 10 (Asymptotic linearity) . Suppose (cid:90) G ( w, ˆ γ − γ ) d P ( w ) = 1 N N (cid:88) i =1 ϕ ( w i ) z i (cid:15) i + o p (cid:16) N − (cid:17) . We emphasize that Assumptions 9 and 10 are standard assumptions widely im-posed in the semiparametric estimation literature, which can be satisfied by manykernel or sieve first-stage estimators under a variety of conditions. See Newey (1994),Newey and McFadden (1994) and Chen, Linton, and Van Keilegom (2003) for refer-ences. In Assumption 11 below, we also provide an example of lower-level conditionsthat replace Assumptions 9 and 10 when we use the Nadaraya-Watson kernel estima-tor in the first-stage nonparametric regression.The next theorem establishes the asymptotic normality of ˆ α . Theorem 2 (Asymptotic Normality) . Under Assumptions 1-10, √ N ( ˆ α − α ) d −→ N ( , Σ) , where Σ := Σ − zx ΩΣ − xz and Ω := E (cid:104) z i z (cid:48) i (cid:0) u i + [1 + ϕ ( w i )] (cid:15) i (cid:1)(cid:105) . We note that, if the latent inputs were observed and the first-step nonparametricregression were not required, the asymptotic variance of standard IV estimator of α would be given by Σ − zx Var ( z i ( u i + (cid:15) i )) Σ − xz . Hence, the presence of the additionalterm δ ( z i ) in Ω captures the effect of the first-step nonparametric regression on theasymptotic variance of ˆ α . 19o obtain consistent variance estimators, defineˆΩ := 1 N N (cid:88) i =1 z i z (cid:48) i (cid:104) y i − ˜ x (cid:48) i ˆ α + ˆ ϕ ( w i ) ( y i − ˜ y i ) (cid:105) where ˜ y i := ˆ γ ( x i , z i ) , for d i = 1 , ˆ γ ( x i , z i ) , for d i = 2 , and with ˆ ϕ ( w i ) := − (cid:18) ˆ λ ˆ α ˆ γ (cid:48) − ˆ λ ˆ α ˆ γ (cid:48) (cid:19) ( { d i = 1 } − { d i = 2 } )where ˆ λ is any consistent nonparametric estimator of λ . Then the variance estima-tors can be obtained as ˆΣ := S − x ˜ z ˆΩ S − zx with S z ˜ x := N (cid:80) Ni =1 z i ˜ x (cid:48) i . Proposition 1.
In addition to Assumptions 1-8 and 11, suppose that ˆ λ is any con-sistent nonparametric estimator of λ . Then ˆΩ p −→ Ω and ˆΩ ∗ p −→ Ω ∗ . If furthermore λ ( x i , z i ) ≡ λ ∈ (0 ,
1) is assumed, then we may use the sampleproportion ˆ λ := N (cid:80) i { d i = 1 } . Finally, we present a set of lower-level conditions that replace Assumptions 9 and 10,when we use the canonical Nadaraya-Watson kernel estimator for the nonparametricregression in Step 1. We emphasize that this subsection simply serves as an illustrationof Assumptions 9-10 and Theorem 2, as our method does not require the use ofa specific form of first-step nonparametric estimators. For sieve (series) first-stepestimators, similar results can be derived based on, for example, Newey (1994), Chen(2007) and Chen and Liao (2015).
Assumption 11 (Example of Lower-Level Conditions with Kernel First Step) . Let N k := (cid:80) Ni =1 { d i = k } denote the number of firms for which h ik is observed, and let γ k be the Nadaraya-Watson kernel estimator of γ k defined by ˆ γ k ( v ) := N k b (cid:80) d i = k K (cid:0) v − v ik b (cid:1) y i N k b (cid:80) d i = k K (cid:0) v − v ik b (cid:1) where v ik := ( x ik , z i , z i ) for all i such that d i = k . Suppose the following conditions:(i) λ ( x i , z i ) ∈ ( (cid:15), − (cid:15) ) for all ( x i , z i ) for some (cid:15) > .(ii) ( x i , z i ) has compact support in R with joint density f that is uniformly boundedboth above and below away from zero.(iii) E [ y i ] < ∞ and E [ y i | x i , z i ] f ( x i , z i ) is bounded.(iv) γ k has uniformly bounded derivatives up to order p ≥ .(v) K ( u ) has uniformly bounded derivatives up to order p , K ( u ) is zero outsidea bounded set, (cid:82) K ( u ) du = 1 , (cid:82) u t K ( u ) du = for t = 1 , ..., p − , and (cid:82) (cid:107) u (cid:107) p | K ( u ) | du < ∞ .(vi) b is chosen such that √ log N √ Nb = o (cid:16) N − (cid:17) and √ N b p → . Assumption 11(i) essentially requires that the proportion of observations with x i observed and that with x i observed are both strictly positive, or in other words, thenumbers of both types of observations tend to infinity at the same rate of N . Thisguarantees that we can estimate both γ based on observations with x i and γ basedon observations with x i well enough asymptotically. Assumption 11(iv) is the keysmoothness condition that will help establish the Donsker property (and a consequentstochastic equicontinuity condition) in Assumption 9(i). Assumption 11(v)(vi) areconcerned with the choice of kernel function K and bandwidth parameter b : (v)requires that a “high-order” kernel function (of order p ) is used, while (vi) requiresthat the bandwidth is set (in a so-called “under-smoothed” way) so that the kernelestimator ˆ γ k converges at a rate faster than N − / , as required in Assumption 9(ii).The requirement of p ≥ Proposition 2 (Asymptotic Distributions with Kernel First Step) . Under Assump-tions 1-8 and 11, the conclusions of Theorem 2 hold. .2 Generalizations Additional Instrumental Variables
If additional instruments are available, it is straightforward to incorporate them in thesecond-stage regression, which will take the form of a two-stage least square estimatorinstead of an IV regression. Our results will carry over with suitable changes innotation. For example, the asymptotic variance formula for ˆ α needs to be adapted asΣ := (cid:0) Σ xz Σ − zz Σ zx (cid:1) − Σ xz Σ − zz ΩΣ − zz Σ zx (cid:0) Σ xz Σ − zz Σ zx (cid:1) − . Other Parametric Outcome Function
Consider a potentially nonlinear parametric production function of the form y i = F α ( x i , x i ) + u i + (cid:15) i After the identification of partially latent inputs via Theorem 1, the second stage boilsdown to the estimation of α based on the moment condition E [ z i ( y i − F α ( x i , x i ))] = , which can be obtained via GMM estimation. Technically, since GMM estima-tors are Z-estimators, the corresponding asymptotic theory in Newey and McFadden(1994), on which the proof of Theorem 2 is mainly based, still applies with properchanges in notation. Nonparametric Outcome Function
More generally, with any nonparametric production function that is additively sepa-rable in u i and (cid:15) i of the form y i = F ( x i , x i ) + u i + (cid:15) i , where F is an unknown function that satisfies Assumption 2, the only thing thatchanges is the second-stage nonparametric estimation of F with the imputed covari-ates ˜ x i (or more precisely, with one component known and one component imputed)based on the moment condition E [ z i ( y i − F ( x i , x i ))] = . The asymptotic theoryfor this case can be similarly obtained based on theory on nonparametric two-stepestimation (e.g. Ai and Chen, 2007, and Hahn, Liao, and Ridder, 2018).22n the more general specification (1): y i = F ( x i , x i , u i ) + (cid:15) i where there is no more additive separability in u i , one way to obtain identification andimplement IV estimation is by adapting Chernozhukov, Imbens, and Newey (2007) toour current context. Essentially, we would need to impose strict monotonicity of F in u i , impose independence of u i from z i , normalize the distribution of u i to be uniform,and then exploit a quantile-based residual condition as described in Chernozhukov,Imbens, and Newey (2007). Here we report the findings of some Monte Carlo experiments. Table 1 reports theparameter specifications of the Cobb-Douglas production function that we use in ourexperiments. We assume that inputs are optimally chosen by a profit maximizingfirm as discussed in detail in Appendix A. These parameters were chosen so thatthe simulated data are broadly consistent with the descriptive statistics of our firstapplication that we discuss in detail in the next section. For each specification, marketsize, denoted by L , and number of firms in each market, denoted by I can vary. Inparticular, we consider the following scenarios: L = 50, 100, 500 and I = 1, 50, 100.For each experiment, we compute the difference between the true parameter value andthe sample average of the estimates using 1000 replications ( N ). This is a measureof the bias of our estimator. We also estimate the root mean squared error (RMSE)using the sample standard deviation of our estimates.Note that our data generating process mechanically implies x i and x i have alinear relationship with y i . We estimate γ ( · , z i ) and γ ( · , z i ) using second degreepolynomials. Not surprisingly, we find that the estimated coefficients on quadraticterms are almost 0. The interpolated functions γ − and γ − are also almost linear.Table 2 summarizes the performance of two different estimators: TSLS when allinputs are observed as well as our version of TSLS when inputs are imputed. Werefer to our version of the TSLS estimator as the “matched” TSLS estimator . Aswe would expect given our asymptotic results, the matched TSLS performs almostas well as the standard TSLS estimator under these ideal sampling conditions. Thisfinding holds for all three different specifications and several choices for the number23able 1: Monte Carlo Parameter SpecificationConstant Across Specification Variable Across Specification α α α µ z σ z κ , , , σ u σ (cid:15) σ η Spec1 4 0.35 0.25 (cid:18) . . (cid:19) (cid:18) .
05 00 0 . (cid:19) . . . . (cid:18) .
01 00 0 . (cid:19) Spec 2 4 0.35 0.25 (cid:18) . . (cid:19) (cid:18) .
05 00 0 . (cid:19) . . . . (cid:18) .
01 00 0 . (cid:19) Spec 3 4 0.35 0.25 (cid:18) . . (cid:19) (cid:18) .
05 00 0 . (cid:19) . . . . (cid:18) . . (cid:19) of firms within a market and the number of local markets.Next, we investigate how our estimator performs when we have a relatively smallnumber of observations in each market. Considering an extreme case, we simulatedata for L = 500 and I = 1. As we only have a single firm in each market, we cannotimpute the missing input variable using within market information. Instead, we poolobservations across markets and estimate conditional expectations conditional on x (or x ), z , and z . Table 2 also summarizes the bias and RMSE where L = 500 and I = 1. We find that the matched TSLS estimator performs almost as well as thestandard TSLS estimator that assumes that both inputs are observed.Finally, we consider the case in which the wage for type j is observed only whenwe observe the input for type j , i.e. we assume that:( z i , z i ) = ( z ∗ i , missing ) if x i is observed( missing, z ∗ i ) if x i is observed (11)Since we need to impute missing wages, we assume that true wages are functions ofsome demand shifters D m ∈ R for the local labor market m and a random error η i which is assumed to be independent from the demand shifters. Note that this24able 2: Monte Carlo: Different Markets, Observed WagesNumber of Number of TSLS Matched TSLSParam Markets Firms Spec Bias RMSE Bias RMSE α
50 50 1 0.001 0.001 0.000 0.001 α
100 100 1 -0.000 0.000 -0.000 0.000 α
50 50 2 0.001 0.002 -0.000 0.002 α
100 100 2 -0.000 0.000 0.000 0.001 α
50 50 3 0.001 0.002 0.001 0.002 α
100 100 3 -0.000 0.000 0.001 0.001 α
500 1 1 -0.004 0.003 -0.004 0.003 α
500 1 2 -0.014 0.011 -0.015 0.011 α
500 1 3 -0.013 0.010 -0.014 0.010 α
50 50 1 0.004 0.003 0.003 0.004 α
100 100 1 0.000 0.001 0.000 0.001 α
50 50 2 0.007 0.010 0.006 0.013 α
100 100 2 0.001 0.002 0.001 0.003 α
50 50 3 0.006 0.008 0.032 0.015 α
100 100 3 0.001 0.002 0.020 0.003 α
500 1 1 -0.002 0.015 -0.001 0.016 α
500 1 2 -0.000 0.048 0.001 0.052 α
500 1 3 -0.007 0.040 -0.006 0.043 α
50 50 1 -0.005 0.005 -0.004 0.006 α
100 100 1 -0.001 0.001 -0.000 0.001 α
50 50 2 -0.010 0.014 -0.010 0.017 α
100 100 2 -0.002 0.003 -0.002 0.004 α
50 50 3 -0.007 0.011 -0.046 0.021 α
100 100 3 -0.001 0.002 -0.029 0.005 α
500 1 1 -0.004 0.020 -0.004 0.022 α
500 1 2 -0.020 0.068 -0.022 0.073 α
500 1 3 -0.009 0.051 -0.010 0.05525pecification allows for correlation between z m ( i ) and z m ( i ) through D m . Specifically,we simulate wages as follows: z ∗ i = z m ( i ) = κ D m + κ D m + η i (12) z ∗ i = z m ( i ) = κ D m + κ D m + η i To impute the missing wages, we regress the observed wages ( z i , z i ) on the demandshifters ( D m , D m ). Using estimated parameters from the regression, we then imputethe missing wages.Table 3: Monte Carlo: Small Markets with Partially Latent WagesNumber of Number of Standard SLS Matched TSLSParam markets firms Spec Bias RMSE Bias RMSE α
500 1 1 -0.004 0.003 -0.004 0.003 α
500 1 2 -0.008 0.010 -0.007 0.010 α
500 1 3 -0.008 0.010 -0.007 0.010 α
500 1 1 -0.002 0.015 -0.001 0.016 α
500 1 2 0.005 0.054 0.008 0.055 α
500 1 3 0.004 0.053 0.008 0.054 α
500 1 1 -0.004 0.020 -0.004 0.022 α
500 1 2 -0.021 0.072 -0.023 0.075 α
500 1 3 -0.020 0.070 -0.023 0.074Table 3 summarizes the performance of our new estimator together with TSLSestimator. Even if we have a relatively large variance of the imputation errors, suchas in Specification 3, our new estimator performs reasonably well.Figure 1 plots the empirical distribution for the case of specification 2. Overall,we find that the matched TSLS estimator performs almost as well as the standardTSLS estimator.We conclude that our estimator performs well in all Monte Carlo experiments, evenin scenarios that are more general than those considered in Sections 3 of the paper. Inparticular, we do not need to observe both sets of instruments in the data, i.e. we canimpute the missing instrument. Next, we evaluate the performance of our estimator intwo applications. The first application focuses on pharmacies and studies differencesin technology across different types of firms. The second application studies education26igure 1: Histograms of Estimated Coefficients With Imputed Wages T S L S -1 -0.5 0 0.5 1020406080100 M a t c hed T S L S -1 -0.5 0 0.5 1020406080100 Nmarket = 500, Nfirms = 1, Parameter Spec = II production functions.
Our first application focuses on the industrial organization of pharmacies. This indus-try has undergone a dramatic change over the past decades. An industry that usedto be primarily dominated by local independent pharmacies has been transformed bythe entry of large chains that operate in multiple markets. An important question isthe extent to which this transformation has been driven by technological change thathas benefited large chains over smaller independently operated pharmacies. If this isin fact the case, these technological changes may help to explain why this professionhas become so popular with females (Goldin and Katz, 2016).The main data set that we use is the National Pharmacist Workforce Survey of2000 which is collected by Midwestern Pharmacy Research. The data comes froma cross-sectional survey answered by randomly selected individual pharmacists withactive licenses. The data set is composed of two types of information: informationabout pharmacists and information about the pharmacy each pharmacist works at.27nformation at the pharmacy level includes the type of pharmacy (Independent or Chain) , the hours of operation per week, the number of pharmacists employed,and the typical number of prescriptions dispensed at the pharmacies per week. Thestore-level information is provided by an individual pharmacist who works at thepharmacy, thus the quality of the responses may depend on how knowledgeable theperson is about the pharmacy. However, considering that most of the pharmacists inour sample are observed to be full-time pharmacists, the quality of the firm-level datais likely to be high. The number of prescriptions dispensed at the pharmacy is ourmeasure of output. As a consequence, we do not have to use revenue based outputmeasures which could bias our analysis as discussed, for example, in Epple, Gordon,and Sieg (2010).Table 4: Summary Statistics at the Firm Level: PharmaciesFirm Number Emp Operating Prescriptions Prescriptions Prop NumberType Pharmacists Size Hours per Week per Hour Urban of ObsIndep n < ≤ n < ≤ n n < ≤ n < ≤ n We explore these issues in more detail below and test whether the differenttypes of pharmacies have access to the same technology.The survey also collects various information about pharmacists including hoursof work, demographics, and household characteristics. Most importantly we observethe position at the pharmacy (Owner/Manager or Employee) . We treat hours of themanager and hours of the employees as the two input factors in our analysis.Information related to the individual pharmacists is summarized in Table 5. Em-ployee pharmacists at independent pharmacies work fewer hours than the employeepharmacists at chain pharmacies, and hourly earnings are lower than those of theemployees at the chains. Pharmacists in managerial positions at independent phar-macies work more hours than do managers at chain pharmacies, but they have lowerhourly earnings on average.We observe only one pharmacy in each local labor market, which is defined asthe 5-digit zip code area. Hence, we need to use the version of our estimator thataverages across local markets as discussed in Section 3.3.We test whether the observed labor inputs are indeed the optimal choice of firms.If the inputs are optimally chosen, the coefficients can be directly estimated fromequation (16) in Appendix A. Under the assumption of Cobb-Douglas production,we can test the optimality by jointly testing the null hypothesis of equality of bothcoefficients. Table 6 shows the results. A formal Wald test rejects the null hypothesisof optimality. Thus the direct inversion of the optimality conditions cannot be appliedto estimate the parameters of the production function, whereas our new estimator isfeasible.We implement two versions of our “matched” TSLS estimator: the first estimatoruses the observed outputs while the second one uses expected outputs. Since theobserved output is subject to a measurement error, the semi-parametric estimator Most pharmacies in our sample have one manager pharmacist and one employee pharmacist,but there are a few pharmacies with a larger employment size. See Appendix C for details on howto compute employees’ hours work for the pharmacies with multiple employees. We only observe the wage for the observed type. Thus, wages are imputed for the unobservedtype using local demand shifters in 5-digit zip code levels and pharmacists’ characteristics. We useactual wages for the observed position and imputed wages for both positions together with principalcomponents of local demand shifters as instruments. n < ≤ n < ≤ n n < ≤ n < ≤ n n < ≤ n < ≤ n n < ≤ n < ≤ n Table 7 shows that we estimate most of parameters of the production functionwith good precision. Correcting for potential measurement error by using the ex-pected output as the dependent variable, we achieve similar, maybe even slightlymore plausible estimates. Table 7: Estimation ResultIndependent ChainObserved Expected Observed ExpectedOutputs Outputs Outputs Outputs α α α x x As a robustness check, we also explored a different matching algorithm which estimates theexpectation of output conditional on local demand shifters rather than wages. The results areconsistent although the matching algorithm with local demand shifters gives slightly larger pointestimates with slightly less precision. Appendix C provides some additional robustness checks. α V ( u )Independent 0.163 0.010Chain 0.687 0.006Difference or Ratio -0.524 1.532Test Statistics 122.841 -1.913 1.532Test Wald t F p-value (0.000) (0.028) (0.003)Second, our findings also suggest that managers may be more effective in chainsthan independents. A formal one-sided t-test reported in Table 8 rejects the nullhypothesis that the two coefficients that characterize managerial efficiency are thesame.Finally, we find that chains have a significantly lower residual variance than in-dependents. A formal F test reported in Table 8 rejects the null hypothesis that theresidual variance of independents is greater than or equal to the residual variance ofchains. Note that all the tests are based on the estimation results with the expectedoutputs as the dependent variable.We thus conclude that chains have different production functions than indepen-dent pharmacies which may partially explain the change in the observed marketstructure of that industry. However, more research is needed to fully address thisimportant research question.
Our second application focuses on the estimation of education achievement functions.Here we assume that a child’s achievement y i is a function of the mother’s and thefather’s time inputs, denoted by x im and x if . Again, we consider a log-linear Cobb-32ouglas specification given by y i = α i + α m x im + α f x if + u i (13)where heterogeneity in the intercept is given by: α i = x (cid:48) i α (14)Hence, we assume that the baseline productivity α i varies with family characteristics,such as family income. As before, we can estimate the education production functionusing TSLS with wages as instruments for inputs as well as our “matched” TSLSestimator if some inputs are partially latent.Our data is based on the four available waves of the Child Development Sup-plement (CDS). These are the cohorts interviewed in 1997, 2002, 2007, and 2014. For these children, we have detailed time usage information of their parents on twodays, each of which is randomly selected among weekdays and weekends, respectively.Based on this time diary information we can construct time inputs for mothers andfathers. The CDS can be linked to the original PSID survey using the family ID.Hence, we have detailed parental information such as education level, household in-come, and the number of children.The CDS collects multiple measures of child development including both cognitiveand non-cognitive skills. We focus on two important cognitive tests. First, we studythe passage comprehension test which assesses reading comprehension and vocabularyamong children aged between 6 and 17. Second, we analyze the applied problems testwhich assesses mathematics reasoning, achievement, and knowledge for children agedbetween 6 and 17. We begin by estimating an education production function using the subsampleof children who live in married households. Hence, we observe the mother’s and thefather’s inputs in the data set. We observe 3,236 children with complete inputs andapplied problem scores as well as 2,789 children with complete inputs and reading The CDS 1997 cohort consists of up to 12-year-old children and follows them for 3 waves (1997,2001, 2007). The CDS 2014 cohort consists of children that were up to 17 years old in 2013. We exclude families with stepmother and stepfather from our sample. We also analyzed the letter word test which assesses symbolic learning and reading identificationskills. There are also two non-cognitive measures. The externalizing behavioral problem indexmeasures disruptive, aggressive, or destructive behavior. The internalizing behavioral problem indexmeasures expressions of withdrawn, sad, fearful, or anxious feelings. − − − − − x m x f − − − x m x f Note that the standard TSLS is nolonger feasible in this subsample because of the latent variable problem. Table 11summarizes our findings.Table 11 shows that the time inputs for mothers are positive, statistically signif-icant, and economically meaningful. Moreover, the point estimates for the appliedproblem test are similar to the ones we obtained for the married sample reported inTable 10. The main difference is that mother’s time inputs are slightly less productivefor children from divorced families, and father’s time inputs are not statistically dif-ferent from zero. In summary, our estimator work well in this application and yieldsplausible and accurate point estimates for most coefficients of interest. Most impor-tantly, we find that the inputs of divorced fathers into the skill formation function oftheir children seem to be negligible. Missing instruments for the unobserved spouse are imputed using standard techniques based onthe observed spouse’s information. Concluding Remarks
We have developed a new method for identifying econometric models with partiallylatent covariates. We have shown that a broad class of econometric models that playa large role in industrial organization and labor economics can be non-parametricallyidentified if the partially latent covariates are monotonic functions of a common shock.Examples that fall into this class of models are production and skill formation func-tions. The partially latent data structure arises quite naturally in these settings ifwe employ an “input-based sampling” strategy, i.e. if the sampling unit is one ofmultiple labor input factors. It is plausible that the sampling unit will only haveincomplete information about the other labor inputs that affect output. Our proofsof identification are constructive and imply a sequential, two-step semi-parametricestimation strategy. We have discussed the key problems encountered in estimation,characterized rate of convergence, and the asymptotic distribution of our estimators.We also presented two applications of our technique. Our first application focuseson estimating team production functions. Using a national survey of pharmacists,we have found some convincing evidence that chains have different technologies thanindependently operated pharmacies. In particular, managers appear to be more pro-ductive in chains. Our second application focuses on the estimation of skill formationfunctions, which play a large role in labor and family economics. We have shown thatour matched TSLS estimator produces similar results to the feasible TSLS estimatorin a sample of children in married households, where both parental inputs are ob-served. We have also considered a sample of children from divorced households wherefather’s inputs must be imputed. We find that the inputs of divorced fathers into theskill formation function of their children is negligible.There is substantial scope for future research in areas other than the two applica-tions that we provided above. At the heart of the applications discussed thus far is therelationship between multiple inputs that are combined to produce a single output.It is easy to imagine questions that ask about relationships that fit this structure andthat do not fall into the frameworks we have considered thus far.To illustrate this idea, consider the problem of inter vivos gifts. It is common forparents, while still alive, to give money to their children, often to help with a downpayment on a house or to reduce taxes the parents will pay. When a couple makes agift to their married child, however, they risk that the child divorces and a portion of37he gift will accrue to the child’s spouse. The concern is real since approximately 40%of marriages in the US end in divorce. A natural question is how well can parentspredict how long a child’s marriage will last at the time they contemplate makinga gift. One could address this question with a data set that includes inter vivosgifts from parents to married children and, in addition, how long the child’s marriagesurvives. Such data sets exist, for example the PSID, which documents these for afamily lines that stretch over a half century.There is a problem however: Multigenerational data sets such as PSID have quitedetailed information about the choices of individuals who are descendants of theinitial respondents, but substantially less information about choices of individualswho “marry into” the data set. For each married couple in the PSID, one of thetwo has the “PSID gene” (that is, a descendant of an initial respondent), and wehave substantially more information about that individual and, importantly, aboutthat individual’s parents than we have about the spouse. In particular, we know theinter vivos gifts to the couple from the parents of the PSID gene child but not intervivos gifts to the couple from the spouse’s parents. Note that this design of the PSIDgives rise to a data structure that mimics the “input-based sampling” approach thatwe have studied in this paper. As we show in Appendix D, it is straightforward towrite down a non-cooperative model of intergenerational transfer, where the transfersof each parents are monotonically increasing in the probability that the marriagesurvives. This potential application is an example of interesting problems that arisein trying to understand intergenerational effects. We would like to know how thechoices or characteristics of individuals in one generation affect the outcomes of theirdescendants. We conjecture that the methods developed in this paper can be fruitfullyapplied to study a variety of questions related to intergenerational linkages.Finally, our research provides ample score for future research in econometricmethodology. We have restricted ourselves to applications in which our method ofidentification can be combined with standard IV techniques to estimate the functionsof interest. Much of the recent panel data literature has focused on dynamic inputs inthe presence of adjustment costs. More research is clearly needed to evaluate whetherthe ideas presented in this paper can be extended and applied to dynamic panel dataframeworks. We have also restricted ourselves to systems of inputs with a single com- Other multigenerational data sets such as NLSY79, NLSY97 and NCDS share the partiallylatent variable problem. eferences
Abadie, A. and G. Imbens (2006): “Large Sample Properties of Matching Esti-mators for Average Treatment Effects,”
Econometrica , 74 (1), 235–67.
Abrevaya, J. and S. G. Donald (2017): “A GMM approach for dealing withmissing data on regressors,”
Review of Economics and Statistics , 99, 657–662.
Acemoglu, D. and D. Autor (2011): “Skills, Tasks and Technologies: Impli-cations for Employment and Earnings,” in
Handbook of Labor Economics , ed. byD. Card and O. Ashenfelter, Elsevier, 1043–1171.
Ackerberg, D., X. Chen, J. Hahn, and Z. Liao (2014): “Asymptotic efficiencyof semiparametric two-step GMM,”
Review of Economic Studies , 81, 919–943.
Ackerberg, D. A., K. Caves, and G. Frazer (2015): “Identification propertiesof recent production function estimators,”
Econometrica , 83, 2411–2451.
Ai, C. and X. Chen (2007): “Estimation of possibly misspecified semiparamet-ric conditional moment restriction models with different conditioning variables,”
Journal of Econometrics , 141, 5–43.
Bergstrom, T., L. Blume, and H. Varian (1986): “On the Private Provisionof Public Goods,”
Journal of Public Economics , 29, 25–49.
Blundell, R. and S. Bond (1998): “Initial conditions and moment restrictions indynamic panel data models,”
Journal of Econometrics , 87, 115–143.——— (2000): “GMM estimation with persistent panel data: an application to pro-duction functions,”
EconometricRreviews , 19, 321–340.
Chaudhuri, S. and D. K. Guilkey (2016): “GMM with multiple missing vari-ables,”
Journal of Applied Econometrics , 31, 678–706.
Chen, X. (2007): “Large sample sieve estimation of semi-nonparametric models,”
Handbook of econometrics , 6, 5549–5632.
Chen, X. and Z. Liao (2015): “Sieve semiparametric two-step GMM under weakdependence,”
Journal of Econometrics , 189, 163–186.40 hen, X., O. Linton, and I. Van Keilegom (2003): “Estimation of semipara-metric models when the criterion function is not smooth,”
Econometrica , 71, 1591–1608.
Chernozhukov, V., G. W. Imbens, and W. K. Newey (2007): “Instrumentalvariable estimation of nonseparable models,”
Journal of Econometrics , 139, 4–14.
Cunha, F., J. Heckman, and S. Schennach (2010): “Estimating the Technologyof Cognitive and Non-cognitive Skill Formation.”
Econometrica , 78, 883–931.
Doraszelski, U. and J. Jaumandreu (2013): “R & D and Productivity: Esti-mating Endogenous Productivity,”
Review of Economic Studies , 80, 1338–83.
Epple, D., B. Gordon, and H. Sieg (2010): “A new approach to estimating theproduction function for housing,”
American Economic Review , 100, 905–24.
Fisher, R. (1935):
Design of Experiments , Hafner, new York.
Gandhi, A., S. Navarro, and D. Rivers (2020): “On the Identification of GrossOutput Production Functions,”
Journal of Political Economy , 128, 2973–3016.
Goldin, C. and L. F. Katz (2016): “A most egalitarian profession: pharmacyand the evolution of a family-friendly occupation,”
Journal of Labor Economics ,34, 705–746.
Graham, B. S. (2011): “Efficiency bounds for missing data models with semipara-metric restrictions,”
Econometrica , 79, 437–452.
Griliches, Z. and J. Mairesse (1998): “Production Functions: The Searchfor Identification,” in
Econometrics and Economic Theory in the 20th Century:The Ragnar Frisch Centennial Symposium , ed. by S. Strøm, Cambridge UniversityPress, 169–203.
Haanwinckel, D. (2018): “Supply, Demand, Institutions, and Firms: A Theory ofLabor Market Sorting and the Wage Distribution,” Working Paper.
Hahn, J., Z. Liao, and G. Ridder (2018): “Nonparametric two-step sieve Mestimation and inference,”
Econometric Theory , 34, 1281–1324.41 ansen, B. E. (2008): “Uniform convergence rates for kernel estimation with de-pendent data,”
Econometric Theory , 726–748.
Heckman, J., H. Ichimura, J. Smith, and P. Todd (1998): “CharacterizingSelection Bias using Experimental Data,”
Econometrica , 66 (2), 315–331.
Hoch, I. (1955): “Estimation of production function parameters and testing forefficiency,”
Econometrica , 23, 325–26.——— (1962): “Estimation of production function parameters combining time-seriesand cross-section data,”
Econometrica , 34–53.
Levinsohn, J. and A. Petrin (2003): “Estimating production functions usinginputs to control for unobservables,”
The Review of Economic Studies , 70, 317–341.
Little, R. J. (1992): “Regression with missing X’s: a review,”
Journal of theAmerican statistical association , 87, 1227–1237.
Marschak, J. and W. H. Andrews (1944): “Random simultaneous equationsand the theory of production,”
Econometrica , 143–205.
Matzkin, R. L. (2007): “Nonparametric identification,”
Handbook of econometrics ,6, 5307–5368.
McDonough, I. K. and D. L. Millimet (2017): “Missing data, imputation, andendogeneity,”
Journal of Econometrics , 199, 141–155.
Milgrom, P. and C. Shannon (1994): “Monotone comparative statics,”
Econo-metrica: Journal of the Econometric Society , 157–180.
Mundlak, Y. (1961): “Empirical production function free of management bias,”
Journal of Farm Economics , 43, 44–56.——— (1963): “Specification and estimation of multiproduct production functions,”
Journal of Farm Economics , 45, 433–443.
Newey, K. and D. McFadden (1994): “Large sample estimation and hypothesistesting,”
Handbook of Econometrics, IV, Edited by RF Engle and DL McFadden ,2112–2245. 42 ewey, W. K. (1994): “The asymptotic variance of semiparametric estimators,”
Econometrica , 1349–1382.
Olley, G. S. and A. Pakes (1996): “The Dynamics of Productivity in theTelecommunications Equipment Industry,”
Econometrica , 64, 1263–1297.
Ridder, G. and R. Moffitt (2007): “The Econometrics of Data Combination,”
Handbook of econometrics , 6, 5469–5547.
Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994): “Estimation of regressioncoefficients when some regressors are not always observed,”
Journal of the Americanstatistical Association , 89, 846–866.
Rosenbaum, P. and D. Rubin (1983): “The central role of the propensity scorein observational studies for causal effects,”
Biometrica , 70, 41–55.
Roy, S. and T. Sabarwal (2010): “Monotone comparative statics for games withstrategic substitutes,”
Journal of Mathematical Economics , 46, 793–806.
Rubin, D. (1973): “Matching to Remove Bias in Observational Studies,”
Biometrics ,29, 159–183.
Rubin, D. B. (1976): “Inference and missing data,”
Biometrika , 63, 581–592.
Todd, P. and K. Wolpin (2003): “On the Specification and Estimation of theProduction Function for Cognitive Achievement,”
Economic Journal , 113, F3–33.
Van Der Vaart, A. W. and J. A. Wellner (1996):
Weak Convergence andEmpirical Processes , Springer.
Vives, X. (2000):
Oligopoly pricing: Old ideas and new trends , Cambridge, MA:MIT Press.
Wooldridge, J. M. (2007): “Inverse probability weighted estimation for generalmissing data problems,”
Journal of econometrics , 141, 1281–1301.43
The Cobb-Douglas Case with Optimal Inputs
Suppose that firm i chooses inputs optimally by solving the following (expected)profit-maximization problem:max X i ,X i e α + u i X α i X α i e u i − Z i X i − Z i X i , (15)where X i , X i , Z i , Z i denote exponents of x i , x i , z i , z i . By the first-order condi-tions, X i = e α ui − α − α (cid:18) Z i α (cid:19) − α α α − (cid:18) Z i α (cid:19) α α α − X i = e α ui − α − α (cid:18) Z i α (cid:19) − α α α − (cid:18) Z i α (cid:19) α α α − Y i = e α ui − α − α (cid:18) Z i α (cid:19) α α α − (cid:18) Z i α (cid:19) α α α − = e α + u i (cid:18) α Z i α Z i (cid:19) α x α α i = e α + u i (cid:18) α Z i α Z i (cid:19) α x α α i In log forms x i = h ( u i , z i ) = α + (1 − α ) log α + α log α − α − α − − α − α − α z i − α − α − α z i + 11 − α − α u i x i = h ( u i , z i ) = α + α log α + (1 − α ) log α − α − α − α − α − α z i − − α − α − α z i + 11 − α − α u i y i = y ( u i , z i ) = α + α log α + α log α − α − α − α − α − α z i − α − α − α z i + 11 − α − α u i = α + α log ( α /α ) + ( α + α ) h ( u i , z i ) + α z i − α z l + u i = α + α log ( α /α ) + ( α + α ) h ( u i , z i ) − α z i + α z l + u i Taking inverses u i = h − ( x i , z i ) := − [ α + (1 − α ) log α + α log α ] + (1 − α − α ) x i + (1 − α ) z i + α z i = h − ( x i , z i ) := − [ α + α log α + (1 − α ) log α ] + (1 − α − α ) x i + α z i + (1 − α ) z i γ ( x i , z i ) = y (cid:0) h − ( x i , z i ) , z i (cid:1) = − log α + x i + z i ,γ ( x i , z i ) = y (cid:0) h − ( x i , z i ) , z i (cid:1) = − log α + x i + z i , and y i = γ ( x i , z i ) + (cid:15) i = − log α + x i + z i + (cid:15) i = γ ( x i , z i ) + (cid:15) i = − log α + x i + z i + (cid:15) i . (16)It is then evident that α or α can be estimated directly from (16) from the corre-sponding subsample where x i or x i is observed. Furthermore, we may test inputoptimality based on equation (16). B Proofs
B.1 Additional Notation and Lemmas
Notation
For each i , we use x ij to denote the observed input and use x ik to denotethe latent input variable for firm i , i.e. x ij = x i , x ik = x i , for d i = 1 ,x ij = x i , x ik = x i , for d i = 2 . We write d i := { d i = 1 } ,d i := { d i = 2 } , so that x ij = d i x i + d i x i while x ik := d i x i + d i x i . We write x i := (1 , x i , x i ) (cid:48) todenote the true regressor vector. (Recall ˜ x i denotes the same regressor vector withimputed latent input ˆ x ik in place of x ik .)Moreover, we suppress the instrumental variables z i in functions, such as γ ( u i , z i ),unless it becomes necessary to emphasize the dependence of such functions on z i .45 emma 1. Under Assumption 8, if (cid:107) ˆ γ k − γ k (cid:107) ∞ = O p ( a n ) , then (cid:13)(cid:13) ˆ γ − k − γ − k (cid:13)(cid:13) ∞ = O p ( a n ) and | ˆ x ik − x ik | = O p ( a n ) .Proof. By Assumption 8 we have c | u − u | ≤ | γ k ( u ) − γ k ( u ) | For any v ∈ Range ( γ k ), (cid:12)(cid:12) ˆ γ − k ( v ) − γ − k ( v ) (cid:12)(cid:12) ≤ c (cid:12)(cid:12) γ k (cid:0) ˆ γ − k ( v ) (cid:1) − γ k (cid:0) γ − k ( v ) (cid:1)(cid:12)(cid:12) = 1 c (cid:12)(cid:12) γ k (cid:0) ˆ γ − k ( v ) (cid:1) − v (cid:12)(cid:12) = 1 c (cid:12)(cid:12) γ k (cid:0) ˆ γ − k ( v ) (cid:1) − ˆ γ k (cid:0) ˆ γ − k ( v ) (cid:1)(cid:12)(cid:12) ≤ c (cid:107) ˆ γ k − γ k (cid:107) ∞ = O p ( a n ) . Furthermore, observing that c (cid:12)(cid:12) γ − k ( v ) − γ − k ( v ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) γ k (cid:0) γ − k ( v ) (cid:1) − γ k (cid:0) γ − k ( v ) (cid:1)(cid:12)(cid:12) = | v − v | we have by Assumption 8 and Lemma 1, for d i = 1, | ˆ x ik − x ik | = (cid:12)(cid:12) ˆ γ − j (ˆ γ k ( x ik )) − γ − j ( γ k ( x ik )) (cid:12)(cid:12) = (cid:12)(cid:12) ˆ γ − j (ˆ γ k ( x ik )) − γ − j (ˆ γ k ( x ik )) + γ − j (ˆ γ k ( x ik )) − γ − j ( γ k ( x ik )) (cid:12)(cid:12) ≤ (cid:12)(cid:12) ˆ γ − j (ˆ γ k ( x ik )) − γ − j (ˆ γ k ( x ik )) (cid:12)(cid:12) + (cid:12)(cid:12) γ − j (ˆ γ k ( x ik )) − γ − j ( γ k ( x ik )) (cid:12)(cid:12) ≤ (cid:13)(cid:13) ˆ γ − j − γ − j (cid:13)(cid:13) ∞ + 1 c | ˆ γ k ( x ik ) − γ k ( x ik ) |≤ (cid:13)(cid:13) ˆ γ − j − γ − j (cid:13)(cid:13) ∞ + 1 c (cid:107) ˆ γ k − γ k (cid:107) ∞ = O p ( a n ) . (17) Lemma 2.
Under Assumption 8:(i) The pathwise derivative of γ − k w.r.t. γ k along τ k ∈ Γ is given by ∇ γ k γ − k [ τ k ] := lim t (cid:38) ( γ k + tτ k ) − ( v ) − γ − k ( v ) t = − τ k (cid:0) γ − k ( v ) (cid:1) γ (cid:48) k (cid:0) γ − k ( v ) (cid:1) . ii) The pathwise derivative of γ − k ( γ j ( · )) w.r.t. γ j along τ j ∈ Γ is given by ∇ γ j (cid:0) γ − k ◦ γ j (cid:1) [ τ j ] := lim t (cid:38) γ − k ( γ j ( x ) + tτ j ( x )) − γ − k ( γ j ( x )) t = (cid:0) γ − k (cid:1) (cid:48) ( γ j ( x )) τ j ( x ) = 1 γ (cid:48) k (cid:0) γ − k ( γ j ( x )) (cid:1) τ j ( x ) . (iii) The second-order derivatives have bounded norms: ∇ γ k γ − k [ τ k ] [ τ k ] ≤ M (cid:107) τ k (cid:107) ∇ γ j (cid:0) γ − k ◦ γ j (cid:1) [ τ j ] [ τ j ] ≤ M (cid:107) τ k (cid:107) Proof. (i) and (ii) follow immediately from the definition of pathwise derivatives. See,e.g., Lemma 3.9.20 and 3.9.25 in Van Der Vaart and Wellner (1996) for reference. For(iii), ∇ γ k γ − k [ τ k ] [ ν k ] = τ (cid:48) k (cid:0) γ − k (cid:1) γ (cid:48) k (cid:0) γ − k (cid:1) · ν k (cid:0) γ − k (cid:1) γ (cid:48) k (cid:0) γ − k (cid:1) − τ k (cid:0) γ − k (cid:1)(cid:2) γ (cid:48) k (cid:0) γ − k (cid:1)(cid:3) (cid:34) γ (cid:48)(cid:48) k (cid:0) γ − k (cid:1) + 1 γ (cid:48) k (cid:0) γ − k (cid:1) (cid:35) ν k (cid:0) γ − k (cid:1) ≤ M (cid:107) τ k (cid:107) (cid:107) ν k (cid:107) since γ (cid:48) k ≥ c > γ (cid:48)(cid:48) and τ (cid:48) k are uniformly bounded above byAssumption 9(i). Similarly for ∇ γ j (cid:0) γ − k ◦ γ j (cid:1) . Lemma 3.
Writing γ := ( γ , γ ) , the pathwise derivative of γ − k ◦ γ j w.r.t. γ along τ is given by ∇ γ (cid:0) γ − k ◦ γ j (cid:1) [ τ ] := lim t (cid:38) ( γ k + tτ k ) − ( γ j ( x ) + tτ j ( x )) − γ − k ( γ j ( x )) t = 1 γ (cid:48) k (cid:0) γ − k ( γ j ( x )) (cid:1) (cid:2) τ j ( x ) − τ k (cid:0) γ − k ( γ j ( x )) (cid:1)(cid:3) roof. By Lemma 2,1 t (cid:2) ( γ k + tτ k ) − ( γ j ( x ) + tτ j ( x )) − γ − k ( γ j ( x )) (cid:3) = 1 t (cid:2) ( γ k + tτ k ) − ( γ j ( x ) + tτ j ( x )) − γ − k ( γ j ( x ) + tτ j ( x )) (cid:3) + 1 t (cid:2) γ − k ( γ j ( x ) + tτ j ( x )) − γ − k ( γ j ( x )) (cid:3) → ∇ γ k γ − k [ τ k ] ( γ j ( x )) + ∇ γ j (cid:0) γ − k ◦ γ j (cid:1) [ τ j ]= − τ k (cid:0) γ − k ( γ j ( x )) (cid:1) γ (cid:48) k (cid:0) γ − k ( γ j ( x )) (cid:1) + 1 γ (cid:48) k (cid:0) γ − k ( γ j ( x )) (cid:1) τ j ( x )= 1 γ (cid:48) k (cid:0) γ − k ( γ j ( x )) (cid:1) (cid:0) τ j ( x ) − τ k (cid:0) γ − k ( γ j ( x )) (cid:1)(cid:1) B.2 Proof of Theorem 2
Proof.
We verify the conditions in Lemma 5.4 of Newey (1994), or equivalently, The-orems 8.11 of Newey and McFadden (1994).Recall w i := ( y i , x i , z i , d i ), γ := ( γ , γ ) and g ( w i , ˆ α, ˆ γ ) = z i (cid:0) y i − ˆ α − (cid:0) x i ˆ α + ˆ γ − (ˆ γ ( x i )) ˆ α (cid:1) d i − (cid:0) x i ˆ α + ˆ γ − (ˆ γ ( x i )) ˆ α (cid:1) d i (cid:1) = z i (cid:0) y i − ˆ α − x ij ˆ α j − ˆ γ − k (ˆ γ j ( x ij )) ˆ α k (cid:1) g ( w i , ˆ γ ) = z i (cid:0) y i − α − (cid:0) x i α + ˆ γ − (ˆ γ ( x i )) α (cid:1) d i − (cid:0) x i α + ˆ γ − (ˆ γ ( x i )) α (cid:1) d i (cid:1) = z i (cid:0) y i − α − x ij α j − ˆ γ − k (ˆ γ j ( x ij )) α k (cid:1) = z i (cid:0) u i + (cid:15) i + (cid:2) x ik − ˆ γ − k (ˆ γ j ( x ij )) (cid:3) α k (cid:1) Clearly, E [ g ( w i , γ )] = E [ z i ( u i + (cid:15) i )] = 0 by Assumptions 6 and 4. Moreover, N (cid:80) Ni =1 g ( w i , ˆ α, ˆ γ ) = 0 by the definition of ˆ α .48ow, define G ( w i , ˆ γ − γ ) := ∇ γ g ( w i , γ ) [ˆ γ − γ ]= − α k z i ∇ γ (cid:0) γ − k ◦ γ j (cid:1) [ˆ γ − γ ]= − α k z i γ (cid:48) k (cid:0) γ − k ( γ j ( x ij )) (cid:1) (cid:2) (ˆ γ j − γ j ) ( x ij ) − (ˆ γ k − γ k ) (cid:0) γ − k ( γ j ( x ij )) (cid:1)(cid:3) = − α k z i γ (cid:48) k ( x ik ) [ˆ γ j ( x ij ) − γ j ( x ij ) − ˆ γ k ( x ik ) + γ k ( x ik )] since γ − k ( γ j ( x ij )) = x ik = d i z i (cid:18) − α γ (cid:48) (cid:19) (1 , − (cid:32) ˆ γ − γ ˆ γ − γ (cid:33) + d i z i (cid:18) − α γ (cid:48) (cid:19) ( − , (cid:32) ˆ γ − γ ˆ γ − γ (cid:33) = − z i (cid:18) d i α γ (cid:48) − d i α γ (cid:48) (cid:19) (1 , −
1) (ˆ γ − γ ) (18)By Lemma 2(iii) and Lemma 3, we deduce (cid:107) g ( w, ˆ γ ) − g ( w, γ ) − G ( w, ˆ γ − γ ) (cid:107) = O p (cid:0) (cid:107) ˆ γ − γ (cid:107) ∞ (cid:1) = o p (cid:18) √ N (cid:19) given our assumption that (cid:107) ˆ γ − γ (cid:107) ∞ = o p (cid:0) N − / (cid:1) .Next, the stochastic equicontinuity condition1 √ N N (cid:88) i =1 (cid:18) G ( w i , ˆ γ − γ ) − (cid:90) G ( w i , ˆ γ − γ ) d P ( w i ) (cid:19) = o p (cid:18) √ N (cid:19) (19)is guaranteed by Assumptions 8 and 9. Specifically, ˆ γ − γ belongs to a Donsker classof functions by the smoothness assumption while 1 /γ (cid:48) k ( x ik ) ≤ /c guarantees that G ( z i , · ) is square-integrable, so that G ( z i , · ) is also Donsker and thus (19) holds.Now, write ζ i := ( x i , z i ) so that w i = ( y i , ζ i , d i ). Then we have (cid:90) G ( w i , ˆ γ − γ ) P w i = (cid:90) − z i (cid:18) d i α γ (cid:48) − d i α γ (cid:48) (cid:19) (1 , −
1) (ˆ γ − γ ) d P ( ζ i , d i )= (cid:90) − z i (cid:18)(cid:20)(cid:90) d i d P ( d i | ζ i ) (cid:21) α γ (cid:48) − (cid:20)(cid:90) d i d P ( d i | ζ i ) (cid:21) α γ (cid:48) (cid:19) (1 , −
1) (ˆ γ − γ ) d P ζ i = (cid:90) − z i (cid:18) λ ( ζ i ) α γ (cid:48) − λ ( ζ i ) α γ (cid:48) (cid:19) (1 , −
1) (ˆ γ − γ ) d P ζ i
49y Proposition 4 of Newey (1994), with ϕ ( w i ) := − (cid:18) λ α z i γ (cid:48) − λ α z i γ (cid:48) (cid:19) ( d i − d i )we have z i (cid:18) λ α γ (cid:48) − λ α γ (cid:48) (cid:19) (1 , − (cid:32) d i ( y i − γ ( x i )) d i ( y i − γ ( x i )) (cid:33) ≡ ϕ ( w i ) z i (cid:15) i , and by Assumption 10 (cid:90) G ( w, ˆ γ − γ ) d P ( w ) = 1 N N (cid:88) i =1 ϕ ( w i ) z i (cid:15) i + o p (cid:18) √ N (cid:19) . Hence, Lemma 5.4 of Newey (1994),1 √ N N (cid:88) i =1 g ( w i , ˆ γ ) = 1 √ N N (cid:88) i =1 [ g ( w i , γ ) + ϕ ( w i ) z i (cid:15) i ] + o p (1) d −→ N ( , Ω) , where Ω :=Var [ g ( w i , γ ) + ϕ ( w i ) z i (cid:15) i ]= E (cid:104) z i z (cid:48) i ( u i + [1 + ϕ ( w i )] (cid:15) i ) (cid:105) = E (cid:104) z i z (cid:48) i (cid:0) u i + [1 + ϕ ( w i )] (cid:15) i (cid:1)(cid:105) Lastly, by Lemma 1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 z i (ˆ x i − x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n n (cid:88) i =1 | z i | | ˆ x i − x i | ≤ O p ( a n ) · n n (cid:88) i =1 | z i | = O p ( a n ) = o p (1)and thus1 N N (cid:88) i =1 z i ˜ x (cid:48) i = E (cid:104) z i x (cid:48) i (cid:105) + 1 N N (cid:88) i =1 z i (˜ x i − x i ) (cid:48) + 1 N N (cid:88) i =1 (cid:16) z i x (cid:48) i − E (cid:104) z i x (cid:48) i (cid:105)(cid:17) = E (cid:104) z i x (cid:48) i (cid:105) + O p ( a N ) + O p (cid:18) √ N (cid:19) p −→ Σ zx := E (cid:104) z i x (cid:48) i (cid:105) . √ N ( ˆ α − α ) = (cid:32) N N (cid:88) i =1 z i ˜ x i (cid:33) − √ N N (cid:88) i =1 g ( w i , ˆ γ ) d −→ N (cid:16) , Σ − zx ΩΣ (cid:48) − zx (cid:17) . B.3 Proof of Propositions 2 and 1
Proof.
Assumption 11(i) guarantees that N ∼ N ∼ N so that (cid:107) ˆ γ − γ (cid:107) ∞ ∼ (cid:107) ˆ γ − γ (cid:107) ∞ = O p ( a N )where, by Assumption 11(ii)-(v) and Theorem 8 of Hansen (2008), a N = b p + √ log N √ N b . With b chosen according to Assumption 11(vi) so that √ log N √ Nb = o (cid:16) N − (cid:17) and √ N b p →
0, implying that a N = o (cid:16) N − (cid:17) + o (cid:16) N − (cid:17) = o (cid:16) N − (cid:17) , verifying Assumption 9(ii). Assumption 10 (and consequently Proposition 2) followsfrom Theorem 8.11 of Newey and McFadden (1994).Since ˆ ϕ p −→ ϕ and ˆ ϕ ∗ p −→ ϕ ∗ , Proposition 1 then follows from Theorem 8.13 ofNewey and McFadden (1994). B.4 An Alternative and More Efficient Estimator ˆ α ∗ The estimator ˆ α proposed in the main text is defined by an IV estimator of theregression equation y i = α + α x i + α x i + u i + (cid:15) i , E [ u i + (cid:15) i | z i ] = 0in Step 3, where the left-hand side is the raw outcome variable y i . Alternatively, withSteps 1 and 2 unchanged, we may construct a slightly different estimator ˆ α ∗ for α based on the conditionally expected outcome as described below.51 tep 3* : Estimate the following equation y i = α + α x i + α x i + u i , E [ u i | z i ] = 0 , (20)with the outcome variable given by y i := F ( u i , z i ) = γ ( x i , z i ) = γ ( x i , z i ) , replaced by its plug-in estimator˜ y i := ˆ γ ( x i , z i ) , for d i = 1 , ˆ γ ( x i , z i ) , for d i = 2 , Again using z i as IVs, estimate α byˆ α ∗ := (cid:32) n n (cid:88) i =1 z i ˜ x i (cid:33) − (cid:32) n n (cid:88) i =1 z i ˜ y i (cid:33) . The difference between ˆ α and ˆ α ∗ lies in the outcome variable being used for theIV regression: ˆ α is based on the raw output y i , while ˆ α ∗ is based on the estimatedconditionally expected output y i . As we will show below, ˆ α ∗ is in fact asymptoticallymore efficient than ˆ α . Theorem 3 (Asymptotic Normality of ˆ α ∗ ) . Define g ∗ ( w i , ˜ α, ˜ γ ) := z i (cid:0) ˜ γ ( x i ) − ˜ α − ˜ α x i − ˜ α ˜ γ − (˜ γ ( x i )) (cid:1) for d i = 1 ,z i (cid:0) ˜ γ ( x i ) − ˜ α − ˜ α x i − ˜ α ˜ γ − (˜ γ ( x i )) (cid:1) for d i = 2 , and g ∗ ( w i , ˜ γ ) as well as G ∗ similarly as in Section 3.1.3. Define ˆ ϕ ∗ ( w i ) := (cid:20) ˆ λ (cid:18) − ˆ α ˆ γ (cid:48) (cid:19) + ˆ λ ˆ α ˆ γ (cid:48) (cid:21) { d i = 1 } + (cid:20) ˆ λ ˆ α ˆ γ (cid:48) + ˆ λ (cid:18) − ˆ α ˆ γ (cid:48) (cid:19)(cid:21) { d i = 2 } . Under Assumptions 1-10 with
G, ϕ replaced by G ∗ , ϕ ∗ whenever applicable, √ N ( ˆ α ∗ − α ∗ ) d −→ N ( , Σ ∗ ) , here Σ ∗ := Σ − zx Ω ∗ Σ − xz and Ω ∗ := E (cid:104) z i z (cid:48) i (cid:0) u i + ϕ ∗ ( w i ) (cid:15) i (cid:1)(cid:105) . The proof is very similar to that of Theorem 2, and is presented in Appendix B.5.Next, we compare the asymptotic variances of ˆ α ∗ and ˆ α , and show that ˆ α ∗ is infact asymptotically more efficient. Theorem 4 ( ˆ α ∗ is Asymptotically More Efficient than ˆ α ) . Ω − Ω ∗ is positive definite,i.e., ˆ α ∗ is asymptotically more efficient than ˆ α . The proof is in Appendix B.6. Here we discuss the intuition of Theorem 4. Theerror term for the IV regression with the raw outcome y i as the left-hand-side variableis u i + (cid:15) i , which has a larger variance than the corresponding error term u i , if theconditionally expected outcome y i is used instead. Even though we do not observe y i and must use an estimator ˜ y i = ˆ γ ( x i ) or ˜ y i = ˆ γ ( x i ), the impact of the first-stageestimation error (which can be loosely thought as an average of (cid:15) i across i ) is smallerthan the impact of (cid:15) i itself.To see this more clearly, first consider the multiplier “1 + ϕ ( w i )” in (i): the “1”comes from the one “raw” share of error (cid:15) i embedded in each y i that we use as theoutcome variable, while “ ϕ ( w i )” essentially captures the share of influence of thefirst-step estimation error ˆ γ − γ due to (cid:15) i . Together, we have1 + ϕ = (cid:18) − λ α γ (cid:48) + λ α γ (cid:48) (cid:19) { d i = 1 } + (cid:18) λ α γ (cid:48) + 1 − λ α γ (cid:48) (cid:19) { d i = 2 } , while the corresponding multiplier ϕ ∗ on (cid:15) i in (ii) is essentially the same except that“1 − λ α γ (cid:48) ” becomes “ λ − λ α γ (cid:48) ” and “1 − λ α γ (cid:48) ” becomes “ λ − λ α γ (cid:48) ”. Since λ , λ < (cid:15) i becomes smaller in magnitude . Essentially, by using theestimated conditional expected output ˜ y i , the raw “1” share of (cid:15) i in y i is movedinto the first-stage estimation error of y i , which is then “averaged” and reduced inmagnitude to λ or λ , thus leading to smaller overall variance.Lastly, we emphasize that the efficiency comparison in 4 does not directly relateto the theory of semiparametric efficiency bounds, such as in Ackerberg et al. (2014), Note that α /γ (cid:48) ≤ α /γ (cid:48) ≤ y i and˜ y i attain their corresponding semiparametric efficiency bounds with respect to theirdifferent criterion functions g and g ∗ . Theorem 4, however, is a comparison across thetwo criterion functions g and g ∗ : it essentially states that the asymptotically efficientestimator under g ∗ is even more efficient than the efficient estimator under g . B.5 Proof of Theorem 3
Proof.
We adapt the proof of Theorem 2 above with g ∗ ( w, ˆ α, ˆ γ ) := z i (cid:0) ˆ γ j ( x ij ) − ˆ α − ˆ α j x ij − ˆ α k ˆ γ − k (ˆ γ j ( x ij )) (cid:1) ,g ∗ ( w, ˆ γ ) := z i (cid:0) ˆ γ j ( x ij ) − α − α j x ij − α k ˆ γ − k (ˆ γ j ( x ij )) (cid:1) . with E [ g ∗ ( w i , γ )] = E (cid:2) z i (cid:0) γ j ( x ij ) − α − α j x ij − α k γ − k ( γ j ( x ij )) (cid:1)(cid:3) = E [ z i u i ] = and N (cid:80) Ni =1 g ( z, ˆ α ∗ , ˆ γ ) = .By the chain rule, G ∗ ( w i , τ ) := ∇ γ g ∗ ( w i , γ ) [ˆ γ − γ ]= z i (cid:0) [ˆ γ j ( x ij ) − γ j ( x ij )] − α k ∇ γ (cid:0) γ − k ◦ γ j (cid:1) [ˆ γ − γ ] (cid:1) = z i (cid:18) − α k γ (cid:48) k ( x ik ) (cid:19) [ˆ γ j ( x ij ) − γ j ( x ij )] − z i α k γ (cid:48) k ( x ik ) [ˆ γ k ( x ik ) − γ k ( x ik )]= z i (cid:20) d i (cid:18) − α γ (cid:48) , − α γ (cid:48) (cid:19) + d i (cid:18) − α γ (cid:48) , − α γ (cid:48) (cid:19)(cid:21) (ˆ γ − γ )and (cid:90) G ( w i , ˆ γ − γ ) P w i = (cid:90) z i (cid:18) λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) , λ α γ (cid:48) + λ (cid:18) − α γ (cid:48) (cid:19)(cid:19) (ˆ γ − γ ) d P ζ i By Proposition 4 of Newey (1994), with ϕ ∗ ( w i ) := − (cid:18) λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) (cid:19) d i + (cid:18) λ α γ (cid:48) + λ (cid:18) − α γ (cid:48) (cid:19)(cid:19) d i
54e have z i (cid:18) λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) , λ α γ (cid:48) + λ (cid:18) − α γ (cid:48) (cid:19)(cid:19) (cid:32) d i ( y i − γ ( x i )) d i ( y i − γ ( x i )) (cid:33) ≡ ϕ ∗ ( w i ) z i (cid:15) i , and by Assumption 10 (cid:90) G ( w, ˆ γ − γ ) d P ( w ) = 1 N N (cid:88) i =1 ϕ ∗ ( w i ) z i (cid:15) i + o p (cid:18) √ N (cid:19) . Hence, we have1 √ N N (cid:88) i =1 g ∗ ( w i , ˆ γ ) = 1 √ N N (cid:88) i =1 [ g ∗ ( w i , γ ) + ϕ ∗ ( w i ) z i ] + o p (1) d −→ N ( , Ω ∗ ) , where Ω := Var [ g ∗ ( w i , γ ) + δ ∗ ( z i )] = E (cid:104) z i z (cid:48) i (cid:0) u i + ϕ ∗ ( w i ) (cid:15) i (cid:1)(cid:105) , giving √ N ( ˆ α − α ) = (cid:32) N N (cid:88) i =1 z i ˜ x i (cid:33) − √ N N (cid:88) i =1 g ∗ ( w i , ˆ γ ) d −→ N (cid:16) , Σ − zx Ω ∗ Σ (cid:48) − zx (cid:17) . B.6 Proof of Theorem 4
Proof.
By (7), we have ∂∂c γ j ( c ; z ) = α j + α k x (cid:48) k x (cid:48) j + 1 x (cid:48) j > α j , and thus 0 < α j /γ (cid:48) j <
1, which implies λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) > , λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) > . ϕ ∗ = (cid:18) λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) (cid:19) d i + (cid:18) λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) (cid:19) d i >
01 + ϕ = 1 − (cid:18) α γ (cid:48) λ − α γ (cid:48) λ (cid:19) ( d i − d i )= (cid:18) − λ α γ (cid:48) + λ α γ (cid:48) (cid:19) d i + (cid:18) − λ α γ (cid:48) + λ α γ (cid:48) (cid:19) d i = ϕ ∗ + (1 − λ ) d i + (1 − λ ) d i > ϕ ∗ > . Hence, (1 + ϕ ) > ϕ ∗ > − Ω ∗ = E (cid:104) z i z (cid:48) i (cid:2) (1 − ϕ ( x i , d i )) − ϕ ∗ ( x i , d i ) (cid:3) (cid:15) i (cid:105) is positive definite. C Robustness Check for First Application
Although most pharmacies in our sample have one manager and one pharmacist,there are a few pharmacies with more than one employee pharmacist. For this subsetof pharmacies, we compute the total hours worked by employee pharmacists by mul-tiplying the reported hours worked from an employee by the number of employees.Then, the second imputation step is applied based on the total hours worked by allemployees. In this process, we implicitly assume the labor hours from two differentemployees are perfect substitutes. As a robustness check, we also estimate a version ofproduction function which has an elasticity of substitution between the hours workedby different employees equal to one. Table 12 summarizes this version of the esti-mation result. The estimated parameters show that employees become slightly lessproductive at both independents and chains compared to our baseline estimation,but in general our estimation result is robust to how we treat employee inputs frompharmacies with more than one employee.56able 12: Using N ∗ log ( x ) instead of log ( N ∗ H )Independent ChainObserved Expected Observed ExpectedOutputs Outputs Outputs Outputs α α α x x D Inter Vivos Gifts
Consider an example with a married couple and two parental households, j = 1 , m and m , which is based on Bergstrom, Blume,and Varian (1986). Parents are altruistic toward their married offspring but nottoward that offspring’s spouse. Parental household j has utility u j ( g j ) = ln( m j − g j ) + µ ln( g + g )where g j is the married couple’s gift from parental household j and µ is the probabilitythat both parental households think the children’s marriage will endure. This leadsto a noncooperative game between the two parental households since the incentivefor either household to gift the offspring couple diminishes as the other parentalhousehold gives more. This is a game of strategic substitutes. The Nash equilibriumof this game between the two parental households is g ∗ = (1 + µ ) m − m µ , g ∗ = (1 + µ ) m − m µ . There is a unique Nash equilibrium for any µ for any wealth levels for the two house-holds that are not “too” different. Both g and g are strictly increasing in the shock µ , and hence the outcome is strictly increasing in µµ