[PDF] Estimating Production Functions with Partially Latent Inputs

Abstract

This paper develops a new method for identifying and estimating production functions with partially latent inputs. Such data structures arise naturally when data are collected using an "input-based sampling" strategy, e.g., if the sampling unit is one of multiple labor input factors. We show that the latent inputs can be nonparametrically identified, if they are strictly monotone functions of a scalar shock a la Olley & Pakes (1996). With the latent inputs identified, semiparametric estimation of the production function proceeds within an IV framework that accounts for the endogeneity of the covariates. We illustrate the usefulness of our method using two applications. The first focuses on pharmacies: we find that production function differences between chains and independent pharmacies may partially explain the observed transformation of the industry structure. Our second application investigates skill production functions and illustrates important differences in child investments between married and divorced couples.

Full PDF

UUsing Monotonicity Restrictions toIdentify Models with Partially Latent Covariates ∗ Minji Bang, Wayne Gao, Andrew Postlewaite, and Holger SiegUniversity of PennsylvaniaJanuary 18, 2021

Abstract

This paper develops a new method for identifying econometric models withpartially latent covariates. Such data structures arise naturally in industrialorganization and labor economics settings where data are collected using an“input-based sampling” strategy, e.g., if the sampling unit is one of multiple la-bor input factors. We show that the latent covariates can be nonparametricallyidentiﬁed, if they are functions of a common shock satisfying some plausiblemonotonicity assumptions. With the latent covariates identiﬁed, semiparamet-ric estimation of the outcome equation proceeds within a standard IV frame-work that accounts for the endogeneity of the covariates. We illustrate the use-fulness of our method using two applications. The ﬁrst focuses on pharmacies:we ﬁnd that production function diﬀerences between chains and independentpharmacies may partially explain the observed transformation of the industrystructure. Our second application investigates education achievement functionsand illustrates important diﬀerences in child investments between married anddivorced couples.Keywords: production functions, latent variables, endogeneity, semiparametricestimation, instrumental variables, matching. ∗ We would like to thank Xu Cheng, Aureo de Paula, Ulrich Doraszelski, Amit Gandhi, ClaudiaGoldin, Aviv Nevo, Dan Silverman, Petra Todd, and seminar participants at numerous universitiesfor comments and suggestions. Postlewaite and Sieg acknowledge support from the National ScienceFoundation. a r X i v : . [ ec on . E M ] J a n Introduction

This paper develops a new method for identifying econometric models with partiallylatent covariates. We show that a broad class of econometric models that play a largerole in industrial organization and labor economics can be nonparametrically identi-ﬁed if the partially latent covariate variables satisfy certain monotonicity assumptions.Examples that fall into this class of models are a variety of diﬀerent production, skillformation, and achievement functions. It is often plausible to assume that the dif-ferent inputs or explanatory variables are functions of a common unobserved randomshock, and we consider models in which it is natural to impose strict monotonicity inthis common shock. The monotonicity assumption imposes some strong functionaldependencies on the explanatory variables as pointed out in the context of produc-tion function estimation by Ackerberg, Caves, and Frazer (2015). The key insightof this paper is that we can leverage the functional dependence between inputs toachieve identiﬁcation within a partially latent covariate framework. In that sense, weturn the functional dependence problem on its head to impute the partially latentcovariates. Broadly speaking, our imputation is in the spirit of matching algorithms(Rubin, 1973). In contrast to traditional matching algorithms, we propose to matchon the expected dependent variable to impute missing covariates. The partially latent data structure, that we study in this paper, arises quite natu-rally in many potential applications of our technique if one employs an “input-basedsampling” strategy, i.e. if the sampling unit is one of the multiple labor input factors.These types of data sets are becoming more prevalent in modern econometrics sinceresearchers have come to rely on unstructured or semi-structured data sets. Consider,for example, a production team in which team members perform diﬀerent tasks. Letus assume that the researcher interviews one member from each team to provide thedata. It is plausible that this person knows the team’s output, but does not have Other potential applications in applied microeconomics are discussed in the conclusions. Note that this assumption is commonly used, for example, in the production function literatureas discussed by Olley and Pakes (1996). In particular, this assumption does not require that inputsare “optimally” chosen by competitive ﬁrms and is consistent with a broad class of strategic andnon-strategic models that may describe the agents’ behavior. Note that we do not apply the matching approach within the standard potential outcome frame-work of program evaluation which is based on the potential outcome model developed by Fisher(1935). For a discussion of the properties of matching estimators in that context see, among oth-ers, Rosenbaum and Rubin (1983), Heckman, Ichimura, Smith, and Todd (1998), and Abadie andImbens (2006). We show that we can combine our identiﬁcation results with a variety of lin-ear, nonlinear, and semiparametric estimation strategies. In that sense our approachis ﬂexible and allows researchers to make appropriate functional form assumptions ifnecessary. To illustrate the key issues that are encountered in estimation we considerthe scenario in which researchers only have access to a single cross-section of dataand rely on instrumental variables for estimation. For example, production functionestimation relies on the assumption that diﬀerences in local input prices give riseto diﬀerences in input choices that are uncorrelated with productivity shocks at thelocal level. Similarly, skill formation and achievement function estimation requiresthe choice of suitable instruments for parental inputs. Estimation proceeds in two steps. In ﬁnite samples, we ﬁrst nonparametricallyestimate the latent input functions. Plugging the estimators into our production, In the context of production function estimation this endogeneity problem is referred to as thetransmission bias problem since inputs are correlated with unobserved productivity shocks (Marschakand Andrews, 1944). Hence we cannot address this endogeneity problem using panel data with ﬁxed eﬀects, ﬁrstadvocated by Hoch (1955, 1962) and Mundlak (1961, 1963). We can also not use more sophisticatedtiming assumptions within a control function or IV frameworks as discussed, for example, in Olleyand Pakes (1996) and Blundell and Bond (1998, 2000), Levinsohn and Petrin (2003), and Ackerberg,Caves, and Frazer (2015). We discuss the extension of our methods to this scenario in the conclusions. Hence, local input prices can serve as valid instruments for endogenous input choices. SeeGriliches and Mairesse (1998) for a critical discussion of the assumption that these input prices areexogenous. For a more general discussion of the issues encountered in estimating achievement and skill for-mation functions see, among others, Todd and Wolpin (2003) and Cunha, Heckman, and Schennach(2010).

Identiﬁcation of Partially Latent Covariates

Consider the following cross-sectional econometric model y i = F ( x i , x i , u i ) + (cid:15) i (1)where i = 1 , ..., N indexes a generic observation from a random sample , y i denotesan observable scalar-valued outcome variable, and x i := ( x i , x i ) denotes a two-dimensional vector of covariates. Both u i and (cid:15) i are scalar-valued unobserved errors,with u i taken to be a “structural error” that is endogenous with respect to x i , while (cid:15) i is a “measurement error” that is assumed to be exogenous. The unknown outcomefunction F may be either parametric or nonparametric.First, we need to deﬁne what we mean by partially latent covariates , a key datastructure that we explore in this paper. Assumption 1 (Partially Latent Covariates) . For each observation i , the econome-trician either observes x i or x i , but never both. Essentially, one of the two covariates ( x i , x i ) is latent in each observation in thedata. In the following, it will be convenient to write d i :=  , if x i is observed and x i is latent , , if x i is observed and x i is latent , so that eﬀectively ( d i , (2 − d i ) x i , ( d i − x i ) is observed for i . Such data structuresoften arise when the data is collected at the individual level while we are interestedin some ﬁrm, household, or team level outcome variable that also depends on otherindividuals who are not surveyed in the data. These types of unstructured data setsare becoming increasingly more prevalent in empirical work, as we discuss in detailbelow. In this section we just provide one application that we use as the leadingexample to illustrate the main concepts. Example (Team Production Functions) . Our ﬁrst application studied in Section 4 See Corollary 1 for the extension of our identiﬁcation method to settings with covariates ofhigher dimensions. For simplicity, letus assume a log-linear Cobb-Douglas speciﬁcation: y i = α + α x i + α x i + u i + (cid:15) i , (2)where y i is the logarithm of the team’s output, x i is the logarithm of hours worked bythe ﬁrst team member (a manager), and x i is the logarithm of hours worked by thesecond team member (an employee). The data structure described in Assumption1 arises if the researcher interviews only one member, and not both members of theteam. We also refer to this technique as an “ input-based sampling ” approach. It isplausible that the interviewed team member knows the team’s output, but does nothave complete information about the other team member’s input choices. Hence, thesurveyed person provides the output level, y i , and her own hours worked, x i or x i ,leading to the problem of partially latent inputs as deﬁned in Assumption 1.The next assumption imposes a monotonicity condition on the outcome function. Assumption 2 (Monotonicity of the Outcome Function) . F is nondecreasing in allof its arguments and is strictly increasing in at least one of its arguments. This assumption essentially states that the covariates or inputs ( x i , x i ) and thestructural error u i have nonnegative eﬀects on the outcome variable y i . Moreover,the monotonicity is strict in, at least, one of the three arguments x i , x i , and u i .The restriction of monotonicity with respect to ( x i , x i ) is substantive: it requiresthat the covariates cannot negatively aﬀect the outcome variable holding everythingelse ﬁxed. In contrast, the restriction of monotonicity with respect to u i is largelyinnocuous given the interpretation of u i as a (weakly) “positive shock”. Example (Team Production Functions Continued) . Assumption 2 is satisﬁed in thelinear additive model in equation (2) provided that the model satisﬁes the additionalparameter restriction that α , α ≥ We use the term “team production function” since we largely focus on diﬀerent types of laborinputs and abstract from capital or other inputs that may be subject to dynamics and adjustmentcosts. The team production concept is also related to the concept of task production functions, whichare surveyed by Acemoglu and Autor (2011). Haanwinckel (2018) estimates a task productionfunction in which each team member specializes in a single task. u i and (cid:15) i in equa-tion (1). First, we assume that the endogenous covariates x i are strictly monotonefunctions of the scalar structural error u i , potentially after conditioning on a set ofobserved covariates z i , that may aﬀect the covariates x i . Assumption 3 (Strict Monotonicity of the Covariates in the Structural Error) . Thereexists a vector of additional observed covariates z i and two deterministic, real-valuedfunctions h , h , such that x i = h ( u i , z i ) , x i = h ( u i , z i ) , with both h ( u i , z i ) and h ( u i , z i ) strictly increasing in their ﬁrst argument u i forevery realization of z i . We note that the functions h and h can be unknown and nonparametric. More-over, Assumption 3 does not require z i to be exogenous; in other words, z i and u i are allowed to be statistically dependent. The only requirement here is that, afterconditioning on z i , the covariates x i and x i can be written as deterministic mono-tone functions of the error u i . Such a “monotonicity-in-a-scalar-error” assumptionhas been widely used in the econometric literature on identiﬁcation analysis. Example (Team Production Functions Continued) . In the IO literature u i is typicallyinterpreted as a “productivity shock” that enters into the choices of inputs x i . Incontrast, (cid:15) i captures either a measurement error or a productivity shock that doesnot aﬀect inputs, since it is not observed to the ﬁrms when input choices are made.Assumption 3 requires that the input choice functions are strictly increasing in the“productivity shock” u i , conditional on any additional observed covariates z i thatmay inﬂuence input choices, as suggested, for example, by Olley and Pakes (1996) andothers. For concreteness, we take z i to be local wages for managers and employees.The monotonicity of input choices in the unobserved productivity shock can be fur-ther micro-founded in a variety of settings based on eﬃciency or equilibrium criteria.For example, Assumption 3 is automatically satisﬁed if competitive ﬁrms optimally See Matzkin (2007) for a general survey, and see Ackerberg, Caves, and Frazer (2015) in thespeciﬁc context of production function identiﬁcation, which ﬁts into our working example (2). This is a standard assumption that underlies most, if not all, existing approaches of productionfunction estimation in one way or another: see, for example, Griliches and Mairesse (1998) andAckerberg, Caves, and Frazer (2015) for reviews of the relevant literature. h and h are char-acterized by the relevant ﬁrst-order conditions and have simple closed-form formulasthat are linear and increasing in u i and decreasing in z i . More generally, one may usethe theory of monotone comparative statics to obtain more primitive conditions forinput monotonicity, which typically involve various forms of increasing-diﬀerence orsingle-crossing conditions: see, for example, Milgrom and Shannon (1994) and Vives(2000) for formal statements. Essentially, in settings where input choices are made bya single decision maker, such as under perfect competition and monopsony, we wouldneed the marginal values of inputs to be increasing in the productivity shock u i , whichis a mild condition to impose given our interpretation of u i as a “productivity shock”.In settings where the input choices are generated as equilibria of a strategic gamebetween two decision makers, an additional assumption of strategic complementarityis typically suﬃcient for monotonicity. For games with strategic substitutability, wewould further need a condition to ensure that the extent of strategic substitutability isnot overwhelming: see Roy and Sabarwal (2010) for general results, and our AppendixD for an example where Assumption 3 is satisﬁed under strategic substitutability.Next we formalize the required exogeneity condition on the measurement error (cid:15) i . Assumption 4 (Exogeneity of the Measurement Error) . E [ (cid:15) i | x i , z i , d i ] = 0 . Note that, under Assumption 3, conditioning on ( x i , z i , d i ) is equivalent to con-ditioning on ( u i , z i , d i ). In the production function estimation literature without thepartial latency problem, E [ (cid:15) i | u i , z i ] = 0 is a standard assumption imposed on (cid:15) i . Inour current setting, we are requiring that (cid:15) i is furthermore exogenous with respect tothe partial latency indicator variable d i .It is worth noting that this paper is both conceptually and technically diﬀerentfrom previous work on missing data in linear regression and, more generally, GMMestimation settings, such as Rubin (1976), Little (1992), Robins, Rotnitzky, and Zhao(1994), Wooldridge (2007), Graham (2011), Chaudhuri and Guilkey (2016), Abrevayaand Donald (2017) and McDonough and Millimet (2017). This line of literature See Appendix A for details. We note that the problem of partially latent inputs is less relevantin that case since the “reduced-form” regression of the observed inputs on the exogenous wages w i will indirectly recover the production function parameters α . This corresponds to the “dualityapproach” to production function estimation as discussed in detail in Griliches and Mairesse (1998).However, an attractive feature of our approach is also that we can test whether inputs are optimallychosen. If we reject the null hypothesis that inputs are optimal, our estimator is still feasible whileduality estimators are not. d i is allowed to be correlated with other observables as wellas the unobserved productivity shock. Instead, we will be relying on monotonicityconditions to identify and impute the latent input.Speciﬁcally, Assumption 4 here is simply requiring that (cid:15) i is a “measurementerror” term that is exogenous with respect to the observables and consequently the“productivity shock” u i , but does not impose any restriction on the dependence struc-ture between the partial latency indicator d i and other structural components of themodel ( u i , x i , z i ).However, we do require the following very mild condition on the variable d i . Assumption 5 (Nondegenerate Latency Probabilities) . < P { d i = 1 | u i , z i } < . Assumption 5 guarantees that conditioning on realizations of ( u i , z i ) we will ob-serve x i , and x i , with strict positive probabilities. Again, this assumption is muchweaker than “missing-at-random” assumptions, which would usually require that P { d i = 1 | u i , z i } is constant in u i , z i , or some other variables. In contrast, herewe do not impose any restrictions on the dependence of P { d i = 1 | u i , z i } on ( u i , z i )beyond non-degeneracy.We are now ready to present our main identiﬁcation result. Theorem 1.

Under Assumptions 1-5, for each observation i , the latent covariate, x i if d i = 1 or x i if d i = 2 , is point identiﬁed. Next, we provide a detailed explanation of our identiﬁcation strategy. The start-ing point of our identiﬁcation strategy is the reduced form of our model with themeasurement error term: y i = F ( u i , z i ) + (cid:15) i (3)where F ( u i , z i ) := F ( h ( u i , z i ) , h ( u i , z i ) , u i ) . (4)11learly, F ( u i , z i ) is strictly increasing in u i given Assumptions 2 and 3.Consider two ﬁrms i and j with z i = z j . In the context of our working example, weare eﬀectively considering two ﬁrms i and j operating in the same local labor marketwith the same local wages. For concreteness, suppose that ( x i , x j ) are observed,while ( x i , x j ) are unobserved. Since these ﬁrms have the same value of managerialinputs x i = x j , then by Assumption 3 it must also be true that they have the samevalue of the productivity shock: u i = h − ( x i ; z i ) = h − ( x j ; z j ) = u j , where h − ( · ; z i ) is the inverse of h ( · , z i ), which is well-deﬁned by Assumption 3.This further implies that F ( u i , z i ) = F ( u j , z j ) . Taking an average of y i and y j ,12 ( y i + y j ) = F ( u i , z i ) + 12 ( (cid:15) i + (cid:15) j ) , (5)we are essentially averaging out the variations in (cid:15) . Intuitively, if we average overoutcomes of all observations that share the same x i and the same z i and thus thesame value of u i , then we can identify F ( u i , z i ).Formally, deﬁne γ ( c ) as the expected output of ﬁrm i conditional on the eventthat x i is observed ( d i = 1) to have a given value of c , i.e., γ ( c ; z ) := E [ y i | z i = z, d i = 1 , x i = c ] . (6)Clearly, γ is directly identiﬁed from data given Assumptions 1 and 5, and canbe nonparametrically estimated later on. Taking a closer look at γ , we have, byequation (3), Assumption 3, and Assumption 4, γ ( c ; z ) = E (cid:2) F ( u i , z i ) + (cid:15) i (cid:12)(cid:12) z i = z, d i = 1 , h ( u i , z i ) = c (cid:3) = F (cid:0) h − ( c ; z ) , z (cid:1) + E (cid:2) (cid:15) i | z i = z, d i = 1 , u i = h − ( c ; z ) (cid:3) = F (cid:0) c , h (cid:0) h − ( c ; z ) , z (cid:1) , h − ( c ; z ) (cid:1) , (7) In fact, we can directly “match” on output y i if there is no measurement error, (cid:15) i , in output. Assumption 5 ensures that the conditioning event occurs with strictly positive probability. z i and a particular observed value of x i = c , we are eﬀectively conditioning onthe unobserved productivity shock u i . Aggregating across observations allows us toaverage out the measurement errors and obtain a quantity that is implicitly a functionof the productivity shock u i = h − ( c ; z i ).Next, we observe that γ ( c ; z ) is strictly increasing in c , since ∂∂c γ ( c ; z ) = F + F · ∂∂u h (cid:0) h − ( c ) , z (cid:1) ∂∂u h (cid:0) h − ( c ) , z (cid:1) + F · ∂∂u h (cid:0) h − ( c ) , z (cid:1) > ∂∂u h , ∂∂u h > F , F , F of F are all nonnegative with, at least, one being strictly positive by Assumption 2. Similarly, we can deﬁne γ ( c ; z ) := E [ y i | z i = z, d i = 2 , x i = c ]which is strictly increasing in c .Now, the basic idea behind our identiﬁcation strategy is then to conditionally“match” observations on the event that γ ( c ; z ) = γ ( c ; z ) (9)for some c , c , and z . Example (Team Production Functions Continued) . Let us consider production teamswithin the same local market so that wages ( z i ) are constant. Equation (9) then in-volves two separate conditional expected output levels, one ( γ ) for teams whosemanager input ( x i ) is observed , and the other ( γ ) for teams whose employee input( x i ) is observed. When these two expected output levels are equalized as in equation(9), we can infer that the underlying productivity shock ( u i ) must be the same acrossall teams with either x i = c observed or x i = c observed. By equations (5) and(7) we know h − ( c ; z i ) = h − ( c ; z i ) =: u The partial derivatives F , F , F of F are evaluated at (cid:0) c , h (cid:0) h − ( c ; z ) , z i (cid:1) , h − ( c ; z ) (cid:1) . x i = h ( u, z i ) , for d i = 1 ,x i = h ( u, z i ) , for d i = 2 . Formally, the latent covariates can be identiﬁed via a composition of γ , γ andtheir inverses, x i = γ − ( γ ( x i ; z i ) ; z i ) , for d i = 1 ,x i = γ − ( γ ( x i ; z i ) ; z i ) , for d i = 2 , (10)since on the right-hand side x i , x i are observed for d i = 1 , , respectively, and γ , γ are nonparametrically identiﬁed functions. This completes the description of our keyidentiﬁcation strategy as well as the proof of Theorem 1. Remark . We have thus far focused on the case withtwo covariates. It is straightforward to see that our model, assumptions, and themain identiﬁcation result can be easily generalized to the case with covariates of anarbitrary ﬁnite dimension D . This result is summarized by the following Corollary. Corollary 1.

Consider the model y i := F ( x i , ..., x iD , u i )+ (cid:15) i along with Assumptions2 and 4 unchanged, and the following modiﬁcations of other assumptions:(i) Assumption 1: for each i at least one out of D covariates is observed.(ii) Assumption 3: all D covariates are strictly increasing in u i given z i .(iii) Assumption 5: all D covariates are observed with strictly positive probabilities.Then the latent covariates are identiﬁed.Remark . If Condition (i) in Corollary 1 is strengthened so that more than one covariates are simultaneously observed in a given observation (with positive prob-ability), then we would also obtain over-identiﬁcation, and the input-monotonicityrestriction in Assumption 3 becomes empirically refutable. Alternatively, with two ormore covariates simultaneously observed, we would be able to accommodate higherdimensions of unobserved shocks, provided that the dimension of the unobserved14hock u i is strictly smaller than the dimension of the covariates D . Since such anextension would be more involved and move farther away from the applications weconsider in this paper, we leave it as a direction for future research. With the latent inputs already identiﬁed in Theorem 1, we are back to equation (1) y i = F ( x i , x i , u i ) + (cid:15) i , but now we can eﬀectively regard both x i and x i as being known, at least foridentiﬁcation purposes. Researchers may proceed to identify the output function F under appropriate application-speciﬁc assumptions as in a “standard” setting withoutthe partial latency problem.Hence, the identiﬁcation of F or other objects of interest is largely “separable”from the partial latency problem, which is the key problem we are solving in thispaper. That said, we note that the estimation of the latent covariates will aﬀect the estimation of (the parameters of) F based on “plugged-in” latent covariate estimates.This section provides a discussion on how to identify and estimate F , and analyzesthe impact of the “ﬁrst-stage” estimation of latent inputs on the ﬁnal estimator of F .While we cannot cover all relevant speciﬁcations of F , in this section we will pro-vide both identiﬁcation and estimation results for the linear case, which is arguablythe workhorse model, or at least a natural benchmark, in various empirical applica-tions. We also discuss how our method can be applied under more general settings. In this subsection we focus on the linear parametric speciﬁcation of F as in (2): y i = α + α x i + α x i + u i + (cid:15) i , where our goal is to identify and estimate the unknown parameters α := ( α , α , α ).15 .1.1 Identiﬁcation In the presence of the endogeneity problem between x i := ( x i , x i ) and u i , we willneed instrumental variables for the identiﬁcation of α . For illustrational simplicity,we impose the following standard IV assumption. Assumption 6 (Instrumental Variables) . Write z i := ( z i , z i ) , z i := (1 , z i , z i ) (cid:48) and x i = (1 , x i , x i ) (cid:48) . Assume(i) Relevance: Σ zx := E (cid:2) z i x (cid:48) i (cid:3) has full rank.(ii) Exogeneity: E [ u i | z i ] = 0 . Corollary 2 (Identiﬁcation of Linear Parameters) . Under Assumptions (1) - (6) , α ispoint identiﬁed. Example (Team Production Function Continued) . In the context of our working ex-ample, here we are essentially following a strategy discussed in Griliches and Mairesse(1998) and assume that we have access to some instrumental variables (such as localwages) that aﬀect input choices.

We now turn to the more interesting problem of estimation, propose semiparametricestimators for α , and characterize their asymptotic distributions.We ﬁrst describe our proposed estimator. Since the identiﬁcation of latent inputsvia equation (10) is constructive, it suggests a natural estimation procedure: Step 1 (Nonparametric Regression): obtain an estimator ˆ γ of γ by nonpara-metrically regressing y i on x i and z i , among ﬁrms with d i = 1, i.e., those with x i observed. Similarly, obtain an estimator ˆ γ of γ . Step 2 (Imputation): impute latent inputs by plugging the nonparametric esti-mators ˆ γ , ˆ γ into equation (10), i.e.,ˆ x i = ˆ γ − (ˆ γ ( x i ; z i ) ; z i ) , for d i = 1 , ˆ x i = ˆ γ − (ˆ γ ( x i ; z i ) ; z i ) , for d i = 2 . tep 3 (IV Regression): estimate equation (2) with z i as IVs for x i , i.e.,ˆ α := (cid:32) n n (cid:88) i =1 z i ˜ x i (cid:33) − (cid:32) n n (cid:88) i =1 z i y i (cid:33) and ˜ x i :=  (1 , x i , ˆ x i ) (cid:48) , for d i = 1 , (1 , ˆ x i , x i ) (cid:48) , for d i = 2 . In Appendix B.4, we also propose an alternative estimator ˆ α ∗ that features a slightlydiﬀerent Step 3, leading to an eﬃciency gain over ˆ α asymptotically. Since the asymp-totic theories for ˆ α and ˆ α ∗ are very similar, we defer results on ˆ α ∗ to the appendix. We now establish the consistency and the asymptotic normality of ˆ α under the fol-lowing regularity assumptions. Assumption 7 (Finite Error Variances) . E [ u i | z i ] < ∞ and E [ (cid:15) i | x i , z i , d i ] < ∞ . Assumption 8 (Strong Monotonicity) . The ﬁrst derivative of γ k ( · , z ) is uniformlybounded away from zero, i.e., for any c, z , ∂∂c γ k ( c ; z ) > c > . In view of equation (8), Assumption 8 is satisﬁed if either α , α > ∂∂u h , ∂∂u h are uniformly bounded above by a ﬁnite constant. Assumption 8 is needed to ensurethat ˆ γ − k ( · , z ) is a good estimator of γ − k ( · , z ) provided that the ﬁrst-stage nonpara-metric estimator ˆ γ k is consistent for γ k . Assumption 9 (First-Stage Estimation) . (i) Donsker property: γ , γ ∈ Γ , which is a Donsker class of functions with uni-formly bounded ﬁrst and second derivatives, and ˆ γ , ˆ γ ∈ Γ with probabilityapproaching 1.(ii) First-stage convergence: (cid:107) ˆ γ k − γ k (cid:107) = o p (cid:16) N − (cid:17) for k = 1 , . γ , γ satisfy certain smoothness condition, e.g. γ k possesses uniformly bounded derivatives up to a suﬃciently high order. Assumption9(ii) requires that the ﬁrst-stage estimator converges at a rate faster than N − / , whichis satisﬁed under various types of nonparametric estimators under certain regularityconditions. This is required so that the ﬁnal estimator of the production functionparameters α can converge at the standard parametric ( √ N ) rate despite the slowerﬁrst-step nonparametric estimation of γ , γ .Finally, we state another technical assumption that captures how the ﬁrst-stagenonparametric estimation of γ , γ inﬂuences the ﬁnal semiparametric estimators ˆ α through the functional derivatives of the residual function with respect to γ , γ .Assumption 10 below, based on Newey (1994), provides an explicit formula for theasymptotic variance of ˆ α that does not depend on the particular forms of ﬁrst-stagenonparametric estimators.Formally, write w i := ( y i , x i , z i , d i ), γ := ( γ , γ ), and suppress the conditioningvariables z i in γ for notational simplicity. Deﬁne the residual functions g ( w i , ˜ α, ˜ γ ) :=  z i (cid:0) y i − ˜ α − ˜ α x i − ˜ α ˜ γ − (˜ γ ( x i )) (cid:1) for d i = 1 ,z i (cid:0) y i − ˜ α − ˜ α x i − ˜ α ˜ γ − (˜ γ ( x i )) (cid:1) for d i = 2 . for generic ˜ α, ˜ γ , and g ( w i , ˜ γ ) := g ( w i , α, ˜ γ ) at the true α . Deﬁne the pathwisefunctional derivative of g at γ along direction τ by G ( w i , τ ) := lim t → t [ g ( w i , γ + tτ ) − g ( w i , γ )] . Then, following Newey (1994), the so-called “inﬂuence function” can be derived an-alytically based on G and takes the form of ϕ ( w i ) z i (cid:15) i with ϕ ( w i ) := − (cid:18) λ α γ (cid:48) − λ α γ (cid:48) (cid:19) ( { d i = 1 } − { d i = 2 } ) , where γ (cid:48) k denotes ∂∂h k γ k ( x ik ; z i ), λ stands for λ ( x i , z i ) := E [ { d i = 1 }| x i , z i ]i.e., the conditional probability of observing x i , and λ := 1 − λ . See the proof of Theorem 2 for details on the calculation. ϕ ( w i ) z i (cid:15) i . In particular, the for-mula for ϕ given above will be the same regardless of the speciﬁc forms of ﬁrst-stepestimators used, provided that some suitable regularity conditions are satisﬁed. Assumption 10 (Asymptotic linearity) . Suppose (cid:90) G ( w, ˆ γ − γ ) d P ( w ) = 1 N N (cid:88) i =1 ϕ ( w i ) z i (cid:15) i + o p (cid:16) N − (cid:17) . We emphasize that Assumptions 9 and 10 are standard assumptions widely im-posed in the semiparametric estimation literature, which can be satisﬁed by manykernel or sieve ﬁrst-stage estimators under a variety of conditions. See Newey (1994),Newey and McFadden (1994) and Chen, Linton, and Van Keilegom (2003) for refer-ences. In Assumption 11 below, we also provide an example of lower-level conditionsthat replace Assumptions 9 and 10 when we use the Nadaraya-Watson kernel estima-tor in the ﬁrst-stage nonparametric regression.The next theorem establishes the asymptotic normality of ˆ α . Theorem 2 (Asymptotic Normality) . Under Assumptions 1-10, √ N ( ˆ α − α ) d −→ N ( , Σ) , where Σ := Σ − zx ΩΣ − xz and Ω := E (cid:104) z i z (cid:48) i (cid:0) u i + [1 + ϕ ( w i )] (cid:15) i (cid:1)(cid:105) . We note that, if the latent inputs were observed and the ﬁrst-step nonparametricregression were not required, the asymptotic variance of standard IV estimator of α would be given by Σ − zx Var ( z i ( u i + (cid:15) i )) Σ − xz . Hence, the presence of the additionalterm δ ( z i ) in Ω captures the eﬀect of the ﬁrst-step nonparametric regression on theasymptotic variance of ˆ α . 19o obtain consistent variance estimators, deﬁneˆΩ := 1 N N (cid:88) i =1 z i z (cid:48) i (cid:104) y i − ˜ x (cid:48) i ˆ α + ˆ ϕ ( w i ) ( y i − ˜ y i ) (cid:105) where ˜ y i :=  ˆ γ ( x i , z i ) , for d i = 1 , ˆ γ ( x i , z i ) , for d i = 2 , and with ˆ ϕ ( w i ) := − (cid:18) ˆ λ ˆ α ˆ γ (cid:48) − ˆ λ ˆ α ˆ γ (cid:48) (cid:19) ( { d i = 1 } − { d i = 2 } )where ˆ λ is any consistent nonparametric estimator of λ . Then the variance estima-tors can be obtained as ˆΣ := S − x ˜ z ˆΩ S − zx with S z ˜ x := N (cid:80) Ni =1 z i ˜ x (cid:48) i . Proposition 1.

In addition to Assumptions 1-8 and 11, suppose that ˆ λ is any con-sistent nonparametric estimator of λ . Then ˆΩ p −→ Ω and ˆΩ ∗ p −→ Ω ∗ . If furthermore λ ( x i , z i ) ≡ λ ∈ (0 ,

1) is assumed, then we may use the sampleproportion ˆ λ := N (cid:80) i { d i = 1 } . Finally, we present a set of lower-level conditions that replace Assumptions 9 and 10,when we use the canonical Nadaraya-Watson kernel estimator for the nonparametricregression in Step 1. We emphasize that this subsection simply serves as an illustrationof Assumptions 9-10 and Theorem 2, as our method does not require the use ofa speciﬁc form of ﬁrst-step nonparametric estimators. For sieve (series) ﬁrst-stepestimators, similar results can be derived based on, for example, Newey (1994), Chen(2007) and Chen and Liao (2015).

Assumption 11 (Example of Lower-Level Conditions with Kernel First Step) . Let N k := (cid:80) Ni =1 { d i = k } denote the number of ﬁrms for which h ik is observed, and let γ k be the Nadaraya-Watson kernel estimator of γ k deﬁned by ˆ γ k ( v ) := N k b (cid:80) d i = k K (cid:0) v − v ik b (cid:1) y i N k b (cid:80) d i = k K (cid:0) v − v ik b (cid:1) where v ik := ( x ik , z i , z i ) for all i such that d i = k . Suppose the following conditions:(i) λ ( x i , z i ) ∈ ( (cid:15), − (cid:15) ) for all ( x i , z i ) for some (cid:15) > .(ii) ( x i , z i ) has compact support in R with joint density f that is uniformly boundedboth above and below away from zero.(iii) E [ y i ] < ∞ and E [ y i | x i , z i ] f ( x i , z i ) is bounded.(iv) γ k has uniformly bounded derivatives up to order p ≥ .(v) K ( u ) has uniformly bounded derivatives up to order p , K ( u ) is zero outsidea bounded set, (cid:82) K ( u ) du = 1 , (cid:82) u t K ( u ) du = for t = 1 , ..., p − , and (cid:82) (cid:107) u (cid:107) p | K ( u ) | du < ∞ .(vi) b is chosen such that √ log N √ Nb = o (cid:16) N − (cid:17) and √ N b p → . Assumption 11(i) essentially requires that the proportion of observations with x i observed and that with x i observed are both strictly positive, or in other words, thenumbers of both types of observations tend to inﬁnity at the same rate of N . Thisguarantees that we can estimate both γ based on observations with x i and γ basedon observations with x i well enough asymptotically. Assumption 11(iv) is the keysmoothness condition that will help establish the Donsker property (and a consequentstochastic equicontinuity condition) in Assumption 9(i). Assumption 11(v)(vi) areconcerned with the choice of kernel function K and bandwidth parameter b : (v)requires that a “high-order” kernel function (of order p ) is used, while (vi) requiresthat the bandwidth is set (in a so-called “under-smoothed” way) so that the kernelestimator ˆ γ k converges at a rate faster than N − / , as required in Assumption 9(ii).The requirement of p ≥ Proposition 2 (Asymptotic Distributions with Kernel First Step) . Under Assump-tions 1-8 and 11, the conclusions of Theorem 2 hold. .2 Generalizations Additional Instrumental Variables

If additional instruments are available, it is straightforward to incorporate them in thesecond-stage regression, which will take the form of a two-stage least square estimatorinstead of an IV regression. Our results will carry over with suitable changes innotation. For example, the asymptotic variance formula for ˆ α needs to be adapted asΣ := (cid:0) Σ xz Σ − zz Σ zx (cid:1) − Σ xz Σ − zz ΩΣ − zz Σ zx (cid:0) Σ xz Σ − zz Σ zx (cid:1) − . Other Parametric Outcome Function

Consider a potentially nonlinear parametric production function of the form y i = F α ( x i , x i ) + u i + (cid:15) i After the identiﬁcation of partially latent inputs via Theorem 1, the second stage boilsdown to the estimation of α based on the moment condition E [ z i ( y i − F α ( x i , x i ))] = , which can be obtained via GMM estimation. Technically, since GMM estima-tors are Z-estimators, the corresponding asymptotic theory in Newey and McFadden(1994), on which the proof of Theorem 2 is mainly based, still applies with properchanges in notation. Nonparametric Outcome Function

More generally, with any nonparametric production function that is additively sepa-rable in u i and (cid:15) i of the form y i = F ( x i , x i ) + u i + (cid:15) i , where F is an unknown function that satisﬁes Assumption 2, the only thing thatchanges is the second-stage nonparametric estimation of F with the imputed covari-ates ˜ x i (or more precisely, with one component known and one component imputed)based on the moment condition E [ z i ( y i − F ( x i , x i ))] = . The asymptotic theoryfor this case can be similarly obtained based on theory on nonparametric two-stepestimation (e.g. Ai and Chen, 2007, and Hahn, Liao, and Ridder, 2018).22n the more general speciﬁcation (1): y i = F ( x i , x i , u i ) + (cid:15) i where there is no more additive separability in u i , one way to obtain identiﬁcation andimplement IV estimation is by adapting Chernozhukov, Imbens, and Newey (2007) toour current context. Essentially, we would need to impose strict monotonicity of F in u i , impose independence of u i from z i , normalize the distribution of u i to be uniform,and then exploit a quantile-based residual condition as described in Chernozhukov,Imbens, and Newey (2007). Here we report the ﬁndings of some Monte Carlo experiments. Table 1 reports theparameter speciﬁcations of the Cobb-Douglas production function that we use in ourexperiments. We assume that inputs are optimally chosen by a proﬁt maximizingﬁrm as discussed in detail in Appendix A. These parameters were chosen so thatthe simulated data are broadly consistent with the descriptive statistics of our ﬁrstapplication that we discuss in detail in the next section. For each speciﬁcation, marketsize, denoted by L , and number of ﬁrms in each market, denoted by I can vary. Inparticular, we consider the following scenarios: L = 50, 100, 500 and I = 1, 50, 100.For each experiment, we compute the diﬀerence between the true parameter value andthe sample average of the estimates using 1000 replications ( N ). This is a measureof the bias of our estimator. We also estimate the root mean squared error (RMSE)using the sample standard deviation of our estimates.Note that our data generating process mechanically implies x i and x i have alinear relationship with y i . We estimate γ ( · , z i ) and γ ( · , z i ) using second degreepolynomials. Not surprisingly, we ﬁnd that the estimated coeﬃcients on quadraticterms are almost 0. The interpolated functions γ − and γ − are also almost linear.Table 2 summarizes the performance of two diﬀerent estimators: TSLS when allinputs are observed as well as our version of TSLS when inputs are imputed. Werefer to our version of the TSLS estimator as the “matched” TSLS estimator . Aswe would expect given our asymptotic results, the matched TSLS performs almostas well as the standard TSLS estimator under these ideal sampling conditions. Thisﬁnding holds for all three diﬀerent speciﬁcations and several choices for the number23able 1: Monte Carlo Parameter SpeciﬁcationConstant Across Speciﬁcation Variable Across Speciﬁcation α α α µ z σ z κ , , , σ u σ (cid:15) σ η Spec1 4 0.35 0.25 (cid:18) . . (cid:19) (cid:18) .

05 00 0 . (cid:19)  . . . .  (cid:18) .

01 00 0 . (cid:19) Spec 2 4 0.35 0.25 (cid:18) . . (cid:19) (cid:18) .

05 00 0 . (cid:19)  . . . .  (cid:18) .

01 00 0 . (cid:19) Spec 3 4 0.35 0.25 (cid:18) . . (cid:19) (cid:18) .

05 00 0 . (cid:19)  . . . .  (cid:18) . . (cid:19) of ﬁrms within a market and the number of local markets.Next, we investigate how our estimator performs when we have a relatively smallnumber of observations in each market. Considering an extreme case, we simulatedata for L = 500 and I = 1. As we only have a single ﬁrm in each market, we cannotimpute the missing input variable using within market information. Instead, we poolobservations across markets and estimate conditional expectations conditional on x (or x ), z , and z . Table 2 also summarizes the bias and RMSE where L = 500 and I = 1. We ﬁnd that the matched TSLS estimator performs almost as well as thestandard TSLS estimator that assumes that both inputs are observed.Finally, we consider the case in which the wage for type j is observed only whenwe observe the input for type j , i.e. we assume that:( z i , z i ) =  ( z ∗ i , missing ) if x i is observed( missing, z ∗ i ) if x i is observed (11)Since we need to impute missing wages, we assume that true wages are functions ofsome demand shifters D m ∈ R for the local labor market m and a random error η i which is assumed to be independent from the demand shifters. Note that this24able 2: Monte Carlo: Diﬀerent Markets, Observed WagesNumber of Number of TSLS Matched TSLSParam Markets Firms Spec Bias RMSE Bias RMSE α

50 50 1 0.001 0.001 0.000 0.001 α

100 100 1 -0.000 0.000 -0.000 0.000 α

50 50 2 0.001 0.002 -0.000 0.002 α

100 100 2 -0.000 0.000 0.000 0.001 α

50 50 3 0.001 0.002 0.001 0.002 α

100 100 3 -0.000 0.000 0.001 0.001 α

500 1 1 -0.004 0.003 -0.004 0.003 α

500 1 2 -0.014 0.011 -0.015 0.011 α

500 1 3 -0.013 0.010 -0.014 0.010 α

50 50 1 0.004 0.003 0.003 0.004 α

100 100 1 0.000 0.001 0.000 0.001 α

50 50 2 0.007 0.010 0.006 0.013 α

100 100 2 0.001 0.002 0.001 0.003 α

50 50 3 0.006 0.008 0.032 0.015 α

100 100 3 0.001 0.002 0.020 0.003 α

500 1 1 -0.002 0.015 -0.001 0.016 α

500 1 2 -0.000 0.048 0.001 0.052 α

500 1 3 -0.007 0.040 -0.006 0.043 α

50 50 1 -0.005 0.005 -0.004 0.006 α

100 100 1 -0.001 0.001 -0.000 0.001 α

50 50 2 -0.010 0.014 -0.010 0.017 α

100 100 2 -0.002 0.003 -0.002 0.004 α

50 50 3 -0.007 0.011 -0.046 0.021 α

100 100 3 -0.001 0.002 -0.029 0.005 α

500 1 1 -0.004 0.020 -0.004 0.022 α

500 1 2 -0.020 0.068 -0.022 0.073 α

500 1 3 -0.009 0.051 -0.010 0.05525peciﬁcation allows for correlation between z m ( i ) and z m ( i ) through D m . Speciﬁcally,we simulate wages as follows: z ∗ i = z m ( i ) = κ D m + κ D m + η i (12) z ∗ i = z m ( i ) = κ D m + κ D m + η i To impute the missing wages, we regress the observed wages ( z i , z i ) on the demandshifters ( D m , D m ). Using estimated parameters from the regression, we then imputethe missing wages.Table 3: Monte Carlo: Small Markets with Partially Latent WagesNumber of Number of Standard SLS Matched TSLSParam markets ﬁrms Spec Bias RMSE Bias RMSE α

500 1 1 -0.004 0.003 -0.004 0.003 α

500 1 2 -0.008 0.010 -0.007 0.010 α

500 1 3 -0.008 0.010 -0.007 0.010 α

500 1 1 -0.002 0.015 -0.001 0.016 α

500 1 2 0.005 0.054 0.008 0.055 α

500 1 3 0.004 0.053 0.008 0.054 α

500 1 1 -0.004 0.020 -0.004 0.022 α

500 1 2 -0.021 0.072 -0.023 0.075 α

500 1 3 -0.020 0.070 -0.023 0.074Table 3 summarizes the performance of our new estimator together with TSLSestimator. Even if we have a relatively large variance of the imputation errors, suchas in Speciﬁcation 3, our new estimator performs reasonably well.Figure 1 plots the empirical distribution for the case of speciﬁcation 2. Overall,we ﬁnd that the matched TSLS estimator performs almost as well as the standardTSLS estimator.We conclude that our estimator performs well in all Monte Carlo experiments, evenin scenarios that are more general than those considered in Sections 3 of the paper. Inparticular, we do not need to observe both sets of instruments in the data, i.e. we canimpute the missing instrument. Next, we evaluate the performance of our estimator intwo applications. The ﬁrst application focuses on pharmacies and studies diﬀerencesin technology across diﬀerent types of ﬁrms. The second application studies education26igure 1: Histograms of Estimated Coeﬃcients With Imputed Wages T S L S -1 -0.5 0 0.5 1020406080100 M a t c hed T S L S -1 -0.5 0 0.5 1020406080100 Nmarket = 500, Nfirms = 1, Parameter Spec = II production functions.

Our ﬁrst application focuses on the industrial organization of pharmacies. This indus-try has undergone a dramatic change over the past decades. An industry that usedto be primarily dominated by local independent pharmacies has been transformed bythe entry of large chains that operate in multiple markets. An important question isthe extent to which this transformation has been driven by technological change thathas beneﬁted large chains over smaller independently operated pharmacies. If this isin fact the case, these technological changes may help to explain why this professionhas become so popular with females (Goldin and Katz, 2016).The main data set that we use is the National Pharmacist Workforce Survey of2000 which is collected by Midwestern Pharmacy Research. The data comes froma cross-sectional survey answered by randomly selected individual pharmacists withactive licenses. The data set is composed of two types of information: informationabout pharmacists and information about the pharmacy each pharmacist works at.27nformation at the pharmacy level includes the type of pharmacy (Independent or Chain) , the hours of operation per week, the number of pharmacists employed,and the typical number of prescriptions dispensed at the pharmacies per week. Thestore-level information is provided by an individual pharmacist who works at thepharmacy, thus the quality of the responses may depend on how knowledgeable theperson is about the pharmacy. However, considering that most of the pharmacists inour sample are observed to be full-time pharmacists, the quality of the ﬁrm-level datais likely to be high. The number of prescriptions dispensed at the pharmacy is ourmeasure of output. As a consequence, we do not have to use revenue based outputmeasures which could bias our analysis as discussed, for example, in Epple, Gordon,and Sieg (2010).Table 4: Summary Statistics at the Firm Level: PharmaciesFirm Number Emp Operating Prescriptions Prescriptions Prop NumberType Pharmacists Size Hours per Week per Hour Urban of ObsIndep n < ≤ n < ≤ n n < ≤ n < ≤ n We explore these issues in more detail below and test whether the diﬀerenttypes of pharmacies have access to the same technology.The survey also collects various information about pharmacists including hoursof work, demographics, and household characteristics. Most importantly we observethe position at the pharmacy (Owner/Manager or Employee) . We treat hours of themanager and hours of the employees as the two input factors in our analysis.Information related to the individual pharmacists is summarized in Table 5. Em-ployee pharmacists at independent pharmacies work fewer hours than the employeepharmacists at chain pharmacies, and hourly earnings are lower than those of theemployees at the chains. Pharmacists in managerial positions at independent phar-macies work more hours than do managers at chain pharmacies, but they have lowerhourly earnings on average.We observe only one pharmacy in each local labor market, which is deﬁned asthe 5-digit zip code area. Hence, we need to use the version of our estimator thataverages across local markets as discussed in Section 3.3.We test whether the observed labor inputs are indeed the optimal choice of ﬁrms.If the inputs are optimally chosen, the coeﬃcients can be directly estimated fromequation (16) in Appendix A. Under the assumption of Cobb-Douglas production,we can test the optimality by jointly testing the null hypothesis of equality of bothcoeﬃcients. Table 6 shows the results. A formal Wald test rejects the null hypothesisof optimality. Thus the direct inversion of the optimality conditions cannot be appliedto estimate the parameters of the production function, whereas our new estimator isfeasible.We implement two versions of our “matched” TSLS estimator: the ﬁrst estimatoruses the observed outputs while the second one uses expected outputs. Since theobserved output is subject to a measurement error, the semi-parametric estimator Most pharmacies in our sample have one manager pharmacist and one employee pharmacist,but there are a few pharmacies with a larger employment size. See Appendix C for details on howto compute employees’ hours work for the pharmacies with multiple employees. We only observe the wage for the observed type. Thus, wages are imputed for the unobservedtype using local demand shifters in 5-digit zip code levels and pharmacists’ characteristics. We useactual wages for the observed position and imputed wages for both positions together with principalcomponents of local demand shifters as instruments. n < ≤ n < ≤ n n < ≤ n < ≤ n n < ≤ n < ≤ n n < ≤ n < ≤ n Table 7 shows that we estimate most of parameters of the production functionwith good precision. Correcting for potential measurement error by using the ex-pected output as the dependent variable, we achieve similar, maybe even slightlymore plausible estimates. Table 7: Estimation ResultIndependent ChainObserved Expected Observed ExpectedOutputs Outputs Outputs Outputs α α α x x As a robustness check, we also explored a diﬀerent matching algorithm which estimates theexpectation of output conditional on local demand shifters rather than wages. The results areconsistent although the matching algorithm with local demand shifters gives slightly larger pointestimates with slightly less precision. Appendix C provides some additional robustness checks. α V ( u )Independent 0.163 0.010Chain 0.687 0.006Diﬀerence or Ratio -0.524 1.532Test Statistics 122.841 -1.913 1.532Test Wald t F p-value (0.000) (0.028) (0.003)Second, our ﬁndings also suggest that managers may be more eﬀective in chainsthan independents. A formal one-sided t-test reported in Table 8 rejects the nullhypothesis that the two coeﬃcients that characterize managerial eﬃciency are thesame.Finally, we ﬁnd that chains have a signiﬁcantly lower residual variance than in-dependents. A formal F test reported in Table 8 rejects the null hypothesis that theresidual variance of independents is greater than or equal to the residual variance ofchains. Note that all the tests are based on the estimation results with the expectedoutputs as the dependent variable.We thus conclude that chains have diﬀerent production functions than indepen-dent pharmacies which may partially explain the change in the observed marketstructure of that industry. However, more research is needed to fully address thisimportant research question.

Our second application focuses on the estimation of education achievement functions.Here we assume that a child’s achievement y i is a function of the mother’s and thefather’s time inputs, denoted by x im and x if . Again, we consider a log-linear Cobb-32ouglas speciﬁcation given by y i = α i + α m x im + α f x if + u i (13)where heterogeneity in the intercept is given by: α i = x (cid:48) i α (14)Hence, we assume that the baseline productivity α i varies with family characteristics,such as family income. As before, we can estimate the education production functionusing TSLS with wages as instruments for inputs as well as our “matched” TSLSestimator if some inputs are partially latent.Our data is based on the four available waves of the Child Development Sup-plement (CDS). These are the cohorts interviewed in 1997, 2002, 2007, and 2014. For these children, we have detailed time usage information of their parents on twodays, each of which is randomly selected among weekdays and weekends, respectively.Based on this time diary information we can construct time inputs for mothers andfathers. The CDS can be linked to the original PSID survey using the family ID.Hence, we have detailed parental information such as education level, household in-come, and the number of children.The CDS collects multiple measures of child development including both cognitiveand non-cognitive skills. We focus on two important cognitive tests. First, we studythe passage comprehension test which assesses reading comprehension and vocabularyamong children aged between 6 and 17. Second, we analyze the applied problems testwhich assesses mathematics reasoning, achievement, and knowledge for children agedbetween 6 and 17. We begin by estimating an education production function using the subsampleof children who live in married households. Hence, we observe the mother’s and thefather’s inputs in the data set. We observe 3,236 children with complete inputs andapplied problem scores as well as 2,789 children with complete inputs and reading The CDS 1997 cohort consists of up to 12-year-old children and follows them for 3 waves (1997,2001, 2007). The CDS 2014 cohort consists of children that were up to 17 years old in 2013. We exclude families with stepmother and stepfather from our sample. We also analyzed the letter word test which assesses symbolic learning and reading identiﬁcationskills. There are also two non-cognitive measures. The externalizing behavioral problem indexmeasures disruptive, aggressive, or destructive behavior. The internalizing behavioral problem indexmeasures expressions of withdrawn, sad, fearful, or anxious feelings. − − − − − x m x f − − − x m x f Note that the standard TSLS is nolonger feasible in this subsample because of the latent variable problem. Table 11summarizes our ﬁndings.Table 11 shows that the time inputs for mothers are positive, statistically signif-icant, and economically meaningful. Moreover, the point estimates for the appliedproblem test are similar to the ones we obtained for the married sample reported inTable 10. The main diﬀerence is that mother’s time inputs are slightly less productivefor children from divorced families, and father’s time inputs are not statistically dif-ferent from zero. In summary, our estimator work well in this application and yieldsplausible and accurate point estimates for most coeﬃcients of interest. Most impor-tantly, we ﬁnd that the inputs of divorced fathers into the skill formation function oftheir children seem to be negligible. Missing instruments for the unobserved spouse are imputed using standard techniques based onthe observed spouse’s information. Concluding Remarks

We have developed a new method for identifying econometric models with partiallylatent covariates. We have shown that a broad class of econometric models that playa large role in industrial organization and labor economics can be non-parametricallyidentiﬁed if the partially latent covariates are monotonic functions of a common shock.Examples that fall into this class of models are production and skill formation func-tions. The partially latent data structure arises quite naturally in these settings ifwe employ an “input-based sampling” strategy, i.e. if the sampling unit is one ofmultiple labor input factors. It is plausible that the sampling unit will only haveincomplete information about the other labor inputs that aﬀect output. Our proofsof identiﬁcation are constructive and imply a sequential, two-step semi-parametricestimation strategy. We have discussed the key problems encountered in estimation,characterized rate of convergence, and the asymptotic distribution of our estimators.We also presented two applications of our technique. Our ﬁrst application focuseson estimating team production functions. Using a national survey of pharmacists,we have found some convincing evidence that chains have diﬀerent technologies thanindependently operated pharmacies. In particular, managers appear to be more pro-ductive in chains. Our second application focuses on the estimation of skill formationfunctions, which play a large role in labor and family economics. We have shown thatour matched TSLS estimator produces similar results to the feasible TSLS estimatorin a sample of children in married households, where both parental inputs are ob-served. We have also considered a sample of children from divorced households wherefather’s inputs must be imputed. We ﬁnd that the inputs of divorced fathers into theskill formation function of their children is negligible.There is substantial scope for future research in areas other than the two applica-tions that we provided above. At the heart of the applications discussed thus far is therelationship between multiple inputs that are combined to produce a single output.It is easy to imagine questions that ask about relationships that ﬁt this structure andthat do not fall into the frameworks we have considered thus far.To illustrate this idea, consider the problem of inter vivos gifts. It is common forparents, while still alive, to give money to their children, often to help with a downpayment on a house or to reduce taxes the parents will pay. When a couple makes agift to their married child, however, they risk that the child divorces and a portion of37he gift will accrue to the child’s spouse. The concern is real since approximately 40%of marriages in the US end in divorce. A natural question is how well can parentspredict how long a child’s marriage will last at the time they contemplate makinga gift. One could address this question with a data set that includes inter vivosgifts from parents to married children and, in addition, how long the child’s marriagesurvives. Such data sets exist, for example the PSID, which documents these for afamily lines that stretch over a half century.There is a problem however: Multigenerational data sets such as PSID have quitedetailed information about the choices of individuals who are descendants of theinitial respondents, but substantially less information about choices of individualswho “marry into” the data set. For each married couple in the PSID, one of thetwo has the “PSID gene” (that is, a descendant of an initial respondent), and wehave substantially more information about that individual and, importantly, aboutthat individual’s parents than we have about the spouse. In particular, we know theinter vivos gifts to the couple from the parents of the PSID gene child but not intervivos gifts to the couple from the spouse’s parents. Note that this design of the PSIDgives rise to a data structure that mimics the “input-based sampling” approach thatwe have studied in this paper. As we show in Appendix D, it is straightforward towrite down a non-cooperative model of intergenerational transfer, where the transfersof each parents are monotonically increasing in the probability that the marriagesurvives. This potential application is an example of interesting problems that arisein trying to understand intergenerational eﬀects. We would like to know how thechoices or characteristics of individuals in one generation aﬀect the outcomes of theirdescendants. We conjecture that the methods developed in this paper can be fruitfullyapplied to study a variety of questions related to intergenerational linkages.Finally, our research provides ample score for future research in econometricmethodology. We have restricted ourselves to applications in which our method ofidentiﬁcation can be combined with standard IV techniques to estimate the functionsof interest. Much of the recent panel data literature has focused on dynamic inputs inthe presence of adjustment costs. More research is clearly needed to evaluate whetherthe ideas presented in this paper can be extended and applied to dynamic panel dataframeworks. We have also restricted ourselves to systems of inputs with a single com- Other multigenerational data sets such as NLSY79, NLSY97 and NCDS share the partiallylatent variable problem. eferences

Abadie, A. and G. Imbens (2006): “Large Sample Properties of Matching Esti-mators for Average Treatment Eﬀects,”

Econometrica , 74 (1), 235–67.

Abrevaya, J. and S. G. Donald (2017): “A GMM approach for dealing withmissing data on regressors,”

Review of Economics and Statistics , 99, 657–662.

Acemoglu, D. and D. Autor (2011): “Skills, Tasks and Technologies: Impli-cations for Employment and Earnings,” in

Handbook of Labor Economics , ed. byD. Card and O. Ashenfelter, Elsevier, 1043–1171.

Ackerberg, D., X. Chen, J. Hahn, and Z. Liao (2014): “Asymptotic eﬃciencyof semiparametric two-step GMM,”

Review of Economic Studies , 81, 919–943.

Ackerberg, D. A., K. Caves, and G. Frazer (2015): “Identiﬁcation propertiesof recent production function estimators,”

Econometrica , 83, 2411–2451.

Ai, C. and X. Chen (2007): “Estimation of possibly misspeciﬁed semiparamet-ric conditional moment restriction models with diﬀerent conditioning variables,”

Journal of Econometrics , 141, 5–43.

Bergstrom, T., L. Blume, and H. Varian (1986): “On the Private Provisionof Public Goods,”

Journal of Public Economics , 29, 25–49.

Blundell, R. and S. Bond (1998): “Initial conditions and moment restrictions indynamic panel data models,”

Journal of Econometrics , 87, 115–143.——— (2000): “GMM estimation with persistent panel data: an application to pro-duction functions,”

EconometricRreviews , 19, 321–340.

Chaudhuri, S. and D. K. Guilkey (2016): “GMM with multiple missing vari-ables,”

Journal of Applied Econometrics , 31, 678–706.

Chen, X. (2007): “Large sample sieve estimation of semi-nonparametric models,”

Handbook of econometrics , 6, 5549–5632.

Chen, X. and Z. Liao (2015): “Sieve semiparametric two-step GMM under weakdependence,”

Journal of Econometrics , 189, 163–186.40 hen, X., O. Linton, and I. Van Keilegom (2003): “Estimation of semipara-metric models when the criterion function is not smooth,”

Econometrica , 71, 1591–1608.

Chernozhukov, V., G. W. Imbens, and W. K. Newey (2007): “Instrumentalvariable estimation of nonseparable models,”

Journal of Econometrics , 139, 4–14.

Cunha, F., J. Heckman, and S. Schennach (2010): “Estimating the Technologyof Cognitive and Non-cognitive Skill Formation.”

Econometrica , 78, 883–931.

Doraszelski, U. and J. Jaumandreu (2013): “R & D and Productivity: Esti-mating Endogenous Productivity,”

Review of Economic Studies , 80, 1338–83.

Epple, D., B. Gordon, and H. Sieg (2010): “A new approach to estimating theproduction function for housing,”

American Economic Review , 100, 905–24.

Fisher, R. (1935):

Design of Experiments , Hafner, new York.

Gandhi, A., S. Navarro, and D. Rivers (2020): “On the Identiﬁcation of GrossOutput Production Functions,”

Journal of Political Economy , 128, 2973–3016.

Goldin, C. and L. F. Katz (2016): “A most egalitarian profession: pharmacyand the evolution of a family-friendly occupation,”

Journal of Labor Economics ,34, 705–746.

Graham, B. S. (2011): “Eﬃciency bounds for missing data models with semipara-metric restrictions,”

Econometrica , 79, 437–452.

Griliches, Z. and J. Mairesse (1998): “Production Functions: The Searchfor Identiﬁcation,” in

Econometrics and Economic Theory in the 20th Century:The Ragnar Frisch Centennial Symposium , ed. by S. Strøm, Cambridge UniversityPress, 169–203.

Haanwinckel, D. (2018): “Supply, Demand, Institutions, and Firms: A Theory ofLabor Market Sorting and the Wage Distribution,” Working Paper.

Hahn, J., Z. Liao, and G. Ridder (2018): “Nonparametric two-step sieve Mestimation and inference,”

Econometric Theory , 34, 1281–1324.41 ansen, B. E. (2008): “Uniform convergence rates for kernel estimation with de-pendent data,”

Econometric Theory , 726–748.

Heckman, J., H. Ichimura, J. Smith, and P. Todd (1998): “CharacterizingSelection Bias using Experimental Data,”

Econometrica , 66 (2), 315–331.

Hoch, I. (1955): “Estimation of production function parameters and testing foreﬃciency,”

Econometrica , 23, 325–26.——— (1962): “Estimation of production function parameters combining time-seriesand cross-section data,”

Econometrica , 34–53.

Levinsohn, J. and A. Petrin (2003): “Estimating production functions usinginputs to control for unobservables,”

The Review of Economic Studies , 70, 317–341.

Little, R. J. (1992): “Regression with missing X’s: a review,”

Journal of theAmerican statistical association , 87, 1227–1237.

Marschak, J. and W. H. Andrews (1944): “Random simultaneous equationsand the theory of production,”

Econometrica , 143–205.

Matzkin, R. L. (2007): “Nonparametric identiﬁcation,”

Handbook of econometrics ,6, 5307–5368.

McDonough, I. K. and D. L. Millimet (2017): “Missing data, imputation, andendogeneity,”

Journal of Econometrics , 199, 141–155.

Milgrom, P. and C. Shannon (1994): “Monotone comparative statics,”

Econo-metrica: Journal of the Econometric Society , 157–180.

Mundlak, Y. (1961): “Empirical production function free of management bias,”

Journal of Farm Economics , 43, 44–56.——— (1963): “Speciﬁcation and estimation of multiproduct production functions,”

Journal of Farm Economics , 45, 433–443.

Newey, K. and D. McFadden (1994): “Large sample estimation and hypothesistesting,”

Handbook of Econometrics, IV, Edited by RF Engle and DL McFadden ,2112–2245. 42 ewey, W. K. (1994): “The asymptotic variance of semiparametric estimators,”

Econometrica , 1349–1382.

Olley, G. S. and A. Pakes (1996): “The Dynamics of Productivity in theTelecommunications Equipment Industry,”

Econometrica , 64, 1263–1297.

Ridder, G. and R. Moffitt (2007): “The Econometrics of Data Combination,”

Handbook of econometrics , 6, 5469–5547.

Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994): “Estimation of regressioncoeﬃcients when some regressors are not always observed,”

Journal of the Americanstatistical Association , 89, 846–866.

Rosenbaum, P. and D. Rubin (1983): “The central role of the propensity scorein observational studies for causal eﬀects,”

Biometrica , 70, 41–55.

Roy, S. and T. Sabarwal (2010): “Monotone comparative statics for games withstrategic substitutes,”

Journal of Mathematical Economics , 46, 793–806.

Rubin, D. (1973): “Matching to Remove Bias in Observational Studies,”

Biometrics ,29, 159–183.

Rubin, D. B. (1976): “Inference and missing data,”

Biometrika , 63, 581–592.

Todd, P. and K. Wolpin (2003): “On the Speciﬁcation and Estimation of theProduction Function for Cognitive Achievement,”

Economic Journal , 113, F3–33.

Van Der Vaart, A. W. and J. A. Wellner (1996):

Weak Convergence andEmpirical Processes , Springer.

Vives, X. (2000):

Oligopoly pricing: Old ideas and new trends , Cambridge, MA:MIT Press.

Wooldridge, J. M. (2007): “Inverse probability weighted estimation for generalmissing data problems,”

Journal of econometrics , 141, 1281–1301.43

The Cobb-Douglas Case with Optimal Inputs

Suppose that ﬁrm i chooses inputs optimally by solving the following (expected)proﬁt-maximization problem:max X i ,X i e α + u i X α i X α i e u i − Z i X i − Z i X i , (15)where X i , X i , Z i , Z i denote exponents of x i , x i , z i , z i . By the ﬁrst-order condi-tions, X i = e α ui − α − α (cid:18) Z i α (cid:19) − α α α − (cid:18) Z i α (cid:19) α α α − X i = e α ui − α − α (cid:18) Z i α (cid:19) − α α α − (cid:18) Z i α (cid:19) α α α − Y i = e α ui − α − α (cid:18) Z i α (cid:19) α α α − (cid:18) Z i α (cid:19) α α α − = e α + u i (cid:18) α Z i α Z i (cid:19) α x α α i = e α + u i (cid:18) α Z i α Z i (cid:19) α x α α i In log forms x i = h ( u i , z i ) = α + (1 − α ) log α + α log α − α − α − − α − α − α z i − α − α − α z i + 11 − α − α u i x i = h ( u i , z i ) = α + α log α + (1 − α ) log α − α − α − α − α − α z i − − α − α − α z i + 11 − α − α u i y i = y ( u i , z i ) = α + α log α + α log α − α − α − α − α − α z i − α − α − α z i + 11 − α − α u i = α + α log ( α /α ) + ( α + α ) h ( u i , z i ) + α z i − α z l + u i = α + α log ( α /α ) + ( α + α ) h ( u i , z i ) − α z i + α z l + u i Taking inverses u i = h − ( x i , z i ) := − [ α + (1 − α ) log α + α log α ] + (1 − α − α ) x i + (1 − α ) z i + α z i = h − ( x i , z i ) := − [ α + α log α + (1 − α ) log α ] + (1 − α − α ) x i + α z i + (1 − α ) z i γ ( x i , z i ) = y (cid:0) h − ( x i , z i ) , z i (cid:1) = − log α + x i + z i ,γ ( x i , z i ) = y (cid:0) h − ( x i , z i ) , z i (cid:1) = − log α + x i + z i , and y i = γ ( x i , z i ) + (cid:15) i = − log α + x i + z i + (cid:15) i = γ ( x i , z i ) + (cid:15) i = − log α + x i + z i + (cid:15) i . (16)It is then evident that α or α can be estimated directly from (16) from the corre-sponding subsample where x i or x i is observed. Furthermore, we may test inputoptimality based on equation (16). B Proofs

B.1 Additional Notation and Lemmas

Notation

For each i , we use x ij to denote the observed input and use x ik to denotethe latent input variable for ﬁrm i , i.e. x ij = x i , x ik = x i , for d i = 1 ,x ij = x i , x ik = x i , for d i = 2 . We write d i := { d i = 1 } ,d i := { d i = 2 } , so that x ij = d i x i + d i x i while x ik := d i x i + d i x i . We write x i := (1 , x i , x i ) (cid:48) todenote the true regressor vector. (Recall ˜ x i denotes the same regressor vector withimputed latent input ˆ x ik in place of x ik .)Moreover, we suppress the instrumental variables z i in functions, such as γ ( u i , z i ),unless it becomes necessary to emphasize the dependence of such functions on z i .45 emma 1. Under Assumption 8, if (cid:107) ˆ γ k − γ k (cid:107) ∞ = O p ( a n ) , then (cid:13)(cid:13) ˆ γ − k − γ − k (cid:13)(cid:13) ∞ = O p ( a n ) and | ˆ x ik − x ik | = O p ( a n ) .Proof. By Assumption 8 we have c | u − u | ≤ | γ k ( u ) − γ k ( u ) | For any v ∈ Range ( γ k ), (cid:12)(cid:12) ˆ γ − k ( v ) − γ − k ( v ) (cid:12)(cid:12) ≤ c (cid:12)(cid:12) γ k (cid:0) ˆ γ − k ( v ) (cid:1) − γ k (cid:0) γ − k ( v ) (cid:1)(cid:12)(cid:12) = 1 c (cid:12)(cid:12) γ k (cid:0) ˆ γ − k ( v ) (cid:1) − v (cid:12)(cid:12) = 1 c (cid:12)(cid:12) γ k (cid:0) ˆ γ − k ( v ) (cid:1) − ˆ γ k (cid:0) ˆ γ − k ( v ) (cid:1)(cid:12)(cid:12) ≤ c (cid:107) ˆ γ k − γ k (cid:107) ∞ = O p ( a n ) . Furthermore, observing that c (cid:12)(cid:12) γ − k ( v ) − γ − k ( v ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) γ k (cid:0) γ − k ( v ) (cid:1) − γ k (cid:0) γ − k ( v ) (cid:1)(cid:12)(cid:12) = | v − v | we have by Assumption 8 and Lemma 1, for d i = 1, | ˆ x ik − x ik | = (cid:12)(cid:12) ˆ γ − j (ˆ γ k ( x ik )) − γ − j ( γ k ( x ik )) (cid:12)(cid:12) = (cid:12)(cid:12) ˆ γ − j (ˆ γ k ( x ik )) − γ − j (ˆ γ k ( x ik )) + γ − j (ˆ γ k ( x ik )) − γ − j ( γ k ( x ik )) (cid:12)(cid:12) ≤ (cid:12)(cid:12) ˆ γ − j (ˆ γ k ( x ik )) − γ − j (ˆ γ k ( x ik )) (cid:12)(cid:12) + (cid:12)(cid:12) γ − j (ˆ γ k ( x ik )) − γ − j ( γ k ( x ik )) (cid:12)(cid:12) ≤ (cid:13)(cid:13) ˆ γ − j − γ − j (cid:13)(cid:13) ∞ + 1 c | ˆ γ k ( x ik ) − γ k ( x ik ) |≤ (cid:13)(cid:13) ˆ γ − j − γ − j (cid:13)(cid:13) ∞ + 1 c (cid:107) ˆ γ k − γ k (cid:107) ∞ = O p ( a n ) . (17) Lemma 2.

Under Assumption 8:(i) The pathwise derivative of γ − k w.r.t. γ k along τ k ∈ Γ is given by ∇ γ k γ − k [ τ k ] := lim t (cid:38) ( γ k + tτ k ) − ( v ) − γ − k ( v ) t = − τ k (cid:0) γ − k ( v ) (cid:1) γ (cid:48) k (cid:0) γ − k ( v ) (cid:1) . ii) The pathwise derivative of γ − k ( γ j ( · )) w.r.t. γ j along τ j ∈ Γ is given by ∇ γ j (cid:0) γ − k ◦ γ j (cid:1) [ τ j ] := lim t (cid:38) γ − k ( γ j ( x ) + tτ j ( x )) − γ − k ( γ j ( x )) t = (cid:0) γ − k (cid:1) (cid:48) ( γ j ( x )) τ j ( x ) = 1 γ (cid:48) k (cid:0) γ − k ( γ j ( x )) (cid:1) τ j ( x ) . (iii) The second-order derivatives have bounded norms: ∇ γ k γ − k [ τ k ] [ τ k ] ≤ M (cid:107) τ k (cid:107) ∇ γ j (cid:0) γ − k ◦ γ j (cid:1) [ τ j ] [ τ j ] ≤ M (cid:107) τ k (cid:107) Proof. (i) and (ii) follow immediately from the deﬁnition of pathwise derivatives. See,e.g., Lemma 3.9.20 and 3.9.25 in Van Der Vaart and Wellner (1996) for reference. For(iii), ∇ γ k γ − k [ τ k ] [ ν k ] = τ (cid:48) k (cid:0) γ − k (cid:1) γ (cid:48) k (cid:0) γ − k (cid:1) · ν k (cid:0) γ − k (cid:1) γ (cid:48) k (cid:0) γ − k (cid:1) − τ k (cid:0) γ − k (cid:1)(cid:2) γ (cid:48) k (cid:0) γ − k (cid:1)(cid:3) (cid:34) γ (cid:48)(cid:48) k (cid:0) γ − k (cid:1) + 1 γ (cid:48) k (cid:0) γ − k (cid:1) (cid:35) ν k (cid:0) γ − k (cid:1) ≤ M (cid:107) τ k (cid:107) (cid:107) ν k (cid:107) since γ (cid:48) k ≥ c > γ (cid:48)(cid:48) and τ (cid:48) k are uniformly bounded above byAssumption 9(i). Similarly for ∇ γ j (cid:0) γ − k ◦ γ j (cid:1) . Lemma 3.

Writing γ := ( γ , γ ) , the pathwise derivative of γ − k ◦ γ j w.r.t. γ along τ is given by ∇ γ (cid:0) γ − k ◦ γ j (cid:1) [ τ ] := lim t (cid:38) ( γ k + tτ k ) − ( γ j ( x ) + tτ j ( x )) − γ − k ( γ j ( x )) t = 1 γ (cid:48) k (cid:0) γ − k ( γ j ( x )) (cid:1) (cid:2) τ j ( x ) − τ k (cid:0) γ − k ( γ j ( x )) (cid:1)(cid:3) roof. By Lemma 2,1 t (cid:2) ( γ k + tτ k ) − ( γ j ( x ) + tτ j ( x )) − γ − k ( γ j ( x )) (cid:3) = 1 t (cid:2) ( γ k + tτ k ) − ( γ j ( x ) + tτ j ( x )) − γ − k ( γ j ( x ) + tτ j ( x )) (cid:3) + 1 t (cid:2) γ − k ( γ j ( x ) + tτ j ( x )) − γ − k ( γ j ( x )) (cid:3) → ∇ γ k γ − k [ τ k ] ( γ j ( x )) + ∇ γ j (cid:0) γ − k ◦ γ j (cid:1) [ τ j ]= − τ k (cid:0) γ − k ( γ j ( x )) (cid:1) γ (cid:48) k (cid:0) γ − k ( γ j ( x )) (cid:1) + 1 γ (cid:48) k (cid:0) γ − k ( γ j ( x )) (cid:1) τ j ( x )= 1 γ (cid:48) k (cid:0) γ − k ( γ j ( x )) (cid:1) (cid:0) τ j ( x ) − τ k (cid:0) γ − k ( γ j ( x )) (cid:1)(cid:1) B.2 Proof of Theorem 2

Proof.

We verify the conditions in Lemma 5.4 of Newey (1994), or equivalently, The-orems 8.11 of Newey and McFadden (1994).Recall w i := ( y i , x i , z i , d i ), γ := ( γ , γ ) and g ( w i , ˆ α, ˆ γ ) = z i (cid:0) y i − ˆ α − (cid:0) x i ˆ α + ˆ γ − (ˆ γ ( x i )) ˆ α (cid:1) d i − (cid:0) x i ˆ α + ˆ γ − (ˆ γ ( x i )) ˆ α (cid:1) d i (cid:1) = z i (cid:0) y i − ˆ α − x ij ˆ α j − ˆ γ − k (ˆ γ j ( x ij )) ˆ α k (cid:1) g ( w i , ˆ γ ) = z i (cid:0) y i − α − (cid:0) x i α + ˆ γ − (ˆ γ ( x i )) α (cid:1) d i − (cid:0) x i α + ˆ γ − (ˆ γ ( x i )) α (cid:1) d i (cid:1) = z i (cid:0) y i − α − x ij α j − ˆ γ − k (ˆ γ j ( x ij )) α k (cid:1) = z i (cid:0) u i + (cid:15) i + (cid:2) x ik − ˆ γ − k (ˆ γ j ( x ij )) (cid:3) α k (cid:1) Clearly, E [ g ( w i , γ )] = E [ z i ( u i + (cid:15) i )] = 0 by Assumptions 6 and 4. Moreover, N (cid:80) Ni =1 g ( w i , ˆ α, ˆ γ ) = 0 by the deﬁnition of ˆ α .48ow, deﬁne G ( w i , ˆ γ − γ ) := ∇ γ g ( w i , γ ) [ˆ γ − γ ]= − α k z i ∇ γ (cid:0) γ − k ◦ γ j (cid:1) [ˆ γ − γ ]= − α k z i γ (cid:48) k (cid:0) γ − k ( γ j ( x ij )) (cid:1) (cid:2) (ˆ γ j − γ j ) ( x ij ) − (ˆ γ k − γ k ) (cid:0) γ − k ( γ j ( x ij )) (cid:1)(cid:3) = − α k z i γ (cid:48) k ( x ik ) [ˆ γ j ( x ij ) − γ j ( x ij ) − ˆ γ k ( x ik ) + γ k ( x ik )] since γ − k ( γ j ( x ij )) = x ik = d i z i (cid:18) − α γ (cid:48) (cid:19) (1 , − (cid:32) ˆ γ − γ ˆ γ − γ (cid:33) + d i z i (cid:18) − α γ (cid:48) (cid:19) ( − , (cid:32) ˆ γ − γ ˆ γ − γ (cid:33) = − z i (cid:18) d i α γ (cid:48) − d i α γ (cid:48) (cid:19) (1 , −

1) (ˆ γ − γ ) (18)By Lemma 2(iii) and Lemma 3, we deduce (cid:107) g ( w, ˆ γ ) − g ( w, γ ) − G ( w, ˆ γ − γ ) (cid:107) = O p (cid:0) (cid:107) ˆ γ − γ (cid:107) ∞ (cid:1) = o p (cid:18) √ N (cid:19) given our assumption that (cid:107) ˆ γ − γ (cid:107) ∞ = o p (cid:0) N − / (cid:1) .Next, the stochastic equicontinuity condition1 √ N N (cid:88) i =1 (cid:18) G ( w i , ˆ γ − γ ) − (cid:90) G ( w i , ˆ γ − γ ) d P ( w i ) (cid:19) = o p (cid:18) √ N (cid:19) (19)is guaranteed by Assumptions 8 and 9. Speciﬁcally, ˆ γ − γ belongs to a Donsker classof functions by the smoothness assumption while 1 /γ (cid:48) k ( x ik ) ≤ /c guarantees that G ( z i , · ) is square-integrable, so that G ( z i , · ) is also Donsker and thus (19) holds.Now, write ζ i := ( x i , z i ) so that w i = ( y i , ζ i , d i ). Then we have (cid:90) G ( w i , ˆ γ − γ ) P w i = (cid:90) − z i (cid:18) d i α γ (cid:48) − d i α γ (cid:48) (cid:19) (1 , −

1) (ˆ γ − γ ) d P ( ζ i , d i )= (cid:90) − z i (cid:18)(cid:20)(cid:90) d i d P ( d i | ζ i ) (cid:21) α γ (cid:48) − (cid:20)(cid:90) d i d P ( d i | ζ i ) (cid:21) α γ (cid:48) (cid:19) (1 , −

1) (ˆ γ − γ ) d P ζ i = (cid:90) − z i (cid:18) λ ( ζ i ) α γ (cid:48) − λ ( ζ i ) α γ (cid:48) (cid:19) (1 , −

1) (ˆ γ − γ ) d P ζ i

49y Proposition 4 of Newey (1994), with ϕ ( w i ) := − (cid:18) λ α z i γ (cid:48) − λ α z i γ (cid:48) (cid:19) ( d i − d i )we have z i (cid:18) λ α γ (cid:48) − λ α γ (cid:48) (cid:19) (1 , − (cid:32) d i ( y i − γ ( x i )) d i ( y i − γ ( x i )) (cid:33) ≡ ϕ ( w i ) z i (cid:15) i , and by Assumption 10 (cid:90) G ( w, ˆ γ − γ ) d P ( w ) = 1 N N (cid:88) i =1 ϕ ( w i ) z i (cid:15) i + o p (cid:18) √ N (cid:19) . Hence, Lemma 5.4 of Newey (1994),1 √ N N (cid:88) i =1 g ( w i , ˆ γ ) = 1 √ N N (cid:88) i =1 [ g ( w i , γ ) + ϕ ( w i ) z i (cid:15) i ] + o p (1) d −→ N ( , Ω) , where Ω :=Var [ g ( w i , γ ) + ϕ ( w i ) z i (cid:15) i ]= E (cid:104) z i z (cid:48) i ( u i + [1 + ϕ ( w i )] (cid:15) i ) (cid:105) = E (cid:104) z i z (cid:48) i (cid:0) u i + [1 + ϕ ( w i )] (cid:15) i (cid:1)(cid:105) Lastly, by Lemma 1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 z i (ˆ x i − x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n n (cid:88) i =1 | z i | | ˆ x i − x i | ≤ O p ( a n ) · n n (cid:88) i =1 | z i | = O p ( a n ) = o p (1)and thus1 N N (cid:88) i =1 z i ˜ x (cid:48) i = E (cid:104) z i x (cid:48) i (cid:105) + 1 N N (cid:88) i =1 z i (˜ x i − x i ) (cid:48) + 1 N N (cid:88) i =1 (cid:16) z i x (cid:48) i − E (cid:104) z i x (cid:48) i (cid:105)(cid:17) = E (cid:104) z i x (cid:48) i (cid:105) + O p ( a N ) + O p (cid:18) √ N (cid:19) p −→ Σ zx := E (cid:104) z i x (cid:48) i (cid:105) . √ N ( ˆ α − α ) = (cid:32) N N (cid:88) i =1 z i ˜ x i (cid:33) − √ N N (cid:88) i =1 g ( w i , ˆ γ ) d −→ N (cid:16) , Σ − zx ΩΣ (cid:48) − zx (cid:17) . B.3 Proof of Propositions 2 and 1

Proof.

Assumption 11(i) guarantees that N ∼ N ∼ N so that (cid:107) ˆ γ − γ (cid:107) ∞ ∼ (cid:107) ˆ γ − γ (cid:107) ∞ = O p ( a N )where, by Assumption 11(ii)-(v) and Theorem 8 of Hansen (2008), a N = b p + √ log N √ N b . With b chosen according to Assumption 11(vi) so that √ log N √ Nb = o (cid:16) N − (cid:17) and √ N b p →

0, implying that a N = o (cid:16) N − (cid:17) + o (cid:16) N − (cid:17) = o (cid:16) N − (cid:17) , verifying Assumption 9(ii). Assumption 10 (and consequently Proposition 2) followsfrom Theorem 8.11 of Newey and McFadden (1994).Since ˆ ϕ p −→ ϕ and ˆ ϕ ∗ p −→ ϕ ∗ , Proposition 1 then follows from Theorem 8.13 ofNewey and McFadden (1994). B.4 An Alternative and More Eﬃcient Estimator ˆ α ∗ The estimator ˆ α proposed in the main text is deﬁned by an IV estimator of theregression equation y i = α + α x i + α x i + u i + (cid:15) i , E [ u i + (cid:15) i | z i ] = 0in Step 3, where the left-hand side is the raw outcome variable y i . Alternatively, withSteps 1 and 2 unchanged, we may construct a slightly diﬀerent estimator ˆ α ∗ for α based on the conditionally expected outcome as described below.51 tep 3* : Estimate the following equation y i = α + α x i + α x i + u i , E [ u i | z i ] = 0 , (20)with the outcome variable given by y i := F ( u i , z i ) = γ ( x i , z i ) = γ ( x i , z i ) , replaced by its plug-in estimator˜ y i :=  ˆ γ ( x i , z i ) , for d i = 1 , ˆ γ ( x i , z i ) , for d i = 2 , Again using z i as IVs, estimate α byˆ α ∗ := (cid:32) n n (cid:88) i =1 z i ˜ x i (cid:33) − (cid:32) n n (cid:88) i =1 z i ˜ y i (cid:33) . The diﬀerence between ˆ α and ˆ α ∗ lies in the outcome variable being used for theIV regression: ˆ α is based on the raw output y i , while ˆ α ∗ is based on the estimatedconditionally expected output y i . As we will show below, ˆ α ∗ is in fact asymptoticallymore eﬃcient than ˆ α . Theorem 3 (Asymptotic Normality of ˆ α ∗ ) . Deﬁne g ∗ ( w i , ˜ α, ˜ γ ) :=  z i (cid:0) ˜ γ ( x i ) − ˜ α − ˜ α x i − ˜ α ˜ γ − (˜ γ ( x i )) (cid:1) for d i = 1 ,z i (cid:0) ˜ γ ( x i ) − ˜ α − ˜ α x i − ˜ α ˜ γ − (˜ γ ( x i )) (cid:1) for d i = 2 , and g ∗ ( w i , ˜ γ ) as well as G ∗ similarly as in Section 3.1.3. Deﬁne ˆ ϕ ∗ ( w i ) := (cid:20) ˆ λ (cid:18) − ˆ α ˆ γ (cid:48) (cid:19) + ˆ λ ˆ α ˆ γ (cid:48) (cid:21) { d i = 1 } + (cid:20) ˆ λ ˆ α ˆ γ (cid:48) + ˆ λ (cid:18) − ˆ α ˆ γ (cid:48) (cid:19)(cid:21) { d i = 2 } . Under Assumptions 1-10 with

G, ϕ replaced by G ∗ , ϕ ∗ whenever applicable, √ N ( ˆ α ∗ − α ∗ ) d −→ N ( , Σ ∗ ) , here Σ ∗ := Σ − zx Ω ∗ Σ − xz and Ω ∗ := E (cid:104) z i z (cid:48) i (cid:0) u i + ϕ ∗ ( w i ) (cid:15) i (cid:1)(cid:105) . The proof is very similar to that of Theorem 2, and is presented in Appendix B.5.Next, we compare the asymptotic variances of ˆ α ∗ and ˆ α , and show that ˆ α ∗ is infact asymptotically more eﬃcient. Theorem 4 ( ˆ α ∗ is Asymptotically More Eﬃcient than ˆ α ) . Ω − Ω ∗ is positive deﬁnite,i.e., ˆ α ∗ is asymptotically more eﬃcient than ˆ α . The proof is in Appendix B.6. Here we discuss the intuition of Theorem 4. Theerror term for the IV regression with the raw outcome y i as the left-hand-side variableis u i + (cid:15) i , which has a larger variance than the corresponding error term u i , if theconditionally expected outcome y i is used instead. Even though we do not observe y i and must use an estimator ˜ y i = ˆ γ ( x i ) or ˜ y i = ˆ γ ( x i ), the impact of the ﬁrst-stageestimation error (which can be loosely thought as an average of (cid:15) i across i ) is smallerthan the impact of (cid:15) i itself.To see this more clearly, ﬁrst consider the multiplier “1 + ϕ ( w i )” in (i): the “1”comes from the one “raw” share of error (cid:15) i embedded in each y i that we use as theoutcome variable, while “ ϕ ( w i )” essentially captures the share of inﬂuence of theﬁrst-step estimation error ˆ γ − γ due to (cid:15) i . Together, we have1 + ϕ = (cid:18) − λ α γ (cid:48) + λ α γ (cid:48) (cid:19) { d i = 1 } + (cid:18) λ α γ (cid:48) + 1 − λ α γ (cid:48) (cid:19) { d i = 2 } , while the corresponding multiplier ϕ ∗ on (cid:15) i in (ii) is essentially the same except that“1 − λ α γ (cid:48) ” becomes “ λ − λ α γ (cid:48) ” and “1 − λ α γ (cid:48) ” becomes “ λ − λ α γ (cid:48) ”. Since λ , λ < (cid:15) i becomes smaller in magnitude . Essentially, by using theestimated conditional expected output ˜ y i , the raw “1” share of (cid:15) i in y i is movedinto the ﬁrst-stage estimation error of y i , which is then “averaged” and reduced inmagnitude to λ or λ , thus leading to smaller overall variance.Lastly, we emphasize that the eﬃciency comparison in 4 does not directly relateto the theory of semiparametric eﬃciency bounds, such as in Ackerberg et al. (2014), Note that α /γ (cid:48) ≤ α /γ (cid:48) ≤ y i and˜ y i attain their corresponding semiparametric eﬃciency bounds with respect to theirdiﬀerent criterion functions g and g ∗ . Theorem 4, however, is a comparison across thetwo criterion functions g and g ∗ : it essentially states that the asymptotically eﬃcientestimator under g ∗ is even more eﬃcient than the eﬃcient estimator under g . B.5 Proof of Theorem 3

Proof.

We adapt the proof of Theorem 2 above with g ∗ ( w, ˆ α, ˆ γ ) := z i (cid:0) ˆ γ j ( x ij ) − ˆ α − ˆ α j x ij − ˆ α k ˆ γ − k (ˆ γ j ( x ij )) (cid:1) ,g ∗ ( w, ˆ γ ) := z i (cid:0) ˆ γ j ( x ij ) − α − α j x ij − α k ˆ γ − k (ˆ γ j ( x ij )) (cid:1) . with E [ g ∗ ( w i , γ )] = E (cid:2) z i (cid:0) γ j ( x ij ) − α − α j x ij − α k γ − k ( γ j ( x ij )) (cid:1)(cid:3) = E [ z i u i ] = and N (cid:80) Ni =1 g ( z, ˆ α ∗ , ˆ γ ) = .By the chain rule, G ∗ ( w i , τ ) := ∇ γ g ∗ ( w i , γ ) [ˆ γ − γ ]= z i (cid:0) [ˆ γ j ( x ij ) − γ j ( x ij )] − α k ∇ γ (cid:0) γ − k ◦ γ j (cid:1) [ˆ γ − γ ] (cid:1) = z i (cid:18) − α k γ (cid:48) k ( x ik ) (cid:19) [ˆ γ j ( x ij ) − γ j ( x ij )] − z i α k γ (cid:48) k ( x ik ) [ˆ γ k ( x ik ) − γ k ( x ik )]= z i (cid:20) d i (cid:18) − α γ (cid:48) , − α γ (cid:48) (cid:19) + d i (cid:18) − α γ (cid:48) , − α γ (cid:48) (cid:19)(cid:21) (ˆ γ − γ )and (cid:90) G ( w i , ˆ γ − γ ) P w i = (cid:90) z i (cid:18) λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) , λ α γ (cid:48) + λ (cid:18) − α γ (cid:48) (cid:19)(cid:19) (ˆ γ − γ ) d P ζ i By Proposition 4 of Newey (1994), with ϕ ∗ ( w i ) := − (cid:18) λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) (cid:19) d i + (cid:18) λ α γ (cid:48) + λ (cid:18) − α γ (cid:48) (cid:19)(cid:19) d i

54e have z i (cid:18) λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) , λ α γ (cid:48) + λ (cid:18) − α γ (cid:48) (cid:19)(cid:19) (cid:32) d i ( y i − γ ( x i )) d i ( y i − γ ( x i )) (cid:33) ≡ ϕ ∗ ( w i ) z i (cid:15) i , and by Assumption 10 (cid:90) G ( w, ˆ γ − γ ) d P ( w ) = 1 N N (cid:88) i =1 ϕ ∗ ( w i ) z i (cid:15) i + o p (cid:18) √ N (cid:19) . Hence, we have1 √ N N (cid:88) i =1 g ∗ ( w i , ˆ γ ) = 1 √ N N (cid:88) i =1 [ g ∗ ( w i , γ ) + ϕ ∗ ( w i ) z i ] + o p (1) d −→ N ( , Ω ∗ ) , where Ω := Var [ g ∗ ( w i , γ ) + δ ∗ ( z i )] = E (cid:104) z i z (cid:48) i (cid:0) u i + ϕ ∗ ( w i ) (cid:15) i (cid:1)(cid:105) , giving √ N ( ˆ α − α ) = (cid:32) N N (cid:88) i =1 z i ˜ x i (cid:33) − √ N N (cid:88) i =1 g ∗ ( w i , ˆ γ ) d −→ N (cid:16) , Σ − zx Ω ∗ Σ (cid:48) − zx (cid:17) . B.6 Proof of Theorem 4

Proof.

By (7), we have ∂∂c γ j ( c ; z ) = α j + α k x (cid:48) k x (cid:48) j + 1 x (cid:48) j > α j , and thus 0 < α j /γ (cid:48) j <

1, which implies λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) > , λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) > . ϕ ∗ = (cid:18) λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) (cid:19) d i + (cid:18) λ (cid:18) − α γ (cid:48) (cid:19) + λ α γ (cid:48) (cid:19) d i >

01 + ϕ = 1 − (cid:18) α γ (cid:48) λ − α γ (cid:48) λ (cid:19) ( d i − d i )= (cid:18) − λ α γ (cid:48) + λ α γ (cid:48) (cid:19) d i + (cid:18) − λ α γ (cid:48) + λ α γ (cid:48) (cid:19) d i = ϕ ∗ + (1 − λ ) d i + (1 − λ ) d i > ϕ ∗ > . Hence, (1 + ϕ ) > ϕ ∗ > − Ω ∗ = E (cid:104) z i z (cid:48) i (cid:2) (1 − ϕ ( x i , d i )) − ϕ ∗ ( x i , d i ) (cid:3) (cid:15) i (cid:105) is positive deﬁnite. C Robustness Check for First Application

Although most pharmacies in our sample have one manager and one pharmacist,there are a few pharmacies with more than one employee pharmacist. For this subsetof pharmacies, we compute the total hours worked by employee pharmacists by mul-tiplying the reported hours worked from an employee by the number of employees.Then, the second imputation step is applied based on the total hours worked by allemployees. In this process, we implicitly assume the labor hours from two diﬀerentemployees are perfect substitutes. As a robustness check, we also estimate a version ofproduction function which has an elasticity of substitution between the hours workedby diﬀerent employees equal to one. Table 12 summarizes this version of the esti-mation result. The estimated parameters show that employees become slightly lessproductive at both independents and chains compared to our baseline estimation,but in general our estimation result is robust to how we treat employee inputs frompharmacies with more than one employee.56able 12: Using N ∗ log ( x ) instead of log ( N ∗ H )Independent ChainObserved Expected Observed ExpectedOutputs Outputs Outputs Outputs α α α x x D Inter Vivos Gifts

Consider an example with a married couple and two parental households, j = 1 , m and m , which is based on Bergstrom, Blume,and Varian (1986). Parents are altruistic toward their married oﬀspring but nottoward that oﬀspring’s spouse. Parental household j has utility u j ( g j ) = ln( m j − g j ) + µ ln( g + g )where g j is the married couple’s gift from parental household j and µ is the probabilitythat both parental households think the children’s marriage will endure. This leadsto a noncooperative game between the two parental households since the incentivefor either household to gift the oﬀspring couple diminishes as the other parentalhousehold gives more. This is a game of strategic substitutes. The Nash equilibriumof this game between the two parental households is g ∗ = (1 + µ ) m − m µ , g ∗ = (1 + µ ) m − m µ . There is a unique Nash equilibrium for any µ for any wealth levels for the two house-holds that are not “too” diﬀerent. Both g and g are strictly increasing in the shock µ , and hence the outcome is strictly increasing in µµ