Gaussian Transforms Modeling and the Estimation of Distributional Regression Functions
GGAUSSIAN TRANSFORMS MODELING AND THE ESTIMATIONOF DISTRIBUTIONAL REGRESSION FUNCTIONS
RICHARD H. SPADY † AND SAMI STOULI § Abstract.
Conditional distribution functions are important statistical objects forthe analysis of a wide class of problems in econometrics and statistics. We proposeflexible Gaussian representations for conditional distribution functions and give aconcave likelihood formulation for their global estimation. We obtain solutions thatsatisfy the monotonicity property of conditional distribution functions, includingunder general misspecification and in finite samples. A Lasso-type penalized versionof the corresponding maximum likelihood estimator is given that expands the scopeof our estimation analysis to models with sparsity. Inference and estimation resultsfor conditional distribution, quantile and density functions implied by our represen-tations are provided and illustrated with an empirical example and simulations. Introduction
The modeling and estimation of conditional distribution functions are important forthe analysis of various econometric and statistical problems. Conditional distributionfunctions are core building blocks in the identification and estimation of nonseparablemodels with endogeneity (e.g., Imbens and Newey, 2009; Chernozhukov, Fernandez-Val, Newey, Stouli, and Vella, 2020), in counterfactual distributional analysis (e.g.,DiNardo, Fortin, and Lemieux, 1996; Chernozhukov, Fernandez-Val, and Melly, 2013),or in the construction of prediction intervals for a stationary time series (e.g., Hall,Wolff, and Yao, 1999; Chernozhukov, Wutrich, and Zhu, 2019). Conditional distribu-tion functions are also a fruitful starting point for the formulation of flexible estimationmethods for other objects of interest (Spady and Stouli, 2018a), such as conditionalquantile functions (CQF).
Date : November 13, 2020. † Nuffield College, Oxford, and Department of Economics, Johns Hopkins University,[email protected]. § Department of Economics, University of Bristol, [email protected] are grateful to Whitney Newey for his encouragements and useful comments, and to seminarparticipants at Bristol, UC San Diego, Oxford, Lehigh, LSE, Johns Hopkins University and theEconometric Society World Congress 2020. We thank Xiaoran Liang for excellent research assistance. a r X i v : . [ ec on . E M ] N ov or a continuous outcome variable Y and a vector of covariates X , three main diffi-culties arise in the formulation of a flexible model and in the choice of a loss functionfor the estimation of the conditional distribution and quantile functions of Y given X . A first main difficulty is the specification of a model that allows for the shape ofthe distribution of Y to vary across values of X , while being characterized by a lossfunction that preserves monotonicity in Y at each value of a potentially large num-ber of explanatory variables X in estimation. Because a valid maximum likelihood(ML) characterization would require this monotonicity property to hold, a second andrelated difficulty is the formulation of a loss function that characterizes an approxi-mate model with a clear information-theoretic interpretation under misspecification.A third difficulty is that nonconcave likelihoods naturally arise in the context of non-separable models, even in the simplest case of a Gaussian location-scale specification. One approach is to discard the monotonicity requirement in estimation and use lossfunctions that characterize quantile or distribution functions pointwise, while specify-ing a functional form that allows for the shape of the distribution of Y to vary acrossvalues of X . Quantile regression (Koenker and Basset, 1978) specifies each CQF as alinear combination of the components of X . The CQF is then estimated at each quan-tile by a sequence of linear programming problems. Distribution regression (Foresi andPerrachi, 1995; Chernozhukov, Fernandez-Val, and Melly, 2013) specifies each level ofthe cumulative distribution function (CDF) of Y conditional on X as a known CDFtransformation of a linear combination of the components of X . The conditional CDFis then estimated at each Y value by a sequence of binary outcome ML estimators.Another approach is to insist on the monotonicity requirement and use loss functionsthat characterize both quantile and distribution functions globally, but do not havean ML interpretation. Dual regression (Spady and Stouli, 2018a) specifies monotonerepresentations for Y given X as linear combinations of known functions of X and astochastic element. The conditional CDF is then estimated globally by the empiricaldistribution of the estimated sample values of the stochastic element.In this paper we take a different approach by formulating Gaussian representationsfor conditional CDFs, instead of modeling conditional CDFs or CQFs directly. Theserepresentations are specified as linear combinations of known functions of X and Y ,and the implied distributional regression models allow for the shape of the distributionof Y to vary across values of X . We give a concave likelihood characterization that Cf. Owen (2007) and Spady and Stouli (2018b) for a discussion in the context of simultaneousestimation of location and scale parameters in a linear regression model. ules out nonmonotone solutions. Under general misspecification, this formulation alsocharacterizes quasi-Gaussian representations that satisfy the monotonicity propertyof conditional CDFs by construction. The corresponding distributional models areoptimal approximations to the true data probability distribution according to theKullback-Leibler Information Criterion (KLIC) (White, 1982).For estimation we derive the properties of the corresponding ML estimator and extendour analysis to a two-step penalized ML estimation strategy, where the unpenalizedestimator is used as a first step for an adaptive Lasso (Zou, 2006) ML estimator whichpreserves the concavity of the objective function. We derive asymptotic properties ofthe corresponding estimators for conditional distribution, quantile and density func-tions. The penalized estimator is selection consistent, asymptotically normal andoracle, where the selection is based on the pseudo-true values of the parameter esti-mators. Under correct specification the estimator is also efficient. We also give thedual formulation of our estimators that we use for implementation.This paper makes five main contributions to the existing literature. First, we introducea new class of Gaussian representations in linear form for the flexible estimation ofdistributional regression models. Second, we demonstrate that our models and the cor-responding loss function characterize globally monotone conditional CDFs and CQFsunder general misspecification, both in finite samples with probability approachingone and in the population. Quantile and distribution regression can result in bothfinite sample estimates and population approximations under misspecification that donot satisfy the monotonicity property of conditional quantile and distribution func-tions. Third, we establish that the resulting approximations are KLIC optimal undergeneral misspecification. Compared to dual regression, we find that the monotonicityproperty can be obtained jointly with the KLIC optimality property, and we establishexistence and uniqueness of solutions under general misspecification. Fourth, we useduality theory to show that our formulation has considerable computational advan-tages. Compared to dual regression, we find in particular that the dual ML problemhas the important advantage of being a convex programming problem (Boyd andVandenberghe, 2004) with linear constraints. Fifth, our estimation analysis allows forsparsity, thereby giving an asymptotically valid characterization of sparse, globallymonotone, and KLIC optimal representations for conditional CDFs and CQFs. Cf. Chernozhukov, Fernandez-Val, and Galichon (2010) for a discussion in the context of quantileregression. n Section 2 we introduce Gaussian transforms modeling. In Section 3 we give resultsunder misspecification. Section 4 contains estimation and inference results, and du-ality theory is derived in Section 5. Section 6 illustrates our methods, and Section7 concludes. The proofs of all results are given in the Appendix. The online Ap-pendix Spady and Stouli (2020) contains supplemental material, including results ofnumerical simulations calibrated to the empirical illustration.2. Gaussian Transforms Modeling
Let Y be a continuous outcome variable and X a vector of explanatory variables. Atransformation to Gaussianity of the conditional CDF F Y | X ( Y | X ) of Y given X occursby application of the Gaussian quantile function Φ − ,(2.1) e = Φ − ( F Y | X ( Y | X )) ≡ g ( Y, X ) , where the resulting Gaussian transform (GT) e is a zero mean and unit varianceGaussian random variable and is independent from X , by construction. With y (cid:55)→ F Y | X ( y | X ) strictly increasing, the corresponding map y (cid:55)→ g ( y, X ) is also strictlyincreasing, with well-defined inverse denoted e (cid:55)→ h ( X, e ).Important statistical objects such as the conditional distribution, quantile and den-sity functions of Y given X can be expressed as known functionals of g ( Y, X ). Theconditional CDF of Y given X can be expressed as F Y | X ( Y | X ) = Φ( g ( Y, X )) , the CQF of Y given X as Q Y | X ( u | X ) = h ( X, Φ − ( u )) , u ∈ (0 , , and the conditional probability density function (PDF) of Y given X as f Y | X ( Y | X ) = φ ( g ( Y, X )) { ∂ y g ( Y, X ) } , ∂ y g ( Y, X ) ≡ ∂g ( Y, X ) ∂y , where e (cid:55)→ φ ( e ) is the Gaussian PDF and we denote partial derivatives as ∂ y g ( y, x ) ≡ ∂g ( y, x ) /∂y . The GT g ( Y, X ) thus constitutes a natural modeling target in the contextof distributional regression models for F Y | X ( Y | X ), Q Y | X ( u | X ), and f Y | X ( Y | X ). Werefer to these objects as the ‘distributional regression functions’.In this paper we consider the class of conditional CDFs with Gaussian representa-tion e = g ( Y, X ) in linear form, where g ( Y, X ) is specified as a linear combination of nown transformations of Y and X . The implied models for the distributional regres-sion functions are flexible, parsimonious, and able to capture complex features of theentire statistical relationship between Y and X . In particular, these models allow fornonlinearity and nonseparability of this relationship.2.1. Gaussian representations in linear form.
Let W ( X ) be a K × X and S ( Y ) a J × Y . Assume that W ( X ) includes an intercept, i.e., has first component 1, and that S ( Y ) has first twocomponents (1 , Y ) (cid:48) and derivative dS ( Y ) /dy = s ( Y ), a vector of functions continuouson R . We denote the marginal support of Y and X by Y and X , respectively, andtheir joint support by YX .Given a random vector ( Y, X (cid:48) ) (cid:48) with support YX = Y × X where Y = R , for some b ∈ R JK a GT regression model takes the form(2.2) e = b (cid:48) T ( X, Y ) , e | X ∼ N (0 , , T ( X, Y ) ≡ W ( X ) ⊗ S ( Y ) , with derivative function,(2.3) ∂ y { b (cid:48) T ( X, Y ) } = b (cid:48) t ( X, Y ) > , t ( X, Y ) ≡ W ( X ) ⊗ s ( Y ) , and where we use the Kronecker product ⊗ to define the dictionary formed with W ( X ), S ( Y ) and their interactions as T ( X, Y ), and the corresponding derivative vec-tor as t ( X, Y ). The GT g ( Y, X ) in (2.1) is specified as a linear combination of theknown functions T ( X, Y ), and hence of the components W ( X ), S ( Y ) and their inter-actions. The linear form of e is preserved by the derivative function b (cid:48) t ( X, Y ) whichis simultaneously specified as a linear combination of the known functions t ( X, Y ).This linear specification can be viewed as an approximation to the general Gaussiantransformation (2.1) when, for a specified dictionary T ( X, Y ), there is no b ∈ R JK such that (2.2)-(2.3) hold. We analyze this case in Section 3.An interpretation of model (2.2)-(2.3) as a varying coefficients model arises from speci-fying e and its derivative function as a linear combination of the known functions S ( Y )and s ( Y ), respectively,(2.4) e = β ( X ) (cid:48) S ( Y ) , ∂ y { β ( X ) (cid:48) S ( Y ) } = β ( X ) (cid:48) s ( Y ) > , with the vector of varying coefficients β ( X ) = ( β ( X ) , . . . , β J ( X )) (cid:48) specified as(2.5) β j ( X ) = b (cid:48) j W ( X ) , j ∈ { , . . . , J } , ith b j = ( b j , . . . , b jK ) (cid:48) , j ∈ { , . . . , J } . Together (2.4)-(2.5) give the linear form J (cid:88) j =1 β j ( X ) S j ( Y ) = J (cid:88) j =1 { b (cid:48) j W ( X ) } S j ( Y ) = b (cid:48) [ W ( X ) ⊗ S ( Y )] = b (cid:48) T ( X, Y ) , with derivative b (cid:48) t ( X, Y ) >
0, which has the form of (2.2)-(2.3). Since the derivativecondition requires β ( X ) (cid:48) s ( Y ) >
0, it is necessary to formulate β ( X ) and s ( Y ) so thatthis is at least possible. A sufficient condition is that both vectors be nonnegativewith probability one. This requirement will for instance be satisfied with b > W ( X ) and s ( Y ) are specified as nonnegative splinefunctions (Curry and Schoenberg, 1966; Ramsay, 1988). In that particular case, werefer to the resulting Gaussian representations as ‘Spline-Spline models’.With J = 2, the important special case of a Gaussian location-scale representationcan be expressed in terms of representation (2.4) as e = β ( X ) + β ( X ) Y, e | X ∼ N (0 , , β j ( X ) ≡ b (cid:48) j W ( X ) , j ∈ { , } , with derivative function β ( X ) >
0, which is of the form (2.2)-(2.3) with S ( Y ) =(1 , Y ) (cid:48) . With β ( X ) = b (cid:48) W ( X ) and β ( X ) ≡ b ∈ R , this specification specializesto the Gaussian location representation e = b (cid:48) W ( X ) + b Y , where b > Y given X implied by (2.2)-(2.3) are(2.6) F Y | X ( y | X ) = Φ( b (cid:48) T ( X, y )) , f Y | X ( y | X ) = φ ( b (cid:48) T ( X, y )) { b (cid:48) t ( X, y ) } , y ∈ R , respectively, and the CQF of Y given X is(2.7) Q Y | X ( u | X ) = h ( X, Φ − ( u )) , u ∈ (0 , , where e (cid:55)→ h ( X, e ) is the well-defined inverse of y (cid:55)→ Φ( b (cid:48) T ( X, y )). With J = 2, theconditional distribution of Y is restricted to Gaussianity for all values of X since theJacobian term b (cid:48) t ( X, y ) = W ( X ) (cid:48) b in (2.6) does not depend on Y . Theorem 1.
For model (2.2)-(2.3), the distributional regression functions take theform (2.6)-(2.7).
Theorem 1 demonstrates that model (2.2)-(2.3) corresponds to a well-defined probabil-ity distribution for Y given X with Gaussian representation in linear form. Therefore,model (2.2)-(2.3) gives a valid representation for the distributional regression func-tions (2.6)-(2.7). Upon setting W ( X ) = 1, Theorem 1 implies that model (2.2)-(2.3)admits distributional models for marginal distribution, quantile and density functions f Y as a particular case. We note that Theorem 1 also implies that the conditionallog density function of Y given X takes the form:log f Y | X ( Y | X ) = −
12 [log(2 π ) + { b (cid:48) T ( X, Y ) } ] + log( b (cid:48) t ( X, Y )) . We use this formulation to give an ML characterization of b , and hence of b (cid:48) T ( X, Y )and the corresponding distributional regression functions.
Remark . Our modeling framework also applies when Y is bounded since Y canalways be monotonically transformed to a random variable with support the real line,e.g., with e = Φ − ( F Y ( Y )) ≡ g ( Y ), where F Y ( Y ) is the marginal distribution of Y .For the GT regression model e = (cid:101) b (cid:48) T ( X, g ( Y )) ≡ (cid:101) g ( g ( Y ) , X ), e | X ∼ N (0 , ∂ y { (cid:101) g ( g ( Y ) , X ) } >
0, the corresponding conditional CDF of Y given X isPr[ Y ≤ y | X ] = Pr[ (cid:101) g ( g ( Y ) , X ) ≤ (cid:101) g ( g ( y ) , X ) | X ] = Φ( (cid:101) g ( g ( y ) , X )), y ∈ R . (cid:3) Remark . With multiple outcomes ( Y , . . . , Y M ) (cid:48) ≡ Y , M ≥
2, writing Y m ≡ ( Y , . . . , Y m ) (cid:48) , a compact generalization of (2.2)-(2.3) is the recursive formulation e m = T m ( X, Y m ) (cid:48) b ,m , e m | X, Y m − ∼ N (0 , , m ∈ { , . . . , M } ,e = T ( X, Y ) (cid:48) b , , e | X ∼ N (0 , , where T m ( X, Y m ) ≡ T m − ( X, Y m − ) ⊗ S m ( Y m ) and T ( X, Y ) ≡ W ( X ) ⊗ S ( Y ), withderivative functions, ∂ y m { T m ( X, Y m ) (cid:48) b ,m } = t m ( X, Y m ) (cid:48) b ,m > , m ∈ { , . . . , M } , where t m ( X, Y m ) ≡ t m − ( X, Y m − ) ⊗ s m ( Y m ), m ∈ { , . . . , M } , and t ( X, Y ) ≡ W ( X ) ⊗ s ( Y ). By construction, the Gaussian representations e , . . . , e M are jointlyGaussian and mutually independent, with variance-covariance the identity matrix,i.e., ( e , . . . , e M ) ∼ (cid:81) Mm =1 Φ( e m ). This is a Gaussian version of Rosenblatt (1952)’smultivariate probability transformation. By recursive application of Theorem 1, theimplied conditional CDF of Y given X is F Y | X ( y , . . . , y M | X ) = ˆ y −∞ . . . ˆ y M −∞ f Y | X ( t , . . . , t M | X ) dt . . . dt M where the PDF of Y given X takes the form f Y | X ( y , . . . , y M | X ) = M (cid:89) m =1 φ ( T m ( X, y m ) (cid:48) b ,m ) { t m ( X, y m ) (cid:48) b ,m } , y m ≡ ( y , . . . , y m ) , for all y , . . . , y M ∈ R . The implied distributional regression functions of Y m given( X, Y (cid:48) m − ) (cid:48) are defined analogously to (2.6)-(2.7), for each m ∈ { , . . . , M } . (cid:3) .2. Characterization and identification.
For the set of parameter values thatsatisfy the derivative condition (2.3),Θ = (cid:8) b ∈ R JK : Pr [ b (cid:48) t ( X, Y ) >
0] = 1 (cid:9) , we define the population objective function(2.8) Q ( b ) = E (cid:20) − (cid:0) log(2 π ) + { b (cid:48) T ( X, Y ) } (cid:1) + log ( b (cid:48) t ( X, Y )) (cid:21) , b ∈ Θ . This criterion introduces a natural logarithmic barrier function (e.g., Boyd and Van-denberghe, 2004) in the form of the log of the Jacobian term ∂ y { b (cid:48) T ( X, Y ) } . Thisis important because the derivative function b (cid:48) t ( X, Y ) enters the log term and themonotonicity requirement for the conditional CDF and CQF is thus imposed directlyby the objective in the definition of the effective domain of Q ( b ), i.e., the region in R JK where Q ( b ) > −∞ . An equivalent interpretation is that the effective domain of Q ( b ) contains the set of parameter values that are admissible for GT regression mod-els with strictly positive conditional PDF, by virtue of the presence and properties ofboth the Gaussian density function and the logarithmic barrier function in (2.8).We characterize the shape and properties of Q ( b ) under the following main assumption. Assumption 1. E [ || T ( X, Y ) || ] < ∞ , E [ || t ( X, Y ) || ] < ∞ , and the smallest eigen-value of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] is bounded away from zero.These conditions restrict the set of dictionaries we allow for, as well as the probabil-ity distribution of Y conditional on X . In particular, because T ( X, Y ) includes Y ,Assumption 1 requires Y to have finite second moment. The moment conditions inAssumption 1 are also sufficient for the second-derivative matrix of Q ( b ),(2.9)Γ( b ) ≡ E [ γ ( Y, X, b )] , γ ( Y, X, b ) ≡ − T ( X, Y ) T ( X, Y ) (cid:48) − t ( X, Y ) t ( X, Y ) (cid:48) { b (cid:48) t ( X, Y ) } , b ∈ Θ , to exist. Nonsingularity of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] then guarantees that Γ( b ) is negativedefinite, and hence that Q ( b ) is strictly concave and admits a unique maximum. Non-singularity of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] is thus sufficient for identification of b , and the GT g ( Y, X ) is identified as a known linear combination of the known functions T ( X, Y ),and hence the distributional regression functions also are identified.
Theorem 2.
For model (2.2)-(2.3), if Assumption 1 holds then Q ( b ) has a uniquemaximum in Θ at b . Consequently, b , the GT g ( Y, X ) and the distributional regres-sion functions are identified. y Theorem 2, b is the only solution to the first-order conditions(2.10) E [ ψ ( Y, X, b )] = 0 , ψ ( Y, X, b ) ≡ − T ( X, Y )( b (cid:48) T ( X, Y )) + t ( X, Y ) b (cid:48) t ( X, Y ) , b ∈ Θ . For the baseline case where Y has a zero mean and unit variance Gaussian distributionand is independent from X , we have that b (cid:48) T ( X, Y ) = Y and b (cid:48) t ( X, Y ) = 1 satisfythe conditions of model (2.2)-(2.3). Theorem 2 then implies that conditions (2.10) areuniquely satisfied by b = (0 , , JK − ) (cid:48) . This fact can be directly verified: E [ ψ ( Y, X, b )] = E [ − T ( X, Y ) Y + t ( X, Y )] = E [ W ( X ) ⊗ {− S ( Y ) Y + s ( Y ) } ]= E [ W ( X )] ⊗ E [ − S ( Y ) Y + s ( Y )] = 0 , since E [ − S ( Y ) Y + s ( Y )] = 0 has the form of the Stein equation for a standardGaussian random variable (e.g., Lemma 2.1 in Chen, Goldstein, and Shao, 2010),and hence holds for any vector of continuously differentiable functions S ( Y ) with E [ | s j ( Y ) | ] < ∞ , j ∈ { , . . . , J } . In contrast, conditions (2.10) holding with b (cid:54) =(0 , , JK − ) (cid:48) will indicate deviations of Y from Gaussianity and independence from X , thereby characterizing a transformation to Gaussianity of Y for almost everyvalue of X since b satisfies (2.2)-(2.3). Hence, we have the following direct testableimplications of Theorem 2. Corollary 1.
If there exists b such that model (2.2)-(2.3) holds then, for any vec-tors of functions (cid:102) W ( X ) and of continuously differentiable functions (cid:101) S ( e ) such that (cid:101) T ( X, e ) ≡ (cid:102) W ( X ) ⊗ (cid:101) S ( e ) and (cid:101) t ( X, e ) ≡ ∂ e (cid:101) T ( X, e ) satisfy Assumption 1 with T = (cid:101) T , t = (cid:101) t and Y = e , the following hold: (i) (0 , , JK − ) (cid:48) is the unique solution to max b ∈ (cid:101) Θ E [ − (log(2 π ) + { b (cid:48) (cid:101) T ( X, e ) } ) / b (cid:48) (cid:101) t ( X, e ))] , where (cid:101) Θ ≡ { b ∈ R JK : Pr[ b (cid:48) (cid:101) t ( X, e ) >
0] = 1 } , and (ii) the ‘Stein score’ conditions E [ − (cid:101) T ( X, e ) e + (cid:101) t ( X, e )] = 0 hold.
Discussion.
The general modeling of F Y | X ( Y | X ) can be done indirectly by spec-ifying a representation for Y given X ,(2.11) Y = H ( X, e ) , e | X ∼ F e , where the function H ( X, e ) is strictly increasing in its second argument e , a scalarrandom variable with distribution F e and independent of X . The specification of both he function H and the distribution F e then determines the form of F Y | X ( Y | X ):(2.12) F Y | X ( y | X ) = F e ( H − ( y, X )) , y ∈ R , where y (cid:55)→ H − ( y, X ) denotes the inverse function of e (cid:55)→ H ( X, e ). In this approach,while in our context the statistical target of the analysis is F Y | X ( Y | X ), for a specifieddistribution F e the object of modeling is the function H ( X, e ).In Econometrics relation (2.11) is often characterized as ‘nonlinear and nonseparable’in order to draw attention to the potentially complex X – Y structure at constant e and the lack of additive structure in e (e.g., Chesher, 2003; Matzkin, 2003). Theseare essential features of H that allow for the shape of the conditional distribution of Y to vary across values of X . An alternative approach to (2.11)-(2.12) that preservesnonlinearity and nonseparability is to model F Y | X ( Y | X ) directly as(2.13) F Y | X ( Y | X ) = F e ( g ( Y, X )) , for some strictly increasing function y (cid:55)→ g ( y, X ). In the approach we propose in thispaper, with F − e denoting the inverse function of F e , for a specified distribution F e the object of modeling is the quantile transform g ( X, Y ) = F − e ( F Y | X ( Y | X )), whichby construction has distribution F e and is independent of X .The modeling of the statistical relationship between X and Y through representation(2.12) or representation (2.13) is not innocuous. In particular, with f e denoting thePDF of e , the definition of the conditional PDF of Y given X according to the indirectapproach (2.12),(2.14) f Y | X ( y | X ) = f e ( H − ( y, X )) { ∂ y H − ( y, X ) } , y ∈ R , involves the inverse function of the modeling object H . In general this inverse functiondoes not have a closed-form expression, except for some simple cases like the locationmodel H ( X, e ) ≡ X (cid:48) β + σe with σ >
0, and the location-scale model H ( X, e ) ≡ X (cid:48) β +( X (cid:48) β ) e with X (cid:48) β >
0. Furthermore, expression (2.14) gives rise to a nonconcavelikelihood for even the simplest specifications of H and F e , including the locationand location-scale models with Gaussian e (Owen, 2007; Spady and Stouli, 2018b).In contrast, a major advantage of representation (2.13) is that the correspondingexpression for f Y | X ( Y | X ) circumvents the inversion step since f Y | X ( Y | X ) = f e ( g ( Y, X )) { ∂ y g ( Y, X ) } . his formulation allows for the direct specification of flexible models for g ( Y, X ) thatare characterized by a concave likelihood. Hence, considerable computational advan-tages accrue in estimation when e = g ( Y, X ) can be computed in closed-form, asfurther demonstrated by the duality analysis in Section 6. Moreover, we show inthe next section that this formulation allows for the characterization of well-definedrepresentations for F Y | X ( Y | X ) under misspecification.3. Quasi-Gaussian Representations under Misspecification
In this section we study the properties of quasi-Gaussian representations for F Y | X ( Y | X )that are generated by maximization of the objective Q ( b ) under general misspecifica-tion, i.e., when there is no representation of the form (2.2)-(2.3) that satisfies eitherthe Gaussianity or the independence properties, or both. We establish existence anduniqueness of such quasi-Gaussian representations and we find that the implied rep-resentations for distributional regression functions are well-defined and KLIC optimalapproximations for the true distributional regression functions.3.1. Existence and uniqueness.
Assumption 1 is sufficient for characterizing thesmoothness properties and the shape of Q ( b ) on Θ. The objective function is con-tinuous and strictly concave over the parameter space, and hence admits at mostone maximizer. Existence of a maximizer, on the other hand, requires an additionalregularity condition. Assumption 2.
The joint density function f Y X ( Y, X ) of Y and X is bounded awayfrom zero with probability one.Assumptions 1 and 2 allow for the characterization of the behavior of Q ( b ) on theboundary of Θ. Under these assumptions, the level sets of Q ( b ) are compact. Com-pactness of the level sets is a sufficient condition for existence of a maximizer, and is aconsequence of the explosive behavior of the objective function at the boundary of Θ.By the quadratic term −{ b (cid:48) T ( X, Y ) } being negative, as b approaches the boundaryof Θ the log Jacobian term diverges to −∞ , and hence so does −{ b (cid:48) T ( X, Y ) } / { b (cid:48) t ( X, Y ) } ) on a set with positive probability. Under Assumption 2, this is suf-ficient to conclude that the objective function Q ( b ) diverges to −∞ , and hence thatthere exists at least one maximizer to Q ( b ) in Θ, denoted b ∗ . nder misspecification, to the maximizer b ∗ corresponds the quasi-Gaussian repre-sentation e ∗ = T ( X, Y ) (cid:48) b ∗ ≡ g ∗ ( Y, X ), where g ∗ ( Y, X ) is an element of the set offinite-dimensional representations
E ≡ { m : Pr[ m ( Y, X ) = b (cid:48) T ( X, Y )] = 1 } with b ∈ Θ. By definition of Θ, y (cid:55)→ b (cid:48) T ( X, y ) is strictly increasing for each b ∈ Θwith probability one, and hence each m ∈ E has a well-defined inverse function. Wenote that nonsingularity of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] implies that g ∗ ( Y, X ) is unique in E ,i.e., there is no m = g ∗ in E with m ( Y, X ) = b (cid:48) T ( Y, X ) and b (cid:54) = b ∗ .Define the range of y (cid:55)→ Φ( m ( y, x )) as U x ( m ) ≡ { u ∈ (0 ,
1) : Φ( m ( y, x )) = u for some y ∈ R } , for m ∈ E and x ∈ X . To the quasi-Gaussian representation g ∗ ( Y, X ) correspond flexible approximations for the conditional CDF and CQF of Y given X , defined as F ∗ ( Y, X ) ≡ Φ( g ∗ ( Y, X )) , Q ∗ ( u, X ) ≡ h ∗ ( X, Φ − ( u )) , u ∈ U X ( g ∗ ) , where e (cid:55)→ h ∗ ( X, e ) denotes the inverse of y (cid:55)→ g ∗ ( y, X ), and for the conditional PDFof Y given X , defined as(3.1) f ∗ ( Y, X ) ≡ φ ( g ∗ ( Y, X )) { ∂ y g ∗ ( Y, X ) } . These representations are unique in, respectively, the following spaces
F ≡ { F : Pr[ F ( Y, X ) = Φ( m ( Y, X ))] = 1 }Q ≡ (cid:8) Q : Pr[ Q ( u, X ) = q ( X, Φ − ( u )) for all u ∈ U X ( m )] = 1 (cid:9) D ≡ { f : Pr[ f ( Y, X ) = φ ( m ( Y, X )) { ∂ y m ( Y, X ) } ] = 1 } with m ∈ E , and where e (cid:55)→ q ( X, e ) denotes the inverse of y (cid:55)→ m ( y, X ). Therefore,the approximations for the distributional regression functions are well-defined, andthe conditional CDF and CQF approximations satisfy global monotonicity. Theorem 3.
If Assumptions 1-2 hold then there exists a unique maximum b ∗ to Q ( b ) in Θ . Consequently, the quasi-Gaussian representation g ∗ ( Y, X ) and the correspondingapproximations for the distributional regression functions are unique. KLIC optimality.
When the elements of D are proper conditional probabilitydistributions that integrate to one, a further motivation for the use of the proposedloss function Q ( b ) is the information-theoretic optimality of the implied distributionalregression functions under misspecification (White, 1982). ince each f ∈ D satisfies f > f ∈ D is a properconditional PDF if it satisfies ´ R f ( y, X ) dy = 1 with probability one. A necessary andsufficient condition for this to hold is that the boundary conditions(3.2) lim y →−∞ b (cid:48) T ( X, y ) = −∞ , lim y →∞ b (cid:48) T ( X, y ) = ∞ , hold with probability one, for all b ∈ Θ. Given a specified dictionary such that(3.2) holds, Theorem 3 implies that the approximation f ∗ ( Y, X ) in (3.1) is the uniquemaximum selected by the population criterion in D , i.e., f ∗ = arg max f ∈D E [log f ( Y, X )] , and hence that f ∗ ( Y, X ) is the KLIC closest probability distribution to f Y | X ( Y | X ).The corresponding F ∗ and Q ∗ are then the KLIC optimal conditional CDF and CQFapproximations for F Y | X ( Y | X ) and Q Y | X ( u | X ), respectively. Theorem 4. If E [ | log f Y | X ( Y | X ) | ] < ∞ and the boundary conditions (3.2) hold withprobability one for all b ∈ Θ , then f ∗ is the KLIC closest probability distribution to f Y | X ( Y | X ) in D , i.e., f ∗ = arg min f ∈D E (cid:20) log (cid:18) f Y | X ( Y | X ) f ( Y, X ) (cid:19)(cid:21) , where each f ∈ D is a proper conditional PDF. Moreover, f ∗ is related to the KLICoptimal conditional CDF F ∗ in F by F ∗ ( y, X ) = ˆ y −∞ f ∗ ( t, X ) dt, y ∈ R , and to the well-defined inverse of y (cid:55)→ F ∗ ( y, X ) , the KLIC optimal CQF u (cid:55)→ Q ∗ ( X, u ) in Q with derivative ∂Q ∗ ( X, u ) ∂u = 1 f ∗ ( Q ∗ ( X, u ) , X ) > , u ∈ (0 , , with probability one. Under the boundary conditions (3.2), the set F is the space of conditional CDFs withGaussian representation in linear form, and the set Q is the space of correspondingwell-defined CQFs. A necessary and sufficient condition for (3.2) is obtained, forinstance, if the limits lim y →±∞ | S j ( y ) | are finite, j ∈ { , . . . , J } . Under this maintained ondition, the varying coefficients representation e = β ( X ) (cid:48) S ( Y ) in (2.4), written as e = β ( X ) (cid:48) S ( Y ) = β ( X ) + β ( X ) Y + J (cid:88) j =3 β j ( X ) S j ( Y ) , implies that β ( X ) > y →∞ β ( X ) (cid:48) S ( y ) would be finite or −∞ , and lim y →−∞ β ( X ) (cid:48) S ( y ) would befinite or ∞ . The support of Y being the entire real line, β ( X ) > β ( X ) > β ( X ) (cid:48) s ( Y ) = β ( X ) + (cid:80) Jj =3 β j ( X ) s j ( Y ) > s j ( Y ), j ∈ { , . . . , J } , are specified to be zero outside some compact region of R , since thederivative then reduces to β ( X ) outside this region. The boundary conditions (3.2)then effectively hold under a location-scale restriction in the tails of the distributionof Y given X . We also note that (3.2) always holds for J = 2 since the derivativecondition is β ( X ) > Remark . Another interpretation arises for the quasi-Gaussian representation e ∗ = g ∗ ( Y, X ) by writing e ∗ = [ W ( X ) ⊗ S ( Y )] (cid:48) b ∗ = K (cid:88) k =1 W k ( X ) { S ( Y ) (cid:48) b ∗ k } = K (cid:88) k =1 W k ( X ) β ∗ k ( Y ) = W ( X ) (cid:48) β ∗ ( Y ) , with β ∗ ( Y ) = ( β ∗ ( Y ) , . . . , β ∗ K ( Y )) (cid:48) a vector of varying coefficients specified as β ∗ k ( Y ) ≡ S ( Y ) (cid:48) b ∗ k where b ∗ k = ( b ∗ k , . . . , b ∗ kJ ) (cid:48) , k ∈ { , . . . , K } . Under the conditions of Theorem4, F ∗ ( Y, X ) = Φ( W ( X ) (cid:48) β ∗ ( Y )) , is the KLIC optimal conditional CDF in F for a distribution regression model ofthe form F Y | X ( Y | X ) = Φ( W ( X ) (cid:48) β ( Y )) (Foresi and Perrachi, 1995; Chernozhukov,Fernandez-Val, and Melly, 2013), where β ( Y ) is a vector of unknown functions. (cid:3) Remark . If some component x (cid:55)→ W k ( x ) of W ( X ) has range the entire real line,then the corresponding varying coefficient β ∗ k ( Y ) must be zero with probability onesince b ∗ ∈ Θ and there is no b ∗ ∈ Θ such that b ∗ k (cid:54) = 0 if x (cid:55)→ W k ( x ) has range R . (cid:3) This and the maintained assumption that lim y →±∞ | S j ( y ) | < ∞ are satisfied for instance if, for each j ∈ { , . . . , J } , the transformations S j ( Y ) are defined as S j ( y ) ≡ ´ y −∞ s j ( t ) dt , for nonnegative splinefunctions s j ( Y ) (cid:54) = 0 on a compact subset of R , as s j ( Y ) = 0 outside this region and S j ( Y ) is then aCDF over the entire real line (Curry and Schoenberg, 1966; Ramsay, 1988). . Estimation, Inference, and Model Specification
Our characterization of GT regression models and of KLIC optimal approximationshas a natural finite sample counterpart. We use the sample analog of the popula-tion objective function (2.8) to propose an ML estimator for GT regression models,which is also asymptotically valid for quasi-Gaussian representations under misspeci-fication. We establish the asymptotic properties of the estimator, and extend the MLformulation in order to allow for potentially sparse Gaussian representations by usingthe ML estimator as a first step for an adaptive Lasso (Zou, 2006) ML estimator.This formulation serves as a model selection procedure, and we derive the asymptoticdistribution of the corresponding estimators for the selected distributional regressionmodel.4.1.
Maximum Likelihood estimation.
We assume that we observe a sample of n independent and identically distributed realizations { ( y i , x i ) } ni =1 of the random vector( Y, X (cid:48) ) (cid:48) . The sample analog of Q ( b ) defines the GT regression empirical loss function: Q n ( b ) ≡ n − n (cid:88) i =1 (cid:26) −
12 [log(2 π ) + { b (cid:48) T ( x i , y i ) } ] + log( b (cid:48) t ( x i , y i )) (cid:27) , b ∈ Θ . The GT regression estimator is(4.1) (cid:98) b ≡ arg max b ∈ Θ Q n ( b ) . We derive the asymptotic properties of ˆ b under the following assumptions. Assumption 3. (i) { ( y i , x i ) } ni =1 are identically and independently distributed, and(ii) E [ || T ( X, Y ) || ] < ∞ .Assumption 3(i) can be replaced with the condition that { ( y i , x i ) } ni =1 is stationaryand ergodic (Newey and McFadden, 1994). Assumption 3(ii) is needed for consistentestimation of the asymptotic variance-covariance matrix of (cid:98) b .Recalling the definitions of γ ( Y, X, b ) and Γ( b ) in (2.9) and ψ ( Y, X, b ) in (2.10),the variance-covariance matrix of ˆ b is Γ − ΨΓ − /n , where Γ ≡ Γ( b ∗ ) and Ψ ≡ E [ ψ ( Y, X, b ∗ ) ψ ( Y, X, b ∗ ) (cid:48) ]. The corresponding estimators of Γ and Ψ are defined as (cid:98) Γ = n − (cid:80) ni =1 γ ( y i , x i , ˆ b ) and (cid:98) Ψ = n − (cid:80) ni =1 ψ ( y i , x i , ˆ b ) ψ ( y i , x i , ˆ b ) (cid:48) , respectively. Thenext theorem states the asymptotic properties of the GT regression estimator. heorem 5. If Assumptions 1-3 hold, then (i) there exists ˆ b in Θ with probabilityapproaching one; (ii) ˆ b → p b ∗ ; and (iii) n / (ˆ b − b ∗ ) → d N (0 , Γ − ΨΓ − ) . Moreover, (cid:98) Γ − (cid:98) Ψ (cid:98) Γ − → p Γ − ΨΓ − . Theorem 5(i) demonstrates existence of a globally monotone representation ˆ b (cid:48) T ( Y, X )with ˆ b (cid:48) t ( Y, X ) > b ∗ such that e ∗ = T ( X, Y ) (cid:48) b ∗ is either not Gaussian or not independent from X , or both. Un-der correct specification, the information matrix equality (e.g., Newey and McFad-den, 1994) implies that Γ = − Ψ and that the estimator is efficient, with asymptoticvariance-covariance matrix − Γ − . The information matrix equality provides a testableimplication of the validity of model (2.2)-(2.3) and forms the basis of a specificationtest in finite samples (White, 1982; Chesher and Spady, 1991). Penalized estimation.
In general the components of a specified dictionary T ( X, Y ) that are sufficient for g ∗ ( Y, X ) to be Gaussian and independent from X are not known. The components of T ( X, Y ) that do not improve the quality of theGT approximation, as measured by the KLIC, have zero coefficients. For selection ofcomponents with nonzero coefficients, we use a penalized ML procedure based on theadaptive Lasso (Lu, Goldberg, and Fine, 2012; Horowitz and Nesheim, 2020) that pre-serves ML KLIC optimality and strict concavity of the objective function. Horowitzand Nesheim (2020) also find that ML adaptive Lasso leads to asymptotic mean-squareerror improvements for nonzero coefficients. Under misspecification, adaptive LassoGT regression selects the KLIC optimal sparse approximation for g ( Y, X ). We notethat we do not assume that the true or pseudo-true parameter vector is sparse.The adaptive Lasso GT regression estimator is defined as(4.2) ˆ b AL ≡ arg max b ∈ Θ n Q n ( b ) − λ n JK (cid:88) l =1 (cid:98) w l | b l | , (cid:98) w l ≡ (cid:40) | ˆ b l | if ˆ b l (cid:54) = 00 if ˆ b l = 0 , where λ n > (cid:98) w l are obtained from afirst-step estimate (4.1). Alternatively, a bootstrap-based specification test can be formulated such as the conditional Kol-mogorov specification test of Andrews (1997) where critical values are obtained using a parametricbootstrap procedure. e write b ∗ = ( b ∗ (cid:48) A , b ∗ (cid:48) A c ) (cid:48) , where b ∗A is a p -dimensional vector of nonzero parametersand b ∗A c is a ( J K − p )-dimensional vector of zero parameters, with p ≤ J K . Thevector (cid:98) b AL = ( (cid:98) b (cid:48)A , (cid:98) b (cid:48)A c ) (cid:48) is written similarly. We state the asymptotic properties of (cid:98) b AL . Theorem 6.
Suppose that Assumptions 1-3 hold, and that λ n → ∞ and n − / λ n → as n → ∞ . Then (i) Pr[ (cid:98) b A c = 0] → , and (ii) n / (ˆ b A − b ∗A ) → d N (0 , Γ − A Ψ A Γ − A ) , where Γ A and Ψ A are the upper left p × p blocks of Γ and Ψ , respectively. Estimation of distributional regression functions.
Estimators of the dis-tributional regression functions are formed as known functionals of an estimator for b ∗ . Let T A ( X, Y ) denote the subvector of T ( X, Y ) corresponding to the componentsof (cid:98) b A , and define T A c ( X, Y ) analogously. Let (cid:98) b † denote either the ML estimator (cid:98) b orthe penalized ML estimator ( (cid:98) b (cid:48)A , JK − p ) (cid:48) , and let T † ( X, Y ) = T ( X, Y ) if (cid:98) b † = (cid:98) b , and T † ( X, Y ) = ( T A ( X, Y ) (cid:48) , T A c ( X, Y ) (cid:48) ) (cid:48) otherwise. The estimators for the GT g ∗ ( y, x )are formed as (cid:98) g ∗ ( y, x ) ≡ T † ( x, y ) (cid:48) (cid:98) b † , ( y, x ) ∈ YX . The corresponding estimators forthe distributional regression functions are defined as (cid:98) F ∗ ( y, x ) ≡ Φ( (cid:98) g ∗ ( y, x )) , (cid:98) f ∗ ( y, x ) ≡ φ ( (cid:98) g ∗ ( y, x )) { ∂ y (cid:98) g ∗ ( y, x ) } , ( y, x ) ∈ YX , and (cid:98) Q ∗ ( x, u ) ≡ { y ∈ R : Φ( (cid:98) g ∗ ( y, x )) = u, ∂ y (cid:98) g ∗ ( y, x ) > } , x ∈ X , u ∈ U x ( g ∗ ) . The asymptotic distribution of both ML and adaptive Lasso estimators for distribu-tional regression functions follows by application of the Delta method.
Theorem 7.
Suppose that Ξ ≡ Γ †− Ψ † Γ †− is positive definite. Under Assumptions1-3 we have: (i) for ( y, x ) ∈ YX , n ( (cid:98) F ∗ ( y, x ) − F ∗ ( y, x )) → d N (cid:0) , φ ( g ∗ ( y, x )) T ( x, y ) (cid:48) Ξ T ( x, y ) (cid:1) , and n ( (cid:98) f ∗ ( y, x ) − f ∗ ( y, x )) → d N (cid:0) , φ ( g ∗ ( y, x )) ∆( x, y ) (cid:48) Ξ∆( x, y ) (cid:1) , where ∆( x, y ) ≡ − g ∗ ( y, x ) { ∂ y g ∗ ( y, x ) } T ( x, y ) + t ( x, y ) ; (ii) for x ∈ X and u ∈ U x ( g ∗ ) , n ( (cid:98) Q ∗ ( x, u ) − Q ∗ ( x, u )) → d N (cid:18) , { ∂ y g ∗ ( y , x ) } T ( x, y ) (cid:48) Ξ T ( x, y ) (cid:19) , where y = Q ∗ ( u, x ) . he asymptotic variance of both the unpenalized and the penalized estimators dependson the asymptotic variance-covariance matrix Ξ of (cid:98) b † , and is computed by substitutingthe corresponding estimator according to Theorems 5 or 6, respectively. Remark . For implementation of the unpenalized estimator (4.1) we expand theoriginal parameter space Θ to Θ n = { b ∈ R JK : b (cid:48) t ( x i , y i ) > , i ∈ { , . . . , n }} , theeffective domain of Q n ( b ). This implies that there exists b ∈ Θ n such that b (cid:48) t ( X, Y ) ≤ (cid:98) b ∈ Θ holds after estimation by checking thequasi-global monotonicity (QGM) property (cid:98) b (cid:48) t ( x, (cid:98) Q ∗ ( x, u )) > X , for each quantile level u of interest. If QGM is violated for some ( x, u )in this grid, then (cid:98) b is reestimated repeatedly by adding an increasing number of linearinequality constraints of the form b (cid:48) T ( x, y ) ≥ (cid:15) on a coarse grid covering Y × X , forsome small constant (cid:15) >
0, until QGM is satisfied. (cid:3)
Remark . For implementation of the penalized estimator (4.2) we also expand theoriginal parameter space Θ to Θ n but do not consider adding monotonicity constraints.Instead, we rule out penalization parameter values λ n for which the QGM propertydoes not hold. (cid:3) Duality Theory
Considerable computational advantages accrue from the concave likelihood formula-tion we propose, where the GT is expressed in closed-form. To the GT regressionproblem (4.1) corresponds a dual formulation that can be cast into the modern con-vex programming framework (Boyd and Vandenberghe, 2004). We derive this dualformulation and establish the properties of the corresponding dual solutions.
Theorem 8.
If Assumptions 1-3 are satisfied then the following hold.(i) The dual of (4.1) is min ( u,v ) ∈ R n × ( −∞ , n − n (cid:18)
12 log(2 π ) + 1 (cid:19) + n (cid:88) i =1 (cid:26) u i − log ( − v i ) (cid:27) (5.1) subject to − n (cid:88) i =1 { T ( x i , y i ) u i + t ( x i , y i ) v i } = 0(5.2) the dual GT regression problem, with solution (cid:98) α = ( (cid:98) u (cid:48) , (cid:98) v (cid:48) ) (cid:48) . ii) The dual GT regression program (5.1)-(5.2) admits the method-of-moments rep-resentation n (cid:88) i =1 (cid:26) − T ( x i , y i ) { b (cid:48) T ( x i , y i ) } + t ( x i , y i ) b (cid:48) t ( x i , y i ) (cid:27) = 0 , the first-order conditions of (4.1).(iii) With probability approaching one we have: (a) existence and uniqueness, i.e.,there exists a unique pair ( (cid:98) b (cid:48) , (cid:98) α (cid:48) ) (cid:48) that solves (4.1) and (5.1)-(5.2), and (5.3) (cid:98) u i = (cid:98) b (cid:48) T ( x i , y i ) , ˆ v i = − (cid:98) b (cid:48) t ( x i , y i ) , i ∈ { , . . . , n } ; (b) strong duality, i.e., the value of (4.1) equals the value of (5.1)-(5.2). The dual formulation established in Theorem 8 demonstrates important computationalproperties of GT regression. The Hessian matrix of the dual problem (5.1)-(5.2) is (cid:34) I n n × n n × n diag(1 /v i ) (cid:35) , a positive definite diagonal matrix for all v ∈ ( −∞ , n , with I n denoting the n × n iden-tity matrix and diag(1 /v i ) the n × n diagonal matrix with elements (1 /v , . . . , /v n ).Thus the dual problem is a strictly convex mathematical program with sparse Hessianmatrix and J K linear constraints. This computationally convenient formulation isexploited by state-of-the-art convex programming solvers like ECOS (Domahidi, Chu,and Boyd, 2013) and SCS (O’Donoghue, Chu, Parikh, and Boyd, 2016) that we usein our implementation.In addition to KLIC optimality of the solution and the presence of a logarithmic barrierfor global monotonicity in the objective, linearity of the constraints is an importantadvantage of the dual formulation (5.1)-(5.2) relative to the alternative generalizeddual regression characterization of CQFs and conditional CDFs (Spady and Stouli,2018a) for which the mathematical program is of the form(5.4) max e ∈ R n (cid:40) y (cid:48) e : n (cid:88) i =1 T ( x i , e i ) = 0 (cid:41) , where T ( x i , e i ) is a specified vector of known functions of x i and e i including e i and( e i − /
2, so that the parameter vector e enters nonlinearly into the constraints. Thefirst-order conditions of (5.4) are(5.5) y i = (cid:98) b (cid:48) { ∂ e i T ( x i , e i ) } , i ∈ { , . . . , n } , here (cid:98) b is the Lagrange multiplier vector for the constraints in (5.4), but where thesolution is now determined by a system of n nonlinear equations instead of havinga closed-form expression as in (5.3). This is a further illustration of the importantbenefits accruing from closed-form modeling of the GT e = g ( Y, X ) and its derivativefunction, compared to direct modeling of the outcome y i in (5.5).The dual formulation extends to the penalized estimator (4.2). Theorem 9.
The dual of (4.2) is min ( u,v ) ∈ R n × ( −∞ , n − n (cid:18)
12 log(2 π ) + 1 (cid:19) + n (cid:88) i =1 (cid:26) u i − log( − v i ) (cid:27) , subject to (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 { T i,l u i + t i,l v i } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ n (cid:98) w l , l ∈ { , . . . , J K } . the dual adaptive Lasso GT regression problem.Remark . For (cid:98) w l = 1 for each l ∈ { , . . . , J K } , the dual adaptive Lasso GT re-gression problem reduces to the dual Lasso GT regression problem, with constraints (cid:107) (cid:80) ni =1 { T i u i + t i v i }(cid:107) ∞ ≤ λ n . (cid:3) An Illustrative Example
In this section we illustrate our framework with the estimation of a distributionalAR(1) model for daily temperatures in Melbourne, Australia. The dataset consists of3,650 consecutive daily maximum temperatures, and was originally analyzed by Hyn-dman, Bashtannyk, and Grunwald (1996). The estimation of distributional regressionfunctions for this dataset is challenging because the shape of the outcome distribution,today’s temperatures Y t , given yesterday’s temperature, Y t − , varies across values of Y t − . Applying quantile regression to this data set, Koenker (2000) finds that tem-peratures following very hot days are bimodally distributed, with the lower modecorresponding to a break in the temperature, that is, a much cooler temperature,whereas temperatures of days following cool days are unimodally distributed. Com-pared to Koenker (2000), we obtain CQFs that are well-behaved across the entiresupport of the data, we estimate the corresponding conditional PDFs and CDFs, andwe provide confidence bands for all distributional regression functions. e illustrate the main features of the GT regression methodology by implementingboth unpenalized and penalized estimation for four different classes of model specifi-cations for e ∗ = g ∗ ( Y t , Y t − ) = [ W ( Y t − ) ⊗ S ( Y t )] (cid:48) b ∗ and its derivative function:(1) Linear-Linear: we set s ( Y t ) = (0 , (cid:48) , S ( Y t ) = (1 , Y t ) (cid:48) and W ( Y t − ) = (1 , Y t − ) (cid:48) .(2) Linear- Y and Spline- X : we set s ( Y t ) = (0 , (cid:48) , S ( Y t ) = (1 , Y t ) (cid:48) and W ( Y t − ) =(1 , (cid:102) W ( Y t − ) (cid:48) ) (cid:48) , with (cid:102) W ( Y t − ) a vector of K − Y and Linear- X : we set s ( Y t ) = (0 , , (cid:101) s ( Y t ) (cid:48) ) (cid:48) , with (cid:101) s ( Y t ) a vector of J − S ( Y t ) = (1 , Y t , (cid:101) S ( Y t ) (cid:48) ) (cid:48) where (cid:101) S j ( y t ) = ´ y t −∞ (cid:101) s ( r ) dr , j ∈ { , . . . , J − } , and W ( Y t − ) = (1 , Y t − ) (cid:48) .(4) Spline-Spline: we set s ( Y t ) = (0 , , (cid:101) s ( Y t ) (cid:48) ) (cid:48) , S ( Y t ) = (1 , Y t , (cid:101) S ( Y t ) (cid:48) ) (cid:48) and W ( Y t − ) = (1 , (cid:102) W ( Y t − ) (cid:48) ) (cid:48) .For specification classes 2 and 4, we consider a set of models including cubic B-splinetransformations in W ( Y t − ) with K ∈ { , . . . , } and equispaced knots. For classes3 and 4 we consider a set of models including quadratic B-spline transformationsin s ( Y t ) with J ∈ { , } and of models including cubic B-splines with J ∈ { , } ,with equispaced knots. In total, we thus consider 50 different model specifications.Spline functions satisfy the conditions of our modeling framework and have beendemonstrated to be remarkably effective when applied to the related problems of logdensity estimation (Kooperberg and Stone, 2001) or monotone regression functionestimation (Ramsay, 1988).For each model specification, we implement three steps. First, we run the penalizedestimator for each of 5 λ n values in a small logarithmically spaced grid in [0 . , . Y coefficients, i.e., we set (cid:98) w = (cid:98) w J +1 = 0. Second,following the literature on adaptive Lasso ML (Lu, Goldberg, and Fine, 2012; Horowitzand Nesheim, 2020), we select the value of λ n that minimizes the Bayes informationcriterion (BIC) among penalized estimates that satisfy QGM (cf. Remark 5). Third,we record the BIC value of the corresponding selected estimate. In the SupplementaryMaterial we describe in detail the implementation of the QGM property, and allcomputational procedures can be implemented in the software R (R DevelopmentCore Team, 2020) using open source software packages for convex optimization suchas CVX, and its R implementation CVXR (Fu, Narasimhan, and Boyd, 2017).Figure 6.1 shows CQFs for the models with smallest recorded BIC within each ofthe specification classes 1-3, illustrating the different features of the data that each Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on (a) Spec. 1.
Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on (b) Spec. 2.
Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on (c) Spec. 3.
Figure 6.1.
CQF with scatterplot, for u ∈ { . , . , . . . , . } .specification class captures, as well as the corresponding restrictions on the implieddistribution of Y t given Y t − . For both classes 1 and 2, this implied distribution isrestricted to Gaussianity across all values of Y t − . Figure 6.1(A) shows that specifica-tion class 1 also strongly restricts the shape of the CQFs across values of Y t − , but isable to capture some nonlinearity in Y t − . Figure 6.1(B) shows that specification class2 further allows for nonmonotonicity of the CQFs in Y t − , while capturing substantialheteroskedasticity in the data, a reflection of the more flexible functional forms for theconditional first and second moments of Y t given Y t − . In contrast with specificationclasses 1-2, for class 3 the GT g ∗ ( Y t , Y t − ) is nonlinear in Y t which allows for devia-tions of the conditional distribution of Y t given Y t − from Gaussianity, through thedependence of the derivative function on both Y t and Y t − . Figure 6.1(C) illustratesthe ability of specification class 3 to capture asymmetry of the distribution of Y t given Y t − , as well as changes in the mode location of this distribution across values of Y t − ,in addition to allowing for nonlinearity of the CQF and heteroskedasticity.Figures 6.2-6.3 show distributional regression functions for the model specificationwith smallest BIC within the Spline-Spline specification class 4. The selected modelhas smallest BIC among all specification classes 1-4, and features quadratic splines in s ( Y t ) with J = 5, and cubic splines in W ( Y t − ) with K = 7. In total, this parametriza-tion includes 35 parameters, of which 25 are estimated to be nonzero after penalization.This parsimonious Spline-Spline model is able to simultaneously capture all important Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on (a) CQF with scatterplot, for u ∈ { . , . , . . . , . } . Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on (b) CQF with confidence bands, for u ∈ { . , . , . , . . , . } . Figure 6.2.
Unpenalized (left) and penalized (right) CQF.data features described above, including nonlinearity in Y t − and the varying shape ofthe conditional distribution of Y t . In particular, for the unpenalized CQF in Figure6.2, the uneven spacing of the quantiles at higher values of lagged temperature sug-gests that the conditional PDF of current temperature is bimodal at such values. Thetwo modes are especially apparent from the unpenalized PDFs displayed in 6.3(B),and are also reflected by the two inflection points in the corresponding CDF in Fig-ure 6.3(A). The right panels of Figures 6.2-6.3 show the penalized versions of the .000.250.500.751.00 10 20 30 40 Today's Temperature C ond i t i ona l C u m u l a t i v e D i s t r i bu t i on F un c t i on Today's Temperature C ond i t i ona l C u m u l a t i v e D i s t r i bu t i on F un c t i on (a) Conditional CDF.
Today's Temperature C ond i t i ona l D en s i t y F un c t i on Today's Temperature C ond i t i ona l D en s i t y F un c t i on (b) Conditional PDF.
Figure 6.3.
Unpenalized (left) and penalized (right) conditional PDFand CDF with confidence bands, for y t − ∈ { . , . , . , . , . } .distributional regression functions. In this example, the penalized estimator yieldsvery similar conclusions, the main differences being the somewhat less pronouncedbimodality for days with high temperatures, as well as the tighter confidence bandsat the CQF boundaries.Overall, we find that parsimonious Gaussian representations are able to capture com-plex features of the data, such as nonmonotonicity and conditional distributions with arying shapes, while providing complete estimates of distributional regression func-tions and their confidence bands. Importantly, the corresponding CQF estimates areendowed with the no-crossing property of quantiles over the full data support. In theSupplementary Material we assess the robustness of the selected Spline-Spline modeland find that its main features are well-preserved across specifications with similar BICvalues. Thus, although establishing the BIC properties in our context is an impor-tant topic for future research, we find that GT regression estimates of distributionalregression functions exhibit reassuring stability within a given specification class.7. Conclusion
The formulation of distributional regression models through the specification of aGT e = g ( Y, X ) leads to a unifying framework for the global estimation of statisti-cal objects of general interest, such as conditional quantile, density and distributionfunctions. The implied convex programming formulation is easy to implement andallows for estimation of sparse models. The linear form of the proposed GT regres-sion models also constitutes a good starting point for nonparametric estimation ofdistributional regression functions. In this paper we have considered a few extensionsto our original formulation such as misspecification, multiple outcomes and penalizedestimation. Our framework can also be extended to allow for outcomes with discreteor mixed discrete-continuous distributions by appropriately modifying the form of thelog-likelihood function. An important further extension we will consider in future workis the generalization of our results to distributional regression models with endogenousregressors.
Appendix A. Proof of Theorem 1
For the conditional CDF of Y given X , for all y ∈ R ,(A.1) Φ( b (cid:48) T ( X, y )) = Pr[ b (cid:48) T ( X, Y ) ≤ b (cid:48) T ( X, y ) | X ] = Pr[ Y ≤ y | X ] = F Y | X ( y | X ) , where the first equality follows from e = b (cid:48) T ( X, Y ) and e | X ∼ N (0 , y (cid:55)→ b (cid:48) T ( X, y ) strictly increasing with probability one by Lemma 1below, and the last equality is by definition of F Y | X ( y | X ). For the conditional PDF,upon differentiating y (cid:55)→ Φ( b (cid:48) T ( X, y )) and y (cid:55)→ F Y | X ( y | X ) in (A.1), we obtain φ ( b (cid:48) T ( X, y )) { b (cid:48) t ( X, y ) } = f Y | X ( y | X ) , y ∈ R , ith probability one . For the CQF, the result in (A.1) and strict monotonicity of both y (cid:55)→ b (cid:48) T ( X, y ) and e (cid:55)→ Φ( e ) together imply Φ( b (cid:48) T ( X, Q Y | X ( u | X ))) = u . Therefore,recalling that e (cid:55)→ h ( X, e ) is the inverse of y (cid:55)→ b (cid:48) T ( X, y ), we obtain Q Y | X ( u | X ) = h ( X, Φ − ( u )) , u ∈ (0 , , with probability one . (cid:3) Lemma 1.
For each b ∈ Θ , the mapping y (cid:55)→ Φ( b (cid:48) T ( X, y )) is strictly increasing in y ∈ R with probability one.Proof. We note that ∂ y Φ( b (cid:48) T ( X, y )) = φ ( b (cid:48) T ( X, y )) { b (cid:48) t ( X, y ) } for all y ∈ R , with y (cid:55)→ φ ( b (cid:48) T ( X, y )) { b (cid:48) t ( X, y ) } continuous, with probability one. Hence, for any α, β ∈ R , α < β , by the Fundamental Theorem of Calculus,Φ( b (cid:48) T ( X, β )) − Φ( b (cid:48) T ( X, α )) = ˆ βα φ ( b (cid:48) T ( X, y )) { b (cid:48) t ( X, y ) } dy > , b ∈ Θ , with probability one, since b (cid:48) t ( X, Y ) > φ ( e ) > e ∈ R , which implies that y (cid:55)→ b (cid:48) T ( X, y ) is strictly increasing on R , with probability one. (cid:3) Appendix B. Proofs of Theorems 2-3 and Corollary 1
B.1.
Definitions and notation.
Define L ( Y, X, b ) ≡ −
12 [log(2 π ) + ( b (cid:48) T ( X, Y )) ] + log( b (cid:48) t ( X, Y )) , b ∈ Θ , and f ( Y, X, b ) ≡ φ ( b (cid:48) T ( X, Y )) { b (cid:48) t ( X, Y ) } , b ∈ Θ , and note that Q ( b ) = E [ L ( Y, X, b )] = E [log f ( Y, X, b )] , b ∈ Θ . In Appendix B.2 we establish the main properties of Q ( b ) used in Appendix B.3 andB.5 for the proofs of Theorems 2 and 3, respectively.B.2. Auxiliary lemmas.Lemma 2.
If Assumption 1 holds then E [ | L ( Y, X, b ) | ] < ∞ and Q ( b ) is continuousover Θ . roof. By the triangle inequality, E [ | L ( Y, X, b ) | ] ≤ E [ | ( b (cid:48) T ( X, Y )) ) | ] + E [ | log( b (cid:48) t ( X, Y )) | ] + 12 log(2 π ) . The first term E [ | ( b (cid:48) T ( X, Y )) | /
2] is finite by Cauchy-Schwartz inequality and by E [ || T ( X, Y ) || ] < ∞ . For the second term, applying a mean-value expansion around¯ b = ( b , JK − ), b >
0, gives for some intermediate values ˜ b , | log( b (cid:48) t ( X, Y )) | = | log( b ) + (˜ b (cid:48) t ( X, Y )) − [( b − ¯ b ) (cid:48) t ( X, Y )] |≤ | log( b ) | + | (˜ b (cid:48) t ( X, Y )) − | || b − ¯ b || || t ( X, Y ) || . Thus E [ | log( b (cid:48) t ( X, Y )) | ] < ∞ , since we have that ˜ b (cid:48) t ( X, Y ) > E [ || t ( X, Y ) || ] < ∞ . Therefore E [ | L ( Y, X, b ) | ] < ∞ . Continuity of Q ( b ) thenfollows from continuity of b (cid:55)→ L ( Y, X, b ) and dominated convergence. (cid:3)
Lemma 3.
If Assumption 1 holds then Q ( b ) is twice continuously differentiable overany compact subset Θ ⊂ Θ , and ∇ bb E [ L ( Y, X, b )] = E [ ∇ bb L ( Y, X, b )] .Proof. By lemma 2, E [ | L ( Y, X, b ) | ] < ∞ . Moreover, for b ∈ Θ, ||∇ b L ( Y, X, b ) || = || − T ( X, Y )( b (cid:48) T ( X, Y )) + ( b (cid:48) t ( X, Y )) − t ( X, Y ) ||≤ || T ( X, Y )( b (cid:48) T ( X, Y )) || + | ( b (cid:48) t ( X, Y )) − | || t ( X, Y ) ||≤ C (cid:8) || T ( X, Y ) || + || t ( X, Y ) || (cid:9) , (B.1)for some finite constant C >
0. Therefore, E [ || T ( X, Y ) || ] < ∞ and E [ || t ( X, Y ) || ] < ∞ imply that E [sup b ∈ Θ ||∇ b L ( Y, X, b ) || ] < ∞ under Assumption 1. Lemma 3.6 inNewey and McFadden (1994) then implies that Q ( b ) is continuously differentiable in b , and that the order of differentiation and integration can be interchanged.Continuous differentiability of ∇ b Q ( b ) in b ∈ Θ follows from applying steps similar to(B.1). By ||∇ bb L ( Y, X, b ) || ≤ || T ( X, Y ) || + C || t ( X, Y ) || , for some finite constant C >
0, we have that E [ || T ( X, Y ) || ] < ∞ and E [ || t ( X, Y ) || ] < ∞ imply that E [sup b ∈ Θ ||∇ bb L ( Y, X, b ) || ] < ∞ under Assumption 1. Lemma 3.6 inNewey and McFadden (1994) then implies that ∇ bb Q ( b ) is continuously differentiablein b , and that the order of differentiation and integration can be interchanged. (cid:3) emma 4. If Assumption 1 holds then, for any compact subset Θ ⊂ Θ , we have that −∇ bb Q ( b ) exists for b ∈ Θ , with smallest eigenvalue bounded away from zero uniformlyin b ∈ Θ .Proof. By Lemma 3, Q ( b ) is twice continuously differentiable over Θ and the order ofdifferentiation and integration can be interchanged. Therefore, ∇ bb {− Q ( b ) } = Γ + Γ ( b ) , Γ ≡ E [ T ( X, Y ) T ( X, Y ) (cid:48) ] , Γ ( b ) ≡ E (cid:20) t ( X, Y ) t ( X, Y ) (cid:48) ( b (cid:48) t ( X, Y )) (cid:21) , exists for all b ∈ Θ under Assumption 1. Denoting the smallest eigenvalue of amatrix A by λ min ( A ), the result then follows from Weyl’s Monotonicity Theorem(e.g., Corollary 4.3.12 in Horn and Johnson, 2012) which implies λ min (Γ + Γ ( b )) ≥ λ min (Γ ) ≥ B, b ∈ Θ , for some constant B >
0, by Γ ( b ) being positive semidefinite for all b ∈ Θ and thesmallest eigenvalue of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] being bounded away from zero. (cid:3) B.3.
Proof of Theorem 2.
B.3.1.
Uniqueness.
We show that b is a point of maximum of Q ( b ) in Θ. For b (cid:54) = b , b ∈ Θ, by E [log f ( Y, X, b )] = E [log f Y | X ( Y | X )] and Jensen’s inequality, we obtain E (cid:20) log (cid:18) f ( Y, X, b ) f ( Y, X, b ) (cid:19)(cid:21) = E (cid:20) − log (cid:18) f ( Y, X, b ) f Y | X ( Y | X ) (cid:19)(cid:21) ≥ − log E (cid:20)(cid:18) f ( Y, X, b ) f Y | X ( Y | X ) (cid:19)(cid:21) = − log E (cid:20) ˆ R f ( y, X, b ) dy (cid:21) ≥ , since ˆ R f ( y, X, b ) dy = lim y →∞ Φ( b (cid:48) T ( X, y )) − lim y →−∞ Φ( b (cid:48) T ( X, y )) ∈ (0 , y (cid:55)→ Φ( b (cid:48) T ( X, y ))being strictly increasing by Lemma 1. Therefore, b is a point of maximum. Strictconcavity in Lemma 4 then implies that Q ( b ) admits at most one maximizer in everycompact subset in Θ, and in particular every compact subset that contains b . Hencethere is no (cid:101) b (cid:54) = b that maximizes Q ( b ) in Θ, and b uniquely maximizes Q ( b ) in Θ. (cid:3) B.3.2.
Identification.
By uniqueness of the point of maximum, for b (cid:54) = b , b ∈ Θ,we have E [log f ( Y, X, b )] − E [log f ( Y, X, b )] >
0, which implies that f ( Y, X, b ) (cid:54) = f ( Y, X, b ) = f Y | X ( Y | X ), and hence that b is identified. Identification of g ( Y, X ) and he distributional regression functions then follows by the fact that they are knownfunctions of b , by Theorem 1. (cid:3) B.4.
Proof of Corollary 1.
The proof follows by application of Theorem 2 and bythe argument in the main text, using that e = b (cid:48) T ( Y, X ) | X ∼ N (0 , Proof of Theorem 3.
B.5.1.
Proof of existence of b ∗ . We first show that the level sets B α = { b ∈ Θ : − Q ( b ) ≤ α } , α ∈ R , of − Q ( b ) are closed and bounded, hence compact, and then usethe fact that − Q ( b ) is continuous over Θ, which implies existence of a minimizer.Step 1. This step shows that B α is bounded.Given b , b ∈ B α , let t = || b − b || and u = b − b || b − b || , so that || u || = 1 and b = b + tu .By Lemma 3, Q ( b ) is twice continuously differentiable for b ∈ B α . Thus, by definitionof b , a second-order Taylor expansion of t (cid:55)→ − Q ( b + tu ) around t = 0 yields, forsome ¯ b on the line connecting b and b and some constant B > α ≥ − Q ( b ) = − Q ( b + tu ) = − Q ( b ) − t ∇ b Q ( b ) (cid:48) u − t u (cid:48) ∇ bb Q (¯ b ) u ≥ − Q ( b ) − t ∇ b Q ( b ) (cid:48) u + B t ≥ − Q ( b ) − t ||∇ b Q ( b ) || + B t , where the penultimate inequality follows by Lemma 4. Fixing b ∈ B α , the aboveinequality implies that t is bounded and therefore B α is bounded.Step 2. This step shows that B α is closed.Define the boundary ∂ Θ of Θ as ∂ Θ = { b ∈ R JK : Pr[ b (cid:48) t ( X, Y ) = 0] > } . For b ∈ ∂ Θ with b (cid:48) t ( X, Y ) < b (cid:48) t ( X, Y )) takes on the value −∞ on that set (e.g., Section 11.2.1 in Boyd and Vandenberghe, 2004). Consider asequence ( b n ) in B α such that b n → ˇ b ∈ ∂ Θ as n → ∞ . Steps 2.1 and 2.2 below showthat − Q ( b n ) = E [ − L ( Y, X, b n )] → ∞ as n → ∞ , and hence that B α is closed.Step 2.1. This step shows that E [lim n →∞ − L ( Y, X, b n )] ≤ lim n →∞ E [ − L ( Y, X, b n )]. y B α being bounded, there exists a constant C > b (cid:48) t ( X, Y )) ≤ C || t ( Y, X ) || with probability one for all b ∈ B α , and hence such that − L ( Y, X, b ) = 12 [log(2 π ) + ( b (cid:48) T ( X, Y )) ] − log( b (cid:48) t ( X, Y )) ≥ − C || t ( X, Y ) || , b ∈ B α , with probability one. Therefore, ϕ ( Y, X, b ) ≡ − L ( Y, X, b ) + δ ( Y, X ) ≥ , δ ( Y, X ) ≡ C || t ( X, Y ) || , b ∈ B α , with probability one, and where E [ | δ ( Y, X ) | ] < ∞ under Assumption 1.Moreover, by definition of ∂ Θ, we have that lim n →∞ log( b (cid:48) n t ( X, Y )) = −∞ on a subset (cid:103) YX of the joint support of ( Y, X ) with positive probability, and hence(B.2)lim n →∞ − L ( Y, X, b n ) = 12 [log(2 π ) + lim n →∞ { b (cid:48) n T ( X, Y ) } ] − lim n →∞ log( b (cid:48) n t ( X, Y )) = ∞ , on (cid:103) YX , by { b (cid:48) T ( X, Y ) } / ≥ b ∈ R JK .Letting χ (cid:103) YX ( Y, X ) ≡ { ( Y, X ) ∈ (cid:103) YX } and χ (cid:103) YX c ( Y, X ) ≡ { ( Y, X ) ∈ (cid:103) YX c } , with (cid:103) YX c denoting the complement of (cid:103) YX , we have E [ lim n →∞ ϕ ( Y, X, b n )] = E [ χ (cid:103) YX ( Y, X ) lim n →∞ ϕ ( Y, X, b n )] + E [ χ (cid:103) YX c ( Y, X ) lim n →∞ ϕ ( Y, X, b n )]= E [ χ (cid:103) YX ( Y, X ) lim n →∞ − L ( Y, X, b n )] + E [ χ (cid:103) YX ( Y, X ) δ ( Y, X ) } ]+ E [ χ (cid:103) YX c ( Y, X ) lim n →∞ − L ( Y, X, b n )] + E [ χ (cid:103) YX c ( Y, X ) δ ( Y, X )]= E [ lim n →∞ − L ( Y, X, b n )] + E [ δ ( Y, X )] , (B.3)where the second equality follows from lim n →∞ − L ( Y, X, b n ) and δ ( Y, X ) being non-negative functions on (cid:103) YX (e.g., Proposition 5.2.6(ii) in Rana, 2002), and δ ( Y, X ) andlim n →∞ − L ( Y, X, b n ) having finite expectation on (cid:103) YX c , since lim n →∞ b (cid:48) n t ( X, Y ) > (cid:103) YX c and E [ | − L ( Y, X, b ) | ] < ∞ for all b ∈ Θ by Lemma 2.By ϕ ( Y, X, b n ) being nonnegative, Fatou’s lemma implies that(B.4) E [ lim n →∞ ϕ ( Y, X, b n )] ≤ lim n →∞ E [ ϕ ( Y, X, b n )] , with(B.5) lim n →∞ E [ ϕ ( Y, X, b n )] = lim n →∞ E [ − L ( Y, X, b n )] + E [ δ ( Y, X )] , by E [ | δ ( Y, X ) | ] < ∞ and E [ | − L ( Y, X, b n ) | ] < ∞ for b n ∈ Θ by Lemma 2. Therefore,(B.6) E [ lim n →∞ − L ( Y, X, b n )] ≤ lim n →∞ E [ − L ( Y, X, b n )] ollows by (B.3), (B.4) and (B.5) .Step 2.2. This step shows that E [lim n →∞ − L ( Y, X, b n )] = ∞ , and hence B α is closed.The limit in (B.2) and the fact that f Y X ( Y, X ) bounded away from 0 withprobability one together imply that E [ χ (cid:103) YX ( Y, X ) lim n →∞ − L ( Y, X, b n )] = ∞ ,and hence that E [ χ (cid:103) YX ( Y, X ) lim n →∞ ϕ ( Y, X, b n )] = ∞ by E [ | δ ( Y, X ) | ] < ∞ . Moreover, E [ χ (cid:103) YX c ( Y, X ) lim n →∞ − L ( Y, X, b n )] < ∞ , and hence E [ χ (cid:103) YX c ( Y, X ) lim n →∞ ϕ ( Y, X, b n )] < ∞ by E [ | δ ( Y, X ) | ] < ∞ . Therefore, E [lim n →∞ ϕ ( Y, X, b n )] = ∞ , and (B.3) now implies that E [lim n →∞ − L ( Y, X, b n )] = ∞ .This fact and the bound (B.6) together imply that E [ − L ( Y, X, b n )] = − Q ( b n ) → ∞ as n → ∞ .We have established that the limit ˇ b of a convergent sequence ( b n ) in B α is in Θ. Bycontinuity of − Q ( b ) over Θ, we then have that − Q (ˇ b ) = lim n →∞ Q ( b n ) ≤ α , and henceˇ b ∈ B α and B α is closed.Step 3. This step concludes.Pick α ∈ R such that B α is nonempty. From Steps 1-2, B α is compact by the Heine-Borel theorem. Since Q ( b ) is continuous over B α , there is at least one minimizer to − Q ( b ) in B α by the Weierstrass theorem. The existence result follows. (cid:3) B.5.2.
Proof of uniqueness of b ∗ . The uniqueness result follows by strict concavity of Q ( b ) in Lemma 4 (cid:3) B.5.3.
Proof of uniqueness of g ∗ ( Y, X ) , F ∗ ( Y, X ) , Q ∗ ( u, X ) and f ∗ ( Y, X ) . For (cid:101) b (cid:54) = b ∗ ,by nonsingularity of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] we have that E [ { ( (cid:101) b − b ∗ ) (cid:48) T ( X, Y ) } ] = ( (cid:101) b − b ∗ ) (cid:48) E [ T ( X, Y ) T ( X, Y ) (cid:48) ]( (cid:101) b − b ∗ ) > , which implies ( (cid:101) b − b ∗ ) (cid:48) T ( X, Y ) (cid:54) = 0. Therefore, g ∗ ( Y, X ) (cid:54) = (cid:101) m ( Y, X ) for (cid:101) m ∈ E with (cid:101) b (cid:54) = b ∗ , by definition of E . By strict monotonicity of e (cid:55)→ Φ( e ), this also implies thatΦ( g ∗ ( Y, X )) (cid:54) = Φ( (cid:101) m ( Y, X )), and hence F ∗ ( Y, X ) (cid:54) = (cid:101) F ( Y, X ) for (cid:101) F ∈ F with (cid:101) m (cid:54) = g ∗ ,by definition of F . For m ∈ E , let (cid:101) Y x ( m ) ≡ { y ∈ Y x : F ∗ ( y, x ) (cid:54) = Φ( m ( y, x )) } , where Y x denotes the conditional support of Y given X = x , and (cid:101) U x ( m ) ≡ { u ∈ (0 ,
1) : F ∗ ( y, x ) = u for some y ∈ (cid:101) Y x ( m ) } . With probability one, by strict monotonicity of y (cid:55)→ m ( y, X ) for all m ∈ E , the composition y (cid:55)→ Φ( (cid:101) m ( y, X )) is also strictly monotone,and hence Q ∗ (Φ − ( u ) , X ) (cid:54) = (cid:101) q ( X, Φ − ( u )), u ∈ (cid:101) U X ( (cid:101) m ), for (cid:101) q ∈ Q with (cid:101) m (cid:54) = g ∗ , bydefinition of Q . Finally, by b ∗ being the unique maximizer of Q ( b ) in Θ, we have that [log f ∗ ( Y, X )] > E [log( φ ( (cid:101) m ( Y, X )) { ∂ y (cid:101) m ( Y, X ) } ] for (cid:101) m ∈ E with (cid:101) b (cid:54) = b ∗ , and hence f ∗ ( Y, X ) (cid:54) = (cid:101) f ( Y, X ) for (cid:101) f ∈ D with (cid:101) m (cid:54) = g ∗ , by definition of D . (cid:3) Appendix C. Proof of Theorem 4
C.1.
Auxiliary lemma.Lemma 5.
If the boundary conditions (3.2) hold for all b ∈ Θ with probability one,then the sets Θ and D are equivalent.Proof. Recall that two sets A and B are equivalent if there is a one-to-one corre-spondence between them, i.e., if there exists some function ϕ : A → B that is bothone-to-one and onto. The two sets then have the same cardinality (Dudley, 2002).We note that by nonsingularity of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] the two sets Θ and E areequivalent. Hence it suffices to show that E and D are equivalent. For each f ∈ D , m ∈ E , and ( y, x ) ∈ YX , we define( ϕ ( f ))( y, x ) ≡ Φ − (cid:18) ˆ y −∞ f ( t, x ) dt (cid:19) , ( ψ ( m ))( y, x ) ≡ ∂ y Φ( m ( y, x )) , We first verify that ϕ : D → F and ψ : F → D , and then establish that ϕ is one-to-oneand onto by showing that ϕ and ψ are inverse functions of each other.By definition of f ∈ D , the Fundamental Theorem of Calculus, and the boundaryconditions (3.2), we have( ϕ ( f ))( y, X ) = Φ − (cid:18) ˆ y −∞ φ ( T ( X, v ) (cid:48) b ) { b (cid:48) t ( X, v ) } dv (cid:19) = Φ − (cid:18) Φ( T ( X, y ) (cid:48) b ) − lim α →−∞ Φ( T ( X, α ) (cid:48) b ) (cid:19) = T ( X, y ) (cid:48) b for some b ∈ Θ and all y ∈ Y , and hence T ( X, Y ) (cid:48) b ∈ E . Therefore ϕ : D → E . Bydefinition of m ∈ E we have( ψ ( m ))( y, X ) = ∂ y Φ( T ( X, y ) (cid:48) b ) = φ ( T ( X, y ) (cid:48) b ) { t ( X, y ) (cid:48) b } , y ∈ Y , for some b ∈ Θ, and hence φ ( T ( X, Y ) (cid:48) b ) { t ( X, Y ) (cid:48) b } ∈ D . Therefore ϕ : E → D .The conclusion then follows from ψ being both the left-inverse of ϕ , since( ψ ( ϕ ( f )))( y, X ) = ∂ y (cid:26) Φ (cid:18) Φ − (cid:18) ˆ y −∞ f ( t, X ) dt (cid:19)(cid:19)(cid:27) = ∂ y (cid:26) ˆ y −∞ f ( t, X ) dt (cid:27) = f ( y, X ) or all y ∈ Y , and the right-inverse of ϕ , since( ϕ ( ψ ( m )))( y, X ) = Φ − (cid:18) ˆ y −∞ ∂ y Φ( m ( t, X )) dt (cid:19) = m ( y, X ) , y ∈ Y . Therefore, ψ is the inverse function of ϕ and the result follows. (cid:3) C.2.
Proof of Theorem 4.
By Theorem 3, b ∗ = arg max b ∈ Θ E [log( φ ( T ( X, Y ) (cid:48) b ) { t ( X, Y ) (cid:48) b } )] . Thus, f ( Y, X ) = φ ( T ( X, Y ) (cid:48) b ) { t ( X, Y ) (cid:48) b } ∈ D for each b ∈ Θ, and the fact that Θand D are equivalent by Lemma 5, together imply that f ∗ is the well-defined point ofmaximum of E [log f ( Y, X )] in D , and hence f ∗ = arg max f ∈D E [log f ( Y, X )]= arg min f ∈D − E [log f ( Y, X )] = arg min f ∈D E (cid:20) log (cid:18) f Y | X ( Y | X ) f ( Y, X ) (cid:19)(cid:21) . (C.1)Moreover, by the boundary conditions (3.2), each f ∈ D satisfies(C.2) ˆ R f ( y, X ) dy = lim y →∞ Φ( b (cid:48) T ( X, y )) − lim y →−∞ Φ( b (cid:48) T ( X, y )) = 1for some b ∈ Θ with probability one. Therefore, (C.1) implies that f ∗ ( Y, X ) is theKLIC closest probability distribution to f Y | X ( Y | X ) in D .By F ∗ ( Y, X ) = Φ( g ∗ ( Y, X )) and f ∗ ( Y, X ) = φ ( g ∗ ( Y, X )) { ∂ y g ∗ ( Y, X ) } , we have ∂ y F ∗ ( Y, X ) = φ ( g ∗ ( Y, X )) { ∂ y g ∗ ( Y, X ) } = f ∗ ( Y, X ) . Since y (cid:55)→ f ∗ ( y, X ) is continuous, we obtain F ∗ ( y, X ) = ´ y −∞ f ∗ ( t, X ) dt for all y ∈ R by the Fundamental Theorem of Calculus, with lim y →−∞ F ∗ ( y, X ) = 0 andlim y →∞ F ∗ ( y, X ) = 1 by definition of F ∗ ( y, X ) and (C.2).By f ∗ ∈ D we have that f ∗ ( Y, X ) >
0, and by Lemma 1 that y (cid:55)→ F ∗ ( y, X ) isstrictly increasing, with probability one. Hence, the inverse function of y (cid:55)→ F ∗ ( y, X )is well-defined, denoted u (cid:55)→ Q ∗ ( X, u ), with ∂Q ∗ ( X, u ) ∂u = 1 f ∗ ( Q ∗ ( X, u ) , X ) > , u ∈ (0 , , with probability one, by continuous differentiability of y (cid:55)→ F ∗ ( y, X ) and the InverseFunction Theorem. (cid:3) ppendix D. Asymptotic Theory
D.1.
Proof of Theorem 5.
Parts (i)-(ii).
We verify the conditions of Theorem 2.7 in Newey and McFadden(1994). By Theorem 3, b ∗ ∈ Θ is the unique minimizer of Q ( b ), and their Condi-tion (i) is verified. By Θ convex and open, existence of b ∗ ∈ Θ established in Theorem3 and concavity of Q n ( b ) together imply that their Condition (ii) is satisfied. Finally,since the sample is i.i.d. by Assumption 3(i), pointwise convergence of Q n ( b ) to Q ( b )follows from Q ( b ) bounded (established in the proof of Theorem 2) and applicationof Khinchine’s law of large numbers. Hence, all conditions of Newey and McFadden’sTheorem 2.7 are satisfied. Therefore, there exists ˆ b ∈ Θ with probability approachingone, and ˆ b → p b ∗ . (cid:3) Part (ii).
The asymptotic normality result n / (ˆ b − b ∗ ) → d N (0 , Γ − Ψ(Γ − ) (cid:48) ) followsfrom verifying the assumptions of Theorem 3.1 in Newey and McFadden (1994), forinstance. Symmetry and nonsingularity of Γ then implies that V = Γ − ΨΓ − .By Theorem 3, b ∗ is in the interior of Θ so that their Condition (i) is satisfied. Con-dition (ii) holds by inspection. Condition (iii) holds by E [ ψ ( Y, X, b ∗ )] = 0, existenceof Γ and the Lindberg-Levy central limit theorem.For their Condition (iv), we apply Lemma 2.4 in Newey and McFadden (1994) with a ( Y, X, b ) ≡ ∇ bb L ( Y, X, b ). Let Θ denote a compact subset of Θ containing b ∗ in itsinterior. By the proof of Lemma 3 we have that E [sup b ∈ Θ ||∇ bb L ( Y, X, b ) || ] < ∞ .In addition, by Assumption 3(i) the data is i.i.d., and ∇ bb L ( Y, X, b ) is continuous ateach b ∈ Θ by inspection. The conditions of the Lemma 2.4 in Newey and McFadden(1994) are verified, and therefore their Condition (iv) in Theorem 3.1 also is. Finally,Γ is nonsingular by Lemma 4 which verifies their Condition (v). The result follows.In order to show that ˆΓ − ˆΨˆΓ − → p Γ − ΨΓ − , we verify the conditions given in thediscussion of Theorem 4.4 in Newey and McFadden (1994, bottom of page 2158).First, by Theorem 5 we have (cid:98) b → p b ∗ . Second, with probability one, by inspectionlog f ( Y, X, b ) is twice continuously differentiable and f ( Y, X, b ) > b ∈ Θ.Moreover, Γ exists and is nonsingular by Lemma 4. Thus Conditions (ii) and (iv) of heorem 3.3 in Newey and McFadden (1994) are verified. Third, || ψ ( Y, X, b ) || = || − T ( X, Y )( b (cid:48) T ( X, Y )) + ( b (cid:48) t ( X, Y )) − t ( X, Y ) || ≤ || T ( X, Y )( b (cid:48) T ( X, Y )) || + 2 | ( b (cid:48) t ( X, Y )) − | || t ( X, Y ) || ≤ C (cid:8) || T ( X, Y ) || + || t ( X, Y ) || (cid:9) , so that E [sup θ ∈ Θ || ψ ( Y, X, b ) || ] < ∞ , by Assumption 1 and 3(ii). Hence, for aneighborhood N of b ∗ , we have that E [sup b ∈N || ψ ( Y, X, b ) || ] < ∞ . Moreover, b (cid:55)→ ψ ( Y, X, b ) is continuous at b ∗ with probability one. The result follows. (cid:3) D.2.
Proof of Theorem 6.
The proof builds on the proof strategy in Zou (2006)and Lu, Goldberg, and Fine (2012). Define D n ( u ) ≡ Q n ( b ∗ + n − / u ) − Q n ( b ∗ ) , where u is defined by b = b ∗ + n − / u . Also let (cid:98) u n = arg max u D n ( u ), so that (cid:98) u n = √ n ( (cid:98) b AL − b ∗ ). By a mean-value expansion, D n ( u ) = n − n (cid:88) i =1 ψ ( y i , x i , b ∗ ) (cid:48) u + (2 n ) − u (cid:48) (cid:40) n (cid:88) i =1 ∇ b ψ ( y i , x i , b ) (cid:41) u + n − λ n JK (cid:88) l =1 (cid:98) w l n ( | b ∗ l | − | b ∗ l + n − u l | ) ≡ D (1) n ( u ) + D (2) n ( u ) + D (3) n ( u ) , for some intermediate values b . Under Assumptions 1-3, D (1) n ( u ) → d N (0 , u (cid:48) Ψ u ) and D (2) n ( u ) → p u (cid:48) Γ u by the results in Theorem 5 and the Law of Large Numbers. For D (3) n ( u ), Zou (2006, proof of Theorem 2) shows n − λ n (cid:98) w l n ( | b ∗ l | − | b ∗ l + n − u l | ) → p , b ∗ l (cid:54) = 00 , b ∗ l = 0 , u l = 0 −∞ , b ∗ l = 0 , u l (cid:54) = 0 . Therefore D n ( u ) → p D ( u ) for every u , where D ( u ) = (cid:40) u (cid:48)A Γ A u A + u (cid:48)A W, u l = 0 , l / ∈ A , −∞ otherwise , with W ∼ N (0 , Ψ A ). Moreover, steps similar to those of Lu, Goldberg, and Fine(2012, proof of Theorem 2) show that (cid:98) u n → p (cid:98) u , upon using that (cid:98) w l → p /b ∗ l when b ∗ l (cid:54) = 0 and n / (cid:98) b l = O p (1) when b ∗ l = 0 by Theorem 5(ii), and the fact that the Hessianmatrix Γ is negative definite by Lemma 4. This yields part (ii), i.e., n / ( (cid:98) b A − b ∗A ) → d N (0 , Γ − A Ψ A Γ − A ). Steps similar to Lu, Goldberg, and Fine (2012, proof of Theorem ) also show that Pr[ (cid:98) b A c = 0] →
1, upon substituting for Q n ( b ) for their objectivefunction, which establishes part (i). (cid:3) D.3.
Proof of Theorem 7.
By Theorems 5 and 6 we have n / ( (cid:98) b † − b ∗ ) → d N (0 , Ξ)with Ξ = Γ †− Ψ † Γ †− positive definite by assumption. Moreover, for ( y, x ) ∈ YX , b (cid:55)→ Φ( b (cid:48) T ( x, y )) ≡ F ( y, x, b ) and b (cid:55)→ f ( y, x, b ) are continuously differentiable, withderivative functions ∇ b F ( y, x, b ) = φ ( b (cid:48) T ( x, y )) T ( x, y ) and ∇ b f ( y, x, b ) = −{ b (cid:48) T ( x, y ) } φ ( b (cid:48) T ( x, y )) { b (cid:48) t ( x, y ) } T ( x, y ) + φ ( b (cid:48) T ( x, y )) t ( x, y )= φ ( b (cid:48) T ( x, y )) [ −{ b (cid:48) T ( x, y ) }{ b (cid:48) t ( x, y ) } T ( x, y ) + t ( x, y )] , respectively, by the properties of the normal PDF. For all ( y, x ) ∈ YX with f ( y, x, b ) > b ∈ Θ, we have that y (cid:55)→ F ( y, x, b ) is invertible, and its inverse function u (cid:55)→ F − ( u, x, b ) is continuously differentiable with derivative 1 /f ( F − ( u, x, b ) , x, b ) for all x ∈ X and u ∈ U x ( m ), m ( y, x, b ) ≡ b (cid:48) T ( x, y ), by the Inverse Function Theorem.Hence, by F − (Φ( b (cid:48) T ( x, y )) , x, b ) = y , we have for ( y, x ) ∈ YX , ∇ b F − (Φ( b (cid:48) T ( x, y )) , x, b ) = φ ( b (cid:48) T ( x, y )) f ( F − ( u , x, b ) , x, b ) T ( x, y ) + ∇ b F − ( u , x, b ) = 0 , with u = Φ( b (cid:48) T ( x, y )), and hence, for x ∈ X and u ∈ U x ( m ), ∇ b F − ( u, x, b ) = − φ ( b (cid:48) T ( x, y )) f ( y , x, b ) T ( x, y ) = − b (cid:48) t ( x, y ) T ( x, y )where y = F − ( u, x, b ), which is continuous in b on Θ, so that b (cid:55)→ F − ( u, x, b ) iscontinuously differentiable on Θ. Parts (i) and (ii) in the statement of Theorem 7 thenfollow by application of the Delta method (e.g., Lemma 3.9 in Wooldridge, 2010). (cid:3) Appendix E. Duality Theory
E.1.
Auxiliary lemma.
In this Section, we write T i ≡ T ( y i , x i ) and t i ≡ t ( y i , x i ),for i ∈ { , . . . , n } . We first show the following result used in the proof of Theorem 8. Lemma 6. If { ( y i , x i ) } ni =1 is i.i.d. and E [ T ( X, Y ) T ( X, Y ) (cid:48) ] is nonsingular then (cid:80) ni =1 T i T (cid:48) i is nonsingular with probability approaching one.Proof. We note that (cid:80) ni =1 T (cid:48) i T i is nonsingular if for all λ (cid:54) = 0 we have λ (cid:48) ( (cid:80) ni =1 T i T (cid:48) i ) λ = (cid:80) ni =1 ( λ (cid:48) T i ) >
0, and hence if, for some i ∈ { , . . . , n } , we have λ (cid:48) T i (cid:54) = 0 for all λ (cid:54) = 0. y nonsingularity of E [ T ( X, Y ) T ( X, Y ) (cid:48) ], for all λ (cid:54) = 0 we have λ (cid:48) T ( X, Y ) (cid:54) = 0 on aset (cid:103) YX with Pr[ (cid:103) YX ] >
0. Hence for { ( y i , x i ) } ni =1 i.i.d.,Pr[ ∩ i ∈{ ,...,n } { ( y i , x i ) / ∈ (cid:103) YX } ] = n (cid:89) i =1 Pr[( y i , x i ) / ∈ (cid:103) YX ]= n (cid:89) i =1 (1 − Pr[ (cid:103) YX ]) = (1 − Pr[ (cid:103) YX ]) n → , as n → ∞ . Since the complement of the event ∩ i ∈{ ,...,n } { ( y i , x i ) / ∈ (cid:103) YX } is the event { ( y i , x i ) ∈ (cid:103) YX for some i ∈ { , . . . , n }} , we obtainPr[( y i , x i ) ∈ (cid:103) YX for some i ∈ { , . . . , n } ] = 1 − (1 − Pr[ (cid:103) YX ]) n → , as n → ∞ . The result now follows from the definition of (cid:103) YX . (cid:3) E.2.
Proof of Theorem 8.
Part (i).
Let R − ≡ ( −∞ , R + ≡ (0 , + ∞ ). Introducing the variables e i = b (cid:48) T i , η i = b (cid:48) t i , an equivalent formulation for the GT regression problem ismax ( b,e,η ) ∈ Θ × R n × R n + nκ − n (cid:88) i =1 (cid:26) e i − log( η i ) (cid:27) , κ ≡ −
12 log(2 π ) , subject to e i = b (cid:48) T i , η i = b (cid:48) t i , i ∈ { , . . . , n } . For all ( u, v ) ∈ R n × R n − , define the Lagrange function for this problem as L ( b, e, η, u, v ) = nκ − n (cid:88) i =1 (cid:26) e i − log( η i ) (cid:27) + n (cid:88) i =1 u i { e i − b (cid:48) T i } + n (cid:88) i =1 v i { η i − b (cid:48) t i } , and the Lagrange dual function (Boyd and Vandenberghe (2004), Chapter 5) as g ( u, v ) ≡ sup ( b,e,η ) ∈ Θ × R n × R n + L ( b, e, η, u, v )= sup ( e,η ) ∈ R n × R n + n (cid:88) i =1 (cid:26) u i e i + v i η i − (cid:20) − nκ + e i − log( η i ) (cid:21)(cid:27) + sup b ∈ Θ (cid:40) − n (cid:88) i =1 u i ( b (cid:48) T i ) − n (cid:88) i =1 v i ( b (cid:48) t i ) (cid:41) ≡ I + I . In order to derive g ( u, v ) we first show that for all ( u, v ) ∈ R n × R n − the maximum ofthe mapping ( b, e, η ) (cid:55)→ L ( b, e, η, u, v ) is attained and is unique, and we then evaluate( b, e, η ) (cid:55)→ L ( b, e, η, u, v ) at this value. he first term I in the dual function g ( u, v ) is the convex conjugate of the negativelog-likelihood function, defined as a function of the n -vectors e and η . Define D ( e, η, u, v ) ≡ n (cid:88) i =1 { u i e i + v i η i } − n (cid:88) i =1 L ( e i , η i ) , L ( e i , η i ) ≡ − nκ + e i − log( η i ) . We first show that, for all ( u, v ) ∈ R n × R n − , the map ( e, η ) (cid:55)→ D ( e, η, u, v ) admits atleast one maximum in R n × R n + . For i ∈ { , . . . , n } , the first-order conditions are ∂ e i D ( e, η, u, v ) = u i − n (cid:88) i =1 ∂ e i L ( e i , η i ) = u i − e i = 0 ∂ η i D ( e, η, u, v ) = v i − n (cid:88) i =1 ∂ η i L ( e i , η i ) = v i + 1 η i = 0 , and upon solving for e i and η i , we obtain e i = u i , η i = − v i , i ∈ { , . . . , n } . Clearly, for all ( u, v ) ∈ R n × R n − there exists ( e, η ) ∈ R n × R n + such that the n first-orderconditions hold.We now show that, for all ( u, v ) ∈ R n × R n − , the map ( e, η ) (cid:55)→ D ( e, η, u, v ) admits atmost one maximum in R n × R n + . For i ∈ { , . . . , n } , the second-order conditions are ∂ e i ,e i D ( e, η, u, v ) = − , ∂ e i ,η i D ( e, η, u, v ) = 0 ∂ η i ,e i D ( e, η, u, v ) = 0 , ∂ η i ,η i D ( e, η, u, v ) = − η i . Therefore the Hessian matrix of ( e, η ) (cid:55)→ D ( e, η, u, v ) is negative definite for all ( u, v ) ∈ R n × R n − . Hence, ( e, η ) (cid:55)→ D ( e, η, u, v ) is strictly concave with unique maximum( e i , η i ) = ( u i , − /v i ), i ∈ { , . . . , n } , for all ( u, v ) ∈ R n × R n − . Evaluating ( e, η ) (cid:55)→D ( e, η, u, v ) at the maximum yields, for all ( u, v ) ∈ R n × R n − ,sup ( e,η ) ∈ R n × R n + D ( e, η, u, v ) = n (cid:88) i =1 u i + n (cid:88) i =1 v i (cid:18) − v i (cid:19) − n (cid:88) i =1 (cid:26) − κ + u i − log (cid:18) − v i (cid:19)(cid:27) = − n (1 − κ ) + n (cid:88) i =1 (cid:26) u i − log ( − v i ) (cid:27) , (E.1)the conjugate function of the negative log-likelihood. e now consider the second term I in the definition of the dual function g ( u, v ). Forall ( b, u, v ) ∈ Θ × R n × R n − , define the penalty function P ( b, u, v ) = n (cid:88) i =1 {− u i ( b (cid:48) T i ) − v i ( b (cid:48) t i ) } . The map b (cid:55)→ P ( b, u, v ) is linear with partial derivative − (cid:80) ni =1 { u i T i + v i t i } . Thevalue of sup b ∈ Θ P ( b, u, v ) is thus determined by the set of all ( u, v ) ∈ R n × R n − suchthat the first-order conditions,(E.2) ∇ b P ( b, u, v ) = − n (cid:88) i =1 { u i T i + v i t i } = 0 , hold. For all such ( u, v ) ∈ R n × R n − and any solution b ∈ Θ, we have thatsup b ∈ Θ P ( b, u, v ) = n (cid:88) i =1 (cid:110) − u i ( b (cid:48) T i ) − v i ( b (cid:48) t i ) (cid:111) = − b (cid:48) n (cid:88) i =1 { u i T i + v i t i } = 0 . Therefore, for all ( u, v ) ∈ R n × R n − such that ∇ b P ( b, u, v ) = 0, the optimal value of P ( b, u, v ) is 0.Combining the expression for the likelihood conjugate (E.1) and the first-orderconditions (E.2) gives the Lagrange dual function g ( u, v ) for all ( u, v ) such that ∇ b P ( b, u, v ) = 0. The form of the dual GT regression problem (5.2) follows. Part (ii).
The Lagrangian for (5.2) is L ( u, v, b ) = − n (1 − κ ) + n (cid:88) i =1 (cid:26) u i − log( − v i ) (cid:27) − b (cid:48) n (cid:88) i =1 { T i u i + t i v i } , with first-order conditions ∂ u i L ( u, v, b ) = u i − b (cid:48) T i = 0 , ∂ v i L ( u, v, b ) = − v i − b (cid:48) t i = 0 , i ∈ { , . . . , n } , equivalently, upon solving for u i and v i , u i = b (cid:48) T i , v i = − b (cid:48) t i , i ∈ { , . . . , n } . (E.3)Upon substituting in the constraints of (5.2) we obtain the method-of-moments rep-resentation of (5.2). Part (iii).
Existence of a solution (cid:98) b ∈ Θ is shown in the proof of Theorem 5(i). Thesample Hessian matrix is − Σ ni =1 { T i T (cid:48) i + t i t (cid:48) i / ( b (cid:48) t i ) } which is negative definite with robability approaching one by Lemma 6. Therefore there exists a unique solution (cid:98) b to the GT regression problem (4.1), with probability approaching one.Existence of a solution ( (cid:98) u (cid:48) , (cid:98) v (cid:48) ) (cid:48) to program (5.2) follows from existence of a solution (cid:98) b ∈ Θ to the first-order conditions of the ML problem (4.1) and the method-of-momentsrepresentation of the dual problem (5.2), upon setting (cid:98) u i = (cid:98) b (cid:48) T i , (cid:98) v i = − / ( (cid:98) b (cid:48) t i ), for i ∈ { , . . . , n } . We now show that, for all b ∈ Θ, the map ( u, v ) (cid:55)→ L ( u, v, b ) admitsat most one maximum in R n × R n − . For all ( u, v ) ∈ R n × R n − and i ∈ { , . . . , n } , thesecond-order conditions for the dual problem (5.2) are ∂ u i ,u i L ( u, v, b ) = 1 , ∂ u i ,v i L ( u, v, b ) = 0 ∂ v i ,u i L ( u, v, b ) = 0 , ∂ v i ,v i L ( u, v, b ) = 1 v i . Therefore, the Hessian matrix of ( u, v ) (cid:55)→ L ( u, v, b ) is positive definite for all b ∈ Θ.Hence, the map ( u, v ) (cid:55)→ L ( u, v, b ) is strictly convex with unique solution ( (cid:98) u (cid:48) , (cid:98) v (cid:48) ) (cid:48) . Part (iv).
Upon using (E.3) and with (cid:98) e i = (cid:98) b (cid:48) T i and (cid:98) η i = (cid:98) b (cid:48) t i , i ∈ { , . . . , n } , the valueof program (5.2) is L ( (cid:98) u, (cid:98) v, (cid:98) b ) = − n (1 − κ )+ n (cid:88) i =1 (cid:26) (cid:98) e i (cid:98) η i ) (cid:27) − n (cid:88) i =1 (cid:8)(cid:98) e i − (cid:9) = nκ − n (cid:88) i =1 (cid:26) (cid:98) e i − log ( (cid:98) η i ) (cid:27) , the value of the ML problem (4.1) at a solution.E.3. Proof of Theorem 9.
Let (cid:107) b (cid:107) , (cid:98) w = (cid:80) JKl =1 (cid:98) w l | b l | . Analogously to the proof ofTheorem 8(i), an equivalent formulation for the adaptive Lasso GT regression problemis max ( b,e,η ) ∈ Θ × R n × R n + nκ − n (cid:88) i =1 (cid:26) e i − log( η i ) (cid:27) − λ n (cid:107) b (cid:107) , (cid:98) w subject to e i = b (cid:48) T i , η i = b (cid:48) t i , i ∈ { , . . . , n } , and, letting ( u, v ) ∈ R n × R n − denote Lagrange multiplier vectors, the correspondingLagrange dual function can be written as g ( u, v ) = sup ( e,η ) ∈ R n × R n + n (cid:88) i =1 (cid:26) u i e i + v i η i − (cid:20) − nκ + e i − log( η i ) (cid:21)(cid:27) + sup b ∈ Θ (cid:40) − n (cid:88) i =1 u i ( b (cid:62) T i ) − n (cid:88) i =1 v i ( b (cid:62) t i ) − λ n (cid:107) b (cid:107) , (cid:98) w (cid:41) , here the first term is the convex conjugate of (cid:80) ni =1 {− nκ + e i / − log( η i ) } derivedin the proof of Theorem 8(i).For the second term, define F ( b ) = n (cid:88) i =1 u i ( b (cid:48) T i ) + n (cid:88) i =1 v i ( b (cid:48) t i ) + λ n (cid:107) b (cid:107) , (cid:98) w , b ∈ R JK , which is convex in b but not smooth. In order to compute the subgradients of F ( b ),we first compute the subgradients of (cid:107) b (cid:107) , (cid:98) w . Recalling that the weights satisfy (cid:98) w l > (cid:98) b l (cid:54) = 0 and (cid:98) w l = 0 otherwise, the weighted norm (cid:107) b (cid:107) , (cid:98) w can be written as themaximum of 2 n linear functions: (cid:107) b (cid:107) , (cid:98) w = max { s (cid:48) b : s l ∈ {− (cid:98) w l , (cid:98) w l }} . The functions s (cid:48) b are differentiable and have a unique subgradient s . The subdifferen-tial of (cid:107) b (cid:107) , (cid:98) w is given by all convex combinations of gradients of the active functions at b (Boyd and Vandenberghe, 2008). We first identify an active function s (cid:48) b , by findingan s = ( s , . . . , s JK ) (cid:48) , s l ∈ {− (cid:98) w l , (cid:98) w l } , such that s (cid:48) b = (cid:107) b (cid:107) , (cid:98) w . Choose s l = (cid:98) w l if b l > s l = − (cid:98) w l if b l <
0, for each l . If b l = 0, choose either s l = − (cid:98) w l or s l = (cid:98) w l . Wecan therefore take z l = (cid:98) w l if b l > − (cid:98) w l if b l < − (cid:98) w l or (cid:98) w l if b l = 0 , l ∈ { , . . . , J K } . The subdifferential of (cid:107) b (cid:107) , (cid:98) w is: ∂ (cid:107) b (cid:107) , (cid:98) w = (cid:110) z : | z l | ≤ (cid:98) w l , l ∈ { , . . . , J K } , z (cid:48) b = (cid:107) b (cid:107) , (cid:98) w (cid:111) . Therefore, the subgradient of F ( b ) is: ∂F ( b ) = (cid:40) n (cid:88) i =1 u i T i.l + n (cid:88) i =1 v i t i.l + λ n z l , l ∈ { , . . . , J K } (cid:41) , where | z l | ≤ (cid:98) w l , l ∈ { , . . . , J K } , and z (cid:48) b = (cid:107) b (cid:107) , (cid:98) w , i.e., z is the subgradient of (cid:107) b (cid:107) , (cid:98) w .The subgradient optimality condition is that there exists b such that 0 ∈ ∂F ( b ). Thus b, z should satisfy z l = − n (cid:88) i =1 { u i T i,l + v i t i,l } /λ n , | z l | ≤ (cid:98) w l , z (cid:48) b = || b || , (cid:98) w , l ∈ { , . . . , J K } , hich is equivalent to(E.4) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 { T i,l u i + t i,l v i } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ n (cid:98) w l , l ∈ { , . . . , J K } . Upon substituting into F ( b ) gives F ( b ) = inf b ∈ Θ F ( b ) = n (cid:88) i =1 u i ( b (cid:48) T i ) + n (cid:88) i =1 v i ( b (cid:48) t i ) + λ n JK (cid:88) l =1 − (cid:80) ni =1 { u i T i,l + v i t i,l } λ n b l = n (cid:88) i =1 (cid:110) u i ( b (cid:48) T i ) + v i ( b (cid:48) t i ) (cid:111) − n (cid:88) i =1 (cid:40) u i JK (cid:88) l =1 T i,l b l + v i JK (cid:88) l =1 t i,l b l (cid:41) = 0 . Hence the optimal value of F ( b ) is 0, and combining the expression for the likelihoodconjugate (E.1) and the optimality conditions (E.4) gives the Lagrange dual function g ( u, v ) for all ( u, v ) such that (E.4) holds. The form of (4.2) follows. ReferencesAndrews, D. (1997). A conditional Kolmogorov test.
Econometrica (65, September),pp. 1097–1128.
Boyd, S. P. and Vandenberghe, L. (2004).
Convex Optimization . CambridgeUniversity Press.
Boyd, S. P. and Vandenberghe, L. (2008).
Subgradients. Notes for EE364b .Stanford University, Winter 2006, 7, pp.1-7.
Chen, L.H., Goldstein, L. and Shao, Q.M. (2010).
Normal approximation byStein’s method . Springer Science & Business Media.
Chernozhukov, V., Fernandez-Val, I., and Galichon, A. (2010).
Quantileand probability curves without crossing.
Econometrica
Chernozhukov, V., Fernandez-Val, I., and Melly, B. (2013).
Inference oncounterfactual distributions.
Econometrica
Chernozhukov, V., W¨uthrich, K., and Zhu, Y. (2019). Distributional confor-mal prediction. eprint arXiv:1909.07889.
Chernozhukov, V., Fernandez-Val, I., Newey, W., Stouli, S., and Vella,F. (2020).
Semiparametric estimation of structural functions in nonseparable tri-angular models.
Quantitative Economics
11, pp. 503–533.
Chesher, A. (2003). Identification in nonseparable models.
Econometrica (71, Sep-tember), pp. 1405–1441. hesher, A. and Spady, R. H. (1991). Asymptotic expansions of the informationmatrix test statistic. Econometrica (59, May), pp. 787–815.
Curry, H. B. and Schoenberg, I. J. (1966). On Polya frequency functions IV:The fundamental spline functions and their limits.
J. Analyse Math.
17, pp.71–107,1966.
DiNardo, J., Fortin, N.M. and Lemieux, T. (1996). Labor market institutionsand the distribution of wages, 1973-1992: A semiparametric approach.
Economet-rica
Domahidi A., Chu, E., and Boyd, S. (2013). ECOS: an SOCP Solver for embed-ded systems. In
Proceedings of the European Control Conference , pp. 3071–3076.
Dudley, R.M. (2002).
Real Analysis and Probability . Cambridge University Press,2nd Edition.
Foresi, S. and Peracchi, F. (1995). The conditional distribution of excess returns:An empirical analysis.
Journal of the American Statistical Association , 90(430),pp. 451–466.
Fu, A., Narasimhan, B. and Boyd, S. (2017). CVXR: An R package for disci-plined convex optimization. arXiv preprint arXiv:1711.07582 . Hall, P., Wolff, R.C. and Yao, Q. (1999). Methods for estimating a conditionaldistribution function.
Journal of the American Statistical Association , 94(445),pp.154–163.
Horn, R. A. and Johnson, C. R. (2012).
Matrix Analysis . 2nd ed., CambridgeUniversity Press.
Horowitz, J. and Nesheim, L. (2020). Using penalized likelihood to select param-eters in a random coefficients multinomial logit model.
Journal of Econometrics ,forthcoming.
Hyndman, R.J., Bashtannyk, D.M. and Grunwald, G.K. (1996). Estimat-ing and visualizing conditional densities.
Journal of Computational and GraphicalStatistics , 5(4), pp.315–336.
Imbens, G. and Newey , W. K. (2009). Identification and estimation of triangularsimultaneous equations models without additivity.
Econometrica
Koenker, R. (2000). Galton, Edgeworth, Frisch, and prospects for quantile regres-sion in econometrics.
Journal of Econometrics
Koenker, R. and Bassett, G. (1978). Regression quantiles.
Econometrica (46),pp. 33–50. ooperberg, C. and Stone, C. J. (1991). A study of logspline density estimation. Computational Statistics & Data Analysis
Lu, W., Goldberg, Y., and Fine, J. P. (2012). On the robustness of the adaptivelasso to model misspecification.
Biometrika
99, pp. 717–731.
Matzkin, R. (2003). Nonparametric estimation of nonadditive random functions.
Econometrica (71, September), pp. 1339–1375.
Newey, W. and Mc Fadden, D. (1994). Large sample estimation and hypoth-esis testing. In
Handbook of Econometrics , vol. 4, ch. 36, 1st ed., pp. 2111–2245.Amsterdam: Elsevier.
O’donoghue, B., Chu, E., Parikh, N. and Boyd, S. (2016). Conic optimizationvia operator splitting and homogeneous self-dual embedding.
Journal of Optimiza-tion Theory and Applications , 169(3), pp. 1042–1068.
Owen, A. B. (2007). A robust hybrid of lasso and ridge regression.
ContemporaryMathematics
R Development Core Team (2020).
R: A language and environment for statisticalcomputing . Vienna, Austria: R Foundation for Statistical Computing.
Ramsay , J. O. (1988). Monotone regression splines in action.
Statistical Science
Rana, I. K. (2002).
An Introduction to Measure and Integration . Vol. 45. AmericanMathematical Soc..
Rosenblatt , M. (1952). Remarks on a multivariate transformation.
The Annals ofMathematical Statistics
Spady, R. H. and Stouli, S. (2018a). Dual regression.
Biometrika
Spady, R. H. and Stouli, S. (2018b). Simultaneous mean-variance regression.eprint arXiv:1804.01631.
White, H. (1982). Maximum likelihood estimation of misspecified models.
Econo-metrica (50, January), pp. 1–25.
Wooldridge, J. M. (2010).
Econometric analysis of cross section and panel data .MIT Press.
Zou, H. (2006). The adaptive lasso and its oracle properties.
Journal of the Americanstatistical association , 101(476), pp.1418–1429., 101(476), pp.1418–1429.