[PDF] Nonparametric Estimation of the Random Coefficients Model: An Elastic Net Approach

Abstract

This paper investigates and extends the computationally attractive nonparametric random coefficients estimator of Fox, Kim, Ryan, and Bajari (2011). We show that their estimator is a special case of the nonnegative LASSO, explaining its sparse nature observed in many applications. Recognizing this link, we extend the estimator, transforming it to a special case of the nonnegative elastic net. The extension improves the estimator's recovery of the true support and allows for more accurate estimates of the random coefficients' distribution. Our estimator is a generalization of the original estimator and therefore, is guaranteed to have a model fit at least as good as the original one. A theoretical analysis of both estimators' properties shows that, under conditions, our generalized estimator approximates the true distribution more accurately. Two Monte Carlo experiments and an application to a travel mode data set illustrate the improved performance of the generalized estimator.

Full PDF

NNonparametric Estimation of the Random CoeﬃcientsModel: An Elastic Net Approach

Florian Heiss † , Stephan Hetzenecker ‡ , and Maximilian Osterhaus ∗†∗ Heinrich Heine University D¨usseldorf ‡ Ruhr Graduate School in Economics &University of Duisburg-EssenSeptember 2019

Abstract

This paper investigates and extends the computationally attractive nonparamet-ric random coeﬃcients estimator of Fox, Kim, Ryan, and Bajari (2011). We showthat their estimator is a special case of the nonnegative LASSO, explaining its sparsenature observed in many applications. Recognizing this link, we extend the estima-tor, transforming it to a special case of the nonnegative elastic net. The extensionimproves the estimator’s recovery of the true support and allows for more accurateestimates of the random coeﬃcients’ distribution. Our estimator is a generalizationof the original estimator and therefore, is guaranteed to have a model ﬁt at least asgood as the original one. A theoretical analysis of both estimators’ properties showsthat, under conditions, our generalized estimator approximates the true distribu-tion more accurately. Two Monte Carlo experiments and an application to a travelmode data set illustrate the improved performance of the generalized estimator.

JEL codes:

C14, C25, L

Keywords:

Random Coeﬃcients, Mixed Logit, Nonparametric Estimation, ElasticNet

Financial support by the Deutsche Forschungsgemeinschaft (DFG) project 235577387/GRK1974and the Ruhr Graduate School in Economics is gratefully acknowledged. † Heinrich Heine University D¨usseldorf, Universit¨atsstr. 1, 40225 D¨usseldorf, Germany. Email:ﬂ[email protected] ‡ Ruhr Graduate School in Economics, RWI - Leibniz Institute for Economic Research, Hohen-zollernstr. 1-3, 45128 Essen, Germany and University of Duisburg-Essen, Universit¨atsstr. 12,45117 Essen, Germany. Email: [email protected] ∗ Heinrich Heine University D¨usseldorf, Universit¨atsstr. 1, 40225 D¨usseldorf, Germany. Email:[email protected] a r X i v : . [ ec on . E M ] S e p Introduction

Adequately modeling unobserved heterogeneity across agents is a common challenge inmany empirical economic studies. A popular approach to address unobserved heterogene-ity are random coeﬃcient models, which allow the coeﬃcients of the economic model tovary across agents. The aim of the researcher is to estimate the distribution of the randomcoeﬃcients.Fox et al. (2011), hereafter FKRB, propose a simple and computationally fast esti-mator that can approximate distributions of any shape. The estimator uses a ﬁxed gridwhere every grid point is a prespeciﬁed vector of random coeﬃcients. The distributionfunction is obtained from the probability weights at the grid points, which are estimatedwith constrained least squares. In principle, the approach can approximate any distribu-tion arbitrarily closely if the grid of random coeﬃcients is suﬃciently dense (McFadden& Train, 2000).Applications of the estimator indicate, however, that it tends to estimate only fewpositive weights and that it sets the weights at many grid points to zero. As a consequence,the estimator lacks the ability to estimate smooth distribution functions but insteadapproximates potentially continuous distributions through step functions with only fewsteps. Our ﬁrst contribution is to show that the estimator of FKRB is NonnegativeLASSO (Wu, Yang, & Liu, 2014) (NNL) with a ﬁxed tuning parameter to explain itssparse nature.NNL, which was ﬁrst mentioned in the seminal work of Efron, Hastie, Johnstone,Tibshirani, et al. (2004) as positive LASSO, is a popular model selection method typicallyused in applications with supposedly sparse models. It is applied in various researchﬁelds, e.g., in vaccine design (Hu, Follmann, & Miura, 2015), nuclear material detection(Kump, Bai, sik Chan, Eichinger, & Li, 2012), document classiﬁcation (El-Arini, Xu,Fox, & Guestrin, 2013), and index tracking in stock markets (Wu et al., 2014). NNLshares the property of LASSO (Tibshirani, 1996) that it regularizes the coeﬃcients ofthe model and shrinks some to zero. This property is observed for the FKRB estimatorin diﬀerent Monte Carlo studies (e.g., Fox et al., 2011 and Fox, Kim, & Yang, 2016)and applications to real data (e.g., Nevo, Turner, & Williams, 2016, Illanes & Padi,2019, Blundell, Gowrisankaran, & Langer, 2018 and Houde & Myers, 2019). Nevo et al.(2016) study the demand for residential broadband and estimate that there are only 53out of 8,626 potentially heterogeneous consumer types. Illanes and Padi (2019) use theapproach to estimate the demand for private pension plans in Chile and assign positiveweights to only 194 of 83,251 grid points. Blundell et al. (2018) analyze ﬁrms’ reactionto the regulation of air pollution and recover no more than ﬁve of the 10,001 potentialpoints.In addition to its sparse nature, the connection of the FKRB estimator to NNL revealsthe estimator’s potentially incorrect selection of grid points under strong correlation. Theestimator “randomly” selects one out of a group of highly correlated points and sets the1emaining weights to zero (see Zou & Hastie, 2005, and Hastie, Tibshirani, & Friedman,2009, for the random behavior of LASSO).The estimator’s sparsity and “random” selection behavior can cause inaccurate ap-proximations of the true distribution through non-smooth distributions with the estimatedsupport possibly deviating from the true distribution’s support. The latter can lead tomisleading conclusions with respect to the heterogeneity of agents in the population. Foxet al. (2016) prove that the estimator identiﬁes the true distribution if the grid of ran-dom coeﬃcients becomes suﬃciently dense. However, in practice, the correlation tendsto increases with the density of the grid and can become so strong that the optimizationproblem to the FKRB estimator cannot be solved due to singularity (Nevo et al., 2016,Online Supplement). Therefore, the high correlation of a dense grid in combination withthe incorrect grid point selection of the estimator under strong correlation can have adrastic impact on the identiﬁcation of the model.Our second contribution is to provide a generalization of the FKRB estimator thatis able to accurately approximate continuous distributions even under strong correlation.Recognizing the link to NNL, we add a quadratic constraint on the probability weights.The constraint transforms the estimator to a special case of nonnegative elastic net (Wu& Yang, 2014). The extension mitigates the sparsity and improves the selection of thegrid points. Due to the additional ﬂexibility that is introduced with the extension, theestimator adjusts to the degree of correlation among grid points. Note that our gener-alization always includes the FKRB estimator as a special case such that the model ﬁtcannot be worse for our estimator than the FKRB estimator.We analyze theoretically, under conditions, that our estimator provides more accurateestimates of the true underlying distribution. For that purpose, we show the selectionconsistency and derive an error bound on the estimated distributions. The analysis ofthe selection consistency examines the estimator’s ability to estimate positive probabilityweights at grid points that lie inside the true distributions support, and zero weights atpoints outside the true support. The selection consistency is necessary to approximatethe true distribution as accurately as possible. Since the estimated distribution revealsthe existing heterogeneity in the population, i.e., agents’ varying preferences, recoveringthe true support points is also important for the correct interpretation of the model.The analysis shows that our generalized estimator correctly selects the grid pointsunder less restrictive conditions than the FKRB estimator. The error bounds on the esti-mated distribution functions illustrate the positive impact of our extension on the overallapproximation accuracy. Two Monte Carlo experiments in which we estimate a randomcoeﬃcients logit model conﬁrm the superior properties of our generalized estimator.Other nonparametric estimators for the random coeﬃcients model include Train (2008),Train (2016), Burda, Harding, and Hausman (2008) and Rossi, Allenby, and McCulloch(2012). Train (2008) introduces three diﬀerent estimators that are, in principle, sim-ilar to the general approach of FKRB but employ a log-likelihood criterion instead of2onstrained least squares. Train (2016) suggests approximating the random coeﬃcients’distribution with polynomials, splines or step functions instead of with a ﬁxed grid of pref-erence vectors. The approach substantially reduces the number of required ﬁxed pointsif the researcher speciﬁes overlapping splines and step functions. Due to the lower num-ber of required ﬁxed points, the approach reduces the curse of dimensionality, which is ashortcoming of the ﬁxed grid approach if the economic model includes a large number ofrandom coeﬃcients. However, both Train (2008) and Train (2016) estimate the respectivemodel with the EM algorithm, which is sensitive to its starting values and is not guaran-teed to converge to a global optimum. Burda et al. (2008) and Rossi et al. (2012) employa Bayesian hierarchical model to approximate the random coeﬃcients’ distribution with amixture of Normal distributions. Even though the estimator potentially has better ﬁnitesample properties, it uses a Markov Chain Monte Carlo technique with a multivariateDirichlet Process prior on the coeﬃcients, which is computationally more demanding.The remainder of the paper is organized as follows. Section 2 describes the FKRBestimator and introduces our generalized version. Section 3 derives the condition on theestimators’ sign consistency and an error bound on the estimated distribution functions.We present two Monte Carlo experiments in Section 4 that investigate the performanceof our generalized estimator in comparison to the FKRB estimator. Section 5 applies theestimators to the

Mode Canada data set from the R package mlogit (Croissant, 2019). 6concludes.

For the introduction of our estimator, we consider the framework of a random coeﬃcientdiscrete choice model. The approach, however, is not restricted to discrete choice modelsbut can be applied to any model with unobserved heterogeneous parameters. Let therebe an i.i.d. sample of N observations, each confronted with a set of J mutually exclu-sive potential outcomes. The researcher observes a K -dimensional real-valued vector ofexplanatory variables x i,j for every observation unit i and potential outcome j , and abinary vector y i whose entries are equal to one whenever she observes outcome j for the i th observation, and zero otherwise. The goal is to estimate the unknown distribution ofheterogeneous parameters F ( β ) in the model P i,j ( x ) = (cid:90) g ( x i,j , β ) dF ( β ) (1)where g ( x i,j , β ) denotes the probability of outcome j conditional on the random co-eﬃcients β and covariates x i,j . The researcher speciﬁes the functional form of g ( x i,j , β ).A prominent example of Equation (1) is the multinomial mixed logit model, the state-of-the-art model for demand estimation. In this model, consumer i realizes utility u i,j = x Ti,j β i + ω i,j from alternative j , given product characteristics x i,j and unobserved consumer-speciﬁc preferences β i . ω i,j denotes an additive, consumer- and choice-speciﬁc error term.3onsumer i chooses alternative j of J alternatives (and an outside good with utility u i, = ω i, ) if u i,j > u i,l for all l (cid:54) = j . Under the assumption that ω i,j follows a type Iextreme value distribution, the unconditional choice probabilities, P i,j ( x ), are of the form P i,j ( x ) = (cid:90) exp (cid:0) x Ti,j β (cid:1) J (cid:80) l =1 exp (cid:0) x Ti,l β (cid:1) dF ( β ) . (2) F ( β ) represents the distribution of heterogeneous consumer preferences in the populationand is to be estimated. In most applications, researchers place restrictive assumptions onthe functional form of F ( β ) in advance, and estimate its parameters from the data. FKRB propose a simple and fast mixture approach to estimate the underlying randomcoeﬃcients’ distribution without restrictive assumptions on its shape. The estimator usesa ﬁnite grid of ﬁxed random coeﬃcient vectors as mixture components to construct thedistribution from the estimated probability weight of every component. The underlyingidea of this ﬁxed grid estimator is the transformation of the unconditional choice probabil-ities in Equation (1) into a probability model in which F ( β ) enters linearly. FKRB derivethe linear probability model in two steps: they transform Equation (1) into a regressionmodel with the random coeﬃcients’ distribution as the only unknown term. Adding y i,j to both sides and moving P i,j to the right results in the probability model y i,j = (cid:90) g ( x i,j , β ) dF ( β ) + ( y i,j − P i,j ( x )) . (3)To exploit linearity in parameters, they use a sieve space approximation to the inﬁnite-dimensional parameter F ( β ). The sieve space approximation divides the support of therandom coeﬃcients β into R ﬁxed vectors. Each vector has length K , the number ofrandom coeﬃcients included in the model. The location of these vectors is speciﬁed bythe researcher. With the sieve space approximation, Equation (3) becomes a simple linearprobability model with unknown parameters θ = ( θ , . . . , θ R ) T y i,j ≈ R (cid:88) r =1 g ( x i,j , β r ) θ r + ( y i,j − P i,j ( x )) (4)where g ( x i,j , β r ) denotes the conditional choice probability evaluated at grid point r .Given the ﬁxed grid of random coeﬃcients, B R = ( β , . . . , β R ), the researcher estimatesthe probability weight θ r at every point r = 1 , . . . , R . The linear relationship between theoutcome variable and the unknown parameters θ allows to estimate the mixture weightswith the least squares estimator. The linear regression, which regresses the binary depen-dent variable y i,j on the choice probabilities evaluated at B R , in total has N J observations, J “regression observations” for every statistical observation unit i = 1 , . . . , N and R co-4ariates z i,j = ( g ( x i,j , β ) , . . . , g ( x i,j , β R )). By the deﬁnition of choice probabilities, theexpected value of the composite error term y i,j − P i,j ( x i,j ) conditional on x i,j is zero. Thus,the regression model satisﬁes the mean-independence assumption of the least squares ap-proach.The estimator of the random coeﬃcients’ joint distribution is constructed from theestimated weights ˆ F ( β ) = R (cid:88) r =1 ˆ θ r β r ≤ β ] (5)where β is an evaluation point chosen by the researcher and the indicator function 1[ β r ≤ β ] is equal to one whenever β r ≤ β , and zero otherwise.To ensure that ˆ F ( β ) is a valid distribution function, FKRB suggest estimating theweights with the least squares estimator subject to the constraints that the weights aregreater than or equal to zero, and sum to oneˆ θ F KRB = arg min θ N J N (cid:88) i =1 J (cid:88) j =1 (cid:32) y i,j − R (cid:88) r =1 θ r z ri,j (cid:33) s.t. θ r ≥ ∀ r and R (cid:88) r =1 θ r = 1 . (6)Key to an accurate approximation of F ( β ) is the precise estimation of the probabilityweights at every grid point. Basis to a precise estimation of the probability weights isthe consistent selection of the relevant grid points. This requires the constrained leastsquares estimator to estimate positive weights at all grid points at which F ( β ) has apositive probability mass, and zero weights otherwise. While zero weights at grid pointsinside F ( β )’s support cause inaccurate approximations through step functions with onlyfew steps, positive estimates at grid points outside F ( β )’s support lead to unreliableestimates of the random coeﬃcients’ distribution. To provide a more accurate non-parametric estimator with similar computational advan-tages, we suggest a simple generalization of the FKRB estimator. Our adjusted versionincludes the baseline estimator as a special case but allows for smoother estimates of F ( β )when necessary. To derive our estimator, we extend the optimization problem formulatedin Equation (6) by a constraint on the sum of the squared probability weights. Thisadditional constraint provides a straightforward way to mitigate the estimator’s sparsenature. Our generalized estimator is still simple and computationally fast.5 .2.1 Connection to Nonnegative LASSO We ﬁrst illustrate the source of the FKRB estimator’s sparsity, which helps to understandits behavior and the intuition behind our extension.One explanation of the potential sparsity of the estimates is the eﬀect of the nonnega-tivity constraint. Slawski and Hein (2013) show that nonnegative least squares estimatorsexhibit a self-regularizing property that yields sparse solutions. The FKRB estimator re-stricts the weights not only to be the nonnegative but also to sum up to one.Taking both constraints into account, we recognize that the FKRB estimator is aspecial case of the nonnegative LASSO (NNL) (Wu et al., 2014).To show the relation of the FKRB estimator to NNL, we transform the equality con-strained problem formulated in Equation (6) into its inequality constrained form. Theconstraint that the probability weights sum to one allows us to reparametrize the op-timization problem in terms of R − R unknown parameters. Without lossof generality, one can rewrite the R th weight as θ R = 1 − (cid:80) R − r =1 θ r . Substituting θ R inEquation (4) with 1 − (cid:80) R − r =1 θ r gives the inequality constrained optimization problemˆ θ FKRB = arg min θ N J N (cid:88) i =1 J (cid:88) j =1 (cid:18) ˜ y i,j − R − (cid:88) r =1 θ r ˜ z ri,j (cid:19) s.t. θ r ≥ ∀ r and R − (cid:88) r =1 θ r ≤ y i,j = y i,j − z Ri,j and ˜ z ri,j = z ri,j − z Ri,j for every r = 1 , . . . , R −

1. Because Equation(7) is an equivalent form of the optimization problem in Equation (6), the objectivefunctions are minimized by the same vector of probability weights. The only diﬀerence inthe inequality constrained problem is the estimation of the R th weight, which is calculatedafter optimization as θ R = 1 − (cid:80) R − r =1 θ r , and is not explicitly part of the optimization. Bythe constraints θ r ≥ ∀ r and (cid:80) R − r =1 θ r ≤

1, the R th weight satisﬁes the property of aprobability weight, 1 ≥ θ R ≥ θ NNL = arg min θ N J N (cid:88) i =1 J (cid:88) j =1 (cid:18) ˜ y i,j − R − (cid:88) r =1 θ r ˜ z ri,j (cid:19) s.t. θ r ≥ ∀ r and R − (cid:88) r =1 θ r ≤ s, (8)reveals that the baseline estimator is a special case of NNL with ﬁxed tuning parameter6 = 1. The constraint that the probability weights sum to one resembles an (cid:96) penaltythat regularizes the parameter estimates and shrinks some weights to zero if the sum ofunrestricted weights exceeds one.The amount of regularization depends on the size of the unrestricted estimates. Themore the sum of the R − R − B R is strong. It tends to select one out of a group of highly correlated grid points at randomand estimates the remaining to zero (see Zou & Hastie, 2005, and Hastie et al., 2009,for the random behavior of LASSO). Because the correlation is particularly strong in adense grid among neighboring grid points, the random selection behavior is especiallysevere for dense random coeﬃcient grids. This property conﬂicts with the requirement ofa suﬃciently ﬁne grid for accurate approximations of F ( β ). Extending the FKRB estimator’s optimization problem formulated in Equation (7) by aquadratic constraint on the probability weights alleviates the sparse nature and randomselection behavior. The additional constraint is known from ridge regression (Hoerl &Kennard, 1970) and transforms the FKRB estimator into the nonnegative elastic net (Wu& Yang, 2014) with ﬁxed constraint on the (cid:96) -penalty. Thus, our adjusted estimatorminimizes ˆ θ ENET = arg min θ N J N (cid:88) i =1 J (cid:88) j =1 (cid:18) ˜ y i,j − R − (cid:88) r =1 θ r ˜ z ri,j (cid:19) s.t. θ r ≥ ∀ r and R − (cid:88) r =1 θ r ≤ R − (cid:88) r =1 θ r ≤ t (9)where t is a nonnegative tuning parameter speciﬁed by the researcher. Having a linearand quadratic constraint on the probability weights ensures a more reliable selection of gridpoints: the quadratic constraint encourages a grouping eﬀect, which allows us to recoverhighly correlated points inside the true support of F ( β ) together and, hence, reduces theestimator’s sparsity. The linear constraint, in turn, retains the LASSO property, whichmakes it possible to select weights inside the support of the true distribution and to7stimate zero weights at points outside the true support.In addition to the improved selection consistency, the quadratic constraint has thedesirable property that it allows the speciﬁcation of a substantially ﬁner grid of randomcoeﬃcients. While the FKRB estimator runs into almost perfect collinearity problemsif the grid becomes ﬁner (Fox et al., 2016), the quadratic constraint ensures that theoptimization problem for our adjusted estimator always has a solution. The non-sparsesolutions together with the possibility of a ﬁner grid endow our estimator with the abilityto provide more accurate and reliable estimated distribution functions.The speciﬁcation of the tuning parameter allows adjusting the estimator to the levelof correlation among grid points. Smaller values of t give more weight to the quadraticconstraint, which enables the joint recovery of grid points if the correlation is strong and,hence, reduces the sparsity of the estimator. For decreasing values of t , the estimatorshrinks the probability weights of highly correlated grid points toward each other and in-duces an averaging of the estimated weights. For any t ≥

1, the quadratic constraint doesnot bind, such that the adjusted estimator simpliﬁes to the baseline estimator. Therefore,our estimator is a generalization of the FKRB estimator given in Equation (7), includingit as a special case. We recommend choosing the tuning parameter with cross-validationand the one standard error rule based on the mean squared error (MSE) criterion. Thisapproach ensures that our estimator achieves a model ﬁt that is at least as high as theFKRB estimator. If the model ﬁt is highest for t ≥

1, the outcome of our adjusted es-timator is the same as that for the estimator by FKRB, while it performs better if themodel ﬁt is lowest for some t < F ( β ) at these points. Due to the constraint that the estimated weights sum toone, the incorrect zero weights lead to downward biased estimates at points with positiveweights. The FKRB estimator reallocates the probability mass from the points with in-correct zero weights to other points, which imposes an upward bias at these points. Thequadratic constraint potentially reduces the described distortions through its improved se-lection consistency. As a result of more correct positive probability weights, the quadraticconstraint diminishes the reallocation of probability caused by the linear constraint and,therefore, reduces the bias both at points with incorrect zero weights and positive weights.The results of the two Monte Carlo studies presented in Section 4 demonstrate thatthe quadratic constraint reduces both the sparsity and the bias. Moreover, we derivean error bound on the estimated probability weights in Section 3 which, under certainconditions, is tighter for our generalized estimator than for the FKRB estimator if thecorrelation between grid points is strong. 8 Theoretical Analysis of the Estimators Properties

The requirement of a suﬃciently ﬁne grid, which potentially includes points outside thetrue support, transforms the ﬁxed grid estimator into a high dimensional regression prob-lem with potentially sparse solutions and highly correlated covariates. Recall that insuch a context, an important element of an accurate estimation of F ( β ) is the consistentselection of grid points. It guarantees the correct recovery of F ( β )’s support, and isfundamental to an undistorted estimation of the probability weights. Subsection 3.1 aimsto analyze both estimators’ ability to select the correct weights. To evaluate the overallapproximation accuracy of the estimators presented in Section 2, we derive an error boundfor the estimated probability weights in Subsection 3.2.Suppose θ ∗ = ( θ ∗ , . . . , θ ∗ R − ) T speciﬁes the vector of probability weights that yields themost accurate discrete approximation, F ∗ ( β ) = (cid:80) Rr =1 θ ∗ r [ β r ≤ β ] with θ ∗ R = 1 − (cid:80) R − r =1 θ ∗ r ,of F ( β ) which can be obtained with the estimators for a given ﬁxed grid B R . Further-more, assume that F ∗ ( β ) converges to F ( β ) for R going to inﬁnity. We use F ∗ ( β ) asa benchmark to compare the estimated distribution function, ˆ F ( β ) = (cid:80) Rr =1 ˆ θ r [ β r ≤ β ]with ˆ θ R = 1 − (cid:80) R − r =1 ˆ θ r , to the true underlying distribution. The introduction of F ∗ ( β )allows us to study the selection consistency and the distance between ˆ θ and θ ∗ .The focus of our analysis is on the impact of the correlation among the grid points onthe estimators. We show that our generalized estimator is selection consistent under lessrestrictive conditions on the design matrix.Due to the relation of the estimators to the NNL and nonnegative elastic net, respec-tively, we build on the literature on regularized regression. Our proof of the selectionconsistency mainly follows Jia and Yu (2010), who analyze selection consistency of theelastic net under i.i.d. Gaussian errors. Similarly to Jia and Yu (2010), Wu et al. (2014)and Wu and Yang (2014) derive selection consistency of the nonnegative LASSO, and thenonnegative elastic net for i.i.d. Gaussian errors.We extend their proof to sub-Gaussian errors and allow for correlation among the J errors that belong to the same observation unit i . Thereby, we contribute to the literatureon the nonnegative elastic net. Neither Jia and Yu (2010) nor Wu and Yang (2014)calculate error bounds on the deviation between the estimated and the true coeﬃcients.Our proof of the error bound on the estimated weights draws from Takada, Suzuki, andFujisawa (2017), who analyze a generalization of the elastic net. We adjust their proofsuch that it is in line with the probability model in Section 2.For any B R , denote the linear probability model corresponding to F ∗ ( β ) by y i,j = R (cid:88) r =1 θ ∗ r z ri,j + (cid:15) i,j (10)where (cid:15) i,j is the linear probability error. For our analysis of the selection consistencyand for the error bound on the estimated weights, we make the following assumptions on9he linear probability model in Equation (10), and on the data generating process. Assumption 1. (i) (cid:16) (cid:15) i = ( (cid:15) i, , ..., (cid:15) i,J ) (cid:17) Ni =1 are independent.(ii) (cid:15) i,j is sub-Gaussian: E [exp ( t(cid:15) i,j )] ≤ exp (cid:16) σ t (cid:17) ( ∀ t ∈ R ) for σ > .(iii) (cid:0) ˜ Z i (cid:1) Ni =1 are i.i.d. with a density bounded from above and each ˜ z ri,j ∈ [ − , .(iv) E (cid:104) (cid:15) i | ˜ Z , ..., ˜ Z N (cid:105) = 0 . ˜ Z refers to the regressor matrix of the transformed model in Equation (7) and ˜ Z i tothe corresponding J × R − i . Assumption 1 (i)imposes independence across the vectors of errors for each observation unit. It does notassume independence of elements within each vector of errors. Assumption 1 (ii) assumesthat the errors are sub-Gaussian with variance proxy σ . The variance proxy σ serves as anupper bound of the variance of the errors and allows for (conditional) heteroscedasticity.Note that the error term in the linear probability model in Equation (10) is sub-Gaussianwith variance proxy σ ≤

1. This follows from the fact that the error term in the linearprobability model is bounded between -1 and 1 since y i,j is either 0 or 1, the weights θ r arenonnegative and by Assumption 1 (iii) ˜ z ri,j is also bounded between -1 and 1. ˜ z ri,j ∈ [ − , For our analysis of the selection consistency, we adapt the deﬁnition of Zhao and Yu(2006). An estimator is deﬁned as equal in sign if ˆ θ r and θ ∗ r have the same sign for every r = 1 , . . . , R −

1. Due to the nonnegativity of the estimates, the deﬁnition implies thatˆ θ must be positive at all points in B R for which θ ∗ r >

0, and zero at those where θ ∗ r = 0.Therefore, the estimation of the correct signs is equivalent to the correct selection of gridpoints. If an estimate ˆ θ of the true weights θ is equal in sign, we write ˆ θ = s θ .Our deﬁnition only includes R − R − θ R = 1 − (cid:80) R − r =1 θ r has the correct sign. Deﬁnition 1.

An estimate ˆ θ is sign consistent if lim N →∞ P (cid:16) ˆ θ = s θ ∗ (cid:17) = 1 . According to Deﬁnition 1, an estimator is sign consistent if it estimates a positiveweight at every grid point at which θ ∗ >

0, and zero weights otherwise with probabilityapproaching one as the number of observation units N goes to inﬁnity.10o derive the condition under which our generalized estimator is sign consistent, wemake the following notations on the design matrix and probability weights. We assumethat B R includes both grid points inside the support of F ( β ), i.e., points at which θ ∗ > θ ∗ = 0. Let S = { r ∈ { , . . . , R − }| θ ∗ r > } deﬁne the index set of grid points at which θ ∗ >

0, and let S C = { r ∈ { , . . . , R − }| θ ∗ r = 0 } denote its complement. The corresponding cardinalities are deﬁned as s := | S | and s C := | S C | . We refer to grid points in S as active grid points and to grid points in S C as inactive grid points. ˜ Z S and ˜ Z S C denote the sub-matrices of all columns of ˜ Z thatare in S and S C , respectively.Let λ denote the ﬁxed LASSO parameter which corresponds to the Lagrange param-eter for s = 1 in Equation (9) and µ the Lagrange version of the ridge tuning parameter t in Equation (9).Following Wu and Yang (2014), we then obtain the subsequent condition for the signconsistency of the generalized estimator: Nonnegative Elastic Irrepresentable Condition (NEIC).

There exists a positiveconstant η > (independent of N ) such that max r ∈ S C N J ˜ Z TS C ˜ Z S (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − (cid:16) ι S + µλ θ ∗ S (cid:17) ≤ − η where ι S is a vector of s ones and I S is the identity matrix. The NEIC is a condition for the correct recovery of support points through our gen-eralized estimator. The term ˜ Z TS C ˜ Z S restricts the linear dependency between active andinactive grid points. The term ˜ Z TS ˜ Z S measures the linear dependency among active gridpoints. In addition to the linear dependence of the regressor matrix, the magnitude ofthe ﬁxed LASSO parameter and the tuning parameter µ is taken into account by theNEIC. For µ = 0, the NEIC reverts to the Nonnegative Irrepresentable Condition (NIC),the corresponding condition for selection consistency through the estimator proposed byFKRB. In contrast to the NEIC, the NIC requires that the inverse of ˜ Z TS ˜ Z S exists, whichis not a necessary condition for the NEIC to hold.We exploit the special structure of our data by incorporating the fact that all θ arebetween zero and one, and all elements of ˜ Z between minus one and one.In line with Fox et al. (2016), we allow R ( N ) to depend on the sample size N . Thatis, the larger N , the more grid points R ( N ) can be included into the grid. If R ( N )increases, we typically expect the number of positive weights s ( N ) to increase if the truedistribution F ( β ) is suﬃciently smooth. The next condition restricts the rate at which s ( N ) and R ( N ) can increase with N . For convenience, we write s and R instead of s ( N )and R ( N ) in the subsequent analyses. 11 ate Condition on Density of Grid (RCDG). lim N →∞ sJ exp (cid:16) − Nξ S min ( µ ) ρ s (cid:17) = 0 .2. lim N →∞ R − J exp (cid:18) − N η λ (cid:16) ξ S min ( µ ) s √ s + ξ S min ( µ ) (cid:17) (cid:14) (cid:19) = 0 , where ξ S min ( µ ) denotes the (unrestricted) minimal eigenvalue of / ( N J ) ˜ Z TS ˜ Z S + µ I S and ρ := min i ∈ S (cid:12)(cid:12)(cid:12)(cid:12) (cid:16) / ( N J ) ˜ Z TS ˜ Z S + µ I S (cid:17) − (cid:16) / ( N J ) ˜ Z TS ˜ Z S θ ∗ S − λι S (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) . RCDG requires that ξ S min ( µ ) >

0. Otherwise, the condition can never be satisﬁed.This is only restrictive for the FKRB estimator and always holds for its generalization aslong as µ > / ( N J ) ˜ Z TS ˜ Z S + µ I S is positive deﬁnite for µ > µ = 0. The assumption ξ S min ( µ ) > Theorem 1.

Suppose Assumption 1 holds. Suppose further that NEIC and RCDG hold.Then lim N →∞ P (cid:16) ˆ θ = s θ ∗ (cid:17) = 1 . Proof.

See Appendix C.2.Theorem 1 relies on suﬃcient conditions for the estimators to select the true weights.These conditions are more restrictive for the FKRB estimator than for our generalization.Since ξ S min ( µ ) = ξ S min (0)+ µ , the minimal eigenvalue ξ S min ( µ ) is higher for the elastic net thanfor the LASSO estimator. Furthermore, the NEIC holds whenever the NIC is satisﬁed.This implies that our estimator consistently selects the true support whenever the FKRBestimator does.The converse is not true since the NEIC might hold even though NIC does not. Thus,Theorem 1 reveals that our estimator can select the true weights in cases in which theFKRB estimator cannot. A key requirement for an accurate estimation of F ( β ) - in addition to the correct supportrecovery discussed in Subsection 3.1 - is the precise estimation of the probability weights.In this section, we derive the error bound for the estimated probability weights and theweights that yield the best discrete approximation of F ( β ).Let H denote the set of vectors of length R in [ − , R for which the (cid:96) -norm is nogreater than 2 H := (cid:110) x ∈ [ − , R (cid:12)(cid:12)(cid:12) (cid:13)(cid:13) x (cid:13)(cid:13) ≤ (cid:111) . The set H contains all possible values of ∆ˆ θ := ˆ θ − θ ∗ since ˆ θ and θ ∗ are vectors of weightswhich sum up to 1. Therefore, it is suﬃcient to consider elements in H when analyzingthe potential error ∆ˆ θ . 12eﬁne the restricted minimum eigenvalue of the real symmetric R × R matrix1 / ( N J ) ˜ Z T ˜ Z + µ I R over the set of vectors H as ξ min ( µ ) := inf v ∈B v T (cid:2) NJ ˜ Z T ˜ Z + µ I R (cid:3) v (cid:13)(cid:13) v (cid:13)(cid:13) . Because the restricted minimal eigenvalue is greater than or equal to the unrestrictedminimal eigenvalue, we use the restricted eigenvalue to derive a tighter error bound. Westill assume ξ min ( µ ) > ξ min ( µ ) > µ > ξ min ( µ ) > R − Theorem 2.

Let < δ ≤ . Deﬁne γ ( N, δ ) := (cid:114) (cid:16) R − Jδ (cid:17) (cid:14) N . Suppose As-sumption 1 holds, and that ξ min ( µ ) > for µ ≥ . Then, for any positive k such that γ ( N, δ ) ≤ kλ , it holds with probability − δ that (cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13) ≤ √ R − kλ + 2 µ √ s (cid:13)(cid:13) θ ∗ S (cid:13)(cid:13) ∞ ξ min ( µ ) . Proof.

See Appendix C.3.Theorem 2 holds with probability approaching one as δ →

0. Because γ ( N, δ ) decreasesin N , the error bound becomes tighter if the number of observation units increases. Thiscan be seen from the condition γ ( N, δ ) ≤ kλ which requires a smaller constant k for alarger N (and ﬁxed λ ).The number of grid points leads to a direct increase of the error bound, both through R and s , which is expected to increase with R , e.g., if the true distribution is continu-ous. The number of grid points also has an indirect eﬀect attributable to the strongercorrelation typically associated with an increase in the number of grid points. This eﬀectis captured through the restricted minimum eigenvalue ξ min ( µ ), which decreases if thecorrelation increases. Hence, an increase in the number of grid points typically leads to awider error bound on the estimated weights (for a ﬁxed µ ).For µ = 0, the bound in Theorem 2 simpliﬁes to the error bound for the FKRBestimator. A comparison of the bound for µ = 0 and µ > s ,the indirect eﬀect is especially relevant if the correlation among grid points is strong.In that case, the extension leads to an increase of ξ min ( µ ) and hence, to a tighter error13ound. The indirect eﬀect becomes particularly important if the design matrix tends to bealmost singular, in which case the restricted minimum eigenvalue of the FKRB estimatorapproaches zero (and the error bound its maximum possible value 2). Also note that theestimation error for the weight θ R , which is not included in the bound in Theorem (2)and calculated as θ R = 1 − (cid:80) R − r =1 θ r , will approach zero whenever (cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13) is close tozero.Corollary 1 establishes the condition under which our extension provides a tightererror bound on the estimated weights than the FKRB estimator. Corollary 1.

When √ s (cid:13)(cid:13) θ ∗ S (cid:13)(cid:13) ∞ ξ min (0) < √ R − kλ , then the error bound for (cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13) in Theorem 2 is tighter for the generalized estimator than for the FKRB estimator.Proof. See Appendix C.3.Using the error bound on the estimated and true probability weights in Theorem 2,we derive a bound on the error of the estimated distribution function ˆ F ( β ) and the bestdiscrete distribution F ∗ ( β ). Theorem 3.

Under the assumptions and conditions in Theorem 2, it holds at any point β ∈ R K with probability − δ that | ˆ F ( β ) − F ∗ ( β ) | ≤ R − kλ + 4 µ (cid:112) s ( R − (cid:13)(cid:13) θ ∗ S (cid:13)(cid:13) ∞ ξ min ( µ ) . Proof.

See Appendix C.3.The bound on the diﬀerence between the estimated distribution and the best discreteapproximation of F ( β ) increases in R and decreases in ξ min ( µ ). Similarly to Theorem 2,the diﬀerence in the distributions decreases in N since k may decrease when N increases.Additionally, Fox et al. (2016) show that, under some regularity conditions, it holdsthat | F ( β ) − F ∗ ( β ) | = O ( R − ¯ s/K ) where ¯ s ≥ F ( β ) and K refers to the number of random coeﬃcients. This explains the relevance ofTheorem 3 since the diﬀerence of F ( β ) and F ∗ ( β ) becomes negligibly small as R increasesand the estimation error can then be well captured by | ˆ F ( β ) − F ∗ ( β ) | . We conduct two Monte Carlo experiments to examine the selection consistency and theapproximation accuracy of our generalized estimator. The Monte Carlo simulation on theselection consistency uses a discrete distribution with a subset of grid points as supportpoints.The second experiment generates the random coeﬃcients from a mixture of two nor-mal distributions. This allows us to study the estimators’ ability to estimate smooth The density function of β is assumed to be ¯ s -times continuously diﬀerentiable. i choosesamong J = 4 mutually exclusive alternatives and an outside option. For every alternative j and observation unit i , we draw the two-dimensional covariate vector x i,j = ( x i,j, , x i,j, )from U (0 ,

5) and U ( − , R and N

200 times to compare the performance of our estimator withthe FKRB estimator in terms of selection consistency and accuracy for every setup. Allcalculations are conducted with the statistical software R (R Core Team, 2018).

To study the estimators’ selection consistency, we generate the random coeﬃcients β froma discrete probability mass function. The estimator successfully recovers the true supportfrom the data if it estimates a positive weight at every support point of F ( β ), and zeroweights at all points outside its support.For the support points of F ( β ), we select a subset of the grid points from the ﬁxedgrid we use for the estimation. The grid covers the range [ − . , . × [ − . , .

5] with R = { , , } uniformly allocated grid points. We specify the support of our discrete datagenerating distribution on [ − . , − . × [ − . , . − . , . × [ − . , . β from a discrete mass function with S = { , , } support points, each drawn with uniform probability weight θ s = 1 /S .In this setup, the data generating process exactly matches the underlying probabilitymodel of the ﬁxed grid estimator. This way, we abstract from any approximation errorsthat can arise from the sieve space approximation of the true underlying distribution.Therefore, the experiment studies the estimators’ selection consistency in the most simpleframework possible.The two areas of the discrete distribution with positive probability mass simulate twoheterogeneous groups of preferences in the population. We estimate every distribution forsample sizes N = { , } .Figure 1 illustrates the setup of the Monte Carlo experiment for the three data gen-erating distributions. The blue shaded area indicates the support of the discrete massfunctions, and the ﬁlled blue points inside this area the active grid points. The hollowblack points outside the blue shared areas are the inactive grid points that are not usedfor data generation.We choose the optimal tuning parameter µ for the generalized estimator with 10-foldcross-validation from a sequence of 101 potential values. For 100 of these values, we use15igure 1: Grid of Monte Carlo Study with Discrete Mass Points (a) R = 25, S = 17 (b) R = 81, S = 49 (c) R = 289, S = 161 the sequence suggested by the R package glmnet for ridge regression with nonnegativecoeﬃcients. We also include µ = 0 in the range of possible values to allow our estimatorto simplify to the FKRB estimator if the model ﬁt in the cross-validation is highest for µ = 0. The selection of the optimal tuning parameter is based on the mean squared error(MSE) criterion. In addition to the tuning parameter with the lowest MSE, we report thetuning parameter that follows from the one-standard-error rule (OneSe). As robustness-checks, we consider the prediction accuracy of the predicted choice ofevery observation and the log-likelihood as a measure of ﬁt in the cross-validation. Wechoose the µ based on the smallest average out-of-sample prediction error and based onthe highest log-likelihood, respectively. The results of the Monte Carlo study for the log-likelihood and predicted choices as selection criteria can be found in Appendix A. Theyindicate that the MSE and the one-standard-error rule give the best results.To evaluate the estimators’ selection consistency, we calculate the average share of signconsistent estimates. An estimate is sign consistent if it is positive at active grid points,and zero otherwise. A weight is deﬁned as positive if it is greater than 10 − . To illustratethe sparsity of the estimators’ solutions, we report the average number of positive weightsand the average share of true positive weights.Beyond selection consistency, the discrete setup of the Monte Carlo experiment allowsus to study the bias of the estimated probability weights. Denote the estimated weightat grid point r in Monte Carlo run m by ˆ θ r,m . We calculate the L norm L = 1 M M (cid:88) m =1 R R (cid:88) r =1 (cid:12)(cid:12)(cid:12) θ r − ˆ θ r,m (cid:12)(cid:12)(cid:12) (11)to measure the average absolute bias of ˆ θ in comparison to the true weights θ overall Monte Carlo runs M . In addition, we adopt the root mean integrated squared error(RMISE) from Fox et al. (2011) to provide a metric on the approximation accuracy of We observe that the curve of the MSE in dependency of µ tends to be ﬂat and that the µ chosen byOneSe often corresponds to the largest element of the sequence of tuning parameters suggested by the glmnet package. Therefore, a possible strategy is to choose the largest µ given by the glmnet package toobtain the µ of OneSe if one wants to avoid cross-validation. (cid:118)(cid:117)(cid:117)(cid:116) M M (cid:88) m =1 (cid:34) E E (cid:88) e =1 (cid:16) (cid:98) F m ( β e ) − F ( β e ) (cid:17) (cid:35) , (12)where (cid:98) F m ( β e ) denotes the estimated distribution function in Monte Carlo run m evalu-ated at grid point β e . For the evaluation, we use E = 10 ,

000 points uniformly distributedover the range [ − . , . × [ − . , . N , the number of grid points R , and the number of true supportpoints S . The upper part of the table presents the measures on the accuracy of theestimated weights, and the lower part the shares of positive, true positive, and signconsistent estimated weights. The ﬁnal column in the upper part reports the third quantileof the absolute values of the correlation ρ among grid points. Table 1: Summary Statistics of 200 Monte Carlo Runs with Discrete Distribution.RMISE L µ ρN R S FKRB MSE OneSe FKRB MSE OneSe MSE OneSe 3rd Qu.1000 25 17 0.067 0.04 0.034 0.035 0.017 0.014 56.05 67.95 0.8081000 81 49 0.08 0.046 0.038 0.019 0.008 0.007 58.90 70.06 0.8191000 289 161 0.088 0.057 0.045 0.006 0.004 0.003 54.87 71.20 0.82210000 25 17 0.042 0.026 0.023 0.02 0.012 0.011 61.15 66.78 0.80910000 81 49 0.05 0.031 0.027 0.015 0.008 0.007 59.31 69.16 0.81810000 289 161 0.057 0.037 0.033 0.006 0.004 0.003 61.90 70.50 0.822Pos. % True Pos. % Sign

N R S

FKRB MSE OneSe FKRB MSE OneSe FKRB MSE OneSe1000 25 17 13.3 20.82 22.36 68.44 94.53 99.71 71.88 77.28 78.141000 81 49 15.47 49.58 54.67 27.02 82.12 90.4 53.1 77.65 81.381000 289 161 16.24 103.13 123.8 8.62 55.31 66.39 48.27 70.24 75.4210000 25 17 17.17 19.39 19.73 90.32 98.12 99.53 86.16 87.86 88.4610000 81 49 23.32 44.84 48.26 42.29 81.07 87.14 61.88 82.24 85.3610000 289 161 24.88 97.39 105.84 13.53 55.07 59.94 50.76 71.96 74.46

Note:

The table reports the average summary statistics over all Monte Carlo replicates for theFKRB estimator (FKRB), and for our generalized estimator with tuning parameter µ from a 10-fold cross-validation and the M SE criterion (MSE) and the one-standard-error rule (OneSe). In addition, we also considered the mean and median to summarize the absolute correlation amonggrid points. We focus on the third quantile since it best illustrates the strong correlation in this setup. N and R , in particular when the tuning parameter µ is chosen basedon the one-standard-error rule. With respect to the selection consistency, the generalizedestimator recovers more true positive and sign consistent probability weights from thedata than the FKRB estimator. While the decrease in these shares is moderate for thegeneralized estimator when the discrete distribution becomes more complex, the correctrecovery through the FKRB estimator becomes signiﬁcantly worse.This is best illustrated by the small number of positive weights, which changes onlyslightly alongside the increasing complexity. In the extreme case of R = 289, the FKRBestimator estimates positive weights at no more than 16 /

25 of the grid points for N =1 , / ,

000 (in comparison to 124 /

106 for the generalized estimator).In addition to its improved selection consistency, all measures on the estimated weightsindicate that our generalized version provides substantially more accurate estimates of theprobability weights than the FKRB estimator. The bias reduction persists for small andlarge sample sizes.Figure 2: Correlation Matrix for N = 10 ,

000 and R = 81The plot of the correlation matrix in Figure 2 and the third quantile of the values ofabsolute correlation in Table 1 both illustrate that correlation among many grid points isstrong. The second Monte Carlo experiment considers a mixture of two bivariate normal distri-butions for F ( β ) to analyze how our generalized estimator accommodates more complex18ontinuous distributions. This way, we can assess its ability to recover distributions thatcannot be estimated with parametric techniques.For the estimation, we use a ﬁxed grid with points spread on [ − . , . × [ − . , . F ( β ) for an increasing number of grid points, we estimate the model with R = { , , , } . The number of observation units N varies from 1,000 to 10,000.The variance-covariance matrices of the two normals are Σ = Σ = (cid:2) . . .

15 0 . (cid:3) . Wegenerate the random coeﬃcient vectors β from the following two-component bivariatemixture 0 . N (cid:18) [ − . , − . , Σ (cid:19) + 0 . N (cid:18) [1 . , . , Σ (cid:19) The left panel in Figure 3 displays the bimodal joint density of the mixture of the twonormals, and the right panel the joint distribution function.Figure 3: True Density and Distribution Function of Mixture of two Normals (a) PDF (b) CDF

For the calculation of the RMISE, we use E = 10 ,

000 evaluation points uniformlydistributed over the range of the ﬁxed grid. In addition, we report the average numberof positive, true positive, and sign consistent estimated weights. For the number of truepositive and sign consistent weights, we calculate the true density at every grid point anddeﬁne a true weight as positive if the density is greater 10 − .Table 2 summarizes the average results over the M = 200 Monte Carlo replicatesfor the FKRB estimator and our generalized estimator when µ is chosen with 10-foldcross-validation and the MSE and one-standard error rule, respectively. Results for theprediction accuracy of the predicted choices and the log-likelihood as criteria are reported19n Appendix A.Table 2: Summary Statistics of 200 Monte Carlo Runs with Mixture of Two BivariateNormals.RMISE Pos. µ ρN R S FKRB MSE OneSe FKRB MSE OneSe MSE OneSe 3rd Qu.1000 25 17 0.085 0.072 0.055 9.65 12.94 17.78 21.2 74.01 0.8231000 50 33 0.09 0.068 0.058 12.57 26.82 32.4 48.09 74 0.821000 100 67 0.095 0.07 0.061 13.65 47.09 54.85 58.6 74.37 0.8221000 250 163 0.102 0.077 0.063 14.2 79.28 103.98 50.5 74.52 0.82410000 25 17 0.063 0.061 0.057 11.65 12.57 14.89 18.3 73.94 0.82310000 50 33 0.058 0.051 0.047 17.52 24.94 28.3 48.71 73.94 0.8210000 100 67 0.06 0.048 0.043 19.87 39.59 46.7 51.63 74.04 0.82310000 250 163 0.063 0.045 0.04 21.21 76.43 87.92 59.98 74.68 0.824% True Pos. % Sign

N R S

FKRB MSE OneSe FKRB MSE OneSe1000 25 17 49.06 66.35 88.68 60.12 70.48 81.481000 50 33 33.39 70.33 84.48 52.93 73.19 80.721000 100 67 18.01 63.46 74.1 43.48 70.95 77.441000 250 163 7.37 44.38 58.17 38.73 60.96 69.0610000 25 17 57.94 63.35 77.24 64.2 67.86 77.4810000 50 33 47.24 68.59 78.05 61.32 74.66 80.4210000 100 67 26.57 54.72 64.84 48.74 66.73 73.1910000 250 163 11.39 43.78 50.56 41.17 61.32 65.56

Note:

The RMISE shows that our generalized estimator provides more accurate estimates ofthe true underlying random coeﬃcients’ distribution than the FKRB estimator for everycombination of N and R . For N = 10 ,

000 the generalized version becomes more accuratewith increasing number of grid points and approximates F ( β ) quite well for R = 250.However, the FKRB estimator does not result in a lower RMISE for N = 10 ,

000 when R increases.The improved performance of our estimator for every combination of N and R can beexplained with the larger number of true positive and sign consistent estimated probabil-20ty weights. Independently of the number of (relevant) grid points, the FKRB estimatorestimates only a small number of positive weights and, hence, recovers only few relevantgrid points. The share of true positive and sign consistent estimated weights is sub-stantially higher for our estimator. Figure 4 plots an example of the joint distributionfunctions estimated with the FKRB estimator (Panel (a)) and our generalized estimator(Panel (b)). Figure 5 shows the corresponding estimated and true marginal distributionsof β and β . The distribution functions are estimated for N = 10 ,

000 and R = 250.Figure 4: Estimated Joint Distribution Functions for N = 10 ,

000 and R = 250 (a) FKRB (b) Generalized with OneSe Figure 5: True and Estimated Marginal Distribution Functions for N = 10 ,

000 and R = 250The plots illustrate the impact of the FKRB estimator’s sparse nature on the estimatedmarginal and joint distribution functions. Visual inspection shows that it approximates F ( β ) through a step function with only few steps due to the small number of positiveweights. In contrast, our generalized estimator provides a smooth estimate that is closeto the true underlying distribution function.21 Application

To study the performance of our generalized estimator with real data, we apply it to the

ModeCanda data set from the

R package mlogit . Originally, the Canadian National RailCarrier VIA Rail assembled the data in 1989 to analyze the demand for future intercitytravel in the Toronto-Montr´eal corridor. The data contains information on travelers whocan choose among the four intercity travel mode options car, bus, train, and air. Due tothe small number of bus users (18), we follow Bhat (1997b) and drop bus as an alternative.Furthermore, we only consider travelers in our analysis that can choose among all threeoptions. Thus, the analyzed data consists of 3 ,

593 business travelers who can chooseamong airplane, train, and car. In addition to the observed choices, the data includesinformation on traveler’s income, the trip distance, the frequency of the service, totaltravel cost, an indicator that is one if either the city of arrival or departure is a big cityand zero otherwise, and the in- and out-of-vehicle travel time. We construct the traveltime variable by summing up in-vehicle travel time and out-of-vehicle time. This is donefor two reasons: ﬁrst, the data on out-of-vehicle time is always zero for car users and wouldtherefore only capture the preferences of airplane and train users. Second, we think it isplausible that individuals care more about total travel time than the travel time insideand outside of a vehicle separately.A detailed description of the data can be found in Marwick and Koppelman (1990).Among others, the data set has been studied by Bhat (1995, 1997a, 1997b, 1998), Koppelmanand Wen (2000), Wen and Koppelman (2001). The only paper that analyzes the datawith a random coeﬃcients logit model is the study by Hess, Bierlaire, and Polak (2005).However, they only use the explanatory variables as input for a Monte Carlo study andsimulate travelers’ mode choices.We estimate a mixed logit model with a random coeﬃcient on the travel time andﬁxed coeﬃcient on all other variables to study the preferred travel mode of businesstravelers. We include all the above variables into the utility speciﬁcation along withmode speciﬁc constants, where we specify car as the reference alternative. To applythe ﬁxed grid approach to a model with ﬁxed and random coeﬃcients, we follow therecommendation of Fox et al. (2016) and Houde and Myers (2019) who suggest a two-step estimator to estimate the model with ﬁxed and random coeﬃcients. In the ﬁrststep, all coeﬃcients are estimated using a parametric mixed logit. We assume that therandom coeﬃcient is normally distributed. In the second step, the ﬁxed variables andtheir estimated coeﬃcients from the ﬁrst stage are treated as data and only the randomcoeﬃcient of travel time is estimated with the FKRB and elastic net estimator. Houdeand Myers (2019) justify the procedure with the argument that a mixed logit can recoverthe means of a distribution fairly well despite the incorrect assumptions on the random We also provide an algorithm to update both the ﬁxed and random coeﬃcients in Appendix B. Thealgorithm is a modiﬁcation of the ﬂexible grid estimator in Train (2008). Unfortunately, the algorithmseems to be very slow and we do not include its results in our comparison here. and add three standard deviations to each side. We estimatethe second step with diﬀerent numbers of grid points. The preferred speciﬁcation uses R = 100 uniformly spread points on the range [ − . , . Furthermore, it can easily beseen that the estimated mass function obtained by the elastic net estimator does not seemto be normally distributed but rather looks like a mixture of two normal distributions.That is, specifying a normal or any other parametric distribution function does not seemappropriate in this example. A quite unexpected result is that there are positive weightsat positive grid points implying that some people appreciate longer trips. Even though,one might argue that this might be the case if such travelers accept additional traveltime for, say, additional comfort when traveling, this might also be a sign of a misspeci-ﬁed model. For the FKRB estimator these weights sum up 9 .

5% and for the elastic netto 10 .

1% which is lower than 12 .

6% for the mixed logit with normal distribution. Theweighted mean of the coeﬃcient of travel time for the FKRB estimator is − . − . − . . . The estimated coeﬃcients of the ﬁrst stage are provided in Appendix A. We again deﬁne a weight as positive if it is greater than 10 − . R = 100 (a) Mass Function for FKRB(b) Mass Function for Elastic Net(c) CDFs for FKRB (red) and Elastic Net (blue) Conclusion

We extend the simple and computationally attractive nonparametric estimator of Fox etal. (2011). We illustrate that their estimator is a special case of NNL, explaining its sparsesolutions. The connection to NNL reveals that the estimator tends to randomly selectamong highly correlated grid points. This behavior gives reason to doubt the preciseestimation of the true distribution through the estimator.To mitigate its undesirable sparsity and random selection behavior, we add a quadraticconstraint on the probability weights to the optimization problem of the FKRB estimator.This simple and straightforward extension transforms the estimator to a special caseof nonnegative elastic net. The combination of the linear and quadratic constraint onthe probability weights enables a more reliable selection of the relevant grid points. Asa consequence, our generalized estimator provides more accurate estimates of the trueunderlying random coeﬃcients’ distribution without increasing computational speed andsimplicity substantially. We derive conditions for selection consistency and an error boundon the estimated distribution function to verify the improved properties of our estimator.Two Monte Carlo studies illustrate the attractive theoretical properties of our es-timator. They show that our generalized version estimates considerably more positiveprobability weights and recovers more grid points correctly. In addition to the improvedselection consistency, the estimator provides more accurate estimates of the true under-lying distributions.Applying the FKRB and the elastic net estimator to a data set of travel choicesmade in the Toronto-Montr´eal corridor conﬁrms the sparsity of the FKRB estimator. Incontrast, the elastic net estimator selects substantially more grid points, resulting in asmooth distribution function. This illustrates the fact that the elastic net estimator isable to approximate continuous distribution functions.25 eferences

Bhat, C. R. (1995). A heteroscedastic extreme value model of intercity travel mode choice.

Transportation Research Part B: Methodological , (6), 471–483.Bhat, C. R. (1997a). Covariance heterogeneity in nested logit models: econometric struc-ture and application to intercity travel. Transportation Research Part B: Method-ological , (1), 11–21.Bhat, C. R. (1997b). An endogenous segmentation mode choice model with an applicationto intercity travel. Transportation science , (1), 34–48.Bhat, C. R. (1998). Accommodating variations in responsiveness to level-of-service mea-sures in travel mode choice modeling. Transportation Research Part A: Policy andPractice , (7), 495–507.Blundell, W., Gowrisankaran, G., & Langer, A. (2018). Escalation of scrutiny: The gainsfrom dynamic enforcement of environmental regulations (Tech. Rep.). NationalBureau of Economic Research.Burda, M., Harding, M., & Hausman, J. (2008). A bayesian mixed logitprobit model formultinomial choice.

Journal of Econometrics , (2), 232–246.Croissant, Y. (2019). mlogit: Multinomial logit models [Computer software manual]. Re-trieved from https://CRAN.R-project.org/package=mlogit (R package version0.4-1)Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004). Least angle regression. The Annals of statistics , (2), 407–499.El-Arini, K., Xu, M., Fox, E. B., & Guestrin, C. (2013). Representing documents throughtheir readers. In Proceedings of the 19th acm sigkdd international conference onknowledge discovery and data mining (pp. 14–22). New York, NY, USA: ACM.Fox, J. T., Kim, K., Ryan, S., & Bajari, P. (2011). A simple estimator for the distributionof random coeﬃcients.

Quantitative Economics , (3), 381–418.Fox, J. T., Kim, K., & Yang, C. (2016). A simple nonparametric approach to estimatingthe distribution of random coeﬃcients in structural models. Journal of Economet-rics , (2), 236–254.Gentle, J. E. (2007). Matrix algebra: Theory, computations, and applications in statistics (1st ed.). Springer Publishing Company, Incorporated.Hastie, T., Tibshirani, R., & Friedman, J. (2009).

The elements of statistical learning:data mining, inference and prediction (2nd ed.). Springer.Hess, S., Bierlaire, M., & Polak, J. W. (2005). Estimation of value of travel-time savingsusing mixed logit models.

Transportation Research Part A: Policy and Practice , (2-3), 221–236.Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation fornonorthogonal problems. Technometrics , (1), 55–67.Houde, S., & Myers, E. (2019, April). Heterogeneous (mis-) perceptions of energy costs:Implications for measurement and policy design (Working Paper No. 25722). Na-26ional Bureau of Economic Research.Hu, Z., Follmann, D. A., & Miura, K. (2015). Vaccine design via nonnegative lasso-basedvariable selection.

Statistics in medicine , (10), 1791–1798.Illanes, G., & Padi, M. (2019). Competition, asymmetric information, and the annuitypuzzle: Evidence from a government-run exchange in chile (Tech. Rep.). Center forRetirement Research.Jia, J., & Yu, B. (2010). On model selection consistency of the elastic net when p (cid:29) n. Statistica Sinica , (2), 595–611.Koppelman, F. S., & Wen, C.-H. (2000). The paired combinatorial logit model: properties,estimation and application. Transportation Research Part B: Methodological , (2),75–89.Kump, P., Bai, E.-W., sik Chan, K., Eichinger, B., & Li, K. (2012). Variable selectionvia rival (removing irrelevant variables amidst lasso iterations) and its applicationto nuclear material detection. Automatica , (9), 2107–2115.Marwick, K. P., & Koppelman, F. (1990). Proposals for analysis of the market demandfor high speed rail in the quebec/ontario corridor. Submitted to Ontario/QuebecRapid Train Task Force .McFadden, D., & Train, K. (2000). Mixed mnl models for discrete response.

Journal ofApplied Econometrics , (5), 447–470.Nevo, A., Turner, J. L., & Williams, J. W. (2016). Usage-based pricing and demand forresidential broadband. Econometrica , (2), 411–443.R Core Team. (2018). R: A language and environment for statistical computing [Computersoftware manual]. Vienna, Austria.Rossi, P. E., Allenby, G. M., & McCulloch, R. (2012). Bayesian statistics and marketing .John Wiley & Sons.Slawski, M., & Hein, M. (2013). Non-negative least squares for high-dimensional lin-ear models: Consistency and sparse recovery without regularization.

Electron. J.Statist. , , 3004–3056.Takada, M., Suzuki, T., & Fujisawa, H. (2017). Independently interpretable lasso: Anew regularizer for sparse regression with uncorrelated variables. arXiv preprintarXiv:1711.01796 .Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of theRoyal Statistical Society: Series B (Methodological) , (1), 267–288.Train, K. (2008). Em algorithms for nonparametric estimation of mixing distributions. Journal of Choice Modelling , (1), 40–69.Train, K. (2016). Mixed logit with a ﬂexible mixing distribution. Journal of ChoiceModelling , , 40–53.Wen, C.-H., & Koppelman, F. S. (2001). The generalized nested logit model. Transporta-tion Research Part B: Methodological , (7), 627–641.Wu, L., & Yang, Y. (2014). Nonnegative elastic net and application in index tracking.27 pplied Mathematics and Computation , , 541–552.Wu, L., Yang, Y., & Liu, H. (2014). Nonnegative-lasso and application in index tracking. Computational Statistics and Data Analysis , , 116–126.Zhao, P., & Yu, B. (2006). On model selection consistency of lasso. Journal of Machinelearning research , (Nov), 2541–2563.Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B (Statistical Methodology) , (2),301–320. 28 ppendixA Supplementary Tables Table A.1: Detailed Summary Statistics of 200 Monte Carlo Runs with Discrete Distribution.

RMISE L µ ρN R S FKRB MSE OneSe LL PredOut FKRB MSE OneSe LL PredOut MSE OneSe LL PredOut 3rd Qu.1000 25 17 0.067 0.04 0.034 0.055 0.045 0.035 0.017 0.014 0.027 0.021 56.05 67.95 13.14 35.03 0.8081000 81 49 0.08 0.046 0.038 0.064 0.055 0.019 0.008 0.007 0.014 0.011 58.90 70.06 20.66 36.96 0.8191000 289 161 0.088 0.057 0.045 0.068 0.06 0.006 0.004 0.003 0.005 0.004 54.87 71.20 30.71 35.75 0.82210000 25 17 0.042 0.026 0.023 0.037 0.032 0.02 0.012 0.011 0.018 0.015 61.15 66.78 13.93 31.02 0.80910000 81 49 0.05 0.031 0.027 0.044 0.037 0.015 0.008 0.007 0.013 0.011 59.31 69.16 13.85 30.90 0.81810000 289 161 0.057 0.037 0.033 0.049 0.043 0.006 0.004 0.003 0.005 0.005 61.90 70.50 19.02 30.14 0.822Pos. % True Pos. % Sign

N R S

FKRB MSE OneSe LL PredOut FKRB MSE OneSe LL PredOut FKRB MSE OneSe LL PredOut1000 25 17 13.3 20.82 22.36 16.23 19.35 68.44 94.53 99.71 80.91 91.15 71.88 77.28 78.14 77.14 78.561000 81 49 15.47 49.58 54.67 31.07 40.88 27.02 82.12 90.4 53.58 69.17 53.1 77.65 81.38 65.97 72.721000 289 161 16.24 103.13 123.8 70.4 84.15 8.62 55.31 66.39 38.02 45.3 48.27 70.24 75.42 62.3 65.6410000 25 17 17.17 19.39 19.73 17.81 18.64 90.32 98.12 99.53 92.94 96.35 86.16 87.86 88.46 87.16 88.4810000 81 49 23.32 44.84 48.26 29.88 37.23 42.29 81.07 87.14 54.5 67.84 61.88 82.24 85.36 68.56 75.6110000 289 161 24.88 97.39 105.84 53.06 69.47 13.53 55.07 59.94 29.93 39.3 50.76 71.96 74.46 59.28 64.04

Note:

FKRB MSE OneSe LL PredOut FKRB MSE OneSe LL PredOut MSE OneSe LL PredOut 3rd Qu.1000 25 17 0.085 0.072 0.055 0.081 0.066 9.65 12.94 17.78 10.38 14.61 21.2 74.01 2.69 34.44 0.8231000 50 33 0.09 0.068 0.058 0.081 0.069 12.57 26.82 32.4 17.05 25.52 48.09 74 6.65 34.74 0.821000 100 67 0.095 0.07 0.061 0.084 0.075 13.65 47.09 54.85 24.59 36.9 58.6 74.37 11.37 29.57 0.8221000 250 163 0.102 0.077 0.063 0.09 0.078 14.2 79.28 103.98 38.05 64.86 50.5 74.52 11.94 31.47 0.82410000 25 17 0.063 0.061 0.057 0.063 0.06 11.65 12.57 14.89 11.78 13.41 18.3 73.94 1.2 29.19 0.82310000 50 33 0.058 0.051 0.047 0.054 0.051 17.52 24.94 28.3 20.03 24.01 48.71 73.94 7.74 32.09 0.8210000 100 67 0.06 0.048 0.043 0.054 0.048 19.87 39.59 46.7 27.61 35.56 51.63 74.04 10.91 32.55 0.82310000 250 163 0.063 0.045 0.04 0.055 0.048 21.21 76.43 87.92 43.45 61.62 59.98 74.68 16.05 36.16 0.824% True Pos. % Sign

N R S

FKRB MSE OneSe LL PredOut FKRB MSE OneSe LL PredOut1000 25 17 49.06 66.35 88.68 53.09 74.59 60.12 70.48 81.48 62.66 751000 50 33 33.39 70.33 84.48 45.65 67.71 52.93 73.19 80.72 60.16 72.341000 100 67 18.01 63.46 74.1 33.56 50.03 43.48 70.95 77.44 53.38 63.151000 250 163 7.37 44.38 58.17 21.11 36.25 38.73 60.96 69.06 47.1 56.1310000 25 17 57.94 63.35 77.24 58.68 68.21 64.2 67.86 77.48 64.68 71.1210000 50 33 47.24 68.59 78.05 54.62 66.05 61.32 74.66 80.42 66.04 73.1610000 100 67 26.57 54.72 64.84 37.76 49.11 48.74 66.73 73.19 55.98 63.2410000 250 163 11.39 43.78 50.56 24.5 35.12 41.17 61.32 65.56 49.37 55.94

Note:

Mode ChoiceIntercept Train − . ∗∗∗ (0 . − . ∗∗∗ (0 . . ∗∗∗ (0 . − . . − . ∗∗∗ (0 . . ∗∗∗ (0 . . ∗ (0 . . ∗∗∗ (0 . . ∗∗∗ (0 . . ∗∗∗ (0 . − . ∗∗∗ (0 . . ∗∗∗ (0 . ∗∗∗ (df = 12) (p = 0.000) Note:

The table reports the mean estimates and standard errors(in brackets) obtained by the mlogit package for the semipara-metric mixed logit model with normally distributed travel time. ∗ p < ∗∗ p < ∗∗∗ p < Elasticities estimated with FKRB:

Car Air TrainCar -0.8992 (-0.8444) (0.6692) (0.129)

Air 0.5895 (0.5943) -1.2267 (-0.5079) (0.1589)

Train -0.1622 (0.0346) (0.1352) -0.6712 (-0.8861)

Elasticities estimated with ENet:

Car Air TrainCar -0.8382 (-0.7731) (0.682) (0.1009)

Air 0.5312 (0.5034) -1.2581 (-0.5704) (0.1339)

Train -0.0887 (0.036) (0.1118) -0.6285 (-0.7691)

Elasticities estimated semiparametrically:

Car Air TrainCar -1.3362 (-1.2584) (0.9975) (0.6846)

Air 0.6194 (0.6093) -1.3744 (-1.4473) (0.2281)

Train 0.2772 (0.1824) (0.1563) -1.6449 (-1.7289)

Note:

The table reports the mean and the median (in brackets) overindividuals’ own- and cross-travel time elasticities for the FKRBestimator, the elastic net estimator, and the semiparametric mixedlogit with normal distribution. The reported numbers correspondto the percentage change of the choice probability of an alternativein a column after a one percent increase in the travel time of analternative in a row.

Estimated Elasticities of FKRB divided by those of ENet:

Car Air TrainCar 1.0728 (1.0922) (0.9813) (1.2783)

Air 1.1099 (1.1804) (0.8905) (1.1864)

Train 1.8291 (0.9611) (1.2098) (1.1521)

Semiparametrically estimated Elasticities divided by those of ENet:

Car Air TrainCar 1.5941 (1.6277) (1.4627) (6.7854)

Air 1.1662 (1.2103) (2.5375) (1.7035)

Train -3.1268 (5.0686) (1.398) (2.2478)

Note:

The table reports the ratio of the mean and the median (in brackets) overindividuals’ own- and cross-travel time elasticities reported in Table A.4 for (1) theFKRB estimator and elastic net estimator and (2) the semiparametric mixed logit withnormal distribution and the elastic net estimator.

B Algorithm to Update Fixed and Random Coeﬃ-cients

The algorithm to update the ﬁxed coeﬃcients uses a modiﬁcation of the ﬂexible gridestimator in Train (2008).Let F denote the set of indices corresponding to the ﬁxed coeﬃcients and M to theset of indices corresponding to the random coeﬃcients. The goal is to maximize withrespect to the ﬁxed coeﬃcients β F and the weights θ = ( θ , . . . , θ R ) corresponding to β M .Therefore, deﬁne the vector which is to be maximized as π = { β F , θ } .Then, rewrite z ri,j more explicitly: z ri,j := z i,j ( β F , β Mr ) = g ( x i,j , β F , β Mr ) = exp (cid:0) x Fi,j β F + x Mi,j β Mr (cid:1) J (cid:80) l =1 exp (cid:0) x Fi,l β F + x Mi,l β Mr (cid:1) . (B.1)The likelihood criterion given in Train (2008) is LL ( β F , β M ) = 1 N N (cid:88) i =1 log (cid:32) R (cid:88) r =1 θ r z ri,y i (cid:33) = 1 N N (cid:88) i =1 log (cid:32) R (cid:88) r =1 θ r z i,y i ( β F , β Mr ) (cid:33) . (B.2)The probability of agent i having coeﬃcients π conditional on her observed choice y i r is h i,r ( π ) = θ r z i,y i ( β F , β Mr ) R (cid:80) r =1 θ r z i,y i ( β F , β Mr ) . (B.3)Based on Equation (B.3) one can derive the iterative EM update scheme which up-dates π t +1 = { β F , θ } t +1 = { β F , ( θ , . . . , θ R ) } t +1 by using a previous estimated trial π t tomaximize π t +1 = arg max π Q (cid:0) π | π t (cid:1) = arg max π N (cid:88) i =1 R (cid:88) r =1 h i,r (cid:0) π t (cid:1) log (cid:0) θ r z i,y i ( β F , β Mr ) (cid:1) . (B.4)Since log (cid:0) θ r z i,j ( β F , β Mr ) (cid:1) = log( θ r ) + log( z i,y i ( β F , β Mr )) one can maximize Equation (B.4)separately for β F and θ . Since we use our generalized estimator given in Equation (9),we only maximize Equation (B.4) over β F : { β F } t +1 = arg max β F N (cid:88) i =1 R (cid:88) r =1 h i,r (cid:0) π t (cid:1) log (cid:0) z i,y i ( β F , β Mr ) (cid:1) . (B.5)Plugging Equation (B.1) into Equation (B.5) gives { β F } t +1 = arg max β F N (cid:88) i =1 R (cid:88) r =1 h i,r (cid:0) π t (cid:1) log  exp (cid:0) x Fi,y i β F + x Mi,y i β Mr (cid:1) J (cid:80) l =1 exp (cid:0) x Fi,l β F + x Mi,l β Mr (cid:1)  (B.6)or equivalently { β F } t +1 = arg max β F N (cid:88) i =1 J (cid:88) j =1 R (cid:88) r =1 y i,j h i,r (cid:0) π t (cid:1) log  exp (cid:0) x Fi,j β F + x Mi,j β Mr (cid:1) J (cid:80) l =1 exp (cid:0) x Fi,l β F + x Mi,l β Mr (cid:1)  . (B.7)This is is the formula of a weighted (standard) logit model where only the coeﬃcients β F are to be maximized and the coeﬃcients β M are treated as constants. The weights h i,r ( π t ), calculated as given in Equation (B.3), do not depend on the product j , but diﬀerfor diﬀerent observations i and grid points r .The whole update scheme is given by the following steps34 eneralized Estimator of Equation (9) with ﬁxed andrandom coeﬃcients

1. Estimate semi-parametric model with all regressors and store the coeﬃcients ofthe ﬁxed parameters β F .2. Choose the grid points β Mr , r = 1 , ..., R .3. Calculate the logit kernel, z i,j ( β F , β Mr ), for each agent at each point.4. Estimate θ using the Generalized Estimator in Equation (9).5. Calculate weights for each agent at each point with π = { β F , θ } as h i,r ( π ) = θ r z i,y i ( β F , β Mr ) R (cid:80) r =1 θ r z i,y i ( β F , β Mr ) .

6. Update the ﬁxed coeﬃcients β F = β F by estimating a weighted standard logitas speciﬁed in Equation (B.7) .7. Repeat steps 3 and 6 until convergence, using the updated coeﬃcients π = π ,where θ = θ is updated in step 4.8. Use these estimated weights (cid:98) θ to calculate the estimated distributionˆ F ( β ) = R (cid:88) r =1 ˆ θ r β r ≤ β ] . Proofs of Results in Section 3

Below, we provide the proofs of the results presented in Section 3. For that purpose, weﬁrst introduce some additional notation.Let A be a m × n matrix and x be a n × (cid:107) A (cid:107) ∞ normrefers to the matrix norm induced by the maximum norm of vectors. Then (cid:107) A (cid:107) ∞ := max || x || ∞ =1 (cid:107) Ax (cid:107) ∞ = max ≤ i ≤ m n (cid:88) j =1 | a ij | denotes the maximum row sum of matrix A . (cid:107) x (cid:107) ∞ refers to the largest absolute elementof vector x .Similarly, (cid:107) A (cid:107) is deﬁned as the matrix norm induced by the euclidean vector norm.That is, (cid:107) A (cid:107) := max || x || =1 (cid:107) Ax (cid:107) , is called spectral norm. It can be shown that (cid:107) A (cid:107) = max ≤ i ≤ n (cid:112) ψ i ( A T A ) where ψ i ( A T A )denotes the eigenvalues of A T A . C.1 Proof of Probability Bound

Lemma 1 uses Hoeﬀding’s inequality to derive a probability bound for sub-Gaussian ran-dom variables. We use the lemma in the proofs of Theorems 1 - 3.

Lemma 1.

Suppose Assumption 1 holds. Then, for γ ≥ P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) N J ˜ Z T (cid:15) (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≥ γ (cid:19) ≤ R − J exp (cid:18) − N γ (cid:19) . Proof.

Notice that P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) N J ˜ Z T (cid:15) (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≥ γ (cid:19) = P (cid:32) max ≤ r ≤ R − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N J N (cid:88) i =1 ˜ Z rTi (cid:15) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γ (cid:33) (C.1)where (cid:15) i = ( (cid:15) i, , . . . , (cid:15) i,J ) denotes a random vector of J dependent variables such thatEquation (C.1) can equivalently be written as P (cid:32) max ≤ r ≤ R − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N J N (cid:88) i =1 ˜ Z rTi (cid:15) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γ (cid:33) = P (cid:32) max ≤ r ≤ R − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N J N (cid:88) i =1 J (cid:88) j =1 ˜ z ri,j (cid:15) i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γ (cid:33) = P (cid:32) (cid:91) ≤ r ≤ R − (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N J N (cid:88) i =1 J (cid:88) j =1 ˜ z ri,j (cid:15) i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γ (cid:41)(cid:33) . (cid:80) Ni =1 (cid:80) Jj =1 ˜ z ri,j (cid:15) i,j ≤ J max ≤ j ≤ J (cid:80) Ni =1 ˜ z ri,j (cid:15) i,j , we obtain the upper bound P (cid:32) (cid:91) ≤ r ≤ R − (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N J N (cid:88) i =1 J (cid:88) j =1 ˜ z ri,j (cid:15) i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γ (cid:41)(cid:33) ≤ P (cid:32) (cid:91) ≤ r ≤ R − (cid:40) J max ≤ j ≤ J (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N J N (cid:88) i =1 ˜ z ri,j (cid:15) i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γ (cid:41)(cid:33) ≤ R − (cid:88) r =1 P (cid:32) max ≤ j ≤ J (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 ˜ z ri,j (cid:15) i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γ (cid:33) = R − (cid:88) r =1 P (cid:32) (cid:91) ≤ j ≤ J (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 ˜ z ri,j (cid:15) i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γ (cid:41)(cid:33) ≤ R − (cid:88) r =1 J (cid:88) j =1 P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 ˜ z ri,j (cid:15) i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γ (cid:33) ≤ ( R − J max ≤ r ≤ R − ≤ j ≤ J P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 ˜ z ri,j (cid:15) i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γ (cid:33) . Recall from Assumption 1 (iii) and Equation (10) that − ≤ ˜ z ri,j ≤ − ≤ (cid:15) i,j ≤

1. Therefore, ξ := (˜ z r ,j (cid:15) ,j , . . . , ˜ z rN,j (cid:15) N,j ) is a vector of independent uniformly boundedrandom variables since for every i = 1 , . . . , N it holds that − ≤ ˜ z ri,j (cid:15) i,j ≤

1. It followsfrom the assumption of conditional exogeneity (Assumption 1 (iv)) that E [ ξ ] = 0. Due tothe boundedness of ξ , its moment generating function satisﬁes E [exp( sξ )] ≤ exp (cid:18) σ s (cid:19) . For any s ∈ R , ξ is said to be sub-Gaussian with variance proxy σ . Thus, usingHoeﬀding’s inequality,max ≤ r ≤ R − ≤ j ≤ J P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 ˜ z ri,j (cid:15) i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γ (cid:33) ≤ (cid:18) − N γ σ (cid:19) . (C.2)It follows from ξ ∈ [ − ,

1] that σ = 1. Therefore, P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) N J ˜ Z T (cid:15) (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≥ γ (cid:19) ≤ ( R − J max ≤ r ≤ R − ≤ j ≤ J P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 ˜ z ri,j (cid:15) i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γ (cid:33) ≤ R − J exp (cid:18) − N γ (cid:19) . (C.3)37 .2 Proof of Selection Consistency In the following, we provide the proof of Theorem 1. We ﬁrst derive two suﬃcient condi-tions in Lemma 3 that ensure that the estimated weights are equal in sign, i.e. ˆ θ = s θ ∗ .Lemma 4 provides a bound on the probability of the ﬁrst suﬃcient condition and Lemma5 a bound on the probability of the second suﬃcient condition. Finally, we use Lemma 4and Lemma 5 to prove Theorem 1. Both Lemma 4 and Lemma 5 employ Lemma 2. Lemma 2.

It holds that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ √ s ξ S min ( µ ) . Proof.

Using Singular Value Decomposition (SVD), rewrite ˜ Z S as1 √ N J ˜ Z S = ADM T (C.4)where A is a N J × s matrix with orthogonal columns, i.e. A T A = I S . M is a s × s orthogonal matrix satisfying M T M = M M T = I S . D is a diagonal s × s matrix consisting of the singular values of (1 / √ N J ) ˜ Z S on its diagonal. We apply theSVD in Equation (C.4) to rewrite (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − = (cid:0) M D T A T ADM T + µ I S (cid:1) − = (cid:0) M D M T + µM M T (cid:1) − = M (cid:0) D + µ I S (cid:1) − M T (C.5)Therefore, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) M (cid:0) D + µ I S (cid:1) − M T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ √ s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) M (cid:0) D + µ I S (cid:1) − M T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (C.6)= √ s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:0) D + µ I S (cid:1) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = √ s max i ∈ S (cid:112) ψ i = √ s max i ∈ S d ii + µ = √ s i ∈ S d ii + µ = √ s ξ S min ( µ )where ψ i denotes the eigenvalues of (cid:16) ( D + µ I S ) − (cid:17) T ( D + µ I S ) − = ( D + µ I S ) − . Thus, ψ i = ( d ii + µ ) − , as the eigenvalues of a diagonal matrix are its diagonal entries. The (un-restricted) eigenvalues of 1 / ( N J ) ˜ Z TS ˜ Z S + µ I S are deﬁned as ξ S ( µ ). ξ S min ( µ ) corresponds tothe minimal eigenvalue of the matrix. The ﬁrst inequality in Equation (C.6) holds by therelation of the absolute row sum norm and the spectral norm. The transformation from38he ﬁrst to the second line follows from the invariance of the spectral norm to orthogonaltransformations (Gentle, 2007, pp. 130-131). The equality in the second line follows fromthe spectral norm. The last equality in Equation (C.6) holds by the relation of singularvalues to eigenvalues. Lemma 3.

Suﬃcient conditions for ˆ θ = s θ ∗ are M ( V ) := (cid:26) max j ∈ S C V j ≤ λ (cid:27) , M ( U ) := (cid:26) max i ∈ S | U i | < ρ (cid:27) where V := 1 N J ˜ Z TS C (cid:20) ˜ Z S (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − (cid:18) λι S + µθ ∗ S − N J ˜ Z TS (cid:15) (cid:19) + (cid:15) (cid:21) ,U := (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − N J ˜ Z TS (cid:15),ρ := min i ∈ S (cid:12)(cid:12)(cid:12) (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − (cid:18) N J ˜ Z TS ˜ Z S θ ∗ S − λι S (cid:19) (cid:12)(cid:12)(cid:12) . Proof.

The Lagrangian of our adjusted estimator that follows from the transformed opti-mization problem in Equation (9) is L ( θ ) := 12 N J || ˜ y − ˜ Zθ || + λ n (cid:0) ι T θ − (cid:1) + 12 µ θ T θ − ν T θ (C.7)which is minimized with respect to θ , i.e. θ = arg min θ L ( θ ). λ and ν are Lagrangianmultipliers that enforce that the estimated weights sum to one and that they are non-negative respectively. µ > µ = 0,Equation (C.7) corresponds to the objective function of the estimator by Fox et al. (2011).To analyze the support recovery of our estimator, we follow the proof in Jia and Yu(2010). The estimator recovers the true support of the distribution if every estimatedprobability weight ˆ θ has the same sign as the true weights θ ∗ , i.e. ˆ θ = s θ ∗ .This is the case if the Karush-Kuhn-Tucker (KKT) conditions to the optimizationproblem in Equation (C.7) are satisﬁed. The KKT conditions are given by39 N J ˜ Z T (cid:16) ˜ y − ˜ Z ˆ θ (cid:17) + λι + µ ˆ θ − ν = 0 , (C.8) λ (cid:16) ι T ˆ θ − (cid:17) = 0 , (C.9) ν r ˆ θ r = 0 , (C.10) λ ≥ , ν r ≥ ∀ r = 1 , . . . , R − . (C.11)Denote the set of grid points where the true distribution has positive probability massby S = { r ∈ { , . . . , R − }| θ ∗ r > } and let S C = { r ∈ { , . . . , R − }| θ ∗ r = 0 } denote itscomplement set. The corresponding cardinalities are deﬁned as s := | S | and s C := | S C | .We refer to grid points in S as active grid points and to grid points in S C as inactive gridpoints. Splitting ˆ θ , ˜ Z and ν over S and S C into two blocks gives − N J (cid:104) ˜ Z S ˜ Z S C (cid:105) T (cid:32) ˜ y − (cid:104) ˜ Z S ˜ Z S C (cid:105) (cid:32) ˆ θ S ˆ θ S C (cid:33)(cid:33) + λι + µ (cid:32) ˆ θ S ˆ θ S C (cid:33) − (cid:32) ν S ν S C (cid:33) = 0 . Recall that θ ∗ r = 0 for all grid points outside S , so that ˜ Zθ ∗ = ˜ Z S θ ∗ S . In order torecover the active grid points, it must hold that ˆ θ = s θ ∗ which implies ˆ θ S C = 0. The twoconditions that follow from Equation (C.8) require − N J ˜ Z TS (cid:16) ˜ y − ˜ Z S ˆ θ S (cid:17) + λι S + µ ˆ θ S − ν S = 0 , (C.12) − N J ˜ Z TS C (cid:16) ˜ y − ˜ Z S ˆ θ S (cid:17) + λι S C − ν S C = 0 . (C.13)Note that ˆ θ S > θ S C = 0 imply ν r = 0 ∀ r ∈ S, (C.14) ν r ≥ ∀ r (cid:54)∈ S. (C.15)It follows from Condition (C.14) that Condition (C.12) simpliﬁes to − N J ˜ Z TS (cid:16) ˜ y − ˜ Z S ˆ θ S (cid:17) + λι S + µ ˆ θ S = 0 . (C.16)Substituting the true model ˜ y = ˜ Zθ ∗ + (cid:15) , we can re-express the required conditions as40 N J ˜ Z TS ˜ Z S (cid:16) θ ∗ S − ˆ θ S (cid:17) − N J ˜ Z TS (cid:15) + λι S + µ ˆ θ S = 0 (C.17)and − N J ˜ Z TS C ˜ Z S (cid:16) θ ∗ S − ˆ θ S (cid:17) − N J ˜ Z TS C (cid:15) + λι S C − ν S C = 0 . (C.18)Reformulating Condition (C.17) givesˆ θ S = (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − (cid:18) N J ˜ Z TS (cid:15) (cid:124) (cid:123)(cid:122) (cid:125) =: U + 1 N J ˜ Z TS ˜ Z S θ ∗ S − λι S (cid:19) > θ S .Plugging Equation (C.19) into Equation (C.18) and using Condition (C.15) yields1 N J ˜ Z TS C (cid:20) ˜ Z S (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − (cid:18) λι S + µθ ∗ S − N J ˜ Z TS (cid:15) (cid:19) + (cid:15) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) =: V ≤ λι S C . (C.20) U and V are deﬁned in Equation (C.19) and Equation (C.20), respectively. The vector U consists of s elements U i , i ∈ S , and is constructed from the conditions on the positiveweights, and vector V from the condition on the zero weights. Therefore, V has R − s elements V j , j ∈ S C . Condition (C.20) is equivalent to the event M ( V ) := (cid:26) max j ∈ S C V j ≤ λ (cid:27) . The event M ( U ) deﬁnes a condition for the positive weights M ( U ) := (cid:26) max i ∈ S | U i | < ρ (cid:27) where ρ := min i ∈ S | g i | with g i := (cid:104) (cid:16) NJ ˜ Z TS ˜ Z S + µ I S (cid:17) − (cid:16) NJ ˜ Z TS ˜ Z S θ ∗ S − λι S (cid:17) (cid:105) i .Therefore, the event M ( U ) implies0 < ρ − max i ∈ S | U i | < ρ − | U i | < | g i | − | U i | < | g i + U i | = | ˆ θ S i | = ˆ θ S i , ∀ i ∈ S where g i , U i and ˆ θ S i denote the i th element of the respective vectors g , U and ˆ θ S . Thesecond last equality holds by deﬁnition of g i and U i (see Equation (C.19)) and the lastinequality by the reverse triangle inequality. Because the weights are constrained to benonnegative by the KKT conditions, the absolute value | ˆ θ S i | can be omitted. Conse-quently, M ( U ) is a suﬃcient condition for Equation (C.19) to hold and thus for ˆ θ S > emma 4. Suppose Assumption (1) holds. Suppose further that the NEIC holds. Let M C ( V ) denote the complement of M ( V ) . Then, P (cid:0) M C ( V ) (cid:1) ≤ R − J exp  − N η λ (cid:16) ξ S min ( µ ) s √ s + ξ S min ( µ ) (cid:17)  . Proof. V j is sub-Gaussian with mean V := E ( V ) = 1 N J ˜ Z TS C ˜ Z S (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − ( λι S + µθ ∗ S ) . Recall the Nonnegative Elastic Net Irrepresentable Condition (NEIC) ismax r ∈ S C N J ˜ Z TS C ˜ Z S (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − (cid:16) ι S + µλ θ ∗ S (cid:17) ≤ − η. Therefore, V j ≤ (1 − η ) λ . Let ˜ V := NJ ˜ Z TS C (cid:20) − ˜ Z S (cid:16) NJ ˜ Z TS ˜ Z S + µ I S (cid:17) − NJ ˜ Z TS + I NJ (cid:21) (cid:15) such that V = V + ˜ V .Consequently, it holds for the complement of M ( V ) that λ < max j ∈ S C V j = max j ∈ S C ( V j + ˜ V j ) ≤ max j ∈ S C V j +max j ∈ S C ˜ V j ⇐⇒ max j ∈ S C ˜ V j > λ − max j ∈ S C V j ≥ λ − (1 − η ) λ = ηλ. We use the last inequality to derive an upper bound on M C ( V ):42 (cid:0) M C ( V ) (cid:1) = P (cid:18) max j ∈ S C V j > λ (cid:19) ≤ P (cid:18) max j ∈ S C ˜ V j > ηλ (cid:19) ≤ P (cid:18) max j ∈ S C | ˜ V j | > ηλ (cid:19) = P (cid:32) max j ∈ S C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z TS C (cid:20) − ˜ Z S (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − N J ˜ Z TS + I (cid:21) (cid:15) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ηλ (cid:33) ≤ P (cid:32) max j ∈ S C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z TS C ˜ Z S (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − N J ˜ Z TS (cid:15) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max j ∈ S C (cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z TS C (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) > ηλ (cid:33) = P (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N J ˜ Z TS C ˜ Z S (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − N J ˜ Z TS (cid:15) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + max j ∈ S C (cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z TS C (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) > ηλ (cid:33) ≤ P (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N J ˜ Z TS C ˜ Z S (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N J ˜ Z TS (cid:15) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + max j ∈ S C (cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z TS C (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) > ηλ (cid:33) . The last inequality holds due the property of the absolute row sum norm that (cid:107)

ABx (cid:107) ∞ ≤(cid:107) A (cid:107) ∞ (cid:107) B (cid:107) ∞ (cid:107) x (cid:107) ∞ for arbitrary matrices A , B and a vector x .By Lemma 2 and (cid:13)(cid:13)(cid:13) NJ ˜ Z TS C ˜ Z S (cid:13)(cid:13)(cid:13) ∞ ≤ s (since every entry in ˜ Z is at most 1 in absolutevalue, and thus the absolute row sum of NJ ˜ Z TS C ˜ Z S at most NJ sN J = s ), we obtain P (cid:0) M C ( V ) (cid:1) ≤ P (cid:18) s √ s ξ S min ( µ ) max j ∈ S (cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z TS C (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) + max j ∈ S C (cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z TS C (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) > ηλ (cid:19) ≤ P (cid:18) s √ s ξ S min ( µ ) max j ∈ R (cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z T (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) + max j ∈ R (cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z T (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) > ηλ (cid:19) = P (cid:18)(cid:16) s √ s ξ S min ( µ ) + 1 (cid:17) max j ∈ R (cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z T (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) > ηλ (cid:19) ≤ P (cid:32) max j ∈ R (cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z T (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) > ηλ s √ s ξ S min ( µ ) + 1 (cid:33) . Applying Hoeﬀding’s inequality with γ = ηλ s √ s ξS min( µ ) +1 as outlined in Lemma 1 gives P (cid:0) M C ( V ) (cid:1) ≤ R − J exp  − N (cid:18) ηλ s √ s ξS min( µ ) +1 (cid:19) σ  = 2( R − J exp  − N (cid:16) ηλ ξ S min ( µ ) s √ s + ξ S min ( µ ) (cid:17) σ 

43 2( R − J exp  − N η λ (cid:16) ξ S min ( µ ) s √ s + ξ S min ( µ ) (cid:17)  . Remark 1.

The above calculations can be simpliﬁed to for the baseline estimator, i.e. if µ = 0 . Assume that the NIC condition for LASSO holds (NEIC with µ = 0 ). Additionally,note that it holds for µ ≥ that (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − ˜ Z TS = ˜ Z TS (cid:18) N J ˜ Z S ˜ Z TS + µ I N (cid:19) − . Using the above equality for µ = 0 , we obtain P (cid:18) max j ∈ S C V j > λ (cid:19) ≤ P (cid:18) max j ∈ S C ˜ V j > ηλ (cid:19) ≤ P (cid:18) max j ∈ S C | ˜ V j | > ηλ (cid:19) = P (cid:32) max j ∈ S C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z TS C (cid:20) − ˜ Z S (cid:18) N J ˜ Z TS ˜ Z S (cid:19) − N J ˜ Z TS + I S (cid:21) (cid:15) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ηλ (cid:33) = P (cid:32) max j ∈ S C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z TS C (cid:20) − N J ˜ Z S ˜ Z TS (cid:18) N J ˜ Z S ˜ Z TS (cid:19) − + I S (cid:21) (cid:15) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ηλ (cid:33) = P (cid:18) max j ∈ S C (cid:12)(cid:12)(cid:12)(cid:12) N J ˜ Z TS C (cid:20) − I S + I S (cid:21) (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) > ηλ (cid:19) = P (0 > ηλ ) = 0 since ηλ > . Lemma 5.

Suppose Assumption (1) holds. Let M C ( U ) denote the complement of M ( U ) .Then, P (cid:0) M C ( U ) (cid:1) ≤ sJ exp (cid:18) − N ξ S min ( µ ) ρ s (cid:19) . Proof.

Because U is sub-Gaussian with mean 0, the probability of the complement of M ( U ) corresponds to 44 (cid:0) M C ( U ) (cid:1) = P (cid:18) max i ∈ S | U i | ≥ ρ (cid:19) = P (cid:32) max i ∈ S (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − N J ˜ Z TS (cid:15) ≥ ρ (cid:33) ≤ P (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:18) N J ˜ Z TS ˜ Z S + µ I S (cid:19) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N J ˜ Z TS (cid:15) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≥ ρ (cid:33) . In the next step Lemma 2 is applied again. P (cid:0) M C ( U ) (cid:1) ≤ P (cid:32) √ s ξ S min ( µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N J ˜ Z TS (cid:15) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≥ ρ (cid:33) ≤ P (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N J ˜ Z TS (cid:15) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≥ ξ S min ( µ ) 1 √ s ρ (cid:33) ≤ sJ exp  − N (cid:16) ξ S min ( µ ) √ s ρ (cid:17) σ  = 2 sJ exp (cid:18) − N ξ S min ( µ ) ρ sσ (cid:19) = 2 sJ exp (cid:18) − N ξ S min ( µ ) ρ s (cid:19) where the last inequality follows from Hoeﬀding’s inequality in Lemma 1 with γ = ξ S min ( µ ) √ s ρ .We use the above lemmata to prove Theorem 1. Proof of Theorem 1.

It holds that P (cid:16) ˆ θ = s θ (cid:17) ≥ P (cid:0) M ( V ) ∩ M ( U ) (cid:1) since M ( U ) is a suﬃcient condition for the selection of the true weights according toLemma 3.Under the condition that RCDG holds, applying Lemma 4 and Lemma 5 gives lim N →∞ P (cid:0) M C ( V ) (cid:1) =0 and lim N →∞ P (cid:0) M C ( U ) (cid:1) = 0. 45hus, lim N →∞ P (cid:16) ˆ θ = s θ (cid:17) ≥ lim N →∞ P (cid:0) M ( V ) ∩ M ( U ) (cid:1) ≥ lim N →∞ (cid:8) − P (cid:0) M C ( V ) (cid:1) − P (cid:0) M C ( U ) (cid:1)(cid:9) = 1 . C.3 Proof of Error Bounds

In the following, we ﬁrst provide the proof of the error bound of the estimated weightspresented in Theorem 2 and the proof of Corollary 1. We then use the derived bound toproof the error bound of the estimated random coeﬃcients’ distribution in Theorem 3. Inthe proofs of Theorem 2 and Theorem 3, we apply Lemma 1.

Proof of Theorem 2.

Note that if ˆ θ is the solution to the Lagrangian in Equation (C.7), it must hold that itminimizes (C.7), i.e. L (ˆ θ ) ≤ L ( θ ) for any θ . Thus, it holds that L (ˆ θ ) ≤ L ( θ ∗ ) where θ ∗ are the true weights. Applying this to the objective function in (C.7), we obtain12 N J (cid:13)(cid:13)(cid:13) ˜ y − ˜ Z ˆ θ (cid:13)(cid:13)(cid:13) + λ (cid:16) ι T ˆ θ − (cid:17) + µ θ T ˆ θ ≤ N J (cid:13)(cid:13)(cid:13) ˜ y − ˜ Zθ ∗ (cid:13)(cid:13)(cid:13) + λ (cid:0) ι T θ ∗ − (cid:1) + µ θ ∗ T θ ∗ . Substituting the true model ˜ y = ˜ Zθ ∗ + (cid:15) into the above condition and simplifying gives12 N J (cid:13)(cid:13)(cid:13) ˜ Z (cid:16) θ ∗ − ˆ θ (cid:17) + (cid:15) (cid:13)(cid:13)(cid:13) + λ (cid:16) ι T ˆ θ − (cid:17) + µ θ T ˆ θ ≤ N J (cid:107) (cid:15) (cid:107) + λ (cid:0) ι T θ ∗ − (cid:1) + µ θ ∗ T θ ∗ . Taking into account that (cid:13)(cid:13)(cid:13) ˜ Z ( θ ∗ − ˆ θ ) + (cid:15) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ˜ Z ( θ ∗ − ˆ θ ) (cid:13)(cid:13)(cid:13) + (cid:107) (cid:15) (cid:107) + 2 (cid:15) T ( ˜ Z ( θ ∗ − ˆ θ ))we obtain 12 N J (cid:13)(cid:13)(cid:13) ˜ Z (cid:16) θ ∗ − ˆ θ (cid:17)(cid:13)(cid:13)(cid:13) + λ (cid:16) ι T ˆ θ − (cid:17) + µ θ T ˆ θ ≤ N J (cid:15) T ˜ Z (cid:16) ˆ θ − θ ∗ (cid:17) + λ (cid:0) ι T θ ∗ − (cid:1) + µ θ ∗ T θ ∗ . (C.21)46ote that (cid:15) T ˜ Z (ˆ θ − θ ∗ ) ≤ (cid:13)(cid:13)(cid:13) ˜ Z T (cid:15) (cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) .Applying Lemma 1 with γ ≡ γ ( N, δ ) := (cid:115) (cid:16) R − Jδ (cid:17) (cid:30) N we obtain P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) N J ˜ Z T (cid:15) (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≥ γ (cid:19) ≤ R − J exp  − N (cid:118)(cid:117)(cid:117)(cid:116) (cid:16) R − Jδ (cid:17) N  (cid:30)  = 2( R − J exp (cid:32) log (cid:32)(cid:18) R − Jδ (cid:19) − (cid:33)(cid:33) = δ. (C.22)In the following, we assume that { (1 / ( N J )) || ˜ Z T (cid:15) || ∞ ≤ γ } , which happens with prob-ability at least 1 − δ according to Equation (C.22). Therefore, the rest of the proof holdswith probability 1 − δ . Using that the event { (1 / ( N J )) || ˜ Z T (cid:15) || ∞ ≤ γ } occurs, we canbound the the right hand side in Equation (C.21) from above by12 N J (cid:13)(cid:13)(cid:13) ˜ Z (cid:16) θ ∗ − ˆ θ (cid:17)(cid:13)(cid:13)(cid:13) + λ (cid:16) ι T ˆ θ − (cid:17) + µ θ T ˆ θ ≤ γ (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) + λ (cid:0) ι T θ ∗ − (cid:1) + µ θ ∗ T θ ∗ . (C.23)We split ˆ θ , ˜ Z and ν over S and S C into two blocks, whereby S again denotes the set ofrelevant grid points for which the true weights θ ∗ > S C the set of points for which θ ∗ = 0. It follows that ι T θ = ι TS θ S + ι TS C θ S C = || θ S || + || θ S C || and θ T θ = θ TS θ S + θ TS C θ S C . Thus, we can reformulate Equation (C.23) as12

N J (cid:13)(cid:13)(cid:13) ˜ Z (cid:16) θ ∗ − ˆ θ (cid:17)(cid:13)(cid:13)(cid:13) + λ (cid:16)(cid:13)(cid:13)(cid:13) ˆ θ S (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˆ θ S C (cid:13)(cid:13)(cid:13) − (cid:17) + µ (cid:16) ˆ θ TS ˆ θ S + θ ∗ TS C θ ∗ S C (cid:17) ≤ γ (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) + λ (cid:16)(cid:13)(cid:13)(cid:13) θ ∗ S (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) θ ∗ S C (cid:13)(cid:13)(cid:13) − (cid:17) + µ (cid:0) θ ∗ TS θ ∗ + θ ∗ TS C θ ∗ S C (cid:1) . It follows from θ ∗ S C = 0 that || ˆ θ − θ ∗ || = || ˆ θ S − θ ∗ S || + || ˆ θ S C || such that after somesimple manipulations we obtain 472 N J (cid:13)(cid:13)(cid:13) ˜ Z (cid:16) θ ∗ − ˆ θ (cid:17)(cid:13)(cid:13)(cid:13) + λ (cid:16)(cid:13)(cid:13)(cid:13) ˆ θ S (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˆ θ S C (cid:13)(cid:13)(cid:13) − (cid:17) + µ (cid:16) ˆ θ TS ˆ θ S − θ ∗ TS θ ∗ S + ˆ θ TS C ˆ θ S C (cid:17) ≤ γ (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) + λ (cid:16)(cid:13)(cid:13)(cid:13) θ ∗ S (cid:13)(cid:13)(cid:13) − (cid:17) . (C.24)Note that the terms in (C.24) that are multiplied by the Langrangian parameter λ drop out. Recall that by the deﬁnition of a linear probability model, || θ ∗ S || − λ ( || ˆ θ S || + || ˆ θ S C || − (cid:80) Rr =1 θ r ≤

1: (1) the estimated proba-bility weights sum to one (the constraint is binding), and (2) the sum of the estimatedprobability weights is less than one (the constraint is not binding). In the former case, || ˆ θ S || + || ˆ θ S C || − λ = 0. Thus,Condition (C.24) simpliﬁes to12 N J (cid:13)(cid:13)(cid:13) ˜ Z (cid:16) θ ∗ − ˆ θ (cid:17)(cid:13)(cid:13)(cid:13) + µ (cid:16) ˆ θ TS ˆ θ S − θ ∗ TS θ ∗ S + ˆ θ TS C ˆ θ S C (cid:17) ≤ γ (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) . (C.25)It follows from || ˆ θ S − θ ∗ S || = ˆ θ TS ˆ θ S − θ ∗ TS ˆ θ S + θ ∗ TS θ ∗ S thatˆ θ TS ˆ θ S − θ ∗ TS θ ∗ S + ˆ θ TS C ˆ θ S C = (cid:13)(cid:13)(cid:13) ˆ θ S − θ ∗ S (cid:13)(cid:13)(cid:13) + 2 θ ∗ TS ˆ θ S − θ ∗ TS θ ∗ + (cid:13)(cid:13)(cid:13) ˆ θ S C (cid:13)(cid:13)(cid:13) and from θ ∗ S C = 0 that || ˆ θ S C || p = || ˆ θ S C − θ ∗ S C || p for p = 1 , S and S C to || ˆ θ S − θ ∗ S || + || ˆ θ S C || = || ˆ θ − θ ∗ || and || ˆ θ S − θ ∗ S || + || ˆ θ S C || = || ˆ θ − θ ∗ || .This yields ˆ θ TS ˆ θ S − θ ∗ TS θ ∗ S + ˆ θ TS C ˆ θ S C = (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) + 2 θ ∗ TS ˆ θ S − θ ∗ TS θ ∗ . Therefore, Equation (C.25) can be equivalently expressed as12

N J (cid:13)(cid:13)(cid:13) ˜ Z (cid:0) θ ∗ − ˆ θ (cid:1)(cid:13)(cid:13)(cid:13) + µ (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) ≤ γ (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) + µ (cid:18) θ ∗ TS θ ∗ S − θ ∗ TS ˆ θ S (cid:19) . (C.26)Next, because θ ∗ S > || ˆ θ S − θ ∗ S || ≤ √ s || ˆ θ S − θ ∗ S || it holds that θ ∗ TS (cid:16) θ ∗ S − ˆ θ S (cid:17) ≤ θ ∗ TS (cid:12)(cid:12)(cid:12) ˆ θ S − θ ∗ S (cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13) θ ∗ S (cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13) ˆ θ S − θ ∗ S (cid:13)(cid:13)(cid:13) ≤ √ s (cid:13)(cid:13)(cid:13) θ ∗ S (cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13) ˆ θ S − θ ∗ S (cid:13)(cid:13)(cid:13) (C.27)where | ˆ θ S − θ ∗ S | takes the absolute value of each element of the vector ˆ θ S − θ ∗ S .Substituting Condition (C.27) back into the error bound in Equation (C.26) and using48he the fact that || ˆ θ − θ ∗ || ≤ (cid:112) ( R − || ˆ θ − θ ∗ || , for γ ≤ kλ , we can rewrite Equation(C.26) as12 N J (cid:13)(cid:13)(cid:13) ˜ Z (cid:0) θ ∗ − ˆ θ (cid:1)(cid:13)(cid:13)(cid:13) + µ (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) ≤ kλ (cid:112) ( R − (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) + µ √ s (cid:13)(cid:13)(cid:13) θ ∗ S (cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13) ˆ θ S − θ ∗ S (cid:13)(cid:13)(cid:13) . (C.28)Recall that (cid:13)(cid:13)(cid:13) ˜ Z (cid:0) ˆ θ − θ ∗ (cid:1)(cid:13)(cid:13)(cid:13) = (cid:0) ˆ θ − θ ∗ (cid:1) T ˜ Z T ˜ Z (cid:0) ˆ θ − θ ∗ (cid:1) and that the left-hand-side in Condition (C.28) can be summarized as12 (cid:0) ˆ θ − θ ∗ (cid:1) T (cid:20) N J ˜ Z T ˜ Z + µ I (cid:21)(cid:0) ˆ θ − θ ∗ (cid:1) ≤ (cid:18) kλ (cid:112) ( R −

1) + µ √ s (cid:13)(cid:13)(cid:13) θ ∗ S (cid:13)(cid:13)(cid:13) ∞ (cid:19) (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) . (C.29)Recall that ξ min ( µ ) deﬁnes the minimum eigenvalue of the real symmetric matrix1 / ( N J ) ˜ Z T ˜ Z + µ I over the set of vectors H (see Subsection (3.2)).It holds that ξ min ( µ ) > µ > ξ min ≥ µ = 0. In the following, weassume ξ min ( µ ) > || ˆ θ − θ ∗ || / || ˆ θ − θ ∗ || andusing the restricted minimum eigenvalue deﬁnition gives the upper (cid:96) -error bound betweenthe estimated and true probability weights: ξ min ( µ )2 (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) ≤ (cid:18) kλ (cid:112) ( R −

1) + µ √ s (cid:107) θ ∗ S (cid:107) ∞ (cid:19) (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) ⇒ (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) ≤ (cid:112) ( R − kλ + 2 µ √ s (cid:107) θ ∗ S (cid:107) ∞ ξ min ( µ ) . Proof of Corollary 1.

By assumption, it holds that (cid:16)(cid:112) ( R − kλ + µ √ s (cid:107) θ ∗ S (cid:107) ∞ (cid:17) ξ min (0) ≤ (cid:112) ( R − kλξ min (0) + µ (cid:112) ( R − kλ = (cid:112) ( R − kλ ( ξ min (0) + µ ) . Using ξ min ( µ ) = ξ min (0) + µ gives (cid:16)(cid:112) ( R − kλ + µ √ s (cid:107) θ ∗ S (cid:107) ∞ (cid:17) ξ min (0) ≤ (cid:112) ( R − kλξ min ( µ )49hich is equivalent to2 (cid:112) ( R − kλ + 2 µ √ s (cid:107) θ ∗ S (cid:107) ∞ ξ min ( µ ) ≤ (cid:112) ( R − kλξ min (0) . Proof of Theorem 3.

It holds that the diﬀerence of ˆ F ( β ) and F ∗ ( β ) in any point β ∈ R K can be bounded by (cid:12)(cid:12)(cid:12) ˆ F ( β ) − F ∗ ( β ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R (cid:88) r =1 ˆ θ r β r ≤ β ] − R (cid:88) r =1 θ ∗ r β r ≤ β ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R (cid:88) r =1 (cid:16) ˆ θ r − θ ∗ r (cid:17) β r ≤ β ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R (cid:88) r =1 (cid:12)(cid:12)(cid:12) ˆ θ r − θ ∗ r (cid:12)(cid:12)(cid:12) = R − (cid:88) r =1 (cid:12)(cid:12)(cid:12) ˆ θ r − θ ∗ r (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) ˆ θ R − θ ∗ R (cid:12)(cid:12)(cid:12) where the last inequality holds by the triangle inequality.Then, (cid:12)(cid:12)(cid:12) ˆ F ( β ) − F ∗ ( β ) (cid:12)(cid:12)(cid:12) ≤ R − (cid:88) r =1 (cid:12)(cid:12)(cid:12) ˆ θ r − θ ∗ r (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) − R − (cid:88) r =1 ˆ θ r − R − (cid:88) r =1 θ ∗ r (cid:12)(cid:12)(cid:12) = R − (cid:88) r =1 (cid:12)(cid:12)(cid:12) ˆ θ r − θ ∗ r (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) R − (cid:88) r =1 (cid:16) θ ∗ r − ˆ θ r (cid:17) (cid:12)(cid:12)(cid:12) ≤ R − (cid:88) r =1 (cid:12)(cid:12)(cid:12) ˆ θ r − θ ∗ r (cid:12)(cid:12)(cid:12) = 2 (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) ≤ (cid:112) ( R − (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) , which, by Theorem 2, can be bounded by | ˆ F ( β ) − F ∗ ( β ) | ≤ (cid:112) ( R −

1) 2 (cid:112) ( R − kλ + 2 µ √ s (cid:107) θ ∗ S (cid:107) ∞ ξ min ( µ ) ..