[PDF] Non-Manipulable Machine Learning: The Incentive Compatibility of Lasso

Abstract

We consider situations where a user feeds her attributes to a machine learning method that tries to predict her best option based on a random sample of other users. The predictor is incentive-compatible if the user has no incentive to misreport her covariates. Focusing on the popular Lasso estimation technique, we borrow tools from high-dimensional statistics to characterize sufficient conditions that ensure that Lasso is incentive compatible in large samples. In particular, we show that incentive compatibility is achieved if the tuning parameter is kept above some threshold. We present simulations that illustrate how this can be done in practice.

Full PDF

aa r X i v : . [ ec on . E M ] J a n Non-Manipulable Machine Learning:The Incentive Compatibility of Lasso

Mehmet Caner ∗ Kfir Eliaz † January 5, 2021

Abstract

We consider situations where a user feeds her attributes to a machine learning method that tries topredict her best option based on a random sample of other users. The predictor is incentive-compatible ifthe user has no incentive to misreport her covariates. Focusing on the popular Lasso estimation technique,we borrow tools from high-dimensional statistics to characterize suﬃcient conditions that ensure thatLasso is incentive compatible in large samples. In particular, we show that incentive compatibility isachieved if the tuning parameter is kept above some threshold. We present simulations that illustratehow this can be done in practice.

Rapid advances in machine learning methods for analyzing big data have given rise to automated systems thatemploy these methods to predict the best ﬁtting outcomes for users based on their personal characteristics.For example, many online platforms try to predict which content - a song, a video, a post, or an article- is the best ﬁt for each user. Medical providers have also begun using machine learning techniques toautomate check-ups and test appointments for patients based on their medical history. Typically, theseautomated systems use data from past users to estimate a model that relates the best ﬁt for a user (such asthe most preferred content or the appropriate medical test) to her characteristics. These estimates are thenapplied to a new user’s characteristics, which she discloses either actively or passively via her past onlinebehavior (which may be reﬂected in his cookies or collected by his browser). Given the growing interactionof users with such automated systems, it is only natural to ask whether a user should truthfully disclose hercharacteristics? ∗ North Carolina State University, Nelson Hall, Department of Economics, NC 27695. Email:[email protected]. We thankSimon Fraser University Department of Economics for the virtual seminar, and Anders Kock and Ran Spiegler for comments.We also thank Columbia University, Department of Economics for the hospitality where this research is initiated, when bothauthors were visitors in 2018-2019. † School of Economics, Tel-Aviv University and Eccles School of Business, the University of Utah. Email: kﬁ[email protected].

1f the information the user discloses is also used to exploit her (say, by providing it to third parties foradvertising or price discrimination), then the user has an obvious reason not to reveal her private information.The question is whether special features of some popular machine learning methods introduce an incentiveto misreport one’s personal characteristics even when this information will be used solely for predicting herbest outcome? This question is of crucial importance: If individuals submit false reports to systems thatrely on these reports for estimation and predictions, then the conclusions drawn from such estimates andpredictions will be wrong and may lead to quite undesirable outcomes (e.g., think of an automated medicalplatform that schedules tests for patients based on false reports on attributes such as smoking, drinking andphysical exercise).To address the above question, we consider a stylized environment where each user i ’s ideal option is alinear function f of her privately observed attributes X i = ( X i, , ..., X i,p ) ′ such that f ( X i ) = X ′ i β . A usermay not know the values of the coeﬃcients β , in which case she would have some (possibly degenerate)prior beliefs over them. A “statistician”, who represents some automated prediction platform has a sampleof the attributes of n users and noisy observations on their ideal options. For instance, suppose f ( X i ) isthe optimal dosage of some medication when taken immediately at the onset of symptoms, conditional onthe patient’s medical history X i , but the statistician observes the dosage that was given after some delay.Similarly, f ( X i ) may be the mix of news and reality shows that a user with attributes X i actually watches,but the statistician observes only self reports by a user who may have forgotten exactly what he watched.The statistician uses her sample to estimate the function f by computing an estimate ˆ β of the truecoeﬃcients β . The statistician wishes to apply these estimates to predict the ideal option of a new user, n + 1 , whose true attributes X n +1 are not observed by the statistician. This new user must decide whatvector of attributes ˜ X n +1 (which may diﬀer from the truth) to report to the statistician. In making thisdecision, the new user takes into account her beliefs about the statistician’s sample (the new user onlyknows the distribution from which the sample is drawn, but she does not observe its realization), and herbeliefs about the true parameters β . The statistician then plugs the new user’s reported attributes into theestimated function and gives the user the option ˜ X ′ n +1 ˆ β, which is the statistician’s estimate of the user’sideal option based her report. The new user’s expected loss from a report ˜ X n +1 is given by the mean squareerror between her expectation of the ideal option X ′ n +1 β and her assigned option ˜ X ′ n +1 ˆ β. The statistician’sestimator is (ex-ante) incentive-compatible , if in expectation, the new user has no incentive to deviate fromtruthful reporting for any prior belief on β : i.e., if for every possible value of β , the expected value of( X ′ n +1 β − ˜ X ′ n +1 ˆ β ) is minimized at the truth ˜ X n +1 = X n +1 , where the expectation is taken with respectto the statistician’s sample and the possible realizations of the user’s attributes.Intuition suggests that an individual cannot beneﬁt from lying to a procedure that is meant to predict the In a recent interview of Brian Christian, the author of

The Alignment Problem , he notes that “computers may one day beable not only to learn our behavior but also intuit our values - ﬁgure out from our actions what it is we’re trying to optimize.... What if an algorithm intuits the ‘wrong’ values, based on its best read of who we currently are but not of who we aspire tobe? Do we really want our computers inferring our values form browser histories? See Shaywitz (2020) for this interview. binary , the statistician has the same (ﬁxed) ﬁnite number of observationson each possible combination of attribute values, and the penalty parameter is ﬁxed and does not adjust tothe sample size. Hence, these papers leave open the following important question: For a general environment,are there conditions ensuring that a penalized regression model is incentive compatible in large samples?Answering this question can potentially allow platforms, like those discussed above, to use machine-learning methods to predict users’ most preferred options without worrying that their data is “contaminated”by non-truthful users. Put bluntly, estimates and predictions made by methods that are not incentive-compatible are possibly unreliable since they may be based on false data.This paper addresses the above open question by focusing on the most popular form of penalized re-gressions - the

Lasso estimator. Borrowing tools from high-dimensional statistics, we establish suﬃcientconditions for incentive compatibility of the Lasso estimator in large samples. We show that to achieve in-centive compatibility, the tuning parameter must be large enough (i.e., it must remain above some thresholdas sample size increases) so as to avoid overﬁtting, which is the main reason why a user may want to lie.This potential to lie implies that the standard way of choosing small enough tuning parameters to ensureconsistency may lead to incentive compatibility violations We provide simulation results that illustrate howthe tuning parameter can be chosen in practice to ensure incentive compatibility. Incentive compatibilitymay therefore be viewed as an additional important property that should be imposed on estimators on top ofconsistency and unbiasedness. We also oﬀer a new technical contribution by extending the oracle inequalitiesof Jankova and van de Geer (2018) to non-sub Gaussian data and providing a diﬀerent proof. The motivation to focus on the Lasso estimator stems from the fact that this estimator is the bench-mark among all high dimensional statistical estimators that predict large scale models when the numberof regressors exceeds the sample size. Following its original proposal by Tibshirani (1996), econometriciansand statisticians have used Lasso-based estimators to push the boundaries of economics and ﬁnance. Oneof the most critical issues facing these Lasso type estimators is post-inference after estimation and modelselection, which require uniformly valid conﬁdence intervals. In a seminal series of papers, Belloni et al.(2012) and Belloni et al. (2014) solved these issues by introducing the idea of “partialling out” the regres-sors. A diﬀerent but complementary approach, via debiasing-desparsifying is proposed by van de Geer et al.(2014). Caner and Kock (2018) extended the debiasing of van de Geer et al. (2014) to heteroskedastic-non-sub-Gaussian data with strong oracle optimality property, thereby proposing a high dimensional estimatorthat is robust to heteroskedasticity, and with uniformly valid conﬁdence intervals. Lasso-based debiasing are Our results can be extended to apply to the debiased lasso estimator, but this involves a diﬀerent proof technique, andhence, is beyond the scope of the current paper. Even though we prove the main theorems with the bounded signal to noise ratio as in Jankova and van de Geer (2018),we can relax this ratio constraint as shown in the Appendix.

None of these papers consider penalized regression methods, and none ofthem characterize conditions guaranteeing incentive compatibility of regression techniques when the statis-tician and users have aligned interests (as is the case in our model).The remainder of the paper is organized as follows. Section 2 considers the model and assumptions. Sec-tion 3 provides new oracle inequalities. Section 4 shows under what conditions lasso is incentive compatible,and Section 5 provides a simulation. Appendix A and B provide proofs of the results when p > n , and p ≤ n ,respectively. We will use the following notational conventions. For any vector ν ∈ R d , let k ν k , k ν k , k ν k ∞ denote its l , l , l ∞ norm respectively, and k ν k be the l norm, which means the total number of nonzero entries. Fora set S ⊆ { , , · · · , d } , let | S | = s be the cardinality of the set. Let ν S be the modiﬁed ν such that we put0 when the index does not belong to S (i.e., say S = { , , } for a 10 × ν , this means that ν ismodiﬁed such that now all elements are zero except elements 1 , , k A k l be the maximum absolutecolumn-sum norm of a matrix of dimensions m × l , i.e., k A k l = max ≤ k ≤ l P mi =1 | A ik | which is also calledinduced l norm of A .Our environment consists of users who are characterized by a set of p personal characteristics. Forinstance, in the context of medical decision making, a characteristic can represent a risk factor (obesity,smoking, etc.). For each user i, these characteristics are modeled as p explanatory variables, X i, , ..., X i,p , drawn from some distribution over a subset of R p . These attributes determine the ideal option for a useraccording to the function f ( X i, , ..., X i,p ) = p X k =1 X i,k β ,k This function applies to all users, who diﬀer only in the values of their characteristics. The realized valuesof ( X i, , ..., X i,p ) are privately observed by user i. A user may or may not know the value of the coeﬃcients4 β , , ..., β ,p ) . In the latter case, she has some (possibly degenerate) prior beliefs over their values.A statistician (representing the automated prediction systems described in the introduction) has private access to a sample of n observations. Each observation i = 1 , ..., n consists of the true attributes X i =( X i, , ..., X i,p ) of user i and a noisy signal y i of that user’s ideal option, y i = X ′ i β + u i , (1)where u i is random noise that is drawn i.i.d from some distribution with zero mean. The X i ’ s are also i.i.d. across i , and exogenous, and will be discussed in detail in Assumption 1 in thenext subsection. β is a p × f . We let S = { j : β ,j = 0 } denote the set of relevant regressors with s being the cardinality of the set S . (i.e., s of the elements of β are nonzero, and the rest are zero). s is a nondecreasing function of n . These facts are known to an“oracle” but not to the statistician (and possibly not to a user).Using her (privately observed) sample, the statistician estimate the function f , or equivalently, sheestimates the coeﬃcients β , , ..., β ,p . When p > n , the least squares estimator is infeasible due to singularityof the empirical Gram matrix. Hence, the statistician uses Lasso, the penalized regression procedure thatassigns costs to including explanatory variables in the regression. Speciﬁcally, she solves the followingminimization problem ˆ β = argmin β ∈ R p P ni =1 ( y i − X ′ i β ) n + 2 λ n k β k , (2)where λ n > λ n = O ( p lnp/n ) (an explicit expression for the sequence λ n is given in equation (A.14) inAppendix A). Given her estimates ˆ β, the statistician must take an action a ∈ R on behalf of a new user, j = n + 1. Thisaction is just the statistician’s prediction of the ideal option of that user. The new user’s payoﬀ from action a is − ( a − f ( X n +1 )) , where f ( X n +1 ) is the true ideal option associated with her personal attributes X n +1 .The distribution of X n +1 may be diﬀerent from that of ( X , ..., X n ), and we do not impose any restrictionon its correlation with the sample distribution.Since the statistician does not observe X n +1 , in order to make her prediction of f ( X n +1 ) , she asks the n + 1 user to report a p × X n +1 , which is interpreted as that user’s attributes. The statisticianthen plugs ˜ X n +1 into her estimated model and chooses the action a = ˜ X n +1 ˆ β. When the n + 1 user decideswhat attribute values to report, she takes into account that she does not observe the statistician’s sample,and hence, does not know the values of the estimated coeﬃcients ˆ β. She only knows the distribution fromwhich the statistician’s sample is drawn, and that given her sample, the statistician chooses ˆ β according to(2). Given this, the user chooses the report ˜ X n +1 that minimize her expected loss E β , ˆ β ( ˜ X n +1 ˆ β − X ′ n +1 β ) , Access to such observations is a necessary condition for any platform that tries to learn about users (say, Netﬂix, Spotify).In the introduction, we gave a couple of examples for such data, which may be obtained from a third party, or from marketingsurveys. We established this rate in Lemma A.2 in Appendix A. β , and hisbeliefs about the estimate ˆ β. Hence, the new user may decide to lie and report ˜ X n +1 = X n +1 . In particular,she may decide to “opt out” and submit a vector of zeros. Our objective is to understand under whatconditions it is in the user’s best interest to be truthful regardless of her prior beliefs on β . We say that an estimator is ex-ante incentive compatible if for any belief over the true model parameters, theuser’s expected payoﬀ from truth-telling is at least as high as her expected payoﬀ from any misreport, wherethe expectation is taken with respect to the user’s realized covariates, and with respect to the statistician’ssample.

Deﬁnition 1.

An estimator is (ex-ante) incentive-compatible if for every ˜ X ′ n +1 and every β , E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] ≥ E [ X ′ n +1 ˆ β − X ′ n +1 β ] . (3) where the expectation E is taken with respect to the possible realizations of X n +1 and the possible realizationsof the statistician’s sample. An alternative notion of incentive-compatibility would be deﬁned ex-post with respect to the realizationof the user’s covariates, such that inequality (3) would be required to hold for every realization of X n +1 , andthe expectation operator would only be with respect to the statistician’s sample. The suﬃcient conditionfor ex-ante incentive-compatibility of the Lasso estimator, which we establish in Section 4, also guaranteesex-post incentive-compatibility. Furthermore, the proof of ex-post incentive compatibility follows from ourproof of ex-ante incentive-compatibility. In light of this, we shall focus on the ex-ante notion henceforth.Incentive compatibility means that the user is unable to perform better by misreporting her personalcharacteristics, regardless of her beliefs over the true model’s parameters in mean squared sense. Howshould we interpret this requirement, given that we do not necessarily want to think of the user as beingsophisticated enough to think in these terms? One interpretation is that lack of incentive compatibility ismerely a normative statement about the user’s welfare - namely, given our model of how the statisticiantakes actions on the user’s behalf, it would be advisable for her to misrepresent her personal characteris-tics. Furthermore, there are opportunities for new ﬁrms to enter and oﬀer the user paid advice for how tomanipulate the procedure - in analogy to the industry of “search engine optimization”. Incentive compati-bility theoretically eliminates the need for such an industry. In the context of the online content provisionstory, some misreporting strategies take the form of “deleting cookies”. This deviation is straightforward toimplement, and the user can check if it makes her better oﬀ in the long run.Note that incentive-compatibility is not a property that can be tested statistically. To see this, supposeeach user is characterized by only a single covariate that is uniformly distributed on { , } . If users are In the case in which the individual’s attributes are collected “passively” from her browsing history, then reporting a vectorof zero attributes can be interpreted as the act of deleting cookies. true attributes of n users. The idea is that the data onthese users is obtained through a diﬀerent process than the way the statistician obtains the data from the n + 1 user. For instance, as mentioned earlier, this data may be obtained from a marketing survey wherethere is no incentive to lie. Alternatively, one may interpret our incentive compatibility requirement as arequirement that truth-telling is a Nash equilibrium among all participants - such that given that everyoneelse is telling the truth, no user has an incentive to lie.

In this subsection, we introduce a number of restrictions on the statistician’s data. To describe theserestrictions, we shall make use of the following notation. Deﬁne an l ball B l ( s ) = {k β k l ≤ s } . DenoteΣ := EX i X ′ i for i = 1 , , · · · , n and let ˆΣ := X ′ X/n be the sample counterpart. Our ﬁrst requirementextends the sub-Gaussian data assumption used in statistics:

Assumption 1. (i). E ( u i | X i ) = 0 , X i , u i are identical and independent across i = 1 , · · · , n, and max ≤ j ≤ p E | X ij | , E | u i | l , l = max (2 k, for all k ≥ are uniformly bounded from above (across n ). (ii).The minimal eigenvalue of Σ is bounded away from zero uniformly in n . Our second set of restrictions applies to the ﬁrst and second moments. These will guarantee the consis-tency of the Lasso estimator, but will not ensure incentive compatibility (suﬃcient conditions for incentivecompatibility will be introduced in Section 4). We start by deﬁning the maximal value of certain crossproducts, which will be related to the behavior of moments in high dimensions in our next assumption. M := max ≤ i ≤ n max ≤ j ≤ p | X ij u i | ,M := max ≤ i ≤ n max ≤ j ≤ p max ≤ l ≤ p | X il X ij − EX il X ij | . Note that M is the maximal covariance between the regressors and errors in a high dimensional context.Roughly speaking, when this covariance is small, it captures exogeneity of the regressors in the sample. M is the maximal variance of the regressors in the sample. With large p and n , these covariance and varianceterms can grow arbitrarily large; hence, we need a condition that restricts the growth rate of their moments.Because we are allowing for heteroskedastic data and unbounded regressors, we need to consider the growthrate of higher-order moments. Assumption 2. (i). √ lnp √ n [ max (( EM ) / , ( EM ) / )] → . Alternatively, we could strengthen Assumption 2 using boundedness of individual moments of

X, u . ii). s ( lnpn ) / → . (iii). k β k = O (1) . Assumption 2(i) and 2(ii) are standard in high dimensional econometrics. 2(i) is used in Chernozhukov et al.(2017) allowing them to apply a concentration inequality, and 2(ii) is a standard sparsity condition. As-sumption 2(iii) ensures that the signal to noise ratio is bounded (see p.2343 of Jankova and van de Geer(2018)). To see this clearly, set σ u := var ( u i ), which is the variance of the errors, and σ u ≥ c >

0, where c is a generic positive constant. Hence, var ( y i ) var ( u i ) = β ′ Σ β σ u + 1 , under E ( u i | X i ) = 0 in Assumption 1 and Σ := EX i X ′ i . But β ′ Σ β σ u + 1 ≥ k β k φ min (Σ) σ u + 1 . where φ min (Σ) ≥ c > var ( y i ) /var ( u i ) ≥ C + 1 >

0, with C beinga positive constant, and deﬁned as C := k β k φ min (Σ) σ u . The empirical implication of this is that only a ﬁxednumber of nonzero coeﬃcients can be constants, and the other nonzero coeﬃcients have to be local to zero.To see this implication, note that k β k = vuut p X j =1 β ,j = s X j ∈ S β ,j = O (1) . But this last point can be achieved, in the case of s growing with n , with s X j ∈ S β ,j = s X j ∈ F β ,j + X j ∈ S − F β ,j ≤ s f C + ( s − f ) C s − f = O (1) , where F := { j : | β ,j | = C } with | F | = f being a ﬁxed number, C is a generic positive constant and F := { j : | β ,j | = C √ s − f } with | F | = s − f . For ease of exposition, we set all coeﬃcients in F and F tobe the same constants C, C/ √ s − f respectively. F contains indices of all local to zero coeﬃcients. Thiscan easily be generalized without aﬀecting our results.In Appendix B we take a more ﬂexible approach compared with Assumption 2(iii). There, we assumethat k β k = O ( √ s ). In this case, all nonzero coeﬃcients can be large (i.e., none of them are local to zero,as in set F above). In other words, there is no index set F as above, but all nonzero coeﬃcients (theirindices) are in the set F above.As p and n grow large, the total number of nonzero coeﬃcients s (also known as the sparsity index ) cangrow arbitrarily large. To guarantee consistency and unbiasedness, it is typically assumed that the productof the sparsity index and the tuning parameter should go to zero. However, this standard condition doesnot guarantee the incentive compatibility of the Lasso estimator as can be seen in the proof of Theorem 3below. 8 New oracle inequalities

Oracle inequalities in high dimensional statistics are upper bounds on prediction and estimation errors. Werequire moment bounds on the Lasso estimator’s error in l norm for our main result. By taking the samplesize to be large, we can show that the upper bound on the mean of higher-order moments of Lasso estimationerrors tend to zero. We then use this asymptotic result to establish the incentive compatibility of the Lassoestimator in large samples. To illustrate this, we note that from the proof of Theorem 3 in Appendix A.2.4,the incentive compatibility constraint is tied to the following expression E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] − E [ X ′ n +1 ˆ β − X ′ n +1 β ] = E [ ˆ β ′ ( ˜ X n +1 − X n +1 )( ˜ X n +1 − X n +1 ) ′ ˆ β ] (4)+ E [ ˆ β ′ ( ˜ X n +1 − X n +1 ) X ′ n +1 ( ˆ β − β )] (5)+ E [( ˆ β − β ) ′ X n +1 ( ˜ X ′ n +1 − X ′ n +1 ) ˆ β ] . (6)For incentive compatibility to hold in large samples, we need the sum of the right-hand side terms to begreater than or equal to zero. The ﬁrst term on the right side (4) is always non-negative. Hence, if we provethat (5) and (6) converge to zero, we establish asymptotic incentive compatibility. However, the size of termsin (5) and (6) will depend on the mean of higher-order estimation error of Lasso.To bound these error, we prove new oracle inequalities, which are diﬀerent from those that are givenin the literature for k ˆ β − β k . These inequalities will serve an important role in proving our main resultin the next section (Theorem 3). Besides, they are of independent interest as they extend previous resultson sub-Gaussian data to heteroskedastic (conditionally) data sets that are commonly used in econometrics.Our proof technique will also look at a less conservative bound compared with Jankova and van de Geer(2018). Hence, our new inequalities contribute to the literature on high-dimensional econometrics wherethey can be used for proving generalized semiparametric eﬃciency of Lasso-type-estimators (as, e.g., inJankova and van de Geer (2018)).Our ﬁrst result in this section is a k -th moment bound for the l norm of the Lasso bias. A key conceptused in this result is the exception probability for the event F := {A ∩ A } , where A and A are deﬁnedin (A.6) and (A.9), which represent the empirical process-noise, and the eigenvalue condition, respectively.The exception probability is the complement of the event F , and is denoted by P ( F c ). An explicit upperbound for the exception probability is calculated in Lemma A.4. Theorem 1.

Under Assumptions 1-2, if n is suﬃciently large and λ n ≥ P ( F c ) / k s / , then [ E k ˆ β − β k k ] /k = O ( s λ n ) . This result is valid uniformly over B l ( s ) = {k β k l ≤ s } . If we set k = 1 we can learn whether the Lasso estimator is unbiased. By the above Theorem, Assumption9 and (A.15) imply s λ n →

0. Hence, in large samples, we have unbiasedness in the large λ n case. Next, weprovide the k -th moment bound for l norm for the Lasso estimator. Theorem 2.

Under Assumptions 1-2, if n is suﬃciently large n and λ n ≥ P ( F c ) / k /s / , then [ E k ˆ β k k ] /k = O ( s / ) . This result is valid uniformly over B l ( s ) = {k β k l ≤ s } . This is a new result and a simple extension of Theorem 1 above. The rate in Theorem diverges to inﬁnityif s → ∞ as n → ∞ . Our main result, which is new in the literature on penalized regressions, establishes that the Lasso estimatoris incentive compatible for a suﬃciently large sample size. In other words, we show that when n → ∞ E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] ≥ E [ X ′ n +1 ˆ β − X ′ n +1 β ] . for all ˜ X ′ n +1 and for every β , where the expectation is taken with respect to the reporting user’s attributes X ′ n +1 (this is our ex-ante notion of incentive-compatibility that we explained in Section 2 .

1) and with respectto the statistician’s realized sample (since the reporting user does not observe this sample).The next theorem is our main result, which provides suﬃcient conditions for incentive compatibility. Itsproof makes use of the following notation. M := max ≤ j ≤ p | X n +1 ,j | ,M : max ≤ j ≤ p | ˜ X n +1 ,j − X n +1 ,j | . Note that M is nothing more than the absolute magnitude of the misreport on a given variable j by the n + 1 user. Theorem 3.

Under Assumptions 1 and 2, the Lasso estimator is incentive compatible in large samples( n → ∞ ) if the following conditions hold: λ n ≥ P ( F c ) / /s / (7) and s / r lnpn [ EM ] / [ EM ] / → . (8) Furthermore, incentive compatibility is valid uniformly over B l ( s ) = {k β k l ≤ s } . emarks . Theorem 3 establishes that a suﬃcient condition for incentive compatibility is that the tuning parameter λ n needs to be large “enough”. A simple way to choose λ n to satisfy (7) is to use the upper bound of theexception probability λ n := upperbound ( P ( F c ) / ) , in Lemma A.4. The simulations in the next section address the issue of whether such a bound is feasible. The typical concern with Lasso is the consistency of the estimator ( k ˆ β − β k = o p (1)), which can beachieved by making sure that λ n goes to zero at a relatively fast rate (as Lemma A.1 in Appendix A shows,this rate is s λ n → λ n gets too small, the Lasso estimator may admit many nonzero variablesincorrectly (i.e., it creates an overﬁt ). Consequently, when the number of regressors p is very large, theexpectation of the sum of l errors ( E k ˆ β − β k ) can grow arbitrarily large, and incentive compatibility maybe violated. Put diﬀerently, consistency does not imply incentive compatibility in large samples .To see this formally, let { A } denote the event A . Then for all ǫ > E k ˆ β − β k = P ( k ˆ β − β k ≤ ǫ ) {k ˆ β − β k ≤ ǫ } + P ( k ˆ β − β k > ǫ ) {k ˆ β − β k > ǫ }≤ P ( k ˆ β − β k ≤ ǫ ) ǫ + P ( k ˆ β − β k > ǫ ) {k ˆ β − β k > ǫ } , (9) By consistency, the ﬁrst term in (9) will go to zero when ǫ approaches zero. However, the second termmay be large and can dominate all the expectation in large dimensions (this is also discussed on p.2339of Jankova and van de Geer (2018)) In other words, just using the l estimator bound on its own does not imply a bound for the expectation of l error. It requires a non-trivial proof. Consistency does not implyunbiasedness, and hence, it does not imply incentive compatibility.Why is overﬁtting a signiﬁcant issue for incentive compatibility? The intuition is as follows. Suppose thetuning parameter is suﬃciently small so that given the user’s prior on the true coeﬃcients, she expects thatmany irrelevant variables will be included in the estimator. To correct this bias, she can report that thesevariables are equal to zero. The second suﬃcient condition (8) allows the distance between the user’s report ˜ X n +1 and the truth X n +1 to be of any magnitude since EM ≡ E k ˜ X n +1 − X n +1 k ∞ can be arbitrarily large. Since the aboveconditions are suﬃcient but not necessary, it remains an open question whether incentive compatibility canbe achieved with a tuning parameter that is lower than the threshold in (7) without restricting the magnitudeof the deviation between the user’s reported and true attributes. Note that (8) requires stricter sparsity than Assumption 2. If EM = O (1) and EM = O (1), thencondition (8) amounts to s / q lnpn →

0, which is a sparsity requirement still stronger than Assumption2(ii). In addition, if we let EM = O ( lnn ) and EM = O ( lnn ) , then s / q lnpn ( lnn ) / ≤ s / q lnpn → n ≤ p . A natural question that arises is whether condition (7) is compatible with the l norm consistency ofLasso? In other .words, consistency requires a small λ n , but incentive compatibility requires a large λ n , so11re they compatible with each other? When we select a large λ n to satisfy incentive compatibility, we shouldnot sacriﬁce consistency, i.e. we need s λ n →

0. To verify whether this is possible, we can take the lowerbound on the tuning parameter in (7) and see whether we can achieve consistency. Note that s λ n = s P ( F c ) / s / = s / P ( F c ) / , (10)From (A.22) in the Appendix, an upper bound on this exception probability is: P ( F c ) ≤ p C + K [ EM + EM ] nlnp , (11)where C and K are positive constants. With l = 1 ,

2, it therefore follows from (10) and (11) that we need s /p C → , s max l EM l /nlnp → , to have consistency. These two conditions are not unreasonable in the sense that they are consistent with( n, p ) increasing to inﬁnity. Also they are compatible with moments satisfying condition (8) in Theorem 3. Finally, note that λ n = O ( q lnpn ) represents an upper bound in terms of rates for λ n , whereas (7)represents a lower bound. We can then take for a positive constant C > C √ lnp √ n ≥ λ n ≥ P ( F c ) / s / . The question is, are there suitable combinations of n and p that satisfy these inequalities? By using algebraand the upper bound for exception probability (A.22), we obtain the requirement that, Cs / ≥ " np C + K [ EM + EM ] nlnp / √ n √ lnp , which is plausible for p > n and large n since the left hand side may diverge and the right side may go tozero. This may be the case for example when p is exponential in n . This section has two objectives. First, it illustrates how in practice the tuning parameter can be chosento ensure incentive compatibility of the Lasso estimator. Second, it demonstrates that by appropriatelychoosing the tuning parameter (in line with the conditions in Theorem 3), incentive compatibility is satisﬁedregardless of the magnitude of the “lie” (i.e., the distance between the true and reported attributes)We provide a simple simulation setup. We model y i = X ′ i β + u i , where β = (1 , ′ p − s , ′ s − ) ′ , 0 p − s is a p − s column vector of all zero elements, and 1 s − is a s − s represent the sparsity of the above model and set s = 20.12e present three diﬀerent simulations. In Design 1, we choose X i to be a p × t distributionwith ﬁve degrees of freedom. The new n + 1 user has the same distribution for her attributes but isindependent of the ﬁrst n users. The errors u i are also chosen from a t distribution but independentlyof the regressors. Tables 1-3 displays these results. For the second simulation (Design 2), we only changethe distribution of the attributes for the n + 1 user to a t distribution with three degrees of freedom. InDesign 2, we keep the same distribution for the errors and the same attribute distribution for the ﬁrst n users from Design 1. The results are displayed in Tables 4-6. In the third simulation (Design 3), we changeonly the following in Design 2. We introduce a multivariate normal distribution for the attributes of users i = 1 , · · · , n , such that the covariance between the j and m -th random variables are governed byΣ j,m = 0 . | j − m | , for j = 1 , . . . , p and m = 1 , · · · , p . Thus, the correlation between the adjacent random variables is 0.5, andthis declines when the random variables are further apart. This Toeplitz type structure is commonly usedin the high dimensional literature (see Caner and Kock (2018)). In Design 3, we keep the distribution of the n + 1 user and the errors from Design 2. The results are presented in Tables 6-9.We aim to demonstrate that with a “large” tuning parameter as in Theorem 3, incentive compatibilitycan be achieved when the sample size n is large enough. As mentioned in the previous section, one possiblechoice of a tuning parameter that satisﬁes Theorem 3 is the upper bound on the exception probability, λ n ≥ upperbound ( P ( F c ) / ) . The issue is to make the exception probability, P ( F c ) operational and usable. Note that an upper boundon this probability is (with positive constants C > , C > , K > P ( F c ) ≤ p C + K [ EM + EM ] nlnp ≤ p C + C ( lnp ) , (12)by observing that for l = 1 , K max l EM l nlnp = " K / p max l EM l √ n √ lnp = " K / p max l EM l √ lnp √ n ( 1 lnp ) ≤ C ( lnp ) , where we use Assumption 2(i). Hence, we can write the upper bound of the exception probability by using p ≥ p C + C ( lnp ) ≤ C ( lnp ) . The tuning parameter is as follows 13 n := [2 + C ( lnp ) ] / , (13)where C can start from a small positive value and stop at a large positive value, and we select the optimal C and λ n according to the Generalized Information Criterion as in Caner and Kock (2018), which givesconsistent model selection with weighted Lasso choices in the least squares framework. Therefore, our choiceof λ n is above a lower bound, which prevents overﬁtting (this is the novel insight of Theorem 3). On theother hand, to prevent a very large λ n and ensure consistency of Lasso, the lower bound inversely dependson p .Deﬁne λ ∗ n := argmin λ n ∈ Λ [ ln (ˆ σ ( λ n )) + ˆ s ( λ n ) n ln ( n ) ln ( ln ( p ))] , where ˆ s ( λ n ) is the number of nonzero elements in the Lasso estimator, given a choice of λ n in a grid Λ, andˆ σ ( λ n ) is the mean squared residuals from the Lasso regression, given a choice of λ n in a grid Λ. We formΛ as follows: We take C in a grid of values [2 + C ( lnp ) ] as in (13). Let C := [0 . , . , , , , , , λ n depending on C . The number of iterations is 1,000.The “Report” column in Tables 1-3 display E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] as the mean squared error from a falsereport by the user. “Truth” refers to E [ X ′ n +1 ( ˆ β − β )] . The diﬀerence between ˜ X n +1 − X n +1 is kept atthree levels: 5, 2 and 0.2 (for all p variables), which represent large, medium, and small deviations from thetruth. We have p = 100 , , , and for each p level we analyze n = 100 , , p = 500 and n = 400 . In Table 1, which correspondsto a large magnitude of a lie, the user’s disutility from reporting the truth is 24.36, while the disutilityfrom lying is 334.01. Hence, the n + 1 user prefers to be truthful. In Table 2, for a medium magnitudeof lies, truth-telling induces a disutility of 25.03, while lying induces a higher disutility of 71.63. Finally,in Table 3, if the lie is “close” to the truth, the disutility from truth-telling is 24.36, while the disutilityfrom lying is 24.73. Similar comparisons hold in the tables’ remaining cells, suggesting that that Lasso’sincentive compatibility is achieved. In Tables 4-9, the same message from Tables 1-3 carries over: Lasso isincentive-compatible with our tuning parameter choice. Tables 6 and 9 show that with a minor lie, Lasso isstill incentive-compatible and the diﬀerence between the truth and lie in MSE sense is larger compared withTable 3 of Design 1. The growing reliance on machine learning in automating decisions previously made by people raises thequestion of how people would interact with these automated systems. In particular, would people have an14able 1: Design 1-Incentive Compatibility Scenarios: Diﬀerence 5 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 25.93 385.25 23.61 338.91 25.89 321.55 p = 250 27.75 353.90 25.25 353.08 26.41 337.91 p = 500 26.71 305.53 25.55 333.97 24.36 334.01Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Deﬁnition. Smaller values of these average squared errors are desirable.Table 2: Design 1-Incentive Compatibility Scenarios: Diﬀerence 2 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 26.12 80.11 25.88 76.10 23.56 71.87 p = 250 25.88 75.64 24.25 77.89 25.20 72.40 p = 500 28.02 72.21 25.20 76.73 25.03 71.63Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Deﬁnition. Smaller values of these average squared errors are desirable.Table 3: Design 1-Incentive Compatibility Scenarios: Diﬀerence 0.2 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 25.60 25.90 25.73 25.93 24.61 25.03 p = 250 26.06 26.87 23.90 24.27 25.43 25.84 p = 500 28.34 28.98 24.94 25.62 24.36 24.73Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Deﬁnition. Smaller values of these average squared errors are desirable.Table 4: Design 2-Incentive Compatibility Scenarios: Diﬀerence 5 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 45.62 392.71 39.25 368.91 40.22 345.82 p = 250 47.11 374.02 45.55 355.38 46.59 342.67 p = 500 45.08 326.11 43.77 370.69 47.70 350.90Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Deﬁnition. Smaller values of these average squared errors are desirable.15able 5: Design 2-Incentive Compatibility Scenarios: Diﬀerence 2 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 48.63 97.92 59.58 94.09 58.65 85.68 p = 250 46.19 89.30 44.12 98.91 43.25 95.22 p = 500 55.57 105.84 41.61 90.43 40.95 92.95Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Deﬁnition. Smaller values of these average squared errors are desirable.Table 6: Design 2-Incentive Compatibility Scenarios: Diﬀerence 0.2 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 50.48 51.03 44.02 45.72 44.42 45.07 p = 250 45.98 47.33 48.88 49.47 42.56 43.68 p = 500 48.84 48.92 44.08 44.79 41.45 42.18Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Deﬁnition. Smaller values of these average squared errors are desirable.Table 7: Design 3-Incentive Compatibility Scenarios: Diﬀerence 5 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 26.22 2766.29 18.84 3004.57 16.77 3136.86 p = 250 24.77 2725.52 21.41 3000.16 20.35 3130.54 p = 500 34.32 2722.20 19.68 2981.42 17.98 3139.44Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Deﬁnition. Smaller values of these average squared errors are desirable.Table 8: Design 3-Incentive Compatibility Scenarios: Diﬀerence 2 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 27.43 464.41 19.82 493.65 17.16 515.18 p = 250 23.49 454.13 19.76 488.40 16.14 507.25 p = 500 25.25 448.99 38.48 503.34 14.56 509.49Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Deﬁnition. Smaller values of these average squared errors are desirable.16able 9: Design 3-Incentive Compatibility Scenarios: Diﬀerence 0.2 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 24.16 28.51 16.77 20.07 14.35 19.63 p = 250 25.80 29.93 19.41 25.04 16.14 22.63 p = 500 27.11 30.47 19.97 25.11 18.24 22.52Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Deﬁnition. Smaller values of these average squared errors are desirable.incentive to act strategically in order to manipulate such automated systems? This strategic interactionwill become particularly important when these automated systems start playing a more prominent role inmedical decision-making or even in driving.This paper takes only a small preliminary step towards addressing this question by studying whether auser would want to lie to an automated system that uses Lasso to predict that user’s ideal outcome basedon her reported attributes. Our main contribution is showing that truthful reporting can be ensured byappropriately adjusting the tuning parameter to be larger than what is required for consistency. Our resultis also signiﬁcant from a pure econometrics point of view: Just concentrating on oracle inequalities and post-selection inference can lead to a small tuning parameter, which in turn, can lead to model overﬁtting, whichthen introduces an incentive to misreport. If users have an incentive to provide false input to algorithms usedfor estimation and prediction, then it is no longer clear that one can rely on the output of these algorithms.17n the next part, Appendix A considers the proofs when p > n , and Appendix B considers the case p ≤ n ,and relaxing Assumption 2(iii). A Appendix A

A.1 Notation

In this section, we show some results that will help us in proofs. Deﬁne random vector of variables F i :=( F i , · · · , F ij , · · · , F ip ) ′ . Also deﬁne σ F := n ( max ≤ j ≤ p varF ij ), and M F := max ≤ i ≤ n max ≤ j ≤ p | F ij − EF ij | .Note that ˆ µ j := n − P ni =1 F ij , and µ j := EF ij . A.2 Maximal Inequalities

We use two assumptions that will provide us maximal inequalities.

Assumption A.1 . Assume F i are iid random vectors across i = 1 , , · · · , n with max ≤ j ≤ p varF ij bounded away from inﬁnity uniformly in n . Assumption A.2 . Assume p EM F √ lnp √ n → . We use the following maximal inequality. With Assumption A.1, Lemma E.2(ii) of Chernozhukov et al.(2017) is: (see (A.2) of Caner and Kock (2019)) P (cid:20) max ≤ j ≤ p | ˆ µ j − µ j | ≥ E max ≤ j ≤ p | ˆ µ j − µ j | + tn (cid:21) ≤ exp ( − t / σ F ) + K EM F t , (A.1)for a constant K >

0. With Assumptions A.1-A.2 here, Caner and Kock (2019) or Lemma E.1 of Chernozhukov et al.(2017) provides E max ≤ j ≤ p | ˆ µ j − µ j | ≤ K [ √ lnp √ n + p EM F lnpn ]= O ( √ lnp √ n ) . (A.2)Deﬁne the sequence κ n = lnp . Set t = t n = ( nκ n ) / to have (A.1) as P (cid:20) max ≤ j ≤ p | ˆ µ j − µ j | ≥ E max ≤ j ≤ p | ˆ µ j − µ j | + √ κ n √ n (cid:21) ≤ exp ( − C κ n ) + K EM F nκ n = 1 p C + KEM F nlnp (A.3)where C >

0, is a positive constant.Now combine (A.2) with (A.3) to have 18 ( max ≤ j ≤ p | ˆ µ j − µ j | ≥ K [ √ lnp √ n + ( EM F ) / lnpn ] + √ lnp √ n ) ≤ p C + KEM F n ( lnp ) = o (1) , (A.4)by Assumptions A1-A.2. This shows also that, since EM F is nondecreasing in n max ≤ j ≤ p | ˆ µ j − µ j | = O p ( p lnp/ √ n ) . (A.5) A.2.1 Events

Before the assumptions, we need to deﬁne events that will be helpful. The ﬁrst event is: A = ( (cid:13)(cid:13)(cid:13)(cid:13) u ′ Xn (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ λ n ) , (A.6)which controls the noise. This is the maximal correlation between regressors and errors. We want this tobe bounded with probability approaching one, and this upper bound, λ n , itself is converging to zero in ourproofs. We show that in Lemma A.2. So in large samples, this proof technique amounts to veriﬁcation ofexogeneity of regressors. This is standard in high dimensional econometrics, for a recent analysis see LemmaA.4 of Caner and Kock (2018).We start with deﬁning ﬁrst population counterparts of restricted eigenvalue conditions and then showthe empirical version also. These are standard in high dimensional econometrics and statistics and can beseen from Assumption 1 of Caner and Kock (2018).We deﬁne the population adaptive restricted eigenvalue of Σ φ ( s ) = min ( δ ′ Σ δ k δ S k : δ ∈ R p − { } , k δ S c k ≤ √ s k δ S k , | S | ≤ s ) . (A.7)Note that if Σ = EX i X ′ i has full rank, the population adaptive restricted eigenvalue being positive is satisﬁedby Assumption 1. Also instead of minimizing all over R p , we minimize vectors that satisfy k δ cS k ≤ k δ S k .Even in the cases that Σ does not have full rank, it is possible that minimal adaptive restricted eigenvaluecondition is satisﬁed due to optimization over a restricted set. The parameter δ will be related to structuralparameter β in the proofs.First deﬁne the empirical adaptive restricted eigenvalue condition, which is empirical counterpart of thepopulation version in Assumption 1:ˆ φ ( s ) = min ( δ ′ ˆΣ δ k δ S k : δ ∈ R p − { } , k δ S c k ≤ √ s k δ S k , | S | ≤ s ) . (A.8)We are interested in behavior of the minimal empirical adaptive restricted eigenvalue condition evaluatedfor set S at cardinality s . The second event is: A = n ˆ φ ( s ) ≥ φ ( s ) / o . (A.9)19mpirical adaptive restricted eigenvalue condition is needed since in case of p > n , X ′ X is singular and theminimal eigenvalue of X ′ X is zero. Empirical adaptive eigenvalue is over a restricted set which we proveto be positive, with probability approaching one, in Lemma A.3. This is also standard in high dimensionaleconometrics, see Lemma A.6 of Caner and Kock (2018). Set F = A ∩ A , and the complement event as F c . A.2.2 Proofs of Lemmata

The following four Lemmata are the intermediate results that are used for Theorems.

Lemma A.1.

Under the joint event F := {A ∩ A } we have k ˆ β − β k ≤ λ n s φ ( s ) . This is also valid uniformly over B l ( s ) = {k β k l ≤ s } . Proof of Lemma A.1 . Using ˆ β deﬁnition k Y − X ˆ β k n + 2 λ n p X j =1 | ˆ β j | ≤ k Y − Xβ k n + 2 λ n p X j =1 | β ,j | . Use the model Y = Xβ + u on the ﬁrst left side term as well as the ﬁrst right side term to simplify theinequality above combining with Holder’s Inequality k X ( ˆ β − β ) k n + 2 λ n p X j =1 | ˆ β j | ≤ (cid:12)(cid:12)(cid:12)(cid:12) u ′ Xn ( ˆ β − β ) (cid:12)(cid:12)(cid:12)(cid:12) + 2 λ n p X j =1 | β ,j |≤ k u ′ Xn k ∞ k ˆ β − β k + 2 λ n p X j =1 | β ,j | On the right side assuming we are on the event A k u ′ Xn k ∞ k ˆ β − β k ≤ λ n k ˆ β − β k . So we have k X ( ˆ β − β ) k n + 2 λ n p X j =1 | ˆ β j | ≤ λ n k ˆ β − β k + 2 λ n p X j =1 | β ,j | . Use k ˆ β k = k ˆ β S k + k ˆ β S c k on the second term for the left side of the inequality immediately above k X ( ˆ β − β ) k n + 2 λ n X j ∈ S c | ˆ β j | ≤ λ n k ˆ β − β k + 2 λ n p X j =1 | β ,j | − λ n X j ∈ S | ˆ β j | . By assumption of sparsity P j ∈ S c | β ,j | = 0, and using the reverse triangle inequality we have k X ( ˆ β − β ) k n + 2 λ n X j ∈ S c | ˆ β j | ≤ λ n k ˆ β − β k + 2 λ n X j ∈ S | ˆ β j − β ,j | . k ˆ β − β k = k ˆ β S − β ,S k + k ˆ β S c k for the ﬁrst term on the right side of the inequalityimmediately above k X ( ˆ β − β ) k n + λ n X j ∈ S c | ˆ β j | ≤ λ n X j ∈ S | ˆ β j − β ,j | . Use k ˆ β S − β ,S k ≤ √ s k ˆ β − β ,S k above on the right side to have k X ( ˆ β − β ) k n + λ n X j ∈ S c | ˆ β j | ≤ λ n √ s k ˆ β S − β ,S k . (A.10)Ignoring the ﬁrst term on the left of (A.10), (A.10) shows that we satisfy the restricted set condition inempirical adaptive restricted eigenvalue condition, so we have k ˆ β S c k ≤ √ s k ˆ β S − β ,S k . Using δ = ˆ β − β in the empirical adaptive restricted eigenvalue condition (A.8) in (A.10) k X ( ˆ β − β ) k n + λ n X j ∈ S c | ˆ β j | ≤ λ n √ s k X ′ ( ˆ β − β ) k n ˆ φ ˆΣ ( s ) . Then use 3 uv ≤ u / v / u = λ n √ s / ˆ φ ˆΣ ( s ), v = k X ( ˆ β − β ) k n to get k X ( ˆ β − β ) k n + λ n X j ∈ S c | ˆ β j | ≤ k X ( ˆ β − β ) k n λ n s ˆ φ ( s ) . Simplify above k X ( ˆ β − β ) k n + 2 λ n X j ∈ S c | ˆ β j | ≤ λ n s ˆ φ ( s ) . Use the event A we get the following k X ( ˆ β − β ) k n + 2 λ n X j ∈ S c | ˆ β j | ≤ λ n s φ ( s ) . This implies the oracle inequality k X ( ˆ β − β ) k n ≤ λ n s φ ( s ) . (A.11)To get to the l bound ignore the ﬁrst term in (A.10) and add both sides λ n k ˆ β S − β ,S k to have λ n X j ∈ S c | ˆ β j | + λ n X j ∈ S | ˆ β j − β ,j | = λ n k ˆ β − β k ≤ λ n k ˆ β S − β ,S k + 3 λ n √ s k ˆ β S − β ,S k , by seeing also P j ∈ S c | β ,j | = 0. Now use the norm inequality k ˆ β S − β ,S k ≤ √ s k ˆ β S − β ,S k to have λ n k ˆ β − β k ≤ λ n √ s k ˆ β S − β ,S k . Use the empirical adaptive restricted eigenvalue condition with δ = ˆ β − β k ˆ β − β k ≤ √ s k X ( ˆ β − β ) k n ˆ φ ˆΣ ( s ) . A to have k ˆ β − β k ≤ √ s " √ λ n √ s φ Σ ( s ) φ ˆΣ ( s ) ≤ λ n s φ ( s ) . (A.12)Note that uniformity over B l ( s ) follows since the upper bound in (A.12) depends on β only through s . Q.E.DLemma A.2. (i). Under Assumption 1, and since κ n = lnpP ( A ) ≥ − exp ( − C κ n ) − KEM ( nκ n ) = 1 − p C − KEM nlnp (ii). Under added Assumption 2 to Assumption 1, P ( A ) → .(iii). Under added Assumption 2 to Assumption 1, λ n = O ( p lnp/n ) . Proof of Lemma A.2 . (i). Establish the probability bound on A via Assumption 1, using (A.3)(A.4)with F i = X i u i there and κ n = lnp , we have P ( A ) ≥ − exp ( − C κ n ) − K EM ( nκ n ) = 1 − p C − KEM nlnp , (A.13)with λ n = K [ r lnpn + p EM lnpn ] + r lnpn . (A.14)(ii). By Assumption 2, we have the proof.(iii). By Assumption 2, we have λ n = O ( p lnp/n ) . (A.15) Q.E.D.Lemma A.3.

Under Assumptions 1, 2, κ n = lnpP ( A ) ≥ − exp ( − C κ n ) − KEM ( nκ n ) = 1 − p C − KEM nlnp = 1 − o (1) . Proof of Lemma A.3 . Start with (cid:12)(cid:12)(cid:12)(cid:12) δ ′ X ′ Xn δ (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) δ ′ ( X ′ Xn − Σ + Σ) δ (cid:12)(cid:12)(cid:12)(cid:12) ≥ | δ ′ Σ δ | − | δ ′ ( ˆΣ − Σ) δ | . (A.16)The second term on the right side of (A.16) can be bounded by repeated application of Holders inequality | δ ′ ( ˆΣ − Σ) δ | ≤ k δ k k ˆΣ − Σ k ∞ . So (A.16) becomes | δ ′ ˆΣ δ | ≥ | δ ′ Σ δ | − k δ k k ˆΣ − Σ k ∞ . (A.17)22ow we digress a bit to simplify (A.17). Note that we have the restriction set deﬁnition k δ S c k ≤ √ s k δ S k , where we add k δ S k to both sides k δ k ≤ √ s k δ S k + k δ S k ≤ √ s k δ S k + √ s k δ S k = 4 √ s k δ S k , where we used the norm inequality k δ S k ≤ √ s k δ S k in the second inequality above. So we get k δ k k δ S k ≤ s . Now divide (A.17) by k δ S k > | δ ′ ˆΣ δ |k δ S k ≥ | δ ′ Σ δ |k δ S k − s k ˆΣ − Σ k ∞ . Minimize over δ on the both sides ˆ φ ( s ) ≥ φ ( s ) − s k ˆΣ − Σ k ∞ . (A.18)So if we can prove that with probability approaching one, 16 s k ˆΣ − Σ k ∞ ≤ φ ( s ) /

2, that will imply ofˆ φ ( s ) ≥ φ ( s ) / ǫ n = 16 s t , where t = K [ r lnp n + p EM lnp n ] + r lnpn . (A.19)By (A.3)(A.4), via Assumption 1 P [16 s k ˆΣ − Σ k ∞ > ǫ n ] = P [ k ˆΣ − Σ k ∞ > t ] ≤ exp ( − C lnp ) + KEM ( nlnp ) → , (A.20)where we use Assumption 2 for the probability tail converging to zero. Also see that by Assumption 2, ǫ n → s p lnp/n →

0. So we get, with probability approaching one, 16 s k ˆΣ − Σ k ∞ ≤ ǫ n ≤ φ ( s ) / P [ ˆ φ ( s ) ≥ φ ( s ) / ≥ − exp ( − C κ n ) − KEM ( nκ n )= 1 − p C − KEM nlnp = 1 − o (1) . (A.21) Q.E.D.

We need the following Lemma for the exception set F c := { A ∩ A } c upper bound probability.23 emma A.4. Under Assumptions 1, 2, with κ n = lnpP ( F c ) ≤ exp ( − C κ n ) + K [ EM + EM ]( nκ n )= 2 p C + K ( EM + EM ) nlnp = o (1) . Proof of Lemma A.4 .Now we provide an upper bound for the probability P ( F c ) in our case under Assumptions 1, 2, by usingLemmata A.2-A.3 P ( F c ) = P ( A ∩ A ) c = P ( A c ∪ A c ) ≤ P ( A c ) + P ( A c ) ≤ exp ( − C κ n ) + K [ EM + EM ]( nκ n )= 2 p C + K [ EM + EM ] nlnp → . (A.22) Q.E.D.A.2.3 New Oracle Inequality Proofs

We start with proof of Theorems 1-2, where they are used as inputs to proof of Theorem 3. Theorems 1-2consider the new oracle inequalities.

Proof of Theorem 1 . We proceed in several steps.Denote the joint event F = {A ∩ A } . F c is F ’s complement. See that E k ˆ β − β k k = E k ˆ β − β k k {F} + E k ˆ β − β k k {F c } . (A.23)We want to form rates for the right side terms in (A.23).Step 1. Note that by Lemma A.1, the ﬁrst term on the right side of (A.23) is: E k ˆ β − β k k {F} = O ( s k λ kn ) . (A.24)Now we want to evaluate the second term on the right side of (A.23). But before that we need thefollowing intermediate step.Step 2. Use Nemirowski’s moment inequality, Lemma 14.24 in Buhlmann and van de Geer (2011), withfor all k ≥

1, for the ﬁrst inequality, and for the second inequality by Loeve’s c r inequality, and for theequality we use u i being iid, also the deﬁnition of σ := Eu i , E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P ni =1 u i − σ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k ≤ [8 ln (2)] k/ E " P ni =1 ( u i ) n k/ ≤ Cn ( k/ − n k n X i =1 Eu ki = C [ Eu ki ] n − k/ = O ( n − k/ ) = o (1) ,

24y Assumption 1. Before the next result we provide the inequality, | x + y | k ≤ k − ( | x | k + | y | k ) , (A.25)for k ≥

1, and x, y being generic scalars, and σ being bounded above by Assumption 1 and using (A.25) E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 u i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( u i − σ ) + σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k ≤ k −  E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( u i − σ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k + ( σ ) k  = O ( n − k/ ) + O (1) = O (1) . (A.26)Step 3. Now we have to form another l expectation bound for lasso that will be key to the sec-ond right side term analysis in (A.23). This step 3 modiﬁes the proof of Theorem 1, supplement, p.4 ofJankova and van de Geer (2018). We extend their proof to non-sub-Gaussian case and show that theirbound is very conservative, and we provide a new less conservative bound. Start with the deﬁnition of lasso. k Y − X ˆ β k n + 2 λ n k ˆ β k ≤ k Y − Xβ k n + 2 λ n k β k . Ignore the ﬁrst term and use the model u = Y − Xβ to have k ˆ β k ≤ k u k n λ n + k β k . Then use triangle inequality and then the inequality above k ˆ β − β k ≤ k ˆ β k + k β k ≤ k u k n λ n + 2 k β k . (A.27)Next taking the k th moment of the sampling error in l norm, and using (A.25) by taking expectationsthere for the second inequality below E k ˆ β − β k k ≤ E " k u k n λ n + 2 k β k k ≤ k − { E " k u k n λ n k + 2 k β k k } (A.28)We use the assumption k β k = O (1) to have k β k k ≤ ( √ s k β k ) k = O ( s k/ ) . (A.29)Then use the last equation with (A.26) in (A.28) to have E " k u k n λ n k + 2 k β k k = O ( λ − kn ) + O ( s k/ ) = O (max( s k/ , λ − kn )) . (A.30)Note that proof of Jankova and van de Geer (2018) use s k/ λ − kn but this is very conservative upper boundsince both two terms in multiplication is diverging with n . But a better bound is max( s k/ , λ − kn ).25e get the rough bound for expectation using (A.30) in (A.28) E k ˆ β − β k k = O (max( s k/ , λ − kn )) . (A.31)Note that rates in (A.24)(A.31) are diﬀerent and the last rate in this step is a rough bound which will behelpful in the next step. The rate in (A.31) is diverging to inﬁnity.Step 4. Rewrite the expectation using event F , F c . E k ˆ β − β k k = E k ˆ β − β k k {F} + E k ˆ β − β k k {F c } ≤ O ( s k λ kn ) + q E k ˆ β − β k k q E {F c } = O ( s k λ kn ) + O (max( s k/ , λ − kn )) p P ( F c ) (A.32)where we use (A.24) and Cauchy-Schwartz inequality for the ﬁrst inequality, and the second equality is by(A.31).First possibility of a rate is (jointly holding): s k λ kn ≥ s k/ P ( F c ) / . (A.33) s k λ kn ≥ λ − kn P ( F c ) / . (A.34)By (A.32)(A.33)(A.34) E k ˆ β − β k k = O ( s k λ kn ) . We can simplify further (A.33)(A.34), respectively they are λ n ≥ P ( F c ) / k /s / , (A.35)and λ n ≥ P ( F c ) / k /s / . (A.36)Since P ( F c ) / k ≥ P ( F c ) / k , k ≥ λ n ≥ P ( F c ) / k /s / then E k ˆ β − β k k = O ( s k λ kn ) . (A.37)Of course there is another possibility-subcase that provides the rate in (A.37). That is when s k λ kn ≥ λ − kn P ( F c ) / , (A.38)jointly holding with λ − kn P ( F c ) / ≥ s k/ P ( F c ) / . (A.39)26his results in the same suﬃcient condition (A.36) via only (A.38) since by Assumption 2, s λ n → ≥ ( s λ n ) k /s k/ . So (A.39) is always satisﬁed with suﬃciently large n . Notealso that joint inequalities s k λ kn ≥ s k/ P ( F c ) / jointly holding with s k/ P ( F c ) / ≥ λ − kn P ( F c ) / (A.40)is not possible since (A.40) implies s k/ λ kn ≥ s λ n ) k / ( s k/ ) ≥ s λ n → n .To combine all the results for the k th moment of the estimation error, for values of λ n ≥ P ( F c ) / k /s / , E k ˆ β − β k k = O ( s k λ kn ) . The uniformity over B l ( s ) follows since the rates in (A.24)(A.31)-(A.34) depends on β only by s . Q.E.D.

Remark. Proof of Theorem 1 in Jankova and van de Geer (2018), in their appendix, p.5, shows that theyuse assumption: λ n ≥ P ( F c ) / k s / , (A.41)which is equivalent to the following condition as shown in p.3 of proof of Theorem 1 in Jankova and van de Geer(2018) τ > kln [( √ s λ n ) − ] /lnp, given that λ n ≥ Cτ p lnp/n and C > , τ > P ( F c ) ≤ p ) τ / (A.42)by Lemma 7 in appendix of Jankova and van de Geer (2018). Our result and theirs are not comparable interms of λ n since they assume sub-Gaussian data, and ours is more general. Proof of Theorem 2 .We start with E k ˆ β k k = E k ˆ β k k {F} + E k ˆ β k k {F c } ≤ E k ˆ β k k {F} + q E k ˆ β k k p P ( F c ) , (A.43)by using Cauchy-Schwartz inequality. Then use triangle inequality on set F and by Lemma A.1, and norminequality to have k ˆ β k ≤ k ˆ β − β k + k β k ≤ λ n s φ ( s ) + √ s k β k = O p ( √ s ) , by Assumptions 1, 2. This last rate shows that E k ˆ β k k {F} = O ( s k/ ) . (A.44)27o handle the second right side term in (A.43) we start with the second inequality in (A.27) and ignore k β k in the middle to have k ˆ β k ≤ k u k n λ n + k β k . then follow (A.30) to get q E k ˆ β k k P ( F c ) / = O (max( s k/ , λ − kn )) P ( F c ) / = O ( λ − kn P ( F c ) / ) , (A.45)and to get the second equality by Assumption 2(ii) ( s λ n ) k /s k/ ≤ s k/ ≤ λ − kn with suﬃciently large n .Now use (A.44) with (A.45) in (A.43) E k ˆ β k k = O ( s k/ ) + O ( λ − kn P ( F c ) / ) . (A.46)If λ n ≥ P ( F c ) / k /s / it is clear that s k/ ≥ λ − kn P ( F c ) / , (A.47)So by (A.47) in (A.46) we have the desired result. Q.E.D.Q.E.D.A.2.4 Main Theorem Proof: Incentive CompatibilityProof of Theorem 3 .By Theorem 1 and 2 we can choose the larger of λ n in those theorems, with s ≥

1, and since it isnondecreasing with n , λ n ≥ P ( F c ) / k s / ≥ P ( F c ) / k s / (A.48)Add and subtract X ′ n +1 ˆ β inside the right hand side of the incentive compatibility deﬁnition: E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] = E [ ˜ X ′ n +1 ˆ β − X ′ n +1 ˆ β + X ′ n +1 ˆ β − X ′ n +1 β ] = E [ ˜ X ′ n +1 ˆ β − X ′ n +1 ˆ β ] + E [ X ′ n +1 ˆ β − X ′ n +1 β ] + E [ ˆ β ′ ( ˜ X n +1 − X n +1 ) X ′ n +1 ( ˆ β − β )]+ E [( ˆ β − β ) ′ X n +1 ( ˜ X ′ n +1 − X ′ n +1 ) ˆ β ] . (A.49)Using the deﬁnition of incentive compatibility, with deﬁning D n +1 := ˜ X n +1 − X n +1 , we have E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] − E [ X ′ n +1 ˆ β − X ′ n +1 β ] = E [ ˆ β ′ D n +1 D ′ n +1 ˆ β ] (A.50)+ E [ ˆ β ′ D n +1 X ′ n +1 ( ˆ β − β )] (A.51)+ E [( ˆ β − β ) ′ X n +1 D ′ n +1 ˆ β ] . (A.52)28ow analyze (A.51), the analysis of (A.52) is the same and thus omitted. See thatˆ β ′ D n +1 X ′ n +1 ( ˆ β − β ) ≤ | ˆ β ′ D n +1 X ′ n +1 ( ˆ β − β ) |≤ | ˆ β ′ D n +1 || X ′ n +1 ( ˆ β − β ) |≤ k ˆ β k k D n +1 k ∞ k X n +1 k ∞ k ˆ β − β k , (A.53)where we use Holder’s inequality. Then E [ ˆ β ′ D n +1 X ′ n +1 ( ˆ β − β )] ≤ E h k ˆ β k k D n +1 k ∞ k X n +1 k ∞ k ˆ β − β k i (A.54) ≤ [ E k ˆ β ] / [ E k D n +1 k ∞ ] / [ E k X n +1 k ∞ ] / [ E k ˆ β − β k ] / (A.55)= [ E k ˆ β ] / [ EM ] / [ EM ] / [ E k ˆ β − β k ] / (A.56) where we apply (A.53) for the ﬁrst inequality and Holder’s Inequality in the second inequality above, andthe last equality comes from M , M deﬁnitions. Then we apply Theorems 1-2 with k = 4. We assume λ n ≥ P ( F c ) / /s / and if s / r lnpn [ EM ] / [ EM ] / → , (A.57)we see that (A.56) goes to zero, by Theorems 1-2, and λ n = O ( q lnpn ).So looking at incentive compatibility deﬁnition and (A.50)-(A.52) E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] − E [ X ′ n +1 ˆ β − X ′ n +1 β ] = E [ ˆ β ′ D n +1 D ′ n +1 ˆ β ] + o (1) , (A.58)where the ﬁrst right side term in (A.58) is nonnegative and the other terms are negligible in large samplesby (A.57).The uniformity over B l ( s ) goes through since Theorems 1, 2 depend on β only through s , and theyare the main ingredient in the proof. Q.E.D.

B Appendix B

Here we consider results when p ≤ n , and relaxing Assumption 2(iii). B.1 When p ≤ n There are minor modiﬁcations in the proofs compared to p > n . We consider them here. One major changeis since p ≤ n , we set κ n = lnn . Change Assumption 2(ii) so that s p ln/n → p ≤ n , and combine (A.2) with (A.3) tohave with κ n = lnn in that case P ( max ≤ j ≤ p | ˆ µ j − µ j | ≥ K [ √ lnp √ n + ( EM F ) / lnpn ] + √ lnn √ n ) ≤ n C + EM F n ( lnn ) = o (1) , (B.1)29y Assumptions A1-A.2. To see this point EM F nlnn =  ( EM F ) / √ lnp √ n ! √ lnn √ lnp  = o (1) . (B.2)This shows also that max ≤ j ≤ p | ˆ µ j − µ j | = O p ( √ lnn/ √ n ) . (B.3)Lemma A.1 will be the same. Lemma A.2(i) lower bound probability has κ n = lnn now. Lemma A.2(ii)is the same. Lemma A.2(iii) will change to λ n = O ( √ lnn/ √ n ) . Lemma A.3 use κ n = lnn , so (A.19) becomes t = K [ p lnp √ n + p EM lnp n ] + √ lnnn . Lemma A.4 is the same with κ n = lnn .Given these results, the proof of Theorem 1 is the same with λ n = O ( q lnnn ). Theorem 2 does not change.Theorem 3 condition will be changing to s / r lnnn [ EM ] / [ EM ] / → , B.2 Relaxing Assumption 2(iii)

In this subsection we relax Assumption 2(iii) from k β k = O (1) to k β k = O ( √ s ) and we explain thelogic and meaning of this new assumption. Assumption 2(iv) . k β k = O ( √ s ) . Assumption 2(iii) which is suggested by Jankova and van de Geer (2018) and simpliﬁes their paper insemiparametric eﬃcient estimators. Our Assumption 2(iv) here generalizes that assumption and in the caseof s being constant becomes Assumption 2(iii). The implication of Assumption 2(iv) is that all nonzerocoeﬃcients can be constant and none of them has to be local to zero. k β k = vuut p X j =1 β ,j = s X j ∈ S β ,j = O ( √ s ) . In terms of Section 2 discussion after Assumption 2, this implies S = F , and F is an empty set. SoAssumption 2(iv) can simultaneously allow s increasing with n , and all large nonzero coeﬃcients in S .Previously in Assumption 2(iii), there can be only a ﬁxed number of large coeﬃcients, and increasing ( s − f )number of local to zero (small) coeﬃcients.We proceed in a way that we only change the proofs in Appendix A, when necessary. All lemmata inAppendix A goes through, there is no usage of Assumption 2(iii) there. The ﬁrst change comes in step 330f Theorem 1 proof. First (A.29) changes to k β k k = O ( s k ) under Assumption 2(iv) instead of Assumption2(iii). Then (A.30) becomes E " k u k n λ n k + 2 k β k k = O (max( s k , λ − kn )) . (B.4)Then (A.32) changes to following E k ˆ β − β k k = O ( s k λ kn ) + O (max( s k , λ − kn ) p P ( F c ) . (B.5)Instead of (A.33)(A.34) we have the following conditions, to establish the rate for the oracle inequality (i.e.mean l norm bound to k th order) s k λ kn ≥ s k P ( F c ) / . (B.6) s k λ kn ≥ λ − kn P ( F c ) / . (B.7)Using (B.5)-(B.7) E k ˆ β − β k k = O ( s k λ kn ) . (B.8)The conditions (B.6)(B.7) can be written as λ n ≥ max ( P ( F c ) / k , P ( F c ) / k /s / ) , (B.9)where the tuning parameter choice under Assumption 2(iv) which is (B.9) is larger than or equal to choiceby Assumption 2(iii), which is the second component in the max on the right side of (B.9). The discussionafter this in step 4 is the same, given Assumption 2(i)-(ii). So we have the following result: Corollary B.1 . Under Assumptions 1, 2(i)(ii)(iv), with λ n ≥ max ( P ( F c ) / k , P ( F c ) / k /s / ) . we have [ E k ˆ β − β k k ] /k = O ( s λ n ) . The result is also uniform over l ball B l Now we modify the proof of Theorem 2. In that respect, by Assumption 2(iv) the rate after (A.43)becomes k ˆ β k = O p ( s ) . (B.10)Then (A.46) changes to E k ˆ β k k = O ( s k ) + O ( λ − kn P ( F c ) / ) . (B.11)We can show that s k ≥ λ − kn P ( F c ) / , (B.12)if we have λ n ≥ P ( F c ) / k /s . (B.13)31hen given (B.13), using (B.12) in (B.11) we have E k ˆ β k k = O ( s k ) . So we established the following Corollary to Theorem 2. The result is diﬀerent from Theorem 2 and the kth moment of l error grows faster here in Corollary B.2 if s increases with n . So relaxed assumption comeswith a cost that will aﬀect main incentive compatibility condition. Corollary B.2 . Under Assumptions 1, 2(i)(ii)(iv), with λ n ≥ P ( F c ) / k /s . we have [ E k ˆ β k k ] /k = O ( s ) . The result is also uniform over l ball B l Now we follow the proof of Theorem 3 and substitute Assumption 2(iv) instead of Assumption 2(iii).Note that our λ n choice must choose the maximum of the ones in Corollary B.1 and B.2. Clearly CorollaryB.1 tuning parameter is larger than the one in Corollary B.2. The only place we have to change there is(A.57). Given λ n ≥ max ( P ( F c ) / , P ( F c ) / s / ) we need s r lnpn [ EM ] / [ EM ] / → , to have Incentive Compatibility in large samples. So we have the following counterpart to Theorem 3. Corollary B.3 . Under Assumptions 1, 2(i)(ii)(iv) and λ n ≥ max ( P ( F c ) / , P ( F c ) / s / ) , and s r lnpn [ EM ] / [ EM ] / → , lasso is Incentive Compatible. The result is also uniform over l ball B l . Clearly, there are two diﬀerences between Theorem 3 and Corollary B.3 here. First, we need a tuningparameter in Corollary B.3 which may be larger than or equal to the one in Theorem 3. Then, incentivecompatibility of lasso is more diﬃcult to achieve, due to sparsity, s , having exponent of 2 here instead of3/2 in Theorem 3. References

Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012). Sparse models and methods for optimalinstruments with an application to eminent domain.

Econometrica 80 , 2369–2429.32elloni, A., V. Chernozhukov, and C. Hansen (2014). Inference on treatment eﬀects after selection amonghigh dimensional controls.

Review of Economic Studies 81 , 608–650.Buhlmann, P. and S. van de Geer (2011). Statistics for high-dimensional data.

Springer Verlag .Cai, Y., C. Daskalakis, and C. Papadimitrou (2015). Optimum statistical estimation with strategic datasources.

Proceedings of the 28 th Conference on Learning Theory 40 , 1–40.Caner, M. and A. B. Kock (2018). Asymptotically honest conﬁdence regions for high dimensional parametersby the desparsiﬁed conservative lasso.

Journal of Econometrics 203 , 143–168.Caner, M. and A. B. Kock (2019). High dimensional linear gmm. arXiv:1811.08779 .Chernozhukov, V., D. Chetverikov, and K. Kato (2017). Central limit theorems and bootstrap in highdimensions.

Annals of Probability 45 , 2309–2452.Chernozhukov, V., M. Goldman, V. Semenova, and M. Taddy (2018). Orthogonal machine learning fordemand estimation: High dimensional causal inference in dynamic panels. arXiv:1712.09988 .Chiang, H. (2020). Many average partial eﬀects: with an application to text regression.

Working Paper .Chiang, H. and Y. Sasaki (2019). Causal inference by quantile regression kink designs.

Journal of Econo-metrics 210 , 405–433.Cummings, R., S. Ioannidis, and K. Ligett (2015). Truthful linear regression.

Conference on LearningTheory 40 , 448–483.Dekel, O., F. Fischer, and A. Procaccia (2010). Incentive compatible regression learning.

Journal of ComputerSystem and Sciences 76 , 759–77.Eliaz, K. and R. Spiegler (2019). The model selection curse.

American Economic Review-Insights 1 , 127–140.Eliaz, K. and R. Spiegler (2020). On incentive compatible estimators.

Working Paper-Tel Aviv University .Gao, C., A. Van der Vaart, and H. Zhou (2015). A general framework for bayes structured linear models. arXiv:1506.02174 .Hardt, M., N. Megiddo, C. Papadimitrou, and M. Wooters (2016). Strategic classiﬁcation.

Proceedings. ofthe ACM Conference on Innovations in. Theoretical Computer Science , 111–122.Jankova, J. and S. van de Geer (2018). Semi-parametric eﬃciency bounds for high-dimensional models.

Annals of Statistics 46 , 2336–2359.Kock, A. (2016). Oracle inequalities, variable selection and uniform inference in high-dimensional correlatedrandom eﬀects panel data models.

Journal of Econometrics 195 , 71–85.33ock, A. and H. Tang (2019). Inference in high-dimensional dynamic panel data models.

EconometricTheory 35 , 295–359.Meir, R., A. Procaccia, and J. Rosenschein (2012). Algorithms for strategyproof classiﬁcation.

ArtiﬁcialIntelligence 186 , 123–156.Perte, J. and J. Perote-Pena (2004). Strategy-proof estimators for simple regression.

Mathemitical SocialSciences 47 , 153–176.Shaywitz, D. (2020). ”the alignment problem” review: When machines miss the point.

The Wall StreetJournal , A25,25 October.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.

Journal of The Royal StatisticalSociety Series B 58 , 267–288.van de Geer, S., P. B¨uhlmann, Y. Ritov, and R. Dezeure (2014). On asymptotically optimal conﬁdenceregions and tests for high-dimensional models.