Non-Manipulable Machine Learning: The Incentive Compatibility of Lasso
aa r X i v : . [ ec on . E M ] J a n Non-Manipulable Machine Learning:The Incentive Compatibility of Lasso
Mehmet Caner ∗ Kfir Eliaz † January 5, 2021
Abstract
We consider situations where a user feeds her attributes to a machine learning method that tries topredict her best option based on a random sample of other users. The predictor is incentive-compatible ifthe user has no incentive to misreport her covariates. Focusing on the popular Lasso estimation technique,we borrow tools from high-dimensional statistics to characterize sufficient conditions that ensure thatLasso is incentive compatible in large samples. In particular, we show that incentive compatibility isachieved if the tuning parameter is kept above some threshold. We present simulations that illustratehow this can be done in practice.
Rapid advances in machine learning methods for analyzing big data have given rise to automated systems thatemploy these methods to predict the best fitting outcomes for users based on their personal characteristics.For example, many online platforms try to predict which content - a song, a video, a post, or an article- is the best fit for each user. Medical providers have also begun using machine learning techniques toautomate check-ups and test appointments for patients based on their medical history. Typically, theseautomated systems use data from past users to estimate a model that relates the best fit for a user (such asthe most preferred content or the appropriate medical test) to her characteristics. These estimates are thenapplied to a new user’s characteristics, which she discloses either actively or passively via her past onlinebehavior (which may be reflected in his cookies or collected by his browser). Given the growing interactionof users with such automated systems, it is only natural to ask whether a user should truthfully disclose hercharacteristics? ∗ North Carolina State University, Nelson Hall, Department of Economics, NC 27695. Email:[email protected]. We thankSimon Fraser University Department of Economics for the virtual seminar, and Anders Kock and Ran Spiegler for comments.We also thank Columbia University, Department of Economics for the hospitality where this research is initiated, when bothauthors were visitors in 2018-2019. † School of Economics, Tel-Aviv University and Eccles School of Business, the University of Utah. Email: kfi[email protected].
1f the information the user discloses is also used to exploit her (say, by providing it to third parties foradvertising or price discrimination), then the user has an obvious reason not to reveal her private information.The question is whether special features of some popular machine learning methods introduce an incentiveto misreport one’s personal characteristics even when this information will be used solely for predicting herbest outcome? This question is of crucial importance: If individuals submit false reports to systems thatrely on these reports for estimation and predictions, then the conclusions drawn from such estimates andpredictions will be wrong and may lead to quite undesirable outcomes (e.g., think of an automated medicalplatform that schedules tests for patients based on false reports on attributes such as smoking, drinking andphysical exercise).To address the above question, we consider a stylized environment where each user i ’s ideal option is alinear function f of her privately observed attributes X i = ( X i, , ..., X i,p ) ′ such that f ( X i ) = X ′ i β . A usermay not know the values of the coefficients β , in which case she would have some (possibly degenerate)prior beliefs over them. A “statistician”, who represents some automated prediction platform has a sampleof the attributes of n users and noisy observations on their ideal options. For instance, suppose f ( X i ) isthe optimal dosage of some medication when taken immediately at the onset of symptoms, conditional onthe patient’s medical history X i , but the statistician observes the dosage that was given after some delay.Similarly, f ( X i ) may be the mix of news and reality shows that a user with attributes X i actually watches,but the statistician observes only self reports by a user who may have forgotten exactly what he watched.The statistician uses her sample to estimate the function f by computing an estimate ˆ β of the truecoefficients β . The statistician wishes to apply these estimates to predict the ideal option of a new user, n + 1 , whose true attributes X n +1 are not observed by the statistician. This new user must decide whatvector of attributes ˜ X n +1 (which may differ from the truth) to report to the statistician. In making thisdecision, the new user takes into account her beliefs about the statistician’s sample (the new user onlyknows the distribution from which the sample is drawn, but she does not observe its realization), and herbeliefs about the true parameters β . The statistician then plugs the new user’s reported attributes into theestimated function and gives the user the option ˜ X ′ n +1 ˆ β, which is the statistician’s estimate of the user’sideal option based her report. The new user’s expected loss from a report ˜ X n +1 is given by the mean squareerror between her expectation of the ideal option X ′ n +1 β and her assigned option ˜ X ′ n +1 ˆ β. The statistician’sestimator is (ex-ante) incentive-compatible , if in expectation, the new user has no incentive to deviate fromtruthful reporting for any prior belief on β : i.e., if for every possible value of β , the expected value of( X ′ n +1 β − ˜ X ′ n +1 ˆ β ) is minimized at the truth ˜ X n +1 = X n +1 , where the expectation is taken with respectto the statistician’s sample and the possible realizations of the user’s attributes.Intuition suggests that an individual cannot benefit from lying to a procedure that is meant to predict the In a recent interview of Brian Christian, the author of
The Alignment Problem , he notes that “computers may one day beable not only to learn our behavior but also intuit our values - figure out from our actions what it is we’re trying to optimize.... What if an algorithm intuits the ‘wrong’ values, based on its best read of who we currently are but not of who we aspire tobe? Do we really want our computers inferring our values form browser histories? See Shaywitz (2020) for this interview. binary , the statistician has the same (fixed) finite number of observationson each possible combination of attribute values, and the penalty parameter is fixed and does not adjust tothe sample size. Hence, these papers leave open the following important question: For a general environment,are there conditions ensuring that a penalized regression model is incentive compatible in large samples?Answering this question can potentially allow platforms, like those discussed above, to use machine-learning methods to predict users’ most preferred options without worrying that their data is “contaminated”by non-truthful users. Put bluntly, estimates and predictions made by methods that are not incentive-compatible are possibly unreliable since they may be based on false data.This paper addresses the above open question by focusing on the most popular form of penalized re-gressions - the
Lasso estimator. Borrowing tools from high-dimensional statistics, we establish sufficientconditions for incentive compatibility of the Lasso estimator in large samples. We show that to achieve in-centive compatibility, the tuning parameter must be large enough (i.e., it must remain above some thresholdas sample size increases) so as to avoid overfitting, which is the main reason why a user may want to lie.This potential to lie implies that the standard way of choosing small enough tuning parameters to ensureconsistency may lead to incentive compatibility violations We provide simulation results that illustrate howthe tuning parameter can be chosen in practice to ensure incentive compatibility. Incentive compatibilitymay therefore be viewed as an additional important property that should be imposed on estimators on top ofconsistency and unbiasedness. We also offer a new technical contribution by extending the oracle inequalitiesof Jankova and van de Geer (2018) to non-sub Gaussian data and providing a different proof. The motivation to focus on the Lasso estimator stems from the fact that this estimator is the bench-mark among all high dimensional statistical estimators that predict large scale models when the numberof regressors exceeds the sample size. Following its original proposal by Tibshirani (1996), econometriciansand statisticians have used Lasso-based estimators to push the boundaries of economics and finance. Oneof the most critical issues facing these Lasso type estimators is post-inference after estimation and modelselection, which require uniformly valid confidence intervals. In a seminal series of papers, Belloni et al.(2012) and Belloni et al. (2014) solved these issues by introducing the idea of “partialling out” the regres-sors. A different but complementary approach, via debiasing-desparsifying is proposed by van de Geer et al.(2014). Caner and Kock (2018) extended the debiasing of van de Geer et al. (2014) to heteroskedastic-non-sub-Gaussian data with strong oracle optimality property, thereby proposing a high dimensional estimatorthat is robust to heteroskedasticity, and with uniformly valid confidence intervals. Lasso-based debiasing are Our results can be extended to apply to the debiased lasso estimator, but this involves a different proof technique, andhence, is beyond the scope of the current paper. Even though we prove the main theorems with the bounded signal to noise ratio as in Jankova and van de Geer (2018),we can relax this ratio constraint as shown in the Appendix.
None of these papers consider penalized regression methods, and none ofthem characterize conditions guaranteeing incentive compatibility of regression techniques when the statis-tician and users have aligned interests (as is the case in our model).The remainder of the paper is organized as follows. Section 2 considers the model and assumptions. Sec-tion 3 provides new oracle inequalities. Section 4 shows under what conditions lasso is incentive compatible,and Section 5 provides a simulation. Appendix A and B provide proofs of the results when p > n , and p ≤ n ,respectively. We will use the following notational conventions. For any vector ν ∈ R d , let k ν k , k ν k , k ν k ∞ denote its l , l , l ∞ norm respectively, and k ν k be the l norm, which means the total number of nonzero entries. Fora set S ⊆ { , , · · · , d } , let | S | = s be the cardinality of the set. Let ν S be the modified ν such that we put0 when the index does not belong to S (i.e., say S = { , , } for a 10 × ν , this means that ν ismodified such that now all elements are zero except elements 1 , , k A k l be the maximum absolutecolumn-sum norm of a matrix of dimensions m × l , i.e., k A k l = max ≤ k ≤ l P mi =1 | A ik | which is also calledinduced l norm of A .Our environment consists of users who are characterized by a set of p personal characteristics. Forinstance, in the context of medical decision making, a characteristic can represent a risk factor (obesity,smoking, etc.). For each user i, these characteristics are modeled as p explanatory variables, X i, , ..., X i,p , drawn from some distribution over a subset of R p . These attributes determine the ideal option for a useraccording to the function f ( X i, , ..., X i,p ) = p X k =1 X i,k β ,k This function applies to all users, who differ only in the values of their characteristics. The realized valuesof ( X i, , ..., X i,p ) are privately observed by user i. A user may or may not know the value of the coefficients4 β , , ..., β ,p ) . In the latter case, she has some (possibly degenerate) prior beliefs over their values.A statistician (representing the automated prediction systems described in the introduction) has private access to a sample of n observations. Each observation i = 1 , ..., n consists of the true attributes X i =( X i, , ..., X i,p ) of user i and a noisy signal y i of that user’s ideal option, y i = X ′ i β + u i , (1)where u i is random noise that is drawn i.i.d from some distribution with zero mean. The X i ’ s are also i.i.d. across i , and exogenous, and will be discussed in detail in Assumption 1 in thenext subsection. β is a p × f . We let S = { j : β ,j = 0 } denote the set of relevant regressors with s being the cardinality of the set S . (i.e., s of the elements of β are nonzero, and the rest are zero). s is a nondecreasing function of n . These facts are known to an“oracle” but not to the statistician (and possibly not to a user).Using her (privately observed) sample, the statistician estimate the function f , or equivalently, sheestimates the coefficients β , , ..., β ,p . When p > n , the least squares estimator is infeasible due to singularityof the empirical Gram matrix. Hence, the statistician uses Lasso, the penalized regression procedure thatassigns costs to including explanatory variables in the regression. Specifically, she solves the followingminimization problem ˆ β = argmin β ∈ R p P ni =1 ( y i − X ′ i β ) n + 2 λ n k β k , (2)where λ n > λ n = O ( p lnp/n ) (an explicit expression for the sequence λ n is given in equation (A.14) inAppendix A). Given her estimates ˆ β, the statistician must take an action a ∈ R on behalf of a new user, j = n + 1. Thisaction is just the statistician’s prediction of the ideal option of that user. The new user’s payoff from action a is − ( a − f ( X n +1 )) , where f ( X n +1 ) is the true ideal option associated with her personal attributes X n +1 .The distribution of X n +1 may be different from that of ( X , ..., X n ), and we do not impose any restrictionon its correlation with the sample distribution.Since the statistician does not observe X n +1 , in order to make her prediction of f ( X n +1 ) , she asks the n + 1 user to report a p × X n +1 , which is interpreted as that user’s attributes. The statisticianthen plugs ˜ X n +1 into her estimated model and chooses the action a = ˜ X n +1 ˆ β. When the n + 1 user decideswhat attribute values to report, she takes into account that she does not observe the statistician’s sample,and hence, does not know the values of the estimated coefficients ˆ β. She only knows the distribution fromwhich the statistician’s sample is drawn, and that given her sample, the statistician chooses ˆ β according to(2). Given this, the user chooses the report ˜ X n +1 that minimize her expected loss E β , ˆ β ( ˜ X n +1 ˆ β − X ′ n +1 β ) , Access to such observations is a necessary condition for any platform that tries to learn about users (say, Netflix, Spotify).In the introduction, we gave a couple of examples for such data, which may be obtained from a third party, or from marketingsurveys. We established this rate in Lemma A.2 in Appendix A. β , and hisbeliefs about the estimate ˆ β. Hence, the new user may decide to lie and report ˜ X n +1 = X n +1 . In particular,she may decide to “opt out” and submit a vector of zeros. Our objective is to understand under whatconditions it is in the user’s best interest to be truthful regardless of her prior beliefs on β . We say that an estimator is ex-ante incentive compatible if for any belief over the true model parameters, theuser’s expected payoff from truth-telling is at least as high as her expected payoff from any misreport, wherethe expectation is taken with respect to the user’s realized covariates, and with respect to the statistician’ssample.
Definition 1.
An estimator is (ex-ante) incentive-compatible if for every ˜ X ′ n +1 and every β , E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] ≥ E [ X ′ n +1 ˆ β − X ′ n +1 β ] . (3) where the expectation E is taken with respect to the possible realizations of X n +1 and the possible realizationsof the statistician’s sample. An alternative notion of incentive-compatibility would be defined ex-post with respect to the realizationof the user’s covariates, such that inequality (3) would be required to hold for every realization of X n +1 , andthe expectation operator would only be with respect to the statistician’s sample. The sufficient conditionfor ex-ante incentive-compatibility of the Lasso estimator, which we establish in Section 4, also guaranteesex-post incentive-compatibility. Furthermore, the proof of ex-post incentive compatibility follows from ourproof of ex-ante incentive-compatibility. In light of this, we shall focus on the ex-ante notion henceforth.Incentive compatibility means that the user is unable to perform better by misreporting her personalcharacteristics, regardless of her beliefs over the true model’s parameters in mean squared sense. Howshould we interpret this requirement, given that we do not necessarily want to think of the user as beingsophisticated enough to think in these terms? One interpretation is that lack of incentive compatibility ismerely a normative statement about the user’s welfare - namely, given our model of how the statisticiantakes actions on the user’s behalf, it would be advisable for her to misrepresent her personal characteris-tics. Furthermore, there are opportunities for new firms to enter and offer the user paid advice for how tomanipulate the procedure - in analogy to the industry of “search engine optimization”. Incentive compati-bility theoretically eliminates the need for such an industry. In the context of the online content provisionstory, some misreporting strategies take the form of “deleting cookies”. This deviation is straightforward toimplement, and the user can check if it makes her better off in the long run.Note that incentive-compatibility is not a property that can be tested statistically. To see this, supposeeach user is characterized by only a single covariate that is uniformly distributed on { , } . If users are In the case in which the individual’s attributes are collected “passively” from her browsing history, then reporting a vectorof zero attributes can be interpreted as the act of deleting cookies. true attributes of n users. The idea is that the data onthese users is obtained through a different process than the way the statistician obtains the data from the n + 1 user. For instance, as mentioned earlier, this data may be obtained from a marketing survey wherethere is no incentive to lie. Alternatively, one may interpret our incentive compatibility requirement as arequirement that truth-telling is a Nash equilibrium among all participants - such that given that everyoneelse is telling the truth, no user has an incentive to lie.
In this subsection, we introduce a number of restrictions on the statistician’s data. To describe theserestrictions, we shall make use of the following notation. Define an l ball B l ( s ) = {k β k l ≤ s } . DenoteΣ := EX i X ′ i for i = 1 , , · · · , n and let ˆΣ := X ′ X/n be the sample counterpart. Our first requirementextends the sub-Gaussian data assumption used in statistics:
Assumption 1. (i). E ( u i | X i ) = 0 , X i , u i are identical and independent across i = 1 , · · · , n, and max ≤ j ≤ p E | X ij | , E | u i | l , l = max (2 k, for all k ≥ are uniformly bounded from above (across n ). (ii).The minimal eigenvalue of Σ is bounded away from zero uniformly in n . Our second set of restrictions applies to the first and second moments. These will guarantee the consis-tency of the Lasso estimator, but will not ensure incentive compatibility (sufficient conditions for incentivecompatibility will be introduced in Section 4). We start by defining the maximal value of certain crossproducts, which will be related to the behavior of moments in high dimensions in our next assumption. M := max ≤ i ≤ n max ≤ j ≤ p | X ij u i | ,M := max ≤ i ≤ n max ≤ j ≤ p max ≤ l ≤ p | X il X ij − EX il X ij | . Note that M is the maximal covariance between the regressors and errors in a high dimensional context.Roughly speaking, when this covariance is small, it captures exogeneity of the regressors in the sample. M is the maximal variance of the regressors in the sample. With large p and n , these covariance and varianceterms can grow arbitrarily large; hence, we need a condition that restricts the growth rate of their moments.Because we are allowing for heteroskedastic data and unbounded regressors, we need to consider the growthrate of higher-order moments. Assumption 2. (i). √ lnp √ n [ max (( EM ) / , ( EM ) / )] → . Alternatively, we could strengthen Assumption 2 using boundedness of individual moments of
X, u . ii). s ( lnpn ) / → . (iii). k β k = O (1) . Assumption 2(i) and 2(ii) are standard in high dimensional econometrics. 2(i) is used in Chernozhukov et al.(2017) allowing them to apply a concentration inequality, and 2(ii) is a standard sparsity condition. As-sumption 2(iii) ensures that the signal to noise ratio is bounded (see p.2343 of Jankova and van de Geer(2018)). To see this clearly, set σ u := var ( u i ), which is the variance of the errors, and σ u ≥ c >
0, where c is a generic positive constant. Hence, var ( y i ) var ( u i ) = β ′ Σ β σ u + 1 , under E ( u i | X i ) = 0 in Assumption 1 and Σ := EX i X ′ i . But β ′ Σ β σ u + 1 ≥ k β k φ min (Σ) σ u + 1 . where φ min (Σ) ≥ c > var ( y i ) /var ( u i ) ≥ C + 1 >
0, with C beinga positive constant, and defined as C := k β k φ min (Σ) σ u . The empirical implication of this is that only a fixednumber of nonzero coefficients can be constants, and the other nonzero coefficients have to be local to zero.To see this implication, note that k β k = vuut p X j =1 β ,j = s X j ∈ S β ,j = O (1) . But this last point can be achieved, in the case of s growing with n , with s X j ∈ S β ,j = s X j ∈ F β ,j + X j ∈ S − F β ,j ≤ s f C + ( s − f ) C s − f = O (1) , where F := { j : | β ,j | = C } with | F | = f being a fixed number, C is a generic positive constant and F := { j : | β ,j | = C √ s − f } with | F | = s − f . For ease of exposition, we set all coefficients in F and F tobe the same constants C, C/ √ s − f respectively. F contains indices of all local to zero coefficients. Thiscan easily be generalized without affecting our results.In Appendix B we take a more flexible approach compared with Assumption 2(iii). There, we assumethat k β k = O ( √ s ). In this case, all nonzero coefficients can be large (i.e., none of them are local to zero,as in set F above). In other words, there is no index set F as above, but all nonzero coefficients (theirindices) are in the set F above.As p and n grow large, the total number of nonzero coefficients s (also known as the sparsity index ) cangrow arbitrarily large. To guarantee consistency and unbiasedness, it is typically assumed that the productof the sparsity index and the tuning parameter should go to zero. However, this standard condition doesnot guarantee the incentive compatibility of the Lasso estimator as can be seen in the proof of Theorem 3below. 8 New oracle inequalities
Oracle inequalities in high dimensional statistics are upper bounds on prediction and estimation errors. Werequire moment bounds on the Lasso estimator’s error in l norm for our main result. By taking the samplesize to be large, we can show that the upper bound on the mean of higher-order moments of Lasso estimationerrors tend to zero. We then use this asymptotic result to establish the incentive compatibility of the Lassoestimator in large samples. To illustrate this, we note that from the proof of Theorem 3 in Appendix A.2.4,the incentive compatibility constraint is tied to the following expression E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] − E [ X ′ n +1 ˆ β − X ′ n +1 β ] = E [ ˆ β ′ ( ˜ X n +1 − X n +1 )( ˜ X n +1 − X n +1 ) ′ ˆ β ] (4)+ E [ ˆ β ′ ( ˜ X n +1 − X n +1 ) X ′ n +1 ( ˆ β − β )] (5)+ E [( ˆ β − β ) ′ X n +1 ( ˜ X ′ n +1 − X ′ n +1 ) ˆ β ] . (6)For incentive compatibility to hold in large samples, we need the sum of the right-hand side terms to begreater than or equal to zero. The first term on the right side (4) is always non-negative. Hence, if we provethat (5) and (6) converge to zero, we establish asymptotic incentive compatibility. However, the size of termsin (5) and (6) will depend on the mean of higher-order estimation error of Lasso.To bound these error, we prove new oracle inequalities, which are different from those that are givenin the literature for k ˆ β − β k . These inequalities will serve an important role in proving our main resultin the next section (Theorem 3). Besides, they are of independent interest as they extend previous resultson sub-Gaussian data to heteroskedastic (conditionally) data sets that are commonly used in econometrics.Our proof technique will also look at a less conservative bound compared with Jankova and van de Geer(2018). Hence, our new inequalities contribute to the literature on high-dimensional econometrics wherethey can be used for proving generalized semiparametric efficiency of Lasso-type-estimators (as, e.g., inJankova and van de Geer (2018)).Our first result in this section is a k -th moment bound for the l norm of the Lasso bias. A key conceptused in this result is the exception probability for the event F := {A ∩ A } , where A and A are definedin (A.6) and (A.9), which represent the empirical process-noise, and the eigenvalue condition, respectively.The exception probability is the complement of the event F , and is denoted by P ( F c ). An explicit upperbound for the exception probability is calculated in Lemma A.4. Theorem 1.
Under Assumptions 1-2, if n is sufficiently large and λ n ≥ P ( F c ) / k s / , then [ E k ˆ β − β k k ] /k = O ( s λ n ) . This result is valid uniformly over B l ( s ) = {k β k l ≤ s } . If we set k = 1 we can learn whether the Lasso estimator is unbiased. By the above Theorem, Assumption9 and (A.15) imply s λ n →
0. Hence, in large samples, we have unbiasedness in the large λ n case. Next, weprovide the k -th moment bound for l norm for the Lasso estimator. Theorem 2.
Under Assumptions 1-2, if n is sufficiently large n and λ n ≥ P ( F c ) / k /s / , then [ E k ˆ β k k ] /k = O ( s / ) . This result is valid uniformly over B l ( s ) = {k β k l ≤ s } . This is a new result and a simple extension of Theorem 1 above. The rate in Theorem diverges to infinityif s → ∞ as n → ∞ . Our main result, which is new in the literature on penalized regressions, establishes that the Lasso estimatoris incentive compatible for a sufficiently large sample size. In other words, we show that when n → ∞ E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] ≥ E [ X ′ n +1 ˆ β − X ′ n +1 β ] . for all ˜ X ′ n +1 and for every β , where the expectation is taken with respect to the reporting user’s attributes X ′ n +1 (this is our ex-ante notion of incentive-compatibility that we explained in Section 2 .
1) and with respectto the statistician’s realized sample (since the reporting user does not observe this sample).The next theorem is our main result, which provides sufficient conditions for incentive compatibility. Itsproof makes use of the following notation. M := max ≤ j ≤ p | X n +1 ,j | ,M : max ≤ j ≤ p | ˜ X n +1 ,j − X n +1 ,j | . Note that M is nothing more than the absolute magnitude of the misreport on a given variable j by the n + 1 user. Theorem 3.
Under Assumptions 1 and 2, the Lasso estimator is incentive compatible in large samples( n → ∞ ) if the following conditions hold: λ n ≥ P ( F c ) / /s / (7) and s / r lnpn [ EM ] / [ EM ] / → . (8) Furthermore, incentive compatibility is valid uniformly over B l ( s ) = {k β k l ≤ s } . emarks . Theorem 3 establishes that a sufficient condition for incentive compatibility is that the tuning parameter λ n needs to be large “enough”. A simple way to choose λ n to satisfy (7) is to use the upper bound of theexception probability λ n := upperbound ( P ( F c ) / ) , in Lemma A.4. The simulations in the next section address the issue of whether such a bound is feasible. The typical concern with Lasso is the consistency of the estimator ( k ˆ β − β k = o p (1)), which can beachieved by making sure that λ n goes to zero at a relatively fast rate (as Lemma A.1 in Appendix A shows,this rate is s λ n → λ n gets too small, the Lasso estimator may admit many nonzero variablesincorrectly (i.e., it creates an overfit ). Consequently, when the number of regressors p is very large, theexpectation of the sum of l errors ( E k ˆ β − β k ) can grow arbitrarily large, and incentive compatibility maybe violated. Put differently, consistency does not imply incentive compatibility in large samples .To see this formally, let { A } denote the event A . Then for all ǫ > E k ˆ β − β k = P ( k ˆ β − β k ≤ ǫ ) {k ˆ β − β k ≤ ǫ } + P ( k ˆ β − β k > ǫ ) {k ˆ β − β k > ǫ }≤ P ( k ˆ β − β k ≤ ǫ ) ǫ + P ( k ˆ β − β k > ǫ ) {k ˆ β − β k > ǫ } , (9) By consistency, the first term in (9) will go to zero when ǫ approaches zero. However, the second termmay be large and can dominate all the expectation in large dimensions (this is also discussed on p.2339of Jankova and van de Geer (2018)) In other words, just using the l estimator bound on its own does not imply a bound for the expectation of l error. It requires a non-trivial proof. Consistency does not implyunbiasedness, and hence, it does not imply incentive compatibility.Why is overfitting a significant issue for incentive compatibility? The intuition is as follows. Suppose thetuning parameter is sufficiently small so that given the user’s prior on the true coefficients, she expects thatmany irrelevant variables will be included in the estimator. To correct this bias, she can report that thesevariables are equal to zero. The second sufficient condition (8) allows the distance between the user’s report ˜ X n +1 and the truth X n +1 to be of any magnitude since EM ≡ E k ˜ X n +1 − X n +1 k ∞ can be arbitrarily large. Since the aboveconditions are sufficient but not necessary, it remains an open question whether incentive compatibility canbe achieved with a tuning parameter that is lower than the threshold in (7) without restricting the magnitudeof the deviation between the user’s reported and true attributes. Note that (8) requires stricter sparsity than Assumption 2. If EM = O (1) and EM = O (1), thencondition (8) amounts to s / q lnpn →
0, which is a sparsity requirement still stronger than Assumption2(ii). In addition, if we let EM = O ( lnn ) and EM = O ( lnn ) , then s / q lnpn ( lnn ) / ≤ s / q lnpn → n ≤ p . A natural question that arises is whether condition (7) is compatible with the l norm consistency ofLasso? In other .words, consistency requires a small λ n , but incentive compatibility requires a large λ n , so11re they compatible with each other? When we select a large λ n to satisfy incentive compatibility, we shouldnot sacrifice consistency, i.e. we need s λ n →
0. To verify whether this is possible, we can take the lowerbound on the tuning parameter in (7) and see whether we can achieve consistency. Note that s λ n = s P ( F c ) / s / = s / P ( F c ) / , (10)From (A.22) in the Appendix, an upper bound on this exception probability is: P ( F c ) ≤ p C + K [ EM + EM ] nlnp , (11)where C and K are positive constants. With l = 1 ,
2, it therefore follows from (10) and (11) that we need s /p C → , s max l EM l /nlnp → , to have consistency. These two conditions are not unreasonable in the sense that they are consistent with( n, p ) increasing to infinity. Also they are compatible with moments satisfying condition (8) in Theorem 3. Finally, note that λ n = O ( q lnpn ) represents an upper bound in terms of rates for λ n , whereas (7)represents a lower bound. We can then take for a positive constant C > C √ lnp √ n ≥ λ n ≥ P ( F c ) / s / . The question is, are there suitable combinations of n and p that satisfy these inequalities? By using algebraand the upper bound for exception probability (A.22), we obtain the requirement that, Cs / ≥ " np C + K [ EM + EM ] nlnp / √ n √ lnp , which is plausible for p > n and large n since the left hand side may diverge and the right side may go tozero. This may be the case for example when p is exponential in n . This section has two objectives. First, it illustrates how in practice the tuning parameter can be chosento ensure incentive compatibility of the Lasso estimator. Second, it demonstrates that by appropriatelychoosing the tuning parameter (in line with the conditions in Theorem 3), incentive compatibility is satisfiedregardless of the magnitude of the “lie” (i.e., the distance between the true and reported attributes)We provide a simple simulation setup. We model y i = X ′ i β + u i , where β = (1 , ′ p − s , ′ s − ) ′ , 0 p − s is a p − s column vector of all zero elements, and 1 s − is a s − s represent the sparsity of the above model and set s = 20.12e present three different simulations. In Design 1, we choose X i to be a p × t distributionwith five degrees of freedom. The new n + 1 user has the same distribution for her attributes but isindependent of the first n users. The errors u i are also chosen from a t distribution but independentlyof the regressors. Tables 1-3 displays these results. For the second simulation (Design 2), we only changethe distribution of the attributes for the n + 1 user to a t distribution with three degrees of freedom. InDesign 2, we keep the same distribution for the errors and the same attribute distribution for the first n users from Design 1. The results are displayed in Tables 4-6. In the third simulation (Design 3), we changeonly the following in Design 2. We introduce a multivariate normal distribution for the attributes of users i = 1 , · · · , n , such that the covariance between the j and m -th random variables are governed byΣ j,m = 0 . | j − m | , for j = 1 , . . . , p and m = 1 , · · · , p . Thus, the correlation between the adjacent random variables is 0.5, andthis declines when the random variables are further apart. This Toeplitz type structure is commonly usedin the high dimensional literature (see Caner and Kock (2018)). In Design 3, we keep the distribution of the n + 1 user and the errors from Design 2. The results are presented in Tables 6-9.We aim to demonstrate that with a “large” tuning parameter as in Theorem 3, incentive compatibilitycan be achieved when the sample size n is large enough. As mentioned in the previous section, one possiblechoice of a tuning parameter that satisfies Theorem 3 is the upper bound on the exception probability, λ n ≥ upperbound ( P ( F c ) / ) . The issue is to make the exception probability, P ( F c ) operational and usable. Note that an upper boundon this probability is (with positive constants C > , C > , K > P ( F c ) ≤ p C + K [ EM + EM ] nlnp ≤ p C + C ( lnp ) , (12)by observing that for l = 1 , K max l EM l nlnp = " K / p max l EM l √ n √ lnp = " K / p max l EM l √ lnp √ n ( 1 lnp ) ≤ C ( lnp ) , where we use Assumption 2(i). Hence, we can write the upper bound of the exception probability by using p ≥ p C + C ( lnp ) ≤ C ( lnp ) . The tuning parameter is as follows 13 n := [2 + C ( lnp ) ] / , (13)where C can start from a small positive value and stop at a large positive value, and we select the optimal C and λ n according to the Generalized Information Criterion as in Caner and Kock (2018), which givesconsistent model selection with weighted Lasso choices in the least squares framework. Therefore, our choiceof λ n is above a lower bound, which prevents overfitting (this is the novel insight of Theorem 3). On theother hand, to prevent a very large λ n and ensure consistency of Lasso, the lower bound inversely dependson p .Define λ ∗ n := argmin λ n ∈ Λ [ ln (ˆ σ ( λ n )) + ˆ s ( λ n ) n ln ( n ) ln ( ln ( p ))] , where ˆ s ( λ n ) is the number of nonzero elements in the Lasso estimator, given a choice of λ n in a grid Λ, andˆ σ ( λ n ) is the mean squared residuals from the Lasso regression, given a choice of λ n in a grid Λ. We formΛ as follows: We take C in a grid of values [2 + C ( lnp ) ] as in (13). Let C := [0 . , . , , , , , , λ n depending on C . The number of iterations is 1,000.The “Report” column in Tables 1-3 display E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] as the mean squared error from a falsereport by the user. “Truth” refers to E [ X ′ n +1 ( ˆ β − β )] . The difference between ˜ X n +1 − X n +1 is kept atthree levels: 5, 2 and 0.2 (for all p variables), which represent large, medium, and small deviations from thetruth. We have p = 100 , , , and for each p level we analyze n = 100 , , p = 500 and n = 400 . In Table 1, which correspondsto a large magnitude of a lie, the user’s disutility from reporting the truth is 24.36, while the disutilityfrom lying is 334.01. Hence, the n + 1 user prefers to be truthful. In Table 2, for a medium magnitudeof lies, truth-telling induces a disutility of 25.03, while lying induces a higher disutility of 71.63. Finally,in Table 3, if the lie is “close” to the truth, the disutility from truth-telling is 24.36, while the disutilityfrom lying is 24.73. Similar comparisons hold in the tables’ remaining cells, suggesting that that Lasso’sincentive compatibility is achieved. In Tables 4-9, the same message from Tables 1-3 carries over: Lasso isincentive-compatible with our tuning parameter choice. Tables 6 and 9 show that with a minor lie, Lasso isstill incentive-compatible and the difference between the truth and lie in MSE sense is larger compared withTable 3 of Design 1. The growing reliance on machine learning in automating decisions previously made by people raises thequestion of how people would interact with these automated systems. In particular, would people have an14able 1: Design 1-Incentive Compatibility Scenarios: Difference 5 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 25.93 385.25 23.61 338.91 25.89 321.55 p = 250 27.75 353.90 25.25 353.08 26.41 337.91 p = 500 26.71 305.53 25.55 333.97 24.36 334.01Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Definition. Smaller values of these average squared errors are desirable.Table 2: Design 1-Incentive Compatibility Scenarios: Difference 2 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 26.12 80.11 25.88 76.10 23.56 71.87 p = 250 25.88 75.64 24.25 77.89 25.20 72.40 p = 500 28.02 72.21 25.20 76.73 25.03 71.63Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Definition. Smaller values of these average squared errors are desirable.Table 3: Design 1-Incentive Compatibility Scenarios: Difference 0.2 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 25.60 25.90 25.73 25.93 24.61 25.03 p = 250 26.06 26.87 23.90 24.27 25.43 25.84 p = 500 28.34 28.98 24.94 25.62 24.36 24.73Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Definition. Smaller values of these average squared errors are desirable.Table 4: Design 2-Incentive Compatibility Scenarios: Difference 5 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 45.62 392.71 39.25 368.91 40.22 345.82 p = 250 47.11 374.02 45.55 355.38 46.59 342.67 p = 500 45.08 326.11 43.77 370.69 47.70 350.90Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Definition. Smaller values of these average squared errors are desirable.15able 5: Design 2-Incentive Compatibility Scenarios: Difference 2 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 48.63 97.92 59.58 94.09 58.65 85.68 p = 250 46.19 89.30 44.12 98.91 43.25 95.22 p = 500 55.57 105.84 41.61 90.43 40.95 92.95Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Definition. Smaller values of these average squared errors are desirable.Table 6: Design 2-Incentive Compatibility Scenarios: Difference 0.2 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 50.48 51.03 44.02 45.72 44.42 45.07 p = 250 45.98 47.33 48.88 49.47 42.56 43.68 p = 500 48.84 48.92 44.08 44.79 41.45 42.18Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Definition. Smaller values of these average squared errors are desirable.Table 7: Design 3-Incentive Compatibility Scenarios: Difference 5 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 26.22 2766.29 18.84 3004.57 16.77 3136.86 p = 250 24.77 2725.52 21.41 3000.16 20.35 3130.54 p = 500 34.32 2722.20 19.68 2981.42 17.98 3139.44Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Definition. Smaller values of these average squared errors are desirable.Table 8: Design 3-Incentive Compatibility Scenarios: Difference 2 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 27.43 464.41 19.82 493.65 17.16 515.18 p = 250 23.49 454.13 19.76 488.40 16.14 507.25 p = 500 25.25 448.99 38.48 503.34 14.56 509.49Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Definition. Smaller values of these average squared errors are desirable.16able 9: Design 3-Incentive Compatibility Scenarios: Difference 0.2 n = 100 n = 200 n = 400Dimension Truth Report Truth Report Truth Report p = 100 24.16 28.51 16.77 20.07 14.35 19.63 p = 250 25.80 29.93 19.41 25.04 16.14 22.63 p = 500 27.11 30.47 19.97 25.11 18.24 22.52Note: ”Truth” refers to E [ X ′ n +1 ( ˆ β − β )] and ”Report” refers to E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] in Incentive Compatibility Definition. Smaller values of these average squared errors are desirable.incentive to act strategically in order to manipulate such automated systems? This strategic interactionwill become particularly important when these automated systems start playing a more prominent role inmedical decision-making or even in driving.This paper takes only a small preliminary step towards addressing this question by studying whether auser would want to lie to an automated system that uses Lasso to predict that user’s ideal outcome basedon her reported attributes. Our main contribution is showing that truthful reporting can be ensured byappropriately adjusting the tuning parameter to be larger than what is required for consistency. Our resultis also significant from a pure econometrics point of view: Just concentrating on oracle inequalities and post-selection inference can lead to a small tuning parameter, which in turn, can lead to model overfitting, whichthen introduces an incentive to misreport. If users have an incentive to provide false input to algorithms usedfor estimation and prediction, then it is no longer clear that one can rely on the output of these algorithms.17n the next part, Appendix A considers the proofs when p > n , and Appendix B considers the case p ≤ n ,and relaxing Assumption 2(iii). A Appendix A
A.1 Notation
In this section, we show some results that will help us in proofs. Define random vector of variables F i :=( F i , · · · , F ij , · · · , F ip ) ′ . Also define σ F := n ( max ≤ j ≤ p varF ij ), and M F := max ≤ i ≤ n max ≤ j ≤ p | F ij − EF ij | .Note that ˆ µ j := n − P ni =1 F ij , and µ j := EF ij . A.2 Maximal Inequalities
We use two assumptions that will provide us maximal inequalities.
Assumption A.1 . Assume F i are iid random vectors across i = 1 , , · · · , n with max ≤ j ≤ p varF ij bounded away from infinity uniformly in n . Assumption A.2 . Assume p EM F √ lnp √ n → . We use the following maximal inequality. With Assumption A.1, Lemma E.2(ii) of Chernozhukov et al.(2017) is: (see (A.2) of Caner and Kock (2019)) P (cid:20) max ≤ j ≤ p | ˆ µ j − µ j | ≥ E max ≤ j ≤ p | ˆ µ j − µ j | + tn (cid:21) ≤ exp ( − t / σ F ) + K EM F t , (A.1)for a constant K >
0. With Assumptions A.1-A.2 here, Caner and Kock (2019) or Lemma E.1 of Chernozhukov et al.(2017) provides E max ≤ j ≤ p | ˆ µ j − µ j | ≤ K [ √ lnp √ n + p EM F lnpn ]= O ( √ lnp √ n ) . (A.2)Define the sequence κ n = lnp . Set t = t n = ( nκ n ) / to have (A.1) as P (cid:20) max ≤ j ≤ p | ˆ µ j − µ j | ≥ E max ≤ j ≤ p | ˆ µ j − µ j | + √ κ n √ n (cid:21) ≤ exp ( − C κ n ) + K EM F nκ n = 1 p C + KEM F nlnp (A.3)where C >
0, is a positive constant.Now combine (A.2) with (A.3) to have 18 ( max ≤ j ≤ p | ˆ µ j − µ j | ≥ K [ √ lnp √ n + ( EM F ) / lnpn ] + √ lnp √ n ) ≤ p C + KEM F n ( lnp ) = o (1) , (A.4)by Assumptions A1-A.2. This shows also that, since EM F is nondecreasing in n max ≤ j ≤ p | ˆ µ j − µ j | = O p ( p lnp/ √ n ) . (A.5) A.2.1 Events
Before the assumptions, we need to define events that will be helpful. The first event is: A = ( (cid:13)(cid:13)(cid:13)(cid:13) u ′ Xn (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ λ n ) , (A.6)which controls the noise. This is the maximal correlation between regressors and errors. We want this tobe bounded with probability approaching one, and this upper bound, λ n , itself is converging to zero in ourproofs. We show that in Lemma A.2. So in large samples, this proof technique amounts to verification ofexogeneity of regressors. This is standard in high dimensional econometrics, for a recent analysis see LemmaA.4 of Caner and Kock (2018).We start with defining first population counterparts of restricted eigenvalue conditions and then showthe empirical version also. These are standard in high dimensional econometrics and statistics and can beseen from Assumption 1 of Caner and Kock (2018).We define the population adaptive restricted eigenvalue of Σ φ ( s ) = min ( δ ′ Σ δ k δ S k : δ ∈ R p − { } , k δ S c k ≤ √ s k δ S k , | S | ≤ s ) . (A.7)Note that if Σ = EX i X ′ i has full rank, the population adaptive restricted eigenvalue being positive is satisfiedby Assumption 1. Also instead of minimizing all over R p , we minimize vectors that satisfy k δ cS k ≤ k δ S k .Even in the cases that Σ does not have full rank, it is possible that minimal adaptive restricted eigenvaluecondition is satisfied due to optimization over a restricted set. The parameter δ will be related to structuralparameter β in the proofs.First define the empirical adaptive restricted eigenvalue condition, which is empirical counterpart of thepopulation version in Assumption 1:ˆ φ ( s ) = min ( δ ′ ˆΣ δ k δ S k : δ ∈ R p − { } , k δ S c k ≤ √ s k δ S k , | S | ≤ s ) . (A.8)We are interested in behavior of the minimal empirical adaptive restricted eigenvalue condition evaluatedfor set S at cardinality s . The second event is: A = n ˆ φ ( s ) ≥ φ ( s ) / o . (A.9)19mpirical adaptive restricted eigenvalue condition is needed since in case of p > n , X ′ X is singular and theminimal eigenvalue of X ′ X is zero. Empirical adaptive eigenvalue is over a restricted set which we proveto be positive, with probability approaching one, in Lemma A.3. This is also standard in high dimensionaleconometrics, see Lemma A.6 of Caner and Kock (2018). Set F = A ∩ A , and the complement event as F c . A.2.2 Proofs of Lemmata
The following four Lemmata are the intermediate results that are used for Theorems.
Lemma A.1.
Under the joint event F := {A ∩ A } we have k ˆ β − β k ≤ λ n s φ ( s ) . This is also valid uniformly over B l ( s ) = {k β k l ≤ s } . Proof of Lemma A.1 . Using ˆ β definition k Y − X ˆ β k n + 2 λ n p X j =1 | ˆ β j | ≤ k Y − Xβ k n + 2 λ n p X j =1 | β ,j | . Use the model Y = Xβ + u on the first left side term as well as the first right side term to simplify theinequality above combining with Holder’s Inequality k X ( ˆ β − β ) k n + 2 λ n p X j =1 | ˆ β j | ≤ (cid:12)(cid:12)(cid:12)(cid:12) u ′ Xn ( ˆ β − β ) (cid:12)(cid:12)(cid:12)(cid:12) + 2 λ n p X j =1 | β ,j |≤ k u ′ Xn k ∞ k ˆ β − β k + 2 λ n p X j =1 | β ,j | On the right side assuming we are on the event A k u ′ Xn k ∞ k ˆ β − β k ≤ λ n k ˆ β − β k . So we have k X ( ˆ β − β ) k n + 2 λ n p X j =1 | ˆ β j | ≤ λ n k ˆ β − β k + 2 λ n p X j =1 | β ,j | . Use k ˆ β k = k ˆ β S k + k ˆ β S c k on the second term for the left side of the inequality immediately above k X ( ˆ β − β ) k n + 2 λ n X j ∈ S c | ˆ β j | ≤ λ n k ˆ β − β k + 2 λ n p X j =1 | β ,j | − λ n X j ∈ S | ˆ β j | . By assumption of sparsity P j ∈ S c | β ,j | = 0, and using the reverse triangle inequality we have k X ( ˆ β − β ) k n + 2 λ n X j ∈ S c | ˆ β j | ≤ λ n k ˆ β − β k + 2 λ n X j ∈ S | ˆ β j − β ,j | . k ˆ β − β k = k ˆ β S − β ,S k + k ˆ β S c k for the first term on the right side of the inequalityimmediately above k X ( ˆ β − β ) k n + λ n X j ∈ S c | ˆ β j | ≤ λ n X j ∈ S | ˆ β j − β ,j | . Use k ˆ β S − β ,S k ≤ √ s k ˆ β − β ,S k above on the right side to have k X ( ˆ β − β ) k n + λ n X j ∈ S c | ˆ β j | ≤ λ n √ s k ˆ β S − β ,S k . (A.10)Ignoring the first term on the left of (A.10), (A.10) shows that we satisfy the restricted set condition inempirical adaptive restricted eigenvalue condition, so we have k ˆ β S c k ≤ √ s k ˆ β S − β ,S k . Using δ = ˆ β − β in the empirical adaptive restricted eigenvalue condition (A.8) in (A.10) k X ( ˆ β − β ) k n + λ n X j ∈ S c | ˆ β j | ≤ λ n √ s k X ′ ( ˆ β − β ) k n ˆ φ ˆΣ ( s ) . Then use 3 uv ≤ u / v / u = λ n √ s / ˆ φ ˆΣ ( s ), v = k X ( ˆ β − β ) k n to get k X ( ˆ β − β ) k n + λ n X j ∈ S c | ˆ β j | ≤ k X ( ˆ β − β ) k n λ n s ˆ φ ( s ) . Simplify above k X ( ˆ β − β ) k n + 2 λ n X j ∈ S c | ˆ β j | ≤ λ n s ˆ φ ( s ) . Use the event A we get the following k X ( ˆ β − β ) k n + 2 λ n X j ∈ S c | ˆ β j | ≤ λ n s φ ( s ) . This implies the oracle inequality k X ( ˆ β − β ) k n ≤ λ n s φ ( s ) . (A.11)To get to the l bound ignore the first term in (A.10) and add both sides λ n k ˆ β S − β ,S k to have λ n X j ∈ S c | ˆ β j | + λ n X j ∈ S | ˆ β j − β ,j | = λ n k ˆ β − β k ≤ λ n k ˆ β S − β ,S k + 3 λ n √ s k ˆ β S − β ,S k , by seeing also P j ∈ S c | β ,j | = 0. Now use the norm inequality k ˆ β S − β ,S k ≤ √ s k ˆ β S − β ,S k to have λ n k ˆ β − β k ≤ λ n √ s k ˆ β S − β ,S k . Use the empirical adaptive restricted eigenvalue condition with δ = ˆ β − β k ˆ β − β k ≤ √ s k X ( ˆ β − β ) k n ˆ φ ˆΣ ( s ) . A to have k ˆ β − β k ≤ √ s " √ λ n √ s φ Σ ( s ) φ ˆΣ ( s ) ≤ λ n s φ ( s ) . (A.12)Note that uniformity over B l ( s ) follows since the upper bound in (A.12) depends on β only through s . Q.E.DLemma A.2. (i). Under Assumption 1, and since κ n = lnpP ( A ) ≥ − exp ( − C κ n ) − KEM ( nκ n ) = 1 − p C − KEM nlnp (ii). Under added Assumption 2 to Assumption 1, P ( A ) → .(iii). Under added Assumption 2 to Assumption 1, λ n = O ( p lnp/n ) . Proof of Lemma A.2 . (i). Establish the probability bound on A via Assumption 1, using (A.3)(A.4)with F i = X i u i there and κ n = lnp , we have P ( A ) ≥ − exp ( − C κ n ) − K EM ( nκ n ) = 1 − p C − KEM nlnp , (A.13)with λ n = K [ r lnpn + p EM lnpn ] + r lnpn . (A.14)(ii). By Assumption 2, we have the proof.(iii). By Assumption 2, we have λ n = O ( p lnp/n ) . (A.15) Q.E.D.Lemma A.3.
Under Assumptions 1, 2, κ n = lnpP ( A ) ≥ − exp ( − C κ n ) − KEM ( nκ n ) = 1 − p C − KEM nlnp = 1 − o (1) . Proof of Lemma A.3 . Start with (cid:12)(cid:12)(cid:12)(cid:12) δ ′ X ′ Xn δ (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) δ ′ ( X ′ Xn − Σ + Σ) δ (cid:12)(cid:12)(cid:12)(cid:12) ≥ | δ ′ Σ δ | − | δ ′ ( ˆΣ − Σ) δ | . (A.16)The second term on the right side of (A.16) can be bounded by repeated application of Holders inequality | δ ′ ( ˆΣ − Σ) δ | ≤ k δ k k ˆΣ − Σ k ∞ . So (A.16) becomes | δ ′ ˆΣ δ | ≥ | δ ′ Σ δ | − k δ k k ˆΣ − Σ k ∞ . (A.17)22ow we digress a bit to simplify (A.17). Note that we have the restriction set definition k δ S c k ≤ √ s k δ S k , where we add k δ S k to both sides k δ k ≤ √ s k δ S k + k δ S k ≤ √ s k δ S k + √ s k δ S k = 4 √ s k δ S k , where we used the norm inequality k δ S k ≤ √ s k δ S k in the second inequality above. So we get k δ k k δ S k ≤ s . Now divide (A.17) by k δ S k > | δ ′ ˆΣ δ |k δ S k ≥ | δ ′ Σ δ |k δ S k − s k ˆΣ − Σ k ∞ . Minimize over δ on the both sides ˆ φ ( s ) ≥ φ ( s ) − s k ˆΣ − Σ k ∞ . (A.18)So if we can prove that with probability approaching one, 16 s k ˆΣ − Σ k ∞ ≤ φ ( s ) /
2, that will imply ofˆ φ ( s ) ≥ φ ( s ) / ǫ n = 16 s t , where t = K [ r lnp n + p EM lnp n ] + r lnpn . (A.19)By (A.3)(A.4), via Assumption 1 P [16 s k ˆΣ − Σ k ∞ > ǫ n ] = P [ k ˆΣ − Σ k ∞ > t ] ≤ exp ( − C lnp ) + KEM ( nlnp ) → , (A.20)where we use Assumption 2 for the probability tail converging to zero. Also see that by Assumption 2, ǫ n → s p lnp/n →
0. So we get, with probability approaching one, 16 s k ˆΣ − Σ k ∞ ≤ ǫ n ≤ φ ( s ) / P [ ˆ φ ( s ) ≥ φ ( s ) / ≥ − exp ( − C κ n ) − KEM ( nκ n )= 1 − p C − KEM nlnp = 1 − o (1) . (A.21) Q.E.D.
We need the following Lemma for the exception set F c := { A ∩ A } c upper bound probability.23 emma A.4. Under Assumptions 1, 2, with κ n = lnpP ( F c ) ≤ exp ( − C κ n ) + K [ EM + EM ]( nκ n )= 2 p C + K ( EM + EM ) nlnp = o (1) . Proof of Lemma A.4 .Now we provide an upper bound for the probability P ( F c ) in our case under Assumptions 1, 2, by usingLemmata A.2-A.3 P ( F c ) = P ( A ∩ A ) c = P ( A c ∪ A c ) ≤ P ( A c ) + P ( A c ) ≤ exp ( − C κ n ) + K [ EM + EM ]( nκ n )= 2 p C + K [ EM + EM ] nlnp → . (A.22) Q.E.D.A.2.3 New Oracle Inequality Proofs
We start with proof of Theorems 1-2, where they are used as inputs to proof of Theorem 3. Theorems 1-2consider the new oracle inequalities.
Proof of Theorem 1 . We proceed in several steps.Denote the joint event F = {A ∩ A } . F c is F ’s complement. See that E k ˆ β − β k k = E k ˆ β − β k k {F} + E k ˆ β − β k k {F c } . (A.23)We want to form rates for the right side terms in (A.23).Step 1. Note that by Lemma A.1, the first term on the right side of (A.23) is: E k ˆ β − β k k {F} = O ( s k λ kn ) . (A.24)Now we want to evaluate the second term on the right side of (A.23). But before that we need thefollowing intermediate step.Step 2. Use Nemirowski’s moment inequality, Lemma 14.24 in Buhlmann and van de Geer (2011), withfor all k ≥
1, for the first inequality, and for the second inequality by Loeve’s c r inequality, and for theequality we use u i being iid, also the definition of σ := Eu i , E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P ni =1 u i − σ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k ≤ [8 ln (2)] k/ E " P ni =1 ( u i ) n k/ ≤ Cn ( k/ − n k n X i =1 Eu ki = C [ Eu ki ] n − k/ = O ( n − k/ ) = o (1) ,
24y Assumption 1. Before the next result we provide the inequality, | x + y | k ≤ k − ( | x | k + | y | k ) , (A.25)for k ≥
1, and x, y being generic scalars, and σ being bounded above by Assumption 1 and using (A.25) E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 u i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( u i − σ ) + σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k ≤ k − E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( u i − σ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k + ( σ ) k = O ( n − k/ ) + O (1) = O (1) . (A.26)Step 3. Now we have to form another l expectation bound for lasso that will be key to the sec-ond right side term analysis in (A.23). This step 3 modifies the proof of Theorem 1, supplement, p.4 ofJankova and van de Geer (2018). We extend their proof to non-sub-Gaussian case and show that theirbound is very conservative, and we provide a new less conservative bound. Start with the definition of lasso. k Y − X ˆ β k n + 2 λ n k ˆ β k ≤ k Y − Xβ k n + 2 λ n k β k . Ignore the first term and use the model u = Y − Xβ to have k ˆ β k ≤ k u k n λ n + k β k . Then use triangle inequality and then the inequality above k ˆ β − β k ≤ k ˆ β k + k β k ≤ k u k n λ n + 2 k β k . (A.27)Next taking the k th moment of the sampling error in l norm, and using (A.25) by taking expectationsthere for the second inequality below E k ˆ β − β k k ≤ E " k u k n λ n + 2 k β k k ≤ k − { E " k u k n λ n k + 2 k β k k } (A.28)We use the assumption k β k = O (1) to have k β k k ≤ ( √ s k β k ) k = O ( s k/ ) . (A.29)Then use the last equation with (A.26) in (A.28) to have E " k u k n λ n k + 2 k β k k = O ( λ − kn ) + O ( s k/ ) = O (max( s k/ , λ − kn )) . (A.30)Note that proof of Jankova and van de Geer (2018) use s k/ λ − kn but this is very conservative upper boundsince both two terms in multiplication is diverging with n . But a better bound is max( s k/ , λ − kn ).25e get the rough bound for expectation using (A.30) in (A.28) E k ˆ β − β k k = O (max( s k/ , λ − kn )) . (A.31)Note that rates in (A.24)(A.31) are different and the last rate in this step is a rough bound which will behelpful in the next step. The rate in (A.31) is diverging to infinity.Step 4. Rewrite the expectation using event F , F c . E k ˆ β − β k k = E k ˆ β − β k k {F} + E k ˆ β − β k k {F c } ≤ O ( s k λ kn ) + q E k ˆ β − β k k q E {F c } = O ( s k λ kn ) + O (max( s k/ , λ − kn )) p P ( F c ) (A.32)where we use (A.24) and Cauchy-Schwartz inequality for the first inequality, and the second equality is by(A.31).First possibility of a rate is (jointly holding): s k λ kn ≥ s k/ P ( F c ) / . (A.33) s k λ kn ≥ λ − kn P ( F c ) / . (A.34)By (A.32)(A.33)(A.34) E k ˆ β − β k k = O ( s k λ kn ) . We can simplify further (A.33)(A.34), respectively they are λ n ≥ P ( F c ) / k /s / , (A.35)and λ n ≥ P ( F c ) / k /s / . (A.36)Since P ( F c ) / k ≥ P ( F c ) / k , k ≥ λ n ≥ P ( F c ) / k /s / then E k ˆ β − β k k = O ( s k λ kn ) . (A.37)Of course there is another possibility-subcase that provides the rate in (A.37). That is when s k λ kn ≥ λ − kn P ( F c ) / , (A.38)jointly holding with λ − kn P ( F c ) / ≥ s k/ P ( F c ) / . (A.39)26his results in the same sufficient condition (A.36) via only (A.38) since by Assumption 2, s λ n → ≥ ( s λ n ) k /s k/ . So (A.39) is always satisfied with sufficiently large n . Notealso that joint inequalities s k λ kn ≥ s k/ P ( F c ) / jointly holding with s k/ P ( F c ) / ≥ λ − kn P ( F c ) / (A.40)is not possible since (A.40) implies s k/ λ kn ≥ s λ n ) k / ( s k/ ) ≥ s λ n → n .To combine all the results for the k th moment of the estimation error, for values of λ n ≥ P ( F c ) / k /s / , E k ˆ β − β k k = O ( s k λ kn ) . The uniformity over B l ( s ) follows since the rates in (A.24)(A.31)-(A.34) depends on β only by s . Q.E.D.
Remark. Proof of Theorem 1 in Jankova and van de Geer (2018), in their appendix, p.5, shows that theyuse assumption: λ n ≥ P ( F c ) / k s / , (A.41)which is equivalent to the following condition as shown in p.3 of proof of Theorem 1 in Jankova and van de Geer(2018) τ > kln [( √ s λ n ) − ] /lnp, given that λ n ≥ Cτ p lnp/n and C > , τ > P ( F c ) ≤ p ) τ / (A.42)by Lemma 7 in appendix of Jankova and van de Geer (2018). Our result and theirs are not comparable interms of λ n since they assume sub-Gaussian data, and ours is more general. Proof of Theorem 2 .We start with E k ˆ β k k = E k ˆ β k k {F} + E k ˆ β k k {F c } ≤ E k ˆ β k k {F} + q E k ˆ β k k p P ( F c ) , (A.43)by using Cauchy-Schwartz inequality. Then use triangle inequality on set F and by Lemma A.1, and norminequality to have k ˆ β k ≤ k ˆ β − β k + k β k ≤ λ n s φ ( s ) + √ s k β k = O p ( √ s ) , by Assumptions 1, 2. This last rate shows that E k ˆ β k k {F} = O ( s k/ ) . (A.44)27o handle the second right side term in (A.43) we start with the second inequality in (A.27) and ignore k β k in the middle to have k ˆ β k ≤ k u k n λ n + k β k . then follow (A.30) to get q E k ˆ β k k P ( F c ) / = O (max( s k/ , λ − kn )) P ( F c ) / = O ( λ − kn P ( F c ) / ) , (A.45)and to get the second equality by Assumption 2(ii) ( s λ n ) k /s k/ ≤ s k/ ≤ λ − kn with sufficiently large n .Now use (A.44) with (A.45) in (A.43) E k ˆ β k k = O ( s k/ ) + O ( λ − kn P ( F c ) / ) . (A.46)If λ n ≥ P ( F c ) / k /s / it is clear that s k/ ≥ λ − kn P ( F c ) / , (A.47)So by (A.47) in (A.46) we have the desired result. Q.E.D.Q.E.D.A.2.4 Main Theorem Proof: Incentive CompatibilityProof of Theorem 3 .By Theorem 1 and 2 we can choose the larger of λ n in those theorems, with s ≥
1, and since it isnondecreasing with n , λ n ≥ P ( F c ) / k s / ≥ P ( F c ) / k s / (A.48)Add and subtract X ′ n +1 ˆ β inside the right hand side of the incentive compatibility definition: E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] = E [ ˜ X ′ n +1 ˆ β − X ′ n +1 ˆ β + X ′ n +1 ˆ β − X ′ n +1 β ] = E [ ˜ X ′ n +1 ˆ β − X ′ n +1 ˆ β ] + E [ X ′ n +1 ˆ β − X ′ n +1 β ] + E [ ˆ β ′ ( ˜ X n +1 − X n +1 ) X ′ n +1 ( ˆ β − β )]+ E [( ˆ β − β ) ′ X n +1 ( ˜ X ′ n +1 − X ′ n +1 ) ˆ β ] . (A.49)Using the definition of incentive compatibility, with defining D n +1 := ˜ X n +1 − X n +1 , we have E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] − E [ X ′ n +1 ˆ β − X ′ n +1 β ] = E [ ˆ β ′ D n +1 D ′ n +1 ˆ β ] (A.50)+ E [ ˆ β ′ D n +1 X ′ n +1 ( ˆ β − β )] (A.51)+ E [( ˆ β − β ) ′ X n +1 D ′ n +1 ˆ β ] . (A.52)28ow analyze (A.51), the analysis of (A.52) is the same and thus omitted. See thatˆ β ′ D n +1 X ′ n +1 ( ˆ β − β ) ≤ | ˆ β ′ D n +1 X ′ n +1 ( ˆ β − β ) |≤ | ˆ β ′ D n +1 || X ′ n +1 ( ˆ β − β ) |≤ k ˆ β k k D n +1 k ∞ k X n +1 k ∞ k ˆ β − β k , (A.53)where we use Holder’s inequality. Then E [ ˆ β ′ D n +1 X ′ n +1 ( ˆ β − β )] ≤ E h k ˆ β k k D n +1 k ∞ k X n +1 k ∞ k ˆ β − β k i (A.54) ≤ [ E k ˆ β ] / [ E k D n +1 k ∞ ] / [ E k X n +1 k ∞ ] / [ E k ˆ β − β k ] / (A.55)= [ E k ˆ β ] / [ EM ] / [ EM ] / [ E k ˆ β − β k ] / (A.56) where we apply (A.53) for the first inequality and Holder’s Inequality in the second inequality above, andthe last equality comes from M , M definitions. Then we apply Theorems 1-2 with k = 4. We assume λ n ≥ P ( F c ) / /s / and if s / r lnpn [ EM ] / [ EM ] / → , (A.57)we see that (A.56) goes to zero, by Theorems 1-2, and λ n = O ( q lnpn ).So looking at incentive compatibility definition and (A.50)-(A.52) E [ ˜ X ′ n +1 ˆ β − X ′ n +1 β ] − E [ X ′ n +1 ˆ β − X ′ n +1 β ] = E [ ˆ β ′ D n +1 D ′ n +1 ˆ β ] + o (1) , (A.58)where the first right side term in (A.58) is nonnegative and the other terms are negligible in large samplesby (A.57).The uniformity over B l ( s ) goes through since Theorems 1, 2 depend on β only through s , and theyare the main ingredient in the proof. Q.E.D.
B Appendix B
Here we consider results when p ≤ n , and relaxing Assumption 2(iii). B.1 When p ≤ n There are minor modifications in the proofs compared to p > n . We consider them here. One major changeis since p ≤ n , we set κ n = lnn . Change Assumption 2(ii) so that s p ln/n → p ≤ n , and combine (A.2) with (A.3) tohave with κ n = lnn in that case P ( max ≤ j ≤ p | ˆ µ j − µ j | ≥ K [ √ lnp √ n + ( EM F ) / lnpn ] + √ lnn √ n ) ≤ n C + EM F n ( lnn ) = o (1) , (B.1)29y Assumptions A1-A.2. To see this point EM F nlnn = ( EM F ) / √ lnp √ n ! √ lnn √ lnp = o (1) . (B.2)This shows also that max ≤ j ≤ p | ˆ µ j − µ j | = O p ( √ lnn/ √ n ) . (B.3)Lemma A.1 will be the same. Lemma A.2(i) lower bound probability has κ n = lnn now. Lemma A.2(ii)is the same. Lemma A.2(iii) will change to λ n = O ( √ lnn/ √ n ) . Lemma A.3 use κ n = lnn , so (A.19) becomes t = K [ p lnp √ n + p EM lnp n ] + √ lnnn . Lemma A.4 is the same with κ n = lnn .Given these results, the proof of Theorem 1 is the same with λ n = O ( q lnnn ). Theorem 2 does not change.Theorem 3 condition will be changing to s / r lnnn [ EM ] / [ EM ] / → , B.2 Relaxing Assumption 2(iii)
In this subsection we relax Assumption 2(iii) from k β k = O (1) to k β k = O ( √ s ) and we explain thelogic and meaning of this new assumption. Assumption 2(iv) . k β k = O ( √ s ) . Assumption 2(iii) which is suggested by Jankova and van de Geer (2018) and simplifies their paper insemiparametric efficient estimators. Our Assumption 2(iv) here generalizes that assumption and in the caseof s being constant becomes Assumption 2(iii). The implication of Assumption 2(iv) is that all nonzerocoefficients can be constant and none of them has to be local to zero. k β k = vuut p X j =1 β ,j = s X j ∈ S β ,j = O ( √ s ) . In terms of Section 2 discussion after Assumption 2, this implies S = F , and F is an empty set. SoAssumption 2(iv) can simultaneously allow s increasing with n , and all large nonzero coefficients in S .Previously in Assumption 2(iii), there can be only a fixed number of large coefficients, and increasing ( s − f )number of local to zero (small) coefficients.We proceed in a way that we only change the proofs in Appendix A, when necessary. All lemmata inAppendix A goes through, there is no usage of Assumption 2(iii) there. The first change comes in step 330f Theorem 1 proof. First (A.29) changes to k β k k = O ( s k ) under Assumption 2(iv) instead of Assumption2(iii). Then (A.30) becomes E " k u k n λ n k + 2 k β k k = O (max( s k , λ − kn )) . (B.4)Then (A.32) changes to following E k ˆ β − β k k = O ( s k λ kn ) + O (max( s k , λ − kn ) p P ( F c ) . (B.5)Instead of (A.33)(A.34) we have the following conditions, to establish the rate for the oracle inequality (i.e.mean l norm bound to k th order) s k λ kn ≥ s k P ( F c ) / . (B.6) s k λ kn ≥ λ − kn P ( F c ) / . (B.7)Using (B.5)-(B.7) E k ˆ β − β k k = O ( s k λ kn ) . (B.8)The conditions (B.6)(B.7) can be written as λ n ≥ max ( P ( F c ) / k , P ( F c ) / k /s / ) , (B.9)where the tuning parameter choice under Assumption 2(iv) which is (B.9) is larger than or equal to choiceby Assumption 2(iii), which is the second component in the max on the right side of (B.9). The discussionafter this in step 4 is the same, given Assumption 2(i)-(ii). So we have the following result: Corollary B.1 . Under Assumptions 1, 2(i)(ii)(iv), with λ n ≥ max ( P ( F c ) / k , P ( F c ) / k /s / ) . we have [ E k ˆ β − β k k ] /k = O ( s λ n ) . The result is also uniform over l ball B l Now we modify the proof of Theorem 2. In that respect, by Assumption 2(iv) the rate after (A.43)becomes k ˆ β k = O p ( s ) . (B.10)Then (A.46) changes to E k ˆ β k k = O ( s k ) + O ( λ − kn P ( F c ) / ) . (B.11)We can show that s k ≥ λ − kn P ( F c ) / , (B.12)if we have λ n ≥ P ( F c ) / k /s . (B.13)31hen given (B.13), using (B.12) in (B.11) we have E k ˆ β k k = O ( s k ) . So we established the following Corollary to Theorem 2. The result is different from Theorem 2 and the kth moment of l error grows faster here in Corollary B.2 if s increases with n . So relaxed assumption comeswith a cost that will affect main incentive compatibility condition. Corollary B.2 . Under Assumptions 1, 2(i)(ii)(iv), with λ n ≥ P ( F c ) / k /s . we have [ E k ˆ β k k ] /k = O ( s ) . The result is also uniform over l ball B l Now we follow the proof of Theorem 3 and substitute Assumption 2(iv) instead of Assumption 2(iii).Note that our λ n choice must choose the maximum of the ones in Corollary B.1 and B.2. Clearly CorollaryB.1 tuning parameter is larger than the one in Corollary B.2. The only place we have to change there is(A.57). Given λ n ≥ max ( P ( F c ) / , P ( F c ) / s / ) we need s r lnpn [ EM ] / [ EM ] / → , to have Incentive Compatibility in large samples. So we have the following counterpart to Theorem 3. Corollary B.3 . Under Assumptions 1, 2(i)(ii)(iv) and λ n ≥ max ( P ( F c ) / , P ( F c ) / s / ) , and s r lnpn [ EM ] / [ EM ] / → , lasso is Incentive Compatible. The result is also uniform over l ball B l . Clearly, there are two differences between Theorem 3 and Corollary B.3 here. First, we need a tuningparameter in Corollary B.3 which may be larger than or equal to the one in Theorem 3. Then, incentivecompatibility of lasso is more difficult to achieve, due to sparsity, s , having exponent of 2 here instead of3/2 in Theorem 3. References
Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012). Sparse models and methods for optimalinstruments with an application to eminent domain.
Econometrica 80 , 2369–2429.32elloni, A., V. Chernozhukov, and C. Hansen (2014). Inference on treatment effects after selection amonghigh dimensional controls.
Review of Economic Studies 81 , 608–650.Buhlmann, P. and S. van de Geer (2011). Statistics for high-dimensional data.
Springer Verlag .Cai, Y., C. Daskalakis, and C. Papadimitrou (2015). Optimum statistical estimation with strategic datasources.
Proceedings of the 28 th Conference on Learning Theory 40 , 1–40.Caner, M. and A. B. Kock (2018). Asymptotically honest confidence regions for high dimensional parametersby the desparsified conservative lasso.
Journal of Econometrics 203 , 143–168.Caner, M. and A. B. Kock (2019). High dimensional linear gmm. arXiv:1811.08779 .Chernozhukov, V., D. Chetverikov, and K. Kato (2017). Central limit theorems and bootstrap in highdimensions.
Annals of Probability 45 , 2309–2452.Chernozhukov, V., M. Goldman, V. Semenova, and M. Taddy (2018). Orthogonal machine learning fordemand estimation: High dimensional causal inference in dynamic panels. arXiv:1712.09988 .Chiang, H. (2020). Many average partial effects: with an application to text regression.
Working Paper .Chiang, H. and Y. Sasaki (2019). Causal inference by quantile regression kink designs.
Journal of Econo-metrics 210 , 405–433.Cummings, R., S. Ioannidis, and K. Ligett (2015). Truthful linear regression.
Conference on LearningTheory 40 , 448–483.Dekel, O., F. Fischer, and A. Procaccia (2010). Incentive compatible regression learning.
Journal of ComputerSystem and Sciences 76 , 759–77.Eliaz, K. and R. Spiegler (2019). The model selection curse.
American Economic Review-Insights 1 , 127–140.Eliaz, K. and R. Spiegler (2020). On incentive compatible estimators.
Working Paper-Tel Aviv University .Gao, C., A. Van der Vaart, and H. Zhou (2015). A general framework for bayes structured linear models. arXiv:1506.02174 .Hardt, M., N. Megiddo, C. Papadimitrou, and M. Wooters (2016). Strategic classification.
Proceedings. ofthe ACM Conference on Innovations in. Theoretical Computer Science , 111–122.Jankova, J. and S. van de Geer (2018). Semi-parametric efficiency bounds for high-dimensional models.
Annals of Statistics 46 , 2336–2359.Kock, A. (2016). Oracle inequalities, variable selection and uniform inference in high-dimensional correlatedrandom effects panel data models.
Journal of Econometrics 195 , 71–85.33ock, A. and H. Tang (2019). Inference in high-dimensional dynamic panel data models.
EconometricTheory 35 , 295–359.Meir, R., A. Procaccia, and J. Rosenschein (2012). Algorithms for strategyproof classification.
ArtificialIntelligence 186 , 123–156.Perte, J. and J. Perote-Pena (2004). Strategy-proof estimators for simple regression.
Mathemitical SocialSciences 47 , 153–176.Shaywitz, D. (2020). ”the alignment problem” review: When machines miss the point.
The Wall StreetJournal , A25,25 October.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of The Royal StatisticalSociety Series B 58 , 267–288.van de Geer, S., P. B¨uhlmann, Y. Ritov, and R. Dezeure (2014). On asymptotically optimal confidenceregions and tests for high-dimensional models.