Minimax rate of testing in sparse linear regression
Alexandra Carpentier, Olivier Collier, Laëtitia Comminges, Alexandre B. Tsybakov, Yuhao Wang
aa r X i v : . [ m a t h . S T ] O c t ISSN: 1935-7524
Minimax rate of testing in sparse linearregression
Alexandra Carpentier ∗ , Olivier Collier † , § , La¨etitia Comminges ‡ , § , Alexandre B.Tsybakov § , Yuhao Wang ¶ University of Magdeburg ∗ , Modal’X, Universit´e Paris-Nanterre † , CEREMADE, Universit´e Paris-Dauphine ‡ ,CREST, ENSAE § , LIDS-IDSS, MIT ¶ Abstract:
We consider the problem of testing the hypothesis that the parameter of linearregression model is 0 against an s -sparse alternative separated from 0 in the ℓ -distance.We show that, in Gaussian linear regression model with p < n , where p is the dimensionof the parameter and n is the sample size, the non-asymptotic minimax rate of testing hasthe form p ( s/n ) log(1 + √ p/s ) . We also show that this is the minimax rate of estimationof the ℓ -norm of the regression parameter. MSC 2010 subject classifications:
Keywords and phrases: linear regression, sparsity, signal detection.
1. Introduction
This paper is deals with testing of hypotheses on the parameter of linear regression model undersparse alternatives. This problem has various applications in genetics, signal transmission and de-tection, and compressed sensing. A detailed description of these applications can be found, for ex-ample, in Arias-Castro, Candes and Plan (2011). It is important to find optimal methods of test-ing in such a framework, and a natural approach is to define the notion of optimality in a minimaxsense. The problem of testing under sparse alternatives in a minimax framework was first studiedby Ingster (1997) and Donoho and Jin (2004) who considered the Gaussian mean model. Thesepapers were dealing with an asymptotic setting under the assumption that the sparsity indexscales as a power of the dimension. Non-asymptotic setting for the Gaussian mean model was ana-lyzed by Baraud (2002) who established bounds on the minimax rate of testing up to a logarithmicfactor. Finally, the exact non-asymptotic minimax testing rate for the Gaussian mean model isderived in Collier, Comminges and Tsybakov (2017). In this paper, we present an extension ofthe results of Collier, Comminges and Tsybakov (2017) to linear regression model with Gaussiannoise. Note that the problem of minimax testing for linear regression under sparse alternativeswas already studied in Ingster, Tsybakov and Verzelen (2010), Arias-Castro, Candes and Plan(2011), Verzelen (2012). Namely, Ingster, Tsybakov and Verzelen (2010), Arias-Castro, Candes and Plan(2011) deal with an asymptotic setting under additional assumptions on the parameters of theproblem while Verzelen (2012) obtains non-asymptotic bounds up to a logarithmic factor in thespirit of Baraud (2002). Our aim here is to derive the non-asymptotic minimax rate of testing inGaussian linear regression model with no specific assumptions on the parameters of the problem.We give a solution to this problem when p < n , where p is the dimension and n is the samplesize.We consider the model Y = Xθ + σξ, (1)where σ > ξ ∈ R n is a vector of Gaussian white noise, i.e., ξ ∼ N (0 , I n ), X is a n × p matrixwith random entries, I n is the n × n identity matrix, and θ ∈ R p is an unknown parameter. Inwhat follows, we assume everywhere that X is independent of ξ . imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression The following notation will be used below. For u = ( u , . . . , u p ) ∈ R p , we denote by k · k bethe ℓ -norm, i.e., k u k = p X i =1 | u i | , and let k · k be the ℓ semi-norm, i.e., k u k = p X i =1 u i =0 , where {·} is the indicator function. We denote by h u, v i = u T v the inner product of u ∈ R p , v ∈ R p . We denote by λ min ( M ) and by tr[ M ] the minimal eigenvalue and the trace of matrix M ∈ R p × p . For an integer s ∈ { , . . . , p } , we consider the set B ( s ) of all s -sparse vectors in R p : B ( s ) := { u ∈ R p : k u k ≤ s } . Given the observations (
X, Y ), we consider the problem of testing the hypothesis H : θ = 0 against the alternative H : θ ∈ Θ( s, τ ) (2)where Θ( s, τ ) = { θ ∈ B ( s ) : k θ k ≥ τ } for some s ∈ { , . . . , p } and τ >
0. Let ∆ = ∆(
X, Y ) be a statistic with values in { , } . Wedefine the risk of test based on ∆ as the sum of the first type error and the maximum secondtype error: P (∆ = 1) + sup θ ∈ Θ( s,τ ) P θ (∆ = 0) , where P θ denotes the joint distribution of ( X, Y ) satisfying (1). The smallest possible value ofthis risk is equal to the minimax risk R s,τ := inf ∆ n P (∆ = 1) + sup θ ∈ Θ( s,τ ) P θ (∆ = 0) o where inf ∆ is the infimum over all { , } -valued statistics. We define the minimax rate of testingon the class B ( s ) with respect to the ℓ -distance as a value λ >
0, for which the following twoproperties hold:(i) (upper bound) for any ε ∈ (0 ,
1) there exists A ε > p, n, s, σ such that, forall A > A ε , R s,Aλ ≤ ε, (3)(ii) (lower bound) for any ε ∈ (0 ,
1) there exists a ε > p, n, s, σ such that, forall 0 < A < a ε , R s,Aλ ≥ − ε. (4)Note that the rate λ defined in this way is a non-asymptotic minimax rate of testing as opposed tothe classical asymptotic definition that can be found, for example, in Ingster and Suslina (2003).It is shown in Collier, Comminges and Tsybakov (2017) that when X is the identity matrix and p = n (which corresponds to the Gaussian sequence model), the non-asymptotic minimax rateof testing on the class B ( s ) with respect to the ℓ -distance has the following form: λ = (cid:26) σ p s log(1 + p/s ) if s < √ p,σp / if s ≥ √ p. (5) imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression For the regression model with random X and satisfying some strong assumptions, the asymptoticminimax rate of testing when n, p , and s tend to ∞ such that s = p a for some 0 < a < X has i.i.d. standard normalentries, the asymptotic rate has the form λ = σ min (cid:16)r s log( p ) n , n − / , p / √ n (cid:17) . (6)Similar result for a somewhat differently defined alternative H is obtained in Arias-Castro, Candes and Plan(2011).Below we show that non-asymptotically, and with no specific restriction on the parameters n, p and s , the lower bound (ii) for the minimax rate of testing is valid with λ = σ min (cid:16)r s log(2 + p/s ) n , n − / , p / √ n (cid:17) (7)whenever X is a matrix with isotropic distribution and independent subgaussian rows (the def-initions of subgaussian and isotropic distributions will be given in Section 3). Furthermore, weshow that the matching upper bound holds when X is a matrix with i.i.d. standard Gaussianentries and p < n . Note that for p < n the expression (7) takes the form λ = σ min (cid:16)r s log(2 + p/s ) n , p / √ n (cid:17) (8)It will be also useful to note that, since for s ≤ √ p the function s s log(2 + p/s ) is increasingand satisfies log(2+ p/s ) ≤ p/s ), the rate (8) can be equivalently (to within an absoluteconstant factor) written as λ = σ q s log(1+ p/s ) n if s < √ p,σ p / √ n if s ≥ √ p. (9)This expression is analogous to (5). Finally, note that the rate can be written in the followingmore compact form σ min (cid:16)r s log(2 + p/s ) n , p / √ n (cid:17) ≍ σ r s log(1 + √ p/s ) n , (10)where ≍ denotes the equivalence up to an absolute constant factor.
2. Upper bounds on the minimax rates
In this section, we assume that X is a matrix with i.i.d. standard Gaussian entries and p < n and we establish an upper bound on the minimax rate of testing in the form (9). This will bedone by using a connexion between testing and estimation of functionals. We first introduce anestimator ˆ Q of the quadratic functional k θ k and establish an upper bound on its risk. Then, wededuce from this result an upper bound for the risk of the estimator ˆ N of the norm k θ k definedas follows: ˆ N = q max( ˆ Q, . imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression Finally, using ˆ N to define a test statistic we obtain an upper bound on the minimax rate oftesting.Introduce the notation α s = E ( Z | Z > p/s ))where Z is a standard normal random variable, and set y i = { ( X T X ) − X T Y } i where { ( X T X ) − X T Y } i is the i th component of the least squares estimator ( X T X ) − X T Y .Note that the inverse ( X T X ) − exists almost surely since we assume in this section that X is amatrix with i.i.d. standard Gaussian entries and p < n , so that X is almost surely of full rank.We consider the following estimator of the quadratic functional k θ k :ˆ Q := p X i =1 y i − σ tr[( X T X ) − ] if s ≥ √ p, p X i =1 (cid:2) y i − σ ( X T X ) − ii α s (cid:3) y i > σ ( X T X ) − ii log(1+ p/s ) if s < √ p. Here and below ( X T X ) − ij denotes the ( i, j )th entry of matrix ( X T X ) − .For any integers n, p, s such that s ≤ p , set ψ ( s, p ) = ( s log(1+ p/s ) n if s < √ p, p / n if s ≥ √ p. Theorem 1
Let n, p, s be integers such that s ≤ p, n ≥ , and p ≤ min( γn, n − for someconstant < γ < . Let r > , σ > . Assume that all entries of matrix X are i.i.d. standardGaussian random variables. Then there exists a constant c > depending only on γ such that sup θ : k θ k ≤ s, k θ k ≤ r E θ [( ˆ Q − k θ k ) ] ≤ c (cid:16) σ r n + σ ψ ( s, p ) (cid:17) . The proof of Theorem 1 is given in Section 4.2.Arguing exactly in the same way as in the proof of Theorem 8 in Collier, Comminges and Tsybakov(2017), we deduce from Theorem 1 the following upper bound on the squared risk of the estimatorˆ N . Theorem 2
Let the assumptions of Theorem 1 be satisfied. Then there exists a constant c ′ > depending only on γ such that sup θ ∈ B ( s ) E θ [( ˆ N − k θ k ) ] ≤ c ′ σ ψ ( s, p ) . Theorem 2 implies that the test ∆ ∗ = { ˆ N>Aλ/ } where λ = σ p ψ ( s, p ) (i.e., the same λ asin (9)) satisfies P (∆ ∗ = 1) + sup θ ∈ Θ( s,Aλ ) P θ (∆ ∗ = 0) ≤ P ( ˆ N > Aλ/
2) + sup θ ∈ B ( s ) P θ ( ˆ N − k θ k ≤ − Aλ/ ≤ θ ∈ B ( s ) E θ [( ˆ N − k θ k ) ]( A/ λ ≤ C ∗ A − imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression for some constant C ∗ >
0. Using this remark and choosing A ε = ( C ∗ /ε ) / leads to the upperbound (i) that we have defined in the previous section. We state this conclusion in the nexttheorem. Theorem 3
Let the assumptions of Theorem 1 be satisfied and let λ be defined by (9) . Then,for any ε ∈ (0 , there exists A ε > depending only on ε and γ such that, for all A > A ε , R s,Aλ ≤ ε.
3. Lower bounds on the minimax rates
In this section, we assume that the distribution of matrix X is isotropic and has independent σ X -subgaussian rows for some σ X >
0. The isotropy of X means that E X ( X T X/n ) = I p where E X denotes the expectation with respect to the distribution P X of X . Definition 1
Let b > . A real-valued random variable ζ is called b -subgaussian if E exp( tζ ) ≤ exp( b t / , ∀ t ∈ R . A random vector η with values in R d is called b -subgaussian if all inner products h η, v i withvectors v ∈ R d such that k v k = 1 are b -subgaussian random variables. The following theorem on the lower bound is non-asymptotic and holds with no restriction onthe parameters n, p, s except for the inevitable condition s ≤ p . Theorem 4
Let ε ∈ (0 , , σ > , and let the integers n, p, s be such that s ≤ p . Assume thatthe distribution of matrix X is isotropic and X has independent σ X -subgaussian rows for some σ X > . Then, there exists a ε > depending only on ε and σ X such that, for τ = Aσ min (cid:16)r s log(2 + p/s ) n , n − / , p / √ n (cid:17) (11) with any A satisfying < A < a ε , we have R s,τ ≥ − ε. The proof of Theorem 4 is given in Section 4.4. The next corollary is an immediate consequenceof Theorems 3 and 4.
Corollary 1
Let the assumptions of Theorem 1 be satisfied. Then the minimax rate of testingon the class B ( s ) with respect to the ℓ -distance is given by (8) . In addition, from Theorem 4, we get the following lower bound on the minimax risk of esti-mation of the ℓ -norm k θ k . Theorem 5
Let the assumptions of Theorem 4 be satisfied, and let λ be defined in (7) . Thenthere exists an a constant c ∗ > depending only on σ X such that inf ˆ T sup θ ∈ B ( s ) E θ [( ˆ T − k θ k ) ] ≥ c ∗ λ , where inf ˆ T denotes the infimum over all estimators. imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression The result of Theorem 5 follows from Theorem 4 by noticing that, for τ in (11) and λ in (7) wehave τ = Aλ , and for any estimator ˆ T ,sup θ ∈ B ( s ) E θ [( ˆ T − k θ k ) ] ≥ h E [ ˆ T ] + sup θ ∈ Θ( s,τ ) E θ [( ˆ T − k θ k ) ] i ≥ τ h P ( ˆ T > τ /
2) + sup θ ∈ Θ( s,τ ) P θ ( ˆ T ≤ τ / i ≥ ( Aλ ) R s,τ . Corollary 2
Let the assumptions of Theorem 1 be satisfied and let λ be defined in (8) . Then theminimax rate of estimation of the norm k θ k under the mean squared risk on the class B ( s ) isequal to λ , that is c ∗ λ ≤ inf ˆ T sup θ ∈ B ( s ) E θ [( ˆ T − k θ k ) ] ≤ c ′ λ , where c ∗ > is an absolute constant and c ′ > is a constant depending only on γ . This corollary is an immediate consequence of Theorems 2 and 5.
Remark 1
Inspection of the proofs in the Appendix reveals that the results of this section remainvalid if we replace the ℓ -ball B ( s ) by the ℓ -sphere ¯ B ( s ) = { u ∈ R p : k u k = s } .
4. APPENDIX
This section treats two main technical issues for the proof of Theorem 1. The first one is to controlthe expectation of a power of the smallest eigenvalue of the inverse empirical covariance matrix.The second issue is to control the errors for identifying non-zero entries in the sparse setting. Forthis, we need accurate bounds on the correlations between centred thresholded transformationsof two correlated χ random variables. We first recall two general facts that we will use to solvethe first issue.In what follows, we will denote by C positive constants that can vary from line to line. Lemma 1 [Davidson and Szarek (2001), see also Vershynin (2012).] Let X satisfy the assump-tions of Theorem 1. Let λ min ( ˆΣ) denote the smallest eigenvalue of the sample covariance matrix ˆΣ = n X T X . Then for any t > with probability at least − − t / we have − r pn − t √ n ≤ q λ min ( ˆΣ) ≤ r pn + t √ n . Lemma 2 [(Tao and Vu, 2010, Lemma A4), see also (Bordenave and Chafa¨ı, 2012, Lemma4.14).] Let ≤ p ≤ n , let R i be the i -th column of matrix X ∈ R n × p and R − i = span { R j : j = i } .If X has full rank, then ( X T X ) − ii = dist( R i , R − i ) − , where dist( R i , R − i ) is the Euclidean distance of vector R i to the space R − i . imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression Lemma 3
Let n ≥ and p ≤ min( γn, n − for some constant γ such that < γ < . Assumethat all entries of matrix X ∈ R n × p are i.i.d. standard Gaussian random variables. Then thereexists a constant c > depending only on γ , such that E [ λ − ( ˆΣ)] ≤ c. (12) Proof.
Set β = √ γ . From the inequality p ≤ γn and Lemma 1 we have P (cid:16)q λ min ( ˆΣ) < − β − t √ n (cid:17) ≤ P (cid:16)q λ min ( ˆΣ) < − r pn − t √ n (cid:17) ≤ − t / . Taking here t = √ n (1 − β ) / P (cid:16) λ min ( ˆΣ) < (cid:16) − β (cid:17) (cid:17) ≤ (cid:16) − n (1 − β ) (cid:17) . Using this inequality we obtain E [ λ − ( ˆΣ)] ≤ (cid:16) − β (cid:17) − + q E [ λ − ( ˆΣ)] √ (cid:16) − n (1 − β ) (cid:17) . (13)We now bound the expectation E [ λ − ( ˆΣ)]. Clearly, λ − ( ˆΣ) ≤ tr[ ˆΣ − ] . (14)Lemma 2 implies that, almost surely, (cid:0) tr[ ˆΣ − ] (cid:1) = n (cid:2) p X i =1 dist( R i , R − i ) − (cid:3) ≤ n p p X i =1 dist( R i , R − i ) − . Since the random variables dist( R i , R − i ) are identically distributed and p ≤ n we have E (cid:2)(cid:0) tr[ ˆΣ − ] (cid:1) (cid:3) ≤ n E [dist( R , R − ) − ] . (15)Finally we only need to bound E [dist( R , R − ) − ]. If S is a p − R n then the random variable dist( R , S ) has the chi-square distribution χ n − p +1 with n − p + 1degrees of freedom. Hence, as R − is a span of random vectors independent of R and R − isalmost surely p − E [dist( R , R − ) − ] = E (cid:20) χ n − p +1 ) (cid:21) = 1( n − p − n − p − n − p − n − p − ≤ . (16)Combining (13), (14), (15) and (16) we get E [ λ − ( ˆΣ)] ≤ (cid:16) − β (cid:17) − + n √ (cid:16) − n (1 − β ) (cid:17) , which implies the lemma.We now turn to the second issue of this section, that is bounds on the correlations. We willuse the following lemma about the tails of the standard normal distribution. imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression Lemma 4
For η ∼ N (0 , and any x > we have √ π ( x + √ x + 4) exp( − x / ≤ P ( | η | > x ) ≤ √ π ( x + √ x + 2) exp( − x / , (17) E [ η | η | >x ] ≤ r π (cid:18) x + 2 x (cid:19) exp( − x / , (18) E [ η | η | >x ] ≤ r π (cid:18) x + 3 x + 1 x (cid:19) exp( − x / . (19) Moreover, if x ≥ , then x < E [ η | | η | > x ] ≤ x . (20)Inequalities (17) - (19) are given, e.g., in (Collier, Comminges and Tsybakov, 2017, Lemma 4)and (20) follows easily from (17) and (18). Lemma 5
Let ( η, ζ ) be a Gaussian vector with mean 0 and covariance matrix Γ = (cid:18) ρρ (cid:19) , < ρ < . Set α = E [ η | | η | > x ] . Then there exists an absolute constant C > such that, forany x ≥ , E [( η − α )( ζ − α ) | η | >x | ζ | >x ] ≤ Cρ x exp( − x / . Proof.
From (20) we get that α ≤ x . Thus, using (19) and the fact that x ≥ E [( ζ − α ) | ζ | >x ] ≤ E (cid:2) ( ζ + α ) | ζ | >x (cid:3) ≤ E [ ζ | ζ | >x ] ≤ Cx exp (cid:0) − x / (cid:1) . (21)Therefore, E [( η − α )( ζ − α ) | η | >x | ζ | >x ] ≤ E [( η − α ) | η | >x ] + E [( ζ − α ) | ζ | >x ] ≤ Cx exp (cid:0) − x / (cid:1) . This proves the lemma for ρ ≥ / √ < ρ < / √
5. Note that, since α ≤ x , for 0 < ρ < / √ ρ < x √ α . The symmetry of the distribution of ( η, ζ ) implies E [( η − α )( ζ − α ) | η | >x | ζ | >x ] = 2 E [( η − α )( ζ − α ) | η | >x ζ>x ] . (22)Now, we use the fact that ( η, ζ ) d = ( ρζ + p − ρ Z, ζ ) where d = means equality in distributionand Z is a standard Gaussian random variable independent of ζ . Thus, E [( η − α )( ζ − α ) | η | >x ζ>x ] = ρ E [( ζ − α ) | ρζ + √ − ρ Z | >x ζ>x ]+ 2 ρ p − ρ E [ ζZ ( ζ − α ) | ρζ + √ − ρ Z | >x ζ>x ]+ (1 − ρ ) E [( Z − α )( ζ − α ) | ρζ + √ − ρ Z | >x ζ>x ] . (23) imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression We now bound separately the three summands on the right hand side of (23). For the firstsummand, using (21) we get the bound ρ E [( ζ − α ) | ρζ + √ − ρ Z | >x ζ>x ] ≤ ρ E [ ζ ζ>x ] ≤ Cρ x exp (cid:18) − x (cid:19) . (24)To bound the second summand, we first write E [ ζZ ( ζ − α ) | ρζ + √ − ρ Z | >x ζ>x ] = E [ ζ ( ζ − α ) g ( ζ ) ζ>x ] (25)where g ( ζ ) := E [ Z | ρζ + √ − ρ Z | >x | ζ ]. It is straightforward to check that g ( ζ ) = exp (cid:18) − ( x − ρζ ) − ρ ) (cid:19) − exp (cid:18) − ( x + ρζ ) − ρ ) (cid:19) . Thus g ( ζ ) is positive when ζ > x . Therefore we have E [ ζ ( ζ − α ) g ( ζ ) ζ>x ] ≤ E [ ζ g ( ζ ) ζ>x ] . (26)In addition, g ( ζ ) = exp (cid:18) − ( x − ρζ ) − ρ ) (cid:19) (cid:18) − exp (cid:18) − xρζ − ρ (cid:19)(cid:19) ≤ − exp (cid:18) − xρζ − ρ (cid:19) ≤ xρζ − ρ . (27)Combining (25) - (27) with (19) and the fact that ρ ≤ , we get2 ρ p − ρ E [ ζZ ( ζ − α ) | ρζ + √ − ρ Z | >x ζ>x ] ≤ Cρ x exp (cid:18) − x (cid:19) . (28)We now consider the third summand on the right hand side of (23). We will prove that E [( Z − α )( ζ − α ) | ρζ + √ − ρ Z | >x ζ>x ] ≤ . (29)We have E [( Z − α )( ζ − α ) | ρζ + √ − ρ Z | >x ζ>x ] = E [( ζ − α ) f ( ζ ) ζ>x ]where f ( ζ ) := E [( Z − α ) | ρζ + √ − ρ Z | >x | ζ ]= Z ∞ x − ρζ √ − ρ ( z − α ) exp (cid:18) − z (cid:19) dz + Z − x + ρζ √ − ρ −∞ ( z − α ) exp (cid:18) − z (cid:19) dz. Note that x < √ α by (20). In order to prove (29), it is enough to show that ∀ ζ ∈ [ x, √ α ] , f ( ζ ) ≥ f ( √ α ) . (30)and ∀ ζ ∈ [ √ α, ∞ ) , f ( ζ ) ≤ f ( √ α ) . (31) imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression Indeed, assume that (30) and (31) hold. Then we have E [( ζ − α ) f ( ζ ) x<ζ ≤√ α ] ≤ E [( ζ − α ) f ( √ α ) x<ζ ≤√ α ]= − E [( ζ − α ) f ( √ α ) ζ> √ α ] ≤ − E [( ζ − α ) f ( ζ ) ζ> √ α ] , where the equality is due the fact that, by the symmetry of the normal distribution and thedefinition of α , E [( ζ − α ) ζ>x ] = 12 E [( ζ − α ) | ζ | >x ] = 0 . Thus, to finish the proof of the lemma, it remains to prove (30) and (31). We first establish (30),for which it is sufficient to show that f ′ ( ζ ) < ζ ∈ [ x, √ α ]. Since 0 < ρ < x/ √ α and x < √ α ,we have ( x − ρy ) − ρ < α for all y ∈ [ x, √ α ] . (32)Using (32) we obtain, for all ζ ∈ [ x, √ α ], f ′ ( ζ ) = ρ p − ρ exp − (cid:16) x + ρζ p − ρ (cid:17) ! (cid:18)(cid:18) ( x − ρζ ) − ρ − α (cid:19) exp (cid:18) ρxζ − ρ (cid:19) − (cid:18) ( x + ρζ ) − ρ − α (cid:19)(cid:19) ≤ ρ p − ρ exp − (cid:16) x + ρζ p − ρ (cid:17) ! (cid:18)(cid:18) ( x − ρζ ) − ρ − α (cid:19) − (cid:18) ( x + ρζ ) − ρ − α (cid:19)(cid:19) = − ρ p − ρ exp − (cid:16) x + ρζ p − ρ (cid:17) ! xρζ − ρ < . This implies (30). Finally, we prove (31). To do this, it is enough to establish the following threefacts:(i) f ′ is continuous and f ′ ( √ α ) < f ′ ( y ) = 0 has at most one solution on [ √ α, + ∞ );(iii) f ( ∞ ) = lim y →∞ f ( y ) ≤ f ( √ α ).Property (i) is already proved above. To prove (ii), we first observe that the solution of theequation ddy f ( y ) = 0 is also solution of the equation h ( y ) = 0 where h ( y ) := (cid:18) ( x − ρy ) − ρ − α (cid:19) (cid:18) exp (cid:18) ρxy − ρ (cid:19) − (cid:19) − ρxy − ρ . Next, let y and y be the solutions of the quadratic equation ( x − ρy ) − ρ = α : y = x − p α (1 − ρ ) ρ and y = x + p α (1 − ρ ) ρ . Due to (32) we have y < √ α < y . Thus, h ( y ) < √ α, y ]. Next, on the interval( y , + ∞ ) the function h is strictly convex and h ( y ) <
0. It follows that h vanishes only once on( y , + ∞ ). Thus, (ii) is proved.It remains to show that f ( √ α ) ≥ f ( ∞ ) = R ∞−∞ ( z − α ) exp( − z / dz . Rewriting f ( √ α ) as f ( √ α ) = f ( ∞ ) − Z x − ρ √ α √ − ρ − x + ρ √ α √ − ρ ( z − α ) exp (cid:18) − z (cid:19) dz we see that inequality f ( ∞ ) ≤ f ( √ α ) follows from (32). This proves item (iii) and thus (31).Therefore, the proof of (29) is complete. Combining (22), (23), (24), (28) and (29) yields thelemma. imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression We consider separately the cases s ≥ √ p and s < √ p . Case s ≥ √ p . From (1) we get that, almost surely,( X T X ) − X T Y = θ + ˜ ǫ where ˜ ǫ = σ ( X T X ) − X T ξ. Thus, we have E θ (cid:2)(cid:0) ˆ Q − k θ k (cid:1) (cid:3) = E θ (cid:0) θ T ˜ ǫ + k ˜ ǫ k − σ tr (cid:2) ( X T X ) − (cid:3)(cid:1) ≤ E θ (cid:0) θ T ˜ ǫ (cid:1) + 2 E θ (cid:16) k ˜ ǫ k − σ tr (cid:2) ( X T X ) − (cid:3)(cid:17) . (33)Note that, conditionally on X , the random vector ˜ ǫ is normal with mean 0 and covariancematrix σ ( X T X ) − . Thus, conditionally on X , the random variable θ T ˜ ǫ is normal with mean 0and variance σ θ T ( X T X ) − θ . It follows that E θ (cid:0) θ T ˜ ǫ (cid:1) ≤ σ r E (cid:2) λ − ( X T X ) (cid:3) . Hence, applyingLemma 3 we have, for some constant C depending only on γ , E θ (cid:0) θ T ˜ ǫ ) ≤ Cσ r n . (34)Consider now the second term on the right hand side of (33). Denote by ( λ i , u i ), i = 1 , . . . , p, the eigenvalues and the corresponding orthonormal eigenvectors of ( X T X ) − , respectively. Set v i = √ λ i u Ti X T ξ . We have E θ (cid:16) k ˜ ǫ k − σ tr (cid:2) ( X T X ) − (cid:3)(cid:17) = σ E (cid:16) p X i =1 λ i [ v i − (cid:17) . Conditionnally on X , the random variables v , . . . , v p are i.i.d. standard Gaussian. Using thisfact and Lemma 3 we get that, for some constant C depending only on γ , E θ (cid:16) k ˜ ǫ k − σ tr (cid:2) ( X T X ) − (cid:3)(cid:17) = 2 σ E (cid:16) p X i =1 λ i (cid:17) ≤ pσ E (cid:2) λ − (cid:0) X T X (cid:1)(cid:3) ≤ C σ pn . (35)Combining (33), (34) and (35) we obtain the result of the theorem for s ≥ √ p . Case s < √ p . Set S = { i : θ i = 0 } . We have E θ (cid:0) ˆ Q − k θ k (cid:1) ≤ E θ (cid:16)X i ∈ S ( y i − σ ( X T X ) − ii α s − θ i ) (cid:17) + 3 E θ (cid:16)X i ∈ S (cid:2) y i − σ ( X T X ) − ii α s (cid:3) y i ≤ σ ( X T X ) − ii log(1+ p/s ) (cid:17) + 3 E θ (cid:16)X i S h ˜ ǫ i − σ ( X T X ) − ii α s i y i > σ ( X T X ) − ii log(1+ p/s ) (cid:17) , (36) imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression where ˜ ǫ i denotes the i th component of ˜ ǫ . We now establish upper bounds for the three terms onthe right hand side of (36). For the first term, observe that E θ (cid:16)X i ∈ S ( y i − σ ( X T X ) − ii α s − θ i ) (cid:17) ≤ E θ (cid:16)X i ∈ S θ i ˜ ǫ i (cid:17) + 2 E θ (cid:16)X i ∈ S (˜ ǫ i − σ ( X T X ) − ii α s ) (cid:17) . (37)The second summand on the right hand side of (37) satisfies E θ (cid:16)X i ∈ S (˜ ǫ i − σ ( X T X ) − ii α s ) (cid:17) ≤ σ ( α s + 3) E X i ∈ S X j ∈ S ( X T X ) − ii ( X T X ) − jj ≤ σ ( α s + 3) s E (cid:2) λ − ( X T X ) (cid:3) . (38)From (20) we obtain α s ≤
10 log(1 + p/s ) . (39)Thus, using (37), (38) and (39) together with Lemma 3 and (34) we find E θ (cid:16)X i ∈ S ( y i − σ ( X T X ) − ii α s − θ i ) (cid:17) ≤ Cσ s log (1 + p/s ) /n , (40)where the constant C depends only on γ . For the second term on the right hand side of (36), wehave immediately that it is smaller, up to an absolute constant factor, than E σ X i ∈ S X j ∈ S ( X T X ) − ii ( X T X ) − jj ( α s + 4 log (1 + p/s )) . Arguing as in (38) and applying Lemma 3 and (39) we get that, for some constant C dependingonly on γ , E θ (cid:16)X i ∈ S (cid:2) y i − σ ( X T X ) − ii α s (cid:3) y i ≤ σ ( X T X ) − ii log(1+ p/s ) (cid:17) ≤ Cσ s log (1 + p/s ) /n . (41)For the third term on the right hand side of (36), we have E θ (cid:16)X i S h ˜ ǫ i − σ ( X T X ) − ii α s i y i > σ ( X T X ) − ii log(1+ p/s ) (cid:17) = σ X i S X j S E (cid:16) ( X T X ) − ii ( X T X ) − jj ( ˜ ξ i − α s )( ˜ ξ j − α s ) | ˜ ξ i | >x | ˜ ξ j | >x (cid:17) , (42)where x = p p/s ) , ˜ ξ i = ˜ ǫ i q σ ( X T X ) − ii . Note that E ( ˜ ξ i | X ) = E ( ˜ ξ j | X ) = 1 and, conditionally on X , ( ˜ ξ i , ˜ ξ j ) ∈ R is a centered Gaussianvector with covariance ρ ij = ( X T X ) − ij q ( X T X ) − ii q ( X T X ) − jj . imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression Using Lemma 5 we obtain that, for some absolute positive constants C , X i S X j S E (cid:16) ( X T X ) − ii ( X T X ) − jj ( ˜ ξ i − α s )( ˜ ξ j − α s ) | ˜ ξ i | >x | ˜ ξ j | >x (cid:17) = X i S X j S E (cid:16) ( X T X ) − ii ( X T X ) − jj E h ( ˜ ξ i − α s )( ˜ ξ j − α s ) | ˜ ξ i | >x | ˜ ξ j | >x | X i(cid:17) ≤ C p X i,j =1 E (cid:2) ( X T X ) − ii ( X T X ) − jj ρ ij (cid:3) x exp( − x / C E (cid:2) k ( X T X ) − k F (cid:3) x exp( − x / ≤ C E (cid:2) k ( X T X ) − k F (cid:3) s p log (1 + p/s ) ≤ C E (cid:2) λ − ( X T X ) (cid:3) s log (1 + p/s ) , where k ( X T X ) − k F is the Frobenius norm of matrix ( X T X ) − . Finally, Lemma 3, (42) and thelast display imply that, for some constant C depending only on γ , E θ (cid:16)X i S h ˜ ǫ i − σ ( X T X ) − ii α s i y i > σ ( X T X ) − ii log(1+ p/s ) (cid:17) ≤ C σ s log (1 + p/s ) n . (43)The proof is completed by combining (36), (40), (41) and (43). We first recall some general facts about lower bounds for the risks of tests. Let Θ be a measurableset, not necessarily the set Θ( s, τ ), and let µ be a probability measure on Θ. Consider any familyof probability measures P θ indexed by θ ∈ Θ. Denote by P µ the mixture probability measure P µ = Z Θ P θ µ ( dθ ) . Let χ ( P ′ , P ) = Z ( dP ′ /dP ) dP − P ′ and P if P ′ ≪ P , and set χ ( P ′ , P ) = + ∞ otherwise. The following lemma is a key tool in Le Cam’s method of provinglower bounds (see, e.g., (Collier, Comminges and Tsybakov, 2017, Lemma 3)). Lemma 6
Let µ be a probability measure on Θ , and let { P θ : θ ∈ Θ } be a family of probabilitymeasures indexed by θ ∈ Θ on X . Then, for any probability measure Q on X , inf ∆ n Q (∆ = 1) + sup θ ∈ Θ P θ (∆ = 0) o ≥ − q χ ( P µ , Q ) where inf ∆ is the infimum over all { , } -valued statistics. Applying Lemma 6 with Q = P , we see that it suffices to choose a suitable measure µ andto bound χ ( P µ , P ) from above by a small enough value in order to obtain the desired lowerbound on R s,τ . The following lemma is useful to evaluate χ ( P µ , P ). imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression Lemma 7
Let µ be a probability measure on Θ , and let { P θ : θ ∈ Θ } be a family of probabilitymeasures indexed by θ ∈ Θ on X . Let Q be a probability measure on X such that P θ ≪ Q for all θ ∈ Θ . Then, χ ( P µ , Q ) = E ( θ,θ ′ ) ∼ µ (cid:16) Z d P θ d P θ ′ dQ (cid:17) − . Here, E ( θ,θ ′ ) ∼ µ denotes the expectation with respect to the distribution of the pair ( θ, θ ′ ) where θ and θ ′ are independent and each of them is distributed according to µ . Proof.
It suffices to note that χ ( P µ , Q ) = Z ( d P µ ) dQ − Z ( d P µ ) dQ = Z R Θ d P θ µ ( dθ ) R Θ d P θ ′ µ ( dθ ′ ) dQ = Z Θ Z Θ µ ( dθ ) µ ( dθ ′ ) Z d P θ d P θ ′ dQ . We now specify the expression for the χ divergence in Lemma 7 when P θ is the probabilitydistribution generated by model (1) and Q = P . Lemma 8
Let P θ be the distribution of ( X, Y ) satisfying (1) . Then, χ ( P µ , P ) = E ( θ,θ ′ ) ∼ µ E X exp( h Xθ, Xθ ′ i /σ ) − . Proof.
We apply Lemma 7 and notice that, for any ( θ, θ ′ ) ∈ Θ × Θ, Z d P θ d P θ ′ d P = 1(2 πσ ) n/ E X Z R n exp (cid:16) − σ ( k y − Xθ k + k y − Xθ ′ k − k y k ) (cid:17) dy = 1(2 πσ ) n/ E X Z R n exp (cid:16) − σ ( k y k − h y, X ( θ + θ ′ ) i + k X ( θ + θ ′ ) k − h Xθ, Xθ ′ i ) (cid:17) dy = E X (cid:18) exp( h Xθ, Xθ ′ i /σ )(2 πσ ) n/ Z R n exp (cid:16) − σ k y − X ( θ + θ ′ ) k (cid:17) dy (cid:19) = E X exp( h Xθ, Xθ ′ i /σ ) . Lemma 9
Let a ∈ R be a constant and let W be a random variable. Then, E exp( W ) ≤ exp( a ) (cid:0) Z ∞ e t p ( t ) dt (cid:1) where p ( t ) = P (cid:0) | W − a | ≥ t (cid:1) . Proof.
We have E exp( W ) ≤ exp( a ) E exp( | W − a | )= exp( a ) Z ∞ P (cid:0) exp( | W − a | ) ≥ x (cid:1) dx = exp( a ) h Z ∞ P (cid:0) exp( | W − a | ) ≥ x (cid:1) dx i = exp( a ) h Z ∞ e t p ( t ) dt i . imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression Lemma 10
Assume that matrix X has an isotropic distribution with independent σ X -subgaussianrows for some σ X > . Then, for all x > and all θ, θ ′ ∈ R p we have P X (cid:16) |h Xθ, Xθ ′ i − n h θ, θ ′ i| ≥ k θ k k θ ′ k x (cid:17) ≤ − C min( x, x /n )) where the constant C > depends only on σ X . Proof.
By homogeneity, it is enough to consider the case k θ k = k θ ′ k = 1, which will beassumed in the rest of the proof. Then we have h Xθ, Xθ ′ i = k Xθ k + k Xθ ′ k − k X ( θ − θ ′ ) k , h θ, θ ′ i = 2 − k θ − θ ′ k , which implies (cid:12)(cid:12) n h Xθ, Xθ ′ i − h θ, θ ′ i (cid:12)(cid:12) ≤ (cid:16)(cid:12)(cid:12)(cid:12) n k Xθ k − (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) n k Xθ ′ k − (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) n k X ( θ − θ ′ ) k − k θ − θ ′ k (cid:12)(cid:12)(cid:12)(cid:17) . (44)By renormalizing, the third summand on the right hand side of (44) is reduced to the same formas the first two summands. Thus, to prove the lemma it suffices to show that P X (cid:16)(cid:12)(cid:12)(cid:12) n k Xθ k − (cid:12)(cid:12)(cid:12) ≥ v (cid:17) ≤ − C ′ min( v, v ) n ) , ∀ v > , k θ k = 1 , (45)where the constant C ′ > σ X .Denote by x i the i th row of matrix X . Then1 n k Xθ k − n n X i =1 ( Z i − , where Z i = x Ti θ are independent σ X -subgaussian random variables, such that E ( Z i ) = 1 for i = 1 , . . . , n . Therefore, Z i − i = 1 , . . . , n , are independent centered sub-exponential randomvariables and (45) immediately follows from Bernstein’s inequality for sub-exponential randomvariables (cf., e.g.,Vershynin (2012), Corollary 5.17). Lemma 11
Assume that matrix X has an isotropic distribution with independent σ X -subgaussianrows for some σ X > . Then, there exists u > depending only on σ X such that, for all θ, θ ′ with k θ k , k θ ′ k ≤ un − / and u ∈ (0 , u ) we have E X exp( h Xθ, Xθ ′ i ) ≤ exp( n h θ, θ ′ i )(1 + C u ) where the constant C > depends only on σ X . Proof.
By Lemma 10, for any x > P X -probability at least 1 − e − C min( x,x /n ) we have (cid:12)(cid:12)(cid:12) h Xθ, Xθ ′ i − n h θ, θ ′ i (cid:12)(cid:12)(cid:12) ≤ k θ k k θ ′ k x ≤ u n − / x. Therefore, for any t > P X -probability at least 1 − e − C min( √ nt/u ,t /u ) we have (cid:12)(cid:12)(cid:12) h Xθ, Xθ ′ i − n h θ, θ ′ i (cid:12)(cid:12)(cid:12) ≤ t. imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression This and Lemma 9 imply that, for all u ≤ u := ( C / / , E X exp( h Xθ, Xθ ′ i ) ≤ exp( n h θ, θ ′ i ) (cid:16) Z ∞ e t − C min (cid:0) √ nt/u ,t /u ) (cid:1) dt (cid:17) ≤ exp( n h θ, θ ′ i ) (cid:16) Z ∞ e t (1 − C √ n/u ) dt + 6 Z ∞ e t − C t /u dt (cid:17) ≤ exp( n h θ, θ ′ i ) (cid:16) Z ∞ e − C √ nt/ (2 u ) dt + 6 Z ∞ e − t ( C t/u − dt (cid:17) (as C √ n/u > ≤ exp( n h θ, θ ′ i ) (cid:16) u C √ n + 12 u C e u /C + 6 Z ∞ u /C e − t C / (2 u ) dt (cid:17) ≤ exp( n h θ, θ ′ i ) (cid:0) C u (cid:1) , (46)where the constant C > C , and thus only on σ X . For an integer s such that 1 ≤ s ≤ p and τ >
0, we denote by µ τ the uniform distribution onthe set of vectors in R p having exactly s nonzero coefficients, all equal to τ / √ s . Note that thesupport of measure µ τ is contained in Θ( s, τ ).We now take τ = τ ( s, n, p ) defined by (11) and set µ = µ τ . In view of Lemmas 6 - 8, to proveTheorem 4 it is enough to show that E ( θ,θ ′ ) ∼ µ τ E X exp( h Xθ, Xθ ′ i /σ ) ≤ o A (1) (47)where o A (1) tends to 0 as A → τ defined by (11)the left hand side of (47) does not depend on σ . Thus, in what follows we set σ = 1 without lossof generality. Next, notice that it is enough to prove the theorem for the case s ≤ √ p . Indeed,for s > √ p we can use the inclusions Θ( s, τ ( s, n, p )) ⊇ Θ( s ′ , τ ( s, n, p )) ⊇ Θ( s ′ , τ ( s ′ , n, p )) where s ′ is the greatest integer smaller than or equal to √ p . Since τ ( s ′ , n, p ) ≍ min (cid:16) p / √ n , n − / (cid:17) and the rate (11) is also of this order for s > √ p , it suffices to prove the lower bound for s ≤ s ′ ,and thus for s ≤ √ p . Taking onto account these simplifications, we assume in what followswithout loss of generality that s ≤ √ p , σ = 1, and τ := A min (cid:16)r s log(1 + p/s ) n , n − / (cid:17) . (48)We now prove (47) under these assumptions. By Lemma 11, for any 0 < A < u we have E ( θ,θ ′ ) ∼ µ τ E X exp( h Xθ, Xθ ′ i ) ≤ E ( θ,θ ′ ) ∼ µ τ exp( n h θ, θ ′ i )(1 + C A ) . (49)Assume that A <
1. Arguing exactly as in the proof of Lemma 1 in Collier, Comminges and Tsybakov imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression (2017), we find E ( θ,θ ′ ) ∼ µ τ exp( n h θ, θ ′ i ) = E ( θ,θ ′ ) ∼ µ τ exp (cid:16) nτ s − p X j =1 θ j =0 θ ′ j =0 (cid:17) (50) ≤ (cid:16) − sp + sp exp( nτ s − ) (cid:17) s ≤ (cid:16) − sp + sp (cid:16) ps (cid:17) A (cid:17) s ≤ (cid:16) A s (cid:17) s ≤ exp( A )where we have used the inequality (1 + x ) A − ≤ A x valid for 0 < A < x >
0. Combining(49) and (50) we obtain that, for all 0 < A < min(1 , u ), E ( θ,θ ′ ) ∼ µ τ E X exp( h Xθ, Xθ ′ i ) ≤ exp( A )(1 + C A )with some u > C > σ X . This completes the proof of the theorem. Acknowledgements
The work of Alexandra Carpentier is supported by the Emmy Noether grant MuSyAD CA1488/1-1, by the GK 2297 MathCoRe on Mathematical Complexity Reduction” 314838170, GRK2297 MathCoRe, and by the SFB 1294 Data Assimilation on Data Assimilation The seamlessintegration of data and models”, Project A03, all funded by the Deutsche Foschungsgemeinschaft(DFG, German Research Foundation), and by the Deutsch-Franzsisches Doktorandenkolleg/Collge doctoral franco-allemand Potsdam-Toulouse CDFA 01-18, funded by the French-GermanUniversity. Olivier Collier’s research has been conducted as part of the project Labex MME-DII (ANR11-LBX-0023-01). The work of A.B.Tsybakov was supported by GENES and by theFrench National Research Agency (ANR) under the grants IPANEMA (ANR-13-BSH1-0004-02)and Labex Ecodec (ANR-11-LABEX-0047).
References
Arias-Castro, E. Candes, E. and Plan, Y. (2011).
Global testing under sparse alternatives:ANOVA, multiple comparisons and the higher criticism.
Ann. Statist. Baraud, Y. (2002).
Non asymptotic minimax rates of testing in signal detection.
Bernoulli Bordenave, C. and Chafa¨ı, D. (2012).
Around the circular law.
Probability Surveys. Collier, O., Comminges, L., and Tsybakov, A.B. (2017). Minimax estimation of linearand quadratic functionals under sparsity constraints.
Ann. Statist. Donoho, D.L. and Jin, J. (2004).
Higher criticism for detecting sparse heterogeneous mix-tures.
Ann. Statist. Ingster, Y.I. (1997).
Some problems of hypothesis testing leading to infinitely divisible dis-tributions.
Math. Methods Statist. Ingster, Y.I. and Suslina, I.A. (2003).
Nonparametric Goodness-of-Fit Testing under Gaus-sian Models . Springer, New York.
Ingster, Y.I., Tsybakov, A.B. and Verzelen, N. (2010).
Detection boundary in sparseregression.
Electron. J. Stat. imsart-ejs ver. 2014/10/16 file: CCCTW_AiT.tex date: October 11, 2018 arpentier et al./Minimax rate of testing in sparse linear regression Davidson, K.R. and Szarek, S.J. (2001)
Local operator theory, random matrices and Banachspaces.
Handbook of the geometry of Banach spaces Tao, T. and Vu, V. (2012)
Random matrices: Universality of ESDs and the circular law.
TheAnnals of Probability , 2023–2065.
Vershynin, R. (2012)
Introduction to the non-asymptotic analysis of random matrices. In:
Compressed sensing , 210–268, Cambridge Univ. Press, Cambridge.
Verzelen N. (2012).
Minimax risks for sparse regressions: Ultra-high dimensional phe-nomenons.
Electron. J. Stat.38–90.