[PDF] Sharper Sub-Weibull Concentrations: Non-asymptotic Bai-Yin Theorem

Abstract

Arising in high-dimensional probability, non-asymptotic concentration inequalities play an essential role in the finite-sample theory of machine learning and high-dimensional statistics. In this article, we obtain a sharper and constants-specified concentration inequality for the summation of independent sub-Weibull random variables, which leads to a mixture of two tails: sub-Gaussian for small deviations and sub-Weibull for large deviations (from mean). These bounds improve existing bounds with sharper constants. In the application of random matrices, we derive non-asymptotic versions of Bai-Yin's theorem for sub-Weibull entries and it extends the previous result in terms of sub-Gaussian entries. In the application of negative binomial regressions, we gives the \ell_2-error of the estimated coefficients when covariate vector X is sub-Weibull distributed with sparse structures, which is a new result for negative binomial regressions.

Full PDF

aa r X i v : . [ m a t h . S T ] F e b Preprint Sharper Sub-Weibull Concentrations: Non-asymptotic Bai-Yin Theorem

Huiming Zhang , and Haoyu Wei †

1. Department of Mathematics, University of Macau, Taipa Macau, China.2. UMacau Zhuhai Research Institute, Zhuhai, China.3. Guanghua School of Management, Peking University, Beijing, China.

Abstract:

Non-asymptotic concentration inequalities play an essential role in the ﬁnite-sampletheory of machine learning and high-dimensional statistics. In this article, we obtain a sharperand constants-speciﬁed concentration inequality for the summation of independent sub-Weibullrandom variables, which leads to a mixture of two tails: sub-Gaussian for small deviations andsub-Weibull for large deviations from mean. These bounds improve existing bounds with sharperconstants. In the application of random matrices, we derive non-asymptotic versions of Bai-Yin’stheorem for sub-Weibull entries, and it extends the previous result for sub-Gaussian entries. Inthe application of negative binomial regressions, we give the ℓ -error of the estimated coeﬃcientswhen covariate vector X is sub-Weibull distributed with sparse structures, which is a new resultfor negative binomial regressions. Key words: large derivation inequalities, random matrices, sub-Weibull random variables, heavy-tailed distributions, lower bounds on the least singular value.

In the last two decades, with the development of modern data collection methods in scienceand techniques, scientists and engineers can access and load a huge number of variables intheir experiments. Probability theory lays the mathematical foundation of statistics. Arisingfrom data-driving problems, various recent statistics research advances also contribute newand challenging problems in probability for further study. For example, in recent years, therapid development of high-dimensional statistics and machine learnings have promoted thedevelopment of the probability theory and even pure mathematics, especially the randommatrices, large deviation inequalities, and geometric functional analysis, etc., see Vershynin(2018).Motivated from sample covariance matrices, a random matrix is a certain matrix A p × p with its entries A jk drawn from some distributions. As p → ∞ , random matrix theory mainlyfocus on studying the properties of the p eigenvalues of A p × p , which turn out to have somelimit law. Several famous limit laws in random matrix theory are diﬀerent from the CLTfor the summation of independent random variables since the p eigenvalues are dependentand interacting with each other. For convergence in distribution, the pioneering work is theWigner’s semicircle law for some symmetric Gaussian matrices’ eigenvalues, the Marchenko-Pastur law for Wishart distributed random matrices (sample covariance matrices), and theTracy-Widom laws for the limit distribution for maximum eigenvalues in Wishart matrices. * Two authors contribute equally to this work. Huiming Zhang’s e-mail: [email protected]; Haoyu Wei’se-mail: [email protected] † Corresponding author.

All these three laws can be regarded as the CLT of random matrix versions. However, thelimit law for the empirical spectral density is some circle distribution, which shed light onthe non-communicative behaviors of the random matrix, while the classic limit law in CLTis for normal distribution or inﬁnite divisible distribution. For strong convergence, Bai-Yin’slaw complements Marchenko-Pastur law, which asserts that almost surely the smallest andlargest eigenvalue of a sample covariance matrix. The monographs Bai and Silverstein (2010),Yao et al. (2015) provide through the introduction of the limit law in random matrices.The classical statistical models are faced with ﬁxed-dimensional variables only. However,contemporary data science motivates statisticians to pay more attention to study p × p → ∞ random Hessian matrices (or sample covariance matrices), arising from the likelihood func-tions of high-dimensional regressions, see Vershynin (2018). When the model dimension in-creases with sample size, obtaining asymptotical results for the estimator is potentially morechallenging than the ﬁxed dimensional case. In statistical machine learning, concentration in-equalities (large derivation inequalities) play an essential role in deriving non-asymptotic errorbounds for the proposed estimator; see Wainwright (2019). Over recent decades, researchershave developed remarkable results of matrix concentration inequalities, which focuses on non-asymptotic upper and lower bounds for the largest eigenvalue of a ﬁnite sum of randommatrices. For a more fascinated introduction, please refer to the book Tropp (2015).In this work, we aim to extend non-asymptotic results from sub-Gaussian to sub-Weibullin terms of concentration inequalities and the Bai-Yin law of extreme eigenvalues in randommatrices. The contributions are: (i) We derive some new results for sub-Weibull r.v.s, includ-ing sharp concentration inequalities for weighted summations of independent sub-Weibull r.v.sand negative binomial r.v.s, which are useful in many statistical applications; (ii) Based on thegeneralized Bernstein-Orlicz norm, a sharper concentration for sub-Weibull summations is ob-tained. Here we circumvent the Stirling’s approximation and derive the inequalities in a moresubtle way. Our result is sharper and more accurate than that Kuchibhotla and Chakrabortty(2018) and Hao et al. (2019) gave. (iii) By using these results, we oﬀer two applications. First,we provide a non-asymptotic version of Bai-Yin’s theorem for sub-Weibull random matricesin terms of the extreme eigenvalues. Second, from the proposed negative binomial concen-tration inequalities, we obtain the ℓ -error for the estimated coeﬃcients in negative binomialregressions under the increasing-dimensional framework. As summarized in Wainwright (2019), concentration inequalities are powerful in high-dimensionalstatistical inference, and it can derive various explicit non-asymptotic error bounds as a func-tion of sample size, sparsity level, and dimension. In this section, we present the result ofconcentration inequalities for sub-Weibull random variables.

In empirical process theory, sub-Weibull norm (or other Orlicz-type norms) is crucial to derivethe tail probability for both single sub-Weibull random variable and summation of randomvariables (by using the Chernoﬀ’s inequality). A beneﬁt of Orlicz-type norms is that theconcentration does not need the zero mean assumption.

Deﬁnition 1 (Sub-Weibull norm) . For θ > , the sub-Weibull norm of X is deﬁned as k X k ψ θ := inf n C ∈ (0 , ∞ ) : E[exp( | X | θ /C θ )] ≤ o . The k · k ψ θ is also called the ψ θ -norm. We deﬁne X as a sub-Weibull random variable with sub-Weibull index θ if it has a bounded ψ θ -norm (denoted as X ∼ subW( θ )). Actually,the sub-Weibull norm is a special case of Orlicz norms below. Deﬁnition 2 (Orlicz Norms) . Let g : [0 , ∞ ) → [0 , ∞ ) be a non-decreasing convex functionwith g (0) = 0. The “ g -Orlicz norm” of a real-valued r.v. X is given by k X k g := inf { η > g ( | X | /η )] ≤ } . (1) Example 1 ( ψ θ -norm of bounded r.v.) . For a r.v. | X | ≤ M < ∞ , set E e ( | X | /t ) θ ≤ e ( M/t ) θ ≤ and then t ≥ M/ (log 2) /θ , By the deﬁnition of k X k ψ θ , we have k X k ψ θ ≥ M/ (log 2) /θ . Since E e ( | X | /t ) θ is continuous about t > if E e ( | X | /t ) θ < ∞ for /t in a neighbourhood of zero, then k X k ψ θ = M/ (log 2) /θ . In general, we have following corollary to determine k X k ψ θ based on moment generatingfunctions (MGF). It would be useful for doing statistical inference of ψ θ -norm. Corollary 1. If k X k ψ θ < ∞ , then k X k ψ θ = (cid:0) m − | X | θ (2) (cid:1) − /θ for the MGF φ Z ( t ) := E e tZ .Proof. The MGF of | X | θ is continuous in a neighbourhood of zero, by the deﬁnition of ψ θ -norm, 2 ≥ E e ( | X | / k X k ψθ ) θ = m | X | θ (cid:0) k X k − θψ θ (cid:1) . Since | X | θ >

0, the MGF m | X | θ ( t ) is monotonicincreasing. Hence the inverse function m − | X | θ ( t ) exists and satisﬁes k X k − θψ θ = m − | X | θ (2) . So k X k ψ θ = (cid:0) m − | X | θ (2) (cid:1) − /θ . Remark 1.

If we observe i.i.d. X , · · · , X n from some sub-Weibull distributions, we can usethe empirical moment generating function [EMGF, Gbur and Collins (1989)] to estimate thesub-Weibull norm of X . Then since the EMGF ˆ m | X | θ ( t ) = n P ni =1 exp { t | X i | θ } converge toMGF m | X | θ ( t ) in probability for t in a neighbourhood of zero, the value of the inverse functionof EMGF at , (cid:0) ˆ m | X | θ (cid:1) − (2) , is a consistent estimate for k X k ψ θ . In particular, if we take θ = 1, we get the sub-exponential norm. That is the sub-exponential norm of X is deﬁned as k X k ψ = inf { t > | X | /t ) ≤ } . If E X i = 0 and k X i k ψ < ∞ , by Proposition 4.2 in Zhang and Chen (2021), we know ∀ t ≥ (cid:18)(cid:12)(cid:12)(cid:12) n X i =1 X i (cid:12)(cid:12)(cid:12) ≥ t (cid:19) ≤  −  t P ni =1 k X i k ψ ∧ t max ≤ i ≤ n k X i k ψ  . (2)An explicitly calculation of the sub-exponential norm is given in G¨otze et al. (2019), theyshow that Poisson r.v. X ∼ Poisson( λ ) has sub-exponential norm k X k ψ ≤ [log(log(2) λ − +1)] − . And Example 1 with triangle inequality implies k X − E X k ψ ≤ k X k ψ + k E X k ψ = k X k ψ + λ log 2 ≤ [log(log(2) λ − + 1)] − + λ log 2 . We can also get some useful results for weighted sums of independent heterogeneousnegative binomial variables { Y i } ni =1 with probability mass functions:P( Y i = y ) = Γ ( y + k i ) Γ ( k i ) y ! (1 − q i ) k i q yi (cid:0) q i ∈ (0 , , y ∈ N (cid:1) , (3) where { k i } ni =1 ∈ (0 , ∞ ) are variance-dependence parameters. Here, the mean and varianceof { Y i } ni =1 are E Y i = k i q i − q i , Var Y i = k i q i (1 − q i ) respectively. The MGF of { Y i } ni =1 are E e sY i = (cid:16) − q i − q i e s (cid:17) k i for i = 1 , · · · , n . Corollary 2.

For any independent r.v.s Y , · · · , Y n satisfying k X i k ψ < ∞ , t ≥ , and non-random weight w = ( w , · · · , w n ) T , we have P (cid:18) | n X i =1 w i ( Y i − E Y i ) | ≥ t (cid:19) ≤ e − (cid:18) t P ni =1 w i ( k Yi k ψ | E Yi/ log 2 | )2 ∧ t max1 ≤ i ≤ n | wi | ( k Yi k ψ | E Yi/ log 2 | ) (cid:19) . P (cid:18) | n X i =1 w i ( Y i − E Y i ) | > (cid:16) t n X i =1 w i k Y i − E Y i k ψ (cid:17) / + 2 t max ≤ i ≤ n ( | w i |k Y i − E Y i k ψ ) (cid:19) ≤ e − t In particular, if Y i is independently distributed as NB( µ i , k i ) , we have P (cid:18) | n X i =1 w i ( Y i − E Y i ) | ≥ t (cid:19) ≤ e − ( t P ni =1 w i a µi,ki ) ∧ t max1 ≤ i ≤ n | wi | a ( µi,ki ) ) , (4) where a ( µ i , k i ) := h log − (1 − q i ) / ki √ q i i − + µ i log 2 with q i := µ i k i + µ i . Proof.

The ﬁrst inequality is the direct application of (2) by observing that for any constant a ∈ R , and r.v. Y with k Y k ψ < ∞ , k aY k ψ = | a |k Y k ψ , k Y + a k ψ ≤ k Y k ψ + k a k ψ = k Y k ψ + | a | / log 2 and k X + a k ψ ≤ ( k X k ψ + | a | / log 2) . The second inequality is obtainedfrom (2) by considering two rate in ( t P ni =1 k Y i k ψ ∧ t max ≤ i ≤ n k Y i k ψ ) separately. For (4), onlyneed to note that k Y i k ψ = inf { t > Y i /t ) ≤ } = inf ( t > (cid:18) − q i − q i e /t (cid:19) k i ≤ ) = " log 1 − (1 − q i ) / ki √ q i − . Then the third inequality is obtained by the ﬁrst inequality and the deﬁnition of a ( µ i , k i ).Similar to sub-exponential, the sub-Weibull r.v. X satisﬁes following properties as shownin Lemma 2.1 of Zajkowski (2019). Proposition 1 (Properties of sub-Weibull norm) . If k X k ψ θ < ∞ , then P {| X | > t } ≤ e − ( t/ k X k ψθ ) θ for all t ≥ ; and then E | X | k ≤ k X k kψ θ Γ( kθ + 1) for all k ≥ . Particularly, when θ = 1 or 2, sub-Weibull r.v.s reduce to sub-exponential or sub-Gaussianr.v.s, respectively. It is obvious that the smaller θ is, the heavier tail the r.v. has. A r.v. iscalled heavy-tailed if its distribution function fails to be bounded by a decreasing exponentialfunction, i.e. R e λx dF ( x ) = ∞ , ∀ λ > θ ∈ (0 , θ and rθ , which is similar to Lemmas 2.7.6 of Vershynin (2018) forsub-exponential norm. Corollary 3.

For any θ, r ∈ (0 , ∞ ) , if X ∼ subW( θ ) , then | X | r ∼ subW( θ/r ) . Moreover, k| X | r k ψ θ/r = k X k rψ θ . (5) Conversely, if X ∼ subW( rθ ) , then X r ∼ subW( θ ) with k X r k ψ θ = k X k rψ rθ .Proof. By the deﬁnition of ψ θ -norm, E exp {| X/ k X k ψ θ | θ } ≤

2. Then E exp {|| X | r / k X k rψ θ | θ/r } ≤ . The result | X | r ∼ subW( θ/r ) follows by the deﬁnition of ψ θ -norm again. Moreover, k X k ψ θ : = inf { C ∈ (0 , ∞ ) : E[exp( | X | θ /C θ )] ≤ } = [inf { C r ∈ (0 , ∞ ) : E[exp {|| X | r /C r | θ/r } ] ≤ } ] /r = k| X | r k /rψ θ/r , which veriﬁes (5). If X ∼ subW( rθ ), then E exp {| X r / k X k rψ rθ | θ } = E exp {| X/ k X k ψ rθ | rθ } ≤ X r ∼ subW( θ ) with k X k ψ rθ : = inf { C ∈ (0 , ∞ ) : E[exp( | X | rθ /C rθ )] ≤ } = [inf { C r ∈ (0 , ∞ ) : E[exp {|| X | r /C r | θ } ] ≤ } ] /r = k| X | r k /rψ θ . By Corollary 3, we obtain that d -th root of the absolute value of sub-Gaussian is subW(2 d )by letting r = 1 /d . Corollary 3 can be extended to product of r.vs, from Proposition D.2. inKuchibhotla and Chakrabortty (2018) with the equality replacing by inequality, we state itas the following proposition. Proposition 2. If { W i } di =1 are (possibly dependent) r.vs satisfying k W i k ψ αi < ∞ for some α i > , then k Q di =1 W i k ψ β ≤ Q di =1 k W i k ψ αi where β := P di =1 1 α i . For multi-armed bandit problems in reinforcement learning, Hao et al. (2019) move be-yond sub-Gaussianity and consider the reward under sub-Weibull distribution which has amuch weaker tail. The corresponding concentration inequality (Theorem 3.1 in Hao et al.(2019)) for the sum of independent sub-Weibull r.v.s will be used repeatedly.

Proposition 3 (Concentration Inequality for sub-Weibull distribution) . Suppose { y i } ni =1 areindependent sub-Weibull r.v.s with k y i k ψ θ ≤ σ . Then there exists an absolute constant C θ onlydepending on θ such that for any a = ( a , . . . , a n ) ⊤ ∈ R n and < α < /e , | P ni =1 a i y i − E( P ni =1 a i y i ) | ≤ C θ σ ( k a k (log α − ) / + k a k ∞ (log α − ) /θ ) with probability at least − α . The weakness in the Proposition 3 is that the upper bound of S a n := P ni =1 a i y i − E( P ni =1 a i y i ) is up to a unknown constant C θ and the α cannot tends to 0. In the nextsection, we will give the constants-speciﬁed tail probability upper bound for | S a n | which issharper than Theorem 3.1 in Kuchibhotla and Chakrabortty (2018). The Chernoﬀ’s inequality tricks in the derivation sub-exponential concentrations is not validfor sub-Weibull distributions, since the exponential moment conditions of sub-Weibull is aboutthe absolute value | X | , but the random summation is not the sum of the absolute values.Thanks to the Bernstein’s moment condition which is the exponential moment of the absolutevalue, an alternative method is given by Kuchibhotla and Chakrabortty (2018) who deﬁnesthe so-called Generalized Bernstein-Orlicz (GBO) norm. And the GBO norm can help us toderive tail behaviours for sub-Weibull r.v.s. Deﬁnition 3 (GBO norm) . Fix α > and L ≥ . Deﬁne the function Ψ θ,L ( · ) as the inversefunction Ψ − θ,L ( t ) := p log( t + 1) + L (log( t + 1)) /θ for all t ≥ . The GBO norm of a r.v. X is then given by k X k Ψ θ,L := inf { η > θ,L ( | X | /η )] ≤ } . The monotone function Ψ θ,L ( · ) is motivated by the classical Bernstein’s inequality for sub-exponential r.v.s. Like the sub-Weibull norm properties Corollary 1, the following propositionin Kuchibhotla and Chakrabortty (2018) allows us to get the concentration inequality for r.v.with ﬁnite GBO norm. Proposition 4. If k X k Ψ θ,L < ∞ , then P( | X | ≥ k X k Ψ θ,L {√ t + Lt /θ } ) ≤ e − t ∀ t ≥ . With an upper bound of GBO norm, we could easily derive the concentration inequalityfor a single sub-Weibull r.v. or even the sum of independent sub-Weibull r.v.s. The sharperupper bounds for the GBO norm is obtained for the sub-Weibull summation, which reﬁnesthe constant in the sub-Weibull concentration inequality. Let || X || k := (E | X | k ) /k for allinteger k ≥

1. First, by truncating more precisely, we obtain a sharper upper bound for || X || k , comparing to Proposition C.1 in Kuchibhotla and Chakrabortty (2018). Corollary 4. If k X k p ≤ C √ p + C p /θ for p ≥ and constants C , C , then k X k Ψ θ,K ≤ γeC where K = γ /θ C / ( γC ) and γ is the minimal solution of n k > e k − − e − k /k k − ≤ o .Proof. Set ∆ := sup p ≥ k X k p √ p + Lp /θ so that k X k p ≤ ∆ √ p + L ∆ p /θ holds for all p ≥

2. ByMarkov’s inequality for t -th moment ( t ≥ (cid:16) | X | ≥ e ∆ √ t + eL ∆ t /θ (cid:17) ≤ (cid:18) || X || t e ∆[ √ t + Lt /θ ] (cid:19) t ≤ e − t , [By the deﬁnition of ∆] . So, for any t ≥

2, P (cid:16) | X | ≥ e ∆ √ t + eL ∆ t /α (cid:17) ≤ e − t . (6)Note the deﬁnition of ∆ shows k X k t ≤ ∆ √ t + L ∆ t /θ holds for all t ≥ k X k t ≤ C √ t + C t /θ for all t ≥

2. It gives e ∆ √ t + eL ∆ t /θ ≤ eC √ t + eC t /θ . Thisinequality with (6) givesP (cid:16) | X | ≥ eC √ t + eC t /θ (cid:17) ≤ { < t < } + e − t { t ≥ } , ∀ t > . (7) Take K = k /θ C / ( kC ), and deﬁne δ k := keC for a certain constant k > (cid:20) Ψ θ,K (cid:18) | X | δ k (cid:19)(cid:21) = Z ∞ P (cid:16) | X | ≥ δ k Ψ − θ,K ( s ) (cid:17) ds = Z ∞ P( | X | ≥ keC p log(1 + s ) + keC K [log(1 + s )] /θ ) ds = Z ∞ P( | X | ≥ eC q log(1 + s ) k + eC [log(1 + s ) k ] /θ ) ds [By (7)] ≤ Z e k − − e − k /k k − ≤ } . An approximate solution is γ ≈ . Lemma 1 (Khinchin-Kahane Inequality, Theorem 1.3.1 of De la Pena and Gine (2012)) . Let { a i } ni =1 be a ﬁnite non-random sequence, { ε i } ni =1 be a sequence of independent Rademachervariables and < p < q < ∞ . Then k P ni =1 ε i a i k q ≤ (cid:16) q − p − (cid:17) / k P ni =1 ε i a i k p . Lemma 2 (Theorem 2 of Latala (1997)) . Let X , · · · , X n be a sequence of independent symmetric r.v.s, and p ≥ . Then, e − e k ( X i ) k p ≤ k X + · · · + X n k p ≤ e k ( X i ) k p , where k ( X i ) k p := inf { t > P ni =1 log φ p ( X i /t ) ≤ p } with φ p ( X ) := E | X | p . Lemma 3 (Example 3.2 and 3.3 of Latala (1997)) . Assume X be a symmetric r.v. satisfying P ( | X | ≥ t ) = e − N ( t ) . For any t ≥ , we have (a) If N ( t ) is concave, then log φ p ( e − tX ) ≤ pM p,X ( t ) := ( t p k X k pp ) ∨ ( pt k X k ) . (b) For convex N ( t ) , denote the convex conjugate function N ∗ ( t ) := sup s> { ts − N ( s ) } and M p,X ( t ) = (cid:26) p − N ∗ ( p | t | ) , if p | t | ≥ pt , if p | t | < . Then log φ p ( tX/ ≤ pM p,X ( t ) . With the help of three lemmas above, we can obtain the main results concerning the shaperand constant-speciﬁed concentration inequality for the sum of independent sub-Weibull r.v.s.

Theorem 1 (Concentration for sub-Weibull summation) . If X , . . . , X n are independent cen-tralized r.v.s such that k X i k ψ θ < ∞ for all ≤ i ≤ n and some θ > , then for any weightvector w = ( w , . . . , w n ) ∈ R n , the following bounds holds true: (a) The estimate for GBO norm of the summation: k P ni =1 w i X i k Ψ θ,Ln ( θ, b ) ≤ γeC ( θ ) k b k ,where b = ( w k X k ψ θ , . . . , w n k X i k ψ θ ) ⊤ ∈ R n , with C ( θ ) :=  h log /θ e (cid:16) Γ / (cid:0) θ + 1 (cid:1) + 3 − θ θ sup p ≥ p − θ Γ /p (cid:0) pθ + 1 (cid:1)(cid:17)i , if θ ≤ , e + 2(log 2) /θ , if θ > and L n ( θ, b ) = γ /θ A ( θ ) k b k ∞ k b k { < θ ≤ } + γ /θ B ( θ ) k b k β k b k { θ > } where B ( θ ) =: eθ − /θ ( − θ − ) /β e +(log 2) /θ and A ( θ ) =: inf p ≥ e − θ θ p − /θ Γ /p ( pθ +1 ) /θ e (Γ / ( θ +1)+3 − θ θ sup p ≥ p − θ Γ /p ( pθ +1))] .For the case θ > , β is the H¨older conjugate satisfying /θ + 1 /β = 1 . (b) Concentration for sub-Weibull summation P (cid:18) | n X i =1 w i X i | ≥ eC ( θ ) k b k {√ t + L n ( θ, b ) t /θ } (cid:19) ≤ e − t . (8)(c) Another form of for θ = 2 : P (cid:18) | n X i =1 w i X i | ≥ s (cid:19) ≤ ( − (cid:18) s θ (cid:2) eC ( θ ) k b k L n ( θ, b ) (cid:3) θ ∧ s e C ( θ ) k b k (cid:19)) ( θ <

2) = ( e − s / e C ( θ ) k b k , if s ≤ eC ( θ ) k b k L θ/ ( θ − n ( θ, b )2 e − s θ / [4 eC ( θ ) k b k L n ( θ, b )] θ , if s > eC ( θ ) k b k L θ/ ( θ − n ( θ, b );( θ >

2) = ( e − s θ / [4 eC ( θ ) k b k L n ( θ, b )] θ , if s < eC ( θ ) k b k L θ/ (2 − θ ) n ( θ, b )2 e − s / e C ( θ ) k b k , if s ≥ eC ( θ ) k b k L θ/ (2 − θ ) n ( θ, b ) . Proof.

The main idea in the proof is by the sharper estimates of the GBO norm of the sumof symmetric r.v.s.(a) Without loss of generality, we assume k X i k ψ θ = 1. Deﬁne Y i := (cid:0) | X i | − (log 2) /θ (cid:1) + ,then it is easy to check that P( | X i | ≥ t ) ≤ e − t θ implies P( Y i ≥ t ) ≤ e − t θ . For in-dependent Rademacher r.v. { ε i } ni =1 , the symmetrization inequality gives k P ni =1 w i X i k p ≤ k P ni =1 ε i w i X i k p . Note that ε i X i is identically distributed as ε i | X i | , k n X i =1 w i X i k p ≤ k n X i =1 ε i w i | X i |k p ≤ k n X i =1 ε i w i (cid:0) Y i + (log 2) /θ (cid:1) k p ≤ k n X i =1 ε i w i Y i k p + 2(log 2) /θ k n X i =1 ε i w i k p [Khinchin-Kahane inequality] ≤ k n X i =1 ε i w i Y i k p + 2(log 2) /θ (cid:18) p − − (cid:19) / k n X i =1 ε i w i k < k n X i =1 ε i w i Y i k p + 2(log 2) /θ √ p (E( n X i =1 ε i w i ) ) / [ { ε i } ni =1 are independent] = 2 k n X i =1 ε i w i Y i k p + 2(log 2) /θ √ p k w k . (9)From Lemma 2, we are going to handle the ﬁrst term in (9) with the sum of symmetricr.v.s. Since P( Y i ≥ t ) ≤ e − t θ , then k P ni =1 ε i w i Y i k p = k P pi =1 w i Z i k p , Z i := ε i Y i for symmetric independent r.v.s { Z i } ni =1 satisfying | Z i | d = Y i and P( Z i ≥ t ) = e − t θ for all t ≥ Next, we proceed the proof by checking the moment conditions in Corollary 4.

Case θ ≤ : N ( t ) = t θ is concave for θ ≤

1. From Lemma 2 and Lemma 3 (a), for p ≥ (cid:13)(cid:13)(cid:13) n X i =1 w i Z i (cid:13)(cid:13)(cid:13) p ≤ e inf (cid:26) t > n X i =1 log φ p (cid:16) e − (cid:16) w i e t (cid:17) Z i (cid:17) ≤ p (cid:27) ≤ e inf (cid:26) t > n X i =1 pM p,Z i (cid:16) w i e t (cid:17) ≤ p (cid:27) = e inf (cid:26) t > n X i =1 (cid:20)n(cid:16) w i e t (cid:17) p k Z i k pp o ∨ n p (cid:16) w i e t (cid:17) k Z i k o(cid:21) ≤ p (cid:27) ≤ e inf (cid:26) t > (cid:16) pθ + 1 (cid:17) e p t p k w k pp ≤ (cid:27) + e inf (cid:26) t > p Γ (cid:16) θ + 1 (cid:17) e t k w k ≤ (cid:27) , where the last inequality we use k Z i k pp = R ∞ pt p − P( | Z i | ≥ t ) dt ≤ R ∞ pt p − e − t θ dt = p Γ (cid:0) pθ + 1 (cid:1) . Hence k P pi =1 w i Z i k p ≤ e (cid:2) Γ /p (cid:0) pθ + 1 (cid:1) k w k p + √ p Γ / (cid:0) θ + 1 (cid:1) k w k (cid:3) , and k p X i =1 w i X i k p ≤ e h Γ /p (cid:16) pθ + 1 (cid:17) k w k p + √ p Γ / (cid:16) θ + 1 (cid:17) k w k i + 2(log 2) /θ √ p k w k = 2 e Γ /p (cid:16) pθ + 1 (cid:17) k w k p + 2 h (log 2) /θ + e Γ / (cid:16) θ + 1 (cid:17)i √ p k w k . Using homogeneity, we can assume that √ p k w k + p /θ k w k ∞ = 1. Then k w k ≤ p − / and k w k ∞ ≤ p − /θ . Therefore, for p ≥ k w k p ≤ (cid:16) n X i =1 | w i | k w k p − ∞ (cid:17) /p ≤ ( p − − ( p − /θ ) /p = ( p − p/θ p (2 − θ ) /θ ) /p ≤ − θ θ p − /θ = 3 − θ θ p − /θ {√ p k w k + p /θ k w k ∞ } , where the last inequality follows form the fact that p /p ≤ / for any p ≥ , p ∈ N . Hence (cid:13)(cid:13)(cid:13) p X i =1 w i X i (cid:13)(cid:13)(cid:13) p ≤ e − θeθ Γ /p (cid:16) pθ + 1 (cid:17) k w k ∞ + 2 (cid:20) log /θ e (cid:16) Γ / (cid:16) θ + 1 (cid:17) + 3 − θ θ p − θ Γ /p (cid:16) pθ + 1 (cid:17)(cid:17)(cid:21) √ p k w k . Following Corollary 4, we have (cid:13)(cid:13)(cid:13) n X i =1 w i X i (cid:13)(cid:13)(cid:13) Ψ θ,Ln ( θ,p ) ≤ γeD ( θ ) , where L n ( θ, p ) = γ /θ D ( θ,p ) γD ( θ ) , D ( θ ) := 2[log /θ e (Γ / ( θ + 1) + sup p ≥ − θ θ p − θ Γ /p ( pθ +1))] k w k < ∞ , and D ( θ, p ) := 2 e − θ θ p − /θ Γ /p (cid:0) pθ + 1 (cid:1) k w k ∞ .Finally, take L n ( θ ) = inf p ≥ L n ( θ, p ) > . Indeed, the positive limit can be argued by(2.2) in Alzer (1997). Then by the monotonicity property of the GBO norm, it gives (cid:13)(cid:13)(cid:13) n X i =1 w i X i (cid:13)(cid:13)(cid:13) Ψ θ,Ln ( θ ) ≤ (cid:13)(cid:13)(cid:13) n X i =1 w i X i (cid:13)(cid:13)(cid:13) Ψ θ,Ln ( θ,p ) ≤ γeD ( θ ) . Case θ > : In this case N ( t ) = t θ is convex with N ∗ ( t ) = θ − θ − (cid:0) − θ − (cid:1) t θθ − . By Lemmas2 and 3 (b), for p ≥

2, we have (cid:13)(cid:13)(cid:13) p X i =1 w i Z i (cid:13)(cid:13)(cid:13) p ≤ e inf n t > n X i =1 log φ p (cid:16) w i t Z i / (cid:17) ≤ p o + e inf n t > n X i =1 pM p,Z i ( 4 w i t ) ≤ p o ≤ e inf n t > n X i =1 p − N ∗ (cid:16) p (cid:12)(cid:12)(cid:12) w i t (cid:12)(cid:12)(cid:12)(cid:17) ≤ o + e inf n t > n X i =1 p ( 4 w i t ) ≤ o = 4 e (cid:2) √ p k w k + ( p/θ ) /θ (1 − θ − ) /β k w k β (cid:3) . with β mentioned in the statement. Therefore, for p ≥

2, (9) implies k P pi =1 w i X i k p ≤ [8 e + 2(log 2) /θ ] √ p k w k + 8 e ( p/θ ) /θ (1 − θ − ) /β k w k β . Then the following result follows by Corollary 4, k P ni =1 w i X i k Ψ θ,L ′ ( θ ) ≤ γeD ′ ( θ ),where L n ( θ ) = γ /θ D ′ ( θ ) γD ′ ( θ ) , D ′ ( θ ) = (cid:2) e + 2(log 2) /θ (cid:3) k w k , and D ′ ( θ ) = 8 eθ − /θ (1 − θ − ) /β k w k β .Note that w i X i = ( w i k X i k ψ θ )( X i / k X i k ψ θ ), we can conclude (a).(b) It is followed from Proposition 4 and (a).(c) For easy notation, put L n ( θ ) = L n ( θ, b ) in the proof. When θ <

2, by the inequality a + b ≤ a ∨ b ) for a, b >

0, we haveP (cid:18) | n P i =1 w i X i | ≥ eC ( θ ) k b k √ t (cid:19) ≤ e − t , if √ t ≥ L n ( θ ) t /θ . Put s := 4 eC ( θ ) k b k √ t , we haveP (cid:18) | n P i =1 w i X i | ≥ s (cid:19) ≤ n − s e C ( θ ) k b k o , if s ≤ eC ( θ ) k b k L θ/ ( θ − n ( θ ) . For √ t ≤ L n ( θ ) t /θ , we get P( | P ni =1 w i X i | ≥ eC ( θ ) k b k L n ( θ ) t /θ ) ≤ e − t . Let s :=4 eC ( θ ) k b k L n ( θ ) t /θ , it givesP (cid:18) | n P i =1 w i X i | ≥ s (cid:19) ≤ n − s θ [4 eC ( θ ) k b k L n ( θ )] θ o , if s > eC ( θ ) k b k L θ/ ( θ − n ( θ ) . Similarly, for θ >

2, it impliesP ( | P ni =1 w i X i | ≥ s ) ≤ e − sθ [4 eC ( θ ) k b k Ln ( θ )] θ if s ≤ eC ( θ ) k b k L θ/ (2 − θ ) n ( θ ),and P ( | P ni =1 w i X i | ≥ s ) ≤ e − s e C θ ) k b k if s ≥ eC ( θ ) k b k L θ/ (2 − θ ) n ( θ ). Remark 2.

Theorem 1 (b) generalizes the sub-Gaussian concentration inequalities, sub-exponential concentration inequalities, and Bernstein’s concentration inequalities with Bern-stein’s moment condition. For θ < in Theorem 1 (c), the tail behaviour of the sum is akinto a sub-Gaussian tail for small t and the tail resembles the exponential tail for large t ; For θ > , the tail behaves like a Weibull r.v. with tail parameter θ and the tail of sums match thatof the sub-Gaussian tail for large t . The intuition is that the sum will concentrates aroundzero by the Law of Large Number. Theorem 1 shows that the convergence rate will be fasterfor small deviations from mean and will be slower for large deviations from mean. Remark 3.

Theorem 1 (b) also implies a potential empirical upper bound for P ni =1 w i X i forany sub-Weibull variables X , · · · , X n , because the only unknown variable in eC ( θ ) k b k {√ t + L n ( θ ) t /θ } is b . From Remark 1, estimating b is possible for i.i.d. observation X , · · · , X n . In machine learning, non-asymptotic results for estimator is specially crucial to evaluate theﬁnite-sample performance. The key topic for non-asymptotic theory is the concentrationof measure or r.v.s, which can provide a sharp probabilistic upper bound on the desiredestimators as a function of sample size n and dimension p .Let A = A n,p be an n × p random matrix whose entries are independent copies of ar.v. with zero mean, unit variance, and ﬁnite fourth moment. Suppose that the dimensions n and p both grow to inﬁnity while the aspect ratio p/n converges to a constant in [0 , √ n λ min ( A ) = 1 − q pn + o (cid:16)q pn (cid:17) , √ n λ max ( A ) = 1 + q pn + o (cid:16)q pn (cid:17) a.s. . Next we introduce a special counting measure for measuring the complexity of a certain setin some space. The N ε is called an ε -net of K in R n if K can be covered by balls with centersin K and radii ε (under Euclidean distance). The covering number N ( K, ε ) is deﬁned by thesmallest number of closed balls with centers in K and radii ε whose union covers K .For purposes of studying random matrices, we need to extend the deﬁnition of sub-Weibull r.v. to sub-Weibull random vectors. The n -dimensional unit Euclidean sphere S n − ,is denoted by S n − = { x ∈ R n : k x k = 1 } . We say that a random vector X in R n issub-Weibull if the one-dimensional marginals h X , a i are sub-Weibull r.v.s for all a ∈ R n .The sub-Weibull norm of a random vector X is deﬁned as k X k ψ θ := sup a ∈ S n − kh X , a ik ψ θ . For simplicity, we assume that the rows in random matrices are isotropic random vectors.A random vector Y in R n is called isotropic if Var( Y ) = I p . Equivalently, Y is isotropic ifE h Y , a i = k a k for all a ∈ R n . In the non-asymptotic regime, Theorem 4.6.1 in Vershynin(2018) study the upper and lower bounds of maximum (minimum) eigenvalues of randommatrices with independent sub-Gaussian entries which are sampled from high-dimensionaldistributions. As an extension of Theorem 4.6.1 in Vershynin (2018), the following result is anon-asymptotic versions of Bai-Yin’s law for sub-Weibull entries, which is useful to estimatecovariance matrices from heavy-tailed data.

Theorem 2 (Non-asymptotic Bai-Yin’s law) . Let A be an n × p matrix whose rows A i are independent isotropic sub-Weibull random vectors in R p with covariance matrix I p and max ≤ i ≤ n k A i k ψ θ ≤ K . Then for every s ≥ , we have P (cid:26)(cid:13)(cid:13) n A ⊤ A − I p (cid:13)(cid:13) ≤ H ( cp + s , n ; θ ) (cid:27) ≥ − e − s . where H ( t, n ; θ ) := 2 eC ( θ/ K +(log 2) − /θ ) (cid:20)q tn + (cid:26) A ( θ/ γ t ) /θ /n, θ ≤ B ( θ/ γ t ) /θ /n /θ , θ > (cid:21) , with A ( θ/ , B ( θ/ and C ( θ/ deﬁned in Theorem 1(a). Moreover, the concentration inequality for extreme eigenvalues hold for c ≥ n log 9 /p P np − H ( cp + s , n ; θ ) ≤ λ min ( A ) √ n ≤ λ max ( A ) √ n ≤ p H ( cp + s , n ; θ ) o ≥ − e − s . (10) Proof.

For convenience, the proof is divided into three steps.

Step1 . Adopting the lemma

Lemma 4 (Computing the spectral norm on a net, Lemma 5.4 in Vershynin (2018)) . Let B be an p × p matrix, and let N ε be an ε -net of S p − for some ε ∈ [0 , . Then (cid:13)(cid:13) B (cid:13)(cid:13) := max || x || =1 (cid:13)(cid:13) B x (cid:13)(cid:13) = sup x ∈ S p − |h B x , x i| ≤ (1 − ε ) − sup x ∈N ε |h B x , x i| . Then show that k n A ⊤ A − I p k ≤ x ∈N / (cid:12)(cid:12) n k A x k − (cid:12)(cid:12) . Indeed, note that h n A ⊤ A x − x , x i = h n A ⊤ A x , x i − n k A x k −

1. By setting ε = 1 / (cid:13)(cid:13) n A ⊤ A − I p (cid:13)(cid:13) ≤ (1 − ε ) − sup x ∈N ε |h n A ⊤ A x − x , x i| = 2 max x ∈N / (cid:12)(cid:12)(cid:12) n k A x k − (cid:12)(cid:12)(cid:12) . Step2 . Let Z i := |h A i , x i| ﬁx any x ∈ S n − . Observe that k A x k = P ni =1 |h A i , x i| = P ni =1 Z i . The fact that { Z i } ni =1 are subW( θ ) with E Z i = 1 , max ≤ i ≤ n k Z i k ψ θ = K . Then byCorollary 3, Z i are independent subW( θ/

2) r.v.s with max ≤ i ≤ n k Z i k ψ θ/ = K . Note that k k ψ θ := (log 2) − /θ by Example 1, then norm triangle inequality givesmax ≤ i ≤ n k Z i − k ψ θ/ ≤ max ≤ i ≤ n k Z i k ψ θ/ + k k ψ θ ≤ K + (log 2) − /θ . (11)Denote b := n ( k Z − k ψ θ/ , . . . , k Z n − k ψ θ/ ) ⊤ in Theorem 1. With (11), we have k b k = n − qP ni =1 k Z i − k ψ θ/ ≤ K +(log 2) − /θ √ n and k b k ∞ ≤ K +(log 2) − /θ n . For β := θθ − >

1, weget k b k β = n − { P ni =1 k Z i − k βψ θ/ } /β ≤ n β − − [ K + (log 2) − /θ ] = n − θ − [ K + (log 2) − /θ ].Write L n ( θ/ , b ) as the constant deﬁned in Theorem 1(a). Then, k b k L n ( θ/ , b ) = γ /θ (cid:26) A ( θ/ k b k ∞ , θ ≤ B ( θ/ k b k β , θ > . ≤ [ K +(log 2) − /θ ] γ /θ (cid:26) A ( θ/ /n, θ ≤ B ( θ/ /n /θ , θ > . Hence 2 eC ( θ/ {k b k √ t + k b k L n ( θ/ , b ) t /θ }≤ eC ( θ/ K + (log 2) − /θ ) "r tn + (cid:26) A ( θ/ γ t ) /θ /n, θ ≤ B ( θ/ γ t ) /θ /n /θ , θ > =: H ( t, n ; θ ) . Therefore, P( n | P ni =1 ( Z i − | ≥ H ( t, n ; θ )) ≤ e − t . Let t = cp + s for constant c , thenP (cid:26)(cid:12)(cid:12)(cid:12) n k A x k − (cid:12)(cid:12)(cid:12) ≥ H ( cp + s , n ; θ ) (cid:27) ≤ e − ( cp + s ) . Step3 . Consider the follow lemma for covering numbers in Vershynin (2018).

Lemma 5 (Covering numbers of the sphere) . For the unit Euclidean sphere S n − , the coveringnumber N ( S n − , ε ) satisﬁes N ( S n − , ε ) ≤ (1 + ε ) n for every ε > . Then, we show the concentration for k n A ⊤ A − I p k , and (10) follows by the deﬁnition oflargest and least eigenvalues. The conclusion is drawn by Step 1 and 2:P (cid:26)(cid:13)(cid:13)(cid:13) n A ⊤ A − I p (cid:13)(cid:13)(cid:13) ≥ H ( cp + s , n ; θ ) (cid:27) ≤ P (cid:26) x ∈N / (cid:12)(cid:12)(cid:12) n k A x k − (cid:12)(cid:12)(cid:12) ≥ H ( cp + s , n ; θ ) (cid:27) ≤ N ( S n − , / (cid:26)(cid:12)(cid:12)(cid:12) n k A x k − (cid:12)(cid:12)(cid:12) ≥ H ( cp + s , n ; θ ) / (cid:27) ≤ · n e − ( cp + s ) , where the last inequality follows by Lemma 5 with ε = 1 /

4. When the c ≥ n log 9 /p , then2 · n e − ( cp + s ) ≤ e − s , and the (10) is proved.Moreover, note thatmax || x || =1 (cid:12)(cid:12)(cid:12)(cid:13)(cid:13) √ n A x (cid:13)(cid:13) − (cid:12)(cid:12)(cid:12) = max || x || =1 (cid:13)(cid:13) ( 1 n A ⊤ A − I p ) x (cid:13)(cid:13) = (cid:13)(cid:13) n A ⊤ A − I p (cid:13)(cid:13) ≤ H ( cp + s , n ; θ ) . implies that p − H ( cp + s , n ; θ ) ≤ √ n λ max ( A ) ≤ p H ( cp + s , n ; θ ).Similarly, for the minimal eigenvalue, we havemin || x || =1 (cid:12)(cid:12)(cid:12)(cid:13)(cid:13) √ n A x (cid:13)(cid:13) − (cid:12)(cid:12)(cid:12) = min || x || =1 (cid:13)(cid:13) ( 1 n A ⊤ A − I p ) x (cid:13)(cid:13) = (cid:13)(cid:13) n A ⊤ A − I p (cid:13)(cid:13) ≤ H ( cp + s , n ; θ ) . This implies p − H ( cp + s , n ; θ ) ≤ √ n λ min ( A ) ≤ p H ( cp + s , n ; θ ). So we obtainthat the two events satisfy n(cid:13)(cid:13) n A ⊤ A − I p (cid:13)(cid:13) ≤ H ( cp + s , n ; θ ) o ⊂ np − H ( cp + s , n ; θ ) ≤ √ n λ min ( A ) ≤ √ n λ max ( A ) ≤ p H ( cp + s , n ; θ ) o Then we get the second conclusion in this theorem.

In statistical regression analysis, the responses { Y i } ni =1 in linear regressions are assume tobe continuous Gaussian variables. However, the category in classiﬁcation or grouping maybe inﬁnite with index by the non-negative integers. The categorical variables is treated ascountable responses for distinction categories or groups; sometimes it can be inﬁnite. Inpractice, random count responses include the number of patients, the bacterium in the unitregion, or stars in the sky and so on. The responses { Y i } ni =1 with covariates { X i } ni =1 belongsto generalized linear regressions. We consider i.i.d. random variables { ( X i , Y i ) } ni =1 ∼ ( X, Y ) ∈ R p × N . By the methods of the maximum likelihood or the M-estimation, The estimator ˆ β n is given by ˆ β n := arg min β ∈ R p n n X i =1 ℓ ( X ⊤ i β, Y i ) (12)where the loss function ℓ ( · , · ) is convex and twice diﬀerentiable in the ﬁrst argument.In high-dimensional regressions, the dimension β may be growing with sample size n .When { Y i } ni =1 belongs to the exponential family, Portnoy (1988) studied the asymptotic be-havior of ˆ β n in the generalized linear models (GLMs) as p n := dim( X ) is increasing. The target vector β ∗ := arg min β ∈ R p E ℓ (cid:0) X T β, Y (cid:1) is assumed to be the loss under the popula-tion expectation, comparing to (12). Let ˙ ℓ ( u, y ) := ∂∂t ℓ ( t, y ) (cid:12)(cid:12) t = u , ¨ ℓ ( u, y ) := ∂∂t ˙ ℓ ( t, y ) (cid:12)(cid:12)(cid:12) t = u and C ( u, y ) := sup | s − t |≤ u ¨ ℓ ( s,y )¨ ℓ ( t,y ) . Finally, deﬁne the score function and Hessian matrix of the empir-ical loss function are ˆ Z n ( β ) := n P ni =1 ˙ ℓ ( X Ti β, Y i ) X i and ˆ Q n ( β ) := n P ni =1 ¨ ℓ ( X Ti β, Y i ) X i X Ti ,respectively. The population version of Hessian matrix is Q ( β ) := E[¨ ℓ ( X T β, Y ) XX T ]. Thefollowing so-called determining inequalities guarantee the ℓ -error for the estimator obtainedfrom the smooth M-estimator deﬁned as (12). Lemma 6 (Corollary 3.1 in Kuchibhotla (2018)) . Let δ n ( β ) := k [ ˆ Q n ( β )] − ˆ Z n ( β ) k for β ∈ R p . If ℓ ( · , · ) is a twice diﬀerentiable function that is convex in the ﬁrst argument andfor some β ∗ ∈ R p : max ≤ i ≤ n C ( k X i k δ n ( β ∗ ) , Y i ) ≤ . Then there exists a vector ˆ β n ∈ R p satisfying ˆ Z n ( ˆ β n ) = 0 as the estimating equation of (12) , δ n ( β ∗ ) ≤ k ˆ β n − β ∗ k ≤ δ n ( β ∗ ) . Applications of Lemma 6 in regression analysis is of special interest when X is heavytailed, i.e. the sub-Weibull index θ <

1. For the negative binomial regression (NBR) with theknown dispersion parameter k >

0, the loss function is ℓ ( u, y ) = − yu + ( y + k ) log( k + e u ) . (13)Thus we have ˙ ℓ ( u, y ) = − k ( y − e u ) k + e u , ¨ ℓ ( u, y ) = k ( y + k ) e u ( k + e u ) , see Zhang and Jia (2022) for details.Further computation gives C ( u, y ) = sup | s − t |≤ u e s ( k + e t ) ( k + e s ) e t and it implies that C ( u, y ) ≤ e u . Therefore, condition max ≤ i ≤ n C ( k X i k δ n ( β ∗ ) , Y i ) ≤ in Lemma 6 leads tomax ≤ i ≤ n k X i k δ n ( β ∗ ) ≤ log(4 / . This condition need the assumption of the design space for max ≤ i ≤ n k X i k .In NBR with loss (13), one has b Q n ( β ∗ ) := n n P i =1 ( Y i + k ) ke X ⊤ i β ∗ X i X ⊤ i ( k + e X ⊤ i β ∗ ) and b Z n ( β ∗ ) := − n n P i =1 k ( Y i − e X ⊤ i β ∗ ) X i k + e X ⊤ i β ∗ .To guarantee that ˆ β n approximates β ∗ well, some regularity conditions are need. • (C.1): For M Y > M X >

0, assume that max ≤ i ≤ n k Y i k ψ ≤ M Y and the covariates { X ik } are uniformly sub-Weibull with max ≤ i ≤ n, ≤ k ≤ p k X ik k ψ θ ≤ M X for 0 < θ < • (C.2): The vector X i is sparse. Let F Y := { max ≤ i ≤ n E Y i = max ≤ i ≤ n e X ⊤ i β ∗ ≤ B, max ≤ i ≤ n k X i k ≤ I n } with a slowly increasing function I n , we have P {F cY } = ε n → ≤ i ≤ n, ≤ i ≤ k | X ik | , the sub-Weibull concentration determines:P( max ≤ i ≤ n, ≤ i ≤ k | X ik | > t ) ≤ np P( | X | > t ) ≤ npe − ( t/ k X k ψθ ) θ ≤ δ ⇒ t = M X log /θ ( 2 npδ ) , by using Corollary 1. Hence, we deﬁne the event for the maximum designs: F max = n max ≤ i ≤ n, ≤ k ≤ p | X ik | ≤ M X log /θ ( 2 npδ ) o ∩ F Y . To make sure that the optimization in (12) has a unique solution, we also require the minimaleigenvalue condition. • (C.3): Suppose that b ⊤ E( ˆ Q n ( β )) b ≥ C min is satisﬁed.In the proof, to ensure that the random Hessian function has a non-singular eigenvalue, wedeﬁne the event F = ( max k,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 " Y i ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) − E Y i ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) ! ≤ C min ) F = ( max k,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 " ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) − E ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) ! ≤ C min ) . Theorem 3 (Upper bound for ℓ -error) . In the NBR with loss (13) and ( C. − C. , let M BX = M X + B log 2 , R n := 6 M BX M X C min "r pn log (cid:16) pδ (cid:17) + 1 n r p log (cid:16) pδ (cid:17) log /θ (cid:16) npδ (cid:17) , and b := ( k/n ) M X (1 , . . . , ⊤ ∈ R n . Under the event F ∩ F ∩ F max , for any < δ < , ifthe sample size n satisﬁes R n I n ≤ log(4 / , (14) Let c n := e − ( nt M X log4 /θ ( 2 npδ ) M BX ∧ ntM X log2 /θ ( 2 npδ ) MBX ) + e − ( tθ/ eC ( θ/ k b k Ln ( θ/ , b )] θ/ ∧ t e C θ/ k b k ) with t = C min / , then P( k ˆ β n − β ∗ k ≤ R n ) ≥ − p c n − δ − ε n . A few comment is made on this Theorem. First, in order to get k ˆ β n − β ∗ k p −→

0, weneed p = o ( n ) under sample size restriction (14) with I n = o (log − /θ ( np ) · [ n − p log p ] − / ).Second, note that the ε n in provability 1 − p c n − δ − ε n depends on the models size and theﬂuctuation of the design by the event F max . Proof.

Note that for ∀ b ∈ S p − , it yields b ⊤ ˆ Q n ( β ∗ ) b − b ⊤ E( ˆ Q n ( β ∗ )) b ≥ − k b k max k,j | [ ˆ Q n ( β ∗ ) − E ˆ Q n ( β ∗ )] kj | = − max k,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 " ( Y i + k ) ke X ⊤ i β ∗ X i X ⊤ i ( k + e X ⊤ i β ∗ ) − E ( Y i + k ) ke X ⊤ i β ∗ X i X ⊤ i ( k + e X ⊤ i β ∗ ) ! kj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (15)Consider the decomposition n n P i =1 (cid:20) ( Y i + k ) ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) − E (cid:18) ( Y i + k ) ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) (cid:19)(cid:21) = n n P i =1 (cid:20) Y i ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) − E (cid:18) Y i ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) (cid:19)(cid:21) + kn n P i =1 (cid:20) ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) − E (cid:18) ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) (cid:19)(cid:21) For the ﬁrst term, we have under the F max with t = C min / (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) n n P i =1 (cid:20) Y i ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) − E (cid:18) Y i ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) (cid:19)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t, F max (cid:19) ≤  −  n t n P i =1 ( X ik X ij ) ( k Y i k ψ + | exp( X ⊤ i β ∗ )log 2 | ) ∧ nt max ≤ i ≤ n | X ik X ij | ( k Y i k ψ + | exp( X ⊤ i β ∗ )log 2 | )  ≤ (cid:26) − (cid:18) nt M X log /θ ( npδ ) M BX ∧ ntM X log /θ ( npδ ) M BX (cid:19)(cid:27) where we use ke X ⊤ i β ∗ ( k + e X ⊤ i β ∗ ) − ≤ k X ik X ij k ψ θ/ ≤ k X ik k ψ θ k X ij k ψ θ ≤ M X we haveP (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) kn n P i =1 (cid:20) ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) − E (cid:18) ke X ⊤ i β ∗ X ik X ij ( k + e X ⊤ i β ∗ ) (cid:19)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t, F max (cid:19) ≤ (cid:26) − (cid:18) t θ/ [4 eC ( θ/ k b k L n ( θ/ ,b )] θ/ ∧ t e C ( θ/ k b k (cid:19)(cid:27) where b = ( k/n ) M X (1 , . . . , ⊤ ∈ R n .Assume that b ⊤ E( ˆ Q n ( β )) b ≥ C min . Under F and F , it shows that by (15): b ⊤ E( ˆ Q n ( β )) b ≥ C min − C min = C min . ThenP { λ min ( ˆ Q n ( β )) ≤ C min } = P n b ⊤ E( ˆ Q n ( β )) b ≤ C min , ∀ b ∈ S p − o (16) ≤ P n b ⊤ E( ˆ Q n ( β )) b ≤ C min , ∀ b ∈ S p − , F max o + P( F c max ) ≤ P {F , F max } + P {F , F max } + P( F cR ( n )) ≤ p exp ( − (cid:18) nt M X log /θ ( npδ ) M BX ∧ ntM X log /θ ( npδ ) M BX (cid:19)) + 2 p exp ( − (cid:18) t θ/ [4 eC ( θ/ k b k L n ( θ/ , b )] θ/ ∧ t e C ( θ/ k b k (cid:19)) + P( F c max ) . (17)Then we have by conditioning on F ∩ F δ n ( β ) := k [ ˆ Q n ( β )] − ˆ Z n ( β ) k ≤ C min k ˆ Z n ( β ) k . By k/ ( k + e X ⊤ i β ∗ ) ≤

1, Corollary 2 implies for any 1 ≤ k ≤ p ,P (cid:20)(cid:12)(cid:12)(cid:12)r pn n X i =1 k ( Y i − e X ⊤ i β ∗ ) X ik k + e X ⊤ i β ∗ (cid:12)(cid:12)(cid:12) > (cid:16) tpn n X i =1 X ik k Y i − E Y i k ψ (cid:17) / + 2 t r pn max ≤ i ≤ n | X ik |k Y i − E Y i k ψ (cid:21) ≤ e − t . (18)Let λ n ( t, X ) := 2 (cid:16) tpn max ≤ k ≤ n n P i =1 X ik k Y i − E Y i k ψ (cid:17) / + 2 t q pn max ≤ i ≤ n, ≤ k ≤ p ( | X ik |k Y i − E Y i k ψ ) . We bound max ≤ i ≤ n, ≤ k ≤ p | X ik | ≤ M X log /θ ( npδ ) and max ≤ k ≤ n n n P i =1 X ik ≤ M X log /θ ( npδ ) under theevent F max . Note that M BX = M X + B log 2 , then (C.1) and (C.2) gives λ n ( t, X ) ≤ (cid:16) tpM BX max ≤ k ≤ p n n X i =1 X ik (cid:17) / + 2 t r pn max ≤ i ≤ n, ≤ k ≤ p | X ik | M BX ≤ M BX M X ( p tp + t p p/n ) log /θ (2 np/δ ) =: λ n ( t ) . So, P (cid:18) | q pn n P i =1 k ( Y i − e X ⊤ i β ∗ ) X ik k + e X ⊤ i β ∗ | > λ n ( t ) (cid:19) ≤ e − t , k = 1 , , · · · , p. Thus (18) showsP {√ n k b Z n ( β ∗ ) k > λ n ( t ) } ≤ P {√ n k b Z n ( β ∗ ) k > λ n ( t ) , F max } + P( F c max ) ≤ P( p S k =1 {k √ n n P i =1 k ( Y i − e X ⊤ i β ∗ ) X ik k + e X ⊤ i β ∗ k > λ n ( t ) √ p } ) + P( F c max ) ≤ pe − t + P( F c max ) = δ + ε n , where t := log( pδ ). Then k ˆ β n − β ∗ k ≤ δ n ( β ∗ ) ≤ C min k ˆ Z n ( β ∗ ) k ≤ λ n ( t ) C min √ n via Lemma 6.Under F ∩ F ∩ F max , we get k ˆ β n − β ∗ k ≤ M BX M X C min "r pn log (cid:16) pδ (cid:17) + 1 n r p log (cid:16) pδ (cid:17) log /θ (cid:16) npδ (cid:17) . Besides, under F ∩ F ∩ F max , it gives the condition of n : (14). Concentration inequalities are far-reaching useful in high-dimensional statistical inferencesand machine learnings. They can facilitate various explicit non-asymptotic conﬁdence in-tervals as a function of the sample size and model dimension. Future research suggeststhat an MGF-based estimation procedure for the unknown GBO norm is crucial to con-struct non-asymptotic and data-driven conﬁdence intervals for the sample mean. Althoughwe have obtained sharper upper bounds for sub-Weibull concentrations, the lower bounds ontail probabilities are also important in some statistical applications (Zhang and Zhou, 2020).Developing non-asymptotic and sharp lower tail bounds of Weibull r.v.s is left for furtherstudy.

References

Alzer, H. (1997). On some inequalities for the gamma and psi functions. Mathematics of computation,66(217), 373-389.Bai, Z. D., & Yin, Y. Q. (1993). Limit of the smallest eigenvalue of a large dimensional samplecovariance matrix. The Annals of Probability, 1275-1294.Bai, Z., Silverstein, J. W. (2010). Spectral analysis of large dimensional random matrices (Vol. 20).Springer.De la Pena, V., & Gine, E. (2012). Decoupling: from dependence to independence. Springer.Foss, S., Korshunov, D., & Zachary, S. (2011). An introduction to heavy-tailed and subexponentialdistributions. Springer.Gbur, E. E., & Collins, R. A. (1989). Estimation of the Moment Generating Function. Communicationsin Statistics - Simulation and Computation, 18(3), 1113-1134.G¨otze, F., Sambale, H., & Sinulis, A. (2019). Concentration inequalities for polynomials in α -sub-exponential random variables. ArXiv Preprint ArXiv:1903.05964.Hao, B., Abbasi-Yadkori, Y., Wen, Z., & Cheng, G. (2019). Bootstrapping Upper Conﬁdence Bound.NeurIPS.Kuchibhotla, A. K., & Chakrabortty, A. (2018). Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. arXiv preprintarXiv:1804.02605.Kuchibhotla, A. K. (2018). Deterministic inequalities for smooth m-estimators. arXiv preprintarXiv:1809.05172.Latala, R. (1997). Estimation of moments of sums of independent real random variables. The Annalsof Probability, 25(3), 1502-1513.8Oliveira, R. I. (2016). The lower tail of random quadratic forms with applications to ordinary leastsquares. Probability Theory and Related Fields, 166(3-4), 1175-1194.Portnoy, S. (1988). Asymptotic behavior of likelihood methods for exponential families when the num-ber of parameters tends to inﬁnity. The Annals of Statistics, 16(1), 356-366.Tropp, J. A. (2015). An introduction to matrix concentration inequalities. Foundations and Trends inMachine Learning, 8(1-2), 1-230.Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science(Vol. 47). Cambridge University Press.Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint (Vol. 48). Cam-bridge University Press.Yao, J., Zheng, S., & Bai, Z. D. (2015). Sample covariance matrices and high-dimensional data analysis.Cambridge University Press.Yaskov, P. (2015). Sharp lower bounds on the least singular value of a random matrix without thefourth moment condition. Electronic Communications in Probability, 44, 1-9.Zajkowski, K. (2019). On norms in some class of exponential type Orlicz spaces of random variables.Positivity, 1-10.Zhang, A. R., & Zhou, Y. (2020). On the non-asymptotic and sharp lower tail bounds of randomvariables. Stat, 9(1), e314.Zhang, H., & Jia, J. (2022). Elastic-net regularized high-dimensional negative bino-mial regression: consistency and weak signals detection. Statistica Sinica, 32(1). https://doi.org/10.5705/ss.202019.0315https://doi.org/10.5705/ss.202019.0315