Fast Non-Asymptotic Testing And Support Recovery For Large Sparse Toeplitz Covariance Matrices
FF AST NONASYMPTOTIC TESTING AND SUPPORT RECOVERY FORLARGE SPARSE T OEPLITZ COVARIANCE MATRICES
A P
REPRINT
Nayel Bettache
1, * , Cristina Butucea , and Marianne Sorba CREST, ENSAE Paris, 5 Avenue Le Chatelier, 91120 Palaiseau, FRANCE { nayel.bettache, cristina.butucea}@ensae.fr , [email protected] * Corresponding author
February 16, 2021 A BSTRACT
We consider n independent p -dimensional Gaussian vectors with covariance matrix having Toeplitzstructure. We test that these vectors have independent components against a stationary distributionwith sparse Toeplitz covariance matrix, and also select the support of non-zero entries. We assumethat the non-zero values can occur in the recent past (time-lag less than p/ ). We build test proce-dures that combine a sum and a scan-type procedures, but are computationally fast, and show theirnon-asymptotic behaviour in both one-sided (only positive correlations) and two-sided alternatives,respectively. We also exhibit a selector of significant lags and bound the Hamming-loss risk of theestimated support. These results can be extended to the case of nearly Toeplitz covariance struc-ture and to sub-Gaussian vectors. Numerical results illustrate the excellent behaviour of both testprocedures and support selectors - larger the dimension p , faster are the rates. K eywords Covariance matrix, High-dimensional vectors, Hypothesis testing, Sparsity, Support recovery, Time Series
Covariance matrices of high-dimensional vectors appear in machine learning, signal processing and statistical proce-dures. In these fields, e.g. in the test-phase of an algorithm or in the validation step of a statistical model, the qualityof the residuals (the difference between the observed and the predicted values) is a good indicator of the good perfor-mance of the procedure. More precisely, the closer the residuals are to a white noise distribution, the less informationwas lost by the predictor or the model at hand. It is therefore natural to look for very weak, sparse information in thecovariance matrix of such residuals.Goodness-of-fit tests are designed to assess whether the underlying (unknown) covariance matrix of high-dimensionalvectors is the identity (which defines the null hypothesis), or it is far from it with respect to some distance (thealternative hypothesis). The separation radius is a measure of how far the covariance matrix needs to be from theidentity matrix in order to be able to distinguish it given the observations. Another important information is to recoverthe support of the covariance matrix, i.e. the set where the non-null values can be found. As in high-dimensionalregression, this support is used to reduce dimension of the problem, produce unbiased estimators of the non-nullentries and so on. A selector is a vector with coordinates taking value 1 when the covariance value is non-null,respectively 0 when it is null. We appreciate the quality of a selector in Hamming loss, which counts the number ofmiss-classified coordinates. Our main interests are both testing the covariance matrix and recovering the support ofsignificant covariance elements under the alternative hypothesis of weak sparse covariance values.We consider the p-dimensional observations X , ..., X n independent, with Gaussian probability distribution N p (0 , Σ) ,where Σ = [ σ ij ] ≤ i,j,p belongs to the set S ++ p of positive definite symmetric matrices. Let us denote by X a genericvector with the same Gaussian N p (0 , Σ) distribution. a r X i v : . [ m a t h . S T ] F e b PREPRINT - F
EBRUARY
16, 2021More particularly, when the vector X is issued from a stationary process, its covariance matrix Σ has a Toeplitzstructure, that is its diagonal elements are all constant and we denote by σ i,j = Cov ( X i , X j ) = σ | i − j | for all i, j from 1 to p . As mentionned in [10], stationary time series are used as approximations of geometrically ergodic timeseries (whose transition probabilities converge exponentially fast to the stationary distribution). The information onthe Toeplitz matrix is fully contained in the vector ( σ , σ , ..., σ p − ) of its diagonal values. More generally, we maystudy similarly any covariance matrix by looking at the energy of each diagonal of the covariance matrix, that is itseuclidean norm σ k = (cid:107) ( σ ,k +1 , ..., σ p − k,p ) (cid:107) . Here, we devote our efforts to quantifying the benefits of the Toeplitzstructure in terms of rates for testing and for support recovery. Contributions
In this paper, we first give a new variant of concentration inequality for quadratic forms of largeGaussian vectors and specify these bounds for covariance matrices that are Toeplitz with few non-null diagonals. Weshow non-asymptotic separation rates for testing large sparse Toeplitz covariance matrices which are remarkably fastdue to the structure of the matrix. We test here whether the covariance matrix is the identity matrix I p or there exists anumber s of covariance elements among σ , ..., σ p − that are significantly positive (one-sided alternative), respectivelysignificantly different from zero (two-sided alternative). The test procedure combines a sum and a scan procedure inorder to detect small (relatively) numerous non-null entries and very few but sufficiently large entries, respectively.This is analogous to but more general than the detection of sparse Gaussian means ([14, 15], [11]) where observationshave the same variance, whereas our model is heteroscedastic.Moreover, we propose a selector of the diagonals with non-null entries - a lag selector, which is constructed byuniversal thresholding of some linear estimators. We provide fast non asymptotic bounds for the expected value of itsloss.Experimental results show the excellent behaviour of these procedures with small values of n (non-asymptotic char-acter of our results) and large values of p . Indeed, by exploiting the Toeplitz structure, the matrix size p does not actas a nuisance parameter anymore, but diminishes the convergence rates. All test procedures and the lag-selector arecomputationally trivial to implement. Note that the scan procedure is performed on a vector as well and it is thereforecomputationally fast, in contrast with the scan procedure of matrices, see e.g. [4] or [1]. Bibliography
Previously, Cai and Ma [9] considered the same goodness-of-fit test with alternative characterized bycovariance values that belong to an L ball of fixed radius. Tests for sparse covariance matrices were given by Arias-Castro, Bubeck and Lugosi [2] and [1]. They considered alternative covariance matrices having at most s significantvalues and also the structured alternative of a clique of size s producing a small submatrix of significant values. Ourtesting rates are faster, but they are difficult to compare as the Toeplitz structure does not allow for the block or theclique sparsity structure in their paper. Butucea and Zgheib [6] and [5] considered the test problem with alternativesthat generalize the L -ball in [9] to dense ellipsoids for both Toeplitz and not necessarily Toeplitz covariance matrices,respectively. More precisely, it was assumed that σ k decreased slowly as a polynomial (Sobolev ellipsoids) or faster,as an exponential of k . The test procedure involved an optimal banding parameter - specific for testing and differentfrom the optimal parameter for estimation of the matrix. It was thus noticed that the minimax rates for goodness-of-fittesting of large covariance matrices are faster for Toeplitz matrices than for non Toeplitz ones, and that they are fasterfor testing than for estimation of the covariance matrix. In this paper, we consider an alternative class where at most s significant values appear sparsely.Cai and Liu [7] and Cai, Liu and Xia [8] considered the problem of support recovery in the sense that the estimatedset ˆ S n is different from the true set S with probability tending to 0. To the best of our knowledge, no quantitative rateswere given for support recovery in the covariance matrix setup. In the context of Toeplitz covariance matrices, we callthis problem lag-selection.Our bounds for testing and lag selection are non-asymptotic, thus n can be equal to 1 when we cannot observe repeatedmeasurements. However, an important remark is that the rates are faster when the significant covariance values havelags in the recent past: k ≤ S , for some S < p . Indeed, the rates depend on p − S . From an asymptotic point ofview, s can tend to infinity as p tends to infinity, thus we allow a nonparametric model (in the sense that the number ofparameters increases). Such models have only been considered in nonparametric estimation of the spectral density ofstationary time series, see Kreiss, Paparoditis and Politis [16] who uses thresholded empirical covariance coefficients. We define ϕ A the linear functional of the covariance matrix Σ associated to the matrix A belonging to S p (the set ofsymmetric p × p matrices) as ϕ A (Σ) = Tr( A Σ) . Recall that Tr( A ) is also denoted by || A || F , the Frobenius norm2 PREPRINT - F
EBRUARY
16, 2021squared, for any A in S p . We denote by (cid:107) A (cid:107) ∞ the largest eigenvalue of the matrix A . We recall that a centeredreal-valued random variable Z is sub-exponential with positive parameters ( ν , b ) if E [exp( tZ )] ≤ exp (cid:18) ν t (cid:19) , for all | t | ≤ b . (1)The sample covariance matrix is denoted Σ n = 1 n n (cid:88) k =1 X k X Tk . The next theorem states that for X , ..., X n independent multivariate Gaussian N p (0 , Σ) vectors, the random variable Z = ϕ A (Σ n − Σ) , for A in S p , is sub-exponential with explicit values for the parameters ( ν , b ) . We recall theBernstein inequality that holds for sub-exponential random variables [19]. Proposition 2.1. If Z is a sub-exponential random variables with parameters ( ν , b ) , then P [ Z ≥ t ] ≤ (cid:40) exp (cid:16) − t ν (cid:17) if ≤ t ≤ ν b exp (cid:0) − t b (cid:1) if t > ν b Equivalently, Z verifies : P [ Z ≥ t u ] ≤ exp (cid:16) − u (cid:17) , for all u > , where t u = max( ν √ u, bu ) .Thus, the concentration inequality for the plug-in estimator ϕ A (Σ n ) of ϕ A (Σ) follows immediately. Theorem 2.2.
The random variable ϕ A (Σ n − Σ) (respectively ϕ A (Σ − Σ n ) ) is centered and sub-exponential withparameters (cid:16) ν = || A Σ || F n (1 − K ) , b = || A Σ || ∞ nK (cid:17) , for some arbitrary K in ]0 , . Therefore, we have : P [ ϕ A (Σ n − Σ) ≥ t u ] ≤ exp (cid:16) − u (cid:17) , for all u > , (2) where t u = max (cid:26) √ u || A Σ || F √ n (1 − K ) , u || A Σ || ∞ nK (cid:27) . Previous concentration inequalities were given for such functionals. The closest to our case is the chi-square type con-centration inequality in Spokoiny and Zhilova [18] for standardized Gaussian vectors and generalized to sub-Gaussianvectors. They generalized Hsu, Kakade and Zhang [13] who assumed finite exponential moments of any order for thevector X . Let us also mention Giurcanu and Spokoiny [12] who gave a Bernstein inequality for the empirical covari-ance element of a stationary centered Gaussian process and generalized it to locally stationary Gaussian processes.Let us also mention the Hanson-Wright inequality which is stated for more general sub-Gaussian vectors but havingindependent components i.e. a diagonal covariance matrix (see Rudelson and Vershynin [17] and its improvementunder Bernstein condition on moments by Bellec [3]).The concentration inequality (2) is the main tool in the applications that we consider hereafter to study stationary timeseries. In this context, we assume that X , .., X n are repeated, independent observations of length p of an underlyingstationary process X = { X , ..., X p , ... } . Note that our results are non-asymptotic, thus n can be equal to 1. Withoutloss of generality, we assume that the process is centered. The covariance matrix of a stationary process is a Toeplitzcovariance matrix, and we denote by σ | i − j | = Cov ( X i , X j ) . Let us denote by T p the set of p × p Toeplitz matricesand by | S | the cardinal of a set S . Definition 1.
We define F + ( s, S, σ ) , for σ > real number and s ≤ S integer numbers between 1 and p − , the setof sparse Toeplitz covariance matrices Σ such that there are s significantly positive covariance elements with lags nolarger than S : F + ( s, S, σ ) = (cid:8) Σ ∈ S ++ p ∩ T p and there exists S ⊆ { , ..., S } such that | S | = s,σ j is (cid:26) ≥ σ > , for all j ∈ S = 0 , for all j ∈ { , p − }\ S (cid:27)(cid:27) . Similarly, we define the two-sided set F ( s, S, σ ) : F ( s, S, σ ) = (cid:8) Σ ∈ S ++ p ∩ T p and there exists S ⊆ { , ..., S } such that | S | = s, | σ j | is (cid:26) ≥ σ > , for all j ∈ S = 0 , for all j ∈ { , p − }\ S (cid:27)(cid:27) . PREPRINT - F
EBRUARY
16, 2021Let us apply Theorem 2.2 to several choices of the matrices A . First, the covariance element σ j , j ≥ , can be writtenas σ j = E [ X T A j X ] = Tr( A j Σ) , with [ A j ] k(cid:96) = p − j ) I ( | k − (cid:96) | = j ) - a matrix that has 0 elements except on j thupper and lower diagonals. Note that we use notation A j instead of A { j } . The empirical estimator of σ j can be writtenas ˆ σ j = 1 n n (cid:88) k =1 X Tk A j X k = Tr( A j Σ n ) . Remark.
It is useful to note that our results can be generalized to time series that are "nearly" stationary, by considering: ˜ σ j = Tr( A j Σ n ) = 12( p − j ) p (cid:88) i,k =1 , | i − k | = j σ i,k . In that case, we consider slightly different sets of sparse covariance matrices: (cid:101) F + ( s, S, σ ) and (cid:101) F ( s, S, σ ) of not neces-sarily Toeplitz matrices with s diagonal average values ˜ σ j of the first S being significant. By taking into considerationthat all methods that we study in the sequel for testing and lag selection are exclusively based on the concentration ofthe mean empirical correlations around their expected values ˜ σ j , the following results remain valid provided that wecontrol || A Σ || F and || A Σ || ∞ .Let W ⊆ { , ..., S } be a set of w values between 1 and S. We estimate (cid:88) j ∈ W σ j = Tr( A W Σ) , where A W = (cid:88) j ∈ W A j by Tr( A W Σ n ) . Proposition 2.3.
Let W ⊆ { , ..., S } contain w elements and A W = (cid:80) j ∈ W A j . We have :1. || A W || ∞ ≤ wp − S and || A W || F ≤ w p − S )
2. For any covariance matrix Σ belonging to F ( s, S, σ ) , || A W Σ || ∞ ≤ σ w (2 s +1) p − S and || A W Σ || F ≤ σ · (cid:40) K (2 s +1)( p − S ) , if w = 1 w (2 s +1) p − S ) , if w > where K = (cid:26) , if W ⊆ { , ..., p − } p , if W ⊆ { p , ..., p − } . The next Corollary specifies the concentration inequality in Theorem 2.2 using the bounds in the Proposition 2.3above.
Corollary 2.4.
Let X , .., X n be i.i.d, N p (0 p , Σ) , Σ belonging to F + ( s, S, σ ) or F ( s, S, σ ) and W ⊆ { , ..., S } with S < p having w elements. Then, for some arbitrary K in ]0 , , P I p [ ϕ A W (Σ n − I p ) ≥ σ · t ] ≤ exp (cid:16) − u (cid:17) , for all u > , (3)where t = max (cid:26)(cid:114) u − K ) (cid:114) wn ( p − S ) , uK wn ( p − S ) (cid:27) . Moreover, for any Σ in F ( s, S, σ ) , P Σ [ ϕ A W (Σ n − Σ) ≥ σ · ˜ t ] ≤ exp (cid:16) − u (cid:17) , for all u > , (4)where ˜ t = max (cid:40)(cid:114) u (1 − K ) (cid:115) s + 1 n ( p − S ) , uK s + 1 n ( p − S ) (cid:41) if w = 1 and ˜ t = (2 s + 1) t if w > . 4 PREPRINT - F
EBRUARY
16, 2021Similar inequalities hold for | ϕ A W (Σ n − I p ) | and | ϕ A W (Σ n − Σ) | multiplies the exponential term by a factor twoin (3) and (4).If W = { , ..., S } , it is enough to replace w by S in the previous results. However, if W = { j } for some j ≤ S , theprevious results are still true with w replaced by 1.From now on, we assume that S < p such that K = 1 in the previous Proposition. Indeed, in the context of timeseries, it is natural to look for significant correlations in the recent past. From now on, we assume for simplicity that σ = 1 , thus dealing with correlation matrices only. The one-sided testproblem is H : Σ = I p , vs. H : Σ ∈ F + ( s, S, σ ) . The following two-sided test problem will also be discussed as a generalization H : Σ = I p vs. H : Σ ∈ F ( s, S, σ ) . Recall that a test procedure ∆ n is a binary valued random variable ∆ n : ( R p ) ⊗ n → { , } . It separates the set ofpossible outcomes of some random event in two contiguous sets, we decide to reject H whenever ∆ n = 1 and toaccept H whenever ∆ n = 0 . The maximal testing risk is defined as R (∆ n , F + ) = P I p (∆ n = 1) + sup Σ ∈F + P Σ (∆ n = 0) , that is the sum of the type I and the maximal type II error probabilities over the set in the alternative hypothesis. Aseparation rate is the least possible value for σ > such that the maximal testing risk stays below some prescribedvalue.We proceed by considering successively two measures of the separation between I p and Σ under the alternative hy-pothesis H . We choose successively the sets W = { , ..., S } and W = S , and arbitrary subset of { , ..., S } with s elements. For testing over F + ( s, S, σ ) , we consider Tr( A S ) and max S ⊆{ ,...,S } , S = s Tr( A S Σ) . Correspondingly, over F ( s, S, σ ) we consider S (cid:88) j =1 | σ j | = S (cid:88) j =1 | Tr( A j Σ) | and max S ⊆{ ,...,S } , S = s S (cid:88) j ∈ S | Tr( A j Σ) | . By analogy to the vector case, we distinguish moderately sparse and highly sparse covariance structure. In the firstcase, the sum of all S values will allow to test, whereas in the latter a search over subsets of size s will be necessary.This is called a scan procedure and it is computationally fast for vectors. Note that, if the sparsity s is unknown asecond search over different possible values of s will produce an aggregated procedure, free of s . When the alternative hypothesis is F + ( s, S, σ ) , we consider for some t MS + n,p the test procedure ∆ MS + n = I (cid:0) ϕ A S (Σ n − I p ) ≥ t MS + n,p (cid:1) . (5) Theorem 3.1.
The test ∆ MS + n defined in (5), with t MS + n,p = max (cid:40)(cid:115) u · Sn ( p − S ) , u · Sn ( p − S ) (cid:41) for u > is such that R (∆ MS + n , F + ) ≤ (cid:16) − u (cid:17) provided that σ ≥ s +1) s t MS + n,p . PREPRINT - F
EBRUARY
16, 2021When the alternative hypothesis is F ( s, S, σ ) , we consider for some t MSn,p the test procedure ∆ MSn = I (cid:32) S (cid:88) i =1 | ϕ A i (Σ n − I p ) | ≥ t MSn,p (cid:33) . (6) Theorem 3.2.
The test ∆ MSn defined in (6), with t MSn,p = S max (cid:40)(cid:115) u log( S ) n ( p − S ) , u log( S ) n ( p − S ) (cid:41) for u > is such that R (∆ MSn , F ) ≤ − ( u −
1) log( S )) provided that σ ≥ t MSn,p + max (cid:110)(cid:113) u − s +1) log( S ) n ( p − S ) , u − s +1) log( S ) n ( p − S ) (cid:111) . Let us consider now for some threshold t HS + n,p the test procedure ∆ HS + n = max S ⊆{ ,...,S } , S = s I (cid:0) ϕ A S (Σ n − I p ) ≥ t HS + n,p (cid:1) . (7)The test ∆ HS + n successively tries all possible sets S of s diagonals among the first S diagonal values. If any of thesetests decides to reject H , then ∆ HS + n also rejects H , otherwise ∆ HS + n accepts the null hypothesis H . Theorem 3.3.
The test ∆ HS + n defined in (7), with t HS + n,p = max (cid:115) u · s log (cid:0) Ss (cid:1) n ( p − S ) , u · s log (cid:0) Ss (cid:1) n ( p − S ) for u > is such that R (∆ HS + n , F + ) ≤ exp (cid:18) − ( u −
1) log (cid:18) Ss (cid:19)(cid:19) + exp (cid:16) − u (cid:17) provided that σ ≥ s (cid:18) t HS + n,p + (2 s + 1) max (cid:26)(cid:113) u · sn ( p − S ) , u · sn ( p − S ) (cid:27)(cid:19) When the alternative set of hypotheses is F ( s, S, σ ) , consider for some threshold t HSn,p > HSn = max S ⊆{ ,...,S } , S = s I (cid:88) j ∈ S | ϕ A j (Σ n − I p ) | ≥ t HSn,p . (8) Theorem 3.4.
The test ∆ HSn defined in (8), with t HSn,p = s max (cid:118)(cid:117)(cid:117)(cid:116) u log (cid:16) s (cid:0) Ss (cid:1)(cid:17) n ( p − S ) , u log (cid:16) s (cid:0) Ss (cid:1)(cid:17) n ( p − S ) for u > is such that R (∆ HSn , F ) ≤ (cid:20) − ( u −
1) log (cid:18) s (cid:18) Ss (cid:19)(cid:19)(cid:21) provided that σ ≥ t HSn,p + max (cid:40)(cid:114) u −
1) log ( s (2 s +1) ( Ss )) n ( p − S ) , u −
1) log ( s (2 s +1) ( Ss )) n ( p − S ) (cid:41) .Remark. When the separation is measured by max S (cid:80) j ∈ S σ j , its estimator is known as the scan statistic. Note thatthe computations are not very involved. Indeed, after computing ξ = ϕ A (Σ n − I p ) , ..., ξ S = ϕ A S (Σ n − I p ) , we sortthese values in decreasing order : ξ (1) ≥ ξ (2) ≥ ... ≥ ξ ( S ) , and then max S ⊆{ ,...,S } , S = s (cid:88) j ∈ S ϕ A j (Σ n − I p ) = ξ (1) + ... + ξ ( s ) Similar calculations hold for max S (cid:80) j ∈ S | σ j | and | ξ | (1) ≥ | ξ | (2) ≥ ... ≥ | ξ | ( S ) . We thus exploit the Toeplitz structurethat reduces the matrix structure to a vector and makes the scan statistic computationally efficient.6 PREPRINT - F
EBRUARY
16, 2021
Remark.
Note that the previous tests must be agregated over a set of possibel values for s in order to be free of thesparsity s : ˜∆ HSn = max s ∆ HSn will reject whever at least one test rejects.
Discussion a) If S (cid:16) log( p ) , giving p − S (cid:16) p , the series has short memory. We get t MS + np (cid:16) (cid:112) log( p ) / ( np ) givinga test rate smaller than (cid:112) log( p ) / ( np ) , and with Stirling’s approximation, t HS + np (cid:16) s (cid:114) log (cid:16) log( p ) s (cid:17) / ( np ) giving thefollowing bound for the testing rate (cid:113) log(log( p ) /s ) np + (cid:113) snp .We see that ∆ HS + n detects smaller values of σ than ∆ MS + n when s ≤ log( p ) , hence our choice to name the procedures M S and HS respectively.b) If the stationary time series has longer memory, for example S = p/ − , this gives p − S = p/ and Sp − S (cid:16) .In this case, t MS + np (cid:16) / √ n and σ ≥ / √ n , while t HS + np (cid:16) s (cid:113) log( p/s ) np + (cid:113) snp .Again, if s/p → , the test ∆ HS + n detects smaller values of σ then ∆ MS + n . However, if s = S (cid:16) p , it is sufficient touse only ∆ MS + n .Table 1 summarizes our results where C , C , C ∗ and C ∗ denote constants depending only on u .Table 1: Thresholds t and separation rates for moderately and highly sparse testsOne sided test M S + HS + t = max (cid:110) C (cid:113) Sn ( p − S ) } , C Sn ( p − S ) (cid:111) max (cid:40) C (cid:114) s log ( Ss ) n ( p − S ) , C s log ( Ss ) n ( p − S ) (cid:41) σ ≥ s +1) s t ts + s +1 s max (cid:26) C (cid:113) sn ( p − S ) , C sn ( p − S ) (cid:27) Two sided test
M S HSt = C max (cid:110) C (cid:113) log( S ) n ( p − S ) } , C S ) n ( p − S ) (cid:111) s max (cid:40) C (cid:114) log ( s ( Ss )) n ( p − S ) , C ( s ( Ss )) n ( p − S ) (cid:41) σ ≥ t + max (cid:110) C ∗ (cid:113) (2 s +1) log( S ) n ( p − S ) , C ∗ s +1) log( S ) n ( p − S ) (cid:111) t + max (cid:40) C ∗ (cid:114) log ( s (2 s +1) ( Ss )) n ( p − S ) , C ∗ ( s (2 s +1) ( Ss )) n ( p − S ) (cid:41) Experimental results
A more detailed numerical study is included in the Section 5 Simulation results, including anexample of a sparse
M A ( (cid:98) p/ (cid:99) ) series with increasing p . We want to give a fast glimpse of the graphs of the powerfunction, E Σ (∆ n = 1) , for the tests ∆ MSn and ∆ HSn , for different values of Σ . Here S = √ p and s = ( S − / .Figures p and n as function of (cid:80) Sj =1 | σ j | and (cid:80) j ∈ S | σ j | - in alogarithmic scale that allow to better read this graphics. The plots show very steep power functions, that indicate anarrow band where the decision is hard to make. The power goes from small values near α = 10% to high valuesclose to 1 in a fast increasing way. There are little differences in the behaviour of moderately and highly sparse tests.We note an improvement as p grows (the tests detect matrices closer to the identity), in agreement with theoreticalrates that first indicated that p is not a nuisance parameter here. All figures should be printed in color PREPRINT - F
EBRUARY
16, 2021 (a) n = 100 (b) n = 500 (c) n = 1000 Figure 1: Power of the ∆ MSn test (a) n = 100 (b) n = 500 (c) n = 1000 Figure 2: Power of the ∆ HSn test
The objective here is to properly select non-null correlation coefficients. We define a (two-sided) lag-selectionproblem as estimation of η , a vector with entries η j = ( | ϕ A j (Σ) | > . We want to find a selector ˆ η with ˆ η j = ( | ϕ A j (Σ n ) | > τ n ) that is consistent in the sense that the risk R LS (ˆ η, F ) = S (cid:88) j =1 E Σ [ | ˆ η j − η j | ] stays bounded (is small). The Hamming loss counts the number of miss-classified elements. Theorem 4.1. If Σ belongs to F ( s, S, σ ) , with σ ≥ τ n , the selector ˆ η with τ n = max (cid:40)(cid:16)(cid:112) log( s ) + (cid:112) log( S − s ) (cid:17) (cid:115) u s + 1 n ( p − S ) , u log( s ( S − s )) 2 s + 1 n ( p − S ) (cid:41) for u > is such that R LS (ˆ η, F ) ≤ (cid:18) − ( u −
1) log( s )4 (cid:19) + 2 exp (cid:18) − ( u −
1) log( S − s )4 (cid:19) . Remark.
If we only consider the class F + , with σ > τ n , we define a one-sided selection by η + j = ( ϕ A j (Σ) > and consider ˆ η j + = ( ϕ A j (Σ n ) > τ n ) . Then R LS (ˆ η + , F ) ≤ exp (cid:18) − ( u −
1) log( s )4 (cid:19) + exp (cid:18) − ( u −
1) log( S − s )4 (cid:19) . PREPRINT - F
EBRUARY
16, 2021Take for example S = p − , and assume that s/p = p − β for some β in (0,1). This implies that log( S − s ) ∼ (1 − β ) log( p ) and the asymptotic value of τ n as p tends to infinity is τ n ∼ (1 + (cid:112) − β ) (cid:115) u log( p ) np β , u > . Figure 3 shows the good behaviour of our lag selector under Σ ∈ F ( s, S, σ ) hypothesis. We plot the Hamming lossbetween η and ˆ η , averaged over 1000 repetitions, as a function of n , for numerous values of p and taking S = √ p .We note the fast decrease to 0 of the Hamming loss for both for s = S − and for s = ( S − / , despite the smallvalues of σ (cid:16) τ n to detect. (a) s = S − (b) s = S − Figure 3: Hamming-loss of the lag selector
We include several examples to illustrate the numerical behavior of our test procedures. First, we highlight that theplots will be drawn with a logarithmic scale. We estimate the power of the four test procedures: ∆ MS + n , ∆ MSn , ∆ HS + n , ∆ HSn to test the null hypothesis
Σ = I .We choose the numbers of non-null entries s and the non-null entries support S ⊂ (cid:74) S (cid:75) with s = ( S − / , and S = √ p. The location of the non zero entries is randomly chosen. We define the common value of non-null entries as growingfractions of σ . The threshold of the test procedure is defined as t = t n,p,α the empirical (1 − α ) -quantile of the teststatistic under the null hypothesis. In order to determine its value empirically, we generate 5000 repeated samplesunder the null hypothesis. The plots represent the power of the tests by the measure of separation, namely: S (cid:88) j =1 σ j , for the one sided tests, and S (cid:88) j =1 | σ j | , for the two-sided tests.To generate the plots, we sample 5000 times under the alternative hypothesis and plot the mean value of the power ofthe tests. The α value will always be . . 9 PREPRINT - F
EBRUARY
16, 2021 (a) Logarithmic scale (b) Identity scale
Figure 4: Impact of the x -axis scale ( ∆ MS + n test)Figure 4 shows that the logarithmic scale should be preferred as it helps to better understand the behaviour of the testprocedure when the measure of separation increases.We represent now the power of the ∆ MS + n test procedure as a function of the measure of separation for numerousvalues of n and p . The best power function goes the fastest from low values above α = 0 . to high values close to 1.The change happens around the theoretical value of the separation rate. (a) n = 100 (b) n = 500 (c) n = 1000 Figure 5: Power of the ∆ MS + n testFigure 5 shows that for p smaller than, equal to or bigger than n , the ∆ MS + n test presents similar behaviour as themeasure of separation increases. However, it can be noticed that the performances are better in high dimension, thatis the power curves are shifted to the left. This is in agreement with our theoretical rates and indicates that p is not anuisance parameter. The ∆ MS + n test is not only robust but also more efficient in high dimension.Let us consider the two-sided ∆ MSn test and plot its estimated power curve.10
PREPRINT - F
EBRUARY
16, 2021 (a) n = 100 (b) n = 500 (c) n = 1000 Figure 6: Power of the ∆ MSn testFigure 6 shows that the ∆ MSn test shows a similar behaviour as the ∆ MS + n test. However, the two-sided test efficiencybenefits more from the high-dimension p than the one-sided version, in the sense that the curves shift more to the left,towards the small values of the measure of separation when p is large.Let us consider the ∆ HS + n test. (a) n = 100 (b) n = 500 (c) n = 1000 Figure 7: Power of the ∆ HS + n testFigure 7 shows that the ∆ HS + n test behaves similarly to the ∆ MS + n and ∆ MSn tests.Finally, we consider the two-sided HS test. (a) n = 100 (b) n = 500 (c) n = 1000 Figure 8: Power of the ∆ HSn test11
PREPRINT - F
EBRUARY
16, 2021Figure 8 shows that the ∆ HSn tests also behaves as the previous ones. The high dimension improves the efficiencyof the tests. We can also notice that the power of the tests increase rapidly around -3 on the logarithmic scale of themeasure of separation.
In the previous Section, we have plotted numerical simulations of the four tests presented in the paper. However wewant to understand in more details the impact of the different choices that can be made in this procedures namely: theimpact of the number of non null entries s , the impact of the location of non-null entries (close to the main diagonalor far from it).In this sub-section we focus our study on the ∆ MS + n test as we can extrapolate its behaviour to the other three tests.The underlying covariance matrix belongs to the class F + ( s, S, σ ) , for some s ∈ (cid:74) S (cid:75) .First, we study the impact of the number of non null entries. For all the previous graphs s was fixed and set to ( S − / . The objective is to observe how the value of s impacts the behaviour of the test. For this purpose we plotside by side the ∆ MS + n test with s = S − and s = ( S − / for n = 100 and different values of p ( , and ). (a) s = S − (b) s = S − Figure 9: Impact of the number of non null entries on ∆ MS + n Figure 9 shows that the number of non null entries has no major impact on the power of the test procedure ∆ MS + n .Second, we look at the impact of the randomness in the location of the non null entries. In all previous graphs thenon null entries were randomly located. The objective is to observe how the location of the non null entries impactsthe behaviour of the test. To this end we plot the power function of ∆ MS + n test with s = ( S − / for n = 100 anddifferent values of p . The non null entries are: (a) randomly located, (b) located next to the main diagonal. The plot (c)shows simultaneously the power functions of ∆ MS + n test for p = 10 and n = 100 , but with non null entries randomlychosen i.e S ⊂ (cid:74) S (cid:75) with | S | = s (red), fixed next to the main diagonal i.e S = (cid:74) , s (cid:75) (blue) and fixed on the lastvalues of the support i.e S = (cid:74) S − s ; S (cid:75) (magenta). 12 PREPRINT - F
EBRUARY
16, 2021 (a) Randomly chosen (b) Next to the main diagonal (c) On the same graph
Figure 10: Impact of the position of the non null entries on ∆ MS + n Figure 10 shows that the location of the non null entries has no impact on the ∆ MS + n test performances. In conclusion,the tests are sensitive neither to the number of non null entries nor to their location. ∆ MSn and ∆ HSn
The four test procedures ∆ MS + n , ∆ MSn , ∆ HS + n and ∆ HSn present very similar behaviour of their power curves. How-ever, for high sparsity levels of the covariance matrix ∆ HS + n and ∆ HSn were designed to be more efficient thanrespectively ∆ MS + n and ∆ MSn . The objective is to observe the difference in their behaviours under such high sparsitylevels assumption. In this sub-section we illustrate our study on the two-sided ∆ MSn and ∆ HSn tests only, as they areanalogous to their one-sided versions.In order to observe the difference in the impact of sparsity on these two tests we plot their power curves by the numberof non null entries s . The parameters are set as follows n = 100 , p = 100 and S = √ p = 10 . The plot is repeated forthe non null entries common value to be σ = t n,p,α / ≈ . and σ = t n,p,α / ≈ . . As the ∆ HSn testrequires a value for s the true value is given in Figure 11. (a) σ = t n,p,α (b) σ = t n,p,α Figure 11: ∆ MSn vs ∆ HSn with s knownFigure 11 shows that indeed the ∆ HSn test procedure with known sparsity s has better detection power than ∆ MSn for higher sparsity, as it was expected. It can also be noticed that larger significant values of the non-null correlationsimprove even more the power ∆ HSn over ∆ MSn .We build now a new ∆ HSn procedure that is free of knowledge of s by aggregating several procedures ∆ HSn ( s ) fordifferent values of s . Then we compare it to ∆ MSn . Consider a grid of plausible values of s from 1 to S , build all ∆ HSn ( s ) and decide according to ∆ HSn = max s ∆ HSn , that is reject whenever at least one of the tests rejected and accept otherwise.13 PREPRINT - F
EBRUARY
16, 2021Let us confront the aggregated high-sparsity test and the moderate-sparsity test procedures. The two test procedureshave been run in the same setup n = 100 , p = 100 and S = √ p = 10 . The true values of s are being set to s = 4 and s = 7 , respectively. We plot the power curves of the two procedures by the measure of separation on a log-scale. Thelatter is rising because of growing values of σ .In both cases, the grid of plausible sparsity levels has been fixed to two values: 2 and 10, which means that ∆ HSn = max { ∆ HSn (2) , ∆ HSn (10) } even though the true underlying sparsity value is not on the grid. This does not seem to be a drawback. (a) s = 4 (b) s = 7 Figure 12: ∆ MSn vs ∆ HSn with s unknownIn Figure 12 it appears that even with unknown value of s the ∆ HSn test procedure performs better than ∆ MSn . It canbe noticed that the curves show larger differences for lower values of the measure of separation.In conclusion, the theoretical improvements of highly-sparse over moderately sparse procedures show up in the veryextreme cases where the underlying signal is very close to white noise either because of very weak correlations or ofvery few non-null values.
M A series
Let us construct a stationary process belonging to our set of sparse covariance matrices. Consider the stationaryprocess X t defined by the following moving average ( M A ) model : X t = (cid:98) p (cid:99) (cid:88) i =0 φ i (cid:15) t − i with { (cid:15) t } t ∈ N a Gaussian white noise and | φ | < . The auto-covariance function of this series isCov ( X t + h , X t ) = , if h odd, or h ≥ p ,φ − h (cid:18) φ h − φ ( (cid:98) p (cid:99) ) − φ (cid:19) , otherwise.In this example, the p -dimensional Gaussian vector X = ( X t , ..., X t + p ) has a covariance matrix belonging to the class F ( s, S, σ ) with s ≥ p − tending to infinity with p , S ≤ p and σ = φ − (cid:98) p (cid:99) (cid:32) φ (cid:98) p (cid:99) − φ ( (cid:98) p +1 (cid:99) )1 − φ (cid:33) . We plot the power of the ∆ MSn test on the y -axis and the value of φ < on the x -axis.14 PREPRINT - F
EBRUARY
16, 2021 (a) n = 500 (b) n = 50 Figure 13: Power of ∆ MSn test for the
M A ( (cid:98) p/ (cid:99) ) Figure 13 shows the power of the ∆ MSn test for this example for various values of p . It can be seen that the ∆ MSn test performs better when the value of p increases. We point out that for p < the M A ( (cid:98) p/ (cid:99) ) is a white noise. Itexplains why the power of the ∆ MSn test stays constantly low when p < . The following lemma is useful to prove the theorem. We prove a more general statement involving an arbitrary constant K in (0,1). It is sufficient to take K = 1 / to deduce the theorem. Lemma 6.1.
Let Σ ∈ S ++ p and Σ / be its square root. Let A ∈ S p and M = Σ / A Σ / . Then, for an arbitrary K ∈ ]0 , , the matrix I p − tM is invertible and det (( I p − tM )) − ≤ exp (cid:18) t Tr( A Σ) + t || A Σ || F − K ) (cid:19) , for all | t | < K || A Σ || ∞ . Proof.
Let λ , ..., λ p be the real eigenvalues of the symmetric matrix M associated to the eigenvectors x , ..., x p . Thenfor an arbitrary K ∈ ]0 , , for all | t | < K || A Σ || ∞ , − tλ , ..., − tλ p are the strictly positive eigenvalues of the matrix I p − tM associated to the eigenvectors x , .., x p We have det ( I p − tM ) − = exp (cid:32) − p (cid:88) k =1 log(1 − tλ k ) (cid:33) = exp (cid:32) p (cid:88) k =1 ∞ (cid:88) i =1 i ( tλ k ) i (cid:33) = exp (cid:32) t Tr( A Σ) + p (cid:88) k =1 t λ k (cid:32) ∞ (cid:88) i =0 t i i + 2 λ ik (cid:33)(cid:33) det ( I p − tM ) − ≤ exp (cid:32) t Tr( A Σ) + p (cid:88) k =1 t λ k (cid:32) ∞ (cid:88) i =0 t i λ ik (cid:33)(cid:33) = exp (cid:32) t Tr( A Σ) + t p (cid:88) k =1 λ k − tλ k (cid:33) . By using the fact that || A Σ || F = || M || F = (cid:80) pk =1 λ k and that || A Σ || ∞ = || M || ∞ = max k | λ k | , we have : det ( I p − tM ) − ≤ exp (cid:18) t Tr( A Σ) + t || A Σ || F − K ) (cid:19) which ends the proof. 15 PREPRINT - F
EBRUARY
16, 2021Let us note that if X ∼ N (0 p , Σ) , then Y = Σ − / X ∼ N (0 p , I p ) .For all | t | < nK || A Σ || ∞ , we have : E [exp ( tϕ A (Σ n − Σ))] = E (cid:20) exp (cid:18) tn (cid:0) X T AX (cid:1)(cid:19)(cid:21) n exp( − t Tr( A Σ))= E (cid:20) exp (cid:18) tn (cid:16) Y T Σ T / A Σ / Y (cid:17)(cid:19)(cid:21) n exp( − t Tr( A Σ))= E (cid:20) exp (cid:18) tn (cid:0) Y T M Y (cid:1)(cid:19)(cid:21) n exp( − t Tr( A Σ)) =: T, say . Now, we use the probability density of Y and calculate explicitly T := exp( − t Tr( A Σ)) (cid:32)(cid:18) π (cid:19) p/ (cid:90) ... (cid:90) exp (cid:18) tn Y T M Y − Y T Y (cid:19) dy ...dy p (cid:33) n = exp( − t Tr( A Σ)) (cid:32)(cid:18) π (cid:19) p/ (cid:90) ... (cid:90) exp (cid:18) − Y T ( I p − tn ) M ) Y (cid:19) dy ...dy p (cid:33) n = exp( − t Tr( A Σ)) (cid:18) det (cid:18) I p − t n M (cid:19)(cid:19) − n/ By applying Lemma 6.1, we have E [exp ( tϕ A (Σ n − Σ))] ≤ exp( − t Tr( A Σ)) exp (cid:18) t Tr( A Σ) + t || A Σ || F n (1 − K ) (cid:19) = exp (cid:18) t || A Σ || F n (1 − K ) (cid:19) = exp (cid:18) ν t (cid:19) with ν = 2 || A Σ || F n (1 − K ) .
1. To bound the operator norm of the matrix A W , we use Gershgorin’s circle theorem. Let M = ( m i,j ) ≤ i,j ≤ p be a p × p matrix. Then, all eigenvalues of the matrix M lie within at least one of the Gershgorin discs D ( m ii , (cid:80) j (cid:54) = i | m ij | ) .Gershgorin’s circle theorem applied to the matrix A W gives us : || A W || ∞ = max k | λ k | ∈ D , (cid:88) j ∈ W p − j ) ⇒ || A W || ∞ ≤ wp − S To bound the squared Frobenius norm, we sum all the squared elements of A W , which gives us : || A W || F = 2 (cid:88) j ∈ W p − j p − j ) = (cid:88) j ∈ W p − j ) ≤ w p − S )
2. To bound the operator norm of the matrix A W Σ for some Σ in F ( s, S, σ ) , we use Cauchy-Schwarz inequalitytogether with Gershgorin’s circle theorem : || A W Σ || ∞ ≤ || A W || ∞ || Σ || ∞ ≤ σ (2 s + 1) wp − S To bound the squared Frobenius norm of the matrix A W Σ we will use the following lemma. Lemma 6.2.
Let M and N be two p × p symmetric matrices. Then || M N || F = Tr( M N ) and || M N || F ≤ max ≤ k ≤ p | λ k | || N || F = || M || ∞ || N || F PREPRINT - F
EBRUARY
16, 2021
Proof.
We have || M N || F = Tr( M N N T M T ) = Tr( M N ) , with M and N symmetric and positive semi-definitematrices ( M ≥ , N ≥ .Recall that, if A ≤ B (in the sense that B − A ≥ ), then Tr( AC ) ≤ Tr( BC ) , for any C ≥ .Here, M ≤ λ max ( M ) I p ≤ λ max ( M ) I p and this gives Tr( M N ) ≤ λ max ( M )Tr( N ) If w > , using Lemma 6.2 on M = Σ and N = A W , we have || A W Σ || F ≤ || A W || F || Σ || ∞ ≤ σ w (2 s + 1) p − S ) If w = 1 and W = { j } , using Lemma 6.2 on M = Σ and N = Σ / A j Σ / , we have || A j Σ || F = Tr( A j Σ ) ≤ || A j Σ / || F || Σ / || ∞ ≤ σ (2 s + 1) || A j Σ / || F It suffices to prove that || A j Σ / || F = Tr( A j Σ) ≤ σ K ( p − S ) so that we conclude the proof that || A j Σ || F ≤ σ K (2 s +1) p − S . Let B j = A j = ( b jk,l ) ≤ k,l ≤ p . For every ≤ k, l ≤ p , we have b jk,l = p (cid:88) i =1 a jk,i a ji,l = p (cid:88) i =1 a j | k − i | a j | l − i | = p (cid:88) i =1 δ | k − i | = j δ | l − i | = j p − j ) • If k = l , b jk,k = p − j ) if j < p and j < k ≤ p − j if j ≥ p and p − j ≤ k < j p − j ) otherwise• If k (cid:54) = l , for δ | k − i | = j δ | l − i | = j to be non-null, we need : (cid:40) k − i = j and l − i = − j or l − i = j and k − i = − j ⇔ k − l = 2 j and i = k + l or l − k = 2 j and i = k + l ⇔ | k − l | = 2 j and i = k + l Therefore, b jk,l = (cid:26) p − j ) if j < p and | k − l | = 2 j otherwiseSumming up the results gives us || A j Σ / || F = Tr( A j Σ) = p (cid:88) m =1 (cid:32) p (cid:88) i =1 b m,i σ i,m (cid:33) ≤ σ p (cid:88) m =1 (cid:32) p (cid:88) i =1 b m,i (cid:33) = σ p (cid:88) m =1 b m,m + σ (cid:88) m (cid:54) = i b m,i ≤ σ (cid:40) p − j )+2( p − j )4( p − j ) if j < p p − j )4( p − j ) otherwise ≤ σ (cid:40) p − j ) if j < p p − j ) otherwiseThis means that || A j Σ || F ≤ σ (2 s + 1) || A j Σ / || F ≤ σ K (2 s + 1)( p − S ) where K = (cid:26) , if W ⊆ { , ..., p − } p , if W ⊆ { p , ..., p } PREPRINT - F
EBRUARY
16, 2021
We know from Corollary 2.4 that the type I error probability is such that P I p (cid:2) ϕ A S (Σ n − I p ) ≥ t MS + n,p (cid:3) ≤ exp (cid:16) − u (cid:17) and that, for any Σ in F + ( s, S, σ ) , we have P Σ [ ϕ A S (Σ n − Σ) ≥ (1 + 2 s ) t MS + n,p ] ≤ exp (cid:16) − u (cid:17) , for all u > . We can bound the type II error probability under the assumption that σ ≥ s +1) s t MS + n,p : P Σ (cid:2) ϕ A S (Σ n − I p ) ≤ t MS + n,p (cid:3) = P Σ (cid:2) ϕ A S (Σ n − Σ) ≤ t MS + n,p − ϕ A S (Σ) (cid:3) = P Σ (cid:2) ϕ A S (Σ − Σ n ) ≥ ϕ A S (Σ) − t MS + n,p (cid:3) ≤ P Σ (cid:2) ϕ A S (Σ − Σ n ) ≥ sσ − t MS + n,p (cid:3) ≤ P Σ (cid:2) ϕ A S (Σ − Σ n ) ≥ (2 s + 1) t MS + n,p (cid:3) ≤ exp (cid:16) − u (cid:17) , for all u > . Finally : R (∆ MS + n , F + ) = P I p ( ϕ A S (Σ n − I p ) ≥ t MS + n,p ) + sup Σ ∈F + P Σ ( ϕ A S (Σ n − I p ) ≤ t MS + n,p ) ≤ (cid:16) − u (cid:17) Similarly to the proof of Theorem 3.1, we use Corollary 2.4 to bound the type I error probability P I p (cid:34) S (cid:88) i =1 | ϕ A i (Σ n − I p ) | ≥ t MSn,p (cid:35) ≤ P I p (cid:34) S (cid:91) i =1 (cid:40) | ϕ A i (Σ n − I p ) | ≥ t MSn,p S (cid:41)(cid:35) ≤ S (cid:88) i =1 P I p (cid:34) | ϕ A i (Σ n − I p ) | ≥ t MSn,p S (cid:35) = S (cid:88) i =1 P I p (cid:34) | ϕ A i (Σ n − I p ) | ≥ max (cid:40)(cid:114) u − K ) (cid:115) S ) n ( p − S ) , uK S ) n ( p − S ) (cid:41)(cid:35) ≤ S (cid:88) i =1 − u log S )= 2 exp ( − ( u −
1) log S ) To bound the type II error probability, we use the condition on σ : P Σ (cid:34) S (cid:88) i =1 | ϕ A i (Σ n − Id) | ≤ t MSn,p (cid:35) ≤ P Σ (cid:34) S (cid:92) i =1 (cid:8) | ϕ A i (Σ n − Id) | ≤ t MSn,p (cid:9)(cid:35) ≤ sup ≤ i ≤ S P Σ (cid:2) | ϕ A i (Σ n − I p ) | ≤ t MSn,p (cid:3) ≤ sup ≤ i ≤ S P Σ (cid:2) | ϕ A i (Σ n − Σ) | ≥ | ϕ A i (Σ − I p ) | − t MSn,p (cid:3) ≤ sup ≤ i ≤ S P Σ (cid:2) | ϕ A i (Σ n − Σ) | ≥ σ − t MSn,p (cid:3) ≤ sup ≤ i ≤ S P Σ (cid:34) | ϕ A i (Σ n − Σ) | ≥ max (cid:40)(cid:115) u − − K ) (cid:115) S (2 s + 1) n ( p − S ) , ( u − K S (2 s + 1) n ( p − c ) (cid:41)(cid:35) ≤ − ( u −
1) log S ) This gives us finally : R (∆ MSn , F ) ≤ − ( u −
1) log S ) PREPRINT - F
EBRUARY
16, 2021
The type I error probability is bounded by P I p [∆ HS + n = 1] ≤ (cid:88) S ⊆{ ,...,S } , S = s P I p (cid:2) ϕ A S (Σ n − I p ) ≥ t HS + n,p (cid:3) ≤ (cid:88) S ⊆{ ,...,S } , S = s exp (cid:18) − u log (cid:18) Ss (cid:19)(cid:19) = exp (cid:18) − ( u −
1) log (cid:18) Ss (cid:19)(cid:19) while the type II error probability is bounded by P Σ [∆ HS + n = 0] = sup Σ ∈F + ( s,S,p,σ ) P Σ (cid:92) S ⊆{ ,...,S } , S = s {| ϕ A S (Σ n − I p ) | ≤ t HS + n,p } ≤ sup Σ ∈F + ( s,S,p,σ ) P Σ (cid:2) ϕ A S (Σ n − Σ) + ϕ A S (Σ − I p ) ≤ t HS + n,p (cid:3) = sup Σ ∈F + ( s,S,p,σ ) P Σ (cid:2) ϕ A S (Σ − Σ n ) ≥ ϕ A S (Σ) − t HS + n,p (cid:3) ≤ sup Σ ∈F + ( s,S,p,σ ) P Σ (cid:2) ϕ A S (Σ − Σ n ) ≥ sσ − t HS + n,p (cid:3) for an arbitrary set S in { , ..., S } containing s values.Under the condition sσ − t HS + n,p ≥ (2 s + 1) max (cid:26)(cid:113) u − K ) (cid:113) sn ( p − S ) , uK sn ( p − S ) (cid:27) and Corollary 2.4, we have : P Σ [∆ HS + n = 0] ≤ sup Σ ∈F + ( s,S,p,σ ) P Σ (cid:2) ϕ A S (Σ − Σ n ) ≥ ˜ t (cid:3) ≤ exp (cid:16) − u (cid:17) The proof is similar to the proof of Theorem 3.2. The type I probability error is bounded by P I p [∆ HSn = 1] ≤ (cid:88) S ⊆{ ,...,S } , S = s P I p (cid:34)(cid:88) i ∈ S | ϕ A i (Σ n − I p ) | ≥ t HSn,p (cid:35) ≤ (cid:88) S ⊆{ ,...,S } , S = s (cid:88) i ∈ S P I p (cid:34) | ϕ A i (Σ n − I p ) | ≥ t HSn,p s (cid:35) ≤ (cid:88) S ⊆{ ,...,S } , S = s (cid:88) i ∈ S (cid:20) − u log (cid:18) s (cid:18) Ss (cid:19)(cid:19)(cid:21) = 2 exp (cid:20) − ( u −
1) log (cid:18) s (cid:18) Ss (cid:19)(cid:19)(cid:21) The type II probability is bounded by P Σ [∆ HSn = 0] = P Σ (cid:34) max S ⊆{ ,...,S } , S = s (cid:88) i ∈ S | ϕ A i (Σ n − Id) | ≤ t HSn,p (cid:35) ≤ P Σ (cid:92) S ⊆{ ,...,S } , S = s (cid:92) i ∈ S (cid:8) | ϕ A i (Σ n − Id) | ≤ t HSn,p (cid:9) ≤ sup S ⊆{ ,...,S } , S = s sup i ∈ S P Σ (cid:2) | ϕ A i (Σ n − Σ) | ≥ σ − t HSn,p (cid:3) ≤ (cid:20) − ( u −
1) log (cid:18) s (cid:18) Ss (cid:19)(cid:19)(cid:21) PREPRINT - F
EBRUARY
16, 2021
Using Theorem 2.2 and Proposition 2.3, we have : R LS (ˆ η, F + ) = S (cid:88) j =1 E Σ [ | ˆ η j − η j | ] = (cid:88) j ∈ S E Σ [ | ˆ η j − η j | ] + (cid:88) j / ∈ S ,j ≤ S E Σ [ | ˆ η j − η j | ]= (cid:88) j ∈ S E Σ [ | ˆ η j − | ] + (cid:88) j / ∈ S ,j ≤ S E Σ [ | ˆ η j | ]= (cid:88) j ∈ S P Σ [ | ϕ A j (Σ n ) | < τ n ] + (cid:88) j / ∈ S ,j ≤ S P Σ [ | ϕ A j (Σ n ) | > τ n ] ≤ (cid:88) j ∈ S P Σ [ | ϕ A j (Σ n − Σ) | > ϕ A j (Σ) − τ n ] + (cid:88) j / ∈ S ,j ≤ S P Σ [ | ϕ A j (Σ n − Σ) | > τ n ] ≤ (cid:88) j ∈ S P Σ [ | ϕ A j (Σ n − Σ) | > σ − τ n ] + (cid:88) j / ∈ S ,j ≤ S P Σ [ | ϕ A j (Σ n − Σ) | > τ n ] ≤ (cid:88) j ∈ S P Σ (cid:20) | ϕ A j (Σ n − Σ) | > max (cid:26)(cid:112) u log( s ) || A j Σ || F √ n , u log( s ) || A j Σ || ∞ n (cid:27)(cid:21) + (cid:88) j / ∈ S ,j ≤ S P Σ (cid:20) | ϕ A j (Σ n − Σ) | > max (cid:26)(cid:112) u log( S − s ) || A j Σ || F √ n , u log( S − s ) || A j Σ || ∞ n (cid:27)(cid:21) ≤ (cid:88) j ∈ S (cid:18) − u log( s )4 (cid:19) + (cid:88) j / ∈ S ,j ≤ S (cid:18) − u log( S − s )4 (cid:19) ≤ (cid:18) − ( u −
1) log( s )4 (cid:19) + 2 exp (cid:18) − ( u −
1) log( S − s )4 (cid:19) References [1] Ery Arias-Castro, Sébastien Bubeck, and Gabor Lugosi. Detecting positive correlations in a multivariate sample.
Bernoulli , 21(1):209–241, 02 2015.[2] Ery Arias-Castro, Sébastien Bubeck, and Gábor Lugosi. Detection of correlations.
Ann. Statist. , 40(1):412–435,02 2012.[3] Pierre C. Bellec. Concentration of quadratic forms under a Bernstein moment assumption.
ArXiv e-prints , 2019.[4] Cristina Butucea and Yuri I. Ingster. Detection of a sparse submatrix of a high-dimensional noisy matrix.
Bernoulli , 19(5B):2652–2688, 2013.[5] Cristina Butucea and Rania Zgheib. Sharp minimax tests for large covariance matrices and adaptation.
Electron.J. Statist. , 10(2):1927–1972, 2016.[6] Cristina Butucea and Rania Zgheib. Sharp minimax tests for large toeplitz covariance matrices with repeatedobservations.
J. Multivariate Anal. , 146(C):164–176, 2016.[7] Tony Cai and Weidong Liu. Adaptive thresholding for sparse covariance matrix estimation.
Journal of theAmerican Statistical Association , 106(494):672–684, 2011.[8] Tony Cai, Weidong Liu, and Yin Xia. Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings.
Journal of the American Statistical Association , 108(501):265–277, 2013.[9] Tony Cai and Zongming Ma. Optimal hypothesis testing for high dimensional covariance matrices.
Bernoulli ,19(5B):2359–2388, 11 2013.[10] Minshuo Chen, Lin Yang, Mengdi Wang, and Tuo Zhao. Dimensionality reduction for stationary time seriesvia stochastic nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,and R. Garnett, editors,
Advances in Neural Information Processing Systems 31 , pages 3496–3506. Curran As-sociates, Inc., 2018. 20
PREPRINT - F
EBRUARY
16, 2021[11] David Donoho and Jiashun Jin. Higher criticism for detecting sparse heterogeneous mixtures.
Ann. Statist. ,32(3):962–994, 2004.[12] Mihai Giurcanu and Vladimir Spokoiny. Confidence estimation of the covariance function of stationary andlocally stationary processes.
Statist. Decisions , 22(4):283–300, 2004.[13] Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors.
Electron. Commun. Probab. , 17:6 pp., 2012.[14] Yu. I. Ingster. Adaptive detection of a signal of growing dimension. I. volume 10, pages 395–421 (2002). 2001.Meeting on Mathematical Statistics (Marseille, 2000).[15] Yu. I. Ingster. Adaptive detection of a signal of growing dimension. II.
Math. Methods Statist. , 11(1):37–68,2002.[16] Jens-Peter Kreiss, Efstathios Paparoditis, and Dimitris N. Politis. On the range of validity of the autoregressivesieve bootstrap.
Ann. Statist. , 39(4):2103–2130, 08 2011.[17] M. Rudelson and R. Vershynin. Hanson-Wright inequality and sub-gaussian concentration.
ArXiv e-prints , June2013.[18] V. Spokoiny and M. Zhilova. Sharp deviation bounds for quadratic forms.
Mathematical Methods of Statistics ,22(2):100–113, Apr 2013.[19] Martin J Wainwright.