aa r X i v : . [ s t a t . O T ] J u l Hotelling’s test for highly correlated data
P. Bubeliny e - mail : bubeliny @ karlin . mff . cuni . cz Charles University, Faculty of Mathematics and Physics, KPMS, Sokolovska 83, Prague, Czech Republic,18675.
Abstract:
This paper is motivated by the analysis of gene expression sets, especially by finding dif-ferentially expressed gene sets between two phenotypes. Gene log expression levels are highly correlatedand, very likely, have approximately normal distribution. Therefore, it seems reasonable to use two-sampleHotelling’s test for such data. We discover some unexpected properties of the test making it di ff erent fromthe majority of tests previously used for such data. It appears that the Hotelling’s test does not always reachmaximal power when all marginal distributions are di ff erentially expressed. For highly correlated data itsmaximal power is attained when about a half of marginal distributions are essentially di ff erent. For the casewhen the correlation coe ffi cient is greater than 0.5 this test is more powerful if only one marginal distribu-tion is shifted, comparing to the case when all marginal distributions are equally shifted. Moreover, whenthe correlation coe ffi cient increases the power of Hotelling’s test increases as well. In many situations statisticians need to test multidimensional hypotheses. In a lot of cases components ofobserved random vectors are highly dependent, which may change the properties of the tests used. One ofthe examples of such data is provided by gene expression levels. Gene expressions are highly correlatedbetween genes (see for example
Klebanov and Yakovlev (2007)). Moreover, often the genes are investigatednot just separately, but also as a set of dependent genes. Therefore one has to deal with multidimensionalhypotheses and in order to test such hypotheses, gene sets should be expressed di ff erentially. The mostpopular tests for gene sets are Hotelling’s test, N-test and tests derived from marginal t -statistics. In thepapers Ackermann and Strimmer (2009),
Glazko and Emmert-Streib (2009), an approach to comparing thesetest in various situations was made. Our goal is not to make another comparison, but rather to describe someinteresting properties of the Hotelling’s test which seems to be unexpected.
One of the most well known tests is t -test. Hotelling’s test is an multidimensional extension of t -test. Similarto t -test, we can consider both one-sample and two-sample Hotelling’s test. One-sample case deals with thehypothesis that the expected value of a sample from multidimensional normal distribution is equal to somegiven vector. In the two-sample case it deals with the hypothesis of the equality of expected values of twosamples from multidimensional normal distributions (with the equal covariance structure). In this paper wewill focus on the two-sample Hotelling’s test.Suppose we have two independent samples (of sizes n x and n y , respectively) from two n -dimensionalnormal distributions with identical covariance matrices equal to Σ . In other words, we consider X , ..., X n x asi.i.d random vectors having N n ( µ x , Σ ) and Y , ..., Y n y as i.i.d random vectors having N n ( µ y , Σ ) ( X i and Y j areindependent for all i = , ..., n x ; j = , ..., n y ). For simplicity we assume that n < n x + n y −
1. Our goal is totest the hypothesis H : µ x = µ y against alternative A : µ x , µ y . For this we use Hotelling’s test based on thestatistic T = n x n y n x + n y ( ¯ X − ¯ Y ) T S − ( ¯ X − ¯ Y ) , (1)1here ¯ X = n x P n x i = X i ; ¯ Y = n y P n y i = Y i and S = P nxi = ( X i − ¯ X )( X i − ¯ X ) T + P nyi = ( Y i − ¯ Y )( Y i − ¯ Y ) T n x + n y − . T is related to the F -distribution by n x + n y − n − n ( n x + n y − T ∼ F ( n , n x + n y − n − . (2)For more details about Hotelling’s test see, for example, Chatfield and Collins (1980). We made the assump-tion n < n x + n y − n ≥ n x + n y − S of Σ results in an irregular matrix, sothat S − does not exist and moreover numerator of (2) is non-positive as well as the degree of freedom of the F -distribution. In such situations it is possible to use some pseudo-inversion of S and in order to estimate p -value of H , we can use permutations of ( X , ..., X n x , Y , ..., Y n y ). As it was mentioned above, genes are highly dependent and we will suppose that their log expression levelshave approximately normal distributions. Many papers work with gene sets (for example Barry et al. (2008))instead of genes alone and therefore deal with multidimensional hypotheses. It seems to be reasonable touse Hotelling’s test in this situation.Assume that we have two multidimensional samples and need to test the hypothesis suggesting theequality of expected values in these two samples. Assume for simplicity that all elements on the maindiagonal of the covariance matrix Σ for both samples are equal to 1 and all other elements are equal to ρ > Σ = ρ ρ ... ρρ ρ ... ρ... ... ... ... ...ρ ... ... ρ . Further on, we assume that µ x = (0 , ..., T , but µ y has first m elements equal to 1 and the others equal to0, i.e. µ y = (cid:16) , ..., | {z } m , , ..., | {z } n − m (cid:17) T . For large n x and n y the matrix Σ and its estimate S are approximately the same as well as the di ff erencesbetween the expected values ( µ x − µ y ) and between the mean values ( ¯ X − ¯ Y ). When dialing with real data, n x and n y might not be large enough, but for theoretical reasons we may use the approximations S ≈ Σ and¯ X − ¯ Y ≈ µ x − µ y . In this case S − ≈ Σ − , that is S − ≈ Σ − = α − β − β ... − β − β α − β ... − β... ... ... ... ... − β ... ... − β α , where α = (1 + ( n − ρ )(1 − ρ )(1 + ( n − ρ ) and β = ρ (1 − ρ )(1 + ( n − ρ ) . For fixed n x and n y we can consider the fraction n x n y n x + n y = k ofHotelling’s statistic (1) as a normalizing constant. Let us denote T ∗ Hotelling’s statistic with Σ − instead of S − and µ x − µ y instead of ¯ X − ¯ Y divided by the constant k . Therefore, we have T / k ≈ T ∗ = ( µ x − µ y ) T Σ − ( µ x − µ y )2 (cid:16) , ..., | {z } m , , ..., | {z } n − m (cid:17) α − β − β ... − β − β α − β ... − β... ... ... ... ... − β ... ... − β α ... ... = m α − ( m − m ) β = m (1 + ( n − ρ ) − m ( m − ρ (1 − ρ )(1 + ( n − ρ ) = m (1 + ( n − m − ρ )(1 − ρ )(1 + ( n − ρ ) . (3)Let us note that it does not matter if µ y consists of ones and zeros or equals to a constant a and zeros. In thelatter case, statistic T ∗ would be multiplied by a . Now we will work with statistic T ∗ and investigate itsbehavior.If we changed m to m + ff erent marginal distribution) we wouldexpect that the statistic T ∗ increases and that so does the power of Hotelling’s test. We need to check if it isindeed the case. For better understanding let the number of ones in µ y be the index of T ∗ (we will write itonly when it is needed). Now we change m to m + = h and we have T ∗ m + = T ∗ m + α − m β. If we expected that T ∗ is an increasing function of m then α − m β should be greater then zero. But we have α − m β = + ( n − ρ (1 − ρ )(1 + ( n − ρ ) − m ρ (1 − ρ )(1 + ( n − ρ ) = + ( n − m − ρ (1 − ρ )(1 + ( n − ρ ) . Since the denominator is greater than zero, then α − m β > m + − n = h − n > ρ . It means that fornot very small values of ρ ’s and m > n − T ∗ is a decreasing function of m . This means thatmaximal power of Hotelling’s test (as a function of m ) is not always attained for m = n but for ρ ’s which arenot very small we have maximal power for m near n . Some examples of the behavior of T ∗ as a function of m are illustrated on figure 1.However, this issue is not the only one that is surprising about Hotelling’s test. Now we look if T ∗ is alwayslower than T ∗ n . It is the case when one di ff erent marginal distribution influences more than all n di ff erentdistributions. So we need to compare α with n α − n ( n − β . We have T ∗ − T ∗ n = α − n α + n ( n − β = ( n −
1) (1 − ρ )(1 − ρ )(1 + ( n − ρ ) . So T ∗ − T ∗ n < ρ < .
5. Therefore we can say that for ρ > . T ∗ is an increasing function of ρ , that may seemsurprising as well. Let us look at Hotelling’s test in the two-dimensional case. As in the previous case, we will consider thetwo-sample problem, but now we will generalize the di ff erence of expected values of these two samples.Suppose that µ x − µ y = ( a , a ) and that the covariance matrix is Σ = ρρ ! . n= 10 ; rho= 0.1 m po w e r . . n= 10 ; rho= 0.3 m po w e r n= 10 ; rho= 0.5 m po w e r n= 10 ; rho= 0.7 m po w e r n= 10 ; rho= 0.9 m po w e r n= 15 ; rho= 0.1 m po w e r n= 15 ; rho= 0.3 m po w e r n= 15 ; rho= 0.5 m po w e r n= 15 ; rho= 0.7 m po w e r n= 15 ; rho= 0.9 m po w e r n= 25 ; rho= 0.1 m po w e r n= 25 ; rho= 0.3 m po w e r n= 25 ; rho= 0.5 m po w e r n= 25 ; rho= 0.7 m po w e r n= 25 ; rho= 0.9 m po w e r n= 40 ; rho= 0.1 m po w e r n= 40 ; rho= 0.3 m po w e r n= 40 ; rho= 0.5 m po w e r n= 40 ; rho= 0.7 m po w e r n= 40 ; rho= 0.9 m po w e r Figure 1: Plots of T ∗ for n =
10, 15, 25, 40; ρ = .
1, 0.3, 0.5, 0.7, 0.9; and m = , ..., n . Notice: each plot isdi ff erently scaled!Then inverse of Σ is the matrix with diagonal elements α = − ρ )(1 + ρ ) and o ff -diagonal elements − β = − ρ (1 − ρ )(1 + ρ ) . Then T ∗ = α a + α a − β a a . First we consider that a = a =
0. Then T ∗ = α . Now we will investigate for which a , a ∈ R statistic T ∗ = α . That is, we need to solve an equation α a + α a − β a a = α. (4)After dividing both sides of equation (4) by α we get a + a − ρ a a − = . (5)For fixed a equation (5) is quadratic in a with the roots a , = ρ a ± q (2 ρ a ) − a − .
4t is defined only if (2 ρ a ) − a − ≥
0, i.e. for | a | ≤ q − ρ . Some plots of the solutions of the equation(5) for di ff erent values of the correlation coe ffi cient ρ are given on figure 2. We can see that the plots ofthese solutions produce elliptic curves. Let us rotate these ellipses by the angle ϕ = Π / a = x cos ϕ − y sin ϕ = √ x − √ y , a = x sin ϕ + y cos ϕ = √ x + √ y , where x and y are new rotated coordinates. After substitution into (5) it gives( √ x − √ y ) + ( √ x + √ y ) − ρ ( √ x − √ y )( √ x + √ y ) = x (1 − ρ ) + y (1 + ρ ) = x a + y b = , where a = q − ρ and b = q + ρ are respectively the major radius and the minor radius of the ellipse. Since a > b , the Hotelling’s test has the weakest power in the direction of a = a , while the fastest increase ofits power is observed towards the direction of a = − a . For example, for ρ = . a = .
162 and b = . a = a = q . = .
236 Hotelling’s test has approximately the same poweras for a = a = a = − a = q . = .
513 as well). So, if there is only one marginal distri-bution shifted by one unit, then the power of Hotelling’s test is approximately the same as if both marginaldistribution were equally shifted (in the same direction) by 2.236 units (for the shift in opposite direction itshould be only 0.513 unit). These results are in contradiction with other multidimensional tests. For exam-ple, consider the test based on marginal t -statistics. The power of this test is higher if both distributions areshifted by the same amount (both t -statistics are ”large”, not depending on direction of shift) than if therewas only one marginal distribution shifted (one t -statistic is ”near” zero). The analytical results obtained above should be verified by checking if actual Hotelling’s test outcomescorrespond to the analytical results regarding real data. In this section we will compare the behavior oftheoretical Hotelling’s statistic T ∗ with real Hotelling’s statistic T . For large n x and n y we assumed that T ∗ ≈ T / k , where k = n x n y n x + n y . Constant k changes as n x and n y change. It is reasonable to divide Hotelling’sstatistic T by k instead of multiplying T ∗ by k in order to be able to compare how do T and T ∗ di ff er forvarious n x and n y .In order to compare the actual results with the analytical ones, we did the following simulations. All datawere simulated from n -dimensional normal distributions. Consider three di ff erent values for the number ofgenes in a gene set. We take n = n =
15 and n =
25. All simulations were performed for three di ff erentvalues of the correlation coe ffi cient ρ : ρ = . ρ = . ρ = .
9. In order to compare the behavior ofHotelling’s test for various sizes of samples we took three choices of n x and n y : n x = n y = n , n x = n y = . n and n x = n y = . n . The value m which is the number of false marginal distributions varies from one to n . The shift value for each of the di ff erent marginal distributions is set to one. The theoretical Hotelling’sstatistic is calculated according to (3). Real Hotelling’s statistic is estimated from 1000 simulations for each5 − . − . . . . rho=0.25 a1 a2 −1.0 −0.5 0.0 0.5 1.0 − . − . . . . rho=0.5 a1 a2 −2 −1 0 1 2 − − rho=0.9 a1 a2 Figure 2: Plots of solutions of equation ( ?? ) for two-dimensional case for rho = . ff erently scaled!case (as the mean of T / k obtained from the simulations).Plots of our simulated cases are shown on figure 3. We can see that for all simulated situations, the shapesof real and theoretical Hotelling’s statistics are similar. The only di ff erence is in the heights of these curves.For small n x and n y statistic T has higher values than for large n x and n y . The reason for that stems from theinaccurate estimates of the expected values and of the covariance matrix. However, we observe that with theincrease of n x and n y , statistic T / k goes to T ∗ relatively fast. Therefore, the behavior of Hotelling’s test forreal data is expected to be very similar to the behavior of statistic T ∗ .In previous section we saw that for the two-dimensional case the plotted shifts with equal values of thepower of theoretical Hotelling’s test form elliptic curves. Hotelling’s statistics T are random variables.Therefore, we can only estimate if their expected values form elliptic curves when plotted. To check thiswe did following simulations. Instead of calculating the shifts for which Hotelling’s test has equal powers,we took the points provided by the elliptic curves observed for theoretical Hotelling’s statistics. For eachpair of these points ( a , a ) we did 1000 simulations and calculated Hotelling’s statistic. We estimatedthe expected value E T / k as the mean for these 1000 repetitions. We divided Hotelling’s statistics by k for better understanding how fast these statistics go to T ∗ . We did this simulation for the values of thecorrelation coe ffi cient ρ = . ρ = . n x = n y = n x = n y =
10 and n x = n y =
20. Results of our simulation are given in Table 1. We observe thatestimated mean values of T / k are not very di ff erent, that they go to T ∗ and that their variance decreaseswith increasing number of observations. Clearly, these points form elliptic curves. Hence, we can claim thatthe real Hotelling’s test behaves very similar to the theoretical one and the theory derived for the theoreticaltest holds for the real Hotelling’s test as well. In this paper we have discovered that two-sample Hotelling’s test (for testing the equality of the expectedvalues of two samples from multidimensional normal distribution with equal covariance structure) has someunexpected properties. At first sight, one could expect that with a larger number of false marginal distribu-6 n= 10 ; rho= 0.1 m s t a t i s t i c n= 10 ; rho= 0.5 m s t a t i s t i c n= 10 ; rho= 0.9 m s t a t i s t i c n= 15 ; rho= 0.1 m s t a t i s t i c n= 15 ; rho= 0.5 m s t a t i s t i c n= 15 ; rho= 0.9 m s t a t i s t i c n= 25 ; rho= 0.1 m s t a t i s t i c n= 25 ; rho= 0.5 m s t a t i s t i c n= 25 ; rho= 0.9 m s t a t i s t i c Figure 3: Comparisons of theoretical statistics T ∗ and real Hotelling’s statistic T / k for number of genes n =
10 15, 25 (from the top to the bottom); for correlation coe ffi cient ρ = .
1, 0 .
5, 0 . n x = n y = n (denoted by ’ + ’), n x = n y = . n (denoted by’x’) and n x = n y = . n (denoted by ’ • ’). The theoretical statistic T ∗ is denoted by ’ ◦ ’. Number of di ff erentmarginal distribution m is set from one to n . Notice: each plot is di ff erently scaled!tions the power of this test increases. But we have discovered that this is not true in general. For highlycorrelated and high dimensional data (such as data sets of gene expressions) maximal power of Hotelling’stest is reached when only about one half of the marginal distributions are shifted. We have found out thatwhen the correlation inside the sample is greater than 0.5, then the Hotelling’s test can have a better powerif only one marginal distribution is di ff erent, as opposed to the case when all marginal hypotheses are false.Moreover, the power of Hotelling’s test increases for higher correlations. That observation may seem some-what unexpected as well. We have investigated Hotelling’s test in detail in two-dimensional case. We havefound that properties of this test are much di ff erent from ones of the tests based on marginal t -statistic. Allreasonable tests based on marginal t -statistic do not depend on the direction of the shift. But the power ofHotelling’s test increases very slowly if both of the marginal distributions are equally shifted and increasesmuch faster if marginal distributions are shifted in opposite directions. Moreover, alternatives with equalvalues of the power form ellipsoids. 7able 1: Results of simulations of two-dimensional adjusted Hotelling’s statistics T / k with n s = n x = n y observations for each sample and correlation coe ffi cient ρ . T ∗ stands for theoretical Hotelling’s statisticsand ( a , a ) is di ff erence between expected values µ x − µ y of these samples. On bottom line is the estimateof variance of each column. T ∗ = . ρ = . T ∗ = . ρ = . a a n s = n s = n s = a a n s = n s = n s = Acknowledgments
The author thanks Prof. Lev Klebanov, DrSc. for valuable comments, remarks and overall help. The workwas supported by the grant SVV 261315 / References
Ackermann, M and Strimmer, K.(2009), A general modular framework for gene set enrichment analysis,
BMC Bioinformatics , 10, 47.Barry, W.,T., Nobel, A., B., and Wright, F., A. (2008), A statistical framework for testing functional cate-gories in microarray data,
The Annals of Applied Statistics , 2 No.1, 286-315.Chatfield, C. and Collins, A.,J. (1980), Introduction To Multivariate Analysis,
Chapman & Hall / CRC .Glazko, G. and Emmert-Streib, F. (2009), Unite and conquer: univariate and multivariate approaches forfinding di ff erentially expressed gene sets, Bioinformatics , 25 No. 18, 2348-2354.Klebanov, L. and Yakovlev, A. (2007), Diverse correlation structures in gene expression data and their utilityin improving statistical inference,