[PDF] ANCOVA: A global test based on a robust measure of location or quantiles when there is curvature

Abstract

For two independent groups, let M j (x) be some conditional measure of location for the j th group associated with some random variable Y , given that some covariate X=x . When M j (x) is a robust measure of location, or even some conditional quantile of Y , given X , methods have been proposed and studied that are aimed at testing H 0 : M 1 (x)= M 2 (x) that deal with curvature in a flexible manner. In addition, methods have been studied where the goal is to control the probability of one or more Type I errors when testing H 0 for each x∈{ x 1 ,…, x p } . This paper suggests a method for testing the global hypothesis H 0 : M 1 (x)= M 2 (x) for ∀x∈{ x 1 ,…, x p } when using a robust or quantile location estimator. An obvious advantage of testing p hypotheses, rather than the global hypothesis, is that it can provide information about where regression lines differ and by how much. But the paper summarizes three general reasons to suspect that testing the global hypothesis can have more power. 2 Data from the Well Elderly 2 study illustrate that testing the global hypothesis can make a practical difference.

Full PDF

AANCOVA: A GLOBAL TEST BASED ON A ROBUSTMEASURE OF LOCATION OR QUANTILES WHENTHERE IS CURVATURE

Rand R. WilcoxDept of PsychologyUniversity of Southern CaliforniaJune 25, 2015 a r X i v : . [ s t a t . M E ] J un BSTRACTFor two independent groups, let M j ( x ) be some conditional measure of location for the j th group associated with some random variable Y , given that some covariate X = x . When M j ( x ) is a robust measure of location, or even some conditional quantile of Y , given X ,methods have been proposed and studied that are aimed at testing H : M ( x ) = M ( x ) thatdeal with curvature in a ﬂexible manner. In addition, methods have been studied wherethe goal is to control the probability of one or more Type I errors when testing H foreach x ∈ { x , . . . , x p } . This paper suggests a method for testing the global hypothesis H : M ( x ) = M ( x ) for ∀ x ∈ { x , . . . , x p } when using a robust or quantile location estimator.An obvious advantage of testing p hypotheses, rather than the global hypothesis, is that itcan provide information about where regression lines diﬀer and by how much. But the papersummarizes three general reasons to suspect that testing the global hypothesis can have morepower. Data from the Well Elderly 2 study illustrate that testing the global hypothesis canmake a practical diﬀerence.Keywords: ANCOVA, trimmed mean, non-parametric regression, Harrell–Davis estima-tor, bootstrap methods, comparing quantiles, Well Elderly 2 study For two independent groups, consider the situation where for the j th group ( j = 1, 2) Y j issome outcome variable of interest and X j is some covariate. The classic ANCOVA methodassumes that Y j = β j + β X j + (cid:15), (1)where β j and β are unknown parameters and (cid:15) is a random variable having a normaldistribution with mean zero and unknown variance σ . So the regression lines are assumedto be parallel and the goal is to compare the intercepts based in part on a least squaresestimate of the regression lines. It is well known, however, that there are serious concernswith this approach. First, there is a vast literature establishing that methods based onmeans, including least squares regression, are not robust (e.g., Staudte and Sheather, 1990;Marrona et al., 2006; Heritier et al., 2007; Hampel et al., 1986; Huber and Ronchetti, 2009;2ilcox, 2012a, 2012b). A general concern is that violations of underlying assumptions canresult in relatively poor power and poor control over the Type I error probability. Moreover,even a single outlier can yield a poor ﬁt to the bulk of the points when using least squaresregression.As is evident, one way of dealing with non-normality is to use some rank-based tech-nique. But rank-based ANCOVA methods are aimed a testing the hypothesis of identicaldistributions (e.g., Lawson, 1983). So when this method rejects, it is reasonable to concludethat the distributions diﬀer in some manner, but the details regarding how they diﬀer, andby how much, are unclear. One way of gaining some understanding of how the groups diﬀer,but certainly not the only way, is to compare the groups using some measure of location.Here the goal is to make inferences about some robust (conditional) measure of locationassociated with Y .Yet another fundamental concern with (1) is that the true regression lines are assumed tobe straight. Certainly, in some situations, this is a reasonable approximation. When thereis curvature, simply meaning that the regression line is not straight, using some obviousparametric regression model might suﬃce. (For example, include a quadratic term.) Butthis approach can be inadequate, which has led to a substantial collection of nonparametricmethods, often called smoothers, for dealing with curvature in a more ﬂexible manner (e.g.,H¨ardle, 1990; Efromovich, 1999; Eubank , 1999; Fox, 2001; Gy¨orﬁ, et al., 2002).Here, the model given by (1) is replaced with the less restrictive model Y j = f j ( X j ) + (cid:15) j , (2)where f j ( X j ) is some unknown function that reﬂects some conditional measure of locationassociated with Y given that the covariate value is X j . The random variable (cid:15) j has someunknown distribution with variance σ j . So unlike the classic approach where it is assumedthat f j ( X j ) = β j + β j X j , no parametric model for f j ( X j ) is speciﬁed and σ = σ is not assumed. Let M j ( x ) be some(conditional) measure of location associated with Y j given that X j = x . Here, curvature isaddressed using a running interval smoother. Roughly, like all smoothers, the basic strategyis to focus on the X j values close to x and use the corresponding Y j values to estimate M j ( x ).3n appeal of the running interval smoother is that it is easily applied when using any robustmeasure of location. The details are given in the next section of this paper.The goal here is to test the global hypothesis H : M ( x ) = M ( x ) , ∀ x ∈ { x , . . . , x p } , (3)where x , . . . , x p are p values of the covariate chosen empirically in a manner aimed at captur-ing any curvature that might exist. Roughly, these p values are chosen using a component ofthe so-called running interval smoother, which is described in section 2. Put in more substan-tive terms, the goal is to determine whether two groups diﬀer (e.g., depressive symptomsamong males and females) taking into account the possibility that the extent they diﬀermight depend in a non-trivial manner on some covariate (such as the cortisol awakeningresponse).In the context of ANCOVA, use of the running interval smoother is not new. In particularWilcox (1997) proposed and studied a method that tests H : M ( x k ) = M ( x k ) for each k , k = 1 , . . . , p . So p hypotheses are tested rather than the global hypothesis corresponding to(3). The method is based in part on Yuen’s (1974) method for comparing trimmed meanswith the familywise error rate (the probability of one or more Type I errors) controlled usinga strategy that is similar to Dunnett’s (1980) T3 technique. More recently, a bootstrapvariation was proposed and studied by Wilcox (2009). Now the familywise error rate can becontrolled using some improvement on the Bonferroni method (e.g., Rom, 1990; Hochberg,1988). The bootstrap method can, in principle, be used with any robust measure of location.However, a practical concern with testing p individual hypotheses, rather than a globalhypothesis, is that power might be relatively low for three general reasons. First, each indi-vidual hypothesis uses only a subset of the available data. In contrast, the global hypothesisused here is based on all of the data that are used to test the individual hypotheses. Thatis, a larger sample size is used suggesting that it might reject in situations where the none ofindividual tests is signiﬁcant. Second, if for example the familywise error rate is set at .05,then Wilcox’s method uses a Type I error probability less than .05 for the individual tests,which again can reduce power. The third reason has to do with using a conﬁdence region fortwo or more parameters as opposed to conﬁdence intervals for each individual parameter ofinterest. It is known that in various situations, conﬁdence regions can result in a signiﬁcant4iﬀerence even when there are non-signiﬁcant results for the individual parameters. (For anillustration, see for example Wilcox, 2012b, p. 690.) The method proposed here for testing(3) deals with this issue in a manner that is made clear in section 3. Data from the WellElderly 2 study (Jackson et al., 2009; Clark et al., 2011) are used to illustrate that the newmethod can make a practical diﬀerence.Another goal in this paper is to include simulation results on comparing (conditional)quartiles. Comparing medians is an obvious way of proceeding. But in some situations,diﬀerences in the tails of two distributions can be more important and informative thancomparisons based on a measure of location that is centrally located (e.g., Doksum & Sievers,1976; Lombard, 2005). This proved to be the case in the Well Elderly 2 study for reasonsexplained in section 4.Note that rather than testing (3), a seemingly natural goal is to test the hypothesis that M ( x ) = M ( x ) for all possible values of x , not just those values in the set { x , . . . , x p } .Numerous papers contain results on methods for accomplishing this goal when M j ( x ) istaken to be the conditional mean of Y given that X = x . (For a list of references, seeWilcox, 2012a, p. 610.) But the mean is not robust and evidently little or nothing is knownabout how best to proceed when using some robust measure of location. Wilcox (2012a,section 11.11.5) describes a robust method based on a running interval smoother, but thechoice for the span (the value of (cid:96) j described in the next section) is dictated by the samplesize given the goal of controlling the Type I error probability. That is, a suboptimal ﬁt to thedata might be needed. The method used here avoids this problem. Here, some considerationwas given to an approach where a robust smoother is applied to each group and predicted Y values are computed for all of the observed x values. If the null hypothesis is true, theregression line for the diﬀerences M ( x ) − M ( x ), versus x , should have a zero slope andintercept. Several bootstrap methods were considered based on this approach, but controlover the Type I error probability was very poor, so no details are provided.5 Description of the Proposed Method

Following Wilcox (1997), the general strategy is to approximate the regression lines witha running interval smoother and then use the components of the smoother to test somerelevant hypothesis. A portion of the method requires choosing a location estimator. As willbe made evident, in principle any robust location estimator could be used, but here attentionis focused on only two estimators: a 20% trimmed mean and the quantile estimator derivedby Harrell and Davis (1982).Let Z , . . . , Z n be any n observations. The γ -trimmed mean is1 n − g n − g (cid:88) i = g +1 Z ( i ) , where Z (1) ≤ · · · ≤ Z ( n ) are the Z values written in ascending order and g = (cid:98) γn (cid:99) is thegreatest integer less than or equal to γn , 0 ≤ γ < .

5. The 20% trimmed mean correspondsto γ = .

2. One advantage of the 20% trimmed mean is that its eﬃciency compares wellto the sample mean under normality (e.g., Rosenberger & Gasko, 1983). But as we movetoward a more heavy-tailed distribution, the standard error of the 20% trimmed mean canbe substantially smaller than the standard error of the mean, which can translate into sub-stantially higher power when outliers tend to occur. Another appeal of the 20% trimmedmean over the mean, when testing hypotheses, is that both theory and simulations indicatethat the 20% trimmed is better at handling skewed distributions in terms of controlling theType I error probability. This is not to suggest that the 20% trimmed mean dominates allother robust estimators that might be used. Clearly this is not the case. The only point isthat it is a reasonable measure of location to consider for the situation at hand.The Harrell and Davis (1982) estimate of the q th quantile uses a weighted average of allthe order statistics. Let U be a random variable having a beta distribution with parameters a = ( n + 1) q and b = ( n + 1)(1 − q ) and let v i = P (cid:18) i − n ≤ U ≤ in (cid:19) . The estimate of the q th quantile, based on Z , . . . , Z n , isˆ θ q = n (cid:88) i =1 v i Z ( i ) . (4)6n terms of its standard error, Sfakianakis and Verginis (2006) show that in some situa-tions the Harrell–Davis estimator competes well with alternative estimators that again usea weighted average of all the order statistics, but there are exceptions. (Sfakianakis andVerginis derived alternative estimators that have advantages over the Harrell–Davis in somesituations. But we found that when sampling from heavy-tailed distributions, the standarderror of their estimators can be substantially larger than the standard error of ˆ θ q .) Compar-isons with other quantile estimators are reported by Parrish (1990), Sheather and Marron(1990), as well as Dielman, Lowry and Pfaﬀenberger (1994). The only certainty is that nosingle estimator dominates in terms of eﬃciency. For example, the Harrell–Davis estimatorhas a smaller standard error than the usual sample median when sampling from a normaldistribution or a distribution that has relatively light tails, but for suﬃciently heavy-taileddistributions, the reverse is true (Wilcox, 2012a, p. 87).To describe the details of the method for testing (3), let ( X ij , Y ij ) ( i = 1 , . . . , n j ; j = 1,2) be a random sample of size n j from the j th group. For a chosen value for x , suppose thegoal is to estimate M j ( x ). The strategy is simple. Roughly, for each j , compute a measureof location based on the Y ij values for which the corresponding X ij values are close to x .More formally, for ﬁxed j , compute a measure of location based on the Y ij values such that i is an element of the set P j ( x ) = { i : | X ij − x | ≤ (cid:96) j × MADN j } , where (cid:96) j is a constant chosen by the investigator and often called the span, MADN j =MAD j /.6745,MAD j (the median absolute deviation) is the median of | X j − m j | , . . . | X n j j − m j | and m j isthe usual sample median based on X j , . . . , X n j j . Under normality, MADN j =MAD j /.6745estimates the population standard deviation, in which case X ij is close to x if it is within (cid:96) j standard deviations from x . Generally, the choice (cid:96) j = . (cid:96) j = 1 gives good results, interms of capturing any curvature, but of course exceptions are encountered.Let N j ( x ) be the cardinality of the set P j ( x ) and suppose that M j ( x ) is estimated withsome measure of location based on the Y ij values for which i ∈ P j ( x ). The two regressionlines are deﬁned to be comparable at x if simultaneously N ( x ) ≥

12 and N ( x ) ≥

12. Theidea is that if the sample sizes used to estimate M ( x ) and M ( x ) are suﬃciently large, thena reasonably accurate conﬁdence interval for M ( x ) − M ( x ) can be computed provided areasonably level robust technique is used. For example, Yuen’s (1974) method might be used7ith a 20% trimmed mean. (It is known that under fairly general conditions methods forcomparing means are not level robust with relatively small sample sizes. See Wilcox, 2012b,for details.)For notational convenience, let ˆ θ jk be some location estimator based on the Y ij valuesfor which i ∈ P j ( x k ). Let ˆ δ k = ˆ θ k − ˆ θ k and let δ k denote the population analog of ˆ δ k ( k = 1 , . . . , p ). Then (3) corresponds to H : δ = δ = · · · δ p = 0 . (5)The basic strategy for testing (5) is to generate bootstrap samples from each group, computeˆ δ k based on these bootstrap samples, repeat this B times, and then measure how deeply thenull vector is nested in the bootstrap cloud of points via Mahalanobis distance. Based onthese distances, results in Liu and Singh (1997) indicate how to compute a p-value.To elaborate, let ( X ∗ ij , Y ∗ ij ) be a bootstrap sample from the j th group, which is obtainedby resampling with replacement n j pairs of points from ( X ij , Y ij ) ( i = 1 , . . . , n j ; j = 1, 2).Let ˆ δ ∗ k be the estimate of δ k based on the bootstrap samples from the two groups. Repeatthis process B times yielding ˆ∆ ∗ b = (ˆ δ ∗ b , . . . , ˆ δ ∗ pb ), b = 1 , . . . , B . Let S be the covariancematrix based on the B vectors ˆ∆ ∗ , . . . , ˆ∆ ∗ B . Note that the center of the bootstrap cloudbeing estimated by these B bootstrap samples is known. It is ˆ∆ = (ˆ δ , . . . , ˆ δ p ), the estimateof ∆ = ( δ , . . . , δ p ) based on the ( X ij , Y ij ) values. Let d b = ( ˆ∆ ∗ b − ˆ∆) S − ( ˆ∆ ∗ b − ˆ∆) (cid:48) , where for b = 0, ˆ∆ ∗ is taken to be the null vector . Then a (generalized) p-value is1 B B (cid:88) b =1 I ( d ≤ d b ) , (6)where the indicator function I ( d ≤ d b ) = 1 if d ≤ d b ; otherwise I ( d ≤ d b ) = 0.There remains the problem of choosing the x k values. They might be chosen basedon substantive grounds, but of course studying this strategy via simulations is diﬃcult atbest. Here, we follow Wilcox (1997) and choose p = 5 points in a manner suggested byrunning interval smoother in terms of capturing any curvature in a ﬂexible manner. Fornotational convenience, assume that for ﬁxed j , the X ij values are in ascending order. That8s, X j ≤ · · · ≤ X n j j . Suppose z is taken to be the smallest X i value for which theregression lines are comparable. That is, search the ﬁrst group for the smallest X i such that N ( X i ) ≥

12. If N ( X i ) ≥

12, in which case the two regression lines are comparable at X i , set x = X i . If N ( x i ) <

12, consider the next largest x i value and continue until itis simultaneously true that N ( X i ) ≥

12 and N ( X i ) ≥

12. Let i be the smallest integersuch that N ( x i ) ≥

12 and N ( x i ) ≥

12. Similarly, let x be the largest X i value forwhich the regression lines are comparable. That is, x is the largest X i value such that N ( x i ) ≥

12 and N ( x i ) ≥

12. Let i be the corresponding value of i . Let i = ( i + i ) / i = ( i + i ) /

2, and i = ( i + i ) /

2. Round i , i , and i down to the nearest integer andset x = X i , x = X i , and x = X i .When the covariate values are chosen in the manner just described, and p = 5 separatetests are performed based on some measure of location, this will be called method W hence-forth. Computing a p-value using (6), with the goal of performing a global test, will becalled method G. Unless stated otherwise, both methods G and W will be based on a 20%trimmed mean.Note that in essence, we have a 2-by-p ANOVA design. But for the p levels of the secondfactor, the groups are not necessarily independent. The reason is that for any two covariatevalues, say x k and x m , the intersection of the sets P j ( x k ) and P j ( x m ) is not necessarily equalto the empty set. Here, the strategy for dealing with this feature is to model it via a bootstrapmethod. Another approach would be divide the data into p independent groups. But thereis uncertainty about how this might be done so as to eﬀectively capture any curvature. Theapproach used here mimics a basic component used by a wide range of smoothers designedto deal with curvature in a ﬂexible manner.Of course, the obvious decision rule, when using method G, is to reject the null hypothesisif the p-value is less than or equal to the nominal level. When testing at the α = .

05 level,preliminary simulations indicated that this approach performs well, in term of controllingthe Type I error probability, when p = 3 and the x k values are taken to be the quartilescorresponding to the X i values. But when p = 5 and the x k values are chosen as justdescribed, the actual level exceeded .075 when testing at the α = .

05 level with n = n = 30.This problem persisted with n = n = 50. However, for the range of distributions considered(described in section 3), the actual level was found to be relatively stable. This suggests using9 strategy similar to Gosset’s (Student’s) approach to comparing means: Assume normality,determine an appropriate critical value using a reasonable test statistic, and continue usingthis critical value when dealing with non-normal distributions.Given n and n , this strategy is implemented by ﬁrst by generating, for each j , n j pairsof observations from a bivariate normal distribution having a correlation ρ = 0. Based onthis generated data, determine p = 5 values of the covariate in the manner just describedand then compute the p-value given by (6). Denote this p-value by ˆ p . Repeat this process A times yielding ˆ p , . . . , ˆ p A . Then an α level critical p-value, say ˆ p c , is taken to be the α quantile of the ˆ p , . . . , ˆ p A values, which here is estimated via the Harrell-Davis estimator.(With A=1000 and when a trimmed mean is used, this can be done in 14.8 seconds usingan R function, described in the ﬁnal section of this paper, running on the ﬁrst author’sMacBook Pro.) That is, letting p o denote the p-value based on the observed data, reject (3)if p o ≤ ˆ p c .Note that once p c has been determined, a 1 − α conﬁdence region for the vector ∆ =( δ , . . . δ p ) can be computed. A conﬁdence region consists of the convex hull containing the(1 − ˆ p c ) B ˆ∆ b vectors that have the smallest d b values. As previously indicated, this conﬁdenceregion provides a perspective on why the global test considered here can have more powerthan method W. Situations are encountered where the null vector is not contained in theconﬁdence region, yet the conﬁdence intervals for each of the p diﬀerences contain zero. Simulations were used to study the small-sample properties of the proposed method with n = n = 30. Smaller sample sizes are dubious because this makes it particularly diﬃcultto eﬀectively deal with curvature. Also, ﬁnding ﬁve covariate values where the groups arecomparable can be problematic. That is, N j ( x ) might be so small as to make comparisonsmeaningless. A few results are reported with n = n = 100 and 200 as well.Estimated Type I error probabilities, ˆ α , were based on 4000 replications. The estimatedcritical p-value was based on A = 1000 and B = 500 bootstrap samples. Four types ofdistributions were used: normal, symmetric and heavy-tailed, asymmetric and light-tailed,10able 1: Some properties of the g-and-h distribution.g h κ κ Z has a standard normal distribution, then bydeﬁnition V =  exp( gZ ) − g exp( hZ / , if g > Z exp( hZ / , if g = 0has a g-and-h distribution where g and h are parameters that determine the ﬁrst four mo-ments. That is, a g-and-h distribution is a transformation of the standard normal randomvariable that can be used to generate data having a range of skewness and kurtosis values.The four distributions used here were the standard normal ( g = h = 0 . h = 0 . g = 0 . h = 0 . g = 0 . g = h = 0 . κ ) and kurtosis ( κ ) for each distribution. Additional propertiesof the g-and-h distribution are summarized by Hoaglin (1985).The g-and-h distributions with h = . Y ij = βX ij + (cid:15) . The two choices for the slope were β = 0 and 1. The third type ofassociation was Y ij = X ij + (cid:15) . These three situations are labeled S1, S2 and S3, respectively.The estimated Type I errors were very similar for S1 and S2, so for brevity the results for S211able 2: Estimated Type I error probabilities when testing at the α = .

05 level, n = n = 30 g h Estimator S1 S30.0 0.0 γ = . q = .

50 .038 .0440.0 0.0 q = .

75 .049 .0480.0 0.2 γ = . q = .

50 .023 .0280.0 0.2 q = .

75 .029 .0280.2 0.0 γ = . q = .

25 .053 .0560.2 0.0 q = .

50 .036 .0440.2 0.0 q = .

75 .046 .0450.2 0.2 γ = . q = .

25 .040 .0400.2 0.2 q = .

50 .022 .0280.2 0.2 q = .

75 .026 .025are not reported. The X ij values were generated from a standard normal distribution and (cid:15) was generated from one of the four g-and-h distributions previously indicated.The simulation results are reported in Table 2. As can be seen, when testing at the.05 level, the actual level was estimated to be less than or equal to .056 among all of thesituations considered. Although the seriousness of a Type I error depends on the situation,Bradley (1978) suggests that as a general guide, when testing at the .05 level, the actuallevel should be between .025 and .075. Based on this criterion, the only concern is that fora very heavy-tailed distribution, the estimated level drops below .025, the lowest estimatebeing .020. Increasing both sample sizes to 50 corrects this problem. For example, with g = h = . γ = .

2, the estimate for situation S1 increases from .020 to .034.Notice that the lowest estimates in Table 2 occur for γ = . g = h = .

2. Simulationswere run again with n = n = 100 as well as n = n = 200 as a partial check on the impactof using larger sample sizes. The estimated Type I error probabilities for these two situationswere .036 and .040, respectively. 12s previously explained, there are at least three reasons to expect that the global testwill have more power than method W. The extent this is true depends on the situation.To provide at least some perspective, consider the case where the covariate has a normaldistribution and the error term has a g-and-h distribution. First consider g = h = 0(normality), and suppose the ﬁrst group has β = β = 0, while for the second group Y = . (cid:15) . With n = n = 50, and testing at the .05 level, the power of method G testwas estimated to be .51. The probability of rejecting at one or more design points usingmethod W was estimated to be .38. If instead Y = . X + . (cid:15) for the second group, thepower estimates are now .75 and .66, respectively. If Y = . X + . (cid:15) , the estimates are.89 and .78. For this last situation, if ( g, h ) = (0 , . g, h ) = ( . , .

2) the estimates are .75 and .70. So all indications are that W has more power,with the increase in power estimated to be as high as .12 among the situations consideredhere.As already noted, a well-known argument for using a 20% trimmed mean, rather thanthe mean, is that under normality its eﬃciency compares very well to to the mean, but as wemove toward a heavy-tailed distribution, the standard error of the mean can be substantiallylarger than the standard error of the 20% trimmed. That is, in terms of power, there is littleseparating the mean and 20% under normality, but for heavier tailed distributions, powermight be substantially higher using a 20% trimmed mean. For the situation at hand, consideragain g = h = 0 and Y = . (cid:15) , only now method W is applied using means rather than 20%trimmed means. Now power is estimated to .43, slightly better than using a 20% trimmedfor which power was estimated to be .38. Using instead method G, power was estimated tobe .56. So again, method G oﬀers more power than method W and power is a bit highercompared to using a 20% trimmed mean, which was .51. For ( g, h ) = (0 , . Illustrations p = .

03, when ﬁtting a straight line regression via ageneralization of the Theil–Sen estimator that is designed to handle tied values.)Note that in Figure 1, there appears to be curvature for the males. A test of the hypothesisthat the regression line is straight was performed using the R function qrchk in Wilcox(2012b, p. 544). If again the six outliers among the independent variable are eliminated,the hypothesis of a straight line is rejected at the .05 level ( p = . p = . p = .

16) and intercepts do diﬀer signiﬁcantly ( p = . CAR S F ++ ++ ++ + + + + +++++ +++++ +++++++++++++++++++++++++++++++++++ +++++++++++++++++ +++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++ + + + Figure 1: Regression lines for predicting perceived health and well-being. The independentvariable is the cortisol awakening response. The solid line is the .5 quantile regression linefor males. 16

CAR S F Figure 2: The least squares regression lines for predicting perceived health and well-beingusing the same data shown in Figure 1. Again, the solid line is the regression line for males.Figure 1 suggests that there is a distinct bend approximately where CAR is equal to − . − . p = . p = . Concluding Remarks

In summary, all indications are that method G avoids Type I errors well above the nominallevel. The highest estimated Type I error probability was .056 when testing at the .05 level.The only known concern is that when dealing with a very heavy-tailed distribution, the TypeI error probability might be less than .025 with sample sizes of 30. Increasing the samplesizes to 50, this problem was avoided among the situations considered.It is unclear under what circumstances some asymptotic result might be used to determinean appropriate critical value. The answer depends on the sample sizes, the span used bythe running interval smoother ( (cid:96) and (cid:96) ) and the number of covariate values used. But thiswould seem to be a minor inconvenience in most situations because a critical value can bedetermined fairly quickly using the method described in the paper. Even with sample sizesof 300, execution time was only 39.5 seconds on a MacBook Pro.It is not being suggested that method G dominates all approaches relevant to ANCOVA.It seems fairly evident that no single method dominates, one reason being that diﬀerentmethods are sensitive to diﬀerent features of the data. Rather, method G provides anapproach to ANCOVA that might have practical value in various situations, as was illustratedusing the Well Elderly data. Here, for example, by dealing with curvature in a ﬂexiblemanner, coupled with a robust measure of location, the results indicated that when CAR isnegative, typical SF-36 scores for males tend to be higher than scores for females. The extentthey diﬀer appears to have little to do with the value of CAR. But for CAR greater than zero,this is no longer the case. The diﬀerences between males and females tend to decrease asCAR increases. Both classic ANCOVA and robust methods indicate that males tend to havehigher SF-36 scores. But the robust methods provide a more detailed picture regarding whenthis is this case. Method G is just one tool that helps provide a more detailed understandingof data beyond the non-robust and less ﬂexible approach based on classic ANCOVA methods.Put in broader terms, is there a single number or a single method that tells us everythingwe would like to know about how groups compare? We would suggest that the answer is no.Method G is aimed at dealing with this issue.Finally, R software is available for applying method G. The function ancGLOB per-forms the calculations and is stored on the ﬁrst author’s web page. For faster execution18ime, C++ subroutines have been written that compute the critical p-value. To take ad-vantage of these subroutines, ﬁrst install the R package devtools with the R commandinstall.packages(“devtools”). Then the C++ subroutines can be installed with the followingcommands: library("devtools")install_github( "WRScpp", "mrxiaohe") Finally, when using the R function ancGLOB, set the argument cpp=TRUE.ACKNOWLEGMENTWe thank Xiao He for supplying the C++ code used in this study.REFERENCESBradley, J. V. (1978) Robustness?

British Journal of Mathematical and Statistical Psy-chology, 31 , 144–152.Brunner, E., Domhof, S & Langer, F. (2002).

Nonparametric Analysis of LongitudinalData in Factorial Experiments . New York: Wiley.Brunner, E. & Munzel, U. (2000). The nonparametric Behrens-Fisher problem: asymp-totic theory and small-sample approximation.

Biometrical Journal , , 17–25.Chida, Y. & Steptoe, A. (2009). Cortisol awakening response and psychosocial factors:A systematic review and meta-analysis. Biological Psychology, 80 , 265–278.Clark, F., Jackson, J., Carlson, M., et al. (2011). Eﬀectiveness of a lifestyle interventionin promoting the well-being of independently living older people: results of the Well Elderly 2Randomise Controlled Trial.

Journal of Epidemiology and Community Health, 66 , 782–790.doi:10.1136/jech.2009.099754Cliﬀ, N. (1996).

Ordinal Methods for Behavioral Data Analysis . Mahwah, NJ: Erlbaum.Clow, A., Thorn, L., Evans, P. & Hucklebridge, F. (2004). The awakening cortisolresponse: Methodological issues and signiﬁcance.

Stress, 7 , 29–37.19ielman, T., Lowry, C. & Pfaﬀenberger, R. (1994). A comparison of quantile estimators.

Communications in Statistics–Simulation and Computation, 23 , 355-371.Doksum, K. A., & Sievers, G. L. (1976). Plotting with conﬁdence: graphical comparisonsof two populations.

Biometrika, 63 , 421–434.Dunnett, C. W. (1980). Pairwise multiple comparisons in the unequal variance case.

Journal of the American Statistical Association, 75 , 796–800.Eakman, A. M., Carlson, M. E. & Clark, F. A. (2010). The meaningful activity partic-ipation assessment: a measure of engagement in personally valued activities

InternationalJournal of Aging Human Development, 70 , 299–317.Efromovich, S. (1999).

Nonparametric Curve Estimation: Methods, Theory and Applica-tions . New York: Springer-Verlag.Efron, B. & Tibshirani, R. J. (1993).

An Introduction to the Bootstrap . New York:Chapman & Hall.Eubank, R. L. (1999).

Nonparametric Regression and Spline Smoothing . New York:Marcel Dekker.Foley K., Reed P., Mutran E., et al. (2002). Measurement adequacy of the CES-D amonga sample of older African Americans.

Psychiatric Research, 109 , 61–9.Fox, J. (2001).

Multiple and Generalized Nonparametric Regression . Thousands Oaks,CA: SageFung, K. Y. (1980). Small sample behaviour of some nonparametric multi-sample locationtests in the presence of dispersion diﬀerences.

Statistica Neerlandica, 34 , 189–196.Gy¨orﬁ, L., Kohler, M., Krzyzk, A. & Walk, H. (2002).

A Distribution-Free Theory ofNonparametric Regression . New York: Springer Verlag.Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. & Stahel, W. A. (1986).

RobustStatistics . New York: Wiley.H¨ardle, W., 1990. Applied Nonparametric Regression. Econometric Society Monographs20o. 19, Cambridge, UK: Cambridge University Press.Harrell, F. E. & Davis, C. E. (1982). A new distribution-free quantile estimator.

Biometrika,69 , 635–640.Hays, R. D., Sherbourne, C .D. & Mazel, R. M. (1993). The Rand 36-item health survey1.0.

Health Economics, 2 , 217–227.Heritier, S., Cantoni, E, Copt, S. & Victoria-Feser, M.-P. (2009).

Robust Methods inBiostatistics . New York: Wiley.Hettmansperger, T. P. (1984).

Statistical Inference Based on Ranks . New York: Wiley.Hettmansperger, T. P. & McKean, J. W. (1998).

Robust Nonparametric Statistical Meth-ods . London: Arnold.Hoaglin, D. C. (1985). Summarizing shape numerically: The g-and-h distribution. In D.Hoaglin, F. Mosteller & J. Tukey (Eds.)

Exploring Data Tables Trends and Shapes . NewYork: Wiley, pp. 461–515.Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of signiﬁcance.

Biometrika, 75 , 800–802.Huber, P. J. & Ronchetti, E. (2009).

Robust Statistics , 2nd Ed. New York: Wiley.Jackson, J., Mandel, D., Blanchard, J., Carlson, M., Cherry, B., Azen, S., Chou, C.-P., Jordan-Marsh, M., Forman, T., White, B., Granger, D., Knight, B., & Clark, F. (2009).Confronting challenges in intervention research with ethnically diverse older adults: the USCWell Elderly II trial.

Clinical Trials, 6

The Advanced Theory of Statistics , Vol. 2. NewYork: Hafner.Lawson, A. (1983). Rank Analysis of Covariance: Alternative Approaches.

Statistician,32 , 331–337.Lewinsohn, P.M., Hoberman, H. M., Rosenbaum M. (1988). A prospective study of riskfactors for unipolar depression.

Journal of Abnormal Psychology, 97 , 251–64.21iu, R. G. & Singh, K. (1997). Notions of limiting P values based on data depth andbootstrap.

Journal of the American Statistical Association, 92 , 266–277.Lombard, F. (2005). Nonparametric conﬁdence bands for a quantile comparison function.Technometrics, 47, 364–369.Maronna, R. A., Martin, D. R. & Yohai, V. J. (2006).

Robust Statistics: Theory andMethods . New York: Wiley.McHorney, C. A., Ware, J. E. & Raozek, A. E. (1993). The MOS 36-item Short-FormHealth Survey (SF-36): II. Psychometric and clinical tests of validity in measuring physicaland mental health constructs.

Medical Care, 31 , 247–263.Neuh¨auser, M., L¨osch, C., & J¨ockel, K-H (2007). The Chen-Luo test in case of het-eroscedasticity.

Computational Statistics & Data Analysis, 51 , 5055–5060.Parrish, R. S. (1990). Comparison of quantile estimators in normal sampling.

Biometrics,46 , 247–257.Radloﬀ, L., 1977. The CES-D scale: a self report depression scale for research in thegeneral population.

Applied Psychological Measurement 1 , 385–401.Rom, D. M. (1990). A sequentially rejective test procedure based on a modiﬁed Bonfer-roni inequality.

Biometrika, 77 , 663–666.Rosenberger, J. L. & Gasko, M. (1983). Comparing location estimators: Trimmed means,medians, and trimean. In D. Hoaglin, F. Mosteller and J. Tukey (Eds.)

UnderstandingRobust and exploratory data analysis. (pp. 297–336). New York: Wiley.Sfakianakis, M. E. & Verginis, D. G. (2006). A New Family of Nonparametric QuantileEstimators.

Communications in Statistics–Simulation and Computation, 37 , 337–345.Sheather, S. J. & Marron, J. S. (1990). Kernel quantile estimators.

Journal of theAmerican Statistical Association, 85 , 410–416.Staudte, R. G. & Sheather, S. J. (1990).

Robust Estimation and Testing . New York:Wiley. 22ilcox, R. R. (1997). ANCOVA based on comparing a robust measure of location atempirically determined design points.

British Journal of Mathematical and Statistical Psy-chology , , 93–103.Wilcox, R. R. (2009). Robust ANCOVA using a smoother with bootstrap bagging. British Journal of Mathematical and Statistical Psychology 62 , 427–437.Wilcox, R. R. (2012a).

Introduction to Robust Estimation and Hypothesis Testing , 3rdEd. San Diego, CA: Academic Press.Wilcox, R. R. (2012b). Modern Statistics for the Social and Behavioral Sciences: APractical Introduction. New York: Chapman & Hall/CRC pressYuen, K. K. (1974). The two sample trimmed t for unequal population variances.